Skip to content

Commit

Permalink
Add a citation to dottxt blog
Browse files Browse the repository at this point in the history
  • Loading branch information
merrymercy committed Feb 5, 2024
1 parent 501d9ba commit 2c36999
Showing 1 changed file with 3 additions and 1 deletion.
4 changes: 3 additions & 1 deletion blog/2024-02-05-compressed-fsm.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,8 +109,10 @@ to form a more frequent token
Moreover, during jump-forward decoding, we've found that different tokenization strategies to the jump-forwarded part may lead to different logit distributions for the subsequent tokens. Simply appending the tokenized jump-forwarded section to the current token sequence might yield unexpected outcomes.

To manage these issues, we propose the following solutions:
- Prefer the use of a comprehensive regular expression to guide the entire decoding process, rather than employing multiple concatenated regular expressions. This approach ensures that both Finite State Machines (FSM) and Large Language Models (LLM) are cognizant of the entire decoding process, thereby minimizing boundary-related issues as much as possible.
- We have implemented a re-tokenization mechanism during the jump-forward phase. This involves appending the string instead of the tokens, followed by a re-tokenization of the entire text. This method effectively resolves most tokenization issues and results in only a minor increase in computational overhead, approximately 4\%.
- Prefer the use of a comprehensive regular expression to guide the entire decoding process, rather than employing multiple concatenated regular expressions. This approach ensures that both FSM and LLM are cognizant of the entire decoding process, thereby minimizing boundary-related issues as much as possible.

You can also read some additional discussion in this [blog post](http://blog.dottxt.co/coalescence.html).

## Benchmark Results

Expand Down

0 comments on commit 2c36999

Please sign in to comment.