Sampling from Your Language Model One Byte at a Time

Sampling fr om Y our Language Model One Byte at a T ime Jonathan Hayase ♡ Alisa Liu ♡ Noah A. Smith ♡ ♣ Sewoong Oh ♡ ♡ Univ ersity of W ashington ♣ Allen Institute for AI jhayase@cs.washington.edu Abstract T okenization is used almost univ ersally by modern language models, enabling efﬁcient text representation using multi-byte or multi-character tokens. Howe v er , prior work has shown that tokenization c an introduce distortion into the model’ s generations, an issue kno wn as the Pr ompt Boundary Pr oblem (PBP) . For e xample, users are often advised not to end their prompts with a space because it pre vents the model from including the space as part of the next tok en. While this heuristic is ef fectiv e in English, the underlying PBP continues to af fect languages such as Chinese as well as code generation, where tokens often do not line up with word and syntactic boundaries. In this work, we present an inference-time method to con vert any autoregressi ve LM with a BPE tokenizer into a character-le vel or byte-level LM. Our method efﬁciently solves the PBP and is also able to unify the v ocabularies of language models with different tok enizers, allowing one to ensemble LMs with different tok enizers at inference time or transfer the post-training from one model to another using proxy-tuning. W e demonstrate in experiments that the ensemble and proxy-tuned models outperform their constituents on downstream e vals. 1 1 Introduction T okenization is a crucial component of nearly all modern language models: it allows them to consume and produce arbitrary streams of text using only ﬁnite vocab ularies. The vast majority of tok enizers in use today , such as those based on Byte-Pair Encoding (BPE) [ 58 ] or Unigram [ 27 ], feature tokens spanning multiple bytes or characters, allo wing them to represent text more ef ﬁciently than purely byte-lev el or character-le vel tokenization [12, 78, 74]. Users of LMs are generally unaware of the tokenization and expect LMs to operate on strings, consuming a prompt as a string and producing a useful string completion thereof. T okenized LMs approximate this by (i) encoding the text as a sequence of tokens, (ii) feeding the resulting sequence to the language model, and (iii) decoding the generated tok en sequence back into text. More precisely , let prompt ∈ Σ ∗ be a string of arbitrary length ov er some alphabet Σ , and let enco de : Σ ∗ → V ∗ and deco de : V ∗ → Σ ∗ represent the translation between strings and token sequences ov er a vocab ulary V . T o complete the prompt, a typical scheme is to sample from the distribution, P ( t 1 , . . . , t n | [ t 1 , . . . , t k ] = enco de(prompt)) , (1) where enco de(prompt) is the tokenization of the prompt, which in this example has length k . Note that sampling from this distribution can be done very con veniently by follo wing the three steps above when the model has an autoregressi ve structure, i.e., P ( t k +1 , . . . , t n | t 1 , . . . , t k ) = n Y i = k +1 P ( t i | t 1 , . . . , t i − 1 ) , 1 Code is av ailable at https://github.com/SewoongLab/byte- sampler . Preprint. Under re view . which is used to sample the completion from P ( t k +1 , . . . , t n | t 1 , . . . , t k ) giv en the tokenized prompt [ t 1 , . . . , t k ] . W e then return deco de( t 1 , . . . , t n ) to the user . For the most part, this process happens transparently to the user , but under certain circumstances it can introduce distortion to the language model’ s completions. > olmo.generate(tok.encode( "This a tes" )) "erstor" > ByteSampler(olmo, "This is a tes" ) "t" > qwen.generate(tok.encode( " Japan’s 日本的 capital 首都 is 是 Tokyo 东京， China’s 中国的 capital 首都 " )) " also 也 is 是 Beijing 北京 " > ByteSampler(qwen, " 日本的首都是东京，中国的首都 " ) " is 是 Beijing 北京 " > olmo.generate(tok.encode( "document.getElement" )) "( ' div ' )" > ByteSampler(olmo, "document.getElement" ) "ById( ' button ' )" Figure 1: ByteSampler resolves the prompt boundary problem (exhibited in the output of generate() ). In this example, test , 都是 , and .getElementById are all single tokens in the respecti ve tokenizers. The Prompt Boundary Pr oblem (PBP). In particular , Eq. (1) introduces distor- tion whenever the prompt ends on a pre- ﬁx of what could otherwise be a single token. More concretely , consider L L A M A - 3 . 2 - 1 B and suppose the user’ s prompt ends with the te xt “becau” ( ["bec" = 17106, "au" = 2933] as tokens): The user most likely e xpects the continuation to begin with “se” ( 325 ) since “because” is a common word. Howe ver during train- ing, the model has only ev er seen the word “because” represented as a single token ( 11458 ) and ne ver as the sequence [17106, 2933, 325] . Accordingly , the actual next token L L A M A - 3 . 2 - 1 B pre- dicts is “z” ( 89 ) which, while plausible in some scenarios, is an arguably unlikely continuation representing an artifact of tokenization. While this example may seem contri ved at ﬁrst glance, there are many situations where this problem may arise (Fig. 1 shows a fe w more examples): 1. In languages that do not separate words with whitespace, such as Chinese and Japanese, tokens can span multiple w ords, so this issue can arise e ven when the prompt ends with a complete word. 2. Any tokenizer that features multi-word tokens, which can bring gains in encoding efﬁcienc y [18, 29, 34], suffer from the same problem as Chinese and Japanese. 3. When completing code, it is common to request completions while in the middle of an identiﬁer [23]. 4. This issue also occurs when performing constrained generation from language models [ 55 ]. In general, the user , unaware of the tokenization, expects samples from the properly conditioned distribution, P ( t 1 , . . . , t n | prompt ⊑ deco de( t 1 , . . . , t n )) , (2) where ⊑ denotes the preﬁx relation. Howe ver , the token-preﬁx conditioned distribution of Eq. (1) and the byte-pr eﬁx conditioned distribution of Eq. (2) can differ substantially (e.g., Figure 1). Eq. (2) transcends the arbitrary token boundary set where the user provided prompt stops, decoupling the prompt boundary from token boundaries, to complete the prompt with the exact distribution from the language model. This leads to a fundamental algorithmic question of interest: how do we sample from the byte-preﬁx conditioned distribution of Eq. (2) e xactly and efﬁciently? Contributions. W e introduce an ef ﬁcient procedure to condition a BPE tokenizer -based model on an arbitrary byte-pr eﬁx giv en only access to the tok enizer and log-probability queries to the model (Section 3). W e demonstrate in experiments that this represents an exact solution to the Prompt Boundary Problem presented abov e (Section 4.2). W e show that our method can be used to con vert the model into a byte-level language model and that this ability can be used to unify the vocabularies of different models. This enables exact byte-lev el ensembles of language models with different tokenizers (Section 4.3) and allows one to transfer the post-training of one model onto another model at inference time using proxy-tuning [33] (Section 4.4). W e demonstrate in proof-of-concept experiments that language model ensembles and proxy-tuned models constructed with our method are able to outperform their constituent models in downstream e valuations. 2 2 Background In this section we gi ve essential background re garding tokenization as well a prior work addressing the Prompt Boundary Problem. W e discuss additional related works in Appendix A. Exact Preprocessing T oken ev aluations TE w/ Preﬁx Caching Backtracking [55, 13, 2] No O (1) O (1) N/A Preﬁx Cov ering [72] Y es 2 O ( n ) 2 O ( n ) 2 O ( n ) Back T okenization [70] Y es 2 O ( n ) O ( n ) O (1) (optimal) Byte-Pair Correction [51] Y es O ( n ) O ( n ) O (1) ByteSampler (ours) Y es O (1) N/A O (1) (optimal) T able 1: Incremental complexity of various mitigations for the prompt boundary problem: we list the complexity (in both preprocessing time and LM e valuations) when sampling each ne w character while generating an n character string. Our method has the same comple xity as backtracking methods while remaining exact, i.e., matching Eq. (2) in distribution, modulo in valid sequences (see below for discussion). W e report both the original LM inference complexity as originally presented, as well as upper bounds using analysis from Section 3.1 when using preﬁx caching. “(optimal)” indicates that the token e valuations for any input will be the minimum required for e xactness. Byte Pair Encoding . BPE was originally presented as a form of data compression in Gage [16] and was proposed for use in NLP in Sennrich et al. [58] . T o tokenize a piece of text with a typical BPE-based tok enizer , the te xt is ﬁrst split into chunks, a process called pr etokenization . These chunks, or pr etokens , are then tokenized separately using BPE (thus no token may cross the boundary between pretokens). The BPE tokenizer processes each pretok en by ﬁrst con verting the text into a sequence of elements of the tokenizer’ s base vocabulary (common choices for base vocab ulary are individual characters or bytes under UTF-8 encoding). Next, an ordered list of merges is applied to the sequence to form larger tokens. Each merge speciﬁes a contiguous pair of tokens (which may include products of previous merges), and a new token that represents their concatenation. The mer ges are applied left-to-right and once all valid mer ges are applied, the tokenization is complete. W e show an example application of these steps in T able 2. Step Description Result 0 Original text “llama and deepseek. ” 1 Pretokenization llama Š ␣ and Š ␣ deepseek Š . 2 Con vert to base vocab ulary l,l,a,m,a Š ␣ ,a,n,d Š ␣ ,d,e,e,p,s,e,e,k Š . 3 Apply merges from mer ge list ll ama ␣ and ␣ deep seek . T able 2: Step-by-step execution of an example BPE tok enizer . Prompt Boundary Problem . Issues surrounding tok enization hav e been extensiv ely documented in prior work. The prompt boundary problem was presented for maximum preﬁx encoding in Phan et al. [51] and for BPE tokenizers in V ieira et al. [72] and Ribeiro [55] . Many methods hav e been proposed to address the prompt boundary issue. One line of heuristic techniques, including token healing [ 55 ] and its generalizations [ 13 , 2 ] perform “backtracking” by (i) removing one or more of the most recent tok ens, follo wed by (ii) sampling a continuation of the partial prompt using the language model, constraining the newly generated tok ens to match the remaining text. Exact methods, which preserve the sampling distribution of the original language model as sho wn in (5) , hav e also been proposed. V ieira et al. [72] gav e an exact method which requires exponential time as well as an approximate solution leveraging beam search. Turag a [70] proposed a method that combines backtracking with the e xponential time method of V ieira et al. [72] , adding a “back tokenization” step that signiﬁcantly reduces the number of necessary calls to the language model, but still requires e xponential preprocessing. Additionally , Phan et al. [51] proposed an exact method which requires only linear time. 3 Although all of the abov e methods, except for Backtracking, are “e xact, ” they may produce slightly different sampling distrib utions. This is because the methods differ in their handling of in valid token sequences, which are sequences that can ne ver be output by the tokenizer , but can still be generated erroneously by the model. For no w , we will assume that the model will always produce v alid token sequences, in which case all of the exact methods are identical. W e discuss this assumption and the differences when it does not hold in more detail in Appendix D. 3 Method In this section, we present some simple building blocks and use them to construct a procedure for sampling from a tokenizer -based language model one byte at a time. The fundamental structure of the algorithm is based on what we call the V alid Cov ering T ree, which is the tree of all possible valid token sequences that share a speciﬁc byte preﬁx and do not extend past the end of the preﬁx by more than one full token. W e show the construction of the V alid Cov ering T ree in Fig. 2. h ypo ten use ... thesis ... hyp er space ... oten use ... oth esis ... ban an agrams ... anas ... . . . . . . (a) Initial (inﬁnite) tree h ypo ten use ... thesis ... hyp er space ... oten use ... oth esis ... ban an agrams ... anas ... . . . . . . (b) Prune by preﬁx h ypo ten use ... thesis ... hyp er space ... oten use ... oth esis ... ban an agrams ... anas ... . . . . . . (c) Prune in valid pairs Figure 2: Construction of the V alid Covering T r ee for string preﬁx “hypot”: (a) starting with the inﬁnite tree of all possible token sequences (man y edges not shown), we prune branches that (b) do not match the given preﬁx or begin after the preﬁx ends or (c) contain inv alid contiguous pairs of tokens. More example trees are sho wn in Appendix E. The tree depicted in Fig. 2b corresponds to the cover described in V ieira et al. [72] , who remark that it will generally hav e exponential size in the length of the preﬁx. In contrast, the V alid Cov ering T ree, which is a subtree of the one in Fig. 2b, has sev eral properties which will prove useful: 1. Correctness: The tree represents exactly the set of valid sequences of tokens with the prompt as a preﬁx. (See Section 3.1 and Appendix D.) 2. Compactness: The tree is composed of a “trunk” of tokens that are fully determined (starting at the root, ev ery node has only one child) plus a ﬁnite number of “branching” nodes at the end of the trunk. (The number is bounded by a constant which depends only on the tokenizer , see Section 3.2.) 3. Con venience: The tree can be updated to reﬂect the addition of a ne w byte using only constant time and space. (See Algorithm 1.) Additional implementation details and optimizations are presented in Appendix C. 3.1 Pairwise V alidation Recall that a token sequence is valid if it is the encoding of some string under the BPE encoder . 2 The correctness of the pairwise pruning depends on the follo wing proposition regarding v alidity under BPE tokenization. Proposition 3.1. Let (enco de , deco de) denote a BPE encoder and decoder pair corr esponding to some mer ge list M and vocabulary V . W e call a token sequence T = [ t 1 , t 2 , . . . , t n ] ∈ V n valid if enco de(decode( T )) = T . Then T is valid if and only if [ t i , t i +1 ] is valid for all i ∈ { 0 , . . . , n − 1 } . T o see that this proposition is true, consider tw o valid token sequences T 1 = enco de( S 1 ) and T 2 = enco de( S 2 ) . If, while tokenizing the concatenation S 1 + + S 2 , there is no merge applied that 2 The notion of pairwise v alidation of token sequences w as ﬁrst used in v an Antwerpen and Neubeck [71] as the basis for a streaming algorithm and a f ast backtracking-based algorithm for BPE tokenization (without addressing the PBP). 4 crosses the boundary between S 1 and S 2 then the two strings will “e volv e” independently , and we will hav e enco de( S 1 + + S 2 ) = T 1 + + T 2 which means T 1 + + T 2 is valid. Con versely , if a merge is applied that does cross the boundary , then the ﬁnal encoding must feature a token crossing the boundary (since no merge can be undone), which means T 1 + + T 2 cannot be valid since it has no such token. W e depict an example of both cases using OpenAI’ s cl100k tokenizer [ 48 ] in Fig. 3. 3 m 20252 m 460 p m 5 e r m m 832 m 76 u t e (a) V alid pair: no merge crossing boundary m 460 p m 5 e r m 53058 m m 832 m 76 u t e m 20252 (b) In valid pair: merge m 20252 crosses boundary Figure 3: Example of valid and in valid token pairs. W e show the initial string’ s bytes and the merges m t ∈ M that are applied to the string (in order of t ) to tokenize the string. In the inv alid case, merge m 53058 cannot occur because a conﬂicting merge m 20252 was applied earlier . The key observation is that we only need to consider the trajectory at the boundary (in blue ) to decide if the pair is valid. This implies a fast method to test whether a pair of tok ens is v alid: we inspect the merge trajectory along the boundary between the tok ens and check if an y conﬂicting merges w ould be applied. The worst case mer ge tree depth is ﬁxed by the tokenizer , so this check can be done in constant time. 4 3.2 Streaming T okenization Giv en a stream of input bytes, we will use Algorithm 1 to update “branches” of the V alid Covering T ree, while writing the fully determined “trunk” of tokens to an output stream. Algorithm 1: Streaming BPE tokenization maintaining a tree matching Fig. 2c Input: Branching tree T , new byte b Output: stream of fully determined tokens for e very node N that ends one byte befor e b do add all valid next tokens as children of N ; // See Fig. 2c end Prune branches that do not match b ; // See Fig. 2b while the r oot of T has only one child do Add the root token to the output stream and make its only child the ne w root; end Now we will sho w that this approach is computationally efﬁcient. T o bound the worse-case behavior , we use the observation of Berglund and van der Merwe [3] that each output token can be fully determined using only a constant amount of lookahead (in bytes), where the constant depends only on the tokenizer . This implies that the branching tree T will ha ve bounded depth, since any tok en that is fully determined will be remov ed from the tree and written to the output stream. The branching factor of the tree is also bounded by a constant depending on the tokenizer . Thus, the number of edges of T is bounded by a constant, which also means the pruning described in Fig. 2 can be carried out in constant time. For more concrete performance numbers see Section 4.1, where we sho w that the tree has only 0.72 extra non-leaf nodes on a verage. 3 It’ s worth noting that the analogs of Proposition 3.1 do not hold for either Unigram [ 27 ] or W ordpiece [ 57 ] tokenizers. 4 W e generally expect the depth of the merge trees to scale with the logarithm of the v ocabulary size V , although we ignore scaling with respect to the tokenizer’ s parameters for brevity . 5 3.3 Language Modeling Using V alid Covering T r ees Now that we can easily compute V alid Covering T rees, we can use them to perform various common language modeling operations. T o compute the probability of a preﬁx under the LM, we sum the cumulati ve probabilities the LM assigns to the sequences represented by all leav es of the tree. T o sample a continuation of a preﬁx , we compute the probability (as above) of ev ery leaf and sample one of them accordingly . W e are then free to continue sampling a continuation from that leaf using normal token-lev el sampling. This can be used to solve the PBP without paying the cost of sampling one byte at a time. T o compute the next byte distribution given a pr eﬁx , we group the leaves by the next byte the y would entail and sum the probabilities (as abov e) of the leaves in each group. This can be combined with a sampling rule to generate text one byte at a time. Naturally , this will generate text more slowly than sampling at the token le vel. W e quantify this ov erhead in Section 4.2. W e use “ByteSampler” to refer to this collection of capabilities for con venience. 4 Experiments In our experiments, we apply ByteSampler at inference time to of f-the-shelf language models. In Section 4.1 we show that our method has less computational overhead compared to other exact methods. Next, in Section 4.2, we sho w that e xact methods perform better than heuristics in character- le vel language modeling. Finally , we present sev eral applications of our method to enable higher-le vel functions such as ensembling (Section 4.3) and proxy-tuning (Section 4.4) models with mismatched tokenizers. 4.1 Efﬁciency As discussed in Section 2, there are several existing methods which are also “exact. ” Although each technically corresponds to a dif ferent sampling distribution, we do not expect there to be any signiﬁcant dif ferences between them in practice. Therefore, the main distinguishing factor to consider is the method’ s computational cost. T o estimate the cost in a realistic setting, we sample a random 100 character substring from the O L M O 2 pretraining corpus [ 47 ] and estimate how man y inference tokens (according to the OLMo2 tok enizer) each method requires to calculate the probability of the substring as a te xt preﬁx. Note that the substring is sampled uniformly , so it is about 80% likely to end in the middle of a word. W e report the a verage inference cost in tokens, a veraged ov er 10,000 samples, for sev eral methods in T able 3. Method Inference T okens Overhead vs. BPE No mitigation (plain BPE) 23.51 0 Preﬁx Cov ering [72] 2 . 12 × 10 30 + 2 . 12 × 10 30 Byte-Pair Correction [51] 72.99 +49.47 Byte-Pair Correction with preﬁx caching 25.61 +2.09 ByteSampler (ours) 5 24.24 + 0.72 T able 3: Inference cost of various exact solutions to the prompt boundary pr oblem. Our method has 65% less ov erhead than the next best method. Overhead vs. BPE measures the average additional tokens of inference required by the method, compared to plain BPE. Importantly , the ov erhead is paid for each byte when sampling at the byte lev el, making low ov erhead crucial for efﬁcient sampling. 4.2 Character -Level Language Modeling 5 W e believe Back T okenization [ 70 ] should match our method when it comes to required inference tokens. Howe ver , its worst-case exponential pr epr ocessing time limits its practicality . 6 For token le vel prediction, calculated using a con version rate of 4.518 characters per token. 6 Prediction unit Method Loss per unit Bits per character 6 T oken Plain BPE 2.67 0.80 Character No mitigation (plain BPE) 4.81 6.53 Character ByteSampler (ours) 0.60 0.81 T able 4: Language modeling loss of O L M O 2 - 1 B on English text using various methods . W e compare three settings: (i) the original token-lev el cross-entropy loss when predicting the next token; (ii) the character-le vel loss when predicting the next character by directly tokenizing the prompt and calculating the next character distribution; and (iii) the character -level loss obtained using ByteSampler to predict the next character . The higher loss per unit for token-lev el prediction is to be expected, as tokens are harder to predict than bytes. Once the loss is normalized to bits per character, our method and the original model achieve similar results, which demonstrates that our method does not degrade language modeling quality . In this section, we will focus on con verting of f-the-shelf language models into character-lev el language models. 7 W e then e valuate the character-le vel prediction performance using the standard cross-entropy loss as well as next-character prediction accuracy in two languages: English in Section 4.2.1 and Chinese in Section 4.2.2. 4.2.1 O L M O 2 for English T ext In this setting, we sample a document randomly from the O L M O 2 pretraining corpus [ 47 ] and choose a random preﬁx of the document of length at most 1000 characters. W e then compute the next- character distribution according to O L M O 2 - 1 B [ 65 ] using v arious methods. T o allow comparison with the original token-based model, we also truncate the preﬁx to the nearest token boundary and perform next-token prediction with the original model. W e can compare the character-le vel and token-le vel losses via bits per character [ 41 ], which normalizes the loss to account for the fact that tokens are more difﬁcult to predict due to their greater information content. W e report the a verage loss of the predictions ov er 100,000 such documents in T able 4. From the results in T able 4, we can clearly see the ef fect of the prompt boundary problem: nai vely predicting the next character by directly applying the tokenizer to an arbitrary string preﬁx as in Eq. (1) leads to poor performance (“no mitigation” in T able 4). In contrast, ByteSampler nearly matches the performance of the original tok en-based model (“plain BPE”) in bits per character , as expected for e xact methods. For backtracking methods, it is not easy to compute the probability of any particular ne xt character . This prevents us from calculating the cross-entropy loss as in T able 4. For our experiments, we compare to the T oken Alignment method of Athiw aratkun et al. [2] , which is the most adv anced of the proposed backtracking methods and also includes token healing as a special case. W e use it to directly predict the next character by sampling greedily and report the av erage accuracy ov er 100,000 samples in T able 5. Interestingly , we ﬁnd that too much backtracking hurts the performance of the T oken Alignment method. W e believe this is because the sampling step often se gments the remainder of the prompt in a non-standard way , which may harm the performance of the model. 4.2.2 Q W E N 3 for Chinese T ext Since Chinese writing does not use whitespace, ending the prompt with a complete w ord does not generally provide a reliable token boundary . This makes it more dif ﬁcult to heuristically av oid the PBP . Similar to Section 4.2.1, we sample a random preﬁx of length at most 500 characters of a random document from the Chinese subset of the M A D L A D - 4 0 0 dataset [ 28 ]. W e then compute the distribution of ne xt characters according to Q W E N 3 - 1 . 7 B - B A S E [ 66 ] using various methods and report the av erage cross-entropy loss over 100,000 documents in T able 6. 7 W e choose character -level modeling for this section, e ven though our method supports byte-lev el predictions, because some related methods can only operate on character strings. 8 For token le vel prediction, calculated using a con version rate of 1.415 characters per token. 7 Method Next character accurac y Overhead vs. BPE No mitigation (plain BPE) 29.490 0 1 T oken Backtracking (T oken Healing) 71.634 + 0.43 2 T oken Backtracking (T oken Alignment) 76.281 +0.53 4 T oken Backtracking (T oken Alignment) 75.407 +1.08 ByteSampler (ours) 81.560 +1.72 T able 5: Next character prediction accuracy of O L M O 2 - 1 B on English text using various methods . W e compare three settings (i) directly tokenizing the prompt and greedily sampling until the ﬁrst character of the completion is determined; (ii) using backtracking with T oken Alignment (of which T oken Healing is a special case) to predict the next character; and (iii) using ByteSampler to predict the next character . Overhead vs. BPE measures the average additional tokens of inference required by the method, compared to (i) . Prediction unit Method Loss per unit Bits per Character 8 T oken Plain BPE 3.43 3.29 Character No mitigation (plain BPE) 3.79 5.16 Character ByteSampler (ours) 2.38 3.23 T able 6: Language modeling loss of Q W E N 3 - 1 . 7 B - B A S E on Chinese text using various methods . W e use the same settings and metrics as T able 4. Similarly to our English results, ByteSampler achiev es a similar normalized language modeling loss (in bits per character) to the original model which can only perform next token prediction. Once again, the naiv e method fails while our method achieves similar normalized loss to the original token-le vel model. W e also report next character prediction accuracy to allow comparison with backtracking methods. Note that Chinese has much higher entropy at the character level so the av erage accuracies are proportionally lower . 4.3 Byte-Level Ensemble Another application enabled by byte-level sampling is the ensembling of language models with different tokenizers. In general, when vocab ularies between LMs are the same, their next-token probability or logit distribution can be combined via arithmetic into a single distribution, but this cannot be done directly when the vocabularies differ . Se veral works have proposed methods to combine LM predictions despite mismatching vocabularies [ 25 , 38 , 35 , 75 ], but these may introduce bias into the sampling distrib ution. Our method makes the direct ensemble possible by con verting models with BPE tokenizers into a byte-wise models, thus unifying their v ocabularies. In our experiment, we consider an ensemble of three small base language models: Q W E N 3 - 1 . 7 B [ 66 ], O L M O 2 - 1 B [ 47 , 65 ], and L L A M A 3 . 2 - 1 B [ 64 ]. W e combine the predictions by computing the av erage p ensemble = 1 n P n i =1 p i where p 1 , . . . , p n are the next-byte probability distrib utions for each model. W e ev aluate the models on a suite of sev en tasks and report the results in T able 8. 4.4 Byte-Level Pr oxy-T uning In addition to additi ve ensembles ov er probabilities, the logit -lev el predictions of multiple LMs can be combined via arithmetic, with individual LMs acting as “experts” (if their predictions are combined additiv ely) or “anti-experts” (if subtracti vely) [ 32 , 31 , 60 , 20 , 11 , 59 ]. In particular, this form of ensembling can be used to achie ve the ef fect of tuning a lar ge pretrained LM without accessing model weights. T o see how this can be done, note that clearly for logit vectors ℓ tuned = ℓ base + ( ℓ tuned − ℓ base ) . 9 Chinese typically uses three bytes for each character when encoded using UTF-8. 8 Method Next character accurac y Overhead vs. BPE No mitigation (plain BPE) 32.8 0 1 T oken Backtracking (T oken Healing) 49.2 +1.82 2 T oken Backtracking (T oken Alignment) 49.6 +2.98 4 T oken Backtracking (T oken Alignment) 49.0 +5.30 ByteSampler (ours) 52.7 + 1.60 T able 7: Next character prediction accuracy of Q W E N 3 - 1 . 7 B - B A S E on Chinese text using various methods . W e use the same settings and metrics as T able 5. Similar to our English language results, ByteSampler achie ves the best prediction accurac y , but unlike in English, ByteSampler also requires the least o verhead of all methods. This highlights that languages with multi-byte characters 9 can behav e differently than ones which typically use a single byte for each character . T ask Q W E N 3 O L M O 2 L L A M A 3 . 2 ← A verage Ensemble Arithmetic [5] 0.974 0.838 0.831 0.881 0.978 DR OP [15] 0.470 0.409 0.299 0.393 0.479 Jeopardy [68] 0.274 0.327 0.264 0.288 0.347 LAMB ADA [50] 0.727 0.628 0.510 0.622 0.755 SQuAD [54] 0.845 0.802 0.694 0.780 0.836 T riviaQA [24] 0.389 0.535 0.443 0.456 0.526 W ikidataQA [4] 0.689 0.643 0.658 0.663 0.719 T able 8: Byte-level ensemble r esults. W e report the performance (accuracy) of a byte-le vel ensemble of three models on downstream ev als, along with the indi vidual performance of each model. W e see that the ensemble is competitive with the best individual model on each task and consistently outperforms the a verage performance across the three models. W e giv e more details reg arding the ev aluation in Appendix B.2. The idea of pr oxy-tuning [ 33 ] is to approximate the term ℓ tuned − ℓ base using the difference between a pair of tuned and base proxy models ℓ expert − ℓ anti-expert . In our experiments, we proxy-tune a strong base model, L L A M A - 3 . 1 - 8 B , using O L M O 2 - 1 B - I N S T RU C T and O L M O 2 - 1 B as the expert and anti-expert, respecti vely , which together represent a strong post-training recipe [47, 30]. Shown in T able 9, we ﬁnd that the proxy-tuned L L A M A 3 . 1 [ 63 ] model consistently outperforms the base model alone as well as the small tuned e xpert. This highlights a practical application of Byte- Sampler to “apply” post-training to base models without actually training them, thus disentangling the quality of the base model from that of the post-training recipe. T ask Metric L L A M A 3 . 1 O L M O 2 I N S T . L L A M A 3 . 1 (Proxy T uned) AlpacaEval 2 LC winrate 0.88 33.5 33.5 GSM8K 5 ICE, CoT , EM 55.3 51.9 76.6 MMLU 0 ICE, CoT , MC 27.8 35.2 59.5 T able 9: Proxy tuning results. W e report performance on downstream ev aluations when proxy- tuning L L A M A 3 . 1 - 8 B using O L M O 2 - 1 B - I N S T RU C T as the e xpert and O L M O 2 - 1 B as the anti-expert. W e see that the proxy tuned model gains the instruction-follo wing capability (AlpacaEval 2) and chain-of-thought capabilities (GSM8K, MMLU) of O L M O 2 - 1 B - I N S T R U C T while also beneﬁting from its larger size, allo wing it to surpass the expert’ s individual performance. For details re garding the ev aluation, see Appendix B.3. 9 5 Conclusion In this w ork, we introduced ByteSampler, an algorithm that eliminates the Prompt Boundary Problem by conv erting any BPE tokenizer -based language model into a byte-le vel model while preserving its generati ve distrib ution at the text-le vel. Interesting extensions of this method include automatic support for arbitrary pretokenizers (discussed in Appendix C.3), generalization to other tokenization schemes (such as Unigram [ 27 ], W ordpiece [ 57 ], and other v ariants of BPE [ 53 , 10 ]), and speculati ve- decoding at the byte-lev el. Beyond correcting sampling artifacts at the prompt-boundary—which is useful in its own right in many situations—the ability to unify v ocabularies at inference time enables man y forms of model composition, including ensembles of (and post-training transfer between) models with different tokenizers. Other applications of this technology include (i) byte-level knowledge distillation to transfer skills more effecti vely between models with different tokenizers, (ii) rapid post-training research lev eraging the fact that a post-training recipe (represented by a pair of proxy-tuning experts) can be applied to an y number of models without additional training, (iii) routing dynamically between models [ 81 ] during generation without requiring matching tokenizers, and potentially (iv) more con venient LM-powered compression of byte streams. In general, whene ver (mismatching) tok enizers represent an obstacle or incon venience, our method has the potential to completely bypass it at the cost of (minimally) increased inference compute. W e hope that this will prov e useful to LM researchers and users alike. Acknowledgments W e would like to thank Hao Xu and Ke Hayase for helping us brainstorm a good example of the PBP in Chinese. JH and AL are supported by the NSF Graduate Research Fellowship Program. This work was partially funded by NSF 2113530, 2112471, and 2229876, and Microsoft Grant for Customer Experience Innov ation. References [1] O. Ahia, S. Kumar , H. Gonen, V . Hofmann, T . Limisiewicz, Y . Tsvetko v , and N. A. Smith. MA GNET: Improving the multilingual fairness of language models with adaptiv e gradient- based tokenization. In The Thirty-eighth Annual Conference on Neural Information Pr ocessing Systems , 2024. URL https://openreview.net/forum?id=1e3MOwHSIX . [2] B. Athiwaratkun, S. W ang, M. Shang, Y . Tian, Z. W ang, S. K. Gonugondla, S. K. Gouda, R. Kwiatko wski, R. Nallapati, P . Bhatia, and B. Xiang. T oken alignment via character matching for subword completion. In L.-W . K u, A. Martins, and V . Srikumar, editors, F indings of the Association for Computational Linguistics: ACL 2024 , pages 15725–15738, Bangk ok, Thailand, Aug. 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.ﬁndings- acl.929. URL https://aclanthology.org/2024.findings- acl.929 . [3] M. Berglund and B. v an der Merwe. Formalizing bpe tokenization. In 13th International W ork- shop on Non-Classical Models of A utomata and Applications, NCMA 2023, 18-19 September , 2023, F amagusta, Cyprus , pages 16–27. Open Publishing Association, 2023. [4] BIG-bench. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. T ransactions on Machine Learning Resear ch , 2023. ISSN 2835-8856. URL https://openreview.net/forum?id=uyTL5Bvosj . [5] T . B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P . Dhariwal, A. Neelakantan, P . Shyam, G. Sastry , A. Askell, S. Agarw al, A. Herbert-V oss, G. Krue ger, T . Henighan, R. Child, A. Ramesh, D. M. Ziegler , J. W u, C. Winter , C. Hesse, M. Chen, E. Sigler , M. Litwin, S. Gray , B. Chess, J. Clark, C. Berner , S. McCandlish, A. Radford, I. Sutsk ev er, and D. Amodei. Lan- guage models are fe w-shot learners. In Pr oceedings of the 34th International Confer ence on Neural Information Pr ocessing Systems , pages 1877–1901, 2020. [6] K. Cao and L. Rimell. Y ou should ev aluate your language model on marginal likelihood ov er tokenisations. In Pr oceedings of the 2021 Confer ence on Empirical Methods in Natural Language Pr ocessing , pages 2104–2114, 2021. 10 [7] Y . Chen, K. Marchisio, R. Raileanu, D. Adelani, P . Stenetorp, S. Riedel, and M. Artetx e. Improving language plasticity via pretraining with active forgetting. In Advances in Neural Information Pr ocessing Systems . NeurIPS, 2023. [8] Z. Chen, J. Li, P . Chen, Z. Li, K. Sun, Y . Luo, Q. Mao, D. Y ang, H. Sun, and P . S. Y u. Harnessing multiple large language models: A surve y on llm ensemble. arXiv pr eprint arXiv:2502.18036 , 2025. [9] N. Chirkov a, G. Kruszewski, J. Rozen, and M. Dymetman. Should you marginalize ov er possible tokenizations? In Pr oceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 2: Short P apers) , pages 1–12, 2023. [10] P . Chizho v , C. Arnett, E. K orotkov a, and I. P . Y amshchiko v . Bpe gets picky: Efﬁcient vocabulary reﬁnement during tokenizer training. arXiv pr eprint arXiv:2409.04599 , 2024. [11] Y .-S. Chuang, Y . Xie, H. Luo, Y . Kim, J. R. Glass, and P . He. Dola: Decoding by contrasting layers improv es factuality in lar ge language models. In The T welfth International Confer ence on Learning Repr esentations , 2024. [12] J. H. Clark, D. Garrette, I. T urc, and J. W ieting. Canine: Pre-training an ef ﬁcient tokenization- free encoder for language representation. T ransactions of the Association for Computational Linguistics , 10:73–91, 2022. doi: 10.1162/tacl_a_00448. URL https://aclanthology. org/2022.tacl- 1.5 . [13] G. Dagan, G. Synnaev e, and B. Rozière. Getting the most out of your tokenizer for pre-training and domain adaptation. In Proceedings of the 41st International Conference on Machine Learning , ICML ’24. JMLR.org, 2024. [14] K. Dobler and G. De Melo. F ocus: Effecti ve embedding initialization for monolingual special- ization of multilingual models. In Pr oceedings of the 2023 Confer ence on Empirical Methods in Natural Languag e Processing , pages 13440–13454, 2023. [15] D. Dua, Y . W ang, P . Dasigi, G. Stanovsky , S. Singh, and M. Gardner . DR OP: A reading comprehension benchmark requiring discrete reasoning ov er paragraphs. In J. Burstein, C. Doran, and T . Solorio, editors, Proceedings of the 2019 Confer ence of the North Amer- ican Chapter of the Association for Computational Linguistics: Human Language T ech- nologies, V olume 1 (Long and Short P apers) , pages 2368–2378, Minneapolis, Minnesota, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/N19- 1246. URL https://aclanthology.org/N19- 1246 . [16] P . Gage. A ne w algorithm for data compression. The C Users J ournal arc hive , 12:23–38, 1994. URL https://api.semanticscholar.org/CorpusID:59804030 . [17] L. Gee, A. Zugarini, L. Rigutini, P . T orroni, et al. Fast vocab ulary transfer for language model compression. In Pr oceedings of the 2022 Confer ence on Empirical Methods in Natural Language Pr ocessing: Industry T rack , pages 409–416. Association for Computational Linguistics (A CL), 2022. [18] L. Gee, L. Rigutini, M. Ernandes, and A. Zugarini. Multi-word tokenization for sequence compression. In M. W ang and I. Zitouni, editors, Pr oceedings of the 2023 Confer ence on Empirical Methods in Natural Language Pr ocessing: Industry T rac k , pages 612–621, Singapore, Dec. 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.emnlp- industry . 58. URL https://aclanthology.org/2023.emnlp- industry.58 . [19] R. L. Geh, H. Zhang, K. Ahmed, B. W ang, and G. V . d. Broeck. Where is the signal in tokenization space? arXiv preprint , 2024. [20] A. Gera, R. Friedman, O. Arviv , C. Gunasekara, B. Sznajder , N. Slonim, and E. Shnarch. The beneﬁts of bad advice: Autocontrastiv e decoding across model layers. In A. Rogers, J. Boyd- Graber , and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 10406–10420, T oronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl- long.580. URL https://aclanthology.org/2023.acl- long.580 . 11 [21] A. Holtzman, J. Buys, L. Du, M. Forbes, and Y . Choi. The curious case of neural text degeneration. In International Conference on Learning Repr esentations , 2020. [22] Y . Huang, X. Feng, B. Li, Y . Xiang, H. W ang, T . Liu, and B. Qin. Ensemble learning for heterogeneous large language models with deep parallel collaboration. Advances in Neural Information Pr ocessing Systems , 37:119838–119860, 2024. [23] J. Jackson. Character preﬁx conditioning, 2025. URL https://www.cursor.com/blog/cpc . [24] M. Joshi, E. Choi, D. W eld, and L. Zettlemoyer . TriviaQA: A lar ge scale distantly supervised challenge dataset for reading comprehension. In R. Barzilay and M.-Y . Kan, editors, Pr oceedings of the 55th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 1601–1611, V ancouver , Canada, July 2017. Association for Computational Linguistics. doi: 10.18653/v1/P17- 1147. URL https://aclanthology.org/P17- 1147 . [25] J. Kasai, K. Sakaguchi, R. Le Bras, H. Peng, X. Lu, D. Radev , Y . Choi, and N. A. Smith. T wist decoding: Div erse generators guide each other . In Y . Goldber g, Z. Kozare va, and Y . Zhang, editors, Pr oceedings of the 2022 Confer ence on Empirical Methods in Natural Language Pr ocessing , pages 4909–4923, Abu Dhabi, United Arab Emirates, Dec. 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp- main.326. URL https:// aclanthology.org/2022.emnlp- main.326 . [26] T . Kudo. Subword regularization: Improving neural network translation models with multi- ple subword candidates. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 66–75, 2018. [27] T . Kudo and J. Richardson. Sentencepiece: A simple and language independent subword tokenizer and detok enizer for neural text processing. In Pr oceedings of the 2018 Confer ence on Empirical Methods in Natural Language Pr ocessing: System Demonstrations , pages 66–71, 2018. [28] S. Kudugunta, I. Caswell, B. Zhang, X. Garcia, D. Xin, A. K usupati, R. Stella, A. Bapna, and O. Firat. Madlad-400: A multilingual and document-lev el large audited dataset. Advances in Neural Information Pr ocessing Systems , 36:67284–67296, 2023. [29] D. Kumar and A. Tha wani. BPE beyond word boundary: Ho w NO T to use multi word expres- sions in neural machine translation. In S. T afreshi, J. Sedoc, A. Rogers, A. Drozd, A. Rumshisky , and A. Akula, editors, Pr oceedings of the Thir d W orkshop on Insights fr om Ne gative Results in NLP , pages 172–179, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.insights- 1.24. URL https://aclanthology.org/2022.insights- 1.24 . [30] N. Lambert, J. Morrison, V . Pyatkin, S. Huang, H. Ivison, F . Brahman, L. J. V . Miranda, A. Liu, N. Dziri, S. L yu, Y . Gu, S. Malik, V . Graf, J. D. Hwang, J. Y ang, R. L. Bras, O. T afjord, C. W ilhelm, L. Soldaini, N. A. Smith, Y . W ang, P . Dasigi, and H. Hajishirzi. Tulu 3: Pushing frontiers in open language model post-training, 2025. URL 15124 . [31] X. L. Li, A. Holtzman, D. Fried, P . Liang, J. Eisner , T . Hashimoto, L. Zettlemoyer , and M. Le wis. Contrastiv e decoding: Open-ended text generation as optimization. In A. Rogers, J. Boyd- Graber , and N. Okazaki, editors, Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 12286–12312, T oronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl- long.687. URL https://aclanthology.org/2023.acl- long.687 . [32] A. Liu, M. Sap, X. Lu, S. Swayamdipta, C. Bhaga vatula, N. A. Smith, and Y . Choi. DEx- perts: Decoding-time controlled text generation with experts and anti-experts. In C. Zong, F . Xia, W . Li, and R. Navigli, editors, Proceedings of the 59th Annual Meeting of the As- sociation for Computational Linguistics and the 11th International Joint Conference on Natural Language Pr ocessing (V olume 1: Long P apers) , pages 6691–6706, Online, Aug. 2021. Association for Computational Linguistics. doi: 10.18653/v1/2021.acl- long.522. URL https://aclanthology.org/2021.acl- long.522 . 12 [33] A. Liu, X. Han, Y . W ang, Y . Tsvetk ov , Y . Choi, and N. A. Smith. T uning language models by proxy . In F irst Confer ence on Language Modeling , 2024. [34] A. Liu, J. Hayase, V . Hofmann, S. Oh, N. A. Smith, and Y . Choi. SuperBPE: Space trav el for language models. arXiv pr eprint arXiv:2503.13423 , 2025. URL 2503.13423 . [35] C. Liu, X. Quan, Y . Pan, L. Lin, W . W u, and X. Chen. Cool-fusion: Fuse large language models without training. arXiv pr eprint arXiv:2407.19807 , 2024. [36] Y . Liu, P . Lin, M. W ang, and H. Schütze. Ofa: A framework of initializing unseen subword embeddings for efﬁcient large-scale multilingual continued pretraining. In F indings of the Association for Computational Linguistics: NAA CL 2024 , pages 1067–1097, 2024. [37] S. Lundberg. The art of prompt design: Prompt boundaries and token heal- ing, 2023. URL https://medium.com/towards- data- science/the- art- of- prompt- design- prompt- boundaries- and- token- healing- 3b2448b0be38 . [38] B. Lv , C. T ang, Y . Zhang, X. Liu, Y . Y u, and P . Luo. Specfuse: Ensembling large language models via next-se gment prediction. arXiv preprint , 2024. [39] K. Marchisio, P . Lewis, Y . Chen, and M. Artetxe. Mini-model adaptation: Efﬁciently extending pretrained models to new languages via aligned shallo w training. In The 61st Annual Meeting Of The Association F or Computational Linguistics , 2023. [40] C. Mavromatis, P . Karypis, and G. Karypis. Pack of llms: Model fusion at test-time via perplexity optimization. In Fir st Confer ence on Language Modeling , 2024. [41] S. J. Mielke. Can you compare perplexity across dif ferent segmentations?, Apr 2019. URL https://sjmielke.com/comparing- perplexities.htm . [42] B. Minixhofer , F . Paischer , and N. Rekabsaz. W echsel: Effecti ve initialization of subword embeddings for cross-lingual transfer of monolingual language models. In Pr oceedings of the 2022 Confer ence of the North American Chapter of the Association for Computational Linguistics: Human Language T echnolo gies , pages 3992–4006, 2022. [43] B. Minixhofer , E. Ponti, and I. V uli ´ c. Zero-shot tokenizer transfer . In The Thirty-eighth Annual Confer ence on Neural Information Pr ocessing Systems , 2024. [44] B. Minixhofer , I. V uli ´ c, and E. M. Ponti. Uni versal cross-tokenizer distillation via approximate likelihood matching. arXiv pr eprint arXiv:2503.20083 , 2025. [45] P . Nawrot, J. Choro wski, A. Lancucki, and E. M. Ponti. Efﬁcient transformers with dynamic token pooling. In A. Rogers, J. Boyd-Graber , and N. Okazaki, editors, Pr oceedings of the 61st Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 6403–6417, T oronto, Canada, July 2023. Association for Computational Linguistics. doi: 10.18653/v1/2023.acl- long.353. URL https://aclanthology.org/2023.acl- long.353 . [46] B.-D. Oh and W . Schuler . Leading whitespaces of language models’ subword vocab ulary pose a confound for calculating word probabilities. In Y . Al-Onaizan, M. Bansal, and Y .-N. Chen, editors, Pr oceedings of the 2024 Conference on Empirical Methods in Natural Languag e Pr ocessing , pages 3464–3472, Miami, Florida, USA, Nov . 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.emnlp- main.202. URL https://aclanthology.org/ 2024.emnlp- main.202 . [47] T . OLMo, P . W alsh, L. Soldaini, D. Groene veld, K. Lo, S. Arora, A. Bhagia, Y . Gu, S. Huang, M. Jordan, N. Lambert, D. Schwenk, O. T afjord, T . Anderson, D. Atkinson, F . Brahman, C. Clark, P . Dasigi, N. Dziri, M. Guerquin, H. Ivison, P . W . K oh, J. Liu, S. Malik, W . Merrill, L. J. V . Miranda, J. Morrison, T . Murray , C. Nam, V . Pyatkin, A. Rangapur , M. Schmitz, S. Skjonsberg, D. W adden, C. W ilhelm, M. W ilson, L. Zettlemoyer , A. Farhadi, N. A. Smith, and H. Hajishirzi. 2 olmo 2 furious, 2024. URL . [48] OpenAI. Openai platform documentation, 2023. URL https://platform.openai.com/ docs . Accessed: 2025/05/10. 13 [49] A. Pagnoni, R. Pasunuru, P . Rodriguez, J. Nguyen, B. Muller , M. Li, C. Zhou, L. Y u, J. W eston, L. Zettlemo yer , G. Ghosh, M. Le wis, A. Holtzman, and S. Iyer . Byte latent transformer: Patches scale better than tokens, 2024. URL . [50] D. Paperno, G. Kruszewski, A. Lazaridou, N. Q. Pham, R. Bernardi, S. Pezzelle, M. Baroni, G. Boleda, and R. Fernández. The LAMBAD A dataset: W ord prediction requiring a broad discourse context. In K. Erk and N. A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 1525–1534, Berlin, Germany , Aug. 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16- 1144. URL https://aclanthology.org/P16- 1144 . [51] B. Phan, M. Ha vasi, M. Muckley , and K. Ullrich. Understanding and mitigating tokenization bias in language models, 2024. URL . [52] T . Pimentel and C. Meister . How to compute the probability of a w ord. In Pr oceedings of the 2024 Confer ence on Empirical Methods in Natural Language Pr ocessing , pages 18358–18375, 2024. [53] I. Provilko v , D. Emelianenko, and E. V oita. Bpe-dropout: Simple and effecti ve subword regularization. In Pr oceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Association for Computational Linguistics, 2020. [54] P . Rajpurkar, J. Zhang, K. Lopyre v , and P . Liang. SQuAD: 100,000+ questions for machine comprehension of text. In J. Su, K. Duh, and X. Carreras, editors, Pr oceedings of the 2016 Confer ence on Empirical Methods in Natural Language Pr ocessing , pages 2383–2392, Austin, T exas, Nov . 2016. Association for Computational Linguistics. doi: 10.18653/v1/D16- 1264. URL https://aclanthology.org/D16- 1264 . [55] M. T . Ribeiro. A guidance language for controlling large language models, 2023. URL https: //github.com/guidance- ai/guidance?tab=readme- ov- file#text- not- tokens . [56] R. S. 4d masks support in transformers, 2024. URL https://huggingface.co/blog/ poedator/4d- masks . [57] M. Schuster and K. Nakajima. Japanese and korean v oice search. In 2012 IEEE international confer ence on acoustics, speech and signal pr ocessing (ICASSP) , pages 5149–5152. IEEE, 2012. [58] R. Sennrich, B. Haddow , and A. Birch. Neural machine translation of rare words with subword units. In K. Erk and N. A. Smith, editors, Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long P apers) , pages 1715–1725, Berlin, Germany , Aug. 2016. Association for Computational Linguistics. doi: 10.18653/v1/P16- 1162. URL https://aclanthology.org/P16- 1162 . [59] R. Shi, Y . Chen, Y . Hu, A. Liu, H. Hajishirzi, N. A. Smith, and S. S. Du. Decoding-time language model alignment with multiple objecti ves. In The Thirty-eighth Annual Confer ence on Neural Information Pr ocessing Systems , 2024. [60] W . Shi, X. Han, M. Lewis, Y . Tsvetko v , L. Zettlemoyer , and W .-t. Y ih. T rusting your evidence: Hallucinate less with conte xt-aware decoding. In K. Duh, H. Gomez, and S. Bethard, edit ors, Pr oceedings of the 2024 Confer ence of the North American Chapter of the Association for Computational Linguistics: Human Language T echnologies (V olume 2: Short P apers) , pages 783–791, Mexico City , Mexico, June 2024. Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl- short.69. URL https://aclanthology.org/2024.naacl- short. 69 . [61] Y . T ay , V . Q. Tran, S. Ruder , J. Gupta, H. W . Chung, D. Bahri, Z. Qin, S. Baumgartner , C. Y u, and D. Metzler . Charformer: Fast character transformers via gradient-based subword tokenization. In International Confer ence on Learning Representations , 2022. URL https: //openreview.net/forum?id=JtBRnrlOEFN . [62] L. T eam. The llama 3 herd of models, 2024. URL . 14 [63] L. T eam. Introducing llama 3.1: Our most capable models to date, 2024. URL https: //ai.meta.com/blog/meta- llama- 3- 1/ . [64] L. T eam. Llama 3.2: Rev olutionizing edge ai and vision with open, customizable models, 2024. URL https://ai.meta.com/blog/llama- 3- 2- connect- 2024- vision- edge- mobile- devices/ . Accessed: 2025/05/10. [65] O. T eam. Olmo release notes, 2025. URL https://allenai.org/olmo/release- notes# olmo- 2- 1b . Accessed: 2025/05/10. [66] Q. T eam. Qwen3: Think deeper , act faster , 2025. URL https://qwenlm.github.io/blog/ qwen3/ . Accessed: 2025/05/10. [67] K. T ran. From english to foreign languages: Transferring pre-trained language models. arXiv pr eprint arXiv:2002.07306 , 2020. [68] B. T unguz. 200,000+ jeopardy! questions, 1019. URL https://www.kaggle.com/ datasets/tunguz/200000- jeopardy- questions . [69] B. T unguz. 200,000+ jeopardy! questions, 2019. URL https://www.kaggle.com/ datasets/tunguz/200000- jeopardy- questions . [70] A. T uraga. Character preﬁx conditioning with back tokenization, 2025. URL https:// anilturaga.github.io/cpc . [71] H. v an Antwerpen and A. Neubeck. So many tokens, so little time: Introducing a faster , more ﬂexible byte-pair tokenizer , 2025. URL https://github.blog/ai- and- ml/llms/so- many- tokens- so- little- time- introducing- a- faster- more- flexible- byte- pair- tokenizer/ . Accessed: 2025/05/10. [72] T . V ieira, B. LeBrun, M. Giulianelli, J. L. Gastaldi, B. DuSell, J. T erilla, T . J. O’Donnell, and R. Cotterell. From language models o ver tokens to language models ov er characters. arXiv pr eprint arXiv:2412.03719 , 2024. [73] T . V ieira, T . Liu, C. Pasti, Y . Emara, B. DuSell, B. LeBrun, M. Giulianelli, J. L. Gastaldi, T . J. O’Donnell, and R. Cotterell. Language models over canonical byte-pair encodings. arXiv pr eprint arXiv:2506.07956 , 2025. [74] J. W ang, T . Gangav arapu, J. N. Y an, and A. M. Rush. Mambabyte: T oken-free selecti ve state space model. In F irst Confer ence on Language Modeling , 2024. URL https://openreview. net/forum?id=X1xNsuKssb . [75] Y . Xu, J. Chen, J. W u, and J. Zhang. Hit the sweet spot! span-level ensemble for lar ge language models. arXiv pr eprint arXiv:2409.18583 , 2024. [76] Y . Xu, J. Lu, and J. Zhang. Bridging the gap between dif ferent vocab ularies for llm ensemble. In Pr oceedings of the 2024 Confer ence of the North American Chapter of the Association for Computational Linguistics: Human Language T echnolo gies (V olume 1: Long P apers) , pages 7133–7145, 2024. [77] Y . Xu, J. Chen, J. W u, and J. Zhang. Hit the sweet spot! span-level ensemble for lar ge language models. In Proceedings of the 31st International Confer ence on Computational Linguistics , pages 8314–8325, 2025. [78] L. Xue, A. Barua, N. Constant, R. Al-Rfou, S. Narang, M. Kale, A. Roberts, and C. Raf fel. ByT5: T owards a tok en-free future with pre-trained byte-to-byte models. T ransactions of the Association for Computational Linguistics , 10:291–306, 2022. doi: 10.1162/tacl_a_00461. URL https://aclanthology.org/2022.tacl- 1.17 . [79] L. Y u, D. Simig, C. Flaherty , A. Aghajanyan, L. Zettlemoyer , and M. Lewis. MEGABYTE: Predicting million-byte sequences with multiscale transformers. In Thirty-seventh Confer ence on Neural Information Pr ocessing Systems , 2023. URL https://openreview.net/forum? id=JTmO2V9Xpz . 15 [80] Y .-C. Y u, C. C. Kuo, Y . Ziqi, C. Y ucheng, and Y .-S. Li. Breaking the ceiling of the llm community by treating token generation as a classiﬁcation for ensembling. In F indings of the Association for Computational Linguistics: EMNLP 2024 , pages 1826–1839, 2024. [81] W . Zheng, Y . Chen, W . Zhang, S. Kundu, Y . Li, Z. Liu, E. P . Xing, H. W ang, and H. Y ao. Citer: Collaborativ e inference for efﬁcient lar ge language model decoding with token-le vel routing. arXiv pr eprint arXiv:2502.01976 , 2025. 16 A Related W ork Byte-level language models Although our method is able to con vert a model usi ng a traditional BPE tokenizer into a byte-le vel model, allowing it to be used in situations where byte-le vel models are required, it may not enjoy the beneﬁts of being trained nati vely at the byte le vel. Training byte-le vel models are an activ e area of research [ 12 , 78 , 74 ]. Howe ver , byte-level language models may still implicitly aggregate multiple bytes into a single “patch” to help reduce the required sequence length. These patches can be segmented either statically [ 61 , 79 ] or dynamically [ 45 , 49 , 1 ] which may lead to issues analogous the Prompt Boundary Problem at the patch lev el, depending on the architecture. T okenizer transfer Methods to adapt a model to use tokenizers other than the one they are trained with hav e been proposed. These methods may rely on interv entions during training [ 7 ], continued training for a subset of the model with the new tokenizer [ 39 ], using self-distillation [ 44 ], careful initialization of a ne w embedding matrix, followed by ﬁne-tuning [ 42 , 17 , 67 , 36 , 14 ], or zero shot transfer using a hypernetwork [ 43 ]. While these methods can, in principle, be used to conv ert any model into a byte-le vel model, they will inevitably introduce some distortion into the model’ s sampling distribution. Ensembles of language models Man y methods to address the mismatching vocabularies one counters when ensembling models hav e been proposed. These include bridging the vocabularies using a mapping based on model features [ 22 ] or edit distance [ 40 ] as well as sampling from the union [ 80 ] or intersection [ 76 ] of multiple vocab ularies. There are also sev eral methods that sample multiple tokens of continuation from each model and then select the best one using a scoring metric [ 35 , 77 , 38 ]. For a surve y of such methods, including ones that require training or additional data, see Chen et al. [8] . Howe ver , unlike our exact method, all of these methods may introduce distortion into the model’ s outputs. W ord level pr obabilities The popular decision to include whitespace with the follo wing word in most modern tokenizers presents a challenge when computing the next word probability [ 46 , 52 ], which is closely related to the Prompt Boundary Problem. Nondeterministic tokenizers Our analysis crucially relies on the determinism of BPE, ho wev er nondeterministic tok enizers such as Unigram [ 26 ] and BPE dropout [ 53 ] are of interest to the community . Lundberg [37] remarks that nondeterministic tokenizers may reduce the se verity of the prompt boundary problem, but it cannot do so perfectly . It is possible that more advanced techniques may be able to fully correct the PBP for these tokenizers as well. B Experimental Details In this appendix, we report additional experimental details. B.1 Calculation of the Naive Method The naiv e method is simple to state. W e merely report the average probability that the next character sampled after the prompt will be the correct one. Ho wev er, some comple xity arises when considering multibyte characters, which occur occasionally in English text and essentially constantly in Chinese. A multibyte character may correspond to multiple tokens under a byte-le vel BPE tokenizer , which means that multiple sampling steps may be necessary to form the next character . T o handle this properly , we compute the tree of all token sequences which start with the desired character (depicted in Fig. 2b) and score the log-probability of all of its leaves to determine the exact probability that the desired next character will be generated. Note that we do not perform the pairwise pruning in this step, as we describe in Fig. 2c and Section 3.1. It is not strictly necessary , since a single character can be at most four bytes under UTF-8, so the size of the tree will always be small, and omitting the pruning step presents the baseline in the best light. B.2 Details for Ensemble Ev aluations For the ensemble ev aluations we use few-shot prompting with ﬁve in-context examples for each query . W e choose the few-shot examples randomly to av oid any bias and ensure that the question 17 being tested is not among the e xamples. W e sample the continuation greedily and test whether the resulting text contains the correct answer . 1. Arithmetic contains simple arithmetic problems [ 5 ]. 10 W e use the 2da , 2dm , and 2ds splits for addition, multiplication, and division of (up to) 2-digit numbers. 2. DR OP contains questions about passages, potentially requiring reasoning ov er multiple pieces of information in the passage [15]. 3. Jeopardy contains open-ended questions from the “Jeopardy!” quiz show [69]. 4. LAMB AD A contains narratives without the last word, which is inferrable gi ven the conte xt [ 50 ]. This task requires models to attend to the full narrativ e instead of only the local context. 5. SQuAD contains passages paired with questions about the passage [ 54 ]. The answer is always a span from the passage. 6. T riviaQA contains open-ended questions about world kno wledge [24]. 7. BIG-bench WikidataQA require models to complete factual statements with the correct continuation [4]. T o sav e compute, we randomly subsample large datasets do wn to 5,000 examples. B.3 Details for Pr oxy-T uning Evaluations Follo wing Liu et al. [33] , we use the proper instruct template for O L M O 2 - I N S T RU C T and use a basic Question/Answer format for the base models. Unlike in the pre vious section, we use a more varied ev aluation setup. 1. For AlpacaEval 2 , we prompt using the instruction as the question and take the response as the answer . This is done with no chain of thought prompting or in-context e xamples. W e use the default AlpacaEv al 2 judge and report the length-controlled win-rate in our results. 2. For GSM8k , we use ﬁ ve in-conte xt examples, which naturally cause the model to produce chains of thought. W e extract the ﬁnal number produced by the model and test if it e xactly matches the answer (removing an y commas). 3. For MMLU , we use no in-context examples and use the chain-of-thought prompt from Lambert et al. [30] to elicit chains of thought resulting in a multiple-choice answer . Unlike with the other datasets, we do not truncate MMLU to 5,000 examples since its examples are distributed across v arious domains. W e report the multiple-choice accuracy in our results. These e v aluations were intended to beneﬁt from instruction-follo wing capabilities and general knowledge model performance. B.4 Compute Resources Our experiments were conducted with a variety of computing resources, including Nvidia A40, L40S, and A100 GPUs. Our method only requires one GPU at a time and features minimal memory ov erhead compared to regular sampling. W e estimate that the total compute required to reproduce all of our results is less than 200 L40S hours. C Implementation Details and Optimizations In this appendix, we report implementation details that improve performance and ensure correctness. C.1 Inference Optimizations T o ensure that our method is practical we employ a number of optimizations. In order to quickly compute the V alid Cover T ree, we maintain a cache of token masks which are valid follo wing a given token and a separate cache for masks specifying tokens that begin with certain common byte preﬁxes. Then giv en a node of the tree, we can quickly expand it, as described in Algorithm 1 by fetching the relev ant masks from both caches and intersecting them on the GPU to ﬁnd the valid children to add. 10 https://huggingface.co/datasets/EleutherAI/arithmetic 18 When e valuating the probabilities of the leav es of the V alid Co ver T ree, we use 4D attention masks [ 56 ] to perform inference for the entire tree in a single query . Additionally , while sampling we use KV -caching to avoid redundant computation. Combining these two techniques can lead to e xcessiv e memory usage because tokens corresponding to branches that are ultimately not selected by sampling take up space in the KV cache. T o address this, we implement a copying garbage collector for the KV cache which discards such tokens from the cache. Since the GC can be run one layer at a time, its total memory overhead is ne gligible. When using the GC, the KV cache will store exactly one set of ke ys and v alues for each token in the curr ent V alid Cov er Tree, reducing the memory ov erhead compared to naiv e sampling to a constant. W e also implement batching, allowing one to sample multiple sequences of bytes in parallel, which permits better utilization of GPU resources. C.2 Byte-level vs Character -lev el BPE Throughout this work, we assume that BPE is carried out at the byte lev el. Howe ver , the alternative, performing BPE at the character level, is also a popular choice. Our method can be extended to character-le vel BPE merges in a natural manner . In particular, one can perform our method at the character le vel instead. All the analysis we pro vide, including the guarantees for the V alid Cov er T ree in Section 3.1 continue to hold regardless of the choice of base vocabulary . The only additional logic that needs to be implemented re volves around the handling of byte fallback, which is a feature that allows the tok enizer to represent characters that were not included in the base vocabulary e xplicitly using their Unicode encoding. T o handle this properly , we will need to “reset” the tree whenev er we encounter a character encoded using byte fallback, since BPE merges do not interact with byte fallback (essentially the byte encoded character acts as a pretokenization boundary). In order to condition on an arbitrary byte sequence, we must consider the possibility that a partial character will be completed to form one not in the base v ocabulary , necessitating the addition of a “byte fallback” branch to the V alid Cover T ree. In all other regards, the approach is the same as the one we outline in Section 3. C.3 Handling pretok enization So far , we hav e focused on correctly handling byte pair encoding, ignoring the pretokenization con- ventionally applied beforehand. T o illustrate why this is step is important, recall that pretokenization is often used to ensure that tokens cannot span multiple w ords and that whitespace separating w ords is merged with the follo wing word and not the preceding one. In order to correctly handle all aspects of modern tokenizers, we must also perform pretokenization in an online fashion, which is challenging in its own right. Pretokenization is typically implemented using a regular expression: beginning at the start of the text, the longest preﬁx matching the regular expression is found greedily . This preﬁx is then extracted into a pretoken and the process is repeated on the sufﬁx. This continues until the entire string has been processed. In order to properly handle pretokenization, we must also perform this splitting online. Due to the expressi vity of regular expressions, this requires maintaining a tree of possible splits, which are resolved once enough te xt is observed, to conclude whether the re gex will match or not. C.3.1 General Solution In principle, the implementation of this idea is straightforward. W e can conv ert any splitting regular expression into a ﬁnite automaton, which allo ws us to detect matches incrementally . By performing connectivity analysis on the automata’ s state graph, we can infer (i) whether there exists a sufﬁx that could produce a rege x match (which would mean that the pretokenization might not end up splitting at this point) and also (ii) whether there exists a sufﬁx which would cause the rege x to stop matching at this point (which would mean that the pretokenization might end up splitting at this point). This analysis can be precomputed for each state in the automaton, allowing these checks to be performed in constant time for each new byte. If the verdict is ambiguous (both splitting and not splitting are possible), then we add an additional subtree to the V alid cover Tree which assumes that the split has indeed happened. The portion to the left of the split can only be tokenized one w ay (since its endpoint is ﬁx ed), while the portion to the right of the split will be isomorphic to a ne w V alid Cov er Tree for just the portion of the preﬁx following the hypothetical split. As we continue to add new bytes, we maintain both branches of the 19 tree, just as we would normally . Once enough bytes are added, we can determine conclusively which option was taken, allo wing us to discard the subtree corresponding to the opposite possibility . Of course, it is possible that a new position may occur where the splitting cannot be determined conclusiv ely before the ﬁrst one is resolv ed. This will necessitate further splitting of the tree (potentially in both subtrees). In general, this may lead to trees of e xponential size, but for typical pretokenizers in use today , we can still guarantee that the tree will have ﬁnite size. C.3.2 Practical Solution Unfortunately , the general solution we outlined in the pre vious section is dif ﬁcult to implement in practice. First, most regular e xpression engines in use today support matching features that are not strictly regular , which makes the con version of its rege xes into automata impossible in the general case. While these features are not used by any pretok enizer we are aware of, the possibility thereof has made it dif ﬁcult to ﬁnd routines that are able to perform this con version for existing rege x engines. T o provide a correct implementation while av oiding the complexity of writing our o wn regex engine, we provide bespok e handlers which are able to handle the pretokenization rules in common use. In general most pretokenization regular expressions have the desirable property that any preﬁx of a match is also a match. W e call this property closed under preﬁx . This makes the detection of possible splitting points very easy , since once the regex stops matching ne w characters, we know there is no sufﬁx that can e xtend it. There are only a handful of rules which do not hav e this property: • Most tokenizers hav e a lookahead rule which stops matching whitespace one before the last whitespace. Thus giv en three spaces in a ro w , followed by a letter , the ﬁrst two spaces would be one pretoken and the last space and letter would form a second pretok en. • Many tok enizers hav e a “contraction merging” rule which forces contraction suf ﬁxes such as 〈 ’ ve 〉 to be indi vidual pretokens. This is tricky because 〈 ’ ve 〉 is considered a match b ut 〈 ’ v 〉 is not. W e provide handlers for expressions that are closed under preﬁx, as well as the two special cases we listed abov e. This is enough to correctly support all pretokenizers we are aware of. C.4 Handling special tokens Special tokens are tokens that are assigned special meaning and are not used simply to represent text. These tokens can have a variety of uses, including marking the be ginning or end of documents or separating turns in a dialog. It is easy to handle special tokens in the prompt: when we see a special token, we terminate the tree at that point (discarding potential continuations) and output it, output the ID of the special token, and then start a ne w tree. T o handle special tokens in the output, we consider the special token distrib ution on the branch of the tree that ends e xactly at the end of the prompt and add those tokens as generation options with the corresponding probabilities alongside the 256 possible next bytes. When composing multiple models which hav e dif ferent sets of special tokens, we require a mapping to specify which tokens hav e the same meaning. This mapping is automatically detected for BOS and EOS tokens, but must be manually speciﬁed for others. C.5 Con verting Merge Lists to Normal F orm Throughout this work, we hav e assumed that the tokenizer is constructed using the BPE training algorithm, which proceeds by iterati vely mer ging the most frequent pair in the partially tokenized training corpus. This assumption leads to merge lists that hav e three desirable properties: (i) e very token has a unique merge that forms it, (ii) every token can be achieved as the tokenization of some string, and (iii) the mer ges always appear in the order the y are merged. W e assume that these properties are true in the analysis we present in Section 3. Howe ver in practice, some models use “BPE” tokenizers with merge lists that are not directly produced by the BPE training algorithm. One example of this is the tokenizer of L L A M A 3 [ 62 ], which appears to be constructed by extending OpenAI’ s cl100k tokenizer [ 48 ] with additional mer ges intended to add multilingual capabilities. Because of way this extension is done, the L L A M A 3 tokenizer does not ha ve any of the three properties we outlined abo ve. Despite this, inference with 20 the tokenizer is still possible because some tokenization libraries such as HuggingFace T okenizers 11 employ a heap-based algorithm which simply applies the earliest merge av ailable until no more merges can be applied, which permits the mer ges to be applied out of order . Fortunately , it happens to be the case that every merge list can be conv erted into a functionally equi valent one in “normal form” which has identical beha vior while also satisfying the three properties abov e. This is done using a two step process: (1) for each token, we run the heap-based algorithm on it as a string and track which merges are used during the tokenization process. If the resulting token sequence is not just the single corresponding token id, then we mark the token as “unreachable” and drop it (this ensures property (ii) ). Otherwise, we check which merge was applied last and drop an y other merges which form the same token since the y are also unreachable (this ensures property (i) ). Then (2) , for ev ery merge we check the position of the merges forming its two inputs and mov e it to immediately after the later of the two if it appears after the original mer ge (this ensures property (iii) ). This procedure allows our method to be used with an y tokenizer that can be speciﬁed using a merge list, ev en if it was not trained using BPE. D In valid segmentations D.1 Differences in exact methods In this work, we consider a method exact if it samples according to the distribution in Eq. (2) modulo the probability mass placed on inv alid sequences, which we deﬁned in Section 3.1. Here we describe exactly ho w these methods differ in their handling of in valid sequences. The method of T uraga [70] and our method condition on a valid co vering of the prompt. This corresponds to the distribution P  t 1 , . . . , t n     prompt ⊑ deco de( t 1 , . . . , t n ) , [ t 1 , . . . , t k ] is v alid where k = min { i | prompt ⊑ decode( t 1 , . . . , t i ) }  . (3) While difﬁcult to notate, this simply means that the portion of the sequence ov erlapping the prompt is required to be v alid. This is roughly similar to common practice described in Eq. (1) of directly tokenizing the prompt and sampling a continuation while av oiding the PBP . Meanwhile, Phan et al. [51] consider a relaxation of the abov e, which does not require the last pair to be valid. This corresponds to P  t 1 , . . . , t n     prompt ⊑ deco de( t 1 , . . . , t n ) , [ t 1 , . . . , t k − 1 ] is v alid where k = min { i | prompt ⊑ decode( t 1 , . . . , t i ) }  . (4) The less strict conditioning explains why this method has greater o verhead, as seen in Section 4.1. It may seem desirable to sample from the distribution P ( t 1 , . . . , t n | prompt ⊑ deco de( t 1 , . . . , t n ) , [ t 1 , . . . , t k ] is v alid ) , (5) where the entire sequence is required to be v alid. Howe ver , it is not clear how to ef ﬁciently sample from this distribution. V ieira et al. [73] highlights this difﬁculty and propose several alternati ve approaches, including approximations of Eq. (5) and architectural modiﬁcations that make it easier to sample from Eq. (5). When applying ByteSampler iterativ ely , the validity of the sequence is enforced continuously . Since this is done locally , the resulting distrib ution corresponds to the “locally canonicalized approximation” of Eq. (5) described in V ieira et al. [73] additionally conditioned on a prompt . D.2 Signiﬁcance of in valid segmentations For the most part, we hav e ignored the contribution of inv alid sequences to the language model’ s distribution. This is done out of necessity , since the number of inv alid sequences scales exponentially with the prompt length [ 72 ]. Howe ver it is worth considering whether these segmentations could contribute meaningfully to the model’ s capabilities. This is closely related to the concept of mar ginalization [ 6 ]: the idea that calculating the probability of generating a string with a language model requires summing ov er all segmentations of the string, (including in valid ones). Of note, Chirkov a et al. [9] found that P ([ t 1 , . . . , t n ] is not v alid ) makes up a negligible fraction of the language model’ s distribution, ho wever later w orks [ 19 , 73 ] came to the opposite conclusion. 11 https://github.com/huggingface/tokenizers 21 E Example V alid Cover T rees Here we sho w complete V alid Cov er T rees for sev eral example preﬁx es. Unlike the tree in Fig. 2c, we sho w the actual tree as calculated by our algorithm. Howe ver to allo w them to ﬁt on a page, we choose to display only the internal nodes of the tree (not the leav es). T o denote where the hidden leav es would be, we display nodes that have lea ves in bold font . This is a t tes Figure 4: Example V alid Cov er Tree for preﬁx “this is a tes” with the O L M O 2 tokenizer . def e ul ule eu Figure 5: Example V alid Cov er Tree for preﬁx “def eule” with the O L M O 2 tokenizer . B PE T oken iz at ati atio Figure 6: Example V alid Cov er Tree for preﬁx “BPE T okenizatio” with the O L M O 2 tokenizer . ind uctive hyp hypo the hypothe Figure 7: Example V alid Cov er Tree for preﬁx “inducti ve hypothe” with the O L M O 2 tokenizer . 日本的首都是东京，中国的首 l 首都 l Figure 8: Example V alid Cov er Tree for preﬁx “ 日本的首都是东京，中国的首都 ” with the Q W E N 3 tokenizer . W e use l to denote nodes with leav es omitted. W e hide the leav es because it is typical for nodes that do ha ve leav es to ha ve dozens or e ven hundreds of them. T o see how this can occur, imagine a prompt that ends on a space, and an internal node that ends right before that space. The node’ s children will be all valid tok ens that begin with a space. Most tokenizers ha ve tens of thousands of tokens which be gin with a space and nearly all of them will be valid continuations. While this may sound problematic, we only need to query the next token distrib ution for the parent once in order to score all of its children, so this can be done efﬁciently in combination with the masking cache we describe in Appendix C.1. F Advanced Decoding Methods In Section 3, we focused on sho wing that our method is “exact. ” T o be precise, this means that sampling byte wise using our method and sampling normally gi ve exactly the same distrib utions of 22 output text (modulo in valid token sequences, as we discussed in Appendix D). Howe ver , this applies only when doing standard sampling from the model, and not when transforming the distribution using popular decoding techniques such as greedy decoding, top- k , top- p [ 21 ], or ev en temperatures other than 1. This is because these transformations have dif ferent effects when applied with different granularities (clearly , greedily selecting the most likely next byte is not the same as greedily selecting the most likely next tok en). It is not immediately clear what advantages or disadv antages are gained by transforming the LM’ s textual distrib ution in this way , and we believe this presents an interesting direction for future work. 23

Sampling from Your Language Model One Byte at a Time

Original Paper

Comments & Academic Discussion

Leave a Comment

Original Paper

Related Papers

Comments & Academic Discussion

Leave a Comment