Fix tokenizer: EOS bug + decode skip_special_tokens=True empty string

by kashif HF Staff - opened 15 days ago

base: refs/heads/main

←

from: refs/pr/1

Discussion Files changed

+54

-20

kashif

Hugging Face Biology Research org 15 days ago

Tokenizer bug fixes

Bug 1: EOS appended when `add_special_tokens=True`

encode(add_special_tokens=True) was appending an EOS token, which breaks lighteval's tok_encode_pair invariant. Qwen3 doesn't add BOS/EOS either — the EOS append is removed.

Bug 2: `decode(skip_special_tokens=True)` returns empty string for pure-DNA generations

The common generation scenario: <dna> is in the prompt, only k-mer tokens + </dna> are in the generated portion being decoded. The elif tid in dna_id_to_token branch was treating all DNA-vocab tokens (including k-mer content) as special tokens and dropping them when skip_special_tokens=True, returning an empty string instead of the DNA sequence.

Fix: only skip actual DNA special tokens (<dna>, </dna>, <oov>); always decode k-mer content tokens.

Also: `auto_dna_tags` parameter added (default `False`)

Allows raw DNA strings to be automatically wrapped in <dna>...</dna> for k-mer tokenization. Default is False to preserve existing behaviour (metadata BPE tokens must not be auto-wrapped).

tokenizer: fix EOS append bug and decode skip_special_tokens=True bugfc796726

tokenizer: add auto_dna_tags to dna_config.json437e6757

tokenizer: fix auto_dna_tags None -> False in tokenizer_config.jsone3cb1186

loubnabnl

Hugging Face Biology Research org 15 days ago

LGTM!

kashif changed pull request status to open 15 days ago

kashif

Hugging Face Biology Research org 15 days ago

Merging tokenizer fixes: EOS append bug, decode skip_special_tokens=True empty string, auto_dna_tags support.

kashif changed pull request status to merged 15 days ago

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment

Fix tokenizer: EOS bug + decode skip_special_tokens=True empty string

Tokenizer bug fixes

Bug 1: EOS appended when add_special_tokens=True

Bug 2: decode(skip_special_tokens=True) returns empty string for pure-DNA generations

Also: auto_dna_tags parameter added (default False)

Bug 1: EOS appended when `add_special_tokens=True`

Bug 2: `decode(skip_special_tokens=True)` returns empty string for pure-DNA generations

Also: `auto_dna_tags` parameter added (default `False`)