PangolinTokenizer

Byte-level BPE tokenizer for Traditional Chinese, Taiwan text, multilingual text, rich transcription, OCR-style text, and generic control formats.

This revision adds the Open Formosa required control tokens as special tokens. The base BPE vocabulary size remains 114,688. The effective tokenizer length, including added special tokens, is 114,822.

Usage

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained(
    "voidful/PangolinTokenizer",
    trust_remote_code=False,
)

text = "<|system|>台灣健保與注音ㄅㄆㄇ，Tailo: Tâi-uân"
ids = tokenizer.encode(text)
decoded = tokenizer.decode(ids)

Open Formosa Compatibility

Required special tokens present: 157
Required special tokens encode as single IDs: yes
Standard special tokens: <unk>, <s>, </s>, <pad>
Model max length metadata: 131,072
trust_remote_code: not required
No discrete audio codec token ranges are included.
No dense timestamp token ranges are included.

Downloads last month: -

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support