HuggingFaceFW/fineweb-edu
Viewer • Updated • 3.5B • 591k • 1.08k
A from-scratch LLaMA-style language model pretrained on 16.6B tokens and instruction-tuned on multi-turn chat data
Monostich is a ~100M parameter decoder-only transformer trained entirely from scratch. It uses a LLaMA-compatible architecture with modern components (GQA, RoPE, SwiGLU, RMSNorm) and is designed to be lightweight.
<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\nPipeline: Chat Prompt → BPE-32K Tokenizer → LLaMA Decoder (12L) → Token Prediction
Each transformer layer contains:
| Architecture | LLaMA-style Decoder-Only Transformer |
| Parameters | 100,092,672 (~100M) |
| Hidden Dimension | 768 |
| Intermediate (MLP) | 2,048 |
| Layers | 12 |
| Attention Heads | 12 (Q) / 4 (KV) — GQA 3:1 |
| Head Dimension | 64 |
| Context Length | 1024 |
| RoPE θ | 10,000 |
| Vocabulary | 32,000 (BPE) |
| Tied Embeddings | Yes |
| Precision | bfloat16 |
| Weight Size | ~191 MiB (bf16) |
| Feature | Description | Origin |
|---|---|---|
| RoPE | Rotary Positional Embeddings for relative position encoding | LLaMA |
| GQA | Grouped Query Attention (3:1) for efficient KV cache | LLaMA-2 |
| SwiGLU | Gated linear unit with SiLU activation | PaLM, LLaMA |
| RMSNorm | Root Mean Square normalization (faster than LayerNorm) | LLaMA |
| Flash Attention | Memory-efficient attention via PyTorch SDPA | Dao et al. |
| Weight Tying | Embedding and LM head share weights | Standard |
| Type | Byte-Pair Encoding (BPE) |
| Vocabulary | 32,000 tokens |
| Library | HuggingFace tokenizers |
| Token | ID | Purpose |
|---|---|---|
<|pad|> | 0 | Padding |
<|unk|> | 1 | Unknown |
<|begin_of_text|> | 2 | Beginning of text |
<|end_of_text|> | 3 | End of text (document boundary) |
<|start_header_id|> | 4 | Chat role header open |
<|end_header_id|> | 5 | Chat role header close |
<|eot_id|> | 6 | End of turn (generation stop token) |
| Dataset | FineWeb-Edu + Wikipedia |
| Tokens | ~16.6B (~11.6B FineWeb-Edu + ~5B Wikipedia) |
| Context Length | 1024 |
| Objective | Next-token prediction (all tokens) |
| Peak LR | 3 × 10-4 |
| Min LR | 3 × 10-5 |
| Warmup | 200 steps |
| Schedule | Warmup → Plateau (10%) → Cosine Decay |
| Datasets | Kyoto-Corpus + LMSYS-Chat-1M + Nomi-150M-Chat + Chat-Compilation |
| Context Length | 1024 |
| Objective | Masked cross-entropy (assistant tokens only) |
| Chat Template | Llama-3 style with header tokens |
| Peak LR | 5 × 10-5 |
| Min LR | 5 × 10-6 |
| Warmup | 100 steps |
| Schedule | Warmup → Cosine Decay |
| Optimizer | AdamW (fused) — β&sub1;=0.9, β&sub2;=0.95, ε=10-8 |
| Weight Decay | 0.0 |
| Gradient Clipping | 1.0 (global norm) |
| Precision | bfloat16 autocast |
| Compilation | Optional torch.compile (max-autotune) |
| Multi-GPU | Automatic DDP when ≥2 GPUs detected |
| Dataset | Source | Notes |
|---|---|---|
| Kyoto-Corpus | Nikity/Kyoto-Corpus | Multi-turn instruction pairs |
| LMSYS-Chat-1M | lmsys/lmsys-chat-1m | Real-world conversations (redacted rows skipped) |
| Nomi-150M-Chat | guus4324343/Nomi-150M-Chat | Synthetic chat data |
| Chat-Compilation | aklein4/chat-compilation | Multi-source compilation (system-prompt conversations excluded) |
pip install torch safetensors tokenizers huggingface_hub
wget https://huggingface.co/kerzgrr/monostich/resolve/main/inference.py
python inference.py
The script downloads the model, tokenizer, and config from Hugging Face automatically (cached after first run).
Interactive chat (default):
python inference.py
Single prompt:
python inference.py --prompt "What is the capital of France?"
Options:
| Flag | Default | Description |
|---|---|---|
--prompt |
None | Single prompt (omit for interactive REPL) |
--temperature |
0.28 | Sampling temperature |
--top-p |
0.95 | Nucleus sampling threshold |
--max-new-tokens |
context max | Max tokens to generate |
--device |
cuda | Device (cuda or cpu) |
--seed |
1234 | Random seed |
| Model | Parameters | Context | Status |
|---|---|---|---|
| Monostich | ~100M | 1024 | Available |
| Couplet | ~200M | 1024 | Training |
kerzgrr/monostich/
README.md # This model card
inference.py # Standalone inference script
monostich.safetensors # Weights (bfloat16, SafeTensors)
config.json # Model architecture config
tokenizer.json # BPE tokenizer (HuggingFace format)
tokenizer_config.json # Tokenizer metadata
special_token_ids.json # Token ID mapping
special_tokens_map.json # Token string mapping
@misc{monostich2026,
title={Monostich: A Compact Instruction-Tuned Language Model},
year={2026},
url={https://huggingface.co/kerzgrr/monostich}
}
Built on:
A monostich is a poem of a single line — small, but complete.