Monostich 100M

A Compact Instruction-Tuned Language Model

A from-scratch LLaMA-style language model pretrained on 16.6B tokens and instruction-tuned on multi-turn chat data

Overview

Monostich is a ~100M parameter decoder-only transformer trained entirely from scratch. It uses a LLaMA-compatible architecture with modern components (GQA, RoPE, SwiGLU, RMSNorm) and is designed to be lightweight.

Pretraining: ~16.6B tokens from FineWeb-Edu + Wikipedia
SFT: Multi-turn instruction tuning on 5 mixed datasets with Llama-3-style chat templates
Chat template: Llama-3 style — <|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\nHello<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n

Model Architecture

Pipeline: Chat Prompt → BPE-32K Tokenizer → LLaMA Decoder (12L) → Token Prediction

Decoder Block (×12)

Each transformer layer contains:

Grouped Query Attention with RoPE positional embeddings (12 Q heads, 4 KV heads)
SwiGLU MLP with gated activation (768 → 2048 → 768)
RMSNorm pre-attention and pre-MLP
SDPA backend (Flash Attention when available)

Technical Specifications

Architecture	LLaMA-style Decoder-Only Transformer
Parameters	100,092,672 (~100M)
Hidden Dimension	768
Intermediate (MLP)	2,048
Layers	12
Attention Heads	12 (Q) / 4 (KV) — GQA 3:1
Head Dimension	64
Context Length	1024
RoPE θ	10,000
Vocabulary	32,000 (BPE)
Tied Embeddings	Yes
Precision	bfloat16
Weight Size	~191 MiB (bf16)

Design Choices

Feature	Description	Origin
RoPE	Rotary Positional Embeddings for relative position encoding	LLaMA
GQA	Grouped Query Attention (3:1) for efficient KV cache	LLaMA-2
SwiGLU	Gated linear unit with SiLU activation	PaLM, LLaMA
RMSNorm	Root Mean Square normalization (faster than LayerNorm)	LLaMA
Flash Attention	Memory-efficient attention via PyTorch SDPA	Dao et al.
Weight Tying	Embedding and LM head share weights	Standard

Tokenizer

Type	Byte-Pair Encoding (BPE)
Vocabulary	32,000 tokens
Library	HuggingFace `tokenizers`

Special Tokens

Token	ID	Purpose
`<\|pad\|>`	0	Padding
`<\|unk\|>`	1	Unknown
`<\|begin_of_text\|>`	2	Beginning of text
`<\|end_of_text\|>`	3	End of text (document boundary)
`<\|start_header_id\|>`	4	Chat role header open
`<\|end_header_id\|>`	5	Chat role header close
`<\|eot_id\|>`	6	End of turn (generation stop token)

Training Details

Phase 1: Pretraining

Dataset	FineWeb-Edu + Wikipedia
Tokens	~16.6B (~11.6B FineWeb-Edu + ~5B Wikipedia)
Context Length	1024
Objective	Next-token prediction (all tokens)
Peak LR	3 × 10^-4
Min LR	3 × 10^-5
Warmup	200 steps
Schedule	Warmup → Plateau (10%) → Cosine Decay

Phase 2: Supervised Fine-Tuning (SFT)

Datasets	Kyoto-Corpus + LMSYS-Chat-1M + Nomi-150M-Chat + Chat-Compilation
Context Length	1024
Objective	Masked cross-entropy (assistant tokens only)
Chat Template	Llama-3 style with header tokens
Peak LR	5 × 10^-5
Min LR	5 × 10^-6
Warmup	100 steps
Schedule	Warmup → Cosine Decay

Shared Training Config

Optimizer	AdamW (fused) — β&sub1;=0.9, β&sub2;=0.95, ε=10^-8
Weight Decay	0.0
Gradient Clipping	1.0 (global norm)
Precision	bfloat16 autocast
Compilation	Optional `torch.compile` (max-autotune)
Multi-GPU	Automatic DDP when ≥2 GPUs detected

SFT Datasets

Dataset	Source	Notes
Kyoto-Corpus	Nikity/Kyoto-Corpus	Multi-turn instruction pairs
LMSYS-Chat-1M	lmsys/lmsys-chat-1m	Real-world conversations (redacted rows skipped)
Nomi-150M-Chat	guus4324343/Nomi-150M-Chat	Synthetic chat data
Chat-Compilation	aklein4/chat-compilation	Multi-source compilation (system-prompt conversations excluded)

Quick Start

Installation

pip install torch safetensors tokenizers huggingface_hub

Run

wget https://huggingface.co/kerzgrr/monostich/resolve/main/inference.py
python inference.py

The script downloads the model, tokenizer, and config from Hugging Face automatically (cached after first run).

Usage

Interactive chat (default):

python inference.py

Single prompt:

python inference.py --prompt "What is the capital of France?"

Options:

Flag	Default	Description
`--prompt`	None	Single prompt (omit for interactive REPL)
`--temperature`	0.28	Sampling temperature
`--top-p`	0.95	Nucleus sampling threshold
`--max-new-tokens`	context max	Max tokens to generate
`--device`	cuda	Device (`cuda` or `cpu`)
`--seed`	1234	Random seed

Model Family

Model	Parameters	Context	Status
Monostich	~100M	1024	Available
Couplet	~200M	1024	Training

Limitations

Scale: At 100M parameters this model is a research prototype, not a production system

File Contents

kerzgrr/monostich/
  README.md                # This model card
  inference.py             # Standalone inference script
  monostich.safetensors    # Weights (bfloat16, SafeTensors)
  config.json              # Model architecture config
  tokenizer.json           # BPE tokenizer (HuggingFace format)
  tokenizer_config.json    # Tokenizer metadata
  special_token_ids.json   # Token ID mapping
  special_tokens_map.json  # Token string mapping

Citation

@misc{monostich2026,
  title={Monostich: A Compact Instruction-Tuned Language Model},
  year={2026},
  url={https://huggingface.co/kerzgrr/monostich}
}

Acknowledgments

Built on:

LLaMA architecture (Meta AI)
FineWeb-Edu dataset (HuggingFace)
Wikipedia dataset (Wikimedia)
Kyoto-Corpus (Nikity)
LMSYS-Chat-1M (LMSYS)
Nomi-150M-Chat (guus4324343)
Chat-Compilation (aklein4)
PyTorch SDPA / Flash Attention
HuggingFace tokenizers and hub

A monostich is a poem of a single line — small, but complete.

Downloads last month: 6

kerzgrr
/

Monostich