Instructions to use codelion/sprog-9m with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use codelion/sprog-9m with MLX:
# Download the model from the Hub pip install huggingface_hub[hf_xet] huggingface-cli download --local-dir sprog-9m codelion/sprog-9m
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
SPROG-9M
Symbolic PROGram solver — a 9.37M-parameter, from-scratch model that solves grade-school math word problems without any LLM at inference time.
Instead of generating text, SPROG abstracts the numbers in a question to slots
([N0], [N1], …) and predicts a postfix program over them, which is then executed
symbolically. It draws 96 temperature samples and selects a single answer with a free
symbolic verifier (0 trainable parameters) that never sees the ground truth — i.e.
self-consistency selection, not a pass@k oracle. The whole thing runs on a
CPU/Apple-Silicon GPU via MLX.
Trained on codelion/gsm8k-synth.
Quick start
pip install mlx numpy huggingface_hub
huggingface-cli download codelion/sprog-9m --local-dir ./sprog-9m
python sprog-9m/inference.py -q "A baker had 24 muffins. She sold 3/4 of them, then baked 10 more. How many muffins does she have now?"
# Answer: 16.0
from huggingface_hub import snapshot_download
from pathlib import Path
import sys
p = snapshot_download("codelion/sprog-9m"); sys.path.insert(0, p)
from inference import load_model, solve
model, stoi, cfg = load_model(Path(p))
print(solve(model, stoi, "Tom has 15 apples. He buys 27 more, then gives away 12. How many does he have?"))
# 30.0
Results
Evaluated on the full GSM8K test set (1,319 problems), averaged over 3 training seeds.
The model commits to one answer per question. It draws 96 temperature samples, then a 0-parameter symbolic verifier picks a single answer without ever seeing the gold answer. This is a single-answer accuracy (the self-consistency / maj@k family) — not a pass@k oracle.
| metric | GSM8K test | what it measures |
|---|---|---|
| verifier @ 96 (headline) | 11.8% (best seed 12.6%) | verifier commits to one answer; gold never used |
| plurality @ 96 | ≈9.3% | most-voted answer; gold never used |
| pass@96 (oracle) | ≈39% | gold is somewhere in the 96 samples — an upper bound that uses the gold to check |
| trainable parameters | 9.37M | — |
| LLM used at inference | none | — |
So 11.8% is a committed single answer, not pass@96 (that would be the ≈39% oracle). The free symbolic verifier adds ≈+2.5 points over majority voting, at 96× the inference cost of a single decode. Results are stable across seeds (range 11.1–12.6%).
Why 96 samples? Recall rises with sample count (≈39% gold-in-pool at 96 → ≈50% at 288), but the verifier's conversion peaks around 64–96 then declines — extra samples add plausible-but-wrong distractors that hurt selection (measured: 192 → 8.5%, 288 → 8.3%). 96 is the sweet spot between recall and selectability.
How it was built
- Number-slot abstraction. The model never sees raw numbers — they become slots, so it learns program structure, not arithmetic, and generalizes across values.
- Symbolic program target. It predicts a postfix program (
[N0] [N1] * [N2] -) executed by a tiny deterministic evaluator. - Self-consistency + free verifier. 96 sampled programs are scored by a 0-parameter symbolic verifier (number-coverage, magnitude sanity, intermediate-value sanity), tie-broken by vote frequency.
- Data is the main lever. Trained on real GSM8K-train plus ≈117K LLM-generated GSM8K-style problems (Claude + Gemini). What mattered most was matching the real GSM8K step-distribution and rigorous decontamination (0% test overlap), not raw data volume or model size — a deeper/bigger model did not help beyond noise.
Constraints (by design)
- ≤10M trainable parameters (9.37M)
- From scratch — no pretrained weights, no frozen LLM, no linear probe on a foundation model
- LLM-free at inference — pure MLX + symbolic execution
- Decontaminated — training data has 0% 8-gram overlap with the GSM8K test set
Files
| file | purpose |
|---|---|
model.npz |
MLX weights (9.37M, d=304, 4 enc + 4 dec layers) |
config.json |
architecture config |
src_vocab.json |
source vocabulary (6,000 tokens) |
inference.py |
self-contained inference (model + slot tokenizer + verifier) |
Why not mlx-lm / AutoModel?
SPROG is a custom encoder-decoder seq2seq with a slot tokenizer and a symbolic
decode + verify pipeline — not a standard causal LM. So it ships its own inference.py
rather than loading through mlx-lm or transformers.AutoModel. The script has no
dependencies beyond mlx and numpy.
Limitations
This is a research model demonstrating how far a tiny, LLM-free, from-scratch solver can go on GSM8K (≈12%). It handles 1–4 step arithmetic word problems with common operations; it misses many multi-step problems that require deeper reading comprehension. It is not a general math model and should not be used as one.
License
MIT.
- Downloads last month
- 45