SPROG-9M

Symbolic PROGram solver — a 9.37M-parameter, from-scratch model that solves grade-school math word problems without any LLM at inference time.

Instead of generating text, SPROG abstracts the numbers in a question to slots ([N0], [N1], …) and predicts a postfix program over them, which is then executed symbolically. It draws 96 temperature samples and selects a single answer with a free symbolic verifier (0 trainable parameters) that never sees the ground truth — i.e. self-consistency selection, not a pass@k oracle. The whole thing runs on a CPU/Apple-Silicon GPU via MLX.

Trained on codelion/gsm8k-synth.

Quick start

pip install mlx numpy huggingface_hub
huggingface-cli download codelion/sprog-9m --local-dir ./sprog-9m
python sprog-9m/inference.py -q "A baker had 24 muffins. She sold 3/4 of them, then baked 10 more. How many muffins does she have now?"
# Answer: 16.0

from huggingface_hub import snapshot_download
from pathlib import Path
import sys

p = snapshot_download("codelion/sprog-9m"); sys.path.insert(0, p)
from inference import load_model, solve

model, stoi, cfg = load_model(Path(p))
print(solve(model, stoi, "Tom has 15 apples. He buys 27 more, then gives away 12. How many does he have?"))
# 30.0

Results

Evaluated on the full GSM8K test set (1,319 problems), averaged over 3 training seeds.

The model commits to one answer per question. It draws 96 temperature samples, then a 0-parameter symbolic verifier picks a single answer without ever seeing the gold answer. This is a single-answer accuracy (the self-consistency / maj@k family) — not a pass@k oracle.

metric	GSM8K test	what it measures
verifier @ 96 (headline)	11.8% (best seed 12.6%)	verifier commits to one answer; gold never used
plurality @ 96	≈9.3%	most-voted answer; gold never used
pass@96 (oracle)	≈39%	gold is somewhere in the 96 samples — an upper bound that uses the gold to check
trainable parameters	9.37M	—
LLM used at inference	none	—

So 11.8% is a committed single answer, not pass@96 (that would be the ≈39% oracle). The free symbolic verifier adds ≈+2.5 points over majority voting, at 96× the inference cost of a single decode. Results are stable across seeds (range 11.1–12.6%).

Why 96 samples? Recall rises with sample count (≈39% gold-in-pool at 96 → ≈50% at 288), but the verifier's conversion peaks around 64–96 then declines — extra samples add plausible-but-wrong distractors that hurt selection (measured: 192 → 8.5%, 288 → 8.3%). 96 is the sweet spot between recall and selectability.

How it was built

Number-slot abstraction. The model never sees raw numbers — they become slots, so it learns program structure, not arithmetic, and generalizes across values.
Symbolic program target. It predicts a postfix program ([N0] [N1] * [N2] -) executed by a tiny deterministic evaluator.
Self-consistency + free verifier. 96 sampled programs are scored by a 0-parameter symbolic verifier (number-coverage, magnitude sanity, intermediate-value sanity), tie-broken by vote frequency.
Data is the main lever. Trained on real GSM8K-train plus ≈117K LLM-generated GSM8K-style problems (Claude + Gemini). What mattered most was matching the real GSM8K step-distribution and rigorous decontamination (0% test overlap), not raw data volume or model size — a deeper/bigger model did not help beyond noise.

Constraints (by design)

≤10M trainable parameters (9.37M)
From scratch — no pretrained weights, no frozen LLM, no linear probe on a foundation model
LLM-free at inference — pure MLX + symbolic execution
Decontaminated — training data has 0% 8-gram overlap with the GSM8K test set

Files

file	purpose
`model.npz`	MLX weights (9.37M, d=304, 4 enc + 4 dec layers)
`config.json`	architecture config
`src_vocab.json`	source vocabulary (6,000 tokens)
`inference.py`	self-contained inference (model + slot tokenizer + verifier)

Why not `mlx-lm` / `AutoModel`?

SPROG is a custom encoder-decoder seq2seq with a slot tokenizer and a symbolic decode + verify pipeline — not a standard causal LM. So it ships its own inference.py rather than loading through mlx-lm or transformers.AutoModel. The script has no dependencies beyond mlx and numpy.

Limitations

This is a research model demonstrating how far a tiny, LLM-free, from-scratch solver can go on GSM8K (≈12%). It handles 1–4 step arithmetic word problems with common operations; it misses many multi-step problems that require deeper reading comprehension. It is not a general math model and should not be used as one.

License

MIT.

Downloads last month: 45

codelion
/

sprog-9m

SPROG-9M

Quick start

Results

How it was built

Constraints (by design)

Files

Why not `mlx-lm` / `AutoModel`?

Limitations

License

Dataset used to train codelion/sprog-9m

SPROG-9M

Quick start

Results

How it was built

Constraints (by design)

Files

Why not mlx-lm / AutoModel?

Limitations

License

Dataset used to train codelion/sprog-9m

Why not `mlx-lm` / `AutoModel`?