SPROG-9M

Symbolic PROGram solver — a 9.37M-parameter, from-scratch model that solves grade-school math word problems without any LLM at inference time.

Instead of generating text, SPROG abstracts the numbers in a question to slots ([N0], [N1], …) and predicts a postfix program over them, which is then executed symbolically. It draws 96 temperature samples and selects a single answer with a free symbolic verifier (0 trainable parameters) that never sees the ground truth — i.e. self-consistency selection, not a pass@k oracle. The whole thing runs on a CPU/Apple-Silicon GPU via MLX.

Trained on codelion/gsm8k-synth.

Quick start

pip install mlx numpy huggingface_hub
huggingface-cli download codelion/sprog-9m --local-dir ./sprog-9m
python sprog-9m/inference.py -q "A baker had 24 muffins. She sold 3/4 of them, then baked 10 more. How many muffins does she have now?"
# Answer: 16.0
from huggingface_hub import snapshot_download
from pathlib import Path
import sys

p = snapshot_download("codelion/sprog-9m"); sys.path.insert(0, p)
from inference import load_model, solve

model, stoi, cfg = load_model(Path(p))
print(solve(model, stoi, "Tom has 15 apples. He buys 27 more, then gives away 12. How many does he have?"))
# 30.0

Results

Evaluated on the full GSM8K test set (1,319 problems), averaged over 3 training seeds.

The model commits to one answer per question. It draws 96 temperature samples, then a 0-parameter symbolic verifier picks a single answer without ever seeing the gold answer. This is a single-answer accuracy (the self-consistency / maj@k family) — not a pass@k oracle.

metric GSM8K test what it measures
verifier @ 96 (headline) 11.8% (best seed 12.6%) verifier commits to one answer; gold never used
plurality @ 96 ≈9.3% most-voted answer; gold never used
pass@96 (oracle) ≈39% gold is somewhere in the 96 samples — an upper bound that uses the gold to check
trainable parameters 9.37M
LLM used at inference none

So 11.8% is a committed single answer, not pass@96 (that would be the ≈39% oracle). The free symbolic verifier adds ≈+2.5 points over majority voting, at 96× the inference cost of a single decode. Results are stable across seeds (range 11.1–12.6%).

Why 96 samples? Recall rises with sample count (≈39% gold-in-pool at 96 → ≈50% at 288), but the verifier's conversion peaks around 64–96 then declines — extra samples add plausible-but-wrong distractors that hurt selection (measured: 192 → 8.5%, 288 → 8.3%). 96 is the sweet spot between recall and selectability.

How it was built

  • Number-slot abstraction. The model never sees raw numbers — they become slots, so it learns program structure, not arithmetic, and generalizes across values.
  • Symbolic program target. It predicts a postfix program ([N0] [N1] * [N2] -) executed by a tiny deterministic evaluator.
  • Self-consistency + free verifier. 96 sampled programs are scored by a 0-parameter symbolic verifier (number-coverage, magnitude sanity, intermediate-value sanity), tie-broken by vote frequency.
  • Data is the main lever. Trained on real GSM8K-train plus ≈117K LLM-generated GSM8K-style problems (Claude + Gemini). What mattered most was matching the real GSM8K step-distribution and rigorous decontamination (0% test overlap), not raw data volume or model size — a deeper/bigger model did not help beyond noise.

Constraints (by design)

  • ≤10M trainable parameters (9.37M)
  • From scratch — no pretrained weights, no frozen LLM, no linear probe on a foundation model
  • LLM-free at inference — pure MLX + symbolic execution
  • Decontaminated — training data has 0% 8-gram overlap with the GSM8K test set

Files

file purpose
model.npz MLX weights (9.37M, d=304, 4 enc + 4 dec layers)
config.json architecture config
src_vocab.json source vocabulary (6,000 tokens)
inference.py self-contained inference (model + slot tokenizer + verifier)

Why not mlx-lm / AutoModel?

SPROG is a custom encoder-decoder seq2seq with a slot tokenizer and a symbolic decode + verify pipeline — not a standard causal LM. So it ships its own inference.py rather than loading through mlx-lm or transformers.AutoModel. The script has no dependencies beyond mlx and numpy.

Limitations

This is a research model demonstrating how far a tiny, LLM-free, from-scratch solver can go on GSM8K (≈12%). It handles 1–4 step arithmetic word problems with common operations; it misses many multi-step problems that require deeper reading comprehension. It is not a general math model and should not be used as one.

License

MIT.

Downloads last month
45
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train codelion/sprog-9m