FrameByFrame
/

programming-language-identification-100plus

@@ -24,10 +24,7 @@ snippet across **107 languages**.
 ## Inference
-Weights are published in **bfloat16** (`model.safetensors`, ~286 MB). The
-ONNX export is **float16** (`onnx/model.onnx`, ~286 MB).
-### Single snippet (PyTorch, bf16)
 ```python
 import torch
@@ -48,10 +45,6 @@ with torch.no_grad():
 print(model.config.id2label[int(logits.argmax(-1))])  # -> "Python"
 ```
-Passing `torch_dtype=torch.bfloat16` keeps the weights in bf16 for inference.
-Omit it to run in fp32 (HF will upcast). `attn_implementation="eager"` is
-recommended — the SDPA path has produced NaNs on some ModernBERT builds.
 ### Batch
 ```python
@@ -65,11 +58,10 @@ for i, pred in enumerate(logits.argmax(-1).tolist()):
     print(snippets[i][:40].splitlines()[0], "→", model.config.id2label[pred])
 ```
-### ONNX Runtime (fp16)
-An ONNX export in fp16 lives in `onnx/` (~286 MB). Use it for CPU or GPU
-inference without pulling PyTorch — handy for non-Python consumers and edge
-deployments.
 ```python
 from optimum.onnxruntime import ORTModelForSequenceClassification
@@ -86,23 +78,7 @@ logits = ort_model(**inputs).logits
 print(ort_model.config.id2label[int(logits.argmax(-1))])
 ```
-Argmax predictions match the PyTorch model (verified on a 500-row validation
-sample — fp16 ONNX accuracy 0.940 vs bf16 PyTorch 0.940).
-### Top-k with confidence
-```python
-probs = logits.softmax(-1)
-top_probs, top_ids = probs.topk(3, dim=-1)
-for prob, label_id in zip(top_probs[0].tolist(), top_ids[0].tolist()):
-    print(f"{model.config.id2label[label_id]:30s}  {prob:.3f}")
-```
-Tip: for the first 512 characters, trim the snippet before tokenizing — the
-model was trained and evaluated with the `head` strategy at inference time.
-See [`notebooks/inference_examples.ipynb`](../../../workspace/projects/accuknox/guardrail/code-language-id/notebooks/inference_examples.ipynb)
-for a runnable walkthrough.
 ## Evaluation
@@ -149,21 +125,9 @@ mislabels between mainstream languages was removed.
 Splits are grouped by task to prevent task-level leakage:
 72,549 / 9,495 / 8,880 rows (train / val / test).
-## Training
-- Base: `answerdotai/ModernBERT-base`
-- 10 epochs, bf16, AdamW (lr 2e-5, weight decay 0.01)
-- Linear schedule, 6% warmup
-- Attention: eager
-- Augmentation: random-window over the source code per training step
-  (window length ∈ [64, 512] chars, random offset). Evaluation takes the
-  first 512 characters.
 ## Limitations
-- Input is truncated to 512 characters. Very short snippets (<60 chars) are
-  inherently ambiguous.
-- Rare languages (PicoLisp, LabVIEW, Lasso, Wren, Ring, Fantom, Pike, etc.)
-  have lower per-label F1 because training data for them is scarce.
-- The classifier is purely content-based — it does not read file extensions
-  or shebang lines. In production, combine it with extension heuristics.

 ## Inference
+### PyTorch
 ```python
 import torch
 print(model.config.id2label[int(logits.argmax(-1))])  # -> "Python"
 ```
 ### Batch
 ```python
     print(snippets[i][:40].splitlines()[0], "→", model.config.id2label[pred])
 ```
+### ONNX Runtime
+An ONNX export lives in `onnx/`. Use it for CPU or GPU inference without
+pulling PyTorch — handy for non-Python consumers and edge deployments.
 ```python
 from optimum.onnxruntime import ORTModelForSequenceClassification
 print(ort_model.config.id2label[int(logits.argmax(-1))])
 ```
+**[Open Inference Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus/blob/main/inference_examples.ipynb)** — download and run in Colab or Jupyter.
 ## Evaluation
 Splits are grouped by task to prevent task-level leakage:
 72,549 / 9,495 / 8,880 rows (train / val / test).
 ## Limitations
+- Only the first **512 characters** of each input are used — longer files are
+  truncated before classification.
+- The classifier is purely content-based. If you have file extensions, treat
+  them as a strong prior in a production pipeline.