Programming Language Identification (100+ languages)
A ModernBERT classifier that identifies the programming language of a code snippet across 107 languages.
Inference
PyTorch
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_id = "FrameByFrame/programming-language-identification-100plus"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(
model_id,
attn_implementation="eager",
torch_dtype=torch.bfloat16,
).eval()
code = "def greet(name: str) -> None:\n print(f'hello, {name}')"
inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
print(model.config.id2label[int(logits.argmax(-1))]) # -> "Python"
Batch
snippets = [py_code, rust_code, go_code] # list of strings
inputs = tokenizer(
snippets, return_tensors="pt", padding=True, truncation=True, max_length=512
)
with torch.no_grad():
logits = model(**inputs).logits
for i, pred in enumerate(logits.argmax(-1).tolist()):
print(snippets[i][:40].splitlines()[0], "→", model.config.id2label[pred])
ONNX Runtime
An ONNX export lives in onnx/. Use it for CPU or GPU inference without
pulling PyTorch — handy for non-Python consumers and edge deployments.
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
model_id = "FrameByFrame/programming-language-identification-100plus"
tokenizer = AutoTokenizer.from_pretrained(model_id)
ort_model = ORTModelForSequenceClassification.from_pretrained(
model_id, subfolder="onnx"
)
inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
logits = ort_model(**inputs).logits
print(ort_model.config.id2label[int(logits.argmax(-1))])
Open Inference Notebook — download and run in Colab or Jupyter.
Evaluation
Held-out validation split (9,495 rows, 107 labels):
| metric | value |
|---|---|
| macro F1 | 0.9206 |
| accuracy | 0.9306 |
Wins on every shared label. Largest gaps: ARM Assembly +0.354, Erlang +0.270, COBOL +0.216, Pascal +0.206, Fortran +0.193, Mathematica/Wolfram +0.173.
Supported languages (107)
ABAP, APL, ARM Assembly, ATS, Ada, ActionScript, AppleScript, AutoHotkey, AutoIt, Awk, BASIC, BQN, Batchfile, Befunge, C, C#, C++, COBOL, Ceylon, Clojure, CoffeeScript, ColdFusion, Common Lisp, Component Pascal, Crystal, D, Dart, E, Eiffel, Elixir, Emacs Lisp, Erlang, Euphoria, F#, Factor, Fantom, Forth, Fortran, FreeBASIC, GAP, Go, Groovy, Haskell, Haxe, IDL, Io, J, Java, JavaScript, Julia, Kotlin, LabVIEW, LFE, Lasso, Logtalk, Lua, M, M4, MATLAB, MAXScript, Mathematica/Wolfram Language, Mercury, Modula-2, Modula-3, Nemerle, NewLisp, Nim, OCaml, Objective-C, Oz, PHP, Pascal, Perl, Pike, PicoLisp, PowerShell, Processing, Prolog, PureBasic, Python, QuickBASIC, R, REXX, Raku, Racket, Rebol, Red, Ring, Ruby, Rust, SAS, Scala, Scheme, Scilab, Smalltalk, Standard ML, Stata, Swift, Tcl, V, VBA, VBScript, Vala, Visual Basic .NET, Wren, Zig, jq
Training data
91,209 code samples across 107 languages, drawn from Rosetta Code
(cakiki/rosetta-code) and The Stack v1 (bigcode/the-stack). Labels were
independently verified by an LLM judge, and a small set of high-confidence
mislabels between mainstream languages was removed.
Splits are grouped by task to prevent task-level leakage: 72,549 / 9,495 / 8,880 rows (train / val / test).
Limitations
- Only the first 512 characters of each input are used — longer files are truncated before classification.
- The classifier is purely content-based. If you have file extensions, treat them as a strong prior in a production pipeline.
- Downloads last month
- 31
Model tree for FrameByFrame/programming-language-identification-100plus
Base model
answerdotai/ModernBERT-base