| --- |
| license: apache-2.0 |
| library_name: transformers |
| pipeline_tag: text-classification |
| tags: |
| - text-classification |
| - code |
| - programming-language-identification |
| - language-detection |
| - modernbert |
| base_model: answerdotai/ModernBERT-base |
| datasets: |
| - cakiki/rosetta-code |
| - bigcode/the-stack |
| metrics: |
| - accuracy |
| - f1 |
| --- |
| |
| # Programming Language Identification (100+ languages) |
|
|
| A ModernBERT classifier that identifies the programming language of a code |
| snippet across **107 languages**. |
|
|
| ## Inference |
|
|
| ### PyTorch |
|
|
| ```python |
| import torch |
| from transformers import AutoModelForSequenceClassification, AutoTokenizer |
| |
| model_id = "FrameByFrame/programming-language-identification-100plus" |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| model = AutoModelForSequenceClassification.from_pretrained( |
| model_id, |
| attn_implementation="eager", |
| torch_dtype=torch.bfloat16, |
| ).eval() |
| |
| code = "def greet(name: str) -> None:\n print(f'hello, {name}')" |
| inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512) |
| with torch.no_grad(): |
| logits = model(**inputs).logits |
| print(model.config.id2label[int(logits.argmax(-1))]) # -> "Python" |
| ``` |
|
|
| ### Batch |
|
|
| ```python |
| snippets = [py_code, rust_code, go_code] # list of strings |
| inputs = tokenizer( |
| snippets, return_tensors="pt", padding=True, truncation=True, max_length=512 |
| ) |
| with torch.no_grad(): |
| logits = model(**inputs).logits |
| for i, pred in enumerate(logits.argmax(-1).tolist()): |
| print(snippets[i][:40].splitlines()[0], "→", model.config.id2label[pred]) |
| ``` |
|
|
| ### ONNX Runtime |
|
|
| An ONNX export lives in `onnx/`. Use it for CPU or GPU inference without |
| pulling PyTorch — handy for non-Python consumers and edge deployments. |
|
|
| ```python |
| from optimum.onnxruntime import ORTModelForSequenceClassification |
| from transformers import AutoTokenizer |
| |
| model_id = "FrameByFrame/programming-language-identification-100plus" |
| tokenizer = AutoTokenizer.from_pretrained(model_id) |
| ort_model = ORTModelForSequenceClassification.from_pretrained( |
| model_id, subfolder="onnx" |
| ) |
| |
| inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512) |
| logits = ort_model(**inputs).logits |
| print(ort_model.config.id2label[int(logits.argmax(-1))]) |
| ``` |
|
|
| **[Open Inference Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus/blob/main/inference_examples.ipynb)** — download and run in Colab or Jupyter. |
|
|
| ## Evaluation |
|
|
| Held-out validation split (9,495 rows, 107 labels): |
|
|
| | metric | value | |
| |---|---| |
| | macro F1 | **0.9206** | |
| | accuracy | 0.9306 | |
|
|
| Head-to-head vs `philomath-1209/programming-language-identification` on the 26 |
| labels both models support (3,057 test rows): |
|
|
| | model | accuracy | macro F1 | |
| |---|---|---| |
| | **this model** | **0.9444** | **0.9636** | |
| | philomath-1209 | 0.8449 | 0.8445 | |
|
|
| Wins on every shared label. Largest gaps: ARM Assembly +0.354, Erlang +0.270, |
| COBOL +0.216, Pascal +0.206, Fortran +0.193, Mathematica/Wolfram +0.173. |
|
|
| ## Supported languages (107) |
|
|
| ABAP, APL, ARM Assembly, ATS, Ada, ActionScript, AppleScript, AutoHotkey, |
| AutoIt, Awk, BASIC, BQN, Batchfile, Befunge, C, C#, C++, COBOL, Ceylon, |
| Clojure, CoffeeScript, ColdFusion, Common Lisp, Component Pascal, Crystal, D, |
| Dart, E, Eiffel, Elixir, Emacs Lisp, Erlang, Euphoria, F#, Factor, Fantom, |
| Forth, Fortran, FreeBASIC, GAP, Go, Groovy, Haskell, Haxe, IDL, Io, J, Java, |
| JavaScript, Julia, Kotlin, LabVIEW, LFE, Lasso, Logtalk, Lua, M, M4, MATLAB, |
| MAXScript, Mathematica/Wolfram Language, Mercury, Modula-2, Modula-3, Nemerle, |
| NewLisp, Nim, OCaml, Objective-C, Oz, PHP, Pascal, Perl, Pike, PicoLisp, |
| PowerShell, Processing, Prolog, PureBasic, Python, QuickBASIC, R, REXX, Raku, |
| Racket, Rebol, Red, Ring, Ruby, Rust, SAS, Scala, Scheme, Scilab, Smalltalk, |
| Standard ML, Stata, Swift, Tcl, V, VBA, VBScript, Vala, Visual Basic .NET, |
| Wren, Zig, jq |
|
|
| ## Training data |
|
|
| 91,209 code samples across 107 languages, drawn from Rosetta Code |
| (`cakiki/rosetta-code`) and The Stack v1 (`bigcode/the-stack`). Labels were |
| independently verified by an LLM judge, and a small set of high-confidence |
| mislabels between mainstream languages was removed. |
|
|
| Splits are grouped by task to prevent task-level leakage: |
| 72,549 / 9,495 / 8,880 rows (train / val / test). |
|
|
| ## Limitations |
|
|
| - Only the first **512 characters** of each input are used — longer files are |
| truncated before classification. |
| - The classifier is purely content-based. If you have file extensions, treat |
| them as a strong prior in a production pipeline. |
|
|