File size: 4,491 Bytes
c96794f bcd4b22 c96794f 23bcac5 c96794f bcd4b22 c96794f bcd4b22 c96794f 23bcac5 c96794f bcd4b22 c96794f bcd4b22 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 | ---
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification
tags:
- text-classification
- code
- programming-language-identification
- language-detection
- modernbert
base_model: answerdotai/ModernBERT-base
datasets:
- cakiki/rosetta-code
- bigcode/the-stack
metrics:
- accuracy
- f1
---
# Programming Language Identification (100+ languages)
A ModernBERT classifier that identifies the programming language of a code
snippet across **107 languages**.
## Inference
### PyTorch
```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer
model_id = "FrameByFrame/programming-language-identification-100plus"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(
model_id,
attn_implementation="eager",
torch_dtype=torch.bfloat16,
).eval()
code = "def greet(name: str) -> None:\n print(f'hello, {name}')"
inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
logits = model(**inputs).logits
print(model.config.id2label[int(logits.argmax(-1))]) # -> "Python"
```
### Batch
```python
snippets = [py_code, rust_code, go_code] # list of strings
inputs = tokenizer(
snippets, return_tensors="pt", padding=True, truncation=True, max_length=512
)
with torch.no_grad():
logits = model(**inputs).logits
for i, pred in enumerate(logits.argmax(-1).tolist()):
print(snippets[i][:40].splitlines()[0], "→", model.config.id2label[pred])
```
### ONNX Runtime
An ONNX export lives in `onnx/`. Use it for CPU or GPU inference without
pulling PyTorch — handy for non-Python consumers and edge deployments.
```python
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer
model_id = "FrameByFrame/programming-language-identification-100plus"
tokenizer = AutoTokenizer.from_pretrained(model_id)
ort_model = ORTModelForSequenceClassification.from_pretrained(
model_id, subfolder="onnx"
)
inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
logits = ort_model(**inputs).logits
print(ort_model.config.id2label[int(logits.argmax(-1))])
```
**[Open Inference Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus/blob/main/inference_examples.ipynb)** — download and run in Colab or Jupyter.
## Evaluation
Held-out validation split (9,495 rows, 107 labels):
| metric | value |
|---|---|
| macro F1 | **0.9206** |
| accuracy | 0.9306 |
Head-to-head vs `philomath-1209/programming-language-identification` on the 26
labels both models support (3,057 test rows):
| model | accuracy | macro F1 |
|---|---|---|
| **this model** | **0.9444** | **0.9636** |
| philomath-1209 | 0.8449 | 0.8445 |
Wins on every shared label. Largest gaps: ARM Assembly +0.354, Erlang +0.270,
COBOL +0.216, Pascal +0.206, Fortran +0.193, Mathematica/Wolfram +0.173.
## Supported languages (107)
ABAP, APL, ARM Assembly, ATS, Ada, ActionScript, AppleScript, AutoHotkey,
AutoIt, Awk, BASIC, BQN, Batchfile, Befunge, C, C#, C++, COBOL, Ceylon,
Clojure, CoffeeScript, ColdFusion, Common Lisp, Component Pascal, Crystal, D,
Dart, E, Eiffel, Elixir, Emacs Lisp, Erlang, Euphoria, F#, Factor, Fantom,
Forth, Fortran, FreeBASIC, GAP, Go, Groovy, Haskell, Haxe, IDL, Io, J, Java,
JavaScript, Julia, Kotlin, LabVIEW, LFE, Lasso, Logtalk, Lua, M, M4, MATLAB,
MAXScript, Mathematica/Wolfram Language, Mercury, Modula-2, Modula-3, Nemerle,
NewLisp, Nim, OCaml, Objective-C, Oz, PHP, Pascal, Perl, Pike, PicoLisp,
PowerShell, Processing, Prolog, PureBasic, Python, QuickBASIC, R, REXX, Raku,
Racket, Rebol, Red, Ring, Ruby, Rust, SAS, Scala, Scheme, Scilab, Smalltalk,
Standard ML, Stata, Swift, Tcl, V, VBA, VBScript, Vala, Visual Basic .NET,
Wren, Zig, jq
## Training data
91,209 code samples across 107 languages, drawn from Rosetta Code
(`cakiki/rosetta-code`) and The Stack v1 (`bigcode/the-stack`). Labels were
independently verified by an LLM judge, and a small set of high-confidence
mislabels between mainstream languages was removed.
Splits are grouped by task to prevent task-level leakage:
72,549 / 9,495 / 8,880 rows (train / val / test).
## Limitations
- Only the first **512 characters** of each input are used — longer files are
truncated before classification.
- The classifier is purely content-based. If you have file extensions, treat
them as a strong prior in a production pipeline.
|