File size: 4,491 Bytes
c96794f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bcd4b22
c96794f
 
 
 
 
23bcac5
c96794f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bcd4b22
c96794f
bcd4b22
 
c96794f
 
 
 
 
23bcac5
c96794f
 
 
 
 
 
 
 
 
 
bcd4b22
c96794f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
bcd4b22
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
---
license: apache-2.0
library_name: transformers
pipeline_tag: text-classification
tags:
- text-classification
- code
- programming-language-identification
- language-detection
- modernbert
base_model: answerdotai/ModernBERT-base
datasets:
- cakiki/rosetta-code
- bigcode/the-stack
metrics:
- accuracy
- f1
---

# Programming Language Identification (100+ languages)

A ModernBERT classifier that identifies the programming language of a code
snippet across **107 languages**.

## Inference

### PyTorch

```python
import torch
from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_id = "FrameByFrame/programming-language-identification-100plus"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSequenceClassification.from_pretrained(
    model_id,
    attn_implementation="eager",
    torch_dtype=torch.bfloat16,
).eval()

code = "def greet(name: str) -> None:\n    print(f'hello, {name}')"
inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
with torch.no_grad():
    logits = model(**inputs).logits
print(model.config.id2label[int(logits.argmax(-1))])  # -> "Python"
```

### Batch

```python
snippets = [py_code, rust_code, go_code]  # list of strings
inputs = tokenizer(
    snippets, return_tensors="pt", padding=True, truncation=True, max_length=512
)
with torch.no_grad():
    logits = model(**inputs).logits
for i, pred in enumerate(logits.argmax(-1).tolist()):
    print(snippets[i][:40].splitlines()[0], "→", model.config.id2label[pred])
```

### ONNX Runtime

An ONNX export lives in `onnx/`. Use it for CPU or GPU inference without
pulling PyTorch — handy for non-Python consumers and edge deployments.

```python
from optimum.onnxruntime import ORTModelForSequenceClassification
from transformers import AutoTokenizer

model_id = "FrameByFrame/programming-language-identification-100plus"
tokenizer = AutoTokenizer.from_pretrained(model_id)
ort_model = ORTModelForSequenceClassification.from_pretrained(
    model_id, subfolder="onnx"
)

inputs = tokenizer(code, return_tensors="pt", truncation=True, max_length=512)
logits = ort_model(**inputs).logits
print(ort_model.config.id2label[int(logits.argmax(-1))])
```

**[Open Inference Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus/blob/main/inference_examples.ipynb)** — download and run in Colab or Jupyter.

## Evaluation

Held-out validation split (9,495 rows, 107 labels):

| metric | value |
|---|---|
| macro F1 | **0.9206** |
| accuracy | 0.9306 |

Head-to-head vs `philomath-1209/programming-language-identification` on the 26
labels both models support (3,057 test rows):

| model | accuracy | macro F1 |
|---|---|---|
| **this model** | **0.9444** | **0.9636** |
| philomath-1209 | 0.8449 | 0.8445 |

Wins on every shared label. Largest gaps: ARM Assembly +0.354, Erlang +0.270,
COBOL +0.216, Pascal +0.206, Fortran +0.193, Mathematica/Wolfram +0.173.

## Supported languages (107)

ABAP, APL, ARM Assembly, ATS, Ada, ActionScript, AppleScript, AutoHotkey,
AutoIt, Awk, BASIC, BQN, Batchfile, Befunge, C, C#, C++, COBOL, Ceylon,
Clojure, CoffeeScript, ColdFusion, Common Lisp, Component Pascal, Crystal, D,
Dart, E, Eiffel, Elixir, Emacs Lisp, Erlang, Euphoria, F#, Factor, Fantom,
Forth, Fortran, FreeBASIC, GAP, Go, Groovy, Haskell, Haxe, IDL, Io, J, Java,
JavaScript, Julia, Kotlin, LabVIEW, LFE, Lasso, Logtalk, Lua, M, M4, MATLAB,
MAXScript, Mathematica/Wolfram Language, Mercury, Modula-2, Modula-3, Nemerle,
NewLisp, Nim, OCaml, Objective-C, Oz, PHP, Pascal, Perl, Pike, PicoLisp,
PowerShell, Processing, Prolog, PureBasic, Python, QuickBASIC, R, REXX, Raku,
Racket, Rebol, Red, Ring, Ruby, Rust, SAS, Scala, Scheme, Scilab, Smalltalk,
Standard ML, Stata, Swift, Tcl, V, VBA, VBScript, Vala, Visual Basic .NET,
Wren, Zig, jq

## Training data

91,209 code samples across 107 languages, drawn from Rosetta Code
(`cakiki/rosetta-code`) and The Stack v1 (`bigcode/the-stack`). Labels were
independently verified by an LLM judge, and a small set of high-confidence
mislabels between mainstream languages was removed.

Splits are grouped by task to prevent task-level leakage:
72,549 / 9,495 / 8,880 rows (train / val / test).

## Limitations

- Only the first **512 characters** of each input are used — longer files are
  truncated before classification.
- The classifier is purely content-based. If you have file extensions, treat
  them as a strong prior in a production pipeline.