Trim README
Browse files
README.md
CHANGED
|
@@ -24,10 +24,7 @@ snippet across **107 languages**.
|
|
| 24 |
|
| 25 |
## Inference
|
| 26 |
|
| 27 |
-
|
| 28 |
-
ONNX export is **float16** (`onnx/model.onnx`, ~286 MB).
|
| 29 |
-
|
| 30 |
-
### Single snippet (PyTorch, bf16)
|
| 31 |
|
| 32 |
```python
|
| 33 |
import torch
|
|
@@ -48,10 +45,6 @@ with torch.no_grad():
|
|
| 48 |
print(model.config.id2label[int(logits.argmax(-1))]) # -> "Python"
|
| 49 |
```
|
| 50 |
|
| 51 |
-
Passing `torch_dtype=torch.bfloat16` keeps the weights in bf16 for inference.
|
| 52 |
-
Omit it to run in fp32 (HF will upcast). `attn_implementation="eager"` is
|
| 53 |
-
recommended β the SDPA path has produced NaNs on some ModernBERT builds.
|
| 54 |
-
|
| 55 |
### Batch
|
| 56 |
|
| 57 |
```python
|
|
@@ -65,11 +58,10 @@ for i, pred in enumerate(logits.argmax(-1).tolist()):
|
|
| 65 |
print(snippets[i][:40].splitlines()[0], "β", model.config.id2label[pred])
|
| 66 |
```
|
| 67 |
|
| 68 |
-
### ONNX Runtime
|
| 69 |
|
| 70 |
-
An ONNX export
|
| 71 |
-
|
| 72 |
-
deployments.
|
| 73 |
|
| 74 |
```python
|
| 75 |
from optimum.onnxruntime import ORTModelForSequenceClassification
|
|
@@ -86,23 +78,7 @@ logits = ort_model(**inputs).logits
|
|
| 86 |
print(ort_model.config.id2label[int(logits.argmax(-1))])
|
| 87 |
```
|
| 88 |
|
| 89 |
-
|
| 90 |
-
sample β fp16 ONNX accuracy 0.940 vs bf16 PyTorch 0.940).
|
| 91 |
-
|
| 92 |
-
### Top-k with confidence
|
| 93 |
-
|
| 94 |
-
```python
|
| 95 |
-
probs = logits.softmax(-1)
|
| 96 |
-
top_probs, top_ids = probs.topk(3, dim=-1)
|
| 97 |
-
for prob, label_id in zip(top_probs[0].tolist(), top_ids[0].tolist()):
|
| 98 |
-
print(f"{model.config.id2label[label_id]:30s} {prob:.3f}")
|
| 99 |
-
```
|
| 100 |
-
|
| 101 |
-
Tip: for the first 512 characters, trim the snippet before tokenizing β the
|
| 102 |
-
model was trained and evaluated with the `head` strategy at inference time.
|
| 103 |
-
|
| 104 |
-
See [`notebooks/inference_examples.ipynb`](../../../workspace/projects/accuknox/guardrail/code-language-id/notebooks/inference_examples.ipynb)
|
| 105 |
-
for a runnable walkthrough.
|
| 106 |
|
| 107 |
## Evaluation
|
| 108 |
|
|
@@ -149,21 +125,9 @@ mislabels between mainstream languages was removed.
|
|
| 149 |
Splits are grouped by task to prevent task-level leakage:
|
| 150 |
72,549 / 9,495 / 8,880 rows (train / val / test).
|
| 151 |
|
| 152 |
-
## Training
|
| 153 |
-
|
| 154 |
-
- Base: `answerdotai/ModernBERT-base`
|
| 155 |
-
- 10 epochs, bf16, AdamW (lr 2e-5, weight decay 0.01)
|
| 156 |
-
- Linear schedule, 6% warmup
|
| 157 |
-
- Attention: eager
|
| 158 |
-
- Augmentation: random-window over the source code per training step
|
| 159 |
-
(window length β [64, 512] chars, random offset). Evaluation takes the
|
| 160 |
-
first 512 characters.
|
| 161 |
-
|
| 162 |
## Limitations
|
| 163 |
|
| 164 |
-
-
|
| 165 |
-
|
| 166 |
-
-
|
| 167 |
-
|
| 168 |
-
- The classifier is purely content-based β it does not read file extensions
|
| 169 |
-
or shebang lines. In production, combine it with extension heuristics.
|
|
|
|
| 24 |
|
| 25 |
## Inference
|
| 26 |
|
| 27 |
+
### PyTorch
|
|
|
|
|
|
|
|
|
|
| 28 |
|
| 29 |
```python
|
| 30 |
import torch
|
|
|
|
| 45 |
print(model.config.id2label[int(logits.argmax(-1))]) # -> "Python"
|
| 46 |
```
|
| 47 |
|
|
|
|
|
|
|
|
|
|
|
|
|
| 48 |
### Batch
|
| 49 |
|
| 50 |
```python
|
|
|
|
| 58 |
print(snippets[i][:40].splitlines()[0], "β", model.config.id2label[pred])
|
| 59 |
```
|
| 60 |
|
| 61 |
+
### ONNX Runtime
|
| 62 |
|
| 63 |
+
An ONNX export lives in `onnx/`. Use it for CPU or GPU inference without
|
| 64 |
+
pulling PyTorch β handy for non-Python consumers and edge deployments.
|
|
|
|
| 65 |
|
| 66 |
```python
|
| 67 |
from optimum.onnxruntime import ORTModelForSequenceClassification
|
|
|
|
| 78 |
print(ort_model.config.id2label[int(logits.argmax(-1))])
|
| 79 |
```
|
| 80 |
|
| 81 |
+
**[Open Inference Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus/blob/main/inference_examples.ipynb)** β download and run in Colab or Jupyter.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 82 |
|
| 83 |
## Evaluation
|
| 84 |
|
|
|
|
| 125 |
Splits are grouped by task to prevent task-level leakage:
|
| 126 |
72,549 / 9,495 / 8,880 rows (train / val / test).
|
| 127 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 128 |
## Limitations
|
| 129 |
|
| 130 |
+
- Only the first **512 characters** of each input are used β longer files are
|
| 131 |
+
truncated before classification.
|
| 132 |
+
- The classifier is purely content-based. If you have file extensions, treat
|
| 133 |
+
them as a strong prior in a production pipeline.
|
|
|
|
|
|