vijaym commited on
Commit
bcd4b22
Β·
verified Β·
1 Parent(s): c96794f

Trim README

Browse files
Files changed (1) hide show
  1. README.md +9 -45
README.md CHANGED
@@ -24,10 +24,7 @@ snippet across **107 languages**.
24
 
25
  ## Inference
26
 
27
- Weights are published in **bfloat16** (`model.safetensors`, ~286 MB). The
28
- ONNX export is **float16** (`onnx/model.onnx`, ~286 MB).
29
-
30
- ### Single snippet (PyTorch, bf16)
31
 
32
  ```python
33
  import torch
@@ -48,10 +45,6 @@ with torch.no_grad():
48
  print(model.config.id2label[int(logits.argmax(-1))]) # -> "Python"
49
  ```
50
 
51
- Passing `torch_dtype=torch.bfloat16` keeps the weights in bf16 for inference.
52
- Omit it to run in fp32 (HF will upcast). `attn_implementation="eager"` is
53
- recommended β€” the SDPA path has produced NaNs on some ModernBERT builds.
54
-
55
  ### Batch
56
 
57
  ```python
@@ -65,11 +58,10 @@ for i, pred in enumerate(logits.argmax(-1).tolist()):
65
  print(snippets[i][:40].splitlines()[0], "β†’", model.config.id2label[pred])
66
  ```
67
 
68
- ### ONNX Runtime (fp16)
69
 
70
- An ONNX export in fp16 lives in `onnx/` (~286 MB). Use it for CPU or GPU
71
- inference without pulling PyTorch β€” handy for non-Python consumers and edge
72
- deployments.
73
 
74
  ```python
75
  from optimum.onnxruntime import ORTModelForSequenceClassification
@@ -86,23 +78,7 @@ logits = ort_model(**inputs).logits
86
  print(ort_model.config.id2label[int(logits.argmax(-1))])
87
  ```
88
 
89
- Argmax predictions match the PyTorch model (verified on a 500-row validation
90
- sample β€” fp16 ONNX accuracy 0.940 vs bf16 PyTorch 0.940).
91
-
92
- ### Top-k with confidence
93
-
94
- ```python
95
- probs = logits.softmax(-1)
96
- top_probs, top_ids = probs.topk(3, dim=-1)
97
- for prob, label_id in zip(top_probs[0].tolist(), top_ids[0].tolist()):
98
- print(f"{model.config.id2label[label_id]:30s} {prob:.3f}")
99
- ```
100
-
101
- Tip: for the first 512 characters, trim the snippet before tokenizing β€” the
102
- model was trained and evaluated with the `head` strategy at inference time.
103
-
104
- See [`notebooks/inference_examples.ipynb`](../../../workspace/projects/accuknox/guardrail/code-language-id/notebooks/inference_examples.ipynb)
105
- for a runnable walkthrough.
106
 
107
  ## Evaluation
108
 
@@ -149,21 +125,9 @@ mislabels between mainstream languages was removed.
149
  Splits are grouped by task to prevent task-level leakage:
150
  72,549 / 9,495 / 8,880 rows (train / val / test).
151
 
152
- ## Training
153
-
154
- - Base: `answerdotai/ModernBERT-base`
155
- - 10 epochs, bf16, AdamW (lr 2e-5, weight decay 0.01)
156
- - Linear schedule, 6% warmup
157
- - Attention: eager
158
- - Augmentation: random-window over the source code per training step
159
- (window length ∈ [64, 512] chars, random offset). Evaluation takes the
160
- first 512 characters.
161
-
162
  ## Limitations
163
 
164
- - Input is truncated to 512 characters. Very short snippets (<60 chars) are
165
- inherently ambiguous.
166
- - Rare languages (PicoLisp, LabVIEW, Lasso, Wren, Ring, Fantom, Pike, etc.)
167
- have lower per-label F1 because training data for them is scarce.
168
- - The classifier is purely content-based β€” it does not read file extensions
169
- or shebang lines. In production, combine it with extension heuristics.
 
24
 
25
  ## Inference
26
 
27
+ ### PyTorch
 
 
 
28
 
29
  ```python
30
  import torch
 
45
  print(model.config.id2label[int(logits.argmax(-1))]) # -> "Python"
46
  ```
47
 
 
 
 
 
48
  ### Batch
49
 
50
  ```python
 
58
  print(snippets[i][:40].splitlines()[0], "β†’", model.config.id2label[pred])
59
  ```
60
 
61
+ ### ONNX Runtime
62
 
63
+ An ONNX export lives in `onnx/`. Use it for CPU or GPU inference without
64
+ pulling PyTorch β€” handy for non-Python consumers and edge deployments.
 
65
 
66
  ```python
67
  from optimum.onnxruntime import ORTModelForSequenceClassification
 
78
  print(ort_model.config.id2label[int(logits.argmax(-1))])
79
  ```
80
 
81
+ **[Open Inference Notebook](https://huggingface.co/FrameByFrame/programming-language-identification-100plus/blob/main/inference_examples.ipynb)** β€” download and run in Colab or Jupyter.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
82
 
83
  ## Evaluation
84
 
 
125
  Splits are grouped by task to prevent task-level leakage:
126
  72,549 / 9,495 / 8,880 rows (train / val / test).
127
 
 
 
 
 
 
 
 
 
 
 
128
  ## Limitations
129
 
130
+ - Only the first **512 characters** of each input are used β€” longer files are
131
+ truncated before classification.
132
+ - The classifier is purely content-based. If you have file extensions, treat
133
+ them as a strong prior in a production pipeline.