digitflow/privacy-filter-de-ft
A German-language fine-tune of openai/privacy-filter.
It exposes the same inference API and OPF label space as the base
model, so existing OPF call sites work without changes on German
input.
Caveat. This model is not a perfect redactor for German PII. No
warranty is provided and Digitflow accepts no legal responsibility
for decisions made on its output. Use at your own risk. For
non-German text, use openai/privacy-filter
directly.
Benchmark
Evaluated on the German subset (language == 'de', n = 1,000) of the
ai4privacy/open-pii-masking-500k-ai4privacy
validation split, scored with OPF-containment F1 (the char-level,
label-agnostic completeness metric from the OPF reference scoring
code). 95 % confidence intervals are estimated by 1,000-sample
bootstrap resampling with replacement, taking the 2.5th and 97.5th
percentiles of the resulting F1 distribution.
| Metric | openai/privacy-filter |
digitflow/privacy-filter-de-ft |
Δ |
|---|---|---|---|
| OPF-containment F1 | 0.8437 | 0.8706 | +0.027 |
| Leak rate (1 − char recall, label-agnostic) | 23.05 % | 20.49 % | −2.56 pp |
| Char-coverage F1, label-aware | 0.6791 | 0.8368 | +0.158 |
| Strict span F1 | 0.4348 | 0.6445 | +0.210 |
| Strict span precision | 0.5645 | 0.7518 | +0.187 |
| Strict span recall | 0.3536 | 0.5640 | +0.210 |
| Model | OPF-containment F1 | 95 % bootstrap CI |
|---|---|---|
openai/privacy-filter |
0.8437 | [0.8294, 0.8579] |
digitflow/privacy-filter-de-ft |
0.8706 | [0.8585, 0.8812] |
The intervals do not overlap; the +0.027 lift is significant against single-slice sampling noise.
Examples
Output of m.redact(text), formatted as label:'redacted text'.
(none) means the model returned no spans.
| Input | openai/privacy-filter |
digitflow/privacy-filter-de-ft |
|---|---|---|
| Mein Name ist Jürgen Müller und ich wohne in Hamburg. | (none) |
private_person:'Jürgen Müller', private_address:'Hamburg' |
| Mein Passwort lautet SicherPasswort123! | (none) |
secret:'SicherPasswort123!' |
| Senden Sie das Paket an Hauptstraße 25, 10115 Berlin. | (none) |
private_address:'Hauptstraße 25, 10115 Berlin' |
| Hans-Jürgen Brömmelmeyer hat den Termin bestätigt. | (none) |
private_person:'Hans-Jürgen Brömmelmeyer' |
| Server-Status: https://intern.firma.de/health. | (none) |
private_url:'https://intern.firma.de/health' |
| Termin mit Mariella von Schönefeld-Brixius um 15:00. | private_person:'Mariella von Schönefeld-Brixius' |
private_person:'Mariella von Schönefeld-Brixius', private_date:'15:00' |
How it was built
The fine-tune adapts the base model to German PII through slot-filled augmentation of public German carriers.
It is supplemented by a hand-authored curriculum spanning real-world text registers, and trained on a single NVIDIA Jetson Orin.
The training set is screened against the evaluation slice for contamination before training begins.
How to use it
The OPF Python API is unchanged. Fetch the checkpoint with
huggingface_hub.snapshot_download(...) and pass the resulting local
path to opf.OPF.
from huggingface_hub import snapshot_download
import opf
path = snapshot_download("digitflow/privacy-filter-de-ft")
m = opf.OPF(
model=path,
device="cuda",
output_mode="typed",
decode_mode="viterbi",
)
text = "Mein Name ist Jürgen Müller und ich wohne in Hamburg."
result = m.redact(text)
for span in result.detected_spans:
print(f"{span.label}: {text[span.start:span.end]!r}")
# private_person: 'Jürgen Müller'
# private_address: 'Hamburg'
snapshot_download caches the weights under ~/.cache/huggingface/
so subsequent calls are free. The current opf release does not
resolve a Hub repo id directly; it expects a local checkpoint
directory.
Reproducing the benchmark
from datasets import load_dataset
from huggingface_hub import snapshot_download
import opf
# ... plus shared.span_prf and metrics.char_coverage_prf from the
# openai/privacy-filter reference scoring code.
ds = load_dataset(
"ai4privacy/open-pii-masking-500k-ai4privacy",
split="validation",
)
de = ds.filter(lambda r: r["language"] == "de").select(range(1000))
ft_path = snapshot_download("digitflow/privacy-filter-de-ft")
m_base = opf.OPF(device="cuda", output_mode="typed", decode_mode="viterbi")
m_ft = opf.OPF(model=ft_path,
device="cuda", output_mode="typed", decode_mode="viterbi")
# Run m.redact() per row, collect predicted spans, score against gold
# with `char_coverage_prf(predictions, golds, label_aware=False)`.
# Report the __micro__.f1 as OPF-containment F1.
License and citations
License. MIT.
ai4privacy/open-pii-masking-500k-ai4privacy
was used as the source of training carriers (with augmentation) and
as the validation slice for the benchmark above.
openai/privacy-filter
is the base model (Apache 2.0).
- Downloads last month
- 13
Model tree for digitflow/privacy-filter-de-ft
Base model
openai/privacy-filterDataset used to train digitflow/privacy-filter-de-ft
Evaluation results
- OPF-containment F1 (char-level, label-agnostic) on ai4privacy/open-pii-masking-500k-ai4privacy (de validation, n=1,000)validation set self-reported0.871
- Char-coverage F1 (label-aware) on ai4privacy/open-pii-masking-500k-ai4privacy (de validation, n=1,000)validation set self-reported0.837
- Strict span F1 on ai4privacy/open-pii-masking-500k-ai4privacy (de validation, n=1,000)validation set self-reported0.644