digitflow/privacy-filter-de-ft

A German-language fine-tune of openai/privacy-filter. It exposes the same inference API and OPF label space as the base model, so existing OPF call sites work without changes on German input.

Caveat. This model is not a perfect redactor for German PII. No warranty is provided and Digitflow accepts no legal responsibility for decisions made on its output. Use at your own risk. For non-German text, use openai/privacy-filter directly.

Benchmark

Evaluated on the German subset (language == 'de', n = 1,000) of the ai4privacy/open-pii-masking-500k-ai4privacy validation split, scored with OPF-containment F1 (the char-level, label-agnostic completeness metric from the OPF reference scoring code). 95 % confidence intervals are estimated by 1,000-sample bootstrap resampling with replacement, taking the 2.5th and 97.5th percentiles of the resulting F1 distribution.

Metric	`openai/privacy-filter`	`digitflow/privacy-filter-de-ft`	Δ
OPF-containment F1	0.8437	0.8706	+0.027
Leak rate (1 − char recall, label-agnostic)	23.05 %	20.49 %	−2.56 pp
Char-coverage F1, label-aware	0.6791	0.8368	+0.158
Strict span F1	0.4348	0.6445	+0.210
Strict span precision	0.5645	0.7518	+0.187
Strict span recall	0.3536	0.5640	+0.210

Model	OPF-containment F1	95 % bootstrap CI
`openai/privacy-filter`	0.8437	[0.8294, 0.8579]
`digitflow/privacy-filter-de-ft`	0.8706	[0.8585, 0.8812]

The intervals do not overlap; the +0.027 lift is significant against single-slice sampling noise.

Examples

Output of m.redact(text), formatted as label:'redacted text'. (none) means the model returned no spans.

Input	`openai/privacy-filter`	`digitflow/privacy-filter-de-ft`
Mein Name ist Jürgen Müller und ich wohne in Hamburg.	`(none)`	`private_person:'Jürgen Müller'`, `private_address:'Hamburg'`
Mein Passwort lautet SicherPasswort123!	`(none)`	`secret:'SicherPasswort123!'`
Senden Sie das Paket an Hauptstraße 25, 10115 Berlin.	`(none)`	`private_address:'Hauptstraße 25, 10115 Berlin'`
Hans-Jürgen Brömmelmeyer hat den Termin bestätigt.	`(none)`	`private_person:'Hans-Jürgen Brömmelmeyer'`
Server-Status: https://intern.firma.de/health.	`(none)`	`private_url:'https://intern.firma.de/health'`
Termin mit Mariella von Schönefeld-Brixius um 15:00.	`private_person:'Mariella von Schönefeld-Brixius'`	`private_person:'Mariella von Schönefeld-Brixius'`, `private_date:'15:00'`

How it was built

The fine-tune adapts the base model to German PII through slot-filled augmentation of public German carriers.

It is supplemented by a hand-authored curriculum spanning real-world text registers, and trained on a single NVIDIA Jetson Orin.

The training set is screened against the evaluation slice for contamination before training begins.

How to use it

The OPF Python API is unchanged. Fetch the checkpoint with huggingface_hub.snapshot_download(...) and pass the resulting local path to opf.OPF.

from huggingface_hub import snapshot_download
import opf

path = snapshot_download("digitflow/privacy-filter-de-ft")

m = opf.OPF(
    model=path,
    device="cuda",
    output_mode="typed",
    decode_mode="viterbi",
)

text = "Mein Name ist Jürgen Müller und ich wohne in Hamburg."
result = m.redact(text)
for span in result.detected_spans:
    print(f"{span.label}: {text[span.start:span.end]!r}")
# private_person: 'Jürgen Müller'
# private_address: 'Hamburg'

snapshot_download caches the weights under ~/.cache/huggingface/ so subsequent calls are free. The current opf release does not resolve a Hub repo id directly; it expects a local checkpoint directory.

Reproducing the benchmark

from datasets import load_dataset
from huggingface_hub import snapshot_download
import opf
# ... plus shared.span_prf and metrics.char_coverage_prf from the
# openai/privacy-filter reference scoring code.

ds = load_dataset(
    "ai4privacy/open-pii-masking-500k-ai4privacy",
    split="validation",
)
de = ds.filter(lambda r: r["language"] == "de").select(range(1000))

ft_path = snapshot_download("digitflow/privacy-filter-de-ft")
m_base = opf.OPF(device="cuda", output_mode="typed", decode_mode="viterbi")
m_ft   = opf.OPF(model=ft_path,
                 device="cuda", output_mode="typed", decode_mode="viterbi")

# Run m.redact() per row, collect predicted spans, score against gold
# with `char_coverage_prf(predictions, golds, label_aware=False)`.
# Report the __micro__.f1 as OPF-containment F1.

License and citations

License. MIT.

ai4privacy/open-pii-masking-500k-ai4privacy was used as the source of training carriers (with augmentation) and as the validation slice for the benchmark above.

openai/privacy-filter is the base model (Apache 2.0).

Downloads last month: 13

Safetensors

Model size

1B params

Tensor type

BF16

Model tree for digitflow/privacy-filter-de-ft

Base model

openai/privacy-filter

Finetuned

(36)

this model

Dataset used to train digitflow/privacy-filter-de-ft

Evaluation results

OPF-containment F1 (char-level, label-agnostic) on ai4privacy/open-pii-masking-500k-ai4privacy (de validation, n=1,000)
validation set self-reported

0.871
Char-coverage F1 (label-aware) on ai4privacy/open-pii-masking-500k-ai4privacy (de validation, n=1,000)
validation set self-reported

0.837
Strict span F1 on ai4privacy/open-pii-masking-500k-ai4privacy (de validation, n=1,000)
validation set self-reported

0.644