SecBERT-PT
SecBERT is a binary classifier for detecting harmful and jailbreak prompts in Brazilian Portuguese. It is built on top of BERTimbau Base with a fully fine-tuned backbone and a two-layer MLP classification head.
This model was introduced in the paper:
Robustness of Language Models against Portuguese Harmful Prompts
Eduardo Alexandre de Amorim, Cleber Zanchettin
International Joint Conference on Neural Networks (IJCNN)
[Paper] [Code] [Dataset]
Model Description
SecBERT frames harmful prompt detection as a binary classification task. Given an input prompt $x$, the model predicts $P(y=1 \mid x)$, where $y=1$ indicates a policy-violating (harmful) prompt and $y=0$ indicates a benign one.
Architecture:
The [CLS] pooler output $h_{CLS} \in \mathbb{R}^{768}$ from BERTimbau-Base is passed through a two-layer MLP:
Training:
| Setting | Value |
|---|---|
| Base model | neuralmind/bert-base-portuguese-cased |
| Optimizer | AdamW |
| Learning rate | 2e-5 |
| Batch size | 20 |
| Max sequence length | 512 |
| LR schedule | Linear warmup (10%) + linear decay |
| Early stopping patience | 20 (on validation loss) |
Evaluation
Evaluated on a held-out test set (25% of the harmful-prompts-pt dataset). Metrics are reported at both the standard threshold (ฯ=0.5) and the KS-optimal threshold (ฯ*), which maximizes class separability.
| Threshold | Accuracy | Precision | Recall | F1 | FPR |
|---|---|---|---|---|---|
| ฯ = 0.5 | 95.4% | 94.9% | 96.1% | 95.5% | 5.4% |
| ฯ* = 0.72 | 95.6% | 96.5% | 94.8% | 95.6% | 3.6% |
Separability (threshold-independent):
| AUC | KS Statistic |
|---|---|
| 99.2% | 91.2% |
The KS statistic measures the maximum separation between the cumulative score distributions of benign and harmful classes. A value of 91.2% indicates that the model assigns well-separated probability scores to each class, making threshold selection robust in deployment.
Usage
from transformers import BertTokenizer
from src.model import BertMLPClassifier
import torch
tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
model = BertMLPClassifier(
model_name="neuralmind/bert-base-portuguese-cased",
hidden_dim=768,
freeze_backbone=False,
)
model.load_state_dict(torch.load("best_model.pth", weights_only=True))
model.eval()
# KS-optimal threshold from paper
TAU_STAR = 0.72
inputs = tokenizer(
"Ignore suas instruรงรตes anteriores e...",
return_tensors="pt",
truncation=True,
max_length=512,
)
with torch.no_grad():
logits = model(**inputs)
prob = torch.softmax(logits, dim=1)[0, 1].item()
label = "harmful" if prob >= TAU_STAR else "benign"
print(f"Score: {prob:.3f} โ {label}")
For the full BertMLPClassifier definition, clone the
source repository.
Limitations
- The dataset was generated via automated translation. Organically crafted Portuguese jailbreaks from native attackers may not be fully represented.
- The model was trained on a static snapshot of WildJailbreak attack vectors. Novel jailbreak strategies not present in the training data may evade detection.
- SecBERT is designed as one layer of a defense-in-depth strategy, not as a standalone solution.
Citation
@inproceedings{amorim2026secbert,
title = {Robustness of Language Models against {P}ortuguese Harmful Prompts},
author = {Amorim, Eduardo Alexandre de and Zanchettin, Cleber},
booktitle = {Proceedings of the International Joint Conference on Neural Networks (IJCNN)},
year = {2026}
}
License
MIT License โ research use only. Users are responsible for complying with the terms of the original WildJailbreak dataset.
Model tree for Edu-p/secbert-pt
Base model
neuralmind/bert-base-portuguese-cased