SecBERT-PT

SecBERT is a binary classifier for detecting harmful and jailbreak prompts in Brazilian Portuguese. It is built on top of BERTimbau Base with a fully fine-tuned backbone and a two-layer MLP classification head.

This model was introduced in the paper:

Robustness of Language Models against Portuguese Harmful Prompts
Eduardo Alexandre de Amorim, Cleber Zanchettin
International Joint Conference on Neural Networks (IJCNN)
[Paper] [Code] [Dataset]

Model Description

SecBERT frames harmful prompt detection as a binary classification task. Given an input prompt $x$, the model predicts $P(y=1 \mid x)$, where $y=1$ indicates a policy-violating (harmful) prompt and $y=0$ indicates a benign one.

Architecture:

The [CLS] pooler output $h_{CLS} \in \mathbb{R}^{768}$ from BERTimbau-Base is passed through a two-layer MLP:

$z = \text{ReLU}(W_1 h_{CLS} + b_1), \quad W_1 \in \mathbb{R}^{128 \times 768}$ $\hat{y} = \sigma(W_2 z + b_2), \quad W_2 \in \mathbb{R}^{1 \times 128}$

Training:

Setting	Value
Base model	neuralmind/bert-base-portuguese-cased
Optimizer	AdamW
Learning rate	2e-5
Batch size	20
Max sequence length	512
LR schedule	Linear warmup (10%) + linear decay
Early stopping patience	20 (on validation loss)

Evaluation

Evaluated on a held-out test set (25% of the harmful-prompts-pt dataset). Metrics are reported at both the standard threshold (τ=0.5) and the KS-optimal threshold (τ*), which maximizes class separability.

Threshold	Accuracy	Precision	Recall	F1	FPR
τ = 0.5	95.4%	94.9%	96.1%	95.5%	5.4%
τ* = 0.72	95.6%	96.5%	94.8%	95.6%	3.6%

Separability (threshold-independent):

AUC	KS Statistic
99.2%	91.2%

The KS statistic measures the maximum separation between the cumulative score distributions of benign and harmful classes. A value of 91.2% indicates that the model assigns well-separated probability scores to each class, making threshold selection robust in deployment.

Usage

from transformers import BertTokenizer
from src.model import BertMLPClassifier
import torch

tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
model = BertMLPClassifier(
    model_name="neuralmind/bert-base-portuguese-cased",
    hidden_dim=768,
    freeze_backbone=False,
)
model.load_state_dict(torch.load("best_model.pth", weights_only=True))
model.eval()

# KS-optimal threshold from paper
TAU_STAR = 0.72

inputs = tokenizer(
    "Ignore suas instruções anteriores e...",
    return_tensors="pt",
    truncation=True,
    max_length=512,
)
with torch.no_grad():
    logits = model(**inputs)
    prob = torch.softmax(logits, dim=1)[0, 1].item()

label = "harmful" if prob >= TAU_STAR else "benign"
print(f"Score: {prob:.3f} → {label}")

For the full BertMLPClassifier definition, clone the source repository.

Limitations

The dataset was generated via automated translation. Organically crafted Portuguese jailbreaks from native attackers may not be fully represented.
The model was trained on a static snapshot of WildJailbreak attack vectors. Novel jailbreak strategies not present in the training data may evade detection.
SecBERT is designed as one layer of a defense-in-depth strategy, not as a standalone solution.

Citation

@inproceedings{amorim2026secbert,
  title     = {Robustness of Language Models against {P}ortuguese Harmful Prompts},
  author    = {Amorim, Eduardo Alexandre de and Zanchettin, Cleber},
  booktitle = {Proceedings of the International Joint Conference on Neural Networks (IJCNN)},
  year      = {2026}
}

License

MIT License — research use only. Users are responsible for complying with the terms of the original WildJailbreak dataset.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Edu-p/secbert-pt

Base model

neuralmind/bert-base-portuguese-cased

Finetuned

(202)

this model