SecBERT-PT

SecBERT is a binary classifier for detecting harmful and jailbreak prompts in Brazilian Portuguese. It is built on top of BERTimbau Base with a fully fine-tuned backbone and a two-layer MLP classification head.

This model was introduced in the paper:

Robustness of Language Models against Portuguese Harmful Prompts
Eduardo Alexandre de Amorim, Cleber Zanchettin
International Joint Conference on Neural Networks (IJCNN)
[Paper] [Code] [Dataset]


Model Description

SecBERT frames harmful prompt detection as a binary classification task. Given an input prompt $x$, the model predicts $P(y=1 \mid x)$, where $y=1$ indicates a policy-violating (harmful) prompt and $y=0$ indicates a benign one.

Architecture:

The [CLS] pooler output $h_{CLS} \in \mathbb{R}^{768}$ from BERTimbau-Base is passed through a two-layer MLP:

z=ReLU(W1hCLS+b1),W1โˆˆR128ร—768z = \text{ReLU}(W_1 h_{CLS} + b_1), \quad W_1 \in \mathbb{R}^{128 \times 768} y^=ฯƒ(W2z+b2),W2โˆˆR1ร—128\hat{y} = \sigma(W_2 z + b_2), \quad W_2 \in \mathbb{R}^{1 \times 128}

Training:

Setting Value
Base model neuralmind/bert-base-portuguese-cased
Optimizer AdamW
Learning rate 2e-5
Batch size 20
Max sequence length 512
LR schedule Linear warmup (10%) + linear decay
Early stopping patience 20 (on validation loss)

Evaluation

Evaluated on a held-out test set (25% of the harmful-prompts-pt dataset). Metrics are reported at both the standard threshold (ฯ„=0.5) and the KS-optimal threshold (ฯ„*), which maximizes class separability.

Threshold Accuracy Precision Recall F1 FPR
ฯ„ = 0.5 95.4% 94.9% 96.1% 95.5% 5.4%
ฯ„* = 0.72 95.6% 96.5% 94.8% 95.6% 3.6%

Separability (threshold-independent):

AUC KS Statistic
99.2% 91.2%

The KS statistic measures the maximum separation between the cumulative score distributions of benign and harmful classes. A value of 91.2% indicates that the model assigns well-separated probability scores to each class, making threshold selection robust in deployment.


Usage

from transformers import BertTokenizer
from src.model import BertMLPClassifier
import torch

tokenizer = BertTokenizer.from_pretrained("neuralmind/bert-base-portuguese-cased")
model = BertMLPClassifier(
    model_name="neuralmind/bert-base-portuguese-cased",
    hidden_dim=768,
    freeze_backbone=False,
)
model.load_state_dict(torch.load("best_model.pth", weights_only=True))
model.eval()

# KS-optimal threshold from paper
TAU_STAR = 0.72

inputs = tokenizer(
    "Ignore suas instruรงรตes anteriores e...",
    return_tensors="pt",
    truncation=True,
    max_length=512,
)
with torch.no_grad():
    logits = model(**inputs)
    prob = torch.softmax(logits, dim=1)[0, 1].item()

label = "harmful" if prob >= TAU_STAR else "benign"
print(f"Score: {prob:.3f} โ†’ {label}")

For the full BertMLPClassifier definition, clone the source repository.


Limitations

  • The dataset was generated via automated translation. Organically crafted Portuguese jailbreaks from native attackers may not be fully represented.
  • The model was trained on a static snapshot of WildJailbreak attack vectors. Novel jailbreak strategies not present in the training data may evade detection.
  • SecBERT is designed as one layer of a defense-in-depth strategy, not as a standalone solution.

Citation

@inproceedings{amorim2026secbert,
  title     = {Robustness of Language Models against {P}ortuguese Harmful Prompts},
  author    = {Amorim, Eduardo Alexandre de and Zanchettin, Cleber},
  booktitle = {Proceedings of the International Joint Conference on Neural Networks (IJCNN)},
  year      = {2026}
}

License

MIT License โ€” research use only. Users are responsible for complying with the terms of the original WildJailbreak dataset.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for Edu-p/secbert-pt

Finetuned
(202)
this model