Spaces:

seai2526-uniba-TheClouds
/

Code-Comment-Classification-Api

Sleeping

App Files Files Community

Code-Comment-Classification-Api / README.md

Sky-Blue-da-ba-dee

improved README

dbf9ef6 4 months ago

preview code

raw

history blame contribute delete

2.46 kB

metadata

title: Code Comment Classification Api
emoji: 📚
colorFrom: indigo
colorTo: yellow
sdk: docker
pinned: false
license: mit
short_description: Multi-label classification of code-comment sentences

Overview

CodeCommentClassification is an end-to-end pipeline to classify comment sentences into language-specific categories and to aggregate results at file/PR level so reviewers can focus on rationale, usage notes, deprecations, examples, and other high-value signals.

The project targets and aims to surpass the NLBSE’26 baselines, providing reproducible training, evaluation, and inference.

The full documentation is available here: https://se4ai2526-uniba.github.io/TheClouds/

Core choices:

Task: multi-label text classification at sentence level
Scope: three languages with per-language models (Java, Python, Pharo)
Usage: batch predictions on submissions (pre-review), summaries per file/PR
Human-in-the-loop: reviewer confirmations/overrides feed threshold recalibration

Current model type:

CodeBERT, a bimodal transformer pretrained on code and natural language, excels in code understanding tasks by generating contextual embeddings for comments, enabling superior multi-label classification (e.g., Java Macro F1 0.7457, Micro F1 0.8364; Python Macro F1 0.6385). The current (best) models are automatically downloaded from MLflow for each language. Model cards are available here: https://huggingface.co/spaces/seai2526-uniba-TheClouds/Code-Comment-Classification-Api/tree/main/models/model_cards

API:

The API module runs as a secure FastAPI web service in a Python 3.11 Docker container, automatically syncing the latest champion models from MLflow at startup. It exposes endpoints for dynamic model listing and core prediction, accepting code comments with specified language and model type, then returning multi-label classifications using SetFit, Random Forest, or Transformer models via lazy-loaded predictors. Endpoints:

/ (GET): Root health check returning a welcome message pointing to /docs.
/privacy (GET): Static privacy notice confirming no data persistence.
/status (GET): Simple running status indicator.
/models (GET): Scans MODELS_DIR to list available language/model_type pairs dynamically.
/predict (POST): Core inference endpoint; validates PredictRequest payload, loads predictor on-demand, runs classification on input text, and returns predictions list or detailed errors.## main.py