Sky-Blue-da-ba-dee commited on
Commit
dbf9ef6
·
1 Parent(s): eed5cfa

improved README

Browse files
Files changed (1) hide show
  1. README.md +19 -3
README.md CHANGED
@@ -10,12 +10,28 @@ short_description: Multi-label classification of code-comment sentences
10
  ---
11
 
12
  ## Overview
13
- This repository implements an end-to-end pipeline to classify comment sentences into language-specific categories and to aggregate results at file/PR level so reviewers can focus on rationale, usage notes, deprecations, examples, and other high-value signals.
 
14
  The project targets and aims to surpass the NLBSE’26 baselines, providing reproducible training, evaluation, and inference.
15
 
16
- Core choices:
17
 
 
18
  - Task: multi-label text classification at sentence level
19
  - Scope: three languages with per-language models (Java, Python, Pharo)
20
  - Usage: batch predictions on submissions (pre-review), summaries per file/PR
21
- - Human-in-the-loop: reviewer confirmations/overrides feed threshold recalibration
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
10
  ---
11
 
12
  ## Overview
13
+ CodeCommentClassification is an end-to-end pipeline to classify comment sentences into language-specific categories and to aggregate results at file/PR level so reviewers can focus on rationale, usage notes, deprecations, examples, and other high-value signals.
14
+
15
  The project targets and aims to surpass the NLBSE’26 baselines, providing reproducible training, evaluation, and inference.
16
 
17
+ The full documentation is available here: https://se4ai2526-uniba.github.io/TheClouds/
18
 
19
+ #### Core choices:
20
  - Task: multi-label text classification at sentence level
21
  - Scope: three languages with per-language models (Java, Python, Pharo)
22
  - Usage: batch predictions on submissions (pre-review), summaries per file/PR
23
+ - Human-in-the-loop: reviewer confirmations/overrides feed threshold recalibration
24
+
25
+ #### Current model type:
26
+ CodeBERT, a bimodal transformer pretrained on code and natural language, excels in code understanding tasks by generating contextual embeddings for comments, enabling superior multi-label classification (e.g., Java Macro F1 0.7457, Micro F1 0.8364; Python Macro F1 0.6385).
27
+ The current (best) models are automatically downloaded from MLflow for each language.
28
+ Model cards are available here: https://huggingface.co/spaces/seai2526-uniba-TheClouds/Code-Comment-Classification-Api/tree/main/models/model_cards
29
+
30
+ #### API:
31
+ The API module runs as a secure FastAPI web service in a Python 3.11 Docker container, automatically syncing the latest champion models from MLflow at startup. It exposes endpoints for dynamic model listing and core prediction, accepting code comments with specified language and model type, then returning multi-label classifications using SetFit, Random Forest, or Transformer models via lazy-loaded predictors.
32
+ Endpoints:
33
+ - / (GET): Root health check returning a welcome message pointing to /docs.
34
+ - /privacy (GET): Static privacy notice confirming no data persistence.
35
+ - /status (GET): Simple running status indicator.
36
+ - /models (GET): Scans MODELS_DIR to list available language/model_type pairs dynamically.
37
+ - /predict (POST): Core inference endpoint; validates PredictRequest payload, loads predictor on-demand, runs classification on input text, and returns predictions list or detailed errors.## main.py