YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
- KairoAI - Complete Project Documentation
- Table of Contents
- 1. Executive Summary
- 2. Project Vision & Goals
- 3. Technical Architecture
- 4. Technology Stack
- 5. Understanding the AI Pipeline
- 6. Data Flow & Pipeline
- 7. MediaPipe Explained
- 8. DNN Model Explained
- 9. Platform Channels Explained
- 10. Dataset Creation Guide
- 5. Understanding the AI Pipeline
- 6. Data Flow & Pipeline
- 7. MediaPipe Explained
- 8. DNN Model Explained
- 9. Platform Channels Explained
- 10. Dataset Creation Guide
- 11. Model Training Guide
- 12. Implementation Roadmap
- 13. Code Structure
- 14. Challenges & Solutions
- 15. Feasibility Assessment
- 16. Resources & Learning Path
- Appendix: Quick Reference
KairoAI - Complete Project Documentation
Indian Sign Language Learning App with AI-Powered Hand Detection
Author: Megh Modi
Created: December 18, 2025
Version: 1.0.0
Status: Planning & Architecture Phase
Table of Contents
- Executive Summary
- Project Vision & Goals
- Technical Architecture
- Technology Stack
- Understanding the AI Pipeline
- Data Flow & Pipeline
- MediaPipe Explained
- DNN Model Explained
- Platform Channels Explained
- Dataset Creation Guide
- Model Training Guide
- Implementation Roadmap
- Code Structure
- Challenges & Solutions
- Feasibility Assessment
- Resources & Learning Path
1. Executive Summary
What is KairoAI?
KairoAI is an Indian Sign Language (ISL) learning application designed specifically for children. The app uses real-time hand gesture detection via the device camera to teach ISL alphabets and words, providing instant feedback to students.
Core Innovation
The app combines three powerful technologies:
- Flutter for cross-platform UI
- MediaPipe for hand detection (running natively on Android)
- TensorFlow Lite for sign language classification
Key Differentiator
Unlike traditional learning apps, KairoAI provides real-time visual feedback by:
- Showing the user what sign to make
- Detecting their hand position using the camera
- Validating if they're making the correct sign
- Providing instant feedback (success/try again)
2. Project Vision & Goals
Primary Goal
Create an accessible, engaging platform for children to learn Indian Sign Language through interactive, AI-powered lessons.
Target Users
- Primary: Children aged 6-14 learning ISL
- Secondary: Parents and educators teaching ISL
- Tertiary: Anyone interested in learning ISL
Core Features
1. Lesson Mode
- Display a target alphabet (e.g., "A") or word (e.g., "MEGH")
- Open device camera
- Detect student's hand sign in real-time
- Validate against expected sign
- Show success animation/sound on correct detection
- Provide guidance hints on incorrect attempts
2. Quiz Mode
- Present random alphabets or words
- Student performs signs sequentially
- Each detected letter is validated in order
- Progress only on correct detection
- Track accuracy and completion time
3. Progress Tracking
- Store lesson completion in Firebase Firestore
- Track quiz scores and accuracy
- Visualize learning progress over time
- Gamification elements (badges, streaks)
3. Technical Architecture
High-Level Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FLUTTER LAYER (UI) β
β Written in Dart β
β β
β β’ Lessons UI β’ Quiz UI β’ Progress Dashboard β
β β’ Camera Preview β’ Feedback Animations β
β β’ Firebase Integration (Auth, Firestore) β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
β Platform Channels (Bridge)
β
ββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββ
β KOTLIN LAYER (Android Native) β
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β CameraX ββ β MediaPipe ββ β TensorFlow β β
β β β β Hands β β Lite β β
β β Capture β β Detect hand β β Classify β β
β β frames β β Extract 21 β β sign β β
β β β β landmarks β β β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β
β β
β Returns: { letter: "A", confidence: 0.95, handDetected: true } β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Architecture Principles
Why Hybrid Architecture?
| Component | Layer | Reason |
|---|---|---|
| UI & Navigation | Flutter (Dart) | Cross-platform, fast development, beautiful UI |
| Camera & ML | Kotlin (Native) | Direct hardware access, optimized performance |
| Firebase | Flutter (Dart) | Easy integration, real-time sync |
Key Design Decision
DO NOT attempt to run MediaPipe or camera processing in Dart.
Why?
- Flutter cannot directly access native camera APIs efficiently
- MediaPipe requires native Android/iOS libraries
- ML inference is faster in native code
- Better battery performance with native implementation
4. Technology Stack
Flutter Side (Dart)
Core Dependencies
dependencies:
flutter:
sdk: flutter
# Firebase
firebase_core: ^2.24.2 # Firebase initialization
firebase_auth: ^4.16.0 # User authentication
cloud_firestore: ^4.14.0 # Database for progress tracking
# State Management
provider: ^6.1.1 # For managing app state
# Navigation
go_router: ^13.0.0 # Declarative routing
# UI/UX Enhancements
lottie: ^3.0.0 # Success animations
audioplayers: ^5.2.1 # Sound effects
# Utilities
equatable: ^2.0.5 # Value comparison
Why These Libraries?
| Library | Purpose | Alternative |
|---|---|---|
provider |
Simple state management | riverpod, bloc |
go_router |
Type-safe routing | auto_route, manual routing |
lottie |
Beautiful animations | flare, custom animations |
Android/Kotlin Side
Build Configuration
// android/app/build.gradle.kts
plugins {
id("com.android.application")
id("kotlin-android")
id("dev.flutter.flutter-gradle-plugin")
}
android {
namespace = "com.kairo.ai"
compileSdk = 34
defaultConfig {
applicationId = "com.kairo.ai"
minSdk = 26 // Required for CameraX
targetSdk = 34
versionCode = 1
versionName = "1.0"
}
compileOptions {
sourceCompatibility = JavaVersion.VERSION_17
targetCompatibility = JavaVersion.VERSION_17
}
kotlinOptions {
jvmTarget = "17"
}
// Required for TFLite model files
aaptOptions {
noCompress("tflite")
}
}
Dependencies
dependencies {
// MediaPipe Tasks Vision (Hand Landmark Detection)
implementation("com.google.mediapipe:tasks-vision:0.10.14")
// TensorFlow Lite
implementation("org.tensorflow:tensorflow-lite:2.14.0")
implementation("org.tensorflow:tensorflow-lite-support:0.4.4")
// CameraX (Camera API)
implementation("androidx.camera:camera-core:1.3.1")
implementation("androidx.camera:camera-camera2:1.3.1")
implementation("androidx.camera:camera-lifecycle:1.3.1")
implementation("androidx.camera:camera-view:1.3.1")
// Coroutines for async operations
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-android:1.7.3")
}
Python Side (Model Training)
Required Libraries
tensorflow==2.15.0 # Deep learning framework
mediapipe==0.10.9 # Hand landmark extraction
opencv-python==4.8.1.78 # Image processing
numpy==1.26.2 # Numerical operations
pandas==2.1.3 # Data manipulation
scikit-learn==1.3.2 # ML utilities
matplotlib==3.8.2 # Visualization
Installation
pip install tensorflow mediapipe opencv-python numpy pandas scikit-learn matplotlib
5. Understanding the AI Pipeline
What is AI Doing in This App?
The AI has one primary job: "Look at the camera and tell me which ISL letter the user is showing"
The Problem Breakdown
Traditional Approach (Pure Image Classification)
Camera Image β CNN Model β Letter
Problem: Slow, requires huge dataset, background sensitive
Our Smart Approach (Landmark-based Classification)
Camera Image β MediaPipe (find hand) β Extract landmarks β DNN Model β Letter
Benefit: Fast, small dataset, background-independent
Why Two AI Models?
Model 1: MediaPipe Hands (Google's Pre-trained Model)
Job: Find the hand and identify 21 key points
Input: Camera frame (640Γ480 pixels)
Output: 21 landmark points (x, y, z coordinates)
Example landmarks:
Point 0: Wrist
Point 1-4: Thumb (base to tip)
Point 5-8: Index finger
Point 9-12: Middle finger
Point 13-16: Ring finger
Point 17-20: Pinky finger
Why use it?
- Already trained by Google on millions of images
- Works in real-time on mobile devices
- Handles different hand sizes, skin tones, lighting
- Free to use
Model 2: Your Custom TFLite Model (Train Yourself)
Job: Classify the 21 landmark points into ISL letters
Input: 63 numbers (21 points Γ 3 coordinates)
Output: Letter (A-Z) + confidence score
Example:
Input: [0.45, 0.82, 0.01, 0.52, 0.75, ...]
Output: { letter: "A", confidence: 0.95 }
Why train your own?
- ISL signs are unique (different from ASL)
- You control accuracy by adding more training data
- Model is tiny (~50-100 KB)
- Fast inference (~1-5ms)
6. Data Flow & Pipeline
Complete Pipeline: Camera β Detection β Flutter UI
Step-by-Step Data Transformation
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 1: CAMERA CAPTURE (CameraX) β
β β
β Input: Nothing (hardware) β
β Process: Open camera, capture frames at 30 FPS β
β Output: Bitmap (640Γ480 RGB image) β
β Data Size: 921,600 values (640 Γ 480 Γ 3) β
β β
β Visual: β
β βββββββββββββββββββ β
β β βββββββββββββββ β β
β β βββββββββββββββ β β Raw camera frame β
β β βββββββββββββββ β (user's hand visible) β
β β βββββββββββββββ β β
β β βββββββββββββββ β β
β β βββββββββββββββ β β
β βββββββββββββββββββ β
ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 2: HAND DETECTION (MediaPipe) β
β β
β Input: Bitmap (921,600 values) β
β Process: β
β 1. Detect if hand exists in frame β
β 2. Locate 21 anatomical landmarks β
β 3. Extract (x, y, z) for each point β
β Output: FloatArray[63] = [x0,y0,z0, x1,y1,z1, ..., x20,y20,z20]β
β Data Size: 63 float values β
β Reduction: 921,600 β 63 (99.99% reduction!) β
β β
β Visual - 21 Landmark Points: β
β 8 12 16 20 (fingertips) β
β | | | | β
β 7 11 15 19 | β
β | | | | | β
β 6 10 14 18 | β
β | | | | | β
β 5βββ9βββ13ββ17βββ β
β \ β
β 4βββ3βββ2βββ1 (thumb) β
β \ β
β 0 (wrist) β
β β
β Example Output for Letter "A": β
β [0.45, 0.82, 0.01, β Point 0 (wrist) β
β 0.52, 0.75, 0.02, β Point 1 (thumb base) β
β 0.58, 0.65, 0.03, β Point 2 (thumb middle) β
β ... β
β 0.63, 0.42, 0.01] β Point 20 (pinky tip) β
ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 3: SIGN CLASSIFICATION (TensorFlow Lite) β
β β
β Input: FloatArray[63] β
β Process: β
β Neural Network Layers: β
β [63] β [128 neurons] β [64 neurons] β [32 neurons] β [26] β
β Dense Layer Dense Layer Dense Layer Softmax β
β β
β Output: Probabilities for each letter β
β [p_A, p_B, p_C, ..., p_Z] β
β β
β Example: β
β Input: [0.45, 0.82, 0.01, ...] β
β Output: [0.01, 0.03, 0.95, 0.00, ...] β
β (1% 3% 95% 0% ...) β
β β
β Best Prediction: Letter "C" with 95% confidence β
ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 4: SEND TO FLUTTER (Platform Channel - EventChannel) β
β β
β Kotlin prepares data as Map: β
β { β
β "letter": "C", β
β "confidence": 0.95, β
β "handDetected": true, β
β "timestamp": 1702907345000 β
β } β
β β
β Sends through EventChannel (continuous stream) β
ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 5: FLUTTER RECEIVES & DISPLAYS (Dart) β
β β
β EventChannel stream listener receives data β
β β
β setState(() { β
β currentLetter = "C"; β
β confidence = 0.95; β
β handDetected = true; β
β }); β
β β
β UI Updates: β
β ββββββββββββββββββββββββββββββ β
β β π― Target: Letter A β β
β β β β
β β π· [Camera Preview] β β
β β β β
β β β Detected: C (95%) β β Red (incorrect) β
β β π‘ Hint: Try again! β β
β ββββββββββββββββββββββββββββββ β
β β
β If correct (C == target): β
β ββββββββββββββββββββββββββββββ β
β β β
Correct! (95%) β β Green (success) β
β β π [Success Animation] β β
β β π [Success Sound] β β
β ββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Processing Speed
| Step | Processing Time | Frequency |
|---|---|---|
| Camera Frame | ~33ms | 30 FPS |
| MediaPipe Detection | ~10-20ms | Per frame |
| TFLite Classification | ~1-5ms | Per detection |
| Platform Channel | <1ms | Per result |
| Total Latency | ~50ms | ~20 detections/sec |
This means users get near-instant feedback!
7. MediaPipe Explained
What is MediaPipe?
MediaPipe is Google's open-source framework for building multimodal (video, audio, text) applied machine learning pipelines.
MediaPipe Hands Solution
Specifically designed to detect and track hands in real-time.
Key Capabilities:
- Detects up to 2 hands simultaneously
- Works in various lighting conditions
- Handles different hand sizes and skin tones
- Runs efficiently on mobile devices
- Provides 21 3D landmark points per hand
The 21 Hand Landmarks
Landmark Numbering System:
8 12 16 20
β β β β
7βββ11ββ15ββ19βββ
β β β β β
6βββ10ββ14ββ18βββ
β β β β β
5βββ9βββ13ββ17βββ
\
4βββ3βββ2βββ1
\
0
Point 0: Wrist
Point 1: Thumb CMC (base)
Point 2: Thumb MCP
Point 3: Thumb IP
Point 4: Thumb tip
Point 5: Index finger MCP
Point 6: Index finger PIP
Point 7: Index finger DIP
Point 8: Index finger tip
Point 9: Middle finger MCP
Point 10: Middle finger PIP
Point 11: Middle finger DIP
Point 12: Middle finger tip
Point 13: Ring finger MCP
Point 14: Ring finger PIP
Point 15: Ring finger DIP
Point 16: Ring finger tip
Point 17: Pinky MCP
Point 18: Pinky PIP
Point 19: Pinky DIP
Point 20: Pinky tip
Coordinate System
Each landmark has 3 coordinates:
X Coordinate
- Range: 0.0 to 1.0
- 0.0 = left edge of image
- 1.0 = right edge of image
- Normalized (independent of image resolution)
Y Coordinate
- Range: 0.0 to 1.0
- 0.0 = top edge of image
- 1.0 = bottom edge of image
- Normalized (independent of image resolution)
Z Coordinate
- Approximate depth from wrist
- Smaller values = closer to camera
- Relative to wrist (Point 0)
- Units: roughly in same scale as X
Example Coordinates
Letter "A" (closed fist with thumb up):
Point 0 (Wrist): x=0.50, y=0.70, z=0.00
Point 1 (Thumb base): x=0.48, y=0.65, z=0.02
Point 2 (Thumb mid): x=0.46, y=0.58, z=0.03
Point 3 (Thumb bend): x=0.44, y=0.52, z=0.04
Point 4 (Thumb tip): x=0.42, y=0.45, z=0.05
Point 5 (Index base): x=0.54, y=0.66, z=0.01
Point 6 (Index mid): x=0.56, y=0.68, z=0.00
Point 7 (Index bend): x=0.57, y=0.69, z=-0.01
Point 8 (Index tip): x=0.58, y=0.70, z=-0.02
...
MediaPipe Integration Code
// File: android/app/src/main/kotlin/com/kairo/ai/ml/HandLandmarkDetector.kt
import com.google.mediapipe.tasks.vision.handlandmarker.HandLandmarker
import com.google.mediapipe.tasks.vision.handlandmarker.HandLandmarkerResult
import com.google.mediapipe.framework.image.BitmapImageBuilder
import com.google.mediapipe.framework.image.MPImage
import android.graphics.Bitmap
import android.content.Context
class HandLandmarkDetector(context: Context) {
private val handLandmarker: HandLandmarker
init {
// Configure MediaPipe Hands
val options = HandLandmarker.HandLandmarkerOptions.builder()
.setBaseOptions(
BaseOptions.builder()
.setModelAssetPath("hand_landmarker.task") // MediaPipe's pre-trained model
.build()
)
.setNumHands(1) // Detect only one hand
.setMinHandDetectionConfidence(0.5f) // 50% confidence threshold
.setMinHandPresenceConfidence(0.5f)
.setMinTrackingConfidence(0.5f)
.build()
handLandmarker = HandLandmarker.createFromOptions(context, options)
}
/**
* Detect hand landmarks from a camera frame
*
* @param bitmap The camera frame
* @return FloatArray of 63 values [x0,y0,z0, x1,y1,z1, ..., x20,y20,z20]
* or null if no hand detected
*/
fun detectLandmarks(bitmap: Bitmap): FloatArray? {
// Convert Android Bitmap to MediaPipe Image
val mpImage: MPImage = BitmapImageBuilder(bitmap).build()
// Run hand detection
val result: HandLandmarkerResult = handLandmarker.detect(mpImage)
// Check if any hands were detected
if (result.landmarks().isEmpty()) {
return null // No hand found
}
// Get landmarks from first detected hand
val handLandmarks = result.landmarks()[0]
// Convert to flat array
val landmarkArray = FloatArray(63)
for (i in 0 until 21) {
val landmark = handLandmarks[i]
landmarkArray[i * 3 + 0] = landmark.x()
landmarkArray[i * 3 + 1] = landmark.y()
landmarkArray[i * 3 + 2] = landmark.z()
}
return landmarkArray
}
/**
* Optional: Normalize landmarks relative to wrist
* This makes the model robust to hand position/size
*/
fun normalizeLandmarks(landmarks: FloatArray): FloatArray {
val normalized = FloatArray(63)
// Get wrist coordinates (point 0)
val wristX = landmarks[0]
val wristY = landmarks[1]
val wristZ = landmarks[2]
// Normalize all points relative to wrist
for (i in 0 until 21) {
normalized[i * 3 + 0] = landmarks[i * 3 + 0] - wristX
normalized[i * 3 + 1] = landmarks[i * 3 + 1] - wristY
normalized[i * 3 + 2] = landmarks[i * 3 + 2] - wristZ
}
return normalized
}
}
8. DNN Model Explained
What is a DNN (Dense Neural Network)?
A DNN is a type of artificial neural network where every neuron in one layer is connected to every neuron in the next layer.
Simple Analogy
Think of it as a decision-making chain:
Your Brain Recognizing a Friend:
Eyes see features β Brain processes patterns β Brain decides who it is
(height, hair, (tall + brown hair ("It's John!")
glasses, voice) + glasses = pattern)
DNN for Sign Language
Input: 63 numbers β Hidden layers process β Output: Letter prediction
(hand landmarks) (find patterns) (A-Z with confidence)
DNN vs CNN vs Other Models
| Model Type | Input | Best For | When to Use |
|---|---|---|---|
| DNN (Dense) | Numbers/Features | Structured data, pre-extracted features | When you have landmark coordinates |
| CNN (Convolutional) | Images | Raw image classification | When working with raw pixels |
| RNN/LSTM | Sequences | Time series, text | When order matters (sentences, videos) |
| Transformer | Sequences | Language, large-scale | Complex NLP tasks (ChatGPT) |
Why DNN for KairoAI?
Comparison: CNN vs DNN Approach
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CNN APPROACH (Not Used) β
β β
β Camera Frame (640Γ480Γ3 = 921,600 values) β
β β β
β Convolutional Layers (find edges, shapes) β
β β β
β Pooling Layers (reduce size) β
β β β
β More Conv Layers (find complex patterns) β
β β β
β Dense Layers (classify) β
β β β
β Output: Letter β
β β
β Problems: β
β β Slow (100-200ms inference) β
β β Large model (10-50 MB) β
β β Needs 10,000+ images per class β
β β Background variations affect accuracy β
β β Sensitive to lighting, hand size, position β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DNN APPROACH (Our Choice) β
β
β β
β MediaPipe Output (63 landmark values) β
β β β
β Dense Layer 1 (128 neurons, ReLU activation) β
β β β
β Dense Layer 2 (64 neurons, ReLU activation) β
β β β
β Dense Layer 3 (32 neurons, ReLU activation) β
β β β
β Output Layer (26 neurons, Softmax activation) β
β β β
β Output: Letter + Confidence β
β β
β Benefits: β
β β
Fast (1-5ms inference) β
β β
Tiny model (~50-100 KB) β
β β
Needs only 500-1000 images per class β
β β
Background doesn't matter (landmarks only) β
β β
Works in any lighting, any hand size β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Model Architecture
Layer-by-Layer Breakdown
# Our DNN Model Architecture
Input Layer:
Shape: (63,)
Description: 63 landmark coordinates
Example: [0.45, 0.82, 0.01, 0.52, 0.75, ...]
β (fully connected)
Hidden Layer 1:
Neurons: 128
Activation: ReLU (Rectified Linear Unit)
Dropout: 30% (prevents overfitting)
Description: Learns basic patterns (finger positions)
β (fully connected)
Hidden Layer 2:
Neurons: 64
Activation: ReLU
Dropout: 30%
Description: Learns complex patterns (hand shapes)
β (fully connected)
Hidden Layer 3:
Neurons: 32
Activation: ReLU
Dropout: 20%
Description: Learns letter-specific features
β (fully connected)
Output Layer:
Neurons: 26
Activation: Softmax
Description: Probability for each letter (A-Z)
Example Output: [0.01, 0.03, 0.95, 0.00, ..., 0.01]
(1% 3% 95% 0% 1% )
A B C D ... Z
What Each Layer Does
1. Input Layer (63 neurons)
- Receives raw landmark coordinates
- No processing, just passes data forward
2. Hidden Layer 1 (128 neurons)
- Learns basic geometric relationships
- Examples:
- "Are fingers spread apart?"
- "Is thumb extended?"
- "What's the palm orientation?"
3. Hidden Layer 2 (64 neurons)
- Combines basic patterns into complex ones
- Examples:
- "Thumb up + fingers curled = might be 'A'"
- "All fingers extended = might be 'B'"
4. Hidden Layer 3 (32 neurons)
- Fine-tunes letter-specific features
- Distinguishes similar signs
- Examples:
- "Is this 'M' or 'N'?" (very similar in ISL)
5. Output Layer (26 neurons)
- Each neuron represents one letter
- Softmax ensures probabilities sum to 1.0
- Highest probability = predicted letter
Activation Functions
ReLU (Rectified Linear Unit)
Formula: f(x) = max(0, x)
Graph:
β β±
β β±
β β±
βββββΌβββββ
β
Benefits:
- Fast to compute
- Prevents vanishing gradients
- Works well for hidden layers
Softmax
Formula: softmax(xi) = e^xi / Ξ£(e^xj)
Purpose:
- Converts raw scores to probabilities
- All outputs sum to 1.0
- Used in output layer for classification
Example:
Raw scores: [2.1, 0.5, 4.2, 1.3]
After softmax: [0.12, 0.02, 0.84, 0.02]
(12% 2% 84% 2% )
Dropout Layers
Purpose: Prevent overfitting
During Training:
- Randomly "turn off" 30% of neurons
- Forces network to learn robust features
- Network can't rely on specific neurons
During Inference (app usage):
- All neurons active
- Uses learned patterns to classify
Model Training Code
# File: model_training/train_model.py
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# Load landmark dataset
print("Loading dataset...")
data = pd.read_csv('landmarks_dataset.csv')
print(f"Dataset shape: {data.shape}")
print(f"Classes: {sorted(data['label'].unique())}")
# Separate features (X) and labels (y)
X = data.iloc[:, :-1].values # 63 landmark columns
y = data.iloc[:, -1].values # Label column
# Encode labels: A=0, B=1, C=2, ..., Z=25
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
y_categorical = keras.utils.to_categorical(y_encoded)
# Save label mapping
label_mapping = {i: label for i, label in enumerate(encoder.classes_)}
print(f"Label mapping: {label_mapping}")
# Train/test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y_categorical,
test_size=0.2,
random_state=42,
stratify=y_categorical # Maintain class distribution
)
print(f"\nTraining samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
# Build the DNN model
model = keras.Sequential([
# Input layer
keras.layers.Input(shape=(63,), name='landmark_input'),
# Hidden layer 1
keras.layers.Dense(128, activation='relu', name='dense_1'),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.3),
# Hidden layer 2
keras.layers.Dense(64, activation='relu', name='dense_2'),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.3),
# Hidden layer 3
keras.layers.Dense(32, activation='relu', name='dense_3'),
keras.layers.Dropout(0.2),
# Output layer
keras.layers.Dense(len(encoder.classes_), activation='softmax', name='output')
])
# Compile model
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='categorical_crossentropy',
metrics=['accuracy']
)
# Display model summary
model.summary()
# Training callbacks
callbacks = [
# Stop training if validation loss doesn't improve for 10 epochs
keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=10,
restore_best_weights=True
),
# Reduce learning rate if validation loss plateaus
keras.callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.5,
patience=5,
min_lr=0.00001
),
# Save best model during training
keras.callbacks.ModelCheckpoint(
'best_model.h5',
monitor='val_accuracy',
save_best_only=True
)
]
# Train the model
print("\nTraining model...")
history = model.fit(
X_train, y_train,
epochs=100,
batch_size=32,
validation_split=0.2, # Use 20% of training data for validation
callbacks=callbacks,
verbose=1
)
# Evaluate on test set
print("\nEvaluating on test set...")
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy*100:.2f}%")
# Save final model
model.save('isl_model.h5')
print("\nModel saved as 'isl_model.h5'")
# Convert to TFLite
print("\nConverting to TFLite...")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
with open('isl_model.tflite', 'wb') as f:
f.write(tflite_model)
print(f"TFLite model saved!")
print(f"Model size: {len(tflite_model) / 1024:.2f} KB")
Model Summary Output
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
landmark_input (InputLayer) [(None, 63)] 0
dense_1 (Dense) (None, 128) 8,192
batch_normalization (None, 128) 512
dropout (Dropout) (None, 128) 0
dense_2 (Dense) (None, 64) 8,256
batch_normalization_1 (None, 64) 256
dropout_1 (Dropout) (None, 64) 0
dense_3 (Dense) (None, 32) 2,080
dropout_2 (Dropout) (None, 32) 0
output (Dense) (None, 26) 858
=================================================================
Total params: 20,154 (78.73 KB)
Trainable params: 19,770 (77.23 KB)
Non-trainable params: 384 (1.50 KB)
_________________________________________________________________
What Happens During Training
Epoch 1/100
ββββββββββββββββββββββββββββββββββββββββ
Batch 1/250: Forward pass β Calculate loss β Backward pass β Update weights
Batch 2/250: Forward pass β Calculate loss β Backward pass β Update weights
...
Batch 250/250: Forward pass β Calculate loss β Backward pass β Update weights
Validation:
- Test on validation set (20% of training data)
- Calculate validation accuracy
- If improved, save as best model
Epoch 1: loss=0.5423, accuracy=0.8234, val_loss=0.4321, val_accuracy=0.8567
Epoch 2: loss=0.3215, accuracy=0.8876, val_loss=0.3102, val_accuracy=0.8923
Epoch 3: loss=0.2543, accuracy=0.9123, val_loss=0.2876, val_accuracy=0.9034
...
Epoch 45: loss=0.0234, accuracy=0.9892, val_loss=0.0456, val_accuracy=0.9845
Epoch 46: loss=0.0231, accuracy=0.9894, val_loss=0.0458, val_accuracy=0.9844
(No improvement for 10 epochs β Early stopping triggered)
Best model: Epoch 45 with val_accuracy=0.9845
9. Platform Channels Explained
What Are Platform Channels?
Platform channels are Flutter's official mechanism for communication between Dart code and native platform code (Kotlin/Swift).
The Problem They Solve
Flutter (Dart) runs in its own runtime
β
Cannot directly access:
- Native camera APIs
- MediaPipe library (Android/iOS)
- Hardware sensors
- Native ML frameworks
- Bluetooth, NFC, etc.
Solution: Platform Channels = Bridge between worlds
Types of Platform Channels
1. MethodChannel (Request-Response)
Use Case: One-time requests with responses
Flutter asks β Kotlin does work β Kotlin responds β Flutter receives
Example: Start/stop camera, take photo, get device info
// Flutter side
final result = await methodChannel.invokeMethod('getCameraStatus');
print(result); // "active" or "inactive"
// Kotlin side
methodChannel.setMethodCallHandler { call, result ->
when (call.method) {
"getCameraStatus" -> {
val status = if (cameraActive) "active" else "inactive"
result.success(status)
}
}
}
2. EventChannel (Continuous Stream)
Use Case: Continuous data stream from native to Flutter
Kotlin continuously sends β Flutter receives stream β UI updates in real-time
Example: Hand detection results, sensor data, location updates
// Flutter side
eventChannel.receiveBroadcastStream().listen((data) {
print(data); // Continuous updates
});
// Kotlin side
eventChannel.setStreamHandler(object : EventChannel.StreamHandler {
override fun onListen(arguments: Any?, events: EventChannel.EventSink?) {
// Send data continuously
events?.success(detectionResult)
}
})
3. BasicMessageChannel (Bi-directional)
Use Case: Custom protocols, binary data
Not commonly used for this project
Comparison Table
| Feature | MethodChannel | EventChannel |
|---|---|---|
| Direction | Bi-directional | Kotlin β Flutter (one way) |
| Pattern | Request-Response | Stream |
| Use Case | Commands | Continuous data |
| Example | "Start camera" | Detection results every 50ms |
| Frequency | On-demand | Continuous |
Complete Implementation for KairoAI
Flutter Side (Dart)
// File: lib/services/sign_detection_service.dart
import 'package:flutter/services.dart';
import 'dart:async';
class SignDetectionService {
// Method channel for commands
static const MethodChannel _methodChannel =
MethodChannel('com.kairo.ai/detection');
// Event channel for continuous detection stream
static const EventChannel _eventChannel =
EventChannel('com.kairo.ai/detection_stream');
Stream<DetectionResult>? _detectionStream;
/// Start hand sign detection
Future<void> startDetection() async {
try {
await _methodChannel.invokeMethod('startDetection');
print('β
Detection started');
} on PlatformException catch (e) {
print('β Error starting detection: ${e.message}');
rethrow;
}
}
/// Stop hand sign detection
Future<void> stopDetection() async {
try {
await _methodChannel.invokeMethod('stopDetection');
print('β
Detection stopped');
} on PlatformException catch (e) {
print('β Error stopping detection: ${e.message}');
rethrow;
}
}
/// Get continuous stream of detection results
Stream<DetectionResult> get detectionStream {
_detectionStream ??= _eventChannel
.receiveBroadcastStream()
.map((event) {
final data = Map<String, dynamic>.from(event);
return DetectionResult.fromMap(data);
});
return _detectionStream!;
}
/// Check camera permission status
Future<bool> checkCameraPermission() async {
try {
final bool hasPermission =
await _methodChannel.invokeMethod('checkCameraPermission');
return hasPermission;
} on PlatformException catch (e) {
print('β Error checking permission: ${e.message}');
return false;
}
}
/// Request camera permission
Future<bool> requestCameraPermission() async {
try {
final bool granted =
await _methodChannel.invokeMethod('requestCameraPermission');
return granted;
} on PlatformException catch (e) {
print('β Error requesting permission: ${e.message}');
return false;
}
}
}
/// Data class for detection results
class DetectionResult {
final String letter;
final double confidence;
final bool handDetected;
final int timestamp;
DetectionResult({
required this.letter,
required this.confidence,
required this.handDetected,
required this.timestamp,
});
factory DetectionResult.fromMap(Map<String, dynamic> map) {
return DetectionResult(
letter: map['letter'] as String,
confidence: (map['confidence'] as num).toDouble(),
handDetected: map['handDetected'] as bool,
timestamp: map['timestamp'] as int,
);
}
@override
String toString() {
return 'DetectionResult(letter: $letter, confidence: ${confidence.toStringAsFixed(2)}, handDetected: $handDetected)';
}
}
Kotlin Side (Android)
// File: android/app/src/main/kotlin/com/kairo/ai/MainActivity.kt
package com.kairo.ai
import android.Manifest
import android.content.pm.PackageManager
import androidx.core.app.ActivityCompat
import androidx.core.content.ContextCompat
import io.flutter.embedding.android.FlutterActivity
import io.flutter.embedding.engine.FlutterEngine
import io.flutter.plugin.common.MethodChannel
import io.flutter.plugin.common.EventChannel
import android.graphics.Bitmap
class MainActivity : FlutterActivity() {
// Channel names (must match Flutter side)
private val METHOD_CHANNEL = "com.kairo.ai/detection"
private val EVENT_CHANNEL = "com.kairo.ai/detection_stream"
// Camera permission request code
private val CAMERA_PERMISSION_CODE = 100
// Our custom classes (to be implemented)
private lateinit var cameraManager: CameraManager
private lateinit var handDetector: HandLandmarkDetector
private lateinit var signClassifier: SignClassifier
// Event sink for streaming data to Flutter
private var eventSink: EventChannel.EventSink? = null
override fun configureFlutterEngine(flutterEngine: FlutterEngine) {
super.configureFlutterEngine(flutterEngine)
// Initialize our components
cameraManager = CameraManager(this)
handDetector = HandLandmarkDetector(this)
signClassifier = SignClassifier(this)
// Setup MethodChannel for commands
MethodChannel(
flutterEngine.dartExecutor.binaryMessenger,
METHOD_CHANNEL
).setMethodCallHandler { call, result ->
when (call.method) {
"startDetection" -> {
startDetection()
result.success(null)
}
"stopDetection" -> {
stopDetection()
result.success(null)
}
"checkCameraPermission" -> {
val hasPermission = checkCameraPermission()
result.success(hasPermission)
}
"requestCameraPermission" -> {
requestCameraPermission()
result.success(null) // Actual result comes via callback
}
else -> {
result.notImplemented()
}
}
}
// Setup EventChannel for streaming data
EventChannel(
flutterEngine.dartExecutor.binaryMessenger,
EVENT_CHANNEL
).setStreamHandler(object : EventChannel.StreamHandler {
override fun onListen(arguments: Any?, events: EventChannel.EventSink?) {
eventSink = events
println("β
Flutter is now listening to detection stream")
}
override fun onCancel(arguments: Any?) {
eventSink = null
stopDetection()
println("β Flutter stopped listening to detection stream")
}
})
}
/**
* Start camera and begin hand detection
*/
private fun startDetection() {
println("π₯ Starting camera and detection...")
cameraManager.startCamera { bitmap ->
processFrame(bitmap)
}
}
/**
* Stop camera and detection
*/
private fun stopDetection() {
println("π Stopping camera and detection...")
cameraManager.stopCamera()
}
/**
* Process each camera frame
*/
private fun processFrame(bitmap: Bitmap) {
// Step 1: Detect hand landmarks
val landmarks = handDetector.detectLandmarks(bitmap)
if (landmarks != null) {
// Step 2: Classify the sign
val result = signClassifier.classify(landmarks)
// Step 3: Send to Flutter
val data = mapOf(
"letter" to result.letter,
"confidence" to result.confidence,
"handDetected" to true,
"timestamp" to System.currentTimeMillis()
)
// Must run on UI thread
runOnUiThread {
eventSink?.success(data)
}
} else {
// No hand detected
val data = mapOf(
"letter" to "",
"confidence" to 0.0,
"handDetected" to false,
"timestamp" to System.currentTimeMillis()
)
runOnUiThread {
eventSink?.success(data)
}
}
}
/**
* Check if camera permission is granted
*/
private fun checkCameraPermission(): Boolean {
return ContextCompat.checkSelfPermission(
this,
Manifest.permission.CAMERA
) == PackageManager.PERMISSION_GRANTED
}
/**
* Request camera permission from user
*/
private fun requestCameraPermission() {
ActivityCompat.requestPermissions(
this,
arrayOf(Manifest.permission.CAMERA),
CAMERA_PERMISSION_CODE
)
}
/**
* Handle permission request result
*/
override fun onRequestPermissionsResult(
requestCode: Int,
permissions: Array<out String>,
grantResults: IntArray
) {
super.onRequestPermissionsResult(requestCode, permissions, grantResults)
when (requestCode) {
CAMERA_PERMISSION_CODE -> {
val granted = grantResults.isNotEmpty() &&
grantResults[0] == PackageManager.PERMISSION_GRANTED
if (granted) {
println("β
Camera permission granted")
} else {
println("β Camera permission denied")
}
}
}
}
}
Communication Flow Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FLUTTER (Dart) β
β β
β User taps "Start Lesson" β
β β β
β LessonScreen calls: β
β signDetectionService.startDetection() β
β β β
β MethodChannel sends: "startDetection" β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ Platform Channel Bridge
β
ββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββ
β KOTLIN (Android) β
β β
β MethodChannel receives: "startDetection" β
β β β
β MainActivity.startDetection() β
β β β
β CameraManager.startCamera() β
β β β
β Camera frames arrive (30 FPS) β
β β β
β processFrame(bitmap) for each frame: β
β 1. HandDetector.detectLandmarks(bitmap) β 63 floats β
β 2. SignClassifier.classify(landmarks) β Letter + confidence β
β 3. Package into Map β
β 4. EventChannel sends to Flutter β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ Event Channel Stream
β
ββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββ
β FLUTTER (Dart) β
β β
β EventChannel.receiveBroadcastStream() β
β β β
β .listen((data) { β
β setState(() { β
β currentLetter = data['letter']; β
β confidence = data['confidence']; β
β }); β
β }) β
β β β
β UI rebuilds with new data β
β Shows: "Detected: A (95%)" β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Data Format Examples
Method Channel (Command)
Flutter β Kotlin:
{
"method": "startDetection",
"arguments": null
}
Kotlin β Flutter:
{
"success": true
}
Event Channel (Stream)
Kotlin β Flutter (continuous stream):
Frame 1:
{
"letter": "A",
"confidence": 0.87,
"handDetected": true,
"timestamp": 1702907345000
}
Frame 2 (50ms later):
{
"letter": "A",
"confidence": 0.91,
"handDetected": true,
"timestamp": 1702907345050
}
Frame 3 (50ms later):
{
"letter": "A",
"confidence": 0.95,
"handDetected": true,
"timestamp": 1702907345100
}
Frame 4 (user changes hand):
{
"letter": "B",
"confidence": 0.23,
"handDetected": true,
"timestamp": 1702907345150
}
Frame 5:
{
"letter": "B",
"confidence": 0.78,
"handDetected": true,
"timestamp": 1702907345200
}
Error Handling
// Flutter side - Handling errors
try {
await signDetectionService.startDetection();
} on PlatformException catch (e) {
switch (e.code) {
case 'CAMERA_ERROR':
showSnackBar('Camera failed to start');
break;
case 'PERMISSION_DENIED':
showSnackBar('Camera permission required');
break;
case 'MEDIAPIPE_ERROR':
showSnackBar('Hand detection failed');
break;
default:
showSnackBar('Unknown error: ${e.message}');
}
}
// Kotlin side - Sending errors
try {
startCamera()
result.success(null)
} catch (e: SecurityException) {
result.error("PERMISSION_DENIED", "Camera permission not granted", null)
} catch (e: CameraAccessException) {
result.error("CAMERA_ERROR", "Failed to access camera: ${e.message}", null)
} catch (e: Exception) {
result.error("UNKNOWN_ERROR", e.message, null)
}
10. Dataset Creation Guide
Understanding the Dataset
What You Need
For a 26-letter ISL alphabet app, you need:
Dataset Size Calculation:
- 26 letters (A-Z)
- 500-1000 images per letter (recommended)
- Total: 13,000 - 26,000 images
Actual Data After Extraction:
- Each image β 1 row in CSV
- Each row = 63 landmark values + 1 label
- Final CSV: 13,000-26,000 rows Γ 64 columns
Dataset Quality Factors
| Factor | Impact on Accuracy | Recommendation |
|---|---|---|
| Number of samples | High | 500+ per letter |
| Variety of people | High | 5-10 different people |
| Hand orientations | Medium | Multiple angles |
| Lighting conditions | Low (landmarks robust) | Normal indoor lighting OK |
| Background | None (landmarks only) | Any background works |
| Camera distance | Medium | Keep consistent (arm's length) |
Option 1: Use Existing Dataset (Fastest)
Step 1: Find ISL Dataset
# Search on Kaggle
https://www.kaggle.com/search?q=indian+sign+language
# Popular datasets:
# 1. "ISL Dataset" by various authors
# 2. "Indian Sign Language Recognition Dataset"
# 3. "ISL Alphabet Dataset"
Step 2: Download and Extract
# Download from Kaggle (requires Kaggle account)
kaggle datasets download -d <dataset-name>
# Extract
unzip dataset.zip
# Expected structure:
ISL_Dataset/
βββ A/
β βββ img_001.jpg
β βββ img_002.jpg
β βββ ...
βββ B/
β βββ img_001.jpg
β βββ ...
βββ Z/
βββ ...
Option 2: Collect Your Own (Better Accuracy)
Method 1: Video Recording (Recommended)
Equipment Needed:
- Smartphone camera
- Good lighting (natural or indoor)
- Plain background (optional but helpful)
Process:
1. Record 30-second video per letter
2. Person makes the sign continuously
3. Vary hand position slightly
4. Extract frames β 200-300 images per video
Advantages:
- Quick data collection (30 min for all 26 letters)
- Natural hand movements
- Variety in positioning
Step-by-Step Video Collection
# File: data_collection/extract_frames_from_video.py
import cv2
import os
def extract_frames_from_video(video_path, output_folder, letter, frame_interval=3):
"""
Extract frames from video at specified interval
Args:
video_path: Path to video file
output_folder: Where to save frames
letter: ISL letter (A-Z)
frame_interval: Extract every Nth frame (3 = every 3rd frame)
"""
# Create output directory
letter_folder = os.path.join(output_folder, letter)
os.makedirs(letter_folder, exist_ok=True)
# Open video
cap = cv2.VideoCapture(video_path)
frame_count = 0
saved_count = 0
print(f"Processing video: {video_path}")
while True:
ret, frame = cap.read()
if not ret:
break
# Extract every Nth frame
if frame_count % frame_interval == 0:
output_path = os.path.join(
letter_folder,
f"{letter}_{saved_count:04d}.jpg"
)
cv2.imwrite(output_path, frame)
saved_count += 1
frame_count += 1
cap.release()
print(f"β
Extracted {saved_count} frames for letter '{letter}'")
print(f" Saved to: {letter_folder}")
# Usage
if __name__ == "__main__":
# Extract frames from all videos
videos = [
("videos/letter_A.mp4", "A"),
("videos/letter_B.mp4", "B"),
# ... add all 26 letters
]
output_folder = "extracted_frames"
for video_path, letter in videos:
extract_frames_from_video(video_path, output_folder, letter, frame_interval=3)
print("\nβ
All frames extracted!")
Method 2: Photo Collection App
# File: data_collection/photo_collector.py
import cv2
import os
import time
def collect_photos_for_letter(letter, num_photos=500):
"""
Interactive photo collection using webcam
Args:
letter: ISL letter to collect (A-Z)
num_photos: Number of photos to capture
"""
# Create output directory
output_folder = f"collected_data/{letter}"
os.makedirs(output_folder, exist_ok=True)
# Open webcam
cap = cv2.VideoCapture(0)
print<!-- filepath: d:\study files\FlutterProjects\KairoAI\DOCUMENTATION.md -->
# KairoAI - Complete Project Documentation
## Indian Sign Language Learning App with AI-Powered Hand Detection
**Author:** Megh Modi
**Created:** December 18, 2025
**Version:** 1.0.0
**Status:** Planning & Architecture Phase
---
# Table of Contents
1. [Executive Summary](#executive-summary)
2. [Project Vision & Goals](#project-vision--goals)
3. [Technical Architecture](#technical-architecture)
4. [Technology Stack](#technology-stack)
5. [Understanding the AI Pipeline](#understanding-the-ai-pipeline)
6. [Data Flow & Pipeline](#data-flow--pipeline)
7. [MediaPipe Explained](#mediapipe-explained)
8. [DNN Model Explained](#dnn-model-explained)
9. [Platform Channels Explained](#platform-channels-explained)
10. [Dataset Creation Guide](#dataset-creation-guide)
11. [Model Training Guide](#model-training-guide)
12. [Implementation Roadmap](#implementation-roadmap)
13. [Code Structure](#code-structure)
14. [Challenges & Solutions](#challenges--solutions)
15. [Feasibility Assessment](#feasibility-assessment)
16. [Resources & Learning Path](#resources--learning-path)
---
# 1. Executive Summary
## What is KairoAI?
KairoAI is an Indian Sign Language (ISL) learning application designed specifically for children. The app uses real-time hand gesture detection via the device camera to teach ISL alphabets and words, providing instant feedback to students.
## Core Innovation
The app combines three powerful technologies:
- **Flutter** for cross-platform UI
- **MediaPipe** for hand detection (running natively on Android)
- **TensorFlow Lite** for sign language classification
## Key Differentiator
Unlike traditional learning apps, KairoAI provides **real-time visual feedback** by:
1. Showing the user what sign to make
2. Detecting their hand position using the camera
3. Validating if they're making the correct sign
4. Providing instant feedback (success/try again)
---
# 2. Project Vision & Goals
## Primary Goal
Create an accessible, engaging platform for children to learn Indian Sign Language through interactive, AI-powered lessons.
## Target Users
- **Primary:** Children aged 6-14 learning ISL
- **Secondary:** Parents and educators teaching ISL
- **Tertiary:** Anyone interested in learning ISL
## Core Features
### 1. Lesson Mode
- Display a target alphabet (e.g., "A") or word (e.g., "MEGH")
- Open device camera
- Detect student's hand sign in real-time
- Validate against expected sign
- Show success animation/sound on correct detection
- Provide guidance hints on incorrect attempts
### 2. Quiz Mode
- Present random alphabets or words
- Student performs signs sequentially
- Each detected letter is validated in order
- Progress only on correct detection
- Track accuracy and completion time
### 3. Progress Tracking
- Store lesson completion in Firebase Firestore
- Track quiz scores and accuracy
- Visualize learning progress over time
- Gamification elements (badges, streaks)
---
# 3. Technical Architecture
## High-Level Architecture
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β FLUTTER LAYER (UI) β β Written in Dart β β β β β’ Lessons UI β’ Quiz UI β’ Progress Dashboard β β β’ Camera Preview β’ Feedback Animations β β β’ Firebase Integration (Auth, Firestore) β ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ β β Platform Channels (Bridge) β ββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββ β KOTLIN LAYER (Android Native) β β β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β β CameraX ββ β MediaPipe ββ β TensorFlow β β β β β β Hands β β Lite β β β β Capture β β Detect hand β β Classify β β β β frames β β Extract 21 β β sign β β β β β β landmarks β β β β β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββ β β β β Returns: { letter: "A", confidence: 0.95, handDetected: true } β ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
## Architecture Principles
### Why Hybrid Architecture?
| Component | Layer | Reason |
|-----------|-------|--------|
| **UI & Navigation** | Flutter (Dart) | Cross-platform, fast development, beautiful UI |
| **Camera & ML** | Kotlin (Native) | Direct hardware access, optimized performance |
| **Firebase** | Flutter (Dart) | Easy integration, real-time sync |
### Key Design Decision
**DO NOT attempt to run MediaPipe or camera processing in Dart.**
Why?
- Flutter cannot directly access native camera APIs efficiently
- MediaPipe requires native Android/iOS libraries
- ML inference is faster in native code
- Better battery performance with native implementation
---
# 4. Technology Stack
## Flutter Side (Dart)
### Core Dependencies
```yaml
dependencies:
flutter:
sdk: flutter
# Firebase
firebase_core: ^2.24.2 # Firebase initialization
firebase_auth: ^4.16.0 # User authentication
cloud_firestore: ^4.14.0 # Database for progress tracking
# State Management
provider: ^6.1.1 # For managing app state
# Navigation
go_router: ^13.0.0 # Declarative routing
# UI/UX Enhancements
lottie: ^3.0.0 # Success animations
audioplayers: ^5.2.1 # Sound effects
# Utilities
equatable: ^2.0.5 # Value comparison
Why These Libraries?
| Library | Purpose | Alternative |
|---|---|---|
provider |
Simple state management | riverpod, bloc |
go_router |
Type-safe routing | auto_route, manual routing |
lottie |
Beautiful animations | flare, custom animations |
Android/Kotlin Side
Build Configuration
// android/app/build.gradle.kts
plugins {
id("com.android.application")
id("kotlin-android")
id("dev.flutter.flutter-gradle-plugin")
}
android {
namespace = "com.kairo.ai"
compileSdk = 34
defaultConfig {
applicationId = "com.kairo.ai"
minSdk = 26 // Required for CameraX
targetSdk = 34
versionCode = 1
versionName = "1.0"
}
compileOptions {
sourceCompatibility = JavaVersion.VERSION_17
targetCompatibility = JavaVersion.VERSION_17
}
kotlinOptions {
jvmTarget = "17"
}
// Required for TFLite model files
aaptOptions {
noCompress("tflite")
}
}
Dependencies
dependencies {
// MediaPipe Tasks Vision (Hand Landmark Detection)
implementation("com.google.mediapipe:tasks-vision:0.10.14")
// TensorFlow Lite
implementation("org.tensorflow:tensorflow-lite:2.14.0")
implementation("org.tensorflow:tensorflow-lite-support:0.4.4")
// CameraX (Camera API)
implementation("androidx.camera:camera-core:1.3.1")
implementation("androidx.camera:camera-camera2:1.3.1")
implementation("androidx.camera:camera-lifecycle:1.3.1")
implementation("androidx.camera:camera-view:1.3.1")
// Coroutines for async operations
implementation("org.jetbrains.kotlinx:kotlinx-coroutines-android:1.7.3")
}
Python Side (Model Training)
Required Libraries
tensorflow==2.15.0 # Deep learning framework
mediapipe==0.10.9 # Hand landmark extraction
opencv-python==4.8.1.78 # Image processing
numpy==1.26.2 # Numerical operations
pandas==2.1.3 # Data manipulation
scikit-learn==1.3.2 # ML utilities
matplotlib==3.8.2 # Visualization
Installation
pip install tensorflow mediapipe opencv-python numpy pandas scikit-learn matplotlib
5. Understanding the AI Pipeline
What is AI Doing in This App?
The AI has one primary job: "Look at the camera and tell me which ISL letter the user is showing"
The Problem Breakdown
Traditional Approach (Pure Image Classification)
Camera Image β CNN Model β Letter
Problem: Slow, requires huge dataset, background sensitive
Our Smart Approach (Landmark-based Classification)
Camera Image β MediaPipe (find hand) β Extract landmarks β DNN Model β Letter
Benefit: Fast, small dataset, background-independent
Why Two AI Models?
Model 1: MediaPipe Hands (Google's Pre-trained Model)
Job: Find the hand and identify 21 key points
Input: Camera frame (640Γ480 pixels)
Output: 21 landmark points (x, y, z coordinates)
Example landmarks:
Point 0: Wrist
Point 1-4: Thumb (base to tip)
Point 5-8: Index finger
Point 9-12: Middle finger
Point 13-16: Ring finger
Point 17-20: Pinky finger
Why use it?
- Already trained by Google on millions of images
- Works in real-time on mobile devices
- Handles different hand sizes, skin tones, lighting
- Free to use
Model 2: Your Custom TFLite Model (Train Yourself)
Job: Classify the 21 landmark points into ISL letters
Input: 63 numbers (21 points Γ 3 coordinates)
Output: Letter (A-Z) + confidence score
Example:
Input: [0.45, 0.82, 0.01, 0.52, 0.75, ...]
Output: { letter: "A", confidence: 0.95 }
Why train your own?
- ISL signs are unique (different from ASL)
- You control accuracy by adding more training data
- Model is tiny (~50-100 KB)
- Fast inference (~1-5ms)
6. Data Flow & Pipeline
Complete Pipeline: Camera β Detection β Flutter UI
Step-by-Step Data Transformation
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 1: CAMERA CAPTURE (CameraX) β
β β
β Input: Nothing (hardware) β
β Process: Open camera, capture frames at 30 FPS β
β Output: Bitmap (640Γ480 RGB image) β
β Data Size: 921,600 values (640 Γ 480 Γ 3) β
β β
β Visual: β
β βββββββββββββββββββ β
β β βββββββββββββββ β β
β β βββββββββββββββ β β Raw camera frame β
β β βββββββββββββββ β (user's hand visible) β
β β βββββββββββββββ β β
β β βββββββββββββββ β β
β β βββββββββββββββ β β
β βββββββββββββββββββ β
ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 2: HAND DETECTION (MediaPipe) β
β β
β Input: Bitmap (921,600 values) β
β Process: β
β 1. Detect if hand exists in frame β
β 2. Locate 21 anatomical landmarks β
β 3. Extract (x, y, z) for each point β
β Output: FloatArray[63] = [x0,y0,z0, x1,y1,z1, ..., x20,y20,z20]β
β Data Size: 63 float values β
β Reduction: 921,600 β 63 (99.99% reduction!) β
β β
β Visual - 21 Landmark Points: β
β 8 12 16 20 (fingertips) β
β | | | | β
β 7 11 15 19 | β
β | | | | | β
β 6 10 14 18 | β
β | | | | | β
β 5βββ9βββ13ββ17βββ β
β \ β
β 4βββ3βββ2βββ1 (thumb) β
β \ β
β 0 (wrist) β
β β
β Example Output for Letter "A": β
β [0.45, 0.82, 0.01, β Point 0 (wrist) β
β 0.52, 0.75, 0.02, β Point 1 (thumb base) β
β 0.58, 0.65, 0.03, β Point 2 (thumb middle) β
β ... β
β 0.63, 0.42, 0.01] β Point 20 (pinky tip) β
ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 3: SIGN CLASSIFICATION (TensorFlow Lite) β
β β
β Input: FloatArray[63] β
β Process: β
β Neural Network Layers: β
β [63] β [128 neurons] β [64 neurons] β [32 neurons] β [26] β
β Dense Layer Dense Layer Dense Layer Softmax β
β β
β Output: Probabilities for each letter β
β [p_A, p_B, p_C, ..., p_Z] β
β β
β Example: β
β Input: [0.45, 0.82, 0.01, ...] β
β Output: [0.01, 0.03, 0.95, 0.00, ...] β
β (1% 3% 95% 0% ...) β
β β
β Best Prediction: Letter "C" with 95% confidence β
ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 4: SEND TO FLUTTER (Platform Channel - EventChannel) β
β β
β Kotlin prepares data as Map: β
β { β
β "letter": "C", β
β "confidence": 0.95, β
β "handDetected": true, β
β "timestamp": 1702907345000 β
β } β
β β
β Sends through EventChannel (continuous stream) β
ββββββββββββββββββββββββββ¬ββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β STEP 5: FLUTTER RECEIVES & DISPLAYS (Dart) β
β β
β EventChannel stream listener receives data β
β β
β setState(() { β
β currentLetter = "C"; β
β confidence = 0.95; β
β handDetected = true; β
β }); β
β β
β UI Updates: β
β ββββββββββββββββββββββββββββββ β
β β π― Target: Letter A β β
β β β β
β β π· [Camera Preview] β β
β β β β
β β β Detected: C (95%) β β Red (incorrect) β
β β π‘ Hint: Try again! β β
β ββββββββββββββββββββββββββββββ β
β β
β If correct (C == target): β
β ββββββββββββββββββββββββββββββ β
β β β
Correct! (95%) β β Green (success) β
β β π [Success Animation] β β
β β π [Success Sound] β β
β ββββββββββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Processing Speed
| Step | Processing Time | Frequency |
|---|---|---|
| Camera Frame | ~33ms | 30 FPS |
| MediaPipe Detection | ~10-20ms | Per frame |
| TFLite Classification | ~1-5ms | Per detection |
| Platform Channel | <1ms | Per result |
| Total Latency | ~50ms | ~20 detections/sec |
This means users get near-instant feedback!
7. MediaPipe Explained
What is MediaPipe?
MediaPipe is Google's open-source framework for building multimodal (video, audio, text) applied machine learning pipelines.
MediaPipe Hands Solution
Specifically designed to detect and track hands in real-time.
Key Capabilities:
- Detects up to 2 hands simultaneously
- Works in various lighting conditions
- Handles different hand sizes and skin tones
- Runs efficiently on mobile devices
- Provides 21 3D landmark points per hand
The 21 Hand Landmarks
Landmark Numbering System:
8 12 16 20
β β β β
7βββ11ββ15ββ19βββ
β β β β β
6βββ10ββ14ββ18βββ
β β β β β
5βββ9βββ13ββ17βββ
\
4βββ3βββ2βββ1
\
0
Point 0: Wrist
Point 1: Thumb CMC (base)
Point 2: Thumb MCP
Point 3: Thumb IP
Point 4: Thumb tip
Point 5: Index finger MCP
Point 6: Index finger PIP
Point 7: Index finger DIP
Point 8: Index finger tip
Point 9: Middle finger MCP
Point 10: Middle finger PIP
Point 11: Middle finger DIP
Point 12: Middle finger tip
Point 13: Ring finger MCP
Point 14: Ring finger PIP
Point 15: Ring finger DIP
Point 16: Ring finger tip
Point 17: Pinky MCP
Point 18: Pinky PIP
Point 19: Pinky DIP
Point 20: Pinky tip
Coordinate System
Each landmark has 3 coordinates:
X Coordinate
- Range: 0.0 to 1.0
- 0.0 = left edge of image
- 1.0 = right edge of image
- Normalized (independent of image resolution)
Y Coordinate
- Range: 0.0 to 1.0
- 0.0 = top edge of image
- 1.0 = bottom edge of image
- Normalized (independent of image resolution)
Z Coordinate
- Approximate depth from wrist
- Smaller values = closer to camera
- Relative to wrist (Point 0)
- Units: roughly in same scale as X
Example Coordinates
Letter "A" (closed fist with thumb up):
Point 0 (Wrist): x=0.50, y=0.70, z=0.00
Point 1 (Thumb base): x=0.48, y=0.65, z=0.02
Point 2 (Thumb mid): x=0.46, y=0.58, z=0.03
Point 3 (Thumb bend): x=0.44, y=0.52, z=0.04
Point 4 (Thumb tip): x=0.42, y=0.45, z=0.05
Point 5 (Index base): x=0.54, y=0.66, z=0.01
Point 6 (Index mid): x=0.56, y=0.68, z=0.00
Point 7 (Index bend): x=0.57, y=0.69, z=-0.01
Point 8 (Index tip): x=0.58, y=0.70, z=-0.02
...
MediaPipe Integration Code
// File: android/app/src/main/kotlin/com/kairo/ai/ml/HandLandmarkDetector.kt
import com.google.mediapipe.tasks.vision.handlandmarker.HandLandmarker
import com.google.mediapipe.tasks.vision.handlandmarker.HandLandmarkerResult
import com.google.mediapipe.framework.image.BitmapImageBuilder
import com.google.mediapipe.framework.image.MPImage
import android.graphics.Bitmap
import android.content.Context
class HandLandmarkDetector(context: Context) {
private val handLandmarker: HandLandmarker
init {
// Configure MediaPipe Hands
val options = HandLandmarker.HandLandmarkerOptions.builder()
.setBaseOptions(
BaseOptions.builder()
.setModelAssetPath("hand_landmarker.task") // MediaPipe's pre-trained model
.build()
)
.setNumHands(1) // Detect only one hand
.setMinHandDetectionConfidence(0.5f) // 50% confidence threshold
.setMinHandPresenceConfidence(0.5f)
.setMinTrackingConfidence(0.5f)
.build()
handLandmarker = HandLandmarker.createFromOptions(context, options)
}
/**
* Detect hand landmarks from a camera frame
*
* @param bitmap The camera frame
* @return FloatArray of 63 values [x0,y0,z0, x1,y1,z1, ..., x20,y20,z20]
* or null if no hand detected
*/
fun detectLandmarks(bitmap: Bitmap): FloatArray? {
// Convert Android Bitmap to MediaPipe Image
val mpImage: MPImage = BitmapImageBuilder(bitmap).build()
// Run hand detection
val result: HandLandmarkerResult = handLandmarker.detect(mpImage)
// Check if any hands were detected
if (result.landmarks().isEmpty()) {
return null // No hand found
}
// Get landmarks from first detected hand
val handLandmarks = result.landmarks()[0]
// Convert to flat array
val landmarkArray = FloatArray(63)
for (i in 0 until 21) {
val landmark = handLandmarks[i]
landmarkArray[i * 3 + 0] = landmark.x()
landmarkArray[i * 3 + 1] = landmark.y()
landmarkArray[i * 3 + 2] = landmark.z()
}
return landmarkArray
}
/**
* Optional: Normalize landmarks relative to wrist
* This makes the model robust to hand position/size
*/
fun normalizeLandmarks(landmarks: FloatArray): FloatArray {
val normalized = FloatArray(63)
// Get wrist coordinates (point 0)
val wristX = landmarks[0]
val wristY = landmarks[1]
val wristZ = landmarks[2]
// Normalize all points relative to wrist
for (i in 0 until 21) {
normalized[i * 3 + 0] = landmarks[i * 3 + 0] - wristX
normalized[i * 3 + 1] = landmarks[i * 3 + 1] - wristY
normalized[i * 3 + 2] = landmarks[i * 3 + 2] - wristZ
}
return normalized
}
}
8. DNN Model Explained
What is a DNN (Dense Neural Network)?
A DNN is a type of artificial neural network where every neuron in one layer is connected to every neuron in the next layer.
Simple Analogy
Think of it as a decision-making chain:
Your Brain Recognizing a Friend:
Eyes see features β Brain processes patterns β Brain decides who it is
(height, hair, (tall + brown hair ("It's John!")
glasses, voice) + glasses = pattern)
DNN for Sign Language
Input: 63 numbers β Hidden layers process β Output: Letter prediction
(hand landmarks) (find patterns) (A-Z with confidence)
DNN vs CNN vs Other Models
| Model Type | Input | Best For | When to Use |
|---|---|---|---|
| DNN (Dense) | Numbers/Features | Structured data, pre-extracted features | When you have landmark coordinates |
| CNN (Convolutional) | Images | Raw image classification | When working with raw pixels |
| RNN/LSTM | Sequences | Time series, text | When order matters (sentences, videos) |
| Transformer | Sequences | Language, large-scale | Complex NLP tasks (ChatGPT) |
Why DNN for KairoAI?
Comparison: CNN vs DNN Approach
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β CNN APPROACH (Not Used) β
β β
β Camera Frame (640Γ480Γ3 = 921,600 values) β
β β β
β Convolutional Layers (find edges, shapes) β
β β β
β Pooling Layers (reduce size) β
β β β
β More Conv Layers (find complex patterns) β
β β β
β Dense Layers (classify) β
β β β
β Output: Letter β
β β
β Problems: β
β β Slow (100-200ms inference) β
β β Large model (10-50 MB) β
β β Needs 10,000+ images per class β
β β Background variations affect accuracy β
β β Sensitive to lighting, hand size, position β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DNN APPROACH (Our Choice) β
β
β β
β MediaPipe Output (63 landmark values) β
β β β
β Dense Layer 1 (128 neurons, ReLU activation) β
β β β
β Dense Layer 2 (64 neurons, ReLU activation) β
β β β
β Dense Layer 3 (32 neurons, ReLU activation) β
β β β
β Output Layer (26 neurons, Softmax activation) β
β β β
β Output: Letter + Confidence β
β β
β Benefits: β
β β
Fast (1-5ms inference) β
β β
Tiny model (~50-100 KB) β
β β
Needs only 500-1000 images per class β
β β
Background doesn't matter (landmarks only) β
β β
Works in any lighting, any hand size β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Model Architecture
Layer-by-Layer Breakdown
# Our DNN Model Architecture
Input Layer:
Shape: (63,)
Description: 63 landmark coordinates
Example: [0.45, 0.82, 0.01, 0.52, 0.75, ...]
β (fully connected)
Hidden Layer 1:
Neurons: 128
Activation: ReLU (Rectified Linear Unit)
Dropout: 30% (prevents overfitting)
Description: Learns basic patterns (finger positions)
β (fully connected)
Hidden Layer 2:
Neurons: 64
Activation: ReLU
Dropout: 30%
Description: Learns complex patterns (hand shapes)
β (fully connected)
Hidden Layer 3:
Neurons: 32
Activation: ReLU
Dropout: 20%
Description: Learns letter-specific features
β (fully connected)
Output Layer:
Neurons: 26
Activation: Softmax
Description: Probability for each letter (A-Z)
Example Output: [0.01, 0.03, 0.95, 0.00, ..., 0.01]
(1% 3% 95% 0% 1% )
A B C D ... Z
What Each Layer Does
1. Input Layer (63 neurons)
- Receives raw landmark coordinates
- No processing, just passes data forward
2. Hidden Layer 1 (128 neurons)
- Learns basic geometric relationships
- Examples:
- "Are fingers spread apart?"
- "Is thumb extended?"
- "What's the palm orientation?"
3. Hidden Layer 2 (64 neurons)
- Combines basic patterns into complex ones
- Examples:
- "Thumb up + fingers curled = might be 'A'"
- "All fingers extended = might be 'B'"
4. Hidden Layer 3 (32 neurons)
- Fine-tunes letter-specific features
- Distinguishes similar signs
- Examples:
- "Is this 'M' or 'N'?" (very similar in ISL)
5. Output Layer (26 neurons)
- Each neuron represents one letter
- Softmax ensures probabilities sum to 1.0
- Highest probability = predicted letter
Activation Functions
ReLU (Rectified Linear Unit)
Formula: f(x) = max(0, x)
Graph:
β β±
β β±
β β±
βββββΌβββββ
β
Benefits:
- Fast to compute
- Prevents vanishing gradients
- Works well for hidden layers
Softmax
Formula: softmax(xi) = e^xi / Ξ£(e^xj)
Purpose:
- Converts raw scores to probabilities
- All outputs sum to 1.0
- Used in output layer for classification
Example:
Raw scores: [2.1, 0.5, 4.2, 1.3]
After softmax: [0.12, 0.02, 0.84, 0.02]
(12% 2% 84% 2% )
Dropout Layers
Purpose: Prevent overfitting
During Training:
- Randomly "turn off" 30% of neurons
- Forces network to learn robust features
- Network can't rely on specific neurons
During Inference (app usage):
- All neurons active
- Uses learned patterns to classify
Model Training Code
# File: model_training/train_model.py
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
# Load landmark dataset
print("Loading dataset...")
data = pd.read_csv('landmarks_dataset.csv')
print(f"Dataset shape: {data.shape}")
print(f"Classes: {sorted(data['label'].unique())}")
# Separate features (X) and labels (y)
X = data.iloc[:, :-1].values # 63 landmark columns
y = data.iloc[:, -1].values # Label column
# Encode labels: A=0, B=1, C=2, ..., Z=25
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
y_categorical = keras.utils.to_categorical(y_encoded)
# Save label mapping
label_mapping = {i: label for i, label in enumerate(encoder.classes_)}
print(f"Label mapping: {label_mapping}")
# Train/test split (80% train, 20% test)
X_train, X_test, y_train, y_test = train_test_split(
X, y_categorical,
test_size=0.2,
random_state=42,
stratify=y_categorical # Maintain class distribution
)
print(f"\nTraining samples: {len(X_train)}")
print(f"Testing samples: {len(X_test)}")
# Build the DNN model
model = keras.Sequential([
# Input layer
keras.layers.Input(shape=(63,), name='landmark_input'),
# Hidden layer 1
keras.layers.Dense(128, activation='relu', name='dense_1'),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.3),
# Hidden layer 2
keras.layers.Dense(64, activation='relu', name='dense_2'),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.3),
# Hidden layer 3
keras.layers.Dense(32, activation='relu', name='dense_3'),
keras.layers.Dropout(0.2),
# Output layer
keras.layers.Dense(len(encoder.classes_), activation='softmax', name='output')
])
# Compile model
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='categorical_crossentropy',
metrics=['accuracy']
)
# Display model summary
model.summary()
# Training callbacks
callbacks = [
# Stop training if validation loss doesn't improve for 10 epochs
keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=10,
restore_best_weights=True
),
# Reduce learning rate if validation loss plateaus
keras.callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.5,
patience=5,
min_lr=0.00001
),
# Save best model during training
keras.callbacks.ModelCheckpoint(
'best_model.h5',
monitor='val_accuracy',
save_best_only=True
)
]
# Train the model
print("\nTraining model...")
history = model.fit(
X_train, y_train,
epochs=100,
batch_size=32,
validation_split=0.2, # Use 20% of training data for validation
callbacks=callbacks,
verbose=1
)
# Evaluate on test set
print("\nEvaluating on test set...")
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy*100:.2f}%")
# Save final model
model.save('isl_model.h5')
print("\nModel saved as 'isl_model.h5'")
# Convert to TFLite
print("\nConverting to TFLite...")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
with open('isl_model.tflite', 'wb') as f:
f.write(tflite_model)
print(f"TFLite model saved!")
print(f"Model size: {len(tflite_model) / 1024:.2f} KB")
Model Summary Output
Model: "sequential"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
landmark_input (InputLayer) [(None, 63)] 0
dense_1 (Dense) (None, 128) 8,192
batch_normalization (None, 128) 512
dropout (Dropout) (None, 128) 0
dense_2 (Dense) (None, 64) 8,256
batch_normalization_1 (None, 64) 256
dropout_1 (Dropout) (None, 64) 0
dense_3 (Dense) (None, 32) 2,080
dropout_2 (Dropout) (None, 32) 0
output (Dense) (None, 26) 858
=================================================================
Total params: 20,154 (78.73 KB)
Trainable params: 19,770 (77.23 KB)
Non-trainable params: 384 (1.50 KB)
_________________________________________________________________
What Happens During Training
Epoch 1/100
ββββββββββββββββββββββββββββββββββββββββ
Batch 1/250: Forward pass β Calculate loss β Backward pass β Update weights
Batch 2/250: Forward pass β Calculate loss β Backward pass β Update weights
...
Batch 250/250: Forward pass β Calculate loss β Backward pass β Update weights
Validation:
- Test on validation set (20% of training data)
- Calculate validation accuracy
- If improved, save as best model
Epoch 1: loss=0.5423, accuracy=0.8234, val_loss=0.4321, val_accuracy=0.8567
Epoch 2: loss=0.3215, accuracy=0.8876, val_loss=0.3102, val_accuracy=0.8923
Epoch 3: loss=0.2543, accuracy=0.9123, val_loss=0.2876, val_accuracy=0.9034
...
Epoch 45: loss=0.0234, accuracy=0.9892, val_loss=0.0456, val_accuracy=0.9845
Epoch 46: loss=0.0231, accuracy=0.9894, val_loss=0.0458, val_accuracy=0.9844
(No improvement for 10 epochs β Early stopping triggered)
Best model: Epoch 45 with val_accuracy=0.9845
9. Platform Channels Explained
What Are Platform Channels?
Platform channels are Flutter's official mechanism for communication between Dart code and native platform code (Kotlin/Swift).
The Problem They Solve
Flutter (Dart) runs in its own runtime
β
Cannot directly access:
- Native camera APIs
- MediaPipe library (Android/iOS)
- Hardware sensors
- Native ML frameworks
- Bluetooth, NFC, etc.
Solution: Platform Channels = Bridge between worlds
Types of Platform Channels
1. MethodChannel (Request-Response)
Use Case: One-time requests with responses
Flutter asks β Kotlin does work β Kotlin responds β Flutter receives
Example: Start/stop camera, take photo, get device info
// Flutter side
final result = await methodChannel.invokeMethod('getCameraStatus');
print(result); // "active" or "inactive"
// Kotlin side
methodChannel.setMethodCallHandler { call, result ->
when (call.method) {
"getCameraStatus" -> {
val status = if (cameraActive) "active" else "inactive"
result.success(status)
}
}
}
2. EventChannel (Continuous Stream)
Use Case: Continuous data stream from native to Flutter
Kotlin continuously sends β Flutter receives stream β UI updates in real-time
Example: Hand detection results, sensor data, location updates
// Flutter side
eventChannel.receiveBroadcastStream().listen((data) {
print(data); // Continuous updates
});
// Kotlin side
eventChannel.setStreamHandler(object : EventChannel.StreamHandler {
override fun onListen(arguments: Any?, events: EventChannel.EventSink?) {
// Send data continuously
events?.success(detectionResult)
}
})
3. BasicMessageChannel (Bi-directional)
Use Case: Custom protocols, binary data
Not commonly used for this project
Comparison Table
| Feature | MethodChannel | EventChannel |
|---|---|---|
| Direction | Bi-directional | Kotlin β Flutter (one way) |
| Pattern | Request-Response | Stream |
| Use Case | Commands | Continuous data |
| Example | "Start camera" | Detection results every 50ms |
| Frequency | On-demand | Continuous |
Complete Implementation for KairoAI
Flutter Side (Dart)
// File: lib/services/sign_detection_service.dart
import 'package:flutter/services.dart';
import 'dart:async';
class SignDetectionService {
// Method channel for commands
static const MethodChannel _methodChannel =
MethodChannel('com.kairo.ai/detection');
// Event channel for continuous detection stream
static const EventChannel _eventChannel =
EventChannel('com.kairo.ai/detection_stream');
Stream<DetectionResult>? _detectionStream;
/// Start hand sign detection
Future<void> startDetection() async {
try {
await _methodChannel.invokeMethod('startDetection');
print('β
Detection started');
} on PlatformException catch (e) {
print('β Error starting detection: ${e.message}');
rethrow;
}
}
/// Stop hand sign detection
Future<void> stopDetection() async {
try {
await _methodChannel.invokeMethod('stopDetection');
print('β
Detection stopped');
} on PlatformException catch (e) {
print('β Error stopping detection: ${e.message}');
rethrow;
}
}
/// Get continuous stream of detection results
Stream<DetectionResult> get detectionStream {
_detectionStream ??= _eventChannel
.receiveBroadcastStream()
.map((event) {
final data = Map<String, dynamic>.from(event);
return DetectionResult.fromMap(data);
});
return _detectionStream!;
}
/// Check camera permission status
Future<bool> checkCameraPermission() async {
try {
final bool hasPermission =
await _methodChannel.invokeMethod('checkCameraPermission');
return hasPermission;
} on PlatformException catch (e) {
print('β Error checking permission: ${e.message}');
return false;
}
}
/// Request camera permission
Future<bool> requestCameraPermission() async {
try {
final bool granted =
await _methodChannel.invokeMethod('requestCameraPermission');
return granted;
} on PlatformException catch (e) {
print('β Error requesting permission: ${e.message}');
return false;
}
}
}
/// Data class for detection results
class DetectionResult {
final String letter;
final double confidence;
final bool handDetected;
final int timestamp;
DetectionResult({
required this.letter,
required this.confidence,
required this.handDetected,
required this.timestamp,
});
factory DetectionResult.fromMap(Map<String, dynamic> map) {
return DetectionResult(
letter: map['letter'] as String,
confidence: (map['confidence'] as num).toDouble(),
handDetected: map['handDetected'] as bool,
timestamp: map['timestamp'] as int,
);
}
@override
String toString() {
return 'DetectionResult(letter: $letter, confidence: ${confidence.toStringAsFixed(2)}, handDetected: $handDetected)';
}
}
Kotlin Side (Android)
// File: android/app/src/main/kotlin/com/kairo/ai/MainActivity.kt
package com.kairo.ai
import android.Manifest
import android.content.pm.PackageManager
import androidx.core.app.ActivityCompat
import androidx.core.content.ContextCompat
import io.flutter.embedding.android.FlutterActivity
import io.flutter.embedding.engine.FlutterEngine
import io.flutter.plugin.common.MethodChannel
import io.flutter.plugin.common.EventChannel
import android.graphics.Bitmap
class MainActivity : FlutterActivity() {
// Channel names (must match Flutter side)
private val METHOD_CHANNEL = "com.kairo.ai/detβ¦
<!-- Continue from line 1131 -->
private val EVENT_CHANNEL = "com.kairo.ai/detection_stream"
// Camera permission request code
private val CAMERA_PERMISSION_CODE = 100
// Our custom classes (to be implemented)
private lateinit var cameraManager: CameraManager
private lateinit var handDetector: HandLandmarkDetector
private lateinit var signClassifier: SignClassifier
// Event sink for streaming data to Flutter
private var eventSink: EventChannel.EventSink? = null
override fun configureFlutterEngine(flutterEngine: FlutterEngine) {
super.configureFlutterEngine(flutterEngine)
// Initialize our components
cameraManager = CameraManager(this)
handDetector = HandLandmarkDetector(this)
signClassifier = SignClassifier(this)
// Setup MethodChannel for commands
MethodChannel(
flutterEngine.dartExecutor.binaryMessenger,
METHOD_CHANNEL
).setMethodCallHandler { call, result ->
when (call.method) {
"startDetection" -> {
startDetection()
result.success(null)
}
"stopDetection" -> {
stopDetection()
result.success(null)
}
"checkCameraPermission" -> {
val hasPermission = checkCameraPermission()
result.success(hasPermission)
}
"requestCameraPermission" -> {
requestCameraPermission()
result.success(null)
}
else -> {
result.notImplemented()
}
}
}
// Setup EventChannel for streaming data
EventChannel(
flutterEngine.dartExecutor.binaryMessenger,
EVENT_CHANNEL
).setStreamHandler(object : EventChannel.StreamHandler {
override fun onListen(arguments: Any?, events: EventChannel.EventSink?) {
eventSink = events
println("β
Flutter is now listening to detection stream")
}
override fun onCancel(arguments: Any?) {
eventSink = null
stopDetection()
println("β Flutter stopped listening to detection stream")
}
})
}
private fun startDetection() {
println("π₯ Starting camera and detection...")
cameraManager.startCamera { bitmap ->
processFrame(bitmap)
}
}
private fun stopDetection() {
println("π Stopping camera and detection...")
cameraManager.stopCamera()
}
private fun processFrame(bitmap: Bitmap) {
// Step 1: Detect hand landmarks
val landmarks = handDetector.detectLandmarks(bitmap)
if (landmarks != null) {
// Step 2: Classify the sign
val result = signClassifier.classify(landmarks)
// Step 3: Send to Flutter
val data = mapOf(
"letter" to result.letter,
"confidence" to result.confidence,
"handDetected" to true,
"timestamp" to System.currentTimeMillis()
)
runOnUiThread {
eventSink?.success(data)
}
} else {
// No hand detected
val data = mapOf(
"letter" to "",
"confidence" to 0.0,
"handDetected" to false,
"timestamp" to System.currentTimeMillis()
)
runOnUiThread {
eventSink?.success(data)
}
}
}
private fun checkCameraPermission(): Boolean {
return ContextCompat.checkSelfPermission(
this,
Manifest.permission.CAMERA
) == PackageManager.PERMISSION_GRANTED
}
private fun requestCameraPermission() {
ActivityCompat.requestPermissions(
this,
arrayOf(Manifest.permission.CAMERA),
CAMERA_PERMISSION_CODE
)
}
override fun onRequestPermissionsResult(
requestCode: Int,
permissions: Array<out String>,
grantResults: IntArray
) {
super.onRequestPermissionsResult(requestCode, permissions, grantResults)
when (requestCode) {
CAMERA_PERMISSION_CODE -> {
val granted = grantResults.isNotEmpty() &&
grantResults[0] == PackageManager.PERMISSION_GRANTED
if (granted) {
println("β
Camera permission granted")
} else {
println("β Camera permission denied")
}
}
}
}
}
Communication Flow Diagram
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β FLUTTER (Dart) β
β β
β User taps "Start Lesson" β
β β β
β LessonScreen calls: β
β signDetectionService.startDetection() β
β β β
β MethodChannel sends: "startDetection" β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ Platform Channel Bridge
β
ββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββ
β KOTLIN (Android) β
β β
β MethodChannel receives: "startDetection" β
β β β
β MainActivity.startDetection() β
β β β
β CameraManager.startCamera() β
β β β
β Camera frames arrive (30 FPS) β
β β β
β processFrame(bitmap) for each frame: β
β 1. HandDetector.detectLandmarks(bitmap) β 63 floats β
β 2. SignClassifier.classify(landmarks) β Letter + confidence β
β 3. Package into Map β
β 4. EventChannel sends to Flutter β
ββββββββββββββββββββββββββ¬βββββββββββββββββββββββββββββββββββββββββ
β
βΌ Event Channel Stream
β
ββββββββββββββββββββββββββ΄βββββββββββββββββββββββββββββββββββββββββ
β FLUTTER (Dart) β
β β
β EventChannel.receiveBroadcastStream() β
β β β
β .listen((data) { β
β setState(() { β
β currentLetter = data['letter']; β
β confidence = data['confidence']; β
β }); β
β }) β
β β β
β UI rebuilds with new data β
β Shows: "Detected: A (95%)" β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
10. Dataset Creation Guide
Understanding the Dataset
What You Need
Dataset Size Calculation:
- 26 letters (A-Z)
- 500-1000 images per letter (recommended)
- Total: 13,000 - 26,000 images
Actual Data After Extraction:
- Each image β 1 row in CSV
- Each row = 63 landmark values + 1 label
- Final CSV: 13,000-26,000 rows Γ 64 columns
Option 1: Use Existing Dataset (Fastest)
Find ISL Dataset on Kaggle
# Popular datasets:
# 1. "ISL Dataset"
# 2. "Indian Sign Language Recognition Dataset"
# 3. "ISL Alphabet Dataset"
# Expected structure:
ISL_Dataset/
βββ A/
β βββ img_001.jpg
β βββ img_002.jpg
β βββ ...
βββ B/
β βββ ...
βββ Z/
βββ ...
Option 2: Extract Landmarks from Images
Complete Extraction Script
# File: model_training/extract_landmarks.py
import mediapipe as mp
import cv2
import os
import csv
import numpy as np
mp_hands = mp.solutions.hands
hands = mp_hands.Hands(
static_image_mode=True,
max_num_hands=1,
min_detection_confidence=0.5
)
def extract_landmarks_from_image(image_path):
"""
Extract 21 hand landmarks from an image.
Returns 63 values (21 points Γ 3 coordinates) or None if no hand detected.
"""
image = cv2.imread(image_path)
if image is None:
return None
image_rgb = cv2.cvtColor(image, cv2.COLOR_BGR2RGB)
results = hands.process(image_rgb)
if not results.multi_hand_landmarks:
return None
hand_landmarks = results.multi_hand_landmarks[0]
landmarks = []
for landmark in hand_landmarks.landmark:
landmarks.extend([landmark.x, landmark.y, landmark.z])
return landmarks
def normalize_landmarks(landmarks):
"""
Normalize landmarks relative to wrist and scale
Makes model robust to hand position and size
"""
landmarks = np.array(landmarks).reshape(21, 3)
# Get wrist position (point 0)
wrist = landmarks[0]
# Translate to origin (wrist at 0,0,0)
landmarks = landmarks - wrist
# Calculate bounding box
min_vals = landmarks.min(axis=0)
max_vals = landmarks.max(axis=0)
# Scale to unit box
scale = max_vals - min_vals
scale[scale == 0] = 1 # Avoid division by zero
landmarks = (landmarks - min_vals) / scale
return landmarks.flatten()
def process_dataset(dataset_path, output_csv):
"""
Process all images in dataset and save landmarks to CSV.
"""
header = []
for i in range(21):
header.extend([f'x{i}', f'y{i}', f'z{i}'])
header.append('label')
total_images = 0
successful_extractions = 0
failed_extractions = 0
with open(output_csv, 'w', newline='') as csvfile:
writer = csv.writer(csvfile)
writer.writerow(header)
for label in sorted(os.listdir(dataset_path)):
label_path = os.path.join(dataset_path, label)
if not os.path.isdir(label_path):
continue
print(f"Processing class: {label}")
for image_name in os.listdir(label_path):
image_path = os.path.join(label_path, image_name)
if not image_name.lower().endswith(('.jpg', '.jpeg', '.png', '.bmp')):
continue
total_images += 1
landmarks = extract_landmarks_from_image(image_path)
if landmarks is not None:
# Normalize landmarks
normalized = normalize_landmarks(landmarks)
writer.writerow(list(normalized) + [label])
successful_extractions += 1
else:
failed_extractions += 1
print(f" β οΈ No hand detected: {image_name}")
print("\n" + "="*50)
print("EXTRACTION COMPLETE")
print("="*50)
print(f"Total images processed: {total_images}")
print(f"Successful extractions: {successful_extractions}")
print(f"Failed extractions: {failed_extractions}")
print(f"Success rate: {(successful_extractions/total_images)*100:.1f}%")
print(f"Output saved to: {output_csv}")
if __name__ == "__main__":
DATASET_PATH = "ISL_Dataset" # Change to your dataset path
OUTPUT_CSV = "landmarks_dataset.csv"
process_dataset(DATASET_PATH, OUTPUT_CSV)
11. Model Training Guide
Training Process Overview
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β TRAINING PIPELINE β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Step 1: Load CSV Dataset
β
Step 2: Split into Features (X) and Labels (y)
β
Step 3: Encode Labels (Aβ0, Bβ1, ..., Zβ25)
β
Step 4: Split into Train/Test Sets (80%/20%)
β
Step 5: Build DNN Model
β
Step 6: Train Model (with callbacks)
β
Step 7: Evaluate on Test Set
β
Step 8: Save Model (.h5)
β
Step 9: Convert to TFLite (.tflite)
β
Step 10: Deploy to Android App
Complete Training Script
# File: model_training/train_model.py
import tensorflow as tf
from tensorflow import keras
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
import matplotlib.pyplot as plt
# Load dataset
print("π Loading dataset...")
data = pd.read_csv('landmarks_dataset.csv')
print(f"Dataset shape: {data.shape}")
print(f"Number of samples: {len(data)}")
print(f"Classes: {sorted(data['label'].unique())}")
# Check for class imbalance
class_counts = data['label'].value_counts()
print("\nπ Class distribution:")
print(class_counts)
# Separate features and labels
X = data.iloc[:, :-1].values # 63 landmark columns
y = data.iloc[:, -1].values # Label column
# Encode labels
encoder = LabelEncoder()
y_encoded = encoder.fit_transform(y)
y_categorical = keras.utils.to_categorical(y_encoded)
# Save label mapping
label_mapping = {i: label for i, label in enumerate(encoder.classes_)}
print(f"\nπ·οΈ Label mapping: {label_mapping}")
# Save encoder for later use
import pickle
with open('label_encoder.pkl', 'wb') as f:
pickle.dump(encoder, f)
# Train/test split
X_train, X_test, y_train, y_test = train_test_split(
X, y_categorical,
test_size=0.2,
random_state=42,
stratify=y_categorical
)
print(f"\nπ Training samples: {len(X_train)}")
print(f"π Testing samples: {len(X_test)}")
# Build model
model = keras.Sequential([
keras.layers.Input(shape=(63,), name='landmark_input'),
keras.layers.Dense(128, activation='relu', name='dense_1'),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.3),
keras.layers.Dense(64, activation='relu', name='dense_2'),
keras.layers.BatchNormalization(),
keras.layers.Dropout(0.3),
keras.layers.Dense(32, activation='relu', name='dense_3'),
keras.layers.Dropout(0.2),
keras.layers.Dense(len(encoder.classes_), activation='softmax', name='output')
])
model.compile(
optimizer=keras.optimizers.Adam(learning_rate=0.001),
loss='categorical_crossentropy',
metrics=['accuracy']
)
model.summary()
# Callbacks
callbacks = [
keras.callbacks.EarlyStopping(
monitor='val_loss',
patience=15,
restore_best_weights=True,
verbose=1
),
keras.callbacks.ReduceLROnPlateau(
monitor='val_loss',
factor=0.5,
patience=5,
min_lr=0.00001,
verbose=1
),
keras.callbacks.ModelCheckpoint(
'best_model.h5',
monitor='val_accuracy',
save_best_only=True,
verbose=1
)
]
# Train
print("\nπ Training model...")
history = model.fit(
X_train, y_train,
epochs=100,
batch_size=32,
validation_split=0.2,
callbacks=callbacks,
verbose=1
)
# Evaluate
print("\nπ Evaluating on test set...")
test_loss, test_accuracy = model.evaluate(X_test, y_test, verbose=0)
print(f"Test Loss: {test_loss:.4f}")
print(f"Test Accuracy: {test_accuracy*100:.2f}%")
# Plot training history
plt.figure(figsize=(12, 4))
plt.subplot(1, 2, 1)
plt.plot(history.history['accuracy'], label='Training Accuracy')
plt.plot(history.history['val_accuracy'], label='Validation Accuracy')
plt.title('Model Accuracy')
plt.xlabel('Epoch')
plt.ylabel('Accuracy')
plt.legend()
plt.subplot(1, 2, 2)
plt.plot(history.history['loss'], label='Training Loss')
plt.plot(history.history['val_loss'], label='Validation Loss')
plt.title('Model Loss')
plt.xlabel('Epoch')
plt.ylabel('Loss')
plt.legend()
plt.tight_layout()
plt.savefig('training_history.png')
print("π Training history saved to training_history.png")
# Save model
model.save('isl_model.h5')
print("\nπΎ Model saved as 'isl_model.h5'")
# Convert to TFLite
print("\nπ Converting to TFLite...")
converter = tf.lite.TFLiteConverter.from_keras_model(model)
converter.optimizations = [tf.lite.Optimize.DEFAULT]
tflite_model = converter.convert()
with open('isl_model.tflite', 'wb') as f:
f.write(tflite_model)
print(f"β
TFLite model saved!")
print(f"π¦ Model size: {len(tflite_model) / 1024:.2f} KB")
print("\nπ Training complete!")
print("\nπ Next steps:")
print("1. Copy isl_model.tflite to android/app/src/main/assets/")
print("2. Copy label_encoder.pkl for future use")
print("3. Test the model in your Android app")
12. Implementation Roadmap
Phase 1: Proof of Concept (1-2 Weeks)
Goal: Detect hand and show "Hand detected!" in Flutter
Tasks:
Setup Android Dependencies
- Add MediaPipe to
build.gradle.kts - Add CameraX libraries
- Configure permissions
- Add MediaPipe to
Basic Camera Integration
- Create
CameraManager.kt - Open camera
- Capture frames
- Create
Hand Detection Only
- Create
HandLandmarkDetector.kt - Detect hand (yes/no)
- Print landmarks to logcat
- Create
Platform Channel Setup
- Create MethodChannel
- Send "Hand detected: true/false" to Flutter
- Display in Flutter UI
Success Criteria:
- Camera opens
- Hand detection works
- Flutter shows "Hand detected!" message
Phase 2: Single Letter Recognition (1-2 Weeks)
Goal: Recognize letter "A" only
Tasks:
Collect Data for Letter A
- Record 100 images of letter "A"
- Record 100 images of letter "B" (for contrast)
- Extract landmarks to CSV
Train 2-Class Model
- Train DNN to distinguish A vs B
- Convert to TFLite
- Achieve 85%+ accuracy
Integrate TFLite
- Create
SignClassifier.kt - Load .tflite model
- Run inference on landmarks
- Create
Flutter Integration
- EventChannel for continuous detection
- Show detected letter
- Show confidence score
Success Criteria:
- Model can distinguish A from B
- Real-time detection works
- Flutter UI shows letter and confidence
Phase 3: Full Alphabet (2-3 Weeks)
Goal: Recognize all 26 letters
Tasks:
Complete Dataset
- Find or collect full ISL dataset
- Extract landmarks for all 26 letters
- Validate data quality
Train 26-Class Model
- Train full model
- Aim for 90%+ test accuracy
- Optimize for mobile inference
Lesson Mode UI
- Create lesson screen
- Show target letter
- Camera preview
- Success/failure feedback
Quiz Mode Logic
- Random letter selection
- Sequential validation
- Score tracking
Success Criteria:
- 90%+ accuracy on test set
- All 26 letters recognized
- Lesson and quiz modes working
Phase 4: Polish & Deploy (1-2 Weeks)
Goal: Production-ready app
Tasks:
Firebase Integration
- Store user progress
- Track quiz scores
- Sync across devices
UI/UX Polish
- Lottie animations
- Sound effects
- Smooth transitions
- Error handling
Performance Optimization
- Reduce latency
- Battery optimization
- Memory management
Testing
- User testing
- Edge cases
- Different devices
- Bug fixes
Success Criteria:
- App feels polished
- No crashes
- Good user feedback
- Ready for release
13. Code Structure
Complete Project Structure
KairoAI/
βββ android/
β βββ app/
β βββ src/
β β βββ main/
β β βββ kotlin/com/kairo/ai/
β β β βββ MainActivity.kt
β β β βββ camera/
β β β β βββ CameraManager.kt
β β β βββ ml/
β β β βββ HandLandmarkDetector.kt
β β β βββ SignClassifier.kt
β β βββ assets/
β β β βββ isl_model.tflite
β β β βββ hand_landmarker.task
β β βββ AndroidManifest.xml
β βββ build.gradle.kts
β
βββ lib/
β βββ main.dart
β βββ services/
β β βββ sign_detection_service.dart
β βββ models/
β β βββ lesson.dart
β β βββ quiz_result.dart
β βββ screens/
β β βββ home_screen.dart
β β βββ lesson_screen.dart
β β βββ quiz_screen.dart
β β βββ progress_screen.dart
β βββ widgets/
β β βββ camera_preview.dart
β β βββ detection_overlay.dart
β β βββ success_animation.dart
β βββ providers/
β βββ detection_provider.dart
β
βββ model_training/
β βββ extract_landmarks.py
β βββ train_model.py
β βββ convert_to_tflite.py
β βββ requirements.txt
β βββ README.md
β
βββ assets/
β βββ animations/
β β βββ success.json
β β βββ loading.json
β βββ sounds/
β β βββ success.mp3
β β βββ error.mp3
β βββ images/
β βββ isl_alphabet_guide.png
β
βββ pubspec.yaml
βββ README.md
βββ DOCUMENTATION.md (this file)
14. Challenges & Solutions
Technical Challenges
Challenge 1: MediaPipe Not Detecting Hand
Symptoms:
landmarksreturnsnull- No hand landmarks extracted
Possible Causes & Solutions:
| Cause | Solution |
|---|---|
| Poor lighting | Improve lighting conditions |
| Hand too small in frame | Move camera closer |
| Hand partially out of frame | Ensure full hand visible |
| Wrong model file | Download correct hand_landmarker.task |
| Low confidence threshold | Reduce minHandDetectionConfidence to 0.3 |
Debug Code:
val result = handLandmarker.detect(mpImage)
println("Hands detected: ${result.landmarks().size}")
if (result.landmarks().isEmpty()) {
println("β No hand detected")
println("Try: better lighting, move hand closer, show full hand")
}
Challenge 2: Low Model Accuracy
Symptoms:
- Test accuracy < 85%
- Wrong letter predictions
- Low confidence scores
Solutions:
Collect More Data
# Aim for at least 500 samples per letter # Current: 100 per letter β Low # Target: 500+ per letter β GoodNormalize Landmarks
def normalize_landmarks(landmarks): landmarks = np.array(landmarks).reshape(21, 3) wrist = landmarks[0] landmarks = landmarks - wrist # Relative to wrist # Scale to unit box min_vals = landmarks.min(axis=0) max_vals = landmarks.max(axis=0) scale = max_vals - min_vals landmarks = (landmarks - min_vals) / scale return landmarks.flatten()Tune Hyperparameters
# Try different: - Learning rates: 0.0001, 0.001, 0.01 - Batch sizes: 16, 32, 64 - Dropout rates: 0.2, 0.3, 0.5 - Number of layers: 2, 3, 4
Challenge 3: Platform Channel Not Working
Symptoms:
- Flutter doesn't receive data
MissingPluginException- No communication
Solutions:
Check Channel Names Match
// Flutter MethodChannel('com.kairo.ai/detection'); // Must match exactly// Kotlin MethodChannel(..., "com.kairo.ai/detection"); // Same stringVerify Plugin Registration
// MainActivity must extend FlutterActivity class MainActivity : FlutterActivity() { // ... }Debug with Logs
// Flutter try { await _methodChannel.invokeMethod('test'); print('β Channel working'); } catch (e) { print('β Channel error: $e'); }// Kotlin methodChannel.setMethodCallHandler { call, result -> println("π Received call: ${call.method}") result.success("OK") }
Challenge 4: App Crashes on TFLite Inference
Symptoms:
- App crashes when detecting
IllegalArgumentExceptionArrayIndexOutOfBoundsException
Solutions:
Check Input Shape
// Model expects: [1, 63] val inputBuffer = Array(1) { FloatArray(63) } inputBuffer[0] = landmarks // landmarks must be exactly 63 floats if (landmarks.size != 63) { println("β Wrong input size: ${landmarks.size}, expected 63") return }Check Output Shape
// Model outputs: [1, 26] val outputBuffer = Array(1) { FloatArray(26) } interpreter.run(inputBuffer, outputBuffer) val probabilities = outputBuffer[0] if (probabilities.size != 26) { println("β Wrong output size: ${probabilities.size}") }Verify Model File
// Check if model file exists and loads correctly try { val modelFile = loadModelFile(context, "isl_model.tflite") println("β Model loaded: ${modelFile.capacity()} bytes") } catch (e: Exception) { println("β Failed to load model: ${e.message}") }
Common Pitfalls
Pitfall 1: Not Normalizing Landmarks
Problem: Model accuracy drops when user changes hand position or distance
Solution: Always normalize landmarks during both training and inference
# Training time
def normalize_landmarks(landmarks):
# Make relative to wrist and scale to unit box
# ...
# Inference time (Kotlin)
fun normalizeLandmarks(landmarks: FloatArray): FloatArray {
// Same normalization logic
// ...
}
Pitfall 2: Imbalanced Dataset
Problem: Some letters have 1000 samples, others have 100
Solution: Balance the dataset
# Check class distribution
print(data['label'].value_counts())
# Undersample majority classes or oversample minority classes
from imblearn.over_sampling import SMOTE
X_balanced, y_balanced = SMOTE().fit_resample(X, y)
Pitfall 3: Forgetting to Close Camera
Problem: Camera stays on even after leaving screen, draining battery
Solution: Properly manage lifecycle
class LessonScreen extends StatefulWidget {
@override
_LessonScreenState createState() => _LessonScreenState();
}
class _LessonScreenState extends State<LessonScreen> {
@override
void initState() {
super.initState();
signDetectionService.startDetection();
}
@override
void dispose() {
signDetectionService.stopDetection(); // Important!
super.dispose();
}
}
15. Feasibility Assessment
Is This Project Doable?
β Absolutely YES, Here's Why:
All Technology Exists
- MediaPipe: Production-ready, used by millions
- TFLite: Mature, optimized for mobile
- Flutter Platform Channels: Well-documented
Similar Apps Exist
- ASL learning apps
- Gesture recognition apps
- Hand tracking AR filters
You Have Foundation
- Flutter project setup β
- Firebase working β
- Android configuration β
Realistic Timeline
Beginner Path (No ML/Native experience)
| Phase | Time | Milestone |
|---|---|---|
| Learning | 1-2 weeks | Understand MediaPipe, TFLite basics |
| Setup | 3-5 days | Add dependencies, configure Android |
| Data Collection | 1 week | Find/collect dataset, extract landmarks |
| Model Training | 1-2 days | Train model, achieve 85%+ accuracy |
| Integration | 1-2 weeks | Platform channels, camera, detection |
| Testing | 1 week | Fix bugs, improve accuracy |
| Polish | 1 week | UI/UX, animations, Firebase |
| TOTAL | 6-8 weeks | Working MVP |
With Experience Path
| Phase | Time | Milestone |
|---|---|---|
| Setup | 1 day | Dependencies configured |
| Data | 2-3 days | Dataset ready |
| Training | 4 hours | Model trained |
| Integration | 1 week | Everything connected |
| Testing | 3-4 days | Bugs fixed |
| TOTAL | 2-3 weeks | Working app |
Skills You'll Gain
Even if this takes longer than expected:
β
Flutter platform channels (valuable)
β
Android native development (Kotlin)
β
Machine learning basics (TensorFlow)
β
Computer vision (MediaPipe)
β
Mobile camera handling
β
Firebase integration
β
Real-world app architecture
β
Complex system integration
β
Problem-solving under constraints
Value: These skills are highly marketable
Risk Mitigation
Risk 1: Can't Collect Enough Data
Mitigation:
- Use Kaggle datasets (many available)
- Start with 10 letters instead of 26
- Use data augmentation
Risk 2: Model Accuracy Too Low
Mitigation:
- Start with 70% accuracy goal (acceptable for MVP)
- Improve gradually with more data
- Use transfer learning if needed
Risk 3: Platform Channel Issues
Mitigation:
- Use exact code patterns from documentation
- Start with simple "Hello World" channel
- Test incrementally (string β number β map)
16. Resources & Learning Path
Week 1: Learn Basics
MediaPipe Resources
Official Docs
YouTube Tutorials
- "MediaPipe Hands Tutorial" by Nicholas Renotte
- "Real-time Hand Tracking" by TensorFlow
Example Code
git clone https://github.com/google/mediapipe cd mediapipe/examples/hand_landmarker/android # Open in Android Studio and run
TensorFlow Lite Resources
Official Docs
Tutorials
- "TFLite on Android" by TensorFlow
- "Custom Model Deployment" tutorials
Week 2-3: Hands-On Practice
Practice Project 1: Hand Detection
Goal: Display camera feed and draw hand landmarks
// Simple app that just shows hand landmarks
class MainActivity : AppCompatActivity() {
// Use MediaPipe to detect hand
// Draw landmarks on camera preview
// No classification yet
}
Practice Project 2: Platform Channel Hello World
Goal: Send message from Kotlin to Flutter
// Flutter
final result = await channel.invokeMethod('sayHello');
print(result); // "Hello from Kotlin!"
// Kotlin
channel.setMethodCallHandler { call, result ->
if (call.method == "sayHello") {
result.success("Hello from Kotlin!")
}
}
Recommended Development Tools
Essential
| Tool | Purpose | Download |
|---|---|---|
| Android Studio | Android development | Download |
| VS Code | Flutter development | Download |
| Python 3.10+ | Model training | Download |
Optional but Helpful
| Tool | Purpose |
|---|---|
| Google Colab | Free GPU for training |
| Postman | API testing |
| Firebase Console | Database management |
| Android Device | Real device testing |
Community Support
Where to Get Help
Stack Overflow
- Tag:
[flutter] [mediapipe] - Tag:
[tensorflow-lite]
- Tag:
GitHub Discussions
Discord Communities
- Flutter Discord
- ML/AI Discord servers
Reddit
- r/FlutterDev
- r/MachineLearning
- r/androiddev
Debugging Checklist
When Something Doesn't Work
β‘ Check logs (Android Studio Logcat)
β‘ Verify dependencies versions match
β‘ Clean and rebuild project
β‘ Restart Android Studio
β‘ Check permissions in AndroidManifest.xml
β‘ Verify channel names match exactly
β‘ Test on real device (not emulator)
β‘ Check model file exists in assets
β‘ Verify input/output shapes
β‘ Print debug information at each step
Next Steps: Your Action Plan
Immediate (This Week)
Download MediaPipe Example
git clone https://github.com/google/mediapipe cd mediapipe/examples/hand_landmarker/androidRun the Example
- Open in Android Studio
- Build and run on device
- Verify hand detection works
Report Back
- Did it work?
- Any errors?
- What did you learn?
Week 2-3
Create Basic Flutter App
- Camera preview
- Platform channel setup
- Simple hand detection
Collect Small Dataset
- 50 images each of letters A and B
- Extract landmarks
- Train 2-class model
Integrate TFLite
- Load model in Android
- Run inference
- Send results to Flutter
Week 4-6
- Expand to Full Alphabet
- Build UI
- Add Firebase
- Test and Polish
Final Thoughts
This Project is:
β Technically feasible - All pieces exist and work β Educationally valuable - You'll learn A LOT β Portfolio-worthy - Impressive for job applications β Challenging but doable - With persistence
This Project is NOT:
β A weekend project β Impossible for beginners β Requiring PhD-level ML knowledge β Dependent on expensive tools
Success Factors
You WILL succeed if you:
- β Start small (hand detection first, full app later)
- β Break problems into tiny steps
- β Debug systematically (logs everywhere)
- β Ask for help when stuck (community is helpful)
- β Accept imperfection (70% accuracy is a great start)
- β Stay persistent (debugging takes time)
You might struggle if you:
- β Try to do everything at once
- β Skip the learning phase
- β Give up at first error
- β Aim for perfection immediately
Contact & Support
If you need help while building this:
- GitHub Discussions - Most responsive
- Stack Overflow - Tag your questions properly
- Flutter Discord - Real-time chat
- This AI Assistant - Come back anytime!
Conclusion
KairoAI is an ambitious but achievable project.
You have:
- β Clear architecture
- β Detailed implementation guide
- β Code examples for every component
- β Realistic timeline
- β Troubleshooting guides
Now it's time to build!
Start with the MediaPipe example this week. Once you see hand detection working on your device, you'll realize this is not just possibleβit's inevitable.
Good luck! π
Last updated: December 18, 2025 Version: 1.0.0 Author: Megh Modi
Appendix: Quick Reference
Key Commands
# Flutter
flutter doctor
flutter clean
flutter pub get
flutter run
# Android
./gradlew clean
./gradlew assembleDebug
# Python
pip install -r requirements.txt
python extract_landmarks.py
python train_model.py
Important File Paths
android/app/src/main/assets/isl_model.tflite
android/app/src/main/kotlin/com/kairo/ai/MainActivity.kt
lib/services/sign_detection_service.dart
model_training/train_model.py
Channel Names (Must Match!)
com.kairo.ai/detection # MethodChannel
com.kairo.ai/detection_stream # EventChannel
Common Error Codes
| Error | Meaning | Fix |
|---|---|---|
MissingPluginException |
Channel not registered | Check MainActivity |
PlatformException |
Native code error | Check Kotlin logs |
IllegalArgumentException |
Wrong input shape | Verify array size is 63 |
FileNotFoundException |
Model not found | Check assets folder |
End of Documentation
- Downloads last month
- 8