Instructions to use BrainboxAI/code-il-E4B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use BrainboxAI/code-il-E4B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="BrainboxAI/code-il-E4B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("BrainboxAI/code-il-E4B")
model = AutoModelForMultimodalLM.from_pretrained("BrainboxAI/code-il-E4B")

llama-cpp-python

How to use BrainboxAI/code-il-E4B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="BrainboxAI/code-il-E4B",
	filename="gemma-4-e4b-it.BF16-mmproj.gguf",
)

llm.create_chat_completion(
	messages = [
		{
			"role": "user",
			"content": "What is the capital of France?"
		}
	]
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use BrainboxAI/code-il-E4B with llama.cpp:

Install from brew

brew install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf BrainboxAI/code-il-E4B:BF16
# Run inference directly in the terminal:
llama-cli -hf BrainboxAI/code-il-E4B:BF16

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama-server -hf BrainboxAI/code-il-E4B:BF16
# Run inference directly in the terminal:
llama-cli -hf BrainboxAI/code-il-E4B:BF16

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf BrainboxAI/code-il-E4B:BF16
# Run inference directly in the terminal:
./llama-cli -hf BrainboxAI/code-il-E4B:BF16

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf BrainboxAI/code-il-E4B:BF16
# Run inference directly in the terminal:
./build/bin/llama-cli -hf BrainboxAI/code-il-E4B:BF16

Use Docker

docker model run hf.co/BrainboxAI/code-il-E4B:BF16

LM Studio
Jan

vLLM

How to use BrainboxAI/code-il-E4B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "BrainboxAI/code-il-E4B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "BrainboxAI/code-il-E4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/BrainboxAI/code-il-E4B:BF16

SGLang

How to use BrainboxAI/code-il-E4B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "BrainboxAI/code-il-E4B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "BrainboxAI/code-il-E4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "BrainboxAI/code-il-E4B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "BrainboxAI/code-il-E4B",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Ollama
How to use BrainboxAI/code-il-E4B with Ollama:
```
ollama run hf.co/BrainboxAI/code-il-E4B:BF16
```

Unsloth Studio

How to use BrainboxAI/code-il-E4B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for BrainboxAI/code-il-E4B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for BrainboxAI/code-il-E4B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for BrainboxAI/code-il-E4B to start chatting

How to use BrainboxAI/code-il-E4B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf BrainboxAI/code-il-E4B:BF16

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "BrainboxAI/code-il-E4B:BF16"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use BrainboxAI/code-il-E4B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama-server -hf BrainboxAI/code-il-E4B:BF16

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default BrainboxAI/code-il-E4B:BF16

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use BrainboxAI/code-il-E4B with Docker Model Runner:
```
docker model run hf.co/BrainboxAI/code-il-E4B:BF16
```

Lemonade

How to use BrainboxAI/code-il-E4B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull BrainboxAI/code-il-E4B:BF16

Run and chat with the model

lemonade run user.code-il-E4B-BF16

List all available models

lemonade list

Code-IL E4B

A 4B-parameter coding assistant for Python and TypeScript — runs entirely on-device, no code ever leaves your machine.

Model overview

code-il-E4B is a 4-billion-parameter coding assistant fine-tuned from Google's Gemma-4 E4B. It is trained on a curated set of Python and TypeScript instruction pairs — filtered by test-pass rate — plus a small hand-written bilingual (Hebrew / English) identity set.

The entire model is 4 GB in GGUF Q4_K_M form. It runs on:

A modern laptop CPU (slower but functional)
Any consumer GPU with 6 GB+ VRAM
Apple Silicon via llama.cpp Metal

No API. No telemetry. No data leaving the developer's machine.

Why this exists

Every keystroke sent to a cloud coding assistant is a potential data-leak event. For companies building proprietary systems — especially in regulated industries like finance, healthcare, and defense — this is not acceptable.

code-il-E4B is the private alternative: a model small enough to run locally, tuned specifically for the two languages most companies actually write in.

It is not competing with Claude Sonnet or GPT-4o on raw capability. It is offering something different: the option to get useful AI assistance without a network connection.

Intended use

Primary use cases:

Local code completion and review in regulated environments
On-prem deployment for companies with strict data-residency rules
Pair-programming for developers with unreliable internet
Integration into internal developer tooling that cannot call external APIs
Hebrew-speaking developer onboarding (model responds in Hebrew on request)

Out-of-scope uses:

Replacement for frontier models on complex architecture tasks
Production code generation without human review
Languages other than Python / TypeScript (coverage is minimal)
Fine-tuning tasks requiring >4B parameters of capacity

How to use

Ollama

ollama pull hf.co/BrainboxAI/code-il-E4B:Q4_K_M
ollama run hf.co/BrainboxAI/code-il-E4B:Q4_K_M

llama.cpp

./llama-cli -m code-il-E4B.Q4_K_M.gguf \
  -p "Write a Python function that parses ISO-8601 dates with timezones." \
  --temp 0.2 --top-p 0.95 -n 1024

Python (transformers)

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("BrainboxAI/code-il-E4B-safetensors")
model = AutoModelForCausalLM.from_pretrained(
    "BrainboxAI/code-il-E4B-safetensors",
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "user", "content": "Implement binary search in TypeScript with full edge-case handling."},
]
inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=1024, temperature=0.2, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Recommended generation parameters

Parameter	Value	Rationale
`temperature`	0.2	Low creativity for deterministic code
`top_p`	0.95	Slightly higher than legal model to allow idiom variety
`max_new_tokens`	1024	Enough for most function-level completions
`repetition_penalty`	1.0	Penalizing repetition hurts code structure

Recommended System Prompt: Semi-Formal Reasoning

This 4B model produces dramatically better code when forced to think through 5 explicit steps before writing. Free-form prompts often produce code that compiles but fails on edge cases, missing tests, or hidden bugs.

Why this matters: Small coding models tend to skip the "thinking" phase and jump straight to code. The semi-formal reasoning template forces the model to do what a senior engineer does: understand the problem, enumerate edge cases, write the code, define tests, then honestly disclose what could break.

The 5 Reasoning Steps

Problem Understanding - restate the requirement, identify ambiguities
Edge Cases and Constraints - enumerate what could go wrong before coding
Implementation - the actual code, with inline comments only where needed
Tests - concrete test cases covering happy path + edge cases
Known Limitations - what this code does NOT handle, dependencies, assumptions

The System Prompt (copy as-is)

DEFINITIONS:
  success: Working code that handles the stated requirement plus enumerated edge cases, includes tests proving correctness, and honestly discloses what is out of scope. No invented APIs, no hallucinated library functions.
  scope: in-scope - Python and TypeScript code (functions, classes, modules), code review, refactoring, debugging, test writing, algorithm implementation. out-of-scope - Languages other than Python/TypeScript (model is weak there), full-application architecture, infrastructure design, code that requires runtime testing the model cannot perform.
  hallucination risk: This model was trained on public code with a cutoff in early 2026. Library APIs change. The model may invent function signatures that do not exist. Every API call must either be from a stable, well-known library OR explicitly marked as "verify in docs."
  edge case: A specific input value or condition that breaks naive implementations - empty inputs, null/None, single-element collections, duplicates, boundary values (0, MAX_INT, negative numbers), Unicode/encoding issues, concurrent access, etc.

PREMISES:
  - The user is a developer, not a beginner. Skip basic explanations of what a function or loop is.
  - The model is 4B parameters - capable for function-level work but not for full systems.
  - Code that "looks right" but fails silently is worse than code with a clear error. Prefer fail-fast.
  - Tests are not optional. Code without tests is a draft, not a deliverable.
  - User can speak Hebrew or English. Code stays in English. Comments match the user input language.

REQUIREMENTS:
  1. Every code response must include all 5 sections: Problem Understanding, Edge Cases, Implementation, Tests, Known Limitations. No exceptions.
  2. Implementation must compile/parse cleanly. No pseudo-code unless explicitly requested.
  3. Use only standard library or widely-known third-party libraries. If using a non-standard library, mark it: "# Requires: pip install <package>".
  4. Never invent function signatures. If unsure whether a function exists, write: "# Verify signature in docs: <library>.<function>".
  5. Tests must be runnable as-is. Use unittest/pytest for Python, jest/vitest for TypeScript.
  6. Edge cases section must list at minimum 3 concrete cases the code handles, plus 1 case it does NOT handle (with rationale).
  7. Known Limitations must be honest. Do not write "this is production-ready" unless every edge case is handled and tested.
  8. Forbidden: silent error handling. No bare `except:` in Python. No empty catch blocks in TypeScript.
  9. Forbidden: code that mutates global state without explicit declaration.
  10. If the user asks a question that requires runtime testing (performance, integration with their specific environment), respond with the code + clear instructions on how to test it locally.

EDGE_CASES:
  - User asks for code in a language other than Python/TypeScript -> "I am specialized for Python and TypeScript. For <language>, the logic is similar but I cannot guarantee idiomatic syntax. Here is the equivalent in Python:" + provide Python version.
  - User provides incomplete requirements -> Ask 1-2 clarifying questions before writing code. Do not assume.
  - User asks for code that depends on a library released after training cutoff -> "I am unsure about <library> v<X>. Here is the implementation pattern; verify the exact API in current docs."
  - User asks "is this code correct?" -> Walk through the 5-step analysis on their code, not yours. Apply the same rigor.
  - User asks for "the fastest" or "the best" implementation -> Provide the most readable correct version first, then a note: "For higher performance, consider <approach>" with rationale.
  - User asks for code that handles secrets, auth, or crypto -> Add a "Security Note" subsection in Known Limitations. Recommend audited libraries (passlib, cryptography, etc.). Never invent crypto.
  - Hebrew question with technical term in English -> Respond in Hebrew, keep variable names and library names in English.
  - User asks for "quick and dirty" code -> Still include the 5 sections, but mark Edge Cases and Tests as minimal: "# Quick prototype - not production. Edge cases: <list>. Test manually with: <example>."

OUTPUT_FORMAT:
  format: Structured markdown with the 5 numbered sections, code in fenced blocks
  structure: |
    ## 1. Problem Understanding
    [Restate the requirement in 1-2 sentences. Note any ambiguities.]

    ## 2. Edge Cases and Constraints
    Handles:
    - [edge case 1]
    - [edge case 2]
    - [edge case 3]

    Does NOT handle:
    - [out-of-scope case + rationale]

    ## 3. Implementation
    ```<language>
    // Clean code. Comments only where the WHY is non-obvious.
    ```

    ## 4. Tests
    ```<language>
    // Runnable tests covering edge cases above
    ```

    ## 5. Known Limitations
    - [What this does not handle]
    - [Dependencies and version assumptions]
    - [When you would need to extend this]
  language: Match user input language (Hebrew or English) for explanations. Code, variable names, and library names stay in English.
  length: 200-800 lines depending on task complexity. Refuse to write monolithic 2000-line responses - break into modules.

VERIFICATION:
  - Are all 5 sections present and labeled?
  - Does the implementation parse cleanly (no obvious syntax errors)?
  - Are tests runnable (correct imports, proper structure)?
  - Are at least 3 edge cases enumerated?
  - Is at least 1 limitation honestly disclosed?
  - regression check: No "production-ready" claims unless edge cases match limitations.

Usage Example with the System Prompt

from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("BrainboxAI/code-il-E4B-safetensors")
model = AutoModelForCausalLM.from_pretrained(
    "BrainboxAI/code-il-E4B-safetensors",
    torch_dtype="auto",
    device_map="auto",
)

# Paste the full DEFINITIONS/PREMISES/REQUIREMENTS prompt above
SYSTEM_PROMPT = """[paste the full prompt from the code block above]"""

messages = [
    {"role": "system", "content": SYSTEM_PROMPT},
    {"role": "user", "content": "Implement binary search in Python with full edge case handling."},
]

inputs = tokenizer.apply_chat_template(messages, return_tensors="pt", add_generation_prompt=True)
outputs = model.generate(inputs, max_new_tokens=1500, temperature=0.2, top_p=0.95)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Customization

Want code-only output (no explanation)? Replace OUTPUT_FORMAT with: "Code blocks only. Comments inside code for any analysis. No prose sections."
Building a code review tool? Add to REQUIREMENTS: "When reviewing user code, output in diff format showing exact changes."
Need TypeScript-only output? Add to REQUIREMENTS: "Always respond in TypeScript. If the user asks for Python, translate to TypeScript with type annotations."
Working on a security-sensitive codebase? Add a section #6 to OUTPUT_FORMAT: "Security Review" listing OWASP-relevant risks in the implementation.

Training details

Attribute	Value
Base model	unsloth/gemma-4-E4B-it
Method	QLoRA (4-bit quantization during training)
LoRA rank (r)	64
LoRA alpha	128
Training data size	40,000 curated examples
Train / validation split	95% / 5%, seed 3407
Hardware	NVIDIA RTX 5090 (RunPod)
Framework	Unsloth Studio

Dataset composition (40,330 examples)

Source	Count	Content
OpenCodeInstruct (NVIDIA)	20,000	Python — filtered to examples with test-pass rate > 50%
typescript-instruct (bleugreen)	20,000	TypeScript instruction pairs
Hand-written identity set	330	Hebrew + English, BrainboxAI persona

The filtering pass on OpenCodeInstruct was the single biggest quality lever. Dropping low-test-pass examples improved downstream evaluation significantly compared to training on the full corpus.

See the dataset card for full details.

Evaluation

Internal evaluation on structured coding tasks:

Task	Examples	Passed	Notes
FizzBuzz (via agentic loop)	5	5/5	Solved in 6 steps, zero correction rounds
Binary search with 11 edge cases	11	11/11	Including leftmost-duplicate handling

Formal HumanEval / MBPP benchmarks have not yet been run publicly. Evaluation work is ongoing.

Limitations

Small model. 4B parameters is not frontier-capability. Expect mistakes on complex architectural questions and long-context reasoning.
Two languages. Strong on Python and TypeScript; weak on other languages.
No tool use out of the box. The base model supports chat-style interaction; agentic tool use requires integration work.
Training cutoff. Libraries and frameworks introduced after the training data was collected (early 2026) are unknown to the model.
Hallucination risk. Like all LLMs, code-il-E4B can produce plausible-looking code that does not compile or does not work. Always test.

Formats available

GGUF Q4_K_M (~4 GB) — for Ollama, llama.cpp, LM Studio
Safetensors 16-bit — for further fine-tuning, HF transformers

License

Apache 2.0. Use commercially, modify, and redistribute with attribution.

Citation

@misc{elyasi2026codeil,
  title        = {Code-IL E4B: A Small, On-Device Coding Assistant for Private Environments},
  author       = {Elyasi, Netanel},
  year         = {2026},
  publisher    = {BrainboxAI},
  howpublished = {\url{https://huggingface.co/BrainboxAI/code-il-E4B}},
  note         = {Fine-tuned from unsloth/gemma-4-E4B-it}
}