Nemotron-Orchestrator-8B-Qwen3-AWQ-INT4-NOESIS
AWQ INT4 quantization of nvidia/Nemotron-Orchestrator-8B optimized for low-VRAM consumer hardware (RTX 3060 6 GB).
Released as part of the NOESIS Professional Multilingual Dubbing Automation Platform (framework: DHCF-FNO — Deterministic Hybrid Control Framework for Frozen Neural Operators).
- Founder: Ilia Bolotnikov
- Organization: AMAImedia.com
- X (Twitter): @AMAImediacom
- LinkedIn: Ilia Bolotnikov
- Telegram: @djbionicl
- NOESIS version: v14.6
- Release date: 2026-04
⚠️ License notice
This model inherits the NVIDIA Open Model License from the upstream
nvidia/Nemotron-Orchestrator-8B. The base model is designated by NVIDIA as
"for research and development only".
This AWQ derivative is published to make the model accessible to the broader
research and development community on consumer GPUs. Users are responsible
for compliance with NVIDIA's license terms — see the LICENSE file in
this repository for the full text.
By downloading or using this model you agree to the upstream NVIDIA license.
Model summary
| Property | Value |
|---|---|
| Base model | nvidia/Nemotron-Orchestrator-8B |
| Underlying architecture | Qwen3-8B (decoder-only transformer, dense, NOT MoE) |
| Original precision | FP32 safetensors (~32 GB) |
| Quantized precision | AWQ INT4 (group_size=128, GEMM, zero_point=True) |
| Vocab size | 151936 |
| Language | English (per base model) |
| Disk footprint | ~4.5 GB |
| Inference VRAM | ~5.0 GB (full-resident on 6 GB GPU) |
| Quantization library | AutoAWQ 0.2.9 |
| Calibration set | 128 in-house orchestration / tool-calling prompts, max_seq_len=512 |
| RNG seed | 1729 (NOESIS reproducibility lock) |
A companion BF16 reference checkpoint is also published: amaimedia/Nemotron-Orchestrator-8B-Qwen3-BF16-NOESIS.
Why this quantization
The original Nemotron-Orchestrator-8B is shipped in FP32 (~32 GB on disk) which
does not fit any consumer GPU. Existing community quantizations exist (mostly
GGUF) but none is calibrated specifically for orchestration / tool-calling
scale search and packaged for the AutoAWQ GEMM kernel path that integrates
directly with transformers and vllm on Windows hosts.
This AWQ build:
- Fits inside the 4.5 GB SEALED VRAM window of the NOESIS specialist sequential-swapping protocol
- Uses GEMM kernel (compatible with
device_map={"":0}— no CPU offload) - Provenance-tracked (
noesis_provenance.jsonships with the model) - Calibrated on orchestration / tool-calling prompts matching base model training distribution (ToolScale + GeneralThought-430K)
How to use
from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch
model_id = "amaimedia/Nemotron-Orchestrator-8B-Qwen3-AWQ-INT4-NOESIS"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoAWQForCausalLM.from_quantized(
model_id,
device_map={"": 0},
torch_dtype=torch.float16,
fuse_layers=False,
)
prompt = "Plan a multi-step task: search for recent AWQ papers, then summarize."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))
NOESIS context
In NOESIS this model serves as the English orchestration teacher for
Specialist M9-ORCH-4B during knowledge distillation. It is loaded
sequentially (per the NOESIS swapping protocol) onto the RTX 3060,
producing top-K=512 logits at temperature=4.0, which are then aggregated
in build_ensemble_labels.py with proposed weight w=0.22 on the
orchestration data shard.
NOESIS specialists overview:
| ID | Role | Size |
|---|---|---|
| M1 | ASR (150+ langs) | 10B/3B |
| M2 | Dubbing LM (30 langs full) | 10B/3B |
| M3 | TTS + voice cloning | 10B/3B |
| M4 | Chat + creative writing | 10B/3B |
| M5 | Code + math | 10B/3B |
| M6 | Deep research (1M ctx) | 10B/3B |
| M7 | Prompt engineering | 4B/0.8B |
| M8 | Quality control (PRM) | 4B/0.8B |
| M9 | Orchestrator + routing | 4B/0.8B |
Acknowledgements & citation
Base model: ToolOrchestra by NVIDIA & University of Hong Kong.
@misc{toolorchestra,
title = {ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration},
author = {Hongjin Su and Shizhe Diao and Ximing Lu and others},
year = {2025},
eprint = {2511.21689},
archivePrefix = {arXiv}
}
Quantization & NOESIS integration:
@misc{noesis_v14,
title = {NOESIS v14.6: DHCF-FNO Multilingual Dubbing Platform},
author = {Bolotnikov, Ilia},
year = {2026},
publisher = {AMAImedia},
url = {https://amaimedia.com}
}
- Downloads last month
- 87