Nemotron-Orchestrator-8B-Qwen3-AWQ-INT4-NOESIS

AWQ INT4 quantization of nvidia/Nemotron-Orchestrator-8B optimized for low-VRAM consumer hardware (RTX 3060 6 GB).

Released as part of the NOESIS Professional Multilingual Dubbing Automation Platform (framework: DHCF-FNO — Deterministic Hybrid Control Framework for Frozen Neural Operators).

Founder: Ilia Bolotnikov
Organization: AMAImedia.com
X (Twitter): @AMAImediacom
LinkedIn: Ilia Bolotnikov
Telegram: @djbionicl
NOESIS version: v14.6
Release date: 2026-04

⚠️ License notice

This model inherits the NVIDIA Open Model License from the upstream nvidia/Nemotron-Orchestrator-8B. The base model is designated by NVIDIA as "for research and development only".

This AWQ derivative is published to make the model accessible to the broader research and development community on consumer GPUs. Users are responsible for compliance with NVIDIA's license terms — see the LICENSE file in this repository for the full text.

By downloading or using this model you agree to the upstream NVIDIA license.

Model summary

Property	Value
Base model	nvidia/Nemotron-Orchestrator-8B
Underlying architecture	Qwen3-8B (decoder-only transformer, dense, NOT MoE)
Original precision	FP32 safetensors (~32 GB)
Quantized precision	AWQ INT4 (group_size=128, GEMM, zero_point=True)
Vocab size	151936
Language	English (per base model)
Disk footprint	~4.5 GB
Inference VRAM	~5.0 GB (full-resident on 6 GB GPU)
Quantization library	AutoAWQ 0.2.9
Calibration set	128 in-house orchestration / tool-calling prompts, max_seq_len=512
RNG seed	1729 (NOESIS reproducibility lock)

A companion BF16 reference checkpoint is also published: amaimedia/Nemotron-Orchestrator-8B-Qwen3-BF16-NOESIS.

Why this quantization

The original Nemotron-Orchestrator-8B is shipped in FP32 (~32 GB on disk) which does not fit any consumer GPU. Existing community quantizations exist (mostly GGUF) but none is calibrated specifically for orchestration / tool-calling scale search and packaged for the AutoAWQ GEMM kernel path that integrates directly with transformers and vllm on Windows hosts.

This AWQ build:

Fits inside the 4.5 GB SEALED VRAM window of the NOESIS specialist sequential-swapping protocol
Uses GEMM kernel (compatible with device_map={"":0} — no CPU offload)
Provenance-tracked (noesis_provenance.json ships with the model)
Calibrated on orchestration / tool-calling prompts matching base model training distribution (ToolScale + GeneralThought-430K)

How to use

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch

model_id = "amaimedia/Nemotron-Orchestrator-8B-Qwen3-AWQ-INT4-NOESIS"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoAWQForCausalLM.from_quantized(
    model_id,
    device_map={"": 0},
    torch_dtype=torch.float16,
    fuse_layers=False,
)

prompt = "Plan a multi-step task: search for recent AWQ papers, then summarize."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))

NOESIS context

In NOESIS this model serves as the English orchestration teacher for Specialist M9-ORCH-4B during knowledge distillation. It is loaded sequentially (per the NOESIS swapping protocol) onto the RTX 3060, producing top-K=512 logits at temperature=4.0, which are then aggregated in build_ensemble_labels.py with proposed weight w=0.22 on the orchestration data shard.

NOESIS specialists overview:

ID	Role	Size
M1	ASR (150+ langs)	10B/3B
M2	Dubbing LM (30 langs full)	10B/3B
M3	TTS + voice cloning	10B/3B
M4	Chat + creative writing	10B/3B
M5	Code + math	10B/3B
M6	Deep research (1M ctx)	10B/3B
M7	Prompt engineering	4B/0.8B
M8	Quality control (PRM)	4B/0.8B
M9	Orchestrator + routing	4B/0.8B

Acknowledgements & citation

Base model: ToolOrchestra by NVIDIA & University of Hong Kong.

@misc{toolorchestra,
  title  = {ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration},
  author = {Hongjin Su and Shizhe Diao and Ximing Lu and others},
  year   = {2025},
  eprint = {2511.21689},
  archivePrefix = {arXiv}
}

Quantization & NOESIS integration:

@misc{noesis_v14,
  title  = {NOESIS v14.6: DHCF-FNO Multilingual Dubbing Platform},
  author = {Bolotnikov, Ilia},
  year   = {2026},
  publisher = {AMAImedia},
  url    = {https://amaimedia.com}
}