Nemotron-Orchestrator-8B-Qwen3-AWQ-INT4-NOESIS

AWQ INT4 quantization of nvidia/Nemotron-Orchestrator-8B optimized for low-VRAM consumer hardware (RTX 3060 6 GB).

Released as part of the NOESIS Professional Multilingual Dubbing Automation Platform (framework: DHCF-FNO — Deterministic Hybrid Control Framework for Frozen Neural Operators).


⚠️ License notice

This model inherits the NVIDIA Open Model License from the upstream nvidia/Nemotron-Orchestrator-8B. The base model is designated by NVIDIA as "for research and development only".

This AWQ derivative is published to make the model accessible to the broader research and development community on consumer GPUs. Users are responsible for compliance with NVIDIA's license terms — see the LICENSE file in this repository for the full text.

By downloading or using this model you agree to the upstream NVIDIA license.


Model summary

Property Value
Base model nvidia/Nemotron-Orchestrator-8B
Underlying architecture Qwen3-8B (decoder-only transformer, dense, NOT MoE)
Original precision FP32 safetensors (~32 GB)
Quantized precision AWQ INT4 (group_size=128, GEMM, zero_point=True)
Vocab size 151936
Language English (per base model)
Disk footprint ~4.5 GB
Inference VRAM ~5.0 GB (full-resident on 6 GB GPU)
Quantization library AutoAWQ 0.2.9
Calibration set 128 in-house orchestration / tool-calling prompts, max_seq_len=512
RNG seed 1729 (NOESIS reproducibility lock)

A companion BF16 reference checkpoint is also published: amaimedia/Nemotron-Orchestrator-8B-Qwen3-BF16-NOESIS.


Why this quantization

The original Nemotron-Orchestrator-8B is shipped in FP32 (~32 GB on disk) which does not fit any consumer GPU. Existing community quantizations exist (mostly GGUF) but none is calibrated specifically for orchestration / tool-calling scale search and packaged for the AutoAWQ GEMM kernel path that integrates directly with transformers and vllm on Windows hosts.

This AWQ build:

  1. Fits inside the 4.5 GB SEALED VRAM window of the NOESIS specialist sequential-swapping protocol
  2. Uses GEMM kernel (compatible with device_map={"":0} — no CPU offload)
  3. Provenance-tracked (noesis_provenance.json ships with the model)
  4. Calibrated on orchestration / tool-calling prompts matching base model training distribution (ToolScale + GeneralThought-430K)

How to use

from awq import AutoAWQForCausalLM
from transformers import AutoTokenizer
import torch

model_id = "amaimedia/Nemotron-Orchestrator-8B-Qwen3-AWQ-INT4-NOESIS"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoAWQForCausalLM.from_quantized(
    model_id,
    device_map={"": 0},
    torch_dtype=torch.float16,
    fuse_layers=False,
)

prompt = "Plan a multi-step task: search for recent AWQ papers, then summarize."
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tokenizer.decode(out[0], skip_special_tokens=True))

NOESIS context

In NOESIS this model serves as the English orchestration teacher for Specialist M9-ORCH-4B during knowledge distillation. It is loaded sequentially (per the NOESIS swapping protocol) onto the RTX 3060, producing top-K=512 logits at temperature=4.0, which are then aggregated in build_ensemble_labels.py with proposed weight w=0.22 on the orchestration data shard.

NOESIS specialists overview:

ID Role Size
M1 ASR (150+ langs) 10B/3B
M2 Dubbing LM (30 langs full) 10B/3B
M3 TTS + voice cloning 10B/3B
M4 Chat + creative writing 10B/3B
M5 Code + math 10B/3B
M6 Deep research (1M ctx) 10B/3B
M7 Prompt engineering 4B/0.8B
M8 Quality control (PRM) 4B/0.8B
M9 Orchestrator + routing 4B/0.8B

Acknowledgements & citation

Base model: ToolOrchestra by NVIDIA & University of Hong Kong.

@misc{toolorchestra,
  title  = {ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration},
  author = {Hongjin Su and Shizhe Diao and Ximing Lu and others},
  year   = {2025},
  eprint = {2511.21689},
  archivePrefix = {arXiv}
}

Quantization & NOESIS integration:

@misc{noesis_v14,
  title  = {NOESIS v14.6: DHCF-FNO Multilingual Dubbing Platform},
  author = {Bolotnikov, Ilia},
  year   = {2026},
  publisher = {AMAImedia},
  url    = {https://amaimedia.com}
}
Downloads last month
87
Safetensors
Model size
8B params
Tensor type
I32
·
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for AMAImedia/Nemotron-Orchestrator-8B-Qwen3-AWQ-INT4-NOESIS

Finetuned
Qwen/Qwen3-8B
Quantized
(21)
this model

Paper for AMAImedia/Nemotron-Orchestrator-8B-Qwen3-AWQ-INT4-NOESIS