Darwin-Qwen3.5-35B-A3B-Opus-AWQ-INT4-NOESIS
Custom AWQ-style INT4 quantization of FINAL-Bench/Darwin-35B-A3B-Opus converted from Q8_0 GGUF, optimized for RAM-constrained machines (64 GB RAM, RTX 3060 6 GB).
Released as part of the NOESIS Professional Multilingual Dubbing Automation Platform (framework: DHCF-FNO — Deterministic Hybrid Control Framework for Frozen Neural Operators).
- Founder: Ilia Bolotnikov
- Organization: AMAImedia.com
- X (Twitter): @AMAImediacom
- LinkedIn: Ilia Bolotnikov
- Telegram: @djbionicl
- NOESIS version: v14.7
- Release date: 2026-04
⚠️ License notice
This model is derived from FINAL-Bench/Darwin-35B-A3B-Opus, which itself is derived from
Qwen/Qwen3.5-35B-A3B — both licensed under Apache 2.0.
This INT4 quantization retains the same Apache 2.0 license — see
the LICENSE file in this repository for the full text.
Model summary
| Property | Value |
|---|---|
| Base model | FINAL-Bench/Darwin-35B-A3B-Opus |
| Quantization source | FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF (Q8_0, ~36.9 GB) |
| Architecture | qwen3_5_moe — Qwen3.5 MoE with Gated DeltaNet |
| Total parameters | 35B |
| Active parameters | ~3B per forward pass (8 routed + 1 shared expert) |
| Experts per layer | 256 routed + 1 shared |
| Layers | 40 (hybrid: 30 GDN/linear_attention + 10 full_attention, every 4th) |
| Hidden size | 2 048 |
| Original vocab size | 248 320 |
| Context length | 262 144 tokens (native) |
| Languages | 201 |
| Quantization format | Custom nibble AWQ-INT4 (group_size=128, symmetric, no AutoAWQ) |
| Precision: linear layers | nibble uint8 (weight_i4 [out, in//2] + weight_scale_i4 [n_groups, out]) |
| Precision: MoE experts | nibble uint8 3D (gate_up_proj_q4 [256, out, in//2] + scales/zeros) |
| Precision: lm_head | BF16 (AWQ standard — output projection kept full precision) |
| Precision: embed_tokens | BF16 |
| Disk footprint | ~17.8 GB |
| Inference RAM (CPU offload) | ~20 GB RAM + ~5.4 GB VRAM (device_map="auto") |
trust_remote_code |
required |
| Quantization library | Custom pipeline (NOESIS v14.7), no AutoAWQ dependency |
| RNG seed | 1729 (NOESIS reproducibility lock) |
Architecture note: Darwin-35B-A3B-Opus was created with Darwin V5 — a diagnostic-guided evolutionary merge engine (DARE-TIES via mergekit).
- Father: Qwen/Qwen3.5-35B-A3B (base architecture + RLHF)
- Mother: Jackrong/Qwen3.5-35B-A3B-Claude-4.6-Opus-Reasoning-Distilled (LoRA SFT)
Key diagnostic finding: Mother had 50–65% dead experts (activation < 5%) from text-only LoRA SFT. Darwin V5 compensated by reducing Mother density and using Father's living experts to fill inactive slots. Layer 38 (reasoning core) uses 90% Mother weights (peak probe cosine distance).
Benchmark results (original BF16 model, Q8_0 ≈ BF16)
| Benchmark | Darwin-35B-A3B-Opus | Father (Qwen3.5-35B-A3B) | Mother (Claude 4.6 Opus Distilled) |
|---|---|---|---|
| GPQA Diamond | 90.0% | 84.2% | 85.0% |
| MMMLU (29 langs) | 85.0% | 85.2% | — |
Why a custom format (not AutoAWQ / transformers AwqConfig)
AutoAWQ and transformers AwqConfig only quantize standard nn.Linear modules.
Darwin-35B stores all 256 routed experts as merged nn.Parameter tensors
[256, out_features, in_features] inside Qwen3_5MoeExperts — not as 256 individual
nn.Linear modules. AutoAWQ skips them, leaving ~80% of the model weights in BF16 and
causing OOM on any device with less than ~65 GB RAM.
This quantization handles both components with a single custom pass:
| Component | Approach |
|---|---|
All nn.Linear (attn, MLP shared expert, router) |
Linear4bit — nibble uint8, dequantize on forward |
mlp.experts (256 routed experts per layer) |
Darwin35BExpertsInt4 — nibble uint8 3D, dequantize on forward |
lm_head, in_proj_a/b |
BF16 (kept full precision) |
Source was the Q8_0 GGUF (not BF16 safetensors), processed layer-by-layer: peak RAM during quantization ~22 GB (one transformer block ~800 MB BF16 at a time).
How to use
Requires
trust_remote_code=True— uses customDarwin35BForCausalLMInt4class. Do NOT useAutoAWQForCausalLM.from_quantized()— this is not AutoAWQ GEMV format.
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_id = "amaimedia/Darwin-Qwen3.5-35B-A3B-Opus-AWQ-INT4-NOESIS"
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="auto",
max_memory={0: "5.4GiB", "cpu": "54GiB"},
trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
messages = [{"role": "user", "content": "Explain the Mixture of Experts architecture."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=512, do_sample=False)
print(tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
CPU-only inference (no GPU):
model = AutoModelForCausalLM.from_pretrained(
model_id,
device_map="cpu",
trust_remote_code=True,
torch_dtype=torch.bfloat16,
)
Note: This format dequantizes weights to BF16 on each forward pass (no dedicated INT4 CUDA kernel). Inference speed is proportional to your CPU/RAM bandwidth. For production fast inference, use the AWQ-INT8 variant (higher quality, larger) or the original GGUF Q8_0 with llama.cpp.
Thinking mode
Darwin-35B-A3B-Opus supports thinking mode (enabled by default at temperature < 0.7).
Use <think> tags or set the generation config to control reasoning:
# Disable thinking (faster, less verbose)
out = model.generate(
**inputs,
max_new_tokens=1024,
temperature=1.0,
do_sample=True,
)
# Enable extended thinking (default at temperature ≤ 0.6)
out = model.generate(
**inputs,
max_new_tokens=4096,
temperature=0.6,
do_sample=True,
)
NOESIS context
In NOESIS this model serves as a high-capability reasoning teacher for
Specialists M4-CHAT, M5-CODE, and M6-RESEARCH during knowledge
distillation (step110 in extraction_master.py). Proposed KD weight: w=0.25.
⚠️ KD pipeline note: Darwin-35B-A3B-Opus has
vocab_size=248 320(Qwen3.5 extended vocab including codec and vision tokens), while NOESIS student models use Qwen3-8B native vocab151 936. Logit extraction requires vocab head truncation to index 151 936 viapurify_logits()before ensemble aggregation inbuild_ensemble_labels.py.
| ID | Role | Size |
|---|---|---|
| M1 | ASR (150+ langs) | 10B/3B |
| M2 | Dubbing LM (30 langs full) | 10B/3B |
| M3 | TTS + voice cloning | 10B/3B |
| M4 | Chat + creative writing | 10B/3B |
| M5 | Code + math | 10B/3B |
| M6 | Deep research (1M ctx) | 10B/3B |
| M7 | Prompt engineering | 4B/0.8B |
| M8 | Quality control (PRM) | 4B/0.8B |
| M9 | Orchestrator + routing | 4B/0.8B |
Provenance
A noesis_provenance.json file ships alongside the model weights with the full
quantization trace: source GGUF path, NOESIS version, quantization methodology,
group size, and specialist assignment.
Acknowledgements & citation
Base model: Darwin-35B-A3B-Opus by FINAL-Bench (Darwin V5 evolutionary merge of Qwen3.5-35B-A3B + Claude 4.6 Opus Reasoning Distilled).
@misc{darwin35b_opus,
title = {Darwin-35B-A3B-Opus},
author = {FINAL-Bench},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus}
}
@misc{darwin35b_opus_gguf,
title = {Darwin-35B-A3B-Opus-Q8-GGUF},
author = {VIDRAFT},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/FINAL-Bench/Darwin-35B-A3B-Opus-Q8-GGUF}
}
Quantization & NOESIS integration:
@misc{noesis_v14,
title = {NOESIS v14.7: DHCF-FNO Multilingual Dubbing Platform},
author = {Bolotnikov, Ilia},
year = {2026},
publisher = {AMAImedia},
url = {https://amaimedia.com}
}
- Downloads last month
- 15
Model tree for AMAImedia/Darwin-Qwen3.5-35B-A3B-Opus-AWQ-INT4-NOESIS
Base model
FINAL-Bench/Darwin-35B-A3B-Opus