Hydra

Hydra is an experimental bounded-residency attention kernel for long-context decode. It keeps sink tokens, recent tokens, and selected older pages resident instead of forcing each decode step to attend over the full KV cache.

Source code: https://github.com/newjordan/hydra/tree/main/hf-kernels/hydra

This release is intentionally narrow. It is not a general replacement for full attention, and it does not claim universal speedups or broad quality preservation. The current target is fit and usability for specific long-context inference workloads where the full-attention path is memory-bound.

Usage

After the kernel is published:

import torch
from kernels import get_kernel

hydra = get_kernel("Frosty40/hydra")

q = torch.randn(1, 32, 1, 128, device="cuda", dtype=torch.bfloat16)
k = torch.randn(1, 8, 8192, 128, device="cuda", dtype=torch.bfloat16)
v = torch.randn(1, 8, 8192, 128, device="cuda", dtype=torch.bfloat16)

out = hydra.hydra(q, k, v)
print(out.shape)

For local development from the public source checkout:

from pathlib import Path
import sys

sys.path.insert(0, str(Path("hf-kernels") / "hydra" / "torch-ext"))
import hydra

readme_example.py uses the local source packet by default so it can run before publication. Set HYDRA_USE_HUB=1 after publication to exercise the Hub-loaded path.

API

hydra.hydra(
    q,
    k,
    v,
    *,
    is_causal=True,
    sliding_window=None,
    policy_layer_idx=None,
    precision="high",
)

Current constraints:

  • CUDA tensors only
  • bf16 q, k, and v
  • shape (B, H, T, D) with D=128
  • causal attention only
  • decode path supports Tq == 1 with arbitrary Tkv
  • prefill path requires T % BLOCK_SIZE == 0

Evidence Boundary

Submission-facing evidence must come from checked artifacts, not prose notes. Treat evidence in three separate scopes:

  • kernel/package validation: tests, CUDA parity logs, kernel-builder logs, and isolated decode benchmarks for this source packet
  • broad Hydra research campaign: capacity, quality, sparse-attention comparison, edge/OOM, diagnostic, and model-family reports from the staging repo
  • exact-model proof-of-concept: checked Qwen/Qwen3.6-35B-A3B-FP8 rows for named GPUs only

The exact-Qwen proof-of-concept appendix in the staging repo is under:

results/raw/qwen3p6_35b_a3b_fp8/
results/reports/QWEN3P6_FP8_EVIDENCE_TABLE.md

Each cited row must include all three:

  • fit/headroom: GPU, context length, memory allocated/reserved, and OOM state
  • quality/correctness: prompt/task ID and generated answer artifact
  • speed/usability: wall time, generated tokens, tokens/sec, and comparison target

Do not cite proxy models, loader-only probes, failed dependency checks, or non-matching model runs as Hydra benchmark results. Do not describe the exact-Qwen proof-of-concept subset as the full Hydra validation campaign.

Current Proof-Of-Concept Scope

The current exact-Qwen artifact-backed proof-of-concept scope is:

GPU Model Scope
RTX PRO 6000 WS Qwen/Qwen3.6-35B-A3B-FP8 32k/80k/160k repeat packet, 160k c96 warm packet, and frontier/headroom sweeps
RTX 3090 Qwen/Qwen3.6-35B-A3B-FP8 2k/3k/4k/6k/8k fit probes and completed 10k/12k/14k edge sweep

The 3090 result should be framed as fit/usability evidence, not a speedup claim. Token rates are slow in the long-context edge rows. The broader Hydra campaign includes additional GPUs, tasks, and comparison lanes outside this exact-model appendix.

Validation Required Before Merge

Minimum gates for source changes:

cd hf-kernels/hydra
python3 -m pytest -q tests
nix run .#ci-test
python3 benchmarks/benchmark_hydra_decode.py --repo .
python3 readme_example.py

Run the CUDA tests on real GPUs. Local syntax checks are not enough for a kernel submission.

Benchmark Snapshot

The current 8192-token decode smoke/benchmark matrix is intentionally reported as kernel/package evidence, not as a universal speedup claim.

GPU Package smoke decode HF benchmark mean
RTX 3060 0.2574 ms 0.3229 ms
RTX 3070 0.1474 ms 0.2532 ms
RTX 3080 0.2051 ms 0.3157 ms
RTX 3090 0.1492 ms 0.3107 ms
RTX 4070 Ti 0.1261 ms 0.2215 ms
RTX 4090 0.1132 ms 0.2245 ms
A100 SXM4 0.1408 ms 0.2568 ms
RTX PRO 6000 Blackwell 0.1158 ms 0.1371 ms
RTX A6000 builder smoke 0.2166 ms 0.3230 ms

The final kernel-builder gate passed on a Vast RTX A6000 with BUILDER_VARIANT=torch210-cxx11-cu128-x86_64-linux: local pytest 6 passed, decode smoke 0.2166 ms/iter, builder pytest 4 passed, 2 skipped, exit status 0.

Downloads last month
-
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support