Acknowledge Embedl Models Community Licence v1.0

By requesting access you agree to the Embedl Models Community
Licence v1.0 (no redistribution as a hosted service) and to the
upstream chronos-2 license terms.

Optimized by Embedl

Need to fine-tune, hit performance targets, or deploy on specific hardware?

We've got you covered.

Learn more Get in touch →

Embedl Chronos-2 (Quantized for TensorRT)

Deployable INT8-quantized version of amazon/chronos-2, optimized with embedl-deploy for low-latency NVIDIA TensorRT inference on edge GPUs. Two static-context variants ship: ctx=512 for short-history forecasting and ctx=2048 for long-history use cases.

Upstream Model

Highlights

Per-tensor INT8 activations + per-channel INT8 weights via embedl-deploy's PTQ flow on top of TensorRT's fused MHA kernel. No QAT or distillation needed.
Drop-in replacement for amazon/chronos-2 inference: same (context, group_ids) → quantile_preds signature; 21 evenly spaced quantile levels with the median at index 10.
Validated on the GIFT-Eval benchmark across 125 task configurations. See Accuracy below.
Two ctx variants so you can pick the latency/history-window trade-off that fits your deployment.

Quick Start

pip install tensorrt pycuda numpy
python infer_trt.py --ctx 512    # 1.2× faster than FP16 on Orin
python infer_trt.py --ctx 2048   # 1.3× faster than FP16 on Orin

The infer_trt.py helper script builds a TensorRT engine from the ONNX on first run (cached as *.engine next to the artifact) and feeds a synthetic seasonal context for demonstration. Replace the context generator with your own series of the right length.

Inputs must be finite float32 context values. If your source series contains missing values, impute or reject them before TensorRT inference; the example script validates this contract before launching the engine.

Files

File	Purpose
`embedl_chronos_2_ctx512_int8.onnx`	INT8 ONNX with Q/DQ — ctx=512, 1024-step horizon.
`embedl_chronos_2_ctx2048_int8.onnx`	INT8 ONNX with Q/DQ — ctx=2048, 1024-step horizon.
`infer_trt.py`	ONNX Runtime / TensorRT inference example.

Both artifacts emit a (1, 21, 1024) quantile tensor (21 quantile levels × 64 output patches × 16 steps-per-patch = 1024 horizon steps). Slice the median (preds[0, 10]) for a point forecast and clip to your needed prediction length.

Performance

Latency measured with TensorRT + trtexec, GPU compute time only (--noDataTransfers), CUDA Graph + Spin Wait enabled, clocks locked (nvpmodel -m 0 && jetson_clocks on Jetson).

Jetson AGX Orin (MAXN)

ctx=512

Chronos-2 INT8 latency, ctx=512

Build	Mean latency (ms)
TensorRT FP16	2.977
TensorRT `--best`	2.974
embedl INT8	2.432
Speedup (FP16 → embedl INT8)	1.22×

ctx=2048

Chronos-2 INT8 latency, ctx=2048

Build	Mean latency (ms)
TensorRT FP16	4.482
TensorRT `--best`	4.482
embedl INT8	3.482
Speedup (FP16 → embedl INT8)	1.29×

Accuracy

Evaluated on the GIFT-Eval benchmark — 125 task configurations spanning 50 datasets × {short, medium, long} horizons. Aggregate WQL (weighted quantile loss, lower is better) reported using the TIME-paper normalization: geomean of per-task ratio against the Seasonal-Naive baseline.

Metric	FP32 baseline	embedl INT8 ctx=512	embedl INT8 ctx=2048
Geomean WQL / Seasonal-Naive	0.549	0.634	0.618
Geomean WQL / FP32	1.000	1.156×	1.126×
Median WQL / FP32	1.000	1.074×	1.045×
Cells within 10 % of FP32	—	71 / 125 (57 %)	79 / 125 (63 %)
Cells within 20 % of FP32	—	96 / 125 (77 %)	98 / 125 (78 %)
Cells beating FP32	—	14 / 125	19 / 125

How to read the headline number. Geomean WQL/S-Naive 0.634 (ctx=512) and 0.618 (ctx=2048) means the INT8 model retains the bulk of chronos-2's skill margin over the no-model Seasonal-Naive baseline. The FP32 model sits at 0.549 by the same convention; the INT8 versions are 15-16 % closer to S-Naive but still convincingly beat it on the geomean.

Where the regression concentrates. Worst-case cells are out-of-distribution low-frequency series (us_births/M, m4_hourly/{medium,long}) and high-frequency long-horizon forecasts (solar/10T/{medium,long}). The full per-task CSVs ship with the artifacts; check them before deploying to a domain that resembles those outliers.

Creating Your Own Optimized Models

This artifact was produced with embedl-deploy, Embedl's open-source PyTorch → TensorRT deployment library. The same workflow applies to your own models — see the documentation for installation and usage.

License

Component	License
Optimized model artifacts (this repo)	Embedl Models Community Licence v1.0 — no redistribution as a hosted service
Upstream architecture and weights	Amazon Chronos-2 License

Contact

We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.

Community & support

Need help with this model? Chat with the Embedl team and other engineers on Discord.

Quantization gotchas, hardware questions, fine-tuning tips — bring them all.

Join our Discord →

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Time Series Forecasting

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for embedl/chronos-2-quantized-trt

Base model

amazon/chronos-2

Quantized

(2)

this model