Instructions to use embedl/chronos-2-quantized-trt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- TensorRT
How to use embedl/chronos-2-quantized-trt with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Embedl Chronos-2 (Quantized for TensorRT)
Deployable INT8-quantized version of
amazon/chronos-2,
optimized with
embedl-deploy for
low-latency NVIDIA TensorRT inference on edge GPUs. Two
static-context variants ship: ctx=512 for short-history
forecasting and ctx=2048 for long-history use cases.
Upstream Model
Highlights
- Per-tensor INT8 activations + per-channel INT8 weights via embedl-deploy's PTQ flow on top of TensorRT's fused MHA kernel. No QAT or distillation needed.
- Drop-in replacement for
amazon/chronos-2inference: same(context, group_ids) β quantile_predssignature; 21 evenly spaced quantile levels with the median at index 10. - Validated on the GIFT-Eval benchmark across 125 task configurations. See Accuracy below.
- Two ctx variants so you can pick the latency/history-window trade-off that fits your deployment.
Quick Start
pip install tensorrt pycuda numpy
python infer_trt.py --ctx 512 # 1.2Γ faster than FP16 on Orin
python infer_trt.py --ctx 2048 # 1.3Γ faster than FP16 on Orin
The infer_trt.py helper script builds a TensorRT engine from the
ONNX on first run (cached as *.engine next to the artifact) and
feeds a synthetic seasonal context for demonstration. Replace the
context generator with your own series of the right length.
Inputs must be finite float32 context values. If your source series
contains missing values, impute or reject them before TensorRT
inference; the example script validates this contract before launching
the engine.
Files
| File | Purpose |
|---|---|
embedl_chronos_2_ctx512_int8.onnx |
INT8 ONNX with Q/DQ β ctx=512, 1024-step horizon. |
embedl_chronos_2_ctx2048_int8.onnx |
INT8 ONNX with Q/DQ β ctx=2048, 1024-step horizon. |
infer_trt.py |
ONNX Runtime / TensorRT inference example. |
Both artifacts emit a (1, 21, 1024) quantile tensor (21 quantile
levels Γ 64 output patches Γ 16 steps-per-patch = 1024 horizon
steps). Slice the median (preds[0, 10]) for a point forecast and
clip to your needed prediction length.
Performance
Latency measured with TensorRT + trtexec, GPU compute time only
(--noDataTransfers), CUDA Graph + Spin Wait enabled, clocks locked
(nvpmodel -m 0 && jetson_clocks on Jetson).
Jetson AGX Orin (MAXN)
ctx=512
| Build | Mean latency (ms) |
|---|---|
| TensorRT FP16 | 2.977 |
TensorRT --best |
2.974 |
| embedl INT8 | 2.432 |
| Speedup (FP16 β embedl INT8) | 1.22Γ |
ctx=2048
| Build | Mean latency (ms) |
|---|---|
| TensorRT FP16 | 4.482 |
TensorRT --best |
4.482 |
| embedl INT8 | 3.482 |
| Speedup (FP16 β embedl INT8) | 1.29Γ |
Accuracy
Evaluated on the GIFT-Eval benchmark β 125 task configurations spanning 50 datasets Γ {short, medium, long} horizons. Aggregate WQL (weighted quantile loss, lower is better) reported using the TIME-paper normalization: geomean of per-task ratio against the Seasonal-Naive baseline.
| Metric | FP32 baseline | embedl INT8 ctx=512 | embedl INT8 ctx=2048 |
|---|---|---|---|
| Geomean WQL / Seasonal-Naive | 0.549 | 0.634 | 0.618 |
| Geomean WQL / FP32 | 1.000 | 1.156Γ | 1.126Γ |
| Median WQL / FP32 | 1.000 | 1.074Γ | 1.045Γ |
| Cells within 10 % of FP32 | β | 71 / 125 (57 %) | 79 / 125 (63 %) |
| Cells within 20 % of FP32 | β | 96 / 125 (77 %) | 98 / 125 (78 %) |
| Cells beating FP32 | β | 14 / 125 | 19 / 125 |
How to read the headline number. Geomean WQL/S-Naive 0.634
(ctx=512) and 0.618 (ctx=2048) means the INT8 model retains the
bulk of chronos-2's skill margin over the no-model Seasonal-Naive
baseline. The FP32 model sits at 0.549 by the same convention; the
INT8 versions are 15-16 % closer to S-Naive but still convincingly
beat it on the geomean.
Where the regression concentrates. Worst-case cells are
out-of-distribution low-frequency series (us_births/M,
m4_hourly/{medium,long}) and high-frequency long-horizon
forecasts (solar/10T/{medium,long}). The full per-task CSVs
ship with the artifacts; check them before deploying to a domain
that resembles those outliers.
Creating Your Own Optimized Models
This artifact was produced with embedl-deploy, Embedl's open-source PyTorch β TensorRT deployment library. The same workflow applies to your own models β see the documentation for installation and usage.
License
| Component | License |
|---|---|
| Optimized model artifacts (this repo) | Embedl Models Community Licence v1.0 β no redistribution as a hosted service |
| Upstream architecture and weights | Amazon Chronos-2 License |
Contact
We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.
Model tree for embedl/chronos-2-quantized-trt
Base model
amazon/chronos-2