postprocess at model res, defer resize+write to CPU (saves ~35s GPU) f4a7288 Running Nekochu commited on 30 days ago
add blue 1024 ONNX (FP16 on disk, FP32 at runtime), rename models 646f0cd Nekochu commited on 30 days ago
safetensors loading, Phase 0 4x faster (uint8), total time in status 33e616b Nekochu commited on 30 days ago
quality: lower clean_matte threshold 0.25→0.02, always keep largest component 1363975 Nekochu commited on 30 days ago
cleanup: stale comments, dead import, redundant makedirs, fix batch size in UI a2a7a3e Nekochu commited on 30 days ago
simplify: merge write functions, fix missing Processed output, bulk transfer 9d23c67 Nekochu commited on 30 days ago
remove dead code: AOTI export, inductor/triton cache, shared_results, deferred write 2a4471f Nekochu commited on 30 days ago
disable torch.compile on ZeroGPU — net negative for GreenFormer f4a2965 Nekochu commited on 30 days ago
fix: reduce-overhead instead of max-autotune (118s→~30s), dedicated export endpoint c53eb28 Nekochu commited on about 1 month ago
enable triton+CUDA graphs via torch.compile(max-autotune) f1d6e7f Nekochu commited on about 1 month ago
fix README: accurate torch.compile description, no triton/AOTI claim cdef1d9 Nekochu commited on about 1 month ago
update README: fix batch size, add GPU optimization docs 69eef34 Nekochu commited on about 1 month ago
fix randperm overflow at 4K: per-frame connected components c9b234a Nekochu commited on about 1 month ago
GPU postprocessing pipeline + TF32 + conditional torch.compile 8ea0c8b Nekochu commited on about 1 month ago
add ZeroGPU GPU inference (FP16, flash-attn, batch=32@1024/16@2048) 0b6961f Nekochu commited on Mar 25