Omini3D / progress.md

maxmo2009

Sync from local: code + epoch-110 checkpoint, clean README

2af0e94 verified 10 days ago

26.2 kB

	# OmniMorph Multi-Node XPU Training Progress

	## Active Jobs (2026-03-28)

	### Job 26457072: 8-node production training (RUNNING)
	- Date: 2026-03-28
	- Status: RUNNING on pvc-s-[41-42,118,125,129,136,140-141], 36h walltime
	- Script: `bash_train_multi_nodes.sh` (srun timeout reduced to 1.5h)
	- Config: `all_om_net`, img_size=128, batchsize=2, timesteps=80, lr=1e-5, condition_type=slice
	- Data: 14,814 diffusion + 2,583 registration (full real data)
	- Steps/epoch: 41 (14814 / (2 × 64) ÷ DIFF_REG_BATCH_RATIO steps), ~108s/step → ~1h14m per epoch
	- Changes from previous job:
	- Contrastive `clip_grad_norm` max_norm: 0.02 → 1e-3 (avoid contrastive dominating training)
	- Registration activation threshold: -0.6 → -0.7 (stricter gate)
	- `dist.all_reduce` → `dist.broadcast(src=0)` for NaN sync + registration gate (CCL hang workaround)
	- srun timeout: 7200s → 5400s (1.5h, reduces CCL hang waste from 46min to ~16min)
	- Excluded pvc-s-135 (hardware failure: "No XPU devices are available")
	- Logs: `Logs/train_multi_26457072.out`

	### CCL Epoch-Boundary Hang (ongoing issue)
	- Pattern: First epoch per srun completes fine (~1h14m). Second epoch always hangs at a CCL collective op.
	- Workaround: srun timeout kills the hung process, bash loop restarts with fresh CCL state. Effective rate: 1 epoch per ~1.5h.
	- Root cause: Likely CCL/Level Zero IPC handle leak or state corruption after ~200+ collective ops. Fresh process resets L0 context.
	- Impact: ~16min wasted per epoch (hang time before timeout), but training progresses reliably.
	- TODO: Investigate reinitializing CCL mid-training (`destroy_process_group` + `init_process_group`) to avoid the restart overhead.

	### Single-node CCL hang (unresolved)
	- Single-node job (26446050) hung at step 1 on explicit `dist.all_reduce` calls.
	- Changed to `dist.broadcast(src=0)` but never verified (cancelled when multi-node started).
	- Single-node epochs take ~10h (8 tiles), so the restart-per-epoch strategy doesn't work (2h timeout kills mid-epoch).
	- TODO: Test broadcast fix on single node; if still hangs, investigate CCL intra-node transport.

	### Checkpoint status
	- Latest: `000036_all_om_net.pth` (Mar 28, 2.9 GB)
	- Total: epochs 3-36 in `Models/all_om_net/` (85+ GB)
	- CUDA copies: epochs 0,1,2,8,11 in `Models/all_om_net/cuda_ckpts/`
	- Older model: epochs 0-10 in `Models/all_recmulmodmutattnnet/` (pre-om_net migration)

	### Loss history (real data, 64 XPU tiles)
	\| Epoch \| Ang \| Dist \| Regul \| Contrastive \| Regist (total) \| imgsim \| imgmse \| ddf \|
	\|-------\|-----\|------\|-------\|-------------\|---------------\|--------\|--------\|-----\|
	\| 3-7 \| -0.02→-0.10 \| 1.50→1.02 \| — \| 9.2e-4 \| 0.0 \| — \| — \| — \|
	\| 8-11 \| — \| — \| — \| — \| — \| — \| — \| — (CUDA) \|
	\| 31 \| — \| — \| — \| 6.9e-5 \| -0.09 \| -0.21 \| 0.35 \| 4.2e-4 \|
	\| 35 \| -0.50 \| 0.85 \| 1.3e-4 \| 7.1e-5 \| -0.10 \| -0.26 \| 0.32 \| 1.4e-4 \|

	Note: Epochs 3-7 used UpsampleConv. Epochs 8-11 trained on CUDA with ConvTranspose3d. Epoch 12+ on XPU. Registration activates when ang < -0.7 (was -0.6 before epoch 37).

	---

	### Previous Job 25899265: Production training (v3) — COMPLETED (historical)
	- Ran 100+ epochs with UpsampleConv on dummy data (only 100 samples loaded due to node path issue)
	- Validated memory stability and CCL cache fix, but loss data not meaningful

	### Previous Strategy A — Job 25898957: CRASH-LOOPED (torch.load bug)
	- Ran 1 good iteration (epoch 5 step 31→epoch 6 step 9), then crash-looped 63 times
	- Root cause: saving `np.random.get_state()` in checkpoint → numpy arrays → `torch.load` with `weights_only=True` (PyTorch 2.6 default) rejects numpy globals
	- Epoch 5 completed and saved. Epoch 6 reached step 9 before crash loop started.

	### Previous Strategy A — Job 25898957: Production training (v2, all audit fixes)
	- Date: 2026-03-23
	- Status: RUNNING — epoch 6 in progress
	- Script: `bash_train_multi_nodes.sh`
	- Config: `Config/config_om.yaml` (om_net, img_size=128, batchsize=2, device=xpu)
	- Resources: 8 nodes x 8 XPU tiles = 64 XPU tiles, 12 CPUs/task
	- Walltime: 36:00:00
	- Progress (as of 2026-03-23 02:45):
	- Completed epoch 5 (full, no restart needed), checkpoint `000005_all_om_net.pth` saved
	- Epoch 6 step 2 in progress, memory stable at ~46 GiB free
	- Zero restarts triggered — leak rate ~0.07 GiB/step allows full epochs without OOM
	- Fixes applied (10 bugs found across 4 independent audits):
	1. (Critical) Optimizer on all DDP ranks — all ranks load checkpoint
	2. (Critical) XPU device RNG saved/restored — `torch.xpu.get_rng_state()` in checkpoint
	3. (Critical) RNG restored after DataLoader skip — prevents `__getitem__` corruption
	4. (Critical) `loss_nan_step` not overwritten — guarded with else branch
	5. (Critical) Off-by-one in step skip — `step <= initial_step` with `initial_step > 0` guard
	6. (Low) tmp/ cleanup — DDP race + stale checkpoint fixes
	7. (Medium) Per-rank RNG divergence — non-rank-0 re-seeded (CPU + XPU device RNG)
	8. (Critical) Step 0 skipped on fresh start — `initial_step > 0` guard added
	9. (Config) Timeout — `SRUN_TIMEOUT` 2400 → 7200
	10. (Low) `total_reg` division by zero — `max(total_reg, 1)` guard
	- Leak rate: ~0.07 GiB/step (ConvTranspose3d eliminated via UpsampleConv)
	- Logs: `Logs/train_multi_25898957.out`

	### Job 25899049: CANCELLED (checkpoint conflict)
	- Submitted as continuation but got nodes while 25898957 was still running. Both would write to same `Models/all_om_net/`. Cancelled before training started.

	### Previous Strategy A — Job 25898349: CANCELLED (had 6 bugs)
	- Ran: 4 restart iterations, reached epoch 5 step 33
	- Issues: (1) timeout too short (killed healthy runs), (2) off-by-one re-trained steps 10/20/30, (3) optimizer not loaded on non-rank-0, (4) epoch stats not saved/restored, (5) RNG state not preserved, (6) `loss_nan_step` reset after restore
	- Checkpoints saved: `000004_all_om_net.pth` (epoch 4), `000005_step0030_all_om_net.pth` (epoch 5 step 30)
	- Memory: Stable at ~46-47 GiB free (leak only ~0.07 GiB/step)

	### Strategy B — Job 25916258: Full validation (v4, CCL cache fix)
	- Date: 2026-03-23
	- Status: RUNNING
	- Resources: 1 node x 8 XPU tiles, dummy data, no proactive restart, 2h walltime
	- Fix: `CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000` (was default 1000, caused driver segfault at ~400 steps)
	- Goal: Survive full 2h walltime without crash — validates both UpsampleConv fix (no OOM) and CCL cache fix (no segfault)
	- Logs: `Logs/stratB_25916258.out`

	### Previous Strategy B — Job 25899266 (v3, segfault at epoch 74)
	- 68 epochs, 403 steps, zero OOM. Memory stable 46-47 GiB.
	- Crashed at epoch 74 step 0: CCL IPC handle cache hit 1000 limit → driver segfault (`drm_neo.cpp:288`). Fixed with `CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000`.
	- Logs: `Logs/stratB_25899266.out`

	### Previous Strategy B — Job 25898356 (crashed on logging bug)
	- Ran 6 steps, leak ~0.375 GiB/step. `ZeroDivisionError` on `total_reg=0` (now fixed).

	### Previous Job 25892717 — CANCELLED (was hung)
	- Status: Cancelled after 5 hours. Crashed at epoch 4 step 26 (OOM at `loss_contra.backward()`). `srun` hung due to CCL processes not exiting, auto-resubmit never triggered. 4+ hours of walltime wasted idle.
	- Last checkpoint: `Models/all_om_net/tmp/000004_step0020_all_om_net.pth`
	- Lesson: `--kill-on-bad-exit=1` is not reliable for CCL cleanup. Strategy A's `timeout` wrapper solves this.

	---

	## Implementation Details

	### Strategy A: Proactive Restart (backup approach)

	Problem: XPU autograd leaks ~1.78 GiB/step of device memory (Intel UR backend bug). Training crashes at ~step 26. The old auto-resubmit via `sbatch` failed because `srun` hangs after CCL rank crash — the bash script never reaches the resubmit logic.

	Solution: Proactive exit + restart loop within the same SLURM allocation.

	Changes to `OM_train_3modes.py` (training behavior unchanged):
	- Added `EXIT_CODE_RESTART = 42` constant
	- Added `--max-steps-before-restart N` CLI argument (default 0 = disabled)
	- Added `steps_since_start` counter in the training loop
	- After N steps: saves mid-epoch checkpoint → `dist.barrier()` → `dist.destroy_process_group()` → `sys.exit(42)`
	- `SystemExit(42)` is NOT caught by `except Exception` — propagates cleanly

	Changes to `bash_train_multi_nodes.sh`:
	- Replaced single `srun` call with a `while` loop (up to 500 iterations)
	- Each `srun` wrapped with `timeout 2400` (40 min) to catch CCL hangs
	- Exit code handling:
	- `0` → training complete, break
	- `42` → proactive restart, 5s pause, re-launch
	- `124` → timeout (CCL hang), 10s pause, re-launch
	- Other → crash (OOM etc.), 10s pause, re-launch from checkpoint
	- Passes `--max-steps-before-restart 20` to Python

	Training behavior: Identical to original. Two independent audits verified that:
	- Model weights, optimizer state (Adam momentum/variance), and RNG states (CPU + XPU + numpy + python) are all saved and restored
	- Off-by-one fixed: `step <= initial_step` correctly skips the checkpointed step
	- RNG restored AFTER DataLoader skip loop (not before) to avoid `__getitem__` corruption
	- `loss_nan_step`, `total_reg`, and all 9 epoch loss accumulators preserved across restarts
	- All DDP ranks load optimizer state (not just rank 0)

	### Strategy B: Leak Rate Diagnostic

	Problem: Need to know whether existing mitigations (gradient checkpointing, `UpsampleConv`, `empty_cache`) reduce the XPU leak rate, or if the leak is fundamental to all backward ops.

	Approach: Run training WITHOUT proactive restart, measure how many steps survive before OOM. Compare with historical ~26 steps.

	Script: `bash_train_stratB.sh` — 1-node, dummy data, no checkpoint saves, `--max-steps-before-restart 0`.

	Training loop structure (both strategies use the same code):
	1. Diffusion: `forward → backward → step` (NO gradient clipping)
	2. Contrastive: `forward → backward → clip(max_norm=0.02) → step`
	3. Registration: `forward → backward → clip(max_norm=0.1) → step`

	Each phase has `gc.collect() + synchronize() + empty_cache()` between them.

	### Earlier attempted fix: UpsampleConv (ConvTranspose3d replacement)
	- `ConvTranspose3d` backward was identified as leaking ~0.33 GiB/step per layer on XPU
	- Replaced with `UpsampleConv` (F.interpolate + Conv) in `Diffusion/networks.py` for `OM_net`
	- Result: Negligible impact — leak rate ~1.78 GiB/step before and after
	- Conclusion: `ConvTranspose3d` was NOT the primary leak source; the leak is fundamental to XPU autograd

	### Strategy B: Per-Operation XPU Leak Analysis (job 25893155)

	Ran `tests/diagnose_xpu_leak_ops.py` on 1 XPU tile to isolate which ops leak. Each test runs 20 forward+backward iterations and measures `device_free` drift.

	\| Operation \| Leak Rate (GiB/step) \| Pattern \| Verdict \|
	\|-----------\|---------------------\|---------\|---------\|
	\| ConvTranspose3d (256→3, 8³→128³) \| 0.335 \| Linear, persistent \| LEAKS — primary source \|
	\| Full OM_net (rec_num=2, 128³) \| 1.15 \| Linear, OOM at step 17 \| LEAKS — aggregated \|
	\| Stacked Conv3d encoder (1→256, 128³→8³) \| 0.013 \| One-time alloc \| OK (initial alloc only) \|
	\| F.grid_sample (128³) \| 0.007 \| One-time alloc \| OK \|
	\| Conv3d (16→32, 64³) \| 0.004 \| One-time alloc \| OK \|
	\| MultiheadAttention (512 tokens) \| 0.002 \| One-time alloc \| OK \|
	\| Adam optimizer only \| 0.0005 \| One-time alloc \| OK \|
	\| F.interpolate trilinear (32³→256³) \| 0.000 \| No leak \| ZERO LEAK \|
	\| UpsampleConv (256→3, 8³→128³) \| Not tested (env error) \| — \| Expected zero (uses F.interpolate) \|

	Key findings:
	1. `ConvTranspose3d` backward is the dominant leaker — 0.335 GiB/step, linear and persistent (6.7 GiB lost over 20 steps). With 5 decoder layers in the old network, this alone accounts for ~1.7 GiB/step.
	2. `F.interpolate` has ZERO leak — confirming that `UpsampleConv` (which uses F.interpolate + Conv) is the correct fix.
	3. All other ops have only one-time allocations (no linear drift) — Conv3d, grid_sample, attention, Adam are all clean.
	4. Full OM_net leaks 1.15 GiB/step — consistent with `ConvTranspose3d` × 5 layers plus minor contributions from other ops.

	Component-level diagnostic (job 25826494):
	- Forward only (no backward): ZERO leak — 62.75 GiB stable for 20 steps
	- Forward + backward: 1.12 GiB/step — 62.97 → 40.53 GiB over 20 steps
	- Confirms the leak is in the autograd backward pass, specifically in `ConvTranspose3d` backward kernel.

	Why the leak rate dropped in recent runs (jobs 25898349/25898957): The `UpsampleConv` fix in `OM_net` replaced all 5 decoder `ConvTranspose3d` layers with F.interpolate+Conv. This eliminated the primary leak source. The remaining ~0.07 GiB/step is from minor one-time allocations that stabilize quickly.

	Why the old runs (25892717) still leaked 1.78 GiB/step: The diagnostic `diagnose_xpu_leak_ops.py` test used the OLD `RecMulModMutAttnNet` with `ConvTranspose3d`. The `OM_net` class (used in production training) already had `UpsampleConv`. The earlier measurement of "negligible impact" was incorrect — the UpsampleConv fix DID work, but the comparison was confounded by different node conditions. The diagnostic data now confirms the fix is effective.

	### Issue 15: CCL IPC Handle Cache Segfault (~400 DDP Steps)
	- Job: 25899266 (Strategy B v3)
	- Symptom: GPU segfault (`drm_neo.cpp:288`) after ~403 DDP all-reduce steps. Memory was healthy (47 GiB free).
	- Root cause: oneCCL's IPC memory handle cache has a default limit of 1000 entries. After ~400 steps of DDP all-reduce, the cache fills. Handle eviction triggers a use-after-free in the Intel compute-runtime driver.
	- Warning before crash: `CCL_WARN: mem handle cache limit is reached: mem_handle_cache size: 1000, limit: 1000`
	- Fix: `export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000` in SLURM scripts. Also added to Strategy A's `bash_train_multi_nodes.sh`.
	- Note: Strategy A's proactive restart naturally avoids this (process resets before 400 steps), but the fix is still needed for long-running single-process jobs.

	---

	## Scripts Directory

	\| File \| Purpose \|
	\|------\|---------\|
	\| `bash_train_multi_nodes.sh` \| Strategy A: 8-node production training with proactive restart loop \|
	\| `bash_train_stratB.sh` \| Strategy B: 1-node leak rate diagnostic (no restart, dummy data) \|
	\| `bash_infer.sh` \| Inference / augmentation SLURM job \|
	\| `bash_diagnose_leak.sh` \| Submits `tests/diagnose_xpu_leak.py` to diagnose per-component leak \|
	\| `bash_diagnose_ops.sh` \| Submits `tests/diagnose_xpu_leak_ops.py` for per-operation leak analysis \|
	\| `bash_verify_fix.sh` \| Compares ConvTranspose3d vs UpsampleConv leak rates (failed due to env issue) \|
	\| `bash_compare_opt.sh` \| Speed comparison: optimized vs original 3-mode training \|
	\| `bash_compare_orig.sh` \| Speed comparison: original 3-mode training baseline \|
	\| `tests/diagnose_xpu_leak.py` \| Component-level leak test (network, DeformDDPM, DDP) \|
	\| `tests/diagnose_xpu_leak_ops.py` \| Operation-level leak test (Conv3d, grid_sample, attention, ConvTranspose3d, UpsampleConv) \|
	\| `tests/test_3modes_opt_equivalence.py` \| Verifies optimized training matches original \|
	\| `tests/test_mslncc.py` \| MSLNCC loss function unit tests \|
	\| `tests/compare_3modes_speed.py` \| Speed benchmark for 3-mode training variants \|

	---

	## Issues Resolved (2026-03-22)

	### 14. XPU Autograd Engine Memory Leak — ~1.0 GiB/Step (ROOT CAUSE IDENTIFIED)
	- Jobs: All XPU training jobs; diagnosed in 25826494
	- Symptom: `device_free` (via `torch.xpu.mem_get_info`) decreases linearly at ~1.0 GiB/step. Memory is outside PyTorch's caching allocator — not tracked by `memory_allocated` / `memory_reserved`.
	- Root cause: PyTorch XPU autograd engine bug. The `loss.backward()` call leaks device memory on every invocation. Confirmed by isolated diagnostic (`tests/diagnose_xpu_leak.py`):
	- Test 1 (forward only, no backward): NO LEAK — device_free perfectly stable at 62.75 GiB over 20 steps
	- Test 2a (forward + backward, no optimizer): LEAK — 1.0 GiB/step (62.97 → 42.98 GiB over 20 steps)
	- Test 2b (forward + backward + optimizer.step): LEAK — 1.1 GiB/step (slightly worse)
	- NOT caused by: CCL all-reduce (`no_sync()` showed identical leak rate), DDP (leak occurs without DDP), garbage collection (`gc.collect()` had no effect), caching allocator (`empty_cache()` had no effect), deferred ops (`synchronize()` had no effect)
	- Why it works on CUDA: CUDA autograd engine does not have this leak. The issue is specific to the Intel XPU backend (Level Zero / SYCL runtime).
	- Workaround applied:
	1. Gradient checkpointing (3 encoder levels in OM_net) reduces peak memory from 43 → 26 GiB, buying ~26 steps before OOM
	2. Mid-epoch checkpoints every 10 steps to `tmp/` subfolder
	3. Auto-resubmitting SLURM job restarts training from last checkpoint with fresh memory (leak resets)
	- Upstream: Should be reported to [intel/torch-xpu-ops](https://github.com/intel/torch-xpu-ops) with `tests/diagnose_xpu_leak.py` as minimal reproduction

	### 13. Pre-allocation Approach — Wrong Direction
	- Jobs: 25799043 (92%), 25823021 (78%)
	- Finding: Pre-allocating device memory into PyTorch's caching allocator REDUCED available memory for the autograd leak, causing EARLIER crashes. 92% → step 3, 78% → step 10, none → step 15.
	- Resolution: Removed all pre-allocation. The 70% allocator cap is sufficient when gradient checkpointing reduces peak to 26 GiB (well under the 44.8 GiB cap).

	### 12. Contrastive Backward OOM — Diffusion Tensors Not Freed
	- Jobs: 25823021, 25823710
	- Finding: `del pre_dvf_I, dvf_I, trm_pred` was placed AFTER the contrastive step. During `loss_contra.backward()`, diffusion output tensors were still alive, pushing peak above the limit.
	- Fix: Moved `del` + `empty_cache()` to BEFORE the contrastive step. Also save `loss_gen_a.item()` before deleting since it's needed for registration decision.

	---

	## Issues Resolved (2026-03-20 to 2026-03-21)

	### 11. DDP Collective Hangs and Registration Desync
	- Jobs: 25699197, 25709947 (hung), 25704670, 25706470
	- Symptoms: Job hangs after step 1 (log files stop growing); or `Expected to have finished reduction` error
	- Root cause 1: OOM try/except guards with `continue` skip the DDP-synchronized backward pass, causing other ranks to wait forever at all-reduce. OOM guards are fundamentally incompatible with DDP.
	- Root cause 2: Registration conditional block (`loss_gen_a.item() < -0.6`) differs per rank — some ranks call `Deformddpm(...)` for registration while others skip, causing DDP desync.
	- Fixes applied:
	1. Removed all OOM try/except guards — let OOM crash the job; rely on checkpoint auto-resume
	2. `DDP(..., find_unused_parameters=True)` — handles detached recovery iterations and conditional registration
	3. `dist.all_reduce(regist_flag, op=ReduceOp.MIN)` — all 64 ranks collectively decide whether to run registration. Only runs when ALL ranks agree.

	### 10. XPU OOM — Allocator 70% Memory Cap
	- Jobs: 25530886 through 25780909
	- Error: `UR_RESULT_ERROR_OUT_OF_RESOURCES` at step 12-14 of each epoch
	- Root cause: XPU caching allocator caps reserved memory at ~70% of device (44.8/64 GiB). Known Intel bug ([torch-xpu-ops#1543](https://github.com/intel/torch-xpu-ops/issues/1543)). Works on 4x 48GB CUDA GPUs because CUDA allocator uses nearly all device memory.
	- Key diagnostic (job 25780909): Memory logging showed alloc/reserved perfectly stable at 9.84/44.82 GiB across all steps — no fragmentation, no creep. OOM is purely a peak spike during forward/backward exceeding the 44.8 GiB cap.
	- Resolution: Gradient checkpointing reduced peak from 43 → 26 GiB, well within the 44.8 GiB cap. Pre-allocation is no longer needed.

	---

	## Issues Resolved (2026-03-19 to 2026-03-20)

	### 1. torchrun Permission Denied
	- Fix: Switched to `python -m torch.distributed.run`, then later to direct `srun` launch

	### 2. GPUS_PER_NODE Mismatch
	- Fix: `--nodes=8 --ntasks-per-node=8` for 64 total XPU tiles (4 cards x 2 tiles/card)

	### 3. `.to(rank)` Sends to CUDA Not XPU
	- Fix: Changed to `.to(f"{DEVICE_TYPE}:{rank}")`

	### 4. No DistributedSampler
	- Fix: Added `DistributedSampler` for both dataloaders + `set_epoch()` per epoch

	### 5. CCL Backend Not Found
	- Fix: dn-mo1 rebuilt the conda env with compatible packages

	### 6. MPI/PMI Init Failure
	- Fix: Switched from torchrun to direct `srun --ntasks-per-node=8` with SLURM env var mapping

	### 7. CCL Worker Thread Startup Failure
	- Fix: Increased `--cpus-per-task=12` + `CCL_WORKER_AFFINITY=auto`

	### 8. gloo Backend Incompatible with XPU
	- Fix: Must use `ccl` backend for XPU DDP

	### 9. Print Spam from All Ranks
	- Fix: Guarded prints with `gpu_id == 0`

	---

	## Working Configuration Summary

	```bash
	# SLURM
	--nodes=8 --gres=gpu:4 --ntasks-per-node=8 --cpus-per-task=12

	# Environment
	I_MPI_PMI_LIBRARY=/usr/local/software/slurm/current-rhel8/lib/libpmi2.so
	I_MPI_HYDRA_BOOTSTRAP=slurm
	CCL_WORKER_AFFINITY=auto
	PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512

	# Launch
	srun --kill-on-bad-exit=1 bash -c 'LOCAL_RANK=$SLURM_LOCALID RANK=$SLURM_PROCID WORLD_SIZE=$SLURM_NTASKS python ...'

	# DDP
	backend = "ccl"
	DDP(model, device_ids=[rank], find_unused_parameters=True)

	# XPU autograd leak workaround
	# - Gradient checkpointing on 3 encoder levels (OM_net.use_checkpoint = True)
	# - Mid-epoch checkpoint every 10 steps to Models/.../tmp/
	# - SLURM script auto-resubmits on crash (sbatch at end of script)
	# - No pre-allocation (gradient checkpointing keeps peak under 70% cap)
	```

	## Previous Test Jobs

	\| Job ID \| Date \| Nodes \| Status \| Issue \|
	\|--------\|------\|-------\|--------\|-------\|
	\| 25379684 \| 03-18 \| 16 \| FAILED \| torchrun permission denied \|
	\| 25416001 \| 03-19 \| 8 \| FAILED \| ccl backend not found (SYCL mismatch) \|
	\| 25433261 \| 03-19 \| 8 \| FAILED \| xccl not built in \|
	\| 25518305 \| 03-19 \| 8 \| FAILED \| MPI PMI_Init failure \|
	\| 25521705 \| 03-20 \| 8 \| FAILED \| CCL OFI transport init failure \|
	\| 25522164 \| 03-20 \| 8 \| FAILED \| CCL worker startup failure (3 CPUs) \|
	\| 25522734 \| 03-20 \| 8 \| FAILED \| CCL worker=0 segfault \|
	\| 25522979 \| 03-20 \| 1 \| FAILED \| gloo + XPU incompatible \|
	\| 25523654 \| 03-20 \| 1 \| FAILED \| gloo + XPU (no device_ids) still fails \|
	\| 25525830 \| 03-20 \| 1 \| SUCCESS \| First working run: ccl + 12 CPUs/task \|
	\| 25528754 \| 03-20 \| 8 \| FAILED \| Superseded by 25530886 \|
	\| 25530886 \| 03-20 \| 8 \| FAILED \| XPU OOM at step 12/41 (original code, no workarounds) \|
	\| 25635451 \| 03-20 \| 8 \| FAILED \| empty_cache regression, OOM step 1 \|
	\| 25678499 \| 03-21 \| 8 \| FAILED \| OOM step 15, variable cleanup added \|
	\| 25696461 \| 03-21 \| 8 \| FAILED \| Epoch 2 reached but forward OOM killed rank 61 \|
	\| 25699197 \| 03-21 \| 8 \| HUNG \| OOM try/except broke DDP \|
	\| 25704670 \| 03-21 \| 8 \| FAILED \| find_unused_parameters error at registration \|
	\| 25706470 \| 03-21 \| 8 \| FAILED \| OOM guard broke DDP reducer state \|
	\| 25709947 \| 03-21 \| 8 \| HUNG \| Registration conditional desync \|
	\| 25763882 \| 03-21 \| 8 \| FAILED \| all_reduce(MIN) sync — no hang! OOM step 14. \|
	\| 25780909 \| 03-21 \| 8 \| FAILED \| Confirmed 70% allocator cap (9.84/44.82 GiB stable) \|
	\| 25799043 \| 03-21 \| 8 \| FAILED \| 92% pre-alloc (59 GiB) — OOM step 3, WORSE \|
	\| 25823021 \| 03-21 \| 8 \| FAILED \| 78% pre-alloc — diffusion OK (43.3 GiB), contra OOM step 10 \|
	\| 25823544 \| 03-21 \| 8 \| FAILED \| del tensors before contra — UnboundLocalError bug \|
	\| 25823710 \| 03-21 \| 8 \| FAILED \| Fixed bug; OOM step 10 again, ~1.3 GiB/step device leak confirmed \|
	\| 25824128 \| 03-21 \| 8 \| FAILED \| No pre-alloc + empty_cache; device_free monitoring confirms 1.3 GiB/step leak \|
	\| 25825585 \| 03-22 \| 8 \| FAILED \| no_sync() — same leak rate; NOT from all-reduce \|
	\| 25825861 \| 03-22 \| 8 \| FAILED \| gc.collect+sync+empty_cache — no effect on leak \|
	\| 25826494 \| 03-22 \| 1 \| DIAG \| Root cause found: fwd=no leak, bwd=1.0 GiB/step leak. XPU autograd bug. \|
	\| 25832610 \| 03-22 \| 8 \| PARTIAL \| Grad checkpoint works! Peak 43→22 GiB. Epoch 3 completed. Retry loop hung (srun won't exit). \|
	\| 25853940 \| 03-22 \| 8 \| PARTIAL \| Resumed from step 25; epoch 3 completed + epoch 4 started. Epoch 4 mid-epoch saved at step 10,20. \|
	\| 25867855 \| 03-22 \| 8 \| HUNG \| Epoch 4 reached step 26. Mid-epoch saved at 10,20. srun hung after crash (no kill-on-bad-exit). \|
	\| 25892717 \| 03-22 \| 8 \| HUNG→CANCELLED \| Crashed at epoch 4 step 26 (OOM contra_bwd). srun hung 4+ hrs, auto-resubmit never triggered. \|
	\| 25898349 \| 03-22 \| 8 \| CANCELLED (Strat A v1) \| Epoch 5 step 33. Leak 0.07 GiB/step. Had 6 bugs (off-by-one, optimizer, RNG, etc). \|
	\| 25898356 \| 03-22 \| 1 \| CRASHED (Strat B v1) \| Leak 0.375 GiB/step (dummy data). ZeroDivisionError at epoch end (bug fixed). \|
	\| 25899114 \| 03-23 \| 1 \| PENDING (Strat B v2) \| Full leak validation: no restart, dummy data, all fixes. Goal: survive 2h/80+ steps. \|
	\| 25898957 \| 03-23 \| 8 \| CRASH-LOOPED (Strat A v2) \| Epoch 6 step 9 then crash-loop ×63. torch.load rejects numpy RNG in checkpoint. \|
	\| 25899049 \| 03-23 \| 8 \| CANCELLED \| Checkpoint conflict — ran simultaneously with 25898957. \|
	\| 25899114 \| 03-23 \| 1 \| CANCELLED \| Strategy B — cancelled with 25898957 for code fix. \|
	\| 25899265 \| 03-23 \| 8 \| RUNNING (Strat A v3) VALIDATED \| 100+ epochs, 7 restarts, zero OOM. Memory stable 45-47 GiB. 6h runtime. \|
	\| 25899266 \| 03-23 \| 1 \| COMPLETED (Strat B v3) \| 68 epochs, 403 steps, zero OOM. GPU segfault at epoch 74 (CCL IPC cache limit). \|
	\| 25916258 \| 03-23 \| 1 \| RUNNING (Strat B v4) \| CCL cache fix (10000 handles). Goal: survive full 2h walltime. \|