File size: 26,158 Bytes
2af0e94 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 | # OmniMorph Multi-Node XPU Training Progress
## Active Jobs (2026-03-28)
### Job 26457072: 8-node production training (RUNNING)
- **Date**: 2026-03-28
- **Status**: RUNNING on pvc-s-[41-42,118,125,129,136,140-141], 36h walltime
- **Script**: `bash_train_multi_nodes.sh` (srun timeout reduced to 1.5h)
- **Config**: `all_om_net`, img_size=128, batchsize=2, timesteps=80, lr=1e-5, condition_type=slice
- **Data**: 14,814 diffusion + 2,583 registration (full real data)
- **Steps/epoch**: 41 (14814 / (2 Γ 64) Γ· DIFF_REG_BATCH_RATIO steps), **~108s/step** β ~1h14m per epoch
- **Changes from previous job**:
- Contrastive `clip_grad_norm` max_norm: 0.02 β **1e-3** (avoid contrastive dominating training)
- Registration activation threshold: -0.6 β **-0.7** (stricter gate)
- `dist.all_reduce` β `dist.broadcast(src=0)` for NaN sync + registration gate (CCL hang workaround)
- srun timeout: 7200s β **5400s** (1.5h, reduces CCL hang waste from 46min to ~16min)
- Excluded pvc-s-135 (hardware failure: "No XPU devices are available")
- **Logs**: `Logs/train_multi_26457072.out`
### CCL Epoch-Boundary Hang (ongoing issue)
- **Pattern**: First epoch per srun completes fine (~1h14m). Second epoch always hangs at a CCL collective op.
- **Workaround**: srun timeout kills the hung process, bash loop restarts with fresh CCL state. Effective rate: 1 epoch per ~1.5h.
- **Root cause**: Likely CCL/Level Zero IPC handle leak or state corruption after ~200+ collective ops. Fresh process resets L0 context.
- **Impact**: ~16min wasted per epoch (hang time before timeout), but training progresses reliably.
- **TODO**: Investigate reinitializing CCL mid-training (`destroy_process_group` + `init_process_group`) to avoid the restart overhead.
### Single-node CCL hang (unresolved)
- Single-node job (26446050) hung at step 1 on explicit `dist.all_reduce` calls.
- Changed to `dist.broadcast(src=0)` but never verified (cancelled when multi-node started).
- Single-node epochs take ~10h (8 tiles), so the restart-per-epoch strategy doesn't work (2h timeout kills mid-epoch).
- **TODO**: Test broadcast fix on single node; if still hangs, investigate CCL intra-node transport.
### Checkpoint status
- **Latest**: `000036_all_om_net.pth` (Mar 28, 2.9 GB)
- **Total**: epochs 3-36 in `Models/all_om_net/` (85+ GB)
- **CUDA copies**: epochs 0,1,2,8,11 in `Models/all_om_net/cuda_ckpts/`
- **Older model**: epochs 0-10 in `Models/all_recmulmodmutattnnet/` (pre-om_net migration)
### Loss history (real data, 64 XPU tiles)
| Epoch | Ang | Dist | Regul | Contrastive | Regist (total) | imgsim | imgmse | ddf |
|-------|-----|------|-------|-------------|---------------|--------|--------|-----|
| 3-7 | -0.02β-0.10 | 1.50β1.02 | β | 9.2e-4 | 0.0 | β | β | β |
| 8-11 | β | β | β | β | β | β | β | β (CUDA) |
| 31 | β | β | β | 6.9e-5 | -0.09 | -0.21 | 0.35 | 4.2e-4 |
| 35 | -0.50 | 0.85 | 1.3e-4 | 7.1e-5 | -0.10 | -0.26 | 0.32 | 1.4e-4 |
Note: Epochs 3-7 used UpsampleConv. Epochs 8-11 trained on CUDA with ConvTranspose3d. Epoch 12+ on XPU. Registration activates when ang < -0.7 (was -0.6 before epoch 37).
---
### Previous Job 25899265: Production training (v3) β COMPLETED (historical)
- Ran 100+ epochs with UpsampleConv on dummy data (only 100 samples loaded due to node path issue)
- Validated memory stability and CCL cache fix, but loss data not meaningful
### Previous Strategy A β Job 25898957: CRASH-LOOPED (torch.load bug)
- Ran 1 good iteration (epoch 5 step 31βepoch 6 step 9), then crash-looped 63 times
- **Root cause**: saving `np.random.get_state()` in checkpoint β numpy arrays β `torch.load` with `weights_only=True` (PyTorch 2.6 default) rejects numpy globals
- Epoch 5 completed and saved. Epoch 6 reached step 9 before crash loop started.
### Previous Strategy A β Job 25898957: Production training (v2, all audit fixes)
- **Date**: 2026-03-23
- **Status**: RUNNING β epoch 6 in progress
- **Script**: `bash_train_multi_nodes.sh`
- **Config**: `Config/config_om.yaml` (om_net, img_size=128, batchsize=2, device=xpu)
- **Resources**: 8 nodes x 8 XPU tiles = 64 XPU tiles, 12 CPUs/task
- **Walltime**: 36:00:00
- **Progress** (as of 2026-03-23 02:45):
- Completed epoch 5 (full, no restart needed), checkpoint `000005_all_om_net.pth` saved
- Epoch 6 step 2 in progress, memory stable at ~46 GiB free
- **Zero restarts triggered** β leak rate ~0.07 GiB/step allows full epochs without OOM
- **Fixes applied** (10 bugs found across 4 independent audits):
1. **(Critical) Optimizer on all DDP ranks** β all ranks load checkpoint
2. **(Critical) XPU device RNG saved/restored** β `torch.xpu.get_rng_state()` in checkpoint
3. **(Critical) RNG restored after DataLoader skip** β prevents `__getitem__` corruption
4. **(Critical) `loss_nan_step` not overwritten** β guarded with else branch
5. **(Critical) Off-by-one in step skip** β `step <= initial_step` with `initial_step > 0` guard
6. **(Low) tmp/ cleanup** β DDP race + stale checkpoint fixes
7. **(Medium) Per-rank RNG divergence** β non-rank-0 re-seeded (CPU + XPU device RNG)
8. **(Critical) Step 0 skipped on fresh start** β `initial_step > 0` guard added
9. **(Config) Timeout** β `SRUN_TIMEOUT` 2400 β 7200
10. **(Low) `total_reg` division by zero** β `max(total_reg, 1)` guard
- **Leak rate**: ~0.07 GiB/step (ConvTranspose3d eliminated via UpsampleConv)
- **Logs**: `Logs/train_multi_25898957.out`
### Job 25899049: CANCELLED (checkpoint conflict)
- Submitted as continuation but got nodes while 25898957 was still running. Both would write to same `Models/all_om_net/`. Cancelled before training started.
### Previous Strategy A β Job 25898349: CANCELLED (had 6 bugs)
- **Ran**: 4 restart iterations, reached epoch 5 step 33
- **Issues**: (1) timeout too short (killed healthy runs), (2) off-by-one re-trained steps 10/20/30, (3) optimizer not loaded on non-rank-0, (4) epoch stats not saved/restored, (5) RNG state not preserved, (6) `loss_nan_step` reset after restore
- **Checkpoints saved**: `000004_all_om_net.pth` (epoch 4), `000005_step0030_all_om_net.pth` (epoch 5 step 30)
- **Memory**: Stable at ~46-47 GiB free (leak only ~0.07 GiB/step)
### Strategy B β Job 25916258: Full validation (v4, CCL cache fix)
- **Date**: 2026-03-23
- **Status**: RUNNING
- **Resources**: 1 node x 8 XPU tiles, dummy data, no proactive restart, 2h walltime
- **Fix**: `CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000` (was default 1000, caused driver segfault at ~400 steps)
- **Goal**: Survive full 2h walltime without crash β validates both UpsampleConv fix (no OOM) and CCL cache fix (no segfault)
- **Logs**: `Logs/stratB_25916258.out`
### Previous Strategy B β Job 25899266 (v3, segfault at epoch 74)
- 68 epochs, 403 steps, zero OOM. Memory stable 46-47 GiB.
- Crashed at epoch 74 step 0: CCL IPC handle cache hit 1000 limit β driver segfault (`drm_neo.cpp:288`). Fixed with `CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000`.
- **Logs**: `Logs/stratB_25899266.out`
### Previous Strategy B β Job 25898356 (crashed on logging bug)
- Ran 6 steps, leak ~0.375 GiB/step. `ZeroDivisionError` on `total_reg=0` (now fixed).
### Previous Job 25892717 β CANCELLED (was hung)
- **Status**: Cancelled after 5 hours. Crashed at epoch 4 step 26 (OOM at `loss_contra.backward()`). `srun` hung due to CCL processes not exiting, auto-resubmit never triggered. 4+ hours of walltime wasted idle.
- **Last checkpoint**: `Models/all_om_net/tmp/000004_step0020_all_om_net.pth`
- **Lesson**: `--kill-on-bad-exit=1` is not reliable for CCL cleanup. Strategy A's `timeout` wrapper solves this.
---
## Implementation Details
### Strategy A: Proactive Restart (backup approach)
**Problem**: XPU autograd leaks ~1.78 GiB/step of device memory (Intel UR backend bug). Training crashes at ~step 26. The old auto-resubmit via `sbatch` failed because `srun` hangs after CCL rank crash β the bash script never reaches the resubmit logic.
**Solution**: Proactive exit + restart loop within the same SLURM allocation.
**Changes to `OM_train_3modes.py`** (training behavior unchanged):
- Added `EXIT_CODE_RESTART = 42` constant
- Added `--max-steps-before-restart N` CLI argument (default 0 = disabled)
- Added `steps_since_start` counter in the training loop
- After N steps: saves mid-epoch checkpoint β `dist.barrier()` β `dist.destroy_process_group()` β `sys.exit(42)`
- `SystemExit(42)` is NOT caught by `except Exception` β propagates cleanly
**Changes to `bash_train_multi_nodes.sh`**:
- Replaced single `srun` call with a `while` loop (up to 500 iterations)
- Each `srun` wrapped with `timeout 2400` (40 min) to catch CCL hangs
- Exit code handling:
- `0` β training complete, break
- `42` β proactive restart, 5s pause, re-launch
- `124` β timeout (CCL hang), 10s pause, re-launch
- Other β crash (OOM etc.), 10s pause, re-launch from checkpoint
- Passes `--max-steps-before-restart 20` to Python
**Training behavior**: Identical to original. Two independent audits verified that:
- Model weights, optimizer state (Adam momentum/variance), and RNG states (CPU + XPU + numpy + python) are all saved and restored
- Off-by-one fixed: `step <= initial_step` correctly skips the checkpointed step
- RNG restored AFTER DataLoader skip loop (not before) to avoid `__getitem__` corruption
- `loss_nan_step`, `total_reg`, and all 9 epoch loss accumulators preserved across restarts
- All DDP ranks load optimizer state (not just rank 0)
### Strategy B: Leak Rate Diagnostic
**Problem**: Need to know whether existing mitigations (gradient checkpointing, `UpsampleConv`, `empty_cache`) reduce the XPU leak rate, or if the leak is fundamental to all backward ops.
**Approach**: Run training WITHOUT proactive restart, measure how many steps survive before OOM. Compare with historical ~26 steps.
**Script**: `bash_train_stratB.sh` β 1-node, dummy data, no checkpoint saves, `--max-steps-before-restart 0`.
**Training loop structure** (both strategies use the same code):
1. **Diffusion**: `forward β backward β step` (NO gradient clipping)
2. **Contrastive**: `forward β backward β clip(max_norm=0.02) β step`
3. **Registration**: `forward β backward β clip(max_norm=0.1) β step`
Each phase has `gc.collect() + synchronize() + empty_cache()` between them.
### Earlier attempted fix: UpsampleConv (ConvTranspose3d replacement)
- `ConvTranspose3d` backward was identified as leaking ~0.33 GiB/step per layer on XPU
- Replaced with `UpsampleConv` (F.interpolate + Conv) in `Diffusion/networks.py` for `OM_net`
- Result: **Negligible impact** β leak rate ~1.78 GiB/step before and after
- Conclusion: `ConvTranspose3d` was NOT the primary leak source; the leak is fundamental to XPU autograd
### Strategy B: Per-Operation XPU Leak Analysis (job 25893155)
Ran `tests/diagnose_xpu_leak_ops.py` on 1 XPU tile to isolate which ops leak. Each test runs 20 forward+backward iterations and measures `device_free` drift.
| Operation | Leak Rate (GiB/step) | Pattern | Verdict |
|-----------|---------------------|---------|---------|
| **ConvTranspose3d** (256β3, 8Β³β128Β³) | **0.335** | **Linear, persistent** | **LEAKS β primary source** |
| Full OM_net (rec_num=2, 128Β³) | **1.15** | **Linear, OOM at step 17** | **LEAKS β aggregated** |
| Stacked Conv3d encoder (1β256, 128Β³β8Β³) | 0.013 | One-time alloc | OK (initial alloc only) |
| F.grid_sample (128Β³) | 0.007 | One-time alloc | OK |
| Conv3d (16β32, 64Β³) | 0.004 | One-time alloc | OK |
| MultiheadAttention (512 tokens) | 0.002 | One-time alloc | OK |
| Adam optimizer only | 0.0005 | One-time alloc | OK |
| **F.interpolate** trilinear (32Β³β256Β³) | **0.000** | **No leak** | **ZERO LEAK** |
| **UpsampleConv** (256β3, 8Β³β128Β³) | Not tested (env error) | β | Expected zero (uses F.interpolate) |
**Key findings:**
1. **`ConvTranspose3d` backward is the dominant leaker** β 0.335 GiB/step, linear and persistent (6.7 GiB lost over 20 steps). With 5 decoder layers in the old network, this alone accounts for ~1.7 GiB/step.
2. **`F.interpolate` has ZERO leak** β confirming that `UpsampleConv` (which uses F.interpolate + Conv) is the correct fix.
3. **All other ops have only one-time allocations** (no linear drift) β Conv3d, grid_sample, attention, Adam are all clean.
4. **Full OM_net leaks 1.15 GiB/step** β consistent with `ConvTranspose3d` Γ 5 layers plus minor contributions from other ops.
**Component-level diagnostic** (job 25826494):
- Forward only (no backward): **ZERO leak** β 62.75 GiB stable for 20 steps
- Forward + backward: **1.12 GiB/step** β 62.97 β 40.53 GiB over 20 steps
- Confirms the leak is in the autograd backward pass, specifically in `ConvTranspose3d` backward kernel.
**Why the leak rate dropped in recent runs** (jobs 25898349/25898957): The `UpsampleConv` fix in `OM_net` replaced all 5 decoder `ConvTranspose3d` layers with F.interpolate+Conv. This eliminated the primary leak source. The remaining ~0.07 GiB/step is from minor one-time allocations that stabilize quickly.
**Why the old runs (25892717) still leaked 1.78 GiB/step**: The diagnostic `diagnose_xpu_leak_ops.py` test used the OLD `RecMulModMutAttnNet` with `ConvTranspose3d`. The `OM_net` class (used in production training) already had `UpsampleConv`. The earlier measurement of "negligible impact" was incorrect β the UpsampleConv fix DID work, but the comparison was confounded by different node conditions. The diagnostic data now confirms the fix is effective.
### Issue 15: CCL IPC Handle Cache Segfault (~400 DDP Steps)
- **Job**: 25899266 (Strategy B v3)
- **Symptom**: GPU segfault (`drm_neo.cpp:288`) after ~403 DDP all-reduce steps. Memory was healthy (47 GiB free).
- **Root cause**: oneCCL's IPC memory handle cache has a default limit of 1000 entries. After ~400 steps of DDP all-reduce, the cache fills. Handle eviction triggers a use-after-free in the Intel compute-runtime driver.
- **Warning before crash**: `CCL_WARN: mem handle cache limit is reached: mem_handle_cache size: 1000, limit: 1000`
- **Fix**: `export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000` in SLURM scripts. Also added to Strategy A's `bash_train_multi_nodes.sh`.
- **Note**: Strategy A's proactive restart naturally avoids this (process resets before 400 steps), but the fix is still needed for long-running single-process jobs.
---
## Scripts Directory
| File | Purpose |
|------|---------|
| `bash_train_multi_nodes.sh` | **Strategy A**: 8-node production training with proactive restart loop |
| `bash_train_stratB.sh` | **Strategy B**: 1-node leak rate diagnostic (no restart, dummy data) |
| `bash_infer.sh` | Inference / augmentation SLURM job |
| `bash_diagnose_leak.sh` | Submits `tests/diagnose_xpu_leak.py` to diagnose per-component leak |
| `bash_diagnose_ops.sh` | Submits `tests/diagnose_xpu_leak_ops.py` for per-operation leak analysis |
| `bash_verify_fix.sh` | Compares ConvTranspose3d vs UpsampleConv leak rates (failed due to env issue) |
| `bash_compare_opt.sh` | Speed comparison: optimized vs original 3-mode training |
| `bash_compare_orig.sh` | Speed comparison: original 3-mode training baseline |
| `tests/diagnose_xpu_leak.py` | Component-level leak test (network, DeformDDPM, DDP) |
| `tests/diagnose_xpu_leak_ops.py` | Operation-level leak test (Conv3d, grid_sample, attention, ConvTranspose3d, UpsampleConv) |
| `tests/test_3modes_opt_equivalence.py` | Verifies optimized training matches original |
| `tests/test_mslncc.py` | MSLNCC loss function unit tests |
| `tests/compare_3modes_speed.py` | Speed benchmark for 3-mode training variants |
---
## Issues Resolved (2026-03-22)
### 14. XPU Autograd Engine Memory Leak β ~1.0 GiB/Step (ROOT CAUSE IDENTIFIED)
- **Jobs**: All XPU training jobs; diagnosed in 25826494
- **Symptom**: `device_free` (via `torch.xpu.mem_get_info`) decreases linearly at ~1.0 GiB/step. Memory is outside PyTorch's caching allocator β not tracked by `memory_allocated` / `memory_reserved`.
- **Root cause**: **PyTorch XPU autograd engine bug**. The `loss.backward()` call leaks device memory on every invocation. Confirmed by isolated diagnostic (`tests/diagnose_xpu_leak.py`):
- Test 1 (forward only, no backward): **NO LEAK** β device_free perfectly stable at 62.75 GiB over 20 steps
- Test 2a (forward + backward, no optimizer): **LEAK** β 1.0 GiB/step (62.97 β 42.98 GiB over 20 steps)
- Test 2b (forward + backward + optimizer.step): **LEAK** β 1.1 GiB/step (slightly worse)
- **NOT caused by**: CCL all-reduce (`no_sync()` showed identical leak rate), DDP (leak occurs without DDP), garbage collection (`gc.collect()` had no effect), caching allocator (`empty_cache()` had no effect), deferred ops (`synchronize()` had no effect)
- **Why it works on CUDA**: CUDA autograd engine does not have this leak. The issue is specific to the Intel XPU backend (Level Zero / SYCL runtime).
- **Workaround applied**:
1. Gradient checkpointing (3 encoder levels in OM_net) reduces peak memory from 43 β 26 GiB, buying ~26 steps before OOM
2. Mid-epoch checkpoints every 10 steps to `tmp/` subfolder
3. Auto-resubmitting SLURM job restarts training from last checkpoint with fresh memory (leak resets)
- **Upstream**: Should be reported to [intel/torch-xpu-ops](https://github.com/intel/torch-xpu-ops) with `tests/diagnose_xpu_leak.py` as minimal reproduction
### 13. Pre-allocation Approach β Wrong Direction
- **Jobs**: 25799043 (92%), 25823021 (78%)
- **Finding**: Pre-allocating device memory into PyTorch's caching allocator REDUCED available memory for the autograd leak, causing EARLIER crashes. 92% β step 3, 78% β step 10, none β step 15.
- **Resolution**: Removed all pre-allocation. The 70% allocator cap is sufficient when gradient checkpointing reduces peak to 26 GiB (well under the 44.8 GiB cap).
### 12. Contrastive Backward OOM β Diffusion Tensors Not Freed
- **Jobs**: 25823021, 25823710
- **Finding**: `del pre_dvf_I, dvf_I, trm_pred` was placed AFTER the contrastive step. During `loss_contra.backward()`, diffusion output tensors were still alive, pushing peak above the limit.
- **Fix**: Moved `del` + `empty_cache()` to BEFORE the contrastive step. Also save `loss_gen_a.item()` before deleting since it's needed for registration decision.
---
## Issues Resolved (2026-03-20 to 2026-03-21)
### 11. DDP Collective Hangs and Registration Desync
- **Jobs**: 25699197, 25709947 (hung), 25704670, 25706470
- **Symptoms**: Job hangs after step 1 (log files stop growing); or `Expected to have finished reduction` error
- **Root cause 1**: OOM try/except guards with `continue` skip the DDP-synchronized backward pass, causing other ranks to wait forever at all-reduce. **OOM guards are fundamentally incompatible with DDP.**
- **Root cause 2**: Registration conditional block (`loss_gen_a.item() < -0.6`) differs per rank β some ranks call `Deformddpm(...)` for registration while others skip, causing DDP desync.
- **Fixes applied**:
1. Removed all OOM try/except guards β let OOM crash the job; rely on checkpoint auto-resume
2. `DDP(..., find_unused_parameters=True)` β handles detached recovery iterations and conditional registration
3. `dist.all_reduce(regist_flag, op=ReduceOp.MIN)` β all 64 ranks collectively decide whether to run registration. Only runs when ALL ranks agree.
### 10. XPU OOM β Allocator 70% Memory Cap
- **Jobs**: 25530886 through 25780909
- **Error**: `UR_RESULT_ERROR_OUT_OF_RESOURCES` at step 12-14 of each epoch
- **Root cause**: XPU caching allocator caps reserved memory at ~70% of device (44.8/64 GiB). Known Intel bug ([torch-xpu-ops#1543](https://github.com/intel/torch-xpu-ops/issues/1543)). Works on 4x 48GB CUDA GPUs because CUDA allocator uses nearly all device memory.
- **Key diagnostic** (job 25780909): Memory logging showed alloc/reserved perfectly stable at 9.84/44.82 GiB across all steps β no fragmentation, no creep. OOM is purely a peak spike during forward/backward exceeding the 44.8 GiB cap.
- **Resolution**: Gradient checkpointing reduced peak from 43 β 26 GiB, well within the 44.8 GiB cap. Pre-allocation is no longer needed.
---
## Issues Resolved (2026-03-19 to 2026-03-20)
### 1. torchrun Permission Denied
- **Fix**: Switched to `python -m torch.distributed.run`, then later to direct `srun` launch
### 2. GPUS_PER_NODE Mismatch
- **Fix**: `--nodes=8 --ntasks-per-node=8` for 64 total XPU tiles (4 cards x 2 tiles/card)
### 3. `.to(rank)` Sends to CUDA Not XPU
- **Fix**: Changed to `.to(f"{DEVICE_TYPE}:{rank}")`
### 4. No DistributedSampler
- **Fix**: Added `DistributedSampler` for both dataloaders + `set_epoch()` per epoch
### 5. CCL Backend Not Found
- **Fix**: dn-mo1 rebuilt the conda env with compatible packages
### 6. MPI/PMI Init Failure
- **Fix**: Switched from torchrun to direct `srun --ntasks-per-node=8` with SLURM env var mapping
### 7. CCL Worker Thread Startup Failure
- **Fix**: Increased `--cpus-per-task=12` + `CCL_WORKER_AFFINITY=auto`
### 8. gloo Backend Incompatible with XPU
- **Fix**: Must use `ccl` backend for XPU DDP
### 9. Print Spam from All Ranks
- **Fix**: Guarded prints with `gpu_id == 0`
---
## Working Configuration Summary
```bash
# SLURM
--nodes=8 --gres=gpu:4 --ntasks-per-node=8 --cpus-per-task=12
# Environment
I_MPI_PMI_LIBRARY=/usr/local/software/slurm/current-rhel8/lib/libpmi2.so
I_MPI_HYDRA_BOOTSTRAP=slurm
CCL_WORKER_AFFINITY=auto
PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512
# Launch
srun --kill-on-bad-exit=1 bash -c 'LOCAL_RANK=$SLURM_LOCALID RANK=$SLURM_PROCID WORLD_SIZE=$SLURM_NTASKS python ...'
# DDP
backend = "ccl"
DDP(model, device_ids=[rank], find_unused_parameters=True)
# XPU autograd leak workaround
# - Gradient checkpointing on 3 encoder levels (OM_net.use_checkpoint = True)
# - Mid-epoch checkpoint every 10 steps to Models/.../tmp/
# - SLURM script auto-resubmits on crash (sbatch at end of script)
# - No pre-allocation (gradient checkpointing keeps peak under 70% cap)
```
## Previous Test Jobs
| Job ID | Date | Nodes | Status | Issue |
|--------|------|-------|--------|-------|
| 25379684 | 03-18 | 16 | FAILED | torchrun permission denied |
| 25416001 | 03-19 | 8 | FAILED | ccl backend not found (SYCL mismatch) |
| 25433261 | 03-19 | 8 | FAILED | xccl not built in |
| 25518305 | 03-19 | 8 | FAILED | MPI PMI_Init failure |
| 25521705 | 03-20 | 8 | FAILED | CCL OFI transport init failure |
| 25522164 | 03-20 | 8 | FAILED | CCL worker startup failure (3 CPUs) |
| 25522734 | 03-20 | 8 | FAILED | CCL worker=0 segfault |
| 25522979 | 03-20 | 1 | FAILED | gloo + XPU incompatible |
| 25523654 | 03-20 | 1 | FAILED | gloo + XPU (no device_ids) still fails |
| 25525830 | 03-20 | 1 | **SUCCESS** | First working run: ccl + 12 CPUs/task |
| 25528754 | 03-20 | 8 | FAILED | Superseded by 25530886 |
| 25530886 | 03-20 | 8 | FAILED | XPU OOM at step 12/41 (original code, no workarounds) |
| 25635451 | 03-20 | 8 | FAILED | empty_cache regression, OOM step 1 |
| 25678499 | 03-21 | 8 | FAILED | OOM step 15, variable cleanup added |
| 25696461 | 03-21 | 8 | FAILED | Epoch 2 reached but forward OOM killed rank 61 |
| 25699197 | 03-21 | 8 | HUNG | OOM try/except broke DDP |
| 25704670 | 03-21 | 8 | FAILED | find_unused_parameters error at registration |
| 25706470 | 03-21 | 8 | FAILED | OOM guard broke DDP reducer state |
| 25709947 | 03-21 | 8 | HUNG | Registration conditional desync |
| 25763882 | 03-21 | 8 | FAILED | all_reduce(MIN) sync β no hang! OOM step 14. |
| 25780909 | 03-21 | 8 | FAILED | Confirmed 70% allocator cap (9.84/44.82 GiB stable) |
| 25799043 | 03-21 | 8 | FAILED | 92% pre-alloc (59 GiB) β OOM step 3, WORSE |
| 25823021 | 03-21 | 8 | FAILED | 78% pre-alloc β diffusion OK (43.3 GiB), contra OOM step 10 |
| 25823544 | 03-21 | 8 | FAILED | del tensors before contra β UnboundLocalError bug |
| 25823710 | 03-21 | 8 | FAILED | Fixed bug; OOM step 10 again, ~1.3 GiB/step device leak confirmed |
| 25824128 | 03-21 | 8 | FAILED | No pre-alloc + empty_cache; device_free monitoring confirms 1.3 GiB/step leak |
| 25825585 | 03-22 | 8 | FAILED | no_sync() β same leak rate; NOT from all-reduce |
| 25825861 | 03-22 | 8 | FAILED | gc.collect+sync+empty_cache β no effect on leak |
| 25826494 | 03-22 | 1 | **DIAG** | **Root cause found**: fwd=no leak, bwd=1.0 GiB/step leak. XPU autograd bug. |
| 25832610 | 03-22 | 8 | PARTIAL | Grad checkpoint works! Peak 43β22 GiB. Epoch 3 completed. Retry loop hung (srun won't exit). |
| 25853940 | 03-22 | 8 | PARTIAL | Resumed from step 25; epoch 3 completed + epoch 4 started. Epoch 4 mid-epoch saved at step 10,20. |
| 25867855 | 03-22 | 8 | HUNG | Epoch 4 reached step 26. Mid-epoch saved at 10,20. srun hung after crash (no kill-on-bad-exit). |
| 25892717 | 03-22 | 8 | HUNGβCANCELLED | Crashed at epoch 4 step 26 (OOM contra_bwd). srun hung 4+ hrs, auto-resubmit never triggered. |
| 25898349 | 03-22 | 8 | CANCELLED (Strat A v1) | Epoch 5 step 33. Leak 0.07 GiB/step. Had 6 bugs (off-by-one, optimizer, RNG, etc). |
| 25898356 | 03-22 | 1 | CRASHED (Strat B v1) | Leak 0.375 GiB/step (dummy data). ZeroDivisionError at epoch end (bug fixed). |
| **25899114** | **03-23** | **1** | **PENDING (Strat B v2)** | **Full leak validation: no restart, dummy data, all fixes. Goal: survive 2h/80+ steps.** |
| 25898957 | 03-23 | 8 | CRASH-LOOPED (Strat A v2) | Epoch 6 step 9 then crash-loop Γ63. torch.load rejects numpy RNG in checkpoint. |
| 25899049 | 03-23 | 8 | CANCELLED | Checkpoint conflict β ran simultaneously with 25898957. |
| 25899114 | 03-23 | 1 | CANCELLED | Strategy B β cancelled with 25898957 for code fix. |
| **25899265** | **03-23** | **8** | **RUNNING (Strat A v3) VALIDATED** | **100+ epochs, 7 restarts, zero OOM. Memory stable 45-47 GiB. 6h runtime.** |
| 25899266 | 03-23 | 1 | COMPLETED (Strat B v3) | 68 epochs, 403 steps, zero OOM. GPU segfault at epoch 74 (CCL IPC cache limit). |
| **25916258** | **03-23** | **1** | **RUNNING (Strat B v4)** | **CCL cache fix (10000 handles). Goal: survive full 2h walltime.** |
|