| # OmniMorph Multi-Node XPU Training Progress |
|
|
| ## Active Jobs (2026-03-28) |
|
|
| ### Job 26457072: 8-node production training (RUNNING) |
| - **Date**: 2026-03-28 |
| - **Status**: RUNNING on pvc-s-[41-42,118,125,129,136,140-141], 36h walltime |
| - **Script**: `bash_train_multi_nodes.sh` (srun timeout reduced to 1.5h) |
| - **Config**: `all_om_net`, img_size=128, batchsize=2, timesteps=80, lr=1e-5, condition_type=slice |
| - **Data**: 14,814 diffusion + 2,583 registration (full real data) |
| - **Steps/epoch**: 41 (14814 / (2 × 64) ÷ DIFF_REG_BATCH_RATIO steps), **~108s/step** → ~1h14m per epoch |
| - **Changes from previous job**: |
| - Contrastive `clip_grad_norm` max_norm: 0.02 → **1e-3** (avoid contrastive dominating training) |
| - Registration activation threshold: -0.6 → **-0.7** (stricter gate) |
| - `dist.all_reduce` → `dist.broadcast(src=0)` for NaN sync + registration gate (CCL hang workaround) |
| - srun timeout: 7200s → **5400s** (1.5h, reduces CCL hang waste from 46min to ~16min) |
| - Excluded pvc-s-135 (hardware failure: "No XPU devices are available") |
| - **Logs**: `Logs/train_multi_26457072.out` |
|
|
| ### CCL Epoch-Boundary Hang (ongoing issue) |
| - **Pattern**: First epoch per srun completes fine (~1h14m). Second epoch always hangs at a CCL collective op. |
| - **Workaround**: srun timeout kills the hung process, bash loop restarts with fresh CCL state. Effective rate: 1 epoch per ~1.5h. |
| - **Root cause**: Likely CCL/Level Zero IPC handle leak or state corruption after ~200+ collective ops. Fresh process resets L0 context. |
| - **Impact**: ~16min wasted per epoch (hang time before timeout), but training progresses reliably. |
| - **TODO**: Investigate reinitializing CCL mid-training (`destroy_process_group` + `init_process_group`) to avoid the restart overhead. |
|
|
| ### Single-node CCL hang (unresolved) |
| - Single-node job (26446050) hung at step 1 on explicit `dist.all_reduce` calls. |
| - Changed to `dist.broadcast(src=0)` but never verified (cancelled when multi-node started). |
| - Single-node epochs take ~10h (8 tiles), so the restart-per-epoch strategy doesn't work (2h timeout kills mid-epoch). |
| - **TODO**: Test broadcast fix on single node; if still hangs, investigate CCL intra-node transport. |
|
|
| ### Checkpoint status |
| - **Latest**: `000036_all_om_net.pth` (Mar 28, 2.9 GB) |
| - **Total**: epochs 3-36 in `Models/all_om_net/` (85+ GB) |
| - **CUDA copies**: epochs 0,1,2,8,11 in `Models/all_om_net/cuda_ckpts/` |
| - **Older model**: epochs 0-10 in `Models/all_recmulmodmutattnnet/` (pre-om_net migration) |
| |
| ### Loss history (real data, 64 XPU tiles) |
| | Epoch | Ang | Dist | Regul | Contrastive | Regist (total) | imgsim | imgmse | ddf | |
| |-------|-----|------|-------|-------------|---------------|--------|--------|-----| |
| | 3-7 | -0.02→-0.10 | 1.50→1.02 | — | 9.2e-4 | 0.0 | — | — | — | |
| | 8-11 | — | — | — | — | — | — | — | — (CUDA) | |
| | 31 | — | — | — | 6.9e-5 | -0.09 | -0.21 | 0.35 | 4.2e-4 | |
| | 35 | -0.50 | 0.85 | 1.3e-4 | 7.1e-5 | -0.10 | -0.26 | 0.32 | 1.4e-4 | |
| |
| Note: Epochs 3-7 used UpsampleConv. Epochs 8-11 trained on CUDA with ConvTranspose3d. Epoch 12+ on XPU. Registration activates when ang < -0.7 (was -0.6 before epoch 37). |
| |
| --- |
| |
| ### Previous Job 25899265: Production training (v3) — COMPLETED (historical) |
| - Ran 100+ epochs with UpsampleConv on dummy data (only 100 samples loaded due to node path issue) |
| - Validated memory stability and CCL cache fix, but loss data not meaningful |
| |
| ### Previous Strategy A — Job 25898957: CRASH-LOOPED (torch.load bug) |
| - Ran 1 good iteration (epoch 5 step 31→epoch 6 step 9), then crash-looped 63 times |
| - **Root cause**: saving `np.random.get_state()` in checkpoint → numpy arrays → `torch.load` with `weights_only=True` (PyTorch 2.6 default) rejects numpy globals |
| - Epoch 5 completed and saved. Epoch 6 reached step 9 before crash loop started. |
| |
| ### Previous Strategy A — Job 25898957: Production training (v2, all audit fixes) |
| - **Date**: 2026-03-23 |
| - **Status**: RUNNING — epoch 6 in progress |
| - **Script**: `bash_train_multi_nodes.sh` |
| - **Config**: `Config/config_om.yaml` (om_net, img_size=128, batchsize=2, device=xpu) |
| - **Resources**: 8 nodes x 8 XPU tiles = 64 XPU tiles, 12 CPUs/task |
| - **Walltime**: 36:00:00 |
| - **Progress** (as of 2026-03-23 02:45): |
| - Completed epoch 5 (full, no restart needed), checkpoint `000005_all_om_net.pth` saved |
| - Epoch 6 step 2 in progress, memory stable at ~46 GiB free |
| - **Zero restarts triggered** — leak rate ~0.07 GiB/step allows full epochs without OOM |
| - **Fixes applied** (10 bugs found across 4 independent audits): |
| 1. **(Critical) Optimizer on all DDP ranks** — all ranks load checkpoint |
| 2. **(Critical) XPU device RNG saved/restored** — `torch.xpu.get_rng_state()` in checkpoint |
| 3. **(Critical) RNG restored after DataLoader skip** — prevents `__getitem__` corruption |
| 4. **(Critical) `loss_nan_step` not overwritten** — guarded with else branch |
| 5. **(Critical) Off-by-one in step skip** — `step <= initial_step` with `initial_step > 0` guard |
| 6. **(Low) tmp/ cleanup** — DDP race + stale checkpoint fixes |
| 7. **(Medium) Per-rank RNG divergence** — non-rank-0 re-seeded (CPU + XPU device RNG) |
| 8. **(Critical) Step 0 skipped on fresh start** — `initial_step > 0` guard added |
| 9. **(Config) Timeout** — `SRUN_TIMEOUT` 2400 → 7200 |
| 10. **(Low) `total_reg` division by zero** — `max(total_reg, 1)` guard |
| - **Leak rate**: ~0.07 GiB/step (ConvTranspose3d eliminated via UpsampleConv) |
| - **Logs**: `Logs/train_multi_25898957.out` |
| |
| ### Job 25899049: CANCELLED (checkpoint conflict) |
| - Submitted as continuation but got nodes while 25898957 was still running. Both would write to same `Models/all_om_net/`. Cancelled before training started. |
| |
| ### Previous Strategy A — Job 25898349: CANCELLED (had 6 bugs) |
| - **Ran**: 4 restart iterations, reached epoch 5 step 33 |
| - **Issues**: (1) timeout too short (killed healthy runs), (2) off-by-one re-trained steps 10/20/30, (3) optimizer not loaded on non-rank-0, (4) epoch stats not saved/restored, (5) RNG state not preserved, (6) `loss_nan_step` reset after restore |
| - **Checkpoints saved**: `000004_all_om_net.pth` (epoch 4), `000005_step0030_all_om_net.pth` (epoch 5 step 30) |
| - **Memory**: Stable at ~46-47 GiB free (leak only ~0.07 GiB/step) |
| |
| ### Strategy B — Job 25916258: Full validation (v4, CCL cache fix) |
| - **Date**: 2026-03-23 |
| - **Status**: RUNNING |
| - **Resources**: 1 node x 8 XPU tiles, dummy data, no proactive restart, 2h walltime |
| - **Fix**: `CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000` (was default 1000, caused driver segfault at ~400 steps) |
| - **Goal**: Survive full 2h walltime without crash — validates both UpsampleConv fix (no OOM) and CCL cache fix (no segfault) |
| - **Logs**: `Logs/stratB_25916258.out` |
| |
| ### Previous Strategy B — Job 25899266 (v3, segfault at epoch 74) |
| - 68 epochs, 403 steps, zero OOM. Memory stable 46-47 GiB. |
| - Crashed at epoch 74 step 0: CCL IPC handle cache hit 1000 limit → driver segfault (`drm_neo.cpp:288`). Fixed with `CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000`. |
| - **Logs**: `Logs/stratB_25899266.out` |
| |
| ### Previous Strategy B — Job 25898356 (crashed on logging bug) |
| - Ran 6 steps, leak ~0.375 GiB/step. `ZeroDivisionError` on `total_reg=0` (now fixed). |
| |
| ### Previous Job 25892717 — CANCELLED (was hung) |
| - **Status**: Cancelled after 5 hours. Crashed at epoch 4 step 26 (OOM at `loss_contra.backward()`). `srun` hung due to CCL processes not exiting, auto-resubmit never triggered. 4+ hours of walltime wasted idle. |
| - **Last checkpoint**: `Models/all_om_net/tmp/000004_step0020_all_om_net.pth` |
| - **Lesson**: `--kill-on-bad-exit=1` is not reliable for CCL cleanup. Strategy A's `timeout` wrapper solves this. |
| |
| --- |
| |
| ## Implementation Details |
| |
| ### Strategy A: Proactive Restart (backup approach) |
| |
| **Problem**: XPU autograd leaks ~1.78 GiB/step of device memory (Intel UR backend bug). Training crashes at ~step 26. The old auto-resubmit via `sbatch` failed because `srun` hangs after CCL rank crash — the bash script never reaches the resubmit logic. |
| |
| **Solution**: Proactive exit + restart loop within the same SLURM allocation. |
| |
| **Changes to `OM_train_3modes.py`** (training behavior unchanged): |
| - Added `EXIT_CODE_RESTART = 42` constant |
| - Added `--max-steps-before-restart N` CLI argument (default 0 = disabled) |
| - Added `steps_since_start` counter in the training loop |
| - After N steps: saves mid-epoch checkpoint → `dist.barrier()` → `dist.destroy_process_group()` → `sys.exit(42)` |
| - `SystemExit(42)` is NOT caught by `except Exception` — propagates cleanly |
|
|
| **Changes to `bash_train_multi_nodes.sh`**: |
| - Replaced single `srun` call with a `while` loop (up to 500 iterations) |
| - Each `srun` wrapped with `timeout 2400` (40 min) to catch CCL hangs |
| - Exit code handling: |
| - `0` → training complete, break |
| - `42` → proactive restart, 5s pause, re-launch |
| - `124` → timeout (CCL hang), 10s pause, re-launch |
| - Other → crash (OOM etc.), 10s pause, re-launch from checkpoint |
| - Passes `--max-steps-before-restart 20` to Python |
| |
| **Training behavior**: Identical to original. Two independent audits verified that: |
| - Model weights, optimizer state (Adam momentum/variance), and RNG states (CPU + XPU + numpy + python) are all saved and restored |
| - Off-by-one fixed: `step <= initial_step` correctly skips the checkpointed step |
| - RNG restored AFTER DataLoader skip loop (not before) to avoid `__getitem__` corruption |
| - `loss_nan_step`, `total_reg`, and all 9 epoch loss accumulators preserved across restarts |
| - All DDP ranks load optimizer state (not just rank 0) |
| |
| ### Strategy B: Leak Rate Diagnostic |
| |
| **Problem**: Need to know whether existing mitigations (gradient checkpointing, `UpsampleConv`, `empty_cache`) reduce the XPU leak rate, or if the leak is fundamental to all backward ops. |
| |
| **Approach**: Run training WITHOUT proactive restart, measure how many steps survive before OOM. Compare with historical ~26 steps. |
| |
| **Script**: `bash_train_stratB.sh` — 1-node, dummy data, no checkpoint saves, `--max-steps-before-restart 0`. |
| |
| **Training loop structure** (both strategies use the same code): |
| 1. **Diffusion**: `forward → backward → step` (NO gradient clipping) |
| 2. **Contrastive**: `forward → backward → clip(max_norm=0.02) → step` |
| 3. **Registration**: `forward → backward → clip(max_norm=0.1) → step` |
| |
| Each phase has `gc.collect() + synchronize() + empty_cache()` between them. |
| |
| ### Earlier attempted fix: UpsampleConv (ConvTranspose3d replacement) |
| - `ConvTranspose3d` backward was identified as leaking ~0.33 GiB/step per layer on XPU |
| - Replaced with `UpsampleConv` (F.interpolate + Conv) in `Diffusion/networks.py` for `OM_net` |
| - Result: **Negligible impact** — leak rate ~1.78 GiB/step before and after |
| - Conclusion: `ConvTranspose3d` was NOT the primary leak source; the leak is fundamental to XPU autograd |
| |
| ### Strategy B: Per-Operation XPU Leak Analysis (job 25893155) |
| |
| Ran `tests/diagnose_xpu_leak_ops.py` on 1 XPU tile to isolate which ops leak. Each test runs 20 forward+backward iterations and measures `device_free` drift. |
| |
| | Operation | Leak Rate (GiB/step) | Pattern | Verdict | |
| |-----------|---------------------|---------|---------| |
| | **ConvTranspose3d** (256→3, 8³→128³) | **0.335** | **Linear, persistent** | **LEAKS — primary source** | |
| | Full OM_net (rec_num=2, 128³) | **1.15** | **Linear, OOM at step 17** | **LEAKS — aggregated** | |
| | Stacked Conv3d encoder (1→256, 128³→8³) | 0.013 | One-time alloc | OK (initial alloc only) | |
| | F.grid_sample (128³) | 0.007 | One-time alloc | OK | |
| | Conv3d (16→32, 64³) | 0.004 | One-time alloc | OK | |
| | MultiheadAttention (512 tokens) | 0.002 | One-time alloc | OK | |
| | Adam optimizer only | 0.0005 | One-time alloc | OK | |
| | **F.interpolate** trilinear (32³→256³) | **0.000** | **No leak** | **ZERO LEAK** | |
| | **UpsampleConv** (256→3, 8³→128³) | Not tested (env error) | — | Expected zero (uses F.interpolate) | |
|
|
| **Key findings:** |
| 1. **`ConvTranspose3d` backward is the dominant leaker** — 0.335 GiB/step, linear and persistent (6.7 GiB lost over 20 steps). With 5 decoder layers in the old network, this alone accounts for ~1.7 GiB/step. |
| 2. **`F.interpolate` has ZERO leak** — confirming that `UpsampleConv` (which uses F.interpolate + Conv) is the correct fix. |
| 3. **All other ops have only one-time allocations** (no linear drift) — Conv3d, grid_sample, attention, Adam are all clean. |
| 4. **Full OM_net leaks 1.15 GiB/step** — consistent with `ConvTranspose3d` × 5 layers plus minor contributions from other ops. |
| |
| **Component-level diagnostic** (job 25826494): |
| - Forward only (no backward): **ZERO leak** — 62.75 GiB stable for 20 steps |
| - Forward + backward: **1.12 GiB/step** — 62.97 → 40.53 GiB over 20 steps |
| - Confirms the leak is in the autograd backward pass, specifically in `ConvTranspose3d` backward kernel. |
| |
| **Why the leak rate dropped in recent runs** (jobs 25898349/25898957): The `UpsampleConv` fix in `OM_net` replaced all 5 decoder `ConvTranspose3d` layers with F.interpolate+Conv. This eliminated the primary leak source. The remaining ~0.07 GiB/step is from minor one-time allocations that stabilize quickly. |
|
|
| **Why the old runs (25892717) still leaked 1.78 GiB/step**: The diagnostic `diagnose_xpu_leak_ops.py` test used the OLD `RecMulModMutAttnNet` with `ConvTranspose3d`. The `OM_net` class (used in production training) already had `UpsampleConv`. The earlier measurement of "negligible impact" was incorrect — the UpsampleConv fix DID work, but the comparison was confounded by different node conditions. The diagnostic data now confirms the fix is effective. |
|
|
| ### Issue 15: CCL IPC Handle Cache Segfault (~400 DDP Steps) |
| - **Job**: 25899266 (Strategy B v3) |
| - **Symptom**: GPU segfault (`drm_neo.cpp:288`) after ~403 DDP all-reduce steps. Memory was healthy (47 GiB free). |
| - **Root cause**: oneCCL's IPC memory handle cache has a default limit of 1000 entries. After ~400 steps of DDP all-reduce, the cache fills. Handle eviction triggers a use-after-free in the Intel compute-runtime driver. |
| - **Warning before crash**: `CCL_WARN: mem handle cache limit is reached: mem_handle_cache size: 1000, limit: 1000` |
| - **Fix**: `export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000` in SLURM scripts. Also added to Strategy A's `bash_train_multi_nodes.sh`. |
| - **Note**: Strategy A's proactive restart naturally avoids this (process resets before 400 steps), but the fix is still needed for long-running single-process jobs. |
|
|
| --- |
|
|
| ## Scripts Directory |
|
|
| | File | Purpose | |
| |------|---------| |
| | `bash_train_multi_nodes.sh` | **Strategy A**: 8-node production training with proactive restart loop | |
| | `bash_train_stratB.sh` | **Strategy B**: 1-node leak rate diagnostic (no restart, dummy data) | |
| | `bash_infer.sh` | Inference / augmentation SLURM job | |
| | `bash_diagnose_leak.sh` | Submits `tests/diagnose_xpu_leak.py` to diagnose per-component leak | |
| | `bash_diagnose_ops.sh` | Submits `tests/diagnose_xpu_leak_ops.py` for per-operation leak analysis | |
| | `bash_verify_fix.sh` | Compares ConvTranspose3d vs UpsampleConv leak rates (failed due to env issue) | |
| | `bash_compare_opt.sh` | Speed comparison: optimized vs original 3-mode training | |
| | `bash_compare_orig.sh` | Speed comparison: original 3-mode training baseline | |
| | `tests/diagnose_xpu_leak.py` | Component-level leak test (network, DeformDDPM, DDP) | |
| | `tests/diagnose_xpu_leak_ops.py` | Operation-level leak test (Conv3d, grid_sample, attention, ConvTranspose3d, UpsampleConv) | |
| | `tests/test_3modes_opt_equivalence.py` | Verifies optimized training matches original | |
| | `tests/test_mslncc.py` | MSLNCC loss function unit tests | |
| | `tests/compare_3modes_speed.py` | Speed benchmark for 3-mode training variants | |
|
|
| --- |
|
|
| ## Issues Resolved (2026-03-22) |
|
|
| ### 14. XPU Autograd Engine Memory Leak — ~1.0 GiB/Step (ROOT CAUSE IDENTIFIED) |
| - **Jobs**: All XPU training jobs; diagnosed in 25826494 |
| - **Symptom**: `device_free` (via `torch.xpu.mem_get_info`) decreases linearly at ~1.0 GiB/step. Memory is outside PyTorch's caching allocator — not tracked by `memory_allocated` / `memory_reserved`. |
| - **Root cause**: **PyTorch XPU autograd engine bug**. The `loss.backward()` call leaks device memory on every invocation. Confirmed by isolated diagnostic (`tests/diagnose_xpu_leak.py`): |
| - Test 1 (forward only, no backward): **NO LEAK** — device_free perfectly stable at 62.75 GiB over 20 steps |
| - Test 2a (forward + backward, no optimizer): **LEAK** — 1.0 GiB/step (62.97 → 42.98 GiB over 20 steps) |
| - Test 2b (forward + backward + optimizer.step): **LEAK** — 1.1 GiB/step (slightly worse) |
| - **NOT caused by**: CCL all-reduce (`no_sync()` showed identical leak rate), DDP (leak occurs without DDP), garbage collection (`gc.collect()` had no effect), caching allocator (`empty_cache()` had no effect), deferred ops (`synchronize()` had no effect) |
| - **Why it works on CUDA**: CUDA autograd engine does not have this leak. The issue is specific to the Intel XPU backend (Level Zero / SYCL runtime). |
| - **Workaround applied**: |
| 1. Gradient checkpointing (3 encoder levels in OM_net) reduces peak memory from 43 → 26 GiB, buying ~26 steps before OOM |
| 2. Mid-epoch checkpoints every 10 steps to `tmp/` subfolder |
| 3. Auto-resubmitting SLURM job restarts training from last checkpoint with fresh memory (leak resets) |
| - **Upstream**: Should be reported to [intel/torch-xpu-ops](https://github.com/intel/torch-xpu-ops) with `tests/diagnose_xpu_leak.py` as minimal reproduction |
|
|
| ### 13. Pre-allocation Approach — Wrong Direction |
| - **Jobs**: 25799043 (92%), 25823021 (78%) |
| - **Finding**: Pre-allocating device memory into PyTorch's caching allocator REDUCED available memory for the autograd leak, causing EARLIER crashes. 92% → step 3, 78% → step 10, none → step 15. |
| - **Resolution**: Removed all pre-allocation. The 70% allocator cap is sufficient when gradient checkpointing reduces peak to 26 GiB (well under the 44.8 GiB cap). |
|
|
| ### 12. Contrastive Backward OOM — Diffusion Tensors Not Freed |
| - **Jobs**: 25823021, 25823710 |
| - **Finding**: `del pre_dvf_I, dvf_I, trm_pred` was placed AFTER the contrastive step. During `loss_contra.backward()`, diffusion output tensors were still alive, pushing peak above the limit. |
| - **Fix**: Moved `del` + `empty_cache()` to BEFORE the contrastive step. Also save `loss_gen_a.item()` before deleting since it's needed for registration decision. |
|
|
| --- |
|
|
| ## Issues Resolved (2026-03-20 to 2026-03-21) |
|
|
| ### 11. DDP Collective Hangs and Registration Desync |
| - **Jobs**: 25699197, 25709947 (hung), 25704670, 25706470 |
| - **Symptoms**: Job hangs after step 1 (log files stop growing); or `Expected to have finished reduction` error |
| - **Root cause 1**: OOM try/except guards with `continue` skip the DDP-synchronized backward pass, causing other ranks to wait forever at all-reduce. **OOM guards are fundamentally incompatible with DDP.** |
| - **Root cause 2**: Registration conditional block (`loss_gen_a.item() < -0.6`) differs per rank — some ranks call `Deformddpm(...)` for registration while others skip, causing DDP desync. |
| - **Fixes applied**: |
| 1. Removed all OOM try/except guards — let OOM crash the job; rely on checkpoint auto-resume |
| 2. `DDP(..., find_unused_parameters=True)` — handles detached recovery iterations and conditional registration |
| 3. `dist.all_reduce(regist_flag, op=ReduceOp.MIN)` — all 64 ranks collectively decide whether to run registration. Only runs when ALL ranks agree. |
|
|
| ### 10. XPU OOM — Allocator 70% Memory Cap |
| - **Jobs**: 25530886 through 25780909 |
| - **Error**: `UR_RESULT_ERROR_OUT_OF_RESOURCES` at step 12-14 of each epoch |
| - **Root cause**: XPU caching allocator caps reserved memory at ~70% of device (44.8/64 GiB). Known Intel bug ([torch-xpu-ops#1543](https://github.com/intel/torch-xpu-ops/issues/1543)). Works on 4x 48GB CUDA GPUs because CUDA allocator uses nearly all device memory. |
| - **Key diagnostic** (job 25780909): Memory logging showed alloc/reserved perfectly stable at 9.84/44.82 GiB across all steps — no fragmentation, no creep. OOM is purely a peak spike during forward/backward exceeding the 44.8 GiB cap. |
| - **Resolution**: Gradient checkpointing reduced peak from 43 → 26 GiB, well within the 44.8 GiB cap. Pre-allocation is no longer needed. |
|
|
| --- |
|
|
| ## Issues Resolved (2026-03-19 to 2026-03-20) |
|
|
| ### 1. torchrun Permission Denied |
| - **Fix**: Switched to `python -m torch.distributed.run`, then later to direct `srun` launch |
|
|
| ### 2. GPUS_PER_NODE Mismatch |
| - **Fix**: `--nodes=8 --ntasks-per-node=8` for 64 total XPU tiles (4 cards x 2 tiles/card) |
|
|
| ### 3. `.to(rank)` Sends to CUDA Not XPU |
| - **Fix**: Changed to `.to(f"{DEVICE_TYPE}:{rank}")` |
|
|
| ### 4. No DistributedSampler |
| - **Fix**: Added `DistributedSampler` for both dataloaders + `set_epoch()` per epoch |
|
|
| ### 5. CCL Backend Not Found |
| - **Fix**: dn-mo1 rebuilt the conda env with compatible packages |
|
|
| ### 6. MPI/PMI Init Failure |
| - **Fix**: Switched from torchrun to direct `srun --ntasks-per-node=8` with SLURM env var mapping |
|
|
| ### 7. CCL Worker Thread Startup Failure |
| - **Fix**: Increased `--cpus-per-task=12` + `CCL_WORKER_AFFINITY=auto` |
|
|
| ### 8. gloo Backend Incompatible with XPU |
| - **Fix**: Must use `ccl` backend for XPU DDP |
|
|
| ### 9. Print Spam from All Ranks |
| - **Fix**: Guarded prints with `gpu_id == 0` |
|
|
| --- |
|
|
| ## Working Configuration Summary |
|
|
| ```bash |
| # SLURM |
| --nodes=8 --gres=gpu:4 --ntasks-per-node=8 --cpus-per-task=12 |
| |
| # Environment |
| I_MPI_PMI_LIBRARY=/usr/local/software/slurm/current-rhel8/lib/libpmi2.so |
| I_MPI_HYDRA_BOOTSTRAP=slurm |
| CCL_WORKER_AFFINITY=auto |
| PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512 |
| |
| # Launch |
| srun --kill-on-bad-exit=1 bash -c 'LOCAL_RANK=$SLURM_LOCALID RANK=$SLURM_PROCID WORLD_SIZE=$SLURM_NTASKS python ...' |
| |
| # DDP |
| backend = "ccl" |
| DDP(model, device_ids=[rank], find_unused_parameters=True) |
| |
| # XPU autograd leak workaround |
| # - Gradient checkpointing on 3 encoder levels (OM_net.use_checkpoint = True) |
| # - Mid-epoch checkpoint every 10 steps to Models/.../tmp/ |
| # - SLURM script auto-resubmits on crash (sbatch at end of script) |
| # - No pre-allocation (gradient checkpointing keeps peak under 70% cap) |
| ``` |
|
|
| ## Previous Test Jobs |
|
|
| | Job ID | Date | Nodes | Status | Issue | |
| |--------|------|-------|--------|-------| |
| | 25379684 | 03-18 | 16 | FAILED | torchrun permission denied | |
| | 25416001 | 03-19 | 8 | FAILED | ccl backend not found (SYCL mismatch) | |
| | 25433261 | 03-19 | 8 | FAILED | xccl not built in | |
| | 25518305 | 03-19 | 8 | FAILED | MPI PMI_Init failure | |
| | 25521705 | 03-20 | 8 | FAILED | CCL OFI transport init failure | |
| | 25522164 | 03-20 | 8 | FAILED | CCL worker startup failure (3 CPUs) | |
| | 25522734 | 03-20 | 8 | FAILED | CCL worker=0 segfault | |
| | 25522979 | 03-20 | 1 | FAILED | gloo + XPU incompatible | |
| | 25523654 | 03-20 | 1 | FAILED | gloo + XPU (no device_ids) still fails | |
| | 25525830 | 03-20 | 1 | **SUCCESS** | First working run: ccl + 12 CPUs/task | |
| | 25528754 | 03-20 | 8 | FAILED | Superseded by 25530886 | |
| | 25530886 | 03-20 | 8 | FAILED | XPU OOM at step 12/41 (original code, no workarounds) | |
| | 25635451 | 03-20 | 8 | FAILED | empty_cache regression, OOM step 1 | |
| | 25678499 | 03-21 | 8 | FAILED | OOM step 15, variable cleanup added | |
| | 25696461 | 03-21 | 8 | FAILED | Epoch 2 reached but forward OOM killed rank 61 | |
| | 25699197 | 03-21 | 8 | HUNG | OOM try/except broke DDP | |
| | 25704670 | 03-21 | 8 | FAILED | find_unused_parameters error at registration | |
| | 25706470 | 03-21 | 8 | FAILED | OOM guard broke DDP reducer state | |
| | 25709947 | 03-21 | 8 | HUNG | Registration conditional desync | |
| | 25763882 | 03-21 | 8 | FAILED | all_reduce(MIN) sync — no hang! OOM step 14. | |
| | 25780909 | 03-21 | 8 | FAILED | Confirmed 70% allocator cap (9.84/44.82 GiB stable) | |
| | 25799043 | 03-21 | 8 | FAILED | 92% pre-alloc (59 GiB) — OOM step 3, WORSE | |
| | 25823021 | 03-21 | 8 | FAILED | 78% pre-alloc — diffusion OK (43.3 GiB), contra OOM step 10 | |
| | 25823544 | 03-21 | 8 | FAILED | del tensors before contra — UnboundLocalError bug | |
| | 25823710 | 03-21 | 8 | FAILED | Fixed bug; OOM step 10 again, ~1.3 GiB/step device leak confirmed | |
| | 25824128 | 03-21 | 8 | FAILED | No pre-alloc + empty_cache; device_free monitoring confirms 1.3 GiB/step leak | |
| | 25825585 | 03-22 | 8 | FAILED | no_sync() — same leak rate; NOT from all-reduce | |
| | 25825861 | 03-22 | 8 | FAILED | gc.collect+sync+empty_cache — no effect on leak | |
| | 25826494 | 03-22 | 1 | **DIAG** | **Root cause found**: fwd=no leak, bwd=1.0 GiB/step leak. XPU autograd bug. | |
| | 25832610 | 03-22 | 8 | PARTIAL | Grad checkpoint works! Peak 43→22 GiB. Epoch 3 completed. Retry loop hung (srun won't exit). | |
| | 25853940 | 03-22 | 8 | PARTIAL | Resumed from step 25; epoch 3 completed + epoch 4 started. Epoch 4 mid-epoch saved at step 10,20. | |
| | 25867855 | 03-22 | 8 | HUNG | Epoch 4 reached step 26. Mid-epoch saved at 10,20. srun hung after crash (no kill-on-bad-exit). | |
| | 25892717 | 03-22 | 8 | HUNG→CANCELLED | Crashed at epoch 4 step 26 (OOM contra_bwd). srun hung 4+ hrs, auto-resubmit never triggered. | |
| | 25898349 | 03-22 | 8 | CANCELLED (Strat A v1) | Epoch 5 step 33. Leak 0.07 GiB/step. Had 6 bugs (off-by-one, optimizer, RNG, etc). | |
| | 25898356 | 03-22 | 1 | CRASHED (Strat B v1) | Leak 0.375 GiB/step (dummy data). ZeroDivisionError at epoch end (bug fixed). | |
| | **25899114** | **03-23** | **1** | **PENDING (Strat B v2)** | **Full leak validation: no restart, dummy data, all fixes. Goal: survive 2h/80+ steps.** | |
| | 25898957 | 03-23 | 8 | CRASH-LOOPED (Strat A v2) | Epoch 6 step 9 then crash-loop ×63. torch.load rejects numpy RNG in checkpoint. | |
| | 25899049 | 03-23 | 8 | CANCELLED | Checkpoint conflict — ran simultaneously with 25898957. | |
| | 25899114 | 03-23 | 1 | CANCELLED | Strategy B — cancelled with 25898957 for code fix. | |
| | **25899265** | **03-23** | **8** | **RUNNING (Strat A v3) VALIDATED** | **100+ epochs, 7 restarts, zero OOM. Memory stable 45-47 GiB. 6h runtime.** | |
| | 25899266 | 03-23 | 1 | COMPLETED (Strat B v3) | 68 epochs, 403 steps, zero OOM. GPU segfault at epoch 74 (CCL IPC cache limit). | |
| | **25916258** | **03-23** | **1** | **RUNNING (Strat B v4)** | **CCL cache fix (10000 handles). Goal: survive full 2h walltime.** | |
| |