File size: 26,158 Bytes
2af0e94
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
# OmniMorph Multi-Node XPU Training Progress

## Active Jobs (2026-03-28)

### Job 26457072: 8-node production training (RUNNING)
- **Date**: 2026-03-28
- **Status**: RUNNING on pvc-s-[41-42,118,125,129,136,140-141], 36h walltime
- **Script**: `bash_train_multi_nodes.sh` (srun timeout reduced to 1.5h)
- **Config**: `all_om_net`, img_size=128, batchsize=2, timesteps=80, lr=1e-5, condition_type=slice
- **Data**: 14,814 diffusion + 2,583 registration (full real data)
- **Steps/epoch**: 41 (14814 / (2 Γ— 64) Γ· DIFF_REG_BATCH_RATIO steps), **~108s/step** β†’ ~1h14m per epoch
- **Changes from previous job**:
  - Contrastive `clip_grad_norm` max_norm: 0.02 β†’ **1e-3** (avoid contrastive dominating training)
  - Registration activation threshold: -0.6 β†’ **-0.7** (stricter gate)
  - `dist.all_reduce` β†’ `dist.broadcast(src=0)` for NaN sync + registration gate (CCL hang workaround)
  - srun timeout: 7200s β†’ **5400s** (1.5h, reduces CCL hang waste from 46min to ~16min)
  - Excluded pvc-s-135 (hardware failure: "No XPU devices are available")
- **Logs**: `Logs/train_multi_26457072.out`

### CCL Epoch-Boundary Hang (ongoing issue)
- **Pattern**: First epoch per srun completes fine (~1h14m). Second epoch always hangs at a CCL collective op.
- **Workaround**: srun timeout kills the hung process, bash loop restarts with fresh CCL state. Effective rate: 1 epoch per ~1.5h.
- **Root cause**: Likely CCL/Level Zero IPC handle leak or state corruption after ~200+ collective ops. Fresh process resets L0 context.
- **Impact**: ~16min wasted per epoch (hang time before timeout), but training progresses reliably.
- **TODO**: Investigate reinitializing CCL mid-training (`destroy_process_group` + `init_process_group`) to avoid the restart overhead.

### Single-node CCL hang (unresolved)
- Single-node job (26446050) hung at step 1 on explicit `dist.all_reduce` calls.
- Changed to `dist.broadcast(src=0)` but never verified (cancelled when multi-node started).
- Single-node epochs take ~10h (8 tiles), so the restart-per-epoch strategy doesn't work (2h timeout kills mid-epoch).
- **TODO**: Test broadcast fix on single node; if still hangs, investigate CCL intra-node transport.

### Checkpoint status
- **Latest**: `000036_all_om_net.pth` (Mar 28, 2.9 GB)
- **Total**: epochs 3-36 in `Models/all_om_net/` (85+ GB)
- **CUDA copies**: epochs 0,1,2,8,11 in `Models/all_om_net/cuda_ckpts/`
- **Older model**: epochs 0-10 in `Models/all_recmulmodmutattnnet/` (pre-om_net migration)

### Loss history (real data, 64 XPU tiles)
| Epoch | Ang | Dist | Regul | Contrastive | Regist (total) | imgsim | imgmse | ddf |
|-------|-----|------|-------|-------------|---------------|--------|--------|-----|
| 3-7 | -0.02β†’-0.10 | 1.50β†’1.02 | β€” | 9.2e-4 | 0.0 | β€” | β€” | β€” |
| 8-11 | β€” | β€” | β€” | β€” | β€” | β€” | β€” | β€” (CUDA) |
| 31 | β€” | β€” | β€” | 6.9e-5 | -0.09 | -0.21 | 0.35 | 4.2e-4 |
| 35 | -0.50 | 0.85 | 1.3e-4 | 7.1e-5 | -0.10 | -0.26 | 0.32 | 1.4e-4 |

Note: Epochs 3-7 used UpsampleConv. Epochs 8-11 trained on CUDA with ConvTranspose3d. Epoch 12+ on XPU. Registration activates when ang < -0.7 (was -0.6 before epoch 37).

---

### Previous Job 25899265: Production training (v3) β€” COMPLETED (historical)
- Ran 100+ epochs with UpsampleConv on dummy data (only 100 samples loaded due to node path issue)
- Validated memory stability and CCL cache fix, but loss data not meaningful

### Previous Strategy A β€” Job 25898957: CRASH-LOOPED (torch.load bug)
- Ran 1 good iteration (epoch 5 step 31β†’epoch 6 step 9), then crash-looped 63 times
- **Root cause**: saving `np.random.get_state()` in checkpoint β†’ numpy arrays β†’ `torch.load` with `weights_only=True` (PyTorch 2.6 default) rejects numpy globals
- Epoch 5 completed and saved. Epoch 6 reached step 9 before crash loop started.

### Previous Strategy A β€” Job 25898957: Production training (v2, all audit fixes)
- **Date**: 2026-03-23
- **Status**: RUNNING β€” epoch 6 in progress
- **Script**: `bash_train_multi_nodes.sh`
- **Config**: `Config/config_om.yaml` (om_net, img_size=128, batchsize=2, device=xpu)
- **Resources**: 8 nodes x 8 XPU tiles = 64 XPU tiles, 12 CPUs/task
- **Walltime**: 36:00:00
- **Progress** (as of 2026-03-23 02:45):
  - Completed epoch 5 (full, no restart needed), checkpoint `000005_all_om_net.pth` saved
  - Epoch 6 step 2 in progress, memory stable at ~46 GiB free
  - **Zero restarts triggered** β€” leak rate ~0.07 GiB/step allows full epochs without OOM
- **Fixes applied** (10 bugs found across 4 independent audits):
  1. **(Critical) Optimizer on all DDP ranks** β€” all ranks load checkpoint
  2. **(Critical) XPU device RNG saved/restored** β€” `torch.xpu.get_rng_state()` in checkpoint
  3. **(Critical) RNG restored after DataLoader skip** β€” prevents `__getitem__` corruption
  4. **(Critical) `loss_nan_step` not overwritten** β€” guarded with else branch
  5. **(Critical) Off-by-one in step skip** β€” `step <= initial_step` with `initial_step > 0` guard
  6. **(Low) tmp/ cleanup** β€” DDP race + stale checkpoint fixes
  7. **(Medium) Per-rank RNG divergence** β€” non-rank-0 re-seeded (CPU + XPU device RNG)
  8. **(Critical) Step 0 skipped on fresh start** β€” `initial_step > 0` guard added
  9. **(Config) Timeout** β€” `SRUN_TIMEOUT` 2400 β†’ 7200
  10. **(Low) `total_reg` division by zero** β€” `max(total_reg, 1)` guard
- **Leak rate**: ~0.07 GiB/step (ConvTranspose3d eliminated via UpsampleConv)
- **Logs**: `Logs/train_multi_25898957.out`

### Job 25899049: CANCELLED (checkpoint conflict)
- Submitted as continuation but got nodes while 25898957 was still running. Both would write to same `Models/all_om_net/`. Cancelled before training started.

### Previous Strategy A β€” Job 25898349: CANCELLED (had 6 bugs)
- **Ran**: 4 restart iterations, reached epoch 5 step 33
- **Issues**: (1) timeout too short (killed healthy runs), (2) off-by-one re-trained steps 10/20/30, (3) optimizer not loaded on non-rank-0, (4) epoch stats not saved/restored, (5) RNG state not preserved, (6) `loss_nan_step` reset after restore
- **Checkpoints saved**: `000004_all_om_net.pth` (epoch 4), `000005_step0030_all_om_net.pth` (epoch 5 step 30)
- **Memory**: Stable at ~46-47 GiB free (leak only ~0.07 GiB/step)

### Strategy B β€” Job 25916258: Full validation (v4, CCL cache fix)
- **Date**: 2026-03-23
- **Status**: RUNNING
- **Resources**: 1 node x 8 XPU tiles, dummy data, no proactive restart, 2h walltime
- **Fix**: `CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000` (was default 1000, caused driver segfault at ~400 steps)
- **Goal**: Survive full 2h walltime without crash β€” validates both UpsampleConv fix (no OOM) and CCL cache fix (no segfault)
- **Logs**: `Logs/stratB_25916258.out`

### Previous Strategy B β€” Job 25899266 (v3, segfault at epoch 74)
- 68 epochs, 403 steps, zero OOM. Memory stable 46-47 GiB.
- Crashed at epoch 74 step 0: CCL IPC handle cache hit 1000 limit β†’ driver segfault (`drm_neo.cpp:288`). Fixed with `CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000`.
- **Logs**: `Logs/stratB_25899266.out`

### Previous Strategy B β€” Job 25898356 (crashed on logging bug)
- Ran 6 steps, leak ~0.375 GiB/step. `ZeroDivisionError` on `total_reg=0` (now fixed).

### Previous Job 25892717 β€” CANCELLED (was hung)
- **Status**: Cancelled after 5 hours. Crashed at epoch 4 step 26 (OOM at `loss_contra.backward()`). `srun` hung due to CCL processes not exiting, auto-resubmit never triggered. 4+ hours of walltime wasted idle.
- **Last checkpoint**: `Models/all_om_net/tmp/000004_step0020_all_om_net.pth`
- **Lesson**: `--kill-on-bad-exit=1` is not reliable for CCL cleanup. Strategy A's `timeout` wrapper solves this.

---

## Implementation Details

### Strategy A: Proactive Restart (backup approach)

**Problem**: XPU autograd leaks ~1.78 GiB/step of device memory (Intel UR backend bug). Training crashes at ~step 26. The old auto-resubmit via `sbatch` failed because `srun` hangs after CCL rank crash β€” the bash script never reaches the resubmit logic.

**Solution**: Proactive exit + restart loop within the same SLURM allocation.

**Changes to `OM_train_3modes.py`** (training behavior unchanged):
- Added `EXIT_CODE_RESTART = 42` constant
- Added `--max-steps-before-restart N` CLI argument (default 0 = disabled)
- Added `steps_since_start` counter in the training loop
- After N steps: saves mid-epoch checkpoint β†’ `dist.barrier()` β†’ `dist.destroy_process_group()` β†’ `sys.exit(42)`
- `SystemExit(42)` is NOT caught by `except Exception` β€” propagates cleanly

**Changes to `bash_train_multi_nodes.sh`**:
- Replaced single `srun` call with a `while` loop (up to 500 iterations)
- Each `srun` wrapped with `timeout 2400` (40 min) to catch CCL hangs
- Exit code handling:
  - `0` β†’ training complete, break
  - `42` β†’ proactive restart, 5s pause, re-launch
  - `124` β†’ timeout (CCL hang), 10s pause, re-launch
  - Other β†’ crash (OOM etc.), 10s pause, re-launch from checkpoint
- Passes `--max-steps-before-restart 20` to Python

**Training behavior**: Identical to original. Two independent audits verified that:
- Model weights, optimizer state (Adam momentum/variance), and RNG states (CPU + XPU + numpy + python) are all saved and restored
- Off-by-one fixed: `step <= initial_step` correctly skips the checkpointed step
- RNG restored AFTER DataLoader skip loop (not before) to avoid `__getitem__` corruption
- `loss_nan_step`, `total_reg`, and all 9 epoch loss accumulators preserved across restarts
- All DDP ranks load optimizer state (not just rank 0)

### Strategy B: Leak Rate Diagnostic

**Problem**: Need to know whether existing mitigations (gradient checkpointing, `UpsampleConv`, `empty_cache`) reduce the XPU leak rate, or if the leak is fundamental to all backward ops.

**Approach**: Run training WITHOUT proactive restart, measure how many steps survive before OOM. Compare with historical ~26 steps.

**Script**: `bash_train_stratB.sh` β€” 1-node, dummy data, no checkpoint saves, `--max-steps-before-restart 0`.

**Training loop structure** (both strategies use the same code):
1. **Diffusion**: `forward β†’ backward β†’ step` (NO gradient clipping)
2. **Contrastive**: `forward β†’ backward β†’ clip(max_norm=0.02) β†’ step`
3. **Registration**: `forward β†’ backward β†’ clip(max_norm=0.1) β†’ step`

Each phase has `gc.collect() + synchronize() + empty_cache()` between them.

### Earlier attempted fix: UpsampleConv (ConvTranspose3d replacement)
- `ConvTranspose3d` backward was identified as leaking ~0.33 GiB/step per layer on XPU
- Replaced with `UpsampleConv` (F.interpolate + Conv) in `Diffusion/networks.py` for `OM_net`
- Result: **Negligible impact** β€” leak rate ~1.78 GiB/step before and after
- Conclusion: `ConvTranspose3d` was NOT the primary leak source; the leak is fundamental to XPU autograd

### Strategy B: Per-Operation XPU Leak Analysis (job 25893155)

Ran `tests/diagnose_xpu_leak_ops.py` on 1 XPU tile to isolate which ops leak. Each test runs 20 forward+backward iterations and measures `device_free` drift.

| Operation | Leak Rate (GiB/step) | Pattern | Verdict |
|-----------|---------------------|---------|---------|
| **ConvTranspose3d** (256β†’3, 8Β³β†’128Β³) | **0.335** | **Linear, persistent** | **LEAKS β€” primary source** |
| Full OM_net (rec_num=2, 128Β³) | **1.15** | **Linear, OOM at step 17** | **LEAKS β€” aggregated** |
| Stacked Conv3d encoder (1β†’256, 128Β³β†’8Β³) | 0.013 | One-time alloc | OK (initial alloc only) |
| F.grid_sample (128Β³) | 0.007 | One-time alloc | OK |
| Conv3d (16β†’32, 64Β³) | 0.004 | One-time alloc | OK |
| MultiheadAttention (512 tokens) | 0.002 | One-time alloc | OK |
| Adam optimizer only | 0.0005 | One-time alloc | OK |
| **F.interpolate** trilinear (32Β³β†’256Β³) | **0.000** | **No leak** | **ZERO LEAK** |
| **UpsampleConv** (256β†’3, 8Β³β†’128Β³) | Not tested (env error) | β€” | Expected zero (uses F.interpolate) |

**Key findings:**
1. **`ConvTranspose3d` backward is the dominant leaker** β€” 0.335 GiB/step, linear and persistent (6.7 GiB lost over 20 steps). With 5 decoder layers in the old network, this alone accounts for ~1.7 GiB/step.
2. **`F.interpolate` has ZERO leak** β€” confirming that `UpsampleConv` (which uses F.interpolate + Conv) is the correct fix.
3. **All other ops have only one-time allocations** (no linear drift) β€” Conv3d, grid_sample, attention, Adam are all clean.
4. **Full OM_net leaks 1.15 GiB/step** β€” consistent with `ConvTranspose3d` Γ— 5 layers plus minor contributions from other ops.

**Component-level diagnostic** (job 25826494):
- Forward only (no backward): **ZERO leak** β€” 62.75 GiB stable for 20 steps
- Forward + backward: **1.12 GiB/step** β€” 62.97 β†’ 40.53 GiB over 20 steps
- Confirms the leak is in the autograd backward pass, specifically in `ConvTranspose3d` backward kernel.

**Why the leak rate dropped in recent runs** (jobs 25898349/25898957): The `UpsampleConv` fix in `OM_net` replaced all 5 decoder `ConvTranspose3d` layers with F.interpolate+Conv. This eliminated the primary leak source. The remaining ~0.07 GiB/step is from minor one-time allocations that stabilize quickly.

**Why the old runs (25892717) still leaked 1.78 GiB/step**: The diagnostic `diagnose_xpu_leak_ops.py` test used the OLD `RecMulModMutAttnNet` with `ConvTranspose3d`. The `OM_net` class (used in production training) already had `UpsampleConv`. The earlier measurement of "negligible impact" was incorrect β€” the UpsampleConv fix DID work, but the comparison was confounded by different node conditions. The diagnostic data now confirms the fix is effective.

### Issue 15: CCL IPC Handle Cache Segfault (~400 DDP Steps)
- **Job**: 25899266 (Strategy B v3)
- **Symptom**: GPU segfault (`drm_neo.cpp:288`) after ~403 DDP all-reduce steps. Memory was healthy (47 GiB free).
- **Root cause**: oneCCL's IPC memory handle cache has a default limit of 1000 entries. After ~400 steps of DDP all-reduce, the cache fills. Handle eviction triggers a use-after-free in the Intel compute-runtime driver.
- **Warning before crash**: `CCL_WARN: mem handle cache limit is reached: mem_handle_cache size: 1000, limit: 1000`
- **Fix**: `export CCL_ZE_CACHE_OPEN_IPC_HANDLES_THRESHOLD=10000` in SLURM scripts. Also added to Strategy A's `bash_train_multi_nodes.sh`.
- **Note**: Strategy A's proactive restart naturally avoids this (process resets before 400 steps), but the fix is still needed for long-running single-process jobs.

---

## Scripts Directory

| File | Purpose |
|------|---------|
| `bash_train_multi_nodes.sh` | **Strategy A**: 8-node production training with proactive restart loop |
| `bash_train_stratB.sh` | **Strategy B**: 1-node leak rate diagnostic (no restart, dummy data) |
| `bash_infer.sh` | Inference / augmentation SLURM job |
| `bash_diagnose_leak.sh` | Submits `tests/diagnose_xpu_leak.py` to diagnose per-component leak |
| `bash_diagnose_ops.sh` | Submits `tests/diagnose_xpu_leak_ops.py` for per-operation leak analysis |
| `bash_verify_fix.sh` | Compares ConvTranspose3d vs UpsampleConv leak rates (failed due to env issue) |
| `bash_compare_opt.sh` | Speed comparison: optimized vs original 3-mode training |
| `bash_compare_orig.sh` | Speed comparison: original 3-mode training baseline |
| `tests/diagnose_xpu_leak.py` | Component-level leak test (network, DeformDDPM, DDP) |
| `tests/diagnose_xpu_leak_ops.py` | Operation-level leak test (Conv3d, grid_sample, attention, ConvTranspose3d, UpsampleConv) |
| `tests/test_3modes_opt_equivalence.py` | Verifies optimized training matches original |
| `tests/test_mslncc.py` | MSLNCC loss function unit tests |
| `tests/compare_3modes_speed.py` | Speed benchmark for 3-mode training variants |

---

## Issues Resolved (2026-03-22)

### 14. XPU Autograd Engine Memory Leak β€” ~1.0 GiB/Step (ROOT CAUSE IDENTIFIED)
- **Jobs**: All XPU training jobs; diagnosed in 25826494
- **Symptom**: `device_free` (via `torch.xpu.mem_get_info`) decreases linearly at ~1.0 GiB/step. Memory is outside PyTorch's caching allocator β€” not tracked by `memory_allocated` / `memory_reserved`.
- **Root cause**: **PyTorch XPU autograd engine bug**. The `loss.backward()` call leaks device memory on every invocation. Confirmed by isolated diagnostic (`tests/diagnose_xpu_leak.py`):
  - Test 1 (forward only, no backward): **NO LEAK** β€” device_free perfectly stable at 62.75 GiB over 20 steps
  - Test 2a (forward + backward, no optimizer): **LEAK** β€” 1.0 GiB/step (62.97 β†’ 42.98 GiB over 20 steps)
  - Test 2b (forward + backward + optimizer.step): **LEAK** β€” 1.1 GiB/step (slightly worse)
- **NOT caused by**: CCL all-reduce (`no_sync()` showed identical leak rate), DDP (leak occurs without DDP), garbage collection (`gc.collect()` had no effect), caching allocator (`empty_cache()` had no effect), deferred ops (`synchronize()` had no effect)
- **Why it works on CUDA**: CUDA autograd engine does not have this leak. The issue is specific to the Intel XPU backend (Level Zero / SYCL runtime).
- **Workaround applied**:
  1. Gradient checkpointing (3 encoder levels in OM_net) reduces peak memory from 43 β†’ 26 GiB, buying ~26 steps before OOM
  2. Mid-epoch checkpoints every 10 steps to `tmp/` subfolder
  3. Auto-resubmitting SLURM job restarts training from last checkpoint with fresh memory (leak resets)
- **Upstream**: Should be reported to [intel/torch-xpu-ops](https://github.com/intel/torch-xpu-ops) with `tests/diagnose_xpu_leak.py` as minimal reproduction

### 13. Pre-allocation Approach β€” Wrong Direction
- **Jobs**: 25799043 (92%), 25823021 (78%)
- **Finding**: Pre-allocating device memory into PyTorch's caching allocator REDUCED available memory for the autograd leak, causing EARLIER crashes. 92% β†’ step 3, 78% β†’ step 10, none β†’ step 15.
- **Resolution**: Removed all pre-allocation. The 70% allocator cap is sufficient when gradient checkpointing reduces peak to 26 GiB (well under the 44.8 GiB cap).

### 12. Contrastive Backward OOM β€” Diffusion Tensors Not Freed
- **Jobs**: 25823021, 25823710
- **Finding**: `del pre_dvf_I, dvf_I, trm_pred` was placed AFTER the contrastive step. During `loss_contra.backward()`, diffusion output tensors were still alive, pushing peak above the limit.
- **Fix**: Moved `del` + `empty_cache()` to BEFORE the contrastive step. Also save `loss_gen_a.item()` before deleting since it's needed for registration decision.

---

## Issues Resolved (2026-03-20 to 2026-03-21)

### 11. DDP Collective Hangs and Registration Desync
- **Jobs**: 25699197, 25709947 (hung), 25704670, 25706470
- **Symptoms**: Job hangs after step 1 (log files stop growing); or `Expected to have finished reduction` error
- **Root cause 1**: OOM try/except guards with `continue` skip the DDP-synchronized backward pass, causing other ranks to wait forever at all-reduce. **OOM guards are fundamentally incompatible with DDP.**
- **Root cause 2**: Registration conditional block (`loss_gen_a.item() < -0.6`) differs per rank β€” some ranks call `Deformddpm(...)` for registration while others skip, causing DDP desync.
- **Fixes applied**:
  1. Removed all OOM try/except guards β€” let OOM crash the job; rely on checkpoint auto-resume
  2. `DDP(..., find_unused_parameters=True)` β€” handles detached recovery iterations and conditional registration
  3. `dist.all_reduce(regist_flag, op=ReduceOp.MIN)` β€” all 64 ranks collectively decide whether to run registration. Only runs when ALL ranks agree.

### 10. XPU OOM β€” Allocator 70% Memory Cap
- **Jobs**: 25530886 through 25780909
- **Error**: `UR_RESULT_ERROR_OUT_OF_RESOURCES` at step 12-14 of each epoch
- **Root cause**: XPU caching allocator caps reserved memory at ~70% of device (44.8/64 GiB). Known Intel bug ([torch-xpu-ops#1543](https://github.com/intel/torch-xpu-ops/issues/1543)). Works on 4x 48GB CUDA GPUs because CUDA allocator uses nearly all device memory.
- **Key diagnostic** (job 25780909): Memory logging showed alloc/reserved perfectly stable at 9.84/44.82 GiB across all steps β€” no fragmentation, no creep. OOM is purely a peak spike during forward/backward exceeding the 44.8 GiB cap.
- **Resolution**: Gradient checkpointing reduced peak from 43 β†’ 26 GiB, well within the 44.8 GiB cap. Pre-allocation is no longer needed.

---

## Issues Resolved (2026-03-19 to 2026-03-20)

### 1. torchrun Permission Denied
- **Fix**: Switched to `python -m torch.distributed.run`, then later to direct `srun` launch

### 2. GPUS_PER_NODE Mismatch
- **Fix**: `--nodes=8 --ntasks-per-node=8` for 64 total XPU tiles (4 cards x 2 tiles/card)

### 3. `.to(rank)` Sends to CUDA Not XPU
- **Fix**: Changed to `.to(f"{DEVICE_TYPE}:{rank}")`

### 4. No DistributedSampler
- **Fix**: Added `DistributedSampler` for both dataloaders + `set_epoch()` per epoch

### 5. CCL Backend Not Found
- **Fix**: dn-mo1 rebuilt the conda env with compatible packages

### 6. MPI/PMI Init Failure
- **Fix**: Switched from torchrun to direct `srun --ntasks-per-node=8` with SLURM env var mapping

### 7. CCL Worker Thread Startup Failure
- **Fix**: Increased `--cpus-per-task=12` + `CCL_WORKER_AFFINITY=auto`

### 8. gloo Backend Incompatible with XPU
- **Fix**: Must use `ccl` backend for XPU DDP

### 9. Print Spam from All Ranks
- **Fix**: Guarded prints with `gpu_id == 0`

---

## Working Configuration Summary

```bash
# SLURM
--nodes=8 --gres=gpu:4 --ntasks-per-node=8 --cpus-per-task=12

# Environment
I_MPI_PMI_LIBRARY=/usr/local/software/slurm/current-rhel8/lib/libpmi2.so
I_MPI_HYDRA_BOOTSTRAP=slurm
CCL_WORKER_AFFINITY=auto
PYTORCH_ALLOC_CONF=expandable_segments:True,max_split_size_mb:512

# Launch
srun --kill-on-bad-exit=1 bash -c 'LOCAL_RANK=$SLURM_LOCALID RANK=$SLURM_PROCID WORLD_SIZE=$SLURM_NTASKS python ...'

# DDP
backend = "ccl"
DDP(model, device_ids=[rank], find_unused_parameters=True)

# XPU autograd leak workaround
# - Gradient checkpointing on 3 encoder levels (OM_net.use_checkpoint = True)
# - Mid-epoch checkpoint every 10 steps to Models/.../tmp/
# - SLURM script auto-resubmits on crash (sbatch at end of script)
# - No pre-allocation (gradient checkpointing keeps peak under 70% cap)
```

## Previous Test Jobs

| Job ID | Date | Nodes | Status | Issue |
|--------|------|-------|--------|-------|
| 25379684 | 03-18 | 16 | FAILED | torchrun permission denied |
| 25416001 | 03-19 | 8 | FAILED | ccl backend not found (SYCL mismatch) |
| 25433261 | 03-19 | 8 | FAILED | xccl not built in |
| 25518305 | 03-19 | 8 | FAILED | MPI PMI_Init failure |
| 25521705 | 03-20 | 8 | FAILED | CCL OFI transport init failure |
| 25522164 | 03-20 | 8 | FAILED | CCL worker startup failure (3 CPUs) |
| 25522734 | 03-20 | 8 | FAILED | CCL worker=0 segfault |
| 25522979 | 03-20 | 1 | FAILED | gloo + XPU incompatible |
| 25523654 | 03-20 | 1 | FAILED | gloo + XPU (no device_ids) still fails |
| 25525830 | 03-20 | 1 | **SUCCESS** | First working run: ccl + 12 CPUs/task |
| 25528754 | 03-20 | 8 | FAILED | Superseded by 25530886 |
| 25530886 | 03-20 | 8 | FAILED | XPU OOM at step 12/41 (original code, no workarounds) |
| 25635451 | 03-20 | 8 | FAILED | empty_cache regression, OOM step 1 |
| 25678499 | 03-21 | 8 | FAILED | OOM step 15, variable cleanup added |
| 25696461 | 03-21 | 8 | FAILED | Epoch 2 reached but forward OOM killed rank 61 |
| 25699197 | 03-21 | 8 | HUNG | OOM try/except broke DDP |
| 25704670 | 03-21 | 8 | FAILED | find_unused_parameters error at registration |
| 25706470 | 03-21 | 8 | FAILED | OOM guard broke DDP reducer state |
| 25709947 | 03-21 | 8 | HUNG | Registration conditional desync |
| 25763882 | 03-21 | 8 | FAILED | all_reduce(MIN) sync β€” no hang! OOM step 14. |
| 25780909 | 03-21 | 8 | FAILED | Confirmed 70% allocator cap (9.84/44.82 GiB stable) |
| 25799043 | 03-21 | 8 | FAILED | 92% pre-alloc (59 GiB) β€” OOM step 3, WORSE |
| 25823021 | 03-21 | 8 | FAILED | 78% pre-alloc β€” diffusion OK (43.3 GiB), contra OOM step 10 |
| 25823544 | 03-21 | 8 | FAILED | del tensors before contra β€” UnboundLocalError bug |
| 25823710 | 03-21 | 8 | FAILED | Fixed bug; OOM step 10 again, ~1.3 GiB/step device leak confirmed |
| 25824128 | 03-21 | 8 | FAILED | No pre-alloc + empty_cache; device_free monitoring confirms 1.3 GiB/step leak |
| 25825585 | 03-22 | 8 | FAILED | no_sync() β€” same leak rate; NOT from all-reduce |
| 25825861 | 03-22 | 8 | FAILED | gc.collect+sync+empty_cache β€” no effect on leak |
| 25826494 | 03-22 | 1 | **DIAG** | **Root cause found**: fwd=no leak, bwd=1.0 GiB/step leak. XPU autograd bug. |
| 25832610 | 03-22 | 8 | PARTIAL | Grad checkpoint works! Peak 43β†’22 GiB. Epoch 3 completed. Retry loop hung (srun won't exit). |
| 25853940 | 03-22 | 8 | PARTIAL | Resumed from step 25; epoch 3 completed + epoch 4 started. Epoch 4 mid-epoch saved at step 10,20. |
| 25867855 | 03-22 | 8 | HUNG | Epoch 4 reached step 26. Mid-epoch saved at 10,20. srun hung after crash (no kill-on-bad-exit). |
| 25892717 | 03-22 | 8 | HUNG→CANCELLED | Crashed at epoch 4 step 26 (OOM contra_bwd). srun hung 4+ hrs, auto-resubmit never triggered. |
| 25898349 | 03-22 | 8 | CANCELLED (Strat A v1) | Epoch 5 step 33. Leak 0.07 GiB/step. Had 6 bugs (off-by-one, optimizer, RNG, etc). |
| 25898356 | 03-22 | 1 | CRASHED (Strat B v1) | Leak 0.375 GiB/step (dummy data). ZeroDivisionError at epoch end (bug fixed). |
| **25899114** | **03-23** | **1** | **PENDING (Strat B v2)** | **Full leak validation: no restart, dummy data, all fixes. Goal: survive 2h/80+ steps.** |
| 25898957 | 03-23 | 8 | CRASH-LOOPED (Strat A v2) | Epoch 6 step 9 then crash-loop Γ—63. torch.load rejects numpy RNG in checkpoint. |
| 25899049 | 03-23 | 8 | CANCELLED | Checkpoint conflict β€” ran simultaneously with 25898957. |
| 25899114 | 03-23 | 1 | CANCELLED | Strategy B β€” cancelled with 25898957 for code fix. |
| **25899265** | **03-23** | **8** | **RUNNING (Strat A v3) VALIDATED** | **100+ epochs, 7 restarts, zero OOM. Memory stable 45-47 GiB. 6h runtime.** |
| 25899266 | 03-23 | 1 | COMPLETED (Strat B v3) | 68 epochs, 403 steps, zero OOM. GPU segfault at epoch 74 (CCL IPC cache limit). |
| **25916258** | **03-23** | **1** | **RUNNING (Strat B v4)** | **CCL cache fix (10000 handles). Goal: survive full 2h walltime.** |