CortexFM β A Lightweight Multimodal Foundation Model for Spike + EMG BCI
A 5.04 M-parameter multimodal Transformer foundation model that jointly learns spike trains and surface EMG envelopes from public DANDI motor-cortex data, evaluated on the FALCON M1 benchmark.
CortexFMμ μ½ 5.04 M νλΌλ―Έν° κ·λͺ¨μ λ€μ€λͺ¨λ¬ Transformer νμ΄λ°μ΄μ λͺ¨λΈλ‘, κ³΅κ° DANDI μ΄λνΌμ§ λ°μ΄ν°λ‘λΆν° μ€νμ΄ν¬μ EMG ν¬λ½μ μ 곡λ νμ΅νκ³ FALCON M1 λ²€μΉλ§ν¬ μμμ νκ°λ κ²½λ BCI λ°±λ³Έμ΄λ€.
Model description
CortexFM is a small, public, and fully reproducible foundation model for invasive brainβcomputer interface (BCI) decoding. It targets the regime where private million-hour pretraining data and 45 M β 350 M parameter backbones are not available, and asks how far we can push neural-decoding quality with ~3.85 hours of public data and a ~5 M-parameter model trained in about six minutes on a single consumer GPU.
CortexFMμ (i) λ¨μ(per-unit)/κ·Όμ‘(per-muscle) μ μ²΄μ± λ³΄μ‘΄, (ii) κ³΅κ° λ°μ΄ν°Β·κ³΅κ° λ²€μΉλ§ν¬λ§μΌλ‘μ μ¬ν, (iii) FALCON νμ€ μ λ ¬μ μΈ κ°μ§ μ€κ³ μλ¦¬λ‘ λλ€. λ°±λ³Έμ 10-layer Γ 6-head Γ d=192 PreNorm Transformer (4.45 M params, FLASH SDPA), ν€λλ spike Poisson NLL μ¬κ΅¬μ±, EMG MSE μ¬κ΅¬μ±, cross-modal InfoNCE λμ‘° νμ΅μ μΈ κ°λλ‘ κ΅¬μ±λλ€.
Architecture summary
| Component | Configuration |
|---|---|
| Backbone | PreNorm Transformer, 10 layers, 6 heads, d_model = 192, FFN = 768, GELU |
| Attention | SDPA with FLASH / EFFICIENT backends (PyTorch 2.10) |
| Backbone params | 4,449,024 (β 4.45 M) |
| Spike tokenizer | Per-unit learned embedding β log(1 + Ξ± Β· count) + temporal positional embedding |
| EMG tokenizer | Per-muscle learned embedding β scalar-to-vector MLP + temporal positional embedding |
| Heads | Spike recon (Poisson NLL), EMG recon (MSE), Contrastive projector (d_p = 128) |
| Total params | 5,044,994 (β 5.04 M) |
| Bin size | 20 ms (FALCON official) |
| Context length T | 64 bins (1.28 s) β 1,088 tokens |
| Mixed precision | BF16 (InfoNCE softmax promoted to FP32) |
Intended uses
- Research-grade neural decoding of primate motor cortex (M1) spike trains into 16-channel surface/intramuscular EMG envelopes.
- Backbone for downstream BCI probes: as a frozen feature extractor with a thin (~3 K-param) per-session output-space affine adapter, CortexFM enables session-1 adaptation to held-out recording days.
- Cross-modal pretraining baseline for studies that compare per-unit tokenization against patch-tokenized BCI foundation models (e.g., NDT-3) at a 1/9 β 1/69 parameter ratio.
- Educational reference for compact (~5 M-param) foundation-model training from public data on a single consumer GPU.
Out-of-scope uses
- Clinical or assistive deployment. This is a research checkpoint trained on a single non-human primate (MonkeyL, DANDI 000941). It is not intended for human BCI control or medical decision-making.
- Cross-subject generalization. The pretraining set is one subject; cross-subject transfer (e.g., MonkeyN, MC_Maze, human cortex) has not been validated.
- Direct kinematic decoding. The model outputs EMG envelopes; downstream kinematic readouts require an additional decoding stage.
- Real-time control without calibration. Held-out sessions require a brief (β₯ ~8 s) per-session affine calibration to enter the positive-RΒ² regime.
Training data
| Dataset | DOI | Subject | Modality | Duration |
|---|---|---|---|---|
| DANDI:000941 (Rouse & Schieber 2018) | 10.48324/dandi.000941/0.211015.0907 | MonkeyL (1 NHP) | M1 spikes (64 units) + intramuscular EMG (16 muscles) | 11 sessions total |
Pretraining uses the four held-in calibration sessions of DANDI 000941 (sessions 20120924, 20120926, 20120927, 20120928), totaling 3 h 38 min of paired spike + EMG recordings. The remaining 7 sessions (4 minival + 3 held-out calibration) are reserved for FALCON M1 evaluation and OOD session-1 adaptation.
License: CC-BY-4.0 (DANDI public release).
Preprocessing pipeline
- EMG: 60/180/200/300/400 Hz notch β 4th-order Butterworth high-pass at 65 Hz β rectify β 99 % clip β 95 % normalize β polyphase resample (1 kHz β 50 Hz) β re-rectify β 10 Hz low-pass envelope.
- Spike: 20 ms bin counts per unit on the same time grid.
- Output: Zarr store with
/emg/envelope,/spike/counts,/eval_mask, and trial markers. Spike/EMG share a common 20 ms bin axis (FALCON official invariant).
Training procedure
Objectives
Joint loss with three components:
- $\mathcal{L}_{\text{spike}}$: Poisson NLL over per-unit spike counts.
- $\mathcal{L}_{\text{emg}}$: MSE on per-muscle EMG envelopes.
- $\mathcal{L}_{\text{cont}}$: Symmetric InfoNCE on pooled cross-modal embeddings (FP32-promoted), temperature Ο = 0.1.
Loss weights $(w_{\text{spike}}, w_{\text{emg}}, w_{\text{cont}}) = (1.0, 1.0, 0.5)$.
Masking
Spike and EMG tokens are independently masked at 50 % per bin. Either modality must be reconstructed from the unmasked complement of itself and the (independently masked) cross-modal signal.
Optimization
| Hyperparameter | Value |
|---|---|
| Optimizer | AdamW |
| Learning rate | 3 Γ 10β»β΄ |
| Weight decay | 0.01 |
| LR schedule | Linear warmup (500 steps) β cosine decay |
| Batch size | 8 |
| Context length T | 64 bins |
| Mixed precision | BF16 (InfoNCE softmax in FP32) |
| Gradient clip | 1.0 |
| Max epochs | 50 (early-best at epoch 28) |
Training environment
| Item | Value |
|---|---|
| GPU | NVIDIA RTX 5080 (16 GB GDDR7, sm_120) β single consumer card |
| OS / runtime | WSL2 Ubuntu 24.04 |
| Framework | PyTorch 2.10.0 + cu128, PyTorch Lightning |
| Wall-clock training time | β 6 minutes for 30 epochs |
| Best checkpoint | epoch28-0.2599.ckpt (60.7 MB, val_loss = 0.2599) |
| External cloud GPU | None β fully on-device |
Train/val gap stayed below 0.03 throughout, so no early stopping was applied and the lowest-validation-loss checkpoint was kept verbatim.
Evaluation
FALCON M1 (held-in calibration sessions, variance-weighted RΒ² over 16 muscles)
| Setting | Params (used) | Per-session RΒ² (mean Β± std) | Pooled RΒ² | NL |
|---|---|---|---|---|
| POYO-1 zero-EMG floor | 15.47 M (0 used) | β1.273 Β± 0.299 | β | 2.4e-5 |
| POYO-1 frozen + per-session affine | 15.47 M + ~2 K/sess | +0.451 Β± 0.112 | +0.498 | β |
| CortexFM zero-shot | 5.04 M | β1.035 Β± 0.234 | β | 0.131 |
| CortexFM frozen + Ridge linear probe | 5.04 M + ~3 K | β0.258 Β± 0.327 | +0.125 | β |
| CortexFM + EMG-head FT 200 step (ZS) | 5.04 M (FT 37 K = 0.75 %) | β0.038 Β± 0.063 | β | β |
| CortexFM frozen + per-session affine | 5.04 M + ~3 K/sess | +0.484 Β± 0.102 | +0.529 | β |
NL = FALCON normalized latency (inference time / data duration).
Auxiliary co-bps (CortexFM only)
Mean 0.756 Β± 0.128 bits/spike above per-unit mean-rate baseline on the four held-in calibration files.
Held-out OOD calibration sessions (DANDI 000941, days +6 to +30, 3 sessions)
Variance-weighted RΒ², calibration β 640 bins:
| Session | CortexFM + affine | POYO-1 + affine | Ξ (POYO-1 β CortexFM) |
|---|---|---|---|
| 20121004 | +0.4443 | β0.0209 | β0.4652 |
| 20121017 | +0.2730 | β0.2326 | β0.5056 |
| 20121024 | +0.4046 | +0.1824 | β0.2222 |
| Per-session mean Β± std | +0.374 Β± 0.073 | β0.024 Β± 0.169 | β0.398 |
| Pooled RΒ² | +0.387 | β0.008 | β0.395 |
The decisive separation between CortexFM and POYO-1 emerges in the OOD held-out sessions: the held-in gap is small (Ξ = +0.031 in CortexFM's favor), but the held-out gap reaches Ξ = +0.395 pooled RΒ² β a gap attributable to backbone representation quality rather than the affine adapter recipe (both backbones use the identical per-session output-space affine).
Why zero-shot RΒ² is negative
Three factors documented in the thesis (Chapter 6):
- Objective mismatch: pretraining minimizes joint masked-recon + InfoNCE, whereas FALCON M1 measures EMG-only regression.
- Inference-time input shift: EMG is the prediction target at evaluation time, so the EMG tokenizer is fed zeros β out of pretraining distribution.
- Absence of per-session linear correction: standard FALCON pipelines fit a shallow regressor per session; CortexFM zero-shot does not.
The Ridge linear probe resolves factor (3) (pooled RΒ² entering the positive regime at +0.125); EMG-head fine-tuning resolves factors (1)+(2) (per-session RΒ² up to β0.038); per-session affine resolves all three jointly (pooled RΒ² = +0.529).
Limitations
- Single-subject pretraining. Pretraining is restricted to MonkeyL (DANDI 000941). Cross-subject transfer to MonkeyN, MC_Maze, or human cortex is not validated.
- n = 3 OOD sessions. The held-out evaluation uses three sessions; effect sizes are large but formal Holm-corrected statistical power is limited.
- Calibration dependence on OOD. With fewer than ~400 calibration bins (< 8 s) on a held-out session, OOD RΒ² becomes unstable. Real-time deployment therefore requires a brief calibration cycle per session.
- EMG-only readout. The model decodes 16-channel EMG envelopes, not kinematics directly. A downstream kinematic stage is needed for end-effector control.
- No clinical validation. The model is research-grade. It has not been evaluated for safety, robustness, or efficacy in any clinical BCI setting and must not be used as such.
How to use
import torch
from cortex_fm.training import CortexFMPretrainModule
# Load checkpoint
module = CortexFMPretrainModule.load_from_checkpoint(
"epoch28-0.2599.ckpt",
map_location="cuda",
strict=True,
)
module.eval()
# Inference: spike counts -> EMG envelope
# spike_counts: (B, T=64, N=64) int
# emg_placeholder: (B, T=64, M=16) float (zeros at inference)
spike_counts = torch.zeros(1, 64, 64, dtype=torch.long, device="cuda")
emg_placeholder = torch.zeros(1, 64, 16, device="cuda")
with torch.no_grad():
out = module(spike_counts, emg_placeholder)
emg_pred = out["emg_pred"].view(1, 64, 16)[:, -1, :] # (B, 16) at last bin
log_rate = out["log_rate"] # (B, T, 64) Poisson log-rates
For FALCON M1 evaluation, see benchmark_wrapper/ and the CortexFMFalconDecoder reference implementation.
Citation
@mastersthesis{shin2026cortexfm,
author = {Shin, Jaeguk},
title = {{CortexFM}: A Lightweight Multimodal Foundation Model for Spike--EMG Decoding on Public Brain--Computer Interface Data},
school = {Dong-eui University},
type = {{M.S.} thesis},
year = {2026},
month = jun,
address = {Busan, Republic of Korea},
}
If you also use the FALCON benchmark, please cite Karpowicz et al. 2024.
Ethical considerations
CortexFM is a research artifact. The following points apply:
- Animal data. Pretraining data come from a single non-human primate recorded under the original Rouse & Schieber 2018 protocols (DANDI 000941, CC-BY-4.0). No additional animal experiments were conducted for this release.
- No human data. The released checkpoint has not been trained or evaluated on human neural recordings.
- Dual-use awareness. Invasive BCI decoding can in principle inform assistive devices or surveillance / commercial neuro-monitoring systems. The author releases this checkpoint to support open scientific reproduction and lightweight benchmarking; downstream users are responsible for ensuring their applications respect informed consent, neural privacy, and applicable medical-device regulation.
- No clinical claims. CortexFM has not been evaluated against clinical-grade BCIs and must not be deployed in patient-facing systems without full regulatory validation.
Model details
- Developed by: Jaeguk Shin (μ μ¬κ΅), Dong-eui University, Department of Artificial Intelligence β M.S. thesis (June 2026), advised by faculty of Dong-eui University AI Department.
- Model type: Multimodal Transformer foundation model (spike + EMG).
- Language: N/A (the inputs are neural signals; the model card is bilingual EN/KO).
- License: MIT (see
LICENSE). - Finetuned from: Trained from scratch.
- Related links: Thesis full text and reproducibility scripts to be released at the GitHub companion repository.