Configuration Parsing Warning:Invalid JSON for config file config.json

YAML Metadata Warning:empty or missing yaml metadata in repo card

Check out the documentation for more information.

LLMic Mamba version

Architecture: see Nemotron-H nvidia paper (with some modifications)

The model is saved to huggingface.co/faur-ai/llmamba (private)

The model was trained on fulg dataset (data used is in /storage/hdd/alexgh/fulg_v1_ssm-mamba_tokenized_megatron/).

The training iteration reached is 43906 out of 132016 total, as Leonardo credits ran out.

All the scripts used + checkpoints + logs are in /storage/hdd/alexgh/llmic_mamba-3.2b/.

Some instructions on installing the environment:

pip install torch==2.5.0 (probably 2.4.0 though, 2.5.0 didn't really work for mamba-ssm)
conda install nvidia/label/cuda-12.4.0::cuda-toolkit
set CUDA_HOME: conda env config vars set CUDA_HOME=$CONDA_PREFIX
conda install cudnn
pip install causal-conv1d
pip install mamba-ssm=2.1.* (2.2.4 didn't work)  -> if pip doesn't work, then clone and CAUSAL_CONV1D_FORCE_BUILD=TRUE CAUSAL_CONV1D_SKIP_CUDA_BUILD=TRUE CAUSAL_CONV1D_FORCE_CXX11_ABI=TRUE pip install .
flash-attn
apex
transformer-engine -> you need to copy the nvtx3/nvToolsExt.h into ~/miniconda3/envs/megatron-env/include (you can find ../meg-env -name "" for it)
megatron-lm
at this point, had to reinstall torch==2.4.0 for everything to work, cause transformer_engine overrides it
tensorboard

When training on leonardo, using 4 GPUs per node didn't work (just 2). Some weird nccl network error.

Some problems and future ideas:

  • I used only fulg as RO data -> should probably also use english (totalling 1T or something)
  • Try to distill / quantize Nemotron-H 8B (see how it compares on some data, perhaps that's a better way than training)
  • The tokenizer used is state-spaces/mamba-2.8b-hf (just a gpt neox tokenizer) 50304 tokens -> probably use llmic's tokenizer
  • The layers ratios are a bit off, for example there's 2 Mamba layers back to back at some point -> try to somehow following the architecture that Nemotron-H 8B uses more closely
Downloads last month
10
Safetensors
Model size
3B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support