Configuration Parsing Warning:Invalid JSON for config file config.json
YAML Metadata Warning:empty or missing yaml metadata in repo card
Check out the documentation for more information.
LLMic Mamba version
Architecture: see Nemotron-H nvidia paper (with some modifications)
The model is saved to huggingface.co/faur-ai/llmamba (private)
The model was trained on fulg dataset (data used is in /storage/hdd/alexgh/fulg_v1_ssm-mamba_tokenized_megatron/).
The training iteration reached is 43906 out of 132016 total, as Leonardo credits ran out.
All the scripts used + checkpoints + logs are in /storage/hdd/alexgh/llmic_mamba-3.2b/.
Some instructions on installing the environment:
pip install torch==2.5.0 (probably 2.4.0 though, 2.5.0 didn't really work for mamba-ssm)
conda install nvidia/label/cuda-12.4.0::cuda-toolkit
set CUDA_HOME: conda env config vars set CUDA_HOME=$CONDA_PREFIX
conda install cudnn
pip install causal-conv1d
pip install mamba-ssm=2.1.* (2.2.4 didn't work) -> if pip doesn't work, then clone and CAUSAL_CONV1D_FORCE_BUILD=TRUE CAUSAL_CONV1D_SKIP_CUDA_BUILD=TRUE CAUSAL_CONV1D_FORCE_CXX11_ABI=TRUE pip install .
flash-attn
apex
transformer-engine -> you need to copy the nvtx3/nvToolsExt.h into ~/miniconda3/envs/megatron-env/include (you can find ../meg-env -name "" for it)
megatron-lm
at this point, had to reinstall torch==2.4.0 for everything to work, cause transformer_engine overrides it
tensorboard
When training on leonardo, using 4 GPUs per node didn't work (just 2). Some weird nccl network error.
Some problems and future ideas:
- I used only fulg as RO data -> should probably also use english (totalling 1T or something)
- Try to distill / quantize Nemotron-H 8B (see how it compares on some data, perhaps that's a better way than training)
- The tokenizer used is state-spaces/mamba-2.8b-hf (just a gpt neox tokenizer) 50304 tokens -> probably use llmic's tokenizer
- The layers ratios are a bit off, for example there's 2 Mamba layers back to back at some point -> try to somehow following the architecture that Nemotron-H 8B uses more closely
- Downloads last month
- 10