Instructions to use TheBloke/starcoderplus-GPTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use TheBloke/starcoderplus-GPTQ with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("text-generation", model="TheBloke/starcoderplus-GPTQ")# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("TheBloke/starcoderplus-GPTQ") model = AutoModelForCausalLM.from_pretrained("TheBloke/starcoderplus-GPTQ") - Notebooks
- Google Colab
- Kaggle
- Local Apps
- vLLM
How to use TheBloke/starcoderplus-GPTQ with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "TheBloke/starcoderplus-GPTQ" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/starcoderplus-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker
docker model run hf.co/TheBloke/starcoderplus-GPTQ
- SGLang
How to use TheBloke/starcoderplus-GPTQ with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "TheBloke/starcoderplus-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/starcoderplus-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "TheBloke/starcoderplus-GPTQ" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "TheBloke/starcoderplus-GPTQ", "prompt": "Once upon a time,", "max_tokens": 512, "temperature": 0.5 }' - Docker Model Runner
How to use TheBloke/starcoderplus-GPTQ with Docker Model Runner:
docker model run hf.co/TheBloke/starcoderplus-GPTQ
seems broken..
using text-generation-webui with this model https://huggingface.co/spaces/HuggingFaceH4/starchat-playground .. it doesn't even seem remotely on the same level with questions and responses generated.. not sure if I'm missing some setting instruct mode? /setup to run locally.. what model is the online version using???
What model are you testing? Because you've posted in StarCoder Plus, but linked StarChat Beta, which are different models with different capabilities and prompting methods.
I have a StarChat Beta model here: https://huggingface.co/TheBloke/starchat-beta-GPTQ
If you are using StarChat Beta like you linked, are you using the right prompt template and tokens? I just edited the README to make it clearer what the prompt template is:
Prompt template
<|system|> system message goes here <|end|>
<|user|> prompt goes here <|end|>
<|assistant|>
Example:
<|system|> Below is a conversation between a human user and a helpful AI coding assistant. <|end|>
<|user|> How do I sort a list in Python? <|end|>
<|assistant|>
If you are using StarCoder Plus then please be aware that it is not an instruction tuned model. From its README:
So it should be able to auto-complete, or fill in the middle. But it's not going to work with "How do I sort a list in Python?". That's what StarChat Beta is for.
ok I might have got confused on that...... downloading starchat-beta.ggmlv3.q5_1.bin now.. hopefully I get it working.. the online demo https://huggingface.co/spaces/HuggingFaceH4/starchat-playground beta.. worked pretty well so hopefully locally it will be the same.. definitly faster online (is it just much faster gpu hardware behind the scene being used for that?)... I'm not even sure I'm getting the speed out of my local setup.. 4090rtx what is the average token speed I should expect out of a card like that on ggml or gptq, using windows and the one-click-installers and text-generation-webui all upto date with git repo's and updated ..just not sure if I'm missing something I assume the one-click-installer is getting all the correct libraries I did specifiy nvidia in the install for it and update.
4090rtx what is the average token speed I should expect out of a card like that on ggml or gptq
it's going to be slow(er) compared to something like google colab or hf, they're using farms of computers with GPUs like the A1000 to run this infrastructure. for me (TITAN RTX) on average it takes anywhere from ~2-15 seconds to generate a full response depending on length.
