Instructions to use TheBloke/starcoderplus-GPTQ with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use TheBloke/starcoderplus-GPTQ with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="TheBloke/starcoderplus-GPTQ")

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("TheBloke/starcoderplus-GPTQ")
model = AutoModelForCausalLM.from_pretrained("TheBloke/starcoderplus-GPTQ")

Notebooks
Google Colab
Kaggle
Local Apps

vLLM

How to use TheBloke/starcoderplus-GPTQ with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "TheBloke/starcoderplus-GPTQ"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TheBloke/starcoderplus-GPTQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/TheBloke/starcoderplus-GPTQ

SGLang

How to use TheBloke/starcoderplus-GPTQ with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "TheBloke/starcoderplus-GPTQ" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TheBloke/starcoderplus-GPTQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "TheBloke/starcoderplus-GPTQ" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "TheBloke/starcoderplus-GPTQ",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use TheBloke/starcoderplus-GPTQ with Docker Model Runner:
```
docker model run hf.co/TheBloke/starcoderplus-GPTQ
```

seems broken..

by Boffy - opened Jun 9, 2023

Discussion

Boffy

Jun 9, 2023

using text-generation-webui with this model https://huggingface.co/spaces/HuggingFaceH4/starchat-playground .. it doesn't even seem remotely on the same level with questions and responses generated.. not sure if I'm missing some setting instruct mode? /setup to run locally.. what model is the online version using???

TheBloke

Owner Jun 9, 2023

•

edited Jun 9, 2023

What model are you testing? Because you've posted in StarCoder Plus, but linked StarChat Beta, which are different models with different capabilities and prompting methods.

I have a StarChat Beta model here: https://huggingface.co/TheBloke/starchat-beta-GPTQ

If you are using StarChat Beta like you linked, are you using the right prompt template and tokens? I just edited the README to make it clearer what the prompt template is:

Prompt template

<|system|> system message goes here <|end|>
<|user|> prompt goes here <|end|>
<|assistant|>

Example:

<|system|> Below is a conversation between a human user and a helpful AI coding assistant. <|end|>
<|user|> How do I sort a list in Python? <|end|>
<|assistant|>

If you are using StarCoder Plus then please be aware that it is not an instruction tuned model. From its README:

So it should be able to auto-complete, or fill in the middle. But it's not going to work with "How do I sort a list in Python?". That's what StarChat Beta is for.

Boffy

Jun 9, 2023

ok I might have got confused on that...... downloading starchat-beta.ggmlv3.q5_1.bin now.. hopefully I get it working.. the online demo https://huggingface.co/spaces/HuggingFaceH4/starchat-playground beta.. worked pretty well so hopefully locally it will be the same.. definitly faster online (is it just much faster gpu hardware behind the scene being used for that?)... I'm not even sure I'm getting the speed out of my local setup.. 4090rtx what is the average token speed I should expect out of a card like that on ggml or gptq, using windows and the one-click-installers and text-generation-webui all upto date with git repo's and updated ..just not sure if I'm missing something I assume the one-click-installer is getting all the correct libraries I did specifiy nvidia in the install for it and update.

psyberm

Jun 9, 2023

•

edited Jun 9, 2023

4090rtx what is the average token speed I should expect out of a card like that on ggml or gptq

it's going to be slow(er) compared to something like google colab or hf, they're using farms of computers with GPUs like the A1000 to run this infrastructure. for me (TITAN RTX) on average it takes anywhere from ~2-15 seconds to generate a full response depending on length.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment