Image-to-Text
Transformers
Safetensors
English
vision-encoder-decoder
image-text-to-text
vit
bert
vision
caption
captioning
image
Instructions to use cnmoro/tiny-image-captioning with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use cnmoro/tiny-image-captioning with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="cnmoro/tiny-image-captioning")# Load model directly from transformers import AutoTokenizer, AutoModelForImageTextToText tokenizer = AutoTokenizer.from_pretrained("cnmoro/tiny-image-captioning") model = AutoModelForImageTextToText.from_pretrained("cnmoro/tiny-image-captioning") - Notebooks
- Google Colab
- Kaggle
| license: apache-2.0 | |
| language: | |
| - en | |
| base_model: | |
| - WinKawaks/vit-small-patch16-224 | |
| - google/bert_uncased_L-2_H-128_A-2 | |
| pipeline_tag: image-to-text | |
| library_name: transformers | |
| tags: | |
| - vit | |
| - bert | |
| - vision | |
| - caption | |
| - captioning | |
| - image | |
| An image captioning model, based on bert-tiny and vit-small, weighing only 100mb! | |
| Works very fast on CPU. | |
| ```python | |
| from transformers import AutoTokenizer, AutoImageProcessor, VisionEncoderDecoderModel | |
| import requests, time | |
| from PIL import Image | |
| model_path = "cnmoro/tiny-image-captioning" | |
| # load the image captioning model and corresponding tokenizer and image processor | |
| model = VisionEncoderDecoderModel.from_pretrained(model_path) | |
| tokenizer = AutoTokenizer.from_pretrained(model_path) | |
| image_processor = AutoImageProcessor.from_pretrained(model_path) | |
| # preprocess an image | |
| url = "https://upload.wikimedia.org/wikipedia/commons/thumb/4/47/New_york_times_square-terabass.jpg/800px-New_york_times_square-terabass.jpg" | |
| image = Image.open(requests.get(url, stream=True).raw) | |
| pixel_values = image_processor(image, return_tensors="pt").pixel_values | |
| start = time.time() | |
| # generate caption - suggested settings | |
| generated_ids = model.generate( | |
| pixel_values, | |
| temperature=0.7, | |
| top_p=0.8, | |
| top_k=50, | |
| num_beams=3 # you can use 1 for even faster inference with a small drop in quality | |
| ) | |
| generated_text = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0] | |
| end = time.time() | |
| print(generated_text) | |
| # a group of people walking in the middle of a city. | |
| print(f"Time taken: {end - start} seconds") | |
| # Time taken: 0.11215853691101074 seconds | |
| # on CPU ! | |
| ``` |