TLDR: Pali Gemma 2 mix is a versatile vision-language model by Google, part of the Gemma family. It can handle multiple tasks like captioning, OCR, image question answering, object detection, and segmentation. Available in different sizes and resolutions, it’s easy to use and can be fine-tuned for specific needs. You can try it on Hugging Face or download the model weights on Kaggle and Hugging Face.
Google has launched PaliGemma 2 mix, an upgraded vision-language model within the Gemma family, designed for a multitude of tasks. Building upon the foundation of PaliGemma 2, this new iteration offers enhanced capabilities and ease of use. PaliGemma 2 mix models are tuned to perform a variety of tasks, allowing users to explore the model’s capabilities and utilize it immediately for common use cases.
Features
PaliGemma 2 mix stands out with its ability to handle multiple tasks using a single model. It can perform short and long captioning, optical character recognition (OCR), image question answering, object detection, and segmentation. This versatility makes it a valuable tool for developers and researchers working across different domains. The models are available in various sizes (3B, 10B, and 28B parameters) and resolutions (224px and 448px), allowing users to select the best fit for their specific needs. PaliGemma 2 mix supports various frameworks, including Hugging Face Transformers, Keras, PyTorch, JAX, and Gemma.cpp.
The Functionality of the PaliGemma 2 Mix
The model performs different tasks based on the prompt provided. Here are a couple of examples of how you can use PaliGemma 2 mix
- Object Detection: The model can identify and locate objects within an image. For example, the prompt “detect android” would find any Android devices present in the image.
Detection
- Task: Detection (PaliGemma-2-3b-mix-224)
- Input: “detect android\n”

Result

- Optical Character Recognition (OCR): The model can extract text from images. For example, when processing an image containing a warning sign that says “WARNING DANGEROUS RIP CURRENT”, the model can recognize and output that text.
OCR

Result:
WARNING DANGEROUS RIP CURRENT
Kickoff
Pali Gemma 2 mix offers strong performance across multiple tasks. To explore its potential, you can try the mix model on the Hugging Face demo or download the model weights on Kaggle and Hugging Face. You can also learn how to run the model using the Kera’s inference notebook in Google Collab or locally. Furthermore, Pali Gemma 2 mix can be deployed and tuned directly in Vertex Model Garden. To achieve optimal results, fine-tuning Pali Gemma 2 on a specific task or domain is recommended.
Source: Google Developer Blog