Gemma 4 12B Unified: Google’s new local multimodal model for agents

Google has expanded the Gemma 4 family with a particularly interesting release for developers: Gemma 4 12B Unified, an open-weight multimodal model designed to run locally on laptops and power agents that work with text, images, audio, and video.

The news is not simply that there is another Gemma model. The important part is the direction: Google is positioning Gemma 4 12B as a practical building block for local agents, developer tools, and multimodal experiences that do not always depend on a cloud API.

What Gemma 4 12B Unified is

Gemma 4 12B Unified is a Google DeepMind model released on June 3, 2026. It is part of Gemma 4, Google’s family of open models built from research and technology related to Gemini.

According to the official documentation, Gemma 4 models are multimodal open-weight models, available in pre-trained and instruction-tuned variants, licensed under Apache 2.0, with support for more than 140 languages and context windows up to 256K tokens on medium and large models.

The new 12B Unified variant sits in the middle: more capable than the small edge models, but still compact enough to run on reasonable local hardware.

The key difference: encoder-free multimodality

The most interesting technical detail is that Gemma 4 12B uses an encoder-free multimodal architecture.

Many multimodal models rely on separate encoders for vision or audio. That approach can work well, but it also adds latency, memory overhead, and fine-tuning complexity.

Gemma 4 12B takes a different path:

For vision, it uses a small vision embedder that projects image patches into the LLM space.
For audio, it projects raw 16 kHz audio in 40 ms frames.
Text, image, and audio inputs all flow into the same decoder-only backbone.

The result is a more unified architecture for experimenting with multimodal agents, video analysis, audio understanding, coding, and reasoning.

Supported modalities

The Gemma 4 family supports:

Text
Images
Video
Audio

Google documents native audio support on E2B, E4B, and 12B. That makes Gemma 4 12B especially interesting because it is the first medium-sized Gemma model capable of natively ingesting audio.

That enables use cases such as:

Local transcription and audio analysis.
Video understanding with frames plus audio.
Agents that reason over visual content.
Local tools that process images or clips without uploading data to the cloud.
Coding workflows powered by a local model.

Gemma 4 model sizes

Google documents five main variants:

Model	Focus
Gemma 4 E2B	Edge / ultra-mobile
Gemma 4 E4B	More capable edge use cases
Gemma 4 12B Unified	Encoder-free multimodal laptop model
Gemma 4 26B A4B	Efficient Mixture-of-Experts model
Gemma 4 31B Dense	More powerful dense model

Gemma 4 also supports Multi-Token Prediction (MTP) for faster inference through speculative decoding.

Approximate memory requirements

Google AI for Developers lists approximate memory requirements depending on quantization. For Gemma 4 12B, the rough numbers are:

Precision / quantization	Approx. memory
FP16/BF16	~26.7 GB
8-bit	~13.4 GB
4-bit	~6.7 GB

That is why Google presents it as viable on laptops with around 16 GB of VRAM or unified memory, especially with quantized builds.

How to try Gemma 4 12B locally

Google mentions several ways to run or experiment with the model: LM Studio, Ollama, Google AI Edge Gallery, LiteRT-LM, Hugging Face, llama.cpp, MLX, SGLang, vLLM, and Unsloth.

Option 1: Ollama

If you have Ollama installed, the simplest path should look like this:

ollama pull gemma4:12b
ollama run gemma4:12b

Quick test:

ollama run gemma4:12b "Explain what makes Gemma 4 12B Unified interesting for local agents."

Note: the exact tag name may vary depending on how Ollama publishes the variant. If it fails, check the official Ollama library entry for Gemma 4.

Option 2: LM Studio

Open LM Studio.
Search for Gemma 4 12B.
Download an instruction-tuned quantized build.
Load it in local server mode.
Use the OpenAI-compatible endpoint from your apps.

Example request:

curl http://localhost:1234/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-12b-it",
    "messages": [
      {"role": "system", "content": "You are a concise technical assistant."},
      {"role": "user", "content": "Give me 5 ideas for using Gemma 4 12B locally."}
    ],
    "temperature": 0.7
  }'

Option 3: LiteRT-LM

Google is pushing LiteRT-LM for local experiences and OpenAI-compatible local servers.

Example from Google’s documented flow:

litert-lm import \
  --from-huggingface-repo=litert-community/gemma-4-12B-it-litert-lm \
  gemma-4-12B-it.litertlm \
  gemma4-12b

litert-lm serve

Then connect OpenAI-compatible tools to the local server.

Option 4: Hugging Face Transformers

Install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -U transformers accelerate torch

Basic example:

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_id = "google/gemma-4-12B-it"

tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

messages = [
    {"role": "system", "content": "You are a technical assistant."},
    {"role": "user", "content": "Explain Gemma 4 12B Unified in 5 bullet points."}
]

prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)

out = model.generate(**inputs, max_new_tokens=500, temperature=0.7)
print(tok.decode(out[0], skip_special_tokens=True))

Note: confirm the exact Hugging Face repository name before running this, because Google publishes multiple variants and quantizations.

Option 5: llama.cpp / GGUF

If you use llama.cpp, download a GGUF quantized build from the Gemma 4 collections.

Generic example:

./llama-server \
  -m ./models/gemma-4-12b-it-q4_k_m.gguf \
  --ctx-size 32768 \
  --port 8080

Then:

curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma-4-12b-it",
    "messages": [
      {"role": "user", "content": "Create a plan to test function calling with Gemma 4."}
    ]
  }'

Prompt formatting in Gemma 4

Google introduced Gemma 4-specific control tokens. If you are using libraries such as Transformers, the safest approach is to use the tokenizer’s official chat template instead of manually building prompts.

For custom apps:

Use system, user, and model roles when supported by your runtime.
Avoid manual prompt concatenation when the library provides an official template.
Keep system instructions separate from user input.
For function calling, always validate generated output before executing code or tools.

Function calling and agents

Gemma 4 is designed for agentic workflows and function calling. But one point matters: the model does not execute tools by itself. It generates a call or code; your application must validate and execute it safely.

Conceptual example:

{
  "tool": "search_docs",
  "arguments": {
    "query": "Gemma 4 MTP speculative decoding"
  }
}

Best practices:

Validate JSON/schema before execution.
Limit the available tools.
Log every tool call.
Use allowlists for commands or file paths.
Never execute generated code without a sandbox.

Practical use cases

Gemma 4 12B can be useful for:

Local privacy-focused assistants.
Document and image analysis without uploading data to a remote server.
Agent prototypes with tool calling.
Local transcription and audio analysis.
Video analysis using frames plus audio.
Local coding assistants.
Local RAG with embeddings plus Gemma for generation.
Educational or enterprise apps where data must stay on-device.

Why it matters

The open model race is no longer only about parameter count. Gemma 4 12B points to another direction: models that are capable enough, multimodal, and practical to run locally.

That matters for developers because it makes it possible to build agents, automations, and AI experiences without depending entirely on proprietary cloud APIs. It can also improve privacy, latency, and cost control.

It will not replace frontier cloud models for every use case, but it is a compelling option for local apps, edge deployments, and internal developer tools.

Upcoming local test

This release is especially interesting to test on real local hardware. I’ll try to run it locally soon to evaluate performance, memory usage, response quality, multimodal capabilities, and how practical it feels for everyday agentic workflows. Once I have that test, I’ll update the article with my impressions and a more grounded hands-on guide.

Official sources

Google AI for Developers: Gemma releases
Google AI for Developers: Gemma 4 model overview
Google AI for Developers: Gemma 4 model card
Google Developers Blog: Gemma 4 12B: The Developer Guide
Google DeepMind: Gemma 4