M

Phi-4-multimodal-instruct

Audioby Microsoft·Model page

Microsoft's Phi-4-multimodal-instruct is a 5.6B-parameter model for speech recognition, translation, and visual QA in 20+ languages.

Share:

Model Card

🎉Phi-4: [mini-reasoning | reasoning] | [multimodal-instruct | onnx]; [mini-instruct | onnx]

Model Summary

Phi-4-multimodal-instruct is a lightweight open multimodal foundation model that leverages the language, vision, and speech research and datasets used for Phi-3.5 and 4.0 models. The model processes text, image, and audio inputs, generating text outputs, and comes with 128K token context length. The model underwent an enhancement process, incorporating both supervised fine-tuning, direct preference optimization and RLHF (Reinforcement Learning from Human Feedback) to support precise instruction adherence and safety measures. The languages that each modal supports are the following:

  • Text: Arabic, Chinese, Czech, Danish, Dutch, English, Finnish, French, German, Hebrew, Hungarian, Italian, Japanese, Korean, Norwegian, Polish, Portuguese, Russian, Spanish, Swedish, Thai, Turkish, Ukrainian
  • Vision: English
  • Audio: English, Chinese, German, French, Italian, Japanese, Spanish, Portuguese

📰 Phi-4-multimodal Microsoft Blog
📖 Phi-4-multimodal Technical Report
🏡 Phi Portal
👩‍🍳 Phi Cookbook
🖥️ Try It on Azure, GitHub, Nvidia, Huggingface playgrounds
📱Huggingface Spaces Thoughts Organizer, Stories Come Alive, Phine Speech Translator

Watch as Phi-4 Multimodal analyzes spoken language to help plan a trip to Seattle, demonstrating its advanced audio processing and recommendation capabilities.

See how Phi-4 Multimodal tackles complex mathematical problems through visual inputs, demonstrating its ability to process and solve equations presented in images.

Explore how Phi-4 Mini functions as an intelligent agent, showcasing its reasoning and task execution abilities in complex scenarios.

Intended Uses

Primary Use Cases

The model is intended for broad multilingual and multimodal commercial and research use . The model provides uses for general purpose AI systems and applications which require

  1. Memory/compute constrained environments
  2. Latency bound scenarios
  3. Strong reasoning (especially math and logic)
  4. Function and tool calling
  5. General image understanding
  6. Optical character recognition
  7. Chart and table understanding
  8. Multiple image comparison
  9. Multi-image or video clip summarization
  10. Speech recognition
  11. Speech translation
  12. Speech QA
  13. Speech summarization
  14. Audio understanding

The model is designed to accelerate research on language and multimodal models, for use as a building block for generative AI powered features.

Use Case Considerations

The model is not specifically designed or evaluated for all downstream purposes. Developers should consider common limitations of language models and multimodal models, as well as performance difference across languages, as they select use cases, and evaluate and mitigate for accuracy, safety, and fairness before using within a specific downstream use case, particularly for high-risk scenarios. Developers should be aware of and adhere to applicable laws or regulations (including but not limited to privacy, trade compliance laws, etc.) that are relevant to their use case.

Nothing contained in this Model Card should be interpreted as or deemed a restriction or modification to the license the model is released under.

Release Notes

This release of Phi-4-multimodal-instruct is based on valuable user feedback from the Phi-3 series. Previously, users could use a speech recognition model to talk to the Mini and Vision models. To achieve this, users needed to use a pipeline of two models: one model to transcribe the audio to text, and another model for the language or vision tasks. This pipeline means that the core model was not provided the full breadth of input information – e.g. cannot directly observe multiple speakers, background noises, jointly align speech, vision, language information at the same time on the same representation space. With Phi-4-multimodal-instruct, a single new open model has been trained across text, vision, and audio, meaning that all inputs and outputs are processed by the same neural network. The model employed new architecture, larger vocabulary for efficiency, multilingual, and multimodal support, and better post-training techniques were used for instruction following and function calling, as well as additional data leading to substantial gains on key multimodal capabilities. It is anticipated that Phi-4-multimodal-instruct will greatly benefit app developers and various use cases. The enthusiastic support for the Phi-4 series is greatly appreciated. Feedback on Phi-4 is welcomed and crucial to the model's evolution and improvement. Thank you for being part of this journey!

Model Quality

Click to view details

To understand the capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of benchmarks using an internal benchmark platform (See Appendix A for benchmark methodology). Users can refer to the Phi-4-Mini-Instruct model card for details of language benchmarks. At the high-level overview of the model quality on representative speech and vision benchmarks:

Speech

The Phi-4-multimodal-instruct was observed as

  • Having strong automatic speech recognition (ASR) and speech translation (ST) performance, surpassing expert ASR model WhisperV3 and ST models SeamlessM4T-v2-Large.
  • Ranking number 1 on the Huggingface OpenASR leaderboard with word error rate 6.14% in comparison with the current best model 6.5% as of March 04, 2025.
  • Being the first open-sourced model that can perform speech summarization, and the performance is close to GPT4o.
  • Having a gap with close models, e.g. Gemini-1.5-Flash and GPT-4o-realtime-preview, on speech QA task. Work is being undertaken to improve this capability in the next iterations.

Speech Recognition (lower is better)

The performance of Phi-4-multimodal-instruct on the aggregated benchmark datasets: alt text

The performance of Phi-4-multimodal-instruct on different languages, averaging the WERs of CommonVoice and FLEURS:

alt text

Speech Translation (higher is better)

Translating from German, Spanish, French, Italian, Japanese, Portugues, Chinese to English:

alt text

Translating from English to German, Spanish, French, Italian, Japanese, Portugues, Chinese. Noted that WhiperV3 does not support this capability:

alt text

Speech Summarization (higher is better)

alt text

Speech QA

MT bench scores are scaled by 10x to match the score range of MMMLU:

alt text

Audio Understanding

AIR bench scores are scaled by 10x to match the score range of MMAU:

alt text

Vision

Vision-Speech tasks

Phi-4-multimodal-instruct is capable of processing both image and audio together, the following table shows the model quality when the input query for vision content is synthetic speech on chart/table understanding and document reasoning tasks. Compared to other existing state-of-the-art omni models that can enable audio and visual signal as input, Phi-4-multimodal-instruct achieves much stronger performance on multiple benchmarks.

Benchmarks Phi-4-multimodal-instruct InternOmni-7B Gemini-2.0-Flash-Lite-prv-02-05 Gemini-2.0-Flash Gemini-1.5-Pro
s_AI2D 68.9 53.9 62.0 69.4 67.7
s_ChartQA 69.0 56.1 35.5 51.3 46.9
s_DocVQA 87.3 79.9 76.0 80.3 78.2
s_InfoVQA 63.7 60.3 59.4 63.6 66.1
Average 72.2 62.6 58.2 66.2 64.7

Vision tasks

To understand the vision capabilities, Phi-4-multimodal-instruct was compared with a set of models over a variety of zero-shot benchmarks using an internal benchmark platform. At the high-level overview of the model quality on representative benchmarks:

Dataset Phi-4-multimodal-ins Phi-3.5-vision-ins Qwen 2.5-VL-3B-ins Intern VL 2.5-4B Qwen 2.5-VL-7B-ins Intern VL 2.5-8B Gemini 2.0-Flash Lite-preview-0205 Gemini2.0-Flash Claude-3.5-Sonnet-2024-10-22 Gpt-4o-2024-11-20
Popular aggregated benchmark
MMMU 55.1 43.0 47.0 48.3 51.8 50.6 54.1 64.7 55.8 61.7
MMBench (dev-en) 86.7 81.9 84.3 86.8 87.8 88.2 85.0 90.0 86.7 89.0
MMMU-Pro (std/vision) 38.5 21.8 29.9 32.4 36.9 34.4 45.1 54.4 54.3 53.0
Visual science reasoning
ScienceQA Visual (img-test) 97.5 91.3 79.4 96.2 87.7 97.3 85.0 88.3 81.2 88.2
Visual math reasoning
MathVista (testmini) 62.4 43.9 60.8 51.2 67.8 56.7 57.6 47.2 56.9 56.1
InterGPS 48.6 36.3 48.3 53.7 52.7 54.1 57.9 65.4 47.1 49.1
Chart & table reasoning
AI2D 82.3 78.1 78.4 80.0 82.6 83.0 77.6 82.1 70.6 83.8
ChartQA 81.4 81.8 80.0 79.1 85.0 81.0 73.0 79.0 78.4 75.1
DocVQA 93.2 69.3 93.9 91.6 95.7 93.0 91.2 92.1 95.2 90.9
InfoVQA 72.7 36.6 77.1 72.1 82.6 77.6 73.0 77.8 74.3 71.9
Document Intelligence
TextVQA (val) 75.6 72.0 76.8 70.9 77.7 74.8 72.9 74.4 58.6 73.1
OCR Bench 84.4 63.8 82.2 71.6 87.7 74.8 75.7 81.0 77.0 77.7
Object visual presence verification
POPE 85.6 86.1 87.9 89.4 87.5 89.1 87.5 88.0 82.6 86.5
Multi-image perception
BLINK 61.3 57.0 48.1 51.2 55.3 52.5 59.3 64.0 56.9 62.4
Video MME 16 frames 55.0 50.8 56.5 57.3 58.2 58.7 58.8 65.5 60.2 68.2
Average 72.0 60.9 68.7 68.8 73.1 71.1 70.2 74.3 69.1 72.4

alt text

Visual Perception

Below are the comparison results on existing multi-image tasks. On average, Phi-4-multimodal-instruct outperforms competitor models of the same size and competitive with much bigger models on multi-frame capabilities. BLINK is an aggregated benchmark with 14 visual tasks that humans can solve very quickly but are still hard for current multimodal LLMs.

Dataset Phi-4-multimodal-instruct Qwen2.5-VL-3B-Instruct InternVL 2.5-4B Qwen2.5-VL-7B-Instruct InternVL 2.5-8B Gemini-2.0-Flash-Lite-prv-02-05 Gemini-2.0-Flash Claude-3.5-Sonnet-2024-10-22 Gpt-4o-2024-11-20
Art Style 86.3 58.1 59.8 65.0 65.0 76.9 76.9 68.4 73.5
Counting 60.0 67.5 60.0 66.7 71.7 45.8 69.2 60.8 65.0
Forensic Detection 90.2 34.8 22.0 43.9 37.9 31.8 74.2 63.6 71.2
Functional Correspondence 30.0 20.0 26.9 22.3 27.7 48.5 53.1 34.6 42.3
IQ Test 22.7 25.3 28.7 28.7 28.7 28.0 30.7 20.7 25.3
Jigsaw 68.7 52.0 71.3 69.3 53.3 62.7 69.3 61.3 68.7
Multi-View Reasoning 76.7 44.4 44.4 54.1 45.1 55.6 41.4 54.9 54.1
Object Localization 52.5 55.7 53.3 55.7 58.2 63.9 67.2 58.2 65.6
Relative Depth 69.4 68.5 68.5 80.6 76.6 81.5 72.6 66.1 73.4
Relative Reflectance 26.9 38.8 38.8 32.8 38.8 33.6 34.3 38.1 38.1
Semantic Correspondence 52.5 32.4 33.8 28.8 24.5 56.1 55.4 43.9 47.5
Spatial Relation 72.7 80.4 86.0 88.8 86.7 74.1 79.0 74.8 83.2
Visual Correspondence 67.4 28.5 39.5 50.0 44.2 84.9 91.3 72.7 82.6
Visual Similarity 86.7 67.4 88.1 87.4 85.2 87.4 80.7 79.3 83.0
Overall 61.6 48.1 51.2 55.3 52.5 59.3 64.0 56.9 62.4

alt text

Usage

Requirements

Phi-4 family has been integrated in the 4.48.2 version of transformers. The current transformers version can be verified with: pip list | grep transformers. We suggest to run with Python 3.10. Examples of required packages:

flash_attn==2.7.4.post1
torch==2.6.0
transformers==4.48.2
accelerate==1.3.0
soundfile==0.13.1
pillow==11.1.0
scipy==1.15.2
torchvision==0.21.0
backoff==2.2.1
peft==0.13.2

Phi-4-multimodal-instruct is also available in Azure AI Studio

Tokenizer

Phi-4-multimodal-instruct supports a vocabulary size of up to 200064 tokens. The tokenizer files already provide placeholder tokens that can be used for downstream fine-tuning, but they can also be extended up to the model's vocabulary size.

Input Formats

Given the nature of the training data, the Phi-4-multimodal-instruct model is best suited for prompts using the chat format as follows:

Text chat format

This format is used for general conversation and instructions:

<|system|>You are a helpful assistant.<|end|><|user|>How to explain Internet for a medieval knight?<|end|><|assistant|>

Tool-enabled function-calling format

This format is used when the user wants the model to provide function calls based on the given tools. The user should provide the available tools in the system prompt, wrapped by <|tool|> and <|/tool|> tokens. The tools should be specified in JSON format, using a JSON dump structure. Example:

<|system|>You are a helpful assistant with some tools.<|tool|>[{"name": "get_weather_updates", "description": "Fetches weather updates for a given city using the RapidAPI Weather API.", "parameters": {"city": {"description": "The name of the city for which to retrieve weather information.", "type": "str", "default": "London"}}}]<|/tool|><|end|><|user|>What is the weather like in Paris today?<|end|><|assistant|>

Vision-Language Format

This format is used for conversation with image:

<|user|><|image_1|>Describe the image in detail.<|end|><|assistant|>

For multiple images, the user needs to insert multiple image placeholders in the prompt as below:

<|user|><|image_1|><|image_2|><|image_3|>Summarize the content of the images.<|end|><|assistant|>

Speech-Language Format

This format is used for various speech and audio tasks:

<|user|><|audio_1|>{task prompt}<|end|><|assistant|>

The task prompt can vary for different task. Automatic Speech Recognition:

<|user|><|audio_1|>Transcribe the audio clip into text.<|end|><|assistant|>

Automatic Speech Translation:

<|user|><|audio_1|>Translate the audio to {lang}.<|end|><|assistant|>

Automatic Speech Translation with chain-of-thoughts:

<|user|><|audio_1|>Transcribe the audio to text, and then translate the audio to {lang}. Use <sep> as a separator between the original transcript and the translation.<|end|><|assistant|>

Spoken-query Question Answering:

<|user|><|audio_1|><|end|><|assistant|>

Vision-Speech Format

This format is used for conversation with image and audio. The audio may contain query related to the image:

<|user|><|image_1|><|audio_1|><|end|><|assistant|>

For multiple images, the user needs to insert multiple image placeholders in the prompt as below:

<|user|><|image_1|><|image_2|><|image_3|><|audio_1|><|end|><|assistant|>

Vision

  • Any common RGB/gray image format (e.g., (".jpg", ".jpeg", ".png", ".ppm", ".bmp", ".pgm", ".tif", ".tiff", ".webp")) can be supported.
  • Resolution depends on the GPU memory size. Higher resolution and more images will produce more tokens, thus using more GPU memory. During training, 64 crops can be supported. If it is a square image, the resolution would be around (8448 by 8448). For multiple-images, at most 64 frames can be supported, but with more frames as input, the resolution of each frame needs to be reduced to fit in the memory.

Audio

  • Any audio format that can be loaded by soundfile package should be supported.
  • To keep the satisfactory performance, maximum audio length is suggested to be 40s. For summarization tasks, the maximum audio length is suggested to 30 mins.

Loading the model locally

After obtaining the Phi-4-multimodal-instruct model checkpoints, users can use this sample code for inference.

Click to view details
import requests
import torch
import os
import io
from PIL import Image
import soundfile as sf
from transformers import AutoModelForCausalLM, AutoProcessor, GenerationConfig
from urllib.request import urlopen


# Define model path
model_path = "microsoft/Phi-4-multimodal-instruct"

# Load model and processor
processor = AutoProcessor.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, 
    device_map="cuda", 
    torch_dtype="auto", 
    trust_remote_code=True,
    # if you do not use Ampere or later GPUs, change attention to "eager"
    _attn_implementation='flash_attention_2',
).cuda()

# Load generation config
generation_config = GenerationConfig.from_pretrained(model_path)

# Define prompt structure
user_prompt = '<|user|>'
assistant_prompt = '<|assistant|>'
prompt_suffix = '<|end|>'

# Part 1: Image Processing
print("\n--- IMAGE PROCESSING ---")
image_url = 'https://www.ilankelman.org/stopsigns/australia.jpg'
prompt = f'{user_prompt}<|image_1|>What is shown in this image?{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Download and open image
image = Image.open(requests.get(image_url, stream=True).raw)
inputs = processor(text=prompt, images=image, return_tensors='pt').to('cuda:0')

# Generate response
generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

# Part 2: Audio Processing
print("\n--- AUDIO PROCESSING ---")
audio_url = "https://upload.wikimedia.org/wikipedia/commons/b/b0/Barbara_Sahakian_BBC_Radio4_The_Life_Scientific_29_May_2012_b01j5j24.flac"
speech_prompt = "Transcribe the audio to text, and then translate the audio to French. Use <sep> as a separator between the original transcript and the translation."
prompt = f'{user_prompt}<|audio_1|>{speech_prompt}{prompt_suffix}{assistant_prompt}'
print(f'>>> Prompt\n{prompt}')

# Downlowd and open audio file
audio, samplerate = sf.read(io.BytesIO(urlopen(audio_url).read()))

# Process with the model
inputs = processor(text=prompt, audios=[(audio, samplerate)], return_tensors='pt').to('cuda:0')

generate_ids = model.generate(
    **inputs,
    max_new_tokens=1000,
    generation_config=generation_config,
)
generate_ids = generate_ids[:, inputs['input_ids'].shape[1]:]
response = processor.batch_decode(
    generate_ids, skip_special_tokens=True, clean_up_tokenization_spaces=False
)[0]
print(f'>>> Response\n{response}')

More inference examples can be found here.

vLLM inference

User can start a server with this command

python -m vllm.entrypoints.openai.api_server --model 'microsoft/Phi-4-multimodal-instruct' --dtype auto --trust-remote-code --max-model-len 131072 --enable-lora --max-lora-rank 320 --lora-extra-vocab-size 0 --limit-mm-per-prompt audio=3,image=3 --max-loras 2 --lora-modules speech=<path to speech lora folder> vision=<path to vision lora folder>

The speech lora and vision lora folders are within the Phi-4-multimodal-instruct folder downloaded by vLLM, you can also use the following script to find thoses:

from huggingface_hub import snapshot_download
model_path = snapshot_download(repo_id="microsoft/Phi-4-multimodal-instruct")
speech_lora_path = model_path+"/speech-lora"
vision_lora_path = model_path+"/vision-lora"

Training

Fine-tuning

A basic example of supervised fine-tuning (SFT) for speech and vision is provided respectively.

An example on how to extend speech recognition to a new language.

Model

  • Architecture: Phi-4-multimodal-instruct has 5.6B parameters and is a multimodal transformer model. The model has the pretrained Phi-4-Mini-Instruct as the backbone language model, and the advanced encoders and adapters of vision and speech.
  • Inputs: Text, image, and audio. It is best suited for prompts using the chat format.
  • Context length: 128K tokens
  • GPUs: 512 A100-80G
  • Training time: 28 days
  • Training data: 5T tokens, 2.3M speech hours, and 1.1T image-text tokens
  • Outputs: Generated text in response to the input
  • Dates: Trained between December 2024 and January 2025
  • Status: This is a static model trained on offline datasets with the cutoff date of June 2024 for publicly available data.
  • **Supported language
Author
M
Microsoft
Organization · ✓
microsoft
Details
Downloads444.9K
Likes1.6K
AccessOpen Source
Taskautomatic-speech-recognition
Parameters5.6B
Licensemit
Librarytransformers
CreatedFeb 24, 2025
UpdatedDec 10, 2025
View on Hugging Face
Languages
multilingualarzhcsdanlenfifrdehehuitjakonoplptruessvthtruk
Get the full context.

Sign up to read complete case studies, access detailed metrics, and unlock all use cases.

Phi-4-multimodal-instruct — AI Model Details | Applied