¿Quién creó Boogu-Image-0.1-Edit?

Boogu-Image-0.1-Edit fue publicado por Boogu en Hugging Face.

Boogu-Image-0.1-Edit

Name: Boogu-Image-0.1-Edit
Author: Boogu

Modelo de edición de imágenes basado en difusión de Boogu para ediciones guiadas por instrucciones.

Boosting Open-Source Unified Multimodal Understanding and Generation

⚠️ Important Notice

The Boogu team does NOT currently provide any paid API, subscription, or commercial service for Boogu-Image. Any paid product or service offered under the name "Boogu-Image" — or any similar / variant name such as booguimage, Boogu Image, Boogu, etc. — is NOT affiliated with this project and is unofficial. Please verify carefully before making any payment, and stay vigilant to protect your personal privacy and financial safety.

Boogu-Image-0.1 is a research project only, and not an official model release.

📖 Introduction

Boogu-Image-0.1 is a competitive Apache-2.0 open-source unified image generation and editing model family, including Base, Turbo, Edit, and other variants that provide stable, practical capabilities for high-quality text-to-image generation, fast generation, image editing, and Chinese-English text rendering. Closed-source multimodal understanding and generation systems like Nano Banana Pro and GPT-Image-2 achieve remarkable performance not because of a single model, but through a highly unified suite of system capabilities. However, under training compute that is extremely limited compared with closed-source systems, we find that systematically improving a model's understanding ability, data quality, and training pipeline can still significantly improve image generation and editing performance. Specifically, compared with some existing open-source models, our training data scale is roughly one order of magnitude smaller. We hope our empirical study and open-source release will help advance the open-source ecosystem for multimodal generation and understanding.

This repository provides checkpoints and inference code for Boogu-Image-0.1.

🏆 Boogu Arena

Since we could not evaluate on LM Arena directly, we built Boogu Arena, an LM Arena-style preference evaluation. We use an LLM to generate diverse user personas, then ask each persona to produce image generation prompts, resulting in 1K+ test prompts that we will release publicly for community reproduction. The ELO leaderboard below spans leading closed- and open-source systems. We welcome teams with questions about the results to contact us so that we can work toward a more objective, fair, and reproducible evaluation.

✨ Highlights

📸 Beautiful and Precise Photography — Accurately understands photography prompts and generates high-quality images with natural lighting, coherent composition, and faithful details, preserving coherent subject, background, and spatial relationships even in complex real-world scenes
📝 Diverse and Stable Text Rendering — Supports a wide range of text-heavy designs — posters, stamps, documents, interfaces, brand guides, and handwritten boards — with readable structure, stable typography, and robust bilingual (Chinese/English) rendering across diverse layouts
🎨 Diverse and Beautiful Stylization — Handles stylized generation across miniature 3D scenes, Chinese-inspired gilded aesthetics, shining fantasy visuals, anime portraits, and mythic character art — not just style transfer, but stable, attractive, and prompt-aware creative generation
🖌️ Versatile Image Editing — Handles a wide spectrum of editing tasks, including object insertion, replacement and removal, attribute and material modification, background and scene replacement, and faithful style transfer across artistic looks, while keeping the source subject and composition coherent
🪧 Personalized Poster Design & Product Rendering — Generates personalized poster layouts and clean product visualizations with consistent branding, refined typography, and product-grade lighting and composition
✍️ Precise Text Editing — Enables fine-grained, in-image text editing — replacing, adding, or removing characters in both Chinese and English — and flexibly adapts fonts, weights, colors, and layouts to match different design intents
📊 Competitive General Performance — Demonstrates competitive performance across many scenarios and benchmarks, with the Boogu-Image-0.1 family ranking among the very top of evaluated open- and closed-source systems in Boogu Arena

📖 For the full set of practical lessons and an honest account of current limitations, see Responsible AI & Limitations below.

🔬 Scenario-wise Comparison

Beyond overall arena rankings, we break performance down by scenario across leading open-source peers. Ratings reflect our internal evaluation of typical prompts in each category.

Model	Realistic Photography	Simple Text Rendering	Dense Text Rendering
Boogu-Image-0.1-Turbo	⭐⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐
Boogu-Image-0.1-Base	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐⭐
Z-Image-Turbo	⭐⭐⭐⭐	⭐⭐⭐	⭐⭐
Qwen-Image-2512	⭐⭐⭐	⭐⭐⭐⭐	⭐⭐⭐

📸 Photography with reliable text rendering — Boogu-Image-0.1-Turbo delivers realistic photography, while also offering solid performance on both simple and dense text rendering.
📝 Strong dense text rendering — Boogu-Image-0.1-Base shows competitive results on dense, layout-heavy text scenarios such as posters, documents, brand guides, and complex bilingual designs.
💡 Recommendation — When your workload is dominated by dense / ultra-dense text rendering needs, we recommend running Boogu-Image-0.1-Base at 2K output resolution for the best layout fidelity and character accuracy.

📣 News

2026-06-XX 🧊 Boogu-Image-0.1-Edit-Turbo (Image-to-Image) is coming!
2026-06-XX 🧊 Boogu-Image-0.1-Turbo-2K (Text-to-Image) is coming!
2026-06-20 🧊 Happy Dragon Boat Festival! We have seen many community reviews and feedback, and we will continue to update the model accordingly. Due to differences in product design philosophy, the Boogu series stands apart from most existing open-source models. While other models tend to rely on reinforcement learning techniques to enhance aesthetics, Boogu focuses on using diverse data to give users more control. This is precisely why we adopt an integrated understanding-and-generation system: we need more precise instruction control. We will release a user manual in three days to help everyone make better use of the Boogu series models.
2026-06-17 🔥 ComfyUI-Boogu powered by ComfyUI is released! Thank you, ComfyUI!
2026-06-17 🔥 ComfyUI-Boogu is released!
2026-06-16 🔥 Boogu-Image-0.1-Base (Text-to-Image) is released! The core text-to-image foundation model. Try the online demo.
2026-06-16 🎨 Boogu-Image-0.1-Edit (Image-to-Image) is released! Image editing and transformation capabilities now available. Note that you need to change the resolution of the reference image to 1K accordingly. Try the online demo. Only support 1 reference image for now. Will try our best to support more reference images. Stay tuned! Boogu-Image-0.1-Edit on single-image editing is strong. More failure cases are welcome.
2026-06-16 🚀 Boogu-Image-0.1-Turbo is released! Four-step distilled variant for fast inference and photorealistic generation. Try the online demo.

📥 Model Zoo

Model	Params	Training	Steps	CFG	Task	Demo
Boogu-Image-0.1-Base	10B	Joint Training	25~50	2.0～5.0
（e.g., 4.0）	T2I
Boogu-Image-0.1-Base-fp8	10B	Joint Training	25~50	2.0～5.0
（e.g., 4.0）	T2I			—
Boogu-Image-0.1-Edit	10B	Joint Training	25~50	2.0～5.0
（e.g., 5.0）	TI2I
Boogu-Image-0.1-Edit-fp8	10B	Joint Training	25~50	2.0～5.0
（e.g., 5.0）	TI2I			—
Boogu-Image-0.1-Turbo	10B	+ Decoupled DMD	4	1.0	T2I
Boogu-Image-0.1-Turbo-fp8	10B	+ Decoupled DMD	4	1.0	T2I	—

Boogu-Image-0.1-Base: Foundation model with strong diversity and controllability — ideal for fine-tuning and downstream development. Mainly intended for ultra-dense text rendering; for photorealism, Turbo is usually the better default.
Boogu-Image-0.1-Edit: Image editing and transformation variant.
Boogu-Image-0.1-Turbo: Distilled variant with the same parameter count, typically requiring only 3~4 steps. Focuses on high-quality generation and photorealism while preserving bilingual text rendering and prompt adherence.

🛠️ Installation

Tested environment: Python 3.10 · CUDA 12.6 · PyTorch 2.7.1

# Use a brand new conda environment
conda create -y -n boogu python=3.10
conda activate boogu
# Instal necessary dependencies
# PyTorch up to 2.11.0 with CUDA up to 12.8 is supported
# Check `requirements/<torch>_<cuda>.txt`
pip install -r requirements/torch2.7-cu126.txt
pip install -e .
python utils/get_flash_attn.py

bash quick_start.sh
conda activate boogu

Download Checkpoints

Download the model weights into a local models/ directory before running inference. We recommend using the official Hugging Face CLI:

pip install -U "huggingface_hub[cli]"

# Download to ./models/<model-name>
huggingface-cli download Boogu/Boogu-Image-0.1-Base --local-dir models/Boogu-Image-0.1-Base
huggingface-cli download Boogu/Boogu-Image-0.1-Turbo --local-dir models/Boogu-Image-0.1-Turbo
huggingface-cli download Boogu/Boogu-Image-0.1-Edit --local-dir models/Boogu-Image-0.1-Edit

Example layout after download:

models/
└── Boogu-Image-0.1-Base/
    ├── model_index.json
    ├── mllm
    ├── processor
    ├── scheduler
    ├── transformer
    └── vae

Then point inference to the local path via --model models/Boogu-Image-0.1-Base.

Flash Attention

This repository provides utils/get_flash_attn.py to automatically install a compatible flash-attn wheel for your environment.

Requirements:

Python and PyTorch with CUDA already installed
Linux x86_64

# Auto: detect environment, download a prebuilt wheel, fallback to source build
python utils/get_flash_attn.py

# Force source compilation
python utils/get_flash_attn.py --build

The script first searches mjun0812/flash-attention-prebuild-wheels, then tries official Dao-AILab/flash-attention release wheels with both cxx11abi variants, and finally falls back to source compilation via pip install flash-attn --no-build-isolation.

🚀 Quick Start

PyTorch Native TI2I Edit Inference

export device="cuda:0" # Required
mkdir -p outputs/test_ti2i/


python inference.py \
    --pretrained_pipeline_name_or_path "models/Boogu-Image-0.1-Edit" \
    --input_image_paths "input_image_examples/03.jpg" \
    --instruction "Change the style to a colored pencil drawing." \
    --num_inference_steps 50 \
    --height 1024 --width 1024 \
    --text_guidance_scale 5.0 --image_guidance_scale 1.0 \
    --output_image_path "outputs/test_ti2i/out_1.png" \
    --device "$device"

Hardware Notes

📖 For full CLI options, device setup, offload strategies, caching acceleration, Torch Compile, FP8, and batch inference details, see INFERENCE_GUIDE.md. Torch Compile note: --enable_torch_compile can occasionally produce all-black outputs on some GPUs/models. If that happens, disable it first.

VRAM	Recommended Config (T2I 1K)	Recommended Config (T2I 2K)
12GB	Unquantized: `--enable_sequential_cpu_offload_flag`
Quantized: `--enable_model_cpu_offload_flag --use_fp8_weights`	Unquantized: `--enable_sequential_cpu_offload_flag`
Quantized: `--enable_group_offload_flag --use_fp8_weights`
16GB	Unquantized: `--enable_sequential_cpu_offload_flag`
Quantized: `--enable_model_cpu_offload_flag --use_fp8_weights`	Unquantized: `--enable_sequential_cpu_offload_flag`
Quantized: `--enable_model_cpu_offload_flag --use_fp8_weights`
24GB	Unquantized: `--enable_model_cpu_offload_flag`
Quantized `--use_fp8_weights`	`--enable_model_cpu_offload_flag`
32GB	Unquantized: `--enable_model_cpu_offload_flag`
Quantized: `--use_fp8_weights`	Unquantized: `--enable_model_cpu_offload_flag`
Quantized: `--use_fp8_weights`
40GB	Base Model	Unquantized: `--enable_model_cpu_offload_flag`
Quantized: `--use_fp8_weights`
80GB	Base Model	Base Model

⚠️ Responsible AI & Limitations

Boogu-Image-0.1 is released for research purposes and is not intended for production deployment without additional safeguards. We took responsible-AI considerations into account during data curation, training, and evaluation; however the model may still produce outputs that are inaccurate, biased, or otherwise inappropriate.

Known Limitations

🌍 World Knowledge Gap

For tasks requiring rich common sense, domain knowledge, real brands or people, famous landmarks, celebrities, products, or complex contextual understanding, Boogu still has a clear gap from strong closed-source systems
This capability is extraordinarily expensive to measure; even Arena-style evaluation struggles to assess it fully, so existing benchmarks barely quantify this dimension and the real gap is likely larger than measured scores suggest

🖼️ Image-to-Image Consistency & In-Context Scenarios

For editing tasks requiring strict preservation of the input subject, identity, layout, or fine details, Boogu's image-to-image consistency is still not stable enough
Because our image-to-image capability focuses more on photography and text-generation applications, Boogu still trails Seedream 5.0 and Nano Banana Pro in some in-context generation scenarios

📝 Text Rendering Stability

Boogu can handle many Chinese and English text scenarios, but long text, dense typography, small fonts, and complex design layouts can still produce typos, missing characters, or layout drift
Text rendering is currently focused on Chinese and English; other languages are not specifically optimized and may degrade noticeably

🦴 Body Structure in Complex Poses

In multi-person interaction, occlusion, exaggerated motion, or unusual viewpoints, hands, limbs, and body structure may still become unnatural or inconsistent

👤 Small Faces & Small Limbs

Because we use the open-source FLUX.1 VAE, reconstruction loss is relatively large, so details such as small faces, small limbs, eyes, and text may still show artifacts or instability

📦 Limited Release Scope

Due to resource constraints, engineering complexity, and release boundaries, we are not able to open-source every training and system detail
The current open-source release aims to balance reproducibility, usability, and sustainable maintenance while providing a reliable starting point for community research and improvement

Downstream users are responsible for applying content moderation, validation, and compliance checks appropriate to their use case.

🙏 Acknowledgements

Closed-source systems such as GPT-Image, Nano Banana, and the Seedream series helped us understand the frontier capabilities and practical boundaries of unified understanding-and-generation systems. We thank the Qwen-Image, Z-Image, OmniGen2, FLUX, and broader open-source communities for the foundations they provide, and DeepSeek for strong open-source understanding models that support open-source unified multimodal systems.

📄 License

This project is released under the Apache-2.0 License.

Autor

Boogu

Organización

Boogu

Detalles

Descargas374

Me gusta74

AccesoCódigo Abierto

Tendencia74

Licenciaapache-2.0

Libreríadiffusers

Creado16 jun 2026

Actualizado21 jun 2026

Ver en Hugging Face

Idiomas

enzh

Entiende todo el contexto.

Regístrate para leer casos de estudio completos, acceder a métricas detalladas y recibir todos los reportes.

Autor

Boogu

Organización

Boogu

Detalles

Descargas374

Me gusta74

AccesoCódigo Abierto

Tendencia74

Licenciaapache-2.0

Libreríadiffusers

Creado16 jun 2026

Actualizado21 jun 2026

Ver en Hugging Face

Idiomas

enzh

Entiende todo el contexto.

Regístrate para leer casos de estudio completos, acceder a métricas detalladas y recibir todos los reportes.

Tarjeta del Modelo

Boosting Open-Source Unified Multimodal Understanding and Generation

⚠️ Important Notice

📖 Introduction

🏆 Boogu Arena

✨ Highlights

🔬 Scenario-wise Comparison

📣 News

📥 Model Zoo

🛠️ Installation

Download Checkpoints

Flash Attention

🚀 Quick Start

PyTorch Native TI2I Edit Inference

Hardware Notes

⚠️ Responsible AI & Limitations

Known Limitations

🙏 Acknowledgements

📄 License