
HunyuanImage 3.0 is Tencent Hunyuan’s newest text-to-image system built as a native multimodal, autoregressive Mixture-of-Experts (MoE) model that unifies understanding and generation — eschewing the traditional DiT-only diffusion stack for a unified framework that reasons over language and pixels together. It is presented as the largest open-source image-generation MoE to date, featuring 64 experts with 80B total parameters and ~13B active per token, and it targets quality on par with leading closed-source systems through curated data and reinforcement learning post-training.
What’s new

- Unified multimodal autoregressive core: Moves beyond DiT-only to a native framework that models text and image jointly, improving prompt adherence and contextual fidelity while enabling richer reasoning during synthesis.
- Largest open MoE for images: 64-expert design with 80B total parameters and ~13B active per token, trading raw size for expert sparsity to boost capacity without prohibitive inference cost.
- Reinforcement post-training: RL post-tuning calibrated for semantic accuracy and aesthetics yields strong prompt alignment with photorealistic detail and typography accuracy.
- World-knowledge reasoning: The unified stack elaborates sparse prompts intelligently and respects long, structured instructions (Chinese/English), enabling complex compositions with consistent semantics.
Architecture at a glance
- Autoregressive multimodal backbone: A single framework that encodes prompts and autoregressively generates image tokens, integrating understanding and generation steps for tighter semantic control.
- MoE routing: 64 experts with per-token activation around 13B parameters, improving scaling efficiency and enabling domain specialization across experts.
- Transfusion-style integration: Community analyses note synergy with diffusion-style refinement (“Transfusion”), aligning the autoregressive core with high-fidelity image synthesis for texture and detail.
- Inference kernels and options: Attention implementation selectable between SDPA and FlashAttention-2; MoE path supports eager or FlashInfer for throughput.
Benchmarks and positioning
- SSAE (Machine Evaluation)
SSAE (Structured Semantic Alignment Evaluation) is an intelligent evaluation metric for image-text alignment based on advanced multimodal large language models (MLLMs). We extracted 3500 key points across 12 categories, then used multimodal large language models to automatically evaluate and score by comparing the generated images with these key points based on the visual content of the images. Mean Image Accuracy represents the image-wise average score across all key points, while Global Accuracy directly calculates the average score across all key points.

- GSB (Human Evaluation)
We adopted the GSB (Good/Same/Bad) evaluation method commonly used to assess the relative performance between two models from an overall image perception perspective. In total, we utilized 1,000 text prompts, generating an equal number of image samples for all compared models in a single run. For a fair comparison, we conducted inference only once for each prompt, avoiding any cherry-picking of results. When comparing with the baseline methods, we maintained the default settings for all selected models. The evaluation was performed by more than 100 professional evaluators.

Use cases
- Product visualization and concept art: High adherence to complex multi-attribute prompts and styles with repeatable lighting and composition.
- Branding and typography: Improved inline text generation and layout control critical for posters, banners, and packaging comps.
- Education and comics: Fine-grained character consistency and panel composition over long instructions, including expressive emoji/iconography.
System Requirements
- GPU: NVIDIA GPU with CUDA support
- Disk Space: 170GB for model weights
- GPU Memory: ≥3×80GB (4×80GB recommended for better performance)
Install and run locally
The official instructions provide a simple path through Hugging Face weights and a runnable demo script, with switches to pick attention and MoE kernels.
- Clone and download:
git clone https://github.com/Tencent-Hunyuan/HunyuanImage-3.0.git
cd HunyuanImage-3.0/
hf download tencent/HunyuanImage-3.0 --local-dir ./HunyuanImage-3
- Quick demo:
python3 run_image_gen.py
--model-id ./HunyuanImage-3
--verbose 1
--prompt "A brown and white dog is running on the grass"
— Useful flags:
- — attn-impl sdpa|flash_attention_2 to toggle attention backend.
- — moe-impl eager|flashinfer to select MoE kernel; FlashInfer can improve throughput.
- — diff-infer-steps 50 by default; adjust for quality/latency trade-offs.
- — image-size auto or explicit like 1280×768 or 16:9 aspect.
Hugging Face usage patterns
Weights and scripts are hosted on Hugging Face with runnable examples and discussions; local usage hinges on loading the provided repo and invoking the demo entry points. The weight repo and file tree are available for direct inspection and pinning.
# Download from HuggingFace and rename the directory.
# Notice that the directory name should not contain dots, which may cause issues when loading using Transformers.
hf download tencent/HunyuanImage-3.0 --local-dir ./HunyuanImage-3
from transformers import AutoModelForCausalLM
# Load the model
model_id = "./HunyuanImage-3"
# Currently we can not load the model using HF model_id `tencent/HunyuanImage-3.0` directly
# due to the dot in the name.
kwargs = dict(
attn_implementation="sdpa", # Use "flash_attention_2" if FlashAttention is installed
trust_remote_code=True,
torch_dtype="auto",
device_map="auto",
moe_impl="eager", # Use "flashinfer" if FlashInfer is installed
)
model = AutoModelForCausalLM.from_pretrained(model_id, **kwargs)
model.load_tokenizer(model_id)
# generate the image
prompt = "A brown and white dog is running on the grass"
image = model.generate_image(prompt=prompt, stream=True)
image.save("image.png")
Training data and scale
- Scale summary: Community coverage cites ~5B image–text pairs plus ~6T tokens for multimodal pretraining, underpinning the model’s broad world knowledge and style generalization.
- Size footprint: Reported full weights around the hundreds of GB range with active parameters ~13B at inference; check system memory and GPU plan accordingly.
Practical tips
- Kernel selection: Prefer FlashAttention-2 if available, and FlashInfer for MoE paths to keep latency manageable on multi-GPU systems.
- Steps vs quality: Increase — diff-infer-steps for more intricate patterns or typography; reduce for drafts or interactive ideation.
- Long prompts: Keep structured, attribute-ordered prompts; the autoregressive core tends to respect explicit hierarchy and constraints in long instructions.
- Seeds and reproducibility: Set — seed for deterministic outputs when comparing parameter sweeps or style variants.
How it compare
- Versus DiT-only pipelines: The unified autoregressive design integrates understanding into generation, which often improves compositionality, typography, and adherence without separate encoders/conditioning stacks.
- Versus smaller open models: The 80B/64-expert MoE offers higher ceiling for realism and complexity, though compute demands are higher; use kernel optimizations and careful step counts.
Sample prompts to try
- Product sheet: “Front-lit studio shot of a matte-black wireless earbud on acrylic stand, single rim light, 85mm, high-contrast shadows, minimal backdrop typography ‘Echo Pro’ in Futura.”
- Educational poster: “Bilingual infographic about the water cycle, clean flat design, labeled arrows and section headers, legible English and Chinese text, A3 layout.”
- Comic panel: “Four-panel manga of a cat detective in rain-soaked city, panel borders, consistent character design, speech bubbles with legible text.”
Early community notes
Developers point out the model behaves like an LLM that outputs images, favoring dialog-driven refinement and making CLIP-like text encoders unnecessary; the trade-off is model size and integration complexity in legacy diffusion toolchains. Expect community-adapted runtimes and potential smaller variants for broader accessibility.
Resources
- Hugging Face model hub page with instructions, arguments, and discussions.
- Official GitHub with framework overview and examples.
- Technical explainers and guides aggregating specs and setup steps.
HunyuanImage 3.0’s bet on a native multimodal, autoregressive MoE pays off with stronger semantic control and high-fidelity imagery, while kernel and MoE optimizations keep active compute reasonable relative to its scale; for teams needing top-tier prompt adherence, bilingual typography, and compositional reliability, it is a compelling open alternative to closed systems.
HunyuanImage 3.0, Unleashed: The 80B MoE Native Multimodal Generator That Thinks in Images was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.