vLLM x Qwen3-Next: Hybrid Attention, Multi-Token Prediction, and Thinking Controls for Production-Grade Inference
vLLM now supports the Qwen3-Next family, enabling hybrid attention (Gated DeltaNet + full attention), high-sparsity MoE activation, and native multi-token prediction (MTP) with OpenAI-compatible serving and chat-template thinking controls for clean integration into existing apps. This support lands in nightly and mainline builds, with recipes that demonstrate long-context serving, MTP, and toggling thinking/non-thinking modes.
Why it matters
Qwen3-Next brings a next-gen hybrid architecture tuned for long context and efficiency, interleaving linear and full attention to scale to extremely long inputs while maintaining reasoning fidelity, and vLLM adds hybrid KV-cache management and CUDA graphs to keep latency low. On the flagship 80B-A3B MoE, only about 3B parameters are active per token due to 1:50 routing, which vLLM’s optimized MoE kernels handle efficiently for strong throughput on multi-GPU nodes.
Long context and MTP
The Qwen3-Next model card recommends 256K default context, with the option to reduce if memory is tight; vLLM exposes this via max-model-len and an env gate for long max settings. MTP is natively supported in vLLM via speculative-config, letting the model predict multiple tokens per step to boost decode speed without app changes.
My book with 20+ End to End Data Science Case Studies from 5 different domains is available on Amazon.
Cracking Data Science Case Study Interview: Data, Features, Models and System Design
Thinking mode controls
Qwen3/Next supports thinking and non-thinking modes; with transformers and vLLM, thinking is controlled via the chat template’s enable_thinking switch or via API chat_template_kwargs. For non-thinking outputs in API calls, set chat_template_kwargs: { “enable_thinking”: False } or the equivalent in tokenizer.apply_chat_template to suppress chain-of-thought while preserving final answers.
Install and serve
- Nightly install: uv pip install vllm — extra-index-url https://wheels.vllm.ai/nightly — torch-backend=auto for latest Qwen3-Next support and kernels.
- Model server: VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct — tensor-parallel-size 4 — max-model-len 262144 on a 4×GPU node; reduce max-model-len (e.g., 32768) if startup fails.
- Enable MTP: add — speculative-config ‘{“method”:”qwen3_next_mtp”,”num_speculative_tokens”:2}’ to increase decode throughput on the 80B-A3B model.
Key implementation details
- Hybrid attention: vLLM integrates Triton kernels for Flash Linear Attention and a hybrid KV cache manager that balances logical block sizes across linear and full attention layers to keep physical memory usage uniform and efficient. This reduces fragmentation and boosts throughput at high memory utilization.
- CUDA Graphs: full CUDA graph mode is enabled by default to reduce CPU overhead from launching Triton kernels, improving low-latency decode.
- High-sparsity MoE: vLLM’s MoE path supports Qwen3-Next’s 1:50 activation ratio, maintaining strong latency and throughput with modern GPU interconnects and tensor parallelism.
- Long context: the model card defaults to 256K; vLLM gates long max-model-len with VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 to prevent accidental OOMs and encourages sizing down as needed.
Python usage (transformers + vLLM, thinking control)
Below is a minimal end-to-end Python example consistent with Qwen3-Next guidance: tokenizer-side enable_thinking for local generation and vLLM engine generation for throughput. The sampling parameters reflect Qwen’s recommended settings for thinking mode.
from transformers import AutoTokenizer
from vllm import LLM, SamplingParams
# Initialize the tokenizer (Qwen3-Next 80B MoE Instruct)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-Next-80B-A3B-Instruct") # thinking enabled by default [7]
# Configure sampling parameters (recommended for thinking mode)
sampling_params = SamplingParams(
temperature=0.6, # recommended for thinking [7] top_p=0.95, # recommended for thinking [7] top_k=20, # recommended for thinking [7] max_tokens=32768 # fit to your memory and latency budget [4])
# Initialize the vLLM engine (ensure nightly/main installed for Qwen3-Next) [2][4]llm = LLM(model="Qwen/Qwen3-Next-80B-A3B-Instruct")
# Prepare chat with thinking enabled or disabled via chat template [7][3]prompt = "Give me a short introduction to large language models."
messages = [{"role": "user", "content": prompt}]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True, # Set False to strictly disable thinking [7][3])
# Generate outputs with vLLM
outputs = llm.generate([text], sampling_params)
for output in outputs:
print("Prompt:", repr(output.prompt))
print("Generated:", repr(output.outputs.text))
OpenAI-compatible client (non-thinking via API)
When serving with vLLM, pass chat_template_kwargs through extra_body to toggle thinking off at request time without changing prompts or templates. This mirrors the Qwen docs’ API example.
from openai import OpenAI
# Point the client to the local vLLM server
client = OpenAI(api_key="EMPTY", base_url="http://localhost:8000/v1") # OpenAI-compatible [3]
chat_response = client.chat.completions.create(
model="Qwen/Qwen3-Next-80B-A3B-Instruct",
messages=[{"role": "user", "content": "Give me a short introduction to large language models."}],
max_tokens=8192,
temperature=0.7,
top_p=0.8,
presence_penalty=1.5,
extra_body={
"top_k": 20,
"chat_template_kwargs": {"enable_thinking": False}, # hard-disable thinking [3] },
)
print("Chat response:", chat_response)
Serving commands
- Standard serve, long context: VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct — port 8000 — tensor-parallel-size 4 — max-model-len 262144 for 256K context; reduce if capacity is insufficient.
- With MTP: VLLM_ALLOW_LONG_MAX_MODEL_LEN=1 vllm serve Qwen/Qwen3-Next-80B-A3B-Instruct — port 8000 — tensor-parallel-size 4 — max-model-len 262144 — speculative-config ‘{“method”:”qwen3_next_mtp”,”num_speculative_tokens”:2}’ to enable multi-token prediction.
- Nightly install: uv pip install vllm — extra-index-url https://wheels.vllm.ai/nightly — torch-backend=auto to access the Qwen3-Next kernels and features showcased in the announcement.
Best practices
- Toggle thinking deliberately: use enable_thinking=True with recommended sampling for math/reasoning; disable via chat_template_kwargs for fast chat and privacy-safe outputs without internal traces.
- Size context to hardware: start with 32K–64K max-model-len on single nodes, and only raise to 256K once utilization and memory headroom are confirmed.
- Use tensor parallelism: the 80B-A3B model expects multi-GPU; start with — tensor-parallel-size 4 and scale with NVLink/Hopper nodes for best latency.
- Turn on MTP for throughput: the qwen3_next_mtp speculative config improves decode efficiency without code changes.
Roadmap and ecosystem
The vLLM team notes upcoming kernel and memory optimizations for hybrid models, including better management for Gated DeltaNet and disaggregated KV memory flows. Community partners and the Qwen team actively contributed to correctness and performance, and the public recipes continue to evolve as more users scale Qwen3-Next in production
vLLM x Qwen3-Next: Hybrid Attention, Multi-Token Prediction, and Thinking Controls for… was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.