FineTuning Liquid AI’s LFM2-VL Series: Revolutionizing Edge AI with Efficient Vision-Language Models

Liquid AI’s LFM2‑VL series lands at a pivotal moment for edge AI, where the mandate is clear: deliver multimodal intelligence that is fast, compact, and private by design. Built by a team of MIT CSAIL alumni and released on August 26, 2025, LFM2‑VL arrives in two open‑weight variants — LFM2‑VL‑450M (350M language + 86M vision) and LFM2‑VL‑1.6B (1.2B language + 400M vision) — that prioritize real‑world deployability without sacrificing capability. With up to 2x faster GPU inference than contemporaries like InternVL3 and SmolVLM2, native resolution image handling, and robust text‑image understanding, these models are tailored for smartphones, laptops, wearables, and IoT systems where latency, cost, and privacy constraints dominate.
Architectural Details: A Modular, Efficiency-Driven Design
Liquid AI’s LFM2-VL series builds on the company’s signature Liquid Foundation Models (LFMs), which leverage Liquid Neural Networks (LNNs) inspired by dynamical systems and numerical linear algebra, challenging the compute-heavy dominance of transformer-based architectures. Unlike traditional VLMs with fixed structures, LFM2-VL combines a hybrid convolution-attention backbone with a shape-optimized vision encoder and a tunable multimodal projector, enabling seamless deployment across diverse hardware — from smartphones to embedded systems.Core ComponentsThe architecture consists of three key modules, as detailed in Liquid AI’s technical documentation:
— Language Model Backbone (LFM2 Tower):
- LFM2-VL-450M: Powered by LFM2–350M (350M parameters).
- LFM2-VL-1.6B: Powered by LFM2–1.2B (1.2B parameters).
- The backbone uses the Linear Input-Varying (LIV) framework, generating dynamic weights on-the-fly to reduce memory usage and maintain near-constant inference time for contexts up to 32K tokens. Each tower comprises 16 blocks: 10 double-gated short-range convolution blocks for local patterns and 6 grouped query attention (GQA) blocks for global context, optimized for CPU, GPU, and NPU. This hybrid design delivers 200% faster decode and prefill on CPUs compared to models like Qwen3, with a memory footprint of ~2GB for the 1.6B model (vs. 4GB for peers).
— Vision Encoder (SigLIP2 NaFlex):
- Parameters: 86M (450M variant), 400M (1.6B variant).
- Built on SigLIP2 (a sigmoid-loss CLIP variant for enhanced semantics), it processes images at native resolutions up to 512×512 pixels, preserving aspect ratios without distortion. For larger images (e.g., 1024×1024), it uses non-overlapping tiling (512×512 patches) and, in the 1.6B model, encodes a downscaled thumbnail for global context, avoiding resizing artifacts.
— Multimodal Projector:
- A 2-layer MLP with “pixel unshuffle” technology, reducing token dimensionality (e.g., 256×384 image yields ~96 tokens; 1000×3000 ~1,020 tokens). It fuses vision and language embeddings, allowing runtime tuning of parameters like maximum image tokens for speed-quality trade-offs (e.g., 64 tokens for sub-second mobile inference).
My book with 20+ End to End Data Science Case Studies from 5 different domains is available on Amazon.
Cracking Data Science Case Study Interview: Data, Features, Models and System Design
Efficiency Innovations
- Input-Adaptive Processing: LIV operators minimize KV cache size, ensuring constant memory complexity for edge devices with <1GB RAM.
- Hybrid Convolution-Attention: Convolutions handle local features efficiently, while GQA reduces attention overhead, yielding 2x GPU speedups.
- Tunable Inference: Runtime adjustments (e.g., capping image tokens at 64) enable sub-second latency without retraining.
Training Details: Efficiency Meets Multimodal Mastery
LFM2-VL’s training pipeline is a testament to Liquid AI’s focus on cost-effective scaling, using ~100 billion multimodal tokens to achieve 3x faster training than prior LFMs. The process leverages the LFM2 language backbone, progressively integrating vision capabilities for robust multimodal performance.Training Phases
— Pre-Training on LFM2 Backbone:
- Dataset: 10T tokens (75% English, 20% multilingual — Japanese, Arabic, Korean, Spanish, French, German — 5% code).
- Method: Knowledge distillation from LFM1–7B using cross-entropy loss, extending context to 32K tokens. Liquid’s STAR neural architecture search optimizes for 50+ internal evals (e.g., MMLU, GSM8K, IFEval).
- Compute: ~50,000 GPU-hours on 96 NVIDIA H200s, with custom GEMM/ScatterAdd kernels for FP8/bfloat16 stability.
— Joint Mid-Training for Multimodal Fusion:
- Dataset: Shifts from 95% text to 30% image-text pairs, using open-source datasets (e.g., for captioning, VQA) and in-house synthetic images for edge cases (e.g., varied resolutions).
- Method: SigLIP2 encoder aligns vision embeddings; projector learns token mapping via pixel unshuffle, ensuring efficient fusion.
— Supervised Fine-Tuning (SFT):
- Dataset: ~100B multimodal tokens refine OCR, real-world QA, and reasoning, focusing on low-resource scenarios.
- Method: Progressive data adjustment ensures convergence; no tool-use training (unlike Gemma), prioritizing core multimodal tasks.
The pipeline’s efficiency — 5% slower but deterministic gradients — positions LFM2-VL as a leader in sustainable AI, critical amid 2025’s $330B capex concerns.
Benchmark Performance: Competitive Edge in Multimodal Tasks
LFM2-VL’s performance shines in edge-optimized multimodal benchmarks, balancing accuracy and speed. The technical report provides detailed comparisons against models like InternVL3, SmolVLM2, and Gemma-2–2B, focusing on vision-language tasks critical for real-world applications. Below is a comprehensive table summarizing key benchmark results, with notes on strengths and gaps.

Analysis
- Strengths: LFM2-VL-1.6B leads in efficiency (2x faster, ~40% less memory) and excels in OCR, document understanding, and real-world QA, making it ideal for edge tasks like mobile photo analysis or IoT anomaly detection. The 450M variant is remarkably competitive despite its size, fitting wearables with <1GB RAM.
- Weaknesses: MMMU scores lag larger models like InternVL3, indicating generalization limits for complex reasoning. X users note fine-tuning is needed for academic tasks.
- Context: Compared to Nous Hermes 4 (96.3% MATH-500) or GPT-5 (95.8% MedQA), LFM2-VL prioritizes speed over raw reasoning, aligning with edge use cases.
Applications: Transforming Edge AI Use Cases
LFM2-VL’s low-latency, low-memory design unlocks a range of applications where cloud reliance is impractical. Key use cases include:
- Mobile and Wearables: Real-time image captioning (e.g., describing photos on smartphones) or VQA on AR glasses (e.g., identifying objects offline). The 450M model suits smartwatches for gesture-based tasks.
- Robotics and Autonomous Systems: Visual reasoning for drones/robots (e.g., obstacle detection via camera feeds). Native patching handles high-res inputs efficiently.
- IoT and Smart Devices: Anomaly detection in security cameras or visual analysis in smart home assistants. The ~2GB footprint fits IoT constraints.
- Enterprise Tools: Visual search for e-commerce (e.g., product matching) or chatbots with image uploads. Integrates with Liquid’s LEAP platform for iOS/Android deployment.
Code Implementation: Inference and LoRA Fine-Tuning
LFM2-VL is hosted on Hugging Face under the LFM1.0 license (Apache 2.0-based, allowing academic use and commercial for firms < $10M revenue). It integrates with Transformers v4.55+ and TRL for fine-tuning. Below are examples for basic inference and LoRA fine-tuning.
Basic Inference
from transformers import AutoProcessor, AutoModelForImageTextToText
from transformers.image_utils import load_image
import torch
# Load model and processor
model_id = "LiquidAI/LFM2-VL-1.6B"
model = AutoModelForImageTextToText.from_pretrained(
model_id,
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
# Load image and create conversation
url = "https://www.ilankelman.org/stopsigns/australia.jpg" # Replace with image URL/path
image = load_image(url)
conversation = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Describe this image in detail."}
] }
]
# Generate response
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
tokenize=True,
max_image_tokens=64 # Tune for speed
).to(model.device)
with torch.no_grad():
outputs = model.generate(**inputs, max_new_tokens=100, do_sample=False)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0]print(response) # E.g., "A tropical beach with cows grazing near palm trees."
Fine-Tuning wiht LoRA
# Install dependencies in Colab
# !pip install torch transformers peft datasets accelerate pillow flash-attn==2.5.8 --no-build-isolation -q
import os
import torch
from transformers import AutoProcessor, AutoModelForImageTextToText, Trainer, TrainingArguments
from transformers.image_utils import load_image
from peft import LoraConfig, get_peft_model, TaskType
from datasets import load_dataset
from torch.cuda.amp import autocast, GradScaler
import logging
from torch.nn.utils.rnn import pad_sequence
import torch
# Check for flash-attn availability
try:
import flash_attn
FLASH_ATTENTION_AVAILABLE = True
logger = logging.getLogger(__name__)
logger.info(f"Flash Attention 2 installed: {flash_attn.__version__}")
except ImportError:
FLASH_ATTENTION_AVAILABLE = False
logger = logging.getLogger(__name__)
logger.warning("Flash Attention 2 not installed. Falling back to standard attention. See https://huggingface.co/docs/transformers/perf_infer_gpu_one#flashattention-2 for installation.")
# Setup logging
logging.basicConfig(level=logging.INFO)
# GPU configuration with architecture detection
if not torch.cuda.is_available():
raise RuntimeError("CUDA GPU not detected. This implementation requires a GPU (e.g., T4 in Colab).")
device = torch.device("cuda")
torch.cuda.set_device(0)
major, minor = torch.cuda.get_device_capability(0) # Get compute capability (e.g., (7,5) for T4)
compute_cap = major * 10 + minor
logger.info(f"Using GPU: {torch.cuda.get_device_name(0)} (Compute Capability: {major}.{minor}) with {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
# Determine attn_implementation based on GPU compatibility (Ampere+ = 8.0+)
if compute_cap >= 80 and FLASH_ATTENTION_AVAILABLE:
ATTN_IMPL = "flash_attention_2"
logger.info("GPU supports Flash Attention 2. Enabling it.")
else:
ATTN_IMPL = "eager"
logger.warning(f"GPU Compute Capability {major}.{minor} does not support Flash Attention 2 (requires >=8.0). Falling back to standard attention.")
# Load model and processor
model_id = "LiquidAI/LFM2-VL-450M" # Use 450M for Colab's T4 GPU
try:
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="cuda:0",
attn_implementation=ATTN_IMPL,
trust_remote_code=True
)
model.eval()
logger.info(f"Model {model_id} loaded on GPU with {ATTN_IMPL} attention")
except Exception as e:
logger.error(f"Failed to load model: {e}")
raise
# Inference Function
def run_inference(image_url, text_prompt, max_image_tokens=32, max_new_tokens=100, output_file="output.txt"):
try:
# Load image
image = load_image(image_url)
conversation = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": text_prompt}
] }
]
# Prepare inputs
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
tokenize=True,
max_image_tokens=max_image_tokens
).to(device)
# Mixed precision inference
with torch.no_grad(), autocast(dtype=torch.bfloat16):
outputs = model.generate(
**inputs,
max_new_tokens=max_new_tokens,
do_sample=False,
pad_token_id=processor.tokenizer.pad_token_id
)
response = processor.batch_decode(outputs, skip_special_tokens=True)[0] logger.info(f"Generated response: {response}")
with open(output_file, "w") as f:
f.write(response)
return response
except torch.cuda.OutOfMemoryError:
logger.error("GPU out of memory. Reduce max_image_tokens (e.g., 16) or use smaller model.")
raise
except Exception as e:
logger.error(f"Inference failed: {e}")
raise
class DataCollatorForVisionLanguage:
def __init__(self, tokenizer):
self.tokenizer = tokenizer
self.pad_token_id = tokenizer.pad_token_id
def __call__(self, batch):
# Convert lists -> tensors
input_ids = [torch.tensor(item["input_ids"], dtype=torch.long) for item in batch] attention_mask = [torch.tensor(item["attention_mask"], dtype=torch.long) for item in batch] labels = [torch.tensor(item["labels"], dtype=torch.long) for item in batch]
# Pad all sequences
input_ids = pad_sequence(input_ids, batch_first=True, padding_value=self.pad_token_id)
attention_mask = pad_sequence(attention_mask, batch_first=True, padding_value=0)
labels = pad_sequence(labels, batch_first=True, padding_value=-100) # ignore loss on padding
return {
"input_ids": input_ids,
"attention_mask": attention_mask,
"labels": labels,
}
# Fine-Tuning Function with Conceptual Captions (unlabeled config)
def fine_tune_lora(dataset_name="lambdalabs/naruto-blip-captions",output_dir="/content/lfm2-vl-lora-finetuned"):
try:
# Load dataset (small for Colab)
dataset = load_dataset(dataset_name, split="train[:10]")
logger.info(f"Loaded dataset with columns: {dataset.column_names}")
def preprocess(examples):
input_ids_list = [] attention_masks_list = [] labels_list = []
for img, caption in zip(examples["image"], examples["text"]):
conversation = [
{
"role": "user",
"content": [
{"type": "image", "image": img},
{"type": "text", "text": "Describe this image."}
] }
]
# Encode multimodal input
inputs = processor.apply_chat_template(
conversation,
add_generation_prompt=True,
return_tensors="pt",
return_dict=True,
tokenize=True,
max_image_tokens=32,
)
input_ids = inputs["input_ids"][0] attention_mask = inputs["attention_mask"][0]
# Encode output (caption), align length with input
labels = processor.tokenizer(
caption,
return_tensors="pt",
padding="max_length",
truncation=True,
max_length=input_ids.shape[0], # ✅ align with input length
)["input_ids"][0]
# Replace pad token ids with -100 (ignore_index for loss)
labels[labels == processor.tokenizer.pad_token_id] = -100
input_ids_list.append(input_ids)
attention_masks_list.append(attention_mask)
labels_list.append(labels)
return {
"input_ids": input_ids_list,
"attention_mask": attention_masks_list,
"labels": labels_list,
}
# Apply mapping
tokenized_dataset = dataset.map(
preprocess,
batched=True,
remove_columns=dataset.column_names
)
lora_config = LoraConfig(
task_type=TaskType.CAUSAL_LM,
r=8,
lora_alpha=32,
lora_dropout=0.1,
bias="none",
target_modules=["q_proj", "k_proj", "v_proj", "o_proj", "fc_in", "fc_out"], # text LM layers
)
model_peft = get_peft_model(model, lora_config)
model_peft.print_trainable_parameters()
# Training args optimized for Colab
training_args = TrainingArguments(
output_dir=output_dir,
num_train_epochs=3,
per_device_train_batch_size=2,
gradient_accumulation_steps=8,
learning_rate=1e-4,
bf16=True,
save_steps=10,
logging_steps=1,
remove_unused_columns=False,
dataloader_num_workers=2,
max_grad_norm=1.0,
report_to="none"
)
data_collator = DataCollatorForVisionLanguage(processor.tokenizer)
trainer = Trainer(
model=model,
args=training_args,
train_dataset=tokenized_dataset,
tokenizer=processor.tokenizer,
data_collator=data_collator,
)
trainer.train()
trainer.save_model()
logger.info(f"LoRA adapters saved to {output_dir}")
except Exception as e:
logger.error(f"Fine-tuning failed: {e}")
raise
# Example Usage
if __name__ == "__main__":
# Inference
image_url = "https://picsum.photos/512" # Reliable test image URL
text_prompt = "Describe this image in detail."
try:
response = run_inference(image_url, text_prompt)
print(f"Response: {response}")
except Exception as e:
print(f"Error during inference: {e}")
# Fine-tuning with Conceptual Captions (unlabeled)
try:
fine_tune_lora()
print("Fine-tuning completed!")
except Exception as e:
print(f"Error during fine-tuning: {e}")
Collab Notebook:
FineTuning Liquid AI’s LFM2-VL Series: Revolutionizing Edge AI with Efficient Vision-Language… was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.