Speed Testing NVIDIA GPUs for LLM inferencing and fine-tuning
In the third installment of our GPU comparison series (following CPU vs. T4, and RTX 4090 vs. 5090), we’re putting two absolute beasts in the ring — NVIDIA H100 vs. NVIDIA H200.
https://medium.com/media/e355139fc96656b7c15e1ce6e98d8c4c/href
Spoiler alert: this ain’t your average GPU bake-off.
These GPUs are not meant for your weekend Stable Diffusion side project. We’re talking server-grade monsters designed for pre-training and massive-scale inference of Large Language Models (LLMs). If you’re building your own GPT, fine-tuning 10B+ parameter models, or working at Anthropic on a rainy day — these are your toys.
So, what did we test? How did they perform? And which one should you consider (if you’re lucky enough to afford either)? Let’s get right into it.
https://medium.com/media/b795b7e7d65c4c2b8618991f54b7a007/href
Quick Specs Showdown: H200 VS H100
Both GPUs are built on NVIDIA’s Hopper architecture, which debuted with the H100 and continues with the H200. While they share the same DNA, the H200 ups the game on memory and bandwidth

So on paper, H200 is clearly a beast — faster memory, higher bandwidth, more room to breathe for large models. But does it live up to the hype? Let’s look at the real-world tests.
https://medium.com/media/2e1697560f032828ea528d92fc9a1b3d/href
Test 1: Inferencing the Qwen3–8B LLM
We used the Qwen3–8B model, ran it for 100 inference iterations, and measured the time for both the GPUs, keeping everything else identical, from codes to other configs.
Codes used
from transformers import AutoModelForCausalLM, AutoTokenizer
import time
from transformers import modeling_utils
if not hasattr(modeling_utils, "ALL_PARALLEL_STYLES") or modeling_utils.ALL_PARALLEL_STYLES is None:
modeling_utils.ALL_PARALLEL_STYLES = ["tp", "none","colwise",'rowwise']
model_name = "Qwen/Qwen3-8B"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "Give me a short introduction to large language model.keep it very short, 1 line only"
messages = [
{"role": "user", "content": prompt}
]text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
enable_thinking=True # Switches between thinking and non-thinking modes. Default is True.
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
start = time.time()
for x in range(100):
generated_ids = model.generate(
**model_inputs,
max_new_tokens=100
)
end = time.time()
The results?

The H200 was ~2.5x faster than the H100 — significantly better, it’s way higher than the expected 30%-50% improvements.
Test 2: Summarizing 100 Articles with T5-Large
Inferencing 100 times with Google’s T5-Large, a 770M parameter model. Smaller than Qwen, but still no joke (~700M params).
Codes used
from transformers import pipeline
import torch
import time
# Detect device
device = 0 if torch.cuda.is_available() else -1
print("Device:", "GPU" if device == 0 else "CPU")
# Load summarizer
summarizer = pipeline("summarization", model="t5-large", device=device)
# Create dummy "articles" (10K repetitive samples)
fake_article = "The quick brown fox jumps over the lazy dog. " * 30
articles = [fake_article for _ in range(100)]
# Run summarization in batches
batch_size = 32
summaries = []start = time.time()
print("Starting summarization")
for i in range(0, len(articles), batch_size):
batch = articles[i:i+batch_size] result = summarizer(batch, do_sample=False)
summaries.extend(result)
end = time.time()
print(f"Summarized {len(articles)} articles in {end - start:.2f} seconds using {torch.cuda.get_device_name(0)}")

A more modest 1.3x gain — exactly what NVIDIA’s docs promise. For memory-light tasks, H200 still edges ahead, thanks to faster memory bandwidth and larger HBM3e cache.
Test 3: Fine-Tuning DistilBERT
In a last experiment, we tried out fine-tuning DistilBERT with 7.5K records for about 5 epochs, and here’s what I got: some mind-boggling results.
Codes used
import time
import pandas as pd
import numpy as np
from datasets import Dataset
from transformers import (
AutoTokenizer,
AutoModelForSequenceClassification,
TrainingArguments,
Trainer,
)
checkpoint = "distilbert-base-uncased-finetuned-sst-2-english" # Define which pre-trained model we will be using
classifier = AutoModelForSequenceClassification.from_pretrained(checkpoint, num_labels=2) # Get the classifier
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
import pandas as pd
# Load the training data
train_path = 'train.csv'
df = pd.read_csv('train.csv')
print("dataset size",len(df))
df = df.loc[:,["text", "target"]]
from sklearn.model_selection import train_test_split
df_train, df_eval = train_test_split(df, train_size=0.8,stratify=df.target, random_state=42) # Stratified splitting
from datasets import Dataset, DatasetDict
raw_datasets = DatasetDict({
"train": Dataset.from_pandas(df_train),
"eval": Dataset.from_pandas(df_eval)
})
tokenized_datasets = raw_datasets.map(lambda dataset: tokenizer(dataset['text'], truncation=True), batched=True)
tokenized_datasets = tokenized_datasets.remove_columns(["text", "__index_level_0__"])
tokenized_datasets = tokenized_datasets.rename_column("target", "labels")
from transformers import DataCollatorWithPadding, TrainingArguments, Trainer
import numpy as np
# Padding for batch of data that will be fed into model for training
data_collator = DataCollatorWithPadding(tokenizer=tokenizer)
# Training args
training_args = TrainingArguments("test-trainer", num_train_epochs=5,
weight_decay=5e-4, save_strategy="no", report_to="none")
# Metric for validation error
def compute_metrics(eval_preds):
metric = evaluate.load("glue", "mrpc") # F1 and Accuracy
logits, labels = eval_preds
predictions = np.argmax(logits, axis=-1)
return metric.compute(predictions=predictions, references=labels)
# Define trainer
trainer = Trainer(
classifier,
training_args,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["eval"],
data_collator=data_collator,
tokenizer=tokenizer,
compute_metrics=compute_metrics
)
# Start the fine-tuning
start = time.time()
trainer.train()
end = time.time()
print("Training time using {} ".format(torch.cuda.get_device_name(0)), end-start)
Unexpectedly, H200 lagged big time here

Wait, what?
Yes, H100 was 3× faster in this test. That’s right — the newer GPU lost. This can feel counterintuitive, but here’s the deal:
Possible Reasons:
- Software stack lag: Libraries used for Fine-Tuning in the codes might not be fully optimized for H200 yet.
- Model size mismatch: DistilBERT is tiny. H200 is like using a nuke to light a candle.
but still, highly unexpected
So… Which One Should You Use?
None if you dont wish to Fine-Tune or Pre-Train LLMs and happy with medium sized LLMs with a slight speed lag
Go with H100 if:
- You’re fine-tuning medium or small models like BERT, RoBERTa, or DistilBERT.
- You want a more widely available and slightly more cost-effective GPU.
- Your codebase or framework isn’t yet H200-optimized.
Go with H200 if:
- You’re running inference on 30B+ LLMs, or pretraining your own GPT.
- You deal with huge context windows and memory-heavy tasks.
- You’re optimizing for batch throughput, not single-instance latency.
Final Thoughts: The Real MVP?
The H200 is an absolute monster on paper — and it delivers when it counts: large-model inference and memory-intensive pretraining. But unless your task is scaled to match its muscle, you might be better off with the tried-and-tested H100.
And hey, software maturity matters. Drivers, CUDA versions, PyTorch/XLA backends — all these things can drag down performance, no matter how good your silicon is.
Tested NVIDIA H200 vs H100 GPUs for AI: The Winner Will Surprise You was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.