Tiny-R1 has just 5% of DeepSeek-R1 parameters
DeepSeek-R1 took the world by storm a few days back. The model being so good, and open source as well, became everyone’s first choice. But unfortunately, even after being open-sourced, the model couldn’t be used locally by common folks
Why? Its huge in size. 671B params precisely
For anyone with mediocre hardware, running DeepSeek-R1 is still a dream. But not any longer as a new model Tiny-R1, with just 32B params, that’s about 5% of total parameters of DeepSeek-R1, has almost matched its performance on major benchmarks
What is Tiny-R1 32B?
he Tiny-R1–32B-Preview model by Qihoo360 is a first-generation reasoning model designed to deliver near-R1 performance while utilizing only 5% of the parameters of full R1 models. The model is optimized using SuperDistillation and outperforms several larger models, such as Deepseek-R1-Distill-Llama-70B, particularly in tasks related to math, coding, and science.
What is SuperDistillation?
SuperDistillation is a technique that refines the process of knowledge distillation (transferring knowledge of big modelss to smaller models). While traditional distillation involves training a smaller model (the student) to replicate the behavior of a larger, pre-trained model (the teacher), superdistillation enhances this by focusing on transferring more fine-grained knowledge, such as internal representations or intermediate features, in addition to the final outputs. This leads to more efficient and effective student models.
Key Features:
- Performance: Tiny-R1–32B-Preview achieves high scores across various domains:

Math:
Tiny-R1–32B-Preview (78.1) is very close to Deepseek-R1 (79.8) but slightly lower.
Both Deepseek-R1-Distill models lag behind, with scores of 72.6 (Qwen-32B) and 70.0 (Llama-70B).
Coding:
Tiny-R1–32B-Preview (61.6) outperforms the Deepseek-R1-Distill models (57.2 and 57.5).
Deepseek-R1 shows the highest performance in this domain (65.9).
Science:
Tiny-R1–32B-Preview (65.0) is quite competitive with Deepseek-R1-Distill-Llama-70B (65.2), but still falls behind Deepseek-R1 (71.5).
How Tiny-R1 trained?
Base Model Selection:
- The team started with Deepseek-R1-Distill-Qwen-32B, a large pretrained model.
Supervised Fine-Tuning (SFT):
- They applied Supervised Fine-Tuning (SFT) to adapt the model to three specific domains: Mathematics, Code, and Science.
- This involves training the model on domain-specific data to specialize it for each task.
360-LLaMA-Factory Framework:
- The fine-tuning was done using the 360-LLaMA-Factory training framework, which is designed to efficiently train large models on specialized tasks.
Using Open-Source Data:
- For each domain, open-source data was used as seeds (starting points).
- These seeds consisted of questions in Math, Code, and Science to help the model learn task-specific knowledge.
Generating Responses with Deepseek-R1:
- The model, Deepseek-R1, was used to generate appropriate responses for each domain (Math, Code, and Science) based on the seed questions.
Creating Specialized Models:
- From this, three specialized models were created, one for each domain: Math Model, Code Model, and Science Model.
Combining Models Using Mergekit:
- The team then used the Mergekit tool (developed by the Arcee team) to combine these three specialized models into one unified model.
Creating Tiny-R1–32B-Preview:
- The final result was Tiny-R1–32B-Preview, a compact model that demonstrates strong performance across all three domains.
How to use Tiny-R1?
The model is open-sourced and the weights are available on
qihoo360/TinyR1-32B-Preview · Hugging Face
Hope you try out the model
Tiny-R1: 32B model achieves DeepSeek-R1 performance was originally published in Data Science in your pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.