What is meant by open-sourced LLMs?
Since the rise of ChatGPT, the AI world has split into two tribes.
- On one side: closed, paid models like GPT-4, Claude, and Gemini models that perform impressively but live behind corporate walls.
- On the other: open-weight challengers like DeepSeek-R1, GLM-4.6, Mistral, and Falcon. They promise freedom, transparency, and community-driven innovation.
But here’s the catch, most of what’s called “open” in this space isn’t truly open. The term “open source” gets thrown around so loosely that it’s practically lost meaning.
What Does “Open Source” Actually Mean?
In traditional software, open source means full transparency. You can read the code, modify it, redistribute it, and even use it in your own projects , usually under a recognized license like MIT, Apache, or GPL.
For large language models, it’s more complicated. A truly open-source LLM would mean:
- The model weights (the trained parameters) are available for download.
- The architecture and training code are publicly accessible.
- The dataset (or at least a clear description of it) is disclosed.
- The training process, hyperparameters, and evaluation benchmarks are open.
Only a handful of models actually meet all these conditions. Most stop halfway, releasing only the weights, or just the architecture, but not the data or code.
Open Weights ≠ Open Source
This is where the confusion usually begins. Companies love to say their models are “open” when they’re really open-weights, not open-source.
- Open weights: You get the trained model parameters. You can run, fine-tune, or deploy the model (depending on the license). But you can’t retrain it from scratch, because the data and full training code aren’t available.
- Open source: You get everything, weights, data, architecture, scripts, configs. You can replicate, audit, or even improve the model independently.
When Mistral released its 7B model, for example, it shared the weights and allowed full commercial use under Apache 2.0. That’s generous, but still not open source in the strict sense. Compare that to early projects like EleutherAI’s GPT-Neo or GPT-J, which made both code and datasets public. Those were genuinely open in spirit.
The Role of Licenses
Licenses are the fine print that decides what “open” really allows you to do.
Here’s how they differ in spirit and restriction:
- MIT License: The simplest, most permissive. You can basically do anything with the software use, modify, distribute, sell as long as you include attribution.
- Apache 2.0 License: Also permissive, but with some added legal clarity around patents and contribution rules. It’s popular among AI companies because it’s business-friendly.
- GPL (GNU Public License): Copyleft in nature. If you modify and distribute the code, you must release your changes under the same license. It keeps derivatives open.
- Custom AI Licenses (like Meta’s Llama License or Stability AI’s terms): Often pretend openness, you can download and experiment, but commercial usage or large-scale deployment is restricted. They’re open enough to attract attention, but closed enough to protect business interests.
So when someone says “our model is open,” your first question should be: under what license?
Why Companies Avoid Full Openness
Full openness sounds noble, but it’s rarely practical for commercial labs. There are real reasons:
- Data privacy: Many training datasets contain copyrighted or personal material.
- Security risks: Completely open models can be fine-tuned for malicious use.
- Business protection: Open sourcing everything means competitors can instantly clone your work.
So companies compromise. They open weights to gain community goodwill, while keeping training data and pipelines secret. It’s transparency with a marketing filter.
Why Open-Weights Still Matter
Even if they aren’t fully open source, open-weights models are still crucial. They’ve democratized access to high-quality AI. Researchers can fine-tune them for niche use cases. Startups can build local, private systems without relying on API calls to big tech. And regulators, at least theoretically, can audit their behavior more directly.
Without open-weights releases, the entire ecosystem would depend on a handful of paid APIs. The open community, Hugging Face, Together AI, EleutherAI, MosaicML before the Databricks merger keeps that monopoly in check.
The Honest Definition
So, to clear the fog:
- Closed models (GPT-4, Claude, Gemini) keep everything private weights, data, code, everything.
- Open-weights models (Mistral, Llama, DeepSeek-R1) share trained weights but restrict data or code.
- Truly open-source models (like early GPT-Neo or RedPajama) share the entire pipeline and license it permissively.
Most of the AI world today sits in that murky middle ground
“open enough to use, closed enough to control.”
Is every Open-Sourced LLM Truly Open-Sourced? NO was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.