Google DeepMind’s C2S-Scale 27B: Teaching AI the Language of Cells to Crack Cancer’s Code

Google DeepMind’s C2S-Scale 27B: Teaching AI the Language of Cells to Crack Cancer’s Code

Google DeepMind, Google Research, and Yale University announced a watershed moment in AI-driven scientific discovery: the Cell2Sentence-Scale 27B (C2S-Scale) foundation model, a 27-billion-parameter AI system that successfully generated a novel hypothesis about cancer cellular behavior and had it experimentally validated in living human cells. This marks the first time an AI model has not just analyzed existing biological data, but created new, testable scientific hypotheses that led to potential cancer therapy pathways.​

Google CEO Sundar Pichai called it “an exciting milestone for AI in science,” emphasizing that with further preclinical and clinical validation, such AI-generated discoveries could accelerate the development of new cancer treatments.​

The Cell2Sentence Framework: Translating Biology Into Language

Core Innovation: Cells That Speak

The revolutionary insight behind C2S-Scale lies in the Cell2Sentence (C2S) framework, which transforms complex biological data into a format that language models can natively understand. Here’s how it works:​

From Molecules to Sentences:

Single-cell RNA sequencing (scRNA-seq) captures the expression levels of thousands of genes within individual cells. The C2S framework converts this high-dimensional expression data into ordered sequences of gene names — termed “cell sentences” — where genes are ranked by their expression levels from highest to lowest.​

Example Cell Sentence:

MALAT1 TMSB4X B2M EEF1A1 H3F3B ACTB FTL RPL13 ...

This space-separated string represents a cell’s molecular identity in a format readable by large language models. By treating gene expression as language, the model can leverage natural language processing capabilities to understand cellular states, contexts, and behaviors.​

Why This Matters

Traditional computational biology models struggle with the combinatorial complexity of cellular interactions. By representing cells as sentences, C2S-Scale can apply the same conditional reasoning, pattern recognition, and contextual understanding that makes LLMs successful in natural language tasks — but now applied to biological reasoning

Architecture and Technical Specifications

Building on Gemma 2

C2S-Scale 27B is built upon Google’s Gemma-2 27B architecture, a decoder-only transformer model from the Gemma family of lightweight, state-of-the-art open LLMs. The model inherits Gemma 2’s robust architecture while being extensively fine-tuned for single-cell biology applications.​

Key Technical Specifications:

  • Parameters: 27 billion
  • Base Architecture: Gemma-2 27B (decoder-only transformer)
  • Training Infrastructure: Google’s TPU v5s (TPU v5 Pods)
  • Training Framework: JAX
  • License: CC-BY-4.0 (open weights)
  • Training Data: Over 57 million cells from 800+ datasets
  • Data Sources: CellxGene and Human Cell Atlas
  • Training Corpus: Over 1 billion tokens of transcriptomic data, biological text, and metadata​

The Power of TPU v5 Infrastructure

The model was trained on Google’s TPU v5s, which enabled unprecedented scaling in both model size and capability. TPUs (Tensor Processing Units) are custom ASICs designed by Google specifically for matrix operations required by large neural networks, offering exceptional throughput on transformer layers and deep integration with JAX.​

This infrastructure choice was critical — TPU v5p delivers market-leading performance for training large language models and dense transformer networks, with capabilities to support models beyond 500 billion parameters within Google Cloud pods.​

Scaling Laws in Biology

The research builds on earlier findings that biological models follow clear scaling laws — just like natural language models, larger biological models perform better. The critical question the team addressed: Does scaling just improve existing tasks, or does it unlock emergent capabilities?​

The answer proved transformative: C2S-Scale 27B demonstrated conditional reasoning abilities that smaller models could not achieve, specifically the capacity to identify context-dependent biological effects.

The Cancer Breakthrough: From Cold Tumors to Hot

The Challenge: Invisible Tumors

A major obstacle in cancer immunotherapy is that many tumors are “cold” — they remain invisible to the body’s immune system, allowing cancer cells to proliferate unchecked. A key strategy to make tumors “hot” and detectable is forcing them to display immune-triggering signals through antigen presentation.​

The AI’s Mission: Finding a Conditional Amplifier

The research team gave C2S-Scale 27B a sophisticated task: identify a drug that acts as a conditional amplifier — one that would boost immune signals only in a specific “immune-context-positive” environment where low levels of interferon (a key immune-signaling protein) were present but insufficient to trigger antigen presentation alone.​

This required emergent conditional reasoning capabilities that appeared only at scale; smaller models could not resolve this context-dependent effect.​

The Dual-Context Virtual Screen

To identify such drugs, researchers designed a dual-context virtual screening approach with two stages :​

Stage 1 — Immune-Context-Positive:
Real-world patient samples with intact tumor-immune interactions and low-level interferon signaling​

Stage 2 — Immune-Context-Neutral:
Isolated cell line data with no immune context​

The model simulated the effects of over 4,000 drugs across both contexts and predicted which drugs would boost antigen presentation only in the patient-relevant immune-positive context.​

The Discovery: Silmitasertib (CX-4945)

Among the drug candidates, C2S-Scale identified a striking “context split” for silmitasertib (CX-4945), a CK2 (casein kinase 2) inhibitor. The model predicted:​

  • Strong increase in antigen presentation when silmitasertib was applied in the immune-context-positive setting
  • Little to no effect in the immune-context-neutral condition​

What Made This Novel?

Although CK2 has been implicated in various cellular functions including immune system modulation, inhibiting CK2 via silmitasertib had never been reported in literature to explicitly enhance MHC-I expression or antigen presentation. The AI generated a genuinely new, testable hypothesis rather than repeating known facts.​

Out of the drug hits identified, only 10–30% were already known in prior literature — the remaining were “surprising hits” with no prior connection to the screening objective.​

Laboratory Validation: AI Prediction Meets Reality

From In Silico to In Vitro

The true test of any scientific hypothesis is experimental validation. The team tested the AI’s prediction in human neuroendocrine cell models — a cell type the model had never encountered during training.​

The Experimental Results

The laboratory experiments demonstrated :​

  1. Silmitasertib alone: No effect on antigen presentation (MHC-I)
  2. Low-dose interferon alone: Modest effect on antigen presentation
  3. Silmitasertib + low-dose interferon: Marked, synergistic amplification of antigen presentation

The Outcome:
The combination resulted in approximately 50% increase in antigen presentation, making tumors significantly more visible to the immune system.​

The model’s in silico prediction was confirmed multiple times in vitro, validating that C2S-Scale had successfully identified a novel, interferon-conditional amplifier.​

Clinical Significance

This discovery reveals a promising new pathway to convert “cold” tumors into “hot” ones, potentially making them more responsive to immunotherapy. While this represents an early first step, it provides a powerful, experimentally-validated lead for developing combination therapies that use multiple drugs in concert to achieve more robust effects.​

Silmitasertib is already being evaluated in clinical trials for various cancer types, including combination therapies with gemcitabine and cisplatin.

Model Capabilities and Applications

Core Performance Features

C2S-Scale 27B excels across diverse single-cell and multi-cell analysis tasks :​

Predictive Tasks:

  • Cell type prediction: Identifying cell types based on gene expression profiles
  • Tissue classification: Determining tissue origins of cells
  • Perturbation prediction: Forecasting how cells respond to drugs or genetic modifications
  • Biomarker discovery: Identifying gene patterns marking specific cell states or diseases​

Generative Tasks:

  • Cell generation: Creating realistic single-cell gene expression profiles under specific conditions
  • In silico experiments: Simulating cellular responses to test biological hypotheses
  • Cluster captioning: Generating natural language descriptions of cell populations​

Advanced Reasoning:

  • Biological question answering: Responding to complex biological queries
  • Context-dependent analysis: Understanding conditional effects across different cellular environments
  • Multicellular reasoning: Synthesizing information across multiple cellular contexts​

Real-World Applications

For Researchers:

  • Drug Discovery: Virtual screening of thousands of compounds for specific biological effects
  • Cell Atlas Annotation: Streamlining annotation of large-scale single-cell datasets
  • Hypothesis Generation: Creating testable predictions about cellular behavior
  • Biomarker Identification: Discovering diagnostic or therapeutic targets​

For Healthcare:

  • Personalized Medicine: Predicting patient-specific responses to treatments
  • Cancer Immunotherapy: Identifying strategies to enhance immune responses
  • Disease Mechanism Discovery: Uncovering molecular pathways underlying diseases

Training Data and Methodology

Massive-Scale Dataset

The model was trained on an unprecedented corpus :​

  • Total cells: Over 57 million human and mouse cells
  • Datasets: 800+ public scRNA-seq datasets from CellxGene and Human Cell Atlas
  • Data diversity: Broad range of tissues, cell types, and experimental conditions
  • Training tokens: Over 1 billion tokens combining transcriptomic data, biological text, and metadata​

Instruction Fine-Tuning Approach

The model underwent instruction fine-tuning using the Cell2Sentence framework. This process involved:​

  1. Converting scRNA-seq expression data into gene token sequences
  2. Creating task-specific prompts for various biological analyses
  3. Training the model to generate appropriate responses
  4. Applying modern reinforcement learning techniques for targeted fine-tuning​

Preprocessing Pipeline

Standard preprocessing steps ensure data quality :​

  • Filtering cells with fewer than 200 expressed genes
  • Filtering genes expressed in fewer than 200 cells
  • Quality control based on mitochondrial gene counts
  • Removing low-quality cells with excessive counts or high mitochondrial transcript percentages​

How to Use C2S-Scale 27B

# pip install accelerate transformers sentencepiece
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model from Hugging Face Hub
model_id = "vandijklab/C2S-Scale-Gemma-2-27B"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(model_id).to(device)

# Create a cell sentence (genes ordered by expression)
cell_sentence = "MALAT1 TMSB4X B2M EEF1A1 H3F3B ACTB FTL RPL13 ..."
num_genes = 1000
organism = "Homo sapiens"

# Format the prompt
prompt = f"""The following is a list of {num_genes} gene names ordered by descending expression level in a {organism} cell. Your task is to give the cell type which this cell belongs to based on its gene expression.

Cell sentence: {cell_sentence}.

The cell type corresponding to these genes is:"""

# Generate prediction
input_ids = tokenizer(prompt, return_tensors="pt").to(device)
outputs = model.generate(**input_ids, max_new_tokens=20)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)

predicted_cell_type = response.split("The cell type corresponding to these genes is:")[1].strip()
print(f"Predicted Cell Type: {predicted_cell_type}")

Ongoing Research and Future Directions

Current Investigations

Teams at Yale University are actively :​

  • Exploring the molecular mechanisms uncovered by the silmitasertib-interferon discovery
  • Testing additional AI-generated predictions in other immune contexts
  • Conducting further preclinical validation studies
  • Preparing pathways toward clinical trials​

Potential Extensions

  1. Model Fine-Tuning:
    C2S-Scale serves as a powerful pretrained foundation that can be fine-tuned for specialized, domain-specific single-cell analysis tasks using proprietary or disease-specific datasets.​
  2. Virtual Cell Development:
    The approach paves the way for developing “virtual cells” — comprehensive computational models that can simulate cellular behavior under arbitrary conditions, accelerating drug discovery and disease research.​
  3. Multi-Omic Integration:
    Future versions could integrate additional biological data types (proteomics, metabolomics, epigenomics) to create even more comprehensive cellular models.​

Limitations and Considerations

Current Constraints

  1. Training Data Scope:
    The model’s knowledge is limited to genes, cell types, and conditions present in the training data (57 million cells from public datasets). Performance on completely novel cell types or technologies requires validation.​
  2. Prompt Formatting:
    Performance is not guaranteed when input prompts deviate significantly from the training prompt formats. Users should follow documented formatting guidelines.​
  3. Out-of-Distribution Generalization:
    While the silmitasertib discovery demonstrated impressive generalization to unseen cell types, systematic evaluation of out-of-distribution performance remains an active research area.​

Intended Use

The model is designed for :​

  • Research in single-cell genomics and computational biology
  • As a foundation for specialized biological domain models
  • Aiding annotation and interpretation of large-scale scRNA-seq experiments

Not intended for: Direct clinical use without further validation and regulatory approval.


Google DeepMind’s C2S-Scale 27B: Teaching AI the Language of Cells to Crack Cancer’s Code was originally published in Data Science in Your Pocket on Medium, where people are continuing the conversation by highlighting and responding to this story.

Share this article
0
Share
Shareable URL
Prev Post

PaddleOCR-VL : Best OCR AI model

Next Post

Nanonets OCR2 : Turning Documents into Structured, LLM-Ready Data

Read next
Subscribe to our newsletter
Get notified of the best deals on our Courses, Tools and Giveaways..