Tahoe Therapeutics, Arc Institute, & Biohub Join Forces to Create the Largest Perturbation Dataset
In a landmark collaboration aimed at accelerating the future of AI-driven biology, Tahoe Therapeutics, Arc Institute, and Biohub have each committed multi-million-dollar investments to close one of the most critical gaps in modern computational biology: large-scale, high-quality perturbation data for virtual cell models.
The teams revealed their plans to generate over 120 million single-cell data points across more than 225,000 perturbations, using Tahoe’s proprietary Mosaic platform to map how drug molecules interact with biological systems at unprecedented scale. Once released, the dataset is expected to become one of the most influential open resources ever created for virtual cell modeling.
Why Virtual Cells Need More Data
Virtual cells—AI models trained on transcriptomic data to predict how gene expression shifts across cell states—are emerging as a powerful tool in drug discovery. In principle, these models can simulate how a cell transitions from a diseased to a healthy state, helping researchers identify drug candidates with fewer off-target effects.
Yet despite rapid progress, the field remains fundamentally data-limited. Biology is not simply another modality that scales smoothly with more compute. It is combinatorially complex, context-dependent, and deeply sensitive to perturbations. Without massive, diverse, and carefully controlled datasets, even the most sophisticated architectures struggle to generalize. This new partnership aims to addresses this exact bottleneck.
High-quality perturbation data is what teaches virtual cell models cause and effect in biology. Instead of just observing what cells look like, scientists deliberately “poke” cells—by adding a drug, changing a gene, or altering their environment—and carefully measure how the cells respond. When this data is collected at large scale, across many cell types, doses, and conditions, and is clean and well-labeled, AI models can start to learn real biological rules rather than memorizing patterns. This is what allows virtual cells to make useful predictions, like how a drug might push a sick cell toward a healthy state, or which treatments are likely to work with fewer side effects.
Building on a Foundation of Influential Datasets
Partnering organizations brings a track record of shaping the virtual cell ecosystem through foundational data releases. Tahoe-100M, produced by Tahoe, is currently the largest perturbation dataset ever released, and has surpassed 250,000 downloads since its open-source debut last February, while scBaseCount, curated by Arc Institute, has supported large-scale transcriptomic modeling across diverse cell states. Last but not the least, CELLxGENE stewarded by Biohub, has become a cornerstone for sharing and exploring single-cell datasets globally.
Together, these resources have powered influential model architectures such as Tahoe-x1, STATE, and TranscriptFormer— promising that the combined scale and perturbative diversity will directly translate into better biological inference. The newly announced pertubration dataset will exceed Tahoe-100M by more than fourfold in perturbation richness, spanning:
- ~50 cell lines
- ~1,400 chemically diverse scaffolds
- Three dose levels per compound
- ~100 cytokine perturbations
- Metadata capturing patient-relevant biological contexts
Data will initially be shared exclusively among the partners before being released open-source for both commercial and non-commercial use. While the public release timeline has not yet been disclosed, expectations across the field are high.
The Next Inflection Point for Virtual Biology
According to the project leads, the large-scale perturbation data represents the next inflection point for virtual cell models. Initially the dataset will help uncover new scaling laws—guiding researchers on how much data, diversity, and experimental resolution are required to justify additional investment.
In the long term, the ambition is far bolder: predicting clinical outcomes directly from virtual cells, effectively narrowing the gap between in silico biology and patient response. Reflecting on the origins of the collaboration, Tahoe CEO Nima Alidoust, PhD, emphasized that the partnership emerged organically from years of shared groundwork.
“We’ve all worked for years to build the foundations of this field,” Alidoust told GEN Edge. “After the success of Tahoe-100M, it became clear that launching this initiative together could push the data-starved field of virtual cells forward in a meaningful way.”
Technology, Scale, and the Limits of the Average Lab
Patricia Brennan, Vice President of Technology and General Manager for Science at Biohub, highlighted a key tension shaping modern biology: sequencing costs have plummeted, but biological complexity continues to outpace what individual labs can generate.
“Biology demands more data at larger scale and deeper representation than most labs can achieve,” Brennan noted. “This partnership allows us to expand perturbative diversity, cell-type coverage, and patient relevance in ways that simply weren’t feasible before.”
Once released, the dataset will also become a centerpiece of Arc’s Virtual Cell Challenge, an annual benchmarking competition evaluating how well models predict cellular responses to perturbations. The inaugural 2025 challenge—sponsored by Nvidia, 10x Genomics, and Ultima Genomics—underscored both the promise and the difficulty of defining biologically meaningful evaluation metrics.
Complementing Genetic and Spatial Approaches
While enthusiasm around large-scale perturbation data is strong, experts caution that scale alone is not enough. Ron Alfa, MD, PhD, CEO of Noetik, praised the initiative but emphasized the importance of spatial context.
“Large-scale controlled perturbation data is a valuable community resource,” Alfa said, “but spatial cellular context remains critical for learning representations that translate to patients.”
Noetik’s own approach—grounded in spatial proteomics, spatial transcriptomics, pathology imaging, and clinical metadata—highlights how chemical perturbation datasets like this one can complement, rather than replace, patient-derived multimodal data. Together, these approaches may form the backbone of clinically relevant virtual cells.
Mosaic: Unlocking Chemical Perturbations at Scale
Tahoe’s Mosaic single-cell perturbation platform is at the centre of the partnership. Originating from the lab of Hani Goodarzi, PhD, an Arc core investigator, Mosaic was developed to overcome the prohibitive cost and time required to study chemical perturbations across diverse biological models.
By multiplexing cells from multiple models into a single tumor and computationally deconvolving their responses, Mosaic reduces the cost of single-cell sequencing by up to 100-fold. This makes it feasible to explore chemical perturbations at a scale previously reserved for pooled genetic screens like Perturb-seq.
Hybrid modeling approaches that incorporate biological priors have so far outperformed purely end-to-end systems, largely because existing datasets are insufficient to let models learn gene regulation from scratch. Goodarzi believes this collaboration changes that calculus.
By dramatically increasing diversity and statistical power, the Tahoe–Arc–Biohub dataset could help define new benchmarks and allow models to internalize fundamental principles of gene regulation—rather than memorizing narrow patterns.
A Step Closer to Clinical Impact
Cells remain profoundly complex, and accurately predicting patient outcomes is still beyond reach. But each large, open dataset narrows the gap between descriptive biology and predictive medicine.
With this unprecedented collaboration, Tahoe Therapeutics, Arc Institute, and Biohub are betting that open, large-scale perturbation data will be the catalyst that finally pushes virtual cells from promising research tools toward clinically meaningful systems—bringing AI-driven biological insight closer than ever to real patient impact.

