Selected Publications

RNAGym: Large-scale Benchmarks for RNA Fitness and Structure Prediction

Rohit Arora, Murphy Angelo, Christian Andrew Choe, Courtney A. Shearer, Aaron W. Kollasch, Fiona Qu, Ruben Weitzman, Artem Gazizov, Sarah Gurev, Erik Xie, Debora Marks, Pascal Notin

Preprint. Large-scale benchmarks to assess models for RNA fitness & structure prediction.

Evolutionary-Scale Enzymology Enables Biochemical Constant Prediction Across a Multi-Peaked Catalytic Landscape

Duncan F. Muir, Garrison P. R. Asper, Pascal Notin, Jacob A. Posner, Debora S. Marks, Michael J. Keiser, Margaux M. Pinney

Science, 2025. A microfluidic platform to measure catalytic constants for hundreds of Adenylate Kinase variants, enabling mapping of the sequence-catalysis landscape and development of superior predictive models.

Protriever: End-to-End Differentiable Protein Homology Search for Fitness Prediction

Ruben Weitzman, Peter Mørch Groth, Lood van Niekerk, Aoi Otani, Yarin Gal, Debora Marks, Pascal Notin

ICML, 2025. an end-to-end differentiable framework that learns to retrieve relevant protein homologs while simultaneously training for downstream tasks, achieving state-of-the-art fitness prediction performance while being orders of magnitude faster than traditional MSA-based approaches.

Large-scale discovery, analysis, and design of protein energy landscapes

Állan J. R. Ferrari, Sugyan M. Dixit, Jane Thibeault, Mario Garcia, Scott Houliston, Robert W. Ludwig, Pascal Notin, Claire M. Phoumyvong, Cydney M. Martell, Michelle D. Jung, Kotaro Tsuboyama, Lauren Carter, Cheryl H. Arrowsmith, Miklos Guttman, Gabriel J. Rocklin

Preprint. A multiplexed experimental approach to analyze conformational fluctuations across thousands of protein domains, revealing hidden variations that affect protein cooperativity and function.

Multi-Scale Representation Learning for Protein Fitness Prediction

Zuobai Zhang, Pascal Notin, Yining Huang, Aurélie Lozano, Vijil Chenthamarakshan, Debora Marks, Payel Das, Jian Tang

NeurIPS, 2024. A multimodal framework that integrates protein sequence, structure, and surface topology features to achieve state-of-the-art fitness prediction.

Predicting Promoter Variant Effects from Evolutionary Sequences

Courtney A. Shearer, Felix Teufel, Rose Orenbuch, Daniel Ritter, Aviv Spinner, Erik Xie, Jonathan Frazer, Mafalda Dias, Pascal Notin, Debora S. Marks

Preprint. A conditional autoregressive transformer model trained on 14.6 million mammalian promoter sequences that achieves state-of-the-art performance in predicting the effects of indels in human promoter regions.

Machine Learning for Functional Protein Design

Pascal Notin, Nathan Rollins, Yarin Gal, Chris Sander, Debora Marks

Nature Biotech, 2024. A unifying framework that classifies models on the basis of their use of three core data modalities: sequences, structures and functional labels is introduced to make sense of the exploding diversity of machine learning approaches.

ProteinGym: Large-Scale Benchmarks for Protein Fitness Prediction and Design

Pascal Notin, Aaron W. Kollasch, Daniel Ritter, Lood van Niekerk, Steffanie Paul, Hansen Spinner, Nathan Rollins, Ada Y. Shaw, Ruben Weitzman, Jonathan Frazer, Mafalda Dias, Dinko Franceschi, Rose Orenbuch, Debora S. Marks, Yarin Gal

NeurIPS, 2023. Large-scale benchmarks to assess models for protein fitness prediction and design.

ProteinNPT: Improving Protein Property Prediction and Design with Non-Parametric Transformers

Pascal Notin, Ruben Weitzman, Debora S. Marks, Yarin Gal

NeurIPS, 2023. A conditional semi-supervised pseudo-generative model for fitness prediction and design.

Learning from Prepandemic Data to Forecast Viral Escape

Nikki Thadani, Sarah Gurev, Pascal Notin, Noor Youssef, Nathan Rollins, Daniel Ritter, Chris Sander, Yarin Gal, Debora Marks

Nature, 2023. A computational framework to predict viral escape from pre-pandemic information only (evolutionary data and 3D structure).

DiscoBAX: Discovery of Optimal Intervention Sets in Genomic Experiment Design

Clare Lyle, Arash Mehrjou, Pascal Notin, Andrew Jesson, Stefan Bauer, Yarin Gal, Patrick Schwab

ICML, 2023. A sample-efficient method for discovering optimal sets that are both diverse and optimize the function of interest.

TranceptEVE: Combining Family-specific and Family-agnostic Models of Protein Sequences for Improved Fitness Prediction

Pascal Notin, Lood van Niekerk, Aaron W Kollasch, Daniel Ritter, Yarin Gal, Debora S. Marks

NeurIPS, LMRL, 2022. A hybrid family-specific and family-agnostic model to achieve SOTA performance on protein fitness prediction and human variant annotation.

Tranception: Protein Fitness Prediction with Autoregressive Transformers and Inference-time Retrieval

Pascal Notin, Mafalda Dias, Jonathan Frazer, Javier Marchena-Hurtado, Aidan N. Gomez, Debora S. Marks, Yarin Gal

ICML, 2022. A suite of autoregressive transformers with biological priors, augmented with inference-time retrieval, to achieve SOTA performance on protein fitness prediction.

RITA: a Study on Scaling Up Generative Protein Sequence Models

Daniel Hesslow, Niccoló Zanichelli, Pascal Notin, Iacopo Poli, Debora Marks

ICML, WCB, 2022. The first paper investigating scaling laws in protein language modeling.

Disease Variant Prediction with Deep Generative Models of Evolutionary Data

Jonathan Frazer, Pascal Notin, Mafalda Dias, Aidan Gomez, Joseph Min, Kelly Brock, Yarin Gal, Debora Marks

Nature, 2021. Deep Generative Models (Bayesian VAEs) of evolutionary sequences to predict the effects of missense mutations in human proteins.

GeneDisco: A Benchmark for Experimental Design in Drug Discovery

Arash Mehrjou, Ashkan Soleymani, Andrew Jesson, Pascal Notin, Yarin Gal, Stefan Bauer, Patrick Schwab

ICLR, 2022. A benchmark suite for evaluating active learning algorithms for experimental design in drug discovery.

Improving Black-box Optimization in VAE Latent Space Using Decoder Uncertainty

Pascal Notin, José Miguel Hernández-Lobato, Yarin Gal

NeurIPS, 2021. A framework that uses the epistemic uncertainty of the decoder of a VAE to guide the optimization of properties of high-dimensional structured objects (e.g., molecules) in latent space.

Improving Compute Efficacy Frontiers with SliceOut

Pascal Notin, Aidan N. Gomez, Joanna Yoo, Yarin Gal

Preprint. A memory-efficient dropout-inspired scheme to train large neural networks faster with no loss in accuracy.