Cryo-electron microscopy (cryo-EM) is a powerful technique for determining high-resolution 3D biomolecular structures from imaging data. As this technique can capture dynamic biomolecular complexes, 3D reconstruction methods are increasingly being developed to resolve this intrinsic structural heterogeneity. However, the absence of standardized benchmarks with ground truth structures and validation metrics limits the advancement of the field. Here, we propose CryoBench, a suite of datasets, metrics, and performance benchmarks for heterogeneous reconstruction in cryo-EM. We propose five datasets representing different sources of heterogeneity and degrees of difficulty. These include conformational heterogeneity generated from simple motions and random configurations of antibody complexes and from tens of thousands of structures sampled from a molecular dynamics simulation. We also design datasets containing compositional heterogeneity from mixtures of ribosome assembly states and 100 common complexes present in cells. We then perform a comprehensive analysis of state-of-the-art heterogeneous reconstruction tools including neural and non-neural methods and their sensitivity to noise, and propose new metrics for quantitative comparison of methods. We hope that this benchmark will be a foundational resource for analyzing existing methods and new algorithmic development in both the cryo-EM and machine learning communities.
Cryo-ET reveals the in situ architecture of the polar tube invasion apparatus from microsporidian parasites
Mahrukh Usmani, Nicolas Coudray, Margot Riggi, Rishwanth Raghu, Harshita Ramchandani, Daija Bobe, Mykhailo Kopylov, Ellen D Zhong, Janet H Iwasa, Damian Charles Ekiert, and Gira Bhabha
Microsporidia are divergent fungal pathogens that employ a harpoon-like apparatus called the polar tube (PT) to invade host cells. The PT architecture and its association with neighboring organelles remain poorly understood. Here, we use cryo-electron tomography to investigate the structural cell biology of the PT in dormant spores from the human-infecting microsporidian species, Encephalitozoon intestinalis. Segmentation and subtomogram averaging of the PT reveal at least four layers: two protein-based layers surrounded by a membrane, and filled with a dense core. Regularly spaced protein filaments form the structural skeleton of the PT. Combining cryo-electron tomography with cellular modeling, we propose a model for the 3-dimensional organization of the polaroplast, an organelle that is continuous with the membrane layer that envelops the PT. Our results reveal the ultrastructure of the microsporidian invasion apparatus in situ, laying the foundation for understanding infection mechanisms.
CryoDRGN-ET: deep reconstructing generative networks for visualizing dynamic biomolecules inside cells
Ramya Rangan*, Ryan Feathers*, Sagar Khavnekar, Adam Lerer, Jake Johnston, Ron Kelley, Martin Obr, Abhay Kotecha, and Ellen D. Zhong
Advances in cryo-electron tomography (cryo-ET) have produced new opportunities to visualize the structures of dynamic macromolecules in native cellular environments. While cryo-ET can reveal structures at molecular resolution, image processing algorithms remain a bottleneck in resolving the heterogeneity of biomolecular structures in situ. Here, we introduce cryoDRGN-ET for heterogeneous reconstruction of cryo-ET subtomograms. CryoDRGN-ET learns a deep generative model of three-dimensional density maps directly from subtomogram tilt-series images and can capture states diverse in both composition and conformation. We validate this approach by recovering the known translational states in Mycoplasma pneumoniae ribosomes in situ. We then perform cryo-ET on cryogenic focused ion beam–milled Saccharomyces cerevisiae cells. CryoDRGN-ET reveals the structural landscape of S. cerevisiae ribosomes during translation and captures continuous motions of fatty acid synthase complexes inside cells. This method is openly available in the cryoDRGN software.
Solving Inverse Problems in Protein Space Using Diffusion-Based Priors
Axel Levy, Eric R. Chan, Sara Fridovich-Keil, Frédéric Poitevin, Ellen D. Zhong, and Gordon Wetzstein
The interaction of a protein with its environment can be understood and controlled via its 3D structure. Experimental methods for protein structure determination, such as X-ray crystallography or cryogenic electron microscopy, shed light on biological processes but introduce challenging inverse problems. Learning-based approaches have emerged as accurate and efficient methods to solve these inverse problems for 3D structure determination, but are specialized for a predefined type of measurement. Here, we introduce a versatile framework to turn raw biophysical measurements of varying types into 3D atomic models. Our method combines a physics-based forward model of the measurement process with a pretrained generative model providing a task-agnostic, data-driven prior. Our method outperforms posterior sampling baselines on both linear and non-linear inverse problems. In particular, it is the first diffusion-based method for refining atomic models from cryo-EM density maps.
Accurate structure prediction of biomolecular interactions with AlphaFold 3
Josh Abramson, Jonas Adler, Jack Dunger, Richard Evans, Tim Green, Alexander Pritzel, Olaf Ronneberger, Lindsay Willmore, Andrew J. Ballard, Joshua Bambrick, Sebastian W. Bodenstein, David A. Evans, Chia-Chun Hung, Michael O’Neill, David Reiman, Kathryn Tunyasuvunakool, Zachary Wu, Akvilė Žemgulytė, Eirini Arvaniti, Charles Beattie, Ottavia Bertolli, Alex Bridgland, Alexey Cherepanov, Miles Congreve, Alexander I. Cowen-Rivers, Andrew Cowie, Michael Figurnov, Fabian B. Fuchs, Hannah Gladman, Rishub Jain, Yousuf A. Khan, Caroline M. R. Low, Kuba Perlin, Anna Potapenko, Pascal Savy, Sukhdeep Singh, Adrian Stecula, Ashok Thillaisundaram, Catherine Tong, Sergei Yakneen, Ellen D. Zhong, Michal Zielinski, Augustin Žídek, Victor Bapst, Pushmeet Kohli, Max Jaderberg, Demis Hassabis, and John M. Jumper
The introduction of AlphaFold 2 has spurred a revolution in modelling the structure of proteins and their interactions, enabling a huge range of applications in protein modelling and design. In this paper, we describe our AlphaFold 3 model with a substantially updated diffusion-based architecture, which is capable of joint structure prediction of complexes including proteins, nucleic acids, small molecules, ions, and modified residues. The new AlphaFold model demonstrates significantly improved accuracy over many previous specialised tools: far greater accuracy on protein-ligand interactions than state of the art docking tools, much higher accuracy on protein-nucleic acid interactions than nucleic-acid-specific predictors, and significantly higher antibody-antigen prediction accuracy than AlphaFold-Multimer v2.3. Together these results show that high accuracy modelling across biomolecular space is possible within a single unified deep learning framework.
Revealing biomolecular structure and motion with neural ab initio cryo-EM reconstruction
Axel Levy, Michal Grzadkowski, Frédéric Poitevin, Francesca Vallese, Oliver Biggs Clarke, Gordon Wetzstein, and Ellen D. Zhong
Proteins and other biomolecules form dynamic macromolecular machines that are tightly orchestrated to move, bind, and perform chemistry. Cryo-electron microscopy (cryo-EM) can access the intrinsic heterogeneity of these complexes and is therefore a key tool for understanding mechanism and function. However, 3D reconstruction of the resulting imaging data presents a challenging computational problem, especially without any starting information, a setting termed ab initio reconstruction. Here, we introduce a method, DRGN-AI, for ab initio heterogeneous cryo-EM reconstruction. With a two-step hybrid approach combining search and gradient-based optimization, DRGN-AI can reconstruct dynamic protein complexes from scratch without input poses or initial models. Using DRGN-AI, we reconstruct the compositional and conformational variability contained in a variety of benchmark datasets, process an unfiltered dataset of the DSL1/SNARE complex fully ab initio, and reveal a new “supercomplex” state of the human erythrocyte ankyrin-1 complex. With this expressive and scalable model for structure determination, we hope to unlock the full potential of cryo-EM as a high-throughput tool for structural biology and discovery.Competing Interest StatementThe authors have declared no competing interest.
2023
Conformational states of the microtubule nucleator, the γ-tubulin ring complex
Brianna Romer, Sophie M. Travis, Brian P. Mahon, Collin T. McManus, Philip D. Jeffrey, Nicolas Coudray, Rishwanth Raghu, Michael J. Rale, Ellen D. Zhong, Gira Bhabha, and Sabine Petry
Microtubules (MTs) perform essential functions in the cell, and it is critical that they are made at the correct cellular location and cell cycle stage. This nucleation process is catalyzed by the γ-tubulin ring complex (γ-TuRC), a cone-shaped protein complex composed of over 30 subunits. Despite recent insight into the structure of vertebrate γ-TuRC, which shows that its diameter is wider than that of a MT, and that it exhibits little of the symmetry expected for an ideal MT template, the question of how γ-TuRC achieves MT nucleation remains open. Here, we utilized single particle cryo-EM to identify two conformations of γ-TuRC. The helix composed of 14 γ-tubulins at the top of the γ-TuRC cone undergoes substantial deformation, which is predominantly driven by bending of the hinge between the GRIP1 and GRIP2 domains of the γ-tubulin complex proteins. However, surprisingly, this deformation does not remove the inherent asymmetry of γ-TuRC. To further investigate the role of γ-TuRC conformational change, we used cryo electron-tomography (cryo-ET) to obtain a 3D reconstruction of γ-TuRC bound to a nucleated MT, providing insight into the post-nucleation state. Rigid-body fitting of our cryo-EM structures into this reconstruction suggests that the MT lattice is nucleated by spokes 2 through 14 of the γ-tubulin helix, which entails spokes 13 and 14 becoming more structured than what is observed in apo γ-TuRC. Together, our results allow us to propose a model for conformational changes in γ-TuRC and how these may facilitate MT formation in a cell.Competing Interest StatementThe authors have declared no competing interest.
Time-resolved cryo-EM (TR-EM) analysis of substrate
polyubiquitination by the RING E3 anaphase-promoting
complex/cyclosome (APC/C)
Tatyana Bodrug, Kaeli A Welsh, Derek L Bolhuis, Ethan Paulаkonis, Raquel C Martinez-Chacin, Bei Liu, Nicholas Pinkin, Thomas Bonacci, Liying Cui, Pengning Xu, Olivia Roscow, Sascha Josef Amann, Irina Grishkovskaya, Michael J Emanuele, Joseph S Harrison, Joshua P Steimel, Klaus M Hahn, Wei Zhang, Ellen D Zhong, David Haselbach, and Nicholas G Brown
Substrate polyubiquitination drives a myriad of cellular
processes, including the cell cycle, apoptosis and immune
responses. Polyubiquitination is highly dynamic, and obtaining
mechanistic insight has thus far required artificially trapped
structures to stabilize specific steps along the enzymatic
process. So far, how any ubiquitin ligase builds a proteasomal
degradation signal, which is canonically regarded as four or
more ubiquitins, remains unclear. Here we present time-resolved
cryogenic electron microscopy studies of the 1.2 MDa E3
ubiquitin ligase, known as the anaphase-promoting
complex/cyclosome (APC/C), and its E2 co-enzymes (UBE2C/UBCH10
and UBE2S) during substrate polyubiquitination. Using cryoDRGN
(Deep Reconstructing Generative Networks), a neural
network-based approach, we reconstruct the conformational
changes undergone by the human APC/C during polyubiquitination,
directly visualize an active E3–E2 pair modifying its
substrate, and identify unexpected interactions between multiple
ubiquitins with parts of the APC/C machinery, including its
coactivator CDH1. Together, we demonstrate how modification of
substrates with nascent ubiquitin chains helps to potentiate
processive substrate polyubiquitination, allowing us to model
how a ubiquitin ligase builds a proteasomal degradation signal.
Here, using cryogenic electron microscopy and cryoDRGN, the
authors delineate how the anaphase-promoting complex/cyclosome
is reconfigurated to interact with its cognate E2s and thus
polyubiquitinate its target. Unexpectedly, multiple ubiquitin
moieties are shown to interact with the anaphase-promoting
complex/cyclosome machinery, including its activator Cdh1.
Conformational heterogeneity and probability distributions from single-particle cryo-electron microscopy
Wai Shing Tang, Ellen D. Zhong, Sonya M. Hanson, Erik H. Thiede, and Pilar Cossio
Single-particle cryo-electron microscopy (cryo-EM) is a technique that takes projection images of biomolecules frozen at cryogenic temperatures. A major advantage of this technique is its ability to image single biomolecules in heterogeneous conformations. While this poses a challenge for data analysis, recent algorithmic advances have enabled the recovery of heterogeneous conformations from the noisy imaging data. Here, we review methods for the reconstruction and heterogeneity analysis of cryo-EM images, ranging from linear-transformation-based methods to nonlinear deep generative models. We overview the dimensionality-reduction techniques used in heterogeneous 3D reconstruction methods and specify what information each method can infer from the data. Then, we review the methods that use cryo-EM images to estimate probability distributions over conformations in reduced subspaces or predefined by atomistic simulations. We conclude with the ongoing challenges for the cryo-EM community.
2022
Amortized Inference for Heterogeneous Reconstruction in Cryo-EM
Cryo-electron microscopy (cryo-EM) is an imaging modality that provides unique insights into the dynamics of proteins and other building blocks of life. The algorithmic challenge of jointly estimating the poses, 3D structure, and conformational heterogeneity of a biomolecule from millions of noisy and randomly oriented 2D projections in a computationally efficient manner, however, remains unsolved. Our method, cryoFIRE, performs ab initio heterogeneous reconstruction with unknown poses in an amortized framework, thereby avoiding the computationally expensive step of pose search while enabling the analysis of conformational heterogeneity. Poses and conformation are jointly estimated by an encoder while a physics-based decoder aggregates the images into an implicit neural representation of the conformational space. We show that our method can provide one order of magnitude speedup on datasets containing millions of images without any loss of accuracy. We validate that the joint estimation of poses and conformations can be amortized over the size of the dataset. For the first time, we prove that an amortized method can extract interpretable dynamic information from experimental datasets.
Latent Space Diffusion Models of Cryo-EM Structures
Karsten Kreis*, Tim Dockhorn*, Zihao Li, and Ellen D Zhong
In NeurIPS Workshop on Machine Learning for Structural Biology (MLSB), 2022
Cryo-electron microscopy (cryo-EM) is unique among tools in structural biology in its ability to image large, dynamic protein complexes. Key to this ability is image processing algorithms for heterogeneous cryo-EM reconstruction, including recent deep learning-based approaches. The state-of-the-art method cryoDRGN uses a Variational Autoencoder (VAE) framework to learn a continuous distribution of protein structures from single particle cryo-EM imaging data. While cryoDRGN can model complex structural motions, the Gaussian prior distribution of the VAE fails to match the aggregate approximate posterior, which prevents generative sampling of structures especially for multi-modal distributions (e.g. compositional heterogeneity). Here, we train a diffusion model as an expressive, learnable prior in the cryoDRGN framework. Our approach learns a high-quality generative model over molecular conformations directly from cryo-EM imaging data. We show the ability to sample from the model on two synthetic and two real datasets, where samples accurately follow the data distribution unlike samples from the VAE prior distribution. We also demonstrate how the diffusion model prior can be leveraged for fast latent space traversal and interpolation between states of interest. By learning an accurate model of the data distribution, our method unlocks tools in generative modeling, sampling, and distribution analysis for heterogeneous cryo-EM ensembles.
Deep generative modeling for volume reconstruction in cryo-electron microscopy
Advances in cryo-electron microscopy (cryo-EM) for high-resolution imaging of biomolecules in solution have provided new challenges and opportunities for algorithm development for 3D reconstruction. Next-generation volume reconstruction algorithms that combine generative modelling with end-to-end unsupervised deep learning techniques have shown promise, but many technical and theoretical hurdles remain, especially when applied to experimental cryo-EM images. In light of the proliferation of such methods, we propose here a critical review of recent advances in the field of deep generative modelling for cryo-EM reconstruction. The present review aims to (i) provide a unified statistical framework using terminology familiar to machine learning researchers with no specific background in cryo-EM, (ii) review the current methods in this framework, and (iii) outline outstanding bottlenecks and avenues for improvements in the field.
Machine Learning for Reconstructing Dynamic Protein Structures from Cryo-EM Images
Proteins and other biomolecules form dynamic macromolecular machines that carry out essential biological processes responsible for life. However, studying the mechanisms of these biomolecular complexes at relevant atomic-scale resolutions is an extraordinarily challenging task in structural biology. This thesis presents new algorithms that address the computational bottlenecks at the frontier of structure determination of dynamic biomolecular complexes via cryo-electron microscopy (cryo-EM).
In single particle cryo-EM, the central problem is to reconstruct the 3D structure of a target biomolecular complex from a set of noisy and randomly oriented 2D projection images, a challenging inverse problem especially when instances of the imaged biomolecular complex exhibit structural heterogeneity.
The main contribution of this thesis is a machine learning system, cryoDRGN, for reconstructing continuous distributions of biomolecular structures from cryo-EM images. Underpinning the cryoDRGN method is a deep generative model parameterized by a new neural representation of cryo-EM volumes and a learning algorithm to optimize this representation from unlabeled 2D cryo-EM images. Released as an open source software tool, cryoDRGN has been applied on real datasets to uncover heterogeneity in high resolution datasets, discover new conformations of large macromolecular machines and visualize continuous trajectories of their motion. This thesis also describes an extension, cryoDRGN2, for learning this model from unposed images, i.e. ab initio reconstruction. Finally, this thesis presents emerging directions in analyzing the learned manifold of cryo-EM structures and in incorporating atomic model priors into cryo-EM reconstruction.
Cryo-EM structure of the plant 26S proteasome
Susanne Kandolf, Irina Grishkovskaya, Katarina Belačić, Derek L Bolhuis, Sascha Amann, Brent Foster, Richard Imre, Karl Mechtler, Alexander Schleiffer, Hemant D Tagare, Ellen D Zhong, Anton Meinhard, Nicholas G Brown, and David Haselbach
Targeted proteolysis is a hallmark of life. It is especially important in long-lived cells that can be found in
higher eukaryotes, like plants. This task is mainly fulfilled by the ubiquitin–proteasome system. Thus, proteolysis by the 26S proteasome is vital to development, immunity, and cell division. Although the yeast and
animal proteasomes are well characterized, there is only limited information on the plant proteasome. We
determined the first plant 26S proteasome structure from Spinacia oleracea by single-particle electron
cryogenic microscopy at an overall resolution of 3.3 A˚ . We found an almost identical overall architecture
of the spinach proteasome compared with the known structures from mammals and yeast. Nevertheless,
we noticed a structural difference in the proteolytic active b1 subunit. Furthermore, we uncovered an unseen compression state by characterizing the proteasome’s conformational landscape. We suspect that
this new conformation of the 20S core protease, in correlation with a partial opening of the unoccupied
gate, may contribute to peptide release after proteolysis. Our data provide a structural basis for the plant
proteasome, which is crucial for further studies.
Conformational landscape of the yeast SAGA complex as revealed by cryo-EM
Diana Vasyliuk, Joeseph Felt, Ellen D Zhong, Bonnie Berger, Joseph H Davis, and Calvin K Yip
Spt-Ada-Gcn5-Acetyltransferase (SAGA) is a conserved multi-subunit complex that activates RNA polymerase II-mediated transcription by acetylating and deubiquitinating nucleosomal histones and by recruiting TATA box binding protein (TBP) to DNA. The prototypical yeast Saccharomyces cerevisiae SAGA contains 19 subunits that are organized into Tra1, core, histone acetyltransferase, and deubiquitination modules. Recent cryo-electron microscopy studies have generated high-resolution structural information on the Tra1 and core modules of yeast SAGA. However, the two catalytical modules were poorly resolved due to conformational flexibility of the full assembly. Furthermore, the high sample requirement created a formidable barrier to further structural investigations of SAGA. Here, we report a workflow for isolating/stabilizing yeast SAGA and preparing cryo-EM specimens at low protein concentration using a graphene oxide support layer. With this procedure, we were able to determine a cryo-EM reconstruction of yeast SAGA at 3.1 Å resolution and examine its conformational landscape with the neural network-based algorithm cryoDRGN. Our analysis revealed that SAGA adopts a range of conformations with its HAT module and central core in different orientations relative to Tra1.
Uncovering structural ensembles from single-particle cryo-EM data using cryoDRGN
Laurel F Kinman*, Barrett M Powell*, Ellen D Zhong*+, Bonnie Berger+, and Joseph H Davis+
Single-particle cryogenic electron microscopy (cryo-EM) has emerged as a powerful technique to visualize the structural landscape sampled by a protein complex. However, algorithmic and computational bottlenecks in analyzing heterogeneous cryo-EM datasets have prevented the full realization of this potential. CryoDRGN is a machine learning system for heterogeneous cryo-EM reconstruction of proteins and protein complexes from single-particle cryo-EM data. Central to this approach is a deep generative model for heterogeneous cryo-EM density maps, which we empirically find is effective in modeling both discrete and continuous forms of structural variability. Once trained, cryoDRGN is capable of generating an arbitrary number of 3D density maps, and thus interpreting the resulting ensemble is a challenge. Here, we showcase interactive and automated processing approaches for analyzing cryoDRGN results. Specifically, we detail a step-by-step protocol for the analysis of an existing assembling 50S ribosome dataset, including preparation of inputs, network training and visualization of the resulting ensemble of density maps. Additionally, we describe and implement methods to comprehensively analyze and interpret the distribution of volumes with the assistance of an associated atomic model. This protocol is appropriate for structural biologists familiar with processing single-particle cryo-EM datasets and with moderate experience navigating Python and Jupyter notebooks. It requires 3–4 days to complete. CryoDRGN is open source software that is freely available.
2021
CryoDRGN2: Ab Initio Neural Reconstruction of 3D Protein Structures From Real Cryo-EM Images
Ellen D Zhong, Adam Lerer, Joseph H Davis, and Bonnie Berger
In International Conference on Computer Vision (ICCV), 2021
Protein structure determination from cryo-EM data requires reconstructing a 3D volume (or distribution of volumes) from many noisy and randomly oriented 2D projection images. While the standard homogeneous reconstruction task aims to recover a single static structure, recently-proposed neural and non-neural methods can reconstruct distributions of structures, thereby enabling the study of protein complexes that possess intrinsic structural or conformational heterogeneity. These heterogeneous reconstruction methods, however, require fixed image poses, which are typically estimated from an upstream homogeneous reconstruction and are not guaranteed to be accurate under highly heterogeneous conditions. In this work we describe cryoDRGN2, an ab initio reconstruction algorithm, which can jointly estimate image poses and learn a neural model of a distribution of 3D structures on real heterogeneous cryo-EM data. To achieve this, we adapt search algorithms from the traditional cryo-EM literature, and describe the optimizations and design choices required to make such a search procedure computationally tractable in the neural model setting. We show that cryoDRGN2 is robust to the high noise levels of real cryo-EM images, trains faster than earlier neural methods, and achieves state-of-the-art performance on real cryo-EM datasets.
CryoDRGN: reconstruction of heterogeneous cryo-EM structures using neural networks
Ellen D Zhong, Tristan Bepler, Bonnie Berger+, and Joseph H Davis+
Cryo-electron microscopy (cryo-EM) single-particle analysis has proven powerful in determining the structures of rigid macromolecules. However, many imaged protein complexes exhibit conformational and compositional heterogeneity that poses a major challenge to existing three-dimensional reconstruction methods. Here, we present cryoDRGN, an algorithm that leverages the representation power of deep neural networks to directly reconstruct continuous distributions of 3D density maps and map per-particle heterogeneity of single-particle cryo-EM datasets. Using cryoDRGN, we uncovered residual heterogeneity in high-resolution datasets of the 80S ribosome and the RAG complex, revealed a new structural state of the assembling 50S ribosome, and visualized large-scale continuous motions of a spliceosome complex. CryoDRGN contains interactive tools to visualize a dataset’s distribution of per-particle variability, generate density maps for exploratory analysis, extract particle subsets for use with other tools and generate trajectories to visualize molecular motions. CryoDRGN is open-source software freely available at http://cryodrgn.csail.mit.edu.
Structures of radial spokes and associated complexes important for ciliary motility
Miao Gui, Meisheng Ma, Erica Sze-Tu, Xiangli Wang, Fujiet Koh, Ellen D Zhong, Bonnie Berger, Joseph H Davis, Susan K Dutcher, Rui Zhang+, and Alan Brown+
In motile cilia, a mechanoregulatory network is responsible for converting the action of thousands of dynein motors bound to doublet microtubules into a single propulsive waveform. Here, we use two complementary cryo-EM strategies to determine structures of the major mechanoregulators that bind ciliary doublet microtubules in Chlamydomonas reinhardtii. We determine structures of isolated radial spoke RS1 and the microtubule-bound RS1, RS2 and the nexin−dynein regulatory complex (N-DRC). From these structures, we identify and build atomic models for 30 proteins, including 23 radial-spoke subunits. We reveal how mechanoregulatory complexes dock to doublet microtubules with regular 96-nm periodicity and communicate with one another. Additionally, we observe a direct and dynamically coupled association between RS2 and the dynein motor inner dynein arm subform c (IDAc), providing a molecular basis for the control of motor activity by mechanical signals. These structures advance our understanding of the role of mechanoregulation in defining the ciliary waveform.
Learning the language of viral evolution and escape
Brian Hie, Ellen D Zhong, Bonnie Berger+, and Bryan Bryson+
The ability for viruses to mutate and evade the human immune system and cause infection, called viral escape, remains an obstacle to antiviral and vaccine development. Understanding the complex rules that govern escape could inform therapeutic design. We modeled viral escape with machine learning algorithms originally developed for human natural language. We identified escape mutations as those that preserve viral infectivity but cause a virus to look different to the immune system, akin to word changes that preserve a sentence’s grammaticality but change its meaning. With this approach, language models of influenza hemagglutinin, HIV-1 envelope glycoprotein (HIV Env), and severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) Spike viral proteins can accurately predict structural escape patterns using sequence data alone. Our study represents a promising conceptual bridge between natural language and viral evolution.
2020
Learning mutational semantics
Brian Hie, Ellen D Zhong, Bryan Bryson, and Bonnie Berger
In Neural Information Processing Systems (NeurIPS), 2020
In many natural domains, changing a small part of an entity can transform its semantics; for example, a single word change can alter the meaning of a sentence, or a single amino acid change can mutate a viral protein to escape antiviral treatment or immunity. Although identifying such mutations can be desirable (for example, therapeutic design that anticipates avenues of viral escape), the rules governing semantic change are often hard to quantify. Here, we introduce the problem of identifying mutations with a large effect on semantics, but where valid mutations are under complex constraints (for example, English grammar or biological viability), which we refer to as constrained semantic change search (CSCS). We propose an unsupervised solution based on language models that simultaneously learn continuous latent representations. We report good empirical performance on CSCS of single-word mutations to news headlines, map a continuous semantic space of viral variation, and, notably, show unprecedented zero-shot prediction of single-residue escape mutations to key influenza and HIV proteins, suggesting a productive link between modeling natural language and pathogenic evolution.
Exploring generative atomic models in cryo-EM reconstruction
Ellen D Zhong, Adam Lerer, Joseph H Davis, and Bonnie Berger
In NeurIPS Workshop on Machine Learning for Structural Biology (MLSB), 2020
Cryo-EM reconstruction algorithms seek to determine a molecule’s 3D density map from a series of noisy, unlabeled 2D projection images captured with an electron microscope. Although reconstruction algorithms typically model the 3D volume as a generic function parameterized as a voxel array or neural network, the underlying atomic structure of the protein of interest places well-defined physical constraints on the reconstructed structure. In this work, we exploit prior information provided by an atomic model to reconstruct distributions of 3D structures from a cryo-EM dataset. We propose Cryofold, a generative model for a continuous distribution of 3D volumes based on a coarse-grained model of the protein’s atomic structure, with radial basis functions used to model atom locations and their physics-based constraints. Although the reconstruction objective is highly non-convex when formulated in terms of atomic coordinates (similar to the protein folding problem), we show that gradient descent-based methods can reconstruct a continuous distribution of atomic structures when initialized from a structure within the underlying distribution. This approach is a promising direction for integrating biophysical simulation, learned neural models, and experimental data for 3D protein structure determination.
RNA timestamps identify the age of single molecules in RNA sequencing
Samuel G Rodriques, Linlin M Chen, Sophia Liu, Ellen D Zhong, Joseph R Scherrer, Edward S Boyden+, and Fei Chen+
Current approaches to single-cell RNA sequencing (RNA-seq) provide only limited information about the dynamics of gene expression. Here we present RNA timestamps, a method for inferring the age of individual RNAs in RNA-seq data by exploiting RNA editing. To introduce timestamps, we tag RNA with a reporter motif consisting of multiple MS2 binding sites that recruit the adenosine deaminase ADAR2 fused to an MS2 capsid protein. ADAR2 binding to tagged RNA causes A-to-I edits to accumulate over time, allowing the age of the RNA to be inferred with hour-scale accuracy. By combining observations of multiple timestamped RNAs driven by the same promoter, we can determine when the promoter was active. We demonstrate that the system can infer the presence and timing of multiple past transcriptional events. Finally, we apply the method to cluster single cells according to the timing of past transcriptional activity. RNA timestamps will allow the incorporation of temporal information into RNA-seq workflows.
Reconstructing continuous distributions of 3D protein structure from cryo-EM images.
Ellen D Zhong, Tristan Bepler, Joseph H Davis, and Bonnie Berger
In International Conference on Learning Representations (ICLR), 2020
Cryo-electron microscopy (cryo-EM) is a powerful technique for determining the structure of proteins and other macromolecular complexes at near-atomic resolution. In single particle cryo-EM, the central problem is to reconstruct the three-dimensional structure of a macromolecule from 104−7 noisy and randomly oriented two-dimensional projections. However, the imaged protein complexes may exhibit structural variability, which complicates reconstruction and is typically addressed using discrete clustering approaches that fail to capture the full range of protein dynamics. Here, we introduce a novel method for cryo-EM reconstruction that extends naturally to modeling continuous generative factors of structural heterogeneity. This method encodes structures in Fourier space using coordinate-based deep neural networks, and trains these networks from unlabeled 2D cryo-EM images by combining exact inference over image orientation with variational inference for structural heterogeneity. We demonstrate that the proposed method, termed cryoDRGN, can perform ab initio reconstruction of 3D protein complexes from simulated and real 2D cryo-EM image data. To our knowledge, cryoDRGN is the first neural network-based approach for cryo-EM reconstruction and the first end-to-end method for directly reconstructing continuous ensembles of protein structures from cryo-EM images.
2019
Explicitly disentangling image content from translation and rotation with spatial-VAE
Tristan Bepler, Ellen D Zhong, Kotaro Kelley, Edward Brignole, and Bonnie Berger
In Neural Information Processing Systems (NeurIPS), 2019
Given an image dataset, we are often interested in finding data generative factors that encode semantic content independently from pose variables such as rotation and translation. However, current disentanglement approaches do not impose any specific structure on the learned latent representations. We propose a method for explicitly disentangling image rotation and translation from other unstructured latent factors in a variational autoencoder (VAE) framework. By formulating the generative model as a function of the spatial coordinate, we make the reconstruction error differentiable with respect to latent translation and rotation parameters. This formulation allows us to train a neural network to perform approximate inference on these latent variables while explicitly constraining them to only represent rotation and translation. We demonstrate that this framework, termed spatial-VAE, effectively learns latent representations that disentangle image rotation and translation from content and improves reconstruction over standard VAEs on several benchmark datasets, including applications to modeling continuous 2-D views of proteins from single particle electron microscopy and galaxies in astronomical images.
2017
Lessons learned from comparing molecular dynamics engines on the SAMPL5 dataset
Michael R Shirts, Christoph Klein, Jason M Swails, Jian Yin, Michael K Gilson, David L Mobley, David A Case, and Ellen D Zhong
We describe our efforts to prepare common starting structures and models for the SAMPL5 blind prediction challenge. We generated the starting input files and single configuration potential energies for the host-guest in the SAMPL5 blind prediction challenge for the GROMACS, AMBER, LAMMPS, DESMOND and CHARMM molecular simulation programs. All conversions were fully automated from the originally prepared AMBER input files using a combination of the ParmEd and InterMol conversion programs. We find that the energy calculations for all molecular dynamics engines for this molecular set agree to better than 0.1 % relative absolute energy for all energy components, and in most cases an order of magnitude better, when reasonable choices are made for different cutoff parameters. However, there are some surprising sources of statistically significant differences. Most importantly, different choices of Coulomb’s constant between programs are one of the largest sources of discrepancies in energies. We discuss the measures required to get good agreement in the energies for equivalent starting configurations between the simulation programs, and the energy differences that occur when simulations are run with program-specific default simulation parameter values. Finally, we discuss what was required to automate this conversion and comparison.
2014
Thermodynamics of coupled protein adsorption and stability using hybrid Monte Carlo simulations
A better understanding of changes in protein stability upon adsorption can improve the design of protein separation processes. In this study, we examine the coupling of the folding and the adsorption of a model protein, the B1 domain of streptococcal protein G, as a function of surface attraction using a hybrid Monte Carlo (HMC) approach with temperature replica exchange and umbrella sampling. In our HMC implementation, we are able to use a molecular dynamics (MD) time step that is an order of magnitude larger than in a traditional MD simulation protocol and observe a factor of 2 enhancement in the folding and unfolding rate. To demonstrate the convergence of our systems, we measure the travel of our order parameter the fraction of native contacts between folded and unfolded states throughout the length of our simulations. Thermodynamic quantities are extracted with minimum statistical variance using multistate reweighting between simulations at different temperatures and harmonic distance restraints from the surface. The resultant free energies, enthalpies, and entropies of the coupled unfolding and absorption processes are in qualitative agreement with previous experimental and computational observations, including entropic stabilization of the adsorbed, folded state relative to the bulk on surfaces with low attraction.
2012
Areas of permanent shadow in Mercury’s south polar region ascertained by MESSENGER orbital imaging
Nancy L Chabot, Carolyn M Ernst, Brett W Denevi, John K Harmon, Scott L Murchie, David T Blewett, Sean C Solomon, and Ellen D Zhong
Radar-bright features near Mercury’s poles have been postulated to be deposits of water ice trapped in cold, permanently shadowed interiors of impact craters. From its orbit about Mercury, MESSENGER repeatedly imaged the planet’s south polar region over one Mercury solar day, providing a complete view of the terrain near the south pole and enabling the identification of areas of permanent shadow larger in horizontal extent than approximately 4 km. In Mercury’s south polar region, all radar-bright features correspond to areas of permanent shadow. Application of previous thermal models suggests that the radar-bright deposits in Mercury’s south polar cold traps are in locations consistent with a composition dominated by water ice provided that some manner of insulation, such as a thin layer of regolith, covers many of the deposits.