Research summary
|
Gene expression is the output of multiple processes that integrate transcriptional and post-transcriptional mechanisms. Transcription factors (TFs) bind specific sequence patterns in the DNA (aka motifs) to regulate transcriptional bursts of closeby or even distant genes. The fate of the generated mRNAs is determined by post-transcriptional factors, such as RNA binding proteins (RBPs), which bind to the produced transcript in a similar sequence-dependent manner to orchestrate a cascade of processes that lead to the translation of the mRNA into the final protein product. Cis-regulatory elements (CREs) in these genomic sequences determine gene expression levels across time and space. A fundamental goal of genomics in the last decade has been to map the locations of CREs across tissues, cell types, and contexts to yield a mechanistic understanding of gene regulatory processes (ENCODE Project Consortium, 2012). In combination with recent advances in deep learning algorithms these data sets represent a rich source for training deep genomic sequence-to-function (S2F) models (Avsec et al., 2021). These models are trained to take as input genomic DNA across the genome and predict as output experimental measurements such as gene expression. S2F models are powerful tools for modern synthetic genomics because they allow us to: (1) query in-silico how arbitrary genetic variants impact gene expression (Wang et al., 2021; Bohn et al., 2023), (2) gain biological insights into sequence determinants of gene expression (Miraldi, Chen and Weirauch, 2021), and (3) in-silico design regulatory elements with specific cellular properties (Sasse et al., 2023). The goal of our research is to develop accurate machine learning models that enable us to exploit large-scale gene expression and CRE datasets, to learn how CREs encode information about gene expression (Sasse, Chikina and Mostafavi, 2024).
Well trained sequence-to-function models can be used to design DNA and RNA sequences to perform specific functions as they are required in many applications in synthetic biology, e.g. gene therapies, production of biomolecules, or mRNA-based therapies, and vaccines. Any trained model can serve as an oracle which in combination with a generative process can create artificial sequences with a predefined function (Linder et al., 2020; Vaishnav et al., 2022). In the simplest case, the process gradually optimizes sequences until they fulfill pre-defined features, for example cell type specific gene expression. Mechanistic multi-modal S2F models that are trained on various data modalities enable design of sequences with diverse cell type specific phenotypes. For example, S2F models with a mechanistic understanding of transcriptional and post-transcriptional processes can directly help with designing mRNAs that have low degradation rates and high translation initiation rates, therefore remaining in specific cell types for long periods to produce large quantities of proteins, as they are required for example for mRNA therapies.
a, Left, interpreting personal genomes requires a mechanistic understanding of the different layers of gene regulation and how intermediate processes (chromatin organization, epigenomic modifications, transcriptional regulation, post-transcriptional regulation and so on) are affected by genetic variation. Right, two approaches to genome interpretation, through statistical association and cell-type-specific sequence-to-function (S2F) models. b, Sequence-to-function models take as input genomic DNA and learn to predict its functional properties such as gene expression in a cell-type-dependent and cell-state-dependent manner. Once trained, these models can be used to predict the impact of arbitrary genetic variations (right) and to derive biological insights into the sequence grammar that determines context-dependent gene regulation (left). Ac, acetylation; Me, methylation; Me3, trimethylation; RBPs, RNA-binding proteins; TAD, topologically associating domain. From (Sasse et al. 2024) References
Avsec, Z. et al. (2021) Effective gene expression prediction from sequence by integrating long-range interactions, Nature methods, 18(10), pp. 1196-1203. |