Ruprecht-Karls-Universität
		Heidelberg

Alexander Sasse
CZS SynGen Junior Research Group Leader

ZMBH - Carl-Zeiss-Stiftung Center SynGen
Im Neuenheimer Feld 345
69120 Heidelberg, Germany


office-sasse@zmbh.uni-heidelberg.de

Twitter

GitHub

Research summary


Gene expression is the output of multiple processes that integrate transcriptional and post-transcriptional mechanisms. Transcription factors (TFs) bind specific sequence patterns in the DNA (aka motifs) to regulate transcriptional bursts of closeby or even distant genes. The fate of the generated mRNAs is determined by post-transcriptional factors, such as RNA binding proteins (RBPs), which bind to the produced transcript in a similar sequence-dependent manner to orchestrate a cascade of processes that lead to the translation of the mRNA into the final protein product. Cis-regulatory elements (CREs) in these genomic sequences determine gene expression levels across time and space. A fundamental goal of genomics in the last decade has been to map the locations of CREs across tissues, cell types, and contexts to yield a mechanistic understanding of gene regulatory processes (ENCODE Project Consortium, 2012). In combination with recent advances in deep learning algorithms these data sets represent a rich source for training deep genomic sequence-to-function (S2F) models (Avsec et al., 2021). These models are trained to take as input genomic DNA across the genome and predict as output experimental measurements such as gene expression. S2F models are powerful tools for modern synthetic genomics because they allow us to: (1) query in-silico how arbitrary genetic variants impact gene expression (Wang et al., 2021; Bohn et al., 2023), (2) gain biological insights into sequence determinants of gene expression (Miraldi, Chen and Weirauch, 2021), and (3) in-silico design regulatory elements with specific cellular properties (Sasse et al., 2023). The goal of our research is to develop accurate machine learning models that enable us to exploit large-scale gene expression and CRE datasets, to learn how CREs encode information about gene expression (Sasse, Chikina and Mostafavi, 2024).

Well trained sequence-to-function models can be used to design DNA and RNA sequences to perform specific functions as they are required in many applications in synthetic biology, e.g. gene therapies, production of biomolecules, or mRNA-based therapies, and vaccines. Any trained model can serve as an oracle which in combination with a generative process can create artificial sequences with a predefined function (Linder et al., 2020; Vaishnav et al., 2022). In the simplest case, the process gradually optimizes sequences until they fulfill pre-defined features, for example cell type specific gene expression. Mechanistic multi-modal S2F models that are trained on various data modalities enable design of sequences with diverse cell type specific phenotypes. For example, S2F models with a mechanistic understanding of transcriptional and post-transcriptional processes can directly help with designing mRNAs that have low degradation rates and high translation initiation rates, therefore remaining in specific cell types for long periods to produce large quantities of proteins, as they are required for example for mRNA therapies.

Summary image

a, Left, interpreting personal genomes requires a mechanistic understanding of the different layers of gene regulation and how intermediate processes (chromatin organization, epigenomic modifications, transcriptional regulation, post-transcriptional regulation and so on) are affected by genetic variation. Right, two approaches to genome interpretation, through statistical association and cell-type-specific sequence-to-function (S2F) models. b, Sequence-to-function models take as input genomic DNA and learn to predict its functional properties such as gene expression in a cell-type-dependent and cell-state-dependent manner. Once trained, these models can be used to predict the impact of arbitrary genetic variations (right) and to derive biological insights into the sequence grammar that determines context-dependent gene regulation (left). Ac, acetylation; Me, methylation; Me3, trimethylation; RBPs, RNA-binding proteins; TAD, topologically associating domain. From (Sasse et al. 2024)

References

Avsec, Z. et al. (2021) Effective gene expression prediction from sequence by integrating long-range interactions, Nature methods, 18(10), pp. 1196-1203.

Bohn, E. et al. (2023) A curated census of pathogenic and likely pathogenic UTR variants and evaluation of deep learning models for variant effect prediction, Frontiers in molecular biosciences, 10, p. 1257550.

ENCODE Project Consortium (2012) An integrated encyclopedia of DNA elements in the human genome, Nature, 489(7414), pp. 57-74.

Linder, J. et al. (2020) A Generative Neural Network for Maximizing Fitness and Diversity of Synthetic DNA and Protein Sequences, Cell systems, 11(1), pp. 49-62.e16.

Miraldi, E.R., Chen, X. and Weirauch, M.T. (2021) Deciphering cis-regulatory grammar with deep learning, Nature genetics, pp. 266-268.

Sasse, A. et al. (2023) Benchmarking of deep neural networks for predicting personal gene expression from DNA sequence highlights shortcomings, Nature genetics, 55(12), pp. 2060-2064.

Sasse, A., Chikina, M. and Mostafavi, S. (2024) Unlocking gene regulation with sequence-to-function models, Nature methods, 21(8), pp. 1374-1377.

Vaishnav, E.D. et al. (2022) The evolution, evolvability and engineering of gene regulatory DNA, Nature, 603(7901), pp. 455-463.

Wang, Q.S. et al. (2021) Leveraging supervised learning for functionally informed fine-mapping of cis-eQTLs identifies an additional 20,913 putative causal eQTLs, Nature communications, 12(1), p. 3394.






© Copyright Heidelberg University | Publishing Information | Privacy Policy