Researchers have developed a mathematical framework to analyze the genome and identify the signs of natural selection, unlocking the evolutionary history and future of non-coding DNA.
Researchers have developed a mathematical framework to analyze the genome and identify the signs of natural selection, unlocking the evolutionary history and future of non-coding DNA.
This framework acts as an "oracle" for predicting the evolution of gene regulation, according to a news release.
According to Aviv Regev, a professor of biology at the Massachusetts Institute of Technology and the study’s senior author, scientists can now use the model for their own evolutionary question or scenario and for other problems, like making sequences that control gene expression in desired ways.
He is also excited about the possibilities for machine-learning researchers interested in interpretability, who can ask their questions in reverse, to better understand the underlying biology.
"We now have an 'oracle' that can be queried to ask: What if we tried all possible mutations of this sequence?," Regev said, according to the release. "Or what new sequence should we design to give us a desired expression?
"Scientists can now use the model for their own evolutionary question or scenario, and for other problems like making sequences that control gene expression in desired ways," he added.
The team created a neural network model to predict gene expression. They trained it on a dataset generated by inserting millions of random non-coding DNA sequences into yeast, observing how each random sequence affected gene expression.
They focused on a subset of non-coding DNA sequences called promoters, which serve as binding sites for proteins that switch nearby genes on or off.
The team went on to test the model’s predictive abilities in a variety of ways, including using it to design promoters for synthetic biology applications, scouring scientific papers for fundamental evolutionary questions, and even feeding it a real-world population dataset from one existing study.
This framework is expected to have a major impact on the study of gene expression and the use of machine learning in the field of biology.