Quantcast
SocratesJedi/Creative Commons, cropped

Harvard team uses deep representation learning to study gene repression in proteins

Computational predictions for how a genetic variant will affect a protein's function are very important. For example, this can help determine whether a specific variant is causing a disease.


Marjorie Hecht
Sep 11, 2021

Computational predictions for how a genetic variant will affect a protein's function are very important. For example, this can help determine whether a specific variant is causing a disease.

A team of scientists at Harvard University and MIT used a type of machine learning called deep representation learning to improve the understanding of gene expression regulation in proteins and create a model for biologists to better predict repression by proteins.

Their research is the subject of a July 6 research article in the Proceedings of the National Academy of Sciences (PNAS). 

Machine learning is a type of artificial intelligence that can automatically learn common patters from large raw data sets and thus improve predictive results in scientific studies.

The team set out to use machine learning to analyze the repression function for tens of thousands of mutants of the E. coli bacteria protein LacI--43,669 variants, to be exact. LacI is used as a model for studying protein function in genetics, biochemistry, molecular biology and other fields.

Their goal was to develop a deep neural network for predicting the gene repression function of LacI when trained using their experimental data.

Gene repression involves turning off particular genes that are needed for cell function. The research team showed that fine-tuning deep representation learning could improve the prediction of repression function in genetic variants well beyond current approaches.

Gauging and guiding

Lead author Alexander S. Garruss said that predicting the functional effects of a genetic variant is helpful for two reasons: "to gauge and to guide" the accuracy of our current understanding and to focus future experiments.

Some areas of proteins operate by strict rules, and others operate more loosely, but the interplay between them is largely unknown. 

"The more strict rules normally establish the specific function of the protein whereby the more lax rules allow biological systems a robustness to change," he said. 

Garruss was a co-author of a 2016 study (Taylor et al.) looking at aspects of LacI function, and the current study carries forward that work. 

"The LacI protein has two primary functions: the binding of DNA to regulate gene expression and the binding of a small molecule to switch genes on if the small molecule is present in the cell," he told Current Science Daily. "Our previous study focused on the specific properties of the LacI protein that govern which small molecule the protein binds."

"In that study, we looked at four different small molecules to investigate which genetic variants could be engineered to bind a new small molecule. The study highlighted key positions and mutations that could, in fact, change the LacI protein's small molecule specificity.

"The current study investigates the other function of LacI, the binding of DNA," Garruss explained. 

The new findings investigated the positions and mutations that result in the loss of ability to bind DNA. "Many genetic variants, previously known and many newly uncovered, completely destroyed this ability to bind DNA. Other genetic variants were well tolerated and led to no change in the DNA-binding function. We considered ~50,000 variants in total, to more comprehensively understand the observed LacI variant effect.

"We found several interesting features of the LacI protein by looking at the sequence-structure-function relationship," Garruss said. "Looking at the sequences with various functional values, we noticed intriguing patterns, such as when two variant sequences were tested separately, they functioned poorly. However, when the variants were combined into a single sequence, they compensated for each other's failings and the combined genetic variant functioned normally."

A new view of LacI's role

The research study determined new findings for the role of LacI and its C-terminal. The C-terminal is the carboxyl terminus of an amino acid chain.

"Two key findings come to mind." Garruss said. "First, the C-terminal domain of LacI can regulate the binding of DNA across a long distance, and the best approach to understand LacI genetic variation is to use the shared wisdom of millions of other proteins."

He noted that the "C-terminal domain is often ignored when studying LacI because it had been shown to be non-essential. Our work shows when it is included, as it is in nature, that this domain can actually stabilize function but also introduces a liability."

He elaborated on this: "Dozens of just single amino acid changes in the C-terminal domain can destroy DNA binding function, a phenomena still not completely understood.

"The second major finding is that a deep representation learning approach most accurately captures the LacI genetic variation effect. We compared state-of-the-art neural networks that utilized evolutionary histories, all-atom structural models, and sequence-only approaches, and found the representation learning approach to be superior. This approach uses tens-of-million of other proteins to learn the general patterns and composition of protein sequences." 

Further research

"We are excited that this domain could open new mysteries of LacI as well as opportunities to engineer LacI using the C-terminal domain exclusively."

Summing up the results of the study, Garruss said: A major advantage of deep representation learning is that it is useful even for sequences that have never been observed in nature."

"When we push these proteins beyond their natural history we require guidance from computers that have witnessed millions of natural proteins. The idea is that each natural protein contains wisdom about stability domain organization and sequence composition.

"The purpose of deep representation learning is to find a way to encapsulate that wisdom numerically such that it can used to predict the effect of novel genetic variants." 


RECOMMENDED