Machine learning reveals unexpected genetic roots of cancers, autism and other disorders
In the decade since the genome was sequenced in 2003, scientists, engineers and doctors have struggled to answer an all-consuming question: Which DNA mutations cause disease?
A new computational technique developed at the University of Toronto may now be able to tell us.
A Canadian research team led by Professor Brendan Frey has developed the first method for 鈥榬anking鈥 genetic mutations based on how living cells 鈥榬ead鈥 DNA, revealing how likely any given alteration is to cause disease. They used their method to discover unexpected genetic determinants of autism, hereditary cancers and spinal muscular atrophy, a leading genetic cause of infant mortality.
Their findings appear in the December 18 issue of the leading journal and are already grabbing headlines. (Read the article in ; ; .)
Think of the human genome as a mysterious text, made up of three billion letters.
鈥淥ver the past decade, a huge amount of effort has been invested into searching for mutations in the genome that cause disease, without a rational approach to understanding why they cause disease,鈥 said Frey. 鈥淭his is because scientists didn鈥檛 have the means to understand the text of the genome and how mutations in it can change the meaning of that text.鈥
It's a puzzle that Frey points out was captured by biologist Eric Lander of the Massachusetts Institute of Technology in a famous quote: 鈥淕enome. Bought the book. Hard to read.鈥
What was Frey鈥檚 approach? Scientists know that certain sections of the text, called exons, describe the proteins that are the building blocks of all living cells. What wasn鈥檛 appreciated until recently is that other sections, called introns, contain instructions for how to cut and paste exons together, determining which proteins will be produced. This 鈥榮plicing鈥 process is a crucial step in the cell鈥檚 process of converting DNA into proteins, and its disruption is known to contribute to many diseases.
(Image at right: artist鈥檚 conception of how disease-causing genetic mutations reside with long chains of DNA; photo by Jessica Wilson.)
Most research into the genetic roots of disease has focused on mutations within exons, but increasingly scientists are finding that diseases can鈥檛 be explained by these mutations. Frey鈥檚 team took a completely different approach, examining changes to text that provides instructions for splicing, most of which is in introns.
Frey鈥檚 team used a new technology called 鈥榙eep learning鈥 to teach a computer system to scan a piece of DNA, read the genetic instructions that specify how to splice together sections that code for proteins, and determine which proteins will be produced.
Unlike other machine learning methods, deep learning can make sense of incredibly complex relationships, such as those found in living systems in biology and medicine.
鈥淭he success of our project relied crucially on using the latest deep learning methods to analyze the most advanced experimental biology data,鈥 said Frey, whose team included members from U of T鈥檚 Faculty of Applied Science & Engineering, Faculty of Medicine and the Terrence Donnelly Centre for Cellular and Biomolecular Research, as well as Microsoft Research and the Cold Spring Harbor Laboratory.
鈥淢y collaborators and our graduate students and postdoctoral fellows are world-leading experts in these areas.鈥
Once they had taught their system how to read the text of the genome, Frey鈥檚 team used it to search for mutations that cause splicing to go wrong. They found that their method correctly predicted 94 per cent of the genetic culprits behind well-studied diseases such as spinal muscular atrophy and colorectal cancer, but more importantly, made accurate predictions for mutations that had never been seen before.
They then launched a huge effort to tackle a condition with complex genetic underpinnings: autism spectrum disorder.
鈥淲ith autism there are only a few dozen genes definitely known to be involved and these account for a small proportion of individuals with this condition,鈥 said Frey.
In collaboration with Dr. Stephen Scherer, senior scientist and director of the University of Toronto McLaughlin Centre and The Centre for Applied Genomics at SickKids, Frey鈥檚 team compared mutations discovered in the whole genome sequences of children with autism, but not in controls. Following the traditional approach of studying protein-coding regions, they found no differences. However, when they used their deep learning system to rank mutations according to how much they change splicing, surprising patterns appeared.
鈥淲hen we ranked mutations using our method, striking patterns emerged, revealing 39 novel genes having a potential role in autism susceptibility,鈥 Frey said.
And autism is just the beginning 鈥 this mutation indexing method is ready to be applied to any number of diseases, and even non-disease traits that differ between individuals.
Dr. Juan Valc谩rcel Ju谩rez, a researcher with the Center for Genomic Regulation in Barcelona, Spain, who was not involved in this research, said: 鈥淚n a way it is like having a language translator: it allows you to understand another language, even if full command of that language will require that you also study the underlying grammar. The work provides important information for personalized medicine, clearly a key component of future therapies.鈥