Sailfish allows for faster gene studies

Carl Kingsford, an associate professor in CMU's Lane Center for Computational Biology, and his team have developed Sailfish, a computational method that has sped up estimates of gene expression. (credit: Jonathan Carreon/Contributing Editor) Carl Kingsford, an associate professor in CMU's Lane Center for Computational Biology, and his team have developed Sailfish, a computational method that has sped up estimates of gene expression. (credit: Jonathan Carreon/Contributing Editor)

A single cell is home to an unfathomable number of molecular interactions. Over one trillion cells work together perfectly to produce a well-oiled machine: the human body. Scientists, however, are fascinated with how the smallest deviation from the norm within a cell can have catastrophic consequences.

Comparing RNA expression in different cells can provide an incredible amount of insight into these minute changes and understanding how they might affect the cell. Carl Kingsford, an associate professor at Carnegie Mellon’s Lane Center for Computational Biology, and his team have developed Sailfish, a revolutionary new software that will allow scientists to analyze massive amounts of RNA gene expression data in a matter of minutes, as opposed to previous methods which took several hours. The tool, which stands true to the speedy fish it was named after, was highlighted in a report published online in April 2014 in the reputable journal Nature Biotechnology.

Proteins are the catalytic molecules responsible for what occurs inside a cell at any given time. Researchers want to learn how the presence of specific proteins can impact the characteristics of a cell and how it behaves. For example, if a protein is found to be expressed in a malfunctioning cell, like a cancerous cell, but is absent in a healthy cell, scientists can conclude that the protein is somehow correlated with the aberration.

It is impractical to physically collect all the proteins expressed in a cell and identify them because protein identification technologies are not robust. To avoid physically collecting and characterizing the proteins in a cell, scientists have developed the technique of RNA sequencing (RNA-seq).

RNA is an intermediate molecule between DNA and proteins. DNA is like a recipe book on how to build all the proteins in an organism. RNA can be considered a messenger molecule that translates the DNA into a language that the cell can understand. The RNA produced are for the proteins that the cell is actively making and expressing at a given time.

Scientists collect thousands of RNA molecules inside a cell that range from a few to hundreds of base pairs in length. Then, they use robust sequencing technologies to determine the sequence of bases that make up the molecules. However, deciphering the RNA code to identify the protein it instructs to build is another challenge in its own right.

Current techniques in computational biology model a string search problem, which involves comparing the RNA sequences to a large database of sequences and identifying the best match. While this has been a hot computational biology research topic in recent years, current methods take six hours to identify anywhere from 30 million to 100 million RNA sequences, depending on the number of sequences.

The time-intensive step is in mapping the large string of letters to its complementary sequence in the database. Kingsford explained that “what makes Sailfish so fast is to do away with this mapping step. Instead of matching the whole RNA sequence at one time, Sailfish first breaks down the input into all possible fragments of size k which are appropriately called k-mers.”

Next, the program identifies all the different RNA sequences in the database where each k-mer could be found. The database sequence that has the highest coverage of k-mers will be the corresponding sequence helping to identify the protein that the sequence codes for.

Searching for these k-mers is computationally less intensive than searching for a whole sequence because it is no longer a string search problem.

The database can be constructed as a minimum perfect hash table matching each possible k-mer from the database to all the possible sequences in the database that contain that specific k-mer. In computer science, hash tables are efficient data structures that store and search through information.

Kingsford described that the idea of breaking down large sequences into smaller k-mers is a technique that is commonly utilized in genome assembly. He added, “researchers had strayed away from applying this idea to RNA-seq because they thought breaking the sequence caused a loss of information.”

Instead, Kingsford and his team found that “despite losing some information, you gain in freedom.... In some cases, the analysis will be actually more accurate than older methods.”

Sailfish is more accurate because it does not incorporate mismatches between the search sequence and the database sequence into the analysis. DNA and RNA can be somewhat variable between different individuals so it is very likely that a whole sequence will never be a perfect match. Older mapping methods are exponentially slowed down by the errors that are allowed between the search and database sequence. In this primitive system, some sequences will never find a match and be thrown out because of too many mismatches. Kingsford explained that in their new method, k-mers that have mismatches are not considered. However, there are other k-mers that don’t have mismatches which will be sufficient in matching the appropriate sequences. Thus, no complete RNA sequence in analysis will be thrown out simply because of a few mismatches.

Rob Patro, a post-doctoral research associate working under Kingsford, was responsible for the majority of the coding and testing of the software.

They collaborated with Stephen M. Mount, an associate professor in the University of Maryland’s Department of Cell Biology and Molecular Genetics and its Center for Bioinformatics and Computational Biology.

In November of 2013, the group released their code and started a forum for users to discuss their experiences. Kingsford shared emphatically that users were having “very positive experiences” and that “it is very rewarding to go from an idea to implementation and then to actual users so they can actually do what they want to do.” They spent the past six months working on their publication.

Previously, researchers were limited in their RNA sequencing studies because of the time required for such intensive computation. Kingsford hopes that this speedy software will “let researchers do much more exploratory analysis, test many more conditions getting millions and millions of RNA sequences and comparing them to even larger data sets.” Kingsford and his team will continue to improve their software, possibly enabling other scientists to make discoveries furthering the human understanding of the beautiful complexity of biology.