Dr. Mike Farkas: My name is Mike Farkas. I'm a senior research fellow at the Ocular Genomics Institute at the Massachusetts Eye and Ear Infirmary. Today, I'm going to talk about the preparation of high-quality RNA-Seq libraries for next-generation sequencing
Today, I'm going to talk about the importance of having accurate qualitative and quantitative RNA analyses prior to starting the RNA-Seq prep libraries and really go over the best methods for determining this. And apply these numbers, both the qualitative and quantitative numbers to the RNA-Seq prep and optimizing the prep. And finally, I'm going to show a few empirical examples of using low-quality RNA and the quality of data that can be obtained from low-quality RNA
As I mentioned, it's important to know the quality prior to starting RNA-Seq prep library. And currently, there are two mainstream methods for determining RNA quality and the quantity. That is, Agilent's 2100 Bioanalyzer and the more traditional spectrophotometric methods
Ideally, the Bioanalyzer is a better choice for quantifying RNA because it's more accurate. It's important to note that the accuracy is really dependent upon quantitating your RNA within the quantitative range of the assay. For example, the RNA 6000 Nano chip requires the total RNA concentration to be between 5 and 500 nanograms per microliter
To determine the quality of RNA, the RNA integrity number, that is part of the Bioanalyzer output, is much easier to interpret than the absorption at 260/280. And it is really easier to apply the RIN number to optimizing the RNA-Seq prep libraries relative to using the 260/280
So, high-quality RNA typically has a RIN above 8.0 and this yields better transcript coverage once the sequencing has been performed. Ideally, a RIN of 9 to 10 is preferred
So, low-quality RNA is characterized as having a RIN below 8.0 and while you can prepare libraries and sequences, it can lead to significant three-prime bias, and it requires optimization of certain library prep methods, which we'll discuss
There are multiple RNA-Seq prep methods coming online. The more conventional method is an oligo-dT bead-based method, which uses chemical fragmentation in 15 cycles of PCR to produce a library. The less conventional method that requires more user optimization is the transposon-based method. This uses a Tn5 transposase to fragment the double-stranded cDNA and 5 cycles of PCR to complete the library
Particularly, for the transposon-based method, it requires a very good starting knowledge of the RNA quality and quantity for properly preparing the library
Briefly, I'm going to discuss the conventional RNA-Seq library prep method. As the RNA-Seq becomes more mainstream, the kits are constantly being updated. So, now, where the total starting RNA concentration began at between 1 to 10 micrograms of total RNA, it's now as low as 100 nanograms of total starting RNA. And this prep is more robust for preparing libraries form low-quality RNA, those with the RIN number of less than 8
Here's an example of a conventional library prep using low-quality RNA. As you can see in the top left-hand corner, the electropherogram from the RNA 6000 Nano chip shows that the RNA concentration one is at 381 nanograms per microliter. It's in the correct range. But, the RIN number is at 6.9. So, it's fairly low-quality RNA. When we prepare the library and analyze the quality, using the DNA 1000 assay, we can see that the library appears to be of decent quality. It is within our size range and of high concentration. However, when we sequence the resulting library, we notice that there is significant three-prime bias where the five-prime region of the transcript is underrepresented. Typically, depending on the assay, this isn't a problem. However, for RNA-Seq assays designed for quantitative purposes, it can lead to misrepresentation of the quantity of a transcript
The transposon-based RNA-Seq library prep method is optimized to use between 50 nanograms and 500 nanograms of total RNA. And as I mentioned, this is primarily a user-optimized method, where the user creates double-stranded cDNA
In our laboratory, we have optimized the protocol to use anchored oligo-dT primers and select for the mRNA during the first strand synthesis of the cDNA. This allows us to use lower quantities of total RNA and ultimately leads to less loss during the prep relative to the conventional RNA-Seq library prep method. As I mentioned before, the fragmentation is performed by the transposon and this prep is more sensitive to low-quality RNA
Here's an example of a transposon-based library prep using low-quality RNA. Again, the electropherogram in the top middle shows an RNA concentration of 219 nanograms per microliter with an RNA integrity number of 6.4. If we follow the original protocol that was optimized in our lab without any further optimization, we notice, using a DNA High Sensitivity Assay, that our resulting library is overfragmented. It is too small. The peak is too narrow. And it results in very poor sequencing that affects the overall alignment and quality of the data for downstream analyses
However, on the right-hand side, we've reoptimized the protocol to use a higher starting RNA concentration and reduced the amount of transposon used for fragmentation. When we view this library using the DNA High Sensitivity Assay, we see a peak that is in our size range that we optimized the library for and the quality is sufficient for sequencing
Now, I'm going to discuss a few examples of sequencing that was performed in our lab with the goal of finding novel alternative splicing events
The transcriptome of the retina is very important in our laboratory for many reasons. We used the full characterization of the transcriptome for identifying novel alternative splicing events as well as using the full data set to filter exome data in identifying novel variants in patients with uncharacterized disease
So, to perform these sequencing experiments, we characterized the transcriptomes of three normal human retinas. The postmortem time from these tissues was quite significant, which led to low RNA quality with a RIN between 6 and 7. We used the transposon-based prep, where we increased the starting RNA concentration and decreased the transposon to reduce fragmentation. We sequenced the libraries on the HiSeq2000 and obtained 300 million reads. We identified novel alternative splicing events and novel genes. We further validated all of our findings using the SureSelect RNA capture method
To analyze our data, we developed an RNA sequencing pipeline called RUM. RUM uses Bowtie and Blat to map the RNA sequencing reads against the transcriptome and genome, resulting in higher number of reads aligned. Between 90 and 95 percent of all RNA-Seq reads, can be mapped to the transcriptome. The output of RUM provides data for transcripts, exons, and splice junctions
In our analyses, to identify all the novel alternative splicing events, we mapped our reads against all of the empirically determined annotation tracks. This was a database of over 1 million unique exons, and then we analyzed these reads for novel exons, exon skipping, and novel alternative three-prime and five-prime splice sites. And what we found is there are approximately 20 to 30,000 of each of these events that have never been annotated before
To briefly show a few important examples of novel alternative splicing events, particularly novel exons, the top picture shows a novel three-prime UTR, which turns out to be the major isoform of this particular gene. What we're looking at are the green bars. The green bars at the top represent the splice junctions with the count of the number of reads that have spanned that junction. So, by using the splice junction information with the red coverage plot information, we can identify this novel UTR
In the bottom picture--and this is really where the power of RUM is important--using the same green splice junction information, we can pick out a novel exon between two annotated exons. That is clearly the minor isoform. And we can detect that, even without the appropriate coverage plot data
We found that the novel features are more abundant than their annotated counterparts, approximately 5 percent of the cases that we identified, whereas about 9 percent of the novel feature was at least as abundant as the annotated counterpart. So, by using RNA-Seq, we're able to show that the novel features that we're identifying actually account for potentially biologically relevant transcripts
Finally, we detected nearly 100 intergenic novel genes in the human retina. These genes were multiexonic. They spanned approximately 3 to 7 exons. The transcript size ranged between 900 to 10,000 base pairs. Many of them were alternatively spliced. However, most appeared to be non-coding. We believe these non-coding transcripts are likely to be lincRNAs, long, intergenic, non-coding RNAs. We validated 10 full-length transcripts and found multiple isoforms for each of the transcripts that we empirically tested
Finally, the validation of RNA-Seq data is often done in a low-throughput manner - designing primers across novel features, Sanger sequencing these, and determining if they're real. Technically, this is difficult. And the low-throughput nature of it doesn't make it representative for a whole RNA-Seq dataset
We used the SureSelect Targeted RNA Capture System to independently validate 15,000 novel events from our RNA dataset, which included exon skipping, novel exons, and the novel alternative three-prime, five-prime splice sites. We chose features that were as low as one read depth, and we sequenced these on the HiSeq2000.
Following sequencing, we reanalyzed our RNA-Seq data through RUM and we found that 99 percent of the novel events that we tried to validate were in fact validated. By looking at 15,000 of the novel events, which is approximately one-quarter of the novel events in our dataset, we feel that this is more representative for the whole RNA-Seq dataset and can be applied across the dataset. And it really shows the power of RNA-Seq to identify novel events and find those that are of very low frequency
So, in summary, while RNA quality and quantity is important for proper preparation of RNA-Seq libraries, it is not absolutely necessary to have the highest-quality RNA to prepare a high-quality RNA-Seq library. It requires accurate quantity measurements so that the RNA-Seq library prep method can be optimized for the starting RNA
Thank you very much for listening to my talk. Any questions or comments can be sent to me at my e-mail and the protocols will be available at our Website
Thank you