My name is Wilfred van IJcken, and I'm heading genomics core facility in Erasmus Medical Center in the Netherlands again. I'm sorry; a lot of Netherlands people around here in the talks. But, today I will share with you some preliminary results we obtained by automating exome sequencing, and then the latest version, or the version four, that was all done for the ERF study.
And the ERF study is the Erasmus Rucphen Family study, which is quite a big cohort of people. Actually, it comprises of five villages, and the largest one is called Rucphen; that's why it has its name. And it's an interesting population because it's a region genetically isolated population. In total, it's about 20,000 cases. And we are here today talking about 3,000 cases, and all these 3,000 people are descendants of 20 people only, so that's quite a lot of descendants there.
So, we have a lot of data about them. We have genealogical databases of church records, even up to 1650, so basically we know everything about where they came from and who are their parents, etc., and if there's variety or not. It seems not always to be the case, so you have to be careful with that.
And we have a very good cooperation with our Erasmus Medical healthcare professionals, and also with local professional--healthcare professionals, so we get in all kinds of data about what they eat, what they do, if they're sick, if they get an X-ray; everything is in there, so we have an enormous wealth of information about these patients--these persons. And we have over hundreds of different trait samples for all these participants.
And our epidemiology department was very interested to do a next-generation association study on these participants, on this cohort, for complex traits. And, well, probably you will all know this kind of research. So, what will happen here is that we will look for the extremes, and then you hope for both of the extremes, the low and the high extremes, and you start comparing a specific quantity of trait and see if you can find any variants that have a higher distribution in one of these extremes of the distribution. So, basically, you're looking for individuals with rare variants, depicted with the red cross over here, in the same gene and that are concentrated to one of the two extremes for a certain trait.
Well, because exome sequencing is looking mainly at mutations that occur in the coding region, we expect to have a large effect as well in the phenotype. So, we thought it would be a good idea to start sequencing about 1,500 of these participants, this cohort.
Well, then the first question that you have to answer is how do you do that? So--and how do you select the exome here? Well, we all know that there are different companies out there that sell different kits, and we already had some previous experience with the Agilent SureSelect version two. We were quite happy about it, so we were also looking at that. And at that time, we thought, well, it would be nice if you can even improve from that, and we contacted Agilent, and Agilent said they were busy working on the new version. So, we said, "Well, we are very interested. Try this new version on this cohort."
That's not all because it's--you know, the SureSelect platform is also quite flexible, so it's easy if you want to add probes later on. You can automate it, which was a key thing for us as you had to do 1,500 exomes. The protocol was short because we are a little bit under time pressure by our epidemiology department. They seem to want to have the results quite soon, so a 24 hour hype [sp] time is quite attractive.
But, maybe the most important thing was that if you look at economics, you don't want to sequence a lot, so you want to have an efficient sequencing process. And for this new version, you only need four gigabase pair of sequence to do this. So, those were the reasons why we chose 40--SureSelect version four.
Well, then you have to set up quite a lot of stuff because if you start reading 1,500 samples you can make mistakes, and it seems that you will make quite often mistakes.
So, what will happen is--we got a request from the epidemiology department. They want to have a correlation between the phenotype and the genotype. They have an ID attached to every sample, and they have isolated genomic DNA, so that was our starting point.
So, we take, as a core facility, in all those samples. Then, we do the exome sequencing--the exome selection, and then the sequencing. We start to do all the alignments of all the data, call all the variants, both for SNP and for indel, and then we make some kind of reports that shows if the capture and sequencing was successful.
And then, in the next phase, the epidemiology department will continue with the further analysis and looking at all the traits that we have covered and if they correlate to different variants which are found. And of course, they have to validate them in the end. And by that, we hope to find a new genotype-phenotype relation.
One thing that we realized was that if you start doing a large number of samples, you need some kind of a system that tracks all that. So, we have built our own LIM system, and in that LIM system everything is in there. So, everything that we order, up to everything that we deliver to our customers, we track in this LIM system. And it means that I can just see by one click where all my samples are and how far are they. And that's quite easy, especially if you have some kind--if you have dropouts, then you can at least know where they are, and you don't start to miss samples and forget about them.
If we take samples in, we, of course, need some correct information. We have some information about the PIs behind it, a sample name, what type of sample it is. We take a volume, and also important, we also ask if they can report to us the sex, which is already determined, for instance, by SNP arrays.
Well, then you have to set up your systems in the lab, so in the summer of 2011 we started buying and agreeing on getting new systems in. Well, you need sharing systems. We use Ecovirus [sp] for that. We use the Tape Station [sp] to do the quantification of all the material. We use the Bravo system to automate all the captures and all the sample prep there. Then, we sequence on Illumina systems, and we have two HiSeqs and one LifeSeq [sp] in our system--in our laboratory. And also, we have a lot of storage and compute clusters.
I don't know if you have experience with Illumina software, but if you start indexing samples, for instance, it's quite dramatic if you want to automate that. So, what we have done is we have redesigned that whole process. So, on the left side of this slide, we have redesigned the whole demultiplexing software of Illumina, combined that together with [unintelligible] alignments, and we're using BWI [sp] for that, and some QC [sp] reporting. And all that is written up in a publication recently by Bravo et al, so if you want the details you can look it up over there. And then, we continued with Picard [sp] for performance coverage, some [unintelligible] tools to call the SNPs and the indels, and in the end we annotate all those SNPs. So, for all SNPs we know in which gene it is, which exome, etc., and what kind of mutations is in there.
And in the end, we deliver basically not too much. We deliver BAM [sp] files. We deliver a coverage report per sample just telling customers what the quality is of the capture and the sequencing, and we deliver annotated PCF [sp] files for SPL SNPs and indwells.
So, the first thing when we received our instruments, somewhere in September/October of 2011, was to look at the automation, so we wanted to validate the automation. So, what you see here, and it's just almost off the screen there--on the left side, you see that on the top we have the version two manual, and on the bottom we have the version two automated, so they're still the previous exome capture version.
And then, I plotted here the fraction of the reads on target on the Y axis, and on the X axis I have plotted the range around the regions in base pairs from zero up to 200 base pairs. And what you can clearly see is that at a range of zero, on the left side, in the manually a little bit less, let's say 75 percent, of all the reads are on target. And if you increase the range outside of the target you see that, let's say, up to 85 percent of all your targets will be--all your reads will be on target. And on the--if you go one panel down, so on the bottom left side, you will see the automated version, and there's a clear increase already at the reads--fraction of reads which is on target if you take the zero range because they're above 80 percent of all the reads are on target. So, for us, this means that we get a higher percentage of reads on target if you start using automation compared to manual sample preparation.
On the right side you see plots detailing the coverage per base. And--well, what you basically can see is that, if you look at the median values for instance, you go up--same amount of sequencing, but by the fact that you have more reads on target you will get a higher average coverage.
Here on this slide I show the comparison of version two to version four. On the top panels there's version two automated, and on the bottom panels there's version four automated. And what you can see is that basically for the reads on target there's not a big difference between version two and version four. About 80 percent of all the reads are on target climbing to--let's say up to approximately 90 percent.
But, there is a big difference, if you look on the panels on the right side, because if you look at the coverage there you--basically you will see that on the left, on the--for the version four, on the left side of the graph, we have less peaks that have a very low coverage. So, the new design of the version four eliminates a lot of target reads that are basically badly covered in the version two. And that helps because that means that you have more easier and higher coverage per--on each patient.
And also here in the comparison is a different amount of sequencing. So, on the top panel we have used five gigabase pair, where the bottom panel we have used four gigabase pair. And--well, it's very nicely shown in the mode over here. The most frequent base in the version four has a coverage about 24 times around here, while in the version two that was zero because the most frequent base there covered was not covered, basically.
So, then we wanted to look if we could compare also the capture exome data with some SNP array data. And on this slide, I present some preliminary because we are not sure everything is completely right here yet--but our preliminary data about the SNP comparison. If you compare version two to version four, we see that we have about 23,000 SNPs which are incolent [sp], about 2,500 which are incolent between the version two, version four and the array, but both version two and version four have additional SNPs which are specific for the different platforms, and only a minority, 61 and 78, is concordant with the array and not present in the other SureSelect platform.
I have to say that from the 3,800 SNPs that are specific to the array we think that about 3,700 of those will have a match also on the capture platforms. But, due to the fact that our SNP process--we didn't look into AA and BB calls enough yet, so we have to account that up. But, probably, we expect that in the end more than 98 percent of all the SNPs on the array are concordant with both platforms here.
Over here, I plotted the zygosity of the SNPs. And on the top you see heterozygous and homozygous and that's the array goal, so the array says if a SNP is heterozygous or homozygous. And then, on the Y axis I have the zygosity of the version four, and on the X axis we have the zygosity of the version two. And what you can clearly see, if the array says it's heterozygous, most of the SNPs in there are indeed heterozygous for both platforms. There's only one SNP basically for version four that is called homozygous which the array says is heterozygous.
And if you look at the homozygous SNPs, almost all of the SNPs will be in this small point over here. More than 80 percent is--that are in there. And only a few that the array says are homozygous are called heterozygous by the--both capture platforms.
That's a very minor detail. We have to look into the detail of which SNPs are that and why is that the case, but we are quite sure that this is happening. It could be a capture defect. It could be an array defect. We don't know at the moment yet.
So, we were happy with the--well, let's say, the version four data, and also with the automation, so we started to run our cohort. I think we started somewhere the beginning of October. And on the left slide--on the left panel you will see here the first six plate layouts. And in green is everything which was, in the end, successful, so that means that we have enough data and everything went fine. We have three samples that, for whatever reason, didn't deliver the expected four gigabase pair. But, still, we also have some of--samples which are colored in red, and that means that we had a dropout somewhere during the process. And I have to say that for almost all of those there are dropouts during the automation.
So, I think that there are still some room for improvements here for the automation, but, nevertheless, you can do quite a lot of samples here in a short amount of time because one plate can be easily done in one week, so you can still do 96 exomes in a week here. And our current output for, let's say, the analyzed exomes is about 50 exomes a week. We have done that now for about 12 weeks all in a row, so that's quite okay.
On the right side I show here a sex verification plot, as we call it. So, we had the reported sex from the array data, and we also looked at the sex which was coming out from the capture data. And you can see that very nicely you can distinguish male and female. For some reason, some of the males tend to be, I would say, less male; not more female, but less male, and we have to look into that. But, we also know that this has nothing to do with the amount of data that is generated, so it's not that we have less sequencing data produced that they start to drop out. That's not the case, so it's not correlated to the amount of data. And--well, it makes it very easy to see if you indeed have a concordant sex or not.
Over here, I looked at the number of SNPs which we have called in the coding regions, and that's--the number is plotted on the Y axis, and on the X axis we have plotted the amount of data we generated, so the amount of gigabase pair. On the left side it's about five gigabase pair going up to 20 at the right side. Well, you can clearly see that for every exome, we generate around, let's say, 31,000 SNPs with a quality higher than 100 out of the dATP pipeline.
And you can also see that it doesn't make a lot of difference if you start to sequence more. You don't get a lot more SNPs in the coding regions if you sequence more because, for the example, we have one exome further sequenced almost up to 25 gig and still you don't reach the 33,000 SNPs, so you have only a small gain in the number of SNPs in the coding regions. And I think this demonstrates nicely that a lot of the regions are covered quite nicely for--in version four.
And over here, I'm showing you the same plots, but then for the SNPs which are outside of the coding region. And you can clearly see that there is a linear correlation, as expected of course, because what happens if you start to sequence more you basically start to increase the coverage on the outsides of the target regions, and therefore you start to get more SNPs. And about 60,000 different SNPs are detected even outside the exomes; let's say about one-third of your SNPs is inside and two-thirds is still outside.
Okay, then we started to look about this SNP increase in the population, in the cohort. And what I plotted over here on the left side, on the left panel, is that we have about--well, this is 500-something samples in this plot, and we then looked at the number of unique SNPs in the population. And we end up in about--let's say about one million SNPs in those 500 people. You can also see that in the beginning you have a very steep slope and that at some point starts to become a linear increase. And I would be interested to know if anybody knows if it's normal for a population or that this is, indeed, not normal and says something about the genetic isolation which we expect.
And at the right panel, we are showing that about--well, 400,000, 500,000 of those unique individual SNPs are unique to only one person. But, in the end, over here, we have still--well, about 50,000 SNPs which seems to be in this cohort, and they are present in all the different members of this cohort so far. And those are, of course, very interesting to see what they are doing here.
Okay, and with this, I will draw some conclusions. I think that the automation improved the percentage reads on target. We still have some unexpected dropouts, so I think there's room for improvement there. And our current output is about 50 analyzed exomes per week.
We are quite happy with the version four because it improved captured design, significantly reduces the load capture targets and has a slimmer coverage [unintelligible], and it makes sex concordant information very easy. Well, we have to investigate a little bit more, but it at least looks that version four is equal or maybe even a little bit better concordant with the array.
And we were able to detect about 30,000 different SNPs inside the exome and about 60,000 outside. We also proved that if you sequence more you don't yield more SNPs, so that is quite nice. And about 90 percent of the SNPs is concordant with the array data.
Well, of course, I--we didn't do this all alone, so there are different people involved here. So, we have the Center for Biomics, which is my laboratory with a lot of people. We have the epidemiology department. We have clinical genetics. Of course, Illumina which we use for all the sequencing and Agilent Technologies for the SureSelect capture and the other instruments.
And with that, I would like to end, but not before I state that we are going so fast that within a few weeks we will have done this cohort. So, we are open for new projects, so if anybody knows of one feel free to contact me. And then, I will end asking if there are any questions.