Yeah, a lot of things have already been said in the previous two presentations, so I could stop immediately and saying that, basically, this works. But, I think Andrew had a, I think, very nice--ended in a very nice way on terms of how we look at sequencing in the current base. And it's basically balancing effort versus value, and value you can see in scientific values and in terms of financial value, costs.
When you consider going from genes to whole genomes, there is, of course, an increase in information content when you go from right to left. Recorded sequencing capacity, well, of course, it increases going from right to left. And that also means that if you consider a sequencing run as an entity, what can you do in an experiment? You can analyze much more samples when you just analyze genes than when you do whole genomes. This is all pretty obvious, and I don't have to tell you that.
But, the point I want to make is that whatever approach you take, the limiting factor that we have in most of our work is the information that we can project on this genome information, so how much information can we get out of it. And especially from a diagnostic perspective, a whole genome is not that much more interesting than just looking at those genes for that specific disease that you're studying and that you want to get a diagnosis out of it. So, this is pretty constant over the methods.
And you can say also the research value--and this is a point of debate and depends on what you're doing. Of course, there is less research value, less to be discovered, in sets of genes than in whole genomes, but we should also be aware that it's very difficult to get information out of whole genomes, and that's already much easier for exams.
So, this is basically--and I don't say you have to be anywhere on this spectrum, but it's just the considerations that people have where to start their experiments, and I think you should be open-minded in doing that.
Of course, what is the challenge, then? That is the scaling challenge when we look at--because if you go from exome to genes, you need some way of target enrichment. We heard about various solutions that are there. On one end, that is exome enrichment. SureSelect is an excellent solution for doing that as you already heard, and we will hear in subsequent presentations as well.
When you go to gene sets, custom sets, we heard some examples about that already. There are various solutions. You can use custom designs, and this also goes back quite awhile. When we started this work, we actually started using microarrays because there were no in-solution arrays. So, I'll give you a bit of historical perspective on using that platform as well, and of course the SureSelect in-solution is much more versatile as an approach. And I will talk a bit on the exome SureSelect panel that's also available from Agilent, and I've got some results of that.
When we go to even smaller sets, there is a further challenge. I put a question mark here, but already spoiled the surprise there, what could be working, and in the previous presentation it's also clear what type of solution could be used here.
And we first go to the scaling challenge. This is the traditional way of how we approach experiments, let's say, a year ago. You take a sample, you have to do an enrichment, and then you combine it in a sequencing run, so there is a big challenge in doing that. When the target footprint becomes smaller and you want to analyze more and more samples, sample preparation and enrichment really becomes a bottleneck.
And it was mentioned also already by Ole [sp] you have solutions to this part. You can basically keep doing this, and we'll see examples of that as well. For exome enrichment, you can automate the whole procedure and have robots do the whole thing. You can think about multiplexing parts of these steps and develop protocols for combined sample preparation and enrichment so that that becomes one step. So, those are basically the solutions that one has.
So, when we go to multiplexing, there are two ways of doing multiplexing. One is multiplex sequencing, which does not really change anything to the sample prep and enrichment procedures. But, you can also add in the indexing, and Ole presented the SureSelect version 2 where eight to 16 multiplexing, basically, pre-enrichment can be achieved. I'll show you some of our work, which also has been published in two papers, where we showed that this approach also works going up to 100 samples if you want to.
This is basically such a result. We used 96 samples, used a target footprint of 300 KB in this specific case, and this is just the average base coverage in this target footprint of 300 KB. And these are 96 different samples, and you see that the average is about 50x coverage for every sample, and we don't get too much out of twofold in terms of coverage. So, in this case we basically re-sequenced in 96 patients a target footprint of 300 KB in one single experiment.
So, one criticism that always comes there if you start multiplexing these samples, "Aren't you starting missing rare alleles? Don't you get competition of non-reference alleles compared to reference alleles," because the reference alleles would hydrolyze better to the probes, which are based on the reference genome sequence. This is, I think, an example or a clear illustration that there is no risk of that.
What you see here on the X axis is the frequency of an allele in the pool. So, put in 96 samples, so we have 192 alleles in there. If only one sample was heterozygous, you would end up at a 1 percent in the pool. But, because that sample also had an index to it, we can split all the information from that sample out and calculate the allele frequency for that sample after the sequencing, and that's basically what's plotted on the Y axis.
You basically see that there is no effect of the allele frequency in pool and the allele frequency for the individual sample that we observe. So, there is no effect of multiplexing on losing non-reference alleles for your specific sample. So, there is no risk of that, at least not in this experiment. So, I think with that, I think that's sort of foolproof that you can easily do these things.
Good. How did we use that? Well, we used it in a wide variety of experiments to use smaller gene sets, combine 10 to 100 samples in one single experiment, enrich them, and basically do sequencing. And one of those applications is to look at the X exome. So, in the clinic we have a lot of patients where you have the suspicion that it's an X-linked disorder. Traditionally, people do mapping, a linkage analysis in these families to narrow down the region, what a causal gene could be. That's laborious. You need big families, etc., so that's challenging. Of course, you can sequence the whole exome in these cases as well, but it's a bit of an overkill if you already know that it's X-linked because the coding sequence on the X exome is just three million base pairs, so why sequencing 50 million base pairs.
So, basically what we do, we used this pedigree information, and when you know that these are affected males, you know that these females should be obligate carriers of the allele. So, the affected sons, you expect a homozygous contribution of the disease allele, so they appear as homozygotes in the sequencing data. And the obligate females, the mothers of these affected boys, they should appear as heterozygous. And you could also include a healthy father, which by definition should be reference sequenced in that same position. So, you have a lot of filtering information from your family, from your pedigree, and you basically take categories when you do your sequencing out of these three different groups, so you can do a lot of filtering on your data.
We collected quite a few of these families; in this case 27. On this slide, in the second column, this is basically how many patients there are and how many obligate carriers and normal individuals were there present that we had DNA samples available. And this is the amount of samples that we sequenced. So, from these 27 families, we sequenced 86 individuals, so on average just three individuals per pedigree.
Then, we can start doing this--the filtering based on this segregation on these families, and this is the number of candidate variants that you're left with per family, so that's not too many. Then, we did quite a bit of controls. These are all families from the Netherlands, so it is useful to filter based on the 1,000 genomes type of projects and 10,000 exome project type of databases, but also in-house databases are very useful for filtering out really population-specific variation.
But, if we did that, then this is the list of candidates left. And you see that for most families, we just have a single candidate left as a causal variant. And actually, the ones in grey, which is about half of the families, that variant, or one of the variants if there are multiple candidates, really fit with the gene and the phenotype of the patient, so I would consider these almost solved in terms of genetics. Of course, the biology still needs follow up and work to make that work. But, just two enrichment experiments were done here, and we solved 13 or 14 different family cases on X-linked diseases.
This is one example. It's a syndromal family, just to show that these things work, syndromal family with facial dysmorphologies and mental retardation, and it's a very large family in this case. We sequenced quite a bit, actually, in this one. And basically, these are affected males. This is an obligate carrier female, and this is a non-carrier female, and you basically see these genotypes very nicely segregating.
This is--so the mutation actually was intronic in this specific case, but we do show that this mutation, actually, affects proper splicing of the intron, and you basically get skipping of exon 2. There's a CDNA analysis and skipping of exon 2 where you've get a stop codon, basically. It goes out of frame in the patients. And this is histolesterase [sp], and they're involved in brain development, etc., so these things seem to fit with the phenotype that we observed.
Good. So, the choice of technique is important, so we can do very well exome, very well, I think, targeted gene cells, but what about if we go to genes? And when do we want to do that? Well, for example, I think this is the mostly sequenced gene in the whole world, BRCA1 and BRCA2 genes in relation to breast cancer. And these are diagnostic type of genes, and they are sequenced a lot. And you just want to look at these two genes and nothing else for most of these questions. So, exome sequencing when you have this question doesn't make any sense. It's a huge overkill because you just want to know if there are mutations in this gene. And you also want to know if there are no mutations in this gene, so completeness becomes very, very important when you want to sequence these genes.
And if you look very carefully on all the exome slides, also on the one that were presented in the commercial presentations, it's about 90 percent that you cover more than 10x. That's quite often sort of a criterion that's used for proper performance. That's not good enough, 90 percent for these genes. You need much more.
So, what can we do? Well, we started off this a long time ago. We first started with hybrid capture. So, let's see if we just used the traditional techniques, SureSelect using either arrays in solution or in solution probes what can we get. There's two things to consider: the platform, and then the other part is what are the experimental steps involved. So, if you want to do hybrid capture, you have to make your sequencing library, you have to do a hybridization, you elute, and you do the post-enrichment PCR. And I will also show you that you can use this post-enrichment PCR part not directly for sequencing, but you can do a second round of enrichment on the same platform and repeat a trick, and then go to sequencing.
When we started this off, we used Agilent SureSelect 244K arrays. We used the design where we targeted all coding exomes of BRCA1 and 2, including 40 base pairs of intronic sequences. The good thing about using microarrays for capturing is that you can design capture probes on the plus and the minus strength, so that can have some advantages so they cannot co-hybridize in solution because they are physically attached and separated. We used a very dense tonic bath [sp] in this one and followed this by solid sequencing.
Well, first off is how does this perform? You just capture--tried to capture two genes by hybrid capture. And this is not a surprise; I think everyone finds the same thing. The efficiency is very poor. These are five different samples, and we get about approximately 5 percent of the sequencing reads which are on target. So, it does work. You can enrich, but the efficiency is rather poor. 95 percent of what you're sequencing is something else.
You can do a second round of enrichment; as I said, the elution product, put it actually back on the same microarrays. What we did, we used the same microarray and washed if off again, and then you can actually boost it to 60, 70 percent on target. So, that does work, but it's a lot of work. You have to go to two rounds of hybridization, and it's a trick that has been demonstrated previously.
This is just to show that if you do this trick in terms of coverage for your target region, it's pretty much okay. So, you still get very good coverage. You don't get very much biases due to the second round of enrichment and the second round of PCR, so that's, I think, the positive part of it.
This is a slide that will--a measurement slide that comes back later on a few times more, so let me explain this the first time. Basically what we plot here is the performance of the technique that we used. And the four different colors are--assuming the amount of mapped, actually, this should be--yeah, the amount of map reads that you generate on your sequencing platform. Let's assume you just generate 50,000 reads, 100,000 reads, 200,000 reads, or 500,000 reads, and you apply an all-array single-round enrichment, followed by solid sequencing, what would you get for your target? Well, you see that 1x coverage is still pretty much okay, but you rapidly drop in terms of coverage. If you do--that's because most of the reads are not on target. If you use double-round enrichment, it's getting better, but you see that if you want to achieve 20x coverage for the--most of the region you have to still generate quite a bit of sequencing data.
So, this is where we started working with Olink at that time. The technology that they were developing was, at that point, a Halo Selector technology. It has not been presented earlier, but this is basically the same basis as for the HaloPlex, which we will now come to later as well. The good thing is that it's a single step type of enrichment.
The experimental procedure is as follows. You first do a selector enrichment, which involves digestion of your genomic DNA, a hybridization to the selector probes, which are these probes. And you see the difference here with the HaloPlex is that there is no internal adapter there, so the selector probe just brings the two ends of the restricted DNAs together. A hybridization and ligation, so there's a ligation just at one single point, and this was followed by a rolling-circle amplification where you get high-molecular DNA and really blow up all the circles.
Well, then you have high-molecular DNA, it's enriched, but you still have to be able to sequence it and introduce adapters. And that's easily done by doing fragmentation and a standard library prep, but it's more work than the HaloPlex, obviously. So, that's what we used initially, and we sequenced this both on a solid and on the Ion Torrent PGM platform.
Well, this just shows the enrichment. We typically get more than 80 percent of the reads on target if we use this approach, so that works very well in terms of enrichment. And also in terms of sequencing, it performs reasonably well. In terms of if you want to get 20x coverage, you already achieve that with 100 to 200,000--sorry, 100 to 200,000 reads.
That gets better when you go to Ion Torrent sequencing. You see the performance here--better, you need less reads, and that's just because your Ion Torrent reads are longer than what we get--than what we generate with the solid sequencing.
So, then the last step is to use the HaloPlex technology which was being developed at that time. The differences there, that you have simultaneously enrichment and library preparation, so you don't have to fragment the high-molecular weight DNA. You introduce this additional adapter in the middle, which is the starting point for your PCR primers that can then be introduced in the solid or PGM-specific adapters; in this case, we focused completely on using the Ion Torrent platform, so reduce one single step consisting of digestion of the genomic DNA hybridization to this float mix, followed by a ligation and a PCR, and then directly go to the PGM.
We started off using very small chips, and there's, I think, an advantage because at the end there will be--rather cheap, rather fast, and our target footprint actually is rather small, so that combination does fit very well. A 5500 read capacity that we used as well is a huge overkill for these experiments.
We explored one more thing, which is the bio-directional sequencing from the product that you generate. And of course, you can use paredine [sp] sequencing on any platform, but on most platforms it doubles the amount of sequencing time. And quite often it increases the costs as well because you basically have to sequence first 100 base pairs in one direction, and then 100 base pairs in the other direction. If you can make sure that in the first sequencing round of 100 base pairs you start sequencing from both ends, potentially from the product that you enriched, you speed up the process as well.
And that's basically what we explored here. To have this adapter in the middle, in the middle part, you can put it in, in two orientations, so in that way you'll get P1s with one orientation of the sequencing reading into this direction, and for this one it would actually read the other direction of the same captured fragment. So, in theory, you can then basically--bio-directional sequencing with--from just one reaction mixture.
However, there is a complication to that because if you make these mixes of these two oligos for the same target set, all these adapters can actually hybridize to itself as well, and the probes can hybridize to itself as well. But, if you ride this out, you would actually only lose half of all your probes in this set because by definition they can--every molecule can just hybridize to one other molecule, and it can be the right one or the wrong one. So, half of the probes will still be consumed by self-hybridization; the others ones still compute to the circularization reaction.
So, we thought, "Good work, so let's give it a try." There is a potential risk in this approach, but--so we tested that, the adapter orientations. These are the--we first made separate mixes with the adapter just in one orientation or the other orientation. This is the performance that you get from one sequencing scenario run, and you see that the performance is pretty much the same from the two different orientations. You see them all mapping in one direction.
And then, if you mix these two oligo sets together and do the assay again, this is basically what you get, and the performance is the same. Why you see a bit of differences here in terms of coverage at more than 20x, well, that's because we sequenced less in this specific experiment. And you actually see here the percentage covered, it actually gives an excellent coverage performance, and we do get bio-directional sequencing. So, this saves time and potential costs in the sequencing step.
So, how does that compare, again, in these metrics. And if we then, again, go to 20x coverage, we actually see an excellent performance for these when you go down to 100,000 reads, which is what you easily achieve on a PGM, even with the smaller chips that are available.
We here move to the latest chemistry on the PGM, version two, providing longer reads, and basically we see that the performance becomes better, which becomes more illustrative at the higher coverages here where you get more and more bases highly covered. And actually, in this case, in these experiments, we found that we can cover 99.9 percent of BRCA1 and 2 bases, coding regions, in these experiments, so we are near to completeness in these experiments. And this is an overview of basically all the different methods, whether we go from the all-arrays single enrichment, double enrichment, selector, solid, and Ion Torrent sequencing, and then to HaloPlex Ion Torrent version one and version two sequencing. And basically, I think you appreciate--you see the improvements in efficiencies and--of capturing these regions at a very good coverage.
Of course, the question is do we--we can cover it now. Can we also find--identify variants? Well, there are not that many variants in these--in the few samples that we continuously use in these testing experiments. But, these are just some examples where we used here the selector technology, followed by solid. This is array data, and this is the HaloPlex, followed by Ion Torrent. And you basically see--I think you should just look at this array type of track. You see the allele distributions on the SNPs, all these different positions very nicely with all the three platforms. So, there is actually a full--almost full concordance with what we see on the different platforms. While that was a biased view just looking at positions, this is an unbiased view. Here we just used our standard SNP calling pipelines identifying variants in the different datasets, so that's shown in the first column. With the hybrid capture, we find six variants, the Halo Selector eight and the HaloPlex we find five.
We did not independently verify these variants, but we also looked at what is unique. Well, there is just one unique here, which was missed by the HaloPlex, and that was an indel. And it's not missed by HaloPlex, but it's probably missed because of the difference in the sequencing platform in combination with the bioinformatic tools that we used, which for the Ion Torrent we have not optimized at for culling the indels. And of course, that's still a thing where work has to be done.
Halo Selector finds three more which are missed by both other platforms, so it is the question about what, basically, are those variants. And when we look basically at the tracks of the different datasets shown here--and you see the Halo Selector here. You see this is really the raw data. You basically see these groups of noise in this Halo Selector. So, this is Halo Selector is not a product that's used. This is this rolling-circle approach.
And basically what we find here is actually that these are the restriction site positions, and this is basically where the circles are fused to each other. The rolling circle basically merge that into a full-blown contact of sequences. Then it was sheared and it was sequenced, so these are basically--you generate hybrid type of sequencing reads, which are partially matching, and then go into the other part of the circle. And our mapping tools have not been optimized to deal, basically, with those type of hybrid reads. So, I think with some bioinformatics you can basically get this noise out of as well, but, on the other hand, this technology has been superseded by the HaloPlex technology, which does not have these artifacts because you don't have these rolling-circle amplifications in it. So, that's basically where we strongly believe that these three actually are false positives from the technology.
So, is that the end? So, it does work, I think, very well for just capturing and targeting two genes, and we also tried to do this for a slightly larger panel, which is of interest from a broader perspective for breast cancer. It's 21 genes, so published at a set of genes. There are a lot of genes included here that are involved in the mismatching pair as well. And basically you find the same performance if you set up this technology now for larger sets of genes, and we get beautiful coverage there as well. A very high enrichment efficiency of 94 or 95 percent of all the reads are on target, and we got more than 95 percent of the reads covered more than 10 times at just 300 [unintelligible] coverage, so this works very well for other gene sets as well.
So, in summary, it's--one part is the enrichment that you do, but the other part is, of course, the time that it requires because you could say, "Well, this actually works very well as well two rounds of one array." You get very good coverage, but it actually takes you almost a week. And now, with the HaloPlex, it allows you to do this is one day. So, HaloPlex, efficient for small targets. I think it's a unique technique for that. Combined enrichment and library prep, that makes it possible. It's a fast protocol, and the good thing is also it's automatable, making it scalable in the future as well. These are all tube-based assays that can easily be automated.
So, why don't we got back to the first slide. We had experiment scaling challenges, and the good thing is Agilent provided solutions to this as well together with Halo Genomics in the form of the HaloPlex.
With that, we'd like to acknowledge the people that contributed to this and the collaborations at Halo Genomics and at Ion Torrent Ion Technologies. Thank you for your attention.