So, the background to this experiment is that there is a natural tendency for tumors to accumulate--sorry, for mutations to accumulate in stem cells. And it's actually a pretty natural property of stem cells because they are constantly renewing themselves, so any mutation that occurs in a stem cell population will remain in that stem cell population; whereas mutations that occur in differentiated cells, at least in regenerating tissues, tend to be washed out. So, there's nothing special about it; it's just a mathematical property of stem cells.
So, there is this concept that leukemias are maintained by leukemic stem cells. And you can imagine that the initial set of mutations required to create a tumor occur in the stem cell population. And then, there might be additional mutations that occur in the leukemic blasts as the tumor develops, and there's a sort of micro-evolution going on in the bulk of the tumor.
But, in many cases, it may be that if you treat--even if you treat the bulk of the tumor and get rid of it, you may not eradicate the leukemic stem cells that contain several of the mutations that gave rise to the original tumor. And then, after some period of time, the tumor might reappear when those additional mutations are reacquired by these--by the tumor or by the stem cells.
So, in order to understand this process, we need to understand which mutations are important for leukemic stem cell development. Or to put it in a much simpler term, we need to find out which mutations occur where in the differentiation lineage of these cells.
So, we're studying one particular leukemic-like disease called myelodysplastic syndrome. It's actually--maybe it should be called a pre-leukemia. It's a disease of the bone marrow that results in abnormal maturation of the stem and progenitor cells. And it's not really a tumor in the sense that you get abnormal growth; rather, you get abnormal maturation of the myeloid lineage, and you get various kinds of symptoms that are a consequence of that, such as anemia.
But, over time, patients that have been diagnosed with MDS tend to develop, in a large number of cases, a more acute leukemia, AML. So, there's--that's the justification for studying this particular syndrome; it's a milder form, and it's not really a full-blown leukemia, but it's a precursor and leads, in a very large number of cases, to a full-blown leukemia. And this has some advantages because the--in particular the genomes of these cells are bound to be simpler than AML, which has a very complex genome with a lot of rearrangements and a very large number of mutations. So, there might be hope to find more particular mutations in this syndrome.
This is sort of a cartoon picture of hematopoiesis. And the way we think of it is that there is an ultimate stem cell, the hematopoietic stem cell, that gives rise to the entire blood cell lineage. And then, there are various stages along the way to the fully differentiated cells. So, you can imagine now that there can be mutations that occur in the hematopoietic stem cells themselves. They will be inherited by the entire blood cell lineage; whereas mutations that occur in the more restricted stem cell will affect only part of the lineage. So, that's sort of the justification for this work.
So, a good thing is that a lot of mutations are known in these tumors. It's known that in a large number of MDS cases there are deletions, for example half of chromosome 5 or deletions of chromosome 7. And these are very common, as I said, and they can also be prognostive for this disease. And many of the common--of course, common cancer mutations also occur in MDS, like p53, as you would expect.
Very recently, last fall, two groups showed independently that there are recurrent mutations specifically in spliceosome genes. So, these genes, and several others, tend to be mutated specifically in a large fraction of the cases of MDS, so there's a very particular and interesting fingerprint that probably has functional consequence.
But, these patients who have these recurrent mutations tend to also have these mutations, so there is still the question of which one is the transforming event, the earliest mutation to occur, and which ones occur later in the differentiation of the tumor.
So, we're adopting a strategy to address this question where, in a first screening event, we take 20 patients, and we scan--we sequence known genes having known mutations in MDS from bulk bone marrow, so we will just find all the mutations that occur in these patients.
In a second step, we'll sort cells from these same patients into the cell populations that I just showed you, the different stem cells, the hematopoietic stem cell and the more committed stem cells. And then, we will look at the same mutations that we already know occurred in these patients.
And in the third screening step, we want to go down to single-cell level to find out if the mutations that are found within populations, if there are multiple such mutations, if they actually occur in the same clone or if they occur in independent tumorigenic clones. So, that's the end goal, and today I will talk about some technology development that we've done to be able to achieve this.
There are, of course, a number of challenges. So, one is just the challenge of sequencing the exomes of a relatively large number of genes in a relatively large number of samples. And I don't have to explain exactly how to do that because you've seen in several talks already that the HaloPlex is a good choice for these kinds of numbers. And we will sequence those samples on the HiSeq 2000.
And the other challenge is to obtain sufficient genomic DNA from tiny cell populations. And although you can go down a little bit in the DNA requirement for the HaloPlex kit, you still need hundreds of thousands of cells to do this; whereas, unfortunately, the stem cell populations are exceedingly rare, so it's often the case that you can only obtain about 100 cells from one bone marrow aspirate for some of these populations. So, we need to be able to work with a hundred or so cells, so we need to amplify and we use a pre-amplification based on 529 multiple displacement amplification.
The third is that we need to eliminate false-positive mutations that are due to the amplification step and sort of the consequence of the fact that we need to amplify. And we take an approach here in this experiment where we don't really focus on discovering new mutations. So, remember that we will sequence these patients from bulk bone marrow at first, so in those samples we don't have this problem that is due to the amplification. In the amplified samples of cell population, we will just consider those mutations that we have already proven are present in the bone marrow.
And that leads to the fourth problem, which is that if we're going to do this, then in a bone marrow aspirate not all of it is going to be tumor. So, there can be sub-clones of the tumor that are present in only a fraction of the bone marrow, and our solution to that problem is to use very deep sequencing to be able to address--to access very sparse populations.
I think I will not spend much time on this slide; you've seen it before. We use the HaloPlex, and it's pretty nice because it gives you, in a very small number of steps, a ready-made library for your luminal sequencing.
So, some of the specific issues are we need to amplify genomic DNA from about 100 cells, and the concern, then, is that you might have allelic dropout; that you lose alleles, basically. And even more, in the last stage we want to amplify genomic DNA from single cells, and we have an approach where--so it's very well known if you do GenomiPhi or any other whole genome amplification method from single cells, you will get--you will lose large portions of the genome. It could vary between 50 percent or 20 percent or something like that, but it's a very large loss typically.
So, the approach that we are taking is that we're doing microculture of the cells, so we're picking single cells into Terasaki plates. These are plates that are used by protein crystallography people to do large numbers of independent small crystallization reactions. So, you can do a kind of microculture in these tiny wells. Actually, you don't see how tiny they are, but they are really tiny. And even though these cells are not easy to culture on a large scale, it's pretty easy to make them divide three or four times; they will do that before they enter senescence. So, then we can get from one cell to about five, 10 cells, which is a big difference. Then, we will start with--instead of just starting with one molecule, we'll start with five or 10 molecules, which should reduce the allelic dropout.
So, to see if this approach is viable, we set up this pilot experiment based on normal bone marrow and cells with cell lines with known mutations mixed in different ratios. So, we have 12 samples here that we analyzed by HaloPlex. We have the sort of basic control here, which is a cell line where we used one million cells and just extracted the DNA directly. And then, we have cell lines, different cell lines with different known mutations at dilutions all the way down to one percent in normal bone marrow, also extracted in bulk, so that should be easy. And then, we have similar samples, but where we start with 100 cells only. These are these three samples here. And we have some other combinations.
And the sequencing strategy was to use--to pull these 12 samples, which each target about 200 kilobases, by HaloPlex. So, the total in all of the 12 samples altogether sequence information that we would like to obtain is about two-and-a-half megabases. And for some reason, we put this on two lanes on the HiSeq two times 100, and this is obviously overkill. We got 25,000-fold coverage. That's the expected--we observed about 10,000 fold because of some losses in mapping and so on. We observed--I should say we observed 10,000 fold as the minimum among the 12 samples, so the other ones were higher. But, it gave us a good--it should show us very easily what is the ultimate limit.
And the way this data looks is as follows. You input the genes that you're interested in, and the HaloPlex design software will design for you these--these are called probe designs. We call them, maybe, targets, so each grey line here is a targeted region that should be covered by your design. And you'll see that some of them map to exomes, like this one. Some of them are outside of exomes, and I presume they are artifacts of the fact that you have--you're limited by restriction enzymes, so you'll get some extra fragments here and there. Like here for example, it's extending maybe unnecessarily far outside this exome, but it's inherent in the design limitations of this method.
When you sequence this, you'll get data that looks like this. I don't think this has been shown maybe quite this way in previous slides, but one of the striking things about the HaloPlex is that all the reads align to exactly the same positions for each of the particular design targets, and this is again because of these restriction enzymes. So, it can look a bit funny when you look at that. But, if you look at it carefully, you'll see that all of the grey bars here are completely covered by reads.
So, one of the things--of course, the first thing that you look at is whether the coverage is good. And if you look at just the simple question whether your targets are covered or not by any reads, you'll see that we have about 97 percent of the targets being covered by reads, but actually 99 percent of the bases. So, the targets that drop out tend to be, for some reason, short targets, so they represent fewer bases than the average target.
There is quite a lot of variability. In this plot--it's a bit difficult to read here, but I'm showing the difference between--the differences in coverage between two samples divided by the sum of the coverage. So, one would mean that all of the reads are on just--in just one of the two samples that we're comparing, and minus one would be--mean that all of the reads are on the opposite--on the other sample. And I'm comparing two identical samples with just bulk one million cells each. So, it's pretty even, and there is some variability. Most of it is between 0.5 and minus 0.5, and there are some that are quite close to minus one.
This is including all the targets. So, I don't know if this particular dot, for example, is really on an exome or not. It might be just outside of an exome. And you should remember that we have here 10,000-fold coverage, so even a dot down here will be represented by hundreds, if not thousands, of reads, even in the minor sample.
So, what happens if you go for amplification? Actually very little. So, you do tend to lose--here is the same data that was on the previous slide. Here is the same data, but for 100 cells amplified by GenomiPhi. And you'll see that a few more targets drop out and a few more bases drop out, but the difference is really minor using this metric.
And the coverage here is a little bit more uneven. I'm zooming in here on about 14 kilobases, so this is 14 kilobases, but it's not continuous in the genome. It's just concatenating all the different targets that make up 14 kilobases. So, you'll see that the cover is relatively even here along this 14 kilobase stretch, and the same 14 kilobase stretch is a little bit more biased in the amplification. So, that's what you get for--that's what you get by amplifying by GenomiPhi. Still, very few targets are actually dropping out completely. There might be one or two that are dropping out here that didn't drop out there, so that's pretty good actually.
Then, we looked in the cell lines that had known mutations and that were present in these--both in the one million sample and in the 100 cell sample. And the four known--these are known mutations that cause cancer, so it's not all the known mutations in these particular cell lines. We haven't sequenced the whole--we just looked for the ones that are known to be involved in cancer here. And they're all detected with very high confidence. The quality scores max out at 3,070, which I've never seen before, in both the one million sample and in the 100 cell sample. And these are all heterozygote mutations.
Next question is if we're going to do this we need to be able to detect mutations that occur only in a fraction of the sample. So, we did this dilution series, and remember we had from 100 percent down to 1 percent of the mutated allele. And what you see in this slightly complicated slide, these are two different mutations in p53 that occur in two different cell lines that are both mixed in into normal bone marrow at these percentages, right? So, each of them should be detected at the percentage that is indicated by the mixing here.
And what you see is--let's see, if we start here, at 100 percent the mutations are heterozygotes, so you should expect 50 percent of the reads to come with a mutation. And there is some difference between the two mutations, but it's fairly good. As you dilute, the percentage go down until you're down here at 1 percent dilution. And actually, it seems that we are exaggerating a little bit the allele frequencies here, and I'm not sure exactly why that is. But, importantly, they are clearly detected above the background.
What you can see here, this is normal bone marrow, five different samples, with no mixed in mutated cell lines. So, this is the false-positive rate that occurs because of sequencing errors. And what we would like to see is that the--let's say the 1 percent dilution here should be very far away from the background, and it is; it's more than 10 times above the background. And again, the numbers of reads here is very large because we have this huge sequencing method.
And finally, you can see that we can detect these mutations, of course, also in the 100 cell sample.
And the last thing I want to say is about the single cell exome sequencing, so we haven't actually gotten yet to the sequencing of that. But, I'll show you data that I think shows convincingly that we can manage allelic dropout even from single cells. So, what we do is that we culture them, as I said, in the Terasaki plates. It takes about 48 hours for them to divide three or four times. We usually get more than four cells. We rarely get more than 10 cells. They tend to stop dividing after that. But, starting with five to 10 cells, something like that, is certainly much better than starting with one cell.
And to test this, we designed qPCR probes or primers against genes on the X chromosome, and we tested this idea on male samples. So, there should be only one X chromosome in these samples, and therefore it's an easy method of detecting the presence or absence in each sample of these genes.
And we were actually pleasantly surprised to see that in every single case we do detect the gene. We tested 21 different genes; they're all listed here. And the white bars here show negative controls; that's no tempted controls. These are CT values, so higher bars mean lower concentrations. And the black bars show less than 10 cells amplified, and then quantified by qPCR, so the qPCR is run on the amplified DNA. And the grey bars show just bulk sample where we just isolated DNA and we run it directly.
And I think you'll appreciate that in every single case the two cell samples behave exactly the same. There are, I think, two cases where the negative control failed, so you have one here and one here, but that just means that the PCR assay didn't work as it was designed to.
So, this, of course, doesn't show that we have no allelic dropout, but it, I think, shows that we can expect to have allelic dropout that is less than about 5 percent or so.
So, that's the conclusion of this pilot study is that we think we can work easily with 100 cells, and we can easily detect 1 percent mutations in normal populations. And we believe that we can also amplify single genomes without allelic dropout, and we're now sequencing those to see if we can also get good enrichment data out of them.
And that's it. I would like to thank Ule Marr [sp] who did more of the work, and Rick Alalead [sp] some of the analysis, and Sten Eirik in Oxford who worked with the leukemic samples. Thanks.