And today, I'd like to speak on disease gene identification by exome sequencing, and I point out for particular focus on de novo mutations there. Of course, as I said, it will be on exome sequencing, and I guess in this audience I really don't have to recapitulate what that really means.
This really is the first genome mutation detection method, at least as it comes down to single nucleotide mutations. And probably the reason why I'm here is that we use SureSelect to do so, and almost all data that I show you today comes from version two, so 50 megabytes exomes SureSelect.
These are the sequences that we use, so we certainly are not a genome center, not one of the big centers as you see it here in the States usually. But, we just have five sequences, and the current throughput that we usually get are just using single-end fragment runs, so 50 base pair runs. So, we usually run eight exomes in a single solid four run, and we also have two of the 5500xl sequencers where we run 12 exomes per run. And just this week a third machine was installed.
So, in total last year, we sequenced roughly a thousand exomes. And all of those exomes, or most of it at least, were really exomes where we would expect monogenetic disease and we were just looking for one or two mutations that can really explain these diseases.
And one very straightforward and rather stringent filter that we usually apply is looking at the private non-synonymous mutations that we see in individual exomes. So, basically, after detecting all the variants that we see in an exome experiment, we filter down to the variants that are actually coding. From those, we filter down those that affect their protein sequence, so the non-synonymous ones. And then, we filter against in-house databases and public databases to really bring it down to private variants in any given exome. And then, I'll show you some examples how that can turn out in terms of finding disease genes.
That's our, let's say, oldest example. That was actually our first SureSelect run, our first exome run at all, and there we decided to sequence four individuals that we predicted would have mutations in the same gene. That at least was the theory, and luckily we were right on that. So, those are three of the individuals that we've sequenced, all little babies that suffered from the so-called Schinzel-Giedion Syndrome, a rather rare syndrome--a really rare syndrome, actually, described first by Albert Schinzel and Andreas Giedion, and we collected 13 cases in our group.
And then, we sequenced the exomes, and here you see the same filters that I just showed you. So, bringing it down to the coding, then to the non-synonymous, and then to the private variants gives you something between 150 and 200 of these very rare alleles per individual case. And then, the theory that you apply, the disease morphology of that all fits, you might just look for one gene where all of them share a private mutation or a private variant. And in this case, that beautifully turned out, and we're just left with one gene, and that really is the gene explaining these phenotype.
And of course, we then went on and sequenced other cases with the same disease, a total of 13, and actually in 12 out of 13 we found mutations in exactly the same gene. We never observed these mutations in controls, but strikingly they all mapped to just 11 base pair stretch in this protein--or actually in this gene, so to four amino acids, and most importantly, fitting to the talk of today, all of them occurred always de novo. So, we checked parents by Sanger sequencing, and they were never present in parents.
In the meantime, we offer this gene test diagnostically just by Sanger sequencing, and we've already identified 10 new cases; all, again, clustered to these four amino acids, and all, again, were de novo mutations.
We went on and did that for several other syndromes. One other example, a very similar strategy--again, we took multiple affected individuals all suffering from an equally serious disease called Bohring-Opitz Syndrome. And in that case, we previously did some math and said, "Okay, then, actually three patients could be enough to find one gene." Again, only that holds true if your theory at your clinics is really, really good. And in this case, we took three patients, filtered it down, again, to the private variants. Again, that's usually between 150 and 200. Actually, patient one had a different ethnical background, so that was a patient from Iraq. Probably our in-house data and the DB SNP databases are simply not polluted by this kind of data, and that's why we see a bit more private variants in patient one.
Nevertheless, if we, again, apply the same setting and say, "Hey, they should all share a private mutation on the same gene," then we just, again, bring it down to just one candidate gene; again, we validated all of that by Sanger means, and, again, we check the parental samples, and, again, all these three were de novo mutations in a gene called ASXL1. Actually, this is--seems to be a bit more of a heterogeneous disease. Only half of the patients showed mutations in this gene while the others do not.
Actually, we also see a very interesting pattern in the terms of genes that we now find for these very severe developmental diseases. Actually, this gene already was known as a cancer-driver gene, so lots of somatic mutations have been described in leukemias, and I think there's quite some overlap. If something goes wrong on the cellular level, you might develop a cancer or leukemia. And if the mutation just comes in early enough, so a germline mutation, then probably even the development of the whole body can be really--yeah, let's say messed up. So, there is lots of parallel ways in developing cells and developing organisms, obviously. And one might even think of using this knowledge of all the cancer genes in also filtering out crucial developmental genes, I think.
In this particular case, I believe, to select the right three patients to start with was really fundamental. And really the good clinics behind that really was fundamental to identify this gene this way.
All right. I think not only from this we learned certainly that de novo mutations itself seem to be really important, especially for these severe developmental diseases. And then, our next challenge was simply to see, "Hey, can't we do that in a more systematic way, to really study de novo mutations in affected individuals," and that's what we did. This is the publication by my dear colleague, Lesanker Fissus [sp].
And then, we said, "Okay, why starting only with the affected child? Shouldn't we just include the exomes of both parents?" And that's exactly what Lesanker did. Very similar numbers again. Until here, you've seen these tables now before. We've filtered down to the private non-synonymous variants in a given cases. Again, it's 150 private variants on average. And then, we simply exclude all the variants that have been inherited. So, we check parental exomes, exclude everything that's in there, and then you're just left with five private potential de novo mutations, and those are the candidates. You then systematically validate those, and we usually see a range of zero to two true de novo changes in an exam.
As a result of that, we set up a whole pipeline to systematically do this, and that certainly consists of read mapping and variant calling, and then the variant prioritization, which includes some photo steps really focusing on the coding non-synonymous changes and the exclusion of inherited variants. Then, a systematic validation, at this stage still by Sanger validation, and then we're left with de novo mutations.
And we've now done more than 100 trios, and we really see that we just see a range of zero to four de novo mutations that we detect in a given exome. And that very well, actually, fits the expected human mutation rate. Everybody expects that we should carry something between 50 and 100 de novo mutations in our genome. And if we go the exome, it is one to 1.5 percent, so that's really in the expected range. Well, actually, that brings down the complexity, even of an exome, dramatically because it's only something between four and--between zero and four candidate genes that you might have to consider.
In this pipeline, we implemented some more novo things. For example, every DNA that we now screen by exome sequencing is previously tested for a number of SNPs. And then, the final variant calling of that exome is compared to this SNP test, so it basically excludes sample swabs in your lab. And that's, of course, a beauty if you go genome-wide, you always have a fingerprint automatically coming with your experiment.
We worked a lot on the inheritance check, so also there if you sequence a trio you can nicely just see whether that really fits. Again, that could be avoiding sample swabs, but you can also detect whether that family really belongs together.
I'll show one more slide on a tool that we call the BAM Miner [sp]. That really was fundamental in excluding inherited variants in a real systematic way. And in the meantime, Christian, who is also here, set up a GUI--a user interface to really do the analysis in a more sophisticated way.
What do we mean with a BAM Miner? Actually, best is to just show you a picture. Usually, you consider this as being one of the patients where you find a private heterozygous mutation, and then you want to check whether that mutation is not present in any other sample. And more particular, you want to check whether that's not present in mom or dad. And actually, we see that sometimes the variant calling is, of course, imperfect, as all of you might see, but if you really go back to the raw data we can really dive into that. And just for these up to 200 private variants, we can easily check that in an automated fashion in the raw sequencing data, which are the BAM files in our case, and therefore it's called the BAM Miner.
All right. So, after setting up this whole pipeline, the question, of course, was, "Can we do something more? Can we find something new?" And today, I want to share one new example with you where we sequenced, actually, trios for a rare disease called Baraitser-Winter Syndrome. One of the main characteristics of this syndrome is a severe lissencephaly; that means smooth brain, so really some severe defect in the brain. And actually, we sequenced three trios, and we did so in a collaboration with Bill Dobbins Group [sp] in Seattle. They sequenced one trio in parallel. At the beginning, we even didn't know that we do that in parallel, and we just teamed up afterwards. And we sequenced two different other trios from independent patients.
And what we found here is--one example from one of the trios that we've sequenced, we found a de novo mutation that's the only de novo mutation in this particular case in one of the cytoplasmic actin genes, which actually is a very well-known gene. Probably all of you know that better if you need loading controls for RNA SAG or just expression experiments. So, that's one of the very well-known genes. And now we've found a de novo mutation, highly conserved missant [sp] change, and, again, we validated what we saw in the exome data by Sanger sequencing and really could show that it is a de novo mutation in the affected child.
Here you see a summary. And actually, quite strikingly, we found first in trio one--we found an actin-B mutation de novo, and the second trio we found a de novo mutation in actin-GI. And this is the cumulative total we screened. Eighteen cases with this syndrome all share mutations on the same two genes. So, half of the patients had mutations in actin-B and the other half in actin-GI.
Of course, we looked into the phenotype. Those are some photographs of patients that have mutations in actin-B, and those are photographs from patients that have mutations on actin-GI. Actually, phenotypically, we can't see any difference, so we really think those two genes--it doesn't matter where you have a mutation on those two; they always cause this phenotype. And one of the reasons might be that there's high homology between these two proteins. These two proteins just differ for four amino acids in total, and they're even highly conserved throughout evolution. So, they're important, highly conserved, but they're also alike and obviously have very similar function; and therefore, probably the outcome of de novo mutations gives such a similar phenotype.
Because those proteins are so well studied and well understood, we joined up with a third group and they did quite some beautiful follow up, functional follow up, in patient cell lines. They really looked into the actin structure. We see, certainly, that we have an increased actual relation of F-actin. It is reproducible between different cell lines that carry the same mutation; however, each mutation seems to give a slightly different pattern in the F-actin.
Okay, as a summary for this example, I think this was not really systematically right. Trio sequencing to find new disease genes for these rare disorders--we think that only actin-B and actin-GI are the causative genes for Baraitser-Winter Syndrome, and obviously a dysfunction of these cytoplasmic actins is one of the causes for lissencephaly. And actually, this paper was accepted for publication and will be online by the end of this month.
All right. I would like to share one more example what happens if you bring accurate phenotyping and genotyping together, and that's basically the beauty of these unbiased genome-wide experiments. You could see these two photographs--and maybe you're surprised if I tell you that those are two independent boys. So, it's not the same boy at different ages; it's totally two independent boys. And actually, this boy was seen by the head of our department, Han Brunner, several years ago, and he always shows this photograph at different syndromology meetings and asks, "Hey, I think this is a very characteristic face and there are some other features with that. Has anybody else ever seen a boy that looks very similar? Maybe that simply is a new syndrome."
And just several years later, one of our collaborators in Leuven in Belgium came up and said, "Hey, I have seen a boy that looks very, very similar," and so we brought those two together, and we truly think that that might be a new syndrome; yet undefined because for now we only now that there are two cases worldwide that seem to share very similar features, and some of them are listed below their photographs right now.
And again, we sequenced a trio approach, looked for de novo mutations in both boys. And in the first boy, that was seen in [unintelligible] originally. We found two de novo mutations in exome, while in the other case from Belgium we found one de novo mutation in the exome. And quite strikingly, that's the mutation--in patient one, we found a mutation in the Pax1 gene. And guess what? Also in the second case we found a de novo mutation in the Pax1 gene. So, they shared de novo mutations in the same gene. But, even more striking, they exactly shared exactly the same mutation. And at least you might speculate that only this one mutation gives this very rare syndrome or this very rare disease. Maybe that's very specific to this one base pair.
That's basically just the overview again. They were de novo, of course I said, and validated.
Okay, just a little bit more on de novo mutations. You might just ask, "Okay, it's de novo, so it occurs new, but when does it actually occur?" And we have one example. It's not yet really proven, but I still thought it's interesting to share with you. Another syndrome where we collaborated with Riitta Salonen in Finland, and she sent over a DNA of this girl and both parents, and again we applied our trio approach. We found a de novo mutation. Here it's already validated by Sanger sequencing, so that's a heterozygous change in the MET3/K7 gene, and it was not present when we looked at Sanger sequencing in both parents. And it's highly conserved, and we think it could fit to the phenotype.
And then, we looked again back to the exome data in more detail, and here's the exome of the daughter. We found the mutation present in 48 percent of all reads. We had quite nice coverage at this spot. We then looked at her daddy; no reads with this kind of variant. But, we found--let me just--here three reads in the mother that also have exactly the same mutation. It is difficult to really prove that by independent experiments. And we just--really with that, we got additional tissues, actually, from the mom, but it might be that even in blood she had a very low level mosaicism for this mutation. So, there might actually be a small recurrence risk, so that certainly will be important to also figure these kinds of scenarios out.
And I think we totally underestimated this because simply all the old-school technologies were basically impossible to find these kind of low-level mosaicism. But, by doing systematic exome sequencing genome-wide, we're really able now to find those, and I think there will be much more coming up like this.
If you talk about timing of de novo mutations, Jim Lofsky [sp] wrote in News and Views, and he said, okay, basically de novo mutations still can occur at all different stages, maybe even quite late. And there are some first examples like the Proteus Syndrome paper, which is highly interesting, where just somatic mutations give rise to a phenotype. And of course, that's going slightly in the tumor direction, but I think there are also non-tumor clinical phenotypes where exactly this happens that a mutation occurs quite late as a somatic mutation. While we actually think that quite some of our de novos happen just in the germline cells, so either in the egg or the sperm cells that the parents deliver, so you better check carefully who's your daddy, and you better carefully check who's your mom; although, that might be quite difficult sometimes.
And maybe we have one of the first examples here that is one mother--had a very low level mosaicism even in blood DNA, so maybe that even happened earlier and already in the somatic state in mom.
All right, just one more slide. And I showed you now these rather severe syndromes, and I think one of the prominent roles of de novo mutations here is that all these diseases share a really severely reduced fecundity. So, most of these individuals will never get children themselves, so they will never pass on their mutations to the next generation, and we think that also applies for diseases like mental retardation, autism, schizophrenia, where we also have this kind of reduced fecundity. And that logically would mean that we need de novo mutations to really still have the same prevalence of the disease.
But, we even see de novo mutations in diseases that are usually inherited. My dear colleague Kornelia Neveling had some examples for blindness. This is a well-known dominant blindness gene, and she found it--in a sporadic case, she found the de novo mutation, and probably previously people wouldn't really have looked at that. So, there are some diseases and some examples.
But, more strikingly, I think, she had this example in Usher 2A combined blind-deafness syndrome, Usher Syndrome, and she actually found a heterozygous mutation segregated throughout a family, and there was a sporadic affected case. And this case, in addition to the inherited variant, had one de novo mutation on the second allele for this gene. And I think we also should consider de novo mutations in such a scenario or at least should not, yeah, just forget about it.
All right. Of course, we want to go on. There are new technical developments to also especially focus on de novo mutations at higher speeds and higher scale. We just teamed up with Life Technologies to improve the sequencing, the throughput and the ease of the protocol. And we now used SureSelect exomes and use the new Wildfire technology to directly grow colonies on the float chips of a solid. And actually, I did the first experiment, and that certainly boosted the throughput by at least a factor of four, so--and basically, you have high dense colonies with up to one billion raw reads per lane now. And if you want to see more of that, Joe Beechem from Life has a poster on this technology, and I also have a shared poster with him. So, you could stop by the poster, and we can discuss more of these new developments.
Because we're interested in de novo mutations in particular, of course sometimes you can't predict whether it's base pair changes or different other genomic variations. And therefore, as we are also originally a C&V [sp] group, we very much look into other kind of variations, like C&Vs in the exome data. And by just normalizing the coverage of some exomes, we can detect C&Vs quite accurately. If we have at least three exomes for now all positive controls that we've run, all deletions or duplications larger than three exomes that were covered in the exome kit were picked up. And actually, here the red line represents the coverage, normalized coverage, C&V detection by exome sequencing, compared to the black line that comes from a C&V split microarray. And actually, we see that already the dynamic range of the exome coverage and the intensity is already better than the arrays that we use, so that's quite promising.
Of course, we have to consider that we are biased by the exome approach itself. There are gene desets [sp] of several mega bases sometimes where we simply cannot detect C&Vs at all. You might even think of spiking in some probes for those regions if you wish. But, I think a number of C&Vs are certainly detectable by exome sequencing.
And I just want to finish with this slide. One of our ideas, especially for de novo mutations, is that Mendelian diseases and its frequency could very well correlate just with a mutational target size. And what I mean with that--I showed you the example of Schinzel-Giedion, a very rare disease, and probably just de novo mutations in an 11 base pair stretch caused this disorder. I showed you these two boys that shared identical de novo mutation. Maybe it's such a rare disease because only this one base pair mutation can cause exactly the phenotype. And the larger the mutational target, so the more genes can cause more or less the same phenotype, the more frequent a disease might be. And then, consider intellectual disability or mental retardation as a disease that could be caused by mutations in, let's say, a thousand, maybe 3,000 genes. Now, we are talking about a really common disorder. And still, I would predict that at least a proportion of those can be caused by de novo mutations.
I think we ourselves were very convinced by these research results that we simply had to push that towards diagnostics, and I truly believe that we are just in the middle of a revolution. So, we ourselves launched diagnostic exome sequencing since September last year, and for now we're performing a pilot study on that for 500 samples on five different diseases. And those are all diseases where actually current diagnostics is really, really poor based on Sanger sequencing. And one very simple reason is that all these diseases are genetically very heterogeneous. So, it's intellectual disability where we certainly run the trio approach to detect systematically de novo mutations, and all the other diseases that we start for now are movement disorder, hereditary blindness, deafness, metabolic diseases. And actually, the list is currently already growing, so some pitnay [sp] disorders are also included in that. And that's truly offered diagnostically since September.
And in a more broader sense, I think where we will go by exome sequencing in particular is that I think in the next five years we will simply know most of the Mendelian disease genes. So, if you just look at OMIM, I think there are now 3,000 genes in there, which basically is the work of a worldwide community of 30 years. And I think we'll simply find the next 10,000 or so disease genes in the next maybe even three years; maybe five is quite conservative on that.
Certainly, we need lots of functional genomics to really understand the biology behind the genetics; there's no doubt about that. My particular focus is on the syndromology, but I'd like to expand that to lethal phenotypes. So, that's what I'm studying at the moment looking into miscarriages that have serial malformations, and I think we will find mutations in genes that we will never see in live-born individuals.
And I certainly think that something like somatic mutations, not only in the cancer field, but also in clinical phenotypes, will be really important. On the long run, I think, going for a single cell or at least very low input amount approaches will be fundamental for that.
And for the diagnostics, I think we are getting much, much closer to a genotype-first approach, but we simply have to sequence either the whole genome or the whole exome as a first step in many, many cases, at least all the cases that come to departments of human genetics. And therefore, I think genetics itself will become a very prominent discipline in medical--in the medical field in general.
With this, I would like to thank you for your attention, and of course thank our great team in Nijmegen, all the guys that I cited here. Christian is also here, our lead mathematician, so if you have particular questions on that he's around. But, actually, it was a department-wide effort, and actually there are 300 people working in Nijmegen in our department.
Thanks a lot for your attention.