Why are mutants important in a population
Figure 1: The overwhelming majority of mutations have very small effects. This example of a possible distribution of deleterious mutational effects was obtained from DNA sequence polymorphism data from natural populations of two Drosophila species. The spike at includes all smaller effects, whereas effects are not shown if they induce a structural damage that is equivalent to selection coefficients that are 'super-lethal' see Loewe and Charlesworth for more details.
A single mutation can have a large effect, but in many cases, evolutionary change is based on the accumulation of many mutations with small effects. Mutational effects can be beneficial, harmful, or neutral, depending on their context or location. Most non-neutral mutations are deleterious. In general, the more base pairs that are affected by a mutation, the larger the effect of the mutation, and the larger the mutation's probability of being deleterious.
To better understand the impact of mutations, researchers have started to estimate distributions of mutational effects DMEs that quantify how many mutations occur with what effect on a given property of a biological system. In evolutionary studies, the property of interest is fitness , but in molecular systems biology, other emerging properties might also be of interest. It is extraordinarily difficult to obtain reliable information about DMEs, because the corresponding effects span many orders of magnitude, from lethal to neutral to advantageous; in addition, many confounding factors usually complicate these analyses.
To make things even more difficult, many mutations also interact with each other to alter their effects; this phenomenon is referred to as epistasis. Of course, much more work is needed in order to obtain more detailed information about DMEs, which are a fundamental property that governs the evolution of every biological system.
Many direct and indirect methods have been developed to help estimate rates of different types of mutations in various organisms. The main difficulty in estimating rates of mutation involves the fact that DNA changes are extremely rare events and can only be detected on a background of identical DNA.
Because biological systems are usually influenced by many factors, direct estimates of mutation rates are desirable. Direct estimates typically involve use of a known pedigree in which all descendants inherited a well-defined DNA sequence. To measure mutation rates using this method, one first needs to sequence many base pairs within this region of DNA from many individuals in the pedigree, counting all the observed mutations.
These observations are then combined with the number of generations that connect these individuals to compute the overall mutation rate Haag-Liautard et al. Such direct estimates should not be confused with substitution rates estimated over phylogenetic time spans. Mutation rates can vary within a genome and between genomes. Much more work is required before researchers can obtain more precise estimates of the frequencies of different mutations.
The rise of high-throughput genomic sequencing methods nurtures the hope that we will be able to cultivate a more detailed and precise understanding of mutation rates.
Because mutation is one of the fundamental forces of evolution, such work will continue to be of paramount importance. Drake, J. Rates of spontaneous mutation. Genetics , — Eyre-Walker, A. The distribution of fitness effects of new mutations. Nature Reviews Genetics 8 , — doi Haag-Liautard, C. Direct estimation of per nucleotide and genomic deleterious mutation rates in Drosophila. Nature , 82—85 doi Loewe, L. Inferring the distribution of mutational effects on fitness in Drosophila.
Biology Letters 2 , — Lynch, M. Perspective: Spontaneous deleterious mutation. Evolution 53 , — Orr, H. The genetic theory of adaptation: A brief history. Nature Review Genetics 6 , — doi Sandelin, A. Arrays of ultraconserved non-coding regions span the loci of key developmental genes in vertebrate genomes. BMC Genomics 5 , 99 Restriction Enzymes. Genetic Mutation. Functions and Utility of Alu Jumping Genes. Transposons: The Jumping Genes. DNA Transcription.
What is a Gene? Colinearity and Transcription Units. Copy Number Variation. Copy Number Variation and Genetic Disease. Copy Number Variation and Human Disease. Tandem Repeats and Morphological Variation. Chemical Structure of RNA. If there are additional, undiscovered sites in the gene at which mutations are fatal when carried in combination with a known recessive lethal mutation, the purging effect of purifying selection on the known mutations will be under-estimated in our simulations, leading us to over-estimate the expected frequencies of the known mutations in simulations.
Therefore, our predictions are, if anything, an over-estimate of the expected allele frequency, and the discrepancy between predicted and the observed is likely even larger than what we found. The other factor that we did not consider in simulations but would reduce the expected allele frequencies is a subtle fitness decrease in heterozygotes, as has been documented in Drosophila for example [ 44 ].
To evaluate potential fitness effects in heterozygotes when none had been documented in humans, we considered the phenotypic consequences of orthologous gene knockouts in mouse. For all eight, homozygote knockout mice presented similar phenotypes as affected humans, and heterozygotes showed a milder but detectable phenotype S5 Table. The magnitude of the heterozygote effect of these mutations in humans is unclear, but the finding with knockout mice makes it plausible that there exists a very small fitness decrease in heterozygotes in humans as well, potentially not enough to have been recognized in clinical investigations but enough to have a marked impact on the allele frequencies of the disease mutations.
To investigate the population genetics of human disease, we focused on mutations that cause Mendelian, recessive disorders that lead to early death or completely impaired reproduction. We sought to understand to what extent the frequencies of these mutations fit the expectation based on a simple balance between the input of mutations and the purging by purifying selection, as well as how other mechanisms might affect these frequencies.
Many studies implicitly or explicitly compare known disease allele frequencies to expectations from mutation-selection balance [ 5 , 29 — 32 ]. In this study, we tested whether known recessive lethal disease alleles as a class fit these expectations, and found that, under a sensible demographic model for European population history with purifying selection only in homozygotes, the expectations fit the observed disease allele frequencies poorly: the mean empirical frequencies of disease alleles are substantially above expectation for all mutation types although not significantly so for CpGti , and the fold increase in observed mean allele frequency in relation to the expectation decreases with increased mutation rate Fig 2.
Furthermore, including possible effects of compound heterozygosity and subtle fitness decrease in heterozygotes will only exacerbate the discrepancy. In principle, higher than expected disease allele frequencies could be explained by at least six non-mutually exclusive factors: i widespread errors in reporting the causal variants; ii misspecification of the demographic model, iii misspecification of the mutation rate; iv reproductive compensation; v overdominance of disease alleles; and vi low penetrance of disease mutations.
Because widespread mis-annotation of the causal variants in disease mutation databases had previously been reported [ 23 , 45 , 46 ], we tried to minimize the effect of such errors on our analyses by filtering out any case that lacked compelling evidence of association with a recessive lethal disease, reducing our initial set of mutations to in which we had greater confidence see Methods for details. We also explored the effects of having misspecified recent demographic history or the mutation rate.
Based on very large samples, it has been estimated that population growth in Europe was stronger than what we considered in our simulations [ 47 , 48 ].
When we considered higher growth rates, such that the current effective population size is up to fold larger than that of the rescaled Tennessen model, we observed an increase in the expected frequency of recessive disease alleles and a larger number of segregating sites S4 Fig , columns A-E.
However, the impact of larger growth rate is insufficient to explain the observed discrepancy: the allele frequencies observed in ExAC are still on average an order of magnitude larger than expected based on a model with a fold larger current effective population size than the one initially considered [ 25 ] S4 Fig.
In turn, population substructure within Europe would only increase the number of homozygotes relative to what was modeled in our simulations through the Wahlund effect [ 49 ] and expose more recessive alleles to selection, thus decreasing the expected allele frequencies and exacerbating the discrepancy that we report. Except for the mean mutation rate now set to 2. The observed mean frequency remains significantly above what those predicted and qualitative conclusions are unchanged S4 Fig , column F.
Another factor to consider is that for disease phenotypes that are lethal very early on in life, there may be partial or complete reproductive compensation e. This phenomenon would decrease the fitness effects of the recessive lethal mutations and could therefore lead to an increase in the allele frequency in data relative to what we predict for a selection coefficient of 1.
There are no reasons, however, for this phenomenon to correlate with the mutation rate, as seen in Fig 2B. The other two factors, overdominance and low penetrance, are likely explanations for a subset of cases.
Regardless, it is known that disease mutations in this gene can complement one another [ 10 , 11 ] and that modifier loci in other genes also influence their penetrance [ 11 , 14 ]. Consistent with variable penetrance, Chen et al. These observations make it plausible that, in a subset of cases particularly for CFTR , the high frequency of deleterious mutations associated with recessive, lethal diseases are due to genetic interactions that modify the penetrance of certain recessive disease mutations.
It is hard to assess the importance of this phenomenon in driving the general pattern that we observe, but two factors argue against it being a sufficient explanation for our findings at the level of single sites.
First, when we removed mutations in CFTR and 12 in DHCR7 , the two genes that were outliers at the gene level Fig 3 ; S4 Table and for which there is evidence of incomplete penetrance [ 24 ], the discrepancy between observed and expected allele frequencies is barely impacted S5 Fig.
Moreover, there is no obvious reason why the degree of incomplete penetrance would vary systematically with the mutation rate of a site, as observed Fig 2B. Instead, it seems plausible that there is an ascertainment bias in disease allele discovery and mutation identification [ 52 , 54 , 55 ].
Therefore, those mutations that have been identified to date are likely the ones that are segregating at higher frequencies in the population. Moreover, mutation-selection balance models predict that the frequency of a deleterious mutation should correlate with the mutation rate. Together, these considerations suggest that disease variants of a highly mutable class, such as CpGti, are more likely to have been mapped and that the mean frequency of mapped mutations will tend to be only slightly above all disease mutations in that class.
In contrast, less mutable disease mutations are less likely to have been discovered to date, and the mean frequency of the subset of mutations that have been identified may tend to be far above that of all mutations in that class. To quantify these effects, we modeled the ascertainment of disease mutations both analytically and in simulations. A large proportion of recessive Mendelian disease mutations were identified in inbred populations, likely because inbreeding leads to an excess of homozygotes compared to expected under random mating, increasing the probability that a recessive mutation would be discovered as causing a disease.
Therefore, we modeled ascertainment in disease discovery in human populations with a plausible degree of inbreeding see Methods. As expected, we found that for a given mutation type, the probability of ascertainment increases with the sample size of the putative disease ascertainment study n a and the average inbreeding coefficient of the population under study F a ; in addition, the average allele frequency of mutations that have been identified is always higher than that of all existing mutations, and the discrepancy decreases as the ascertainment probability increases Table 1.
Furthermore, comparison across different mutation types reveals that a higher mutation rate increases the probability of disease mutations being ascertained Table 1 and S6 Fig and decreases the magnitude of bias in the estimated allele frequency relative to the mutation class as a whole Table 1.
In summary, among all the possible aforementioned explanations for the observed discrepancy between empirical and expected mean allele frequencies, the ascertainment bias hypothesis is the only one that also explains why it is more pronounced for less mutable mutation types Fig 2B. For a similar result derived from analytical modeling, see S6 Fig.
Parameters for this step of the simulation correspond to plausible scenarios for human populations with widespread inbreeding e. The last column in the bottom panel shows the fold increase of the mean allele frequency observed in ExAC in relation to simulations based on the Tennessen et al.
Mutation rates u per bp, per generation were obtained from a large human pedigree study [ 18 ]. One implication of this hypothesis is that there are numerous sites at which mutations cause recessive lethal diseases yet to be discovered, particularly at non-CpG sites.
More generally, this ascertainment bias complicates the interpretation of observed allele frequencies in terms of the selection pressures acting on disease alleles. Beyond this specific point, our study illustrates how the large sample sizes now made available to researchers in the context of projects like ExAC [ 23 ] can be used not only for direct discovery of disease variants, but also to test why disease alleles are segregating in the population and to understand at what frequencies we might expect to find them in the future.
In order to identify single nucleotide variants within the 42 genes associated with lethal, recessive Mendelian diseases S1 Table , we initially relied on the ClinVar dataset [ 56 ] accessed on June 3 rd , We filtered out any variant that is an indel or a more complex copy number variant or that is ever classified as benign or likely benign in ClinVar whether or not it is also classified as pathogenic or likely pathogenic.
By this approach, we obtained SNVs described as pathogenic or likely pathogenic. We considered effects in the absence of medical treatment, as we were interested in the selection pressures acting on the alleles over evolutionary time scales rather than in the last one or two generations, i. To evaluate the impact of treatment, we decreased s from 1 to 0 i. Because of the stochastic nature of the simulations, we repeated this pairwise comparison 10 times in order to get a range of expected increase in allele frequencies.
We observed only a minor increase in the mean allele frequency 2. This simulation procedure corresponds to a scenario in which there is an extremely effective treatment for all diseases for the past three generations, which is an overestimate of the effect and length of treatment for the disease set considered.
Variants with mention of incomplete penetrance i. Although these mutations were purportedly associated with completely recessive diseases, we sought to examine whether there would be possible, unreported effects in heterozygous carriers. To this end, we used the Mouse Genome Database MGD [ 57 ] accessed July 29 th , and were able to retrieve information for both homozygote and heterozygote mice for eight out of the 32 genes all of which with a homologue in mice S5 Table.
In addition to the information provided by ClinVar for each one of these variants, we considered the immediate sequence context of each SNV, to tailor the mutation rate estimate accordingly [ 18 ].
We focused our analyses on those individuals of Non-Finnish European descent, because they constitute the largest sample size from a single ancestry group. We note that, some diseases mutations, for instance, those in ASPA , HEXA and SMPD1 , are known to be especially prevalent in Ashkenazi Jewish populations, which could potentially bias our results if Ashkenazi Jewish individuals constituted a great portion of the sample we considered.
Daniel MacArthur, personal communication. From the initial mutations, we filtered out three that were homozygous in at least one individual in ExAC and 29 that had lower coverage, i. This approach left us with a set of mutations with a minimum coverage of 27x per sample and an average coverage of 69x per sample S2 Table. For sites with non-zero sample frequencies, ExAC reported the number of non-Finnish European individuals that were sequenced, which was on average 32, individuals [ 23 ].
For the remaining sites, we did not have this information. We therefore assumed that mean number of individuals covered for all sites was 32, and used this number to obtain sample frequencies from simulations, as explained below.
This subset excludes individuals with indications for testing because of known personal or family history of Mendelian diseases, infertility, and consanguinity. It therefore represents a more random with regard to the presence of disease alleles , population-based survey. We focused our analysis of this dataset on the 76, individuals of self-reported Northern or Southern European ancestry.
We modeled the frequency of a deleterious allele in human populations by forward simulations based on a crude but plausible demographic model for human populations from Africa and Europe, inferred from exome data for African-Americans and European-Americans [ 25 ].
To this end, we used a program described in [ 1 ]. In brief, the demographic scenario consists of an Out-of-Africa demographic model, with changes in population size throughout the population history, including a severe bottleneck in Europeans following the split from the African population and a rapid, recent population growth in both populations [ 25 ].
As in Simons et al. The original demographic model was inferred using a mutation rate u of 2. To incorporate what is believed to be a more accurate mutation rate estimate, we rescaled all demographic and time parameters in the original Tennessen et al.
We refer to this model as the rescaled Tennessen model and rely on it throughout. Negative selection acting on a single bi-allelic site was modeled as in the analytic models. Allele frequencies follow a Wright-Fisher sampling scheme in each generation according to these viabilities, with migration rate and population sizes varying according to the demographic scenario considered.
Whenever a demographic event e. In contrast to Simons et al. However, recurrent mutations at a site are allowed, as in Simons et al. When implementing the simulations, we considered a mean mutation rate u of 1. While these four categories capture much of the variation in germline mutation rates across sites, a number of other factors e. A well-known source of heterogeneity in mutation rate within the CpGti class is methylation status, with a high transition rate seen only at methylated CpGs [ 21 ].
In our analyses, we tried to control for the methylation status of CpG sites by excluding sites located in CpG islands CGIs , which tend to not be methylated [ 42 ].
We note that the CpGti estimate from [ 18 ] includes CGIs, and in that sense the average mutation rate that we are using for CpGti may be a very slight underestimate of the mean rate for transitions at methylated CpG sites. Unless otherwise noted, the expectation assumes fully recessive, lethal alleles with complete penetrance.
Notably, by calculating the expected frequency one site at a time, we are ignoring possible interaction between genes i. These assumptions are relaxed in two ways. Second, when considering the gene-level analysis Fig 3 , we implicitly allowed for compound heterozygosity between any pair of known lethal mutations [ 8 ].
For this analysis, we ran simulations for a total mutation rate U per gene that was calculated accounting for the heterogeneity and uncertainty in the mutation rates estimates as follows: i For j sites in a gene known to cause a recessive lethal disease and that passed our filtering criteria S2 Table , we drew a mutation rate u j from the lognormal distribution, as described above; ii We then took the sum of u j as the total mutation rate U; iii We then ran one replicate with U as the mutation parameter, and other parameters as specified for site level analysis.
Because the mutational target size considered in simulations is only comprised of those sites at which mutations are known to cause a lethal recessive disease, it is almost certainly an underestimate of the true mutation rate—potentially by a lot. We note further that by this approach, we are assuming that compound heterozygotes formed by any two lethal alleles have fitness zero, i.
Moreover, we are implicitly ignoring the possibility of complementation, which is somewhat justified by our focus on mutations with severe effects and complete penetrance but see Discussion. To calculate the probability of ascertaining a recessive, lethal mutation, we assumed that all currently known disease mutations were identified in a putative ascertainment study of sample size n a in a population with an inbreeding coefficient of F a.
Under this model, we can estimate P asc , the probability of ascertaining a disease mutation, as following:. For a disease allele denoted as a at frequency q in the present population, if we randomly sample an individual with inbreeding coefficient of F a , the probabilities of the three genotypes are: 9 10 Thus, if n a unrelated individual are surveyed, the probability of not seeing any homozygote for the disease allele which is the same as the probability of not being ascertained in this set is: 12 Therefore, the probability of ascertainment is 13 which is an increasing function with regard to q the population allele frequency , F a the inbreeding coefficient of the population under study as well as n a the sample size of the putative ascertainment study S6 Fig.
We also demonstrate the relationship between the probability of ascertainment and mutation rate using simulations of ascertainment bias implemented according to the following steps:. These simulations were meant to illustrate the likely impact of ascertainment bias, rather than to precisely describe the disease mutation identification process or to quantify the expected effect.
Notably, we performed these simulations for single sites, so the criteria for ascertainment in step 3 did not include the possibility of compound heterozygotes, despite the fact that an estimated However, this simulation framework could readily be extended in this direction and it would not change our qualitative conclusion. Shown are the allele frequencies for 91 variants associated with lethal, recessive diseases, as estimated from 33, individuals of non-Finnish, European ancestry in the Exome Aggregation Consortium ExAC database [ 23 ] and 76, European-ancestry individuals from a genetic testing laboratory Counsyl [ 20 ] see Methods.
Points lie on the dashed blue line if the allele frequencies in Counsyl and ExAC are the same. A Population mean allele frequency as a function of effective population size, under a model of constant population size. The X-axis range corresponds to the range of effective population size over time estimated in [ 25 ].
The red bar indicates the value of a constant population size at which the mean allele frequency is the same as in simulations, for an average mutation rate of 1. B-C The allele frequency distribution in grey is presented for 2 x 10 6 simulations based on B the complex demographic scenario inferred by Tennessen et al.
Shown in each case is the distribution of the deleterious allele frequencies in the population, generated from , simulations. A Dictionary of Genetics , 7th ed. Nei, M. Molecular Evolution and PhylogeneticS. The Hardy-Weinberg Principle. Evolution Introduction. Life History Evolution. Mutations Are the Raw Materials of Evolution. Speciation: The Origin of New Species. Avian Egg Coloration and Visual Ecology. The Ecology of Avian Brood Parasitism. The Maintenance of Species Diversity.
Neutral Theory of Species Diversity. Population Genomics. Semelparity and Iteroparity. Geographic Mosaics of Coevolution. Comparative Genomics. Cybertaxonomy and Ecology. Ecological Opportunity: Trigger of Adaptive Radiation. Evidence for Meat-Eating by Early Humans. Resource Partitioning and Why It Matters. The Evolution of Aging. Citation: Carlin, J. Nature Education Knowledge 3 10 Aa Aa Aa. At top are the original sequences, at bottom are the sequences adjusted to vertically align similar DNA bases.
References and Recommended Reading Allendorf, F. Freeland, J. Molecular Ecology. Share Cancel. Revoke Cancel. Keywords Keywords for this Article. Save Cancel. Flag Inappropriate The Content is: Objectionable.
Flag Content Cancel. Email your Friend. Submit Cancel. This content is currently under construction. Explore This Subject. Topic rooms within Evolution Close. No topic rooms are there.