Population genetics of rapid adaptation

My review on “Genetic draft, selective interference, and population genetics of rapid adaptation” in Annual Reviews of Ecology, Evolution, and Systematics is finally out (not exactly final yet, some notational issues will be corrected). Sally Otto had asked me to write an accessible summary of the work published over the last 10-15 years on adaptation and selective interference.  Some of this work was done by scientists with backgrounds in physics like myself. Owing to differences in notation and mathematical approaches, population geneticists sometimes struggled with these papers. Coming from physics and having worked in population genetics for 6 years, I have tried to synthesize this work in a streamlined and accessible fashion — let me know if it worked. To illustrate some of the ideas, I have put together a website with some python scripts that simulate different scenarios discussed in the paper: http://webdav.tuebingen.mpg.de/interference/

Drift vs Draft
Classical population genetics emphazises the competition between stochastic effects in reproduction (genetic drift) and deterministic forces such as selection. In idealized models, genetic drift stems from non-heritable randomness in offspring number. The width of this offspring number distribution is assumed (very) small compared to the population size and the law of large numbers garantees that many similar models converge to the same diffusion limit where the strength of drift is inversely proportional to the population size. However, a different source of randomness is often much more important: random associations to genetic backgrounds of different fitness result in background selection, Hill-Robertson effects, and selective interference. waveWhile the effect of background fitness on allele frequencies might be weak in a single generation, associations to genetic backgrounds are (partly) heritable and the effects amplify over many generations. This amplification is multiplicative and the resulting differences in offspring number after several generations can be comparable to the population size. In other words, the effective offspring distributions after several generations are very skewed with long power-law tails. In fact, these distributions can be so broad that the variance diverges with the population size. In this case, no diffusion limit is possible and the statistical properties of drift and and linked selection are fundamentally different.

Asexual vs sexual
The effects of draft are strongest in asexual organisms where the entire chromosome stays linked forever. However, linked selection can also be substantial in facultatively species such as plants, worms, yeasts or viruses (think influenza). As soon as there is the potential for the rapid expansion of a particular line (be it because of an intrinsic fitness advantage or favorable environmental conditions in a particular spot), the effective “many-generation” offspring distribution can become very broad and draft dominates over drift. In obligatly sexual species, the effects of draft are confined to the chromosomal neighborhood, but linkage to alleles at different distances still gives rise to stochastic forces very different from the classical genetic drift (rare tight linkage to a beneficial allele essentially sweeps one haplotype to fixation, loosely linked sweeps only bounce it around a little).

Recent Developments: Genealogical methods for rapid adaptation
Many successful population genetic methods have used the duality between Kimura’s diffusion models and the Kingman coalescent. This duality allows the efficient computation of statistics by considering the backward process of observed alleles, rather than the forward process of the entire population. Recent developments suggest that a similar duality exists for models dominated by draft: Genealogies in these models share statistical properties with a particular coalescent process known as Bolthausen-Sznitman coalescent that allows for multiple mergers. This coalescent process can predict a number of observable features in sequence data such as the site frequency spectra, the time to the most recent common ancestor, etc.  I briefly discuss these very recent results in the review.

Why should we care?
You might say “Let’s just define an effective population size and pretend all linked selection is some sort of drift”. But many population genetic methods detect outliers above a random background. To detect outliers reliably, we need to understand the null distribution. The background has very different statistical properties when the dominant source of randomness is draft rather than drift and using the wrong null model will reduce the power of the test and produce false positives. In other applications, one estimates values of parameters of simple models and these models better capture the relevant population genetic processes. It is for example popular to estimate the history of the effective population size from the rate of coalescence in the past. In many cases, in particular for large populations under selection, this effective population size has very little to do with the actual population size. Instead, one estimates the rate of coalescence which depends on the relative success of different lineages, which in turn depends on fitness, environmental fluctuations, and luck.

Deleterious effects of synonymous mutations in HIV

Fabio and Richard just published a paper in Journal of Virology about whether or not synonymous mutations in HIV are neutral. We know of course that many synonymous sites in HIV have important roles in regulation and RNA structure, but what about those sites at which we see high-frequency synonymous polymorphisms without known function? We focus on the env gene and in particular on the region containing the V loops that are under antibody attack by the immune system and hence change rapidly (V stands for variable). We investigate the dynamics of synonymous mutations in serial sequence data and show that most synonymous mutations in that region come with a fitness cost of the order of 0.002 / day. Nevertheless, they seem to hitch-hike to high frequency on linked adaptive variants.  These synonymous mutations interfere with folding of the HIV RNA molecule, reducing its replication efficiency.

Why is this important?

In studies of molecular evolution, synonymous mutations are often used as neutral control used to infer selection operating on the proteins. Our results add to the growing body of evidence that selection on synonymous sites needs to be accounted for when doing such analysis. This not only holds for viruses, but has recently been demonstrated in the fruit fly.

More importantly, we shed light on the functional relevance and evolutionary dynamics of RNA structures in the genome. A recent study compared RNA structures of diverged strains of HIV and found that the overall pattern of folding is conserved, while the exact base pairs are not. Our results can explain this observation: most RNA structures seem to be important to the virus, but not crucial (disrupting mutations have a small fitness effect). This results in occasional fixation of a disrupting mutation which is slowly restored by compensatory mutations or rearrangements. This way the molecular architecture of the RNA structure can change leaving the overall pattern in place.

How did we do it?

In longitudinal HIV sequence datasets, we observe new synonymous alleles rising in frequency in our sample up to 50% and higher (see sketch), but almost all of them disappear from our sample after one to two years (i.e., the ancestral allele is found again in all sequences). This suggests that these mutations are deleterious. We quantify this effect by measuring how long it takes for these mutations to disappear and how often they do not disappear. Shankarappa_allele_freqs_trajectories_syn_p10_sparseComparing these observations to extensive computer simulations, we conclude that the majority of synonymous mutations are deleterious with small fitness effects (0.002 / day). Then, building on published RNA pairing data, we ask whether or not a mutation that destroys a base pair is pruned more often than any other mutation; that seems to be the case indeed.

Inferring HIV escape rates

We have a new preprint on the arXiv  (here on Haldane’s sieve). This work is the result of a collaboration between us and Alan Perelson, LANL, and explores methods to estimate parameters of the HIV-immune system interaction from time resolved sequence data. The focus of this paper is on early infeImagection dominated by a few rapid substitutions that fix because they prevent or reduce recognition of infected cells by the immune system via cytotoxic T-lymphocytes (CTL).  CTL escape is one of the fastest instances of evolution I have come across. 4-6 mutations spread within a few weeks. It happens in most HIV infections and is partly predictable based on the HLA genotype of the infected person. These substitutions are so rapid that clonal interference has to be modeled. Our method fits a reduced model of clonal interference to the typically very sparse data and thereby estimates the selection coefficients, aka escape rates.

Why do we want to know these numbers?
The number of viruses in the blood of an infected person peaks 2-3 weeks after infection and thereafter drops by 2-3 order of magnitude. This drop is partly due to a response by the adaptive immune system. However, it has proved difficult to attribute this drop to specific parts of the immune response. The rates at which different mutations sweep through the population gives us information about the pressure exerted by the T-cell clones that target the epitope containing this mutation.

How do we do it?
Early in infection, the viral population is large and selection is strong. In these conditions, recombination is of minor importance since most double/triple… mutants are more efficiently produced by recurrent mutation than recombination. This implies that mutations accumulate sequentially always on a background one which already all previous mutations are present. The time at which a novel mutation happens in tightly constrained by the trajectory of preceding genotype. These constraints regularize the fitting problem to some degree and the multi-locus fitting is more robust than single locus fitting.

What do we learn about evolution in general?
In addition to the intrinsic interest in the HIV/CTL interaction, CTL escape is an ideal setting to study rapidly evolving populations. This evolution happens in its “natural” habitat and the selective pressure as well as the functional consequences of the observed molecular changes can be quantified via immunological data, protein structure, and replication assays. In addition, we have ample cross-sectional data (HIV sequences from many different patients) that allows us to look at prevalence of the escape mutations and potential compensatory mutations. None of this is done in this paper, but studying HIV/immune-system coevolution is a fascinating show case of rapid evolution.

Arxiv: Coalescence in sexual populations under selection

Update: the paper is now published.

A few days ago, I uploaded a revision of our recent manuscript (with Taylor Kessinger and Boris Shraiman) on genetic diversity in sexual populations under selection. I would like to elaborate a little bit on what I think is remarkable about our results.

Why is it important?
It is common these days to sequence multiple individuals from a population and analyze the genetic diversity in the sample to learn something about demographic and evolutionary past. To infer the past from diversity data, we need to know how diversity depends on the parameters and processes we are interested in. This link typically comes from the analysis of simple models. The predominant framework used for this purpose is the neutral coalescent, which is often used as a null model to detect selection. This strategy — looking for outliers in a mostly neutral genome — seemed like a good strategy at the time when it was thought that the great majority of polymorphims are neutral. If, however, the majority of polymorphisms is under some form of selection, we need a new null model to detect adaptations of particular interest that stand out from all the rest that, while not neutral, has weak or fluctuating effects. Our manuscript aims at delivering such a null. In contrast to previous analysis that focussed on mutations with strong effects (background selection or hitch-hiking), we analyze a model where a large number of weakly selected polymorphisms generate fitness diversity in a sexual population. We find that the properties of neutral diversity smoothly interpolate between the neutral limit (drift dominated) and the limit of strong selection (draft dominated). The crossover between the two regimes happens when fitness difference between haplotypes are comparable to the inverse population size. The length of haplotypes (LD) and the diversity are self-consistently determined and depend on the fitness variance per maplength, but only weakly on the population size. To determine where a population sits on this continuum between neutral or draft dominated regime, it is informative to analyze the site frequency spectrum, which changes qualitatively between the regimes.

How did we address it?
In sexual populations, crossing over reshuffles alleles, which results in linkage equilibrium and independent histories of loci at large distances. The histories of tightly linked loci, however, remain correlated and very close loci behave as if they were asexual. These different degrees of linkage interact with selection in complicated ways. Our approach to this problem was to identify the length of blocks that behave more or less asexually over the time to the most recent common ancestor at the locus, calculate the fitness variation within those blocks that, and map the problem to results for coalescence with selection in asexual populations. Image

The latter problem has been addressed by Oskar Hallatschek and myself. We showed that in asexual populations with substantial selected diversity, coalescence and genetic diversity are not described by the Kingman (standard neutral) coalescent, but resemble the Bolthausen-Sznitman coalscent (BSC) — at least in the limit of large populations. Michael Desai, Aleksandra Walczak and Daniel Fisher published similar conclusions.

What’s next?
It is common to define an “effective population size”, Ne, via the distance between pairs of haplotypes and hope that a neutral model with this Ne explains other features of genetic diversity. This rarely works. Furthermore, Ne depends strongly on crossover rates, functional density (purifying selection), etc. The one quantity Ne is only weakly correlated with is the census population size. Our results link genetic diversity (I refuse to call it Ne) to parameters such as mutation rates, crossover rates, and effect distributions of mutations. The predictions should be applicable whenever there are many polymorphisms within a linkage block, which is likely the case in facultative outcrossers or low recombination regions of obligate outcrossers.

When analyzing resequencing data, it should be possible to use the polarized site frequency spectrum to determine whether diversity is dominated by drift or draft. In the draft regime, heterozygosity should be proportional to the square root of of rho/mu s^2, where rho is the crossover rates, mu is the mutation rate, and s^2 is the average squared effect of mutations.