Now out in eLife: Predicting evolution

Our paper (with Colin Russell and Boris Shraiman) on predicting evolution (see also this earlier post) now appeared in eLife. The method we developed predicts fitness of individuals, which in turn allows to predict likely progenitor sequences of future populations. Our method calculates a posterior distribution of fitness for each individual. The resulting ranking reliably predicts the ancestors of future populations of seasonal influenza. Since my previous post on this topic, we have developed a simple and robust approximation to the probabilistic model. This approximation is motivated by the intuitive insight that a very fit individual can develop into a growing clone.  Such an expanding subpopulation results in rapid branching in the tree. From this approximation, we developed a simple measure — the local branching index (LBI) — that it predicts future success almost as well as the full fitness inference algorithm. The LBI measures tree length in the neighborhood of each node in the tree, where neighborhood is defined via a weighing function that decreases exponentially with distance from the focal node. This is illustrated in the figure below. The only parameter of the LBI is the size of the relevant neighborhood, i.e., the radius of the shaded region in the illustration.

How the Local Branching Index (LBI) works: Consider the rooted tree on the left and the corresponding unrooted tree (the root is indicated by the black dot). The LBI of the node connecting the orange and red clade is the length of the tree in the shaded area. The more rapidly the tree is branching close to this node, the larger is the LBI.

Why does the LBI predict fitness?

The full fitness inference algorithm multiplies the propagators for each branch to obtain posterior distribution of fitness. The propagator of each additional downstream branch pushes the posterior towards higher fitness. The amount of this additional polarization increases with the length of the subtree of the branch. This polarizing effect of tree length is, however, forgotten over long times. The fitness propagators account for this loss of memory through equilibration to the stationary distribution of ancestral fitness. In a similar fashion, the LBI focusses on local tree length explicitly by exponential weighing with distance.

Connection to multiple-merger coalescents

Together with Oskar Hallatschek, we have shown earlier that the genealogical trees of rapidly adapting populations contain approximate multiple mergers and that the genealogies are asymptotically described by the Bolthausen-Sznitman Coalescent process. Brunet et al have linked these approximate multiple mergers to rare events when one individual happens to be substantially fitter than all others. It then generates a burst if offspring, which looks like a multiple merger in a reconstructed tree. The LBI measures the “burstiness” of the branching at a node. For a given number of offspring, the LBI is maximal for a star-like tree (i.e., a multiple merger), and minimal for a binary merger with no other branching happening for a long time.

Is evolution predictable?

In this preprint — a collaboration with Colin Russell and Boris Shraiman — we show that it is possible to predict which individual from a population is most closely related to future populations. To this end, we have developed a method that uses the branching pattern of genealogical trees to estimate which part of the tree contains the “fittest” sequences, where fit means rapidly multiplying. Those that multiply rapidly, are most likely to take over the population. We demonstrate the power of our method by predicting the evolution of seasonal influenza viruses.

How does it work?
Individuals adapt to a changing environment by accumulating beneficial mutations, while avoiding deleterious mutations. We model this process assuming that there are many such mutations which change fitness in small increments. Using this model, we calculate the probability that an individual that lived in the past at time t leaves n descendants in the present. This distributions depends critically on the fitness of the ancestral individual. We then extend this calculation to the probability of observing a certain branch in a genealogical tree reconstructed from a sample of sequences. A branch in a tree connects an individual A that lived at time tA and had fitness xA and with an individual B that lived at a later time tB with fitness xB as illustrated in the figure. B has descendants in the sample, otherwise the branch would not be part of the tree. Furthermore, all sampled descendants of A are also descendants of B, otherwise the connection between A and B would have branched between tA and tB. We call the mathematical object describing fitness evolution between A and B “branch propagator” and propagatordenote it by g(xB,tB|xA,tA). The joint probability distribution of fitness values of all nodes of the tree is given by a product of branch propagators. We then calculate the expected fitness of each node and use it to rank the sampled sequences. The top ranked sequence is our prediction for the sequence of the progenitor of the future population.

Why do we care?
flu_tree Being able to predict evolution could have immediate applications. The best example is the seasonal influenza vaccine, that needs to be updated frequently to keep up with the evolving virus. Vaccine strains are chosen among sampled virus strains, and the more closely this strain matches the future influenza virus population, the better the vaccine is going to be. Hence by predicting a likely progenitor of the future, our method could help to improve influenza vaccines. One of our predictions is shown in the figure, with the top ranked sequence marked by a black arrow. Influenza is not the only possible application. Since the algorithm only requires a reconstructed tree as input, it can be applied to other rapidly evolving pathogens or cancer cell populations. In addition, to being useful, the ability to predict also implies that the model captures an essential aspect of evolutionary dynamics: influenza evolution is to a substantial degree — enough to enable prediction — dependent on the accumulation of small effect mutations.

Comparison to other approaches
Given the importance of good influenza vaccines, there has been a number of previous efforts to anticipate influenza virus evolution, typically based on using patterns of molecular evolution from historical data. Along these lines, Luksza and Lässig have recently presented an explicit fitness model for influenza virus evolution that rewards mutations at positions known to convey antigenic novelty and penalizes likely deleterious mutations (+a few other things). By using molecular influenza specific signatures, this model is complementary to ours that uses only the tree reconstructed from nucleotide sequences. Interestingly, the two models do more or less equally well and combining different methods of prediction should result in more reliable results.

Population genetics of rapid adaptation

My review on “Genetic draft, selective interference, and population genetics of rapid adaptation” in Annual Reviews of Ecology, Evolution, and Systematics is finally out (not exactly final yet, some notational issues will be corrected). Sally Otto had asked me to write an accessible summary of the work published over the last 10-15 years on adaptation and selective interference.  Some of this work was done by scientists with backgrounds in physics like myself. Owing to differences in notation and mathematical approaches, population geneticists sometimes struggled with these papers. Coming from physics and having worked in population genetics for 6 years, I have tried to synthesize this work in a streamlined and accessible fashion — let me know if it worked. To illustrate some of the ideas, I have put together a website with some python scripts that simulate different scenarios discussed in the paper:

Drift vs Draft
Classical population genetics emphazises the competition between stochastic effects in reproduction (genetic drift) and deterministic forces such as selection. In idealized models, genetic drift stems from non-heritable randomness in offspring number. The width of this offspring number distribution is assumed (very) small compared to the population size and the law of large numbers garantees that many similar models converge to the same diffusion limit where the strength of drift is inversely proportional to the population size. However, a different source of randomness is often much more important: random associations to genetic backgrounds of different fitness result in background selection, Hill-Robertson effects, and selective interference. waveWhile the effect of background fitness on allele frequencies might be weak in a single generation, associations to genetic backgrounds are (partly) heritable and the effects amplify over many generations. This amplification is multiplicative and the resulting differences in offspring number after several generations can be comparable to the population size. In other words, the effective offspring distributions after several generations are very skewed with long power-law tails. In fact, these distributions can be so broad that the variance diverges with the population size. In this case, no diffusion limit is possible and the statistical properties of drift and and linked selection are fundamentally different.

Asexual vs sexual
The effects of draft are strongest in asexual organisms where the entire chromosome stays linked forever. However, linked selection can also be substantial in facultatively species such as plants, worms, yeasts or viruses (think influenza). As soon as there is the potential for the rapid expansion of a particular line (be it because of an intrinsic fitness advantage or favorable environmental conditions in a particular spot), the effective “many-generation” offspring distribution can become very broad and draft dominates over drift. In obligatly sexual species, the effects of draft are confined to the chromosomal neighborhood, but linkage to alleles at different distances still gives rise to stochastic forces very different from the classical genetic drift (rare tight linkage to a beneficial allele essentially sweeps one haplotype to fixation, loosely linked sweeps only bounce it around a little).

Recent Developments: Genealogical methods for rapid adaptation
Many successful population genetic methods have used the duality between Kimura’s diffusion models and the Kingman coalescent. This duality allows the efficient computation of statistics by considering the backward process of observed alleles, rather than the forward process of the entire population. Recent developments suggest that a similar duality exists for models dominated by draft: Genealogies in these models share statistical properties with a particular coalescent process known as Bolthausen-Sznitman coalescent that allows for multiple mergers. This coalescent process can predict a number of observable features in sequence data such as the site frequency spectra, the time to the most recent common ancestor, etc.  I briefly discuss these very recent results in the review.

Why should we care?
You might say “Let’s just define an effective population size and pretend all linked selection is some sort of drift”. But many population genetic methods detect outliers above a random background. To detect outliers reliably, we need to understand the null distribution. The background has very different statistical properties when the dominant source of randomness is draft rather than drift and using the wrong null model will reduce the power of the test and produce false positives. In other applications, one estimates values of parameters of simple models and these models better capture the relevant population genetic processes. It is for example popular to estimate the history of the effective population size from the rate of coalescence in the past. In many cases, in particular for large populations under selection, this effective population size has very little to do with the actual population size. Instead, one estimates the rate of coalescence which depends on the relative success of different lineages, which in turn depends on fitness, environmental fluctuations, and luck.

Arxiv: Coalescence in sexual populations under selection

Update: the paper is now published.

A few days ago, I uploaded a revision of our recent manuscript (with Taylor Kessinger and Boris Shraiman) on genetic diversity in sexual populations under selection. I would like to elaborate a little bit on what I think is remarkable about our results.

Why is it important?
It is common these days to sequence multiple individuals from a population and analyze the genetic diversity in the sample to learn something about demographic and evolutionary past. To infer the past from diversity data, we need to know how diversity depends on the parameters and processes we are interested in. This link typically comes from the analysis of simple models. The predominant framework used for this purpose is the neutral coalescent, which is often used as a null model to detect selection. This strategy — looking for outliers in a mostly neutral genome — seemed like a good strategy at the time when it was thought that the great majority of polymorphims are neutral. If, however, the majority of polymorphisms is under some form of selection, we need a new null model to detect adaptations of particular interest that stand out from all the rest that, while not neutral, has weak or fluctuating effects. Our manuscript aims at delivering such a null. In contrast to previous analysis that focussed on mutations with strong effects (background selection or hitch-hiking), we analyze a model where a large number of weakly selected polymorphisms generate fitness diversity in a sexual population. We find that the properties of neutral diversity smoothly interpolate between the neutral limit (drift dominated) and the limit of strong selection (draft dominated). The crossover between the two regimes happens when fitness difference between haplotypes are comparable to the inverse population size. The length of haplotypes (LD) and the diversity are self-consistently determined and depend on the fitness variance per maplength, but only weakly on the population size. To determine where a population sits on this continuum between neutral or draft dominated regime, it is informative to analyze the site frequency spectrum, which changes qualitatively between the regimes.

How did we address it?
In sexual populations, crossing over reshuffles alleles, which results in linkage equilibrium and independent histories of loci at large distances. The histories of tightly linked loci, however, remain correlated and very close loci behave as if they were asexual. These different degrees of linkage interact with selection in complicated ways. Our approach to this problem was to identify the length of blocks that behave more or less asexually over the time to the most recent common ancestor at the locus, calculate the fitness variation within those blocks that, and map the problem to results for coalescence with selection in asexual populations. Image

The latter problem has been addressed by Oskar Hallatschek and myself. We showed that in asexual populations with substantial selected diversity, coalescence and genetic diversity are not described by the Kingman (standard neutral) coalescent, but resemble the Bolthausen-Sznitman coalscent (BSC) — at least in the limit of large populations. Michael Desai, Aleksandra Walczak and Daniel Fisher published similar conclusions.

What’s next?
It is common to define an “effective population size”, Ne, via the distance between pairs of haplotypes and hope that a neutral model with this Ne explains other features of genetic diversity. This rarely works. Furthermore, Ne depends strongly on crossover rates, functional density (purifying selection), etc. The one quantity Ne is only weakly correlated with is the census population size. Our results link genetic diversity (I refuse to call it Ne) to parameters such as mutation rates, crossover rates, and effect distributions of mutations. The predictions should be applicable whenever there are many polymorphisms within a linkage block, which is likely the case in facultative outcrossers or low recombination regions of obligate outcrossers.

When analyzing resequencing data, it should be possible to use the polarized site frequency spectrum to determine whether diversity is dominated by drift or draft. In the draft regime, heterozygosity should be proportional to the square root of of rho/mu s^2, where rho is the crossover rates, mu is the mutation rate, and s^2 is the average squared effect of mutations.