Establishment and dynamics of the latent HIV-1 reservoir

Our new study (work with the group of Jan Albert) on HIV-1 evolution and turnover during suppressive anti-retroviral therapy has just come out in eLife. In this paper, we combined our previous data on HIV-1 evolution in plasma prior to therapy (Zanini et al, 2015) with HIV-1 DNA sequences from peripheral blood cells (PBMC) after many years of therapy. This combination of pre-therapy and on-therapy data from the same individuals allowed us to investigate the origin of integrated HIV-1 DNA and determine whether viral DNA in cells change during therapy:

  • We find no evidence of replication/evolution during suppressive therapy
  • Even after 18y of therapy HIV-1 DNA looks very similar to the HIV-1 RNA from samples right before treatment
  • The HIV reservoir is turning over fast in absence of therapy. This turnover is dramatically slowed by therapy, suggesting that HIV-1 infection is a major contributor to T-cell death.

Our results are at odds with a recent study by Lorenzo-Redondo et al 2016. Using sequence data from HIV RNA at treatment initiation and HIV DNA 3 and 6 month into therapy, Lorenzo-Redondo et al estimated a very high rate of sequence evolution. The evolution of the root-to-tip distance predicted on the basis of their rate estimate is included in the graph below as shaded area – clearly incompatible with our results. In fact, the rate estimated by Lorenzo-Redondo et al is faster than the pre-therapy rate in the individuals we investigated. combined_root_to_tip_clustered_good_hap_count

Lorenzo-Redondo et al studied sequences from blood and lymph tissue, while we had only access to blood samples. This, however, is unlikely to explain this discrepancy: Lorenzo-Redondo et al estimate similar rates in PBMCs and lymp tissue. Furthermore, several studies, including Lorenzo-Redondo et al, estimate that HIV sequences from lymph and PBMCs mix on a time scale of a few month such that PBMCs should be an accurate reporter. The rapid evolution inferred by Lorenzo-Redondo might be explained in part by the following factors:

  • The samples come from a six month interval, which is much shorter than the coalescence time scale of HIV. With sequences from such small time intervals, rooting of the phylogenetic tree to maximize the correlation between root-to-tip distance and sampling date can generate an exaggerated temporal signal.
  • With increasing time since start of therapy, the HIV-1 DNA positive pool of cells will become dominated by long-lived cells which sample deeper into the history of the HIV infection prior to therapy. This could generate a signal of spurious signal “backward” evolution.

The graph below illustrates the latter. If HIV positive cells are a mix of short-lived (blue) and long-lived (red) cells, a sample taken at treatment start will be dominated by short-lived cells and virus that was replicating very recently. A few month into treatment, short-lived cells will be mostly HIV negative while HIV positive cells tend to be long-lived cells that sample deeper into the history of the infection. This shift can generate a spurious signal of evolution.back_sampling

While we cannot rule out that HIV does replicate in compartments that are missed by our sequencing of HIV from PBMCs, ongoing replication is not the dominant mechanism by which HIV DNA is maintained in circulating cells.



Mutation rates and fitness costs of HIV-1

The rates at which mutations arise and the effects these mutations have on phenotypes and replications are key determinants of how populations change and adapt – but measuring them is often hard. While mutation rates in animals or plants can be obtained quite easily by sequencing parents and children, fitness effects are much more difficult to ascertain: Only the most dramatic mutations have a big enough effects that can be measured over a few generations or leave strong signals in genetic diversity.

In viruses like HIV-1, mutation rates and effects of mutations are more readily accessible since their generation times are short and their genomes are compact. However, these measurements cannot be done in the natural environment – the infected host – but typically in cell culture systems. In our new preprint, Fabio, Vadim, myself and our colleagues Johanna and Jan from Sweden present estimates of mutation rates and fitness costs in-vivo.

How did we do it?

We have previously presented longitudinal whole genome deep sequencing data from multiple patients (Zanini et al, 2016). At each position of the genome, we can observe the frequency of different mutations at different times during the course of infection. A subset of positions don’t seem to matter muchmut_matrix for virus replication. We found that at those sites, mutations accumulate almost linearly: The rate of accumulation is the in vivo mutation rate. The estimates so obtained agree very well with cell culture estimates. The figure on the right summarizes these findings: The thickness of the arrows indicate the relative rates – the overall rate is 1.2 mutations per site and day.

At these approximately neutral sites, mutation accumulation is linear (at least over the few years we looked at it). At other sites, mutations arise very much the same way, but they reduce the rate of virus replication and are hence weeded out. As a result, mutation frequencies don’t accumulate linearly but saturate. The time its takes to saturate and the level at which the frequencies saturate depend on the selection coefficient. We use this dependence to estimate the landscape of fitness costs at almost every site of the HIV-1 genome.

fitness_costThis graph shows a slightly smoothed landscape of fitness costs in units of 1/day separately for non-synonymous mutations (solid) and synonymous mutations (dashed) for the major genes of HIV-1 (colors). As expected, fitness costs of non-synonymous mutations are a lot larger than those of synonymous mutations (about 50% of nonsyn mutations have costs of 10% or more). But subsets of synonymous mutations are also very costly, in particular in RNA secondary structure rich regions at the 5′ end or in envelope.


Estimating fitness costs requires accurate estimates of mutation frequencies. The accuracy of the latter is limited by small numbers of HIV genomes that enter the sequencing library, amplification biases during PCR, and possibly through hitch-hiking effects that bring deleterious alleles to high frequencies. To nevertheless get reasonable estimates of fitness costs at individual sites, we used weighted averages of all sequenced samples that we had available. This is sensible, since the frequencies of deleterious mutations decorrelate rapidly such that different samples from the same patient are approximately independent. By combining multiple samples with weights proportional to the number of genomes contributing to the sample, we generate a meta sample that represents a much larger population.The individual samples are sequenced with an error rate below 0.002 per site and the pooled sample then allows us to estimate frequencies far below this threshold.

Why do we care?

We have previously shown that reversion to the consensus is a dominant force in HIV-1 evolution. These reversion mutations are driven by the fitness costs of these mutations. The landscape we determined will allow to look more closely at the driving forces of reversion. Furthermore, the landscape can pin-point regions of vulnerability and target particular regions with unexpected conservation patterns for follow-up analysis.

On a more general note, fitness landscapes and the distribution of effect sizes of mutations are the most important parameters we need to know in order to decide what kind of model of the population genetics is appropriate. We have very little knowledge how these distributions look like for any organism. Our work is one of the first examples where such a landscape has been determined in-vivo on a genome wide scale.

Many aspects of intrapatient HIV-1 evolution are predictable

Within HIV infected individuals, the human immune system tries to prevent virus replication while HIV continuously changes to avoid recognition by the immune system. The resulting evolution of the virus population has become a paradigmatic example of rapidly adapting populations. We just published a paper that provides unprecendented insight into intrapatient HIV evolution. This work is the product of an extremely enjoyable collaboration involving Fabio Zanini from our group here in Tuebingen and the group of Jan Albert at the Karolinska Institute in Stockholm.

Whole genome deep sequencing

Our aim in this study was to provide a comprehensive assessment of the evolutionary dynamics ulogonfolding within the body of HIV infected people. We developed a strategy to sequence the entire virus genome such that even rare mutations are accurately represented in our data set. The impressive sample collections in Sweden and the generous participation of patients in this study allowed us to follow HIV evolution densely in multiple patients. We developed an interactive web application that allows users to explore HIV evolutionary dynamics and access the data in a convenient way.

What did we find?

Mutations occur at random and while selection for replication weeds out harmful mutation and amplifies useful ones. The common conception is that mutation rates are low and/or useful mutations are rare. Furthermore, biological reality is complicated and predicting what might be a useful mutations seems hopeless. In HIV, however, we find a high degree of reproducibility and predictability, indicating that “finding the right mutation” is not a rare, fortunate event for HIV but rather a fast and reliable mechanism of survival.

The predictability extends to single positions in the genome. More then 20% of sites that are globally unconserved (such as sites at which mutations are synonymous) are measurably diverse after a few years within each patient. This diversity is growing continuously with little signs of loss of diversity through genetic drift or hitch-hiking with beneficial mutations. This implies a large population that systematically explores sequence space. In contrast, at conserved sites, we observe next to no diversity indicating efficient selection against deleterious variants.

We can not only predict where mutations accumulate because they are tolerated, but also where they spread because they help the virus. By looking specifically at sites where the virus population was initially different from the majority of HIV sequences known, we found toaway_croppedthat the virus has a strong tendency to come back to the global consensus state. 30% of all substitutions occur at the 5% of sites were the initial virus differed from this consensus and represent reversions. The tendency to revert to this global attractor is stronger at sites that are globally more conserved.  Within the diversity of HIV-1, this attractor seems universal. The picture on the right shows the rate of evolution (divergence after 6 years) separately sites that can revert and sites already in the consensus state. At the most globally most conserved sites, about 50% of all non-consensus positions revert to consensus after 5 years — a roughly 1000 fold excess over evolution away from consensus. We also found that reversions are happening not only soon after infection, but rather all along, for many years.

What does this mean?

Our data are consistent with HIV as a large population that systematically explores a mostly universal fitness landscape and returns to favoured state when possible. The reproducible patterns of evolution are only possible since HIV recombines extensively within patients — without recombination it would be much more difficult for the virus population to simultaneously revert and escape in different regions of the genome as these mutations would interfere with each other as they spread. Sweeping of adaptive mutations would wipe out diversity and the reproducible patterns of mutation accumulation. The reproducibility of minor variation further suggests that the fitness costs of individual mutations are similar among unrelated viruses and explains why inference of fitness landscapes of HIV from cross-sectional data is possible.


Recombination in HIV-1 and the “book” of genealogical trees

In our new preprint, we report whole genome deep sequencing of longitudinally sampled HIV-1 populations from multiple patients — effectively a movie of evolution at about 6 month resolution. This work was led by Fabio and is the product of a fantastic collaboration with the group of Jan Albert at the Karolinska Institute in Stockholm.

Among the many things we can study in detail using this data set, we looked at linkage and recombination. We find that linkage disequilibrium in chronic infection is typically limited to about 100bps. Consistent with this lack of long range linkage, the shapes of trees reconstructed from 400bp reads varies greatly in different regions of the genome. 400bp are often too short to construct well supported phylogenetic trees. Nevertheless, the trees are instructive to illustrate diversity in the population. The figure below animates trees when moving through genome from 5′ to 3′ end.

Every position in genome has a unique genealogical tree, but through recombination genealogical trees of two sites diverge as the distance between the sites increases. One way to picture this process is to think of a book in which each page show the genealogy corresponding to a particular nucleotide. Skimming through the book results in a movie of gradually changing trees. We need diversity to resolve trees and can’t reconstruct a tree for an individual site, but the trees obtained from sliding 400bp windows approximate this process.

Trees of longitudinally sampled sequences in various parts of the HIV-1
Trees of longitudinally sampled sequences in various parts of the HIV-1 genome. Big circles correspond to common variants, small circles to rare variants. Early samples are shown in blue, followed by green, yellow and red.

Trees in different parts of the genome vary widely in shape and depth. This is consistent with extensive recombination. The scale of linkage — about 100bp — is compatible with earlier estimates of the intrapatient recombination rate by us and Thomas Leitner or Batorsky and colleagues.

Deleterious effects of synonymous mutations in HIV

Fabio and Richard just published a paper in Journal of Virology about whether or not synonymous mutations in HIV are neutral. We know of course that many synonymous sites in HIV have important roles in regulation and RNA structure, but what about those sites at which we see high-frequency synonymous polymorphisms without known function? We focus on the env gene and in particular on the region containing the V loops that are under antibody attack by the immune system and hence change rapidly (V stands for variable). We investigate the dynamics of synonymous mutations in serial sequence data and show that most synonymous mutations in that region come with a fitness cost of the order of 0.002 / day. Nevertheless, they seem to hitch-hike to high frequency on linked adaptive variants.  These synonymous mutations interfere with folding of the HIV RNA molecule, reducing its replication efficiency.

Why is this important?

In studies of molecular evolution, synonymous mutations are often used as neutral control used to infer selection operating on the proteins. Our results add to the growing body of evidence that selection on synonymous sites needs to be accounted for when doing such analysis. This not only holds for viruses, but has recently been demonstrated in the fruit fly.

More importantly, we shed light on the functional relevance and evolutionary dynamics of RNA structures in the genome. A recent study compared RNA structures of diverged strains of HIV and found that the overall pattern of folding is conserved, while the exact base pairs are not. Our results can explain this observation: most RNA structures seem to be important to the virus, but not crucial (disrupting mutations have a small fitness effect). This results in occasional fixation of a disrupting mutation which is slowly restored by compensatory mutations or rearrangements. This way the molecular architecture of the RNA structure can change leaving the overall pattern in place.

How did we do it?

In longitudinal HIV sequence datasets, we observe new synonymous alleles rising in frequency in our sample up to 50% and higher (see sketch), but almost all of them disappear from our sample after one to two years (i.e., the ancestral allele is found again in all sequences). This suggests that these mutations are deleterious. We quantify this effect by measuring how long it takes for these mutations to disappear and how often they do not disappear. Shankarappa_allele_freqs_trajectories_syn_p10_sparseComparing these observations to extensive computer simulations, we conclude that the majority of synonymous mutations are deleterious with small fitness effects (0.002 / day). Then, building on published RNA pairing data, we ask whether or not a mutation that destroys a base pair is pruned more often than any other mutation; that seems to be the case indeed.

Inferring HIV escape rates

We have a new preprint on the arXiv  (here on Haldane’s sieve). This work is the result of a collaboration between us and Alan Perelson, LANL, and explores methods to estimate parameters of the HIV-immune system interaction from time resolved sequence data. The focus of this paper is on early infeImagection dominated by a few rapid substitutions that fix because they prevent or reduce recognition of infected cells by the immune system via cytotoxic T-lymphocytes (CTL).  CTL escape is one of the fastest instances of evolution I have come across. 4-6 mutations spread within a few weeks. It happens in most HIV infections and is partly predictable based on the HLA genotype of the infected person. These substitutions are so rapid that clonal interference has to be modeled. Our method fits a reduced model of clonal interference to the typically very sparse data and thereby estimates the selection coefficients, aka escape rates.

Why do we want to know these numbers?
The number of viruses in the blood of an infected person peaks 2-3 weeks after infection and thereafter drops by 2-3 order of magnitude. This drop is partly due to a response by the adaptive immune system. However, it has proved difficult to attribute this drop to specific parts of the immune response. The rates at which different mutations sweep through the population gives us information about the pressure exerted by the T-cell clones that target the epitope containing this mutation.

How do we do it?
Early in infection, the viral population is large and selection is strong. In these conditions, recombination is of minor importance since most double/triple… mutants are more efficiently produced by recurrent mutation than recombination. This implies that mutations accumulate sequentially always on a background one which already all previous mutations are present. The time at which a novel mutation happens in tightly constrained by the trajectory of preceding genotype. These constraints regularize the fitting problem to some degree and the multi-locus fitting is more robust than single locus fitting.

What do we learn about evolution in general?
In addition to the intrinsic interest in the HIV/CTL interaction, CTL escape is an ideal setting to study rapidly evolving populations. This evolution happens in its “natural” habitat and the selective pressure as well as the functional consequences of the observed molecular changes can be quantified via immunological data, protein structure, and replication assays. In addition, we have ample cross-sectional data (HIV sequences from many different patients) that allows us to look at prevalence of the escape mutations and potential compensatory mutations. None of this is done in this paper, but studying HIV/immune-system coevolution is a fascinating show case of rapid evolution.