Basic evolutionary model
We begin by considering data Y, consisting of aligned molecular sequences of length S from n taxa. We orientate these data such that we may write Y = (Y1, ..., Y
S
), where Y
s
for s = 1, ..., S are the n homologous characters at each site s of the sequence alignment. To model this homology, we follow standard likelihood-based phylogenetic reconstruction practice [13] and assume the data arise from an underlying continuous-time Markov chain (CTMC) process [14] along an unobserved tree τ. The tree τconsists of a rooted, bifurcating topology that characterizes the relatedness among the taxa, the generally unknown historical times when lineages diverge in the topology and up to 2n − 2 rate parameters r
k
that relate historical time and expected number of substitutions on each branch k. The CTMC process describes the relative rates at which different sequence characters change along the independent branches in τ. We restrict our attention in this paper to nucleotide substitution processes characterized by either the HKY85 [15] or GTR [16] infinitesimal rate matrices Λ and discrete-Gamma distributed across-site rate variation [17] with shape parameter α. However, our approach admits any standardly used CTMC for nucleotides, codons or amino acids.
Letting Φ = (Λ,α), we write the data sampling density of site s as P(Y
s
|τ, Φ). Felsenstein's peeling/pruning algorithm [18] enables computational efficient calculations of P(Y
s
|τ,Φ). Assuming that sites are independent and identically distributed given (τ,Φ) yields the complete data likelihood
(1)
Branch-specific rate variation
We take the opinion that variation in the rate of molecular evolution is widespread [5, 6], but, following Yoder and Yang [11], we assumed that in any given tree there exist a small number of rate changes. This contrasts with most previous Bayesian MCMC relaxed clock models that favor many small or smoothly changing events [3, 7, 19, 20]. In general, the numerous small changes arise as a modeling consequence, and are not necessarily data-driven. Apart from the induced smoothing, some structure remains quite useful; at certain time scales one expects rate changes to be heritable and persist for some time down the subtree extending from the change-point.
Model parameterization
We introduce the RLC model that allows for sparse, possibly large-scale changes while maintaining spatial correlation along the tree. We start at the unobserved branch leading to the most recent common ancestor (MRCA) of the tree and define the composite rate ρMRCA = 1. Substitutions then occur on each branch k = 1, ..., 2n − 2 below the MRCA with normalized rate
(2)
where pa(k) refers to the parent branch above k, branch-specific rate multipliers ϕ= (ϕ1,...,ϕ2n-2) and c(·) is a normalization constraint that ensures that r
k
reflect the expected number of substitutions per unit time. This multiplicative structure on the composite ρ
k
= ρpa(k) × ϕ
k
builds up a hierarchy of rate multipliers descending towards the tree's tips.
Allowing all elements in ϕto vary independently leads to a completely non-clock-like model with, even worse, far too many free parameters for identifiability with the divergence times in τ. We avoid this problem through specifying a prior P(ϕ) on the rate multipliers. This prior specifies that only a random number K ∈ {0,...,2n-2} of ϕ
k
≠ 1 such that r
k
do not inherit their ancestors' rate of change but instead mark the start of a new local clock, where a priori we believe K is small. In effect, we place non-negligible prior probably on K = 0, the state in with one rate rules them all. Further, with most r
k
= rpa(k), the prior binds absolute rates equal on branches incident to the same divergence points.
Bayesian stochastic search variable selection
To infer which branch-specific rates r
k
do or do not inherit their ancestors' rate, we employ ideas from Bayesian stochastic search variable selection (BSSVS) [21]. BSSVS traditionally applies to model selection problems in a linear regression framework. In this framework, the statistician starts with a large number of potential predictors X1,...,X
P
and asks which among these associate linearly with an N-dimensional outcome variable Y. For example, the full model becomes Y = [X1,...,X
P
]β + ϵ, where β is a P-dimensional vector of regression coefficients and ϵ is an N-dimensional vector of normally distributed errors with mean 0. When β
p
for p = 1,...,P differs significantly from 0, X
p
helps predict Y, otherwise X
p
contributes little additional information and warrants removal from the model via forcing β
p
= 0. Given potentially high correlation between the predictors, deterministic model search strategies tend not to find the optimal set of predictors unless one explores all possible subsets. This exploration is generally computationally impractical as there exist 2P such subsets and completely fails for P > N.
Recent work in BSSVS [22, 23] efficiently performs the exploration in two steps. In the first step, the approach augments the model state-space with a set of P binary indicator variables δ= (δ1,...,δ
P
) and imposes a prior P(β) on the regression coefficients that has expectation 0 and variance proportional to a P × P diagonal matrix with its entries equal to δ. If δ
P
= 0, then the prior variance on β
p
shrinks to 0 and enforces β
p
= 0 in the posterior. In the second step, MCMC explores the joint space of (δ, β) simultaneously.
To map BSSVS into the setting of rate variation, let δ
k
be the binary indicator that a local clock starts along branch k, such that r
k
≠ rpa(k). Conversely, when δ
k
= 0, r
k
= rpa(k) implying that ϕ
k
= 1. So, rate multipliers ϕplay an analogous role to the regression coefficients in BSSVS. An important difference is that ϕ
k
∈ [0,∞) and shrinks to 1, while β
k
∈ (-∞,∞) and shrinks to 0, mandating alternative prior formulations.
Prior specification
To specify a prior distribution over δ= (δ1,...,δ2n-2), we assume that each indicator acts a priori as an independent Bernoulli random variable (RV) with small success probability χ. The sum of independent Bernoulli RVs yields a Binomial distribution over their sum . In the limit that K ≪ χ × (2n-2), this prior conveniently collapses to
(3)
where λ is the prior expected number of rate changes along the tree τ. Choosing λ = log2, for example, sets 50% prior probability on the hypothesis of no rate changes.
Completing the RLC prior specification, we assume that all rate multipliers in ϕare a priori independent and
(4)
When δ
k
= 1, then a priori, ϕ
k
has expectation 1 and variance ψ, following in the vein of [24]. However, in light of BSSVS, when δ
k
= 0, the prior variance collapses to 0 and ϕ
k
= 1.
Normalization
To translate between the expected number of substitutions b
k
on branch k and real clock-time t
k
,
(5)
where μ is the overall substitution rate. The keen eye may observe that, over the entire tree τ, the parameterization in Equation (5) again leads to more degrees-of-freedom than are identifiable. We solve this difficulty through a further normalization constraint c(·) on ρ. Recall that we wish to measure μ in terms of expected substitutions per unit real time, such that
(6)
To maintain this scaling, we sum Equation (5) over all branches and substitute the result into Equation (6). This eliminates the unknown μ and yields
(7)
Posterior simulation
We take a Bayesian approach to data analysis and draw inference under the RLC model via MCMC. MCMC straightforwardly generates random draws with first-order dependence through the construction of a Markov chain that explores the posterior distribution. Via the Ergodic Theorem, simple tabulation of a chain realization {θ(1),...,θ(L)} can provide adequate empirical estimates. To generate a Markov chain using the Metropolis-Hastings algorithm [25, 26], one imagines starting at chain step ℓ in state θ(ℓ) and randomly proposing a new state θ* drawn from an arbitrary distribution with density q(·|θ(ℓ)). This arbitrary distribution is commonly called a "transition kernel". Finally the next chain step ℓ + 1 arrives in state
(8)
The first term in the acceptance probability above is the ratio of posterior densities and the term involving the transition kernel is the Hastings ratio. The beauty of the algorithm is that the posterior densities only appear as a ratio so that intractable normalizing constant cancels out.
Transition kernels
We employ standard phylogenetic transition kernels via a Metropolis-within-Gibbs scheme, as implemented in BEAST [12], to travel through most dimensions in the RLC parameter space. What is unique to the RLC model are transition kernels to explore rate multipliers ϕand all possible local clock indicator δconfigurations. Since ϕ
k
∈ [0,∞) we propose new rates ϕ* component-wise, such that for a uniform randomly selected k with δ
k
= 1,
(9)
where 0 < s
f
<1 is a tuning constant and the Hastings ratio is 1/U [27].
Transition kernels on δare more challenging. One natural way to construct a Markov chain on a bit-vector state space, such as δ, involves selecting one random element δ
k
with uniform probability 1/(2n − 2) and swapping its state with probability 1 [14].
At first glance, the transition kernel density q(δ*|δ) = q(δ|δ*) = 1/(2n - 2) appears symmetric leading to a Hastings ratio of 1. However, this view is flawed. One must recall that we introduced the indicators δas a computational convenience. The number of different local clocks K over-shadows δas our parameter of interest, upon which we place our truncated-Poisson prior P(K). The correct densities to calculate then become q(K*|K) and q(K|K*). Suppose the swapping event above generates 0 → 1 so that K* = K + 1. As K approaches 0 the transition kernel finds it more and more difficult to decrease K because the kernel is more likely initially to choose a 0 state for swapping. From this perspective, the kernel is definitely not symmetric in the interchange of K* and K. Assuming symmetry would lead to upwardly biased estimates for K < ⌊n - 1⌋. The reverse bias occurs as K approaches 2n − 2 from below.
To determine q(K*|K), we identify that our kernel chooses a δ
k
= 0 with probability (2n − 2 − K )/(2n − 2) and a δ
k
= 1 with probability K/(2n − 2). Therefore, if K* = K + 1, q(K*|K) is the former probability and if K* = K - 1, q(K*|K) is the latter. Forming the Hastings ratio
(10)
This derivation provides an important lesson for those new to MCMC implementation; the Hastings ratio may vary depending on the model parameterization; it is, therefore, necessary to calculate the ratio as a function of the same parameterization as the prior.
In cases where the swap event relaxes the prior variance on the rate multiplier ϕ
k
, we simultaneously propose a new value for . We draw this value from the prior given in Equation (4).
Proposals involving changes to the tree topology are based on existing tree proposal moves in the BEAST software framework with a small modification to track the augmented data at the nodes [see Additional file 1].
Model selection
Statistical inference divides into two intertwined approaches: parameter estimation and model selection. For the former, parameter inference relies on empirical estimates of P(θ| Y) that we tabulate from the MCMC draws. Model selection often represents a more formidable task. The natural selection criterion in a Bayesian framework is the Bayes factor [28–30]. The Bayes factor B10 in favor of ℳ1 over ℳ0 is the ratio of the marginal likelihoods of ℳ1 and ℳ0,
(11)
and informs the phylogeneticist how she (he) should change her (his) prior belief P(ℳ1)/P(ℳ0) about the competing models in the face of the observed data. Involving the evaluation of two different normalizing constants, Bayes factors are often challenging to estimate.
By fortuitous construction, we side-step this computational limitation when estimating the Bayes factor in favor of a global clock (GC) model ℳGC over the RLC model ℳRLC. Model ℳGC occurs when K = 0, conveniently nested within model ℳRLC. Consequentially, the P(K = 0|ℳRLC) equals the prior probability of ℳGC, and P(K = 0|Y,ℳRLC) yields P(ℳGC|Y). Given this, a Bayes factor test of ℳGC only requires simulation under the RLC model. The Bayes factor in favor of a global clock
(12)
To calculate the ratio of marginal likelihoods we need only an estimator of P(K =0|Y, ℳRLC). The Ergodic Theorem suggests that we let
(13)
where 1{·} is the indicator function. Occasionally becomes a poor estimator when P(K = 0|Y,ℳRLC) decreases below ϵ or increases above 1 - ϵ for ϵ ≈ 1/L. In such situations, there are alternatives that depend on MCMC chains generated under several different prior probabilities P(K = 0|ℳRLC) [31]. The Bayes factor then provides the mechanism to combine results from the multiple chains and to rescale back to a believable prior.