### Basic evolutionary model

We begin by considering data **Y**, consisting of aligned molecular sequences of length *S* from *n* taxa. We orientate these data such that we may write **Y** = (**Y**_{1}, ..., **Y**_{
S
}), where **Y**_{
s
} for *s* = 1, ..., *S* are the *n* homologous characters at each site *s* of the sequence alignment. To model this homology, we follow standard likelihood-based phylogenetic reconstruction practice [13] and assume the data arise from an underlying continuous-time Markov chain (CTMC) process [14] along an unobserved tree *τ*. The tree *τ*consists of a rooted, bifurcating topology that characterizes the relatedness among the taxa, the generally unknown historical times when lineages diverge in the topology and up to 2*n −* 2 rate parameters *r*_{
k
} that relate historical time and expected number of substitutions on each branch *k*. The CTMC process describes the relative rates at which different sequence characters change along the independent branches in *τ*. We restrict our attention in this paper to nucleotide substitution processes characterized by either the HKY85 [15] or GTR [16] infinitesimal rate matrices **Λ** and discrete-Gamma distributed across-site rate variation [17] with shape parameter *α*. However, our approach admits any standardly used CTMC for nucleotides, codons or amino acids.

Letting **Φ** = (**Λ**,*α*), we write the data sampling density of site *s* as *P*(**Y**_{
s
}|**τ**, **Φ**). Felsenstein's peeling/pruning algorithm [18] enables computational efficient calculations of *P*(**Y**_{
s
}|**τ**,**Φ**). Assuming that sites are independent and identically distributed given (**τ**,**Φ**) yields the complete data likelihood

P\left(\mathbf{Y}|\mathbf{\tau},\mathbf{\Phi}\right)={\displaystyle \mathbf{\prod}_{s=1}^{S}P}\left({\mathbf{Y}}_{s}|\mathbf{\tau},\mathbf{\Phi}\right).

(1)

### Branch-specific rate variation

We take the opinion that variation in the rate of molecular evolution is widespread [5, 6], but, following Yoder and Yang [11], we assumed that in any given tree there exist a small number of rate changes. This contrasts with most previous Bayesian MCMC relaxed clock models that favor many small or smoothly changing events [3, 7, 19, 20]. In general, the numerous small changes arise as a modeling consequence, and are not necessarily data-driven. Apart from the induced smoothing, some structure remains quite useful; at certain time scales one expects rate changes to be heritable and persist for some time down the subtree extending from the change-point.

#### Model parameterization

We introduce the RLC model that allows for sparse, possibly large-scale changes while maintaining spatial correlation along the tree. We start at the unobserved branch leading to the most recent common ancestor (MRCA) of the tree and define the composite rate *ρ*_{MRCA} = 1. Substitutions then occur on each branch *k* = 1, ..., 2*n −* 2 below the MRCA with normalized rate

\begin{array}{c}{r}_{k}=c(\mathit{\rho})\times {\rho}_{k}\\ =c(\mathit{\rho})\times {\rho}_{\text{pa}(k)}\times {\varphi}_{k},\end{array}

(2)

where pa(*k*) refers to the parent branch above *k*, branch-specific rate multipliers *ϕ*= (*ϕ*_{1},...,*ϕ*_{2n-2}) and *c*(·) is a normalization constraint that ensures that *r*_{
k
} reflect the expected number of substitutions per unit time. This multiplicative structure on the composite *ρ*_{
k
} = *ρ*_{pa(k) }× *ϕ*_{
k
} builds up a hierarchy of rate multipliers descending towards the tree's tips.

Allowing all elements in *ϕ*to vary independently leads to a completely non-clock-like model with, even worse, far too many free parameters for identifiability with the divergence times in *τ*. We avoid this problem through specifying a prior *P*(*ϕ*) on the rate multipliers. This prior specifies that only a random number *K* ∈ {0,...,2*n*-2} of *ϕ*_{
k
} *≠* 1 such that *r*_{
k
} do not inherit their ancestors' rate of change but instead mark the start of a new local clock, where *a priori* we believe *K* is small. In effect, we place non-negligible prior probably on *K* = 0, the state in with one rate rules them all. Further, with most *r*_{
k
} = *r*_{pa(k)}, the prior binds absolute rates equal on branches incident to the same divergence points.

#### Bayesian stochastic search variable selection

To infer which branch-specific rates *r*_{
k
} do or do not inherit their ancestors' rate, we employ ideas from Bayesian stochastic search variable selection (BSSVS) [21]. BSSVS traditionally applies to model selection problems in a linear regression framework. In this framework, the statistician starts with a large number of potential predictors **X**_{1},...,**X**_{
P
} and asks which among these associate linearly with an *N*-dimensional outcome variable **Y**. For example, the full model becomes **Y** = [**X**_{1},...,**X**_{
P
}]*β + ϵ*, where *β* is a *P*-dimensional vector of regression coefficients and *ϵ* is an *N*-dimensional vector of normally distributed errors with mean **0**. When *β*_{
p
} for *p* = 1,...,*P* differs significantly from 0, **X**_{
p
} helps predict **Y**, otherwise **X**_{
p
} contributes little additional information and warrants removal from the model via forcing *β*_{
p
}*=* 0. Given potentially high correlation between the predictors, deterministic model search strategies tend not to find the optimal set of predictors unless one explores all possible subsets. This exploration is generally computationally impractical as there exist 2^{P} such subsets and completely fails for *P > N*.

Recent work in BSSVS [22, 23] efficiently performs the exploration in two steps. In the first step, the approach augments the model state-space with a set of *P* binary indicator variables *δ*= (*δ*_{1},...,*δ*_{
P
}) and imposes a prior *P*(*β*) on the regression coefficients that has expectation **0** and variance proportional to a *P × P* diagonal matrix with its entries equal to *δ*. If *δ*_{
P
} = 0, then the prior variance on *β*_{
p
} shrinks to 0 and enforces *β*_{
p
} = 0 in the posterior. In the second step, MCMC explores the joint space of (*δ*, *β*) simultaneously.

To map BSSVS into the setting of rate variation, let *δ*_{
k
} be the binary indicator that a local clock starts along branch *k*, such that *r*_{
k
} ≠ *r*_{pa(k)}. Conversely, when *δ*_{
k
} = 0, *r*_{
k
} = *r*_{pa(k) }implying that *ϕ*_{
k
} = 1. So, rate multipliers *ϕ*play an analogous role to the regression coefficients in BSSVS. An important difference is that *ϕ*_{
k
} ∈ [0,∞) and shrinks to 1, while *β*_{
k
} ∈ (-∞,∞) and shrinks to 0, mandating alternative prior formulations.

#### Prior specification

To specify a prior distribution over *δ*= (*δ*_{1},...,*δ*_{2n-2}), we assume that each indicator acts *a priori* as an independent Bernoulli random variable (RV) with small success probability *χ*. The sum of independent Bernoulli RVs yields a Binomial distribution over their sum K={\displaystyle {\sum}_{k=1}^{2n-2}{\delta}_{k}}. In the limit that *K* ≪ χ × (2*n*-2), this prior conveniently collapses to

K~\text{Truncated}-\text{Poisson}(\text{\lambda}),

(3)

where λ is the prior expected number of rate changes along the tree *τ*. Choosing λ = log2, for example, sets 50% prior probability on the hypothesis of no rate changes.

Completing the RLC prior specification, we assume that all rate multipliers in *ϕ*are *a priori* independent and

{\mathit{\varphi}}_{k}~\text{Gamma}(1/\psi {\delta}_{k},1/\psi {\delta}_{k}).

(4)

When *δ*_{
k
} = 1, then *a priori*, *ϕ*_{
k
} has expectation 1 and variance *ψ*, following in the vein of [24]. However, in light of BSSVS, when *δ*_{
k
} = 0, the prior variance collapses to 0 and *ϕ*_{
k
} = 1.

#### Normalization

To translate between the expected number of substitutions *b*_{
k
} on branch *k* and real clock-time *t*_{
k
},

{b}_{k}=\mu \times {r}_{k}\times {t}_{k},

(5)

where *μ* is the overall substitution rate. The keen eye may observe that, over the entire tree *τ*, the parameterization in Equation (5) again leads to more degrees-of-freedom than are identifiable. We solve this difficulty through a further normalization constraint *c*(·) on *ρ*. Recall that we wish to measure *μ* in terms of expected substitutions per unit real time, such that

\mu ={\displaystyle \sum _{k=1}^{2n-2}{b}_{k}}/{\displaystyle \sum _{k=1}^{2n-2}{t}_{k}.}

(6)

To maintain this scaling, we sum Equation (5) over all branches and substitute the result into Equation (6). This eliminates the unknown *μ* and yields

\begin{array}{c}{\displaystyle \sum _{k=1}^{2n-2}{r}_{k}}{t}_{k}=c(\mathit{\rho}){\displaystyle \sum _{k=1}^{2n-2}{\rho}_{k}}{t}_{k}={\displaystyle \sum _{k=1}^{2n-2}{t}_{k}},\\ c(\mathit{\rho})={\displaystyle \sum _{k}{t}_{k}}/{\displaystyle \sum _{k}\rho k{t}_{k}}.\end{array}

(7)

### Posterior simulation

We take a Bayesian approach to data analysis and draw inference under the RLC model via MCMC. MCMC straightforwardly generates random draws with first-order dependence through the construction of a Markov chain that explores the posterior distribution. Via the Ergodic Theorem, simple tabulation of a chain realization {*θ*^{(1)},...,*θ*^{(L)}} can provide adequate empirical estimates. To generate a Markov chain using the Metropolis-Hastings algorithm [25, 26], one imagines starting at chain step *ℓ* in state *θ*^{(ℓ) }and randomly proposing a new state *θ** drawn from an arbitrary distribution with density *q*(·|*θ*^{(ℓ)}). This arbitrary distribution is commonly called a "transition kernel". Finally the next chain step *ℓ* + 1 arrives in state

{\mathit{\theta}}^{(\ell +1)}=\{\begin{array}{ll}{\mathit{\theta}}^{\star}\hfill & \text{with}\text{probability}:\mathrm{min}\left\{1,\frac{P({\mathit{\theta}}^{\star}|Y)}{P({\mathit{\theta}}^{(\ell )}|Y)}\times \frac{q({\mathit{\theta}}^{(\ell )}|{\mathit{\theta}}^{\star})}{q({\mathit{\theta}}^{\star}|{\mathit{\theta}}^{(\ell )})}\right\},\hfill \\ {\mathit{\theta}}^{(\ell )}\hfill & \text{otherwise}.\hfill \end{array}

(8)

The first term in the acceptance probability above is the ratio of posterior densities and the term involving the transition kernel is the Hastings ratio. The beauty of the algorithm is that the posterior densities only appear as a ratio so that intractable normalizing constant cancels out.

#### Transition kernels

We employ standard phylogenetic transition kernels via a Metropolis-within-Gibbs scheme, as implemented in BEAST [12], to travel through most dimensions in the RLC parameter space. What is unique to the RLC model are transition kernels to explore rate multipliers *ϕ*and all possible local clock indicator *δ*configurations. Since *ϕ*_{
k
} ∈ [0,∞) we propose new rates *ϕ** component-wise, such that for a uniform randomly selected *k* with *δ*_{
k
} = 1,

\begin{array}{c}{\varphi}_{k}^{\star}=U{\varphi}_{k},\\ U~\text{Uniform}\left({s}_{f},\frac{1}{{s}_{f}}\right),\end{array}

(9)

where 0 *< s*_{
f
} *<*1 is a tuning constant and the Hastings ratio is 1/*U* [27].

Transition kernels on *δ*are more challenging. One natural way to construct a Markov chain on a bit-vector state space, such as *δ*, involves selecting one random element *δ*_{
k
} with uniform probability 1/(2*n −* 2) and swapping its state {\delta}_{k}^{\star}=1-{\delta}_{k} with probability 1 [14].

At first glance, the transition kernel density *q*(*δ**|*δ*) = *q*(*δ*|*δ**) = 1/(2*n* - 2) appears symmetric leading to a Hastings ratio of 1. However, this view is flawed. One must recall that we introduced the indicators *δ*as a computational convenience. The number of different local clocks *K* over-shadows *δ*as our parameter of interest, upon which we place our truncated-Poisson prior *P*(*K*). The correct densities to calculate then become *q*(*K**|*K*) and *q*(*K*|*K**). Suppose the swapping event above generates 0 *→* 1 so that *K** = *K* + 1. As *K* approaches 0 the transition kernel finds it more and more difficult to decrease *K* because the kernel is more likely initially to choose a 0 state for swapping. From this perspective, the kernel is definitely not symmetric in the interchange of *K** and *K*. Assuming symmetry would lead to upwardly biased estimates for *K* < ⌊*n* - 1⌋. The reverse bias occurs as *K* approaches 2*n −* 2 from below.

To determine *q*(*K**|*K*), we identify that our kernel chooses a *δ*_{
k
} = 0 with probability (2*n −* 2 *− K* )/(2*n −* 2) and a *δ*_{
k
} = 1 with probability *K/*(2*n −* 2). Therefore, if *K** = *K* + 1, *q*(*K**|*K*) is the former probability and if *K** = *K* - 1, *q*(*K**|*K*) is the latter. Forming the Hastings ratio

\frac{q(K|{K}^{\star})}{q({K}^{\star}|K)}=\{\begin{array}{ll}\frac{K+1}{2n-2-K}\hfill & \text{if}{}^{K}=K+1,\hfill \\ \frac{2n-2-K+1}{K}\hfill & \text{if}{}^{K}=K-1.\hfill \end{array}

(10)

This derivation provides an important lesson for those new to MCMC implementation; the Hastings ratio may vary depending on the model parameterization; it is, therefore, necessary to calculate the ratio as a function of the same parameterization as the prior.

In cases where the swap event relaxes the prior variance on the rate multiplier *ϕ*_{
k
}, we simultaneously propose a new value for {\mathit{\varphi}}_{k}^{\star}\ne 1. We draw this value from the prior given in Equation (4).

Proposals involving changes to the tree topology are based on existing tree proposal moves in the BEAST software framework with a small modification to track the augmented data at the nodes [see Additional file 1].

#### Model selection

Statistical inference divides into two intertwined approaches: parameter estimation and model selection. For the former, parameter inference relies on empirical estimates of *P*(*θ|* **Y**) that we tabulate from the MCMC draws. Model selection often represents a more formidable task. The natural selection criterion in a Bayesian framework is the Bayes factor [28–30]. The Bayes factor *B*_{10} in favor of ℳ_{1} over ℳ_{0} is the ratio of the marginal likelihoods of ℳ_{1} and ℳ_{0},

{B}_{10}=\frac{P(\mathbf{Y}|{\mathcal{M}}_{1})}{P(\mathbf{Y}|{\mathcal{M}}_{0})}=\frac{P({\mathcal{M}}_{1}|\mathbf{Y})}{P({\mathcal{M}}_{0}|\mathbf{Y})}/\frac{P({\mathcal{M}}_{1})}{P({\mathcal{M}}_{0})},

(11)

and informs the phylogeneticist how she (he) should change her (his) prior belief *P*(ℳ_{1})/*P*(ℳ_{0}) about the competing models in the face of the observed data. Involving the evaluation of two different normalizing constants, Bayes factors are often challenging to estimate.

By fortuitous construction, we side-step this computational limitation when estimating the Bayes factor in favor of a global clock (GC) model ℳ_{GC} over the RLC model ℳ_{RLC}. Model ℳ_{GC} occurs when *K* = 0, conveniently nested within model ℳ_{RLC}. Consequentially, the *P*(*K* = 0|ℳ_{RLC}) equals the prior probability of ℳ_{GC}, and *P*(*K* = 0|**Y**,ℳ_{RLC}) yields *P*(ℳ_{GC}|**Y**). Given this, a Bayes factor test of ℳ_{GC} only requires simulation under the RLC model. The Bayes factor in favor of a global clock

{B}_{\text{GC}}=\frac{P(K=0|\mathbf{Y},{\mathcal{M}}_{\text{RLC}})}{1-P(K=0|\mathbf{Y},{\mathcal{M}}_{\text{RLC}})}{\left(\frac{P(K=0|{\mathcal{M}}_{\text{RLC}})}{1-P(K=0|{\mathcal{M}}_{\text{RLC}})}\right)}^{-1}.

(12)

To calculate the ratio of marginal likelihoods we need only an estimator \stackrel{\u02c6}{\text{P}} of *P*(*K* =0|**Y**, ℳ_{RLC}). The Ergodic Theorem suggests that we let

\stackrel{\u02c6}{\text{P}}={\displaystyle \sum _{\ell =1}^{L}1}\left\{{K}^{(\ell )}=0\right\},

(13)

where 1{·} is the indicator function. Occasionally \stackrel{\u02c6}{\text{P}} becomes a poor estimator when *P*(*K* = 0|**Y**,ℳ_{RLC}) decreases below ϵ or increases above 1 - ϵ for ϵ ≈ 1/L. In such situations, there are alternatives that depend on MCMC chains generated under several different prior probabilities *P*(*K* = 0|ℳ_{RLC}) [31]. The Bayes factor then provides the mechanism to combine results from the multiple chains and to rescale back to a believable prior.