A graph-theoretic view of gene regulation
We offer in this section a non-technical account of the linear framework as applied to gene regulation. The technical details are provided, along with references, in the section on ‘Calculating microstate probabilities at steady state’.
The framework starts with a labelled, directed graph consisting of a collection of vertices with directed edges between pairs of vertices and labels on the edges (Figure 1, bottom). The graphs considered here have only finitely many vertices and the edges always go between distinct vertices, so that there are no self-loops. It is further assumed that each graph is connected, which means that, given any two vertices, there is always a path of edges between them, ignoring edge directions. A connected graph is not in disjoint pieces.
The vertices of the graph correspond to microstates, or snapshots of DNA and its accompanying proteins. Figure 1 (top) shows the range of features that can potentially be found in a microstate, including TFs, transcriptional co-regulators, RNA polymerase, nucleosomes, chromatin remodelling enzymes, DNA looping, various forms of post-translational modification and DNA methylation. The directed edges correspond to transitions between microstates arising from biochemical reactions taking place on chromatin, such as the binding and unbinding of TFs or co-regulators or post-translational modification or demodification of proteins bound to DNA. Directed graphs of this kind are often found in the literature as qualitative summaries of the behaviour of regulatory mechanisms. Such cartoons can be given a rigorous mathematical basis through the methods introduced here.
The labels on the edges supply quantitative information in the form of effective rate constants for the corresponding transitions. Each label has units of inverse time, as in per second. The rate of some transitions, such as binding events, can depend on the concentration of components in solution around DNA. The labels can therefore be compound expressions involving component concentrations as well as kinetic parameters. In this way biochemical non-linearity is accommodated in the labels. An important feature of the framework is that the numerical values of the parameters do not have to be known in advance. They can be treated as symbols and many properties of the system can be calculated in symbolic form. This permits analysis without having to measure or estimate the actual values of the parameters.
The level of granularity used for the microstates, and the corresponding transitions, is a matter of choice. It can range from coarse-grained descriptions of open and closed chromatin to fine-grained descriptions of DNA sequence, individual nucleosomes and specific histone modifications. The choice depends on the context, the available experimental methods and data and the biological questions being asked. The graph constitutes a mathematical model of the system being studied and is best thought of not as a description of reality but as a precise statement of the assumptions being made about that reality – a hypothesis – from which rigorous deductions can be made and experiments proposed [43].
Because there is only one molecule of DNA, the dynamical behaviour of microstates has to be understood in terms of probabilities. If we imagine watching DNA over time, the microstates will fluctuate as transitions take place due to random molecular events, such as binding or unbinding of components. Let us denote the probability of the system being in microstate i at time t by u
i
(t). The following thought experiment may help to interpret this quantity. Imagine a large number of copies of the system being created in the identical starting condition at time 0, with the same initial microstate and the same protein components present in the surrounding solution at the same concentrations. As time progresses, the randomness of molecular events will cause the different copies of the system to diverge so that different microstates will be found in each system copy. The proportion of copies in which microstate i is found at time t is an approximation for u
i
(t) and this approximation becomes more accurate as the number of copies is increased. In other words, u
i
(t) measures how often microstate i will be found at time t, were it possible to repeatedly replay the system from its initial condition at time 0.
Probabilities can appear difficult to reason with but the graph-based framework offers a different way to think about them which may be more familiar. The vertices of the graph are regarded as chemical species with concentrations, the edges as chemical reactions and the labels as rate constants. Each reaction has only a single substrate and only a single product, like an isomerisation, so the graph describes a kind of one-dimensional chemistry. This macroscopic interpretation allows us to reason about concentrations and reactions but gives the same results as the microscopic interpretation in terms of probabilities and transitions. In other words, if we imagine placing concentrations of matter at each vertex and allowing the chemistry to work, then the change in concentrations over time is identical to the change in probabilities over time. The only thing we have to remember is that probabilities add up to 1 – the system must be in some microstate – so that the total concentration of matter at all vertices should be kept at 1. Because the reactions only move matter between vertices, and neither create nor destroy it, the total concentration remains the same over time (see Equation 2 below), so we only need to make it 1 to begin with.
It is easy to imagine that, no matter what initial concentrations of matter are distributed over the vertices, the one-dimensional chemistry will eventually reach a steady state, in which production and consumption of each species are in balance and the concentration of each species is unchanging. Such a steady state occurs no matter what the structure of the graph. In a general graph, the steady state can depend on the initial concentrations that were chosen at time 0, so that there is a memory of these initial conditions (see the section ‘Formation of an inherently bounded chromatin domain’). However, if the graph is strongly connected, such memory is lost and the steady state becomes independent of the initial conditions and depends only on the structure of the graph. A strongly connected graph is one in which any pair of vertices are connected, both ways, by a path of consecutive edges that all point in the same direction (Figure 2A). In effect, any two vertices can communicate with each other in both directions. Strong connectivity depends only on the edges and not on the labels.
A strongly connected graph can be arbitrarily large and complicated but its one-dimensional chemistry is particularly simple. The steady-state concentration of each species can be calculated in terms of the edge labels using certain sub-graphs called spanning trees (see Equation 7 below). Among other things, this shows that each microstate in a strongly connected graph has positive probability at steady state: if such a system is watched over time, each microstate will appear at steady state, even if that microstate had zero probability in the initial condition.
A general graph, which is not strongly connected, breaks up naturally into maximal strongly connected sub-graphs, or strongly connected components (SCCs) (Figure 2B). Once matter has left a SCC under one-dimensional chemistry, it can never return to it, for otherwise the SCC would not be maximal. Hence, matter eventually accumulates on those SCCs from which there is no escape, which are the terminal SCCs. If a microstate is not in a terminal SCC, its steady-state probability is zero: if the system is watched over time, such microstates never appear in the steady state, even if they had positive probability in the initial condition. For the microstates that do lie in terminal SCCs, their steady-state probability may or may not be zero depending on the initial conditions. For instance, if matter is only placed on the vertices of one terminal SCC, it will remain there forever and cannot escape into any other SCC, whose vertices will have zero probability at all times.
A system that reaches thermodynamic equilibrium always has a strongly connected graph. The property of detailed balance, which must always hold at equilibrium, requires that each edge in the graph has a corresponding reverse edge, so that strong connectivity is guaranteed. If the labels on a pair of reversible edges are a and b, then the ratio a/b is a thermodynamic quantity that depends only on the free energy difference between the two microstates (see Equation 6 below). The steady-state probabilities depend only on these thermodynamic ratios and can be calculated as products of the ratios along paths in the graph, without the need for any spanning trees (see Equation 5 below). This gives the same result as equilibrium statistical mechanics. In this way, the framework provides a generalisation of equilibrium statistical mechanics for gene-regulation systems that are far from equilibrium.
Constructing graphs to describe gene regulation
Linear framework graphs are constructed from labelled edges, which arise from two kinds of transitions, as listed below. The main restrictive assumptions concern the interplay between mechanisms taking place in solution around chromatin and those taking place on chromatin itself. The basic approach is to assume that these can be uncoupled from each other. More relaxed assumptions can be made, using the methods of [35], but at the expense of considerably increased complexity.
Binding transitions
These represent the binding of a component L to a microstate (Figure 3A). The label is a=k[ L], where k is an on-rate and [ L] is the free concentration of L. We follow the thermodynamic formalism and assume, first, that components are neither synthesised nor degraded over the timescale of interest so that their total amounts are conserved quantities and, second, that the depletion of L can be ignored, so that the binding of a single molecule of L does not appreciably change its free concentration, [ L]. In other words, [ L]≈L tot. Non-specific binding to DNA can significantly reduce the free concentration and if this is thought to jeopardise the no-depletion assumption, a more elaborate analysis is needed [36],[44].
Components can also engage in interactions such as oligomerisation. We again follow the thermodynamic formalism and assume that such reactions are fast compared to binding reactions on DNA, so that they have reached a rapid equilibrium. The label on the edge has the form a=k[ X], were k is an appropriate on-rate and X is the component form that binds to DNA (Figure 3B). [ X] can be calculated in terms of the concentrations of the underlying components using the rapid equilibrium assumption (Methods).
Non-binding transitions
These are transitions in which the edge label does not contain a concentration term. They can arise from several different types of biochemical process:
-
unbinding reactions, in which a component that had previously bound to form the source microstate unbinds, with the off-rate as the label (Figure 3C);
-
allosteric change, in which the conformational state of DNA, or of a component or complex in the microstate, is altered (Figure 3D);
-
three-dimensional chromatin conformation change, such as DNA looping, in which separate parts of a microstate, such as a distal enhancer and a proximal promoter, bind or unbind from each other (Figure 3E), with the respective rate constants as the labels;
-
nucleosome assembly or disassembly, with the nucleosomes treated, for example, as individual entities (Figure 3F), so that the labels are the aggregated overall rates of the assembly or disassembly pathway;
-
enzymatic activity, in which an enzyme, which is assumed to be already bound in the source microstate, undertakes a biochemical reaction that alters the microstate, such as post-translational modification or demodification of a histone, a co-regulator or a transcription factor (Figure 3G, H), or methylation or demethylation of DNA (Figure 3I, demethylation is not shown), with the enzyme catalytic rate as the label;
-
RNA polymerase activity, including transcription initiation, open complex formation, promoter clearance, elongation, pausing, etc.; Figure 3J shows elongation as a single step following initiation but this can be broken down to a finer granularity as required.
Numerical values for the parameters appearing in the labels can sometimes be estimated from experimental data [10],[12],[45]. One of the advantages of the framework is that calculations can be undertaken with symbolic parameters, without having to know numerical values in advance.
Calculating microstate probabilities at steady state
The mathematical details of the linear framework were developed in previous work [35]-[37], as reviewed in [38]. As this may not be familiar, and to keep this paper as self-contained as possible, the material is summarised here. Proofs of most of the assertions can be found in [37]. A graph of the kind constructed above, as in Figure 1, gives rise to a linear differential equation that describes how the probabilities of each microstate change in time. We first explain how this differential equation arises and then show how microstate probabilities can be calculated at steady state. The key formulas for the microstate probabilities are Equation 5 at equilibrium and Equation 7 away from equilibrium. We have italicised mathematical concepts that may be unfamiliar and have provided a glossary to explain these in the Methods.
Laplacian dynamics
Suppose we are given a graph G, as in Figure 4A, with vertices indexed 1,…,n. We typically use the index 1 for the reference microstate with no TFs bound and choose the order of the other microstates arbitrarily. The notation signifies the edge with label a from source vertex i to target vertex j. A dynamics can be imposed on G in two equivalent ways. In the macroscopic interpretation, the vertices are chemical species and the edges are chemical reactions, which convert source species to target species. The edge labels are rate constants for the corresponding reactions, assuming mass-action kinetics. Since each reaction is uni-molecular, with only one substrate and one product, this one-dimensional chemistry yields a linear dynamics (Figure 4A),
where x(t) is a column vector of species concentrations and is an n×n matrix whose entries are labels, which is called the Laplacian matrix of G.
Since the dynamics inter-converts between species and neither creates matter nor destroys it, the total concentration does not change over time. The dynamics therefore satisfies the conservation law
(2)
This corresponds to the columns of the Laplacian matrix adding up to 0 (Figure 4A), so that , where 1 signifies the all-ones column vector and t denotes the transpose operation, which turns a column vector into a row vector.
In the microscopic interpretation, the vertices are microstates, the edges are transitions between microstates and the labels are infinitesimal transition rates for the corresponding edges. This means that, if and Δ t is a time interval sufficiently small so that a Δ t<1, then the probability of taking the transition from state i to state j is approximately a Δ t and the approximation gets better as Δ t gets smaller (see Equation 15 in the glossary). This interpretation defines a continuous time, finite state Markov process. A Markov process gives rise to a master equation that describes how the microstate probabilities change over time. This master equation is identical to Equation 1, so that
where u
i
(t) is the probability of occurrence of microstate i at time t. The only difference with the macroscopic interpretation is that probabilities must always add up to 1, so that u tot=1 in Equation 2. Matrices of Laplacian type often arise when master equations are used but the underlying graph, from which the Laplacian can always be derived, has not been exploited as we do here.
Steady states
In the macroscopic interpretation, no matter what graph and what initial condition are chosen, the dynamics always reaches a steady state, x ∗, in which production and consumption of each species is exactly balanced, so that, d x ∗/d t=0. By Equation 1, x ∗ is in the kernel of the Laplacian matrix: .
A particularly important case arises when G is strongly connected (Figures 2A and 4B) because the kernel of the Laplacian is one dimensional:
In other words, there is a unique steady state, up to a scalar multiple. Given a basis vector for the kernel, , it then follows from Equations 2 and 3 that the steady-state probabilities are obtained by normalising the entries of ρ G to its total amount, , so that
Such a basis vector ρ G can be constructed in one of two ways, described next.
At thermodynamic equilibrium
If the graph represents a system that can reach thermodynamic equilibrium, then detailed balance must be satisfied [36]. This requires two conditions to hold. First, the graph must be reversible: if the graph has an edge , then it must also have a reverse edge, , corresponding to the same underlying biochemical reaction working in reverse. Note that reversible edges imply that the graph is strongly connected. Second, in any steady state, x ∗, any such pair of reversible edges must be independently at equilibrium, with the forward flux in balance with the reverse flux, irrespective of any other edges involving i and j. Setting the two fluxes to be in balance, it follows that .
To determine , choose any path of reversible edges from vertex 1 to vertex j,
and let to be the corresponding product of label ratios,
(5)
It follows from detailed balance that , so that x ∗=λ ρ G where . Hence, ρ G provides the required basis vector of , from which probabilities can be calculated using Equation 4. For this procedure to be consistent, must be independent of the chosen path from 1 to j. This is ensured by the cycle condition, which is a necessary consequence of detailed balance [36]. It is an important feature of being at thermodynamic equilibrium that history does not matter: any path to a microstate can be used to determine its equilibrium probability.
Equation 5 is equivalent to the thermodynamic formalism through van’t Hoff’s formula. If and , then, at thermodynamic equilibrium,
(6)
where Δ G is the free-energy difference between microstates j and i, R is the molar Boltzmann constant and T is the absolute temperature. The product of label ratios in Equation 5 is transformed, through the exponential function in Equation 6, into a sum of free energies, which determines the free energy of microstate j relative to that of the reference microstate 1. The denominator in Equation 4 is then the partition function of equilibrium statistical mechanics.
Thermodynamic equilibrium requires detailed balance but a graph can satisfy detailed balance without being at equilibrium. For instance, certain graph structures in which each edge is reversible, such as a sequence structure (Figure 5A) or, more generally, a tree structure (Figure 5B), always satisfy detailed balance (Methods). In such a graph the edges may involve dissipative mechanisms. However, although an edge is accompanied by a reverse edge , these edges may not arise from an underlying biochemical reaction operating reversibly but from two separate dissipative reactions, such as phosphorylation and dephosphorylation, each acting irreversibly. The ratio a/b would no longer have a thermodynamic interpretation in terms of a free energy difference, as in Equation 6.
Away from equilibrium
If the graph represents a system that is maintained away from thermodynamic equilibrium, then detailed balance may no longer hold. The graph may have irreversible edges and Equation 5 no longer works. If the graph is strongly connected, a basis vector of can be calculated by the matrix-tree theorem, a proof of which is given in the Appendix to [37]. This leads to the following procedure. Let Θ
j
(G) be the set of spanning trees of G that are rooted at microstate j. Informally, a tree is a sub-graph with no cycles, it is spanning if it reaches every vertex and it is rooted at vertex i if i has no outgoing edges in the tree. Figure 4B gives examples of rooted spanning trees. It is not difficult to see that a graph is strongly connected if, and only if, it has a spanning tree rooted at each vertex and that a spanning tree always has one less edge than the number of vertices in G.
For a strongly connected graph, may be calculated by multiplying together the labels on the edges of each spanning tree rooted at j and adding up these products over all such spanning trees:
(7)
Because a strongly connected graph has at least one spanning tree rooted at each vertex, each entry in the basis vector is positive, so that for each j. Hence, by Equation 4, each microstate has positive steady-state probability. The denominator in Equation 4 provides a non-equilibrium partition function.
Non-strongly connected graphs
Graphs arising in gene regulation may not always be strongly connected (see the section ‘Formation of an inherently bounded chromatin domain’ and Figure 6C). Steady-state probabilities for non-strongly connected graphs can be calculated by considering the SCCs of G (Figures 2B and 4C). The SCCs inherit connections from the underlying graph but these connections can never form a cycle, for otherwise the SCCs would collapse into each other. It is therefore possible to identify terminal SCCs, from which there are no outgoing connections. The terminal SCCs yield steady states in the following way.
Let T 1,…,T
t
denote the terminal SCCs. Each T
k
is by definition strongly connected, so that it has a basis vector , as given by Equation 7. We can now construct the vector ρ G,k that agrees with on those microstates that lie in T
k
and which is zero on all other microstates (Figure 4C). The vectors ρ G,k provide a basis for the kernel of the Laplacian of G:
(8)
The dimension of the kernel is then t, the number of terminal SCCs. Note that, if i is any microstate that is not in a terminal SCC, then for each basis vector ρ G,k.
The t basis vectors in are matched by t conservation laws. In contrast to Equation 2, which is the only conservation law when t=1, the additional conservation laws for t>1 depend on the structure of the graph. These additional laws can be algorithmically calculated from .
Any steady state x ∗ can be expressed as a linear combination of the basis vectors in Equation 8. If these vectors are normalised to their respective totals, then, in the resulting expression for x ∗,
(9)
the coefficients z 1,…,z
t
are the values taken by the t conservation laws.
Calculating gene expression
In the thermodynamic formalism, a rate of gene expression, g
i
, is assumed for each microstate i and the overall rate is taken to be proportional to the average over the steady-state microstate probabilities . This average is given by
(10)
The same procedure is used for the examples studied here but the linear framework can accommodate the irreversible dynamics of mRNA polymerase (initiation, open complex formation, promoter escape, elongation, pausing, etc.) [17],[49],[50], as shown in Figure 3J. The dynamics of mRNAs and proteins can also be coupled to gene regulation within a graph-theoretic formalism [41]. However, this leads to infinite graphs because the number of mRNA or protein molecules may be unlimited.
Having summarised the linear framework and shown how it generalises the thermodynamic formalism to non-equilibrium contexts, we now discuss three applications that demonstrate the framework’s scope.
Regulation of steroid-hormone responsive genes
Ong et al. have put forward a theoretical framework for gene induction [46], motivated by studies of steroid-hormone receptors [51]. They use ad hoc methods, which are independent of previous work on gene regulation. We show here how their analysis can be generalised and simplified within the linear framework.
Recent work on steroid-hormone sensitive genes has revealed new co-regulators, such as the Ubiquitin conjugating enzyme, Ubc9, indicating the existence of multiple steps in addition to hormone-receptor binding to DNA [46]. Despite this additional complexity, gene-regulation functions [16], which describe how rates of gene expression depend on hormone concentration, are well fitted to Michaelis–Menten style functions, or first-order Hill dose–response curves (FHDCs) in the language of Ong et al., who use their theoretical framework to derive conditions under which such FHDCs arise.
They consider a sequence of reversible reactions (Figure 5A), representing the behaviour of the promoter of a hormone-sensitive gene. Such a sequence graph always satisfies detailed balance (Methods). We consider the more general case of an arbitrary graph G of reversible edges that satisfies detailed balance. This might be, for instance, a tree graph (Figure 5B), which also always satisfies detailed balance (Methods). If a general graph satisfies detailed balance it may not necessarily reach thermodynamic equilibrium and the edges of G may involve dissipative mechanisms.
We assume that components R,U,Y 1,…,Y
m
are present and they can bind and unbind to form the microstates of G. Y 1,…,Y
m
are background components that can engage in protein–protein interactions among themselves, so that their concentrations can appear in labels of the form , where Φ is some function, as in Figure 3B. The no-depletion assumption allows free concentrations to be replaced by total concentrations, [ Y
i
]≈ Y i,tot, so that the labels in which Y 1,…,Y
m
occur are functions of rate constants and total amounts, or “constants”. R and U are titratable components, which, crucially, are assumed to bind at most once in each microstate. U corresponds to a co-regulator like Ubc9, which does not engage in protein–protein interactions, so that the corresponding label has the form k ′[ U] (Figure 3A). R corresponds to the steroid-hormone receptor, to which the steroid hormone S binds to form a complex RS, which then binds DNA (Figure 3B with S=L and R=M). The label on the corresponding edge has the form k ″[ R S] where
which is a FHDC as a function of [ S].
The main result is that, provided gene expression only occurs from microstates in which both R and U are bound, the average rate of gene expression, g([ S]), as given by Equation 10, is also a FHDC (Additional file 1A),
(11)
The constants M
G
and K
G
have clear interpretations in terms of G. M
G
is (evidently) the average rate of gene expression at saturation (i.e., when [ R S]=R tot). Less obviously, K
G
is K
R
multiplied by the saturation probability of those microstates in which R is not bound. Additional file 1A gives the details of the proof and shows how the formulas in Ong et al. emerge from Equation 11. It also discusses how Ong et al. show, for the special case of a sequence, that g([ S]) remains a FHDC even if the no-depletion assumption is dropped at a concentration limiting step. Ong et al. also address other issues, such as inhibitory reactions, which are not discussed here.
The framework introduced here generalises and clarifies the work of Ong et al., showing how formulas like Equation 11 can be rigorously proved irrespective of the complexity of the underlying graph. The interpretation of the parameters in Equation 11 is new but emerges easily from our analysis (Additional file 1A). However, because detailed balance is assumed, the consequences of being away from equilibrium remain hidden, as we will see subsequently.
Formation of an inherently bounded chromatin domain
Our next application is to a model of chromatin organisation, with no explicit gene regulation. Hathaway et al. recently showed how a bounded chromatin domain could be nucleated in vivo and stably inherited as a form of epigenetic memory [47]. To explain the dynamics of such domains, they developed a mathematical model based on a linear array of 257 nucleosomes [47],[48]. This model is readily translated into our framework. We considered nucleosome arrays with varying numbers of sites n. We placed the nucleation site at the right-hand end of our array (Figure 6A). This is essentially similar to the left-hand half of the array of 2n−1 nucleosomes (for n=129) considered by Hathaway et al. The microstates correspond to array marking patterns, of which there are 2n, while the edges correspond to mark nucleation, propagation and turnover (Figure 6A,B). Propagation and turnover were assumed uniform at all nucleosomes, at rates k+ and k_, respectively. However, nucleation was limited to the nucleation site at rate k+, so that some edges are not reversible. This irreversibility reflects the dissipative mechanism of histone marking and the non-equilibrium nature of the model. The graph does not satisfy detailed balance but is strongly connected.
Hathaway et al. used a Monte Carlo simulation to generate stochastically a succession of microstates, from which steady-state probabilities were estimated as the frequencies with which microstates appear. They found that, if k+/k_≤1.5, marking persisted in a stochastically fluctuating but inherently bounded domain near the nucleation site, reflecting what was found experimentally.
Monte Carlo simulation is an efficient method for studying very large graphs: an array of 257 nucleosome has a graph with approximately 1077 microstates. However, the linear framework provides mathematical access to the steady-state probabilities for any array size and this yields insights that are not easily found by simulation. For instance, the ratio k+/k_ appears as a convenience in the simulations [48]. However, for a nucleosome array of n sites, the spanning trees in the corresponding graph (Figure 6A) have 2n−1 edges, each of which is labelled k+ or k_. Dividing Equation 7 by , it is evident that the steady-state probabilities in Equation 4 depend only on the ratio k+/k_ and not on the individual rates. The importance of the ratio becomes readily apparent within our framework.
More significantly, Hathaway et al. proposed a modification to their model to explain the inherited stability of the domain after the nucleating stimulus was removed. They imposed a stabilisation of the nucleosome mark through a transition to a hypothetical new marked state, whose turnover was inhibited (Figure 6C, left). Each nucleosome can now be in one of three states and the graph has 3n microstates (Figure 6C, right, for n=2). Because turnover is prevented by the stabilised mark, the graph is no longer strongly connected. If nucleation is stopped, as was done in the simulation, then the resulting graph has two terminal SCCs, each consisting of a single extreme microstate, one in which the entire nucleosome array is unmarked and the other in which the entire array is stably marked. According to Equation 9, all other microstates have zero steady-state probability.
Which of the two extreme microstates is reached in a simulated trajectory depends on the microstate in which nucleation is stopped. If some nucleosome has become stably marked in that microstate, then it cannot become unmarked, so the trajectory can only reach the completely stably marked microstate. This is likely to happen once the inherently bounded domain is established, unless the stabilisation rate, k ∗, is so low that no stable mark has appeared. In their simulation, Hathaway et al. chose k ∗ to be low compared to propagation and turnover but not so low that stable marks had not appeared by the time nucleation was stopped. They concluded that the inherently bounded domain was stably maintained in the absence of the initial nucleating stimulus. Our analysis shows that this conclusion is incorrect. Once nucleation is stopped, the bounded domain becomes a transient phenomenon, which eventually expands to fill the whole array. It is conceivable that a bound on the domain size is maintained for sufficiently long to still be biologically relevant. But this places the stabilising rate k ∗ in a double bind: it must be sufficiently high so as to stabilise the domain, yet sufficiently low so as not to destroy its boundedness too quickly. Such fine-tuning of rate constants is inherently fragile and we think it is more likely that other mechanisms are at work to ensure stable inheritance of the inherently bounded domain.
Our framework allows these conclusions to be reached by elementary mathematical deductions, without the need for the numerical simulations undertaken by Hathaway et al.
Regulation of yeast PHO5
We now turn back to gene regulation and to one of the very few models in which a non-equilibrium mechanism has been rigorously analysed without assuming detailed balance. Pho5 is an acid phosphatase in Saccharomyces cerevisiae that is expressed under phosphate-starvation conditions. Kim and O’Shea undertook a quantitative analysis of PHO5 regulation by the transcription factor Pho4, using a construct detached from the phosphate-response pathway [52] (Figure 7A).
To calculate the PHO5 gene-regulation function, Kim and O’Shea constructed a stochastic master equation based on a graph of transitions between DNA states. They pointed out that the nucleosomal transitions were dissipative and in some cases irreversible under their assumptions, so that detailed balance could not be assumed. Accordingly, they determined steady-state probabilities using the Symbolic Math Toolbox in MATLAB.
Kim and O’Shea’s graph of transitions is readily translated into our linear framework (Figure 7B). They assumed that the binding of Pho4 saturates according to a Hill function, which can be accommodated in a similar way to Figure 3B. The non-binding reactions correspond to unbinding of Pho4 (Figure 3C), or to nucleosomal assembly or disassembly (Figure 3F). The graph is strongly connected, a point not mentioned by Kim and O’Shea, but as noted above for Equation 7, this ensures that the steadystate probability of each microstate is positive. They assumed that PHO5 is transcribed when there is no nucleosome occluding the TATA box, so that, in the average in Equation 10, g
i
=1 for the microstates 2, 3, 7, 8, 9 and 12 on the right in Figure 7B and g
i
=0 for those on the left. We used our own software written in the programming language Python to enumerate the spanning trees by a fast algorithm and then used the polynomial algebra capabilities of Mathematica to calculate the microstate probabilities and the gene-regulation function (Methods). This gave an identical result to Kim and O’Shea’s MATLAB calculation (H Kim, personal communication, January 2013). This strongly suggests that what can be done for the yeast PHO5 gene can be systematically undertaken for other genes with non-equilibrium features, with the solution now being understood explicitly through Equation 7, without recourse to MATLAB.
Having calculated the gene-regulation function using our framework, we sought to compare it to the experimental data acquired by Kim and O’Shea [52]. They used their synthetic construct (Figure 7A, with details in the caption) to measure the PHO5 gene-regulation function. In response to doxycycline, individual cells expressed Pho4-YFP, which was treated as the input to the gene-regulation function, and this induced the expression of CFP from the Pho4-responsive promoter in the construct. CFP was treated as the output as a proxy for Pho5. By using different doses of doxycycline to cover a range of Pho4-YFP expression levels, the gene-regulation function was assembled from single-cell measurements. Kim and O’Shea also measured the gene-regulation function of five other variant promoters, in which the low-affinity and high-affinity sites for Pho4 binding were either exchanged or removed.
Kim and O’Shea estimated the threshold and the maximum expression level of each variant by fitting their experimental data to a Hill function, whose Hill coefficient was found to be nearly 2 for all variants. They then fitted the estimated threshold and maximum values to the calculated gene-regulation function for each variant and found good agreement ([52], Figure 5). We were curious as to how well the gene-regulation function itself would fit the data. This is a more challenging question because the data are noisy and the gene-regulation function is very complicated (see below). To address this, we first smoothed the data. We then used numerical optimisation to find excellent quantitative fits to each variant individually (Figure 8, red curves) but could only undertake a manual fit to all variants collectively, which yielded the parameter values in Equation 16 (Methods). The collective fit was considerably poorer (Figure 8, black curves). While this broadly confirms Kim and O’Shea’s more coarse-grained analysis, it also suggests that the individual variants may exhibit more nuanced behaviours, which are better described by distinct parameter values.
History-dependent complexity away from equilibrium
Our analysis revealed further unexpected features of the PHO5 gene-regulation function. By Equation 7, each is a sum of distinct product terms (monomials) in the five edge labels (Figure 7B), of the form
Here, α is a positive integer, which records the number of spanning trees having that product of labels, and i 1,…,i 5 are non-negative integers. Because the graph has 12 microstates, each spanning tree has 11 edges, so that the total degree of each monomial is 11: i 1+i 2+i 3+i 4+i 5=11. By examination of the calculated formulas, the maximal degree of , in which the concentration of Pho4 appears, is 8. Considering only those monomials with this highest-order term, , the gene-regulation function looks like
(12)
The simplicity of these highest-order terms is deceptive, however. The numerator of Equation 12 has 261 distinct monomials while the denominator has 500 distinct monomials. Indeed, the graph in Figure 7B has 53,376 spanning trees in total. We see that the calculated PHO5 gene-regulation function is very complicated – the full details shown in Additional file 1C cover six pages – despite the model having only two binding sites and two nucleosomes. Because Kim and O’Shea did not provide the gene-regulation function in their original paper, these features are revealed here for the first time.
The linear framework allows us to understand this surprising explosion in complexity. At equilibrium, Equation 5 shows that any single path to a microstate can be used to calculate its steady-state probability. As a physicist would say, free energy at equilibrium is a function of the microstate, not of the route through which that microstate is reached. In marked contrast, away from equilibrium, Equation 7 shows that every spanning tree rooted at that microstate is required. In this case, all routes to the microstate become relevant and microstate probabilities depend in a more intricate way on the structure of the graph. Equation 7 takes care of the bookkeeping. The number of spanning trees increases very rapidly with the size of a graph: the complete undirected graph on n vertices (i.e., the graph in which there is an undirected edge between each pair of distinct vertices) has n n−2 spanning trees in total. This worse than exponential increase manifests itself in the complexity of the PHO5 gene-regulation function.
It is important to appreciate, however, that it is not the complexity or the size of a graph that is the dominant factor in explaining the complexity found here. If we imposed additional edges on the graph in Figure 7B so as to make all the edges reversible, this would only make the graph more complex. If we then imposed detailed balance, which restricts the values of the parameters, the equilibrium probabilities would be given by Equation 5 rather than Equation 7 and the gene-regulation function could be written down in a few lines. The complexity uncovered here depends crucially on being far from thermodynamic equilibrium.
Additional study of PHO5 has shown that nucleosomes decouple the threshold for PHO5 expression from its dynamic range [53]. However, this kind of behaviour can be recapitulated within the thermodynamic formalism [54]. This suggests that the full implications of non-equilibrium behaviour, as revealed by the complexity of the PHO5 gene-regulation function, have not yet been uncovered experimentally. To suggest experimental options, we need ways to decompose the complexity found in Additional file 1C and to attribute aspects of it to specific biochemical mechanisms. Approximation methods may help in particular cases [55] but new ideas are needed for addressing the complexity barrier systematically, to which we now turn.
Graph independence leads to reduced complexity
Gene regulation often takes a modular form, with repeated binding sites, reiterated motifs and multiple enhancers [56],[57]. The microstate probabilities and the resulting gene-regulation function could become extremely complicated, especially if the modules are operating far from equilibrium. There is, however, one context in which simplification may be expected. This occurs when modules operate independently of each other, so that whatever takes place within one module does not affect what takes place in any other module. For instance, developmental genes are often regulated by multiple enhancers, which sometimes appear to act independently of each other [58].
Within the thermodynamic formalism, independence of binding sites leads to multiplication of the corresponding partition functions (described after Equation 6). For instance, a transcription factor, T, binding to a single site on DNA has the partition function 1+K[ T], where K is the association constant for binding. Suppose that there are m repeated binding sites to which T binds and suppose that each site has the same association constant. If these bindings are independent of each other, then the partition function for the m-site system is obtained by simply multiplying the one-site partition function m times, to yield
On the other hand, if the sites are not independent, the partition function takes the more complicated form
where a 1,…,a
m
can be arbitrary numbers. Evidently, the partition function in Equation 13 is considerably less complex and easier to understand. In the light of this result for equilibrium systems, we wanted to find a generalisation in which the modules are no longer individual binding sites but are represented by potentially complex graphs, which may not be at thermodynamic equilibrium. Such modules might correspond, for instance, to independent enhancers.
We used the product graph construction to capture the concept of independence. Let G and H be any two graphs which represent two modules within a gene regulation system. We make no assumptions about the graphs, which do not have to be at equilibrium and do not have to be strongly connected. The product graph G×H is constructed as follows (Figure 9). It has vertices (i,j), where i is a vertex in G and j is a vertex in H. The vertices are enumerated lexicographically, so that (i,j)<(i ′,j ′) if either i<i ′ or i=i ′ and j<j ′. For each labelled edge in G and for every vertex j in H, the labelled edge is created in G×H. The retention of the same label a on these edges ensures that the transition from (i 1,j) to (i 2,j) occurs independently of j and always at the same rate, which captures the independence assumption. Similarly, for each labelled edge in H and for every vertex i in G, the labelled edge is created in G×H. These are the only edges in G×H.
If the modules represented by G and H are operating independently of each other, then the graph of the combined system is given by G×H. What can be said about the ρ G×H in terms of ρ G and ρ H? When G and H are both strongly connected, then G×H is also strongly connected and a basis vector in the kernel of the Laplacian is given by
(14)
This uses the Kronecker product of two vectors, x⊗y, defined by (x⊗y)(i,j)=x
i
y
j
(Figure 9). If either G or H are not strongly connected then G×H will not be strongly connected. A basis for the Laplacian kernel of G×H is then given by the Kronecker products ρ G,i⊗ρ H,j between each pair of basis vectors from each respective kernel. The precise product theorem is stated and proved in Additional file 1B.
In the example in Figure 9, the product theorem yields polynomials for the components of ρ G×H that have degree 3 in the labels. Since G×H is strongly connected, ρ G×H can also be calculated using the matrix-tree formula in Equation 7. The resulting polynomials must have degree 5 because G×H has six vertices. However, each of the polynomials from Equation 7 has the same scalar factor of degree 2, given by
which can be divided out to give the much simpler expressions in Figure 9. The basis vectors from the product theorem are substantially less complicated, both in degree and in the numbers of monomials, than those from Equation 7.
This product theorem is important because it shows that a system that is far from equilibrium may still have simple expressions for its microstate probabilities. What is required is that the system has independent modules within it. This suggests a starting point for addressing the complexity challenge identified above, as reviewed further in the Discussion below.