Many real world systems can be described approximately with Markov chain models which are particularly convenient for making statistical comparisons. Here we derive a simple computational method for comparing such models and inferring parameters. The methods described below are demonstrated by Infr (https://github.com/Homo-carbonis/infr), a Common Lisp library for performing Bayesian inference over Markov chains.
We’re going to take a model of a system in the form of an ordinary differential equation, approximate it as a Markov chain using the finite difference method and then use Bayes” Theorem to compare it to a sequence of data points. This will allow us to infer the probability distribution of the model’s parameters and compare two competing models to see which is best supported by the data.
If we want to inferer the probability distribution of an unknown parameter β from an observed value, y. We need two things: A model telling us the likelihood of y given β P(y|β) and a prior distribution P(β), representing any prior knowledge we have of β. Bayes” theorem then tells us $$P\left( \beta~|~y \right) = \frac{P\left( y|\beta \right)P(\beta)}{P(y)}$$ P(y) is known as the marginal distribution of y given by ∫−∞+∞P(y|β)P(β) dβ and it normalises the joint posterior distribution P(β, y) for our particular model. If we are are only interested in comparing relative probabilities of β, we can discard this and just use the relative posterior P(β | y) ∝ P(y|β)P(β) A simple method to estimate the marginal probability so we can compare different models is presented under Marginal.
If we want find the probability of β given a sequence of observations of a process, we need a model which gives us the probability of the next observation given the ones which preceded it.
From the first two observations we get:
$$P\left( \beta~|~y_{1},y_{0} \right) = \frac{P\left( y_{1}~|~\beta,y_{0} \right)P(\beta)}{P\left( y_{1}~|~y_{0} \right)}$$
This then becomes the prior proability for the next observation, so we have
$$P\left( \beta~|~y_{2},y_{1},y_{0} \right) = \frac{P\left( y_{1}~|~\beta,y_{0} \right)P\left( \beta~|~y_{1},y_{0} \right)}{P\left( y_{1}~|~y_{0} \right)}$$
We can continue chaining applications of Bayes” Theorem like this to give a recursive definition
$$\frac{P\left( \beta~|~y_{n},y_{n - 1},\ldots y_{0} \right) = P\left( y_{n}~|~\beta,y_{n - 1},\ldots y_{0} \right)P\left( \beta~|~y_{n - 1},y_{n - 2},\ldots y_{0} \right)}{P\left( y_{n}~|~y_{n - 1},y_{n - 2}\ldots y_{0} \right)}$$
This can be applied to any discrete process but the integration involved in computing the marginal probabilities make it impractical. To produce an equation we can solve, we constrain ourselves to a particular kind of process.
A Markov Chain is a stochastic process with the property:
P(xn | xn − 1, xn − 2, …x0) = P(xn | x0)
That is, it is a chain of random variables, each of which depends only on its predecessor.
For our purposes Markov Chains have the very useful property P(xn, xn − 1, …x0) = P(xn|xn − 1)P(xn − 1 | xn − 2)…P(x1|x0)P(x0)
Proof
$$\begin{aligned} P\left( x_{1},x_{0} \right) & = P\left( x_{1}|x_{0} \right)\ P\left( x_{0} \right) \\ P\left( x_{n},x_{n - 1},\ldots x_{0} \right) & = P\left( x_{n}~|~x_{n - 1},x_{n - 2},\ldots x_{0} \right)\ P\left( x_{n - 1},x_{n - 2},\ldots x_{0} \right) \\ \\ \text{By induction this gives } \\ P\left( x_{n},x_{n - 1},\ldots x_{0} \right) & = P\left( x_{n}|x_{n - 1} \right)P\left( x_{n - 1}~|~x_{n - 2} \right)\ldots P\left( x_{1}|x_{0} \right)P\left( x_{0} \right) \end{aligned}$$
For a Markov chain our posterior distribution becomes
$$\begin{aligned} P\left( y_{n},y_{n - 1},\ldots y_{0} \right) & = \frac{P\left( y_{n}~|~\beta,y_{n - 1} \right)P\left( y_{n} - 1~|~\beta,y_{n - 2} \right)\ldots P\left( y_{1}~|~\beta,y_{0} \right)P(\beta)}{P\left( y_{n},y_{n - 1},\ldots y_{0} \right)} \\ \text{or }\quad P\left( \mathrm{\mathbf{y}} \right) & = \frac{P(\beta)\prod_{i = 0}^{n}P\left( y_{i}|\beta,y_{i - 1} \right)}{P\left( \mathrm{\mathbf{y}} \right)} \end{aligned}$$
We begin with some experimental data, perhaps we have measured the membrane potential of a cell a we have a plot like this
We have a model which we think may be a good fit the observed process, in the form of a differential equation: $$\frac{d^{2}y}{dt^{2}} - \beta(1 - x^{2})\frac{dy}{dx} + x = 0$$
The Finite Difference Method allows us to approximate a continuous time system as a discrete time system. If the function of the original system is sufficiently smooth, we can choose time increments,h, small enough to apporximate it with arbitrary accuracy.
We substitute: $$\begin{array}{rlr} \frac{dx}{dt} & \rightarrow \frac{x_{i} + 1 - x_{i} - 1}{2}h\frac{\begin{array}{r} \\ \left( d^{2}x \right) \end{array}}{dt^{2}} & \rightarrow \frac{x_{i} - 1 - 2y_{i} + y_{i} + 1}{h^{2}} \end{array}$$
This gives us $$x_{i + 1} = f\left( x_{i},x_{i - 1} \right) = \frac{- 2x_{i} + x_{i - 1}}{h^{2}} - \beta\left( 1 - x_{i}^{2} \right)\left( \frac{- x_{i} - 1}{2h} \right) + x_{i}$$ Adding a certain quantity of Gaussian noise to this difference equation we have xi + 1 = f(xi, xi − 1) + r, where r ∼ 𝒩(0, σ2)
NB. This introduces a second parameter σ2. This method can be applied to any number of paramters but in the interest of simplicity here we treat it as a constant.
We now have something approaching a Markov chain. It depends on xi and xi − 1 so it is not a simple Markov Chain, but for a sufficiently small h this should not affect the result.
Say from previous experiments we have an idea β is around 3. To reflect this For our prior we choose a normal distributon 𝒩(3, 1).
Now we have everything we need to find the posterior distribution of β.
$$P\left( \beta~|~\mathrm{\mathbf{y}} \right) \propto \prod_{i = 0}^{n}P\left( y_{i}|\beta,y_{i - 1} \right)P(\beta)$$
We can see β is somewhere around 2 but there is still a great deal of uncertainty. We will need more data to know the value with greater accuracy.
The relative posterior probability allows to compare values of β for the same model but if we want to compare two different models, we also need to know the marginal probability for each: P(y|ℳ) = ∫−∞+∞P(y|β, ℳ)P(β) dβ
This can’t be solved analytically except in the simplest cases. Numerical integration by quadrature is also impractical since the integral is form −∞ to +∞ and we may be considering several parameters leading to integration of many dimensions.
The method described below uses the prior distribution to optimize numerical integration It is a simpler varient of the particle filter method which uses only the prior distribution to generate samples.
To integrate a posterior function of beta, which is approximated by another function of β, the prior, we draw number of random samples from the prior distribution beta, β0, β1, …βm. We then divide the area under the function we are integrating into strips of with hi, such that there is approximately one sample in each strip. If our prior is a reasonably good fit for our posterior the strips should be narrower near the mode, leading to much greater accuracy compared to quadrature, where the strips are all the same. We also get over the problem with not knowing the range to integrate over.
hi is proportional to 1/P(Bi) so we have $$\begin{aligned} \int_{- \infty}^{+ \infty}P\left( \mathrm{\mathbf{y}}|\beta,\mathcal{M} \right)P(\beta)\ d\beta & \approx \sum_{i}P\left( \mathrm{\mathbf{y}}~|~\beta_{i} \right)P\left( \beta_{i} \right)\ \frac{k}{P\left( \beta_{i} \right)} \\ & \approx K\sum_{i}P\left( \mathrm{\mathbf{y}}~|~\beta_{i} \right) \end{aligned}$$
We can find the constant K from the fact that $$\begin{aligned} \sum_{i}h\left( \beta_{i} \right) & \approx B_{\max} - B_{\min} \\ \text{so }\quad K\sum_{i}1/P\left( \beta_{i} \right) & \approx B_{\max} - B_{\min} \end{aligned}$$
This gives us the formula $$\begin{aligned} P\left( \mathrm{\mathbf{y}}|\mathcal{M} \right) & \approx \frac{B_{\max} - B_{\min}}{\sum_{i}\frac{1}{P\left( \beta_{i} \right)}}\sum_{i}P\left( \mathrm{\mathbf{y}}~|~\beta_{i} \right) \end{aligned}$$
Applying this to two different models we can find the Bayes” Factor $$\frac{P\left( \mathrm{\mathbf{y}}|\mathcal{M_{1}} \right)}{P\left( \mathrm{\mathbf{y}}|\mathcal{M_{2}} \right)}$$ which allows us to make a direct comparison between the evidence for the two models.