Monte Carlo Markov Chain

August 24, 2020

Monte Carlo Markov Chain

A Monte Carlo Markov Chain (MCMC) is a model describing a sequence of possible events where the probability of each event depends only on the state attained in the previous event. MCMC have a wide array of applications, the most common of which is the approximation of probability distributions.

Let’s take a look at an example of Monte Carlo Markov Chains in action. Suppose we wanted to determine the probability of sunny (S) and rainy (R) weather.

We’re given the following conditional probabilities:

If it’s rainy today, there’s a 50% chance that it will be sunny tomorrow.
If it’s rainy today, there’s a 50% chance it will be raining tomorrow.
If it’s sunny today, there’s a 90% chance it will be sunny tomorrow.
If it’s sunny today, there’s a 10% chance it will be raining tomorrow.

Let’s assume we started out in the sunny state. We then do a Monte Carlo simulation. That is to say, we generate a random number between 0 and 1, and if it happens to be below 0.9, it will be sunny tomorrow, and rainy otherwise. We do another Monte Carlo simulation, and this time around, it’s going to be rainy tomorrow. We repeat the process for n iterations.

The following sequence is known as a Markov Chain.

We count the number of sunny days and divide by the total number of days to determine the probability that it will be sunny. If the Markov Chain are long enough, even though the initial states might be different, we will obtain the same marginal probability. In this case, the probability that it will be sunny is 83.3%, and the probability that it will be rainy is 16.7%.

Let’s see how we might go about implementing a Monte Carlo Markov Chain in Python.

We begin by importing the following libraries:

import numpy as np  
from matplotlib import pyplot as plt  
import seaborn as sns  
sns.set()

We can express the conditional probabilities from the previous example using a state machine, and corresponding matrix.

We perform some linear algebra.

T = np.array([[0.9, 0.1],[0.5, 0.5]])

p = np.random.uniform(low=0, high=1, size=2)  
p = p/np.sum(p)  
q=np.zeros((100,2))

for i in np.arange(0,100):  
    q[i, :] = np.dot(p,np.linalg.matrix_power(T, i))

Finally, we plot the results.

plt.plot(q)  
plt.xlabel('i')  
plt.legend(('S', 'R'))

As we can see, the probability that it will be sunny settles around 0.833. Similarly, the probability that it will be rainy converges towards 0.167.

Bayes Formula

Often times, we want to know the probability of some event, given that another event has occurred. This can be expressed symbolically as p(B|A). If two events are not independent then the probability of both occurring is expressed by the following formula.

For example, suppose we are drawing two cards from a standard deck of 52 cards. Half of all cards in the deck are red and half are black. These events are not independent because the probability of the second draw depends on the first.

P(A) = P(black card on first draw) = 25/52 = 0.5

P(B|A) = P(black card on second draw | black card on first draw) = 25/51 = 0.49

Using this information, we can calculate the probability of drawing two black cards in a row as:

Now, let’s suppose that instead, we wanted to develop a spam filter that will categorize an email as spam or not depending on the occurrence of some word. For example, if an email contains the word Viagra, we classify it as spam. If on the other hand, an email contains the word money, then there’s an 80% chance that it’s spam.

According to Bayes Theorem, the probability that an email is spam given that it contains a given word is:

We might know the probability that an email is spam as well as the probability that a word is contained in an email classified as spam. We do not, however, know the probability that a given word will be found in an email. This is where the Metropolis-Hastings algorithm come in to play.

Metropolis Hastings Algorithm

The Metropolis-Hasting algorithm enables us to determine the posterior probability without knowing the normalizing constant. At a high level, the Metropolis-Hasting algorithm works as follows:

The acceptance criteria only looks at ratios of the target distribution, so the denominator cancels out.

For you visual learners out there, let’s illustrate how the algorithm works with an example.

We begin by selecting a random initial value for theta.

Then, we propose a new value for theta.

We calculate the ratio of the PDF at current value of theta and the PDF at the proposed value of theta.

If rho is less than 1, then we set theta to the new value with probability p. We do so by comparing rho with a sample u drawn from a uniform distribution. If rho is greater than u then we accept the proposed value, otherwise, we reject it.

We try a different value for theta.

If rho is greater than 1 it will always be greater or equal to the sample drawn from the uniform distribution. Thus, we accept the proposal for the new value of theta.

We repeat the process n number of iterations.

Since we automatically accept a proposed move when the target distribution is greater than the one at the current position, theta will tend to be in places where the target distribution is denser. However, if we only accepted values that were greater than the current position, we’d get stuck at one of the peaks. Therefore, we occasionally accept moves to lower density regions. This way, theta will be expected to bounce around in such a way as to approximate the density of the posterior distribution.

The sequence of steps is in fact a Markov Chain.

Let’s walk through how we’d go about implemented the Metropolis-Hasting algorithm in Python, but first, here’s a quick refresher on the different types of distributions.

Normal Distribution

In nature random phenomenon (i.e. IQ, height) tend to follow a normal distribution. A normal distribution has two parameters Mu, and Sigma. Varying Mu shifts the bell curve, whereas varying Sigma alters the width of the bell curve.

Beta Distribution

Like a normal distribution, beta distribution has two parameters. However, unlike a normal distribution, the shape of a beta distribution will vary significantly based on the values of its parameters alpha and beta.

Binomial Distribution

Unlike a normal distribution that could have height as its domain, the domain of a binomial distribution will always be the number of discrete events.

Now that we’ve familiarized ourselves with these concepts, we’re ready to delve into the code. We begin by initializing the hyperparameters.

n = 100  
h = 59  
a = 10  
b = 10  
sigma = 0.3  
theta = 0.1  
niters = 10000  
thetas = np.linspace(0, 1, 200)  
samples = np.zeros(niters+1)  
samples[0] = theta

Next, we define a function that will return the multiplication of the likelihood and prior for a given value of theta.

def prob(theta):  
    if theta < 0 or theta > 1:  
        return 0  
    else:  
        prior = st.beta(a, b).pdf(theta)  
        likelihood = st.binom(n, theta).pmf(h)  
        return likelihood * prior

We step through the algorithm updating the values of theta based off the conditions described earlier.

for i in range(niters):  
    theta_p = theta + st.norm(0, sigma).rvs()  
    rho = min(1, prob(theta_p) / prob(theta))  
    u = np.random.uniform()  
    if u < rho:  
        # Accept proposal  
        theta = theta_p  
    else:  
        # Reject proposal  
        pass  
    samples[i+1] = theta

We define the likelihood, as well as the prior and post probability distributions.

prior = st.beta(a, b).pdf(thetas)  
post = st.beta(h+a, n-h+b).pdf(thetas)  
likelihood = st.binom(n, thetas).pmf(h)

We visualize the posterior distribution obtained using the Metropolis-Hastings algorithm.

plt.figure(figsize=(12, 9))  
plt.hist(samples[len(samples)//2:], 40, histtype='step', normed=True, linewidth=1, label='Predicted Posterior');  
plt.plot(thetas, n*likelihood, label='Likelihood', c='green')  
plt.plot(thetas, prior, label='Prior', c='blue')  
plt.plot(thetas, post, c='red', linestyle='--', alpha=0.5, label='True Posterior')  
plt.xlim([0,1]);  
plt.legend(loc='best');

As we can see, the Metropolis Hasting method does a good job of approximating the actual posterior distribution.

Conclusion

A Monte Carlo Markov Chain is a sequence of events drawn from a set of probability distributions that can be used to approximate another distribution. The Metropolis-Hasting algorithm makes use of Monte Carlo Markov Chains to approximate the posterior distribution when we know the likelihood and prior, but not the normalizing constant.