### KL Divergence Python Example

As you progress in your career as a data scientist, you will inevitable come across the Kullback–Leibler (KL) divergence. We can think of the KL divergence as distance metric (although it isn’t symmetric) that quantifies the difference between two probability distributions. One common scenario where this is useful is when we are working with a complex distribution. Rather than working with the distribution directly, we can make our life easier by using another distribution with well known properties (i.e. normal distribution) that does a decent job of describing the data. In other words, we can use the KL divergence to tell whether a poisson distribution or a normal distribution is a better at approximating the data. The KL divergence is also a key component of Gaussian Mixture Models and t-SNE.

For distributions P and Q of a **continuous random variable**, the Kullback-Leibler divergence is computed as an integral.

On the other hand, if P and Q represent the probability distribution of a discrete random variable, the Kullback-Leibler divergence is calculated as a summation.

### Python Code

To start, we import the following libraries.

```
import numpy as np
from scipy.stats import norm
from matplotlib import pyplot as plt
import tensorflow as tf
import seaborn as sns
sns.set()
```

Next, we define a function to calculate the KL divergence of two probability distributions. We need to make sure that we don’t include any probabilities equal to 0 because the log of 0 is negative infinity.

``` def kl_divergence(p, q): return np.sum(np.where(p != 0, p * np.log(p / q), 0)) ```The KL divergence between a normal distribution with a mean of 0 and a standard deviation of 2 and another distribution with a mean of 2 and a standard deviation of 2 is equal to 500.

```
x = np.arange(-10, 10, 0.001)
p = norm.pdf(x, 0, 2)
q = norm.pdf(x, 2, 2)
```

```
plt.title('KL(P||Q) = %1.3f' % kl_divergence(p, q))
plt.plot(x, p)
plt.plot(x, q, c='red')
```

If we measure the KL divergence between the initial probability distribution and another distribution with a mean of 5 and a standard deviation of 4, we expect the KL divergence to be higher than in the previous example.

`q = norm.pdf(x, 5, 4)`

```
plt.title('KL(P||Q) = %1.3f' % kl_divergence(p, q))
plt.plot(x, p)
plt.plot(x, q, c='red')
```

It’s important to note that the KL divergence is not symmetrical. In other words, if we switch P for Q and vice versa, we get a different result.

`q = norm.pdf(x, 5, 4)`

```
plt.title('KL(P||Q) = %1.3f' % kl_divergence(q, p))
plt.plot(x, p)
plt.plot(x, q, c='red')
```

The lower the KL divergence, the closer the two distributions are to one another. Therefore, as in the case of t-SNE and Gaussian Mixture Models, we can estimate the Gaussian parameters of one distribution by minimizing its KL divergence with respect to another.

#### Minimizing KL Divergence

Let’s see how we could go about minimizing the KL divergence between two probability distributions using gradient descent. To begin, we create a probability distribution with a known mean (0) and variance (2). Then, we create another distribution with random parameters.

```
x = np.arange(-10, 10, 0.001)
p_pdf = norm.pdf(x, 0, 2).reshape(1, -1)
np.random.seed(0)
random_mean = np.random.randint(10, size=1)
random_sigma = np.random.randint(10, size=1)
random_pdf = norm.pdf(x, random_mean, random_sigma).reshape(1, -1)
```

Given that we are using gradient descent, we need to select values for the hyperparameters (i.e. step size, number of iterations).

```
learning_rate = 0.001
epochs = 100
```

Just like `numpy`

, in `tensorflow`

we need to allocate memory for variables. For the variable `q`

, we use the equation for a normal distribution given mu and sigma, only we exclude the part before the exponent since we’re normalizing the result.

Just like before, we define a function to compute the KL divergence that excludes probabilities equal to zero.

```
kl_divergence = tf.reduce_sum(
tf.where(p == 0, tf.zeros(pdf.shape, tf.float64), p * tf.log(p / q))
)
```

Next, we initialize an instance of the `GradientDescentOptimizer`

class and call the `minimize`

method with the KL divergence function as an argument.

`optimizer = tf.train.GradientDescentOptimizer(learning_rate).minimize(kl_divergence)`

Only after running `tf.global_variables_initializer()`

will the variables hold the values we set when we declared them (i.e. `tf.zeros`

).

`init = tf.global_variables_initializer()`

All operations in tensorflow must be done within a session. In the proceeding code block, we minimize the KL divergence using gradient descent.

```
with tf.Session() as sess:
sess.run(init)
history = []
means = []
variances = []
for i in range(epochs):
sess.run(optimizer, { p: pdf })
if i % 10 == 0:
history.append(sess.run(kl_divergence, { p: pdf }))
means.append(sess.run(mu)[0])
variances.append(sess.run(sigma)[0][0])
for mean, variance in zip(means, variances):
q_pdf = norm.pdf(x, mean, np.sqrt(variance))
plt.plot(x, q_pdf.reshape(-1, 1), c='red')
```

```
plt.title('KL(P||Q) = %1.3f' % history[-1])
plt.plot(x, p_pdf.reshape(-1, 1), linewidth=3)
plt.show()
plt.plot(history)
plt.show()
sess.close()
```

Then, we plot the probability distribution and KL divergence at different points in time.