November 15, 2023

Probability Mass Functions, Expectation, and Variance

Relearning probability from a machine learning perspective. What PMFs are, how to compute expectation and variance from first principles, and why any of this matters for ML.

probabilitystatisticsPMFmachine learningfundamentals

Probability is a topic we all learned in high-school maths. Since you are reading this blog, you probably have some rust on those concepts. We will remove that rust and understand them from the machine learning point of view.

One prerequisite: you should know how we get a probability of 1/2 from tossing a fair coin. If you know that, you are ready to learn Probability Mass Functions.

By the end of this blog you will know

Probability & distributions Probability Mass Function Expected value Variance PMF derivations Why this matters for ML

What is Probability?

Take an example: tossing a coin. It will give you either heads or tails.

How can you predict beforehand whether the next toss will be heads? You cannot know for sure. But you can quantify the likelihood.

You count how many times you tossed the coin, how many times you got heads, how many times tails. Then you use that frequency as your estimate for the next toss.

$$P(\text{heads}) = \frac{\text{number of heads}}{\text{total tosses}}$$

Suppose you toss a coin 100 times and get 64 heads and 36 tails. Your estimate is $P(H) = 0.64$. That does not mean the coin will land heads exactly 64 times in the next 100 tosses. Probability is a measure, not a guarantee. Toss it a billion times and you will converge on 64%.

Why does it not always work for small numbers? With few tosses, random variance dominates. The law of large numbers says: only as $N \to \infty$ does the empirical frequency converge to the true probability.

Code: plotting the PMF for one coin

import matplotlib.pyplot as plt
import seaborn as sns
import random

def flip(p):
    return 'H' if random.random() < p else 'T'

def pmf(P, N, M):
    # N = coins per sample, M = number of samples
    l = []
    for m in range(M):
        flips = [flip(P) for i in range(N)]
        l.append(flips.count('H'))
    return l

P = 0.64   # biased coin
N = 1      # one coin per sample
M = 100    # 100 samples

plt.ylabel('PMF')
sns.histplot(pmf(P, N, M), stat='probability', bins=2)

With one coin you get two bars: one for tails, one for heads. The heights are the relative frequencies, i.e. empirical probabilities.

Probability Distribution

A probability distribution is a function that gives the probabilities of all possible outcomes for an experiment.

Now suppose you toss 10 coins at once. How many heads do you expect? Instead of two outcomes you now have eleven: 0 heads through 10 heads. Toss the set 1000 times and count how often each outcome appears.

P = 0.64   # same biased coin
N = 10     # 10 coins per sample
M = 1000   # 1000 samples

plt.ylabel('PMF')
sns.histplot(pmf(P, N, M), stat='probability', bins=10)

You get a histogram with 11 bars. Each bar height is the probability of getting exactly that many heads in a single 10-coin toss. This shape is the probability distribution.

PMF: 10 coins, p = 0.64 (binomial distribution)

Probability Mass Function

A probability mass function (PMF) is a function that gives the probability that a discrete random variable is exactly equal to some value.

$$P(X = k) = \text{probability of exactly}\ k\ \text{heads}$$

The key word is discrete. The number of heads is 0, 1, 2, ... 10. No fractions. A continuous random variable (like temperature) would need a different function. We will cover that in the normal distribution blog.

Two properties every PMF must satisfy:

Non-negativity $P(X = k) \geq 0$ for all $k$

Sums to 1 $\sum_{k} P(X = k) = 1$

Expected Value (Expectation)

The expected value of a random variable is its theoretical mean. To understand it, start from the ordinary mean.

Take the set $\lbrace 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 \rbrace$. The mean is $20/10 = 2$. Now rewrite it using frequencies:

$\bar x = \dfrac{1+1+1+1+2+2+3+3+3+3}{10}$

$= \dfrac{1 \cdot 4 + 2 \cdot 2 + 3 \cdot 4}{10}$

$= 1 \cdot \dfrac{4}{10} + 2 \cdot \dfrac{2}{10} + 3 \cdot \dfrac{4}{10} = 2$

Each value multiplied by its relative frequency — that is just each value multiplied by its probability. So the mean is $\sum x \cdot P(x)$. That is the definition of expected value.

$$E(X) = \sum_{x} x \cdot P(x)$$

Computing E(X) from our data

import collections
import pandas as pd

P = 0.64
N = 10
M = 100000

l = pmf(P, N, M)
counter = collections.Counter(l)

df = pd.DataFrame(list(dict(counter).items()), columns=['Heads', 'Count'])
df = df.sort_values('Heads').reset_index(drop=True)
df['Probability'] = df['Count'] / M
print(df)

If you sum all (Count × Heads) values you get roughly 639,901. Divide by 100,000 trials and you get $\bar x \approx 6.4$.

For a fair coin, $E(X) = N \cdot p = 10 \times 0.64 = 6.4$. The dashed line in the chart above marks this. Expected value is just a mean. It describes where the distribution is centred.

Variance

The mean tells you where the distribution is centred. Variance tells you how spread out it is.

The obvious first attempt: sum the distances from the mean. The problem is that values below the mean give negative distances and values above give positive, so they cancel out. You can get zero variance on a non-constant distribution if the data is symmetric around the mean.

Solution: square the distances before summing. This removes the sign and penalises large deviations more heavily. Then take the expectation (mean) of those squared distances.

$$\text{Var}(X) = E\left[(x - \mu)^2\right]$$

Expand and simplify using $\mu = E(X)$:

$\text{Var}(X) = E\left(x^2 - 2\mu x + \mu^2\right)$

$= E(x^2) - 2\mu\, E(x) + \mu^2$

$= E(x^2) - 2[E(x)]^2 + [E(x)]^2$

$\text{Var}(X) = E(x^2) - [E(x)]^2$

Why squared and not absolute distance? Squared distance is differentiable everywhere and makes optimisation easier. It also gives more weight to outliers, which is often what we want.

Why does any of this matter for ML?

When you train a machine learning model, what are you actually doing? You are asking it to learn the probability distribution of your data.

If a model can predict the right expectation and variance for the 10-coin example with 1000 samples, it can generalise to 100,000 samples — because the distribution itself does not change. Only the noise around it shrinks.

Expected value Where the data is centred. Bias in your model shows up here.

Variance How spread out the data is. Overfitting shows up here.

Distribution The full picture. A well-trained model approximates this.

The coin example has a special name: the Binomial Distribution. But the point here was not the binomial distribution itself. It was to understand what PMF, expectation, and variance mean — because these concepts appear everywhere in ML.

To read about continuous distributions, see the normal distribution blog.

Related: ROC curve and AUC from scratch

Questions? me@arshad-kazi.com