Probability is a topic we all learned in high-school maths. Since you are reading this blog, you probably have some rust on those concepts. We will remove that rust and understand them from the machine learning point of view.
One prerequisite: you should know how we get a probability of 1/2 from tossing a fair coin. If you know that, you are ready to learn Probability Mass Functions.
By the end of this blog you will know
What is Probability?
Take an example: tossing a coin. It will give you either heads or tails.
How can you predict beforehand whether the next toss will be heads? You cannot know for sure. But you can quantify the likelihood.
You count how many times you tossed the coin, how many times you got heads, how many times tails. Then you use that frequency as your estimate for the next toss.
Suppose you toss a coin 100 times and get 64 heads and 36 tails. Your estimate is $P(H) = 0.64$. That does not mean the coin will land heads exactly 64 times in the next 100 tosses. Probability is a measure, not a guarantee. Toss it a billion times and you will converge on 64%.
Code: plotting the PMF for one coin
import matplotlib.pyplot as plt
import seaborn as sns
import random
def flip(p):
return 'H' if random.random() < p else 'T'
def pmf(P, N, M):
# N = coins per sample, M = number of samples
l = []
for m in range(M):
flips = [flip(P) for i in range(N)]
l.append(flips.count('H'))
return l
P = 0.64 # biased coin
N = 1 # one coin per sample
M = 100 # 100 samples
plt.ylabel('PMF')
sns.histplot(pmf(P, N, M), stat='probability', bins=2) With one coin you get two bars: one for tails, one for heads. The heights are the relative frequencies, i.e. empirical probabilities.
Probability Distribution
A probability distribution is a function that gives the probabilities of all possible outcomes for an experiment.
Now suppose you toss 10 coins at once. How many heads do you expect? Instead of two outcomes you now have eleven: 0 heads through 10 heads. Toss the set 1000 times and count how often each outcome appears.
P = 0.64 # same biased coin
N = 10 # 10 coins per sample
M = 1000 # 1000 samples
plt.ylabel('PMF')
sns.histplot(pmf(P, N, M), stat='probability', bins=10) You get a histogram with 11 bars. Each bar height is the probability of getting exactly that many heads in a single 10-coin toss. This shape is the probability distribution.
PMF: 10 coins, p = 0.64 (binomial distribution)
Probability Mass Function
A probability mass function (PMF) is a function that gives the probability that a discrete random variable is exactly equal to some value.
The key word is discrete. The number of heads is 0, 1, 2, ... 10. No fractions. A continuous random variable (like temperature) would need a different function. We will cover that in the normal distribution blog.
Two properties every PMF must satisfy:
Expected Value (Expectation)
The expected value of a random variable is its theoretical mean. To understand it, start from the ordinary mean.
Take the set $\lbrace 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 \rbrace$. The mean is $20/10 = 2$. Now rewrite it using frequencies:
Each value multiplied by its relative frequency — that is just each value multiplied by its probability. So the mean is $\sum x \cdot P(x)$. That is the definition of expected value.
Computing E(X) from our data
import collections
import pandas as pd
P = 0.64
N = 10
M = 100000
l = pmf(P, N, M)
counter = collections.Counter(l)
df = pd.DataFrame(list(dict(counter).items()), columns=['Heads', 'Count'])
df = df.sort_values('Heads').reset_index(drop=True)
df['Probability'] = df['Count'] / M
print(df) If you sum all (Count × Heads) values you get roughly 639,901. Divide by 100,000 trials and you get $\bar x \approx 6.4$.
Variance
The mean tells you where the distribution is centred. Variance tells you how spread out it is.
The obvious first attempt: sum the distances from the mean. The problem is that values below the mean give negative distances and values above give positive, so they cancel out. You can get zero variance on a non-constant distribution if the data is symmetric around the mean.
Solution: square the distances before summing. This removes the sign and penalises large deviations more heavily. Then take the expectation (mean) of those squared distances.
Expand and simplify using $\mu = E(X)$:
Why does any of this matter for ML?
When you train a machine learning model, what are you actually doing? You are asking it to learn the probability distribution of your data.
If a model can predict the right expectation and variance for the 10-coin example with 1000 samples, it can generalise to 100,000 samples — because the distribution itself does not change. Only the noise around it shrinks.
The coin example has a special name: the Binomial Distribution. But the point here was not the binomial distribution itself. It was to understand what PMF, expectation, and variance mean — because these concepts appear everywhere in ML.
To read about continuous distributions, see the normal distribution blog.
Related: ROC curve and AUC from scratch
Questions? me@arshad-kazi.com