Probability for Data Science
eBook  ›  Chapter 3 · Discrete Random Variables
Section 3.5

Common Discrete Random Variables

In the previous sections, we have conveyed three key concepts: one about the random variable, one about the PMF, and one about the mean. The next step is to introduce a few commonly used discrete random variables so that you have something concrete in your “toolbox.” As we have mentioned before, these predefined random variables should be studied from a synthesis perspective (sometimes called generative). The plan for this section is to introduce several models, derive their theoretical properties, and discuss examples.

Note that some extra effort will be required to understand the origins of the random variables. The origins of random variables are usually overlooked, but they are more important than the equations. For example, we will shortly discuss the Poisson random variable and its PMF \(p_X(k) = \frac{\lambda^ke^{-\lambda}}{k!}\). Why is the Poisson random variable defined in this way? If you know how the Poisson PMF was originally derived, you will understand the assumptions made during the derivation. Consequently, you will know why Poisson is a good model for internet traffic, recommendation scores, and image sensors for computer vision applications. You will also know under what situation the Poisson model will fail. Understanding the physics behind the probability models is the focus of this section.

3.5.1Bernoulli random variable

We start discussing the simplest random variable, namely the Bernoulli random variable. A Bernoulli random variable is a coin-flip random variable. The random variable has two states: either 1 or 0. The probability of getting 1 is \(p\), and the probability of getting 0 is \(1-p\). See Figure 3.21 for an illustration. Bernoulli random variables are useful for all kinds of binary-state events: coin flip (H or T), binary bit (1 or 0), true or false, yes or no, present or absent, Democrat or Republican, etc.

To make these notions more precise, we define a Bernoulli random variable as follows.

Definition 3.8

Let \(X\) be a Bernoulli random variable. Then, the PMF of \(X\) is

$$p_X(0) = 1-p, \quad\quad p_X(1) = p,$$

where \(0 < p < 1\) is called the Bernoulli parameter. We write $$X \sim \mathrm{Bernoulli}(p)$$ to say that \(X\) is drawn from a Bernoulli distribution with a parameter \(p\).

In this definition, the parameter \(p\) controls the probability of obtaining 1. In a coin-flip event, \(p\) is usually \(\tfrac{1}{2}\), meaning that the coin is fair. However, for biased coins \(p\) is not necessarily \(\tfrac{1}{2}\). For other situations such as binary bits (0 or 1), the probability of obtaining 1 could be very different from the probability of obtaining 0.

Figure 3.21
Figure 3.21. A Bernoulli random variable has two states with probability \(p\) and \(1-p\).

In MATLAB and Python, generating Bernoulli random variables can be done by calling the binomial random number generator np.random.binomial (Python) and binornd (MATLAB). When the parameter n is equal to 1, the binomial random variable is equivalent to a Bernoulli random variable. The MATLAB and Python codes to synthesize a Bernoulli random variable are shown below.

MATLAB
% MATLAB code to generate 1000 Bernoulli random variables
    p = 0.5;
    n = 1;
    X = binornd(n,p,[1000,1]);
    [num, ~] = hist(X, 10);
    bar(linspace(0,1,10), num,`FaceColor',[0.4, 0.4, 0.8]);
Python
# Python code to generate 1000 Bernoulli random variables
    import numpy as np
    import matplotlib.pyplot as plt
    p = 0.5
    n = 1
    X = np.random.binomial(n,p,size=1000)
    plt.hist(X,bins=`auto')

An alternative method in Python is to call stats.bernoulli.rvs to generate random Bernoulli numbers.

Python
# Python code to call scipy.stats library
    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.stats as stats
    p = 0.5
    X  = stats.bernoulli.rvs(p,size=1000)
    plt.hist(X,bins=`auto');

Properties of Bernoulli random variables

Let us now derive a few key statistical properties of a Bernoulli random variable.

Theorem 3.6

If \(X \sim \mathrm{Bernoulli}(p)\), then

$$\begin{aligned} \E[X] = p, \quad\quad \E[X^2]= p, \quad\quad \Var[X]= p(1-p). \end{aligned}$$

Proof. The expectation can be computed as

$$\begin{aligned} \E[X] &= (1)p_X(1) + (0)p_X(0) = (1)(p) + (0)(1-p) = p. \end{aligned}$$

The second moment is

$$\begin{aligned} \E[X^2] &= (1^2)(p) + (0^2)(1-p) = p. \end{aligned}$$

Therefore, the variance is

$$\begin{aligned} \Var[X] &= \E[X^2]-\mu^2 = p - p^2 = p(1-p). \end{aligned}$$

A useful property of the Python code is that we can construct an object rv. Then we can call rv's attributes to determine its mean, variance, etc.

Python
# Python code to generate a Bernoulli rv object
    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.stats as stats
    p = 0.5
    rv = stats.bernoulli(p)
    mean, var = rv.stats(moments=`mv')
    print(mean, var)

In both MATLAB and Python, we can plot the PMF of a Bernoulli random variable, such as the one shown in Figure 3.22. To do this in MATLAB, we call the function binopdf, with the evaluation points specified by x.

Figure 3.22
Figure 3.22. An example of a theoretical PMF (not the empirical histogram) plotted by MATLAB.
MATLAB
% MATLAB code to plot the PMF of a Bernoulli
    p = 0.3;
    x = [0,1];
    f = binopdf(x,1,p);
    stem(x, f, `bo', `LineWidth', 8);

In Python, we construct a random variable rv. With rv, we can call its PMF rv.pmf:

Python
# Python code to plot the PMF of a Bernoulli
    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.stats as stats
    p = 0.3
    rv = stats.bernoulli(p)
    x  = np.linspace(0, 1, 2)
    f  = rv.pmf(x)
    plt.plot(x, f, `bo', ms=10);
    plt.vlines(x, 0, f, colors=`b', lw=5, alpha=0.5);

When will a Bernoulli random variable have the maximum variance?

Let us take a look at the variance of the Bernoulli random variable. For any given \(p\), the variance is \(p(1-p)\). This is a quadratic equation. If we let \(V(p) = p(1-p)\), we can show that the maximum is attained at \(p = 1/2\). To see this, take the derivative of \(V(p)\) with respect to \(p\). This will give us \(\frac{d}{dp}V(p) = 1-2p\). Equating to zero yields \(1-2p = 0\), so \(p = 1/2\). We know that \(p = 1/2\) is a maximum and not a minimum point because the second-order derivative \(V''(p) = -2\), which is negative. Therefore \(V(p)\) is maximized at \(p = 1/2\). Now, since \(0 \le p \le 1\), we also know that \(V(0) = 0\) and \(V(1) = 0\). Therefore, the variance is minimized at \(p = 0\) and \(p = 1\). Figure 3.23 shows a graph of the variance.

Figure 3.23
Figure 3.23. The variance of a Bernoulli reaches maximum at \(p = 1/2\).

Does this result make sense? Why is the variance maximized at \(p = 1/2\)? If we think about this problem more carefully, we realize that a Bernoulli random variable represents a coin-flip experiment. If the coin is biased such that it always gives heads, on the one hand, it is certainly a bad coin. However, on the other hand, the variance is zero because there is nothing to vary; you will certainly get heads. The same situation happens if the coin is biased towards tails. However, if the coin is fair, i.e., \(p = 1/2\), then the variance is large because we only have a 50% chance of getting a head or a tail whenever we flip a coin. Nothing is certain in this case. Therefore, the maximum variance happening at \(p = 1/2\) matches our intuition.

Rademacher random variable

A slight variation of the Bernoulli random variable is the Rademacher random variable, which has two states: \(+1\) and \(-1\). The probability of getting \(+1\) or \(-1\) is each \(1/2\). Therefore, the PMF of a Rademacher random variable is

$$p_X(-1) = \frac{1}{2}, \quad\mbox{and}\quad p_X(+1) = \frac{1}{2}.$$
Practice Exercise 3.8

Show that if \(X\) is a Rademacher random variable then \((X+1)/2 \sim \text{Bernoulli}(1/2)\). Also show the converse: If \(Y \sim \text{Bernoulli}(1/2)\) then \(2Y-1\) is a Rademacher random variable.

Solution

Since \(X\) can either be \(+1\) or \(-1\), we show that if \(X = +1\) then \((X+1)/2 = 1\) and if \(X = -1\) then \((X+1)/2 = 0\). The probabilities of getting \(+1\) and \(-1\) are equal. Thus, the probabilities of getting \((X+1)/2 = 1\) and \(0\) are also equal. So the resulting random variable is Bernoulli(1/2). The other direction can be proved similarly.

Bernoulli in social networks: the Erdos-R\'enyi graph

The study of networks is a big branch of modern data science. It includes social networks, computer networks, traffic networks, etc. The history of network science is very long, but one of the most basic models of a network is the Erdos-R\'enyi graph, named after Paul Erdos and Alfr\'ed R\'enyi. The underlying probabilistic model of the Erdos-R\'enyi graph is the Bernoulli random variable.

To see how a graph can be constructed from a Bernoulli random variable, we first introduce the concept of a graph. A graph contains two elements: nodes and edges. For node \(i\) and node \(j\), we denote the edge connecting \(i\) and \(j\) as \(A_{ij}\). Therefore, if we have \(N\) nodes, then we can construct a matrix \(\mA\) of size \(N \times N\). We call this matrix the adjacency matrix. For example, the adjacency matrix

$$\mA = \begin{bmatrix} 0 & 1 & 1 & 0 \\ 1 & 0 & 0 & 0 \\ 1 & 0 & 0 & 1 \\ 0 & 0 & 1 & 0 \end{bmatrix}$$

will have edges for node pairs \((1,2)\), \((1,3)\), and \((3,4)\). Note that in this example we assume that the adjacency matrix is symmetric, meaning that the graph is undirected. The “1” in the adjacency matrix indicates there is an edge, and “0” indicates there is no edge. So \(\mA\) represents a binary graph.

The Erdos-R\'enyi graph model says that the probability of getting an edge is an independent Bernoulli random variable. That is

$$A_{ij} \sim \text{Bernoulli}(p),$$

for \(i < j\). If we model the graph in this way, then the parameter \(p\) will control the density of the graph. High values of \(p\) mean that there is a higher chance for an edge to be present.

To illustrate the idea of an Erdos-R\'enyi graph, we show in Figure 3.24 a graph of 40 nodes. The edges are randomly selected by flipping a Bernoulli random variable with parameter \(p = 0.3, 0.5, 0.7, 0.9\). As we can see in the figure, a small value of \(p\) gives a graph with very sparse connectivity, whereas a large value of \(p\) gives a very densely connected graph. The bottom row of Figure 3.24 shows the corresponding adjacency matrices. Here, a white pixel denotes “1” in the matrix and a black pixel denotes “0” in the matrix.

Figure 3.24. The Erdos-R\'enyi graph. [Top] The graphs. [Bottom] The adjacency matrices.

While Erdos-R\'enyi graphs are elementary, their variations can be realistic models of social networks. The stochastic block model is one such model. In a stochastic block model, nodes form small communities within a large network. For example, there are many majors in a university. Students within the same major tend to have more interactions than with students of another major. The stochastic block model achieves this goal by partitioning the nodes into communities. Within each community, the nodes can have a high degree of connectivity. Across different communities, the connectivity will be much lower. Figure 3.25 illustrates a network and the corresponding adjacency matrix. In this example, the network has three communities.

Figure 3.25. A stochastic block model containing three communities. [Left] The graph. [Right] The adjacency matrix.

In network analysis, one of the biggest problems is determining the community structure and recovering the underlying probabilities. The former task is about grouping the nodes into blocks. This is a nontrivial problem because in practice the nodes are never arranged nicely, as shown in Figure 3.25. For example, why should Alice be node 1 and Bob be node 2? Since we never know the correct ordering of the nodes, partitioning the nodes into blocks requires various estimation techniques such as clustering or iterative estimation. Recovering the underlying probability is also not easy. Given an adjacency matrix, why can we assume that the underlying network is a stochastic block model? Even if the model is correct, there will be imperfect grouping in the previous step. As such, estimating the underlying probability in the presence of these uncertainties would pose additional challenges.

Today, network analysis remains one of the hottest areas in data science. Its importance derives from its broad scope and impact. It can be used to analyze social networks, opinion polls, marketing, or even genome analysis. Nevertheless, the starting point of these advanced subjects is the Bernoulli random variable, the random variable of a coin flip!

3.5.2Binomial random variable

Suppose we flip the coin \(n\) times and count the number of heads. Since each coin flip is a random variable (Bernoulli), the sum is also a random variable. It turns out that this new random variable is the binomial random variable.

Definition 3.9

Let \(X\) be a binomial random variable. Then, the PMF of \(X\) is

$$p_X(k) = {n \choose k} p^k (1-p)^{n-k}, \quad\quad k = 0,1,\ldots,n,$$

where \(0 < p < 1\) is the binomial parameter, and \(n\) is the number of trials. We write $$X \sim \mathrm{Binomial}(n,p)$$ to say that \(X\) is drawn from a binomial distribution with a parameter \(p\) of size \(n\).

To understand the meaning of a binomial random variable, consider a simple experiment consisting of flipping a coin three times. We know that all possible cases are HHH, HHT, HTH, THH, TTH, THT, HTT and TTT. Now, suppose we define \(X\) = number of heads. We want to write down the probability mass function. Effectively, we ask: What is the probability of getting 0 head, one head, two heads, and three heads? We can, of course, count and get the answer right away for a fair coin. However, suppose the coin is unfair, i.e., the probability of getting a head is \(p\) whereas that of a tail is \(1-p\). The probability of getting each of the 8 cases is shown in Figure 3.26 below.

Figure 3.26
Figure 3.26. The probability of getting \(k\) heads out of \(n = 3\) coins.

Here are the detailed calculations. Let us start with \(X = 3\).

$$\begin{aligned} p_X(3) &= \Pb[\{\text{HHH}\}]\\ &= \Pb[\{\text{H}\} \cap \{\text{H}\} \cap \{\text{H}\}] \\ &\overset{(a)}{=} \Pb[\{\text{H}\}]\Pb[\{\text{H}\}]\Pb[\{\text{H}\}] \\ &\overset{(b)}{=} p^3, \end{aligned}$$

where \((a)\) holds because the three events are independent. (Recall that if \(A\) and \(B\) are independent then \(\Pb[A \cap B] = \Pb[A]\Pb[B]\).) \((b)\) holds because each \(\Pb[\{\text{H}\}] = p\) by definition. With exactly the same argument, we can show that \(p_X(0) = \Pb[\{\text{TTT}\}] = (1-p)^3\).

Now, let us look at \(p_X(2)\), i.e., 2 heads. This probability can be calculated as follows:

$$\begin{aligned} p_X(2) &= \Pb[\{\text{HHT}\} \cup \{\text{HTH}\} \cup \{\text{THH}\}] \\ &\overset{(c)}{=} \Pb[\{\text{HHT}\}] + \Pb[\{\text{HTH}\}] + \Pb[\{\text{THH}\}]\\ &\overset{(d)}{=} p^2(1-p) + p^2(1-p) + p^2(1-p) = 3p^2(1-p), \end{aligned}$$

where \((c)\) holds because the three events HHT, HTH and THH are disjoint in the sample space. Note that we are not using the independence argument in \((c)\) but the disjoint argument. We should not confuse the two. The step in \((d)\) uses independence, because each coin flip is independent.

The above calculation shows an interesting phenomenon: Although the three events HHT, HTH, and THH are different (in fact, disjoint), the number of heads in all the cases is the same. This happens because when counting the number of heads, the ordering of the heads and tails does not matter. So the same problem can be formulated as finding the number of combinations of \(\{\) 2 heads and 1 tail \(\}\), which in our case is \({3 \choose 2} = 3\).

To complete the story, let us also try \(p_X(1)\). This probability is

$$\begin{aligned} p_X(1) = \Pb[\{\text{TTH}\} \cup \{\text{HTT}\} \cup \{\text{THT}\}] = 3p(1-p)^2. \end{aligned}$$

Again, we see that the combination \({3 \choose 1} = 3\) appears in front of the \(p(1-p)^2\).

In general, the way to interpret the binomial random variable is to decouple the probabilities \(p\), \((1-p)\), and the number of combinations \({n \choose k}\):

$$p_X(k) = \underset{\textcolor{blue}{\text{number of combinations}}}{\underbrace{ \qquad {n \choose k} \qquad }} \underset{\textcolor{blue}{\text{prob getting $k$ H's}}}{\underbrace{\qquad p^k \qquad}} \underset{\textcolor{blue}{\text{prob getting $n-k$ T's}}}{\underbrace{ \qquad (1-p)^{n-k} \qquad}}.$$

The running index \(k\) should go with \(0, 1, \ldots, n\). It starts with 0 because there could be zero heads in the sample space. Furthermore, we note that in this definition, two parameters are driving a binomial random variable: the number of Bernoulli trials \(n\) and the underlying probability for each coin flip \(p\). As such, the notation for a binomial random variable is \(\text{Binomial}(n,p)\), with two arguments.

The histogram of a binomial random variable is shown in Figure 3.27(a). Here, we consider the example where \(n = 10\) and \(p = 0.5\). To generate the histogram, we use 5000 samples. In MATLAB and Python, generating binomial random variables as in Figure 3.27(a) can be done by calling binornd and np.random.binomial.

Figure 3.27. An example of a binomial distribution with \(n = 10\), \(p = 0.5\).
MATLAB
% MATLAB code to generate 5000 Binomial random variables
    p = 0.5;
    n = 10;
    X = binornd(n,p,[5000,1]);
    [num, ~] = hist(X, 10);
    bar( num,`FaceColor',[0.4, 0.4, 0.8]);
Python
# Python code to generate 5000 Binomial random variables
    import numpy as np
    import matplotlib.pyplot as plt
    p = 0.5
    n = 10
    X = np.random.binomial(n,p,size=5000)
    plt.hist(X,bins=`auto');

Generating the ideal PMF of a binomial random variable as shown in Figure 3.27(b) can be done by calling binopdf in MATLAB. In Python, we can define a random variable rv through stats.binom, and call the PMF using rv.pmf.

MATLAB
% MATLAB code to generate a binomial PMF
    p = 0.5;
    n = 10;
    x = 0:10;
    f = binopdf(x,n,p);
    stem(x, f, 'o', 'LineWidth', 8, 'Color', [0.8, 0.4, 0.4]);
Python
# Python code to generate a binomial PMF
    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.stats as stats
    p = 0.5
    n = 10
    rv = stats.binom(n,p)
    x  = np.arange(11)
    f  = rv.pmf(x)
    plt.plot(x, f, 'bo', ms=10);
    plt.vlines(x, 0, f, colors='b', lw=5, alpha=0.5);

The shape of the binomial PMF is shown in Figure 3.28. In this set of figures, we vary one of the two parameters \(n\) and \(p\) while keeping the other fixed. In Figure 3.28(a), we fix \(n = 60\) and plot three sets of \(p = 0.1, 0.5, 0.9\). For small \(p\) the PMF is skewed towards the left, and for large \(p\) the PMF is skewed toward the right. Figure 3.28(b) shows the PMF for a fixed \(p = 0.5\). As we increase \(n\), the centroid of the PMF moves towards the right. Thus we should expect the mean of a binomial random variable to increase with \(n\). Another interesting observation is that as \(n\) increases, the shape of the PMF approaches the Gaussian function (the bell-shaped curve). We will explain the reason for this when we discuss the Central Limit Theorem.

Figure 3.28. PMFs of a binomial random variable \(X \sim \text{Binomial}(n,p)\). (a) We assume that \(n = 60\). By varying the probability \(p\), we see that the PMF shifts from the left to the right, and the shape changes. (b) We assume that \(p = 0.5\). By varying the number of trials, the PMF shifts and the shape becomes more “bell-shaped.”

The expectation, second moment, and variance of a binomial random variable are summarized in Theorem def: moment properties of binomial.

Theorem 3.7

If \(X \sim \mathrm{Binomial}(n,p)\), then

$$\begin{aligned} \E[X] &= np, \\ \E[X^2]&= np(np+(1-p)), \\ \Var[X]&= np(1-p). \end{aligned}$$

We will prove that \(\E[X] = np\) from first principles. For \(\E[X^2]\) and \(\Var[X]\), we will skip the proofs here and will introduce a “shortcut” later.

Proof. Let us start with the definition.

$$\begin{aligned} \E[X] &= \sum_{k=0}^n k \cdot {n \choose k} p^k (1-p)^{n-k} \\ &= \sum_{k=0}^n k \cdot \frac{n!}{k! (n-k)!} p^k (1-p)^{n-k} \\ &= \underset{0}{\underbrace{0 \cdot \frac{n!}{0! (n-0)!} p^0 (1-p)^{n-0}}} + \sum_{k=1}^n k \cdot \frac{n!}{k! (n-k)!} p^k (1-p)^{n-k} \\ &= \sum_{k=1}^n \frac{n!}{\textcolor{red}{(k-1)}! (n-k)!} p^k (1-p)^{n-k}. \end{aligned}$$

Note that we have shifted the index from \(k = 0\) to \(k=1\). Now let us apply a trick:

$$\begin{aligned} \E[X] &= \sum_{k=1}^n \frac{n!}{\textcolor{red}{(k-1)}! (n-k)!} p^k (1-p)^{n-k} \\ &= \sum_{k=1}^n \frac{n!}{\textcolor{red}{(k-1)}! (n-\textcolor{blue}{k-1+1})!} p^k (1-p)^{n-k}. \end{aligned}$$

Using this trick, we can show that

$$\begin{aligned} &\sum_{k=1}^n \frac{n!}{\textcolor{red}{(k-1)}! (n-\textcolor{blue}{k-1+1})!} p^k (1-p)^{n-k}\\ &= \sum_{k=1}^n \frac{n!}{(k-1)! (\textcolor{blue}{(n-1)}-\textcolor{blue}{(k-1)})!} p^k (1-p)^{n-k}\\ &= \sum_{k=1}^n \frac{\textcolor{red}{n} \textcolor{red}{(n-1)!}}{(k-1)! (\textcolor{blue}{(n-1)}-\textcolor{blue}{(k-1)})!} p^k (1-p)^{n-k}\\ &= \textcolor{red}{np} \sum_{k=1}^n \frac{ (n-1)!}{(k-1)! (\textcolor{blue}{(n-1)}-\textcolor{blue}{(k-1)})!} p^{\textcolor{red}{k-1}} (1-p)^{n-k} \end{aligned}$$

With a simple substitution of \(\ell = k-1\), the above equation can be rewritten as

$$\begin{aligned} \E[X] &= \textcolor{red}{np} \cdot \sum_{\textcolor{red}{\ell=0}}^{\textcolor{red}{n-1}} \frac{ (n-1)!}{ \textcolor{red}{\ell} ! ((n-1)-\textcolor{red}{\ell})!} p^{\ell} (1-p)^{n-1-\ell}\\ &= np \cdot \underset{\textcolor{blue}{\text{summing PMF of Binomial}(n-1,p)}}{\underbrace{\sum_{\ell=0}^{n-1} {n-1 \choose \ell} p^{\ell} (1-p)^{n-1-\ell}}} = np. \end{aligned}$$

In MATLAB, the mean and variance of a binomial random variable can be found by calling the command binostat(n,p) (MATLAB).

In Python, the command is rv = stats.binom(n,p) followed by calling rv.stats.

MATLAB
% MATLAB code to compute the mean and var of a binomial rv
    p = 0.5;
    n = 10;
    [M,V] = binostat(n, p)
Python
# Python code to compute the mean and var of a binomial rv
    import scipy.stats as stats
    p = 0.5
    n = 10
    rv = stats.binom(n,p)
    M, V = rv.stats(moments=`mv')
    print(M, V)

An alternative view of the binomial random variable. As we discussed, the origin of a binomial random variable is the sum of a sequence of Bernoulli random variables. Because of this intrinsic definition, we can derive some useful results by exploiting this fact. To do so, let us define \(I_1,\ldots,I_n\) as a sequence of Bernoulli random variables with \(I_j \sim \mathrm{Bernoulli}(p)\) for all \(j = 1,\ldots,n\). Then the resulting variable

$$X = I_1 + I_2 + \cdots + I_n$$

is a binomial random variable of size \(n\) and parameter \(p\). Using this definition, we can compute the expectation as follows:

$$\begin{aligned} \E[X] &= \E[I_1 + I_2 + \cdots + I_n] \\ &\overset{(a)}{=} \E[I_1] + \E[I_2] + \cdots + \E[I_n] \\ &= p + p + \cdots + p \\ &= np. \end{aligned}$$

In this derivation, the step \((a)\) depends on a useful fact about expectation (which we have not yet proved): For any two random variables \(X\) and \(Y\), it holds that \(\E[X+Y] = \E[X] + \E[Y]\). Therefore, we can show that the expectation of \(X\) is \(np\). This line of argument not only simplifies the proof but also provides a good intuition of the expectation. If each coin flip has an expectation of \(\E[I_i] = p\), then the expectation of the sum should be simply \(n\) times \(p\), giving \(np\).

How about the variance? Again, we are going to use a very useful fact about variance: If two random variables \(X\) and \(Y\) are independent, then \(\Var[X+Y] = \Var[X] + \Var[Y]\). With this result, we can show that

$$\begin{aligned} \Var[X] &= \Var[I_1+\cdots+I_n] \\ &= \Var[I_1] + \cdots + \Var[I_n] \\ &= p(1-p) + \cdots + p(1-p) \\ &= np(1-p). \end{aligned}$$

Finally, using the fact that \(\Var[X] = \E[X^2] - \mu^2\), we can show that

$$\begin{aligned} \E[X^2] &= \Var[X] + \mu^2 \\ &= np(1-p) + (np)^2. \end{aligned}$$
Practice Exercise 3.9

Show that the binomial PMF sums to 1.

Solution

We use the binomial theorem to prove this result:

$$\begin{aligned} \sum_{k=0}^n p_X(k) = \sum_{k=0}^n {n \choose k} p^k (1-p)^{n-k} = (\textcolor{red}{p} + \textcolor{blue}{(1-p)})^n = 1. \end{aligned}$$

The CDF of the binomial random variable is not very informative. It is basically the cumulative sum of the PMF:

$$F_X(k) = \sum_{\ell=0}^{k} {n \choose \ell} p^{\ell}(1-p)^{n-\ell}.$$

The shapes of the PMF and the CDF is shown in Figure 3.29.

Figure 3.29. PMF and CDF of a binomial random variable \(X \sim \text{Binomial}(n,p)\).

In MATLAB, plotting the CDF of a binomial can be done by calling the function binocdf. You may also call f = binopdf(x,n,p), and define F = cumsum(f) as the cumulative sum of the PMF. In Python, the corresponding command is rv = stats.binom(n,p) followed by rv.cdf.

MATLAB
% MATLAB code to compute the mean and var of a binomial rv
    x = 0:10;
    p = 0.5;
    n = 10;
    F = binocdf(x,n,p);
    figure; stairs(x,F,`.-',`LineWidth',4,`MarkerSize',30);
Python
# Python code to compute the mean and var of a binomial rv
    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.stats as stats
    p = 0.5
    n = 10
    rv = stats.binom(n,p)
    x  = np.arange(11)
    F  = rv.cdf(x)
    plt.plot(x, F, 'bo', ms=10);
    plt.vlines(x, 0, F, colors='b', lw=5, alpha=0.5);

3.5.3Geometric random variable

In some applications, we are interested in trying a binary experiment until we succeed. For example, we may want to keep calling someone until the person picks up the call. In this case, the random variable can be defined as the outcome of many failures followed by a final success. This is called the geometric random variable.

Definition 3.10

Let \(X\) be a geometric random variable. Then, the PMF of \(X\) is

$$p_X(k) = (1-p)^{k-1}p, \quad\quad k = 1,2,\ldots,$$

where \(0 < p < 1\) is the geometric parameter. We write $$X \sim \mathrm{Geometric}(p)$$ to say that \(X\) is drawn from a geometric distribution with a parameter \(p\).

A geometric random variable is easy to understand. We define it as Bernoulli trials with \(k-1\) consecutive failures followed by one success. This can be seen from the definition:

$$p_X(k) = \underset{\textcolor{blue}{\text{$k-1$ failures}}}{\underbrace{ \qquad (1-p)^{k-1} \qquad}} \underset{\textcolor{blue}{\text{final success}}}{\underbrace{ \qquad p \qquad}}.$$

Note that in geometric random variables, there is no \({n \choose k}\) because we must have \(k-1\) consecutive failures before one success. There is no alternative combination of the sequence.

The histogram and PMF of a geometric random variable are illustrated in Figure 3.30. Here, we assume that \(p = 0.5\).

Figure 3.30. An example of a geometric distribution with \(p = 0.5\).

In MATLAB, generating geometric random variables can be done by calling the commands geornd. In Python, it is np.random.geometric.

MATLAB
% MATLAB code to generate 1000 geometric random variables
    p = 0.5;
    X = geornd(p,[5000,1]);
    [num, ~] = hist(X, 0:10);
    bar(0:10, num, `FaceColor',[0.4, 0.4, 0.8]);
Python
# Python code to generate 1000 geometric random variables
    import numpy as np
    import matplotlib.pyplot as plt
    p = 0.5
    X = np.random.geometric(p,size=1000)
    plt.hist(X,bins=`auto');

To generate the PMF plots, we use geopdf and rv = stats.geom.

MATLAB
% MATLAB code to generate geometric PMF
    p = 0.5; x = 0:10;
    f = geopdf(x,p);
    stem(x, f, `o', `LineWidth', 8, `Color', [0.8, 0.4, 0.4]);
Python
# Python code to generate 1000 geometric random variables
    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.stats as stats
    x  = np.arange(1,11)
    rv = stats.geom(p)
    f  = rv.pmf(x)
    plt.plot(x, f, `bo', ms=8, label=`geom pmf')
    plt.vlines(x, 0, f, colors=`b', lw=5, alpha=0.5)
Practice Exercise 3.10

Show that the geometric PMF sums to one.

Solution

We can apply infinite series to show the result:

$$\begin{aligned} \sum_{k=1}^{\infty}p_X(k) &= \sum_{k=1}^{\infty} (1-p)^{k-1}p \\ &= p \cdot \sum_{k=1}^{\infty} (1-p)^{k-1}, \qquad\qquad \ell = k-1 \\ &= p \cdot \sum_{\ell=0}^{\infty} (1-p)^{\ell} \\ &= p \cdot \frac{1}{1-(1-p)} = 1. \end{aligned}$$

It is interesting to compare the shape of the PMFs for various values of \(p\). In Figure 3.31 we show the PMFs. We vary the parameter \(p = 0.25, 0.5, 0.9\). For small \(p\), the PMF starts with a low value and decays at a slow speed. The opposite happens for a large \(p\), where the PMF starts with a high value and decays rapidly.

Furthermore, we can derive the following properties of the geometric random variable.

Theorem 3.8

If \(X \sim \mathrm{Geometric}(p)\), then

$$\begin{aligned} \E[X] &= \frac{1}{p}, \qquad \E[X^2]= \frac{2}{p^2}-\frac{1}{p}, \\ \Var[X]&= \frac{1-p}{p^2}. \end{aligned}$$
Figure 3.31. PMFs of a geometric random variable \(X \sim \text{Geometric}(p)\).

Proof. We will prove that the mean is \(1/p\) and leave the second moment and variance as an exercise.

$$\begin{aligned} \E[X] &= \sum_{k=1}^{\infty} k p(1-p)^{k-1} = p \left(\sum_{k=1}^{\infty} k (1-p)^{k-1}\right) \overset{(a)}{=} p \left( \frac{1}{(1-(1-p))^2}\right) = \frac{1}{p}, \end{aligned}$$

where \((a)\) follows from the infinite series identity in Chapter 1.

3.5.4Poisson random variable

In many physical systems, the arrivals of events are typically modeled as a Poisson random variable, e.g., photon arrivals, electron emissions, and telephone call arrivals. In social networks, the number of conversations per user can also be modeled as a Poisson. In e-commerce, the number of transactions per paying user is again modeled using a Poisson.

Definition 3.11

Let \(X\) be a Poisson random variable. Then, the PMF of \(X\) is

$$p_X(k) = \frac{\lambda^k}{k!}e^{-\lambda}, \quad\quad k = 0,1,2,\ldots,$$

where \(\lambda > 0\) is the Poisson rate. We write $$X \sim \mathrm{Poisson}(\lambda)$$ to say that \(X\) is drawn from a Poisson distribution with a parameter \(\lambda\).

In this definition, the parameter \(\lambda\) determines the rate of the arrival. The histogram and PMF of a Poisson random variable are illustrated in Figure 3.32. Here, we assume that \(\lambda = 1\).

Figure 3.32. An example of a Poisson distribution with \(\lambda = 1\).

The MATLAB code and Python code used to generate the histogram are shown below.

MATLAB
% MATLAB code to generate 5000 Poisson numbers
    lambda = 1;
    X = poissrnd(lambda,[5000,1]);
    [num, ~] = hist(X, 0:10);
    bar(0:10, num, `FaceColor',[0.4, 0.4, 0.8]);
Python
# Python code to generate 5000 Poisson random variables
    import numpy as np
    import matplotlib.pyplot as plt
    lambd = 1
    X = np.random.poisson(lambd,size=5000)
    plt.hist(X,bins=`auto');

For the PMF, in MATLAB we can call poisspdf, and in Python we can call rv.pmf with rv = stats.poisson.

MATLAB
% MATLAB code to plot the Poisson PMF
    lambda = 1;
    x = 0:10;
    f = poisspdf(x,lambda);
    stem(x, f, `o', `LineWidth', 8, `Color', [0.8, 0.4, 0.4]);
Python
# Python code to plot the Poisson PMF
    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.stats as stats
    x = np.arange(0,11)
    rv = stats.poisson(lambd)
    f  = rv.pmf(x)
    plt.plot(x, f, `bo', ms=8, label=`geom pmf')
    plt.vlines(x, 0, f, colors=`b', lw=5, alpha=0.5)

The shape of the Poisson PMF changes with \(\lambda\). As illustrated in Figure 3.33, \(p_X(k)\) is more concentrated at lower values for smaller \(\lambda\) and becomes spread out for larger \(\lambda\). Thus, we should expect that the mean and variance of a Poisson random variable will change together as a function of \(\lambda\). In the same figure, we show the CDF of a Poisson random variable. The CDF of a Poisson is

$$F_X(k) = \Pb[X \le k] = \sum_{\ell=0}^{k} \frac{\lambda^{\ell}}{\ell!}e^{-\lambda}.$$
Figure 3.33. A Poisson random variable using different \(\lambda\)'s. [Left] Probability mass function \(p_X(k)\). [Right] Cumulative distribution function \(F_X(k)\).
Example 3.20

Let \(X\) be a Poisson random variable with parameter \(\lambda\). Find \(\Pb[X > 4]\) and \(\Pb[X \le 5]\).

Solution

$$\begin{aligned} \Pb[X > 4] &= 1 - \Pb[X \le 4] = 1- \sum_{k=0}^4 \frac{\lambda^k}{k!}e^{-\lambda}, \\ \Pb[X \le 5] &= \sum_{k=0}^{5} \frac{\lambda^k}{k!}e^{-\lambda}. \end{aligned}$$
Practice Exercise 3.11

Show that the Poisson PMF sums to 1.

Solution

We use the exponential series to prove this result:

$$\begin{aligned} \sum_{k=0}^{\infty}p_X(k) = \sum_{k=0}^{\infty} \frac{\lambda^k}{k!}e^{-\lambda} = e^{-\lambda} \cdot \underset{=e^{\lambda}}{\underbrace{ \; \sum_{k=0}^{\infty} \frac{\lambda^k}{k!} \;}} = 1. \end{aligned}$$

Poisson random variables in practice

(1) Computational photography. In computational photography, the Poisson random variable is one of the most widely used models for photon arrivals. The reason pertains to the origin of the Poisson random variable, which we will discuss shortly. When photons are emitted from the source, they travel through the medium as a sequence of independent events. During the integration period of the camera, the photons are accumulated to generate a voltage that is then translated to digital bits.

If we assume that the photon arrival rate is \(\alpha\) (photons per second), and suppose that the total amount of integration time is \(t\), then the average number of photons that the sensor can see is \(\alpha t\). Let \(X\) be the number of photons seen during the integration time. Then if we follow the Poisson model, we can write down the PMF of \(X\):

$$\Pb[X = k ] = \frac{(\alpha t)^k}{k!} e^{-\alpha t}.$$

Therefore, if a pixel is bright, meaning that \(\alpha\) is large, then \(X\) will have a higher likelihood of landing on a large number.

(2) Traffic model. The Poisson random variable can be used in many other problems. For example, we can use it to model the number of passengers on a bus or the number of spam phone calls. The required modification to Figure 3.34 is almost trivial: merely replace the photons with your favorite cartoons, e.g., a person or a phone, as shown in Figure 3.35. In the United States, shared-ride services such as Uber and Lyft need to model the vacant cars and the passengers. As long as they have an arrival rate and certain degrees of independence between events, the Poisson random variable will be a good model.

Figure 3.34
Figure 3.34. The Poisson random variable can be used to model photon arrivals.
Figure 3.35
Figure 3.35. The Poisson random variable can be used to model passenger arrivals and the number of phone calls, and can be used by Uber or Lyft to provide shared rides.

As you can see from these examples, the Poisson random variable has broad applicability. Before we continue our discussion of its applications, let us introduce a few concepts related to the Poisson random variable.

Properties of a Poisson random variable

We now derive the mean and variance of a Poisson random variable.

Theorem 3.9

If \(X \sim \mathrm{Poisson}(\lambda)\), then

$$\begin{aligned} \E[X] &= \lambda, \qquad \E[X^2]= \lambda + \lambda^2, \\ \Var[X]&= \lambda. \end{aligned}$$

Proof. Let us first prove the mean. It can be shown that

$$\begin{aligned} \E[X] = \sum_{k=0}^{\infty} k \cdot \frac{\lambda^k}{k!}e^{-\lambda} &= \sum_{k=1}^{\infty} \frac{\lambda^k}{\textcolor{red}{(k-1)!}}e^{-\lambda} \\ &= \textcolor{red}{\lambda e^{-\lambda}} \sum_{k=1}^{\infty} \frac{\lambda^{k-1}}{(k-1)!} = \lambda e^{-\lambda} \sum_{\ell=0}^{\infty} \frac{\lambda^{\ell}}{\ell!} = \lambda e^{-\lambda} e^{\lambda} = \lambda. \end{aligned}$$

The second moment can be computed as

$$\begin{aligned} \E[X^2] &= \sum_{k=0}^{\infty} k^2 \cdot \frac{\lambda^k}{k!}e^{-\lambda} \\ &= \sum_{k=0}^{\infty} k \cdot \frac{\lambda^k}{\textcolor{red}{(k-1)!}}e^{-\lambda}\\ &= \sum_{k=0}^{\infty} (k \textcolor{blue}{- 1 + 1}) \cdot \frac{\lambda^k}{(k-1)!}e^{-\lambda} \\ &= \sum_{k=1}^{\infty} (k-1) \cdot \frac{\lambda^k}{(k-1)!}e^{-\lambda} + \sum_{k=1}^{\infty} \frac{\lambda^k}{(k-1)!}e^{-\lambda}\\ &= \lambda^2 \cdot \underset{=1}{\underbrace{\sum_{k=2}^{\infty} \frac{\lambda^{k-2} e^{-\lambda}}{(k-2)!}}} + \lambda \cdot \underset{=1}{\underbrace{\sum_{k=1}^{\infty} \frac{\lambda^{k-1} e^{-\lambda}}{(k-1)!}}}. \end{aligned}$$

The variance can be computed using \(\Var[X] = \E[X^2] - \mu^2\).

To compute the mean and variance of a Poisson random variable, we can call poisstat in MATLAB and rv.stats(moments=`mv') in Python.

MATLAB
% MATLAB code to compute Poisson statistics
    lambda = 1;
    [M,V] = poisstat(lambda);
Python
# Python code to compute Poisson statistics
    import scipy.stats as stats
    lambd = 1
    rv = stats.poisson(lambd)
    M, V = rv.stats(moments='mv')

The Poisson random variable is special in the sense that the mean and the variance are equal. That is, if the mean arrival number is higher, the variance is also higher. This is very different from some other random variables, e.g., the normal random variable where the mean and variance are independent. For certain engineering applications such as photography, this plays an important role in defining the signal-to-noise ratio. We will come back to this point later.

Origin of the Poisson random variable

We now address one of the most important questions about the Poisson random variable: Where does it come from? Answering this question is useful because the derivation process will reveal the underlying assumptions that lead to the Poisson PMF. When you change the problem setting, you will know when the Poisson PMF will hold and when the Poisson PMF will fail.

Our approach to addressing this problem is to consider the photon arrival process. (As we have shown, there is conceptually no difference if you replace the photons with pedestrians, passengers, or phone calls.) Our derivation follows the argument of J. Goodman, Statistical Optics, Section 3.7.2.

To begin with, we consider a photon arrival process. The total number of photons observed over an integration time \(t\) is defined as \(X(t)\). Because \(X(t)\) is a Poisson random variable, its arguments must be integers. The probability of observing \(X(t) = k\) is therefore \(\Pb[X(t) = k]\). Figure 3.36 illustrates the notations and concepts.

Figure 3.36
Figure 3.36. Notations for deriving the Poisson PMF.

We propose three hypotheses with the photon arrival process:

The significance of these three hypotheses is that if the underlying photon arrival process violates any of these assumptions, then the Poisson PMF will not hold. One example is the presence of scattering effects, where a photon has a certain probability of going off due to the scattering medium and a certain probability of coming back. In this case, the events will no longer be independent.

Assuming that these hypotheses hold, then at time \(t+\Delta t\), the probability of observing \(X(t+\Delta t) = k\) can be computed as

$$\begin{aligned} &\Pb[X(t+\Delta t) = k] \\ &= \Pb[X(t) = k] \cdot \underset{\textcolor{blue}{=\Pb[X(t+\Delta t) - X(t) = 0]}}{\underbrace{ (1-\lambda \Delta t) }} + \;\; \Pb[X(t) = k-1] \cdot \underset{\textcolor{blue}{=\Pb[X(t+\Delta t) - X(t) = 1]}}{\underbrace{ (\lambda \Delta t) }}\\ &= \Pb[X(t) = k] - \Pb[X(t) = k]\lambda \Delta t + \Pb[X(t) = k-1]\lambda \Delta t . \end{aligned}$$

By rearranging the terms we show that

$$\begin{aligned} \frac{\Pb[X(t+\Delta t) = k] - \Pb[X(t) = k]}{\Delta t} = \lambda\bigg(\Pb[X(t)=k-1] - \Pb[X(t)=k]\bigg). \end{aligned}$$

Setting the limit of \(\Delta t \rightarrow 0\), we arrive at an ordinary differential equation

$$\begin{aligned} \frac{d}{dt} \Pb[X(t) = k] = \lambda\bigg(\Pb[X(t)=k-1] - \Pb[X(t)=k]\bigg). \end{aligned}$$

We claim that the Poisson PMF, i.e.,

$$\begin{aligned} \Pb[X(t)=k] = \frac{(\lambda t)^{k}}{k!}e^{-\lambda t}, \end{aligned}$$

would solve this differential equation. To see this, we substitute the PMF into the equation. The left-hand side gives us

$$\begin{aligned} \frac{d}{dt} \Pb[X(t) = k] &= \frac{d}{dt} \left(\frac{(\lambda t)^{k}}{k!}e^{-\lambda t}\right)\\ &= \lambda k \frac{(\lambda t)^{k-1}}{k!}e^{-\lambda t} + (-\lambda) \frac{(\lambda t)^k}{k!}e^{-\lambda t}\\ &= \lambda \frac{(\lambda t)^{k-1}}{(k-1)!}e^{-\lambda t} - \lambda \frac{(\lambda t)^k}{k!}e^{-\lambda t}\\ &= \lambda \bigg( \Pb[X(t)=k-1] - \Pb[X(t)=k] \bigg), \end{aligned}$$

which is the right-hand side of the equation. To retrieve the basic form of Poisson, we can just set \(t = 1\) in the PMF so that

$$\begin{aligned} \Pb[X(1)=k] = \frac{\lambda^{k}}{k!}e^{-\lambda}. \end{aligned}$$
The origin of Poisson random variables
  • sep0em
  • We assume independent arrivals.
  • Probability of seeing one event is linear with the arrival rate.
  • Time interval is short enough so that you see either one event or no event.
  • Poisson is derived by solving a differential equation based on these assumptions.
  • Poisson becomes invalid when these assumptions are violated, e.g., in the case of scattering of photons due to turbid medium.

There is an alternative approach to deriving the Poisson PMF. The idea is to drive the parameter \(n\) in the binomial random variable to infinity while pushing \(p\) to zero. In this limit, the binomial PMF will converge to the Poisson PMF. We will discuss this shortly. However, we recommend the physics approach we have just described because it has a rich meaning and allows us to validate our assumptions.

Poisson approximation to binomial

We present one additional result about the Poisson random variable. The result shows that Poisson can be regarded as a limiting distribution of a binomial random variable.

Theorem 3.10

(Poisson approximation to binomial). For small \(p\) and large \(n\),

$$\begin{aligned} {n \choose k} p^k (1-p)^{n-k} \approx \frac{\lambda^k}{k!}e^{-\lambda}, \end{aligned}$$

where \(\lambda \bydef np\).

Before we prove the result, let us see how close the approximation can be. In Figure 3.37, we show a binomial distribution and a Poisson approximation. The closeness of the approximation can easily be seen.

Figure 3.37
Figure 3.37. Poisson approximation of binomial distribution.

In MATLAB, the code to approximate a binomial distribution with a Poisson formula is shown below. Here, we draw 10,000 random binomial numbers and plot their histogram. On top of the plot, we use poisspdf to compute the Poisson PMF. This gives us Figure 3.37. A similar set of commands can be called in Python.

MATLAB
% MATLAB code to approximate binomial using Poisson
    n = 1000; p = 0.05;
    X = binornd(n,p,[10000,1]);
    t = 0:100;
    [num,val] = hist(X,t);
    lambda = n*p;
    f_pois = poisspdf(t,lambda);
    bar(num/10000,`FaceColor',[0.9 0.9 0],`BarWidth',1); hold on;
    plot(f_pois, `LineWidth', 4);
Python
# Python code to approximate binomial using Poisson
    import numpy as np
    import matplotlib.pyplot as plt
    import scipy.stats as stats
    n = 1000; p = 0.05
    rv1 = stats.binom(n,p)
    X   = rv1.rvs(size=10000)
    plt.figure(1); plt.hist(X,bins=np.arange(0,100));
    rv2 = stats.poisson(n*p)
    f   = rv2.pmf(bin)
    plt.figure(2); plt.plot(f);

Proof. Let \(\lambda = np\). Then,

$$\begin{aligned} {n \choose k} p^k (1-p)^{n-k} &= \frac{n!}{k!(n-k)!} \left(\frac{\lambda}{n}\right)^k \left(1-\frac{\lambda}{n}\right)^{n-k}\\ &= \frac{\lambda^k}{k!} \frac{n(n-1)\cdots(n-k+1)}{n \cdot n \cdots n} \left(1-\frac{\lambda}{n}\right)^{n-k}\\ &= \frac{\lambda^k}{k!} \underset{\rightarrow 1 \;\mathrm{as}\; n \rightarrow \infty }{ \underbrace{(1)\left(1-\frac{1}{n}\right)\cdots\left(1-\frac{k-1}{n}\right)} } \underset{\rightarrow 1 \;\mathrm{as}\; n \rightarrow \infty}{ \underbrace{\left(1-\frac{\lambda}{n}\right)^{-k}} } \left(1-\frac{\lambda}{n}\right)^{n}\\ &= \frac{\lambda^k}{k!} \left(1-\frac{\lambda}{n}\right)^{n}. \end{aligned}$$

We claim that \(\left(1-\frac{\lambda}{n}\right)^{n} \rightarrow e^{-\lambda}\). This can be proved by noting that

$$\begin{aligned} \log(1+x) \approx x, \quad\quad x \ll 1. \end{aligned}$$

It then follows that \(\log \left(1-\frac{\lambda}{n}\right) \approx -\frac{\lambda}{n}\). Hence, \(\left(1-\frac{\lambda}{n}\right)^{n} \approx e^{-\lambda}\)

Example 3.21

Consider an optical communication system. The bit arrival rate is \(10^{9}\) bits/sec, and the probability of having one error bit is \(10^{-9}\). Suppose we want to find the probability of having five error bits in one second.

Let \(X\) be the number of error bits. In one second there are \(10^9\) bits. Since we do not know the location of these 5 bits, we have to enumerate all possibilities. This leads to a binomial distribution. Using the binomial distribution, we know that the probability of having \(k\) error bits is

$$\begin{aligned} \Pb[X = k] &= {n \choose k} p^k (1-p)^{n-k} \\ &= {10^9 \choose k} (10^{-9})^k (1-10^{-9})^{10^9-k}. \end{aligned}$$

This quantity is difficult to calculate in floating-point arithmetic.

Using the Poisson approximation to the binomial, we can see that the probability can be approximated by

$$\begin{aligned} \Pb[X = k] \approx \frac{\lambda ^k }{k!} e^{-\lambda}, \end{aligned}$$

where \(\lambda = np = 10^9 (10^{-9})= 1\). Setting \(k=5\) yields \(\Pb[X=5] \approx 0.003\).

Photon arrival statistics

The strong connection between Poisson statistics and physics makes the Poisson random variable a very good fit for many physical experiments. Here we demonstrate an application in modeling photon shot noise.

An image sensor is a photon-sensitive device which is used to detect incoming photons. In the simplest setting, we can model a pixel in the object plane as \(X_{m,n}\), for some 2D coordinate \([m,n] \in \R^2\). Written as an array, an \(M \times N\) image in the object plane can be visualized as

$$\mX = \text{object} = \begin{bmatrix} X_{1,1} & X_{1,2} & \cdots & X_{1,N}\\ \vdots & \vdots & \ddots & \vdots\\ X_{M,1} & X_{M,2} & \cdots & X_{M,N} \end{bmatrix}.$$

Without loss of generality, we assume that \(X_{m,n}\) is normalized so that \(0 \le X_{m,n} \le 1\) for every coordinate \([m,n]\). To model the brightness, we multiply \(X_{m,n}\) by a scalar \(\alpha > 0\). If a pixel \(\alpha X_{m,n}\) has a large value, then it is a bright pixel; conversely, if \(\alpha X_{m,n}\) has a small value, then it is a dark pixel. At a particular pixel location \([m,n] \in \R^2\), the observed pixel value \(Y_{m,n}\) is a random variable following the Poisson statistics. This situation is illustrated in Figure 3.38, where we see that an object-plane pixel will generate an observed pixel through the Poisson PMF. (The color of an image is often handled by a color filter array, which can be thought of as a wavelength selector that allows a specific wavelength to pass through.)

Figure 3.38
Figure 3.38. The image formation process is governed by the Poisson random variable. Given a pixel in the object plane \(X_{m,n}\), the observed pixel \(Y_{m,n}\) is a Poisson random variable with mean \(\alpha X_{m,n}\). Therefore, a brighter pixel will have a higher Poisson mean, whereas a darker pixel will have a lower Poisson mean.

Written as an array, the image is

$$\begin{aligned} \mY &= \text{observed image} \\ &= \text{Poisson}\bigg\{\alpha \mX \bigg\} \\ &= \begin{bmatrix} \text{Poisson}\{\alpha X_{1,1}\} & \text{Poisson}\{\alpha X_{1,2}\} & \cdots & \text{Poisson}\{\alpha X_{1,N}\}\\ \text{Poisson}\{\alpha X_{2,1}\} & \text{Poisson}\{\alpha X_{2,2}\} & \cdots & \text{Poisson}\{\alpha X_{2,N}\}\\ \vdots & \vdots & \ddots & \vdots\\ \text{Poisson}\{\alpha X_{M,1}\} & \text{Poisson}\{\alpha X_{M,2}\} & \cdots & \text{Poisson}\{\alpha X_{M,N}\} \end{bmatrix}. \end{aligned}$$

Here, by \(\text{Poisson}\{\alpha X_{m,n}\}\) we mean that \(Y_{m,n}\) is a random integer with probability mass

$$\begin{aligned} \Pb[Y_{m,n} = k] = \frac{[\alpha X_{m,n}]^k}{k!}e^{-\alpha X_{m,n}}. \end{aligned}$$

Note that this model implies that the images seen by our cameras are more or less an array of Poisson random variables. (We say “more or less” because of other sources of uncertainties such as read noise, dark current, etc.) Because the observed pixels \(Y_{m,n}\) are random variables, they fluctuate about the mean values, and hence they are noisy. We refer to this type of random fluctuation as the shot noise. The impact of the shot noise can be seen in Figure 3.39. Here, we vary the sensor gain level \(\alpha\). We see that for small \(\alpha\) the image is dark and has much random fluctuation. As \(\alpha\) increases, the image becomes brighter and the fluctuation becomes smaller.

Figure 3.39
Figure 3.39. Illustration of the Poisson random variable in photographing images. Here, \(\alpha\) denotes the gain level of the sensor: Larger \(\alpha\) means that there are more photons coming to the sensor.

In MATLAB, simulating the Poisson photon arrival process for an image requires the image-processing toolbox. The command to read an image is imread. Depending on the data type, the input array could be uint8 integers. To convert them to floating-point numbers between 0 and 1, we use the command im2double. Drawing Poisson measurements from the clean image is done using poissrnd. Finally, we can use imshow to display the image.

MATLAB
% MATLAB code to simulate a photon arrival process
    x0 = im2double(imread('cameraman.tif'));
    X  = poissrnd(10*x0);
    figure(1); imshow(x0, []);
    figure(2); imshow(X, []);

Similar commands can be found in Python with the help of the cv2 library. When reading an image, we call cv2.imread. The option 0 is used to read a gray-scale image; otherwise, we will have a 3-channel color image. The division /255 ensures that the input array ranges between 0 to 1. Generating the Poisson random numbers can be done using np.random.poisson, or by calling the statistics library with stats.poisson.rvs(10*x0). To display the images, we call plt.imshow, with the color map option set to cmap = 'gray'.

Python
# Python code to simulate a photon arrival process
    import numpy as np
    import matplotlib.pyplot as plt
    import cv2
    x0 = cv2.imread('./cameraman.tif', 0)/255
    plt.figure(1); plt.imshow(x0,cmap='gray');
    X  = np.random.poisson(10*x0)
    plt.figure(2); plt.imshow(X, cmap='gray');
Why study Poisson? What is shot noise?
  • sep0ex
  • The Poisson random variable is used to model photon arrivals.
  • Shot noise is the random fluctuation of the photon counts at the pixels. Shot noise is present even if you have an ideal sensor.

Signal-to-noise ratio of Poisson

Now let us answer a question we asked before. A Poisson random variable has a variance equal to the mean. Thus, if the scene is brighter, the variance will be larger. How come our simulation in Figure 3.39 shows that the fluctuation becomes smaller as the scene becomes brighter?

The answer to this question lies in the signal-to-noise ratio (SNR) of the Poisson random variable. The SNR of an image defines its quality. The higher the SNR, the better the image. The mathematical definition of SNR is the ratio between the signal power and the noise power. In our case, the SNR is

$$\begin{aligned} \text{SNR} &= \frac{\text{signal power}}{\text{noise power}} \\ &\bydef \frac{\E[Y]}{\sqrt{\Var[Y]}} \\ &\overset{(a)}{=} \frac{\lambda}{\sqrt{\lambda}} = \sqrt{\lambda}, \end{aligned}$$

where \(Y = Y_{m,n}\) is one of the observed pixels and \(\lambda = \alpha X_{m,n}\) is the corresponding object pixel. In this equation, the step \((a)\) uses the properties of the Poisson random variable \(Y\) where \(\E[Y] = \Var[Y] = \lambda\). The result \(\text{SNR} = \sqrt{\lambda}\) is very informative. It says that if the underlying mean photon flux (which is \(\lambda\)) increases, the SNR increases at a rate of \(\sqrt{\lambda}\). So, yes, the variance becomes larger when the scene is brighter. However, the gain in signal \(\E[Y]\) overrides the gain in noise \(\sqrt{\Var[Y]}\). As a result, the big fluctuation in bright images is compensated by the strong signal. Thus, to minimize the shot noise one has to use a longer exposure to increase the mean photon flux. When the scene is dark and the aperture is small, shot noise is unavoidable.

Poisson modeling is useful for describing the problem. However, the actual engineering question is that, given a noisy observation \(Y_{m,n}\), how would you reconstruct the clean image \(X_{m,n}\)? This is a very difficult inverse problem. The typical strategy is to exploit the spatial correlations between nearby pixels, e.g., usually smooth except along some sharp edges. Other information about the image, e.g., the likelihood of obtaining texture patterns, can also be leveraged. Modern image-processing methods are rich, ranging from classical filtering techniques to deep neural networks. Static images are easier to recover because we can often leverage multiple measurements of the same scene to boost the SNR. Dynamic scenes are substantially harder when we need to track the motion of any underlying objects. There are also newer image sensors with better photon sensitivity. The problem of imaging in the dark is an important research topic in computational imaging. New solutions are developed at the intersection of optics, signal processing, and machine learning.