The Binomial is one of the most famous and useful distributions in statistics. It is written $\text{Bin}(n, p)$ and has two parameters: $n$, any positive integer, and $p$, any real number in $[0, 1]$. Changing the parameters gives a different distribution, but it is still a Binomial — so $\text{Bin}(n, p)$ is really a whole family of distributions, one for each $(n, p)$ pair.
Saying $X \sim \text{Bin}(n, p)$ means $X$ is a random variable whose distribution is Binomial. There are three equally important ways to understand what that distribution is, and all three matter in this course.
We run $n$ independent trials. Each trial results in success or failure, with probability $p$ of success. $X$ is the number of successes.
The independence of the trials is crucial. The word "success" is generic: you can define success however you like (you can even swap the labels success and failure), as long as each trial yields exactly one of the two, never both.
The classic example is flipping a coin $n$ times and counting heads. But the setting is far more general than coin flips — any sequence of $n$ independent yes/no trials with a common success probability fits. The story is the most important view, because it tells us why we care about the Binomial; a distribution with no useful story is not worth studying.
The story immediately gives a second view:
Each $X_j$ is an indicator random variable — it indicates whether trial $j$ succeeded (1) or failed (0). This equation says exactly what the story says: add 1 for every success, add 0 for every failure. That is just how counting works (to count to five, you add 1 five times). Breaking a complicated random variable into a sum of very simple 0/1 pieces is subtle but extremely useful.
The $X_j$ here are i.i.d. Bernoulli$(p)$. Independent: the trials are independent, so their indicators are too. Identically distributed: every $X_j$ is Bernoulli$(p)$, equal to 1 with probability $p$ and 0 with probability $1 - p$. The acronym IID (independent and identically distributed) is used constantly in statistics.
The third view writes down the probability mass function directly. For $X \sim \text{Bin}(n, p)$ with $q = 1 - p$:
This is the probability of one specific arrangement of $k$ successes and $n - k$ failures, $p^k q^{\,n-k}$, times $\binom{n}{k}$, the number of ways to choose which trials are the successes.
A very common confusion is to conflate a random variable with its distribution.
Many different random variables can share the same distribution. The indicators $X_1, \ldots, X_n$ are distinct random variables (they depend on different trials, and they are independent of one another), yet they all have the same Bernoulli$(p)$ distribution.
Think of the sample space $S$ in the "pebble world" picture: a finite collection of pebbles, each representing a possible outcome. A random variable is a function that assigns a number to each pebble.
The real sample space could be enormous, high-dimensional, or infinite — impossible to draw — but the idea is the same: start with an abstract space of outcomes, assign a numerical value to each one.
The expression $X = 7$ is not an equation to solve. It is notation for an event — the set of all outcomes (pebbles) that $X$ maps to 7. Since $\{X = 7\}$ is an event (a subset of $S$), it makes sense to talk about its probability. Likewise $\{X \le x\}$ is an event for any number $x$.
For any random variable $X$, define the cumulative distribution function (CDF):
$$F(x) = P(X \le x).$$
The reasoning: before the experiment you don't know $X$; afterward you observe a value, say $X = 7$. If $x = 9$, then since $7 \le 9$ the event $\{X \le x\}$ occurred. As a function of $x$, the probability of this event is the CDF.
The CDF is one way to describe a distribution, and in principle it determines all probabilities about $X$. Questions like "what is the probability $X$ lies between 1 and 3, or between 5 and 9?" can all be answered from $F$. The CDF's strength is generality: it is defined for any random variable, discrete or not.
The PMF is defined only for discrete random variables, so we first need that distinction.
There are hybrids that are neither purely discrete nor continuous, but once you understand the two pure cases you can handle the hybrids. The course starts mostly with discrete, adds continuous later, and keeps using discrete throughout.
For a discrete random variable with possible values $a_1, a_2, \ldots$,
$$p_j = P(X = a_j) \quad \text{for all } j.$$
To specify the PMF you must give every one of these probabilities. It is the "blueprint" for $X$ — it says exactly how the randomness of $X$ is distributed.
$$p_j \ge 0 \text{ for all } j, \qquad \sum_j p_j = 1.$$
If the sum exceeded 1 it would be nonsensical; if it were less than 1, some possible value was left off the list — $X$ has to equal something. Conversely, any numbers $p_j$ satisfying these two conditions define a valid PMF.
For discrete random variables the PMF is usually easier to work with than the CDF; the CDF exists mainly because it generalizes to the non-discrete case. When a problem says "find the distribution," giving either the PMF (if discrete) or the CDF counts as a complete answer — they are equally valid, but the PMF is usually easier.
Check that the Binomial PMF is valid. Non-negativity is obvious, so the only thing to verify is that it sums to 1:
The sum is exactly the Binomial Theorem, which gives $(p + q)^n$. Since $q = 1 - p$, we have $p + q = 1$, so the sum is $1^n = 1$.
This is precisely why it is called the Binomial distribution — it is tied to the Binomial Theorem. If the sum had failed to equal 1, the PMF formula would have to be wrong, so the check is reassuring.
Let $X \sim \text{Bin}(n, p)$ and $Y \sim \text{Bin}(m, p)$ be independent (note: the same $p$). Claim: $X + Y \sim \text{Bin}(n + m, p)$. We prove it three ways, one for each view of the Binomial.
$X$ and $Y$ are functions on the same sample space (pebble world). To add two functions with the same domain, evaluate both and add the values. So as long as random variables live on the same sample space $S$, it makes perfect sense to add them, multiply them, square them, cube them, exponentiate them — each operation yields a new random variable, computed outcome by outcome.
$X$ counts successes in $n$ trials; $Y$ counts successes in $m$ separate trials (separate because $X$ and $Y$ are independent). Together that is $n + m$ trials, each with success probability $p$. The total number of successes is the sum of the two counts, so $X + Y \sim \text{Bin}(n + m, p)$. No algebra needed.
This requires the same $p$ for both. It fails if, say, one had $p = 1/2$ and the other $p = 1/3$.
Write $X = X_1 + \cdots + X_n$ and $Y = Y_1 + \cdots + Y_m$, where all the $X_j$ and $Y_j$ are independent Bernoulli$(p)$. Then
which is a sum of $n + m$ IID Bernoulli$(p)$ random variables. A sum of $N$ IID Bernoulli$(p)$ is $\text{Bin}(N, p)$; here $N = n + m$, so $X + Y \sim \text{Bin}(n + m, p)$. (This easy argument would break if the variables were not independent.)
The hard way: compute $P(X + Y = k)$ directly. The sum of two random variables is called a convolution — and, fittingly, "convolution" and "convoluted" sound alike. By wishful thinking, condition on $X$ (we could equally condition on $Y$). The Law of Total Probability gives
We sum only up to $k$: since $X$ and $Y$ are non-negative, neither can exceed $k$ when $X + Y = k$. Given $X = j$, the event $\{X + Y = k\}$ becomes $\{Y = k - j\}$. By independence, conditioning on $X = j$ gives no information about $Y$, so $P(Y = k - j \mid X = j) = P(Y = k - j)$. Plugging in both Binomial PMFs:
Collect the powers of $p$ and $q$ (both independent of $j$, so they leave the sum):
The remaining sum is Vandermonde's identity:
So $P(X + Y = k) = \binom{m+n}{k} p^{\,k} q^{\,m+n-k}$, which is exactly the $\text{Bin}(n + m, p)$ PMF.
This was far more work than the other two proofs — and we would have been stuck at the ugly sum without already knowing Vandermonde. As a bonus, the calculation is itself a second proof of Vandermonde's identity: if that sum had not equaled $\binom{m+n}{k}$, we would have a contradiction.
A common mistake is to assume a distribution is Binomial when it is not. The two key Binomial assumptions are that the trials are independent and have the same success probability. If either fails, it is not Binomial.
Draw a five-card hand from a standard 52-card deck (all $\binom{52}{5}$ subsets equally likely). Find the distribution of $X = $ the number of aces in the hand.
This is discrete ($X$ is $0, 1, 2, 3,$ or $4$), so the PMF is the natural target. Listing the possible values first is good practice — it guards against PMFs that don't sum to 1 or that allow impossible values like 2.5 or 5 aces.
Think of each card as a trial, but the trials are not independent. If the first card is an ace, the second is less likely to be one; if the first four cards are all aces, the fifth is certainly not. Independence fails.
Using the naive definition (equally likely hands):
Choose $k$ of the 4 aces and $5 - k$ of the 48 non-aces, out of $\binom{52}{5}$ total hands. There is a pleasing pattern: $4 + 48 = 52$ in the "tops" and $k + (5 - k) = 5$ in the "bottoms."
This PMF looks like Vandermonde, and it is exactly the structure of the elk problem (capture–recapture) from the homework: a population of elk, some tagged and some untagged; sample some; find the probability of exactly $k$ tagged ones. Here "tag" just means "is an ace" — four cards come pre-tagged as aces, 48 are untagged. The card problem and the elk problem are not merely similar, they are the same problem. Recognizing when a new problem is identical to one already solved is a central skill in this course.
Generalize: a jar holds $b$ black marbles and $w$ white marbles. Take a simple random sample of size $n$ (all subsets of that size equally likely). Let $X = $ number of white marbles in the sample. (Same problem as the aces and the elk, with "white/black" in place of "ace/non-ace" and "tagged/untagged.")
$$P(X = k) = \frac{\binom{w}{k}\binom{b}{\,n-k}}{\binom{w+b}{n}},$$
with constraints $0 \le k \le w$ and $0 \le n - k \le b$; by convention the formula is 0 when $k > w$ (or otherwise impossible).
As with the Binomial, memorizing the PMF is far less useful than knowing the story — the story tells you when to apply the distribution.
The defining difference is replacement:
| Feature | Binomial | Hypergeometric |
|---|---|---|
| Sampling | With replacement | Without replacement |
| Trials | Independent | Not independent |
| Story | Count successes in $n$ independent trials | Count white marbles in a sample of size $n$ drawn without replacement |
| PMF | $\binom{n}{k} p^k q^{\,n-k}$ | $\dfrac{\binom{w}{k}\binom{b}{\,n-k}}{\binom{w+b}{n}}$ |
If you draw a marble, replace it, and repeat, the situation resets each time and the draws are independent — that gives a Binomial. Drawing without replacement makes the trials dependent, giving a Hypergeometric.
Suppose the jar holds about a billion marbles and we sample only 10. Picking the same marble twice is then extraordinarily unlikely, so sampling with and without replacement behave almost identically. Under such conditions the Hypergeometric is approximately Binomial.
Verify validity. Non-negativity is clear. For the sum, the constant $\binom{w+b}{n}$ factors out and the numerator sum is Vandermonde again:
As before, this doubles as a third proof of Vandermonde's identity: had the sum not matched, we would have a contradiction.
A quick look at what CDFs $F(x) = P(X \le x)$ look like, to keep in mind for next time.
For a continuous random variable, $F$ is a smooth increasing curve:
For a discrete random variable, $F$ is a step function with jumps at the possible values and flat stretches in between. For $X$ taking values $0, 1, 2$:
The function jumps at each value the random variable can take, then stays flat until the next one, eventually leveling off at 1 forever. The open circles reflect that $F$ is defined with $\le$: at a jump point $F$ takes the higher value. In the discrete case these jumpy step functions make the PMF easier to use; in the continuous case the CDF is often the more convenient description.