Discrete Probability Introduction

This notebook introduces the basic concepts of discrete probability distributions. Thinking in terms of probabilities is an important skill in analyzing data and interpreting statistical analyses.

It is inspired by Dr. Kennington's probability examples from Boise State University CS 597.

You can download the raw notebook.

Setup

This notebook requires the following Conda packages:

conda install r-nycflights13
In [1]:
library(tidyverse)
library(nycflights13)
options(repr.plot.height=4)
Loading tidyverse: ggplot2
Loading tidyverse: tibble
Loading tidyverse: tidyr
Loading tidyverse: readr
Loading tidyverse: purrr
Loading tidyverse: dplyr
Conflicts with tidy packages ---------------------------------------------------
filter(): dplyr, stats
lag():    dplyr, stats

The flights table from nycflights13 contains data on over 300,000 flights leaving New York City in 2013. We'll use it as our example in this worksheet.

In [2]:
head(flights)
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00
2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00
2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00
2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00

What Is a Probability?

Suppose we see a plane leaving the NYC area, and want to know which of the 3 New York airports (EWR, LGA, and JFK) it probably came from. If we know nothing other than ‘a plane left NYC’, then we can look at the relative frequency of flights from the airports: which airport produces the most flights?

We can do this by counting the number of flights from each airport. dplyr makes this easy with group_by and summarize:

In [3]:
origins = flights %>%
    group_by(origin) %>%
    summarize(count=n())
origins
origincount
EWR 120835
JFK 111279
LGA 104662

We assign the value to the variable origins, and then we ask for the value origins on a new line to see the data we just computed. This is useful to be able to make use of this data later!

Also, this data type is called a data frame. A data frame is like a little spreadsheet - it has named columns of data.

The %>% business is called a pipeline, and it is the standard way to process data in with tidyverse (or more specifically dplyr). It pipes the results of each operation into the next, until we finally have results.

It is often convenient to plot data like this, so we can see it visually:

In [4]:
ggplot(flights) +
    aes(x=origin) +
    geom_bar()

EWR (Newark) has the most departing flights. However, these numbers aren't very convenient - 120835 flights left EWR, but that is a little unwieldy if we want to estimate the chances of another flight coming from Newark.

Fortunately, there is a way we can make these numbers easier to deal with: make them sum to 1, so each value is the fraction of flights that left that airport. Let's do that by dividing each count by the total count:

In [5]:
origins$prob = origins$count / sum(origins$count)
origins
origincountprob
EWR 120835 0.3587993
JFK 111279 0.3304244
LGA 104662 0.3107763

This new column, prob, indicates the probability of a flight departing from the specified airport, given the observations that we have. We are making the assumption here that the flights leaving in 2013 are representative of flights leaving New York City in general, so that we can infer things about future flights from this data. We'll explore in more detail later when we can and can't make this kind of assumption.

A probability is a real number between 0 and 1 that expresses how likely something is. (There is a subtle difference between likelihood and probability, but for our current purposes that difference does not matter.)

Origin airport is an example of what we call a discrete value: it has one of a finite set of distinct values (in this case, just 3: EWR, JFK, and LGA). We also call the origin airport a variable: like variables in computer programs, it is one of the parameters that characterizes an observation (one of the flights). When we are trying to reason about the probability of the variable having different values, we call it a random variable: a variable that takes on random values.

Note: We often think of things in terms of random variables and probabilities even when we don't necessarily think that the way they are produced is actually random. Randomness just provides a convenient way for us to think about the uncertainty we have about our knowledge.

Our table forms a discrete probability distribution. A discrete probability distribution associates each possible value of a discrete variable with a probability of the variable having that value. Each probability must be in the range 0 to 1 (inclusive); in addition, all probabilities in the distribution must sum to 1. We can check this sum:

In [6]:
sum(origins$prob)
1

More formally, a probability distribution $P(x)$ over the values of a random variable $X$ is a function $P(x): X \into \mathbb{R}$ such that:

  1. $\forall x \in X. 0 \le P(x) \le 1$
  2. $\sum_{x \in X} P(x) = 1$

In R, the $ operator accesses the column of a data frame. It's a lot like . in Java or Python. Each column of this data frame is a vector, which is R-speak for an array. The sum function sums up the elements of a vector and returns them.

But above, when we converted counts into probabilities: notice that we did not write a loop! In R, most operations are vectorized: when you apply them to vectors, they operate on the whole vector element-by-element. If we take two vectors and add them, we get the pairwise sum:

In [7]:
c(1,2,3) + c(10, 20, 30)
  1. 11
  2. 22
  3. 33

In R, there is no such thing as a single value - a value is a vector of length 1. And when two vectors have different lengths, R will recycle the shorter one:

In [8]:
c(1,2,3) + 5
  1. 6
  2. 7
  3. 8

This can get us in trouble sometimes if we don't have our vectors quite straight. Fortunately, R warns us in the common error case where we have two multi-item vectors but their lengths aren't compatible:

In [9]:
c(1,2,3) + c(1,2)
Warning message in c(1, 2, 3) + c(1, 2):
"longer object length is not a multiple of shorter object length"
  1. 2
  2. 4
  3. 4

Most R computations work over vectors. A basic rule of thumb is to never use loops. R has them, but you won't need them very often at all. Vectorized operations are much faster than manually looping, and are easier to write.

Names and Math

Our distribution above, over origin airports, is what is called a multinomial distribution. That is, it is a distribution over multiple discrete ‘categorical’ values (we'll see what categorical means in the next lesson).

The simplest kind of multinomial distribution is the binomial distribution: a distribution over two values $\mathsf{H}$ and $\mathsf{T}$. This can be parameterized with a single value $p \in [0,1]$ such that:

$$\begin{align*} P(\mathsf{H}) & = p \\ P(\mathsf{T}) & = 1-p \end{align*}$$

We use the $P(\dots)$ notation to indicate a probability.

It is easy to check that this distribution satisfies our two probability laws:

  1. Since $0 \le p \le 1$, both $p$ and $1-p$ are probabilities.
  2. $P(\mathsf{H}) + P(\mathsf{T}) = p + 1 - p = 1$.

I have called our two outcomes $\mathsf{H}$ and $\mathsf{T}$ because we often think of them as corresponding to the flip of a (possibly weighted) coin with two sides, heads and tails. A fair coin has $p=0.5$, so that both heads and tails are equally likely.

Let's see the flips of 20 fair coins (don't worry about the details of the flip function for now):

In [10]:
flip = function(n, p=0.5) {
    sample(c('H', 'T'), n, replace=TRUE, prob=c(p, 1-p))
}
flip(20)
  1. 'T'
  2. 'H'
  3. 'H'
  4. 'H'
  5. 'T'
  6. 'H'
  7. 'H'
  8. 'T'
  9. 'H'
  10. 'T'
  11. 'T'
  12. 'T'
  13. 'T'
  14. 'T'
  15. 'H'
  16. 'T'
  17. 'H'
  18. 'H'
  19. 'T'
  20. 'T'

More generally, the binomial distribution is the probability of observing $k$ successes (in our case, heads) in $n$ flips (or trials). Let's count the successes in a series of 20 flips:

In [11]:
sum(flip(20) == 'H')
8

This will often sum to 10, but not always - it may be 7 or 9 or 11.

R Note: The == operator tests for equality, and like most other R operations, it is vectorized - it tests each element of the left-hand vector with the corresponding element of the right-hand vector; when the right-hand vector has length 1, it just reuses that element for all left-hand values. The result is a logical vector (true and false); summing it counts the TRUE values.

We can take advantage of another couple of R operations, : to generate sequences and sapply to apply a function over many sequences, to carry out several trials of 20-flip sequences and see how often we see different values:

In [12]:
repeated_sequences = sapply(1:100, function(t) {
    sum(flip(20) == 'H')
})
repeated_sequences
  1. 12
  2. 13
  3. 11
  4. 10
  5. 12
  6. 10
  7. 13
  8. 10
  9. 10
  10. 9
  11. 8
  12. 10
  13. 11
  14. 10
  15. 9
  16. 16
  17. 8
  18. 7
  19. 14
  20. 9
  21. 8
  22. 12
  23. 10
  24. 7
  25. 8
  26. 9
  27. 8
  28. 7
  29. 10
  30. 9
  31. 10
  32. 12
  33. 13
  34. 8
  35. 8
  36. 11
  37. 10
  38. 12
  39. 11
  40. 11
  41. 9
  42. 10
  43. 10
  44. 10
  45. 10
  46. 12
  47. 8
  48. 12
  49. 10
  50. 8
  51. 10
  52. 7
  53. 9
  54. 12
  55. 9
  56. 12
  57. 9
  58. 13
  59. 12
  60. 9
  61. 12
  62. 11
  63. 10
  64. 11
  65. 13
  66. 14
  67. 6
  68. 9
  69. 10
  70. 11
  71. 11
  72. 12
  73. 8
  74. 9
  75. 13
  76. 13
  77. 9
  78. 10
  79. 17
  80. 13
  81. 14
  82. 9
  83. 8
  84. 10
  85. 11
  86. 10
  87. 14
  88. 7
  89. 10
  90. 9
  91. 12
  92. 12
  93. 11
  94. 9
  95. 11
  96. 6
  97. 12
  98. 9
  99. 8
  100. 12

In R, we don't have a return command - a function returns the value of its last expression. The sapply function takes a vector and a function f and returns a new vector that is the result of calling f(x) for each value x in the original vector.

Let's plot these values:

In [13]:
ggplot(data.frame(k=repeated_sequences)) +
    aes(x=k) +
    geom_histogram(binwidth=1)

We can see that values close to 10 are the most common.

What if we have a weighted coin, so that $P(\mathsf{H}) = 0.7$?

In [14]:
repeated_sequences = sapply(1:100, function(t) {
    sum(flip(20, 0.7) == 'H')
})
ggplot(data.frame(k=repeated_sequences)) +
    aes(x=k) +
    geom_histogram(binwidth=1)

Now 13-16 are the most common values, which we would expect, since $0.7 \cdot 20 = 14$.

Now, we can directly compute the probability of observing $k$ successes in $n$ trials without needing to simulate all these trials. The probabiltiy $P(k|n,p)$ (read ‘the probability of $k$ given $n$ and $p$’) can be written:

$$P(k|n,p) = {{n}\choose{k}} p^k (1-p)^{(n-k)}$$

R has a built-in definition of this function called dbinom:

In [15]:
dbinom(14, 20, 0.7)
0.191638982753443

For fixed $n$ and $p$, this binomial distribution itself is a discrete distribution over the integers $0 \dots n$, and we can also visualize it:

In [16]:
ggplot(data.frame(k=0:20) %>% mutate(prob=dbinom(k, 20, 0.7))) +
    aes(x=k, y=prob) +
    geom_bar(stat='identity')

We can observe that our flip above has the same basic shape. Randomness means that it won't quite align perfectly, but on average it will be pretty close.

Joint Distributions

We have now seen how we can start to think about the distribution of a single random variable by counting; often, though, we care about more than one variable.

Let's look at the carrier airlines for our NYC flights:

In [17]:
flights %>%
    group_by(carrier) %>%
    summarize(count=n())
carriercount
9E 18460
AA 32729
AS 714
B6 54635
DL 48110
EV 54173
F9 685
FL 3260
HA 342
MQ 26397
OO 32
UA 58665
US 20536
VX 5162
WN 12275
YV 601

We now have a bunch of carriers, and we can convert this to a probability distribution to estimate the probability of a plan being from a particular airline.

We can also start to think about airlines and flights. Let's do a bit more R trickery! We can group by two variables:

In [18]:
origin_carrier_flights = flights %>%
    group_by(origin, carrier) %>%
    summarize(count=n()) %>%
    ungroup() %>%
    mutate(prob = count / sum(count))
head(origin_carrier_flights)
origincarriercountprob
EWR 9E 1268 0.003765114
EWR AA 3487 0.010354063
EWR AS 714 0.002120104
EWR B6 6557 0.019469915
EWR DL 4342 0.012892843
EWR EV 43939 0.130469511

R Note: this contains 2 new functions. mutate is the dplyr way of doing the normalization we did previously for origins; it lets us compute a new variable based on other variables in the data frame. ungroup removes the grouping data introduced by group_by, so that sum sums over the entire data frame.

In [19]:
sum(origin_carrier_flights$prob)
1

This is probability distribution is called a joint probability distribution: it is the probability of two variables simultaneously taking on the given values. We can write it $P(O, C)$: the probability of a specific origin and carrier. So $P(O=\mathsf{EWR}, C=\mathsf{AA}) \approx 0.010$.

It can be easier to visualize this in a more matrix-like form. The spread function lets us convert data in this form (‘tall’) into a ‘wide’ format; the select(-count) operation removes the count column from the data frame:

In [20]:
origin_carrier_wide = spread(origin_carrier_flights %>% select(-count), origin, prob, fill=0)
origin_carrier_wide
carrierEWRJFKLGA
9E 0.003765114 0.043503694 7.545074e-03
AA 0.010354063 0.040926313 4.590291e-02
AS 0.002120104 0.000000000 0.000000e+00
B6 0.019469915 0.124937644 1.782194e-02
DL 0.012892843 0.061468157 6.849360e-02
EV 0.130469511 0.004180820 2.620733e-02
F9 0.000000000 0.000000000 2.033993e-03
FL 0.000000000 0.000000000 9.680025e-03
HA 0.000000000 0.001015512 0.000000e+00
MQ 0.006758201 0.021358410 5.026486e-02
OO 0.000017816 0.000000000 7.720265e-05
UA 0.136847638 0.013462955 2.388531e-02
US 0.013079911 0.008893152 3.900515e-02
VX 0.004649975 0.010677721 0.000000e+00
WN 0.018374231 0.000000000 1.807433e-02
YV 0.000000000 0.000000000 1.784569e-03

We can then convert this into an R matrix:

In [21]:
origin_carrier_matrix = as.matrix(select(origin_carrier_wide, -carrier))
row.names(origin_carrier_matrix) = origin_carrier_wide$carrier
origin_carrier_matrix
EWRJFKLGA
9E0.003765114 0.043503694 7.545074e-03
AA0.010354063 0.040926313 4.590291e-02
AS0.002120104 0.000000000 0.000000e+00
B60.019469915 0.124937644 1.782194e-02
DL0.012892843 0.061468157 6.849360e-02
EV0.130469511 0.004180820 2.620733e-02
F90.000000000 0.000000000 2.033993e-03
FL0.000000000 0.000000000 9.680025e-03
HA0.000000000 0.001015512 0.000000e+00
MQ0.006758201 0.021358410 5.026486e-02
OO0.000017816 0.000000000 7.720265e-05
UA0.136847638 0.013462955 2.388531e-02
US0.013079911 0.008893152 3.900515e-02
VX0.004649975 0.010677721 0.000000e+00
WN0.018374231 0.000000000 1.807433e-02
YV0.000000000 0.000000000 1.784569e-03

This is our joint distribution: each cell contains the probability of a randomly selected flight being on the particular carrier and from the specifeid airport. We can check its sum again:

In [22]:
sum(origin_carrier_matrix)
1

Ok, that's better.

Marginal Probabilities

One of the things we often want to do with a joint probability distribution is compute the marginal distributions of its variables. If we have a joint distribution $P(A,B)$, the marginal distribution $P(A) = \sum_{b \in B} P(A,B)$. When our joint distribution is a matrix, R makes it very easy to compute the marginals:

In [23]:
rowSums(origin_carrier_matrix)
9E
0.0548138822243865
AA
0.097183290970853
AS
0.00212010357032568
B6
0.162229493788156
DL
0.142854597714802
EV
0.16085766206618
F9
0.00203399292111077
FL
0.00968002470484833
HA
0.00101551179418961
MQ
0.0783814761146875
OO
9.50186474095541e-05
UA
0.174195904696297
US
0.0609782169750814
VX
0.0153276955602537
WN
0.0364485592797587
YV
0.00178456897166069

This is the probability of a randomly selected aircraft being operated by the specified carrier.

In [24]:
colSums(origin_carrier_matrix)
EWR
0.358799320616671
JFK
0.330424377033993
LGA
0.310776302349336

If you compare these with the airport probabilities we estimated at the beginning, you should find them to be the same. This is a useful sanity check - they should be the same! But sometimes we have a joint distribution, and we want to extract the marginal distribution from it.

Joint probability distributions are also symmetric - it makes no difference whether we write $P(O,C)$ or $P(C,O)$.

Conditional Probability

Another useful kind of probability we can derive from a joint distribution is the conditional probability. For example, the conditional probability $P(O|C)$ is the probability that an airplane left a particular airport given that we know it is operated by a given carrier.

The probability $P(O|C) = \frac{P(O,C)}{P(C)}$.

In [25]:
origin_carrier_cond = origin_carrier_matrix / rowSums(origin_carrier_matrix)
origin_carrier_cond
EWRJFKLGA
9E0.068689060.793661970.1376490
AA0.106541600.421125000.4723334
AS1.000000000.000000000.0000000
B60.120014640.770129040.1098563
DL0.090251510.430284760.4794637
EV0.811086700.025990810.1629225
F90.000000000.000000001.0000000
FL0.000000000.000000001.0000000
HA0.000000001.000000000.0000000
MQ0.086221920.272493090.6412850
OO0.187500000.000000000.8125000
UA0.785596180.077286290.1371175
US0.214501360.145841450.6396572
VX0.303370790.696629210.0000000
WN0.504114050.000000000.4958859
YV0.000000000.000000001.0000000

What's that we just did? We divided a matrix by a vector that has as many entries as the matrix has rows. This divides every entry in the matrix by the value corresponding to its row. Neat, huh? We can check that each row is a probability distribution:

In [26]:
rowSums(origin_carrier_cond)
9E
1
AA
1
AS
1
B6
1
DL
1
EV
1
F9
1
FL
1
HA
1
MQ
1
OO
1
UA
1
US
1
VX
1
WN
1
YV
1

Each row of our new matrix is a probability distribution over airports, given that we know the carrier of the airline. Cool! If we know that the flight is United (UA), then it is most likely from Newark (EWR) - $P(\mathsf{EWR}|\mathsf{UA}) = 0.78$.

Now, unlike joint probabilities, conditional probabilities are not symmetric: $P(O|C) \ne P(C|O)$. If we want to compute $P(C|O)$, we can use t to transpose our matrix and normalize rows to be distributions again:

In [27]:
carrier_origin_cond = t(origin_carrier_matrix) / colSums(origin_carrier_matrix)
carrier_origin_cond
9EAAASB6DLEVF9FLHAMQOOUAUSVXWNYV
EWR0.01049365 0.02885753 0.005908884 0.05426408 0.0359333 0.36362809 0.000000000 0.00000000 0.000000000 0.01883560 4.965449e-050.38140439 0.03645467 0.01295982 0.05121033 0.000000000
JFK0.13166006 0.12385985 0.000000000 0.37811267 0.1860279 0.01265288 0.000000000 0.00000000 0.003073356 0.06463933 0.000000e+000.04074444 0.02691433 0.03231517 0.00000000 0.000000000
LGA0.02427815 0.14770404 0.000000000 0.05734651 0.2203952 0.08432860 0.006544878 0.03114789 0.000000000 0.16173970 2.484187e-040.07685693 0.12550878 0.00000000 0.05815864 0.005742294
In [28]:
rowSums(carrier_origin_cond)
EWR
1
JFK
1
LGA
1

If we want to visualize conditional probabilities, the easiest way is with a faceted plot. First let's convert our conditional distribution to a tall data frame:

In [29]:
carrier_origin_frame = as.data.frame(carrier_origin_cond)
carrier_origin_frame$origin = row.names(carrier_origin_cond)
carrier_origin_tall = gather(carrier_origin_frame, carrier, prob, -origin)
head(carrier_origin_tall)
origincarrierprob
EWR 9E 0.01049365
JFK 9E 0.13166006
LGA 9E 0.02427815
EWR AA 0.02885753
JFK AA 0.12385985
LGA AA 0.14770404
In [30]:
ggplot(carrier_origin_tall) +
    aes(x=carrier, y=prob) +
    geom_bar(stat='identity') +
    facet_wrap(~ origin) +
    theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))

Independence of Variables

Two variables are independent if $P(A,B) = P(A) P(B)$ — that is, we can compute the probability of $a$ and $b$ happening at the same time by independently computing the probabilities of $a$ and $b$, and multiplying them. What this means in practice is that knowing $A$ tells us nothing about $B$. We can see that our origin airport and carrier are not independent - observing either tells us quite a bit about the other.

But let's go back to our binomial distribution: when flipping a coin, each flip is independent. Knowing I flipped heads tells me nothing about whether the next flip will be heads.

This is the key to making the binomial distribution formula work: the probability of flipping $\mathsf{HH}$ is $P(X_1=\mathsf{H},X_2=\mathsf{H}) = P(\mathsf{H})P(\mathsf{H})$

The same is true of rolling dice: the results of a roll of two fair dice is the product of the individual die probabilities.

Bayes' Theorem

Remember that $P(A|B) \ne P(B|A)$? There is, however, a way that we can convert between these two probabilities!

$$P(A|B) = \frac{P(B|A)P(B)}{P(A)}$$

That is, with one conditional distribution and both marginal distributions, we can compute the other conditional distribution. To see why this is true, we can expand the definition of conditional probability:

$$\begin{align*} P(B|A) & = P(A,B)P(A) \\ P(A|B) & = P(A,B)P(B) \\ & = \frac{P(B|A)}{P(A)} P(B) \\ & = \frac{P(B|A) P(B)}{P(A)} \end{align*}$$