{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Discrete Probability Introduction\n", "\n", "This notebook introduces the basic concepts of discrete probability distributions. Thinking in terms of probabilities is an important skill in analyzing data and interpreting statistical analyses.\n", "\n", "It is inspired by Dr. Kennington's probability examples from Boise State University CS 597." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Setup\n", "\n", "This notebook requires the following Conda packages:\n", "\n", " conda install r-nycflights13" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Loading tidyverse: ggplot2\n", "Loading tidyverse: tibble\n", "Loading tidyverse: tidyr\n", "Loading tidyverse: readr\n", "Loading tidyverse: purrr\n", "Loading tidyverse: dplyr\n", "Conflicts with tidy packages ---------------------------------------------------\n", "filter(): dplyr, stats\n", "lag(): dplyr, stats\n" ] } ], "source": [ "library(tidyverse)\n", "library(nycflights13)\n", "options(repr.plot.height=4)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `flights` table from `nycflights13` contains data on over 300,000 flights leaving New York City in 2013. We'll use it as our example in this worksheet." ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
yearmonthdaydep_timesched_dep_timedep_delayarr_timesched_arr_timearr_delaycarrierflighttailnumorigindestair_timedistancehourminutetime_hour
2013 1 1 517 515 2 830 819 11 UA 1545 N14228 EWR IAH 227 1400 5 15 2013-01-01 05:00:00
2013 1 1 533 529 4 850 830 20 UA 1714 N24211 LGA IAH 227 1416 5 29 2013-01-01 05:00:00
2013 1 1 542 540 2 923 850 33 AA 1141 N619AA JFK MIA 160 1089 5 40 2013-01-01 05:00:00
2013 1 1 544 545 -1 1004 1022 -18 B6 725 N804JB JFK BQN 183 1576 5 45 2013-01-01 05:00:00
2013 1 1 554 600 -6 812 837 -25 DL 461 N668DN LGA ATL 116 762 6 0 2013-01-01 06:00:00
2013 1 1 554 558 -4 740 728 12 UA 1696 N39463 EWR ORD 150 719 5 58 2013-01-01 05:00:00
\n" ], "text/latex": [ "\\begin{tabular}{r|lllllllllllllllllll}\n", " year & month & day & dep\\_time & sched\\_dep\\_time & dep\\_delay & arr\\_time & sched\\_arr\\_time & arr\\_delay & carrier & flight & tailnum & origin & dest & air\\_time & distance & hour & minute & time\\_hour\\\\\n", "\\hline\n", "\t 2013 & 1 & 1 & 517 & 515 & 2 & 830 & 819 & 11 & UA & 1545 & N14228 & EWR & IAH & 227 & 1400 & 5 & 15 & 2013-01-01 05:00:00\\\\\n", "\t 2013 & 1 & 1 & 533 & 529 & 4 & 850 & 830 & 20 & UA & 1714 & N24211 & LGA & IAH & 227 & 1416 & 5 & 29 & 2013-01-01 05:00:00\\\\\n", "\t 2013 & 1 & 1 & 542 & 540 & 2 & 923 & 850 & 33 & AA & 1141 & N619AA & JFK & MIA & 160 & 1089 & 5 & 40 & 2013-01-01 05:00:00\\\\\n", "\t 2013 & 1 & 1 & 544 & 545 & -1 & 1004 & 1022 & -18 & B6 & 725 & N804JB & JFK & BQN & 183 & 1576 & 5 & 45 & 2013-01-01 05:00:00\\\\\n", "\t 2013 & 1 & 1 & 554 & 600 & -6 & 812 & 837 & -25 & DL & 461 & N668DN & LGA & ATL & 116 & 762 & 6 & 0 & 2013-01-01 06:00:00\\\\\n", "\t 2013 & 1 & 1 & 554 & 558 & -4 & 740 & 728 & 12 & UA & 1696 & N39463 & EWR & ORD & 150 & 719 & 5 & 58 & 2013-01-01 05:00:00\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | \n", "|---|---|---|---|---|---|\n", "| 2013 | 1 | 1 | 517 | 515 | 2 | 830 | 819 | 11 | UA | 1545 | N14228 | EWR | IAH | 227 | 1400 | 5 | 15 | 2013-01-01 05:00:00 | \n", "| 2013 | 1 | 1 | 533 | 529 | 4 | 850 | 830 | 20 | UA | 1714 | N24211 | LGA | IAH | 227 | 1416 | 5 | 29 | 2013-01-01 05:00:00 | \n", "| 2013 | 1 | 1 | 542 | 540 | 2 | 923 | 850 | 33 | AA | 1141 | N619AA | JFK | MIA | 160 | 1089 | 5 | 40 | 2013-01-01 05:00:00 | \n", "| 2013 | 1 | 1 | 544 | 545 | -1 | 1004 | 1022 | -18 | B6 | 725 | N804JB | JFK | BQN | 183 | 1576 | 5 | 45 | 2013-01-01 05:00:00 | \n", "| 2013 | 1 | 1 | 554 | 600 | -6 | 812 | 837 | -25 | DL | 461 | N668DN | LGA | ATL | 116 | 762 | 6 | 0 | 2013-01-01 06:00:00 | \n", "| 2013 | 1 | 1 | 554 | 558 | -4 | 740 | 728 | 12 | UA | 1696 | N39463 | EWR | ORD | 150 | 719 | 5 | 58 | 2013-01-01 05:00:00 | \n", "\n", "\n" ], "text/plain": [ " year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time\n", "1 2013 1 1 517 515 2 830 819 \n", "2 2013 1 1 533 529 4 850 830 \n", "3 2013 1 1 542 540 2 923 850 \n", "4 2013 1 1 544 545 -1 1004 1022 \n", "5 2013 1 1 554 600 -6 812 837 \n", "6 2013 1 1 554 558 -4 740 728 \n", " arr_delay carrier flight tailnum origin dest air_time distance hour minute\n", "1 11 UA 1545 N14228 EWR IAH 227 1400 5 15 \n", "2 20 UA 1714 N24211 LGA IAH 227 1416 5 29 \n", "3 33 AA 1141 N619AA JFK MIA 160 1089 5 40 \n", "4 -18 B6 725 N804JB JFK BQN 183 1576 5 45 \n", "5 -25 DL 461 N668DN LGA ATL 116 762 6 0 \n", "6 12 UA 1696 N39463 EWR ORD 150 719 5 58 \n", " time_hour \n", "1 2013-01-01 05:00:00\n", "2 2013-01-01 05:00:00\n", "3 2013-01-01 05:00:00\n", "4 2013-01-01 05:00:00\n", "5 2013-01-01 06:00:00\n", "6 2013-01-01 05:00:00" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "head(flights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## What Is a Probability?\n", "\n", "Suppose we see a plane leaving the NYC area, and want to know which of the 3 New York airports (EWR, LGA, and JFK) it probably came from. If we know nothing other than ‘a plane left NYC’, then we can look at the _relative frequency_ of flights from the airports: which airport produces the most flights?\n", "\n", "We can do this by counting the number of flights from each airport. dplyr makes this easy with `group_by` and `summarize`:" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
origincount
EWR 120835
JFK 111279
LGA 104662
\n" ], "text/latex": [ "\\begin{tabular}{r|ll}\n", " origin & count\\\\\n", "\\hline\n", "\t EWR & 120835\\\\\n", "\t JFK & 111279\\\\\n", "\t LGA & 104662\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "origin | count | \n", "|---|---|---|\n", "| EWR | 120835 | \n", "| JFK | 111279 | \n", "| LGA | 104662 | \n", "\n", "\n" ], "text/plain": [ " origin count \n", "1 EWR 120835\n", "2 JFK 111279\n", "3 LGA 104662" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "origins = flights %>%\n", " group_by(origin) %>%\n", " summarize(count=n())\n", "origins" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We assign the value to the variable `origins`, and then we ask for the value `origins` on a new line to see the data we just computed. This is useful to be able to make use of this data later!\n", "\n", "Also, this data type is called a _data frame_. A data frame is like a little spreadsheet - it has named columns of data.\n", "\n", "The `%>%` business is called a _pipeline_, and it is the standard way to process data in with `tidyverse` (or more specifically `dplyr`). It pipes the results of each operation into the next, until we finally have results." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "\n", "It is often convenient to plot data like this, so we can see it visually:" ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAA0gAAAHgCAMAAACo6b1DAAAAOVBMVEUAAAAzMzNNTU1ZWVlo\naGh8fHyMjIyampqnp6eysrK9vb3Hx8fQ0NDZ2dnh4eHp6enr6+vw8PD///8Yrk7HAAAACXBI\nWXMAABJ0AAASdAHeZh94AAAOw0lEQVR4nO3dgVLbSBaFYdELgcnsDIvf/2EXxRYIAcJy9+1z\ndfR/VYMhW5XiWP3H2JDscAJQbVB/AoADQgIaICSgAUICGiAkoAFCAhogJKABQgIaCAvpf3lk\n+lwisbO7jSGV89tX021ZfDy/JSQRdna3LaRLJ9ObMqurfL4lJBV2drcppHIipF1gZ3fbHpFm\neZT3DwgpG3Z2VxXS9BTp25DuRtf8voCLbSGtBcQjkho7u6sJaXqHkLJhZ3c3hvT5uRIhJcLO\n7m4Lqby/JaSE2NndTSHNclp/sYGQNNjZ3S0hlZ9+ooGfbBBjZ3cbQ7qFeuJMps8lEju7IyRH\n7OyOkByxsztCcsTO7gjJETu7IyRH7OwuR0j/cRN+3dYlOmChEu0kpBDh121dogMWKtFOQgoR\nft3WJTpgoRLtJKQQ4ddtXaIDFirRTkIKEX7d1iU6YKES7SSkEOHXbV2iAxYq0U5CChF+3dYl\nOmChEu0kpBDh121dogMWKtFOQgoRft3WJTpgoRLtJKQQ4ddtXaIDFirRTkIKEX7d1iU6YKES\n7SSkEOHXbV2iAxYq0U5CChF+3dYlOmChEu0kpBDh121dogMWKtFOQgoRft3WJTpgoRLtJKQQ\n4ddtXaIDFirRTkIKEX7d1iU6YKES7SSkEOHXbV2iAxYq0U5CChF+3dYlOmChEu0kpBDh121d\nogMWKtFOQgoRft3WJTpgoRLtJKQQ4ddtXaIDFirRTkIKEX7d1iU6YKES7SSkEOHXbV2iAxYq\n0U5CChF+3dYlOmChEu0kpBDh121dogMWKtFOQgoRft3WJTpgoRLt7BDSFdTnvjnhfQkxHpEa\nCv8DcF2iP6lDJdpJSCHCr9u6RAcsVKKdhBTi4Dt7IaQF9Xlo7uA7eyGkBfV5aO7gO3shpAX1\neWju4Dt7IaQF9Xlo7uA7eyGkBfV5aO7gO3shpAX1eWju4Dt7IaQF9Xlo7uA7eyGkBfV5aO7g\nO3shpAX1eWju4Dt7IaQF9Xlo7uA7eyGkBfV5aO7gO3shpAX1eWju4Dt7IaQF9Xlo7uA7eyGk\nBfV5aO7gO3shpAX1eWju4Dt7IaQF9Xlo7uA7eyGkBfV5aO7gO3shpAX1eWju4Dt7IaQF9Xlo\n7uA7eyGkBfV5aO7gO3shpAX1eWiOnV4ISYSdXghJhJ1eCEmEnV4ISYSdXghJhJ1eCEmEnV4I\nSYSdXghJhJ1eCEmEnV4ISYSdXpqHVM5vX11zS0js9NA6pEsnlzc/3RISO000DqmcCOk67PTS\nOKQTIV2JnV7UId2Nrvht1fdTc+z0ck0bEx6RGmKnF/UjEiGx0wIhibDTCyGJsNMLIYmw00tM\nSPxkw4/Y6aV5SLcgJHbuHSGJsNMLIYmw0wshibDTCyGJsNMLIYmw0wshibDTCyGJsNMLIYmw\n0wshibDTCyGJsNMLIYmw0wshibDTCyGJsNMLIYmw0wshibDTCyGJsNMLIYmw0wshibDTCyGJ\nsNMLIYmw0wshibDTCyGJsNMLIYmw0wshibDTCyGJsNMLIYmw0wshibDTCyGJsNMLIYmw0wsh\nibDTCyGJsNNLipCuoL6fmmOnly2HmUekhtjpJcUjEiGxc+8ISYSdXghJhJ1eCEmEnV4ISYSd\nXghJhJ1eCEmEnV4ISYSdXghJhJ1eCEmEnV4ISYSdXghJhJ1eCEmEnV4ISYSdXghJhJ1eCEmE\nnV4ISYSdXghJhJ1eCEmEnV4ISYSdXghJhJ1eCEmEnV4ISYSdXghJhJ1eCEmEnV4ISYSdXghJ\nhJ1eCEmEnV4ISYSdXghJhJ1eCEmEnV4ISYSdXghJhJ1eYkIqf0zvXG5PX9wSEjs9xIR0runy\n33RTPt8SEjtNxIU0j4WQPmGnl9iQyux9QvqAnV7CQnp7KvQW1Jch3Y2u+O3U91Nz7PSypY0b\nQrq84RHpE3Z6iX1Emt4jpE/Y6SUqpPLhXUL6hJ1eYkPiS7tvsdNLfEjrLzYQEjstxIb07U80\n8JMN7PQSFdImhMTOvSMkEXZ6ISQRdnohJBF2eiEkEXZ6ISQRdnohJBF2eiEkEXZ6ISQRdnoh\nJBF2eiEkEXZ6ISQRdnohJBF2eiEkEXZ6ISQRdnohJBF2eiEkEXZ6qQhpuHw8/7uuhHQtdnq5\nNaQyzBDSduz0cmtIv2cd/Sak7djp5daQTu9f2tUjJHbuXUVI7RASO/euJqSnwnOkm7HTS0VI\nT7zYUIGdXipCKvWvMhASO01UhMSLDTXY6aUipF/DCyHdjJ1eKkJ6Lg/PhHQrdnqpCImfbKjB\nTi+EJMJOLxUhtUNI7Nw7QhJhp5eKkNp9aXcF9f3UHDu9bDnMPEdqiJ1eKh6Rzp4f/qrtiJDY\nuXvVIZ1ehuqSCImde1cfUoMfFSIkdu5dfUh/D/ybDTdgp5eKkN5ea3gipO3Y6aU+pFLdESGx\nc/cqQmqHkNi5d4Qkwk4vNSG9PN0Pw/1T/d9KIiR27l1FSM+Xf/ukVP+tJEJi595VhPQ4jH+x\n7/lheCSk7djppSKk6RuxfEP2Fuz0Qkgi7PRSERJf2tVgp5eKkHixoQY7vVSExMvfNdjppSak\nZgiJnXtHSCLs9FIT0q8/vzDc8xzpBuz0UhHS0/l174FX7W7BTi8VIZXhn/HmX76PdAt2eqkI\niW/I1mCnl4qQfg2PL+Nr4MMDIW3HTi8VIb19Q/ZfQtqOnV4qQpq+IVv//+1CSOzcu5qQmiEk\ndu4dIYmw0wshibDTCyGJsNMLIYmw0wshibDTCyGJsNMLIYmw00tQSGV0uT2t3BISOz1EhTS7\nKd/fEhI7TRCSCDu9xIRU5reE9BV2egkKaXqK9GNId6Mrfkf1/dQcO71c3cZp8yPSSkA8Ih39\nT2r1p9VczCPSVBMhfYudXghJhJ1eYkLiS7sfsdNLXEjXvdhASOy0EBPSjz/RwE82sNNLUEjb\nEBI7946QRNjphZBE2OmFkETY6YWQRNjphZBE2OmFkETY6YWQRNjphZBE2OmFkETY6YWQRNjp\nhZBE2OmFkETY6YWQRNjphZBE2OmFkETY6YWQRNjphZBE2OmFkETY6YWQRNjphZBE2OmFkETY\n6YWQRNjphZBE2OmFkETY6YWQRNjphZBE2OmFkETY6YWQRNjphZBE2OmFkETY6YWQRNjpJUVI\nV1DfT82x08uWw8wjUkPs9JLiEYmQ2Ll3hCTCTi+EJMJOL4Qkwk4vhCTCTi+EJMJOL4Qkwk4v\nhCTCTi+EJMJOL4Qkwk4vhCTCTi+EJMJOL4Qkwk4vhCTCTi+EJMJOL4Qkwk4vhCTCTi+EJMJO\nL4Qkwk4vhCTCTi+EJMJOL4Qkwk4vhCTCTi+EJMJOL4Qkwk4vhCTCTi+EJMJOL4Qkwk4vhCTC\nTi+EJMJOL4Qkwk4vhCTCTi+EJMJOL0EhlVfTbVl8PL8lJHZ6iAmpTG/Kx4+Xt4TEThOEJMJO\nLzEhTfWU93cJ6SN2egkOaXqK9G1Id6MrfjP1/dQcO71sKWNbSGsB8Yh09D+p1Z9Wc3GPSGX2\nDiF9wk4vYSGV+XuE9Ak7vUSFVN7fEtJX2OklKKTZy97rLzYQEjstxIRUfvqJBn6ygZ1eYkLa\niJDYuXeEJMJOL4Qkwk4vhCTCTi+EJMJOL4Qkwk4vhCTCTi+EJMJOL4Qkwk4vhCTCTi+EJMJO\nL4Qkwk4vhCTCTi+EJMJOL4Qkwk4vhCTCTi+EJMJOL4Qkwk4vhCTCTi+EJMJOL4Qkwk4vhCTC\nTi+EJMJOL4Qkwk4vhCTCTi+EJMJOL4Qkwk4vhCTCTi+EJMJOL4Qkwk4vhCTCTi+EJMJOLylC\nuoL6fmqOnV62HGYekRpip5cUj0iExM69IyQRdnohJBF2eiEkEXZ6ISQRdnohJBF2eiEkEXZ6\nISQRdnohJBF2eiEkEXZ6ISQRdnohJBF2eiEkEXZ6ISQRdnohJBF2eiEkEXZ6ISQRdnohJBF2\neiEkEXZ6ISQRdnohJBF2eiEkEXZ6ISQRdnohJBF2eiEkEXZ6ISQRdnohJBF2eiEkEXZ6ISQR\ndnohJBF2eiEkEXZ6ISQRdnohJBF2ehGFVF4REjt9aEIqb28IiZ0WCEmEnV4ISYSdXtQh3Y2a\n/b7ADigfkXrJ9LlEYmd3hOSInd0RkiN2dkdIjtjZHSE5Ymd3ESFt/8mGXjJ9LpHY2V1ISB+p\nJ85k+lwisbM7QnLEzu4IyRE7uyMkR+zsjpAcsbM7QnLEzu4IyRE7uyMkR+zsjpAcsbM7QnLE\nzu4IyRE7uyMkR+zsrkNIiRzlr72zU4iQfLBTiJB8sFOIkHywU+gIIQHhCAlogJCABggJaICQ\ngAYICWjAMKRyMf0je283b/+DmfK2+bTYbON9z/wKZlrpGNLHd8riNtf930L5YrOZL/eVTDud\nQ5r+8dfp34A9TEhuAz9fuvmVzYGQ9m8Zktu+0zeXjpBifQxp9t9BQnKbN/rq0uX6M8MxpA/P\nvOchub/YcDr/KW03cPkl3TSVkEKVD++WDzeZ7vpmPjwi5TperXzxHCnZyyruIc1f30l21zdz\nzBcbkn15QUi7tnwdxXHjiEckgeUT0rdbw29WfhOS1cbR5z8osr1y5BjS7EH/q5DS3PctzJ54\nnz9e3Jp4v6RlsTjLUMOQgP4ICWiAkIAGCAlogJCABggJaICQgAYICWiAkIAGCAlogJAsDMP3\nH6EH7nILhKTGXQ40QEj79Pw4DI/Pp/HR59/ycH4Men4Y7v87vnf+7/nXUJ7Un+dhENIuvZTh\nVXkZg3kYHv+kc/6195D+fEhJnRDSLj0ND6fTw5jJuZUxnb9ef+3l4T2kh5fT7yHLX9exR0i7\ndD+8fln3PNz/+RLudE7n8mvvX9qdeNmhH+7oXToHMkXz1XvvH6EH7uhdIqRsuKN3af6l3fjx\nV1/aTb+OHrijd2n+YsP48fj28muEJMEdvUvzl7/Hj796+Xv6dfTAHb1Ps2/Inqa34zdk/yYk\nDe5oM3znSIOQbAzDP+MTpUf153FMhGTj6fwU6Vn9eRwTIfn4fX953oT+CAlogJCABggJaICQ\ngAYICWiAkIAGCAlogJCABv4P3+Zx4WcrhlcAAAAASUVORK5CYII=", "text/plain": [ "plot without title" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ggplot(flights) +\n", " aes(x=origin) +\n", " geom_bar()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "EWR (Newark) has the most departing flights. However, these numbers aren't very convenient - 120835 flights left EWR, but that is a little unwieldy if we want to estimate the chances of another flight coming from Newark.\n", "\n", "Fortunately, there is a way we can make these numbers easier to deal with: make them sum to 1, so each value is the fraction of flights that left that airport. Let's do that by dividing each count by the total count:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
origincountprob
EWR 120835 0.3587993
JFK 111279 0.3304244
LGA 104662 0.3107763
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " origin & count & prob\\\\\n", "\\hline\n", "\t EWR & 120835 & 0.3587993\\\\\n", "\t JFK & 111279 & 0.3304244\\\\\n", "\t LGA & 104662 & 0.3107763\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "origin | count | prob | \n", "|---|---|---|\n", "| EWR | 120835 | 0.3587993 | \n", "| JFK | 111279 | 0.3304244 | \n", "| LGA | 104662 | 0.3107763 | \n", "\n", "\n" ], "text/plain": [ " origin count prob \n", "1 EWR 120835 0.3587993\n", "2 JFK 111279 0.3304244\n", "3 LGA 104662 0.3107763" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "origins$prob = origins$count / sum(origins$count)\n", "origins" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This new column, `prob`, indicates the _probability_ of a flight departing from the specified airport, given the observations that we have. We are making the assumption here that the flights leaving in 2013 are representative of flights leaving New York City in general, so that we can infer things about future flights from this data. We'll explore in more detail later when we can and can't make this kind of assumption.\n", "\n", "A _probability_ is a real number between 0 and 1 that expresses how likely something is. (There is a subtle difference between likelihood and probability, but for our current purposes that difference does not matter.)\n", "\n", "Origin airport is an example of what we call a _discrete_ value: it has one of a finite set of distinct values (in this case, just 3: EWR, JFK, and LGA). We also call the origin airport a _variable_: like variables in computer programs, it is one of the parameters that characterizes an _observation_ (one of the flights). When we are trying to reason about the probability of the variable having different values, we call it a _random variable_: a variable that takes on random values.\n", "\n", "> **Note:** We often think of things in terms of random variables and probabilities even when we don't necessarily think that the way they are produced is actually random. Randomness just provides a convenient way for us to think about the uncertainty we have about our knowledge.\n", "\n", "Our table forms a _discrete probability distribution_. A discrete probability distribution associates each possible value of a discrete variable with a _probability_ of the variable having that value. Each probability must be in the range 0 to 1 (inclusive); in addition, all probabilities in the distribution must sum to 1. We can check this sum:" ] }, { "cell_type": "code", "execution_count": 6, "metadata": {}, "outputs": [ { "data": { "text/html": [ "1" ], "text/latex": [ "1" ], "text/markdown": [ "1" ], "text/plain": [ "[1] 1" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sum(origins$prob)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More formally, a probability distribution $P(x)$ over the values of a random variable $X$ is a function $P(x): X \\into \\mathbb{R}$ such that:\n", "\n", "1. $\\forall x \\in X. 0 \\le P(x) \\le 1$\n", "2. $\\sum_{x \\in X} P(x) = 1$" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In R, the `$` operator accesses the column of a data frame. It's a lot like `.` in Java or Python. Each column of this data frame is a _vector_, which is R-speak for an _array_. The `sum` function sums up the elements of a vector and returns them.\n", "\n", "But above, when we converted counts into probabilities: notice that we did not write a loop! In R, most operations are _vectorized_: when you apply them to vectors, they operate on the whole vector element-by-element. If we take two vectors and add them, we get the pairwise sum:" ] }, { "cell_type": "code", "execution_count": 7, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. 11
  2. \n", "\t
  3. 22
  4. \n", "\t
  5. 33
  6. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 11\n", "\\item 22\n", "\\item 33\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 11\n", "2. 22\n", "3. 33\n", "\n", "\n" ], "text/plain": [ "[1] 11 22 33" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "c(1,2,3) + c(10, 20, 30)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In R, there is no such thing as a single value - a value is a vector of length 1. And when two vectors have different lengths, R will _recycle_ the shorter one:" ] }, { "cell_type": "code", "execution_count": 8, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. 6
  2. \n", "\t
  3. 7
  4. \n", "\t
  5. 8
  6. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 6\n", "\\item 7\n", "\\item 8\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 6\n", "2. 7\n", "3. 8\n", "\n", "\n" ], "text/plain": [ "[1] 6 7 8" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "c(1,2,3) + 5" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This can get us in trouble sometimes if we don't have our vectors quite straight. Fortunately, R warns us in the common error case where we have two multi-item vectors but their lengths aren't compatible:" ] }, { "cell_type": "code", "execution_count": 9, "metadata": {}, "outputs": [ { "name": "stderr", "output_type": "stream", "text": [ "Warning message in c(1, 2, 3) + c(1, 2):\n", "\"longer object length is not a multiple of shorter object length\"" ] }, { "data": { "text/html": [ "
    \n", "\t
  1. 2
  2. \n", "\t
  3. 4
  4. \n", "\t
  5. 4
  6. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 2\n", "\\item 4\n", "\\item 4\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 2\n", "2. 4\n", "3. 4\n", "\n", "\n" ], "text/plain": [ "[1] 2 4 4" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "c(1,2,3) + c(1,2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Most R computations work over vectors. A basic rule of thumb is to **never use loops.** R has them, but you won't need them very often at all. Vectorized operations are much faster than manually looping, and are easier to write." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Names and Math\n", "\n", "Our distribution above, over origin airports, is what is called a _multinomial distribution_. That is, it is a distribution over multiple discrete ‘categorical’ values (we'll see what categorical means in the next lesson).\n", "\n", "The simplest kind of multinomial distribution is the _binomial distribution_: a distribution over two values $\\mathsf{H}$ and $\\mathsf{T}$. This can be parameterized with a single value $p \\in [0,1]$ such that:\n", "\n", "$$\\begin{align*}\n", "P(\\mathsf{H}) & = p \\\\\n", "P(\\mathsf{T}) & = 1-p\n", "\\end{align*}$$\n", "\n", "We use the $P(\\dots)$ notation to indicate a probability.\n", "\n", "It is easy to check that this distribution satisfies our two probability laws:\n", "\n", "1. Since $0 \\le p \\le 1$, both $p$ and $1-p$ are probabilities.\n", "2. $P(\\mathsf{H}) + P(\\mathsf{T}) = p + 1 - p = 1$.\n", "\n", "I have called our two outcomes $\\mathsf{H}$ and $\\mathsf{T}$ because we often think of them as corresponding to the flip of a (possibly weighted) coin with two sides, _heads_ and _tails_. A fair coin has $p=0.5$, so that both heads and tails are equally likely.\n", "\n", "Let's see the flips of 20 fair coins (don't worry about the details of the `flip` function for now):" ] }, { "cell_type": "code", "execution_count": 10, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. 'T'
  2. \n", "\t
  3. 'H'
  4. \n", "\t
  5. 'H'
  6. \n", "\t
  7. 'H'
  8. \n", "\t
  9. 'T'
  10. \n", "\t
  11. 'H'
  12. \n", "\t
  13. 'H'
  14. \n", "\t
  15. 'T'
  16. \n", "\t
  17. 'H'
  18. \n", "\t
  19. 'T'
  20. \n", "\t
  21. 'T'
  22. \n", "\t
  23. 'T'
  24. \n", "\t
  25. 'T'
  26. \n", "\t
  27. 'T'
  28. \n", "\t
  29. 'H'
  30. \n", "\t
  31. 'T'
  32. \n", "\t
  33. 'H'
  34. \n", "\t
  35. 'H'
  36. \n", "\t
  37. 'T'
  38. \n", "\t
  39. 'T'
  40. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 'T'\n", "\\item 'H'\n", "\\item 'H'\n", "\\item 'H'\n", "\\item 'T'\n", "\\item 'H'\n", "\\item 'H'\n", "\\item 'T'\n", "\\item 'H'\n", "\\item 'T'\n", "\\item 'T'\n", "\\item 'T'\n", "\\item 'T'\n", "\\item 'T'\n", "\\item 'H'\n", "\\item 'T'\n", "\\item 'H'\n", "\\item 'H'\n", "\\item 'T'\n", "\\item 'T'\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 'T'\n", "2. 'H'\n", "3. 'H'\n", "4. 'H'\n", "5. 'T'\n", "6. 'H'\n", "7. 'H'\n", "8. 'T'\n", "9. 'H'\n", "10. 'T'\n", "11. 'T'\n", "12. 'T'\n", "13. 'T'\n", "14. 'T'\n", "15. 'H'\n", "16. 'T'\n", "17. 'H'\n", "18. 'H'\n", "19. 'T'\n", "20. 'T'\n", "\n", "\n" ], "text/plain": [ " [1] \"T\" \"H\" \"H\" \"H\" \"T\" \"H\" \"H\" \"T\" \"H\" \"T\" \"T\" \"T\" \"T\" \"T\" \"H\" \"T\" \"H\" \"H\" \"T\"\n", "[20] \"T\"" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "flip = function(n, p=0.5) {\n", " sample(c('H', 'T'), n, replace=TRUE, prob=c(p, 1-p))\n", "}\n", "flip(20)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "More generally, the binomial distribution is the probability of observing $k$ _successes_ (in our case, heads) in $n$ flips (or _trials_). Let's count the successes in a series of 20 flips:" ] }, { "cell_type": "code", "execution_count": 11, "metadata": {}, "outputs": [ { "data": { "text/html": [ "8" ], "text/latex": [ "8" ], "text/markdown": [ "8" ], "text/plain": [ "[1] 8" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sum(flip(20) == 'H')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This will often sum to 10, but not always - it may be 7 or 9 or 11.\n", "\n", "> **R Note:** The `==` operator tests for equality, and like most other R operations, it is vectorized - it tests each element of the left-hand vector with the corresponding element of the right-hand vector; when the right-hand vector has length 1, it just reuses that element for all left-hand values. The result is a _logical_ vector (true and false); summing it counts the `TRUE` values.\n", "\n", "We can take advantage of another couple of R operations, `:` to generate sequences and `sapply` to apply a function over many sequences, to carry out several trials of 20-flip sequences and see how often we see different values:" ] }, { "cell_type": "code", "execution_count": 12, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
    \n", "\t
  1. 12
  2. \n", "\t
  3. 13
  4. \n", "\t
  5. 11
  6. \n", "\t
  7. 10
  8. \n", "\t
  9. 12
  10. \n", "\t
  11. 10
  12. \n", "\t
  13. 13
  14. \n", "\t
  15. 10
  16. \n", "\t
  17. 10
  18. \n", "\t
  19. 9
  20. \n", "\t
  21. 8
  22. \n", "\t
  23. 10
  24. \n", "\t
  25. 11
  26. \n", "\t
  27. 10
  28. \n", "\t
  29. 9
  30. \n", "\t
  31. 16
  32. \n", "\t
  33. 8
  34. \n", "\t
  35. 7
  36. \n", "\t
  37. 14
  38. \n", "\t
  39. 9
  40. \n", "\t
  41. 8
  42. \n", "\t
  43. 12
  44. \n", "\t
  45. 10
  46. \n", "\t
  47. 7
  48. \n", "\t
  49. 8
  50. \n", "\t
  51. 9
  52. \n", "\t
  53. 8
  54. \n", "\t
  55. 7
  56. \n", "\t
  57. 10
  58. \n", "\t
  59. 9
  60. \n", "\t
  61. 10
  62. \n", "\t
  63. 12
  64. \n", "\t
  65. 13
  66. \n", "\t
  67. 8
  68. \n", "\t
  69. 8
  70. \n", "\t
  71. 11
  72. \n", "\t
  73. 10
  74. \n", "\t
  75. 12
  76. \n", "\t
  77. 11
  78. \n", "\t
  79. 11
  80. \n", "\t
  81. 9
  82. \n", "\t
  83. 10
  84. \n", "\t
  85. 10
  86. \n", "\t
  87. 10
  88. \n", "\t
  89. 10
  90. \n", "\t
  91. 12
  92. \n", "\t
  93. 8
  94. \n", "\t
  95. 12
  96. \n", "\t
  97. 10
  98. \n", "\t
  99. 8
  100. \n", "\t
  101. 10
  102. \n", "\t
  103. 7
  104. \n", "\t
  105. 9
  106. \n", "\t
  107. 12
  108. \n", "\t
  109. 9
  110. \n", "\t
  111. 12
  112. \n", "\t
  113. 9
  114. \n", "\t
  115. 13
  116. \n", "\t
  117. 12
  118. \n", "\t
  119. 9
  120. \n", "\t
  121. 12
  122. \n", "\t
  123. 11
  124. \n", "\t
  125. 10
  126. \n", "\t
  127. 11
  128. \n", "\t
  129. 13
  130. \n", "\t
  131. 14
  132. \n", "\t
  133. 6
  134. \n", "\t
  135. 9
  136. \n", "\t
  137. 10
  138. \n", "\t
  139. 11
  140. \n", "\t
  141. 11
  142. \n", "\t
  143. 12
  144. \n", "\t
  145. 8
  146. \n", "\t
  147. 9
  148. \n", "\t
  149. 13
  150. \n", "\t
  151. 13
  152. \n", "\t
  153. 9
  154. \n", "\t
  155. 10
  156. \n", "\t
  157. 17
  158. \n", "\t
  159. 13
  160. \n", "\t
  161. 14
  162. \n", "\t
  163. 9
  164. \n", "\t
  165. 8
  166. \n", "\t
  167. 10
  168. \n", "\t
  169. 11
  170. \n", "\t
  171. 10
  172. \n", "\t
  173. 14
  174. \n", "\t
  175. 7
  176. \n", "\t
  177. 10
  178. \n", "\t
  179. 9
  180. \n", "\t
  181. 12
  182. \n", "\t
  183. 12
  184. \n", "\t
  185. 11
  186. \n", "\t
  187. 9
  188. \n", "\t
  189. 11
  190. \n", "\t
  191. 6
  192. \n", "\t
  193. 12
  194. \n", "\t
  195. 9
  196. \n", "\t
  197. 8
  198. \n", "\t
  199. 12
  200. \n", "
\n" ], "text/latex": [ "\\begin{enumerate*}\n", "\\item 12\n", "\\item 13\n", "\\item 11\n", "\\item 10\n", "\\item 12\n", "\\item 10\n", "\\item 13\n", "\\item 10\n", "\\item 10\n", "\\item 9\n", "\\item 8\n", "\\item 10\n", "\\item 11\n", "\\item 10\n", "\\item 9\n", "\\item 16\n", "\\item 8\n", "\\item 7\n", "\\item 14\n", "\\item 9\n", "\\item 8\n", "\\item 12\n", "\\item 10\n", "\\item 7\n", "\\item 8\n", "\\item 9\n", "\\item 8\n", "\\item 7\n", "\\item 10\n", "\\item 9\n", "\\item 10\n", "\\item 12\n", "\\item 13\n", "\\item 8\n", "\\item 8\n", "\\item 11\n", "\\item 10\n", "\\item 12\n", "\\item 11\n", "\\item 11\n", "\\item 9\n", "\\item 10\n", "\\item 10\n", "\\item 10\n", "\\item 10\n", "\\item 12\n", "\\item 8\n", "\\item 12\n", "\\item 10\n", "\\item 8\n", "\\item 10\n", "\\item 7\n", "\\item 9\n", "\\item 12\n", "\\item 9\n", "\\item 12\n", "\\item 9\n", "\\item 13\n", "\\item 12\n", "\\item 9\n", "\\item 12\n", "\\item 11\n", "\\item 10\n", "\\item 11\n", "\\item 13\n", "\\item 14\n", "\\item 6\n", "\\item 9\n", "\\item 10\n", "\\item 11\n", "\\item 11\n", "\\item 12\n", "\\item 8\n", "\\item 9\n", "\\item 13\n", "\\item 13\n", "\\item 9\n", "\\item 10\n", "\\item 17\n", "\\item 13\n", "\\item 14\n", "\\item 9\n", "\\item 8\n", "\\item 10\n", "\\item 11\n", "\\item 10\n", "\\item 14\n", "\\item 7\n", "\\item 10\n", "\\item 9\n", "\\item 12\n", "\\item 12\n", "\\item 11\n", "\\item 9\n", "\\item 11\n", "\\item 6\n", "\\item 12\n", "\\item 9\n", "\\item 8\n", "\\item 12\n", "\\end{enumerate*}\n" ], "text/markdown": [ "1. 12\n", "2. 13\n", "3. 11\n", "4. 10\n", "5. 12\n", "6. 10\n", "7. 13\n", "8. 10\n", "9. 10\n", "10. 9\n", "11. 8\n", "12. 10\n", "13. 11\n", "14. 10\n", "15. 9\n", "16. 16\n", "17. 8\n", "18. 7\n", "19. 14\n", "20. 9\n", "21. 8\n", "22. 12\n", "23. 10\n", "24. 7\n", "25. 8\n", "26. 9\n", "27. 8\n", "28. 7\n", "29. 10\n", "30. 9\n", "31. 10\n", "32. 12\n", "33. 13\n", "34. 8\n", "35. 8\n", "36. 11\n", "37. 10\n", "38. 12\n", "39. 11\n", "40. 11\n", "41. 9\n", "42. 10\n", "43. 10\n", "44. 10\n", "45. 10\n", "46. 12\n", "47. 8\n", "48. 12\n", "49. 10\n", "50. 8\n", "51. 10\n", "52. 7\n", "53. 9\n", "54. 12\n", "55. 9\n", "56. 12\n", "57. 9\n", "58. 13\n", "59. 12\n", "60. 9\n", "61. 12\n", "62. 11\n", "63. 10\n", "64. 11\n", "65. 13\n", "66. 14\n", "67. 6\n", "68. 9\n", "69. 10\n", "70. 11\n", "71. 11\n", "72. 12\n", "73. 8\n", "74. 9\n", "75. 13\n", "76. 13\n", "77. 9\n", "78. 10\n", "79. 17\n", "80. 13\n", "81. 14\n", "82. 9\n", "83. 8\n", "84. 10\n", "85. 11\n", "86. 10\n", "87. 14\n", "88. 7\n", "89. 10\n", "90. 9\n", "91. 12\n", "92. 12\n", "93. 11\n", "94. 9\n", "95. 11\n", "96. 6\n", "97. 12\n", "98. 9\n", "99. 8\n", "100. 12\n", "\n", "\n" ], "text/plain": [ " [1] 12 13 11 10 12 10 13 10 10 9 8 10 11 10 9 16 8 7 14 9 8 12 10 7 8\n", " [26] 9 8 7 10 9 10 12 13 8 8 11 10 12 11 11 9 10 10 10 10 12 8 12 10 8\n", " [51] 10 7 9 12 9 12 9 13 12 9 12 11 10 11 13 14 6 9 10 11 11 12 8 9 13\n", " [76] 13 9 10 17 13 14 9 8 10 11 10 14 7 10 9 12 12 11 9 11 6 12 9 8 12" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "repeated_sequences = sapply(1:100, function(t) {\n", " sum(flip(20) == 'H')\n", "})\n", "repeated_sequences" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In R, we don't have a `return` command - a function returns the value of its last expression. The `sapply` function takes a vector and a function `f` and returns a new vector that is the result of calling `f(x)` for each value `x` in the original vector.\n", "\n", "Let's plot these values:" ] }, { "cell_type": "code", "execution_count": 13, "metadata": {}, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAA0gAAAHgCAMAAACo6b1DAAAAOVBMVEUAAAAzMzNNTU1ZWVlo\naGh8fHyMjIyampqnp6eysrK9vb3Hx8fQ0NDZ2dnh4eHp6enr6+vw8PD///8Yrk7HAAAACXBI\nWXMAABJ0AAASdAHeZh94AAAT3UlEQVR4nO2d63obN7JF6T5xrCQz8Vjv/7DHtGRJlCB2o1CX\nDfRaP3zRF29ssGqFFCUnl0cAGOZSXQBgBRAJwAFEAnAAkQAcQCQABxAJwAFEAnAAkQAcMIv0\nv188/zSOW5BgJe6WmpQZhEhTBglWOvndEGnKIMFKJ78bIk0ZJFjp5HdDpCmDBCud/G6INGWQ\nYKWT3w2RpgwSrHTyuyHSlEGClU5+N0SaMkiw0snvhkhTBglWOvndHEQCgFd4RpooSLDSye+G\nSFMGCVY6+d0QacogwUonvxsiTRkkWOnkd0OkKYMEK538bog0ZZBgpZPfDZGmDBKsdPK7IdKU\nQYKVTn43RHIO+j87UZUmCxKshEj5QYikk4RIYkmIlBkkWAmR8oMQSScJkcSSECkzSLASIuUH\nIZJOEiKJJSFSZpBgJUTKD0IknSREEktCpMwgwUqIlB+ESDpJiCSWhEiZQYKVECk/CJF0khBJ\nLAmRMoMEKyFSfhAi6SQhklgSImUGCVZCpPwgRNJJQiSxJETKDBKshEj5QYikk4RIYkmIlBkk\nWAmR8oMQSScJkcSSECkzSLASIuUHIZJOEiKJJSFSZpBgJUTKD0IknSREEktCpMwgwUqIlB+E\nSDpJiCSWhEiZQYKVECk/CJF0khBJLAmRMoMEKyFSfhAi6STJirT95O3PiPQRRNJJUhVpe/5h\ne/kNIn0AkXSSEEksCZEygwQrRXyOhEj3QSSdpHlE+nLl2J87CwMiVVcHF46JtD3yjHQXnpF0\nkpSfkRBpB0TSSRIWabv9AZE+gEg6Sboibe9sQqQPIJJOkqxI2/unJUT6ACLpJKmKtG3P39LA\ndzZ8DiLpJKmK9An5pbOTECkzSLASIuUHIZJOEiKJJSFSZpBgJUTKD0IknSREEktCpMwgwUqI\nlB+ESDpJiCSWhEiZQYKVECk/CJF0khBJLAmRMoMEKyFSfhAi6SQhklgSImUGCVZCpPwgRNJJ\nQiSxJETKDBKshEj5QYikk4RIYkmIlBkkWAmR8oMQSScJkcSSECkzSLASIuUHIZJOEiKJJSFS\nZpBgJUTKD0IknSREEktCpMwgwUqIlB+ESDpJiCSWhEiZQYKVECk/CJF0khBJLAmRMoMEKyFS\nfhAi6SQhklgSImUGCVZCpPwgRNJJQiSxJETKDBKshEj5QYikk4RIYkmIlBkkWAmR8oMQSScJ\nkcSSECkzSLASIuUHIZJOEiKJJSFSZpBgJUTKD0IknaTJRIK3DIhUXR1c4BnJJYhnJJ2kyZ6R\n8ktnJyFSZpBgJUTKD0IknSREEktCpMwgwUqIlB+ESDpJiCSWhEiZQYKVECk/CJF0khBJLAmR\nMoMEKyGSkQEbEEknCZGqkxCpMEiwEiIZQaTCIMFKiGQEkQqDBCshkhFEKgwSrIRIRhCpMEiw\nEiIZQaTCIMFKiGQEkQqDBCshkhFEKgwSrIRIRhCpMEiwEiIZQaTCIMFKiGQEkQqDBCshkhFE\nKgwSrIRIRhCpMEiwEiIZQaTCIMFKiGQEkQqDBCshkhFEKgwSrIRIRmYTaeDQpeZWGIRILdzc\n6MJ+t4FDl5pbYRAitXBzowv73QYOXWpuhUGI1MLNjS7sdxs4dKm5FQYhUgs3N7qw323g0KXm\nVhiESC3c3OjCfreBQ5eaW2EQIrVwc6ML+90GDl1qboVBiNTCzY0u7HcbOHSpuRUGIVILNze6\nsN9t4NCl5lYYhEgt3Nzown63gUOXmlthECK1cHOjC/vdBg5dam6FQYjUws2NLux3Gzh0qbkV\nBiFSCzc3urDfbeDQpeZWGIRILdzc6MJ+t4FDl5pbYVCnSNvTj1cQyRv73QYOXWpuhUF9Ij3r\ns918ML90eJKbG13Y7zZw6FJzKwzqEml7RKRA7HcbOHSpuRUG9T0jPSl06xEieWG/28ChS82t\nMMgk0sunSF+uHPlzk+HmRt9OV/T1e9Dgic5nJN5scMd+t4FDl5pbYZBFpHe/yi8dnuTmRhf2\nuw0cutTcCoMQqYWbG13Y7zZw6FJzKwyyiMRLuxjsdxs4dKm5FQZZRXrzzl1+6fAkNze6sN9t\n4NCl5lYYZBHp8eYbGxDJC/vdBg5dam6FQZ0itcgvHZ7k5kYX9rsNHLrU3AqDEKmFmxtd2O82\ncOhScysMQqQWbm50Yb/bwKFLza0wCJFauLnRhf1uA4cuNbfCIERq4eZGF/a7DRy61NwKgxCp\nhZsbXdjvNnDoUnMrDEKkFm5udGG/28ChS82tMAiRWri50YX9bgOHLjW3wiBEauHmRhf2uw0c\nutTcCoMQqYWbG13Y7zZw6FJzKwxCpBZubnRhv9vAoUvNrTAIkVq4udGF/W4Dhy41t8IgRGrh\n5kYX9rsNHLrU3AqDEKmFmxtd2O82cOhScysMQqQWbm50Yb/bwKFLza0wCJFauLnRhf1uA4cu\nNbfCIERq4eZGF/a7DRy61NwKgxCphZsbXdjvNnDoUnMrDEKkFm5udGG/28ChS82tMAiRWri5\n0YX9bgOHLjW3wiBEauHmRhf2uw0cutTcCoMQqYWbG13Y7zZw6FJzKwxCpBZubnRhv9vAoUvN\nrTAIkVq4udGF/W4Dhy41t8IgRGrh5kYX9rsNHLrU3AqDEKmFmxtd2O82cOhScysMQqQWbm50\nYb/bwKFLza0wCJFauLnRhf1uA4cuNbfCIERq4eZGF/a7DRy61NwKgxCphZsbXdjvNnDoUnMr\nDEKkFm5udGG/28ChS82tMAiRWri50YX9bgOHLjW3wiBEauHmRhf2uw0cutTcCoMQqYWbG13Y\n7zZw6FJzKwxCpBZubnRhv9vAoUvNrTAIkVq4udGF/W4Dhy41t8IgB5EWxM2Nvp2u6Ov3oMET\nPCO94uZGF/a7DRy61NwKgxCphZsbXdjvNnDoUnMrDEKkFm5udGG/28ChS82tMAiRWri50YX9\nbgOHLjW3wqBPRbo8/37bECkJ+90GDl1qboVBbZG2yxsQKQn73QYOXWpuhUFtkf5+49HfiJSE\n/W4Dhy41t8KgtkiPry/t9skvHZ7k5kYX9rsNHLrU3AqDPhXpOPmlD+K24SsjODe9SkMiPWyz\nf45UvaNTIDg3vUojIj3M/2ZD9Y5OgeDc9CqNiLTtv8uASAsgODe9Sid/s6F6R6dAcG56lUZE\n+vPyA5FOgODc9CqNiPR9+/odkdZHcG56lcZe2vFmwxkQnJteJUSCPQTnpleJL8jCHoJz06uE\nSLCH4Nz0KvHSDvYQnJteJUSCPQTnpldp/KXd969/7XmESFMjODe9Sg6fI/247JqUX/og1Ts6\nBYJz06vk8WYDL+3WRnBuepUcRPrnMu9/s6F6R6dAcG56lVzebHhApJURnJteJQeRtl2PEGlq\nBOemV4kvyMIegnPTq4RIsIfg3PQqDYn04+GPy+WPh/2/lZRf+iDVOzoFgnPTqzT295GeP0na\n/VtJ+aUPUr2jUyA4N71KIyJ9u1z/Yt/3r5dviLQygnPTqzT2rt3tz4i0JoJz06uESLCH4Nz0\nKvHSDvYQnJteJd5sgD0E56ZXibe/YQ/BuelV4guysIfg3PQq+Yv09K3g2/b2/+OXX/og1Ts6\nBYJz06s0JNKfvz5w+ePt50hP/rz+gEjTIzg3vUojIj08ve99efuu3faISKshODe9SiMibZf/\nXn/69/brSIi0GoJz06vk/wXZdyJ9ufKoSvWOTkH1kNbjw/+N4tuP63vgl69vP8oz0moIzk2v\nkssXZP9FpJURnJteJY8vyN5+YwMirYbg3PQqBX0dCZEWQnBuepUQCfYQnJteJb6zAfYQnJte\nJb7XDvYQnJteJUSCPQTnplcJkWAPwbnpVUIk2ENwbnqVEAn2EJybXiVEgj0E56ZXCZFgD8G5\n6VVCJNhDcG56lRAJ9hCcm14lRII9BOemVwmRYA/BuelVQiTYQ3BuepUQCfYQnJteJUSCPQTn\nplcJkWAPwbnpVUIk2ENwbnqVEAn2EJybXiVEgj0E56ZXCZFgD8G56VVCJNhDcG56lRAJ9hCc\nm14lRII9BOemVwmRYA/BuelVQiTYQ3BuepUQCfYQnJteJUSCPQTnplcJkWAPwbnpVUIk2ENw\nbnqVEAkCiZqb3iohEgQSNTe9VUIkCCRqbnqrhEgQSNTc9FYJkSCQqLnprRIiQSBRc9NbJUSC\nQKLmprdKiASBRM1Nb5VyRJKletFWp3q+ovCMBH1EzU1vlXhpB4FEzU1vlRAJAomam94qIRIE\nEjU3vVVCJAgkam56q4RIEEjU3PRWCZEgkKi56a0SIkEgUXPTWyVEgkCi5qa3SogEgUTNTW+V\nEAkCiZqb3iohEgQSNTe9VUIkCCRqbnqrhEgQSNTc9FYJkSCQqLnprRIiQSBRc9NbJUSCQKLm\nprdKiASBRM1Nb5UQCQKJmpveKiESBBI1N71VQiQIJGpuequESBBI1Nz0VgmRIJCouemtEiJB\nIFFz01slRIJAouamt0qIBIFEzU1vlRAJAomam94qIRIEEjU3vVVCJAgkam56q4RIEEjU3PRW\nCZEgkKi56a0SIkEgUXPTWyVEgkCi5qa3SogEgUTNTW+VEAkCiZqb3iohEgQSNTe9VUIkCCRq\nbnqrhEgQSNTc9FYJkSCQqLnprVKUSNsVRDo7UXPTW6UwkXhGAkS6/UcQCYxEzU1vlYJEuvUI\nkc5K1Nz0VilKpJdPkb5cOfznTFRvC3xK7OCnpfMZKevNhuptgU+5Oze/DZgjyCLSq02IdGbu\nzs1vA+YIQiSwcndufhswR5BFJF7awZW7c/PbgDmCrCK9eecutnT1tsCn3J2b3wbMEWQR6fHm\nGxsQ6azcnZvfBswRZBLpltjS1dsCn3J3bn4bMEcQIoGVu3Pz24A5ghAJrNydm98GzBGESGDl\n7tz8NmCOIEQCK3fn5rcBcwQhEli5Oze/DZgjCJHAyt25+W3AHEGIBFbuzs1vA+YIQiSwcndu\nfhswRxAigZW7c/PbgDmCEAny6doAt1UKDUIkyKdrA9xWKTQIkSCfrg1wW6XQIESCfLo2wG2V\nQoMQCfLp2gC3VQoNQiTIp2sD3FYpNAiRIJ+uDXBbpdAgRIJ8ujbAbZVCgxAJ8unaALdVCg1C\nJMinawPcVik0CJEgn64NcFul0CBEgny6NsBtlUKDEAny6doAt1UKDUIkyKdrA9xWKTQIkSCf\nrg1wW6XQIESCfLo2wG2VQoMQCfLp2gC3VQoNQiTIp2sD3FYpNAiRIJ+uDXBbpdAgRIJ8ujbA\nbZVCgxAJ8unaALdVCg1CJMinawPcVik0CJEgn64NcFul0CBEgny6NsBtlUKDEAny6doAt1UK\nDUIkyKdrA9xWKTTIQaRYqmcOAVQvVSg8I0EWXRvgtkqhQYgE+XRtgNsqhQYhEuTTtQFuqxQa\nhEiQT9cGuK1SaBAiQT5dG+C2SqFBiAT5dG2A2yqFBiES5NO1AW6rFBqUI1L14ECL4bU9BiLB\n2gyv7TEQCdZmeG2PgUiwNsNrewxEgrUZXttjIBKszfDaHgORYG2G1/YYiARrM7y2x0AkWJvh\ntT0GIsHaDK/tMRAJ1mZ4bY+BSLA2w2t7DESCtRle22MgEqzN8NoeA5FgbYbX9hiIBGszvLbH\nQCRYm+G1PQYiwdoMr+0xEAnWZmBtS/oiEkiCSIgEDiASIoEDiIRI4AAiIRI4gEiIBA4gEiKB\nA4iESOAAIiESOIBIiAQOIBIigQOIhEjgACI9sf0EkcAMIj159PIDIoEFREIkcACREAkcQKR3\nIn25cvjPAZyAyGekHtyCBCtxt9Qk9b9qjkjlQYKVTn43RJoySLDSye+GSFMGCVY6+d0Qacog\nwUonv5tFpP7vbHAunZ2kFyRY6eR3M4l0S37p7CS9IMFKJ78bIk0ZJFjp5HdDpCmDBCud/G6I\nNGWQYKWT3w2RpgwSrHTyuyHSlEGClU5+N0SaMkiw0snvhkhTBglWOvndEGnKIMFKJ78bIk0Z\nJFjp5HdDpCmDBCud/G6INGWQYKWT381BpCcE/8a5XiW9RlQ6QlcjRIpHrxGVjoBIYug1otIR\nEEkMvUZUOgIiiaHXiEpHyBQJAK4gEoADiATgACIBOIBIAA4YRdre/Le5bv8zXVVs7ytVd9pe\nmrx8pLpUu1Jpp9dKKuu0/S7Rt01Wkd4fbIvxZnv3cyHb6+Py3Kb8gfpYqfyBel5QoXV6o0zX\ng7SUSNuHX5SxPcqJ1KhU/UBtj2oibY0lChRpe//r+s29IrIeT8iJ9Pbw7f0HqvjwoJQ/Sh/8\nOVbGKNLbV5DHDwtG55X/rxK3Pyg8UB9FKn+gXj8juflAWZ+GP8cepIFnJJn9+M3tv0uqKymL\nJLO1go/SR5FufvcZA29/q9z8he3O7/LRW5FPXvVrVJJ5lNqv6M4k0nb3t+kIiyT0bxxZkXq3\naaGXdu9mUl1JVySlB0rvUTI+SHaRdF5nP3N79fJGeivyfPh2+wGFNxsehdapKVLUmw0vX37e\n3vy6mjePgEKl561VeqC2342238uhUUnqUdo+/BT4rh0A3IBIAA4gEoADiATgACIBOIBIAA4g\nEoADiATgACIBOIBIAA4g0pxcGJwWzGNOEEkM5jEniCQG85iTXyI9XP5T3QOeQaQ5uYr0cHmo\nrgG/QaQ5+SnSw+Wv6hbwAiLNyeXC6zopEGlOLj/5u7oEvIJIc3K5/LVdvle3gBcQaU5+fo70\nz+XP6hbwAiLNyfVdu6+Xf6prwG8QaU6uIv172X5U94BnEGlOfn1B9q/Lt+oe8AwiATiASAAO\nIBKAA4gE4AAiATiASAAOIBKAA4gE4AAiATjw/1Kf1pQNoVWNAAAAAElFTkSuQmCC", "text/plain": [ "plot without title" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ggplot(data.frame(k=repeated_sequences)) +\n", " aes(x=k) +\n", " geom_histogram(binwidth=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can see that values close to 10 are the most common.\n", "\n", "What if we have a weighted coin, so that $P(\\mathsf{H}) = 0.7$?" ] }, { "cell_type": "code", "execution_count": 14, "metadata": {}, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAA0gAAAHgCAMAAACo6b1DAAAAOVBMVEUAAAAzMzNNTU1ZWVlo\naGh8fHyMjIyampqnp6eysrK9vb3Hx8fQ0NDZ2dnh4eHp6enr6+vw8PD///8Yrk7HAAAACXBI\nWXMAABJ0AAASdAHeZh94AAAT8klEQVR4nO2d7XIcVRIFx70YBOxirPd/2LVGH/aIaXF0+5Tq\nTDnzh8FEOKP63soYaSSZ0z0AHObUPQDABAgJwAAhARggJAADhARggJAADBASgAFCAjCwHtLf\nDjwWk2bgMAMfKWoYQqqzRA0z8JGihiGkOkvUMAMfKWoYQqqzRA0z8JGihiGkOkvUMAMfKWoY\nQqqzRA0z8JGihiGkOkvUMAMfKWoYQqqzRA0z8JGihiGkOkvUMAMfKWoYQqqzRA0z8JGihiGk\nOkvUMAMfKWoYQqqzRA0z8JGihiGkOkvUMAMfKWoYQqqzRA0z8JGihiGkOkvUMAMfKWqYd4W0\nfePHfxLSR2iSLAyza3lHSNvTL9vLbwjpAzRJFobZtRBSmSVqmIGPFDXMe0J6romQPlSTZGGY\nXcuhkD49oP05uMp/LHQ/BfyAFtJ2zyuSU+MJyTOLwq2d78dZCKnMQki1lqhh3hnSdvkLIR3V\nEFK7piOk7VVNhHRUQ0jtmo4vyL5+WSKkoxpCatc0fB1pe/qWBr6zwaYhpHYN32tXpyGkWs3A\nYQipzkJItZaoYQipzkJItZaoYQipzkJItZaoYQipzkJItZaoYQipzkJItZaoYQipzkJItZao\nYQipzkJItZaoYQipzkJItZaoYQipzkJItZaoYQipzkJItZaoYQipzkJItZaoYQipzkJItZao\nYQipzkJItZaoYQipzkJItZaoYQipzkJItZaoYQipzkJItZaoYQipzkJItZaoYQipzkJItZao\nYQipzkJItZaoYQipzkJItZaoYQipzkJItZaoYQipzkJItZaoYQipzkJItZaoYQipzkJItZao\nYQipzkJItZaoYQipzkJItZaoYQipzkJItZaoYQipzkJItZaoYQipzkJItZaoYQipzkJItZao\nYQipzvJxIZnwPNKHWaKGIaQ6CyHVWqKGIaQ6CyHVWqKGIaQ6CyHVWqKGIaQ6CyHVWqKGIaQ6\nCyHVWqKGIaQ6CyHVWqKGIaQ6CyHVWqKGIaQ6CyHVWqKGcYQE63S3c0H3YcyAVyS3hVekWkvU\nMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvU\nMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvU\nMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvU\nMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvU\nMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvU\nMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvU\nMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvUMIRUZyGkWkvU\nMO8MaXv89QFCcmi627nA80gfZoka5n0hPeWzXf5XzyAWS9TZElKtJWqYd4W03ROSWdPdzgWe\nR/owS9Qw73tFekzoVUdJjxN1toRUa4kaZimkl0+RPj2g/DnYobudArqPtJt3viLxZoNF0731\nBXgORiHpsldCevVvSY8TdbaEdOBgFJIum5DqLIR04GAUki57JSQ+tPNpure+AM/BKCRd9mpI\nP75z5xnEYok6W0I6cDAKSZe9EtL95Tc2JD1O1NkS0oGDUUi67HeGdBXPIBZL1NkS0oGDUUi6\nbEKqsxDSgYNRSLpsQqqzENKBg1FIumxCqrMQ0oGDUUi6bEKqsxDSgYNRSLpsQqqzENKBg1FI\numxCqrMQ0oGDUUi6bEKqsxDSgYNRSLpsQqqzENKBg1FIumxCqrMQ0oGDUUi6bEKqsxDSgYNR\nSLpsQqqzENKBg1FIumxCqrMQ0oGDUUi6bEKqsxDSgYNRSLpsQqqzENKBg1FIumxCqrMQ0oGD\nUUi6bEKqsxDSgYNRSLpsQqqzENKBg1FIumxCqrMQ0oGDUUi6bEKqsxDSgYNRSLpsQqqzENKB\ng1FIumxCqrMQ0oGDUUi6bEKqsxDSgYNRSLpsQqqzENKBg1FIumxCqrMQ0oGDUUi6bEKqsxDS\ngYNRSLpsQqqzENKBg1FIumxCqrMQ0oGDUUi6bEKqsxDSgYNRSLpsQqqzENKBg1FIumxCqrMQ\n0oGDUUi6bEJatHSvay6e81VI2jxCWrR0r2sunvNVSNo8Qlq0dK9rLp7zVUjaPEJatHSvay6e\n81VI2jxCWrR0r2sunvNVSNo8Qlq0dK9rLp7zVUjaPEJatHSvay6e81VI2jxCWrR0r2sunvNV\nSNo8Qlq0dK9rLp7zVUjaPEdIPyXd65pL9810wyvSuyzd65qL53wVkjaPkBYt3euai+d8FZI2\nj5AWLd3rmovnfBWSNo+QFi3d65qL53wVkjaPkBYt3euai+d8FZI2j5AWLd3rmovnfBWSNo+Q\nFi3d65qL53wVkjaPkBYt3euai+d8FZI2j5AWLd3rmovnfBWSNo+QFi3d65qL53wVkjaPkBYt\n3euai+d8FZI2j5AWLd3rmovnfBWSNo+QFi3d65qL53wVkjaPkBYt3euai+d8FZI2j5AWLd3r\nmovnfBWSNo+QFi3d65qL53wVkjaPkBYt3euai+d8FZI2j5AWLd3rmovnfBWSNo+QFi3d65qL\n53wVkjaPkBYt3euai+d8FZI2j5AWLd3rmovnfBWSNo+QFi3d65qL53wVkjaPkBYt3euai+d8\nFZI2j5AWLd3rmovnfBWSNo+QFi3d65qL53wVkjaPkBYt3euai+d8FZI2bzek09Pvt42QrtG9\nrrl4zlchafOuh7SdfoCQrtG9rrl4zlchafOuh/THDx39QUjX6F7XXKLO1zOMYrka0v33D+0E\nPINYLITUT9T5eoZRLHshvQPPIBYLIfUTdb6eYRTLbkh3G58jvYHpogcSdb6eYRTLXkh3vNnw\nJqaLHkjU+XqGUSx7IW3CuwyEBP8k6nw9wyiWvZB4s+FtTBc9kKjz9QyjWPZC+vX0lZDewHTR\nA4k6X88wimUvpC/b5y+EtI/pogcSdb6eYRTLXkh8Z8PbmC56IFHn6xlGsRDSmsV00QOJOl/P\nMIplL6R34BnEYiGkfqLO1zOMYiGkNYvpogcSdb6eYRTLXkh8aPc2poseSNT5eoZRLIS0ZjFd\n9ECiztczjGLZC+mRL59/v/j940/5bdvFj/t5BrFYCKmfqPP1DKNY3g7p/uvpx5Ie+/n+CyHB\nP4g6X88wiuVfQrr4VqHtnpCeMV30QKLO1zOMYvmXkP48XfydDYT0jOmiBxJ1vp5hFMteSC/v\nNdy9EdKnB66/kk3HdNFQS8NmXA9pu+iIV6QXujcEJDyXrVj2QroOIT3TvSEg4blsxUJIa5bu\nDQEJz2Urlt2Qvt79cjr9cnf5U0mE9Ez3hoCE57IVy15IX57+7pPt4qeSCOmZ7g0BCc9lK5a9\nkH47Pfxg35fPp9/+ERLf2fA3Id0InstWLHshPX8hlu+1u073hoCE57IVCyGtWbo3BCQ8l61Y\n9kK6/qEdIT3TvSEg4blsxbIX0vU3Gwjpme4NAQnPZSuWvZCuv/1NSM90bwhIeC5bseyGpOMZ\nxGIhJLjAc9mKhZDWLN0bAhKey1YsuyH9ev4Pp1/4HOkq3RsCEp7LVix7Id09vu994l2763Rv\nCEh4Llux7IW0nf738I+/+DrSdbo3BCQ8l61Y9kLiC7Jv070hIOG5bMWyF9Kvp9++PrwHfvpM\nSNfo3hCQ8Fy2YtkL6eULsn8R0jW6NwQkPJetWPZCev6CrPD/dvEMYrEQElzguWzFshuSjmcQ\ni4WQ4ALPZSsWQlqzdG8ISHguW7EQ0pqle0NAwnPZioWQ1izdGwISnstWLIS0ZuneEJDwXLZi\nIaQ1S/eGgITnshULIa1ZujcEJDyXrVgIac3SvSEg4blsxUJIa5buDQEJz2UrFkJas3RvCEh4\nLluxENKapXtDQMJz2YqFkNYs3RsCEp7LViyEtGbp3hCQ8Fy2YiGkNUv3hoCE57IVCyGtWbo3\nBCQ8l61YCGnN0r0hIOG5bMVCSGuW7g0BCc9lKxZCWrN0bwhIeC5bsRDSmqV7Q0DCc9mKhZDW\nLN0bAhKey1YshLRm6d4QkPBctmIhpDVL94aAhOeyFQshrVm6NwQkPJetWAhpzdK9ISDhuWzF\nQkhrlu4NAQnPZSsWQlqzdG8ISHguW7EYQvop6d4QkGjYDF6R3mXp3hCQ8Fy2YiGkNUv3hoCE\n57IVCyGtWbo3BCQ8l61YCGnN0r0hIOG5bMVCSGuW7g0BCc9lKxZCWrN0bwhIeC5bsRDSmqV7\nQ0DCc9mKhZDWLN0bAhKey1YshLRm6d4QkPBctmL5+ULqvlv4QDwro1gICQbjWRnFQkgwGM/K\nKBZCgsF4VkaxEBIMxrMyioWQYDCelVEshASD8ayMYiEkGIxnZRQLIcFgPCujWAgJBuNZGcVC\nSDAYz8ooFkKCwXhWRrEQEgzGszKKhZBgMJ6VUSyEBIPxrIxiISQYjGdlFAshwWA8K6NYCAkG\n41kZxUJIMBjPyigWQoLBeFZGsRASDMazMoqFkGAwnpVRLIQEg/GsjGIhJBiMZ2UUCyHBYDwr\no1gICQbjWRnFQkgwGM/KKBZCgsF4VkaxEBIMxrMyioWQYDCelVEshASD8ayMYiEkGIxnZRQL\nIcFgPCujWAgJBuNZGcWyEtL2ACFBPp6VUSxLIfGKBLeBZ2UUCyHBYDwro1gWQnrVESFBLJ6V\nUSwrIb18ivTpAfnPhdB9t/CBNOzXO1+ReLMBbgDPyiiWhZC+10RIkI1nZRQLIcFgPCujWBZC\n4kM7uBU8K6NYFkP68Z07zyAWCyHBBZ6VUSwLId1ffmMDIUEsnpVRLCshvcIziMVCSHCBZ2UU\nCyHBYDwro1gICQbjWRnFQkgwGM/KKBZCgsF4VkaxEBIMxrMyioWQYDCelVEshASD8ayMYiEk\nGIxnZRQLIcFgPCujWAgJBuNZGcVCSDAYz8ooFkKCwXhWRrEQEgzGszKKhZBgMJ6VUSyEBIPx\nrIxiISQYjGdlFAshwWA8K6NYCAkG41kZxUJIMBjPyigWQoLBeFZGsRASDMazMoqFkGAwnpVR\nLIQEg/GsjGIhJBiMZ2UUCyHBYDwro1gICQbjWRnFQkgwGM/KKBZCgsF4VkaxEBIMxrMyioWQ\nYDCelVEstxRS963AzWHaPMFCSDAY0+YJFkKCwZg2T7AQEgzGtHmChZBgMKbNEyyEBIMxbZ5g\nISQYjGnzBAshwWBMmydYCAkGY9o8wWII6cPovhW4ORq2lFckmIdp8wQLIcFgTJsnWAgJBmPa\nPMFCSDAY0+YJFkKCwZg2T7AQEgzGtHmChZBgMKbNEyyEBIMxbZ5gISQYjGnzBMuHhNR9nPCz\noiRwOKKzhZBgMEoChyM6WwgJBqMkcDiis4WQYDBKAocjOlsICQajJHA4orOFkGAwSgKHIzpb\nCAkGoyRwOKKzhZBgMEoChyM6WwgJBqMkcDiis4WQYDBKAocjOlsICQajJHA4orOFkGAwSgKH\nIzpbCAkGoyRwOKKzhZBgMEoChyM6WwgJBqMkcDiis4WQYDBKAocjOlsICQajJHA4orOFkGAw\nSgKHIzpbCAkGoyRwOKKzhZAA/gVCAjBASAAGCAnAACEBGCAkAAOEBGCAkAAMEBKAAUICMEBI\nAAYICcAAIQEYICQAA4QEYKAqpO0bhAQ/DUUhbS+/EBL8DBASgAFCAjBQHtKnB+Q/B/ATUPmK\npBRtsbj+JguLJWqYgY8UNQwh1Vmihhn4SFHDEFKdJWqYgY8UNQwh1Vmihhn4SFHDEFKdJWqY\ngY8UNcxKSO//zgZlEIsl6myjhhn4SFHDLIX0Cs8gFkvU2UYNM/CRooYhpDpL1DADHylqGEKq\ns0QNM/CRooYhpDpL1DADHylqGEKqs0QNM/CRooYhpDpL1DADHylqGEKqs0QNM/CRooYhpDpL\n1DADHylqGEKqs0QNM/CRooYhpDpL1DADHylqGEKqs0QNM/CRooYhpDpL1DADHylqGENIUT9r\nzjA7JM0yexhCspM0TNIss4chJDtJwyTNMnsYQrKTNEzSLLOHISQ7ScMkzTJ7mPV37QDgBUIC\nMEBIAAYICcAAIQEYWAzp1V/N1cnjHCEDJQ2z/eNfGkk6mJJh1kJ6/ZdFNrJ9n6V/oKRhXrYk\nYXeTDqZmmFsPabsPuqKkYbbnCbb2UbIOpmiYWw/pPumKfhyhf5jt+df+UXIu6AwhXYOQdkgM\nKehzpIiQUj5rfCAzpJRZtvuIWZ6GCZmGNxuuQUg7vGxtwCyBtxTxipRyIg/kXdF9xCiPIT3S\nPUriLRHSK/KuKGKSrFfHwFsipFfkXVHCIPeEtE9OSHFvNqQM9PQ5dcQ0eSFF3VLEmw0AcAEh\nARggJAADhARggJAADBASgAFCAjBASAAGCAnAACEBGCCk2+TExWXBfdwmhBQG93GbEFIY3Mdt\ncg7p7vTf7jngCUK6TR5CujvddY8BzxDSbfItpLvT791TwAuEdJucTnxcFwUh3Sanb/zRPQR8\nh5Buk9Pp9+30pXsKeIGQbpNvnyP9efq1ewp4gZBuk4d37T6f/uweA54hpNvkIaS/TtvX7jng\nCUK6Tc5fkP399Fv3HPAEIQEYICQAA4QEYICQAAwQEoABQgIwQEgABggJwAAhARj4P/GaX1cF\nv1OHAAAAAElFTkSuQmCC", "text/plain": [ "plot without title" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "repeated_sequences = sapply(1:100, function(t) {\n", " sum(flip(20, 0.7) == 'H')\n", "})\n", "ggplot(data.frame(k=repeated_sequences)) +\n", " aes(x=k) +\n", " geom_histogram(binwidth=1)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Now 13-16 are the most common values, which we would expect, since $0.7 \\cdot 20 = 14$.\n", "\n", "Now, we can directly compute the probability of observing $k$ successes in $n$ trials without needing to simulate all these trials. The probabiltiy $P(k|n,p)$ (read ‘the probability of $k$ given $n$ and $p$’) can be written:\n", "\n", "$$P(k|n,p) = {{n}\\choose{k}} p^k (1-p)^{(n-k)}$$\n", "\n", "R has a built-in definition of this function called `dbinom`:" ] }, { "cell_type": "code", "execution_count": 15, "metadata": {}, "outputs": [ { "data": { "text/html": [ "0.191638982753443" ], "text/latex": [ "0.191638982753443" ], "text/markdown": [ "0.191638982753443" ], "text/plain": [ "[1] 0.191639" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "dbinom(14, 20, 0.7)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "For fixed $n$ and $p$, this binomial distribution itself is a discrete distribution over the integers $0 \\dots n$, and we can also visualize it:" ] }, { "cell_type": "code", "execution_count": 16, "metadata": {}, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAA0gAAAHgCAMAAACo6b1DAAAANlBMVEUAAAAzMzNNTU1ZWVlo\naGh8fHyMjIyampqnp6eysrK9vb3Hx8fQ0NDh4eHp6enr6+vw8PD///8dnFTkAAAACXBIWXMA\nABJ0AAASdAHeZh94AAAT1ElEQVR4nO2d7XYTSRJERe8YzM7C4Pd/2RUF2LJHUmV3Z3VElm78\nYIxO5XUcV94jfwjP6YUQsjsndQFCZggiEZIQRCIkIYhESEIQiZCEIBIhCYmJtJzz8e3Lxwh5\n8IREWl7/eHv78jFCHj1ZIv2zOhtGDidS0hjpUHKHSH8eQCQXIiV1yFyRPp0TwRAybTaJtPz7\nsQPsFxApaYx0KIlIMiQlfYlHi7RceeyA0gIiJY2RDiV3ibRck+uA0gIiJY2RDiX3iLRceQyR\nhERK6pArRXp7NcOvN9tf3r+y4YDSAiIljZEOJdeK1M8BpQVEShojHUoikgxJSV8iIo0iUtIY\n6VASkWRISvoSEWkUkZLGSIeSiCRDUtKXiEijiJQ0RjqURCQZkpK+REQaRaSkMdKhJCLJkJT0\nJSLSKCIlb+Y/d2NScjgSkWTIWUoiUhtAJBVylpKI1AYQSYWcpSQitQFEUiFnKYlIbQCRVMhZ\nSiJSG0AkFXKWkojUBhBJhZylJCK1AURSIWcpiUhtAJFUyFlKIlIbQCQVcpaSiNQGEEmFnKUk\nIrUBRFIhZymJSG0AkVTIWUoiUhtAJBVylpKI1AYQSYWcpSQitQFEUiFnKYlIbQCRVMhZSiJS\nG0AkFXKWkojUBhBJhZylJCK1AURSIWcpiUhtAJFUyFlKIlIbQCQVcpaSiNQGEEmFnKUkIrUB\nRFIhZymJSG0AkVTIWUp2RVpr2oiSw5GIJEPOUhKR2gAiqZCzlESkNoBIKuQsJRGpDSCSCjlL\nSURqA4ikQs5SEpHaACKpkLOURKQ2gEgq5CwlEakNpItEHiz3PYkcmCs8Ix2MnKUkz0htAJFU\nyFlKIlIbQCQVcpaSiNQGEEmFnKUkIrUBRFIhZymJSG0AkVTIWUoiUhtAJBVylpKI1AYQSYWc\npSQitQFEUiFnKYlIbQCRVMhZSiJSG0AkFXKWkojUBhBJhZylJCK1AURSIWcpiUhtAJFUyFlK\nIlIbQCQVcpaSiNQGEEmFnKUkIrUBRFIhZymJSG0AkVTIWUoiUhtAJBVylpKI1AYQSYWcpSQi\ntQFEUiGLlNzvCSIh0khkkZKIFBtAJBWySElEig0gkgpZpCQixQYQSYUsUhKRYgOIpEIWKYlI\nsQFEUiGLlESk2AAiqZBFSiJSbACRVMgiJREpNoBIKmSRkogUG0AkFbJISUSKDSCSClmkJCLF\nBhBJhSxSEpFiA4ikQhYpiUixAURSIYuURKTYACKpkEVKIlJsAJFUyCIlESk2gEgqZJGSiBQb\nQCQVskhJRIoNIJIKWaQkIsUGEEmFLFISkWIDiKRCFimJSLEBRFIhi5REpNgAIqmQRUoiUmxg\npUjLORd/e33s7cEDSguIj1sSkWID60Ra3vRpBr28+zsiaYmIlJZDRVpeEMmKiEhpOfYZ6eNf\nEUlLRKS0SER6+xLp0zkRDKmT+xqkHJgr+56R+GaDAZFnpLQonpE+vHVAaQHxcUsiUmwAkVTI\nIiURKTawXyQ+tXMhIlJaVCJdfOfugNIC4uOWRKTYwDqRXl/ZcPkTpPevdjigtID4uCURKTaw\nUqR+DigtID5uSUSKDSCSClmkJCLFBhBJhSxSEpFiA4ikQhYpiUixAURSIYuURKTYACKpkEVK\nIlJsAJFUyCIlESk2gEgqZJGSiBQbQCQVskhJRIoNIJIKWaQkIsUGEEmFLFISkWIDiKRCFimJ\nSLEBRFIhi5REpNgAIqmQRUoiUmwAkVTIIiURKTaASCpkkZKIFBtAJBWySElEig0gkgpZpCQi\nxQYQSYUsUhKRYgOIpEIWKYlIsQFEUiGLlESk2AAiqZBFSiJSbACRVMgiJREpNoBIKmSRkogU\nG0AkFbJISUSKDSCSClmkJCLFBhBJhSxSEpFiA4ikQhYpiUixAURSIYuURKTYACKpkEVKIlJs\nAJFUyCIlESk2gEgqZJGSiBQbQCQVskjJA0Taa5rDRxKRZMgiJREpNoBIKmSRkogUG0AkFbJI\nSUSKDSCSClmkJCLFBhBJhSxSEpFiA4ikQhYpiUixAURSIYuURKTYACKpkEVKIlJsAJFUyCIl\nESk2gEgqZJGSiBQbSBeJzJX7W37MgVLhGelgZJGS97f8H56Rfg8gkgpZpCQixQYQSYUsUhKR\nYgOIpEIWKYlIsQFEUiGLlESk2AAiqZBFSiJSbACRVMgiJREpNoBIKmSRkogUG0AkFbJISUSK\nDSCSClmkJCLFBhBJhSxSEpFiA4ikQhYpiUixAURSIYuURKTYACKpkEVKIlJsAJFUyCIlESk2\ngEgqZJGSiBQbQCQVskhJRIoNIJIKWaQkIsUGEEmFLFISkWIDiKRCFimJSLEBRFIhPUp2lhiR\nogOIpEJ6lOwsMSJFBxBJhfQo2VliRIoOIJIK6VGys8SIFB1AJBXSo2RniREpOoBIKqRHyc4S\nI1J0AJFUSI+SnSVGpOgAIqmQHiU7S4xI0QFEUiE9SnaWGJGiA4ikQnqU7CwxIkUHEEmF9CjZ\nWWJEig4gkgrpUbKzxIgUHUAkFdKjZGeJESk6gEgqpEfJzhIjUnQAkVRIj5KdJUak6AAiqZAe\nJTtLjEjRAURSIT1KdpYYkaIDiKRCepTsLDEiRQcQSYX0KNlZYkSKDiCSCulRsrPEiBQduCnS\nj+e/Tqe/vv5ApEFIj5KdJUak6MAtkb4vp5blOyKNQXqU7CwxIkUHbon0dHo6K/T96fQZkcYg\nPUp2lhiRogO3RDr9+vuP09qvnQ4oLSBOW7KzxIgUHbgl0ufTr6+OeEYahfQo2VliRIoO3BLp\n5fOvT+0+eLScc/G3K48dUFpAnLZkZ4kRKTpwVaTTZd559KZPM+jfjyGSjohIaTlUpOUFkayI\niJSW0T+Q/SANIlkRESktYpE+nRPBEJvcX2KTA6Vy/ZUNz+9f2cAzkjORZ6S0jH5lAyI5ExEp\nLXkiffnzyoYviDQG6VGys8SIFB24JdKf79bd+/Y3InkRESktiDSIOG3JzhIjUnTglkjXP7V7\nfRXD8nLxJ69s8CAiUlr4ZxSDiNOW7CwxIkUHbol0/dvfiORMRKS0jP6BLCI5ExEpLYn/sO/9\n10aIVICISGnJE2nZ+gx1QGkBcdqSnSVGpOjALZG+PT2v/TYDIomJiJSWzJ8jXftnFIjkTESk\ntCDSIOK0JTtLjEjRgVsibc4BpQXEaUt2lhiRogOIpEJ6lOwsMSJFB26K1H4g+/QVkUYhPUp2\nlhiRogO3ROIlQqORHiU7S4xI0YFbIj1df9EqIhkTESkt6f+Mgt+0OgzpUbKzxIgUHbgl0utv\nWn1CpDFIj5KdJUak6MAtkV6+PH37+andE18jDUJ6lOwsMSJFB26J9O53RK759O6A0gLitCU7\nS4xI0QFEUiE9SnaWGJGiA7dE2pwDSguI05bsLDEiRQcQSYX0KNlZYkSKDiCSCulRsrPEiBQd\nQCQV0qNkZ4kRKTqASCqkR8nOEiNSdACRVEiPkp0lRqToACKpkB4lO0uMSNEBRFIhPUp2lhiR\nogOIpEJ6lOwsMSJFBxBJhfQo2VliRIoOIJIK6VGys8SIFB1AJBXSo2RniREpOoBIKqRHyc4S\nI1J0AJFUSI+SnSVGpOgAIqmQHiU7S4xI0QFEUiE9SnaWGJGiA4ikQnqU7CwxIkUHEEmF9CjZ\nWWJEig4gkgrpUbKzxIgUHUAkFdKjZGeJESk6gEgqpEfJzhIjUnQAkVRIj5KdJUak6AAiqZAe\nJTtLjEjRAURSIT1KdpYYkaID6SKRUrm/xCYHSoVnpIORHiXvLzHPSOEBRFIhPUp2lhiRogOI\npEJ6lOwsMSJFBxBJhfQo2VliRIoOIJIK6VGys8SIFB1AJBXSo2RniREpOoBIKqRHyc4Se4jU\nOeDwkUQkGdKjZHdHh2uASIhkRkSktCDSIOK0Jbs7OlwDREIkMyIipQWRBhGnLdnd0eEaIBIi\nmRERKS2INIg4bcnujg7XAJEQyYyISGlBpEHEaUt2d3S4BoiESGZEREoLIg0iTluyu6PDNUAk\nRDIjXkHu39G9BERCpLFIRMo64HDdiCRDIlLWAYfrRiQZEpGyDjhcNyLJkIiUdcDhuhFJhkSk\nrAMO141IMiQiZR1wuG5EkiERKeuAw3UjkgyJSFkHHK4bkWRIRMo64HDdiCRDIlLWAYfrRiQZ\nEpGyDjhcNyLJkIiUdcDhuhFJhkSkrAMO141IMiQiZR1wuG5EkiERKeuAw3UjkgyJSFkHHK4b\nkWRIRMo64HDdiCRDIlLWAYfrRiQZEpGyDjhcNyLJkIiUdcDhuhFJhkSkrAMO141IMiQiZR1w\nuG5EkiERKeuAw3UjkgyJSFkHHK4bkWRIRMo64HDdiCRDIlLWAYfrXivScs7Ht5fLBxFJR0Sk\ngR/J3sA6kZbXPy7eXt4dOaC0gFi15P4d3UtAJEQai0SkrAMO150g0nuPEElHRKSBH8neQIJI\nb18ifTongiFH5f4KTnPAKfuekfhmgwGRZ6SBH8newH6RXi7/i0hCIiIN/Ej2BhBJhUSkrAMO\n150gEp/auRARaeBHsjeQI9LFd+4OKC0gVi25f0f3EhDpukl/Xs1w+fblCxsQSUdEpIEfyd7A\nSpH6OaC0gFi15P4d3UtAJEQai0SkrAMO141IMiQiZR1wuG5EkiERKeuAw3UjkgyJSFkHHK4b\nkWRIRMo64HDdiCRDIlLWAYfrRiQZEpGyDjhcNyLJkIiUdcDhuhFJhkSkrAMO141IMiQiZR1w\nuG5EkiERKeuAw3UjkgyJSFkHHK4bkWRIRMo64HDdiCRDIlLWAYfrRiQZEpGyDjhcNyLJkIiU\ndcDhuhFJhkSkrAMO141IMiQiZR1wuG5EkiERKeuAw3UjkgyJSFkHHK4bkWRIRMo64HDdiCRD\nIlLWAYfrRiQZEpGyDjhcNyLJkIiUdcDhuhFJhkSkrAMO141IMiQiZR1wuG5EkiERKeuAw3Uj\nkgyJSFkHHK4bkWRIRMo64HDdiCRDIlLWAYfrRiQZMoN4wI4OfxeIhEhy4gE7OvxdIBIiyYkH\n7Ojwd4FIiCQnHrCjw98FIiGSnHjAjg5/F4iESHLiATs6/F0gEiLJiQfs6PB3cUjJ4ZeDSDIk\nImUdQKSNpQVE05IOO6o/gEgbSwuIpiUddlR/AJE2lhYQTUs67Kj+wJwikQNzf4E48HrgwPCM\ndDCSZ6SsA3M+Ix1QWkA0Lemwo/oDiLSxtIBoWtJhR/UHEGljaQHRtKTDjuoPINLG0gKiaUmH\nHdUfQKSNpQVE05IOO6o/gEgbSwuIpiUddlR/AJE2lhYQTUs67Kj+ACJtLC0gmpZ02FH9AUTa\nWFpANC3psKP6A4i0sbSAaFrSYUf1BxBpY2kB0bSkw47qDyDSxtIComlJhx3VH0CkjaUFRNOS\nDjuqP4BIG0sLiKYlHXZUfwCRNpYWEE1LOuyo/gAibSwtIJqWdNhR/QFE2lhaQDQt6bCj+gOI\ntLG0gGha0mFH9QcQaWNpAdG0pMOO6g8g0sbSAqJpSYcd1R9ApI2lBUTTkg47qj+ASBtLC4im\nJR12VH8AkTaWFhBNSzrsqP4AIm0sLSCalnTYUf0BRNpYWkDUlBy+goi0/XI+DCCSColIWQcQ\naWNpARGRdB0QaVRpARGRdB0QaVRpARGRdB0QaVRpARGRdB0QaVRpARGRdB0QaVRpARGRdB0Q\naVRpARGRdB0QaVRpARGRdB0QaVRpARGRdB0OKbn+cj4MIJIKiUhZBxBpY2kBEZF0HRAp2mH9\nyOFERNJ1QKRoh/UjhxMRSdcBkaId1o8cTkQkXQdEinZYP3I4cUjJ/dc//MDDlNx73YgkQz7M\njg4/gEgbSwuIiKTrgEjhjcoOImUdeJiSe697rUjLOR/fvnwMkVYQ91//8AMPU3Lvda8UaXn9\n4+3ty8cQaQ1x//UPP/AwJfdeNyKNQx5w/cMPUPLPgd5+INIwpMP17z1AyT8HevuRKtKncyIY\nQqYNz0gHIynpS+RrpFFEShojHUoikgxJSV8iIo0iUtIY6VASkWRISvoSj3tlw3L5Nq9s8CBS\nUodcK1I/B5QWEClpjHQoiUgyJCV9iYg0ikhJY6RDSUSSISnpS0SkUURKGiMdSiKSDElJXyIi\njSJS0hjpUBKRZEhK+hIRaRSRksZIh5KIJENS0peISKOIlDRGOpREJBmSkr5EB5HWp8S/Tqdk\nVmYviUh3Q8mszF4Ske6GklmZvSQi3Q0lszJ7SUS6G0pmZfaSOpEImSiIREhCEImQhCASIQlB\nJEISohLp/S/zMs3i3/JXPfOebyVtW17/JXMrIhLpw6+XNI19wZe3XzdoXPb3avoWvPlrT1cE\nke6kQMECIi0viDQs3lf/O+79fqaASP79fgeRBsX6k/rfKSSS/UcTkQalQslCIr3+YZrlBZEG\nxrxkiR1drrzlF0QaGvOSiJSVnR9JRLqdCiXriGRe8uI7i4iUnNfneuNU2NG3fr4dP/xv9NaH\nVzbcSYGSxV7ZIC5yM8uf7ylWe2UDIXMFkQhJCCIRkhBEIiQhiERIQhCJkIQgEiEJQSRCEoJI\nhCQEkQhJCCLVzokL9Aj3UDuIZBLuoXYQySTcQ+00kZ5P/1P3ePggUu38FOn59KyuQRCpds4i\nPZ++qlsQRCqe04nP6yyCSLVzOue/6hIEkarndPq6nL6rWxBEKp7z10h/nz6rWxBEKp6f37V7\nOv2trkEQqXZ+ivTttPxQ93j4IFLttB/Ifj19Ufd4+CASIQlBJEISgkiEJASRCEkIIhGSEEQi\nJCGIREhCEImQhCASIQn5P2r4G1ZowFI8AAAAAElFTkSuQmCC", "text/plain": [ "plot without title" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ggplot(data.frame(k=0:20) %>% mutate(prob=dbinom(k, 20, 0.7))) +\n", " aes(x=k, y=prob) +\n", " geom_bar(stat='identity')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can observe that our flip above has the same basic shape. Randomness means that it won't quite align perfectly, but on average it will be pretty close." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Joint Distributions\n", "\n", "We have now seen how we can start to think about the distribution of a single random variable by counting; often, though, we care about more than one variable.\n", "\n", "Let's look at the carrier airlines for our NYC flights:" ] }, { "cell_type": "code", "execution_count": 17, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
carriercount
9E 18460
AA 32729
AS 714
B6 54635
DL 48110
EV 54173
F9 685
FL 3260
HA 342
MQ 26397
OO 32
UA 58665
US 20536
VX 5162
WN 12275
YV 601
\n" ], "text/latex": [ "\\begin{tabular}{r|ll}\n", " carrier & count\\\\\n", "\\hline\n", "\t 9E & 18460\\\\\n", "\t AA & 32729\\\\\n", "\t AS & 714\\\\\n", "\t B6 & 54635\\\\\n", "\t DL & 48110\\\\\n", "\t EV & 54173\\\\\n", "\t F9 & 685\\\\\n", "\t FL & 3260\\\\\n", "\t HA & 342\\\\\n", "\t MQ & 26397\\\\\n", "\t OO & 32\\\\\n", "\t UA & 58665\\\\\n", "\t US & 20536\\\\\n", "\t VX & 5162\\\\\n", "\t WN & 12275\\\\\n", "\t YV & 601\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "carrier | count | \n", "|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|\n", "| 9E | 18460 | \n", "| AA | 32729 | \n", "| AS | 714 | \n", "| B6 | 54635 | \n", "| DL | 48110 | \n", "| EV | 54173 | \n", "| F9 | 685 | \n", "| FL | 3260 | \n", "| HA | 342 | \n", "| MQ | 26397 | \n", "| OO | 32 | \n", "| UA | 58665 | \n", "| US | 20536 | \n", "| VX | 5162 | \n", "| WN | 12275 | \n", "| YV | 601 | \n", "\n", "\n" ], "text/plain": [ " carrier count\n", "1 9E 18460\n", "2 AA 32729\n", "3 AS 714\n", "4 B6 54635\n", "5 DL 48110\n", "6 EV 54173\n", "7 F9 685\n", "8 FL 3260\n", "9 HA 342\n", "10 MQ 26397\n", "11 OO 32\n", "12 UA 58665\n", "13 US 20536\n", "14 VX 5162\n", "15 WN 12275\n", "16 YV 601" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "flights %>%\n", " group_by(carrier) %>%\n", " summarize(count=n())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now have a bunch of carriers, and we can convert this to a probability distribution to estimate the probability of a plan being from a particular airline.\n", "\n", "We can also start to think about airlines _and_ flights. Let's do a bit more R trickery! We can group by two variables:" ] }, { "cell_type": "code", "execution_count": 18, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
origincarriercountprob
EWR 9E 1268 0.003765114
EWR AA 3487 0.010354063
EWR AS 714 0.002120104
EWR B6 6557 0.019469915
EWR DL 4342 0.012892843
EWR EV 43939 0.130469511
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " origin & carrier & count & prob\\\\\n", "\\hline\n", "\t EWR & 9E & 1268 & 0.003765114\\\\\n", "\t EWR & AA & 3487 & 0.010354063\\\\\n", "\t EWR & AS & 714 & 0.002120104\\\\\n", "\t EWR & B6 & 6557 & 0.019469915\\\\\n", "\t EWR & DL & 4342 & 0.012892843\\\\\n", "\t EWR & EV & 43939 & 0.130469511\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "origin | carrier | count | prob | \n", "|---|---|---|---|---|---|\n", "| EWR | 9E | 1268 | 0.003765114 | \n", "| EWR | AA | 3487 | 0.010354063 | \n", "| EWR | AS | 714 | 0.002120104 | \n", "| EWR | B6 | 6557 | 0.019469915 | \n", "| EWR | DL | 4342 | 0.012892843 | \n", "| EWR | EV | 43939 | 0.130469511 | \n", "\n", "\n" ], "text/plain": [ " origin carrier count prob \n", "1 EWR 9E 1268 0.003765114\n", "2 EWR AA 3487 0.010354063\n", "3 EWR AS 714 0.002120104\n", "4 EWR B6 6557 0.019469915\n", "5 EWR DL 4342 0.012892843\n", "6 EWR EV 43939 0.130469511" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "origin_carrier_flights = flights %>%\n", " group_by(origin, carrier) %>%\n", " summarize(count=n()) %>%\n", " ungroup() %>%\n", " mutate(prob = count / sum(count))\n", "head(origin_carrier_flights)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "> **R Note:** this contains 2 new functions. `mutate` is the `dplyr` way of doing the normalization we did previously for origins; it lets us compute a new variable based on other variables in the data frame. `ungroup` removes the grouping data introduced by `group_by`, so that `sum` sums over the entire data frame." ] }, { "cell_type": "code", "execution_count": 19, "metadata": {}, "outputs": [ { "data": { "text/html": [ "1" ], "text/latex": [ "1" ], "text/markdown": [ "1" ], "text/plain": [ "[1] 1" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sum(origin_carrier_flights$prob)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is probability distribution is called a _joint probability distribution_: it is the probability of two variables simultaneously taking on the given values. We can write it $P(O, C)$: the probability of a specific origin and carrier. So $P(O=\\mathsf{EWR}, C=\\mathsf{AA}) \\approx 0.010$.\n", "\n", "It can be easier to visualize this in a more matrix-like form. The `spread` function lets us convert data in this form (‘tall’) into a ‘wide’ format; the `select(-count)` operation removes the `count` column from the data frame:" ] }, { "cell_type": "code", "execution_count": 20, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
carrierEWRJFKLGA
9E 0.003765114 0.043503694 7.545074e-03
AA 0.010354063 0.040926313 4.590291e-02
AS 0.002120104 0.000000000 0.000000e+00
B6 0.019469915 0.124937644 1.782194e-02
DL 0.012892843 0.061468157 6.849360e-02
EV 0.130469511 0.004180820 2.620733e-02
F9 0.000000000 0.000000000 2.033993e-03
FL 0.000000000 0.000000000 9.680025e-03
HA 0.000000000 0.001015512 0.000000e+00
MQ 0.006758201 0.021358410 5.026486e-02
OO 0.000017816 0.000000000 7.720265e-05
UA 0.136847638 0.013462955 2.388531e-02
US 0.013079911 0.008893152 3.900515e-02
VX 0.004649975 0.010677721 0.000000e+00
WN 0.018374231 0.000000000 1.807433e-02
YV 0.000000000 0.000000000 1.784569e-03
\n" ], "text/latex": [ "\\begin{tabular}{r|llll}\n", " carrier & EWR & JFK & LGA\\\\\n", "\\hline\n", "\t 9E & 0.003765114 & 0.043503694 & 7.545074e-03\\\\\n", "\t AA & 0.010354063 & 0.040926313 & 4.590291e-02\\\\\n", "\t AS & 0.002120104 & 0.000000000 & 0.000000e+00\\\\\n", "\t B6 & 0.019469915 & 0.124937644 & 1.782194e-02\\\\\n", "\t DL & 0.012892843 & 0.061468157 & 6.849360e-02\\\\\n", "\t EV & 0.130469511 & 0.004180820 & 2.620733e-02\\\\\n", "\t F9 & 0.000000000 & 0.000000000 & 2.033993e-03\\\\\n", "\t FL & 0.000000000 & 0.000000000 & 9.680025e-03\\\\\n", "\t HA & 0.000000000 & 0.001015512 & 0.000000e+00\\\\\n", "\t MQ & 0.006758201 & 0.021358410 & 5.026486e-02\\\\\n", "\t OO & 0.000017816 & 0.000000000 & 7.720265e-05\\\\\n", "\t UA & 0.136847638 & 0.013462955 & 2.388531e-02\\\\\n", "\t US & 0.013079911 & 0.008893152 & 3.900515e-02\\\\\n", "\t VX & 0.004649975 & 0.010677721 & 0.000000e+00\\\\\n", "\t WN & 0.018374231 & 0.000000000 & 1.807433e-02\\\\\n", "\t YV & 0.000000000 & 0.000000000 & 1.784569e-03\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "carrier | EWR | JFK | LGA | \n", "|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|\n", "| 9E | 0.003765114 | 0.043503694 | 7.545074e-03 | \n", "| AA | 0.010354063 | 0.040926313 | 4.590291e-02 | \n", "| AS | 0.002120104 | 0.000000000 | 0.000000e+00 | \n", "| B6 | 0.019469915 | 0.124937644 | 1.782194e-02 | \n", "| DL | 0.012892843 | 0.061468157 | 6.849360e-02 | \n", "| EV | 0.130469511 | 0.004180820 | 2.620733e-02 | \n", "| F9 | 0.000000000 | 0.000000000 | 2.033993e-03 | \n", "| FL | 0.000000000 | 0.000000000 | 9.680025e-03 | \n", "| HA | 0.000000000 | 0.001015512 | 0.000000e+00 | \n", "| MQ | 0.006758201 | 0.021358410 | 5.026486e-02 | \n", "| OO | 0.000017816 | 0.000000000 | 7.720265e-05 | \n", "| UA | 0.136847638 | 0.013462955 | 2.388531e-02 | \n", "| US | 0.013079911 | 0.008893152 | 3.900515e-02 | \n", "| VX | 0.004649975 | 0.010677721 | 0.000000e+00 | \n", "| WN | 0.018374231 | 0.000000000 | 1.807433e-02 | \n", "| YV | 0.000000000 | 0.000000000 | 1.784569e-03 | \n", "\n", "\n" ], "text/plain": [ " carrier EWR JFK LGA \n", "1 9E 0.003765114 0.043503694 7.545074e-03\n", "2 AA 0.010354063 0.040926313 4.590291e-02\n", "3 AS 0.002120104 0.000000000 0.000000e+00\n", "4 B6 0.019469915 0.124937644 1.782194e-02\n", "5 DL 0.012892843 0.061468157 6.849360e-02\n", "6 EV 0.130469511 0.004180820 2.620733e-02\n", "7 F9 0.000000000 0.000000000 2.033993e-03\n", "8 FL 0.000000000 0.000000000 9.680025e-03\n", "9 HA 0.000000000 0.001015512 0.000000e+00\n", "10 MQ 0.006758201 0.021358410 5.026486e-02\n", "11 OO 0.000017816 0.000000000 7.720265e-05\n", "12 UA 0.136847638 0.013462955 2.388531e-02\n", "13 US 0.013079911 0.008893152 3.900515e-02\n", "14 VX 0.004649975 0.010677721 0.000000e+00\n", "15 WN 0.018374231 0.000000000 1.807433e-02\n", "16 YV 0.000000000 0.000000000 1.784569e-03" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "origin_carrier_wide = spread(origin_carrier_flights %>% select(-count), origin, prob, fill=0)\n", "origin_carrier_wide" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We can then convert this into an R _matrix_:" ] }, { "cell_type": "code", "execution_count": 21, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
EWRJFKLGA
9E0.003765114 0.043503694 7.545074e-03
AA0.010354063 0.040926313 4.590291e-02
AS0.002120104 0.000000000 0.000000e+00
B60.019469915 0.124937644 1.782194e-02
DL0.012892843 0.061468157 6.849360e-02
EV0.130469511 0.004180820 2.620733e-02
F90.000000000 0.000000000 2.033993e-03
FL0.000000000 0.000000000 9.680025e-03
HA0.000000000 0.001015512 0.000000e+00
MQ0.006758201 0.021358410 5.026486e-02
OO0.000017816 0.000000000 7.720265e-05
UA0.136847638 0.013462955 2.388531e-02
US0.013079911 0.008893152 3.900515e-02
VX0.004649975 0.010677721 0.000000e+00
WN0.018374231 0.000000000 1.807433e-02
YV0.000000000 0.000000000 1.784569e-03
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " & EWR & JFK & LGA\\\\\n", "\\hline\n", "\t9E & 0.003765114 & 0.043503694 & 7.545074e-03\\\\\n", "\tAA & 0.010354063 & 0.040926313 & 4.590291e-02\\\\\n", "\tAS & 0.002120104 & 0.000000000 & 0.000000e+00\\\\\n", "\tB6 & 0.019469915 & 0.124937644 & 1.782194e-02\\\\\n", "\tDL & 0.012892843 & 0.061468157 & 6.849360e-02\\\\\n", "\tEV & 0.130469511 & 0.004180820 & 2.620733e-02\\\\\n", "\tF9 & 0.000000000 & 0.000000000 & 2.033993e-03\\\\\n", "\tFL & 0.000000000 & 0.000000000 & 9.680025e-03\\\\\n", "\tHA & 0.000000000 & 0.001015512 & 0.000000e+00\\\\\n", "\tMQ & 0.006758201 & 0.021358410 & 5.026486e-02\\\\\n", "\tOO & 0.000017816 & 0.000000000 & 7.720265e-05\\\\\n", "\tUA & 0.136847638 & 0.013462955 & 2.388531e-02\\\\\n", "\tUS & 0.013079911 & 0.008893152 & 3.900515e-02\\\\\n", "\tVX & 0.004649975 & 0.010677721 & 0.000000e+00\\\\\n", "\tWN & 0.018374231 & 0.000000000 & 1.807433e-02\\\\\n", "\tYV & 0.000000000 & 0.000000000 & 1.784569e-03\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| | EWR | JFK | LGA | \n", "|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|\n", "| 9E | 0.003765114 | 0.043503694 | 7.545074e-03 | \n", "| AA | 0.010354063 | 0.040926313 | 4.590291e-02 | \n", "| AS | 0.002120104 | 0.000000000 | 0.000000e+00 | \n", "| B6 | 0.019469915 | 0.124937644 | 1.782194e-02 | \n", "| DL | 0.012892843 | 0.061468157 | 6.849360e-02 | \n", "| EV | 0.130469511 | 0.004180820 | 2.620733e-02 | \n", "| F9 | 0.000000000 | 0.000000000 | 2.033993e-03 | \n", "| FL | 0.000000000 | 0.000000000 | 9.680025e-03 | \n", "| HA | 0.000000000 | 0.001015512 | 0.000000e+00 | \n", "| MQ | 0.006758201 | 0.021358410 | 5.026486e-02 | \n", "| OO | 0.000017816 | 0.000000000 | 7.720265e-05 | \n", "| UA | 0.136847638 | 0.013462955 | 2.388531e-02 | \n", "| US | 0.013079911 | 0.008893152 | 3.900515e-02 | \n", "| VX | 0.004649975 | 0.010677721 | 0.000000e+00 | \n", "| WN | 0.018374231 | 0.000000000 | 1.807433e-02 | \n", "| YV | 0.000000000 | 0.000000000 | 1.784569e-03 | \n", "\n", "\n" ], "text/plain": [ " EWR JFK LGA \n", "9E 0.003765114 0.043503694 7.545074e-03\n", "AA 0.010354063 0.040926313 4.590291e-02\n", "AS 0.002120104 0.000000000 0.000000e+00\n", "B6 0.019469915 0.124937644 1.782194e-02\n", "DL 0.012892843 0.061468157 6.849360e-02\n", "EV 0.130469511 0.004180820 2.620733e-02\n", "F9 0.000000000 0.000000000 2.033993e-03\n", "FL 0.000000000 0.000000000 9.680025e-03\n", "HA 0.000000000 0.001015512 0.000000e+00\n", "MQ 0.006758201 0.021358410 5.026486e-02\n", "OO 0.000017816 0.000000000 7.720265e-05\n", "UA 0.136847638 0.013462955 2.388531e-02\n", "US 0.013079911 0.008893152 3.900515e-02\n", "VX 0.004649975 0.010677721 0.000000e+00\n", "WN 0.018374231 0.000000000 1.807433e-02\n", "YV 0.000000000 0.000000000 1.784569e-03" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "origin_carrier_matrix = as.matrix(select(origin_carrier_wide, -carrier))\n", "row.names(origin_carrier_matrix) = origin_carrier_wide$carrier\n", "origin_carrier_matrix" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is our joint distribution: each cell contains the probability of a randomly selected flight being on the particular carrier _and_ from the specifeid airport. We can check its sum again:" ] }, { "cell_type": "code", "execution_count": 22, "metadata": {}, "outputs": [ { "data": { "text/html": [ "1" ], "text/latex": [ "1" ], "text/markdown": [ "1" ], "text/plain": [ "[1] 1" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "sum(origin_carrier_matrix)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Ok, that's better." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Marginal Probabilities\n", "\n", "One of the things we often want to do with a joint probability distribution is compute the _marginal distributions_ of its variables. If we have a joint distribution $P(A,B)$, the marginal distribution $P(A) = \\sum_{b \\in B} P(A,B)$. When our joint distribution is a matrix, R makes it very easy to compute the marginals:" ] }, { "cell_type": "code", "execution_count": 23, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\t
9E
\n", "\t\t
0.0548138822243865
\n", "\t
AA
\n", "\t\t
0.097183290970853
\n", "\t
AS
\n", "\t\t
0.00212010357032568
\n", "\t
B6
\n", "\t\t
0.162229493788156
\n", "\t
DL
\n", "\t\t
0.142854597714802
\n", "\t
EV
\n", "\t\t
0.16085766206618
\n", "\t
F9
\n", "\t\t
0.00203399292111077
\n", "\t
FL
\n", "\t\t
0.00968002470484833
\n", "\t
HA
\n", "\t\t
0.00101551179418961
\n", "\t
MQ
\n", "\t\t
0.0783814761146875
\n", "\t
OO
\n", "\t\t
9.50186474095541e-05
\n", "\t
UA
\n", "\t\t
0.174195904696297
\n", "\t
US
\n", "\t\t
0.0609782169750814
\n", "\t
VX
\n", "\t\t
0.0153276955602537
\n", "\t
WN
\n", "\t\t
0.0364485592797587
\n", "\t
YV
\n", "\t\t
0.00178456897166069
\n", "
\n" ], "text/latex": [ "\\begin{description*}\n", "\\item[9E] 0.0548138822243865\n", "\\item[AA] 0.097183290970853\n", "\\item[AS] 0.00212010357032568\n", "\\item[B6] 0.162229493788156\n", "\\item[DL] 0.142854597714802\n", "\\item[EV] 0.16085766206618\n", "\\item[F9] 0.00203399292111077\n", "\\item[FL] 0.00968002470484833\n", "\\item[HA] 0.00101551179418961\n", "\\item[MQ] 0.0783814761146875\n", "\\item[OO] 9.50186474095541e-05\n", "\\item[UA] 0.174195904696297\n", "\\item[US] 0.0609782169750814\n", "\\item[VX] 0.0153276955602537\n", "\\item[WN] 0.0364485592797587\n", "\\item[YV] 0.00178456897166069\n", "\\end{description*}\n" ], "text/markdown": [ "9E\n", ": 0.0548138822243865AA\n", ": 0.097183290970853AS\n", ": 0.00212010357032568B6\n", ": 0.162229493788156DL\n", ": 0.142854597714802EV\n", ": 0.16085766206618F9\n", ": 0.00203399292111077FL\n", ": 0.00968002470484833HA\n", ": 0.00101551179418961MQ\n", ": 0.0783814761146875OO\n", ": 9.50186474095541e-05UA\n", ": 0.174195904696297US\n", ": 0.0609782169750814VX\n", ": 0.0153276955602537WN\n", ": 0.0364485592797587YV\n", ": 0.00178456897166069\n", "\n" ], "text/plain": [ " 9E AA AS B6 DL EV \n", "5.481388e-02 9.718329e-02 2.120104e-03 1.622295e-01 1.428546e-01 1.608577e-01 \n", " F9 FL HA MQ OO UA \n", "2.033993e-03 9.680025e-03 1.015512e-03 7.838148e-02 9.501865e-05 1.741959e-01 \n", " US VX WN YV \n", "6.097822e-02 1.532770e-02 3.644856e-02 1.784569e-03 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "rowSums(origin_carrier_matrix)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This is the probability of a randomly selected aircraft being operated by the specified carrier." ] }, { "cell_type": "code", "execution_count": 24, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\t
EWR
\n", "\t\t
0.358799320616671
\n", "\t
JFK
\n", "\t\t
0.330424377033993
\n", "\t
LGA
\n", "\t\t
0.310776302349336
\n", "
\n" ], "text/latex": [ "\\begin{description*}\n", "\\item[EWR] 0.358799320616671\n", "\\item[JFK] 0.330424377033993\n", "\\item[LGA] 0.310776302349336\n", "\\end{description*}\n" ], "text/markdown": [ "EWR\n", ": 0.358799320616671JFK\n", ": 0.330424377033993LGA\n", ": 0.310776302349336\n", "\n" ], "text/plain": [ " EWR JFK LGA \n", "0.3587993 0.3304244 0.3107763 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "colSums(origin_carrier_matrix)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If you compare these with the airport probabilities we estimated at the beginning, you should find them to be the same. This is a useful sanity check - they should be the same! But sometimes we have a joint distribution, and we want to extract the marginal distribution from it.\n", "\n", "Joint probability distributions are also symmetric - it makes no difference whether we write $P(O,C)$ or $P(C,O)$." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Conditional Probability\n", "\n", "Another useful kind of probability we can derive from a joint distribution is the _conditional_ probability. For example, the conditional probability $P(O|C)$ is the probability that an airplane left a particular airport **given that** we know it is operated by a given carrier.\n", "\n", "The probability $P(O|C) = \\frac{P(O,C)}{P(C)}$.\n" ] }, { "cell_type": "code", "execution_count": 25, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
EWRJFKLGA
9E0.068689060.793661970.1376490
AA0.106541600.421125000.4723334
AS1.000000000.000000000.0000000
B60.120014640.770129040.1098563
DL0.090251510.430284760.4794637
EV0.811086700.025990810.1629225
F90.000000000.000000001.0000000
FL0.000000000.000000001.0000000
HA0.000000001.000000000.0000000
MQ0.086221920.272493090.6412850
OO0.187500000.000000000.8125000
UA0.785596180.077286290.1371175
US0.214501360.145841450.6396572
VX0.303370790.696629210.0000000
WN0.504114050.000000000.4958859
YV0.000000000.000000001.0000000
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " & EWR & JFK & LGA\\\\\n", "\\hline\n", "\t9E & 0.06868906 & 0.79366197 & 0.1376490 \\\\\n", "\tAA & 0.10654160 & 0.42112500 & 0.4723334 \\\\\n", "\tAS & 1.00000000 & 0.00000000 & 0.0000000 \\\\\n", "\tB6 & 0.12001464 & 0.77012904 & 0.1098563 \\\\\n", "\tDL & 0.09025151 & 0.43028476 & 0.4794637 \\\\\n", "\tEV & 0.81108670 & 0.02599081 & 0.1629225 \\\\\n", "\tF9 & 0.00000000 & 0.00000000 & 1.0000000 \\\\\n", "\tFL & 0.00000000 & 0.00000000 & 1.0000000 \\\\\n", "\tHA & 0.00000000 & 1.00000000 & 0.0000000 \\\\\n", "\tMQ & 0.08622192 & 0.27249309 & 0.6412850 \\\\\n", "\tOO & 0.18750000 & 0.00000000 & 0.8125000 \\\\\n", "\tUA & 0.78559618 & 0.07728629 & 0.1371175 \\\\\n", "\tUS & 0.21450136 & 0.14584145 & 0.6396572 \\\\\n", "\tVX & 0.30337079 & 0.69662921 & 0.0000000 \\\\\n", "\tWN & 0.50411405 & 0.00000000 & 0.4958859 \\\\\n", "\tYV & 0.00000000 & 0.00000000 & 1.0000000 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| | EWR | JFK | LGA | \n", "|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|\n", "| 9E | 0.06868906 | 0.79366197 | 0.1376490 | \n", "| AA | 0.10654160 | 0.42112500 | 0.4723334 | \n", "| AS | 1.00000000 | 0.00000000 | 0.0000000 | \n", "| B6 | 0.12001464 | 0.77012904 | 0.1098563 | \n", "| DL | 0.09025151 | 0.43028476 | 0.4794637 | \n", "| EV | 0.81108670 | 0.02599081 | 0.1629225 | \n", "| F9 | 0.00000000 | 0.00000000 | 1.0000000 | \n", "| FL | 0.00000000 | 0.00000000 | 1.0000000 | \n", "| HA | 0.00000000 | 1.00000000 | 0.0000000 | \n", "| MQ | 0.08622192 | 0.27249309 | 0.6412850 | \n", "| OO | 0.18750000 | 0.00000000 | 0.8125000 | \n", "| UA | 0.78559618 | 0.07728629 | 0.1371175 | \n", "| US | 0.21450136 | 0.14584145 | 0.6396572 | \n", "| VX | 0.30337079 | 0.69662921 | 0.0000000 | \n", "| WN | 0.50411405 | 0.00000000 | 0.4958859 | \n", "| YV | 0.00000000 | 0.00000000 | 1.0000000 | \n", "\n", "\n" ], "text/plain": [ " EWR JFK LGA \n", "9E 0.06868906 0.79366197 0.1376490\n", "AA 0.10654160 0.42112500 0.4723334\n", "AS 1.00000000 0.00000000 0.0000000\n", "B6 0.12001464 0.77012904 0.1098563\n", "DL 0.09025151 0.43028476 0.4794637\n", "EV 0.81108670 0.02599081 0.1629225\n", "F9 0.00000000 0.00000000 1.0000000\n", "FL 0.00000000 0.00000000 1.0000000\n", "HA 0.00000000 1.00000000 0.0000000\n", "MQ 0.08622192 0.27249309 0.6412850\n", "OO 0.18750000 0.00000000 0.8125000\n", "UA 0.78559618 0.07728629 0.1371175\n", "US 0.21450136 0.14584145 0.6396572\n", "VX 0.30337079 0.69662921 0.0000000\n", "WN 0.50411405 0.00000000 0.4958859\n", "YV 0.00000000 0.00000000 1.0000000" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "origin_carrier_cond = origin_carrier_matrix / rowSums(origin_carrier_matrix)\n", "origin_carrier_cond" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "What's that we just did? We divided a matrix by a vector that has as many entries as the matrix has rows. This divides every entry in the matrix by the value corresponding to its row. Neat, huh? We can check that each row is a probability distribution:" ] }, { "cell_type": "code", "execution_count": 26, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\t
9E
\n", "\t\t
1
\n", "\t
AA
\n", "\t\t
1
\n", "\t
AS
\n", "\t\t
1
\n", "\t
B6
\n", "\t\t
1
\n", "\t
DL
\n", "\t\t
1
\n", "\t
EV
\n", "\t\t
1
\n", "\t
F9
\n", "\t\t
1
\n", "\t
FL
\n", "\t\t
1
\n", "\t
HA
\n", "\t\t
1
\n", "\t
MQ
\n", "\t\t
1
\n", "\t
OO
\n", "\t\t
1
\n", "\t
UA
\n", "\t\t
1
\n", "\t
US
\n", "\t\t
1
\n", "\t
VX
\n", "\t\t
1
\n", "\t
WN
\n", "\t\t
1
\n", "\t
YV
\n", "\t\t
1
\n", "
\n" ], "text/latex": [ "\\begin{description*}\n", "\\item[9E] 1\n", "\\item[AA] 1\n", "\\item[AS] 1\n", "\\item[B6] 1\n", "\\item[DL] 1\n", "\\item[EV] 1\n", "\\item[F9] 1\n", "\\item[FL] 1\n", "\\item[HA] 1\n", "\\item[MQ] 1\n", "\\item[OO] 1\n", "\\item[UA] 1\n", "\\item[US] 1\n", "\\item[VX] 1\n", "\\item[WN] 1\n", "\\item[YV] 1\n", "\\end{description*}\n" ], "text/markdown": [ "9E\n", ": 1AA\n", ": 1AS\n", ": 1B6\n", ": 1DL\n", ": 1EV\n", ": 1F9\n", ": 1FL\n", ": 1HA\n", ": 1MQ\n", ": 1OO\n", ": 1UA\n", ": 1US\n", ": 1VX\n", ": 1WN\n", ": 1YV\n", ": 1\n", "\n" ], "text/plain": [ "9E AA AS B6 DL EV F9 FL HA MQ OO UA US VX WN YV \n", " 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "rowSums(origin_carrier_cond)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Each row of our new matrix is a probability distribution over airports, given that we know the carrier of the airline. Cool! If we know that the flight is United (UA), then it is most likely from Newark (EWR) - $P(\\mathsf{EWR}|\\mathsf{UA}) = 0.78$.\n", "\n", "Now, unlike joint probabilities, conditional probabilities are _not_ symmetric: $P(O|C) \\ne P(C|O)$. If we want to compute $P(C|O)$, we can use `t` to _transpose_ our matrix and normalize rows to be distributions again:" ] }, { "cell_type": "code", "execution_count": 27, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\n", "
9EAAASB6DLEVF9FLHAMQOOUAUSVXWNYV
EWR0.01049365 0.02885753 0.005908884 0.05426408 0.0359333 0.36362809 0.000000000 0.00000000 0.000000000 0.01883560 4.965449e-050.38140439 0.03645467 0.01295982 0.05121033 0.000000000
JFK0.13166006 0.12385985 0.000000000 0.37811267 0.1860279 0.01265288 0.000000000 0.00000000 0.003073356 0.06463933 0.000000e+000.04074444 0.02691433 0.03231517 0.00000000 0.000000000
LGA0.02427815 0.14770404 0.000000000 0.05734651 0.2203952 0.08432860 0.006544878 0.03114789 0.000000000 0.16173970 2.484187e-040.07685693 0.12550878 0.00000000 0.05815864 0.005742294
\n" ], "text/latex": [ "\\begin{tabular}{r|llllllllllllllll}\n", " & 9E & AA & AS & B6 & DL & EV & F9 & FL & HA & MQ & OO & UA & US & VX & WN & YV\\\\\n", "\\hline\n", "\tEWR & 0.01049365 & 0.02885753 & 0.005908884 & 0.05426408 & 0.0359333 & 0.36362809 & 0.000000000 & 0.00000000 & 0.000000000 & 0.01883560 & 4.965449e-05 & 0.38140439 & 0.03645467 & 0.01295982 & 0.05121033 & 0.000000000 \\\\\n", "\tJFK & 0.13166006 & 0.12385985 & 0.000000000 & 0.37811267 & 0.1860279 & 0.01265288 & 0.000000000 & 0.00000000 & 0.003073356 & 0.06463933 & 0.000000e+00 & 0.04074444 & 0.02691433 & 0.03231517 & 0.00000000 & 0.000000000 \\\\\n", "\tLGA & 0.02427815 & 0.14770404 & 0.000000000 & 0.05734651 & 0.2203952 & 0.08432860 & 0.006544878 & 0.03114789 & 0.000000000 & 0.16173970 & 2.484187e-04 & 0.07685693 & 0.12550878 & 0.00000000 & 0.05815864 & 0.005742294 \\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "| | 9E | AA | AS | B6 | DL | EV | F9 | FL | HA | MQ | OO | UA | US | VX | WN | YV | \n", "|---|---|---|\n", "| EWR | 0.01049365 | 0.02885753 | 0.005908884 | 0.05426408 | 0.0359333 | 0.36362809 | 0.000000000 | 0.00000000 | 0.000000000 | 0.01883560 | 4.965449e-05 | 0.38140439 | 0.03645467 | 0.01295982 | 0.05121033 | 0.000000000 | \n", "| JFK | 0.13166006 | 0.12385985 | 0.000000000 | 0.37811267 | 0.1860279 | 0.01265288 | 0.000000000 | 0.00000000 | 0.003073356 | 0.06463933 | 0.000000e+00 | 0.04074444 | 0.02691433 | 0.03231517 | 0.00000000 | 0.000000000 | \n", "| LGA | 0.02427815 | 0.14770404 | 0.000000000 | 0.05734651 | 0.2203952 | 0.08432860 | 0.006544878 | 0.03114789 | 0.000000000 | 0.16173970 | 2.484187e-04 | 0.07685693 | 0.12550878 | 0.00000000 | 0.05815864 | 0.005742294 | \n", "\n", "\n" ], "text/plain": [ " 9E AA AS B6 DL EV \n", "EWR 0.01049365 0.02885753 0.005908884 0.05426408 0.0359333 0.36362809\n", "JFK 0.13166006 0.12385985 0.000000000 0.37811267 0.1860279 0.01265288\n", "LGA 0.02427815 0.14770404 0.000000000 0.05734651 0.2203952 0.08432860\n", " F9 FL HA MQ OO UA \n", "EWR 0.000000000 0.00000000 0.000000000 0.01883560 4.965449e-05 0.38140439\n", "JFK 0.000000000 0.00000000 0.003073356 0.06463933 0.000000e+00 0.04074444\n", "LGA 0.006544878 0.03114789 0.000000000 0.16173970 2.484187e-04 0.07685693\n", " US VX WN YV \n", "EWR 0.03645467 0.01295982 0.05121033 0.000000000\n", "JFK 0.02691433 0.03231517 0.00000000 0.000000000\n", "LGA 0.12550878 0.00000000 0.05815864 0.005742294" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "carrier_origin_cond = t(origin_carrier_matrix) / colSums(origin_carrier_matrix)\n", "carrier_origin_cond" ] }, { "cell_type": "code", "execution_count": 28, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\t
EWR
\n", "\t\t
1
\n", "\t
JFK
\n", "\t\t
1
\n", "\t
LGA
\n", "\t\t
1
\n", "
\n" ], "text/latex": [ "\\begin{description*}\n", "\\item[EWR] 1\n", "\\item[JFK] 1\n", "\\item[LGA] 1\n", "\\end{description*}\n" ], "text/markdown": [ "EWR\n", ": 1JFK\n", ": 1LGA\n", ": 1\n", "\n" ], "text/plain": [ "EWR JFK LGA \n", " 1 1 1 " ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "rowSums(carrier_origin_cond)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we want to visualize conditional probabilities, the easiest way is with a _faceted_ plot. First let's convert our conditional distribution to a tall data frame:" ] }, { "cell_type": "code", "execution_count": 29, "metadata": {}, "outputs": [ { "data": { "text/html": [ "\n", "\n", "\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\t\n", "\n", "
origincarrierprob
EWR 9E 0.01049365
JFK 9E 0.13166006
LGA 9E 0.02427815
EWR AA 0.02885753
JFK AA 0.12385985
LGA AA 0.14770404
\n" ], "text/latex": [ "\\begin{tabular}{r|lll}\n", " origin & carrier & prob\\\\\n", "\\hline\n", "\t EWR & 9E & 0.01049365\\\\\n", "\t JFK & 9E & 0.13166006\\\\\n", "\t LGA & 9E & 0.02427815\\\\\n", "\t EWR & AA & 0.02885753\\\\\n", "\t JFK & AA & 0.12385985\\\\\n", "\t LGA & AA & 0.14770404\\\\\n", "\\end{tabular}\n" ], "text/markdown": [ "\n", "origin | carrier | prob | \n", "|---|---|---|---|---|---|\n", "| EWR | 9E | 0.01049365 | \n", "| JFK | 9E | 0.13166006 | \n", "| LGA | 9E | 0.02427815 | \n", "| EWR | AA | 0.02885753 | \n", "| JFK | AA | 0.12385985 | \n", "| LGA | AA | 0.14770404 | \n", "\n", "\n" ], "text/plain": [ " origin carrier prob \n", "1 EWR 9E 0.01049365\n", "2 JFK 9E 0.13166006\n", "3 LGA 9E 0.02427815\n", "4 EWR AA 0.02885753\n", "5 JFK AA 0.12385985\n", "6 LGA AA 0.14770404" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "carrier_origin_frame = as.data.frame(carrier_origin_cond)\n", "carrier_origin_frame$origin = row.names(carrier_origin_cond)\n", "carrier_origin_tall = gather(carrier_origin_frame, carrier, prob, -origin)\n", "head(carrier_origin_tall)" ] }, { "cell_type": "code", "execution_count": 30, "metadata": {}, "outputs": [ { "data": {}, "metadata": {}, "output_type": "display_data" }, { "data": { "image/png": "iVBORw0KGgoAAAANSUhEUgAAA0gAAAHgCAMAAACo6b1DAAAAPFBMVEUAAAAaGhozMzNNTU1Z\nWVloaGh8fHyMjIyampqnp6eysrK9vb3Hx8fQ0NDZ2dnh4eHp6enr6+vw8PD////GSW4mAAAA\nCXBIWXMAABJ0AAASdAHeZh94AAAVC0lEQVR4nO3dgXabSg4AULd+Tdu3u25V//+/btK0DuDB\nGCSCaa/O2QQirJmRucamednDWQiRjsPWExDiTwiQhCgIkIQoCJCEKAiQhCgIkIQoCJCEKAiQ\nhCiIFKTT3xt6oAenTgtAWhh6oAcnkPKhB3pwAikfeqAHJ5DyoQd6cAIpH3qgByeQ8qEHenAC\nKR96oAcnkPKhB3pwAikfeqAHJ5DyoQd6cNolpA+/4vTh1/7vb5fEu8a2J9GHSz9Og368Z2zW\ng7e1dp/5DTqwS0j9jQ+D7+/dx+0hdbZPG59F7z1wc+0fNn0t2SGk160Pv7+BdPny3rE5pN53\nkO4LkDoxhLSJo8eB9GsHpHuiD6nzP5A2cvRYkLZ5PdkjpN6n6y6kv/lmw+n1lfhvviJd2gDS\nXfGht/mh922DFm4P6bK92Xu7x4B06rRju9Ngp5C692m2aeEDQbp8ee94KEhbvC0BKR/bnkRu\nfw+/uyLdG8MPlpfv2/yD5KNB2vSW1XsPfP0istU/yu4RUufi3YL019y163y4ft0ffH+/2A7S\n252WQTdA2lvogR6cQMqHHujBCaR86IEenEDKhx7owQmkfOiBHpxAyoce6MFpJqTjc3R3QTo5\niV5CD2ZBOvb1HEF6CT3Qg1MG0tEV6WfogR6cEpCOvbd28Tvetpr7+0nfffTKPXigliR68BjT\nXDOdh/TxJSYfK8RfEvMgHc+uSL+21u3BA7XEFelGeiGkwX0HkEACaRGk1wAJpJ9bIC2EdB5s\nPtaiKtIggTQnDVL2aJBAinmQLr/Z0LnhABJIIMVMSOPxWIuqSIME0pw0SNmjQQIpQMofDRJI\nAVL+aJBACpDyR4MEUoCUPxokkAKk/NEggRQg5Y8GCaQAKX80SCDF3wnpn26kxwIJpAAJpIo0\nSCClxwIJpABpS0i189gyDRJI6bFAAilAAqkiDRJI6bFAAilAAqkiDRJI6bFAAilAAqkiDRJI\n6bFAAilAAqkiDRJI6bFAAilAAqkiDRJI6bFAAilAAqkiDdKjQvqneZKB9KBpkEDKFgMpQAqQ\nQKpIgwRSthhIAVKABFJFGiSQssVACpACJJAq0iCBlC0GUoAUIIFUkQYJpGwxkAKkAAmkijRI\nIGWLgRQgBUggVaRBqoJUHX1IaxYvrr3HeYjScEVaVMwVKVyRwls7kCrSIIGULQZSgBQggVSR\nBgmkbDGQAqQACaSKNEggZYuBFCAFSCBVpEECKVsMpAApQAKpIg0SSNliIAVIARJIFWmQQMoW\nAylACpBAqkiDBFK2GEgBUoAEUkUaJJCyxUAKkAIkkCrSIIGULQZSgBQggVSRBgmkbDGQAqQA\nCaSKNEggZYuBFCAFSCBVpEECKVsMpAApQAKpIg0SSNliIAVIARJIFWmQQMoWAylACpBAqkiD\nBFK2GEgBUoAEUkUaJJCyxUAKkAIkkCrSIIGULQZSgBQggVSRBgmkbDGQAqQACaSKNEggZYuB\nFCAFSCBVpEGaB+n4HK1tkEAC6X5Ix8uX/jZIIIEEUq4YSAFSLId07m9XTwuk2fPYMg1SAaSP\nLzH52JnRh7Rm8eLae5yHKI25kNxs+LW1uAeuSH9Qejmks7d2r1sggQQSSBVpkNy1yxYDKUAK\nkECqSIO07Dcbjp1tkEACaSak8aieFkiz57FlGiSQssVACpACJJAq0iCBlC0GUoAUIIFUkQYJ\npGwxkAKkAAmkijRIIGWLgRQgBUggVaRBAilbDKQAKUACqSINEkjZYiAFSAESSBVpkEDKFgMp\nQAqQQKpIgwRSthhIAVKABFJFGiSQssVACpACJJAq0iCBlC0GUoAUIIFUkQYJpGwxkAKkAAmk\nijRIIGWLgRQgBUggVaRBAilbDKQAKUACqSINEkjZYiAFSAESSBVpkEDKFgMpQAqQQKpIgwRS\nthhIAVKABFJFGiSQssVACpACJJAq0iCBlC0GUoAUIIFUkQapClJ19CGtWby49h7nIUrDFWlR\nMVekcEUKb+1AqkiDBFK2GEgBUoAEUkUaJJCyxUAKkAIkkCrSIIGULQZSgBQggVSRBgmkbDGQ\nAqQACaSKNEggZYuBFCAFSCBVpEECKVsMpAApQAKpIg0SSNliIAVIARJIFWmQQMoWAylACpBA\nqkiDBFK2GEgBUoAEUkUaJJCyxUAKkAIkkCrSIIGULQZSgBQggVSRBgmkbDGQAqQACaSKNEgg\nZYuBFIWQVn3eV02DlCwGUoAUIIFUkQYJpGwxkAKkAAmkijRIIGWLgRQgBUggVaRBAilbDKQA\nKUACqSINEkjZYiAFSDEK6cfXT4fDp39/gDSZBgmkGIP0/Xj4GcfvIE2lQQIpxiA9HZ6eCX1/\nOnwGaSoNEkgxBunwuvPjcO8Hp+ppgTR7HlumQRqB9Pnw+ulocEU6PkdrGySQQGpBOn9+fWs3\ncHT50t8GCSSQriEdugHSVBokkGI5pHN/u3paIM2ex5ZpkGb9g2wb0seXmHzszOg3dM3ixbX3\nOI8HilWf93eK+ZDcbPi5tbgHrkhXu3/cFen1Nxu+9n+zAaTWLkggxRik9m82DCB13+RVTwuk\n2fPYMg3SCKQvv3+z4cs4pK4jkEACqQHp9926G3fteo5AAgmkuyFdfpvh+Lp5dPs7QPq5BVIb\nUvut3Y2onhZIs+exZRok/xlFthhIAVKMQWrf/gaptQsSSDEKaW5UTwuk2fPYMg3S2H/Yd+9n\nI5BACpBiDNJx7uWpelogzZ7HlmmQRiB9e/p6720GkKog7fckAmkMUvs/owCptQsSSAESSBVp\nkNy1yxYDKUAKkECqSIN08x9kn/4FaToNEkgxBsmvCIE0Jw3S1F9a9Uurk2mQQIoxSP7SKkhz\n0iBN/aXVJ5Cm0iCBFGOQzl+evr28tXvyGWkyDRJIMQap9zci73l7Vz0tkGbPY8s0SCBli4EU\nIMUYpNlRPS2QZs9jyzRIIGWLgRQgBUggVaRBAilbDKQAKUACqSINEkjZYiAFSAESSBVpkEDK\nFgMpQAqQQKpIgwRSthhI8aiQ2sVAAikzzVXTIIGULQZSgBQggVSRBgmkbDGQAqQACaSKNEgg\nZYuBFCAFSCBVpEECKVsMpAApQAKpIg1SFaTq6PdgzeLFtRPzWHXNO4nSHmzUUFekRcVckcIV\nKby1A6kiPd6DmSczSMXTAmn2PLZMgwRSthhIAVKABFJFGiSQssVACpACJJAq0iCBlC0GUoAU\nIIFUkQYJpGwxkAKkAAmkijRIIGWLgRQgBUggVaRBAilbDKQAKUACqSINEkjZYiAFSAFSKaTM\nWQPS+OEggTRjHiCBBBJIIIGUKQZSgBQggVSRBgmkbDGQ4g+BlBsbpGQxkAKkAGlVSBOrAOlq\nF6TkmmsXNad4thhIAVKABFJFGiSQssVACpACJJAq0iCBlC0GUoAUIIFUkQYJpGwxkAKkAAmk\nijRIIGWLgRQgBUggVaRBAilbDKQAKUACqSINEkjZYiAFSAESSBVpkEDKFgMpQAqQQKpIgzQP\n0vE5Onsg/dwCCaR5kI49PkeQXrdAAikB6eiK9GsLJJAyVySQfm2BBFIFpI8vMfnYmdFf1JrF\nVyv8z+Qq5h29z5i5qNIebDS2K9KiYq5I4YoU3tqBVJEGCaRsMZACpAAJpIo0SCBli4EUIMU8\nSJffbDiCBBJIvd1ZkMZjclEz0yAVr3ndNEggZYuBFCAFSCBVpEECKVsMpAApQAKpIg0SSNli\nIAVIARJIFWmQQMoWAynWgzTz0aVjTxQf7oKULAZSgBQggVSRBgmkbDGQAqQACaSKNEggZYuB\nFCAFSCBVpEECKVsMpNgppJnP1sTYICWLgRQgdVsA0rJiIAVI3RaAtKwYSAFStwUgLSsGUoDU\nbQFIy4qBFCB1WwDSsmIgBUjdFoC0rBhIAVK3BSAtKwZSgNRtAUjLioEUIHVbANKyYiAFSN0W\ngLSs2INCyp2Cc9MggZQtBlKA1G0BSMuKgRQgdVsA0rJiIAVI3RaAtKwYSAFStwUgLSsGUoDU\nbQFIy4qBFH8HpIk0SMliIEUC0kTHcqvIjT0sPpEGKVkMpACp24J3hpSb9UTxOWNni4EUIHVb\nANKyYiAFSN0WgLSsGEgBUrcFIC2bKUgBUrcFOUizoz+tuenCsbMzvXXonEUWr3nVBtbNY6Jj\nuVXkxp5Z7S1ckRbN1BUpXJG6LQBp2UxBCpC6LQBp2UxBCpC6LQBp2UxBCpC6LQBp2UxBCpC6\nLQBp2UxBCpC6LQBp2UxBCpC6LQBp2UxBCpC6LQBp2Uz3CWkiPXdskEDKzhSkfg9y8xjs56aZ\nG3tYfCINUnKmIPV7kJvHYD83zdzYw+ITaZCyUwEJpHgYSDMXNa+hN8cGCaS7+j+RBgkkkO7p\n/0QaJJBAuqf/E2mQQMo7AwkkkPYLqTbdPhwkkK6LgXT/2MNFTqRXgzRvWjMX1d4HCaS6sYeL\nnEiDBBJIrbGHi5xIgwRS/kM+SCCBBFJz7OEiJ9IggVQKaV6l3Dxq0+3DQQLpuhhI9489XORE\nGiSQQGqNPVzkRBokkEBqjT1c5EQaJJBAao09XORE+o+ENGdskEBqjj1c5EQaJJBAao09XORE\nGiSQQGqNPVzkRBokkEBqjT1c5EQaJJBAao09XOREGiSQQGqNPVzkRBqk7FRWg9R8MEjLFzlv\n7OEiJ9IggQRSa+zhIifSZZBy05r56Iknds7YIIHUHHu4yIk0SCCB1Bp7uMiJNEgggdQae7jI\niTRIIO0HUq74vLFvPlvbQcqlJ86amWuunQpIfyWk4dGzIB2fo7UNEkgg3Q/pePnS3wbpASBN\nrGLeSTTx6Bk9qJ1Hrvi8sWc2GCSQQGoVn9lgkEACqVV8ZoPzkD6+xORjhfhLInlFar9o3fvZ\n+PHSdx+9cg8eqCWJHjzGNNdMg5Q9GiSQAqT80SCBFCDljwYJpAApfzRIIMU8SJffZjh2tkEC\nCaSZkMbjsRZVkQYJpDlpkLJHgwRSgJQ/GiSQAqT80SCBFCDljwYJpAApfzRIIAVI+aNBAilA\nyh8NEkgBUv5okEAKkPJHgwRSlEG6xPA/8fu41/TMYo8yjy3TjzKPjXsAUqbYo8wDpM17AFKm\n2KPMA6TNewBSptijzAOkzXsAUqbYo8wDpM17UANJiL88QBKiIEASoiBAEqIgHgfScfqQxOH7\niHmL+iNbsNce5CEd+186ic6fHur9P8O8/g2VYQuOvQe9Hf0rmsNejXUpfezvDh/cmtrbYc1n\np1Gm/4CqHlz/dKQHI2MNW9DuwdvR53V70B5otAV77cFakI79Sb5l3zaHPWhNdPQM7e4PztBO\njcHuHVPrfD8eB8/e+GlU3IPBfJvFx8carLnVg27HrqfWm0muB6MDjbZgrz1YB9LzUMeR9CSk\nqyX2Fnl9eGes6x5c1b41tdcO3br4ta5uzUq3B5ruQXeg0R40x7pa83UPro5eqwc3BpqEtLce\nrADpdZQEpPPbKqYg9cfqDDo8q++Z2tsDW1Np745UKuhB73J26yRqjNVZydXJ2OjYKj24PdAd\nkM676kE9pB6B644N2jioM5zoBKTBWFOQJqbWmEXjB63TaJ0eHIePaZ5EjbFevzeX1urYKj2Y\nGGi0BXvtwbtckfqrHbzfbH6MvPrQOHL4yBVp+AJz59SuX4lufEDt/XA4ZEEPru+fHNtHt1+N\nx86NRsfW6cHEQIMV774Hq39GetkYZCcr3ro51Dj6fkgTUxu+N27dO2rPoTVkpgfzWnD1eayz\n1ejB8FPlaj24OdBk7KsHBZBGXkuO/c27uzKvfcOx+t+vO3trasOjr9t3YwaVPZjfgu4dq4nv\ng5mt24PxgaZibz1Y8R9kx165Xi/Rw4t6+0FXa5gY63Lpb+7entrV61j/zFjwzC7rwfAxt3ev\nxhqueaQH7Zmt0IM7W7D3Hrz7bza80e9fdqfeL7T3G4+4eklcJOC87Ky5s/I50YPZLXjIHgxb\nsPseFEBqvatpTODYOWkGX94SzfQsSMuj9YHyOJzf6GPrejBMzD6JErFCD9r7zUWeG5m99GD9\n32wY7JZCan82GZ9q/+Crx7aqXL/xbhVuzXVpD2aeRPN6MDx25R4MJ/XW+8HRu+9BEaTXS3Vn\n1MEZ9bZbf0WafpkcmXNjt/lMjL1xHxYo6sGSV+P5PRivUNeDQQu6+7WQzq39wfxX7sEKkF4n\n0elmf7fxZVCnDtLYzYZbJW69NRkbqrQHxSfR6A2XG/sVPRisebD/vpDeoQf1kI7dH1/tzoU0\nuGa3ruHdBQ4Pvz6k+YP+brf29VuT8StcUQ+uTqL+oppvY3oz7qfHenB7P92D5pqXQ3r4HlTd\nbOgMO/FqPGjJW5m3r80XnPHxe3NpZyY62C5xfbqMzqmyBwta0H8xGcncfPGo78FgzcMr0shL\n3m57sM7t7xufD66ObDR0OaT+qudDaj/6krx3Sr+OXtaD5Ek0eOLnn0R1PbjxGen60J33YK1/\nRxoobr0XGHvc9UV74iGjY9/bwLeXyXadkXEmJ7asBwtacH1GNO4xjb1K9/dre9B4S3Tv43bW\ng/f7zYZ5r+f3DjHS75H3jf2Dr3abA/RLzJ7grd2SGOvByHumwbHr9+DWh/aqeIAePM7fbKiL\n65eizHO3xhO/ejSeeD1YtQd/HKT+m+Nj99vykqmHv3v0X5f14F168IdBur6/MuttdrPi7k6h\n65/oweo9+LMgLfmMOlGuos57RnEL9OB8Xw/+LEilsb8TqD704N4egCREQYAkREGAJERBgCRE\nQYAkREGAJERBgCREQYAkREGAJERBgCREQYC0/zh4ErcPz8H+A6QHCM+BEAUB0h7i++fD8evL\nxunz4XXrcPh2fLp8ff7Bjy+Hw5cfl4x45wBpB/HjeHiOz+fz/w4/4+sLl6fDl8vX52N+HvLp\n/Dsj3jlA2kF8fZZxeuHy6fDf8/nby9ZPTZev5/O/L1tfD//5/TPxzgHSDuLT4cfvze//+/fp\nFdL389vXl0Nesi9XrdefiXcOkHYQb7flnl7f2/3+0dvXw6GfEe8cur6DuNj4cvj0n/99B+kB\nQ9d3EJe3dq/351qQPl2eSJA2CV3fQXw9fP19i+F0/vHUgvRyyPm/hyeQNgpd30F8/31v++th\n7DPS6x3ywzeQNgpd30N8e74KfXm5GfflcHg6tSCdv/9MnUHaKHRdiIIASYiCAEmIggBJiIIA\nSYiCAEmIgvg/dKjfHucz/qkAAAAASUVORK5CYII=", "text/plain": [ "plot without title" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "ggplot(carrier_origin_tall) +\n", " aes(x=carrier, y=prob) +\n", " geom_bar(stat='identity') +\n", " facet_wrap(~ origin) +\n", " theme(axis.text.x=element_text(angle=90, hjust=1, vjust=0.5))" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Independence of Variables\n", "\n", "Two variables are _independent_ if $P(A,B) = P(A) P(B)$ — that is, we can compute the probability of $a$ and $b$ happening at the same time by independently computing the probabilities of $a$ and $b$, and multiplying them. What this means in practice is that knowing $A$ tells us nothing about $B$. We can see that our origin airport and carrier are not independent - observing either tells us quite a bit about the other.\n", "\n", "But let's go back to our binomial distribution: when flipping a coin, each flip is independent. Knowing I flipped heads tells me nothing about whether the next flip will be heads.\n", "\n", "This is the key to making the binomial distribution formula work: the probability of flipping $\\mathsf{HH}$ is $P(X_1=\\mathsf{H},X_2=\\mathsf{H}) = P(\\mathsf{H})P(\\mathsf{H})$\n", "\n", "The same is true of rolling dice: the results of a roll of two fair dice is the product of the individual die probabilities." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Bayes' Theorem\n", "\n", "Remember that $P(A|B) \\ne P(B|A)$? There is, however, a way that we can convert between these two probabilities!\n", "\n", "$$P(A|B) = \\frac{P(B|A)P(B)}{P(A)}$$\n", "\n", "That is, with one conditional distribution and both marginal distributions, we can compute the other conditional distribution. To see why this is true, we can expand the definition of conditional probability:\n", "\n", "$$\\begin{align*}\n", "P(B|A) & = P(A,B)P(A) \\\\\n", "P(A|B) & = P(A,B)P(B) \\\\\n", "& = \\frac{P(B|A)}{P(A)} P(B) \\\\\n", "& = \\frac{P(B|A) P(B)}{P(A)}\n", "\\end{align*}$$" ] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.4.1" } }, "nbformat": 4, "nbformat_minor": 2 }