Assignment 2

Goals

This assignment will exercise the following skills:

  1. Creating and managing projects with git

  2. Loading data from files

  3. Aggregating data with dplyr

  4. Detecting product associations

As of Sep. 8, this assignment description will get you started. I may add a few questions, but will do so no later than Sep. 15.

Data Set

For this project, you will use the Last.FM data from HetRec 2011. This data set consists of several tab-separated value files, each of which has a different set of records:

artists.dat

Information about music artists.

tags.dat

Information about tags (mapping from tag IDs to tags).

user_artists.dat

The number of times each user has listened to each artist.

user_taggedartists.dat

Tags users have applied to artists; this file uses IDs, so you will need to join with the other files to make it readable.

You can read each of these files with the read_tsv() function from readr.

Getting Started

Create a new Git repository, both locally and on BitBucket (your BitBucket repository should be private), to store your work.

Unpack the files into the data directory, and add them with git lfs:

git lfs track '*.dat'
git add data
For this project, you will share your Git repository with me to submit it, and I need to be able to re-run your notebook.

Create a Jupyter notebook to contain your data analysis.

Do not commit the HTML export of your notebook.

Necessary Functions

For this assignment, you’ll need quite a few dplyr functions:

  1. inner_join to merge tables

  2. group_by and summarize to aggregate data

  3. arrange to sort data

And probably some more! We’ll be working some examples in class to help with this.

For example, to count the number of users playing each artist and the number of times the artist has been played, you can write:

artists = read_tsv('artists.dat')
plays = read_tsv('user_artists.dat')
plays %>%
    group_by(artistID) %>%
    summarize(userCount=n(),
              totalPlays=sum(weight)) %>%
    inner_join(select(artists, artistID=id, name=name))

Think about what this code does a bit — I think you will find it instructive in how to put together dplyr pipelines.

Exploring the Data

Answer the following questions:

  1. Plot the distribution of play counts per artist

  2. Plot the distribution of unique users playing each artist

  3. Plot the distribution of play counts per user

  4. Plot the distribution of unique artists per user

  5. What is the mean artists-per-user? Users-per-artist? Plays per user/artist pair?

  6. What are the 10 artists with the most plays?

  7. What are the 10 artists with the most unique playing users?

Association Rules

One common problem in sales and media is identifying related products: if a user likes Nickelback, what other artists might we be able to recommend?

One way to do this is through association rules: look at other artists listened to by users who like Nickelback. We can think of a very simple association rule as a conditional probability. If we denote by \(A_u\) the set of artists played by user \(u\), we can compute the association between artists \(a\) and \(b\) by:

\[P(b \in A_u | a \in A_u)\]

You will probably find it useful to make use of the identity \(P(X|Y) = P(X,Y)/P(Y)\). We can estimate \(P(X,Y)\) from co-plays, the number of users who have listened to both artists in a pair.

QUESTION: What is the most commonly-played pair of artists?

Estimating the joint probabilities can be tricky. A naive solution to compute the joint probabilities between all pairs of items is \(O(n^2)\) probability computations.

We can take advantage of the fact that two artists that never appear together have a co-play count of 0, and that the artists.dat table links artists to plays.

coplays = plays %>%
    select(a1=artistID, userID) %>%
    inner_join(select(plays, a2=artistID, userID)) %>%
    filter(a1 != a2) %>%
    group_by(a1, a2) %>%
    summarize(count = n())

Now that we have our coplay counts, you can answer the following:

  1. What pair of artists has been co-played the most often?

  2. How many users have listened to both Nickelback and Britney Spears?

  3. What is the probability that a randomly-selected user has listened to both Nickelback and Britney Spears?

  4. Given that a user has listened to Nickelback, what is the probability that they have also listened to Britney Spears?

  5. Given that a user has listened to Aretha Franklin, what 10 artists are they most likely to have listened to?

Extending Association Rules

Naive association rules have the problem that they tend to favor popular items. For a popular artist such as Katy Perry, many people listen to her no matter what else they listen to; \(P(\textrm{katy perry}|X)\) is high no matter what \(X\) is. (Except that Katy Perry isn’t in our data set because it’s old.)

What we can do instead is measure lift, which measures how positively coupled two items are by measuring how much more likely the user is to listen to both of them than they would be if the two items were completely independent.

\[\textrm{lift(X,Y)} = \frac{P(X,Y)}{P(X)P(Y)}\]

There are other formulas we can use as well, but this one will get us started.

  1. What 10 artists have the highest lift with respect to Aretha Franklin?

  2. What is the lift of Nickelback and Britney Spears?

  3. What is the lift of Britney Spears and Ozzy Ozborne?

Submitting

  1. Push your notebook & data to a repository on your BitBucket account

  2. Give my user (mdekstrand) read access to your repository

  3. Send me an e-mail with a link to your repository

The assignment is due on Monday, Sep. 25 by midnight.

results matching ""

    No results matching ""