Output Analysis

This notebook contains our analysis of the recommender evaluation results.

It proceeds in a few steps:

  1. Setup, where we do the initial R configuration
  2. Read Data, where we load in the experiment data for analysis
  3. Controlling for Profile Size, where we model accuracy as a function of user profile size in order to control for intrinsic profile difficulty
  4. Gender, where we analyze the results by gender.
  5. Age, where we analyze the results by age.

Setup

Libraries:

In [1]:
library(MASS)
library(plyr)
library(dplyr)
library(readr)
library(ggplot2)
library(tidyr)
library(modelr)
library(tibble)
Attaching package: ‘dplyr’

The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following object is masked from ‘package:MASS’:

    select

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

In [2]:
options(repr.plot.height=5)
options(repr.matrix.max.rows=10)
options(repr.matrix.max.columns=10)

Read Data

First, we need to read the user data from the underlying MovieLens data:

In [3]:
users.meta = read_delim("data/ml-1m/users.dat", delim=":",
                   col_names=c("user", "gender", "age", "occupation", "zip"),
                   col_types="i_c_c_c_c") %>%
    mutate(gender=as.factor(gender),
           age=as.factor(age),
           occupation=as.factor(occupation))
users.meta
usergenderageoccupationzip
1 F 1 10 48067
2 M 56 16 70072
3 M 25 15 55117
4 M 45 7 02460
5 M 25 20 55455
6036 F 25 15 32603
6037 F 45 1 76006
6038 F 56 1 14706
6039 F 45 0 01060
6040 M 25 6 11106

Now, we want to get user profile statistics. In order to do that, we need each user's train profile; we can load that from the training data output by the cross-folding process.

For each partition, we need to load the train ratings (historical profile) for all the test users in that partition. So we will load the test data (to get the users) and the train data (to get the histories), then merge them.

In [4]:
train.ratings = ldply(1:5, function(part) {
    message("reading part ", part)
    test.fn = sprintf("build/ML-1M.out/part0%d.test.csv", part)
    train.fn = sprintf("build/ML-1M.out/part0%d.train.csv", part)
    test = suppressMessages(read_csv(test.fn, col_names=c("user", "item", "rating", "timestamp")))
    train = suppressMessages(read_csv(train.fn, col_names=c("user", "item", "rating", "timestamp")))
    test %>%
        select(user) %>%
        distinct() %>%
        mutate(part=as.factor(part), proto=as.factor('UserPct')) %>%
        inner_join(train)
})
reading part 1
Joining, by = "user"
reading part 2
Joining, by = "user"
reading part 3
Joining, by = "user"
reading part 4
Joining, by = "user"
reading part 5
Joining, by = "user"

Quick summary just to see what this data looks like:

In [5]:
train.ratings %>%
    select(part, proto, rating) %>%
    summary()
 part           proto            rating     
 1:193768   UserPct:970009   Min.   :1.000  
 2:189980                    1st Qu.:3.000  
 3:181023                    Median :4.000  
 4:203058                    Mean   :3.578  
 5:202180                    3rd Qu.:4.000  
                             Max.   :5.000  

Now we want to compute per-user profile statistics:

In [6]:
user.stats = train.ratings %>%
    group_by(user) %>%
    summarize(nratings = n(), meanRating=mean(rating), ratingVar=var(rating))
user.stats
usernratingsmeanRatingratingVar
1 48 4.208333 0.4663121
2 124 3.701613 1.0078023
3 46 3.847826 1.0207729
4 16 4.312500 0.7625000
5 193 3.150259 1.2741796
6036 883 3.300113 1.0470181
6037 197 3.705584 0.7802238
6038 15 4.000000 0.7142857
6039 118 3.872881 0.5392583
6040 336 3.568452 1.3982854

Join these statistics with our user metadata table, so that we have one table of user information:

In [7]:
users = users.meta %>% inner_join(user.stats)
users
Joining, by = "user"
usergenderageoccupationzipnratingsmeanRatingratingVar
1 F 1 10 48067 48 4.208333 0.4663121
2 M 56 16 70072 124 3.701613 1.0078023
3 M 25 15 55117 46 3.847826 1.0207729
4 M 45 7 02460 16 4.312500 0.7625000
5 M 25 20 55455 193 3.150259 1.2741796
6036 F 25 15 32603 883 3.300113 1.0470181
6037 F 45 1 76006 197 3.705584 0.7802238
6038 F 56 1 14706 15 4.000000 0.7142857
6039 F 45 0 01060 118 3.872881 0.5392583
6040 M 25 6 11106 336 3.568452 1.3982854

Now that we have user information, we can read the per-user recommender evaluation results.

In [8]:
user.results.all = read_csv("build/movielens-user-results.csv", guess_max=10000) %>%
    rename(user=User) %>%
    mutate(DataSet=as.factor(DataSet), 
           Partition=as.factor(Partition),
           Algorithm=as.factor(Algorithm)) %>%
    inner_join(users)
head(user.results.all)
Parsed with column specification:
cols(
  DataSet = col_character(),
  Partition = col_integer(),
  Algorithm = col_character(),
  User = col_integer(),
  TestTime = col_double(),
  RMSE = col_double(),
  nDCG = col_double(),
  Rank = col_integer(),
  RecipRank = col_double(),
  AvgPrec = col_double(),
  `1R.nDCG` = col_character(),
  `1R.Rank` = col_character(),
  `1R.RecipRank` = col_character(),
  `1R.AvgPrec` = col_character()
)
Joining, by = "user"
DataSetPartitionAlgorithmuserTestTimeRMSEnDCGRankRecipRankAvgPrec1R.Rank1R.RecipRank1R.AvgPrecgenderageoccupationzipnratingsmeanRatingratingVar
ML-1M.UI 4 Pop-B 4011 0.019 NA 0 NA 0 0 NA NA NA M 25 17 85210 46 3.347826 2.009662
ML-1M.UI 3 Pop-B 4016 0.028 NA 0 NA 0 0 NA NA NA M 50 0 33901 673 3.007429 1.257385
ML-1M.UI 2 Pop-B 3999 0.007 NA 0 NA 0 0 NA NA NA M 35 17 30701 737 3.137042 1.284183
ML-1M.UI 4 Pop-B 4014 0.003 NA 0 NA 0 0 NA NA NA M 25 0 48316-5601220 3.722727 1.050623
ML-1M.UI 3 Pop-B 4017 0.003 NA 0 NA 0 0 NA NA NA M 25 14 17972 34 3.529412 1.650624
ML-1M.UI 5 Pop-B 4024 0.025 NA 0 NA 0 0 NA NA NA M 25 5 45011 247 3.153846 1.520951

We will often want to use the standard-crossfold user results, so filter those down into a table:

In [9]:
user.results = user.results.all %>%
    filter(DataSet == 'ML-1M')

Controlling for Profile Size

Users with more ratings are easier to predict for; we want to be able to remove that effect. We will focus for now on nDCG.

There are two ways we can try to model: we can consider all algorithms, and just try to predict based on e.g. nratings. The other is that we can aggregate by user to produce a per-user 'difficulty' score. We will do the latter to avoid statistical non-independence problems.

In [10]:
user.ndcg = user.results %>%
    group_by(user) %>%
    summarize(nDCG=mean(nDCG)) %>%
    inner_join(users)
user.ndcg
Joining, by = "user"
usernDCGgenderageoccupationzipnratingsmeanRatingratingVar
1 0.11137798F 1 10 48067 48 4.208333 0.4663121
2 0.14952018M 56 16 70072 124 3.701613 1.0078023
3 0.10849804M 25 15 55117 46 3.847826 1.0207729
4 0.17493075M 45 7 02460 16 4.312500 0.7625000
5 0.01488767M 25 20 55455 193 3.150259 1.2741796
6036 0.03656636F 25 15 32603 883 3.300113 1.0470181
6037 0.11469427F 45 1 76006 197 3.705584 0.7802238
6038 0.08419201F 56 1 14706 15 4.000000 0.7142857
6039 0.10125047F 45 0 01060 118 3.872881 0.5392583
6040 0.04173309M 25 6 11106 336 3.568452 1.3982854

Let's plot to see what this is looking like:

In [11]:
ggplot(user.ndcg) +
    aes(x=nratings, y=nDCG) +
    geom_point() + 
    geom_rug()

Oof, that distribution is concentrated. Let's log the number of ratings, since log is often a good transform for counts:

In [12]:
ggplot(user.ndcg) +
    aes(x=nratings, y=nDCG) +
    geom_point() + 
    geom_rug() +
    scale_x_log10()

Looking better. Let's see how the difficulty (nDCG) is distributed (Q-Q plot):

In [13]:
ggplot(user.ndcg) +
    aes(sample=nDCG) +
    geom_qq()

That isn't quite normal; however, a log transform looks like it will be too aggressive. Let's try a square root:

In [14]:
ggplot(user.ndcg) +
    aes(sample=sqrt(nDCG)) +
    geom_qq()

That looks close to normal.

Let's use log10 of the nratings as our predictor, and square root of nDCG as our dependent variable. But we will try more basic models too so we can see the improvement.

In [15]:
user.ndcg.mod.raw = lm(nDCG ~ nratings, data=user.ndcg)
summary(user.ndcg.mod.raw)
Call:
lm(formula = nDCG ~ nratings, data = user.ndcg)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.10111 -0.04940 -0.01361  0.03787  0.30570 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.016e-01  1.072e-03  94.740  < 2e-16 ***
nratings    -3.005e-05  4.273e-06  -7.033 2.24e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.064 on 6038 degrees of freedom
Multiple R-squared:  0.008126,	Adjusted R-squared:  0.007962 
F-statistic: 49.47 on 1 and 6038 DF,  p-value: 2.237e-12
In [16]:
user.ndcg.mod.log = lm(nDCG ~ log10(nratings), data=user.ndcg)
summary(user.ndcg.mod.log)
Call:
lm(formula = nDCG ~ log10(nratings), data = user.ndcg)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.11099 -0.04913 -0.01338  0.03780  0.30033 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.132259   0.003605   36.69   <2e-16 ***
log10(nratings) -0.018086   0.001787  -10.12   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.06373 on 6038 degrees of freedom
Multiple R-squared:  0.01668,	Adjusted R-squared:  0.01652 
F-statistic: 102.4 on 1 and 6038 DF,  p-value: < 2.2e-16
In [17]:
user.ndcg.mod = lm(sqrt(nDCG) ~ log10(nratings), data=user.ndcg)
summary(user.ndcg.mod)
Call:
lm(formula = sqrt(nDCG) ~ log10(nratings), data = user.ndcg)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.31853 -0.07545 -0.00390  0.07412  0.33689 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.357808   0.005965   59.98   <2e-16 ***
log10(nratings) -0.033397   0.002957  -11.29   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1054 on 6038 degrees of freedom
Multiple R-squared:  0.02069,	Adjusted R-squared:  0.02053 
F-statistic: 127.6 on 1 and 6038 DF,  p-value: < 2.2e-16

This is the best model. Let's fill in our data with it. We compute residuals manually, because we linear model residuals are in square-root space. For checking the model, we want:

$$\epsilon_i = \sqrt{\mathrm{nDCG}}-y_i$$

However, we also do want the error, which is

$$e_i = \mathrm{nDCG} - y_i^2$$
In [18]:
user.ndcg.preds = user.ndcg %>%
    add_predictions(user.ndcg.mod) %>%
    rename(predRoot=pred) %>%
    mutate(pred = predRoot * predRoot, resid=sqrt(nDCG) - predRoot, error=nDCG-pred)
head(user.ndcg.preds)
usernDCGgenderageoccupationzipnratingsmeanRatingratingVarpredRootpredresiderror
1 0.11137798 F 1 10 48067 48 4.208333 0.4663121 0.3016593 0.09099832 0.03207413 0.02037967
2 0.14952018 M 56 16 70072 124 3.701613 1.0078023 0.2878935 0.08288268 0.09878487 0.06663750
3 0.10849804 M 25 15 55117 46 3.847826 1.0207729 0.3022766 0.09137112 0.02711384 0.01712692
4 0.17493075 M 45 7 02460 16 4.312500 0.7625000 0.3175939 0.10086587 0.10065336 0.07406488
5 0.01488767 M 25 20 55455 193 3.150259 1.2741796 0.2814767 0.07922913 -0.15946165-0.06434146
6 0.06617258 F 50 9 55117 66 3.893939 0.7116550 0.2970403 0.08823295 -0.03980001-0.02206038
In [19]:
ggplot(user.ndcg.preds) +
    aes(x=nratings, y=nDCG) +
    geom_point() + 
    geom_smooth() +
    geom_line(mapping=aes(y=pred), color="red") +
    geom_rug() +
    scale_x_log10() + scale_y_sqrt()
`geom_smooth()` using method = 'gam'

We will be good citizens and plot our residuals.

In [20]:
ggplot(user.ndcg.preds) +
    aes(x=pred, y=resid) +
    geom_point() +
    geom_rug() +
    geom_hline(yintercept=0, linetype="dashed")

That is mostly noise with the exception of that one line at the bottom; but our initial plot of outcome variable distribution suggested there's a chunk of 0 values that will cause funny business.

In [21]:
ggplot(user.ndcg.preds) +
    aes(sample=resid) +
    geom_qq()

Not bad! We don't need this, since we are not looking for an inferentially-valid linear model. But it is a good idea to check anyway.

Gender

Let's analyze accuracy by gender.

In [22]:
# FIXME re-add rank.ndcg
gender.results = user.results %>%
    select(Algorithm, gender, MAP=AvgPrec, MRR=RecipRank, RMSE, nDCG) %>%
    gather("Metric", "value", -Algorithm, -gender) %>%
    group_by(Algorithm, gender, Metric) %>%
    summarize(value=mean(value)) %>%
    ungroup()
head(gender.results)
AlgorithmgenderMetricvalue
II-B F MAP 0.1419284
II-B F MRR 0.2145984
II-B F nDCG 0.1643127
II-B F RMSE NA
II-B M MAP 0.1679569
II-B M MRR 0.2622334
In [23]:
overall.results = user.results %>%
    select(Algorithm, MAP=AvgPrec, MRR=RecipRank, RMSE, nDCG) %>%
    gather("Metric", "value", -Algorithm) %>%
    group_by(Algorithm, Metric) %>%
    summarize(value=mean(value)) %>%
    ungroup() %>%
    mutate(gender = 'Any')
combined.results = rbind(gender.results, overall.results)
head(combined.results)
AlgorithmgenderMetricvalue
II-B F MAP 0.1419284
II-B F MRR 0.2145984
II-B F nDCG 0.1643127
II-B F RMSE NA
II-B M MAP 0.1679569
II-B M MRR 0.2622334
In [24]:
ggplot(combined.results) +
    aes(x=gender, y=value, fill=Algorithm) +
    geom_bar(stat="identity", position="dodge") +
    facet_wrap(~ Metric, scales="free")
Warning message:
“Removed 12 rows containing missing values (geom_bar).”

Controlled Effects

Now we want to control for the size of user's profile. Since male users rate more movies than female users, and long profiles are easier to predict for, it is possible that the increase in improvement for male users is due to rating size.

We will control for that by using our linear model for predicting 'difficulty' using profile size, and look at the remaining effect of gender on accuracy.

In [25]:
gender.corr.ndcg = user.results %>%
    select(user, Algorithm, gender, nDCG) %>%
    inner_join(select(user.ndcg.preds, user, Pred.nDCG=pred)) %>%
    mutate(Corr.nDCG = nDCG - Pred.nDCG) %>%
    group_by(gender, Algorithm) %>%
    summarize(nDCG=mean(nDCG), Corr.nDCG=mean(Corr.nDCG))
head(gender.corr.ndcg)
Joining, by = "user"
genderAlgorithmnDCGCorr.nDCG
F II-B 0.16431269 0.07754269
F II-E 0.03050540 -0.05626460
F Mean-E 0.03072369 -0.05604631
F MF-B 0.04193736 -0.04483263
F MF-E 0.02645664 -0.06031336
F Pop-B 0.11204937 0.02527937
In [26]:
ggplot(gender.corr.ndcg) +
    aes(x=gender, y=Corr.nDCG, fill=Algorithm) +
    geom_bar(stat="identity", position="dodge")

After controlling for the number of ratings, we still see increased accuracy for male users.

Weird Gender Control Analysis

Michael tried this control for gender effects, but it was kinda weird and doesn't tell us anything meaningful, we think. So ignore it (but the code is still here for history).

In [27]:
gender.effects = user.results %>%
    select(gender, MAP=AvgPrec, MRR=RecipRank, RMSE, nDCG) %>%
    gather("Metric", "value", -gender) %>%
    group_by(gender, Metric) %>%
    summarize(effect=mean(value, na.rm=TRUE)) %>%
    ungroup()
gender.effects
genderMetriceffect
F MAP 0.06188669
F MRR 0.09174290
F nDCG 0.08555278
F RMSE 0.86783965
M MAP 0.07300555
M MRR 0.11038443
M nDCG 0.10114069
M RMSE 0.83569903
In [28]:
controlled.gender.results = user.results %>%
    select(Algorithm, gender, MAP=AvgPrec, MRR=RecipRank, RMSE, nDCG) %>%
    gather("Metric", "value", -Algorithm, -gender) %>%
    inner_join(gender.effects) %>%
    mutate(offset = value - effect) %>%
    select(-effect) %>%
    group_by(Algorithm, gender, Metric) %>%
    summarize(offset=mean(offset)) %>%
    ungroup()
head(controlled.gender.results)
Joining, by = c("gender", "Metric")
AlgorithmgenderMetricoffset
II-B F MAP 0.08004171
II-B F MRR 0.12285555
II-B F nDCG 0.07875991
II-B F RMSE NA
II-B M MAP 0.09495133
II-B M MRR 0.15184898
In [29]:
ggplot(controlled.gender.results) +
    aes(x=gender, y=offset, fill=Algorithm) +
    geom_bar(stat="identity", position="dodge") +
    facet_wrap(~ Metric, scales="free")
Warning message:
“Removed 8 rows containing missing values (geom_bar).”

Age

We now want to examine accuracy as a function of age. We're basically going to do the same thing we did for gender.

Start by aggregating the results:

In [30]:
# FIXME Re-add rank nDCG
age.results = user.results %>%
    select(Algorithm, age, MAP=AvgPrec, MRR=RecipRank, nDCG) %>%
    gather("Metric", "value", -Algorithm, -age) %>%
    group_by(Algorithm, age, Metric) %>%
    summarize(value=mean(value))
age.results
AlgorithmageMetricvalue
II-B 1 MAP 0.1448092
II-B 1 MRR 0.2267504
II-B 1 nDCG 0.1804995
II-B 18 MAP 0.1654056
II-B 18 MRR 0.2635039
UU-E 50 MRR 0.007145532
UU-E 50 nDCG 0.022216754
UU-E 56 MAP 0.006705764
UU-E 56 MRR 0.006429232
UU-E 56 nDCG 0.021696886

And then we can plot them:

In [31]:
ggplot(age.results) +
    aes(x=age, y=value, fill=Algorithm) +
    geom_bar(stat="identity", position="dodge") +
    facet_grid(Metric ~ ., scales="free_y")

Controlling for Profile Size

Let's control for profile size again.

In [32]:
age.corr.ndcg = user.results %>%
    select(user, Algorithm, age, nDCG) %>%
    inner_join(select(user.ndcg.preds, user, Pred.nDCG=pred)) %>%
    mutate(Corr.nDCG = nDCG - Pred.nDCG) %>%
    group_by(age, Algorithm) %>%
    summarize(nDCG=mean(nDCG), Corr.nDCG=mean(Corr.nDCG))
head(age.corr.ndcg)
Joining, by = "user"
ageAlgorithmnDCGCorr.nDCG
1 II-B 0.18049953 0.09327252
1 II-E 0.02916638 -0.05806063
1 Mean-E 0.02325226 -0.06397475
1 MF-B 0.04930430 -0.03792271
1 MF-E 0.02870881 -0.05851821
1 Pop-B 0.11716021 0.02993319
In [33]:
ggplot(age.corr.ndcg) +
    aes(x=age, y=Corr.nDCG, fill=Algorithm) +
    geom_bar(stat="identity", position="dodge")

Discrepancy in performance is robust to controls for size.