Output Analysis¶

This notebook contains our analysis of the recommender evaluation results.

It proceeds in a few steps:

Setup, where we do the initial R configuration
Read Data, where we load in the experiment data for analysis
Controlling for Profile Size, where we model accuracy as a function of user profile size in order to control for intrinsic profile difficulty
Gender, where we analyze the results by gender.
Age, where we analyze the results by age.

Setup¶

Libraries:

library(MASS)
library(plyr)
library(dplyr)
library(readr)
library(ggplot2)
library(tidyr)
library(modelr)
library(tibble)

Attaching package: ‘dplyr’

The following objects are masked from ‘package:plyr’:

    arrange, count, desc, failwith, id, mutate, rename, summarise,
    summarize

The following object is masked from ‘package:MASS’:

    select

The following objects are masked from ‘package:stats’:

    filter, lag

The following objects are masked from ‘package:base’:

    intersect, setdiff, setequal, union

options(repr.plot.height=5)
options(repr.matrix.max.rows=10)
options(repr.matrix.max.columns=10)

Read Data¶

First, we need to read the user data from the underlying MovieLens data:

users.meta = read_delim("data/ml-1m/users.dat", delim=":",
                   col_names=c("user", "gender", "age", "occupation", "zip"),
                   col_types="i_c_c_c_c") %>%
    mutate(gender=as.factor(gender),
           age=as.factor(age),
           occupation=as.factor(occupation))
users.meta

Now, we want to get user profile statistics. In order to do that, we need each user's train profile; we can load that from the training data output by the cross-folding process.

For each partition, we need to load the train ratings (historical profile) for all the test users in that partition. So we will load the test data (to get the users) and the train data (to get the histories), then merge them.

train.ratings = ldply(1:5, function(part) {
    message("reading part ", part)
    test.fn = sprintf("build/ML-1M.out/part0%d.test.csv", part)
    train.fn = sprintf("build/ML-1M.out/part0%d.train.csv", part)
    test = suppressMessages(read_csv(test.fn, col_names=c("user", "item", "rating", "timestamp")))
    train = suppressMessages(read_csv(train.fn, col_names=c("user", "item", "rating", "timestamp")))
    test %>%
        select(user) %>%
        distinct() %>%
        mutate(part=as.factor(part), proto=as.factor('UserPct')) %>%
        inner_join(train)
})

reading part 1
Joining, by = "user"
reading part 2
Joining, by = "user"
reading part 3
Joining, by = "user"
reading part 4
Joining, by = "user"
reading part 5
Joining, by = "user"

Quick summary just to see what this data looks like:

train.ratings %>%
    select(part, proto, rating) %>%
    summary()

 part           proto            rating     
 1:193768   UserPct:970009   Min.   :1.000  
 2:189980                    1st Qu.:3.000  
 3:181023                    Median :4.000  
 4:203058                    Mean   :3.578  
 5:202180                    3rd Qu.:4.000  
                             Max.   :5.000

Now we want to compute per-user profile statistics:

user.stats = train.ratings %>%
    group_by(user) %>%
    summarize(nratings = n(), meanRating=mean(rating), ratingVar=var(rating))
user.stats

Join these statistics with our user metadata table, so that we have one table of user information:

users = users.meta %>% inner_join(user.stats)
users

Joining, by = "user"

Now that we have user information, we can read the per-user recommender evaluation results.

user.results.all = read_csv("build/movielens-user-results.csv", guess_max=10000) %>%
    rename(user=User) %>%
    mutate(DataSet=as.factor(DataSet), 
           Partition=as.factor(Partition),
           Algorithm=as.factor(Algorithm)) %>%
    inner_join(users)
head(user.results.all)

Parsed with column specification:
cols(
  DataSet = col_character(),
  Partition = col_integer(),
  Algorithm = col_character(),
  User = col_integer(),
  TestTime = col_double(),
  RMSE = col_double(),
  nDCG = col_double(),
  Rank = col_integer(),
  RecipRank = col_double(),
  AvgPrec = col_double(),
  `1R.nDCG` = col_character(),
  `1R.Rank` = col_character(),
  `1R.RecipRank` = col_character(),
  `1R.AvgPrec` = col_character()
)
Joining, by = "user"

We will often want to use the standard-crossfold user results, so filter those down into a table:

user.results = user.results.all %>%
    filter(DataSet == 'ML-1M')

Controlling for Profile Size¶

Users with more ratings are easier to predict for; we want to be able to remove that effect. We will focus for now on nDCG.

There are two ways we can try to model: we can consider all algorithms, and just try to predict based on e.g. nratings. The other is that we can aggregate by user to produce a per-user 'difficulty' score. We will do the latter to avoid statistical non-independence problems.

user.ndcg = user.results %>%
    group_by(user) %>%
    summarize(nDCG=mean(nDCG)) %>%
    inner_join(users)
user.ndcg

Joining, by = "user"

Let's plot to see what this is looking like:

ggplot(user.ndcg) +
    aes(x=nratings, y=nDCG) +
    geom_point() + 
    geom_rug()

Oof, that distribution is concentrated. Let's log the number of ratings, since log is often a good transform for counts:

ggplot(user.ndcg) +
    aes(x=nratings, y=nDCG) +
    geom_point() + 
    geom_rug() +
    scale_x_log10()

Looking better. Let's see how the difficulty (nDCG) is distributed (Q-Q plot):

ggplot(user.ndcg) +
    aes(sample=nDCG) +
    geom_qq()

That isn't quite normal; however, a log transform looks like it will be too aggressive. Let's try a square root:

ggplot(user.ndcg) +
    aes(sample=sqrt(nDCG)) +
    geom_qq()

That looks close to normal.

Let's use log10 of the nratings as our predictor, and square root of nDCG as our dependent variable. But we will try more basic models too so we can see the improvement.

user.ndcg.mod.raw = lm(nDCG ~ nratings, data=user.ndcg)
summary(user.ndcg.mod.raw)

Call:
lm(formula = nDCG ~ nratings, data = user.ndcg)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.10111 -0.04940 -0.01361  0.03787  0.30570 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept)  1.016e-01  1.072e-03  94.740  < 2e-16 ***
nratings    -3.005e-05  4.273e-06  -7.033 2.24e-12 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.064 on 6038 degrees of freedom
Multiple R-squared:  0.008126,	Adjusted R-squared:  0.007962 
F-statistic: 49.47 on 1 and 6038 DF,  p-value: 2.237e-12

user.ndcg.mod.log = lm(nDCG ~ log10(nratings), data=user.ndcg)
summary(user.ndcg.mod.log)

Call:
lm(formula = nDCG ~ log10(nratings), data = user.ndcg)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.11099 -0.04913 -0.01338  0.03780  0.30033 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.132259   0.003605   36.69   <2e-16 ***
log10(nratings) -0.018086   0.001787  -10.12   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.06373 on 6038 degrees of freedom
Multiple R-squared:  0.01668,	Adjusted R-squared:  0.01652 
F-statistic: 102.4 on 1 and 6038 DF,  p-value: < 2.2e-16

user.ndcg.mod = lm(sqrt(nDCG) ~ log10(nratings), data=user.ndcg)
summary(user.ndcg.mod)

Call:
lm(formula = sqrt(nDCG) ~ log10(nratings), data = user.ndcg)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.31853 -0.07545 -0.00390  0.07412  0.33689 

Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
(Intercept)      0.357808   0.005965   59.98   <2e-16 ***
log10(nratings) -0.033397   0.002957  -11.29   <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.1054 on 6038 degrees of freedom
Multiple R-squared:  0.02069,	Adjusted R-squared:  0.02053 
F-statistic: 127.6 on 1 and 6038 DF,  p-value: < 2.2e-16

This is the best model. Let's fill in our data with it. We compute residuals manually, because we linear model residuals are in square-root space. For checking the model, we want:

$$\epsilon_i = \sqrt{\mathrm{nDCG}}-y_i$$

However, we also do want the error, which is

$$e_i = \mathrm{nDCG} - y_i^2$$

user.ndcg.preds = user.ndcg %>%
    add_predictions(user.ndcg.mod) %>%
    rename(predRoot=pred) %>%
    mutate(pred = predRoot * predRoot, resid=sqrt(nDCG) - predRoot, error=nDCG-pred)
head(user.ndcg.preds)

ggplot(user.ndcg.preds) +
    aes(x=nratings, y=nDCG) +
    geom_point() + 
    geom_smooth() +
    geom_line(mapping=aes(y=pred), color="red") +
    geom_rug() +
    scale_x_log10() + scale_y_sqrt()

`geom_smooth()` using method = 'gam'

We will be good citizens and plot our residuals.

ggplot(user.ndcg.preds) +
    aes(x=pred, y=resid) +
    geom_point() +
    geom_rug() +
    geom_hline(yintercept=0, linetype="dashed")

That is mostly noise with the exception of that one line at the bottom; but our initial plot of outcome variable distribution suggested there's a chunk of 0 values that will cause funny business.

ggplot(user.ndcg.preds) +
    aes(sample=resid) +
    geom_qq()

Not bad! We don't need this, since we are not looking for an inferentially-valid linear model. But it is a good idea to check anyway.

Gender¶

Let's analyze accuracy by gender.

# FIXME re-add rank.ndcg
gender.results = user.results %>%
    select(Algorithm, gender, MAP=AvgPrec, MRR=RecipRank, RMSE, nDCG) %>%
    gather("Metric", "value", -Algorithm, -gender) %>%
    group_by(Algorithm, gender, Metric) %>%
    summarize(value=mean(value)) %>%
    ungroup()
head(gender.results)

overall.results = user.results %>%
    select(Algorithm, MAP=AvgPrec, MRR=RecipRank, RMSE, nDCG) %>%
    gather("Metric", "value", -Algorithm) %>%
    group_by(Algorithm, Metric) %>%
    summarize(value=mean(value)) %>%
    ungroup() %>%
    mutate(gender = 'Any')
combined.results = rbind(gender.results, overall.results)
head(combined.results)

ggplot(combined.results) +
    aes(x=gender, y=value, fill=Algorithm) +
    geom_bar(stat="identity", position="dodge") +
    facet_wrap(~ Metric, scales="free")

Warning message:
“Removed 12 rows containing missing values (geom_bar).”

Controlled Effects¶

Now we want to control for the size of user's profile. Since male users rate more movies than female users, and long profiles are easier to predict for, it is possible that the increase in improvement for male users is due to rating size.

We will control for that by using our linear model for predicting 'difficulty' using profile size, and look at the remaining effect of gender on accuracy.

gender.corr.ndcg = user.results %>%
    select(user, Algorithm, gender, nDCG) %>%
    inner_join(select(user.ndcg.preds, user, Pred.nDCG=pred)) %>%
    mutate(Corr.nDCG = nDCG - Pred.nDCG) %>%
    group_by(gender, Algorithm) %>%
    summarize(nDCG=mean(nDCG), Corr.nDCG=mean(Corr.nDCG))
head(gender.corr.ndcg)

Joining, by = "user"

ggplot(gender.corr.ndcg) +
    aes(x=gender, y=Corr.nDCG, fill=Algorithm) +
    geom_bar(stat="identity", position="dodge")

After controlling for the number of ratings, we still see increased accuracy for male users.

Weird Gender Control Analysis¶

Michael tried this control for gender effects, but it was kinda weird and doesn't tell us anything meaningful, we think. So ignore it (but the code is still here for history).

gender.effects = user.results %>%
    select(gender, MAP=AvgPrec, MRR=RecipRank, RMSE, nDCG) %>%
    gather("Metric", "value", -gender) %>%
    group_by(gender, Metric) %>%
    summarize(effect=mean(value, na.rm=TRUE)) %>%
    ungroup()
gender.effects

controlled.gender.results = user.results %>%
    select(Algorithm, gender, MAP=AvgPrec, MRR=RecipRank, RMSE, nDCG) %>%
    gather("Metric", "value", -Algorithm, -gender) %>%
    inner_join(gender.effects) %>%
    mutate(offset = value - effect) %>%
    select(-effect) %>%
    group_by(Algorithm, gender, Metric) %>%
    summarize(offset=mean(offset)) %>%
    ungroup()
head(controlled.gender.results)

Joining, by = c("gender", "Metric")

ggplot(controlled.gender.results) +
    aes(x=gender, y=offset, fill=Algorithm) +
    geom_bar(stat="identity", position="dodge") +
    facet_wrap(~ Metric, scales="free")

Warning message:
“Removed 8 rows containing missing values (geom_bar).”

Age¶

We now want to examine accuracy as a function of age. We're basically going to do the same thing we did for gender.

Start by aggregating the results:

# FIXME Re-add rank nDCG
age.results = user.results %>%
    select(Algorithm, age, MAP=AvgPrec, MRR=RecipRank, nDCG) %>%
    gather("Metric", "value", -Algorithm, -age) %>%
    group_by(Algorithm, age, Metric) %>%
    summarize(value=mean(value))
age.results

And then we can plot them:

ggplot(age.results) +
    aes(x=age, y=value, fill=Algorithm) +
    geom_bar(stat="identity", position="dodge") +
    facet_grid(Metric ~ ., scales="free_y")

Controlling for Profile Size¶

Let's control for profile size again.

age.corr.ndcg = user.results %>%
    select(user, Algorithm, age, nDCG) %>%
    inner_join(select(user.ndcg.preds, user, Pred.nDCG=pred)) %>%
    mutate(Corr.nDCG = nDCG - Pred.nDCG) %>%
    group_by(age, Algorithm) %>%
    summarize(nDCG=mean(nDCG), Corr.nDCG=mean(Corr.nDCG))
head(age.corr.ndcg)

Joining, by = "user"

ggplot(age.corr.ndcg) +
    aes(x=age, y=Corr.nDCG, fill=Algorithm) +
    geom_bar(stat="identity", position="dodge")

Discrepancy in performance is robust to controls for size.

user	gender	age	occupation	zip
1	F	1	10	48067
2	M	56	16	70072
3	M	25	15	55117
4	M	45	7	02460
5	M	25	20	55455
⋮	⋮	⋮	⋮	⋮
6036	F	25	15	32603
6037	F	45	1	76006
6038	F	56	1	14706
6039	F	45	0	01060
6040	M	25	6	11106

user	nratings	meanRating	ratingVar
1	48	4.208333	0.4663121
2	124	3.701613	1.0078023
3	46	3.847826	1.0207729
4	16	4.312500	0.7625000
5	193	3.150259	1.2741796
⋮	⋮	⋮	⋮
6036	883	3.300113	1.0470181
6037	197	3.705584	0.7802238
6038	15	4.000000	0.7142857
6039	118	3.872881	0.5392583
6040	336	3.568452	1.3982854

user	gender	age	occupation	zip	nratings	meanRating	ratingVar
1	F	1	10	48067	48	4.208333	0.4663121
2	M	56	16	70072	124	3.701613	1.0078023
3	M	25	15	55117	46	3.847826	1.0207729
4	M	45	7	02460	16	4.312500	0.7625000
5	M	25	20	55455	193	3.150259	1.2741796
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
6036	F	25	15	32603	883	3.300113	1.0470181
6037	F	45	1	76006	197	3.705584	0.7802238
6038	F	56	1	14706	15	4.000000	0.7142857
6039	F	45	0	01060	118	3.872881	0.5392583
6040	M	25	6	11106	336	3.568452	1.3982854

DataSet	Partition	Algorithm	user	TestTime	RMSE	Rank	⋯	1R.Rank	1R.RecipRank	1R.AvgPrec	gender	age	occupation	zip	nratings	meanRating	ratingVar
ML-1M.UI	4	Pop-B	4011	0.019	NA	NA	⋯	NA	NA	NA	M	25	17	85210	46	3.347826	2.009662
ML-1M.UI	3	Pop-B	4016	0.028	NA	NA	⋯	NA	NA	NA	M	50	0	33901	673	3.007429	1.257385
ML-1M.UI	2	Pop-B	3999	0.007	NA	NA	⋯	NA	NA	NA	M	35	17	30701	737	3.137042	1.284183
ML-1M.UI	4	Pop-B	4014	0.003	NA	NA	⋯	NA	NA	NA	M	25	0	48316-5601	220	3.722727	1.050623
ML-1M.UI	3	Pop-B	4017	0.003	NA	NA	⋯	NA	NA	NA	M	25	14	17972	34	3.529412	1.650624
ML-1M.UI	5	Pop-B	4024	0.025	NA	NA	⋯	NA	NA	NA	M	25	5	45011	247	3.153846	1.520951

user	nDCG	gender	age	occupation	zip	nratings	meanRating	ratingVar
1	0.11137798	F	1	10	48067	48	4.208333	0.4663121
2	0.14952018	M	56	16	70072	124	3.701613	1.0078023
3	0.10849804	M	25	15	55117	46	3.847826	1.0207729
4	0.17493075	M	45	7	02460	16	4.312500	0.7625000
5	0.01488767	M	25	20	55455	193	3.150259	1.2741796
⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮	⋮
6036	0.03656636	F	25	15	32603	883	3.300113	1.0470181
6037	0.11469427	F	45	1	76006	197	3.705584	0.7802238
6038	0.08419201	F	56	1	14706	15	4.000000	0.7142857
6039	0.10125047	F	45	0	01060	118	3.872881	0.5392583
6040	0.04173309	M	25	6	11106	336	3.568452	1.3982854

Algorithm	gender	Metric	value
II-B	F	MAP	0.1419284
II-B	F	MRR	0.2145984
II-B	F	nDCG	0.1643127
II-B	F	RMSE	NA
II-B	M	MAP	0.1679569
II-B	M	MRR	0.2622334

gender	Algorithm	nDCG	Corr.nDCG
F	II-B	0.16431269	0.07754269
F	II-E	0.03050540	-0.05626460
F	Mean-E	0.03072369	-0.05604631
F	MF-B	0.04193736	-0.04483263
F	MF-E	0.02645664	-0.06031336
F	Pop-B	0.11204937	0.02527937

gender	Metric	effect
F	MAP	0.06188669
F	MRR	0.09174290
F	nDCG	0.08555278
F	RMSE	0.86783965
M	MAP	0.07300555
M	MRR	0.11038443
M	nDCG	0.10114069
M	RMSE	0.83569903

Algorithm	gender	Metric	offset
II-B	F	MAP	0.08004171
II-B	F	MRR	0.12285555
II-B	F	nDCG	0.07875991
II-B	F	RMSE	NA
II-B	M	MAP	0.09495133
II-B	M	MRR	0.15184898

Algorithm	age	Metric	value
II-B	1	MAP	0.1448092
II-B	1	MRR	0.2267504
II-B	1	nDCG	0.1804995
II-B	18	MAP	0.1654056
II-B	18	MRR	0.2635039
⋮	⋮	⋮	⋮
UU-E	50	MRR	0.007145532
UU-E	50	nDCG	0.022216754
UU-E	56	MAP	0.006705764
UU-E	56	MRR	0.006429232
UU-E	56	nDCG	0.021696886

age	Algorithm	nDCG	Corr.nDCG
1	II-B	0.18049953	0.09327252
1	II-E	0.02916638	-0.05806063
1	Mean-E	0.02325226	-0.06397475
1	MF-B	0.04930430	-0.03792271
1	MF-E	0.02870881	-0.05851821
1	Pop-B	0.11716021	0.02993319