Image Source

Problem Statement

To build a recommendation system using collaborative filtering, where customers will be recommended the beer that they are most likely to buy using given dataset of half million beer reviews.

Goals of Assignment:

  • Choose only those beers that have at least N number of reviews, Find N using EDA.
  • Clean data an find average beer & user ratings.
  • Find similarity between the first 10 beers & first 10 users & plot this similarity matrix.
  • Evaluate Item Based & User Based Collaborative Filtering Algorithms using ‘split’ and ‘cross-validation’ evaluation schemes.
  • Recommend Top 5 beers to users “cokes”, “genog” & “giblet”.


Dataset Preview

head(beer) %>% kable("html") %>%
  kable_styling()
beer\_beerid review\_profilename review\_overall
48215 stcules 3.0
52159 oline73 3.0
52159 alpinebryant 3.0
52159 rawthar 4.0
52159 RangerClegg 3.5
58046 mikedrinksbeer2 4.5

Dataset contains 3 attributes Beer Id, Reviewer Name & Rating given by reviewer. Total 475984 reviews are given.


Data Structure

str(beer)

## 'data.frame':    475984 obs. of  3 variables:
##  $ beer_beerid       : int  48215 52159 52159 52159 52159 58046 58046 58046 58046 58046 ...
##  $ review_profilename: Factor w/ 22498 levels "","0110x011",..: 19293 15474 675 16851 16798 13843 5629 2680 1473 15351 ...
##  $ review_overall    : num  3 3 3 4 3.5 4.5 4 4.5 4.5 4 ...


Data Summary

summary(beer)

##   beer_beerid         review_profilename review_overall
##  Min.   :    3   northyorksammy:  1846   Min.   :0.000  
##  1st Qu.: 1716   mikesgroove   :  1379   1st Qu.:3.500  
##  Median :13892   BuckeyeNation :  1338   Median :4.000  
##  Mean   :21661   Thorpe429     :  1072   Mean   :3.815  
##  3rd Qu.:39397   ChainGangGuy  :  1046   3rd Qu.:4.500  
##  Max.   :77317   NeroFiddled   :  1031   Max.   :5.000  
##                  (Other)       :468272


Missing Values

Total 100 reviews have empty review_profilename.

nrow(beer[(is.na(beer$review_profilename) | beer$review_profilename==""), ])

## [1] 100

Updating beer datset removing empty review_profilename.

beer<-beer[!(beer$review_profilename==""), ]


Checking Duplicates

Checking for duplicates where both item and user are duplicated or same user gave multiple ratings to same beer.

beer[!(duplicated(beer[c("beer_beerid","review_profilename")]) | duplicated(beer[c("beer_beerid","review_profilename")], fromLast = TRUE)), ] %>% nrow()

## [1] 473055

beer %>% distinct(beer_beerid,review_profilename,.keep_all = TRUE) %>% nrow()

## [1] 474462

474462 distinct reviews

Removing duplicates having both beer & user duplicated, this will ensure that no 2 reviews from single user to same beer are counted & cumulate.

beer<-distinct(beer,beer_beerid,review_profilename,.keep_all = TRUE)


I. Data Preparation

1. Choose only those beers that have at least N number of reviews

Lets find all distinct beers & total reviews they received

beer_reviews_count <- beer %>% group_by(beer_beerid) %>% summarise(total_beer_reviews=n())
dim(beer_reviews_count)

## [1] 40304     2

Total 40308 distinct beers.

Lets sort these beers by reviews

 beer_reviews_count[with(beer_reviews_count, order(total_beer_reviews,decreasing = TRUE)), ]

## # A tibble: 40,304 x 2
##    beer_beerid total_beer_reviews
##          <int>              <int>
##  1        2093                977
##  2         412                966
##  3        1904                902
##  4        1093                840
##  5          92                812
##  6        4083                798
##  7         276                788
##  8        7971                778
##  9          88                755
## 10        1013                750
## # ... with 40,294 more rows

summary(beer_reviews_count)

##   beer_beerid    total_beer_reviews
##  Min.   :    3   Min.   :  1.00    
##  1st Qu.:16880   1st Qu.:  1.00    
##  Median :37366   Median :  2.00    
##  Mean   :36973   Mean   : 11.77    
##  3rd Qu.:56234   3rd Qu.:  5.00    
##  Max.   :77317   Max.   :977.00

Beers have reviews count in range of 1 to 987, beer 2093 have highest number of reviews(987).


Lets also count number of distinct users

beer %>% group_by(review_profilename) %>% summarise(total_user_reviews=n()) %>% nrow()

## [1] 22497

22497 distinct users


Lets now check which user have reviewed maximum beers

beer %>% group_by(review_profilename) %>% summarise(user_review_count=n()) %>% top_n(1)

## Selecting by user_review_count

## # A tibble: 1 x 2
##   review_profilename user_review_count
##               <fctr>             <int>
## 1     northyorksammy              1842

user: northyorksammy have given max reviews:1842


To find an ideal value of N its important that we choose beers hacing large enough number of reviews to avoid cold start problem.

So Lets further analyze beers by number of reviews each one have received. lets start with all beers with single review.

qplot(beer_reviews_count$total_beer_reviews, geom = "histogram",xlab="Total Reviews")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

beer_reviews_count %>% subset(total_beer_reviews==1) %>% dim()

## [1] 18080     2

#So 18080 or 44.7% beers have only 1 review, lets group by total reviews & find count for each.


Lets create separate dataframe with frequency of count of reviews

review_frequency<-beer_reviews_count %>% group_by(total_beer_reviews) %>% summarise(review_occurance=n())

head(review_frequency,25)

## # A tibble: 25 x 2
##    total_beer_reviews review_occurance
##                 <int>            <int>
##  1                  1            18080
##  2                  2             6183
##  3                  3             3079
##  4                  4             1895
##  5                  5             1350
##  6                  6              959
##  7                  7              785
##  8                  8              556
##  9                  9              478
## 10                 10              401
## # ... with 15 more rows

It seems that most beers get very few reviews. Infact very few beers have large enough number of reviews required for building recommendation system based on collaborative filtering.


Filtering by beers having at least 50 reviews & users having reviewed at least 30 beers as there are more distinct beers(40308)

beer_reviews_count_subset<-subset(beer_reviews_count,beer_reviews_count$total_beer_reviews>=50)

Now we are left with a dataset of these distinct beer ids each having more than 50 reviews each.


Lets check beer review frequency now

ggplot(beer_reviews_count_subset,aes(x=total_beer_reviews)) + geom_bar()


Lets also filter beer dataset based on users who have atleast reviewd 30 beers each

user_reviews_count<- beer %>% group_by(review_profilename) %>% summarise(total_user_reviews=n())
user_reviews_count_subset<-subset(user_reviews_count,user_reviews_count$total_user_reviews>=30)
ggplot(user_reviews_count_subset,aes(x=total_user_reviews)) + geom_bar()


Now lets filter original data by these beer and user ids

important_beers<-merge(beer,beer_reviews_count_subset,by.x="beer_beerid",by.y="beer_beerid")
important_beers<-merge(important_beers,user_reviews_count_subset,by.x="review_profilename",by.y="review_profilename")

summary(important_beers)

##       review_profilename  beer_beerid    review_overall total_beer_reviews
##  BuckeyeNation :   518   Min.   :    5   Min.   :1.00   Min.   : 50.0     
##  mikesgroove   :   505   1st Qu.: 1089   1st Qu.:3.50   1st Qu.: 97.0     
##  BEERchitect   :   460   Median : 4161   Median :4.00   Median :171.0     
##  northyorksammy:   455   Mean   :15774   Mean   :3.87   Mean   :233.8     
##  WesWes        :   455   3rd Qu.:29619   3rd Qu.:4.50   3rd Qu.:314.0     
##  TheManiacalOne:   440   Max.   :75086   Max.   :5.00   Max.   :977.0     
##  (Other)       :227255                                                    
##  total_user_reviews
##  Min.   :  30.0    
##  1st Qu.:  78.0    
##  Median : 155.0    
##  Mean   : 233.1    
##  3rd Qu.: 299.0    
##  Max.   :1842.0    
##

This beer subset have substantial ratings by both users & beers.


2. Convert this data frame to a “realratingMatrix” before you build your collaborative filtering models

beers_rrmatrix <- as(important_beers[,c(1,2,3)], "realRatingMatrix")
class(beers_rrmatrix)

## [1] "realRatingMatrix"
## attr(,"package")
## [1] "recommenderlab"


Checking realRatingMatrix

Few reviews by each user

head(rowCounts(beers_rrmatrix))

##  0110x011  05Harley 100floods  1759Girl  1fastz28  32hoss32
##        30        30        29        40        96        26


Few reviews given to beers

head(colCounts(beers_rrmatrix))

##   5   6   7  10  14  15
##  92 190 147 181  51  45


Average ratings by few users

head(rowMeans(beers_rrmatrix))

##  0110x011  05Harley 100floods  1759Girl  1fastz28  32hoss32
##  4.333333  4.116667  4.155172  3.725000  3.875000  3.788462


Converting the matrix to a dataframe

beers_df <- as(beers_rrmatrix, "data.frame")
str(beers_df)

## 'data.frame':    230088 obs. of  3 variables:
##  $ user  : Factor w/ 3176 levels "0110x011","05Harley",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ item  : Factor w/ 2064 levels "10","100","1002",..: 2039 47 256 263 1165 1750 1925 1955 2007 24 ...
##  $ rating: num  4 4 4.5 4.5 4.5 4 5 4 5 4 ...

summary(beers_df)

##              user             item            rating    
##  BuckeyeNation :   518   2093   :   600   Min.   :1.00  
##  mikesgroove   :   505   412    :   589   1st Qu.:3.50  
##  BEERchitect   :   460   1904   :   574   Median :4.00  
##  northyorksammy:   455   1093   :   552   Mean   :3.87  
##  WesWes        :   455   4083   :   525   3rd Qu.:4.50  
##  TheManiacalOne:   440   1013   :   515   Max.   :5.00  
##  (Other)       :227255   (Other):226733


II. Data Exploration:

1. Determine how similar the first ten users are with each other and visualize it

How similar are the first ten users are with each other

similar_users <- similarity(beers_rrmatrix[1:10,],method = "cosine",which = "users")


Users Similarity matrix

kable(as.matrix(similar_users))
0110x011 05Harley 100floods 1759Girl 1fastz28 32hoss32 3Vandoo 4000qtrap 4DAloveofSTOUT 51mmz0rz
0110x011 0.0000000 1.0000000 1.0000000 NA 0.9986178 NA 1.0000000 NA 1.0000000 1.0000000
05Harley 1.0000000 0.0000000 0.9982744 1.0000000 0.9801961 NA 1.0000000 1.0000000 0.9986898 0.8944272
100floods 1.0000000 0.9982744 0.0000000 1.0000000 0.9986226 NA 1.0000000 0.8106792 0.9982744 0.9474403
1759Girl NA 1.0000000 1.0000000 0.0000000 0.9975295 NA 0.9970831 1.0000000 0.9970545 0.9727176
1fastz28 0.9986178 0.9801961 0.9986226 0.9975295 0.0000000 0.9959184 0.9938707 0.9977852 1.0000000 0.9626219
32hoss32 NA NA NA NA 0.9959184 0.0000000 NA NA NA 1.0000000
3Vandoo 1.0000000 1.0000000 1.0000000 0.9970831 0.9938707 NA 0.0000000 0.9998377 NA 1.0000000
4000qtrap NA 1.0000000 0.8106792 1.0000000 0.9977852 NA 0.9998377 0.0000000 0.9969278 NA
4DAloveofSTOUT 1.0000000 0.9986898 0.9982744 0.9970545 1.0000000 NA NA 0.9969278 0.0000000 NA
51mmz0rz 1.0000000 0.8944272 0.9474403 0.9727176 0.9626219 1.0000000 1.0000000 NA NA 0.0000000


Visualize similarity matrix

image(as.matrix(similar_users), main = "User similarity")


2. Compute and visualize the similarity between the first 10 beers


How similar are the first ten beers are with each other

similar_beers <- similarity(beers_rrmatrix[,1:10],method = "cosine",which = "items")


Items(Beers) Similarity matrix

kable(as.matrix(similar_beers))
5 6 7 10 14 15 17 19 30 31
5 0.0000000 0.9726805 0.9783900 0.9312075 0.9933630 0.9815302 0.9519402 0.9607689 0.9634705 0.9710609
6 0.9726805 0.0000000 0.9469754 0.9877854 0.9542911 0.9865850 0.9633901 0.9653313 0.9829763 0.9822110
7 0.9783900 0.9469754 0.0000000 0.9491052 0.8717101 0.9674071 0.9561002 0.9323216 0.9459190 0.9260756
10 0.9312075 0.9877854 0.9491052 0.0000000 0.9615401 0.9764387 0.9742800 0.9852062 0.9917181 0.9854066
14 0.9933630 0.9542911 0.8717101 0.9615401 0.0000000 0.9934646 0.9601002 0.9905018 0.9894903 0.9711294
15 0.9815302 0.9865850 0.9674071 0.9764387 0.9934646 0.0000000 0.9804572 0.9835829 0.9948577 0.9951374
17 0.9519402 0.9633901 0.9561002 0.9742800 0.9601002 0.9804572 0.0000000 0.9768746 0.9744582 0.9640672
19 0.9607689 0.9653313 0.9323216 0.9852062 0.9905018 0.9835829 0.9768746 0.0000000 0.9893860 0.9827195
30 0.9634705 0.9829763 0.9459190 0.9917181 0.9894903 0.9948577 0.9744582 0.9893860 0.0000000 0.9884008
31 0.9710609 0.9822110 0.9260756 0.9854066 0.9711294 0.9951374 0.9640672 0.9827195 0.9884008 0.0000000


Visualize similarity matrix

image(as.matrix(similar_beers), main = "Beer similarity")


3. What are the unique values of ratings?

beers_df %>% group_by(rating) %>% summarise(rating_frequency=n()) %>% nrow()

## [1] 9

So 9 distinct ratings, lets check frequency of each rating.


beers_df %>% group_by(rating) %>% summarise(rating_frequency=n())

## # A tibble: 9 x 2
##   rating rating_frequency
##    <dbl>            <int>
## 1    1.0             1120
## 2    1.5             1343
## 3    2.0             4474
## 4    2.5             7012
## 5    3.0            21189
## 6    3.5            42543
## 7    4.0            88240
## 8    4.5            51086
## 9    5.0            13081

So rating 4.0 & 4.5 are most common, 1.0 & 1.5 are least common.


4. Visualize the rating values and notice:

(i) The average beer ratings

avg_beer_ratings<-beers_df %>% group_by(item) %>% summarise(average_rating=mean(rating))

colors <- c(rep("red",2), rep("blue",2), rep("green",1))
ggplot(avg_beer_ratings,aes(x=average_rating)) + geom_histogram() + labs(x="Average Rating", y="Number of Beers") + scale_x_discrete(limits=1:5)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(avg_beer_ratings$average_rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   1.258   3.656   3.870   3.806   4.043   4.636

So average beer ratings(Mean)=3.898 & Median=3.955, almost normal, slightly left skewed.


Also checking on original full dataset of beers

avg_beer_ratings_all<-beer %>% group_by(beer_beerid) %>% summarise(average_rating=mean(review_overall))
ggplot(avg_beer_ratings_all,aes(x=average_rating)) + geom_histogram() + labs(x="Average Rating", y="# of Beers")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(avg_beer_ratings_all$average_rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   0.000   3.500   3.800   3.671   4.000   5.000

So average beer ratings(Mean)=3.8 & Median=4, uneven distribution.


(ii) The average user ratings

avg_user_ratings<-beers_df %>% group_by(user) %>% summarise(average_rating=mean(rating))
ggplot(avg_user_ratings,aes(x=average_rating)) + geom_histogram() + labs(x="Average Rating", y="Number of Users")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(avg_user_ratings$average_rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##   2.833   3.750   3.901   3.875   4.023   4.457

So average beer ratings(Mean)=3.991 & Median=4, slightly left skewed & uneven distribution.


(iii) The average number of ratings given to the beers

avg_beer_reviews<-important_beers %>% group_by(beer_beerid) %>% summarise(average_reviews=mean(total_beer_reviews))
ggplot(avg_beer_reviews,aes(x=average_reviews)) + geom_histogram() + labs(x="Average Rating", y="Number of Beers")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(avg_beer_reviews$average_reviews)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    50.0    68.0    99.0   143.4   168.0   977.0

So on average each beer gets ~143 reviews from chosen subset.


Also checking on original full dataset of beers

avg_user_ratings_all<-beer_reviews_count %>% group_by(beer_beerid) %>% summarise(average_rating=mean(total_beer_reviews))
ggplot(avg_user_ratings_all,aes(x=average_rating)) + geom_histogram() + labs(x="Average Rating", y="Number of Users")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(avg_user_ratings_all$average_rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    1.00    1.00    2.00   11.77    5.00  977.00

So on average each beer gets ~12 reviews.


(iv) The average number of ratings given by the users

avg_user_reviews<-important_beers %>% group_by(review_profilename) %>% summarise(average_reviews=mean(total_user_reviews))
ggplot(avg_user_reviews,aes(x=average_reviews)) + geom_histogram()

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(avg_user_reviews$average_reviews)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    30.0    44.0    72.0   120.9   137.2  1842.0

So on average each user gives ~101 reviews, but this distribution is very skewed.


Also checking on original full dataset of beers

avg_user_ratings_all<-user_reviews_count %>% group_by(review_profilename) %>% summarise(average_rating=mean(total_user_reviews))
ggplot(avg_user_ratings_all,aes(x=average_rating)) + geom_histogram() + labs(x="Average Rating", y="Number of Users")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(avg_user_ratings_all$average_rating)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    1.00    1.00    3.00   21.09   11.00 1842.00

So on average each user gives 21 reviews, but this distribution is very skewed.


Also Visualizing ratings with real rating matrix of beers

qplot(getRatings(beers_rrmatrix), binwidth = 1, main = "Histogram of ratings", xlab = "Rating")

summary(getRatings(beers_rrmatrix)) #slightly right skewed

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
##    1.00    3.50    4.00    3.87    4.50    5.00

qplot(getRatings(normalize(beers_rrmatrix, method = "Z-score")),main = "Histogram of normalized ratings", xlab = "Rating")

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

summary(getRatings(normalize(beers_rrmatrix, method = "Z-score"))) # seems better

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.
## -6.5750 -0.5318  0.1276  0.0000  0.6984  3.2030

qplot(rowCounts(beers_rrmatrix), binwidth = 10,
      main = "Beers Rated on average", xlab = "# of users", ylab = "# of beers rated")

Most users rate less number of beers, very few users have rated more beers


III. Recommendation Models:

1. Divide your data into training and testing datasets, Experiment with ‘split’ and ‘cross-validation’ evaluation schemes


  1. Scheme1 with train/test(90/10) using split without cross validation & goodRating as 4
scheme1 <- evaluationScheme(beers_rrmatrix, method = "split", train = .75,k = 1, given = -1, goodRating = 4)
scheme1

## Evaluation scheme using all-but-1 items
## Method: 'split' with 1 run(s).
## Training set proportion: 0.750
## Good ratings: >=4.000000
## Data set: 3176 x 2064 rating matrix of class 'realRatingMatrix' with 230088 ratings.


  1. Scheme2 using cross-validation without cross validation(5 folds) & goodRating as 4
scheme2 <- evaluationScheme(beers_rrmatrix, method = "cross-validation",k = 5, given = -1, goodRating = 4)
scheme2

## Evaluation scheme using all-but-1 items
## Method: 'cross-validation' with 5 run(s).
## Good ratings: >=4.000000
## Data set: 3176 x 2064 rating matrix of class 'realRatingMatrix' with 230088 ratings.


2. Building IBCF and UBCF models with below hyperparameters

algorithms <- list(
  "user-based CF" = list(name="UBCF", param=list(normalize = "Z-score",
                                                 method="Cosine",
                                                 nn=30)),
  "item-based CF" = list(name="IBCF", param=list(normalize = "Z-score")))


Evaluating algorithms & predicting next n beers

results1 <- evaluate(scheme1, algorithms, n=c(1, 3, 5, 10, 15, 20))

## UBCF run fold/sample [model time/prediction time]
##   1  [0.06sec/10.44sec]
## IBCF run fold/sample [model time/prediction time]
##   1  [21.42sec/0.74sec]

class(results1)

## [1] "evaluationResultList"
## attr(,"package")
## [1] "recommenderlab"

results2 <- evaluate(scheme2, algorithms, n=c(1, 3, 5, 10, 15, 20))

## UBCF run fold/sample [model time/prediction time]
##   1  [0.052sec/7.712sec]
##   2  [0.048sec/7.492sec]
##   3  [0.052sec/7.88sec]
##   4  [0.052sec/7.668sec]
##   5  [0.056sec/7.832sec]
## IBCF run fold/sample [model time/prediction time]
##   1  [25.072sec/0.504sec]
##   2  [24.016sec/0.476sec]
##   3  [23.268sec/0.492sec]
##   4  [37.936sec/0.488sec]
##   5  [36.488sec/0.48sec]

class(results2)

## [1] "evaluationResultList"
## attr(,"package")
## [1] "recommenderlab"

Note: Evaluate on scheme1 & scheme2 takes around 10 mins on my system


3. Compare the performance of the two models and suggest the one that should be deployed

Drawing ROC curve

plot(results1, annotate = 1:4, legend="topleft")

plot(results2, annotate = 1:4, legend="topleft")

So UBCF seems to get better then IBCF especially with higher values of n.


4. Give the names of the top 5 beers that you would recommend to the users “cokes”, “genog” & “giblet”


Making Recommendations using UBCF

r <- Recommender(beers_rrmatrix, method = "UBCF")
r

## Recommender of type 'UBCF' for 'realRatingMatrix'
## learned using 3176 users.


recom_cokes <- predict(r, beers_rrmatrix['cokes'], n=5)
as(recom_cokes, "list")

## $cokes
## [1] "7971" "645"  "1346" "582"  "2041"


recom_genog <- predict(r, beers_rrmatrix['genog'], n=5)
as(recom_genog, "list")

## $genog
## [1] "57908" "1160"  "1093"  "1161"  "1445"


recom_giblet <- predict(r, beers_rrmatrix['giblet'], n=5)
as(recom_giblet, "list")

## $giblet
## [1] "19960" "4083"  "582"   "11757" "2041"