Building a Joke Recommender with Machine Learning

Why did the machine learning algorithm cross the road?

Code AI Blogs
CodeAI

--

Photo by Glen Carrie on Unsplash

Introduction

Imagine: you are sitting in front of your laptop. Bored. You look up jokes to entertain yourself, but the only thing funny about them is how bad they are. The only solution? To take matters into your own hands by building yourself a joke recommendation system.

In this article, I’ll walk through how to create a joke recommendation system using the scikit-surprise library!

The scikit-surprise library is a Python library for building and testing recommender systems. You can read more about it here:

Data Source

The Surprise library contains a few built-in datasets, including Dataset 2 from the archived Jester database, but here I’ll be using a subset of Dataset 1.

Dataset 1 includes 4.1 million ratings between -10 and 10 for 100 jokes from 73 421 users. The subset I used, jester dataset 1_1, has data on 24 983 users who have each rated 36 or more jokes. You can find the full archived Jester database here:

Joke 82 from Jester Dataset 1

Setup

First things first, I pip the Surprise library and import the Python libraries that I’ll need throughout the project:

%pip install scikit-surpriseimport numpy as npimport pandas as pdfrom collections import defaultdictfrom surprise import Dataset, Reader, KNNWithMeans, accuracyfrom surprise.model_selection import GridSearchCVfrom surprise.model_selection import train_test_splitfrom surprise.model_selection import KFold

Then I load the data into a Pandas DataFrame.

Note: if you’re using Google Colab, I recommend uploading the dataset as a zip file and extracting it within the notebook.

df = pd.read_excel('/tmp/jester-data-1.xls', header=None)

Data Preprocessing

Let’s start by creating a column for user IDs and dropping the column that contains the number of jokes rated by each user:

Now we’ll reformat the data frame. Rather than having a column for each joke, I now have columns for user ID, joke ID, and ratings.

Next, I’ll remove the rows that contain null ratings, followed by sorting the entries first by user ID, and then by joke ID.

# dropping entries with a rating of 99.0 (within this dataset, 99.0 corresponds to a null rating where the user did not rate that joke)df = df[df["Rating"] != 99.0]# sorting the entriesdf = df.sort_values(by=['User ID', 'Joke ID'])df = df.reset_index(drop=True)

Finally, I pass the data frame into a surprise dataset, the datatype the Surprise library requires when building a recommendation system.

Training the Model

Now we’re ready to train our recommendation system! We’ll use the k nearest neighbors with means algorithm, a basic collaborative filtering algorithm that takes into account the mean ratings of each user. Through collaborative filtering, items are recommended to a user based on what similar users thought of the items.

To start, I am going to determine the optimal algorithm parameters using GridSearchCV. Given a dictionary of parameters, this class tries all the possible combinations and returns the best parameters for a given accuracy measure.

A similarity measure is used by many algorithms, including k nearest neighbors with means, to estimate a rating.

  • I’ve set user-based to be false, which means that similarities will be computed between items instead of users.
  • For similarity measure options, I’m using mean squared difference, which computes the mean squared difference similarity between all pairs of items, and cosine, which computes the cosine similarity between all pairs of items.
  • Minimum support is the minimum number of common users for the similarity not to be zero. I’ve provided 3, 4, and 5 as options.
  • For performance measures, I’m using root mean squared error and mean absolute error.

From our results, we see that according to root mean square error, it is best to use cosine similarity as a similarity measure with a minimum support of 3.

Finally, I train the recommender using the best set of parameters.

Note: if you’re using Google Colab, I recommend using the GPU hardware accelerator to speed up the training process.

algo = gs.best_estimator['rmse']trainset = data.build_full_trainset()algo.fit(trainset)

Results

Now that we have our joke recommender, we can use it to generate rating predictions!

First, we’ll compare the recommender-generated ratings to the actual ratings for specific users and jokes.

  • For user 1 and joke 1, the actual rating is -7.82. Using the trained algorithm, we get a prediction of -3.43.
  • For user 24983 and joke 87, the actual rating given is 7.23. Using the trained algorithm, we get a prediction of 4.93.

From these 2 examples, it seems like our algorithm’s estimates are in the correct direction but aren’t super numerically accurate, so let’s rank our joke recommendations by error:

Looking at the best predictions made by our algorithm, the recommender-generated ratings are very close to the actual ratings, with the best prediction made having an error of only 0.000008. On the other hand, the error peaks at 18.83 in the worst predictions made by our algorithm.

Finally, we can use our algorithm to generate recommendations for each user:

These are the top 5 recommendations for each user. For many, our joke recommender recommends joke number 89.

Joke 89 from Jester Dataset 1

References

In addition to the ones linked throughout this article, I wouldn’t have been able to complete this project without the help of these awesome examples and tutorials:

[1] Analytics Vidhya | Comprehensive Guide to build a Recommendation Engine from scratch (in Python) by Pulkit Sharma

[2] Surprise Documentation | Analysis of the KNNBasic algorithm by Nicolas Hug

--

--