Worldreader Query Data Project

A Capstone Project for the Data Science and Big Data course at Universitat de Barcelona.

View the Project on GitHub

Worldreader Logo

Query Data Project

Created by Cary Lewis, Aina Pascual, Patricia Araguz and Enrique Rodríguez
UB Data Science and Big Data Capstone Project

Thank you to Worldreader for granting us access to its data and for supporting our Capstone Project.

  1. Project Background
  2. Project Scope and Methodology
  3. Project Work
  4. Findings and Results
  5. Conclusions
  6. Possible Next Steps
  7. References





Project Background


Worldreader is a non-profit organization working to reduce illiteracy through its reading applications and sponsorship programs. The organization has a collection of over 40,000 books in more than 40 languages with the mission “to unlock the potential of millions of people through the use of digital books in places where access to reading material is very limited.”

The project team approached Worldreader as their large collection generates significant amounts data and particularly through the Worldreader Open Library for mobile phones application.

Worldreader was interested in the proposal of examining their data and in particular the organization wanted to focus on the queries from their mobile application. The project’s aim was to examine the search query data with Worldreader providing a dataset of search queries and the corresponding fields from their application. All data provided would be anonymized and not include personal data in any form.



Project Scope and Methodology


The Project Scope was to analyse queries made by users on the feature phone application by using clustering techniques to identify similar searches. The result of the project would be to give Worldreader a better grasp on what their users are interested in reading and algorithms that the organization could use to improve upon its search queries and results in the future.

The majority of the queries consisted of short text searches, few words, requiring us to supplement the data. In order to use topic modeling techniques such as Latent Dirichlet Allocation (LDA) and Non-negative Matrix Factorization (NMF), we ran the queries through the Google Books API to pull book descriptions.

The models used LDA and NMF are unsupervised techniques for topic discovery in large document collections. These algorithms discover different topics represented in a set of documents and how much of each topic is present in a document (or corpus).

Each algorithm has a different mathematical underpinning:

Both algorithms take as input a bag of words matrix (i.e., each document represented as a row, with each columns containing the count of words in the corpus) and produce 2 smaller matrices:

The output of the derived topics generated by both models involved assigning a numeric label to the topic and printing out the top words in a topic.

NMF and LDA are not able to automatically determine the number of topics and this must be specified.



Project Work



For an indepth review of the project work and code see the Capstone’s Jupyter notebook.

Flow Chart of Data in Capstone Project

Worldreader provided our team with 6 CSV files consisting of over 3,000,000 queries and related information.

customer, country, url, query, created_at
157260,"KE","/Search/Results?Query=New+Testament&Language=","New Testament","2016-12-27 15:48:16.893"
157261,"PH","/Search/Results?Query=circles","circles","2016-11-12 18:14:11.933"
157261,"PH","/Search/Results?Query=japanese","japanese","2016-11-18 17:15:54.19"

Reviewing the fields associated with the queries with Worldreader we were told the customer ID was not stable, as sessions would break and did not permit a clear idea of individual user queries.

It was discussed with Worldreader if it would be beneficial to focus on certain countries and it was determined to not be necessary.

Data Cleaning and Language Detection

After loading the data:

Language Selection and Error Correction

We decided to use English queries as it would permit us to use the models more effectively. All other queries of the other languages were removed from our data set.

After selecting the English queries we had validated queries and corrected misspellings within words.

Sampling and Descriptive Stats

We needed to select a random sample of 20,000 queries to run as we were not able to run the all queries due to time and API limitations. Using the sample set of 20,000 queries gave us a statistical significance of 99% and was representative of all our query data. Using the random sample of the corrected words we could use this for comparison against the total query data.


Generated a word cloud with the most used terms and identified the 10 first bigrams depending on their frequency of occurrence.



10 Most Frequent Bigrams from Sample Data 10 Most Frequent Bigrams from Total Query Data

Classification and Topic Modeling

Supplementing the sample of 20,000 queries with the Google Books API

Defining Number of Topics

Since both LDA and NMF need “k” number of topics to run we used the perplexity measure and an iterative approach to define the most suitable number of topics for our data.

Low perplexity indicates that the probability distribution of the model is good at predicting the sample.

KPerplexity
K=5249.75001877
K=10254.351936899
K=15250.76157719
K=20251.955603856
K=25254.939880397

After seeing the results we determined that 15 was the best number of topics.

We prepared the files to generate LDA and NMF models on the descriptions obtained from the sample with the Google Books API

Findings and Results







The complete train for both the LDA and the NMF topics were more accurate than the best match as topics were more distinguishable from one another. In the best match for both models some words would appear in more than one topic.

We compared the complete train of the LDA and the NMF classifications against the test results. The following figure shows the overlap between the two classifications. Based these results we determined the NMF complete train to be a more accurate model at this time for our data.

We then grouped the NMF complete train topics into similar categories and assigned to each group a google category. The Google categories did not reflect the NMF complete train grouping when looking at the number of searches by title and categories in the sample.

Conclusions



Possible Next Steps


Classification of queries based off user information (if provided)

References