Sparkify Project Post

1. Project Definition

This project use spark to analyze user behavior data from music app Sparkify. The code is in https://github.com/lijianhua040704/Sparkify

1.1 Project Overview

Sparkify is a music app, this mini dataset contains sparkify user behavior log. The json file contains enough user action ,their actions can be found from their viewing pages and other operations . In the data, about 20% user is churned, through the cancellation confrimation can be distinguished.

1.2 Problem Statement

We just use the mini-dataset to solve this issue. Through the data exploration and analysis, we checked the user count, the male and female quantity, the paid and free count, and their actions for different pages for churn and no-churn users, then get the contrast of the viewing frequency between these two kinds of users. We then choose several top page contrast to be an input feature for our model, thus we can get more accurate model to predict which user are prone to be churned.

1.3 Metrics

We choose three models, LogisticRegression, DecisionTreeClassifier and GBTClassifier, because churn is our label and the churn rate is about 20%, so precision and recall is not a good metrics, we choose F1 score as our metrics.

2. Analysis

2.1 Data Load and Explore

The original size of the dataset is 12GB which is too large for my preliminary analysis, I used the small dataset (128MB) mini_sparkify_event_data.json to perform data exploration process.

Use df.printSchema() to understand the contens of this dataset.

|-- artist: string (nullable = true)

|-- auth: string (nullable = true)

|-- firstName: string (nullable = true)

|-- gender: string (nullable = true)

|-- itemInSession: long (nullable = true)

|-- lastName: string (nullable = true)

|-- length: double (nullable = true)

|-- level: string (nullable = true)

|-- location: string (nullable = true)

|-- method: string (nullable = true)

|-- page: string (nullable = true)

|-- registration: long (nullable = true)

|-- sessionId: long (nullable = true)

|-- song: string (nullable = true)

|-- status: long (nullable = true)

|-- ts: long (nullable = true)

|-- userAgent: string (nullable = true)

|-- userId: string (nullable = true)

First I have sorted out the meaning of all the columns

1. Artist: singer

2. Auth: login status

3. First Name:

4. Gender:

5. Iteminsession:sequence number

6. Last time:

7. Length:

8. Level: paid or free

9. Location

10. Method:PUT/GET

11. Page:page overview

12. Registration:timestamp for register

13. Sessionid:

14. Song:

15. Status:200/404etc.

16. Ts:timestamp for logging time

17. UserAgent:information

2.2 Data Preprocessing

I first dropna from userId and sessionId column, then checked the empty string in userId and sessionId column, after that, we removed these empty values.

2.3 Define Churn

In this project we use cancellation confirmation in pages to judge whether one user could be churned. Then for user with churn is equal to True, then we apply all churn item for this user for True.

+--------------------+

| page|

+--------------------+

| Cancel|

| Submit Downgrade|

| Thumbs Down|

| Home|

| Downgrade|

| Roll Advert|

| Logout|

| Save Settings|

|Cancellation Conf...|

| About|

| Settings|

| Add to Playlist|

| Add Friend|

| NextSong|

| Thumbs Up|

| Help|

| Upgrade|

| Error|

| Submit Upgrade|

+--------------------+

I then added the churn column to the dataset to mark churn users, so now I can observe the difference between what happens to churn users and what doesn’t.

The ratio of males to females is slightly different.

looking at the paid status of the churn users, it seems that the paid users account for a large proportion

From the below frequency ratio for churn users vs. non-churn users, we can see About/Error/Submit Downgrade/Thumbs up/Add Friend/Add to playlist and Next Song is big different between churn users and non-churn users. These pages frequency ratio is above 1.5, so theoretically they can be chosen for input features, but for ‘About’ and ‘Error’, their quantity is much more less than others, hence we don’t choose these two items. For ‘Submit Downgrade’, this project don’t consider this.

3. Feature Engineering

We have already chosed four items as input feature: Thumbs up/Add Friend/Add to playlist and Next Song (songs_quantity), gender is considered as one from analysis. we still need several features to train model, How long a user has registered may be a factor for churn, and total listening time is also considered because a user may hardly listen to the songs even though he/she has registered long time.

The following item is the final feature for training, Because these features are related to userId, and we know the user count is 225 from data exploration, so our final dataset contains 225 rows, churn column is the label.

# Feature 1: songs_quantity

# Feature 2: regday

# Feature 3: gender

# Feature 4: thumbup

# Feature 5: addfriend

# Feature 6: add to playlist

# Feature 7: totallistentime

4. Modeling

The final dataset is divided to train and validation, 80% of them is train dataset. I use three types of model: LogisticRegression, DecisionTreeClassifier and GBTClassifier. For each model, we choose several parameter combinations to search the best model.

For LogisticRegression, I choose elasticNetParam and regParam as paramGrid, for DecisionTreeClassifier, I choose impurity and maxDepth as paramGrid, for GBTClassifier, choose maxIter and maxDepth as paramGrid. All of three models use MulticlassClassificationEvaluator as evaluator, and the numFolders is 3.

For LogisticRegression and GBTClassifier,

It takes more time to get the best model, but for DecisionTreeClassifier, it’s faster.

5. Result

For this mini dataset, The three model results has small difference, it looks like DecisionTreeClassifier and GBTClassifier is better than LogisticRegression. And From the best model, we can see feature 1 and feature 5 have big weight, thus regday and addtoplaylist is the important feature, we need to make some measures to new users and provide the songs they like then they can put them in playlist. If we run the whole datasets and run model on more combinations, we can get more accurate results.