Music Box Churn Prediction and Recommendation System

https://github.com/zhou100/MusicBoxChurn

This purpose of this project is to use music box log data to predict churn, including information on songs played, user profile, searching activities. This project builds a demo for analyzing large user record level data and analyze user behaviors. A recommendation system based on Collaborative Filtering is built and reaches an rmse of 0.567.

Churn is defined by inactivity in fixed 14 days window. It is predicted using the users’ behavior pattern in a 30 day window before that 14 day window.

Churn prediction is an important operation problem. Effective targeting of those who might churn and act earlier is crucial to keep the product’s core users. As such, features are generated from the log file, including the frequency, recency of events, total playing time, songs fully played and other playing behaviors such as the mean and standard deviation of song playing time.

Machine learning models, such as random forest, gradient boosting and Logistic regression are applied on the training data to predict the test data set. The end churn prediction accuracy is around 0.92 on the training and 0.91 on the testing data set, which is 14% more than the 80% baseline.

Variables that explain most of the variation is recency in a 30 days windows, total playing time in two weeks and event frequency in the most recent weeks.

The result provides important insights to what measures that the product operation should keep monitoring. Also, the accurate prediction results can suggest the right time window (decrease in frequency for two weeks, for example) when the operation system should try to intervene and influence the user in order to retain more current users of the music box.

The recommender systems is based on Collaborative filtering, which is commonly used . The rmse on test data is 0.567 which is a lot smaller than the bench mark performance RMSE~1.2-1.5. Sample recommendation of songs and user ids are displayed.

The major problem of this project is the sheer size of data. It is hard to load the entire dataset into a pandas dataframe on a local computer. This project builds the basic tools and practices for analyzing large, user-record level data, which would be impossible to do so directly on a computer.

The processing code is written in Pyspark to make use of the big data technology in Spark when such data infrastructure is available. In this applicaiton, a user id level downsampling (randomly select 10% of users) is applied, so that the data processing and modelling can be accomplished on a personal computer. But it will be scalable when running on a spark cluster on Google Cloud or other cloud computing infrastructure.