A pay TV provider is seeing an increase in loss of customers in the past months, but doesn't know exactly why this is happening. They want to use their customer's data to see if any patterns emerge on why they are leaving.
Can we predict this behavior and see what characteristics make someone more likely to churn their service?
With access to anonymized data from customers, a dataset containing client information (gender, age, zip code, ...), account information (payment method, contract age, ...), package information (package tier, pay-per-view add-ons, ...) and some engineered features (number of calls to help center, periods in debt, ...) was generated. The target column identifies if a client left that month or not.
Data was split into train and test. A few models were tested before choosing the Random Forest method to classify churning clients. Due to the imbalanced nature of the classes, an oversampling technique was also applied to the training set in some tests. A grid search was performed to select top performing hyperparameters based on test accuracy, precision, recall and f1-score.
With the random forest, we also identified what were the top contributing factors for a client to churn using its feature importance property.
SQL was used at first for queries and Python was used for exploratory analysis, clean the data, engineer features and train the model. Modeling was done using Scikit-learn and Imblearn for SMOTE
If you'd like to learn more about my projects or work together, feel free to reach out! You can also connect with me on LinkedIn