A pay TV provider wants to figure out why customers are losing clients lately. The goal is to utilize available customer data to decipher any discernible patterns that could explain the cancellation trend. The question then arises: Can we employ data science to predict this churn behavior and determine the attributes that increase a customer's likelihood to terminate their service?
To tackle the problem of customer churn, we first leveraged access to anonymized customer data to create a broad and inclusive dataset. This dataset covered a range of key customer touchpoints, enabling us to gain a better understanding of the customers.
We began with client demographic information, including gender, age, and geographical data such as zip code. Next, we examined account-specific information. This included data on payment methods (credit card, automatic bank transfer, digital wallets, etc.) and contract specifics, such as the age of the contract. This information helped us understand the financial and contractual aspects of the relationship between the customer and the service provider.
We also incorporated details related to the package subscribed by the customer. This involved assessing the tier of the package (e.g., basic, premium), the presence of any add-on services (like pay-per-view channels), and whether the customer had recently upgraded or downgraded their package or the number of calls made by a client to the customer support center. This helped us gauge customer engagement and satisfaction with the provided services.
By creating this comprehensive dataset, we were in a position to construct a robust predictive model. This model was designed to identify customers at high risk of churning and highlight the key factors contributing to this risk. Such insights can guide targeted intervention strategies, helping to retain valuable customers and improve overall service quality.
Our project's methodology entailed a combination of several tools and languages to efficiently process, analyze, and model the data.
Firstly, we utilized SQL to perform initial data extraction. Then, we transitioned to Python for the remainder of the data pipeline. This facilitated the data cleaning and exploratory data analysis (EDA) and feature engineering stages. During these phases, we worked on resolving inconsistencies and errors in the data, identifying outliers, and generating insights about the distribution and relationships among various features.
We explored several algorithms using the Scikit-learn library, ultimately settling on a Random Forest Classifier to classify and for identifying which features are the most important for the model prediction.
If you'd like to learn more about my projects or work together, feel free to reach out! You can also connect with me on LinkedIn