Wednesday, May 20, 2020

Exploratory Data Analysis – A Key Step in Machine Learning


The goal of this post is to emphasize the role of Exploratory Data Analysis while solving business problems with Machine Learning and Artificial Intelligence with a detailed case study walkthrough.

A 360° data mindset In this information-driven age, a 360° view has to be taken for the extraordinary volume of data that is being available – historic, current and predictive – so that right data has to be extracted to make better business decisions.

Exploratory Data Analysis (EDA) is an observational approach to understand the characteristics of the data. EDA is essential for a well-defined and structured data science project and it should be performed before any machine learning modelling phase. This helps in Identifying patterns and develop hypotheses.

Case Study : A medium size bikes & cycling accessories manufacturing consultancy is keen on growing the business. We’ll help them analyze their customer and transaction data to optimize marketing strategy.

Preliminary Data Exploration – Identify ways to improve the quality of data

Environment and Code Readiness 
  • Create a Jupyter Notebook hosted on Azure
  • Import pandas package to read and write excel data
  • Import matplotlib & seaborn for data visualization
  • Upload the Customer data into the Azure Notebook path


Let’s put the below analysis into various data quality dimensions in a table


Identify Missing Values


Column can be dropped if no relevance


Gender data to be consistent, should be either Male or Female


Check for validity of transactions data :, product first sold date data type float to be converted into date time format


Follow the above code and output for other data sets

Here is the Data Quality Analysis Summary

Data Exploration, Model Development and Interpretation : Understanding the data distributions, feature engineering, data transformations, modelling, results interpretation and reporting.
Customer Age & Gender Distribution : Female category is more than Male; New customers are recommended between 30 to 60 years old
Calculate the age of the customers from date of birth for plotting the graph

Number of Mass Customers under the Wealth Segment are the highest


New customers are from Manufacturing & Finance industry


Customer cars owned data


Visualizations & Interactive Dashboard : Help us highlight key findings and convey the ideas in a more succinct manner. Below dashboards have been built in Power BI desktop. Walk-through of the building of dashboards in Power BI is out of scope for this blog.




Conclusion, Exploratory Data Analysis is a key process in Machine Learning / Data Science projects. The main pillars of EDA are data cleaning, data preparation, data exploration, and data visualization.