EDA 101: Explore, Discover, Analyze (Part-2)

EDA 101: Explore, Discover, Analyze (Part-2)

Welcome to the second part of the 5-part blog series “EDA 101: Explore, Discover, Analyze”. aim of this series is to provide a comprehensive guide to Exploratory Data Analysis (EDA) and its various techniques. In the first part, I introduced the concept of EDA and why it is an important step in the data analysis process. In this second part, we will dive deeper into the methods of EDA and explore how to use these techniques to gain insights and understanding of the underlying structure and relationships within the data.

We will take a closer look at some of the key methods used in EDA, including univariate analysis, bivariate analysis, multivariate analysis, data visualization, outlier analysis, and missing data analysis. I will explain each method in detail and provide practical examples to help you better understand the concepts. Whether you are a beginner or an experienced data analyst, this blog series will provide valuable information to help you in your data analysis journey. So, let’s get started!

So, in this blog you will look into:
II. Methods of Exploratory Data Analysis (EDA)
A. Univariate Analysis
B. Bivariate Analysis
C. Multivariate Analysis
D. Data Visualization
E. Outlier Analysis
F. Missing Data Analysis

II. Methods of Exploratory Data Analysis (EDA)

A. Univariate Analysis:

The univariate analysis involves analyzing the distribution of individual variables in the data. This method provides important insights into the central tendency, dispersion, and shape of the data distribution. Common measures used in the univariate analysis include mean median, mode, range, and standard deviation. Univariate analysis is performed using visualizations such as histograms and box plots, which help in understanding the distribution of the data and identifying outliers.

example: Pie chart for the proportion of the population of the world belongs to which gender. ( here the only variable on which we are making decisions is “Gender”).

B. Bivariate Analysis:

The bivariate analysis explores the relationship between two variables in the data. This method helps in understanding how changes in one variable affect the other. Common techniques used in the bivariate analysis include scatter plots, regression analysis, and correlation analysis. Scatter plots are used to visualize the relationship between two variables, while regression analysis helps in understanding the strength and direction of the relationship. Correlation analysis measures the strength and direction of the relationship between two variables.

example: Analyzing what products are getting customer retention by comparing the price and sales of the product. (here the two variables are price and the no. of sales ).

C. Multivariate Analysis:

The multivariate analysis involves analyzing relationships among multiple variables. This method is used to identify patterns in the data and to understand how variables interact with each other. Common techniques used in the multivariate analysis include clustering, principal component analysis (PCA), and dimensionality reduction. Clustering is used to group similar data points, while PCA and dimensionality reduction help in reducing the complexity of the data by transforming it into a lower dimensional space.

D. Data Visualization:

Data visualization is an important aspect of EDA as it helps in presenting data in a visual format, making it easier to understand patterns and relationships. Common data visualization techniques include histograms, bar charts, line charts, and scatter plots. It is important to choose the right visualization technique based on the type of data and the relationship being analyzed.

example: the above illustration just shows how the data is visualized, but in usual cases when the data is too huge, varying from tens of thousands to a million it becomes hard to analyze anything out of it. so that’s why we use data visualization. for eg. using a bar chart to compare the sales of different products.

E. Outlier Analysis:

Outlier analysis involves identifying and analyzing extreme values in the data. Outliers can have a significant impact on statistical results and must be dealt with carefully. Common techniques used in outlier analysis include the Z-score method and the Tukey method. The Z-score method calculates the number of standard deviations a data point is from the mean, while the Tukey method defines outliers as values that fall outside of the range defined by the 1.5 * IQR rule.

example: Identifying and handling outliers, such as identifying and removing extreme values in a stock price dataset.

F. Missing Data Analysis:

Missing data analysis involves identifying and handling missing data in the dataset. Missing data can impact the accuracy of the analysis, and it is important to deal with it properly. Common techniques used in missing data analysis include imputation and exclusion. Imputation involves replacing missing data with estimated values, while exclusion involves removing observations with missing data.

example: In the table mentioned above, a few students missed some exams, leading to empty (null) cells. We can fill these blank spaces with “0's” based on our general understanding. The process of handling missing values requires more intuition than formal logic.

In conclusion, Exploratory Data Analysis is a crucial step in the data analysis process and provides valuable insights into your data. By understanding and utilizing the various methods of EDA, you can gain a deeper understanding of your data, identify patterns and relationships, and make more informed decisions.

I hope you are enjoying the “EDA 101: Explore, Discover, Analyze” blog series. Stay tuned for upcoming blogs where I’ll delve even deeper into the world of Exploratory Data Analysis and show you how to apply these methods to real-world data.
If you have any questions or would like to share your own experiences with Exploratory Data Analysis, feel free to reach out on Twitter
@lokstwt. I’d love to hear from you! Peace ✌🏾.

Did you find this article valuable?

Support The ML Journal by becoming a sponsor. Any amount is appreciated!