At last the most exciting part of the EDA 101 series, Data visualization. if you haven't checked out the previous parts of this blog series, then I'd recommend checking them out first and getting here again (or) you know enough data analysis and now you want to do some data visualization and don't know where to start from, then this blog is for you. Through this blog I'll try to answer questions like:
How is data visualization helping efficiently represent data?
Which can the best library to get started with data visualization?
How to choose which visualization would be best for my data?
Things we are going to cover in this part of our 5-part blog series are:
Why Data Visualization is important ???🧐
Why is Data Visualization in the first place? Data Visualization is the process of presenting data in a graphical or visual format. It is an essential tool for analyzing, interpreting, and communicating data effectively. Here are some reasons why data visualization is important:
Easy to understand: Visualizing data makes it easy to understand and interpret complex information quickly. You can get many insights about the data by staring at tabular datasets for hours, compared to a simple visualization telling how the data variables are related to each other.
Identify patterns and trends: Data visualization allows us to identify patterns and trends that might not be apparent in raw data. which is you'll identify the moment you see the graph or plot and take the decision accordingly.
Make informed decisions: Visualizations help us make informed decisions based on data-driven insights.
Better communication: Visualizations help us communicate complex data to non-experts and stakeholders more effectively because showing them the clean data set alone and explaining will eat their minds off and you'll end up losing that pitch, and we do not want that.
Improve productivity: By using visualizations, we can save time and improve productivity by quickly identifying key insights, and taking necessary calls to go for.
Types of Data Visualizations !! 📈
There are many different types of data visualizations, each with its strengths and weaknesses. Some of the most common types of data visualization used include:
Line charts: Used to show trends in data over time.
Bar charts: Used to compare categories or groups of data.
Scatter plots: Used to show the relationship between two variables.
Pie charts: Used to show the relative proportions of different categories.
Heatmaps: Used to show patterns or trends in large datasets.
Box plots: Used to show the distribution of data and identify outliers.
Histograms: Used to show the distribution of data and identify gaps or clusters.
Bubble charts: Used to show three dimensions of data (x, y, and size).
Don't worry about how each plot looks like, we'll look into that soon! this is just to get an idea about what all types we majorly use and the purpose behind each plot.
Choosing the right visualization for your data🤔
Choosing the right visualization for your data is very important to effectively communicate insights and trends. Here are some questions you should ask yourself when choosing a visualization:
What is the purpose of the visualization? Are you trying to compare data, show trends, or identify outliers? For example, a scatter plot is useful for showing the relationship between two variables, while a box plot is useful for identifying outliers in a dataset.
What type of data do you have? Is it continuous or categorical? Is it one-dimensional or multi-dimensional? For example, if you have categorical data, a bar chart or pie chart might be more appropriate, while if you have continuous data, a line chart or histogram might be more useful.
What story do you want to tell with the data? What insights do you want to communicate to your audience? For example, if you want to highlight the distribution of data, a box plot or histogram might be more appropriate, while if you want to show the relationship between two variables, a scatter plot might be more useful.
I'll give a deeper intuition on how to choose the visualization in another blog soon. So, stay tuned for that, till then these are the basic questions and charts you need to keep in mind while presenting your data.
What's the best python package to start with !!! 🤗
I've been asked this question by so many beginners, so my answer to this will be. the most famous libraries used throughout the world are Matplotlib, Seaborn & Plotly. these libraries are built one after another, which makes them better in chronological order, but there is a silver lining to it, there are things in which a package can perform better than, others don't (like, plotly can plot out of the world plots compared to other plots, but the flexibility seaborn and matplotlib has, plotly doesn't). So, you need to get familiar with all of them based on the situation. My suggestion would be to go with Seaborn which is very flexible and most used and the plots are also really attractive.
Using Seaborn 📊
Seaborn is a library in Python that can help you create informative and attractive visualizations. It is built on top of another popular visualization library called Matplotlib. Seaborn provides a higher-level interface to Matplotlib, which makes it easier to create common types of plots, such as scatter plots, line plots, and histograms. By using Seaborn, you can quickly create high-quality visualizations that can help you gain insights from your data.
Installing and Importing Seaborn
Like any other python library, Seaborn can also be installed in your local python lib using pip install seaborn
.
After the completion of the installation process, you can import the library into your script using the command. import seaborn as sns
in your python script.
Basic plots with Seaborn
Line plot
A line plot is a graph that displays information as a series of data points connected by straight line segments. It is commonly used to represent time-series data or data that changes over time. Look at the below example,
import seaborn as sns
import matplotlib.pyplot as plt
# load dataset
tips = sns.load_dataset("tips")
# plot
sns.lineplot(x="total_bill", y="tip", data=tips)
plt.show()
Scatter Plot
A scatter plot is a graph that displays information as a collection of points on a two-dimensional coordinate system. It is commonly used to represent the relationship between two variables. Look at the below example,
import seaborn as sns
import matplotlib.pyplot as plt
# load dataset
iris = sns.load_dataset("iris")
# plot
sns.scatterplot(x="sepal_length", y="petal_length", data=iris)
plt.show()
Bar Plot
A bar plot is a graph that displays the relationship between a categorical variable and a numerical variable. It is commonly used to represent the frequency distribution of data. Look at the below example,
import seaborn as sns
import matplotlib.pyplot as plt
# Creating sample data
job_titles = ['Manager', 'Engineer', 'Analyst', 'Developer']
avg_salary = [100000, 80000, 75000, 70000]
# plot
ax = sns.barplot(x=job_titles, y=avg_salary)
ax.set(xlabel='Job Titles', ylabel='Average Salary')
plt.show()
Histogram
A histogram is a graph that displays the distribution of a set of continuous data. It is commonly used to represent the frequency distribution of data. Look at the below example,
import seaborn as sns
import matplotlib.pyplot as plt
# load dataset
tips = sns.load_dataset("tips")
# plot
sns.histplot(data=tips, x="total_bill", kde=True)
plt.show()
Box Plot
A box plot is a graph that displays the distribution of a set of continuous data through their quartiles. It is commonly used to represent the distribution of data and identify outliers. Look at the below example,
import seaborn as sns
import matplotlib.pyplot as plt
# load dataset
tips = sns.load_dataset("tips")
# plot
sns.boxplot(data =tips, x="day", y="total_bill")
plt.show()
Violin Plot
A violin plot is a type of data visualization that combines the features of a box plot and a kernel density plot. It is used to display the distribution of a continuous variable across different categories or groups. Look at the below example,
import seaborn as sns import matplotlib.pyplot as plt # load dataset titanic = sns.load_dataset('titanic') # plot sns.violinplot(x='sex', y='age', data=titanic) plt.show()
Heatmap
A heatmap is a graphical representation of data where the individual values contained in a matrix are represented as colours. It is commonly used to represent the correlation between variables. Look at the below example,
import seaborn as sns
import matplotlib.pyplot as plt
# load dataset
flights = sns.load_dataset("flights")
# create a matrix
flights = flights.pivot("month", "year", "passengers")
# plot
sns.heatmap(flights, cmap="YlGnBu")
plt.show()
Conclusion
So, That's it for the blog. In this blog we have answered the most frequently asked questions in data visualization, compared different python libraries and how the data visualization can be easily done using seaborn. In the upcoming blog, I'll give a deeper intuition of Seaborn where we'll look into how to make our graphs much more informative and elegant.
I hope you are enjoying the “EDA 101: Explore, Discover, Analyze” blog series. Stay tuned for upcoming blogs where I’ll delve even deeper into the world of Exploratory Data Analysis and show you how to apply these methods to real-world data.
If you have any questions or would like to share your own experiences with Exploratory Data Analysis, feel free to reach out on Twitter @lokstwt. I’d love to hear from you and you can support me by buying me a coffee! Peace ✌🏾.