Table of contents
If you haven’t checked out the first two blogs, part 1 and part 2 then make sure to check them out
In part 2, we have discussed various types of data analysis and types of methods of EDA and explored how to use those techniques to gain insights and understanding of the underlying structure and relationships within the data. And in this part we’ll look into two important things as per the table of contents from part 1 of the series:
III. Understanding the Dataset(how can we use pandas to get a better understanding of the dataset)
IV. Data Cleaning and Preprocessing ( how to deal with all types of questions which come into our mind during the data analysis, with pandas.)
Understanding the dataset
Let's start with installing pandas:
pip install pandas #run this is terminal to install pandas
Now after installing the panda's library, you can import it into your python file or jupyter notebook by using:
import pandas as pd #here pd is the alias for 'pandas' to save time
Now after importing the library next step is to import the dataset using pandas methods, which provide powerful tools for data manipulation and analysis. The data is stored in a CSV file, which can be loaded into a Pandas DataFrame using the read_csv()
function:
import pandas as pd
# Load the data
# if csv file
df = pd.read_csv('data.csv')
# if excel file, sheet_name parameter is optional if it has only one sheet.
df = pd.read_excel('data.xlsx', sheet_name='Sheet1')
# if json file
df = pd.read_json('data.json')
# if sql file has to be imported
import sqlite3
conn = sqlite3.connect('database.db')
df = pd.read_sql('SELECT * FROM table', conn)
There are many other different types of formats of the datasets like .xlsx, .json, .db denoting excel, JavaScript Object Notation and SQL format. pandas provide functions for importing them as well as mentioned in the code block, so use them accordingly.
Now, after importing the dataset and before we start analyzing the data, we have to take a look at the dataset and try to understand it. We can use the following methods to get an idea about the data:
# Display the first 5 rows of the dataset
df.head()
# Display the last 5 rows of the dataset
df.tail()
# Display the dimensions of the dataset
df.shape
# Display the summary of the dataset
df.info()
# Display the descriptive statistics of the dataset
df.describe()
# Display the correlation between numerical variables
df.corr()
# Display the frequency distribution of categorical variables
df.value_counts()
# Display the number of unique values for each variable
df.nunique()
# Display the number of missing values for each variable
df.isnull().sum()
You can learn more about each method on my GitHub @ Enhance-EDA where I’ve explained all these methods in detail with also the help of a real-world example. Make sure to check that out and don't forget to star the repo :-)
Data Cleaning and Preprocessing
Now that we have got to know how to understand our dataset better, now it's time to sculpt a clean dataset out of the existing dataset to analyze better outcomes from it. To get that clean dataset we might have to clean, derive a few new variables,change the datatypes of the variables , fill the empty cells or remove those records completely depending on the situations , to identify and correct the errors in the dataset. We can use the following methods that can help in data cleaning and preprocessing:
# Remove duplicates
df.drop_duplicates(inplace=True)
# Replace missing values with the column mean
df['var'].fillna(df['var'].mean(), inplace=True)
# This method drops the rows or columns with missing values.
df.dropna()
# Renaming the columns
df.rename(columns={'old_name':'new_name'})
# Dropping columns
df.drop('column_name', axis=1)
# Changing data types
df.astype({'column_name': 'new_data_type'})
# Removing unwanted characters
df['column_name'].str.replace('unwanted_character', '')
# Removing outliers
df = df[(df['column_name'] >= lower_limit) & (df['column_name'] <= upper_limit)]
# Handling inconsistent data
df['column_name'].replace({'inconsistent_value': 'consistent_value'})
# Standardizing data
(df['column_name'] - df['column_name'].mean()) / df['column_name'].std()
# Encoding categorical variables
pd.get_dummies(df['column_name'])
# Combine two dataframes based on a common column
merged_df = pd.merge(df1, df2, on='id')
# Create a summary table that aggregates values based on categories
pivot_table = df.pivot_table(index='category', values='value', aggfunc='mean')
# Apply a function to each element in a dataframe.
def double(x):
return x*2
df.apply(double)
# Randomly sample rows from a dataframe
sampled_df = df.sample(n=100)
# Sort a dataframe based on a column or multiple columns
sorted_df = df.sort_values(by='column_name')
# Set a column as the index of a dataframe
df.set_index('column_name')
These were the possible and most used methods for cleaning and preprocessing the dataset. it is important to at least know about these methods so that in future when you get stuck in some situation, you can have an idea of what can be done in that situation using pandas. Once, you get to know these methods there is the whole internet to dive deep into each method. or else you can check out the one-stop destination for getting started with EDA.
→ https://github.com/lokeshwarlakhi/Enhance-EDA
In conclusion, EDA is an essential step in data analysis that helps in better decision-making by identifying patterns, relationships, and anomalies in the data. You can start by using these methods on Kaggle datasets like , sales data , titanic dataset , and many more, analyze the data and Note observation from it for now. In upcoming blogs we’ll look into how can we visualize those observations out of the clean dataset. So, stay tuned! ;-)
I hope you are enjoying the “EDA 101: Explore, Discover, Analyze” blog series. Stay tuned for upcoming blogs where I’ll delve even deeper into the world of Exploratory Data Analysis and show you how to apply these methods to real-world data.
If you have any questions or would like to share your own experiences with Exploratory Data Analysis, feel free to reach out on Twitter @lokstwt. I’d love to hear from you and you can support me by buying me a coffee! Peace ✌🏾.