What is exploratory data analysis| Data Preparation Guide 2024

Exploratory data analysis (EDA) examines and visualizes data to understand its main features, identify patterns, spot anomalies, and test hypotheses. This helps summarize the data and uncover insights before applying more advanced data analysis techniques.

Market analysis with exploratory data analysis

Now perform exploratory data analysis on market analysis data. You start by importing all necessary modules.

Figure 3: Importing necessary modules

Then you read the data in as a pandas dataframe.

EDA_15

EDA_15_1

Figure 4: Market Analysis Data

The dataset is not formatted correctly. The first two rows contain the actual column names, just arbitrary values.

Entering data

When entering your data, skip the first two rows to overcome the skewed rows. This will ensure that your column names are filled in correctly.

EDA_16.

Figure 5: Importing market analysis data

The dataset is now imported correctly. The column names are in the correct order, and you have dropped the arbitrary data.

The above data was collected while taking a survey. Information about the surveys, such as their occupation, salary, whether they have taken a loan, age, etc., is given. You will use exploratory data analysis to find patterns in this data and correlations between columns. You will also perform basic data cleansing steps.

Data cleaning

The next step is cleaning data. Let’s leave the customer ID column because it’s just the row numbers indexed at 1. Also divide the ‘jobedu’ column into two: one for the job and one for the education field. After splitting the columns, you can drop the ‘jobedu’ column as it is more useless.

EDA_17

Figure 6: Cleaning market analysis data

This is what the dataset looks like now.

EDA_18

Figure 7: Market Analysis Data

Missing values

The data has some missing values in its columns. There are three main categories of missing values:

MCAR (Missing Completely at Random): These values are missing at random and do not depend on any other values. MAR (Missing Random): These values depend on additional features. MNAR (Not Missing at Random): There is a reason why these values are missing.

Let’s look at the columns that have missing values.

EDA_19.

Figure 8: Missing values

You cannot do anything about the missing age values. So leave all rows without age values.

EDA_20

Figure 9: Missing age values

Now, in the month column, you can fill in the missing values by finding the most common month and filling it in place of the missing values. You view the mode of the month column to get the most common values and fill in the missing values using the fill function.

EDA_21

Figure 10: Fill in missing monthly values

Check to see the number of missing values in your data.

EDA_22

Figure 11: Missing values

Finally, only the answer column has missing values. You cannot change these values. If the user didn’t fill in the answer, you can’t auto-generate it, so you omit these values.

EDA_23.

Figure 12: Drop Missing response values

Finally, the data is clean. You can now start to find the outliers.

Handling of outliers

There are two types of outliers in data:

Univariate outliers: Univariate outliers are the data points whose values lie outside the expected range. Here only a single variable is considered. Multivariate outliers: These outliers depend on the correlation between two variables. While plotting data, one variable may not lie outside the expected range. However, when you plot the same variable with another variable, these values may lie far from the expected value.

Univariate analysis

Now consider the different jobs you have data on. Plotting the job column as a bar graph in ascending order of the number of people working in that job tells us the most popular jobs in the market. Normalize the data to ensure they lie in the same range and are comparable.

EDA_24

Figure 13: Draw the number of people performing a certain job

Continuing, draw a pie chart to compare the educational qualifications of the people in the survey. Almost half of the people have only secondary school education, and one fourth have a tertiary education.

EDA_25

Figure 14: The plot of the educational qualification of people

Bivariate analysis

Bivariate analysis is of three main types:

1. Numerical-Numerical Analysis

When both variables are compared, they have numerical data, and the analysis is said to be a Numerical-Numerical Analysis. You can use scatter plots, pair plots, and correlation matrices to compare two numeric columns.

Scatter plot

A scatter plot represents each data point in the graph. It shows how the data in one column fluctuates according to the corresponding data points in another column. For example, set a scatter diagram between different individuals’ salaries and bank balances and the balance and age of individuals.

EDA_26.

Figure 15: Draw a scatter diagram of Salary vs. Balance

Looking at the above plot, it can be said that regardless of the individual salary, the average bank balance ranges from 0 – 25.0000. Majority of the people have a bank balance below 40k.

EDA_27

Figure 16: Draw a scatter plot of Balance vs Age

From the above graph you can deduce that the average balance of people, regardless of age, is around 25,000. This is the average balance regardless of age and salary.

Couple Plot

Pair plots are used to compare several variables simultaneously. They plot a scatter diagram of all input variables against each other, which helps save space and allows us to compare multiple variables at once. Let’s draw the pair plot for salary, balance and age.

EDA_28.

Figure 17: The plot of a pair plot

The figures below show the pair plots for salary, balance and age. Each variable is plotted against the other on both the x- and y-axis.

EDA_29.

Figure 18: Pair diagrams of salary, balance and age

Correlation Matrix

A correlation matrix is used to see the correlation between different variables. The correlation coefficient determines how two variables are correlated. The table below shows the correlation between salary, age and balance. Correlation tells you how one variable affects the other. This helps us determine how changes in one variable will also cause a change in the other.

EDA_30.

Figure 19: Correlation matrix between salary, balance and age

The above matrix tells us that balance, age and salary have a high correlation coefficient and influence each other. Age and salary have a lower correlation coefficient.

2. Numerical – Categorical Analysis

When one variable is of numeric type, and another is a categorical variable, you perform numeric-categorical analysis.

You can use the group by function to arrange the data into similar groups. Rows that have the same value in a particular column will be grouped. This way you can see the numerical occurrence of a certain category across a column. You can also group values and find their average.

EDA_31

Figure 20: Grouping of response in relation to salary

The above values tell you the average salary of the people who answered yes or no in the answer column.

You can also find the mean value of salary or the median value of the people who responded with yes and no in our survey.

EDA_32

Figure 21: Median of group be of response regarding salary

You can also plot the box plot of response vs salary. A boxplot will show you the range of values that fall under a certain category.

EDA_33.

Figure 22: Boxplot of response in relation to salary

The above plot tells you that the salary range of people who said no to the survey is between 20k – 70k with a median salary of 60k, while the salary range of people who answered yes to the survey was between 50k – 100k with a median salary of 60K.

3. Categorical — Categorical Analysis

When both variables contain categorical data, you perform categorical-categorical analysis. First, convert the categorical response column into a numeric column with 1 corresponding to a positive response and 0 corresponding to a negative response.

EDA_34

Figure 23: Changing from categorical to numerical values

Now compare the marital status of people with the response rate. The figure below tells you the average number of people who answered yes to the survey and their marital status.

EDA_35.

Figure 24: Changing from categorical to numerical values

Also, compare the average loan with the response rate.

EDA_36

Figure 25: Changing from categorical to numerical values

You can conclude that people who have taken out a loan are likely to respond with a no to the survey.

Disclaimer for Uncirculars, with a Touch of Personality:

While we love diving into the exciting world of crypto here at Uncirculars, remember that this post, and all our content, is purely for your information and exploration. Think of it as your crypto compass, pointing you in the right direction to do your own research and make informed decisions.

No legal, tax, investment, or financial advice should be inferred from these pixels. We’re not fortune tellers or stockbrokers, just passionate crypto enthusiasts sharing our knowledge.

And just like that rollercoaster ride in your favorite DeFi protocol, past performance isn’t a guarantee of future thrills. The value of crypto assets can be as unpredictable as a moon landing, so buckle up and do your due diligence before taking the plunge.

Ultimately, any crypto adventure you embark on is yours alone. We’re just happy to be your crypto companion, cheering you on from the sidelines (and maybe sharing some snacks along the way). So research, explore, and remember, with a little knowledge and a lot of curiosity, you can navigate the crypto cosmos like a pro!

UnCirculars – Cutting through the noise, delivering unbiased crypto news