## Exploratory Data Analysis (EDA)

• Develop an understanding of your data.
• EDA is process where we ask questions by exploring data
• Its fundamentally a creative process

## Definitions

• Variable: A quantity, quality or a property that you can measure
• Value: State of variable when you measure it.
• Observation: A set of measurements, several values each associated with a different variable
• Tabular Data: A set of values, each associated with a variable and an observation. Tabular data is tidy if
• Each value is place in its own cell
• Each variable in its own column
• Each observation in its own row
• Variation:
• The tendency of the values of a variable to change from measurement to measurement
• Each measurement will include small amount of error that varies from measurement to measurement
• The best way to understand the pattern f variation is to visualize the distribution of variable’s values
• Types of Variables
• Categorical: If it takes one of small set of values (factors or character vectors)
• Continuous: If it can take any of the infinite set of ordered values (numbers or datetimes)
• Other values:
• Unusual Values:
• Outliers are observations that are unusual (data points that doesn’t seem to fit the patter)
• Two Options same as missing values
• Drop entire row with the unusual values
• Replace the unusual value with imputation (mean, median, KNN)
• Missing Values: Two Options
• Drop entire row with the missing values
• Replace the missing value with imputation (mean, median, KNN)
• COVARIATION
• Variation describes the behaviour within a variable, whereas CoVariation is the tendency of values of two or more variables to vary together in a related way
• Best way to spot covariation is to visualize the relationship between two or more variables
• How depends on type of variables involved
• Categorical vs Continuous (Frequency/density/box plot)
• Categorical vs Categorical
• Continuous vs Continuous (scatter plot)

## Hands On EDA

• Load the tidyverse and ggplot2 libraries
• We have a dataset from ggplot2 called as `diamonds`
• Try to observe the diamonds data and find the basic information and what the varaibles are by reading documentation `?diamonds`           • Exercise: Try finding patterns b/w color and cut and color and clarity

This site uses Akismet to reduce spam. Learn how your comment data is processed. ## About continuous learner

devops & cloud enthusiastic learner