## Exploratory Data Analysis (EDA)

- Develop an understanding of your data.
- EDA is process where we ask questions by exploring data
- Its fundamentally a creative process

## Definitions

- Variable: A quantity, quality or a property that you can measure
- Value: State of variable when you measure it.
- Observation: A set of measurements, several values each associated with a different variable
- Tabular Data: A set of values, each associated with a variable and an observation. Tabular data is tidy if
- Each value is place in its own cell
- Each variable in its own column
- Each observation in its own row
- Variation:
- The tendency of the values of a variable to change from measurement to measurement
- Each measurement will include small amount of error that varies from measurement to measurement
- The best way to understand the pattern f variation is to visualize the distribution of variable’s values
- Types of Variables
- Categorical: If it takes one of small set of values (factors or character vectors)
- Continuous: If it can take any of the infinite set of ordered values (numbers or datetimes)
- Other values:
- Unusual Values:
- Outliers are observations that are unusual (data points that doesn’t seem to fit the patter)
- Two Options same as missing values
- Drop entire row with the unusual values
- Replace the unusual value with imputation (mean, median, KNN)

- Missing Values: Two Options
- Drop entire row with the missing values
- Replace the missing value with imputation (mean, median, KNN)

- COVARIATION
- Variation describes the behaviour within a variable, whereas CoVariation is the tendency of values of two or more variables to vary together in a related way
- Best way to spot covariation is to visualize the relationship between two or more variables
- How depends on type of variables involved
- Categorical vs Continuous (Frequency/density/box plot)
- Categorical vs Categorical
- Continuous vs Continuous (scatter plot)

## Hands On EDA

- Load the tidyverse and ggplot2 libraries
- We have a dataset from ggplot2 called as
`diamonds`

- Try to observe the diamonds data and find the basic information and what the varaibles are by reading documentation
`?diamonds`

- Exercise: Try finding patterns b/w color and cut and color and clarity