DataScience Classroomnotes 15/Feb/2022

Exploratory Data Analysis (EDA)

  • Develop an understanding of your data.
  • EDA is process where we ask questions by exploring data
  • Its fundamentally a creative process


  • Variable: A quantity, quality or a property that you can measure
  • Value: State of variable when you measure it.
  • Observation: A set of measurements, several values each associated with a different variable
  • Tabular Data: A set of values, each associated with a variable and an observation. Tabular data is tidy if
  • Each value is place in its own cell
  • Each variable in its own column
  • Each observation in its own row
  • Variation:
  • The tendency of the values of a variable to change from measurement to measurement
  • Each measurement will include small amount of error that varies from measurement to measurement
  • The best way to understand the pattern f variation is to visualize the distribution of variable’s values
  • Types of Variables
  • Categorical: If it takes one of small set of values (factors or character vectors)
  • Continuous: If it can take any of the infinite set of ordered values (numbers or datetimes)
  • Other values:
  • Unusual Values:
    • Outliers are observations that are unusual (data points that doesn’t seem to fit the patter)
    • Two Options same as missing values
    • Drop entire row with the unusual values
    • Replace the unusual value with imputation (mean, median, KNN)
  • Missing Values: Two Options
    • Drop entire row with the missing values
    • Replace the missing value with imputation (mean, median, KNN)
  • Variation describes the behaviour within a variable, whereas CoVariation is the tendency of values of two or more variables to vary together in a related way
  • Best way to spot covariation is to visualize the relationship between two or more variables
  • How depends on type of variables involved
    • Categorical vs Continuous (Frequency/density/box plot)
    • Categorical vs Categorical
    • Continuous vs Continuous (scatter plot)

Hands On EDA

  • Load the tidyverse and ggplot2 libraries
  • We have a dataset from ggplot2 called as diamonds
  • Try to observe the diamonds data and find the basic information and what the varaibles are by reading documentation ?diamonds
  • Exercise: Try finding patterns b/w color and cut and color and clarity

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About continuous learner

devops & cloud enthusiastic learner