DataScience Classroomnotes 15/Feb/2022

Exploratory Data Analysis (EDA)

  • Develop an understanding of your data.
  • EDA is process where we ask questions by exploring data
  • Its fundamentally a creative process

Definitions

  • Variable: A quantity, quality or a property that you can measure
  • Value: State of variable when you measure it.
  • Observation: A set of measurements, several values each associated with a different variable
  • Tabular Data: A set of values, each associated with a variable and an observation. Tabular data is tidy if
  • Each value is place in its own cell
  • Each variable in its own column
  • Each observation in its own row
  • Variation:
  • The tendency of the values of a variable to change from measurement to measurement
  • Each measurement will include small amount of error that varies from measurement to measurement
  • The best way to understand the pattern f variation is to visualize the distribution of variable’s values
  • Types of Variables
  • Categorical: If it takes one of small set of values (factors or character vectors)
  • Continuous: If it can take any of the infinite set of ordered values (numbers or datetimes)
  • Other values:
  • Unusual Values:
    • Outliers are observations that are unusual (data points that doesn’t seem to fit the patter)
    • Two Options same as missing values
    • Drop entire row with the unusual values
    • Replace the unusual value with imputation (mean, median, KNN)
  • Missing Values: Two Options
    • Drop entire row with the missing values
    • Replace the missing value with imputation (mean, median, KNN)
  • COVARIATION
  • Variation describes the behaviour within a variable, whereas CoVariation is the tendency of values of two or more variables to vary together in a related way
  • Best way to spot covariation is to visualize the relationship between two or more variables
  • How depends on type of variables involved
    • Categorical vs Continuous (Frequency/density/box plot)
    • Categorical vs Categorical
    • Continuous vs Continuous (scatter plot)

Hands On EDA

  • Load the tidyverse and ggplot2 libraries
  • We have a dataset from ggplot2 called as diamonds
  • Try to observe the diamonds data and find the basic information and what the varaibles are by reading documentation ?diamonds
    Preview
    Preview
    Preview
    Preview
    Preview
    Preview
    Preview
    Preview
    Preview
    Preview
    Preview
  • Exercise: Try finding patterns b/w color and cut and color and clarity

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About continuous learner

devops & cloud enthusiastic learner