DataScience Classroomnotes 09/Feb/2022

Data Preprocessing

  • Packages:
  • recipes
  • rsample
  • Real World Data is generally
  • Incomplete
  • Inconsistent
  • Inacurate (Contain some errors or outliers)
  • Often lack specific attributes
  • We need ensure to split the data set as train/test
  • Train/Test Split
  • Must be done before preprocessing
  • Common splits: 70%/30%, 80%/20%
  • Test data can only be used once
  • Ensure objective measurement for the accuracy of the model
  • Feature Engineering
  • Creating new features from existing features
  • use domain knowledge
  • Can be used to improve the performance
  • Common Preprocessing Steps
  • Missing Values
  • Vectorizations
  • Feature Scaling

Missing Values

  • Row Deletion: If the missing values are very less compared to the actual data then we can safely delete the rows
  • Back-fill or forward-fill
  • Imputation:
  • Mean
  • Median
  • kNN


  • Encoding Categorical data
  • As Integers
  • As dummy variables (One-Hot Encoding)

Feature Scaling

  • Normalization: Scaling variables to have values between 0 and 1
  • Standarization: Transforming data to have 0 mean and 1 standard deviation


  • Downsampling or upsampling
  • Collapsing rare occuring cases into one case called “Other”

Data Pre-Processing in R

  • Lets take starwars data try to experiment some of the data preprocessing steps
  • Lets work on the variables height, mass, gender variables in the starwars dataset
  • We need to split the data into train/test data which we can easily perform using rsample.
  • Lets install tidymodels first install.packages("tidymodels")
  • Now lets split the data
  • Lets add a feature called as BMI
  • Removing Missing Values:
  • Imputate Missing Values with average or mean for mass and BMI

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About continuous learner

devops & cloud enthusiastic learner