Data Preprocessing
- Packages:
- recipes
- rsample
- Real World Data is generally
- Incomplete
- Inconsistent
- Inacurate (Contain some errors or outliers)
- Often lack specific attributes
- We need ensure to split the data set as train/test
- Train/Test Split
- Must be done before preprocessing
- Common splits: 70%/30%, 80%/20%
- Test data can only be used once
- Ensure objective measurement for the accuracy of the model
- Feature Engineering
- Creating new features from existing features
- use domain knowledge
- Can be used to improve the performance
- Common Preprocessing Steps
- Missing Values
- Vectorizations
- Feature Scaling
Missing Values
- Row Deletion: If the missing values are very less compared to the actual data then we can safely delete the rows
- Back-fill or forward-fill
- Imputation:
- Mean
- Median
- kNN
Vectorization
- Encoding Categorical data
- As Integers
- As dummy variables (One-Hot Encoding)
Feature Scaling
- Normalization: Scaling variables to have values between 0 and 1
- Standarization: Transforming data to have 0 mean and 1 standard deviation
OTHERS
- Downsampling or upsampling
- Collapsing rare occuring cases into one case called “Other”
Data Pre-Processing in R
- Lets take starwars data try to experiment some of the data preprocessing steps
- Lets work on the variables height, mass, gender variables in the starwars dataset
- We need to split the data into train/test data which we can easily perform using rsample.
- Lets install tidymodels first
install.packages("tidymodels")
- Now lets split the data
- Lets add a feature called as BMI
- Removing Missing Values:
- Imputate Missing Values with average or mean for mass and BMI