DataScience Classroomnotes 11/Feb/2022

Data Preprocessing using recipes

  • In the previous session we had taken star wars data set and then we have done the following steps
  • Create the new feature BMI
  • Dealed with missing values
    • Removed infrequent missing values
    • Imputed with mean for some missing values
  • Dummy variables or one hot encoding for categorical values
  • Standardize/normalize the values
  • We will try to achieve all of the above steps using recipes package from tidy models w.r.t data preprocessing Refer Here for the references
  • We have done the following in R
library(rsample)
library(skimr)
library(recipes)
data <- starwars %>%
  select(height, mass, gender)
data_split <- initial_split(data)
training_data <- training(data_split)
testing_data <- testing(data_split)
starwars_recipe <- training_data %>%
  recipe() %>%
  step_mutate(BMI = mass/(height * height)) %>%
  step_naomit(height, gender) %>%
  step_impute_mean(mass, BMI) %>%
  step_dummy(gender) %>%
  step_normalize(everything()) %>%
  prep()

starwars_recipe
juice(starwars_recipe)

Preview

  • Lets create a Data Preprocessing Pipeline using recipes on ipl dataset Refer Here
  • Our data will be
  • over
  • ball
  • dismissal_kind => not blank
  • fielder => 0 or 1
  • Steps:
  • Read csv Data into R object
  • Select above mentioned cols & Ensure data is considered only when dismissal_kind is not Blank
  • split the data into training & testing
  • encode dismissal kind
  • normalize
# Loading Libraries
library(tidyverse)
library(tidymodels)
library(skimr)
# Load the whole data
data <- read_csv('deliveries.csv') %>%
  select(over, ball, dismissal_kind, fielder) %>%
  filter(dismissal_kind != "")
skim(data)
# Split the data into training and testing
library(rsample)
data_split <- initial_split(data)
training_data <- training(data_split)
testing_data <- testing(data_split)
skim(training_data)
# Lets encode the fielder to 0 or 1
training_data <- training_data %>%
  mutate(fielder = ifelse(is.na(fielder), 0, 1))
skim(training_data)
# using recipes 
ipl_dismissal_recipe <- training_data %>%
  recipe() %>%
  step_dummy(dismissal_kind) %>%
  step_normalize(everything()) %>%
  prep()
ipl_dismissal_recipe
preprocesed_data <- juice(ipl_dismissal_recipe)
preprocesed_data

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About continuous learner

devops & cloud enthusiastic learner