DataScience Classroomnotes 10/Feb/2022

Data Preprocessing in R (Contd)

  • In the last session we have done the steps till missing values
  • Using the Iris data set to demonstrate encoding categorical value with unique integers
    Preview
  • When you assign a numberical to every categorial you are assigning an order which is not the case, so lets try to see if we can use One Hot encoding
Species => Pet, Ver, Vir

Create a New Variables
Species_Versicolor 
Species_viginica

Preview

library(dplyr)
library(tidyr)
library(skimr)
starwars
View(starwars)
skim(starwars)
data <- starwars %>%
  select(height, mass, gender)
data
library(rsample)

data_split <- initial_split(data)
training_data <- training(data_split)
testing_data <- testing(data_split)

# Add a BMI feature
training_data <- training_data %>%
  mutate(BMI = mass/(height * height))
skim(training_data)

# Removing Missing Data
training_data <- training_data %>%
  drop_na(height, gender)


training_data <- training_data %>%
  mutate(mass = ifelse(is.na(mass), mean(mass, na.rm=TRUE), mass)) %>%
  mutate(BMI = ifelse(is.na(BMI), mean(BMI, na.rm=TRUE), BMI))

skim(training_data)
skim(training_data)

data_tr_encoded <- training_data %>%
  mutate(gender_masculine = ifelse(gender == "masculine", 1, 0)) %>%
  select(-gender)

#creating a reusable function
standardize <- function(feature) {
  (feature - mean(feature))/sd(feature)

}

data_tr_imputed_encoded_normalize <- data_tr_encoded %>%
  mutate_all(standardize)

skim(data_tr_imputed_encoded_normalize)
  • Now we have encoded the values and done normalization
  • Execute the above steps in your local environment and try to do all of this as one statement using pipes => data processing pipeline
library(rsample)
library(skimr)

#creating a reusable function
standardize <- function(feature) {
  (feature - mean(feature))/sd(feature)

}

data_split <- initial_split(data)
data_train <- training(data_split)
data_test <- testing(data_split)
transformed_data_train <- data_train %>%
  mutate(BMI = mass/ ( height*height ) ) %>%
  drop_na(height, gender) %>%
  mutate(mass = ifelse(is.na(mass), mean(mass, na.rm = TRUE), mass),
         BMI = ifelse(is.na(BMI), mean(BMI, na.rm = TRUE), BMI)) %>%
  mutate(gender_masculine = ifelse(gender == "masculine", 1, 0)) %>%
  select(-gender) %>%
  mutate_all(standardize)

skim(transoformed_data_train)

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About continuous learner

devops & cloud enthusiastic learner