DataScience Classroomnotes 26/Feb/2022

Classification

  • Predict a class for a given instance based on set of features
    • Whether or not an email is spam
    • Whether a handwritten digit is 0,1,2,3,4,5,6,7,8,9
  • Binary Classification:
    • Classify an instance into one of two possible classes (email spam or not spam)
  • Multicalss Classification:
    • Classify an instance into more than two possible classess (Whether a handwritten digit is 0,1,2,3,4,5,6,7,8,9)

Accuracy

  • Accuracy = Number of correct predictions / Total number of predictions
  • This is most commonly used metric for evaluating classification models
  • This can be misleading metric & doesn’t alone tell the full story.
  • Accuracy: Not Enough!
  • Building a classifier to predict whether a patient has rare, fatal disease like cancer. Assumption is 0.1% of the population affected by disease
  • If we reply with no cancer irrespective of data (tests, lab results etc) this is 99.9% accurate.
  • Building a classifier to predict whether a patient has rare, fatal disease like cancer. Assumption is 0.1% of the population affected by disease
    • Positive Case: Patient has disease
    • Negative Case: Patient doesn’t have a disease

Confusion Matrix

  • A table where
    • Rows Represent actual classes
    • Colums Represent predicted classes
    • Each entry is the number of instance with the corresponding actual and predicted classes
      Preview
  • Accuracy: (TP + TN) / (FP + FN + TP + TN)
    • How often is classifier correct?
  • Precision: TP / (TP + FP)
    • When Predicted positive, how often is classifer correct?
  • Recall: TP / (TP + FN )
    • How often are the positive instances classified correctly as positive?

F1 Score

  • Combines precision and recall into single metric
    Preview
  • Interpreted as weighted average of precision and recall
  • Its value will be between 0 (worst) and 1 (best)
  • High only if both precision and recall are high
  • Example:
  • precision: 0.5, recall: 0.5, F1 = 0.5
  • precision: 1.0, recall: 0.2, F1 = 0.5

Lets try classification using titanic dataset

Preview
Preview
Preview
Preview
Preview

R-Notebook

Importing libraries

library(tidyverse)
library(tidymodels)
library(skimr)
library(corrr)

# install package if not present 
# install.packages('titanic')
library(titanic)

Lets use titanic dataset

# Lets split the titanic_train and build the model

data <- titanic_train
data_split <- initial_split(data)
train <- training(data_split)
test <- testing(data_split)

skimr::skim(train)

Build a recipe

data_rec <- recipe(Survived ~ ., train) %>%
  step_mutate(Survived = ifelse(Survived  == 0, "Died", "Survived")) %>%
  step_string2factor(Survived) %>%
  step_rm(PassengerId, Name, Ticket, Cabin) %>%
  step_meanimpute(Age) %>%
  step_dummy(all_nominal(), -all_outcomes()) %>%
  step_zv(all_predictors()) %>%
  step_center(all_predictors(), -all_nominal()) %>%
  step_scale(all_predictors(), -all_nominal())


prepping a recipe

data_prep <- data_rec %>%
  prep()

data_prep

Build a fitted model

fitted_model <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification") %>%
  fit(Survived ~ ., data = bake(data_prep, train))

predict using the fitted model


predictions <- fitted_model %>%
  predict(new_data = bake(data_prep, test)) %>%
  bind_cols(
    bake(data_prep, test) %>%
      select(Survived)
  )

predictions

Create a confusion matrix

predictions %>%
  conf_mat(Survived, .pred_class)

Metrics

predictions %>%
  metrics(Survived, .pred_class)
predictions %>%
  precision(Survived, .pred_class)
predictions %>%
  recall(Survived, .pred_class)
predictions %>%
  f_meas(Survived, .pred_class)

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About continuous learner

devops & cloud enthusiastic learner