Classification
- Predict a class for a given instance based on set of features
- Whether or not an email is spam
- Whether a handwritten digit is 0,1,2,3,4,5,6,7,8,9
- Binary Classification:
- Classify an instance into one of two possible classes (email spam or not spam)
- Multicalss Classification:
- Classify an instance into more than two possible classess (Whether a handwritten digit is 0,1,2,3,4,5,6,7,8,9)
Accuracy
- Accuracy = Number of correct predictions / Total number of predictions
- This is most commonly used metric for evaluating classification models
- This can be misleading metric & doesn’t alone tell the full story.
- Accuracy: Not Enough!
- Building a classifier to predict whether a patient has rare, fatal disease like cancer. Assumption is 0.1% of the population affected by disease
- If we reply with no cancer irrespective of data (tests, lab results etc) this is 99.9% accurate.
- Building a classifier to predict whether a patient has rare, fatal disease like cancer. Assumption is 0.1% of the population affected by disease
- Positive Case: Patient has disease
- Negative Case: Patient doesn’t have a disease
Confusion Matrix
- A table where
- Rows Represent actual classes
- Colums Represent predicted classes
- Each entry is the number of instance with the corresponding actual and predicted classes
- Accuracy: (TP + TN) / (FP + FN + TP + TN)
- How often is classifier correct?
- Precision: TP / (TP + FP)
- When Predicted positive, how often is classifer correct?
- Recall: TP / (TP + FN )
- How often are the positive instances classified correctly as positive?
F1 Score
- Combines precision and recall into single metric
- Interpreted as weighted average of precision and recall
- Its value will be between 0 (worst) and 1 (best)
- High only if both precision and recall are high
- Example:
- precision: 0.5, recall: 0.5, F1 = 0.5
- precision: 1.0, recall: 0.2, F1 = 0.5
Lets try classification using titanic dataset
R-Notebook
Importing libraries
library(tidyverse)
library(tidymodels)
library(skimr)
library(corrr)
# install package if not present
# install.packages('titanic')
library(titanic)
Lets use titanic dataset
# Lets split the titanic_train and build the model
data <- titanic_train
data_split <- initial_split(data)
train <- training(data_split)
test <- testing(data_split)
skimr::skim(train)
Build a recipe
data_rec <- recipe(Survived ~ ., train) %>%
step_mutate(Survived = ifelse(Survived == 0, "Died", "Survived")) %>%
step_string2factor(Survived) %>%
step_rm(PassengerId, Name, Ticket, Cabin) %>%
step_meanimpute(Age) %>%
step_dummy(all_nominal(), -all_outcomes()) %>%
step_zv(all_predictors()) %>%
step_center(all_predictors(), -all_nominal()) %>%
step_scale(all_predictors(), -all_nominal())
prepping a recipe
data_prep <- data_rec %>%
prep()
data_prep
Build a fitted model
fitted_model <- logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification") %>%
fit(Survived ~ ., data = bake(data_prep, train))
predict using the fitted model
predictions <- fitted_model %>%
predict(new_data = bake(data_prep, test)) %>%
bind_cols(
bake(data_prep, test) %>%
select(Survived)
)
predictions
Create a confusion matrix
predictions %>%
conf_mat(Survived, .pred_class)
Metrics
predictions %>%
metrics(Survived, .pred_class)
predictions %>%
precision(Survived, .pred_class)
predictions %>%
recall(Survived, .pred_class)
predictions %>%
f_meas(Survived, .pred_class)