DataScience Classroomnotes 31/Dec/2021

Data Transformation with dplyr

  • Ensure tidyverse is installed
install.packages('tidyverse')
  • Now in this class we would focus on how to use dplyr package a core member of tidy verse
  • Lets load some packages as pre-requisite
library(tidyverse)
install.packages('nycflights13')
library(nycflights13)
  • Refer Here for information about nycflights13 package
  • Lets explore this dataset flights. This data frame consists of 336776 flights departed from new york in 2013.
    Preview
  • Tibbles are dataframes, but slightly tweaked to work better with tidyverse.
  • In the image above we have abbreviations under column names that describe type of each variable
  • int
  • dbl => double
  • chr => character vector or string
  • dttm => data time
  • lgl => logical vector => TRUE or FALSE
  • fctr => factors
  • date => date
  • Basics of dplyr: Dplyr functions allow you to solve majority of data manipulation challenges
  • Pick observations by their values => filter()
  • Reorder the rows => arrange()
  • Pick variables by their names => select()
  • Create a new variable with functions of existing varaibles mutate()
  • Filter Rows with filter()
    Preview
    Preview
  • We can do filtering with comparision operators >, >=, <=, <, !=. ==
  • Create four objects with flights travelled in each quarter using filter
q1 <- filter(flights, month<=3)
q2 <- filter(flights, month>=4 & month<=6)
q3 <- filter(flights, month %in% 7:9)
q4 <- filter(flights, month %in% 10:12)
print(q1)
print(q2)
print(q3)
print(q4)
  • Lets try to find the flights where the arrival delay is 0
on_time_arrival <- filter(flights, arr_delay == 0)
print(on_time_arrival)

# On time departures
on_time_departure <- filter(flights, dep_delay == 0)
print(on_time_departure)

# ontime arrival & on-time departure
on_time <- filter(flights, dep_delay == 0, arr_delay==0)
print(on_time)
  • Try to find the flights where arrival delay is greater than 120 and departure delay is greater than 120
gt_two_hours <- filter(flights, dep_delay >= 120 , arr_delay >=120)
  • Try to find the flights where arrival delay is not less than 120 or departue delay is not less that 120
cond_two_hours <- filter(flights, !(dep_delay < 120) | !(arr_delay  < 120))

  • Try to write the above expression using and
cond_two_hours_1 <- filter(flights, (dep_delay < 120 & arr_delay  < 120))
  • Exercise: But the count is not matching in cond_two_hours and cond_two_hours_1. Try to find the reason for that

  • Arrange rows with arrange(): with filter we select rows and arrange changes order by ascending or desc

arrange(flights, desc(dep_delay))
arrange(flights, desc(arr_delay))
arrange(flights, dep_delay)
arrange(flights, arr_delay)
  • Selecting Columns with select
    Preview
    Preview
  • There are helper functions
  • starts_with
  • ends_with
  • contains
  • matches
    Preview
  • Add new variables using mutate(). Mutate can add new columns
    Preview

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About continuous learner

devops & cloud enthusiastic learner