DataScience Classroomnotes 06/Jan/2022

Understanding the who dataset

  • The who dataset
    Preview
  • First three letters of the each columen denote whether the column contains new or old cases of TB
  • The next two letters describe type of TB
  • rel => relapse
  • ep => extrapulmonary TB
  • sn => Pulmonary TB diagnosed by a pulmonary smear (smear negative)
  • sp => Pulmonary TB diagnosed by a pulmonary smear (smear positive)
  • The sixth letter gives the gender of the TB patients (m, f)
  • The remaining numbers gives the age group. The dataset is divided into several age groups
  • 014 => 0-14 years old
  • 1524 => 15-24 years old
  • 2534 => 25-34 years old
  • 3544
  • 4554
  • 5564
  • 65 => 65 or older
  • We need to make a minor fix to the columns with name newrel as it is inconsistent, so lets make it new_rel
who2 <- who1 %>%
  mutate(
    variant=stringr::str_replace(variant, "newrel", "new_rel"))
who2
  • Now lets split the variant into new, type and sexage
    Preview
  • Now lets remove the columns, iso2 and iso3 and new as they are redundant
    Preview
  • Now lets seperate sex and age
who5 <- who4 %>%
  separate(sexage, c("sex", "age"), sep=1)
print(who5)
  • Now our dataset is tidy
  • Lets compare the original dataset
    Preview
    Preview
  • We have done the steps from who to who5 one step at time
who6 <- who %>%
  gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
  mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
  separate(code, c("new", "variant", "sexage"), sep="_") %>%
  select(-new, -iso2, -iso3) %>%
  separate(sexage, c("sex", "age"), sep=1) %>%
  mutate(age = stringr::str_replace(age, "014", "0014")) %>%
  mutate(age = stringr::str_replace(age, "65", "65118")) %>%
  separate(age, c("starting_age", "end_age"), sep = 2, convert = TRUE)

Relational Data with dplyr

  • It’s rare that a data analysis involves only a single table of data.
  • Typically we will have many tables of data, we need combine them to answer the questions
  • Relations are always defined between pairs of tables
  • To work with relational data we need verbs that work with pairs of tables, There are three families of verbs designed to work with relational data
  • Mutating Joins: which add new variables to one data frame from matching observations in other
  • Filtering Joins: which filter observations from one data frame based on whether or not they match an observation in other table
  • Set Operations, which treat observations as if they were set elements
  • Dplyr is easier to us than SQL
  • Now execute the following
library(tidyverse)
library(nycflights13)

airlines

airports

planes

weather

flights
  • airlines: lets us lookup the full carrier name
    Preview
  • airports: Gives the information about each airport identified by faa airport code
    Preview
  • planes: gives you information about each plane identified by tailnum
    Preview
  • weather: gives the weather about each NYC airport for each hour
    Preview
  • flights: information about flights arrived and departure at each NYC airport
    Preview

  • Keys:

  • Primary Key: This uniquely identifies and observation in its owntable planes$tailnum
  • Foreign Key: Uniquely identifies an observation in another table flights$tailnum
  • Count the number of planes to identify if it is primary or foreign key
    Preview

Mutating Joins

  • Now lets select some variables from flights
    Preview
  • Now lets try to join airlines and flights2
    Preview
  • Explore the following
flights2 %>% left_join(weather)
flights2 %>% left_join(planes, by="tailnum")

flights2 %>% left_join(airports, c("dest"="faa"))

flights2 %>% left_join(airports, c("origin"="faa"))

Join Types

  • Basic illustration
    Preview

  • Lets try to create two data frames
    Preview

  • inner_join
    Preview
  • left_join
    Preview
  • full_join: also referred as outer-join
    Preview
  • right_join:
    Preview

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About continuous learner

devops & cloud enthusiastic learner