Understanding the who dataset
- The who dataset
- First three letters of the each columen denote whether the column contains new or old cases of TB
- The next two letters describe type of TB
- rel => relapse
- ep => extrapulmonary TB
- sn => Pulmonary TB diagnosed by a pulmonary smear (smear negative)
- sp => Pulmonary TB diagnosed by a pulmonary smear (smear positive)
- The sixth letter gives the gender of the TB patients (m, f)
- The remaining numbers gives the age group. The dataset is divided into several age groups
- 014 => 0-14 years old
- 1524 => 15-24 years old
- 2534 => 25-34 years old
- 3544
- 4554
- 5564
- 65 => 65 or older
- We need to make a minor fix to the columns with name
newrel
as it is inconsistent, so lets make it new_rel
who2 <- who1 %>%
mutate(
variant=stringr::str_replace(variant, "newrel", "new_rel"))
who2
- Now lets split the variant into new, type and sexage
- Now lets remove the columns, iso2 and iso3 and new as they are redundant
- Now lets seperate sex and age
who5 <- who4 %>%
separate(sexage, c("sex", "age"), sep=1)
print(who5)
- Now our dataset is tidy
- Lets compare the original dataset
- We have done the steps from who to who5 one step at time
who6 <- who %>%
gather(code, value, new_sp_m014:newrel_f65, na.rm = TRUE) %>%
mutate(code = stringr::str_replace(code, "newrel", "new_rel")) %>%
separate(code, c("new", "variant", "sexage"), sep="_") %>%
select(-new, -iso2, -iso3) %>%
separate(sexage, c("sex", "age"), sep=1) %>%
mutate(age = stringr::str_replace(age, "014", "0014")) %>%
mutate(age = stringr::str_replace(age, "65", "65118")) %>%
separate(age, c("starting_age", "end_age"), sep = 2, convert = TRUE)
Relational Data with dplyr
- It’s rare that a data analysis involves only a single table of data.
- Typically we will have many tables of data, we need combine them to answer the questions
- Relations are always defined between pairs of tables
- To work with relational data we need verbs that work with pairs of tables, There are three families of verbs designed to work with relational data
- Mutating Joins: which add new variables to one data frame from matching observations in other
- Filtering Joins: which filter observations from one data frame based on whether or not they match an observation in other table
- Set Operations, which treat observations as if they were set elements
- Dplyr is easier to us than SQL
- Now execute the following
library(tidyverse)
library(nycflights13)
airlines
airports
planes
weather
flights
- airlines: lets us lookup the full carrier name
- airports: Gives the information about each airport identified by faa airport code
- planes: gives you information about each plane identified by tailnum
- weather: gives the weather about each NYC airport for each hour
-
flights: information about flights arrived and departure at each NYC airport
-
Keys:
- Primary Key: This uniquely identifies and observation in its owntable
planes$tailnum
- Foreign Key: Uniquely identifies an observation in another table
flights$tailnum
- Count the number of planes to identify if it is primary or foreign key
Mutating Joins
- Now lets select some variables from flights
- Now lets try to join airlines and flights2
- Explore the following
flights2 %>% left_join(weather)
flights2 %>% left_join(planes, by="tailnum")
flights2 %>% left_join(airports, c("dest"="faa"))
flights2 %>% left_join(airports, c("origin"="faa"))
Join Types
-
Basic illustration
-
Lets try to create two data frames
- inner_join
- left_join
- full_join: also referred as outer-join
- right_join: