DataScience Classroomnotes 05/Jan/2022

Tidy Data

  • Certain Tables
    Preview
  • There are three interrelated rules which make a dataset tidy
  • Each variable must have its own column
  • Each observation must have its own row
  • Each value must have its own cell

Spreading and Gathering

  • The first step is always to figure out what the variables and observations are.
  • Sometimes it is easy & in othercases it might be difficult
  • Typlically a dataset will from suffer from the followng problems
  • one variable might be spread across multiple columns
  • one observation might be scattered across multiple rows
  • To fix this problem tidyr has two important functions
  • gather()
  • spread()

Gathering

  • A common problem is a dataset where some of the column names are not names of the variables, rather values of variable
  • In table4a, the column names 1999 and 2000 represent values of the year variable
    Preview
  • Each row represents two observations
  • To tidy a dataset like this we need to gather columns into a new pair of variables
  • The name of the variable whose values form the columns name lets call it key and here it is year
  • The name of the variable whose values are spread over sells, let call it as value & here it is cases
    Preview
  • Now lets apply for table4b which has population
    Preview

Spreading

  • Spreading is opposite of gather. We use it when an observation is scattered across multiple rows
  • view table2. An observatin is a country in a year, but the observation is spread across two rows
    Preview
  • Solution use spread() function
    Preview

  • Consider the following simple data

stocks <- tibble(
  year = c(2015,2015, 2016,2016),
  half = c(1,2,1,2),
  return = c(1.88, 0.59, 0.92, 0.17)
)

Seperate

  • Seperate() pulls apart one column in mulitple columns
    Preview
    Preview
    Preview

Unite

  • unite() is inverse of seperate
    Preview

Sample Activity

  • There is a who data set which reprsent Tuberculosis (TB) case broken down year by year, country, age, gender and diagnosis method
who
glimpse(who)
  • Lets try to gather columns form new_sp_m014 to newrel_f65 and ignore the NA values
who
glimpse(who)

who1 <- who %>%
  gather(
    new_sp_m014:newrel_f65, 
    key= "variant", 
    value="cases", 
    na.rm = TRUE)
print(who1)
  • Now, lets try to understand what the dataset is and do the further analysis.

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About continuous learner

devops & cloud enthusiastic learner