DataScience Classroomnotes 30/Dec/2021

Utilities in R

  • Consider the following functions
  • rep()
  • seq()
  • For basic example
    Preview
  • Exercise: Create a sequence from 1 to 500 by incrementing with 3 and assign this to seq1 and create a second sequence seq2 from 1500 to 100 decrementing by 7 and calculate sum of sequences
seq1 <- seq(from=1, to=500, by=3)
seq2 <- seq(from=1500, to=1000, by=-7)
sum(seq1, seq2)
  • When we have do search text data we would use regular expressions, In R language we have the following functions
  • grepl() => returns TRUE if the pattern is found int he corresponding character string
  • grep() => returns a vector of indices of the strings that contain patter
  • Exercies:
emails <- c("qt@gmail.com", "qt@qt.com", "qt@live.in", "qt@qt.org", "qt@qt.edu")
  • Now lets search for email ids with “com”
    Preview
  • Now print all the email ids which have “com” pattern
print(emails[grep("com", emails)])

# using pipes
library(magrittr)
myindex <- function(n, vec) {
  vec[n]
}

grep("com", emails) %>% myindex(vec=emails)
  • Regular expressions
  • ^ => begining
  • $ => end
  • .* => this matches any character present zero or more times
  • Lets write an regular expression based function to filter out only valid email ids
emails <- c("qt@gmail.com", "qt@qt.com", "qt@live.in", "qtqt.org", "qt@qtedu")
# print only valid email ids
print(emails[grep("^[a-zA-Z0-9_.-]+@[a-zA-Z0-9]+\\.[a-zA-z]+",emails)])
  • Sometimes, we might be replacing certain text which matches a pattern
  • Consider the below emails, try to replace .edu with .education
emails <- c("qt@gmail.com", "qt@test.edu", "qt@live.in", "qt@qt.org", "qt@qt.edu")
  • To do these kind of activities we have two functions
  • sub()
  • gsub()
  • Solution
    Preview
  • Date Time formats
  • %Y: 4 digit year
  • %y: 2 digit year
  • %m: 2-digit month
  • %d: 2-digit day of month
  • %A: weekday (Monday)
  • %a: abbreviated weekday (Mon)
  • %B: month (March)
  • %b: abbreviated month (Mar)
  • Samples
str1 <- "August 15, 1947"

date1 <- as.Date(str1, format="%B %d, %Y")
print(date1)

str2 <- "2012-27-05"
date2 <- as.Date(str2, format="%Y-%d-%m")
print(date2)
  • Time formats
  • %H: hours as decimal number (00-23)
  • %I: hours as decimal number (01-12)
  • %M: minutes as decimal number
  • %S: seconds as decimal number
  • %T: short hand notation for “%H:%M:%S”
  • %p: AM/PM indicator
  • Try ?strptime
  • Look at the following sample
str3 <- "April 2, 11 hours:09 minutes:45 seconds:30 pm"
time3 <- as.POSIXct(str3, format="%B %d, %y hours:%I minutes:%M seconds:%S %p")
print(time3)

Preview

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About continuous learner

devops & cloud enthusiastic learner