Data Science Classroom Series – 08/Dec/2021

Thousand Feet overview of Data Science Pipeline

  • Data Science is a field of study that focuses on extracting knowledge from data

  • To do this a pipeline is created which will have phases Preview

  • Theres is an acronym from Hilary Mason and Chris Wiggins to explain data science pipeline => O.S.E.M.N

  • OSEMN Pipeline

    • O => Obtaining data
    • S => Scrubbing/cleaning data
    • E => Exploring/Visualizing data that will allow us to find trends and patterns
    • M => Modeling our data that will give prediction powers
    • N => Interpreting your data
  • Intersting example w.r.t Prediction Power:

    • Walmart was able to predict that they would sellout allof the Strawberr pop-tarts during the hurricane seasion in one of their stores. Through data mining their historical record showed that most popular itme sold befor the hurricane was Pop-tarts. Preview
  • Obtaining your data:

    • As a data scientist you cannot do anything without having data
    • As a rule of thumb, you mist identify all of your available data sets (internal/external database). This data should be in the usable format
    • Skills required:
      • Database Management: SQL (MySQL, Postgres), NoSQL (Mongo Db)
      • Querying Relational Databases
      • Retrieving Unstructured Data: text, audio, video, documents
      • Distributed Storage: Hadoop, Apache Spark etc
  • Scrubbing/Clean Your Data:

    • This is most time consuming stage and requires more effort
    • Examining Data:
      • identify errors
      • identify missing values
      • identify corrupted recording
    • Cleaning of Data
      • replace or fill or throw away missing data
    • Skills Required:
      • Scripting Languages: R, Python
      • Data Wrangling Tools: Python Pandas, R
      • Distributed Processing: Hadoop, Map Reduce/Spark
  • Exploring (Exploratory Data Analysis)

    • We try to understand what patterns and values our data has.
    • We will be using differnt visualizations and statistical testings to backup our findings
    • Skills Required:
      • Python: Numpy, Matplotlib, Pandas, Scipy
      • R: GGplot2, Dply
      • Inferential Statisics
      • Experimental Design
      • Data Visualizations
  • Modeling (Machine Learning)

    • Models are general rules in statistical sense.
    • We will be using maching learning algorithme to better your predictive power
    • Skills Required:
      • Machine Learning: Supervised/Unsupervised algorithms
      • Evaluation Methods
      • Machine Learning Libaries: Python (Sci-kit Learn)/R(CARET)
      • Linear Algebra & Multivariate Calculus
  • Interpreting (Data Storytelling)

    • Identify business insights
    • Visualize your findings accrodingly
    • Skills Required
      • Domain Knowledge
      • Data Visualization Tools: Matplotlib, GGPlot, Seaborn, Tableau, D3 etc..


Elaborate Image


R-Studio Installation

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About learningthoughtsadmin