DataScience Classroomnotes 05/Feb/2022

Machine Learning Approaches

  • Traditionally divided into three categories
  • Supervised Learning
  • Unsupervised Learning
  • Reinforcement Learning

Supervised Learning

  • The training data is labelled
  • For each instance in the training data, the right answers or results (labels) are given
  • Regression and Classification are typical supervised learning tasks
  • Focus: Predict the Future
  • Some of the most widely used supervised learning algorithms
  • Linear Regression
  • Logistic Regresssion
  • Support Vector Machines (SVMs)
  • Decision Trees and Random Forests
  • K-Nearest Neighbors
  • Neural Networks
  • Regression:
  • Predict the target numerical value given a set of features called predictors
  • Example: Predict the Car price (mileage, age, brand etc), Predict the House Price (Age, Builder etc…)
  • Classification:
  • Predict a Class
  • Example: Predict whether a new email is spam or not

Unsupervised Learning

  • The training data is unlabelled
  • Instances in the training data are not presented with the right answers or outcomes
  • Typical Tasks: Clustering, Dimensionality Reduction, Anamoly Detection
  • Clustering:
  • Organize the instances into meaningful subgroups (clusters) without having prior knowledge of their group memberships
  • For example, allow marketers to discover groups based on the interests, in order to develop distinct marketing
  • Dimensionality Reduction:
  • Often data can have high dimension (a large number of features). The goal is to simplify the data by reducing the number of dimensions without losing too much information
  • One way is to merge several correlated features into one
  • Can improv the computational and sometimes the predictive performance of ML algorithmes and reduces the storage requirements
  • Anamoly Detection:
  • The system is trained with normal instances, and when it sees a new instance it can tell whether it looks like a normal one or it is likely an anamoly

ML Approaches

  • Supervised Learning: Labelled data, direct feedback, predict outcome/future
  • Unsupervised Learning: No labels, no feedback, find hidden structures in data.

GENERALIZATION

  • The ability of the model to perform well on new data is the ultimate objective of ML which is referred as Generalization.
  • How can we tell if ML model is generalizing well?
  • Split the data into 2 sets
    • Training set: For Training the model
    • Testing Set: For evaluation the generalization performance of the model
  • Evaluate performance on training data and test data

DATA CHALLENGES

  • Insufficient Quantity of Training data: Most ML algorithms require lots of data (i.e. image/speech recognition with Deep Learning)
  • Non representative Training Data: In order to generalize well, it is crucial that your training data be representative of the new cases
  • Poor-Quality data: Training data full of errors, outlier, and noise makes it harder to the detect the underlying patterns, makes it less likely that the model will perform well
  • Irrelevant Features: An ML system will only be capable of learning if the training data contains enough relevant feature and not too many irrelevant one
  • Feature Engineering: Coming up with good set of feature to train on

    • Feature Selection
    • Feature Extraction
    • Create new features by gathering data
  • Next Steps:

  • Algorithmic challenges

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About continuous learner

devops & cloud enthusiastic learner