Data Science Classroom Series – 27/Nov/2021

Assumptions of Linear Regression

  • So far in Regression we have learnt
    • Correlation
    • Causation
    • Simple LR model
    • Multivariate LR model
    • Geometrical representation
    • SST,SSR,SSE
    • OLS
    • R-Squared
    • Adjusted R-Squared
    • F-test
  • Assumptions
    • Linearity
    • No endogeneity
    • Normality and homoscedasity
    • No auto correlation
    • No multicollinearity
  • The biggest mistake you can make is to perform a regression that violates one of the above assumption

Linearity

  • The linear regression is the simplest non trival relationship
y = b0 + b1x1 + b2x2+ ..... + bkxk

γ = β0+ β1x1+ β2x2+ .......+ βkxk + ε
  • Representation
    • Linear regression is suitable Preview
    • Linear regression is not suitable Preview
  • Even in the cases like not suitable we can fix by
    • Running a non-linear regression
    • Log Transformation

No endogeneity

  • If we can represent the covariance between X and ε
σ   = 0
 Xε
  • Independent variable and error are correlated this is a serious problem
  • This can happen due to ommited variable bias (You forget to include a relevant variable)
Y correlation X => y is explained by x
Y Correlation X* => y is explained by ommited variable x
X correlation X* => X and X* are somewhat correlated
Any thing you do not include will be error
X correlation ε => X and ε are somewhat correlated
  • Example: Price and Size Correlation of Flats in Hyderabad
    • Price = F(Size) (size in sqft)
    • Regression equation => y = 1134278 - 132100x
    • Why is bigger real estate cheaper
    • In this we have samples of real estate in developed as well as developing areas around hyderabad, so we have samples in developed hyderabad and developing hyderabad
    • IN this analysiz we have not included the developed hyderabad or developing hyderabad
    • So now if we include this factor in regression analysis the formuala
    y = 520365 + 78210x     + 7126579 x
                       size            developed
     x         = 1 in developed City 
    developed   0 if developing
    

Normality and homoscedasticity

  • Expressed as ε ~ N (0, σ^2 )
  • Zero mean
  • homoscedasticity => to have equal variance Preview
  • To understand this better lets take an example of expenditure on meals daily
    • Middle class (Low Variability)
      • They cook rice or vegetables
      • Some days in the week probably biryani
    • Rich class (High Variability)
      • They spend on lavish 5 star restaurants
      • Some days the cook at home
  • To solve the heteroscedasticity we can
    • remove outlier
    • perform log transformation

No auto correlation (Serial Correlation)

  • Expressed as
σ       = 0
 ε 	ε
  i  j
  • Lets take an example of time-series data
    • Stock
      • changes every day
      • same underlying asset
    • Factors:
      • GDP
      • Tax Rate
      • Political Events
  • In Stock market Especially in developing countries we have a day-of-the week effect (High returns of Fridays, low returns on Mondays) Preview
  • We have taken some factors into consideration but there are some external patterns happening
  • To detect autocorrleation use plots to find patterns
  • Solution:
    • No Remedy: Avoid the linear regression model
    • Alternatives:
      • Autoregressive model
      • Moving Average Model
      • Autoregressive moving average model
      • Autoregressive integrated moving average model

No multicollinearity

  • Correlation between the dependent variables should not be more
ρ       ≉  1
 x x
  i  j
  • Lets assume we have two independent variables (a,b)
a = 2 + 5 * b
b = (a-2)/5
ρ   = 1 (Perfect multicollinearity)
 ab
  • In the above case use only one of the independent variable in regression
  • Lets assume we have two independent variables (c,x)
ρ   = 0.9  (imperfect multicollinearity)
 cd
  • Example: Consider two bars in the same locality
    • In your locality people consume only beer Preview
    • We need to do the regression analys for market sharet
    Market share = F(P           , P       , P      )
                      1/2 pint A    pint A    pint B
    
    Preview
    • Fix: Drop one of two values for your analysis to be significant

Dealing with Categorical Data

  • When we have categorical variables and if we need to perform regression we would create dummy variables to substitute the categorical variables

  • Refer Here for the data set containing GPA and SAT Score

  • Now after regression analysis our summary Preview

  • Formula with attendance

y = 0.8651 + 0.0013* SAT Score + 0.1184 * (dummy attendance)

# attended more than 75%
y = 0.9835 + 0.0013* SAT Score
# not attended more that 75 %
y = 0.8651 + 0.0013* SAT Score

Preview

  • Now lets take the following example
JOHN => SAT Score 1700 & He has less  attendence than 75% => 3.07 (predicted value)
Jennifer => SAT Score 1685 & Has attended more than 75% of the classes => 3.17 (predicted value)

  • Formula without attendance considered
y = 0.275+ 0.0017 * SAT Score
  • Example:
JOHN => SAT Score 1700 => 3.16
Jennifer => SAT Score 1685 => 3.13

Leave a Comment

This site uses Akismet to reduce spam. Learn how your comment data is processed.

About learningthoughtsadmin