## Assumptions of Linear Regression

• So far in Regression we have learnt
• Correlation
• Causation
• Simple LR model
• Multivariate LR model
• Geometrical representation
• SST,SSR,SSE
• OLS
• R-Squared
• F-test
• Assumptions
• Linearity
• No endogeneity
• Normality and homoscedasity
• No auto correlation
• No multicollinearity
• The biggest mistake you can make is to perform a regression that violates one of the above assumption

## Linearity

• The linear regression is the simplest non trival relationship
``````y = b0 + b1x1 + b2x2+ ..... + bkxk

γ = β0+ β1x1+ β2x2+ .......+ βkxk + ε
``````
• Representation
• Linear regression is suitable • Linear regression is not suitable • Even in the cases like not suitable we can fix by
• Running a non-linear regression
• Log Transformation

## No endogeneity

• If we can represent the covariance between X and ε
``````σ   = 0
Xε
``````
• Independent variable and error are correlated this is a serious problem
• This can happen due to ommited variable bias (You forget to include a relevant variable)
``````Y correlation X => y is explained by x
Y Correlation X* => y is explained by ommited variable x
X correlation X* => X and X* are somewhat correlated
Any thing you do not include will be error
X correlation ε => X and ε are somewhat correlated
``````
• Example: Price and Size Correlation of Flats in Hyderabad
• Price = F(Size) (size in sqft)
• Regression equation => `y = 1134278 - 132100x `
• Why is bigger real estate cheaper
• In this we have samples of real estate in developed as well as developing areas around hyderabad, so we have samples in developed hyderabad and developing hyderabad
• IN this analysiz we have not included the developed hyderabad or developing hyderabad
• So now if we include this factor in regression analysis the formuala
``````y = 520365 + 78210x     + 7126579 x
size            developed
x         = 1 in developed City
developed   0 if developing
``````

## Normality and homoscedasticity

• Expressed as `ε ~ N (0, σ^2 )`
• Zero mean
• homoscedasticity => to have equal variance • To understand this better lets take an example of expenditure on meals daily
• Middle class (Low Variability)
• They cook rice or vegetables
• Some days in the week probably biryani
• Rich class (High Variability)
• They spend on lavish 5 star restaurants
• Some days the cook at home
• To solve the heteroscedasticity we can
• remove outlier
• perform log transformation

## No auto correlation (Serial Correlation)

• Expressed as
``````σ       = 0
ε 	ε
i  j
``````
• Lets take an example of time-series data
• Stock
• changes every day
• same underlying asset
• Factors:
• GDP
• Tax Rate
• Political Events
• In Stock market Especially in developing countries we have a day-of-the week effect (High returns of Fridays, low returns on Mondays) • We have taken some factors into consideration but there are some external patterns happening
• To detect autocorrleation use plots to find patterns
• Solution:
• No Remedy: Avoid the linear regression model
• Alternatives:
• Autoregressive model
• Moving Average Model
• Autoregressive moving average model
• Autoregressive integrated moving average model

## No multicollinearity

• Correlation between the dependent variables should not be more
``````ρ       ≉  1
x x
i  j
``````
• Lets assume we have two independent variables (a,b)
``````a = 2 + 5 * b
b = (a-2)/5
ρ   = 1 (Perfect multicollinearity)
ab
``````
• In the above case use only one of the independent variable in regression
• Lets assume we have two independent variables (c,x)
``````ρ   = 0.9  (imperfect multicollinearity)
cd
``````
• Example: Consider two bars in the same locality
• In your locality people consume only beer • We need to do the regression analys for market sharet
``````Market share = F(P           , P       , P      )
1/2 pint A    pint A    pint B
`````` • Fix: Drop one of two values for your analysis to be significant

## Dealing with Categorical Data

• When we have categorical variables and if we need to perform regression we would create dummy variables to substitute the categorical variables

• Refer Here for the data set containing GPA and SAT Score

• Now after regression analysis our summary • Formula with attendance

``````y = 0.8651 + 0.0013* SAT Score + 0.1184 * (dummy attendance)

# attended more than 75%
y = 0.9835 + 0.0013* SAT Score
# not attended more that 75 %
y = 0.8651 + 0.0013* SAT Score
`````` • Now lets take the following example
``````JOHN => SAT Score 1700 & He has less  attendence than 75% => 3.07 (predicted value)
Jennifer => SAT Score 1685 & Has attended more than 75% of the classes => 3.17 (predicted value)

``````
• Formula without attendance considered
``````y = 0.275+ 0.0017 * SAT Score
``````
• Example:
``````JOHN => SAT Score 1700 => 3.16
Jennifer => SAT Score 1685 => 3.13
``````

This site uses Akismet to reduce spam. Learn how your comment data is processed. 