Assumptions of Linear Regression
- So far in Regression we have learnt
- Correlation
- Causation
- Simple LR model
- Multivariate LR model
- Geometrical representation
- SST,SSR,SSE
- OLS
- R-Squared
- Adjusted R-Squared
- F-test
- Assumptions
- Linearity
- No endogeneity
- Normality and homoscedasity
- No auto correlation
- No multicollinearity
- The biggest mistake you can make is to perform a regression that violates one of the above assumption
Linearity
- The linear regression is the simplest non trival relationship
y = b0 + b1x1 + b2x2+ ..... + bkxk
γ = β0+ β1x1+ β2x2+ .......+ βkxk + ε
- Representation
- Linear regression is suitable
- Linear regression is not suitable
- Linear regression is suitable
- Even in the cases like not suitable we can fix by
- Running a non-linear regression
- Log Transformation
No endogeneity
- If we can represent the covariance between X and ε
σ = 0
Xε
- Independent variable and error are correlated this is a serious problem
- This can happen due to ommited variable bias (You forget to include a relevant variable)
Y correlation X => y is explained by x
Y Correlation X* => y is explained by ommited variable x
X correlation X* => X and X* are somewhat correlated
Any thing you do not include will be error
X correlation ε => X and ε are somewhat correlated
- Example: Price and Size Correlation of Flats in Hyderabad
- Price = F(Size) (size in sqft)
- Regression equation =>
y = 1134278 - 132100x
- Why is bigger real estate cheaper
- In this we have samples of real estate in developed as well as developing areas around hyderabad, so we have samples in developed hyderabad and developing hyderabad
- IN this analysiz we have not included the developed hyderabad or developing hyderabad
- So now if we include this factor in regression analysis the formuala
y = 520365 + 78210x + 7126579 x size developed x = 1 in developed City developed 0 if developing
Normality and homoscedasticity
- Expressed as
ε ~ N (0, σ^2 )
- Zero mean
- homoscedasticity => to have equal variance
- To understand this better lets take an example of expenditure on meals daily
- Middle class (Low Variability)
- They cook rice or vegetables
- Some days in the week probably biryani
- Rich class (High Variability)
- They spend on lavish 5 star restaurants
- Some days the cook at home
- Middle class (Low Variability)
- To solve the heteroscedasticity we can
- remove outlier
- perform log transformation
No auto correlation (Serial Correlation)
- Expressed as
σ = 0
ε ε
i j
- Lets take an example of time-series data
- Stock
- changes every day
- same underlying asset
- Factors:
- GDP
- Tax Rate
- Political Events
- Stock
- In Stock market Especially in developing countries we have a day-of-the week effect (High returns of Fridays, low returns on Mondays)
- We have taken some factors into consideration but there are some external patterns happening
- To detect autocorrleation use plots to find patterns
- Solution:
- No Remedy: Avoid the linear regression model
- Alternatives:
- Autoregressive model
- Moving Average Model
- Autoregressive moving average model
- Autoregressive integrated moving average model
No multicollinearity
- Correlation between the dependent variables should not be more
ρ ≉ 1
x x
i j
- Lets assume we have two independent variables (a,b)
a = 2 + 5 * b
b = (a-2)/5
ρ = 1 (Perfect multicollinearity)
ab
- In the above case use only one of the independent variable in regression
- Lets assume we have two independent variables (c,x)
ρ = 0.9 (imperfect multicollinearity)
cd
- Example: Consider two bars in the same locality
- In your locality people consume only beer
- We need to do the regression analys for market sharet
Market share = F(P , P , P ) 1/2 pint A pint A pint B
- Fix: Drop one of two values for your analysis to be significant
- In your locality people consume only beer
Dealing with Categorical Data
-
When we have categorical variables and if we need to perform regression we would create dummy variables to substitute the categorical variables
-
Refer Here for the data set containing GPA and SAT Score
-
Now after regression analysis our summary
-
Formula with attendance
y = 0.8651 + 0.0013* SAT Score + 0.1184 * (dummy attendance)
# attended more than 75%
y = 0.9835 + 0.0013* SAT Score
# not attended more that 75 %
y = 0.8651 + 0.0013* SAT Score
- Now lets take the following example
JOHN => SAT Score 1700 & He has less attendence than 75% => 3.07 (predicted value)
Jennifer => SAT Score 1685 & Has attended more than 75% of the classes => 3.17 (predicted value)
- Formula without attendance considered
y = 0.275+ 0.0017 * SAT Score
- Example:
JOHN => SAT Score 1700 => 3.16
Jennifer => SAT Score 1685 => 3.13