Assumptions of Linear Regression
 So far in Regression we have learnt
 Correlation
 Causation
 Simple LR model
 Multivariate LR model
 Geometrical representation
 SST,SSR,SSE
 OLS
 RSquared
 Adjusted RSquared
 Ftest
 Assumptions
 Linearity
 No endogeneity
 Normality and homoscedasity
 No auto correlation
 No multicollinearity
 The biggest mistake you can make is to perform a regression that violates one of the above assumption
Linearity
 The linear regression is the simplest non trival relationship
y = b0 + b1x1 + b2x2+ ..... + bkxk
γ = β0+ β1x1+ β2x2+ .......+ βkxk + ε
 Representation
 Linear regression is suitable
 Linear regression is not suitable
 Even in the cases like not suitable we can fix by
 Running a nonlinear regression
 Log Transformation
No endogeneity
 If we can represent the covariance between X and ε
σ = 0
Xε
 Independent variable and error are correlated this is a serious problem
 This can happen due to ommited variable bias (You forget to include a relevant variable)
Y correlation X => y is explained by x
Y Correlation X* => y is explained by ommited variable x
X correlation X* => X and X* are somewhat correlated
Any thing you do not include will be error
X correlation ε => X and ε are somewhat correlated
 Example: Price and Size Correlation of Flats in Hyderabad
 Price = F(Size) (size in sqft)
 Regression equation =>
y = 1134278  132100x
 Why is bigger real estate cheaper
 In this we have samples of real estate in developed as well as developing areas around hyderabad, so we have samples in developed hyderabad and developing hyderabad
 IN this analysiz we have not included the developed hyderabad or developing hyderabad
 So now if we include this factor in regression analysis the formuala
y = 520365 + 78210x + 7126579 x size developed x = 1 in developed City developed 0 if developing
Normality and homoscedasticity
 Expressed as
ε ~ N (0, σ^2 )
 Zero mean
 homoscedasticity => to have equal variance
 To understand this better lets take an example of expenditure on meals daily
 Middle class (Low Variability)
 They cook rice or vegetables
 Some days in the week probably biryani
 Rich class (High Variability)
 They spend on lavish 5 star restaurants
 Some days the cook at home
 Middle class (Low Variability)
 To solve the heteroscedasticity we can
 remove outlier
 perform log transformation
No auto correlation (Serial Correlation)
 Expressed as
σ = 0
ε ε
i j
 Lets take an example of timeseries data
 Stock
 changes every day
 same underlying asset
 Factors:
 GDP
 Tax Rate
 Political Events
 Stock
 In Stock market Especially in developing countries we have a dayofthe week effect (High returns of Fridays, low returns on Mondays)
 We have taken some factors into consideration but there are some external patterns happening
 To detect autocorrleation use plots to find patterns
 Solution:
 No Remedy: Avoid the linear regression model
 Alternatives:
 Autoregressive model
 Moving Average Model
 Autoregressive moving average model
 Autoregressive integrated moving average model
No multicollinearity
 Correlation between the dependent variables should not be more
ρ ≉ 1
x x
i j
 Lets assume we have two independent variables (a,b)
a = 2 + 5 * b
b = (a2)/5
ρ = 1 (Perfect multicollinearity)
ab
 In the above case use only one of the independent variable in regression
 Lets assume we have two independent variables (c,x)
ρ = 0.9 (imperfect multicollinearity)
cd
 Example: Consider two bars in the same locality
 In your locality people consume only beer
 We need to do the regression analys for market sharet
Market share = F(P , P , P ) 1/2 pint A pint A pint B
 Fix: Drop one of two values for your analysis to be significant
Dealing with Categorical Data

When we have categorical variables and if we need to perform regression we would create dummy variables to substitute the categorical variables

Refer Here for the data set containing GPA and SAT Score

Now after regression analysis our summary

Formula with attendance
y = 0.8651 + 0.0013* SAT Score + 0.1184 * (dummy attendance)
# attended more than 75%
y = 0.9835 + 0.0013* SAT Score
# not attended more that 75 %
y = 0.8651 + 0.0013* SAT Score
 Now lets take the following example
JOHN => SAT Score 1700 & He has less attendence than 75% => 3.07 (predicted value)
Jennifer => SAT Score 1685 & Has attended more than 75% of the classes => 3.17 (predicted value)
 Formula without attendance considered
y = 0.275+ 0.0017 * SAT Score
 Example:
JOHN => SAT Score 1700 => 3.16
Jennifer => SAT Score 1685 => 3.13