楚新元 | All in R

Welcome to R Square

异方差:诊断与处理

楚新元 / 2021-09-06


对于模型,如果出现 \(Var(u_{i}) = σ_{i}^2\),即对于不同的样本点,随机误差项的方差不再是常数,而互不相同,则认为出现了异方差性(Heteroscedasticity)。古扎拉蒂《经济计量学精要》表 9-2 给出了 523 个工人的工资(每小时,美元)、教育(受教育年限)、经验(工龄) 等数据。本例只考虑工资与教育和经验的关系。

载入数据

options(digits = 4)
library(readxl)
library(dplyr)
library(kableExtra)
data = read_xls(
  "data/Table9_2.xls", 
  skip = 5, 
  n_max = 523
)
data %>% 
  select(Wage, Educ, Exper) %>% 
  slice(1:20) %>%  # 此处只展示前 20 个样本
  kable() %>% 
   kable_styling(
     bootstrap_options = "striped",
     font_size = 14
  )
Wage Educ Exper
5.10 8 21
4.95 9 42
6.67 12 1
4.00 12 4
7.50 12 17
13.07 13 9
4.45 10 27
19.47 12 9
13.28 16 11
8.75 12 9
11.35 12 17
11.50 12 19
6.50 8 27
6.25 9 30
19.98 9 29
7.30 12 37
8.00 7 44
22.20 12 26
3.65 11 16
20.55 12 33

建立回归模型

model = lm(Wage ~ Educ + Exper, data = data)
summary(model)
## 
## Call:
## lm(formula = Wage ~ Educ + Exper, data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -8.27  -2.85  -0.59   2.01  36.15 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -4.5245     1.2393   -3.65  0.00029 ***
## Educ          0.9130     0.0822   11.11  < 2e-16 ***
## Exper         0.0968     0.0177    5.46  7.2e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.62 on 520 degrees of freedom
## Multiple R-squared:  0.195,	Adjusted R-squared:  0.192 
## F-statistic:   63 on 2 and 520 DF,  p-value: <2e-16

$$ \widehat{Wage} = -4.5245 + 0.913 * Educ + 0.0968 * Exper $$

异方差检验

残差的图形检验

最直观的方法是作模型残差平方 \(e_{i}^2\) 关于每一个变量的散点图,可能只有某一个变量与 \(e_{i}^2\) 的散点图表现出异方差特征。比较便捷的方法是用 \(e_{i}^2\)\(Y_{i}\) 的估计值 \(\hat{Y_{i}}\) 作散点图。

resid = residuals(model)
plot(
  resid^2 ~ fitted(model), 
  xlab = expression(widehat(Wage)),
  ylab = expression(resid^2)
)

从图上可以看出残差平方与 \(\widehat{Wage}\) 之间存在某种线性关系,这些点并不是随机分布的,表明模型可能存在异方差。

White 异方差检验

# 建立辅助回归模型
model_aux = lm(
  I(resid^2) ~ Educ + Exper + I(Educ^2) + I(Exper^2) + I(Educ * Exper), 
  data = data
)
summary(model_aux)
## 
## Call:
## lm(formula = I(resid^2) ~ Educ + Exper + I(Educ^2) + I(Exper^2) + 
##     I(Educ * Exper), data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -42.3  -17.3  -10.4    1.1 1276.7 
## 
## Coefficients:
##                 Estimate Std. Error t value Pr(>|t|)
## (Intercept)      14.3830    71.3473    0.20     0.84
## Educ             -1.1833     9.1380   -0.13     0.90
## Exper            -1.4011     1.9121   -0.73     0.46
## I(Educ^2)         0.1686     0.3007    0.56     0.58
## I(Exper^2)        0.0271     0.0210    1.29     0.20
## I(Educ * Exper)   0.0222     0.1041    0.21     0.83
## 
## Residual standard error: 65.1 on 517 degrees of freedom
## Multiple R-squared:  0.0215,	Adjusted R-squared:  0.012 
## F-statistic: 2.27 on 5 and 517 DF,  p-value: 0.0465

# 直接调用 lmtest 包进行检验
library(lmtest)
bptest(
  model, 
  ~ Educ + Exper + I(Educ^2) + I(Exper^2) + I(Educ * Exper),
  data = data
)
## 
## 	studentized Breusch-Pagan test
## 
## data:  model
## BP = 11, df = 5, p-value = 0.05

在 5% 的置信水平下,辅助模型的 F 统计量的 p 值小于 0.05,说明整个模型显著,拒绝同方差原假设,White 异方差检验结果表明原模型存在异方差。

Breusch-Pagan 检验

bptest(model)
## 
## 	studentized Breusch-Pagan test
## 
## data:  model
## BP = 8.7, df = 2, p-value = 0.01

检验统计量的 p 值依然小于 0.05,Breusch-Pagan 检验结果也表明原模型存在异方差。

方差齐性检验

n = nrow(data)
m = ceiling(n / 2)
o = order(data$Wage)  # order把顺序号计算出来
resid1 = model$residuals[o[1:m]]
resid2 = model$residuals[o[-1:-m]]
var.test(resid1,resid2)
## 
## 	F test to compare two variances
## 
## data:  resid1 and resid2
## F = 0.26, num df = 261, denom df = 260, p-value <2e-16
## alternative hypothesis: true ratio of variances is not equal to 1
## 95 percent confidence interval:
##  0.2059 0.3350
## sample estimates:
## ratio of variances 
##             0.2626

方差齐性检验结果表明,前后两个部分方差比值等于 1 的概率非常小,因此拒绝方差比值为 1 的原假设,即原模型存在异方差。

异方差的修正(注意 R 的 weights 参数设置)

利用原模型计算 1 / |残差| 作为权重来修正

w = abs(model$residuals)
model_w = lm(
  I(Wage / w) ~ 0 + I(1 / w) + I(Educ / w) + I(Exper / w),
  data = data
)
summary(model_w)
## 
## Call:
## lm(formula = I(Wage/w) ~ 0 + I(1/w) + I(Educ/w) + I(Exper/w), 
##     data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##  -1.26  -0.99  -0.95   1.01   1.34 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## I(1/w)     -4.44207    0.09530   -46.6   <2e-16 ***
## I(Educ/w)   0.90621    0.00664   136.6   <2e-16 ***
## I(Exper/w)  0.09505    0.00177    53.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.998 on 520 degrees of freedom
## Multiple R-squared:  0.999,	Adjusted R-squared:  0.999 
## F-statistic: 1.74e+05 on 3 and 520 DF,  p-value: <2e-16

# model_w 模型的等价形式
model_glm = glm(
  Wage ~ Educ + Exper,
  weights = 1 / w ^ 2,
  data = data
)
summary(model_glm)
## 
## Call:
## glm(formula = Wage ~ Educ + Exper, data = data, weights = 1/w^2)
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -4.44207    0.09530   -46.6   <2e-16 ***
## Educ         0.90621    0.00664   136.6   <2e-16 ***
## Exper        0.09505    0.00177    53.7   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.9966)
## 
##     Null deviance: 19100.41  on 522  degrees of freedom
## Residual deviance:   518.23  on 520  degrees of freedom
## AIC: 2283
## 
## Number of Fisher Scoring iterations: 2

$$ \widehat{Wage} = -4.4421 + 0.9062 * Educ + 0.0951 * Exper $$

误差方法与 \(x_{i}\) 成比例:平方根变换

w = sqrt(data$Educ)  # 这里以 Educ 变量为例
model_x_sqrt = lm(
  I(Wage / w) ~ 0 + I(1 / w) + I(Educ / w) + I(Exper / w),
  data = data
)
summary(model_x_sqrt)
## 
## Call:
## lm(formula = I(Wage/w) ~ 0 + I(1/w) + I(Educ/w) + I(Exper/w), 
##     data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -2.262 -0.858 -0.183  0.567  9.653 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## I(1/w)      -2.6456     1.0769   -2.46    0.014 *  
## I(Educ/w)    0.7814     0.0718   10.89  < 2e-16 ***
## I(Exper/w)   0.0877     0.0164    5.36  1.3e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.26 on 520 degrees of freedom
## Multiple R-squared:  0.804,	Adjusted R-squared:  0.802 
## F-statistic:  709 on 3 and 520 DF,  p-value: <2e-16

$$ \widehat{Wage} = -2.6456 + 0.7814 * Educ + 0.0877 * Exper $$

误差方法与 \(x_{i}^2\) 成比例

w = data$Educ  # 这里以 Educ 变量为例
model_x = lm(
  I(Wage / w) ~ 0 + I(1 / w) + I(Educ / w) + I(Exper / w),
  data = data
)
summary(model_x)
## 
## Call:
## lm(formula = I(Wage/w) ~ 0 + I(1/w) + I(Educ/w) + I(Exper/w), 
##     data = data)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -0.651 -0.250 -0.056  0.167  2.582 
## 
## Coefficients:
##            Estimate Std. Error t value Pr(>|t|)    
## I(1/w)       0.0903     0.7622    0.12     0.91    
## I(Educ/w)    0.5854     0.0513   11.42  < 2e-16 ***
## I(Exper/w)   0.0709     0.0138    5.13  4.2e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.354 on 520 degrees of freedom
## Multiple R-squared:  0.804,	Adjusted R-squared:  0.803 
## F-statistic:  710 on 3 and 520 DF,  p-value: <2e-16

$$ \widehat{Wage} = 0.0903 + 0.5854 * Educ + 0.0709 * Exper $$

其实本例中模型解释变量有多个,在这种情形下,可以 \(\hat{Y_{i}}\) 作为变换变量,而不是某一个解释变量,因为 \(\hat{Y_{i}}\)\(X_{i}\) 的线性组合。

重新设定模型

通过对数变换压缩了变量的度量规模,因而有利于消除异方差。

model2 = lm(log(Wage) ~ log(Educ) + log(Exper), data = data)
summary(model2)
## 
## Call:
## lm(formula = log(Wage) ~ log(Educ) + log(Exper), data = data)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.1125 -0.3321  0.0195  0.3192  2.0636 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  -0.7946     0.2592   -3.07   0.0023 ** 
## log(Educ)     0.9573     0.0917   10.44  < 2e-16 ***
## log(Exper)    0.1662     0.0247    6.73  4.5e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.47 on 520 degrees of freedom
## Multiple R-squared:  0.194,	Adjusted R-squared:  0.191 
## F-statistic: 62.5 on 2 and 520 DF,  p-value: <2e-16

$$ \widehat{lnWage} = -0.7946 + 0.9573 * lnEduc + 0.1662 * lnExper $$

对数变换后仍然要进行异方差检验,检验步骤同上,此处不再赘述。