Домашнее задание 5.

В пятой домашке мы выясним, от чего зависит количество рабочих часов у замужних женщин. Для этого будем использовать набор данных Workinghours из пакета Ecdat. Прежде чем переходить к упражнениям, посмотрите описание переменных в справке :)

Упражнение 0.

Запустите следующие чанки кода :)

Для чего нужен первый из них?

library(tidyverse) # обработка данных, графики...
library(skimr) # описательные статистики
library(rio) # импорт фантастического количества форматов данных
library(broom) # метла превращает результаты оценивания моделей в таблички
library(GGally) # больше готовых графиков
library(sjPlot) # ещё больше графиков
library(lmtest) # диагностика линейных моделей
library(Ecdat) # много-много разных наборов данных
library(sjstats) # удобные мелкие функции для работы с моделями
library(sandwich) # оценка Var для гетероскедастичности
library(AER) # работа с инструментальными переменными

Упражнение 1.

Бросьте взгляд на данные Workinghours и посмотрите описательные статистики по ним.

work <- Workinghours
skim(work)

## Skim summary statistics
##  n obs: 3382 
##  n variables: 12 
## 
## Variable type: factor 
##    variable missing complete    n n_unique
##  occupation       0     3382 3382        4
##                             top_counts ordered
##  oth: 1314, swc: 1021, mp: 962, fr: 85   FALSE
## 
## Variable type: integer 
##   variable missing complete    n    mean     sd   p0 p25  p50     p75 p100
##        age       0     3382 3382   36.81  11.35   18  28   34   44      64
##    child13       0     3382 3382    0.56   0.86    0   0    0    1       5
##    child17       0     3382 3382    0.21   0.51    0   0    0    0       6
##     child5       0     3382 3382    0.51   0.76    0   0    0    1       4
##  education       0     3382 3382   12.55   2.42    0  12   12   14      17
##      hours       0     3382 3382 1135.48 892.59    0   0 1304 1944    5840
##     income       0     3382 3382  296.9  287.99 -139 146  247  368.75 7220
##   mortgage       0     3382 3382    0.53   0.5     0   0    1    1       1
##   nonwhite       0     3382 3382    0.3    0.46    0   0    0    1       1
##      owned       0     3382 3382    0.68   0.47    0   0    1    1       1
##      unemp       0     3382 3382    5.64   2.3     1   4    5    7      30
##      hist
##  ▃▇▇▆▂▂▂▂
##  ▇▃▁▂▁▁▁▁
##  ▇▂▁▁▁▁▁▁
##  ▇▃▁▁▁▁▁▁
##  ▁▁▁▁▂▇▃▃
##  ▇▃▇▁▁▁▁▁
##  ▇▁▁▁▁▁▁▁
##  ▇▁▁▁▁▁▁▇
##  ▇▁▁▁▁▁▁▃
##  ▃▁▁▁▁▁▁▇
##  ▅▇▂▁▁▁▁▁

glimpse(work)

## Observations: 3,382
## Variables: 12
## $ hours      <int> 2000, 390, 1900, 0, 3177, 0, 0, 1040, 2040, 0, 1432...
## $ income     <int> 350, 241, 160, 80, 456, 390, 181, 726, -5, 78, 195,...
## $ age        <int> 26, 29, 33, 20, 33, 22, 41, 31, 33, 30, 41, 23, 32,...
## $ education  <int> 12, 8, 10, 9, 12, 12, 9, 16, 12, 11, 12, 14, 16, 12...
## $ child5     <int> 0, 0, 0, 2, 0, 2, 0, 2, 0, 1, 0, 0, 0, 0, 1, 0, 1, ...
## $ child13    <int> 1, 1, 2, 0, 2, 0, 0, 1, 3, 1, 1, 0, 0, 0, 1, 0, 0, ...
## $ child17    <int> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nonwhite   <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
## $ owned      <int> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, ...
## $ mortgage   <int> 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, ...
## $ occupation <fct> swcc, other, swcc, other, swcc, other, swcc, mp, fr...
## $ unemp      <int> 7, 4, 7, 7, 7, 7, 7, 3, 4, 5, 7, 3, 7, 12, 12, 7, 7...

Все ли переменные имеют правильный тип?

Исправьте недоразумения по переменным, отвечающим за расу, владение домом и наличие ипотеки, и посмотрите, что в новых данных work_fct все типы указаны правильно.

work_fct <- work %>%
  mutate_at(vars(nonwhite, owned, mortgage), factor)
glimpse(work_fct)

## Observations: 3,382
## Variables: 12
## $ hours      <int> 2000, 390, 1900, 0, 3177, 0, 0, 1040, 2040, 0, 1432...
## $ income     <int> 350, 241, 160, 80, 456, 390, 181, 726, -5, 78, 195,...
## $ age        <int> 26, 29, 33, 20, 33, 22, 41, 31, 33, 30, 41, 23, 32,...
## $ education  <int> 12, 8, 10, 9, 12, 12, 9, 16, 12, 11, 12, 14, 16, 12...
## $ child5     <int> 0, 0, 0, 2, 0, 2, 0, 2, 0, 1, 0, 0, 0, 0, 1, 0, 1, ...
## $ child13    <int> 1, 1, 2, 0, 2, 0, 0, 1, 3, 1, 1, 0, 0, 0, 1, 0, 0, ...
## $ child17    <int> 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nonwhite   <fct> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, ...
## $ owned      <fct> 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 1, 0, 1, 1, 1, 1, 1, ...
## $ mortgage   <fct> 1, 1, 0, 1, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 1, 1, 1, ...
## $ occupation <fct> swcc, other, swcc, other, swcc, other, swcc, mp, fr...
## $ unemp      <int> 7, 4, 7, 7, 7, 7, 7, 3, 4, 5, 7, 3, 7, 12, 12, 7, 7...

Упражнение 2.

Постройте праную регрессию количества рабочих часов женщины на доход её семьи. Для этого и следующих заданий используйте исправленные данные work_fct.

model_r <- lm(data = work_fct, hours ~ income)
summary(model_r)

## 
## Call:
## lm(formula = hours ~ income, data = work_fct)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1216.2 -1052.5   170.8   799.4  4759.4 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1216.5210    21.9633   55.39  < 2e-16 ***
## income        -0.2730     0.0531   -5.14  2.9e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 889.3 on 3380 degrees of freedom
## Multiple R-squared:  0.007757,   Adjusted R-squared:  0.007463 
## F-statistic: 26.42 on 1 and 3380 DF,  p-value: 2.9e-07

Какое p-значение у коэффициента при переменной income?
Найдите значение F-статистики для проверки гипотезы об адекватности модели в целом.
Упражнение 3.

Выведите отдельно информацию о:

коэффициентах модели;
общих характеристиках модели, умещающихся в одну строку;
коэффициентах модели с учётом поправки на гетероскедастичность.

tidy(model_r)

##          term    estimate   std.error statistic      p.value
## 1 (Intercept) 1216.521003 21.96333121 55.388729 0.000000e+00
## 2      income   -0.272962  0.05310292 -5.140244 2.899547e-07

glance(model_r)

##     r.squared adj.r.squared    sigma statistic      p.value df    logLik
## 1 0.007756559   0.007462995 889.2528  26.42211 2.899547e-07  2 -27762.92
##        AIC      BIC   deviance df.residual
## 1 55531.84 55550.22 2672804553        3380

coeftest(model_r, vcov. = vcovHC)

## 
## t test of coefficients:
## 
##                Estimate  Std. Error t value  Pr(>|t|)    
## (Intercept) 1216.521003   21.057223 57.7721 < 2.2e-16 ***
## income        -0.272962    0.047919 -5.6963 1.329e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Упражнение 4.

Визуализируйте линию регрессии для модели model_r.

ggplot(data = work_fct, aes(x = income, y = hours)) +
  geom_point(alpha = 0.2) +
  stat_smooth(method = 'lm')

Попробуйте поменять значение параметра alpha. За что он отвечает?
Упражнение 5.

Постройте модель с новыми регрессорами. Помимо доходов семьи income, включите возраст age, количество лет обучения education, индикатор для расы nonwhite и для наличия ипотеки mortgage.

model_ur <- lm(data = work_fct, hours ~ income + nonwhite + mortgage + education + age)
summary(model_ur)

## 
## Call:
## lm(formula = hours ~ income + nonwhite + mortgage + education + 
##     age, data = work_fct)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1558.4  -841.2   130.3   727.2  5022.0 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 467.89859  105.80318   4.422 1.01e-05 ***
## income       -0.43824    0.05628  -7.787 9.08e-15 ***
## nonwhite1    63.25313   33.57579   1.884   0.0597 .  
## mortgage1   206.66484   30.91475   6.685 2.69e-11 ***
## education    73.31487    6.64494  11.033  < 2e-16 ***
## age          -6.80271    1.37024  -4.965 7.23e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 858 on 3376 degrees of freedom
## Multiple R-squared:  0.07745,    Adjusted R-squared:  0.07608 
## F-statistic: 56.68 on 5 and 3376 DF,  p-value: < 2.2e-16

Каким получился \(R^2\) в новой модели?
Все ли коэффициенты оказались значимыми на 5%-ом уровне?
Упражнение 6.

Постройте 90%-ые доверительные интервалы для всех коэффициентов модели model_ur с учётом поправки на гетероскедастичность и визуализируйте результат.

Подсказка: чтобы поправка на гетероскедастичность отразилась на графике, нужно указать тип корректировки в аргументе se.

coefci(model_ur, vcov. = vcovHC, type = 'HC3', level = 0.9)

##                     5 %        95 %
## (Intercept) 295.6462428 640.1509452
## income       -0.5768899  -0.2995844
## nonwhite1     8.7036122 117.8026569
## mortgage1   154.1895857 259.1400983
## education    62.5075195  84.1222278
## age          -9.0873490  -4.5180794

plot_model(model_ur, ci.lvl = 0.9, se = 'HC3')

Упражнение 7.

Проведите тест Вальда для моделей model_r и model_ur, используя корректировку на гетероскедастичность.

waldtest(model_r, model_ur, vcov = vcovHC)

## Wald test
## 
## Model 1: hours ~ income
## Model 2: hours ~ income + nonwhite + mortgage + education + age
##   Res.Df Df      F    Pr(>F)    
## 1   3380                        
## 2   3376  4 64.991 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Какую модель следует предпочесть?
Упражнение 8.

С помощью регрессий постройте доверительные интервалы для математического ожидания дохода семьи и для разности математических ожиданий доходов семей с ипотекой и без.

model_mu <- lm(data = work, income ~ 1)
coefci(model_mu)

##                2.5 %  97.5 %
## (Intercept) 287.1878 306.607

model_diff <- lm(data = work, income ~ mortgage)
coefci(model_diff)

##                2.5 %   97.5 %
## (Intercept) 211.9712 239.4514
## mortgage    115.9618 153.7875

Упражнение 9.

С помощью регрессии проверьте, что наличие ипотеки и раса не влияют на доход семьи.

model_2anova <- lm(data = work, income ~ mortgage * nonwhite)
summary(model_2anova)

## 
## Call:
## lm(formula = income ~ mortgage * nonwhite, data = work)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -462.3 -129.0  -44.3   63.7 6959.0 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        260.957      8.726  29.906  < 2e-16 ***
## mortgage           128.378     11.448  11.214  < 2e-16 ***
## nonwhite           -93.969     14.248  -6.595 4.91e-11 ***
## mortgage:nonwhite  -34.002     21.152  -1.607    0.108    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 275.7 on 3378 degrees of freedom
## Multiple R-squared:  0.08462,    Adjusted R-squared:  0.08381 
## F-statistic: 104.1 on 3 and 3378 DF,  p-value: < 2.2e-16

anova_stats(model_2anova)

## # A tibble: 4 x 12
##   term     df  sumsq meansq statistic p.value  etasq partial.etasq omegasq
##   <chr> <dbl>  <dbl>  <dbl>     <dbl>   <dbl>  <dbl>         <dbl>   <dbl>
## 1 mort…     1 1.53e7 1.53e7    202.     0      0.055         0.056   0.054
## 2 nonw…     1 8.20e6 8.20e6    108.     0      0.029         0.031   0.029
## 3 mort…     1 1.96e5 1.96e5      2.58   0.108  0.001         0.001   0    
## 4 Resi…  3378 2.57e8 7.60e4     NA     NA     NA            NA      NA    
## # ... with 3 more variables: partial.omegasq <dbl>, cohens.f <dbl>,
## #   power <dbl>

Упражнение 10.

Для упражнений про инструментальные переменные будем использовать встроенный набор данных CollegeDistance из пакета AER. Посмотрите на эти данные и их описательные статистики.

data("CollegeDistance")
college <- CollegeDistance
skim(college)

## Skim summary statistics
##  n obs: 4739 
##  n variables: 14 
## 
## Variable type: factor 
##   variable missing complete    n n_unique
##  ethnicity       0     4739 4739        3
##   fcollege       0     4739 4739        2
##     gender       0     4739 4739        2
##       home       0     4739 4739        2
##     income       0     4739 4739        2
##   mcollege       0     4739 4739        2
##     region       0     4739 4739        2
##      urban       0     4739 4739        2
##                            top_counts ordered
##  oth: 3050, his: 903, afa: 786, NA: 0   FALSE
##             no: 3753, yes: 986, NA: 0   FALSE
##           fem: 2600, mal: 2139, NA: 0   FALSE
##             yes: 3887, no: 852, NA: 0   FALSE
##           low: 3374, hig: 1365, NA: 0   FALSE
##             no: 4088, yes: 651, NA: 0   FALSE
##            oth: 3796, wes: 943, NA: 0   FALSE
##            no: 3635, yes: 1104, NA: 0   FALSE
## 
## Variable type: numeric 
##   variable missing complete    n  mean   sd    p0   p25   p50   p75  p100
##   distance       0     4739 4739  1.8  2.3   0     0.4   1     2.5  20   
##  education       0     4739 4739 13.81 1.79 12    12    13    16    18   
##      score       0     4739 4739 50.89 8.7  28.95 43.92 51.19 57.77 72.81
##    tuition       0     4739 4739  0.81 0.34  0.26  0.48  0.82  1.13  1.4 
##      unemp       0     4739 4739  7.6  2.76  1.4   5.9   7.1   8.9  24.9 
##       wage       0     4739 4739  9.5  1.34  6.59  8.85  9.68 10.15 12.96
##      hist
##  ▇▂▁▁▁▁▁▁
##  ▇▃▂▂▁▃▁▁
##  ▁▅▆▇▇▇▃▁
##  ▅▅▂▆▅▂▇▂
##  ▂▇▆▂▁▁▁▁
##  ▂▃▅▃▇▁▂▁

glimpse(college)

## Observations: 4,739
## Variables: 14
## $ gender    <fct> male, female, male, male, female, male, female, fema...
## $ ethnicity <fct> other, other, other, afam, other, other, other, othe...
## $ score     <dbl> 39.15, 48.87, 48.74, 40.40, 40.48, 54.71, 56.07, 54....
## $ fcollege  <fct> yes, no, no, no, no, no, no, no, yes, no, no, no, no...
## $ mcollege  <fct> no, no, no, no, no, no, no, no, no, no, no, yes, no,...
## $ home      <fct> yes, yes, yes, yes, no, yes, yes, yes, yes, yes, yes...
## $ urban     <fct> yes, yes, yes, yes, yes, yes, no, no, yes, yes, yes,...
## $ unemp     <dbl> 6.2, 6.2, 6.2, 6.2, 5.6, 5.6, 7.2, 7.2, 5.9, 5.9, 5....
## $ wage      <dbl> 8.09, 8.09, 8.09, 8.09, 8.09, 8.09, 8.85, 8.85, 8.09...
## $ distance  <dbl> 0.2, 0.2, 0.2, 0.2, 0.4, 0.4, 0.4, 0.4, 3.0, 3.0, 3....
## $ tuition   <dbl> 0.88915, 0.88915, 0.88915, 0.88915, 0.88915, 0.88915...
## $ education <dbl> 12, 12, 12, 12, 13, 12, 13, 15, 13, 15, 12, 14, 15, ...
## $ income    <fct> high, low, low, low, low, low, low, low, low, low, h...
## $ region    <fct> other, other, other, other, other, other, other, oth...

Постройте обычную регрессию зарплаты wage на количество лет обучения education, этническую принадлежность ethnicity, пол gender и доход семьи income.

model_ols <- lm(data = college, wage ~ education + ethnicity + gender + income)
summary(model_ols)

## 
## Call:
## lm(formula = wage ~ education + ethnicity + gender + income, 
##     data = college)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2106 -0.8545  0.1908  0.7356  3.7829 
## 
## Coefficients:
##                    Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9.726192   0.154450  62.973  < 2e-16 ***
## education         -0.003627   0.011007  -0.330  0.74177    
## ethnicityafam     -0.568147   0.053366 -10.646  < 2e-16 ***
## ethnicityhispanic -0.441342   0.050590  -8.724  < 2e-16 ***
## genderfemale      -0.056957   0.038587  -1.476  0.13999    
## incomehigh         0.117962   0.044015   2.680  0.00739 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.319 on 4733 degrees of freedom
## Multiple R-squared:  0.0372, Adjusted R-squared:  0.03618 
## F-statistic: 36.57 on 5 and 4733 DF,  p-value: < 2.2e-16

Что не так в этой модели?
Упражнение 11.

Постройте 2МНК регрессию, используя в качестве инструмента для переменной education расстояние до колледжа distance.

model_iv <- ivreg(data = college,  wage ~ education + ethnicity + gender + income | distance + ethnicity + gender + income)
summary(model_iv, diagnostics = TRUE)

## 
## Call:
## ivreg(formula = wage ~ education + ethnicity + gender + income | 
##     distance + ethnicity + gender + income, data = college)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.2515 -0.8784  0.1678  0.7580  3.8559 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        9.00464    1.76427   5.104 3.46e-07 ***
## education          0.04922    0.12920   0.381    0.703    
## ethnicityafam     -0.54849    0.07178  -7.641 2.59e-14 ***
## ethnicityhispanic -0.43331    0.05436  -7.971 1.95e-15 ***
## genderfemale      -0.05789    0.03875  -1.494    0.135    
## incomehigh         0.07460    0.11446   0.652    0.515    
## 
## Diagnostic tests:
##                   df1  df2 statistic  p-value    
## Weak instruments    1 4733    34.775 3.96e-09 ***
## Wu-Hausman          1 4732     0.169    0.681    
## Sargan              0   NA        NA       NA    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.322 on 4733 degrees of freedom
## Multiple R-Squared: 0.03251, Adjusted R-squared: 0.03149 
## Wald test:  36.4 on 5 and 4733 DF,  p-value: < 2.2e-16

Как можно проинтерпретировать коэффициент при переменной education?

Домашнее задание 5.

Июнь, 14, 2018