Regression table example
06_regression_example.Rmd
Setup
We start by loading a few helpful libraries.
# Load libraries
library(tidyverse, warn.conflicts = FALSE) # Data wrangling/ manipulation
library(arsenal) # Package with good table creation functions
library(tidycoRe) # Internal package for creating tables
library(flextable) # For formatting tables for presentation
library(broom) # For cleaning model results
library(glue) # For concatenating strings
Linear regression table
Using the demo dataset included in the tidycoRe package, we’re going to build a linear regression to predict BMI with the following predictors: age, gender, race, ethnicity, smoking status, drinking status, history of cancer, history of CVD, history of hypertension, history of hyperlipidemia, and history of diabetes.
lm2 <- lm(
bmi ~ age + female + white + hispanic + shx_smoke + shx_drink + hx_cancer +
hx_cvd + hx_htn + hx_hld + hx_diabetes,
data = demo_data
)
And here’s a summary of the results:
summary(lm2)
#>
#> Call:
#> lm(formula = bmi ~ age + female + white + hispanic + shx_smoke +
#> shx_drink + hx_cancer + hx_cvd + hx_htn + hx_hld + hx_diabetes,
#> data = demo_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -15.2699 -4.6741 -0.8572 3.7478 22.7053
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 35.69377 2.66691 13.384 < 2e-16 ***
#> age -0.11628 0.04650 -2.501 0.013424 *
#> femaleFemale 0.03388 1.10011 0.031 0.975473
#> whiteWhite 3.61611 1.39484 2.592 0.010428 *
#> hispanicYes 3.72449 2.80140 1.330 0.185609
#> shx_smokeFormer smoker -0.21779 1.18552 -0.184 0.854478
#> shx_smokeCurrent smoker -0.52670 1.76314 -0.299 0.765544
#> shx_drink<1 drink per week -1.00385 1.50030 -0.669 0.504413
#> shx_drink1-3 drinks per week -4.14258 1.65234 -2.507 0.013190 *
#> shx_drink4-6 drinks per week -3.05412 1.77738 -1.718 0.087709 .
#> shx_drink1-2 drinks per day -5.03727 1.97711 -2.548 0.011801 *
#> shx_drink>2 drinks per day 0.67882 3.06979 0.221 0.825279
#> hx_cancerYes -2.07835 4.61225 -0.451 0.652888
#> hx_cvdYes -0.26151 1.81825 -0.144 0.885824
#> hx_htnYes 4.19313 1.22987 3.409 0.000828 ***
#> hx_hldYes -1.07512 1.43113 -0.751 0.453632
#> hx_diabetesYes 1.86206 1.65515 1.125 0.262302
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 6.954 on 157 degrees of freedom
#> (26 observations deleted due to missingness)
#> Multiple R-squared: 0.2112, Adjusted R-squared: 0.1308
#> F-statistic: 2.627 on 16 and 157 DF, p-value: 0.001143
If we plan on generating multiple models using the same formula, we can save the variables as vectors and call on them using the “formulize” function from the “arsenal” package. We can see that these results are the same as the the previous model.
LHS <- "bmi"
RHS <- c("age",
"female",
"white",
"hispanic",
"shx_smoke",
"shx_drink",
"hx_cancer",
"hx_cvd",
"hx_htn",
"hx_hld",
"hx_diabetes")
lm2_0 <- lm(
formulize(LHS, RHS),
data = demo_data
)
summary(lm2_0)
#>
#> Call:
#> lm(formula = formulize(LHS, RHS), data = demo_data)
#>
#> Residuals:
#> Min 1Q Median 3Q Max
#> -15.2699 -4.6741 -0.8572 3.7478 22.7053
#>
#> Coefficients:
#> Estimate Std. Error t value Pr(>|t|)
#> (Intercept) 35.69377 2.66691 13.384 < 2e-16 ***
#> age -0.11628 0.04650 -2.501 0.013424 *
#> femaleFemale 0.03388 1.10011 0.031 0.975473
#> whiteWhite 3.61611 1.39484 2.592 0.010428 *
#> hispanicYes 3.72449 2.80140 1.330 0.185609
#> shx_smokeFormer smoker -0.21779 1.18552 -0.184 0.854478
#> shx_smokeCurrent smoker -0.52670 1.76314 -0.299 0.765544
#> shx_drink<1 drink per week -1.00385 1.50030 -0.669 0.504413
#> shx_drink1-3 drinks per week -4.14258 1.65234 -2.507 0.013190 *
#> shx_drink4-6 drinks per week -3.05412 1.77738 -1.718 0.087709 .
#> shx_drink1-2 drinks per day -5.03727 1.97711 -2.548 0.011801 *
#> shx_drink>2 drinks per day 0.67882 3.06979 0.221 0.825279
#> hx_cancerYes -2.07835 4.61225 -0.451 0.652888
#> hx_cvdYes -0.26151 1.81825 -0.144 0.885824
#> hx_htnYes 4.19313 1.22987 3.409 0.000828 ***
#> hx_hldYes -1.07512 1.43113 -0.751 0.453632
#> hx_diabetesYes 1.86206 1.65515 1.125 0.262302
#> ---
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#>
#> Residual standard error: 6.954 on 157 degrees of freedom
#> (26 observations deleted due to missingness)
#> Multiple R-squared: 0.2112, Adjusted R-squared: 0.1308
#> F-statistic: 2.627 on 16 and 157 DF, p-value: 0.001143
We generally want to report the coefficients and corresponding CIs and p-values from our regression. We can extract these using the “tidy” function from the “broom” package.
tidy(lm2, conf.int = TRUE)
#> # A tibble: 17 × 7
#> term estim…¹ std.e…² stati…³ p.value conf.…⁴ conf.…⁵
#> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 (Intercept) 35.7 2.67 13.4 9.66e-28 30.4 41.0
#> 2 age -0.116 0.0465 -2.50 1.34e- 2 -0.208 -0.0244
#> 3 femaleFemale 0.0339 1.10 0.0308 9.75e- 1 -2.14 2.21
#> 4 whiteWhite 3.62 1.39 2.59 1.04e- 2 0.861 6.37
#> 5 hispanicYes 3.72 2.80 1.33 1.86e- 1 -1.81 9.26
#> 6 shx_smokeFormer smoker -0.218 1.19 -0.184 8.54e- 1 -2.56 2.12
#> 7 shx_smokeCurrent smoker -0.527 1.76 -0.299 7.66e- 1 -4.01 2.96
#> 8 shx_drink<1 drink per week -1.00 1.50 -0.669 5.04e- 1 -3.97 1.96
#> 9 shx_drink1-3 drinks per week -4.14 1.65 -2.51 1.32e- 2 -7.41 -0.879
#> 10 shx_drink4-6 drinks per week -3.05 1.78 -1.72 8.77e- 2 -6.56 0.457
#> 11 shx_drink1-2 drinks per day -5.04 1.98 -2.55 1.18e- 2 -8.94 -1.13
#> 12 shx_drink>2 drinks per day 0.679 3.07 0.221 8.25e- 1 -5.38 6.74
#> 13 hx_cancerYes -2.08 4.61 -0.451 6.53e- 1 -11.2 7.03
#> 14 hx_cvdYes -0.262 1.82 -0.144 8.86e- 1 -3.85 3.33
#> 15 hx_htnYes 4.19 1.23 3.41 8.28e- 4 1.76 6.62
#> 16 hx_hldYes -1.08 1.43 -0.751 4.54e- 1 -3.90 1.75
#> 17 hx_diabetesYes 1.86 1.66 1.13 2.62e- 1 -1.41 5.13
#> # … with abbreviated variable names ¹estimate, ²std.error, ³statistic,
#> # ⁴conf.low, ⁵conf.high
Since tidy returns a data frame object, we can clean this up using dplyr functions as we normally would.
lm2_tidy <- tidy(lm2, conf.int = TRUE) %>%
mutate(
p.value = format.pval(p.value, digits = 2, eps = 0.001), # format p-value column
across(where(is.numeric), ~roundx(.x, 2)), # Round numeric columns to 2 digits
b_ci = glue("{estimate} ({conf.low}, {conf.high})") # Concatenate rounded coefficient plus lower and upper CI bounds
) %>%
select(term, b_ci, p.value) # Select columns to keep
lm2_tidy
#> # A tibble: 17 × 3
#> term b_ci p.value
#> <chr> <glue> <chr>
#> 1 (Intercept) 35.69 ( 30.43, 40.96) <0.001
#> 2 age -0.12 ( -0.21, -0.02) 0.013
#> 3 femaleFemale 0.03 ( -2.14, 2.21) 0.975
#> 4 whiteWhite 3.62 ( 0.86, 6.37) 0.010
#> 5 hispanicYes 3.72 ( -1.81, 9.26) 0.186
#> 6 shx_smokeFormer smoker -0.22 ( -2.56, 2.12) 0.854
#> 7 shx_smokeCurrent smoker -0.53 ( -4.01, 2.96) 0.766
#> 8 shx_drink<1 drink per week -1.00 ( -3.97, 1.96) 0.504
#> 9 shx_drink1-3 drinks per week -4.14 ( -7.41, -0.88) 0.013
#> 10 shx_drink4-6 drinks per week -3.05 ( -6.56, 0.46) 0.088
#> 11 shx_drink1-2 drinks per day -5.04 ( -8.94, -1.13) 0.012
#> 12 shx_drink>2 drinks per day 0.68 ( -5.38, 6.74) 0.825
#> 13 hx_cancerYes -2.08 (-11.19, 7.03) 0.653
#> 14 hx_cvdYes -0.26 ( -3.85, 3.33) 0.886
#> 15 hx_htnYes 4.19 ( 1.76, 6.62) <0.001
#> 16 hx_hldYes -1.08 ( -3.90, 1.75) 0.454
#> 17 hx_diabetesYes 1.86 ( -1.41, 5.13) 0.262
If we want goodness of fit statistics, we can use the “glance” function from the broom package.
lm2_glance <- glance(lm2) %>%
mutate(
across(where(is.numeric), ~roundx(.x, 2))
)
lm2_glance
#> # A tibble: 1 × 12
#> r.squ…¹ adj.r…² sigma stati…³ p.value df logLik AIC BIC devia…⁴ df.re…⁵
#> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr> <chr>
#> 1 0.21 0.13 6.95 2.63 0.00 16.00 -575.… 1186… 1243… 7592.88 157.00
#> # … with 1 more variable: nobs <chr>, and abbreviated variable names
#> # ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual
These results are still in plain-text in our console window. Let’s use the tidycoRe function, format_basictable, to convert this data frame to a flextable object, apply some default formats, and print.
format_basictable(lm2_tidy)
term |
b_ci |
p.value |
(Intercept) |
35.69 ( 30.43, 40.96) |
<0.001 |
age |
-0.12 ( -0.21, -0.02) |
0.013 |
femaleFemale |
0.03 ( -2.14, 2.21) |
0.975 |
whiteWhite |
3.62 ( 0.86, 6.37) |
0.010 |
hispanicYes |
3.72 ( -1.81, 9.26) |
0.186 |
shx_smokeFormer smoker |
-0.22 ( -2.56, 2.12) |
0.854 |
shx_smokeCurrent smoker |
-0.53 ( -4.01, 2.96) |
0.766 |
shx_drink<1 drink per week |
-1.00 ( -3.97, 1.96) |
0.504 |
shx_drink1-3 drinks per week |
-4.14 ( -7.41, -0.88) |
0.013 |
shx_drink4-6 drinks per week |
-3.05 ( -6.56, 0.46) |
0.088 |
shx_drink1-2 drinks per day |
-5.04 ( -8.94, -1.13) |
0.012 |
shx_drink>2 drinks per day |
0.68 ( -5.38, 6.74) |
0.825 |
hx_cancerYes |
-2.08 (-11.19, 7.03) |
0.653 |
hx_cvdYes |
-0.26 ( -3.85, 3.33) |
0.886 |
hx_htnYes |
4.19 ( 1.76, 6.62) |
<0.001 |
hx_hldYes |
-1.08 ( -3.90, 1.75) |
0.454 |
hx_diabetesYes |
1.86 ( -1.41, 5.13) |
0.262 |
After converting to a flextable object, we can further tweak its appearance by applying functions from the flextable package. As an example, let’s rename the columns.
format_basictable(lm2_tidy) %>%
set_header_labels(
term = "Parameter",
b_ci = "\U03B2 (95% CI)",
p.value = "p"
)
Parameter |
β (95% CI) |
p |
(Intercept) |
35.69 ( 30.43, 40.96) |
<0.001 |
age |
-0.12 ( -0.21, -0.02) |
0.013 |
femaleFemale |
0.03 ( -2.14, 2.21) |
0.975 |
whiteWhite |
3.62 ( 0.86, 6.37) |
0.010 |
hispanicYes |
3.72 ( -1.81, 9.26) |
0.186 |
shx_smokeFormer smoker |
-0.22 ( -2.56, 2.12) |
0.854 |
shx_smokeCurrent smoker |
-0.53 ( -4.01, 2.96) |
0.766 |
shx_drink<1 drink per week |
-1.00 ( -3.97, 1.96) |
0.504 |
shx_drink1-3 drinks per week |
-4.14 ( -7.41, -0.88) |
0.013 |
shx_drink4-6 drinks per week |
-3.05 ( -6.56, 0.46) |
0.088 |
shx_drink1-2 drinks per day |
-5.04 ( -8.94, -1.13) |
0.012 |
shx_drink>2 drinks per day |
0.68 ( -5.38, 6.74) |
0.825 |
hx_cancerYes |
-2.08 (-11.19, 7.03) |
0.653 |
hx_cvdYes |
-0.26 ( -3.85, 3.33) |
0.886 |
hx_htnYes |
4.19 ( 1.76, 6.62) |
<0.001 |
hx_hldYes |
-1.08 ( -3.90, 1.75) |
0.454 |
hx_diabetesYes |
1.86 ( -1.41, 5.13) |
0.262 |
We can use the same process for the goodness of fit table.
format_basictable(lm2_glance)
r.squared |
adj.r.squared |
sigma |
statistic |
p.value |
df |
logLik |
AIC |
BIC |
deviance |
df.residual |
nobs |
0.21 |
0.13 |
6.95 |
2.63 |
0.00 |
16.00 |
-575.40 |
1186.80 |
1243.66 |
7592.88 |
157.00 |
174.00 |
And change the column names with flextable::set_header_labels and add outer borders.
format_basictable(lm2_glance, borders = TRUE) %>%
set_header_labels(
r.squared = "R\U00B2",
adj.r.squared = "Adjusted R\U00B2",
sigma = "Standard error residuals",
statistic = "Test statistic",
p.value = "p",
df = "Degrees of Freedom",
logLik = "Log-likelihood",
deviance = "Deviance",
df.residual = "Degrees of Freedom residuals",
nobs = "# observations"
)
R² |
Adjusted R² |
Standard error residuals |
Test statistic |
p |
Degrees of Freedom |
Log-likelihood |
AIC |
BIC |
Deviance |
Degrees of Freedom residuals |
# observations |
0.21 |
0.13 |
6.95 |
2.63 |
0.00 |
16.00 |
-575.40 |
1186.80 |
1243.66 |
7592.88 |
157.00 |
174.00 |