Regression table example • tidycoRe

Setup

We start by loading a few helpful libraries.

# Load libraries
library(tidyverse, warn.conflicts = FALSE) # Data wrangling/ manipulation
library(arsenal)   # Package with good table creation functions
library(tidycoRe)  # Internal package for creating tables
library(flextable) # For formatting tables for presentation
library(broom)     # For cleaning model results
library(glue)      # For concatenating strings

Linear regression table

Using the demo dataset included in the tidycoRe package, we’re going to build a linear regression to predict BMI with the following predictors: age, gender, race, ethnicity, smoking status, drinking status, history of cancer, history of CVD, history of hypertension, history of hyperlipidemia, and history of diabetes.

lm2 <- lm(
  bmi ~ age + female + white + hispanic + shx_smoke + shx_drink + hx_cancer + 
        hx_cvd + hx_htn + hx_hld + hx_diabetes, 
  data = demo_data
  )

And here’s a summary of the results:

summary(lm2)
#> 
#> Call:
#> lm(formula = bmi ~ age + female + white + hispanic + shx_smoke + 
#>     shx_drink + hx_cancer + hx_cvd + hx_htn + hx_hld + hx_diabetes, 
#>     data = demo_data)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -15.2699  -4.6741  -0.8572   3.7478  22.7053 
#> 
#> Coefficients:
#>                              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                  35.69377    2.66691  13.384  < 2e-16 ***
#> age                          -0.11628    0.04650  -2.501 0.013424 *  
#> femaleFemale                  0.03388    1.10011   0.031 0.975473    
#> whiteWhite                    3.61611    1.39484   2.592 0.010428 *  
#> hispanicYes                   3.72449    2.80140   1.330 0.185609    
#> shx_smokeFormer smoker       -0.21779    1.18552  -0.184 0.854478    
#> shx_smokeCurrent smoker      -0.52670    1.76314  -0.299 0.765544    
#> shx_drink<1 drink per week   -1.00385    1.50030  -0.669 0.504413    
#> shx_drink1-3 drinks per week -4.14258    1.65234  -2.507 0.013190 *  
#> shx_drink4-6 drinks per week -3.05412    1.77738  -1.718 0.087709 .  
#> shx_drink1-2 drinks per day  -5.03727    1.97711  -2.548 0.011801 *  
#> shx_drink>2 drinks per day    0.67882    3.06979   0.221 0.825279    
#> hx_cancerYes                 -2.07835    4.61225  -0.451 0.652888    
#> hx_cvdYes                    -0.26151    1.81825  -0.144 0.885824    
#> hx_htnYes                     4.19313    1.22987   3.409 0.000828 ***
#> hx_hldYes                    -1.07512    1.43113  -0.751 0.453632    
#> hx_diabetesYes                1.86206    1.65515   1.125 0.262302    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.954 on 157 degrees of freedom
#>   (26 observations deleted due to missingness)
#> Multiple R-squared:  0.2112, Adjusted R-squared:  0.1308 
#> F-statistic: 2.627 on 16 and 157 DF,  p-value: 0.001143

If we plan on generating multiple models using the same formula, we can save the variables as vectors and call on them using the “formulize” function from the “arsenal” package. We can see that these results are the same as the the previous model.

LHS <- "bmi"
RHS <- c("age", 
         "female", 
         "white", 
         "hispanic", 
         "shx_smoke", 
         "shx_drink", 
         "hx_cancer", 
         "hx_cvd", 
         "hx_htn", 
         "hx_hld", 
         "hx_diabetes")

lm2_0 <- lm(
  formulize(LHS, RHS),
  data = demo_data
)

summary(lm2_0)
#> 
#> Call:
#> lm(formula = formulize(LHS, RHS), data = demo_data)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -15.2699  -4.6741  -0.8572   3.7478  22.7053 
#> 
#> Coefficients:
#>                              Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)                  35.69377    2.66691  13.384  < 2e-16 ***
#> age                          -0.11628    0.04650  -2.501 0.013424 *  
#> femaleFemale                  0.03388    1.10011   0.031 0.975473    
#> whiteWhite                    3.61611    1.39484   2.592 0.010428 *  
#> hispanicYes                   3.72449    2.80140   1.330 0.185609    
#> shx_smokeFormer smoker       -0.21779    1.18552  -0.184 0.854478    
#> shx_smokeCurrent smoker      -0.52670    1.76314  -0.299 0.765544    
#> shx_drink<1 drink per week   -1.00385    1.50030  -0.669 0.504413    
#> shx_drink1-3 drinks per week -4.14258    1.65234  -2.507 0.013190 *  
#> shx_drink4-6 drinks per week -3.05412    1.77738  -1.718 0.087709 .  
#> shx_drink1-2 drinks per day  -5.03727    1.97711  -2.548 0.011801 *  
#> shx_drink>2 drinks per day    0.67882    3.06979   0.221 0.825279    
#> hx_cancerYes                 -2.07835    4.61225  -0.451 0.652888    
#> hx_cvdYes                    -0.26151    1.81825  -0.144 0.885824    
#> hx_htnYes                     4.19313    1.22987   3.409 0.000828 ***
#> hx_hldYes                    -1.07512    1.43113  -0.751 0.453632    
#> hx_diabetesYes                1.86206    1.65515   1.125 0.262302    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 6.954 on 157 degrees of freedom
#>   (26 observations deleted due to missingness)
#> Multiple R-squared:  0.2112, Adjusted R-squared:  0.1308 
#> F-statistic: 2.627 on 16 and 157 DF,  p-value: 0.001143

We generally want to report the coefficients and corresponding CIs and p-values from our regression. We can extract these using the “tidy” function from the “broom” package.

tidy(lm2, conf.int = TRUE)
#> # A tibble: 17 × 7
#>    term                         estim…¹ std.e…² stati…³  p.value conf.…⁴ conf.…⁵
#>    <chr>                          <dbl>   <dbl>   <dbl>    <dbl>   <dbl>   <dbl>
#>  1 (Intercept)                  35.7     2.67   13.4    9.66e-28  30.4   41.0   
#>  2 age                          -0.116   0.0465 -2.50   1.34e- 2  -0.208 -0.0244
#>  3 femaleFemale                  0.0339  1.10    0.0308 9.75e- 1  -2.14   2.21  
#>  4 whiteWhite                    3.62    1.39    2.59   1.04e- 2   0.861  6.37  
#>  5 hispanicYes                   3.72    2.80    1.33   1.86e- 1  -1.81   9.26  
#>  6 shx_smokeFormer smoker       -0.218   1.19   -0.184  8.54e- 1  -2.56   2.12  
#>  7 shx_smokeCurrent smoker      -0.527   1.76   -0.299  7.66e- 1  -4.01   2.96  
#>  8 shx_drink<1 drink per week   -1.00    1.50   -0.669  5.04e- 1  -3.97   1.96  
#>  9 shx_drink1-3 drinks per week -4.14    1.65   -2.51   1.32e- 2  -7.41  -0.879 
#> 10 shx_drink4-6 drinks per week -3.05    1.78   -1.72   8.77e- 2  -6.56   0.457 
#> 11 shx_drink1-2 drinks per day  -5.04    1.98   -2.55   1.18e- 2  -8.94  -1.13  
#> 12 shx_drink>2 drinks per day    0.679   3.07    0.221  8.25e- 1  -5.38   6.74  
#> 13 hx_cancerYes                 -2.08    4.61   -0.451  6.53e- 1 -11.2    7.03  
#> 14 hx_cvdYes                    -0.262   1.82   -0.144  8.86e- 1  -3.85   3.33  
#> 15 hx_htnYes                     4.19    1.23    3.41   8.28e- 4   1.76   6.62  
#> 16 hx_hldYes                    -1.08    1.43   -0.751  4.54e- 1  -3.90   1.75  
#> 17 hx_diabetesYes                1.86    1.66    1.13   2.62e- 1  -1.41   5.13  
#> # … with abbreviated variable names ¹estimate, ²std.error, ³statistic,
#> #   ⁴conf.low, ⁵conf.high

Since tidy returns a data frame object, we can clean this up using dplyr functions as we normally would.

lm2_tidy  <- tidy(lm2, conf.int = TRUE) %>% 
  mutate(
    p.value = format.pval(p.value, digits = 2, eps = 0.001), # format p-value column
    across(where(is.numeric), ~roundx(.x, 2)), # Round numeric columns to 2 digits
    b_ci = glue("{estimate} ({conf.low}, {conf.high})") # Concatenate rounded coefficient plus lower and upper CI bounds
    ) %>% 
  select(term, b_ci, p.value) # Select columns to keep

lm2_tidy
#> # A tibble: 17 × 3
#>    term                         b_ci                  p.value
#>    <chr>                        <glue>                <chr>  
#>  1 (Intercept)                  35.69 ( 30.43, 40.96) <0.001 
#>  2 age                          -0.12 ( -0.21, -0.02) 0.013  
#>  3 femaleFemale                  0.03 ( -2.14,  2.21) 0.975  
#>  4 whiteWhite                    3.62 (  0.86,  6.37) 0.010  
#>  5 hispanicYes                   3.72 ( -1.81,  9.26) 0.186  
#>  6 shx_smokeFormer smoker       -0.22 ( -2.56,  2.12) 0.854  
#>  7 shx_smokeCurrent smoker      -0.53 ( -4.01,  2.96) 0.766  
#>  8 shx_drink<1 drink per week   -1.00 ( -3.97,  1.96) 0.504  
#>  9 shx_drink1-3 drinks per week -4.14 ( -7.41, -0.88) 0.013  
#> 10 shx_drink4-6 drinks per week -3.05 ( -6.56,  0.46) 0.088  
#> 11 shx_drink1-2 drinks per day  -5.04 ( -8.94, -1.13) 0.012  
#> 12 shx_drink>2 drinks per day    0.68 ( -5.38,  6.74) 0.825  
#> 13 hx_cancerYes                 -2.08 (-11.19,  7.03) 0.653  
#> 14 hx_cvdYes                    -0.26 ( -3.85,  3.33) 0.886  
#> 15 hx_htnYes                     4.19 (  1.76,  6.62) <0.001 
#> 16 hx_hldYes                    -1.08 ( -3.90,  1.75) 0.454  
#> 17 hx_diabetesYes                1.86 ( -1.41,  5.13) 0.262

If we want goodness of fit statistics, we can use the “glance” function from the broom package.

lm2_glance  <- glance(lm2) %>% 
  mutate(
    across(where(is.numeric), ~roundx(.x, 2))
    )

lm2_glance
#> # A tibble: 1 × 12
#>   r.squ…¹ adj.r…² sigma stati…³ p.value df    logLik AIC   BIC   devia…⁴ df.re…⁵
#>   <chr>   <chr>   <chr> <chr>   <chr>   <chr> <chr>  <chr> <chr> <chr>   <chr>  
#> 1 0.21    0.13    6.95  2.63    0.00    16.00 -575.… 1186… 1243… 7592.88 157.00 
#> # … with 1 more variable: nobs <chr>, and abbreviated variable names
#> #   ¹r.squared, ²adj.r.squared, ³statistic, ⁴deviance, ⁵df.residual

These results are still in plain-text in our console window. Let’s use the tidycoRe function, format_basictable, to convert this data frame to a flextable object, apply some default formats, and print.

format_basictable(lm2_tidy)

term	b_ci	p.value
(Intercept)	35.69 ( 30.43, 40.96)	<0.001
age	-0.12 ( -0.21, -0.02)	0.013
femaleFemale	0.03 ( -2.14, 2.21)	0.975
whiteWhite	3.62 ( 0.86, 6.37)	0.010
hispanicYes	3.72 ( -1.81, 9.26)	0.186
shx_smokeFormer smoker	-0.22 ( -2.56, 2.12)	0.854
shx_smokeCurrent smoker	-0.53 ( -4.01, 2.96)	0.766
shx_drink<1 drink per week	-1.00 ( -3.97, 1.96)	0.504
shx_drink1-3 drinks per week	-4.14 ( -7.41, -0.88)	0.013
shx_drink4-6 drinks per week	-3.05 ( -6.56, 0.46)	0.088
shx_drink1-2 drinks per day	-5.04 ( -8.94, -1.13)	0.012
shx_drink>2 drinks per day	0.68 ( -5.38, 6.74)	0.825
hx_cancerYes	-2.08 (-11.19, 7.03)	0.653
hx_cvdYes	-0.26 ( -3.85, 3.33)	0.886
hx_htnYes	4.19 ( 1.76, 6.62)	<0.001
hx_hldYes	-1.08 ( -3.90, 1.75)	0.454
hx_diabetesYes	1.86 ( -1.41, 5.13)	0.262

After converting to a flextable object, we can further tweak its appearance by applying functions from the flextable package. As an example, let’s rename the columns.

format_basictable(lm2_tidy) %>% 
  set_header_labels(
    term = "Parameter",
    b_ci = "\U03B2 (95% CI)",
    p.value = "p"
  )

Parameter	β (95% CI)	p
(Intercept)	35.69 ( 30.43, 40.96)	<0.001
age	-0.12 ( -0.21, -0.02)	0.013
femaleFemale	0.03 ( -2.14, 2.21)	0.975
whiteWhite	3.62 ( 0.86, 6.37)	0.010
hispanicYes	3.72 ( -1.81, 9.26)	0.186
shx_smokeFormer smoker	-0.22 ( -2.56, 2.12)	0.854
shx_smokeCurrent smoker	-0.53 ( -4.01, 2.96)	0.766
shx_drink<1 drink per week	-1.00 ( -3.97, 1.96)	0.504
shx_drink1-3 drinks per week	-4.14 ( -7.41, -0.88)	0.013
shx_drink4-6 drinks per week	-3.05 ( -6.56, 0.46)	0.088
shx_drink1-2 drinks per day	-5.04 ( -8.94, -1.13)	0.012
shx_drink>2 drinks per day	0.68 ( -5.38, 6.74)	0.825
hx_cancerYes	-2.08 (-11.19, 7.03)	0.653
hx_cvdYes	-0.26 ( -3.85, 3.33)	0.886
hx_htnYes	4.19 ( 1.76, 6.62)	<0.001
hx_hldYes	-1.08 ( -3.90, 1.75)	0.454
hx_diabetesYes	1.86 ( -1.41, 5.13)	0.262

We can use the same process for the goodness of fit table.

format_basictable(lm2_glance)

r.squared	adj.r.squared	sigma	statistic	p.value	df	logLik	AIC	BIC	deviance	df.residual	nobs
0.21	0.13	6.95	2.63	0.00	16.00	-575.40	1186.80	1243.66	7592.88	157.00	174.00

And change the column names with flextable::set_header_labels and add outer borders.

format_basictable(lm2_glance, borders = TRUE) %>% 
  set_header_labels(
    r.squared     = "R\U00B2",
    adj.r.squared = "Adjusted R\U00B2",
    sigma         = "Standard error residuals",
    statistic     = "Test statistic",
    p.value       = "p",
    df            = "Degrees of Freedom",
    logLik        = "Log-likelihood",
    deviance      = "Deviance",
    df.residual   = "Degrees of Freedom residuals",
    nobs          = "# observations"
  )

R²	Adjusted R²	Standard error residuals	Test statistic	p	Degrees of Freedom	Log-likelihood	AIC	BIC	Deviance	Degrees of Freedom residuals	# observations
0.21	0.13	6.95	2.63	0.00	16.00	-575.40	1186.80	1243.66	7592.88	157.00	174.00