Skip to contents

Get started

To get started, let’s load the package into our working session:

If you have not downloaded it yet from our private github repo, please see the Readme on the package home page.

Comparing docx tables

TidycoRe contains functionality to extract docx tables, pull all their summary info, and compare the data between them. Please make sure to install the splitstackshape library to run the functionality needed for this example.

Note: To accurately compare the tables, tables need to have similar headings/summary statistics

The below example is created functionality for comparing and older and newer PV report althought we are using an example table.

First we pull all the tables from the document.

new_doc <- read_docx("example_docx_tables.docx")
new_doc_tables <- docx_extract_all_tbls(new_doc, guess_header = FALSE, TRUE)


new_doc2 <- read_docx("example_docx_tables2.docx")
new_doc_tables2 <- docx_extract_all_tbls(new_doc2, guess_header = FALSE, TRUE)
#> NOTE: header=FALSE but table has a marked header row in the Word document

To extract a table you can specify using an index/number depending on how many tables there are.

new_doc_tables[[1]] %>% 
  format_basictable()

Then we specify the row where the data starts within each table (set of tables) that you want to compare. For the above table(s) that we would like to compare for our example, we can see that the data starts on line/row 2.

row_data_start = 2

Lastly, we call compare_clean_tables() from the package to extract and merge all the summary stats from the table. For the example below, we want to compare the first tables from the older report and newer report.

  • V1 is a description of the variable name for which the stat and value is for
  • variable_grouping makes sure to assign the correct stat to the correct variable in V1 (especially useful for categorical variables)
  • name indicates the header (from the previous table) for which the variable stat/value are for
  • value.x is the value from the new table
  • stat indicates which number order is being printed. For example: A (B%) has two values. Value 1 is A and Value 2 is B.
  • summary describes the pattern for which the stat was extracted. N=Value1/Value1 (Value2%) indicates a count of count (percent)
  • value.y is the value from the old table
  • diff is the difference from value.x - value.y
compare_clean_tables(
    old =  new_doc_tables[[1]],
    new = new_doc_tables2[[1]],
    row_data_start = 2
  ) %>% 
  head(10) %>% #this just limits output to the first 10 rows
  format_basictable() # formats the table nicely
#>  Extracting table info...
#>  Variable groupings identified.
#>  All white space removed.
#>  Data reformated.
#>  All patterns identified.
#> New names:
#>  N=Value1/Value1 (Value2%)/Value1 summaries extracted
#>  Summary extraction complete.
#>  `Old` table information extracted.
#>  Variable groupings identified.
#>  All white space removed.
#>  Data reformated.
#>  All patterns identified.
#> New names:
#>  N=Value1/Value1 (Value2%)/Value1 summaries extracted
#>  Summary extraction complete.
#>  `New` table information extracted.
#>  Merging and comparing info from both tables...
#>  Merging complete.

If you would like to just ‘clean’ the docx table or pull the summary information from a docx table dataframe, you can run clean_table():

new_doc_tables[[1]] %>% 
  clean_table() %>%
  head(10) %>% 
  format_basictable()
#>  Variable groupings identified.
#>  All white space removed.
#>  Data reformated.
#>  All patterns identified.
#> New names:
#>  N=Value1/Value1 (Value2%)/Value1 summaries extracted
#>  Summary extraction complete.