TableCleaneR provides small, focused helpers to collapse duplicate rows defined by key columns in tibbles/data frames, in a way that is safe for downstream analysis:
- Key combinations are made unique by collapsing rows sharing the same keys.
- Selected columns can be concatenated to preserve information (e.g. sources, comments).
- Other divergent non-key fields are replaced with typed
NAto avoid silently mixing inconsistent values. - Companion functions help audit which groups require concatenation and subset collapsed outputs accordingly.
Installation
Development version (GitHub)
# install.packages("pak")
pak::pak("besibo/TableCleaneR")(Alternatively)
# install.packages("devtools")
devtools::install_github("besibo/TableCleaneR")Core functions
TableCleaneR currently exposes four functions:
-
collapse_by_keys(): collapse duplicates to one row per key group; concatenate.concatcolumns when needed; replace other divergent columns withNA. -
collapse_candidates(): return the original input rows belonging to groups where concatenation would be needed for.concat. -
collapse_concat_only(): return the collapsed output restricted to groups where concatenation actually occurred. -
collapse_no_concat(): return the collapsed output restricted to groups where no concatenation was needed for.concat.
Quick start
library(TableCleaneR)
library(dplyr)
#>
#> Attachement du package : 'dplyr'
#> Les objets suivants sont masqués depuis 'package:stats':
#>
#> filter, lag
#> Les objets suivants sont masqués depuis 'package:base':
#>
#> intersect, setdiff, setequal, union
library(tibble)
df <- tibble(
exam.num_collec = c(1, 1, 1, 2, 2),
mat.matrice = c("SER", "SER", "SER", "PLAS", "PLAS"),
spe.denomination = c(
"E. coli",
"E. coli",
"E. coli",
"S. aureus",
"S. aureus"
),
result = c("POS", "NEG", "POS", "NEG", "NEG"), # divergent (non-concat) for key=1
value = c(10, 10, 12, 5, 5), # divergent (non-concat) for key=1
commentaire = c("first", NA, "repeat", "ok", "ok"), # concat target
source_info = c("labA", "labA", "labB", "labC", NA), # concat target
flag = c(TRUE, TRUE, TRUE, FALSE, FALSE) # constant within groups
)
df
#> # A tibble: 5 × 8
#> exam.num_collec mat.matrice spe.denomination result value commentaire
#> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 1 SER E. coli POS 10 first
#> 2 1 SER E. coli NEG 10 <NA>
#> 3 1 SER E. coli POS 12 repeat
#> 4 2 PLAS S. aureus NEG 5 ok
#> 5 2 PLAS S. aureus NEG 5 ok
#> # ℹ 2 more variables: source_info <chr>, flag <lgl>1) Collapse duplicates safely
collapsed <- df %>%
collapse_by_keys(
exam.num_collec,
mat.matrice,
spe.denomination,
.concat = c(commentaire, source_info),
sep = " | ",
na_rm = TRUE,
warn = TRUE
)
#> Warning: Divergent columns replaced by NA: result, value
collapsed
#> # A tibble: 2 × 8
#> exam.num_collec mat.matrice spe.denomination result value commentaire
#> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 1 SER E. coli <NA> NA first | repeat
#> 2 2 PLAS S. aureus NEG 5 ok
#> # ℹ 2 more variables: source_info <chr>, flag <lgl>Expected behaviour (for the key group (1, "SER", "E. coli")):
-
commentairebecomes"first | repeat"(distinct non-NA values concatenated), -
source_infobecomes"labA | labB", -
resultandvaluebecomeNA(divergent and not selected for concatenation), -
flagstaysTRUE(constant within the group).
2) Audit: which original rows would require concatenation?
candidates <- df %>%
collapse_candidates(
exam.num_collec,
mat.matrice,
spe.denomination,
.concat = c(commentaire, source_info),
na_rm = TRUE
)
candidates
#> # A tibble: 3 × 8
#> exam.num_collec mat.matrice spe.denomination result value commentaire
#> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 1 SER E. coli POS 10 first
#> 2 1 SER E. coli NEG 10 <NA>
#> 3 1 SER E. coli POS 12 repeat
#> # ℹ 2 more variables: source_info <chr>, flag <lgl>This returns the input rows for key groups where at least one .concat column has more than one distinct value.
3) Keep only collapsed groups where concatenation occurred
only_concat <- df %>%
collapse_concat_only(
exam.num_collec,
mat.matrice,
spe.denomination,
.concat = c(commentaire, source_info),
sep = " | ",
na_rm = TRUE,
warn = FALSE
)
only_concat
#> # A tibble: 1 × 8
#> exam.num_collec mat.matrice spe.denomination result value commentaire
#> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 1 SER E. coli <NA> NA first | repeat
#> # ℹ 2 more variables: source_info <chr>, flag <lgl>4) Keep only collapsed groups where no concatenation was needed
no_concat <- df %>%
collapse_no_concat(
exam.num_collec,
mat.matrice,
spe.denomination,
.concat = c(commentaire, source_info),
na_rm = TRUE,
warn = FALSE
)
no_concat
#> # A tibble: 1 × 8
#> exam.num_collec mat.matrice spe.denomination result value commentaire
#> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 2 PLAS S. aureus NEG 5 ok
#> # ℹ 2 more variables: source_info <chr>, flag <lgl>Tidyselect support
Column selection for .concat uses tidyselect (e.g. starts_with(), contains(), matches(), where()):
df %>%
collapse_by_keys(
exam.num_collec,
mat.matrice,
spe.denomination,
.concat = starts_with("source"),
sep = " | "
)
#> Warning: Divergent columns replaced by NA: result, value, commentaire
#> # A tibble: 2 × 8
#> exam.num_collec mat.matrice spe.denomination result value commentaire
#> <dbl> <chr> <chr> <chr> <dbl> <chr>
#> 1 1 SER E. coli <NA> NA <NA>
#> 2 2 PLAS S. aureus NEG 5 ok
#> # ℹ 2 more variables: source_info <chr>, flag <lgl>Notes
-
.concatcolumns return a character result when concatenation is needed. - For non-
.concatcolumns, divergent groups are replaced by typedNA(to reduce accidental mixing of inconsistent values). - All functions are designed for use in tidyverse pipelines; selection arguments are captured using tidyeval/tidyselect (no premature evaluation).