Map column names to cell type groups using patterns — map_cell

Maps a character vector of column names to predefined cell type categories by matching against patterns, with optional regex support.

Usage

map_cell_groups(
  column_names,
  mapping_rules,
  default_group = "Unknown",
  verbose = FALSE,
  use_regex = TRUE
)

Arguments

column_names: A character vector of column names to be mapped to cell type groups.
mapping_rules: A named list where names are the target cell type groups and values are character vectors of patterns that identify columns belonging to each group.
default_group: A string that represents the default group name for columns not matching any pattern. Default is "Unknown".
verbose: Logical. If TRUE, prints mapping process messages showing which columns were mapped to which groups and which columns remained unmapped. Default is FALSE.
use_regex: Logical. If TRUE (default), treats patterns in mapping_rules as regular expressions. If FALSE, performs exact string matching.

Value

A character vector with the same length as column_names, containing the mapped cell type group for each column name. Columns that don't match any pattern will be assigned the default_group.

Details

This function iterates through the mapping_rules list and attempts to match each column name against the patterns for each cell type group. The first matching group in the order of the list will be assigned.

When use_regex = TRUE, patterns are treated as regular expressions and matching is case-insensitive. When use_regex = FALSE, exact string matching is performed, which is case-sensitive.

Examples

# Example column names from a dataset
cols <- c("CD8_T_cell", "T_cell", "B_cell",
          "NK_cell", "Monocyte", "Unknown_cell")

# Define simple mapping rules for cell types
mapping <- list(
  "T_cell" = c("CD8_T_cell", "T_cell"),
  "B_cell" = c("B_cell"),
  "NK_cell" = c("NK_cell"),
  "Monocyte" = c("Monocyte")
)

# Map column names to cell types using exact matching
cell_types <- map_cell_groups(cols, mapping, use_regex = FALSE)
print(data.frame(column = cols, cell_type = cell_types))
#>         column cell_type
#> 1   CD8_T_cell    T_cell
#> 2       T_cell    T_cell
#> 3       B_cell    B_cell
#> 4      NK_cell   NK_cell
#> 5     Monocyte  Monocyte
#> 6 Unknown_cell   Unknown

# Define mapping rules using regex patterns
mapping_regex <- list(
  "T_cell" = c("CD8.*", "T_cell"),
  "B_cell" = c("B[-_]?cell"),
  "NK_cell" = c("NK[-_]?cell"),
  "Monocyte" = c("Mono.*")
)

# Map using regex patterns (default)
cell_types_regex <- map_cell_groups(cols, mapping_regex)
print(data.frame(column = cols, cell_type = cell_types_regex))
#>         column cell_type
#> 1   CD8_T_cell    T_cell
#> 2       T_cell    T_cell
#> 3       B_cell    B_cell
#> 4      NK_cell   NK_cell
#> 5     Monocyte  Monocyte
#> 6 Unknown_cell   Unknown