ciss_vae.utils.matrix.create_missingness_prop_matrix
- create_missingness_prop_matrix(data, index_col=None, cols_ignore=None, na_values=None, repeat_feature_names=None, timepoint_prefix=None, nonint_timepoint=False, column_mapping=None, loose=False)[source]
Create a missingness proportion matrix summarizing feature-level missingness per sample.
Computes the proportion of missing values for each feature within each sample, optionally aggregating repeated measurements (e.g.,
feature_t1,feature_t2). Can also accept an explicitcolumn_mappingfrom base feature → list of columns.- Parameters:
data (pandas.DataFrame or numpy.ndarray) – Input dataset (coercible to DataFrame).
index_col (str or None, optional) – Optional column to use as sample index in the output metadata (not scored).
cols_ignore (list[str] or None, optional) – Columns to exclude from scoring (e.g., IDs, non-features).
na_values (list[Any] or None, optional) – Extra values to treat as missing (in addition to NaN/None/±Inf).
repeat_feature_names (list[str] or None, optional) – Base feature names that have repeated timepoints to be aggregated. Columns matched by regex pattern: - if
timepoint_prefixis provided:^<feat>_<prefix>\d+$- else:^<feat>_\d+$timepoint_prefix (str or None, optional) – Optional prefix that appears before the timepoint integer, e.g.,
tto matchfeat_t1.nonint_timepoint (bool, optional) – If true, any text after ‘_’ will count as timepoint (eg Baseline).
column_mapping (dict[str, list[str]] or None, optional) – Explicit mapping { base_feature: [col1, col2, …] } to aggregate. Takes precedence.
loose (bool) – If true, will match any text starting with the base feature names in repeat_feature_names.
- Returns:
MissingnessMatrix with: -
data: (n_samples, n_features) matrix of missingness proportions -feature_columns_map: mapping of base features → contributing columns -to_dataframe()to view as DataFrame- Return type: