ciss_vae.utils.matrix.create_missingness_prop_matrix

create_missingness_prop_matrix(data, index_col=None, cols_ignore=None, na_values=None, repeat_feature_names=None, timepoint_prefix=None, nonint_timepoint=False, column_mapping=None, loose=False)[source]

Create a missingness proportion matrix summarizing feature-level missingness per sample.

Computes the proportion of missing values for each feature within each sample, optionally aggregating repeated measurements (e.g., feature_t1, feature_t2). Can also accept an explicit column_mapping from base feature → list of columns.

Parameters:

data (pandas.DataFrame or numpy.ndarray) – Input dataset (coercible to DataFrame).
index_col (str or None, optional) – Optional column to use as sample index in the output metadata (not scored).
cols_ignore (list[str] or None, optional) – Columns to exclude from scoring (e.g., IDs, non-features).
na_values (list[Any] or None, optional) – Extra values to treat as missing (in addition to NaN/None/±Inf).
repeat_feature_names (list[str] or None, optional) – Base feature names that have repeated timepoints to be aggregated. Columns matched by regex pattern: - if timepoint_prefix is provided: ^<feat>_<prefix>\d+$ - else: ^<feat>_\d+$
timepoint_prefix (str or None, optional) – Optional prefix that appears before the timepoint integer, e.g., t to match feat_t1.
nonint_timepoint (bool, optional) – If true, any text after ‘_’ will count as timepoint (eg Baseline).
column_mapping (dict[str, list[str]] or None, optional) – Explicit mapping { base_feature: [col1, col2, …] } to aggregate. Takes precedence.
loose (bool) – If true, will match any text starting with the base feature names in repeat_feature_names.

Returns:

MissingnessMatrix with: - data: (n_samples, n_features) matrix of missingness proportions - feature_columns_map: mapping of base features → contributing columns - to_dataframe() to view as DataFrame

Return type:

MissingnessMatrix