ciss_vae.utils.matrix.create_missingness_prop_matrix

create_missingness_prop_matrix(data, index_col=None, cols_ignore=None, na_values=None, repeat_feature_names=None, timepoint_prefix=None, nonint_timepoint=False, column_mapping=None, loose=False)[source]

Create a missingness proportion matrix summarizing feature-level missingness per sample.

Computes the proportion of missing values for each feature within each sample, optionally aggregating repeated measurements (e.g., feature_t1, feature_t2). Can also accept an explicit column_mapping from base feature → list of columns.

Parameters:
  • data (pandas.DataFrame or numpy.ndarray) – Input dataset (coercible to DataFrame).

  • index_col (str or None, optional) – Optional column to use as sample index in the output metadata (not scored).

  • cols_ignore (list[str] or None, optional) – Columns to exclude from scoring (e.g., IDs, non-features).

  • na_values (list[Any] or None, optional) – Extra values to treat as missing (in addition to NaN/None/±Inf).

  • repeat_feature_names (list[str] or None, optional) – Base feature names that have repeated timepoints to be aggregated. Columns matched by regex pattern: - if timepoint_prefix is provided: ^<feat>_<prefix>\d+$ - else: ^<feat>_\d+$

  • timepoint_prefix (str or None, optional) – Optional prefix that appears before the timepoint integer, e.g., t to match feat_t1.

  • nonint_timepoint (bool, optional) – If true, any text after ‘_’ will count as timepoint (eg Baseline).

  • column_mapping (dict[str, list[str]] or None, optional) – Explicit mapping { base_feature: [col1, col2, …] } to aggregate. Takes precedence.

  • loose (bool) – If true, will match any text starting with the base feature names in repeat_feature_names.

Returns:

MissingnessMatrix with: - data: (n_samples, n_features) matrix of missingness proportions - feature_columns_map: mapping of base features → contributing columns - to_dataframe() to view as DataFrame

Return type:

MissingnessMatrix