ciss_vae.utils package

Submodules

ciss_vae.utils.clustering module

run_cissvae takes in the dataset as an input and (optionally) clusters on missingness before running ciss_vae full model.

cluster_on_missing(data, cols_ignore=None, n_clusters=None, k_neighbors=15, use_snn=True, leiden_resolution=0.5, leiden_objective='CPM', seed=42)[source]

Cluster samples based on their missingness patterns using KMeans or Leiden.

When n_clusters is None, performs Leiden clustering on a graph constructed from the binary missingness mask of the dataset. If use_snn=True, builds a shared-nearest-neighbor (SNN) graph using Jaccard similarity; otherwise, constructs a standard kNN graph with Jaccard weights. Returns both the cluster labels and an optional silhouette score.

Parameters:

data (pandas.DataFrame) – Input dataset with potential missing values, shape (n_samples, n_features). Non-numeric columns should be excluded or specified in cols_ignore.
cols_ignore (list[str] or None, optional) – Column names to exclude from the missingness pattern clustering. Typically includes identifiers or static metadata columns. Defaults to None.
n_clusters (int or None, optional) – Number of clusters for KMeans. If None, uses Leiden clustering on the binary missingness mask instead. Defaults to None.
k_neighbors (int, optional) – Number of nearest neighbors used when constructing the kNN/SNN graph for Leiden clustering. Defaults to 15.
use_snn (bool, optional) – If True, constructs a shared-nearest-neighbor (SNN) graph using mutual neighbor overlap weighted by Jaccard similarity. If False, uses standard kNN graph weighting by Jaccard distance. Defaults to True.
leiden_resolution (float, optional) – Resolution parameter for Leiden clustering; higher values yield more clusters. Defaults to 0.5.
leiden_objective (str, optional) – Objective function for Leiden optimization. One of {"CPM", "RB", "Modularity"}. Defaults to "CPM".
seed (int, optional) – Random seed for reproducibility in KMeans and Leiden algorithms. Defaults to 42.

Returns:

Tuple (labels, silhouette): - labels (numpy.ndarray): Cluster assignments of length n_samples. - silhouette (float or None): Silhouette score computed using Jaccard distance

on the binary missingness mask; None if undefined.

Return type:

tuple[numpy.ndarray, float or None]

Example:

>>> labels, silh = cluster_on_missing(data, n_clusters=None, use_snn=True)
>>> np.unique(labels)
array([0, 1, 2])
>>> print(f"Silhouette: {silh:.3f}")
Silhouette: 0.408

cluster_on_missing_prop(prop_matrix, *, n_clusters=None, seed=None, k_neighbors=15, use_snn=True, snn_mutual=True, leiden_resolution=0.5, leiden_objective='CPM', metric='euclidean', scale_features=False)[source]

Cluster samples based on their per-feature missingness proportions using KMeans or Leiden.

When n_clusters is None, performs Leiden clustering on a graph constructed from the missingness proportion matrix. If use_snn=True, builds a shared-nearest-neighbor (SNN) graph with Jaccard-based or metric-based similarity; otherwise uses a standard kNN graph. Returns both the cluster labels and an optional silhouette score.

Parameters:

prop_matrix (pandas.DataFrame or numpy.ndarray) – Matrix of missingness proportions, shape (n_samples, n_features). Each entry represents the fraction of missing values for a feature within each sample. Values must lie in [0, 1].
n_clusters (int or None, optional) – Number of clusters for KMeans. If None, uses Leiden clustering instead. Defaults to None.
seed (int or None, optional) – Random seed for KMeans initialization and Leiden reproducibility. Defaults to None.
k_neighbors (int, optional) – Number of nearest neighbors for kNN/SNN graph construction used by Leiden. Defaults to 15.
use_snn (bool, optional) – If True, constructs a shared-nearest-neighbor (SNN) graph using mutual or weighted neighbor overlap. If False, uses standard kNN. Defaults to True.
snn_mutual (bool, optional) – If True, retains only mutual nearest neighbors when building the SNN graph. Defaults to True.
leiden_resolution (float, optional) – Resolution parameter controlling cluster granularity in Leiden. Higher values produce more clusters. Defaults to 0.5.
leiden_objective (str, optional) – Objective function for Leiden optimization. One of {"CPM", "RB", "Modularity"}. Defaults to "CPM".
metric (str, optional) – Distance metric used for kNN graph construction and silhouette calculation. Recommended options are "euclidean" or "cosine". Defaults to "euclidean".
scale_features (bool, optional) – Whether to standardize features (zero mean, unit variance) prior to clustering. Recommended when feature scales differ widely. Defaults to False.

Returns:

Tuple (labels, silhouette): - labels (numpy.ndarray): Cluster assignments of length n_samples. - silhouette (float or None): Silhouette score based on the same metric;

None if undefined (e.g., single cluster).

Return type:

tuple[numpy.ndarray, float or None]

Example:

>>> labels, silh = cluster_on_missing_prop(prop_matrix, n_clusters=None, use_snn=True)
>>> np.unique(labels)
array([0, 1, 2, 3])
>>> print(f"Silhouette: {silh:.3f}")
Silhouette: 0.421

ciss_vae.utils.helpers module

plot_vae_architecture(model, title=None, color_shared='skyblue', color_unshared='lightcoral', color_latent='gold', color_input='lightgreen', color_output='lightgreen', figsize=(16, 8), return_fig=False, fontsize_layer=12, fontsize_section=14, fontsize_title=16)[source]

Plots a horizontal schematic of the VAE architecture, showing shared and cluster-specific layers.

Parameters:

model (nn.Module) – An instance of CISSVAE model to visualize
title (str, optional) – Title of the plot, defaults to None
color_shared (str, optional) – Color for shared hidden layers, defaults to “skyblue”
color_unshared (str, optional) – Color for unshared hidden layers, defaults to “lightcoral”
color_latent (str, optional) – Color for latent layer, defaults to “gold”
color_input (str, optional) – Color for input layer, defaults to “lightgreen”
color_output (str, optional) – Color for output layer, defaults to “lightgreen”
figsize (tuple, optional) – Size of the matplotlib figure, defaults to (16, 8)
return_fig (bool, optional) – Whether to return the figure object instead of displaying, defaults to False
fontsize_layer (int, optional) – Font size of layer blocks, defaults to 12
fontsize_section (int, optional) – Font size of encoder/decoder labels, defaults to 14
fontsize_title (int, optional) – Font size of title, defaults to 16

Returns:

Matplotlib figure object if return_fig is True, otherwise None

Return type:

matplotlib.figure.Figure or None

get_imputed_df(model, data_loader, device='cpu')[source]

Given trained model and cluster dataset object, get imputed dataset as pandas DataFrame.

Reconstructs missing values using the trained VAE model and returns the complete dataset with original scaling restored and validation entries replaced with true values.

Parameters:

model (CISSVAE) – Trained CISSVAE model (should be in eval() mode)
data_loader (torch.utils.data.DataLoader) – DataLoader for the original ClusterDataset
device (str, optional) – Device to run computations on, defaults to “cpu”

Returns:

DataFrame containing imputed (unscaled) data with original row ordering

Return type:

pandas.DataFrame

get_imputed(model, data_loader, device='cpu')[source]

Returns a ClusterDataset where originally missing values have been replaced with model reconstructions.

Processes the dataset through the trained VAE model to reconstruct missing values, including validation-masked entries. The returned dataset maintains the same structure as the original but with missing values filled in.

Parameters:

model (nn.Module) – Trained VAE model
data_loader (torch.utils.data.DataLoader) – DataLoader for the original ClusterDataset
device (str, optional) – Torch device for computations, defaults to “cpu”

Returns:

ClusterDataset with reconstructed values filled in at originally missing positions

Return type:

ClusterDataset

compute_val_mse(model, dataset, device='cpu', auto_fix_binary=False, eps=1e-07, debug=False)[source]

evaluate_imputation(imputed_df, df_complete, df_missing, activation_groups=None)[source]

Test CISSVAE performance by evaluating imputed dataset vs true complete dataset.

Supports mixed data types: - continuous → MSE - binary → BCE-style squared error - categorical → classification error

Returns overall error and detailed comparison dataframe.

Parameters:

imputed_df (pd.DataFrame()) – An imputed version of df_missing.
df_complete (pd.DataFrame()) – A complete dataset with no missingness.
df_missing (pd.DataFrame()) – A version of df_complete with induced missingness.
activation_groups (dict[str, list[int]]) –
Dictionary mapping feature types to column indices. Expected format:

{
“continuous”: [int, …], “binary”: [int, …], “<categorical_name>”: [int, …], …

}

Each key defines a feature group, and values are lists of column indices corresponding to that group. Categorical variables must be represented as grouped indices (e.g., one-hot encoded columns belonging to the same variable).

ciss_vae.utils.loss module

loss_function(cluster, mask, recon_x, x, activation_groups, mu, logvar, beta=0.001, return_components=False, imputable_mask=None, device='cpu', debug=False)[source]

loss_function_nomask(cluster, recon_x, x, activation_groups, mu, logvar, beta=0.001, return_components=False, imputable_mask=None, device='cpu', debug=False)[source]

ciss_vae.utils.matrix module

class MissingnessMatrix(data, feature_columns_map, feature_names, sample_names=None)[source]

Bases: object

A matrix with missingness proportions and metadata.

property shape: Return (n_samples, n_features).

to_dataframe()[source]

Convert to pandas DataFrame with preserved names.

Return type:: DataFrame

to_numpy(dtype=None, copy=False)[source]

Return the underlying NumPy array (optionally cast/copied).

Return type:: ndarray

head()[source]

create_missingness_prop_matrix(data, index_col=None, cols_ignore=None, na_values=None, repeat_feature_names=None, timepoint_prefix=None, nonint_timepoint=False, column_mapping=None, loose=False)[source]

Create a missingness proportion matrix summarizing feature-level missingness per sample.

Computes the proportion of missing values for each feature within each sample, optionally aggregating repeated measurements (e.g., feature_t1, feature_t2). Can also accept an explicit column_mapping from base feature → list of columns.

Parameters:

data (pandas.DataFrame or numpy.ndarray) – Input dataset (coercible to DataFrame).
index_col (str or None, optional) – Optional column to use as sample index in the output metadata (not scored).
cols_ignore (list[str] or None, optional) – Columns to exclude from scoring (e.g., IDs, non-features).
na_values (list[Any] or None, optional) – Extra values to treat as missing (in addition to NaN/None/±Inf).
repeat_feature_names (list[str] or None, optional) – Base feature names that have repeated timepoints to be aggregated. Columns matched by regex pattern: - if timepoint_prefix is provided: ^<feat>_<prefix>\d+$ - else: ^<feat>_\d+$
timepoint_prefix (str or None, optional) – Optional prefix that appears before the timepoint integer, e.g., t to match feat_t1.
nonint_timepoint (bool, optional) – If true, any text after ‘_’ will count as timepoint (eg Baseline).
column_mapping (dict[str, list[str]] or None, optional) – Explicit mapping { base_feature: [col1, col2, …] } to aggregate. Takes precedence.
loose (bool) – If true, will match any text starting with the base feature names in repeat_feature_names.

Returns:

MissingnessMatrix with: - data: (n_samples, n_features) matrix of missingness proportions - feature_columns_map: mapping of base features → contributing columns - to_dataframe() to view as DataFrame

Return type:

MissingnessMatrix

Module contents

loss_function(cluster, mask, recon_x, x, activation_groups, mu, logvar, beta=0.001, return_components=False, imputable_mask=None, device='cpu', debug=False)[source]

plot_vae_architecture(model, title=None, color_shared='skyblue', color_unshared='lightcoral', color_latent='gold', color_input='lightgreen', color_output='lightgreen', figsize=(16, 8), return_fig=False, fontsize_layer=12, fontsize_section=14, fontsize_title=16)[source]

Plots a horizontal schematic of the VAE architecture, showing shared and cluster-specific layers.

Parameters:

model (nn.Module) – An instance of CISSVAE model to visualize
title (str, optional) – Title of the plot, defaults to None
color_shared (str, optional) – Color for shared hidden layers, defaults to “skyblue”
color_unshared (str, optional) – Color for unshared hidden layers, defaults to “lightcoral”
color_latent (str, optional) – Color for latent layer, defaults to “gold”
color_input (str, optional) – Color for input layer, defaults to “lightgreen”
color_output (str, optional) – Color for output layer, defaults to “lightgreen”
figsize (tuple, optional) – Size of the matplotlib figure, defaults to (16, 8)
return_fig (bool, optional) – Whether to return the figure object instead of displaying, defaults to False
fontsize_layer (int, optional) – Font size of layer blocks, defaults to 12
fontsize_section (int, optional) – Font size of encoder/decoder labels, defaults to 14
fontsize_title (int, optional) – Font size of title, defaults to 16

Returns:

Matplotlib figure object if return_fig is True, otherwise None

Return type:

matplotlib.figure.Figure or None

compute_val_mse(model, dataset, device='cpu', auto_fix_binary=False, eps=1e-07, debug=False)[source]

get_imputed(model, data_loader, device='cpu')[source]

Returns a ClusterDataset where originally missing values have been replaced with model reconstructions.

Processes the dataset through the trained VAE model to reconstruct missing values, including validation-masked entries. The returned dataset maintains the same structure as the original but with missing values filled in.

Parameters:

model (nn.Module) – Trained VAE model
data_loader (torch.utils.data.DataLoader) – DataLoader for the original ClusterDataset
device (str, optional) – Torch device for computations, defaults to “cpu”

Returns:

ClusterDataset with reconstructed values filled in at originally missing positions

Return type:

ClusterDataset

get_imputed_df(model, data_loader, device='cpu')[source]

Given trained model and cluster dataset object, get imputed dataset as pandas DataFrame.

Reconstructs missing values using the trained VAE model and returns the complete dataset with original scaling restored and validation entries replaced with true values.

Parameters:

model (CISSVAE) – Trained CISSVAE model (should be in eval() mode)
data_loader (torch.utils.data.DataLoader) – DataLoader for the original ClusterDataset
device (str, optional) – Device to run computations on, defaults to “cpu”

Returns:

DataFrame containing imputed (unscaled) data with original row ordering

Return type:

pandas.DataFrame

create_missingness_prop_matrix(data, index_col=None, cols_ignore=None, na_values=None, repeat_feature_names=None, timepoint_prefix=None, nonint_timepoint=False, column_mapping=None, loose=False)[source]

Create a missingness proportion matrix summarizing feature-level missingness per sample.

Computes the proportion of missing values for each feature within each sample, optionally aggregating repeated measurements (e.g., feature_t1, feature_t2). Can also accept an explicit column_mapping from base feature → list of columns.

Parameters:

data (pandas.DataFrame or numpy.ndarray) – Input dataset (coercible to DataFrame).
index_col (str or None, optional) – Optional column to use as sample index in the output metadata (not scored).
cols_ignore (list[str] or None, optional) – Columns to exclude from scoring (e.g., IDs, non-features).
na_values (list[Any] or None, optional) – Extra values to treat as missing (in addition to NaN/None/±Inf).
repeat_feature_names (list[str] or None, optional) – Base feature names that have repeated timepoints to be aggregated. Columns matched by regex pattern: - if timepoint_prefix is provided: ^<feat>_<prefix>\d+$ - else: ^<feat>_\d+$
timepoint_prefix (str or None, optional) – Optional prefix that appears before the timepoint integer, e.g., t to match feat_t1.
nonint_timepoint (bool, optional) – If true, any text after ‘_’ will count as timepoint (eg Baseline).
column_mapping (dict[str, list[str]] or None, optional) – Explicit mapping { base_feature: [col1, col2, …] } to aggregate. Takes precedence.
loose (bool) – If true, will match any text starting with the base feature names in repeat_feature_names.

Returns:

MissingnessMatrix with: - data: (n_samples, n_features) matrix of missingness proportions - feature_columns_map: mapping of base features → contributing columns - to_dataframe() to view as DataFrame

Return type:

MissingnessMatrix

cluster_on_missing(data, cols_ignore=None, n_clusters=None, k_neighbors=15, use_snn=True, leiden_resolution=0.5, leiden_objective='CPM', seed=42)[source]

Cluster samples based on their missingness patterns using KMeans or Leiden.

When n_clusters is None, performs Leiden clustering on a graph constructed from the binary missingness mask of the dataset. If use_snn=True, builds a shared-nearest-neighbor (SNN) graph using Jaccard similarity; otherwise, constructs a standard kNN graph with Jaccard weights. Returns both the cluster labels and an optional silhouette score.

Parameters:

data (pandas.DataFrame) – Input dataset with potential missing values, shape (n_samples, n_features). Non-numeric columns should be excluded or specified in cols_ignore.
cols_ignore (list[str] or None, optional) – Column names to exclude from the missingness pattern clustering. Typically includes identifiers or static metadata columns. Defaults to None.
n_clusters (int or None, optional) – Number of clusters for KMeans. If None, uses Leiden clustering on the binary missingness mask instead. Defaults to None.
k_neighbors (int, optional) – Number of nearest neighbors used when constructing the kNN/SNN graph for Leiden clustering. Defaults to 15.
use_snn (bool, optional) – If True, constructs a shared-nearest-neighbor (SNN) graph using mutual neighbor overlap weighted by Jaccard similarity. If False, uses standard kNN graph weighting by Jaccard distance. Defaults to True.
leiden_resolution (float, optional) – Resolution parameter for Leiden clustering; higher values yield more clusters. Defaults to 0.5.
leiden_objective (str, optional) – Objective function for Leiden optimization. One of {"CPM", "RB", "Modularity"}. Defaults to "CPM".
seed (int, optional) – Random seed for reproducibility in KMeans and Leiden algorithms. Defaults to 42.

Returns:

Tuple (labels, silhouette): - labels (numpy.ndarray): Cluster assignments of length n_samples. - silhouette (float or None): Silhouette score computed using Jaccard distance

on the binary missingness mask; None if undefined.

Return type:

tuple[numpy.ndarray, float or None]

Example:

>>> labels, silh = cluster_on_missing(data, n_clusters=None, use_snn=True)
>>> np.unique(labels)
array([0, 1, 2])
>>> print(f"Silhouette: {silh:.3f}")
Silhouette: 0.408

cluster_on_missing_prop(prop_matrix, *, n_clusters=None, seed=None, k_neighbors=15, use_snn=True, snn_mutual=True, leiden_resolution=0.5, leiden_objective='CPM', metric='euclidean', scale_features=False)[source]

Cluster samples based on their per-feature missingness proportions using KMeans or Leiden.

When n_clusters is None, performs Leiden clustering on a graph constructed from the missingness proportion matrix. If use_snn=True, builds a shared-nearest-neighbor (SNN) graph with Jaccard-based or metric-based similarity; otherwise uses a standard kNN graph. Returns both the cluster labels and an optional silhouette score.

Parameters:

prop_matrix (pandas.DataFrame or numpy.ndarray) – Matrix of missingness proportions, shape (n_samples, n_features). Each entry represents the fraction of missing values for a feature within each sample. Values must lie in [0, 1].
n_clusters (int or None, optional) – Number of clusters for KMeans. If None, uses Leiden clustering instead. Defaults to None.
seed (int or None, optional) – Random seed for KMeans initialization and Leiden reproducibility. Defaults to None.
k_neighbors (int, optional) – Number of nearest neighbors for kNN/SNN graph construction used by Leiden. Defaults to 15.
use_snn (bool, optional) – If True, constructs a shared-nearest-neighbor (SNN) graph using mutual or weighted neighbor overlap. If False, uses standard kNN. Defaults to True.
snn_mutual (bool, optional) – If True, retains only mutual nearest neighbors when building the SNN graph. Defaults to True.
leiden_resolution (float, optional) – Resolution parameter controlling cluster granularity in Leiden. Higher values produce more clusters. Defaults to 0.5.
leiden_objective (str, optional) – Objective function for Leiden optimization. One of {"CPM", "RB", "Modularity"}. Defaults to "CPM".
metric (str, optional) – Distance metric used for kNN graph construction and silhouette calculation. Recommended options are "euclidean" or "cosine". Defaults to "euclidean".
scale_features (bool, optional) – Whether to standardize features (zero mean, unit variance) prior to clustering. Recommended when feature scales differ widely. Defaults to False.

Returns:

Tuple (labels, silhouette): - labels (numpy.ndarray): Cluster assignments of length n_samples. - silhouette (float or None): Silhouette score based on the same metric;

None if undefined (e.g., single cluster).

Return type:

tuple[numpy.ndarray, float or None]

Example:

>>> labels, silh = cluster_on_missing_prop(prop_matrix, n_clusters=None, use_snn=True)
>>> np.unique(labels)
array([0, 1, 2, 3])
>>> print(f"Silhouette: {silh:.3f}")
Silhouette: 0.421