ciss_vae.utils package
Submodules
ciss_vae.utils.clustering module
run_cissvae takes in the dataset as an input and (optionally) clusters on missingness before running ciss_vae full model.
- cluster_on_missing(data, cols_ignore=None, n_clusters=None, k_neighbors=15, use_snn=True, leiden_resolution=0.5, leiden_objective='CPM', seed=42)[source]
Cluster samples based on their missingness patterns using KMeans or Leiden.
When
n_clustersisNone, performs Leiden clustering on a graph constructed from the binary missingness mask of the dataset. Ifuse_snn=True, builds a shared-nearest-neighbor (SNN) graph using Jaccard similarity; otherwise, constructs a standard kNN graph with Jaccard weights. Returns both the cluster labels and an optional silhouette score.- Parameters:
data (pandas.DataFrame) – Input dataset with potential missing values, shape
(n_samples, n_features). Non-numeric columns should be excluded or specified incols_ignore.cols_ignore (list[str] or None, optional) – Column names to exclude from the missingness pattern clustering. Typically includes identifiers or static metadata columns. Defaults to
None.n_clusters (int or None, optional) – Number of clusters for KMeans. If
None, uses Leiden clustering on the binary missingness mask instead. Defaults toNone.k_neighbors (int, optional) – Number of nearest neighbors used when constructing the kNN/SNN graph for Leiden clustering. Defaults to
15.use_snn (bool, optional) – If
True, constructs a shared-nearest-neighbor (SNN) graph using mutual neighbor overlap weighted by Jaccard similarity. IfFalse, uses standard kNN graph weighting by Jaccard distance. Defaults toTrue.leiden_resolution (float, optional) – Resolution parameter for Leiden clustering; higher values yield more clusters. Defaults to
0.5.leiden_objective (str, optional) – Objective function for Leiden optimization. One of
{"CPM", "RB", "Modularity"}. Defaults to"CPM".seed (int, optional) – Random seed for reproducibility in KMeans and Leiden algorithms. Defaults to
42.
- Returns:
Tuple
(labels, silhouette): - labels (numpy.ndarray): Cluster assignments of lengthn_samples. - silhouette (floatorNone): Silhouette score computed using Jaccard distanceon the binary missingness mask;
Noneif undefined.- Return type:
tuple[numpy.ndarray, float or None]
Example:
>>> labels, silh = cluster_on_missing(data, n_clusters=None, use_snn=True) >>> np.unique(labels) array([0, 1, 2]) >>> print(f"Silhouette: {silh:.3f}") Silhouette: 0.408
- cluster_on_missing_prop(prop_matrix, *, n_clusters=None, seed=None, k_neighbors=15, use_snn=True, snn_mutual=True, leiden_resolution=0.5, leiden_objective='CPM', metric='euclidean', scale_features=False)[source]
Cluster samples based on their per-feature missingness proportions using KMeans or Leiden.
When
n_clustersisNone, performs Leiden clustering on a graph constructed from the missingness proportion matrix. Ifuse_snn=True, builds a shared-nearest-neighbor (SNN) graph with Jaccard-based or metric-based similarity; otherwise uses a standard kNN graph. Returns both the cluster labels and an optional silhouette score.- Parameters:
prop_matrix (pandas.DataFrame or numpy.ndarray) – Matrix of missingness proportions, shape
(n_samples, n_features). Each entry represents the fraction of missing values for a feature within each sample. Values must lie in[0, 1].n_clusters (int or None, optional) – Number of clusters for KMeans. If
None, uses Leiden clustering instead. Defaults toNone.seed (int or None, optional) – Random seed for KMeans initialization and Leiden reproducibility. Defaults to
None.k_neighbors (int, optional) – Number of nearest neighbors for kNN/SNN graph construction used by Leiden. Defaults to
15.use_snn (bool, optional) – If
True, constructs a shared-nearest-neighbor (SNN) graph using mutual or weighted neighbor overlap. IfFalse, uses standard kNN. Defaults toTrue.snn_mutual (bool, optional) – If
True, retains only mutual nearest neighbors when building the SNN graph. Defaults toTrue.leiden_resolution (float, optional) – Resolution parameter controlling cluster granularity in Leiden. Higher values produce more clusters. Defaults to
0.5.leiden_objective (str, optional) – Objective function for Leiden optimization. One of
{"CPM", "RB", "Modularity"}. Defaults to"CPM".metric (str, optional) – Distance metric used for kNN graph construction and silhouette calculation. Recommended options are
"euclidean"or"cosine". Defaults to"euclidean".scale_features (bool, optional) – Whether to standardize features (zero mean, unit variance) prior to clustering. Recommended when feature scales differ widely. Defaults to
False.
- Returns:
Tuple
(labels, silhouette): - labels (numpy.ndarray): Cluster assignments of lengthn_samples. - silhouette (floatorNone): Silhouette score based on the same metric;Noneif undefined (e.g., single cluster).- Return type:
tuple[numpy.ndarray, float or None]
Example:
>>> labels, silh = cluster_on_missing_prop(prop_matrix, n_clusters=None, use_snn=True) >>> np.unique(labels) array([0, 1, 2, 3]) >>> print(f"Silhouette: {silh:.3f}") Silhouette: 0.421
ciss_vae.utils.helpers module
- plot_vae_architecture(model, title=None, color_shared='skyblue', color_unshared='lightcoral', color_latent='gold', color_input='lightgreen', color_output='lightgreen', figsize=(16, 8), return_fig=False, fontsize_layer=12, fontsize_section=14, fontsize_title=16)[source]
Plots a horizontal schematic of the VAE architecture, showing shared and cluster-specific layers.
- Parameters:
model (nn.Module) – An instance of CISSVAE model to visualize
title (str, optional) – Title of the plot, defaults to None
color_shared (str, optional) – Color for shared hidden layers, defaults to “skyblue”
color_unshared (str, optional) – Color for unshared hidden layers, defaults to “lightcoral”
color_latent (str, optional) – Color for latent layer, defaults to “gold”
color_input (str, optional) – Color for input layer, defaults to “lightgreen”
color_output (str, optional) – Color for output layer, defaults to “lightgreen”
figsize (tuple, optional) – Size of the matplotlib figure, defaults to (16, 8)
return_fig (bool, optional) – Whether to return the figure object instead of displaying, defaults to False
fontsize_layer (int, optional) – Font size of layer blocks, defaults to 12
fontsize_section (int, optional) – Font size of encoder/decoder labels, defaults to 14
fontsize_title (int, optional) – Font size of title, defaults to 16
- Returns:
Matplotlib figure object if return_fig is True, otherwise None
- Return type:
matplotlib.figure.Figure or None
- get_imputed_df(model, data_loader, device='cpu')[source]
Given trained model and cluster dataset object, get imputed dataset as pandas DataFrame.
Reconstructs missing values using the trained VAE model and returns the complete dataset with original scaling restored and validation entries replaced with true values.
- Parameters:
model (CISSVAE) – Trained CISSVAE model (should be in eval() mode)
data_loader (torch.utils.data.DataLoader) – DataLoader for the original ClusterDataset
device (str, optional) – Device to run computations on, defaults to “cpu”
- Returns:
DataFrame containing imputed (unscaled) data with original row ordering
- Return type:
- get_imputed(model, data_loader, device='cpu')[source]
Returns a ClusterDataset where originally missing values have been replaced with model reconstructions.
Processes the dataset through the trained VAE model to reconstruct missing values, including validation-masked entries. The returned dataset maintains the same structure as the original but with missing values filled in.
- Parameters:
model (nn.Module) – Trained VAE model
data_loader (torch.utils.data.DataLoader) – DataLoader for the original ClusterDataset
device (str, optional) – Torch device for computations, defaults to “cpu”
- Returns:
ClusterDataset with reconstructed values filled in at originally missing positions
- Return type:
- compute_val_mse(model, dataset, device='cpu', auto_fix_binary=False, eps=1e-07, debug=False)[source]
- evaluate_imputation(imputed_df, df_complete, df_missing, activation_groups=None)[source]
Test CISSVAE performance by evaluating imputed dataset vs true complete dataset.
Supports mixed data types: - continuous → MSE - binary → BCE-style squared error - categorical → classification error
Returns overall error and detailed comparison dataframe.
- Parameters:
imputed_df (pd.DataFrame()) – An imputed version of df_missing.
df_complete (pd.DataFrame()) – A complete dataset with no missingness.
df_missing (pd.DataFrame()) – A version of df_complete with induced missingness.
activation_groups (dict[str, list[int]]) –
Dictionary mapping feature types to column indices. Expected format:
- {
“continuous”: [int, …], “binary”: [int, …], “<categorical_name>”: [int, …], …
}
Each key defines a feature group, and values are lists of column indices corresponding to that group. Categorical variables must be represented as grouped indices (e.g., one-hot encoded columns belonging to the same variable).
ciss_vae.utils.loss module
ciss_vae.utils.matrix module
- class MissingnessMatrix(data, feature_columns_map, feature_names, sample_names=None)[source]
Bases:
objectA matrix with missingness proportions and metadata.
- property shape
Return (n_samples, n_features).
- create_missingness_prop_matrix(data, index_col=None, cols_ignore=None, na_values=None, repeat_feature_names=None, timepoint_prefix=None, nonint_timepoint=False, column_mapping=None, loose=False)[source]
Create a missingness proportion matrix summarizing feature-level missingness per sample.
Computes the proportion of missing values for each feature within each sample, optionally aggregating repeated measurements (e.g.,
feature_t1,feature_t2). Can also accept an explicitcolumn_mappingfrom base feature → list of columns.- Parameters:
data (pandas.DataFrame or numpy.ndarray) – Input dataset (coercible to DataFrame).
index_col (str or None, optional) – Optional column to use as sample index in the output metadata (not scored).
cols_ignore (list[str] or None, optional) – Columns to exclude from scoring (e.g., IDs, non-features).
na_values (list[Any] or None, optional) – Extra values to treat as missing (in addition to NaN/None/±Inf).
repeat_feature_names (list[str] or None, optional) – Base feature names that have repeated timepoints to be aggregated. Columns matched by regex pattern: - if
timepoint_prefixis provided:^<feat>_<prefix>\d+$- else:^<feat>_\d+$timepoint_prefix (str or None, optional) – Optional prefix that appears before the timepoint integer, e.g.,
tto matchfeat_t1.nonint_timepoint (bool, optional) – If true, any text after ‘_’ will count as timepoint (eg Baseline).
column_mapping (dict[str, list[str]] or None, optional) – Explicit mapping { base_feature: [col1, col2, …] } to aggregate. Takes precedence.
loose (bool) – If true, will match any text starting with the base feature names in repeat_feature_names.
- Returns:
MissingnessMatrix with: -
data: (n_samples, n_features) matrix of missingness proportions -feature_columns_map: mapping of base features → contributing columns -to_dataframe()to view as DataFrame- Return type:
Module contents
- loss_function(cluster, mask, recon_x, x, activation_groups, mu, logvar, beta=0.001, return_components=False, imputable_mask=None, device='cpu', debug=False)[source]
- plot_vae_architecture(model, title=None, color_shared='skyblue', color_unshared='lightcoral', color_latent='gold', color_input='lightgreen', color_output='lightgreen', figsize=(16, 8), return_fig=False, fontsize_layer=12, fontsize_section=14, fontsize_title=16)[source]
Plots a horizontal schematic of the VAE architecture, showing shared and cluster-specific layers.
- Parameters:
model (nn.Module) – An instance of CISSVAE model to visualize
title (str, optional) – Title of the plot, defaults to None
color_shared (str, optional) – Color for shared hidden layers, defaults to “skyblue”
color_unshared (str, optional) – Color for unshared hidden layers, defaults to “lightcoral”
color_latent (str, optional) – Color for latent layer, defaults to “gold”
color_input (str, optional) – Color for input layer, defaults to “lightgreen”
color_output (str, optional) – Color for output layer, defaults to “lightgreen”
figsize (tuple, optional) – Size of the matplotlib figure, defaults to (16, 8)
return_fig (bool, optional) – Whether to return the figure object instead of displaying, defaults to False
fontsize_layer (int, optional) – Font size of layer blocks, defaults to 12
fontsize_section (int, optional) – Font size of encoder/decoder labels, defaults to 14
fontsize_title (int, optional) – Font size of title, defaults to 16
- Returns:
Matplotlib figure object if return_fig is True, otherwise None
- Return type:
matplotlib.figure.Figure or None
- compute_val_mse(model, dataset, device='cpu', auto_fix_binary=False, eps=1e-07, debug=False)[source]
- get_imputed(model, data_loader, device='cpu')[source]
Returns a ClusterDataset where originally missing values have been replaced with model reconstructions.
Processes the dataset through the trained VAE model to reconstruct missing values, including validation-masked entries. The returned dataset maintains the same structure as the original but with missing values filled in.
- Parameters:
model (nn.Module) – Trained VAE model
data_loader (torch.utils.data.DataLoader) – DataLoader for the original ClusterDataset
device (str, optional) – Torch device for computations, defaults to “cpu”
- Returns:
ClusterDataset with reconstructed values filled in at originally missing positions
- Return type:
- get_imputed_df(model, data_loader, device='cpu')[source]
Given trained model and cluster dataset object, get imputed dataset as pandas DataFrame.
Reconstructs missing values using the trained VAE model and returns the complete dataset with original scaling restored and validation entries replaced with true values.
- Parameters:
model (CISSVAE) – Trained CISSVAE model (should be in eval() mode)
data_loader (torch.utils.data.DataLoader) – DataLoader for the original ClusterDataset
device (str, optional) – Device to run computations on, defaults to “cpu”
- Returns:
DataFrame containing imputed (unscaled) data with original row ordering
- Return type:
- create_missingness_prop_matrix(data, index_col=None, cols_ignore=None, na_values=None, repeat_feature_names=None, timepoint_prefix=None, nonint_timepoint=False, column_mapping=None, loose=False)[source]
Create a missingness proportion matrix summarizing feature-level missingness per sample.
Computes the proportion of missing values for each feature within each sample, optionally aggregating repeated measurements (e.g.,
feature_t1,feature_t2). Can also accept an explicitcolumn_mappingfrom base feature → list of columns.- Parameters:
data (pandas.DataFrame or numpy.ndarray) – Input dataset (coercible to DataFrame).
index_col (str or None, optional) – Optional column to use as sample index in the output metadata (not scored).
cols_ignore (list[str] or None, optional) – Columns to exclude from scoring (e.g., IDs, non-features).
na_values (list[Any] or None, optional) – Extra values to treat as missing (in addition to NaN/None/±Inf).
repeat_feature_names (list[str] or None, optional) – Base feature names that have repeated timepoints to be aggregated. Columns matched by regex pattern: - if
timepoint_prefixis provided:^<feat>_<prefix>\d+$- else:^<feat>_\d+$timepoint_prefix (str or None, optional) – Optional prefix that appears before the timepoint integer, e.g.,
tto matchfeat_t1.nonint_timepoint (bool, optional) – If true, any text after ‘_’ will count as timepoint (eg Baseline).
column_mapping (dict[str, list[str]] or None, optional) – Explicit mapping { base_feature: [col1, col2, …] } to aggregate. Takes precedence.
loose (bool) – If true, will match any text starting with the base feature names in repeat_feature_names.
- Returns:
MissingnessMatrix with: -
data: (n_samples, n_features) matrix of missingness proportions -feature_columns_map: mapping of base features → contributing columns -to_dataframe()to view as DataFrame- Return type:
- cluster_on_missing(data, cols_ignore=None, n_clusters=None, k_neighbors=15, use_snn=True, leiden_resolution=0.5, leiden_objective='CPM', seed=42)[source]
Cluster samples based on their missingness patterns using KMeans or Leiden.
When
n_clustersisNone, performs Leiden clustering on a graph constructed from the binary missingness mask of the dataset. Ifuse_snn=True, builds a shared-nearest-neighbor (SNN) graph using Jaccard similarity; otherwise, constructs a standard kNN graph with Jaccard weights. Returns both the cluster labels and an optional silhouette score.- Parameters:
data (pandas.DataFrame) – Input dataset with potential missing values, shape
(n_samples, n_features). Non-numeric columns should be excluded or specified incols_ignore.cols_ignore (list[str] or None, optional) – Column names to exclude from the missingness pattern clustering. Typically includes identifiers or static metadata columns. Defaults to
None.n_clusters (int or None, optional) – Number of clusters for KMeans. If
None, uses Leiden clustering on the binary missingness mask instead. Defaults toNone.k_neighbors (int, optional) – Number of nearest neighbors used when constructing the kNN/SNN graph for Leiden clustering. Defaults to
15.use_snn (bool, optional) – If
True, constructs a shared-nearest-neighbor (SNN) graph using mutual neighbor overlap weighted by Jaccard similarity. IfFalse, uses standard kNN graph weighting by Jaccard distance. Defaults toTrue.leiden_resolution (float, optional) – Resolution parameter for Leiden clustering; higher values yield more clusters. Defaults to
0.5.leiden_objective (str, optional) – Objective function for Leiden optimization. One of
{"CPM", "RB", "Modularity"}. Defaults to"CPM".seed (int, optional) – Random seed for reproducibility in KMeans and Leiden algorithms. Defaults to
42.
- Returns:
Tuple
(labels, silhouette): - labels (numpy.ndarray): Cluster assignments of lengthn_samples. - silhouette (floatorNone): Silhouette score computed using Jaccard distanceon the binary missingness mask;
Noneif undefined.- Return type:
tuple[numpy.ndarray, float or None]
Example:
>>> labels, silh = cluster_on_missing(data, n_clusters=None, use_snn=True) >>> np.unique(labels) array([0, 1, 2]) >>> print(f"Silhouette: {silh:.3f}") Silhouette: 0.408
- cluster_on_missing_prop(prop_matrix, *, n_clusters=None, seed=None, k_neighbors=15, use_snn=True, snn_mutual=True, leiden_resolution=0.5, leiden_objective='CPM', metric='euclidean', scale_features=False)[source]
Cluster samples based on their per-feature missingness proportions using KMeans or Leiden.
When
n_clustersisNone, performs Leiden clustering on a graph constructed from the missingness proportion matrix. Ifuse_snn=True, builds a shared-nearest-neighbor (SNN) graph with Jaccard-based or metric-based similarity; otherwise uses a standard kNN graph. Returns both the cluster labels and an optional silhouette score.- Parameters:
prop_matrix (pandas.DataFrame or numpy.ndarray) – Matrix of missingness proportions, shape
(n_samples, n_features). Each entry represents the fraction of missing values for a feature within each sample. Values must lie in[0, 1].n_clusters (int or None, optional) – Number of clusters for KMeans. If
None, uses Leiden clustering instead. Defaults toNone.seed (int or None, optional) – Random seed for KMeans initialization and Leiden reproducibility. Defaults to
None.k_neighbors (int, optional) – Number of nearest neighbors for kNN/SNN graph construction used by Leiden. Defaults to
15.use_snn (bool, optional) – If
True, constructs a shared-nearest-neighbor (SNN) graph using mutual or weighted neighbor overlap. IfFalse, uses standard kNN. Defaults toTrue.snn_mutual (bool, optional) – If
True, retains only mutual nearest neighbors when building the SNN graph. Defaults toTrue.leiden_resolution (float, optional) – Resolution parameter controlling cluster granularity in Leiden. Higher values produce more clusters. Defaults to
0.5.leiden_objective (str, optional) – Objective function for Leiden optimization. One of
{"CPM", "RB", "Modularity"}. Defaults to"CPM".metric (str, optional) – Distance metric used for kNN graph construction and silhouette calculation. Recommended options are
"euclidean"or"cosine". Defaults to"euclidean".scale_features (bool, optional) – Whether to standardize features (zero mean, unit variance) prior to clustering. Recommended when feature scales differ widely. Defaults to
False.
- Returns:
Tuple
(labels, silhouette): - labels (numpy.ndarray): Cluster assignments of lengthn_samples. - silhouette (floatorNone): Silhouette score based on the same metric;Noneif undefined (e.g., single cluster).- Return type:
tuple[numpy.ndarray, float or None]
Example:
>>> labels, silh = cluster_on_missing_prop(prop_matrix, n_clusters=None, use_snn=True) >>> np.unique(labels) array([0, 1, 2, 3]) >>> print(f"Silhouette: {silh:.3f}") Silhouette: 0.421