ciss_vae.training.run_cissvae.run_cissvae
- run_cissvae(data, val_proportion=0.1, replacement_value=0.0, columns_ignore=None, print_dataset=True, imputable_matrix=None, binary_feature_mask=None, categorical_column_map=None, clusters=None, n_clusters=None, k_neighbors=15, leiden_resolution=0.5, leiden_objective='CPM', seed=42, missingness_proportion_matrix=None, scale_features=False, hidden_dims=[150, 120, 60], latent_dim=15, layer_order_enc=['unshared', 'unshared', 'unshared'], layer_order_dec=['shared', 'shared', 'shared'], latent_shared=False, output_shared=False, batch_size=4000, return_model=True, epochs=500, initial_lr=0.01, decay_factor=0.999, weight_decay=0.001, beta=0.001, device=None, max_loops=100, patience=2, epochs_per_loop=None, initial_lr_refit=None, decay_factor_refit=None, beta_refit=None, verbose=False, return_clusters=False, return_silhouettes=False, return_history=False, return_dataset=False, debug=False)[source]
End-to-end pipeline for Clustering-Informed Shared-Structure Variational Autoencoder (CISS-VAE).
This workflow prepares data (validation masking, optional feature/biomarker clustering inputs), optionally infers sample clusters, trains the VAE, and performs iterative impute–refit loops with early stopping. Returns the final imputed dataset and, optionally, the trained model and auxiliary artifacts.
- Parameters:
data (pandas.DataFrame or numpy.ndarray or torch.Tensor) – Input matrix with potential missing values, shape
(n_samples, n_features).val_proportion (float or collections.abc.Sequence or collections.abc.Mapping or pandas.Series, optional) – Per-cluster fraction of non-missing entries to mask for validation. May be a single float (global), a per-cluster sequence, or mapping. Defaults to
0.1.replacement_value (float, optional) – Value used to fill masked validation entries in the training tensor. Does not affect the separate validation target tensor. Defaults to
0.0.columns_ignore (list[str or int] or None, optional) – Columns to exclude from validation masking (names if
datais a DataFrame, otherwise integer indices). Defaults toNone.print_dataset (bool, optional) – If
True, prints dataset summary/statistics during setup. Defaults toTrue.imputable_matrix (pandas.DataFrame or numpy.ndarray or torch.Tensor or None, optional) – Optional binary mask with the same shape as
dataindicating which entries are eligible for imputation. Use1to allow imputation and0to exclude from imputation. Defaults toNone.binary_feature_mask (list[bool] or numpy.ndarray) – 1D boolean vector of length
n_featuresindicating which columns are binary. Used during dataset construction to deriveactivation_groups. Columns belonging to categorical dummy variables must also be marked as True.categorical_column_map (dict[str, list[str or int]]) –
Optional dictionary mapping original categorical variable names to their corresponding dummy-variable columns. Example:
{“C1”: [“C1b1”, “C1b2”], “C2”: [“C2b1”, “C2b2”]}
These columns are grouped together in
activation_groupsand treated as categorical variables during loss computation and imputation. All listed columns must also be marked as True inbinary_feature_mask.clusters (array-like or None, optional) – Precomputed cluster labels for samples (length
n_samples). IfNone, clustering may be performed depending onn_clustersand Leiden settings. Defaults toNone.n_clusters (int or None, optional) – If provided, performs KMeans with
n_clusters. IfNoneandclustersis alsoNone, Leiden-based clustering is used. Defaults toNone.k_neighbors (int, optional) – Number of nearest neighbors for the Leiden KNN graph construction. Defaults to
15.leiden_resolution (float, optional) – Resolution parameter for Leiden clustering. Defaults to
0.5.leiden_objective (str, optional) – Objective function for Leiden clustering. One of
{"CPM", "RB", "Modularity"}. Defaults to"CPM".seed (int, optional) – Random seed for reproducibility. Defaults to
42.missingness_proportion_matrix (pandas.DataFrame or numpy.ndarray or None, optional) – Optional matrix for biomarker/feature clustering where each entry is the per-sample proportion of missingness for each feature. If provided, can guide clustering on missingness patterns. Defaults to
None.scale_features (bool, optional) – If
True, standardizes features for proportion-matrix-based clustering. Defaults toFalse.hidden_dims (list[int], optional) – Encoder/decoder hidden layer sizes (mirrored architecture). Defaults to
[150, 120, 60].latent_dim (int, optional) – Dimensionality of the latent space. Defaults to
15.layer_order_enc (list[str], optional) – Per-layer specification for encoder blocks; values are
"shared"or"unshared". Length should matchhidden_dims. Defaults to["unshared", "unshared", "unshared"].layer_order_dec (list[str], optional) – Per-layer specification for decoder blocks; values are
"shared"or"unshared". Length should matchhidden_dims. Defaults to["shared", "shared", "shared"].latent_shared (bool, optional) – If
True, shares latent layer parameters across clusters. Defaults toFalse.output_shared (bool, optional) – If
True, shares final output layer across clusters. Defaults toFalse.batch_size (int, optional) – Batch size for training. Defaults to
4000.return_model (bool, optional) – If
True, include the trained VAE model in the return tuple. Defaults toTrue.epochs (int, optional) – Number of epochs in the initial training phase. Defaults to
500.initial_lr (float, optional) – Initial learning rate for the optimizer. Defaults to
0.01.decay_factor (float, optional) – Multiplicative LR decay applied per epoch (
lr *= decay_factor). Defaults to0.999.beta (float, optional) – Weight of the KL-divergence term in the VAE loss for initial training. Defaults to
0.001.device (str or torch.device or None, optional) – Compute device, e.g.,
"cpu"or"cuda". IfNone, selected automatically. Defaults toNone.max_loops (int, optional) – Maximum number of impute–refit loops. Defaults to
100.patience (int, optional) – Early stopping patience counted in loops without improvement. Defaults to
2.epochs_per_loop (int or None, optional) – Number of epochs per refit loop. If
None, reusesepochs. Defaults toNone.initial_lr_refit (float or None, optional) – Learning rate for refit loops. If
None, usesinitial_lr. Defaults toNone.decay_factor_refit (float or None, optional) – LR decay factor during refit loops. If
None, usesdecay_factor. Defaults toNone.beta_refit (float or None, optional) – KL weight used in refit loops. If
None, usesbeta. Defaults toNone.verbose (bool, optional) – If
True, prints progress and diagnostics. Defaults toFalse.return_clusters (bool, optional) – If
True, include sample cluster labels in the return tuple. Defaults toFalse.return_silhouettes (bool, optional) – If
True, include clustering silhouette score(s) in the return tuple. Defaults toFalse.return_history (bool, optional) – If
True, include concatenated training/refit history (e.g., losses, metrics). Defaults toFalse.return_dataset (bool, optional) – If
True, include the constructed/processedClusterDatasetobject. Defaults toFalse.debug (bool, optional) – If
True, enables additional checks/logging for troubleshooting. Defaults toFalse.
- Returns:
By default returns the imputed dataset. Depending on flags, may also return:
model,clusters,silhouette_scores,history, and/or theClusterDataset. The order is:(imputed_dataset[, model][, clusters][, silhouette_scores][, history][, dataset]).- Return type:
pandas.DataFrame or tuple[ pandas.DataFrame
[, CISSVAE] [, numpy.ndarray or pandas.Series] [, float or dict] [, pandas.DataFrame] [, ClusterDataset]
]