ciss_vae.training.run_cissvae.run_cissvae

run_cissvae(data, val_proportion=0.1, replacement_value=0.0, columns_ignore=None, print_dataset=True, imputable_matrix=None, binary_feature_mask=None, categorical_column_map=None, clusters=None, n_clusters=None, k_neighbors=15, leiden_resolution=0.5, leiden_objective='CPM', seed=42, missingness_proportion_matrix=None, scale_features=False, hidden_dims=[150, 120, 60], latent_dim=15, layer_order_enc=['unshared', 'unshared', 'unshared'], layer_order_dec=['shared', 'shared', 'shared'], latent_shared=False, output_shared=False, batch_size=4000, return_model=True, epochs=500, initial_lr=0.01, decay_factor=0.999, weight_decay=0.001, beta=0.001, device=None, max_loops=100, patience=2, epochs_per_loop=None, initial_lr_refit=None, decay_factor_refit=None, beta_refit=None, verbose=False, return_clusters=False, return_silhouettes=False, return_history=False, return_dataset=False, debug=False)[source]

End-to-end pipeline for Clustering-Informed Shared-Structure Variational Autoencoder (CISS-VAE).

This workflow prepares data (validation masking, optional feature/biomarker clustering inputs), optionally infers sample clusters, trains the VAE, and performs iterative impute–refit loops with early stopping. Returns the final imputed dataset and, optionally, the trained model and auxiliary artifacts.

Parameters:
  • data (pandas.DataFrame or numpy.ndarray or torch.Tensor) – Input matrix with potential missing values, shape (n_samples, n_features).

  • val_proportion (float or collections.abc.Sequence or collections.abc.Mapping or pandas.Series, optional) – Per-cluster fraction of non-missing entries to mask for validation. May be a single float (global), a per-cluster sequence, or mapping. Defaults to 0.1.

  • replacement_value (float, optional) – Value used to fill masked validation entries in the training tensor. Does not affect the separate validation target tensor. Defaults to 0.0.

  • columns_ignore (list[str or int] or None, optional) – Columns to exclude from validation masking (names if data is a DataFrame, otherwise integer indices). Defaults to None.

  • print_dataset (bool, optional) – If True, prints dataset summary/statistics during setup. Defaults to True.

  • imputable_matrix (pandas.DataFrame or numpy.ndarray or torch.Tensor or None, optional) – Optional binary mask with the same shape as data indicating which entries are eligible for imputation. Use 1 to allow imputation and 0 to exclude from imputation. Defaults to None.

  • binary_feature_mask (list[bool] or numpy.ndarray) – 1D boolean vector of length n_features indicating which columns are binary. Used during dataset construction to derive activation_groups. Columns belonging to categorical dummy variables must also be marked as True.

  • categorical_column_map (dict[str, list[str or int]]) –

    Optional dictionary mapping original categorical variable names to their corresponding dummy-variable columns. Example:

    {“C1”: [“C1b1”, “C1b2”], “C2”: [“C2b1”, “C2b2”]}

    These columns are grouped together in activation_groups and treated as categorical variables during loss computation and imputation. All listed columns must also be marked as True in binary_feature_mask.

  • clusters (array-like or None, optional) – Precomputed cluster labels for samples (length n_samples). If None, clustering may be performed depending on n_clusters and Leiden settings. Defaults to None.

  • n_clusters (int or None, optional) – If provided, performs KMeans with n_clusters. If None and clusters is also None, Leiden-based clustering is used. Defaults to None.

  • k_neighbors (int, optional) – Number of nearest neighbors for the Leiden KNN graph construction. Defaults to 15.

  • leiden_resolution (float, optional) – Resolution parameter for Leiden clustering. Defaults to 0.5.

  • leiden_objective (str, optional) – Objective function for Leiden clustering. One of {"CPM", "RB", "Modularity"}. Defaults to "CPM".

  • seed (int, optional) – Random seed for reproducibility. Defaults to 42.

  • missingness_proportion_matrix (pandas.DataFrame or numpy.ndarray or None, optional) – Optional matrix for biomarker/feature clustering where each entry is the per-sample proportion of missingness for each feature. If provided, can guide clustering on missingness patterns. Defaults to None.

  • scale_features (bool, optional) – If True, standardizes features for proportion-matrix-based clustering. Defaults to False.

  • hidden_dims (list[int], optional) – Encoder/decoder hidden layer sizes (mirrored architecture). Defaults to [150, 120, 60].

  • latent_dim (int, optional) – Dimensionality of the latent space. Defaults to 15.

  • layer_order_enc (list[str], optional) – Per-layer specification for encoder blocks; values are "shared" or "unshared". Length should match hidden_dims. Defaults to ["unshared", "unshared", "unshared"].

  • layer_order_dec (list[str], optional) – Per-layer specification for decoder blocks; values are "shared" or "unshared". Length should match hidden_dims. Defaults to ["shared", "shared", "shared"].

  • latent_shared (bool, optional) – If True, shares latent layer parameters across clusters. Defaults to False.

  • output_shared (bool, optional) – If True, shares final output layer across clusters. Defaults to False.

  • batch_size (int, optional) – Batch size for training. Defaults to 4000.

  • return_model (bool, optional) – If True, include the trained VAE model in the return tuple. Defaults to True.

  • epochs (int, optional) – Number of epochs in the initial training phase. Defaults to 500.

  • initial_lr (float, optional) – Initial learning rate for the optimizer. Defaults to 0.01.

  • decay_factor (float, optional) – Multiplicative LR decay applied per epoch (lr *= decay_factor). Defaults to 0.999.

  • beta (float, optional) – Weight of the KL-divergence term in the VAE loss for initial training. Defaults to 0.001.

  • device (str or torch.device or None, optional) – Compute device, e.g., "cpu" or "cuda". If None, selected automatically. Defaults to None.

  • max_loops (int, optional) – Maximum number of impute–refit loops. Defaults to 100.

  • patience (int, optional) – Early stopping patience counted in loops without improvement. Defaults to 2.

  • epochs_per_loop (int or None, optional) – Number of epochs per refit loop. If None, reuses epochs. Defaults to None.

  • initial_lr_refit (float or None, optional) – Learning rate for refit loops. If None, uses initial_lr. Defaults to None.

  • decay_factor_refit (float or None, optional) – LR decay factor during refit loops. If None, uses decay_factor. Defaults to None.

  • beta_refit (float or None, optional) – KL weight used in refit loops. If None, uses beta. Defaults to None.

  • verbose (bool, optional) – If True, prints progress and diagnostics. Defaults to False.

  • return_clusters (bool, optional) – If True, include sample cluster labels in the return tuple. Defaults to False.

  • return_silhouettes (bool, optional) – If True, include clustering silhouette score(s) in the return tuple. Defaults to False.

  • return_history (bool, optional) – If True, include concatenated training/refit history (e.g., losses, metrics). Defaults to False.

  • return_dataset (bool, optional) – If True, include the constructed/processed ClusterDataset object. Defaults to False.

  • debug (bool, optional) – If True, enables additional checks/logging for troubleshooting. Defaults to False.

Returns:

By default returns the imputed dataset. Depending on flags, may also return: model, clusters, silhouette_scores, history, and/or the ClusterDataset. The order is: (imputed_dataset[, model][, clusters][, silhouette_scores][, history][, dataset]).

Return type:

pandas.DataFrame or tuple[ pandas.DataFrame

[, CISSVAE] [, numpy.ndarray or pandas.Series] [, float or dict] [, pandas.DataFrame] [, ClusterDataset]

]