ciss_vae.training package

Submodules

ciss_vae.training.autotune module

Optuna-based hyperparameter tuning for CISS-VAE. This module defines: - SearchSpace: a structured container describing tunable/fixed hyperparameters. - autotune(): runs Optuna trials that train CISSVAE models and selects the best trial

by validation MSE, then retrains a final model with the best settings.

class SearchSpace(num_hidden_layers=(1, 4), hidden_dims=[64, 512], latent_dim=[10, 100], latent_shared=[True, False], output_shared=[True, False], lr=(0.0001, 0.001), decay_factor=(0.9, 0.999), weight_decay=0.001, beta=0.01, num_epochs=1000, batch_size=64, num_shared_encode=[0, 1, 3], num_shared_decode=[0, 1, 3], encoder_shared_placement=['at_end', 'at_start', 'alternating', 'random'], decoder_shared_placement=['at_end', 'at_start', 'alternating', 'random'], refit_patience=2, refit_loops=100, epochs_per_loop=1000, reset_lr_refit=[True, False])[source]

Bases: object

Defines tunable and fixed hyperparameter ranges for the Optuna search.

Parameters are specified as: - scalar: fixed value (e.g., latent_dim=16) - list: categorical choice (e.g., hidden_dims=[64, 128, 256]) - tuple: range (min, max) for suggest_int or suggest_float

Parameters:
  • num_hidden_layers (int or list[int] or tuple[int, int], optional) – Number of encoder/decoder hidden layers, defaults to (1, 4)

  • hidden_dims (int or list[int] or tuple[int, int], optional) – Hidden dimension specification - int for repeated per layer, list for per-layer choices, tuple for range, defaults to [64, 512]

  • latent_dim (int or tuple[int, int], optional) – Latent dimension size or range, defaults to [10, 100]

  • latent_shared (bool or list[bool], optional) – Whether latent space is shared across clusters, defaults to [True, False]

  • output_shared (bool or list[bool], optional) – Whether output layer is shared across clusters, defaults to [True, False]

  • lr (float or tuple[float, float], optional) – Initial learning rate or range, defaults to (1e-4, 1e-3)

  • decay_factor (float or tuple[float, float], optional) – Learning rate exponential decay factor or range, defaults to (0.9, 0.999)

  • beta (float or tuple[float, float], optional) – KL divergence weight or range, defaults to 0.01

  • num_epochs (int or tuple[int, int], optional) – Number of epochs for initial training, defaults to 1000

  • batch_size (int or tuple[int, int], optional) – Mini-batch size, defaults to 64

  • num_shared_encode (list[int], optional) – Candidate counts of shared encoder layers, defaults to [0, 1, 3]

  • num_shared_decode (list[int], optional) – Candidate counts of shared decoder layers, defaults to [0, 1, 3]

  • encoder_shared_placement (list[str], optional) – Strategy for arranging shared vs unshared layers in encoder, defaults to [“at_end”, “at_start”, “alternating”, “random”]

  • decoder_shared_placement (list[str], optional) – Strategy for arranging shared vs unshared layers in decoder, defaults to [“at_end”, “at_start”, “alternating”, “random”]

  • refit_patience (int or tuple[int, int], optional) – Early-stop patience for refit loops, defaults to 2

  • refit_loops (int or tuple[int, int], optional) – Maximum number of refit loops, defaults to 100

  • epochs_per_loop (int or tuple[int, int], optional) – Number of epochs per refit loop, defaults to 1000

  • reset_lr_refit (bool or list[bool], optional) – Whether to reset learning rate before refit, defaults to [True, False]

save(file_path)[source]

Save this search space to a JSON file. :type file_path: :param file_path: Path to save file. :type file_path: string

classmethod load(file_path)[source]

Load a search space from a JSON file and return a new instance. :type file_path: :param file_path: Path to saved SearchSpace. :type file_path: string

autotune(search_space, train_dataset, save_model_path=None, save_search_space_path=None, n_trials=20, study_name='vae_autotune', device_preference='cuda', optuna_dashboard_db=None, load_if_exists=True, seed=42, verbose=False, show_progress=False, constant_layer_size=False, evaluate_all_orders=False, max_exhaustive_orders=100, return_history=False, n_jobs=1, debug=False)[source]

Optuna-based hyperparameter search for the CISSVAE model.

Runs initial training followed by impute-refit loops per trial, selecting the

trial with the lowest total imputation error (MSE + BCE + categorical CE). The best model is then retrained with

optimal hyperparameters and returned along with the imputed dataset.

ciss_vae.training.run_cissvae module

End-to-end pipeline for preparing data, optionally clustering samples, training the CISS-VAE model, and performing iterative imputation.

Handles validation masking, feature-type resolution (via activation groups), optional clustering on missingness patterns, and model training with impute–refit loops.

run_cissvae(data, val_proportion=0.1, replacement_value=0.0, columns_ignore=None, print_dataset=True, imputable_matrix=None, binary_feature_mask=None, categorical_column_map=None, clusters=None, n_clusters=None, k_neighbors=15, leiden_resolution=0.5, leiden_objective='CPM', seed=42, missingness_proportion_matrix=None, scale_features=False, hidden_dims=[150, 120, 60], latent_dim=15, layer_order_enc=['unshared', 'unshared', 'unshared'], layer_order_dec=['shared', 'shared', 'shared'], latent_shared=False, output_shared=False, batch_size=4000, return_model=True, epochs=500, initial_lr=0.01, decay_factor=0.999, weight_decay=0.001, beta=0.001, device=None, max_loops=100, patience=2, epochs_per_loop=None, initial_lr_refit=None, decay_factor_refit=None, beta_refit=None, verbose=False, return_clusters=False, return_silhouettes=False, return_history=False, return_dataset=False, debug=False)[source]

End-to-end pipeline for Clustering-Informed Shared-Structure Variational Autoencoder (CISS-VAE).

This workflow prepares data (validation masking, optional feature/biomarker clustering inputs), optionally infers sample clusters, trains the VAE, and performs iterative impute–refit loops with early stopping. Returns the final imputed dataset and, optionally, the trained model and auxiliary artifacts.

Parameters:
  • data (pandas.DataFrame or numpy.ndarray or torch.Tensor) – Input matrix with potential missing values, shape (n_samples, n_features).

  • val_proportion (float or collections.abc.Sequence or collections.abc.Mapping or pandas.Series, optional) – Per-cluster fraction of non-missing entries to mask for validation. May be a single float (global), a per-cluster sequence, or mapping. Defaults to 0.1.

  • replacement_value (float, optional) – Value used to fill masked validation entries in the training tensor. Does not affect the separate validation target tensor. Defaults to 0.0.

  • columns_ignore (list[str or int] or None, optional) – Columns to exclude from validation masking (names if data is a DataFrame, otherwise integer indices). Defaults to None.

  • print_dataset (bool, optional) – If True, prints dataset summary/statistics during setup. Defaults to True.

  • imputable_matrix (pandas.DataFrame or numpy.ndarray or torch.Tensor or None, optional) – Optional binary mask with the same shape as data indicating which entries are eligible for imputation. Use 1 to allow imputation and 0 to exclude from imputation. Defaults to None.

  • binary_feature_mask (list[bool] or numpy.ndarray) – 1D boolean vector of length n_features indicating which columns are binary. Used during dataset construction to derive activation_groups. Columns belonging to categorical dummy variables must also be marked as True.

  • categorical_column_map (dict[str, list[str or int]]) –

    Optional dictionary mapping original categorical variable names to their corresponding dummy-variable columns. Example:

    {“C1”: [“C1b1”, “C1b2”], “C2”: [“C2b1”, “C2b2”]}

    These columns are grouped together in activation_groups and treated as categorical variables during loss computation and imputation. All listed columns must also be marked as True in binary_feature_mask.

  • clusters (array-like or None, optional) – Precomputed cluster labels for samples (length n_samples). If None, clustering may be performed depending on n_clusters and Leiden settings. Defaults to None.

  • n_clusters (int or None, optional) – If provided, performs KMeans with n_clusters. If None and clusters is also None, Leiden-based clustering is used. Defaults to None.

  • k_neighbors (int, optional) – Number of nearest neighbors for the Leiden KNN graph construction. Defaults to 15.

  • leiden_resolution (float, optional) – Resolution parameter for Leiden clustering. Defaults to 0.5.

  • leiden_objective (str, optional) – Objective function for Leiden clustering. One of {"CPM", "RB", "Modularity"}. Defaults to "CPM".

  • seed (int, optional) – Random seed for reproducibility. Defaults to 42.

  • missingness_proportion_matrix (pandas.DataFrame or numpy.ndarray or None, optional) – Optional matrix for biomarker/feature clustering where each entry is the per-sample proportion of missingness for each feature. If provided, can guide clustering on missingness patterns. Defaults to None.

  • scale_features (bool, optional) – If True, standardizes features for proportion-matrix-based clustering. Defaults to False.

  • hidden_dims (list[int], optional) – Encoder/decoder hidden layer sizes (mirrored architecture). Defaults to [150, 120, 60].

  • latent_dim (int, optional) – Dimensionality of the latent space. Defaults to 15.

  • layer_order_enc (list[str], optional) – Per-layer specification for encoder blocks; values are "shared" or "unshared". Length should match hidden_dims. Defaults to ["unshared", "unshared", "unshared"].

  • layer_order_dec (list[str], optional) – Per-layer specification for decoder blocks; values are "shared" or "unshared". Length should match hidden_dims. Defaults to ["shared", "shared", "shared"].

  • latent_shared (bool, optional) – If True, shares latent layer parameters across clusters. Defaults to False.

  • output_shared (bool, optional) – If True, shares final output layer across clusters. Defaults to False.

  • batch_size (int, optional) – Batch size for training. Defaults to 4000.

  • return_model (bool, optional) – If True, include the trained VAE model in the return tuple. Defaults to True.

  • epochs (int, optional) – Number of epochs in the initial training phase. Defaults to 500.

  • initial_lr (float, optional) – Initial learning rate for the optimizer. Defaults to 0.01.

  • decay_factor (float, optional) – Multiplicative LR decay applied per epoch (lr *= decay_factor). Defaults to 0.999.

  • beta (float, optional) – Weight of the KL-divergence term in the VAE loss for initial training. Defaults to 0.001.

  • device (str or torch.device or None, optional) – Compute device, e.g., "cpu" or "cuda". If None, selected automatically. Defaults to None.

  • max_loops (int, optional) – Maximum number of impute–refit loops. Defaults to 100.

  • patience (int, optional) – Early stopping patience counted in loops without improvement. Defaults to 2.

  • epochs_per_loop (int or None, optional) – Number of epochs per refit loop. If None, reuses epochs. Defaults to None.

  • initial_lr_refit (float or None, optional) – Learning rate for refit loops. If None, uses initial_lr. Defaults to None.

  • decay_factor_refit (float or None, optional) – LR decay factor during refit loops. If None, uses decay_factor. Defaults to None.

  • beta_refit (float or None, optional) – KL weight used in refit loops. If None, uses beta. Defaults to None.

  • verbose (bool, optional) – If True, prints progress and diagnostics. Defaults to False.

  • return_clusters (bool, optional) – If True, include sample cluster labels in the return tuple. Defaults to False.

  • return_silhouettes (bool, optional) – If True, include clustering silhouette score(s) in the return tuple. Defaults to False.

  • return_history (bool, optional) – If True, include concatenated training/refit history (e.g., losses, metrics). Defaults to False.

  • return_dataset (bool, optional) – If True, include the constructed/processed ClusterDataset object. Defaults to False.

  • debug (bool, optional) – If True, enables additional checks/logging for troubleshooting. Defaults to False.

Returns:

By default returns the imputed dataset. Depending on flags, may also return: model, clusters, silhouette_scores, history, and/or the ClusterDataset. The order is: (imputed_dataset[, model][, clusters][, silhouette_scores][, history][, dataset]).

Return type:

pandas.DataFrame or tuple[ pandas.DataFrame

[, CISSVAE] [, numpy.ndarray or pandas.Series] [, float or dict] [, pandas.DataFrame] [, ClusterDataset]

]

ciss_vae.training.train_initial module

train_vae_initial(model, train_loader, epochs=10, initial_lr=0.01, decay_factor=0.999, beta=0.1, device='cpu', verbose=False, *, return_history=False, progress_callback=None, weight_decay=0.001, seed=42)[source]

Train a VAE on masked data with validation monitoring for initial training phase.

Performs the initial training of a CISSVAE model on data with missing values using masked loss computation. Tracks training loss and validation MSE across epochs, with optional progress reporting and learning rate decay.

Parameters:
  • model (torch.nn.Module) – CISSVAE or compatible VAE model that implements forward(x, cluster_id)

  • train_loader (torch.utils.data.DataLoader) – DataLoader built on ClusterDataset containing validation data

  • epochs (int, optional) – Number of training epochs, defaults to 10

  • initial_lr (float, optional) – Starting learning rate for Adam optimizer, defaults to 0.01

  • decay_factor (float, optional) – Exponential learning rate decay factor applied per epoch, defaults to 0.999

  • beta (float, optional) – Weight coefficient for KL divergence term in VAE loss, defaults to 0.1

  • device (str, optional) – Device for training computations (“cpu” or “cuda”), defaults to “cpu”

  • verbose (bool, optional) – Whether to print per-epoch training metrics, defaults to False

  • return_history (bool, optional) – Whether to return training history DataFrame along with model, defaults to False

  • progress_callback (callable, optional) – Optional callback function for progress reporting, defaults to None

Returns:

Trained model, or tuple of (model, history_dataframe) if return_history=True

Return type:

torch.nn.Module or tuple[torch.nn.Module, pandas.DataFrame]

Raises:

ValueError – If dataset does not contain ‘val_data’ attribute for validation

ciss_vae.training.train_refit module

train_vae_refit(model, imputed_data, epochs=10, initial_lr=0.01, decay_factor=0.999, beta=0.1, device='cpu', verbose=False, *, progress_callback=None, weight_decay=0.001, seed=42)[source]

Train the VAE model on imputed data without masking for one refit iteration.

Performs training on the complete imputed dataset.

Parameters:
  • model (torch.nn.Module) – VAE model to train

  • imputed_data (torch.utils.data.DataLoader) – DataLoader containing imputed dataset with complete values

  • epochs (int, optional) – Number of training epochs, defaults to 10

  • initial_lr (float, optional) – Initial learning rate for the optimizer, defaults to 0.01

  • decay_factor (float, optional) – Exponential decay factor for learning rate scheduler, defaults to 0.999

  • beta (float, optional) – Weight for KL divergence term in loss function, defaults to 0.1

  • device (str, optional) – Device to run training on, defaults to “cpu”

  • verbose (bool, optional) – Whether to print training progress information, defaults to False

  • progress_callback (callable, optional) – Optional callback function to report epoch progress, defaults to None

Returns:

Trained model with updated final learning rate

Return type:

torch.nn.Module

impute_and_refit_loop(model, train_loader, max_loops=10, patience=2, epochs_per_loop=5, initial_lr=None, decay_factor=0.999, weight_decay=0.001, beta=0.1, device='cpu', verbose=False, batch_size=4000, progress_epoch=None, seed=42)[source]

Iterative impute-refit loop with validation MSE early stopping.

Performs alternating cycles of imputation (filling missing values with model predictions) and refitting (training on the complete imputed data). Uses early stopping based on validation MSE to prevent overfitting and selects the best performing model.

Parameters:
  • model (torch.nn.Module) – Trained VAE model to start the impute-refit process

  • train_loader (torch.utils.data.DataLoader) – DataLoader for the original training dataset with missing values

  • max_loops (int, optional) – Maximum number of impute-refit cycles to perform, defaults to 10

  • patience (int, optional) – Number of loops to wait for improvement before early stopping, defaults to 2

  • epochs_per_loop (int, optional) – Number of training epochs per refit cycle, defaults to 5

  • initial_lr (float, optional) – Learning rate for refit training, uses model’s final LR if None, defaults to None

  • decay_factor (float, optional) – Exponential decay factor for learning rate, defaults to 0.999

  • beta (float, optional) – Weight for KL divergence term in loss function, defaults to 0.1

  • device (str, optional) – Device to run computations on, defaults to “cpu”

  • verbose (bool, optional) – Whether to print detailed progress information, defaults to False

  • batch_size (int, optional) – Batch size for refit training, defaults to 4000

  • progress_epoch (callable, optional) – Optional callback function to report epoch progress, defaults to None

Returns:

Tuple containing (imputed_dataframe, best_model, best_dataset, refit_history_dataframe) refit_history_dataframe Columns:

  • epoch (int) : cumulative epoch counter (continues from initial)

  • train_loss (float) : NaN (not tracked during refit here)

  • val_mse (float) : validation MSE after each refit loop

  • lr (float) : learning rate after each refit loop

  • phase (str) : {“refit_init”, “refit_loop”}

  • loop (int) : 0 for baseline (pre-refit), then 1..k per loop

Return type:

tuple[pandas.DataFrame, torch.nn.Module, ClusterDataset, pandas.DataFrame]

Module contents

train_vae_initial(model, train_loader, epochs=10, initial_lr=0.01, decay_factor=0.999, beta=0.1, device='cpu', verbose=False, *, return_history=False, progress_callback=None, weight_decay=0.001, seed=42)[source]

Train a VAE on masked data with validation monitoring for initial training phase.

Performs the initial training of a CISSVAE model on data with missing values using masked loss computation. Tracks training loss and validation MSE across epochs, with optional progress reporting and learning rate decay.

Parameters:
  • model (torch.nn.Module) – CISSVAE or compatible VAE model that implements forward(x, cluster_id)

  • train_loader (torch.utils.data.DataLoader) – DataLoader built on ClusterDataset containing validation data

  • epochs (int, optional) – Number of training epochs, defaults to 10

  • initial_lr (float, optional) – Starting learning rate for Adam optimizer, defaults to 0.01

  • decay_factor (float, optional) – Exponential learning rate decay factor applied per epoch, defaults to 0.999

  • beta (float, optional) – Weight coefficient for KL divergence term in VAE loss, defaults to 0.1

  • device (str, optional) – Device for training computations (“cpu” or “cuda”), defaults to “cpu”

  • verbose (bool, optional) – Whether to print per-epoch training metrics, defaults to False

  • return_history (bool, optional) – Whether to return training history DataFrame along with model, defaults to False

  • progress_callback (callable, optional) – Optional callback function for progress reporting, defaults to None

Returns:

Trained model, or tuple of (model, history_dataframe) if return_history=True

Return type:

torch.nn.Module or tuple[torch.nn.Module, pandas.DataFrame]

Raises:

ValueError – If dataset does not contain ‘val_data’ attribute for validation

train_vae_refit(model, imputed_data, epochs=10, initial_lr=0.01, decay_factor=0.999, beta=0.1, device='cpu', verbose=False, *, progress_callback=None, weight_decay=0.001, seed=42)[source]

Train the VAE model on imputed data without masking for one refit iteration.

Performs training on the complete imputed dataset.

Parameters:
  • model (torch.nn.Module) – VAE model to train

  • imputed_data (torch.utils.data.DataLoader) – DataLoader containing imputed dataset with complete values

  • epochs (int, optional) – Number of training epochs, defaults to 10

  • initial_lr (float, optional) – Initial learning rate for the optimizer, defaults to 0.01

  • decay_factor (float, optional) – Exponential decay factor for learning rate scheduler, defaults to 0.999

  • beta (float, optional) – Weight for KL divergence term in loss function, defaults to 0.1

  • device (str, optional) – Device to run training on, defaults to “cpu”

  • verbose (bool, optional) – Whether to print training progress information, defaults to False

  • progress_callback (callable, optional) – Optional callback function to report epoch progress, defaults to None

Returns:

Trained model with updated final learning rate

Return type:

torch.nn.Module

run_autotune(search_space, train_dataset, save_model_path=None, save_search_space_path=None, n_trials=20, study_name='vae_autotune', device_preference='cuda', optuna_dashboard_db=None, load_if_exists=True, seed=42, verbose=False, show_progress=False, constant_layer_size=False, evaluate_all_orders=False, max_exhaustive_orders=100, return_history=False, n_jobs=1, debug=False)

Optuna-based hyperparameter search for the CISSVAE model.

Runs initial training followed by impute-refit loops per trial, selecting the

trial with the lowest total imputation error (MSE + BCE + categorical CE). The best model is then retrained with

optimal hyperparameters and returned along with the imputed dataset.

class SearchSpace(num_hidden_layers=(1, 4), hidden_dims=[64, 512], latent_dim=[10, 100], latent_shared=[True, False], output_shared=[True, False], lr=(0.0001, 0.001), decay_factor=(0.9, 0.999), weight_decay=0.001, beta=0.01, num_epochs=1000, batch_size=64, num_shared_encode=[0, 1, 3], num_shared_decode=[0, 1, 3], encoder_shared_placement=['at_end', 'at_start', 'alternating', 'random'], decoder_shared_placement=['at_end', 'at_start', 'alternating', 'random'], refit_patience=2, refit_loops=100, epochs_per_loop=1000, reset_lr_refit=[True, False])[source]

Bases: object

Defines tunable and fixed hyperparameter ranges for the Optuna search.

Parameters are specified as: - scalar: fixed value (e.g., latent_dim=16) - list: categorical choice (e.g., hidden_dims=[64, 128, 256]) - tuple: range (min, max) for suggest_int or suggest_float

Parameters:
  • num_hidden_layers (int or list[int] or tuple[int, int], optional) – Number of encoder/decoder hidden layers, defaults to (1, 4)

  • hidden_dims (int or list[int] or tuple[int, int], optional) – Hidden dimension specification - int for repeated per layer, list for per-layer choices, tuple for range, defaults to [64, 512]

  • latent_dim (int or tuple[int, int], optional) – Latent dimension size or range, defaults to [10, 100]

  • latent_shared (bool or list[bool], optional) – Whether latent space is shared across clusters, defaults to [True, False]

  • output_shared (bool or list[bool], optional) – Whether output layer is shared across clusters, defaults to [True, False]

  • lr (float or tuple[float, float], optional) – Initial learning rate or range, defaults to (1e-4, 1e-3)

  • decay_factor (float or tuple[float, float], optional) – Learning rate exponential decay factor or range, defaults to (0.9, 0.999)

  • beta (float or tuple[float, float], optional) – KL divergence weight or range, defaults to 0.01

  • num_epochs (int or tuple[int, int], optional) – Number of epochs for initial training, defaults to 1000

  • batch_size (int or tuple[int, int], optional) – Mini-batch size, defaults to 64

  • num_shared_encode (list[int], optional) – Candidate counts of shared encoder layers, defaults to [0, 1, 3]

  • num_shared_decode (list[int], optional) – Candidate counts of shared decoder layers, defaults to [0, 1, 3]

  • encoder_shared_placement (list[str], optional) – Strategy for arranging shared vs unshared layers in encoder, defaults to [“at_end”, “at_start”, “alternating”, “random”]

  • decoder_shared_placement (list[str], optional) – Strategy for arranging shared vs unshared layers in decoder, defaults to [“at_end”, “at_start”, “alternating”, “random”]

  • refit_patience (int or tuple[int, int], optional) – Early-stop patience for refit loops, defaults to 2

  • refit_loops (int or tuple[int, int], optional) – Maximum number of refit loops, defaults to 100

  • epochs_per_loop (int or tuple[int, int], optional) – Number of epochs per refit loop, defaults to 1000

  • reset_lr_refit (bool or list[bool], optional) – Whether to reset learning rate before refit, defaults to [True, False]

save(file_path)[source]

Save this search space to a JSON file. :type file_path: :param file_path: Path to save file. :type file_path: string

classmethod load(file_path)[source]

Load a search space from a JSON file and return a new instance. :type file_path: :param file_path: Path to saved SearchSpace. :type file_path: string

run_cissvae(data, val_proportion=0.1, replacement_value=0.0, columns_ignore=None, print_dataset=True, imputable_matrix=None, binary_feature_mask=None, categorical_column_map=None, clusters=None, n_clusters=None, k_neighbors=15, leiden_resolution=0.5, leiden_objective='CPM', seed=42, missingness_proportion_matrix=None, scale_features=False, hidden_dims=[150, 120, 60], latent_dim=15, layer_order_enc=['unshared', 'unshared', 'unshared'], layer_order_dec=['shared', 'shared', 'shared'], latent_shared=False, output_shared=False, batch_size=4000, return_model=True, epochs=500, initial_lr=0.01, decay_factor=0.999, weight_decay=0.001, beta=0.001, device=None, max_loops=100, patience=2, epochs_per_loop=None, initial_lr_refit=None, decay_factor_refit=None, beta_refit=None, verbose=False, return_clusters=False, return_silhouettes=False, return_history=False, return_dataset=False, debug=False)[source]

End-to-end pipeline for Clustering-Informed Shared-Structure Variational Autoencoder (CISS-VAE).

This workflow prepares data (validation masking, optional feature/biomarker clustering inputs), optionally infers sample clusters, trains the VAE, and performs iterative impute–refit loops with early stopping. Returns the final imputed dataset and, optionally, the trained model and auxiliary artifacts.

Parameters:
  • data (pandas.DataFrame or numpy.ndarray or torch.Tensor) – Input matrix with potential missing values, shape (n_samples, n_features).

  • val_proportion (float or collections.abc.Sequence or collections.abc.Mapping or pandas.Series, optional) – Per-cluster fraction of non-missing entries to mask for validation. May be a single float (global), a per-cluster sequence, or mapping. Defaults to 0.1.

  • replacement_value (float, optional) – Value used to fill masked validation entries in the training tensor. Does not affect the separate validation target tensor. Defaults to 0.0.

  • columns_ignore (list[str or int] or None, optional) – Columns to exclude from validation masking (names if data is a DataFrame, otherwise integer indices). Defaults to None.

  • print_dataset (bool, optional) – If True, prints dataset summary/statistics during setup. Defaults to True.

  • imputable_matrix (pandas.DataFrame or numpy.ndarray or torch.Tensor or None, optional) – Optional binary mask with the same shape as data indicating which entries are eligible for imputation. Use 1 to allow imputation and 0 to exclude from imputation. Defaults to None.

  • binary_feature_mask (list[bool] or numpy.ndarray) – 1D boolean vector of length n_features indicating which columns are binary. Used during dataset construction to derive activation_groups. Columns belonging to categorical dummy variables must also be marked as True.

  • categorical_column_map (dict[str, list[str or int]]) –

    Optional dictionary mapping original categorical variable names to their corresponding dummy-variable columns. Example:

    {“C1”: [“C1b1”, “C1b2”], “C2”: [“C2b1”, “C2b2”]}

    These columns are grouped together in activation_groups and treated as categorical variables during loss computation and imputation. All listed columns must also be marked as True in binary_feature_mask.

  • clusters (array-like or None, optional) – Precomputed cluster labels for samples (length n_samples). If None, clustering may be performed depending on n_clusters and Leiden settings. Defaults to None.

  • n_clusters (int or None, optional) – If provided, performs KMeans with n_clusters. If None and clusters is also None, Leiden-based clustering is used. Defaults to None.

  • k_neighbors (int, optional) – Number of nearest neighbors for the Leiden KNN graph construction. Defaults to 15.

  • leiden_resolution (float, optional) – Resolution parameter for Leiden clustering. Defaults to 0.5.

  • leiden_objective (str, optional) – Objective function for Leiden clustering. One of {"CPM", "RB", "Modularity"}. Defaults to "CPM".

  • seed (int, optional) – Random seed for reproducibility. Defaults to 42.

  • missingness_proportion_matrix (pandas.DataFrame or numpy.ndarray or None, optional) – Optional matrix for biomarker/feature clustering where each entry is the per-sample proportion of missingness for each feature. If provided, can guide clustering on missingness patterns. Defaults to None.

  • scale_features (bool, optional) – If True, standardizes features for proportion-matrix-based clustering. Defaults to False.

  • hidden_dims (list[int], optional) – Encoder/decoder hidden layer sizes (mirrored architecture). Defaults to [150, 120, 60].

  • latent_dim (int, optional) – Dimensionality of the latent space. Defaults to 15.

  • layer_order_enc (list[str], optional) – Per-layer specification for encoder blocks; values are "shared" or "unshared". Length should match hidden_dims. Defaults to ["unshared", "unshared", "unshared"].

  • layer_order_dec (list[str], optional) – Per-layer specification for decoder blocks; values are "shared" or "unshared". Length should match hidden_dims. Defaults to ["shared", "shared", "shared"].

  • latent_shared (bool, optional) – If True, shares latent layer parameters across clusters. Defaults to False.

  • output_shared (bool, optional) – If True, shares final output layer across clusters. Defaults to False.

  • batch_size (int, optional) – Batch size for training. Defaults to 4000.

  • return_model (bool, optional) – If True, include the trained VAE model in the return tuple. Defaults to True.

  • epochs (int, optional) – Number of epochs in the initial training phase. Defaults to 500.

  • initial_lr (float, optional) – Initial learning rate for the optimizer. Defaults to 0.01.

  • decay_factor (float, optional) – Multiplicative LR decay applied per epoch (lr *= decay_factor). Defaults to 0.999.

  • beta (float, optional) – Weight of the KL-divergence term in the VAE loss for initial training. Defaults to 0.001.

  • device (str or torch.device or None, optional) – Compute device, e.g., "cpu" or "cuda". If None, selected automatically. Defaults to None.

  • max_loops (int, optional) – Maximum number of impute–refit loops. Defaults to 100.

  • patience (int, optional) – Early stopping patience counted in loops without improvement. Defaults to 2.

  • epochs_per_loop (int or None, optional) – Number of epochs per refit loop. If None, reuses epochs. Defaults to None.

  • initial_lr_refit (float or None, optional) – Learning rate for refit loops. If None, uses initial_lr. Defaults to None.

  • decay_factor_refit (float or None, optional) – LR decay factor during refit loops. If None, uses decay_factor. Defaults to None.

  • beta_refit (float or None, optional) – KL weight used in refit loops. If None, uses beta. Defaults to None.

  • verbose (bool, optional) – If True, prints progress and diagnostics. Defaults to False.

  • return_clusters (bool, optional) – If True, include sample cluster labels in the return tuple. Defaults to False.

  • return_silhouettes (bool, optional) – If True, include clustering silhouette score(s) in the return tuple. Defaults to False.

  • return_history (bool, optional) – If True, include concatenated training/refit history (e.g., losses, metrics). Defaults to False.

  • return_dataset (bool, optional) – If True, include the constructed/processed ClusterDataset object. Defaults to False.

  • debug (bool, optional) – If True, enables additional checks/logging for troubleshooting. Defaults to False.

Returns:

By default returns the imputed dataset. Depending on flags, may also return: model, clusters, silhouette_scores, history, and/or the ClusterDataset. The order is: (imputed_dataset[, model][, clusters][, silhouette_scores][, history][, dataset]).

Return type:

pandas.DataFrame or tuple[ pandas.DataFrame

[, CISSVAE] [, numpy.ndarray or pandas.Series] [, float or dict] [, pandas.DataFrame] [, ClusterDataset]

]