ciss_vae.training package
Submodules
ciss_vae.training.autotune module
Optuna-based hyperparameter tuning for CISS-VAE.
This module defines:
- SearchSpace: a structured container describing tunable/fixed hyperparameters.
- autotune(): runs Optuna trials that train CISSVAE models and selects the best trial
by validation MSE, then retrains a final model with the best settings.
- class SearchSpace(num_hidden_layers=(1, 4), hidden_dims=[64, 512], latent_dim=[10, 100], latent_shared=[True, False], output_shared=[True, False], lr=(0.0001, 0.001), decay_factor=(0.9, 0.999), weight_decay=0.001, beta=0.01, num_epochs=1000, batch_size=64, num_shared_encode=[0, 1, 3], num_shared_decode=[0, 1, 3], encoder_shared_placement=['at_end', 'at_start', 'alternating', 'random'], decoder_shared_placement=['at_end', 'at_start', 'alternating', 'random'], refit_patience=2, refit_loops=100, epochs_per_loop=1000, reset_lr_refit=[True, False])[source]
Bases:
objectDefines tunable and fixed hyperparameter ranges for the Optuna search.
Parameters are specified as: - scalar: fixed value (e.g.,
latent_dim=16) - list: categorical choice (e.g.,hidden_dims=[64, 128, 256]) - tuple: range(min, max)forsuggest_intorsuggest_float- Parameters:
num_hidden_layers (int or list[int] or tuple[int, int], optional) – Number of encoder/decoder hidden layers, defaults to (1, 4)
hidden_dims (int or list[int] or tuple[int, int], optional) – Hidden dimension specification - int for repeated per layer, list for per-layer choices, tuple for range, defaults to [64, 512]
latent_dim (int or tuple[int, int], optional) – Latent dimension size or range, defaults to [10, 100]
latent_shared (bool or list[bool], optional) – Whether latent space is shared across clusters, defaults to [True, False]
output_shared (bool or list[bool], optional) – Whether output layer is shared across clusters, defaults to [True, False]
lr (float or tuple[float, float], optional) – Initial learning rate or range, defaults to (1e-4, 1e-3)
decay_factor (float or tuple[float, float], optional) – Learning rate exponential decay factor or range, defaults to (0.9, 0.999)
beta (float or tuple[float, float], optional) – KL divergence weight or range, defaults to 0.01
num_epochs (int or tuple[int, int], optional) – Number of epochs for initial training, defaults to 1000
batch_size (int or tuple[int, int], optional) – Mini-batch size, defaults to 64
num_shared_encode (list[int], optional) – Candidate counts of shared encoder layers, defaults to [0, 1, 3]
num_shared_decode (list[int], optional) – Candidate counts of shared decoder layers, defaults to [0, 1, 3]
encoder_shared_placement (list[str], optional) – Strategy for arranging shared vs unshared layers in encoder, defaults to [“at_end”, “at_start”, “alternating”, “random”]
decoder_shared_placement (list[str], optional) – Strategy for arranging shared vs unshared layers in decoder, defaults to [“at_end”, “at_start”, “alternating”, “random”]
refit_patience (int or tuple[int, int], optional) – Early-stop patience for refit loops, defaults to 2
refit_loops (int or tuple[int, int], optional) – Maximum number of refit loops, defaults to 100
epochs_per_loop (int or tuple[int, int], optional) – Number of epochs per refit loop, defaults to 1000
reset_lr_refit (bool or list[bool], optional) – Whether to reset learning rate before refit, defaults to [True, False]
- autotune(search_space, train_dataset, save_model_path=None, save_search_space_path=None, n_trials=20, study_name='vae_autotune', device_preference='cuda', optuna_dashboard_db=None, load_if_exists=True, seed=42, verbose=False, show_progress=False, constant_layer_size=False, evaluate_all_orders=False, max_exhaustive_orders=100, return_history=False, n_jobs=1, debug=False)[source]
Optuna-based hyperparameter search for the CISSVAE model.
Runs initial training followed by impute-refit loops per trial, selecting the
- trial with the lowest total imputation error (MSE + BCE + categorical CE). The best model is then retrained with
optimal hyperparameters and returned along with the imputed dataset.
ciss_vae.training.run_cissvae module
End-to-end pipeline for preparing data, optionally clustering samples, training the CISS-VAE model, and performing iterative imputation.
Handles validation masking, feature-type resolution (via activation groups), optional clustering on missingness patterns, and model training with impute–refit loops.
- run_cissvae(data, val_proportion=0.1, replacement_value=0.0, columns_ignore=None, print_dataset=True, imputable_matrix=None, binary_feature_mask=None, categorical_column_map=None, clusters=None, n_clusters=None, k_neighbors=15, leiden_resolution=0.5, leiden_objective='CPM', seed=42, missingness_proportion_matrix=None, scale_features=False, hidden_dims=[150, 120, 60], latent_dim=15, layer_order_enc=['unshared', 'unshared', 'unshared'], layer_order_dec=['shared', 'shared', 'shared'], latent_shared=False, output_shared=False, batch_size=4000, return_model=True, epochs=500, initial_lr=0.01, decay_factor=0.999, weight_decay=0.001, beta=0.001, device=None, max_loops=100, patience=2, epochs_per_loop=None, initial_lr_refit=None, decay_factor_refit=None, beta_refit=None, verbose=False, return_clusters=False, return_silhouettes=False, return_history=False, return_dataset=False, debug=False)[source]
End-to-end pipeline for Clustering-Informed Shared-Structure Variational Autoencoder (CISS-VAE).
This workflow prepares data (validation masking, optional feature/biomarker clustering inputs), optionally infers sample clusters, trains the VAE, and performs iterative impute–refit loops with early stopping. Returns the final imputed dataset and, optionally, the trained model and auxiliary artifacts.
- Parameters:
data (pandas.DataFrame or numpy.ndarray or torch.Tensor) – Input matrix with potential missing values, shape
(n_samples, n_features).val_proportion (float or collections.abc.Sequence or collections.abc.Mapping or pandas.Series, optional) – Per-cluster fraction of non-missing entries to mask for validation. May be a single float (global), a per-cluster sequence, or mapping. Defaults to
0.1.replacement_value (float, optional) – Value used to fill masked validation entries in the training tensor. Does not affect the separate validation target tensor. Defaults to
0.0.columns_ignore (list[str or int] or None, optional) – Columns to exclude from validation masking (names if
datais a DataFrame, otherwise integer indices). Defaults toNone.print_dataset (bool, optional) – If
True, prints dataset summary/statistics during setup. Defaults toTrue.imputable_matrix (pandas.DataFrame or numpy.ndarray or torch.Tensor or None, optional) – Optional binary mask with the same shape as
dataindicating which entries are eligible for imputation. Use1to allow imputation and0to exclude from imputation. Defaults toNone.binary_feature_mask (list[bool] or numpy.ndarray) – 1D boolean vector of length
n_featuresindicating which columns are binary. Used during dataset construction to deriveactivation_groups. Columns belonging to categorical dummy variables must also be marked as True.categorical_column_map (dict[str, list[str or int]]) –
Optional dictionary mapping original categorical variable names to their corresponding dummy-variable columns. Example:
{“C1”: [“C1b1”, “C1b2”], “C2”: [“C2b1”, “C2b2”]}
These columns are grouped together in
activation_groupsand treated as categorical variables during loss computation and imputation. All listed columns must also be marked as True inbinary_feature_mask.clusters (array-like or None, optional) – Precomputed cluster labels for samples (length
n_samples). IfNone, clustering may be performed depending onn_clustersand Leiden settings. Defaults toNone.n_clusters (int or None, optional) – If provided, performs KMeans with
n_clusters. IfNoneandclustersis alsoNone, Leiden-based clustering is used. Defaults toNone.k_neighbors (int, optional) – Number of nearest neighbors for the Leiden KNN graph construction. Defaults to
15.leiden_resolution (float, optional) – Resolution parameter for Leiden clustering. Defaults to
0.5.leiden_objective (str, optional) – Objective function for Leiden clustering. One of
{"CPM", "RB", "Modularity"}. Defaults to"CPM".seed (int, optional) – Random seed for reproducibility. Defaults to
42.missingness_proportion_matrix (pandas.DataFrame or numpy.ndarray or None, optional) – Optional matrix for biomarker/feature clustering where each entry is the per-sample proportion of missingness for each feature. If provided, can guide clustering on missingness patterns. Defaults to
None.scale_features (bool, optional) – If
True, standardizes features for proportion-matrix-based clustering. Defaults toFalse.hidden_dims (list[int], optional) – Encoder/decoder hidden layer sizes (mirrored architecture). Defaults to
[150, 120, 60].latent_dim (int, optional) – Dimensionality of the latent space. Defaults to
15.layer_order_enc (list[str], optional) – Per-layer specification for encoder blocks; values are
"shared"or"unshared". Length should matchhidden_dims. Defaults to["unshared", "unshared", "unshared"].layer_order_dec (list[str], optional) – Per-layer specification for decoder blocks; values are
"shared"or"unshared". Length should matchhidden_dims. Defaults to["shared", "shared", "shared"].latent_shared (bool, optional) – If
True, shares latent layer parameters across clusters. Defaults toFalse.output_shared (bool, optional) – If
True, shares final output layer across clusters. Defaults toFalse.batch_size (int, optional) – Batch size for training. Defaults to
4000.return_model (bool, optional) – If
True, include the trained VAE model in the return tuple. Defaults toTrue.epochs (int, optional) – Number of epochs in the initial training phase. Defaults to
500.initial_lr (float, optional) – Initial learning rate for the optimizer. Defaults to
0.01.decay_factor (float, optional) – Multiplicative LR decay applied per epoch (
lr *= decay_factor). Defaults to0.999.beta (float, optional) – Weight of the KL-divergence term in the VAE loss for initial training. Defaults to
0.001.device (str or torch.device or None, optional) – Compute device, e.g.,
"cpu"or"cuda". IfNone, selected automatically. Defaults toNone.max_loops (int, optional) – Maximum number of impute–refit loops. Defaults to
100.patience (int, optional) – Early stopping patience counted in loops without improvement. Defaults to
2.epochs_per_loop (int or None, optional) – Number of epochs per refit loop. If
None, reusesepochs. Defaults toNone.initial_lr_refit (float or None, optional) – Learning rate for refit loops. If
None, usesinitial_lr. Defaults toNone.decay_factor_refit (float or None, optional) – LR decay factor during refit loops. If
None, usesdecay_factor. Defaults toNone.beta_refit (float or None, optional) – KL weight used in refit loops. If
None, usesbeta. Defaults toNone.verbose (bool, optional) – If
True, prints progress and diagnostics. Defaults toFalse.return_clusters (bool, optional) – If
True, include sample cluster labels in the return tuple. Defaults toFalse.return_silhouettes (bool, optional) – If
True, include clustering silhouette score(s) in the return tuple. Defaults toFalse.return_history (bool, optional) – If
True, include concatenated training/refit history (e.g., losses, metrics). Defaults toFalse.return_dataset (bool, optional) – If
True, include the constructed/processedClusterDatasetobject. Defaults toFalse.debug (bool, optional) – If
True, enables additional checks/logging for troubleshooting. Defaults toFalse.
- Returns:
By default returns the imputed dataset. Depending on flags, may also return:
model,clusters,silhouette_scores,history, and/or theClusterDataset. The order is:(imputed_dataset[, model][, clusters][, silhouette_scores][, history][, dataset]).- Return type:
pandas.DataFrame or tuple[ pandas.DataFrame
[, CISSVAE] [, numpy.ndarray or pandas.Series] [, float or dict] [, pandas.DataFrame] [, ClusterDataset]
]
ciss_vae.training.train_initial module
- train_vae_initial(model, train_loader, epochs=10, initial_lr=0.01, decay_factor=0.999, beta=0.1, device='cpu', verbose=False, *, return_history=False, progress_callback=None, weight_decay=0.001, seed=42)[source]
Train a VAE on masked data with validation monitoring for initial training phase.
Performs the initial training of a CISSVAE model on data with missing values using masked loss computation. Tracks training loss and validation MSE across epochs, with optional progress reporting and learning rate decay.
- Parameters:
model (torch.nn.Module) – CISSVAE or compatible VAE model that implements forward(x, cluster_id)
train_loader (torch.utils.data.DataLoader) – DataLoader built on ClusterDataset containing validation data
epochs (int, optional) – Number of training epochs, defaults to 10
initial_lr (float, optional) – Starting learning rate for Adam optimizer, defaults to 0.01
decay_factor (float, optional) – Exponential learning rate decay factor applied per epoch, defaults to 0.999
beta (float, optional) – Weight coefficient for KL divergence term in VAE loss, defaults to 0.1
device (str, optional) – Device for training computations (“cpu” or “cuda”), defaults to “cpu”
verbose (bool, optional) – Whether to print per-epoch training metrics, defaults to False
return_history (bool, optional) – Whether to return training history DataFrame along with model, defaults to False
progress_callback (callable, optional) – Optional callback function for progress reporting, defaults to None
- Returns:
Trained model, or tuple of (model, history_dataframe) if return_history=True
- Return type:
- Raises:
ValueError – If dataset does not contain ‘val_data’ attribute for validation
ciss_vae.training.train_refit module
- train_vae_refit(model, imputed_data, epochs=10, initial_lr=0.01, decay_factor=0.999, beta=0.1, device='cpu', verbose=False, *, progress_callback=None, weight_decay=0.001, seed=42)[source]
Train the VAE model on imputed data without masking for one refit iteration.
Performs training on the complete imputed dataset.
- Parameters:
model (torch.nn.Module) – VAE model to train
imputed_data (torch.utils.data.DataLoader) – DataLoader containing imputed dataset with complete values
epochs (int, optional) – Number of training epochs, defaults to 10
initial_lr (float, optional) – Initial learning rate for the optimizer, defaults to 0.01
decay_factor (float, optional) – Exponential decay factor for learning rate scheduler, defaults to 0.999
beta (float, optional) – Weight for KL divergence term in loss function, defaults to 0.1
device (str, optional) – Device to run training on, defaults to “cpu”
verbose (bool, optional) – Whether to print training progress information, defaults to False
progress_callback (callable, optional) – Optional callback function to report epoch progress, defaults to None
- Returns:
Trained model with updated final learning rate
- Return type:
- impute_and_refit_loop(model, train_loader, max_loops=10, patience=2, epochs_per_loop=5, initial_lr=None, decay_factor=0.999, weight_decay=0.001, beta=0.1, device='cpu', verbose=False, batch_size=4000, progress_epoch=None, seed=42)[source]
Iterative impute-refit loop with validation MSE early stopping.
Performs alternating cycles of imputation (filling missing values with model predictions) and refitting (training on the complete imputed data). Uses early stopping based on validation MSE to prevent overfitting and selects the best performing model.
- Parameters:
model (torch.nn.Module) – Trained VAE model to start the impute-refit process
train_loader (torch.utils.data.DataLoader) – DataLoader for the original training dataset with missing values
max_loops (int, optional) – Maximum number of impute-refit cycles to perform, defaults to 10
patience (int, optional) – Number of loops to wait for improvement before early stopping, defaults to 2
epochs_per_loop (int, optional) – Number of training epochs per refit cycle, defaults to 5
initial_lr (float, optional) – Learning rate for refit training, uses model’s final LR if None, defaults to None
decay_factor (float, optional) – Exponential decay factor for learning rate, defaults to 0.999
beta (float, optional) – Weight for KL divergence term in loss function, defaults to 0.1
device (str, optional) – Device to run computations on, defaults to “cpu”
verbose (bool, optional) – Whether to print detailed progress information, defaults to False
batch_size (int, optional) – Batch size for refit training, defaults to 4000
progress_epoch (callable, optional) – Optional callback function to report epoch progress, defaults to None
- Returns:
Tuple containing (imputed_dataframe, best_model, best_dataset, refit_history_dataframe) refit_history_dataframe Columns:
epoch (int) : cumulative epoch counter (continues from initial)
train_loss (float) : NaN (not tracked during refit here)
val_mse (float) : validation MSE after each refit loop
lr (float) : learning rate after each refit loop
phase (str) : {“refit_init”, “refit_loop”}
loop (int) : 0 for baseline (pre-refit), then 1..k per loop
- Return type:
tuple[pandas.DataFrame, torch.nn.Module, ClusterDataset, pandas.DataFrame]
Module contents
- train_vae_initial(model, train_loader, epochs=10, initial_lr=0.01, decay_factor=0.999, beta=0.1, device='cpu', verbose=False, *, return_history=False, progress_callback=None, weight_decay=0.001, seed=42)[source]
Train a VAE on masked data with validation monitoring for initial training phase.
Performs the initial training of a CISSVAE model on data with missing values using masked loss computation. Tracks training loss and validation MSE across epochs, with optional progress reporting and learning rate decay.
- Parameters:
model (torch.nn.Module) – CISSVAE or compatible VAE model that implements forward(x, cluster_id)
train_loader (torch.utils.data.DataLoader) – DataLoader built on ClusterDataset containing validation data
epochs (int, optional) – Number of training epochs, defaults to 10
initial_lr (float, optional) – Starting learning rate for Adam optimizer, defaults to 0.01
decay_factor (float, optional) – Exponential learning rate decay factor applied per epoch, defaults to 0.999
beta (float, optional) – Weight coefficient for KL divergence term in VAE loss, defaults to 0.1
device (str, optional) – Device for training computations (“cpu” or “cuda”), defaults to “cpu”
verbose (bool, optional) – Whether to print per-epoch training metrics, defaults to False
return_history (bool, optional) – Whether to return training history DataFrame along with model, defaults to False
progress_callback (callable, optional) – Optional callback function for progress reporting, defaults to None
- Returns:
Trained model, or tuple of (model, history_dataframe) if return_history=True
- Return type:
- Raises:
ValueError – If dataset does not contain ‘val_data’ attribute for validation
- train_vae_refit(model, imputed_data, epochs=10, initial_lr=0.01, decay_factor=0.999, beta=0.1, device='cpu', verbose=False, *, progress_callback=None, weight_decay=0.001, seed=42)[source]
Train the VAE model on imputed data without masking for one refit iteration.
Performs training on the complete imputed dataset.
- Parameters:
model (torch.nn.Module) – VAE model to train
imputed_data (torch.utils.data.DataLoader) – DataLoader containing imputed dataset with complete values
epochs (int, optional) – Number of training epochs, defaults to 10
initial_lr (float, optional) – Initial learning rate for the optimizer, defaults to 0.01
decay_factor (float, optional) – Exponential decay factor for learning rate scheduler, defaults to 0.999
beta (float, optional) – Weight for KL divergence term in loss function, defaults to 0.1
device (str, optional) – Device to run training on, defaults to “cpu”
verbose (bool, optional) – Whether to print training progress information, defaults to False
progress_callback (callable, optional) – Optional callback function to report epoch progress, defaults to None
- Returns:
Trained model with updated final learning rate
- Return type:
- run_autotune(search_space, train_dataset, save_model_path=None, save_search_space_path=None, n_trials=20, study_name='vae_autotune', device_preference='cuda', optuna_dashboard_db=None, load_if_exists=True, seed=42, verbose=False, show_progress=False, constant_layer_size=False, evaluate_all_orders=False, max_exhaustive_orders=100, return_history=False, n_jobs=1, debug=False)
Optuna-based hyperparameter search for the CISSVAE model.
Runs initial training followed by impute-refit loops per trial, selecting the
- trial with the lowest total imputation error (MSE + BCE + categorical CE). The best model is then retrained with
optimal hyperparameters and returned along with the imputed dataset.
- class SearchSpace(num_hidden_layers=(1, 4), hidden_dims=[64, 512], latent_dim=[10, 100], latent_shared=[True, False], output_shared=[True, False], lr=(0.0001, 0.001), decay_factor=(0.9, 0.999), weight_decay=0.001, beta=0.01, num_epochs=1000, batch_size=64, num_shared_encode=[0, 1, 3], num_shared_decode=[0, 1, 3], encoder_shared_placement=['at_end', 'at_start', 'alternating', 'random'], decoder_shared_placement=['at_end', 'at_start', 'alternating', 'random'], refit_patience=2, refit_loops=100, epochs_per_loop=1000, reset_lr_refit=[True, False])[source]
Bases:
objectDefines tunable and fixed hyperparameter ranges for the Optuna search.
Parameters are specified as: - scalar: fixed value (e.g.,
latent_dim=16) - list: categorical choice (e.g.,hidden_dims=[64, 128, 256]) - tuple: range(min, max)forsuggest_intorsuggest_float- Parameters:
num_hidden_layers (int or list[int] or tuple[int, int], optional) – Number of encoder/decoder hidden layers, defaults to (1, 4)
hidden_dims (int or list[int] or tuple[int, int], optional) – Hidden dimension specification - int for repeated per layer, list for per-layer choices, tuple for range, defaults to [64, 512]
latent_dim (int or tuple[int, int], optional) – Latent dimension size or range, defaults to [10, 100]
latent_shared (bool or list[bool], optional) – Whether latent space is shared across clusters, defaults to [True, False]
output_shared (bool or list[bool], optional) – Whether output layer is shared across clusters, defaults to [True, False]
lr (float or tuple[float, float], optional) – Initial learning rate or range, defaults to (1e-4, 1e-3)
decay_factor (float or tuple[float, float], optional) – Learning rate exponential decay factor or range, defaults to (0.9, 0.999)
beta (float or tuple[float, float], optional) – KL divergence weight or range, defaults to 0.01
num_epochs (int or tuple[int, int], optional) – Number of epochs for initial training, defaults to 1000
batch_size (int or tuple[int, int], optional) – Mini-batch size, defaults to 64
num_shared_encode (list[int], optional) – Candidate counts of shared encoder layers, defaults to [0, 1, 3]
num_shared_decode (list[int], optional) – Candidate counts of shared decoder layers, defaults to [0, 1, 3]
encoder_shared_placement (list[str], optional) – Strategy for arranging shared vs unshared layers in encoder, defaults to [“at_end”, “at_start”, “alternating”, “random”]
decoder_shared_placement (list[str], optional) – Strategy for arranging shared vs unshared layers in decoder, defaults to [“at_end”, “at_start”, “alternating”, “random”]
refit_patience (int or tuple[int, int], optional) – Early-stop patience for refit loops, defaults to 2
refit_loops (int or tuple[int, int], optional) – Maximum number of refit loops, defaults to 100
epochs_per_loop (int or tuple[int, int], optional) – Number of epochs per refit loop, defaults to 1000
reset_lr_refit (bool or list[bool], optional) – Whether to reset learning rate before refit, defaults to [True, False]
- run_cissvae(data, val_proportion=0.1, replacement_value=0.0, columns_ignore=None, print_dataset=True, imputable_matrix=None, binary_feature_mask=None, categorical_column_map=None, clusters=None, n_clusters=None, k_neighbors=15, leiden_resolution=0.5, leiden_objective='CPM', seed=42, missingness_proportion_matrix=None, scale_features=False, hidden_dims=[150, 120, 60], latent_dim=15, layer_order_enc=['unshared', 'unshared', 'unshared'], layer_order_dec=['shared', 'shared', 'shared'], latent_shared=False, output_shared=False, batch_size=4000, return_model=True, epochs=500, initial_lr=0.01, decay_factor=0.999, weight_decay=0.001, beta=0.001, device=None, max_loops=100, patience=2, epochs_per_loop=None, initial_lr_refit=None, decay_factor_refit=None, beta_refit=None, verbose=False, return_clusters=False, return_silhouettes=False, return_history=False, return_dataset=False, debug=False)[source]
End-to-end pipeline for Clustering-Informed Shared-Structure Variational Autoencoder (CISS-VAE).
This workflow prepares data (validation masking, optional feature/biomarker clustering inputs), optionally infers sample clusters, trains the VAE, and performs iterative impute–refit loops with early stopping. Returns the final imputed dataset and, optionally, the trained model and auxiliary artifacts.
- Parameters:
data (pandas.DataFrame or numpy.ndarray or torch.Tensor) – Input matrix with potential missing values, shape
(n_samples, n_features).val_proportion (float or collections.abc.Sequence or collections.abc.Mapping or pandas.Series, optional) – Per-cluster fraction of non-missing entries to mask for validation. May be a single float (global), a per-cluster sequence, or mapping. Defaults to
0.1.replacement_value (float, optional) – Value used to fill masked validation entries in the training tensor. Does not affect the separate validation target tensor. Defaults to
0.0.columns_ignore (list[str or int] or None, optional) – Columns to exclude from validation masking (names if
datais a DataFrame, otherwise integer indices). Defaults toNone.print_dataset (bool, optional) – If
True, prints dataset summary/statistics during setup. Defaults toTrue.imputable_matrix (pandas.DataFrame or numpy.ndarray or torch.Tensor or None, optional) – Optional binary mask with the same shape as
dataindicating which entries are eligible for imputation. Use1to allow imputation and0to exclude from imputation. Defaults toNone.binary_feature_mask (list[bool] or numpy.ndarray) – 1D boolean vector of length
n_featuresindicating which columns are binary. Used during dataset construction to deriveactivation_groups. Columns belonging to categorical dummy variables must also be marked as True.categorical_column_map (dict[str, list[str or int]]) –
Optional dictionary mapping original categorical variable names to their corresponding dummy-variable columns. Example:
{“C1”: [“C1b1”, “C1b2”], “C2”: [“C2b1”, “C2b2”]}
These columns are grouped together in
activation_groupsand treated as categorical variables during loss computation and imputation. All listed columns must also be marked as True inbinary_feature_mask.clusters (array-like or None, optional) – Precomputed cluster labels for samples (length
n_samples). IfNone, clustering may be performed depending onn_clustersand Leiden settings. Defaults toNone.n_clusters (int or None, optional) – If provided, performs KMeans with
n_clusters. IfNoneandclustersis alsoNone, Leiden-based clustering is used. Defaults toNone.k_neighbors (int, optional) – Number of nearest neighbors for the Leiden KNN graph construction. Defaults to
15.leiden_resolution (float, optional) – Resolution parameter for Leiden clustering. Defaults to
0.5.leiden_objective (str, optional) – Objective function for Leiden clustering. One of
{"CPM", "RB", "Modularity"}. Defaults to"CPM".seed (int, optional) – Random seed for reproducibility. Defaults to
42.missingness_proportion_matrix (pandas.DataFrame or numpy.ndarray or None, optional) – Optional matrix for biomarker/feature clustering where each entry is the per-sample proportion of missingness for each feature. If provided, can guide clustering on missingness patterns. Defaults to
None.scale_features (bool, optional) – If
True, standardizes features for proportion-matrix-based clustering. Defaults toFalse.hidden_dims (list[int], optional) – Encoder/decoder hidden layer sizes (mirrored architecture). Defaults to
[150, 120, 60].latent_dim (int, optional) – Dimensionality of the latent space. Defaults to
15.layer_order_enc (list[str], optional) – Per-layer specification for encoder blocks; values are
"shared"or"unshared". Length should matchhidden_dims. Defaults to["unshared", "unshared", "unshared"].layer_order_dec (list[str], optional) – Per-layer specification for decoder blocks; values are
"shared"or"unshared". Length should matchhidden_dims. Defaults to["shared", "shared", "shared"].latent_shared (bool, optional) – If
True, shares latent layer parameters across clusters. Defaults toFalse.output_shared (bool, optional) – If
True, shares final output layer across clusters. Defaults toFalse.batch_size (int, optional) – Batch size for training. Defaults to
4000.return_model (bool, optional) – If
True, include the trained VAE model in the return tuple. Defaults toTrue.epochs (int, optional) – Number of epochs in the initial training phase. Defaults to
500.initial_lr (float, optional) – Initial learning rate for the optimizer. Defaults to
0.01.decay_factor (float, optional) – Multiplicative LR decay applied per epoch (
lr *= decay_factor). Defaults to0.999.beta (float, optional) – Weight of the KL-divergence term in the VAE loss for initial training. Defaults to
0.001.device (str or torch.device or None, optional) – Compute device, e.g.,
"cpu"or"cuda". IfNone, selected automatically. Defaults toNone.max_loops (int, optional) – Maximum number of impute–refit loops. Defaults to
100.patience (int, optional) – Early stopping patience counted in loops without improvement. Defaults to
2.epochs_per_loop (int or None, optional) – Number of epochs per refit loop. If
None, reusesepochs. Defaults toNone.initial_lr_refit (float or None, optional) – Learning rate for refit loops. If
None, usesinitial_lr. Defaults toNone.decay_factor_refit (float or None, optional) – LR decay factor during refit loops. If
None, usesdecay_factor. Defaults toNone.beta_refit (float or None, optional) – KL weight used in refit loops. If
None, usesbeta. Defaults toNone.verbose (bool, optional) – If
True, prints progress and diagnostics. Defaults toFalse.return_clusters (bool, optional) – If
True, include sample cluster labels in the return tuple. Defaults toFalse.return_silhouettes (bool, optional) – If
True, include clustering silhouette score(s) in the return tuple. Defaults toFalse.return_history (bool, optional) – If
True, include concatenated training/refit history (e.g., losses, metrics). Defaults toFalse.return_dataset (bool, optional) – If
True, include the constructed/processedClusterDatasetobject. Defaults toFalse.debug (bool, optional) – If
True, enables additional checks/logging for troubleshooting. Defaults toFalse.
- Returns:
By default returns the imputed dataset. Depending on flags, may also return:
model,clusters,silhouette_scores,history, and/or theClusterDataset. The order is:(imputed_dataset[, model][, clusters][, silhouette_scores][, history][, dataset]).- Return type:
pandas.DataFrame or tuple[ pandas.DataFrame
[, CISSVAE] [, numpy.ndarray or pandas.Series] [, float or dict] [, pandas.DataFrame] [, ClusterDataset]
]