ciss_vae.classes.cluster_dataset.ClusterDataset

class ClusterDataset(*args: Any, **kwargs: Any)[source]

Bases: Dataset

Dataset that handles cluster-wise masking and normalization for VAE training.

1. Optionally holds out a validation subset per cluster from observed (non-NaN) entries according to val_proportion. 2. Combines original missingness with validation-held-out entries. 3. Normalizes observed values column-wise (mean/std), keeps masks for NaNs, and replaces NaNs (including held-out values) with replacement_value.

Parameters:

data (pandas.DataFrame | numpy.ndarray | torch.Tensor) – Input matrix of shape (n_samples, n_features). May contain NaNs.
cluster_labels (array-like or None) – Cluster assignment per sample (length n_samples). If None, all rows are assigned to a single cluster 0.
val_proportion (float | collections.abc.Sequence | collections.abc.Mapping | pandas.Series) –
Per-cluster fraction of non-missing entries to hold out for validation. Accepted forms:
- float in [0, 1]: same fraction for all clusters
- sequence (length = number of clusters): aligned to sorted(unique(cluster_labels))
- mapping (e.g. {cluster_id: fraction}) covering all clusters
- pandas.Series indexed by cluster IDs covering all clusters
replacement_value (float) – Value used to fill missing and held-out entries after masking.
columns_ignore (list[str | int] or None) – Columns to exclude from validation masking. Use column names for DataFrame and indices otherwise.
imputable (pandas.DataFrame | numpy.ndarray | torch.Tensor) – Matrix indicating which entries should be excluded from imputation (1 = impute, 0 = exclude). Must have the same shape as data.
binary_feature_mask (list[bool] | numpy.ndarray) – Boolean vector of length n_features indicating binary columns. Used to construct activation_groups. Categorical dummy columns must also be marked as True.
categorical_column_map (dict[str, list[str | int]] or None) –
Optional mapping from original categorical variable names to their corresponding dummy-variable columns. Example:
```
{"C1": ["C1b1", "C1b2"], "C2": ["C2b1", "C2b2"]}
```
These columns are grouped together in activation_groups and treated as categorical variables. All listed columns must also be marked as True in binary_feature_mask.

Variables:

raw_data (torch.FloatTensor) – Original data converted to float tensor (NaNs preserved).
data (torch.FloatTensor) – Normalized data with NaNs replaced by replacement_value.
masks (torch.BoolTensor) – Boolean mask where True indicates observed (non-NaN) entries before replacement.
val_data (torch.FloatTensor) – Tensor containing only validation-held-out values (others are NaN).
cluster_labels (torch.LongTensor) – Cluster ID for each row.
indices (torch.LongTensor) – Original row indices (from DataFrame index or arange for arrays/tensors).
feature_names (list[str]) – Column names (from DataFrame) or synthetic names (V1, V2, …).
n_clusters (int) – Number of unique clusters.
shape (tuple[int, int]) – Shape of self.data as (n_samples, n_features).
binary_feature_mask (numpy.ndarray) – Boolean mask indicating binary features.
activation_groups (dict) –
Mapping of feature groups to column indices. Structure:
```
{
    "continuous": [int, ...],
    "binary": [int, ...],
    "<categorical_name>": [int, ...],
    ...
}
```
- ”continuous”: indices of continuous-valued features
- ”binary”: indices of binary features
- Each additional key corresponds to a grouped categorical variable
This structure is used for loss computation, imputation, and validation logic.

Raises:

TypeError – If data or cluster_labels are invalid types, or if val_proportion is not a supported type.
ValueError – If any proportion is outside [0, 1], or if cluster coverage is incomplete, or sequence lengths do not match number of clusters.

Note

Normalization uses column-wise mean and standard deviation computed from

observed values after validation masking. * Zero standard deviations are replaced with 1 to avoid division by zero. * Feature types are resolved into activation_groups and used throughout training, loss computation, and imputation.

__init__(data, cluster_labels, val_proportion=0.1, replacement_value=0, columns_ignore=None, imputable=None, val_seed=42, binary_feature_mask=None, categorical_column_map=None)[source]

Build the dataset, apply per-cluster validation masking, and normalize.

Steps: 1. Convert inputs to tensors; preserve indices/column names if a DataFrame. 2. Resolve per-cluster validation proportions from val_proportion. 3. For each cluster and feature, randomly mark the requested fraction of observed entries as validation targets. 4. Create val_data (validation targets only) and training data where validation entries are set to NaN. 5. Compute per-feature mean/std over non-NaN entries in data and apply normalization; then replace remaining NaNs with replacement_value.

Parameters:

data (pandas.DataFrame or numpy.ndarray or torch.Tensor) – Input matrix, shape (n_samples, n_features). May contain NaNs
cluster_labels (array-like or None) – Cluster assignment per sample (length n_samples). If None, all rows are assigned to a single cluster 0
val_proportion (float or collections.abc.Sequence or collections.abc.Mapping or pandas.Series, optional) – Per-cluster fraction of non-missing entries to hold out for validation, defaults to 0.1
replacement_value (float, optional) – Value to fill missing/held-out entries in self.data after masking, defaults to 0
columns_ignore (list[str or int] or None, optional) – Columns to exclude from validation masking (names for DataFrame, indices otherwise), defaults to None
imputable (pandas.DataFrame | numpy.ndarray | torch.Tensor, optional) – Optional Matrix showing which data entries to exclude from imputation (1 for impute, 0 for exclude from imputation), shape (n_samples, n_features). Should be same shape as data.
val_seed (int) – Optional (default 42), seed for random number generator for selecting validation dataset
binary_feature_mask (list[bool]) – 1D bool vector of length ‘input_dim’ -> true if column is binary.
categorical_column_map (dict) – Optional dictionary where keys are original categories and values are resulting dummy variables. Must set binary_feature_mask if using!

Methods

`__init__`(data, cluster_labels[, ...])	Build the dataset, apply per-cluster validation masking, and normalize.
`copy`()	Creates a deep copy of the ClusterDataset method containing all attributes.
`get_activation_groups`([exclude_ignored])	Return activation groups, optionally excluding ignored columns.

get_activation_groups(exclude_ignored=False)[source]

Return activation groups, optionally excluding ignored columns.

Parameters:: exclude_ignored (bool) – If True, removes columns listed in columns_ignore.
Returns:: Filtered activation groups with ignored columns removed.
Return type:: dict

copy()[source]

Creates a deep copy of the ClusterDataset method containing all attributes.

Returns:: Deep copy of the dataset
Return type:: ClusterDataset