Handling binary data columns

The CISS-VAE model can handle binary and categorical variables in addition to continuous ones. Categorical variables must be represented with binary dummy variables (pandas get_dummies can do this) and the new dummy variables must be linked back to the original categorical label by passing a dictionary via categorical_column_map.

When imputing binary data, the model applies a sigmoid activation function at the end of the forward pass to convert to a probability (this means that the end imputed result is the probability that the imputed value should be 1. The user will need to change these values after running).

Because some datasets have both binary and continuous variables, you can include a binary variable mask (boolean vector) to tell the model which variables are binary so it acts accordingly.

Example dataset

The example dataset below has both binary and continuous variables in it:

import pandas as pd
import numpy as np

np.random.seed(42)

n_rows = 100
prop_mask = 0.3

X = pd.DataFrame({
    "feat1": np.random.choice(np.arange(1, n_rows + 1), size=n_rows, replace=True),
    "feat2": np.random.choice(np.arange(1, 6), size=n_rows, replace=True),
    "feat3": np.random.choice(np.arange(1, 8), size=n_rows, replace=True),
    "feat4": np.random.choice(np.arange(1, 8), size=n_rows, replace=True),
    "feat5": np.random.choice(np.arange(1, 8), size=n_rows, replace=True),
    ## now add some binary features
    "bf1": np.random.binomial(1, 0.25, size=n_rows),
    "bf2": np.random.binomial(1, 0.5, size=n_rows),
    "bf3": np.random.binomial(1, 0.75, size=n_rows),
    "bf4": np.random.binomial(1, 0.33, size=n_rows),
    "bf5": np.random.binomial(1, 0.66, size=n_rows),
})

X_raw = X.copy()

for col in X.columns[1:]:  # skip feat1
    idx = np.where(~X[col].isna())[0]  # indices of non-NA entries
    n_mask = int(np.ceil(len(idx) * prop_mask))
    if n_mask > 0:
        mask_idx = np.random.choice(idx, size=n_mask, replace=False)
        X.loc[mask_idx, col] = np.nan

print(f"X matrix:\n{X}")


## choosing random clusters for now 
clusters = np.random.choice([1, 2, 3], size=n_rows, replace=True)

X matrix:
    feat1  feat2  feat3  feat4  feat5  bf1  bf2  bf3  bf4  bf5
    52    2.0    6.0    4.0    2.0  NaN  NaN  NaN  NaN  0.0
    93    1.0    7.0    NaN    2.0  NaN  1.0  1.0  1.0  1.0
    15    4.0    NaN    4.0    2.0  0.0  0.0  1.0  0.0  NaN
    72    4.0    3.0    NaN    1.0  NaN  1.0  NaN  1.0  NaN
    61    NaN    1.0    NaN    1.0  0.0  NaN  NaN  NaN  0.0
..    ...    ...    ...    ...    ...  ...  ...  ...  ...  ...
   85    NaN    NaN    6.0    2.0  0.0  0.0  1.0  NaN  1.0
   80    1.0    NaN    7.0    2.0  0.0  NaN  1.0  NaN  1.0
   82    2.0    4.0    NaN    1.0  0.0  NaN  1.0  NaN  NaN
   53    4.0    3.0    7.0    7.0  NaN  0.0  0.0  0.0  1.0
   24    4.0    1.0    3.0    NaN  0.0  1.0  1.0  1.0  0.0

[100 rows x 10 columns]

Preparing Binary Vector

The binary vector binary_feature_mask is of length p for an n x p data matrix and is True for binary columns and False for continuous columns.

binary_vector = [False, False, False, False, False, True, True, True, True, True]

Using `run_cissvae()` with binary matrix

Pass the binary vector to the run_cissvae() function using the binary_feature_mask argument. Note: even if columns are ignored via columns_ignore, those columns must be accounted for in the binary_feature_mask.

import ciss_vae

from ciss_vae.training.run_cissvae import run_cissvae
from ciss_vae.utils.helpers import plot_vae_architecture
print(ciss_vae.__file__)

imputed_data, vae, ds, history = run_cissvae(data = X,
## Dataset params
    columns_ignore = X.columns[0], ## columns to ignore when selecting validation dataset (and clustering if you do not provide clusters). For example, demographic columns with no missingness.
    clusters = clusters,
    print_dataset = False,
    binary_feature_mask = binary_vector,
## VAE model params
    hidden_dims = [150, 120, 60], ## Dimensions of hidden layers, in order. One number per layer. 
    latent_dim = 15, ## Dimensions of latent embedding
    layer_order_enc = ["unshared", "unshared", "unshared"], ## order of shared vs unshared layers for encode (can use u or s instead of unshared, shared)
    layer_order_dec=["shared", "shared",  "shared"],  ## order of shared vs unshared layers for decode
    latent_shared=False, 
    output_shared=False, 
    batch_size = 4000, ## batch size for data loader
    return_model = True, ## if true, outputs imputed dataset and model, otherwise just outputs imputed dataset. Set to true to return model for `plot_vae_architecture`

## Initial Training params
    epochs = 5, ## default 

## Other params
    return_history = True, ## if true, will return training MSE history as pandas dataframe
    return_dataset=True
)

print(f"The successfully imputed dataset:\n{imputed_data.head(3)}\n\n")

\\VPensBST\BstShared\Biostatistics\Danielle\Repos\CISS_VAE\CISS-VAE-python\src\ciss_vae\__init__.py
The successfully imputed dataset:
   feat1  feat2    feat3     feat4  feat5       bf1       bf2       bf3  \
0   52.0    2.0  6.00000  4.000000    2.0  0.302485  0.404743  0.636127   
1   93.0    1.0  7.00000  4.189552    2.0  0.110679  1.000000  1.000000   
2   15.0    4.0  3.44211  4.000000    2.0  0.000000  0.000000  1.000000   

        bf4      bf5  
0  0.443553  0.00000  
1  1.000000  1.00000  
2  0.000000  0.57127  

Now that we have an imputed dataset, we need to convert the probabilities to true binary values.

bf_cols = [col for col in imputed_data.columns if col.startswith('bf')]
imputed_data[bf_cols] = (imputed_data[bf_cols] > 0.5).astype(int)

print(f"The successfully imputed dataset:\n{imputed_data.head(3)}\n\n")

The successfully imputed dataset:
   feat1  feat2    feat3     feat4  feat5  bf1  bf2  bf3  bf4  bf5
0   52.0    2.0  6.00000  4.000000    2.0    0    0    1    0    0
1   93.0    1.0  7.00000  4.189552    2.0    0    1    1    1    1
2   15.0    4.0  3.44211  4.000000    2.0    0    0    1    0    1

print(f"History \n{history}")

History 
    epoch  train_loss   train_mse   train_bce  train_ce  imputation_error  \
     0    6.945113  347.701630  346.802612       0.0        320.539062   
     1    6.810024  340.168030  340.817749       0.0        344.425415   
     2    6.712993  339.692535  331.558289       0.0        345.298767   
     3    6.597640  334.835938  324.817261       0.0        334.807343   
     4    6.465297  325.737579  320.526825       0.0        337.141022   
     4         NaN         NaN         NaN       NaN        347.181915   
     9         NaN         NaN         NaN       NaN        271.728943   
     0    5.920253  281.688995  309.766754       0.0               NaN   
     1    6.260298  318.909790  306.554749       0.0               NaN   
     2    5.865682  278.205566  307.714355       0.0               NaN   
    3    5.694291  265.898407  302.624420       0.0               NaN   
    4    5.658046  259.660370  305.226379       0.0               NaN   
   14         NaN         NaN         NaN       NaN        377.688629   
    0    5.380131  233.666336  303.406342       0.0               NaN   
    1    9.380676  620.381714  315.344940       0.0               NaN   
    2    5.336485  226.872391  305.304321       0.0               NaN   
    3    5.540195  241.228027  311.292175       0.0               NaN   
    4    5.593123  246.618195  310.927979       0.0               NaN   
   19         NaN         NaN         NaN       NaN        370.147614   
    0    5.478112  237.671204  307.676208       0.0               NaN   
    1    5.940436  279.735870  311.661987       0.0               NaN   
    2    5.405255  234.999435  302.320129       0.0               NaN   
    3    5.558175  249.377594  302.137451       0.0               NaN   
    4    4.837627  180.613281  298.911987       0.0               NaN   

       val_mse   val_bce  val_ce        lr           phase  loop  
 319.861664  0.677406     0.0  0.010000             NaN   NaN  
 343.775421  0.649993     0.0  0.009990             NaN   NaN  
 344.682617  0.616162     0.0  0.009980             NaN   NaN  
 334.233856  0.573492     0.0  0.009970             NaN   NaN  
 336.607391  0.533642     0.0  0.009960             NaN   NaN  
 346.653839  0.528070     0.0  0.009950           refit   0.0  
 271.148407  0.580540     0.0  0.009900           refit   1.0  
        NaN       NaN     NaN  0.009950  refit_training   NaN  
        NaN       NaN     NaN  0.009940  refit_training   NaN  
        NaN       NaN     NaN  0.009930  refit_training   NaN  
       NaN       NaN     NaN  0.009920  refit_training   NaN  
       NaN       NaN     NaN  0.009910  refit_training   NaN  
377.069336  0.619285     0.0  0.009851           refit   2.0  
       NaN       NaN     NaN  0.009900  refit_training   NaN  
       NaN       NaN     NaN  0.009891  refit_training   NaN  
       NaN       NaN     NaN  0.009881  refit_training   NaN  
       NaN       NaN     NaN  0.009871  refit_training   NaN  
       NaN       NaN     NaN  0.009861  refit_training   NaN  
369.551239  0.596373     0.0  0.009802           refit   3.0  
       NaN       NaN     NaN  0.009851  refit_training   NaN  
       NaN       NaN     NaN  0.009841  refit_training   NaN  
       NaN       NaN     NaN  0.009831  refit_training   NaN  
       NaN       NaN     NaN  0.009822  refit_training   NaN  
       NaN       NaN     NaN  0.009812  refit_training   NaN  

As always, the vae architecture can be printed.

plot_vae_architecture(model = vae,
                        title = None, ## Set title of plot
                        ## Colors below are default
                        color_shared = "skyblue", 
                        color_unshared ="lightcoral",
                        color_latent = "gold", # xx fix
                        color_input = "lightgreen",
                        color_output = "lightgreen",
                        figsize=(16, 8),
                        return_fig = False)

_images/30ddb3edc7c6d3dfdf48a5e5a6425dd673b6a5d0d46f82c16a5b7505e911b74d.png

Using Binary Feature Mask with Autotune

To use a binary_feature_mask with autotune(), pass the use the binary_feature_mask parameter when initializing the ClusterDataset object.

from ciss_vae.classes.cluster_dataset import ClusterDataset
from ciss_vae.training.autotune import autotune, SearchSpace
cd = ClusterDataset(
    X, cluster_labels = clusters, binary_feature_mask=binary_vector
)

ss = SearchSpace(
    num_hidden_layers = [1, 2],
    hidden_dims = [6, 16, 32],
    latent_dim=10,
    latent_shared=True,
    output_shared = True,
    lr = 0.01,
    decay_factor=0.999,
    num_epochs = 100,
    num_shared_encode = 1,
    num_shared_decode = 1,
    epochs_per_loop=100,
    reset_lr_refit=False

)
autotune(search_space = ss, train_dataset = cd, optuna_dashboard_db =  "sqlite:///optuna_study_test_binary.db", debug = False)

[Warning] CUDA requested but not available. Falling back to CPU.

[I 2026-04-01 09:56:56,124] Using an existing study with name 'vae_autotune' instead of creating a new one.
\\VPensBST\BstShared\Biostatistics\Danielle\Repos\CISS_VAE\CISS-VAE-python\src\ciss_vae\training\autotune.py:652: ExperimentalWarning: optuna.study.study.Study.set_metric_names is experimental (supported from v3.2.0). The interface can change in the future.
  study.set_metric_names(["Total Imputation Error"])

Starting Optuna optimization with 20 trials...

[I 2026-04-01 09:57:11,614] Trial 23 finished with value: {'Total Imputation Error': 280.91204833984375} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 32, 'hidden_dim_1': 16, 'encoder_shared_placement': 'at_end', 'decoder_shared_placement': 'at_start'}. Best is trial 21 with value: 202.40597534179688.
[I 2026-04-01 09:57:25,087] Trial 24 finished with value: {'Total Imputation Error': 213.1970672607422} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 32, 'hidden_dim_1': 16, 'encoder_shared_placement': 'at_end', 'decoder_shared_placement': 'at_start'}. Best is trial 21 with value: 202.40597534179688.
[I 2026-04-01 09:57:38,223] Trial 25 finished with value: {'Total Imputation Error': 313.8942565917969} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 32, 'hidden_dim_1': 16, 'encoder_shared_placement': 'at_end', 'decoder_shared_placement': 'at_start'}. Best is trial 21 with value: 202.40597534179688.
[I 2026-04-01 09:57:48,507] Trial 26 finished with value: {'Total Imputation Error': 378.0359191894531} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 32, 'hidden_dim_1': 16, 'encoder_shared_placement': 'at_end', 'decoder_shared_placement': 'at_end'}. Best is trial 21 with value: 202.40597534179688.
[I 2026-04-01 09:57:58,126] Trial 27 finished with value: {'Total Imputation Error': 291.3521423339844} and parameters: {'num_hidden_layers': 1, 'hidden_dim_0': 32, 'encoder_shared_placement': 'at_start', 'decoder_shared_placement': 'at_start'}. Best is trial 21 with value: 202.40597534179688.
[I 2026-04-01 09:58:08,416] Trial 28 finished with value: {'Total Imputation Error': 270.27435302734375} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 32, 'hidden_dim_1': 16, 'encoder_shared_placement': 'alternating', 'decoder_shared_placement': 'at_start'}. Best is trial 21 with value: 202.40597534179688.
[I 2026-04-01 09:58:19,106] Trial 29 finished with value: {'Total Imputation Error': 256.9208068847656} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 16, 'hidden_dim_1': 16, 'encoder_shared_placement': 'at_end', 'decoder_shared_placement': 'at_end'}. Best is trial 21 with value: 202.40597534179688.
[I 2026-04-01 09:58:30,824] Trial 30 finished with value: {'Total Imputation Error': 278.3486633300781} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 32, 'hidden_dim_1': 16, 'encoder_shared_placement': 'random', 'decoder_shared_placement': 'at_end'}. Best is trial 21 with value: 202.40597534179688.
[I 2026-04-01 09:58:41,303] Trial 31 finished with value: {'Total Imputation Error': 318.2501220703125} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 6, 'hidden_dim_1': 6, 'encoder_shared_placement': 'at_start', 'decoder_shared_placement': 'random'}. Best is trial 21 with value: 202.40597534179688.
[I 2026-04-01 09:58:55,522] Trial 32 finished with value: {'Total Imputation Error': 193.67373657226562} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 32, 'hidden_dim_1': 16, 'encoder_shared_placement': 'at_end', 'decoder_shared_placement': 'at_start'}. Best is trial 32 with value: 193.67373657226562.
[I 2026-04-01 09:59:06,457] Trial 33 finished with value: {'Total Imputation Error': 323.4795837402344} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 32, 'hidden_dim_1': 16, 'encoder_shared_placement': 'at_end', 'decoder_shared_placement': 'at_start'}. Best is trial 32 with value: 193.67373657226562.
[I 2026-04-01 09:59:30,637] Trial 34 finished with value: {'Total Imputation Error': 148.54522705078125} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 32, 'hidden_dim_1': 16, 'encoder_shared_placement': 'at_end', 'decoder_shared_placement': 'at_start'}. Best is trial 34 with value: 148.54522705078125.
[I 2026-04-01 09:59:41,609] Trial 35 finished with value: {'Total Imputation Error': 252.63011169433594} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 32, 'hidden_dim_1': 16, 'encoder_shared_placement': 'at_end', 'decoder_shared_placement': 'at_start'}. Best is trial 34 with value: 148.54522705078125.
[I 2026-04-01 09:59:52,709] Trial 36 finished with value: {'Total Imputation Error': 261.8629150390625} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 32, 'hidden_dim_1': 16, 'encoder_shared_placement': 'at_end', 'decoder_shared_placement': 'at_start'}. Best is trial 34 with value: 148.54522705078125.
[I 2026-04-01 10:00:03,830] Trial 37 finished with value: {'Total Imputation Error': 402.05194091796875} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 6, 'hidden_dim_1': 16, 'encoder_shared_placement': 'random', 'decoder_shared_placement': 'at_start'}. Best is trial 34 with value: 148.54522705078125.
[I 2026-04-01 10:00:10,947] Trial 38 finished with value: {'Total Imputation Error': 293.0616760253906} and parameters: {'num_hidden_layers': 1, 'hidden_dim_0': 32, 'encoder_shared_placement': 'alternating', 'decoder_shared_placement': 'at_start'}. Best is trial 34 with value: 148.54522705078125.
[I 2026-04-01 10:00:19,768] Trial 39 finished with value: {'Total Imputation Error': 280.6060485839844} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 32, 'hidden_dim_1': 16, 'encoder_shared_placement': 'at_start', 'decoder_shared_placement': 'at_start'}. Best is trial 34 with value: 148.54522705078125.
[I 2026-04-01 10:00:29,228] Trial 40 finished with value: {'Total Imputation Error': 208.51077270507812} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 16, 'hidden_dim_1': 32, 'encoder_shared_placement': 'at_end', 'decoder_shared_placement': 'at_start'}. Best is trial 34 with value: 148.54522705078125.
[I 2026-04-01 10:00:37,311] Trial 41 finished with value: {'Total Imputation Error': 421.8951416015625} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 6, 'hidden_dim_1': 16, 'encoder_shared_placement': 'random', 'decoder_shared_placement': 'random'}. Best is trial 34 with value: 148.54522705078125.
[I 2026-04-01 10:00:44,687] Trial 42 finished with value: {'Total Imputation Error': 282.4975280761719} and parameters: {'num_hidden_layers': 2, 'hidden_dim_0': 16, 'hidden_dim_1': 32, 'encoder_shared_placement': 'at_end', 'decoder_shared_placement': 'at_start'}. Best is trial 34 with value: 148.54522705078125.

Optimization complete. Best trial: 34 (Total Imputation Error: 148.545227)
Training final model with best parameters...
Final model training complete.

(    feat1     feat2     feat3     feat4     feat5       bf1       bf2  \
  52.0  2.000000  6.000000  4.000000  2.000000  0.603871  0.159842   
  93.0  1.000000  7.000000 -0.439109  2.000000  0.000002  1.000000   
  15.0  4.000000  3.687330  4.000000  2.000000  0.000000  0.000000   
  72.0  4.000000  3.000000  3.249043  1.000000  0.004010  1.000000   
  61.0  2.255670  1.000000  2.768236  1.000000  0.000000  0.018236   
 ..    ...       ...       ...       ...       ...       ...       ...   
 85.0  0.962307  3.369231  6.000000  2.000000  0.000000  0.000000   
 80.0  1.000000  3.238206  7.000000  2.000000  0.000000  0.000018   
 82.0  2.000000  4.000000  6.498518  1.000000  0.000000  0.000099   
 53.0  4.000000  3.000000  7.000000  7.000000  0.000504  0.000000   
 24.0  4.000000  1.000000  3.000000  2.768066  0.000000  1.000000   
 
          bf3           bf4       bf5  
 0.517981  2.188087e-02  0.000000  
 1.000000  1.000000e+00  1.000000  
 1.000000  0.000000e+00  0.966626  
 1.000000  1.000000e+00  0.999973  
 0.761495  1.825549e-01  0.000000  
 ..       ...           ...       ...  
1.000000  8.229457e-10  1.000000  
1.000000  5.805393e-11  1.000000  
1.000000  4.715938e-10  0.994726  
0.000000  0.000000e+00  1.000000  
1.000000  1.000000e+00  0.000000  
 
 [100 rows x 10 columns],
 CISSVAE(input_dim=10, latent_dim=10, latent_shared=True, output_shared=True,num_clusters=3)
 Encoder Layers:
   [0] UNSHARED 10 → 32
   [1] SHARED   32 → 16
 
 Latent Layer:
   SHARED    16 → 10
 
 Decoder Layers:
   [0] SHARED   10 → 16
   [1] UNSHARED 16 → 32
 
 Final Output Layer:
    SHARED  32 → 10,
 <optuna.study.study.Study at 0x2af125bf1c0>,
     trial_number  imputation_error  num_hidden_layers  hidden_dim_0  \
            0        471.357300                  2             6   
            1        281.187469                  2            32   
            2        335.337006                  2             6   
            3        294.416748                  1             6   
            4        208.593140                  2            32   
            5        290.728394                  2            16   
            6        352.566742                  2            16   
            7        316.873352                  2            16   
            8        342.178864                  2             6   
            9        355.935516                  2             6   
          10        322.344818                  1            32   
          11        325.406799                  1            32   
          12        353.754150                  2            32   
          13        376.329926                  2            32   
          14        317.611389                  1            32   
          15        319.031525                  2            32   
          16        384.458893                  2            32   
          17        204.753601                  2            32   
          18        313.878571                  1            32   
          19        248.689560                  2            32   
          20        359.694733                  2            16   
          21        202.405975                  2            32   
          22               NaN                  2            32   
          23        280.912048                  2            32   
          24        213.197067                  2            32   
          25        313.894257                  2            32   
          26        378.035919                  2            32   
          27        291.352142                  1            32   
          28        270.274353                  2            32   
          29        256.920807                  2            16   
          30        278.348663                  2            32   
          31        318.250122                  2             6   
          32        193.673737                  2            32   
          33        323.479584                  2            32   
          34        148.545227                  2            32   
          35        252.630112                  2            32   
          36        261.862915                  2            32   
          37        402.051941                  2             6   
          38        293.061676                  1            32   
          39        280.606049                  2            32   
          40        208.510773                  2            16   
          41        421.895142                  2             6   
          42        282.497528                  2            16   
 
     hidden_dim_1 encoder_shared_placement decoder_shared_placement  \
         32.0                   random                   at_end   
         32.0                 at_start              alternating   
         16.0                   random                 at_start   
          NaN              alternating                   random   
         16.0                   at_end                   at_end   
          6.0                   random                   random   
         16.0                   random              alternating   
         16.0                 at_start                   at_end   
          6.0                 at_start              alternating   
          6.0                 at_start                 at_start   
         NaN                   at_end                   at_end   
         NaN                   at_end              alternating   
        32.0                   at_end                   at_end   
        32.0              alternating              alternating   
         NaN                   at_end                   at_end   
        16.0                 at_start              alternating   
        32.0                   at_end                   random   
        16.0                 at_start                 at_start   
         NaN              alternating                 at_start   
        16.0                   at_end                 at_start   
        16.0                 at_start                 at_start   
        16.0                   at_end                 at_start   
        16.0                   at_end                 at_start   
        16.0                   at_end                 at_start   
        16.0                   at_end                 at_start   
        16.0                   at_end                 at_start   
        16.0                   at_end                   at_end   
         NaN                 at_start                 at_start   
        16.0              alternating                 at_start   
        16.0                   at_end                   at_end   
        16.0                   random                   at_end   
         6.0                 at_start                   random   
        16.0                   at_end                 at_start   
        16.0                   at_end                 at_start   
        16.0                   at_end                 at_start   
        16.0                   at_end                 at_start   
        16.0                   at_end                 at_start   
        16.0                   random                 at_start   
         NaN              alternating                 at_start   
        16.0                 at_start                 at_start   
        32.0                   at_end                 at_start   
        16.0                   random                   random   
        32.0                   at_end                 at_start   
 
    layer_order_enc_used layer_order_dec_used  
                 U,S                  U,S  
                 S,U                  S,U  
                 S,U                  S,U  
                   S                    S  
                 U,S                  U,S  
                 S,U                  S,U  
                 U,S                  S,U  
                 S,U                  U,S  
                 S,U                  S,U  
                 S,U                  S,U  
                  S                    S  
                  S                    S  
                U,S                  U,S  
                S,U                  S,U  
                  S                    S  
                S,U                  S,U  
                U,S                  U,S  
                S,U                  S,U  
                  S                    S  
                U,S                  S,U  
                S,U                  S,U  
                U,S                  S,U  
               None                 None  
                U,S                  S,U  
                U,S                  S,U  
                U,S                  S,U  
                U,S                  U,S  
                  S                    S  
                S,U                  S,U  
                U,S                  U,S  
                S,U                  U,S  
                S,U                  S,U  
                U,S                  S,U  
                U,S                  S,U  
                U,S                  S,U  
                U,S                  S,U  
                U,S                  S,U  
                U,S                  S,U  
                  S                    S  
                S,U                  S,U  
                U,S                  S,U  
                U,S                  U,S  
                U,S                  S,U  )

Handling Categorical Data Columns

When using one-hot encoding to handle categorical data, the validation data must be structured such that if one dummy variable from a given category is added to the validation dataset, all dummy variables of that category must be added. In order to achieve this, one can create a categorical_column_map dictionary for which the keys are the original categorical column names and the entries are the corresponding dummy variable column names. Note: When using categorical_column_map, the binary_feature_mask must also be given.

To illustrate this, we will use a high hold-out proportion of 0.5, which is less realistic in real-world situations but helps illustrate the use of categorical_column_map.

import pandas as pd
import numpy as np

np.random.seed(42)

n_rows = 100
prop_mask = 0.30 

X = pd.DataFrame({
    "feat1": np.random.choice(np.arange(1, n_rows + 1), size=n_rows, replace=True),
    "feat2": np.random.choice(np.arange(1, 6), size=n_rows, replace=True),
    "feat3": np.random.choice(np.arange(1, 8), size=n_rows, replace=True),
    "feat4": np.random.choice(np.arange(1, 8), size=n_rows, replace=True),
    "feat5": np.random.choice(np.arange(1, 8), size=n_rows, replace=True),
    ## now add some categorical and binary features
    "c11": np.random.binomial(1, 0.25, size=n_rows),
    "c12": np.random.binomial(1, 0.5, size=n_rows),
    "c21": np.random.binomial(1, 0.75, size=n_rows),
    "c22": np.random.binomial(1, 0.33, size=n_rows),
    "b1": np.random.binomial(1, 0.66, size=n_rows),
})

X_raw = X.copy()

## define categorical groups
categorical_groups = {
    "c1": ["c11", "c12"],
    "c2": ["c21", "c22"],
}

grouped_cols = {col for group in categorical_groups.values() for col in group}
independent_cols = [col for col in X.columns[1:] if col not in grouped_cols]  # skip feat1

## mask independent columns
for col in independent_cols:
    idx = np.where(~X[col].isna())[0]
    n_mask = int(np.ceil(len(idx) * prop_mask))
    if n_mask > 0:
        mask_idx = np.random.choice(idx, size=n_mask, replace=False)
        X.loc[mask_idx, col] = np.nan

## mask grouped categoricals
for group_name, cols in categorical_groups.items():
    ## rows eligible only if ALL dummy cols are observed
    group_data = X[cols]
    eligible = group_data.notna().all(axis=1)
    idx = np.where(eligible)[0]

    n_mask = int(np.ceil(len(idx) * prop_mask))
    if n_mask > 0:
        mask_idx = np.random.choice(idx, size=n_mask, replace=False)
        X.loc[mask_idx, cols] = np.nan

print(f"X matrix:\n{X}")


## choosing random clusters for now 
clusters = np.random.choice([1, 2, 3], size=n_rows, replace=True)

binary_feature_mask = [False, False, False, False, False, True, True, True, True, True]

## create categorical column map

ccm = {
    'c1': ['c11', 'c12'],
    'c2': ['c21', 'c22']
}

cd = ClusterDataset(
    X, cluster_labels = clusters, binary_feature_mask=binary_feature_mask, categorical_column_map = ccm, val_proportion = 0.5
)

X matrix:
    feat1  feat2  feat3  feat4  feat5  c11  c12  c21  c22   b1
    52    2.0    6.0    4.0    2.0  NaN  NaN  NaN  NaN  NaN
    93    1.0    7.0    NaN    2.0  0.0  1.0  1.0  1.0  NaN
    15    4.0    NaN    4.0    2.0  0.0  0.0  1.0  0.0  1.0
    72    4.0    3.0    NaN    1.0  1.0  1.0  NaN  NaN  NaN
    61    NaN    1.0    NaN    1.0  NaN  NaN  NaN  NaN  0.0
..    ...    ...    ...    ...    ...  ...  ...  ...  ...  ...
   85    NaN    NaN    6.0    2.0  0.0  0.0  1.0  0.0  1.0
   80    1.0    NaN    7.0    2.0  NaN  NaN  1.0  1.0  1.0
   82    2.0    4.0    NaN    1.0  NaN  NaN  1.0  1.0  1.0
   53    4.0    3.0    7.0    7.0  0.0  0.0  0.0  0.0  NaN
   24    4.0    1.0    3.0    NaN  0.0  1.0  1.0  1.0  0.0

[100 rows x 10 columns]

As we can see from the validation dataset printed below, c11 and c12 are masked together and c21 and c22 are masked together.

# Convert val_data tensor to numpy
val_data_np = cd.val_data.cpu().numpy()

# Reconstruct DataFrame with original column names
val_df = pd.DataFrame(val_data_np, columns=cd.feature_names)

# Print nicely
print(val_df.head(20))

    feat1  feat2  feat3  feat4  feat5  c11  c12  c21  c22   b1
  52.0    2.0    6.0    4.0    2.0  NaN  NaN  NaN  NaN  NaN
  93.0    1.0    7.0    NaN    NaN  0.0  1.0  1.0  1.0  NaN
   NaN    NaN    NaN    4.0    NaN  NaN  NaN  1.0  0.0  1.0
  72.0    NaN    3.0    NaN    NaN  1.0  1.0  NaN  NaN  NaN
   NaN    NaN    NaN    NaN    1.0  NaN  NaN  NaN  NaN  0.0
   NaN    NaN    5.0    NaN    1.0  0.0  1.0  NaN  NaN  NaN
  83.0    1.0    NaN    NaN    NaN  NaN  NaN  NaN  NaN  NaN
   NaN    NaN    NaN    3.0    NaN  NaN  NaN  1.0  1.0  NaN
  75.0    NaN    NaN    NaN    NaN  NaN  NaN  1.0  0.0  0.0
  75.0    1.0    NaN    NaN    5.0  0.0  1.0  1.0  0.0  NaN
 88.0    NaN    2.0    2.0    NaN  NaN  NaN  1.0  0.0  NaN
  NaN    NaN    NaN    6.0    NaN  NaN  NaN  1.0  1.0  NaN
 24.0    1.0    NaN    NaN    NaN  0.0  1.0  NaN  NaN  1.0
  3.0    NaN    6.0    NaN    NaN  0.0  1.0  NaN  NaN  1.0
 22.0    NaN    NaN    6.0    2.0  0.0  1.0  1.0  0.0  NaN
 53.0    3.0    NaN    3.0    1.0  NaN  NaN  NaN  NaN  1.0
  2.0    NaN    NaN    NaN    NaN  0.0  0.0  NaN  NaN  1.0
 88.0    3.0    1.0    NaN    4.0  NaN  NaN  1.0  0.0  NaN
  NaN    NaN    3.0    4.0    NaN  NaN  NaN  NaN  NaN  NaN
 38.0    NaN    NaN    1.0    NaN  NaN  NaN  1.0  1.0  NaN