ciss_vae.utils.clustering.cluster_on_missing

cluster_on_missing(data, cols_ignore=None, n_clusters=None, k_neighbors=15, use_snn=True, leiden_resolution=0.5, leiden_objective='CPM', seed=42)[source]

Cluster samples based on their missingness patterns using KMeans or Leiden.

When n_clusters is None, performs Leiden clustering on a graph constructed from the binary missingness mask of the dataset. If use_snn=True, builds a shared-nearest-neighbor (SNN) graph using Jaccard similarity; otherwise, constructs a standard kNN graph with Jaccard weights. Returns both the cluster labels and an optional silhouette score.

Parameters:

data (pandas.DataFrame) – Input dataset with potential missing values, shape (n_samples, n_features). Non-numeric columns should be excluded or specified in cols_ignore.
cols_ignore (list[str] or None, optional) – Column names to exclude from the missingness pattern clustering. Typically includes identifiers or static metadata columns. Defaults to None.
n_clusters (int or None, optional) – Number of clusters for KMeans. If None, uses Leiden clustering on the binary missingness mask instead. Defaults to None.
k_neighbors (int, optional) – Number of nearest neighbors used when constructing the kNN/SNN graph for Leiden clustering. Defaults to 15.
use_snn (bool, optional) – If True, constructs a shared-nearest-neighbor (SNN) graph using mutual neighbor overlap weighted by Jaccard similarity. If False, uses standard kNN graph weighting by Jaccard distance. Defaults to True.
leiden_resolution (float, optional) – Resolution parameter for Leiden clustering; higher values yield more clusters. Defaults to 0.5.
leiden_objective (str, optional) – Objective function for Leiden optimization. One of {"CPM", "RB", "Modularity"}. Defaults to "CPM".
seed (int, optional) – Random seed for reproducibility in KMeans and Leiden algorithms. Defaults to 42.

Returns:

Tuple (labels, silhouette): - labels (numpy.ndarray): Cluster assignments of length n_samples. - silhouette (float or None): Silhouette score computed using Jaccard distance

on the binary missingness mask; None if undefined.

Return type:

tuple[numpy.ndarray, float or None]

Example:

>>> labels, silh = cluster_on_missing(data, n_clusters=None, use_snn=True)
>>> np.unique(labels)
array([0, 1, 2])
>>> print(f"Silhouette: {silh:.3f}")
Silhouette: 0.408