ciss_vae.utils.clustering.cluster_on_missing
- cluster_on_missing(data, cols_ignore=None, n_clusters=None, k_neighbors=15, use_snn=True, leiden_resolution=0.5, leiden_objective='CPM', seed=42)[source]
Cluster samples based on their missingness patterns using KMeans or Leiden.
When
n_clustersisNone, performs Leiden clustering on a graph constructed from the binary missingness mask of the dataset. Ifuse_snn=True, builds a shared-nearest-neighbor (SNN) graph using Jaccard similarity; otherwise, constructs a standard kNN graph with Jaccard weights. Returns both the cluster labels and an optional silhouette score.- Parameters:
data (pandas.DataFrame) – Input dataset with potential missing values, shape
(n_samples, n_features). Non-numeric columns should be excluded or specified incols_ignore.cols_ignore (list[str] or None, optional) – Column names to exclude from the missingness pattern clustering. Typically includes identifiers or static metadata columns. Defaults to
None.n_clusters (int or None, optional) – Number of clusters for KMeans. If
None, uses Leiden clustering on the binary missingness mask instead. Defaults toNone.k_neighbors (int, optional) – Number of nearest neighbors used when constructing the kNN/SNN graph for Leiden clustering. Defaults to
15.use_snn (bool, optional) – If
True, constructs a shared-nearest-neighbor (SNN) graph using mutual neighbor overlap weighted by Jaccard similarity. IfFalse, uses standard kNN graph weighting by Jaccard distance. Defaults toTrue.leiden_resolution (float, optional) – Resolution parameter for Leiden clustering; higher values yield more clusters. Defaults to
0.5.leiden_objective (str, optional) – Objective function for Leiden optimization. One of
{"CPM", "RB", "Modularity"}. Defaults to"CPM".seed (int, optional) – Random seed for reproducibility in KMeans and Leiden algorithms. Defaults to
42.
- Returns:
Tuple
(labels, silhouette): - labels (numpy.ndarray): Cluster assignments of lengthn_samples. - silhouette (floatorNone): Silhouette score computed using Jaccard distanceon the binary missingness mask;
Noneif undefined.- Return type:
tuple[numpy.ndarray, float or None]
Example:
>>> labels, silh = cluster_on_missing(data, n_clusters=None, use_snn=True) >>> np.unique(labels) array([0, 1, 2]) >>> print(f"Silhouette: {silh:.3f}") Silhouette: 0.408