KMeans

class pycave.clustering.KMeans(num_clusters=1, *, init_strategy='kmeans++', convergence_tolerance=0.0001, batch_size=None, trainer_params=None)[source]

Bases: lightkit.estimator.configurable.ConfigurableBaseEstimator[pycave.clustering.kmeans.model.KMeansModel], lightkit.estimator.mixins.TransformerMixin[Union[numpy.ndarray, torch.Tensor], torch.Tensor], lightkit.estimator.mixins.PredictorMixin[Union[numpy.ndarray, torch.Tensor], torch.Tensor]

Model for clustering data into a predefined number of clusters. More information on K-means clustering is available on Wikipedia.

See also

KMeansModel

PyTorch module for the K-Means model.

KMeansModelConfig

Configuration class for a K-Means model.

Parameters
  • num_clusters (int) -- The number of clusters.

  • init_strategy (Literal['random', 'kmeans++']) -- The strategy for initializing centroids.

  • convergence_tolerance (float) -- Training is conducted until the Frobenius norm of the change between cluster centroids falls below this threshold. The tolerance is multiplied by the average variance of the features.

  • batch_size (Optional[int]) -- The batch size to use when fitting the model. If not provided, the full data will be used as a single batch. Set this if the full data does not fit into memory.

  • trainer_params (Optional[Dict[str, Any]]) --

    Initialization parameters to use when initializing a PyTorch Lightning trainer. By default, it disables various stdout logs unless PyCave is configured to do verbose logging. Checkpointing and logging are disabled regardless of the log level. This estimator further sets the following overridable defaults:

    • max_epochs=300

Note

The number of epochs passed to the initializer only define the number of optimization epochs. Prior to that, initialization is run which may perform additional iterations through the data.

Methods

fit

Fits the KMeans model on the provided data by running Lloyd's algorithm.

predict

Predicts the closest cluster for each item provided.

score

Computes the average inertia of all the provided datapoints.

score_samples

Computes the inertia for each of the the provided datapoints.

transform

Transforms the provided data into the cluster-distance space.

Inherited Methods

clone

Clones the estimator without copying any fitted attributes.

fit_predict

Fits the estimator using the provided data and subsequently predicts the labels for the data using the fitted estimator.

fit_transform

Fits the estimator using the provided data and subsequently transforms the data using the fitted estimator.

get_params

Returns the estimator's parameters as passed to the initializer.

load

Loads the estimator and (if available) the fitted model.

load_attributes

Loads the fitted attributes that are stored at the fitted path.

load_parameters

Initializes this estimator by loading its parameters.

save

Saves the estimator to the provided directory.

save_attributes

Saves the fitted attributes of this estimator.

save_parameters

Saves the parameters of this estimator.

set_params

Sets the provided values on the estimator.

trainer

Returns the trainer as configured by the estimator.

Attributes

persistent_attributes

Returns the list of fitted attributes that ought to be saved and loaded.

model_

The fitted PyTorch module with all estimated parameters.

converged_

A boolean indicating whether the model converged during training.

num_iter_

The number of iterations the model was fitted for, excluding initialization.

inertia_

The mean squared distance of all datapoints to their closest cluster centers.