cassiopy.mixture.SkewTMixture#

The SkewTMixture class models a mixture of skew-t distributions. It provides various methods for working with skew-t distributed data, including generating samples and calculating densities.

class cassiopy.mixture.SkewTMixture(n_cluster: int, init_method='gmm', parametre=None, n_init_gmm=6)[source]#

Bases: object


A mixture model for clustering using the skewed Student’s t-distribution.

Parameters:
n_clusterint

Number of clusters.

init_methodstr, default=’gmm’

Initialization method. Options: ‘gmm’, ‘kmeans’, ‘params’.

parametredict, optional

Dictionary of initial parameters if init_method=’params’.

n_init_gmmint, default=6

Number of initializations for GMM.

Notes

For more information, refer to the documentation Skew-t Mixture Models

Examples

>>> import numpy as np
>>> from cassiopy.mixture import SkewTMixture
>>> X = np.array([[5, 3], [5, 7], [5, 1], [20, 3], [20, 7], [20, 1]])
>>> model = SkewTMixture(n_cluster=2, init_method='kmeans')
>>> model.fit(X, max_iter=100, tol=1e-4)
>>> model.mean
array([[ 5.        ,  1.40735413],
   [20.00000058,  0.66644041]])
>>> model.predict_proba(np.array([[0, 0], [22, 5]]))
array([[0.5, 0.5],
   [0.5, 0.5]])
>>> model.save('model.h5')
>>> model.load('Models_folder/model.h5')
>>> model.predict_cluster(np.array([[0, 0], [22, 5]]))
array([0., 0.])
Attributes:
meanndarray

Cluster means.

sigmandarray

Cluster standard deviations.

dlndarray

Degrees of freedom for each cluster.

lambndarray

Skewness parameters for each cluster.

alphandarray

Cluster weights.

tikndarray

Posterior probabilities.

loglilist

Log-likelihood values during training.

Methods

ARI(y_true, y_pred)

Compute the ARI .

BIC(X)

Calculate the Bayesian Information Criterion (BIC) for the model.

HARTIGAN(X)

Calculate the Hartigan's index for the model.

IUS(X[, bins])

Calculate the Indice d'Uniformité de Shannon (IUS) for the model.

KL(X[, bins])

Calculate the Kullback-Leibler divergence for the model.

L2(X[, bins, penalty_weight])

Calculate the L2 distance between the empirical distribution of data points in the uniform cluster and a uniform distribution.

ST(X, w, mu, s, nu, la)

Compute the posterior probabilities for the SkewT mixture model.

chi2(X[, bins])

Calculate the Chi-squared statistic for the model.

confusion_matrix(y_true, y_pred)

Calculate the confusion matrix.

fit(X[, tol, max_iter, verbose])

Fit the SkewT mixture model to the data.

initialisation_gmm(X)

Initialize the parameters for the Gaussian Mixture Model (GMM).

initialisation_kmeans(X)

Initializes the parameters for the SkewMM algorithm using the K-means initialization method.

initialisation_random(X)

Initialize the parameters randomly for the SkewMM algorithm.

load(filename)

Load matrices from a given file.

logt_expr(eta2, nu)

Compute the logarithm of the t-distribution expression.

predict(X)

Predict the cluster labels for the data.

predict_proba(X)

Predict the posterior probabilities for the data.

save(filename)

Save the model to a file.

AIC

AIC(X, penalty_weight=0.1)[source]#
ARI(y_true, y_pred)[source]#

Compute the ARI .

Parameters:
- y (array-like): The true labels.
Returns:
  • ari (float): The Adjusted Rand Index (ARI) score.
BIC(X)[source]#

Calculate the Bayesian Information Criterion (BIC) for the model.

Parameters:
- X (array-like): The input data.
Returns:
  • bic (float): The BIC value.
HARTIGAN(X)[source]#

Calculate the Hartigan’s index for the model. This index measures the compactness of clusters by summing the squared distances of points from their cluster centers. :param X: The input data. :type X: array-like, shape (n_samples, n_features)

Returns:
Wfloat

The Hartigan’s index value.

IUS(X, bins=3)[source]#

Calculate the Indice d’Uniformité de Shannon (IUS) for the model. This index measures the uniformity of the distribution of data points across clusters.

Parameters:
datandarray of shape (n_samples, n_features)
binsint, default=3

Number of bins for the histogram.

Returns:
iusfloat

The Indice d’Uniformité de Shannon (IUS) value.

KL(X, bins=3)[source]#

Calculate the Kullback-Leibler divergence for the model. This index measures the divergence between the empirical distribution of data points in the uniform cluster and a uniform distribution.

Parameters:
datandarray of shape (n_samples, n_features)
binsint, default=3

Number of bins for the histogram.

Returns:
kl_divfloat

The Kullback-Leibler divergence value.

L2(X, bins=10, penalty_weight=0.1)[source]#

Calculate the L2 distance between the empirical distribution of data points in the uniform cluster and a uniform distribution. This index measures the distance between the empirical distribution and a uniform distribution.

Parameters:
datandarray of shape (n_samples, n_features)
binsint, default=10

Number of bins for the histogram.

penalty_weightfloat, default=0.1

Weight for the penalty term based on cluster sizes.

Returns:
l2_distancefloat

The L2 distance value.

ST(X, w, mu, s, nu, la)[source]#

Compute the posterior probabilities for the SkewT mixture model.

Parameters:
Xndarray

Input data.

wndarray

Cluster weights.

mundarray

Cluster means.

sndarray

Cluster standard deviations.

nundarray

Degrees of freedom for each cluster.

landarray

Skewness parameters for each cluster.

Returns:
Zndarray

Posterior probabilities for each cluster.

chi2(X, bins=3)[source]#

Calculate the Chi-squared statistic for the model. This index measures the goodness of fit between the empirical distribution of data points in the uniform cluster and a uniform distribution.

Parameters:
datandarray of shape (n_samples, n_features)
binsint, default=3

Number of bins for the histogram.

Returns
=======
chi2_statfloat

The Chi-squared statistic value.

p_valuefloat

The p-value associated with the Chi-squared statistic.

confusion_matrix(y_true, y_pred)[source]#

Calculate the confusion matrix.

Parameters:
y_truearray-like

The true labels.

y_predarray-like, default=None

The predicted labels.

Returns:
matrixarray-like

The confusion matrix. The last cluster correspond to the uniform cluster.

fit(X, tol=1e-06, max_iter=200, verbose=0)[source]#

Fit the SkewT mixture model to the data.

Parameters:
Xarray-like, shape (n_samples, n_features)

The input data.

tolfloat, default=1e-6

Tolerance for convergence.

max_iterint, default=200

Maximum number of iterations for the EM algorithm.

initialisation_gmm(X)[source]#

Initialize the parameters for the Gaussian Mixture Model (GMM).

Parameters:
Xarray-like of shape (n_samples, n_features)

Input data matrix.

Returns:
wndarray of shape (n_clusters,)

Cluster weights.

mundarray of shape (n_clusters, n_features)

Cluster means.

sndarray of shape (n_clusters, n_features)

Cluster standard deviations.

nundarray of shape (n_clusters, n_features)

Degrees of freedom for each cluster.

landarray of shape (n_clusters, n_features)

Skewness parameters for each cluster.

initialisation_kmeans(X)[source]#

Initializes the parameters for the SkewMM algorithm using the K-means initialization method.

Parameters:
Xarray-like of shape (n_samples, n_features)

The input data matrix.

default_n_initint, default=’auto’

The number of times the K-means algorithm will be run with different centroid seeds. Default is ‘auto’.

Returns:
dictdict

A dictionary containing the initialized parameters:

wndarray of shape (n_clusters,)

Cluster weights.

mundarray of shape (n_clusters, n_features)

Cluster means.

sndarray of shape (n_clusters, n_features)

Cluster standard deviations.

nundarray of shape (n_clusters, n_features)

Degrees of freedom for each cluster.

landarray of shape (n_clusters, n_features)

Skewness parameters for each cluster.

initialisation_random(X)[source]#

Initialize the parameters randomly for the SkewMM algorithm.

Parameters:
Xarray-like of shape (n_samples, n_features)

Input data matrix.

Returns:
wndarray of shape (n_clusters,)

Cluster weights.

mundarray of shape (n_clusters, n_features)

Cluster means.

sndarray of shape (n_clusters, n_features)

Cluster standard deviations.

nundarray of shape (n_clusters, n_features)

Degrees of freedom for each cluster.

landarray of shape (n_clusters, n_features)

Skewness parameters for each cluster.

load(filename: str)[source]#

Load matrices from a given file.

Parameters:
filenamestr

The path to the file containing the matrices.

logt_expr(eta2, nu)[source]#

Compute the logarithm of the t-distribution expression.

Parameters:
eta2ndarray

Squared standardized residuals.

nundarray

Degrees of freedom.

Returns:
logt_valndarray

Logarithm of the t-distribution expression.

predict(X)[source]#

Predict the cluster labels for the data.

Parameters:
- X (array-like): The input data.
Returns:
  • labels (array-like): The predicted cluster labels.
predict_proba(X)[source]#

Predict the posterior probabilities for the data.

Parameters:
Xarray-like, shape (n_samples, n_features)

The input data.

Returns:
probandarray, shape (n_samples, n_clusters)

The posterior probabilities for each cluster.

save(filename: str)[source]#

Save the model to a file.

Parameters:
filenamestr

The name of the file.