cassiopy.mixture.SkewTMixture#

The SkewTMixture class models a mixture of skew-t distributions. It provides various methods for working with skew-t distributed data, including generating samples and calculating densities.

class cassiopy.mixture.SkewTMixture(n_cluster: int, init_method='gmm', parametre=None, n_init_gmm=6)[source]#

Bases: object

A mixture model for clustering using the skewed Student’s t-distribution.

Parameters:

n_clusterint: Number of clusters.
init_methodstr, default=’gmm’: Initialization method. Options: ‘gmm’, ‘kmeans’, ‘params’.
parametredict, optional: Dictionary of initial parameters if init_method=’params’.
n_init_gmmint, default=6: Number of initializations for GMM.

Notes

For more information, refer to the documentation Skew-t Mixture Models

Examples

>>> import numpy as np
>>> from cassiopy.mixture import SkewTMixture
>>> X = np.array([[5, 3], [5, 7], [5, 1], [20, 3], [20, 7], [20, 1]])
>>> model = SkewTMixture(n_cluster=2, init_method='kmeans')
>>> model.fit(X, max_iter=100, tol=1e-4)
>>> model.mean
array([[ 5.        ,  1.40735413],
   [20.00000058,  0.66644041]])
>>> model.predict_proba(np.array([[0, 0], [22, 5]]))
array([[0.5, 0.5],
   [0.5, 0.5]])
>>> model.save('model.h5')
>>> model.load('Models_folder/model.h5')
>>> model.predict_cluster(np.array([[0, 0], [22, 5]]))
array([0., 0.])

Attributes:

meanndarray: Cluster means.
sigmandarray: Cluster standard deviations.
dlndarray: Degrees of freedom for each cluster.
lambndarray: Skewness parameters for each cluster.
alphandarray: Cluster weights.
tikndarray: Posterior probabilities.
loglilist: Log-likelihood values during training.

Methods

`ARI`(y_true, y_pred)	Compute the ARI .
`BIC`(X)	Calculate the Bayesian Information Criterion (BIC) for the model.
`HARTIGAN`(X)	Calculate the Hartigan's index for the model.
`IUS`(X[, bins])	Calculate the Indice d'Uniformité de Shannon (IUS) for the model.
`KL`(X[, bins])	Calculate the Kullback-Leibler divergence for the model.
`L2`(X[, bins, penalty_weight])	Calculate the L2 distance between the empirical distribution of data points in the uniform cluster and a uniform distribution.
`ST`(X, w, mu, s, nu, la)	Compute the posterior probabilities for the SkewT mixture model.
`chi2`(X[, bins])	Calculate the Chi-squared statistic for the model.
`confusion_matrix`(y_true, y_pred)	Calculate the confusion matrix.
`fit`(X[, tol, max_iter, verbose])	Fit the SkewT mixture model to the data.
`initialisation_gmm`(X)	Initialize the parameters for the Gaussian Mixture Model (GMM).
`initialisation_kmeans`(X)	Initializes the parameters for the SkewMM algorithm using the K-means initialization method.
`initialisation_random`(X)	Initialize the parameters randomly for the SkewMM algorithm.
`load`(filename)	Load matrices from a given file.
`logt_expr`(eta2, nu)	Compute the logarithm of the t-distribution expression.
`predict`(X)	Predict the cluster labels for the data.
`predict_proba`(X)	Predict the posterior probabilities for the data.
`save`(filename)	Save the model to a file.

AIC

AIC(X, penalty_weight=0.1)[source]#

ARI(y_true, y_pred)[source]#

Compute the ARI .

Parameters:

- y (array-like): The true labels.

Returns:

ari (float): The Adjusted Rand Index (ARI) score.

BIC(X)[source]#

Calculate the Bayesian Information Criterion (BIC) for the model.

Parameters:

- X (array-like): The input data.

Returns:

bic (float): The BIC value.

HARTIGAN(X)[source]#

Calculate the Hartigan’s index for the model. This index measures the compactness of clusters by summing the squared distances of points from their cluster centers. :param X: The input data. :type X: array-like, shape (n_samples, n_features)

Returns:

Wfloat: The Hartigan’s index value.

IUS(X, bins=3)[source]#

Calculate the Indice d’Uniformité de Shannon (IUS) for the model. This index measures the uniformity of the distribution of data points across clusters.

Parameters:

datandarray of shape (n_samples, n_features)
binsint, default=3: Number of bins for the histogram.

Returns:

iusfloat: The Indice d’Uniformité de Shannon (IUS) value.

KL(X, bins=3)[source]#

Calculate the Kullback-Leibler divergence for the model. This index measures the divergence between the empirical distribution of data points in the uniform cluster and a uniform distribution.

Parameters:

datandarray of shape (n_samples, n_features)
binsint, default=3: Number of bins for the histogram.

Returns:

kl_divfloat: The Kullback-Leibler divergence value.

L2(X, bins=10, penalty_weight=0.1)[source]#

Calculate the L2 distance between the empirical distribution of data points in the uniform cluster and a uniform distribution. This index measures the distance between the empirical distribution and a uniform distribution.

Parameters:

datandarray of shape (n_samples, n_features)
binsint, default=10: Number of bins for the histogram.
penalty_weightfloat, default=0.1: Weight for the penalty term based on cluster sizes.

Returns:

l2_distancefloat: The L2 distance value.

ST(X, w, mu, s, nu, la)[source]#

Compute the posterior probabilities for the SkewT mixture model.

Parameters:

Xndarray: Input data.
wndarray: Cluster weights.
mundarray: Cluster means.
sndarray: Cluster standard deviations.
nundarray: Degrees of freedom for each cluster.
landarray: Skewness parameters for each cluster.

Returns:

Zndarray: Posterior probabilities for each cluster.

chi2(X, bins=3)[source]#

Calculate the Chi-squared statistic for the model. This index measures the goodness of fit between the empirical distribution of data points in the uniform cluster and a uniform distribution.

Parameters:

datandarray of shape (n_samples, n_features)
binsint, default=3: Number of bins for the histogram.
Returns
=======
chi2_statfloat: The Chi-squared statistic value.
p_valuefloat: The p-value associated with the Chi-squared statistic.

confusion_matrix(y_true, y_pred)[source]#

Calculate the confusion matrix.

Parameters:

y_truearray-like: The true labels.
y_predarray-like, default=None: The predicted labels.

Returns:

matrixarray-like: The confusion matrix. The last cluster correspond to the uniform cluster.

fit(X, tol=1e-06, max_iter=200, verbose=0)[source]#

Fit the SkewT mixture model to the data.

Parameters:

Xarray-like, shape (n_samples, n_features): The input data.
tolfloat, default=1e-6: Tolerance for convergence.
max_iterint, default=200: Maximum number of iterations for the EM algorithm.

initialisation_gmm(X)[source]#

Initialize the parameters for the Gaussian Mixture Model (GMM).

Parameters:

Xarray-like of shape (n_samples, n_features): Input data matrix.

Returns:

wndarray of shape (n_clusters,): Cluster weights.
mundarray of shape (n_clusters, n_features): Cluster means.
sndarray of shape (n_clusters, n_features): Cluster standard deviations.
nundarray of shape (n_clusters, n_features): Degrees of freedom for each cluster.
landarray of shape (n_clusters, n_features): Skewness parameters for each cluster.

initialisation_kmeans(X)[source]#

Initializes the parameters for the SkewMM algorithm using the K-means initialization method.

Parameters:

Xarray-like of shape (n_samples, n_features): The input data matrix.
default_n_initint, default=’auto’: The number of times the K-means algorithm will be run with different centroid seeds. Default is ‘auto’.

Returns:

dictdict: A dictionary containing the initialized parameters:
wndarray of shape (n_clusters,): Cluster weights.
mundarray of shape (n_clusters, n_features): Cluster means.
sndarray of shape (n_clusters, n_features): Cluster standard deviations.
nundarray of shape (n_clusters, n_features): Degrees of freedom for each cluster.
landarray of shape (n_clusters, n_features): Skewness parameters for each cluster.

initialisation_random(X)[source]#

Initialize the parameters randomly for the SkewMM algorithm.

Parameters:

Xarray-like of shape (n_samples, n_features): Input data matrix.

Returns:

wndarray of shape (n_clusters,): Cluster weights.
mundarray of shape (n_clusters, n_features): Cluster means.
sndarray of shape (n_clusters, n_features): Cluster standard deviations.
nundarray of shape (n_clusters, n_features): Degrees of freedom for each cluster.
landarray of shape (n_clusters, n_features): Skewness parameters for each cluster.

load(filename: str)[source]#

Load matrices from a given file.

Parameters:

filenamestr: The path to the file containing the matrices.

logt_expr(eta2, nu)[source]#

Compute the logarithm of the t-distribution expression.

Parameters:

eta2ndarray: Squared standardized residuals.
nundarray: Degrees of freedom.

Returns:

logt_valndarray: Logarithm of the t-distribution expression.

predict(X)[source]#

Predict the cluster labels for the data.

Parameters:

- X (array-like): The input data.

Returns:

labels (array-like): The predicted cluster labels.

predict_proba(X)[source]#

Predict the posterior probabilities for the data.

Parameters:

Xarray-like, shape (n_samples, n_features): The input data.

Returns:

probandarray, shape (n_samples, n_clusters): The posterior probabilities for each cluster.

save(filename: str)[source]#

Save the model to a file.

Parameters:

filenamestr: The name of the file.