Initial commit
This commit is contained in:
90
venv/lib/python3.7/site-packages/nltk/cluster/__init__.py
Normal file
90
venv/lib/python3.7/site-packages/nltk/cluster/__init__.py
Normal file
@@ -0,0 +1,90 @@
|
||||
# Natural Language Toolkit: Clusterers
|
||||
#
|
||||
# Copyright (C) 2001-2019 NLTK Project
|
||||
# Author: Trevor Cohn <tacohn@cs.mu.oz.au>
|
||||
# URL: <http://nltk.org/>
|
||||
# For license information, see LICENSE.TXT
|
||||
|
||||
"""
|
||||
This module contains a number of basic clustering algorithms. Clustering
|
||||
describes the task of discovering groups of similar items with a large
|
||||
collection. It is also describe as unsupervised machine learning, as the data
|
||||
from which it learns is unannotated with class information, as is the case for
|
||||
supervised learning. Annotated data is difficult and expensive to obtain in
|
||||
the quantities required for the majority of supervised learning algorithms.
|
||||
This problem, the knowledge acquisition bottleneck, is common to most natural
|
||||
language processing tasks, thus fueling the need for quality unsupervised
|
||||
approaches.
|
||||
|
||||
This module contains a k-means clusterer, E-M clusterer and a group average
|
||||
agglomerative clusterer (GAAC). All these clusterers involve finding good
|
||||
cluster groupings for a set of vectors in multi-dimensional space.
|
||||
|
||||
The K-means clusterer starts with k arbitrary chosen means then allocates each
|
||||
vector to the cluster with the closest mean. It then recalculates the means of
|
||||
each cluster as the centroid of the vectors in the cluster. This process
|
||||
repeats until the cluster memberships stabilise. This is a hill-climbing
|
||||
algorithm which may converge to a local maximum. Hence the clustering is
|
||||
often repeated with random initial means and the most commonly occurring
|
||||
output means are chosen.
|
||||
|
||||
The GAAC clusterer starts with each of the *N* vectors as singleton clusters.
|
||||
It then iteratively merges pairs of clusters which have the closest centroids.
|
||||
This continues until there is only one cluster. The order of merges gives rise
|
||||
to a dendrogram - a tree with the earlier merges lower than later merges. The
|
||||
membership of a given number of clusters *c*, *1 <= c <= N*, can be found by
|
||||
cutting the dendrogram at depth *c*.
|
||||
|
||||
The Gaussian EM clusterer models the vectors as being produced by a mixture
|
||||
of k Gaussian sources. The parameters of these sources (prior probability,
|
||||
mean and covariance matrix) are then found to maximise the likelihood of the
|
||||
given data. This is done with the expectation maximisation algorithm. It
|
||||
starts with k arbitrarily chosen means, priors and covariance matrices. It
|
||||
then calculates the membership probabilities for each vector in each of the
|
||||
clusters - this is the 'E' step. The cluster parameters are then updated in
|
||||
the 'M' step using the maximum likelihood estimate from the cluster membership
|
||||
probabilities. This process continues until the likelihood of the data does
|
||||
not significantly increase.
|
||||
|
||||
They all extend the ClusterI interface which defines common operations
|
||||
available with each clusterer. These operations include.
|
||||
- cluster: clusters a sequence of vectors
|
||||
- classify: assign a vector to a cluster
|
||||
- classification_probdist: give the probability distribution over cluster memberships
|
||||
|
||||
The current existing classifiers also extend cluster.VectorSpace, an
|
||||
abstract class which allows for singular value decomposition (SVD) and vector
|
||||
normalisation. SVD is used to reduce the dimensionality of the vector space in
|
||||
such a manner as to preserve as much of the variation as possible, by
|
||||
reparameterising the axes in order of variability and discarding all bar the
|
||||
first d dimensions. Normalisation ensures that vectors fall in the unit
|
||||
hypersphere.
|
||||
|
||||
Usage example (see also demo())::
|
||||
from nltk import cluster
|
||||
from nltk.cluster import euclidean_distance
|
||||
from numpy import array
|
||||
|
||||
vectors = [array(f) for f in [[3, 3], [1, 2], [4, 2], [4, 0]]]
|
||||
|
||||
# initialise the clusterer (will also assign the vectors to clusters)
|
||||
clusterer = cluster.KMeansClusterer(2, euclidean_distance)
|
||||
clusterer.cluster(vectors, True)
|
||||
|
||||
# classify a new vector
|
||||
print(clusterer.classify(array([3, 3])))
|
||||
|
||||
Note that the vectors must use numpy array-like
|
||||
objects. nltk_contrib.unimelb.tacohn.SparseArrays may be used for
|
||||
efficiency when required.
|
||||
"""
|
||||
|
||||
from nltk.cluster.util import (
|
||||
VectorSpaceClusterer,
|
||||
Dendrogram,
|
||||
euclidean_distance,
|
||||
cosine_distance,
|
||||
)
|
||||
from nltk.cluster.kmeans import KMeansClusterer
|
||||
from nltk.cluster.gaac import GAAClusterer
|
||||
from nltk.cluster.em import EMClusterer
|
||||
Reference in New Issue
Block a user