Initial commit

2019-10-20 13:16:49 +02:00
commit 233066caf4
2099 changed files with 360824 additions and 0 deletions
--- a/venv/lib/python3.7/site-packages/nltk/cluster/init.py
+++ b/venv/lib/python3.7/site-packages/nltk/cluster/init.py
@@ -0,0 +1,90 @@
+# Natural Language Toolkit: Clusterers
+#
+# Copyright (C) 2001-2019 NLTK Project
+# Author: Trevor Cohn <tacohn@cs.mu.oz.au>
+# URL: <http://nltk.org/>
+# For license information, see LICENSE.TXT
+
+"""
+This module contains a number of basic clustering algorithms. Clustering
+describes the task of discovering groups of similar items with a large
+collection. It is also describe as unsupervised machine learning, as the data
+from which it learns is unannotated with class information, as is the case for
+supervised learning.  Annotated data is difficult and expensive to obtain in
+the quantities required for the majority of supervised learning algorithms.
+This problem, the knowledge acquisition bottleneck, is common to most natural
+language processing tasks, thus fueling the need for quality unsupervised
+approaches.
+
+This module contains a k-means clusterer, E-M clusterer and a group average
+agglomerative clusterer (GAAC). All these clusterers involve finding good
+cluster groupings for a set of vectors in multi-dimensional space.
+
+The K-means clusterer starts with k arbitrary chosen means then allocates each
+vector to the cluster with the closest mean. It then recalculates the means of
+each cluster as the centroid of the vectors in the cluster. This process
+repeats until the cluster memberships stabilise. This is a hill-climbing
+algorithm which may converge to a local maximum. Hence the clustering is
+often repeated with random initial means and the most commonly occurring
+output means are chosen.
+
+The GAAC clusterer starts with each of the *N* vectors as singleton clusters.
+It then iteratively merges pairs of clusters which have the closest centroids.
+This continues until there is only one cluster. The order of merges gives rise
+to a dendrogram - a tree with the earlier merges lower than later merges. The
+membership of a given number of clusters *c*, *1 <= c <= N*, can be found by
+cutting the dendrogram at depth *c*.
+
+The Gaussian EM clusterer models the vectors as being produced by a mixture
+of k Gaussian sources. The parameters of these sources (prior probability,
+mean and covariance matrix) are then found to maximise the likelihood of the
+given data. This is done with the expectation maximisation algorithm. It
+starts with k arbitrarily chosen means, priors and covariance matrices. It
+then calculates the membership probabilities for each vector in each of the
+clusters - this is the 'E' step. The cluster parameters are then updated in
+the 'M' step using the maximum likelihood estimate from the cluster membership
+probabilities. This process continues until the likelihood of the data does
+not significantly increase.
+
+They all extend the ClusterI interface which defines common operations
+available with each clusterer. These operations include.
+   - cluster: clusters a sequence of vectors
+   - classify: assign a vector to a cluster
+   - classification_probdist: give the probability distribution over cluster memberships
+
+The current existing classifiers also extend cluster.VectorSpace, an
+abstract class which allows for singular value decomposition (SVD) and vector
+normalisation. SVD is used to reduce the dimensionality of the vector space in
+such a manner as to preserve as much of the variation as possible, by
+reparameterising the axes in order of variability and discarding all bar the
+first d dimensions. Normalisation ensures that vectors fall in the unit
+hypersphere.
+
+Usage example (see also demo())::
+    from nltk import cluster
+    from nltk.cluster import euclidean_distance
+    from numpy import array
+
+    vectors = [array(f) for f in [[3, 3], [1, 2], [4, 2], [4, 0]]]
+
+    # initialise the clusterer (will also assign the vectors to clusters)
+    clusterer = cluster.KMeansClusterer(2, euclidean_distance)
+    clusterer.cluster(vectors, True)
+
+    # classify a new vector
+    print(clusterer.classify(array([3, 3])))
+
+Note that the vectors must use numpy array-like
+objects. nltk_contrib.unimelb.tacohn.SparseArrays may be used for
+efficiency when required.
+"""
+
+from nltk.cluster.util import (
+    VectorSpaceClusterer,
+    Dendrogram,
+    euclidean_distance,
+    cosine_distance,
+)
+from nltk.cluster.kmeans import KMeansClusterer
+from nltk.cluster.gaac import GAAClusterer
+from nltk.cluster.em import EMClusterer