Clustering text documents using k-means

This is an example showing how the scikit-learn can be used to cluster documents by topics using a bag-of-words approach. This example uses a scipy.sparse matrix to store the features instead of standard numpy arrays. Two feature extraction methods can be used in this example: TfidfVectorizer uses a in-memory vocabulary (a python dict) to map the most frequent words to features indices and hence compute a word occurrence frequency (sparse) matrix. The word frequencies are then reweighted usin

neighbors.NearestNeighbors()

class sklearn.neighbors.NearestNeighbors(n_neighbors=5, radius=1.0, algorithm='auto', leaf_size=30, metric='minkowski', p=2, metric_params=None, n_jobs=1, **kwargs) [source] Unsupervised learner for implementing neighbor searches. Read more in the User Guide. Parameters: n_neighbors : int, optional (default = 5) Number of neighbors to use by default for k_neighbors queries. radius : float, optional (default = 1.0) Range of parameter space to use by default for radius_neighbors queries.

sklearn.feature_extraction.image.extract_patches_2d()

sklearn.feature_extraction.image.extract_patches_2d(image, patch_size, max_patches=None, random_state=None) [source] Reshape a 2D image into a collection of patches The resulting patches are allocated in a dedicated array. Read more in the User Guide. Parameters: image : array, shape = (image_height, image_width) or (image_height, image_width, n_channels) The original image data. For color images, the last dimension specifies the channel: a RGB image would have n_channels=3. patch_size :

kernel_approximation.SkewedChi2Sampler()

class sklearn.kernel_approximation.SkewedChi2Sampler(skewedness=1.0, n_components=100, random_state=None) [source] Approximates feature map of the ?skewed chi-squared? kernel by Monte Carlo approximation of its Fourier transform. Read more in the User Guide. Parameters: skewedness : float ?skewedness? parameter of the kernel. Needs to be cross-validated. n_components : int number of Monte Carlo samples per original feature. Equals the dimensionality of the computed feature space. rando

svm.OneClassSVM()

class sklearn.svm.OneClassSVM(kernel='rbf', degree=3, gamma='auto', coef0=0.0, tol=0.001, nu=0.5, shrinking=True, cache_size=200, verbose=False, max_iter=-1, random_state=None) [source] Unsupervised Outlier Detection. Estimate the support of a high-dimensional distribution. The implementation is based on libsvm. Read more in the User Guide. Parameters: kernel : string, optional (default=?rbf?) Specifies the kernel type to be used in the algorithm. It must be one of ?linear?, ?poly?, ?rbf?

Outlier detection with several methods.

When the amount of contamination is known, this example illustrates three different ways of performing Novelty and Outlier Detection: based on a robust estimator of covariance, which is assuming that the data are Gaussian distributed and performs better than the One-Class SVM in that case. using the One-Class SVM and its ability to capture the shape of the data set, hence performing better when the data is strongly non-Gaussian, i.e. with two well-separated clusters; using the Isolation Forest

sklearn.metrics.confusion_matrix()

sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None) [source] Compute confusion matrix to evaluate the accuracy of a classification By definition a confusion matrix is such that is equal to the number of observations known to be in group but predicted to be in group . Thus in binary classification, the count of true negatives is , false negatives is , true positives is and false positives is . Read more in the User Guide. Parameters: y_true : array, shape =

Compare BIRCH and MiniBatchKMeans

This example compares the timing of Birch (with and without the global clustering step) and MiniBatchKMeans on a synthetic dataset having 100,000 samples and 2 features generated using make_blobs. If n_clusters is set to None, the data is reduced from 100,000 samples to a set of 158 clusters. This can be viewed as a preprocessing step before the final (global) clustering step that further reduces these 158 clusters to 100 clusters. Out: Birch without global clustering as the final step t

Gaussian process classification on iris dataset

This example illustrates the predicted probability of GPC for an isotropic and anisotropic RBF kernel on a two-dimensional version for the iris-dataset. The anisotropic RBF kernel obtains slightly higher log-marginal-likelihood by assigning different length-scales to the two feature dimensions. print(__doc__) import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.gaussian_process import GaussianProcessClassifier from sklearn.gaussian_process.kernels i

cluster.SpectralClustering()

class sklearn.cluster.SpectralClustering(n_clusters=8, eigen_solver=None, random_state=None, n_init=10, gamma=1.0, affinity='rbf', n_neighbors=10, eigen_tol=0.0, assign_labels='kmeans', degree=3, coef0=1, kernel_params=None, n_jobs=1) [source] Apply clustering to a projection to the normalized laplacian. In practice Spectral Clustering is very useful when the structure of the individual clusters is highly non-convex or more generally when a measure of the center and spread of the cluster is