A demo of the mean-shift clustering algorithm

Reference: Dorin Comaniciu and Peter Meer, ?Mean Shift: A robust approach toward feature space analysis?. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2002. pp. 603-619. print(__doc__) import numpy as np from sklearn.cluster import MeanShift, estimate_bandwidth from sklearn.datasets.samples_generator import make_blobs Generate sample data centers = [[1, 1], [-1, -1], [1, -1]] X, _ = make_blobs(n_samples=10000, centers=centers, cluster_std=0.6) Compute clustering with Mean

A demo of structured Ward hierarchical clustering on a raccoon face image

Compute the segmentation of a 2D image with Ward hierarchical clustering. The clustering is spatially constrained in order for each segmented region to be in one piece. # Author : Vincent Michel, 2010 # Alexandre Gramfort, 2011 # License: BSD 3 clause print(__doc__) import time as time import numpy as np import scipy as sp import matplotlib.pyplot as plt from sklearn.feature_extraction.image import grid_to_graph from sklearn.cluster import AgglomerativeClustering from sklearn.uti

A demo of K-Means clustering on the handwritten digits data

In this example we compare the various initialization strategies for K-means in terms of runtime and quality of the results. As the ground truth is known here, we also apply different cluster quality metrics to judge the goodness of fit of the cluster labels to the ground truth. Cluster quality metrics evaluated (see Clustering performance evaluation for definitions and discussions of the metrics): Shorthand full name homo homogeneity score compl completeness score v-meas V measure ARI adjuste

6. Strategies to scale computationally

For some applications the amount of examples, features (or both) and/or the speed at which they need to be processed are challenging for traditional approaches. In these cases scikit-learn has a number of options you can consider to make your system scale. 6.1. Scaling with instances using out-of-core learning Out-of-core (or ?external memory?) learning is a technique used to learn from data that cannot fit in a computer?s main memory (RAM). Here is sketch of a system designed to achieve this

5. Dataset loading utilities

The sklearn.datasets package embeds some small toy datasets as introduced in the Getting Started section. To evaluate the impact of the scale of the dataset (n_samples and n_features) while controlling the statistical properties of the data (typically the correlation and informativeness of the features), it is also possible to generate synthetic data. This package also features helpers to fetch larger datasets commonly used by the machine learning community to benchmark algorithm on data that

4.8. Transforming the prediction target

4.8.1. Label binarization LabelBinarizer is a utility class to help create a label indicator matrix from a list of multi-class labels: >>> from sklearn import preprocessing >>> lb = preprocessing.LabelBinarizer() >>> lb.fit([1, 2, 6, 4, 2]) LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False) >>> lb.classes_ array([1, 2, 4, 6]) >>> lb.transform([1, 6]) array([[1, 0, 0, 0], [0, 0, 0, 1]]) For multiple labels per instance, use MultiL

4.7. Pairwise metrics, Affinities and Kernels

The sklearn.metrics.pairwise submodule implements utilities to evaluate pairwise distances or affinity of sets of samples. This module contains both distance metrics and kernels. A brief summary is given on the two here. Distance metrics are functions d(a, b) such that d(a, b) < d(a, c) if objects a and b are considered ?more similar? than objects a and c. Two objects exactly alike would have a distance of zero. One of the most popular examples is Euclidean distance. To be a ?true? metric,

4.6. Kernel Approximation

This submodule contains functions that approximate the feature mappings that correspond to certain kernels, as they are used for example in support vector machines (see Support Vector Machines). The following feature functions perform non-linear transformations of the input, which can serve as a basis for linear classification or other algorithms. The advantage of using approximate explicit feature maps compared to the kernel trick, which makes use of feature maps implicitly, is that explicit

4.5. Random Projection

The sklearn.random_projection module implements a simple and computationally efficient way to reduce the dimensionality of the data by trading a controlled amount of accuracy (as additional variance) for faster processing times and smaller model sizes. This module implements two types of unstructured random matrix: Gaussian random matrix and sparse random matrix. The dimensions and distribution of random projections matrices are controlled so as to preserve the pairwise distances between any t

4.4. Unsupervised dimensionality reduction

If your number of features is high, it may be useful to reduce it with an unsupervised step prior to supervised steps. Many of the Unsupervised learning methods implement a transform method that can be used to reduce the dimensionality. Below we discuss two specific example of this pattern that are heavily used. Pipelining The unsupervised data reduction and the supervised estimator can be chained in one step. See Pipeline: chaining estimators. 4.4.1. PCA: principal component analysis decom