Working With Text Data

The goal of this guide is to explore some of the main scikit-learn tools on a single practical task: analysing a collection of text documents (newsgroups posts) on twenty different topics. In this section we will see how to: load the file contents and the categories extract feature vectors suitable for machine learning train a linear model to perform categorization use a grid search strategy to find a good configuration of both the feature extraction components and the classifier Tutorial s

Visualization of MLP weights on MNIST

Sometimes looking at the learned coefficients of a neural network can provide insight into the learning behavior. For example if weights look unstructured, maybe some were not used at all, or if very large coefficients exist, maybe regularization was too low or the learning rate too high. This example shows how to plot some of the first layer weights in a MLPClassifier trained on the MNIST dataset. The input data consists of 28x28 pixel handwritten digits, leading to 784 features in the datase

Visualizing the stock market structure

This example employs several unsupervised learning techniques to extract the stock market structure from variations in historical quotes. The quantity that we use is the daily variation in quote price: quotes that are linked tend to cofluctuate during a day. Learning a graph structure We use sparse inverse covariance estimation to find which quotes are correlated conditionally on the others. Specifically, sparse inverse covariance gives us a graph, that is a list of connection. For each symbo

Various Agglomerative Clustering on a 2D embedding of digits

An illustration of various linkage option for agglomerative clustering on a 2D embedding of the digits dataset. The goal of this example is to show intuitively how the metrics behave, and not to find good clusters for the digits. This is why the example works on a 2D embedding. What this example shows us is the behavior ?rich getting richer? of agglomerative clustering that tends to create uneven cluster sizes. This behavior is especially pronounced for the average linkage strategy, that ends

Vector Quantization Example

Face, a 1024 x 768 size image of a raccoon face, is used here to illustrate how k-means is used for vector quantization. print(__doc__) # Code source: Ga Varoquaux # Modified for documentation by Jaques Grobler # License: BSD 3 clause import numpy as np import scipy as sp import matplotlib.pyplot as plt from sklearn import cluster from sklearn.utils.testing import SkipTest from sklearn.utils.fixes import sp_version if sp_version < (0, 12): raise SkipTest("Skipping because

Varying regularization in Multi-layer Perceptron

A comparison of different values for regularization parameter ?alpha? on synthetic datasets. The plot shows that different alphas yield different decision functions. Alpha is a parameter for regularization term, aka penalty term, that combats overfitting by constraining the size of the weights. Increasing alpha may fix high variance (a sign of overfitting) by encouraging smaller weights, resulting in a decision boundary plot that appears with lesser curvatures. Similarly, decreasing alpha may

Unsupervised learning

Clustering: grouping observations together The problem solved in clustering Given the iris dataset, if we knew that there were 3 types of iris, but did not have access to a taxonomist to label them: we could try a clustering task: split the observations into well-separated group called clusters. K-means clustering Note that there exist a lot of different clustering criteria and associated algorithms. The simplest clustering algorithm is K-means. >>> from sklearn import cluste

Using FunctionTransformer to select columns

Shows how to use a function transformer in a pipeline. If you know your dataset?s first principle component is irrelevant for a classification task, you can use the FunctionTransformer to select all but the first column of the PCA transformed data. import matplotlib.pyplot as plt import numpy as np from sklearn.model_selection import train_test_split from sklearn.decomposition import PCA from sklearn.pipeline import make_pipeline from sklearn.preprocessing import FunctionTransformer d

Underfitting vs. Overfitting

This example demonstrates the problems of underfitting and overfitting and how we can use linear regression with polynomial features to approximate nonlinear functions. The plot shows the function that we want to approximate, which is a part of the cosine function. In addition, the samples from the real function and the approximations of different models are displayed. The models have polynomial features of different degrees. We can see that a linear function (polynomial with degree 1) is not

Univariate Feature Selection

An example showing univariate feature selection. Noisy (non informative) features are added to the iris data and univariate feature selection is applied. For each feature, we plot the p-values for the univariate feature selection and the corresponding weights of an SVM. We can see that univariate feature selection selects the informative features and that these have larger SVM weights. In the total set of features, only the 4 first ones are significant. We can see that they have the highest sc