Scaling the regularization parameter for SVCs

The following example illustrates the effect of scaling the regularization parameter when using Support Vector Machines for classification. For SVC classification, we are interested in a risk minimization for the equation: where is used to set the amount of regularization is a loss function of our samples and our model parameters. is a penalty function of our model parameters If we consider the loss function to be the individual error per sample, then the data-fit term, or the sum

Scalability of Approximate Nearest Neighbors

This example studies the scalability profile of approximate 10-neighbors queries using the LSHForest with n_estimators=20 and n_candidates=200 when varying the number of samples in the dataset. The first plot demonstrates the relationship between query time and index size of LSHForest. Query time is compared with the brute force method in exact nearest neighbor search for the same index sizes. The brute force queries have a very predictable linear scalability with the index (full scan). LSHFor

Sample pipeline for text feature extraction and evaluation

The dataset used in this example is the 20 newsgroups dataset which will be automatically downloaded and then cached and reused for the document classification example. You can adjust the number of categories by giving their names to the dataset loader or setting them to None to get the 20 of them. Here is a sample output of a run on a quad-core machine: Loading 20 newsgroups dataset for categories: ['alt.atheism', 'talk.religion.misc'] 1427 documents 2 categories Performing grid search... pi

Robust vs Empirical covariance estimate

The usual covariance maximum likelihood estimate is very sensitive to the presence of outliers in the data set. In such a case, it would be better to use a robust estimator of covariance to guarantee that the estimation is resistant to ?erroneous? observations in the data set. Minimum Covariance Determinant Estimator The Minimum Covariance Determinant estimator is a robust, high-breakdown point (i.e. it can be used to estimate the covariance matrix of highly contaminated datasets, up to outl

Robust linear model estimation using RANSAC

In this example we see how to robustly fit a linear model to faulty data using the RANSAC algorithm. Out: Estimated coefficients (true, normal, RANSAC): 82.1903908408 [ 54.17236387] [ 82.08533159] import numpy as np from matplotlib import pyplot as plt from sklearn import linear_model, datasets n_samples = 1000 n_outliers = 50 X, y, coef = datasets.make_regression(n_samples=n_samples, n_features=1, n_informative=1, noise=10,

Robust linear estimator fitting

Here a sine function is fit with a polynomial of order 3, for values close to zero. Robust fitting is demoed in different situations: No measurement errors, only modelling errors (fitting a sine with a polynomial) Measurement errors in X Measurement errors in y The median absolute deviation to non corrupt new data is used to judge the quality of the prediction. What we can see that: RANSAC is good for strong outliers in the y direction TheilSen is good for small outliers, both in direction X

Robust Scaling on Toy Data

Making sure that each Feature has approximately the same scale can be a crucial preprocessing step. However, when data contains outliers, StandardScaler can often be mislead. In such cases, it is better to use a scaler that is robust against outliers. Here, we demonstrate this on a toy dataset, where one single datapoint is a large outlier. Out: Testset accuracy using standard scaler: 0.545 Testset accuracy using robust scaler: 0.705 from __future__ import print_function print(__doc_

Recursive feature elimination with cross-validation

A recursive feature elimination example with automatic tuning of the number of features selected with cross-validation. Out: Optimal number of features : 3 print(__doc__) import matplotlib.pyplot as plt from sklearn.svm import SVC from sklearn.model_selection import StratifiedKFold from sklearn.feature_selection import RFECV from sklearn.datasets import make_classification # Build a classification task using 3 informative features X, y = make_classification(n_samples=1000, n_features=2

Robust covariance estimation and Mahalanobis distances relevance

An example to show covariance estimation with the Mahalanobis distances on Gaussian distributed data. For Gaussian distributed data, the distance of an observation to the mode of the distribution can be computed using its Mahalanobis distance: where and are the location and the covariance of the underlying Gaussian distribution. In practice, and are replaced by some estimates. The usual covariance maximum likelihood estimate is very sensitive to the presence of outliers in the data set a

Restricted Boltzmann Machine features for digit classification

For greyscale image data where pixel values can be interpreted as degrees of blackness on a white background, like handwritten digit recognition, the Bernoulli Restricted Boltzmann machine model (BernoulliRBM) can perform effective non-linear feature extraction. In order to learn good latent representations from a small dataset, we artificially generate more labeled data by perturbing the training data with linear shifts of 1 pixel in each direction. This example shows how to build a classific