Pipelining

The PCA does an unsupervised dimensionality reduction, while the logistic regression does the prediction. We use a GridSearchCV to set the dimensionality of the PCA print(__doc__) # Code source: Ga Varoquaux # Modified for documentation by Jaques Grobler # License: BSD 3 clause import numpy as np import matplotlib.pyplot as plt from sklearn import linear_model, decomposition, datasets from sklearn.pipeline import Pipeline from sklearn.model_selection import GridSearchCV logistic = linear

preprocessing.LabelBinarizer()

class sklearn.preprocessing.LabelBinarizer(neg_label=0, pos_label=1, sparse_output=False) [source] Binarize labels in a one-vs-all fashion Several regression and binary classification algorithms are available in the scikit. A simple way to extend these algorithms to the multi-class classification case is to use the so-called one-vs-all scheme. At learning time, this simply consists in learning one regressor or binary classifier per class. In doing so, one needs to convert multi-class labels

Plot classification probability

Plot the classification probability for different classifiers. We use a 3 class dataset, and we classify it with a Support Vector classifier, L1 and L2 penalized logistic regression with either a One-Vs-Rest or multinomial setting, and Gaussian process classification. The logistic regression is not a multiclass classifier out of the box. As a result it can identify only the first class. Out: classif_rate for GPC : 82.666667 classif_rate for L2 logistic (OvR) : 76.666667 classif_rate for

Comparison of F-test and mutual information

This example illustrates the differences between univariate F-test statistics and mutual information. We consider 3 features x_1, x_2, x_3 distributed uniformly over [0, 1], the target depends on them as follows: y = x_1 + sin(6 * pi * x_2) + 0.1 * N(0, 1), that is the third features is completely irrelevant. The code below plots the dependency of y against individual x_i and normalized values of univariate F-tests statistics and mutual information. As F-test captures only linear dependency, i

Agglomerative clustering with and without structure

This example shows the effect of imposing a connectivity graph to capture local structure in the data. The graph is simply the graph of 20 nearest neighbors. Two consequences of imposing a connectivity can be seen. First clustering with a connectivity matrix is much faster. Second, when using a connectivity matrix, average and complete linkage are unstable and tend to create a few clusters that grow very quickly. Indeed, average and complete linkage fight this percolation behavior by consideri

linear_model.HuberRegressor()

class sklearn.linear_model.HuberRegressor(epsilon=1.35, max_iter=100, alpha=0.0001, warm_start=False, fit_intercept=True, tol=1e-05) [source] Linear regression model that is robust to outliers. The Huber Regressor optimizes the squared loss for the samples where |(y - X'w) / sigma| < epsilon and the absolute loss for the samples where |(y - X'w) / sigma| > epsilon, where w and sigma are parameters to be optimized. The parameter sigma makes sure that if y is scaled up or down by a cert

Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation

This is an example of applying Non-negative Matrix Factorization and Latent Dirichlet Allocation on a corpus of documents and extract additive models of the topic structure of the corpus. The output is a list of topics, each represented as a list of terms (weights are not shown). The default parameters (n_samples / n_features / n_topics) should make the example runnable in a couple of tens of seconds. You can try to increase the dimensions of the problem, but be aware that the time complexity

preprocessing.RobustScaler()

class sklearn.preprocessing.RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True) [source] Scale features using statistics that are robust to outliers. This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile). Centering and scaling happen independently on each feature (or each sample, depen

sklearn.datasets.make_sparse_uncorrelated()

sklearn.datasets.make_sparse_uncorrelated(n_samples=100, n_features=10, random_state=None) [source] Generate a random regression problem with sparse uncorrelated design This dataset is described in Celeux et al [1]. as: X ~ N(0, 1) y(X) = X[:, 0] + 2 * X[:, 1] - 2 * X[:, 2] - 1.5 * X[:, 3] Only the first 4 features are informative. The remaining features are useless. Read more in the User Guide. Parameters: n_samples : int, optional (default=100) The number of samples. n_features : int,

Plot the decision surface of a decision tree on the iris dataset

Plot the decision surface of a decision tree trained on pairs of features of the iris dataset. See decision tree for more information on the estimator. For each pair of iris features, the decision tree learns decision boundaries made of combinations of simple thresholding rules inferred from the training samples. print(__doc__) import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import load_iris from sklearn.tree import DecisionTreeClassifier # Parameters n_classes