Given a small number of observations, we want to recover which features of X are relevant to explain y. For this sparse linear models can outperform standard statistical tests if the true model is sparse, i.e. if a small fraction of the features are relevant. As detailed in the compressive sensing notes, the ability of L1-based approach to identify the relevant variables depends on the sparsity of the ground truth, the number of samples, the number of features, the conditioning of the design m

sklearn.datasets.make_checkerboard(shape, n_clusters, noise=0.0, minval=10, maxval=100, shuffle=True, random_state=None) [source] Generate an array with block checkerboard structure for biclustering. Read more in the User Guide. Parameters: shape : iterable (n_rows, n_cols) The shape of the result. n_clusters : integer or iterable (n_row_clusters, n_column_clusters) The number of row and column clusters. noise : float, optional (default=0.0) The standard deviation of the gaussian nois

A demo of K-Means clustering on the handwritten digits data

In this example we compare the various initialization strategies for K-means in terms of runtime and quality of the results. As the ground truth is known here, we also apply different cluster quality metrics to judge the goodness of fit of the cluster labels to the ground truth. Cluster quality metrics evaluated (see Clustering performance evaluation for definitions and discussions of the metrics): Shorthand full name homo homogeneity score compl completeness score v-meas V measure ARI adjuste

Linear and Quadratic Discriminant Analysis with confidence ellipsoid

Plot the confidence ellipsoids of each class and decision boundary print(__doc__) from scipy import linalg import numpy as np import matplotlib.pyplot as plt import matplotlib as mpl from matplotlib import colors from sklearn.discriminant_analysis import LinearDiscriminantAnalysis from sklearn.discriminant_analysis import QuadraticDiscriminantAnalysis colormap cmap = colors.LinearSegmentedColormap( 'red_blue_classes', {'red': [(0, 1, 1), (1, 0.7, 0.7)], 'green': [(0, 0.7, 0.7),

cross_validation.StratifiedShuffleSplit()

Warning DEPRECATED class sklearn.cross_validation.StratifiedShuffleSplit(y, n_iter=10, test_size=0.1, train_size=None, random_state=None) [source] Stratified ShuffleSplit cross validation iterator Deprecated since version 0.18: This module will be removed in 0.20. Use sklearn.model_selection.StratifiedShuffleSplit instead. Provides train/test indices to split data in train test sets. This cross-validation object is a merge of StratifiedKFold and ShuffleSplit, which returns stratified ra

semi_supervised.LabelSpreading()

class sklearn.semi_supervised.LabelSpreading(kernel='rbf', gamma=20, n_neighbors=7, alpha=0.2, max_iter=30, tol=0.001, n_jobs=1) [source] LabelSpreading model for semi-supervised learning This model is similar to the basic Label Propgation algorithm, but uses affinity matrix based on the normalized graph Laplacian and soft clamping across the labels. Read more in the User Guide. Parameters: kernel : {?knn?, ?rbf?} String identifier for kernel function to use. Only ?rbf? and ?knn? kernels

sklearn.datasets.mldata_filename()

sklearn.datasets.mldata_filename(dataname) [source] Convert a raw name for a data set in a mldata.org filename.

feature_extraction.text.TfidfVectorizer()

class sklearn.feature_extraction.text.TfidfVectorizer(input=u'content', encoding=u'utf-8', decode_error=u'strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer=u'word', stop_words=None, token_pattern=u'(?u)\b\w\w+\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=, norm=u'l2', use_idf=True, smooth_idf=True, sublinear_tf=False) [source] Convert a collection of raw documents to a matrix of TF-IDF features.

gaussian_process.kernels.PairwiseKernel()

class sklearn.gaussian_process.kernels.PairwiseKernel(gamma=1.0, gamma_bounds=(1e-05, 100000.0), metric='linear', pairwise_kernels_kwargs=None) [source] Wrapper for kernels in sklearn.metrics.pairwise. A thin wrapper around the functionality of the kernels in sklearn.metrics.pairwise. Note: Evaluation of eval_gradient is not analytic but numeric and all kernels support only isotropic distances. The parameter gamma is considered to be a hyperparameter and may be optimized. The other kernel p

Manifold learning on handwritten digits

An illustration of various embeddings on the digits dataset. The RandomTreesEmbedding, from the sklearn.ensemble module, is not technically a manifold embedding method, as it learn a high-dimensional representation on which we apply a dimensionality reduction method. However, it is often useful to cast a dataset into a representation in which the classes are linearly-separable. t-SNE will be initialized with the embedding that is generated by PCA in this example, which is not the default setti