Train error vs Test error

Illustration of how the performance of an estimator on unseen data (test data) is not the same as the performance on training data. As the regularization increases the performance on train decreases while the performance on test is optimal within a range of values of the regularization parameter. The example with an Elastic-Net regression model and the performance is measured using the explained variance a.k.a. R^2. print(__doc__) # Author: Alexandre Gramfort <alexandre.gramfort@inria.fr&g

sklearn.metrics.classification_report()

sklearn.metrics.classification_report(y_true, y_pred, labels=None, target_names=None, sample_weight=None, digits=2) [source] Build a text report showing the main classification metrics Read more in the User Guide. Parameters: y_true : 1d array-like, or label indicator array / sparse matrix Ground truth (correct) target values. y_pred : 1d array-like, or label indicator array / sparse matrix Estimated targets as returned by a classifier. labels : array, shape = [n_labels] Optional list

feature_extraction.text.TfidfVectorizer()

class sklearn.feature_extraction.text.TfidfVectorizer(input=u'content', encoding=u'utf-8', decode_error=u'strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer=u'word', stop_words=None, token_pattern=u'(?u)\b\w\w+\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=, norm=u'l2', use_idf=True, smooth_idf=True, sublinear_tf=False) [source] Convert a collection of raw documents to a matrix of TF-IDF features.

gaussian_process.kernels.PairwiseKernel()

class sklearn.gaussian_process.kernels.PairwiseKernel(gamma=1.0, gamma_bounds=(1e-05, 100000.0), metric='linear', pairwise_kernels_kwargs=None) [source] Wrapper for kernels in sklearn.metrics.pairwise. A thin wrapper around the functionality of the kernels in sklearn.metrics.pairwise. Note: Evaluation of eval_gradient is not analytic but numeric and all kernels support only isotropic distances. The parameter gamma is considered to be a hyperparameter and may be optimized. The other kernel p

Manifold learning on handwritten digits

An illustration of various embeddings on the digits dataset. The RandomTreesEmbedding, from the sklearn.ensemble module, is not technically a manifold embedding method, as it learn a high-dimensional representation on which we apply a dimensionality reduction method. However, it is often useful to cast a dataset into a representation in which the classes are linearly-separable. t-SNE will be initialized with the embedding that is generated by PCA in this example, which is not the default setti

decomposition.LatentDirichletAllocation()

class sklearn.decomposition.LatentDirichletAllocation(n_topics=10, doc_topic_prior=None, topic_word_prior=None, learning_method=None, learning_decay=0.7, learning_offset=10.0, max_iter=10, batch_size=128, evaluate_every=-1, total_samples=1000000.0, perp_tol=0.1, mean_change_tol=0.001, max_doc_update_iter=100, n_jobs=1, verbose=0, random_state=None) [source] Latent Dirichlet Allocation with online variational Bayes algorithm New in version 0.17. Read more in the User Guide. Parameters: n_

2.3. Clustering

Clustering of unlabeled data can be performed with the module sklearn.cluster. Each clustering algorithm comes in two variants: a class, that implements the fit method to learn the clusters on train data, and a function, that, given train data, returns an array of integer labels corresponding to the different clusters. For the class, the labels over the training data can be found in the labels_ attribute. Input data One important thing to note is that the algorithms implemented in this module

sklearn.metrics.average_precision_score()

sklearn.metrics.average_precision_score(y_true, y_score, average='macro', sample_weight=None) [source] Compute average precision (AP) from prediction scores This score corresponds to the area under the precision-recall curve. Note: this implementation is restricted to the binary classification task or multilabel classification task. Read more in the User Guide. Parameters: y_true : array, shape = [n_samples] or [n_samples, n_classes] True binary labels in binary label indicators. y_score

Plot multi-class SGD on the iris dataset

Plot decision surface of multi-class SGD on iris dataset. The hyperplanes corresponding to the three one-versus-all (OVA) classifiers are represented by the dashed lines. print(__doc__) import numpy as np import matplotlib.pyplot as plt from sklearn import datasets from sklearn.linear_model import SGDClassifier # import some data to play with iris = datasets.load_iris() X = iris.data[:, :2] # we only take the first two features. We could # avoid this ugly slicing b

gaussian_process.GaussianProcessClassifier()

class sklearn.gaussian_process.GaussianProcessClassifier(kernel=None, optimizer='fmin_l_bfgs_b', n_restarts_optimizer=0, max_iter_predict=100, warm_start=False, copy_X_train=True, random_state=None, multi_class='one_vs_rest', n_jobs=1) [source] Gaussian process classification (GPC) based on Laplace approximation. The implementation is based on Algorithm 3.1, 3.2, and 5.1 of Gaussian Processes for Machine Learning (GPML) by Rasmussen and Williams. Internally, the Laplace approximation is use