Precision-Recall

Example of Precision-Recall metric to evaluate classifier output quality. In information retrieval, precision is a measure of result relevancy, while recall is a measure of how many truly relevant results are returned. A high area under the curve represents both high recall and high precision, where high precision relates to a low false positive rate, and high recall relates to a low false negative rate. High scores for both show that the classifier is returning accurate results (high precisio

Discrete versus Real AdaBoost

This example is based on Figure 10.2 from Hastie et al 2009 [1] and illustrates the difference in performance between the discrete SAMME [2] boosting algorithm and real SAMME.R boosting algorithm. Both algorithms are evaluated on a binary classification task where the target Y is a non-linear function of 10 input features. Discrete SAMME AdaBoost adapts based on errors in predicted class labels whereas real SAMME.R uses the predicted class probabilities. [1] T. Hastie, R. Tibshirani and J. Fri

4.1. Pipeline and FeatureUnion

4.1.1. Pipeline: chaining estimators Pipeline can be used to chain multiple estimators into one. This is useful as there is often a fixed sequence of steps in processing the data, for example feature selection, normalization and classification. Pipeline serves two purposes here: Convenience: You only have to call fit and predict once on your data to fit a whole sequence of estimators. Joint parameter selection: You can grid search over parameters of all estimators in the pipeline at once. A

3.2. Tuning the hyper-parameters of an estimator

Hyper-parameters are parameters that are not directly learnt within estimators. In scikit-learn they are passed as arguments to the constructor of the estimator classes. Typical examples include C, kernel and gamma for Support Vector Classifier, alpha for Lasso, etc. It is possible and recommended to search the hyper-parameter space for the best Cross-validation: evaluating estimator performance score. Any parameter provided when constructing an estimator may be optimized in this manner. Speci

GMM covariances

Demonstration of several covariances types for Gaussian mixture models. See Gaussian mixture models for more information on the estimator. Although GMM are often used for clustering, we can compare the obtained clusters with the actual classes from the dataset. We initialize the means of the Gaussians with the means of the classes from the training set to make this comparison valid. We plot predicted labels on both training and held out test data using a variety of GMM covariance types on the

decomposition.ProjectedGradientNMF()

class sklearn.decomposition.ProjectedGradientNMF(*args, **kwargs) [source] Non-Negative Matrix Factorization (NMF) Find two non-negative matrices (W, H) whose product approximates the non- negative matrix X. This factorization can be used for example for dimensionality reduction, source separation or topic extraction. The objective function is: 0.5 * ||X - WH||_Fro^2 + alpha * l1_ratio * ||vec(W)||_1 + alpha * l1_ratio * ||vec(H)||_1 + 0.5 * alpha * (1 - l1_ratio) * ||W||_Fro^2 + 0.5 * alph

feature_selection.RFECV()

class sklearn.feature_selection.RFECV(estimator, step=1, cv=None, scoring=None, verbose=0, n_jobs=1) [source] Feature ranking with recursive feature elimination and cross-validated selection of the best number of features. Read more in the User Guide. Parameters: estimator : object A supervised learning estimator with a fit method that updates a coef_ attribute that holds the fitted parameters. Important features must correspond to high absolute values in the coef_ array. For instance, th

Scalability of Approximate Nearest Neighbors

This example studies the scalability profile of approximate 10-neighbors queries using the LSHForest with n_estimators=20 and n_candidates=200 when varying the number of samples in the dataset. The first plot demonstrates the relationship between query time and index size of LSHForest. Query time is compared with the brute force method in exact nearest neighbor search for the same index sizes. The brute force queries have a very predictable linear scalability with the index (full scan). LSHFor

tree.DecisionTreeClassifier()

class sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_split=1e-07, class_weight=None, presort=False) [source] A decision tree classifier. Read more in the User Guide. Parameters: criterion : string, optional (default=?gini?) The function to measure the quality of a split. Supported criteria are ?gini? for the

Adjustment for chance in clustering performance evaluation

The following plots demonstrate the impact of the number of clusters and number of samples on various clustering performance evaluation metrics. Non-adjusted measures such as the V-Measure show a dependency between the number of clusters and the number of samples: the mean V-Measure of random labeling increases significantly as the number of clusters is closer to the total number of samples used to compute the measure. Adjusted for chance measure such as ARI display some random variations cent