Receiver Operating Characteristic with cross validation

Example of Receiver Operating Characteristic (ROC) metric to evaluate classifier output quality using cross-validation. ROC curves typically feature true positive rate on the Y axis, and false positive rate on the X axis. This means that the top left corner of the plot is the ?ideal? point - a false positive rate of zero, and a true positive rate of one. This is not very realistic, but it does mean that a larger area under the curve (AUC) is usually better. The ?steepness? of ROC curves is als

Explicit feature map approximation for RBF kernels

An example illustrating the approximation of the feature map of an RBF kernel. It shows how to use RBFSampler and Nystroem to approximate the feature map of an RBF kernel for classification with an SVM on the digits dataset. Results using a linear SVM in the original space, a linear SVM using the approximate mappings and using a kernelized SVM are compared. Timings and accuracy for varying amounts of Monte Carlo samplings (in the case of RBFSampler, which uses random Fourier features) and diff

sklearn.metrics.accuracy_score()

sklearn.metrics.accuracy_score(y_true, y_pred, normalize=True, sample_weight=None) [source] Accuracy classification score. In multilabel classification, this function computes subset accuracy: the set of labels predicted for a sample must exactly match the corresponding set of labels in y_true. Read more in the User Guide. Parameters: y_true : 1d array-like, or label indicator array / sparse matrix Ground truth (correct) labels. y_pred : 1d array-like, or label indicator array / sparse m

sklearn.datasets.make_spd_matrix()

sklearn.datasets.make_spd_matrix(n_dim, random_state=None) [source] Generate a random symmetric, positive-definite matrix. Read more in the User Guide. Parameters: n_dim : int The matrix dimension. random_state : int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.ran

cross_validation.LabelShuffleSplit()

Warning DEPRECATED class sklearn.cross_validation.LabelShuffleSplit(labels, n_iter=5, test_size=0.2, train_size=None, random_state=None) [source] Shuffle-Labels-Out cross-validation iterator Deprecated since version 0.18: This module will be removed in 0.20. Use sklearn.model_selection.GroupShuffleSplit instead. Provides randomized train/test indices to split data according to a third-party provided label. This label information can be used to encode arbitrary domain specific stratifica

Out-of-core classification of text documents

This is an example showing how scikit-learn can be used for classification using an out-of-core approach: learning from data that doesn?t fit into main memory. We make use of an online classifier, i.e., one that supports the partial_fit method, that will be fed with batches of examples. To guarantee that the features space remains the same over time we leverage a HashingVectorizer that will project each example into the same feature space. This is especially useful in the case of text classifi

manifold.Isomap()

class sklearn.manifold.Isomap(n_neighbors=5, n_components=2, eigen_solver='auto', tol=0, max_iter=None, path_method='auto', neighbors_algorithm='auto', n_jobs=1) [source] Isomap Embedding Non-linear dimensionality reduction through Isometric Mapping Read more in the User Guide. Parameters: n_neighbors : integer number of neighbors to consider for each point. n_components : integer number of coordinates for the manifold eigen_solver : [?auto?|?arpack?|?dense?] ?auto? : Attempt to choos

1.6. Nearest Neighbors

sklearn.neighbors provides functionality for unsupervised and supervised neighbors-based learning methods. Unsupervised nearest neighbors is the foundation of many other learning methods, notably manifold learning and spectral clustering. Supervised neighbors-based learning comes in two flavors: classification for data with discrete labels, and regression for data with continuous labels. The principle behind nearest neighbor methods is to find a predefined number of training samples closest in

cross_validation.ShuffleSplit()

Warning DEPRECATED class sklearn.cross_validation.ShuffleSplit(n, n_iter=10, test_size=0.1, train_size=None, random_state=None) [source] Random permutation cross-validation iterator. Deprecated since version 0.18: This module will be removed in 0.20. Use sklearn.model_selection.ShuffleSplit instead. Yields indices to split data into training and test sets. Note: contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this

Recursive feature elimination with cross-validation

A recursive feature elimination example with automatic tuning of the number of features selected with cross-validation. Out: Optimal number of features : 3 print(__doc__) import matplotlib.pyplot as plt from sklearn.svm import SVC from sklearn.model_selection import StratifiedKFold from sklearn.feature_selection import RFECV from sklearn.datasets import make_classification # Build a classification task using 3 informative features X, y = make_classification(n_samples=1000, n_features=2