cross_validation.ShuffleSplit()

Warning DEPRECATED class sklearn.cross_validation.ShuffleSplit(n, n_iter=10, test_size=0.1, train_size=None, random_state=None) [source] Random permutation cross-validation iterator. Deprecated since version 0.18: This module will be removed in 0.20. Use sklearn.model_selection.ShuffleSplit instead. Yields indices to split data into training and test sets. Note: contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this

sklearn.datasets.clear_data_home()

sklearn.datasets.clear_data_home(data_home=None) [source] Delete all the content of the data home cache.

Gradient Boosting regularization

Illustration of the effect of different regularization strategies for Gradient Boosting. The example is taken from Hastie et al 2009. The loss function used is binomial deviance. Regularization via shrinkage (learning_rate < 1.0) improves performance considerably. In combination with shrinkage, stochastic gradient boosting (subsample < 1.0) can produce more accurate models by reducing the variance via bagging. Subsampling without shrinkage usually does poorly. Another strategy to reduce

1.7. Gaussian Processes

Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems. The advantages of Gaussian processes are: The prediction interpolates the observations (at least for regular kernels). The prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals and decide based on those if one should refit (online fitting, adaptive fitting) the prediction in some region of interest. Versatile: differen

1.6. Nearest Neighbors

sklearn.neighbors provides functionality for unsupervised and supervised neighbors-based learning methods. Unsupervised nearest neighbors is the foundation of many other learning methods, notably manifold learning and spectral clustering. Supervised neighbors-based learning comes in two flavors: classification for data with discrete labels, and regression for data with continuous labels. The principle behind nearest neighbor methods is to find a predefined number of training samples closest in

Scaling the regularization parameter for SVCs

The following example illustrates the effect of scaling the regularization parameter when using Support Vector Machines for classification. For SVC classification, we are interested in a risk minimization for the equation: where is used to set the amount of regularization is a loss function of our samples and our model parameters. is a penalty function of our model parameters If we consider the loss function to be the individual error per sample, then the data-fit term, or the sum

sklearn.metrics.roc_auc_score()

sklearn.metrics.roc_auc_score(y_true, y_score, average='macro', sample_weight=None) [source] Compute Area Under the Curve (AUC) from prediction scores Note: this implementation is restricted to the binary classification task or multilabel classification task in label indicator format. Read more in the User Guide. Parameters: y_true : array, shape = [n_samples] or [n_samples, n_classes] True binary labels in binary label indicators. y_score : array, shape = [n_samples] or [n_samples, n_cl

1.11. Ensemble methods

The goal of ensemble methods is to combine the predictions of several base estimators built with a given learning algorithm in order to improve generalizability / robustness over a single estimator. Two families of ensemble methods are usually distinguished: In averaging methods, the driving principle is to build several estimators independently and then to average their predictions. On average, the combined estimator is usually better than any of the single base estimator because its varianc

manifold.Isomap()

class sklearn.manifold.Isomap(n_neighbors=5, n_components=2, eigen_solver='auto', tol=0, max_iter=None, path_method='auto', neighbors_algorithm='auto', n_jobs=1) [source] Isomap Embedding Non-linear dimensionality reduction through Isometric Mapping Read more in the User Guide. Parameters: n_neighbors : integer number of neighbors to consider for each point. n_components : integer number of coordinates for the manifold eigen_solver : [?auto?|?arpack?|?dense?] ?auto? : Attempt to choos

sklearn.preprocessing.robust_scale()

sklearn.preprocessing.robust_scale(X, axis=0, with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True) [source] Standardize a dataset along any axis Center to the median and component wise scale according to the interquartile range. Read more in the User Guide. Parameters: X : array-like The data to center and scale. axis : int (0 by default) axis used to compute the medians and IQR along. If 0, independently scale each feature, otherwise (if 1) scale each sample.