sklearn.utils.resample()

sklearn.utils.resample(*arrays, **options) [source] Resample arrays or sparse matrices in a consistent way The default strategy implements one step of the bootstrapping procedure. Parameters: *arrays : sequence of indexable data-structures Indexable data-structures can be arrays, lists, dataframes or scipy sparse matrices with consistent first dimension. replace : boolean, True by default Implements resampling with replacement. If False, this will implement (sliced) random permutations.

Pixel importances with a parallel forest of trees

This example shows the use of forests of trees to evaluate the importance of the pixels in an image classification task (faces). The hotter the pixel, the more important. The code below also illustrates how the construction and the computation of the predictions can be parallelized within multiple jobs. Out: Fitting ExtraTreesClassifier on faces data with 1 cores... done in 3.341s print(__doc__) from time import time import matplotlib.pyplot as plt from sklearn.datasets import fetch_

sklearn.preprocessing.add_dummy_feature()

sklearn.preprocessing.add_dummy_feature(X, value=1.0) [source] Augment dataset with an additional dummy feature. This is useful for fitting an intercept term with implementations which cannot otherwise fit it directly. Parameters: X : {array-like, sparse matrix}, shape [n_samples, n_features] Data. value : float Value to use for the dummy feature. Returns: X : {array, sparse matrix}, shape [n_samples, n_features + 1] Same data with dummy feature added as first column. Examples >

Partial Dependence Plots

Partial dependence plots show the dependence between the target function [2] and a set of ?target? features, marginalizing over the values of all other features (the complement features). Due to the limits of human perception the size of the target feature set must be small (usually, one or two) thus the target features are usually chosen among the most important features (see feature_importances_). This example shows how to obtain partial dependence plots from a GradientBoostingRegressor trai

model_selection.GroupShuffleSplit()

class sklearn.model_selection.GroupShuffleSplit(n_splits=5, test_size=0.2, train_size=None, random_state=None) [source] Shuffle-Group(s)-Out cross-validation iterator Provides randomized train/test indices to split data according to a third-party provided group. This group information can be used to encode arbitrary domain specific stratifications of the samples as integers. For instance the groups could be the year of collection of the samples and thus allow for cross-validation against ti

sklearn.datasets.make_sparse_uncorrelated()

sklearn.datasets.make_sparse_uncorrelated(n_samples=100, n_features=10, random_state=None) [source] Generate a random regression problem with sparse uncorrelated design This dataset is described in Celeux et al [1]. as: X ~ N(0, 1) y(X) = X[:, 0] + 2 * X[:, 1] - 2 * X[:, 2] - 1.5 * X[:, 3] Only the first 4 features are informative. The remaining features are useless. Read more in the User Guide. Parameters: n_samples : int, optional (default=100) The number of samples. n_features : int,

sklearn.metrics.auc()

sklearn.metrics.auc(x, y, reorder=False) [source] Compute Area Under the Curve (AUC) using the trapezoidal rule This is a general function, given points on a curve. For computing the area under the ROC-curve, see roc_auc_score. Parameters: x : array, shape = [n] x coordinates. y : array, shape = [n] y coordinates. reorder : boolean, optional (default=False) If True, assume that the curve is ascending in the case of ties, as for an ROC curve. If the curve is non-ascending, the result w

Topic extraction with Non-negative Matrix Factorization and Latent Dirichlet Allocation

This is an example of applying Non-negative Matrix Factorization and Latent Dirichlet Allocation on a corpus of documents and extract additive models of the topic structure of the corpus. The output is a list of topics, each represented as a list of terms (weights are not shown). The default parameters (n_samples / n_features / n_topics) should make the example runnable in a couple of tens of seconds. You can try to increase the dimensions of the problem, but be aware that the time complexity

preprocessing.RobustScaler()

class sklearn.preprocessing.RobustScaler(with_centering=True, with_scaling=True, quantile_range=(25.0, 75.0), copy=True) [source] Scale features using statistics that are robust to outliers. This Scaler removes the median and scales the data according to the quantile range (defaults to IQR: Interquartile Range). The IQR is the range between the 1st quartile (25th quantile) and the 3rd quartile (75th quantile). Centering and scaling happen independently on each feature (or each sample, depen

Online learning of a dictionary of parts of faces

This example uses a large dataset of faces to learn a set of 20 x 20 images patches that constitute faces. From the programming standpoint, it is interesting because it shows how to use the online API of the scikit-learn to process a very large dataset by chunks. The way we proceed is that we load an image at a time and extract randomly 50 patches from this image. Once we have accumulated 500 of these patches (using 10 images), we run the partial_fit method of the online KMeans object, MiniBat