4.3. Preprocessing data

The sklearn.preprocessing package provides several common utility functions and transformer classes to change raw feature vectors into a representation that is more suitable for the downstream estimators. 4.3.1. Standardization, or mean removal and variance scaling Standardization of datasets is a common requirement for many machine learning estimators implemented in scikit-learn; they might behave badly if the individual features do not more or less look like standard normally distributed da

svm.SVR()

class sklearn.svm.SVR(kernel='rbf', degree=3, gamma='auto', coef0=0.0, tol=0.001, C=1.0, epsilon=0.1, shrinking=True, cache_size=200, verbose=False, max_iter=-1) [source] Epsilon-Support Vector Regression. The free parameters in the model are C and epsilon. The implementation is based on libsvm. Read more in the User Guide. Parameters: C : float, optional (default=1.0) Penalty parameter C of the error term. epsilon : float, optional (default=0.1) Epsilon in the epsilon-SVR model. It spe

Comparing different clustering algorithms on toy datasets

This example aims at showing characteristics of different clustering algorithms on datasets that are ?interesting? but still in 2D. The last dataset is an example of a ?null? situation for clustering: the data is homogeneous, and there is no good clustering. While these examples give some intuition about the algorithms, this intuition might not apply to very high dimensional data. The results could be improved by tweaking the parameters for each clustering strategy, for instance setting the nu

3.4. Model persistence

After training a scikit-learn model, it is desirable to have a way to persist the model for future use without having to retrain. The following section gives you an example of how to persist a model with pickle. We?ll also review a few security and maintainability issues when working with pickle serialization. 3.4.1. Persistence example It is possible to save a model in the scikit by using Python?s built-in persistence model, namely pickle: >>> from sklearn import svm >>> fr

sklearn.utils.shuffle()

sklearn.utils.shuffle(*arrays, **options) [source] Shuffle arrays or sparse matrices in a consistent way This is a convenience alias to resample(*arrays, replace=False) to do random permutations of the collections. Parameters: *arrays : sequence of indexable data-structures Indexable data-structures can be arrays, lists, dataframes or scipy sparse matrices with consistent first dimension. random_state : int or RandomState instance Control the shuffling for reproducible behavior. n_samp

decomposition.KernelPCA()

class sklearn.decomposition.KernelPCA(n_components=None, kernel='linear', gamma=None, degree=3, coef0=1, kernel_params=None, alpha=1.0, fit_inverse_transform=False, eigen_solver='auto', tol=0, max_iter=None, remove_zero_eig=False, random_state=None, copy_X=True, n_jobs=1) [source] Kernel Principal component analysis (KPCA) Non-linear dimensionality reduction through the use of kernels (see Pairwise metrics, Affinities and Kernels). Read more in the User Guide. Parameters: n_components : in

linear_model.PassiveAggressiveRegressor()

class sklearn.linear_model.PassiveAggressiveRegressor(C=1.0, fit_intercept=True, n_iter=5, shuffle=True, verbose=0, loss='epsilon_insensitive', epsilon=0.1, random_state=None, warm_start=False) [source] Passive Aggressive Regressor Read more in the User Guide. Parameters: C : float Maximum step size (regularization). Defaults to 1.0. epsilon : float If the difference between the current prediction and the correct label is below this threshold, the model is not updated. fit_intercept :

Model Complexity Influence

Demonstrate how model complexity influences both prediction accuracy and computational performance. The dataset is the Boston Housing dataset (resp. 20 Newsgroups) for regression (resp. classification). For each class of models we make the model complexity vary through the choice of relevant model parameters and measure the influence on both computational performance (latency) and predictive power (MSE or Hamming Loss). print(__doc__) # Author: Eustache Diemert <eustache@diemert.fr> # L

Visualization of MLP weights on MNIST

Sometimes looking at the learned coefficients of a neural network can provide insight into the learning behavior. For example if weights look unstructured, maybe some were not used at all, or if very large coefficients exist, maybe regularization was too low or the learning rate too high. This example shows how to plot some of the first layer weights in a MLPClassifier trained on the MNIST dataset. The input data consists of 28x28 pixel handwritten digits, leading to 784 features in the datase

Segmenting the picture of a raccoon face in regions

This example uses Spectral clustering on a graph created from voxel-to-voxel difference on an image to break this image into multiple partly-homogeneous regions. This procedure (spectral clustering on an image) is an efficient approximate solution for finding normalized graph cuts. There are two options to assign labels: with ?kmeans? spectral clustering will cluster samples in the embedding space using a kmeans algorithm whereas ?discrete? will iteratively search for the closest partition spa