class sklearn.tree.ExtraTreeRegressor(criterion='mse', splitter='random', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', random_state=None, min_impurity_split=1e-07, max_leaf_nodes=None) [source] An extremely randomized tree regressor. Extra-trees differ from classic decision trees in the way they are built. When looking for the best split to separate the samples of a node into two groups, random splits are drawn for each of the m

class sklearn.decomposition.FactorAnalysis(n_components=None, tol=0.01, copy=True, max_iter=1000, noise_variance_init=None, svd_method='randomized', iterated_power=3, random_state=0) [source] Factor Analysis (FA) A simple linear generative model with Gaussian latent variables. The observations are assumed to be caused by a linear transformation of lower dimensional latent factors and added Gaussian noise. Without loss of generality the factors are distributed according to a Gaussian with ze

Out-of-core classification of text documents

This is an example showing how scikit-learn can be used for classification using an out-of-core approach: learning from data that doesn?t fit into main memory. We make use of an online classifier, i.e., one that supports the partial_fit method, that will be fed with batches of examples. To guarantee that the features space remains the same over time we leverage a HashingVectorizer that will project each example into the same feature space. This is especially useful in the case of text classifi

decomposition.FastICA()

class sklearn.decomposition.FastICA(n_components=None, algorithm='parallel', whiten=True, fun='logcosh', fun_args=None, max_iter=200, tol=0.0001, w_init=None, random_state=None) [source] FastICA: a fast algorithm for Independent Component Analysis. Read more in the User Guide. Parameters: n_components : int, optional Number of components to use. If none is passed, all are used. algorithm : {?parallel?, ?deflation?} Apply parallel or deflational algorithm for FastICA. whiten : boolean,

SVM Margins Example

The plots below illustrate the effect the parameter C has on the separation line. A large value of C basically tells our model that we do not have that much faith in our data?s distribution, and will only consider points close to line of separation. A small value of C includes more/all the observations, allowing the margins to be calculated using all the data in the area. print(__doc__) # Code source: Ga Varoquaux # Modified for documentation by Jaques Grobler # License: BSD 3 clause

manifold.Isomap()

class sklearn.manifold.Isomap(n_neighbors=5, n_components=2, eigen_solver='auto', tol=0, max_iter=None, path_method='auto', neighbors_algorithm='auto', n_jobs=1) [source] Isomap Embedding Non-linear dimensionality reduction through Isometric Mapping Read more in the User Guide. Parameters: n_neighbors : integer number of neighbors to consider for each point. n_components : integer number of coordinates for the manifold eigen_solver : [?auto?|?arpack?|?dense?] ?auto? : Attempt to choos

1.7. Gaussian Processes

Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems. The advantages of Gaussian processes are: The prediction interpolates the observations (at least for regular kernels). The prediction is probabilistic (Gaussian) so that one can compute empirical confidence intervals and decide based on those if one should refit (online fitting, adaptive fitting) the prediction in some region of interest. Versatile: differen

Robust linear model estimation using RANSAC

In this example we see how to robustly fit a linear model to faulty data using the RANSAC algorithm. Out: Estimated coefficients (true, normal, RANSAC): 82.1903908408 [ 54.17236387] [ 82.08533159] import numpy as np from matplotlib import pyplot as plt from sklearn import linear_model, datasets n_samples = 1000 n_outliers = 50 X, y, coef = datasets.make_regression(n_samples=n_samples, n_features=1, n_informative=1, noise=10,

Plot randomly generated multilabel dataset

This illustrates the datasets.make_multilabel_classification dataset generator. Each sample consists of counts of two features (up to 50 in total), which are differently distributed in each of two classes. Points are labeled as follows, where Y means the class is present: 1 2 3 Color Y N N Red N Y N Blue N N Y Yellow Y Y N Purple Y N Y Orange Y Y N Green Y Y Y Brown A star marks the expected sample for each class; its size reflects the probability of selecting that class label. The left and

sklearn.metrics.coverage_error()

sklearn.metrics.coverage_error(y_true, y_score, sample_weight=None) [source] Coverage error measure Compute how far we need to go through the ranked scores to cover all true labels. The best value is equal to the average number of labels in y_true per sample. Ties in y_scores are broken by giving maximal rank that would have been assigned to all tied values. Read more in the User Guide. Parameters: y_true : array, shape = [n_samples, n_labels] True binary labels in binary indicator format