ensemble.RandomTreesEmbedding()

class sklearn.ensemble.RandomTreesEmbedding(n_estimators=10, max_depth=5, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_leaf_nodes=None, min_impurity_split=1e-07, sparse_output=True, n_jobs=1, random_state=None, verbose=0, warm_start=False) [source] An ensemble of totally random trees. An unsupervised transformation of a dataset to a high-dimensional sparse representation. A datapoint is coded according to which leaf of each tree it is sorted into. Using a one-h

qda.QDA()

Warning DEPRECATED class sklearn.qda.QDA(priors=None, reg_param=0.0, store_covariances=False, tol=0.0001) [source] Alias for sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis. Deprecated since version 0.17: This class will be removed in 0.19. Use sklearn.discriminant_analysis.QuadraticDiscriminantAnalysis instead. Methods decision_function(X) Apply decision function to an array of samples. fit(X, y[, store_covariances, tol]) Fit the model according to the given training data

feature_extraction.FeatureHasher()

class sklearn.feature_extraction.FeatureHasher(n_features=1048576, input_type='dict', dtype=, non_negative=False) [source] Implements feature hashing, aka the hashing trick. This class turns sequences of symbolic feature names (strings) into scipy.sparse matrices, using a hash function to compute the matrix column corresponding to a name. The hash function employed is the signed 32-bit version of Murmurhash3. Feature names of type byte string are used as-is. Unicode strings are converted to

3.1. Cross-validation

Learning the parameters of a prediction function and testing it on the same data is a methodological mistake: a model that would just repeat the labels of the samples that it has just seen would have a perfect score but would fail to predict anything useful on yet-unseen data. This situation is called overfitting. To avoid it, it is common practice when performing a (supervised) machine learning experiment to hold out part of the available data as a test set X_test, y_test. Note that the word

Libsvm GUI

A simple graphical frontend for Libsvm mainly intended for didactic purposes. You can create data points by point and click and visualize the decision region induced by different kernels and parameter settings. To create positive examples click the left mouse button; to create negative examples click the right button. If all examples are from the same class, it uses a one-class SVM. from __future__ import division, print_function print(__doc__) # Author: Peter Prettenhoer <peter.prettenho

Sparse recovery

Given a small number of observations, we want to recover which features of X are relevant to explain y. For this sparse linear models can outperform standard statistical tests if the true model is sparse, i.e. if a small fraction of the features are relevant. As detailed in the compressive sensing notes, the ability of L1-based approach to identify the relevant variables depends on the sparsity of the ground truth, the number of samples, the number of features, the conditioning of the design m

sklearn.datasets.make_checkerboard()

sklearn.datasets.make_checkerboard(shape, n_clusters, noise=0.0, minval=10, maxval=100, shuffle=True, random_state=None) [source] Generate an array with block checkerboard structure for biclustering. Read more in the User Guide. Parameters: shape : iterable (n_rows, n_cols) The shape of the result. n_clusters : integer or iterable (n_row_clusters, n_column_clusters) The number of row and column clusters. noise : float, optional (default=0.0) The standard deviation of the gaussian nois

linear_model.RANSACRegressor()

class sklearn.linear_model.RANSACRegressor(base_estimator=None, min_samples=None, residual_threshold=None, is_data_valid=None, is_model_valid=None, max_trials=100, stop_n_inliers=inf, stop_score=inf, stop_probability=0.99, residual_metric=None, loss='absolute_loss', random_state=None) [source] RANSAC (RANdom SAmple Consensus) algorithm. RANSAC is an iterative algorithm for the robust estimation of parameters from a subset of inliers from the complete data set. More information can be found

2.2. Manifold learning

Manifold learning is an approach to non-linear dimensionality reduction. Algorithms for this task are based on the idea that the dimensionality of many data sets is only artificially high. 2.2.1. Introduction High-dimensional datasets can be very difficult to visualize. While data in two or three dimensions can be plotted to show the inherent structure of the data, equivalent high-dimensional plots are much less intuitive. To aid visualization of the structure of a dataset, the dimension

Working With Text Data

The goal of this guide is to explore some of the main scikit-learn tools on a single practical task: analysing a collection of text documents (newsgroups posts) on twenty different topics. In this section we will see how to: load the file contents and the categories extract feature vectors suitable for machine learning train a linear model to perform categorization use a grid search strategy to find a good configuration of both the feature extraction components and the classifier Tutorial s