6. Strategies to scale computationally

For some applications the amount of examples, features (or both) and/or the speed at which they need to be processed are challenging for traditional approaches. In these cases scikit-learn has a number of options you can consider to make your system scale. 6.1. Scaling with instances using out-of-core learning Out-of-core (or ?external memory?) learning is a technique used to learn from data that cannot fit in a computer?s main memory (RAM). Here is sketch of a system designed to achieve this

sklearn.metrics.pairwise.paired_cosine_distances()

sklearn.metrics.pairwise.paired_cosine_distances(X, Y) [source] Computes the paired cosine distances between X and Y Read more in the User Guide. Parameters: X : array-like, shape (n_samples, n_features) Y : array-like, shape (n_samples, n_features) Returns: distances : ndarray, shape (n_samples, ) Notes The cosine distance is equivalent to the half the squared euclidean distance if each sample is normalized to unit norm

scikit-learn Tutorials

An introduction to machine learning with scikit-learnMachine learning: the problem setting Loading an example dataset Learning and predicting Model persistence Conventions A tutorial on statistical-learning for scientific data processingStatistical learning: the setting and the estimator object in scikit-learn Supervised learning: predicting an output variable from high-dimensional observations Model selection: choosing estimators and their parameters Unsupervised learning: seeking repres

Feature selection using SelectFromModel and LassoCV

Use SelectFromModel meta-transformer along with Lasso to select the best couple of features from the Boston dataset. # Author: Manoj Kumar <mks542@nyu.edu> # License: BSD 3 clause print(__doc__) import matplotlib.pyplot as plt import numpy as np from sklearn.datasets import load_boston from sklearn.feature_selection import SelectFromModel from sklearn.linear_model import LassoCV # Load the boston dataset. boston = load_boston() X, y = boston['data'], boston['target'] # We use th

Blind source separation using FastICA

An example of estimating sources from noisy data. Independent component analysis (ICA) is used to estimate sources given noisy measurements. Imagine 3 instruments playing simultaneously and 3 microphones recording the mixed signals. ICA is used to recover the sources ie. what is played by each instrument. Importantly, PCA fails at recovering our instruments since the related signals reflect non-Gaussian processes. print(__doc__) import numpy as np import matplotlib.pyplot as plt from scipy im

Concatenating multiple feature extraction methods

In many real-world examples, there are many ways to extract features from a dataset. Often it is beneficial to combine several methods to obtain good performance. This example shows how to use FeatureUnion to combine features obtained by PCA and univariate selection. Combining features using this transformer has the benefit that it allows cross validation and grid searches over the whole process. The combination used in this example is not particularly helpful on this dataset and is only used

Principal components analysis

These figures aid in illustrating how a point cloud can be very flat in one direction?which is where PCA comes in to choose a direction that is not flat. print(__doc__) # Authors: Gael Varoquaux # Jaques Grobler # Kevin Hughes # License: BSD 3 clause from sklearn.decomposition import PCA from mpl_toolkits.mplot3d import Axes3D import numpy as np import matplotlib.pyplot as plt from scipy import stats Create the data e = np.exp(1) np.random.seed(4) def pdf(x): return

The Iris Dataset

This data sets consists of 3 different types of irises? (Setosa, Versicolour, and Virginica) petal and sepal length, stored in a 150x4 numpy.ndarray The rows being the samples and the columns being: Sepal Length, Sepal Width, Petal Length and Petal Width. The below plot uses the first two features. See here for more information on this dataset. print(__doc__) # Code source: Ga Varoquaux # Modified for documentation by Jaques Grobler # License: BSD 3 clause import matplotlib.pyplot as

Decision Tree Regression with AdaBoost

A decision tree is boosted using the AdaBoost.R2 [1] algorithm on a 1D sinusoidal dataset with a small amount of Gaussian noise. 299 boosts (300 decision trees) is compared with a single decision tree regressor. As the number of boosts is increased the regressor can fit more detail. [1] Drucker, ?Improving Regressors using Boosting Techniques?, 1997. print(__doc__) # Author: Noel Dawe <noel.dawe@gmail.com> # # License: BSD 3 clause # importing necessary libraries import numpy as

sklearn.pipeline.make_pipeline()

sklearn.pipeline.make_pipeline(*steps) [source] Construct a Pipeline from the given estimators. This is a shorthand for the Pipeline constructor; it does not require, and does not permit, naming the estimators. Instead, their names will be set to the lowercase of their types automatically. Returns: p : Pipeline Examples >>> from sklearn.naive_bayes import GaussianNB >>> from sklearn.preprocessing import StandardScaler >>> make_pipeline(StandardScaler(), GaussianN