Feature Union with Heterogeneous Data Sources

Datasets can often contain components of that require different feature extraction and processing pipelines. This scenario might occur when: Your dataset consists of heterogeneous data types (e.g. raster images and text captions) Your dataset is stored in a Pandas DataFrame and different columns require different processing pipelines. This example demonstrates how to use sklearn.feature_extraction.FeatureUnion on a dataset containing different types of features. We use the 20-newsgroups data

Feature transformations with ensembles of trees

Transform your features into a higher dimensional, sparse space. Then train a linear model on these features. First fit an ensemble of trees (totally random trees, a random forest, or gradient boosted trees) on the training set. Then each leaf of each tree in the ensemble is assigned a fixed arbitrary feature index in a new feature space. These leaf indices are then encoded in a one-hot fashion. Each sample goes through the decisions of each tree of the ensemble and ends up in one leaf per tre

Feature selection using SelectFromModel and LassoCV

Use SelectFromModel meta-transformer along with Lasso to select the best couple of features from the Boston dataset. # Author: Manoj Kumar <mks542@nyu.edu> # License: BSD 3 clause print(__doc__) import matplotlib.pyplot as plt import numpy as np from sklearn.datasets import load_boston from sklearn.feature_selection import SelectFromModel from sklearn.linear_model import LassoCV # Load the boston dataset. boston = load_boston() X, y = boston['data'], boston['target'] # We use th

Feature importances with forests of trees

This examples shows the use of forests of trees to evaluate the importance of features on an artificial classification task. The red bars are the feature importances of the forest, along with their inter-trees variability. As expected, the plot suggests that 3 features are informative, while the remaining are not. Out: Feature ranking: 1. feature 1 (0.295902) 2. feature 2 (0.208351) 3. feature 0 (0.177632) 4. feature 3 (0.047121) 5. feature 6 (0.046303) 6. feature 8 (0.046013) 7. feature

Feature agglomeration vs. univariate selection

This example compares 2 dimensionality reduction strategies: univariate feature selection with Anova feature agglomeration with Ward hierarchical clustering Both methods are compared in a regression problem using a BayesianRidge as supervised estimator. # Author: Alexandre Gramfort <alexandre.gramfort@inria.fr> # License: BSD 3 clause print(__doc__) import shutil import tempfile import numpy as np import matplotlib.pyplot as plt from scipy import linalg, ndimage from sklearn.featur

Feature agglomeration

These images how similar features are merged together using feature agglomeration. print(__doc__) # Code source: Ga Varoquaux # Modified for documentation by Jaques Grobler # License: BSD 3 clause import numpy as np import matplotlib.pyplot as plt from sklearn import datasets, cluster from sklearn.feature_extraction.image import grid_to_graph digits = datasets.load_digits() images = digits.images X = np.reshape(images, (len(images), -1)) connectivity = grid_to_graph(*images[0].shape)

FastICA on 2D point clouds

This example illustrates visually in the feature space a comparison by results using two different component analysis techniques. Independent component analysis (ICA) vs Principal component analysis (PCA). Representing ICA in the feature space gives the view of ?geometric ICA?: ICA is an algorithm that finds directions in the feature space corresponding to projections with high non-Gaussianity. These directions need not be orthogonal in the original feature space, but they are orthogonal in th

Faces recognition example using eigenfaces and SVMs

The dataset used in this example is a preprocessed excerpt of the ?Labeled Faces in the Wild?, aka LFW: http://vis-www.cs.umass.edu/lfw/lfw-funneled.tgz (233MB) Expected results for the top 5 most represented people in the dataset: Ariel Sharon 0.67 0.92 0.77 13 Colin Powell 0.75 0.78 0.76 60 Donald Rumsfeld 0.78 0.67 0.72 27 George W Bush 0.86 0.86 0.86 146 Gerhard Schroeder 0.76 0.76 0.76 25 Hugo Chavez 0.67 0.67 0.67 15 Tony Blair 0.81 0.69 0.75 36 avg / total 0.80 0.80 0.80

Faces dataset decompositions

This example applies to The Olivetti faces dataset different unsupervised matrix decomposition (dimension reduction) methods from the module sklearn.decomposition (see the documentation chapter Decomposing signals in components (matrix factorization problems)) . print(__doc__) # Authors: Vlad Niculae, Alexandre Gramfort # License: BSD 3 clause import logging from time import time from numpy.random import RandomState import matplotlib.pyplot as plt from sklearn.datasets import fetch_olivett

Face completion with a multi-output estimators

This example shows the use of multi-output estimator to complete images. The goal is to predict the lower half of a face given its upper half. The first column of images shows true faces. The next columns illustrate how extremely randomized trees, k nearest neighbors, linear regression and ridge regression complete the lower half of those faces. print(__doc__) import numpy as np import matplotlib.pyplot as plt from sklearn.datasets import fetch_olivetti_faces from sklearn.utils.validatio