feature_selection.GenericUnivariateSelect()

class sklearn.feature_selection.GenericUnivariateSelect(score_func=, mode='percentile', param=1e-05) [source] Univariate feature selector with configurable strategy. Read more in the User Guide. Parameters: score_func : callable Function taking two arrays X and y, and returning a pair of arrays (scores, pvalues). For modes ?percentile? or ?kbest? it can return a single array scores. mode : {?percentile?, ?k_best?, ?fpr?, ?fdr?, ?fwe?} Feature selection mode. param : float or int depend

feature_selection.RFE()

class sklearn.feature_selection.RFE(estimator, n_features_to_select=None, step=1, verbose=0) [source] Feature ranking with recursive feature elimination. Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and weights are assigned to each one o

feature_extraction.text.TfidfVectorizer()

class sklearn.feature_extraction.text.TfidfVectorizer(input=u'content', encoding=u'utf-8', decode_error=u'strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer=u'word', stop_words=None, token_pattern=u'(?u)\b\w\w+\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=, norm=u'l2', use_idf=True, smooth_idf=True, sublinear_tf=False) [source] Convert a collection of raw documents to a matrix of TF-IDF features.

feature_extraction.text.TfidfTransformer()

class sklearn.feature_extraction.text.TfidfTransformer(norm=u'l2', use_idf=True, smooth_idf=True, sublinear_tf=False) [source] Transform a count matrix to a normalized tf or tf-idf representation Tf means term-frequency while tf-idf means term-frequency times inverse document-frequency. This is a common term weighting scheme in information retrieval, that has also found good use in document classification. The goal of using tf-idf instead of the raw frequencies of occurrence of a token in a

feature_extraction.text.HashingVectorizer()

class sklearn.feature_extraction.text.HashingVectorizer(input=u'content', encoding=u'utf-8', decode_error=u'strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=u'(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer=u'word', n_features=1048576, binary=False, norm=u'l2', non_negative=False, dtype=) [source] Convert a collection of text documents to a matrix of token occurrences It turns a collection of text documents into a scipy.sparse matri

feature_extraction.text.CountVectorizer()

class sklearn.feature_extraction.text.CountVectorizer(input=u'content', encoding=u'utf-8', decode_error=u'strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, stop_words=None, token_pattern=u'(?u)\b\w\w+\b', ngram_range=(1, 1), analyzer=u'word', max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=) [source] Convert a collection of text documents to a matrix of token counts This implementation produces a sparse representation of the counts

feature_extraction.image.PatchExtractor()

class sklearn.feature_extraction.image.PatchExtractor(patch_size=None, max_patches=None, random_state=None) [source] Extracts patches from a collection of images Read more in the User Guide. Parameters: patch_size : tuple of ints (patch_height, patch_width) the dimensions of one patch max_patches : integer or float, optional default is None The maximum number of patches per image to extract. If max_patches is a float in (0, 1), it is taken to mean a proportion of the total number of pat

feature_extraction.FeatureHasher()

class sklearn.feature_extraction.FeatureHasher(n_features=1048576, input_type='dict', dtype=, non_negative=False) [source] Implements feature hashing, aka the hashing trick. This class turns sequences of symbolic feature names (strings) into scipy.sparse matrices, using a hash function to compute the matrix column corresponding to a name. The hash function employed is the signed 32-bit version of Murmurhash3. Feature names of type byte string are used as-is. Unicode strings are converted to

feature_extraction.DictVectorizer()

class sklearn.feature_extraction.DictVectorizer(dtype=, separator='=', sparse=True, sort=True) [source] Transforms lists of feature-value mappings to vectors. This transformer turns lists of mappings (dict-like objects) of feature names to feature values into Numpy arrays or scipy.sparse matrices for use with scikit-learn estimators. When feature values are strings, this transformer will do a binary one-hot (aka one-of-K) coding: one boolean-valued feature is constructed for each of the pos

FeatureHasher and DictVectorizer Comparison

Compares FeatureHasher and DictVectorizer by using both to vectorize text documents. The example demonstrates syntax and speed only; it doesn?t actually do anything useful with the extracted vectors. See the example scripts {document_classification_20newsgroups,clustering}.py for actual learning on text documents. A discrepancy between the number of terms reported for DictVectorizer and for FeatureHasher is to be expected due to hash collisions. # Author: Lars Buitinck # License: BSD 3 clause