Classification of text documents

This is an example showing how the scikit-learn can be used to classify documents by topics using a bag-of-words approach. This example uses a scipy.sparse matrix to store the features instead of standard numpy arrays. The dataset used in this example is the 20 newsgroups dataset and should be downloaded from the http://mlcomp.org (free registration required): http://mlcomp.org/datasets/379 Once downloaded unzip the archive somewhere on your filesystem. For instance in: % mkdir -p ~/data/

Choosing the right estimator

Often the hardest part of solving a machine learning problem can be finding the right estimator for the job. Different estimators are better suited for different types of data and different problems. The flowchart below is designed to give users a bit of a rough guide on how to approach problems with regard to which estimators to try on your data. Click on any estimator in the chart below to see its documentation.

calibration.CalibratedClassifierCV()

class sklearn.calibration.CalibratedClassifierCV(base_estimator=None, method='sigmoid', cv=3) [source] Probability calibration with isotonic regression or sigmoid. With this class, the base_estimator is fit on the train set of the cross-validation generator and the test set is used for calibration. The probabilities for each of the folds are then averaged for prediction. In case that cv=?prefit? is passed to __init__, it is assumed that base_estimator has been fitted already and all data is

Blind source separation using FastICA

An example of estimating sources from noisy data. Independent component analysis (ICA) is used to estimate sources given noisy measurements. Imagine 3 instruments playing simultaneously and 3 microphones recording the mixed signals. ICA is used to recover the sources ie. what is played by each instrument. Importantly, PCA fails at recovering our instruments since the related signals reflect non-Gaussian processes. print(__doc__) import numpy as np import matplotlib.pyplot as plt from scipy im

Biclustering documents with the Spectral Co-clustering algorithm

This example demonstrates the Spectral Co-clustering algorithm on the twenty newsgroups dataset. The ?comp.os.ms-windows.misc? category is excluded because it contains many posts containing nothing but data. The TF-IDF vectorized posts form a word frequency matrix, which is then biclustered using Dhillon?s Spectral Co-Clustering algorithm. The resulting document-word biclusters indicate subsets words used more often in those subsets documents. For a few of the best biclusters, its most common

base.TransformerMixin

class sklearn.base.TransformerMixin [source] Mixin class for all transformers in scikit-learn. Methods fit_transform(X[, y]) Fit to data, then transform it. __init__() x.__init__(...) initializes x; see help(type(x)) for signature fit_transform(X, y=None, **fit_params) [source] Fit to data, then transform it. Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X. Parameters: X : numpy array of shape [n_samples, n_features] Training

Bayesian Ridge Regression

Computes a Bayesian Ridge Regression on a synthetic dataset. See Bayesian Ridge Regression for more information on the regressor. Compared to the OLS (ordinary least squares) estimator, the coefficient weights are slightly shifted toward zeros, which stabilises them. As the prior on the weights is a Gaussian prior, the histogram of the estimated weights is Gaussian. The estimation of the model is done by iteratively maximizing the marginal log-likelihood of the observations. print(__doc__) im

base.RegressorMixin

class sklearn.base.RegressorMixin [source] Mixin class for all regression estimators in scikit-learn. Methods score(X, y[, sample_weight]) Returns the coefficient of determination R^2 of the prediction. __init__() x.__init__(...) initializes x; see help(type(x)) for signature score(X, y, sample_weight=None) [source] Returns the coefficient of determination R^2 of the prediction. The coefficient R^2 is defined as (1 - u/v), where u is the regression sum of squares ((y_true - y_pred)

base.ClusterMixin

class sklearn.base.ClusterMixin [source] Mixin class for all cluster estimators in scikit-learn. Methods fit_predict(X[, y]) Performs clustering on X and returns cluster labels. __init__() x.__init__(...) initializes x; see help(type(x)) for signature fit_predict(X, y=None) [source] Performs clustering on X and returns cluster labels. Parameters: X : ndarray, shape (n_samples, n_features) Input data. Returns: y : ndarray, shape (n_samples,) cluster labels

base.ClassifierMixin

class sklearn.base.ClassifierMixin [source] Mixin class for all classifiers in scikit-learn. Methods score(X, y[, sample_weight]) Returns the mean accuracy on the given test data and labels. __init__() x.__init__(...) initializes x; see help(type(x)) for signature score(X, y, sample_weight=None) [source] Returns the mean accuracy on the given test data and labels. In multi-label classification, this is the subset accuracy which is a harsh metric since you require for each sample tha