Classification of text documents using sparse features

This is an example showing how scikit-learn can be used to classify documents by topics using a bag-of-words approach. This example uses a scipy.sparse matrix to store the features and demonstrates various classifiers that can efficiently handle sparse matrices. The dataset used in this example is the 20 newsgroups dataset. It will be automatically downloaded, then cached. The bar plot indicates the accuracy, training time (normalized) and test time (normalized) of each classifier. # Author: P

sklearn.datasets.load_files()

sklearn.datasets.load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0) [source] Load text files with categories as subfolder names. Individual samples are assumed to be files stored a two levels folder structure such as the following: container_folder/ category_1_folder/ file_1.txt file_2.txt ... file_42.txt category_2_folder/ file_43.txt file_44.txt ... The folder names are used as supervised

sklearn.cluster.k_means()

sklearn.cluster.k_means(X, n_clusters, init='k-means++', precompute_distances='auto', n_init=10, max_iter=300, verbose=False, tol=0.0001, random_state=None, copy_x=True, n_jobs=1, algorithm='auto', return_n_iter=False) [source] K-means clustering algorithm. Read more in the User Guide. Parameters: X : array-like or sparse matrix, shape (n_samples, n_features) The observations to cluster. n_clusters : int The number of clusters to form as well as the number of centroids to generate. max

Illustration of Gaussian process classification on the XOR dataset

This example illustrates GPC on XOR data. Compared are a stationary, isotropic kernel (RBF) and a non-stationary kernel (DotProduct). On this particular dataset, the DotProduct kernel obtains considerably better results because the class-boundaries are linear and coincide with the coordinate axes. In general, stationary kernels often obtain better results. print(__doc__) # Authors: Jan Hendrik Metzen <jhm@informatik.uni-bremen.de> # # License: BSD 3 clause import numpy as np import

2.4. Biclustering

Biclustering can be performed with the module sklearn.cluster.bicluster. Biclustering algorithms simultaneously cluster rows and columns of a data matrix. These clusters of rows and columns are known as biclusters. Each determines a submatrix of the original data matrix with some desired properties. For instance, given a matrix of shape (10, 10), one possible bicluster with three rows and two columns induces a submatrix of shape (3, 2): >>> import numpy as np >>> data = np.ar

sklearn.metrics.confusion_matrix()

sklearn.metrics.confusion_matrix(y_true, y_pred, labels=None, sample_weight=None) [source] Compute confusion matrix to evaluate the accuracy of a classification By definition a confusion matrix is such that is equal to the number of observations known to be in group but predicted to be in group . Thus in binary classification, the count of true negatives is , false negatives is , true positives is and false positives is . Read more in the User Guide. Parameters: y_true : array, shape =

Outlier detection with several methods.

When the amount of contamination is known, this example illustrates three different ways of performing Novelty and Outlier Detection: based on a robust estimator of covariance, which is assuming that the data are Gaussian distributed and performs better than the One-Class SVM in that case. using the One-Class SVM and its ability to capture the shape of the data set, hence performing better when the data is strongly non-Gaussian, i.e. with two well-separated clusters; using the Isolation Forest

sklearn.datasets.fetch_rcv1()

sklearn.datasets.fetch_rcv1(data_home=None, subset='all', download_if_missing=True, random_state=None, shuffle=False) [source] Load the RCV1 multilabel dataset, downloading it if necessary. Version: RCV1-v2, vectors, full sets, topics multilabels. Classes 103 Samples total 804414 Dimensionality 47236 Features real, between 0 and 1 Read more in the User Guide. New in version 0.17. Parameters: data_home : string, optional Specify another download and cache folder for the datasets. By defa

ensemble.RandomForestRegressor()

class sklearn.ensemble.RandomForestRegressor(n_estimators=10, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_split=1e-07, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False) [source] A random forest regressor. A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and use ave

sklearn.metrics.silhouette_score()

sklearn.metrics.silhouette_score(X, labels, metric='euclidean', sample_size=None, random_state=None, **kwds) [source] Compute the mean Silhouette Coefficient of all samples. The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that