Scalability of Approximate Nearest Neighbors

This example studies the scalability profile of approximate 10-neighbors queries using the LSHForest with n_estimators=20 and n_candidates=200 when varying the number of samples in the dataset. The first plot demonstrates the relationship between query time and index size of LSHForest. Query time is compared with the brute force method in exact nearest neighbor search for the same index sizes. The brute force queries have a very predictable linear scalability with the index (full scan). LSHFor

tree.DecisionTreeClassifier()

class sklearn.tree.DecisionTreeClassifier(criterion='gini', splitter='best', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=None, random_state=None, max_leaf_nodes=None, min_impurity_split=1e-07, class_weight=None, presort=False) [source] A decision tree classifier. Read more in the User Guide. Parameters: criterion : string, optional (default=?gini?) The function to measure the quality of a split. Supported criteria are ?gini? for the

linear_model.RidgeClassifier()

class sklearn.linear_model.RidgeClassifier(alpha=1.0, fit_intercept=True, normalize=False, copy_X=True, max_iter=None, tol=0.001, class_weight=None, solver='auto', random_state=None) [source] Classifier using Ridge regression. Read more in the User Guide. Parameters: alpha : float Regularization strength; must be a positive float. Regularization improves the conditioning of the problem and reduces the variance of the estimates. Larger values specify stronger regularization. Alpha correspo

Hierarchical clustering

Example builds a swiss roll dataset and runs hierarchical clustering on their position. For more information, see Hierarchical clustering. In a first step, the hierarchical clustering is performed without connectivity constraints on the structure and is solely based on distance, whereas in a second step the clustering is restricted to the k-Nearest Neighbors graph: it?s a hierarchical clustering with structure prior. Some of the clusters learned without connectivity constraints do not respect

sklearn.metrics.log_loss()

sklearn.metrics.log_loss(y_true, y_pred, eps=1e-15, normalize=True, sample_weight=None, labels=None) [source] Log loss, aka logistic loss or cross-entropy loss. This is the loss function used in (multinomial) logistic regression and extensions of it such as neural networks, defined as the negative log-likelihood of the true labels given a probabilistic classifier?s predictions. The log loss is only defined for two or more labels. For a single sample with true label yt in {0,1} and estimated

covariance.ShrunkCovariance()

class sklearn.covariance.ShrunkCovariance(store_precision=True, assume_centered=False, shrinkage=0.1) [source] Covariance estimator with shrinkage Read more in the User Guide. Parameters: store_precision : boolean, default True Specify if the estimated precision is stored shrinkage : float, 0 <= shrinkage <= 1, default 0.1 Coefficient in the convex combination used for the computation of the shrunk estimate. assume_centered : boolean, default False If True, data are not centered

sklearn.datasets.load_files()

sklearn.datasets.load_files(container_path, description=None, categories=None, load_content=True, shuffle=True, encoding=None, decode_error='strict', random_state=0) [source] Load text files with categories as subfolder names. Individual samples are assumed to be files stored a two levels folder structure such as the following: container_folder/ category_1_folder/ file_1.txt file_2.txt ... file_42.txt category_2_folder/ file_43.txt file_44.txt ... The folder names are used as supervised

sklearn.cluster.k_means()

sklearn.cluster.k_means(X, n_clusters, init='k-means++', precompute_distances='auto', n_init=10, max_iter=300, verbose=False, tol=0.0001, random_state=None, copy_x=True, n_jobs=1, algorithm='auto', return_n_iter=False) [source] K-means clustering algorithm. Read more in the User Guide. Parameters: X : array-like or sparse matrix, shape (n_samples, n_features) The observations to cluster. n_clusters : int The number of clusters to form as well as the number of centroids to generate. max

Illustration of Gaussian process classification on the XOR dataset

This example illustrates GPC on XOR data. Compared are a stationary, isotropic kernel (RBF) and a non-stationary kernel (DotProduct). On this particular dataset, the DotProduct kernel obtains considerably better results because the class-boundaries are linear and coincide with the coordinate axes. In general, stationary kernels often obtain better results. print(__doc__) # Authors: Jan Hendrik Metzen <jhm@informatik.uni-bremen.de> # # License: BSD 3 clause import numpy as np import

2.4. Biclustering

Biclustering can be performed with the module sklearn.cluster.bicluster. Biclustering algorithms simultaneously cluster rows and columns of a data matrix. These clusters of rows and columns are known as biclusters. Each determines a submatrix of the original data matrix with some desired properties. For instance, given a matrix of shape (10, 10), one possible bicluster with three rows and two columns induces a submatrix of shape (3, 2): >>> import numpy as np >>> data = np.ar