5. Dataset loading utilities

The sklearn.datasets package embeds some small toy datasets as introduced in the Getting Started section. To evaluate the impact of the scale of the dataset (n_samples and n_features) while controlling the statistical properties of the data (typically the correlation and informativeness of the features), it is also possible to generate synthetic data. This package also features helpers to fetch larger datasets commonly used by the machine learning community to benchmark algorithm on data that

sklearn.cluster.affinity_propagation()

sklearn.cluster.affinity_propagation(S, preference=None, convergence_iter=15, max_iter=200, damping=0.5, copy=True, verbose=False, return_n_iter=False) [source] Perform Affinity Propagation Clustering of data Read more in the User Guide. Parameters: S : array-like, shape (n_samples, n_samples) Matrix of similarities between points preference : array-like, shape (n_samples,) or float, optional Preferences for each point - points with larger values of preferences are more likely to be cho

1.14. Semi-Supervised

Semi-supervised learning is a situation in which in your training data some of the samples are not labeled. The semi-supervised estimators in sklearn.semi_supervised are able to make use of this additional unlabeled data to better capture the shape of the underlying data distribution and generalize better to new samples. These algorithms can perform well when we have a very small amount of labeled points and a large amount of unlabeled points. Unlabeled entries in y It is important to assign

naive_bayes.MultinomialNB()

class sklearn.naive_bayes.MultinomialNB(alpha=1.0, fit_prior=True, class_prior=None) [source] Naive Bayes classifier for multinomial models The multinomial Naive Bayes classifier is suitable for classification with discrete features (e.g., word counts for text classification). The multinomial distribution normally requires integer feature counts. However, in practice, fractional counts such as tf-idf may also work. Read more in the User Guide. Parameters: alpha : float, optional (default=1

Plotting Learning Curves

On the left side the learning curve of a naive Bayes classifier is shown for the digits dataset. Note that the training score and the cross-validation score are both not very good at the end. However, the shape of the curve can be found in more complex datasets very often: the training score is very high at the beginning and decreases and the cross-validation score is very low at the beginning and increases. On the right side we see the learning curve of an SVM with RBF kernel. We can see clea

Theil-Sen Regression

Computes a Theil-Sen Regression on a synthetic dataset. See Theil-Sen estimator: generalized-median-based estimator for more information on the regressor. Compared to the OLS (ordinary least squares) estimator, the Theil-Sen estimator is robust against outliers. It has a breakdown point of about 29.3% in case of a simple linear regression which means that it can tolerate arbitrary corrupted data (outliers) of up to 29.3% in the two-dimensional case. The estimation of the model is done by calcu

linear_model.BayesianRidge()

class sklearn.linear_model.BayesianRidge(n_iter=300, tol=0.001, alpha_1=1e-06, alpha_2=1e-06, lambda_1=1e-06, lambda_2=1e-06, compute_score=False, fit_intercept=True, normalize=False, copy_X=True, verbose=False) [source] Bayesian ridge regression Fit a Bayesian ridge model and optimize the regularization parameters lambda (precision of the weights) and alpha (precision of the noise). Read more in the User Guide. Parameters: n_iter : int, optional Maximum number of iterations. Default is 3

Probabilistic predictions with Gaussian process classification

This example illustrates the predicted probability of GPC for an RBF kernel with different choices of the hyperparameters. The first figure shows the predicted probability of GPC with arbitrarily chosen hyperparameters and with the hyperparameters corresponding to the maximum log-marginal-likelihood (LML). While the hyperparameters chosen by optimizing LML have a considerable larger LML, they perform slightly worse according to the log-loss on test data. The figure shows that this is because t

Automatic Relevance Determination Regression

Fit regression model with Bayesian Ridge Regression. See Bayesian Ridge Regression for more information on the regressor. Compared to the OLS (ordinary least squares) estimator, the coefficient weights are slightly shifted toward zeros, which stabilises them. The histogram of the estimated weights is very peaked, as a sparsity-inducing prior is implied on the weights. The estimation of the model is done by iteratively maximizing the marginal log-likelihood of the observations. print(__doc__)

sklearn.model_selection.permutation_test_score()

sklearn.model_selection.permutation_test_score(estimator, X, y, groups=None, cv=None, n_permutations=100, n_jobs=1, random_state=0, verbose=0, scoring=None) [source] Evaluate the significance of a cross-validated score with permutations Read more in the User Guide. Parameters: estimator : estimator object implementing ?fit? The object to use to fit the data. X : array-like of shape at least 2D The data to fit. y : array-like The target variable to try to predict in the case of supervi