Underfitting vs. Overfitting

This example demonstrates the problems of underfitting and overfitting and how we can use linear regression with polynomial features to approximate nonlinear functions. The plot shows the function that we want to approximate, which is a part of the cosine function. In addition, the samples from the real function and the approximations of different models are displayed. The models have polynomial features of different degrees. We can see that a linear function (polynomial with degree 1) is not

sklearn.model_selection.learning_curve()

sklearn.model_selection.learning_curve(estimator, X, y, groups=None, train_sizes=array([ 0.1, 0.33, 0.55, 0.78, 1. ]), cv=None, scoring=None, exploit_incremental_learning=False, n_jobs=1, pre_dispatch='all', verbose=0) [source] Learning curve. Determines cross-validated training and test scores for different training set sizes. A cross-validation generator splits the whole dataset k times in training and test data. Subsets of the training set with varying sizes will be used to train the est

2.6. Covariance estimation

Many statistical problems require at some point the estimation of a population?s covariance matrix, which can be seen as an estimation of data set scatter plot shape. Most of the time, such an estimation has to be done on a sample whose properties (size, structure, homogeneity) has a large influence on the estimation?s quality. The sklearn.covariance package aims at providing tools affording an accurate estimation of a population?s covariance matrix under various settings. We assume that the o

Nested versus non-nested cross-validation

This example compares non-nested and nested cross-validation strategies on a classifier of the iris data set. Nested cross-validation (CV) is often used to train a model in which hyperparameters also need to be optimized. Nested CV estimates the generalization error of the underlying model and its (hyper)parameter search. Choosing the parameters that maximize non-nested CV biases the model to the dataset, yielding an overly-optimistic score. Model selection without nested CV uses the same data

sklearn.preprocessing.scale()

sklearn.preprocessing.scale(X, axis=0, with_mean=True, with_std=True, copy=True) [source] Standardize a dataset along any axis Center to the mean and component wise scale to unit variance. Read more in the User Guide. Parameters: X : {array-like, sparse matrix} The data to center and scale. axis : int (0 by default) axis used to compute the means and standard deviations along. If 0, independently standardize each feature, otherwise (if 1) standardize each sample. with_mean : boolean, T

1.3. Kernel ridge regression

Kernel ridge regression (KRR) [M2012] combines Ridge Regression (linear least squares with l2-norm regularization) with the kernel trick. It thus learns a linear function in the space induced by the respective kernel and the data. For non-linear kernels, this corresponds to a non-linear function in the original space. The form of the model learned by KernelRidge is identical to support vector regression (SVR). However, different loss functions are used: KRR uses squared error loss while suppor

sklearn.datasets.fetch_rcv1()

sklearn.datasets.fetch_rcv1(data_home=None, subset='all', download_if_missing=True, random_state=None, shuffle=False) [source] Load the RCV1 multilabel dataset, downloading it if necessary. Version: RCV1-v2, vectors, full sets, topics multilabels. Classes 103 Samples total 804414 Dimensionality 47236 Features real, between 0 and 1 Read more in the User Guide. New in version 0.17. Parameters: data_home : string, optional Specify another download and cache folder for the datasets. By defa

ensemble.RandomForestRegressor()

class sklearn.ensemble.RandomForestRegressor(n_estimators=10, criterion='mse', max_depth=None, min_samples_split=2, min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features='auto', max_leaf_nodes=None, min_impurity_split=1e-07, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False) [source] A random forest regressor. A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and use ave

sklearn.metrics.silhouette_score()

sklearn.metrics.silhouette_score(X, labels, metric='euclidean', sample_size=None, random_state=None, **kwds) [source] Compute the mean Silhouette Coefficient of all samples. The Silhouette Coefficient is calculated using the mean intra-cluster distance (a) and the mean nearest-cluster distance (b) for each sample. The Silhouette Coefficient for a sample is (b - a) / max(a, b). To clarify, b is the distance between a sample and the nearest cluster that the sample is not a part of. Note that

sklearn.datasets.load_svmlight_file()

sklearn.datasets.load_svmlight_file(f, n_features=None, dtype=, multilabel=False, zero_based='auto', query_id=False) [source] Load datasets in the svmlight / libsvm format into sparse CSR matrix This format is a text-based format, with one sample per line. It does not store zero valued features hence is suitable for sparse dataset. The first element of each line can be used to store a target variable to predict. This format is used as the default format for both svmlight and the libsvm comm