sklearn.feature_selection.chi2()

sklearn.feature_selection.chi2(X, y) [source] Compute chi-squared stats between each non-negative feature and class. This score can be used to select the n_features features with the highest values for the test chi-squared statistic from X, which must contain only non-negative features such as booleans or frequencies (e.g., term counts in document classification), relative to the classes. Recall that the chi-square test measures dependence between stochastic variables, so using this functio

SVM Exercise

A tutorial exercise for using different SVM kernels. This exercise is used in the Using kernels part of the Supervised learning: predicting an output variable from high-dimensional observations section of the A tutorial on statistical-learning for scientific data processing. print(__doc__) import numpy as np import matplotlib.pyplot as plt from sklearn import datasets, svm iris = datasets.load_iris() X = iris.data y = iris.target X = X[y != 0, :2] y = y[y != 0] n_sample = len(X)

sklearn.datasets.make_moons()

sklearn.datasets.make_moons(n_samples=100, shuffle=True, noise=None, random_state=None) [source] Make two interleaving half circles A simple toy dataset to visualize clustering and classification algorithms. Read more in the User Guide. Parameters: n_samples : int, optional (default=100) The total number of points generated. shuffle : bool, optional (default=True) Whether to shuffle the samples. noise : double or None (default=None) Standard deviation of Gaussian noise added to the da

sklearn.datasets.load_mlcomp()

sklearn.datasets.load_mlcomp(name_or_id, set_='raw', mlcomp_root=None, **kwargs) [source] Load a datasets as downloaded from http://mlcomp.org Parameters: name_or_id : the integer id or the string name metadata of the MLComp dataset to load set_ : select the portion to load: ?train?, ?test? or ?raw? mlcomp_root : the filesystem path to the root folder where MLComp datasets are stored, if mlcomp_root is None, the MLCOMP_DATASETS_HOME environment variable is looked up instead. **kwargs :

sklearn.datasets.load_sample_image()

sklearn.datasets.load_sample_image(image_name) [source] Load the numpy array of a single sample image Parameters: image_name: {`china.jpg`, `flower.jpg`} : The name of the sample image loaded Returns: img: 3D array : The image as a numpy array: height x width x color Examples >>> from sklearn.datasets import load_sample_image >>> china = load_sample_image('china.jpg') >>> china.dtype dtype('uint8') >>> china.shape

grid_search.ParameterGrid()

Warning DEPRECATED class sklearn.grid_search.ParameterGrid(param_grid) [source] Grid of parameters with a discrete number of values for each. Deprecated since version 0.18: This module will be removed in 0.20. Use sklearn.model_selection.ParameterGrid instead. Can be used to iterate over parameter value combinations with the Python built-in function iter. Read more in the User Guide. Parameters: param_grid : dict of string to sequence, or sequence of such The parameter grid to explore

A demo of the mean-shift clustering algorithm

Reference: Dorin Comaniciu and Peter Meer, ?Mean Shift: A robust approach toward feature space analysis?. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2002. pp. 603-619. print(__doc__) import numpy as np from sklearn.cluster import MeanShift, estimate_bandwidth from sklearn.datasets.samples_generator import make_blobs Generate sample data centers = [[1, 1], [-1, -1], [1, -1]] X, _ = make_blobs(n_samples=10000, centers=centers, cluster_std=0.6) Compute clustering with Mean

sklearn.calibration.calibration_curve()

sklearn.calibration.calibration_curve(y_true, y_prob, normalize=False, n_bins=5) [source] Compute true and predicted probabilities for a calibration curve. Read more in the User Guide. Parameters: y_true : array, shape (n_samples,) True targets. y_prob : array, shape (n_samples,) Probabilities of the positive class. normalize : bool, optional, default=False Whether y_prob needs to be normalized into the bin [0, 1], i.e. is not a proper probability. If True, the smallest value in y_pro

sklearn.datasets.make_friedman2()

sklearn.datasets.make_friedman2(n_samples=100, noise=0.0, random_state=None) [source] Generate the ?Friedman #2? regression problem This dataset is described in Friedman [1] and Breiman [2]. Inputs X are 4 independent features uniformly distributed on the intervals: 0 <= X[:, 0] <= 100, 40 * pi <= X[:, 1] <= 560 * pi, 0 <= X[:, 2] <= 1, 1 <= X[:, 3] <= 11. The output y is created according to the formula: y(X) = (X[:, 0] ** 2 + (X[:, 1] * X[:, 2] - 1 / (X[:, 1] * X

Plotting Validation Curves

In this plot you can see the training scores and validation scores of an SVM for different values of the kernel parameter gamma. For very low values of gamma, you can see that both the training score and the validation score are low. This is called underfitting. Medium values of gamma will result in high values for both scores, i.e. the classifier is performing fairly well. If gamma is too high, the classifier will overfit, which means that the training score is good but the validation score i