5. Dataset loading utilities

The sklearn.datasets package embeds some small toy datasets as introduced in the Getting Started section. To evaluate the impact of the scale of the dataset (n_samples and n_features) while controlling the statistical properties of the data (typically the correlation and informativeness of the features), it is also possible to generate synthetic data. This package also features helpers to fetch larger datasets commonly used by the machine learning community to benchmark algorithm on data that

sklearn.cluster.affinity_propagation()

sklearn.cluster.affinity_propagation(S, preference=None, convergence_iter=15, max_iter=200, damping=0.5, copy=True, verbose=False, return_n_iter=False) [source] Perform Affinity Propagation Clustering of data Read more in the User Guide. Parameters: S : array-like, shape (n_samples, n_samples) Matrix of similarities between points preference : array-like, shape (n_samples,) or float, optional Preferences for each point - points with larger values of preferences are more likely to be cho

Feature Union with Heterogeneous Data Sources

Datasets can often contain components of that require different feature extraction and processing pipelines. This scenario might occur when: Your dataset consists of heterogeneous data types (e.g. raster images and text captions) Your dataset is stored in a Pandas DataFrame and different columns require different processing pipelines. This example demonstrates how to use sklearn.feature_extraction.FeatureUnion on a dataset containing different types of features. We use the 20-newsgroups data

Vector Quantization Example

Face, a 1024 x 768 size image of a raccoon face, is used here to illustrate how k-means is used for vector quantization. print(__doc__) # Code source: Ga Varoquaux # Modified for documentation by Jaques Grobler # License: BSD 3 clause import numpy as np import scipy as sp import matplotlib.pyplot as plt from sklearn import cluster from sklearn.utils.testing import SkipTest from sklearn.utils.fixes import sp_version if sp_version < (0, 12): raise SkipTest("Skipping because

sklearn.datasets.fetch_kddcup99()

sklearn.datasets.fetch_kddcup99(subset=None, shuffle=False, random_state=None, percent10=True, download_if_missing=True) [source] Load and return the kddcup 99 dataset (classification). The KDD Cup ?99 dataset was created by processing the tcpdump portions of the 1998 DARPA Intrusion Detection System (IDS) Evaluation dataset, created by MIT Lincoln Lab [1] . The artificial data was generated using a closed network and hand-injected attacks to produce a large number of different types of att

Compare Stochastic learning strategies for MLPClassifier

This example visualizes some training loss curves for different stochastic learning strategies, including SGD and Adam. Because of time-constraints, we use several small datasets, for which L-BFGS might be more suitable. The general trend shown in these examples seems to carry over to larger datasets, however. Note that those results can be highly dependent on the value of learning_rate_init. Out: learning on dataset iris training: constant learning-rate Training set score: 0.980000 Trai

Visualizing the stock market structure

This example employs several unsupervised learning techniques to extract the stock market structure from variations in historical quotes. The quantity that we use is the daily variation in quote price: quotes that are linked tend to cofluctuate during a day. Learning a graph structure We use sparse inverse covariance estimation to find which quotes are correlated conditionally on the others. Specifically, sparse inverse covariance gives us a graph, that is a list of connection. For each symbo

Plotting Learning Curves

On the left side the learning curve of a naive Bayes classifier is shown for the digits dataset. Note that the training score and the cross-validation score are both not very good at the end. However, the shape of the curve can be found in more complex datasets very often: the training score is very high at the beginning and decreases and the cross-validation score is very low at the beginning and increases. On the right side we see the learning curve of an SVM with RBF kernel. We can see clea

Probabilistic predictions with Gaussian process classification

This example illustrates the predicted probability of GPC for an RBF kernel with different choices of the hyperparameters. The first figure shows the predicted probability of GPC with arbitrarily chosen hyperparameters and with the hyperparameters corresponding to the maximum log-marginal-likelihood (LML). While the hyperparameters chosen by optimizing LML have a considerable larger LML, they perform slightly worse according to the log-loss on test data. The figure shows that this is because t

decomposition.NMF()

class sklearn.decomposition.NMF(n_components=None, init=None, solver='cd', tol=0.0001, max_iter=200, random_state=None, alpha=0.0, l1_ratio=0.0, verbose=0, shuffle=False, nls_max_iter=2000, sparseness=None, beta=1, eta=0.1) [source] Non-Negative Matrix Factorization (NMF) Find two non-negative matrices (W, H) whose product approximates the non- negative matrix X. This factorization can be used for example for dimensionality reduction, source separation or topic extraction. The objective fun