2.7. Novelty and Outlier Detection

Many applications require being able to decide whether a new observation belongs to the same distribution as existing observations (it is an inlier), or should be considered as different (it is an outlier). Often, this ability is used to clean real data sets. Two important distinction must be made: novelty detection: The training data is not polluted by outliers, and we are interested in detecting anomalies in new observations. outlier detection: The training data contains outliers, and we n

base.ClusterMixin

class sklearn.base.ClusterMixin [source] Mixin class for all cluster estimators in scikit-learn. Methods fit_predict(X[, y]) Performs clustering on X and returns cluster labels. __init__() x.__init__(...) initializes x; see help(type(x)) for signature fit_predict(X, y=None) [source] Performs clustering on X and returns cluster labels. Parameters: X : ndarray, shape (n_samples, n_features) Input data. Returns: y : ndarray, shape (n_samples,) cluster labels

Kernel Density Estimate of Species Distributions

This shows an example of a neighbors-based query (in particular a kernel density estimate) on geospatial data, using a Ball Tree built upon the Haversine distance metric ? i.e. distances over points in latitude/longitude. The dataset is provided by Phillips et. al. (2006). If available, the example uses basemap to plot the coast lines and national boundaries of South America. This example does not perform any learning over the data (see Species distribution modeling for an example of classific

The Johnson-Lindenstrauss bound for embedding with random projections

The Johnson-Lindenstrauss lemma states that any high dimensional dataset can be randomly projected into a lower dimensional Euclidean space while controlling the distortion in the pairwise distances. Theoretical bounds The distortion introduced by a random projection p is asserted by the fact that p is defining an eps-embedding with good probability as defined by: Where u and v are any rows taken from a dataset of shape [n_samples, n_features] and p is a projection by a random Gaussian N(0

cluster.Birch()

class sklearn.cluster.Birch(threshold=0.5, branching_factor=50, n_clusters=3, compute_labels=True, copy=True) [source] Implements the Birch clustering algorithm. Every new sample is inserted into the root of the Clustering Feature Tree. It is then clubbed together with the subcluster that has the centroid closest to the new sample. This is done recursively till it ends up at the subcluster of the leaf of the tree has the closest centroid. Read more in the User Guide. Parameters: threshold

gaussian_process.GaussianProcess()

Warning DEPRECATED class sklearn.gaussian_process.GaussianProcess(*args, **kwargs) [source] The legacy Gaussian Process model class. Deprecated since version 0.18: This class will be removed in 0.20. Use the GaussianProcessRegressor instead. Read more in the User Guide. Parameters: regr : string or callable, optional A regression function returning an array of outputs of the linear regression functional basis. The number of observations n_samples should be greater than the size p of t

sklearn.linear_model.orthogonal_mp_gram()

sklearn.linear_model.orthogonal_mp_gram(Gram, Xy, n_nonzero_coefs=None, tol=None, norms_squared=None, copy_Gram=True, copy_Xy=True, return_path=False, return_n_iter=False) [source] Gram Orthogonal Matching Pursuit (OMP) Solves n_targets Orthogonal Matching Pursuit problems using only the Gram matrix X.T * X and the product X.T * y. Read more in the User Guide. Parameters: Gram : array, shape (n_features, n_features) Gram matrix of the input data: X.T * X Xy : array, shape (n_features,) o

linear_model.LogisticRegressionCV()

class sklearn.linear_model.LogisticRegressionCV(Cs=10, fit_intercept=True, cv=None, dual=False, penalty='l2', scoring=None, solver='lbfgs', tol=0.0001, max_iter=100, class_weight=None, n_jobs=1, verbose=0, refit=True, intercept_scaling=1.0, multi_class='ovr', random_state=None) [source] Logistic Regression CV (aka logit, MaxEnt) classifier. This class implements logistic regression using liblinear, newton-cg, sag of lbfgs optimizer. The newton-cg, sag and lbfgs solvers support only L2 regul

feature_selection.RFE()

class sklearn.feature_selection.RFE(estimator, n_features_to_select=None, step=1, verbose=0) [source] Feature ranking with recursive feature elimination. Given an external estimator that assigns weights to features (e.g., the coefficients of a linear model), the goal of recursive feature elimination (RFE) is to select features by recursively considering smaller and smaller sets of features. First, the estimator is trained on the initial set of features and weights are assigned to each one o

sklearn.datasets.fetch_mldata()

sklearn.datasets.fetch_mldata(dataname, target_name='label', data_name='data', transpose_data=True, data_home=None) [source] Fetch an mldata.org data set If the file does not exist yet, it is downloaded from mldata.org . mldata.org does not have an enforced convention for storing data or naming the columns in a data set. The default behavior of this function works well with the most common cases: data values are stored in the column ?data?, and target values in the column ?label? alternati