A tutorial on statistical-learning for scientific data processing

Statistical learning Machine learning is a technique with a growing importance, as the size of the datasets experimental sciences are facing is rapidly growing. Problems it tackles range from building a prediction function linking different observations, to classifying observations, or learning the structure in an unlabeled dataset. This tutorial will explore statistical learning, the use of machine learning techniques with the goal of statistical inference: drawing conclusions on the data at

sklearn.linear_model.lars_path()

sklearn.linear_model.lars_path(X, y, Xy=None, Gram=None, max_iter=500, alpha_min=0, method='lar', copy_X=True, eps=2.2204460492503131e-16, copy_Gram=True, verbose=0, return_path=True, return_n_iter=False, positive=False) [source] Compute Least Angle Regression or Lasso path using LARS algorithm [1] The optimization objective for the case method=?lasso? is: (1 / (2 * n_samples)) * ||y - Xw||^2_2 + alpha * ||w||_1 in the case of method=?lars?, the objective function is only known in the form

svm.NuSVR()

class sklearn.svm.NuSVR(nu=0.5, C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True, tol=0.001, cache_size=200, verbose=False, max_iter=-1) [source] Nu Support Vector Regression. Similar to NuSVC, for regression, uses a parameter nu to control the number of support vectors. However, unlike NuSVC, where nu replaces C, here nu replaces the parameter epsilon of epsilon-SVR. The implementation is based on libsvm. Read more in the User Guide. Parameters: C : float, optional (

model_selection.ShuffleSplit()

class sklearn.model_selection.ShuffleSplit(n_splits=10, test_size=0.1, train_size=None, random_state=None) [source] Random permutation cross-validator Yields indices to split data into training and test sets. Note: contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets. Read more in the User Guide. Parameters: n_splits : int (default 10) Number of re-shuffling & splitting

sklearn.datasets.load_svmlight_files()

sklearn.datasets.load_svmlight_files(files, n_features=None, dtype=, multilabel=False, zero_based='auto', query_id=False) [source] Load dataset from multiple files in SVMlight format This function is equivalent to mapping load_svmlight_file over a list of files, except that the results are concatenated into a single, flat list and the samples vectors are constrained to all have the same number of features. In case the file contains a pairwise preference constraint (known as ?qid? in the svm

exceptions.ChangedBehaviorWarning

class sklearn.exceptions.ChangedBehaviorWarning [source] Warning class used to notify the user of any change in the behavior. Changed in version 0.18: Moved from sklearn.base.

Robust linear estimator fitting

Here a sine function is fit with a polynomial of order 3, for values close to zero. Robust fitting is demoed in different situations: No measurement errors, only modelling errors (fitting a sine with a polynomial) Measurement errors in X Measurement errors in y The median absolute deviation to non corrupt new data is used to judge the quality of the prediction. What we can see that: RANSAC is good for strong outliers in the y direction TheilSen is good for small outliers, both in direction X

sklearn.metrics.brier_score_loss()

sklearn.metrics.brier_score_loss(y_true, y_prob, sample_weight=None, pos_label=None) [source] Compute the Brier score. The smaller the Brier score, the better, hence the naming with ?loss?. Across all items in a set N predictions, the Brier score measures the mean squared difference between (1) the predicted probability assigned to the possible outcomes for item i, and (2) the actual outcome. Therefore, the lower the Brier score is for a set of predictions, the better the predictions are ca

grid_search.RandomizedSearchCV()

Warning DEPRECATED class sklearn.grid_search.RandomizedSearchCV(estimator, param_distributions, n_iter=10, scoring=None, fit_params=None, n_jobs=1, iid=True, refit=True, cv=None, verbose=0, pre_dispatch='2*n_jobs', random_state=None, error_score='raise') [source] Randomized search on hyper parameters. Deprecated since version 0.18: This module will be removed in 0.20. Use sklearn.model_selection.RandomizedSearchCV instead. RandomizedSearchCV implements a ?fit? and a ?score? method. It a

ensemble.AdaBoostClassifier()

class sklearn.ensemble.AdaBoostClassifier(base_estimator=None, n_estimators=50, learning_rate=1.0, algorithm='SAMME.R', random_state=None) [source] An AdaBoost classifier. An AdaBoost [1] classifier is a meta-estimator that begins by fitting a classifier on the original dataset and then fits additional copies of the classifier on the same dataset but where the weights of incorrectly classified instances are adjusted such that subsequent classifiers focus more on difficult cases. This class