A tutorial on statistical-learning for scientific data processing

Statistical learning Machine learning is a technique with a growing importance, as the size of the datasets experimental sciences are facing is rapidly growing. Problems it tackles range from building a prediction function linking different observations, to classifying observations, or learning the structure in an unlabeled dataset. This tutorial will explore statistical learning, the use of machine learning techniques with the goal of statistical inference: drawing conclusions on the data at

model_selection.ShuffleSplit()

class sklearn.model_selection.ShuffleSplit(n_splits=10, test_size=0.1, train_size=None, random_state=None) [source] Random permutation cross-validator Yields indices to split data into training and test sets. Note: contrary to other cross-validation strategies, random splits do not guarantee that all folds will be different, although this is still very likely for sizeable datasets. Read more in the User Guide. Parameters: n_splits : int (default 10) Number of re-shuffling & splitting

sklearn.linear_model.logistic_regression_path()

sklearn.linear_model.logistic_regression_path(X, y, pos_class=None, Cs=10, fit_intercept=True, max_iter=100, tol=0.0001, verbose=0, solver='lbfgs', coef=None, copy=False, class_weight=None, dual=False, penalty='l2', intercept_scaling=1.0, multi_class='ovr', random_state=None, check_input=True, max_squared_sum=None, sample_weight=None) [source] Compute a Logistic Regression model for a list of regularization parameters. This is an implementation that uses the result of the previous model to

Robust linear estimator fitting

Here a sine function is fit with a polynomial of order 3, for values close to zero. Robust fitting is demoed in different situations: No measurement errors, only modelling errors (fitting a sine with a polynomial) Measurement errors in X Measurement errors in y The median absolute deviation to non corrupt new data is used to judge the quality of the prediction. What we can see that: RANSAC is good for strong outliers in the y direction TheilSen is good for small outliers, both in direction X