Statistical learning

Datasets

Scikit-learn deals with learning information from one or more datasets that are represented as 2D arrays. They can be understood as a list of multi-dimensional observations. We say that the first axis of these arrays is the samples axis, while the second is the features axis.

A simple example shipped with the scikit: iris dataset

1
2
3
4
5
>>> from sklearn import datasets
>>> iris = datasets.load_iris()
>>> data = iris.data
>>> data.shape
(150, 4)

It is made of 150 observations of irises, each described by 4 features: their sepal and petal length and width, as detailed in iris.DESCR.

When the data is not initially in the (n_samples, n_features) shape, it needs to be preprocessed in order to be used by scikit-learn.

An example of reshaping data would be the digits dataset

../../_images/sphx_glr_plot_digits_last_image_001.png

The digits dataset is made of 1797 8x8 images of hand-written digits

1
2
3
4
5
6
>>> digits = datasets.load_digits()
>>> digits.images.shape
(1797, 8, 8)
>>> import matplotlib.pyplot as plt
>>> plt.imshow(digits.images[-1], cmap=plt.cm.gray_r)
<matplotlib.image.AxesImage object at ...>

To use this dataset with the scikit, we transform each 8x8 image into a feature vector of length 64

1
>>> data = digits.images.reshape((digits.images.shape[0], -1))

Estimators objects

Fitting data: the main API implemented by scikit-learn is that of the estimator. An estimator is any object that learns from data; it may be a classification, regression or clustering algorithm or a transformer that extracts/filters useful features from raw data.

All estimator objects expose a fit method that takes a dataset (usually a 2-d array):

1
>>> estimator.fit(data)

Estimator parameters: All the parameters of an estimator can be set when it is instantiated or by modifying the corresponding attribute:

1
2
3
>>> estimator = Estimator(param1=1, param2=2)
>>> estimator.param1
1

Estimated parameters: When data is fitted with an estimator, parameters are estimated from the data at hand. All the estimated parameters are attributes of the estimator object ending by an underscore:

1
>>> estimator.estimated_param_
doc_scikit_learn
2025-01-10 15:47:30
Comments
Leave a Comment

Please login to continue.