This examples shows how a classifier is optimized by cross-validation, which is done using the sklearn.model_selection.GridSearchCV
object on a development set that comprises only half of the available labeled data.
The performance of the selected hyper-parameters and trained model is then measured on a dedicated evaluation set that was not used during the model selection step.
More details on tools available for model selection can be found in the sections on Cross-validation: evaluating estimator performance and Tuning the hyper-parameters of an estimator.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 | from __future__ import print_function from sklearn import datasets from sklearn.model_selection import train_test_split from sklearn.model_selection import GridSearchCV from sklearn.metrics import classification_report from sklearn.svm import SVC print (__doc__) # Loading the Digits dataset digits = datasets.load_digits() # To apply an classifier on this data, we need to flatten the image, to # turn the data in a (samples, feature) matrix: n_samples = len (digits.images) X = digits.images.reshape((n_samples, - 1 )) y = digits.target # Split the dataset in two equal parts X_train, X_test, y_train, y_test = train_test_split( X, y, test_size = 0.5 , random_state = 0 ) # Set the parameters by cross-validation tuned_parameters = [{ 'kernel' : [ 'rbf' ], 'gamma' : [ 1e - 3 , 1e - 4 ], 'C' : [ 1 , 10 , 100 , 1000 ]}, { 'kernel' : [ 'linear' ], 'C' : [ 1 , 10 , 100 , 1000 ]}] scores = [ 'precision' , 'recall' ] for score in scores: print ( "# Tuning hyper-parameters for %s" % score) print () clf = GridSearchCV(SVC(C = 1 ), tuned_parameters, cv = 5 , scoring = '%s_macro' % score) clf.fit(X_train, y_train) print ( "Best parameters set found on development set:" ) print () print (clf.best_params_) print () print ( "Grid scores on development set:" ) print () means = clf.cv_results_[ 'mean_test_score' ] stds = clf.cv_results_[ 'std_test_score' ] for mean, std, params in zip (means, stds, clf.cv_results_[ 'params' ]): print ( "%0.3f (+/-%0.03f) for %r" % (mean, std * 2 , params)) print () print ( "Detailed classification report:" ) print () print ( "The model is trained on the full development set." ) print ( "The scores are computed on the full evaluation set." ) print () y_true, y_pred = y_test, clf.predict(X_test) print (classification_report(y_true, y_pred)) print () # Note the problem is too easy: the hyperparameter plateau is too flat and the # output model is the same for precision and recall with ties in quality. |
Total running time of the script: (0 minutes 0.000 seconds)
Download Python source code:
grid_search_digits.py
Download IPython notebook:
grid_search_digits.ipynb
Please login to continue.