This is a Supervised Learning using the random forest. The distinctive part of this example in contrast to the previous one (Random Forest Example) is the split of data. In this example, we apply more extensive validation of the model using the KFold Cross-validation. In this validation approach, the model is trained and tested K-times with different training and testing data for each time. In the end, we will have 5 results for 5 models. To report the results, the average or the median of performance measures is usually is selected to represent the outcome of this experiment. The main benefit of KFold Cross-validation is to reduce the chances of overfitting. Generally, the model results from the KFold validation is trained on different combinations of the data. Hence, it has lower chances to be overfitted to a particular training set.
#Demo4 #M. S. Rakha, Ph.D. # Post-Doctoral - Queen's University # Supervised Learning - RandomForest Classification # RandomForest Classification_Kfold %matplotlib inline import numpy as np import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D import pandas as pd from sklearn.cluster import KMeans from sklearn import datasets from sklearn.preprocessing import scale import sklearn.metrics as sm from sklearn.metrics import confusion_matrix,classification_report from sklearn.model_selection import train_test_split from sklearn.ensemble import RandomForestClassifier from sklearn.metrics import classification_report
#Applying the training and testing of the KFold for train_index, test_index in kf.split(X): print("Train:", train_index, "Validation:",test_index) X_train, X_test = X[train_index], X[test_index] y_train, y_test = y[train_index], y[test_index] randomForestModel.fit(X_train, y_train) y_pred = randomForestModel.predict(X_test) print(classification_report(y_test, y_pred))
This script is using several machine learning libraries, such as numpy, matplotlib, pandas, and sklearn, to classify breast cancer data using Random Forest classifier. The breast cancer data is loaded from the sklearn datasets library, and only two features from the data set are used for the analysis. The script splits the data into training and testing sets using the train_test_split method from sklearn. Then, it uses the RepeatedKFold class from sklearn.model_selection to perform 5-fold cross-validation, repeated once. The RepeatedKFold class generates indices to split the data into training and test sets for each fold. The script uses the RandomForestClassifier method from sklearn.ensemble to create a random forest model with 100 trees, a maximum depth of 2, and a random state of 0. For each fold, it uses the generated indices to split the data into a training set and a test set, fits the random forest model to the training data, predicts the labels for the test data, and prints out a classification report to evaluate the performance of the model on the test data. The classification report includes metrics such as precision, recall, f1-score, and support for each label.
In the script, the KFold class from sklearn.model_selection is used to perform k-fold cross-validation. The script specifies that it wants to use 5-fold cross-validation, which means that the data will be split into 5 equal-sized folds. For each fold, one of the five parts will be used as the test set, while the other four parts will be used as the training set. The script initializes a KFold object called 'kf', with the following parameters:
n_splits: 5, which means that the data will be split into 5 folds. n_repeats: 1, which means that the k-fold procedure will be repeated once. random_state: None, which means that the random number generator will be initialized with the current system time.
The script then uses the split() method of the KFold object to generate indices to split the data into training and test sets. The split() method returns an iterator yielding pairs of indices corresponding to the training and test sets. For each fold, it uses the generated indices to split the data into a training set and a test set, fits the random forest model to the training data, predicts the labels for the test data, and prints out a classification report to evaluate the performance of the model on the test data.
The main goal of using cross-validation is to evaluate the performance of a machine-learning model on unseen data. By using k-fold cross-validation, the script can use all of the available data for training and testing, which helps reduce the model performance variance. It allows the script to estimate how well the model will perform on unseen data. The script uses 5-fold cross-validation, which is a widely used method. In 5-fold cross-validation, the data is divided into five equal-sized parts or "folds," The model is trained on 4 of the five parts and tested on the remaining part. This process is repeated five times, with each of the five parts used as the test set once. The script uses the classification_report method to evaluate the performance of the model on each fold. The classification report calculates several performance metrics such as precision, recall, f1-score, and support for each label. These metrics provide insight into the model's performance and help the script understand how well the model can predict the class labels.
The out of this code are indexes of the training set, indexes of the testing set (also called validation), and accuracy measurements for each fold of the 5.