R Scripts


On the beast cancer dataset, the code snippet below applies supervised learning of the random forest classifier. The code is divided into seven main steps. The first step is loading the necessary packages that are used in the rest of the code snippet. For example, we use the RandomForest implementation from the sklearn package. The second step is loading the dataset of this experiment. We use the publically available data set of Breast Cancer which has 569 records and 30 features. The target class of this dataset is a binary value representing the diagnosis results of this disease as "Malignant" or "Benign". Next is the feature selection step to choose which features to pick out of 30 features available in this dataset. Feature selection is a big pre-processing topic that is outside the scope of this example discussion. To keep things simple, we choose the first two features by specifying the column range as follows "[:, 0:2]". Generally, in more advanced examples, we would like to choose the features that performing the best and get rid of the noisy ones. Step4 concerns about preparing the split of the dataset for model validation. We split the data equally using the ready to use function "train_test_split". Step5 uses the training set from Step4 to train the random forest model. Training speed depends on the size of the training set and also on model parameters (such as the number of trees). We evaluate the prediction powers of the random forest model using the testing set in Step6. In the end, we measure the accuracy of the model using famous metrics such as precision, recall, and f-measure.

python code
#M. S. Rakha, Ph.D.
#Post-Doctoral - Queen's University
# Supervised Learning - RandomForest Classification

#Step 1: Loading packages
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
from sklearn import datasets
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report

# Random seed so results remain consistent every time you run the script.

#Step 2: Loading the datset.
breastCancer = datasets.load_breast_cancer()

#Step 3: Selecting the features to use.
X =[:, 0:2]
y =

#Step 4: Splitting the dataset into training and testing sets.
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=42)
varriableNames= breastCancer.feature_names

#Step 5: Fitting the Random Forest model using the training set.
randomForestModel = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0), y_train);

#Step 6: Testing the trained model using the test dataset.
y_pred = randomForestModel.predict(X_test)

#Step 7: Printing out the accuracy measurements
print(classification_report(y_test, y_pred))

Below is the accuracy measurements as printed out from the "classification_report" function :
              precision    recall  f1-score   support

           0       0.89      0.82      0.85        98
           1       0.91      0.95      0.93       187

   Accuracy                           0.90       285
   Macro avg            0.90      0.88      0.89       285
   Weighted avg       0.90      0.90      0.90       285

The "0" row is for the "Malignant" class, while the "1" row is for the "Benign" class.

M. S. Rakha, Ph.D.
Queen's University

Topic Tags

Artificial Intelligence., Machine Learning

