Total members 11891 |It is currently Wed Jul 17, 2024 10:50 am Login / Join Codemiles














R Scripts


This example illustrates the extra analysis that random forest can provide for data scientists. In a random forest, we can rank the essential features based on the error caused by dropping any of them. This script loads the breast cancer dataset from sklearn, and uses the RandomForestClassifier to train a model on it. It then tests the model on the test set and prints the classification report. After that, it plots the feature importance of the different features and shows them in a bar chart. It is also printing the feature names. The model is trained on the first ten features; the target is breast cancer diagnosis (malignant or benign).

python code
#Demo7 - part2
#M. S. Rakha, Ph.D.
# Post-Doctoral - Queen's University
# Supervised Learning - Random Forest
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier

breastCancer = datasets.load_breast_cancer()


#Only two features
X =[:, 0:10]
y =

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=42)


varriableNames= breastCancer.feature_names

randomForestModel = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0), y_train);

y_pred = randomForestModel.predict(X_test)

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

importances = randomForestModel.feature_importances_
std = np.std([tree.feature_importances_ for tree in randomForestModel.estimators_],
indices = np.argsort(importances)[::-1]

# Print the feature ranking
print("Feature ranking:")

for f in range(X.shape[1]):
print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))

# Plot the feature importances of the forest
plt.title("Feature importances")[1]), importances[indices],
color="r", yerr=std[indices], align="center")
plt.xticks(range(X.shape[1]), indices)
plt.xlim([-1, X.shape[1]])


The script starts by importing the necessary libraries, including numpy, matplotlib, pandas, and the sklearn library. It then loads the breast cancer dataset and splits it into training and test sets. The script then creates an instance of the RandomForestClassifier with 100 estimators and a maximum depth of 2 and trains it on the training set. It then uses this trained model to make predictions on the test set and prints the classification report to evaluate the model's performance. The script then prints feature importances for the features used in the model, which are the first ten features. The feature importances are used to determine which features are most important in the classification of breast cancer diagnosis. The feature importances are visualized using a bar chart, where each bar's width represents the feature's importance. The script then prints the feature names, which are the names of the features used in the dataset. These names are used to match the feature index to the feature name in the feature importance chart. Overall, the script shows how the random forest algorithm can be used to classify the breast cancer dataset and determine which features are most important in the classification.

The script also uses the train_test_split function from sklearn to split the data into training and test sets. This is a common practice in machine learning to evaluate the performance of a model on unseen data. In this case, the test set is made up of 50% of the data, and the random_state parameter is set to 42, which means that the split will always be the same if the script is run multiple times with the same seed. The script also uses the NumPy random seed function to set a seed for the random number generator, which is used for the train_test_split function. Setting a seed for the random number generator ensures that the same random numbers will be generated each time the script is run, which can be useful for reproducibility. Additionally, the script uses the confusion_matrix and classification_report from sklearn.metrics to evaluate the performance of the model. The classification report provides a more detailed evaluation of the model's performance, including precision, recall, and f1-score for each class. In contrast, the confusion matrix summarizes the number of true positives, true negatives, false positives, and false negatives. Overall, this script demonstrates a typical machine learning workflow where the data is loaded, split into training and test sets, used to train a model, and then evaluated on unseen data. It also demonstrates how to extract feature importance and evaluate a model's performance using a classification report and confusion matrix.

The output of this snippet:

      precision    recall  f1-score   support

           0       0.92      0.85      0.88        98
           1       0.92      0.96      0.94       187

    accuracy                           0.92       285
   macro avg       0.92      0.90      0.91       285
weighted avg       0.92      0.92      0.92       285

Feature ranking:
1. feature 7 (0.327613)
2. feature 6 (0.197932)
3. feature 2 (0.187159)
4. feature 0 (0.104715)
5. feature 3 (0.102147)
6. feature 5 (0.039644)
7. feature 1 (0.026285)
8. feature 9 (0.008671)
9. feature 4 (0.005309)
10. feature 8 (0.000525)

['mean radius' 'mean texture' 'mean perimeter' 'mean area'
'mean smoothness' 'mean compactness' 'mean concavity'
'mean concave points' 'mean symmetry' 'mean fractal dimension'
'radius error' 'texture error' 'perimeter error' 'area error'
'smoothness error' 'compactness error' 'concavity error'
'concave points error' 'symmetry error' 'fractal dimension error'
'worst radius' 'worst texture' 'worst perimeter' 'worst area'
'worst smoothness' 'worst compactness' 'worst concavity'
'worst concave points' 'worst symmetry' 'worst fractal dimension']

RandomForestImportant.png [ 4.4 KiB | Viewed 4462 times ]

M. S. Rakha, Ph.D.
Queen's University

User avatar Posts: 2715
Have thanks: 74 time
Post new topic Reply to topic  [ 1 post ] 

  Related Posts  to : Get the important variables of random forest classifier
 random forest algorithm classifier     -  
 Cost Sensitive Classifier Random Forest Java in weka     -  
 Random Search for tuning classifier parameters     -  
 Random Forest Classification (Binary )- Supervised Learning     -  
 Weka java code for Random Forest Cross Validation     -  
 KFold Cross-validation Random Forest Binary Classification     -  
 Local variables vs Instance variables     -  
 naive Bayes classifier in MATLAB     -  
 Important JSP tags     -  
 Usage of the CSS property !important     -  

Topic Tags

Artificial Intelligence, Machine Learning, Python

Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
All copyrights reserved to 2007-2011
mileX v1.0 designed by codemiles team is a participant in the Amazon Services LLC Associates Program, an affiliate advertising program designed to provide a means for sites to earn advertising fees by advertising and linking to