R script for RandomForest with Cross-validation and Sampling

Sat Oct 18, 2014 5:28 am

This R script reads in a data set in a CSV file called "Data.csv", and loads the randomForest and ROSE packages. It then performs k-fold cross-validation using the caret package, with k = 10. For each iteration of the cross-validation, the script subsets the data into a training set and a test set. It then performs upsampling of the minority class using the ROSE package and downsampling of the majority class using the downSample function. It then uses the cforest function from the party package to fit a random forest model to the training data and makes predictions on the test data. The script then calculates various performance metrics, such as precision and recall for each class and ROC curve.

It also calculates variable importance using varImp function. After each iteration of the cross-validation, the script saves the variable importance and ROC values, and at the end of the loop it calculates the average of these values over all iterations. Finally, it saves the variable importance in a CSV file called "CondsaveTemptable.csv". This is script is used for building a randomForest Classifier with 10-cross-validation in R. This script also deals with the unbalanced data problem by doing up-sampling and down-sampling steps on the training data. This script also calculate the precision/recall, variable importance, and ROC curve area for each fold. To run this script, you need to do few modification to read and process your data.

Code:: setwd("D:/newFolder/") data <- read.csv("Data.csv",head=TRUE ) require(randomForest) require(ROSE) if(!require(caret)){ library(caret) } if(!require(pROC)){ library(pROC) } library (ROCR); k=10 #Folds id <- sample(1:k,nrow(data),replace=TRUE) list <- 1:k prediction <- data.frame() trainingset <- data.frame() testsetCopy <- data.frame() #Creating a progress bar to know the status of CV #progress.bar <- create_progress_bar("text") #progress.bar$init(k) PrecisionClassOne=0; RecallClassOne=0; PrecisionClassTwo=0; RecallClassTwo=0; for (i in 1:k){ trainingset <- subset(data, id %in% list[-i]) # Performing upsampling of minorities using ROSE package #trainingset <- ROSE(class~., data=trainingset, N=length(trainingset$class))$data # Note that the sizes of the arrays here are based on your data. So you may need to change it!! trainingset=downSample(trainingset[,1:22],as.factor( trainingset[,23]), list = FALSE, yname = "class") trainingset=upSample(trainingset[,1:22],as.factor( trainingset[,23]), list = FALSE, yname = "class") #print(trainingset[,23]) testset <- subset(data, id %in% c(i)) #which(sapply(testset, class) != sapply(trainingset, class)) library(party) cf1 <- cforest(class~.,data=trainingset,control=cforest_unbiased(mtry=2,ntree=100)) print("perform predictions on test data...") predictions <- predict(cf1, newdata=testset) metrics<- confusionMatrix(predictions,testset$class,positive='1') ClassOne=metrics$byClass metrics2<- confusionMatrix(predictions,testset$class,positive='2') ClassTwo=metrics2$byClass; PrecisionClassOne=ClassOne[3]+PrecisionClassOne; RecallClassOne=ClassOne[1]+RecallClassOne; PrecisionClassTwo=ClassTwo[3]+PrecisionClassTwo; RecallClassTwo=ClassTwo[1]+RecallClassTwo; rocValue=roc.curve(testset$class, predictions, main="ROC curve \n (Half circle depleted data)") importToSave=varImp(cf1) #varImp(model2,conditional=TRUE) #plot(varImp(model2), top = 20) if(i>1) { saveTemp= cbind(saveTemp,importToSave) saveROCtemp= rbind(saveROCtemp ,rocValue$auc) }else { saveTemp= importToSave; saveROCtemp=rocValue$auc; } } PrecisionClassOne=PrecisionClassOne/k; RecallClassOne=RecallClassOne/k; PrecisionClassTwo=PrecisionClassTwo/k; RecallClassTwo=RecallClassTwo/k; print("Class One Precision/ Recall"); print(PrecisionClassOne); print(RecallClassOne); print("Class Two(Re-open) Precision/ Recall"); print(PrecisionClassTwo); print(RecallClassTwo); ### Saving the importance variables . write.table ( saveTemp, file = "CondsaveTemptable.csv", append = FALSE, quote = TRUE, sep = ",", col.names = TRUE, row.names = TRUE); meansOfCOlS=rowMeans(saveTemp) max(saveTemp) min(meansOfCOlS) write.table (meansOfCOlS, file = "CondsaveTemptableMeans.csv", append = FALSE, quote = TRUE, sep = ",", col.names = TRUE, row.names = TRUE); ### Saving the RCOC variables . write.table ( saveROCtemp, file = "CondsaveROCTemptable.csv", append = FALSE, quote = TRUE, sep = ",", col.names = TRUE, row.names = TRUE); newSaveTemp<-t(saveTemp) melt(newSaveTemp) b <- ggplot(saveTemp, aes(x = saveTemp, ymin = `0%`, lower = `25%`, middle = `50%`, upper = `75%`, ymax = `100%`)) b + geom_boxplot(stat = "identity")

This R script uses the randomForest package to train a random forest model on a given data set, and performs cross-validation using the caret package to evaluate the model's performance.

The script first reads in a data set in a CSV file called "Data.csv" and loads the necessary packages (randomForest, ROSE, caret and pROC). It then sets the number of folds for the cross-validation (k = 10).

It then randomly assigns each row of the data to a fold, and for each iteration of the cross-validation:

After each iteration of the cross-validation, the script saves the variable importance and ROC values, and at the end of the loop it calculates the average of these values over all iterations. Finally, it saves the variable importance in a CSV file called "CondsaveTemptable.csv".

By using cross-validation and sampling techniques such as upsampling and downsampling, the script is able to evaluate the model's performance in a more robust and unbiased way.

Sat Nov 28, 2015 4:57 am

updated

R script for RandomForest with Cross-validation and Sampling

R script for RandomForest with Cross-validation and Sampling

Re: R script for RandomForest with Cross-validation and Sampling

Topic Tags