Switch to full style
Examples of R Scripting Language
Post a reply

R script for RandomForest with Cross-validation and Sampling

Sat Oct 18, 2014 5:28 am

This R script reads in a data set in a CSV file called "Data.csv", and loads the randomForest and ROSE packages. It then performs k-fold cross-validation using the caret package, with k = 10. For each iteration of the cross-validation, the script subsets the data into a training set and a test set. It then performs upsampling of the minority class using the ROSE package and downsampling of the majority class using the downSample function. It then uses the cforest function from the party package to fit a random forest model to the training data and makes predictions on the test data. The script then calculates various performance metrics, such as precision and recall for each class and ROC curve.

It also calculates variable importance using varImp function. After each iteration of the cross-validation, the script saves the variable importance and ROC values, and at the end of the loop it calculates the average of these values over all iterations. Finally, it saves the variable importance in a CSV file called "CondsaveTemptable.csv". This is script is used for building a randomForest Classifier with 10-cross-validation in R. This script also deals with the unbalanced data problem by doing up-sampling and down-sampling steps on the training data. This script also calculate the precision/recall, variable importance, and ROC curve area for each fold. To run this script, you need to do few modification to read and process your data.
Code:

setwd("D:/newFolder/")


data <- read.csv("Data.csv",head=TRUE )
require(randomForest)
require(ROSE)
if(!require(caret)){
  library(caret) 
}
if(!require(pROC)){
  library(pROC)
}
library (ROCR);





k=10 #Folds

id <- sample(1:k,nrow(data),replace=TRUE)
list <- 1:k

prediction <- data.frame()
trainingset <- data.frame()
testsetCopy <- data.frame()
#Creating a progress bar to know the status of CV
#progress.bar <- create_progress_bar("text")
#progress.bar$init(k)

PrecisionClassOne=0;
RecallClassOne=0;
PrecisionClassTwo=0;
RecallClassTwo=0;


for (i in 1:k){
 
  trainingset <- subset(data, id %in% list[-i])
  # Performing upsampling of minorities using ROSE package
   #trainingset <- ROSE(class~., data=trainingset,    N=length(trainingset$class))$data
 
   # Note that the sizes of the arrays here are based on your data. So you may need to change it!!
  trainingset=downSample(trainingset[,1:22],as.factor( trainingset[,23]), list = FALSE, yname = "class")
  trainingset=upSample(trainingset[,1:22],as.factor( trainingset[,23]), list = FALSE, yname = "class")
  #print(trainingset[,23])
  testset <- subset(data, id %in% c(i))
 

  #which(sapply(testset,  class) != sapply(trainingset,  class))
   
 
 
  library(party)
  cf1 <- cforest(class~.,data=trainingset,control=cforest_unbiased(mtry=2,ntree=100))
 
  print("perform predictions on test data...")
 
   
   
 
  predictions <- predict(cf1, newdata=testset)

 
 
  metrics<- confusionMatrix(predictions,testset$class,positive='1')
  ClassOne=metrics$byClass
 
 
  metrics2<- confusionMatrix(predictions,testset$class,positive='2')
 
 
 
  ClassTwo=metrics2$byClass;
 
 
 
  PrecisionClassOne=ClassOne[3]+PrecisionClassOne;
  RecallClassOne=ClassOne[1]+RecallClassOne;
 
  PrecisionClassTwo=ClassTwo[3]+PrecisionClassTwo;
  RecallClassTwo=ClassTwo[1]+RecallClassTwo;
 
 
  rocValue=roc.curve(testset$class, predictions,
                     main="ROC curve \n (Half circle depleted data)")
   
  importToSave=varImp(cf1)
  #varImp(model2,conditional=TRUE)
 
  #plot(varImp(model2), top = 20)
 
  if(i>1)
  {
    saveTemp= cbind(saveTemp,importToSave)
    saveROCtemp= rbind(saveROCtemp ,rocValue$auc)
  }else
  {
    saveTemp= importToSave;
    saveROCtemp=rocValue$auc;
  }
 
 
}
PrecisionClassOne=PrecisionClassOne/k;
RecallClassOne=RecallClassOne/k;
PrecisionClassTwo=PrecisionClassTwo/k;
RecallClassTwo=RecallClassTwo/k;
print("Class One Precision/ Recall");
print(PrecisionClassOne);
print(RecallClassOne);
print("Class Two(Re-open) Precision/ Recall");
print(PrecisionClassTwo);
print(RecallClassTwo);


### Saving the importance variables .

write.table ( saveTemp,
              file = "CondsaveTemptable.csv",
              append = FALSE,
              quote = TRUE,
              sep = ",",
              col.names = TRUE,
              row.names = TRUE);

meansOfCOlS=rowMeans(saveTemp)
max(saveTemp)
min(meansOfCOlS)
write.table (meansOfCOlS,
             file = "CondsaveTemptableMeans.csv",
             append = FALSE,
             quote = TRUE,
             sep = ",",
             col.names = TRUE,
             row.names = TRUE);



### Saving the RCOC variables .

write.table ( saveROCtemp,
              file = "CondsaveROCTemptable.csv",
              append = FALSE,
              quote = TRUE,
              sep = ",",
              col.names = TRUE,
              row.names = TRUE);



newSaveTemp<-t(saveTemp)
melt(newSaveTemp)
b <- ggplot(saveTemp, aes(x = saveTemp, ymin = `0%`, lower = `25%`, middle = `50%`, upper = `75%`, ymax = `100%`))
b + geom_boxplot(stat = "identity")





This R script uses the randomForest package to train a random forest model on a given data set, and performs cross-validation using the caret package to evaluate the model's performance.

The script first reads in a data set in a CSV file called "Data.csv" and loads the necessary packages (randomForest, ROSE, caret and pROC). It then sets the number of folds for the cross-validation (k = 10).

It then randomly assigns each row of the data to a fold, and for each iteration of the cross-validation:

    1. It creates a training set by subsetting the data to include only the rows assigned to the folds other than the current one, and a test set by subsetting the data to include only the rows assigned to the current fold.
    2. it performs upsampling of the minority class using the ROSE package and downsampling of the majority class using the downSample function to balance the class distribution.
    3. it uses the cforest function from the party package to fit a random forest model to the training data and make predictions on the test data.
    4. it calculates various performance metrics such as precision and recall for each class and ROC curve.
    5. it also calculates variable importance using varImp function.

After each iteration of the cross-validation, the script saves the variable importance and ROC values, and at the end of the loop it calculates the average of these values over all iterations. Finally, it saves the variable importance in a CSV file called "CondsaveTemptable.csv".

By using cross-validation and sampling techniques such as upsampling and downsampling, the script is able to evaluate the model's performance in a more robust and unbiased way.



Re: R script for RandomForest with Cross-validation and Sampling

Sat Nov 28, 2015 4:57 am

updated

Post a reply
  Related Posts  to : R script for RandomForest with Cross-validation and Sampling
 KFold Cross-validation Random Forest Binary Classification     -  
 Weka java code for Random Forest Cross Validation     -  
 Cross platform c++ programming     -  
 Can anyone suggest some script?     -  
 need help with java script in a pdf     -  
 Questions: Perl Script. PHP.     -  
 script for including files     -  
 Script ingoring lines     -  
 A PHP Number Guessing Script     -  
 Send Email from a PHP Script Example     -  

Topic Tags

R Classifiers