Feature selection: Statistical and Recursive examples

DrRakha

This example illustrates Feature selection using techniques. The first technique is Univariate Selection which a statistical test is applied to the features and dependent variable (class). The second approach is Recursive feature elimination (RFE) which a recursive method of dropping features and measuring performance is applied.

python code

#https://jupyter.org/try
#Demo7 - part1
#M. S. Rakha, Ph.D.
# Post-Doctoral - Queen's University 
#  
# Feature Selection#1 RandomForest
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import SelectKBest
from sklearn.feature_selection import chi2


#Univariate Selection
#Statistical tests can be used to select those features that have the strongest relationship with the output variable.
# example below uses the chi squared (chi^2) statistical test for non-negative features
np.random.seed(5)
breastCancer = datasets.load_breast_cancer()

list(breastCancer.target_names)

#Only two features
X = breastCancer.data
y = breastCancer.target



# feature extraction
test = SelectKBest(score_func=chi2, k=4)
fit = test.fit(X, y)
# summarize scores
np.set_printoptions(precision=3)
print(fit.scores_)
features = fit.transform(X)
# summarize selected features
print(features[0:5,:])



## Recursive Feature Elimination
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn.metrics import confusion_matrix,classification_report
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFE
np.random.seed(5)
breastCancer = datasets.load_breast_cancer()

list(breastCancer.target_names)

X = breastCancer.data
y = breastCancer.target


X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=42)

X_train[:,0].size
X_train[:,0].size

varriableNames= breastCancer.feature_names
  
#Feature extraction
#Recursive feature elimination (RFE) is a feature selection method that fits a model 
#and removes the weakest feature (or features) until the specified number of features is reached. 
#Features are ranked by the model’s coef_ or feature_importances_

randomForestModel = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)
rfe = RFE(randomForestModel, 3)
fit = rfe.fit(X, y)
print("Num Features: %d" % fit.n_features_)
print("Selected Features: %s" % fit.support_)
print("Feature Ranking: %s" % fit.ranking_)

This script uses two feature selection methods, Univariate Selection and Recursive Feature Elimination (RFE), to select the most informative features from the breast cancer dataset. First, the script uses Univariate Selection, specifically the chi-squared test, to select the four features that have the strongest relationship with the output variable (breast cancer diagnosis). It uses the SelectKBest function from sklearn.feature_selection and passes it to the chi2 function as the scoring function. The selectKBest function then returns the top 4 features based on the chi-squared test, printed along with their scores. Next, the script uses Recursive Feature Elimination (RFE) to select the three most informative features. It uses the RFE function from sklearn.feature_selection and passes it a RandomForestClassifier as the estimator. The RFE function then recursively removes features from the dataset, using the RandomForestClassifier to evaluate the importance of each feature until it reaches the specified number of features (3 in this case). The script then prints the number of features, the selected features, and their ranking. The script demonstrates how feature selection methods can be used to select the most informative features from a dataset and improve the performance of machine learning models.

Univariate feature selection methods consider the relationship of each feature with the response variable independently; in other words, it considers each feature individually. The chi-squared test is a statistical test that can select the features with the strongest relationship with the output variable. The chi-squared test compares the observed frequencies of a categorical variable to the expected frequencies if the two are independent. The chi-squared test's output is a score representing the relationship between the feature and the output variable. The SelectKBest function from sklearn.feature_selection is then used to select the top K features based on the scores produced by the chi-squared test.

On the other hand, Recursive Feature Elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached. Features are ranked by the model’s coef_ or feature_importances_. This script uses a Random Forest Classifier as the estimator, which can be used to evaluate the importance of each feature. The RFE function is then used to recursively remove the least important feature until it reaches the specified number of features.

Both Univariate selection and RFE are powerful feature selection methods that can be used to select the most informative features from a dataset and improve the performance of machine learning models. Using these feature selection methods, you can reduce the dimensionality of the dataset, which can help improve the model's performance and make it easier to interpret the results. Additionally, by reducing the dimensionality of the dataset, you can also help prevent overfitting, a common problem in machine learning. Overfitting occurs when a model is too complex and can fit the noise in the training data rather than the underlying pattern. This can lead to poor performance on new, unseen data. By selecting only the most informative features, you can help to simplify the model and reduce the chance of overfitting.

Another benefit of feature selection is that it can also help improve the model's interpretability. When working with large datasets with many features, it can be difficult to understand the relationships between the features and the response variable. By selecting only the most informative features, you can help to make the relationships between the features and the response variable clear.

In summary, feature selection is an important step in the machine learning workflow that can help improve the model's performance and interpretability. By using univariate selection and RFE methods, you can select the most informative features from the dataset, reduce dimensionality, prevent overfitting and help to understand the relationships between features and response variables.

The output of this example is :

Code:

[2.661e+02 9.390e+01 2.011e+03 5.399e+04 1.499e-01 5.403e+00 1.971e+01
 1.054e+01 2.574e-01 7.431e-05 3.468e+01 9.794e-03 2.506e+02 8.759e+03
 3.266e-03 6.138e-01 1.045e+00 3.052e-01 8.036e-05 6.371e-03 4.917e+02
 1.744e+02 3.665e+03 1.126e+05 3.974e-01 1.931e+01 3.952e+01 1.349e+01
 1.299e+00 2.315e-01]
[[1001.    153.4   184.6  2019.  ]
 [1326.     74.08  158.8  1956.  ]
 [1203.     94.03  152.5  1709.  ]
 [ 386.1    27.23   98.87  567.7 ]
 [1297.     94.44  152.2  1575.  ]]
Num Features: 3
Selected Features: [False False False False False False False False False False False False
 False False False False False False False False False False  True  True
 False False False  True False False]
Feature Ranking: [ 9 13  7  4 28 14  5  3 22 21 18 19 16  8 25 27 20 23 24 26  2 10  1  1
 15 11  6  1 12 17]

Java

C/C++

PHP

C#

HTML

CSS

ASP

Javascript

JQuery

AJAX

XSD

Python

Matlab

R Scripts

Weka

Feature selection: Statistical and Recursive examples

Topic Tags