Total members 11892 |It is currently Fri Aug 09, 2024 5:37 pm Login / Join Codemiles

### Weka

This example illustrates Feature selection using techniques. The first technique is Univariate Selection which a statistical test is applied to the features and dependent variable (class). The second approach is Recursive feature elimination (RFE) which a recursive method of dropping features and measuring performance is applied.

python code
`#https://jupyter.org/try#Demo7 - part1#M. S. Rakha, Ph.D.# Post-Doctoral - Queen's University #  # Feature Selection#1 RandomForest%matplotlib inlineimport numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3Dimport pandas as pdfrom sklearn.cluster import KMeansfrom sklearn import datasetsfrom sklearn.preprocessing import scaleimport sklearn.metrics as smfrom sklearn.metrics import confusion_matrix,classification_reportfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.feature_selection import SelectKBestfrom sklearn.feature_selection import chi2#Univariate Selection#Statistical tests can be used to select those features that have the strongest relationship with the output variable.# example below uses the chi squared (chi^2) statistical test for non-negative featuresnp.random.seed(5)breastCancer = datasets.load_breast_cancer()list(breastCancer.target_names)#Only two featuresX = breastCancer.datay = breastCancer.target# feature extractiontest = SelectKBest(score_func=chi2, k=4)fit = test.fit(X, y)# summarize scoresnp.set_printoptions(precision=3)print(fit.scores_)features = fit.transform(X)# summarize selected featuresprint(features[0:5,:])## Recursive Feature Elimination%matplotlib inlineimport numpy as npimport matplotlib.pyplot as pltfrom mpl_toolkits.mplot3d import Axes3Dimport pandas as pdfrom sklearn.cluster import KMeansfrom sklearn import datasetsfrom sklearn.preprocessing import scaleimport sklearn.metrics as smfrom sklearn.metrics import confusion_matrix,classification_reportfrom sklearn.model_selection import train_test_splitfrom sklearn.ensemble import RandomForestClassifierfrom sklearn.feature_selection import RFEnp.random.seed(5)breastCancer = datasets.load_breast_cancer()list(breastCancer.target_names)X = breastCancer.datay = breastCancer.targetX_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.50, random_state=42)X_train[:,0].sizeX_train[:,0].sizevarriableNames= breastCancer.feature_names  #Feature extraction#Recursive feature elimination (RFE) is a feature selection method that fits a model #and removes the weakest feature (or features) until the specified number of features is reached. #Features are ranked by the model’s coef_ or feature_importances_randomForestModel = RandomForestClassifier(n_estimators=100, max_depth=2, random_state=0)rfe = RFE(randomForestModel, 3)fit = rfe.fit(X, y)print("Num Features: %d" % fit.n_features_)print("Selected Features: %s" % fit.support_)print("Feature Ranking: %s" % fit.ranking_)`

This script uses two feature selection methods, Univariate Selection and Recursive Feature Elimination (RFE), to select the most informative features from the breast cancer dataset. First, the script uses Univariate Selection, specifically the chi-squared test, to select the four features that have the strongest relationship with the output variable (breast cancer diagnosis). It uses the SelectKBest function from sklearn.feature_selection and passes it to the chi2 function as the scoring function. The selectKBest function then returns the top 4 features based on the chi-squared test, printed along with their scores. Next, the script uses Recursive Feature Elimination (RFE) to select the three most informative features. It uses the RFE function from sklearn.feature_selection and passes it a RandomForestClassifier as the estimator. The RFE function then recursively removes features from the dataset, using the RandomForestClassifier to evaluate the importance of each feature until it reaches the specified number of features (3 in this case). The script then prints the number of features, the selected features, and their ranking. The script demonstrates how feature selection methods can be used to select the most informative features from a dataset and improve the performance of machine learning models.

Univariate feature selection methods consider the relationship of each feature with the response variable independently; in other words, it considers each feature individually. The chi-squared test is a statistical test that can select the features with the strongest relationship with the output variable. The chi-squared test compares the observed frequencies of a categorical variable to the expected frequencies if the two are independent. The chi-squared test's output is a score representing the relationship between the feature and the output variable. The SelectKBest function from sklearn.feature_selection is then used to select the top K features based on the scores produced by the chi-squared test.

On the other hand, Recursive Feature Elimination (RFE) is a feature selection method that fits a model and removes the weakest feature (or features) until the specified number of features is reached. Features are ranked by the model’s coef_ or feature_importances_. This script uses a Random Forest Classifier as the estimator, which can be used to evaluate the importance of each feature. The RFE function is then used to recursively remove the least important feature until it reaches the specified number of features.

Both Univariate selection and RFE are powerful feature selection methods that can be used to select the most informative features from a dataset and improve the performance of machine learning models. Using these feature selection methods, you can reduce the dimensionality of the dataset, which can help improve the model's performance and make it easier to interpret the results. Additionally, by reducing the dimensionality of the dataset, you can also help prevent overfitting, a common problem in machine learning. Overfitting occurs when a model is too complex and can fit the noise in the training data rather than the underlying pattern. This can lead to poor performance on new, unseen data. By selecting only the most informative features, you can help to simplify the model and reduce the chance of overfitting.

Another benefit of feature selection is that it can also help improve the model's interpretability. When working with large datasets with many features, it can be difficult to understand the relationships between the features and the response variable. By selecting only the most informative features, you can help to make the relationships between the features and the response variable clear.

In summary, feature selection is an important step in the machine learning workflow that can help improve the model's performance and interpretability. By using univariate selection and RFE methods, you can select the most informative features from the dataset, reduce dimensionality, prevent overfitting and help to understand the relationships between features and response variables.

The output of this example is :
Code:
[2.661e+02 9.390e+01 2.011e+03 5.399e+04 1.499e-01 5.403e+00 1.971e+01
1.054e+01 2.574e-01 7.431e-05 3.468e+01 9.794e-03 2.506e+02 8.759e+03
3.266e-03 6.138e-01 1.045e+00 3.052e-01 8.036e-05 6.371e-03 4.917e+02
1.744e+02 3.665e+03 1.126e+05 3.974e-01 1.931e+01 3.952e+01 1.349e+01
1.299e+00 2.315e-01]
[[1001.    153.4   184.6  2019.  ]
[1326.     74.08  158.8  1956.  ]
[1203.     94.03  152.5  1709.  ]
[ 386.1    27.23   98.87  567.7 ]
[1297.     94.44  152.2  1575.  ]]
Num Features: 3
Selected Features: [False False False False False False False False False False False False
False False False False False False False False False False  True  True
False False False  True False False]
Feature Ranking: [ 9 13  7  4 28 14  5  3 22 21 18 19 16  8 25 27 20 23 24 26  2 10  1  1
15 11  6  1 12 17]

_________________
M. S. Rakha, Ph.D.
Queen's University

 Tweet
Author:
 Posts: 2715Have thanks: 74 time
 Page 1 of 1 [ 1 post ]

Related Posts  to : Feature selection: Statistical and Recursive examples
Inverse Transformation for some Statistical Distributions (C     -
problem, recursive function using array_merge()     -
problem, recursive function using array_merge()     -
recursive function to delete directories     -
examples with c     -
Examples in C#     -
few examples of c using ascii     -
Matlab basics examples     -
Asp.net insert update delete Examples     -

### Topic Tags

Machine Learning, Python, Artificial Intelligence