In this example, we apply the unsupervised learning concept using the kmeans clustering. We apply the Kmean algorithm on the breast cancer dataset from sklearn. This script is using several machine learning libraries, such as numpy, matplotlib, pandas, and sklearn, to perform a clustering analysis on breast cancer data. The breast cancer data is loaded from the sklearn datasets library, and only two features from the data set are used for the analysis.
The script creates a KMeans model with two clusters, an initial method 'random', and 10 iterations. It then fits the model to the data and plots the results of the model, comparing it to the ground truth classification of the data.
The script uses the classification_report method from sklearn.metrics to evaluate the model by comparing the predicted labels from the k-means model to the true labels. It prints out the classification report to evaluate the performance of the model. The script first loads the breast cancer data from the sklearn datasets library and only two features (mean radius and mean texture) are selected from the data set to be used for the analysis. The script creates a KMeans model, setting the number of clusters to 2, the initial method to 'random', and the number of iterations to 10. It then fits the model to the data and plots the model results, comparing it to the ground truth classification of the data. The script uses the classification_report method from sklearn.metrics to evaluate the model by comparing the predicted labels from the k-means model to the true labels. It prints out the classification report to evaluate the model's performance, which includes precision, recall, f1-score and support for each label. It's worth noting that using only two features for a clustering task on a breast cancer dataset will not yield good results, and it's not a proper way to analyze this dataset. It's also worth noting that KMeans is not the best approach for a supervised classification task like a breast cancer dataset.
python code
#https://jupyter.org/try
#Demo5
#M. S. Rakha, Ph.D.
# Post-Doctoral - Queen's University
# UnSupervised Learning - Clustering Kmeans
# Kmeans Clustering
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt
from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn.metrics import confusion_matrix,classification_report
np.random.seed(5)
breastCancer = datasets.load_breast_cancer()
list(breastCancer.target_names)
X = breastCancer.data[:, 0:2]
y = pd.DataFrame(breastCancer.target)
varriableNames= breastCancer.feature_names
#first ten records
X[0:10,]
#Building you Kmeans model
n_clusters = 2 # The number of clusters
init = 'random' # Centroids will be assigned in a random way
n_init = 10 # Number of iterations
clusteringKMeans = KMeans(n_clusters=n_clusters, init=init, n_init=n_init)
clusteringKMeans.fit(X)
##Plotting the model output
breastCancer_df = pd.DataFrame(breastCancer.data)
breastCancer_df = breastCancer_df.iloc[:, 0:2]# first column of data frame (first_name)
breastCancer_df.columns = ['meanRadius','meanTexture']
y.columns = ["Targets"]
color_theme = np.array(['red','darkgreen'])
plt.subplot(1,2,1)
plt.scatter(x=breastCancer_df.meanRadius, y=breastCancer_df.meanTexture,c=color_theme[breastCancer.target],s=50)
plt.title('Ground Truth Classification')
plt.subplot(1,2,2)
plt.scatter(x=breastCancer_df.meanRadius, y=breastCancer_df.meanTexture,c=color_theme[clusteringKMeans.labels_],s=50)
plt.title('K-Means Clustering')
#Evaluate the model
print(classification_report(y,clusteringKMeans.labels_))
In general, using k-means for classification tasks is not the best approach because it is an unsupervised learning technique used for clustering. It tries to group similar data points together and doesn't consider the actual class labels. It's more suitable when the goal is to group similar data points together and discover hidden patterns in the data. This script applies KMeans to the breast cancer dataset, which is a supervised classification task. The script tries to classify the data into 2 clusters (malignant or benign) based only on two features (mean radius and mean texture). This is inappropriate because these two features may not be enough to classify the data, and KMeans accurately is not designed to handle supervised classification tasks.
Furthermore, the script uses the classification_report method to evaluate the KMeans model, which is inappropriate in this context because it compares the predicted labels of the k-means model with the true labels. KMeans doesn't predict class labels, it only groups data points together. Therefore, it's impossible to evaluate the performance of a k-means model with this method. Using KMeans on a supervised classification task like breast cancer dataset is not a good idea. Instead, a supervised learning algorithm like Random Forest, Decision Tree, Neural network, Logistic Regression, etc., should be used. These algorithms are designed to handle classification tasks and have the ability to learn from labeled data and make predictions.
Another thing to consider is that the script uses only two features out of the thirty features that are provided in the breast cancer dataset. This is not enough to make accurate predictions, and it's likely that the model will have poor performance. To achieve better results, it's recommended to use more features, and even better to use all the features provided in the dataset.
Additionally, the script uses the train_test_split method to split the data into a training set and a test set, but it uses only 50% of the data for testing and the rest for training. This is not enough data to make accurate predictions, especially when using a few features. It's recommended to use a larger portion of the data for testing, or even better to use cross-validation techniques to make sure that the model generalizes well to new unseen data.
Finally, the script uses the breast cancer dataset, a binary classification task that aims to classify tumors as malignant or benign. However, the script uses a KMeans model with 2 clusters, which does not take into account the class labels of the data; it only groups similar data points together. This is not an appropriate approach to solving a binary classification task. Instead, a supervised learning algorithm like Random Forest, Decision Tree, Neural network, Logistic Regression, etc. should be used, as they can learn from labeled data and predict the class labels directly.
Below is the performance of Kmeans clustering in contrast to the original dataset.
- Code:
precision recall f1-score support
0 0.76 0.82 0.78 212
1 0.89 0.84 0.86 357
accuracy 0.83 569
macro avg 0.82 0.83 0.82 569
weighted avg 0.84 0.83 0.83 569
The following picture presents the 2 clusters (
Left: The labeled data,
Right: is the output of Kmeans)
- Image.png (33.49 KiB) Viewed 4663 times