Kmeans Clustering - Unsupervised Learning

Mon Oct 28, 2019 6:11 am

In this example, we apply the unsupervised learning concept using the kmeans clustering. We apply the Kmean algorithm on the breast cancer dataset from sklearn.

python code
#M. S. Rakha, Ph.D.
# Post-Doctoral - Queen's University
# UnSupervised Learning - Clustering Kmeans
# Kmeans Clustering
%matplotlib inline
import numpy as np
import matplotlib.pyplot as plt

from mpl_toolkits.mplot3d import Axes3D
import pandas as pd
from sklearn.cluster import KMeans
from sklearn import datasets
from sklearn.preprocessing import scale
import sklearn.metrics as sm
from sklearn.metrics import confusion_matrix,classification_report

breastCancer = datasets.load_breast_cancer()


X = breastCancer.data[:, 0:2]

y = pd.DataFrame(breastCancer.target)
varriableNames= breastCancer.feature_names

#first ten records

#Building you Kmeans model
n_clusters = 2 # The number of clusters
init = 'random' # Centroids will be assigned in a random way
n_init = 10 # Number of iterations
clusteringKMeans = KMeans(n_clusters=n_clusters, init=init, n_init=n_init)

##Plotting the model output
breastCancer_df = pd.DataFrame(breastCancer.data)
breastCancer_df = breastCancer_df.iloc[:, 0:2]# first column of data frame (first_name)
breastCancer_df.columns = ['meanRadius','meanTexture']
y.columns = ["Targets"]

color_theme = np.array(['red','darkgreen'])
plt.scatter(x=breastCancer_df.meanRadius, y=breastCancer_df.meanTexture,c=color_theme[breastCancer.target],s=50)
plt.title('Ground Truth Classification')

plt.scatter(x=breastCancer_df.meanRadius, y=breastCancer_df.meanTexture,c=color_theme[clusteringKMeans.labels_],s=50)
plt.title('K-Means Clustering')

#Evaluate the model


Below is the performance of Kmeans clustering in contrast to the original dataset.

   precision    recall  f1-score   support

           0       0.76      0.82      0.78       212
           1       0.89      0.84      0.86       357

    accuracy                           0.83       569
   macro avg       0.82      0.83      0.82       569
weighted avg       0.84      0.83      0.83       569

The following picture presents the 2 clusters (Left: The labeled data, Right: is the output of Kmeans)
