DM Lecture08 12 Clustering

46
8/2/2019 DM Lecture08 12 Clustering http://slidepdf.com/reader/full/dm-lecture08-12-clustering 1/46 Prof. M-T. Kechadi School of Computer Science & Informatics, UCD 

Transcript of DM Lecture08 12 Clustering

Page 1: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 1/46

Prof. M-T. KechadiSchool of Computer Science & Informatics, UCD 

Page 2: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 2/46

Learning Outcomes

Understand Clustering Analysis

Types of Data in Clustering Analysis

Major Clustering MethodsPartitioning Methods

Hierarchical Methods

Page 3: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 3/46

Clustering AnalysisCluster Collection of data objects similar to one another within

the same cluster

Dissimilar to the objects in other clusters

Clustering analysis Grouping a set of data objects into clusters

Clustering is unsupervised classification

no predefined classesTypical applications Stand-alone tool to get insight into data distribution

Pre-processing step for other algorithms

Page 4: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 4/46

Clustering Applications

Pattern Recognition

Spatial Data Analysis create thematic maps in GIS by clustering feature

spaces

detect spatial clusters and explain them in spatial datamining

Image Processing

Economic Science

WWW Document classification

Cluster Weblog data to discover groups of similar

access patterns

Page 5: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 5/46

Page 6: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 6/46

Page 7: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 7/46

Page 8: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 8/46

Page 9: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 9/46

Types of data in clustering analysis

Interval-scaled variables

Binary variables

Nominal, ordinal, and ratio variables

Variables of mixed types

Page 10: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 10/46

Interval-valued variables

Standardise data

Calculate the mean absolute deviation:

where

Calculate the standardised measurement (z-score )

Mean absolute deviation is more robust than using

standard deviation

.)...21

1nf  f  f  f 

x x(xnm

|)|...|||(|121 f nf  f  f  f  f  f 

m xm xm xns

 zif 

  xif 

m f 

s f 

Page 11: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 11/46

Similarity & Dissimilarity Between Objects

Distance

is normally used to measure the similarity or dissimilaritybetween two data objects

Some popular ones include

Minkowski distance :

where i = (x i1, x i2, …, x ip) and j = (x  j1, x  j2, …, x  jp) are two p -

dimensional data objects, and q is a positive integer

Manhattan distance ( q = 1)

qq

 p p

qq

 j x

i x

 j x

i x

 j x

i x jid  )||...|||(|),(

2211

||...||||),(2211 p p j

 x

i

 x

 j

 x

i

 x

 j

 x

i

 x jid 

Page 12: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 12/46

Similarity & Dissimilarity Between Objects

Euclidean distance (q = 2 )

Properties

d(i,j) 0

d(i,i) = 0

d(i,j) = d(j,i)

d(i,j) d(i,k) + d(k,j)

Also one can use weighted distance, parametric Pearson

product moment correlation, or other dissimilarity

measures

)||...|||(|),( 22

22

2

11  p p  j xi x j xi x j xi x jid 

Page 13: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 13/46

Binary VariablesA contingency table for binary data

Simple matching coefficient (invariant, if the binary

variable is symmetric ):

Jaccard coefficient (non invariant if the binary variable is

asymmetric ):

d cbacb

  jid 

),(

cbacb  jid 

),(

 pd bcasum

d cd c

baba

sum

0

1

01

Object i

Object  j

Page 14: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 14/46

Dissimilarity between Binary VariablesExample

gender is a symmetric attribute

the remaining attributes are asymmetric binary

let the values Y and P be set to 1, and the value N be set to 0

Page 15: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 15/46

Nominal VariablesCategorical variables

Can take more than 2 states, e.g., red, yellow, blue, green

Method 1

Simple matching

m : number of matches, p : total number of variables

Method 2 use a large number of binary variables

creating a new binary variable for each of the M nominal states

 pm p

 jid 

),(

Page 16: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 16/46

Ordinal VariablesAn ordinal variable can be discrete or continuous

The order is important, e.g., rankCan be treated like interval-scaled

replacing x if  by their rank

map the range of each variable onto [0, 1] by replacingi -th object in the j-th variable by

compute the dissimilarity using methods for interval-

scaled variables

1

1

 f 

if 

if   M 

r  z

},...,1{ f if 

Page 17: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 17/46

Ratio-Scaled Variables

Ratio-scaled variable

a positive measurement on a nonlinear scale, approximately atexponential scale

E.g. Ae Bt or Ae -Bt 

Methods treat them like interval-scaled variables — not a good choice! 

apply logarithmic transformation

y if = log(x if  ) treat them as continuous ordinal data and treat their rank as

interval-scaled

Page 18: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 18/46

Variables of Mixed TypesA database may contain all the six types of variables

symmetric binary, asymmetric binary, nominal, ordinal, interval

and ratioOne may use a weighted formula to combine their effects

f  is binary or nominal:dij

(f) = 0 if xif = x jf , or dij(f) = 1 o.w.

f  is interval-based: use the normalised distance

f  is ordinal or ratio-scaled

compute ranks rif andand treat zif as interval-scaled

)(

1

)()(

1),(  f  

ij

 p

  f  

  f  ij

  f  ij

 p  f  

d   jid 

  

  

1

1

  f  

if  

 M r 

 zif  

Page 19: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 19/46

Major Clustering TechniquesPartitioning algorithms

Construct various partitions and then evaluate them by some

criterionHierarchy algorithms

Create a hierarchical decomposition of the set of data (or objects)using some criterion

Density-based

based on connectivity and density functions

Grid-based

based on a multiple-level granularity structure

Model-based

A model is hypothesised for each of the clusters and the idea is tofind the best fit of that model to each other

Page 20: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 20/46

Partitioning Algorithms: Basic Concept

Partitioning method

Partition a database D of n objects into a set of k clusters

Problem

Given k , find a partition of k clusters that optimises the chosen

partitioning criterion

Remarks

Global optimal: exhaustively enumerate all partitions (!!!)

Heuristic methods: k-means and k-medoids algorithms

k-means : Each cluster is represented by the centre of the cluster

k-medoids or PAM (Partition around medoids) : Each cluster is

represented by one of the objects in the cluster

Page 21: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 21/46

Page 22: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 22/46

The K-Means Clustering Method

Example

Page 23: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 23/46

K-Means MethodStrength Relatively efficient : O (tkn ), where n , k, and t are the

number of objects, number of clusters, and numberiterations respectively. Normally, k , t << n .

Often terminates at a local optimum 

The global optimum may be found using techniques suchas: genetic algorithms 

Weakness Applicable only when mean is defined, then what about

categorical data?

Need to specify k, the number of clusters, in advance

Unable to handle noisy data and outliers 

Not suitable to discover clusters with non-convex shapes 

Page 24: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 24/46

Page 25: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 25/46

The K -Medoids Clustering Method

Find representative objects, called medoids, in

clustersPAM (Partitioning Around Medoids)

Start from an initial set of medoids and iteratively

replace one of the medoids by one of the non-medoids

if it improves the total distance of the resulting

clustering

PAM works effectively for small data sets, but does notscale well for large data sets

Focusing + spatial data structure

Page 26: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 26/46

PAM (Partitioning Around Medoids)

Use real object to represent the cluster

Select k representative objects arbitrarily

For each pair of non-selected object h and selected

object i , calculate the total swapping cost TC ih 

For each pair of i and h ,

If TC ih < 0, i is replaced by h 

Then assign each non-selected object to the mostsimilar representative object

Repeat steps 2-3 until there is no change

Page 27: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 27/46

PAM Clustering: Total swapping cost TC ih =    

 j C  jih 

 j

i h

 t t

i h j

 h

i t

 j

 t

i h j

Page 28: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 28/46

Hierarchical Clustering

Use distance matrix as clustering criteria. This

method does not require the number of clustersk as an input, but needs a termination condition

Step 0 Step 1 Step 2 Step 3 Step 4

b

d

c

e

aa b

d ec d e

a b c d e

Step 4 Step 3 Step 2 Step 1 Step 0

agglomerative

(AGNES)

divisive

(DIANA)

Page 29: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 29/46

AGNES (Agglomerative Nesting)

Use the Single-Link method and the dissimilaritymatrix

Merge nodes that have the least dissimilarity

Continue in a non-descending fashion

Eventually all nodes belong to the same cluster

Page 30: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 30/46

A Dendrogram Shows How the Clusters are Merged

Hierarchically

Decompose data objects into a several levels of nested partitioning

(tree of clusters), called a dendrogram

A clustering of the data objects is obtained by cutting the dendrogramat the desired level, then each connected component forms a cluster

Page 31: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 31/46

DIANA (Divisive Analysis)

Inverse order of AGNES

Eventually each node forms a cluster on its

own

Page 32: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 32/46

Hierarchical Clustering Methods

Major weakness of agglomerative clustering

methods do not scale well: time complexity of at least O (n 2 ),

where n is the total number of objects

can never undo what was done previously

Integration of hierarchical with distance-basedclustering BIRCH: uses CF-tree and incrementally adjusts the

quality of sub-clusters CURE: selects well-scattered points from the cluster

and then shrinks them towards the centre of the clusterby a specified fraction

Page 33: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 33/46

Clustering Feature Vector

Clustering Feature: CF = (N, LS, SS)

 N : Number of data points

 LS:     N i=1=X i : sum of the points in the cluster

SS:     N 

i=1=X i2

CF = (5, (16,30),(54,190))

(3,4)

(2,6)

(4,5)(4,7)

(3,8)

Page 34: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 34/46

CF TreeCF1

child1

CF3

child3

CF2

child2

CF6

child6

CF1

child1

CF3

child3

CF2

child2

CF5

child5

CF1 CF2 CF6prev next CF1 CF2 CF4

prev next

B = 6

L = 7

Root

Non-leaf node

Leaf node Leaf node

Page 35: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 35/46

CURE (Clustering Using Representatives)

The main idea of CURE

Stops the creation of a cluster hierarchy if a level

consists of k clusters

Uses multiple representative points to evaluate the

distance between clusters, adjusts well to arbitrary

shaped clusters and avoids single-link effect

Drawbacks of square-error based clusteringmethod

Consider only one point as representative of a cluster

Good only for convex shaped, similar size and density,and if k can be reasonably estimated

Page 36: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 36/46

Cure: The Algorithm

Draw random sample s 

Partition sample to p partitions with size s/p 

Partially cluster partitions into s/pq clusters

Eliminate outliers

By random sampling

If a cluster grows too slow, eliminate it

Cluster partial clusters

Label data in disk

Page 37: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 37/46

Data Partitioning and Clustering s = 50

p = 2

s/p = 25

xx

x

y

y y

y

x

y

x

s/pq = 5

Page 38: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 38/46

Cure: Shrinking Representative Points

Shrink the multiple representative points towardsthe gravity centre by a fraction of .

Multiple representatives capture the shape of thecluster

x

y

x

y

Page 39: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 39/46

Clustering Categorical Data: ROCK

ROCK: Robust Clustering using linKs

Use links to measure similarity/proximity Not distance based

Computational complexity:

Basic ideas

Similarity function and neighbours:

Let T 1 = {1,2,3}, T 2 ={3,4,5}

O n nm m n nm a( log )2 2

Sim T T  T T 

T T 

( , )1 2

1 2

1 2

Sim T T  ( , ){ }

{ , , , , }.1 2

3

1 2 3 4 5

1

50 2

Page 40: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 40/46

Rock AlgorithmLinks: The number of common neighbours for thetwo points

Algorithm

Draw random sample (get a random sample of the data) Cluster the data using the link agglomerative approach

Label data in disk

{1,2,3}, {1,2,4}, {1,2,5}, {1,3,4}, {1,3,5}

{1,4,5}, {2,3,4}, {2,3,5}, {2,4,5}, {3,4,5}

{1,2,3} {1,2,4}3

Page 41: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 41/46

ExampleConsider a system where documents may

contain keywords {book, water, sun, sand, swim,read}. Suppose there are 4 documents, where

1st contains the word {book},

2nd

contains {water, sun, sand, swim}, 3rd contains {water, sun, swim, read}, and

4th contains {read, sand}.

Use the ROCK clustering algorithm to clusterthese documents.

Page 42: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 42/46

Density-Based Clustering MethodsClustering based on density (local cluster

criterion), such as density-connected points

Major features: Discover clusters of arbitrary shape

Handle noise

One scan

Need density parameters as termination condition

Page 43: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 43/46

Density-Based Clustering: Background

Two parameters: 

Eps : Maximum radius of the neighbourhood

MinPts : Minimum number of points in an Eps-neighbourhood ofthat point

N Eps (p) : {q belongs to D | dist(p,q) <= Eps} 

Directly density-reachable: A point p is directly density-reachable from a point q with regard to Eps , MinPts if

1) p belongs to N Eps (q) 

2) core point condition:

|N Eps (q) | >= MinPts p

q

MinPts = 5

Eps = 1 cm

Page 44: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 44/46

Density-Based Clustering: Background (II)

Density-reachable:

A point p is density-reachable from

a point q with regard to Eps ,MinPts if there is a chain of pointsp 1, …, p n , p 1 = q , p n = p such thatp i+1 is directly density-reachable

from p i 

Density-connected

A point p is density-connected to apoint q with regard to Eps , MinPts if there is a point o such that both,p and q are density-reachablefrom o with regard to Eps andMinPts .

 p

q p1

 p q

 o

DBSCAN: Density Based Spatial

Page 45: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 45/46

DBSCAN: Density Based Spatial

Clustering of Applications with Noise

Relies on a density-based notion of cluster: A cluster isdefined as a maximal set of density-connected points

Discovers clusters of arbitrary shape in spatial databases

with noise

Core

Border

Outlier

Eps = 1cm

MinPts = 5

Page 46: DM Lecture08 12 Clustering

8/2/2019 DM Lecture08 12 Clustering

http://slidepdf.com/reader/full/dm-lecture08-12-clustering 46/46

DBSCAN: The Algorithm Arbitrary select a point p 

Retrieve all points density-reachable from p with

regard to Eps and MinPts .

If p is a core point, a cluster is formed. If p is a border point, no points are density-reachable

from p and DBSCAN visits the next point of the

database.

Continue the process until all of the points have been

processed.