Ch6 - Phan Cum
-
Upload
nguyen-gia-tri -
Category
Documents
-
view
219 -
download
0
Transcript of Ch6 - Phan Cum
-
7/30/2019 Ch6 - Phan Cum
1/45
Chng 6: PHN CMClustering
KHAI PH D LIU
-
7/30/2019 Ch6 - Phan Cum
2/45
Page 2
Chng 6: PHN CM
PHN CM D LIU L G?
Phn nhm mt cch t nhin cc i tng sau?
-
7/30/2019 Ch6 - Phan Cum
3/45
Page 3
Chng 6: PHN CM
Cc nhn vin trng hcGia nh Simpson Nam giiN gii
-
7/30/2019 Ch6 - Phan Cum
4/45
Page 4
TNG T (SIMILARITY) L G?
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
5/45
Page 5
Cm (Cluster) lmttp ccitng
o Ccphntthucmtcm cstngng, tngtnhauo Ccphntdliunm trong cccm khc nhau ctngtthp
hn ccphntdliunm trongmtcm.
Phn cm d liu l hnh thc hc khng gim st (unsupervisedlearning) trong ccmuhcchacgn nhn. Mcchca phncmd liu l tmnhngmuidinhocgomd liutngtnhau(theomtchunnh gi no) thnhnhngcm.
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
6/45
Page 6
MT S NG DNG PHN CMLnh vc kinh t, ngi ta thng tm cc quc gi c nn kinh t tng
ng hay cc cng ty c tim lc kinh t nh nhau. Phn tch cm c thgip cc nh marketing khm ph nhm khch hng c cng thi quenmua sm.
Sinh hc, n c th c s dng phn loi thc vt, ng vt, ccmu gen vi cc chc nng tng t nhau.
Y hc, pht hin cc nhm bnh nhn c cng triu chng lm sng
Gom cm phn loi cc ti liu trn Web.
..
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
7/45
Page 7
TH NO L PHN CMTT
MTPHNG PHP PHN CMTTS SINH RA CC PHN CMCHTLNG.
Chtlngca phn cmphthuc vo:OTNGT
PHNG PHP THCHIN
Cht lng phn cm cn c xc nh bi kt qu pht hin cc m
hnh tim n.
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
8/45
Page 8
CC YU CU CA PHN CM
C kh nng tng tch, hiu qu vi CSDL dung lng ln, s chiu ln
C kh nng x l cc kiu d liu khc nhau
C kh nng khm ph ra cc cm vi cc dng bt k
Ti thiu lng tri thc cn cho xc nh cc tham s u vo Kh nng thch nghi vi d liu nhiu
t nhy cm vi th t ca cc d liu vo
D hiu v d s dng
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
9/45
Page 9
Phn t ngoi l (Outliers) l nhngi tng khng thuc
btk phn cm (cluster) hay cc phn cm c qu t phnt
Trong mt s ng dng ngi ta quan tm n vic pht hin ccphn t ngoi l, hay khng c phn cm (outlier analysis)
cluster
outliers
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
10/45
Page 10
PHN LOI CC KTHUT PHN CM
Kthutphncm phnhoch Micmcha t nhtmtitng
Miitngthucvmtcm duy nht.
Hai thut ton tiu biu: K-Mean (1967) v K-Medoids(1987)
Mtsthutton khc: PAM(1987), CLARA (1990), CLARANS(1994)
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
11/45
Page 11
Kthutphncm phncp
Spxpmttpd liu cho thnh mtcu trc c dng hnh cy,cy phn cp phn cm ny c xy dng theo kthut quy
C hai cch tipcnphbincakthut ny l:Ha nhp nhm, thngcgi l tipcn Bottom-UpPhn chia nhm, thngcgi l tipcn Top-Down
Cc thut ton phn cm phn cp tiu biu: AGNES, DIANA, BIRCH,CURE (1998), CHAMELEON (1999)
Chng 6: PHN CM
p4p1 p2 p3 p4p1 p2 p3
-
7/30/2019 Ch6 - Phan Cum
12/45
Page 12
K thut phn cm da trn mt
Mt cm l mt khu vc dy c cc im, c ngn cch bi cc khuvc mt thp, t cc vng khc ca mt cao.c s dng khi cc cm l khng ph binhoc an quyn vo nhau,v khi c xut hin nhiuv gi tr ngoi lai.
Cc thut ton tiu biu:
DBSCAN: Ester, et al. (KDD96) OPTICS: Ankerst, et al (SIGMOD99). DENCLUE: Hinneburg & D. Keim (KDD98) CLIQUE: Agrawal, et al. (SIGMOD98)
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
13/45
Page 13
Salary
(10,0
00)
20 30 40 50 60age
5
4
3
1
2
6
7
0
20 30 40 50 60age
5
4
3
1
2
6
7
0
Vacation
(week)
age
Vacation
30 50
= 3
Phn cmc chiu lnkhng gian con(salary, age)
Phn cmc chiu lnkhng gian con(vacation, age)
K thut phn cm da trn li
-
7/30/2019 Ch6 - Phan Cum
14/45
Page 14
Thut ton phn cm da trn li tiu biu: CLIQUE(SIGMOD98),STING, WaveCluster
K thut phn cm da trn m hnh
K thut phn cm da trn m hnh c gng khp d liu vi cc mhnh ton hc
Thut ton phn cm da trn m hnh tiu biu: EM, Autoclass, Denclue,Cobweb
Chng 6: PHN CM
.
-
7/30/2019 Ch6 - Phan Cum
15/45
Page 15
Cu trc dliu
Ma trn d liu (data matrix)
khc bit (dissimilarity)hay Ma trn
Khong cch (distance)
Ma trn khong cch
npx...nfx.. .n1x
.. ......... ... .
ipx...ifx.. .i1x
.. ......... ... .
1px...1fx.. .11x
0...)2,()1,(
:::
)2,3()
...ndnd
0dd(3,1
0d(2,1)
0
attributes/dimensions
tuples/objects
objects
objects
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
16/45
Page 16
o tng t trong phn cm
Metric khc bit/ tngt:S khc bit hay tngtgia 2 itngi v j cbiudinbi1 hmkhongcch(distance function), tha tnh chtcamtmetric: d(i, j)0 (khng m)
d(i, i)=0 (phn lp) d(i, j)= d(j, i) (ixng) d(i, j) d(i, h)+d(h, j) (bccu ) Cc hm khng cch cnhngha khc nhau da vo cc loid
liu (interval-scaled, boolean, categorical, ordinal,ratio-scaled)
Trngs c th dng kthpvi cc hm khong cch ty theo ngdng v ngnghacadliu.
(xem lichng 4 Slide DM-04)
Chng 6: PHN CM
.0,1,),(),(11
l
p
l
l
p
l
jlillji wwxxdwxxD
-
7/30/2019 Ch6 - Phan Cum
17/45
Page 17
MTSTHUT TON PHN CM TIU BIU
K-MEANS (Mac Queen 1967)1. Xc nhs phn cm k.
2. Khito k tm cho k cm (chnngu nhin)
3. Chia N i tng vo k cm ng vi k tm (mt i tngthuc cum th i nukhong cch titngn tm cm i lgnnht)
4. Xc nh li tm ca k cm, vi gi nh s phn cm trn lng.
5. Nu tt c cc i tng cc cm c khong cch n tmcm l gnnhtdng, ngclithchinlibc 3.
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
18/45
Page 18
0
1
2
3
4
5
0 1 2 3 4 5
Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
Chng 6: PHN CM
Bc 1
-
7/30/2019 Ch6 - Phan Cum
19/45
Page 19
0
1
2
3
4
5
0 1 2 3 4 5
Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
Chng 6: PHN CM
Bc 2
-
7/30/2019 Ch6 - Phan Cum
20/45
Page 20
0
1
2
3
4
5
0 1 2 3 4 5
Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
Chng 6: PHN CM
Bc 3
-
7/30/2019 Ch6 - Phan Cum
21/45
Page 21
0
1
2
3
4
5
0 1 2 3 4 5
Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2
k3
Chng 6: PHN CM
Bc 4
-
7/30/2019 Ch6 - Phan Cum
22/45
Page 22
0
1
2
3
4
5
0 1 2 3 4 5
Algorithm: k-means, Distance Metric: Euclidean Distance
k1
k2k3
Chng 6: PHN CM
Bc 5
-
7/30/2019 Ch6 - Phan Cum
23/45
Page 23
Mt minh ha khc
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
0
1
2
3
4
5
6
7
8
9
10
0 1 2 3 4 5 6 7 8 9 10
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
24/45
Page 24
Trongthutton trn, tm cccmlphntcctrngbivectorltrung bnhcngcc vectortngngviccitngcacm.
UNHC /IMCATHUT TON K-MEANS
uim:+ Tngi nhanh.phctpcathut ton l O(tkn), trong :
- n: Sitng trong khng gian dliu.
- k: Scmcn phn hoch.
- t: Slnlp (t thng kh nh so vi n).+ K-Means ph hpvi cc cm c dng hnh cu.
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
25/45
Page 25
Khuytim:
+ Khng m bo t c ti u ton cc v ktqu u ra phthucnhiu vo vicchn k imkhiu.
Dophi chy li thut tonvinhiu b khiu khc nhau ccktqutt.
+ Cnphi xc nhtrcscm.
+ Kh xc nhscmthcs khng gian dliu c.
Dophithvicc gitrk khc nhau.
+ Kh pht hin cc loicm hnh dngphc tp khc nhau v nht lcc dngcm khng li.
+ Khng thx l nhiu v cc phntngoil
+ Chc th p dng khi tnh ctrng tm.
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
26/45
Page 26
MTSBINTHCA K-MEANS
o
K-MODES (Huang 1998), EM (Lauritzen 1995)
THUT TON K-MEDOID (Kaufman, Rousseeuw 1987)
im khc bit so vi K-MEANS:
Trng tm micm l phnt sao cho tngkhong cch cc imthuccmtitrng tm l nhnht
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
27/45
Page 27
K-MEDOIDS
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
28/45
Page 28
Chn
ngunhin3 p. tlmtm
K-MEDOIDS
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
29/45
Page 29
Gn cc phn t gn tm i thnh 1 cm Ci
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
30/45
Page 30
Mi Cluster xc nh li tm l im c tng khong cch n cc imtrong cm l b nht
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
31/45
Page 31
Chng 6: PHN CM
Mi Cluster xc nh li tm l im c tng khong cch n cc imtrong cm l b nht
-
7/30/2019 Ch6 - Phan Cum
32/45
Page 32
Gn li cc im gn cc tm v cc cm tng ng
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
33/45
Page 33
Lp li cc qu trnh trn n khi trong tm cc
cum khng thay i
Gn cc phn t gn tm i thnh 1 cm Ci
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
34/45
Page 34
uvnhcimcathutton:
uim: K-Medoids lm viccvinhiu v cc phnt ngoil.Khuytim: K-Medoids chhiuqu khi tpdliu khng qu ln v c
phctp l O(k(n-k)2t).
Trong :
n: Sim trong khng gian dliu.k: Scmcn phn hoch.
t: Slnlp, t kh nh so vi n.
Mtsbinthca K-Medoids:
PAM (Partition Around Medoids), CLARA (Clustering Large Application-Kaufman & Rousseuw, 1990), CALARANS (Clustering Large ApplicationRANdomized Search- Ng and Han, 1994)
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
35/45
Page 35
Sdng ma trnkhong cch nh lm tiu chun phn cm . Phngphp ny khng yu cu khai bo s phn cm (k), m khai bo iukindng.
Step 0 Step 1 Step 2 Step 3 Step 4
b
d
c
e
aa b
d e
c d e
a b c d e
Step 4 Step 3 Step 2 Step 1 Step 0
Tch ng(AGNES)
Phn chia
(DIANA)
Bottom-up
Top-down
PHN CM PHN CP
Chng 6: PHN CM
C 6 C
-
7/30/2019 Ch6 - Phan Cum
36/45
Page 36
THUT TON PHN CM TCHNG PHN CP
(Agglomerative Hierachical Clustering)tng:
Xut pht, micm l mtitng (nu c N cc itng, ta c Ncm, micmchamtitng).
Sau , tin hnh ghp cc cm hai cp c khong cch b nht. Qu trnh ghp cm ny ctin hnh lp cho n khi cc cmc
ghp thnh mtcm duy nht.
Khong cch gia hai cm c th l mt trong ba loi sau:
oSingle-linkageclustering,
oComplete-linkageclustering,
oAverage-linkageclustering.
Chng 6: PHN CM
Ch 6 PHN CM
-
7/30/2019 Ch6 - Phan Cum
37/45
Page 37
Single-linkage clustering (cn gi l connectedness hoc minimummethod): khong cch gia hai cm l khongcchngnnhtgia
hai itngca hai cm.
Complete-linkage clustering (cn gi l phng thc diameter hocmaximum), khong cch gia hai cm l khongcchlnnhtgiahai itngca hai cm.
Avgage-linkageclustering: khong cch gia hai cm l khongcchtrung bnhgia hai itngca hai cm.
Chng 6: PHN CM
Ch 6 PHN CM
-
7/30/2019 Ch6 - Phan Cum
38/45
Page 38
0 8 8 7 7
0 2 4 4
0 3 3
0 1
0
D( , ) = 8
D( ) = 1
Khi u t ma trn khong cch
Chng 6: PHN CM
Bottom Up (Tch ng): Khi u
-
7/30/2019 Ch6 - Phan Cum
39/45
Page 39
Bottom-Up (Tch ng): Khiumi item l mtcm, sau trndnn khi cn 1 cm
Xem xt tt ccc kh nngtrn cm
Chn khnng ttnht
Bottom Up (Tch ng): Khi u
-
7/30/2019 Ch6 - Phan Cum
40/45
Page 40
Bottom-Up (Tch ng): Khiumi item l mtcm, sau trndnn khi cn 1 cm
Xem xt tt ccc kh nngtrn cm
Chn khnng ttnht
Xem xt tt ccc kh nngtrn cm
Chn khnng ttnht
Bottom Up (Tch ng): Khi u
-
7/30/2019 Ch6 - Phan Cum
41/45
Page 41
Bottom-Up (Tch ng): Khiumi item l mtcm, sau trndnn khi cn 1 cm
Xem xt tt ccc kh nngtrn cm
Chn khnng ttnht
Xem xt tt ccc kh nngtrn cm
Chn khnng ttnht
Xem xt tt ccc kh nngtrn cm
Chn khnng ttnht
Bottom Up (Tch ng): Khi u
-
7/30/2019 Ch6 - Phan Cum
42/45
Page 42
Bottom-Up (Tch ng): Khiumi item l mtcm, sau trndnn khi cn 1 cm
Xem xt tt ccc kh nngtrn cm
Chn khnng ttnht
Xem xt tt ccc kh nngtrn cm
Chn khnng ttnht
Xem xt tt ccc kh nngtrn cm
Chn khnng ttnht
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
43/45
Page 43
o U / NHCIM
Khng chnhs phn cm S phn cm tch ngtrc quan
phctp trong trnghpttnht: O(n2), nsitng phn cm
Ktqu phn cm KH CH QUAN
Thut ton nh l mt heuristic Khng dng c cho CSDL ln
Chng 6: PHN CM
Chng 6: PHN CM
-
7/30/2019 Ch6 - Phan Cum
44/45
Page 44
Chng 6: PHN CM
TI LIU THAM KHO THM
The top ten algorithm in Data Mining Xindong Hu, Vipin Kuma Principles of Data Mining Max Bramer
SlideLecture Notes for Chapter 8,9: www.cse.msu.edu/~ptan/
www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdf, lect27-05.pdf, lect28-05.pdf
http://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdf -
7/30/2019 Ch6 - Phan Cum
45/45
Cm n s theo di!