Ch6 - Phan Cum

download Ch6 - Phan Cum

of 45

Transcript of Ch6 - Phan Cum

  • 7/30/2019 Ch6 - Phan Cum

    1/45

    Chng 6: PHN CMClustering

    KHAI PH D LIU

  • 7/30/2019 Ch6 - Phan Cum

    2/45

    Page 2

    Chng 6: PHN CM

    PHN CM D LIU L G?

    Phn nhm mt cch t nhin cc i tng sau?

  • 7/30/2019 Ch6 - Phan Cum

    3/45

    Page 3

    Chng 6: PHN CM

    Cc nhn vin trng hcGia nh Simpson Nam giiN gii

  • 7/30/2019 Ch6 - Phan Cum

    4/45

    Page 4

    TNG T (SIMILARITY) L G?

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    5/45

    Page 5

    Cm (Cluster) lmttp ccitng

    o Ccphntthucmtcm cstngng, tngtnhauo Ccphntdliunm trong cccm khc nhau ctngtthp

    hn ccphntdliunm trongmtcm.

    Phn cm d liu l hnh thc hc khng gim st (unsupervisedlearning) trong ccmuhcchacgn nhn. Mcchca phncmd liu l tmnhngmuidinhocgomd liutngtnhau(theomtchunnh gi no) thnhnhngcm.

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    6/45

    Page 6

    MT S NG DNG PHN CMLnh vc kinh t, ngi ta thng tm cc quc gi c nn kinh t tng

    ng hay cc cng ty c tim lc kinh t nh nhau. Phn tch cm c thgip cc nh marketing khm ph nhm khch hng c cng thi quenmua sm.

    Sinh hc, n c th c s dng phn loi thc vt, ng vt, ccmu gen vi cc chc nng tng t nhau.

    Y hc, pht hin cc nhm bnh nhn c cng triu chng lm sng

    Gom cm phn loi cc ti liu trn Web.

    ..

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    7/45

    Page 7

    TH NO L PHN CMTT

    MTPHNG PHP PHN CMTTS SINH RA CC PHN CMCHTLNG.

    Chtlngca phn cmphthuc vo:OTNGT

    PHNG PHP THCHIN

    Cht lng phn cm cn c xc nh bi kt qu pht hin cc m

    hnh tim n.

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    8/45

    Page 8

    CC YU CU CA PHN CM

    C kh nng tng tch, hiu qu vi CSDL dung lng ln, s chiu ln

    C kh nng x l cc kiu d liu khc nhau

    C kh nng khm ph ra cc cm vi cc dng bt k

    Ti thiu lng tri thc cn cho xc nh cc tham s u vo Kh nng thch nghi vi d liu nhiu

    t nhy cm vi th t ca cc d liu vo

    D hiu v d s dng

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    9/45

    Page 9

    Phn t ngoi l (Outliers) l nhngi tng khng thuc

    btk phn cm (cluster) hay cc phn cm c qu t phnt

    Trong mt s ng dng ngi ta quan tm n vic pht hin ccphn t ngoi l, hay khng c phn cm (outlier analysis)

    cluster

    outliers

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    10/45

    Page 10

    PHN LOI CC KTHUT PHN CM

    Kthutphncm phnhoch Micmcha t nhtmtitng

    Miitngthucvmtcm duy nht.

    Hai thut ton tiu biu: K-Mean (1967) v K-Medoids(1987)

    Mtsthutton khc: PAM(1987), CLARA (1990), CLARANS(1994)

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    11/45

    Page 11

    Kthutphncm phncp

    Spxpmttpd liu cho thnh mtcu trc c dng hnh cy,cy phn cp phn cm ny c xy dng theo kthut quy

    C hai cch tipcnphbincakthut ny l:Ha nhp nhm, thngcgi l tipcn Bottom-UpPhn chia nhm, thngcgi l tipcn Top-Down

    Cc thut ton phn cm phn cp tiu biu: AGNES, DIANA, BIRCH,CURE (1998), CHAMELEON (1999)

    Chng 6: PHN CM

    p4p1 p2 p3 p4p1 p2 p3

  • 7/30/2019 Ch6 - Phan Cum

    12/45

    Page 12

    K thut phn cm da trn mt

    Mt cm l mt khu vc dy c cc im, c ngn cch bi cc khuvc mt thp, t cc vng khc ca mt cao.c s dng khi cc cm l khng ph binhoc an quyn vo nhau,v khi c xut hin nhiuv gi tr ngoi lai.

    Cc thut ton tiu biu:

    DBSCAN: Ester, et al. (KDD96) OPTICS: Ankerst, et al (SIGMOD99). DENCLUE: Hinneburg & D. Keim (KDD98) CLIQUE: Agrawal, et al. (SIGMOD98)

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    13/45

    Page 13

    Salary

    (10,0

    00)

    20 30 40 50 60age

    5

    4

    3

    1

    2

    6

    7

    0

    20 30 40 50 60age

    5

    4

    3

    1

    2

    6

    7

    0

    Vacation

    (week)

    age

    Vacation

    30 50

    = 3

    Phn cmc chiu lnkhng gian con(salary, age)

    Phn cmc chiu lnkhng gian con(vacation, age)

    K thut phn cm da trn li

  • 7/30/2019 Ch6 - Phan Cum

    14/45

    Page 14

    Thut ton phn cm da trn li tiu biu: CLIQUE(SIGMOD98),STING, WaveCluster

    K thut phn cm da trn m hnh

    K thut phn cm da trn m hnh c gng khp d liu vi cc mhnh ton hc

    Thut ton phn cm da trn m hnh tiu biu: EM, Autoclass, Denclue,Cobweb

    Chng 6: PHN CM

    .

  • 7/30/2019 Ch6 - Phan Cum

    15/45

    Page 15

    Cu trc dliu

    Ma trn d liu (data matrix)

    khc bit (dissimilarity)hay Ma trn

    Khong cch (distance)

    Ma trn khong cch

    npx...nfx.. .n1x

    .. ......... ... .

    ipx...ifx.. .i1x

    .. ......... ... .

    1px...1fx.. .11x

    0...)2,()1,(

    :::

    )2,3()

    ...ndnd

    0dd(3,1

    0d(2,1)

    0

    attributes/dimensions

    tuples/objects

    objects

    objects

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    16/45

    Page 16

    o tng t trong phn cm

    Metric khc bit/ tngt:S khc bit hay tngtgia 2 itngi v j cbiudinbi1 hmkhongcch(distance function), tha tnh chtcamtmetric: d(i, j)0 (khng m)

    d(i, i)=0 (phn lp) d(i, j)= d(j, i) (ixng) d(i, j) d(i, h)+d(h, j) (bccu ) Cc hm khng cch cnhngha khc nhau da vo cc loid

    liu (interval-scaled, boolean, categorical, ordinal,ratio-scaled)

    Trngs c th dng kthpvi cc hm khong cch ty theo ngdng v ngnghacadliu.

    (xem lichng 4 Slide DM-04)

    Chng 6: PHN CM

    .0,1,),(),(11

    l

    p

    l

    l

    p

    l

    jlillji wwxxdwxxD

  • 7/30/2019 Ch6 - Phan Cum

    17/45

    Page 17

    MTSTHUT TON PHN CM TIU BIU

    K-MEANS (Mac Queen 1967)1. Xc nhs phn cm k.

    2. Khito k tm cho k cm (chnngu nhin)

    3. Chia N i tng vo k cm ng vi k tm (mt i tngthuc cum th i nukhong cch titngn tm cm i lgnnht)

    4. Xc nh li tm ca k cm, vi gi nh s phn cm trn lng.

    5. Nu tt c cc i tng cc cm c khong cch n tmcm l gnnhtdng, ngclithchinlibc 3.

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    18/45

    Page 18

    0

    1

    2

    3

    4

    5

    0 1 2 3 4 5

    Algorithm: k-means, Distance Metric: Euclidean Distance

    k1

    k2

    k3

    Chng 6: PHN CM

    Bc 1

  • 7/30/2019 Ch6 - Phan Cum

    19/45

    Page 19

    0

    1

    2

    3

    4

    5

    0 1 2 3 4 5

    Algorithm: k-means, Distance Metric: Euclidean Distance

    k1

    k2

    k3

    Chng 6: PHN CM

    Bc 2

  • 7/30/2019 Ch6 - Phan Cum

    20/45

    Page 20

    0

    1

    2

    3

    4

    5

    0 1 2 3 4 5

    Algorithm: k-means, Distance Metric: Euclidean Distance

    k1

    k2

    k3

    Chng 6: PHN CM

    Bc 3

  • 7/30/2019 Ch6 - Phan Cum

    21/45

    Page 21

    0

    1

    2

    3

    4

    5

    0 1 2 3 4 5

    Algorithm: k-means, Distance Metric: Euclidean Distance

    k1

    k2

    k3

    Chng 6: PHN CM

    Bc 4

  • 7/30/2019 Ch6 - Phan Cum

    22/45

    Page 22

    0

    1

    2

    3

    4

    5

    0 1 2 3 4 5

    Algorithm: k-means, Distance Metric: Euclidean Distance

    k1

    k2k3

    Chng 6: PHN CM

    Bc 5

  • 7/30/2019 Ch6 - Phan Cum

    23/45

    Page 23

    Mt minh ha khc

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    0

    1

    2

    3

    4

    5

    6

    7

    8

    9

    10

    0 1 2 3 4 5 6 7 8 9 10

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    24/45

    Page 24

    Trongthutton trn, tm cccmlphntcctrngbivectorltrung bnhcngcc vectortngngviccitngcacm.

    UNHC /IMCATHUT TON K-MEANS

    uim:+ Tngi nhanh.phctpcathut ton l O(tkn), trong :

    - n: Sitng trong khng gian dliu.

    - k: Scmcn phn hoch.

    - t: Slnlp (t thng kh nh so vi n).+ K-Means ph hpvi cc cm c dng hnh cu.

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    25/45

    Page 25

    Khuytim:

    + Khng m bo t c ti u ton cc v ktqu u ra phthucnhiu vo vicchn k imkhiu.

    Dophi chy li thut tonvinhiu b khiu khc nhau ccktqutt.

    + Cnphi xc nhtrcscm.

    + Kh xc nhscmthcs khng gian dliu c.

    Dophithvicc gitrk khc nhau.

    + Kh pht hin cc loicm hnh dngphc tp khc nhau v nht lcc dngcm khng li.

    + Khng thx l nhiu v cc phntngoil

    + Chc th p dng khi tnh ctrng tm.

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    26/45

    Page 26

    MTSBINTHCA K-MEANS

    o

    K-MODES (Huang 1998), EM (Lauritzen 1995)

    THUT TON K-MEDOID (Kaufman, Rousseeuw 1987)

    im khc bit so vi K-MEANS:

    Trng tm micm l phnt sao cho tngkhong cch cc imthuccmtitrng tm l nhnht

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    27/45

    Page 27

    K-MEDOIDS

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    28/45

    Page 28

    Chn

    ngunhin3 p. tlmtm

    K-MEDOIDS

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    29/45

    Page 29

    Gn cc phn t gn tm i thnh 1 cm Ci

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    30/45

    Page 30

    Mi Cluster xc nh li tm l im c tng khong cch n cc imtrong cm l b nht

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    31/45

    Page 31

    Chng 6: PHN CM

    Mi Cluster xc nh li tm l im c tng khong cch n cc imtrong cm l b nht

  • 7/30/2019 Ch6 - Phan Cum

    32/45

    Page 32

    Gn li cc im gn cc tm v cc cm tng ng

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    33/45

    Page 33

    Lp li cc qu trnh trn n khi trong tm cc

    cum khng thay i

    Gn cc phn t gn tm i thnh 1 cm Ci

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    34/45

    Page 34

    uvnhcimcathutton:

    uim: K-Medoids lm viccvinhiu v cc phnt ngoil.Khuytim: K-Medoids chhiuqu khi tpdliu khng qu ln v c

    phctp l O(k(n-k)2t).

    Trong :

    n: Sim trong khng gian dliu.k: Scmcn phn hoch.

    t: Slnlp, t kh nh so vi n.

    Mtsbinthca K-Medoids:

    PAM (Partition Around Medoids), CLARA (Clustering Large Application-Kaufman & Rousseuw, 1990), CALARANS (Clustering Large ApplicationRANdomized Search- Ng and Han, 1994)

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    35/45

    Page 35

    Sdng ma trnkhong cch nh lm tiu chun phn cm . Phngphp ny khng yu cu khai bo s phn cm (k), m khai bo iukindng.

    Step 0 Step 1 Step 2 Step 3 Step 4

    b

    d

    c

    e

    aa b

    d e

    c d e

    a b c d e

    Step 4 Step 3 Step 2 Step 1 Step 0

    Tch ng(AGNES)

    Phn chia

    (DIANA)

    Bottom-up

    Top-down

    PHN CM PHN CP

    Chng 6: PHN CM

    C 6 C

  • 7/30/2019 Ch6 - Phan Cum

    36/45

    Page 36

    THUT TON PHN CM TCHNG PHN CP

    (Agglomerative Hierachical Clustering)tng:

    Xut pht, micm l mtitng (nu c N cc itng, ta c Ncm, micmchamtitng).

    Sau , tin hnh ghp cc cm hai cp c khong cch b nht. Qu trnh ghp cm ny ctin hnh lp cho n khi cc cmc

    ghp thnh mtcm duy nht.

    Khong cch gia hai cm c th l mt trong ba loi sau:

    oSingle-linkageclustering,

    oComplete-linkageclustering,

    oAverage-linkageclustering.

    Chng 6: PHN CM

    Ch 6 PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    37/45

    Page 37

    Single-linkage clustering (cn gi l connectedness hoc minimummethod): khong cch gia hai cm l khongcchngnnhtgia

    hai itngca hai cm.

    Complete-linkage clustering (cn gi l phng thc diameter hocmaximum), khong cch gia hai cm l khongcchlnnhtgiahai itngca hai cm.

    Avgage-linkageclustering: khong cch gia hai cm l khongcchtrung bnhgia hai itngca hai cm.

    Chng 6: PHN CM

    Ch 6 PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    38/45

    Page 38

    0 8 8 7 7

    0 2 4 4

    0 3 3

    0 1

    0

    D( , ) = 8

    D( ) = 1

    Khi u t ma trn khong cch

    Chng 6: PHN CM

    Bottom Up (Tch ng): Khi u

  • 7/30/2019 Ch6 - Phan Cum

    39/45

    Page 39

    Bottom-Up (Tch ng): Khiumi item l mtcm, sau trndnn khi cn 1 cm

    Xem xt tt ccc kh nngtrn cm

    Chn khnng ttnht

    Bottom Up (Tch ng): Khi u

  • 7/30/2019 Ch6 - Phan Cum

    40/45

    Page 40

    Bottom-Up (Tch ng): Khiumi item l mtcm, sau trndnn khi cn 1 cm

    Xem xt tt ccc kh nngtrn cm

    Chn khnng ttnht

    Xem xt tt ccc kh nngtrn cm

    Chn khnng ttnht

    Bottom Up (Tch ng): Khi u

  • 7/30/2019 Ch6 - Phan Cum

    41/45

    Page 41

    Bottom-Up (Tch ng): Khiumi item l mtcm, sau trndnn khi cn 1 cm

    Xem xt tt ccc kh nngtrn cm

    Chn khnng ttnht

    Xem xt tt ccc kh nngtrn cm

    Chn khnng ttnht

    Xem xt tt ccc kh nngtrn cm

    Chn khnng ttnht

    Bottom Up (Tch ng): Khi u

  • 7/30/2019 Ch6 - Phan Cum

    42/45

    Page 42

    Bottom-Up (Tch ng): Khiumi item l mtcm, sau trndnn khi cn 1 cm

    Xem xt tt ccc kh nngtrn cm

    Chn khnng ttnht

    Xem xt tt ccc kh nngtrn cm

    Chn khnng ttnht

    Xem xt tt ccc kh nngtrn cm

    Chn khnng ttnht

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    43/45

    Page 43

    o U / NHCIM

    Khng chnhs phn cm S phn cm tch ngtrc quan

    phctp trong trnghpttnht: O(n2), nsitng phn cm

    Ktqu phn cm KH CH QUAN

    Thut ton nh l mt heuristic Khng dng c cho CSDL ln

    Chng 6: PHN CM

    Chng 6: PHN CM

  • 7/30/2019 Ch6 - Phan Cum

    44/45

    Page 44

    Chng 6: PHN CM

    TI LIU THAM KHO THM

    The top ten algorithm in Data Mining Xindong Hu, Vipin Kuma Principles of Data Mining Max Bramer

    SlideLecture Notes for Chapter 8,9: www.cse.msu.edu/~ptan/

    www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdf, lect27-05.pdf, lect28-05.pdf

    http://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdfhttp://www.cs.bu.edu/fac/gkollios/ada05/.../lect26-05.pdf
  • 7/30/2019 Ch6 - Phan Cum

    45/45

    Cm n s theo di!