LECTIA 3

5
TEHNICI DATA MINING 2007 Lector dr.DANIELA JOITA UNIVERSITATEA TITU MAIORESCU LECTIA 3 Clustering Kmeans clustering; Clustering aglomerativ Clustering diviziv

description

n

Transcript of LECTIA 3

  • TEHNICIDATAMINING 2007

    Lectordr.DANIELAJOITAUNIVERSITATEATITUMAIORESCU

    LECTIA3Clustering Kmeansclustering;

    ClusteringaglomerativClusteringdiviziv

  • TEHNICIDATAMINING 2007

    Lectordr.DANIELAJOITAUNIVERSITATEATITUMAIORESCU

    Cap4.ClusterizareClusterizareaesteprocesuldedivizareauneibazededateingrupedeinregistrarisimilarastfelincatmembriiaceleasigrupesafiecatsepoatedeapropriatiunuldealtul,iargrupurilesuntcatsepoatededepartateuneledecelelalte.Inclusterizarenuexistaniciomtdedatepreclasificatesinusefaceniciodistinctieintrevariabileleindepsiceledependente.Var(atributele)infctdecaresefaceoperatiadeclusterizaresenumescvardeintrare,iardimensiuneaproblemeiestedatadenrvardeintrare.Ppcaavemnvariabiledeintrare: nXXX ,...,, 21 .Atuncifiecareinregcecontinecateovaloareptcelenvar( nn xXxXxX === ,...,, 2211 )reprezintaunpunct ( )nxxx ,...,, 21 inspatialndimensional.CelmaidesfolositalgdeclusterizareestealgKmeans(alclusterizariicukclusterefolosindmediaaritmetica).AlgKmeans(MacQueen1967)Fiekfixat,k=nrclustere.

    1. Sealeglaintamplarekpuncte(inregistrari)cafiindcentereleinitialealecelorkclustere.(MacQueenpropunealegereaprimelorkinreg)

    2. Ptfiecareinregdeterminacelmaiapropriatcentrusiatribuieinregistrariiclusterulasociatcentrului.

    3. Ptfiecarecluster,calculeazamediainregdincluster.Mutacentralclusteruluiinpctcorespmediei.

    4. Repetapasii2si3panacandseobtineconvergentaadicapanacandnrdereatribuirialeclusterelorestemaimicdecatovaloaredata.

    Obs.Algfunctioneazanumaiptatribuitecuvalorinumerice.Oposibilafctdistantcaresadescrienotiuneadecelmaiapropriatestefctdistantaeuclidianaintredouapct: ( )nxxxX ,...,, 21= si

    ( )nyyyY ,...,, 21= : ( ) ( )2211 ...),( nn yxyxYXd ++= .Exemplu:Sappcan=2adicaavemdouaattributedeintrare,atunciputemreprezentapunctele(inregistrarile)inplan:

    0

    1

    2

    3

    4

    5

    0 1 2 3 4 5 6 7 8

  • TEHNICIDATAMINING 2007

    Lectordr.DANIELAJOITAUNIVERSITATEATITUMAIORESCU

    Dacaalegemk=3,AlegemlaintamplarecentreleclusterelorpuncteleA(1,1),B(3,1),C(7,1).Grupampuncteleinclustereastfel:cluster1:punctele(1,1),(0,2),(2,2)cluster2:punctele(3,1),(4,1),(4,3),(5,3)cluster3:punctele(6,4),(6.5,3.5),(7,1).Calculammediileclusterelorsirecentram.

    cluster1:x=(1+0+2)/3=1,y=(1+2+2)/3=5/3cluster2:x=(3+4+4+5)/4=4,y=(1+1+3+3)/4=2cluster3:x=(7+6+6.5)/3=6.5,y=(1+4+3.5)/3Serecalcdistanteledelafiecarepunctlanoilecentre.Cumdistminimenuduclareatribuirialeclusterelor,seobtineconvergenta.Deasemenea,alegandcele3centreinitialealtfel,clusterelerezultatevorfialtele.Deexemplu,Cluster1:acelasicentruCluster2:acelasicentruCluster3:x=(2+3+4+7+4+5+6+6.5)/8=4.68,y=(2+1+1+3+3+4+3.5+1)/8=2.31

    2)131.2()68.43()2,1( 22

  • TEHNICIDATAMINING 2007

    Lectordr.DANIELAJOITAUNIVERSITATEATITUMAIORESCU

    Exempludedendograma:

    Metodeaglomerative:Producoseriedepartitiialeinreg: 11,..., PPP nn unden=nrtotaldeinreg

    }}{},...,{},{{ 21 nn iiiP = nmt(nclustere)...

    },...,,{ 211 niiiP = 1clusterLafiecarepasmetodaunsetecatedouaclusteremaiexactdouaclusterecaresuntcelemaiapropriate:AlegemR,Sa.i. ),(min),(

    ,srDSRD

    clusteresr= .UnimRsiS.

    Diferentadintremetodeaparedatoritamodurilordiferitedeamasuradistintreclustere.

    1. ClusterizareculegsimplaDistintreclustere=celmaiscurtdrumintreclustere

    }/),(min{),( sclusterulinjsirclusterulinesteijidsrD =

    2. ClusterizareculegcompletaDistintreclustere=distantadintrecelemaidepartateinregdinclustere

    }/),(max{),( sclusterulinjsirclusterulinesteijidsrD =

    3. Clusterizareculegmedie

    D(r,s)=media{d(i,j)/iinclusterulr,jsuntinclusteruls}=)()(

    ),(scardrcard

    jid

    4. Clusterizareculegmediedegruprsisseunesca.i.dupauniredistmediedininteriorulfiecaruiclustersafieminima.Ppcanoulclusterformatprinunirealuircusestet.Atunci:D(r,s)=media{d(i,j)/i,jsuntinclusterultformatprinfuzionarealuircus}

    5. ClusterizareculegaturaWard:Dist=crestereainsumapatrateloreroriiESSdupafuzionareacelordouaclustereintrunulsingur.SealegpasisuccesivicaresaminimizezecrestereainESSlafiecarepas.

    MetodeaglomerativeMetodeaglomerative Metodedivizive

  • TEHNICIDATAMINING 2007

    Lectordr.DANIELAJOITAUNIVERSITATEATITUMAIORESCU

    X=mt,n=nrelemaleluiX

    =

    =n

    ii mediaxXESS

    1

    2)(

    D(r,s)=ESS(t=clusterobtprinfuzionarealuirsis)(ESS(r)+ESS(s))Metodedivizive:Algoritm:Initialtoateinregsuntintrunsingurcluster.Sedecideunpragptdist.Secalcdistdintreorice2inregsisedetperecheacuceamaimaredist.Distmaxsecomparacupragulptdist.Dacadistmax>pragptdistatuncigrupulseimparteindoua.Cele2inregsepunin

    clusterediferiteiarcelelaltepctesepuninclusterulcelmaiapropriat.Siserepetaprocedeul.

    Dacadistmax