Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual...

23
Clustering with Spark- MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag [email protected] @TugdualSarazin

Transcript of Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual...

Page 1: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

Clustering with Spark-MapReduce

Vichy’14Tugdual Sarazin, Mustapha Lebbah, Hanene

Azzag

[email protected]@TugdualSarazin

Page 2: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

● Business Intelligence● Data integration● Open Source● BigData and machine learning

● Computer science laboratory of Paris-Nord University● A3 team : Machine learning and applications● Encadrants: M. Lebbah, H. Azzag

CIFRE

Page 3: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

MapReduce

Page 4: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

Machine learning with Hadoop MapReduce

Iter. 1

HDFSRead

HDFSWrite

Iter. 2

HDFSRead

HDFSWrite

. . .

Page 5: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

Machine learning with Spark

Iter. 1

HDFSRead

RAMWrite

Iter. 2

RAMRead

RAMWrite

. . .

Page 6: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

Clustering (machine learning)

Page 7: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

Types of clustering

● Hierarchical clustering

● Density clustering (e.g. DBSCAN)

● Centroid clustering (e.g. k-means)

Page 8: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

K-means

data = spark.textFile("hdfs://...") .map(parsePoint)centroids = Array( Point(randX(), randY()), Point(randX(), randY()))

x

y

Page 9: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

K-means - process the distances for each prototypes

x

y closestCentroid(p, centroids)

Page 10: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

K-means - affectations

x

y closestCentroid(p, centroids)

Page 11: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

K-means - affectations

x

y closestCentroid(p, centroids)

Page 12: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

K-means - Map(affecations)

x

y val closest = data.map(p => (closestCentroid(p, centroids), (p, 1)) )

Page 13: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

K-means - Reduce(update prototypes)

x

y val pointStats=closest.reduceByKey{ case ((p1, sum1), (p2, sum2)) => (p1 + p2, sum1 + sum2) }

Page 14: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

K-means - Iteration 1

x

y val pointStats=closest.reduceByKey{ case ((p1, sum1), (p2, sum2)) => (p1 + p2, sum1 + sum2) } pointStats.foreach{case(id, value) => centroids(id) = value._1 / value._2 }

Page 15: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

K-means - Iteration 2

x

y for (i <- 1 until 10) { val closest = data.map(p => (closestCentroid(p, centroids), (p, 1)) ) val pointStats=closest.reduceByKey{ case ((p1, sum1), (p2, sum2)) => (p1 + p2, sum1 + sum2) } pointStats.foreach{case(id, value) => centroids(id) = value._1 / value._2 }}

Page 16: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

Self-Organizing Map

Page 17: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

SOM benefits● Better clustering performances than k-

means● Linear complexity● Topological visualization

Page 18: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

SOM MapReduce - iteration

Page 19: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

Speed up

● 100 millions observations● SOM - Map : 10 x 10

Page 20: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

Bitm : Biclustering Topological Map

BiClusteringRaw Data

Page 21: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

Bitm : speedup

● 2 millions observations● SOM - Map : 5 x 5

Page 22: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

Evolutions

● Improve performances● Feature selection● Increase the number of cores

○ Grid 5000○ Google Compute engine

Page 23: Clustering with Spark- MapReduce Vichy’14cerin/VICHY2014/... · MapReduce Vichy’14 Tugdual Sarazin, Mustapha Lebbah, Hanene Azzag tugdual.sarazin@altic.org @TugdualSarazin Business

Thank you!Questions?