Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB...

51
Distributed Data Management Summer Semester 2013 TU Kaiserslautern Dr.Ing. Sebas4an Michel [email protected] Distributed Data Management, SoSe 2013, S. Michel 1

Transcript of Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB...

Page 1: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Distributed  Data  Management  Summer  Semester  2013  

TU  Kaiserslautern  

Dr.-­‐Ing.  Sebas4an  Michel    

[email protected]­‐saarland.de  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   1  

Page 2: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

MOTIVATION  AND  OVERVIEW  Lecture  1  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   2  

Page 3: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Distributed  Data  Management  

•  What  does  “distributed”  mean?  

•  And  why  would  we  want/need  to  do  things  in  a  distributed  way?  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   3  

Page 4: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Reason:  Federated  Data  •  Data  is  per  se  hosted  at  different  sites  

•  Autonomy  of  sites  •  Maintained  by  diff.  organiza4ons  •  Mashups  over  such  independent      sources  

•  Linked  Open  Data  (LOD)  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   4  

Page 5: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Reason:  Sensor  Data  

•  Data  originates  at  different  sensors  •  Spread  across  the  world  •  Health  data  from  mobile  devices  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   5  

Con$nuous  queries!  

Page 6: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Distributed  Data  Management,  SoSe  2013,  S.  Michel   6  

IP Bytes in kB

192.168.1.7 31kB

192.168.1.3 23kB

192.168.1.4 12kB

IP Bytes in kB

192.168.1.8 81kB

192.168.1.3 33kB

192.168.1.1 12kB

IP Bytes in kB

192.168.1.4 53kB

192.168.1.3 21kB

192.168.1.1 9kB

IP Bytes in kB

192.168.1.1 29kB

192.168.1.4 28kB

192.168.1.5 12kB

E.g.  find  clients  that  cause    high  network  traffic.  

Reason:  Network  Monitoring  

Page 7: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Reason:  Individuals  as  Providers/Consumers  

•  Don’t  want  single  operator  with  global  knowledge  -­‐>  be]er  decentralized?  

•  Distributed  search  engines  •  Data  on  mobile  phones  •  Peer-­‐to-­‐Peer  (P2P)  systems  •  Distributed  social  networks  •  Leveraging  idle  resources  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   7  

Page 8: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Example:  SETI@Home  

•  Distributed  Compu4ng  •  Donate  idle  4me  of  your  personal  computer  

•  Analyze  extraterrestrial  radio  signals  when  screensaver  is  running  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   8  

Page 9: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Distributed  Data  Management,  SoSe  2013,  S.  Michel   9  

Example:  P2P  Systems:  Napster  

File  Download  File  Dow

nload  

•   Central  server  (index)  •   Client  soeware  sends  informa4on  about  users‘  contents  to  server.  •   User  send  queries  to  server  •   Server  responds  with  IP  of  users    that  store  matching  files.  à  Peer-­‐to-­‐Peer  file  sharing!  

•   Developed  in  1998.  •   First  P2P  file-­‐sharing  system    

Pirate-­‐to-­‐Pirate?    

Page 10: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Example:  Self  Organiza4on  &  Message  Flooding  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   10  

TTL  3  

TTL  3  

TTL  2  

TTL  2  

TTL  2  TTL  1  

TTL  0  TTL  1  

TTL  1  

TTL  0  

Page 11: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Example:  Structured  Overlay  Networks  

•  Logarithmic  cost  with        rou4ng  tables          (not  shown  here)  •  Self  organizing  

•  Will  see  later  twice:  NoSQL  KeyValue  stores  and  P2P  Systems  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   11  

p1  

p8  

p14  

p21  

p32  p38  

p42  

p48  

p51  

p56  

k10  

k24  

k30  k38  

k54  

Page 12: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Reason:  Size  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   12  

Page 13: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Showcase  Scenario  

•  Assume  you  got  10  TB  data  on  disk  •  Now,  do  some  analysis  of  it  

•  With  a  100MB/s  disk,  reading  alone  takes  – 100000  seconds  – 1666  minutes  – 27  hours  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   13  

Page 14: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Huge  Amounts  of  Data  

•  Google:  – Billions  of  Websites  (around  50  billion,  Spring  2013)  – TBs  of  data  

•  Twi]er:  – 100s  million  tweets  per  day  

•  Cern’s  LHC  – 25  Petabytes  of  data  per  year  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   14  

Page 15: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Huge  Amounts  of  Data(2)  

•  Megaupload  – 28  PB  of  data  

•  AT&T  (US  Telecomm.  Provider)  –   30  PB  of  data  through  its  networks  each  day  

•  Facebook  – 100  PB  Hadoop  cluster  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   15  

h]p://en.wikipedia.org/wiki/Petabyte  

Page 16: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Need  to  do  something  about  it  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   16  h]p://flickr.com/photos/jurvetson/157722937/  

h]p://www.google.com/about/datacenter  

Page 17: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Scale-­‐Out  vs.  Scale-­‐Up  •  Scale-­‐Out  (Many  Servers-­‐>  Distributed)  

•  As  opposed  to  Scale-­‐Up  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   17  

Page 18: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Scale-­‐Out  •  Common  technique  is  scale-­‐out  

– Many  machines  – Amazon’s  EC2  cloud,  around  400,  000  machines  

•  Commodity  machines  (many  but  not  individually  super  fast)  

•  Failures  happen  virtually  at  any  4me.  

•  Electricity  is  an  issue  (par4cularly  for  cooling)  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   18  

h]p://huanliu.wordpress.com/2012/03/13/amazon-­‐data-­‐center-­‐size/  

Page 19: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Hardware  Failures  •  Lots  of  machines  (commodity  hardware)  à  failure  is  not  excep4on  but  very  common  

•  P[machine  fails  today]  =  1/365  •  n  machines:  P[failure  of  at  least  1  machine]  =  

                                                                         1-­‐(1-­‐P[machine  fails  today])^n    –  for  n=1:                        0.0027  –  for  n=10:              0.02706  –  for  n=100:        0.239  –  for  n=1000:    0.9356  –  for  n=10  000:      ~  1.0  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   19  

Page 20: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Failure  Handling  &  Recovery  

•  Hardware  failures  happen  virtually  at  any  4me  •  Algorithms/Infrastructures  have  to  compensate  that  

•  Replica4on  of  data,  logging  of  state,  also  redundancy  in  task  execu4on  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   20  

Page 21: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Cost  Numbers  (=>Complex  Cost  Model)  •  L1  cache  reference  0.5  ns  •  L2  cache  reference  7  ns  •  Main  memory  reference  100  ns  •  Compress  1K  bytes  with  Zippy  10,000  ns  •  Send  2K  bytes  over  1  Gbps  network  20,000  ns  •  Read  1  MB  sequen4ally  from  memory  250,000  ns  •  Round  trip  within  same  datacenter  500,000  ns  •  Disk  seek  10,000,000  ns  •  Read  1  MB  sequen4ally  from  network  10,000,000  ns  •  Read  1  MB  sequen4ally  from  disk  30,000,000  ns  •  Send  packet  CA-­‐>Netherlands-­‐>CA  150,000,000  ns  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   21  Numbers  source:  Jeff  Dean    

1ns  =  10^-­‐6  ms  

Page 22: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Map  Reduce  •  “Novel”  compu4ng  paradigm  introduced  by  Google  in  2004.  

•  Have  many  machines  in  a  data  center.  •  Don’t  want  to  care  about  impl.  details  like  data  placement,  failure  handling,  cost  models.  

•  Abstract  computa4on  to  two  basic  func4ons:  

•  Think  “func4onal  programming”  with  map  and  fold  (reduce),  but    –  Distributed  and  –  Large  scale  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   22  

Jeffrey  Dean,  Sanjay  Ghemawat:  MapReduce:  Simplified  Data  Processing  on  Large  Clusters.  OSDI  2004:  137-­‐150  

Page 23: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Map  Reduce:  Example  Map  +  Count  

•  Line  1  – “One  ring  to  rule  them  all,  one  ring  to  find  them,  

•  Line  2    – “One  ring  to  bring  them  all  and  in  the  darkness  bind  them.”  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   23  

Page 24: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Map  Line  to  Terms  and  Counts  

{"one"=>["1",  "1"],  "ring"=>["1",  "1"],    "to"=>["1",  "1"],    "rule"=>["1"],    "them"=>["1",  "1"],  "all"=>["1"],    "find"=>["1"]}  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   24  

{"one"=>["1"],    "ring"=>["1"],    "to"=>["1"],    "bring"=>["1"],    "them"=>["1",  "1"],    "all"=>["1"],    "and"=>["1"],    "in"=>["1"],    "the"=>["1"],    "darkness"=>["1"],    "bind"=>["1"]}  

Line  1  

Line  2  

Page 25: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Group  by  Term  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   25  

{"one"=>["1",  "1"],  "ring"=>["1",  "1"],  ….  

{"one"=>["1"],    "ring"=>["1"],  …  

{"one"=>[["1”,”1”],[“1”]],    "ring"=>[["1”,”1”],[“1”]],  …  

Page 26: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Sum  Up  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   26  

{"one"=>[["1”,”1”],[“1”]],    "ring"=>[["1”,”1”],[“1”]],  …  

{"one"=>[“3”],    "ring"=>[“3”],  …  

Page 27: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Applica4on:  Compu4ng  PageRank  

•  Link  analysis  model  proposed  by  Brin&Page  •  Compute  authority  scores  •  In  terms  of:  

–  incoming  links  (weights)  from  other  pages  

•  “Random  surfer  model”  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   27  

S.  Brin  &  L.  Page.  The  anatomy  of  a  large-­‐scale  hypertextual  web  search  engine.  In  WWW  Conf.  1998.  

Page 28: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

New  Requirements  

•  Map  Reduce  is  one  prominent  example  that  novel  businesses  have  new  requirements.  

•  Going  away  from  tradi4onal  RDBMS.  

•  Addressing  huge  data  volumes,  processed  in  mul4ple,  distributed  (wide  spread)  data  centers.  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   28  

Page 29: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

New  Requirements  (Cont’d)  

•  Massive  amounts  of  unstructured  (text)  data  •  Processed  oeen  in  batches  (with  MapReduce).  

•  Huge  graphs  like  Facebook’s  friendship  graph  •  Oeen  enough  to  store  (key,  value)  pairs  

•  No  need  for  RDBMS  overhead  •  Oeen  wanted:  open  source  or  at  least  not  bound  to  par4cular  commercial  product  (vendor).  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   29  

Page 30: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Wish  List    •  Data  should  always  be  consistent  

•  Provided  service  should  be  always  quickly  responding  to  requests  

•  Data  can  be  (is)  distributed  across  many  machines  (par44ons)  

•  Even  if  some  machines  fail,  the  system  should  be  up  and  running  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   30  

Page 31: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

CAP  Theorem  (Brewer's  Theorem)  

•  System  cannot  provide  all  3  proper4es  at  the  same  4me:  – Consistency  – Availability  – Par44on  Tolerance  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   31  

C   A  

P  C+P   A+P  

C+A  

h]p://lpd.epfl.ch/sgilbert/pubs/BrewersConjecture-­‐SigAct.pdf  

Page 32: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

With Huge Data Sets ….

•  Par44on  tolerance  is  strictly  required  

•  That  leaves  trading  off  consistency  and  availability  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   32  

Page 33: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Best  effort:  BASE  

•  Basically  Available  •  Soe  State  •  Eventual  Consistency  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   33  

see  h]p://www.allthingsdistributed.com/2007/12/eventually_consistent.html  W.  Vogels.  Eventually  Consistent.  ACM  Queue  vol.  6,  no.  6,  December  2008.  

Page 34: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

The  NoSQL  “Movement”  

•  No  one-­‐size-­‐fits-­‐all    •  Not  only  SQL  (not  necessarily  “no”  SQL  at  all)  •  for  group  of  non-­‐tradi4onal  DBMS  (not  rela4onal,  oeen  no  SQL),  for  different    purposes  – key  value  stores  – graph  databases    – document  stores  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   34  

Page 35: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Example:  Key  Value  Stores  

•  Like  Apache  Cassandra,  Amazon’s  Dynamo,  Riak  •  Handling  of  (K,V)  pairs  

•  Consistent  hashing  of  values  to  nodes  based  on  their  keys  

•  Simple  CRUD  opera4ons  (create,  read,  update,  delete)      (no  SQL,  or  at  least  not  full)  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   35  

Page 36: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Cri4cisms  

•  Some  DB  folks  say  “Map  Reduce  is  a  major  step  backward”.  

•  And  NoSQL  is  too  basic  and  will  end  up  re-­‐inven4ng  DB  standards  (once  they  need  it).  

•  Will  ask  in  a  few  weeks:  What  do  you  think?  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   36  

Page 37: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Cloud  Compu4ng  

•  On  demand  hardware  –  rent  your  compu4ng  machinery    – virtualiza4on  

•  Google  App  engine,  Amazon  AWS,  Microsoe  Azure  –  Infrastructure  as  a  Service  (IaaS)  – Pla�orm  as  a  Service  (PaaS)  – Soeware  as  a  Service  (SaaS)  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   37  

Page 38: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Cloud  Compu4ng  (Cont’d)  •  Promises  “no”  startup  cost  for  own  business  in  terms  of  hardware  you  need  to  buy  

•  Scalability:  Just  rent  more  when  you  need  them    •  And  return  them  when  there  is  no  demand  •  Prominent  showcase:  Animoto,  in  Amazon’s  EC2.  From  50  to  3,500  machines  in  few  days.  

•  But  also  problema4c:  –  fully  dependent  on  a  vendors  hardware/service  –  sensi4ve  data  (all  your  data)  is  with  vendor,  maybe  stored  in  a  diff  country  (likely)  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   38  

Page 39: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Dynamic  Big  Data  

•  Scalable,  con4nuous  processing  of  massive  data  streams  

•  Twi]er’s  Storm,  Yahoo!  (now  Apache)  S4  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   39  

h]p://storm-­‐project.net/  

Page 40: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Last  but  not  least:  Fallacies  of  Distributed  Compu4ng  

1.  The  network  is  reliable    2.  Latency  is  zero    3.  Bandwidth  is  infinite  4.  The  network  is  secure    5.  Topology  doesn't  change    6.  There  is  one  administrator    7.  Transport  cost  is  zero    8.  The  network  is  homogeneous  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   40  

source:  Peter  Deutsch  and  others  at  Sun  

Page 41: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

LECTURE:  CONTENT  &  REGULATIONS  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   41  

Page 42: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

What  you  will  learn  in  this  Lecture  •  Most  of  the  lecture  is  on  processing  big  data  

– Map  Reduce,  NoSQL,  Cloud  compu4ng  •  Will  operate  on  state  of  the  art  research  results  and  tools  

•  Middle  way  between  pure  systems/tools  discussion  and  learning  how  to  build  algorithms  on  top  of  them  (see  Joins  over  MR,  n-­‐grams,  etc.)  

•  But  also  basic  (important)  techniques,  like  consistent  hashing,  PageRank,  Bloom  filters  

•  Very  relevant  stuff.  Think  “CV”  ;)  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   42  

Page 43: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

•  We  will  cri4cally  discuss  techniques  (philosophies).  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   43  

Page 44: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Prerequisites  •  Successfully  a]ended  informa4on  systems  or  database  lectures.  

•  Prac4cal  exercises  require  solid  Java  skills  

•  Work  with  systems/tools  requires  will  to  dive  into  APIs  and  installa4on  procedures  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   44  

Page 45: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

•  VL  1  (18.  April):  Mo4va4on,  Regula4ons,  Big  Data  •  VL  2  (25.  April):  Map  Reduce  1  •  VL  3  (02.  May):  Map  Reduce  2  •  No  Lecture  (09.  Mai)    (Himmelfahrt,  Ascension)  •  VL  4  (16.  May):  NoSQL  1  •  VL  5  (23.  May):  NoSQL  2  •  No  Lecture  (30.  Mai)      (Fronleichnam,  Corpus  Chris4)  •  VL  7  (06.  June):  Cloud  Compu4ng  •  VL  8  (13.  June):  Stream  Processing  •  VL  9  (20.  June)  :  Distributed  RDBMS  1  •  VL  10  (27.  June):  Distributed  RDBMS  2  •  VL  11  (04.  July):  Peer  to  Peer  Systems  •  VL  12  (11.  July):  Open  Topic  1  •  VL  13  (18.  July):  Last  Lecture  /  Oral  exams    

Distributed  Data  Management,  SoSe  2013,  S.  Michel   45  

Schedule  of  Lectures  (Topics  TentaWve)  

Page 46: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Lecturer  and  TA  •  Lecturer  :  Sebas4an  Michel  (Uni  Saarland)  

– smichel  (at)  mmci.uni-­‐saarland.de  – Building  E  1.7,  Room  309  (Uni  Saarland)  – Phone:  0681  302  70803  – or  be]er,  catch  me  aeer  lecture!    

•  TA:  Johannes  Schildgen  – schildgen  (at)  cs.uni-­‐kl.de  – Room:  36/340  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   46  

Page 47: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Organiza4on  &  Regula4ons  

•  Lecture:  – Thursday  – 11:45  -­‐  13:15          –   Room  48-­‐379    

•  Exercise:  – Tuesday  (bi-­‐weekly)  – 15:30  -­‐  17:00    – Room  52-­‐203  – First  session:  May  7th.  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   47  

Page 48: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Lecture  Organiza4on  

•  New  Lecture  (almost  all  slides  are  new).  

•  On  topics  that  are  oeen  brand  new.  

•  Later  topics  are  s4ll  tenta4ve.  

•  Please  provide  feedback.  E.g.,  too  slow  /  too  fast?    Important  topics  you  want  to  be  addressed?  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   48  

Page 49: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Exercises  

•  Assignment  sheet,  every  two  weeks  •  Sheet  +  TA  session  by  Johannes  Schildgen  •  Mixture  of:  

– Prac4cal:  Implementa4on  (e.g.,  Map  Reduce)  – Prac4cal:  Algorithms  on  “paper”  – Theory:  Where  appropriate  (show  that  …)  – Brief  Essay:  Explain  the  difference  of  x  and  y  (short  summary)  

•  Ac4ve  par4cipa4on  wanted!  J  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   49  

Page 50: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Exam  

•  Oral  Exam  at  the  end  of  semester/early  in  semester  break.  

•  Around  20min  •  Topics  captured  announced  few  (1-­‐2)  weeks  before  exams  

•  We  assume  you  ac4vely  par4cipated  in  the  exercises.  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   50  

Page 51: Distributed*DataManagement · Distributed*DataManagement,*SoSe*2013,*S.*Michel* 6 IP Bytes in kB 192.168.1.7 31kB 192.168.1.3 23kB 192.168.1.4 12kB IP Bytes in kB 192.168.1.8 81kB

Registra4on    

•  Please  register  by  email  to  – Sebas4an  Michel  and  Johannes  Schildgen  – Use  subject  prefix:  [ddm13]  – With  content:  

•  Your  name  •  Matricula4on  number  

•  In  par4cular  to  receive  announcements/news  

Distributed  Data  Management,  SoSe  2013,  S.  Michel   51