基于Alluxio提升Spark和Hadoop HDFS的系统性能与 … MEM Worker3 SSD HDD MEM Worker2 SSD...

download 基于Alluxio提升Spark和Hadoop HDFS的系统性能与 … MEM Worker3 SSD HDD MEM Worker2 SSD HDD. Alluxio ... checkPointPath : hdfs://xxx:yyy/zzz... Alluxio ...

If you can't read please download the document

  • date post

    09-May-2018
  • Category

    Documents

  • view

    225
  • download

    4

Embed Size (px)

Transcript of 基于Alluxio提升Spark和Hadoop HDFS的系统性能与 … MEM Worker3 SSD HDD MEM Worker2 SSD...

  • AlluxioSparkHadoopHDFS

    ,AlluxioPMC, Maintainer

    2017/03/[email protected] Hadoop Summit 2017()

  • Alluxio1.4

    AlluxioSpark DataFrame/RDD

    AlluxioHDFSSLA

  • BIG DATA ECOSYSTEM Yesterday

  • BIG DATA ECOSYSTEM Today

    4

  • BIG DATA ECOSYSTEM Issue

    5

  • BIG DATA ECOSYSTEM With Alluxio

    6

  • BIG DATA ECOSYSTEM With Alluxio

    7

  • Alluxio

    Alluxio(memory-centric)

    Alluxio

  • Alluxio

    201212AlluxioTachyon0.1.0

    20171Alluxio1.4

  • Alluxio

    20134100400AlluxioAlluxioIBMIntelRed HatUC BerkeleyYahoo

    Popular Open Source Projects Growth

  • PASALab

  • INDUSTRY ADOPTION

    12

  • 13

  • 14

  • Alluxio

    Master-Worker Master

    Worker

    Worker

    MEMSSDHDD

    Client

    MasterWorker

    Under File System

    Under File System

    node 1 node 2 node 3

    Master

    Client

    MEM

    Worker1

    SSD

    HDD

    MEM

    Worker3

    SSD

    HDD

    MEM

    Worker2

    SSD

    HDD

  • Alluxio

    Inode Tree

    Inode

    Inodeid/

    /

    Dir0/ Dir1/

    Dir2/ File1File0

    name : File1List : Block0, Block1, ...checkPointPath : hdfs://xxx:yyy/zzz

    ...

  • Alluxio

    ReadType ---

    WriteType ---

    ReadType

    CACHE_PROMOTE

    WorkerWorkerWorkerWorker

    CACHE WorkerWorker

    NO_CACHE

    WriteType

    CACHE_THROUGH Worker

    MUST_CACHE

    Worker

    THROUGH Worker

    ASYNC_THROUGH Worker

  • Alluxio

    Master ZooKeeperMaster

    Journal : EditLog + Image

    Worker Master

    Checkpoint & Lineage

  • bin/alluxio fs [command]

    cat

    chmod

    copyFromLocal

    copyToLocal

    fileInfo

    ls

    alluxio://:/

    *`bin/alluxio fs rm /data/2014*`

    mkdir

    mv

    rm

    touch

    mount

    unmount

  • API

    Java APIAlluxio

    FileSystem fs = FileSystem.Factory.get();AlluxioURI path = new AlluxioURI("/myFile");FileOutStream out = fs.createFile(path);out.write(...);out.close();

    FileSystem fs = FileSystem.Factory.get();AlluxioURI path = new AlluxioURI("/myFile");FileInStream in = fs.openFile(path);in.read(...);in.close();

    Java API Dochttp://alluxio.org/documentation/master/api/java/

    Hadoop FileSystem MapReduceSparkalluxio://hdfs://

    http://alluxio.org/documentation/master/api/java/

  • Alluxio-FUSE

    LinuxAlluxio Linux libfuse

    Linux

    $ alluxio-fuse.sh mount

    open

    read

    lseek

    write

    Kernel

    Userspace

    cat /tmp/alluxio-file

    glibc

    VFS

    FUSE

    NFS

    Ext4

    ...

    glibc

    libfuse

    Alluxio

  • API

    Alluxiokey-value

    APIKeyValueSystem kvs = KeyValueSystem

    .Factory().create();

    KeyValueStoreWriter writer = kvs.createStore(

    new AlluxioURI("alluxio://path/my-kvstore"));

    writer.put("100", "foo");

    writer.put("200", "bar");

    writer.close();

    KeyValueStoreReader reader = kvs.openStore(

    new AlluxioURI("alluxio://path/kvstore/"));

    reader.get("100");

    reader.get(300); //null

    reader.close();

    AlluxioKV Store

    batch put

    K1get

    2

    V1 foo

  • StorageTier

    SSD

    StorageDir

    Alluxio Worker StorageTierStorageDir

    Allocator ---- StorageDir

    GreedyAllocatorMaxFreeAllocatorRoundRobinAllocator Evictor ---- StorageDir

    GreedyEvictorLRUEvictorLRFUEvictorPartialLRUEvictor

  • Alluxio

  • Alluxio Alluxio

    Alluxio

    Alluxio

  • Alluxio SparkHadoop MapReduceFlink

    H20Impala

    hdfs://ip:port/xxx -> alluxio://ip:port/xxx

    Zeppelin

    AlluxioAlluxio

  • Alluxio

    Alluxio

    POSIX

    rwx

  • Web

    Master WebUI

    Worker WebUI

  • Co-located compute and data with memory-speed access to

    data

    Virtualized different storage systems under a unified namespace

    Scale-out architecture

    File system API, software only

  • Unification

    New workflows

    across any data in

    any storage system

    Orders of

    magnitude

    improvement in run

    time

    Choice in compute

    and storage grow

    each

    independently, buy

    only what is

    needed

    Performance Flexibility

  • Alluxio 1.4

    AlluxioAPI

    Alluxio 1.4.0UFS API

    400

  • Alluxio 1.4

    REST RESTAlluxio native Java APIJavaAlluxio

    RESTAlluxioAlluxio JavaRESTAlluxio

    AlluxioAlluxiojavaAlluxioAlluxioAlluxio

  • Alluxio 1.4

    Packet Streaming Alluxio 1.4.0Alluxio

    2IO

    -

  • Alluxio 1.4

    Apache HiveContributed By PASALab Apache HiveAlluxio,

    (http://www.alluxio.org/docs/master/en/Running-Hive-with-Alluxio.html)

    YARN /YARNAlluxio

    http://www.alluxio.org/docs/master/en/Running-Hive-with-Alluxio.html

  • Alluxio 1.4

    Alluxio Master MapReduce

    1

    Alluxio

  • Alluxio1.4

    AlluxioSpark DataFrame

    AlluxioHDFSSLA

  • Spark 2.0.0 + Alluxio 1.2.0

    Single worker: Amazon r3.2xlarge61 GB MEM, 8-core CPU

    Comparisons:

    Alluxio

    Spark Storage Level: MEMORY_ONLY

    Spark Storage Level: MEMORY_ONLY_SER

    Spark Storage Level: DISK_ONLY

    19

  • 23

    0

    50

    100

    150

    200

    250

    0 10 20 30 40 50

    Tim

    e [seconds]

    DataFrame Size [GB]

    READING CACHED DATAFRAME (PARQUET)

    Alluxio (textFile) DISK_ONLY

    MEMORY_ONLY_SER MEMORY_ONLY

  • 24

    0 50 100 150 200 250

    Alluxio

    No Alluxio

    Time [seconds]

    READ 50 GB DATAFRAME

    (SSD)

  • 25

    0 250 500 750 1000 1250 1500 1750

    Alluxio

    No Alluxio

    Time [seconds]

    READ 50 GB DATAFRAME

    (S3)

    10x average speedup, 17x peak speedup

  • Alluxio1.4

    AlluxioSpark DataFrame/RDD

    AlluxioHDFSSLA

  • AlluxioHDFSSLA

    AlluxioHDFS 10

    SLAservice-level agreement

    1002

  • 1

    1. Alluxio2. AlluxioAlluxio10

    IO

  • 2

    1. Alluxio2. AlluxioI/OIOCPU3.Alluxio1

    I/O CPU

  • 3

    1. Alluxio

    2. CPU

    Alluxio

    CPU I/O

  • 4

    1. Alluxio

    I/O2. Alluxio

    CPU

  • Alluxio1.4

    AlluxioSpark DataFrame/RDD

    AlluxioHDFSSLA

  • AlluxioHadoop/Spark AlluxioSpark

    Alluxio

    Alluxio

  • Alluxio

    http://alluxio.org/documentation/master/cn/index.html

    http://alluxio.org/documentation/master/cn/index.html

  • The End & Thank you!

    http://alluxio.org/

    [email protected]

    http://alluxio.org/