Yuanwei Lu 1,2 , Peng Cheng 2, Jiansong Zhang , EnhongChen ......Yuanwei Lu 1,2, Guo Chen3, Bojie...

45
Yuanwei Lu 1,2 , Guo Chen 3 , Bojie Li 1,2 , Kun Tan 4 , Yongqiang Xiong 2 , Peng Cheng 2 , Jiansong Zhang 2 , Enhong Chen 1 , Thomas Moscibroda 5 1 USTC, 2 Microsoft Research, 3 Hunan University, 4 Huawei Technologies, 5 Microsoft Azure

Transcript of Yuanwei Lu 1,2 , Peng Cheng 2, Jiansong Zhang , EnhongChen ......Yuanwei Lu 1,2, Guo Chen3, Bojie...

Yuanwei Lu1,2, Guo Chen3, Bojie Li1,2, Kun Tan4, Yongqiang Xiong2, Peng Cheng2, Jiansong Zhang2, Enhong Chen1, Thomas Moscibroda5

1USTC, 2Microsoft Research, 3Hunan University, 4Huawei Technologies, 5Microsoft Azure

• Low and stable base latency: <10μs

• High throughput with low CPU overhead

• e.g. Half of the traffic in a online service datacenter is RDMA [1]

• Lots of parellel paths between any pair of nodes

• Hot spot caused high latency and low throughput• Not robust to failure

[1] Guo, Chuanxiong, et al. "RDMA over Commodity Ethernet at Scale. "Proceedings of the 2016 conference on ACM SIGCOMM 2016 Conference. ACM, 2016.

TCP/IP RDMA

Multi-path TCP

Multi-path Transport ?

[1] Kalia, Anuj, Michael Kaminsky, and David G. Andersen. "Design Guidelines for High Performance RDMA Systems." 2016 USENIX Annual Technical Conference (USENIX ATC 16). 2016.

MP-RDMA design needs to add as little hardware memory footprint as possible compared with existing single-path RDMA

#2: How to track out-of-order packets with small on-chip memory footprint?

#3: How to provide message ordering without hurting performance?

• Tracking the transmission and congestion states on every paths

Challenge #1: How to achieve congestion-aware traffic split without per-path states at NIC?

• Insight: window-based protocol with ACK clocking

cWnd 6Inflight 0aWnd 6

cWnd 6Inflight 6aWnd 0

cWnd 6Inflight 6aWnd 0

ECN ECN

cWnd 6Inflight 6aWnd 0ECN

ECN

1

2

cWnd 5.5Inflight 5aWnd 0.5ECN

ECN

1

2

cWnd 5.5Inflight 5aWnd 0.5ECN2

cWnd 5Inflight 4aWnd 1ECN2

cWnd 5Inflight 5aWnd 0

cWnd 5Inflight 3aWnd 2

cWnd 5Inflight 5aWnd 0

cWnd 5Inflight 5aWnd 0

#1: How to achieve congestion-aware traffic split without per-path states at NIC?

#2: How to track out-of-order packets with small on-chip memory footprint?

#3: How to provide message ordering without hurting performance?

• Need meta data to track out-of-order

• Large under certain scenarios • e.g. Link downgrade: 1.2KB bitmap • e.g. PFC pause: infinite

Challenge #2: How to track out-of-order packets with small on-chip memory footprint?

good pathsbad paths

∆ 3snd_ooh 0snd_ool 0 1 2

3 4

5 6

∆ 3snd_ooh 0snd_ool 0 1 2

3 4

5 6

∆ 3snd_ooh 4snd_ool 1 1 2

3 4

5 6

∆ 3snd_ooh 6snd_ool 3 1 2

5 6 7 8

∆ 3snd_ooh 6snd_ool 3

1 2

7 8

9 10

1 2

∆ 3snd_ooh 6snd_ool 3

1 2

7 8

9 10

1 2Drop!

#1: How to achieve congestion-aware traffic split without per-path states at NIC?

#2: How to track out-of-order packets with small on-chip memory footprint?

#3: How to provide message ordering without hurting performance?

• Problem: application may assume in-order messages delivery• Naïve solution: stop and wait • Throughput downgrade

WRITE 1: Data WRITE 2: meta data

Application poll on meta data

Message 1 Message 2

Challenge #3: How to provide message ordering without hurting the application performance?

• A synchronise message is guaranteed to arrive after all previous messages

• A synchronise message just wait ∆! before previous messages

• ∆!=α∗∆/(#$%&/'((), large enough to avoid out-of-order

• Altera Stratix V D5 FPGA board with two 40Gb Ethernet ports• 14 elements, ~2K lines of OpenCL code

• Dependency: 9 cycles • Clock Frequency: ~206MHz

• Dependency: 3 cycles • Clock Frequency: ~196MHz

[1] Li, Bojie, et al. "ClickNP: Highly flexible and High-performance Network Processing with Reconfigurable Hardware." ACM SIGCOMM 2016 Conference.

Conclusion: MP-RDMA can be implemented in hardware with high performance

• 47.05% higher application throughput v.s. DCQCN

• Under loss: > 3x throughput improvement

• No per-path states • Out-of-order aware path selection • Synchronise mechanism to ensure message ordering