Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC...

14
Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors Brian Zimmer 1 , Yunsup Lee 1 , Alberto Puggelli 1 , Jaehwa Kwak 1 , Ruzica Jevtic 1 , Ben Keller 1 , Stevo Bailey 1 , Milovan Blagojevic 1,2 , Pi-Feng Chiu 1 , Hanh-Phuc Le 1 , Po-Hung Chen 1 , Nicholas Sutardja 1 , Rimas Avizienis 1 , Andrew Waterman 1 , Brian Richards 1 , Philippe Flatresse 2 , Elad Alon 1 , Krste Asanovic ́ 1 , and Borivoje Nikolic ́ 1 1 University of California, Berkeley, USA 2 STMicroelectronics, Crolles, France 6/30/2015, RISC-V Workshop

Transcript of Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC...

Page 1: Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC ...riscv.org/wp-content/uploads/2015/06/riscv-raven... · Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors

Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors

Brian Zimmer1, Yunsup Lee1, Alberto Puggelli1, Jaehwa Kwak1, Ruzica Jevtic1, Ben Keller1, Stevo Bailey1, Milovan Blagojevic1,2,

Pi-Feng Chiu1, Hanh-Phuc Le1, Po-Hung Chen1, Nicholas Sutardja1, Rimas Avizienis1, Andrew Waterman1, Brian Richards1,

Philippe Flatresse2, Elad Alon1, Krste Asanovic ́1, and Borivoje Nikolic ́1

1University of California, Berkeley, USA 2STMicroelectronics, Crolles, France

6/30/2015, RISC-V Workshop

Page 2: Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC ...riscv.org/wp-content/uploads/2015/06/riscv-raven... · Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors

Motivation•  Dynamic voltage and frequency scaling (DVFS)

maximizes energy efficiency

Time

PerformanceRequirement

Fixed voltage

DVFS voltage

•  Off-chip conversionSlow mode transitionsLimited voltage domainsCostly off-chip components

•  On-chip conversionFast transitionsMany domainsNo off-chip components

✖  ✖  ✖  

✔  ✔  ✔  

2  

Page 3: Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC ...riscv.org/wp-content/uploads/2015/06/riscv-raven... · Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors

Project Goals

•  Fine-grained DVFS•  High conversion efficiency

Energy-efficient

•  Entirely on-chip converter•  Low area overheadlow-cost

•  Runs Linux•  Realistic digital loadprocessor

3  

Page 4: Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC ...riscv.org/wp-content/uploads/2015/06/riscv-raven... · Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors

Integrated Voltage RegulatorsMethod Efficiency

(0.5V output)Off-chip components

Area overhead

Linear regulator (eg. Toprak-Deniz ISSCC2014)

<50%

Buck converter (eg. Kurd ISSCC2014)

?%-90%

Interleaved switched-capacitor (eg. Kim ISSCC20151, Clerc ISSCC20152)

40%-65%1

65%-82%2

Simultaneous switched-capacitor with adaptive clock (Proposed)

85%

4  

Page 5: Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC ...riscv.org/wp-content/uploads/2015/06/riscv-raven... · Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors

Videal VoutVlow

VhighVideal VoutVlow

Vhigh

Proposed Solution•  Switch all converters simultaneously to avoid

charge sharing

1.0V1.8V

24 switched-capacitorDC-DC unit cells

(0.19 mm2)

DC-DC controller

Adaptive clock generator

Vout...

togg

le

+

FSMVref

clk

Async. FIFO/Level shiftersbetween domains

2Ghzref

CORE (1.19 mm2)

UNCORE

16KB ScalarInst. Cache

(Custom 8T SRAM Macros)

32KB Shared Data Cache

(Custom 8T SRAM Macros)

8KB VectorInst. Cache

(Custom 8T SRAM Macros)

To/from off-chip FPGA FSB and DRAM

Digital IO pads to wire-bonded chip-on-board

To off-chip oscilloscope

1.0V

RISC-V scalar core Vector acceleratorVector Issue Unit

Vector Memory Unit

Crossbar

ScalarRF FPUint

Arbiter

...

...

int int int int int

(16KB Vector RF uses eight custom 8T SRAM macros)

...Shared Func. Units64-bit Integer MultiplierSingle-Precision FMADouble-Precision FMA

1.0V1.8V

24 switched-capacitorDC-DC unit cells

(0.19 mm2)

DC-DC controller

Adaptive clock generator

Vout...

togg

le

+

FSMVref

clk

Async. FIFO/Level shiftersbetween domains

2Ghzref

CORE (1.19 mm2)

UNCORE

16KB ScalarInst. Cache

(Custom 8T SRAM Macros)

32KB Shared Data Cache

(Custom 8T SRAM Macros)

8KB VectorInst. Cache

(Custom 8T SRAM Macros)

To/from off-chip FPGA FSB and DRAM

Digital IO pads to wire-bonded chip-on-board

To off-chip oscilloscope

1.0V

RISC-V scalar core Vector acceleratorVector Issue Unit

Vector Memory Unit

Crossbar

ScalarRF FPUint

Arbiter

...

...

int int int int int

(16KB Vector RF uses eight custom 8T SRAM macros)

...Shared Func. Units64-bit Integer MultiplierSingle-Precision FMADouble-Precision FMA

1.0V1.8V

24 switched-capacitorDC-DC unit cells

(0.19 mm2)

DC-DC controller

Adaptive clock generator

Vout...

togg

le

+

FSMVref

clk

Async. FIFO/Level shiftersbetween domains

2Ghzref

CORE (1.19 mm2)

UNCORE

16KB ScalarInst. Cache

(Custom 8T SRAM Macros)

32KB Shared Data Cache

(Custom 8T SRAM Macros)

8KB VectorInst. Cache

(Custom 8T SRAM Macros)

To/from off-chip FPGA FSB and DRAM

Digital IO pads to wire-bonded chip-on-board

To off-chip oscilloscope

1.0V

RISC-V scalar core Vector acceleratorVector Issue Unit

Vector Memory Unit

Crossbar

ScalarRF FPUint

Arbiter

...

...

int int int int int

(16KB Vector RF uses eight custom 8T SRAM macros)

...Shared Func. Units64-bit Integer MultiplierSingle-Precision FMADouble-Precision FMA

1.0V1.8V

24 switched-capacitorDC-DC unit cells

(0.19 mm2)

DC-DC controller

Adaptive clock generator

Vout...

togg

le+

FSMVref

clk

Async. FIFO/Level shiftersbetween domains

2Ghzref

CORE (1.19 mm2)

UNCORE

16KB ScalarInst. Cache

(Custom 8T SRAM Macros)

32KB Shared Data Cache

(Custom 8T SRAM Macros)

8KB VectorInst. Cache

(Custom 8T SRAM Macros)

To/from off-chip FPGA FSB and DRAM

Digital IO pads to wire-bonded chip-on-board

To off-chip oscilloscope

1.0V

RISC-V scalar core Vector acceleratorVector Issue Unit

Vector Memory Unit

Crossbar

ScalarRF FPUint

Arbiter

...

...

int int int int int

(16KB Vector RF uses eight custom 8T SRAM macros)

...Shared Func. Units64-bit Integer MultiplierSingle-Precision FMADouble-Precision FMA

Traditional Interleaving Simultaneous Switching

Charge sharing losses

1.0V1.8V

24 switched-capacitorDC-DC unit cells

(0.19 mm2)

DC-DC controller

Adaptive clock generator

Vout...

togg

le

+

FSMVref

clk

Async. FIFO/Level shiftersbetween domains

2Ghzref

CORE (1.19 mm2)

UNCORE

16KB ScalarInst. Cache

(Custom 8T SRAM Macros)

32KB Shared Data Cache

(Custom 8T SRAM Macros)

8KB VectorInst. Cache

(Custom 8T SRAM Macros)

To/from off-chip FPGA FSB and DRAM

Digital IO pads to wire-bonded chip-on-board

To off-chip oscilloscope

1.0V

RISC-V scalar core Vector acceleratorVector Issue Unit

Vector Memory Unit

Crossbar

ScalarRF FPUint

Arbiter

...

...

int int int int int

(16KB Vector RF uses eight custom 8T SRAM macros)

...Shared Func. Units64-bit Integer MultiplierSingle-Precision FMADouble-Precision FMA

1.0V1.8V

24 switched-capacitorDC-DC unit cells

(0.19 mm2)

DC-DC controller

Adaptive clock generator

Vout...

togg

le

+

FSMVref

clk

Async. FIFO/Level shiftersbetween domains

2Ghzref

CORE (1.19 mm2)

UNCORE

16KB ScalarInst. Cache

(Custom 8T SRAM Macros)

32KB Shared Data Cache

(Custom 8T SRAM Macros)

8KB VectorInst. Cache

(Custom 8T SRAM Macros)

To/from off-chip FPGA FSB and DRAM

Digital IO pads to wire-bonded chip-on-board

To off-chip oscilloscope

1.0V

RISC-V scalar core Vector acceleratorVector Issue Unit

Vector Memory Unit

Crossbar

ScalarRF FPUint

Arbiter

...

...

int int int int int

(16KB Vector RF uses eight custom 8T SRAM macros)

...Shared Func. Units64-bit Integer MultiplierSingle-Precision FMADouble-Precision FMA

1.0V1.8V

24 switched-capacitorDC-DC unit cells

(0.19 mm2)

DC-DC controller

Adaptive clock generator

Vout...

togg

le

+

FSMVref

clk

Async. FIFO/Level shiftersbetween domains

2Ghzref

CORE (1.19 mm2)

UNCORE

16KB ScalarInst. Cache

(Custom 8T SRAM Macros)

32KB Shared Data Cache

(Custom 8T SRAM Macros)

8KB VectorInst. Cache

(Custom 8T SRAM Macros)

To/from off-chip FPGA FSB and DRAM

Digital IO pads to wire-bonded chip-on-board

To off-chip oscilloscope

1.0V

RISC-V scalar core Vector acceleratorVector Issue Unit

Vector Memory Unit

Crossbar

ScalarRF FPUint

Arbiter

...

...

int int int int int

(16KB Vector RF uses eight custom 8T SRAM macros)

...Shared Func. Units64-bit Integer MultiplierSingle-Precision FMADouble-Precision FMA

1.0V1.8V

24 switched-capacitorDC-DC unit cells

(0.19 mm2)

DC-DC controller

Adaptive clock generator

Vout...

togg

le

+

FSMVref

clk

Async. FIFO/Level shiftersbetween domains

2Ghzref

CORE (1.19 mm2)

UNCORE

16KB ScalarInst. Cache

(Custom 8T SRAM Macros)

32KB Shared Data Cache

(Custom 8T SRAM Macros)

8KB VectorInst. Cache

(Custom 8T SRAM Macros)

To/from off-chip FPGA FSB and DRAM

Digital IO pads to wire-bonded chip-on-board

To off-chip oscilloscope

1.0V

RISC-V scalar core Vector acceleratorVector Issue Unit

Vector Memory Unit

Crossbar

ScalarRF FPUint

Arbiter

...

...

int int int int int

(16KB Vector RF uses eight custom 8T SRAM macros)

...Shared Func. Units64-bit Integer MultiplierSingle-Precision FMADouble-Precision FMA

No charge sharing!

•  Clock frequency adapts to track the voltage ripple5  

Page 6: Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC ...riscv.org/wp-content/uploads/2015/06/riscv-raven... · Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors

CORE (1.19 mm2)Chip Architecture

6  

Page 7: Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC ...riscv.org/wp-content/uploads/2015/06/riscv-raven... · Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors

Floorplan•  ST 28nm FDSOI•  Die area: 2.37mm2

•  Core area: 1.19mm2

•  Converter area: 0.19mm2 (16%)

•  MOS+MOM (to M3) density: 11fF/µm2

7  

Core

Page 8: Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC ...riscv.org/wp-content/uploads/2015/06/riscv-raven... · Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors

Reconfigurable SC Converters

•  Simple lower bound control

1V Mode 1.8V 1/2 Mode 1V 2/3 Mode 1V 1/2 Mode(~1V) (~0.9V) (~0.67V) (~0.5V)

Unit Cell Topologies

Lower-bound Controller

Operational Waveform

1V

1V

1V

1.8V

1V

1.8V

1V

1V

1V

1V

1V

1V

Vout VoutVoutVout

+FSM

2GHz comparator

VrefVout toggle

toggleVref

Vout

Φ1Φ1 Φ1 Φ1

Φ2Φ2Φ2

Φ2

Φ2

Φ2

Φ2 Φ1

Φ1

Φ1

Φ1off

off

Φ2

Φ2 Φ1

Φ1

Φ1

Φ1

off

Φ2off

off

Φ1

Φ1

Φ1

Φ1off

off

Φ2

Φ2

Φ2

Φ2

off

off

off

off

Efficiency ImprovementsInterleaving Rippling (this approach)

Videal VoutVlow

Vhigh VoutVidealVlow

Vhigh

clkAdaptive clock negates losses by increasing the average clock frequency

caused by charge sharing with interleaved converters

clk

No losses�V loss

�V

Vlow

Vin

Vlow

VinVhigh

Vlow

Vin VinVhigh

toggletoggle

•  Four output voltages, single unit cell

1V Mode 1.8V 1/2 Mode 1V 2/3 Mode 1V 1/2 Mode(~1V) (~0.9V) (~0.67V) (~0.5V)

Unit Cell Topologies

Lower-bound Controller

Operational Waveform

1V

1V

1V

1.8V

1V

1.8V

1V

1V

1V

1V

1V

1V

Vout VoutVoutVout

+FSM

2GHz comparator

VrefVout toggle

toggleVref

Vout

Φ1Φ1 Φ1 Φ1

Φ2Φ2Φ2

Φ2

Φ2

Φ2

Φ2 Φ1

Φ1

Φ1

Φ1off

off

Φ2

Φ2 Φ1

Φ1

Φ1

Φ1

off

Φ2off

off

Φ1

Φ1

Φ1

Φ1off

off

Φ2

Φ2

Φ2

Φ2

off

off

off

off

Efficiency ImprovementsInterleaving Rippling (this approach)

Videal VoutVlow

Vhigh VoutVidealVlow

Vhigh

clkAdaptive clock negates losses by increasing the average clock frequency

caused by charge sharing with interleaved converters

clk

No losses�V loss

�V

Vlow

Vin

Vlow

VinVhigh

Vlow

Vin VinVhigh

toggletoggle

•  Simultaneous switching

1V Mode 1.8V 1/2 Mode 1V 2/3 Mode 1V 1/2 Mode(~1V) (~0.9V) (~0.67V) (~0.5V)

1V1V

1V

1.8V1V

1.8V

1V1V

1V

1V1V

1V

Vout

Vout

Vout

Vout

Φ2

Φ2

Φ2

Φ2 Φ1

Φ1

Φ1

Φ1off

offΦ2

Φ2 Φ1

Φ1

Φ1

Φ1

off

Φ2off

offΦ1

Φ1

Φ1

Φ1off

offΦ2

Φ2

Φ2Φ2

off

off

offoff

8  

Page 9: Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC ...riscv.org/wp-content/uploads/2015/06/riscv-raven... · Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors

Self-Adjusting Clock Generator

•  Replica tracks critical path with voltage ripple•  Controller quantizes clock edge

Block Diagram (Tunable Replica Path)

DLL2Ghz input, 16 phases

ControllerSelect closest edge

TunableReplica

Path

Vout .........

...

...

......

Adaptive system clock

Vout

9  

Page 10: Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC ...riscv.org/wp-content/uploads/2015/06/riscv-raven... · Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors

Test setup

•  Host: Zedboard (ARM+FPGA, network accessible)

•  Motherboard: Programmable supplies•  Daughterboard: Chip-on-board

Conf.  1  

Conf.  2  

Conf.  3  

10  

Page 11: Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC ...riscv.org/wp-content/uploads/2015/06/riscv-raven... · Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors

Vout Measurements

Fast converter mode transitions enable extremely fine-grain DVFS algorithms

11  

Page 12: Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC ...riscv.org/wp-content/uploads/2015/06/riscv-raven... · Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors

GFLOPS/W

28nm FDSOI offers high energy efficiency and 4 converter modes cover wide operating range

•  Double-precision matrix-multiplication used as energy-efficiency metric

•  GFLOPS/W inverse of energy/operation

12  

Page 13: Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC ...riscv.org/wp-content/uploads/2015/06/riscv-raven... · Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors

Conclusion

1.  Fine-grained and wide-range DVFS–  20ns transitions–  1V to 0.45V

2.  Entirely on-chip voltage conversion–  Simultaneous switched-capacitor

3.  High efficiency–  80% across wide voltage range

4.  Extreme energy efficiency–  26 GFLOPS/W with on-chip conversion

Energy-efficient 28nm processor featuring:

13  

Page 14: Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC ...riscv.org/wp-content/uploads/2015/06/riscv-raven... · Raven3: 28nm RISC-V Vector Processor with On-Chip DC/DC Convertors

Acknowledgements

•  Funding: BWRC members, ASPIRE members, DARPA PERFECT Award Number HR0011-12-2-0016, Intel ARO, AMD, GRC, Marie Curie FP7, NSF GRFP, NVIDIA Fellowship

•  Fabrication donation by STMicroelectronics.

•  Tom Burd, James Dunn, Olivier Thomas, and Andrei Vladimirescu

14