Designingenergyefficient’ microprocessor:Howtofight ...Designingenergyefficient’...

Designing energy-‐‑efficient microprocessor: How to fight

process variations Ruzica Jevtic

Berkeley Wireless Research Center

Moore’s law

10

!""#$%&'()*'+ ,--.TransistorsTransistors

Per DiePer Die

101099

10101010

,./!,./!.0,!.0,!

0101 ,1,1

2-2-2-2- 2-2/2-2/2-,2/2-,2/

32/4'5#"6$&&"#32/4'5#"6$&&"#72/4'5#"6$&&"#72/4'5#"6$&&"#

5$89:;<5$89:;<=''=''5#"6$&&"#5#"6$&&"#5$89:;<5$89:;<='='>>'5#"6$&&"#>>'5#"6$&&"#

5$89:;<5$89:;<== >>>'5#"6$&&"#>>>'5#"6$&&"#5$89:;<5$89:;<== 7'5#"6$&&"#7'5#"6$&&"#

>9)8:;<>9)8:;<4'4'5#"6$&&"#5#"6$&&"#101088

101077

101066

101055

101044

1010332--22--2

>9)8:;<>9)8:;<4'4',, 5#"6$&&"#5#"6$&&"#

0?0?7?7?

/7?/7?,./?,./?

0!0!

0/!0/!7!7!

/7!/7!,./!,./!

0,2!0,2!

0/?0/?

0@

7--77--71010

101022

101011

101000

2--22--21965 Data (Moore)1965 Data (Moore)

MicroprocessorMicroprocessorMemoryMemory

19601960 19651965 19701970 19751975 19801980 19851985 19901990 19951995 20002000 20052005 20102010

Source: IntelSource: IntelGraph from S.Chou, ISSCC’2005

!""#$%&'A)*')8B'6"&9

,-

13

!"#$%&'()*+,-./0+12

34

)*01/-5(627()8920.:;(<:/-10000100001010

Nominal feature sizeNominal feature size

mm

10001000

100100

11

0.10.1

nmnm130nm130nm90nm90nm

70nm70nm50nm50nm

Gate LengthGate Length 65nm65nm45nm45nm

32nm32nm22nm22nm

0.7X every 2 years

180nm180nm250nm250nm

3=Source: Intel, IEDM presentations

10100.010.01

50nm50nm35nm35nm

19701970 19801980 19901990 20002000 20102010 20202020

~30nm~30nm

)8920.:;(>:/-(;-1>/8(?(1+@01:;(A-:/B*-(20C-(:A/-*(331@D

Scaling consequences: •  Process variations •  Power has become critical!

(c) Ruzica Jevtic 2012

Scaling consequences

•  Fabrication: - Litography with liquids

- Use of difraction…

•  Consequences: - Random dopant fluctuations - Hot-carrier trapping…

•  Additional steps: - Optical Proximity Check

•  Design rules: - No corners - One direction lines…

J. Hartmann, ISSCC 2007 (c) Ruzica Jevtic 2012

11

!"#$%&'(%!)&'*+,"-./0&$-1

23

+,"-./0&$-1*!)&4,'/10000100001010

Nominal feature size scaling

mm

10001000

100100

11

0.10.1

nmnm

130nm130nm90nm90nm

65nm65nm



193nm193nm

22

10100.010.0119701970 19801980 19901990 20002000 20102010 20202020

45nm45nm32nm32nm

22nm22nm EUV 13nmEUV 13nm

567*8 9#)-'.4./1*.:*"-#*:;";0#*<:.0#=#0>?

Litography

•  Litography issues: 193nm wavelength for

lines as small as 30nm!

12

!"#$%&'()(*+,-./0,-1+2&3-4/0+-,.3215(6,(7.,-21"+-.&.+&3

193nm lightMask

193nm light

Lightintensity

Lightintensity

89

!"#$%&'()(*+,-./0,-1+2&3-4

1CD kNA

:(62(&;(.Presently: 193 nm (ArF excimer laser)(Distant?) future: EUV

<*62(&;(.NA =.n;0*Maximum n is 1 in airPresently: ~0 92-1 35 min 1

1930.25 50

0 92nmCD k nm

NA

8>

Presently: 0.92-1.35Immersion

?(;"),@.!-20*A0*+.ABPresently: 0.35 – 0.4Theoretical limit: 0.25

0.92NA

45nm technology beyond resolution limit


Process variations

•  Process corners: typical, fast and slow •  All corners coexist on the same wafer •  If we design for the worst case, it is too pessimistic! •  Need to know after fab chip features observation circuits

B. Nikolic, TCAS-‐‑I, 2011


Moore’s law

10

!""#$%&'()*'+ ,--.TransistorsTransistors

Per DiePer Die

101099

10101010

,./!,./!.0,!.0,!

0101 ,1,1

2-2-2-2- 2-2/2-2/2-,2/2-,2/

32/4'5#"6$&&"#32/4'5#"6$&&"#72/4'5#"6$&&"#72/4'5#"6$&&"#

5$89:;<5$89:;<=''=''5#"6$&&"#5#"6$&&"#5$89:;<5$89:;<='='>>'5#"6$&&"#>>'5#"6$&&"#

5$89:;<5$89:;<== >>>'5#"6$&&"#>>>'5#"6$&&"#5$89:;<5$89:;<== 7'5#"6$&&"#7'5#"6$&&"#

>9)8:;<>9)8:;<4'4'5#"6$&&"#5#"6$&&"#101088

101077

101066

101055

101044

1010332--22--2

>9)8:;<>9)8:;<4'4',, 5#"6$&&"#5#"6$&&"#

0?0?7?7?

/7?/7?,./?,./?

0!0!

0/!0/!7!7!

/7!/7!,./!,./!

0,2!0,2!

0/?0/?

0@

7--77--71010

101022

101011

101000

2--22--21965 Data (Moore)1965 Data (Moore)

MicroprocessorMicroprocessorMemoryMemory

19601960 19651965 19701970 19751975 19801980 19851985 19901990 19951995 20002000 20052005 20102010

Source: IntelSource: IntelGraph from S.Chou, ISSCC’2005

!""#$%&'A)*')8B'6"&9

,-

13

!"#$%&'()*+,-./0+12

34

)*01/-5(627()8920.:;(<:/-10000100001010

Nominal feature sizeNominal feature size

mm

10001000

100100

11

0.10.1

nmnm130nm130nm90nm90nm

70nm70nm50nm50nm

Gate LengthGate Length 65nm65nm45nm45nm

32nm32nm22nm22nm

0.7X every 2 years


3=Source: Intel, IEDM presentations

10100.010.01

50nm50nm35nm35nm

19701970 19801980 19901990 20002000 20102010 20202020

~30nm~30nm

)8920.:;(>:/-(;-1>/8(?(1+@01:;(A-:/B*-(20C-(:A/-*(331@D

Scaling consequences: •  Process variations •  Power has become critical!


Scaling issues: Power

•  Operating voltage scaled to avoid device breakdown •  Not scaled enough to keep the performance boosting up

•  Power has become a critical design constraint

3

Frequency Trends in Intel's Microprocessors

10000Pentium 4 Core2

!"#$%#&'(

10

100

1000

requ

ency

[MH

z]

808680286

386DX

486DX 486DX4

PentiumPentium Pro

Pentium II

Pentium MMX

Pentium III

ItaniumItanium II

Core2

)

0.1

1

1970 1975 1980 1985 1990 1995 2000 2005

Fr

4004

8008

8080

8088 Has been doublingevery 2 years, but is now slowing down

*+,#"-./00/123/+&

Power Trends in Intel's Microprocessors

1000

Has been > doublingevery 2 years

10

100

Pow

er [W

]

808680286 486DX

Pentium

Pentium Pro

Pentium II

Pentium IIIPentium 4

ItaniumItanium II Core 2

4

Has to stay ~constant

0.1

1

1970 1975 1980 1985 1990 1995 2000 2005

4004

8008 80808088

386DX


Power issues •  Power became an issue before variations: - battery life for mobile devices

- power limits performance! - performance is what sells the product!

•  Power classification: - Dynamic power (60%)

- Static power (30%) exp(Vdd)

P = sw ! f !Vdd2 !C


Power-‐‑performance trade-‐‑offs •  The best way to reduce power (both leakage and

active) is to reduce the power supply

•  How to maintain throughput under reduced supply?

•  Introducing more parallelism/pipelining •  Dynamic voltage scaling with variable throughput

Energy

Performance (c) Ruzica Jevtic 2012

Overview

•  Dynamic voltage and frequency scaling

•  Control voltage through DC-DC converters •  Conclusions

•  Error detection and correction circuits


Impact of Dynamic Variations

MSFF D

CLK

Q MSFF

CLK

VCC Droop

CLK

D

D

Setup Time Timing with VCC Droop

Timing with Nominal VCC

Timing Guardband

Ø Guardbands required to ensure correct operation within the presence of dynamic variations

K. Bowman, CMOS Emerging Tech. Workshop, 2009 (c) Ruzica Jevtic 2012

Timing-‐‑Error Detection Error-‐‑Detection Sequential (EDS)

D

CLK

Q

ERROR

MSFF

LATCH

MSFF

CLK

VCC Droop

CLK

D

ERROR

Data Arrives Late

Error Detected

[1] P. Franco, et al., VLSI Test Symp., 1994. [2] M. Nicolaidis, VLSI Test Symp., 1999. [3] D. Ernst, et al., MICRO, 2003.

RAZOR


Recovery mechanism DAS et al.: A SELF-TUNING DVS PROCESSOR USING DELAY-ERROR DETECTION AND CORRECTION 795

Fig. 3. Distributed pipeline recovery mechanism.

corrupting the state of the shadow latch. Delay buffers are re-quired to be inserted in those paths which fail to meet this min-imum path delay constraint imposed by the shadow latch. Theinsertion of delay buffers incurs power overhead because ofthe extra capacitance added. A large shadow latch samplingdelay requires a greater number of delay buffers to be inserted,thereby increasing the power overhead. However, a small sam-pling delay implies that the voltage difference between the pointof first failure and the point where shadow latch fails is less and,thus, reduces the voltage margin available through Razor timingspeculation. Hence, the shadow latch sampling delay representsthe tradeoff between power overhead due to delay buffers andthe voltage margin available for Razor subcritical mode of op-eration. Using suitable clock chopping techniques, the durationof the positive phase of the propagated clock can be configuredas required so as to exploit the above tradeoff.

A key point to note is the fact that the hold constraint im-posed by the shadow latch only limits the maximum durationof the positive clock phase and has no bearing upon the clockfrequency. Thus, a “Razor”-ed pipeline can still be operated atany frequency as required as long as the positive clock phase issufficient to meet the minimum path delay constraint. In our de-sign, for a sampling delay of 3.0 ns which is approximately halfthe cycle time at 140 MHz, it was required to add 2388 delaybuffers to satisfy the short path constraint on 207 RFFs (7.4%of the total number of flip-flops). The power overhead due tothese buffers was less than 3% of the nominal chip power.

Correct pipeline state is recovered in the event of a timingerror by engaging a distributed pipeline recovery mechanism, asdescribed in [1], which is based on a counter-flow pipeline archi-tecture [9]. The primary requirement of the recovery mechanismis to prevent corrupt state being committed to storage in memoryor the register file before being validated by Razor. In [1], wehave discussed two possible ways in which this can be achieved.A centralized pipeline recovery mechanism uses thesignal as a global clock-gating signal to stall the pipeline fora single cycle while the errant flip-flop recovers correct state.This incurs only a one-cycle recovery penalty but imposes sig-nificant timing restrictions on the signal which needsto be distributed through the entire chip in less than one cycle.In contrast, the distributed pipeline recovery mechanism placesnegligible restrictions on the cycle time at the expense of ex-tending recovery over several cycles.

Fig. 3 conceptually illustrates the working principle of thedistributed pipeline recovery mechanism. When a Razor erroroccurs, two actions are taken. First, the computation in the stage

following the errant stage is nullified by a “bubble” signal whichindicates to the next and subsequent stages that the pipeline slotis invalid. Second, a backward propagating flush train is trig-gered by asserting the stage identifier (ID) of the failing stage.In the following cycle, the correct value from the Razor shadowlatch data is injected back into the pipeline, allowing the errantinstruction to continue with its correct inputs. In addition, theflush train begins propagating the ID of the failing stage in theopposite direction of instructions. At each stage, the flush traininserts a bubble in the corresponding pipeline stage as well as inthe immediately preceding stage. (Two stages must be nullifiedbecause the main pipeline appears to move twice as fast rela-tive to the flush train.) When the flush ID reaches the start of thepipeline, the flush control logic restarts the pipeline at the in-struction following the errant instruction. In the event that mul-tiple stages experience errors in the same cycle, all will initiaterecovery but only the Razor error closest to write-back (WB)will complete. Earlier recoveries will be flushed by later ones.

III. TRANSISTOR-LEVEL DESIGN OF THE RFF

Fig. 4 shows the transistor level circuit schematic of the RFF.In the absence of a timing error, the RFF behaves as a standardpositive edge triggered flip-flop. The error comparator is a semi-dynamic XOR gate which evaluates when the data latched by theslave differs from that of the shadow in the negative clock phase.The error comparator shares its dynamic node with themetastability detector which evaluates in the positive phase ofthe clock when the slave output could become metastable. Thus,the RFF signal is flagged when either the metastabilitydetector or the comparator evaluate.

This, in turn, evaluates the dynamic gate to generate thesignal by ORing together the error signals of indi-

vidual RFFs (Fig. 5), in the negative clock phase. Thesignal incurs significant routing and gate capacitance as it isrouted to every flip-flop in the pipeline stage and needs to bedriven by strong drivers. For an RFF, the serves tooverwrite the master with the shadow latch data. Hence, theslave gets the correct data at the next positive edge.

The needs to be latched at the output of the dynamicOR gate so that it retains state during the next positive phase(recovery cycle) during which it disables the shadow latch toprotect state. In addition, the also disables all regular,non-“Razor”-ed flip-flops in the pipeline stage to preserve thestate that was latched in the errant cycle. This is required tomaintain the temporal consistency of all flip-flops in the pipelinestage. The stack of three pMOS transistors in the shadow latch

•  Correct data restored in the following clock cycle •  All previous pipeline stages have to be flushed •  Program counter resumes at the next instruction


Razor – issues

Good idea but has a lot of issues: -  Duty cycle constraints the shortest path -  Metastability at the output of the main FF -  The longest paths changed if a latch is added to FF

D Q

D Q

Short path

Long path D Q

Shadow Latch

CLK

ERROR!


Improvements

ü  Lower clock power ü  Data metastability solved -‐‑ Additonal buffers needed for short paths

ü  No added buffers for short paths ü  Data metastability detected w/o

additional detector - Detection window tuned in prefab

D Q

*

*

*

*

EN

ERROR

PG RISING

PG FALLING

LATCH

ERROR

ARM’10 Intel’09


Tunable Replica Circuit (TRC)

EDS

Calibration Bits

error

Logic stages: Ø  Inverter Ø NAND Ø NOR Ø  Pass gates Ø Repeated interconnects

J. Tschanz, et al., Symp. VLSI Circuits, 2009. (c) Ruzica Jevtic 2012

TRC – cont’d

1132009 Symposium on VLSI Circuits Digest of Technical Papers

•  Not so accurate as EDS, but no interfering with the longest path •  If properly tuned, no recovery mechanism needed


Observation circuits -‐‑ summary Advantages:

ü Reduce margins for process variations ü EDS:

•  enable instruction recovery •  detect errors in pipeline stages

ü  TRC: •  capture clock-to-data delay per pipeline stage •  have low design overhead

Disadvantages: -  EDS:

•  Adding buffers for short paths •  Metastability issues •  The longest path is affected by additional circuits

-  TRC: •  Cannot detect local dynamic variations •  Requires margin between TRC & the longest path •  Requires post-silicon calibration (c) Ruzica Jevtic 2012

Overview


•  Control voltage through DC-DC converters

•  Conclusions



DVFS

5

Typical MPEG IDCT Histogram

9

DesiredThroughput

Compute-intensive andlow-latency processes

Processor Usage Model

g p

Maximum Processor Speed

10

timeSystem Idle

Background andhigh-latency processes

• Maximize Peak Throughput• Minimize Average Energy/operation

System Optimizations:BurdISSCC’00Burd, ISSCC’00


DVFS – cont’d

7

1.0

LK

CMOS Circuits Track Over VDD

InverterRingOscRegFileSRAM

0.5

orm

aliz

ed m

ax. f

C

13

0VT 2VT 3VT 4VT

No

VDD

Delay tracks within +/- 10%BurdISSCC’00

Vary !CLK,VDD1 2 Dynamically adapt

Dynamic Voltage Scaling (DVS)

Vary !CLK,VDD

Delivered

Throughput

1 2 Dynamically adapt

14

time

• Dynamically scale energy/operation with throughput.• Always minimize speed minimize average energy/operation.• Extend battery life up to 10x with the exact same hardware!

BurdISSCC’00

Burd, ISSCC’00


Traditional DVFS

•  Traditional DVFS use a ring oscillator for frequency detection (Xscale, PowerPC, Pentium M, …)

•  Impossible to use nowadays: ring oscillator frequency change does not reflect the cpu frequency change

•  Different paths have different behavior with Vdd change

Counter

+ -‐‑ Fcpu_av

DC-‐‑DC

uP Ring Oscillator

... fcpu

Burd, ISSCC’00


Resilient DVFS

•  Resilient DVFS use TRCs and EDS as freq. indicator •  Two options: TRC+EDS or EDS with recovery

Fcpu_av DC-‐‑DC

uP

CLK Cntrl

EDS with recovery Slow down

J. Tschanz, et al., Symp. VLSI Circuits, 2009.

Slow down

Max. path delay Speed up

EDS TRC


RAVEN microprocessor

•  Implemented in the newest technology: 28nm! (apple i5/i7 cores in 32nm, altera chips in 28nm)

•  Mobile applications: battery life important!

•  Manycore architecture: exploiting parallelism! •  Error detection circuits for observation:

improvement over ARM architecture! •  Unconventional DVFS scheme


Motivation -‐‑ DVFS •  Single core

E

Perf

" Two cores

E

Perf

" Performance dictated by the slower core " In a conventional DVFS synchronous system


Goal •  Manycore

E

Perf

" Make sure to operate at the optimal energy-performance point

Instead of operating here

Would like to operate here

10x


Per-Core Supply/Clock Control

PLL

EDS

EDS

EDS

EDS

FIFO

FIFO FIFO

FIFO

DC-‐‑DC DC-‐‑DC

DC-‐‑DC DC-‐‑DC


DVFS on manycore processor

•  Allow the ripple at the output and track the voltage through clock generation for beler energy-‐‑efficiency

Control Clock

Generator

fdesired

DC-‐‑DC

Tracking

TRC

EDS uP

Delay

Energy


Overview



•  Conclusions

•  Control voltage through DC-DC converters


DC-‐‑DC converters

Two phases: 1. Loading the energy

from the battery 2. Transferring loaded

energy to the output


!" !" #$%# $"%&'( )"*+(&,-"% ./0 .-!!1 &()"'02)"$ %3&)*+"$4*252*&)/0 $*4$* */(6"0)"0% 7878

.9:; 7; <=> 2 7#8 ?@AB4CDEF %* $*4$* GDFHAI@AI =FC <J> 9@? DBAI=@9DF=K E=HALDIM?;

&&; %* */(6"0)"0 !/%% 2(2!1%&% 2($ /5)&N&O2)&/(

&F DICAI @D =GP9AHA @PA DB@9M=K @I=CADLL JA@EAAF BDEAI CAF4?9@Q =FC AL!G9AFGQR 9F @P9? ?AG@9DF EA E9KK !I?@ =F=KQSA @PA DBAI4=@9DF =FC KD?? MAGP=F9?M? DL %* GDFHAI@AI?; )P9? =F=KQ?9? E9KKKA=C @D CA?9:F ATU=@9DF? LDI ?E9@GP9F: LIATUAFGQ =FC ?E9@GPE9C@P @P=@ M9F9M9SA KD??A? 9F = :9HAF @AGPFDKD:Q =FC BDEAICAF?9@Q; %9FGA = ?9F:KA4@DBDKD:Q %* GDFHAI@AI 9? DFKQ AL!G9AF@EPAF :AFAI=@9F: =F DU@BU@ HDK@=:A E9@P9F = K9M9@AC I=F:AR @P9??AG@9DF =K?D CA?GI9JA? = ?9MBKA CA?9:F ?@I=@A:Q LDI AF=JK9F: IA4GDF!:UI=JKA @DBDKD:9A? =? EAKK =? BIAC9G@9F: @PA DHAI=KK AL!4G9AFGQ HAI?U? DU@BU@ HDK@=:A;

&% '(!)#"*+, +- # .#/($! .0 0+,1!)"!)

&F DICAI @D AKUG9C=@A @PA VAQ KD?? MAGP=F9?M? @P=@ E9KK ?A@ @PA@I=CADLL JA@EAAF GDFHAI@AI AL!G9AFGQ =FC =IA= <9;A;R BDEAI CAF4?9@Q>R EA E9KK JA:9F JQ AW=M9F9F: @PA DBAI=@9DF DL @PA 7#8 ?@AB4CDEF GDFHAI@AI ?PDEF 9F .9:; 7<=>; %E9@GPAC4G=B=G9@DI $*4$*GDFHAI@AI? @QB9G=KKQ DBAI=@A 9F @ED BP=?A?R A=GP DL EP9GP 9CA=KKQP=? XYZ CU@Q GQGKA; 3P9KA 9@ 9? BD??9JKA @D DBAI=@A %* $*4$*GDFHAI@AI? =@ = !WAC ?E9@GP9F: LIATUAFGQ =FC U?A = H=I9=JKA CU@QGQGKA @D =C[U?@ @PA DU@BU@ 9MBAC=FGA DL @PA GDFHAI@AI \7Y]R \78]R=? E9KK JA ?PDEF K=@AI 9F DUI =F=KQ?9?R M=W9MUM AL!G9AFGQ G=FDFKQ JA =GP9AHAC JQ DB@9M9S9F: ?E9@GP9F: LIATUAFGQ =FC DBAI4=@9F: E9@P XYZ CU@Q GQGKA @D M=W9M9SA @PA GP=I:A @I=F?LAI CU@QGQGKA;39@P @PA XYZ CU@Q GQGKA DBAI=@9DF ?PDEF 9F .9:; 7R CUI9F:

BP=?A R @PA "Q9F: G=B=G9@DI 9? GDFFAG@AC JA@EAAF @PA9FBU@ FDCA 69 =FC @PA DU@BU@ FDCA 6D; )PA GP=I:A CI=EF LIDM69 @PDU:P GP=I:A? @P9? G=B=G9@DI UB =FC "DE? @D @PA KD=C;&F BP=?A R 9? GDFFAG@AC JA@EAAF 6D =FC '($R =FC @PU?@PA GP=I:A BIAH9DU?KQ ?@DIAC DF @PA "Q9F: G=B=G9@DI 9? @I=F?4LAIIAC @D @PA DU@BU@; %9FGA @PA ?E9@GP9F: GQGKA 9? @QB9G=KKQ MUGP?M=KKAI @P=F @PA GP=I:A^C9?GP=I:A @9MA GDF?@=F@ <EP9GP 9? ?A@JQ >R @PA I=MB I=@A DL @PA HDK@=:A =GID?? @PA G=B=G9@DI 9?IAK=@9HAKQ GDF?@=F@R =FC PAFGA @PA KD=C G=F JA =BBIDW9M=@AC =?= GUIIAF@ ?DUIGA; 2? E9KK JA CA@=9KAC K=@AIR 9F DICAI @D M=W9M9SAAL!G9AFGQ 9@ 9? CA?9I=JKA @D U@9K9SA =KK =H=9K=JKA G=B=G9@=FGAE9@P9F@PA GDFHAI@AI 9@?AKL; )PAIALDIAR EA E9KK =??UMA @P=@ @PAIA 9? FDAWBK9G9@ DU@BU@ !K@AI9F: G=B=G9@DIR EP9GP 9F @PA G=?A DL @PA ?9MBKA%* GDFHAI@AI CA?GI9JAC ?D L=IR M=VA? @PA BA=V4@D4BA=V HDK@=:AI9BBKA =GID?? @PA G=B=G9@DI =FC @PA GDFHAI@AI_? DU@BU@ ATU=KR =??PDEF 9F .9:; 7<J>; )P9? HDK@=:A I9BBKA P=? = C9IAG@ 9MBK9G=@9DF DF@PA KD??`=FC PAFGA @PA =GP9AH=JKA AL!G9AFGQ`DL @PA GDFHAI@AI;

2% 3+44 &,#$54*4

)PA HDK@=:A I9BBKA =GID?? @PA G=B=G9@DI? ?G=KA? E9@P @PA KD=CGUIIAF@R =FC =? E9KK JA CA?GI9JAC ?PDI@KQR E9KK @PAIALDIA =BBA=I

=? = LDIM DL ?AI9A? KD?? ?9M9K=I @D @PA ?E9@GP GDFCUG@9DF KD??A?;&F =CC9@9DFR =FQ %* GDFHAI@AI E9KK =K?D P=HA ?PUF@ KD??A? @P=@=IA 9FCABAFCAF@ DL @PA KD=C GUIIAF@R 9FGKUC9F: :=@A =FC JD@@DMBK=@A G=B=G9@DI ?E9@GP9F: KD??A?; (D@A @P=@ @PA GDF@IDK G9IGU9@IQLDI =F %* GDFHAI@AI E9KK =K?D GDF@I9JU@A @D ?PUF@ KD??R JU@ E9KKJA FA:KAG@AC PAIA ?9FGA @P9? KD?? 9? @QB9G=KKQ ?M=KK 9F GDMB=I49?DF @D @PA D@PAI KD??A? =@ @PA BDEAI KAHAK? @P=@ =IA @PA LDGU?DL @P9? B=BAI; )PA?A KD??A? G=F JA MDCAKAC =? ?PDEF 9F .9:; aREPAIA @PA ?AI9A? KD??A? =IA IABIA?AF@AC JQ =F ATU9H=KAF@ DU@BU@IA?9?@=FGA \8b]R \8c]R @PA ?PUF@ KD??A? JQ @PA B=I=KKAK IA?9?@DIR =FC @PA @I=F?LDIMAI IABIA?AF@? @PA 9CA=K HDK@=:A GDFHAI?9DF

I=@9D;&F DICAI @D AKUG9C=@A @PA IAK=@9DF?P9B JA@EAAF HDK@=:A I9BBKA

=GID?? @PA G=B=G9@DI =FC KD??R 9@ 9? 9MBDI@=F@ @D IAG=KK @P=@ MD?@LUKKQ 9F@A:I=@AC ?E9@GPAC4G=B=G9@DI GDFHAI@AI? E9KK JA CAK9HAI9F:BDEAI @D ?QFGPIDFDU? C9:9@=K G9IGU9@IQ; )PA BAILDIM=FGA DL G9I4GU9@? 9F ?QFGPIDFDU? C9:9@=K ?Q?@AM? 9? CA@AIM9FAC JQ @PA DBAI4=@9F: LIATUAFGQR EP9GP 9F @UIF 9? ?A@ JQ @PA M9F9MUM =HAI=:AHDK@=:A DHAI = GKDGV BAI9DC; %9FGA @PA GKDGV BAI9DC DL MD?@ C9:49@=K G9IGU9@? E9KK JA ?PDI@ 9F GDMB=I9?DF @D @PA %* GDFHAI@AI_??E9@GP9F: BAI9DCR @PA BAILDIM=FGA DL @PA?A G9IGU9@? 9? @QB9G=KKQ?9MBKQ ?A@ JQ @PA M9F9MUM HDK@=:A DL @PA ?UBBKQ I=9K \87];&F @P9? G=?AR @PA AL!G9AFGQ DL @PA GDFHAI@AI ?PDUKC JA G=KGUK=@ACIAK=@9HA @D @PA BDEAI @P=@ EDUKC P=HA JAAF GDF?UMAC JQ @PAKD=C 9L 9@ E=? GDF?@=F@KQ DBAI=@9F: =@ AW=G@KQ \87]; &F D@PAIEDIC?R @PA 9CA=K BDEAI GDF?UMAC JQ @PA KD=C 9?#

<8>

EPAIA ; +DEAHAIR CUA @D @PA HDK@=:A I9BBKA LIDM@PA GDFHAI@AIR =??UM9F: @P=@ @P9? I9BBKA 9? IAK=@9HAKQ ?M=KK GDM4B=IAC @D @PA FDM9F=K HDK@=:AR @PA =HAI=:A BDEAI C9??9B=@AC JQ@PA KD=C DHAI DFA ?E9@GP9F: GQGKA DL @PA GDFHAI@AI 9? =BBIDW94M=@AKQ#

<7>

EPAIA 9? @PA DU@BU@ HDK@=:A I9BBKA <CUA @D @PA DBAI=@9DF DL@PA GDFHAI@AI> =FC ;2K@PDU:P 9? 9FCAAC C9??9B=@AC JQ @PA KD=CR =FQ BDEAI

GDF?UMAC JAQDFC ?PDUKC JA GDUF@AC =? KD?? ?9FGA @P9?=CC9@9DF=K BDEAI CDA? FD@ GDF@I9JU@A @D =F 9FGIA=?A 9F BAILDI4M=FGA; &F DICAI @D TU=F@9LQ @P9? KD??R EA FAAC @D G=KGUK=@A=FC d =? ?PDEF 9F .9:; 7<J>R LDI @PA 7#8 GDFHAI@AI GDF?9C4AIAC PAIAR 9? KDEAI @P=F @PA 9CA=K DU@BU@ HDK@=:A JQ

#

<b>

Switched Capacitor DC-‐‑DC Converter

•  Advantages: - Fully integrated on a chip - Smaller area (no inductive components) - Large power density •  Disadvantages:

- Discrete output voltage - Low efficiency

ñ

Vin

Cfly

Φ1

Φ1 Φ2

Φ2

Φ2

Vout

Vout

Φ1 Vout Vin


0.43V

0.55V

0.78V

0.96V

Vin [V] ra)o Vout [V]

1 1/2 0.5

1 2/3 0.67

1 1/3 Too small

H-‐‑P. Le, ISSCC 2011

Discrete points solution

1.8 1/2 0.9

Range[V] 0.45 – 0.55

0.55 – 0.71

Too small

0.79 – 0.96


Efficiency solution

Efficiency can be improved by more than 10%!

Loss conventional our approach


DVFS details

Vin Vout

Energy

Delay

Vref

Vout=1/2 VinVout=2/3 Vin ...

Configuration

Fcpu_av

Clock gen.Clock gen.

DC-‐DC

TRC

...EDSEDS

uP

Tracking

Vo

clk

•  Set the ripple size at the optimal energy-‐‑delay point


Overview


•  Dynamic voltage and frequency scaling •  Control voltage through DC-DC converters

•  Conclusions


Summary •  Process variations introduce difficulties in circuit

design •  Power is a critical design constraint

•  Observation circuits needed in order to avoid too conservative design decisions (EDS and TRC)

•  Multicore/manycore architectures open spatial

dimension for energy optimization through DVFS

•  Fine granularity V-F control enabled through performance observation and DC-DC converters


Acknowledgement

Marie Curie FP7 People program


Designingenergyefficient’ microprocessor:Howtofight ...Designingenergyefficient’...

Documents

Transcript of Designingenergyefficient’ microprocessor:Howtofight ...Designingenergyefficient’...