Designingenergyefficient’ microprocessor:Howtofight ...Designingenergyefficient’...
Transcript of Designingenergyefficient’ microprocessor:Howtofight ...Designingenergyefficient’...
Designing energy-‐‑efficient microprocessor: How to fight
process variations Ruzica Jevtic
Berkeley Wireless Research Center
Moore’s law
10
!""#$%&'()*'+ ,--.TransistorsTransistors
Per DiePer Die
101099
10101010
,./!,./!.0,!.0,!
0101 ,1,1
2-2-2-2- 2-2/2-2/2-,2/2-,2/
32/4'5#"6$&&"#32/4'5#"6$&&"#72/4'5#"6$&&"#72/4'5#"6$&&"#
5$89:;<5$89:;<=''=''5#"6$&&"#5#"6$&&"#5$89:;<5$89:;<='='>>'5#"6$&&"#>>'5#"6$&&"#
5$89:;<5$89:;<== >>>'5#"6$&&"#>>>'5#"6$&&"#5$89:;<5$89:;<== 7'5#"6$&&"#7'5#"6$&&"#
>9)8:;<>9)8:;<4'4'5#"6$&&"#5#"6$&&"#101088
101077
101066
101055
101044
1010332--22--2
>9)8:;<>9)8:;<4'4',, 5#"6$&&"#5#"6$&&"#
0?0?7?7?
/7?/7?,./?,./?
0!0!
0/!0/!7!7!
/7!/7!,./!,./!
0,2!0,2!
0/?0/?
0@
7--77--71010
101022
101011
101000
2--22--21965 Data (Moore)1965 Data (Moore)
MicroprocessorMicroprocessorMemoryMemory
19601960 19651965 19701970 19751975 19801980 19851985 19901990 19951995 20002000 20052005 20102010
Source: IntelSource: IntelGraph from S.Chou, ISSCC’2005
!""#$%&'A)*')8B'6"&9
,-
13
!"#$%&'()*+,-./0+12
34
)*01/-5(627()8920.:;(<:/-10000100001010
Nominal feature sizeNominal feature size
mm
10001000
100100
11
0.10.1
nmnm130nm130nm90nm90nm
70nm70nm50nm50nm
Gate LengthGate Length 65nm65nm45nm45nm
32nm32nm22nm22nm
0.7X every 2 years
180nm180nm250nm250nm
3=Source: Intel, IEDM presentations
10100.010.01
50nm50nm35nm35nm
19701970 19801980 19901990 20002000 20102010 20202020
~30nm~30nm
)8920.:;(>:/-(;-1>/8(?(1+@01:;(A-:/B*-(20C-(:A/-*(331@D
Scaling consequences: • Process variations • Power has become critical!
(c) Ruzica Jevtic 2012
Scaling consequences
• Fabrication: - Litography with liquids
- Use of difraction…
• Consequences: - Random dopant fluctuations - Hot-carrier trapping…
• Additional steps: - Optical Proximity Check
• Design rules: - No corners - One direction lines…
J. Hartmann, ISSCC 2007 (c) Ruzica Jevtic 2012
11
!"#$%&'(%!)&'*+,"-./0&$-1
23
+,"-./0&$-1*!)&4,'/10000100001010
Nominal feature size scaling
mm
10001000
100100
11
0.10.1
nmnm
130nm130nm90nm90nm
65nm65nm
180nm180nm250nm250nm
365nm365nm248nm248nm
193nm193nm
22
10100.010.0119701970 19801980 19901990 20002000 20102010 20202020
45nm45nm32nm32nm
22nm22nm EUV 13nmEUV 13nm
567*8 9#)-'.4./1*.:*"-#*:;";0#*<:.0#=#0>?
Litography
• Litography issues: 193nm wavelength for
lines as small as 30nm!
12
!"#$%&'()(*+,-./0,-1+2&3-4/0+-,.3215(6,(7.,-21"+-.&.+&3
193nm lightMask
193nm light
Lightintensity
Lightintensity
89
!"#$%&'()(*+,-./0,-1+2&3-4
1CD kNA
:(62(&;(.Presently: 193 nm (ArF excimer laser)(Distant?) future: EUV
<*62(&;(.NA =.n;0*Maximum n is 1 in airPresently: ~0 92-1 35 min 1
1930.25 50
0 92nmCD k nm
NA
8>
Presently: 0.92-1.35Immersion
?(;"),@.!-20*A0*+.ABPresently: 0.35 – 0.4Theoretical limit: 0.25
0.92NA
45nm technology beyond resolution limit
(c) Ruzica Jevtic 2012
Process variations
• Process corners: typical, fast and slow • All corners coexist on the same wafer • If we design for the worst case, it is too pessimistic! • Need to know after fab chip features observation circuits
B. Nikolic, TCAS-‐‑I, 2011
(c) Ruzica Jevtic 2012
Moore’s law
10
!""#$%&'()*'+ ,--.TransistorsTransistors
Per DiePer Die
101099
10101010
,./!,./!.0,!.0,!
0101 ,1,1
2-2-2-2- 2-2/2-2/2-,2/2-,2/
32/4'5#"6$&&"#32/4'5#"6$&&"#72/4'5#"6$&&"#72/4'5#"6$&&"#
5$89:;<5$89:;<=''=''5#"6$&&"#5#"6$&&"#5$89:;<5$89:;<='='>>'5#"6$&&"#>>'5#"6$&&"#
5$89:;<5$89:;<== >>>'5#"6$&&"#>>>'5#"6$&&"#5$89:;<5$89:;<== 7'5#"6$&&"#7'5#"6$&&"#
>9)8:;<>9)8:;<4'4'5#"6$&&"#5#"6$&&"#101088
101077
101066
101055
101044
1010332--22--2
>9)8:;<>9)8:;<4'4',, 5#"6$&&"#5#"6$&&"#
0?0?7?7?
/7?/7?,./?,./?
0!0!
0/!0/!7!7!
/7!/7!,./!,./!
0,2!0,2!
0/?0/?
0@
7--77--71010
101022
101011
101000
2--22--21965 Data (Moore)1965 Data (Moore)
MicroprocessorMicroprocessorMemoryMemory
19601960 19651965 19701970 19751975 19801980 19851985 19901990 19951995 20002000 20052005 20102010
Source: IntelSource: IntelGraph from S.Chou, ISSCC’2005
!""#$%&'A)*')8B'6"&9
,-
13
!"#$%&'()*+,-./0+12
34
)*01/-5(627()8920.:;(<:/-10000100001010
Nominal feature sizeNominal feature size
mm
10001000
100100
11
0.10.1
nmnm130nm130nm90nm90nm
70nm70nm50nm50nm
Gate LengthGate Length 65nm65nm45nm45nm
32nm32nm22nm22nm
0.7X every 2 years
180nm180nm250nm250nm
3=Source: Intel, IEDM presentations
10100.010.01
50nm50nm35nm35nm
19701970 19801980 19901990 20002000 20102010 20202020
~30nm~30nm
)8920.:;(>:/-(;-1>/8(?(1+@01:;(A-:/B*-(20C-(:A/-*(331@D
Scaling consequences: • Process variations • Power has become critical!
(c) Ruzica Jevtic 2012
Scaling issues: Power
• Operating voltage scaled to avoid device breakdown • Not scaled enough to keep the performance boosting up
• Power has become a critical design constraint
3
Frequency Trends in Intel's Microprocessors
10000Pentium 4 Core2
!"#$%#&'(
10
100
1000
requ
ency
[MH
z]
808680286
386DX
486DX 486DX4
PentiumPentium Pro
Pentium II
Pentium MMX
Pentium III
ItaniumItanium II
Core2
)
0.1
1
1970 1975 1980 1985 1990 1995 2000 2005
Fr
4004
8008
8080
8088 Has been doublingevery 2 years, but is now slowing down
*+,#"-./00/123/+&
Power Trends in Intel's Microprocessors
1000
Has been > doublingevery 2 years
10
100
Pow
er [W
]
808680286 486DX
Pentium
Pentium Pro
Pentium II
Pentium IIIPentium 4
ItaniumItanium II Core 2
4
Has to stay ~constant
0.1
1
1970 1975 1980 1985 1990 1995 2000 2005
4004
8008 80808088
386DX
(c) Ruzica Jevtic 2012
Power issues • Power became an issue before variations: - battery life for mobile devices
- power limits performance! - performance is what sells the product!
• Power classification: - Dynamic power (60%)
- Static power (30%) exp(Vdd)
P = sw ! f !Vdd2 !C
(c) Ruzica Jevtic 2012
Power-‐‑performance trade-‐‑offs • The best way to reduce power (both leakage and
active) is to reduce the power supply
• How to maintain throughput under reduced supply?
• Introducing more parallelism/pipelining • Dynamic voltage scaling with variable throughput
Energy
Performance (c) Ruzica Jevtic 2012
Overview
• Dynamic voltage and frequency scaling
• Control voltage through DC-DC converters • Conclusions
• Error detection and correction circuits
(c) Ruzica Jevtic 2012
Impact of Dynamic Variations
MSFF D
CLK
Q MSFF
CLK
VCC Droop
CLK
D
D
Setup Time Timing with VCC Droop
Timing with Nominal VCC
Timing Guardband
Ø Guardbands required to ensure correct operation within the presence of dynamic variations
K. Bowman, CMOS Emerging Tech. Workshop, 2009 (c) Ruzica Jevtic 2012
Timing-‐‑Error Detection Error-‐‑Detection Sequential (EDS)
D
CLK
Q
ERROR
MSFF
LATCH
MSFF
CLK
VCC Droop
CLK
D
ERROR
Data Arrives Late
Error Detected
[1] P. Franco, et al., VLSI Test Symp., 1994. [2] M. Nicolaidis, VLSI Test Symp., 1999. [3] D. Ernst, et al., MICRO, 2003.
RAZOR
(c) Ruzica Jevtic 2012
Recovery mechanism DAS et al.: A SELF-TUNING DVS PROCESSOR USING DELAY-ERROR DETECTION AND CORRECTION 795
Fig. 3. Distributed pipeline recovery mechanism.
corrupting the state of the shadow latch. Delay buffers are re-quired to be inserted in those paths which fail to meet this min-imum path delay constraint imposed by the shadow latch. Theinsertion of delay buffers incurs power overhead because ofthe extra capacitance added. A large shadow latch samplingdelay requires a greater number of delay buffers to be inserted,thereby increasing the power overhead. However, a small sam-pling delay implies that the voltage difference between the pointof first failure and the point where shadow latch fails is less and,thus, reduces the voltage margin available through Razor timingspeculation. Hence, the shadow latch sampling delay representsthe tradeoff between power overhead due to delay buffers andthe voltage margin available for Razor subcritical mode of op-eration. Using suitable clock chopping techniques, the durationof the positive phase of the propagated clock can be configuredas required so as to exploit the above tradeoff.
A key point to note is the fact that the hold constraint im-posed by the shadow latch only limits the maximum durationof the positive clock phase and has no bearing upon the clockfrequency. Thus, a “Razor”-ed pipeline can still be operated atany frequency as required as long as the positive clock phase issufficient to meet the minimum path delay constraint. In our de-sign, for a sampling delay of 3.0 ns which is approximately halfthe cycle time at 140 MHz, it was required to add 2388 delaybuffers to satisfy the short path constraint on 207 RFFs (7.4%of the total number of flip-flops). The power overhead due tothese buffers was less than 3% of the nominal chip power.
Correct pipeline state is recovered in the event of a timingerror by engaging a distributed pipeline recovery mechanism, asdescribed in [1], which is based on a counter-flow pipeline archi-tecture [9]. The primary requirement of the recovery mechanismis to prevent corrupt state being committed to storage in memoryor the register file before being validated by Razor. In [1], wehave discussed two possible ways in which this can be achieved.A centralized pipeline recovery mechanism uses thesignal as a global clock-gating signal to stall the pipeline fora single cycle while the errant flip-flop recovers correct state.This incurs only a one-cycle recovery penalty but imposes sig-nificant timing restrictions on the signal which needsto be distributed through the entire chip in less than one cycle.In contrast, the distributed pipeline recovery mechanism placesnegligible restrictions on the cycle time at the expense of ex-tending recovery over several cycles.
Fig. 3 conceptually illustrates the working principle of thedistributed pipeline recovery mechanism. When a Razor erroroccurs, two actions are taken. First, the computation in the stage
following the errant stage is nullified by a “bubble” signal whichindicates to the next and subsequent stages that the pipeline slotis invalid. Second, a backward propagating flush train is trig-gered by asserting the stage identifier (ID) of the failing stage.In the following cycle, the correct value from the Razor shadowlatch data is injected back into the pipeline, allowing the errantinstruction to continue with its correct inputs. In addition, theflush train begins propagating the ID of the failing stage in theopposite direction of instructions. At each stage, the flush traininserts a bubble in the corresponding pipeline stage as well as inthe immediately preceding stage. (Two stages must be nullifiedbecause the main pipeline appears to move twice as fast rela-tive to the flush train.) When the flush ID reaches the start of thepipeline, the flush control logic restarts the pipeline at the in-struction following the errant instruction. In the event that mul-tiple stages experience errors in the same cycle, all will initiaterecovery but only the Razor error closest to write-back (WB)will complete. Earlier recoveries will be flushed by later ones.
III. TRANSISTOR-LEVEL DESIGN OF THE RFF
Fig. 4 shows the transistor level circuit schematic of the RFF.In the absence of a timing error, the RFF behaves as a standardpositive edge triggered flip-flop. The error comparator is a semi-dynamic XOR gate which evaluates when the data latched by theslave differs from that of the shadow in the negative clock phase.The error comparator shares its dynamic node with themetastability detector which evaluates in the positive phase ofthe clock when the slave output could become metastable. Thus,the RFF signal is flagged when either the metastabilitydetector or the comparator evaluate.
This, in turn, evaluates the dynamic gate to generate thesignal by ORing together the error signals of indi-
vidual RFFs (Fig. 5), in the negative clock phase. Thesignal incurs significant routing and gate capacitance as it isrouted to every flip-flop in the pipeline stage and needs to bedriven by strong drivers. For an RFF, the serves tooverwrite the master with the shadow latch data. Hence, theslave gets the correct data at the next positive edge.
The needs to be latched at the output of the dynamicOR gate so that it retains state during the next positive phase(recovery cycle) during which it disables the shadow latch toprotect state. In addition, the also disables all regular,non-“Razor”-ed flip-flops in the pipeline stage to preserve thestate that was latched in the errant cycle. This is required tomaintain the temporal consistency of all flip-flops in the pipelinestage. The stack of three pMOS transistors in the shadow latch
• Correct data restored in the following clock cycle • All previous pipeline stages have to be flushed • Program counter resumes at the next instruction
(c) Ruzica Jevtic 2012
Razor – issues
Good idea but has a lot of issues: - Duty cycle constraints the shortest path - Metastability at the output of the main FF - The longest paths changed if a latch is added to FF
D Q
D Q
Short path
Long path D Q
Shadow Latch
CLK
ERROR!
(c) Ruzica Jevtic 2012
Improvements
ü Lower clock power ü Data metastability solved -‐‑ Additonal buffers needed for short paths
ü No added buffers for short paths ü Data metastability detected w/o
additional detector - Detection window tuned in prefab
D Q
*
*
*
*
EN
ERROR
PG RISING
PG FALLING
LATCH
ERROR
ARM’10 Intel’09
(c) Ruzica Jevtic 2012
Tunable Replica Circuit (TRC)
EDS
Calibration Bits
error
Logic stages: Ø Inverter Ø NAND Ø NOR Ø Pass gates Ø Repeated interconnects
J. Tschanz, et al., Symp. VLSI Circuits, 2009. (c) Ruzica Jevtic 2012
TRC – cont’d
1132009 Symposium on VLSI Circuits Digest of Technical Papers
• Not so accurate as EDS, but no interfering with the longest path • If properly tuned, no recovery mechanism needed
(c) Ruzica Jevtic 2012
Observation circuits -‐‑ summary Advantages:
ü Reduce margins for process variations ü EDS:
• enable instruction recovery • detect errors in pipeline stages
ü TRC: • capture clock-to-data delay per pipeline stage • have low design overhead
Disadvantages: - EDS:
• Adding buffers for short paths • Metastability issues • The longest path is affected by additional circuits
- TRC: • Cannot detect local dynamic variations • Requires margin between TRC & the longest path • Requires post-silicon calibration (c) Ruzica Jevtic 2012
Overview
• Error detection and correction circuits
• Control voltage through DC-DC converters
• Conclusions
• Dynamic voltage and frequency scaling
(c) Ruzica Jevtic 2012
DVFS
5
Typical MPEG IDCT Histogram
9
DesiredThroughput
Compute-intensive andlow-latency processes
Processor Usage Model
g p
Maximum Processor Speed
10
timeSystem Idle
Background andhigh-latency processes
• Maximize Peak Throughput• Minimize Average Energy/operation
System Optimizations:BurdISSCC’00Burd, ISSCC’00
(c) Ruzica Jevtic 2012
DVFS – cont’d
7
1.0
LK
CMOS Circuits Track Over VDD
InverterRingOscRegFileSRAM
0.5
orm
aliz
ed m
ax. f
C
13
0VT 2VT 3VT 4VT
No
VDD
Delay tracks within +/- 10%BurdISSCC’00
Vary !CLK,VDD1 2 Dynamically adapt
Dynamic Voltage Scaling (DVS)
Vary !CLK,VDD
Delivered
Throughput
1 2 Dynamically adapt
14
time
• Dynamically scale energy/operation with throughput.• Always minimize speed minimize average energy/operation.• Extend battery life up to 10x with the exact same hardware!
BurdISSCC’00
Burd, ISSCC’00
(c) Ruzica Jevtic 2012
Traditional DVFS
• Traditional DVFS use a ring oscillator for frequency detection (Xscale, PowerPC, Pentium M, …)
• Impossible to use nowadays: ring oscillator frequency change does not reflect the cpu frequency change
• Different paths have different behavior with Vdd change
Counter
+ -‐‑ Fcpu_av
DC-‐‑DC
uP Ring Oscillator
... fcpu
Burd, ISSCC’00
(c) Ruzica Jevtic 2012
Resilient DVFS
• Resilient DVFS use TRCs and EDS as freq. indicator • Two options: TRC+EDS or EDS with recovery
Fcpu_av DC-‐‑DC
uP
CLK Cntrl
EDS with recovery Slow down
J. Tschanz, et al., Symp. VLSI Circuits, 2009.
Slow down
Max. path delay Speed up
EDS TRC
(c) Ruzica Jevtic 2012
RAVEN microprocessor
• Implemented in the newest technology: 28nm! (apple i5/i7 cores in 32nm, altera chips in 28nm)
• Mobile applications: battery life important!
• Manycore architecture: exploiting parallelism! • Error detection circuits for observation:
improvement over ARM architecture! • Unconventional DVFS scheme
(c) Ruzica Jevtic 2012
Motivation -‐‑ DVFS • Single core
E
Perf
" Two cores
E
Perf
" Performance dictated by the slower core " In a conventional DVFS synchronous system
(c) Ruzica Jevtic 2012
Goal • Manycore
E
Perf
" Make sure to operate at the optimal energy-performance point
Instead of operating here
Would like to operate here
10x
(c) Ruzica Jevtic 2012
Per-Core Supply/Clock Control
PLL
EDS
EDS
EDS
EDS
FIFO
FIFO FIFO
FIFO
DC-‐‑DC DC-‐‑DC
DC-‐‑DC DC-‐‑DC
(c) Ruzica Jevtic 2012
DVFS on manycore processor
• Allow the ripple at the output and track the voltage through clock generation for beler energy-‐‑efficiency
Control Clock
Generator
fdesired
DC-‐‑DC
Tracking
TRC
EDS uP
Delay
Energy
(c) Ruzica Jevtic 2012
Overview
• Error detection and correction circuits
• Dynamic voltage and frequency scaling
• Conclusions
• Control voltage through DC-DC converters
(c) Ruzica Jevtic 2012
DC-‐‑DC converters
Two phases: 1. Loading the energy
from the battery 2. Transferring loaded
energy to the output
(c) Ruzica Jevtic 2012
!" !" #$%# $"%&'( )"*+(&,-"% ./0 .-!!1 &()"'02)"$ %3&)*+"$4*252*&)/0 $*4$* */(6"0)"0% 7878
.9:; 7; <=> 2 7#8 ?@AB4CDEF %* $*4$* GDFHAI@AI =FC <J> 9@? DBAI=@9DF=K E=HALDIM?;
&&; %* */(6"0)"0 !/%% 2(2!1%&% 2($ /5)&N&O2)&/(
&F DICAI @D =GP9AHA @PA DB@9M=K @I=CADLL JA@EAAF BDEAI CAF4?9@Q =FC AL!G9AFGQR 9F @P9? ?AG@9DF EA E9KK !I?@ =F=KQSA @PA DBAI4=@9DF =FC KD?? MAGP=F9?M? DL %* GDFHAI@AI?; )P9? =F=KQ?9? E9KKKA=C @D CA?9:F ATU=@9DF? LDI ?E9@GP9F: LIATUAFGQ =FC ?E9@GPE9C@P @P=@ M9F9M9SA KD??A? 9F = :9HAF @AGPFDKD:Q =FC BDEAICAF?9@Q; %9FGA = ?9F:KA4@DBDKD:Q %* GDFHAI@AI 9? DFKQ AL!G9AF@EPAF :AFAI=@9F: =F DU@BU@ HDK@=:A E9@P9F = K9M9@AC I=F:AR @P9??AG@9DF =K?D CA?GI9JA? = ?9MBKA CA?9:F ?@I=@A:Q LDI AF=JK9F: IA4GDF!:UI=JKA @DBDKD:9A? =? EAKK =? BIAC9G@9F: @PA DHAI=KK AL!4G9AFGQ HAI?U? DU@BU@ HDK@=:A;
&% '(!)#"*+, +- # .#/($! .0 0+,1!)"!)
&F DICAI @D AKUG9C=@A @PA VAQ KD?? MAGP=F9?M? @P=@ E9KK ?A@ @PA@I=CADLL JA@EAAF GDFHAI@AI AL!G9AFGQ =FC =IA= <9;A;R BDEAI CAF4?9@Q>R EA E9KK JA:9F JQ AW=M9F9F: @PA DBAI=@9DF DL @PA 7#8 ?@AB4CDEF GDFHAI@AI ?PDEF 9F .9:; 7<=>; %E9@GPAC4G=B=G9@DI $*4$*GDFHAI@AI? @QB9G=KKQ DBAI=@A 9F @ED BP=?A?R A=GP DL EP9GP 9CA=KKQP=? XYZ CU@Q GQGKA; 3P9KA 9@ 9? BD??9JKA @D DBAI=@A %* $*4$*GDFHAI@AI? =@ = !WAC ?E9@GP9F: LIATUAFGQ =FC U?A = H=I9=JKA CU@QGQGKA @D =C[U?@ @PA DU@BU@ 9MBAC=FGA DL @PA GDFHAI@AI \7Y]R \78]R=? E9KK JA ?PDEF K=@AI 9F DUI =F=KQ?9?R M=W9MUM AL!G9AFGQ G=FDFKQ JA =GP9AHAC JQ DB@9M9S9F: ?E9@GP9F: LIATUAFGQ =FC DBAI4=@9F: E9@P XYZ CU@Q GQGKA @D M=W9M9SA @PA GP=I:A @I=F?LAI CU@QGQGKA;39@P @PA XYZ CU@Q GQGKA DBAI=@9DF ?PDEF 9F .9:; 7R CUI9F:
BP=?A R @PA "Q9F: G=B=G9@DI 9? GDFFAG@AC JA@EAAF @PA9FBU@ FDCA 69 =FC @PA DU@BU@ FDCA 6D; )PA GP=I:A CI=EF LIDM69 @PDU:P GP=I:A? @P9? G=B=G9@DI UB =FC "DE? @D @PA KD=C;&F BP=?A R 9? GDFFAG@AC JA@EAAF 6D =FC '($R =FC @PU?@PA GP=I:A BIAH9DU?KQ ?@DIAC DF @PA "Q9F: G=B=G9@DI 9? @I=F?4LAIIAC @D @PA DU@BU@; %9FGA @PA ?E9@GP9F: GQGKA 9? @QB9G=KKQ MUGP?M=KKAI @P=F @PA GP=I:A^C9?GP=I:A @9MA GDF?@=F@ <EP9GP 9? ?A@JQ >R @PA I=MB I=@A DL @PA HDK@=:A =GID?? @PA G=B=G9@DI 9?IAK=@9HAKQ GDF?@=F@R =FC PAFGA @PA KD=C G=F JA =BBIDW9M=@AC =?= GUIIAF@ ?DUIGA; 2? E9KK JA CA@=9KAC K=@AIR 9F DICAI @D M=W9M9SAAL!G9AFGQ 9@ 9? CA?9I=JKA @D U@9K9SA =KK =H=9K=JKA G=B=G9@=FGAE9@P9F@PA GDFHAI@AI 9@?AKL; )PAIALDIAR EA E9KK =??UMA @P=@ @PAIA 9? FDAWBK9G9@ DU@BU@ !K@AI9F: G=B=G9@DIR EP9GP 9F @PA G=?A DL @PA ?9MBKA%* GDFHAI@AI CA?GI9JAC ?D L=IR M=VA? @PA BA=V4@D4BA=V HDK@=:AI9BBKA =GID?? @PA G=B=G9@DI =FC @PA GDFHAI@AI_? DU@BU@ ATU=KR =??PDEF 9F .9:; 7<J>; )P9? HDK@=:A I9BBKA P=? = C9IAG@ 9MBK9G=@9DF DF@PA KD??`=FC PAFGA @PA =GP9AH=JKA AL!G9AFGQ`DL @PA GDFHAI@AI;
2% 3+44 &,#$54*4
)PA HDK@=:A I9BBKA =GID?? @PA G=B=G9@DI? ?G=KA? E9@P @PA KD=CGUIIAF@R =FC =? E9KK JA CA?GI9JAC ?PDI@KQR E9KK @PAIALDIA =BBA=I
=? = LDIM DL ?AI9A? KD?? ?9M9K=I @D @PA ?E9@GP GDFCUG@9DF KD??A?;&F =CC9@9DFR =FQ %* GDFHAI@AI E9KK =K?D P=HA ?PUF@ KD??A? @P=@=IA 9FCABAFCAF@ DL @PA KD=C GUIIAF@R 9FGKUC9F: :=@A =FC JD@@DMBK=@A G=B=G9@DI ?E9@GP9F: KD??A?; (D@A @P=@ @PA GDF@IDK G9IGU9@IQLDI =F %* GDFHAI@AI E9KK =K?D GDF@I9JU@A @D ?PUF@ KD??R JU@ E9KKJA FA:KAG@AC PAIA ?9FGA @P9? KD?? 9? @QB9G=KKQ ?M=KK 9F GDMB=I49?DF @D @PA D@PAI KD??A? =@ @PA BDEAI KAHAK? @P=@ =IA @PA LDGU?DL @P9? B=BAI; )PA?A KD??A? G=F JA MDCAKAC =? ?PDEF 9F .9:; aREPAIA @PA ?AI9A? KD??A? =IA IABIA?AF@AC JQ =F ATU9H=KAF@ DU@BU@IA?9?@=FGA \8b]R \8c]R @PA ?PUF@ KD??A? JQ @PA B=I=KKAK IA?9?@DIR =FC @PA @I=F?LDIMAI IABIA?AF@? @PA 9CA=K HDK@=:A GDFHAI?9DF
I=@9D;&F DICAI @D AKUG9C=@A @PA IAK=@9DF?P9B JA@EAAF HDK@=:A I9BBKA
=GID?? @PA G=B=G9@DI =FC KD??R 9@ 9? 9MBDI@=F@ @D IAG=KK @P=@ MD?@LUKKQ 9F@A:I=@AC ?E9@GPAC4G=B=G9@DI GDFHAI@AI? E9KK JA CAK9HAI9F:BDEAI @D ?QFGPIDFDU? C9:9@=K G9IGU9@IQ; )PA BAILDIM=FGA DL G9I4GU9@? 9F ?QFGPIDFDU? C9:9@=K ?Q?@AM? 9? CA@AIM9FAC JQ @PA DBAI4=@9F: LIATUAFGQR EP9GP 9F @UIF 9? ?A@ JQ @PA M9F9MUM =HAI=:AHDK@=:A DHAI = GKDGV BAI9DC; %9FGA @PA GKDGV BAI9DC DL MD?@ C9:49@=K G9IGU9@? E9KK JA ?PDI@ 9F GDMB=I9?DF @D @PA %* GDFHAI@AI_??E9@GP9F: BAI9DCR @PA BAILDIM=FGA DL @PA?A G9IGU9@? 9? @QB9G=KKQ?9MBKQ ?A@ JQ @PA M9F9MUM HDK@=:A DL @PA ?UBBKQ I=9K \87];&F @P9? G=?AR @PA AL!G9AFGQ DL @PA GDFHAI@AI ?PDUKC JA G=KGUK=@ACIAK=@9HA @D @PA BDEAI @P=@ EDUKC P=HA JAAF GDF?UMAC JQ @PAKD=C 9L 9@ E=? GDF?@=F@KQ DBAI=@9F: =@ AW=G@KQ \87]; &F D@PAIEDIC?R @PA 9CA=K BDEAI GDF?UMAC JQ @PA KD=C 9?#
<8>
EPAIA ; +DEAHAIR CUA @D @PA HDK@=:A I9BBKA LIDM@PA GDFHAI@AIR =??UM9F: @P=@ @P9? I9BBKA 9? IAK=@9HAKQ ?M=KK GDM4B=IAC @D @PA FDM9F=K HDK@=:AR @PA =HAI=:A BDEAI C9??9B=@AC JQ@PA KD=C DHAI DFA ?E9@GP9F: GQGKA DL @PA GDFHAI@AI 9? =BBIDW94M=@AKQ#
<7>
EPAIA 9? @PA DU@BU@ HDK@=:A I9BBKA <CUA @D @PA DBAI=@9DF DL@PA GDFHAI@AI> =FC ;2K@PDU:P 9? 9FCAAC C9??9B=@AC JQ @PA KD=CR =FQ BDEAI
GDF?UMAC JAQDFC ?PDUKC JA GDUF@AC =? KD?? ?9FGA @P9?=CC9@9DF=K BDEAI CDA? FD@ GDF@I9JU@A @D =F 9FGIA=?A 9F BAILDI4M=FGA; &F DICAI @D TU=F@9LQ @P9? KD??R EA FAAC @D G=KGUK=@A=FC d =? ?PDEF 9F .9:; 7<J>R LDI @PA 7#8 GDFHAI@AI GDF?9C4AIAC PAIAR 9? KDEAI @P=F @PA 9CA=K DU@BU@ HDK@=:A JQ
#
<b>
Switched Capacitor DC-‐‑DC Converter
• Advantages: - Fully integrated on a chip - Smaller area (no inductive components) - Large power density • Disadvantages:
- Discrete output voltage - Low efficiency
ñ
Vin
Cfly
Φ1
Φ1 Φ2
Φ2
Φ2
Vout
Vout
Φ1 Vout Vin
(c) Ruzica Jevtic 2012
0.43V
0.55V
0.78V
0.96V
Vin [V] ra)o Vout [V]
1 1/2 0.5
1 2/3 0.67
1 1/3 Too small
H-‐‑P. Le, ISSCC 2011
Discrete points solution
1.8 1/2 0.9
Range[V] 0.45 – 0.55
0.55 – 0.71
Too small
0.79 – 0.96
(c) Ruzica Jevtic 2012
Efficiency solution
Efficiency can be improved by more than 10%!
Loss conventional our approach
(c) Ruzica Jevtic 2012
DVFS details
Vin Vout
Energy
Delay
Vref
Vout=1/2 VinVout=2/3 Vin ...
Configuration
Fcpu_av
Clock gen.Clock gen.
DC-‐DC
TRC
...EDSEDS
uP
Tracking
Vo
clk
• Set the ripple size at the optimal energy-‐‑delay point
(c) Ruzica Jevtic 2012
Overview
• Error detection and correction circuits
• Dynamic voltage and frequency scaling • Control voltage through DC-DC converters
• Conclusions
(c) Ruzica Jevtic 2012
Summary • Process variations introduce difficulties in circuit
design • Power is a critical design constraint
• Observation circuits needed in order to avoid too conservative design decisions (EDS and TRC)
• Multicore/manycore architectures open spatial
dimension for energy optimization through DVFS
• Fine granularity V-F control enabled through performance observation and DC-DC converters
(c) Ruzica Jevtic 2012
Acknowledgement
Marie Curie FP7 People program
(c) Ruzica Jevtic 2012