ACM International Conference on Computing …...Power and Clock Gating Modelling in Coarse Grained...

1
* [ ) ( _ ) ( ) ( iON LR actors i i ON i T LR Over Ext LR P LR P i Technological parameters The targeted technology decides the ratio between the two power consumption terms. As transistors get smaller the contribution of the static contribute gets larger and not negligible anymore. Parameters Specification Functional parameters Input data impose the behaviour of the reconfigurable design and the activation time of LRs. A in1 D in2 F in3 E G B Sbox_1 Sbox_2 out1 C Sbox_0 ID net = 3 LUT α β γ 1 1 0 Power Estimation Model. Power and Clock Gating Modelling in Coarse Grained Reconfigurable Systems ACM International Conference on Computing Frontiers 2016 The research leading to these results has received funding from RPCT (L.R. 7/2007, CRP-18324) project. Abstract In the context of Coarse Grained Reconfigurable systems we propose a way to model power and clock gating costs based on the parameters of the baseline CGR system. The proposed flow guides designers towards optimal implementations, saving designer effort and time. The model is assessed by adopting a reconfigurable core for FFT targeting an ASIC 90 nm technology. Multi-Dataflow Composer Tool The Multi-Dataflow Composer (MDC) tool provides automatic datapath merging starting from different Dataflow Descriptions (DDs). It identifies the minimum set of Functional Units (FUs) always active together, called Logic Regions (LRs), and blindly applies to all of them: CG: by inserting a CG cell for each LR. PG: by inserting all the necessary PG logic. sleep transistors to switch on/off the power supply. isolation cells to avoid the transmission of spurious signals in input to the normally-on cells. state retention logic to maintain the internal state of the gated region. Coarse Grained Reconfigurable (CGR) Systems CGR systems offer high-performance and flexibility. However their efficient design is complex and they may require large power consumption due to the fact that not all the resources are involved in the computation. A in1 D in2 F in3 E G B Sbox_1 Sbox_2 out1 C Sbox_0 ID net=1 LUT Coarse Grained Reconfigurable Platform A B C D E C A F C G Dataflow Descriptions N:1 α β γ α β γ A in1 D in2 F in3 E G B Sbox_1 Sbox_2 out1 C LR1 LR2: Always ON LR LR3 LR4 LR5 LR4 RTNs RTNs RTNs RTNs RTNs RTNs ISOL ISOL ISOL ISOL Sbox_0 ISOL VDD SW VDD SW VDD SW ISOL VDD SW ID net Power Controller CG_en1 CG_en4 CG_en5 CG_en3 clock To save both dynamic and static power, Clock Gating (CG) and Power Gating (PG) can be respectively used. While all the main commercial synthesizers apply automatic CG, they do not provide automatic PG methodologies. 0 0 0 Methodology Assessment on ASIC Histograms report, for each LR, the estimated and real power consumption variation after the application of power and clock gating methodologies. Francesca Palumbo University of Sassari - PolComIng Information Engineering Unit [email protected] Tiziana Fanni, Carlo Sau, Paolo Meloni and Luigi Raffo University of Cagliari - Dept. of Electrical and Electronics Eng. (tiziana.fanni, carlo.sau, paolo.meloni, luigi)@diee.unica.it ID_net Net Sbox_ 0 Sbox_ 1 Sbox_ 2 LRs 1 α 0 0 0 LR1, LR2, LR4 2 β X X 1 LR2, LR3 3 γ 1 1 0 LR2, LR4, LR5 Architectural parameters LRs composition imposes the number of gates and their typology (combinatorial and sequential cells), determining the region basic consumption. State in1 out clk_LR3 feedback Comb Comb F G clk_LR3 in2 LR3 ] ) [ ) ( _ ) ( ) ( LR actors i i ON i LR Over Ext LR P LR P i Power Gating model. Clock Gating model. NO YES NO YES NO YES NO YES For each PDi i<N+1 AREA THRESHOLDING ESTIMATE PG ESTIMATE CG Is Saving % Positive Is Saving % Positive Select PDi to be Implement with PG Select PDi to be Implement with CG START i=0 Increment i STOP Is Above? 1 2 3 P ON (LRi): power consumption inside considered region. Ext_Over(LRi): power consumption due to the additional logic inserted outside the region. Term not present in the static estimation model of clock gated designs! APPLICATION # KERNEL #LRs FFT 4 8 host processor FFT coprocessor INTERCONNECTION LAYER A CGR coprocessor for image processing, involving eight computational kernels, has been assembled with MDC. Ti ON : percentage activation time. #reg: number of sequential elements. Area%: LR’s percentage area wrt total area. #rtn: number of retention cells. #iso(e): estimated number of isolation cells. #iso(r): real number of isolation cells. LRs Ti ON #reg Area% #rtn #iso(e) #riso(r) LR1 0.33 1024 62.44 1024 1024 1024 LR2 0.67 512 0.46 3 1112 1032 LR3 0.37 256 15.84 256 512 512 LR4 0.04 0 0.41 0 2324 2306 LR5 0.21 0 0.25 0 1948 1926 LR6 0.42 0 0.43 0 1756 1740 LR7 0.58 128 7.93 128 256 256 LR8 0.96 512 1.36 512 5259 4934 Power Estimation Flow Base PG_full CG_full PG_1: area_th 5% PG_2_ area_th 10% All LRs ON All LRs power gated All LRs clock gated PG LRs: 1, 3, 7 CG LRs: 2, 8 PG LRs: 1, 3 CG LRs: 2, 8,7 To assess the efficiency of the proposed model we compare different designs: Histograms report results for the different designs. PG_1 has the best tradeoff between power saving and area overhead. Full_CG is saves larger dynamic power but has less benefit in terms of total power consumption. Advantage of the Proposed Approach The proposed approach requires uniquely one synthesis and n different simulations. Without the proposed models, the identification of the most convenient PDs to be switched off: 1) to implement one power gated and one clock gated design for each PD, and 2) to compare them with respect to the baseline CGR. Retention Cells Isolation Cells Power Controller Clock Gating Cells Combinational Logic Sequential Logic Standard Cost Overhead http://sites.unica.it/rpct/

Transcript of ACM International Conference on Computing …...Power and Clock Gating Modelling in Coarse Grained...

Page 1: ACM International Conference on Computing …...Power and Clock Gating Modelling in Coarse Grained Reconfigurable Systems ACM International Conference on Computing Frontiers 2016 The

]*)(*)([

]*)(*)([

]*)*#(*)*#([

*]/#)#(#*)()*#()([

)(_)()(

iOFFOFFiONON

iOFFOFFiONON

iOFFOFFiONON

iON

LRactors

iiONi

TCGPTCGP

TContrPTContrP

TisoISOPTisoISOP

TregrtnregregPrtnRCPcmbP

LROverExtLRPLRP

i

Technological parameters

The targeted technology decides

the ratio between the two power

consumption terms. As transistors

get smaller the contribution of

the static contribute gets larger

and not negligible anymore.

Parameters Specification

Functional parameters

Input data impose the behaviour of

the reconfigurable design and the

activation time of LRs.

Ain1

Din2

Fin3

E

G

B Sbo

x_1

Sbo

x_2

out1C

Sbo

x_0

ID net = 3

LUT

α

β

γ

1 1

0

Power Estimation Model.

Power and Clock Gating Modelling in

Coarse Grained Reconfigurable Systems

ACM International Conference on Computing Frontiers 2016

The research leading to these results has received funding from RPCT (L.R. 7/2007, CRP-18324) project.

AbstractIn the context of Coarse Grained Reconfigurable systems we propose a way to model power and clock gating costs based on the parameters of the baseline CGR system. The proposed flow

guides designers towards optimal implementations, saving designer effort and time. The model is assessed by adopting a reconfigurable core for FFT targeting an ASIC 90 nm technology.

Multi-Dataflow Composer Tool

The Multi-Dataflow Composer (MDC) tool

provides automatic datapath merging starting

from different Dataflow Descriptions (DDs). It

identifies the minimum set of Functional Units (FUs)

always active together, called Logic Regions

(LRs), and blindly applies to all of them:

CG: by inserting a CG cell for each LR.

PG: by inserting all the necessary PG logic.

• sleep transistors to switch on/off the power

supply.

• isolation cells to avoid the transmission of

spurious signals in input to the normally-on

cells.

• state retention logic to maintain the internal

state of the gated region.

Coarse Grained Reconfigurable (CGR) Systems

CGR systems offer high-performance and flexibility. However

their efficient design is complex and they may require large

power consumption due to the fact that not all the resources are

involved in the computation.

Ain1

Din2

Fin3

E

G

B Sbo

x_1

Sbo

x_2

out1C

Sbo

x_0

ID net=1

LUT

Coarse Grained Reconfigurable Platform

A B C

D E C

A

FCG

Dataflow Descriptions

N:1

α

β

γ

α

β

γ

Ain1

Din2

Fin3

E

G

B Sbo

x_1

Sbo

x_2

out1C

LR1

LR2: Always ON LR

LR3

LR4

LR5

LR4RTNs

RTNs

RTNs

RTNs

RTNs

RTNsISOL

ISOL

ISOLISOL

Sbo

x_0

ISOL

VDD

SW

VDD

SW

VDD

SW

ISOL

VDD

SW

ID net

PowerController

CG_en1

CG_en4

CG_en5

CG_en3

clock

To save both dynamic and static power, Clock Gating (CG) and

Power Gating (PG) can be respectively used. While all the main

commercial synthesizers apply automatic CG, they do not

provide automatic PG methodologies.

0 0

0

Methodology Assessment on ASIC

Histograms report, for each LR, the estimated and real power consumption variation after

the application of power and clock gating methodologies.

Francesca Palumbo

University of Sassari - PolComIng Information Engineering Unit

[email protected]

Tiziana Fanni, Carlo Sau, Paolo Meloni and Luigi Raffo

University of Cagliari - Dept. of Electrical and Electronics Eng.

(tiziana.fanni, carlo.sau, paolo.meloni, luigi)@diee.unica.it

ID_net Net Sbox_

0

Sbox_

1

Sbox_

2

LRs

1 α 0 0 0 LR1, LR2, LR4

2 β X X 1 LR2, LR3

3 γ 1 1 0 LR2, LR4, LR5

Architectural parameters

LRs composition imposes the number

of gates and their typology

(combinatorial and sequential cells),

determining the

region basic

consumption.

Statein1

out

clk_LR3

feedback

CombComb

FG

clk_LR3 in2

LR3

]*)(*)([

]*)()([

)(_)()(

iOFFOFFiONON

iON

LRactors

iiONi

TCGPTCGP

TregPcmbP

LROverExtLRPLRP

i

Power Gating model.

Clock Gating model.

NO

YES

NO

YES

NO

YES

NO

YES

For each PDi

i<N+1

AREA THRESHOLDING

ESTIMATE PG ESTIMATE CG

Is Saving % Positive

Is Saving % Positive

Select PDi to be Implement with PG

Select PDi to be Implement with CG

START i=0

Increment i

STOP

Is Above?

1

2 3

PON(LRi): power consumption inside considered region.

Ext_Over(LRi): power consumption due to the

additional logic inserted outside the region.

Term not present in the

static estimation model of

clock gated designs!

APPLICATION # KERNEL #LRs

FFT 4 8

hostprocessor

FFTcoprocessor

INTERCONNECTION LAYER

A CGR coprocessor

for image processing,

involving eight

computational

kernels, has been

assembled with

MDC.

TiON: percentage activation time.

#reg: number of sequential elements.

Area%: LR’s percentage area wrt total area.

#rtn: number of retention cells.

#iso(e): estimated number of isolation cells.

#iso(r): real number of isolation cells.

LRs TiON #reg Area% #rtn #iso(e) #riso(r)

LR1 0.33 1024 62.44 1024 1024 1024

LR2 0.67 512 0.46 3 1112 1032

LR3 0.37 256 15.84 256 512 512

LR4 0.04 0 0.41 0 2324 2306

LR5 0.21 0 0.25 0 1948 1926

LR6 0.42 0 0.43 0 1756 1740

LR7 0.58 128 7.93 128 256 256

LR8 0.96 512 1.36 512 5259 4934

Power Estimation Flow

Base PG_full CG_full PG_1: area_th 5% PG_2_ area_th 10%

All LRs ON All LRs power gated

All LRs clock gated PG LRs: 1, 3, 7 CG LRs: 2, 8 PG LRs: 1, 3 CG LRs: 2, 8,7

To assess the efficiency of the proposed model we compare different designs:

Histograms report results for the different

designs. PG_1 has the best tradeoff

between power saving and area

overhead. Full_CG is saves larger

dynamic power but has less benefit in

terms of total power consumption.

Advantage of the Proposed ApproachThe proposed approach requires uniquely one synthesis and n different simulations.

Without the proposed models, the identification of the most convenient PDs to be switched

off: 1) to implement one power gated and one clock gated design for each PD, and 2) to

compare them with respect to the baseline CGR.

Retention Cells

Isolation Cells

Power Controller

Clock Gating Cells

Combinational

Logic

Sequential

Logic

Standard Cost Overhead

http://sites.unica.it/rpct/