Exploratie van de ontwerpruimte 2. De Hardware/software-grens Prof. dr. ir. Dirk Stroobandt...

71
Exploratie van de ontwerpruimte 2. De Hardware/software-grens Prof. dr. ir. Dirk Stroobandt Academiejaar 2004-2005

Transcript of Exploratie van de ontwerpruimte 2. De Hardware/software-grens Prof. dr. ir. Dirk Stroobandt...

Exploratie van de ontwerpruimte

2. De Hardware/software-grens

Exploratie van de ontwerpruimte

2. De Hardware/software-grens

Prof. dr. ir. Dirk Stroobandt

Academiejaar 2004-2005

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -2-

Inhoud (deel 1)Inhoud (deel 1)

Inleiding over Ingebedde systemen, System-on-Chip en Platform-gebaseerd ontwerp

Systeemspecificatietechnieken

Exploratie van de ontwerpruimte– Prestatiematen– De hardware/software-grens

• Raamwerk voor architectuurexploratie

• Hoogniveautransformaties

• Hardware/software-partitionering

• Exploratietools

– Prototypes, emulatie en simulatie

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -3-

Sim

ula

tie

en

Veri

fica

tie

Sim

ula

tie

en

Veri

fica

tie

OntwerptrajectOntwerptraject

PlatformontwerpPlatformontwerp

Hardware/software-partitioneringHardware/software-partitionering

HoogniveausyntheseHoogniveausynthese

Logisch ontwerpLogisch ontwerp

Fysisch ontwerpFysisch ontwerp

Software-compilatieSoftware-compilatie

Interface-syntheseInterface-synthese

HW SW

Hardware-ontwerpHardware-ontwerpCommunicatie

Component-selectie

Component-selectie

TestingTesting

SysteemspecificatieSysteemspecificatie ArchitectuurexploratieArchitectuurexploratie

AnaloogontwerpAnaloogontwerp

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -4-

Inhoud (deel 1)Inhoud (deel 1)

Inleiding over Ingebedde systemen, System-on-Chip en Platform-gebaseerd ontwerp

Systeemspecificatietechnieken

Exploratie van de ontwerpruimte– Prestatiematen– De hardware/software-grens

• Raamwerk voor architectuurexploratie

• Hoogniveautransformaties

• Hardware/software-partitionering

• Exploratietools

– Prototypes, emulatie en simulatie

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -5-

Raamwerk voor architectuurexploratie

Raamwerk voor architectuurexploratie

• Gegeven: initiële specificatie– Beschrijving van het systeemgedrag op

algoritmeniveau– Een set van vereisten (snelheid, oppervlakte,

vermogen, …)

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -6-

Raamwerk voor architectuurexploratie

Raamwerk voor architectuurexploratie

• Gevraagd: systeemarchitectuurspecificatie– Definitie van architecturale componenten en

interfaces– Eerste partitionering van de algoritmische

beschrijving in segmenten die elk door een andere architecturale component geïmplementeerd zullen worden

– Een grondplan van de systeemarchitectuur– Budgetten voor tijd, oppervlakte, vermogen voor de

architecturale componenten

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -7-

Raamwerk voor architectuurexploratie

Raamwerk voor architectuurexploratie

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -8-

Iteratieve aanpakIteratieve aanpak

• Geleid door prestatie-analyse

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -9-

Hoe vind je heuristisch goede oplossingen?

Hoe vind je heuristisch goede oplossingen?

• Types van hardware hulpbronnen– Hulpbronnen inherent aan het algoritme

• Functionele eenheden (vermenigvuldigers, optellers, …)• Redelijk goed te schatten op systeemniveau

– Hulpbronnen voor implementatie-overhead• Al de rest (controlelogica, bussen, MUXen, registers, draden)• Niet (nauwelijks) te schatten op systeemniveau (zeker niet

voor dedicated hardware)

• Oplossing: taxonomie op basis van potentieel om specifieke eigenschappen te gebruiken die een minimale implementatie-overhead genereren

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -10-

Voorbeelden van eigenschappen met weinig implementatie-overhead

Voorbeelden van eigenschappen met weinig implementatie-overhead

• Lokaliteit van berekeningen– Hoge lokaliteit = berekeningen zijn geïsoleerd, hebben

een sterk geïnterconnecteerde deelgraaf– Veel meer data tussen knopen binnen de cluster dan

naar buiten

• Regelmatigheid– Herhaalde berekening van bepaalde functiepatronen– Hergebruik van berekeningselementen

• Korte tijdsafhankelijkheden– Liefst lokaliteit in de tijd: vermijden van globale bussen

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -11-

Inhoud (deel 1)Inhoud (deel 1)

Inleiding over Ingebedde systemen, System-on-Chip en Platform-gebaseerd ontwerp

Systeemspecificatietechnieken

Exploratie van de ontwerpruimte– Prestatiematen– De hardware/software-grens

• Raamwerk voor architectuurexploratie

• Hoogniveautransformaties

• Hardware/software-partitionering

• Exploratietools

– Prototypes, emulatie en simulatie

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -12-

Voorbereidende takenVoorbereidende taken

• Task level concurrency managementWhich tasks in the final system?

• High level transformationsTransformations that are outside the scope of traditional compilers

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -13-

Task-level concurrency management

Task-level concurrency management

• Granularity: size of tasks (e.g. in instructions)• Readable specifications and efficient implementations can possibly require different task structures.

Granularity changes

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -14-

Merging of tasksMerging of tasks

• Reduced overhead of context switches,• More global optimization of machine code,• Reduced overhead for inter-process/task communication.

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -15-

Splitting of tasksSplitting of tasks

• No blocking of resources while waiting for input,• More flexibility for scheduling, possibly improved result.

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -16-

Merging and splitting of tasksMerging and splitting of tasks

• The most appropriate task graph granularity depends upon the context merging and splitting may be required.• Merging and splitting of tasks should be done automatically, depending upon the context.

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -17-

High-level optimizationsHigh-level optimizations

• To improve (potentially) the efficiency of embedded software and hardware

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen -18-

Floating-point to fixed point conversion

• Pros– Lower cost– Faster– Lower power consumption– Sufficient SNR, if properly scaled– Suitable for portable applications

• Cons– Decreased dynamic range– Finite word-length effect, unless

properly scaled• Overflow and excessive quantization

noise– Extra programming effort © Ki-Il Kum, et al. (Seoul National

University): A Floating-point To Fixed-point C Converter For Fixed-point Digital Signal Processors, 2nd SUIF Workshop, 1996

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen -19-

Fixed-Point Data Format

• Floating-Point vs. Fixed-

Point– exponent, mantissa– Floating-Point

• automatic computation and update of each exponent at run-time

– Fixed-Point• implicit exponent• determined off-line

S 1 0 0 . . . 0 0 0 0 1 0

hypothetical binary point

IWL=3

• Integer vs. Fixed-Point

S 1 0 0 . . . 0 0 0 0 1 0

(a) Integer

(b) Fixed-Point

FWL

© Ki-Il Kum, et al

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen -20-

Assignment and Addition/Subtraction

• Assume y = x, with- x (IWL=2) and- y (IWL=3):

s

s

x

x>>1

y

s

Let result = x + y:

equalizing each IWL

s

s

x

x>>1

y

s

sresult

+

© Ki-Il Kum, et al

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen -21-

Development Procedure

Range EstimationC Program

Execution

Floating-PointC Program

Fixed-PointC Program

Floating-Point to

Fixed-PointC Program Converter

RangeEstimator

Manualspecification

IWL information

© Ki-Il Kum, et al

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen -22-

Performance Comparison- Machine Cycles -

Fourth Order IIR Filter

215

2980

0

1000

2000

3000

4000

Fixed-Point (16b) Floating-Point

Cycles

© Ki-Il Kum, et al

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen -23-

Performance Comparison- Machine Cycles -

ADPCM

26718

61401

125249

0

20000

40000

60000

80000

100000

120000

140000

Fixed-Point (16b) Fixed-Point (32b) Floating-Point

Cycles

© Ki-Il Kum, et al

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen -24-

Performance Comparison- SNR -

ADPCM

0

5

10

15

20

25

A B C D

SNR (dB)

Fixed-Point (16b)Fixed-Point (32b)Floating-Point

© Ki-Il Kum, et al

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -25-

FridgeFridge

• RWTH Aachen, commercialized by Synopsys as part of the CoCentric tool suite.• Used type definition features of C++ to define typesFixed and fixed.• Using types in declarations: fixed a, *b, c[8]• Defining types in assignments: a= fixed(5,4,wt,*b)

Word-length

Wrap-around

truncation

Fractional wordlength

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -26-

Other work on the topicOther work on the topic

• Fridge (RWTH Aachen), commercialized by Synopsys

• Some support in Simulink (MATLAB toolbox)• .. hundreds of papers on the topic.

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -27-

• Column major order

• Column major order

Simple loop transformations:Loop permutation

Simple loop transformations:Loop permutation

• Array p[j][k]• Array p[j][k]• Row major

order

• Row major order

j=0

j=1

j=2

k=0

k=1

k=2

j=0

j=1

…j=0

j=1

…j=0

j=1

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -28-

Simple loop transformations:Loop permutation

Simple loop transformations:Loop permutation

• For row major order

• For row major order

Two loops, assuming row major order (C):

for (k=0; k<=m; k++) for (j=0; j<=n; j++)

for (j=0; j<=n; j++) ) for (k=0; k<=m; k++)

p[j][k] = ... p[j][k] = ...

Poor cache behavior Good cache behavior

Two loops, assuming row major order (C):

for (k=0; k<=m; k++) for (j=0; j<=n; j++)

for (j=0; j<=n; j++) ) for (k=0; k<=m; k++)

p[j][k] = ... p[j][k] = ...

Poor cache behavior Good cache behavior

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -29-

Loop fusion, loop fissionLoop fusion, loop fission

for(j=0; j<=n; j++) for (j=0; j<=n; j++)

p[j]= ... ; {p[j]= ... ;

for (j=0; j<=n; j++) , p[j]= p[j] + ...}

p[j]= p[j] + ...

Loops small enough to Better locality for

allow zero overhead access to p.

loops Better chances for

parallel execution.

Which of the two versions is best?

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -30-

Loop unrollingLoop unrolling

for (j=0; j<=n; j++) for (j=0; j<=n; j+=2)p[j]= ... ; {p[j]= ... ;

p[j+1]= ...} % factor = 2

Less branches perexecution of the loop.More opportunities foroptimizations.Tradeoff between codesize and improvement.Extreme case: completely unrolled

loop (no branch).

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -31-

Loop tiling/loop blockingOriginal version

Loop tiling/loop blockingOriginal version

for (i=1; i<=N; i++)for(k=1; k<=N; k++){

r=X[i,k]; /* to be allocated to a register*/for (j=1; j<=N; j++)

Z[i,j] += r* Y[k,j]} % Never reusing information in the cache for Y and Z

if N is large or cache is small (2 N³ references for Z and Y + N² references for X).

j++

k++

i++ j++

k++

i++

j++

k++

i++

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -32-

Loop tiling/loop blockingtiled version

Loop tiling/loop blockingtiled version

for (kk=1; kk<= N; kk+=B) for (jj=1; jj<= N; jj+=B) for (i=1; i<= N; i++) for (k=kk; k<= min(kk+B-1,N); k++){ r=X[i][k]; /* to be allocated to a register*/ for (j=jj; j<= min(jj+B-1, N); j++) Z[i][j] += r* Y[k][j] }

Reuse factor of B

for Z and Y,

O(N³/B) accesses to

main memory

Same elements for

next iteration of i

k++, j++

jj

kk

j++

k++

i++

jj

k++

i++

kk

j++

jj

kki++

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -33-

High-level transformationsHigh-level transformations

Example: Separation of margin handling

+

many if-statements for margin-checking

no checking,efficient

only few margin elements to be processed

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -34-

if (x>=10||y>=14) for (y=0; y<49; y++) for (k=0; k<9; k++) for (l=0; l<9;l++ ) for (i=0; i<4; i++) for (j=0; j<4;j++) { then_block_1; then_block_2}else {y1=4*y; for (k=0; k<9; k++) {x2=x1+k-4; for (l=0; l<9; ) {y2=y1+l-4; for (i=0; i<4; i++) {x3=x1+i; x4=x2+i; for (j=0; j<4;j++) {y3=y1+j; y4=y2+j; if (0 || 35<x3 ||0 || 48<y3) then-block-1; else else-block-1; if (x4<0|| 35<x4||y4<0||48<y4) then_block_2; else else_block_2;}}}}}}

Loop nest splittingMPEG-4 full search motion estimation

Loop nest splittingMPEG-4 full search motion estimation

for (z=0; z<20; z++) for (x=0; x<36; x++) {x1=4*x; for (y=0; y<49; y++) {y1=4*y; for (k=0; k<9; k++) {x2=x1+k-4; for (l=0; l<9; ) {y2=y1+l-4; for (i=0; i<4; i++) {x3=x1+i; x4=x2+i; for (j=0; j<4;j++) {y3=y1+j; y4=y2+j; if (x3<0 || 35<x3||y3<0||48<y3) then_block_1; else else_block_1; if (x4<0|| 35<x4||y4<0||48<y4) then_block_2; else else_block_2;}}}}}}

for (z=0; z<20; z++) for (x=0; x<36; x++) {x1=4*x; for (y=0; y<49; y++)

analysis of polyhedral domains, selection with genetic algorithm

[H. Falk et al., Inf 12, UniDo, 2002]

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -35-

Results for loop nest splitting- Execution times -

Results for loop nest splitting- Execution times -

0%

10%

20%

30%

40%

50%

60%

70%

80%

90%

100%

110%

Cavity Motion Estimation QSDPCM

[H. Falk et al., Inf 12, UniDo, 2002]

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -36-

Results for loop nest splitting- Code sizes -

Results for loop nest splitting- Code sizes -

[Falk, 2002]

0%

20%

40%

60%

80%

100%

120%

140%

160%

180%

200%Cavity Motion Estimation QSDPCM

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -37-

Array foldingArray folding

• Initial arrays

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -38-

Array foldingArray folding

• Unfolded arrays

• Unfolded arrays

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -39-

Inter-array foldingInter-array folding

Intra-array folding

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -40-

Function inlining:advantages and limitations

Function inlining:advantages and limitations

Function sq(c:integer)

return:integer;begin return c*cend;.... a=sq(b);....

....LD R1,b;MUL R1,R1,R1;ST R1,a....

push PC;push b;BRA sq; pull R1; mul R1,R1,R1; pull R2; push R1; BRA (R2)+1;pull R1;ST R1,a;

Advantage: low calling overhead

Limitations:

• Not all functions are candidates.• Code size explosion.• Requires manual identification using ‘inline’ qualifier.

Inlining

branching Goal:• Controlled code size• Automatic identification of suitable functions.

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -41-

Results for GSM speech and channel

encoder: #calls, #cycles (TI ‘C62xx)

Results for GSM speech and channel

encoder: #calls, #cycles (TI ‘C62xx)

0

10

20

30

40

50

60

70

80

90

100

calls cycles

# re

lati

ve t

o n

o i

nli

nin

g . 100

105110115120125130135140145150

33% speedup for 25% increase in code size.# of cycles not a monotonically decreasing function of the code size!

33% speedup for 25% increase in code size.# of cycles not a monotonically decreasing function of the code size!

L [%]

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -42-

Inline vectors computed by B&B algorithm

Inline vectors computed by B&B algorithm

size limit (%) inline vector (functions 1-26)100 00000000000000000000000000105 00100000001100001110111111110 10111001011100001111111111115 10110000000001001000111001120 10110100101000100110111101125 10110000001010000100111101130 00110000000010100100111000135 10110010001110101110111101140 10111011111110101111111111145 10110110101010100110111101150 10110110000010110110111101

Major changes for each new size limit. Difficult to generate manually. Major changes for each new size limit. Difficult to generate manually.

References: • J. Teich, E. Zitzler, S.S. Bhattacharyya. 3D Exploration of Software Schedules for DSP Algorithms, CODES’99• R. Leupers, P.Marwedel: Function Inlining under Code Size Constraints for Embedded Processors ICCAD, 1999

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -43-

Inhoud (deel 1)Inhoud (deel 1)

Inleiding over Ingebedde systemen, System-on-Chip en Platform-gebaseerd ontwerp

Systeemspecificatietechnieken

Exploratie van de ontwerpruimte– Prestatiematen– De hardware/software-grens

• Raamwerk voor architectuurexploratie

• Hoogniveautransformaties

• Hardware/software-partitionering

• Exploratietools

– Prototypes, emulatie en simulatie

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -44-

Hardware/software partitioningHardware/software partitioning

Functionality to be implemented in software or in hardware?Functionality to be implemented in software or in hardware?

No need to consider special purpose hardware in the long run?

Maybe correct for fixed functionality, wrong in general, since“By the time MPEG-n can be implemented in software, MPEG-n+1 has been invented” [de Man]

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -45-

Functionality to be implementedin software or in hardware?

Functionality to be implementedin software or in hardware?

Decision based on hardware/ software partitioning, a special case of hardware/ software codesign.

Decision based on hardware/ software partitioning, a special case of hardware/ software codesign.

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -46-

Codesign Tool (COOL)as an example of HW/SW

partitioning

Codesign Tool (COOL)as an example of HW/SW

partitioningInputs to COOL:

1. Target technology• Available HW platforms• Multiprocessors OK but all of same type• ASIC: synthesizable (technology library)

2. Design constraints• Throughput, latency, memory size, area

3. Required behavior• Hierarchical task graphs• Behaviour specified in VHDL• Communication edges and timing edges

Inputs to COOL:

1. Target technology• Available HW platforms• Multiprocessors OK but all of same type• ASIC: synthesizable (technology library)

2. Design constraints• Throughput, latency, memory size, area

3. Required behavior• Hierarchical task graphs• Behaviour specified in VHDL• Communication edges and timing edges

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -47-

Hardware/software codesign: approach

Hardware/software codesign: approach

[Niemann, Hardware/Software Co-Design for Data Flow Dominated Embedded Systems, Kluwer Academic Publishers, 1998 (Comprehensive mathematical model)]

Processor P1

Processor P2 Hardware

Specification

Mapping

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -48-

Steps of the COOL partitioning algorithm (1)

Steps of the COOL partitioning algorithm (1)

1. Translation of the behavior into an internal graph model

2. Translation of the behavior of each node from VHDL into C

3. Compilation• All C programs compiled for the target processor,• Computation of the resulting program size, • estimation of the resulting execution time

(simulation input data might be required)

4. Synthesis of hardware components: leaf node, application-specific hardware is synthesized. High-level synthesis must be sufficiently fast!

1. Translation of the behavior into an internal graph model

2. Translation of the behavior of each node from VHDL into C

3. Compilation• All C programs compiled for the target processor,• Computation of the resulting program size, • estimation of the resulting execution time

(simulation input data might be required)

4. Synthesis of hardware components: leaf node, application-specific hardware is synthesized. High-level synthesis must be sufficiently fast!

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -49-

Steps of the COOL partitioning algorithm (2)

Steps of the COOL partitioning algorithm (2)

5. Flattening of the hierarchy:Granularity used by the designer is maintained.Cost and performance information added to the nodes. Precise information required for partitioning is pre-computed

6. Generating and solving a mathematical model of the optimization problem:Integer programming IP model for optimization.Optimal with respect to the cost function (approximates communication time)

5. Flattening of the hierarchy:Granularity used by the designer is maintained.Cost and performance information added to the nodes. Precise information required for partitioning is pre-computed

6. Generating and solving a mathematical model of the optimization problem:Integer programming IP model for optimization.Optimal with respect to the cost function (approximates communication time)

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -50-

Steps of the COOL partitioning algorithm (3)

Steps of the COOL partitioning algorithm (3)

7. Iterative improvements:Adjacent nodes mapped to the same hardware component are now merged.

7. Iterative improvements:Adjacent nodes mapped to the same hardware component are now merged.

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -51-

Steps of the COOL partitioning algorithm (4)

Steps of the COOL partitioning algorithm (4)

8. Interface synthesis:After partitioning, the glue logic required for interfacing processors, application-specific hardware and memories is created.

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -52-

Integer programming modelsInteger programming models

Ingredients:

1. Cost function

2. Constraints

Ingredients:

1. Cost function

2. Constraints

Involving linear expressions of integer variables from a set X

Def.: The problem of minimizing (1) subject to the constraints (2) is called an integer programming (IP) problem.

If all xi are constrained to be either 0 or 1, the IP problem said to be a 0/1 integer programming problem.

Cost function )1(,with NxRaxaC iXx

iii

i

Constraints: )2(,with: ,, RcbcxbJjXx

jjijiji

i

Peter Marwedel
Equation stored as image in order to protect against font problems

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -53-

ExampleExample

321 465 xxxC

}1,0{,,

2

321

321

xxx

xxx

Optimal

C

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -54-

Remarks on integer programming

Remarks on integer programming

Maximizing the cost function can be done by setting C‘=-C

Integer programming is NP-complete.

In practice, running times can increase exponentially with the size of the problem, but problems of some thousands of variables can still be solved with commercial solvers, depending on the size and structure of the problem.

IP models can be a good starting point for modeling, even if in the end heuristics have to be used to solve them.

Maximizing the cost function can be done by setting C‘=-C

Integer programming is NP-complete.

In practice, running times can increase exponentially with the size of the problem, but problems of some thousands of variables can still be solved with commercial solvers, depending on the size and structure of the problem.

IP models can be a good starting point for modeling, even if in the end heuristics have to be used to solve them.

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -55-

An IP model for HW/SW partitioning

An IP model for HW/SW partitioning

Notation: Index set I denotes task graph nodes. Index set L denotes task graph node types

e.g. square root, DCT or FFT Index set KH denotes hardware component types.

e.g. hardware components for the DCT or the FFT. Index set J of hardware component instances Index set KP denotes processors.

All processors are assumed to be of the same type

Notation: Index set I denotes task graph nodes. Index set L denotes task graph node types

e.g. square root, DCT or FFT Index set KH denotes hardware component types.

e.g. hardware components for the DCT or the FFT. Index set J of hardware component instances Index set KP denotes processors.

All processors are assumed to be of the same type

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -56-

An IP model for HW/SW partitioning

An IP model for HW/SW partitioning

Xi,k: =1 if node vi is mapped to hardware component type k KH and 0 otherwise.

Yi,k: =1 if node vi is mapped to processor k KP and 0 otherwise.

NYl,k =1 if at least one node of type l is mapped to processor k KP and 0 otherwise.

T is a mapping from task graph nodes to their types:T: I L

The cost function accumulates the cost of hardware units:C = cost(processors) + cost(memories) +

cost(application specific hardware)

Xi,k: =1 if node vi is mapped to hardware component type k KH and 0 otherwise.

Yi,k: =1 if node vi is mapped to processor k KP and 0 otherwise.

NYl,k =1 if at least one node of type l is mapped to processor k KP and 0 otherwise.

T is a mapping from task graph nodes to their types:T: I L

The cost function accumulates the cost of hardware units:C = cost(processors) + cost(memories) +

cost(application specific hardware)

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -57-

ConstraintsConstraints

Operation assignment constraints Operation assignment constraints

KHk KPk

kiki YXIi 1: ,,

All task graph nodes have to be mapped either in software or in hardware.

Variables are assumed to be integers.

Additional constraints to guarantee they are either 0 or 1:

1:: , kiXKHkIi

1:: , kiYKPkIi

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -58-

Operation assignment constraints (2)

Operation assignment constraints (2)

lL, i:T(vi)=cl, k KP: NYl,k Yi,k

For all types l of operations and for all nodes i of this type:if i is mapped to some processor k, then that processor must implement the functionality of l.

Decision variables must also be 0/1 variables:

lL, k KP: NYl,k 1.

lL, i:T(vi)=cl, k KP: NYl,k Yi,k

For all types l of operations and for all nodes i of this type:if i is mapped to some processor k, then that processor must implement the functionality of l.

Decision variables must also be 0/1 variables:

lL, k KP: NYl,k 1.

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -59-

Resource & design constraintsResource & design constraints

• k KH, the cost (area) used for components of that type is calculated as the sum of the costs of the components of that type. This cost should not exceed its maximum.

• k KP, the cost for associated data storage area should not exceed its maximum.

• k KP the cost for storing instructions should not exceed its maximum.

• The total cost (k KH) of HW components should not exceed its maximum

• The total cost of data memories (k KP) should not exceed its maximum

• The total cost instruction memories (k KP) should not exceed its maximum

• k KH, the cost (area) used for components of that type is calculated as the sum of the costs of the components of that type. This cost should not exceed its maximum.

• k KP, the cost for associated data storage area should not exceed its maximum.

• k KP the cost for storing instructions should not exceed its maximum.

• The total cost (k KH) of HW components should not exceed its maximum

• The total cost of data memories (k KP) should not exceed its maximum

• The total cost instruction memories (k KP) should not exceed its maximum

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005

Scheduling

Processorp1 ASIC h1

FIR1 FIR2

v1 v2 v3 v4

v9 v10

v11

v5 v6 v7 v8

e3 e4

t

p1

v8 v7

v7 v8

or

...

... ...

...

t

c1

or

...

... ...

...e3

e3

e4

e4t

FIR2 on h1

v4 v3

v3 v4

or

...

... ...

...

Communication channel c1

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -61-

Scheduling / precedence constraints

Scheduling / precedence constraints

• For all nodes vi1 and vi2 that are potentially mapped to the same processor or hardware component instance, introduce a binary decision variable bi1,i2 withbi1,i2=1 if vi1 is executed before vi2 and

= 0 otherwise.Define constraints of the type(end-time of vi1) (start time of vi2) if bi1,i2=1 and(end-time of vi2) (start time of vi1) if bi1,i2=0

• Ensure that the schedule for executing operations is consistent with the precedence constraints in the task graph.

• For all nodes vi1 and vi2 that are potentially mapped to the same processor or hardware component instance, introduce a binary decision variable bi1,i2 withbi1,i2=1 if vi1 is executed before vi2 and

= 0 otherwise.Define constraints of the type(end-time of vi1) (start time of vi2) if bi1,i2=1 and(end-time of vi2) (start time of vi1) if bi1,i2=0

• Ensure that the schedule for executing operations is consistent with the precedence constraints in the task graph.

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -62-

Other constraintsOther constraints

Timing constraintsThese constraints can be used to guarantee that certain time constraints are met.

Some less important constraints omitted ..

Timing constraintsThese constraints can be used to guarantee that certain time constraints are met.

Some less important constraints omitted ..

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -63-

ExampleExample

HW types H1, H2 and H3 with costs of 20, 25, and 30.

Processors of type P.

Tasks T1 to T5.

Execution times:

HW types H1, H2 and H3 with costs of 20, 25, and 30.

Processors of type P.

Tasks T1 to T5.

Execution times:

T H1 H2 H3 P

1 20 100

2 20 100

3 12 10

4 12 10

5 20 100

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -64-

Operation assignment constraints (1)

Operation assignment constraints (1)

T H1 H2 H3 P

1 20 100

2 20 100

3 12 10

4 12 10

5 20 100

X1,1+Y1,1=1 (task 1 mapped to H1 or to P)X2,2+Y2,1=1X3,3+Y3,1=1X4,3+Y4,1=1X5,1+Y5,1=1

KHk KPk

kiki YXIi 1: ,,

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -65-

Operation assignment constraints (2)

Operation assignment constraints (2)

Assume types of tasks are l=1, 2, 3, 3, and 1.

lL, i:T(vi)=cl, k KP: NYl,k Yi,k

Assume types of tasks are l=1, 2, 3, 3, and 1.

lL, i:T(vi)=cl, k KP: NYl,k Yi,k

Functionality 3 to be implemented on

processor if node 4 is mapped to it.

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -66-

Other equationsOther equations

Time constraints leading to: Application specific hardware required for time constraints under 100 time units.

Time constraints leading to: Application specific hardware required for time constraints under 100 time units.

T H1 H2 H3 P

1 20 100

2 20 100

3 12 10

4 12 10

5 20 100

Cost function:C=20 #(H1) + 25 #(H2) + 30 # (H3) + cost(processor) + cost(memory)

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -67-

ResultResult

For a time constraint of 100 time units and cost(P)<cost(H3):For a time constraint of 100 time units and cost(P)<cost(H3):

T H1 H2 H3 P

1 20 100

2 20 100

3 12 10

4 12 10

5 20 100

Solution (educated guessing) :T1 H1T2 H2T3 PT4 PT5 H1

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -68-

Separation of scheduling and partitioning

Separation of scheduling and partitioning

Combined scheduling/partitioning very complex; Heuristic:

1. Compute estimated schedule

2. Perform partitioning for estimated schedule

3. Perform final scheduling

4. If final schedule does not meet time constraint, go to 1 using a reduced overall timing constraint.

Combined scheduling/partitioning very complex; Heuristic:

1. Compute estimated schedule

2. Perform partitioning for estimated schedule

3. Perform final scheduling

4. If final schedule does not meet time constraint, go to 1 using a reduced overall timing constraint.

2. Iteration

t

specificationspecification

Actual execution time

1. Iteration

approx. execution time

t

Actual execution time

approx. execution time

New specificationNew specification

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -69-

Application exampleApplication example

Audio lab (mixer, fader, echo, equalizer, balance units)

• slow SPARC processor• 1µ ASIC library• Allowable delay of 22.675 µs (~ 44.1 kHz)

Audio lab (mixer, fader, echo, equalizer, balance units)

• slow SPARC processor• 1µ ASIC library• Allowable delay of 22.675 µs (~ 44.1 kHz)

SPARCprocessor

ASIC(Compass,1 µ)

External memory

Outdated technology; just a proof of concept.

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -70-

Design space for audio labDesign space for audio lab

Everything in software: 72.9 µs, 0 2 Everything in hardware: 3.06 µs, 457.9x106 2

Lowest cost for given sample rate: 18.6 µs, 78.4x106 2,

Dirk Stroobandt: Ontwerpmethodologie van Complexe Systemen2004-2005 -71-

Final remarksFinal remarks

COOL approach:

shows that formal model of hardware/SW codesign is beneficial; IP modeling can lead to useful implementation even if optimal result is available only for small designs.

Other approaches for HW/SW partitioning:

starting with everything mapped to hardware; gradually moving to software as long as timing constraint is met.

starting with everything mapped to software; gradually moving to hardware until timing constraint is met.

Binary search.

COOL approach:

shows that formal model of hardware/SW codesign is beneficial; IP modeling can lead to useful implementation even if optimal result is available only for small designs.

Other approaches for HW/SW partitioning:

starting with everything mapped to hardware; gradually moving to software as long as timing constraint is met.

starting with everything mapped to software; gradually moving to hardware until timing constraint is met.

Binary search.