Ss Rk2r05a21

7/30/2019 Ss Rk2r05a21

1/26

1

ERM PAPER CSE318

HARDWARE COMPILERSystem Software

Submitted to:

Submitted by: MS.RAMANPREET KAUR LAMBA

RAHUL ABHISHEK

BTECH(CSE)

+MBA

roll

no RK2r05A21

Regd. no 11006646

7/30/2019 Ss Rk2r05a21

2/26

2

ACKNOWLEDGMENT

I have the honor to express our gratitude to our Miss.Ramanpreet Kaur

Lamba Mam for their timely help,support encouragement,valuable comments,

suggestions and many innovative ideas in carrying out this project. It is great

for me to gain out of their experience.

I also owe overwhelming debt to my classmates for their constant

encouragement and support throughout the execution of this project.

Above all, consider it a blessing of theAlmighty Godthat i have been able tocomplete this project. His contribution towards every aspect of oursthroughout our life would always be the supreme and unmatched.

Rahul Abhishek

7/30/2019 Ss Rk2r05a21

3/26

3

ABSTRACT

This term paper presents the co development of the instruction set,the hardware, and the compiler for the Vector IRAM mediaprocessor. A vector architecture is used to exploit the dataparallelism of multimedia programs, which allows the use of highlymodular hardware and enables implementations that combine highperformance, low power consumption, and reduced designcomplexity. It also leads to a compiler model that is efficient both interms of performance and executable code size. The memorysystem for the vector processor is implemented using embeddedDRAM technology, which provides high bandwidth in an integrated,cost-effective manner.

The hardware and the compiler for this architecture makecomplementary contributions to the efficiency of the overall system.This paper explores the interactions and tradeoffs between them, aswell as the enhancements to a vector architecture necessary formultimedia processing. We also describe how the architecture,design, and compiler features come together in a prototype system-on-a chip, able to execute 3.2 billion operations per second perwatt.

7/30/2019 Ss Rk2r05a21

4/26

4

CONTENTS

1. INTRODUCTION

2. Software and Hardware Compilers Compared

3. The Structure of the Compiler

4. Syntax Directed Translation

5. STRUCTURE OF HARDWARE COMPILER

6. . Global Optimization

7. Silicon Code Generation

7.1. Compiling the Tree Matcher

7.2. CODE generation

7.3. Local Optimizations for Speed

8. Peephole Optimization

9. Work-In-Progress: Global Speed Optimization

10.Evaluation of the Silicon Instruction Set

11.Working of hardware compiler

12.CONCLUSION

7/30/2019 Ss Rk2r05a21

5/26

5

13.REFRENCES

1.Introduction

The direct generation of hardware from a program more precisely: the automatic translation

of a program specified in a programming language into a digital circuit has been a topic ofinterest for a long time. So far it appeared to be a rather uneconomical method of largely

academic but hardly practical interest. It has therefore not been pursued with vigour and

consequently remained an idealist's dream. It is only through the recent advances of

semiconductor technology that new interest in the topic has reemerged, as it has become

possible to be rather wasteful with abundantly available resources. In

particular, the advent of programmable components devices representing circuits that are

easily and rapidly reconfigurable has brought the idea closer to practical realization by a

large measure.

Hardware and software have traditionally been considered as distinct entities with little in

common in their design process. Focusing on the logical design phase, perhaps the most

significant difference is that programs are predominantly regarded as ordered sets ofstatements to be interpreted sequentially, one after the other, whereas circuits again

simplifying the matter are viewed as sets of subcircuits operating concurrently ("in

parallel"), the same activities reoccurring in each clock cycle forever. Programs "run",

circuits are static.

Another reason for considering hard and software design as different is that in the latter case

compilation ends the design process (if one is willing to ignore "debugging"), whereas in the

former compilation merely produces an abstract circuit. This is to be followed by the

difficult, costly, and tedious phase of mapping the abstract circuit onto a physical medium,

taking into account the physical properties of the selected parts, propagation delays, fanout,

and other details.

With the progressing introduction of hardware description languages, however, it has becomeevident that the two subjects have several traits in common. Hardware description languages

let specifications of circuits assume textual form like programs, replacing the traditional

circuit diagrams by texts. This development may be regarded as the analogy of the

replacement of flowcharts by program texts several decades ago. Its main promoter has been

the rapidly growing complexity of programs, exhibiting the limitations of diagrams spreading

over many pages. In the light of textual specifications, variables in programs have their

counterparts in circuits in the form of clocked registers. The counterparts of expressions are

combinational circuits of gates. The fact that programs mostly operate on numbers, whereas

circuits feature binary signals, is of no further significance. It is well understood how to

represent numbers in terms of arrays of binary signals (bits), and how to implement

arithmetic operations by combinational circuits.

7/30/2019 Ss Rk2r05a21

6/26

6

Direct compilation of hardware is actually fairly straightforward, at least if one is willing to

ignore aspects of economy (circuit simplification). We follow in his footsteps and formulate

this essay in the form of a tutorial. As a result, we recognize some principal limitation of the

translation process, beyond which it may still be applicable, but not be realistic. Thereby we

perceive more clearly what should better be left to software. We also recognize an area where

implementation by hardware is beneficial even necessary: parallelism. Hopefully we end upin obtaining a better understanding of the fact that hardware and software design share

several important aspects, such as structuring and modularization that may well be expressed

in a common notation. Subsequently we will proceed step by step, in each step introducing

another programming concept expressed by a programming language construct.

2. Software and Hardware Compilers Compared

Hardware compilers produce a hardware implementation from some specification of the

hardware. There are, however, a number of types of hardware compilers which start fromdifferent sources and compile into different targets.

Architectural synthesis tools create a microarchitecture from a programlike description of the

machines function. These systems jointly design the microcode to implement the given

program and the datapath on which the microcode is to be executed; adding a bus to the data

path, for example, can allow the microcode to be redesigned to reduce the number of cycles

required to execute a critical operation. In general, architectural compilers that accept a fully

general program input synthesize relatively con-ventional machine architectures. For this

reason, architectural compilers display a number of relatively direct applications of

programming-language compilation techniques.

Architectural compilers design a machines structure, paying relatively little attention to the

physical design of transistors and wires on the chip. Silicon compilers concentrate on thechips physical design. While these programs may take a high-level description of the chips

function, they map that description onto a restricted architecture and do little architectural

optimization. Silicon compilers assemble a chip from blocks that implement the

architectures functional units. They must create custom versions of the function blocks,

place the blocks, and route wires between them to implement the architecture.

Perhaps the most ambitious silicon compiler project is Cathedral, a family of compilers that

create digital filter chips directly from the filter specifications(pass band, ripple, etc .).

Register-transfer compilers, in contrast, implement an architecture: they take as input a

description of the machines memory elements and the data computation and transfer

between those memory elements; they produce an optimized version of the hardware thatimplements the architecture. The target of a register-transfer compiler is a set of logic gates

and memory elements; its job is to compile the most efficient hardware for the architecture. A

register transfer compiler can be used as the back-end of an architectural compiler;

furthermore, many interesting chips-protocols, microcontrollers, etc.-are

best described directly as register-transfer machines. Even though register-transfer

descriptions bear less resemblance to programming languages than do the source languages to

architectural compilers, we can still apply a number of programming-language compiler

techniques to register-transfer compilation.

3. The Structure of the Compiler

7/30/2019 Ss Rk2r05a21

7/26

7

Our hardware compiler he performs a series of translations to transform the FSM description

into a hardware implementation. Figure 3 shows the compilers structure. The compiler

represents the design as a graph; successive compilation steps transform the graph to increase

the level of detail in the design and to perform optimizations.

The first step, syntax-directed translation, translates the source description into two directed

acyclic graphs (DAGS): Gc for the combinational functions, and Gm for the connectionsbetween the combinational functions and the memory elements. Global

optimization transforms Gc into a Gc that computes the same functions but is smaller and

faster than Gc.Gc and Gm are composed into a single intermediate form Gi analogous to the

intermediate code generated by a programming-language compiler. The silicon code

generator transforms the intermediate form into hardware implemented in the available logic

gates and registers. This implementation is cleaned up by a peephole optimizer; the result can

be given to a layout program to create a final chip design.

4. Syntax Directed Translation

Translation of the source into Boolean gates plus registers is syntax-directed, much as in

programming language compilers, but with some differences because the target is a purely

structural description of the hardware. The source description is parsed into a pure abstract

syntax tree (AST). Leaf nodes in the tree represent either signals or memory elements,

which consist of an input signal and an output signal. Subtrees describe the combinational

logic functions. Translation turns the AST into the graphs Gc and

G V%* Simple Boolean functions of signals are directly translatable; if statements (and by

implication, switch statements) also represent Boolean functions: if (cond) z=a; else z=b;

becomes the function2 = (cond&a) I(!cond&b).Syntax-directed translation is a good

implementation strategy because global optimization methods for combinational logic aregood enough to wipe out a wide variety of inefficiencies introduced by a naive translation of

the source into Boolean functions. The simplest way to translate the source into logic is to

reproduce the AST directly in the structure of the combinational function graph Gc, creating

functions that are too fragmented for good global optimization.

We use syntax-directed partitioning after translation to simplify the logic into fewer and

larger functions: collapsing all subexpression trees into one large function, for example, takes

much less CPU time than using general global optimization functions to simplify the

subexpressions. This strategy is akin to simple programming-language compiler

optimizations like in-line function expansion.

7/30/2019 Ss Rk2r05a21

8/26

8

FSM

SYNTAXDIRECTED

COMBITIONAL LOGIC

MEMORYELEMENTS

GLOBALOPTIMIZATION

INTERMEDIATE FORM

SILICON CODEGENERATION

AUTOMATICLAYOUT

PEEPHOLEOPTIMIZATION

MASK FORMANUFACTURE

Figure 3: How to compile register-transferdescriptionsinto layout.

5. STRUCTURE OF HARDWARE COMPILER

7/30/2019 Ss Rk2r05a21

9/26

9

solve both the minimization and common subexpression elimination problems, but these

methods can become stuck in local minima. We use other global optimization techniques

very different from programming-language compiler methods that take advantage the

restrictions of the Boolean domain. We reduce the combinational logic to a canonical sum-of

products form and use modern extensions [8] to the classical Quine-McCluskey optimizationprocedures. We eliminate common subexpressions using a technique[9] tuned to the

canonical sum-of-products representation.

Because Boolean logic is a relatively simple theory, our global optimizer can deduce many

legal transformations of the combinational functions and explore a large fraction of the total

design space. Global optimization commonly reduces total circuit area by 15%.Global logic

optimization is similar to machine independent code optimization in programming-language

compilers in that it is performed on an intermediate form-in this case, networks of Boolean

functions in sum-of-products form. As in programming language compilation, many

optimizations can be performed on this implementation-independent representation; but, as in

programming languages, some optimizations cannot be identified until the target-in this case,

the hardware gates and registers available-are known. For example, if the available hardware

includes a register that provides its output in both true and complemented form, we can use

that complemented form directly rather than compute it with an external inverter. Global

optimization algorithms for Boolean functions are not good at finding these implementation-

dependent optimizations; the code-generation phase transforms the intermediate form into

efficient hardware.

6 . Global Optimization

As in programming-language compilation, global optimization tries to reduce the resources

required to implement the machine. The resources in this case are the combinational logic

functions-we want to compute the machines outputs and next state using the smallest, fastest

possible functions. Combinational logic optimization combines minimization,

which eliminates obvious redundancies in the combinational logic (analogous to dead code

elimination) and simplifies the logic (analogous to strength reduction and constant folding),

and common subezpression elimination, which reduces the size of the logic functions.

Compiler-oriented methods have been used with success by the IBM Logic Synthesis System

to that attracted interest used a rule-based expert system. But silicon code generation is so

similar to tree-matching methods for retargetable code generation that we use twig [2] , a treemanipulator used for constructing code generators for programming language compilers, to

7/30/2019 Ss Rk2r05a21

10/26

10

create the DAGON silicon code generator [21]. DAGON can implement hardware optimized

for speed, area, or some function of both.

Figure 4 shows how silicon code generation works. For each new set of silicon instructions

we must generate a tree-matching program. Twig compiles a set of patterns and actions that

describe the code transformations into a tree matcher for that library. The code generation

phase of a hardware compilation run starts from the intermediate form Gi that includes boththe Boolean functions for the combinational logic and the memory elements. That graph is

parsed and broken into a forest of trees which are fed to the code generator.

As we will describe later, code generation can be performed iteratively to optimize the silicon

code for speed. At the end of code generation the intermediate form has been compiled into

an equivalent piece of hardware built from the available library components.

7. Silicon Code Generation

The portion of our system that makes the most innovative use of compiler methods is the

silicon code generator. Our circuits are implemented a library of registers and gates that

implement a subset of the Boolean functions-NAND, NOR, AND-ORINVERT(AOI), etc.

This library forms the silicon instruction set-it is an exact analog of a machines instruction

set. The silicon code generators job is to translate the intermediate form produced by the

global optimizer into the best silicon code-networks of the gates and registers available in the

library. The first approach to silicon code generation was ad hoc reminiscent of the do what

you have to do era of code generation. Another approach

Figure 4: The silicon code generation process.

that attracted interest used a rule-based expert system[19]. But silicon code generation is so

similar to tree-matching methods for retarget able code generation that we use twig [2] [30], a

tree manipulator used for constructing code generators for programming language compilers,

to create the DAGON silicon code generator [21]. DAGON can implement hardware

optimized for speed, area, or some function of both. Figure 4 shows how silicon code

generation works. For each new set of silicon instructions we must generate a tree-matching

program. Twig compiles a set of patterns and actions that describe the code transformations

into a tree matcher for that library. The code generation phase of a hardware compilation run

starts from the intermediate form Gi that includes both the

Boolean functions for the combinational logic and the memory elements. That graph is parsedand broken into a forest of trees which are fed to the code generator.

7/30/2019 Ss Rk2r05a21

11/26

11

As we will describe later, code generation can be performed iteratively to optimize the silicon

code for speed. At the end of code generation the intermediate form has been compiled into

an equivalent piece of hardware built from the available library components.

7.1 Compiling the Tree Matcher

Code generation finds the best mapping from the intermediate form onto the object code by

tree matching. We want an efficient implementation of the matching program, but we also

want to be able to matching automaton from the code patterns and the actions.The patterns supplied to twig describe either direct mappings of the intermediate form onto

the target code or may describe a mapping that contains a simple local optimization. Local

optimizations may be thought of as clever code sequences that are not natural matches of the

instruction set.

A twig pattern statement has three parts. The first is the pat tern itself, writ ten in a form like

that used by parser generators. A pattern can be used to describe a class of simple pattern

instances by parameterizing the pattern. For instance, five patterns for AOIxxx actually

describe sixty-four pattern instances from an A01444 to an AOI211. Many of these patterns

are symmetric-an A01114 is equivalent to, and would be output as, an AO1411.

The second part of a pattern statement is a formula that defines the cost of applying the

pattern. Separately we supply twig with functions for adding, computing, and comparing

costs. The tree matcher computes the cost for a tree using the user-supplied routine uses these

costs to choose by dynamic programming the minimum cost match over the entire tree.

The third part of a pattern statement is either a rewrite command or an action routine. Some

matches cause the input circuit to be transformed and reprocessed to find new matches, while

others cause immediate actions. DAGONS action is always to write out the code chosen by

the match. The patterns, actions, and cost functions are compiled by twig into a tree-matching

automaton.

7/30/2019 Ss Rk2r05a21

12/26

12

Figure 5: Partitioning the intermediate form DAG into trees

The automaton finds the optimum match for a tree in two phases: first, finding all the patterns

that match subtrees; next, by finding the set of matches that cover the tree at the lowest cost.

We will describe these steps in detail below. We have written 60 parameterised patterns todescribe the AT&T Cell Library. These patterns expand into over seven hundred pattern

instances that represent permutations of our hardware library of approximately 165 gates and

a few thousand unique local optimisations. The modularity afforded by describing the code

generator as patterns, actions, and cost functions makes it easy to port our code generator to a

new hardware library, just as code generators can be ported to new instruction sets. To date

we have created code generators for six different hardware libraries.

7/30/2019 Ss Rk2r05a21

13/26

13

Figure 6: An intermediate form tree and all its possible matches.

7.2 Code Generation

The intermediate form Gi is a DAG, so it must be broken into a forest of trees before being

processed by the tree matcher produced by twig. Figure 5 shows an intermediate form DAG

partitioned into three trees; the Xs show where the DAG has been cut. The partitioned DAG

is sent one tree at a time to the tree matcher. It first determines all possible matches for each

vertex u. The matching automaton finds all possible matches using the Aho-Corasick

algorithm[l], which works in time O(TREESIZE xMAX-MATCHES): the size of the tree

times the maximum number of matches possible at any node. Figure 6 shows a tree and all its

matches. Each node matched a pattern corresponding to either a two-input NAND gate (ND2)

or an inverter (INV); in addition, the subtree formed by the top three ranks of nodes matcheda larger pattern corresponding to an A0121 gate. The optimum code is generated by finding

the least-cost match of patterns that covers the tree from among all the possible matches. The

tree matcher generated by twig finds the optimum match by dynamic programming (41.

Optimum tree matching can be solved by dynamic programming because the best match for

the root of a tree is a function of the best matches for its subtrees. Figure 7 shows how the

pattern-matching automaton finds the optimum match for our example tree. The tree matcher

starts from the leaf nodes and successively matches larger subtrees. The nodes at the bottom

of the tree each have only one match; the number shown with the name of the best match is

the total cost of that subtree. The root node has two matches: the INV match gives a total tree

cost of 13; the A012 1 match (which covers several previously matched nodes) gives a tree of

cost 7. The A0121 match gives the smallest total cost, so the automaton labels the node withA012 1. This example is made simple because we tried to match only a few instructions;

7/30/2019 Ss Rk2r05a21

14/26

14

matching even this simple DAG against our full instruction set would give four to five

matches per node. Pure tree matching would not be able to handle two important cases:

matching non-tree patterns, like exclusive-or gates; and matching patterns across tree

boundaries. We handle these cases by associating at-tributes that describe global information-

fan-out, inverse, and signal-with each vertex in the tree. Inverse tells if the inverted sense of a

signal is available somewhere in the circuit. This is very useful because a number of areareducing optimizations can often be made if the inverted sense of a signal can be used in

place of the signal itself. Signal gives the name of a signal, which can be used with inverse to

represent a non-tree pattern such as an exclusive-or. Fan-out is used to determine timing

costs. Because patterns are matched against a forest of trees rather than the original DAG,

DAGON may not find the globally optimal match. Finding an optimal DAG-matching

(DAG-covering) is an NP-complete [ 31]problem. However, the combination of tree-

matching and peephole optimizations has proven to be an effective method for silicon code

generation. The MIS system tried extending tree matching to DAG covering [17], but the

results were disappointing. MIS now uses tree-matching for code generation.

after partial match

7/30/2019 Ss Rk2r05a21

15/26

15

Figure 7: Finding an optimal match.

7.3 Local Optimizations for Speed

While not all industrial designs have to run at high speed, there are always circuits that have

to run very fast. Because path delay is a global property of the hardware it can be hard fordesigners to optimize large machines for speed. Furthermore, reducing the delay of a piece of

hardware almost always costs additional area. We want to meet the designers speed

requirement at a minimum area cost. As Figure 4 shows, code generation can be applied

iteratively to jointly optimize for speed and area. The first pass through the tree matcher

produces hardware strictly optimized for area. A static delay analyzer finds parts of the

hardware that must be speeded up and produces speed constraints. The code generated by the

previous pass is translated back into the intermediate form and sent through the tree matcher

along with the constraints. The constraints cause the matcher to abort matches that cannot run

at the required speed. Because many Boolean functions are represented by both low-

speed/low-power and high-speed/high-power gates in the library, the matcher often need only

substitute the high-speed version of a gate to meet the speed constraint, but a constraint canforce the entire tree to be rematched. By this procedure we have been able to routinely double

the speed of circuits with an area penalty of only 10%. This is such a significant improvement

that DAGON is being used for timing optimization of manually designed hardware. Speed

optimization is sufficiently hard that DAGON has been able to similarly improve even these

hand-crafted designs. of DAGs, not trees. We are presently investigating techniques for

height minimization a DAG.

8. Peephole Optimization

Inverters are the bane of a silicon code generator. Due to the fact that trees are matched

independently it is common for a signal to be inverted, due to a local optimization, in two or

more different trees when a single inverter would do the job. We clean up these and other

problems with a peephole optimization phase that minimizes the number of inverters used. As

in programming-language compilers, the peephole optimizer looks at the circuit through a

small window; but this window reveals a portion of the circuit graph, not a portion of code

text as in an object code peep hole optimizer. Peephole optimization reduces combinational

logic area by as much as 10%.

9. Work-In-Progress: Global Speed Optimization

While circuit timing problems related to very local circuit properties, such as high capacitive

loading, are well solved by the techniques described in the previous sections, not all timing

problems in digital hardware can be solved by a single application of local optimizations.

Some timing problems are due to having too many gates in a particular path through the

hardware. To solve this problem the hardware must often be drastically restructured. As local

optimizations using tree matching are relatively inexpensive (linear time), our current

approach is to iteratively apply the local optimizations

to achieve more global transformations. This procedure is both aesthetically and practically

unsatisfying.

7/30/2019 Ss Rk2r05a21

16/26

16

We would prefer to make hardware transformations that more quickly result in hardware

which meets timing requirements. Tree height reduction algorithms [lo], developed in the

context of reducing the height of arithmetic expression trees to speed their execution on a

parallel processor, suggest an approach to this problem. While this method, and its

successors, has promise, we have found two limitations in its application to hardware: First,

when compiling to a fixed architecture, resources are free up to the point of saturation andinfinitely expensive after that. Thus it makes since to reduce tree height until all processors

are saturated. When designing hardware, however, we are interested in speeding up the

hardware to some limit specified by the user. We have no fixed resource limit in silicon code

generation to provide a lower bound on tree height, but maximally reducing all trees would

create hardware much larger than necessary. Secondly, we are concerned with minimizing the

height

10. Evaluation of the Silicon Instruction Set

Highly portable compilers for fixed architectures encouraged the evaluation of instruction

sets to determine the most cost-effective set of instructions: instruction sets that gave the

fastest compiled code at the minimum cost in hardware. While a larger silicon instruction set

does not have the unpleasant side effect of increasing the complexity of the implementation, a

larger silicon instruction set does require more human resources to develop and maintain. The

cost of developing and maintaining the instruction set must be weighed against the

improvement in the speed and area of hardware that can be compiled. The availability of a

highly portable silicon code generator made possible a long overdue experiment, that of

evaluating the impact of the silicon instruction set on the quality of the final hardware.

We evaluated five different silicon instruction sets ranging in complexity from 2 to 125 gates.

Our results on the twenty standard benchmark examples show a monotonic, thoughexponentially decreasing, reduction in chip area, ranging from 26% (7 instructions) to 45%

(125 instructions). These results were further verified against a number of industrial designs.

If we compare chip area with size of object code, these results simply show that with the

addition of more and more powerful instructions the size of object code is reduced. What was

less predictable was the knee of the area reduction per number of instructions curve. Beyond

an instruction set size of 37 instructions, which gave an area reduction of 40%, adding

instructions gave little area reduction: an additional 88 instructions only reduced area by an

additional 2%. A more surprising result was that all silicon instruction sets showed

comparable speed.

11 .Working of hardware compiler

11.1. Variable Declarations

All variables are introduced by their declaration. We use the notation

VAR x, y, z: Type

In the corresponding circuit, the variables are represented by registers holding their values.The output carries the name of the variable represented by the register, and the enable signal

7/30/2019 Ss Rk2r05a21

17/26

17

determines when a register is to accept a new input. We postulate that all registers be clocked

by the same clock signal, that is, all derived circuits will be synchronous circuits. Therefore,

clock signals will subsequently not be drawn and assumed to be implicit.

11.2. Assignment Statements

An assignment is expressed by the statement

y := f(x)

Where y is a variable, x is (a set of) variables and f is an expression in x. The assignment

statement is translated into the circuit according to Fig. 1, with the function f resulting in a

combinational circuit (i.e. a circuit neither containing feedback loops nor registers). e stands

for enable; this signal is active when the assignment is to occur.

Given a time t when the register(s) have received a value, the combinational circuit f yields

the new value f(x) after a time span pd called the propagation delay of f. This requires thenext clock tick to occur no before time t+pd. In other words, the propagation delay

determines the clock frequency f < 1/pd. The concept of the synchronous circuit dictates, that

the frequency is chosen according to the circuit f with the largest propagation delay among all

function circuits.

The simplest types of variables have only 2 possible values, say 0 and 1. They are said to be

of type BIT (or Boolean). However, we may just as well also consider composite types, that

is, variables consisting of several bits. A type Byte, for example, might consist of 8 bits, and

the same holds for a type Integer. The translation of arithmetic functions (addition,

subtraction, comparison, multiplication) into corresponding combinational circuits is well

understood. Although they be of considerable complexity, they are still purely combinational

circuits.

11.3. Parallel Composition

We denote parallel composition of two statements S0 and S1 by

S0, S1

The meaning is that the two component statements are executed concurrently. The

characteristic of parallel composition is that the affected registers feature the same enable

signal. Translation into a circuit is straightforward, as shown in Fig 1 (right side) for theexample

x := f(x, y), y := g(x, y)

7/30/2019 Ss Rk2r05a21

18/26

18

The characteristic of parallel composition is that the registers feature the same enable signal.

11.4. Sequential Composition

Traditionally, sequential composition of two statements S0 and S1 is denoted by

S0; S1 The sequencing operator ";" signifies that S1 is to be executed after completion of S0.

This necessitates a sequencing mechanism. The example of the three assignments

y := f(x); z := g(y); x := h(z)

is translated into the circuit shown if Fig. 2, with enable signals e corresponding to the

statements.

The upper line contains the combinational circuits and registers corresponding to the

assignments. The lower line contains the sequencing machinery assuring the proper

sequential execution of the three assignments. With each statement is associated an individual

enable signal e. It determines when the assignment is to occur, that is, when the register

holding the respective variable is to be enabled. The sequencing part is a "onehot" state

machine. The following table illustrates the signal values before and after each clock cycle.

e0 e1 e2 e3 x y z

1 0 0 0 x0 ? ?

0 1 0 0 x0 f(x0) ?

0 0 1 0 x0 f(x0) g(f(x0))

0 0 0 1 h(g(f(x0))) f(x0) g(f(x0))

The preceding example is a special case in the sense that each variable is assigned only a

single value, namely y is assigned f(x) in cycle 0, z is assigned g(y) in cycle 1, and x is

assigned h(z) in cycle 2. The generalized case is reflected in the following short example:

7/30/2019 Ss Rk2r05a21

19/26

19

x := f(x); x := g(x)

which is transformed into the circuit shown in Fig. 3. Since x receives two values, namely

f(x) in cycle 0 and g(x) in cycle 1, the register holding x needs to be preceded by a

multiplexer.

11.5. Conditional Composition

This is expressed by the statement

7/30/2019 Ss Rk2r05a21

20/26

20

IF b THEN S END

where b is a Boolean variable and S a statement. The circuit shown in Fig. 4 on the left side is

derived

from it with S standing for y := f(x). The only difference to Fig. 1 is the derivation of the

associated sequencing machinery yielding the enable signal for y.

It now emerges clearly that the sequencing machinery, and it alone, directly reflects the

control statements, whereas the data parts of circuits reflect assignments and expressions.

The conditional statement is easily generalized to its following form:

IF b0 THEN S0 ELSE S1 END

In its corresponding circuit on the right side of Fig. 4 we show the sequencing part only with

the enable signals corresponding to the various statements. At any time, at most one of the

statements S is active after being triggered by e0.

11.6. Repetitive Composition

We consider repetitive constructs traditionally expressed as repeat and while statements.

Since the control structure of a program statement is reflected by the sequencing machinery

only, we may omit showing associated data assignments, and instead let each statement be

represented by its associated enable signal only.

We consider the statements

REPEAT S UNTIL b and WHILE b DO S END

again standing for a Boolean variable, they translate into the circuits shown in Fig. 5, wheree stands for the enable signal of statement S.

7/30/2019 Ss Rk2r05a21

21/26

21

11.7. Selective Composition

The selection of one statement out of several is expressed by the case statement, whose

corresponding sequencing circuitry containing a decoder is shown in Fig. 6.

CASE k OF

0: S0 | 1: S1 | 2: S2 | ... | n: Sn

END

11.8. Preliminary Discussion

7/30/2019 Ss Rk2r05a21

22/26

22

At this point we realize that arbitrary programs consisting of assignments, conditional and

repeated statements can be transformed into circuits according to fixed rules, and therefore

automatically.

However, it is also to be feared that the complexity of the resulting circuits may quickly

surpass reasonable bounds. After all, every expression occurring in a program results in an

individual combinational circuit; every add symbol yields an adder, every multiply symbolturns into a combinational multiplier, potentially with hundreds or thousands of gates. We

agree that for the present time this scheme is hardly practical except for toy examples. Cynics

will remark, however, that soon the major problem will no longer be the economical use of

components, but rather to find good use of the hundreds of millions of transistors on a chip.

Inspite of such warnings, we propose to look at ways to reduce the projected hardware

explosion. The best solution is to share subcircuits among various parts. This, of course, is

equivalent to the idea of the subroutine in software. We must therefore find a way to translate

subroutines and subroutine calls. We emphasize that the driving motivation is the sharing of

circuits, and not the reduction of program text.

Therefore, a facility for declaring subroutines and textually substituting their calls by their

bodies, that is, handling them like macros, must be rejected. This is not to deny the usefulnessof such a feature, as it is for example provided in the language Lola, where the calls may be

parameterized [2, 3]. But with its Expansions of the circuit for each call it contributes to the

circuit explosion rather than being a facility to avoid it.

11.9. Subroutines

The circuits obtained so far are basically state machines. Let every subroutine translate into

such a state machine. A subroutine call then corresponds to the suspension of the calling

machine, the activation of the called machine, and, upon its completion, a resumption of the

suspended activity of the caller.

A first measure to implement subroutines is to provide every register in the sequencing part

with a common enable signal. This allows to suspend and resume a given machine. The

second measure is to provide a stack (a firstin lastout store) of identifications of suspended

machines, typically numbers from 0 to some limit n. We propose the following

implementation, which is to be regarded as a fixed base part of all circuits generated,

analogous to a runtime subroutine package in software. The extended circuit is a

pushdown machine.

The stack of machine numbers (analogous to return addresses) consists of a (static) RAM of

m words, each of n bits, and an up/down counter generating the RAMaddresses. Each of the

n bits in a word corresponds to one of the circuits representing a subroutine. One of them hasthe value 1, identifying the circuit to be reactivated. Hence, the (latched) readout directly

specifies the values of the enable signals of the n circuits. This is shown in Fig. 7. The stack

is operated by the push signal, incrementing the counter and thereafter writing the applied

input to the RAM, and the pop signal, decrementing the counter.

7/30/2019 Ss Rk2r05a21

23/26

23

The push signal is activated whenever a state register belonging to a call statement becomes

active. Since a subroutine may contain several calls itself, these register outputs must be

ORed. A better solution would be a bus. The pop signal is activated when the last state of a

circuit representing a subroutine is reached. An obvious enhancement with the purpose to

enlarge the number of possible subcircuits without undue expansion of the RAM's width is toplace an encoder at the input and a decoder at the output of the RAM. A width of n bits then

allows for 2n subroutines.

The scheme presented so far excludes recursive procedures. The reason is that every

subcircuit must have at most one suspension point, that is, it can have been activated at most

once. If recursive procedures are to be included, a scheme allowing for several suspension

points must be found. This implies that the stack entries must not only contain an

identification of the suspended subcircuit, but also of the suspension point of the call in

question. This can be achieved in a manner similar to storing the circuit identification, either

as a bit set or an encoded value. The respective state reenable signals must then be properly

qualified, further complicating the circuit. We shall not pursue this topic any further.

7/30/2019 Ss Rk2r05a21

24/26

24

12.Conclusion

We have found that a number of techniques and algorithms developed for software compilers

for fixed architectures are useful in hardware compilation. We add the caveat that while

programming language compilation techniques for fixed architectures can often give us a

good start on many problems in hardware compilation it is important to remember thatlanguage-to-hardware compilations are harder pre-cicely because the hardware architecture is

not fixed. A good software compiler must optimize its code for the target of its code. A good

hardware compiler must be able optimize its code for a target architecture, but must also

explore resource tradeoffs across a potentially wide range of such targets. Nevertheless, we

hope that hardware and programming- language compilation techniques become even more

closely related as we discover more general ways to compile software descriptions into

custom hardware. For the moment we continue to experiment with and extend our hardware

compiler technology to make it more general, while still retaining its ability to create

industrial-quality designs.

It is now evident that subroutines, and more so procedures, introduce a significant degree of

complexity into a circuit. It is indeed highly questionable, whether it is worth while

considering their implementation in hardware. It comes as no surprise that state machines,

implementing assignments, conditional and repetitive statements, but not subroutines, play

such a dominant role in sequential circuit design. The principal concern in this area is to

make optimal use of the implemented facilities. This is achieved if most of the subcircuits

yield values contributing to the process most of the time. The simplest way to achieve this

goal is to introduce as few sequential steps as possible. It is the underlying principle in

computer architectures, where (almost) the same steps are performed for each instruction.

The steps are typically the subcycles of an instruction interpretation. Hardware acts as what is

known in software as an interpretive system.

The strength of hardware is the possibility of concurrent operation of subcircuits. This is alsoa topic in software design. But we must be aware of the fact that concurrency is only possible

by having concurrently operating circuits to support this concept. Much work on parallel

computing in software actually ends up in implementing quasiconcurrency, that is, in

pretending concurrent execution, and, in fact, in hiding the underlying sequentiality. This

leads us to contend that any scheme of direct hardware compilation may well omit the

concept of subroutines, but must include the facility of specifying concurrent, parallel

statements. Such a hardware programming language may indeed be the best way to let the

programmer specify parallel statements. We call this finegrained parallelism.

Coarsegrained concurrency may well be left to conventional programming languages, where

parallel processes interact infrequently, and where they are generated and deleted at arbitrary

but distant intervals. This claim is amply supported by the fact that finegrained parallelismhas mostly been left to be introduced by compilers, under the hood, so to say.

The reason for this is that compilers may be tuned to particular architectures, and they can

therefore make use of their target computer's specific characteristics. In this light, the

consideration of a common language for hard and software specification has a certain merit.

It may reveal the inherent difference in the designers' goals. As C. Thacker expressed it

succinctly: "Programming (software) is about finding the best sequential algorithm to solve a

problem, and implementing it efficiently. The hardware designer, on the other hand, tries to

bring as much parallelism to bear on the problem as possible, in order to improve

performance". In other words, a good circuit is one where most gates contribute to the result

in every clock cycle. Exploiting parallelism is not an optional luxury, but a necessity.

We end by repeating that hardware compilation has gained interest in practice primarilybecause of the recent advent of largescale programmable devices. These can be configured

7/30/2019 Ss Rk2r05a21

25/26

25

onthefly, and hence be used to directly represent circuits generated through a hardware

compiler. It is therefore quite conceivable that parts of a program are compiled into

instruction sequences for a conventional processor, and other parts into circuits to be loaded

onto programmable gate arrays. Although specified in the same language, the engineer will

observe different criteria for good design in the two areas.

7/30/2019 Ss Rk2r05a21

26/26

26

13. References

1. Aho and M. Corasick. Efficient string matching: an aid to bibliographic search.

Journal of the ACM, 18(6):333-340, June 1975.

2. A. Aho and M. Ganapathi. Efficient tree pattern matching: an aid to code generation.

In Conference Record of the Twelfth Annual ACM Symposium on Principle3 of

Programming Languages,pages 334-340, January 1985.

3. A. Aho and S. C. Johnson. Code generation for a one register machine. Journal of the

ACM, 23(3):502-510, 1976.

4. A. Aho and S. C. Johnson. Optimal code generation for expression trees. Journal of

the ACM, 23(3):488-501, 1976.

5. The Yorktown Silicon Compiler. In Proceedings, International Conference onCircuits and Systems, pages 391- 394, IEEE Circuits and Systems Society, June

1985.

6. Richard Brent, David Kuck, and Kiyoshi Maruyama. The parallel evaluation of

arithmetic expressions without division. In IEEE Transactions on Computers, pages

532-534, May

7. J. Darringer, W. Joyner, C. L. Berman, and L. Trevillyan. Logic synthesis through

local transformations. IBM Journal of Research and Development, 25(4):272-280,

1981.

8. H. DeMan, J. Rabaey, P. Six, and L. Claesen. Cathedral-ii: a silicon compiler for

digital signal processing. IEEE Design d Test, 3(6):13-25, 1984.

9. I. Page. Constructing hardwaresoftware systems from a single description.Journal of VLSI Signal Processing 12, 87107 (1996).

10. N. Wirth. Digital Circuit Design. SpringerVerlag, Heidelberg, 1995.

11. www.Lola.ethz.ch

12. Y. N. Patt, et al. One Billion Transistors.
http://www.lola.ethz.ch/http://www.lola.ethz.ch/

Ss Rk2r05a21

Documents

Transcript of Ss Rk2r05a21