Ss Rk2r05a21
-
Upload
rahul-abhishek-mehra -
Category
Documents
-
view
222 -
download
0
Transcript of Ss Rk2r05a21
-
7/30/2019 Ss Rk2r05a21
1/26
1
ERM PAPER CSE318
HARDWARE COMPILERSystem Software
Submitted to:
Submitted by: MS.RAMANPREET KAUR LAMBA
RAHUL ABHISHEK
BTECH(CSE)
+MBA
roll
no RK2r05A21
Regd. no 11006646
-
7/30/2019 Ss Rk2r05a21
2/26
2
ACKNOWLEDGMENT
I have the honor to express our gratitude to our Miss.Ramanpreet Kaur
Lamba Mam for their timely help,support encouragement,valuable comments,
suggestions and many innovative ideas in carrying out this project. It is great
for me to gain out of their experience.
I also owe overwhelming debt to my classmates for their constant
encouragement and support throughout the execution of this project.
Above all, consider it a blessing of theAlmighty Godthat i have been able tocomplete this project. His contribution towards every aspect of oursthroughout our life would always be the supreme and unmatched.
Rahul Abhishek
-
7/30/2019 Ss Rk2r05a21
3/26
3
ABSTRACT
This term paper presents the co development of the instruction set,the hardware, and the compiler for the Vector IRAM mediaprocessor. A vector architecture is used to exploit the dataparallelism of multimedia programs, which allows the use of highlymodular hardware and enables implementations that combine highperformance, low power consumption, and reduced designcomplexity. It also leads to a compiler model that is efficient both interms of performance and executable code size. The memorysystem for the vector processor is implemented using embeddedDRAM technology, which provides high bandwidth in an integrated,cost-effective manner.
The hardware and the compiler for this architecture makecomplementary contributions to the efficiency of the overall system.This paper explores the interactions and tradeoffs between them, aswell as the enhancements to a vector architecture necessary formultimedia processing. We also describe how the architecture,design, and compiler features come together in a prototype system-on-a chip, able to execute 3.2 billion operations per second perwatt.
-
7/30/2019 Ss Rk2r05a21
4/26
4
CONTENTS
1. INTRODUCTION
2. Software and Hardware Compilers Compared
3. The Structure of the Compiler
4. Syntax Directed Translation
5. STRUCTURE OF HARDWARE COMPILER
6. . Global Optimization
7. Silicon Code Generation
7.1. Compiling the Tree Matcher
7.2. CODE generation
7.3. Local Optimizations for Speed
8. Peephole Optimization
9. Work-In-Progress: Global Speed Optimization
10.Evaluation of the Silicon Instruction Set
11.Working of hardware compiler
12.CONCLUSION
-
7/30/2019 Ss Rk2r05a21
5/26
5
13.REFRENCES
1.Introduction
The direct generation of hardware from a program more precisely: the automatic translation
of a program specified in a programming language into a digital circuit has been a topic ofinterest for a long time. So far it appeared to be a rather uneconomical method of largely
academic but hardly practical interest. It has therefore not been pursued with vigour and
consequently remained an idealist's dream. It is only through the recent advances of
semiconductor technology that new interest in the topic has reemerged, as it has become
possible to be rather wasteful with abundantly available resources. In
particular, the advent of programmable components devices representing circuits that are
easily and rapidly reconfigurable has brought the idea closer to practical realization by a
large measure.
Hardware and software have traditionally been considered as distinct entities with little in
common in their design process. Focusing on the logical design phase, perhaps the most
significant difference is that programs are predominantly regarded as ordered sets ofstatements to be interpreted sequentially, one after the other, whereas circuits again
simplifying the matter are viewed as sets of subcircuits operating concurrently ("in
parallel"), the same activities reoccurring in each clock cycle forever. Programs "run",
circuits are static.
Another reason for considering hard and software design as different is that in the latter case
compilation ends the design process (if one is willing to ignore "debugging"), whereas in the
former compilation merely produces an abstract circuit. This is to be followed by the
difficult, costly, and tedious phase of mapping the abstract circuit onto a physical medium,
taking into account the physical properties of the selected parts, propagation delays, fanout,
and other details.
With the progressing introduction of hardware description languages, however, it has becomeevident that the two subjects have several traits in common. Hardware description languages
let specifications of circuits assume textual form like programs, replacing the traditional
circuit diagrams by texts. This development may be regarded as the analogy of the
replacement of flowcharts by program texts several decades ago. Its main promoter has been
the rapidly growing complexity of programs, exhibiting the limitations of diagrams spreading
over many pages. In the light of textual specifications, variables in programs have their
counterparts in circuits in the form of clocked registers. The counterparts of expressions are
combinational circuits of gates. The fact that programs mostly operate on numbers, whereas
circuits feature binary signals, is of no further significance. It is well understood how to
represent numbers in terms of arrays of binary signals (bits), and how to implement
arithmetic operations by combinational circuits.
-
7/30/2019 Ss Rk2r05a21
6/26
6
Direct compilation of hardware is actually fairly straightforward, at least if one is willing to
ignore aspects of economy (circuit simplification). We follow in his footsteps and formulate
this essay in the form of a tutorial. As a result, we recognize some principal limitation of the
translation process, beyond which it may still be applicable, but not be realistic. Thereby we
perceive more clearly what should better be left to software. We also recognize an area where
implementation by hardware is beneficial even necessary: parallelism. Hopefully we end upin obtaining a better understanding of the fact that hardware and software design share
several important aspects, such as structuring and modularization that may well be expressed
in a common notation. Subsequently we will proceed step by step, in each step introducing
another programming concept expressed by a programming language construct.
2. Software and Hardware Compilers Compared
Hardware compilers produce a hardware implementation from some specification of the
hardware. There are, however, a number of types of hardware compilers which start fromdifferent sources and compile into different targets.
Architectural synthesis tools create a microarchitecture from a programlike description of the
machines function. These systems jointly design the microcode to implement the given
program and the datapath on which the microcode is to be executed; adding a bus to the data
path, for example, can allow the microcode to be redesigned to reduce the number of cycles
required to execute a critical operation. In general, architectural compilers that accept a fully
general program input synthesize relatively con-ventional machine architectures. For this
reason, architectural compilers display a number of relatively direct applications of
programming-language compilation techniques.
Architectural compilers design a machines structure, paying relatively little attention to the
physical design of transistors and wires on the chip. Silicon compilers concentrate on thechips physical design. While these programs may take a high-level description of the chips
function, they map that description onto a restricted architecture and do little architectural
optimization. Silicon compilers assemble a chip from blocks that implement the
architectures functional units. They must create custom versions of the function blocks,
place the blocks, and route wires between them to implement the architecture.
Perhaps the most ambitious silicon compiler project is Cathedral, a family of compilers that
create digital filter chips directly from the filter specifications(pass band, ripple, etc .).
Register-transfer compilers, in contrast, implement an architecture: they take as input a
description of the machines memory elements and the data computation and transfer
between those memory elements; they produce an optimized version of the hardware thatimplements the architecture. The target of a register-transfer compiler is a set of logic gates
and memory elements; its job is to compile the most efficient hardware for the architecture. A
register transfer compiler can be used as the back-end of an architectural compiler;
furthermore, many interesting chips-protocols, microcontrollers, etc.-are
best described directly as register-transfer machines. Even though register-transfer
descriptions bear less resemblance to programming languages than do the source languages to
architectural compilers, we can still apply a number of programming-language compiler
techniques to register-transfer compilation.
3. The Structure of the Compiler
-
7/30/2019 Ss Rk2r05a21
7/26
7
Our hardware compiler he performs a series of translations to transform the FSM description
into a hardware implementation. Figure 3 shows the compilers structure. The compiler
represents the design as a graph; successive compilation steps transform the graph to increase
the level of detail in the design and to perform optimizations.
The first step, syntax-directed translation, translates the source description into two directed
acyclic graphs (DAGS): Gc for the combinational functions, and Gm for the connectionsbetween the combinational functions and the memory elements. Global
optimization transforms Gc into a Gc that computes the same functions but is smaller and
faster than Gc.Gc and Gm are composed into a single intermediate form Gi analogous to the
intermediate code generated by a programming-language compiler. The silicon code
generator transforms the intermediate form into hardware implemented in the available logic
gates and registers. This implementation is cleaned up by a peephole optimizer; the result can
be given to a layout program to create a final chip design.
4. Syntax Directed Translation
Translation of the source into Boolean gates plus registers is syntax-directed, much as in
programming language compilers, but with some differences because the target is a purely
structural description of the hardware. The source description is parsed into a pure abstract
syntax tree (AST). Leaf nodes in the tree represent either signals or memory elements,
which consist of an input signal and an output signal. Subtrees describe the combinational
logic functions. Translation turns the AST into the graphs Gc and
G V%* Simple Boolean functions of signals are directly translatable; if statements (and by
implication, switch statements) also represent Boolean functions: if (cond) z=a; else z=b;
becomes the function2 = (cond&a) I(!cond&b).Syntax-directed translation is a good
implementation strategy because global optimization methods for combinational logic aregood enough to wipe out a wide variety of inefficiencies introduced by a naive translation of
the source into Boolean functions. The simplest way to translate the source into logic is to
reproduce the AST directly in the structure of the combinational function graph Gc, creating
functions that are too fragmented for good global optimization.
We use syntax-directed partitioning after translation to simplify the logic into fewer and
larger functions: collapsing all subexpression trees into one large function, for example, takes
much less CPU time than using general global optimization functions to simplify the
subexpressions. This strategy is akin to simple programming-language compiler
optimizations like in-line function expansion.
-
7/30/2019 Ss Rk2r05a21
8/26
8
FSM
SYNTAXDIRECTED
COMBITIONAL LOGIC
MEMORYELEMENTS
GLOBALOPTIMIZATION
INTERMEDIATE FORM
SILICON CODEGENERATION
AUTOMATICLAYOUT
PEEPHOLEOPTIMIZATION
MASK FORMANUFACTURE
Figure 3: How to compile register-transferdescriptionsinto layout.
5. STRUCTURE OF HARDWARE COMPILER
-
7/30/2019 Ss Rk2r05a21
9/26
9
solve both the minimization and common subexpression elimination problems, but these
methods can become stuck in local minima. We use other global optimization techniques
very different from programming-language compiler methods that take advantage the
restrictions of the Boolean domain. We reduce the combinational logic to a canonical sum-of
products form and use modern extensions [8] to the classical Quine-McCluskey optimizationprocedures. We eliminate common subexpressions using a technique[9] tuned to the
canonical sum-of-products representation.
Because Boolean logic is a relatively simple theory, our global optimizer can deduce many
legal transformations of the combinational functions and explore a large fraction of the total
design space. Global optimization commonly reduces total circuit area by 15%.Global logic
optimization is similar to machine independent code optimization in programming-language
compilers in that it is performed on an intermediate form-in this case, networks of Boolean
functions in sum-of-products form. As in programming language compilation, many
optimizations can be performed on this implementation-independent representation; but, as in
programming languages, some optimizations cannot be identified until the target-in this case,
the hardware gates and registers available-are known. For example, if the available hardware
includes a register that provides its output in both true and complemented form, we can use
that complemented form directly rather than compute it with an external inverter. Global
optimization algorithms for Boolean functions are not good at finding these implementation-
dependent optimizations; the code-generation phase transforms the intermediate form into
efficient hardware.
6 . Global Optimization
As in programming-language compilation, global optimization tries to reduce the resources
required to implement the machine. The resources in this case are the combinational logic
functions-we want to compute the machines outputs and next state using the smallest, fastest
possible functions. Combinational logic optimization combines minimization,
which eliminates obvious redundancies in the combinational logic (analogous to dead code
elimination) and simplifies the logic (analogous to strength reduction and constant folding),
and common subezpression elimination, which reduces the size of the logic functions.
Compiler-oriented methods have been used with success by the IBM Logic Synthesis System
to that attracted interest used a rule-based expert system. But silicon code generation is so
similar to tree-matching methods for retargetable code generation that we use twig [2] , a treemanipulator used for constructing code generators for programming language compilers, to
-
7/30/2019 Ss Rk2r05a21
10/26
10
create the DAGON silicon code generator [21]. DAGON can implement hardware optimized
for speed, area, or some function of both.
Figure 4 shows how silicon code generation works. For each new set of silicon instructions
we must generate a tree-matching program. Twig compiles a set of patterns and actions that
describe the code transformations into a tree matcher for that library. The code generation
phase of a hardware compilation run starts from the intermediate form Gi that includes boththe Boolean functions for the combinational logic and the memory elements. That graph is
parsed and broken into a forest of trees which are fed to the code generator.
As we will describe later, code generation can be performed iteratively to optimize the silicon
code for speed. At the end of code generation the intermediate form has been compiled into
an equivalent piece of hardware built from the available library components.
7. Silicon Code Generation
The portion of our system that makes the most innovative use of compiler methods is the
silicon code generator. Our circuits are implemented a library of registers and gates that
implement a subset of the Boolean functions-NAND, NOR, AND-ORINVERT(AOI), etc.
This library forms the silicon instruction set-it is an exact analog of a machines instruction
set. The silicon code generators job is to translate the intermediate form produced by the
global optimizer into the best silicon code-networks of the gates and registers available in the
library. The first approach to silicon code generation was ad hoc reminiscent of the do what
you have to do era of code generation. Another approach
Figure 4: The silicon code generation process.
that attracted interest used a rule-based expert system[19]. But silicon code generation is so
similar to tree-matching methods for retarget able code generation that we use twig [2] [30], a
tree manipulator used for constructing code generators for programming language compilers,
to create the DAGON silicon code generator [21]. DAGON can implement hardware
optimized for speed, area, or some function of both. Figure 4 shows how silicon code
generation works. For each new set of silicon instructions we must generate a tree-matching
program. Twig compiles a set of patterns and actions that describe the code transformations
into a tree matcher for that library. The code generation phase of a hardware compilation run
starts from the intermediate form Gi that includes both the
Boolean functions for the combinational logic and the memory elements. That graph is parsedand broken into a forest of trees which are fed to the code generator.
-
7/30/2019 Ss Rk2r05a21
11/26
11
As we will describe later, code generation can be performed iteratively to optimize the silicon
code for speed. At the end of code generation the intermediate form has been compiled into
an equivalent piece of hardware built from the available library components.
7.1 Compiling the Tree Matcher
Code generation finds the best mapping from the intermediate form onto the object code by
tree matching. We want an efficient implementation of the matching program, but we also
want to be able to matching automaton from the code patterns and the actions.The patterns supplied to twig describe either direct mappings of the intermediate form onto
the target code or may describe a mapping that contains a simple local optimization. Local
optimizations may be thought of as clever code sequences that are not natural matches of the
instruction set.
A twig pattern statement has three parts. The first is the pat tern itself, writ ten in a form like
that used by parser generators. A pattern can be used to describe a class of simple pattern
instances by parameterizing the pattern. For instance, five patterns for AOIxxx actually
describe sixty-four pattern instances from an A01444 to an AOI211. Many of these patterns
are symmetric-an A01114 is equivalent to, and would be output as, an AO1411.
The second part of a pattern statement is a formula that defines the cost of applying the
pattern. Separately we supply twig with functions for adding, computing, and comparing
costs. The tree matcher computes the cost for a tree using the user-supplied routine uses these
costs to choose by dynamic programming the minimum cost match over the entire tree.
The third part of a pattern statement is either a rewrite command or an action routine. Some
matches cause the input circuit to be transformed and reprocessed to find new matches, while
others cause immediate actions. DAGONS action is always to write out the code chosen by
the match. The patterns, actions, and cost functions are compiled by twig into a tree-matching
automaton.
-
7/30/2019 Ss Rk2r05a21
12/26
12
Figure 5: Partitioning the intermediate form DAG into trees
The automaton finds the optimum match for a tree in two phases: first, finding all the patterns
that match subtrees; next, by finding the set of matches that cover the tree at the lowest cost.
We will describe these steps in detail below. We have written 60 parameterised patterns todescribe the AT&T Cell Library. These patterns expand into over seven hundred pattern
instances that represent permutations of our hardware library of approximately 165 gates and
a few thousand unique local optimisations. The modularity afforded by describing the code
generator as patterns, actions, and cost functions makes it easy to port our code generator to a
new hardware library, just as code generators can be ported to new instruction sets. To date
we have created code generators for six different hardware libraries.
-
7/30/2019 Ss Rk2r05a21
13/26
13
Figure 6: An intermediate form tree and all its possible matches.
7.2 Code Generation
The intermediate form Gi is a DAG, so it must be broken into a forest of trees before being
processed by the tree matcher produced by twig. Figure 5 shows an intermediate form DAG
partitioned into three trees; the Xs show where the DAG has been cut. The partitioned DAG
is sent one tree at a time to the tree matcher. It first determines all possible matches for each
vertex u. The matching automaton finds all possible matches using the Aho-Corasick
algorithm[l], which works in time O(TREESIZE xMAX-MATCHES): the size of the tree
times the maximum number of matches possible at any node. Figure 6 shows a tree and all its
matches. Each node matched a pattern corresponding to either a two-input NAND gate (ND2)
or an inverter (INV); in addition, the subtree formed by the top three ranks of nodes matcheda larger pattern corresponding to an A0121 gate. The optimum code is generated by finding
the least-cost match of patterns that covers the tree from among all the possible matches. The
tree matcher generated by twig finds the optimum match by dynamic programming (41.
Optimum tree matching can be solved by dynamic programming because the best match for
the root of a tree is a function of the best matches for its subtrees. Figure 7 shows how the
pattern-matching automaton finds the optimum match for our example tree. The tree matcher
starts from the leaf nodes and successively matches larger subtrees. The nodes at the bottom
of the tree each have only one match; the number shown with the name of the best match is
the total cost of that subtree. The root node has two matches: the INV match gives a total tree
cost of 13; the A012 1 match (which covers several previously matched nodes) gives a tree of
cost 7. The A0121 match gives the smallest total cost, so the automaton labels the node withA012 1. This example is made simple because we tried to match only a few instructions;
-
7/30/2019 Ss Rk2r05a21
14/26
14
matching even this simple DAG against our full instruction set would give four to five
matches per node. Pure tree matching would not be able to handle two important cases:
matching non-tree patterns, like exclusive-or gates; and matching patterns across tree
boundaries. We handle these cases by associating at-tributes that describe global information-
fan-out, inverse, and signal-with each vertex in the tree. Inverse tells if the inverted sense of a
signal is available somewhere in the circuit. This is very useful because a number of areareducing optimizations can often be made if the inverted sense of a signal can be used in
place of the signal itself. Signal gives the name of a signal, which can be used with inverse to
represent a non-tree pattern such as an exclusive-or. Fan-out is used to determine timing
costs. Because patterns are matched against a forest of trees rather than the original DAG,
DAGON may not find the globally optimal match. Finding an optimal DAG-matching
(DAG-covering) is an NP-complete [ 31]problem. However, the combination of tree-
matching and peephole optimizations has proven to be an effective method for silicon code
generation. The MIS system tried extending tree matching to DAG covering [17], but the
results were disappointing. MIS now uses tree-matching for code generation.
after partial match
-
7/30/2019 Ss Rk2r05a21
15/26
15
Figure 7: Finding an optimal match.
7.3 Local Optimizations for Speed
While not all industrial designs have to run at high speed, there are always circuits that have
to run very fast. Because path delay is a global property of the hardware it can be hard fordesigners to optimize large machines for speed. Furthermore, reducing the delay of a piece of
hardware almost always costs additional area. We want to meet the designers speed
requirement at a minimum area cost. As Figure 4 shows, code generation can be applied
iteratively to jointly optimize for speed and area. The first pass through the tree matcher
produces hardware strictly optimized for area. A static delay analyzer finds parts of the
hardware that must be speeded up and produces speed constraints. The code generated by the
previous pass is translated back into the intermediate form and sent through the tree matcher
along with the constraints. The constraints cause the matcher to abort matches that cannot run
at the required speed. Because many Boolean functions are represented by both low-
speed/low-power and high-speed/high-power gates in the library, the matcher often need only
substitute the high-speed version of a gate to meet the speed constraint, but a constraint canforce the entire tree to be rematched. By this procedure we have been able to routinely double
the speed of circuits with an area penalty of only 10%. This is such a significant improvement
that DAGON is being used for timing optimization of manually designed hardware. Speed
optimization is sufficiently hard that DAGON has been able to similarly improve even these
hand-crafted designs. of DAGs, not trees. We are presently investigating techniques for
height minimization a DAG.
8. Peephole Optimization
Inverters are the bane of a silicon code generator. Due to the fact that trees are matched
independently it is common for a signal to be inverted, due to a local optimization, in two or
more different trees when a single inverter would do the job. We clean up these and other
problems with a peephole optimization phase that minimizes the number of inverters used. As
in programming-language compilers, the peephole optimizer looks at the circuit through a
small window; but this window reveals a portion of the circuit graph, not a portion of code
text as in an object code peep hole optimizer. Peephole optimization reduces combinational
logic area by as much as 10%.
9. Work-In-Progress: Global Speed Optimization
While circuit timing problems related to very local circuit properties, such as high capacitive
loading, are well solved by the techniques described in the previous sections, not all timing
problems in digital hardware can be solved by a single application of local optimizations.
Some timing problems are due to having too many gates in a particular path through the
hardware. To solve this problem the hardware must often be drastically restructured. As local
optimizations using tree matching are relatively inexpensive (linear time), our current
approach is to iteratively apply the local optimizations
to achieve more global transformations. This procedure is both aesthetically and practically
unsatisfying.
-
7/30/2019 Ss Rk2r05a21
16/26
16
We would prefer to make hardware transformations that more quickly result in hardware
which meets timing requirements. Tree height reduction algorithms [lo], developed in the
context of reducing the height of arithmetic expression trees to speed their execution on a
parallel processor, suggest an approach to this problem. While this method, and its
successors, has promise, we have found two limitations in its application to hardware: First,
when compiling to a fixed architecture, resources are free up to the point of saturation andinfinitely expensive after that. Thus it makes since to reduce tree height until all processors
are saturated. When designing hardware, however, we are interested in speeding up the
hardware to some limit specified by the user. We have no fixed resource limit in silicon code
generation to provide a lower bound on tree height, but maximally reducing all trees would
create hardware much larger than necessary. Secondly, we are concerned with minimizing the
height
10. Evaluation of the Silicon Instruction Set
Highly portable compilers for fixed architectures encouraged the evaluation of instruction
sets to determine the most cost-effective set of instructions: instruction sets that gave the
fastest compiled code at the minimum cost in hardware. While a larger silicon instruction set
does not have the unpleasant side effect of increasing the complexity of the implementation, a
larger silicon instruction set does require more human resources to develop and maintain. The
cost of developing and maintaining the instruction set must be weighed against the
improvement in the speed and area of hardware that can be compiled. The availability of a
highly portable silicon code generator made possible a long overdue experiment, that of
evaluating the impact of the silicon instruction set on the quality of the final hardware.
We evaluated five different silicon instruction sets ranging in complexity from 2 to 125 gates.
Our results on the twenty standard benchmark examples show a monotonic, thoughexponentially decreasing, reduction in chip area, ranging from 26% (7 instructions) to 45%
(125 instructions). These results were further verified against a number of industrial designs.
If we compare chip area with size of object code, these results simply show that with the
addition of more and more powerful instructions the size of object code is reduced. What was
less predictable was the knee of the area reduction per number of instructions curve. Beyond
an instruction set size of 37 instructions, which gave an area reduction of 40%, adding
instructions gave little area reduction: an additional 88 instructions only reduced area by an
additional 2%. A more surprising result was that all silicon instruction sets showed
comparable speed.
11 .Working of hardware compiler
11.1. Variable Declarations
All variables are introduced by their declaration. We use the notation
VAR x, y, z: Type
In the corresponding circuit, the variables are represented by registers holding their values.The output carries the name of the variable represented by the register, and the enable signal
-
7/30/2019 Ss Rk2r05a21
17/26
17
determines when a register is to accept a new input. We postulate that all registers be clocked
by the same clock signal, that is, all derived circuits will be synchronous circuits. Therefore,
clock signals will subsequently not be drawn and assumed to be implicit.
11.2. Assignment Statements
An assignment is expressed by the statement
y := f(x)
Where y is a variable, x is (a set of) variables and f is an expression in x. The assignment
statement is translated into the circuit according to Fig. 1, with the function f resulting in a
combinational circuit (i.e. a circuit neither containing feedback loops nor registers). e stands
for enable; this signal is active when the assignment is to occur.
Given a time t when the register(s) have received a value, the combinational circuit f yields
the new value f(x) after a time span pd called the propagation delay of f. This requires thenext clock tick to occur no before time t+pd. In other words, the propagation delay
determines the clock frequency f < 1/pd. The concept of the synchronous circuit dictates, that
the frequency is chosen according to the circuit f with the largest propagation delay among all
function circuits.
The simplest types of variables have only 2 possible values, say 0 and 1. They are said to be
of type BIT (or Boolean). However, we may just as well also consider composite types, that
is, variables consisting of several bits. A type Byte, for example, might consist of 8 bits, and
the same holds for a type Integer. The translation of arithmetic functions (addition,
subtraction, comparison, multiplication) into corresponding combinational circuits is well
understood. Although they be of considerable complexity, they are still purely combinational
circuits.
11.3. Parallel Composition
We denote parallel composition of two statements S0 and S1 by
S0, S1
The meaning is that the two component statements are executed concurrently. The
characteristic of parallel composition is that the affected registers feature the same enable
signal. Translation into a circuit is straightforward, as shown in Fig 1 (right side) for theexample
x := f(x, y), y := g(x, y)
-
7/30/2019 Ss Rk2r05a21
18/26
18
The characteristic of parallel composition is that the registers feature the same enable signal.
11.4. Sequential Composition
Traditionally, sequential composition of two statements S0 and S1 is denoted by
S0; S1 The sequencing operator ";" signifies that S1 is to be executed after completion of S0.
This necessitates a sequencing mechanism. The example of the three assignments
y := f(x); z := g(y); x := h(z)
is translated into the circuit shown if Fig. 2, with enable signals e corresponding to the
statements.
The upper line contains the combinational circuits and registers corresponding to the
assignments. The lower line contains the sequencing machinery assuring the proper
sequential execution of the three assignments. With each statement is associated an individual
enable signal e. It determines when the assignment is to occur, that is, when the register
holding the respective variable is to be enabled. The sequencing part is a "onehot" state
machine. The following table illustrates the signal values before and after each clock cycle.
e0 e1 e2 e3 x y z
1 0 0 0 x0 ? ?
0 1 0 0 x0 f(x0) ?
0 0 1 0 x0 f(x0) g(f(x0))
0 0 0 1 h(g(f(x0))) f(x0) g(f(x0))
The preceding example is a special case in the sense that each variable is assigned only a
single value, namely y is assigned f(x) in cycle 0, z is assigned g(y) in cycle 1, and x is
assigned h(z) in cycle 2. The generalized case is reflected in the following short example:
-
7/30/2019 Ss Rk2r05a21
19/26
19
x := f(x); x := g(x)
which is transformed into the circuit shown in Fig. 3. Since x receives two values, namely
f(x) in cycle 0 and g(x) in cycle 1, the register holding x needs to be preceded by a
multiplexer.
11.5. Conditional Composition
This is expressed by the statement
-
7/30/2019 Ss Rk2r05a21
20/26
20
IF b THEN S END
where b is a Boolean variable and S a statement. The circuit shown in Fig. 4 on the left side is
derived
from it with S standing for y := f(x). The only difference to Fig. 1 is the derivation of the
associated sequencing machinery yielding the enable signal for y.
It now emerges clearly that the sequencing machinery, and it alone, directly reflects the
control statements, whereas the data parts of circuits reflect assignments and expressions.
The conditional statement is easily generalized to its following form:
IF b0 THEN S0 ELSE S1 END
In its corresponding circuit on the right side of Fig. 4 we show the sequencing part only with
the enable signals corresponding to the various statements. At any time, at most one of the
statements S is active after being triggered by e0.
11.6. Repetitive Composition
We consider repetitive constructs traditionally expressed as repeat and while statements.
Since the control structure of a program statement is reflected by the sequencing machinery
only, we may omit showing associated data assignments, and instead let each statement be
represented by its associated enable signal only.
We consider the statements
REPEAT S UNTIL b and WHILE b DO S END
again standing for a Boolean variable, they translate into the circuits shown in Fig. 5, wheree stands for the enable signal of statement S.
-
7/30/2019 Ss Rk2r05a21
21/26
21
11.7. Selective Composition
The selection of one statement out of several is expressed by the case statement, whose
corresponding sequencing circuitry containing a decoder is shown in Fig. 6.
CASE k OF
0: S0 | 1: S1 | 2: S2 | ... | n: Sn
END
11.8. Preliminary Discussion
-
7/30/2019 Ss Rk2r05a21
22/26
22
At this point we realize that arbitrary programs consisting of assignments, conditional and
repeated statements can be transformed into circuits according to fixed rules, and therefore
automatically.
However, it is also to be feared that the complexity of the resulting circuits may quickly
surpass reasonable bounds. After all, every expression occurring in a program results in an
individual combinational circuit; every add symbol yields an adder, every multiply symbolturns into a combinational multiplier, potentially with hundreds or thousands of gates. We
agree that for the present time this scheme is hardly practical except for toy examples. Cynics
will remark, however, that soon the major problem will no longer be the economical use of
components, but rather to find good use of the hundreds of millions of transistors on a chip.
Inspite of such warnings, we propose to look at ways to reduce the projected hardware
explosion. The best solution is to share subcircuits among various parts. This, of course, is
equivalent to the idea of the subroutine in software. We must therefore find a way to translate
subroutines and subroutine calls. We emphasize that the driving motivation is the sharing of
circuits, and not the reduction of program text.
Therefore, a facility for declaring subroutines and textually substituting their calls by their
bodies, that is, handling them like macros, must be rejected. This is not to deny the usefulnessof such a feature, as it is for example provided in the language Lola, where the calls may be
parameterized [2, 3]. But with its Expansions of the circuit for each call it contributes to the
circuit explosion rather than being a facility to avoid it.
11.9. Subroutines
The circuits obtained so far are basically state machines. Let every subroutine translate into
such a state machine. A subroutine call then corresponds to the suspension of the calling
machine, the activation of the called machine, and, upon its completion, a resumption of the
suspended activity of the caller.
A first measure to implement subroutines is to provide every register in the sequencing part
with a common enable signal. This allows to suspend and resume a given machine. The
second measure is to provide a stack (a firstin lastout store) of identifications of suspended
machines, typically numbers from 0 to some limit n. We propose the following
implementation, which is to be regarded as a fixed base part of all circuits generated,
analogous to a runtime subroutine package in software. The extended circuit is a
pushdown machine.
The stack of machine numbers (analogous to return addresses) consists of a (static) RAM of
m words, each of n bits, and an up/down counter generating the RAMaddresses. Each of the
n bits in a word corresponds to one of the circuits representing a subroutine. One of them hasthe value 1, identifying the circuit to be reactivated. Hence, the (latched) readout directly
specifies the values of the enable signals of the n circuits. This is shown in Fig. 7. The stack
is operated by the push signal, incrementing the counter and thereafter writing the applied
input to the RAM, and the pop signal, decrementing the counter.
-
7/30/2019 Ss Rk2r05a21
23/26
23
The push signal is activated whenever a state register belonging to a call statement becomes
active. Since a subroutine may contain several calls itself, these register outputs must be
ORed. A better solution would be a bus. The pop signal is activated when the last state of a
circuit representing a subroutine is reached. An obvious enhancement with the purpose to
enlarge the number of possible subcircuits without undue expansion of the RAM's width is toplace an encoder at the input and a decoder at the output of the RAM. A width of n bits then
allows for 2n subroutines.
The scheme presented so far excludes recursive procedures. The reason is that every
subcircuit must have at most one suspension point, that is, it can have been activated at most
once. If recursive procedures are to be included, a scheme allowing for several suspension
points must be found. This implies that the stack entries must not only contain an
identification of the suspended subcircuit, but also of the suspension point of the call in
question. This can be achieved in a manner similar to storing the circuit identification, either
as a bit set or an encoded value. The respective state reenable signals must then be properly
qualified, further complicating the circuit. We shall not pursue this topic any further.
-
7/30/2019 Ss Rk2r05a21
24/26
24
12.Conclusion
We have found that a number of techniques and algorithms developed for software compilers
for fixed architectures are useful in hardware compilation. We add the caveat that while
programming language compilation techniques for fixed architectures can often give us a
good start on many problems in hardware compilation it is important to remember thatlanguage-to-hardware compilations are harder pre-cicely because the hardware architecture is
not fixed. A good software compiler must optimize its code for the target of its code. A good
hardware compiler must be able optimize its code for a target architecture, but must also
explore resource tradeoffs across a potentially wide range of such targets. Nevertheless, we
hope that hardware and programming- language compilation techniques become even more
closely related as we discover more general ways to compile software descriptions into
custom hardware. For the moment we continue to experiment with and extend our hardware
compiler technology to make it more general, while still retaining its ability to create
industrial-quality designs.
It is now evident that subroutines, and more so procedures, introduce a significant degree of
complexity into a circuit. It is indeed highly questionable, whether it is worth while
considering their implementation in hardware. It comes as no surprise that state machines,
implementing assignments, conditional and repetitive statements, but not subroutines, play
such a dominant role in sequential circuit design. The principal concern in this area is to
make optimal use of the implemented facilities. This is achieved if most of the subcircuits
yield values contributing to the process most of the time. The simplest way to achieve this
goal is to introduce as few sequential steps as possible. It is the underlying principle in
computer architectures, where (almost) the same steps are performed for each instruction.
The steps are typically the subcycles of an instruction interpretation. Hardware acts as what is
known in software as an interpretive system.
The strength of hardware is the possibility of concurrent operation of subcircuits. This is alsoa topic in software design. But we must be aware of the fact that concurrency is only possible
by having concurrently operating circuits to support this concept. Much work on parallel
computing in software actually ends up in implementing quasiconcurrency, that is, in
pretending concurrent execution, and, in fact, in hiding the underlying sequentiality. This
leads us to contend that any scheme of direct hardware compilation may well omit the
concept of subroutines, but must include the facility of specifying concurrent, parallel
statements. Such a hardware programming language may indeed be the best way to let the
programmer specify parallel statements. We call this finegrained parallelism.
Coarsegrained concurrency may well be left to conventional programming languages, where
parallel processes interact infrequently, and where they are generated and deleted at arbitrary
but distant intervals. This claim is amply supported by the fact that finegrained parallelismhas mostly been left to be introduced by compilers, under the hood, so to say.
The reason for this is that compilers may be tuned to particular architectures, and they can
therefore make use of their target computer's specific characteristics. In this light, the
consideration of a common language for hard and software specification has a certain merit.
It may reveal the inherent difference in the designers' goals. As C. Thacker expressed it
succinctly: "Programming (software) is about finding the best sequential algorithm to solve a
problem, and implementing it efficiently. The hardware designer, on the other hand, tries to
bring as much parallelism to bear on the problem as possible, in order to improve
performance". In other words, a good circuit is one where most gates contribute to the result
in every clock cycle. Exploiting parallelism is not an optional luxury, but a necessity.
We end by repeating that hardware compilation has gained interest in practice primarilybecause of the recent advent of largescale programmable devices. These can be configured
-
7/30/2019 Ss Rk2r05a21
25/26
25
onthefly, and hence be used to directly represent circuits generated through a hardware
compiler. It is therefore quite conceivable that parts of a program are compiled into
instruction sequences for a conventional processor, and other parts into circuits to be loaded
onto programmable gate arrays. Although specified in the same language, the engineer will
observe different criteria for good design in the two areas.
-
7/30/2019 Ss Rk2r05a21
26/26
26
13. References
1. Aho and M. Corasick. Efficient string matching: an aid to bibliographic search.
Journal of the ACM, 18(6):333-340, June 1975.
2. A. Aho and M. Ganapathi. Efficient tree pattern matching: an aid to code generation.
In Conference Record of the Twelfth Annual ACM Symposium on Principle3 of
Programming Languages,pages 334-340, January 1985.
3. A. Aho and S. C. Johnson. Code generation for a one register machine. Journal of the
ACM, 23(3):502-510, 1976.
4. A. Aho and S. C. Johnson. Optimal code generation for expression trees. Journal of
the ACM, 23(3):488-501, 1976.
5. The Yorktown Silicon Compiler. In Proceedings, International Conference onCircuits and Systems, pages 391- 394, IEEE Circuits and Systems Society, June
1985.
6. Richard Brent, David Kuck, and Kiyoshi Maruyama. The parallel evaluation of
arithmetic expressions without division. In IEEE Transactions on Computers, pages
532-534, May
7. J. Darringer, W. Joyner, C. L. Berman, and L. Trevillyan. Logic synthesis through
local transformations. IBM Journal of Research and Development, 25(4):272-280,
1981.
8. H. DeMan, J. Rabaey, P. Six, and L. Claesen. Cathedral-ii: a silicon compiler for
digital signal processing. IEEE Design d Test, 3(6):13-25, 1984.
9. I. Page. Constructing hardwaresoftware systems from a single description.Journal of VLSI Signal Processing 12, 87107 (1996).
10. N. Wirth. Digital Circuit Design. SpringerVerlag, Heidelberg, 1995.
11. www.Lola.ethz.ch
12. Y. N. Patt, et al. One Billion Transistors.
http://www.lola.ethz.ch/http://www.lola.ethz.ch/