Ss Rk2r05a21

download Ss Rk2r05a21

of 26

Transcript of Ss Rk2r05a21

  • 7/30/2019 Ss Rk2r05a21

    1/26

    1

    ERM PAPER CSE318

    HARDWARE COMPILERSystem Software

    Submitted to:

    Submitted by: MS.RAMANPREET KAUR LAMBA

    RAHUL ABHISHEK

    BTECH(CSE)

    +MBA

    roll

    no RK2r05A21

    Regd. no 11006646

  • 7/30/2019 Ss Rk2r05a21

    2/26

    2

    ACKNOWLEDGMENT

    I have the honor to express our gratitude to our Miss.Ramanpreet Kaur

    Lamba Mam for their timely help,support encouragement,valuable comments,

    suggestions and many innovative ideas in carrying out this project. It is great

    for me to gain out of their experience.

    I also owe overwhelming debt to my classmates for their constant

    encouragement and support throughout the execution of this project.

    Above all, consider it a blessing of theAlmighty Godthat i have been able tocomplete this project. His contribution towards every aspect of oursthroughout our life would always be the supreme and unmatched.

    Rahul Abhishek

  • 7/30/2019 Ss Rk2r05a21

    3/26

    3

    ABSTRACT

    This term paper presents the co development of the instruction set,the hardware, and the compiler for the Vector IRAM mediaprocessor. A vector architecture is used to exploit the dataparallelism of multimedia programs, which allows the use of highlymodular hardware and enables implementations that combine highperformance, low power consumption, and reduced designcomplexity. It also leads to a compiler model that is efficient both interms of performance and executable code size. The memorysystem for the vector processor is implemented using embeddedDRAM technology, which provides high bandwidth in an integrated,cost-effective manner.

    The hardware and the compiler for this architecture makecomplementary contributions to the efficiency of the overall system.This paper explores the interactions and tradeoffs between them, aswell as the enhancements to a vector architecture necessary formultimedia processing. We also describe how the architecture,design, and compiler features come together in a prototype system-on-a chip, able to execute 3.2 billion operations per second perwatt.

  • 7/30/2019 Ss Rk2r05a21

    4/26

    4

    CONTENTS

    1. INTRODUCTION

    2. Software and Hardware Compilers Compared

    3. The Structure of the Compiler

    4. Syntax Directed Translation

    5. STRUCTURE OF HARDWARE COMPILER

    6. . Global Optimization

    7. Silicon Code Generation

    7.1. Compiling the Tree Matcher

    7.2. CODE generation

    7.3. Local Optimizations for Speed

    8. Peephole Optimization

    9. Work-In-Progress: Global Speed Optimization

    10.Evaluation of the Silicon Instruction Set

    11.Working of hardware compiler

    12.CONCLUSION

  • 7/30/2019 Ss Rk2r05a21

    5/26

    5

    13.REFRENCES

    1.Introduction

    The direct generation of hardware from a program more precisely: the automatic translation

    of a program specified in a programming language into a digital circuit has been a topic ofinterest for a long time. So far it appeared to be a rather uneconomical method of largely

    academic but hardly practical interest. It has therefore not been pursued with vigour and

    consequently remained an idealist's dream. It is only through the recent advances of

    semiconductor technology that new interest in the topic has reemerged, as it has become

    possible to be rather wasteful with abundantly available resources. In

    particular, the advent of programmable components devices representing circuits that are

    easily and rapidly reconfigurable has brought the idea closer to practical realization by a

    large measure.

    Hardware and software have traditionally been considered as distinct entities with little in

    common in their design process. Focusing on the logical design phase, perhaps the most

    significant difference is that programs are predominantly regarded as ordered sets ofstatements to be interpreted sequentially, one after the other, whereas circuits again

    simplifying the matter are viewed as sets of subcircuits operating concurrently ("in

    parallel"), the same activities reoccurring in each clock cycle forever. Programs "run",

    circuits are static.

    Another reason for considering hard and software design as different is that in the latter case

    compilation ends the design process (if one is willing to ignore "debugging"), whereas in the

    former compilation merely produces an abstract circuit. This is to be followed by the

    difficult, costly, and tedious phase of mapping the abstract circuit onto a physical medium,

    taking into account the physical properties of the selected parts, propagation delays, fanout,

    and other details.

    With the progressing introduction of hardware description languages, however, it has becomeevident that the two subjects have several traits in common. Hardware description languages

    let specifications of circuits assume textual form like programs, replacing the traditional

    circuit diagrams by texts. This development may be regarded as the analogy of the

    replacement of flowcharts by program texts several decades ago. Its main promoter has been

    the rapidly growing complexity of programs, exhibiting the limitations of diagrams spreading

    over many pages. In the light of textual specifications, variables in programs have their

    counterparts in circuits in the form of clocked registers. The counterparts of expressions are

    combinational circuits of gates. The fact that programs mostly operate on numbers, whereas

    circuits feature binary signals, is of no further significance. It is well understood how to

    represent numbers in terms of arrays of binary signals (bits), and how to implement

    arithmetic operations by combinational circuits.

  • 7/30/2019 Ss Rk2r05a21

    6/26

    6

    Direct compilation of hardware is actually fairly straightforward, at least if one is willing to

    ignore aspects of economy (circuit simplification). We follow in his footsteps and formulate

    this essay in the form of a tutorial. As a result, we recognize some principal limitation of the

    translation process, beyond which it may still be applicable, but not be realistic. Thereby we

    perceive more clearly what should better be left to software. We also recognize an area where

    implementation by hardware is beneficial even necessary: parallelism. Hopefully we end upin obtaining a better understanding of the fact that hardware and software design share

    several important aspects, such as structuring and modularization that may well be expressed

    in a common notation. Subsequently we will proceed step by step, in each step introducing

    another programming concept expressed by a programming language construct.

    2. Software and Hardware Compilers Compared

    Hardware compilers produce a hardware implementation from some specification of the

    hardware. There are, however, a number of types of hardware compilers which start fromdifferent sources and compile into different targets.

    Architectural synthesis tools create a microarchitecture from a programlike description of the

    machines function. These systems jointly design the microcode to implement the given

    program and the datapath on which the microcode is to be executed; adding a bus to the data

    path, for example, can allow the microcode to be redesigned to reduce the number of cycles

    required to execute a critical operation. In general, architectural compilers that accept a fully

    general program input synthesize relatively con-ventional machine architectures. For this

    reason, architectural compilers display a number of relatively direct applications of

    programming-language compilation techniques.

    Architectural compilers design a machines structure, paying relatively little attention to the

    physical design of transistors and wires on the chip. Silicon compilers concentrate on thechips physical design. While these programs may take a high-level description of the chips

    function, they map that description onto a restricted architecture and do little architectural

    optimization. Silicon compilers assemble a chip from blocks that implement the

    architectures functional units. They must create custom versions of the function blocks,

    place the blocks, and route wires between them to implement the architecture.

    Perhaps the most ambitious silicon compiler project is Cathedral, a family of compilers that

    create digital filter chips directly from the filter specifications(pass band, ripple, etc .).

    Register-transfer compilers, in contrast, implement an architecture: they take as input a

    description of the machines memory elements and the data computation and transfer

    between those memory elements; they produce an optimized version of the hardware thatimplements the architecture. The target of a register-transfer compiler is a set of logic gates

    and memory elements; its job is to compile the most efficient hardware for the architecture. A

    register transfer compiler can be used as the back-end of an architectural compiler;

    furthermore, many interesting chips-protocols, microcontrollers, etc.-are

    best described directly as register-transfer machines. Even though register-transfer

    descriptions bear less resemblance to programming languages than do the source languages to

    architectural compilers, we can still apply a number of programming-language compiler

    techniques to register-transfer compilation.

    3. The Structure of the Compiler

  • 7/30/2019 Ss Rk2r05a21

    7/26

    7

    Our hardware compiler he performs a series of translations to transform the FSM description

    into a hardware implementation. Figure 3 shows the compilers structure. The compiler

    represents the design as a graph; successive compilation steps transform the graph to increase

    the level of detail in the design and to perform optimizations.

    The first step, syntax-directed translation, translates the source description into two directed

    acyclic graphs (DAGS): Gc for the combinational functions, and Gm for the connectionsbetween the combinational functions and the memory elements. Global

    optimization transforms Gc into a Gc that computes the same functions but is smaller and

    faster than Gc.Gc and Gm are composed into a single intermediate form Gi analogous to the

    intermediate code generated by a programming-language compiler. The silicon code

    generator transforms the intermediate form into hardware implemented in the available logic

    gates and registers. This implementation is cleaned up by a peephole optimizer; the result can

    be given to a layout program to create a final chip design.

    4. Syntax Directed Translation

    Translation of the source into Boolean gates plus registers is syntax-directed, much as in

    programming language compilers, but with some differences because the target is a purely

    structural description of the hardware. The source description is parsed into a pure abstract

    syntax tree (AST). Leaf nodes in the tree represent either signals or memory elements,

    which consist of an input signal and an output signal. Subtrees describe the combinational

    logic functions. Translation turns the AST into the graphs Gc and

    G V%* Simple Boolean functions of signals are directly translatable; if statements (and by

    implication, switch statements) also represent Boolean functions: if (cond) z=a; else z=b;

    becomes the function2 = (cond&a) I(!cond&b).Syntax-directed translation is a good

    implementation strategy because global optimization methods for combinational logic aregood enough to wipe out a wide variety of inefficiencies introduced by a naive translation of

    the source into Boolean functions. The simplest way to translate the source into logic is to

    reproduce the AST directly in the structure of the combinational function graph Gc, creating

    functions that are too fragmented for good global optimization.

    We use syntax-directed partitioning after translation to simplify the logic into fewer and

    larger functions: collapsing all subexpression trees into one large function, for example, takes

    much less CPU time than using general global optimization functions to simplify the

    subexpressions. This strategy is akin to simple programming-language compiler

    optimizations like in-line function expansion.

  • 7/30/2019 Ss Rk2r05a21

    8/26

    8

    FSM

    SYNTAXDIRECTED

    COMBITIONAL LOGIC

    MEMORYELEMENTS

    GLOBALOPTIMIZATION

    INTERMEDIATE FORM

    SILICON CODEGENERATION

    AUTOMATICLAYOUT

    PEEPHOLEOPTIMIZATION

    MASK FORMANUFACTURE

    Figure 3: How to compile register-transferdescriptionsinto layout.

    5. STRUCTURE OF HARDWARE COMPILER

  • 7/30/2019 Ss Rk2r05a21

    9/26

    9

    solve both the minimization and common subexpression elimination problems, but these

    methods can become stuck in local minima. We use other global optimization techniques

    very different from programming-language compiler methods that take advantage the

    restrictions of the Boolean domain. We reduce the combinational logic to a canonical sum-of

    products form and use modern extensions [8] to the classical Quine-McCluskey optimizationprocedures. We eliminate common subexpressions using a technique[9] tuned to the

    canonical sum-of-products representation.

    Because Boolean logic is a relatively simple theory, our global optimizer can deduce many

    legal transformations of the combinational functions and explore a large fraction of the total

    design space. Global optimization commonly reduces total circuit area by 15%.Global logic

    optimization is similar to machine independent code optimization in programming-language

    compilers in that it is performed on an intermediate form-in this case, networks of Boolean

    functions in sum-of-products form. As in programming language compilation, many

    optimizations can be performed on this implementation-independent representation; but, as in

    programming languages, some optimizations cannot be identified until the target-in this case,

    the hardware gates and registers available-are known. For example, if the available hardware

    includes a register that provides its output in both true and complemented form, we can use

    that complemented form directly rather than compute it with an external inverter. Global

    optimization algorithms for Boolean functions are not good at finding these implementation-

    dependent optimizations; the code-generation phase transforms the intermediate form into

    efficient hardware.

    6 . Global Optimization

    As in programming-language compilation, global optimization tries to reduce the resources

    required to implement the machine. The resources in this case are the combinational logic

    functions-we want to compute the machines outputs and next state using the smallest, fastest

    possible functions. Combinational logic optimization combines minimization,

    which eliminates obvious redundancies in the combinational logic (analogous to dead code

    elimination) and simplifies the logic (analogous to strength reduction and constant folding),

    and common subezpression elimination, which reduces the size of the logic functions.

    Compiler-oriented methods have been used with success by the IBM Logic Synthesis System

    to that attracted interest used a rule-based expert system. But silicon code generation is so

    similar to tree-matching methods for retargetable code generation that we use twig [2] , a treemanipulator used for constructing code generators for programming language compilers, to

  • 7/30/2019 Ss Rk2r05a21

    10/26

    10

    create the DAGON silicon code generator [21]. DAGON can implement hardware optimized

    for speed, area, or some function of both.

    Figure 4 shows how silicon code generation works. For each new set of silicon instructions

    we must generate a tree-matching program. Twig compiles a set of patterns and actions that

    describe the code transformations into a tree matcher for that library. The code generation

    phase of a hardware compilation run starts from the intermediate form Gi that includes boththe Boolean functions for the combinational logic and the memory elements. That graph is

    parsed and broken into a forest of trees which are fed to the code generator.

    As we will describe later, code generation can be performed iteratively to optimize the silicon

    code for speed. At the end of code generation the intermediate form has been compiled into

    an equivalent piece of hardware built from the available library components.

    7. Silicon Code Generation

    The portion of our system that makes the most innovative use of compiler methods is the

    silicon code generator. Our circuits are implemented a library of registers and gates that

    implement a subset of the Boolean functions-NAND, NOR, AND-ORINVERT(AOI), etc.

    This library forms the silicon instruction set-it is an exact analog of a machines instruction

    set. The silicon code generators job is to translate the intermediate form produced by the

    global optimizer into the best silicon code-networks of the gates and registers available in the

    library. The first approach to silicon code generation was ad hoc reminiscent of the do what

    you have to do era of code generation. Another approach

    Figure 4: The silicon code generation process.

    that attracted interest used a rule-based expert system[19]. But silicon code generation is so

    similar to tree-matching methods for retarget able code generation that we use twig [2] [30], a

    tree manipulator used for constructing code generators for programming language compilers,

    to create the DAGON silicon code generator [21]. DAGON can implement hardware

    optimized for speed, area, or some function of both. Figure 4 shows how silicon code

    generation works. For each new set of silicon instructions we must generate a tree-matching

    program. Twig compiles a set of patterns and actions that describe the code transformations

    into a tree matcher for that library. The code generation phase of a hardware compilation run

    starts from the intermediate form Gi that includes both the

    Boolean functions for the combinational logic and the memory elements. That graph is parsedand broken into a forest of trees which are fed to the code generator.

  • 7/30/2019 Ss Rk2r05a21

    11/26

    11

    As we will describe later, code generation can be performed iteratively to optimize the silicon

    code for speed. At the end of code generation the intermediate form has been compiled into

    an equivalent piece of hardware built from the available library components.

    7.1 Compiling the Tree Matcher

    Code generation finds the best mapping from the intermediate form onto the object code by

    tree matching. We want an efficient implementation of the matching program, but we also

    want to be able to matching automaton from the code patterns and the actions.The patterns supplied to twig describe either direct mappings of the intermediate form onto

    the target code or may describe a mapping that contains a simple local optimization. Local

    optimizations may be thought of as clever code sequences that are not natural matches of the

    instruction set.

    A twig pattern statement has three parts. The first is the pat tern itself, writ ten in a form like

    that used by parser generators. A pattern can be used to describe a class of simple pattern

    instances by parameterizing the pattern. For instance, five patterns for AOIxxx actually

    describe sixty-four pattern instances from an A01444 to an AOI211. Many of these patterns

    are symmetric-an A01114 is equivalent to, and would be output as, an AO1411.

    The second part of a pattern statement is a formula that defines the cost of applying the

    pattern. Separately we supply twig with functions for adding, computing, and comparing

    costs. The tree matcher computes the cost for a tree using the user-supplied routine uses these

    costs to choose by dynamic programming the minimum cost match over the entire tree.

    The third part of a pattern statement is either a rewrite command or an action routine. Some

    matches cause the input circuit to be transformed and reprocessed to find new matches, while

    others cause immediate actions. DAGONS action is always to write out the code chosen by

    the match. The patterns, actions, and cost functions are compiled by twig into a tree-matching

    automaton.

  • 7/30/2019 Ss Rk2r05a21

    12/26

    12

    Figure 5: Partitioning the intermediate form DAG into trees

    The automaton finds the optimum match for a tree in two phases: first, finding all the patterns

    that match subtrees; next, by finding the set of matches that cover the tree at the lowest cost.

    We will describe these steps in detail below. We have written 60 parameterised patterns todescribe the AT&T Cell Library. These patterns expand into over seven hundred pattern

    instances that represent permutations of our hardware library of approximately 165 gates and

    a few thousand unique local optimisations. The modularity afforded by describing the code

    generator as patterns, actions, and cost functions makes it easy to port our code generator to a

    new hardware library, just as code generators can be ported to new instruction sets. To date

    we have created code generators for six different hardware libraries.

  • 7/30/2019 Ss Rk2r05a21

    13/26

    13

    Figure 6: An intermediate form tree and all its possible matches.

    7.2 Code Generation

    The intermediate form Gi is a DAG, so it must be broken into a forest of trees before being

    processed by the tree matcher produced by twig. Figure 5 shows an intermediate form DAG

    partitioned into three trees; the Xs show where the DAG has been cut. The partitioned DAG

    is sent one tree at a time to the tree matcher. It first determines all possible matches for each

    vertex u. The matching automaton finds all possible matches using the Aho-Corasick

    algorithm[l], which works in time O(TREESIZE xMAX-MATCHES): the size of the tree

    times the maximum number of matches possible at any node. Figure 6 shows a tree and all its

    matches. Each node matched a pattern corresponding to either a two-input NAND gate (ND2)

    or an inverter (INV); in addition, the subtree formed by the top three ranks of nodes matcheda larger pattern corresponding to an A0121 gate. The optimum code is generated by finding

    the least-cost match of patterns that covers the tree from among all the possible matches. The

    tree matcher generated by twig finds the optimum match by dynamic programming (41.

    Optimum tree matching can be solved by dynamic programming because the best match for

    the root of a tree is a function of the best matches for its subtrees. Figure 7 shows how the

    pattern-matching automaton finds the optimum match for our example tree. The tree matcher

    starts from the leaf nodes and successively matches larger subtrees. The nodes at the bottom

    of the tree each have only one match; the number shown with the name of the best match is

    the total cost of that subtree. The root node has two matches: the INV match gives a total tree

    cost of 13; the A012 1 match (which covers several previously matched nodes) gives a tree of

    cost 7. The A0121 match gives the smallest total cost, so the automaton labels the node withA012 1. This example is made simple because we tried to match only a few instructions;

  • 7/30/2019 Ss Rk2r05a21

    14/26

    14

    matching even this simple DAG against our full instruction set would give four to five

    matches per node. Pure tree matching would not be able to handle two important cases:

    matching non-tree patterns, like exclusive-or gates; and matching patterns across tree

    boundaries. We handle these cases by associating at-tributes that describe global information-

    fan-out, inverse, and signal-with each vertex in the tree. Inverse tells if the inverted sense of a

    signal is available somewhere in the circuit. This is very useful because a number of areareducing optimizations can often be made if the inverted sense of a signal can be used in

    place of the signal itself. Signal gives the name of a signal, which can be used with inverse to

    represent a non-tree pattern such as an exclusive-or. Fan-out is used to determine timing

    costs. Because patterns are matched against a forest of trees rather than the original DAG,

    DAGON may not find the globally optimal match. Finding an optimal DAG-matching

    (DAG-covering) is an NP-complete [ 31]problem. However, the combination of tree-

    matching and peephole optimizations has proven to be an effective method for silicon code

    generation. The MIS system tried extending tree matching to DAG covering [17], but the

    results were disappointing. MIS now uses tree-matching for code generation.

    after partial match

  • 7/30/2019 Ss Rk2r05a21

    15/26

    15

    Figure 7: Finding an optimal match.

    7.3 Local Optimizations for Speed

    While not all industrial designs have to run at high speed, there are always circuits that have

    to run very fast. Because path delay is a global property of the hardware it can be hard fordesigners to optimize large machines for speed. Furthermore, reducing the delay of a piece of

    hardware almost always costs additional area. We want to meet the designers speed

    requirement at a minimum area cost. As Figure 4 shows, code generation can be applied

    iteratively to jointly optimize for speed and area. The first pass through the tree matcher

    produces hardware strictly optimized for area. A static delay analyzer finds parts of the

    hardware that must be speeded up and produces speed constraints. The code generated by the

    previous pass is translated back into the intermediate form and sent through the tree matcher

    along with the constraints. The constraints cause the matcher to abort matches that cannot run

    at the required speed. Because many Boolean functions are represented by both low-

    speed/low-power and high-speed/high-power gates in the library, the matcher often need only

    substitute the high-speed version of a gate to meet the speed constraint, but a constraint canforce the entire tree to be rematched. By this procedure we have been able to routinely double

    the speed of circuits with an area penalty of only 10%. This is such a significant improvement

    that DAGON is being used for timing optimization of manually designed hardware. Speed

    optimization is sufficiently hard that DAGON has been able to similarly improve even these

    hand-crafted designs. of DAGs, not trees. We are presently investigating techniques for

    height minimization a DAG.

    8. Peephole Optimization

    Inverters are the bane of a silicon code generator. Due to the fact that trees are matched

    independently it is common for a signal to be inverted, due to a local optimization, in two or

    more different trees when a single inverter would do the job. We clean up these and other

    problems with a peephole optimization phase that minimizes the number of inverters used. As

    in programming-language compilers, the peephole optimizer looks at the circuit through a

    small window; but this window reveals a portion of the circuit graph, not a portion of code

    text as in an object code peep hole optimizer. Peephole optimization reduces combinational

    logic area by as much as 10%.

    9. Work-In-Progress: Global Speed Optimization

    While circuit timing problems related to very local circuit properties, such as high capacitive

    loading, are well solved by the techniques described in the previous sections, not all timing

    problems in digital hardware can be solved by a single application of local optimizations.

    Some timing problems are due to having too many gates in a particular path through the

    hardware. To solve this problem the hardware must often be drastically restructured. As local

    optimizations using tree matching are relatively inexpensive (linear time), our current

    approach is to iteratively apply the local optimizations

    to achieve more global transformations. This procedure is both aesthetically and practically

    unsatisfying.

  • 7/30/2019 Ss Rk2r05a21

    16/26

    16

    We would prefer to make hardware transformations that more quickly result in hardware

    which meets timing requirements. Tree height reduction algorithms [lo], developed in the

    context of reducing the height of arithmetic expression trees to speed their execution on a

    parallel processor, suggest an approach to this problem. While this method, and its

    successors, has promise, we have found two limitations in its application to hardware: First,

    when compiling to a fixed architecture, resources are free up to the point of saturation andinfinitely expensive after that. Thus it makes since to reduce tree height until all processors

    are saturated. When designing hardware, however, we are interested in speeding up the

    hardware to some limit specified by the user. We have no fixed resource limit in silicon code

    generation to provide a lower bound on tree height, but maximally reducing all trees would

    create hardware much larger than necessary. Secondly, we are concerned with minimizing the

    height

    10. Evaluation of the Silicon Instruction Set

    Highly portable compilers for fixed architectures encouraged the evaluation of instruction

    sets to determine the most cost-effective set of instructions: instruction sets that gave the

    fastest compiled code at the minimum cost in hardware. While a larger silicon instruction set

    does not have the unpleasant side effect of increasing the complexity of the implementation, a

    larger silicon instruction set does require more human resources to develop and maintain. The

    cost of developing and maintaining the instruction set must be weighed against the

    improvement in the speed and area of hardware that can be compiled. The availability of a

    highly portable silicon code generator made possible a long overdue experiment, that of

    evaluating the impact of the silicon instruction set on the quality of the final hardware.

    We evaluated five different silicon instruction sets ranging in complexity from 2 to 125 gates.

    Our results on the twenty standard benchmark examples show a monotonic, thoughexponentially decreasing, reduction in chip area, ranging from 26% (7 instructions) to 45%

    (125 instructions). These results were further verified against a number of industrial designs.

    If we compare chip area with size of object code, these results simply show that with the

    addition of more and more powerful instructions the size of object code is reduced. What was

    less predictable was the knee of the area reduction per number of instructions curve. Beyond

    an instruction set size of 37 instructions, which gave an area reduction of 40%, adding

    instructions gave little area reduction: an additional 88 instructions only reduced area by an

    additional 2%. A more surprising result was that all silicon instruction sets showed

    comparable speed.

    11 .Working of hardware compiler

    11.1. Variable Declarations

    All variables are introduced by their declaration. We use the notation

    VAR x, y, z: Type

    In the corresponding circuit, the variables are represented by registers holding their values.The output carries the name of the variable represented by the register, and the enable signal

  • 7/30/2019 Ss Rk2r05a21

    17/26

    17

    determines when a register is to accept a new input. We postulate that all registers be clocked

    by the same clock signal, that is, all derived circuits will be synchronous circuits. Therefore,

    clock signals will subsequently not be drawn and assumed to be implicit.

    11.2. Assignment Statements

    An assignment is expressed by the statement

    y := f(x)

    Where y is a variable, x is (a set of) variables and f is an expression in x. The assignment

    statement is translated into the circuit according to Fig. 1, with the function f resulting in a

    combinational circuit (i.e. a circuit neither containing feedback loops nor registers). e stands

    for enable; this signal is active when the assignment is to occur.

    Given a time t when the register(s) have received a value, the combinational circuit f yields

    the new value f(x) after a time span pd called the propagation delay of f. This requires thenext clock tick to occur no before time t+pd. In other words, the propagation delay

    determines the clock frequency f < 1/pd. The concept of the synchronous circuit dictates, that

    the frequency is chosen according to the circuit f with the largest propagation delay among all

    function circuits.

    The simplest types of variables have only 2 possible values, say 0 and 1. They are said to be

    of type BIT (or Boolean). However, we may just as well also consider composite types, that

    is, variables consisting of several bits. A type Byte, for example, might consist of 8 bits, and

    the same holds for a type Integer. The translation of arithmetic functions (addition,

    subtraction, comparison, multiplication) into corresponding combinational circuits is well

    understood. Although they be of considerable complexity, they are still purely combinational

    circuits.

    11.3. Parallel Composition

    We denote parallel composition of two statements S0 and S1 by

    S0, S1

    The meaning is that the two component statements are executed concurrently. The

    characteristic of parallel composition is that the affected registers feature the same enable

    signal. Translation into a circuit is straightforward, as shown in Fig 1 (right side) for theexample

    x := f(x, y), y := g(x, y)

  • 7/30/2019 Ss Rk2r05a21

    18/26

    18

    The characteristic of parallel composition is that the registers feature the same enable signal.

    11.4. Sequential Composition

    Traditionally, sequential composition of two statements S0 and S1 is denoted by

    S0; S1 The sequencing operator ";" signifies that S1 is to be executed after completion of S0.

    This necessitates a sequencing mechanism. The example of the three assignments

    y := f(x); z := g(y); x := h(z)

    is translated into the circuit shown if Fig. 2, with enable signals e corresponding to the

    statements.

    The upper line contains the combinational circuits and registers corresponding to the

    assignments. The lower line contains the sequencing machinery assuring the proper

    sequential execution of the three assignments. With each statement is associated an individual

    enable signal e. It determines when the assignment is to occur, that is, when the register

    holding the respective variable is to be enabled. The sequencing part is a "onehot" state

    machine. The following table illustrates the signal values before and after each clock cycle.

    e0 e1 e2 e3 x y z

    1 0 0 0 x0 ? ?

    0 1 0 0 x0 f(x0) ?

    0 0 1 0 x0 f(x0) g(f(x0))

    0 0 0 1 h(g(f(x0))) f(x0) g(f(x0))

    The preceding example is a special case in the sense that each variable is assigned only a

    single value, namely y is assigned f(x) in cycle 0, z is assigned g(y) in cycle 1, and x is

    assigned h(z) in cycle 2. The generalized case is reflected in the following short example:

  • 7/30/2019 Ss Rk2r05a21

    19/26

    19

    x := f(x); x := g(x)

    which is transformed into the circuit shown in Fig. 3. Since x receives two values, namely

    f(x) in cycle 0 and g(x) in cycle 1, the register holding x needs to be preceded by a

    multiplexer.

    11.5. Conditional Composition

    This is expressed by the statement

  • 7/30/2019 Ss Rk2r05a21

    20/26

    20

    IF b THEN S END

    where b is a Boolean variable and S a statement. The circuit shown in Fig. 4 on the left side is

    derived

    from it with S standing for y := f(x). The only difference to Fig. 1 is the derivation of the

    associated sequencing machinery yielding the enable signal for y.

    It now emerges clearly that the sequencing machinery, and it alone, directly reflects the

    control statements, whereas the data parts of circuits reflect assignments and expressions.

    The conditional statement is easily generalized to its following form:

    IF b0 THEN S0 ELSE S1 END

    In its corresponding circuit on the right side of Fig. 4 we show the sequencing part only with

    the enable signals corresponding to the various statements. At any time, at most one of the

    statements S is active after being triggered by e0.

    11.6. Repetitive Composition

    We consider repetitive constructs traditionally expressed as repeat and while statements.

    Since the control structure of a program statement is reflected by the sequencing machinery

    only, we may omit showing associated data assignments, and instead let each statement be

    represented by its associated enable signal only.

    We consider the statements

    REPEAT S UNTIL b and WHILE b DO S END

    again standing for a Boolean variable, they translate into the circuits shown in Fig. 5, wheree stands for the enable signal of statement S.

  • 7/30/2019 Ss Rk2r05a21

    21/26

    21

    11.7. Selective Composition

    The selection of one statement out of several is expressed by the case statement, whose

    corresponding sequencing circuitry containing a decoder is shown in Fig. 6.

    CASE k OF

    0: S0 | 1: S1 | 2: S2 | ... | n: Sn

    END

    11.8. Preliminary Discussion

  • 7/30/2019 Ss Rk2r05a21

    22/26

    22

    At this point we realize that arbitrary programs consisting of assignments, conditional and

    repeated statements can be transformed into circuits according to fixed rules, and therefore

    automatically.

    However, it is also to be feared that the complexity of the resulting circuits may quickly

    surpass reasonable bounds. After all, every expression occurring in a program results in an

    individual combinational circuit; every add symbol yields an adder, every multiply symbolturns into a combinational multiplier, potentially with hundreds or thousands of gates. We

    agree that for the present time this scheme is hardly practical except for toy examples. Cynics

    will remark, however, that soon the major problem will no longer be the economical use of

    components, but rather to find good use of the hundreds of millions of transistors on a chip.

    Inspite of such warnings, we propose to look at ways to reduce the projected hardware

    explosion. The best solution is to share subcircuits among various parts. This, of course, is

    equivalent to the idea of the subroutine in software. We must therefore find a way to translate

    subroutines and subroutine calls. We emphasize that the driving motivation is the sharing of

    circuits, and not the reduction of program text.

    Therefore, a facility for declaring subroutines and textually substituting their calls by their

    bodies, that is, handling them like macros, must be rejected. This is not to deny the usefulnessof such a feature, as it is for example provided in the language Lola, where the calls may be

    parameterized [2, 3]. But with its Expansions of the circuit for each call it contributes to the

    circuit explosion rather than being a facility to avoid it.

    11.9. Subroutines

    The circuits obtained so far are basically state machines. Let every subroutine translate into

    such a state machine. A subroutine call then corresponds to the suspension of the calling

    machine, the activation of the called machine, and, upon its completion, a resumption of the

    suspended activity of the caller.

    A first measure to implement subroutines is to provide every register in the sequencing part

    with a common enable signal. This allows to suspend and resume a given machine. The

    second measure is to provide a stack (a firstin lastout store) of identifications of suspended

    machines, typically numbers from 0 to some limit n. We propose the following

    implementation, which is to be regarded as a fixed base part of all circuits generated,

    analogous to a runtime subroutine package in software. The extended circuit is a

    pushdown machine.

    The stack of machine numbers (analogous to return addresses) consists of a (static) RAM of

    m words, each of n bits, and an up/down counter generating the RAMaddresses. Each of the

    n bits in a word corresponds to one of the circuits representing a subroutine. One of them hasthe value 1, identifying the circuit to be reactivated. Hence, the (latched) readout directly

    specifies the values of the enable signals of the n circuits. This is shown in Fig. 7. The stack

    is operated by the push signal, incrementing the counter and thereafter writing the applied

    input to the RAM, and the pop signal, decrementing the counter.

  • 7/30/2019 Ss Rk2r05a21

    23/26

    23

    The push signal is activated whenever a state register belonging to a call statement becomes

    active. Since a subroutine may contain several calls itself, these register outputs must be

    ORed. A better solution would be a bus. The pop signal is activated when the last state of a

    circuit representing a subroutine is reached. An obvious enhancement with the purpose to

    enlarge the number of possible subcircuits without undue expansion of the RAM's width is toplace an encoder at the input and a decoder at the output of the RAM. A width of n bits then

    allows for 2n subroutines.

    The scheme presented so far excludes recursive procedures. The reason is that every

    subcircuit must have at most one suspension point, that is, it can have been activated at most

    once. If recursive procedures are to be included, a scheme allowing for several suspension

    points must be found. This implies that the stack entries must not only contain an

    identification of the suspended subcircuit, but also of the suspension point of the call in

    question. This can be achieved in a manner similar to storing the circuit identification, either

    as a bit set or an encoded value. The respective state reenable signals must then be properly

    qualified, further complicating the circuit. We shall not pursue this topic any further.

  • 7/30/2019 Ss Rk2r05a21

    24/26

    24

    12.Conclusion

    We have found that a number of techniques and algorithms developed for software compilers

    for fixed architectures are useful in hardware compilation. We add the caveat that while

    programming language compilation techniques for fixed architectures can often give us a

    good start on many problems in hardware compilation it is important to remember thatlanguage-to-hardware compilations are harder pre-cicely because the hardware architecture is

    not fixed. A good software compiler must optimize its code for the target of its code. A good

    hardware compiler must be able optimize its code for a target architecture, but must also

    explore resource tradeoffs across a potentially wide range of such targets. Nevertheless, we

    hope that hardware and programming- language compilation techniques become even more

    closely related as we discover more general ways to compile software descriptions into

    custom hardware. For the moment we continue to experiment with and extend our hardware

    compiler technology to make it more general, while still retaining its ability to create

    industrial-quality designs.

    It is now evident that subroutines, and more so procedures, introduce a significant degree of

    complexity into a circuit. It is indeed highly questionable, whether it is worth while

    considering their implementation in hardware. It comes as no surprise that state machines,

    implementing assignments, conditional and repetitive statements, but not subroutines, play

    such a dominant role in sequential circuit design. The principal concern in this area is to

    make optimal use of the implemented facilities. This is achieved if most of the subcircuits

    yield values contributing to the process most of the time. The simplest way to achieve this

    goal is to introduce as few sequential steps as possible. It is the underlying principle in

    computer architectures, where (almost) the same steps are performed for each instruction.

    The steps are typically the subcycles of an instruction interpretation. Hardware acts as what is

    known in software as an interpretive system.

    The strength of hardware is the possibility of concurrent operation of subcircuits. This is alsoa topic in software design. But we must be aware of the fact that concurrency is only possible

    by having concurrently operating circuits to support this concept. Much work on parallel

    computing in software actually ends up in implementing quasiconcurrency, that is, in

    pretending concurrent execution, and, in fact, in hiding the underlying sequentiality. This

    leads us to contend that any scheme of direct hardware compilation may well omit the

    concept of subroutines, but must include the facility of specifying concurrent, parallel

    statements. Such a hardware programming language may indeed be the best way to let the

    programmer specify parallel statements. We call this finegrained parallelism.

    Coarsegrained concurrency may well be left to conventional programming languages, where

    parallel processes interact infrequently, and where they are generated and deleted at arbitrary

    but distant intervals. This claim is amply supported by the fact that finegrained parallelismhas mostly been left to be introduced by compilers, under the hood, so to say.

    The reason for this is that compilers may be tuned to particular architectures, and they can

    therefore make use of their target computer's specific characteristics. In this light, the

    consideration of a common language for hard and software specification has a certain merit.

    It may reveal the inherent difference in the designers' goals. As C. Thacker expressed it

    succinctly: "Programming (software) is about finding the best sequential algorithm to solve a

    problem, and implementing it efficiently. The hardware designer, on the other hand, tries to

    bring as much parallelism to bear on the problem as possible, in order to improve

    performance". In other words, a good circuit is one where most gates contribute to the result

    in every clock cycle. Exploiting parallelism is not an optional luxury, but a necessity.

    We end by repeating that hardware compilation has gained interest in practice primarilybecause of the recent advent of largescale programmable devices. These can be configured

  • 7/30/2019 Ss Rk2r05a21

    25/26

    25

    onthefly, and hence be used to directly represent circuits generated through a hardware

    compiler. It is therefore quite conceivable that parts of a program are compiled into

    instruction sequences for a conventional processor, and other parts into circuits to be loaded

    onto programmable gate arrays. Although specified in the same language, the engineer will

    observe different criteria for good design in the two areas.

  • 7/30/2019 Ss Rk2r05a21

    26/26

    26

    13. References

    1. Aho and M. Corasick. Efficient string matching: an aid to bibliographic search.

    Journal of the ACM, 18(6):333-340, June 1975.

    2. A. Aho and M. Ganapathi. Efficient tree pattern matching: an aid to code generation.

    In Conference Record of the Twelfth Annual ACM Symposium on Principle3 of

    Programming Languages,pages 334-340, January 1985.

    3. A. Aho and S. C. Johnson. Code generation for a one register machine. Journal of the

    ACM, 23(3):502-510, 1976.

    4. A. Aho and S. C. Johnson. Optimal code generation for expression trees. Journal of

    the ACM, 23(3):488-501, 1976.

    5. The Yorktown Silicon Compiler. In Proceedings, International Conference onCircuits and Systems, pages 391- 394, IEEE Circuits and Systems Society, June

    1985.

    6. Richard Brent, David Kuck, and Kiyoshi Maruyama. The parallel evaluation of

    arithmetic expressions without division. In IEEE Transactions on Computers, pages

    532-534, May

    7. J. Darringer, W. Joyner, C. L. Berman, and L. Trevillyan. Logic synthesis through

    local transformations. IBM Journal of Research and Development, 25(4):272-280,

    1981.

    8. H. DeMan, J. Rabaey, P. Six, and L. Claesen. Cathedral-ii: a silicon compiler for

    digital signal processing. IEEE Design d Test, 3(6):13-25, 1984.

    9. I. Page. Constructing hardwaresoftware systems from a single description.Journal of VLSI Signal Processing 12, 87107 (1996).

    10. N. Wirth. Digital Circuit Design. SpringerVerlag, Heidelberg, 1995.

    11. www.Lola.ethz.ch

    12. Y. N. Patt, et al. One Billion Transistors.

    http://www.lola.ethz.ch/http://www.lola.ethz.ch/