Compiler Strategies for Transport Triggered...

Compiler Strategiesfor

Transport Triggered Architectures

Johan Janssen

Compiler Strategiesfor

Transport Triggered Architectures

Proefschrift

ter verkrijging van de graad van doctoraan de Technische Universiteit Delft,

op gezag van de Rector Magnificus prof.ir. K.F. Wakker,voorzitter van het College voor Promoties,

in het openbaar te verdedigen

op maandag 17 september 2001 om 16:00 uur

door

Johannes Antonius Andreas Jozef JANSSEN

elektrotechnisch ingenieurgeboren te Wamel

Dit proefschrift is goedgekeurd door de promotoren:Prof.dr.ir. A.J. van de GoorProf.dr. H. Corporaal

Samenstelling promotiecommissie:

Rector Magnificus, voorzitterProf.dr.ir. A.J. van de Goor, Technische Universiteit Delft, promotorProf.dr. H. Corporaal, T.U. Eindhoven / IMEC, promotorProf.dr.ir. E.F. Deprettere, Universiteit LeidenProf.dr.ir. Th. Krol, Universiteit TwenteProf.dr.ir. R.H.J.M. Otten, Technische Universiteit EindhovenProf.dr.ir. H.J. Sips, Technische Universiteit DelftDr. C. Eisenbeis, INRIA, Rocquencourt

Published and distributed by: DUP Science

DUP Science is an imprint ofDelft University PressP.O. Box 982600 MG DelftThe NetherlandsTelephone: +31 15 27 85 678Telefax: +31 15 27 85 706E-mail: [email protected]

ISBN 90-407-2209-9

Keywords: Compilers, Instruction Scheduling, Register Assignment

Advanced School for Computing and Imaging

This work was carried out in the ASCI graduate school.ASCI dissertation series number 69.

Copyright c© 2001 by Johan Janssen

All rights reserved. No part of the material protected by this copyright notice may bereproduced or utilized in any form or by anymeans, electronic or mechanical, includingphotocopying, recording or by any information storage and retrieval system, withoutwritten permission from the publisher: Delft University Press.

Printed in The Netherlands

This dissertation is dedicated to the loving memory of my mother

Acknowledgements

This Ph.D. thesis is the result of my research at the Computer Engineeringgroup of the Electrical Engineering department of the Delft University of Tech-nology.

First, I would like to express my gratitude to Henk Corporaal, my super-visor, for his everlasting support, valuable comments and for the numerousdiscussions we had. In addition, I want to thank prof. Ad van de Goor forbeing my promotor and for giving me the opportunity to perform my researchin his group.

Secondly, I thank the reviewers of this thesis, Andrea Cilio, Henjo Schotand my sister Willemien Korfage for their valuable comments on (parts of)the first drafts of this thesis. I thank Walter Groeneveld for his contributionon the formulation of the “stellingen”. Furthermore, I would like to thankmy fellow Ph.D. students within the MOVE project: Marnix Arnold, JeroenHordijk, Steven Roos and especially Jan Hoogerbrugge for their work on theTTA compiler. I would like to thank the system administrators Jean-Paul vander Jagt, Tobias Nijweide and Bert Meijs for providing an excellent workingcomputer environment. In addition, I would also like to thank all my formercolleagues and students of the Computer Engineering group for the enjoyableworking environment.

Finally, I would like to thank my family, friends and TNO colleagues fortheir support and encouragement.

Johan JanssenDelft, July 2001

vii

Contents

Acknowledgements vii

1 Introduction 11.1 Instruction-Level Parallelism . . . . . . . . . . . . . . . . . . . . 2

1.1.1 ILP Architecture Arena . . . . . . . . . . . . . . . . . . . 31.1.2 Architectural Trade-off . . . . . . . . . . . . . . . . . . . . 6

1.2 Research Goals . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91.3 Thesis Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2 TTAs: An Overview 132.1 From VLIW to TTA . . . . . . . . . . . . . . . . . . . . . . . . . . 132.2 Transport Triggered Architectures . . . . . . . . . . . . . . . . . 15

2.2.1 TTA Instruction Format . . . . . . . . . . . . . . . . . . . 162.2.2 Function Units . . . . . . . . . . . . . . . . . . . . . . . . 172.2.3 Register Files . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.4 Immediates . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.5 Move Buses . . . . . . . . . . . . . . . . . . . . . . . . . . 182.2.6 Sockets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 192.2.7 Control Flow and Conditional Execution . . . . . . . . . 192.2.8 Software Bypassing . . . . . . . . . . . . . . . . . . . . . . 202.2.9 Operand Sharing . . . . . . . . . . . . . . . . . . . . . . . 21

3 Compiler Overview 233.1 Front-end . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243.2 Back-end Infrastructure . . . . . . . . . . . . . . . . . . . . . . . 26

3.2.1 Reading and Writing . . . . . . . . . . . . . . . . . . . . . 263.2.2 Control Flow Analysis . . . . . . . . . . . . . . . . . . . . 273.2.3 Data Flow Analysis . . . . . . . . . . . . . . . . . . . . . . 303.2.4 Data Dependence Analysis . . . . . . . . . . . . . . . . . 323.2.5 Loop Unrolling, Function Inlining and Grafting . . . . . 35

3.3 Register Assignment . . . . . . . . . . . . . . . . . . . . . . . . . 353.3.1 Graph Coloring . . . . . . . . . . . . . . . . . . . . . . . . 363.3.2 Spilling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

ix

3.3.3 State Preserving Code . . . . . . . . . . . . . . . . . . . . 413.3.4 TTA vs. OTA . . . . . . . . . . . . . . . . . . . . . . . . . 43

3.4 Instruction Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 433.4.1 List Scheduling . . . . . . . . . . . . . . . . . . . . . . . . 443.4.2 Resource Assignment . . . . . . . . . . . . . . . . . . . . 453.4.3 Local Scheduling . . . . . . . . . . . . . . . . . . . . . . . 463.4.4 Global Scheduling . . . . . . . . . . . . . . . . . . . . . . 503.4.5 Software Pipelining . . . . . . . . . . . . . . . . . . . . . . 54

4 Evaluation Methodology 594.1 Benchmark Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . 594.2 TTA Processor Suite . . . . . . . . . . . . . . . . . . . . . . . . . . 60

4.2.1 Space Walking . . . . . . . . . . . . . . . . . . . . . . . . . 604.2.2 Selected TTA Processors . . . . . . . . . . . . . . . . . . . 63

4.3 Scheduling Scopes . . . . . . . . . . . . . . . . . . . . . . . . . . 654.4 Exploitable ILP . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

5 The Phase Ordering Problem 695.1 Early Register Assignment . . . . . . . . . . . . . . . . . . . . . . 70

5.1.1 ILP and Early Register Assignment . . . . . . . . . . . . 705.1.2 Dependence-Conscious Register Assignment Strategies . 725.1.3 Dependence-Conscious Early Register Assignment for

TTAs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 785.1.4 Discussion, Experiments and Evaluation . . . . . . . . . 81

5.2 Late Register Assignment . . . . . . . . . . . . . . . . . . . . . . 835.2.1 ILP and Late Register Assignment . . . . . . . . . . . . . 845.2.2 Register-Sensitive Instruction Scheduling Strategies . . . 865.2.3 Register-Sensitive Instruction Scheduling for TTAs . . . 885.2.4 Experiments and Evaluation . . . . . . . . . . . . . . . . 90

5.3 Integrated Register Assignment . . . . . . . . . . . . . . . . . . . 915.3.1 Interleaved Register Assignment . . . . . . . . . . . . . . 925.3.2 Integrated Instruction Scheduling and Register Assign-

ment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 935.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 96

6 Integrated Assignment and Local Scheduling 996.1 Resource Assignment and Phase Integration . . . . . . . . . . . 1006.2 Register Resource Vectors . . . . . . . . . . . . . . . . . . . . . . 1016.3 The Interference Register Set . . . . . . . . . . . . . . . . . . . . . 1056.4 Spilling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

6.4.1 Integrated Spilling . . . . . . . . . . . . . . . . . . . . . . 1106.4.2 Updating Data Flow and Data Dependence Relations . . 1106.4.3 Scheduling Issues . . . . . . . . . . . . . . . . . . . . . . . 1126.4.4 Peephole Optimizations . . . . . . . . . . . . . . . . . . . 117

6.5 State Preserving Code . . . . . . . . . . . . . . . . . . . . . . . . 118

x

6.5.1 Generation of Callee-saved Code . . . . . . . . . . . . . . 1186.5.2 Generation of Caller-Saved Code . . . . . . . . . . . . . . 120

6.6 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . 1216.6.1 Register Selection . . . . . . . . . . . . . . . . . . . . . . . 1226.6.2 Operation Selection . . . . . . . . . . . . . . . . . . . . . . 1236.6.3 Basic Block Selection . . . . . . . . . . . . . . . . . . . . . 1246.6.4 Early vs. Integrated Assignment . . . . . . . . . . . . . . 126

6.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

7 Integrated Assignment and Global Scheduling 1317.1 The Interference Register Set . . . . . . . . . . . . . . . . . . . . . 131

7.1.1 Importing a Use . . . . . . . . . . . . . . . . . . . . . . . . 1327.1.2 Importing a Definition . . . . . . . . . . . . . . . . . . . . 133

7.2 Importing Operations . . . . . . . . . . . . . . . . . . . . . . . . . 1367.3 Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1377.4 Spilling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1407.5 State Preserving Code . . . . . . . . . . . . . . . . . . . . . . . . 1417.6 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . 143

7.6.1 Region Selection . . . . . . . . . . . . . . . . . . . . . . . 1437.6.2 Global Spill Cost Heuristic . . . . . . . . . . . . . . . . . 1447.6.3 Early vs. Integrated Assignment . . . . . . . . . . . . . . 145

7.7 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150

8 Integrated Assignment and Software Pipelining 1518.1 Register Pressure . . . . . . . . . . . . . . . . . . . . . . . . . . . 1528.2 Register Assignment and Software Pipelining . . . . . . . . . . . 1568.3 Integrated Assignment and Modulo Scheduling . . . . . . . . . 158

8.3.1 The Interference Register Set . . . . . . . . . . . . . . . . 1598.3.2 Spilling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 160

8.4 Experiments and Evaluation . . . . . . . . . . . . . . . . . . . . . 1618.4.1 Spilling or Increasing the II . . . . . . . . . . . . . . . . . 1618.4.2 Early vs. Integrated Assignment . . . . . . . . . . . . . . 162

8.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164

9 The Partitioned Register File 1679.1 Register Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168

9.1.1 Silicon Area . . . . . . . . . . . . . . . . . . . . . . . . . . 1699.1.2 Access Time . . . . . . . . . . . . . . . . . . . . . . . . . . 1709.1.3 Power Consumption . . . . . . . . . . . . . . . . . . . . . 1719.1.4 Partitioned Register Files . . . . . . . . . . . . . . . . . . 171

9.2 Early Assignment and Partitioned Register Files . . . . . . . . . 1749.2.1 Simple Distribution Methods . . . . . . . . . . . . . . . . 1759.2.2 Advanced Distribution Methods . . . . . . . . . . . . . . 1779.2.3 Equal Area Compiling . . . . . . . . . . . . . . . . . . . . 179

9.3 Late Assignment and Partitioned Register Files . . . . . . . . . . 180

xi

9.4 Integrated Assignment and Partitioned Register Files . . . . . . 1839.4.1 Local Heuristics . . . . . . . . . . . . . . . . . . . . . . . . 1849.4.2 A Global Heuristic . . . . . . . . . . . . . . . . . . . . . . 186

9.5 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189

10 Summary and Future Research 19110.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19110.2 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19510.3 Proposed Research Directions . . . . . . . . . . . . . . . . . . . . 196

A Integrated Assignment Benchmark Results 203

B Partitioned Register File Benchmark Results 211B.1 Early Assignment . . . . . . . . . . . . . . . . . . . . . . . . . . . 211B.2 Integrated Assignment . . . . . . . . . . . . . . . . . . . . . . . . 213

Glossary 217

Bibliography 223

Samenvatting 237

Curriculum Vitae 241

xii

Introduction 1T oday, microprocessors can be found in virtually every electronic device.

Not only workstations and PCs contain microprocessors; they can alsobe found in equipment for daily use such as television sets, mobile phones,PDAs (Personal Digital Assistant), microwaves and cars, or in specialized de-vices such as the automatic pilot in an aircraft, robots and medical instrumen-tation. More than 100 million microprocessors for general-purpose comput-ers (PCs, workstations, mainframes, etc.) are sold annually. This is however,only the tip of the iceberg. Over two billion microprocessors are estimated tobe sold annually for embedded applications [Leh00]. The embedded micro-processor market is growing, according to Dataquest, from $7.5 billion in 1998to$ 26 billion by 2002 [FBF+00]. Furthermore, the amount of produced em-bedded software exceeds the produced general-purpose software by a factorof five [EZ97].

The performance of microprocessors is increasing rapidly. This increaseis driven by the demand to execute over and over again, more complex andlarger applications. Various architectures are used to deliver the requestedperformance, like CISC (Complex Instruction Set Computer) and RISC (ReducedInstruction Set Computer) processor architectures. At this moment, we are in themiddle of the instruction-level parallelism (ILP) era. The power of ILP process-ing lies in the ability to execute multiple operations in parallel. It should beobvious that this potentially results in large performance improvements. Var-ious ILP architectures like superscalar architectures, VLIWs (Very Long Instruc-tionWord architectures) and TTAs (Transport Triggered Architectures) are proposedand implemented. Unfortunately, the ability in hardware to execute multipleoperations in parallel, by adding extra resources, does not always lead to a

1

2 CHAPTER 1. INTRODUCTION

performance increase. Numerous constraints prevent the efficient usage of theresources of ILP processors. To efficiently utilize the available resources theorder of operations in the program code should be rearranged. This processis called instruction scheduling. Register assignment manages the available high-speed on-chip memory elements called registers. These registers are used forholding temporary values produced by the operations. The order, in whichregister assignment and instruction scheduling are applied, plays an impor-tant role in the exploitation of ILP. An efficient register assignment may hin-der an efficient reordering of the operations. In addition, efficient instructionscheduling may result in an inefficient use of the registers. In this dissertation,this phase ordering problem is discussed, solutions are proposed and results aregiven.

The research is performed within the MOVE project at the Delft University ofTechnology. The MOVE project aims at bringing instruction-level parallelismwithin the reach of application specific processors (ASPs) in a flexible, scalableand cost effective way. These processors are designed for a specialized taskand can be found in all kinds of equipment like TVs, cars, copiers, cameras,etc. To achieve these goals, a new processor architecture is proposed and de-scribed. This new architecture is called the Transport Triggered Architecture,or in short TTA [Cor98]. Several processors using this new architecture havebeen designed and implemented [CvdA93, AHC96, TNO99, VLW00]. The per-formance of TTA processors highly depends on the quality of the compiler.Therefore, to fully exploit the available ILP provided by TTA processors, re-search is performed to develop new compiler techniques and strategies to en-hance instruction-level parallelism.

In this chapter, the concept of instruction-level parallelism is introduced in Sec-tion 1.1. The research goals of this thesis are formulated in Section 1.2. Anoverview of the remaining chapters of this thesis is given in Section 1.3.

1.1 Instruction-Level Parallelism

Instruction-Level Parallelism (ILP) is the family of processor architectures andcompiler techniques that enhances performance by executing multiple opera-tions in parallel. The processors provide the resources to execute operationsin parallel. The architecture of an ILP processor allows simultaneously ac-cess to the duplicated resources, which improves performance. The questionof how much ILP is available in programs is addressed in a number of arti-cles [JW89, Wal91, LW92, TGH92, LW97]. Studies to measure the maximumavailable ILP have critical shortcomings, however. First, many of these studiesassume the presence of infinite processor resources and assume perfect pro-gram behavior predictors. In this case, the upper limit is too optimistic. Sec-ondly, these studies do not consider modern or future techniques to enhanceILP. This results in a too pessimistic upper limit. Therefore, these studies are of

1.1. INSTRUCTION-LEVEL PARALLELISM 3

processingCentral

unit

Instructionfetchunit

decodeInstruction

unit

FU-1

FU-2

FU-3

FU-4

FU-5

Inst

ruct

ion

mem

ory

Dat

a m

emor

y

Reg

iste

r fi

le

Figure 1.1: General organization of ILP architectures.

limited value. The maximum available ILP estimates range between 2 [JW89]and 1000 [LW97]. One should keep in mind that the exploitable ILP highly de-pends on the application. Scientific programs have inherently more ILP thancontrol-intensive programs. To discover and to increase the ILP in programs,compiler technology is used. Discovering and exploiting ILP in programs, willbe key to future increases in microprocessor performance [SCD+97].

Various architectures used to exploit ILP are described in Section 1.1.1. Thearchitectural trade-off is discussed in Section 1.1.2.

1.1.1 ILP Architecture Arena

The general organization of ILP architectures is shown in Figure 1.1. The in-struction fetch unit reads the instructions from the instruction memory. Theseinstructions are decoded and sent to the function units (FUs). The central pro-cessing unit (CPU) shown in the figure contains five FUs. The FUs perform theactual computation such as additions, multiplications, etc. One of the FUs inthe figure is able to perform load and store operations on the data memory. Tem-porary data is stored in registers. These registers are grouped into the registerfile (RF). The FUs exchange data via this shared RF.

The main reason for the enormous research interest in ILP architecturesnowadays, is the ability to have more silicon space available than a RISC pro-cessor requires. This allows the duplication of FUs and data paths. Having du-plicated FUs means that multiple operations can be executed simultaneously.The data path transports data between the various resources in a processor.More FUs results in the need for a larger data path. The registers in an RF canbe accessed by a limited number of ports. Increasing the number of ports, andthus increasing the data path to and from the RF, enables the exploitation ofmore ILP.

In [RF93] an excellent overview of the dynamic history of ILP architecturesis given. Although the importance of ILP was already recognized in the early


fifties [Wil51], and ILP processors were build in the eighties [BYA93, RYYT89,SS93], it took until the nineties, to become a key technology for microprocessorperformance. Rau and Fisher [RF93] classify ILP architectures into three cate-gories: sequential architectures, dependence architectures, and independencearchitectures.

Sequential architectures

Sequential architectures execute programs that contain no explicit informationregarding dependences between operations. The programs for these architec-tures consist of a sequential operation stream. It is the responsibility of thehardware to detect dependences between operations, and to rearrange the op-eration order to achieve fast computation; this is called dynamic scheduling. Anoperation starts to execute if it does not depend on an operation currently beingexecuted and if the resources needed for the operation are free.

Implementations of sequential ILP architectures are known as super-scalars [Joh91]. Superscalars exploit the ILP of a program in hardware; thisrequires extra logic to detect ILP and to dispatch operations to the FUs. Super-scalars differ in their issue width (i.e., the maximum number of operations thatcan be simultaneously executed), and in the complexity of their instructionscheduler. Simple superscalars, like the Alpha 21064, issue operations in thesame order as they appear in the program while others, like the PowerPC 601and the Pentium 4 [Int00], allow instructions to be issued out-of-order [SW94].A disadvantage of superscalars is their limited scalability, for example increas-ing the issue width often results in a completely new and much more complexdesign [PJS96].

Superscalars do not require a compiler to exploit ILP, since there is no wayto explicitly communicate information regarding ILP from the compiler to thehardware. However, many ILP compiler techniques may be beneficial to en-hance superscalar performance [SCD+97, STK98, Wol99].

Dependence architectures

Dependence architectures execute programs consisting of operations and infor-mation about the data dependences between operations. The programmer orcompiler adds this information to the program, which releases the hardwarefrom detecting these dependences. The responsibility of the hardware is todetect operations that are ready for execution and to find free resources.

Data-flow processors [GS95] are representatives of this class. The operationsof these processors contain a list of all data dependent successor operations.When an operation finishes execution, a copy of its result is created for eachof its successor operations. As soon as all the input operands of an opera-tion and the required resources are available the operation is executed. Since


the operands of an operation are implicitly specified by its predecessors, theoperands do not have to be specified.

Independence architectures

Programs for independence architectures contain, besides the operations, infor-mation about independences between the operations. The compiler is respon-sible for identifying parallelism in a program. It communicates this informa-tion to the hardware by specifying which operations are independent of eachother.

An example of an independence architecture is the Horizon architec-ture [TS88]. The compiler encodes an integer H into each operation. This in-teger tells the hardware that the next H operations in the operation stream aredata independent of the current operation. This releases the hardware fromdetecting data independence, however, the hardware still is responsible for as-signing resources to the operations.

Also the Explicitly Parallel Instruction Computing (EPIC) architec-ture [ACM+98] is classified as an independence architecture. The instructionsof an EPIC architecture contain multiple operations. Each instruction includesa template, which indicates whether the operations in the instruction areindependent. The template also indicates whether the instruction can beexecuted in parallel with neighbor instructions. An example of the EPICarchitecture is the IA-64 instruction set architecture [IA99] as developed byIntel and Hewlett Packard in a joint effort. An actual implementation is Intel’sItanium processor [Abe00].

Another independence architecture is the Very Long InstructionWord (VLIW) architecture [BYA93, DT93, L+93, SS93, GNAB92, HA99, Kla00].A VLIW compiler not only releases the hardware from detecting indepen-dences, but it also assigns the FUs to the operations. A VLIW program specifieson which FU each operation should be executed, and when each operationshould be issued. In the context of VLIW architectures, it is important todistinguish between operations and instructions. An operation is a unit ofcomputation, such as an addition, memory load or branch. An instructionconsists of multiple operations. The operations in an instruction are issuedsimultaneously.

The compiler plays an important role to enhance the performance of VLIWprocessors. In fact, the compiler decides in which order operations are exe-cuted, with respect to the data dependence and resource constraints. To ac-complish this the compiler uses a detailed description of the processor. It mustexactly know how many operations can be executed in parallel. This is incontrast with the previous discussed approaches, where the hardware doesthe assignment of FUs to operations. Research in the area of exploiting ILPwith the use of compilers is still ongoing as a result of the growing interest inVLIW processors. Examples of VLIW implementations are Cydrome’s Cydra


processingCentral

unit

Inst

ruct

ion

mem

ory

Dat

a m

emor

y

RF-2

RF-1

FU-3

FU-2

FU-1

Inte

rcon

nect

ion

netw

ork

decodeInstruction

unit

Instructionfetchunit

Figure 1.2: General organization of a TTA.

5 [RYYT89], Multiflow’s TRACE 14/300 [SS93], Philip’s TriMedia [SRD96] pro-cessor, the TMS320C6201 DSP processor of Texas Instruments [TI99], BOPS’sManArray [PV00] and Transmeta’s Crusoe processor [Kla00].

The Transport Triggered Architecture (TTA) is also classified as an inde-pendence architecture. TTAs resemble VLIWs; however, where VLIWs specifyoperations in an instruction, TTAs also specify the transports between FUs andRFs. This gives the compiler an even larger responsibility, since now not onlythe FUs and registers need to be assigned but also the transport resources. Thegeneral organization of a TTA is given in Figure 1.2. It differs from other ILParchitectures (see Figure 1.1) in the sense that not all FUs require a direct con-nection to the RF. The FUs exchange data via the interconnection network insteadof using a shared RF. This reduces the complexity of the data path considerablyas will be shown in Section 2.1.

The difference, between the three classes of ILP architectures, is the division ofthe responsibility of the ILP exploitation between the hardware and the com-piler. Figure 1.3 summarizes the responsibilities for each type of ILP architec-ture.

1.1.2 Architectural Trade-off

The most visible ILP processors are general-purpose processors, which can befound in workstations and PCs. For these processors, the superscalar architec-ture is the current technology of choice. Superscalars have advantages com-pared to VLIWs and TTAs because they provide binary compatibility, whichallows existing applications to run on new machines with a different level ofILP without recompiling. Binary compatibility is an important issue for userswho want to upgrade their hardware, without buying new application soft-ware. The dynamic scheduling ability of superscalars can adapt the order ofexecution in situations that were unknown at compile time. It can handle op-erations with variable latencies, such as load operations with potential cache


Application code

Determine dependencies

Determine independencies

Bind function units

Front-end

Determine dependencies

Determine independencies

Bind function units

Bind datapaths

Sequential

Independence

Independence

Independence

Hardware’s responsibility

Execute

Dependence

Bind datapathsBind datapaths

(Data-flow)

Compiler’s responsibility

(Superscalar)

(TTA)

(VLIW)

(Horizon, EPIC)

Figure 1.3: ILP architecture classification.

misses. Furthermore, it can reorder memory references whose independencecould not be determined at compile time. It does this by comparing their ef-fective addresses. On the other hand, superscalars have a limited view of theoperations it can reorder, typically 4 to 64 operations can be considered. Thehardware needed to perform the data independence tests and the assignmentof the resources largely limits the scalability of superscalars. The number oftransistors required to implement this level of intelligence is substantial andthe time it takes to execute this work also adds a significant overhead to thepipeline. The enormous design effort required to increase the issue width of asuperscalar increases the time-to-market. Furthermore, the increasingly com-plex design needed to make these processors puts a question mark over howmany companies can afford designing them.


Dataflow and Horizon processor architectures are academic researchprojects and did not found widespread application in the microprocessor mar-ket. Processors with an EPIC architecture are just starting to emerge. Initiallythey will replace superscalars in high-end PCs and workstations. Their hard-ware complexity is between superscalars and VLIWs. Compiler techniquesdeveloped for VLIWs and TTAs are also useful for EPIC based processors.

For Application Specific Processors (ASPs) time-to-market is an importantissue. For many companies, it is of vital importance to introduce high qualityASPs at a low cost and within a short time frame. This is the promise of VLIWsand TTAs: by removing complexity from the hardware, simple processors arecreated that increase performance far more easily than superscalars. Simplehardware increases clock speeds more aggressively than is possible with to-day’s complex superscalars, and more FUs can be easily added to exploit theparallelism existing in applications. TTAs are even more scalable than VLIWssince they do not require that each FU has its own private connection to theRF. VLIWs and TTAs can exploit large amounts of ILP with relatively simplecontrol logic. This not only results in less silicon, but also reduces the powerconsumption. The compiler for these architectures can reorder the operationswithin a larger scope, normally tens or hundreds of operations. This gives amajor performance benefit compared to superscalars. Unfortunately, VLIWsand TTAs do not provide binary compatibility. Changing the issue width re-quires recompiling the application. Several methods are proposed to solve thislimitation [Rau93, CS95], however no widely accepted solution has been foundyet1. For companies that develop ASPs, binary compatibility is not an issue, be-cause they usually own the application code and can recompile it for anotherprocessor. VLIWs and TTAs are not able to adapt their instruction schedulingstrategy to run-time unpredictable situations. When, for instance, a cache missis encountered and the following instruction needs the data, the processor islocked until the data is available. Furthermore, the instruction scheduler isoften hindered by ambiguous memory references. Compile-time analysis canhelp to alleviate this problem. The instructions of a VLIW or a TTA specifywhich n operations must be executed in parallel. However, it is very unlikelythat n operations can be found. The empty places in the instructions are thenfilled with no-ops. This lowers the code density compared with superscalars.Efficient encoding and compression techniques [CBLM96, L+93, RYYT89] canbe used to solve this problem. Although the market for processors in embed-ded systems is less visible than themarket for superscalars, the embeddedmar-ket is much larger. Due to these reasons, new compiler techniques for VLIWand TTAs have a lot of interest in academic and industrial research.

1The Java [GJS96] platform solves this problem by using a virtual machine. This extra soft-ware layer decreases performance and does not yet support the exploitation of ILP. Research hasto prove whether ILP can efficiently be exploited by the Java platform [EA97, GV97]. Code morph-ing [Kla00], as introduced in Transmeta’s Crusoe processor translates during execution x86 instruc-tion in VLIW instructions. This software layer implements a virtual machine on a VLIW processor.

1.2. RESEARCH GOALS 9

1.2 Research Goals

The research described in this dissertation is performed within the MOVEproject. This project aims at the automatic generation of application specificprocessors and their compilers. In this project, a new architecture is developed:the Transport Triggered Architecture (TTA). TTAs are very well suited for ASPssince they provide scalability and flexibility. Since TTAs fall into the categoryof independence architectures, the compiler is responsible for detecting andexploiting ILP. Two of the most important code generation phases for ILP pro-cessors in general, and TTAs in specific, are register assignment and instructionscheduling [HP90]. Applying these two phases separatelymay havemajor per-formance drawbacks. This is especially true for applications for which registersare a critical resource, and for processors with a small register set.

In this dissertation, problems related to the interaction between register as-signment and instruction scheduling are analyzed and new methods are re-searched. This includes the following topics:

Evaluation of the phase ordering problemAn evaluation of the phase ordering problem of instruction scheduling andregister assignment is given. The instruction scheduler is responsible forcreating a legal reordering of operations such that the execution time of aprogram is reduced and the semantics of the program are preserved. The taskof the register allocator is to assign the program’s variables to the registers ofthe processor. Instruction scheduling can be done either before or after registerassignment. Consider the example in Figure 1.4, which shows three possiblescenarios for scheduling the code fragment of Figure 1.4b, assuming a 2-issueprocessor with three registers. When register assignment is carried out beforeinstruction scheduling (Figure 1.4c), the selection of registers may limit thepossibilities to reorder the operations. This has a negative impact on the appli-cation’s performance. On the other hand, when scheduling precedes registerassignment, more variables become live simultaneously. For the scheduledcode in Figure 1.4d no legal assignment can be found with three registers. Thismeans that register assignment has to introduce extra operations, so-calledspill code, to read and write values from memory. This lengthens the programand increases the execution time. The code example of Figure 1.4e showsthe generated code when register constraints and instruction schedulingfreedom are considered simultaneously. This approach results in the fastestexecuting code. In general, the more simultaneously issued operations, themore registers are potentially required. Therefore, it seems advantageous toaddress register assignment and instruction scheduling simultaneously inorder to maximize ILP and to manage the registers efficiently. In this thesis,phase orderings are evaluated and TTA related issues are researched. Inaddition, solutions proposed in the literature are discussed.


for(i = 0; i < 100; i++){

a[i] = a[i]/3 + 10 + a[i]*b[i];b[i] = b[i]*3 + 2;

}

ld v1div v2, v1, #3add v3, v2, #10ld v4mul v5, v1, v4add v6, v5, v3st v6mul v7, v4, #3add v8, v7, #2st v8

a) Example loop. b) Code without loop control.

add r2, r2, #10add r1, r1, r2st r1 mul r1, r3, #3add r1, r1, #2st r1

Register Assignment

Instruction Scheduling

ld r1div r2, r1, #3add r2, r2, #10ld r3mul r1, r1, r3add r1, r1, r2st r1

add r1, r1, #2st r1

mul r1, r3, #3

ld r1 ld r3div r2, r1, #3 mul r1, r1, r3

ld v4mul v7, v4, #3mul v5, v1, v4add v8, v7, #2st v8

ld v1div v2, v1, #3add v3, v2, #10add v6, v5, v3st v6

mul r1, r1, r3add r2, r2, #10ld r3

ld r3add r2, r3st r2

add r1, r1, r2st r1

st r3 a

b

b

a

b, #2

Register Assignment


ld r1st r3ld r3

b

a

amul r3 , r3a, #3div r2, r1, #3

c) Register assignment beforeinstruction scheduling.

d) Register assignment afterinstruction scheduling2

ld r1 ld r2mul r3, r1, r2div r1, r1, #3

add r1, r1, #10add r1, r3, r1 add r2, r2, #2st r1 st r2

mul r2, r2, #3

e) Integrated register assignment and instruction scheduling.

Figure 1.4: Motivating example: phase ordering problem.

2The live ranges of the variables v4 and v7 are replaced by two short live ranges. Both liveranges with the index a originate from the live range of variable v4, while the live ranges withindex b originate from the live range of v7.

1.3. THESIS OUTLINE 11

Integrated register assignment and local schedulingA new algorithm, which integrates register assignment and local scheduling,is developed. Local scheduling is one of the simplest instruction schedulingtechniques. It exploits the ILP within basic blocks. A basic block is a sequenceof consecutive statements in which the flow of control enters at the beginningand always leaves at the end. It will be shown that the introduced algorithmcan gracefully handle situations in which insufficient registers are available. Italso inserts store and reload code around procedure calls to preserve the stateof the program. The state of the program consists of the contents of the regis-ters. The results of the new introduced method are evaluated and comparedwith the results of early assignment methods, within the same compiler.

Integrated register assignment and global schedulingNormally, a basic block consists of a small number of operations. This givesfew opportunities to exploit ILP. Therefore, it is beneficial to exploit the ILPacross several basic blocks. This increases the register requirements. We re-search the effectiveness of integrated assignment using a global scheduler,which exploits ILP without being restricted to basic block boundaries.

Integrated register assignment and software pipeliningA powerful and efficient scheduling technique for exploiting ILP in loops issoftware pipelining. It results in high performance, but increases the registerrequirements. When this scheduling technique runs out of registers, the com-piler is faced with a severe problem. We present a solution based on our devel-oped integrated register assignment and instruction scheduling technique.

Efficient code generation in the context of partitioned RFsA practical implementation of high performance ILP architectures is con-strained by the difficulty to build a large multi-ported RF. A solution is pro-posed to partition the RF into smaller RFs while keeping the total number ofregisters and ports equal. The advantages and disadvantages of partitioningRFs are discussed. In this dissertation, compiler techniques are proposed togenerate code for TTAs containing partitioned RFs. Solutions for separatedregister assignment and instruction scheduling, as well as for an integratedapproach are presented and the results of experiments are given.

1.3 Thesis Outline

This dissertation is organized as follows. Chapter 2 describes the concept ofTTAs. Starting from the VLIW concept this new class of processors is derived.In Chapter 3, the compiler framework is discussed. This includes a descrip-tion of instruction scheduling and register assignment. A thorough knowl-edge of compiler techniques is necessary in order to comprehend the newmethods. Chapter 4 discusses the environment for the experiments, includ-ing the used TTA configurations and benchmarks. The relationship between


instruction scheduling and register assignment is described in Chapter 5. Itevaluates techniques from literature that tackle the phase ordering problem.The new developed method, which fully integrates register assignment andinstruction scheduling in a single phase, is described in the following threechapters. Chapter 6 describes the simplest case. Register assignment is in-tegrated with a local scheduler. A local scheduler can exploit only a modestamount of ILP; therefore, in Chapter 7, a global scheduling method is used.In Chapter 8, the application of integrated register assignment in combinationwith software pipelining, an even more aggressive scheduling technique, isdiscussed. This scheduling technique can only be applied to loops. When theamount of exploitable ILP increases, the register pressure increases also. Con-sequently, more registers are read and written simultaneously, which requiresa large multi-ported RF. However, RFs with a high number of ports are diffi-cult to realize. In Chapter 9, solutions are proposed to solve this problem. Twosolutions are implemented: one for a phase ordering in which register assign-ment precedes instruction scheduling, and one method in which both phasesare fully integrated. The last chapter concludes this dissertation, the findingsare summarized and suggestions for further research directions are proposed.

TTAs: An Overview 2I ncreasing computing power is the subject of many research programs.The need for more powerful processors not only originates from general-

purpose computing, but also from dedicated applications. To satisfy the needfor more powerful processors, computer researchers develop new computerarchitectures and architectural features. Themost well known ILP (Instruction-Level Parallel) computer architectures are superscalars and VLIW (Very LongInstruction Word) processors. Corporaal [Cor98] developed a new computerarchitecture, called Transport Triggered Architecture (TTA). The research pre-sented in this dissertation is done in the context of this architecture. Knowledgeof the TTA concept is necessary to fully understand all aspects of the researchpresented in the remainder of this dissertation.

The TTA concept evolved from the VLIW concept. The evolution fromVLIW to TTA is described in Section 2.1. The TTA’s characteristics are de-scribed in Section 2.2.

2.1 From VLIW to TTA

TTAs resemble VLIW architectures; both exploit ILP at compile-time. An ex-ample of the data path of a VLIW is given in Figure 2.1. This processor con-tains five function units (FUs) connected through a 15-ported register file (RF).An operation, performed by an FU, usually reads two values (operands), ma-nipulates them and produces a single result. Consequently, the RF of a VLIWwithK FUs must have 3K ports: 2K read ports andK write ports.

The data transport bandwidth in a VLIW is proportional with K. As wasobserved by [Cor98], VLIWs are designed for the worst case; however, it is very

13

14 CHAPTER 2. TTAS: AN OVERVIEW

RegisterFU-3 FU-4

FU-5

FU-2

File

FU-1

Figure 2.1: Register file connectivity within a VLIW.

unlikely that all FUs are busy simultaneously and that the data bandwidth isutilized for the full 100%. Some VLIWs havemore FUs than the instruction sizepermits [SRD96]. For these VLIWs it is not even possible to keep all FUs busy,assuming single cycle pipelined FUs. Even when all FUs are busy, the datapath is not likely to be fully used due to operations that require only one sourceoperand or do not produce a result. The addition of a bypassing network toa VLIW may result in a better performance but decreases the utilization of thedata path even further. The complexity of a fully connected bypassing networkgrows quadratically with the number of FUs [Cor98].

Due to the enormous effort towards the exploitation of more and more ILP,a processor architecture must be designed for scalability. However, the scala-bility of a VLIW is limited by the rapidly increasing complexity of the requireddata path; especially as its RF and bypass circuit become complex [Cor98]. Thismay increase the cycle time of the processor and hence degrades the perfor-mance. Furthermore, area and power consumption can become a bottleneck,which can make a processor too expensive and unsuited for specific applica-tions.

Because the data path of a VLIW is rarely used for 100% it seems logical toshare the transport capability with other FUs. This not only improves the uti-lization of the data path, but also decreases the number of ports on the RF. Toaccomplish this, an extra level of control is needed to ensure that no two trans-ports use the same connection at the same time. This task is the responsibilityof the compiler. The generated instructions have to specify transports insteadof operations, hence the name of this new architecture: Transport TriggeredArchitecture.

2.2. TRANSPORT TRIGGERED ARCHITECTURES 15

SocketSocket

FU-1 FU-5

BusInterconnection network

FU-2 FU-3Register File FU-4

Figure 2.2: Block diagram of a TTA.

2.2 Transport Triggered Architectures

Unlike VLIWs, TTAs do not require that each FU has its own private connectionto the RF. An example of the computing core of a TTA is given in Figure 2.2. AnFU is connected to the RF by means of an interconnection network. It containsbuses and sockets. A socket can be viewed as a gateway, which is able to pass onedata item per cycle. The inputs and outputs of the FUs and RFs are connectedto respectively input and outputs sockets. It is not necessary that all buses areconnected to a socket. In the figure, the dots indicate to which bus a socket isconnected.

FUs can be designed separately, pipelined independently and can have anarbitrary number of inputs and outputs. Examples of standard FUs are:

• Instruction Fetch Unit: Reads the instructions from memory and controlsthe flow of the program (jumps and calls).

• Integer Unit: Performs integer operations (add, subtract, etc.).• Floating-point Unit: Performs floating-point operations (add, subtract,etc.).

• Logic Unit: Performs logical operations (and, or, xor, etc.).• Load/Store Unit: Reads data from and writes data to external memory.

For some applications it is profitable to have FUs dedicated to a specific task,for instance a Multiply-Add or an RS232 - interface [AHC96]. These SpecialFunction Units (SFUs) can also easily be integrated within the processor andexploited by the compiler.

The design space of TTAs is enormous. The number and type of FUs andRFs, and the capacity of the interconnection network can be changed easily.Performance gains can be achieved by: adding FUs, pipelining FUs, increasingthe number of buses, or changing the RFs. Due to these qualities TTAs areextremely useful for Application Specific Processors (ASPs). For more details


destination-idsource-idguard-id

slot Nslot 1 ...slot 2

0

1

short immediate

source register-id

Figure 2.3: General instruction format of a TTA.

on TTAs and prototype realizations, the reader is referred to [CM91, Cor93,CvdA93, AHC96, Cor98, TNO99].

TTAs are composed of various highly regular building blocks. These build-ing blocks can be customized to the needs of an application. In the remainder ofthis section, a general description of these building blocks is given. In addition,the instruction format and TTA specific properties that allow the exploitationof ILP to a greater extent are discussed.

2.2.1 TTA Instruction Format

TTAs mirror the traditional programming model. Traditional architectures areprogrammed by specifying operations. The data transports between FUs andRFs are implicitly triggered by executing the operations. Therefore, traditionalarchitectures are called Operation Triggered Architectures (OTAs). TTAs are pro-grammed by specifying the data transports; as a side effect, the operations areperformed. Programming TTAs shows much resemblance with programmingVLIWs. Instead of packing the operations in a single instruction, like VLIWs,TTAs pack multiple transports in a single instruction.

The general instruction format of a TTA is shown in Figure 2.3. Each slotof the TTA instruction controls directly a bus. A data transport, also calledmove, describes a data transfer between the source and destination as specifiedby the source-id and destination-id of a slot. The source- and destination-idsspecify registers. Examples of registers are: general-purpose registers, operandregisters, trigger registers and result registers. The source-id does not alwaysspecify a register; it can also contain a small integer. A 1-bit flag is added to thesource-id to specify whether the source-id specifies a register or contains a shortimmediate. The guard-id is used to support conditional or guarded execution.It specifies a Boolean expression. The outcome of this expression determineswhether a move is executed or not. In a two-stage instruction pipeline, the FUsdo not have a result register. For these TTAs, the destination-id specifies theFU output instead of a register.


2.2.2 Function Units

The function units or FUs are the components of a TTA that perform the com-putations or communicate with the outside world. Tasks performed by the FUsare for example: additions, multiplications, and load and stores to memory.

An FU contains one or more input and output registers. These registersare subdivided into three types: operand registers, result registers and a triggerregister. Executing an operation with n (source) operands consists of movingn− 1 operands to the operand registers and one operand to the trigger registerof the FU. A trigger register is a special kind of operand register because mov-ing an operand to a trigger register starts an FU operation. The results of theoperation can be read from its result registers. As an example we show how aRISC type add instruction translates into three transports:

add r3,r2,r1; ⇒ r1 → add.o; r2 → add.t;add.r → r3;

First, the values of r1 and r2 aremoved from the RF to the operand and triggerregisters add.o and add.t of the FU. After a delay (depending on the latencyof the adder), the result is moved from the result register add.r to r3 in theRF. An FU may support various operations. The type of operation is specifiedby its trigger register. Writing to a specific trigger register of an FU starts thecorresponding operation.

Operations that have a longer latency than a single cycle are subject topipelining. There are several alternatives for pipelining the FUs. The twoalternatives that are currently supported are hybrid pipelines and virtual-timelatching (VTL) pipelines.

A hybrid pipeline ensures that data in the pipeline is never overwritten.This implies that when a result is not read from the result register, anotheroperation that is started one cycle later will not overwrite the old result. Apipeline full exception is raised, when an attempt is made to write to a triggerregister of a hybrid pipelined FU, while it cannot accept an operation becausethe pipeline is full. On the other hand, the processor locks when a read attemptismade from a result register, which contains no valid data. The lock is releasedwhen the result register receives a valid value from a previous pipeline stage.This pipelining discipline is used in the MOVE32INT [CvdA93] processor andthe Phoenix processor [CL95].

Operations executing on a VTL pipelined FU proceed unconditionally fromone pipeline stage to the next (except for data and instruction cache misses andexceptions). This means that the value of a result register can be overwritteneven when the old value was not even read. The availability of the old valuedepends on how soon another operation triggers the FU. It is the responsibilityof the compiler to ensure that no data is unintentionally overwritten.

The experiments performed in [Hoo96] show that both pipeline alterna-tive have a comparable performance. VTL pipelined FUs are easier to im-plement in silicon and less likely increase the cycle time due to complex con-


trol logic [Cor98]. In the remainder of this thesis, all FUs have VTL pipelines.For a detailed evaluation of various pipeline disciplines the reader is referredto [Cor98].

2.2.3 Register Files

Registers can be seen as small, first level, fastest accessible components of thememory hierarchy. Registers bridge the gap between main memory speed andthe rate at which FUs can process data. The values of variables are storedin registers to speed up the execution time. Registers are grouped into reg-ister files (RFs). An RF has a number of ports through which the registersare accessible. The RF in Figure 2.2 has two read ports and two write ports.Two register reads and two register writes can be performed simultaneously.The more ports on an RF the more freedom in access patterns; however, in-creasing the number of ports increases the chip area [CDN95], the accesstime [FJC95, Cor98], and the power consumption [ZK97]. The current TTAframework provides three types of RFs: RFs for integer registers (32 bits), RFsfor floating-point registers (64 bits) and RFs for Boolean registers (1 bits). How-ever, no fundamental restrictions exist for using an arbitrary number of bits forthe registers, or for using an arbitrary number of RFs.

2.2.4 Immediates

Not all values originate from registers. For example, values that are fixedthroughout the program can be coded into the instructions. As already dis-cussed in Section 2.2.1, small immediates can be coded in the source-id of amove. The size of a short immediate is usually less or equal to 8 bits.

Immediates that do not fit into the source-id of a move are handled differ-ently. To support these long immediates the compiler must encode them in theinstructions. This can be done by adding one or more immediate fields to theinstruction format. When an instruction is fetched from the instruction mem-ory, the immediates stored in the immediate fields are placed in special regis-ters. These registers are called immediate registers. The immediates can now betransported by specifying the id of the immediate register in the source-id ofthe move. Other implementations to handle long immediates are also feasible,see [Jan01].

2.2.5 Move Buses

The communication between FUs and RFs is done over the interconnection net-work. This network consists of a set of sockets and buses. A bus, also denotedas move bus, is controlled by a slot of the TTA instruction. The implementa-tion of a move bus not only provides the necessary data transport capability,


but it also performs the distribution of the control signals. The control sig-nals include the source and destination register-ids and the signals for locking,guarding and exceptions.

A fully connected interconnection network simplifies the code generationtask. From the hardware point of view, a fully connected interconnection net-work may result in a high bus load, which may affect the cycle time. ForASPs, the interconnection network should represent the communication re-quirements for the executed application(s) under its performance constraints.Paths in the interconnection network that are heavily used should be provided,while less frequently used paths can be removed1.

2.2.6 Sockets

The interface between the interconnection network and the FUs and RFs is pro-vided by input and output sockets, see Figure 2.2. Sockets primarily consist ofa comparator, input multiplexers (for the input sockets) and output demulti-plexers (for the output sockets). The register-id that is supplied on the bus ischecked by the socket whether the specified register is accessible through it.When a register is accessible, the data is passed in the wanted direction.

2.2.7 Control Flow and Conditional Execution

To execute high-level language statements, such as while and if-then-else state-ments, the processor must be able to change the flow of control conditionally.This is usually done by changing the contents of the program counter. In aTTA, the flow of control can be changed by directly writing a value to the pro-gram counter. The program counter is accessible for writing through the jumpregister. A jump is simply performed by writing the address of the target of thejump into the jump register. Depending on the instruction pipeline, the jumpcan have one or more delay slots.

A jump operation usually executes under a certain condition. When thiscondition is met, the jump is carried out, otherwise normal execution contin-ues. TTAs support conditional execution by means of guarded or predicated ex-ecution. A Boolean expression is associated with each move. Only when thisexpression is evaluated to be true, the move takes place. The Boolean expres-sion, also called guard expression, is constructed out of Boolean values that arestored in the Boolean RF. These values are defined by compare operations. Thefollowing C-code fragment:

if(a > b)c = a;

label:

1The current compiler requires that there should be at least a (single) connection between theRF and the FU registers.


translates with guarded execution into:

r1 → gt o; r2 → gt t; /* b1 = a > b */gt r → b1;!b1:label → jump; /* if ( b1 == false ) goto label */r1 → r3; /* c = a*/

label: ...

where the variables a, b and c are respectively mapped onto the registers r1,r2 and r3. The operation gt performs the greater-than operation and generatesa Boolean value. Boolean register b1 holds the Boolean value and guards thejump operation. The notation !b1: indicates that the operation following thisguarded expression is only executed when b1 evaluates to false.

Conditional execution can also be used to prevent the insertion of jumpsinto the code. Consider the following example:

if(a > b)c = a;

elsec = b;

With conditional execution this code fragment translates into:

r1 → gt o; r2 → gt t; /* b1 = a > b */gt r → b1;b1:r1 → r3 !b1:r2 → r3; /* if ( b1 == true ) c = a else c = b */

where the variables a, b and c are respectively mapped onto the registersr1, r2 and r3. The notation b1: indicates that the operation following theguarded expression is only executed when b1 evaluates to true.

2.2.8 Software Bypassing

Software bypassing is one of the advantages of TTAs above more traditional ar-chitectures. Using software bypassing the compiler can eliminate the need ofsome RF accesses, see for example the following code fragment:

r1 → add.o; r2 → add.t; /* r3 = r1 + r2 */add.r → r3;r3 → sub.o; r4 → sub.t; /* r5 = r3 - r4 */sub.r → r5;

Software bypassing allows the two flow dependent moves (add.r → r3 andr3→ sub.o) to be scheduled in the same instruction, provided that the resultof the addition is directly written to the operand of the subtraction. This opti-mization shortens the execution time and saves one RF read. This reduces theRF-port requirements. The scheduled version when applying software bypass-ing is:


r1 → add.o; r2 → add.t;add.r → r3; add.r → sub.o; r4 → sub.t;sub.r → r5;

When the value of r3 is not needed anymore, the defining move add.r→ r3can be removed by dead-result move elimination: the result of the addition isdirectly bypassed to the subtraction. This results in the following code:

r1 → add.o; r2 → add.t;add.r → sub.o; r4 → sub.t;sub.r → r5;

Dead-result elimination not only reduces the register requirements, but alsosaves a data transport and an RF write access. The freed resources can be usedby other moves, which may lead to higher performance.

2.2.9 Operand Sharing

Since moves are handled individually by the compiler, it is also possible toshare an operand move by multiple operations. This is illustrated in the fol-lowing example, which shows two additions with a common operand (e.g.,r1).

r1 → add1.o; r2 → add1.t; /* r3 = r1 + r2 */add1.r → r3;r1 → add2.o; r4 → add2.t; /* r5 = r1 + r4 */add2.r → r5;

When both additions are executed on the same FU, the second operand mover1 → add2.o can be eliminated because the value of r1 is already present inthe operand register of the FU. This results in the following code:

r1 → add1.o; r2 → add1.t;add1.r → r3; r4 → add2.t;add2.r → r5;

The optimized version saves a move and an RF access. This optimization canonly be applied when the following requirements are met: (1) the value of thecommon operand is the same, (2) the operations execute on the same FU, (3)the common operand is provided to the FU via the same operand register, and(4) the operand register is not changed by other intervening operations.

Compiler Overview 3A compiler is used to translate a program into executable machine code.

Especially for VLIW and TTA based processors, the compiler plays animportant role, because it assigns the resources to the operations and the oper-ations to the instructions. In the remainder of this dissertation, new compilertechnologies are described to increase the amount of exploitable instruction-level parallelism (ILP). A thorough knowledge of the internals of the TTA com-piler is essential in order to understand the work presented in this dissertation.

This chapter describes the compiler infrastructure to generate parallel TTAcode. The developed infrastructure is shown in Figure 3.1. The compiler ac-cepts applications written in a High Level Language (HLL) like C, C++ or For-tran. It translates these applications into an intermediate representation, thesequential TTA code. The sequential TTA code is simulated with as input adata set that is representative for future runs of the application. The simula-tion is used for: (1) verification of the produced code, (2) obtaining applicationstatistics and (3) obtaining profiling information.

The research presented in this thesis focuses on the TTA compiler back-end.It reads the sequential TTA code, the architectural description and the profil-ing information. It translates the sequential TTA code to parallel (i.e. sched-uled) TTA code for the TTA processor specified in the architectural description.When available, profiling information is used to optimize the scheduling pro-cess. The parallel code simulator is used to verify the generated code and toevaluate the results.

In the remainder of this chapter, the front-end and the back-end of the TTAcompiler are discussed in more detail. Section 3.1 discusses front-end issuesrelevant for this thesis. The back-end infrastructure is the focus of Section 3.2.

23

24 CHAPTER 3. COMPILER OVERVIEW

Compiler front-end

Architecturedescription

Input/Output

Input/Output

Compiler back-end

ParallelTTA code

Sequential Sequential code

TTA code simulator

Parallel codesimulator

Application

(C/C++/Fortran)

Profiling information

Figure 3.1: Information flow in the TTA compiler.

It describes code restructuring and analysis techniques, necessary to compre-hend the methods introduced in this thesis. The two most important phases ofthe compiler back-end are register assignment and instruction scheduling. InSection 3.3, the basic principles of the popular graph coloring register allocatorare given. Finally, Section 3.4 describes instruction scheduling techniques thatare used throughout this thesis.

3.1 Front-end

The job of the compiler front-end is to translate the application code writtenin a HLL into sequential TTA code. In order to be assured of good code qual-ity, good HLL compatibility and a well debugged compiler, the freely avail-able GNU C compiler [Sta94] of the Free Software Foundation is used as thefront-end of the TTA compiler. The ported compiler transforms programscoded in the programming languages ANSI C, C++ or Fortran 77 to sequen-tial TTA code. Other compilers can also be used as a front-end. More recentlythe SUIF C [WFW+95] compiler has been ported to produce sequential TTAcode [CC97]. The SUIF C compiler gives more control over optimizations andprovides the ability to combine the benefits of exploiting both coarse and finegrained parallelism.

The complete operation repertoire of a generic TTA processor is listed inTable 3.1. Note that no mnemonic is needed for a register copy. A specific TTAprocessor does not have to support all operations, this is indicated in the table.When operations are not supported by the TTA, the front-end will replace it byother supported operations, or it generates a call to a library function.

3.1. FRONT-END 25

Table 3.1: Operation set of the TTA compiler.

Operation type Mnemonic OptionalInteger add and subtract add, sub NoInteger multiply and divide mul, div, divu, mod, modu YesWord load/store ld, st NoSub-word load/store ldb, ldh, stb, sth YesInteger compare eq, gt, gtu NoShift shl, shr, shru NoLogical and, ior, xor NoSign-extend sxbh, sxbw, sxhw YesSub-word insert/extract insb, insh, extb, exth YesFloating-point arithmetic addf, subf, negf, mulf, divf YesFloating-point load/store ldd, lds, std, sts YesFloating-point compare eqf, gtf YesType conversions f2i, f2u, i2f, u2f YesRegister copy No

To hold temporary values, variables are used. The front-end uses 128 32-bitinteger variables to hold the integer values. When a floating-point RF is avail-able in the target TTA, 128 64-bit floating-point variables are used for storingthe floating-point values. When the target TTA does not contain a floating-point RF, the compiler front-end will use integer variables for holding floating-point values. The large number of variable names prevents that the front-endmaps variables onto the main memory. The actual register assignment for thetarget TTA processor is done by the back-end. The front-end uses a singleBoolean variable to support conditional execution of jumps. Only jumps areguarded in the sequential TTA code. All moves contain at least one variable orone immediate. The front-end does not apply TTA specific optimizations suchas software bypassing.

To support procedure calls, a call register is used. Writing to the call registeris similar to writing to the jump register (i.e. the program counter), with thedifference that the address of the next instruction is placed in the return addressregister. The value of the return address register is used to resume executionwhen the called procedure finishes execution. The front-end does not insertstate preserving code around procedure calls, because register assignment isnot performed yet. For interfacing with the operating system a trap register isused. The value that is written to it indicates the requested (4.3BSD) systemcall. Passing information between procedures is accomplished with the use ofspecial variables. These variables are listed below:

1. Variables v1 and v2 are used for the stack and frame pointer respectively.These two variables have aliases sp and fp respectively.


2. Results are returned via v0 (for integers) and vf0 (for floating-pointnumbers). We will use rv and fv as aliases for v0 and vf0 respectively.

3. The integer arguments are passed via variables v3..v6 and the floating-point arguments are passed via vf1..vf4. The remaining parameters arepassed via the stack.

3.2 Back-end Infrastructure

The back-end exploits the ILP of an application and maps it onto the TTA pro-cessor that is specified in the architecture description. Figure 3.2 shows theTTA compiler back-end. Transforming the sequential TTA code to correct par-allel TTA code requires that the semantics of the application are preserved andthe architectural description is respected. In this section, the infrastructure ofthe back-end is presented. The back-end infrastructure provides the follow-ing functionality: (1) I/O components that perform conversions between inter-nal data structures and external files (architecture description reader, sequen-tial TTA code reader, profiling information reader, parallel TTA code writer),(2) analysis functions (control flow analysis, data flow analysis, data depen-dence analysis) and (3) code transformation functions (function inlining, loopunrolling, grafting and controlled node splitting).

3.2.1 Reading and Writing

The sequential TTA code of the program generated by the front-end is read bythe back-end and transformed into an internal representation. The followingdefinitions define the elements of this representation:

Definition 3.1 A program P describes the behavior of an application. It consists ofa set of procedures.

Definition 3.2 A procedure P is a code abstraction element of a program. Eachprocedure implements a specific task. Each procedure consists of a set of basic blocks.

Definition 3.3 A basic block b is a sequence of consecutive instructions in which theflow of control always enters at the beginning and always leaves at the end. A basicblock consists of a set of operations.

Definition 3.4 An operation o describes the computation to be performed on an FU.An operation consists of a set of moves.

Definition 3.5 Amovem describes data transports between hardware components.

The internal representation of a program P is annotated with profiling infor-mation, which consists of the execution counts of the basic blocks and controlflow edges. When profiling information has been generated by the sequential

3.2. BACK-END INFRASTRUCTURE 27

Parallel TTA Code Writer

ParallelTTA code

Architecturedescription

Function Inlining

Data

Internal

Structures

Control Flow Analysis

Data Flow Analysis

Data Dependence Analysis

Profiling Information Reader

Register Assignment

Sequential TTA Code Reader

ArchitectureDescription Reader

Sequential

TTA codeProfiling data

Controlled Node Splitting

Loop Unrolling

Grafting


Figure 3.2: Structure of the TTA compiler back-end.

simulator, it is read from a file. Otherwise, it is generated based upon the loopnesting in the procedures.

The machine description file contains the description of the target TTA pro-cessor. This information is also stored into an internal representation. Theback-end generates parallel code using this internal representation. The gener-ated parallel TTA code is written to a file.

3.2.2 Control Flow Analysis

Conditional branches determine the order in which basic blocks are executed.AControl Flow Graph (CFG) makes these execution paths visible to the compiler.

Definition 3.6 The control flow graph CFG of a procedure P is a triple (B,CE, s)where (B,CE) is a finite directed graph, with B the set of basic blocks and CE the set


of control flow edges. From the initial basic block s ∈ B there is a path to every basicblock of the graph.

Control flow analysis (CFA) uses the CFG to compute the relations betweenbasic blocks; successive compiler phases can use this information for opti-mizing the program. CFA computes the dominator and post-dominator re-lations between basic blocks, identifies loops, and determines the nesting ofthe loops [ASU85].

Dominator information is used to identify loops and to determine whethercode motion between basic blocks requires code duplication.

Definition 3.7 A basic block bi dominates basic block bj if every path from the initialnode of the CFG to bj goes through bi. More formally bi doms bj .

Note that a basic block dominates itself. Dominator information is computedby solving control flow equations [ASU85].

Several compiler optimizations require as input a reducible CFG. Many def-initions for reducible CFGs are proposed. The one adopted here, is givenin [ASU85] and is based on the partitioning of the control flow edges into twodisjoint sets:

1. The set of back edges BE consists of all control flow edges whose headsdominate their tails.

2. The set of forward edges FE consists of all control flow edges that are notback edges, thus FE = CE − BE.

The definition of a reducible flow graph is as follows:

Definition 3.8 A CFG = (B,CE, s) is reducible if and only if its subgraphCFG′ = (B,FE, s) is acyclic and every basic block b ∈ B can be reached fromthe initial basic block s.

The CFG of Figure 3.3a is reducible since CFG′ = (B,FE, s) is acyclic,see Figure 3.3b. The CFG of Figure 3.4a is irreducible. The set of back edges isempty, because neither basic block a nor basic block b dominates the other. FEis equal to {(s, a) , (s, b) , (a, b) , (b, a)}, and CFG′ = (B,FE, s) is not acyclic.

Many compiler optimizations such as data flow analysis, loop transforma-tions, the exploitation of ILP, and memory disambiguation are simpler, moreefficient, or only applicable when the control flow graph of the program isreducible. To overcome this limitation, irreducible CFGs are transformed toreducible CFGs. In the past, some methods were given to solve this prob-lem [CM69, Hec77, ASU85]. Most methods for converting an irreducible CFGare based on a technique called node splitting. The principle of node splitting isillustrated in Figure 3.4; basic block a of the CFG is split. The CFG is convertedinto a reducible CFG at the cost of code duplication.

Unfortunately, existing methods have the problem that the resulting codesize, after converting an irreducible CFG, can grow uncontrolled. In [JC96] we


s

a

c

d

f

b

e

s

a

c

d

f

b

e

a) CFG = (B,CE, s) b) CFG′ = (B,FE, s)

Figure 3.3: a) Reducible control flow graph; b) the graph CFG′ = (B,FE, s).

reported an average increase in code size of 236% for procedures with an irre-ducible CFG. In the same article, we described a new method for transformingirreducible CFGs to reducible CFGs, called Controlled Node Splitting (CNS). Thismethod minimizes the number of copies by duplicating only basic blocks withspecific properties. This method resulted in an average increase in code size ofonly 3%. As we have proven in [JC97a] this method is optimal in the sense thatit only needs a minimal number of copies to make a CFG reducible. After ap-plying CNS the resulting CFG contains only natural loops. Natural loops haveone header and may have multiple back edges and exit edges.

s

ba a

s

a’

b

a) An irreducible CFG. b) The reducible CFG after applyingnode splitting to basic block a.

Figure 3.4: An irreducible CFG and its reducible counterpart.


3.2.3 Data Flow Analysis

Data Flow Analysis or DFA can be described as the process of ascertaining andcollecting information on how a program manipulates data. There are severallevels for this analysis. The ones used in the TTA compiler back-end are theanalysis at the basic block level and at the procedure level. First, some termsare defined to describe the problem more formally.

Definition 3.9 A variable v is a place holder for temporary values.

Definition 3.10 A variable v is defined at a point in a program when a value isassigned to it.

Definition 3.11 A variable is v used at a point in a program when its value is refer-enced in an expression.

Definition 3.12 A variable v is said to be live at a pointQ in a program if it has beendefined earlier and will be used later. The set live(Q) consists of all variables that arelive at point Q.

Definition 3.13 The live range lr(v) of a variable v is the execution range betweenits definitions and uses.

Definition-use chains (du-chains) are added between the definitions of a variableand its reachable uses. These du-chains are used for memory disambiguationand register assignment.

Definition 3.14 The du-chain(v) of a variable v is a directed graph (Ndu, Edu) thatconnects the moves that define variable v to the moves that use v. The nodes and edgesof a du-chain(v) are defined as:

Ndu = NDef(v) ∪ NUse(v)

Edu ={(ndef , nuse) | ndef � nuse, ndef ∈ NDef(v) ∧ nuse ∈ NUse(v)

}whereNDef (v) is the set of moves that define variable v andNUse(v) is the set of movesthat use variable v. The notation ndef � nuse means that there exists an executionorder from ndef to nuse that does not redefine v.

The du-chains are constructed by applying a standard iterative data flowalgorithm [ASU85]. Figure 3.5a shows the du-chains of the variables v1 andv2. In this figure def vi denotes a definition of a variable vi and use videnotes a use of a variable vi.

Renaming is a transformation that may increase the scheduling freedom. Thenaming of the variables as shown in Figure 3.5a prevents, for example, that theinstruction scheduler can reorder the definition and the use of variable v1 inbasic block B. Reordering these operations would result in incorrect programexecution. This ordering constraint, or dependence, is caused by the re-use of


B C

D

def

A

use usedef

control flow edgedu-chain

use defdef

useuse

v1v1v1

v1

v1v2

v1v2v1

B C

D

def

A

use usedef

usedefdef

control flow edgedu-chain

v2v3

v1v2v1

useuse

v1v3

v1v3

a) Du-chains of v1 and v2. b) Renaming.

Basic block liveDef liveUse liveIn liveOut

A {v1,v2} ∅ ∅ {v1,v2}B {v3} {v1} {v1,v2} {v2,v3}C {v3} {v1} {v1,v2} {v2,v3}D ∅ {v2,v3} {v2,v3} ∅

c) The live sets of figure b.

Figure 3.5: Data flow analysis.

variable v1. Renaming [CF87, Muc97] removes this type of dependence bysplitting up the du-chain. The newly created du-chains are assigned to newvariables. In Figure 3.5b, the second live-range of variable v1 is renamed tovariable v3. This removes the false dependences in the basic blocks B and C.

Live-Variable Analysis computes whether a variable is live in a particular basicblock. For each basic block the sets liveUse(b) and liveDef (b) are computed.The set liveUse(b) is defined as the set of variables that are used in basic block bbefore they are defined in basic block b. The set liveDef (b) is defined as the setof variables that are defined in basic block b before they are used in this basicblock. The set of variables that are live on entry and exit of a basic block cannow be computed with:

liveIn(b) = (liveOut(b) − liveDef (b)) ∪ liveUse(b) (3.1)

andliveOut(b) =

⋃b′∈Succ(b)

liveIn(b′) (3.2)

where Succ(b) is the set of all successor basic blocks of basic block b in the CFG.More formally:

Succ(b) = {b′ ∈ B| (b, b′) ∈ CE} (3.3)


Algorithms to solve these equations can be found in [ASU85, Muc97]. In Fig-ure 3.5c the result of applying the live-variable analyses to the program of Fig-ure 3.5b is given. Live information is used by the scheduler to test for off-liveness during speculative execution, and by the register allocator to deter-mine the live ranges of the variables.

3.2.4 Data Dependence Analysis

Data Dependence Analysis (DDA) is a vital tool in instruction scheduling, it de-termines the ordering (also known as data dependence) relations between op-erations that must be satisfied for the code to execute correctly. The set of rela-tions is represented by a directed graph, called the data dependence graph (DDG).For each procedure a DDG is build. The scheduler uses the DDG to exploit theavailable ILP without violating the data dependences.

Definition 3.15 A DDG = (NDDG, EDDG) is a finite directed graph, with NDDG

the collection of operations andEDDG the collection of data dependence edges. An edgeleading from node ni to node nj indicates that ni must be executed before nj .

Data dependences between operations indicate accesses to the same location.The set of data dependences EDDG can be divided into three subsets:

• Memory data dependences are caused by accesses to the samememory loca-tion. The set of memory edges, EMem, is the set of edges that representsmemory data dependences.

• Register data dependences are caused by accesses to the same register. Theset of register edges, EReg, is the set of edges that represent register datadependences. These dependences arise when the variables are mappedonto the registers of the target machine.

• Variable data dependences are caused by accesses to the same variable. Theset of variable edges, EV ar, is the set of edges that represent variable datadependences. These dependences arise when the variables are not yetmapped onto the registers.

A data dependence between two operations o1 and o2 arises from the flowof data between both operations. Assuming that o1 occurs before o2 then thetypes of data dependences that constrain the execution order are:

• Flow dependence: the value of a memory location, register or variable de-fined by o1 may be used by o2. This dependence is denoted as o1 δf o2.

• Anti dependence: the value of a memory location, register or variable usedby o1 may be redefined by o2. This is denoted as o1 δa o2.

• Output dependence: the value of a memory location, register or variabledefined by o1 may be redefined by o2. This is denoted as o1 δo o2.


for(i = 0; i < 100; i++, j++)x[i] = a * x[i] + b / y[j];

o1 ld v3, 0(v1)o2 mulv4, v7, v3o3 ld v3, 0(v2)o4 div v5, v8, v3o5 add v6, v4, v5o6 st v6, 0(v1)

a) Source code. b) Code of loop body.

o

o

o

o

o

o

1

2

3

5

6

4

Flow dependences o1 δf o2

o3 δf o4

o2 δf o5

o4 δf o5

o5 δf o6

Anti dependences o2 δa o3

o1 δa o6

Output dependences o1 δo o3

c) The DDG of the loop body. d) The dependence relations.

Figure 3.6: Example of dependence relations.

For a correct execution of the application, all three data dependence typesmust be respected during instruction scheduling. Unfortunately, data depen-dences may hinder the exploitation of ILP. Avoiding or eliminating data de-pendences may increase the exploitable ILP. Flow dependences cannot be elim-inated and are also known as true dependences. Anti and output dependencesare known as false dependences. They are caused by name conflicts and can beeliminated by renaming [HP90].

Figure 3.6b shows the RISC style code of the loop of Figure 3.6a. The corre-sponding DDG is given in Figure 3.6c and the data dependence relations withtheir types are listed in Figure 3.6d. The data dependences can be categorizedin the following sets:

EV ar ={o1 δf o2, o3 δf o4, o2 δf o5, o4 δf o5, o5 δf o6, o2 δa o3, o1 δo o3

}EMem = {o1 δa o6}EReg = ∅

Note that the set EReg is empty because the registers are not yet assigned to thevariables.


In the TTA compiler back-end, the nodes of the DDG represent moves insteadof operations; edges represent data dependences between moves. The movesthat make up an operation must be executed in a specific order. The operandmove must be executed before or at the same time as the trigger move. Theresult move must be executed after the trigger move. To guarantee that theexecution order of the moves is not violated, extra TTA specific dependenceedges, so-called intra operation edges, are added to the DDG.

• Trigger-result dependence: guarantees that the trigger move of an operationis always scheduled in an earlier instruction than the result move of thesame operation. This relation is denoted with δtr.

• Operand-trigger dependence: guarantees that the operand move of an oper-ation is never scheduled in a later instruction than the trigger move. Thisrelation is denoted with δot.

A delay is associated with each dependence edge. This delay is added to thedata dependence relation in the form of o1 δtype

delay o2, where type represents thetype of the data dependence. The delay indicates that operation o2 should bescheduled at least delay instructions after the instruction of o1. For TTAs thedelay of a flow dependence caused by a register or variable is always zero (δf

0 ),because values can be defined and used in the same instruction by using soft-ware bypassing. This does not apply to flow dependences caused by accessesto the same memory location. Their delay is set to one cycle (δf

1 ) in order to en-sure that the correct value is read from memory. In the TTA processor model,it is assumed that when a read and a write to the same location occur in thesame instruction, the read will get the previous value. Consequently, anti de-pendences have a delay of zero (δa

0 ). The delay of an output dependence isone instruction (δo

1), because multiple writes to the same register or memorylocation in the same instruction are undefined. The delay associated with theoperand-trigger dependence is at least zero. In the remainder of this thesis,it is assumed that this delay is always equal to zero (δot

0 ). The delay betweenthe trigger and the result is equal to the latency of the FU that will execute theoperation (δtr

delay FU ).

The dependence edges of the DDG impose a partial order in which the movesmust be executed. A path in the DDG is called a dependence path. The longestdependence path is called the critical path. The dependences restrict the re-ordering of the moves. As already mentioned, reducing the number of depen-dence edges results in a larger scheduling freedom and may result in fasterexecuting code. Methods for reducing the number of data dependences areregister renaming as discussed in Section 3.2.3 and memory reference disam-biguation.

Memory reference disambiguation is used for proving the independence be-tween two memory references. It attempts to answer the question: given twomemory references m1 and m2, could they possibly refer to the same mem-ory location? The most simple memory reference disambiguator answers this

3.3. REGISTER ASSIGNMENT 35

question simply with yes. This preserves program semantics, but limits the ex-ploitation of ILP. A more accurate memory reference disambiguation improvesthe ability for ILP exploitation by removing spurious data dependences. Whenthe memory disambiguator cannot guarantee whether two memory referencesnever refer to the same location, dependence is assumed.

3.2.5 Loop Unrolling, Function Inlining and Grafting

A method to reduce the overhead of executing loops and to improve the effec-tiveness of other optimizations, such as the exploitation of ILP is loop unrolling.Loop unrolling replaces the body of a loop by several copies of the loop andadjusts the loop-control code accordingly. This enlarges the loop body and de-creases the number of iterations of the loop. A larger loop body allows a moreaggressive exploitation of ILP, which benefits performance. To take full ad-vantage of loop unrolling in the context of exploiting ILP, variable renaming isapplied to the replicated loop bodies. This increases the register pressure.

Function inlining replaces calls to procedures with copies of their bodies. Thistransformation removes barriers between procedures and increases the scopeof optimizations, such as common subexpression elimination and constantpropagation. Because the optimization scope is enlarged, more ILP can be ex-ploitable. This has also its consequences for register assignment. More vari-ables are live simultaneously and thus the register pressure increases. Anotheradvantage is the reduction of overhead around procedure calls. A detailedanalysis concerning procedure inlining can be found in [HC89]. They claimthat 59% of the procedure calls can be eliminated with a modest code expan-sion (17%).

Basic blocks are often too small too contain sufficient ILP.Grafting is a techniquethat increases the size of basic blocks by duplicating join-point basic blocks1. InFigure 3.7 the join-point basic block D is duplicated. After duplication, thebasic blocks B and D, and C and D’ can be merged into larger basic blocks.

3.3 Register Assignment

Register assignment is the problem of finding a mapping of program variablesto the registers of the target machine, while preserving the semantics of the pro-gram. A valid solution is to place each variable in a different register. However,usually the number of variables exceeds the number of registers. Fortunately,not all variables are live simultaneously. This means that we can map multiplevariables to a single register, as long as the corresponding live ranges do notoverlap. A proper register assignment is a mapping such that no register is

1Grafting is closely related to tail duplication. Grafting duplicates only join-point basic blocks,while tail duplication copies the complete tree below the join-point basic block.


CB

A

D D’

FE

CB

A

D D’

FE

CB

A

D

FE

A

B+D C+D’

FE

Figure 3.7: Grafting and merging of basic blocks.

assigned to any two variables that are live simultaneously. When no properregister assignment can be found, some variables must be mapped onto mem-ory locations.

In literature, register allocation is often used when register assignment ismeant. Register allocation deals with the determination of type and num-ber of resources, while register assignment is the assignment to register in-stances [MLD92]. Register assignment can be applied to expressions, basicblocks (local register assignment), procedures (global register assignment),or collections of procedures (inter-procedural register assignment). Aggres-sive techniques to exploit ILP operate on whole (or even beyond) procedures.Therefore, we believe that at least global register assignment is required to sup-port the exploitation of ILP. In the remainder of this thesis, we mean by registerassignment, global register assignment, unless stated otherwise.

3.3.1 Graph Coloring

Graph coloring [BCKT89, Bri92, BCT94, CK91, CAC+81, Cha82, CH90, GL95,GSS89, Mue92] is the most popular method to assign registers. Graph coloringas originally proposed by Chaitin [CAC+81, Cha82] performs register alloca-tion and assignment at the same time. An interference graph is constructed tofind an efficient mapping. The interference graph consists of a node for eachvariable and edges between any two nodes, if and only if, the two variablesassociated with the nodes are live simultaneously for the given operation or-der. When variables are live simultaneously, we say that these variables inter-fere. Two variables interfere if one of them is live at a definition point of theother [CAC+81].

Definition 3.16 An interference graph IG =(Nvar, Einterf

)is a finite undirected

graph, with Nvar the set of variables and Einterf the set of interference edges. Einterfis defined as Einterf = {(vi, vj) | vi, vj ∈ Nvar ∧ vi �= vj ∧ vj ∈ interf(vi)} whereinterf(vi) = {v ∈ live(Q) | Q ∈ NDef (vi)}.

The problem of register assignment can be described as: the problem offinding a proper node coloring for the interference graph with k colors, where


Live ranges

v1v3v2v2v4v1v3v4

v1 v2 v3 v4defdefdefusedefuseuseuse

Programv1

v2

v4

v3

a) Program and live ranges. b) Interference graph.

Figure 3.8: Graph coloring example.

k is the number of available machine registers. In a proper node coloring, noadjacent nodes have the same color (register). A graph is said to be k-colorablewhen a proper coloring can be found. In Figure 3.8a a program and its cor-responding live ranges, are given. Figure 3.8b shows the interference graph.Variables v2 and v4 can be mapped onto the same register, but v2 and v1cannot. A proper register assignment for the program-segment of Figure 3.8arequires at least three registers.

The first global register allocator, based on graph coloring, was imple-mented at IBM Yorktown [CAC+81] by Chaitin et al. The problem of finding aproper node coloring is known to be NP-complete [Kar72]. Therefore, this reg-ister allocator, also denoted as the Yorktown Allocator, uses a fast and simpleheuristic. It relies on the following graph theoretic property:

Theorem 3.17 Given a graphG = (N,E) and a node n ∈ N such that degree(n) <k, then G is k-colorable if and only if G − n is k-colorable. The degree of a node isdefined as the number of its neighbors.

This theorem states that there will always be at least one color left for n,no matter how the reduced graph G − n is colored. When a graph is not k-colorable, some program variables must be placed in a non-register resource.This resource is usually an off-chip read/write memory. This creates a definitespeed penalty, so variables must be chosen carefully for these locations.

The Yorktown Allocator uses Theorem 3.17 to simplify the interferencegraph by repeatedly removing nodes with a degree less than k until the graphis empty, or until only nodes with a degree greater or equal to k are left. Ifthe reduced graph is empty, then finding a k-coloring of the interference graphis reduced to finding a k-coloring of the empty graph. The nodes are then re-inserted into the graph in the reverse order in which they were removed. Eachnode is given a color distinct from its neighbors. Since each node had a degreeless than k when removed, each node is guaranteed to be colorable when it isre-inserted. If the reduced graph was not empty, i.e., all nodes of the reducedgraph have a degree larger or equal to k, then the graph cannot be colored with


Algorithm 3.1 OPTIMISTICREGISTERALLOCATOR(P )

spilling = TRUEWHILE spilling DO

IG = BUILD(P )SPILLCOSTS(P )SIMPLIFY(IG)SELECT(IG)spilling = INSERTSPILLCODE(P , IG)

ENDWHILE

ASSIGNREGISTERS(P , IG)GENERATESTATEPRESERVINGCODE(P )

k colors and hence the number of registers is insufficient to hold all variables.In the Yorktown Allocator, this situation is solved by marking an uncolorablenode for spilling and removing this node. The process of simplifying andmark-ing nodes for spilling continues until the graph is empty. Afterwards the vari-ables of the marked uncolorable nodes are spilled; a store to main memory isinserted after each definition of the variable and a load from main memoryis inserted before each use. Because the inserted spill code changes the reg-ister requirements, the entire process of building, simplifying and spilling isrepeated until no further spilling is needed. The resulting interference graph isguaranteed to be k colorable. At this time registers are assigned to variables.

Briggs et al. [BCKT89] developed an improvement to the Yorktown Allo-cator; this allocator is called the Optimistic Allocator. It removes also uncol-orable nodes, without marking them for spilling, from the graph as if theirdegree were less than k. It optimistically hopes a color will be available duringthe coloring phase. For nodes with degree greater or equal to k it is possi-ble to find a color, when two or more non-interfering neighbors have receivedthe same color. Since more nodes can be colored, less spill code is required.When all neighbor nodes have been assigned all k colors, the node is markedfor spilling. Then, just as in the Yorktown Allocator, the variables associatedwith the marked nodes are placed in memory. In [BCT94], it was reported thatthe Optimistic Allocator inserts 32% less spill code than the Yorktown Alloca-tor does. Spill code may increase the execution time of a program, therefore itshould be avoided when possible.

The register allocator used in the TTA compiler back-end is based on theOptimistic Allocator, see Algorithm 3.1. The phases of this allocator are:

BUILD This phase constructs the interference graph of a procedure P .

SPILLCOSTS In preparation for coloring, a spill cost estimate is computed forevery live range. The spill cost per live range reflects the expected in-crease in execution time when spilling the variable associated with thelive range.


Table 3.2: Mapping special variables to registers.

integervariable rv sp fp v3 v4 v5 v6register r0 r1 r2 r3 r4 r5 r6

floating-pointvariable fv fv1 fv2 fv3 fv4register f0 f1 f2 f3 f4

SIMPLIFY This phase removes nodes with degree < k from the interferencegraph. Whenever it discovers that all remaining nodes have degree ≥ k,it chooses a spill candidate. This node is also removed from the graph,hoping a color will be available in spite of its high degree.

SELECT Colors are selected for nodes. The nodes are reinserted in the interfer-ence graph in the reverse order in which they were removed, and given acolor distinct from their neighbors. Whenever it discovers that it has nocolor available for some node, it leaves the node uncolored and continueswith the next node.

INSERTSPILLCODE Spill code is inserted for the live ranges of all uncolorednodes. If no spill code is required, this phase sets the variable spilling tofalse otherwise to true. When spilling is required, the register allocatorstarts again with the first phase in an attempt to color the modified codewith k colors.

ASSIGNREGISTERS The registers (colors) are assigned to the variables.

GENERATESTATEPRESERVINGCODE Generates the code required to ensurecorrect execution around procedure calls.

The preceding discussion assumes that each variable can be mapped on anarbitrary register. However, some variables are mapped on specific registers;an example is the stack pointer, see Section 3.1. These variables and the registerthey are mapped on, are listed in Table 3.2. The variables that must be mappedon a special register are assigned first. The compiler front-end guarantees thatthese registers can be assigned correctly and thus do not overlap.

An efficient register assignment is a crucial problem in modern microproces-sors. The increasing gap between the internal clock and memory latency re-quires that variables are kept in registers and to avoid spilling. Furthermore,reads and writes to memory consume more power than a register access. Un-fortunately, an optimal coloring of the interference graph does not necessarilycorrelate with good machine utilization. This will be demonstrated in Chap-ter 5.


v3 → sub.o;#10 → sub.t;sub.r → v1;

v3 → sub.o; #10 → sub.t;sub.r → v1;fp → add.o; #offset → add.t;add.r → v2;v2 → st.o; v1 → st.t;

a) Original operation sequence. b) Spill code sequence.

Figure 3.9: Spilling.

3.3.2 Spilling

A processor has a limited number of registers. When not enough registers areavailable to hold all variables, some of them are spilled to memory. Spilling isrequired when the register pressure, at some point in the program, is higherthan the number of available registers. The register pressure at a point in a pro-gram is the number of variables that could reside in registers.

A spilled variable is written to a memory location. Each procedure claimsmemory locations where it can store the spilled values [ASU85]. This is ac-complished by using the frame pointer fp. The address of the frame pointer issupplied by the calling procedure. The memory address of a spilled variable isan offset from the frame pointer. The basic TTA template does not support loadsand stores with offsets, therefore additions are inserted to compute the mem-ory address2. The operation sequences of spilling and reloading of a variablev1 are given in Figure 3.9 and Figure 3.10.

Figure 3.9a shows the TTA code for a subtraction. The result of the sub-traction is stored in variable v1. When insufficient registers are available, thevalue of variable v1must be spilled to memory. To accomplish this, two opera-tions (an addition and a store operation) are inserted in the code. This is shownin Figure 3.9b. The addition computes the memory address of the memory lo-cation, in which the value of v1 will be stored. It adds an offset to the framepointer fp. The store operation stores the value of v1 in memory.

A similar situation arises when a value must be reloaded from memory.Assume that the variable v1 in Figure 3.10a is spilled to memory. In order toretrieve its value, reload code (an addition and a load operation) is inserted inthe code. This is shown in Figure 3.10b. The addition computes the memoryaddress in the same way as for spilling. The load operation reads the valuefrom memory and writes it to variable v1.

Spilling splits a long live range in multiple shorter live ranges. It is notpermitted to spill these newly created live ranges. Spilling these variables willresult in an infinite loop in the register allocator. Short live ranges in the origi-nal code may also lead to superfluous spill code insertion. Spilling these shortlive ranges does not reduce the register pressure, because the spill and reloadcode itself also requires registers. Precautions are taken too avoid the spillingof these short live ranges as much as possible.

2When the offset is zero no additions are needed.


v1 → mul.o;#5 → mul.t;mul.r → v2;

fp → add.o; #offset → add.t;add.r → v3;v3 → ld.t;ld.r → v1;v1 → mul.o; #5 → mul.t;mul.r → v2;

a) Original operation sequence. b) Reload code sequence.

Figure 3.10: Reloading.

A number of heuristics that attempt to minimize the impact of spill code inser-tion when using graph coloring [Cha82, BGM+89] have been published. Thespill heuristic used in the register allocator of the TTA compiler back-end isbased on the work of Chaitin. However, two changes were made. First of all,we do not use the nesting of loops as an estimate for the execution frequency,but instead profiling information is used. This results in more accurate esti-mates. Secondly, Chaitin’s heuristic assumes that an operation executes in asingle machine cycle. However, loading data from memory may take consid-erable more time than storing data to memory. Our cost function considers thisdifference:

Cspill(v) =

∑nuse∈NUse(v)

(L (add) + L (ld)) · f (nuse)

degree(v)

+

∑ndef∈NDef (v)

(L (add) + L (st)) · f (ndef )

degree(v)(3.4)

where v is a variable that is candidate for spilling, L (add), L (ld), L (st) arerespectively the latencies for an addition, a load and a store operation. Theexpressions f(nuse) and f(ndef ) are the frequencies of respectively the usesand definitions of v, and degree(v) is the degree of the node v. The heuristicselects the variable v with the lowest cost Cspill(v) for spilling.

3.3.3 State Preserving Code

When a program executes a procedure call statement, the state of the callingprocedure (caller) may be destroyed by the called procedure (callee). The stateconsists of the values residing in registers and the value of the program counter.A convention can be adopted that specifies which registers may be overwrit-ten by a called procedure; however, this practice removes resources that other-wise could be well used. An alternative convention saves the values needed inmemory before they are overwritten. When control is returned to the callingprocedure, the state is restored frommemory and execution can resume. Using


B C

D

A

v1

(90)

(10)v1def

def v2

calluse v2

calluse

(1)

(10)

Figure 3.11: Heuristic for caller-saved and callee-saved registers.

this alternative, two choices exist: (1) the caller saves the values before trans-fer of execution and restores them upon return or, (2) the callee saves themimmediately upon entrance and restores them before exit.

To generate high quality code, compilers divide the set of machine registersin caller-saved (Rcaller-saved) and callee-saved registers (Rcallee-saved). Caller-savedregisters are stored and reloaded by the caller and callee-saved registers arestored and reloaded by the callee. Note that only the registers referenced in theprocedure need to be saved.

The task of the register allocator is to decide to which set, the caller- orcallee-saved set, the variable must be assigned in order to obtain fast code.Consider the CFG in Figure 3.11. The number between parentheses in eachbasic block represents its execution frequency. Variable v1 is live in the basicblocks A and B. Since the call in basic block B is executed more frequentlythan the entry and exit basic blocks (A andD), it is more advantageous to mapthis variable onto a callee-saved register. Consequently, variable v1 is savedand restored in respectively basic block A and D. For variable v2, it is moreadvantageous to assign it to a caller-saved register, because basic blocks A andD are more frequently executed than basic block C. The caller-saved store andrestore code are inserted respectively before and after the call.

The TTA compiler back-end divides the registers of the target TTA equallyinto the two sets. To decide to which set a variable will be assigned, caller-saved and callee-saved costs are computed.

Ccaller-saved (v) =∑

call ∈ lr(v)

(2 · L (add) + L (ld) + L (st)) · f (call) (3.5)

Ccallee-saved (v) = (2 · L (add) + L (ld) + L (st)) · f (entry basic block) (3.6)

where lr(v) is the live range of variable v, f(call) the execution frequency ofthe call and f(entry basic block) the execution frequency of the first basic blockof the procedure. The register allocator tries to map a variable in the set with

3.4. INSTRUCTION SCHEDULING 43

the lowest costs. When no registers of a set are available, a register from theother set is chosen.

3.3.4 TTA vs. OTA

TTAs have an advantage over most OTAs (Operation Triggered Architectures)in the context of register assignment. The live ranges of TTAs are finer grained.OTAs require that the register, which holds the result of an operation, is in useat the moment the operation starts executing, see Figure 3.12a [Cor98, page268]. TTAs, however, do not have this limitation; a register is only in use atthe moment it is defined, see Figure 3.12b. This may result in a lower registerrequirement.

...

...

...

v1 v2

Live ranges

def v1

...use v2

ld v2, (v1)...

v1 v2

Live ranges

...use v2

...

def v1

v1 ld.t

ld.r v2

a) Live range of an OTA. b) Live range of a TTA.

Figure 3.12: Live ranges of TTAs and OTAs.

3.4 Instruction Scheduling

The aim of instruction scheduling is to reorder operations and packing theminto a minimum number of instructions, in order to reduce the execution timeof an application, while preserving the semantics of the application and re-specting the resource constraints. To ensure correct semantics of the producedschedule the dependences between operations should be respected. These de-pendences are given by the edges of the DDG. Furthermore, operations thatare executed in parallel should not use the same hardware resources. A TTAinstruction scheduler attempts to pack moves, instead of operations, into aminimum number of instructions. A TTA Instruction scheduler is more com-plex than an OTA instruction scheduler is. In this section, various instructionscheduling scopes and techniques in the context of TTAs are discussed. Thepresented algorithms are based on the work of Hoogerbrugge [Hoo96].


3.4.1 List Scheduling

Finding the optimal schedule is an NP-complete problem [GJ79]. A sim-ple and efficient heuristic to schedule the code is list scheduling. It is themost popular technique used for instruction scheduling nowadays becauseit gives most of the time optimal results and has a short compilation time[BR91, BEH91b, Ell86, Fis81, HA99, HMC+93]. It comes as no surprise thatthe instruction scheduler in the TTA compiler back-end is also based on listscheduling. List scheduling repeatedly assigns an operation to an instructionwithout backtracking or lookahead. List scheduling schedules the operationsin topological order. This order is determined by the DDG. An operation isscheduled when all its predecessors in the DDG within its scheduling scopehave been scheduled. Such an operation is said to be ready and is a mem-ber of the ready set. Selecting operations from the ready set guarantees thatscheduling will never block, because there are only upwards exposed con-straints. There are two variants of list scheduling: instruction and operation-based list scheduling.

• Instruction-based list scheduling tries to place as many operations as pos-sible in the current instruction, while respecting data dependence andresource constraints. When no more operations can be placed in the cur-rent instruction it proceeds to the next instruction.

• Operation-based list scheduling repeatedly selects a ready operation andplaces it in the first instruction where dependences and resources con-straints are satisfied. In contrast with instruction-based list scheduling,this method does not create a schedule by filling one instruction at atime. Operation-based list scheduling is more general than instruction-based list scheduling because the number of operations in its ready set islarger: ready(instruction-based) ⊆ ready(operation-based). Consequently,operation-based list scheduling has a larger freedom in selecting opera-tions for scheduling and potentially can achieve a higher performance.

Instruction-based list scheduling is more popular than operation-basedlist scheduling due to its lower engineering complexity. Unfortunately,instruction-based list scheduling is less suitable for TTAs because of the fol-lowing three reasons[Hoo96]:

• Inefficient hardware usage. Scheduling the operand and trigger moveindividually, can result in a schedule in which they are scheduled faraway from each other. In the interjacent instructions no other operationscan use the operand register of the FU.

• Deadlocks. Scheduling moves individually can result in, for example, asituation where the trigger move of operation o1 cannot be scheduled,because it depends on an operation o2 that needs the same FU, and o2

cannot be scheduled because operation o1 has occupied the operand reg-ister.


• Pipelining problems. When VTL pipelined FUs are used, the resultsmust be read in time before they are overwritten by successive opera-tions scheduled on the same FU. When it is impossible to schedule theresult move in time due to dependence or resource constraints, the op-eration must be unscheduled. This requires a backtracking scheduler,which increases the compilation time and the engineering complexity ofthe scheduler.

Because of the above mentioned problems, the TTA scheduler uses theoperation-based list scheduling technique and schedules the moves of an op-eration in one indivisible step.

3.4.2 Resource Assignment

Resource parallelism in a processor exists in two forms [JW89]. The first is theability of a processor to issue multiple instructions simultaneously. This is de-termined by the degree to which resources are duplicated in the processor. Thesecond is the ability to overlap the execution of multiple operations, caused bythe degree of pipelining of the FUs. Both forms of resource parallelism affectthe extent to which a processor can exploit the ILP of an application.

Resource assignment assigns resources to operations. It is the responsibilityof the scheduler to assign FUs to operations; buses and sockets tomoves; and todecidewhether an immediate is encoded in the source field of amove, or storedin an immediate field. In general, resource assignment in a TTA compiler ismore complex than in an OTA compiler, because a TTA compiler has to assignmore resources. Consequently, more resources have to be checked for conflicts.

The TTA compiler back-end uses a first-fit assignment algorithm to assignFUs, long immediates and sockets. The order of examination is determined bythe order in which the resources are specified in the machine description file.Resources can have overlapping functionality; for instance, an FU can supporta subset of operations of another FU in addition to its own operation set. Toobtain the best results, resources with themost specialized functionality shouldbe selected first. It is the responsibility of the designer to specify this order inthe machine description file.

Move buses are assigned in a two step process [HC96]. A first-fit algo-rithm is used for finding a free move bus. When the interconnection networkis fully connected and no move bus is found then it is guaranteed that no movebus is available. However, if the interconnection network is irregular, then re-shuffling of the already made move bus allocation in the same instruction canlead to a valid allocation. A bipartite matching algorithm [DA93] is used forfinding a valid move bus allocation. After scheduling, the actual move busassignment is carried out.


Algorithm 3.2 SCHEDULEBASICBLOCK(b, EDDG)

ready = {o ∈ b | ¬∃(oi, o) ∈ EDDG, oi ∈ b}S = ∅WHILE S �= b DO

o = SELECTOPERATION(ready)IF ISCOPY(o) THENSCHEDULECOPY(o)

ELSE IF ISPROCEDURECALL(o) THENSCHEDULEPROCEDURECALL(o)

ELSE IF ISJUMP(o) THENSCHEDULEJUMP(o)

ELSE

SCHEDULEOPERATION(o)ENDIF

S = S ∪ {o}ready = {o | o ∈ b − S ∧ ∀(oi, o) ∈ EDDG, oi ∈ S}

ENDWHILE

3.4.3 Local Scheduling

There are several hierarchical levels or scopes at which instruction schedulingcan be applied. The scheduling scope of a local or basic block scheduler con-sists of a single basic block. Scheduling decisions made in one basic block haveno effect on scheduling decisions made in other basic blocks. Due to the lim-ited size of basic blocks, typical 5 or 6 operations [JW89], the amount of ILPthat can be exploited is modest. However, many scheduling techniques thatexploit ILP in a larger scope use the principles of basic block scheduling.

Algorithm 3.2 shows the steps to create a schedule S of a basic block b.Operations for scheduling are selected from the set of ready operations. Thisprocess is repeated until all operations are scheduled. The order in which theoperations are selected for scheduling has a large performance impact. Intu-itively, operations along the critical path of the DDG should be scheduled first,since they are the primary bottleneck. The operations in the TTA compilerback-end are ordered with this observation in mind. The slack based priorityheuristic is used for computing the priorities of the operations [Hoo96].

slack(oi) = alap(oi) − asap(oi) (3.7)

where asap() is the as-soon-as-possible limit while respecting the data depen-dences. This limit represents the earliest instruction where operation oi can bescheduled.

asap(oi) ={

max{asap(oj) + delay(oj , oi)} : if ∃(oj , oi) ∈ EDDG

0 : otherwise (3.8)


where delay(oj , oi) is the delay associated with the data dependence betweenthe operations oj and oi. The function alap() computes the as-late-as-possiblelimit; it represents the latest instruction where operation oi can be scheduledwithout increasing the critical path length Lmax (b) of basic block b.

alap(oi) ={

min{alap(oj) − delay(oi, oj)} : if ∃(oi, oj) ∈ EDDG

Lmax (b) : otherwise (3.9)

The scheduler selects the operation from basic block b that minimizes slack(o)and whose predecessors in the DDG are scheduled. The following priorityfunction is used for ordering the operations:

priorityslack (oi ∈ ready) =(

1 − slack (oi)Lmax (b)

)(3.10)

Each time an operation is scheduled the priorities are recomputed.The basic block scheduling algorithm distinguishes four types of opera-

tions: copies, procedure calls, jumps and data operations. Copy, jump andcall operations are relatively easy to schedule. Each consists of a single move.Scheduling a data operation, like additions, subtractions, etc., is more difficult.These operations consist of multiple moves. For each move, resources have tobe found. The moves of an operation o are scheduled with Algorithm 3.3. Theearliest instruction in which an attempt is made to schedule a transport mi ofan operation o is computed with:

EARLIESTINSN(mi) = maxmj∈pred(mi)

insn(mj) + delay(mj ,mi) (3.11)

where pred(mi) = {mj | (mj ,mi) ∈ EDDG,mj ∈ S} and insn(mj) representsthe instruction in which the transport mj is scheduled. When pred(mi) = ∅then EARLIESTINSN(mi) = 0. Scheduling attempts in earlier instructions willalways fail because of the data dependence constraints. To obtain a compactschedule, the instruction counter i is incremented from the lower bound to theupper bound of the trigger move (see Algorithm 3.3). The upper bound iscomputed with:

LATESTINSN(mt) = LASTINSN(S) + NUMBEROFOPERANDMOVES(o) + 1 (3.12)

where LASTINSN(S) returns the number of instructions in the current scheduleand NUMBEROFOPERANDS(o) gives the number of operand moves of opera-tion o. The resources in the instructions between the last instruction of thecurrent schedule and the upper bound are all free. Consequently, it is alwayspossible to find free resources for the operand and trigger moves.

The first resource that is assigned to an operation is the FU. Assigning anFU to an operation differs from assigning move buses and sockets, in the sensethat an FU is assigned to all moves of an operation, while move buses andsockets are assigned to each individual move. After an FU is found on which


Algorithm 3.3 SCHEDULEOPERATION(o)

FOR i = EARLIESTINSN(TRIGGER(o)) TO LATESTINSN(TRIGGER(o)) DOFOR EACH fu ∈ FUset DO

IF OPERATIONTYPE(o) ∈ OPERATIONSET(fu) THENASSIGNFU(o, fu)IF SCHEDULETRIGGERMOVE(mt ∈ o, fu, i) THENIF SCHEDULEOPERANDMOVES(o, fu, i) THENIF SCHEDULERESULTMOVE(mr ∈ o, fu, i) THENreturn TRUE

ENDIF

ENDIF

ENDIF

RELEASERESOURCES(o)ENDIF

ENDFOR

ENDFOR

return FALSE

Algorithm 3.4 SCHEDULETRIGGERMOVE(mt, fu, i)

IF ISTRIGGERED(fu, i) THENreturn FALSE

ELSE IF ¬ ASSIGNTRANSPORTRESOURCES(mt, i) THENreturn FALSE

ENDIF

return TRUE

the operation can be executed, the trigger, operand and result moves are sched-uled. When scheduling of an operand and/or trigger move fails, the alreadyassigned resources are released.

Algorithm 3.4 shows the steps necessary to schedule a trigger move mt ininstruction i. The algorithm first checks whether in instruction i another oper-ation already performed a write to the fu’s trigger register. It is not allowed towrite more than once in the same instruction to the same FU register. Schedul-ing of mt succeeds when legal assignments can be found for the transport re-sources (sockets and move buses).

The operand moves3 are scheduled using Algorithm 3.5. The first instruc-tion, in which an attempt is made to schedule an operand move mo, is equalto the instruction in which the trigger move is scheduled. If scheduling inthis instruction fails, earlier instructions are tried. The earliest instruction inwhich the algorithm tries to schedule the operand move, is bounded by the

3Complex operations can have more than a single operand. For example addmul has twooperands and one trigger: (mt +mo1 ) · mo2 .


Algorithm 3.5 SCHEDULEOPERANDMOVES(o, fu, i)

FOR EACH mo ∈ o DOFOR i′ = i downto max (i − otfreedom, EARLIESTINSN(mo)) DOIF ISOPERANDREGISTEROCCUPIED(fu, i′) THENreturn FALSE

ELSE IF ASSIGNTRANSPORTRESOURCES(mo, i′) THENreturn TRUE

ENDIF

ENDFOR

ENDFOR

return FALSE

data dependence constraints of mo and the parameter otfreedom. This param-eter ensures that the trigger and operand move are scheduled close to eachother. This prevents inefficient hardware usage (see Section 3.4.1). A typicalvalue for otfreedom is three4. Scheduling fails, if mo overwrites an operandregister, which is in use by another operation, or when no free transport re-sources (sockets and buses) can be found.

Algorithm 3.6 is used for scheduling the result moves. The first legal in-struction in which the result move can be scheduled is equal to the instructionof the trigger move plus the latency of the operation. The upper bound is equalto the lower bound incremented with the constant trfreedom. This constantprevents that the result move is scheduled too far from the trigger move. Atypical value for trfreedom is three5. To prevent incorrect resource assignmentthree checks are required: (1) the trigger and result moves of all operationsexecuting on fu have to be scheduled in FIFO order, (2) the number of oper-ations in the pipeline may not exceed the capacity of the pipeline, and (3) onVTL pipelined FUs, collisions have to be prevented. The last check preventsthat the contents of pipeline stages are unintentionally overwritten. If all thesechecks are passed, it is checked that no dependence constraints (output depen-dences) are violated: a result move cannot be scheduled earlier than its depen-dence constraints allow. The result move is scheduled successfully in instruc-tion i′, when legal assignments are found for the transport resources (socketsand move buses).

It should be stated that the presented algorithms give just a glimpse of thecomplexity of the issues involved for instruction scheduling for TTAs. Otherissues are scheduling across procedure calls. All moves of an operation shouldbe scheduled before or after the call. In addition, the exploitation of the TTAspecific optimizations is done during scheduling.

4A value of zero would restricts the operand-trigger scheduling freedom completely. Experi-ments indicate that this may result in a performance loss of 7% [Hoo96].

5A value of zero would restricts the trigger-result scheduling freedom completely. Experimentsindicate that this may result in a performance loss of 3% [Hoo96].


Algorithm 3.6 SCHEDULERESULTMOVE(mr , fu, i)

FOR i′ = i + LATENCY(fu) to i + LATENCY(fu) + trfreedom DO

IF NOTINFIFOORDER(fu, i′) THENreturn FALSE

ELSE IF PIPELINEOVERFLOW(fu, i′) THENreturn FALSE

ELSE IF ISVTLPIPELINE(fu) ∧ COLLISION(fu, i′) THENreturn FALSE

ELSE IF i′ < EARLIESTINSN(mr) THENcontinue

ELSE IF ASSIGNTRANSPORTRESOURCES(mr , i′) THENreturn TRUE

ENDIF

ENDFOR

return FALSE

3.4.4 Global Scheduling

As already stated in the previous section, the size of a basic block is limited:typical 5 or 6 operations for non-numeric code. As a result, the amount ofILP that can be exploited by a basic block scheduler is limited. To increasethe amount of exploitable ILP the scheduling scope should be larger than asingle basic block. Scheduling scopes that exploit ILP across multiple basicblocks are called extended basic block schedulers or global schedulers. The schedul-ing scope of these schedulers consist of an acyclic CFG. The inter basic blockILP is exploited by moving operations between basic blocks belonging to thesame scheduling scope. This may require speculative execution and code du-plication. In literature various extended basic block scheduling scopes areintroduced [BR91, Fis81, HMC+93, MLC+92, Mah96]. Regions [BR91] shouldpotentially give, due to their generality, the best performance. Regions corre-spond to the body of a natural loop. Note that this body may contain arbitrarynested conditional statements. Figure 3.13a shows the region hierarchy of aprogram. The region scheduler used in the TTA compiler back-end is inspiredon the work of Bernstein [BR91] and has extensions for multi-way branchingand predicated execution.

Basic blocks of a region are scheduled in topological order; a basic block isscheduled after all its predecessors are scheduled. Region scheduling enlargesthe exploitable ILP by moving operations over basic block boundaries. Thisis illustrated in Figure 3.13b. First, all operations of destination basic block bare scheduled using the basic block scheduler. Afterwards, the remaining op-erations in the sequential code of the same region are examined, to find opera-tions from other basic blocks that can be scheduled in b, see Algorithm 3.7. Theprocess of scheduling an operation into another basic block is called importing


Region 3

Region 1

Region 2

bD

bD

bD

b’

/ b

a) Nesting of regions. b) Code motion in a region.

Figure 3.13: Regions.

operations. When an operation o is selected for importing, it is removed fromits source basic block b′ and added to the destination basic block b. To ensurecorrect semantics, code duplication might be necessary. Consider Figure 3.13b,moving an operation from basic b′ to basic block b requires the insertion ofduplicates in the basic blocks bD in order to preserve the semantics of the pro-gram. Duplication of an operation is required when basic block b does notdominate b′. The algorithm to compute the set of duplication basic blocks Dcan be found in [Hoo96]; D includes the destination basic block b. To simplifyscheduling, no control paths are allowed between the elements of D. This isknown as the single copy on a path rule (SCP) described in [BCK91].

Algorithm 3.8 is used for importing an operation o in a basic block b. Thefunction BB(o) returns the (source) basic block of operation o. Note that someof the duplication basic blocks D might be scheduled already. In the presentimplementation, the imported operations are not permitted to enlarge thesebasic blocks. Importing also fails when in one of the scheduled duplicationbasic blocks insufficient resources are available.

The code motion from basic block b′ to b in Figure 3.13b is always profitablebecause all execution paths starting at b go through b′. This is not the case forthe other two duplication basic blocks. If the outcome of the branch in one of


Algorithm 3.7 SCHEDULEREGION(Region, EDDG)

is scheduled = ∅FOR EACH b ∈ Region IN TOPOLOGICAL ORDER DOSCHEDULEBASICBLOCK(b, EDDG)is scheduled = is scheduled ∪ {b}WHILE NOT EACH REACHABLE OPERATION o TRIED DO

TRYTOIMPORTOPERATION(b, o)ENDWHILE

ENDFOR

Algorithm 3.8 TRYTOIMPORTOPERATION(b, o)

b′ = BB(o)D = COMPUTEDUPLICATIONSET(b, b′)FOR EACH b′′ ∈ D DO

IF b′′ ∈ is scheduled THENIF ¬ TRYTOSCHEDULEOPERATION(b′′ ,o) THENRELEASERESOURCES(D, o)return

ENDIF

ENDIF

b′′ = b′′ ∪ {o}ENDFOR

b′ = b′ − {o}

these duplication basic blocks is not in the direction of basic block b′, then theoperation imported from basic block b′ is executed, but its result is never used.Code motions into basic blocks containing jumps that might not branch in thedirection of b′, are said to be speculative.

Some operations are not allowed to be speculatively executed. Specu-latively imported operations that produce exceptions, overwrite registers oroverwrite memory locations, may change the state of a program when the out-come of the branch is not in the direction of b′. This may lead to incorrectprogram execution. To eliminate these restrictions operations can be guarded6,see Figure 3.14. The copy operation is guarded by a guard expression, that iscomputed by combining the guards of the branches over which the operationis imported. When the operation was not guarded and the jump branched tobasic block C, the imported operation would unintentionally overwrite regis-ter r3 and would change the outcome of the addition. This is also known asoff-liveness.

Instead of moving whole operations across basic block boundaries, it can also6Operations that produce exceptions are never speculatively executed.


def b1

b1:B jump

A

Cr3 add.o#10 r3

B

def b1

b1:B jumpb1:#10 r3

B

A

Cr3 add.o

a) Original code fragment. b) Code fragment after importing.Figure 3.14: Speculative code motion with guarding.

om omBA

C

D E

tm

rmrm

Figure 3.15: Scheduling over basic block boundaries.

be advantageous to import individual moves. The TTA region scheduler triesto schedule individual moves across basic block boundaries in the followingsituations, see Figure 3.15:

• When the trigger move of an operation is scheduled in basic block C andthe operand move cannot be placed in the same or earlier instructionof C, then the scheduler will try to schedule the operand move in thepredecessor basic blocks A and B.

• When the result move cannot be placed in an instruction in C, the sched-uler tries to place the result move in the successor basic blocks D and E.

To prevent inefficient hardware usage, due to a large number of instructionsbetween the moves of an operation, individual moves are only imported indirect successor or predecessor basic blocks.

The performance of an extended basic block scheduler depends on the order inwhich ready operations are tried for importing. The priority function used forthe TTA region scheduler [Hoo96] is given by:

priority (o) = Pr (b′|b) ·(

1 − slack (o, b′)Lmax (b′)

)(3.13)

where o is the candidate operation for the codemotion from basic block b′ to ba-sic block b and Pr (b′|b) is the probability that b′ will be executed after b. A largeprobability makes it very likely that the imported operation will indeed be exe-cuted and hence the used resources are spend well. The parameter slack (o, b′)


gives the slack or criticality of operation o in basic block b′. This parameter isnormalized with Lmax (b′), the critical path length of b′, to prevent that opera-tions from small basic blocks are prioritized over operations from larger basicblocks. The operation with the highest priority is selected for importing. Thisprocess is repeated until all ready operations have been tried.

3.4.5 Software Pipelining

As already mentioned in Section 3.2.5, loop unrolling enables the exploitationof ILP of successive loop iterations. However, replicating the loop body in-creases the code size. Furthermore, prior to scheduling, one has to decide howmany times to unroll the loop. Software pipelining is a technique that potentiallyachieves the same performance as infinite loop unrolling with only a modestcode size expansion.

Software pipelining has received widespread attention in academic and in-dustrial research [Lam88, Rau94, LVAG95]. This scheduling technique exploitsILP of loops by overlapping the execution of successive iterations. During theexecution of a non-pipelined loop, the first iteration is started and executed toits completion. The second iteration is then initiated and executed until com-pletion, etc. In contrast, for a software pipelined loop, the second iteration ofthe original loop is allowed to start before the first iteration is completed. Theinterval at which iterations of the software pipelined loop are started is calledthe initiation interval (II), which is expressed in the number of cycles. The goalof software pipelining is to find a schedule with the shortest possible initiationinterval. A number of methods on construction of software pipelined loopswere published, the most widely used is Iterative Modulo scheduling [Rau94].The algorithm chosen for the research presented in this thesis is also based onthis method.

Iterative modulo scheduling first computes a lower bound on the II calledtheminimum initiation interval (MII). Then, it tries to schedule the loop inMIIinstructions. When scheduling fails the II is increased until a valid scheduleis constructed that fits. In order to reduce the number of iterations requiredto generate a valid schedule, theMII should be estimated as precisely as pos-sible. Resource and dependence constraints are used for this estimation. Formore information on computing the MII the reader is referred to the litera-ture [Rau94].

Basic block and region scheduling are hindered by resource and depen-dence constraints between operations in the same iteration. Software pipelin-ing also has to respect dependences between operations of different iterations.This is accomplished by adding to the DDG so-called inter-iteration data depen-dences, denoted as oi δi

delay,distance oj . This data dependence edge states thatoj should be executed at least delay cycles after oi in the distanceth previousiteration.

Figure 3.16a shows the RISC style code of the loop body that executes the


L1: add r1, r1, #4ld r2, (r1)mul r4, r2, #17sub r3, r3, #4st r4, (r3)bgz r3, L1

δ0,1i

δ0,1i

δ0,1i

δ0,1i δ1,0

t

δ1,0t

δ1,0t

δ1,0t δ1,1

i

δ1,1i

δ0,1i δ2,0

t

add r1, r1, #4

ld r2, (r1)

mul r4, r2, #17 sub r3, r3, #4

bgz r3, L1st r4, (r3)

a) Example loop. b) DDG of the loop.

Prologueadd r1, r1, #4ld r2, (r1)add r1, r1, #4 mul r4, r2, #17ld r2, (r1) sub r3, r3, #4

KernelL1: add r1, r1, #4 mul r4, r2, #17 st r4, (r3)

ld r2, (r1) sub r3, r3, #4 bgz r3, L1Epilogue

mul r4, r2, #17 st r4, (r3)sub r3, r3, #4

st r4, (r3)

c) Software pipeline with II = 2.

Figure 3.16: Example software pipelining.

computation b[n..1] = 17 * a[1..n]. For reasons of clarity, OTA codeis used in the example instead of TTA code. The DDG with inter-iteration de-pendences is given in Figure 3.16b. Scheduling the loop with a basic blockscheduler results in 5 cycles per loop iteration, assuming a latency of two forthe multiplier, a single cycle latency for all other operations and infinite re-sources. Its software pipelined counterpart in Figure 3.16c starts each secondcycle a new iteration.

The schedule produced by a software pipelining algorithm consists of threedifferent phases: the prologue, the kernel and the epilogue. The pipeline is startedby the prologue, then the kernel is executed (the loop-body of the schedule),and finally the epilogue drains the pipeline. For high iteration counts, the ker-nel mainly determines the total execution time.

The software pipelining algorithm (see the Algorithms 3.9 and 3.10) is based


Algorithm 3.9 SOFTWAREPIPELINING(b)

II = COMPUTEMII(b)budget = budget ratio · |b|WHILE ¬ ITERATIVEMODULOSCHEDULING(b, II , budget, Huff) ∧

¬ ITERATIVEMODULOSCHEDULING(b, II , budget, Rau) DOII = II + 1

ENDWHILE

on the algorithm for operation-based list scheduling for basic blocks. The maindifferences are:

• In contrast to basic block scheduling, resource conflicts not only can arisein the same instruction i, but also in all instructions i + II · k ∀ k > 0.Therefore, when an operation uses a resource in instruction i, the stateof this resource in instruction i mod II is updated. Checking whether aresource can be used in instruction i, means checking the availability ofthis resource in instruction i mod II .

• In local and global scheduling, the first instruction in which an attempt ismade to schedule an operation oi is computed with Equation 3.11. Thiscomputation expects that all predecessors are already scheduled. How-ever, because of the recurrences in the DDG it is not always possible toselect an operation for scheduling whose predecessors are all scheduled.Furthermore, also the inter-iteration dependences must be included.

EarliestInsn(oi) = maxoj∈pred∗(oi)

{insn(oj) + delay(oj , oi) − II · distance(oj , oi)}

where pred∗(oi) is the set of scheduled predecessors of operation oi andmax(∅) = 0. The latest instruction in which an operation oi is tried isequal to EarliestInsn(oi) + II − 1. Searching beyond this boundary isuseless because if there is a resource conflict in instruction i, then there isalso a resource conflict in instruction i + k · II .

• Because an operation can be scheduled before all its predecessors arescheduled, a dependence constraint can be violated. When an operationoi is scheduled, the following condition is checked:

insn(oj) < insn(oi) + delay(oi, oj) − II · distance(oi, oj) ∀ oj ∈ succ(oi)

where succ(oi) = {oj | (oi, oj) ∈ EDDG, oj ∈ S}. When this expressionevaluates to true, iterative modulo scheduling corrects the partial sched-ule by unscheduling all operations that conflict with operation oi.

• As already observed, it is not always possible to generate a correctschedule in II instructions. To detect the inability to schedule the loopwithin the given II a scheduling budget is provided. This budget equalsbudget ratio · |b| where |b| is the number of operations in the software


Algorithm 3.10 ITERATIVEMODULOSCHEDULING(b, II , budget, heuristic)

S = ∅WHILE S �= b ∧ budget > 0 DO

o = SELECTOPERATION(b - S , heuristic)IF ISCOPY(o) THENIF ¬ SCHEDULECOPY(o) THENreturn FALSE

ENDIF

ELSE IF ISJUMP(o) THENIF ¬ SCHEDULEJUMP(o) THENreturn FALSE

ENDIF

ELSE

IF ¬ SCHEDULEOPERATION(o) THENreturn FALSE

ENDIF

ENDIF

S = S ∪ {o}budget = budget - 1IF ¬ DEPENDENCESCORRECT(S , o) THENUNSCHEDULEALLCONFLICTINGOPERATIONS(S , o)

ENDIF

ENDWHILE

return TRUE

pipelined loop and budget ratio an adjustable parameter (a typical valueis 4.5). The scheduling budget is decremented each time an operationis scheduled. Scheduling fails when the budget becomes negative. Thelarger the value of budget ratio, the harder the scheduler tries to find avalid schedule.

• The order in which the operations are selected for scheduling influencesthe efficiency of the generated schedule. Because operations from differ-ent iterations are executed at the same time, Huff [Huf93] modified theslack based priority heuristic of Equation 3.10 by replacing the delay of aDDG edge by the length of that edge (delay(oi, oj)− II ·distance(oi, oj)).Another popular priority heuristic is the height-based priority functionproposed by Rau [Rau96]. This heuristic gives a higher priority to op-erations with a large height. The height of an operation is defined asthe length of the longest path in the DDG from the operation to a stoppseudo operation that is dependent on all operations in the loop. Hooger-brugge [Hoo96] evaluated both priority functions. Neither appeared tobe clearly better than the other. As proposed by Hoogerbrugge both


heuristics are tried in order to generate a valid schedule in II instruc-tions.

A drawback of modulo scheduling is that it can only handle single basic blockloops. To overcome this problem if-conversion [WHB92] is used. This methodcombines the branches of an if-then-else construction into a single basic blockwith the use of predicates or guards. This method is also applied in the TTAcompiler back-end. Software pipelining can only be performed on the innermost loops, therefore the remaining parts of the code are handled by the regionscheduler.

EvaluationMethodology 4T o evaluate the quality of the introduced algorithms and compiler strate-

gies an experimental framework is defined. This experimental frameworkconsists of 30 benchmark applications, two TTA processors and ameasurementmethodology. Measurements play an important role in evaluating compilertechniques. Analysis of the results may lead to various improvements and in abetter understanding of the proposed methods.

The application benchmark suite is described in Section 4.1. This suiteconsists of workstation-type applications, benchmarks from the SPECint95benchmark suite [Spe96], benchmarks from the MediaBench suite [LPMS97],and DSP (Digital Signal Processing) benchmarks. In Section 4.2, the TTAprocessor configurations used in the experiments are described. To give animpression of the quality of the compiler in combination with the selected TTAprocessors, some performance metrics are provided. Section 4.3 evaluates thethree scheduling scopes as discussed in Section 3.4. In Section 4.4, the achievedamount of instruction-level parallelism (ILP) is given when the benchmarksare compiled with the selected TTA processors.

4.1 Benchmark Suite

Benchmark applications are used for the evaluation of the developed algo-rithms. The benchmarks are selected from various application areas in orderto prevent that the developed algorithms are tailored towards a specific appli-cation or application area. Only real-world applications are taken, syntheticbenchmarks such as Dhrystone or the Livermore loops are not considered. Thebenchmark suite consists of:

59

60 CHAPTER 4. EVALUATION METHODOLOGY

• Workstation-type benchmarks. Most of these benchmarks will never beconsidered to be mapped onto an ASP (Application Specific Processor).These benchmarks give a good indication of the quality of the compileralgorithms when used in combination with general-purpose processors.

• Benchmarks of the SPECint95 suite [Spe96]. The SPECint95 suite is an in-ternationally recognized benchmark suite that represents a typical work-load. The amount of ILP is expected to be varying. For example, 132.ijpegis expected to have a high degree of ILP because the algorithms used inthis benchmark have a parallel nature. The benchmarks 147.vortex and099.go are likely to have a low degree of ILP, because usually a databaseprogram respectively a game are dominated by control intensive code1.

• DSP benchmarks obtained from [Emb95]. These benchmarks containsmall loops, which are executed frequently. The algorithms used withinthese benchmarks are representative algorithms that are especially suit-able for implementation within an ASP. A high degree of ILP is expected.

• Benchmarks from the MediaBench suite [LPMS97]. These applicationsapply DSP like algorithms. In addition, they also contain control inten-sive code. These applications are candidates for implementation in ASPs.

The benchmarks and their characteristics are listed in Table 4.1. The table listsfor each benchmark the following characteristics: (1) a short description ofthe benchmark, (2) the number of static operations including the library code,which gives an indication of the code size, and (3) the number of dynamic (ex-ecuted) operations. This last number highly depends on the input data setstaken for each benchmark.

Code efficiency is measured by the execution time of an application. Whenaveraging the results of the measurements, it is assumed that each benchmarkis equally important (independent of the static or dynamic code size of thebenchmark). The presented results in the remainder of this dissertation are the(unweighted) arithmetic means of the individual measurements.

4.2 TTA Processor Suite

In this section, the TTA processors used for evaluating the developed meth-ods are described. The TTA processor selection method is described in Sec-tion 4.2.1. The selected TTA processors are described in detail in Section 4.2.2.

4.2.1 Space Walking

Designing an Application Specific Processor (ASP) consists of finding a properset of resources for the given application or application domain. The ASP de-

1A game like Go usually contains more task parallelism than instruction-level parallelism.

4.2. TTA PROCESSOR SUITE 61

Table 4.1: Benchmark characteristics.

Benchmark Description #static #dyn.oper. oper.

Workstation-typea68 68K assembler 19646 2805Kbison Parser generator 18962 4902Kcpp C preprocessor 15833 1960Kcrypt Encryption 4253 5875Kdiff File compare 21802 29Mexpand Tab expansion 4895 29Mflex Scanner generator 19567 12Mgawk Language interpreter 36157 42Mgzip File compression 14539 108Mod Octal dump 7315 21Msed Stream editor 17532 46Msort Sort lines 7908 81Muniq Report repeated lines 5368 27Mvirtex Text formatting 41789 50Mwc Word count 4481 7192KSPECint95099.go Plays the game of Go 133K 236G124.m88ksim Microprocessor simulator 29K 73G129.compress Data compression 7051 45G132.ijpeg JPEG encoder 39K 92G147.vortex Object-oriented database 122K 72GDSP-typeinstf Frequency tracking 1978 3140Kmulaw Speech compression 1397 330Kradproc Doppler radar processing 1903 29Mrfast Fast convolution using FFT 1946 3098Krtpse Spectrum analysis 1936 2090KMediaBenchdjpeg JPEG decoder 26K 5568Krawcaudio Audio encoder 3486 8144Kmpeg2decode MPEG2 decoder 13336 168Mmpeg2encode MPEG2 encoder 20887 1662Munepic Wavelet decoder 9538 7484K


0

1e+09

2e+09

3e+09

4e+09

5e+09

6e+09

0 100 200 300 400 500 600 700

Exe

cutio

n T

ime

(ns)

CostFigure 4.1: Curve generated by the synthesis tool for benchmark sort.

sign space is very large. Manual exploration is tedious and error prone. Meth-ods to automate the design space exploitation are described in [FFD96, HC95].In [FFD96], a system, which automatically designs VLIW architectures fora given application, is described. The design methodology as proposedin [HC95], designs ASPs based on TTAs. This synthesis tool is used for se-lecting the TTA configurations for this thesis. A popular term to refer to thesetechniques is space walking.

Selecting a proper TTA configuration for a given application is a trade-off be-tween parameters such as performance, chip area, power consumption, codesize, etc. These objectives are in conflict and the relative importance dependson the application. It is hard to select a TTA processor whose parameters areclose to the desired requirements in a single step. The used synthesis tool per-forms a quantitative analysis of many design points. The result of this tool isshown in Figure 4.1 when applied to the sort application. Each point on thecost-performance curve is a 2-tuple (texec, cost) and corresponds to a particu-lar TTA processor. The execution time texec is the estimated time in ns to runthe application and is the product of the cycle count and the cycle time. Thecompiler determines the cycle count. The cycle time is computed using a cycletime model. The parameter cost represents the realization cost. This parameteris expressed in units of a 32-bit integer function unit2. The design points onthe curve are called Pareto points [dM94]. A configuration is a Pareto point

2For amore detailed description of the hardware cost model and the cycle timemodel the readeris referred to [CH95, Hoo96]. In [Arn01] a new more accurate model is presented.

4.2. TTA PROCESSOR SUITE 63

0

2e+12

4e+12

6e+12

8e+12

0 100 200 300 400 500

Exe

cutio

n T

ime

(ns)

Cost

TTA realistic

0

1e+08

2e+08

3e+08

4e+08

0 100 200 300 400 500

Exe

cutio

n T

ime

(ns)

Cost

TTA realistic

a) Space walking curve for 132.ijpeg. b) Space walking curve for crypt.

Figure 4.2: Space walking curves.

if it is realizable and there are no other realizable configurations that are bothfaster and cheaper. In other words, the curve gives the lowest cost for a givenperformance or reverse, gives the best performance for a given cost.

4.2.2 Selected TTA Processors

The synthesis tool, as described in the previous section, is applied to selecta realistic TTA processor (TTArealistic) for the experiments. The space walk-ing curves of various benchmarks are used to make this selection; the resultsof two of them are shown in Figure 4.2. The selected TTA processor shouldcombine good performance with reasonable cost. Processors that comply withthis requirement can be found in the knee of the curves as indicate by the ar-rows in Figure 4.2. Besides a realistic TTA processor, a second TTA processor

Table 4.2: Function units characteristics.

FU name Latency Pipelined Operationsload/store FU 2 y ld, ldb, ldh, ldd,

lds, st ,stb, sth,std, sts

integer FU 1 - add, sub, eq, gt,gtu, shl, shr, shru,and, ior, xor, sxbh,sxbw, sxhw

integer multiply FU 3 y mulinteger divide FU 8 n div, divu, mod, modufloating-point FU 3 y addf, subf, negf,mulf,

eqf, gtf, f2i, f2u,i2f, u2f, divf


Table 4.3: Supported guard expressions; bx and bx are Boolean registers.

Simple expressions bx, by , !bx, !by

And expressions bx·by , !bx·by , bx·!by , !bx·!by

Or expressions bx+by , !bx+by , bx+!by , !bx+!by

Table 4.4: Benchmark TTA configurations.

Resource ConfigurationTTArealistic TTAideal

# move buses 8 64# load/store FUs 2 16# integer FUs 3 24# integer multiply FUs 1 8# integer divide FUs 1 8# floating-point FUs 1 8immediate # short(8-bits) 8 64

# long (32 bits) 2 32integer RF # registers n n

# read ports 4 32# write ports 4 32

floating-point RF # registers 48 512# read ports 2 32# write ports 1 32

Boolean RF # registers 4 32# write ports 2 32

is added to the processor benchmark suite. This second TTA processor, namedTTAideal, has many resources of each type. Because compilation for this pro-cessor is hardly hindered by resource constraints, it is well suited to evaluatethe potential performance of new algorithms. Note that the TTAideal does notcorrespond to TTA configurations at the uttermost right of the space walkingcurves. The enormous amount of connections to the move buses results in alarge cycle time and hence the execution time increases significantly. In thisrespect, the TTAideal configuration would be in the upper right corner. In theremainder of this thesis, the impact of the cycle time is ignored (unless statedexplicitly) because we are mainly interested in the quality of the generatedcode.

The function units (FUs) characteristics of the TTArealistic and TTAideal pro-cessors are listed in Table 4.2. The pipelined FUs use the virtual time latch-ing (VTL) pipeline discipline as discussed in Section 2.2.2. Both TTA processorssupport guarding. Each move bus is guarded by guard expressions. When thisexpression evaluates to true the associated transport is executed, otherwise the

4.3. SCHEDULING SCOPES 65

transport is squashed. Hoogerbrugge [Hoo96] claims that a guard expressionsize of more than two does not give much performance gain. Therefore, in ourexperiments we use an expression size of two. The available guard expressionsare listed in Table 4.3.

The TTA templates assume a jump latency of two cycles. This results inone delay slot. The interconnection network is fully connected. The memorysystem is assumed to be perfect. Cache or TLBmisses are not taken into consid-eration. Note that many embedded processors have local memories (instead ofcaches) and no virtual memory. The parameters of the two TTA processors arelisted in Table 4.4. The number of read ports of the Boolean register file is notlisted because Booleans are read implicitly by guards. The number of integerregisters n is varied during the experiments between 10 and 512.

4.3 Scheduling Scopes

In Section 3.4, three scheduling scopes were discussed: (1) basic block schedul-ing, (2) region scheduling and (3) software pipelining. These schedulingscopes play an important role in this dissertation. In this section, the effectof the scheduling scope on the performance is measured. Figure 4.3 gives thespeedup relative to basic block scheduling when using the TTAideal templatewith 512 registers. As can be seen, region scheduling results in a large improve-ment, on average 135%. These results clearly demonstrate that exploiting ILPacross basic block boundaries is beneficial.

0

50

100

150

200

250

300

350

400

Per

form

ance

gai

n (

%)

a68

bis

on

cpp

cryp

t

dif

f

exp

and

flex

gaw

k

gzi

p

od

sed

sort

un

iq

virt

ex wc

099.

go

124.

m88

ksim

129.

com

pre

ss

132.

ijpeg

147.

vort

ex

inst

f

mu

law

rad

pro

c

rfas

t

rtp

se

djp

eg

raw

cau

dio

mp

eg2d

eco

de

mp

eg2e

nco

de

un

epic

aver

age

Benchmarks

Region SchedulingSoftware Pipelining

Figure 4.3: Performance gains of region scheduling and software pipeliningrelative to basic block scheduling.


Table 4.5: The software pipeline ratio.

Benchmark Software pipeline ratiocrypt 0.840flex 0.148gzip 0.170od 0.437sort 0.172virtex 0.109124.m88ksim 0.207132.ijpeg 0.506instf 0.735mulaw 0.995radproc 0.246rfast 0.583rtpse 0.147djpeg 0.425mpeg2decode 0.588mpeg2encode 0.193unepic 0.229

The average performance gain of software pipelining is even larger, 145%.However, this gain difference ismainly caused by the benchmarkmulaw, whichresulted in a speedup of 373%. The execution time of this benchmark is dom-inated by a single loop, which is very suitable for software pipelining. Whenthe benchmark mulaw is ignored, software pipelining performs only slightlybetter than region scheduling. In [Hoo96], various reasons are mentioned toexplain this modest improvement. The two most important are: (1) withoutsoftware pipelining, the loops are unrolled which is already quite effective and(2) only a fraction of the loops can be software pipelined3. Table 4.5 shows thesoftware pipeline ratio for various benchmarks. The software pipeline ratio rep-resents the fraction of the execution time spend in software pipelined loops.Only benchmarks with a software pipeline ratio higher than 10% are listed.In addition, a third reason for the modest improvement of software pipelin-ing is identified. Software pipelining generates prologue and epilogue basicblocks. In the current implementation local scheduling is applied to these ba-sic blocks. Better results can be achieved when this code is scheduled togetherwith the code that surrounds the original loop. To achieve high performance abest-of-both-worlds strategy seems profitable. Such a strategy would generateper loop two schedules: one using region scheduling and one using softwarepipelining. The one with the highest performance should be incorporated inthe total schedule.

3As can be observed many benchmarks give the same results for region scheduling and soft-ware pipeliningwhich indicates that no loops, or only a small fraction, could be software pipelined.

4.4. EXPLOITABLE ILP 67

0

1

2

3

4

5

6

7

ILP

a68

bis

on

cpp

cryp

t

dif

f

exp

and

flex

gaw

k

gzi

p

od

sed

sort

un

iq

virt

ex wc

099.

go

124.

m88

ksim

129.

com

pre

ss

132.

ijpeg

147.

vort

ex

inst

f

mu

law

rad

pro

c

rfas

t

rtp

se

djp

eg

raw

cau

dio

mp

eg2d

eco

de

mp

eg2e

nco

de

un

epic

aver

age

Benchmarks

TTA ideal TTA realistic

Figure 4.4: Exploited ILP for the TTAideal and TTArealistic processors.

4.4 Exploitable ILP

In the previous section, we saw that region scheduling and software pipelin-ing generate code with a much higher performance than basic block schedul-ing. Consequently, also the amount of exploited ILP is increased. The prac-tically exploitable ILP of an application depends on the nature of the appli-cation (control-intensive, DSP or multimedia) and the used processor. Stud-ies [JW89, Wal91, LW92, TGH92, LW97] to measure the maximum availableILP assume the presence of infinite processor resources and perfect predictorsto predict the behavior of a program. In this section, the exploitable ILP of thebenchmark applications is measured using both TTA processors and the regionscheduler. To allow the exploitation of large amounts of ILP, no register assign-ment is carried out. Consequently, the code does not contain spill and statepreserving code. Although the TTAideal processor is not realistic for practicaluse, its (hardware) ability to exploit large amounts of ILP gives an indication ofthe performance of the TTA compiler. The exploited ILP varies between 1.5 and6.9. Figure 4.4 gives the results for the individual benchmarks. As can be seenthe DSP-type benchmarks and benchmarks from the MediaBench suite have ahigher degree of ILP than the more control intensive Workstation-type appli-cations. Because the TTAideal processor contains more resources, the exploitedILP is larger than the exploited ILP for the TTArealistic processor.

The Phase OrderingProblem 5M odern optimizing compilers consist of several optimization phases. An

important research topic in compiler design is to find the optimal phaseordering. In this thesis, we focus on the phase ordering of the two most im-portant phases in ILP compilers: register assignment and instruction schedul-ing [HP90]. Both phases have received widespread attention in academicand industrial research [ASU85, BR91, Bri92, CH90, CAC+81, CH95, Ell86,Fis81, GL95, GS90, HHR95, Hoo96, Lam98, ME92, NP95, Pin93, Rau94,WM95].The interaction between these two phases is becoming increasingly important,since the number of simultaneously executed operations increases due to ad-vances in silicon and compiler technology. Executing more operations simul-taneously results in a higher register pressure.

An important question to answer is when, during compilation, should reg-ister assignment take place. In one sense, one would like register assignmentto be done very late in the compilation process. This approach maintains themyth of unlimited registers until after traditional optimizations, such as com-mon subexpression elimination, copy propagation, and dead code removal.These optimizations increase or reduce the number of required registers by cre-ating or removing variables, or by changing the live range of variables. Sinceregister assignment is not yet applied, it is valid to create variables and to altertheir live ranges. If registers are assigned before one or more of these opti-mizations, assignment and spilling decisions are based on a poor estimate ofthe register usage.

So far, the timing of register assignment with respect to traditional compileroptimizations is discussed. How does inclusion of an instruction schedulingphase affect the optimal placement of register assignment? While scheduling

69

70 CHAPTER 5. THE PHASE ORDERING PROBLEM

itself will not create variables, it will most certainly alter the live ranges ofvariables by changing the relative order of operations. Applying register as-signment first, limits the instruction scheduler’s ability to reorder operations.Applying scheduling first, most likely results in schedules that require moreregisters than available. The interaction between register assignment and in-struction scheduling has its impact on the produced code; decisions made byone phase can have negative effects on the other. The order, in which these twophases should be applied, is a point of dispute.

This chapter gives an overview of several strategies towards the phaseordering of register assignment and instruction scheduling. Section 5.1 de-scribes and evaluates methods that apply register assignment before instruc-tion scheduling. Approaches in which instruction scheduling precedes registerassignment are evaluated in Section 5.2. A third strategy, which integrates reg-ister assignment and instruction scheduling into a single phase is discussed inSection 5.3. Finally, Section 5.4 evaluates the strategies and gives the directionin which research should go.

5.1 Early Register Assignment

We speak of early register assignment or post-pass scheduling when register as-signment precedes instruction scheduling. This results in an efficient registerassignment; i.e., few variables are spilled to memory. However, the registerallocator is likely to assign the same register to variables, which are referencedby unrelated operations. The re-use of registers introduces new (false) depen-dence constraints in the data dependence graph (DDG), making instructionscheduling more restricted.

Historically, the merit of early assignment was that processors offered littleexploitable ILP and contained few registers. So, whereas there was much tobe lost by poor register assignment, there was little to be gained by good in-struction scheduling. Today, however, modern microprocessors contain manyregisters and provide opportunities to exploit ILP.

In Section 5.1.1, the limitations of early register assignment in relation toILP exploitation are discussed. Section 5.1.2 evaluates solutions in literaturethat try to alleviate this problem. In Section 5.1.3, a practical implementationas used in the TTA compiler back-end is discussed. Section 5.1.4 summarizesthe presented methods and gives an evaluation of the implemented method.

5.1.1 ILP and Early Register Assignment

The selection of registers in an early assignment register allocator may limitthe possibilities to reorder instructions, due to extra dependences that are in-troduced with the re-use of registers. These extra dependences are called falsedependences. A false dependence connects two dependence paths in the DDG;

5.1. EARLY REGISTER ASSIGNMENT 71

o1 ld v1o2 div v2, v1, #3o3 add v3, v2, #10o4 ld v4o5 mul v5, v1, v4o6 add v6, v5, v3o7 st v6o8 mul v7, v4, #3o9 add v8, v7, #2o10 st v8

v7 v1

v2

v3v4v6

v8 v5

a) Code fragment. b) The interference graph.

false dependenceflow dependence

o3

o2

o1

o8

o4

o5

o6

9o

7o

o10

c) DDG with a false dependence.

Figure 5.1: Early assignment example.

false dependences can increase the critical path length and the execution timeof a program.

Let’s look again at the example of Figure 1.4. The corresponding code frag-ment with operation numbers is shown in Figure 5.1a. The interference graphassociated with the code fragment is given in Figure 5.1b. It can be coloredwith three colors in various ways. Figure 5.1c shows the DDG. Without falsedependences the DDG has a critical path of five cycles, under the assumptionthat each operation takes one cycle. Assuming infinite resources, this codefragment can therefore be scheduled in five instructions. When the variablesv1, v5, v6, v7 and v8 are mapped onto register r1, variable v2 and v3 ontor2 and v4 onto register r3, as shown in Figure 1.4, it is no longer possibleto schedule the code fragment in five instructions. This is caused by the falsedependence introduced by the register allocator, see Figure 5.1c. The critical


path of the DDG increases to seven cycles, assuming that a read and a writeof a register can occur in the same cycle. A mapping of the form v1 and v5onto register r1; v2, v3 and v6 onto r2; and v4, v7 and v8 onto r3 wouldnot result in a longer schedule, since this register assignment does not increasethe critical path of the DDG. From the register allocator’s point of view, bothregister assignments are equally good; however, the latter assignment resultsin faster executing code.

When not enough registers are available to hold all variables that are livesimultaneously, some variables are spilled to memory. This is done before in-struction scheduling. The extra inserted instructions can be scheduled in thesame way as other instructions. The same idea applies to the code that is re-sponsible for saving the state of a program around procedure calls. The pro-gram is analyzed and the necessary code is inserted in the unscheduled (se-quential) code. The describedmethod is referred to as global strictly early assign-ment because the registers are assigned to variables for a complete procedureprior to instruction scheduling.

5.1.2 Dependence-Conscious Register Assignment Strategies

Most methods for global early register assignment are based on the workof Chaitin [CAC+81, Cha82]; a number of improvements are published lateron [Bri92]. These methods are based on graph coloring and assume that opera-tions do not move relative to one another. However, in the presence of instruc-tion scheduling this assumption is wrong. This observation is the basis of aseries of papers that try to extract information from the unscheduled programabout operations that may move relative to one another. In the following, ap-proaches are discussed that try to avoid the introduction of false dependencesin an early assignment register allocator, thus preserving more ILP enhancingpossibilities for the instruction scheduler.

Round-robin Register Selection [HG83, GWC88, BEH91a]In an attempt to efficiently allocate variables to registers, most register alloca-tors select the first available register for a variable. Registers with a low indexare selected first and are re-used more frequently than registers with a highindex. As a result, false dependences are primarily associated with low in-dexed registers. Balancing the variables more equally across all registers, usinga round-robin approach, very likely reduces the number of false dependences.This observation is made in several papers [HG83, BEH91a] and is consideredto be better than a first-fit approach.

However, a round-robin selection policy of registers does not explicitly takeinto consideration how false dependences are introduced. As a result, it canadd false dependences, which were not present when using a first-fit approach.As observed in [GWC88], no selection policy is uniformly (i.e. for large andsmall register sets) superior to others in balancing the length of merged depen-dence paths in the DDG.


Another disadvantage of a round-robin assignment not observed in [HG83,GWC88, BEH91a] is the impact of a round-robin selection policy on the amountof callee-saved code. When more registers are used, more callee-saved code isrequired. A round-robin selection policy uses, when there are more variablesthan registers, all registers. A first-fit approach, however, allocates the sameset of variables into a much smaller set of registers. In these cases, a first-fitselection policy is preferable.

DAG-Driven Register Allocation [GWC88]Goodman and Hsu [GWC88] introduced a register assignment method thatuses the data dependence graph (denoted as DDG or DAG) of each individualbasic block to avoid the introduction of false dependences. Their strategy usesthe width and height of the DDG. The width of a DDG is defined as the maxi-mum number of mutually independent nodes that need a destination register.The height of a DDG is defined as the length of the longest path (i.e. the criticalpath). The left-edge algorithm is used for assigning registers. When insuffi-cient registers are available, the register allocator will reduce the width of theDDG to be smaller or equal to the number of registers by reusing registers.While the width is reduced, the height may increase since each re-use of reg-isters may merge two independent paths in the DDG into one. This decreasesthe exploitable ILP and results in a longer schedule.

To minimize the increase in height of the DDG, the register allocator triesto select registers in a manner such that only redundant false dependences areintroduced. A false dependence is redundant if the ordering between the op-erations is already enforced by other dependences. When there are no redun-dant false dependences, the DAG-driven register allocator tries to minimizethe growth of the height, by giving priority to the merging of short paths.

Themethod as proposed byGoodman andHsu is only able to allocate regis-ters in straight-line code (i.e. a single basic block). DAG-driven allocation doesnot consider false dependences between operations of different basic blocks,which make this method less suiteable in combination with extended basicblock schedulers. Furthermore, constraints, imposed by for example a limitedset of FUs, are not considered when reducing the width of the DDG. These con-straints would probably already result in a reduction of the width of the DDGand thus the number of required registers. The reported results are based on alimited set of benchmarks. It is shown that the DAG-driven register allocationoutperforms local strictly early assignment.

Register Allocation with Instruction Scheduling: a New Approach [Pin93]In [Pin93] an early assignment method is proposed, which preserves the prop-erty that no false dependences are introduced. Therefore, all options for par-allelism are kept for the instruction scheduler. The method is based on theYorktown Allocator [Cha82]. Instead of using an interference graph, a paral-lel interference graph is used for graph coloring. This interference graph is theunion of the traditional interference graph IG = (Nvar, Einterf) and the false


dependence graph Gf = (NDDG, Ef ). The false dependence graph Gf is gen-erated by analyzing the data dependence graph DDG.

Definition 5.1 The graphGt = (NDDG, Et) is a finite undirected graph withNDDG

the set of nodes of the DDG. The set Et is defined as the set of edges of the transitiveclosure C(DDG) after removal of the direction of the edges.

Definition 5.2 For a given basic block the false dependence graph is defined asthe undirected graph Gf = (NDDG, Ef ). The set Ef is defined as Ef ={(ni, nj) | ni, nj ∈ NDDG ∧ ni �= nj ∧ (ni, nj) /∈ Et}.

Observe that the variables v ∈ Nvar are defined by nodes n ∈ NDDG. Thusa node n ∈ NDDG may correspond to a defining operation and to a variable.This relation is used to construct the parallel interference graph.

Definition 5.3 The parallel interference graph, IGpar = (Nvar, Epar) is a finiteundirected graph, with Nvar the set of variables and Epar the set of parallel inter-ference edges: Epar = Einterf ∪ Efdp. The set of false dependence prevention edgesEfdp is defined as Efdp = {(vi, vj) | (nref(vi), nref(vj)) ∈ Ef ∧ vi, vj ∈ Nvar ∧nref(vi), nref(vj) ∈ NDDG} where nref(v) is a node that references variable v.

Note that the sets Einterf and Efdp may overlap. Due to the sequentiality of theinput code, Einterf may contain some false dependence prevention edges.

As proven in [Pin93], an optimal coloring of the parallel interference graphprovides an optimal register assignment and preserves the property that nofalse dependences are introduced. However, the number of edges is increasedand the probability of finding a legal coloring is reduced. When no valid color-ing is found, heuristics are used for trading off parallel scheduling (i.e. the in-troduction of a false dependence) and spilling. A solution, proposed in [Pin93],is to addweights to the edges of the parallel interference graph that distinguishbetween those edges that preserve parallelism (Efdp) and those that preventspills (Einterf). The weights should reflect the importance of violating the inter-ference. No examples of how to compute such weights are given in this article.

Figure 5.2a shows the false dependence graph Gf of the code fragment ofFigure 5.1. As can be seen the false dependence graph Gf contains many moreedges than the data dependence graph DDG of Figure 5.1c. The edges Ef andthe interference graph of Figure 5.1b are used to construct the parallel interfer-ence graph IGpar. This graph is given in Figure 5.2b. At least four registersare required in order to color this interference graph. Adding a false depen-dence between the operations o2 and o5 such that the variables v1 and v5 canbe mapped onto the same register, reduces the minimal number of requiredregisters to three without increasing the critical path in the DDG.

The method is presented in the context of a basic block, extensions for reg-ister assignment and instruction scheduling over multiple basic blocks are pro-vided. However, the constructed parallel interference graphmay containmanyinterference edges that never restrict parallelism. This occurs when references


o3

o2

o1

o8

o4

o5

o6

9o

7o

o10

v7

v6

v8 v5

v4

v1

v2

v3

a) False dependence graph Gf . b) Parallel interference graph IGpar.

Figure 5.2: Construction of the parallel interference graph of Figure 5.1.

of involved variables are far apart in the code and thus never will be scheduledin parallel. The large number of interference edgesmakes coloring difficult. Noresults are presented to support Pinter’s claims that this method improves thequality of the produced code.

Dependence-Conscious Global Register Allocation [AEBK94]The early register assignment method, proposed by Ambrosch, Ertl, Beer andKrall [AEBK94], is based upon the Optimistic Allocator [Bri92]. In conven-tional graph coloring [Cha82, Bri92], the interference graph is computed fromtotally ordered code. This ordering may cause some interference edges thatneed not to be valid for the final schedule. To avoid this problem, Ambrosch etal. compute the interference edges of a basic block b from its DDG. The notionsof “before” and “after” in a totally ordered basic block are replaced by the datadependence relations, which is a partial ordering. Into the DDG of b a top node is inserted that represents the definition point of all variables that are refer-enced in b and are elements of the set liveIn(b). Similarly, a bottom node ⊥ isinserted that is the use point of all variables referenced in b that are members ofliveOut(b). The set NDef(v,b) is defined as the set of nodes that define variablev in basic block b andNUse(v,b) is defined as the set of nodes that use variable vin basic block b. The constructed interference graph is denoted as the minimalinterference graph IGmin. This graph only contains interference edges that arepresent in all possible code orderings.

Definition 5.4 The minimal interference graph IGmin(b) = (Nvar, Emin) of a ba-sic block b, is a finite undirected graph, with Nvar the set of variables and Emin

the minimal set of interference edges: Emin = {(vi, vj) | vi, vj ∈ Nvar ∧ vi �=


v7 v1

v2

v3v4v6

v8 v5

Figure 5.3: The minimal interference graph of Figure 5.1.

vj ∧ ∃(ndef(vi), nuse(vj)) ∈ C(DDG) ∧ ∃(ndef(vj), nuse(vi)) ∈ C(DDG)} wherendef(v) ∈ NDef(v,b) and nuse(v) ∈ NUse(v,b).

To construct the minimal interference graph for a complete procedure, allgraphs IGmin(b) of all basic blocks are combined into a single interferencegraph. Conventional data flow analyses is used for computing the global inter-ferences. In contrast with the conventional interference graph, this graph con-tains fewer edges because it is not bound to a preordered operation sequence.

The minimal interference graph IGmin of Figure 5.1 is shown in Figure 5.3.This graph only contains two interference edges. Both interference edges willbe present in any possible code ordering. This graph shows that the code frag-ment of Figure 5.1 only requires two registers. However, to accomplish this, afalse dependence must be added between o10 and o1. This results in a criticalpath of eight cycles. Note further that in any schedule variable v2 interfereswith v1 or v5.

During coloring, the register selection algorithm is made aware of the falsedependences it can introduce. The absence of an edge in the minimal interfer-ence graph indicates that coloring a pair of nodes with the same color might in-troduce a false dependence. This only hurts performance when the introducedfalse dependence is not redundant. The register selection algorithm avoids in-troducing non-redundant false dependences, if possible. If this is not possible,false dependences are introduced that connect only short paths in order tomin-imize the increase of the dependence paths in the DDG. The introduction of afalse dependence results in changes of the minimal interference graph. Conse-quently, each time a false dependence is introduced, the minimal interferencegraph must be recomputed.

Only preliminary results are published based on two benchmark programs.The results show that the number of interference edges is reduced by 7%–24%and false dependences by 46%–100%. The impact on the execution time of thebenchmarks is not listed.


A Scheduler-Sensitive Global Register Allocator [NP93]Norris and Pollock [NP93] present an approach to build a Scheduler-SensitiveGlobal register allocator (SSG) based upon Brigg’s Optimistic Allocator [Bri92].The main difference lies in building the interference graph. Norris and Pollockbuild the interference graph from the data dependence graphs (DDG) of eachindividual basic block, rather than from the ordering of the intermediate code.The later ordering is usually the coincidental result of some earlier compilerphase. The interference graph IGSSG(b) reflects whether two variables inter-fere given any legitimate code ordering. As a result IGSSG(b) contains manymore interference edges than IGmin(b).

Definition 5.5 The interference graph IGSSG(b) = (Nvar, ESSG) of a basic block b,is a finite undirected graph, with Nvar the set of variables and ESSG a set of interfer-ence: ESSG = EA ∪ EB ∪ EC where:

EA = {(vi, vj) | vi ∈ liveDef (b), vj ∈ liveIn(b) ∧ vj ∈ liveOut(b)}EB = {(vi, vj) | vi ∈ liveDef (b), vj ∈ liveIn(b) ∧ vj /∈ liveOut(b),

∃(nuse(vj), ndef(vi)) /∈ ET}

EC ={(vi, vj) | vi, vj ∈ liveDef (b), nk ∈ Ndef(vj), nl ∈ Ndef(vi), k < l ∧

∃(nuse(vj), ndef(vi)) /∈ ET}

where ET is the set of edges of the transitive closure of the DDG, and k < l indicatesthat node nk precedes node of nl in the preordered operation sequence.

The interference graph IGSSG for the code fragment of Figure 5.1 is in this caseequal to IGpar and is shown in Figure 5.2b. Both graphs are equal because inthis particular situation the set EA is empty.

Although the interference graph now reflects the maximum freedom ofcode reordering per basic block, the increased number of interferences willmake the register allocator’s task more difficult as it will be less able to colorthe larger interference graph. Norris and Pollock propose to add extra DDGedges in two steps to reduce the number of interferences. First, they add DDGedges prior to building the interference graph. To identify the basic blockswhose DDGs require additional DDG edges, the register requirements per ba-sic block are estimated. In this step, only scheduler-sensitive edges are addedto the DDG. An edge is scheduler-sensitive if the schedule generated by theinstruction scheduler would contain the edge anyway. This can be the resultof other dependences or resource constraints. Adding only scheduler-sensitiveedges may not be sufficient to create a colorable graph. To handle these situa-tions, in a second step additional DDG edges are added during register assign-ment. When no legal coloring can be found, a node of the interference graphis selected with the greatest number of interferences that can be eliminated byadding false dependence edges to the DDG. If there are not enough possibili-ties to eliminate interferences so that the node will be colorable, no DDG edges


live = {v1}Out

v1

v2

v1

Inlive = {v1}

use v1def v2

def v1use v2

Figure 5.4: Non-interference that results in an interference edge in ESSG.

are added and the node with the least spill costs is selected for spilling. Theirexperiments show a significant improvement over global strictly early registerassignment for Livermore loops.

Examination of the presented algorithm shows that the construction ofIGSSG is too conservative. The set ESSG contains more edges (and thus inter-ferences) than necessary. This is illustrated in Figure 5.4. Variable v1 is live onentry and exit of the basic block, but will never interfere with variable v2. Theset ESSG contains, however, the interference edge (v1, v2) because this edge isa member of the set EA.

Another disadvantage of this method is that only the false dependenceswithin basic blocks are considered. As a result, a region scheduler will be con-strained by inter basic block false dependences. To extend this method for re-gion scheduling, one should compute the IGSSG for a region instead of a basicblock. This, however, results in many more interference edges, which makescoloring hard.

5.1.3 Dependence-Conscious Early Register Assignment forTTAs

The previous paragraphs showed that avoiding false dependences, or placingthem where they cannot hurt performance, is an important technique to en-hance performance. The TTA instruction scheduler operates on regions. Con-sequently, methods that have the ability to avoid potential false dependencesbetween operations in the same basic block, and between operations in differ-ent basic blocks, are required to exploit a large amount of ILP. The early registerallocator used in this thesis is based on the work of Pinter [Pin93]. Instead of


A

D F G

H

E

CB

def v1

def v2

Figure 5.5: CFG of code fragment.

the Yorktown Allocator [Cha82], the Optimistic Allocator [Bri92] is used, be-cause of it’s better performance.

Pinter’s parallel interference graph may represent interferences that almostnever restrict parallelism. This occurs when references of involved variablesare far apart in the code and thus never will be scheduled in parallel. This isillustrated in Figure 5.5, which shows the CFG of a code fragment. A false de-pendence between the definition of variable v1 and the definition of variablev2 only restricts ILPwhen both operations are imported in basic blockA. How-ever, this is very unlikely because of resource constraints and dependences onother operations. Pinter’s parallel interference graph reflects this dependence.This makes coloring hard and may even result in spilling, or the introductionof more restrictive false dependences.

Based on this observation, the method to build the false dependence graphis slightly modified [Hoo96]. The new constructed graph is called the forwardfalse dependence graph Gff = (NDDG, Eff). To avoid too many false dependenceedges, only potential forward false dependences are recorded. A forward falsedependence is a false dependence between operations between which a controlflow path exists. The Gff is defined as:

Definition 5.6 The forward false dependence graph, Gff(P ) =(NDDG, Eff

)of a procedure P is a finite undirected graph, with NDDG the set ofoperations and Eff the set of forward false dependence edges: Eff =Ef − {(ni, nj) | bb(ni) ¬doms bb(nj) ∧ bb(nj) ¬doms bb(ni) ∧ ni, nj ∈ NDDG}where bb(n) is the basic block of operation n and doms gives the dominance relationbetween basic blocks.


The forward parallel interference graph is constructed by combining theedges of Eff and Einterf in the same way as Pinter suggests.

Definition 5.7 The forward parallel interference graph, IGfpar(P ) =(Nvar, Efpar

)of a procedure P is a finite undirected graph, with Nvar the set of variables andEfpar the set of forward parallel interference edges: Efpar = Einterf ∪ Effdp. The setof forward false dependence prevention edges Effdp is defined as Effdp = {(vi, vj) |(nref(vi), nref(vj)) ∈ Eff ∧ vi, vj ∈ Nvar ∧ nref(vi), nref(vj) ∈ NDDG}.

The forward parallel interference graph IGfpar for the code fragment of Fig-ure 5.1 is in this case equal to IGpar because the false dependences do notcross basic block boundaries. However, the forward parallel interference graphfor a complete procedure reflects fewer potential false dependences than theoriginal parallel interference graph as proposed by Pinter. Consequently, theforward parallel interference graph is easier to color. Also the following ob-servation justifies the choice only to consider forward false dependences. Toachieve high performance, the operation selecting heuristics of the instructionscheduler will favor operations from control flow paths with a high executionprobability. Because most branches are biased towards one direction [PSM97],operations will be selected from the same control flow path. Therefore, falsedependences between operations in the same control flow path will hurt theattainable performance more than false dependences between operations indifferent control flow paths.

When insufficient registers are available, the register allocator has to decidewhether to spill a variable or to add a false dependence. The method proposedin [Hoo96] always chooses for the latter if possible. In order to decide whichfalse dependence to introduce, each interference edge is augmented with aweight. This weight reflects the possible negative effect on performance whentwo variables vi and vj are assigned to the same register.

Wffdp(vi, vj) = max(

f(mref(vi)) · Pr(mdef(vj)|mref(vi))dist(mref(vi),mdef(vj))

)(5.1)

The move mref(vi) uses or defines variable vi and move mdef(vj) defines vari-able vj . The weight is proportional to the execution frequency f ofmref(vi) andthe probability that mdef(vj) will be executed after mref(vi). The weight is in-verse proportional to the number of moves (the distance) betweenmref(vi) andmdef(vj) in the sequential code. Because there can be multiple combinations ofpotential false dependences between two variables, the maximum weight ofall combinations is taken. A high weight indicates that avoiding the associatedfalse dependence is important and will probably result in higher performanceof the generated schedule. When the register allocator cannot find a propernode coloring, it introduces a false dependence with the lowest weight.

In the remainder of this thesis, the method described in this section is re-ferred to as DCEA or Dependence-Conscious Early Assignment.


5.1.4 Discussion, Experiments and Evaluation

In the previous sections, various dependence-conscious register assignmentstrategies were described. The round-robin approach distributes the registersin a round-robin fashion to the variables hoping that false dependences arenot introduced. This does not seem to be a constructive method. All otherdiscussed methods use the DDG to identify whether a particular assignmentwill result in a false dependence. Goodman and Hsu [GWC88] assign registerswith the use of a left-edge algorithm, without introducing false dependences.When insufficient registers are available, false dependences are added in sucha way that the impact on the total schedule is minimized. This method canonly be applied to straight-line code. The method as proposed by Norris andPollock [NP93] is too conservative, the method as proposed by Ambrosch etal. [AEBK94] is computational intensive and the method of Pinter [Pin93] re-sults in graphs that are hard to color (the same holds for the method as pro-posed by Norris and Pollock). The method as used by Hoogerbrugge [Hoo96]does not have these problems, however, it may ignore important false depen-dences.

The approaches of Norris and Pollock, Ambrosch et al., Pinter and Hooger-brugge are closely related. All are based on graph coloring. The followingrelations are identified assuming all methods operate on the same schedulingscope.

Emin ⊆ Einterf ⊆ Efpar ⊆ Epar (5.2)E ffdp ⊆ Efdp (5.3)

The graph IGSSG(b) does not consider false dependences between operationsin different basic blocks. Whenwe restrict the scheduling scope to a basic block,the following relation is identified.

Emin ⊆ Einterf ⊆ Efpar = Epar ⊆ ESSG (5.4)

It should be noted that ESSG is the largest set of edges. It even contains edgesthat are not interference edges. Observe that the sets Emin and Einterf are notequal. The set Emin only contains interferences that are present in all possiblecode orderings, while Einterf also can contain additional edges from the set Efdp.Note further, edges present in Emin also can be present in Efdp. For example,the edge (v1, v4) in the parallel interference graph of Figure 5.2b originatesfrom the set Efdp (and Einterf) and is also present in the minimal interferencegraph of Figure 5.3.

Experiments are performed to evaluate the importance of dependence-conscious early register assignment. Strictly early assignment is comparedwith DCEA using the region scheduler. Global register assignment is used,e.g. registers are assigned for a complete procedure. For all applications in thebenchmark suite (see Section 4.1) the performance gain is measured. The re-sults of these measurements for both target TTAs are shown in Figure 5.6. The


0

4

8

12

16

20

24

28

32

Per

form

ance

gai

n (

%)

10 12 14 16 18 20 22 24 26 28 30 32 48 64 512

Number of registers

TTA TTA

ideal

realistic

Figure 5.6: Speedup of DCEA compared with strictly early assignment.

number of integer registers is varied and is listed along the x-axis. The speedupof DCEA is listed along the y-axis. The results clearly show that DCEA out-performs strictly early assignment. The performance gain decreases when thenumber of registers decreases. When a large number of registers is availableall potential false dependences can be avoided and thus the impact of DCEAis high. When registers are scarce, it is no longer possible to avoid all potentialfalse dependences, and DCEA has to decide, which false dependences to avoidor to introduce. Consequently, the false dependences introduced by DCEA re-sulted in a smaller performance gain.

It is also interesting to note that the performance gain of the TTAideal pro-cessor is larger than the performance gain of the TTArealistic processor. Becausestrictly early assignment introduces many false dependences, the instructionscheduler is not capable to exploit the large number of resources provided bythe TTAideal processor. As a result, the performance of the TTAideal proces-sor is only slightly higher than the performance of the TTArealistic processorwhen using strictly early assignment. On the other hand, DCEA leaves manycode reordering possibilities to the instruction scheduler. Especially, when alarge number of registers is available the instruction scheduler is able to use asmany resources of the TTAideal processor as needed. Consequently, the perfor-mance difference between the TTAideal and the TTArealistic processor is sub-stantial when using DCEA. Both effects explain the larger performance gain ofthe TTAideal processor whenwe compare DCEAwith strictly early assignment.

Based on the above discussion one would expect that using fewer regis-ters would result in a smaller performance gain. However, Figure 5.6 showsa minimum around 14 registers. This strange effect can be explained by theobservation that for some benchmarks, when using strictly early assignment,the performance decreases significantly when the number of registers dropsbelow 14 registers. This is caused by the introduction of a significant amount

5.2. LATE REGISTER ASSIGNMENT 83

of spill code. The introduced short live ranges are causing extra false depen-dences. Both early assignment methods are faced with this problem. However,DCEA is still capable to avoid false dependences and to keep the performancedegradation limited.

The performance gain of a dependence-conscious early register assignmentmethod is caused by its ability to identified potential false dependences. How-ever, it is difficult to predict which false dependences are really important. Itmay happen that a false dependence between two operations is avoided thatdid not limit the available ILP. In these situations, avoiding a different falsedependence would be more profitable.

An additional problem with early assignment approaches within the con-text of TTAs, is their inability to take advantage of software bypassing anddead-result move elimination. These techniques can eliminate the need ofsome register file accesses; see for example the following code fragment:

r1 → add.o; r2 → add.t;add.r → r3;r3 → sub.o; r4 → sub.t;sub.r → r5

The scheduled version may turn out not to use register r3, because the resultof the addition is bypassed to the subtraction and never used again.

r1 → add.o; r2 → add.t;add.r → sub.o; r4 → sub.t;sub.r → r5

From [Hoo96] it is known that more than 35% of the register file accesses areeliminated. The registers assigned to these variables could be used, when thiswas known in advance, to avoid spilling or to avoid the introduction of falsedependences. An early assignment register allocator has, however, no ideawhich register references will be eliminated by the scheduler. Whereas soft-ware bypassing decreases the register pressure in the scheduled code, the earlyassignment method cannot exploit this advantage.

5.2 Late Register Assignment

When register assignment is performed after instruction scheduling, i.e. lateregister assignment or pre-pass scheduling, the scheduler, uninhibited by false de-pendences, can generate an efficient schedule. However, the instruction sched-uler, in its attempt to reorder instructions to maximize ILP, may lengthen thelive ranges of values and thus increases the contention for registers. If notenough registers are provided by the target processor, the data is written tomemory, introducing spill code which itself also requires registers. The in-crease in ILP can be nullified by the amount of spill code.


o1 :ld v1 o4 :ld v4o2 :div v2, v1, #3 o8 :mul v7, v4, #3o3 :add v3, v2, #10 o5 :mul v5, v1, v4o6 :add v6, v5, v3 o9 :add v8, v7, #2o7 :st v6 o10 :st v8

a) Schedule for a 2-issue processor.

v4

v1v3

v2

v5

v6v7

v8

b) Associated interference graph.

Figure 5.7: Late assignment example.

In Section 5.2.1, the limitations of late register assignment in relation to ILPare described. Section 5.2.2 discusses various approaches proposed in litera-ture to solve the problems related to ILP and late assignment. Section 5.2.3presents the late register allocator as implemented in the TTA compiler back-end and Section 5.2.4 gives an evaluation of late assignment in the context ofTTAs.

5.2.1 ILP and Late Register Assignment

Performing instruction scheduling prior to register assignment has the advan-tage that the available ILP can be exploited without constraints imposed byregister assignment. However, applying late assignment may result in a largeregister pressure since multiple variables are becoming live simultaneously.

Let’s return to the example of Figure 1.4b. Figure 5.7a shows the scheduledversion of the code fragment for a 2-issue processor, when instruction schedul-ing is uninhibited by register assignment. The code fragment executes in fivecycles. The associated interference graph is shown in Figure 5.7b. As can beeasily seen, at least four registers are required to color the graph. The sched-uler has reordered the code in such a way that the register pressure is increasedcompared to the sequential code of Figure 1.4b. Note that switching the oper-ations o5 and o8 results in a register pressure of three. From the instructionscheduler’s point of view, both schedules are equally good, however, the latterresults in a better overall schedule when only three registers are available.

When only three registers were available, the schedule shown in Figure 5.7a


ld r1 ld r3a

st r3a

div r2,r1,#3 mul r3b,r3a,#3st r3b

ld r3a

add r2,r2,#10 mul r1,r1,r3a

ld r3b

add r1,r1,r2 add r2,r3b, #2st r1 st r2

a) Schedule with inserted spill code.

ld r1 ld r3a

st r3a

div r2,r1,#3 mul r3b,r3a, #3st r3b ld r3a

add r2,r2,#10 mul r1,r1,r3a

ld r3b

add r1,r1,r2 add r2,r3b, #2st r1 st r2

b) Rescheduled code.

Figure 5.8: Late assignment and spilling.

requires spill code. The spill code generated by the register allocator is insertedin already scheduled code as shown in Figure 5.8a. Inserting new instructionsinto the compacted code could violate the constraints under which the codewas originally scheduled (this certainly holds for TTAs as we will see later).Rescheduling is usually applied to efficiently integrate the spill code withinthe schedule, see Figure 5.8b. However, rescheduling may rearrange the codecompletely. The false dependences introduced by the late register allocatormay restrict the new code reordering. This may lead to less efficient sched-ules. Instead of adding spill code into the already scheduled code, Sweanyand Beaty [SB90] proposed to add the spill code to the original unscheduledcode without assigning registers to variables. The resulting code is scheduledagain and may result in efficient code since the scheduler is never hindered byfalse dependences. Rescheduling, however, does not guarantee that the newlyscheduled code is colorable. Consequently, additional spill code is inserted.This process is repeated until a legal register assignment is found.

Sweany and Beaty [SB90] also observed that insertion of state preservingcode is difficult in late assignment approaches. This causes a phase orderingproblem, because, until after register assignment, it is not known how manyregisters need to be saved and restored. The method as proposed for insertingspill code does not apply, because scheduling of state preserving code can leadto other register usage patterns. This may result in the need for more, fewer or


different state preserving code. The solution they propose is to consider all reg-isters callee-saved. After register assignment, new entry and exit basic blocksare added to each procedure. The necessary callee-saved code is inserted in thenew created basic blocks and is scheduled using a local scheduler.

5.2.2 Register-Sensitive Instruction Scheduling Strategies

The simplest late assignment approach is to apply instruction scheduling andregister assignment in two separate phases without any form of communica-tion between the two phases, called strictly late assignment. This implementa-tion is quite straightforward, but the instruction scheduler may produce sched-ules that require more registers than available. The approaches discussed inthis section add extra heuristics to the instruction scheduler in order to reducethe register pressure.

Integrated Prepass Scheduling [GWC88]Goodman and Hsu presented a method, called integrated prepass schedul-ing (IPS), which performs late register assignment, but attempts to restrict thenumber of concurrently live local variables by giving each basic block a regis-ter limit. The register limit places an upper bound on the number of live localvariables, thus limiting the amount of ILP that the local instruction schedulermay exploit. The instruction-based list scheduler selects operations to exploitILP, unless the number of live local variables is greater or equal to the givenlimit. The scheduler then tries to schedule operations that reduce the numberof simultaneously live local variables. By keeping track of the number of avail-able registers, the scheduler can choose the appropriate scheduling techniqueto produce a better code sequence. The initial number of available registers perbasic block is determined by the total number of registers, minus the numberof global registers live-on-entry of the basic block. The method combines twoscheduling techniques, one to exploit ILP and the other to minimize registerusage, into a single phase. After scheduling, a local register allocator assignsregisters to the variables within the basic block; spills to memory are inserted ifthe limit could not be met. The proposedmethod assumes that global variablesare already assigned to registers. Furthermore, no attempt is made to scheduleinserted spill code efficiently.

The reported results are based on a limited set of benchmarks (the firsttwelve Livermore loops) and showed improvements in the order of 15% for 10registers compared to strictly late assignment. The method heavily depends onthe availability in the ready set of operations that can decrease the number ofsimultaneous live registers. Goodman also observed that late assignment oftenresulted in register spilling. Therefore, the scheduled programs had significantlarger sizes than programs produced with an early assignment approach. Thismakes late assignment less suitable for application specific processors, sinceoften the code size is a critical design parameter.


A variation of Integrated Prepass Scheduling [BEH91a]The method proposed by Bradlee, Eggers and Henry is a variant of Goodmanand Hsu’s Integrated Prepass Scheduling (IPS) [GWC88]. The main improve-ment is the use of a global register allocator. Tomodel reserved registers for im-portant global variables, Bradlee’s IPS sets the local register limit for each basicblock to the maximum number of available registers. This limit is reduced bythe number of unique global variables referenced within the basic block, in-stead of assigning global variables prior to running IPS as done in [GWC88].Instruction scheduling is applied per basic block. It tries to generate codewithin the local register limit. After scheduling, the variables in the scheduledcode are assigned to registers by the Chaitin’s global register allocator [Cha82].A local post-pass scheduler is invoked after register assignment to ensure thatspill code is scheduled as well as possible. The reported results showed thatfor 32 registers this variation of IPS produces code that is on average 13% fasterthan strictly early assignment.

Like the method of Goodman, this method also applies instruction schedul-ing per basic block. The reported results show that the proposed method pro-duces code that is on average faster than a strictly early assignment method.However, the results are based on comparison with a non-dependence-conscious early register allocator.

The (α, β)-Combined Heuristic [MPSR95]Motwanu et al. propose a heuristic, which combines controlling register pres-sure and instruction-level parallelism considerations. Prior to scheduling, anordering of the operations is determined. The priority function, which deter-mines the ordering, consists of two parts: (1) the schedule rank γS for which apriority function of any good list scheduling can be chosen, and (2) the registerrank γR defined as:

γR(oi) = minoj∈succv(oi)

max{

Lpath(oi, oj),Tpath(oi, oj)| FUset |

}∀ oi ∈ NDDG (5.5)

where Lpath(oi, oj) represents the distance in the DDG between the operationsoi and oj , Tpath(oi, oj) is defined as the total path length in the DDG of allpaths from oi to oj , succv(oi) the set of all successors of oj in the DDG thatread variable v which is referenced by oi, and | FUset | represents the numberof function units. The register rank is zero when succv(oi) = ∅. This priorityfunction favors operations that use variables in short live ranges. Schedulingthese uses close to their definition reduces the register pressure.

The combined rank function γ = α · γS + β · γR orders the operations intoa list in increasing order of rank. With the parameters α and β the algorithmis tuned. These parameters obey the following equality α + β = 1. Withoutworrying about the register bound, a greedy local list scheduling algorithmuses the ordered list to obtain a schedule. Afterwards the code is checked forover-using registers and spill code is inserted.


r1 → add1.o; r2 → add1.t;r4 → add2.o; r5 → add2.t;add1.r → r3;add2.r → r6;

r1 → add1.o; r2 → add1.t;r4 → add2.o; r5 → add2.t;...add1.r → r3;add2.r → r6;

a) Original schedule. b) Schedule with inserted instruction.

Figure 5.9: Instruction insertion problem.

Instead of verifying the algorithm with real programs, Motwanu veri-fies the algorithm with randomly generated DDGs of unrealistic large basicblocks (100 operations). The experiments showed that the (α, β)-combinedheuristic outperforms on average strictly late assignment with 16% and strictlyearly assignment with 4% when 16 registers were available.

In contrast to IPS, this method does not switch abruptly from selection pri-ority, but uses a smoother transition to decide whether reducing register pres-sure is more important than increasing ILP. However, the proposed methodaccomplishes this by assigning prior to scheduling a static priority to the op-erations; it does not adapt its operation selection criteria during scheduling.Since the results were obtained by using synthetic benchmarks, they cannot bestraightforwardly extrapolated to real programs.

5.2.3 Register-Sensitive Instruction Scheduling for TTAs

In the context of TTAs, early assignment has the drawback that it is unable toexploit the registers saved by software bypassing and dead-result move elim-ination. Late assignment, however, can exploit this TTA specific optimization.Since fewer variables exist in the scheduled code, it is more likely to find alegal register assignment. This results in an interesting observation. Late as-signment for TTAs results in more spill code because of the increase of the livespans, but on the other hand it reduces the number of spills due to the ability toexploit the benefits of software bypassing and dead-result move elimination.

Unfortunately, late assignment has additional problems in the context ofTTAs. The first problem is called the instruction insertion problem [Cor98]. Be-cause operation latencies are visible, the insertion of instructions in alreadyscheduled code may violate the constraints under which the code was origi-nally scheduled. This is shown in Figure 5.9. Assume both additions are ex-ecuted on the same VTL pipelined FU, which has a latency of two cycles. In-serting an instruction as is done in Figure 5.9b has as a consequence that bothoperations produce the same result. This occurs because the result register inthe FU is overwritten by the second operation before the first operation couldread it. Note, that this problem does not arise when hybrid pipelined FUs wereused.

Independent of the type of FU pipelining there is another problem related


r1 → ld.t;v1 → add.o;ld.r → r3;

r1 → ld.t;spill address → ld.t;ld.r → add.o;ld.r → r3;

a) Schedule prior to spilling. b) Schedule after spilling.

Figure 5.10: Spill code insertion for TTAs.

to inserting spill code; this problem is illustrated in Figure 5.10a. Suppose vari-able v1 must be spilled. In this case, a load operation is required to read thevalue of v1 from memory. However, simply inserting a load operation willresult in an invalid schedule, see Figure 5.10b. The operation is inserted in themiddle of another operation. This disturbs the pipeline of the FU; the originalload will receive a value from an incorrect memory location. Note, that thisproblem can be alleviated when both load operations are executed on differ-ent FUs. This requires, however, a TTA that has at least one free FU that canperform a load operation.

The last problem mentioned in relation to the insertion of spill code isrescheduling. In contrast to conventional processors, where after register as-signment and spilling all variables reside in registers or memory, in TTAs vari-ables are also hidden by software bypassing and dead-result move elimination.The hidden variables might reappear in the rescheduled code, however, theyare not mapped onto a register. Rescheduling might also result in an oppositeeffect; variables that were mapped onto registers become hidden because ofsoftware bypassing and dead-result move elimination. A solution to solve thisproblem is to use the approach described in [SB90]. After register assignment,spill code is inserted in the original unscheduled code. To the unscheduledcode, no software bypassing and dead-result move elimination optimizationsare applied. Consequently, the code with the inserted spill code can be sched-uled in the sameway as the original code. This process is repeated until no spillcode is required. Note that schedulers with large scheduling scopes (extendedbasic block schedulers) rearrange code drastically and due to the insertion ofspill code the schedule will not resemble the first version. Spill code inserted inthe early iterations might be void in the eventual schedule. Another drawbackis the long compile time to generate the schedule. Despite of the disadvantagesmentioned above, the proposed method is implemented in the TTA compilerback-end because it seems to be the only valid approach to generate code inthe context of late assignment.

Inserting state preserving code gives another problem. As was observedin [SB90] the insertion of caller- and callee-saved code is in itself a phase or-dering problem. The solution proposed by Sweany is to save all used registerson entry and on exit of a procedure by adding extra basic blocks. Thus, allregisters are callee-saved. This approach is also applicable to the TTA late as-signment approach. Note, however, that this may lead to inefficient schedules


since it is no longer possible to make a trade-off between caller- and callee-saved registers. Furthermore, the extra inserted basic blocks are scheduled bya local scheduler. This leads to a reduction of the exploitable ILP.

To prevent the insertion of too much spill code, we tried to limit the greed-iness of the scheduler. In the previous section, register sensitive schedulingmethods are described that keep track of the number of available registers.The scheduler switches between a selection heuristic that exploits ILP, and aheuristic that favors operations that reduce the register pressure. Key to theseapproaches is an accurate estimate about the registers required when selectingan operation for scheduling. This can easily be done in an instruction-basedlist scheduler; however, for an operation-based list scheduler, as used in ourcompiler back-end, this is problematic. Each instruction in the partial schedulehas its own different register limit. Because it is not known in which instruc-tion an operation will be scheduled, it is also not known which register limitto use. In our experiments, we favored operations that free registers when itturned out that scheduling the most critical operation in the ready set increasesthe register pressure above a certain threshold. Unfortunately, on average noperformance improvements were found. This is caused by various reasons: (1)it is not possible to make a good register pressure estimate, (2) the ready setis small, which reduces the probability of finding a register pressure reducingoperation, (3) changing the priority function, which favors operations in thecritical path, has a negative impact on performance and (4) floaters. Floaters areoperations that do not depend on other operations in a basic block and whenunhindered by false dependences float to the top of a basic block. When thesefloaters define variables, they increase the register pressure.

5.2.4 Experiments and Evaluation

In the previous sections, various register-sensitive instruction schedulingstrategies were described. These strategies try to limit the register pres-sure when scheduling. The methods known from literature are all based oninstruction-based list scheduling. Scheduling for TTAs, however, requiresoperation-based list scheduling. Our experiments showed that applying themethods for instruction-based list scheduling in a TTA instruction schedulerdid not result in any improvement.

Most of the methods known from literature can only operate on single ba-sic blocks. Our experiments indicate that DCEA outperforms late assignmenton average with 5% when using a local scheduler. However, in order to ex-ploit larger amounts of ILP, larger scheduling scopes should be considered aswell. The performance gain of DCEA over late assignment, when using regionscheduling, is shown in Figure 5.11. The performance penalty of late assign-ment is large. The scheduler in a late assignment strategy is not constrainedby false dependences. It imports as many operations as permitted by other re-sources constraints. This results in a huge amount of variables that are simul-

5.3. INTEGRATED REGISTER ASSIGNMENT 91

05

1015202530354045505560

Per

form

ance

gai

n (

%)

10 12 14 16 18 20 22 24 26 28 30 32 48 64 512

Number of registers

TTA TTA

ideal

realistic

Figure 5.11: Speedup of DCEA compared with late assignment in the contextof region scheduling.

taneously alive. When not enough registers are available this results in a largeamount of spill code. The performance difference between the two benchmarkTTA processors can be explained by the following. For the TTArealistic proces-sor, due to resource constraints other than registers, the scheduler cannot ap-ply importing as aggressively as for the TTAideal processor. Consequently, theschedules generated for a TTAideal processor contain more simultaneous livevariables. This results in a higher register pressure and thus more spill code isrequired. This reduces the performance. Despite our effort to find heuristicsto improve the performance of late assignment, early assignment still gener-ates faster executing code. This behavior of late assignment was also observedin [BSBC95] where it was stated that there are times when spilling dramaticallyincreases the execution time well beyond any scheduling gain obtained by lateregister assignment.

5.3 Integrated Register Assignment

As we have seen in the previous sections, the division of instruction schedul-ing and register assignment into separate phases can affect the performanceof these tasks and thus the quality of the generated code. Both discussed ap-proaches, early and late assignment, have problems. In effect, the lack of com-munication and cooperation between the instruction scheduler and the registerallocator can result in code that contains excess register spills and/or lower de-gree of instruction-level parallelism than possible. Improved performance inone phase can deteriorate the performance of the other phase, possibly result-ing in poorer overall performance.

In this section, various approaches from literature are discussed that ad-dress integrated register assignment and instruction scheduling. These ap-


proaches can be divided into three classes: the first class invokes register as-signment and instruction scheduling multiple times for a complete procedure.These methods are discussed in Section 5.3.1. The second class uses integerlinear programming to address the phase ordering problem [EGS95, CCK97].We chose not to discuss them because these methods tend to have long compi-lation times. Their practical use is limited to straight-line code or small loopsonly. The third class truly integrates both phases into a single phase; theseapproaches are discussed in Section 5.3.2.

5.3.1 Interleaved Register Assignment

Interleaved register assignment and instruction scheduling strategies applyboth phases multiple times to get correct estimates of the expected constraintsimposed by one phase to the other phase. Only a few approaches are knownwhich apply this strategy. This is probably caused by the long compilationtimes required for these methods.

Register Allocation with Schedule Estimates [BEH91a]The strategy proposed by Bradlee et al. [BEH91a], called RASE (Register Allo-cation with Schedule Estimates), consists of three steps. The first step performsmultiple times local early register assignment, followed by local instructionscheduling with a varying number of registers. A schedule cost estimate is com-puted for each number of registers. This schedule cost estimate is the estimatednumber of cycles required to execute the basic block while remaining within acertain register limit. In the second step, a global register allocator (based onthe work of Chaitin [Cha82]) partitions the register set for each basic block intotwo sets. One set is used for global variables and the other set is the basicblock’s register limit. Based on the spill costs and the schedule cost estimate,the global register allocator determines the appropriate balance between thesetwo competing needs for registers. The third step schedules each basic blockand assigns registers to variables within the basic block’s register limit in thesame way as suggested in [GWC88]. Spill code is inserted when insufficientregisters are available for the local live ranges. The reported results showedthat RASE produced code is on average 8% faster than strictly early assign-ment for Intel’s i860. A drawback of RASE is that it can only be applied to basicblock scheduling, because otherwise the register limit per basic block loses itsmeaning.

Combining Register Assignment Interference Graphs [BSBC95]Brasier et al. [BSBC95] describe in their paper a framework, called CRAIG (Com-bining Register Assignment Interference Graphs) that combines register assign-ment and instruction scheduling to tackle the phase ordering problem. CRAIGperforms first late assignment. The generated schedule is not hindered by reg-ister assignment and exploits all available ILP. The register allocator is invokedto compute the late interference graph. The generated schedule is accepted when


no spilling is required. Otherwise, CRAIG constructs an early interference graphfor the original unscheduled code. This graph generally has fewer interferenceedges than the late interference graph. When spill code is needed to color theearly interference graph, CRAIG inserts spill code and invokes the scheduler,based upon the assumption that this is the best that can be done under the cir-cumstances. Otherwise, it assumes that it is likely that false dependences havebeen added by the register allocator, and thus, the resulting schedule can beimproved. CRAIG will attempt to reclaim some of this lost efficiency by re-moving as many of the false dependences as possible, up to the point wherespilling is needed. By adding edges to the early interference graph that arefound exclusively in the late interference graph, CRAIG creates interferencebetween those values, which the scheduler forced to be in different registers.If they were mapped onto the same register in the early interference graph,then a false dependence that potentially inhibits a more efficient schedule isidentified and removed.

The method is applied to a limited set of benchmarks. When registers arescarce the results showed an average performance increase of 6.7% compare tostrictly early assignment, and 3.9% compared to strictly late assignment. Thedescribed method uses a random approach towards selecting edges that canremove false dependences. It is observed that more accurate selection criteriamust be found to increase the performance. The proposed algorithm stopsremoving false dependences when spill code is required. As a result, it canoccur that not all false dependences that prevent the generation of an efficientlate assignment schedule are removed. This may result in a schedule in whichoriginally not recorded false dependences restrict parallelism due to a differentscheduling order of the operations than the scheduling order that was used tocompute the late interference graph.

5.3.2 Integrated Instruction Scheduling and Register Assign-ment

There are obvious pros and cons to doing register assignment early or late. Be-cause register assignment and instruction scheduling are antagonistic, it seemsprofitable to merge both phases into a single step. In the past, the integrationof the instruction scheduling and the register assignment has been consideredas too complicated [BEH91a]. However, due to the ongoing research to exploitmore and more ILP, the register pressure will increase and registers should beassigned in such a way that exploitation of ILP is not hindered.

A Unified Resource Allocator [BGS93]TheURSA (Unified ReSource Allocator) presented by Berson et al. [BGS93] unifiesthe problems of allocating registers and function units. This technique operateson the DDG of the program. The purpose of the URSA is to modify the DDGin such a way that its resource requirements cannot exceed the capacity of thetarget machine. Therefore, it is only concerned with the allocation of resources,


and not their actual assignment. The first phase carries out the measurementof resource requirements and identifies regions with excess requirements. Aso-called re-use directed acyclic graph (DAG) is constructed for every resource.These DAGs are used to determine the maximum number of resources of aspecific type to obtain an optimal schedule. The second phase applies trans-formations to the DDG that reduce the requirements to levels supported bythe target machine. These transformations add sequential dependence edgesto the DDG that remove excess resource requirements. The transformationthat is best with respect to the combination of minimizing the critical path andreduction of excess requirements is selected and applied. These extra edges in-troduce sequentiality, i.e. reduce the exploitable ILP. When it is not possible toreduce the register requirements by adding sequential dependence edges, spillcode is introduced. Resource assignment and instruction scheduling followthe DDG transformations. The assignment phase is also responsible for han-dling any excessive requirements that were not identified byURSA’s heuristics.URSA requires a large number of representations to expose the availability ofresources. Furthermore, no experimental results are presented to give an indi-cation of the method’s effectiveness.

The URSA is defined for local scheduling. In [BGS94] resource spackling,an extension of the URSA, which also supports global code motion, is pre-sented. Resource requirement measurements are used for finding areas whereresources are either under or over utilized, called resource holes and excessivesets, respectively. Conditions for code motion are established to increase theresource utilization in the resource holes and to decrease the resource require-ments in excessive sets. These conditions are applicable to both local and globalcode motion. The results are, however, disappointing, the improvements ofglobal over local instruction scheduling are on average 5.5% while other ap-proaches have shown much larger improvements (e.g. 135%, see Section 4.3).

Integrated Register Assignment in the Bulldog Compiler [Ell86]The approach described by Ellis [Ell86] integrates register assignment and tracescheduling. Registers are assigned to variables by an instruction-based listscheduler as it produces code for a trace. Since trace scheduling starts schedul-ing on the crucial traces first, the trace scheduler, which uses a pool of registers,takes as many registers from the pool as it requires. When the trace is sched-uled, the register locations of the variables are recorded at every entry and exitof the trace. Later traces adjoining the exits and entries are advised to use theselocations. Traces are scheduled as independent entities, therefore it is not al-ways possible to keep a variable in all traces in the same register. To guaranteecorrect execution, repair code is required.

Ellis showed that trace scheduling makes it hard to manage registers ef-fectively; a register written to an operand of an operation must be consideredoccupied until the operation has written its result. When this restriction is notrespected, the code will execute incorrectly when an operation with a multi-cycle latency is bisected by joining or splitting traces. This restriction implies


that the live ranges of the operands and results of the same operation alwaysinterfere when the operation is scheduled for execution on an FUwith multiplepipeline stages (e.g. the latency is larger than one). When the pipeline is fullyutilized, 2(d−1) extra registers are required for an FU pipeline of d stages. Thisreally becomes a bottleneck on processors that can execute several multi-cycleoperations in parallel, which is not uncommon.

The method as proposed by Ellis does not include the insertion of statepreserve code and spill code. However, a remark is made that an operation-based list scheduler probably outperforms the instruction-based list schedulerin the context of spilling. The operation-based list scheduler can look back intothe already generated schedule and can schedule spill code as early as possible.The list scheduler is always constrained to schedule newly generated code afteror in the current instruction being scheduled.

No heuristics were presented to prevent the scheduler from being toogreedy. Consequently, it will easily over utilize the available registers. An-other potential performance bottleneck is the amount of inserted repair code.No measurements are provided to evaluate the performance of this approach.

Trace Scheduling as a Global Register Allocation Framework [FR91]Freudenberger and Ruttenberg [FR91] observe that often registers are the mostcritical instruction scheduling resource. To manage them well, they describehow global register assignment is integrated into trace scheduling in the Mul-tiFlow compiler [L+93, SS93]. The scheduler drives the register assignmentprocess to place the variables referenced within the heavily-used traces in reg-isters. The article does not discuss the assignment of registers to variables in-side the traces1, but merely presents the communication required to keep anassignment consistent between traces. Since traces have multiple entry andexit points, repair code is inserted to obtain correct programs. When other, lesscrucial, traces hook up to this trace, extra analysis is needed to check whethera variable is allocated in the same register in all traces. When this is not true,extra code is inserted to make the necessary corrections.

Freudenberger and Ruttenberg observed that repair code for register as-signment purposes alone, already contributed 5% to the total operation count.They compared their results to two other processors (with other compilers) bycounting the executed operations. It is shown that the proposed method iscompetitive with both other approaches. However, comparing compilers fromother vendors on other architectures is difficult; results cannot be generalizedwithout listing the algorithms used by the other compilers.

Instruction Scheduling for TriMedia [HA99]In [HA99], Hoogerbrugge and Augusteijn describe the compiler for the Tri-Media VLIW mediaprocessor family. The operation-based list scheduler oper-ates on decision trees. In practice, decision trees are often too small to containsufficient ILP, especially in control intensive applications. Grafting is used to

1The assignment inside traces is based on the work of Ellis [Ell86]


remove decision tree boundaries by duplication join-point basic blocks. In or-der to support speculative execution, guards are assigned to operations whenrequired. The register allocator is split into two parts: a global and a local reg-ister allocator. To support this division, the registers are divided into local andglobal registers. The global variables are assigned using a graph coloring basedalgorithm prior to instruction scheduling. A local integrated register allocatorassigns the variables local to a decision tree. A register is assigned to a variableas soon as the definition is scheduled. Since the scheduler uses decision treesas a scheduling scope, all live ranges are tree-shaped (a single definition perlive range) and definitions are scheduled before their uses are scheduled. Thisgreatly simplifies integrated local register assignment.

To keep the register pressure under control, heuristics are used. Hooger-brugge introduces the notion of floater operations. A floater operation has ei-ther none or a single predecessor (which is also a floater) in the DDG and itsresult is used only once. These operations are called floaters because they tendto float to the top of the decision tree when the list scheduler schedules themas soon as possible. This results in long live ranges and thus in an increasedregister pressure. Therefore, floaters are handled differently in the TriMediascheduler. First, they are not in the ready set. When a non-floater is scheduled,its preceding floaters, if any, are scheduled as close as possible before it. Unfor-tunately, no results are given to verify the impact of this idea on performance.

When the register allocator runs out of registers, variables are spilled. Be-cause a register is required between the operation and the actual spill andreload operations, a few registers are reserved. Without these registers thescheduler might get stuck. The insertion of state preserving code is not re-quired since the compiler does not support procedure calls.

The proposed method divides the register set in three sets: global registers,local registers and spill registers. This can lead to an inefficient usage of reg-isters: one set might be under utilized while the other is over utilized. Thiswill, however, not be a severe problem for the TriMedia processor family sinceit contains 128 registers. For processors with a smaller amount of registers,it is expected to become a problem. Global register assignment is performedprior to instruction scheduling; this may lead to the introduction of false de-pendences, which limits the performance. Unfortunately, no comparison ismade with a conventional approach, such as early assignment, to evaluate thepresented method.

5.4 Conclusion

Register assignment and instruction scheduling are antagonistic phases incompilers that exploit ILP. The phase executed first may hinder the other. Intheory, if there are an infinite number of resources, early, late and integratedassignment, generate code with the same performance. However, for register

5.4. CONCLUSION 97

accesses to be fast, the size of the register file should be limited. Hence, thequestion is: how to use a limited set of registers efficiently? This problem wasaddressed in this chapter. We discussed the problems related to early and lateregister assignment and TTA related issues. In summary:

• The assignment of registers to variables prior to instruction schedulingmay limit the possibilities to reorder operations because of false depen-dences introduced with the re-use of registers. Lately, some work is doneon the interaction between instruction scheduling and register assign-ment. In order to avoid the introduction of false dependences, the regis-ter allocator is made aware of the code motions the instruction schedulerwants to perform.

• Instruction scheduling uninhibited by constraints imposed by register as-signment leads to efficient schedules. Unfortunately, it may also increasethe span of live ranges, which leads to excessive spilling. An efficientschedule can lose its achieved degree of ILP when spill code is insertedafterwards. Heuristics are introduced that limit the greediness of the in-struction scheduler.

• Early assignment cannot exploit the software bypass and dead-resultmove elimination advantage of TTAs. Consequently, the resulting codehas a lower efficiency caused by wasted registers.

• The insertion of spill code in TTA code when using late assignmentis much more complicated than for conventional architectures. This iscaused by the software bypassing and dead-result move elimination ca-pability. Furthermore, because operation latencies are visible, the inser-tion of instructions in already scheduled code may violate the constraintsunder which the code originally was scheduled.

Today, the problems related to early and late assignment hinder the genera-tion of high performance code. At the same time ILP compiler techniques areadvancing and the available silicon space increases. Both advances allow theexecution of an increasing number of operations in parallel to boost perfor-mance. The more simultaneously issued operations, the more registers are po-tentially required. Thus, register assignment and instruction scheduling mustbe addressed simultaneously in order to maximize ILP.

Integrated Assignmentand Local Scheduling 6T he goal of register assignment is to map the variables of a program as

efficiently as possible to the set of registers of a processor to obtain fastprograms and to minimize the number of executed memory accesses. The taskof the instruction scheduler is to order the instructions in such a way that theexecution time of a program is minimized. Both the instruction scheduler andthe register allocator have the same goal: minimizing the execution time of aprogram. However, decisions made by one phase can deteriorate the overallperformance because they put too many constraints on the other phase.

This chapter describes a global register assignment method integratedwithin a local operation-based list scheduler. To the best of our knowledge,no integrated approach towards global register assignment and instructionscheduling exists using an operation-based list scheduler. The method is de-scribed in the context of a basic block scheduler. The next two chapters willdiscuss extensions of this method for two more aggressive scheduling tech-niques, which are region scheduling and software pipelining.

To make a new register assignment approach applicable for use in pro-duction compilers it should incorporate all aspects of register assignment,including spilling and the insertion of state preserving code. Unlike manyother researched methods, our integrated method incorporates all these as-pects. Therefore, we are able to compile any ANSI C/C++ program includingSPECint95 benchmarks.

This chapter is structured as follows. Section 6.1 discusses issues relatedto resource assignment and instruction scheduling. The register assigned to aparticular variable is selected from a set of free registers. An important datastructure to compute this set is described in Section 6.2. The definition of the

99

100 CHAPTER 6. INTEGRATED ASSIGNMENT AND LOCAL SCHEDULING

set itself is given in Section 6.3. When insufficient registers are available tohold all variables, spill code is inserted. The insertion of spill code is discussedin Section 6.4. In contrast to other approaches [HA99], our method allows pro-cedure calls. The insertion of code to preserve the state of the program acrossprocedure calls is described in Section 6.5. In Section 6.6, experiments aredescribed to evaluate the new method. The developed integrated assignmentmethod is implemented in the same compiler as the early and late assignmentmethods. This gives the opportunity to make a fair comparison with early andlate assignment. Finally, Section 6.7 states the conclusions.

6.1 Resource Assignment and Phase Integration

Integration of instruction scheduling and register assignment has as a goal togenerate code that is more efficient by letting the two phases interact. Thiscomplicates register assignment, because variables that were not live simulta-neously before a scheduling step can be simultaneously live after this step andvice versa. In other words, the live relations between the variables can changein time during instruction scheduling. To get insight in the complexity of ap-plying register assignment and instruction scheduling simultaneously, the im-pact of the assignment of other resources, such as buses and FUs, is comparedwith the assignment of registers.

To make correct resource assignments it is necessary to collect informationabout previous assignments. This is accomplished with the use of so-calledresource vectors. For example, to record the assignment of buses to moves a busavailability vector is created for each instruction. When a move m is scheduledin instruction i the bus assigned to m is set as occupied in the bus availabilityvector associated with i. Before scheduling another move in instruction i thescheduler checks the bus availability vector for available buses. The same kindof administration is used for sockets. In terms of register assignment, the liveranges of buses and sockets always span a single instruction.

Assigning an FU to an operation involves checking the availability of thisFU and, when the FU is selected, updating the appropriate resource vectors1.In contrast to buses and sockets, the resource vectors of an FU span multipleinstructions, ranging from the instruction where the operand move is sched-uled to the instruction in which the result move is scheduled. Typically, thenumber of spanned instructions is equal to the latency of the FU performingthe operation. In other words, the live range of an FU is equal to the latency ofthe FU.

Variables are in general referenced by many operations. Once a variable isassigned to a register, this register cannot be used for storing other values untilits content is killed. In contrast to other resources, the live range of a register

1The precise administration of the availability of FUs is outside the scope of this thesis. Theinterested reader is referred to [Hoo96].

6.2. REGISTER RESOURCE VECTORS 101

spans many instructions and may even cross basic block and region bound-aries. Therefore, registers can be considered as a global kind of resource, re-served and released at different points of the program. All other resources canbe considered as local resources since these resources are reserved and releasedwithin a single or a few instructions. The larger scope of register assignmentmakes registers the hardest allocatable resource.

The question arises when to assign registers to variables. One approach isto assign a register to a variable vwhen all references to v are scheduled. At thispoint, all instructions spanned by the live range are known. The main reasonnot to choose this approach are the problems related to the insertion of spillcode in already scheduled code. Therefore, a method that avoids the insertionof spill code in already scheduled instructions is chosen. The fundamental ideaof our approach is as follows:

A register r is assigned to a variable v as soon as a move m is scheduledthat refers to v.

The complete live range of a variable is checked for a common available regis-ter, before any of the references to this variable are scheduled.

Algorithm 6.1 assigns the transport resources (buses, sockets and registers).A valid transport resource combination for a move m in instruction i consistsof a source socket si, a move busmb and a destination socket di, which are notalready in use (in other words available) in this instruction. This combinationshould form a data path from the source register (FU or RF) to the destinationregister (FU or RF) in the TTA processor.

6.2 Register Resource Vectors

A register is assigned to a variable when the first move referring to this vari-able is scheduled. Bookkeeping is necessary to guarantee that variables withoverlapping live ranges are not mapped onto the same register. Just as creatinga bus availability vector for each instruction, a Register Resource Vector (RRV) isassociated with each instruction.

Definition 6.1 The Register Resource VectorRRV (i) is defined as the set of registersthat are in use at instruction i.

An example is given in Figure 6.1. Note that a register can be re-defined byanother operation in the same instruction as where it was last used.

When the live range of a variable v spans multiple basic blocks, the registerr mapped onto v is added to the RRVs of all instructions in the spanned basicblocks. Observe that a basic block has no instructions when it is not yet selectedfor scheduling and therefore the usage of r cannot be recorded properly. Thisleads to incorrect assignments. To solve this problem, initially to each basicblock a single instruction, and thus a single RRV, is added. This enables thecorrect recording of the assignments.


Algorithm 6.1 ASSIGNTRANSPORTRESOURCES(m, i)

src = SOURCE(m)dst = DESTINATION(m)FOR EACH si ∈ AVAILABLESRCSOCKETS(src, i) DOFOR EACH di ∈ AVAILABLEDSTSOCKETS(dst, i) ∧ di �= si DOFOR EACH mb ∈ AVAILABLEMOVEBUSES(si, di, i) DOIF ISVARIABLE(src) THEN

rsrc = SELECTSRCREGISTER(m, i)IF rsrc = ∅ THENcontinue

ENDIF

ENDIF

IF ISVARIABLE(dst) THENrdst = SELECTDSTREGISTER(m, i)IF rdst = ∅ THENcontinue

ENDIF

ENDIF

IF ISVARIABLE(src) THENASSIGNREGISTER(src, rsrc)

ENDIF

IF ISVARIABLE(dst) THENASSIGNREGISTER(dst, rdst)

ENDIF

ASSIGNSOURCESOCKET(m, si)ASSIGNDESTINATIONSOCKET(m, si)ASSIGNMOVEBUS(m, mb)return TRUE

ENDFOR

ENDFOR

ENDFOR

return FALSE

2

1

0

r3

r3add.r r3

r3 sub.o

RRVInstruction

Figure 6.1: Usage of RRVs.

6.2. REGISTER RESOURCE VECTORS 103

add.r r3

RRVInstruction

r30 add.r r3

RRVInstruction

r3

r3

2

1

0

r3

a) Initial schedule. b) Adding two instructions.

Figure 6.2: The impact on RRVs when enlarging a basic block.

During scheduling of a basic block b, new instructions are added to it whenan operation cannot be scheduled in the currently available instructions. TheRRVs associated with these added instructions are initialized with the infor-mation recorded in the RRV of the last instruction of b. This is valid because aregister is available again in the instruction in which it is killed (the last use ofthe variable onto which the register is mapped) and the live information of thenewly added instructions is identical to the live information of the last instruc-tion of b. An example, which illustrates the addition of instructions, is givenin Figure 6.2. Figure 6.2a shows a basic block with a scheduled definition ofregister r3. In Figure 6.2b, the basic block is enlarged with two instructions.The contents of the last RRV is copied, hence r3 is also set to be unavailable inthe newly added instructions.

When a register is selected and assigned to a variable v, the RRVs must be up-dated to guarantee that subsequent assignments are legal and do not interferewith this assignment. Each time a move referring to variable v is scheduled,more information about the size of its live range is known. As a result, theRRV information can be refined.

A register is assigned to v when the first move referring to it is scheduled.Consequently, all the othermoves referring to v have already a register assignedto them, before they are scheduled. When such a move, for example a defini-tion is scheduled, the register associated with it can be set as available againin the instructions before this definition. This is illustrated in Figure 6.3. Fig-ure 6.3a shows the situation before the definition is scheduled. In Figure 6.3bthe definition is scheduled and the appropriate RRVs are updated. There is,however, one exception. When a use of a register was already scheduled ina lower or the same instruction as the definition, the register can only be re-moved in the instructions between this use and the definition.

The RRVs are also updated when a use is scheduled, to which already a reg-ister was assigned. Figures 6.4b to Figures 6.4e show various scenarios whenscheduling the code fragment of Figure 6.4a, it is assumed that r3 is not inliveOut. Figure 6.4b shows the situation when the definition and the first useare scheduled. Since it is not known in which instruction the second use is go-ing to be scheduled the worst is assumed and r3 is not available in all instruc-tions of the basic block. Figure 6.4c shows the contents of the RRVs when the


RRVInstruction

r3

r3

2

1

0

r3

3 r3

RRVInstruction

2

1

0

r3

3 r3

add.r r3

a) RRVs prior to scheduling thedefinition.

b) RRVs when the definition isscheduled.

Figure 6.3: Updating RRVs after scheduling a definition.

second use is also scheduled. More information about the live range is knownand the RRVs are updated as shown in the figure. Since operation-based listscheduling is used, the second use can be scheduled in an earlier instructionthan the first use. In this situation, in the second instruction, register r3 canbe released for re-usage. This is shown in Figure 6.4d. Not all live ranges arelocal to a basic block. Assume, in our example, that the second use of r3 islocated in a successor basic block. The RRVs associated with the instructionsof the successor basic block must all contain register r3. When the second useis scheduled, the RRVs in the successor basic block can be updated. This situ-

add.r → r3;r3 → sub.o;r3 → mul.o;

add.r r3

RRVInstruction

2

1

0

r3r3 sub.o

r3

r3

a) Code fragment. b) Definition and first use scheduled.

add.r r3

r3 sub.o r3

RRVInstruction

2

1

0

r3

r3

r3 mul.o3

c) Second use scheduled after firstuse.

add.r r3

r3 sub.o

r3 mul.o

RRVInstruction

2

1

0

r3

r3

d) Second use scheduled beforefirst use.

r3 mul.o

add.r r3

r3 sub.o

1

0

RRVInstruction

r3

2

1

0

r3

r3

r3

e) Variable live across basic blockboundaries.

Figure 6.4: Updating RRVs after scheduling a use.

6.3. THE INTERFERENCE REGISTER SET 105

add.r r3

RRVInstruction

r3

r3

1

0 add.r sub.o

RRVInstruction

1

0

a) Initial schedule. b) Software bypassing and dead-resultmove elimination.

Figure 6.5: Updating RRVswhen applying software bypassing and dead-resultmove elimination.

ation is depicted in Figure 6.4e. Because register r3 is live until the end of theoriginal basic block, it is not released in its last RRVs.

TTAs have a property that makes integrated register assignment even moreattractive: its ability to forward data directly from the output of one FU tothe input of another or the same FU. How this advantage is exploited in ourintegrated assignment method is shown in Figure 6.5. Figure 6.5a gives thesituation where a definition of register r3 is scheduled. Figure 6.5b shows theschedule and its RRVs when the variable is software bypassed, and this useends the live range (i.e. dead-result move elimination can be applied). Thevariable disappears from the code and therefore also register r3 is removedfrom the RRVs and can be used again for another variable. The register canonly be removed from the RRVs when all uses are scheduled in the same in-struction as their definition.

6.3 The Interference Register Set

The interference register set of a variable v contains the registers that are mappedonto variables that may interfere with v under any legitimate schedule. Thisset is used for the selection of a register for variable v. For each basic block b inthe live range of v, an interference register set is constructed.

The basic blocks spanned by the live range of variable v can be partitioned intotwo sets.

• BIO(v), variable v is live on entry and exit of these basic blocks, but it isnot referenced. This set is constructed using2:

BIO(v) = {b ∈ B |v∈ liveIn(b) ∧ v ∈ liveOut(b) ∧ v /∈ liveUse(b)} (6.1)

All variables live in these basic blocks interfere with variable v. The in-terference register set RIO(v, b) for a basic block b ∈ BIO(v) is simply

2Note that due to the definitions of the sets liveIn, liveOut, liveDef and liveUse it is not nec-essary to include v /∈ liveDef (b) in the construction of this set.


v4(r2) → add.o; v1(r1) → add.t;add.r → v3;v3 → mul.o; v0(r0) → mul.t;mul.r → v1(r1);v1(r1) → v2(r0);v5(r4) → sub.o; #4 → sub.t;sub.r → v6(r5);

a) Unscheduled TTA code.

live = {v0, v1, v4, v5}In

Outlive = {v1, v2, v6}

RRV

r0..r2, r4, r5

A

b) The unscheduled basic block Awith its RRV.

Outlive = {v1, v2, v6}

A live = {v0, v1, v4, v5}In

v4(r2) v1(r1)

v6(r5)

v5(r4)

v0(r0)

v1(r1)

v2(r0)

v1(r1)

v3

sub add

mul

copy

c) The DDG of basic block A.

Figure 6.6: RRV based register interference.

computed with:

RIO(v, b) =⋃∀i∈b

RRV (i), b ∈ BIO(v) (6.2)

• BDU (v), these basic blocks have references to variable v. This set is de-scribed with:

BDU (v) = {b ∈ B | v ∈ liveDef (b) ∨ v ∈ liveUse(b)} (6.3)

The set of interfering registers of a basic block b ∈ BDU (v) is denotedwith RDU (v, b).

Computing the interference register set RDU (v, b) is much more complicatedthan computing RIO(v, b). A conservative approach is simply to include allregisters in the RRVs of all instructions of the basic blocks b ∈ BDU (v). Fig-ure 6.6 gives an example. Figure 6.6a shows the operations of a basic blockA. The notation v4(r2) means register r2 is mapped onto variable v4. Fig-ure 6.6b gives the (empty) schedule and the RRV. According to the RRV infor-mation variable v3 cannot be mapped onto any of the registers r0, r1, r2, r4and r5. However, careful examination of the code learns that registers r1 andr2 never interfere with variable v3, independent of the generated schedule.Consequently, the information in the RRV is too conservative.

To efficiently exploit the available registers, amore accurate estimation is re-quired. The DDG’s partial ordering within basic blocks is used for constructingthe interference set. To capture all interference types, four non-interference setsare defined per basic block. In the formulas, all references to variables aremade


by operations belonging to basic block b, i.e., nuse(v), nuse(vi), ndef(v), ndef(vi)

are contained in b.

Definition 6.2 The non-interference set VBelow(b, v) contains all variables vi, whoselive range starts after the end of the live range of v in basic block b. The ordering of thelive ranges is not the result of the ordering in the sequential intermediate code, but isthe result of the dependence relations in the DDG. The set VBelow(b, v) is constructedwith:

VBelow(v, b) ={vi ∈ liveDef (b) | (nuse(v), ndef(vi)) ∈ ET ∀ nuse(v) ∈ b

}(6.4)

where ET is the set of edges of the transitive closure of the DDG.

A similar situation arises when v and vi change roles.

Definition 6.3 The non-interference set VAbove(b, v) contains all variables vi, whoselive range ends before the live range of v starts in basic block b. More formally:

VAbove(v, b) ={vi ∈ live¬Out(b) | (nuse(vi), ndef(v)) ∈ ET ∀ nuse(vi) ∈ b

}(6.5)

where live¬Out(b) = (liveDef (b) ∪ liveIn(b)) − liveOut(b).

Things become more complex when the live range of vi is loop carried, e.g. thevariable vi is live at entry of basic block b and it is redefined within b.

Definition 6.4 The non-interference set VAround(v, b) contains all variables that donot interfere with v in basic block b, and are live on entry and are redefined in basicblock b. This set is constructed with:

VAround(v, b) = { vi ∈ liveLoop(b) | (nuse(v), ndef(vi)) ∈ ET ∀ nuse(v) ∈ b

∧ (nuse(vi), ndef(v)) ∈ ET ∀ nuse(vi) ∈ K(vi, b) } (6.6)

where liveLoop(b) = liveIn(b) ∩ liveDef (b) and K(v, b) is the set of uses of v, whichwill be executed before the definition of v in basic block b.

K(v, b) ={n ∈ NUse(v) | (ndef(v), nuse(v)) /∈ ET , ndef(v), nuse(v) ∈ b

}(6.7)

A similar situation occurs when the roles of v and vi are interchanged.

Definition 6.5 The non-interference set VBetween(v, b) contains all variables that donot interfere with v in basic block b, when v is live on entry and exit of b. Moreformally:

VBetween(v, b) ={vi ∈ liveLocal(b) | (nuse(vi), ndef(v)) ∈ ET ∀ nuse(vi) ∈ b

∧ (nuse(v), ndef(vi)) ∈ ET ∀ nuse(v) ∈ K(v, b) } (6.8)

where liveLocal(b) = liveDef (b) − liveOut(b).


0

1

RRV

r0..r2

r0..r2, r5

#4r4

sub.r

sub.o

r5

sub.t

A

Figure 6.7: Partial schedule of basic block A.

The non-interference set of a basic block b ∈ BDU (v) can now be computedwith the four non-interference sets of the Equations 6.4, 6.5, 6.6 and 6.8:

Vnon-interf(v, b) = VBelow(v, b) ∪ VAbove(v, b) ∪ VAround(v, b) ∪ VBetween(v, b) (6.9)

The set of interfering registers in b ∈ BDU (v) can now be determined with:

RDU (v, b) ={r(vi) | vi ∈ live(b) − Vnon-interf(v, b) − v

}(6.10)

where r(vi) returns the register mapped on variable vi. When no register isassigned to vi then r(vi) = ∅.

Figure 6.6c shows the portion of the DDG of the basic block of Fig-ure 6.6a. The figure only shows a small part of the DDG of the completeprocedure, which explains the dangling edges. The following sets are nowconstructed: VBelow(v3,A) = {v2}, VAbove(v3,A) = {v4}, VAround(v3,A) ={v1} and VBetween(v3,A) = ∅. According to Equation 6.9, the set of non-interfering variables becomes Vnon-interf(v3,A) = {v1, v2, v4}. Since live(A) ={v0, v1, v2, v3, v4, v5, v6} the set of interfering registers becomesRDU (v3,A) ={r0, r4, r5}. The registers in the set {r0, r4, r5}may interfere with variable v3under any legitimate schedule.

For all basic blocks in the set RDU (v, b), the interference register set is com-puted using Equation 6.10. These basic blocks are not scheduled yet. Thereis, however, one exception: the currently scheduled basic block b∗. The setRDU (v, b∗) indeed contains all registers that might possibly interfere with v,prior to scheduling any of the operations of b∗. However, when the schedulerhas already assigned some operations to instructions, some of the registers donot interfere anymore. This is illustrated in Figure 6.7, which shows the sched-ule of our running example (see Figure 6.6) after scheduling the subtraction.As can be seen, variable v3 can never interfere anymore with register r4 be-cause the definition of v3 will always be scheduled after the use of r4. Whenthe information in the RRVs is combined with RDU (v, b∗), a more accurate in-terference set can be constructed. The exact construction of this set depends onwhether a definition or a use of a variable v is scheduled.

• When a definition n of variable v is scheduled in instruction icur, only theRRVs of instruction icur until the last instruction of basic block b∗ need to

6.4. SPILLING 109

be checked for a free register.

RRRV (v, n) =LastInsn(bb(n))⋃

i=icur

RRV (i) (6.11)

• A similar approach is used for constructing the interference register setwhen a use n of variable v is scheduled in instruction icur.

RRRV (v, n) =LastUseInsn(n,v)⋃

i=0

RRV (i) (6.12)

where

LastUseInsn(n, v) =

LastInsn(bb(n)) : NUse(v) − n �= ∅LastInsn(bb(n)) : v ∈ liveOut(bb(n))

icur − 1 : otherwise(6.13)

Combining the setsRRRV (v, n) andRDU (v, b∗) results in an instruction preciseregisters interference set of variable v, for any possible code ordering of theremaining unscheduled operations in basic block b∗. This set, RCur(v, n), iscomputed with:

RCur(v, n) = RRRV (v, n) ∩ RDU (v, bb(n)) (6.14)

The first scheduled move in our running example (see Figure 6.6 and 6.7),referring to v3, is a definition (e.g., add.r → v3). The first instruction inwhich it can be scheduled is instruction 1. As a result RRRV (v3, add.r → v3) ={r0, r1, r2, r5} and RCur(v3, add.r → v3) = {r0, r5}.The complete set of interfering registers can now be computed with:

RInterfere(v, n) =

⋃

b∈BIO(v)

RIO(v, b)

∪ RCur(v, n)

∪

⋃

b∈BDU (v)−bb(n)

RDU (v, b)

(6.15)

6.4 Spilling

In this section, issues related to spilling in the context of our integrated as-signment method are discussed. Late and early assignment insert spill codein either completely scheduled code, or completely unscheduled code. Inte-grated assignment has to insert spill and reload code in partly scheduled code.In Section 6.4.1, a solution is presented, which solves this problem. Adding


operations changes the data dependence relations and the data flow relations.This issue is addressed in Section 6.4.2. Scheduling of on-the-fly inserted spillcode has complications. These complications are identified in Section 6.4.3 andtheir solutions are presented.

6.4.1 Integrated Spilling

The problem of inserting spill code in partly scheduled code seems to be similarto the problem of inserting spill code in completely scheduled code. Extra op-erations must be squeezed into scheduled instructions. As already discussedin Section 5.2.3, this requires rescheduling in order to generate correct code.Reschedulingmay lead to changes in the register requirements; non-interferinglive ranges in the original scheduled code may interfere in the rescheduledcode, or variables that were software bypassed, are not software bypassedanymore, and require registers. These effects result in an iteration of regis-ter assignment, spill code insertion and rescheduling steps. It is unclear howto apply this strategy in the context of integrated assignment because it is notdesirable to restart scheduling, when during scheduling it is discovered thatspilling is required.

Because of the above mentioned reasons, it was decided not to insert spillcode in already scheduled code. Instead, spill code is only inserted in the stillto be scheduled code. This strategy fits very well in our integrated assignmentapproach, because a register is assigned to a variable when the first referenceto this variable is being scheduled. As a result, all references are located in un-scheduled basic blocks. This avoids the insertion of code in already scheduledcode and therefore is easier to implement.

The principle of our approach is illustrated in the example of Figure 6.8.Figure 6.8a shows the CFG of a procedure. Assume that the basic blocks Aand B are already scheduled and basic block C is being scheduled. The shadedparts indicate which code is already scheduled. In this example, it is assumedthat no more free registers are available in basic block D. Consequently, noregister can be found for the live range of variable v2. Integrated assignmentdetects this situation when it tries to schedule the definition of v2 in basic blockC. As illustrated in Figure 6.8b spill and reload code is inserted in the unsched-uled code of respectively basic block C and D. For reasons of clarity the codefor the address calculations is omitted.

6.4.2 Updating Data Flow and Data Dependence Relations

Spill code insertion changes the data dependence and data flow relations. Thisinformation was originally computed before scheduling, now, during schedul-ing the necessary updates must be made to ensure correct code generation.New live ranges are created to hold the memory addresses, and the to bespilled and reloaded values. To maintain the fully renaming property, and

6.4. SPILLING 111

A

B C

Duse v2

def v2

B C

D

load v2"

def v2’store v2’

A

use v2"

a) Before spilling. b) After Spilling.

Figure 6.8: Integrated spilling.

thus a large scheduling freedom, new and unique variable names are associatedwith these live ranges. In the following, the changes to the DDG are describedin detail when spilling a variable v.

• A store operation stdefiis inserted just after each operation ndefi(v) ∈

NDef(v). A unique index i is given to each definition of v in NDef(v). Adata dependence edge, of the flow dependence type, is added betweenthe definition ndefi(v) and the associated stdefi

.

NDDG = NDDG ∪ {stdefi}

EDDG = EDDG ∪{

(ndefi(v) δf0 stdefi

) | stdefi, ndefi(v) ∈ NDDG

}An addition adddefi

is inserted just before each inserted stdefi. This ad-

dition computes the memory address of the location where the spilledvariable is stored. The DDG is updated with:

NDDG = NDDG ∪ {adddefi}

EDDG = EDDG ∪{

(adddefiδf0 stdefi

) | adddefi, stdefi

∈ NDDG

}• A load operation ldusej

is inserted just before each operation nusej(v) ∈NUse(v). The index j distinguishes the various uses of v inNUse(v). A datadependence edge is added between the load and the related consumer:

NDDG = NDDG ∪{ldusej

}EDDG = EDDG ∪

{(ldusej

δf0 nusej(v)) | ldusej

, nusej(v) ∈ NDDG

}


An addition addusejis inserted just before each inserted ldusej

. This addi-tion computes the memory address of the location fromwhich the spilledvariable should be reloaded.

NDDG = NDDG ∪{addusej

}EDDG = EDDG ∪

{(addusej

δf0 ldusej

) | addusej, ldusej

∈ NDDG

}• Two types of memory data dependence edges are added between theinserted store and load operations. The first edge prevents that a valueis read from memory before it is written. The flow dependence edgebetween ndefi(v) and nusej(v) is replacedwith amemory flow dependenceedge between the stdefi

and ldusej.

EDDG = EDDG ∪{

(stdefiδf1 ldusej

) | (ndefi(v) δf0 nusej(v)) ∈ EDDG

}−

{(ndefi(v) δf

0 nusej(v))}

The second memory dependence edge prevents that a value is written tomemory before it is read. This edge replaces the anti dependence edgebetween nusej(v) and ndefi(v). Such an edge only exists when variable vwas loop carried.

EDDG = EDDG ∪{(ldusej

δa1 stdefi(v)) | (nusej(v) δa

0 ndefi(v)) ∈ EDDG

}−

{(nusej(v) δa

0 ndefi(v))}

The sets with live information are also updated. Variable v is removed fromall the sets liveIn, liveOut, liveDef and liveUse. The new variables, created tohold the temporary values, are added to the liveDef sets of their basic blocks.In addition, for each new live range a new du-chain is created3.

After the insertion of spill code, the operation sequences for spilling andreloading are stand-alone pieces of code. That is, they are no longer directlyconnected by a du-chain and the data is transported via memory.

6.4.3 Scheduling Issues

Normally, an operation is scheduled in the first instruction where its data de-pendence and resource constraints are met. The used basic block schedulingmethod guarantees that there always exists an instruction in which both thedata dependence and the local resource constraints (FUs, buses and sockets)can be fulfilled. When necessary, new (empty) instructions are created and

3The additions use the frame-pointer (fp) to compute the address of a memory location. Inorder to be complete, du-chains and data dependency edges between the definition of the frame-pointer and the uses of it by the additions must be inserted. For reasons of clarity, they are omittedin the above discussion.

6.4. SPILLING 113

r0, r2

0

1

2

Instruction RRV

r0, r1 .. rn

r0, r1 .. rn

a) Initial schedule.

r0, r2

r0, r2

add.r st.o #20

#offset

fp

st.t

add.o

add.t

0

1

2

3

Instruction RRV

r0, r1 .. rn

r0, r1 .. rn

b) Direct spilling.

r0, r1, r2r1

0

1

2

Instruction RRV

r0, r1 .. rn

r0, r1 .. rn

#20

c) Postponed spilling.

Figure 6.9: Direct vs. postponed spilling when scheduling transport #20→v1.

added to the basic block. For global resources, such as registers, it is not guar-anteed that the resource constraints can be met, because variables can crossbasic block boundaries.

Two strategies were explored when the inability to schedule an operationin a particular instruction is caused by register shortage:

• Direct spilling: Spill code is generated in the first instruction where alldata dependence and all resource constraints in an instruction can bemet,except registers.

• Postponed spilling: The scheduler tries to schedule the transport in later in-structions, hoping that in one of these instructions more registers becomeavailable. When no extra registers become available, the first strategy isused as a fallback strategy.

Figure 6.9 shows the impact of both strategies. Transport #20 → v1 must bescheduled in the basic block given in Figure 6.9a. Direct spilling inserts spillcode because all resource and dependence constraints, except registers, are metin instruction 1. This is shown in Figure 6.9b. When postponed spilling isused (see Figure 6.9c), the scheduler discovers that a register becomes availablewhen the transport is scheduled in instruction 2. Consequently, no spill codeis inserted.

Preliminary experiments indicated that postponed spilling results in ahigher performance. Despite the engineering complexities, this strategy is cho-sen for implementation in our integrated assignment approach.


add add

mul

sub

register pressure = 3

mul

add

st

add ld1

sub

mul

ld2


mul

add

st

add ld1 ld2

mul

sub

mul


a) Original DDG. b) DDGwith spill code. c) Scheduled code.Figure 6.10: Impact of spilling and instruction scheduling on register pressure.

The idea behind spilling is to reduce the register pressure by replacing a longlive range with a number of short live ranges. However, in some situationsspilling increases the register pressure. For example, when inserting a store op-eration, the original live range is replaced with two simultaneously live, shortlive ranges: one for the memory address calculation and one for the value to bestored in memory. This is a problem when insufficient registers are availableto hold these short live ranges; a condition very probable since spilling is dueto register shortage. Other register pressure problems are related to instructionscheduling. The instruction scheduler may decide to schedule the operationsrequired for spilling far apart. This also increases register pressure, see for in-stance Figure 6.10. In Figure 6.10a the original graph is shown, this programrequires three registers. When only two registers are available, spill code isintroduced as shown in Figure 6.10b. The resulting code requires only two reg-isters. Because instruction scheduling techniques tend to schedule instructionsas early as possible, it can happen that operation ld2 is scheduled in the sameinstruction as operation ld1, see Figure 6.10c. In this situation, the registerrequirement is still three, and spilling did not help at all.

In [LVA96] this problem is attacked by scheduling the operation, whichvariable is spilled, and the spill code itself close together in a single schedulingstep. However, this does not guarantee that no deadlock situation can arise,since there are still registers required for the introduced short live ranges.

In [CLM+95, HA99], it is suggested to use a limited set of reserved registersfor these newly created live ranges. This method has three major drawbacks:

6.4. SPILLING 115

fp → add.o; #offset → add.t;add.r → ld.t;ld.r → use;

Figure 6.11: Software bypassed reload code.

(1) variables are spilled to memory, although some registers were available, (2)false dependences are introduced that could be avoided if the complete set ofregisters was available, and (3) it introduces false dependences between spillcode because the newly created live ranges are only mapped onto a small set ofregisters. Consequently, reserving registers for the short live ranges introducedby spilling results in inefficient register usage.

A better solution is to exploit the software bypassing property of TTAs anddead-result move elimination. The newly created live ranges are directly trans-ported from FU to FU; they disappear completely from the code. Because noregisters are required, it is guaranteed that this method always converges. Thecorresponding TTA code for reload code is given in Figure 6.11. The result ofthe addition is bypassed to the load, and the result of the load is bypassed tothe operation that uses the reloaded value.

To ensure that the code can be scheduled without the need of reserved reg-isters, the address calculation, the load or store, and the operation, which re-quires spilling, must be scheduled in such a way that all variables are bypassedwhen required. This cannot be guaranteed when the involved operations arescheduled in individual scheduling steps. For example: assume the resultmove of the addition performing the memory address calculation is scheduledin instruction i. The trigger of the load should also be scheduled in this instruc-tion. However, when not enough move buses or FUs are available the triggerwill be scheduled in instruction i + 1 or higher, and software bypassing cannotbe applied.

To guarantee software bypassing, the address calculation, the load or store,and the operation, which requires spilling are scheduled in a single (atomic)scheduling step. To achieve this, the scheduler recognizes spill code. It selectsstand-alone pieces of spill or reload code as if it were a single operation. Thescheduler uses backtracking to ensure software bypassing. The scheduler isnot always required to software bypass variables. When, for example, spillingwas caused by an assignment in another basic block, some registers may stillbe available in the currently scheduled basic block. In these situations, thescheduler is allowed to use these available registers. This decreases the num-ber of backtracking steps. Scheduling of spill code is a complex engineeringchallenge, especially when another variable, defined or used by the operationthat originally required spilling, also requires spilling.

The most general spill code data dependence graph is shown in Figure 6.12,intra operation edges are omitted. It shows the worst case situation for anoperation op with n operands andm results. In practice most operations haveonly one or two operands and one result; furthermore it is not likely that all


relo

ad c

ode

spill

cod

eop

erat

ion

op.o

op.rop.r

op.t

ld.t

add.t

ld.r

add.r

ld.t

add.o add.t

ld.r

add.r

op.o

st.o

add.t

add.r

add.o

add.o

ld.t

add.o add.t

ld.r

add.r

add.o add.t

add.r

st.t st.ost.t

1 n

m1

Figure 6.12: Largest recognizable spill/reload code sequence.

6.4. SPILLING 117

deffpadd.rv1

add.rv1ld.rv2"

fp add.o;v1;

add.o; #offset add.t;

v2’#offset add.t;

st.t;st.o;v1;

v2’;

ld.t;v2";use;

def

fp add.o

add.r st.o st.t def use

#offset add.t

b) Optimization with software bypassing.

fp

add.r

r1

add.o

st.o

#offset add.t

def st.t def r1

use

a) Unscheduled code. c) Optimization with register usage.

Figure 6.13: Definition-use peephole optimization.

operands and results need spill/reload code. When loads and/or stores withaddress offsets are supported, the graph complexity reduces substantially.

6.4.4 Peephole Optimizations

Up to now, the general spilling process is outlined, but there are cases wheresome of the added operations turn out to be superfluous. Integrated assign-ment considers several particular cases:

• When a definition and a use of a variable v are scheduled in nearby in-structions, and v is spilled, the reload code can be omitted when the usecan be scheduled in the same instruction as the definition. The valuecan be software bypassed directly to the use, without reloading the valuefrom memory. An example is given in Figure 6.13. Figure 6.13a showsthe code containing spill and reload code. The reload code (shaded in thefigure) can be left out in the generated schedule. This schedule is shownin Figure 6.13b; the result of the definition (def) is spilled to memory and,directly software bypassed to the use.

When a register is available between the definition and the use, thisregister can be used for holding the value generated by the definition.Again the reload code can be omitted. This is shown in Figure 6.13c.Register r1 is used for temporary storage. The discarding of the super-fluous operations is carried outwhen the use and its associate spill code isscheduled. Only at this point, it is known in which instructions the defi-nition and the use are scheduled, and whether any registers are available.

• When two uses of the same variable are scheduled in nearby instructions,the reload code of one of the uses can be omitted. This is illustrated inFigure 6.14. Figure 6.14a shows a code fragment, which reloads the samevalue from memory twice. This code can be optimized by removing thesecond reload (shaded in the figure) and replacing it with a short liverange from the first load operation to the second use. Figure 6.14b showsthe resulting schedule. The discarding of the superfluous operations iscarried out, when the second use and its associate spill code is scheduled.


fp add.o; #offset add.t;add.r v1;v1 ld.t;ld.r v2’;v2’ use1;fp add.o; #offset add.t;add.r v1;v1 ld.t;

v2" use2;ld.r v2";

fp

add.r

r1

add.o

ld.t

use2

#offset add.t

ld.r use1 ld.r r1

a) Unscheduled code. b) Optimization with register usage.

Figure 6.14: Use-use peephole optimization.

6.5 State Preserving Code

When a procedure invokes another procedure, parameters are passed fromthe calling procedure to the called procedure, and on return from the calledprocedure to the calling procedure. These parameters are located in the vari-ables v0..v6 for integer values, and vf0..vf4 for floating-point values, as de-fined by the front-end (Section 3.1). Precautions have to be taken, to ensure thatthe contents of these variables are not altered by register assignment. There-fore, integrated assignment assigns prior to scheduling the correct registers tothese variables, just as in the graph coloring approach. These registers, how-ever, are not dedicated for parameter passing exclusively4. They can also beused for holding other variables, as long as their live ranges do not interfere.

An invoked (called) procedure normally changes the contents of the reg-isters that are in use by the calling procedure. To save the contents of theseregisters, state preserving code must be inserted. In the remainder of this sec-tion, methods to generate caller- and callee-saved code, in the context of in-tegrated assignment, are discussed. Section 6.5.1 discusses the generation ofcallee-saved code and Section 6.5.2 describes the approach used for generatingcaller-saved code.

6.5.1 Generation of Callee-saved Code

The convention used in our compiler dedicates the upper half of the registerset to callee-saved registers. It is the responsibility of the called procedure tosave these registers when it is called. Callee-saved code is inserted in the entryand exit basic blocks of a procedure. The registers saved and restored are thosewhich are referenced in the called procedure and are a member of the callee-saved register set. In the context of integrated assignment two methods forinserting callee-saved code are developed:

4The registers sp and fp can never be used by another variable, since they are live in the com-plete procedure.

6.5. STATE PRESERVING CODE 119

• Create an extra hierarchy level by adding extra entry and exit basic blocksin the sameway as Sweany and Beaty [SB90] propose for late assignment.These added basic blocks are dedicated to hold callee-saved code solely.This approach has as a drawback that the callee-saved code is not sched-uled with the code of the original entry and exit basic blocks: this willlikely result in a small performance loss.

• To overcome the limitation of the previous method the callee-saved codemust be inserted within the original entry and exit basic blocks. To gen-erate legal code the following steps should be performed:

1. Schedule all basic blocks, except the entry and exit basic blocks.

2. Map all remaining, not yet assigned variables whose references arelocated in the not yet scheduled entry and exit basic blocks, ontoregisters. Thus effectively applying early assignment to these basicblocks. When no register can be found, no spill code is generated inthe hope that integrated assignment can find a free register duringscheduling.

3. Insert save-code in the entry basic block, and restore-code in the exitbasic blocks for each referenced callee-saved register in this proce-dure. To ensure that the operations are scheduled in the correctorder, extra data dependency edges must be added between thecallee-saved code and all not yet scheduled references to the callee-saved registers.

4. Set all never referenced callee-saved registers as used in the RRVsof the entry and exit basic blocks. For these registers, no callee-saved code is generated. This prevents the use of these registers bythe yet to be scheduled operations, and thus avoids the insertion ofcallee-saved code in already scheduled code.

5. Schedule the operations in the entry and exit basic blocks.

The drawback of this method is that the variables, which were not yetmapped onto registers, can only be mapped onto the caller-saved regis-ters and the saved callee-saved registers5. This generally does not imposesevere problems; when there are many registers the probability of find-ing a register is high, since at least all caller-saved register are available.When there are only a few registers, all callee-saved registers are used inthe other basic blocks and thus all registers are available.

Both methods avoid the problem of inserting callee-saved code in alreadyscheduled basic blocks. The second method potentially results in a largeramount of exploitable ILP. It places the callee-saved code and the code of theoriginal entry and exit basic blocks into the same basic block. Based on this

5When no register can be found at all, the variable is of course spilled.


foo → pcv2 → sub.t #11 → sub.osub.r → v1

live In = {fp, v2}

live Out = {fp, v1, v2}

0

Instruction RRV

fp

a) Unscheduled code. b) Partial schedule.fp → add.o; #offset → add.t;add.r → v3;v3 → st.o; r7 → st.t;foo → pcfp → add.o; #offset → add.t;add.r → v4;v4 → ld.t;ld.r → r7;r7 → sub.t #11 → sub.osub.r → v1

#offset add.t add.ofp

st.t

fp add.o

add.r st.o

pcfoo

#offset add.t

ld.tadd.r

ld.r sub.t

0

1

2

3

4

5

6

fp, r7

r0..r9

fp

fp

fp

fp, r7

fp, r3, r7sub.r r3

Instruction RRV

#11 sub.o

ld.r

r7

r7

c) Inserted caller-saved code d) Generated schedule

Figure 6.15: On-the-fly caller-saved code generation.

observation and preliminary experiments, the second approach is selected andincorporated within the TTA compiler back-end.

6.5.2 Generation of Caller-Saved Code

It is the responsibility of the register allocator to save and restore caller-savedregisters around procedure calls with the use of caller-saved code. In an inte-grated assignment approach, this code cannot be generated before schedulingthe procedure. At that point in time, no registers are assigned yet, and thus itis unknown which registers are alive across the procedure calls. Instead, thecaller-saved code must be generated at the moment the procedure call opera-tion is selected for scheduling. The following steps show how this problem issolved in our integrated assignment approach.

1. First, the variables that are live across the procedure call are identifiedby using live-variable information. Because other operations are al-ready scheduled, it is very likely that some of these variables are alreadymapped onto a register. For the unassigned variables, the integrated as-signment method has the freedom to map them onto either the caller-or the callee-saved register set. The same heuristic as in early assign-ment is used to determine the best register set (see Section 3.3.3). When acaller-saved register is selected, the variable is mapped onto this registeralthough no reference to this variable is scheduled yet. An example isgiven in Figure 6.15. Figure 6.15a shows the unscheduled code and Fig-ure 6.15b shows the produced schedule so far. The next operation to bescheduled is the procedure call foo → pc. Prior to scheduling the proce-dure call, variable v2 ,which is live across the procedure call, is mappedonto the caller-saved register r7.

2. Caller-saved code is inserted for all variables that are mapped onto a

6.6. EXPERIMENTS AND EVALUATION 121

caller-saved register and are live across the procedure call. In the sequen-tial code, save-code is inserted before the procedure call, and consists of astore and an add operation. Restore-code, a load and an addition, is in-serted after the procedure call in the sequential code. The additions arerequired to compute the memory locations. An incremental live-variablealgorithm is used for generating the new live-variable information. Ex-tra data dependences are inserted to prevent that the loads of the restorecode can be scheduled before the procedure call. Figure 6.15c shows theunscheduled (sequential) code with the inserted state preserving code forregister r7.

3. The operations of the inserted save-code are scheduled in the same man-ner as all other operations.

4. When all save-code operations are scheduled, the procedure call is sched-uled. This implies that it is no longer allowed to assign a caller-savedregister to an unassigned variable that is live across this procedure call.Without this restriction, the required extra save-code must be inserted inalready scheduled code. As discussed previously, this is problematic forTTAs. To prevent such an assignment, all caller-saved registers are set asoccupied in the RRV of the instruction of the procedure call. This is illus-trated in Figure 6.15d assuming a TTA with 20 registers. All caller-savedregisters (r0 - r9) are set as occupied.

5. In the last step, the operations of the restore-code are scheduled. Thegenerated schedule is shown in Figure 6.15d.

6.6 Experiments and Evaluation

In this section, the performance of the proposed register assignment methodis measured and evaluated. The target TTAs used in the experiments are in-stances of the TTAideal and the TTArealistic processors (see Section 4.2.2). Theinstances of each processor only differ in the number of integer registers. Thelargest model supports 512 registers. The smallest model still requires 10 reg-isters. The reason is twofold: from the set of registers, seven registers arereserved by the compiler front-end as special registers (for the stack pointer,frame pointer and parameter passing). In addition to these seven registers,early assignment requires at least three registers for spilling6. Integrated as-signment can always find a solution when the target TTA contains at leastseven registers, because no reserved registers are needed for spilling. Forthe experiments a single RF for each register type (integer, floating-point andBoolean) is used. In Chapter 9, this restriction is released.

To measure the performance of the proposed integrated assignment6These three extra registers are required to hold the memory addresses and the results of the

reload operations in case all three operands of, for example, a multiply-add are spilled.


method the benchmarks are first compiled to sequential move code and sim-ulated with representative data sets. The sequential code is scheduled for thetarget architecture with the use of profiling information. The last step com-bines the information from the parallel code with the profiling information inorder to compute, for example, the cycle count. The performance numbers areobtained by averaging the speedups over all benchmarks.

Since both instruction scheduling and register assignment are NP-completeproblems, heuristics are used for guiding the instruction scheduling and regis-ter assignment process. The heuristics are defined at three hierarchical levels:

• Register selection (Section 6.6.1)• Operation selection (Section 6.6.2)• Basic block selection (Section 6.6.3)

To evaluate the heuristics in a structured manner, they are evaluated indepen-dently. We are aware of the fact that the heuristics are not independent. How-ever, evaluating all combinations of heuristics is too time-consuming. Further-more, it is not to be expected that a set of heuristics exists, which in all situ-ations, outperforms the others. In the final section, the performance of inte-grated assignment is compared with the best approach found in Chapter 5.

6.6.1 Register Selection

When an operation n that referes to a not already assigned variable v is sched-uled, then a non-interfere register set is constructed for v. Each member of thisset can be mapped onto a v. This set is defined as:

RNon-interfere(v, n) = R − RInterfere(v, n) (6.16)

where R is the set of registers of the target processor of the required type (inte-ger, floating-point, etc.). The set RInterfere(v, n) is computed with equation 6.15.

Normally, this set has more than one element. The question arises whichregister to select. Integrated assignment has three choices: (1) mapping thevariable onto a caller-saved register, (2) mapping the variable onto a callee-saved register or (3) spilling the variable to memory. The cost functionsfor each of these three choices are given in the Equations 3.5 (caller-savedcost), 3.6 (callee-saved cost) and 3.4 (spill cost). We propose to choose the ca-tegory with the lowest cost. When the spill cost is the lowest, the variable isspilled to memory even when registers are available. When one of the othertwo categories has the lowest cost, a register is selected from the associatedpart of the register set. When no register in such a register set can be found,the category is chosen which has the second lowest costs. If this also fails noregister can be found and the variable is spilled to memory.


6.6.2 Operation Selection

As discussed in Section 3.4.1, operations are selected for scheduling when theybecome a member of the ready set. A heuristic is used to select an operationfrom this set. The basic block scheduler uses the priorityslack heuristic as de-fined in Equation 3.10, which favors operations in the critical path. However,when registers are scarce it may be profitable to use a modified operation selec-tion heuristic, which decrease the register pressure [GWC88]. This may reducethe amount of spill code and hence results in an improved performance. Thegoal of such a heuristic is to favor operations in the ready set that will decreasethe number of live ranges, i.e. select operations that end the live range of vari-ables. This set of operations is denoted as O⊥ ⊆ ready. Three heuristics areproposed that increase the priorities of the operations in the set O⊥. The prior-ities of all other operations (ready − O⊥) remain unchanged.

• Step:

prioritystep (o ∈ O⊥) = priorityslack (o) +{

δ : |Rfree| < α0 : |Rfree| ≥ α

where Rfree is set of available registers in the last instruction of the cur-rently scheduled basic block. This priority function increases the priorityof operations in O⊥ when |Rfree| drops below a certain threshold α.

• Linear:

prioritylinear(o ∈ O⊥) = priorityslack(o) ·{

1 + δ(

α−|Rfree|α

): |Rfree|<α

1 : |Rfree|≥α

where α is the register limit that determines when the priority shouldbe increased. The parameter δ determines how strong the number ofavailable registers |Rfree| influences the priority.

• Exponential:

priorityexponential (o ∈ O⊥) = priorityslack (o)(

1 + δ · e−Rfree

α

)

This priority scheme favors operations in O⊥ using an exponential func-tion. The parameters α and δ determine the impact of the number ofavailable registers |Rfree| on the priority.

The impact on the priority of all three heuristics is shown in Figure 6.16. Ascan be clearly seen, the heuristics have more influence when the number ofavailable registers decreases.

For all three heuristics a large number of experiments are performed whilevarying the parameter values. Unfortunately, the experiments showed thatchanging the priority function has little impact on the resulting performance.


α0

Step

Exponential

1

1+δ

wei

ght

free|R ||R|

Linear

Figure 6.16: The priority as a function of the number of available registers forthe three heuristics.

Only when registers are extremely scarce, a change in the priority scheme isprofitable. However, the performance gain is very small (in the order of 0.1%).Changing the operation priority scheme, when a sufficient amount of registersis available may even hurt performance. Analysis of this unexpected behaviorresulted in the following observations:

• The size of the ready set is usually small. Since O⊥ is a subset of ready,this set is even smaller. Consequently, only few operations are subject toa change in priority.

• A change in the priority when registers were scarce did not result in anincreased performance, because the operations that were selected due tothe new heuristic were selected anyway with the slack priority function.The same operations were selected independent of the priority function.

• Changing the priority hurts performance when operations that are notin the critical path are selected first. This was especially true when thepriority was increased when sufficient registers ( > 3) were available.

The above observations result in the conclusion that the slack priority heuristic,when using integrated assignment, not only favors critical operations but alsodoes a fair good job in controlling the register pressure.

6.6.3 Basic Block Selection

In all experiments, the procedures of the benchmark programs are scheduledindependently. Decisions made in one procedure do not influence decisions inother procedures. The next hierarchical level in a basic block scheduler is thebasic block. When basic block scheduling and register assignment are done inseparate phases, the order in which the basic blocks are scheduled has no in-fluence on the resulting code. The basic blocks are scheduled as independentpieces of code. When, however, instruction scheduling and register assignment


0

4

8

12

16

20

Per

form

ance

gai

n (

%)

10 12 14 16 18 20 22 24 26 28 30 32 48 64 512

Number of registers

TTA TTA

ideal

realistic

Figure 6.17: Speedup of the BBfreq heuristic compared to the BBtop heuristic.

are integrated into a single phase, the scheduling order of the basic blocks doesmatter. Register assignment decisions made during instruction scheduling inone basic block can influence scheduling/assignment decisions in other basicblocks since the live ranges of variables cross basic block boundaries. Refer-ences in basic blocks that are scheduled early in the process can choose fromall registers, while references in basic blocks that are scheduled later on onlycan pick registers from a smaller set of available registers. Register assignmentfor these later references is more likely to be hindered by a shortage on regis-ters. The introduction of false dependences and spill code might be necessary.

In this section, two basic block priority functions are proposed and evalu-ated. Profiling information is used for determining the execution frequenciesof the basic blocks. The evaluated heuristics are listed below.

BBtop Basic blocks are selected for scheduling in topological order; a basicblock is selected when all its predecessor basic blocks are scheduled.

BBfreq The basic blocks are ordered according to their execution frequency. Ba-sic blocks, which are executed more frequently, are scheduled first. Con-sequently, the set of free registers is larger in themost frequently executedbasic blocks. This results in a large scheduling freedom, since registers donot hinder the construction of an efficient schedule in these code parts.

When multiple basic blocks conform to the used requirement the selection isdone randomly. The assumptions concerning entry and exit basic blocks asdiscussed in Section 6.5.1 are respected.

Figure 6.17 compares the performance of both methods. The speedup num-bers are obtained by averaging the speedups over all benchmarks. As can beseen from the results, the basic block scheduling heuristic has a large impact onthe performance of integrated assignment. When registers are not a critical re-source, the methods give approximately the same results. However, when reg-


0

5

10

15

20

25

Per

form

ance

gai

n (

%)

10 12 14 16 18 20 22 24 26 28 30 32 48 64 512

Number of registers

TTA TTA

ideal

realistic

Figure 6.18: Speedup of integrated assignment compared to DCEA.

isters become scarce the execution frequency conscious heuristic outperformsthe topological approach with more than 10%.

The results show a larger performance gain for the TTArealistic than for theTTAideal. This effect is caused by the fact that the amount of resources likeFUs and buses are more limited for the TTArealistic. When applying the BBtop

heuristic with a limited set of registers, spill code is inserted in code with ahigh execution frequency such as loops. These extra operations require ex-tra resources such as FUs. When these resources are limited, the operationscannot be executed in parallel, hence the size of the scheduled basic blocksis increased. When the resources are unlimited, this increases will be less ornon-existing. When applying the BBfreq heuristic, spill code is inserted in codewith a low execution frequency. Because the TTArealistic has a limited set ofresources, these basic blocks also enlarge. However, this does not hurt per-formance as much as the BBtop heuristic does. For the TTAideal there are alarge number of resources and hence the spill code can be scheduled in paral-lel when other dependences will permit this. Consequently, the performancepenalty when applying the BBtop heuristic is less pronounced.

It is interesting to note that the performance gain for 10 registers is lowerthan for 12 registers for the TTArealistic. This effect can be explained by thefact that when a large amount of spill code is needed, it will be placed in basicblocks with a high frequency count anyway.

6.6.4 Early vs. Integrated Assignment

To evaluate the introduced integrated assignment method, we compare its re-sults, for each of the benchmarks described in Section 4.1, with the results ofDCEA (Dependence-Conscious Early Assignment), the best register assign-ment method of Chapter 5. The averages of theses measurements for bothtarget TTAs are shown in Figure 6.18 (the results of the individual benchmarks


add.r → v1;v1 → st.t;sub.r → v2;v2 → mul.t;

δa

st.t

add.r sub.r

v1(r7) v2(r7)

mul.t

a) Code fragment. b) DDG with a false dependence.

add.r → st.t;sub.r → mul.t;

add.r → st.t; sub.r → mul.t;

c) Early assignment schedule d) Integrated assignment schedule

Figure 6.19: False dependence effect when registers are scarce.

can be found in Appendix A). The number of registers in each target TTA isplaced along the x-axis. The speedup of integrated assignment compared withDCEA is listed along the y-axis. The average performance gain varies between0% for 512 registers and 21.3% for 10 registers.

As can be observed from the figure, the speedup is substantial for low regis-ter counts. As already demonstrated in Section 5.1, DCEA successfully tries toavoid false dependences; however, when registers become a critical resource itbecomes difficult to avoid false dependences. This is illustrated in Figure 6.19.Figure 6.19a shows a small code fragment consisting of four transports. Whensufficient registers are available, DCEA will prevent a false dependence andassigns different registers to v1 and v2. However, when registers are scarcethis is no longer guaranteed. To prevent spilling of other variables, or in anattempt to prevent other false dependences, DCEA may assign the same regis-ter to both variables v1 and v2. This results in a false dependence in the DDGas depicted in Figure 6.19b. Although during scheduling, software bypassingand dead-result move elimination can, and often will be applied, the false de-pendence in the DDG prevents an optimal schedule. The resulting schedule isgiven in Figure 6.19c. The optimal schedule is given in Figure 6.19d. With in-tegrated assignment it is possible to generate the optimal schedule even whenregisters are scarce, because it can re-use the registers freed due to softwarebypassing and dead-result move elimination.

The direction of a false dependence added by an early assignment ap-proach, such as DCEA, is determined by the operation order in the sequentialcode. This is depicted in Figure 6.20. The sequential code is shown in Fig-ure 6.20a. When the variables v1 and v2 are mapped onto the same register, afalse dependence is added to the DDG. There are two possibilities, δa

1 and δa2 , as

shown in Figure 6.20b. Only one of them needs to be added. Adding false de-pendence δa

1 results in a critical path of 5 instructions, while adding δa2 results


#100 → ld.tld.r → v3v3 → add.o #4 → add.tadd.r → v1;v1 → st.o; #100 → st.tv3 → sub.o #4 → sub.tsub.r → v2;v2 → mul.o #7 → mul.tmul.r → v4;v4 → st.o; #104 → st.t

δaδa

1

2

v1(r7) v2(r7)

v4(r6)

v3(r6)

add sub

ld

mulst

st

a) Code fragment. b) DDG with two possible falsedependences.

Figure 6.20: Direction of false dependences.

in 4 instructions in the critical path, assuming single cycle latencies. Appar-ently, false dependence δa

2 is preferable. Unfortunately, early assignment willadd δa

1 due to the operation order in the sequential code. A solution could be totake this observation into consideration when adding false dependences priorto scheduling. However, during scheduling the critical pathmay change due toscheduling decisions, and thus the effect of the selected false dependence mayturn out not to be advantageous. In other words, the order of operations andthe introduced false dependences hinders out-of-order scheduling. Becauseintegrated assignment assigns registers during scheduling, it allows reorder-ing of the operations and can adapt to new situations caused by schedulingdecisions.

Another effect, which emerges when registers become scarce, is the inter-action between register assignment and instruction scheduling. The schedulerselects operations for scheduling according to a priority function, which favorsthe operations in critical paths. Because integrated assignment assigns a regis-ter to a variable when the first operation that refers to this variable is scheduled,it respects the operation ordering of the instruction scheduler. In other words,for operations in the critical path it is more likely to find a free register. Earlyassignment does not interact with the instruction scheduler and may introducefalse dependences and spill code in the critical path.

The results in Figure 6.18 show that the performance gains for the TTAideal

are larger than the performance gains for the TTArealistic. When applying earlyassignment for the TTArealistic, false dependences can be concealed by a short-age on resources such as FUs and buses. In other words, some operations are


0

20

40

60

80

100

120

140

10 12 14 16 18 20 22 24 26 28 30 32 40 48 64 96 128 256

Number of registers

Rel

ativ

e cy

cle

cou

nt

incr

ease

(%

) crypt

132.ijpeg

djpeg

expand

mulaw

Figure 6.21: Cycle count increase relative to the TTAideal with 512 registerswhile applying DCEA.

scheduled sequentially anyway due to a resource conflict independent of thepresence of a false dependence. This is not the case for the TTAideal; a falsedependence is not hidden by a resource shortage since a large number of re-sources are provided. Because a false dependence introduced by early assign-ment when scheduling for the TTAideal is more visible than when schedulingfor the TTArealistic, the performance gain for the TTAideal is larger.

As can be seen in the tables in Appendix A, the performance gains of thebenchmarks differ significantly. The benchmark crypt, 132.ijpeg and djpeg havelarge performance gains while the benchmarks expand and mulaw do not showany improvement. Figure 6.21 shows the cycle count increase of DCEA relativeto the TTAideal with 512 registers for the five mentioned benchmarks. The rel-ative cycle count for the benchmarks crypt, 132.ijpeg and djpeg increases whenthe number of registers decreases. These benchmarks require a large numberof registers, and thus opportunities are present to allow integrated assignmentto improve performance. The relative cycle count for the benchmarks expandand mulaw is not influenced when the number of registers is reduced. Thesebenchmarks only require 10 registers and thus integrated assignment cannotimprove their performance.

In some situations, also a small negative performance effect can be observedas shown in the tables in Appendix A. This is mainly caused by the register se-lection heuristics. Bothmethods, DCEA and integrated assignment, use heuris-tics to determine whether to assign a caller-saved or callee-saved register, or tospill the variable. In some situations, this choice leads to a negative perfor-mance impact. The effect is larger for the TTArealistic than for the TTAideal

because the inserted spill code requires extra resources. When resources arelimited, the impact is larger. However, on average, integrated assignment out-performs DCEA.


6.7 Conclusions

Based on the observations of Chapter 5, an integrated assignment method isdeveloped in combination with a basic block scheduler. In this chapter, thegeneral principle of this innovative method is presented. A method is devel-oped, which constructs a set containing all registers that can be mapped ontoa variable v, while part of the code is already scheduled, and part of the vari-ables are already assigned to registers. This set contains instruction preciseinformation about available registers. The introduced method does not addfalse dependences prior to scheduling. During scheduling/assignment, falsedependences are only created when there are not enough registers. In contrastto early assignment methods, the number of required registers is reduced byexploiting software bypassing in combination with dead-result move elimina-tion. In order to be complete, specific issues related to the insertion of spill codeand the insertion of caller-saved and callee-saved code are addressed. Further-more, several heuristics are presented and evaluated.

The method is compared with the best early assignment approach foundin Chapter 5. All methods are implemented within the same compiler, whichmakes a fair comparison possible. The experiments showed performance gainsup to 100%. Especially when registers were scarce, integrated assignments out-performed DCEA. Since we compared our method with the best approach im-plemented in Chapter 5, we conclude that integrated assignment also outper-forms late assignment and strictly early assignment.

Although we discussed the new method in the context of TTAs, we believethat it is also applicable for superscalars and VLIWs. A problem is, however,that these architectures do not suppress unnecessary register write-backs andthus spill code cannot be scheduled without the use of registers. To overcomethis problem, one can reserve a small set of registers for spilling. Alternatively,an extension for superscalar processors, as proposed in [LG95], could alleviatethis problem. Lozano and Gao describe a hardware scheme that avoids thecommits of variables that are only live in the reorder buffer. This is similarto software bypassing, with the advantage that dependent instructions do nothave to be scheduled in the same cycle in order to avoid commits, which relaxesthe scheduling process and may result in even larger speedups.

Integrated Assignmentand Global Scheduling 7T he amount of exploitable ILP in basic blocks is limited. To justify the du-

plication cost of FUs and data paths in ILP processors, the ILP betweenoperations of different basic blocks should also be exploited. In general, moreexploitable ILP increases the number of simultaneously live variables in theschedule produced. Consequently, the register pressure increases and registersshould be assigned with more care. In this chapter, an extension of integratedassignment is proposed, which fully integrates register assignment and regionscheduling.

This chapter is organized as follows. In Section 7.1, the constructionof the interference register set is discussed when operations are importedinto basic blocks. The algorithm for importing operations is presented inSection 7.2. Section 7.3 presents an example of the proposed method. Whenintegrated assignment runs out of registers, a decision has to be made whetherto insert spill code or to schedule the code less aggressively. This is discussedin Section 7.4. The insertion of state preserving code in a region schedulerand its consequences are described in Section 7.5. In Section 7.6, variousregister selection heuristics are presented to increase the performance of thegenerated code. The last section evaluates the proposed techniques and statesthe conclusions.

7.1 The Interference Register Set

An extended basic block scheduler increases the exploitable ILP by movingoperations over basic block boundaries. For reasons discussed in Section 3.4.4,importing operations may result in code duplication. Importing and duplica-

131

132 CHAPTER 7. INTEGRATED ASSIGNMENT AND GLOBAL SCHEDULING

bDbD

bI

bI

bI

bI

b’

Live Rangesafter importing

Live Ranges

v1 v2 v3 v4

add.oadd.tv3add.r

v2v1

v1 v4

before importing

v1 v2 v3 v4

Figure 7.1: Stretching and shrinking of live ranges when importing all themoves of the addition.

tion changes the live ranges of the variables referenced by the imported oper-ation. This is shown in Figure 7.1. Importing the addition of basic block b′ tothe basic blocks bD, results in a shorter live range for the variable v2, while thelive range of v3 is stretched. The live range of v1 does not change, because itis required by the copy operation in the source basic block b′.

Importing operations changes the number of basic blocks spanned by thereferenced live ranges. When a live range is stretched, the register allocatormust check additional basic blocks for a legal assignment. The opposite holdsfor live ranges that shrink. The consequences of shrunk and stretched liveranges on the computation of the interference register set are discussed in re-spectively Section 7.1.1 and Section 7.1.2.

7.1.1 Importing a Use

Importing an operation n, which uses a variable v, may result in a shorter liverange for v. For a legal register assignment, the live-variable information mustbe updated. The imported operation n is removed from basic block b′, thereforethe sets liveUse(b′) and liveDef (b′) are recomputed. Operation n is added to theduplication basic blocks bD ∈ D. This has the following consequence for theliveUse sets of these basic blocks.


liveUse(bD) ={

liveUse(bD) ∪ {v} : v /∈ liveDef (bD)liveUse(bD) : otherwise

∀ bD ∈ D (7.1)

The basic blocks on all paths from the duplication basic blocks (bD ∈ D) tothe source basic block (b′) are called intermediate basic blocks (bI ∈ I).

I = {b ∈ B | bD � b ∧ b � b′, bD ∈ D} (7.2)

where b � b′ means that there is a control flow path within the region fromb to b′. The live-variable information in these intermediate basic blocks maychange also when operation n is imported. It is tempting to simply removethe variable v from the sets liveIn and liveOut of all basic blocks bI ∈ I . Un-fortunately, sometimes the intermediate basic blocks have references to v, orv is an element of the set liveIn of one of the successors (excluding b′ and theintermediate basic blocks) of the intermediate basic blocks. In these situations,the live range shrinks only partly. The new live-variable information in the in-termediate basic blocks can be computed with the Equations 3.1 and 3.2. Note,only the live-variable information of the basic blocks b ∈ {b′} ∪ I ∪ D changes.Only for these basic blocks, the equations have to be solved.

Additional steps are required when variable v is already mapped onto a regis-ter r. The RRVs that are not spanned anymore by the new shorter live range,incorrectly indicate that r is already used and cannot be assigned to anothervariable. This hinders an efficient register assignment. The RRVs of the inter-mediate basic blocks bI ∈ I , the source basic block b′ and the duplication basicblocks bD ∈ Dmust be updated. Register r is removed from all RRVs of a basicblock b ∈ {b′} ∪ I if v /∈ liveIn(b) ∧ v /∈ liveDef (b). The RRV information in thenot yet scheduled duplication basic blocks remains unchanged because beforeand after importing, v is live in these basic blocks. The RRV information in thealready scheduled duplication basic blocks is updated in the same way as ifthe use was scheduled with local scheduling.

The new live-variable and RRV information is computed prior to importing.This information reflects the situation as if operation n referring to v is im-ported in the duplication basic blocks bD

1. Consequently, the same method forconstructing the interference register set as in basic block scheduling can beused (see Equation 6.15).

7.1.2 Importing a Definition

Importing an operation n, defining a variable v, stretches the live range of v.The basic blocks spanned by the new live range consist of the basic blocks ofthe original live range, the intermediate basic blocks and the duplication ba-sic blocks. Importing operation n changes the live-variable information. To

1When the live range does not change, neither the live-variable nor the RRV information needsto be updated.


produce a legal register assignment, the sets liveUse(b′) and liveDef (b′) are re-computed. The set liveDef of the duplication basic blocks changes also:

liveDef (bD) ={

liveDef (bD) ∪ {v} : v /∈ liveUse(bD)liveDef (bD) : otherwise

∀ bD ∈ D (7.3)

In addition, the liveIn and liveOut sets of the basic blocks b ∈ {b′} ∪ I ∪ Dchange. These sets can be computed with Equations 3.1 and 3.2. However,there is a simpler method to compute these new live sets. For the source basicblock b′ only the set liveIn changes:

liveIn(b′) = liveIn ∪ {v} (7.4)

The duplication basic blocks bD require only an update of the set liveOut:

liveOut(bD) = liveOut(bD) ∪ {v} ∀ bD ∈ D (7.5)

Because the intermediate basic blocks bI ∈ I do not contain any references to v,otherwise importing operation nwas illegal2, their live information can simplybe computed with:

liveIn(bI) = liveIn(bI) ∪ {v} ∀ bI ∈ I (7.6)liveOut(bI) = liveOut(bI) ∪ {v} ∀ bI ∈ I (7.7)

The new live-variable information is computed prior to importing. It reflectsthe situation as if the reference to v is imported in the duplication basic blocks.

The construction of the interference register set as given in Equation 6.15 as-sumes that all references to variable v are located in not yet scheduled basicblocks. When operations are imported, they can also be added to alreadyscheduled duplication basic blocks3. Let’s denote this set of scheduled du-plication basic blocks D+ ⊆ D. The ordering of the operations in these basicblocks is completely known. This allows us to make a more accurate registeravailability estimation. The interference register set can now be computedwithan equation similar to Equation 6.11, where icur is replaced with the earliestinstruction EarliestInsn(b, n) in which the duplicated definition can be sched-uled. Dependence constraints are used for computing the earliest instruction.

R+RRV (b, n) =

LastInsn(b)⋃i=EarliestInsn(b,n)

RRV (i) (7.8)

2Operation n would not be selected for importing, because of a false dependence between theoperation referring to v in one of the intermediate basic blocks and operation n.

3This is caused by the fact that guarded expressions, in the current implementation, are onlycomputed for already scheduled duplication basic blocks. When, for example, a variable is off-live, importing may fail because one of the duplication basic blocks is not yet scheduled. Whenall duplication basic blocks are scheduled or are being scheduled, the guarded expressions canbe computed and importing may succeed. As a result, the operation is imported into alreadyscheduled basic blocks.


Live ranges Live rangesafter importing

v1 v2before importingA

B C

D

v1 v2

Conflicting

def v2(r1)

def v1(r1)

use v1(r1)

use v2(r1)

assignment of r1

Figure 7.2: Register conflict when operations are imported.

The interference register set for a variable v defined by an imported opera-tion n can now be computed with:

RInterfere(v, n) =

⋃

b∈BIO(v)

RIO(v, b)

∪

⋃

b∈BDU (v)−D+

RDU (v, b)

∪

⋃

b∈D+

R+RRV (b, n)

(7.9)

Note that due to importing I ⊆ BIO and D ⊆ BDU .A refinement can be made when computing the interference register sets

for the basic blocks b ∈ D+. It can happen that the register requirements of theearliest instruction where the definition can be scheduled, in combination withthe register requirements of the other parts of the live range, exceed the numberof available registers. When the definition is scheduled in a later instruction,the register pressure may reduce. Therefore, the definition is annotated withthe earliest instruction in which the register requirements are still met. Thescheduler uses this information to determine the earliest instruction in whicha scheduling attempt is made.

The previous discussion assumes that variable v is not yet mapped onto a reg-ister. This is, however, not always true. When a register was already assignedbefore importing, the situation can arise that the assigned register is not free inthe stretched live range. This is illustrated in Figure 7.2. Assume that the defi-nition of r1 in basic blockD is imported into basic block A. As a consequence,in basic block C, two variables are live simultaneously that are assigned to thesame register. This will result in incorrect program execution. Consequently,importing is illegal. However, it may turn out that another register can be used


Algorithm 7.1 TRYTOIMPORTOPERATION(b, o)

b′ = BB(o)D = COMPUTEDUPLICATIONSET(b, b′)UPDATEUSEINFORMATION(b′, D, o)UNASSIGNDEFINITION(o)UPDATEDEFINFORMATION(b′, D, o)FOR EACH b′′ ∈ D DO

IF b′′ ∈ is scheduled THENIF ¬ TRYTOSCHEDULEOPERATION(b′′ ,o) THENRELEASERESOURCES(D, o)RESTOREDEFINFORMATION(b′, D, o)REASSIGNDEFINITION(o)RESTOREUSEINFORMATION(b′, D, o)return

ENDIF

ENDIF

b′′ = b′′ ∪ {o}ENDFOR

b′ = b′ − {o}

for the new stretched live range. Assigning this register to variable v2 allowsthe codemotion. Therefore, prior to computing the interference register set, theassignment of the definition is undone, and the same procedure is followed asif the variable was not assigned. When it turns out that not sufficient resourcesare available to successfully import the operation, importing fails and the orig-inal assignment is redone.

7.2 Importing Operations

The implementation of integrated assignment into the region scheduler re-quires some changes to Algorithm 3.8 as shown in Algorithm 7.1. Prior toimporting, the live-variable and RRV information is changed, as if the oper-ation was imported, using the functions UPDATEUSEINFORMATION and UP-DATEDEFINFORMATION. As discussed in the previous section, an earlier as-signment of the definition may hinder the algorithm to find a free register.Therefore, the assignment of the definition is undone with the function UNAS-SIGNDEFINITION. When all information is updated, the scheduler attempts toimport operation o in the duplication basic blocks D. After an operation hasbeen imported successfully, the RRV information is updated. When however,due to some resource constraint this is not feasible, all scheduling decisionsmade so far for this operation, and the changed live-variable and RRV infor-

7.3. EXAMPLE 137

mation, must be restored to their original state.When a variable has definitions or uses outside the scheduled region, the

RRVs in the live-range that are outside the region are also updated. This isin contrast with the approach chosen by [FR91], where special data structuresare needed to distribute the register assignment information from one traceto another. In our approach, the register assignment information is implicitlydistributed to other regions.

7.3 Example

An extensive example is used to demonstrate the operation of the proposedmethod. We start this example with the CFG given in Figure 7.3. It shows theschedule before the addition in basic block E is imported into the duplicationbasic blocks A and B. It is assumed that basic block A is currently being sched-uled and basic block B is already scheduled. Due to earlier scheduling steps,both basic block A and B contain three instructions (for reasons of clarity, thealready scheduled operations are not shown). The basic blocks are annotatedwith live-variable and RRV information. Variable v1 is already mapped ontoregister r3. This assignment is reflected in the RRVs of all shown basic blocks.Because the basic blocks A and B are already scheduled, and v1 is live on en-

E

DC

BA

live

liveOut

r2,r3

r2,r3

r2,r3

liveOut

liveIn

r2,r3

liveOut

liveIn

liveIn

r3

r3

r3

liveIn ={v1,v2}

liveOut ={v1,v2} liveOut ={v1,v2,v4,v5}

={v1,v2,v4}In

r3

={v1,v2,v4,v5}

={v3,v4}

={v1,v2,v4}

r2,r3

={v1,v2,v4,v5}

={v1,v2,v4}

={v1,v2}

mul.r

17

v1(r3)

v5(r2)

#63 st.o

st.t

v1(r3) add.o

add.t

v3(r2)

sub.o

v2

add.r

v2

v4

mul.t

mul.o

Figure 7.3: CFG prior to importing.


A B

C D

E live

liveOut

r2,r3

r2,r3

r2,r3

liveOut

liveIn

liveOut

liveIn

liveIn

r3

r3

r3

liveIn ={v1,v2}

liveOut ={v1,v2}

v1(r3) add.ov1(r3) add.o

add.tv2add.tv2

liveOut ={v1,v2,v4,v5}

={v2,v4}In

r3

={v1,v2,v4,v5}

={v3,v4}

={v2, v4}

r2

={v2,v4,v5}

={v2,v4}

={v1,v2}

mul.r

17

v1(r3)

v5(r2)

#63 st.o

st.t

v3(r2)

sub.o

add.r

v2

v4

mul.t

mul.o

r2

Figure 7.4: CFG as if the operand and trigger move of the addition were im-ported.

try of both basic blocks, all their RRVs contain register r3. The other basicblocks (C, D and E) are not scheduled yet. They have only one RRV with reg-ister r3 set as unavailable.

Importing the addition results in a shorter live range for variable v1. Be-cause basic block C contains a reference to v1, the live range cannot shrinkcompletely. The live range of variable v2 does not shrink at all, because basicblock E contains another use of v2. Figure 7.4 shows the situation as if themoves v1(r3)→ add.o and v2→ add.t were imported. In addition to thechanges in the live-variable information, some RRVs are also changed. Becausev1 is not live anymore in the basic blocksD and E, register r3 is removed fromtheir RRVs.

Importing the transport add.r → v3(r2) stretches the live range of vari-able v3. Note that v3 was already assigned to register r2. This register, how-ever, is also used in basic block D by variable v5, which makes importing ille-gal. To allow importing, the assignment of v3 is undone. Figure 7.5 shows theresult of unassigning r2, and the new live-variable and RRV information, as ifthe move add.r→ v3 is imported.

In the last step, the addition and its duplicate are scheduled. First, the trig-ger and operand moves are scheduled. During scheduling, register r1 is as-signed to variable v2. This register is added to the RRVs spanned by v2’s new

7.3. EXAMPLE 139

A B

C D

E live

liveOut

liveOut

r2,r3

r2,r3

r2,r3

liveOut

liveInliveIn

liveIn

r3

r3

r3

liveOut

liveIn ={v1,v2}

liveOut ={v1,v2,v3}

v1(r3) add.o

add.tv2

add.r v3

v1(r3) add.o

add.tv2

add.r v3

={v1,v2,..,v5}

={v2,v3,v4}In

r3

={v1,v2,v4,v5}

={v3,v4}

={v2,v3,v4}

r2

={v2,v3,v4,v5}={v1,v2,v3}

mul.r

17

v1(r3)

v5(r2)

#63 st.o

st.t

sub.ov2

v4

mul.t

mul.o

={v2,v3,v4}

Figure 7.5: CFG as if the result move of the addition is imported.

live range. Because of this assignment, and the previously recorded register as-signment information in the RRVs, the interference register set for v3 becomes{r1, r2, r3}. In the example, r4 is selected and assigned to variable v3. Theresulting scheduled code, including the updated information in the RRVs, isshown in Figure 7.6.

E

DC

BA

live

liveOut

liveOut

liveOut

liveInliveIn

liveIn

liveOut ={v1,v2,v3}

v1(r3) add.ov1(r3) add.o

liveOut

liveIn ={v1,v2}

={v1,v2,..,v5}

={v2,v3,v4}In

r1,r3,r4

={v1,v2,v4,v5}

={v3,v4}

={v2,v3,v4}

r1,r2,r4

={v2,v3,v4,v5}={v1,v2,v3}

mul.r

17

v1(r3)

v5(r2)

#63 st.o

st.tv4

mul.t

mul.o

add.tv2(r1)

r1,r3

r1,r3

r1,r3,r4

add.tv2(r1)

r1,r2,r3

r1,r2,r3

r1,r2,r3,r4

r1,r4

sub.ov2(r1)

add.r v3(r4) add.r v3(r4)

={v2,v3,v4}

Figure 7.6: CFG after importing.


7.4 Spilling

Importing fails when during importing it is discovered that insufficient re-sources, including registers, are available for the to be scheduled operationand its duplicates. When a shortage of registers is the cause for failing, the in-sertion of spill code can solve this problem. Both type of references, uses anddefinitions, have their own specifics in relation to spilling.

The decision whether to spill a variable that is defined by a to be imported op-eration, is based on the following observations:

• The live range of a definition (result move) is lengthenedwhen the opera-tion is imported, and hence the number of interferences increases. Whena register can be found for the original live range but not for the stretchedlive range, it is unclear whether spilling will be advantageous.

• Generating spill code for each definition for which no register can befound, results in a similar problem as with late assignment: importingis applied too aggressively. This results in a large amount of spill code.The inserted spill code also needs extra resources, which leads to ineffi-cient code.

• When other operations are imported, registers might become availableand another attempt can be made.

Based on the above observations, it was decided not to insert spill code whenno register can be found for an imported definition. Instead, importing is re-jected. However, when another operation is successfully imported, anotherattempt is made to import the operation since the register pressure might bereduced.

Importing an operation may result in shorter live ranges for the variables usedby it. In general, a shorter live range has less interference with other liveranges. Therefore, the probability to find a register for this shorter live rangeis larger than for the original longer live range. In our approach, spill code isinserted when no register can be found for the new live range. The motivationto insert spill code instead of rejecting importing is the following: (1) becauseno register could be found for the shorter live range, it is very unlikely that aregister can be found for the original live range, and thus the variable is spilledanyhow, (2) it might be possible that all inserted spill operations, including theoperation itself, can be imported because the register shortage problem waslocated elsewhere in the code. Therefore, when importing fails due to insuffi-cient registers for the uses, the necessary reload code is inserted in the sourcebasic block b′ just before the operation. In subsequent scheduling steps, it istried to import the individual operations of the reload code sequence. Thiscan be problematic, because extra registers are required for the new live rangesthat are local to the reload code sequence. In Section 6.4.3, it has been shownthat the problem of extra registers for scheduling reload code sequences can be

7.5. STATE PRESERVING CODE 141

overcome by scheduling the complete reload code sequence in a single step.Unfortunately, importing a complete reload code sequence is complex and hasa high probability of failure, because it is very unlikely that the complete reloadcode sequence fits into the available instructions of all duplication basic blocks.Our choice, importing the operations of the reload code sequences individu-ally (and thus assigning registers to the live ranges introduced by spilling), hasas a consequence that incomplete reload code sequences may remain in sourcebasic block b′. Therefore, the algorithm for scheduling reload code sequencesis modified in such a way that it can also schedule the incomplete reload codesequences that are left in the source basic blocks.

In general, importing spill code and reload code is handled in the sameway as importing other operations. The incomplete spill code and reload codesequences that are left in the source basic blocks are scheduled as if they werea single operation, i.e. software bypassing and dead-result move eliminationare enforced when no registers are available.

7.5 State Preserving Code

The insertion of callee-saved code is normally applied when all variables aremapped onto registers. Only at that point it is known, which registers need tobe saved. This is problematic for an integrated assignment approach. In Sec-tion 6.5.1, a method has been described in the context of basic block scheduling,which inserts callee-saved code in the exit and entry basic blocks of a proce-dure, when all other basic blocks have been scheduled. A similar approach isused for region scheduling: callee-saved code is inserted in the entry and exitbasic blocks when the outermost region is selected for scheduling. Becauseregions are scheduled from inner to outer, all other regions are already sched-uled, and variables referenced in these regions are already mapped onto reg-isters. Similar to the approach presented in Section 6.5.1, an early assignmentstep is applied to map variables in the outermost region onto callee-saved reg-isters. Afterwards, the callee-saved code is inserted. This approach allows im-porting of the callee-saved code inserted in the exit basic blocks, which mightimprove performance.

Procedure calls are never subject to importing, because it is unlikely to findempty instructions for the call delay slots in all already scheduled duplicationbasic blocks. Scheduled duplication basic blocks only contain empty instruc-tions when the delay of an operation is longer than a single cycle. However,inserting a procedure call between the trigger and the result of an operation re-sults in incorrect program behavior. Inserting extra empty instructions is alsonot an option for the same reasons as discussed in Section 5.2.1.

Since procedure calls are never imported, it seems logical to use the sameapproach for scheduling procedure calls as in basic block scheduling. Unfor-tunately, this leads to less parallel code than possible. This is shown in Fig-


A

B

fp, r7

fp, r7

fp, r7

fp, r7pcfoo

RRV

A

B

fp, r7

fp, r7

fp, r7

fp, r7

st.tr7add.r st.o

pcfoo

fp add.o #offset add.t

ld.r r7

add.r v4

v4 ld.t

RRV

fp add.o add.t#offset

a) Code example. b) Inserted caller-saved code.

B

A

st.tr7

add.r st.o

pcfoo

fp add.o #offset add.t

ld.r r7

add.r v1

v1 ld.t

fp, r7

fp

fp, r7

fp, r7

RRV

fp add.o add.t#offset

c) Imported caller-saved code.Figure 7.7: Importing procedure calls.

ure 7.7. Figure 7.7a shows a code fragment in which it is assumed that basicblockA is already scheduled, and all operations that could be imported are im-ported. The next basic block to be scheduled is B. Selecting the procedure callfor scheduling results in the generation of caller-saved code in the same way asdescribed in Section 6.5.2. The generated caller-saved code is now scheduled inbasic block B. This is shown in Figure 7.7b. However, the generated operationscould have been candidates for importing, which could increase performance.The problem is that the caller-saved code is generated too late.

Selecting procedure calls for importing without actually importing themsolves this problem. When the region scheduler selects a procedure call forimporting, caller-saved code is generated in the basic block of the procedurecall. This code is not yet scheduled and subject to importing. Consequently,the generated caller-saved code can be imported. This is shown in Figure 7.7c.

Due to importing of other operations, other live ranges can become live acrossnot yet scheduled procedure calls. This does not create extra problems since


each time an operation is successfully imported, all other operations for whichimporting failed, including procedure calls, are tried again. When a procedurecall is selected again for scheduling, caller-saved code is generated for thesenew live ranges too. In other words, caller-saved code is generated incremen-tally.


In this section, the proposed integrated assignment method is evaluated incombination with region scheduling. In order to optimize the performance,various heuristics are proposed and evaluated. In Section 7.6.1 the selectionorder of regions is addressed. So far, the decision to spill a variable v wasbased on the ratio of the spill cost, caller-saved cost and callee-saved cost. InSection 7.6.2 a new heuristic is proposed, which also takes the spill cost of thevariables interfering with v into account. In the last section, the performanceof integrated assignment is compared with DCEA, the best register assignmentapproach found in Chapter 5.

7.6.1 Region Selection

Region scheduling traditionally starts with scheduling the inner most regionsfirst. When assigning registers for these inner regions, all registers are availableand spilling is less likely to occur. However, the inner most regions are notalways the most heavily executed regions. Another approach is to schedulethe most frequently executed region first. Consequently, spill code is pushedto the less frequently executed regions. This should potentially result in a betterperformance. In this section, two region selection heuristics are compared:

1. Rtop: A region is selected for scheduling when all its inner regions arescheduled.

2. Rfreq: The most frequently executed regions are scheduled first.

Note that within a region, the basic blocks are always scheduled in topologicalorder.

The results of our experiments indicated that the Rfreq heuristic only slightlyoutperforms the Rtop heuristic, on average 0.5%. This seems surprisingly low,because a similar heuristic for local scheduling resulted in much higher perfor-mance gains (Section 6.6.3). Fortunately, this difference can easily be explained.Many of the inner regions are also themost frequently executed regions. There-fore, the cycle count of many benchmarks is equal, independent of the regionselection method.


7.6.2 Global Spill Cost Heuristic

The approach used so far to decide whether to spill variable v, only takes thecosts associated with v into account. It ignores the spill costs of the variablesthat are simultaneously live with v. A variable v, referenced by an operationthat is scheduled at an early stage in the scheduling process, has a high proba-bility of being mapped onto a register. Variables referenced by operations thatare scheduled in a later stage often are spilled to memory. However, it mayturn out that the spill cost associated with these variables is higher than thespill cost of v, and hence spilling them may have a negative impact on perfor-mance.

The above discussion results in the conclusion that the spill costs of vari-ables that are simultaneously live with v should also be taken into account.Therefore, a global spill cost heuristic (GSC) is proposed. At each point (instruc-tion) in the live range of v, the set of interfering variables is computed. Whenthe number of not yet assigned variables with higher spill cost than v exceedsthe number of available registers at a point in v’s live range, it may be advanta-geous to spill v. Unfortunately, the begin- and end-points of many live rangesare unknown because the references to the variables, including v, are not yetscheduled. Therefore, it is impossible to accurately compute the set of interfer-ing variables at each point in the live range of v. To tackle this problem, thedefinition of a point is changed from instruction to basic block. Only the basicblocks in which v is live on entry and exit are taken into consideration. The setof not yet assigned variables with a higher spill cost than v in a basic block b,can now be computed with:

liveSpill(v, b) = {vi ∈ livebb(b) − v | Cspill(vi) > Cspill(v) ∧ r(v) = ∅} (7.10)

where livebb(b) = liveIn(b) ∪ liveDef (b) is the set of all variables live withinbasic block b. The function r(v) returns register r that is mapped onto variablev. This function returns the empty set when no register is assigned to v. Theset of available registers in a basic block b is computed with:

RAvail(b) = R −LastInsn(b)⋃

i=0

RRV (i) (7.11)

The global spill cost heuristic can now be described as: a variable v is spilledto memory when | liveSpill(v, b) |>| RAvail(b) | for any basic block b in the liverange of v.

During the experiments, it was also noticed that importing operationsworks too greedy. As a result, too many live-ranges were stretched and nomore registers were available in the intermediate basic blocks I . This leads toexcessive spill code in these basic blocks. Fortunately, the GSC heuristic alsocontrols the register pressure in these basic blocks. An operation defining avariable v is not imported when the GSC heuristic evaluates to true in any of


-2

0

2

4

6

8

10

Per

form

ance

gai

n (

%)

10 12 14 16 18 20 22 24 26 28 30 32 48 64 512

Number of registers

TTA TTA

ideal

realistic

Figure 7.8: Performance impact of the GSC heuristic.

the basic blocks spanned by the new (stretched) live range. It is of course notnecessary to apply above heuristic, when due to importing the number of liveranges in the intermediate basic blocks remains constant or decreases. Opera-tions that do not define a variable are never restricted for importing, becauseimporting these operations does not increase the register pressure. The perfor-mance impact of applying the GSC heuristic is shown in Figure 7.8. Applyingthis heuristic is indeed beneficial.

One could argue that the GSC heuristic is too conservative. It ignores thefact that scheduling rearranges the code and doing that also changes the inter-ference relations. Furthermore, the set liveSpill(v, b) is too conservative. Thisset contains all variables referenced within a basic block b. It is, however, veryunlikely that all these variables are live simultaneously and thus the registerpressure is probably lower. In addition, software bypassing and dead-resultmove elimination may remove variables from the liveSpill(v, b) set. In order totake these effects into account, it seems a good idea to relax the GSC heuristic.Therefore, we propose only to use the GSC heuristic when the number of avail-able registers RAvail(b) drops below a certain threshold n. The new heuristicsare now denoted as GSCn. Our experiments indicated that n = 5 resulted, onaverage, in the highest performance. The performance gains when applyingthe GSC5 heuristic compared to the GSC heuristic are shown in Figure 7.9.


The key question is: how does the introduced method compare in terms ofcycle counts to the best register allocator of Chapter 5 ? Figure 7.10 showsthis performance comparison when all benchmarks are scheduled using a re-gion scheduler. The TTAideal and TTArealistic templates are used with a vary-ing number of registers. As can be seen, integrated assignment outperforms


0

1

2

3

Per

form

ance

gai

n (

%)

10 12 14 16 18 20 22 24 26 28 30 32 48 64 512

Number of registers

TTA TTA

ideal

realistic

Figure 7.9: Performance impact of the GSC5 heuristic.

0

5

10

15

20

25

30

35

40

45

Per

form

ance

gai

n (

%)

10 12 14 16 18 20 22 24 26 28 30 32 48 64 512

Number of registers

TTA TTA

ideal

realistic

Figure 7.10: Speedup of integrated assignment compared to DCEA using re-gion scheduling and the TTAideal and TTArealistic templates.

DCEA. Especially when registers are scarce, the performance improvement islarge. The results of the individual benchmarks can be found in Appendix A.

The TTAideal template is used to show the performance increase of themethod without being hindered by shortage of resources other than registers.Consequently, a large amount of ILP could be exploited. This resulted in a highregister pressure. A method, such as integrated assignment, that efficiently ex-ploits the available registers gives better results when the register pressure ishigh. In practice, the TTAideal template will never be used because of it costand high cycle time. However, it shows the potential of the method. A morepractical TTA template is the TTArealistic. The resources of this template aremore restricted and hence less ILP can be exploited. As shown in Figure 7.10,the performance gain is still substantial. However, it is lower than when usingthe TTAideal template.


A(50)

B C(40) (10)

D(50)

b1eq.r

b1: B jump

#0 eq.o eq.tr7

v9#1#4 add.o add.t

add.r v8

ld.t

ld.r

v8

r7

r7

sub.t

callfoo

r3

#1v10 sub.o

sub.r

A(50)

D(50)

B(40)

r8 ld.t

b1eq.r

jumpb1: C

r9!b1:#1

add.tr7

eq.o#0 eq.tr7

add.o#4

b1:add.r r8

sub.tsub.or10

callfoo

r3sub.r

#1

ld.r r7

a) CFG. b) CFG after applying integratedassignment.

D(50)

A(50)

B C(40) (10)

callprintf

r8 ld.t

b1eq.r

b1: C

#0

jump

eq.o eq.tr7

add.o#4

add.tr7

b1:add.r r8

r10 sub.o

ld.r r7

#1 sub.t

sub.r r3

r10 sub.o

sub.r r3

#1 sub.t

#1 r8

c) CFG after applying DCEA.

Figure 7.11: Too aggressively false dependence avoidance.

As already observed in Section 6.6.4, various reasons exist to explain theperformance differences between applying DCEA and integrated assignmentfor local scheduling. The same reasons can be applied to region scheduling.However, the performance gains are larger. Because region scheduling exploitsa larger amount of ILP, the register pressure increases. A method that uses theavailable registers more efficiently can achieve a higher performance.

In addition to the reasons mentioned in Section 6.6.4, another effect is ob-served. DCEA only considers forward false dependences, see Section 5.1.3.This is a valid assumption, since this restriction avoids the addition of manyedges into the interference graph, which almost never will hinder efficient codegeneration. However, sometimes these neglected potential false dependencesare important and hinder the generation of efficient code. This is the case forthe benchmarks 132.ijpeg and mpeg2encodewhere DCEA shows a small perfor-mance loss even when sufficient registers (512) are available. This effect canalso work the other way around. Consider the CFG of Figure 7.11a. All op-erations in basic block A are already scheduled and the scheduler is ready toimport operations into basic blockA. Figure 7.11b shows the completely sched-


uled CFG when integrated assignment is used. This method was so successfulin importing operations that no operations remain in basic block C. As a result,this basic block is removed from the CFG. This seems advantageous, however,this is not always true. When basic block C is removed from the CFG, it isimpossible to import operations from basic block D to basic block B, althoughsufficient resources are available in basic block B. During code examination, wediscovered that DCEA sometimes performed better in these situations. SinceDCEA only considers forward false dependences, it may map variable v8 andv9 onto the same register, for example r8. This results in a false dependencebetween the addition of basic block B and the copy of basic block C. Becauseof this dependence, it is no longer possible to import the copy operation andhence basic block C is not removed from the CFG. This has the advantage thatit is now possible to import the subtraction of basic block D into basic blockB and its duplication basic block C. The resulting scheduled CFG is shown inFigure 7.11c. Assuming the execution frequencies as given within parentheses,the CFGwill execute in 470 cycles when applying integrated assignment and in400 cycles when using DCEA. Summarizing, although integrated assignmentis good in avoiding false dependences, it may result in a performance loss. Thenegative performance of the benchmark expand is caused by this effect. Futuresolutions are to postpone the deletion of empty basic blocks or to (re)insertempty basic blocks.

The reduction of required callee saved code also contributes to the smallperformance gain when sufficient registers are available. Because integratedassignment uses the available registers more efficiently, it requires less callee-saved code, which results in a small performance gain.

As discussed in the previous section, the GSC5 heuristic is used to limit theaggressiveness. In some situations, this leads to a negative performance gain,because the aggressiveness is limited too much. As an example consider thewcbenchmark. When registers are scarce, a negative performance gain resultedfor the TTAideal template. When no heuristic is used to limit the aggressive-ness, a speedup can be observed (see Appendix A). However, the use of theGSC5 heuristic resulted, on average, in the best results.

We were also curious as to how severe reducing registers hurts performance.This data is shown in Figure 7.12; it shows the cycle count increase relativeto a configuration with 512 registers; all benchmarks are averaged. For TTAswith 32 registers, integrated assignment already outperforms DCEA.When thenumber of registers is reduced, the performance gap grows to more than 60%.

TTAs were originally proposed for application specific purposes; therefore itis also interesting to know how many registers can be saved while the per-formance stays constant. Fewer registers, results in a smaller chip area andin faster access to the register file, both are important for application specificprocessors. In Table 7.1, a comparison is made between integrated assignmentand DCEA, to find out what these savings are. We see that integrated assign-


0

10

20

30

40

50

60

70

80

90

100

10 12 14 16 18 20 22 24 26 28 30 32

Number of registers

Rel

ativ

e cy

cle

cou

nt

incr

ease

(%)

Integrated

DCEA

0

10

20

30

40

50

60

70

80

90

100

10 12 14 16 18 20 22 24 26 28 30 32

Number of registers

Rel

ativ

e cy

cle

cou

nt

incr

ease

(%)

Integrated

DCEA

a) TTAideal. b) TTArealistic.

Figure 7.12: Relative cycle count increase of integrated assignment and DCEA.

ment needs substantially fewer registers than DCEA and, therefore, is the bestsolution for embedded systems.

Besides the number of registers required for a certain performance, also thecode size of the application is an important factor for embedded systems.Memories on these systems are usually limited. Performance optimizationsthat increase the code size to a large extent, are not suitable for these systems.Figure 7.13 shows the impact on the code size when using integrated assign-ment compared to DCEA. This figure clearly demonstrates that integrated as-signment reduces the code size. This comes not as a surprise, because instruc-tion scheduling can be done more efficient and thus the number of instructionsrequired to pack the transports reduces.

Table 7.1: Register requirements of integrated assignment and DCEA withequal performance.

TTAideal TTArealistic

Integrated DCEA increase[%] Integrated DCEA increase[%]assignment assignment

10 20 100 % 10 16 60 %12 25 108 % 12 18 50 %14 29 107 % 14 22 57 %16 32 100 % 16 26 63 %24 48 100 % 18 29 61 %24 > 64 - 20 32 60 %10 28 48 71 %20 32 > 48 - %


-10

-9

-8

-7

-6

-5

-4

-3

-2

-1

0

1

Co

de

size

incr

ease

(%

)

10 12 14 16 18 20 22 24 26 28 30 32 48 64 512

Number of registers

TTA TTA

ideal

realistic

Figure 7.13: Increase in code size of integrated assignment compared to DCEAusing the TTAideal and TTArealistic templates.

7.7 Conclusions

In this chapter, integrated assignment as presented in Chapter 6 is extendedto a larger scheduling scope. The impact of importing operations is discussed.The principle of importing operations is used in many scheduling scopes, suchas regions, traces, superblocks, hyperblocks and decision trees. Although thediscussion focuses on a region scheduler, the discussed techniques can also beapplied in combination with alternative scheduling scopes.

Besides the impact of stretching and shrinking of the live ranges of vari-ables referenced by imported operations, also spilling and the insertion of statepreserving code has been presented. Heuristics were proposed and evaluated.Limiting the import greediness is one of the major concerns when applyingintegrated assignment. When importing is applied aggressively, the registerpressure may increase, which on its turn may result in a large amount of spillcode and thus hurts performance. On the other hand, when importing is re-stricted to a great extent, the amount of exploited ILP decreases, which alsohurts performance. The use of the global spill cost heuristic helps to controlthe import greediness. It can be tuned to achieve the best performance.

Integrated assignment in combination with a region scheduler outperformsall other evaluated late and early assignment methods. It needs substantiallyfewer registers than other methods to achieve the same performance. This re-sults in a saving of silicon area. Furthermore, the code size decreases in com-parison with DCEA.

Integrated Assignmentand SoftwarePipelining 8S oftware pipelining is a powerful and efficient scheduling technique to ex-

ploit instruction level parallelism (ILP) in loops. In this chapter, integratedassignment in the context of software pipelining is discussed. Software pipelin-ing schedules loops such that each iteration in the produced schedule consistsof operations chosen from different iterations in the sequential code. The gen-eration of an optimal resource-constrained schedule for loops is known to beNP-complete [Lam88, GJ79]. Heuristics are used for guiding the constructionof the schedules in polynomial time.

The disadvantage of aggressive scheduling techniques, such as softwarepipelining, is their resulting high register pressure. To obtain a valid schedulethe compiler must be able to match the register requirements with the avail-able registers of the target processor. When more registers are needed, someadditional actions are required, such as the insertion of spill code.

In this chapter, a new technique is proposed, which integrates register as-signment and iterative modulo scheduling. In Section 8.1, the relation betweenregister pressure and software pipelining is described. In Section 8.2 relatedapproaches are discussed that perform register assignment in a separate step.The proposed technique is presented in Section 8.3. Section 8.4 gives theexperimental results and evaluates the method. The conclusions are stated inSection 8.5.

151

152 CHAPTER 8. INTEGRATED ASSIGNMENT AND SOFTWARE PIPELINING

+

st

ld

v1 v2

v3

v4

ld

Ker

nel

Prol

ogue

Epi

logu

e

iteration 1

iteration 2

iteration 3

v2

v3

v2

v3

v2

v3

v1

v4

v4

v4

a) DDG of loop-body. b) Live ranges in software pipelined loop.

Figure 8.1: Live ranges of variables in a software pipelined loop.

8.1 Register Pressure

The register pressure of a software pipelined loop is equal to the number of si-multaneously live variables in the loop. The problem of assigning registers forsoftware pipelined loops is somewhat different from other code, because vari-ables, which were live in one iteration in the original loop, can be live in mul-tiple iterations in the software pipelined version. This section explores somecharacteristics of the live ranges of variables in software pipelined loops. Twotypes of variables can be distinguished:

• Loop-invariant variables are live in the loop, but are never modified dur-ing loop execution.

• Loop-variants variables are modified in each iteration of the loop. Themodified value can be used by an operation in the same or other iteration.

The loop in Figure 8.1a has three loop-variants v2, v3, and v4, and one loop-invariant v1. When an isolated iteration of the original loop is considered, liveranges v2 and v3 never interfere and can be mapped onto the same register.As a result, three registers are required to hold the variables: two registers tohold the loop-variant variables and one for the loop-invariant variable v1. Insoftware pipelined loops, live ranges that did not interfere in the original loopinterfere in the scheduled code. This is illustrated in Figure 8.1b. The liverange of variable v2 overlaps with the live range of variable v3 of the previousiteration. To obtain a valid schedule, they have to be mapped onto distinctregisters. The register pressure is thus increased.

Another aspect of software pipelining is the presence of long-living-variables.The live ranges of these variables are longer than the initiation interval II . Un-

8.1. REGISTER PRESSURE 153

less special measures are taken, this results in illegal schedules since new val-ues are generated before previous ones are used. The variable v4 in Figure 8.1bis such a long-living-variable, it overlaps with itself in a previous iteration. Todeal with this problem some form of renaming must be applied so that succes-sive definitions of v4 use distinct registers. This renaming can be performed atcompile time or in hardware.

One approach to apply renaming at compile time is modulo variable expan-sion [Lam98]. This technique schedules the loop prior to register assignment.During scheduling, the inter-iteration false dependences between references tothe same variable are ignored. The resulting incorrect schedule is unrolled toensure that each live range fits in the II . After unrolling, variable renaming isapplied to correct the schedule. The minimum degree of unrolling (Kmin) isdetermined by the longest live range of all loop-variants in the kernel and theII . Kmin can be calculated as

Kmin = max∀v

⌈L(v)II

⌉(8.1)

where L(v) represents the length of the live range of v.A rotating register file [BYA93, RYYT89] is a hardware solution to deal with

the problem of long-living-variables. Dedicated hardware solves the problemwithout replicating code. It renames different instantiations of a loop-variant atexecution time. A rotating register file rotates the registers every loop iteration.Register ri becomes ri+1 and rn becomes r0. This allows a value, generated byan operation in one iteration, to co-exist with the values generated by the sameoperation in previous and subsequent iterations.

Both modulo variable expansion and rotating register files have their draw-backs. Modulo variable expansion increases the static code size and implieslate assignment. As discussed in Chapter 5, late assignment is not the bestchoice for compiling programs for TTAs, therefore this method is rejected. Ro-tating register files require dedicated hardware. This may affect the cycle timeand requires extra hardware. In our opinion, the extra hardware costs are betterspent on more general-purpose registers, because then the complete programcan benefit.

In [Hoo96], another technique is proposed to deal with long-living-variables: delay lines. Delay lines are the software counterpart of rotating regis-ter files. They are implemented as a series of copy operations. The delay linesare inserted before instruction scheduling and register assignment. The num-

ber of delay lines per variable is determined by⌊

L(v)MII

⌋. Because the code is not

yet scheduled, the precise length of the live ranges is unknown. The length ofa live range is defined as the longest path in the DDG from the definition tothe consumer of the variable. Figure 8.2 shows the effects of the insertion of adelay line. The live range of variable v4 is split in two (v4 and v5). As a result,five registers are required to hold all variables of the software pipelined loop.We use delay lines to solve the problem of long-living-variables.


Ker

nel

Prol

ogue

Epi

logu

e

iteration 1

iteration 2

iteration 3

v2

v3

v2

v3

v2

v3

v4

v4

v4

v1

v5

v5

v5

Figure 8.2: Live ranges when using delay lines.

All of the three mentioned methods for handling long-living-variables leadto an increased register pressure. When more registers are required than avail-able, the register allocator fails to find a solution. To obtain a schedule, whichrespects the register constraints, additional actions are required. In the re-mainder of this section, methods to decrease the register pressure in softwarepipelined loops are discussed.

Consider the loop in Figure 8.3a. For reasons of clarity, RISC style code isused. This loop executes in five cycles per iteration assuming an issue width ofthree, a latency of two for the multiply and a single cycle latency for the otheroperations. Three registers are needed when it is assumed that the variables v1and v3 are loop carried. The software pipelined version of this loop, includingits register assignment, is given in Figure 8.3b. This loop executes in two cy-cles per iteration, however, the register pressure is increased to four. The resultregisters of the load and the multiply interfere in the software pipelined ver-sion, while this was not the case in the original loop. When the II is increasedwith one, again only three registers are needed. The generated schedule withan II = 3 is given in Figure 8.3c. Reasons why the register pressure decreaseswhen the II increases are:

• Fewer live ranges interfere since less iterations of the original loop over-lap.

• The number of long-living-variables decreases. As a consequence, thenumber of delay lines decreases and thus fewer registers are required.

The principle of trading performance against a lower register pressure is gen-erally applicable for software pipelining, however, this process cannot be ex-tended indefinitely. The loop in the previous example always requires at least

8.1. REGISTER PRESSURE 155

L1: add v1, v1, #4ld v2, (v1)mul v4, v2, #17sub v3, v3, #4st v4, (v3)bgz v3, L1

Prologueadd r1, r1, #4ld r2, (r1)add r1, r1, #4 mul r4, r2, #17ld r2, (r1) sub r3, r3, #4

KernelL1: add r1, r1, #4 mul r4, r2, #17 st r4, (r3)

ld r2, (r1) sub r3, r3, #4 bgz r3, L1Epilogue

mul r4, r2, #17 st r4, (r3)sub r3, r3, #4

st r4, (r3)

a) Loop body. b) Software pipeline with II = 2, Rpress = 4.

Prologueadd r1, r1, #4ld r2, (r1)mul r2, r2, #17

KernelL1: add r1, r1, #4 sub r3, r3, #4

ld r2, (r1) st, r2(r3)mul r2, r2, #17 bgz r3, L1

Epiloguesub r3, r3, #4

st r2(r3)

c) Software pipeline with II = 3, Rpress = 3

Figure 8.3: Increasing the II to reduce register pressure.

three registers. Reasons why increasing the II does not reduce the registerpressure indefinitely are:

• The live range of loop-invariants in the loop is always II instructions.Independently of the schedule, they require one register each.

• Each loop-body has a minimal register requirement. Note that this regis-ter requirement can drop to zero for TTAs when software bypassing anddead-result move elimination are applied.

The second alternative to reduce the register pressure is spilling. However,the addition of spill code may increase the II , which reduces performance aswell. Adding spill code in software pipelined loops is discussed in detail inSection 8.3.2. The last method to reduce the register pressure is to split the loopinto two or more loops. Smaller loops tend to use fewer registers. Loop split-ting or other loop restructuring techniques in relation to software pipeliningare outside the scope of this thesis and are not investigated any further.


L1: #4 → add.o v1;→ add.tadd.r→ v1v1 → ld.tld.r → v2#17 → mul.o v2;→ mul.tmul.r→ v4#4 → sub.o v3;→ sub.tsub.r → v3v3 → st.o v4;→ st.t#0 → gt.o v3;→ gt.tgt.r → b0b0: jump L1

a) TTA code of Figure 8.3a.Prologue

#4→ add.o; r1→ add.tadd.r→ r1; add.r→ ld.t#4→ add.o; r1→ add.t; ld.r→mul.t; #17→mul.oadd.r→ r1; add.r→ ld.t; #4→ sub.o; r3→ sub.t

KernelL1: #4→ add.o; r1→ add.t; ld.r→mul.t; #17→mul.o; mul.r→ st.t; sub.r→ r3; �

��sub.r→ st.o; sub.r→ gt.t; #0→ gt.o; b0:jump L1add.r→ r1; add.r→ ld.t; #4→ sub.o; r3→ sub.t; gt.r→ b0

Epilogueld.r→mul.t; #17→mul.o; mul.r→ st.t; sub.r→ st.o#4→ sub.o; r3→ sub.t

mul.r→ st.t; sub.r→ st.o

b) Software pipelined TTA code.

Figure 8.4: Software pipelined TTA code with II = 2, Rpress = 2.

8.2 Register Assignment and Software Pipelining

When register assignment is applied prior to software pipelining, the con-straints added by register assignment prevent an efficient schedule. In Fig-ure 8.4a, the loop of Figure 8.3a is translated into TTA code. It requires onlythree registers to hold all variables, because v2 and v4 can be mapped onto thesame register. This assignment, however, results in a false dependence betweenthe result moves of the load and the multiply. When local or global schedulingwas used, this false dependence does not hinder the generation of an efficientschedule. However, when the loop is software pipelined, it increases the lengthof the kernel to three instructions in order to avoid interference between v2 andv4.

In Section 5.1, techniques to prevent the creation of false dependences arepresented. The TTA compiler back-end uses the forward parallel interferencegraph (IGfpar), see Section 5.1.3. Applying this method straightforwardly forsoftware pipelining does indeed prevent the creation of false dependences.However, only the intra-iteration false dependences are considered. The reg-

8.2. REGISTER ASSIGNMENT AND SOFTWARE PIPELINING 157

ister assignment of Figure 8.4a, respects all intra-iteration false dependences,it is, however, the inter-iteration false dependence between the result move ofthe load and multiply, which results in an II of three. To identify potentialinter-iteration false dependences, the to be software pipelined loops are pre-software pipelined before register assignment, without considering resourceconstraints [Hoo96]. The resulting schedule is analyzed to see which code mo-tions were carried out. The resulting interferences are added to the forwardparallel interference graph in the same manner as described in Section 5.1.3.When all added interference edges are respected during early register assign-ment, the scheduler has more freedom to generate the optimal throughputschedule.

The example in Figure 8.4 requires four registers in the sequential code toavoid all false dependences. Figure 8.4b shows the resulting maximal through-put schedule. Careful examination of the code learns that only two registersare actually used in the parallel code. Although four registers are required togenerate this schedule, the number of actual required registers has dropped totwo, due to the ability of TTAs to software bypass values.

Spilling in the context of early assignment in combination with softwarepipelining is not a severe problem. Spill code generated by the register allo-cator is scheduled in the same way as the other operations, but of course mayincrease the II .

In contrast to early assignment, late assignment can exploit the software by-pass property of TTAs. This reduces the number of required registers. Sincescheduling is done before register assignment, the false dependences createdby the register allocator do not hinder the generation of efficient code. Mostpublished work on software pipelining assume late assignment. Applyingsoftware pipelining first can lead, however, to loops with a high register pres-sure. Various heuristics are proposed to decrease the register pressure.

Slack Scheduling [Huf93] schedules some operations late and other opera-tions early, with the aim to reduce the register requirements and achievingmaximum execution rate. The algorithm uses a heuristic, which integrates re-currence constraints and critical-path considerations in order to decide wheneach operation is scheduled.

In [ED95b], a method is proposed that tries to reduce the register pressurewhen the complete loop is scheduled. After scheduling, some operations aremoved to an earlier or later instruction without changing the throughput ofthe schedule. Moving operations in a smart fashion can reduce the registerrequirements.

Hypernode Reduction Modulo Scheduling (HRMS) [LVAG95] is a heuristicstrategy that tries to shorten loop-variant live ranges, without sacrificing per-formance. The main part of HRMS is the ordering strategy, which orders thenodes before scheduling them. This prevents that direct predecessors and suc-cessors, of a node are scheduled before the node itself is scheduled. Thus, onlypredecessors or successors of a node can be scheduled before the node itself is


scheduled, but not predecessors and successors. During scheduling, the nodesare scheduled as soon/late as possible, if predecessors/successors have beenscheduled previously. Themain drawback of HRMS is that it does not take intoaccount that nodes are more critical in the scheduling process if they belong toa more critical path.

Swing Modulo Scheduling [LGAV96] considers latencies to decide how criti-cal the nodes are. It gives the highest priority to operations in the most criticalpath. To reduce the register requirements each node is placed as close as pos-sible to its predecessors and successors.

Applying software pipelining with the intention to achieve high through-put, without considering register requirements, may lead to impractical solu-tions. Most modulo scheduling approaches use heuristics to produce near-optimal schedules with reduced register requirements. All mentioned meth-ods have in common that it is not guaranteed that the number of required reg-isters is less than or equal to the number of available registers. None of thediscussed methods inserts spill code. Inserting spill code in an already sched-uled software pipelined loop, without affecting the whole schedule, is verydifficult and in many situations impossible. The option used in the Cydra 5compiler [DT93] is to start over again with an increased II . If after several at-tempts this also fails, conventional techniques, such as basic block scheduling,are used for scheduling the loop.

In [LVA96] an approach is presented, which does insert spill code. It isgenerated in an iterative way. First, a software pipelined schedule is gener-ated, which tries to minimize the register pressure, then the assignment is doneusing graph coloring [Cha82, BCKT89]. When no valid assignment is found,spill code is inserted in the original code and the complete loop is rescheduled.Rescheduling is necessary, since the added load/store operations might not fitin the schedule.

Since the register requirements of a software pipelined loop are highly depen-dent on the way the loop is generated, it seems beneficial to apply registerassignment and instruction scheduling of software pipelined loops in a singlephase. Methods based on integer linear programming are proposed [NG93] togenerate optimal schedules with register constraints. However, the number ofcomputations is very large.

8.3 Integrated Assignment andModulo Scheduling

In Chapter 6 and 7, we investigated integrated register assignment and in-struction scheduling for basic block scheduling and region scheduling. Thissection describes how this method can be used in combination with softwarepipelining. In Section 8.3.1, the construction of the interference register set isdiscussed. Section 8.3.2 describes the instruments we have to decrease the reg-ister pressure, when the number of registers is too small to hold all variables.

8.3. INTEGRATED ASSIGNMENT AND MODULO SCHEDULING 159

8.3.1 The Interference Register Set

Integrated assignments in combination with software pipelining builds uponiterative modulo scheduling as has been presented in Section 3.4.5. The setof used registers per instruction is recorded using Registers Resource Vec-tors (RRVs). Just as for all other resources, register conflicts not only can arisein the same instruction i, but also in all instructions i + II · k ∀ k > 0. Whenan operation in a software pipelined loop is being scheduled, registers are as-signed to the variables to which it refers. Checking whether a register is freefor variable v, implies checking the RRVs of all instructions that are spannedby the live range of v. The part of the live range outside the software pipelinedloop is checked in the same way as for local scheduling (see Section 6.3). Inaddition, all instructions, thus II instructions, of the software pipelined loopare checked. This seems very conservative. However, this is needed becausein a software pipelined loop there is no way to tell where the other definitionsor uses will be scheduled. Normally, the definition will be scheduled beforethe use, but with software pipelining this is not guaranteed. Therefore, the in-terference register set of the currently being scheduled software pipelined loopcannot be computed as accurate as being done with local and region schedul-ing. The interference register set for a variable v, which is first referenced in asoftware pipelined loop, is constructed with:

RInterfere(v) =

⋃

b∈BIO(v)∪BSwp

RIO(v, b)

∪

⋃

b∈BDU (v)−BSwp

RDU (v, b)

(8.2)

where BSwp is the set of basic blocks that belong to software pipeline-ableloops, including the currently scheduled basic block. Because the completelive range of variable v is included, no extra analysis or repair code is neededto integrate the software pipelined loop in the total schedule. Note, that the in-clusion of the RRVs of all instructions in the II limits the re-usage of registers.This is not a problem in early and late assignment, however, these approacheshave their own problems as discussed previously. Only when software bypass-ing and dead-result move elimination are applied, integrated assignment canre-use a register.

When the interference register set is constructed, the register allocator picksa register of the set of non-interfering registers RNon-interfere(v) = R−RInterfere(v),assigns the selected register to the variable and updates all associated RRVs. Ifan operation is scheduled, which has already assigned variables, then to keepthe bookkeeping precise, the RRVs of the loop are updated. In the context ofsoftware pipelining, this is only possible when all uses and definitions of thevariable in the loop are scheduled, since only then it is precisely known inwhich instruction the variable is live.


Algorithm 8.1 SOFTWAREPIPELINING(b)

II = COMPUTEMII(b)budget = budget ratio · |b|n = 0WHILE ¬ ITERATIVEMODULOSCHEDULING(b, II , budget, Huff ) ∧

¬ ITERATIVEMODULOSCHEDULING(b, II , budget, Rau) DOIF n > Threshold ∧ FAILEDDUEREGISTERCONSTRAINT(b) THEN

var = SELECTVARIABLETOSPILL(b)GENERATESPILLCODE(var)n = 0II = RECOMPUTEII(b)budget = budget ratio · |b|

ELSE

II = II + 1n = n + 1

ENDIF

ENDWHILE

8.3.2 Spilling

The same method as described in Section 6.4.2 can be applied when insertingspill code. However, the DDG of a software pipelined loop has one peculiaritycompared to the DDG of other code: inter-iteration dependences. If applicable,the inter-iteration dependence between the definition and the use of the spilledvariable is replaced by an inter-iteration dependence between the store and theload operations. To prevent a value to be read frommemory before it is written,the associated delay is set to one instruction. Just as discussed in Section 6.4.2,a second data dependence edge between the load and the store operation isadded to prevent that a value is written tomemory before it is read. In the sameway as in local scheduling, the complete spill code and reload code sequencesare scheduled in a single scheduling step by using software bypassing anddead-result move elimination. This has the advantage that no extra registersare required for the short live ranges created by spilling.

Besides spilling, there is another alternative to decrease the register pres-sure: increasing the II (see Section 8.1). This may result in a better schedule.Note that the introduction of spill code has as a side effect that the II may in-crease as well. These combined effects may result in an even larger decrease ofthe register requirements. In [LVA96], it is even claimed that the introductionof spill code is preferable above increasing the II . In order to investigate thisclaim, we propose a combination of adding spill code and increasing the II .

The proposed algorithm (see Algorithm 8.1) first performs integrated regis-ter assignment and iterative modulo scheduling without spill code. When thisfails, the II is increased. If the II is increasedwithmore instructions than a pre-


defined threshold and still no valid schedule is found, the algorithm considersspilling. If an operation could not be scheduled due to a register shortage, allscheduling and assignment decisions are made void, as if no operation in theloop is already scheduled. Next, the algorithm selects a live range for spillingusing the heuristic of Equation 3.4. All live ranges are considered, includingthose who in the previous attempt received a register. The insertion of loadand store operations changes the resource requirements and the precedenceconstraints. Trying to obtain a valid schedule within the original II , which isprobably lower than the new one, is futile. Therefore a new II is computedand the process is repeated until a solution is found.


In this section, integrated assignment in combination with software pipeliningis evaluated. Not all loops are suitable for software pipelining. As discussedin Section 3.4.5, software pipelining can only be performed on some of theinner most loops. The remaining parts of the code are scheduled by the regionscheduler.

In Section 8.4.1, the impact of changing the Threshold (see Algorithm 8.1)is discussed. In Section 8.4.2, the integrated assignment approach is comparedwith DCEA.

8.4.1 Spilling or Increasing the II

As discussed in Section 8.3.2 there are two options to reduce the register pres-sure: (1) the insertion of spill code and (2) increasing the II . Algorithm 8.1switches between increasing the II and the insertion of spill code. When theThreshold is set to zero, the algorithm inserts spill code at the moment it dis-covers that insufficient registers are available to generate a schedule in II in-structions. When Threshold is for example set to five, the algorithm will firsttry to reduce the register pressure by increasing the II . When this fails afterfive attempts, the algorithm decides to insert spill code. In the experiments, wevaried the Threshold between 0 and 10. When sufficient registers were avail-able, all experiments gave of course the same results. When registers werescarce, the Threshold only influenced the performance of a few benchmarks.On average, no significant performance improvement was found. One reasonfor this behavior is that the inserted spill code often can be scheduled withoutincreasing the II . This leads to the conclusion that most benchmarks are insen-sitive for a change in the value of Threshold and that spilling is the best wayto reduce the register pressure.


0

10

20

30

40

50

60

70

Per

form

ance

gai

n (

%)

10 12 14 16 18 20 22 24 26 28 30 32 48 64 512

Number of registers

TTA TTA

ideal

realistic

Figure 8.5: Speedup of integrated assignment compared to DCEA using soft-ware pipelining and the TTAideal and TTArealistic templates.


It is interesting to know how integrated assignment in combination with soft-ware pipelining performs compared to DCEA. The results of Chapter 6 andChapter 7 indicate that the performance gain of integrated assignment in-creases when the register pressure increases. Software pipelining increases theregister pressure and thus we expected a high performance gain.

The achieved performance gain for both TTA templates is shown in Fig-ure 8.5. As was expected, integrated assignment outperforms DCEA. Whencomparing the performance gain of software pipelining with the performancegain of region scheduling in Figure 7.10, an increase in performance gain canbe observed. This indicates that integrated assignment indeed performs betterwhen the register pressure is high.

The individual results of all benchmarks are listed in Appendix A. These re-sults show that in some cases an extreme high performance gain was achieved.For example, the rfast benchmark when using the TTAideal template with 10registers has a speedup of 400%. Part of the speedup can be explained by thereasons mentioned in the previous chapters. Furthermore, software pipeliningincreases the register pressure, which makes things even worse for an earlyassignment approach. There are, however, other reasons.

When early assignment precedes software pipelining, false dependencesmay hinder the generation of an optimal schedule. Not only false dependencesbetween live ranges in the same iteration, but also false dependences betweenlive ranges in other iterations may reduce the throughput of the loop. DCEAtakes besides the intra-iteration false dependences also the inter-iteration falsedependences into consideration. However, when all false dependences (in-


tra and inter) are take into consideration, the resulting interference graph hasmany interference edges. It is hard to determine which false dependences arereally important. Therefore, it is necessary to prune the interference graph.This is accomplished by pre-software pipelining the loop without consideringresource constraints and false dependences [Hoo96]. The pre-schedule is ana-lyzed and interference edges are added to the graph IGfpar. This graph containsmany more interference edges than without software pipelining. DCEA has todecide which false dependences should be avoided, and which ones can beintroduced when the number of registers is limited. Unfortunately, the IGfparis constructed under the assumption of an unbounded number of resources.When resources are limited, and thus the II increases, other false dependencesmight be more important. This effect seems to be an important reason for thelarge performance gain when registers are scarce.

Another reason for the performance difference is the influence of the delaylines. Delay lines require extra registers when an early assignment approach isapplied. However, in the resulting code, many of the delay lines disappear be-cause of software bypassing and dead-result move elimination. Unfortunately,an early assignment approach cannot benefit from this reduction of register re-quirements. Even worse, it may insert spill code for the delay lines. Integratedassignment does not have this problem because it can benefit from the TTAspecific properties.

The results, shown in Figure 8.5, suggest that integrated assignment per-forms extremely well when registers are scarce. However, it is DCEA thatperforms extremely badly in these situations. This can be concluded from Fig-ure 8.6. This figure shows the cycle count increase relative to a configurationwith 512 registers. Only benchmarks with a software pipeline ratio higher than10% are included to compute the average (see Section 4.3). The figure showsthat the cycle count increase for DCEA is 187.7%, while for integrated assign-ment it is limited to only 36.6% when using the TTAideal template with 10 reg-isters. For individual benchmarks this difference is even larger. For the sameTTA, the benchmark rfast has a cycle count increase of 553.4% for DCEA, whilefor integrated assignment it is limited to 30.7%. It is also interesting to notethat the cycle count increase of the TTArealistic is larger than the one for theTTAideal. Because the TTAideal has many resources, it can more easily inte-grated spill code in the schedule without a large performance loss.

Also some performance losses can be observed when comparing integratedassignment with DCEA (see Appendix A). As stated in [Hoo96] the perfor-mance of software pipelining heavily depends on the selection order of the op-erations. As proposed by Hoogerbrugge, two heuristics that define the opera-tion selection order are tried in order to generate a valid schedule in II instruc-tions. Because of the false dependences in early assignment, these heuristicsgenerate different orderings for early assignment than for integrated assign-ment. The ordering determines whether a schedule fits into II instructions.Another ordering may result in a smaller or larger II . This effect causes the


0

20

40

60

80

100

120

140

160

180

200

10 12 14 16 18 20 22 24 26 28 30 32 48 64

Number of registers

Rel

ativ

e cy

cle

cou

nt

incr

ease

(%)

Integrated

DCEA

0

20

40

60

80

100

120

10 12 14 16 18 20 22 24 26 28 30 32 48 64

Number of registers

Rel

ativ

e cy

cle

cou

nt

incr

ease

(%)

Integrated

DCEA

a) TTAideal. b) TTArealistic.

Figure 8.6: Average cycle count increase of integrated assignment and DCEA.Only benchmarks with a software pipeline ratio higher than 10%are included.

negative performance gain of integrated assignment for the benchmarks instf,radproc and rfast even when sufficient registers (512) are available. Analysisshowed that these benchmarks are dominated by a small set of loops that canbe software pipelined. An increase of the II with one instruction, already con-tributes to a performance loss of a few percentages.

The presented algorithm has a potential performance flaw that did notshow up in the results. Because all instructions in the loop are taken into ac-count when computing the interference register set, it is not possible to use thesame register for two different live ranges. Software bypassing and dead-resultmove elimination may have obscured this effect. Furthermore, the loops con-sidered were relatively small. When larger loops are taken into consideration,which results in a larger kernel, this effect may becomemore visible. Therefore,in the future the algorithm should be extended in order to allow the re-usageof registers in software pipelined loops with a large II .

8.5 Conclusions

In this chapter, the relation between register assignment and software pipelin-ing is discussed. It was shown that software pipelining increases the registerpressure. Based on the work as described in the Chapters 6 and 7, we proposeda method that integrates register assignment and software pipelining. We alsoaddressed the problem of register shortage. We have shown that the new de-

8.5. CONCLUSIONS 165

veloped method can handle this by either increasing the II , or by insertingspill code. The experiments showed that increasing the II in some situationsslightly improves the performance. On average, however, the impact of in-creasing the II is practically non-existing. Spilling is the better choice.

The performance of the proposed method is compared with DCEA. Theexperiments showed that integrated assignment outperformed DCEA. DCEAsuffers from the increased register pressure in software pipelined loops. Itsinability to exploit software bypassing and dead-result move elimination, andthe used method to predict false dependences heavily contribute to its perfor-mance penalty. The impact of DCEA on the cycle count increase leads to theconclusion that the combination of DCEA and software pipelining is not a goodchoice when registers are scarce.

The PartitionedRegister File 9T he general-purpose registers of a microprocessor are located in the regis-

ter file (RF). These registers can be accessed through a limited number ofRF-ports. Reading N values in parallel requires N read ports on the RF. Anarchitecture that exploits instruction-level parallelism (ILP) should be able toread and write multiple register values concurrently; ILP architectures there-fore require multi-ported register files. Many ILP based processors have a sin-gle multi-ported register file that is commonly known as a monolithic multi-ported register file. Unfortunately, a monolithic multi-ported RF increaseschip area [CDN95], causes extra delay [WRP92] and boosts power consump-tion [ZK98]. Therefore, an alternative RF organization is needed in order toexploit large amounts of ILP.

This chapter is organized as follows. Section 9.1 describes the problems ofa monolithic multi-ported RF: chip area, access time and power consumptionare addressed. To overcome these problems, an alternative RF organization isdiscussed: the partitioned RF. Although a partitioned RF has hardware advan-tages, from a code generation point of view, it introduces some complications.To avoid large performance penalties, the compiler has to take the partitioninginto consideration. In particular, the register allocator must be adapted; it hasto distribute the variables over the partitioned RF. In Section 9.2 partitioningand early assignment is discussed. The consequences for late assignment arediscussed in Section 9.3, and how integrated assignment handles partitionedRFs is discussed in Section 9.4. In addition, related work is addressed in thesesections. Finally, Section 9.5 concludes this chapter.

167

168 CHAPTER 9. THE PARTITIONED REGISTER FILE

9.1 Register Files

To increase performance, modern microprocessors execute many operations inparallel. To allow the exploitation of a high degree of ILP, RFs with a largenumber of registers are needed. For example, the Itanium processor and theTM1000 contain 128 integer registers each [Abe00, HA99]. Parallel execution ofmultiple operations also requires simultaneously reading and writing of mul-tiple values from and to the RF. This is done by means of the RF-ports. Inour formulation, each RF-port is either a read port or a write port and henceoperands compete for RF-port usage within the type of access (read operationsconflict with read operations for accessing read ports, and vice versa for writeoperations).

VLIWs require 3K RF-ports assuming K FUs (see Section 2.1). The inputsand outputs of each FU have a connection to the RF. From the compiler’s pointof view, this interconnection is beneficial because it provides an orthogonalprogramming model. As a result, a standard register assignment techniquecan be used, which in turn, produces high performance code. Unfortunately,the high number of RF-ports needed for VLIWs is a major problem when thenumber of FUs increases. E.g., a VLIW with 5 FUs requires 15 RF-ports. Evenif such a large multi-ported RF could be built, it is likely to introduce perfor-mance degradation because of an increased overall cycle time [Jol91]. Thisdegradation may greatly affect the benefit of the complete interconnection to asingle RF and reduces the theoretical performance achievable with many FUs.

TTAs do not have a direct relation between the number of FUs and the num-ber of RF-ports. This allows the design of microprocessors with a high numberof FUs, while the number of RF-ports is moderate. However, the number ofRF-ports must be large enough to support the available ILP. Another factor,which reduces the pressure on the RF-ports, is the ability to software bypassresults directly from the producing FU to the consuming FU. This saves theuse of a read port. When dead-result move elimination can be applied, also theuse of one write port and a register is saved. In [HC94], it is shown that TTAscan reduce the RF-port requirements by 50% or more compared to VLIWs.

It is to be expected that the ongoing research in compiler and microproces-sor technology will make it feasible to execute 16 [LW97, RJSS97, SCD+97] ormore operations in parallel. Although the number of RF-ports for a TTA aresmaller than for a VLIW (A VLIW would require 48 RF-ports) to achieve thesame performance, the number of RF-ports must still be substantial to exploitthe increase in computing power.

Silicon area, access time and power consumption increase when the numberof ports and registers of an RF increase. In this section, models from literatureare presented to model silicon area (Section 9.1.1), access time 9.1.2 and powerconsumption 9.1.3. We do not claim an exhaustive analysis, which would bebeyond the scope of this thesis. In Section 9.1.4 an alternative RF architectureis presented. This RF architecture solves the problems of a large monolithic RF.

9.1. REGISTER FILES 169

Add

ress

dec

oder

s

bitline 2

Bitline drivers and sense amplifiers

wordline2wordline1

Vdd

bitline 1

SRAM Cells

bitline 1bitline 2

Figure 9.1: Implementation of a 2-ported RF (only one bit has been drawn).

9.1.1 Silicon Area

The area of a multi-ported RF is largely dependent on the number of registers,the width of the registers (i.e. the number of bits per registers), the numberof RF-ports and the routing (wiring) area. A register in an RF consists of anumber of cells. Figure 9.1 [vdG98, JC95] shows a 1-bit, 4-transistor SRAM cellwith two ports. An RF with 24 32-bit registers will contain 32 of those cells in arow (forming one word or register) and 24 registers in a column.

In the worst case, each cell must be able to drive all the read ports simulta-neously. To accomplish this, the area of the two inverters must be increased todrive the increased load. In [CDN95], it is claimed that their area grows linearwith the number of registers. Another factor that influences the area of an RFis routing. According to [CDN94, CDN95, Cor98] the area for routing growsquadratically with the number of RF-ports. In [CDN94] a large number of pub-lished RF designs is studied. This study and the study of Corporaal [Cor98]indicated that the silicon area is usually dominated by the routing area. Evenfor a limited number of RF-ports the cell layout is almost completely occupiedby bitlines and wordlines. The area model used in this thesis is based on themodel of [CDN94]:

AreaRF ∼ RRF (1 + δa · Nports)2 (9.1)

where RRF is the number of registers, Nports the number of RF-ports and δa

the percentage dimensional increase determined by each RF-port. Accordingto [CDN95], δa is in the range 0.05 to 0.20 for most of the considered designs.This value depends on the number of wiring levels. Note, that the model doesnot give an absolute area measure, but the relation between various RF con-


Rel

ativ

e ar

ea

N

R

ports

RF

2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 320

32

64

96

128

0

5

10

15

20

25

Figure 9.2: Area of various monolithic RF configurations relative to the area ofa monolithic 8-ported RF with 32 registers assuming δa = 0.1.

figurations. Figure 9.2 shows the relative area as a function of the number ofRF-ports and registers assuming δa = 0.1.

9.1.2 Access Time

In current microprocessor designs a trade-off has to be made between size,bandwidth and speed of the RF [CDN94]. Simultaneous access to all RF-portsrequires appropriate sizing of the drivers. This results in an increased cell di-mension because the inverters must be sized to drive all the lines in all situa-tions. Also the line length increases and hence results in an increased propaga-tion delay. The access time of an RF can, according to [CDN95] and [Cor98], bemodeled as a linear function of the number of RF-ports. From the results pre-sented in [FJC95], it follows that the access time is also proportional with thenumber of registers. Based on these observations we model the relative accesstime increase for a multi-ported RF as:

Taccess ∼ 1 + ρr · RRF + ρp · Nports (9.2)


where ρr is the percentage cost increase per register and ρp gives the percentagecost increase per RF-port. In [CDN95] a value of 0.1 for ρp is suggested. Farkaset al. [FJC95] observed that the access time of an RF is far more affected by adoubling of RF-ports than a doubling of the number of registers. Based on theirresults we use ρr = 0.01.

In future microprocessor designs, the number of registers and RF-ports willgrow. The above observations about the relation between access time and RF-ports make the RF a fundamental design issue for large TTA, VLIW and super-scalar processors [Cor98, CDN95, FJC95, EFK+98, CGVT00].

9.1.3 Power Consumption

The last discussed parameter is power consumption. When taking into ac-count the growth of both the storage size and the number of ports, it is to beexpected that the power portion of a multi-ported RF will grow in the future.In [ZK98], the authors have described and analyzed the energy complexity ofmulti-ported RFs. Their study showed that the average access energy is pro-portional with the number of RF-ports and with the number of registers.

ERF ∼ (c1 + c2 · RRF )(c3 + c4 · Nports) (9.3)

The constants c1, c2, c3 and c4 can be estimated from the figures in [ZK98] for0.5 µm technology using CMOS. The result is shown in Figure 9.3. Further-more, in [ZK00] it was shown that the power consumption of an RF in a super-scalar or VLIW processor can be described as P ∼ (IW )1.8, where IW is theissue width. In a properly designed superscalar or VLIW processor, an increasein issue width has as a consequence that the number of RF-ports increases aswell as the number of registers. According to Zyuban and Kogge, the RF isamong the components with the highest energy growth when the issue widthincreases. This leads to the conclusion that an increase in issue width (andthus exploitable ILP) results in an almost quadratical increase in power con-sumption. Zyuban and Kogge [ZK98, ZK00] conclude that none of the knowncircuit techniques solves the problem of rapid RF power growth for micropro-cessors with increasing ILP. Also aggressive technology scaling does not solvethe problem. They suggest to develop new alternatives for the monolithic RF.

9.1.4 Partitioned Register Files

When designing microprocessors that can exploit a large amount of ILP analternative for the large monolithic RF must be found. A frequently used tech-nique is to split the set of registers into two banks (RFs), one for the integerand the other for the floating-point registers. However, this is only a partlysolution since the number of ports and registers for each RF is still large. In ad-dition to dividing register types over different RFs, one can also choose to splitthe monolithic RF per type into small RFs. This results in either a distributed or


248121620242832 8

16

32

64128

256

0

500

1000

1500

2000A

cces

s en

erg

y (p

J)

R

Nports

RF

Figure 9.3: Average access energy per instruction.

a partitioned register file configuration [Lee92]; these configurations are shownin Figure 9.4.

In a distributed RF configuration each FU, or cluster of FUs, has direct ac-cess to one RF only. Access to other RFs requires register copy operations orcomplex bypassing logic. This architecture, also known as multicluster architec-ture [FCJV97], has been used in the Multiflow TRACE architecture [SS93], thevector processor CM-5 of Thinking Machines and DEC’s Alpha 21264. The Al-pha 21264 has two integer FU clusters, each with a separate RF. It was decided

RF

RFFU FU RF

RF

Multiplexors

Multiplexors

FU FU

a) Distributed RF. b) Partitioned RF.

Figure 9.4: Dividing an RF.


to partition the RF because it was indeed in the critical timing path [Gwe96]. Inthe VLIW family TMS320C6000 the idea of a distributed RF is used also. Thedevices of this family are based on the VelociTI core, which has two clusters,each with a local RF and four FUs.

The idea of a distributed RF is also applied in the Multiscalar architec-ture [SBV95]. The key idea behind this architecture is to split one wide is-sue processing unit into multiple narrow-issue processing units. Each narrow-issue unit consists of FUs and an RF. The trace processor [RJSS97] is based onthe Multiscalar idea as well. Like Multiscalar, trace processors are proposed toovercome the architectural limitations on ILP of superscalar processors. Traceprocessors exploit the characteristics of traces. A trace processor consists ofmultiple processor elements that each have the organization of a small super-scalar processor. Trace data characteristics, local versus global values suggesta hierarchical RF implementation: a local RF per trace for holding values pro-duced and consumed solely within a trace, and a global RF for holding valuesthat are live between traces. The local RFs can be regarded as a distributed RF,while the global RF is used to copy the results between processing elements.

As shown by Capitanio [CDN93], compilation for a distributed RF configu-ration is difficult and may result in a large performance loss (up to 75%). Notefurther, that the register copy operations between RFs may require extra portson the RFs. These ports are not shown in Figure 9.4a.

A different approach is to partition the RF as shown in Figure 9.4b. Al-though a partitioned RF provides less connectivity between FUs and registers,when compared to a monolithic RF, each FU still has direct access (for readsand writes) to any register. Therefore, no extra register copies are required.However, still performance losses are to be expected because the limited num-ber of ports per RF may lead to RF-port conflicts, i.e. when two or more accessesneed the same port in the same cycle.

Partitioned RFs have been used for many years in vector processors. Eachvector register in a vector processor can be regarded as a separate RF with onlya limited number (often one) read andwrite ports [Lee92]. Partitioning can alsobe compared to the use of multiple banks; e.g. a cache split into two banks, eachhaving only one port, can handle two accesses at a time if these accesses are notto the same bank. In both cases, the hardware detects and handles the accessconflicts. The Motorola 56000 and the NEC 77016 allow multiple concurrentaccess to different on-chip memory banks. To exploit this feature, applicationsare hand-written. In [SM95], a compiler technique is described, which can gen-erate code for these types of architectures. This technique uses graph coloringto assign registers as well as memory references to memory banks during lateregister assignment. Besides register conflict edges, also memory bank conflictedges and edges that reflect architectural imposed constraints are added. As-sociated with each edge is a cost. Simulated annealing is used to minimize thetotal cost. The total cost is defined as the sum of the costs of all edges whoseconstraints are not met.


32 16 16 8 8 8 8

a) Monolithic RF. b) 2-partitioned RF. c) 4-partitioned RF.

Figure 9.5: Register file configurations.

In this chapter, we propose the use of a partitioned RF, for which accessconflicts are handled (i.e. avoided) by the compiler. A partitioned RF fits verywell in the TTA template. No extra hardware facilities are required. Examplesof some partitions are given in Figure 9.5.

Figure 9.6 draws the area for three RF configurations, a monolithic RF, onewith two partitions, and one with four partitions in units of an 8-ported mono-lithic RF with 32 registers (i.e. relative to the configuration of Figure 9.5a). Eachconfiguration has a total of 32 registers. The registers and RF-ports are equallydivided between the new partitions. From the figure, it can be clearly seen thatthe area occupied by a partitioned RF is significantly smaller than the area ofthe monolithic RF.

In the remainder of this chapter, the RF configurations of Figure 9.5 areused. Based on the observationsmade in this section, the following conclusionscan be drawn:

• Amonolithic 8-ported RF with 32 registers has an area 1.65 times the areaof its two times partitioned counterpart, and an area 2.25 times the areaof its four times partitioned counterpart.

• Amonolithic 8-ported RF with 32 registers has an access time that is 1.36,respectively 1.66 times, the access time of its two, respectively four timespartitioned counterpart.

• The average access energy per instruction for a monolithic RF is 1.53,respectively 1.96 times, the average access energy of its two, respectivelyfour times partitioned counterpart.

Partitioning the RF has advantages: less chip area, lower access times and alower power consumption. However, partitioning also has a drawback. Ina single monolithic RF, each register can be accessed via each RF-port. In apartitioned RF, this is no longer valid. The consequences for code generationare discussed in the remainder of this chapter.

9.2 Early Assignment and Partitioned Register Files

Partitioning the RF does not induce any changes to the TTA post-pass sched-uler. Based on the index of a register, it selects an RF and an RF-port of the

9.2. EARLY ASSIGNMENT AND PARTITIONED REGISTER FILES 175

0

0.5

1

1.5

2

2.5

0 2 4 6 8 10 12 14 16N ports

Rel

ativ

e ar

ea

Monolithic RF2 Partitions4 Partitions

Figure 9.6: Area of various 32 register RF configurations relative to the area ofa monolithic 8-ported RF with 32 registers. Model based on Equa-tion 9.1 with δa = 0.1.

correct RF partition. The register allocator is responsible for assigning registersto variables and thus determines in which RF partition the variable will reside.Distribution of the variables over the RFs should be done with care in order toreduce the performance penalty imposed by the partitioning. In this section,four distribution methods in combination with early assignment are describedand evaluated. The first two methods use a simple distribution method, whilethe latter two are based on more advanced heuristics.

9.2.1 Simple Distribution Methods

Let us assume that all registers fit into a single address space; e.g. if there arefour RFs with 8 registers each, then we assume that we can address each reg-ister with an address between 0 and 31. The problem is how to distribute theregister addresses over the RFs. Two distribution methods are considered: thevertical and the horizontal distribution, see Figure 9.7. The vertical distributionassigns the register addresses 0..7 to RF-1, 8..15 to RF-2 and so on. The hori-zontal distribution assigns the register addresses in a round robin fashion tothe RFs as shown in Figure 9.7b.

First, experiments were conducted with the vertical distribution. TheTTArealistic template as described in Section 4.2 is used in combination withthe RF configurations of Figure 9.5. In addition, a two and four partitioningof an RF with 512 registers and four read and four write ports was included inthe experiments. Figure 9.8 shows the performance decrease of the two and fourpartitionings relative to the performance of the monolithic RFs when using re-gion scheduling. The performance loss for a partitioning in four parts is quite


RF-1 RF-2 RF-3 RF-40 12 34 56 7

10 1112 1314 15

16 1718 1920 2122 23

24 2526 2728 2930 31

8 9

a) Vertical distribution.

RF-1 RF-2 RF-3 RF-4

16 2024 28

3 72 60 4

17 2125 29

10 1418 2226 30

11 1519 2327 31

1 59 138 12

b) Horizontal distribution.

Figure 9.7: Distributions of register addresses over a four times partitioned RF.

large, especially for the 4x128 partitioning. This is because the register alloca-tor picks the first available register, and therefore registers with a low indexin the register address space are picked relatively often. These registers are,however, located in the first RF and must share the same number of ports. Thisresults in many access conflicts during the scheduling process. It is interestingto note that the relative performance penalty is lower when the partitioned RFcontains 32 instead of 512 registers. For these smaller RFs, the register allocatorhas to use multiple partitions in order to avoid spilling, while for the RF with512 registers a single partition may be sufficient. However, the number of RF-

0

4

8

12

16

20

24

28

32

Per

form

ance

pen

alty

(%

)

2x256 4x128 2x16 4x8

RF configurations

Vertical distribution

Horizontal distribution

Figure 9.8: Results of the vertical and horizontal distribution for various RFpartitionings.


ports on a single partition is small and consequently, the number of RF-portconflicts increases, which results in a larger performance penalty.

The horizontal distribution distributes the register address such that the num-ber of accesses to the RFs are more uniformly distributed. The relative perfor-mance loss when applying this distribution is also shown in Figure 9.8. Theresults show that the horizontal distribution performs surprisingly good, andmuch better than the vertical distribution. The vertical distribution resultedin an average performance loss of more than 35.6% for the 4x128 partitioningwhile the horizontal distribution limits the performance loss to only 3.8%. Theresults per benchmark are given in Appendix B.

9.2.2 Advanced Distribution Methods

Both the vertical and horizontal distribution do not consider the impact of si-multaneous accesses to an RF. When transports are independent, the schedulermay schedule them in the same instruction. However, the scheduler cannotschedule them in the same instruction if there are more transports from or tothe same RF than there are ports on that RF. The scheduler has to delay oneor more transports, which may reduce the performance. In the remainder ofthis section, two strategies are proposed that try to avoid RF-port conflicts bymapping variables, whose transports can be scheduled in the same instruction,onto different RFs.

Definition 9.1 The RF-port interference graph IGRF-ports = (Nvar, ERF-port) is a fi-nite undirected graph, with Nvar the set of variables and ERF-port the set of potentialRF-port conflicts. These edges are augmented with a weight WRF-port, which reflectsthe importance of avoiding a particular RF-port conflict.

The two proposed strategies differ in the way IGRF-ports is constructed and theweights associated with the edges.

To select a register ri for a variable vi, the information from the forwardparallel interference graph IGfpar and the RF-port interference graph IGRF-ports

is used. A register is selected which respects all interferences in both graphs.When no such register is available, the register allocator selects a register thatminimizes the following function:

CTotal(ri, vi) = W1 ·∑

vj∈Nvar∧r(vj)=ri

Wffdp(vi, vj)

+ W2 ·∑

vj∈Nvar∧r(vj)∈RF (ri)

WRF-port(vi, vj)

+ W3 · Cstate preserving(ri, vi) (9.4)

where

Cstate preserving(ri, vi) ={

Ccaller-saved(vi) : ri ∈ Rcaller-savedCcallee-saved(vi) : ri ∈ Rcallee-saved


and RF (ri) is the register file of register ri. The constantsW1,W2 andW3 rep-resent the weight of each of the three costs. These weights can be used to tunethe algorithm. Initially, they are equal. When CTotal �= 0 the register alloca-tor has to decide whether to spill a variable, to introduce false dependencesor to introduce RF-port conflicts. In our implementation the register allocatoralways tries to solve this problem by introducing a false dependence, a RF-portconflict or a combination of both conflicts.

Data Independence Heuristic

The RF-port interference graph, when using the data independence heuristic, isconstructed by inspecting all pairs of independent transports in the DDG. Thefollowing edges are added to the set ERF-port:

• (vi, vj) : if one transport defines vi and the other defines vj .

• (vi, vj) : if one transport uses vi and the other uses vj .

The weight WRF-port is calculated in the same manner as is done for the edgesin the forward false dependence graph, see Equation 5.1.

The results when using this heuristic were disappointing. On average, theperformance loss due to RF partitioning is increased compared to the horizon-tal distribution. The larger performance loss is most likely caused by the factthat the data independence heuristic avoids many RF-port conflicts, which donot show up in the generated schedule. Because the weights WRF-port are in-cluded in the register selection heuristic, some unwanted false dependenceswere introduced. Decreasing the weight W2 to one tenth of the other weightsdid result in a small performance improvement; however, the results of the hor-izontal distribution could not be met. The results per benchmark are shown inAppendix B.

Intra-Operation Heuristic

The results of the data independence heuristic show that the number of edgesin ERF-port should be reduced. The intra-operation heuristic tries to avoid mul-tiple reads within a single operation from the same port-limited RF. We willillustrate this problemwith an example: suppose an RF has only one read port.An operation usually needs two operands. If both operands are located in thesame RF then it is impossible for the scheduler to schedule the two transportsin the same instruction. If, however, the operands are located in different RFs,their transports can be scheduled in the same instruction and hence the perfor-mance improves. This heuristic constructs the IGRF-ports by inspecting all oper-ations with multiple operands. For each combination of operand variables anedge is added to ERF-port. However, an edge is only added when the live rangehas multiple uses. This restriction is added because it is very likely that a liverange with a single use is eliminated from the schedule by dead-result move


-3

-2

-1

0

1

2

3

Per

form

ance

dif

fere

nce

(%

)

a68

biso

n

cpp

cryp

t

diff

expa

nd flex

gaw

k

gzip od sed

sort

uniq

virt

ex wc

099.

go

124.

m88

ksim

129.

com

pres

s

132.

ijpeg

147.

vort

ex

inst

f

mul

aw

radp

roc

rfas

t

rtps

e

djpe

g

raw

caud

io

mpe

g2de

code

mpe

g2en

code

unep

ic

Benchmarks

Figure 9.9: Performance comparison of the horizontal distribution and theintra-operation heuristic for the 4x8 partitioning.

elimination and thus never causes an RF-port conflict1. To each edge (vi, vj) aweightWRF-port is added. This weight is equal to the execution frequency of theoperation’s basic block.

The intra-operation heuristic turned out to be better than the data inde-pendence heuristic. On average, the results of the intra-operation heuristic areequal to the results of the horizontal distribution. However, per benchmarkthe results vary, see Appendix B. Figure 9.9 shows the performance differencesper benchmark for the 4x8 RF partitioning. A positive value in this graph indi-cates that the horizontal distribution performs better. These results show thatthe performance of the horizontal distribution can be improved, but that noheuristic was found that on average resulted in a better performance.

9.2.3 Equal Area Compiling

One could argue that the area saved by RF partitioning can be used for otherresources. Let us investigate what happens when the saved area is spend onextra registers. Table 9.1 lists the new equal area partitioning of amonolithic RFwith 32 registers, four read, and four write ports. In the experiments, the hori-zontal distribution was used; with this heuristic the best results were obtained.Figure 9.10 shows the performance penalties of these new partitionings. Fromthe figure we conclude that with equal area compiling the performance lossescaused by RF partitioning can be turned into a performance gain. For some

1Note, that the inspected pairs in the intra-operation heuristic are a subset of the inspected pairsin the data independence heuristic.


Table 9.1: RF configurations.

RF Partitions RRF Nports(read) Nports(write)configuration per partition per partition

2x26 2 26 2 24x18 4 18 1 1

-40

-35

-30

-25

-20

-15

-10

-5

0

5

10

Per

form

ance

pen

alty

(%

)

a68

biso

n

cpp

cryp

t

diff

expa

nd flex

gaw

k

gzip od sed

sort

uniq

virt

ex wc

099.

go

124.

m88

ksim

129.

com

pres

s

132.

ijpeg

147.

vort

ex

inst

f

mul

aw

radp

roc

rfas

t

rtps

e

djpe

g

raw

caud

io

mpe

g2de

code

mpe

g2en

code

unep

ic

Benchmarks

4x18 2x26

Figure 9.10: Results of equal area compiling.

benchmarks the performance gain is large, for example mpeg2encode. Becausemore registers are available, less spill code is required and thus the perfor-mance increases. Other benchmarks show disappointing results. Because theRFs in the new RF configurations contain more registers than the original RFs,potentially more RF-port conflicts can arise. This may result in a performancepenalty. A register assignment strategy that divides the variables evenly acrossthe RFs, such as the horizontal distribution, may reduce the impact of havingmore registers behind a small number of RF-ports. The results of many bench-marks are almost not affected by adding extra registers. These benchmarks donot require more than 32 registers. Adding more registers will not help to gainperformance. For these benchmarks it is more profitable to spend the savedarea on other resources like FUs, a larger cache, etc.

9.3 Late Assignment and Partitioned Register Files

In Chapter 5, we discussed why late register assignment is not an attractivesolution for TTAs. In this section, the consequences when applying late assign-

9.3. LATE ASSIGNMENT AND PARTITIONED REGISTER FILES 181

Phase ordering A

Phase ordering B

Phase ordering C

Phase ordering D

Phase ordering E

register allocator/RF allocator

instructionscheduler


RF allocatorregisterallocator


RF allocator registerallocator

instruction scheduler/RF allocator

registerallocator

RF allocator instructionscheduler

registerallocator

Figure 9.11: Late assignment phase orderings in the context of a partitioned RF.

ment in combination with a partitioned RF are discussed. Allocating variablesto RF partitions can be applied in the instruction scheduling phase, the regis-ter assignment phase or in a separate phase. All phase orderings are given inFigure 9.11 and are shortly addressed2.

Phase ordering A: Early RF assignmentWhen assigning variables to RFs before the instruction scheduling and regis-ter assignment phase, both phases are constrained by the RF assignment. Dueto RF-port conflicts, the instruction scheduler has to delay some operations.This results in less efficient code. Also the freedom of the register allocator isrestricted; it can only map a variable onto a register in the assigned RF. De-pending on the partitioning algorithm, some RFs are over-utilized, there aretoo many variables for the available registers, while others have a surplus onregisters. For the variables in the over-utilized RF, the post-pass register allo-cator has two spill options: (1) spilling to memory or (2) spilling to an under-utilized RF with the use of copy operations. This is only possible when enoughRF-ports are available. Both approaches degrade the performance.

Phase ordering B: Integrated scheduling and RF assignmentIn this phase ordering, the instruction scheduler maps variables to RFs. TheRFs are assigned in such a manner that the performance is maximized fromthe instruction scheduler’s point of view. The scheduler assumes that all the

2For early assignment also several phase orderings can be considered, however, most of themare senseless since the register assignment determines the RF.


RFs are infinitely large. The RF assignment may hinder the register allocator;it can only map a variable onto a register in the RF assigned by the scheduler.In the worst case, the scheduler assigns all variables to one RF, while the otherones are empty. Typically the variables will be scattered over all RFs, however,the distribution of the variables over the RFs may not be optimal for registerassignment. As in phase ordering A, the register allocator has to insert spillcode or extra copy operations.

Phase ordering C: In between RF assignmentIn this phase ordering, the instruction scheduler assumes a single infinitelylarge RF. Consequently, it will likely generate a high performance schedule. Itis, however, not guaranteed that this schedule is correct when using a parti-tioned RF. The RF assignment phase assigns the variables to RFs and inserts,when needed extra instructions to correct the schedule. The register allocatoris responsible for allocating the variables to the registers of the RFs.

In literature, two approaches are described that perform RF partitioningafter instruction scheduling. In [CDN93], the RF allocator partitions the op-erations in already scheduled VLIW instructions into a number of substreamsequal to the number of RFs. The proposed method partitions the operationsin such a way that it minimizes the amount of communication between thesubstreams. It is not very likely that a clean partitioning can be found, i.e.,without any data movement between the substreams. When communicationis necessary, data movement operations are inserted and the code is locallyrescheduled. This method only addresses the distribution of variables acrossRFs. It does not address register assignment; it is assumed that each RF has suf-ficient registers to hold all variables. Furthermore, the method is only appliedto straight-line code (loop bodies). The reported performance losses range from4% to 40% for two partitions and from 15% to 75% for four partitions.

In [CND95], a hypergraph-based coloring method is proposed to modelcompetition among variables for RF-ports on a VLIW with a partitioned RF.The nodes in such a hypergraph represent multichains. A multichain is a col-lection of scheduled operations that define or use the same variable (similar todu-chainswithout a direction of the edges). The hyperedges between the nodesrepresent the interferences among multichains. Two types of hyperedges aredefined: RF-port conflicts for reading and RF-port conflicts for writing a vari-able. Each hyperedge connects sets of multichains, which compete for an RFread/write port in the same instruction. Graph coloring is used to assign mul-tichains to RFs. The colors represent the RFs. A hyperedge should not connectmore nodes with the same color than there are RF-ports. Not always a legal col-oring can be found. In these situations, operations are moved to other instruc-tions and copy operations are inserted. The presented results show that only asmall performance degradation has to be accepted, lower than 5%. However,no assumption about the size of the RFs is made. This may lead, as in [CDN93],to an unevenly distribution of variables to RFs and may even lead to unneces-sary spilling when the RFs have a limited number of registers.

9.4. INTEGRATED ASSIGNMENT AND PARTITIONED REGISTER FILES 183

Phase ordering D: Integrated register and RF assignmentIn this approach, the register allocator is responsible for assigning RFs to vari-ables. Prior to register assignment the instruction scheduler places the oper-ations or transports in the instructions while respecting the total number ofRF-ports. However, the scheduler does not assign RF-ports to operations; thisis the responsibility of the register allocator. The register allocator cannot as-sign registers in such away that more writes or reads access simultaneously thesame RF than there are write respectively read ports. It is not likely that this re-striction is always respected by the instruction scheduled, and thus reschedul-ing is required. Furthermore, the register allocator has to insert spill code whenneeded.

Phase ordering E: Late RF assignmentIn this phase ordering both instruction scheduling and register assignment as-sume a single RF. During RF assignment, the registers are mapped onto RFs.The assignment of the registers combined with the already generated schedulemost likely results in RF-port conflicts. When no legal RF assignment can befound, some operations must be delayed and rescheduled to reduce RF-portconflicts.

All mentioned phase orderings have their advantages and drawbacks. Sincenot much research is done in this area, it is hard to say which phase orderingoutperforms the other. Although in [CND95] good results are reported, theydid not take into consideration the size of the RFs. Furthermore, all methods re-quire rescheduling of already scheduled code. As was observed in Section 5.2,this is complex in the context of TTAs.

9.4 Integrated Assignment and Partitioned RegisterFiles

The separation of register assignment, instruction scheduling and RF-assignment in different phases has its drawbacks. Decisions that seem to beadvantageous in an earlier phase can have a negative effect on the total per-formance. In early assignment, the register allocator has no idea which assign-ments will result in RF-port conflicts in the instruction scheduling phase; onlyassumptions about potential conflicts can be taken into considerations. An in-tegrated approach has more knowledge about RF-port availability per instruc-tion. Because the capacity of the RF is taken into consideration as well, the RFsare not over-utilized such as with late assignment. This reduces the amount ofspill code. Furthermore, no rescheduling is required.

Despite these advantages, almost no research is done in the area of integratedassignment and partitioned RFs. To the best of our knowledge, only in theBulldog compiler [Ell86] integrated assignment is addressed in the context ofan instruction-based list scheduler for clustered VLIWs. The proposed method


add r6, r2, r4 move r2, r3add r6, r3, r4

a) Problematic operation. b) Solution using copy insertion.

Figure 9.12: Inserting a copy operation. It is assumed that the register set isdivided over two RF, one with the even registers and one with theodd registers.

selects an RFwhich has (1) free registers, (2) can be reached from the producingFU and (3) has the fewest RF-port accesses in the instruction in which the newaccess will be scheduled. No attempt is made to include possible conflicting fu-ture references (that is later in the schedule) into the selection method. Unfor-tunately, no measurements are provided to evaluate the performance decreasedue to a partitioned RF. Ellis also addressed a typical OTA related problemin relation to partitioned RFs. Many operations require two operands, whichhave to be read in the same instruction. When both operands are located intothe same RF, which has only one RF-port, the operation cannot be executed.Inserting a copy operation into the code can solve this problem. An exampleis given in Figure 9.12. Figure 9.12a shows the original operation. This oper-ation cannot be scheduled since both operands must be read simultaneouslyfrom the same single read ported RF. The principle of copy insertion is demon-strated in Figure 9.12b. Note that this is not a problem for TTAs, since theoperands do not have to be read in the same instruction.

In the Chapters 6, 7 and 8, we introduced a new integrated assignmentapproach. In the following, we extend our integrated assignment approach insuch a way that it can efficiently handle a partitioned RF.

9.4.1 Local Heuristics

Recall that our integrated approach assigns a register to a variable when theoperation that first references this variable is scheduled. At that moment, italso decides in which RF the variable will reside. Algorithm 6.1 is changed toAlgorithm 9.1 in order to support the use of partitioned RFs3. The main differ-ence is that now the functions SELECTSRCREGISTER and SELECTDSTREGISTERhave to select a register from the RF that is connected to socket si respectivelydi. This new algorithm implements a first-fit approach. The RF whose sock-ets are in the front of the socket list is tried first and thus will be used moreextensively than the other RFs. Only when no sockets are available, or when

3During region scheduling registers are unassigned to facilitate importing. Because some ofthe accesses of the associated variable are already scheduled and bound to an RF, the register al-locator has to assign a register from the same RF as the original register. This is accomplished byadding extra information to the move that indicates whether the choice of RFs is free or not. Thisinformation is used in the functions AVAILABLEORDEREDSRCSOCKETS and AVAILABLEORDERED-DSTSOCKETS, which only return the sockets of the correct RF.


Algorithm 9.1 ASSIGNTRANSPORTRESOURCES(m, i)

src = SOURCE(m)dst = DESTINATION(m)FOR EACH si ∈ AVAILABLEORDEREDSRCSOCKETS(src, i) DOFOR EACH di ∈ AVAILABLEORDEREDDSTSOCKETS(dst, i) ∧ di �= si DOFOR EACH mb ∈ AVAILABLEMOVEBUSES(si, di, i) DOIF ISVARIABLE(src) THEN

rsrc = SELECTSRCREGISTER(m, i, RFCONNECTEDTO(si))IF rsrc = ∅ THENcontinue

ENDIF

ENDIF

IF ISVARIABLE(dst) THENrdst = SELECTDSTREGISTER(m, i, RFCONNECTEDTO(di))IF rdst = ∅ THENcontinue

ENDIF

ENDIF

IF ISVARIABLE(src) THENASSIGNREGISTER(src, rsrc)

ENDIF

IF ISVARIABLE(dst) THENASSIGNREGISTER(dst, rdst)

ENDIF

ASSIGNSOURCESOCKET(m, si)ASSIGNDESTINATIONSOCKET(m, si)ASSIGNMOVEBUS(m, mb)return TRUE

ENDFOR

ENDFOR

ENDFOR

return FALSE

no register is available, another RF is selected. The results of this straightfor-ward application of RF partitioning are shown in Figure 9.13. Because the RFs,whose sockets are in the front of the list are picked relatively often, many RF-port conflicts occur. This resulted in a large performance penalty.

In an attempt to distribute the RF-port accesses more evenly across the RFs,we implemented a best-fit strategy. A register is selected that resides in the RFwith the fewest accesses in instruction i where this new access will be sched-uled. The RF sockets are sorted such that the sockets connected to the RFs withthe fewest number of accesses in instruction i are in front of the list. The re-sults of this strategy are also given in Figure 9.13. When using two partitions


0

5

10

15

20

25

Per

form

ance

pen

alty

(%

)

2x256 4x128 2x16 4x8

RF configurations

First-fit

Best-fit

Round-robin

Figure 9.13: The results of the three local RF partitioning strategies: first-fit,best-fit and round robin.

the results improved somewhat. However, when using RFs with a single readport and a single write port, the best-fit strategy produces the exact same re-sults as the first-fit approach. This is caused by the fact that in both strategiesonly those RFs can be selected that have no reference at all in the instruction inwhich the reference is scheduled. The orderings of the RFs with no referenceare the same for both the best-fit and first-fit approach. This is also the weak-ness of this approach: the RF, which is in front of the original list, is selectedrelatively often, even when other RFs with no references are available. As aresult, other, not yet scheduled references of the live range of the already as-signed variable have a higher probability of RF conflicts with other referencesto the first RF.

A better strategy is to use another RF each time a register is required. Thisstrategy is similar to the horizontal distribution as used in early assignment.We implemented a round robin strategy. This is accomplished by rotating thesockets in the socket list each time an attempt is made to assign a register to avariable. The results of the round robin strategy are shown in Figure 9.13. Ascan be seen, this method resulted in the smallest performance penalty.

9.4.2 A Global Heuristic

Like Ellis [Ell86] the local strategies only consider the RF-port usage in theinstruction where the operation or transport that requires an RF-port is sched-uled. However, RF-port assignments must be planned globally since a variableis accessed in other instructions too. Therefore, also all other accesses to a vari-able and all potential RF-port conflicts in their live range should be taken intoconsideration.

For example, consider the sequential code in Figure 9.14 and assume oneRF-port for writing per RF. The transports defining the variables v3 and v5


v1(r1) → add.o; v2(r2) → add.t;add.r → v3;v4(r4) → sub.o; #4 → sub.t;sub.r → v5(r5);

Figure 9.14: Potential RF-port conflicts in sequential code.

v1 → add.o; v2 → add.t;add.r → v3;v4 → sub.o; #4 → sub.t;sub.r → v5;v8 → v9;v6 → mul.o; #10 → mul.t;mul.r → v7;v5 → st.o; v7 → st.t;

r1

sub.r

add.o

r3

r2 add.t

r3 sub.o

r1 r4

#4

add.r

sub.t

r2

a) Unscheduled TTA code. b) Partial schedule 1.

r1

sub.r

add.o

r3

r2 add.t

mul.o #10 mul.tr1

r3 sub.o

r1 r4

#4 mul.r r5

add.r

sub.t

r2

r1

sub.r

add.o

st.o

r2 add.t

mul.o #10 mul.tr1

r3 sub.o

r1 r4

#4 mul.r r5

st.t

add.r

sub.t

r2

r5

c) Partial schedule 2. d) Schedule.

Figure 9.15: Global RF-port conflicts.

both require an RF write port. Variable v5 is already mapped onto r5 andthus is already mapped onto an RF. In an efficient schedule, both accesses maybe placed in the same instruction, and thus compete for an RF-port. Whenthe result move of the addition is scheduled first, the register allocator shouldavoid that variable v3 is mapped onto the same RF as variable v5.

Besides RF-port conflicts between not yet scheduled operations, also con-flicts can arise between not yet scheduled and scheduled operations. Fig-ure 9.15b shows the partial schedule, after scheduling of the first three oper-ations of the sequential code of Figure 9.15a, under the assumption that theRFs only have one RF-port for reading. When scheduling the result of the mul-tiply, a register and thus an RF, has to be selected; in this case register r5 is se-lected (see Figure 9.15c). The selection of the RF has an impact on the schedul-ing of the trigger move of the store operation. This is shown in Figure 9.15d;when register r5 is in the same RF as register r1 an RF-port conflict arises be-tween the read access of the copy and the read access of the trigger move of thestore operation. Thus, a local decision has impact on a global scale.

Based on these observations it seems advantageous to avoid local as wellas global RF-port conflicts. To accomplish this, an RF-port conflict table is buildeach time when integrated assignments seeks a register for a variable v. Thistable has as many entries as there are partitions. When scheduling transport


0

0.5

1

1.5

2

2.5

3

Per

form

ance

pen

alty

(%

)

2x256 4x128 2x16 4x8

RF configurations

Figure 9.16: The results of the global RF partitioning strategy.

m referring to v, the code is scanned for potential RF-port conflicts. Potentialconflicts are identified by inspecting the DDG for operations with already as-signed registers and with the same access type (read or write)4 that could bescheduled in the same instruction as m. For each potential conflict mi, the ta-ble is updated with an RF-port conflict cost. The updated table entry is theentry that corresponds to the RF of the conflicting accessmi. Besides recordingconflicts for the to be scheduled transport m, also potential RF-port conflictswith all other references to variable v are recorded by traversing the du-chain.It is decided to only consider the RF-port conflicts in the basic blocks contain-ing the references. Including all RF-port conflicts is too time consuming; withregion scheduling all operations that could be imported should also be con-sidered each time an assignment is made. Furthermore, most of the potentialRF-port conflicts with operations of other basic blocks will never occur in prac-tice. In our experiments the RF-port conflict costs are equal to the expectedexecution frequency of transportmi. The RFs are ordered with respect to theirtotal RF-port conflict costs. The RF with the lowest total cost is selected first.The round-robin strategy is used as a fallback option, when two or more RFshave the same RF conflict cost.

The results of the global strategy are shown in Figure 9.16. They are closeto that of the round-robin strategy. The general conclusion is that an ad-hocapproach such as round-robin is as good as a more sophisticated strategy. Itshould be noted that in both cases the performance loss caused by RF parti-tioning is small.

To our surprise, some of the results show an improvement when the RF ispartitioned. For example the benchmarks gzip and mulaw show a small perfor-

4No shared read/write ports.

9.5. CONCLUSIONS 189

mance gain, see Appendix B. This effect can be explained by the same obser-vation as was made in Section 7.6.3. When the region scheduler is successful inimporting operations, a basic block can be removed (see Figure 7.11). However,the removal of this basic blocks may result in an overall negative effect. Whenusing a partitioned RF, an RF-port conflict may hinder importing operationsand thus will prevent that the basic block is removed.

9.5 Conclusions

In this chapter, the various effects of a single monolithic RF in the context of ILPprocessors are evaluated. Models are presented for the area, access time andpower consumption. These models show that it is advantageous to partitionthe RF because the total area, access time and power consumption are reduced.However, partitioning may increase the number of executed cycles due to RF-port conflicts. We have shown that RF partitioning, when applied to TTAs,does not have to result in a large performance loss. Both, for early assignmentand integrated assignment, the average performance loss can be kept below5% with simple heuristics. We currently do not see a good method for using apartitioned RF in combination with late assignment. The performance resultsshown in this chapter are much better than the results reported for distributedRFs in [CDN93]; the reported performance losses for distributed RFs rangedfrom 4% to 40% for two partitions and from 15% to 75% for four partitions. Thereason is twofold. First, partitioned RFs do not require extra copy operationsbetween registers of different clusters. Second, in TTAs software bypassingand dead-result move elimination can be applied. This reduces the number ofRF accesses and thus reduces the number of RF-port conflicts.

A partitioned RF requires less area. If we compensate for this, by addingextra registers to the partitioned RF, the performance losses caused by par-titioning have been shown to disappear; the usage of a partitioned RF mayeven result in a small performance gain. If the register access time constrainsthe achievable cycle time of a processor, and is therefore critical in the overallperformance of the system, then the performance loss due to the partitioningcompletely vanishes.

Finally, we summarize the advantages of using a partitioned RF versus amonolithic RF: (1) a partitioned RF is easier to realize due to the low complexityof the components, (2) the approach is scalable: the RFs can be duplicated to fitthe architecture requirements, (3) the design can be made with available lowcost SRAM cells, (4) the required chip area may reduce, (5) the access time ofan RF with fewer ports is smaller and (6) the power dissipation reduces. Sincethe trend is towards the exploitation of more and more ILP, it is likely thatthe RF becomes a larger bottleneck. Our approach offers good performance asregister (bandwidth) requirements double and redouble from current needs.

Summary and FutureResearch 10I n this dissertation we considered and resolved issues associated with reg-ister assignment, instruction scheduling, and partitioned RFs. Our overall

investigation and achievements are summarized in Section 10.1. The contribu-tions that stem from the described work are listed in Section 10.2. Section 10.3discusses future research directions in the area of register assignment, ILP ex-ploitation, and TTA research.

10.1 Summary

Microprocessors that are more powerful and high quality compilers are re-quired to keep up with the ever-increasing complexity of embedded appli-cations. In addition, the power consumption should be kept to a minimum,because many new devices are battery powered. Microprocessor research atthe Delft University of Technology resulted into the development of the Trans-port Triggered Architecture (TTA) [Cor98]. This new architecture promises tofulfill the above mentioned requirements.

Chapter 2 discussed themain characteristics of TTA based processors. TTAscombine flexibility, modularity, and scalability. They can be tuned for any ap-plication or application area. The application’s complexity, combinedwith a re-duced time-to-market, makes a software based approach to embedded systemsparticular appealing today. Therefore, a high-level language compiler environ-ment, based on aggressive instruction-level parallel (ILP) compiler technology,is a prerequisite. The used compiler infrastructure was discussed in Chapter 3.The benchmark suite, used for the experiments, consists of two TTA processorsand 30 benchmark applications, including SPECint95 and multimedia bench-marks. The characteristics of these benchmarks were presented in Chapter 4.

191

192 CHAPTER 10. SUMMARY AND FUTURE RESEARCH

The compiler is divided into a front-end and a back-end. The front-endtranslates a high-level language into an intermediate format. The back-enduses this intermediate format to generate parallel code. The main phases of theback-end are register assignment and instruction scheduling. The register al-locator assigns registers to variables, and the instruction scheduler parallelizesthe code. Applying register assignment first, limits the scheduler’s ability toreorder operations. Applying scheduling first, results in schedules that requiremore registers than available. The interaction between register assignment andinstruction scheduling has its impact on the produced code; decisions made byone phase can have negative effects on the other.

In this dissertation, various register assignment and instruction schedul-ing phase ordering strategies are investigated. To the best of our knowledge,it is the first time that three alternative strategies, early, late and integratedassignment, are compared and investigated in so much detail. In Chapter 5,approaches from literature towards the interaction of register assignment andinstruction scheduling are discussed. The low performance of a strictly earlyassignment approach originates from the false dependences it introduces; theyhinder the generation of efficient schedules, because they increase the criti-cal path. Most approaches that try to avoid the introduction of false depen-dences use the data dependence graph (DDG) to make the register allocatordependence-conscious [GWC88, Pin93, NP93, AEBK94]. Also in the TTA com-piler back-end a dependence-conscious approach is implemented. The resultsshow a significant improvement compared to strictly early assignment. Thisis the first indication that register assignment and instruction scheduling mustco-operate to achieve high performance code. Because register assignment isapplied before instruction scheduling, the avoidance of false dependences isbased on educated guesses. Consequently, the avoided false dependences mayturn out not to be critical, while the ones that could not be avoided increase thecritical path. Furthermore, the TTA specific optimizations, such as software by-passing and dead-move result elimination, result in a lower register pressureafter scheduling than before scheduling. Unfortunately, an early assignmentapproach cannot exploit this optimization. Both observations show that thereis still room for improvement.

Another strategy is late assignment; instruction scheduling is performedbefore register assignment. This strategy has the advantage that the instruc-tion scheduler is not hindered by false dependences and, the TTA specific op-timizations are visible for the register allocator. This seems advantageous,however, the instruction scheduler, in its attempt to reorder instructions tomaximize ILP, may lengthen the live ranges of values and thus increases thecontention for registers. If not enough registers are provided by the targetprocessor, the data is written to memory, introducing spill code, which itselfalso requires registers. The increase in ILP can be nullified by the amount ofspill code. This problem is identified in the literature [BSBC95]. To reduce theimpact of the inserted spill code, rescheduling is usually applied. To lower

10.1. SUMMARY 193

the register pressure so-called register-sensitive scheduling algorithms are pro-posed [GWC88, BEH91a, MPSR95] that limit the scheduling aggressiveness.An estimate of the register requirements is used for switching between an in-struction scheduler that maximizes ILP and a scheduling method that reducesthe register pressure. A register-sensitive instruction scheduler is implementedinto the TTA compiler back-end. The performance gain, in relation with strictlylate assignment, was disappointing. The main reason was that the number ofregister pressure reducing operations in the ready set is limited or non-existing.Furthermore, it was shown that the insertion of spill code, when using late as-signment, is problematic for TTAs. Instead of adding spill code into the alreadyscheduled code, spill code was added to the original code. The resulting codeis then scheduled again. Our experiments indicate that for TTAs dependence-conscious early register assignment outperforms late assignment.

To address the problems of early and late assignment, a new register as-signment method is developed: integrated assignment. The register assignmentis performed during instruction scheduling. A register is assigned to a vari-able as soon as an operation is scheduled that refers to this variable. In Chap-ter 6, this idea has been presented in the context of basic block scheduling. Torecord the availability of registers in instructions, so-called Register ResourceVectors (RRVs) are introduced. With the use of these RRVs, dataflow informa-tion, and the DDG, the set of available registers for a variable is computed. Inorder to be complete, also methods to insert spill code and state preservingcode during scheduling are presented. The insertion of spill code results innew short live ranges that also require registers. However, it was this resourcethat caused the spilling. To solve this problem the TTA specific properties areexploited. The scheduler is changed in such a manner that it is guaranteedthat these new short live ranges, when needed, are software bypassed andthat dead-result move elimination eliminates these short live ranges from thescheduled code. The results of this new approach showed that improvements,compared to a dependence-conscious early register assignment approach, ofup to 100%, with an average of 20%, can be obtained when registers are scarce.Of course, when there are plenty of registers this new method should (anddoes) give the same performance as the early and late assignment methods.

The amount of exploitable ILP in basic blocks is limited. To justify the du-plication cost of FUs and data paths in ILP processors, the ILP between op-erations of different basic blocks should also be exploited. In Chapter 7, it isshown how integrated register assignment can be applied in extended basicblock schedulers. In the experiments, a region scheduler was used. Since op-erations can be moved from one basic block to another, the exploitable ILPincreases, as well as, the register pressure. During the experiments, it wasdiscovered that limiting the aggressiveness of the instruction scheduler mightresult in a performance increase. In order to limit the aggressiveness of the in-struction scheduler, new heuristics are proposed. The results showed that ourintegrated assignment method, in combination with a region scheduler, per-


0

20

40

60

80

100

120

140

160

180

200

10 12 14 16 18 20 22 24 26 28 30 32 48 64 512

Number of registers

Rel

ativ

e cy

cle

cou

nt

incr

ease

(%

)

Late AssignmentStrictly Early AssignmentDCEAIntegrated Assignment

Figure 10.1: Relative cycle count increase of late assignment, strictly early as-signment, dependence-conscious early assignment and integratedassignment for region scheduling and the TTAideal template.

forms very well. On average we obtained a 40% performance increase, com-pared to dependence-conscious early assignment, when registers are scarce.Individual benchmarks show performance gains up to 140%. The performancegain is even larger than for basic block scheduling.

To support our hypothesis that our approach is especially suitable for appli-cations with a high register pressure, we integrated our method also with soft-ware pipelining. It is known that this instruction scheduling method increasesthe register pressure significantly, and produces high performance code. InChapter 8 issues related to software pipelining in the context of integrated as-signment are discussed. The results of the experiments showed that our con-clusion was correct. The performance gain was even larger than for regionscheduling. Normally, a high register pressure occurs in code with a largeamount of exploitable ILP. This brings us to the conclusion that integrated as-signment is especially suitable for high performance applications.

The performance improvement of integrated assignment over all otherevaluated register assignment methods is given in Figure 10.1. This figureshows the cycle count increase for the TTAideal template of late assignment,strictly early assignment, dependence-conscious early assignment and inte-grated assignment relative to a TTAideal processor with 512 registers. Regionsare used as the scheduling scope. The figure shows that the developed methodis superior to all other methods. However, it must be noted that integrated as-signment has a much higher engineering complexity than assignment methodsbased on graph coloring. This is mainly caused by the complexity of schedul-ing spill code.

10.2. CONTRIBUTIONS 195

Exploiting a high degree of ILP requires highly concurrent register access,yet it is difficult to provide fast, parallel, and global access to a single registerfile (RF). The RF represents a key component in the design of a fast data pathsince conflicting requirements in terms of short access time, large number ofregisters and large I/O data bandwidth, can drive the RF to quickly become alimiting factor for performance. Integrated register assignment requires fewerregisters to achieve the same high quality code as early and late assignment.Consequently, the RF can be smaller; this results in a reduction in silicon area,power consumption and access time. In Chapter 9, it was observed that besidesthe number of registers, the number of RF-ports largely determines silicon area,power consumption and access time. It was also observed that a partitionedRF, with the same number of registers and RF-ports, is smaller, consumes lesspower and is faster than a single monolithic RF. Therefore, it was proposed touse several small RFs instead of one single monolithic RF. Newmethods are de-veloped to make early and integrated assignment suitable for the partitionedRF. Because of the fewer ports per RF, the probability of RF-port congestion perRF increases. Heuristics are developed to minimize the impact of RF-port con-flicts. Experiments showed that only a small loss in performance, due to accessconflicts, has to be accepted when partitioning the RF. When the area saved bythe partitioning is spend on extra registers even a performance increases canbe observed. This increase is even larger when the reduced access time is takeninto account.

10.2 Contributions

The major contributions of this thesis are summarized below.

• For the first time, mathematical formulations are developed for variousdependence-conscious early assignment methods found in literature. Inaddition, the relations between these approaches are mathematically for-mulated.

• A late assignment method in the context of TTAs is developed. The prob-lems related to the insertion of operations in already scheduled TTA codeare addressed.

• A new approach towards the integration of register assignment and ba-sic block scheduling is developed [JC97b]. It is shown that the introducedalgorithm can gracefully handle the insertion of spill code and state pre-serving code. New heuristics are proposed and evaluated. It is demon-strated that this method outperforms all other evaluated phase orderingstrategies.

• The developed integrated assignment method is extended to regionscheduling in order to exploit a larger amount of ILP [JC98, CJA00].


Heuristics are developed to limit the aggressiveness of the instructionscheduler in order to increase the performance. It is shown that thismethod outperforms all other evaluated phase ordering strategies. It isvery effective in exploiting ILP when the register pressure is high.

• In order to demonstrate the effectiveness of integrated assignment, it isapplied in combination with software pipelining [JCK98], a very aggres-sive instruction scheduling method. The experiments showed that thedeveloped method is indeed very suitable for code with a large registerpressure.

• Insight has been gained in the delay, area and power consumption prob-lems of multi-ported register files. New RF configurations are proposedto overcome these problems. In addition, new methods have been de-veloped to minimize the performance penalty, when early assignment isapplied to TTA processors with a partitioned RF [JC95].

• It is shown that the concept of partitioned RFs can easily be added tointegrated assignments. Experiments show that the performance penaltydue to RF-conflicts can be limited.

Another contribution related to the research presented in this dissertation isControlled Node Splitting, a new technique to restructure the control flowgraph in order to facilitate the exploitation of ILP. This work resulted in a con-ference and a journal paper [JC96, JC97a].

10.3 Proposed Research Directions

The work described in this dissertation can, in our opinion, be continued in thefollowing directions.

Early Assignment Improvements

Because of the low complexity of graph coloring in contrast to integrated as-signment it is interesting to investigated how early assignment can be furtherimproved. Based on our experience with early assignment algorithms andtheir results, the following possible improvements were identified.

1. Better interference graphs. During the analysis of the variousdependence-conscious early assignment strategies, it was observed thatnone of the parallel interference graphs makes a clear distinction be-tween real interference edges and false dependence prevention edges.With real interference edges, the interferences that always will be present,independent of the generated schedule, are meant. These edges aredenoted with Emin. With false dependence prevention edges we re-fer to the interferences that might be present in a schedule and is de-noted as Efalse = Efdp − Emin. The new interference graph with a

10.3. PROPOSED RESEARCH DIRECTIONS 197

clear cut between both type of edges can now be constructed as: Gdc =(Nvar, Emin, Efalse). Note that this approach is a combination of thework of Ambrosch [AEBK94] and Pinter [Pin93], see Section 5.1. Futureresearch may lead to heuristics that use both type of interference edgesto make a better trade-off between spilling and the addition of false de-pendences.

2. Parallel interference graph pruning. The parallel interference graphtends to have many interference edges, especially when the schedulingscope is extended beyond basic block boundaries. Methods are devel-oped to prune the graph. For example, to only include forward falsedependences, see Section 5.1.3. This may however, sometimes lead toan unnecessary performance loss. Other pruning methods can be devel-oped. One can think of including more scheduling decisions, like theimpossibility to execute operations from various basic blocks in parallel,due to a restriction in speculative execution. Because not all edges areimportant, methods should be developed to identify only the importantones. For example, a pre-scheduler could help to identify false depen-dences that should be avoided.

3. False dependence-conscious operation reordering. The direction of afalse dependence added by register assignment depends on the operationorder in the sequential code. In order to reduce the impact on the criti-cal path, one could research the impact of a false dependence-consciousoperation reorder technique.

4. False dependence elimination during scheduling. In an early assign-ment approach, software bypassing and dead-result move elimination,can eliminate false dependences. For example, consider Figure 6.19c. Thefalse dependence that hindered the parallel execution of the operationsis hidden by software bypassing and dead-result move elimination be-cause the live ranges are not visible anymore in the resulting code. Whenboth the addition and the store are scheduled, the false dependence canbe removed from the DDG. This allows an early assignment scheduler togenerate parallel code as shown in Figure 6.19d.

General Register Assignment Issues

Besides improvements in early assignment, also other register assignment top-ics require further research. These topics are:

1. Live range splitting. All discussed register assignment methods spill thecomplete live range of a variable. When a live range consists of manydefinitions and uses, one could decide only to spill part of the live range.Hence, the variable resides for some time in memory and the remainingtime in a register. Because fewer reloads are necessary, a performance


increase is expected. The main problem to be solved is: where in thecode to switch between a register or memory location.

2. Development of better heuristics. The presented experiments indicatethat with the use of heuristics improvements can be obtained. However,some of the heuristics do not have the same impact on all benchmarks (forexample the GSC5 heuristic). Analysis of these benchmarks may result ina top-level heuristic that decides which heuristic or which combination ofheuristics to use. This heuristic can be different for different parts of thecode. Another research direction can be the development of a frameworkthat automatically evaluates various heuristics and based on the resultsselects the best heuristic. This framework can, for example, be used to de-cide whether to include in the total schedule, a loop’s software pipelinedcode or it’s region scheduled code.

3. Predicate analysis. The current register assignment implementations ig-nore predicate or guarded execution analysis to reduce the register pres-sure. The major reason to ignore this optimization was that guarding isnot used extensively. However, ILP enhancing techniques exist whichheavily depend on predicates. In order to fully benefit from these tech-niques, research should be conducted to allow non-interfering variablesthat overlap in time to share a common register. A good starting pointfor this research is [ED95a].

4. Basic block insertion. The used region scheduler removes a basic blockwhen all operations of it are imported to other basic blocks. However,as discussed in Section 7.6.3, this is not always advantageous, especiallywhen in an if-then-else construction the outcome of the jump is stronglybiased towards one branch, and the basic block in the other branch is re-moved. As a result, operations below the join point of the two branchescannot be imported in the first branch, because this will result in a viola-tion of the single copy on a path rule. Research should be conducted todecide when to remove basic blocks, or when to add basic blocks in orderto increase the exploitable ILP.

5. Creating extra temporary registers. Besides the registers in the RF, alsothe registers in the FUs can be used for storing variables. For example,one could decide to perform an additionwith zero to create a place holderfor a variable with a short live range.

6. Single ported RFs. In this thesis, the research was limited to RFs withseparate read and write ports. Combining these two functions in one RF-port may reduce the cost. The TTA compiler can already handle RFs withmultiple RF-ports that combine reading and writing. However, how tocompile for an RF with a single RF-port is still an open issue. This isespecially a problem for copy operations, because copies require a readand a write in the same cycle. A possible solution is to replace the copy


#100 → mul.t; v3 → mul.omul.r → v1;#17 → add.o; v1 → add.t;add.r → v2;#200 → st.t; v1 → st.o;

a) Code fragment.

#100 → mul.t; v3 → mul.omul.r → add.t; #17 → add.o; mul.r → st.o;#200 → st.t; add.r → v2;

b) Resulting schedule.

Figure 10.2: Optimistic scheduling.

with an addition that adds zero to the operand of the copy.

7. Duplicated variables. When using RF partitioning and performance isdeteriorated by RF-port conflicts, it could be an option to store a variablein multiple RFs [BS91]. This increases the probability that the value ofa variable can be read from an RF in a certain cycle and thus reducesthe number of RF-port conflicts. Care must be taken to ensure that allduplicates of the variable are consistent with each other.

8. FUs with RF run-outs. The idea of run-outs [Cor98] has the advantagethat results may stay longer in FUs. Consequently, more values can besoftware bypassed. This may result in a reduction of the required numberof RF-ports and registers.

9. Optimistically assuming software bypassing and dead-result moveelimination. Even integrated assignment may superfluously spill vari-ables. For example, consider the code fragment in Figure 10.2a. Assumethe scheduler selects the operations for scheduling in the order mul, add,st. This sequence requires two registers because the variables v1 andv2 are simultaneously live, when the result of the addition is scheduled.When only one register is available, spill code is inserted. However, asshown in Figure 10.2b the total schedule only requires a single register.The problem is that only after scheduling of the store, it becomes clearthat only a single register is required. A solution to this problem is to op-timistically assume that software bypassing will free a register and thusthe scheduler continues scheduling. When after a number of schedulingsteps it becomes evident that no registers become available, backtrackingmust be used to generate correct code.

10. Alternative phase orderings. Besides early, late and integrated assign-ment strategies, one can also research alternative phase orderings. For


example, one could research a phase ordering, in which first the mostfrequently executed region is scheduled, after which registers are as-signed to the variables referenced in this region. Because some vari-ables are removed from the code due to software bypassing and dead-result move elimination, more register might be available for variablesthat are live across this region, but are never referenced in it. The samemethod can be applied to the other regions as well. Other approachesfrom literature are known that divide the registers set in local and globalregisters [BEH91a, HA99]. Local registers are assigned to variables thatare only referenced in scheduling scopes without loops (basic blocks ortrees), while the global registers are assigned to variables that are liveacross the boundaries of the used scheduling scope. For the assignmentof both register sets a different register assignment strategy can be used.In [Mes01] an interesting approach is presented in the context of con-strained based scheduling. Analysis and assignment steps are iterativelyapplied to prune the scheduling search space. Eventually, a schedule isgenerated, if feasible, that complies with the given constraints.

General TTA related topics

There are many other topics related to the research presented in this disserta-tion. Some major research areas for future research are listed below:

1. Clustered architectures. Advances in silicon technology allow the in-crease of the number of FUs and data paths in order to increase per-formance. However, the enormous amount of connections to the movebuses and length of these buses, may result in a large cycle time andhence the execution time may increase significantly. This problem canbe tackled with the use of a clustered architecture. Figure 10.3 shows aTTA processor with four clusters. Because the number of connections toa single bus and its length are reduced, the cycle time is reduced in com-parison to a single cluster architecture. Communication between clustersis achieved by inter-cluster RFs. Although clustered architectures reducethe cycle time, it is hard to exploit the increased exploitable ILP. There-fore, new compiler technologies have to be developed to exploit ILP inthe context of clustered TTAs [RCL01].

2. Enhancing exploitable ILP. Although the methods presented in this dis-sertation already exploit a reasonable amount of ILP, methods exists thateven may increase the exploited ILP. For example, if-conversion can beapplied more aggressively. Other examples are source code transforma-tions for loops. In addition, research should also be devoted in develop-ing new ILP enhancing techniques.

3. Compilation for caches and local memories. Research should be di-rected towards compilation techniques for caches and local memories.


RF-1

RF-3

RF1-2

RF2-3

FU-1 FU-2 FU-3 FU-4

FU-5 FU-6 FU-8

FU-9

FU-7

FU-16

FU-10 FU-11 FU-12

FU-15FU-14FU-13

RF-2

RF-4 RF3-4

Figure 10.3: Example of a clustered TTA.

Caches are a very suitable technique to increase performance. This per-formance increase can even be further improved when the compiler ismade cache aware. In addition, many embedded applications benefitfrom multiple local memories. To make use of these memories new com-piler techniques must be developed.

4. Code density. A drawback of VLIWs and TTAs is their code size. Emptyslots in an instruction are filled with no-ops. In embedded systems, codedensity is an important issue. Therefore, techniques have to be developedand evaluated that solve this problem.

5. Reduced RF-FU connectivity. The current implementation of the TTAback-end requires that there is always a path between any FU-port andthe RF(s). This constraint originates from the early assignment strategybecause the register allocator is unaware of the connectivity between RFsand FUs, and on which FU an operation will be scheduled. When usingintegrated assignment this restriction can be released. When no directpath exists between the RF that holds the required value and the FU thatrequires it, extra copy operations can be inserted during scheduling.


6. Programming tools. The amount of exploitable ILP not only depends onthe compiler, but also on the programmer. Tools should be developedthat inform the programmer how the application should be improved atsource code level.

Integrated AssignmentBenchmark Results AI n the Sections 6.6.4, 7.6.3 and 8.4.2 the average performance gains of in-tegrated assignment compared to dependence-conscious early register as-

signment are given. In this appendix the individual benchmark results of theseexperiments are listed:

• Table A.1: Integrated assignment versus dependence-conscious early as-signment, using basic block scheduling and the TTAideal template.

• Table A.2: Integrated assignment versus dependence-conscious early as-signment, using basic block scheduling and the TTArealistic template.

• Table A.3: Integrated assignment versus dependence-conscious early as-signment, using region scheduling and the TTAideal template.

• Table A.4: Integrated assignment versus dependence-conscious early as-signment, using region scheduling and the TTArealistic template.

• Table A.5: Integrated assignment versus dependence-conscious early as-signment, using software pipelining and the TTAideal template.

• Table A.6: Integrated assignment versus dependence-conscious early as-signment, using software pipelining and the TTArealistic template.

203

204 APPENDIX A. INTEGRATED ASSIGNMENT BENCHMARK RESULTS

TableA.1:Speedupinpercentagesofintegratedassignmentcomparedtodependence-consciousearlyassignment,

usingbasicblockschedulingandtheTTA

idealtemplate.

Benchmark

NumberofRegisters

512

6448

3230

2826

2422

2018

1614

1210

a68

0.0

0.0

-0.1

-0.1

-0.1

-0.1

-0.1

-0.1

0.0

0.0

0.0

-0.2

0.1

0.8

1.9

bison

0.0

0.0

0.0

-0.1

-0.1

0.0

0.0

-0.1

0.1

0.3

0.6

0.5

0.7

1.6

1.9

cpp

0.0

0.0

0.0

0.0

0.0

0.0

0.1

0.1

0.0

0.1

0.5

1.3

1.5

2.7

6.2

crypt

0.0

0.0

0.0

6.2

16.9

18.1

8.4

19.7

17.8

21.8

37.5

73.2

91.2

85.4

89.7

diff

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.2

0.3

0.7

0.8

0.9

2.1

3.5

4.7

expand

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

flex

0.0

0.0

-0.1

-0.1

-0.1

0.0

0.1

0.3

0.9

1.1

1.8

2.1

3.3

4.9

6.8

gawk

0.4

0.4

0.4

0.3

0.3

0.3

0.3

0.3

0.3

0.3

0.0

0.2

-0.1

-0.3

0.6

gzip

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.1

0.1

0.2

0.6

od0.0

0.0

-0.1

-0.1

-0.2

-0.1

0.0

-0.2

1.0

1.5

3.0

4.1

3.9

3.4

5.3

sed

0.1

0.9

-2.2

-0.6

0.0

0.9

1.4

1.4

1.0

1.5

2.0

3.0

3.6

5.6

7.7

sort

0.1

0.1

0.1

0.6

1.0

1.0

2.1

2.6

6.3

7.2

9.2

16.4

25.5

35.9

40.5

uniq

0.1

0.1

0.1

-0.1

-0.1

-0.2

-0.3

0.3

0.4

0.4

0.0

3.0

5.0

3.9

12.7

virtex

0.1

0.1

0.1

0.1

0.1

0.0

0.0

0.4

0.5

0.5

0.9

1.5

2.0

3.7

6.5

wc

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

099.go

0.0

0.0

-0.1

-0.1

0.0

-0.1

0.0

-0.5

-0.5

-0.5

-0.3

0.6

1.3

1.1

1.0

124.m88ksim

0.1

0.1

0.1

0.1

0.1

0.1

0.1

0.1

0.1

0.1

0.1

-0.1

0.1

0.5

5.8

129.compress

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.3

2.6

4.0

4.4

4.0

132.ijpeg

0.0

0.1

0.3

6.4

8.2

9.0

13.0

14.6

21.2

28.7

35.9

52.1

71.4

83.8

105.5

147.vortex

0.0

0.0

0.0

0.0

0.0

0.0

0.0

-0.1

-0.4

-0.4

-0.2

0.6

0.7

0.7

1.2

instf

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.2

1.1

5.1

6.9

6.3

7.5

11.9

11.5

mulaw

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

radproc

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.7

4.4

8.3

7.1

2.4

7.4

20.9

30.2

rfast

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.8

3.5

15.7

21.1

18.7

22.4

35.2

33.9

rtpse

0.1

0.1

0.1

0.1

0.1

0.1

0.1

0.6

3.9

10.1

21.7

18.5

19.2

28.0

28.1

djpeg

0.0

0.3

0.7

11.1

14.7

17.9

21.9

24.2

29.2

35.7

42.9

42.8

45.9

86.7

116.1

rawcaudio

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

-1.2

40.3

mpeg2decode

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.1

0.1

0.2

0.3

1.8

mpeg2encode

0.0

0.0

0.0

18.0

20.2

23.6

27.0

30.4

31.7

32.5

34.2

35.8

37.9

40.7

46.6

unepic

0.0

0.0

0.0

0.1

0.2

2.6

2.6

2.0

2.0

5.9

6.9

15.6

21.5

27.7

27.4

average

0.0

0.1

0.0

1.4

2.0

2.4

2.6

3.3

4.2

5.9

7.8

10.1

12.6

16.4

21.3

205


usingbasicblockschedulingandtheTTA

reali

sti

ctemplate.

Benchmark

NumberofRegisters

512

6448

3230

2826

2422

2018

1614

1210

a68

-0.1

-0.1

-0.2

-0.3

-0.3

-0.3

-0.3

-0.3

-0.2

-0.2

-0.2

-0.3

0.1

0.8

1.9

bison

-0.1

-0.1

-0.1

-0.1

-0.2

-0.2

-0.2

-0.3

-0.1

0.2

0.4

0.2

0.5

1.4

1.7

cpp

-0.1

-0.1

-0.1

-0.1

0.0

-0.1

0.0

-0.1

0.0

0.1

0.6

1.3

1.3

2.8

6.0

crypt

-0.1

3.4

4.6

6.8

12.7

9.8

9.4

8.0

12.0

13.0

3.1

32.3

37.5

50.3

42.8

diff

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.2

0.2

0.7

0.8

0.9

2.1

3.6

4.7

expand

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

flex

0.0

0.0

-0.1

-0.2

-0.3

-0.3

-0.3

-0.1

0.4

0.6

1.4

1.5

2.5

4.1

6.0

gawk

0.5

0.4

0.5

0.8

0.6

0.6

0.5

0.4

0.3

0.3

0.0

0.3

-0.2

-0.7

0.2

gzip

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.1

0.1

0.1

0.4

od0.0

0.0

-0.2

-0.3

-0.5

-0.4

-0.4

-0.8

0.2

0.7

2.3

3.0

3.2

2.8

4.7

sed

-0.1

-0.2

-3.4

-2.9

-2.2

-1.3

-0.9

-1.0

-1.2

-0.7

-0.4

0.3

0.6

2.5

4.0

sort

0.0

0.1

0.3

0.7

1.5

1.5

2.3

2.8

6.5

6.1

8.1

14.6

21.4

29.4

31.8

uniq

0.1

0.1

0.1

-0.3

0.1

-0.1

0.0

0.4

0.5

0.5

0.0

3.0

4.3

3.2

11.7

virtex

0.0

0.0

0.0

0.1

0.2

0.1

0.0

0.0

0.1

0.2

0.5

1.0

1.7

3.4

5.9

wc

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

099.go

0.0

-0.1

-0.3

-0.7

-0.8

-0.9

-0.9

-1.4

-1.4

-1.5

-1.7

-0.9

-0.6

-0.1

0.7

124.m88ksim

0.3

0.3

0.3

0.3

0.3

0.3

0.1

0.2

0.1

0.2

0.0

-0.1

-0.4

0.3

5.0

129.compress

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.3

2.5

4.2

2.6

2.3

132.ijpeg

-0.3

0.5

-0.2

4.0

4.1

3.9

5.2

6.6

9.9

14.4

15.1

28.5

41.5

40.9

55.9

147.vortex

0.0

0.0

-0.1

-0.1

-0.1

-0.1

-0.2

-0.3

-0.5

-0.6

-0.4

0.4

0.5

0.6

1.1

instf

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.2

1.1

4.6

5.8

7.0

9.5

10.0

8.5

mulaw

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

radproc

0.0

0.0

0.0

-0.1

-0.1

-0.2

-0.2

0.4

3.4

5.2

5.3

4.7

4.7

22.2

21.2

rfast

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.7

3.1

13.0

15.9

19.1

25.5

26.8

21.5

rtpse

0.1

0.1

0.1

0.0

0.1

0.0

0.0

0.5

3.4

8.1

17.5

19.8

21.2

22.2

20.2

djpeg

-0.1

-0.2

-0.2

5.9

9.8

10.0

13.2

15.8

18.4

20.8

22.1

19.7

22.6

52.0

55.9

rawcaudio

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

-1.2

40.3

mpeg2decode

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.1

0.0

0.2

1.8

mpeg2encode

0.0

-0.3

0.1

8.2

10.0

12.6

14.6

17.5

18.5

19.3

20.0

21.5

23.2

25.9

28.2

unepic

-0.3

-0.3

-0.3

-0.2

-0.1

0.6

0.6

-2.3

-4.0

-1.5

-2.7

3.7

8.8

13.6

12.0

average

0.0

0.1

0.0

0.7

1.2

1.2

1.4

1.6

2.4

3.4

3.8

6.1

7.9

10.7

13.2



usingregionschedulingandtheTTA

idealtemplate.

Benchmark

NumberofRegisters

512

6448

3230

2826

2422

2018

1614

1210

a68

1.5

1.5

1.5

1.5

1.4

1.5

1.9

2.0

1.9

1.6

2.2

2.9

2.9

2.5

6.9

bison

2.0

2.2

2.0

1.8

2.0

2.5

2.7

3.3

2.4

4.2

5.1

5.6

6.1

8.7

8.5

cpp

2.4

2.3

2.2

1.6

1.5

1.9

2.0

1.8

1.5

1.5

-1.8

0.7

2.3

4.4

14.1

crypt

0.2

1.3

1.9

4.7

10.0

13.9

21.2

31.1

29.3

14.8

32.3

54.8

56.5

89.9

94.7

diff

0.5

0.5

1.2

13.0

13.0

12.8

11.6

12.5

13.9

13.8

13.6

15.3

16.4

16.4

19.5

expand

-2.7

-1.6

0.5

-0.3

-0.5

0.1

1.0

-1.2

0.7

-9.7

-4.3

-4.5

-3.6

-3.3

-9.1

flex

1.4

1.4

1.5

1.4

1.4

0.4

0.7

2.5

2.3

6.0

3.6

6.6

8.0

11.1

14.5

gawk

1.9

1.9

1.2

0.6

0.7

1.0

1.1

0.9

0.8

2.1

2.5

2.5

1.6

0.6

2.1

gzip

3.3

3.4

3.6

2.2

2.9

2.9

3.5

6.0

5.9

5.9

5.9

6.9

6.0

6.7

27.9

od1.0

1.5

1.4

2.6

-0.1

-1.1

2.0

3.7

5.2

14.3

9.9

16.3

30.3

32.1

30.9

sed

1.6

4.5

5.4

10.1

11.9

13.5

13.5

12.4

13.9

14.3

14.9

18.1

15.6

18.7

21.3

sort

1.8

0.6

1.4

1.0

-0.7

1.2

5.9

12.5

15.3

23.8

28.7

59.7

55.7

69.1

81.4

uniq

0.3

0.0

1.1

3.1

2.3

3.1

9.4

10.9

10.0

9.1

12.3

10.6

14.4

17.3

23.3

virtex

2.5

2.5

2.3

2.7

3.3

3.3

3.5

4.6

6.6

7.1

8.6

10.7

11.8

14.1

15.6

wc

0.0

0.0

0.0

9.2

9.2

9.2

9.2

9.2

9.2

8.7

8.2

7.9

7.4

-11.9

-6.5

099.go

2.0

2.0

2.2

2.4

2.5

2.8

3.7

5.7

5.7

6.1

7.6

8.6

9.0

10.3

8.4

124.m88ksim

1.6

1.7

1.8

3.0

2.9

2.9

2.6

1.8

2.0

2.6

2.6

2.5

2.8

4.6

14.8

129.compress

0.0

0.0

0.0

4.4

4.4

4.4

6.1

6.1

6.1

6.1

8.8

28.9

29.3

29.9

31.4

132.ijpeg

6.8

13.7

18.7

34.0

44.5

49.8

51.1

65.1

72.9

78.7

84.1

104.9

96.1

131.5

140.0

147.vortex

0.4

0.4

0.2

0.7

0.6

1.0

0.8

1.5

1.3

1.6

2.8

4.1

4.7

4.9

6.0

instf

0.9

1.8

2.1

5.5

7.9

10.8

9.9

14.6

20.7

22.8

25.7

35.7

65.3

73.4

79.5

mulaw

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

radproc

0.3

1.4

2.1

1.6

2.0

2.0

5.9

7.7

17.4

19.4

17.9

25.2

44.6

39.0

57.8

rfast

2.5

3.1

3.9

5.4

3.7

9.1

15.5

24.8

24.2

27.5

36.0

38.0

63.8

80.2

102.6

rtpse

1.1

2.9

3.1

3.2

3.0

2.7

2.7

3.3

5.0

15.8

18.4

31.3

38.9

43.9

66.0

djpeg

6.7

19.2

28.3

53.7

61.8

67.7

64.3

69.7

80.0

98.6

90.7

92.0

113.8

144.6

128.0

rawcaudio

0.5

0.5

1.0

1.0

2.6

2.6

2.6

2.6

6.0

6.0

5.5

1.1

80.2

44.3

57.5

mpeg2decode

1.1

1.2

1.2

1.4

1.6

1.8

1.5

1.8

1.8

2.3

2.1

2.3

5.6

7.1

42.2

mpeg2encode

8.3

49.8

83.9

77.8

86.4

84.8

89.6

86.6

82.4

70.7

71.7

78.3

83.3

84.5

88.8

unepic

0.2

0.4

1.1

0.1

1.3

9.8

11.4

16.6

12.9

13.7

15.8

23.1

31.1

42.8

46.9

average

1.7

4.0

5.9

8.3

9.4

10.6

11.9

14.0

15.2

16.3

17.7

23.0

30.0

33.9

40.5

207


usingregionschedulingandtheTTA

reali

sti

ctemplate.

Benchmark

NumberofRegisters

512

6448

3230

2826

2422

2018

1614

1210

a68

0.7

0.7

0.7

1.1

1.2

1.1

1.7

1.6

1.8

1.4

0.9

1.3

2.2

2.3

3.3

bison

1.5

1.8

1.5

1.5

1.7

1.7

2.4

2.8

3.1

4.0

4.0

4.4

5.5

8.1

4.0

cpp

1.0

1.2

1.2

0.9

1.1

0.5

0.2

0.3

-0.5

0.6

2.2

2.5

3.0

5.7

11.7

crypt

0.0

0.8

1.6

5.7

8.2

9.6

4.5

6.5

-2.8

12.5

17.8

5.1

8.3

29.9

29.1

diff

0.4

0.4

1.1

11.1

10.1

9.9

9.9

11.5

12.8

12.6

12.3

12.8

13.3

13.3

18.1

expand

0.7

2.3

2.8

0.9

3.4

4.5

2.9

2.7

3.2

1.9

-9.1

-6.7

-8.8

-8.0

-11.9

flex

1.6

1.6

1.8

0.3

-0.3

0.8

0.4

2.4

2.1

4.6

2.1

5.3

6.6

10.1

12.1

gawk

1.9

1.9

1.8

1.9

1.9

1.9

1.9

1.8

1.4

2.9

3.4

3.4

2.1

0.9

2.2

gzip

4.0

4.1

4.5

-0.7

1.5

3.4

3.3

6.0

5.8

5.4

5.2

6.5

6.8

3.8

25.2

od0.7

1.2

0.8

1.2

-0.8

-1.5

1.7

2.6

2.4

10.1

6.2

9.5

21.0

24.0

23.3

sed

-0.5

2.0

0.6

4.2

5.1

6.2

6.9

5.9

6.4

5.8

6.4

8.3

6.6

6.7

10.6

sort

0.4

2.6

3.0

2.7

2.9

1.9

1.9

7.2

7.3

15.1

19.6

45.2

42.2

51.8

57.7

uniq

-0.8

1.6

0.1

0.7

0.7

1.9

7.3

8.7

8.0

7.8

10.4

11.4

11.1

12.1

21.5

virtex

1.6

1.1

1.3

1.6

1.9

1.8

1.5

2.2

4.1

4.1

4.5

6.6

6.1

9.8

11.0

wc

0.0

0.0

0.0

11.9

11.9

11.9

11.9

11.9

11.9

11.9

9.3

11.1

12.8

5.4

10.3

099.go

1.3

1.2

1.3

1.2

1.2

1.1

1.5

3.0

3.3

2.1

2.2

3.3

5.0

6.5

4.4

124.m88ksim

0.5

0.5

0.4

1.0

0.9

1.3

-0.3

-0.2

1.3

1.7

2.0

2.0

2.1

4.1

13.6

129.compress

0.1

0.1

0.0

4.2

4.3

4.2

5.9

5.9

5.9

6.6

8.8

27.2

28.0

29.0

30.8

132.ijpeg

0.8

0.5

2.5

14.5

18.1

17.8

18.7

21.5

24.3

25.0

28.0

30.7

30.1

36.3

35.5

147.vortex

0.4

0.6

0.8

0.5

0.4

0.8

0.8

0.8

1.1

1.4

1.9

2.9

3.8

4.0

-1.2

instf

0.0

0.1

-2.5

-3.2

-1.1

-1.7

-1.7

-0.6

2.6

4.8

10.6

25.1

43.2

40.2

45.9

mulaw

0.0

0.0

0.0

2.4

2.4

2.9

3.9

3.9

3.9

3.9

3.9

3.9

3.9

3.9

2.9

radproc

-0.1

-0.6

-0.8

-0.5

-0.4

0.1

2.3

4.6

11.5

9.0

10.7

22.9

30.4

30.1

38.6

rfast

-0.1

-0.3

-3.1

-4.2

-1.9

-2.4

-0.2

0.6

1.5

3.2

13.5

29.6

48.9

40.1

57.1

rtpse

0.2

0.0

-0.1

0.2

0.1

-0.1

0.1

2.3

5.5

10.8

12.1

27.3

26.2

29.2

41.4

djpeg

0.3

-1.7

4.6

17.8

20.9

27.8

26.0

30.1

39.0

43.9

41.7

43.0

62.6

78.3

54.0

rawcaudio

0.7

0.7

0.7

0.8

4.7

4.7

4.8

4.8

6.0

5.9

3.7

-2.6

68.4

37.0

56.7

mpeg2decode

0.1

0.1

0.1

-1.3

-1.2

-1.1

-1.3

-1.0

-1.0

-0.6

-0.7

0.9

5.2

4.7

32.2

mpeg2encode

0.5

9.1

43.3

51.3

51.2

53.9

49.9

51.1

48.9

46.4

46.9

45.7

49.6

48.7

54.8

unepic

-0.3

-0.3

0.2

-1.6

-1.0

4.9

4.4

7.4

4.6

4.8

5.6

10.7

10.7

16.9

17.9

average

0.6

1.1

2.3

4.3

5.0

5.7

5.8

6.9

7.5

9.0

9.5

13.3

18.6

19.5

23.8



usingsoftwarepipeliningandtheTTA

idealtemplate.

Benchmark

NumberofRegisters

512

6448

3230

2826

2422

2018

1614

1210

a68

1.4

1.5

1.5

1.5

1.4

1.6

1.9

2.0

2.4

2.2

2.5

2.9

2.9

2.7

7.0

bison

1.8

2.0

1.8

1.9

2.1

2.6

2.6

3.2

2.5

4.0

5.8

5.6

5.9

6.6

7.0

cpp

2.4

2.3

2.2

1.6

1.5

1.9

1.9

1.7

1.4

1.5

-2.0

0.6

2.3

4.3

13.6

crypt

0.2

0.2

0.2

8.8

23.6

27.7

13.8

25.4

40.9

39.3

47.1

88.1

142.7

121.1

119.1

diff

0.4

0.4

1.2

12.9

12.9

12.7

11.5

12.8

16.8

19.5

19.5

19.0

23.1

28.5

39.7

expand

-2.7

-1.6

0.5

-0.3

-0.5

0.1

1.0

-1.2

0.7

-9.7

-4.3

-4.5

-3.6

-3.3

-9.1

flex

1.4

1.5

1.6

1.1

0.9

0.5

0.7

0.7

1.1

3.9

3.0

5.0

7.3

8.5

12.0

gawk

1.9

1.9

1.2

0.6

0.7

1.0

1.1

0.9

0.8

2.1

2.5

2.5

1.6

0.6

2.1

gzip

6.3

6.4

6.6

2.4

2.8

3.3

3.7

6.5

6.3

6.5

12.0

12.5

11.8

21.4

41.6

od0.6

1.2

0.9

2.2

0.4

-0.4

1.6

3.8

4.3

7.3

6.5

9.2

12.4

10.2

21.2

sed

1.6

4.3

5.4

10.0

11.8

13.1

13.1

12.0

13.4

14.4

15.2

17.3

14.8

18.2

20.6

sort

1.9

0.7

1.3

0.5

-1.6

0.7

7.7

10.3

14.7

24.6

30.7

55.0

57.4

77.4

92.5

uniq

0.3

-0.3

1.1

2.4

2.4

2.4

8.9

9.6

9.4

8.9

11.7

11.3

13.5

16.7

21.6

virtex

2.6

2.6

2.4

2.7

3.4

3.8

4.1

5.0

6.6

6.8

9.1

10.3

13.6

15.0

15.8

wc

0.0

0.0

0.0

9.2

9.2

9.2

9.2

9.2

9.2

8.7

8.2

7.9

7.4

-11.9

-6.5

099.go

2.0

2.0

2.2

2.3

2.5

2.7

3.7

5.6

5.7

5.8

7.6

7.6

9.0

9.8

7.2

124.m88ksim

4.5

4.6

4.7

4.9

4.6

4.5

3.4

3.9

3.6

4.1

3.7

3.2

4.3

-0.2

-1.7

129.compress

0.0

0.0

0.0

4.4

4.4

4.4

6.1

6.1

6.1

6.2

8.8

28.6

29.1

29.7

30.9

132.ijpeg

2.3

2.5

8.4

74.6

76.7

84.6

86.0

99.2

130.2

154.4

165.6

201.5

240.0

205.5

194.7

147.vortex

0.2

0.3

0.3

0.2

0.0

0.1

0.0

0.4

0.8

1.2

2.3

2.5

3.0

4.9

3.7

instf

0.0

0.0

0.0

66.1

100.0

100.9

110.8

115.5

124.0

123.6

121.7

124.7

155.6

170.3

187.0

mulaw

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

12.4

12.4

49.7

radproc

0.4

0.6

0.6

9.1

13.0

18.7

20.1

30.6

39.2

45.9

50.9

53.2

48.7

54.4

53.5

rfast

0.0

0.0

0.0

159.3

240.1

242.7

266.2

274.0

298.3

291.3

281.3

282.2

342.3

379.5

400.0

rtpse

0.2

0.2

0.2

3.1

8.6

9.0

14.9

32.5

37.2

33.7

32.2

43.5

41.3

60.7

71.6

djpeg

1.4

1.9

8.2

132.1

167.5

195.5

197.4

230.8

244.9

248.7

261.3

277.9

268.0

173.0

172.8

rawcaudio

0.5

0.5

1.0

1.0

2.6

2.6

2.6

2.6

6.0

6.0

5.5

1.1

80.2

44.3

57.5

mpeg2decode

1.2

1.2

1.2

1.5

1.8

1.9

3.3

3.5

4.3

6.2

9.1

9.9

12.8

38.2

94.7

mpeg2encode

0.0

37.2

102.4

97.5

106.6

100.6

93.8

104.1

100.5

105.4

115.0

135.6

137.9

121.0

136.4

unepic

0.3

0.5

2.1

13.3

16.1

23.6

23.9

34.2

38.0

38.9

40.0

60.4

80.8

105.7

105.8

average

1.1

2.5

5.3

20.9

27.2

29.1

30.5

34.8

39.0

40.4

42.4

49.2

59.3

57.5

65.4

209


usingsoftwarepipeliningandtheTTA

reali

sti

ctemplate.

Benchmark

NumberofRegisters

512

6448

3230

2826

2422

2018

1614

1210

a68

0.6

0.6

0.6

1.1

1.2

1.1

1.7

1.6

1.8

1.4

0.9

1.3

2.2

2.5

3.3

bison

1.5

1.8

1.3

1.2

1.6

1.5

2.2

2.7

2.7

3.8

4.5

4.3

5.5

5.7

1.9

cpp

1.0

1.2

1.2

0.9

1.0

0.4

0.2

0.3

-0.4

0.6

2.1

2.4

3.0

5.4

11.2

crypt

0.0

4.4

4.0

8.3

17.0

15.9

-13.4

-10.9

2.4

-4.5

-2.3

6.2

22.2

26.9

23.1

diff

0.4

0.4

1.1

10.8

9.8

9.6

9.6

10.9

13.5

15.6

15.2

13.9

15.8

19.0

30.1

expand

0.7

2.3

2.8

0.9

3.4

4.5

2.9

2.7

3.2

1.9

-9.1

-6.7

-8.8

-8.0

-11.9

flex

1.0

1.1

1.3

0.6

0.0

0.2

-0.2

0.0

0.5

3.1

1.7

3.8

5.5

6.2

9.9

gawk

1.9

1.9

1.8

1.9

1.9

1.9

1.9

1.8

1.4

2.9

3.4

3.4

2.1

0.9

2.2

gzip

4.6

4.6

4.8

1.9

4.4

4.2

3.4

4.9

5.5

5.3

9.8

10.2

11.6

18.8

28.1

od0.5

0.7

1.0

0.6

0.6

0.0

1.5

3.1

3.9

7.5

5.0

8.1

8.5

7.0

16.5

sed

-0.5

0.7

0.0

3.8

5.0

5.5

6.3

5.2

6.3

5.6

6.5

8.1

6.4

7.0

11.0

sort

0.4

1.5

1.5

2.5

2.1

1.3

3.3

4.1

7.1

15.6

21.8

40.8

40.1

56.1

52.5

uniq

-0.8

-1.0

-1.3

-0.4

-0.1

1.3

6.6

7.6

6.6

7.5

9.7

11.9

11.3

13.1

21.6

virtex

1.7

1.2

1.6

1.6

1.7

2.1

1.7

2.5

3.6

3.5

4.4

6.3

7.3

10.7

10.2

wc

0.0

0.0

0.0

11.9

11.9

11.9

11.9

11.9

11.9

11.9

9.3

11.1

12.8

5.4

10.3

099.go

1.3

1.2

1.3

1.2

1.2

1.1

1.5

3.0

3.3

2.1

2.2

3.3

5.0

6.5

4.4

124.m88ksim

2.9

3.0

2.8

3.2

3.1

2.9

2.1

2.2

3.0

4.0

3.7

2.8

3.3

-0.6

-5.7

129.compress

0.0

0.0

0.0

4.3

4.3

4.3

5.9

5.9

6.2

6.9

9.0

27.5

28.2

27.5

31.4

132.ijpeg

-0.2

1.1

1.8

6.4

4.6

11.2

31.3

51.9

59.0

53.2

53.0

65.2

65.7

56.2

73.2

147.vortex

0.3

0.5

0.4

0.2

-0.1

0.0

0.3

0.1

0.6

0.9

1.8

1.8

1.8

4.6

-3.0

instf

-2.9

-2.9

-2.9

-2.9

-2.9

-2.9

-2.9

-2.2

-1.3

-2.2

-1.4

2.1

6.3

34.8

41.8

mulaw

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

0.0

radproc

-2.3

-2.7

-2.4

-2.6

-2.4

-2.3

-1.8

0.5

5.2

5.6

5.0

7.3

9.9

25.3

31.1

rfast

-5.4

-5.4

-5.4

-5.4

-5.4

-5.4

-5.4

-4.7

-2.3

-3.9

-2.2

3.6

11.9

62.3

73.4

rtpse

-0.6

-0.6

-0.6

-0.5

-0.7

-0.7

-0.9

0.4

3.2

7.5

10.5

27.5

23.9

29.8

29.7

djpeg

0.3

0.4

4.6

10.4

5.4

20.1

27.7

36.0

58.7

81.0

69.9

79.6

82.2

66.1

69.4

rawcaudio

0.7

0.7

0.7

0.8

4.7

4.7

4.8

4.8

6.0

5.9

3.7

-2.6

68.4

37.0

56.7

mpeg2decode

0.1

0.1

0.2

0.1

0.1

0.2

0.1

0.4

0.8

2.2

3.6

5.8

8.6

19.1

66.7

mpeg2encode

-1.1

7.5

42.4

65.3

64.1

62.9

60.3

57.1

61.2

67.3

62.4

72.7

71.4

63.8

71.4

unepic

-0.4

-0.4

0.3

-1.2

-0.8

11.3

10.3

7.2

6.3

6.3

11.8

19.8

27.5

49.0

45.3

average

0.2

0.8

2.2

4.2

4.6

5.6

5.8

7.0

9.3

10.6

10.5

14.7

18.7

21.9

26.9

Partitioned RegisterFile BenchmarkResults BI n this appendix the individual benchmark results of the partitioned registerfile experiments are listed. The numbers indicate the performance loss in

percentages relative to a monolithic register file with four read and four writeports. Two monolithic register files are used in the experiments: one with 512registers and one with 32 registers. In Section B.1 the results of early assign-ment are listed. Section B.2 lists the results of integrated assignment.

B.1 Early Assignment

This section shows the impact of register file partitioning on the individ-ual benchmark results, when using dependence-conscious early assignment.The results listed in Table B.1 clearly show that the chosen distribution has alarge impact on the performance. More sophisticated partitioning methods, asshown in Table B.2, did not result in a further improvement. It should be noted,however, that the impact of register file partitioning is small.

211

212 APPENDIX B. PARTITIONED REGISTER FILE BENCHMARK RESULTS

Table B.1: Performance loss in percentages of the vertical and horizontaldistribution when the register file of the TTArealistic template ispartitioned.

Vertical distribution Horizontal distribution2x256 4x128 2x16 4x8 2x256 4x128 2x16 4x8

a68 0.7 20.9 0.4 10.6 0.3 1.4 0.2 1.6bison 3.3 28.0 0.9 10.2 1.0 2.7 0.6 2.2cpp 1.2 11.0 1.2 6.3 0.4 0.4 0.5 0.7crypt 26.0 63.9 15.1 26.6 0.3 8.2 1.6 12.2diff 4.1 44.6 2.2 16.4 1.2 7.4 0.2 6.8expand 8.6 41.4 4.8 13.5 3.9 10.5 3.8 6.4flex 3.8 24.5 2.6 11.6 0.3 3.6 0.6 4.7gawk 0.4 4.2 0.0 3.2 0.3 0.1 0.2 0.9gzip -3.8 20.4 2.1 9.0 0.2 1.0 1.0 3.5od 14.6 67.2 10.1 22.0 3.9 7.9 1.2 7.0sed 4.0 26.0 2.4 13.5 0.9 5.1 0.5 5.6sort 8.2 51.9 3.8 20.9 0.3 2.4 0.7 5.7uniq 6.6 45.6 5.6 22.8 0.9 5.8 0.3 6.8virtex 1.9 14.7 0.8 6.6 0.1 2.0 -0.3 1.7wc 0.0 13.0 0.1 -1.0 0.0 0.3 0.1 -0.6099.go 2.1 18.0 1.7 7.7 0.2 2.1 0.1 1.4124.m88ksim 2.1 18.9 2.0 9.6 0.5 2.6 0.4 1.5129.compress 1.6 12.7 0.3 5.5 0.3 3.0 0.0 1.6132.ijpeg 17.8 84.1 5.1 18.8 1.8 8.2 2.0 8.2147.vortex 1.4 33.2 0.6 16.8 -0.2 0.9 0.1 1.4instf 1.8 24.7 2.7 5.2 0.0 -0.5 2.0 3.6mulaw 0.5 2.4 0.5 2.4 0.0 0.0 0.0 0.0radproc 2.3 28.4 1.0 7.9 0.2 1.1 0.0 2.5rfast 2.1 34.3 0.0 -0.6 0.6 1.8 0.0 0.2rtpse 2.4 29.5 0.8 1.2 0.1 1.1 0.4 2.2djpeg 18.8 92.9 5.8 22.9 1.9 9.6 1.5 8.5rawcaudio 3.9 42.7 0.6 7.4 0.4 6.5 0.3 4.2mpeg2decode 1.8 13.7 3.7 2.8 -1.2 1.9 2.7 1.2mpeg2encode 35.0 119.4 4.7 17.7 3.7 14.1 2.7 10.7unepic 7.1 35.5 1.9 7.6 0.6 2.9 1.2 3.6average 6.0 35.6 2.8 10.8 0.8 3.8 0.8 3.9

B.2. INTEGRATED ASSIGNMENT 213

Table B.2: Performance loss in percentages of the data independenceheuristic and the intra-operation heuristic when the register fileof the TTArealistic template is partitioned.

Data independence heuristic Intra-operation heuristic2x256 4x128 2x16 4x8 2x256 4x128 2x16 4x8

a68 0.3 1.7 0.4 1.2 0.3 1.2 0.3 1.4bison 1.4 3.1 0.8 3.8 0.3 2.4 0.9 2.9cpp 0.1 2.7 0.2 3.4 1.6 2.1 1.5 0.5crypt 12.8 28.8 4.1 13.2 0.6 6.1 1.7 11.5diff 1.6 10.2 -1.9 6.7 0.3 4.4 0.3 6.7expand 2.6 13.7 5.4 14.1 4.7 12.3 3.3 6.0flex 0.3 3.2 0.8 4.5 0.4 2.9 0.6 3.8gawk 0.2 0.8 0.3 0.9 0.0 -0.1 -0.2 0.5gzip -3.0 0.9 1.2 6.9 0.2 1.5 0.1 2.1od 6.6 6.9 3.1 10.1 3.5 7.8 0.7 5.2sed 2.8 6.3 2.4 6.2 0.8 4.2 0.6 5.7sort 2.8 11.2 2.5 9.2 2.3 2.6 0.9 6.0uniq 2.6 15.5 2.0 13.3 0.4 6.9 -0.1 9.0virtex 1.1 3.4 1.1 2.8 0.5 2.0 0.2 2.3wc 0.0 3.6 -8.5 -7.1 0.0 0.3 0.1 -0.6099.go 0.8 1.9 2.6 3.2 0.2 2.1 0.1 1.4124.m88ksim 0.4 2.8 -0.5 1.0 0.5 2.1 0.2 2.2129.compress 0.3 0.8 -2.1 0.1 0.3 3.0 0.3 1.6132.ijpeg 5.1 13.0 4.1 10.8 2.1 7.9 2.3 8.0147.vortex 0.3 0.6 0.0 1.0 -0.1 0.7 -0.2 1.1instf 0.1 1.2 1.1 0.8 -0.6 -1.6 -0.5 4.8mulaw -2.8 -3.3 -2.8 -3.3 0.0 0.0 0.0 0.0radproc 0.0 2.1 1.0 2.6 0.5 0.5 0.1 2.5rfast 0.4 2.0 -1.4 0.6 0.5 1.6 0.0 2.3rtpse -0.1 1.5 -0.3 1.6 0.0 1.6 0.4 2.6djpeg 2.8 12.5 2.1 12.0 1.9 7.2 2.7 9.3rawcaudio 0.5 1.2 0.8 9.3 0.4 4.1 -0.1 3.7mpeg2decode 1.5 1.5 3.1 2.4 0.1 1.3 1.4 3.6mpeg2encode 6.6 18.4 4.3 11.4 8.0 11.7 3.8 11.1unepic 1.6 3.0 7.5 6.2 1.1 2.7 0.8 3.5average 1.7 5.7 1.1 5.0 1.0 3.4 0.7 4.0

B.2 Integrated Assignment

This section shows the impact of register file partitioning on the individualbenchmark results, when using integrated assignment. The results for the localheuristics (first-fit, best-fit and round robin) are shown in Table B.3. The resultsfor the global heuristic are shown in Table B.4. The tables show that the roundrobin strategy performs best.


TableB.3:Performancelossinpercentagesofthelocalheuristics:first-fit,best-fitandroundrobin,whentheregister

fileoftheTTA

reali

sti

ctemplateispartitioned.

first-fitheuristic

best-fitheuristic

roundrobinheuristic

2x256

4x128

2x16

4x8

2x256

4x128

2x16

4x8

2x256

4x128

2x16

4x8

a68

3.4

11.2

3.6

6.6

3.3

11.2

3.5

6.6

0.0

0.5

-0.2

0.6

bison

5.4

26.7

2.3

9.7

5.4

26.7

2.3

9.7

1.5

2.4

1.1

1.1

cpp

1.5

10.6

0.8

5.1

1.4

10.6

0.7

5.1

0.1

1.7

-0.7

-0.7

crypt

1.6

25.2

1.2

3.2

1.4

25.2

0.8

3.2

0.1

1.9

0.0

1.9

diff

4.6

30.0

2.7

21.5

3.8

30.0

2.5

21.5

1.1

3.2

0.3

2.7

expand

10.5

28.9

8.3

17.2

10.5

28.9

8.9

17.2

0.6

6.6

0.3

4.1

flex

4.0

17.6

2.2

8.6

2.7

17.6

1.5

8.6

1.0

2.7

1.5

2.3

gawk

1.2

4.1

1.0

4.3

1.2

4.1

1.0

4.3

-0.8

0.6

-0.8

1.0

gzip

0.5

18.2

-0.8

4.0

0.8

18.2

0.4

4.0

-0.3

-0.1

-1.5

-0.6

od12.9

55.7

10.5

26.3

8.9

55.7

6.7

26.3

1.7

2.7

-1.0

2.3

sed

6.1

24.0

2.3

10.2

4.6

24.0

1.9

10.2

0.9

3.2

-0.1

3.1

sort

7.9

25.0

5.6

14.2

5.8

25.0

4.2

14.2

1.2

4.9

1.0

6.7

uniq

5.4

24.0

4.5

12.7

5.3

24.0

3.8

12.7

-1.1

4.5

-0.6

2.7

virtex

2.2

14.0

0.9

4.8

2.2

14.0

0.9

4.8

0.4

1.8

0.1

2.7

wc

0.0

27.4

0.0

18.0

0.0

27.4

0.0

18.0

0.0

0.8

0.0

-0.2

099.go

1.4

11.8

4.2

8.9

1.4

11.8

3.4

8.9

0.2

2.3

1.2

5.0

124.m88ksim

1.3

12.9

0.5

9.7

1.3

12.9

0.5

9.7

0.2

4.2

0.6

1.2

129.compress

1.3

5.3

1.6

5.8

0.3

5.3

0.3

5.8

0.0

2.0

0.6

1.9

132.ijpeg

13.3

53.3

2.5

15.3

9.2

53.3

1.9

15.3

1.3

6.3

-0.3

3.8

147.vortex

7.4

19.0

6.9

8.6

7.3

19.0

6.9

8.6

0.2

1.1

-0.1

0.7

instf

1.8

12.8

1.1

11.6

0.3

12.8

0.2

11.6

0.0

0.2

0.1

-1.0

mulaw

0.0

-1.4

0.0

0.0

0.0

-1.4

0.0

0.0

0.0

0.0

-1.4

-1.4

radproc

1.1

9.2

1.0

5.8

0.7

9.2

0.5

5.8

0.1

2.7

-0.6

1.9

rfast

2.2

12.3

0.5

7.9

0.8

12.3

0.6

7.9

-0.1

1.8

0.0

0.4

rtpse

2.4

14.7

1.2

8.4

0.9

14.7

0.4

8.4

0.1

1.8

0.1

1.4

djpeg

13.3

52.1

3.3

13.1

10.1

52.1

2.3

13.1

2.5

6.4

-0.2

6.0

rawcaudio

4.0

39.0

0.6

10.7

4.0

39.0

0.6

10.7

1.0

7.4

1.0

8.1

mpeg2decode

1.7

5.8

2.8

2.2

1.5

5.8

2.7

2.2

1.4

0.4

1.4

0.6

mpeg2encode

24.7

80.3

3.6

15.7

22.8

80.3

3.9

15.7

6.5

14.0

1.3

12.3

unepic

5.6

16.0

0.6

5.2

4.1

16.0

-0.6

5.2

0.3

1.5

0.3

2.2

average

5.0

22.9

2.5

9.8

4.1

22.9

2.1

9.8

0.7

3.0

0.1

2.4

B.2. INTEGRATED ASSIGNMENT 215

Table B.3: Performance loss in percentages of the global heuristic when theregister file of the TTArealistic template is partitioned.

Benchmark RF Partitions2x256 4x128 2x16 4x8

a68 -0.2 0.5 -0.2 0.4bison 1.5 2.9 0.7 1.9cpp -0.8 0.3 0.3 0.4crypt 0.2 3.2 0.3 2.7diff 0.9 2.1 0.5 2.1expand 0.6 6.4 2.5 4.2flex 1.0 3.1 0.4 3.5gawk -0.1 1.3 -0.1 1.5gzip -0.4 3.7 -0.8 1.1od -0.2 5.4 -0.8 3.7sed 1.6 4.1 -0.1 3.6sort 1.5 3.5 1.1 5.3uniq -1.0 2.7 -1.3 4.5virtex 0.4 1.7 0.6 1.9wc 0.0 0.8 0.0 0.6099.go 0.2 2.2 1.5 5.5124.m88ksim 0.3 3.6 0.7 3.8129.compress 0.0 1.9 0.0 1.0132.ijpeg 1.8 5.7 -0.6 4.5147.vortex 0.3 0.6 0.4 0.6instf 0.0 -0.2 0.1 2.9mulaw 0.0 0.0 -1.4 -1.4radproc 0.8 0.2 0.4 1.0rfast 0.3 0.8 0.1 1.0rtpse 0.3 3.9 0.2 2.6djpeg 1.4 7.1 0.6 2.3rawcaudio 0.8 5.3 0.5 4.8mpeg2decode 1.3 2.5 1.4 1.1mpeg2encode 5.0 12.2 1.9 15.0unepic 0.3 2.2 0.7 1.8average 0.6 3.0 0.3 2.8

Glossary

This glossary gives an overview of the symbols and abbreviations usedthroughout this thesis.

Symbols

Definition point of all variables v referenced in basicblock b and v ∈ liveIn(b)

⊥ Use point of all variables v referenced in basic block band v ∈ liveOut(b)

|S| Cardinality (size) of the set SγR Register rankγS Schedule rankδa Anti dependenceδa Percentage dimensional increase per RF-portδf Flow dependenceδidelay,distance Inter-iteration data dependence edge with a minimal

delay of delay instructions and a minimal distance ofdistance iterations

δo Output dependenceδot Operand-trigger dependenceδtr Trigger-result dependenceδtypedelay Data dependence edge of type typewith a minimal de-

217

218 GLOSSARY

lay of delay instructionsρp Relative access time increase per RF-portρr Relative access time increase per registerAreaRF Relative area of a register fileb A basic blockb∗ Currently scheduled basic blockbb(n) Basic block of operation n

B Set of basic blocksBDU (v) Set of basic blocks containing references to variable v

BIO(v) Set of basic blocks in which variable v is live on exitand entry

BE Set of back edgesC(G) Transitive closure of a graph G

Ccallee-saved(v) Callee-saved cost of variable v

Ccaller-saved(v) Caller-saved cost of variable v

Cspill(v) Spill cost of variable v

Cstate preserving(v, r) State preserving cost when register r is mapped ontovariable v

CE Set of control flow edgesdegree(n) Number of neighbor nodes of a node n

delay(oj , oi) Data dependence delay between the operations oj andoi

bi doms bj Dominance relation between basic blocks bi and bj

D Set of duplication basic blocksEdu Set of du-chain edgesEDDG Set of DDG edgesEf Set of false dependence edgesEfdp Set of false dependence prevention edgesEff Set of forward false dependence edgesEffdp Set of forward false dependence prevention edgesEfpar Set of forward parallel interference edgesEinterf Set of interference edgesEMem Set of memory data dependence edgesEmin Set of minimal interference edgesEpar Set of parallel interference edgesEReg Set of register data dependence edgesERF Average access energy per instruction of an RFERF-ports Set of RF-port interference edgesESSG Set of interference edges in IGSSG

Et Set of undirected edges of the transitive closure of theDDG

ET Set of edges of the transitive closure of the DDG

GLOSSARY 219

EV ar Set of variable data dependence edgesf() Execution frequencyfu A function unitFE Set of forward edgesFUset Set of function unitsGf False dependence graphGff Forward false dependence graphGt Transitive closure of the DDG with undirected edgesi An instructionicur Instruction in which a move is currently being sched-

uledinterf(v) Set of variables interfering with variable v

I Set of intermediate basic blocksIG Interference graphIGfpar Forward parallel interference graphIGmin Minimal interference graphIGpar Parallel interference graphIGRF-ports RF-port interference graphIGSSG Scheduler-sensitive global interference graphlr(v) Live range of variable v

liveDef (b) Set of variables defined in basic block b before they areused in b

liveIn(b) Set of live variables at the entry of a basic block b

liveOut(b) Set of live variables at the exit of a basic block b

liveUse(b) Set of variables used in basic block b before they aredefined in b

live(Q) Set of variables live at a point QL(o) Latency of operation o

Lpath(oi, oj) Distance in the DDG between the operations oi and oj

L(v) Length of the live range of variable v

m A move (transport)mdef(v) Move defining variable v

mi Move with index i

mo An operand movemr A result movemref(v) Move referring to variable v

mt A trigger moven A node in a graphndef(v) Node in the DDG that defines variable v

ni Node with index i

nref(v) Node in the DDG that references variable v

220 GLOSSARY

nuse(v) Node in the DDG that uses variable v

Ndu Nodes of a du-chainNDef(v) Set of moves that define variable v

NDef(v,b) Set of moves that define variable v in basic block b

NDDG Set of nodes in a DDGNUse(v) Set of moves that use variable v

NUse(v,b) Set of moves that use variable v in basic block b

Nports Number of RF-portsNvar Set of variableso An operationoi Operation with index i

otfreedom Upperbound of the operand-trigger scheduling dis-tance

O⊥ Set of ready operations that end a live rangepred(m) Set of predecessor moves ofm in the DDGpred∗(o) Set of scheduled predecessor operations of o in the

DDGP A programP A procedurePr (b′|b) Probability that b′ will be executed after b

r A registerready Set of ready operationsri Register with index i

r(v) Returns register r mapped onto variable v

R Set of registers in the target TTARcallee-saved Set of callee-saved registersRcaller-saved Set of caller-saved registersRCur(v, n) Instruction precise interfering register set for variable

v in a partly scheduled basic block bb(n)RDU (v, b) Set of interfering registers of variable v in a basic block

b ∈ BDU (v)Rfree Set of available registers in the last instruction in the

currently scheduled basic blockRInterfere(v, n) Set of interfering registers for a variable v

RIO(v, b) Set of interfering registers of variable v in a basic blockb ∈ BIO(v)

RNon-interfere(v, n) The set of non-interfering registers for a variable v

RRF Number of registers in a register fileRRRV (v, n) Set of interfering registers based on RRV informationRF (ri) Register file of register ri

succv(oi) Set of successor operations of oi in the DDG that readvariable v

GLOSSARY 221

S A scheduletexec Execution timetrfreedom Upperbound of the trigger-result scheduling distanceTaccess Relative access time of a multi-ported register fileTpath(oi, oj) Total path length in the DDG of all paths from oi to oj

v A variablevi Variable with index i

VAbove(b, v) Set of non-interfering variables variables whose liverange ends before the live range of v starts in basicblock b

VAround(b, v) Set of non-interfering variables that are live at entryand exit of basic block b, and are refined and usedwithin b and do not interfere with v

VBelow(b, v) Set of non-interfering variables whose live range startsafter the end of the live range of v in basic block b

VBetween(b, v) Set of non-interfering variables defined after all uses ofv, and whose live range ends before the definition of v

Vnon-interf(b, v) Set of variables that do not interfere with v in basicblock b under any legitimate schedule

Wffdp(vi, vj) Weight associated with a forward false dependenceprevention edge

Abbreviations

ALU Arithmetic Logic UnitANSI American National Standards InstituteASP Application Specific ProcessorBOP Billions of Operations Per Second, Inc.BSD Berkeley Software DistributionCFA Control Flow AnalysisCFG Control Flow GraphCISC Complex Instruction Set ComputerCMOS Complementary Metal Oxide SemiconductorCNS Controlled Node SplittingCPU Central Processing UnitCRAIG Combining Register Assignment Interference GraphsDAG Directed Acyclic GraphDCEA Dependence Conscious Early AssignmentDDA Data Dependence AnalysisDDG Data Dependence GraphDEC Digital Equipment Corporation

222 GLOSSARY

DFA Data Flow AnalysisDSP Digital Signal Processingdu-chain Definition-use chainEPIC Explicitly Parallel Instruction ComputingFDPG False Dependence Prevention GraphFFT Fast Fourier TransformFIFO First In First OutFU Function UnitGSC Global Spill Cost heuristicHLL High Level LanguageHRMS Hypernode Reduction Modulo SchedulingIBM International Business MachinesIG Interference GraphII Initiation IntervalILP Instruction-Level ParallelismIO Input/OutputIPS Integrated Prepass SchedulingIW Issue WidthJPEG Joint Photographic Experts GroupMII Minimum Initiation IntervalMPEG Moving Pictures Expert GroupNEC Nippon Electric CompanyNP Nondeterministic-PolynomialOTA Operation Triggered ArchitecturePC Personal ComputerPDA Personal Digital AssistantRASE Register Allocation with Schedule EstimatesRF Register FileRISC Reduced Instruction Set ComputerRRV Register Resource VectorSCP Single Copy on a Path ruleSFU Special Function UnitSPEC Standard Performance Evaluation CorporationSRAM Static Random Access MemoryTLB Translation-Lookaside BufferTTA Transport Triggered ArchitectureTV TeleVisionURSA Unified ReSource AllocatorVLIW Very Long Instruction Word architecture or processorVTL Virtual-time latching

Bibliography

Abe00. Aberdeen Group, Inc. Intel’s Itanium: Who Benefits from Early Adop-tion, Aberdeen Group, Inc., Boston, November 2000.

ACM+98. D.I. August, D.A. Connors, S.A. Mahlke, J.W. Sias, K.M. Crozier, B-C. Cheng, P.R. Eaton, Q.B. Olaniran, and W-M.W. Hwu. IntegratedPredicated and Speculative Execution in the IMPACT EPIC Archi-tecture. In Proceedings of the 25th Annual International Symposium onComputer Architecture, pages 227–237, Barcelona, Spain, June 1998.

AEBK94. W. Ambrosch, M.A. Ertl, F. Beer, and A. Krall. Dependence-Conscious Global Register Allocation. In Programming Languagesand System Architectures, volume 782 of Lecture Notes in ComputerScience, pages 125–136, Zurich, Switzerland, March 1994.

AHC96. H. Adriani, F. Harmsze, and H. Corporaal. The Utilization of aFully Configurable Microprocessor Development Environment forRapid VHDL Prototyping and Implementation of ’C’-Based Algo-rithms. In Proceedings of ProRisc Workshop, Mierlo, The Netherlands,November 1996.

Arn01. M.L.C.H. Arnold. Instruction Set Extensions for Embedded Proces-sors. PhD thesis, Delft University of Technology, The Netherlands,March 2001.

ASU85. A.V. Aho, R. Sethi, and J.D. Ullman. Compilers: Principles, Techniquesand Tools. Addison-Wesley Series in Computer Science. Addison-Wesley Publishing Company, Reading, Massachusetts, 1985.

223

224 BIBLIOGRAPHY

BCK91. D. Bernstein, D. Cohen, and H. Krawczyk. Code Duplication: AnAssist for Global Instruction Scheduling. In Proceedings of the 24thAnnual International Symposium on Microarchitecture, pages 103–113,Albuquerque, New Mexico, November 1991.

BCKT89. P. Briggs, K. D. Cooper, K. Kennedy, and L. Torczon. ColoringHeuristics for Register Allocation. In Proceedings of the ACM SIG-PLAN’89 Conference on Programming Language Design and Implemen-tation, pages 275–284, Portland, Oregon, June 1989.

BCT94. P. Briggs, K.D. Cooper, and L. Torczon. Improvements to GraphColoring Register Allocation. ACM Transactions on ProgrammingLanguages and Systems, 16(3):428–455, May 1994.

BEH91a. D.G. Bradlee, S.J. Eggers, and R.R. Henry. Integrating Register Al-location and Instruction Scheduling for RISCs. In Proceedings of the4th International Conference on Architectural Support for ProgrammingLanguages and Operating System, pages 122–131, Santa Clara, Cali-fornia, April 1991.

BEH91b. D.G. Bradlee, S.J. Eggers, and R.R. Henry. The Marion System forRetargetable Instruction Scheduling. In Proceedings of the ACM SIG-PLAN’91 Conference on Programming Language Design and Implemen-tation, pages 229–240, Toronto, Canada, June 1991.

BGM+89. D. Bernstein, M.C. Golumbic, Y. Mansour, R.Y. Pinter, D.Q. Goldin,H. Krawczyk, and I. Nahshon. Spill Code Minimization Tech-niques for Optimizing Compilers. In Proceedings of the ACM SIG-PLAN’89 Conference on Programming Language Design and Implemen-tation, pages 258–263, Portland, Oregon, June 1989.

BGS93. D. Berson, R. Gupta, and M.L. Soffa. URSA: A Unified ReSourceAllocator for Registers and Functional Units in VLIW Architec-tures. In Proceedings of the Conference on Architectures and Compila-tion Techniques for Fine and Medium Grain Parallelism, pages 243–254,Orlando, Florida, January 1993.

BGS94. D. Berson, R. Gupta, and M.L. Soffa. Resource Spackling: A Frame-work for Integrating Register Allocation in Local and Global Sched-ulers. In Proceedings of the 1994 Conference on Parallel Architecturesand Compilation Techniques, pages 135–146, Montreal, Canada, Au-gust 1994.

BR91. D. Bernstein and M. Rodeh. Global Instruction Scheduling for Su-perscalar Machines. In Proceedings of ACM SIGPLAN’91 Conferenceon Programming Language Design and Implementation, pages 241–255,Toronto, Canada, June 1991.

Bri92. P. Briggs. Register Allocation via Graph Coloring. PhD thesis, Rice

BIBLIOGRAPHY 225

University, April 1992.

BS91. M. Breternitz and J.P. Shen. Implementation Optimization Tech-niques for Architecture Synthesis of Application-Specific Proces-sors. In Proceedings of the 24th Annual International Symposium onMi-croarchitecture, pages 114–123, Albuquerque, New Mexico, Novem-ber 1991.

BSBC95. T.S. Brasier, P.H. Sweany, S.J. Beaty, and S. Carr. CRAIG: A Practi-cal Framework for Combining Instruction Scheduling and RegisterAssignment. In Proceedings of the 1995 Conference on Parallel Archi-tectures and Compilation Techniques, pages 11–18, Limassol, Cyprus,June 1995.

BYA93. G.R. Beck, D.W.L. Yen, and T.L. Anderson. The Cydra 5 Minisuper-computer: Architecture and Implementation. The Journal of Super-computing, 7(1/2):143–180, May 1993.

CAC+81. G.J. Chaitin, M.A. Auslander, A.K. Chandra, J. Cocke, M.E. Hop-kins, and P.W. Markstein. Register Allocation via Coloring. Com-puter Languages, 6(1):47–57, January 1981.

CBLM96. T.M. Conte, S. Banerjia, S.Y. Larin, and K.N. Menezes. InstructionFetch Mechanisms for VLIW Architectures with Compressed En-codings. In Proceedings of the 29th Annual International Symposium onMicroarchitecture, pages 201–211, Paris, France, December 1996.

CC97. A. G. M. Cilio and H. Corporaal. Efficient Code Generation forASIPs with Different Word Sizes. In Proceedings of the third AnnualConference of the Advanced School for Computing and Imaging, pages144–150, Heijen, The Netherlands, June 1997.

CCK97. C.M. Chang, C.M. Chen, and C.T. King. Using Integer Linear Pro-gramming for Instruction Scheduling and Register Allocation inMulti-issue Processors. Computers andMathematics with Applications,34(9):1–14, 1997.

CDN93. A. Capitanio, N. Dutt, and A. Nicolau. Partitioned Register Filesfor VLIWs: A Preliminary Analysis of Tradeoffs. In Proceedings ofthe 25th Annual International Symposium on Microarchitecture, pages292–300, Portland, Oregon, December 1993.

CDN94. A. Capitanio, N. Dutt, and A. Nicolau. Toward Register Allocationfor Multiple Register File VLIW Architectures. Technical Report TR6-94, University of California, Irvine, Department of Informationand Computer Science, February 1994.

CDN95. A. Capitanio, N. Dutt, and A. Nicolau. Architectural Tradeoff Anal-ysis of Partitioned VLIWs. Technical Report TR 14-94, University ofCalifornia, Irvine, Department of Information and Computer Sci-

226 BIBLIOGRAPHY

ence, April 1995.

CF87. R. Cytron and J. Ferrante. What’s in a Name? or The Value ofRenaming for Parallelism Detection and Storage Allocation. InProceedings of the 1987 International Conference on Parallel Processing,pages 19–27, August 1987.

CGVT00. J-L. Cruz, A. Gonzalez, M. Valero, and N.P. Topham. Multiple-banked Register File Architectures. In Proceedings of the 27th An-nual International Symposium on Computer Architecture, pages 316–325, Vancouver, Canada, June 2000.

CH90. F.C. Chow and J.L. Hennessy. The Priority-Based Coloring Ap-proach to Register Allocation. ACM Transactions on ProgrammingLanguages and Systems, 12(4):501–526, October 1990.

CH95. H. Corporaal and J. Hoogerbrugge. Code Generation for TransportTriggered Architectures. In Gert Goossens and Peter Marwedel,editors, Code Generation for Embedded Processors. Kluwer AcademicPublishers, 1995.

Cha82. G.J. Chaitin. Register Allocation & Spilling via Graph Coloring.In Proceedings of the ACM SIGPLAN’82 Symposium on Compiler Con-struction, pages 98–105, Boston, Massachusetts, June 1982.

CJA00. H. Corporaal, J.A.A.J. Janssen, and M.L.C.H. Arnold. Computationin the context of Transport Triggered Architectures. InternationalJournal of Parallel Programming, 28(4):401–427, August 2000.

CK91. D. Callahan and B. Koblenz. Register Allocation via HierarchicalGraph Coloring. In Proceedings of the ACM SIGPLAN’91 Conferenceon Programming Language Design and Implementation, pages 192–203,Toronto, Canada, June 1991.

CL95. H. Corporaal and R. Lamberts. TTA Processor Synthesis. In Proceed-ings of the first Annual Conference of the Advanced School for Computingand Imaging, pages 18–27, Heijen, The Netherlands, May 1995.

CLM+95. P. P. Chang, D. M. Lavery, S. A. Mahlke, W. Y. Chen, and W. W.Hwu. The Importance of Prepass Code Scheduling for Superscalarand Superpipelined Processors. IEEE Transactions on Computers,44(3):353–370, March 1995.

CM69. J. Cocke and R. E. Miller. Some Analysis Techniques for OptimizingComputer Programs. In Proceedings of the 2nd Hawaii Conference onSystem Sciences, pages 143–146, Honolulu, Hawaii, January 1969.

CM91. H. Corporaal and J.M. Mulder. MOVE: A Framework for High-Performance Processor Design. In Supercomputing’91 Proceedings,pages 692–701, Albuquerque, New Mexico, November 1991.

BIBLIOGRAPHY 227

CND95. A. Capitanio, A. Nicolau, and N. Dutt. A Hypergraph-BasedModelfor Port Allocation on Multiple-Register-File VLIW Architectures.International Journal of Parallel Programming, 23(6):499–513, Decem-ber 1995.

Cor93. H. Corporaal. Evaluating Transport Triggered Architectures forScalar Applications. Microprocessing and Microprogramming, 38:45–52, September 1993.

Cor98. H. Corporaal. Microprocessor Architectures; from VLIW to TTA. JohnWiley & Sons, 1998.

CS95. T.M. Conte and S.W. Sathaye. Dynamic Rescheduling: A Techniquefor Object Code Compatibility in VLIW Architectures. In Proceed-ings of the 28th Annual International Symposium on Microarchitecture,pages 208–218, Ann Arbor, Michigan, November 1995.

CvdA93. H. Corporaal and P. van der Arend. MOVE32INT, a Sea of Gates Re-alization of a High Performance Transport Triggered Architecture.Microprocessing and Microprogramming, 38:53–60, September 1993.

DA93. A. Dolan and J. Aldous. Networks and Algorithms: An IntroductoryApproach. John Wiley & Sons, 1993.

dM94. G. de Micheli. Synthesis and Optimization of Digital Circuits.McGraw-Hill, 1994.

DT93. J. Dehnert and R. Towle. Compiling for the Cydra 5. The Journal ofSupercomputing, 7(1/2):181–228, May 1993.

EA97. K. Ebcioglu and E. R. Altman. DAISY: Dynamic Compilation for100% Architectural Compatibility. In Proceedings of the 24th AnnualInternational Symposium on Computer Architecture, pages 26–37, Den-ver, Colorado, June 1997.

ED95a. A.E. Eichenberger and E.S. Davidson. Register Allocation for Predi-cated Code. In Proceedings of the 28th Annual International Symposiumon Microarchitecture, pages 180–191, Ann Arbor, Michigan, Novem-ber 1995.

ED95b. A.E. Eichenberger and E.S. Davidson. Stage Scheduling: A Tech-nique to Reduce Register Requirements of a Modulo Schedule. InProceedings of the 28th Annual International Symposium on Microarchi-tecture, pages 338–349, Ann Arbor, Michigan, November 1995.

EFK+98. K. Ebcioglu, J. Fritts, S. Kosonocky, M. Gschwind, E. R. Altman,K. Kailas, and T. Bright. An Eight-Issue Tree-VLIW Processor forDynamic Binary Translation. In Proceedings of the International Con-ference on Computer Design, pages 488–495, Austin, Texas, October1998.

228 BIBLIOGRAPHY

EGS95. C. Eisenbeis, F. Gasperoni, and U. Schwiegelshohn. Allocating Reg-isters in Multiple Instruction-Issuing Processors. In Proceedings ofthe 1995 Conference on Parallel Architectures and Compilation Tech-niques, pages 290–293, June 1995.

Ell86. J.R. Ellis. Bulldog: A Compiler for VLIW Architectures. ACM DoctoralDissertation Awards. MIT Press, Cambridge, Massachusetts, 1986.

Emb95. P.M. Embree. C Language Algorithms for Real-Time DSP. Prentice-Hall, 1995.

EZ97. Embedded Software Onderzoek in Nederland, Analyse en resul-taten. Ministerie van Economische Zaken, August 1997.

FBF+00. P. Faraboschi, G. Brown, J.A. Fisher, G. Desoli, and F. Homewood.LX: A Technology Platform for Customizable VLIW EmbeddedProcessing. In Proceedings of the 27th Annual International Sympo-sium on Computer Architecture, pages 203–213, Vancouver, Canada,June 2000.

FCJV97. K.I. Farkas, P. Chow, N.P. Jouppi, and Z. Vranesic. The Multiclus-ter Architecture: Reducing Cycle Time Through Partitioning. InProceedings of the 30th Annual International Symposium on Microar-chitecture, pages 149–159, Research Triangle Park, North Carolina,December 1997.

FFD96. J.A. Fisher, P. Faraboschi, and G. Desoli. Custom-Fit Processors:Letting Applications Define Architectures. In Proceedings of the 29thAnnual International Symposium on Microarchitecture, pages 324–335,Paris, France, December 1996.

Fis81. J.A. Fisher. Trace scheduling: A technique for global microcodecompaction. IEEE Transactions on Computers, 30(7):478–490, July1981.

FJC95. K.I. Farkas, N.P. Jouppi, and P. Chow. Register File Design Con-siderations in Dynamically Scheduled Processors. Technical ReportWRL 95/10, Digital Western Research Laboratory, Palo Alto, Cali-fornia, November 1995.

FR91. S.M. Freudenberger and J.C. Ruttenberg. Phase Ordering of Reg-ister Allocation and Instruction Scheduling. In Code Generation -Concepts, Tools, Techniques, pages 146–172, 1991.

GJ79. M.R. Garey and D.S. Johnson. Computers and Intractability: A Guideto the Theory of NP-Completeness. W.H. Freeman and Company, NewYork, 1979.

GJS96. J. Gosling, B. Joy, and G. Steele. The Java Language Specification.Addison-Wesley Publishing Company, Reading, Massachusetts,

BIBLIOGRAPHY 229

1996.

GL95. K.J. Gough and J. Ledermann. Register Allocation in the GardensPoint Compilers. In Proceedings 18th Australian Computer ScienceConference, Adelaide, Australia, January 1995.

GNAB92. Jeffrey Gray, Andrew Naylor, Arthur Abnous, and NaderBagherzadeh. VIPER: A 25-MHz, 100 MIPS Peak VLIW Micro-processor. Technical Report ICS-TR-92-78, University of Califor-nia, Irvine, Department of Information and Computer Science, July1992.

GS90. R. Gupta and M.L. Soffa. Region Scheduling: An approach for De-tecting and Redistributing Parallelism. IEEE Transactions on SoftwareEngineering, 16(4):421–431, April 1990.

GS95. J.R. Gurd and D.F. Snelling. Self-Regulation of Workload in theManchester Data-Flow Computer. In Proceedings of the 28th AnnualInternational Symposium on Microarchitecture, pages 135–145, AnnArbor, Michigan, November 1995.

GSS89. R. Gupta, M.L. Soffa, and T. Steele. Registers Allocation via CliqueSeparators. In Proceedings of the ACMSIGPLAN’89 Conference on Pro-gramming Language Design and Implementation, pages 264–274, Port-land, Oregon, July 1989.

GV97. C. J. Glossner and S. Vassiliadis. The Delft-Java Engine. LectureNotes in Computer Science, 1300:766–770, August 1997.

GWC88. J.R. Goodman and W. Wei-Chung. Code Scheduling and RegisterAllocation in Large Basic Blocks. In Proceedings of the InternationalConference on Supercomputing, pages 442–452, St. Malo, France, July1988.

Gwe96. L. Gwennap. Digital 21264 Sets New Standard. Microprocessor Re-port, 10(14):11–16, October 1996.

HA99. J. Hoogerbrugge and L. Augesteijn. Instruction Scheduling for Tri-Media. Journal of Instruction-Level Parallelism (http://www.jilp.org/vol1), February 1999.

HC89. W.W. Hwu and P. P. Chang. Inline Function Expansion for Compil-ing C Programs. In Proceedings of the ACM SIGPLAN’89 Conferenceon Programming Language Design and Implementation, pages 246–257,Portland, Oregon, June 1989.

HC94. J. Hoogerbrugge and H. Corporaal. Register File Port Require-ments of Transport Triggered Architectures. In Proceedings of the27th Annual International Symposium onMicroarchitecture, pages 191–195, San Jose, California, November 1994.

230 BIBLIOGRAPHY

HC95. J. Hoogerbrugge and H. Corporaal. Automatic Synthesis of Trans-port Triggered Architectures. In Proceedings of the first Annual Con-ference of the Advanced School for Computing and Imaging, pages 79–88,Heijen, The Netherlands, May 1995.

HC96. J. Hoogerbrugge andH. Corporaal. Resource Assignment in a Com-piler for Transport Triggered Architectures. In Proceedings of the sec-ond Annual Conference of the Advanced School for Computing and Imag-ing, Lommel, Belgium, June 1996.

Hec77. M.S. Hecht. Flow Analysis of Computer Programs. Programming Lan-guages Series. Elsevier, North-Holland, 1977.

HG83. J.L. Hennessy and T. Gross. Postpass Code Optimization of PipelineConstraints. ACM Transactions on Programming Languages and Sys-tems, 5(3):422–448, July 1983.

HHR95. R.E. Hank, W-M.W. Hwu, and B.R. Rau. Region-Based Compila-tion: An Introduction and Motivation. In Proceedings of the 28thAnnual International Symposium on Microarchitecture, pages 158–168,Ann Arbor, Michigan, November 1995.

HMC+93. W.W. Hwu, S.A. Mahlke, W.Y. Chen, P.P. Chang, N.J. Warter, R.A.Bringmann, R.G. Ouellette, R.E. Hank, T. Kiyohara, G.E. Haab, J.G.Holm, and D.M. Lavery. The Superblock: An Effective Techniquefor VLIW and Superscalar Compilation. The Journal of Supercomput-ing, 7(1/2):229–248, May 1993.

Hoo96. J. Hoogerbrugge. Code Generation for Transport Triggered Architec-tures. PhD thesis, Delft University of Technology, The Netherlands,February 1996.

HP90. J.L. Hennessy and D.A. Patterson. Computer Architecture, a Quantita-tive Approach. Morgan Kaufmann publishers, SanMateo, California,1990.

Huf93. R.A. Huff. Lifetime-Sensitive Modulo Scheduling. In Proceedingsof the ACM SIGPLAN’93 Conference on Programming Language De-sign and Implementation, pages 258–267, Albuquerque, NewMexico,June 1993.

IA99. IA-64 Architecture Innovations. 1999 Intel Developer Forum, JointWhitepaper between Intel & HP, February 1999.

Int00. Intel Corporation. Inside the NetBurstTM Micro-Architecture of theIntel Pentium 4 Processor, Intel Corporation, Santa Clara 2000.

Jan01. I. Janssen. Enhancing the Move Framework. Master’s thesis, DelftUniversity of Technology, The Netherlands, May 2001.

JC95. J.A.A.J. Janssen and H. Corporaal. Partitioned Register Files for

BIBLIOGRAPHY 231

TTAs. In Proceedings of the 28th Annual International Symposium onMicroarchitecture, pages 303–312, Ann Arbor, Michigan, November1995.

JC96. J.A.A.J. Janssen and H. Corporaal. Controlled Node Splitting. InProceedings of the 6th International Conference on Compiler Construc-tion, pages 44–58, Linkoping, Sweden, April 1996.

JC97a. J.A.A.J. Janssen and H. Corporaal. Making Graphs Reducible withControlled Node Splitting. ACM Transactions on Programming Lan-guages and Systems, 19(6):1031–1052, November 1997.

JC97b. J.A.A.J. Janssen and H. Corporaal. Registers On Demand: Inte-grated Register Allocation and Instruction Scheduling . In Proceed-ings of the third Annual Conference of the Advanced School for Comput-ing and Imaging, pages 61–67, Heijen, The Netherlands, June 1997.

JC98. J.A.A.J. Janssen and H. Corporaal. Registers On Demand, an Inte-grated Region Scheduler and Register Allocator. In Poster Proceed-ings of the 7th International Conference on Compiler Construction, pages44–51, Lisbon, Portugal, April 1998.

JCK98. J.A.A.J. Janssen, H. Corporaal, and I. Karkowski. Integrated Regis-ter Allocation and Software Pipelining. In Proceedings of the fourthAnnual Conference of the Advanced School for Computing and Imaging,pages 80–86, Lommel, Belgium, June 1998.

Joh91. W.M. Johnson. Superscalar Microprocessor Design. Prentice Hall,1991.

Jol91. R.D. Jolly. A 9-ns, 1.4-Gigabyte/s, 17-Ported CMOS Register File.IEEE Journal of Solid-State Circuits, 26(10):1407–1412, October 1991.

JW89. N.P. Jouppi and David W. Wall. Available Instruction-Level Paral-lelism for Superscalar and Superpipelined Machines. In Proceedingsof the 3th International Conference on Architectural Support for Program-ming Languages and Operating System, pages 272–282, Boston, Mas-sachusetts, April 1989.

Kar72. R.M. Karp. Reducibility among Combinatorial Problems. In Pro-ceedings of a Symposium on the Complexity of Computer Computations,pages 85–103, Yorktown Heights, New York, March 1972.

Kla00. A. Klaiber. The Technology Behind CrusoeTM Processors. TransmetaCorporation, Santa Clara, January 2000.

L+93. P.G. Lowney et al. The Multiflow Trace Scheduling Compiler. TheJournal of Supercomputing, 7(1/2):51–142, May 1993.

Lam88. M. S. Lam. Software Pipelining: an Effective Scheduling Techniquefor VLIW Machines. In Proceedings of ACM SIGPLAN’88 Conference

232 BIBLIOGRAPHY

on Programming Language Design and Implementation, pages 318–328,Atlanta, Georgia, June 1988.

Lam98. M. S. Lam. A Systolic Array Optimizing Compiler. The Kluwer In-ternational Series in Engineering and Computer Science. KluwerAcademic Publishers, Norwell, Massachusetts, 1998.

Lee92. C.G. Lee. Code Optimizers and Register Organizations for Vector Archi-tectures. PhD thesis, University of California, Berkeley, May 1992.

Leh00. R. Lehrbaum. Embedded systems news: Soc it to ’ya! Linux Journal,pages 36–37, August 2000.

LG95. L.A. Lozano and G.R. Gao. Exploiting Short-Lived Variables in Su-perscalar Processors. In Proceedings of the 28th Annual InternationalSymposium on Microarchitecture, pages 292–302, Ann Arbor, Michi-gan, December 1995.

LGAV96. J. Llosa, A. Gonzalez, E. Ayguade, and M. Valero. Swing ModuloScheduling: A Lifetime-Sensitive Approach. In Proceedings of the1996 Conference on Parallel Architectures and Compilation Techniques,pages 80–86, Boston, Massachusetts, October 1996.

LPMS97. C. Lee, M. Potkonjak, and W.H. Mangione-Smith. MediaBench: ATool for Evaluating and Synthesizing Multimedia and Communi-cation Systems. In Proceedings of the 30th Annual International Sym-posium on Microarchitecture, pages 330–335, Research Triangle Park,North Carolina, December 1997.

LVA96. J. Llosa, M. Valero, and E. Ayguade. Heuristics for Register-Constrained Software Pipelining. In Proceedings of the 29th AnnualInternational Symposium on Microarchitecture, pages 250–261, Paris,France, December 1996.

LVAG95. J. Llosa, M. Valero, E. Ayguade, and A. Gonzalez. Hypernode Re-duction Modulo Scheduling. In Proceedings of the 28th Annual Inter-national Symposium on Microarchitecture, pages 350–360, Ann Arbor,Michigan, November 1995.

LW92. M. S. Lam and R. P. Wilson. Limits of Control Flow on Parallelism.In Proceedings of the 19th Annual International Symposium on ComputerArchitecture, pages 46–57, Gold Coast, Australia, May 1992.

LW97. H. Liao and A. Wolfe. Available Parallelism in Video Applications.In Proceedings of the 30th Annual International Symposium on Microar-chitecture, pages 321–329, Research Triangle Park, North Carolina,December 1997.

Mah96. S.A. Mahlke. Exploiting Instruction Level Parallelism in the Presenceof Conditional Branches. PhD thesis, University of Illinois, Urbana-

BIBLIOGRAPHY 233

Champaign, 1996.

ME92. S. Moon and K. Ebcioglu. An Efficient Resource Constrained GlobalScheduling Technique for Superscalar and VLIWprocessors. In Pro-ceedings of the 25th Annual International Symposium on Microarchitec-ture, pages 55–71, Portland, Oregon, December 1992.

Mes01. B.Mesman. DSPCode Generation. PhD thesis, EindhovenUniversityof Technology, The Netherlands, May 2001.

MLC+92. S.A. Mahlke, D.C. Lin, W.Y. Chen, R.E. Hank, and R.A. Bringmann.Effective Compiler Support for Predicated Execution Using the Hy-perblock. In Proceedings of the 25th Annual International Sympo-sium on Microarchitecture, pages 45–54, Portland, Oregon, December1992.

MLD92. P. Michel, U. Lauther, and P. Duzy, editors. The Synthesis Approach toDigital System Design. The Kluwer International Series in Engineer-ing and Computer Science. Kluwer Academic Publishers, Norwell,Massachusetts, 1992.

MPSR95. R. Motwani, K. Palem, V. Sarkar, and S. Reyen. Combined Instruc-tion Scheduling and Register Allocation. Technical Report 698, NewYork University, Department of Computer Science, July 1995.

Muc97. S.S. Muchnick. Advanced Compiler Design & Implementation. MorganKaufmann Publishers, San Francisco, California, 1997.

Mue92. F. Mueller. Register Allocation by Graph Coloring: A Review. Tech-nical Report FALL 1992, Florida State University, Department ofComputer Science, 1992.

NG93. Q. Ning and G.R. Gao. A Novel Framework of Registers Allocationfor Software Pipelining. In conference record 20th ACM Symposium onPrinciples of Programming Languages, pages 29–42, Charleston, SouthCarolina, 1993.

NP93. C. Norris and L.L. Pollock. A Scheduler-Sensitive Global RegisterAllocator. In Supercomputing’93 Proceedings, pages 804–813, Port-land, Oregon, November 1993.

NP95. C. Norris and L.L. Pollock. An Experimental Study of Several Coop-erative Register Allocation and Instruction Scheduling Strategies.In Proceedings of the 28th Annual International Symposium on Microar-chitecture, pages 169–179, Ann Arbor, Michigan, November 1995.

Pin93. S.S. Pinter. Register Allocation with Instruction Scheduling: A NewApproach. In Proceedings of the ACM SIGPLAN’94 Conference on Pro-gramming Language Design and Implementation, pages 248–257, Albu-querque, New Mexico, June 1993.

234 BIBLIOGRAPHY

PJS96. S. Palacharla, N.P. Jouppi, and J.E. Smith. Quantifying the Com-plexity of Superscalar Processors. Technical Report CS-TR-96-1328,University ofWisconsin, Madison, Computer Sciences Department,November 1996.

PSM97. S. Park, S. Shim, and S-M Moon. Evaluation of Scheduling Tech-niques on a SPARC-Based VLIW Testbed. In Proceedings of the 30thAnnual International Symposium on Microarchitecture, pages 104–113,Research Triangle Park, North Carolina, December 1997.

PV00. G.G. Pechanek and S. Vassiliadis. The ManArray Embedded Pro-cessor Architecture. In Proceedings of the 26th Euromicro Conference:”Informatics: inventing the future, pages 384–355, Maastricht, TheNetherlands, September 2000.

Rau93. B.R. Rau. Dynamically Scheduled VLIW Processors. In Proceed-ings of the 26th Annual International Symposium on Microarchitecture,pages 80–92, Austin, Texas, December 1993.

Rau94. B.R. Rau. Iterative Modulo Scheduling: An algorithm For SoftwarePipelining Loops. In Proceedings of the 27th Annual International Sym-posium on Microarchitecture, pages 63–74, San Jose, California, De-cember 1994.

Rau96. B.R. Rau. Iterative Modulo Scheduling. International Journal of Par-allel Programming, 24(1):3–64, February 1996.

RCL01. S. Roos, H. Corporaal, and R. Lamberts. Clustering on the Move.Submitted for publication, 2001.

RF93. B.R. Rau and J.A. Fisher. Instruction-Level Parallel Processing:History, Overview and Perspective. The Journal of Supercomputing,7(1/2):9–50, May 1993.

RJSS97. E.R. Rotenberg, Q. Jacobson, Y. Sazeides, and J. Smith. Trace Pro-cessors. In Proceedings of the 30th Annual International Symposiumon Microarchitecture, pages 138–148, Research Triangle Park, NorthCarolina, December 1997.

RYYT89. B.R. Rau, D.W.L. Yen, W. Yen, and R.A. Towle. The Cydra 5 Depart-mental Supercomputer. Design Philosophies, Decisions and Trade-offs. IEEE Computer, 22(1):12–35, January 1989.

SB90. P. Sweany and S. Beaty. Post-Compaction Register Assignment in aRetargetable Compiler. In Proceedings of the 23th Annual InternationalSymposium on Microarchitecture, pages 107–116, Orlando, Florida,December 1990.

SBV95. G.S. Sohi, S.E. Breach, and T.N. Vijaykumar. Multiscalar processors.In Proceedings of the 22nd Annual International Symposium on Com-

BIBLIOGRAPHY 235

puter Architecture, pages 414–425, Santa Margherita Ligure, Italy,June 1995.

SCD+97. M. Schlansker, T.M. Conte, J. Dehnert, K. Ebcioglu, J.Z. Fang, andC. Thompson. Compilers for Instruction-Level Parallelism. IEEEComputer, 30(12):63–69, December 1997.

SM95. A. Sudarsanam and S. Malik. Memory Bank and Register Alloca-tion in Software Synthesis for ASIPs. In Proceedings of the IEEE In-ternational Conference on Computer Aided Design, pages 388–392, LosAlamitos, California, November 1995.

Spe96. Spec95. The Standard Performance Evaluation Corporation.(http://www.specbench.org), December 1996.

SRD96. G. A. Slavenburg, S Rathnam, and H. Dijkstra. The TriMedia TM-1 PCI VLIW Mediaprocessor. In Hot Chips 8, Stanford, California,August 1996.

SS93. M. Schuette and J. Shen. Instruction-Level Experimental Evaluationof the Multiflow TRACE 14/300 VLIW Computer. The Journal ofSupercomputing, 7(1/2):249–271, May 1993.

Sta94. R.M. Stallman. Using and Porting GNU CC. Free Software Founda-tion, Cambridge, Massachusetts, 1994.

STK98. E. Stumpel, M. Thies, and U. Kastens. VLIW Compilation Tech-niques for Superscalar Architectures. In Poster Proceedings of the 7thInternational Conference on Compiler Construction, pages 234–248, Lis-bon, Portugal, April 1998.

SW94. J.E. Smith and S. Weiss. PowerPC 601 and Alpha 21064: A Tale ofTwo RISCs. IEEE Computer, 14(3):46–58, June 1994.

TGH92. K.B. Theobald, G.R. Gao, and L.J. Hendren. On the Limits of Pro-gram Parallelism and its Smootability. In Proceedings of the 25th An-nual International Symposium on Microarchitecture, pages 10–19, Port-land, Oregon, December 1992.

TI99. Texas Instruments. TMS320C6000 CPU and Instruction Set ReferenceGuide, Texas Instruments, Dallas, March 1999.

TNO99. TNO Physics and Electronics Laboratory. µLog-x Low Power Dat-alogger ASIC family for complete datalogging system solutions. TNOPhysics and Electronics Laboratory, The Hague, The Netherlands,1999.

TS88. M.R. Thistle and B.J. Smith. A Processor Architecture for Horizon.In Supercomputing’88 Proceedings, pages 35–41, Orlando, Florida,November 1988.

vdG98. A. J. van de Goor. Testing Semiconductor Memories: Theory and Prac-

236 BIBLIOGRAPHY

tice. ComTex Publishing, Gouda, The Netherlands, 1998.

VLW00. S. Virtanen, J. Lilius, and T. Westerlund. A Processor Architec-ture for the TACO Protocol Processor Development Framework.In Proceedings of the 18th IEEE NorChip Conference, Turku, Finland,November 2000.

Wal91. D.W. Wall. Limits of Instruction-Level Parallelism. In Proceedings ofthe 4th International Conference on Architectural Support for Program-ming Languages and Operating System, pages 176–188, Santa Clara,California, April 1991.

WFW+95. R. Wilson, R. Franch, C. Wilson, S. Amarasinghe, J. Ander-son, S. Tjiang, S.-W. Liao, C.-W. Tseng, M. Hall, M. Lam,and J. Hennessy. An Overview of the SUIF Compiler System.(http://suif.stanford.edu/suif/suif.html), 1995.

WHB92. N.J. Warter, G.E. Haab, and J.W. Bockhaus. Enhanced ModuloScheduling for Loops with Conditional Branches. In Proceedings ofthe 25th Annual International Symposium on Microarchitecture, pages170–179, Portland, Oregon, December 1992.

Wil51. M.V.Wilkes. The Best Way to Design an Automatic CalculatingMa-chine. In Proceedings of the Manchester University Computer InauguralConference, pages 16–18, Manchester, England, July 1951.

WM95. R. Wilhelm and D. Maurer. Compiler Design. Addison-Wesley Seriesin Computer Science. Addison-Wesley Publishing Company, Read-ing, Massachusetts, 1995.

Wol99. A. Wolfe. Analysis: VLIW chips will depend on smarter software.EE Times Online, May 1999.

WRP92. T. Wada, S. Rajan, and S.A. Przybylski. An Analytical Access TimeModel for On-chip Cache Memories. IEEE Journal of Solid-State Cir-cuits, 27(8):1147–1156, August 1992.

ZK97. V. Zyuban and P. Kogge. The Energy Complexity of Register Files.Technical Report TR 97-20, University of Notre Dame, Notre Dame,Indiana, CSE Department, December 1997.

ZK98. V. Zyuban and P. Kogge. The Energy Complexity of RegisterFiles. In International Symposium on Low Power Electronics and De-sign, pages 305–310, Monterey, California, August 1998.

ZK00. V. Zyuban and P. Kogge. Optimization of High-Performance Su-perscalar Architectures for Energy Efficiency. In International Sym-posium on Low Power Electronics and Design, pages 84–89, Rapallo,Italy, July 2000.

SamenvattingCompiler Strategieen voor Transport Triggered Architecturen

Johan Janssen

In dit proefschrift worden compiler strategieen gepresenteerd die de aan-wezige hardware in processoren efficient benutten. Voor het onderzoek is ge-bruik gemaakt van processoren gebaseerd op de Transport Triggered Architec-tuur (TTA). Deze door de Technische Universiteit Delft ontwikkelde architec-tuur is uitermate geschikt voor toepassing in applicatie specifieke processoren.De hardware van de TTA is eenvoudig en modulair van opzet. Daardooris het eenvoudig om de componenten te dupliceren teneinde de aanwezigerekenkracht te vergroten. De compiler is verantwoordelijk voor het toewij-zen van hardware aan operaties en de executie volgorde van de operaties.De prestatie kan worden verhoogd door operaties parallel uit te voeren. Demate van uitbuitbaar instruction-level parallellisme (ILP) wordt beperkt doorde hoeveelheid hardware en door de noodzakelijke volgorde van operaties.

Het genereren van een executie volgorde voor operaties wordt instructionscheduling genoemd. In deze fase worden operaties toegewezen aan instruc-ties en worden alle hardware componenten, behalve registers, toegewezen aande operaties. De registers worden vaak in een aparte fase toegekend. De re-gisters bevatten de waarden van de variabelen in een applicatie. Het toewij-zen van een register aan een variabele betekent impliciet het toewijzen vaneen register aan alle operaties die deze variabele gebruiken. Een extra com-plicatie treedt op als er onvoldoende registers zijn om alle variabelen te be-vatten. In dit geval worden de waarden van de variabelen, aan wie geen re-gister is toegekend, weggeschreven naar het geheugen en indien nodig weerteruggelezen. Registers worden als de moeilijkst toewijsbare hardware com-ponenten beschouwd.

237

238 SAMENVATTING

Als instruction scheduling en register toewijzing uitgevoerd worden alsaparte fasen, zullen beslissingen die genomen worden door de ene fase in-vloed hebben op de andere fase. In dit proefschrift zijn verschillende strate-gieen geevalueerd. De eerste strategie, vroege register toewijzing, voert eerstregister toewijzing uit gevolgd door instruction scheduling. Omdat hetzelfderegister toegewezen kan worden aan verschillende variabelen ontstaan er zo-genaamde valse afhankelijkheden. Deze valse afhankelijkheden beperken deinstruction scheduler in het zo efficient mogelijk afbeelden van de ILP vanhet programma op de ILP van de processor. Er bestaan verschillende strate-gieen die dit proberen te voorkomen door potentiele valse afhankelijkhedente betrekken bij de register toewijzing. Echter, indien er niet genoeg regis-ters beschikbaar zijn, is het onvermijdelijk dat sommige valse afhankelijkhe-den toch worden geıntroduceerd. Voor TTAs heeft vroege register toewijzingnog een ander nadeel. Omdat TTAs data direct van de ene functie unit (FU)naar een ander FU kunnen transporteren, is het niet altijd meer noodzakelijkom deze data op te slaan in een register. Dit is echter pas bekend na instruc-tion scheduling. Vroege register toewijzing kan geen gebruik maken van dezeoptimalisatie en introduceert meer valse afhankelijkheden en geheugentrans-porten dan strikt noodzakelijk.

De tweede strategie, late register toewijzing, voert eerst instructionscheduling uit gevolgd door register toewijzing. Dit heeft als voordeel dat tij-dens instruction scheduling geen nadelige invloed wordt ervaren van valseafhankelijkheden. Bovendien is nu tijdens de register toewijzing bekend welketransporten werkelijk een register nodig hebben. Het nadeel van late regis-ter toewijzing is dat operaties vroeger worden geplaatst dan strikt noodzake-lijk. Hierdoor neemt het aantal variabelen dat op hetzelfde tijdstip een registernodig heeft toe. Als er onvoldoende registers zijn, moeten sommige waardennaar het geheugen worden geschreven. Dit is problematisch omdat het vrijwelonmogelijk is om extra operaties toe te voegen in parallelle TTA code. In ditproefschrift wordt een oplossing voor dit probleem voorgesteld.

Omdat zowel vroege als late register toewijzing problemen hebben, wordtin dit proefschrift een methode voorgesteld om beide fasen te integreren. Degeıntegreerde register toewijzingsmethode is eerst toegepast in combinatie meteen eenvoudige instruction scheduler. Naast het correct toewijzen van registersaan variabelen, zijn ook technieken gepresenteerd die gedurende instructionscheduling lees en schrijf operaties naar het geheugen invoegen om bijvoor-beeld het gebrek aan registers te compenseren. Om de methode te validerenis de methode uitgebreid met een agressievere instruction scheduling methodedie operaties verplaatst tussen basic blocks met als doel de prestatie te ver-groten. Ook is de methode met succes toegepast in combinatie met softwarepipelining. Software pipelining verplaatst operaties over loop iteratie gren-zen met als doel om de ILP tussen loop iteraties te vergroten. Verschillendeheuristieken zijn voorgesteld en geevalueerd om de prestaties verder op te voe-ren. Voor alle drie instruction scheduling strategieen presteert geıntegreerde

SAMENVATTING 239

toewijzing beter dan alle andere geevalueerde register toewijzing strategieen.Met name wanneer het aantal beschikbare registers beperkt is nemen deprestaties aanzienlijk toe.

De registers zijn gegroepeerd in een register file (RF). Het uitbuiten van ILPvereist simultaan toegang tot deze RF door middel van RF-poorten. Het isechter kostbaar om een grote RF te maken met een groot aantal RF-poorten.In dit proefschrift wordt een alternatieve RF architectuur voorgesteld: degedeelde RF. In deze architectuur wordt de RF opgedeeld in een aantal kleinereRFs. Het blijkt dat, hoewel in totaal hetzelfde aantal registers en RF-poortenbeschikbaar zijn, het totale chip oppervlak, de toegangstijd en het energiever-bruik afnemen. Het is echter niet meer mogelijk om via een willekeurige RF-poort een willekeurig register te bereiken. Om te voorkomen dat dit tot eengrote prestatie vermindering leidt zijn een aantal compiler strategieen gepre-senteerd en geevalueerd. Met eenvoudige heuristieken is het mogelijk om deprestatie vermindering te beperken tot minder dan 5%. Wanneer het bespaardeoppervlak wordt gebruikt voor extra registers en de kleinere toegangstijd inrekening wordt gebracht nemen de prestaties zelfs toe.

240 SAMENVATTING

Curriculum Vitae

Johan Janssen was born inWamel on the 5th of March, 1968. After finishing theHAVO at the Pax Christi College in Druten, he started studying in Arnhem in1985. In 1989 he received his B.S. degree in Technical Computer Science fromthe Hogeschool Arnhem. Subsequently, he studied Electrical Engineering atthe Delft University of Technology, where he joined the Information Theorygroup of Prof.dr.ir. J. Biemond in 1992. From 1991 until 1992 he was a seniorgraduate student instructor in analogue electronics. In 1993 he graduated cumlaude on amaster thesis entitled “Calculating Rate-Distortion Functions of Size-Constrained Reproduction Alphabets”. Thereafter, from 1994 to 1998, he hasbeen a PhD student at Prof.dr.ir. A.J. van de Goor’s Computer Engineeringgroup at the department of Electronic Engineering of the Delft University ofTechnology, where he performed the research described in this dissertation. Inearly 1998 he joined TNO’s Physics and Electronics Laboratory in the Hague,where he is involved in embedded system design and research towards net-worked intelligent devices.

241

Compiler Strategies for Transport Triggered...

Documents

Transcript of Compiler Strategies for Transport Triggered...