Simulating Computer Architectures - Semantic Scholar · Simulating Computer Architectures Nice...

Simulating

Computer

Architectures

Nice picture is missing!

HenkMuller

i

Simulating Computer Architectures

ACADEMISCH PROEFSCHRIFT

ter verkrijging van de graad van doctoraan de Universiteit van Amsterdam,op gezag van de Rector Magnificus

prof.dr. P.W.M. de Meijer,in het openbaar te verdedigen in de Aula der Universiteit

(Oude Lutherse Kerk, ingang Singel 411, hoek Spui),op vrijdag 26 Februari 1993 te 11:30 uur.

door

Hendrik Lambertus Muller

geboren te Amsterdam

Amsterdam, 1993.

ii

Promotor: prof. dr. L.O. Hertzberger (Universiteit van Amsterdam)

Faculteit: Wiskunde en Informatica

ISBN 90-800769-4-5c 1993 Henk Muller. All rights reserved.

Cover design: Martine Bloem.

MirandaTM is a trademark of Research Software Ltd.UNIXTM is a trademark of Bell laboratories.

Printed at Febodruk, Enschede, Holland.

Contents

Acknowledgements : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : vi

1 Designing computer architectures 11.1 The design process : : : : : : : : : : : : : : : : : : : : : : : : : : : : 21.2 Evaluating computer architectures : : : : : : : : : : : : : : : : : : : 41.3 Measuring the performance of an architecture : : : : : : : : : : : : 5

1.3.1 Building a performance model : : : : : : : : : : : : : : : : : 61.3.2 Simulating architectures : : : : : : : : : : : : : : : : : : : : : 8

1.4 An overview of the rest of this thesis : : : : : : : : : : : : : : : : : : 9

I Simulation Tools 11

2 Simulating architectures 132.1 Simulations: a functional description : : : : : : : : : : : : : : : : : 13

2.1.1 Demand driven simulation : : : : : : : : : : : : : : : : : : : 162.1.2 Continuous time simulation : : : : : : : : : : : : : : : : : : 172.1.3 Discrete time simulation : : : : : : : : : : : : : : : : : : : : 182.1.4 Event driven discrete time simulator : : : : : : : : : : : : : : 202.1.5 Summarising the simulation algorithms : : : : : : : : : : : : 22

2.2 Existing simulation systems : : : : : : : : : : : : : : : : : : : : : : : 262.2.1 General purpose languages : : : : : : : : : : : : : : : : : : : 272.2.2 General purpose simulation languages : : : : : : : : : : : : 282.2.3 Hardware simulation languages : : : : : : : : : : : : : : : : 292.2.4 Architecture Evaluation tools : : : : : : : : : : : : : : : : : : 322.2.5 Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : 36

3 The simulation environment 373.1 Pearl : The simulation language : : : : : : : : : : : : : : : : : : : : 38

3.1.1 Objects computations : : : : : : : : : : : : : : : : : : : : : : 403.1.2 Communication : : : : : : : : : : : : : : : : : : : : : : : : : 423.1.3 The clock : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 443.1.4 Initialising the architecture : : : : : : : : : : : : : : : : : : : 443.1.5 An example program : : : : : : : : : : : : : : : : : : : : : : 453.1.6 Comparing Pearl to other languages : : : : : : : : : : : : : : 47

3.2 The Pearl kernel : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 483.2.1 The run time support system : : : : : : : : : : : : : : : : : : 48

iv Contents

3.2.2 Statistics : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 503.3 The layer above Pearl : : : : : : : : : : : : : : : : : : : : : : : : : : 54

3.3.1 Memory : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 553.3.2 Cache : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 553.3.3 Processor : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 563.3.4 Other library models : : : : : : : : : : : : : : : : : : : : : : 58

3.4 Interfacing Oyster to the outside world : : : : : : : : : : : : : : : : 583.5 The status of the current Oyster implementation : : : : : : : : : : : 60

4 Simulating applications 614.1 Full emulation of the application : : : : : : : : : : : : : : : : : : : : 624.2 Application derived address traces : : : : : : : : : : : : : : : : : : : 64

4.2.1 Off-line generated address traces : : : : : : : : : : : : : : : : 644.2.2 The MiG simulator : : : : : : : : : : : : : : : : : : : : : : : : 67

4.3 Stochastically generated address traces : : : : : : : : : : : : : : : : 754.3.1 Instruction access locality. : : : : : : : : : : : : : : : : : : : : 774.3.2 Data access locality. : : : : : : : : : : : : : : : : : : : : : : : 784.3.3 Processor parameters : : : : : : : : : : : : : : : : : : : : : : 784.3.4 Discussion of the stochastical trace generator : : : : : : : : : 79

4.4 Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 80

II Case studies 83

5 Simulating PRISMA’s communication architecture 855.1 The PRISMA architecture and implementation : : : : : : : : : : : : 865.2 The simulation model : : : : : : : : : : : : : : : : : : : : : : : : : : 905.3 Measurements and Results : : : : : : : : : : : : : : : : : : : : : : : 91

5.3.1 Verification : : : : : : : : : : : : : : : : : : : : : : : : : : : : 925.3.2 Technology update : : : : : : : : : : : : : : : : : : : : : : : 935.3.3 Adding the allocation processor : : : : : : : : : : : : : : : : 945.3.4 Adding the message processor : : : : : : : : : : : : : : : : : 95

5.4 Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 96

6 Simulating the Futurebus 996.1 Introduction to the Futurebus cache consistency : : : : : : : : : : : 101

6.1.1 Consistency in flat architectures : : : : : : : : : : : : : : : : 1016.1.2 Consistency in hierarchical architectures : : : : : : : : : : : 1036.1.3 Splitting transactions : : : : : : : : : : : : : : : : : : : : : : 1046.1.4 Discrepancies between the simulator and the real Futurebus 105

6.2 The simulation model : : : : : : : : : : : : : : : : : : : : : : : : : : 1066.2.1 The application simulator : : : : : : : : : : : : : : : : : : : : 1076.2.2 The buses : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1086.2.3 The caches : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1086.2.4 The shared memory : : : : : : : : : : : : : : : : : : : : : : : 1096.2.5 The topology : : : : : : : : : : : : : : : : : : : : : : : : : : : 109

Contents v

6.3 Validation : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 1096.3.1 Single processor, one-level cache validation : : : : : : : : : : 1106.3.2 Single processor, two-level cache validation : : : : : : : : : : 110

6.4 Varying and tuning the cache parameters : : : : : : : : : : : : : : : 1126.4.1 Tuning the associativity : : : : : : : : : : : : : : : : : : : : : 1126.4.2 Varying the line size : : : : : : : : : : : : : : : : : : : : : : : 113

6.5 Varying the topology : : : : : : : : : : : : : : : : : : : : : : : : : : 1146.5.1 Measuring with a UNIX workload : : : : : : : : : : : : : : : 1156.5.2 Measuring with a parallel workload : : : : : : : : : : : : : : 118

6.6 Discussion and conclusions : : : : : : : : : : : : : : : : : : : : : : : 120

7 A Futurebus performance model 1237.1 The performance model of a flat system : : : : : : : : : : : : : : : : 125

7.1.1 The performance of cache architectures : : : : : : : : : : : : 1257.1.2 Modelling the average number of waiting processors : : : : 1267.1.3 Modelling the miss rate (m) : : : : : : : : : : : : : : : : : : : 1277.1.4 Validation of the complete model with experimental data : : 128

7.2 Hierarchical architectures : : : : : : : : : : : : : : : : : : : : : : : : 1307.2.1 The number of transactions at each level : : : : : : : : : : : 1317.2.2 The miss rates of the caches at the various levels : : : : : : : 1327.2.3 Putting it together, the performance of multi level hierarchies 138

7.3 Discussion : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : : 139

III Conclusions 141

8 Conclusions: evaluating Oyster 143

Bibliography 145

Index 151

Nederlandse samenvatting 153

vi Acknowledgements

Acknowledgements

Many persons have been involved in one way or another in the work described inthis thesis, scientists (thanks for the discussions and cooperation) and non scientists(thanks for the mostly invisible support that eases my work). Li Liangliang, Inavan der Velde, Sjaak Koot, Frank Stoots, Maarten Carels, Rob Milikowski, Rob vanTwist, Eddy Odijk, Fred Robert, Loek Nijman, Theodosis Papathanassiadis, PierreAmerica, Geert van der Heijden, Huib Eggenhuisen, Dolf Starreveld, Wim Mooij,Sun Chengzheng, Ben Hulshof, Henk van Essen, Benno Overeinder, Hans Oerle-mans, Ewout Brandsma, Marcel Beemster, Adriaan Ligtenberg, Marius Schoorel,Marnix Vlot, Gert Poletiek, Gert Jan Stil, Martijn de Lange, Esther Rijken, ArthurVeen, Toto van Inge, Juul van der Spek and Laura Lotty, from the research groupsand support staffs of the University of Amsterdam, Philips Research Laboratoriesin Eindhoven, ACE Amsterdam, and Parallel computing BV, thank you all.

Some persons had a special role: my family placed the foundation for this thesis,Bob Hertzberger has guided me to it, Wim Vree, Theun Bruins, Henk Sips and Advan de Goor were members of the promotion committee, Pieter Hartel and RutgerHofman made valuable contributions to chapters 7 and 2, and Koen Langendoenwas co-researcher/author of (parts of) chapters 6 and 4. Thanks you as well.

The work was financially supported by Philips Research Laboratories Eind-hoven and the SPIN (Stimuleringsprojectteam Informaticaonderzoek).

Henk Muller, February 1993

Chapter 1

Designing computer architectures

Computer architects are in a constant struggle to develop better computers. Eachtime a new technology comes available, old computers are redesigned to make fulluse of this new technology’s potentials, and new ideas are developed aiming tooutperform all other computers in once. As can be seen today, a great deal of this re-search has been very successful, the potentials of computers increased dramatically.Computers became faster (each 3 years the speed of computers doubled), mem-ories became larger (each 3 years the size of memories have quadrupled), whilethe physical size of the computer reduced (computers with the same functionalityreduced with a factor 4 in size each 3 years).

The increasing potentials of computer technology poses computer architectsfor problems: the complexity of todays computer systems is such that the designprocess has to be structured and automated. For this reason, the top down designmethod is adopted, a design method that has proven its value in many engineeringdisciplines. Top down design, elaborated on in Section 1.1, allows for hierarchicaldecomposition and leads to a structured design. Computer programs have beendeveloped that aid the designer to use this design method, by taking over manytrivial but tedious tasks.

During the design process, the designer needs regular feedback about the con-sequences of the design decisions. This feedback should inform the designer if therequirements are still met, to discover fundamental design errors in an early stageof the design. It is one of the major tasks of the design environment to provide thisfeedback to the designer. Four aspects have to be considered: the functionality,the performance (speed), the price and the physical properties. These aspects arediscussed in more detail in Section 1.2.

In the scope of this thesis, we are only interested in estimating the performance ofcomputer designs. There exist two methods to estimate the performance, detailedin Section 1.3. With the development of an analytical model, one can capture theperformance in a set of mathematical formulas. The other way is to develop asimulation model of the architecture, the designer then programs another computerso that it behaves as the architecture under investigation. Using this program, theperformance of the architecture can be measured. The rest of this thesis (outlined inSection 1.4), deals with the simulation of computer architectures in order to obtainperformance figures.

2 Chapter 1. Designing computer architectures

1.1 The design process

When designing top down, the designer starts with a global idea and refines the ideastep by step ([Goor89, Lewin85], a strategy also known as stepwise refinement):the designer splits the global design in several smaller subdesigns, which aredeveloped separately. This process is repeated, so gradually all details are addedto the design. Finally the designer has a “full design”: an unambiguous recipehow the implementation should look like; implementing such a full design is apurely technical affair. The top down design method is used in many engineeringdisciplines. Computer architects, airplane constructors, software engineers andarchitects of buildings all use the same principles. Here the principles are illustratedin the context of the design of a building.

Suppose that an architect has to design an office building for a company with500 staff members on a site of 4000 m2. In the first stage, the architect has to takesome fundamental decisions on the structure of the building: will it be tall andhigh, or low and broad. The second stage will be the design of the skeleton ofthe building (a general layout). Gradually, the locations of the elevators, staircasesand corridors are added, and later on the precise positioning of the rooms. Thefurnishing of the rooms is detailed in one of the last phases. In parallel with theroom placement and furnishing, the structure of the floors and walls (the thicknessof the concrete, the placement of the rods) can be decided on. The final designdelivered by an architect are the “specifications”: the recipe for the builder to erectthe building.

Because of the top down design strategy, the sub designs are independent ofeach other. The furnishing of the corridors, rooms and toilets may be performedindependently. This independency is one of the largest merits of top down design:it allows for architects to concentrate on specific aspects, and allow for multiplearchitects to work on the same building. One should be aware that aspects can neverbe separated completely; at some point, various subdesigns are related to eachother. Sometimes, this relation is obvious (for example the connections betweenrooms, corridors and elevators), but sometimes, the relation is not clear: althoughthe positioning of rooms is separate per floor, toilets should not be placed abovemeeting rooms because of the noise.

A straightforward top down design might lead directly to the envisaged result,but unfortunately many things can go wrong. Suppose for example that whilefurnishing of the rooms, the architect discovers that the walls are 5 centimeter tooshort to fit a standard sized cabinet. At that moment, the architect has to find asolution, for example by reconsidering the location of the rooms. A more disastrousexample: during the design of the foundation, it may come out that the soil is toosoft too carry the building. In that case the design has to be revised completely,in order to prevent the tower of Pisa to be rebuild. This shows that the architectmight have to go back to an earlier stage of the design, and to modify that stagebecause the chosen solution did not meet the requirements. After redesigning thatearlier stage, all later stages have to be reconsidered as well: when the rooms aresituated differently, the air-conditioning and the electrical power supplies have to

1.1. The design process 3

Trans Trans Trans TransEval Eval EvalTool Tool Tool Tool Tool Tool Tool

Design Design Design& & & &% % % %6 6 6 6

- - - -IdeaFull

Design? ? ?6 6 6

��

��

��1

��

��*

�� 6

@@

@@I

HHHH

HHHY

PPPP

PPPP

PPi

) � ? R j q

A r c h i t e c t

Figure 1.1: The design environment surrounded by architect and designs.

be redesigned. The larger the number of steps backward, the larger the total delayin the design: taking the right decision at once is one of the characteristics of a goodarchitect.

There is no essential difference when designing computers. The computer architectstarts with a global idea about the computer, in terms of hardware, for exampleprocessing elements, memories, and interconnections, and software, for exampleapplications, programming model and execution models. This idea is then grad-ually developed towards a final implementation which consists of chips mountedon boards, peripherals, and power supply and a bundle of software: compiler,operating system and application program.

Lets consider only one part of the design, the development of one of the pro-cessing elements on a single chip (note that in this example the chip is designedcompletely separately from the other hardware and software components, whichis fortunately not the case for real designs). In this example, four steps are dis-tinguished in the development of a chip[Horbst87]. In the first step, the chip isdefined in terms of registers, interconnected with combinatorial logic, a level ofdetail referred to as the register transfer level. In the second step, the logic isfurther detailed in terms of basic gates performing nand and nor functions onboolean values. In the third step, the gates are specified in terms of transistors thatcan be open or closed, a level referred to as the switch level. The full design (theunambiguous recipe) consists of the mask specification of the chip. It exactly tellshow a silicon wafer should be processed in order to produce the chip. Any chipmanufacturer with the “know how” can produce this design.

The first computers were entirely designed by hand, but designing larger andmore complex computers by hand is an impossible task. As an example considerthat someone has to design the final stage of a chip (the mask level) in full detailby hand. At present, a chip has at least 20 different masks, each of the masksconsisting of about 5 million rectangles. The total mask definition thus consistsof at least 100 million rectangles. Drawing these by hand is clearly an impossibletask. When chips are to be integrated in a larger system, the designer has toconsider the requirements of the rest of the system as well. For these reasonsdesign environments were developed to assist the computer architect in the designprocess [Weste85, Rubin87].

An example design environment is shown in Figure 1.1. This environment con-


sists in total of seven tools, four transformation tools and three evaluation tools. Thetransformation tools help the architect in transforming a higher level design into alower level design. Especially at the lower levels of the design very good tools existthat generate the mask layout according to the gate or transistor level descriptionautomatically. The transformation tools relieve the architect from thinking aboutobnoxious details, and prevents errors in the design: when the tool functions cor-rectly, a gate design will be translated into an equivalent mask design, which is notthe case if the architect does it by hand.

The evaluation tools aid the architect in judging the quality of the design.With the evaluation tools, the architect can substantiate or falsify claims about theproposed architecture. When a choice between several alternatives has to be made,the architect does not have to rely purely on intuition and rules of a thumb (aswas the case 15 years ago), but a tool can be used to verify some of the intuitions.Note that the design process is still controlled by the architect, the tools only aidthe architect.

In this thesis, the transformation tools are not considered, we are only interestedin the evaluation of architectures.

1.2 Evaluating computer architectures

The quality of a design, that has to be measured by the evaluation tools, is influ-enced by many factors. When evaluating a computer architecture, there are fourimportant aspects to judge: the functionality, the price, the physical properties(size, power dissipation) and the speed of the architecture [Hennessy90].

The most fundamental demand on a computer architecture is the functionality.Consider for example a simple pocket calculator, with the basic arithmetic opera-tions, and a range of �1099 to 1099. The calculator should be able to add, subtract,multiply and divide numbers in this range correctly: 3:5 + 1:2 = 4:7, any other an-swer is incorrect. A computer that does not function correctly is almost worseless;almost, because many computers are sold with a few design bugs in it. Because ofthe importance of the functionality, many tools are developed that can be used toevaluate the functionality. The most popular way is to provide the possibility to testthe computer architecture with some input values. The architect can compare theoutput of the simulator with the expected output, and can so reduce the number oferrors in the architecture. Testing cannot give a decisive positive answer: if all testruns performed well, there might still be a bug in the design. Total correctness canbe accomplished by the tools working with formal specification and transformationrules: computers developed with these tools work according to their specifications.The formal specification tools are getting more powerful, and gaining popularity.

The price is not only interesting from user’s point of view (the dollars paid forthe machine), but also from the architect’s point of view. The price as seen by thearchitect is fully determined by the number of transistors, pins, and chips, andby the type of technology used. It is of interest because a comparison betweenarchitectures is only fair when the prices are roughly equal: it is not hard to makean ultimately performing computer using unlimited resources. Transistors can be

1.3. Measuring the performance of an architecture 5

used only once for the same price; it is the task of the architect to decide whether touse the transistors for a cache, for a register bank, or for a floating point multiplier.Some tools exist that estimates the costs of a design; but most of the time, the costestimation is left to the intuition of the designer.

The physical properties, as the size or the power consumption, are importantin certain application areas: computers for television sets should be small, officecomputers should not have extreme power demands, but for a super computingcenter, neither the size, nor the power dissipation are of interest. Some tools existthat dare to make a prediction of the size and other physical properties, but as withthe price, it is left to the intuition of the architect most of the time.

The last criteria mentioned in the list above is the speed of the computer. Manyapplication areas (most notably the real time applications) place constraints on thespeed: a real time video decoder should process a frame in 40 ms. There is no needfor a faster computer: it does not matter whether the video frame is decoded in40 ms, 10 ms or 5 �s. For other application domains, a faster computer is better:a computer may need hours to predict the weather, reducing this time allows forbetter models to be developed. It is of utmost importance for the architect to ensurethat the minimal performance requirements are met: the video decoder simply willnot work when it needs 41 ms, and a weather predicting computer needing 48 hoursto predict the weather of the next day has no value either. Many tools exist thatprovides the architect with an estimation of the performance that can be expectedfrom the architecture. Unfortunately, most of these tools consider the elements ofthe architecture separately, which does not provide a performance figure for thearchitecture as a whole.

These four aspects may be measured separately, but the architect should considerthe figures integratedly and in relation to the requirements. When the architectis designing a supercomputer, the performance and functionality are important,while the price and physical properties are less relevant, but when designing acomputer for a washing machine, it is the other way around. In this thesis weconcentrate on tools to evaluate the performance of computer architectures, theother aspects are simply ignored.

1.3 Measuring the performance of an architecture

In the example above of the video decoder, the second is used as performancemeasure: 0.040 s is good, 0.041 s is too slow. The execution time is the most simpleperformance measure, and incorporates every detail of the computer system: theapplication software, the system software including the quality of the compiler,and the hardware. For the user this is the essential measure, it tells how long acertain task takes to complete. For the architect it is not always desirable to havea measure that takes everything into account, because the architect might wantto abstract from some aspects, in order to evaluate ideas separate from specificapplication programs or technologies.

Two commonly made abstractions are the use of benchmarks to abstract from


the application, and the use of MIPS-rates to abstract from the technology. Usingthese abstractions is a dangerous practice because it implicitly assumes that the per-formance of an architecture is composed of several orthogonal components: in thiscase MIPS rates and application programs. It is simple to falsify this assumption:there is not an “average instruction”. Even on the same processor, two programsmight run with a completely different MIPS rate (because the two programs use adifferent subset of the instructions, perform different amounts of loads or have adifferent cache hit rates). For the same reason, the performance figures measuredwith a benchmark program cannot be generalised to performance figures of ap-plication programs: although computer A might execute a benchmark 10% fasterthan computer B, other applications may run 50% faster on computer B. The useof other performance characterisations as MFLOPS or SPECmarks has the samedrawback.

It is, however, not by definition incorrect to use MIPS rates or benchmark pro-grams. When comparing identical processors under the same load, implementedwith different technology, the MIPS rate is a reliable measure. For a special purposearchitecture, it is correct to use the intended application as benchmark. And whenone wants to have a performance figure of a general purpose machine (like the SUNon which I am typing this text, LATEXing the text, and running the experiments), theuse of a number of programs as a benchmark cannot be avoided. But the architectshould always be aware that MIPS rates, benchmark ratings and other abstractionshave a limited value, and that the user is eventually only interested in the barespeed, in seconds.

One can measure the performance of a computer system by using a (high resolu-tion) clock, but in the scope of this thesis, only designs of computer systems areconsidered. To predict the performance of designs, two methods are commonlyused: with the help of a performance model of the architecture, or with the helpof a simulation model of the architecture [Jain91, Sauer81]. Both ways require theconstruction of a model, but the natures of these models have nothing in common.A performance model is a set of formulas that relate the performance to the param-eters of the architecture (for example the cache size and bus speeds); the simulationmodel is an executable program that behaves as the architecture, the performancefigure is obtained by measuring the time needed in the simulation to perform atask. Both methods are elaborated on below.

1.3.1 Building a performance model

A performance model is a set of equations that relate the parameters of the archi-tecture (as for example the basic cycle time of the processor, and the network delay)to its performance. As a simple example of a performance model, the instructionrate of a processor can be modelled. Assume that the processor is running with aclock with cycle time C and that the processor needs i clock cycles per instructionon average. The processor is connected to zero delay memory. The instruction rate

1.3. Measuring the performance of an architecture 7

I is defined by:

I =1iC

This is a very simple model, so let extend it with pipelined instruction accesses.The processor addresses the instruction, and expects the answer after p clock ticks(p < i). If there is no answer at that moment, the processor is stalled to wait for theresult. The instruction rate is then defined by:

I =

8>>><>>>:

pC > M !1iC

pC �M !1

C(lMC

m� p+ i)

where dxe denotes the smallest integer greater or equal to x and M denotes theresponse time of the memory. This formula is an abstraction of the reality, manyaspects of the architecture are omitted: data accesses, possible caches, and differentinstruction types for example. In general, the modeller chooses a level to definethe model. This can either be a low level (with many detailed parameters), or ahigh level (with a few parameters). A low level model is rather complex, and hasmany parameters, but the parameters are often quite well defined (clock cycle time,pipeline length). A high level model is much simpler (less parameters), but themeaning of the parameters is often quite complex, like the missrate of a cache thatdepends on the application behaviour and the cache parameters. The modeller hasto compromise between these two: a level of detail has to be chosen.

Because the model is only an abstraction of the reality, a performance modelshould be validated [MacDougall87]. Validating performance models is a hardtask, that is easily skipped. There are two popular ways to validate performancemodels. Firstly, the model can be analysed under extreme circumstances by cal-culating the limits, for example for C ! 1, C # 0, M ! 1, and M # 0. Theresulting models (in this case 1

iC (which equals 0), 1M , 1

M (which equals 0), and 1iC

respectively) are simpler, and should be validated in turn (in this case, the fourlimits are clearly correct). Secondly, the model can be calibrated by setting thearchitecture parameters to values of an existing architecture, and by comparingthe predicted performance with the performance of the build computer. None ofthese two methods guarantee that the model is correct but one can surely check thesanity of the model.

Performance models suffers from three serious drawbacks. Firstly, the parame-ters of the architecture are most of the time too abstract. In the example above, onlythe cycle time and the memory delay are used, both well defined and exact values.But in the example the calculation of the number of seconds that is needed for anapplication to complete is omitted. This requires a multiplication with the numberof instructions of the application. But when “the application” is a parameter of themodel, the modeller should thus provide a function that calculates the number ofinstructions of an application program, which is pretty hard. When it comes to forexample the dynamic usage of a pipeline, modelling gets really complex.

The second drawback is related to the first one, but concerns the topology of thearchitecture. A slight change in the architecture (as for example the introduction


of some extra delay), can be captured easily in the model most of the time (withouthaving to start the modelling all over again). But changing the topology of thearchitecture, for example by placing an extra bus between the processor and theI/O, might require a complete reconstruction of the performance model. This isbecause the topology of the architecture is implicitly in the structure of the model,it is not a parameter of the performance model.

Thirdly, a performance model always makes statistical assumptions about thedynamic behaviour of the architecture. In the example above, it is assumed thatthe processor needs i cycles per instruction on average, but nothing is stated abouthow the average was measured. The performance model is a static model of thearchitecture.

All three drawbacks can be alleviated by making a simulation model of anarchitecture.

1.3.2 Simulating architectures

A simulation model of an architecture is a program that can be executed on acomputer (called the simulation platform). The program makes the simulationplatform behaving as the architecture under study. The first architecture of whicha performance model was constructed in the previous section can be simulatedwith the following model:

repeatexecute an instructionwait iC seconds

until program ready

When incorporating the delay of the memory for instruction accesses, the simula-tion model extends to:

Processor: Memory:repeat repeataddress memory wait for addresswait pC seconds wait M secondswait for memory value pass dataexecute an instruction foreverwait (i� p)C seconds

until program ready

The left code fragment simulates the processor, the code at the right hand sidesimulates the memory. The wait statements wait until a certain amount of timeis passed. Because the simulation model should abstract from the speed of thecomputer executing the simulator, the simulator works in its own time framework,called the virtual time [Jefferson85] (opposed to the real time, which is the time ofthe real world). When running the program above, the virtual time will typicallyrun slower than the real time, the execution of the loop will take microsecondsof real time, while the virtual time is only increased with nanoseconds. When

1.4. An overview of the rest of this thesis 9

simulating a slowly evolving process, as the weather, the virtual time runs fasterthan the real time (one is predicting the future). It is also possible to bind the realtime to the virtual time, which is for example the case in flight simulators, wherethe pilot is trained in real-time.

The instruction rate of the processor is can be measured using this simulationmodel by counting the number of iterations of the processor, and by dividing itby the virtual time needed for these iterations. Note that the three drawbacks ofperformance models are indeed not in the simulator. The application is simulated aswell, the topology is explicitly in the simulation model, and the dynamic behaviourcan be accounted for: the instruction timings (i� p)C may depend on the executedinstruction.

Like a performance model, a simulation model can be implemented at any levelof abstraction. The lower the level of abstraction, the more details are accounted for,the more effort is needed to construct the simulator, and the higher the accuracy.In contrast with the performance models, lower level simulation models needconsiderably more run time than high level models. A difference of a factor millionis no exception.

Like performance models, simulation models need to be validated. Besides avalidation of the performance figures coming from the simulator, the functionalbehaviour of the simulator should be validated as well. As an example, an ar-chitecture designed to generate all prime numbers should output the list 2, 3, 5,7, 11, 13, 17, 19, 23, 29; : : :. The same methods are used as for the validation ofperformance models: test runs are made to study the output of the simulator, thesimulator can be stressed by setting the parameters to extreme values.

In contrast with a performance model there is one significant drawback in usinga simulator: a simulator provides a performance figure for one specific setting ofthe architecture parameters, by experimenting with one or more parameters, onecan plot the relation between the architecture parameters and the performance.A performance model provides an analytical relation between the architecture pa-rameters and the performance, which has more value. A simulation can thusnever replace a performance model, nor can a performance model ever replace asimulation.

1.4 An overview of the rest of this thesis

This thesis deals with the simulation of architectures in order to get performancefigures. The rest of the thesis is split in three parts, presenting the tools, two casestudies and the conclusions respectively.

The principles of simulators are explored in more detail in Chapter 2. Firstly,four simulation algorithms are specified in a functional language (Miranda), andcompared with each other. The comparison shows that the four algorithms haveincreasing efficiency and increasing power, but are increasingly error-prone as well.Secondly, a comparison of existing (architecture) simulation systems is presented.This comparison concludes in a list of requirements for architecture simulationsystems. In Chapter 3 a description of “Oyster” is given, a tool developed at the


University of Amsterdam to simulate and evaluate the hardware of an architecture.Simulating the hardware is only one part of architecture evaluation. Integratedwith it, the software has to be simulated as well. Chapter 4 discusses three waysto simulate applications, by means of a complete emulation of the software, byabstracting from the execution of the application by using an address trace of areal application, or by abstracting from the trace of a real applications by using asynthetic trace.

Oyster is an experimental simulation system that has been used in evaluatingfour architectures, the simulation of the G-Line [Milikowski91, Hendriksen90] andG-Hinge [Milikowski92, Gijsen92] architectures, the simulation of the communica-tion architecture of the PRISMA machine [Apers90, Muller90], and the simulation ofthe Futurebus cache coherency protocols [Futurebus89, Langendoen91, Muller92b].The last two of these evaluation studies are presented in Part II. In 1989, the com-munication architecture of the PRISMA machine was simulated, in order to find theperformance bottlenecks, and to analyse the benefits of introducing special hard-ware to overcome these bottlenecks. This experiment is described in Chapter 5:the full PRISMA architecture is simulated, both the application and underlyinghardware are simulated at the appropriate levels. The PRISMA machine has beenbuild, providing a reference point for the simulation. The second case study ispresented in Chapter 6. In that experiment, the Futurebus cache coherency pro-tocols are studied in the context of shared memory architectures with hierarchicalcaches. The Futurebus is an IEEE bus standard, that defines amongst others a cachecoherency protocol. The simulations were performed in order to find out for howmany nodes an extra level in a hierarchy of busses pays of. During the analysis ofthe simulation results, a performance model of hierarchical architectures based onthe Futurebus cache coherency scheme has been developed which is presented inChapter 7. This performance model can be used to validate the simulation resultsand to extrapolate the results to larger architectures and more levels.

The last part of this thesis comprises a concluding chapter about the usabilityof a simulation system, and especially an evaluation of the design choices madein Oyster. The two case studies cover different topics of computer architecture,while experiments performed by others cover a different level of abstraction. Thisillustrates that Oyster is useful for more than one class of architectures, and formore than one level of detail.

Part I

Simulation Tools

Chapter 2

Simulating architectures

Simulations are used in many disciplines. In physical, economical and socialsciences for example, researchers sometimes rely on simulations to verify theories,or to predict future developments. Computer scientists are involved in simulationsfrom two points of view. On the one hand they develop simulation systems, but onthe other hand they use simulation systems to solve their own problems.

One of the problems in computer science is the simulation of new computer ar-chitectures in order to validate claims about the correctness and the performance ofthese architectures. If a simulation shows that the architecture is not correct, or doesnot meet the performance requirements, the architect can correct the architecture,or redesign part of it to improve the performance.

Many tools for the simulation computer architectures have been developedover the past decades. Some of these tools are targeted at simulation in a specificdomain (for example DSP’s or cache architectures), while others are general pur-pose simulators that can be used for any architecture. Despite this rich variety ofsimulators, they are all based on the same simulation principles. In Section 2.1 afunctional description of these basic simulation principles is presented. Section 2.2gives an overview of the tools that are in use to simulate computer architectures.The overview is necessarily incomplete (since as many simulation tools exist asthere are computer scientists), so the tools are classified, and some representativeexamples from each category are presented. The chapter ends with a discussion ofthe positive and negative aspects of the various simulation systems.

2.1 Simulations: a functional description

The functional paradigm is useful to describe algorithms. A functional descrip-tion is concise, and it allows for easy reasoning about the correctness, deadlockfreedom and efficiency of the algorithm [Kelly89, Sijtsma89]. In this section fourflavours of simulation algorithms are described using Miranda, a functional lan-guage described in [Turner90]. The four algorithms have increasing efficiency andexpressiveness, as is discussed in Section 2.1.5.

Throughout the section, the Flip-Flop is used as a running example. The Flip-Flop is one of the elementary circuits that can be used for data storage. The

14 Chapter 2. Simulating architectures

Reset

Set

Qb

Q

��

�XXXXXXX

h

h

��

�� Set Reset Q Qb

1 1 Q Qb0 1 1 01 0 0 10 0 1 1

Figure 2.1: The Flip-Flop circuit and truth table.

schematics of the Flip-Flop are depicted in Figure 2.1, together with its truth table.Although a Flip-Flop is not complex at all (and is not a typical example of architec-ture simulation), it possesses all hard problems: it contains a loop, it contains state,and it will not stabilise unless used in the right way. The Flip-Flop has two inputslabeled Set and Reset, and two outputs, Q and Qb. For a stable situation bothSet and Reset are kept high, giving a high signal on one of the outputs and a lowsignal on the other: Set = Reset = 1 ) Q = :Qb. Driving Set low for a whilecauses Q to become high regardless of its previous state. Driving Reset low for awhile causes Q to become low (and Qb high). Driving Set or Reset low for a verysmall period of time brings the Flip-Flop in an oscillating state for an indefiniteperiod of time: it causes a short pulse to start racing through the two nand gates.

In Miranda (see Figure 2.2 for a short explanation how to read Miranda pro-grams), we represent the high and low signals by H and L respectively, while the Xstands for an undefined signal (can be either high or low). All following programfragments use the following definition for the typethreestate and the definitionof the nand characteristics:

> timestamp == num || The time is stored as an integer> threestate ::= X | L | H || undefined, high, low>> nandfun :: threestate -> threestate -> threestate>> nandfun H H = L || The only way to get ‘Low’ out of a nand> nandfun L x = H || ‘Low’ on left port forces a high output> nandfun x L = H || ‘Low’ on right port forces a high output> nandfun x y = X || all other inputs results in undefined

The function nandfun is a function that takes two values of the type threestateand produces a value of the type threestate. The function nandfun applied totwo High values (H H) result in a Low output (L), a low value on the first or thesecond parameter results in a high output, while all other inputs (for example H X)result in undefined, X.

In all Miranda fragments below the Flip-Flop is simulated, under the assump-tion that the nand gates introduce a fixed delay of 3 time steps. In the example

2.1. Simulations: a functional description 15

With the help of a few examples, the ba-sic notations of Miranda are explained inthis intermezzo. Refer to [Bird88] for a fullintroduction to functional programming,and to [Turner90] for the definition of Mi-randa.The Miranda function to compute the fac-torial of a number is defined by:

fac n = 1, if n = 0= n * fac (n-1),otherwise

which means that the factorial of n is de-fined as 1 if n equals 0, or n times the fac-torial of n � 1 otherwise. The argumentsof a function are separated by spaces, bothwhen defining and applying functions; thebrackets around n � 1 are necessary onlybecause function application has a higherpriority than subtraction. In the followingexample the same function is defined usingpattern matching instead of the if. Also awhere-clause is used to define a local func-tion:

fac 0 = 1fac n = n * fac nminus1

where nminus1 = n-1

The function nminus1 (which has no pa-rameter in this example) is defined in thescope of fac: it is only visible inside thisfunction definition.The list is a standard Miranda data struc-ture to store an ordered collection of itemsof the same type. A list is constructed withcolons, and terminated with [] (the nil el-ement):

2:3:5:7:[]

is the list with the first 4 prime numbers.([2,3,5,7] is an alternative notation forthe same list). An exclamation mark isused to select an element:

[2,3,5,7]!2

yields 5 (counting starts at 0).

To construct lists, one can use a list compre-hension:

[ func x | x <- somelist ]

This notation is analogously to the mathe-matics notation ffa : a � Sg. The functionfunc is applied to each element of the listsomelist, and the output elements areplaced in a new list.The programmer can freely design otherdata structures. The declaration

record ::= SomeRec num num char

declares a type record and a constructorSomeRec with three fields, two numbersand a character. The expression SomeRec1 4 ’h’ creates an instance of record.Alternatives are specified with a bar:

threestate ::= High| Low| Undef

declares a type threestate which is ei-ther a High, Low or Undef. A list-type isspecified with brackets: [threestate]denotes a list of threestate. Type syn-onyms are declared with an ‘==’:

state == [threestate]

declares the type state that is synonymfor [threestate].Functions are typed using the ->. A func-tion mapping a number to a character hasthe type num -> char. A function map-ping two characters onto one number hasthe type char -> char -> num (whichshould be read as a function mapping asingle character to a function mapping acharacter onto a number). The type offunctions is declared in Miranda using adouble colon: the type declaration of thefunction fac would be

fac :: num -> num

By default the types of functions are in-ferred from the program context.

Figure 2.2: Intermezzo: a short introduction to Miranda


AAAAU �

��

AAAAU@@R

-

--

-

��

��

Set:Reset:Q:

Qb:Time: 0 5 10 15 20 25 30 35 40

Figure 2.3: The inputs and outputs of the examples, the gray parts denote anundefined signal.

programs, the Set and Reset wires are driven by the two clock signals that aredepicted in Figure 2.3, together with the expected output signals on Q and Qb.

2.1.1 Demand driven simulation

The most elementary way to describe the circuit is by defining the recursion equa-tions as done in Figure 2.4. q, qb, set and reset represent the state of the fourwires Q, Qb, Set and Reset. The state on the wires at a moment t in time is definedin terms of the output of the nand gate at time t, while the nand-function at timet depends on the state at the input wires at time t � delay. When the simulator isasked for the state of Q at time 40, all states of Q and Qb are recursively calculateduntil the moment the simulator comes to a well defined state (when one of the

> wire == timestamp -> threestate>> nand1 :: wire -> wire -> timestamp -> threestate>> delay = 3> nand1 x y t = nandfun (x tbefore) (y tbefore), if t > 0> = X, otherwise> where> tbefore = t - delay>> q, qb, set, reset :: wire || Definition of the wires>> q = nand1 set qb || Define upper nand> qb = nand1 reset q || Define lower nand>> set t = L, if t mod 28 < 7 || This defines a clock> = H, otherwise || 7 ticks low, 2 high> reset t = set (t+14) || reset is shifted set.

Figure 2.4: The source code in Miranda for a demand driven simulator.


input signals is low, the output of the nand is fixed regardless its previous state), oruntil time zero, when the signal is X (by definition). In this example, the calculationof q 40 requires the values of set and qb of three steps earlier (the delay of thenand gate), the value of set 37 (which is H) and the value of qb 37. To calculatethis, the value of reset 34 (H) and the value of q 34 is required. This dependson the value of set 31, which is L; consequently q 34 equals H, qb 37 equals L,and q 40 thus equals H, which is the answer.

This simulation scheme has three drawbacks, firstly it outputs only the stateat a given moment in time. This implies that that it is not possible to answer thequestion: “What is the first time that Qwill become high after time T” directly, onlyan exhaustive search can provide the answer. Secondly, the recursive calculationplaces a huge demand on the memory, since a whole stack of calculations is built,before they are evaluated. Thirdly, in a more complex circuit, where Q is used inmore than one place, Q will be recalculated each time, leading to an exponentialtime consumption. All three drawbacks can be relieved by reversing the orderof computations, thus by starting with the state at time zero, and by processingforward in time. By memorising the old states in (lazy) lists, states in the past canbe referred to. There are two ways to maintain these lists: with implicit timinginformation, giving a continuous time simulation, or with explicit time stamps,resulting in a discrete time simulation. Both methods are shown in the next sections.

2.1.2 Continuous time simulation

A simulation of the Flip-Flop with an implicit continuous time increment is de-scribed in [Vree89]. The wires are represented by infinite lists (streams) of states.The ith element of such a list represents the value of the wire at time i�t, where �t isthe time increment. The components are modelled by functions working on theselists, as synchronous processes [Kahn74]. The nand-processes of the Flip-Flop thustake two lists of states as input parameters and produce a list of states as outputs.The source code for a continuous simulation of the Flip-Flop is shown in Figure 2.5.

The nand function nand2 consumes states from two input streams, and pro-duces the output-stream with the (earlier defined) function nandfun. The nandfunction starts with three undefined states on the output list (X:X:X), to modela delay of three time steps. As before, the four wires are named q, qb, set andreset, and are connected by means of two nand gates. By providing two in-put lists for the Set and Reset, the program computes the output values on thestreams Q and Qb:

> q = [X,X,X,H,H,H,H,H,H,H,H,H,H,H,H,H,H,H,H,H,L,L,L,L,L,L,L]> qb = [X,X,X,X,X,X,L,L,L,L,L,L,L,L,L,L,L,H,H,H,H,H,H,H,H,H,H]

Although the infinite lists q and qb are mutually dependent, the algorithm doesnot deadlock because the start elements of the lists are defined (X:X:X). In termsof the productivity theory of [Sijtsma89], the function nand2 is +3-productive, so a(cyclic) network of nand2 functions is productive (which means that the networkwill keep producing elements as long as input is provided on the input lists).


> nand2, nand2’ :: [threestate] -> [threestate] -> [threestate]>> nand2’ x [] = [] || terminate> nand2’ [] x = [] || terminate> nand2’ (x:xs) (y:ys) = nandfun x y : nand2’ xs ys || apply nand> nand2 xs ys = X:X:X: nand2’ xs ys || delay 3>> q, qb, set, reset :: [threestate]>> q = nand2 set qb || Define the Flip-Flop> qb = nand2 reset q || circuit>> set = [L,L,L,L,L,L,L,H,H,H,H,H,H,H,H,H,H,H,H,H,H,H,H,H,H,H,H,H]> reset= [H,H,H,H,H,H,H,H,H,H,H,H,H,H,L,L,L,L,L,L,L,H,H,H,H,H,H,H]

Figure 2.5: The Flip-Flop in a continuous time simulator.

An essential property of continuous time simulators is that the time step is con-stant: all processes are synchronous, the lists of states are produced synchronously,implying that all components of the simulation use the same time step. This typeof simulator is used in simulations where the state would change continuously intime, but where the state is necessarily modelled with small discrete steps. Anexample given in [Vree89] is the simulation of water heights; electrical currents orthe positions of planets in a solar system can be simulated with the same technique.Most of these problems do not have discrete state values, like the X, L and H usedbefore, but a real value, estimated by a floating point number.

In computer architecture, continuous time simulations are only applied at thelowest levels, where electrical currents are modelled. Modelling digital circuitsor any higher level description of an architecture with a continuous simulationmodel results in a waste of computing power, since it is not necessary to recalculatethe state of the wires continuously: stable parts of the architecture need not to besimulated. This feature is provided by using a discrete time simulator.

2.1.3 Discrete time simulation

To define a discrete time simulation, again streams are used to model the stateof a particular wire in the circuit. In contrast with the continuous simulator, thefunctions do not operate synchronously on these streams, but asynchronously. Theprocesses may consume an element from one input-list, without consuming anelement from the other input lists. Because the streams modelling the states of thewires thus proceeds with different speeds, the time is not implicitly available, andmust be made an explicit part of the stream.

In Figure 2.6 the Flip-Flop is specified with a discrete time model. The streammodelling a wire consists of data structures of the form Until T V. The meaningof this data structure (called a tuple for short) is that the wire will be in state V


> state ::= Until timestamp threestate || The state of a wire>> nand3’ :: [state] -> [state] -> [state]> nand3 :: [state] -> [state] -> [state]>> delay x = x + 3 || The delay of a nand>> nand3’ [] ys = [] || Terminate> nand3’ xs [] = [] || Terminate> nand3’ xs ys = Until (delay yt) value : nand3’ xs y, if yt < xt> = Until (delay xt) value : nand3’ x ys, if xt < yt> = Until (delay xt) value : nand3’ x y, if xt = yt> where> (Until xt xv):x = xs> (Until yt yv):y = ys> value = nandfun xv yv>> nand3 xs ys = Until (delay 0) X : nand3’ xs ys>> q, qb, set, reset :: [state]>> q = nand3 set qb || The upper nand> qb = nand3 reset q || The lower nand> set = [Until 7 L, Until 28 H ] || A clock period> reset = [Until 14 H, Until 21 L, Until 28 H ]

Figure 2.6: The source code for a discrete time simulator.

until time T ; at time T the state changes instantaneously into the value describedby the next tuple. By definition, the time stamps of the tuples in a stream areincreasing: the time proceeds forward. The nand function operating on thesestreams generates an output tuple for the state up to the lowest time-stamp in itsinput streams. Since all other streams have a higher time stamp, the states of allthese streams is defined by their first tuple. By adding a constant to the time stampof the output tuple, the delay of the nand-gate is modelled. In contrast with thecontinuous time simulator, the lists consists of the changes only: the length of thelists does not depend on the granularity of the time. This implementation is slightlydifferent from the implementations of [Bevan86] and [Joosten89], who have to passa state parameter around, since their tuples are defined as From T V.

The correctness of the discrete time simulator is not evident. To prove it, ithas to be proven that the simulator does not deadlock, generates correct outputs,and makes progress in time. The simulator will not deadlock because of thefollowing invariant: the consumption of a tuple on an input stream, will alwaysresult in a new tuple on the output stream. By using the theory of [Sijtsma89]again, the function nand3 is +1-productive so the simulator will not deadlock.


The simulator generates correct output values as long as all streams are orderedon their time stamp. It is easily proven from the definition of nand3’ that whenthe time stamps on the input lists are increasing, the output lists are ordered aswell, hence, all streams are ordered. Because a non zero delay is added to the timestamp, the simulator makes progress in virtual time.

The invariant “each consumed tuple produces an output tuple” is also theweakness of the approach. Due to the loop in the definition of the Flip-Flop, tupleswith increasing time stamp but identical state will start racing around the Q andQb, while the state of the circuit does not change. This is best observed on thecalculated value of q (which is plotted in Figure 2.3, page 16):

> q = [Until 3 X, Until 6 H, Until 9 H, Until 10 H, Until 12 H,> Until 15 H, Until 16 H, Until 18 H, Until 20 H, Until 21 L,> Until 22 L, Until 24 L, Until 26 L, Until 27 L, Until 28 L,> Until 30 L, Until 31 L]

Although Q is high from tick 3 until tick 20, there are seven tuples stating that Q isstill high at ticks 6, 9, 10, 12, 15, 16, and 18. The tuple at tick 12 is caused by thetuple at tick 6: Until 6 H in q causes a tuple Until 9 L on qb, which in turncauses the tuple Until 12 H on q. In the same way, the tuple at tick 9 causes thetuple at 15, and the tuple at tick 10 causes the tuple at 16.

The seemingly obvious solution of deleting tuples with an identical state isincorrect: it violates the invariant, and leads to an immediate deadlock of theprogram. Some of the redundant tuples can be avoided by relaxing the invariant:in the case of a nand, a low signal on one input fixes the output to a high value,regardless of the other input (it may even be undefined), on which the tuples maythus be ignored. When the nand is reprogrammed to ignore tuples on one channelas long as the other channel is low, far fewer tuples are generated. Still, the stableFlip-Flop has tuples running around, since both Set and Reset are high in thestable situation, requiring a meaningless tuple to float through Q and Qb.

The meaningless tuples floating around are essentially caused by the distributednature of the discrete time simulator: the two nands and wires operate completelyautonomously which makes an empty tuple essential for the progress in the sim-ulation (the problem is identical to that in the distributed simulation algorithmof [Chandy79], where NULL-messages are used to keep a distributed simulatorrunning). The problem can be solved by centralising the solution, giving rise to asimulator known as an event-driven simulator.

2.1.4 Event driven discrete time simulator

With respect to the previous simulator, two major changes are implemented: anexplicit state of all wires is maintained, and there is a global list of things to happenin the future, the so called events. The events leads to new states, and possible newevents. The Miranda source of the event driven simulator is listed in Figure 2.7.

The wires are numbered by small positive integers, in this example 0 for Set,1 for Reset, 2 for Q, and 3 for Qb. The circuit is defined with the definitions ofrecalculate, it tells that wire 2 (Q) is calculated as a nand on wires 0 (Set) and 3


> wirenumber == num> state == [ threestate ] || Used as an array of threestates> event ::= At timestamp wirenumber threestate>> recalculate 2 = nand4 0 3 || Wire 2, q = nand set qb> recalculate 3 = nand4 1 2 || Wire 3, qb = nand reset q> dependencies 0 = [2] || Wire 0 used in def of wire 2> dependencies 1 = [3] || Wire 1 used in def of wire 3> dependencies 2 = [3] || Wire 2 used in def of wire 3> dependencies 3 = [2] || Wire 3 used in def of wire 2>> update :: state -> wirenumber -> threestate -> state> update st i val = take i st ++ val : drop (i+1) st>> simulate :: [event] -> state -> [event]> simulate [] st = [] || Termination> simulate (e:es) st> = simulate es st, if st!wire = what || IGNORE> = e : simulate newes newst, otherwise || Process event> where> (At time wire what) = e> newes = merge es (sort more) || New event list> newst = update st wire what || New state> more = [ mkevent out | out <- dependencies wire ]> mkevent wire = (recalculate wire) time wire newst>> nand4 :: wirenumber -> wirenumber ->> timestamp -> wirenumber -> state -> event> nand4 x y time wire st = At (time+3) wire (nandfun (st!x) (st!y))>> inputs = [ At 7 0 L, At 14 0 H, At 21 1 L, At 28 1 H ]> main = simulate inputs [H,H,X,X]

Figure 2.7: The source code for an event driven discrete time simulator.

(Qb). An event is represented by a three tuple At time which what. The tupletells at what time, which wire will get what value. The list of events is sorted onincreasing time stamp, so the event with the lowest time is handled first. The timeof the lowest event is also the current time of the simulation, and since the timeshould increase, newly generated events should have a time stamp larger thanor equal to the current time, also known as the causality condition: an event canhave consequences for the future, but no consequences for the past. The functionsimulate traverses the list of events recursively, each time producing a new statebased on the old state and the consumed event.

An event may change the state of the circuit, and when the state of one of thewires is changed, all components connected to that state are required to recalculate


their output value, generating new events for their output wires. The new eventsare merged into the old event list before calling the simulator to consume the restof the events. Note that in this example program the circuit is specified twice.In the definition of recalculate and in the definition of dependencies. Thislast function tells which wires have to be recalculated on a change of a state ona specific wire. This last function is in fact the inverse of the former, and can bederived automatically, but both are defined for the sake of simplicity.

An event that does not cause any change in the state may be ignored, as isdone on the line marked IGNORE. An optimisation which is not allowed in thediscrete time simulator of the previous section, because it would cause a deadlockof that algorithm. The event driven simulator does not deadlock because all eventsare managed centrally: there are still future events to continue the calculation.Without this optimisation, the event driven simulator would have the same poorperformance as the previous discrete time simulator.

2.1.5 Summarising the simulation algorithms

The four algorithms presented before describe the basic principles of simulation.The second algorithm is the one used in all continuous time simulations, while thefourth algorithm is in use for all sequential discrete time simulations. Discrete sim-ulations are sometimes parallelised by using the third algorithm, it has a distributednature, but other algorithms can be parallelised as well [Overeinder91, Misra86].The algorithms differ in three aspects: the way the time is used and represented,the way the delay is introduced, and the efficiency.

Time representation

In algorithm one, the time is an explicit parameter in the definition of a wire (q 5or qb 350). In the second algorithm, the time of a tuple is implicitly defined by theindex number of the state in the list times �t, the fixed time step. Algorithm three hasan explicit time stamp related to each state in the list of old states (Until 12 H),the fourth algorithm has an explicit time stamp related to each event (At 7 0 L).The last two algorithms thus have explicit time stamps, they are useful for problemswith non constant time values. The second algorithm has implicit time stamps, itis useful for problems with a fixed time step. The first algorithm uses an explicittime parameter in the calculation, but it is not useful for simulations for many otherreasons explained before.

Delay representation

The representation of the delay is far more interesting, since the expressive powerof the four algorithms increases with respect to this aspect. In the example of theFlip-Flop a constant delay is introduced by the nand gate: the nand gate delays theoutput signal with 3 clock ticks in each example. In realistic circuits, it is howevernot uncommon that the delay is not fixed, but that the delay depends on the stateof the circuit. Figure 2.8 for example shows a more realistic scheme how a simple


Input:

Output:

Time:

AAA��

0 10 20 30 40 50 60 70 80

Figure 2.8: A simple buffer circuit (Output := Input), that needs more timeto get a signal high, than to bring it to low again. Consequently,pulses are shortened, while a short pulse disappears.

buffer behaves. In this example, the buffer needs 5 time steps to get a signal high,and only 1 time step to bring the signal back to low again. A pulse entering thebuffer, will result in a shorter pulse on the output of the buffer.

The first algorithm computes in a demand driven fashion. Consequently, thedelay has to be known before the state is calculated. This implies that the delaycan only depend on the time in that part of the program, and not on the state. Thebuffer of Figure 2.8 can thus not be modelled using this algorithm.

The second algorithm introduces the delay as a fixed number of X’s at the startof the output list. It is possible to make the delay dependent on the state of thewire, by checking if the wire contains a swap from L to H (or the other way around),and to generate a variable number of output elements as response (extra outputelements for more delay, fewer output elements for short delay). This trick isapplied in the following Miranda fragment for a buffer that delays the high signalwith 2 ticks, and the low signal with 1 tick:

> bufferL (L:x) = L: bufferL x> bufferL (H:x) = L:L:bufferH x> bufferH (H:x) = H: bufferH x> bufferH (L:x) = bufferL x> buffer (L:x) = X: bufferL x> buffer (H:x) = X:X:bufferH x

This simulator does not deadlock, regardless of the fact that bufferH sometimesconsumes a value without producing one. Because bufferL produces an extraelement before bufferH is called, buffer is productive (also in the formal sense, ascan be proven with [Sijtsma89]).

In the third algorithm, the function delay adds the constant 3 to the time. Thisfunction may be changed to an arbitrary function as long as delay satisfies twoproperties that follow directly from the correctness proof of the algorithm (progressin virtual time should be guaranteed, and the streams should be ordered):

Progress: delay t > tList ordering: tx > ty ) delay tx > delay ty

The first requirement states that the simulator should make progress in time: thetime stamp of a produced output tuple should be larger than the time stamp of


Input:

Output:JJJ

EEE?f !!!

Time: 0 10 20 30 40 50 60 70 80

Figure 2.9: The buffer circuit, simulated naively and incorrectly.

the consumed input tuple. When this requirement is not met, the progress ofthe simulator is not guaranteed, since the simulator may start running backwards(it violates the causality condition). The second property (ordering of the lists,effectively requiring strict monotonicity of delay) guarantees that if the inputlist of time stamps is ordered, the output list of time stamps is ordered also. Anunordered list of output tuples has no meaning. It is hard to create a functiondelay that depends on the state of the circuit and that meets these properties. Asan example, the following definition of delay is incorrect:

> delay x H = x+4> delay x L = x+1

because delay 10 H 6< delay 11 L; applying this delay on a short pulse wouldresult in an output list that is unordered, which has no meaning. When the stateis incorporated in the delay function, the history of the state should be taken intoaccount to prevent this type of accidents.

In the fourth algorithm, the only constraint is that an event generated at time tshould have a time stamp greater or equal than t (the causality condition). Thereare no other constraints. This means that this algorithm has the feature that allowsfor two events generated by two subsequent events on one input, to overtake eachother on the output. This last is sometimes the intended result, but most of the time,the result is disastrous. The buffer of Figure 2.8 modelled in a naive way, will reacton a short pulse with two events. The events representing the short input pulseof Figure 2.9 are ‘At 70 input H’ and ‘At 71 input L’, the output eventsgenerated are respectively ‘At 75 output H’ and ‘At 72 output L’. Becausethe events are ordered on their time stamp, the simulator will first execute the lastevent (at time 72 the output signal is set low), while the first event is executedat virtual time 75: the output signal is set high at that time, which is obviouslyincorrect: the signal should stay low as in Figure 2.8.

Although the event list provides the most flexible solution in terms of the delay,it is also the most dangerous implementation: it offers a designer the possibilityto shoot in his own foot. It does not only pass assignments, it even containsassignments somewhere in the future. Functional programmers advocate that it ishard to reason about assignments, it is illustrated that it is even harder to reasonabout future assignments.

An event list is only useful if certain restrictions are applied on it, so thatcommon causality rules are not violated. One such a restriction might be that one


Reset

Set

Qb

Q

��

�XXXXXXX

h

h

��

��

> nand1 x y z t> = nf (x tb) (y tb) (z tb), if t > 0> = X, otherwise> where> tb = t - delay>> q = nand1 set qb qb> qb = nand1 reset q q

Figure 2.10: A circuit that will lead to exponential costs when simulated demanddriven.

does not place the assignment with a new value in the event list, but an assignmentwith an implicit calculation of the new value according to the state of that moment.This is less error prone, but less elegant as well. The programmer has to keep trackof the history explicitly, and has to program the proper reactions explicitly (as wasdone for the previous algorithms).

The efficiency

The third major difference considers the efficiency of the algorithms. Algorithmone is hopelessly inefficient (both in space and in time). In the worst case, algorithmone needs an execution time that is exponential in the in the number of componentsand the length of the virtual time (as is demonstrated in Figure 2.10, construct aFlip-Flop with three-input nand gates and connect the second and the third input ofthe upper nand to Qb, and the first and the second input of the lower nand to Q: thevalues of q and qb will be calculated twice, recursively, leading to an exponentialtime behaviour).

The second algorithm is efficient for continuous simulations. The time require-ments are linear in the length of the simulation run, and inversely proportional tothe time step �t. A discrete simulation can be performed using a continuous simula-tor (�t should be set to the greatest common denominator of all delays in the circuit),but it is pretty inefficient, since there are many unnecessary recomputations.

The third algorithm deals better with discrete time problems because the timestep is not fixed. Unfortunately, inactive parts of the circuit need still to be recom-puted, to prevent the simulator from deadlocking. The worst case time behaviouris identical to that of the second algorithm.

The fourth algorithm deals with all inefficiencies, the execution order is linear inthe number of executed events, the number of changes of the state in the circuit. Forthis reason the event driven algorithm is widely used for discrete time simulations.

Discussion

Architecture simulations almost always use a discrete time scale. Only low levelsimulations, as the simulations of the electronic level use a continuous time scale.


An event driven simulator is the most efficient, which is why architecture simula-tors are always using the algorithm.

The algorithms above only simulate an architecture. From the results of thesimulation one can derive conclusions about the correctness and the performanceof the architecture. The correctness is checked by exhaustively checking all inputpatterns of the state, and verifying that the output is correct. Note that in theexample above only two of the cases are checked: a Set when the Q was low anda Reset when the Q was high. A Set on a high Q has not been checked, nor aReset on a low Q nor a Set quickly followed by a Reset or a short Set or Resetpulse. A full check of the correctness using a simulator is possible but for any buttrivial circuits prohibitively expensive: it requires to test the whole space of statesand inputs to be checked.

However our primary interest is not in the correctness, but in the performanceof architectures. The interesting output from performance point of view is thetime needed for a Set or Reset pulse to be effectuated, and, more important, theaverage time the Flip-Flop was in the state high or low, and how many Set’s andReset’s were issued. All these can be determined by postprocessing the outputsof the simulator. The exact question what quantities are to be measured mostlydepends on the design: for a cache one is interested in the hit-rate, for a processorone might be interested in the instruction rate. But it is possible to measure generalaspects (the time distributions, the critical parts and the absolute timings of actionsfor example), as is shown in Chapter 3 in more detail.

2.2 Existing simulation systems

The rest of this chapter gives a brief overview of the tools that are used to simulateand evaluate computer architectures. We restrict ourselves to discrete time simu-lations. Since there are hundreds of simulation tools, it is not feasible to present anexhaustive list. Instead, the tools are classified, and some interesting tools of eachclass are highlighted. The following four classes are distinguished:

1. Ordinary programming languages. These languages are used to build manyad-hoc simulation tools. Typically, these simulators are developed to getan answer on one specific question. Section 2.2.1 gives an overview of howgeneral purpose programming languages can be used to construct a simulator.

2. Simulation languages. The drawback of using a general purpose languageis that one has to implement the basic simulation algorithm each time asimulator is implemented. A simulation language has build-in simulationprimitives, since their implementation is hidden in the language, the use of asimulation language is more convenient and less error prone than the use of ageneral purpose language. In Section 2.2.2 is a short description of SIMULA,the most important member of this class.

3. Hardware simulation languages. Many features of simulation languages arenot needed for the simulation of hardware, while some features that might be

2.2. Existing simulation systems 27

handy for hardware simulations are not supported by simulation languages.In Section 2.2.3 the simulators are discussed that are particularly designed tosimulate digital hardware.

4. The architecture evaluation tools. The hardware simulators suffer from twodrawbacks. Firstly, most hardware simulators aim purely at checking thefunctionality of a circuit, not on the performance. Secondly, the hardwaresimulators are directed to any digital design: computers but also telephoneexchange networks or the electronics of washing machines. The architec-ture evaluation tools are developed to evaluate specific aspects of computerarchitectures, in Section 2.2.4 some of these tools are discussed.

This chapter ends with a discussion about the advantages and disadvantages ofthe various approaches in the context as sketched in the introductory chapter: theevaluation of the performance of computer architectures. Although the last classof tools deliver the evaluation results the architect is interested in, it turns out thatthese tools are too restrictive in their use, since they cover aspects of computerarchitectures separately. The general purpose simulation languages can cover allaspects in once, but they do not offer the right level of abstraction.

2.2.1 General purpose languages

Imperative languages

The imperative programming style is still the most popular programming style, alsofor the construction of simulators. On the one hand, ad-hoc simulation problemsare solved by using a language like C or Pascal, on the other hand imperativelanguages are used to implement more general simulators: VHDL (Section 2.2.3)is implemented in Pascal, AWB (Section 2.2.4) and Oyster (Chapter 3) for instanceare implemented in C.

The implementation of a discrete time simulation algorithm in an imperativelanguage is straightforward. The state of the simulator is kept in a set of globalvariables, and an event list manager is written that maintains the list of orderedevents. The semantics of the events are specified by the programmer. One canimplement delayed assignments (as was the case in Section 2.1.4), delayed functioncalls where a function F is to be called at time T with parameters P (an event isthen the tuple < T; F; P >), or very problem specific events (for the simulation of amessage passing network for example one can define events with the meaning thatat time T a message will arrive at A from B). When a general purpose languageis used, the programmer is totally responsible for the structure and consistency ofthe event list.

Object oriented languages

When the program is implemented in an imperative language, the state is keptin global variables. Most of the state is however managed locally, and it is goodprogramming style to distribute the state over the code handling the data. This


is facilitated by object oriented programming techniques. C++ (a language thatconvinced many programmers of the benefits of object oriented programming)is widely used to program ad-hoc simulators. The entities of the real world aremodelled by objects, with their own control. The threads of control of the variousobjects do not interfere unless explicitly stated, a model closely resembling the realworld. The perfect fit between object oriented programming and simulations isno coincidence: the language first raising the concepts of object orientedness wasSIMULA ’67 [Dahl66], a simulation language.

In comparison with the use of an imperative language to implement a simulator,the use of an object oriented language solves at least one problem: the structuringof a model in terms of objects is more convenient than structuring a simulator in Cor Pascal. Still, not everything is solved. POOL has a message passing system thatcan be nicely used to model interactions in the real world, but it is hard to integratethe virtual time with it. The same holds for many other object oriented languages.

Other languages

Other languages can be used for simulations as well. In the Section 2.1 the algo-rithms were presented in Miranda. Translating these algorithms into some otherfunctional language is trivial. One can also use Prolog or any other type of lan-guage, but the problems remain the same: for each implementation, one has toreimplement the simulation algorithm. Using a general purpose language pro-vides full flexibility, at the costs of implementing everything from scratch onwardseach time.

2.2.2 General purpose simulation languages

Most simulation languages are extensions of existing general purpose languages.A simulation language offers three major extra features:

1. A simulation language has a structure that allows for a natural mapping ofentities and activities of the real world into the language.

2. The kernel of the simulation algorithm, as sketched in Section 2.1.4, is anintegral part of a simulation language.

3. A simulation language has support for activities that are frequently encoun-tered in the real world: for example for the handling of queues, and for thecalculation of statistical figures like standard deviations.

There are a number of simulation languages that are in use. Only one of themis discussed: SIMULA [Dahl66, Birtwistle73]. Being the first simulation language(SIMULA ’67 has been designed in the late sixties as an extension of ALGOL ’60),and the first object oriented language, SIMULA contains all interesting aspects ofsimulation languages.

For modelling entities of the real world, SIMULA introduced the class concept.A class is a blueprint how an entity in the real world behaves, and consists of data


declarations and code that implement the behaviour. An object is an instance ofa class, the object simulates the modelled real world entity. By creating multipleinstances of a class, multiple objects are created, simulating several entities of thesame type. Nowadays, objects are well known abstractions in object orientedlanguages for implementing any type of program.

SIMULA objects run independent of each other but can synchronise with thevirtual clock. The virtual clock is a counter counting from zero (start of the sim-ulation) onwards, maintaining the current time. An object that has to wait for acoming period, calls a primitive to stop the execution of the object until a certaintime has been reached. The virtual clock implements the event list, but less generalthan an ordinary event list. The only event supported is “wake an object”, nothingis said what has to happen in the object, that is a decision private to the object basedon the state of the system at that moment. An object that needs the old state has tokeep track of the old state explicitly.

The standard support available is threefold. Firstly, queues are supported by astandard class. With inheritance, each object can be made queueable. Queues arefrequently needed in for example client server problems (how many servers areneeded to serve the clients within a reasonable time), but are useful in many otherapplications as well. Secondly, SIMULA supports random number generation, togenerate a load for a problem. Depending on the purpose, different distributionscan be chosen (normal, uniform etcetera). Although it is quite trivial to encode arandom generator, it is hard to debug it: a bug in a random generator is almostsurely not recognised, and can lead to disastrously incorrect simulation results.Thirdly, functions are build in the language to facilitate statistical calculations.

SIMULA is not perfect. The data of objects is not private, even worser, the basiccommunication primitive is the assignment to a variable within the destinationobject. Modern object oriented languages use a procedure call or message passingmechanism to pass data from one object to another: there is a clear separationbetween computations and communications. Another subtility is that SIMULAcarries too many features, sometimes inherited from ALGOL ’60 (call by name),sometimes it looks like a kind of experimentation, like all the different ways to (de-)activate objects. Despite these disadvantages, SIMULA shows the three importantprinciples of simulation languages given above that drastically ease the life of theprogrammer.

Specific simulation languages have been developed that restrict the simulationdomain. These restrictions imply on the one hand that more extensive librariesfor a specific domain can be developed, but on the other hand, that more specificstatistics about the simulation can be maintained. These two extensions are visiblein the next two classes of languages, those specialised to hardware simulation, andthose specialised to architecture evaluation.

2.2.3 Hardware simulation languages

When simulating hardware using an ordinary simulation language, one quicklydiscovers that it would be useful to extend the simulation language. On the one


hand extensive libraries to model the standard hardware components (registers,ALUs, caches), and on the other hand some features in the language itself tomodel for example digital signals. These changes makes the simulator less general,but particularly useful for the simulation of hardware. Many research groupsand companies have developed hardware simulation languages. In this sectionsome representatives are briefly discussed: the HDL family (the most widely usedhardware simulation language), ART-DACO (a simulator based on data abstractionand concurrency) and INSIST (an interactive simulator). These simulators showthe most interesting aspects of hardware simulation languages.

VHDL

A number of popular tools exist that are used for the specification of hardware:Dacapo [Dacapo89] and VHDL [IEEE88] (based on Pascal), or Verilog [Verilog89](based on C and Pascal). Although the syntax and details of these languagesdiffer, the (imperative) base language was in all cases extended with the sameconstructs: “wires” and the notion of time. Below, a short description of the wholeset of languages is given without going into details of one specific language: HDLis discussed, ignoring the differences between HHDL (Helix version of HDL),Dacapo, Verilog and VHDL.

The most important extension is that HDL uses two types of variables: nets(wires) and variables. Variables are for programming, nets are for their implemen-tation. The operator ‘+’ works on variables, the procedure ‘ADD’ works on nets.Both ADD and ‘+’ add two integers, but the ADD is a real implementation (withtime delay), while the ‘+’ is an abstraction without delay. Variables can be assignedto by using a Pascal ‘:=’, while nets are assigned to by using an ‘ASSIGN’ state-ment. This last statement has an optional ‘DELAY x’ part, allowing to postponethe assignment. Except for these (and many other) syntactical changes, the HDLenvironment has build-in libraries for frequently used components like the trivialand, nand, nor and not gates but also for complex components as J-K Flip-Flops,counters or ALUs.

Nets and variables coexist to facilitate a top down design method. A high leveldesign is described by using PASCAL variables, for-loops and other high levelconstructs, which are gradually replaced by nets, ASSIGN-statements and calls tolibrary functions for adding integers. A design that is fully composed of nets andthe corresponding operations is finally translated into silicon. All design stages(thus the high level description, the low level description and all intermediateforms) can be simulated using the HDL simulator. The designer can thus try if thecircuit is functioning correctly during the refinement process.

HDL is implemented using an event driven simulator. The model is very closeto the description of Section 2.1.4, the state of the wires is stored in the global state,and an ASSIGN statement of the form

ASSIGN value TO wire DELAY dt

inserts a tuple At (T+dt) wire value in the event list. To react on a changeon a wire (wait for an event), the designer uses a statement of the form


UPON load=HI CHK load DO ...

that waits until the condition on the left side is true, whereupon the statement afterthe DO is executed. Multiple conditions may be supplied to branch on multipleconditions.

TheASSIGN statement allows to assign different values at different points in thefuture, which could lead to the surprising results sketched in Figure 2.9. This canbe circumvented by redefining the event list manager to delete the latter ASSIGNwhen an earlier is placed in the list, but this is not elegant, because the secondassignment might have given the intended result. In comparison with SIMULA,the event list of HDL (and many other simulators: Verilog, N.2 [Endot87]) is a messof postponed assignments, while an event of a SIMULA implementation only startsa process, which must explicitly assign, and explicitly remember the old state.

ART-DACO

ART-DACO is an experimental simulation system aiming at the construction ofhigh level abstractions of circuits prior to design [Krishnakumar87]. ART-DACO(Architecture Research Tool using Data Abstraction and COncurrency) allows thedesigner to answer so called “what-if” questions, for example: “What if the widthof the bus is doubled?” As the name explains, ART-DACO models are constructedaround abstract data types. The various parts of the model run concurrent, buttotally synchronously: a continuous time simulation is implemented (Section 2.1.2).It seems that this implementation decision is made because of the the limitedapplication domain; only signal processing architectures are considered.

Despite the continuous time model of ART-DACO, there are two nice aspects init. Firstly the use of abstract data types is an enormous step forward. Secondly, anART-DACO design consists of separate behavioural and topological interconnec-tion descriptions. These two are not merged, as is the case in for example SIMULA.Because they are separate, it is easy to modify the topology of the architecture.

INSIST

INSIST (INteractive Simulation In SmallTalk) is a simulation system developedat Philips research laboratories in Sunneyvale [Meulen87]. INSIST is build ontop of SmallTalk (an object oriented language with a graphics interface), and hasa SmallTalk alike appearance: interactive, objects, and graphics. For simulationpurposes, INSIST supports two important features: the virtual time and a libraryof standard components. INSIST uses a discrete event simulator to implementthe virtual time, but the details of the simulation algorithm are hidden insideINSIST. The designer is only allowed to install a delay on an output-pin of amodule, effectively delaying the signal coming out of the module. INSIST has alarge library of standard components including amongst others nand gates, logicanalysers, PLA’s and a microprocessor core.

INSIST is completely different from other hardware design environments. Theinterface is totally window oriented and since INSIST is interactive, there is no


edit-compile-run cycle, but an edit-watch cycle: changes can be made in a “live”circuit. During the simulation run, one can place probes, logic analysers, or othertaps in the circuit. This is not only a nice feature during debugging, it is also veryinstructive in explaining how circuits work.

INSIST is not restricted to the simulation of (digital) hardware. The “wires” cancarry any signal: high, low, or undefined for traditional digital hardware, but onecan also send functions (defining a waveform) or other objects like strings over thewires. The weak point of INSIST is the performance. SmallTalk is an interpretedlanguage (the most easy way to implement dynamic interactive languages), henceslower than C or Pascal. Although INSIST is well suited for designing a circuit,large evaluation runs cannot be made with it.

Other hardware simulators

There are many other hardware simulators (like Palladio [Brown83]), but they donot have interesting other features. All hardware simulators have the same draw-back: the architecture is simulated in the perspective of debugging the circuit, ofevaluating the functionality. None of the simulators helps in analysing or evalu-ating the performance of an architecture. For many designs, the performance isnot immediately important (coffee vending machines), or the performance is easilydetermined using static analysis tools, as is the case with for example DSP architec-tures. But as explained in the introductory chapter, the performance of computersis important and is in general not determined easily statically. The simulatorsbelow simulate architectures in order to evaluate the performance.

2.2.4 Architecture Evaluation tools

The architecture evaluation tools are characterised by their input domain (computerarchitecture designs) and their purpose: to evaluate performance related aspects ofarchitectures. The tools mostly evaluate only one single aspect, but do a good jobon that particular aspect. Although all tools are internally based on some kind ofsimulator, the simulation strategy is not always visible to the programmer. Dinerofor example accepts a stream of memory references and calculates the hit rates ofthe caches without bothering the designer with events or virtual clocks.

Five tools are discussed here, four tools that concentrate on single aspects (ISPS,PARET, AWB, and dinero,evaluating the instruction set, the parallelism, the storagehierarchy, and the cache aspects respectively) and one tool that produces moregeneral evaluation figures: SADAN.

ISPS

The goal of ISPS (Instruction Set Processor Specification [Barbacci81], implementedin the N.2 system [Endot87] as well) is that both software (assembler and compiler)and hardware (the chip) are generated from the ISPS specification of a processor.A processor is specified in ISPS at register transfer level: in terms of registers andinterconnections between. One small part of the ISPS system is the evaluator, a


simulator that can emulate a benchmark program, and collects a list of statisticsabout the machine level interface. When performance figures of a specific architec-ture are required, a postprocessor is needed. Caching or multiprocessing aspectsare not covered by ISPS. The execution of the ISPS specification does not maintaina virtual time, only dependencies in the (concurrent) description are taken care of.

ISPS is oriented towards the design of processors, and explicitly towards theconsequences of the design of the instruction set. Within this field, the designercan obtain good results (ISPS has been used in the design of the PDP 11), ISPS itselfdoes not suffice to easily obtain information about, for example, the hit rate of acache in the architecture.

PARET

The PARET environment [Nichols88] has been developed to predict the perfor-mance of parallel applications on distributed memory MIMD machines. The ar-chitecture and application are specified by means of directed graphs. The nodesof the architecture graph represents the processing elements of the architecture,the communication channels are represented by the arcs. The application graphis formed by program fragments, interconnected by data flow arcs. The programfragments have explicit annotations specifying the speed of the code. PARET doesnot need a detailed implementation of the application, a precise description of thetiming informations and the synchronisation suffices. Running a detailed imple-mentation under PARET is possible, but the programmer remains is for assigningdelays to program fragments.

Because the application is not simulated, but only the timings of the basicblocks are simulated, PARET runs fast in comparison with a full simulation of theprogram. A drawback of this method that the timings specified by the designer areonly approximations. Although the quality of the approximations can be improvedwith the help of an implementation and some profiling tools, one does not get areal application trace. In real cases, the applications behaviour might depend onthe speed of the processor’s network. A situation that cannot be modelled usingPARET (see Section 4.2 for a discussion on this topic).

AWB

AWB, the Architecture Work Bench [Bray89, Mulder87] developed at Stanford(where the MIPS processors originated as well), is used to evaluate the design of thestorage hierarchy. The input of an AWB simulation run consists of an applicationprogram, written in a high level language, the input data of the application, and ahigh level description of the architecture: cache sizes, the size of the register banksand so on. It is assumed that the processor has a standard instruction set and thatall instructions are executed in a single cycle.

AWB offers some nice features. Firstly, it simulates the real application ex-tremely fast. This is accomplished by compiling the application to the host archi-tecture, executing the program on the host, tracing the order in which basic blocksare executed, and by simulating the target architecture with this “basic block trace”.


This reduces the simulation time, without loosing accuracy: still the real applica-tion program is simulated. Secondly AWB allows to compare effects of addingregisters, caches or other features. It is the responsibility of the designer to keepthe costs of the architecture constant, the AWB estimates the merits of the varioussolutions. This makes it feasible to optimise the performance of a core processordesign towards the execution of a specific application: one application might ben-efit from a large instruction cache, while another application might benefit from alarge register bank.

The AWB provides a compiler, including optimiser and register allocator. Themotivation is that it is not valid to compare programs compiled with different com-pilers (because the performance effects of using another compiler might obscurethe performance effects introduced by the architectures), but this limits the AWBto be used for one class of processors: the collection of RISC processors for whichthe current optimiser generates code. Processors with multiple pipelines (one ofthe extensions proposed in [Bray89]) will need another compiler, but it will be hardto write a compiler that generates optimal code for processors with any type of(multiple) pipelines, the current generation compilers can generate optimised codefor specific pipelined designs only.

The AWB is different from the other tools in this section because the applicationprogram, compiler generated code, processor and memory hierarchy are simulated.Like the other tools, AWB cannot be extended to simulate more exotic or parallelarchitectures easily.

Dinero

Dinero [Hill89] is a very simple but rather effective cache evaluation tool. The inputof dinero consists of an address trace, and a set of cache parameters amongst othersspecifying the size, the associativity and the fetch policy (prefetching, demandfetching). Dinero outputs a number of statistics as the hit and miss rates, and thetraffic ratio. Dinero fully simulates the cache behaviour, but it does not have anotion of time, the simulator just reads the address trace and applies it to the cachesequentially. This is the reason that the output consists of ratio’s only. Although thecapabilities of dinero are minimal, dinero can be of great help when the question“what is an economical cache size for this address trace” is to be answered. Noticethat there are many similarities with AWB. AWB has more capabilities, since itcan compile and execute programs, in contrast, dinero has more parameters for thecache regarding for example the prefetch strategy.

Dinero is a typical example of a very small, very specialised tool. It aims at onlyone aspect (cache evaluation), and is fully flexible on that particular part. All otheraspects are totally ignored.

SADAN

SADAN [Delgado88] is an architecture simulation language aiming at the evalua-tion of computer architectures. The used description language is an extension of


Pascal, with concurrency, primitives for event driven simulation and communica-tion, and statistical features.

SADAN distinguishes between different types of communication. SADAN sup-ports normal, open collector and three-state wires, queues, and networks. Wires,queues and networks all have their own primitive operations. Wires are assignedto (with an optional delay), and have a wait on primitive to wait for a changeon a wire. For queues and networks send() and receive() primitives are buildin the language, again with an optional delay. The difference between a queue anda network is that a network has an addressing scheme (many to many), while aqueue does not have an addressing scheme (many to one). Concurrency is intro-duced via the concept of a ‘module’, a part of code (declared as if it is a procedure),that is not called as a normal procedure, but that runs concurrently with the caller.SADAN allows to declare stats, variables in which statistical data is maintained,which is displayed at the end of the simulation run in any wanted form.

SADAN is an interesting approach, it is general purpose and it is directedtowards the evaluation of architectures. The queue communication type allows foreasy modelling of client server communication, the stat variables ease the extractionof statistical information. Despite this, a few points in SADAN are arguable: thelanguage has (too) many communication features, and in SADAN the specificationof the behaviour and topology of the architecture are intermixed. In comparisonwith the tools above, SADAN lacks the high level descriptions of for examplecaches or application programs.

Combining the architecture evaluation tools

The systems sketched above typically tackle some of the aspects of architectureevaluation, but none of them is successful in tackling all aspects in once. As anexample, one might consider the simulation of an application running on a sharedmemory machine equipped with consistent caches (the case study sketched inChapter 6). The problem is related to dinero (caches), AWB (storage hierarchy),PARET (multi processing systems, although PARET is oriented towards distributedmemory) and ISPS (the execution of an application). Integrating these four is hard,because the results of all four depends on each other: the execution order ofthe application depends on the synchronisations, the synchronisations depend onthe cache hit rates, and on the traffic density. ISPS can produce a trace, dinero caninterpret a trace, but ISPS and dinero cannot be integrated because there are mutualdependencies (see Section 4.2.2 for a detailed discussion on this topic).

The only system that is capable of handling the whole architecture is SADAN,but when this architecture is to be specified in SADAN, the behaviour of caches,busses, processors and so on have to be specified from scratch onward. Dinero,AWB or ISPS already know how a cache works, and what the parameters andrelevant measures of a cache are.

The contribution of all the architecture evaluation tools compared to the hard-ware simulation languages is that the evaluation tools provide statistical outputabout the efficiency and performance of the architecture, whereas the hardware


simulation languages mostly stress the correctness of the implemented architec-ture.

2.2.5 Discussion

In this section, a number of languages and simulation systems have been discussed,ranging from simulators written in an ordinary imperative language to specialisedtools for example for cache evaluation. None of the systems above is perfect in thecontext of evaluating architectures. The architecture evaluation tools, especiallySADAN, are the most close, but they also miss some features. The requirementsfor an architecture simulation tool are summarised here.

Object orientedness seems to be the best paradigm to formulate an architecture.It is easier to map real world entities on objects than to map these entities on (flat)imperative code because one does not want to intermix the descriptions of thevarious components. Furthermore the behavioural description of the componentsand the topology of the architecture should be specified separately to ease a changeof the topology of the architecture without a modification of the behaviour ofthe components. It is also favourable to separate the communications and thecomputations, both from software engineering point of view, and because it givesa more clear mapping of the architecture onto the model.

The simulator is implemented most efficiently by using an event driven simu-lation algorithm. Because an event list is easily used incorrectly, the usage of theevent list should be restricted, as in SIMULA (which supports only one event type)or in many of the architecture evaluation tools (where the event list is not visible atall). These restrictions hamper the designer only at those places where errors areeasily made, so the restrictions protect the designer against pitfalls.

To facilitate rapid prototyping of architectures, the designer should have aset of models of standard components at hands. These models prevent that forinstance a cache is remodelled each time it is necessary, that it is designed onlyonce to maintain all relevant statistics about the component, and guarantee thatthe models are correct (opposed to the models made by the designer).

Last but not least, too many simulators are not useful because they are not opento the rest of the world. A tool like dinero for example is freely available in sourceform, for everyone who likes it. This allows users to slightly modify it to theirown wishes or to integrate it with other packages. When a closed simulator (notavailable in source form and without an interface to the outside world) does notmeet the requirements of the user, it is an invitation to the user to develop hisown simulation tool in a language like C, which leads to a reimplementation of thewheel.

As there are, as is illustrated, no complete systems available that integrate allrequirements stated above, we have developed such a system that is described inthe next chapters. In this system there is place for both the specialisation of an ar-chitecture evaluation tool, the freedom to implement any (architecture) simulationaspect, and the freedom to integrate it with other simulators (like hardware designlanguages).

Chapter 3

The simulation environment

In this chapter Oyster is presented, a simulation environment that should stimulatecomputer architects to evaluate the performance of their architectures during anearly stage of the design. Oyster allows to simulate the whole architecture in once,to reduce the risk that effects are overlooked during a separate evaluation of dif-ferent parts of the architecture. When simulating large architectures, the architectis easily lost in the details. For this reason, Oyster accepts both high and low leveldescriptions, so that each component can be modelled at the appropriate level ofdetail. To facilitate an integrated simulation of high and low level models, Oyster isorganised in a layered fashion, as is depicted in Figure 3.1. The high level abstrac-tions are implemented on a common lower level, so the whole architecture can besimulated integratedly (in contrast with the existing architecture evaluation tools).Evaluation figures are emitted from all levels of Oyster. The lower level maintainsgeneral figures about the architecture, the higher levels maintain evaluation figuresthat are specific for the types of component modelled.

Oyster is organised around an object oriented simulation language,called Pearl1.Object oriented, so there is a close relation between the entities of the real worldand the objects of the language, and a simulation language so the user is providedwith the basic primitives for a discrete event simulator. A brief description of Pearlis given in Section 3.1.

The level below Pearl is the simulation kernel, implementing the basic prim-itives for Pearl (as the virtual clock), and a number of architecture independentstatistical measurements (as an analysis of the utilisation of the components). InSection 3.2 the functionality of the simulation kernel is presented.

At the level above Pearl, the frequently used elements of architectures are pre-defined, a kind of library. The library elements have “parameters” to tailor themto a particular architecture, where the word parameter may be a bit misleading,because for example the semantics of a processor’s instruction set are also a pa-rameter. The library models maintain evaluation figures specific for the modelledcomponent, like the miss rate for a cache, or the instruction rate for a processor. Adescription of the library elements currently implemented in Oyster is presentedin Section 3.3.

1Pearl stands for Pure Evaluation and Architecture Research Language and should not be con-fused with three other Pearls: [Deering79], [Martin81] and [Cherry88].

38 Chapter 3. The simulation environment

Other simulatorsArchitect

Standard Modules

Object Oriented Simulation

Discrete Time Simulation Kernel

Simulation Platform (stock hardware)

?

?

?6 6 6

?

6

?

6 Parameter definition

Pearl

C

Machine language

Figure 3.1: The structure of Oyster.

The three levels of Oyster are mapped onto the respective lower levels by com-pilers. For each type of component at the standard layer a “compiler” is availablethat maps the parameters of the component onto a Pearl module modelling thecomponent. The Pearl compiler translates Pearl into ordinary C code, and theC-compiler is used as target compiler towards the hardware platform. Oyster isthus totally machine independent: as long as a C-compiler is available, Oyster willrun.

There are many similarities between Oyster and the tools described in Sec-tion 2.2. The top level of Oyster provides a type of interface found in the archi-tecture evaluation tools described in Section 2.2.4. The evaluation output of themodules predefined at this level is specific for the modelled components: a cachemodel will report about the miss rate, a processor model about the instructionrate. The difference between the existing architecture evaluation tools and Oys-ters standard modules is that the standard modules of Oyster are implementedas part of a larger system, which means that they allow for a broader range ofexperiments. The simulation language of Oyster, Pearl, is neither a real generalpurpose simulation language (Section 2.2.2), nor a hardware simulation language(Section 2.2.3). Pearl is not suitable for general purpose simulations because thelanguage does not support any dynamic behaviour (a restriction imposed on Pearlto ease the extraction of evaluation information about architectures in general),and Pearl is no Hardware Description Language since it abstracts from “wires”but supports (queueable) messages instead. The differences and similarities arediscussed throughout the chapter.

Oyster is open to the outside world. The lowest level of Oyster provides twointerfaces to the outside world: one for calling C-functions from the Pearl program,and one to interface to other simulators. These interfaces are briefly discussed inSection 3.4. This chapter concludes with a description of the current status of theOyster implementation and a comparison with some existing tools.

3.1 Pearl : The simulation language

Pearl is the name of the simulation language developed for Oyster. The languagedesign has been influenced greatly by three languages: POOL [America89a], C

3.1. Pearl : The simulation language 39

Compiler C-compiler Loader Processor

classdefinitions

(extraC-functions)

topology,array bounds

programinput

? ? ? ?

- - -C executable instances

Figure 3.2: The compilation path of a Pearl program.

[Kernighan78] and SIMULA [Birtwistle73]. The computational model of objectsrunning in parallel and communicating with messages originates from POOL, thestatement structure and the syntax of C has been used, while the object orientedsimulation model originally stems from SIMULA. More than 95% of the features ofPOOL, C and SIMULA languages are not incorporated in Pearl to keep it simple:because Pearl is a vehicle used to do research on architectures, (not a research objecton its own), it is important that Pearl is simple and implementable.

A running Pearl program consists of a collection of objects communicating viamessages. Each object simulates the behaviour of a component of the architecture.The interactions between the components are simulated by messages between theobjects. Like other object oriented languages, the behaviour of an object is describedby the class of the object, and an object is said to be an instance of a class. A classcan have more than one instance, these objects share the code (they all simulate thesame type of component), but they have their own private data, stack and programstate. Messages are the only communication medium between objects. Sharedvariables are not available. Because the send and receive of a message are explicit,objects can synchronise by sending a message and waiting for a reply.

Because of the simulation nature, Pearl is equipped with a virtual clock. Duringthe run of the Pearl program, the clock always points to the current time. Like realtime clocks, it never goes backward in time, but unlike ordinary clocks, it does notflow continuously. Objects can at any time decide to wait for the virtual clock topass a certain time. Only when the simulation kernel detects that all objects arewaiting for the virtual clock to advance, the clock is advanced to the first time inthe future where some object will become active. When no object will ever becomeactive in the future anymore and all objects are waiting, the program is said to beterminated.

The compilation path of a Pearl program is shown in Figure 3.2. The compilercompiles the class definitions via the C-compiler to an executable, the loader readsthe topology description and generates the object instances, which run on theprocessor. Unlike other object oriented languages, Pearl does not allow for therun time creation of objects: all objects are created at load-time. This eases theextraction of statistical information, and leads to faster execution, as will be seenlater on, other aspects are fixed at load time as well. In the coming subsections thebasic principles of Pearl are explained: objects and the typing scheme, messages,the virtual clock, and the program startup. This section ends with an example Pearlprogram and a brief comparison with related languages.


3.1.1 Objects computations

An ordinary imperative programming style is used to specify a class. A classdescription consists of the declarations of variables, functions, and the body (themain code executed after the creation of an object). Body and functions are bothcomposed of lists of statements that are executed sequentially.

There is one important difference between Pearl and conventional object ori-ented languages: the typing scheme. Conventional object oriented languages havea highly orthogonal typing scheme. Integers are objects, arrays are objects, modelsof cars are objects, and even classes may be objects. The strong point of this descrip-tion is the uniformity in the language description, and the resulting simplicity andelegance in the language definition. The weak point of the uniformity is that theimplementation of objects is expensive: if each integer would really be representedas an object, programs would run orders of magnitude slower than necessary. Thecompilers for these languages therefore recognise special cases (transparently tothe programmer), for example integers and records, to be represented efficiently,while other entities are treated as real objects. This method can be extended toallow the efficient execution of user defined objects also, either implicit or explicit.

In Pearl data and objects are distinguished. Data is privately owned by objects,data can be freely copied to other objects, but data can never be shared betweenobjects. Objects are defined in classes, and are handled by reference. Two or moreobjects can thus maintain references to the same object. Data is used to model the(run time) contents of registers, memories or even busses in the architecture, whileobjects model the hardware components itself.

Pearl defines four basic types of data: booleans, arbitrary (but fixed) length inte-gers, floating point numbers and strings. The normal arithmetic and bit operationsare supported for integers. An important feature for architecture simulations is theability to work with integers of any width. This is thus supported, but the width isfixed during compile time. Once an integer variable is chosen to be for example 66bits wide, it will not stretch if the result does not fits in 66 bits, instead the computa-tions continue modulo 73786976294838206464(266). Booleans and one-bit integersare different entities: a boolean is the output of a relational operator and is neededas conditional for the control flow statements, a one bit integer has the value ‘1’ or‘0’, on which one can perform arithmetic operations. A boolean can only be usedwith the boolean conjunction, disjunction and negation. Floating point numbersare in Pearl to ease the extraction of statistical measurements. Although floatingpoint entities are almost never used for the modelling process, the calculation ofstandard deviations, averages or correlations is more convenient when floatingpoint numbers are supported. The reason to include strings is comparable, theyare not necessary for the simulation itself, but they are essential for performing I/O.There are no string operations in Pearl, strings can only be passed in assignmentsor function calls; strings are compile time constants.

Data is structured with the help of the struct and array constructors. Structures(called records or tuples in other languages) are used to organise a collection of dataof different types, the types of these elements are fixed at compile time. Arraysare used for storing multiple items of the same type. The length of an array is


fixed at load time. Pearl does not support more complex data types like lists, sets,unions or (higher order) functions. These types are inevitable for any self respectinggeneral purpose language (because they are needed in writing numerical librariesor compilers), but for architecture simulations one does not need them, so they arenot in Pearl. If one needs them anyhow, there is always an escape to the level belowPearl, C, which supports all of them.

As said, data is privately owned, which means that integers, floats, arrays andstructs are passed by value. It is not possible to pass for example an array byreference. If one wants to modify an array, the array should be returned backto the caller. This call by value scheme is used anywhere in Pearl: messagesbetween objects, and function calls inside objects. Again, this is a feature that isundesirable in many general purpose languages, but it does not harm architecturesimulations. Since passing large data-structures occurs seldom, it is even notnecessary to implement a call by reference optimisation in the Pearl compiler.

Contrary to data structures, objects (constructed with a class definition) arehandled by reference. Each object in the architecture has a unique object identifier,defined during the startup of the program, which is for example used when spec-ifying the destination of a message. As all entities in Pearl, object identifiers aretyped; an object identifier has a type that is denoted by the class name.

Many object oriented language supports the concept of “inheritance”: a classcan be defined in terms of another class, plus some new features. Inheritance isintroduced for two purposes: for code reusability and as a mechanism for thecreation of generic classes. SIMULA ’67 for example [Birtwistle73] defines a class“List” which can be inherited by any object that should be queueable: if a class Yinherits the properties of List, then the actions applicable to lists are applicable to Yalso, which means that instances of Y can be queued and dequeued. Inheritance isa very powerful programming mechanism, but it is also rather complex in its im-plementation. For Pearl, a full inheritance mechanism is considered as an overkill,and it was expected to trouble the development of the Pearl compiler.

Instead of inheritance, Pearl supports a simpler concept known as subtyping[America89a]. A subtyping relation between two classes exist if the first classsupports at least every feature supported by the second class. The first class maysupport more features, but this is not necessary. Subtypes can be used to define anabstract interface. As an example from the field of computer design, consider thedefinition of two classes, a static memory and a cache. As seen from the outsideworld, both the memory and the cache have the same properties: one can storedata in it, and can retrieve it later on. The behaviour is quite different in the sensethat a cache stores it data in a background memory, and is in general much fasterthan the memory. A cache is thus a subtype of a memory. Because a cache supportsmore operations than a memory, for example Flush and Invalidate, a memoryis not a subtype of a cache. The consequence of the subtype relation between thecache and the memory is that wherever a memory is required in the architecture, thedesigner may place either a memory or a cache without violating the typing system.The type-checker accepts all supertypes at places where a subtype is required. Thesubtyping relation is explicit annotated by the designer, the compiler just checks ifthe relation is true, a check which is very easy to implement.


3.1.2 Communication

The communication between objects is message based. A message consists of fourparts: the source of the message, the destination of the message, the name of afunction inside the destination object (called the method of the message), and anumber of parameters. A message is composed and sent in one single action as inthe example below:

destination !! method( param1, param2, param3 ) ;

This statement creates a message, and places it in the queue of the object identifiedby destination. Method should be a function of the class of destinationwithin this example three parameters. When the destination object decides to handlethe message, the message is retrieved from the queue, and method is called insidethe object with parameters, param1, param2 and param3. As explained before,the parameters are sent by value.

At the receiving side, the object explicitly decides when to handle a message byexecuting a block statement. By default the messages are passed in a FIFO orderto the receiver, but it is possible to select a message directed to a specific method.The statement

block( method1, method2 ) ;

blocks until a message for either method1 or method2 arrives, whichever comesfirst. When the message arrives, the corresponding method is called, and afterthe return of the method, the statement after the block is executed. The keywordany may be used as a shortcut for all methods, while a block without parameters,block(), just blocks forever.

The two primitives described above send and receive asynchronous messages.Pearl also has a set synchronous send primitives. During a synchronous send, thesending object waits for an answer from the destination object: a synchronisation isthus accomplished between sending and receiving object, while additionally datais transported in two directions. The synchronous send is denoted using a singleexclamation mark:

x = a ! method( param ) ;

while the reply is send back from within the receiving method with the statement:

reply( somevalue ) ;

In Pearl, the synchronous send primitives are defined in terms of asynchronoussends. Although this introduces some typing problems, it is worthwhile becauseit allows the designer to invisibly split the sending or the receiving side of asynchronous message exchange in two asynchronous parts. In this way it is for ex-ample possible to model an arbiter that receives synchronous requests, but repliesto these requests in an arbitrary order (while normally the requests would be an-swered in some FIFO or LIFO order). It is also possible to send a synchronous


request to, for example, a memory, while in the meantime doing some other com-putations. The typing troubles have been solved pragmatically, as is shown in theremainder of this section.

The reply( value ) of a synchronous method that should return a value ofthe type type is syntactic sugar for

source !! reply type( value )

The keyword source denotes the source-object of the message currently beinganswered, value should be an expression of type type. A synchronous send of theform ‘x = a!method( param )’ (with x of the type type and method replying avalue of the type type) is syntactic sugar for:

x = ( a!!method( param ) ; block( reply type ) )

where “(s;b)” executes s, and results in the value of b. The block primitiveresults in the value that is returned by the executed function. For this reason, Pearlfunctions result in two values: a reply-value (which is send to the originatingobject) and a return-value, which is passed as return value inside the object (asexpression result of either the function call or the block). The method reply type(in the class of the sending object) is defined by the compiler in the following way:

reply type : type ( value : type )freturn( value ) ; /* Return only, no reply */

g

It takes the replied value (which was send as parameter of the reply message), andreturns it to the local caller, so it is returned as result of the block. Below is anexample that shows how the reply of a synchronous send can be postponed:

x : has_reply_integer /* See the text below for an *//* explanation of the type. */

method : ( a : integer ) -> integer /* Method should */{ /* reply an integer */x = source ; /* Don’t reply yet, */

} /* but remind source*/

...x !! reply_integer( 10 ) ; /* send the reply */...

The source of the message is saved (in the variable x), so the reply message canbe send later on, from another part of code in this class.

This scheme to define synchronous sends poses a number of typing problems:the types of source, x and block need to be defined. Source can be the refer-ence to any object sending such a message, but all objects sending a synchronousmessage have one method in common: the method reply type. Pearl defines thetype of source as has reply type, a set of classes defined in the language with


exactly one method each: reply type. Such a class exactly tells what the purposeof source is, namely to send a reply of the appropriate type. In the example above,x is thus declared by the Pearl programmer to be of the type has reply integer.The type of a block is determined by the return-type(s) of the method(s) which isblocked on, note that the return types of these methods should be equal.

The feature for splitting synchronous primitives should be used with care: whenthe designer makes an error in splitting either side of the synchronous send, andsends or awaits one reply too less or too much, the object will fall in a deadlock,or it will find all its next synchronous transfers preempted by the superfluousreply message. These bugs cannot be located by the compiler, so it is hard to tracethem. This is also the reason that conventional object oriented languages do notsupport such a feature, one has to choose between either synchronous sends orasynchronous sends at both sides.

3.1.3 The clock

The clock determines the execution order of the objects. The clock starts at time 0,and increments in discrete steps. An object that has to simulate that a certain timeis passed, tells the clock it does not want to be scheduled before that time is passed.Waiting for 10 ticks is written as:

blockt( 10 ) ;

Any expression is allowed, as long as a non negative integer results.There are in Pearl two orthogonal ways to suspend an object, block for a mes-

sage, and block for the clock to pass a time. These two may be combined: wait forboth the clock and a message at the same moment, whichever comes first2. This isdenoted by passing a list of methods to the blockt:

blockt( 10, method1, method2 ) ;

waits for 10 clock ticks, or a message for method1 or method2.Pearl computations and message passing executes in zero virtual time steps.

From an object’s viewpoint, the clock only advances when an object is blocked:thus in a block, blockt or synchronous send (which is translated to a block).From the clock’s viewpoint, the clock is only advanced when all objects are blocked,it then picks the first point in the future where a blockt expires. When all objectsare stopped on a block, the simulation terminates.

3.1.4 Initialising the architecture

The architecture is initialised at load time, by specifying which objects shouldbe created, and how these objects should be initialised. The specification of thearchitecture consists of a hierarchical definition in which all objects to be created

2Another way would be to wait for both a message and the clock, but this can be modelled usingthe other three primitives.


are specified with the initial values for their instance variables. The topology of thearchitecture under study is defined after the compilation: a feature which allowsto change the topology of the architecture without recompiling the Pearl program.As said, it is not possible to create objects run-time, nor is it allowed to change thecommunication pattern. This means that the object communication graph is fixedafter the start of the simulation.

The object graph is fixed during run-time to ease the development of strategiesto evaluate an architecture, as presented in Section 3.2.2. Evaluations on a fixedobject graph are more easy than evaluations on an object graph that changes. Therestriction is a valid one because an architecture is a static entity, processors do notarrive suddenly, nor does bus connections suddenly change. Even fault tolerantsystems with spare components are static in the sense that the components arethere already, but are only disabled for the time being. Human intervention (anextension of an architecture), cannot be modelled; the designer who insists on thisfeature should extend the architecture on beforehand.

3.1.5 An example program

To clarify the way an architecture is modelled in Pearl, Figure 3.3 contains anexample architecture with a processor, a cache, and a memory. The objects aredenoted by the boxes, the messages by the arrows between the boxes. The time isincreasing from top to bottom.

Proc Cache Mem

-T=10, fetch word #1414-T=15, fetch word #1414

�T=30, data 31415�T=35, data 31415

-T=45, fetch word #1414�T=50, data 31415

TI

ME

10203040506070

ticks ?

Figure 3.3: Simulation of an example architecture (the corresponding source pro-gram is listed in Figure 3.4).

In this example, the processor is an object which does two accesses to the sameaddress (1414). After the first access, the cache does a lookup of the data, waits5 clock ticks to model the time needed for the lookup and decides to query thememory for that word, since it was not in the cache. The memory gets the requestand needs 15 clock ticks to fetch the data. The data (31415) is sent to the cachethat stores the data and sends it back to the processor after a small delay (5 ticks).The second access to the processor goes to the same address and is handled by thecache: after the lookup, the cache immediately sends the data back to the processor.In this example, the processor is waiting for the reply of the cache during the timeintervals 10..35 and 45..50. The memory is waiting for a request from the cachebefore time 15 and after time 30.

The architecture sketched in this example does not exhibit any concurrency, allsteps are executed sequentially. In Figure 3.5 the example architecture is extended


class processorc : cache

fblockt( 10 ) ;c ! fetch( 1414 ) ;blockt( 10 ) ;c ! fetch( 1414 ) ;

g

class memorymemval : [1024*1024] integer

fetch:(addr:integer)-> integerfblockt( 15 ) ;reply( memval[ addr ] ) ;

g

fwhile( true ) fblock( any ) ;

gg

class cache

mem : memory

cachedaddress : integer = -1cached : integer

fetch:(addr:integer)-> integerfblockt( 5 ) ;if( cachedaddress != addr )fcachedaddress = addr ;cached = mem!fetch(addr) ;blockt( 5 ) ;

g ;reply( cached ) ;

g

f

while( true ) fblock( any ) ;

g

g

(a) Source listing of the three classes of the example.

architecture() f

proc: processor( cache )cache: cache( mem )mem: memory( )

g

architecture2() fprocA: processor( cacheA )procB: processor( cacheB )cacheA: cache( mem )cacheB: cache( mem )mem: memory()

g

(b) Architecture topology of Figure 3.3. (c) Architecture topology of Figure 3.5.

Figure 3.4: a,b,c: The Pearl sources of the example simulations.

with a second processor and cache. There is only one memory that is shared by thetwo processors and caches. The memory is not dual ported, it handles one requestat a time. Note that the sources of the three classes are not changed, there are onlyextra cache and processor instances.

Cache B tries to access the memory at time 15, when the latter is handling therequest of cache A. The memory blocks this call and handles it after the otherrequest has been finished. The answer from the memory is thus delayed 10 ticks.The consequence is that processor B gets the data 15 ticks later than expected. It


ProcA

CacheA Mem Cache

BProc

B

-10: fw -15: fw

�30: dr�35: dr-45: fw�

50: dr

�10: fw�15: fw

-45: dr -50: dr�60: fw-

65: dr

TI

ME

10203040506070

ticks ?

Figure 3.5: The previous example with an extra processor and cache, “fw” meansfetch word, “dr” means data reply.

has been waiting between ticks 10 and 50 for the answer of the cache, due to thememory contention.

In these examples, the objects are used to model real hardware activity: caches,processors and the memory. An object modelling hardware need not necessarily bea realistic model. It is sometimes helpful to perform experiments on hypotheticalarchitectures, to determine lower and upper limits. One might for example simulatea machine with one centralised load-balancing unit with thousands of connections,without any contention. Although such an architecture cannot be build, it gives anupper-bound of what can be achieved using a centralised solution.

Objects may even be used for non hardware components. When simulating aparallel machine for example, it is convenient to have an object hanging aroundto which all objects can pass information about the utilisation of the system. Thisobject is not a part of the architecture, but is just used to gather statistical informationof the various nodes. These objects in general do not synchronise with the clock,and are servers which are always ready to receive information.

3.1.6 Comparing Pearl to other languages

Pearl implements a discrete event simulator, as described in Section 2.1.4, but withrestrictions on the usage of the event-list. The only event defined is: “wake object Xat time �t”. The event “wake object X at time now” is generated if a message arrivedat object X where object X was waiting for. The event “wake object X at time x” canonly be generated by object X itself, and only if it starts sleeping from that momenton. There are thus neither “future assignments” in the event list (refer Section 2.1.5),nor any “future messages”. Both assignments and message delivery are alwaysexecuted at present, delayed assignments must be programmed explicitly. Thismakes it harder to make mistakes with the event list.

Comparing Pearl to SIMULA, the most important difference is the restrictivecommunication mechanism and the data hiding concept, both also in POOL. Datahiding is essential for neat programming, and it allows to trace the communicationand synchronisation between objects. As will be shown in the next section, thePearl kernel traces these synchronisations and presents an analysis to the designer.

POOL was the direct inspiration of Pearl. The first attempt of the author tosimulate hardware was done using POOL, but it came out that it is impossible to usePOOL conveniently as a simulation language, without building a clock in POOL.


Extending POOL with a clock is not the way to go, because the objectives of POOLand Pearl are different. POOL implements parallelism with a nice programmingparadigm, Pearl does not aim at parallelism (in first instance). Adding a clockto POOL makes that version of POOL totally useless for parallel programming(maintaining a distributed global clock is a research field on its own), and theother way around, lots of features in POOL are not needed at all for architecturesimulation. Besides, changing the POOL compiler would have been a major effortbecause of the complexity of POOL and the compiler. Therefore we choose todefine a new language, with clock, but without the luggage borne by a languagesuited for the implementation of general purpose parallel programs.

In comparison with C, Pearl does not have pointers and their associated arith-metic. It is the author’s opinion that pointers are not essential for simulations. Clearextensions to C are the object orientedness, the virtual clock and the message pass-ing system. Pearl inherited the syntax of C. C++ and Pearl differ in many aspects,the most important aspect is at the implementation side. The Pearl implementationgathers statistics about the objects, this is not possible while using C++, becausewe have to work below the object layer, thus deep in the C++ implementation.

Most concepts of Pearl sketched above are not new. They have been used inother languages (POOL, SIMULA, C, others), and have proven to be useful. Theunique point of Pearl is that only few concepts are actually in the language, leadingto a simple, clearly structured language. In the next section it is shown that due tothe restrictiveness of Pearl, the layer below Pearl can analyse certain performanceaspects.

3.2 The Pearl kernel

The Pearl kernel implements the simulation primitives of Pearl, and maintains thestatistics of the running Pearl program. The run time support part of the kernel,elaborated on in Section 3.2.1, takes care of object creation, the handling of messagesand the actual scheduling of objects (Figure 3.6). The statistics part of the kernel(described in Section 3.2.2) collects all kinds of statistics and presents these to theuser.

3.2.1 The run time support system

The run time support system is formed by the object creator, scheduler, and messagehandler. The object creator instantiates the objects and initiates the variables of theobjects with the proper values and object identifiers. For this purpose, the creatorreads a topology file, containing object definitions and initialisations. After theinitialisation is completed, all objects are placed in the ready queue of the scheduler,and the scheduler starts executing objects. An executing object runs until it blocksbecause it either needs a (not yet arrived) message, or because it has to wait forthe clock: the scheduler is non-preemptive. In case the object needs a message (itstops on a block primitive), the object is set in the state Wait-for-message. If theobject stops to wait for the clock, the object is set in the state Wait-for-clock, and

3.2. The Pearl kernel 49

Objects

Object Creator

MessageHandler

- Scheduler-�

-�

??????

?

?????? ? ?

Statistics Collector

ARCHITECT

�

-

---

topology

class

definitions

evaluationoutput

Figure 3.6: Relation between the kernel, the objects, and the designer.

it is placed in the future-list, the list of objects that has to be scheduled at a futurevirtual time. Objects blocking on the clock and a method in once are set in the stateWait-for-clock-or-message, and are placed in the future-list as well. As soon as theready list of objects is exhausted, the scheduler picks the first object of the futurelist to be executed: the future list is Pearls variant of the event list. The messagehandler takes care that objects receiving a message are placed in the ready queue.Resuming, the objects are in one of the following five states:

Running The object is executing sequential code, it is doing computations. Thereis only one object at a time in this state. The object is neither in the schedulinglist, nor in the future list. Depending on the reason the object is descheduledit is transferred to one of the Wait states.

Ready The object is not executing yet, but it is ready for execution. All objects startin this state. The object is in the ready queue, not in the future list.

Wait-for-message The object is waiting for a message. The object is neither in thefuture list, nor in the ready queue.

Wait-for-clock The object is waiting for the clock to pass a time. The object issomewhere in the future list.

Wait-for-clock-or-message The object is in the future list, but the message handlermay move the object to the ready queue in case the proper message arrives.

When both the ready queue and the future list are exhausted, and all objects arethus in the state Wait-for-message, the scheduler stops the simulation (no objectwill ever become ready anymore). Most of the time this is because all objects fall ina blockwithout parameters, which is placed by the compiler at the end of the body(remember that he number of objects is fixed, there is thus no object termination).But when the program falls in a deadlock (objects are sending message to eachother, but none of them is able to reply), all objects are also in the Wait-for-message


Object Busy IdleProcA 31% 69%ProcB 31% 69%CacheA 23% 77%CacheB 23% 77%Mem 46% 54%

Figure 3.7: The utilisation output of the example program of Figure 3.5.

state, resulting in termination of the program as well. This silent termination isundesirable because a deadlock is caused by either a programming error, or bya design flaw in the architecture. A future version of Oyster should recognisedeadlocks, and respond as appropriate.

3.2.2 Statistics

The second task of the Pearl kernel is to collect simulation statistics to aid inevaluating the design. Every interaction between the Pearl code and the run timesupport is signaled to the collector to update the statistics. Currently five measuresare maintained: the analysis of the utilisation, contention, time distribution, callgraph, and bandwidth. At the end of the simulation run, all statistics are emitted,together with the statistical information maintained by the Pearl model itself.

Utilisation analysis

The utilisation analysis measures how long object were idle and busy. To computethis, the virtual time spent in the three Wait-states described previously is accountedfor each object (note that virtual time cannot be spent in one of the first two statesbecause the clock only advances during Wait-states).

When an object is in the state Wait-for-message, the object cannot do anythinguntil the message arrives. This means that the component is idle. All time spent inthe state Wait-for-message is therefore accounted as idle-time of the object.

When an object is in the state Wait-for-clock, the object models that work isbeing done, hence the object is busy. The time spent in the state Wait-for-clock isaccumulated as busy-time of that object.

The last state, Wait-for-message-or-clock, poses some problems: depending onhow the blocktwith time and methods is used, the time spent in this state shouldbe accounted as busy or as idle time. Suppose that the time of the blockt places atime-out on the receipt of a message. In that case the time should be accounted foras idle-time, because the object is inactive. But when the method of the blocktis used as an interrupt, while the time was modelling activity, the time shouldbe accounted as busy time. For the utilisation analysis it is assumed that objectsawoken by a message were idle, while objects awoken by the clock were busy. This


Mem: # Messages 0 1idle 54%busy 23% 23%

...

Figure 3.8: The contention analysis output of the example program of Figure 3.5.

heuristic is not fool proof, but since the time spent in this state is usually minimal,it only rarely causes problems.

As an example output of the utilisation analysis, Figure 3.7 shows what theutilisation of the components of the architecture depicted in Figure 3.5 would be.Since the simulation run is very short, it is not possible to draw any conclusionsfrom this table, but in longer and more complex cases, the designer can get an ideawhere the bottlenecks are. Highly utilised components are either a bottleneck inthe system, or just good used, while lowly utilised components can be deleted fromthe architecture if their functionality can be taken over by some other component.It is up to the designer to draw conclusions from the figures, Oyster only signalspossible troubles.

Contention analysis

The utilisation analysis distinguishes two states: busy and idle. When taking thenumber of messages waiting for the object into account, it is possible to gain insightif the object is a central bottleneck in the architecture: long queues of messagessuggest that the component has (too) many waiting clients. For the contentionanalysis, the accounting of busy and idle time is differentiated according to thequeuelength, as is demonstrated by the output of the example program depictedin Figure 3.8. In total the memory was 54% idle and 46% busy. During 23% ofthe time, the memory was busy with one waiting message, in 23% of the time,the memory was busy without messages. In realaistic outputs, components withheavy contention are recognised because a large fraction of time is spent with manymessages waiting for it. The figures coming out of this analysis can also be used toconstruct or verify a queueing model for this component of the architecture.

Profiling analysis

The simplest way of analysing software performance is to make a profile of theprogram, as for example obtained with the UNIX prof command. Prof countsthe number of calls to each function of the program, and accounts the time spentin the various functions. This method can be applied to evaluate architecturesspecified in Pearl as well: for each method of a class, the number of invocationsof the method, and the average time spend in the method are maintained, leadingto an analysis where the time was spent inside objects. Those parts may needoptimisations. The utilisation of each object is thus further split on bases of the


uu

uuu

uuuuuu

uuu

uu

uuuu

uu

uuu

? -

? -

? -

?�

?�

?

?

�

-

?�

��

��=

? -

? -

?�

?�

Proc A Cache A Memory Cache B Proc B

60

50

40

30

20

10

0

?

TI

ME

��

��

��

Figure 3.9: An example graph representing an executing Pearl program.

function spending the time. The nice point in comparison with ordinary softwareprofiling, is that Pearl works in a virtual time frame: maintaining the profile doesnot disturb the virtual time, it only takes real time. This means that this measureimplemented in Pearl does not introduce any uncertainties.

Call graph analysis

The call graph analysis is the most sophisticated analysis currently in Oyster. Ithas close resemblance with the call graph analysers found in conventional softwareenvironments (like gprof under UNIX [Graham82]), and the critical path analysisfound in electrical simulators (as for example [Wallace88]), the call graph analyser isused to find the critical paths in the architecture. This is implemented by analysingeach idle period of an object. At the end of an idle period, the Pearl kernel tracesback along which path in the call graph, the object was awoken. If the objectsalong this (critical) path run faster, the final message comes earlier, resulting in ashorter idle time. This means that objects “responsible” for the idle time of a certainobject can be identified. The call graph of the example program of Figure 3.5 isdepicted in Figure 3.9. The virtual time proceeds from top to bottom, according tothe time scale at the left hand side. The nodes in the graph represent active objects,the horizontal and slanted arrows represent message transfers, the vertical arrowsrepresent objects waiting for the time.

Processor B for example is idle between ticks 10 and 50. The thick line betweenthe dots at 10 and 50 shows the critical path for this idle period. It passes cacheB twice (2 times 5 ticks), and the memory once (for 15 ticks). 15 ticks are causedbecause the memory is handling the message from the A-side of the architecture,


Mem Self 66% (-) Busy time of memoryProcA 22% (66%) 22% of total time, 66% of idle timeCacheA 11% (33%)

ProcB Mem 46% (66%) Memory was bottleneck for processor BSelf 31% (-) Busy time of Processor BCacheB 23% (33%) Cache B is responsible for one third

CacheB Mem 46% (60%) Memory was bottleneck for cacheProcB 31% (40%)Self 23% (-) Busy time of Cache B

...

Figure 3.10: Output of the call graph analysis. The text at the right is not producedby Oyster, but is an explanation of the author.

these are accounted for the memory object as well. The idleness of processorB between ticks 10 and 50 is thus “caused” by cache A for 10 ticks, and by thememory for 30 ticks. The idle period of cache B between 15 and 45 is entirelycaused by the memory, while the idle period of cache B between ticks 50 and 60is caused by processor B. The output of this analysis for the memory, cache B andprocessor B is presented in Figure 3.10. The memory was busy for 66% of thetime, it was waiting for processor A for 22% of the time, and for cache A for 11%of the time. Likewise, processor B is busy for 31%, the remaining 69% idle timeare caused by the memory (for two third) and the cache (for one third). In realruns of architectures, this trace information can be used to determine where thebottlenecks of specific parts are.

Maintaining this analysis is considerably more expensive than the others, thecosts of this analysis are quadratic in the number of objects, while the other mea-sures can be maintained in constant time (per event). Critical path analysers, asfound in hardware design environments, are much faster but they only present astatic analysis of the critical path. The analysis presented here calculates an “aver-age” critical path that takes the dynamic behaviour of application and architectureinto account.

Average bandwidth

The message send in a Pearl program are used to model communication betweenvarious components. The Pearl kernel counts the total length of the messages ex-changed between the various objects, in order to compute the bandwidth require-ments between the objects This bandwidth estimation is helpful when decidinghow the components should be distributed over chips and boards. Figure 3.11shows the bandwidth requirements for cache A of the example architecture. Theunit of the bandwidth is “bits per tick”, since the Pearl clock runs in ticks. Note thatonly the average bandwidth is computed, peak bandwidth cannot be computedsince Pearl messages are sent in zero time.


ProcA to CacheA 0.98 bptCacheA to ProcA 0.98 bptMem to CacheA 0.49 bptCacheA to Mem 0.49 bpt

...

Figure 3.11: The bandwidth requirements between cache A and the other compo-nents.

Pearl level statistics

To aid the Pearl programmer in maintaining the statistics of objects, (for examplethe cache hit rate or the average time between two reads in a memory) each class isequipped with a special method. This method, called statistics, is implicitlyadded to the method list of eachblock statement. At the end of the simulation run,all objects receive a statistics message, whereupon the object calculates andprints the maintained statistics. To ease the calculation of the statistical values, astandard library is provided for the calculation of averages, the standard deviation,and so on.

Interpreting the statistics

When the analyses above are applied to an architecture of 100 objects, the designerwould be buried under under a few thousands lines of statistical output. Manyof these lines are uninteresting on beforehand: the working of the reset circuit forexample is well known: it is inactive. The designer can therefore suppress thestatistical messages with a control file. To ease the interpretation of the remainingstatistics, the output is sorted where possible. Components with long queues,totally idle and totally busy components are given at the top of the list. Although thesorting heuristics are very ad-hoc, they are at least better than a random (alphabetic)ordering.

3.3 The layer above Pearl

The previous sections presented the simulation language and simulation kernel.Pearl is convenient for running architecture simulations, but in many cases, thedesigner could use a higher level of abstraction. Reprogramming a cache each timeis a waste of time, and is more error prone than using a “standard cache”. For thispurpose, Oyster is equipped with a library of components which are frequentlyused. Currently the library consists of generic memory, cache, and processormodels.

The library modules are parametrised so a whole class of components is cov-ered. The memory module can be configured to simulate dynamic and static mem-ories, the cache module is configured with the line size, set size, and replacement

3.3. The layer above Pearl 55

algorithm, while the processor is parametrised with the instruction set and the ad-dressing modes. Because of this high degree of parametrisation, the word “library”is somewhat misleading. The models are actually generated by small programsor generators. Such a generator emits, given the parameters of the component, aPearl-class, which is further translated by the Pearl compiler.

This strategy to model an architecture is comparable to the construction of acompiler by using the “yacc” and “lex” tools. Yacc generates the parser of thecompiler (given the syntax description), lex generates the lexical analyser partof the compiler, while the rest of the compiler is supplied in C. Some parts aregenerated, the rest is hand-work (note that the current generation of compiler toolsis more sophisticated and can construct type-checkers and code generators as well).In Oyster, the models of the processor, memory and cache are generated given theparameters of the model, the other components (communication processors forexample) are specified in Pearl directly. Like lex and yacc, the specification of thehigh level descriptions might contain parts of Pearl code to specify, for example,the replacement algorithm of a cache.

3.3.1 Memory

The simplest standard component is the memory. A memory is used to storeand retrieve data in arbitrary order (Random Access Memory). Different types ofmemories exist, which behave different in how the accesses are granted. Besidesthe trivial differences in access time, Oyster supports dynamic and static RAM. Theaccess time of static RAM is equal for each access. The access time of dynamic RAMdepends on the time since the previous access, and may also depend on previousedaccessed address, since accesses to the same page might be optimised.

For static RAMs, the designer specifies the time needed to read data from thememory, and the time needed to write data to the memory. For dynamic RAMs,the designer should also provide the time needed for the RAM to recover from theprevious access, and if the RAM has a page mode, also the page-size. For bothstatic and dynamic RAM there are parameters for the size of the RAM, and thewidth (type) of the elements stored.

The memory module implements only one statistical measurement: the num-ber of subsequent reads to the same page. The rest of the interesting statisticalmeasurements (the number of reads, writes, the idle time) are all already measuredby the Pearl kernel. The memory model currently simulates the memory contentsin an array, hoping that the underlying operating system will have enough swapspace to store it. The memory model can be changed so that only the used parts ofthe array are simulated, but this space optimisation is not necessary at the moment,and can be implemented invisibly to the architect later on.

3.3.2 Cache

The cache is slightly more complex than the memory. In comparison with thememory, the cache has extra parameters for the line size, the number of lines


per set (1 for direct mapped caches), and the update and replacement policies.For the replacement algorithm the designer can select between pseudo random(most simple), FIFO, LRU, or a part of Pearl code implementing the replacementalgorithm. This last feature allows the designer to experiment with, for example,compiler controlled replacement. The cache model counts the average miss-rate ofthe cache, and the average miss-penalty. The output is less sophisticated than theoutput of tools like dinero and the AWB, but it is just a matter of time to enhancethese outputs.

3.3.3 Processor

The processor model can simulate a wide range of processors. The processor modelis driven by emulating a program at instruction level. The processor is specifiedby the following parameters:

� the available memory resources: the register bank, the main memory, theinstruction memory (may be the same), special registers, and so on,

� the instructions,

� the addressing modes,

� the clock speed.

The memory resource definitions are specified by their name and type. An examplespecification might look like:

r : memory /* Register bank */mem : memory /* Data memory */instructions : memory /* Instruction memory */IObus : IO /* Bus for performing I/O */

Note that only the interfaces are defined. The register bank itself should be declaredlater on as a (very fast) static memory with 16 entries.

The addressing modes are defined orthogonally on the memory resourcesabove. For each addressing mode allowed, the syntax is given followed by theaddress calculation functions for this mode. The keyword default is used toindicate that the normal computation method is used, otherwise two parts of Pearlcode should be provided for loading and storing a value. As an example, thefollowing addressing modes can be specified:

_ default /* Immediate addressing */r[_] default /* Direct register addressing */mem[r[_]] default /* Indirect memory addressing */mem[r[_]+_] default /* Offset memory addressing */

The brackets denote an indirection, the ‘ ’ denotes a constant integer, the ‘+’ is usedto add values. Using register 0 is thus denoted by r[0], addressing a memorylocation at 8 from register register 7 is denoted by mem[r[7]+8].

3.3. The layer above Pearl 57

The instructions are specified by a five-tuple: the name of the instruction, thenumber of clock cycles needed, the number of bytes needed in the program space,the number and type (read, write) of operands and the semantics of the instruc-tion. The first four are straightforward. The semantics are either specified by thekeyword default (in which case the instruction-name should be a name knownto Oyster, like “add”; code is then generated for the addition of two integers), or al-ternatively one may specify a fragment of Pearl-code that tells how the instructionbehaves. With this, full flexibility is provided to model non standard instructions.As an example, consider the instruction specifications:

add 4 1 wrr defaultmult 4 32 wrr defaultsquare 4 32 wr { o0 = i0 * i0 ; }input 4 2 wr { o0 = IObus ! input( i0 ) ; }output 4 2 rr { IObus ! output( i1, i0 ) ; }printi 0 0 r { printf( "%d %08x\n",i0,i0 ); }

The first five instructions need 4 bytes of program space. The add takes 1 cycle,the mult and square take 32 cycles. The add has three operands, one write andtwo read, for the destination and two sources, the action is default: add. Theoutput instruction has two read operands, for the address and the value to bewritten on the IObus (which had been declared as an external memory). The Pearlvariables i0, i1, ... are the input parameters of the instruction; o0, o1, ... arethe output parameters. The printi instruction is an example of a rather curiousinstruction: it takes neither program space, nor instruction cycles and it prints theoperand in two formats. This instruction is handy to get timing or debuggingoutput without disturbing the behaviour of the architecture (the instruction can beplaced anywhere in the program for free).

A generic assembler is capable of assembling programs written in the proces-sor’s assembly language into an intermediate format that is interpreted by theprocessor model during the simulation run, see Section 4.1 for more details. Anexample program for this architecture might be:

input r[0],13 ; get r[0] from IObusprinti r[0] ; print r[0] on the screenadd r[1],r[0],r[0] ; r[1] = 2 * r[0]add r[2],r[1],r[0] ; r[2] = 3 * r[0]mult r[3],r[1],r[2] ; r[3] = 6 * r[0]ˆ2output r[3],18 ; place the result on IObusprinti r[3] ; print the result on the screen

The program gets an integer from the IObus, multiplies it by 2 and 3, and multipliesthese results which are outputted on the IObus. Meanwhile, both the input andoutput values are printed on the screen for debugging.

Load/store architectures, which only allow memory operations in load andstore instructions, can be implemented easily: as long as the assembly code does not


contain any memory reference outside the load and store instructions, a load/storeprocessor is emulated. It is not possible to model processors with multiple pipelinesand interlocks using this model yet. Furthermore, the current processor model hasonly a rudimentary collector of statistical information. It counts the instructionrate, (a unit of performance that is questionable as mentioned in the introduction),and the (dynamic) distribution of the used instructions. In the future, it will beextended by measuring the usage of the the addressing modes, and the branchdistance distribution as well.

3.3.4 Other library models

Arbitrary models can be added to the library. The library as it is now has been usedin the experiments described in the coming chapters. Some extra models can bedeveloped, for example a processor for the simulation of applications using MiG(Section 4.2.2) or a stochastical application as defined in Section 4.3. One of thefuture activities on Oyster will be to extend the library to cover a wider range ofcomponents, and to improve the generality of the existing library elements.

3.4 Interfacing Oyster to the outside world

Oyster has two interfaces to the outside world. An interface for calling C-functionsfrom Pearl, and an interface for coupling VLSI simulators to the Pearl kernel.

The C-function interface allows to use the standard C-library (mathematicalfunctions, random generators, the UNIX system calls) from within Pearl. Fur-thermore, one can code parts of the simulator in C, those parts for example wheredynamic memory management is needed, and which are not conveniently encodedin Pearl. This interface introduces some holes in the typing scheme and executionmodel of Pearl programs: when the programmer does not give the correct typingof a C-function, the results are quite unpredictable; and if the programmer usesC-functions to implement another communication medium using global variables,the Pearl kernel won’t detect this. These holes are the price being paid for openingPearl to the outside world.

The other interface allows to couple a VLSI simulator (or any other simulator) tothe Pearl kernel. The interface is defined so that from the viewpoint of the Pearl ob-jects, the VLSI simulator just behaves as one single object, the VLSI simulator doesnot see any Pearl object, it only receives the stimuli coming from the objects. TheVLSI simulator is thus encapsulated in a Pearl object, as is depicted in Figure 3.12.Messages entering this “VLSI-object” are translated into input-stimuli directed tothe VLSI simulator, the outputs of the VLSI simulator are translated into messagescoming from the VLSI object.

The difficult point of this interface is to synchronise the virtual clocks of bothsimulators. The Pearl-kernel maintains a virtual clock, but the VLSI simulatormaintains a clock as well. Additionally, many types of VLSI simulators exist, thathave different rules how the clock is maintained. For this discussion three types ofsimulators are distinguished:

3.4. Interfacing Oyster to the outside world 59

PearlProcessor

PearlMemory

��

��VLSI

Cache Simulation- -

Figure 3.12: Interfacing Oyster with an external VLSI simulator.

1. The continuous time simulators (Section 2.1.2). An example simulator of thisclass is the simulator used to run the last check on a chip, the circuit is thensimulated at the electrical level.

2. The discrete event simulators (Section 2.1.4). For these simulators a circuitis modelled in terms of switches and sometimes capacitors and resistances.Only discrete voltages are assigned to the wires, for example 0, 1, Undefinedand High-impedance.

3. The clocked simulators. These simulators assume that the circuit is clocked(an electrical clock, not to be confused with the virtual clock of a simulator)with a frequency that is so low that all transistors in the circuit can stabiliseeach clock cycle. These simulators iteratively updates the values of the wires,until the circuit stabilises, whereupon the clock of the circuit is advanced (seeamongst others [Bryant84]). This type of simulators are a kind of continuoustime simulators, but in contrast with real continuous simulators, the time step�t is bound to the clock frequency of the circuit (this eases the interfacing).

The continuous simulators are interfaced easily: Oyster does all the computationsthat are to be done at time 0, whereupon the VLSI simulator is allowed to simulateuntil either the time is reached that one of the Pearl objects has to be scheduled, orone of the outputs of the low level simulators changes its value. In the last case, amessage is generated, which will almost certainly wake some Pearl objects. Notethat there should be a conversion from the continuous voltage levels of the VLSIsimulator to discrete values for Oyster, otherwise each 10�7 Volt difference at anoutput pin will result in a message, leading to a disastrous flood of messages.

Interfacing a discrete event simulator requires that the event lists of the VLSI andPearl simulators are effectively merged (a problem similar to the synchronisation ofnodes of a parallel simulation as for example in [Chandy79]). By running the twosimulators as co-routines, and by passing the control to the other co-routine eachtime the virtual clock would advance over the first event of the other simulator,the simulators are kept synchronised. This can only be implemented if the VLSIsimulator is open, it should be able to execute single events, only few simulatorsare that open.

Coupling the clocked simulators is the most easy. The VLSI object repeatedly blocksfor one clock cycle, and calls the VLSI simulator to stabilise the circuit. This simplescheme works because a clocked simulator does not maintain its own virtual time,it only recognises clock phases.


The MULGA package ([Weste81], with a clocked simulator) has actually beeninterfaced with Oyster, and some simple experiments were performed with it. Acommunication processor design, of [Mooij89], could be placed in a network ofcommunication processors specified in Pearl, so the Pearl network could drive thecommunication processor. At that level it is rather easy to test many input patternsfor the communication processor. At present it is investigated if it is worthwhile tolink Oyster to the Cadence tools.

3.5 The status of the current Oyster implementation

The first ideas of Pearl were developed in the summer of 1988. The first proto-type compiler that was implemented could not do anything with data structures,but could pass messages, synchronise with the clock, and implemented some ofthe evaluation measures. From the autumn of 1988 till the summer of 1989, anew version of the Pearl compiler was developed by two master degree students[Overeinder89, Stil89], which has been upgraded since then to the version that isdescribed in this thesis. Since that date various experiments have been performedwith the Pearl compiler, the layers above it, and the kernel below. The evaluationstrategies mentioned in Section 3.2 were partially inspired by the experiments andwere designed and implemented in parallel with the experiments.

It is hard to compare Pearl and Oyster with any of the hardware design envi-ronments, because the goals are different: Oyster aims at performance evaluation,the hardware design environments aim at implementation and design: an HDLdesign can be translated into a mask level design (which is not possible with Oys-ter), the cache hit rate in a multi processor system cannot be estimated using HDL;a comparison makes no sense.

Pearl combines features from simulation languages and architecture evaluationtools. The simplicity of the language allowed us to experiment with features atlanguage level (as for example the possibility to split synchronous sends invisibly)and below the language level (as for example with the evaluation figures and theinterfaces to the outside world). The library models on Pearl ease the evaluation ofarchitectures. Because Oyster has a layered structure, all problems can be attackedat the appropriate level. Integration is no problem, since the top layers are builtonto a common lower layer. Oyster is extensible as well: currently only three toplevel models and five evaluation strategies are implemented, but an extension withother high level models is trivial and invisible to the other layers.

The layering and generalisation provided by Oyster is not for free: the per-formance of Oyster itself is lower than the performance of comparable evaluationtools. A tool as dinero for example can trace 20000 cache accesses per second,while Oyster stucks at 5000. When comparing these figures, one should not forgetthat Pearl maintains a virtual time during this simulation, while dinero only calcu-lates the time independent measures. The performance benefits are inherent to thespecialisation of the evaluation tools. The evaluation measures maintained by forexample the AWB are also better than the evaluation measures of Oyster, but it isjust a matter of manpower to enhance the models and measures of Oyster.

Chapter 4

Simulating applications

Simulating the hardware of an architecture is one aspect of architecture evaluation.Simulating the software that is proposed to run on the hardware is as important,for the following three reasons. Firstly, the software places a load on the hardware:hardware cannot be simulated without this load. Secondly, neither the applicationprogram nor the system software (as compiler and operating system) can be de-veloped without simulating it on the hardware: design decisions in the softwareare based on the expected performance behaviour of the hardware, these designconsiderations should be verified. Thirdly, the tradeoff between software and hard-ware is one of the main issues of computer architecture, it is not possible to makea fair tradeoff without an evaluation of both. The software and hardware shouldthus be simulated integratedly.

In this chapter three methods are presented to simulate an application programon simulated hardware. The methods are presented in order of increasing abstrac-tion from the application: in the first method the software is fully interpreted, thesecond method uses address traces of applications, while synthetic traces are usedfor the last method.

The most basal way to simulate the software on the hardware is to emulate theapplication at the level of individual processor instructions. This technique, knownas emulation, is elaborated on in Section 4.1. It can be used to simulate every detailof the application and the architecture, including I/O, pipelining and interrupthandling. The drawback of a full emulation is that it places large demands on thesimulation platform, in terms of processing speed and memory capacity.

The simulation speed can be improved by abstracting from the processor in-structions, for example by using an address trace extracted from the application:the sequence of addresses referenced by the program. The architect is not botheredwith the details of the full simulation, but the architect cannot experiment with I/Oor detailed tricks in the processor. Generality is traded in for speed and simplicity.Section 4.2 discusses five methods to extract address traces, four methods to extractthe trace off-line, and one to extract it on-line. The difference between off-line andon-line extracted traces is that on-line traces are constructed incorporating feed-back from the simulator concerning the timings, so that a realistic execution pathcan be chosen. So for experiments with parallel applications on-line feedback, asprovided by the fifth method, is inevitable; sequential programs can be simulated

62 Chapter 4. Simulating applications

ProcessorModel Memory

Model

MemoryModel

-� -

-�

Data references

Data, I/O

Instr references

Instructions

Figure 4.1: Full emulation of a program. For clarity, the data and instructionsare placed in separate memories, but these can be combined into onememory module.

off-line.The demands on the simulation platform can be further reduced by using a

stochastically generated address trace, elaborated on in Section 4.3. This methodmost rigorously abstracts from a real application: an address trace is produced,(suitable for driving the memory system of an architecture) based on statisticaldata about the behaviour of a real program. This method is the least accurate, butit is the most flexible one. It also places the lowest demands on the simulationplatform.

Section 4.4 presents a comparison of the three algorithms, where the applicabil-ity, the resource requirements and the accuracy are discussed. The two simulationmethods presented in Section 4.1 and Section 4.3 have been used for the expire-ments described in Part II of this thesis, the MiG (described in Section 4.2) is usedfor the applications simulated in [Langendoen92a] and [Hofman93].

4.1 Full emulation of the application

The most natural way to simulate the application program is to emulate the applica-tion at the conventional machine level. During such an emulation, all instructionsthat would have been executed in the real world, are simulated on the architectureunder study. The instructions are fetched from the instruction memory (as depictedin Figure 4.1), and decoded and emulated by the processor model. The processorobtains the data from the data memory. The processor can either use memorymapped I/O or perform the I/O directly.

With such a detailed simulator, the architect can validate the machine languageand I/O interface of the architecture. Furthermore, it provides a platform for thedevelopment of assembler, compilers, libraries and operating systems. The soft-ware can be debugged by running the code on the simulated hardware, and theperformance of the a total architecture (hardware and software) can be measuredand analysed. The designer can determine static measures (as the program length,number of functions), and dynamic measures, for example the average jump dis-tance, or memory utilisation. Because of the level of detail, the outcomes of thesimulation are reliable, and can be used to make justified tradeoffs.

An example system capable of emulating application programs at this level of

4.1. Full emulation of the application 63

AssemblyStore

-InstrStream


Model

MemoryModel

-� -

-�

Data references

Data, I/O

Instr references

Fake instructions

Figure 4.2: An emulation as provided by Oyster. The instructions are stored ina separate ‘store’ (invisibly), but fetches to the instruction stream arestill simulated.

detail is ISPS [Barbacci81]. The user specifies the processor at register transfer level,specifies the instruction set coding, and ISPS builds a simulator for the processor,and an assembler that translates the application program to machine code. Thesimulator interprets the machine code program, so all details are accounted for:pipelines in the implementation, self modifying code and so on.

The drawback of an emulation at this level of detail is twofold. Firstly, a fullemulation requires an enormous amount of processing power on the simulationplatform. When for example a parallel application has to be emulated, all processorcycles and the whole memory are to be simulated on the simulation platform. Thisplaces a huge demand on both the processing power, and the memory capacity ofthe simulation platform. Secondly, a full emulation at conventional machine levelrequires the development of a compiler and assembler (which is taken care of byISPS), and the development of a model for the processor including all details. Ina first design stage, one might want to evaluate without the burden of developingthe models to its full details.

Both drawbacks can be relieved a little: Oyster provides a model for a “proces-sor”, together with a generic assembler (like ISPS does), described in Section 3.3.3.In contrary with ISPS, the simulation is performed at assembly level. The architecthas to give a definition of the assembly language, which is used by Oyster to gener-ate a simulation model for the processor that can execute assembly programs. Thearchitect does not have to design the coding scheme for the instructions and ad-dressing modes, nor any internal details of the processor. It is enough to specify theaddressing modes and the instruction set in terms of the timings, lengths, seman-tics and number of operands. Although the application still needs to be translatedto assembly (a compiler is inevitable), Oyster takes care of the instruction codingand decoding, which is faster in both run time and design time. The way Oystersimulates the assembly program is depicted in Figure 4.2: the assembly programis executed from a separate instruction store (invisibly to the architect), while theinstruction fetches are simulated from the instruction memory, although the con-tents of this memory are in this case totally irrelevant (compare to Figure 4.1). Thedata is fetched from the data memory, and is used to drive the program.

The price being paid for a faster and more convenient level of simulation isthat certain interesting details cannot be modelled anymore. The introduction ofpipelines in Oyster’s processor model is under investigation, but at present all


Application,executing on

host processor

(on-line feedback)�

-Addresstrace


Model

MemoryModel

-� -

-�

Data references

Fake data

Instr references

Fake instructions

Figure 4.3: Trace generation. The dashed arrow represents the feedback of on-line generated traces (presented in Section 4.2.2).

specific processor implementation tricks need to be implemented completely byhand, in Pearl. Despite these restrictions, studies not directly related to the preciseexecution model, as caching or I/O can be performed conveniently using Oyster.This is demonstrated in Chapter 5, where tradeoffs between hardware and softwareare made for the communication architecture of a distributed memory machine.

4.2 Application derived address traces

Instead of emulating the application program, the application can just be executedon a host computer to extract an address and instruction trace from the executingbinary. An address trace can be as simple as a sequence of referenced addresses,but it can be annotated with tags about the type of reference (user or supervisormode, instruction or data), with the instruction currently being executed or evena time stamp. The architecture simulator can then use this trace as a substitutefor the references coming from a real processor, as is depicted in Figure 4.3. Theprocessor model gets the trace of the application, and uses it to drive the memories.Both the referenced data and the referenced instructions are not of interest, onlythe reference pattern is interesting (compare Figure 4.3 with Figures 4.1 and 4.2).

The conventional way to capture an address trace is a one-way process: thetrace is sent to the simulator, which uses the trace. We call this off-line tracing. Incontrast, it is also possible to generate traces on-line, in which case the applicationprogram runs with feedback from the simulator, as visualised by the dashed arrowin Figure 4.3. This feedback is used by the application program running on the hostmachine to make time-dependent decisions, for example the scheduling decisionsin a parallel application. Section 4.2.1 first discusses how to extract off-line traces,while in Section 4.2.2 a method will be presented to capture traces on-line. Thismethod can be used to extract traces from real parallel programs that can be usedfor the simulation of a parallel architecture.

4.2.1 Off-line generated address traces

To extract a trace from a running program, one has to intervene at some level inthe implementation: in the hardware, the microcode, the machine code or oneof the higher levels. All of the four are discussed below subsequently. Low

4.2. Application derived address traces 65

level intervention generally poses technical problems (specialised hardware orknowledge), while high level interventions are less accurate and run slow. Againthe tradeoff between the accuracy (level of detail) and the effort needed to fetch thetrace has to be made.

Hardware intervention. To get a real time address trace, hardware monitors areoften used. A bus-spy or some other hardware device snoops all the traffic on thememory bus, and stores the references for later usage [Hercksen82]. A specialisedhardware device is the best way to fetch traces, because these traces give an exactrepresentation of the application program running on that particular machine,without disturbing the execution in any way.

Microcode intervention. Hardware monitors have two serious drawbacks: theyare rather inflexible in their usage (special hardware need to be installed in thecomputer, nullifying the warranty), and the hardware monitors cannot record theinternal state of the processor. Because of these drawbacks, software solutions havebeen searched for, running on (or in) the processor. The most elegant approachis probably the ATUM [Agarwal86]. ATUM (Address Tracing Using Microcode),intervenes in the lowest software level: the microcode. The microcode of the VAXhas been modified so that all address references are captured. Since the microcodeprogram has been extended, ATUM runs slower than a normal VAX, so the timinginformation cannot be part of the trace. In comparison with hardware monitoring,the internal state of the processor is available to ATUM. ATUM is transparent tothe user, like the hardware intervention.

Conventional machine level intervention. Modifying the microcode of a pro-cessor requires a thorough knowledge of the processors architecture, and requiresthe microcode to be downloadable to the processor. To get around these trou-bles, the trace can be retrieved at machine code level, by using the trace modepresent in many processors (for example the mc68020 [Motorola85], but also in thePDP11 [Digital72] or the mc88000). This trace bit causes the interrupt handler to becalled between the execution of each two user instructions, the interrupt handleris programmed to maintain the address trace. To find the data references of aninstruction, the interrupt handler decodes the instructions, and extracts (and eval-uates) the addressing modes in it. This is a hard task because it requires a completedecoder for the instruction set. ATUM does not need that because the microcodedecodes the machine code. Intervention at conventional machine level is for thisreason also slower in execution than ATUM. A positive point of intervention atthe conventional machine level is that it can be used for many types of processors,while ATUM runs on a VAX only.

Assembly level intervention (CAT). The three methods above are transparentto the assembly level programmer. Although the program runs slower, the pro-grammer is not bothered with the trace. A fourth method is visible to the assembly


programmer, but still invisible to the application programmer: the compiler there-fore generates extra code that will extract the trace at run time. In contrast with theprevious methods, an intervention at assembly levels takes place at the user level,the system software or hardware need not to be modified, only the user’s assemblycode is annotated with function calls that maintain the address trace. Any compilercan be extended to generate these extra function calls [Gupta89, Langendoen92b].In the remainder of this section one such an extension, called CAT for Compiled Ad-dress Tracing, build around the FCG code generator [Langendoen92c], is describedin more detail because it is used in Section 4.2.2.

The FCG code generator is a back-end for the FAST compiler [Hartel91]. FASTand FCG together translate a program written in a functional language into C.Together with a run time support system that is entirely written in C, the functionalprograms can be executed with reasonably good performance on any computerplatform equipped with a C-compiler. The FCG back-end uses C as the targetlanguage; no use is made of advanced control structures (as a for, a while or afunction call), or from data structures. Because C is used as “portable assembler”,there is a direct relation between the C-statements and the underlying assemblystatements. Because of this direct mapping, the compiler has the exact knowledgewhere the loads and stores in the underlying assembly are generated1. At allthese places, the FCG back-end inserts a function call (at C-level) that appends thereferenced address to the address trace at run time.

The insertion of the function calls that trace the instruction references is takencare of by a small analysis program. It translates the C-source to assembler (with theC-compiler), locates the basic-blocks of code, counts their lengths, and annotates theC-source with function calls for tracing the instruction references. This backwardannotation process is possible because the basic blocks are unambiguously definedat C-level with the help of C-labels and C-gotos. The application program is thusannotated with trace function calls invisibly to the programmer.

The run time support code of FCG is written in real C, and cannot be annotatedautomatically. It is annotated by hand, by compiling the run time support codeto assembly, examining the generated assembly code, and placing function callstracing the execution in the source code. This was a tedious and time consumingprocess, but it had to be done only once. FCG does not need an operating system(FCG is stand alone), so annotating the compiled code and the run time supportsuffices to cover all software components.

As an example how the annotation works out, consider the function definitionat the top left hand side of Figure 4.4. It has been compiled on a SPARC architecture(the top-right box contains the assembly code) and is annotated (bottom box) withcalls to trace load() and trace instr() (for tracing the load from an addressand the fetch and execute of a number of instructions respectively). The functiongetanitem fetches the return and load instructions, and performs a load. The ex-ecution trace of the function is thus: “load getanitem, load getanitem+1,load ptr+3” (the memory is assumed to be 32 bits wide, instructions are assumed

1CAT assumes that the C-compiler has a sensible register allocator, and enough general purposeregisters to store the register-variables.


C-functiongetanitem( ptr )int *ptr ;freturn ptr[3] ;

g

gcc)

Assembly.global getanitem

getanitem:retlld [%o0+12],%o0

+ +

Annotated C-functiongetanitem( ptr )int *ptr ;ftrace instr( getanitem,2 ); /* fetch 2 instructions */trace load( &ptr[3] ); /* data reference */return ptr[3] ;

g

Figure 4.4: A function definition, the assembly and the annotated version.

to fit in one word and the variable ptr, the return value, and the return addressare passed in registers. In this case, the load is performed in the delay slot of thereturn, which explains the order).

Note that neither in the run time support, nor in the generated code the tracecalls disturb the execution path. The program runs a bit slower, so the precisetiming information is lost, but the generated address trace precisely resembles thetrace that would be executed on a real system.

The trace as extracted by CAT contains no information about the length or thetype of the instructions. It is assumed that all instructions have the same length, andexecute in the same time. It is also assumed that loads and stores are effectuatedimmediately following the instruction. In current pipelined RISC designs loadsand stores are queued, while instructions are prefetched. This implies that the ex-tracted traces are slightly disordered, depending on the actual processor type. Thisslight disordering has a negligible performance effect on most memory subsystems(although the disordering is essential for the performance of the processor itself).Note that the numbers of instructions in the basic blocks are counted on the hostarchitecture. If one wants to simulate the execution on another type of processor,the programs should be recompiled for this architecture, so the instruction countis corrected for that specific processor’s instruction set. The number of registers isnot important since the compiler assumes that enough registers are available.

4.2.2 MiG: on-line simulation of parallel functional programs

As explained before, off-line simulated traces cannot react on feedback from thesimulator. This seriously hampers the tracing of parallel programs, since their exe-


Generator: Simulator:

Concurrency Control

Application

Run time support

SchedulerArchitecture

simulator

----

ParallelAddress Trace

6Synchronisation, timing

Figure 4.5: Schematic overview of the MiG

cution path depends on the timings of the execution. An off-line extracted trace ofa parallel application is only valid for the architecture the trace was measured on.A system with a slightly different architecture (other cache size, other interconnec-tion topology, other number of processor), might have a totally different executionpath. The order in which locks are fetched or the order in which memory blocksare allocated, can result in a different schedule, or in a different memory allocationpattern, both leading to a (radically) different address trace.

MiG (Memory simulation Integrated with Graph reduction [Muller92a]) is asimulator capable of tracing parallel applications on-line, thus with feedback fromthe generated trace towards the parallel program. MiG simulates the execution ofan application written in a high level language on a shared memory multiprocessormachine, without bothering the user or the simulation platform with a low level ofdetail. The tasks of a parallel application execute concurrently under the MiG, andemit a parallel address trace that is interpreted by the architecture simulator. Thetiming information of the architecture simulator is used to control the concurrentexecution of the parallel program, as sketched in Figure 4.5. The precise structure ofthe architecture simulator is not of interest for MiG. The architecture simulator onlyhas to synchronise the n address streams extracted from the application programaccording to a shared memory model. The shared memory model may be of anycomplexity: a single bus, caches, switches, or virtual shared memory.

Opposed to the emulation of a parallel program at assembly level, or otherapproaches to the simulation of parallel programs (for example [Hagersten91]),MiG does not synchronise after each instruction, but only after the execution ofa large number of instructions. Because of the overhead (context switches, tests)involved in the synchronisation, synchronising at each instruction is inefficient;MiG allows large programs to be simulated on a parallel machine, without thatoverhead. The places where MiG needs to synchronise, and the way MiG isimplemented are the subject of the remainder of this section

MiG is constructed around the parallel functional language implementation de-scribed in [Barendregt92]. The application program is written in a functionallanguage, with parallel tasks explicitly denoted by a fork-join annotation. The


application is compiled using FAST, FCG and CAT (page 66) to incorporate codefor the extraction of the address trace. The execution model of the FCG imple-mentation consists of a hierarchy of tasks, with per processor a stack, and per taska private heap. Both the heap and the stack are allocated in the shared memory.Besides using its private data, a task may read the data of all of its ancestor tasks.Interaction between tasks is accomplished by the join-part of the fork-join. Afterthe termination of a task, the heap of the terminated task contains the result ofthe task. The father task of a family of terminated jobs coalesces the heaps of theterminated children with its own heap, whereupon the father task may combinethe results, since they are all in the private heap of the father. The run time supportcode has functions that are called by the tasks to collect the garbage in their heaps.A copying garbage collector copies all referenced (non garbage) cells to a freshspace (the so called “to-space”), whereupon the old heaps are totally free. To savememory, the FCG execution model has only one to-space, so only one task at a timemay perform a garbage collect.

The advantage of the execution model of this compiler and run time support isthat tasks have well defined interaction points with other tasks. In between theseinteraction points are sequential blocks of code that run completely independentlyof other tasks. Since a sequential block is independent, there is no need to synchro-nise the MiG generator and the architecture simulator during the execution of asequential block, they only need to synchronise after the execution of a sequentialblock. A sequential block ends where a task:

1. forks one or more new tasks,

2. allocates or frees a block of shared memory,

3. performs a garbage collect, or

4. terminates.

In all of these four “interaction” points, the task has to synchronise with the activ-ities on other processors because centralised data is used. In the first interactionpoint, the new jobs are placed in a global list of ready tasks (a task-pool), in thesecond point, a free block of memory is searched in a global list of free memoryblocks, in the third point the single global to-space is used, and at the fourth pointthe task needs to flag the parent task, (it also ends the execution trace of the currenttask). None of the interaction points is located in the compiler generated code, theyare all in the run time support, where they are protected by means of semaphoreswith P (down, grab) and V (up, release) operations. Since the code between theinteraction points is a sequential block, the code executed between the V of one in-teraction point and the P of the next interaction point is a sequential block. Thecode between the P and the V (the critical code itself) is protected by the semaphorefor mutual access, hence it is of the same nature as ordinary code. This meansthat the critical code consists again of a sequential block (or multiple sequentialblocks when nested semaphores are present). Hence all code, except for the codewithin the P or the V, is part of a sequential block. With an example, the simulatedexecution of 4 parallel tasks is shown


An example simulation under MiG

In the example, it is supposed that four tasks are running all performing P and Voperations on a binary semaphore once in a while. In the beginning, all tasks areallowed to execute the first sequential block, up to the first P-operation, one afterthe other. Since the executions are independent, the order of execution is irrelevant.During the execution, an address trace is built of all accesses made to the memory.The address traces are buffered and passed to the architecture simulator. This leadsto the situation visualised in Figure 4.6 (the boxes denote the address traces thatwere emitted, while a ‘P’ marks that at that point in the trace, the task will executea P).

Task 0:Task 1:Task 2:Task 3:

PP

PP

Time!

Figure 4.6: The generated address traces until the first P-operations.

The architecture can now be simulated up to the place where the trace of task 1ends. The simulator interprets the four traces interleaved, although the addresstraces were generated one after the other. When the last access of task 1 has beensimulated, task 1 may execute the P operation. After completion of the P (task 1is the first one to enter the semaphore so it will be granted access immediately),task 1 is allowed to execute the next sequential block. This execution results in anew address trace, leading to the situation sketched in Figure 4.7 (a hatched blockdenotes a part of the address trace that has been simulated by the memory; theaccesses in the hatched blocks are thus globally ordered).


��

��

��

��

PP V

PP

Time!

Figure 4.7: The address traces are (partly) consumed. Task 1 continued.

Again, the address traces are fed into the architecture simulator, until the nextinteraction point is encountered. In this example, the trace of task 0 will exhaust,and task 0 is allowed to try the P operation. Task 0 will fail to pass the semaphore,since it is already in use by task 1 (remember that the V of task 1 is not executed yet).


This failure is modelled by a busy wait on the semaphore. The address trace thatresults from the test and the jump (the implementation of the semaphore, see alsopage 73) is generated, and this trace is passed to the memory simulator. The P isrepeated many times (denoted by the PPPPPPP ) until the trace of tassk 1 is completelysimulated, and task 1 executes the V operation. Task 0 will grab the semaphore,and both task 0 and 1 can execute a new sequential block, as is shown in Figure 4.8.Again, the order in which these blocks of code are executed is not relevant.


��

��

��

��

PPPPPPPP VPP V

PP

Time!

Figure 4.8: Task 0 got the semaphore and executed the critical section.

Note that the sequential blocks of the tasks always complete execution: there areno interrupts, timers or exceptions other than aborting exceptions like a divisionby zero or other programming errors.

The above example is based on tasks. It is easy to extend it to processors: aprocessor runs tasks one after the other with the help of a non preemptive scheduler.A task runs till termination, or until it forks and has to wait on the results (fork-join).After termination of a task, another task is fetched from the global task pool to runon the processor, or a parent task is resumed. The scheduler itself is treated asordinary code, it is (like the other critical parts) protected with semaphores, so itexecutes following the same mechanism as above. Because the scheduler bases itsdecisions on the tasks that are synchronised according to a real parallel execution,a realistic execution path is traced.

Restrictions on applications simulated with MiG

The model used to describe a task (sequential blocks with semaphore operationsin between) uses the following essential properties of a sequential block:

� Once a sequential block is started, the state of the program at that point fullydetermines the execution path of the block. This implies that if shared datais being read, this data should be read-only to all other sequential blocks exe-cuting on other processors (hence a P-operation cannot be part of a sequentialblock).

� The execution of the sequential block does not alter the execution order ofsequential blocks executing concurrently in other tasks. This implies that ifshared data is overwritten, it may not be used in any other sequential blockrunning at that moment (hence a V-operation cannot be part of a sequentialblock).


...

...

...P( &exclusive ) ;if( bound<optimum ) {optimum = bound ;

}V( &exclusive ) ;...

P( &exclusive ) ;...V( &exclusive ) ;...if( optimum<local ) {

...} else {

...}

Figure 4.9: Sample code for an optimisation problem (an atomic update is as-sumed).

The execution model of the FCG back-end and run time support conform to theseproperties, but other parallel programs or parallel language implementations willconform to it as well. Only two classes of parallel programs are ruled out: programsthat asynchronously read shared data, and programs that asynchronously writeshared data.

Performing asynchronous reads is common use in optimisation problems, wheremultiple tasks search through a space, reading a global optimal value asynchronouslyto prune the search space. Updates to this optimum variable are protected by meansof a semaphore. A code fragment of such a program is depicted in Figure 4.9. Leftis the code updating the shared variable, while the right hand part of the codeasynchronously reads the variable. The path chosen in the if of the right handpart of code is determined at the moment the variable optimum is read. The valueof optimum can be modified at any place before or after the if, consequently thecode is not a sequential block.

It is hard to give an example of a correct program that asynchronously writes datato shared variables: such a write will lead to disaster most of the time. Onlyif semaphores are constructed with an asynchronous write, as is the case withDekker’s semaphore [BenAri82], one can write a correct program that cannot besimulated using sequential blocks, but this implementation is rather uncommon.

Only programs in which both read and write access to shared variables are pro-tected by means of semaphores (or monitors, they can be rewritten to semaphores),can be simulated using sequential blocks. Parallel C programs are in general notimplemented this way, but many implementations of object oriented and declara-tive programming languages allow for a MiG like simulation.

Implementing pseudo parallel execution

The pseudo parallel execution is implemented with the help of so called lightweight processes under Sun-UNIX (originally called threads [Rashid86]). A lightweight process is one thread of control in a normal UNIX process, with a privatestack, program counter and register set, but with an address space that is shared


with the other threads in the UNIX process. Each processor in the architectureis simulated by one thread, which precisely resembles the execution model of ashared memory multiprocessor machine.

The architecture is simulated by one extra thread, the “simulation thread”. Thisthread takes care of the synchronisation between all processes, and ensures thatnone of the other threads goes too far ahead. If one of the threads modelling theprocessors encounters a P or V operation, control is passed to the simulation thread.The simulation thread calls the architecture simulator that interprets the addresstraces (for example with the Futurebus simulator described in Chapter 6) until oneof the address traces is exhausted, and passes control to the thread of the processorof which the address trace is exhausted. This thread resumes execution, generatesa new part of address trace, and passes control back to the simulator thread again.

Implementing P and V

During the P and V operations the threads executing the program code synchronisewith the simulator thread. The V operation does so by:

1. Stopping the current thread and starting the simulator thread (which willinterpret the address trace).

2. At the moment the simulator has finished the trace, the simulator’s virtualtime is brought up to the moment of the V operation. The synchronisation isthus accomplished so the V operation will free the semaphore and continue,whereupon the next sequential block is executed and traced.

The P operation works similarly, but may end with a “busy-wait” loop:

1. Stop the current thread and start the simulation thread.

2. When the control is passed back, P tries to grab the semaphore. The grabwill cause some instruction fetches, and a load and a possibly store of thesemaphore. If the grab succeeds, the thread continues with the sequentialblock after the P, otherwise, step 1 is repeated.

The busy wait loop terminates when the semaphore is free. Since the loop containsreferences to the memory, other threads can continue and progress is guaranteed.As long as a fetched semaphore is freed somewhere in the future, P-operationsnever deadlock.

A straightforward implementation maps the P operation onto an indivisibleTest-And-Set instruction: in one indivisible bus cycle, the semaphore value isfetched, tested and overwritten. In an architecture with distributed caches with aMOESI protocol [Sweazy86] (like the Futurebus [Futurebus89] or SCI based systems[SCI92]), the TAS-operation requires the cache line containing the semaphore to befetched exclusively. This implies that if two processors in a large architecture arewaiting for a busy semaphore, the cache line is transported continuously betweenthe caches, wasting valuable bandwidth (the performance of a blocked P is not


interesting, but it should not slow down the rest of the machine by burdening thecommunication hardware).

A better way to implement the P is to test the semaphore value until it is free,before executing a Test-And-Set (one could call this implementation a Test-And-Test-And-Set [Raina91, Rudolph84]). First a test is done on a shared cache line;only when it succeeds, the line is fetched exclusively, and a TAS is applied. If itfails, it repeats the first test. The MiG generates the address trace according to thelast variant, because it scales better, while the other version tends to generate a highmemory traffic in certain implementations, resulting in unpredictable performancebehaviour.

Idle processors

During the run of the applications it can be expected that some of the processors areidle, and run an idle loop. The idle loop should typically be kept in the local cache,otherwise it wastes bus bandwidth (note that the performance of the idle loop itselfis not important). When the idle loop is in the local cache there is no reason tosimulate the execution of the idle loop anymore. Omitting the simulation of the idleloop leads to a significant increase in the simulation speed when not all processorsare used all the time, and it does no accuracy is lost. Another positive consequenceof treating the idle loop as a special case is that the MiG can distinguish betweencomputations (the work of the application), semaphore waits (the bottlenecks in thelanguage implementation), the critical run time support code (enclosed betweensemaphores) and idle loops (too large number of processors or too little work).The distribution of the time over these four phases is valuable profile information:the degree of parallelisation, the critical section overhead and the synchronisationoverhead can be inferred from it.

Accuracy and performance

The speed of applications simulated with MiG depends totally on the level of detailat which the architecture is simulated. With the most elementary architecture,consisting of one hypothetical memory with unit response time, MiG programsrun about a factor 3 to 5 slower than normally executed programs. But whena complex architecture is simulated at bus cycle level, for example one of thehierarchical cache architectures presented in Chapter 6, the simulator runs a factor3000 slower than real execution, a slow down entirely caused by the detailedsimulation of the memory hierarchy.

If one is only interested in debugging the compiler, run time support or parallelapplication, the MiG with a unity memory works fine. Since it performs a realparallel execution of the program, bugs can be searched on a deterministic platform(in contrary to real machines), with the possibility to insert “hardware probes”for analysing the performance, or for hunting bugs. When cache hit rates orbus contentions are to be monitored, the MiG has to be equipped with a fullblown memory hierarchy simulator, reducing the performance. The accuracy ofthe simulator is traded against the performance.

4.3. Stochastically generated address traces 75

The shortcoming of the MiG is that it only works with a shared memory model.MiG is currently only implemented for a functional language, but there is no reasonwhy it should not work for other parallel programs obeying the restrictions listedon page 71. Examples of the use of the MiG are in [Langendoen92a, Muller92a,Hofman93]

4.3 Stochastically generated address traces

Both the extraction of address traces and the emulation of a program suffer fromhigh computational and memory demands. Furthermore, the traces extractedfrom real applications are not flexible in the sense that experiments with “a bitmore locality” or “another type of processors with a denser instruction coding”cannot be performed using these traces. For these experiments, the architect canuse synthetic address traces. A synthetic trace is a “random” sequence of addresses,that is used as a substitute for a real address trace. The word random is quoted,because a synthetic trace should exhibit temporal and spatial locality [Denning72]like real address traces.

Synthetic address traces can be tuned to the wishes of the architect, and can begenerated without huge memory constraints. Their use has one major drawback:the results obtained with a synthetic trace are less reliable than the results obtainedwith a real address trace. Real world address traces give an exact description of theexecution of an application on a real world machine, the correspondence between asynthetic address trace and a real world trace is doubtful. This drawback should beobviated by a systematical validation, and by experimentation with the sensitivityof the parameters. The use of a non-validated, non-calibrated synthetic addresstrace makes no sense.

A synthetic address trace may arbitrarily abstract from a real world addresstrace. [Thiebaut92] for example observed that the curve plotting the number ofunique words versus the number of allocated words has two knees. A syntheticaddress trace can be generated on base of these two knees, which has absolutelyno relation with the notion of a processor or an application, although this tracemay certainly exhibit some kind of locality. At the opposite side, one can try tomake a stochastical model a processor running a program, by simulating the fetch,decode and execute of a real processor and generating fake instruction and datareferences in a manner that looks like a real address trace. This is for example doneby [Archibald86] for the generation of an address trace of data references.

In this section, a stochastical trace generator is described that is close to themodel a real processor (like [Archibald86]), and that generates an address trace thatis supposed to stem from a multi processor shared memory machine. Besides ordinarylocality from within the trace of one processor, the stochastical trace generator alsotakes care of the locality between the traces of various processors. The addresstraces coming from a parallel machine running highly independent, non migratingjobs do not interfere with each other; when jobs migrate this will be visible inthe executed trace because address referenced at one processor now, are accessedlater on by another processor. On a multiprocessor machine running a truly parallel


Scheduler��

��Pool of

jobs��)

��

PPPPPPPPPPPqProcessor

“running” jobsProcessor

“running” jobsProcessor

“running” jobs

.........

? ? ?trace trace trace

Figure 4.10: The structure of a stochastical address trace generator.

program, the jobs communicate with each other via shared memory, which is foundback in the address traces because (almost) equal address are accessed on multipleprocessors at the same moment, leading to collisions on these memory addresses.

The stochastical trace generator (sketched in Figure 4.10), allows to modelshared memory programs with any degree of interaction. Only the schedulingalgorithm is fixed: jobs are scheduled at any processor of the architecture. Thescheduler maintains a global pool of jobs, which are scheduled in a FIFO mannerover all processors. The processors generates data and instruction references forthe scheduled job, until it is time to schedule another job.

The number and types of jobs are parameters of the stochastical generator.When modelling a machine running a parallel application, there are typically alarge number of identical jobs, that frequently access a shared data space. Whenmodelling a UNIX machine, a large number of different jobs are running, that donot use shared data, but that only have shared text segments. This is facilitated inthe stochastical trace generator by the definitions of a number of job-classes, and aspecification of how many jobs of which class are active on the machine.

The parameters of a job-class defines the behaviour of a job (a job is an instanceof a job-class), for example how long the job executes before it deschedules, thelocality behaviour and so on. The parameters for the are described in more de-tailed in the remainder of this section. Section 4.3.1 gives a detailed descriptionof the parameters specifying the instruction locality and Section 4.3.2 presents theparameters to define the locality in the data space.

A job is allocated in shared memory, and has segments of physical addresseswhere the job accesses its instructions and data. A job has two private segmentswhere the stack and the private data resides, and two segments that are sharedwith the other jobs of the same class where the shared data and the (read-only)program code resides. Figure 4.11 shows an example memory layout with 3 jobs,two instances of job class p and one instance of job class q. The segments of the jobsare aligned on page boundaries, and are placed contiguously in the shared addressspace.

The actual generation of the address streams of the jobs is taken care of by theprocessors of Figure 4.10. According to the parameters of the job-class, and thephysical address of the segments of the job, an address trace is generated for a


Shared data 2 (q)

Private stack 2 (q)

Private data 2 (q)

Shared text 2 (q)

Shared data 0, 1 (p)

Private stack 1 (p)

Private stack 0 (p)

Private data 1 (p)

Private data 0 (p)

Shared Text 0, 1 (p)

Job 2

Job 1

Job 0

Job class q

Job class p

XXXXXX

Xy

��

��9�

��9 ��

��

��

��

ZZ

ZZ

ZZZ}

PPPP

PPPi

��9

XXXXXXXz

HHHHHHHj

��

�:

Figure 4.11: Example job structure, two job classes (p and q), three jobs (0,1,2) andthe memory layout.

registerless zero-cycle processor. According to the rules described in Section 4.3.3,this address trace is then tailored to one specific type of processor. This allows togenerate traces that would have been extracted from processors with various clockspeeds, numbers of registers, or instruction types. Section 4.3.4 finally presents ashort reflection on the pros and cons of using stochastically generated traces.

4.3.1 Instruction access locality.

Each job has a program counter, pointing to the address where the last instructionwas fetched. By default, the program counter is incremented to the next instruction,but occasionally, a jump is executed. Two types of jumps are distinguished, eachwith their own probability:

Local jumps. A local jump instruction transfers the control flow of the programto another place in the program, but in the neighbourhood of the currentinstruction, normally within the body of a function. It turns out that jumpdistances are reasonably normal distributed, so the jump target is randomlyselected from a normal distribution. Since backward jumps and forwardsjumps (stemming from loops and if-then-else’s respectively) have differentprobabilities, the model has four local jump parameters: the average distanceand probability of both the backward and the forward jump.

Global jump. Instructions that transfer the control to another subroutine performa non local jump. Typically, there is a restricted set of destinations for thesejumps, the entry points of the functions. To model this, the entry points areuniformly spread over the text space. An exponential distribution is used toselect an entry point when a global jump is executed. This gives a typical nonlocal jump behaviour: all places in the address space are accessed, but someplaces have a higher chance than others.


The size of the text section, the number of entry points, the slope of theexponential chance, and the probability of a global jump are the parametersdefining the non local jump behaviour.

4.3.2 Data access locality.

For each data segment, the probability of a load and a store to the segment isdefined as a parameter of the job-class. The locality constraints are modelled persegment:

The stack segment. Each job has a stackpointer, pointing in its stack. Referencesare (normally distributed) around the stack pointer. The stackpointer is up-dated when a non local jump is made. The distribution parameter, and theaverage update size are parameters of the job class.

The private data segment. A pointer is maintained in each job pointing to thelastly referenced data element. The next data reference is at a random offsetfrom the previous access, selected from a normal distribution. The width ofthis distribution is a parameter of the job-class.

The shared data segment. Identical to the private data, but with its own parameterfor the distribution.

4.3.3 Processor parameters

The parameters above are the parameters that depend on the application, runningon some abstract, registerless, processor. Orthogonally on these application pa-rameters, the essential parameters of the actual processor are specified. With theseparameters one can model processors that need more instructions, have a higherclock speed or have more registers.

Speed. The instruction fetches of the application trace are fed to the memory layerat the MIPS rate of the specific processor. Note that the MIPS rate influencesdata reference rate as well.

Instruction size, Instruction power. A CISC needs fewer instructions than a RISCto perform the same task and instructions are coded more dense as well.Hence, the average jump distance and text size of an application have to bescaled accordingly. Therefore the processor model includes two scaling pa-rameters: instruction power and instruction density. These cannot be mergedinto one parameter because they scale differently: The density operates onall instructions, whereas the instruction power only affects the arithmeticinstructions, because the absolute number of jump instructions doesn’t varybetween RISCs and CISCs.

Register usage. The usage of register windows, large register banks, and compileroptimisations leads to a reduced data traffic. Especially the access frequency


to the stack segment is greatly reduced. The processor model accounts forthis effect by masking off some of the stack references from the applicationtrace, they are simply discarded. The register-usage parameter specifies thepercentage of masked references.

The simulator allows to generate traces for machines with different types of pro-cessors (where shared data remains shared, but the text segments of differentprocessors are separated and jobs have a preference for a certain type of processor),but this feature has never been used.

4.3.4 Discussion of the stochastical trace generator

A stochastically generated trace is intended to be comparable to the address traceof a real multi processor program. The generation of the stochastical trace does notplace extreme demands on the processing capacity or memory of the simulationplatform. To generate a realistic trace, four address spaces are simulated, eachwith their own typical behaviour. A program space with sequential execution andjumps, a data and a shared data space where accesses are in the neighbourhoodof the previous reference, and a stack space with references around the top of thestack. The simplest access behaviour is simulated for the three data spaces, thebehaviour of the text segment is a bit more complex than the other segments (bysimulating function calls) because the largest number of references are typically in-struction references, and all programs have a structure that favour certain functionsover others. The data sections could be enhanced likewise, with LRU chains of ac-cesses as [Archibald86], or with access patterns for arrays and matrices, but this isconsidered as less important, because less references are made to the data segment,and because array or matrix access structures are not typical for all programs.

We are more concerned about the locality of the shared data. Real applicationscan be (and are) programmed to take advantage of keeping the data local. Afeature which is not accounted for in the stochastical trace generator. The onlyway to simulate this effect with the current version of the trace generator is toincrease the size of the shared space, so there is less chance on collisions (notethat increasing the locality-parameter does not help because it does not reducethe chance on a collision), but this is not completely correct, since the a largeraccess space gives another temporal locality inside the trace of one process as well.Another problem is that real programs might use “locks” in the shared memory,places that are polled and overwritten in read-modify-write cycles to guaranteemutual exclusion. Although the poll and update of these locks might increase thetraffic, large effects are not expected because a program hanging on locks won’tscale anyway because of the software contention.

As said in the introduction of this chapter, the usage of a stochastical tracegenerators is absolutely non trivial. A careful validation should be performedbefore the results can be trusted. The validation of a synthetic trace is hard incomparison with the validation of, for example, an emulated application. Byverifying the output of the emulator, the structural correctness of the emulationcan be verified, while the timings of the emulation can be verified step by step.


Validating a synthetic trace is hard because the property which is being modelled(locality) cannot be measured or calculated, because there is no well defined unitfor it yet (although attempts have been done to define a unit of locality, for example[Bunt84]). The best option to validate a synthetic trace is to apply the trace ina number of experiments with well known results, and to compare the resultsof these experiments. When the trace will be used to experiment with the hitrates of caches, the trace should be validated by applying it on a number of cachearchitectures with known hit rates. If the trace cannot be validated because there areno relevant results on that topic, one should take care to determine the sensitivityof the parameters, so that at least the error margins can be derived. A standardrecipe how to validate a stochastically generated trace does not exist, but a carefulvalidation is inevitable.

4.4 Discussion

The methods described above to simulate application programs all have drawbacksand advantages. Depending on the type of architecture and application understudy and the type of questions that are to be answered, some strategies to simulatethe application are preferable over the others.

The most important difference between the various methods is the amountof feedback from the hardware simulation to the software. The more feedback,the more the hardware and software simulations are integrated, and the higherthe accuracy of the simulation. Off line generated address traces and stochasti-cally generated address traces ignore all feedback from the architecture. Thesemethods are thus only applicable to architectures where the program controls thearchitecture, but where the execution path of the program does not depend on thearchitecture. On line generated address traces can react on synchronisations in thearchitecture, but cannot react on data produced by the architecture. Only emulatedprograms correctly react on the data coming from the hardware (for example I/O).Any type of interface between the hardware and the software can be simulatedwhen emulating a program: microcode, special instructions at machine code level,memory mapped I/O, interrupts, every detail can be simulated and evaluated indetail.

A direct consequence of the absence of feedback in some methods is that theyare not suitable for simulating parallel applications. Off-line generated addresstraces represent a realistic execution path of a parallel application on one specificarchitecture. But these traces cannot be used for the simulation of another archi-tecture (with more nodes or larger caches), because the execution trace dependson the architecture. A trace representing a parallel application can be generatedusing a stochastical generation method. Because the stochastical model is basedon jobs and a scheduler, the generator can cope with varying numbers of proces-sors. Because of the high degree of parametrisation, any type of workload can bemodelled. On-line extracted address traces (as produced by the MiG) represent anexact execution path of an application running on a shared memory machine. Sincethe MiG simulates a real application, one is restricted to the sharing and scaling of

4.4. Discussion 81

real applications (in contrast to a stochastical application, where one can performexperiments with a bit more or less sharing). Parallel applications on both sharedand distributed memory machines can be simulated by fully emulating the pro-gram at instruction level. There are no restrictions on the type of parallel program,because all details are taken into account.

The three methods work at a different level of abstraction, and consequentlyhave a different accuracy. Stochastically generated address traces are the leastaccurate. Address traces extracted from running programs gets more reliable whenthe level of intervention is lower: the most reliable address traces are extractedwhen using special hardware. Emulation of programs is a rather reliable method,again, the lower the level of interpretation, the higher the reliability.

The resource requirements are proportionally to the accuracy. A full emulationis the most accurate and needs the most processing power and memory. Off-line generated traces require a large disk to store the traces, MiG needs the fullapplication memory to be simulated, while the stochastical address trace generatorsdo not have high computational or memory demands, and are the least accurate.The designer has to choose at what level of detail the application has to be simulated:at a low level, which is a tedious job, but results in highly reliable performancefigures, or at a high level, which is the fast way to get somewhat less reliable figures.

The methods described in this chapter have all been used to simulate applicationprograms. In Chapter 5 an experiment is presented where the application has beenemulated at assembly level. The problem under study required the trade offbetween hardware and software, and required interfacing between the softwareand some specialised I/O: a network of communication processors. None of thetracing methods would have sufficed for this study. In Chapter 6, an address trace isneeded for the evaluation of caching hierarchies in a shared memory multiprocessormachine. A stochastically generated trace is used, because of the freedom in tuningthe parameters. It would have been possible to use on-line generated address traces(as the MiG), but a tunable application was preferable. The MiG has been used in astudy towards memory management algorithms, as presented in [Langendoen92a],and for a study towards scheduling strategies for hierarchical caching architectures[Hofman93].

Part II

Case studies

Chapter 5

Simulating PRISMA’scommunication architecturey

An interesting field to study the evaluation of computer architecture designs isthe development of communication architectures for distributed memory multiprocessor systems. Looking at existing distributed memory MIMD machines, wesee that most machines implement the packet transport and routing layer in spe-cialised hardware. The exact nature of the hardware varies. Cut through routing,deadlock freedom and (fixed) packet size are some design decisions influencingthe ease of programming and the performance of the network. On top of thisnetworking hardware, one or more layers of software (operating system, compiler)provide a suitable level of abstraction to the user. The hardware and the softwarelayers together form the communication architecture as seen by the user. The im-portant issues in the design of such a communication architecture are the questionsif the network will function correctly, and if the performance of the network, thelatency and the bandwidth, are satisfying under varying circumstances (a few longmessages, many short messages).

Three recent hardware designs in this field are the Torus Routing Chip (or TRC,[Dally86]), the Adaptive Routing Chip (the ARC, [Mooij89]), and the DOOM Com-munication processor (or CP, [Annot87]). These three communication processorshave all been analysed, simulated and built. The analysis was made to prove thecorrectness of the network (deadlock freedom) or to analyse the networks per-formance behaviour (queuing models). The simulation was used to observe thenetwork performance under non analysable conditions. Eventually all the threecommunication processors have been realised. The ARC and TRC on single chip,the DOOM CP is first implemented using TTL Logic and is currently integratedonto a single chip. At the moment the (breadboard) DOOM CP was really used itcame out that the software overhead was underestimated. Sending messages fromA to B took 350 �s while only 30 �s are needed in the network. The reason that thiswas not foreseen during the design phase was that both the analysis and the sim-ulation of the design only took the hardware level into account. The abstractionsabove packet level running outside the communication processor have not beensimulated or analysed. It can be expected that both the ARC and the TRC will have

yThis chapter is based on [Muller90].

86 Chapter 5. Simulating PRISMA’s communication architecture

Disk Ether Proc Memory CP ---- Links to

other nodes

Local bus

Figure 5.1: The architecture of a PRISMA node.

comparable problems; they are both extremely fast communication processors, butto exploit this speed for general purpose messages an extra layer of software orhardware is necessary, that will reduce the performance.

To get a better impression where the loss of performance in the PRISMA machinecame from, the experiment described in this chapter was performed. The experi-ment involves the evaluation of the communication part of the PRISMA architecture(PRISMA stands for PaRallel Inference and Storage MAchine [Apers90] and is alsocalled POOMA –Parallel Object Oriented MAchine– or DOOM –Decentralised Ob-ject Oriented Machine– [Bronnenberg87]), which is based on the DOOM-CP. Threevariants of the PRISMA architecture are simulated in order to evaluate the com-munication performance. Section 5.1 presents a brief description of the PRISMAarchitecture, and shows where the bottleneck in the current implementation is (see[Vlot90, Apers90] and [Bronnenberg87] for a concise description of the architecture).The architecture has been modelled using Oyster, as is described in Section 5.2. Infact, the simulation of this architecture was the first experiment performed withOyster, so only few of the features described in Chapter 3 have actually been used.In Section 5.3 the result of a validation of the simulation model is shown, and thesimulated performances of the three alternative communication architectures arepresented. The chapter ends with a discussion of the simulation results.

5.1 The PRISMA architecture and implementation

PRISMA is a distributed memory architecture, where the nodes are interconnectedby a packet switching network. A node of the machine is essentially composed ofa data processor, a memory and a Communication Processor (CP). Some nodes areequipped with a hard disk (for storing permanent data) or an Ethernet board (forcommunication with the host machine). The relevant parts of a PRISMA node aresketched in Figure 5.1.

A 100 node prototype of the machine has been built. In this prototype, eachnode has 16 Mbyte of memory, an MC68020 data processor, equipped with amemory management unit, instruction cache and floating point coprocessor, anda prototype version of the communication processor. Every other node has a 300Mbyte SCSI disk, and one out of five nodes is equipped with a self-containedEthernet board. As mentioned, the Ethernet is used for host communication, notfor communication inside the machine. Both the disk and the Ethernet board arenot relevant for the communication performance, and are not taken into accountin the study presented here.

5.1. The PRISMA architecture and implementation 87

Communication Processors

Messagelayer

Run timesupport

POOLprogram

...

...

...X Y

Messagelayer

Run timesupport

POOLprogram

?

?

?

6

6

6

9>>>>>>>=>>>>>>>;

Software

�Hardware

Figure 5.2: The communication layers in the POOL implementation.

In the prototype machine, the data processor and communication processor areconnected by a (slow) VME-interface. Via a memory mapped interface, the dataprocessor (DP) reads and writes packets from and to the CP. There is no DMA (DirectMemory Access) in a node, the data processor transports the data to and from thecommunication processor. The CP implements a deadlock free and starvation free,packet switching network [Annot87]. A packet consists of 240 data bits, precededby a 16-bits header containing the 12-bits identification of the destination node andfour administration bits. The CP’s are interconnected by serial links through whichthe packets are routed to the destination node using a store and forward protocol.The interconnection topology is not fixed (any topology is allowed), it is stored inthe routing tables of the communication processor during the bootstrap.

The PRISMA machine is programmed using POOL [America89b], a ParallelObject Oriented Language. A running POOL program consists of objects that com-municate and synchronise by sending messages to each other. In general, neitherthe destination of a message, nor the size are known at compile time. Several POOLimplementations have been made [Spek90, Beemster90]. For the experiments onthe communication architecture, the latter one has been used. It translates POOLinto C and uses a C-compiler to generate the assembly code for a specific processor.The run time support for the compiled code is entirely written in C as well. Itconsists of the library functions for POOL, but also for those functions that arenormally found in an operating system, as the I/O, memory management andcommunication primitives. Below the run time support a message layer interfacesto the communication processor.

A message that is send from one POOL object in the program to another thuspasses the levels shown in Figure 5.2. At the top level, the POOL code is executingand sending/receiving POOL messages. A simple POOL message is jut a fewwords of data, but in general a POOL message consists of a graph of objects. Thelevel below is the POOL run time support system. It extracts the destination nodeof the message and flattens the POOL-message, the graph, to a stream of bytes.


Object x:...

BODYDO

i := y!pingpong(3);OD

YDOB

Object y:...METHOD pingpong( a:Int ): IntBEGINRESULT a * a ;

END pingpong

%% Default body

Figure 5.3: The used POOL program.

These byte streams are handled by the message layer that converts the byte streamto packets, and puts them in the communication network. At the receiver side, thepackets are glued together, restructured to a POOL message and finally deliveredat the destination object.

The interface between the message layer and the communication processor isa polling packet oriented interface. A packet is written to, or read from the CPby sending eight 32-bit words; the first word containing the 16 bit header and thefirst 16 data bits, the other 7 words containing the other 224 data bits. If only apart of a packet is used, empty words need to be written to the CP. In the sameway, partly empty packets should be read completely from the communicationprocessor. The interface between the message layer and the run time support layerconsists of six functions, three for sending and three for receiving messages. At thesender side, the processor first calls a ‘start_of_message’ function. The datais sent by (repeatedly) calling a function to send a block of bytes. At the end of amessage an ‘end_of_message’ function is called that completes the last packetwith empty data words. At the receiving side, the run time support layer pollsfor a message, by calling a function ‘is_there_a_message’ When a messageis available, (multiple) blocks of memory are filled with the data portions of themessage. When the message is ready, an end_of_receive function is calledwhich flushes the rest of the packet. The interface between the POOL compiler andthe runtime support consists of functions that generically send and receive a graphof POOL-objects. These functions knows the structure of a POOL graph, they get apointer to the root of graph, after which they are able to pack the graph into a bytestream.

An example POOL program is shown in Figure 5.3. It consists of two objects Xand Y that are located on different nodes. Object X continuously sends synchronousmessages to object Y (denoted by the exclamation mark in y!pingpong(...)),which are immediately answered by Y (Y sends as result a*a). This program istrivially short and is used to measure one important performance-parameter of theimplementation: the latency introduced by the message handling. The latency isdefined as the time needed to completely execute a synchronous send to (a rendezvous with) an object on another node. Referring to Figure 5.2, the message startsat the X at the top left hand side, travels through the bottom communication part

5.1. The PRISMA architecture and implementation 89

DP-41 executing object X, send and receive the answer�send DP-41 executing message prot.� -

CP transporting packets� -

recv send

recv

DP-42 exec. mess. prot.�DP-42 executing object Y, replying�

oDP-41 activity

oCP activity

oDP-42 activity

-0 100 200 300 400 500 600 700

time (microsec)

Figure 5.4: Schematical drawing of the data processor and CP activities in anode while running the POOL ping pong program of Figure 5.3. X islocated on node 41, Y on node 42.

to the Y (top right hand), and back to the X again.In the prototype POOL implementation, the synchronous send takes 700 �s.

This delay is only partially caused by the communication processor, the majorreason for it is the interface between the CP and the data processor, and the soft-ware layers running on the data processor. A rough calculation (as is sketched inFigure 5.4, where the message transport between X and Y is depicted) shows that80% of the time is spent while the communication processors are idle, while thedata processor is working hard to send the message. This causes serious problemswhen exploiting fine or medium grain parallelism, since frequent communicationsand synchronisations take more overhead than the computations in between. Foran efficient parallel implementation of POOL, the data processor overhead shouldbe reduced.

For this reason, it could be fruitful to introduce special interface hardwarebetween the data processor and the communication processor that relieves the dataprocessor from constructing messages, a so called message processor. This messageprocessor could take care of tasks like packet assembly and memory transfers. Thelowest level of the POOL runtime support (which implements a message passinglayer) is then implemented in this specialised message processor.

The use of such a message processor in the architecture could possibly improvethe throughput (the other important performance characteristic of the network) as

DP-13 executing POOL code, sending a message�DP-13 executing message protocol (send)

CP transporting packets

DP-11 executing message protocol (recv)

-DP-11 executing POOL code, receiving message

oDP-13 activity

oCP activity

oDP-11 activity

-0 100 200 300 400 500 600 700

time (microsec)

Figure 5.5: Schematical drawing of the various activities in a node while send-ing a large message from node 13 to node 11, without result. Thethroughput is bounded by CP and DP.


Proc Mem CP

I queue- -

O queue� �

Central Buffer Storehsi

hso

h66I h??

O h66I h??

O h66I h??

O h66I h??

O

I-cache

Figure 5.6: The Oyster model of a PRISMA node. Every box and circle representa Pearl object.

well. The peak throughput of the network of the current prototype machine is ashigh as 500 KB/seconds per link. The throughput is also measured at languagelevel (by sending one very large message), so it incorporates the performanceeffects of all software layers. It turns out however that the throughput is limited byboth the communication network and the software layers, as they run in parallel(depicted in Figure 5.5). It is thus not expected that an improvement in the interfacewill show a dramatic improvement on the throughput. However, since the dataprocessor currently needs all its capacity to send a message, the overhead for thedata processor when sending large messages can be reduced greatly by addinga message processor, freeing the data processor for computational tasks. In theexperiments described in this chapter, the latency of the network is used as theoptimisation criterion. The consequences for the throughput are discussed at theend of the chapter.

5.2 The simulation model

A two node version of the PRISMA machine has been modelled using Oyster. Asingle node is modelled with four objects (Figure 5.6): the memory, a small instruc-tion cache, the communication processor and the data processor. The memory,cache and processor model are taken from the standard library discussed in Sec-tion 3.3. The memory is a dynamic RAM with a size of 65536 words, just enoughfor these experiments. The instruction cache is a direct mapped cache, with oneword per line and 512 lines.

The communication processor is not in the standard library. It is modelledwith thirteen Pearl objects, as drawn in the dashed box in Figure 5.6. The objectsforming the CP are literally as described in [Annot87]: the central store managesthe internal buffer space, four input machines “ hI ” receive data from neighbouringcommunication processors, four output machines “ hO ” send data to these neigh-bouring nodes, two special input and output machines (marked SI and SO) handlethe packets going to and coming from this node, and an input queue and outputqueue decouples the data processor.

The algorithms used in the model are partly different from the algorithms used

5.3. Measurements and Results 91

in the hardware version of the CP, but their external behaviour is the same. In thehardware CP, the central buffer store consists of four queues, containing referencesto the packets which should be sent over the four links. A rather efficient butcomplicated protocol ensures that each packet travels over exactly one link. Forthe sake of simplicity, this has been modelled as a single queue. The FIFO orderingand timings of the original queuing scheme are preserved. The input and outputmachines are implemented radically different from the hardware implementation.In hardware, the input and output machines are running a polling protocol. Duringa simulation run, the polls would have to be emulated, leading to a waste ofsimulation time. The protocol is therefore simulated as an event driven protocol,again with preservation of the external behaviour.

The data processor is a three address register oriented processor, approximatelywith the capacities of an MC68020. The instruction set is enriched with a printfinstruction, which do not take any execution time, nor any space in the program (noinstruction fetches are simulated for it) but which is very practical for debuggingthe simulation models, the compiler and the run time support software. Theinstruction set is further enriched with a readtimer and printtimer to getextra timing information from the simulator.

As program for measuring the latency, the POOL program sketched in Figure 5.3(page 88) is used. The POOL compiler translates the POOL program into C, andthe GNU C-compiler [Stallman88] is used to translate the C code and the POOL runtime support into the assembly code for the processor. The GNU compiler is usedbecause of its portability. Targeting the compiler to the processor in this architectureis rather easy since the processor is highly orthogonal and has absolutely no nastyimplementation caused non-orthogonalities. Although such a processor is hardto build, the performance characteristics of a real-world processor are not verydifferent. It is only more work to construct a compiler and simulator.

Since the POOL program, run time support and message handling library arecompiled to assembly code, and simulated together with the hardware (emulatedat assembly level as discussed in Section 4.1), the full message trajectory is sim-ulated, including the software overhead. This allows to study trade-offs betweenhardware and software. Note that some parts of the architecture are not modelled.The floating point unit, memory mapping (virtual to physical address translationwhich causes interesting problems when passing physical references to the com-ing message processor), and possible cache coherency problems are not taken intoaccount. These parts do not have a performance influence, they only introducecomplexity in the design and are thus not interesting for this evaluation study.

5.3 Measurements and Results

The model described above has been verified first, by checking it against an avail-able implementation of the machine. All parameters of the simulator were set tothe values which are implemented in the hardware prototype. The performanceresults of the simulator should match the experimental results measured on theprototype machine. After this verification, the parameters are upgraded to values


which are reasonable for the state of the art technology of 1989 (the year this exper-iment was performed), because the prototype machine was built using outdatedtechnology. In an up to date machine, the data processor is faster than the cur-rently used MC68020 and the entire node fits on a single board, so the slow VMEbus is eliminated from the machine. Below, these two machines are referred toas ‘1985-technology’ and ‘1989-technology’. Note that the architecture itself is notdifferent, only the timing parameters of the model are changed. The performancefigure of this upgraded model serves as a reference point for evaluating the benefitsof adding a message processor.

In the first step an output message processor is added to the architecture. Itdoes DMA from the memory to the network. In the second step an input messageprocessor is added, that handles the messages coming from the communicationprocessor destined to the data processor. The input message processor has toallocate memory for storing the incoming messages. Since the memory is managedby the data processor, this gives rise to a complex interface between the dataprocessor and the message processor. To circumvent this interfacing problem, aspecial purpose allocation processor is incorporated in the architecture before themessage processors are added. The allocation processor manages the memory forboth the message processor and the data processor, and has a clear well definedinterface. The allocation processor is not essential for the performance, but it easesthe interfacing and the total design. A major performance improvement is expectedfrom the addition of the message processor.

5.3.1 Verification

The model is roughly verified by exactly simulating the prototype hardware. Todo so, the values of the parameters of the prototype need to be found. For the mostsimple components, this value is not hard to find. The values of the link-speedsof the CP and the speeds of the internal state machines of the CP are known, andconsequently all the internal delays of the CP are exactly known. The memory waitstates, cache parameters and the bus latency are also well documented.

The situation with the data processor is worse. The model is not an exact copyof an MC68020, but a highly orthogonal three address machine. By restrictingthe amount of registers, by tuning the allowed addressing modes to look likethe MC68020, and by teaching the compiler to use these addressing modes, themodel already looks like an MC68020 with respect to the amount of memoryreferences and computations. To make the speed of the modelled processor andthe MC68020 comparable, the timings of the various instructions are set to theinstruction timings of the MC68020. By setting the clock speed to the value used inthe prototype machine, the processor executes around 3 millions instructions persecond, which is also the speed obtained by the MC68020 in the prototype machine.It is clear that this model is still radically different from an MC68020, so that MIPSmeasurements have not much relevance, but since the instruction sets have roughlyequal expressiveness, the modelled processor can do about the same work as anMC68020 in the same time. Most important is that the basic architectural differences


LATENCY 550 �s

575 �s

600 �s

625 �s

16.6 20 25 33.3

JJJJJJ

b

@@@@

b

aaaa

b

b

Clock Frequency (Mhz)

(a)

3.0

3.5

4.0

4.5

MIPS

16.6 20 25 33.3

!!!!########

Clock Frequency (Mhz)

(b)

Figure 5.7: Clock frequency versus latency (a) and clock frequency versus MIPS(b).

do not have big performance consequences (three operand instructions) and thatthere are no architectural differences that have consequences for the performance(like separate instruction and data spaces, a data cache, or a special pipelinedimplementation).

When running the simulator with the parameters sketched above, it predicted arendez vous time of 690 �s. Considering the measured value of the real prototypemachine of 700 �s, the error is no more than 1.5%. This error is so insignificantthat apparently all errors made in the modelling process compensate each other.We did our best to eliminate all possible sources of systematical errors, importantbecause tradeoff’s between various hardware and software solutions are made inthe coming pages.

5.3.2 Technology update

Upgrading the internal parameters of the CP to a more comfortable internal clockspeed and external link speed, and upgrading the speed of the internal algorithmsresults in a latency of 655�s. Reducing the access time of the communicationprocessors IO-queues from 1 �s (introduced by the VME-bus) to 50 ns, gives amessage rendez-vous time of 625 �s (this improvement could have been expected,since 32 accesses are accelerated by about 1 �s each).

The next step is to increase the clock frequency of the data processor. In Fig-ure 5.7a the latency is plotted against the processor frequency. In this figure, itcan be seen that the step from 25 to 33 Mhz did not give much improvement inlatency. This is an indication that other system components are more and morebecoming a bottleneck. The MIPS curve (plotted in Figure 5.7b) does not increasefast enough. In a first order approximation, the MIPS rate should equal the clockfrequency divided by the clocks per instruction, which comes to 5.2 MIPS for a33.3 Mhz clock, while it is only 4.3 MIPS. This is apparently caused by the slowmemory. Figures from Oyster show that the memory system was utilised for 92%


during the rendez vous when the data processor was set to a clock frequency of33.3 Mhz. Furthermore, the small instruction cache did not give a sensible hit rate:about 40%, of which the major part was caused by the idle loop of the POOL runtime support system. After increasing the cache size by a factor 8 (which is notunreasonable as technology upgrade), the latency improves to 450 �s. The mem-ory is still utilised for 80% of the time. This is explained because a part of the idleloop refers to global variables in the memory, causing lots of accesses (rememberthat the cache is an instruction-cache). The performance could be further improvedby extending the system with a data cache, but this would require a coherencyprotocol between the data cache and the I/O devices: the data processor, messageprocessor and allocation processor should keep the same view on the memory.

The idle loop causes problems in the simulator also: idle processors consumeCPU time on the simulation platform (all idle instructions are emulated), andOyster cannot trace that the processor and memory are actually idle: according toOyster the processor is busy executing the idle loop. The idle figures of Oystershould thus be interpreted carefully.

5.3.3 Adding the allocation processor

The allocation processor handles memory allocation requests. As was mentioned,it is not introduced to increase the performance, but because it makes the designof the message processor simpler. The interface of this allocation processor as seenfrom the data processor exactly resembles the C-library malloc() and free()calls. Memory is allocated by asking the allocation processor for a certain amountof bytes, a pointer to the start of the block is returned. The interface is memorymapped. A block of memory is freed by passing the address of the first word tothe allocation processor. The allocation processor maintains multiple linked lists offree blocks. In the allocation processor, an array of registers is maintained pointingto the lists of blocks of 4..512 bytes, and one register that points to the list of freeblocks larger than 512 bytes. A block that is freed is linked in the list of blocksof that length. A block request is handled by unlinking the first block of the list.Blocks are thus only created, never glued together. The allocation processor can beextended with capabilities for coalescing free blocks during idle periods, but thatis a separate research issue.

There are some issues involved here posing nasty implementation problemswhen building real hardware. For example, when the data processor asks for ablock of memory which is not yet available (the allocation processor encounters anempty list of free blocks), the data processor should release the bus so the allocationprocessor can access the memory to cut a longer block into two parts. This bus-release is not simulated since it is only an implementation difficulty and has almostno influence on the performance. The performance is bounded by the speed ofthe memory and the allocation processor itself. Only if the bus switch betweenthe data processor and allocation processor would take considerably more timethan a memory allocation, and if it happens very frequently, a simulation of thebus-release would be necessary.


Because the allocation processor is dedicated to allocating memory, and par-allelism is introduced (memory management activities are carried out in parallelwith execution of the program), the performance of the architecture is improveda bit: the latency of a message decreases with about 40 �s. This speedup is a niceside effect, but it is questionable if the additional hardware pays off: the allocationprocessor is only needed to interface the message processor.

5.3.4 Adding the message processor

The message processor implements the message layer of Figure 5.2 in hardware.The interface as seen from the run time support is identical, but 99% of the functionscode is now implemented in hardware. Functions for start-message, end-message,send-data and so on are all written in two or three assembly instructions. Theseassembly instructions pass the function parameters to the message processor via amemory mapped interface.

In Figure 5.8 a sketch is given of the complete model of the architecture includ-ing the allocation processor, the input message processor (input MP) and outputmessage processor (Output MP). The input and output message processors itselfare dual ported, one port is dedicated to transmit data to the communication pro-cessor, the other to communicate with the memory and to get commands from theprocessor. The message processor has four tasks: it transports data from one busto the other, it generates the headers required by the communication processor atthe start of each packet, it removes these headers when retrieving the packet fromthe network, and it completes the half empty packets with some empty words. Theinput message processor additionally unravels the various streams coming fromall nodes of the machine into separate byte streams.

Proc Mem AllocProc

CP

I queue- -

O queue� �

Central Buffer Storehsi

hso

h66I h??

O h66I h??

O h66I h??

O h66I h??

O

I-cache Input MP

Output MP

Figure 5.8: The model of the enhanced PRISMA architecture.

The result of adding the output side of the message processor (from DP to CP)is that the rendez vous time is reduced with another 80 �s, resulting in an overalllatency of 325 �s. When sending larger messages the benefits are higher, since themessage processor is dedicated to transport data from memory to the CP over twobusses, while the data processor has to fetch and send the data subsequently overthe same bus.

Placing the input side of the message processor in the stream from networkto data processor saves less, only 25 �s, bringing the latency to 300 �s. In more


Clock frequency 16.7 Mhz 20 Mhz 25 Mhz 33.3 MhzClock cycle time 60 ns 50 ns 40 ns 30 nsMeasured on Prototype 700 �sSimulator, 1985-technology 690 �sSimulator, 1989-technology 615 �s 575 �s 550 �s 533 �sWith MP, small cache 425 �s 400 �s 380 �s 375 �sWith MP, large cache 360 �s 330 �s 310 �s 300 �s

Table 5.1: Delay of prototype, first model, introduction of a small processorcache, introduction of the message processor and with a larger cache.The four columns represent 4 clock speeds.

complex POOL programs, where multiple objects are communicating, messagesfrom different nodes interleave. In that case a larger gain can be expected. Furthersimulations are necessary to get insight in the behaviour at higher loads. For thesimplest case, already more than 100 �s is gained.

5.4 Discussion

Table 5.1 summarises the latency of a single synchronous send in POOL as measuredon five implementations. The top row shows the performance of the prototypemachine, the other rows show the simulation results of respectively the validationmodel, a machine implemented using 1989 technology (Figure 5.6), the architecturewith message and allocation processor (Figure 5.8) equipped with a small cache (toobserve the influence of the cache), and the architecture with message processorand a large instruction cache. The columns of the table show the latencies forfour clock frequencies of the data processor. The latency introduced in executing arendez-vous can thus be reduced from 690 �s (for the original machine) to 300 �s(33.3 Mhz data processor, with message processor and a large cache). Of thisimprovement, 75 �s is caused by a technology upgrade of the memory, cache, CPand bus, 60-80 �s is saved by a higher clock speed of the data processor, 160-180 �sby the special extra hardware, and 75 �s by the bigger instruction cache.

Instead of measuring the latency microseconds, it might be better to calculatethe number of instructions that could have been executed by the data processorsince this gives an indication of the minimal grainsize of the program. In 615 �sa 16.7 Mhz (3 MIPS) processor, can execute 1850 instructions. If the 615 �s areoverhead spent at the data processor, a parallel application with a grainsize of 1850instructions will loose a factor 2 because of overhead. In 300 �s a 33.3 Mhz (5 MIPS)processor can execute around 1500 instructions. Although the latency is reducedby more than a factor 2 (615 ! 300), the grain is only reduced by 20% (1850 !1500). Despite the reduction of the latency, the minimally required grainsize of theapplication increases only slightly, due to the higher clock frequency of the dataprocessor. Note however that without the message processor, programs should

5.4. Discussion 97

have a minimal grainsize of 5*533 = 2650 instructions to run efficiently on a 33.3Mhz system, which is a 50% larger grain.

In the above calculations it is assumed that the latency is purely overhead at thedata processor. In the original architecture, the communication processor does notadd significantly to the latency (as visualised in Figure 5.4), so the whole latency isindeed overhead of the data processor. In the architecture with message processor,the memory is limiting the performance (since the message processors tries toretrieve words faster than the memory can supply them), implying that the dataprocessor cannot do any computational work during the transfers of messages.The latency is thus still pure overhead time of the data processor.

The bandwidth to the memory also limits the throughput of the communi-cation architecture. Large messages are handled very efficiently by the messageprocessor, but unfortunately both the data processor and the message processorare then blocked by the memory. By adding a data cache to the architecture thedata processor can be decoupled, This is troublesome because there should be a co-herency protocol between the data cache, the message processor and the allocationprocessor, since they need to have the same view on the memory.

The experimentation with the simulator gave us insight in the behaviour ofPRISMA’s communication architecture. The simulator allows to search for an opti-mal balance between hardware and software. It is our opinion that implementingmore layers of the software in hardware, (for example the flattening procedure thatmaps a graph of POOL objects to a stream of bytes), makes no sense. Besides thata hardware realisation would take all flexibility out of the implementation, it willalso result in a horrendous interface between the compiler and the hardware, sincethis layer of the implementation has the knowledge of the structure of the graph.Such an interface is not an attractive alternative.

Chapter 6

Simulating the Futurebus coherentcaching schemey

One of the ways to build parallel computers is to use a distributed memoryparadigm, where all computing nodes have their own local memory, and nodescommunicate via some interconnection network. The PRISMA architecture, simu-lated in the previous chapter, is a typical distributed memory machine. One of thedisadvantages of a distributed memory machine is the delay introduced in sendingdata from one processor to another. Another way to construct a parallel machineis to use a shared memory. In the most extreme form of a shared memory machine,all processors are directly connected to one large memory. The memory storesthe (shared) program code and data. Since the data can be accessed directly byall processors, there is no delay in communication, but this machine suffers fromcontention on the memory: when many processors attempt to access the memory,only one can be serviced, and the others have to wait.

In early large shared memory machines, like the NYU[Gottlieb83], this problemwas alleviated by splitting the memory in modules, and connecting the processorsand the memory modules with a switch, to route memory accesses of the processorsto the right memory module. A disadvantage of such a switch is the related costs,and the extra delay between the processor and the memory. The costs of a cheapswitch (an Omega-switch) connecting n processors and n memory modules isO(n logn), while it has a delay of O(log n). An expensive switch (crossbar) hasconstant delay, but costs O(n2). The problem of using a switch is that one eithersuffers from high (quadratic) costs, or high (logarithmic) delay.

Another way to attack the contention on the shared memory is to use cachingtechniques. A cache [Smith82] is a fast and small memory that stores the data thatis frequently used by the processor. Caches can service, say, 95% of the accessesto the memory directly, while the other 5% are passed to the slower memory to beanswered after some delay. In a single-processor machine, a cache is used to matchthe speed of the fast processor and the slow memory. In a shared memory machine,the cache also significantly reduces the traffic to the memory. Since only 5% of theaccesses are routed to the memory, more processors can be connected to the sharedmemory without running into contention problems. The current generation top

yThis chapter is based on [Langendoen91, Muller92b].

100 Chapter 6. Simulating the Futurebus

of the line minicomputers, for example from Data General and SUN, are sharedmemory machines with caches.

Caching in a single processor machine is simple: because the cache is betweenthe memory and the processor, it intercepts all accesses routed to the memory.Because the machine language program cannot see the cache (it behaves exactly asa memory, it is only faster) the cache is said to be transparent. Because the cachealways contains an up to date image of the memory, the cache is said to be consistent.In a shared memory machine, other processors may modify the memory, leadingto a problem known as cache consistency: suppose that processor X has data A inits cache, and processor Y modifies data A in the memory, then processor X stillhas the old data in its cache; the cached data is inconsistent. The most crude wayto solve this consistency problem is to shift the problem to the machine languagelevel: updates to the memory are only allowed if the data is not in any other cache.This solution is not transparent: the machine language program has to be awarethat data is cached. Since it is not a good idea to bother the application programmerwith the consistency problem, this solution is only viable if the assembler, operatingsystem or compiler can enforce the consistency transparently.

Consistency can also be enforced with a hardware protocol. On a parallelmachine with a single broadcast medium like a bus, the protocol consists simplyof a broadcast sending the new data to all caches. All caches must then invalidateor update the old value. This is for example the way the Motorola mc88200 cachechip [Motorola88] (used in amongst others the Data General computers) works. Forarchitectures without a single broadcasting medium, all other copies of cached datahave to be found when an update operation is performed. It is for example possibleto search the data with the help of “directories” that point where shared data hasgone [Chaiken90]. There are more alternatives, see for example [Stenstrom90] fora survey on this topic. Combining broadcasts and directories, caching sharedmemory architectures can be constructed that use multiple (broadcasting) bussegments with directories in between, as is for example implemented in the DataDiffusion Machine [Warren88]. The Futurebus+1 [Futurebus89] defines an industrystandard bus that supports this type of hierarchical architectures.

In this chapter, a study on the performance of hierarchical cache architecturesthat can be built on top of the Futurebus is presented. With fixed technologi-cal parameters (as the bus speed), the performance of architectures with varyingcache parameters (sizes) and bus topologies is measured using a simulator. Theperformance results are obtained under the load of a stochastical application (seeChapter 4 for a description). Other studies are mostly based on the use of exe-cution traces, but these are only valid for one specific application running on onespecific architecture. Because the stochastical simulation model is less accuratethan a model based on address traces, the stochastical model is validated againstresults from real address traces (Section 6.3).

The simulator is firstly used to measure the influence of two cache parameters,the line size and the associativity2. The results are presented in Section 6.4. In

1Note, in the sequel the ‘+’ is omitted for readability.2Throughout this thesis the term line is used to denote the basic entity stored in the cache. The

6.1. Introduction to the Futurebus cache consistency 101

contrast with many other studies (for example [Przybylski90] and [Hennessy90]),the studies towards line size and associativity are performed in the context of hi-erarchical multiprocessor architectures, as is for example also the case in [Baer88].Because of invalidates and sharing, the optimal values for these parameters forsingle and multi processor systems are not necessarily identical. After determina-tion of reasonable settings for these parameters, a study on the influence of the bustopology on the performance is presented in Section 6.5. All architectures with upto 32 processors and regular topologies are compared: a flat bus with 2, 3, 4, ... 32processors, two level systems with 2*2, 2*3, ..., 2*16, ..., 16*2 processors, three, andfour level hierarchies have been simulated. As performance measure, the relativeMIPS rate of the processors is used, indicating how much performance is lost due tocontention problems. Only the contention issue is studied, the parallelism and theoverhead of the application are not considered at all, although this causes seriousproblems when parallelising real world programs.

The results of the simulations of all hierarchical architectures are further anal-ysed in the next chapter, where a performance model for hierarchical cache archi-tectures is presented.

6.1 Introduction to the Futurebus cache consistency

The Futurebus is the new industry standard bus, intended to be the successor of theVME bus. The Futurebus is defined in several layers [Futurebus89]: the wires andtheir electrical behaviour (timings, glitches, voltages, live insertion), the arbitrationorder (preferences, masters, slaves), the basic protocols for data transport (readingand writing), a cache consistency protocol for systems with multiple caches and anarbitrary number of bus segments, and a message protocol for bulk data transport.The definition of the bus is independent of specific processor implementations orspecific technologies: as an example, the width of the data bus is by default 64bits, but widths of 32, 128 or 256 bits also are also allowed. The only layer thatis of interest in this performance study is the layer defining the cache consistencyprotocol: the layers above it are ignored, while the timings of the lower layers area parameter of the simulator. This section gives a brief description of the cacheconsistency layer. It should suffice for readers unfamiliar with hierarchical cachesto read this and the next chapter. The section ends with an overview of the aspectsthat are not simulated but which are actually part of the Futurebus definition.

6.1.1 Consistency in flat architectures

The Futurebus cache consistency protocol is MOESI like [Sweazy86]. Each cachedline of data is in one of the following three states: Exclusive, Shared or Invalid.An Exclusive line of data may be read and written, a Shared line of data is readonly, and an Invalid line of data is not here (the same as a cache miss). In a flat

lines (with size 2l bytes) are grouped into sets, with 2a lines per set (the associativity). There are 2s

sets in the cache, the total cache size is 2l+a+s bytes. The set is selected by a hash function (typicallyjust the selection of address bits l::l + s � 1), while the line in a set is selected associatively.


architecture, with one single bus interconnecting all caches and the memory, theinvariant holds that if some data is Exclusive in one cache, the other caches do nothave a Shared or Exclusive copy; shared data may be freely distributed over allcaches.

Processors, caches and the memory communicate with so called transactions.For flat architectures, four types of transactions suffices: read shared, write,read modified, and invalidate. These transactions respectively read a lineof shared data (read-only), write a line of data back to the memory read a line ofdata with the intention to write, and invalidate a cache line. A cache receiving atransaction from the processor for a cache line with a given state, changes the stateof the line according to the following state transition table:

Old Transaction New TransactionState from processor state to memoryInvalid read shared Shared read sharedInvalid read modified Exclusive read modifiedShared read shared Shared -Shared read modified Exclusive read modified(!)Shared invalidate Invalid -Exclusive read shared Exclusive -Exclusive read modified Exclusive -Exclusive invalidate Invalid write

The transaction marked with the ‘(!)’ is not necessary to obtain the data (since thememory contents of the line is already in the cache), but signals the other cachesto destroy their shared copy, to enforce the invariant that exclusive data is in onlyone cache. The caches thus also react to the transactions that are snooped from thememory bus according to the following state transition table:

Old Transaction New Interventionstate on memory bus state on memory busInvalid any Invalid noneShared read shared Shared noneShared read modified Invalid noneExclusive read shared Shared Supply dataExclusive read modified Invalid Supply data

The last two state transitions of the table (an Exclusive cache line snooping atransaction) shows how modified data is kept consistent: if the data is needed byanother cache, the data is written on the memory bus, preventing the use of olddata of the memory. As an example, consider the four caches in Figure 6.1. Onecache (B) has the data Exclusively, while the other three do not have the data. In thecase that cache A reads the data (either modified, or shared), cache B intervenes inthe read and supplies the data, preventing the memory from supplying outdateddata.


Invalid InvalidAInvalid BExcl� ��-(1) read (2) dataMemory

Figure 6.1: An example of an intervening cache.

Proc Proc Proc Proc

Cache Cache Cache Cache

level 0 level 0Cache Cache

Memory

Figure 6.2: An example hierarchical architecture, with two level 0 caches.

6.1.2 Consistency in hierarchical architectures

So far, only architectures with a flat bus have been considered. Hierarchical ar-chitectures, with more busses, operate almost identically. An example architectureis sketched in Figure 6.2. It consists of four processors with four caches, and aglobal memory, interconnected by three busses, and two level 0 caches. Towards theprocessor side a level 0 cache acts as a memory. Data requests coming from theprocessor caches are handled by the level 0 cache by passing the data request tothe memory bus, and passing the answer of the memory to the bus at the processorside. Towards the memory side, the level 0 cache acts as an ordinary cache. Alltransactions on the memory bus are snooped by the level 0 cache, it and intervenesif possible inconsistencies are detected. To be able to detect these inconsistencies,each level 0 cache maintains a full administration of all lines of data kept in itsprocessors’ caches. The following invariant is maintained:

If a cache has an Exclusive copy of a line, all level 0 caches towards thememory have marked this cache line as Exclusive, and all other level 0caches have marked this line as Invalid.

If a cache has a Shared copy of a line, all level 0 caches towards thememory have marked this cache line as Shared, any other cache in thearchitecture may have a Shared copy of this line.

The state transition table of a level 0 cache is almost identical to that of a cache. Themajor difference is that requests are passed through the level 0 cache, leading to extratransactions. Furthermore, all transition tables are extended to react on invalidates


A B C D

E F

X Y

Invalid Invalid Invalid Exclusive level 1Caches

Invalid Exclusive level 0Caches

Memory

� �

� �

�-

�

- �

-(1)

(2)

(3)

(4)

(5)

Figure 6.3: An example of a split answer.

snooped on the memory bus for Shared data, the data need to be invalidated in thatcase. As an example, the snoop on a Shared line of data is handled according tothe following transition table (the rest of the table is replaced with dots):

Old Transaction New Intervention Transactionstate on memory bus state on memory bus on processor side

...Shared read shared Shared - -Shared read modified Invalid - invalidateShared invalidate Invalid - invalidate

...

The level 0 caches need not necessarily be equipped with cache memory. In thatcase they act as pure directories that guide transactions to the right bus segments.In the study presented here, we always use real caches.

6.1.3 Splitting transactions

A level 0 cache cannot answer transactions immediately because it might have toconsult the bus at the other side. It would be a waste of bus bandwidth to hold thebus(es) while waiting for the answer. Therefor the Futurebus definition allows tosplit transactions. A transaction that cannot be resolved immediately gets a split-answer, meaning that the transaction is being dealt with. The bus is then releasedand can be used for other transactions. The eventual answer to the transaction isthen transferred in a separate bus cycle, called a response transaction.

Figure 6.3 shows an example how splits are used. Suppose that cache A exclu-sively needs the data that is Exclusive in cache D. Cache A issues a read modifiedon bus X (1). Because cache E cannot answer with the data, it splits the transaction,and passes the read modified request to the memory bus (2). The memory couldsupply (outdated) data, but is intervened by cache F, that issues a split because itadministrated that somewhere above the data is kept (exclusively). It issues theread modified to bus Y (3), where cache D immediately replies with the data,while invalidating its own copy. The data is then passed through cache F (4), whichmarks the line as Invalid, and cache E (5), which marks the line as Exclusive, to


cache A, marking the line as Exclusive. The request for exclusive data is servicedwith five transactions on three buses: three times a read modified and two timesa response read modified. Note that while cache F and cache D are executingtransaction (3), cache E is free to handle other transactions for cache B.

The information about outstanding requests (for which an answer is expected)and outstanding responses (for which an answer need to be given) are adminis-trated for each cache line, enlarging the state of a cache line, and enlarging the statetransition table. In total there are around 8000 states, but the majority of them isillegal, in the sense that the state cannot occur.

In the case sketched in Figure 6.3, only one processor was requesting the Ex-clusive data. In a multiprocessing environment it is very well possible that whilethe level 0 cache handling a transaction for one processor, another processor mightstart a transaction on the same cache line. In the best case, both transactions areread shared transactions, causing the level 0 cache to split the second transac-tion, and to send the response (when the result of the first transaction is back) toboth requesting caches at the same time.

In a worser case, both transactions are read modified transactions. In thiscase, the level 0 cache will ask the cache performing the second request to shut up aslong as the first transaction is in progress. When theread modified transaction ofthe first cache is finished, the second cache is allowed to repeat theread modifiedtransaction.

In the worst case, a read modified request is issued on the same part ofdata in two distinct processors at about the same time. The processor caches bothinitiate a read modified request down the tree, resulting in an invalidaterequest upwards from the memory bus towards all caches in the system having acopy. Somewhere in the architecture, the upgoing invalidate of the first, andthe downgoing read modified of the second processor will meet, giving riseto a conflict. The request which was closest to completion (the invalidate goingupwards in this case) has preference, which causes the second read modified tobe suspended. The second transaction is resumed as soon as the first transaction iscompleted. In these cases the ability to split a transaction is not only important toimprove the performance but also essential in preventing deadlocks.

The protocol gets really hairy where neighbouring caches start referencing ei-ther the line selected for a replacement, or the line that caused a replacement. Inthese cases, utmost care is taken to define the protocol in such a way that dead-lock is avoided, the invariants are maintained, and still a sensible action is taken.Implementing these nasty features, account for about 80% of the work.

6.1.4 Discrepancies between the simulator and the real Futurebus

Above only a short description is given of the part of the Futurebus used in the sim-ulator. The details about the Futurebus specification can be found in [Futurebus89].The most important deviations are:

� The Futurebus defines an Unmodified exclusive state, for data that is exclu-sive, but not (yet) modified. The simulator only implements the states Invalid,


Scheduler�� @@

HHHPPPPPXXXXXX

P P P P P P P P P P P P

C C C C C C C C C C C C

C C C C

C C

Memory

Application simulation

level 2 cache (processor cache)level 2 buslevel 1 cachelevel 1 buslevel 0 cachelevel 0 bus (memory bus)Shared Memory

Bus Bus Bus Bus

Bus Bus

Bus

Figure 6.4: The structure of the simulator. The boxes are Pearl objects.

Shared and Exclusive (together with the attributes for pending responses andrequests) because it is expected that the Unmodified state is not used veryfrequently (if used at all).

� Above, six transactions were defined, the Futurebus has 15 more transactionsfor efficient implementation of I/O (and virtual memory) and the implemen-tation of locked parts of memory. We expect that for computation intensiveapplications, 99% of the transactions is covered by the six sketched above.

� For the timings of the lower levels of the Futurebus, one set of parametersis used in all experiments. This means that the performance figures are onlyrelevant for that specific implementation of the Futurebus. Despite this, weexpect that the general trends are valid for other implementations as well.

Especially the implementation of I/O could place other demands on the architec-ture. Therefore only applications are considered that are not performing bulk I/O,the amount of computations should be significantly higher than the amount of I/O.

6.2 The simulation model

The structure of the simulator is depicted in Figure 6.4. The upper two levelsof the picture (the P’s and the scheduler) model the application program. Theaddress trace produced by the application is interpreted by the memory hierarchy.Buses, caches, and the shared memory are simulated at bus transaction level,allowing an exact simulation of the Futurebus cache consistency protocol. Thesimulator is entirely written in Pearl: memories, caches and busses are Pearl objects,communicating by messages, one message for each transaction. The boxes inFigure 6.4 resemble Pearl objects, the lines connecting the boxes denotes the pathswhere messages may pass.

6.2. The simulation model 107

6.2.1 The application simulator

A stochastical application, as discussed in Section 4.3 (on page 75), is used as loadfor the memory hierarchy. The stochastical application generates an address trace,which is interpreted by the memory hierarchy at bus cycle level. The reason to usea stochastical application is that the execution or emulation of a real applicationwould be too expensive, while real world address traces cannot be modified tomodel a different type of application with a bit more or less sharing. The inaccuracyof the stochastical application is obviated by a validation, presented in Section 6.3.

The experiments presented in this chapter were performed with five differentstochastical applications (or workloads), modelling a collection of programs typicallyfound on a large UNIX-machine, and four flavours of a parallel application:

UNIX: In this workload, 8 classes of jobs are defined, each class represents aprogram found on a UNIX machine (a C-compiler, an editor, TeX). The staticparameters of these programs (like the number of instructions, functions,sizes of the various segments) have been taken from real world systems. Thedynamic parameters, like the average number of executed jumps and theirdistances have been taken from literature [Hennessy90] and from profile runs.

The UNIX workload consists of 50 jobs (6-7 of each class). The jobs all havetheir own private data and stack segments, and do not use any shared data.All jobs of the same class share their text segments: there is only one TeX-image and cc-image.

The kernel is simulated by a small process, running on each processor withboth shared text and data segments.

par: The parallel workload models a fine grained parallel program. The programhas one shared text segment, one shared data segment and multiple privatedata and stack segments. The majority of accesses is directed to the stack,text and private data, only a small part of accesses is directed towards theshared data. The parameters of the text, stack, and private data segmentsare an “average” of the parameters found in the UNIX job-mix, the valuesrepresent a reasonably large program. The hard part in the definition of paris the communication behaviour. Therefor a segment with a size of 10 Kbyteis chosen, with an average access rate of once every 10 instructions. Usingthese parameters, the results for the write broadcast ratio are comparable tothe results reported in [Eggers89]. Since the choice for “one access every 10instructions in a 10K block” is fairly arbitrarily, three other parallel programsare defined with a lower access rate and/or a larger shared space:

par-large This workload is the same as par except that the shared data space is oneorder of magnitude higher (100 Kbyte), the access rate is kept the same. Thisapplication has a reduced chance of a collision in the shared space.

par-less This workload is the same as par except that the frequency of accesses tothe shared space is 5 times lower: one access in 50 instructions. This givesless traffic on the bus for the shared pages.


par-less-large This is the combination of the larger space of par-large and thelower access frequency of par-less. Of the four parallel workloads, this onewill generate the lowest traffic over the bus.

The jobs are executed on a single cycle processor running with a 20 Mhz ba-sic clock, with 4 bytes instructions, and a register usage and instructions powercomparable with a state of the art 88000 processor running at 20 MIPS has beenused.

6.2.2 The buses

The bus is modelled at bus transaction level, which means that the electrical be-haviour of the bus is not taken into account. The bus handles requests on a “Firstcome first served” basis, and broadcasts transactions to all caches connected tothis bus. A bus segment is parametrised with three latencies, the time needed forarbitration in the case of 1 and multiple requesters, and the time needed to transfera cache line over the bus. In all simulation runs, these parameters are fixed to110 ns, 150 ns, and 80 ns respectively.

The bus implementation in Pearl caused a slight problem. Pearl lacked theconcept of broadcasting, forcing the programmer to broadcast the data explicitlyto all clients. For this reason Pearl has been extended with an asynchronous sendto an array of objects. Since the addition is almost purely syntactical sugar, andis completely handled by the front end, it does not require any change in thesimulation kernel.

6.2.3 The caches

The caches implement the cache consistency protocol. For each line in a cache, thestate is maintained. A cache receives all bus transactions from both sides, and runsthem through a state machine implementing the cache consistency protocol. Thefull state transition table is quite large, but in practice only a small fraction of thetable is used: specifying 252 entries out of the 8620 suffices to run all simulations.The transition table usage is discussed further on page 121.

The Futurebus protocol does not dictate which cache line to replace when aset is full, nor what actions to take during the replacement. Therefore the mostsimple replacement algorithm is chosen, pseudo random. Although LRU or FIFOreplacement performs a bit better, large errors are not expected in the results whenusing a random replacement algorithm.

The caches are parametrised with their associativity, the line size and the cachesize. The caches respond within the bus cycle time. For simplicity of the model,all caches and busses use the same line size. Since the Futurebus cannot handledifferent linesizes (the definition dictates 64 bytes lines), this is a valid restriction.Furthermore, only complete lines are transported. The performance deviation inthe case of a copy back of a half line (if only half of a line is dirty) will be smallbecause of the large setup time compared to the data transmission speed.

6.3. Validation 109

6.2.4 The shared memory

Because the stochastical application abstracts from the memory contents, the mem-ory does not maintain any administration. The memory simply acknowledges allbus cycles which had not been intervened by a cache. The memory is parametrisedwith the response time.

6.2.5 The topology

The topology of the whole architecture is specified using the Pearl feature fordefining the interconnections between objects. Figure 6.4 shows how the busesand caches are numbered throughout this and the next chapter: buses and cachesare numbered according to their level, counting starts at zero at the memory-side.When referring to other buses or caches in the hierarchy relative to a specificcache, downwards is used to denote caches closer to memory, and upwards for theones closer to the processors. To complete the terminology, a node is a processingelement with its processor cache (the cache closest to the processor), a cluster is acollection of nodes, the bus segment connecting these nodes, and the next levelcache (the cluster cache). A supercluster is a set of clusters with the interconnectingbus and cache. The architecture in Figure 6.4 thus consists of 12 nodes, 4 clustersand 2 superclusters. The number of caches connected to a bus at a certain level iscalled the branching factor of that level: the branching factors of the architecturein Figure 6.4 are 2, 2 and 3 (for levels 0, 1 and 2 respectively). This architecture istherefore also described as a 2*2*3 topology.

6.3 Validation

The stochastical application model, described in Section 4.3, introduces inaccura-cies. Especially the locality in data and text references is a potentially weak point.To verify the stochastical model the results of simulation runs are compared withthe results of experiments described in the literature. These published results havebeen derived from real world address traces, and cover important aspects of thehierarchical multiprocessor model:

� [Hennessy90, Hill87]: Simulates a trace derived from a UNIX environmenton a uniprocessor system with one single level cache. The cache size andassociativity are varied.

� [Bugge90]: Simulates a SINTRAN III trace on a uniprocessor system with atwo-level cache hierarchy. The cache size and associativity are varied.

An aspect that is not verified is the locality in using the shared space. Theparameters for the shared space are calibrated so that a write broadcast ratio iscomparable to [Eggers89], but this is only a “one-point-calibration”. For this reason,four flavours of the parallel workload have been used in the experiments.


miss rate" 20%

10%

5%

2%

1%1 2 4 8 16 32 64 128

c c c c c c c c

..

..

..

. .

! Size (Kb)

miss rate" 20%

10%

5%

2%

1%1 2 4 8 16 32 64 128

c c c c c c c c

..

..

.

.

..

! Size (Kb)

Figure 6.5: Miss rates measured by [Hill] (circles) and simulated miss rates (dots).A direct mapped cache (left) and a 2-way associative cache.

For an effective comparison, our model resembles the literatures model asclosely as possible. A single level cache and a two level cache can be simulatedeasily, and the linesizes and cache sizes can be adjusted to the proper values. It isharder to find the correct workload. Both experiments are performed with a UNIXworkload with 10 jobs, that most closely matches the workload used in both arti-cles. Although UNIX is not SINTRAN III, they both are large multi-programmingenvironments.

6.3.1 Single processor, one-level cache validation

This experiment is described in [Hennessy90, Hill87]. It is a trace driven simulationof a VAX processor with a single cache, running a multi programming workload.The cache size and associativity are variable. The study of [Hill87] has decomposedthe cache miss rate into three fractions, but here only the total miss rates areconsidered.

The data in Figure 6.5 shows that the stochastical simulation model compareswell to the measurements of [Hill87]. The miss rates for small caches are a bittoo high, and the miss rates for larger caches are a bit too low. For caches withincreased associativity the miss rates show similar results with a slightly higherdeviation for large cache sizes. The deviation is neither surprising, nor alarming.

6.3.2 Single processor, two-level cache validation

This experiment considers a uniprocessor with a small fast level 1 cache, followedby a large level 0 cache. [Bugge90] reports the miss rates of the level 0 cache withvarious parameter settings of line size, associativity, and total cache size. The level1 cache is fixed as a 128Kb direct-mapped cache with 16 byte lines. To deal with theeffects of cold start misses in the address trace, that paper contains three miss rates:a worst, best, and estimate case. The worst case assumes that cold start misses are

6.3. Validation 111

[Bugge90] SimulationSize Associativity Worst Estimated Best model1M 2-way 22.6 22.2 22.1 20.9

4-way 18.6 18.2 18.1 17.98-way 16.4 16.0 15.9 15.5

2M 2-way 11.6 10.7 10.6 12.04-way 9.0 8.1 8.0 9.48-way 8.0 7.1 7.0 9.4

4M 2-way 7.3 5.6 5.5 7.34-way 6.0 4.2 4.2 6.28-way 5.9 4.0 4.0 5.8

8M 2-way 5.0 2.0 1.9 5.94-way 4.6 1.3 1.3 5.48-way 4.4 1.0 1.0 5.3

Table 6.1: Miss rate of the level 0 cache (in %); [Bugge90] and model values

indeed real misses, whereas the best case counts those misses as hits. The estimatedmiss rate simply ignores cold start misses (nor miss, nor hit).

Since our memory hierarchy model is limited to equal line sizes in all levels ofthe hierarchy, and [Bugge90] uses a 16 byte line for the level 1 cache, only the missrates for the cases with a 16 byte line size have been compared (note that this isdifferent from the Futurebus standard of 64 bytes). To reduce the effects of coldstart misses the miss rate in Table 6.1 counts the miss rates over the last 30% of thesynthesised address trace. The trace stems from a mix of UNIX jobs and containsapproximately 4:6 � 107 references to the level 1 cache. Since the this cache has amiss rate of 3%, the level 0 cache gets 1:4 � 106 accesses.

The results in Table 6.1 show that the miss rates of the stochastical simulationmodel are close to the figures reported in [Bugge90], although our miss rates aresystematically close to Bugge’s worst case. The simulated figures of the 8 Mbytelevel 0 cache are inaccurate because the cache did not reach a steady state beforethe end of the simulation run. The 8 Mbyte figures of Bugge are also distortedby cold start effects, as can be seen from the relatively large difference betweenthe worst and best miss rates. Our optimistic 1 Mbyte results are presumablycaused by the usage of shared text segments in the UNIX mix. We observed thatthe miss rates, especially those for large caches, are quite sensitive to the exactconfiguration of the workload (number of applications, context switch rate). Thesimulated miss rates of the caches up to 4 Mbyte however, show the same trend asthe miss rates of Bugge: the influence of the cache size is larger than the effects ofthe associativity. Increasing the associativity from 2 to 4 to 8 result in a miss ratethat is 20% respectively 10% lower, while doubling the cache size gives about 50%fewer misses. Because the miss rate in a large second level cache is completelydictated by long-term effects, this validation indicates that the long term behaviourof the model is similar to the long term behaviour of Bugge’s application programs.


level 0 2-way level 1 cache 4-way level 1 cacheassociativity MIPS missrate bus-util MIPS missrate bus-util

4 13.4 3.2% 65% 13.6 3.0% 64%8 13.3 3.3% 66% 13.8 3.0% 63%

16 13.5 3.2% 65% 13.7 3.0% 64%32 13.5 3.2% 65% 13.7 3.0% 63%64 13.5 3.1% 65% 13.7 3.0% 64%

Table 6.2: MIPS rates, miss rates and bus utilisations for architectures where theassociativity of the caches is varied.

6.4 Varying and tuning the cache parameters

The Futurebus simulator has been used model to study the performance effects ofdifferent multiprocessor topologies (as is presented in Section 6.5). Before theseexperiments were performed, reasonable parameter values for the associativity andline size of the caches is determined. First the optimal associativity of the cacheswith a fixed linesize of 64 bytes (the Futurebus standard) is determined. Giventhis associativity the best linesize is measured (although the Futurebus fixes thelinesize to 64 bytes, it is interesting to find the optimal linesize).

6.4.1 Tuning the associativity

The associativity of a cache has a known important effect on the miss rate, see forexample [Hennessy90]. The question addressed here is what associativity shouldbe used for lower level caches in a hierarchical architecture. A common line ofreasoning is that low-level caches should be at least large enough to hold all linesof the caches upward in the hierarchy (called the inclusion property, [Baer88]). Oth-erwise high level caches will have to compete with neighbouring caches for spacein the low-level cache, which causes the low-level caches to frequently invalidateupward copies to service requests that hit a full set. The performance of high-levelcaches will decrease because they have to invalidate useful lines, which results ina low hit-rate. In [Baer88] it is proved that to enforce the inclusion property, theassociativity of a low-level cache memory should be at least the sum of the associa-tivity of its upward connected caches. Then one set in the low-level cache can holdall lines in the processor caches which fall into that set. Since [Baer88] does notquantify the performance effects of obeying the inclusion property, an experimentwith various associativities around the inclusion value has been performed.

The experiment uses an architecture with 2 clusters of four nodes each (a pictureof it can be found in for example Figure 6.7a on page 115). The level 1 caches have asize of 64 Kbyte, and are 2 or 4-way associative. The line size is 64 bytes, accordingto the Futurebus specification. The size of the level-0 caches is fixed at 2 Mbytes,while the associativity ranges from 4 to 64. To satisfy the inclusion property, aminimal associativity of 8 respectively 16 is required (4*2, 4*4).

6.4. Varying and tuning the cache parameters 113

32 64 128 256 512! Size (bytes)

1024

10

12

14

16

18MIPS ( s)

0%

1%

2%

3%

4%(+) miss rate

s s s s s

s+

++ + +

+

32 64 128 256 512 1024! Size (bytes)

1

3

5

7

9MIPS ( s)

6%

7%

8%

9%

10%(+) miss rates s s

ss s

+

+

++ + +

Figure 6.6: Processing power ( s) and miss rates (+) of various line sizes in thecase of the UNIX workload (left figure) and a parallel workload (rightfigure).

The results of the simulations, in Table 6.2, show that the performance is hardlyinfluenced by the associativity of the level 0 cache. This unexpected result isprobably caused by the size of the level 0 cache: the large number of sets effectivelyincreases the associativity of the level 0 cache. Possibly colliding lines betweendifferent level 1 caches usually fall into different level 0 sets because of the spatiallocality in the applications. Higher associativity does decrease the miss-rate of thelevel 0 cache (as observed in Section 6.3.2, and in many other studies [Przybylski90])because colliding direct mapped lines can coexist in an associative cache, but theoverall performance is dictated by the miss rates of the level 1 caches. Experimentswith other cache sizes showed that an associativity violating the inclusion propertymight indeed influence the performance because of upward invalidations, so in theforthcoming experiments the lowest associativity satisfying the inclusion propertyhas been used. Using a higher associativity does not give a significant performanceimprovement.

6.4.2 Varying the line size

To find the “optimal” line size, the same architecture as in the previous exper-iment has been simulated (2 clusters of four nodes) with varying linesizes. Theassociativity of the level 1 caches is set to 2, and the level 0 caches have an associa-tivity of 8. The line size is varied between 32 and 1024. The results for the UNIXand par-workload are shown in Figure 6.6.

The data in Figure 6.6 shows that the line size has different effects on the systemperformance and the level 0 cache miss rate. For a UNIX mix of programs, theoptimal performance is reached with a 128 bytes line, whereas a 512 bytes lineyields the lowest miss rate. Apparently, the decreased miss rate does not outweighthe increased miss penalty. The parallel application results are similar to the UNIXmix, only the optimal performance is reached with a smaller line size of 64 bytes.This optimum depends on the way the shared data is used. As stated in [Eggers89]


applications can be specifically coded for a certain line size. The 64 bytes line sizeof the Futurebus seems a reasonably good choice.

6.5 Varying the topology

The possibility to build a leveled architecture gives an extra degree of freedom tothe architecture designer. A 12 node architecture for example, can be designed in8 different ways (12 nodes flat topology, two level topologies with 2*6, 3*4, 4*3and 6*2 nodes, and three level topologies with 2*2*3, 2*3*2 and 3*2*2 nodes). Itis not clear on beforehand what topology will give the best performance. On theone hand, adding a level to the hierarchy of busses, increases the performancebecause of a higher traffic locality and the resulting decrease of the bus contention.A substantial amount of bus traffic will be between neighbouring caches on onebus, and does not slow down the whole system. On the other hand, adding levelsin the hierarchy increases the miss penalty; a cache miss causes more levels in thehierarchy to be traversed, with overhead at each level. Another disadvantage ofmore levels is the decrease of sharing, which results in less effective usage of thecache. So using more bus levels does not necessarily imply a higher performance.There will be an optimal number of levels, with a balance between the increasingmiss penalty, decreasing traffic, and decreased sharing although we expect the lastfactor to be less important.

To find the best performing topology for each number of processors, all pos-sible homogeneous topologies with one, two, three and four levels of busses areexhaustively simulated, with the five workloads defined in Section 6.2.1. Becausethe costs of the architecture should stay roughly equal, the total size of all cachesin the architecture is kept constant: 1 Mbyte per processor (because it is hard toquantify the costs of an extra bus level, these costs are ignored). From the perfor-mance figures conclusions can be derived about the optimal number of levels andnodes for classes of applications.

The simplest multi-processor configuration is the shared bus, a one-level hierar-chy. All processors have a cache of equal size, and are connected to a bus, to whichthe shared memory is also connected. Next comes the two-level hierarchy. Fora system with 8 nodes there are two possible regular bus topologies, sketched inFigure 6.7. The left hierarchy has two clusters of four nodes, while the right hierar-chy has four clusters with two nodes each. With the simulator, the performance ofboth hierarchies is measured, the highest value of the two is the best performancewhich can be obtained from a two-level architecture with 8 nodes. Note that onlyregular (homogeneous) architectures are considered, each processor in the architec-ture should have the same view on the rest of the system. An example non regulararchitecture consists of a cluster of two nodes, a cluster of five nodes, and a singleprocessor. Architectures with bus segments of width one are simulated neither:two-level architectures with 8 clusters of 1 node, or 1 cluster of 8 nodes are skippedsince they will always be outperformed by a single level architecture with 8 nodes(because of the increased miss-penalty).

The case for three-level architectures is similar to the two-level architectures.

6.5. Varying the topology 115

P P P P P P P P

C C C C C C C C

C C

Memory

a: 2 clusters

P P P P P P P P

C C C C C C C C

C C C C

Memory

b: 4 clusters

Figure 6.7: The two 2-level hierarchies with 8 nodes.

The smallest three level machine has 8 nodes (2*2*2). The four-level architecturesstarts with 16 nodes. In total 135 topologies have been simulated for all fiveworkloads.

The associativity of the processor caches is set to 2, while all lower level cacheshave the lowest associativity not violating the inclusion property. The total cachememory for each architecture is fixed to 1 Mbyte per processor. Since the size of thecache should be a power of 2, this leads to only a few possible candidates for thecache sizes. In the architectures sketched in Figure 6.7, the cache sizes boil downto 2M level 0 and 512K level 1 caches in Figure 6.7a, and 1M level 0 and 512K level1 caches in the right hand architecture. Note that these topologies have exactly1 Mbyte cache per processor. For the topology with 14 processors consisting of 2clusters with 7 processors, the cache size is only 805 Kbyte per processor. Since itis hard to correct the performance figures for this slight deviation, and since thedeviations were never more than 200K, these errors were not taken into account:the effects of bus saturation are much larger than the effects caused by some fewercache bytes. By running some of the experiments with larger caches it came outthat the differences are not significant.

6.5.1 Measuring with a UNIX workload

For 1, 2, 3 and 4-level hierarchies, the following simulations are performed: Foreach number of nodes n (0 < n � 32), all topologies with exactly n nodes have beenevaluated. Note that some architectures do not exist; for n is prime, no 2-, 3-, or4-level architecture exists; 25 cannot be factorised into three integers and so on. Weevaluate the architectures according to their speed, measured in MIPS, millions ofinstructions per second. Then, for each n, the best performing topology is selected,and the corresponding speed is plotted in the performance graph. As an example,the MIPS rates of all two-level architectures with 24 nodes are depicted in the tablebelow:

Nodes 24Cluster size 2 3 4 6 8 12# Clusters 12 8 6 4 3 2MIPS-rate 127 158 178 194 204 178


The boxed column, representing the architecture with 3 clusters of 8 processorsperforms best, a two level 24 node architecture has a maximal performance of 204MIPS, which is the point plotted in Figure 6.8. However, the Table 6.3 demon-strates that an architecture with 21 nodes performs even better. In these tables thebranching factors of the optimal performing topologies for the two and three levelarchitectures are given together with the performance. The small italic numbersdenote hierarchies with a lower performance than an optimal hierarchy with fewerprocessors.

The best 20 processor system with 2 levels has a lower performance than thebest eighteen processor system with two levels. Observe that none of the optimaltwo-level architectures has more than three clusters, while few actually have threeclusters; the memory buses of the architectures are saturated with two clusters. Forthe three-level hierarchies, there are only five optimal architectures, all having 2super clusters with (almost) all 2 clusters. There is a clear preference for architec-tures with few (2, maximal 3) clusters on the memory bus, with many (6 or 7) level1 caches on the level 1 bus.

The performance graphs of the architectures are presented in Figure 6.8. Thespeed of all architectures is presented in two ways, by the total amount of in-structions executed per second, and normalised, by dividing it by the number ofprocessors. This last form is useful to observe the effective use of the processors,the first form gives an indication how fast the architecture actually is.

The lower graph is the normalised performance, for one-, two-, three- andfour-level hierarchies. One can see in Figure 6.8b that at nine processors, the busof a flat single level architecture is saturated and the performance declines. Theupper graph (Figure 6.8a) shows this saturation effect clearly: the total performancestabilises, and declines slowly. From the same figure one can see that a two-levelhierarchy has maximal performance with 21 processors, a three-level hierarchy

# Nodes 4 6 8 9 10 12 14 15 16 18 20 21 22 24 25 26 27 28 30 32Cluster size 2 3 4 3 5 6 7 5 8 6 5 7 11 8 5 13 9 7 10 8

# Clusters 2 2 2 3 2 2 2 3 2 3 4 3 2 3 5 2 3 4 3 4

MIPS rate 63 91 120 128 144 166 179 189 190 207 204 216 184 204 187 171 207 190 189 176

All two level hierarchies

# Nodes 8 12 16 18 20 24 27 28 30 32Cluster size 2 3 4 3 5 6 3 7 5 8

SuperCluster size 2 2 2 3 2 2 3 2 3 2

# SuperClusters 2 2 2 2 2 2 3 2 2 2

MIPS rate 105 153 188 203 222 218 202 218 216 208

All three level hierarchies

Table 6.3: Maximal performances and optimal topologies under the UNIXworkload.


0

50

100

150

200

250

0 5 10 15 20 25 30

MIP

S ra

te o

f tot

al s

yste

m

Number of processors

1-level2-level3-level4-level

a: the total performance (# processors * normalised performance).

0

5

10

15

20

0 5 10 15 20 25 30

MIP

S pe

r pr

oces

sor



b: the normalised performance.

Figure 6.8: The performance of the architectures under a UNIX workload.

with 20 processors and the four-level hierarchy with 24 processors.

Interesting points in the graphs are the places where the lines of the one- andtwo-level hierarchy cross (at 8 processors) and where the lines of the two- andthree-level hierarchies cross (at 18). Those are the places where an extra level inthe hierarchy outperforms a flatter hierarchy. This can be seen most dramaticallyin Figure 6.8a: the two-level hierarchy starts with a slightly lower performancethan single level bus systems, but it scales well up to 16 processors. The three-levelhierarchy scales two nodes further, but does not give a substantial improvement.Four levels of hierarchy do not pay off for systems with a UNIX like workload.



# Clusters 2 2 2 3 2 3 2 3 4 3 4 3 2 3 5 2 3 4 3 4

MIPS rate 40 51 57 58 58 64 58 67 62 72 66 75 61 76 67 62 78 72 81 73

All two level hierarchies

# Nodes 8 12 16 18 20 24 27 28 30 32Cluster size 2 3 4 3 5 4 3 7 5 8

SuperCluster size 2 2 2 3 2 3 3 2 3 2# SuperClusters 2 2 2 2 2 2 3 2 2 2

MIPS rate 52 66 76 75 84 88 81 95 96 96

All three level hierarchies

Table 6.4: Maximal performances and optimal topologies under the parallelworkload.

There is an interesting anomaly in Figure 6.8. The graphs of two- and three-level hierarchies cross twice at 27 nodes. A two level hierarchy with 27 nodesperforms better than a three level hierarchy. This is caused by the fact that 27 canonly be factorised in three levels as 3 * 3 * 3 nodes, which is unlucky (because ofthe preference for a few large clusters).

6.5.2 Measuring with a parallel workload

The tables of branching factors of the optimal architectures for the par-workload,for all two and three-level hierarchies are listed in Table 6.4.

Still, the optimal two-level architectures never have more than 3 clusters. Op-posed to the UNIX workload, the architectures with 3 clusters now outperformthe architectures with 2 clusters. This is demonstrated in the architectures with 12processors: for the par-workload, a system with 3 clusters of 4 nodes is optimal,while the UNIX applications give preference to a system with 2 clusters of 6 nodes(see the first table of Section 6.5.1). Another interesting case is the architecture with16 processors: for the UNIX system two clusters of 8 processors are optimal, whilein the par-workload both 2*8 and 4*4 perform worse than a 3*5 architecture withonly 15 nodes.

The table with branching factors in the case of the par-less application is notshown, the values in it are between the architectures for the UNIX-workload andthe par-workload.

The performance graphs of the par- and par-less-workloads are given in Fig-ure 6.9. Only the figures with the total performance of the system are shown, notthe performance per processor. Note that the vertical scale of the graphs differs.Because of the higher miss rates of the caches, and the extra bus transfers due to


0

20

40

60

80

100

0 5 10 15 20 25 30

MIP

S ra

te o

f tot

al s

yste

m



0

50

100

150

200

0 5 10 15 20 25 30

MIP

S ra

te o

f tot

al s

yste

m



Figure 6.9: The performance of the architectures under a parallel workload. Thetop figure is the par-workload, the bottom figure is the par-less-workload.

the sharing of some data, the performance of the par loaded systems is lower thanthe performance of the par-less loaded systems.

The basic form of the graphs is the same, and also comparable to the graphof the UNIX workload: up to a certain number of processors the performance isincreasing, until the busses of the system are saturated. For the par-workload,a four-level system does not give a substantial improvement over a three-levelsystem in the range of processors that was monitored. In contrast with the UNIX-workload, three- and two-level hierarchies clearly outperform the flat hierarchy.This is explained by the fact that the parallel programs more severely sufferedfrom bus contentions than the UNIX programs, hence the trade off between theincreased miss penalty and the decreased bus contention (discussed at the start of


this section, page 114) is different.The figures for the par-large and par-less-large workloads (not shown) are

similar to the figures of the par- respectively the par-less-workloads. This impliesthat enlarging the shared space with an order of magnitude does not have a bigperformance effect (a few percents). The reason behind this remarkable fact seemsto be that there are three influences that have opposing effects on the performance.The locality of the shared accesses is reduced, so the effect of caching is less; thishas a negative influence on performance. Another negative performance effect iscaused by the reduced sharing between caches due to the lower locality. Apparentlythis is cancelled by the positive performance effect caused by the reduction of trafficdue to fewer invalidation requests. Further studies have to quantify these effects.

6.6 Discussion and conclusions

The simulation results show that a line size of 64 bytes, as chosen by the Futurebus,gives good performance for both the UNIX and parallel applications. Higher asso-ciativity of (large) caches down in the hierarchy (closer to memory) will improvetheir hit-rate, but it only has an effect on the overall performance when the lowerbusses are saturated. In general the size of these low-level caches is more impor-tant than the associativity, as is also the case in caches of single processor systems.The associativity of caches deeper in the hierarchy should not be lower than thedegrees of associativity of the caches higher in the hierarchy, because that mightcause serious performance loss.

For an increasing number of nodes, more levels in the hierarchy are needed tomake efficient use of the processing power. The number of nodes where more levelsin the hierarchy are preferable strongly depends on the behaviour of the application.A typical coarse grained UNIX application, which does not communicate via sharedmemory, runs well on a single level hierarchy with 1-7 nodes. For 8-18 nodes, a two-level bus hierarchy is favourable, while a three-level bus hierarchy works betterfor 20 nodes. Systems with more 20 nodes have a lower performance, because ofthe bus saturation at all levels. Note that all these statements are only valid for theused combination of technological parameters.

For the parallel applications, between one up to 4 nodes the single level hier-archy performs best, for 6 up to 10 nodes a two-level hierarchy wins, for 12 upto 28 nodes a three-level hierarchy performs best, while at 32 nodes a four-levelhierarchy wins. More than 32 nodes might be useful for these applications, butthese architectures were not simulated. The parallel applications communicate soheavily that the performance of a 16 nodes system is only one quarter of the max-imum performance (Figure 6.9; 16 nodes runs at 80 MIPS instead of 16 � 20 = 320MIPS which could be reached maximally).

For the UNIX application it came out that all optimal performing architectureshave long busses close to the processor (many caches and processors), while thebusses close to the memory should be short (only two caches). This is the case fortwo, three and four level architectures. For the parallel applications, the memorybus may sometimes have three caches, as can be seen in the first table of Section 6.5.2.

6.6. Discussion and conclusions 121

This observation can be explained by the cache hit rates in the various levels ofthe system. The level 1 caches have a high hit rate (around 95%), many caches cancooperate on one bus. The level 0 caches have lower hit rates (60%), so less of themcan be placed on a single bus. See the next chapter for a more detailed analysis.

Both the parallel and the UNIX workload have a limit on the performance gain,the maximal attainable parallelism. This limit is not caused by a lack of active tasks,the workload has enough tasks. The performance is limited by the communicationrequirements of the tasks, in Figure 6.8, the limit on the speedup for the UNIXworkload is around a factor 11 (220 MIPS). (for the given set of technologicalparameters). These communication requirements are inherent to the fact that theFuturebus architectures, although shared memory, are in some sense distributedmemory architectures, connected by a tree like network. The difference with realdistributed memory architectures is that the nodes communicate implicitly (causedby a cache miss), while distributed memory machines have explicit communication.

The points where the various (n-level) architectures saturate, and the levelwhere the optimal performance of all levels is reached depends on the workloadand the architectural parameters, which is illustrated by the difference betweenthe graphs in Figure 6.8 and Figure 6.9. The amount of sharing between tasks,the relative speed of the processors, caches, and busses, and the total size of thecaches influences the scaling of the graphs and the question how many levels inthe hierarchy are to be used for an optimal performance.

The graphs from our simulations have some similarities with the graphs givenin [Vernon89], despite a radically different method is used. We use simulation,while [Vernon89] developed an analytical model of regular hierarchical cache ar-chitectures: given a set of detailed parameters, as the cache hit rate and the degreeof sharing at the various levels, this model calculates the expected performance.The most notably similarities between the results are (of course) the saturationeffects but also the preference for long level 1 busses. But since the total cache sizesare kept constant during the experiments, our figures for higher level architectureshave a lower performance than those reported in [Vernon89]. In the next chapterit is shown how a less general, but much simpler, performance model for regularhierarchical cache architectures is constructed. It “predicts” the optimums in theperformance graphs (the places where the performance stabilises), given the simu-lation results of some small architectures. This performance model is heavily basedon the simulation results, since the simulator shows which effects can be neglected,and which parameters are significant.

During the simulation runs we did not only measure performance parameterslike the miss rates and bus utilisation, but also the usage of the Futurebus cachecoherency protocol. The protocol has been implemented as a state machine, andthe measurements show that 90% of the transitions are executed in just 17 states. Asto be expected, these states correspond to cache hits on reads/writes and snoopinghits. Only 38 states cover 99%, while the remaining 1% requires another 214 states.These 252 states out of a total of 8620 sufficed to run all simulation experimentsdescribed in this paper, which issued a total of 3:109 transactions. Unfortunatelywe had already specified over 300 states before trying to run the simulator. Thedefinition of the bus contains a lot of small optimisations to save some bus cycles.


Many of these optimisations were rarely executed, or never at all. We expect thatonly systems with rapidly changing shared data will exercise these optimisations.Therefore we question the value of these optimisations since they do increase thecomplexity without a clear performance benefit.

Chapter 7

A Futurebus performance model

In the previous chapter the performance figures of all regular hierarchical cachearchitectures based on the Futurebus were presented. The performance figures(depicted in Figure 6.8 and 6.9 on page 117 and 119) are derived by simulatingthe architectures under a stochastical load. In this chapter a performance modelis presented that predicts the performance of the various topologies, without theneed for an exhaustive simulation of all topologies. Based on the performancefigures of a few topologies, the parameters of the model can be derived, that allowsthe performance model to calculate the performance figures of other topologies.

[Vernon89] developed a model for hierarchical cache architectures that are struc-tured in a regular tree. Given the miss rates at the various levels of the hierarchy,the timings of the various transactions of the hierarchy, and the branching factorsof the hierarchy, the model described in [Vernon89] is capable of predicting theperformance of the architecture based on the calculated bus contentions. A weakpoint of the model is the level of detail of the parameter specification: the hit ratesof the caches at the various levels of the architecture are an input to the model. Toalleviate this the miss rate specification is split into several independent compo-nents. Unfortunately, the relation between these components and the topology ofthe architecture is not clear (as is shown later on, there is a relation between themiss rate and the topology).

[Jog90] studies a flat system with caches based on the model of [Vernon89],but has extended it with other features, like the inclusion of I/O effects. [Jog90]applied the model to the flat architecture after measuring many parameters on areal architecture (with hardware analysers) and after measuring some others withthe help of a simulator.

In [Agarwal89] a performance model is presented that accurately predicts themiss rate of a cache in a single processor architecture (the step from miss rates toperformance is straightforward, see for example [Hennessy90]). The model needsa description of the cache (the associativity, size and so on), and a specification ofthe application behaviour (in terms of the number of uniquely referenced blocks,and the access rate) to calculate the miss rate.

It would be nice if the models of [Vernon89] and [Agarwal89] could be combinedso that a performance figure is predicted based on the parameters of the architec-ture, the cache, and the application. This combination is not feasible because of

124 Chapter 7. A Futurebus performance model

n The number of processors.k The number of levels in the hierarchy. The busses and caches counts from

0 (which are closest to the memory) up to k � 1. Bus k is the connectionbetween the processor and its cache. Cache “�1” is the memory, a “cache”with a hit rate of 100%.

bi The branching factors; the number of caches on each level of the bus.The branching factor at the processor level, bk, equals 1;

Qk�1i=0 bi = n. The

topology of an architecture is sometimes denoted by b0*b1*b2*� � �, the firstnumber thus denotes the branching factor of the memory bus.

ni The number of processors above a single level i bus. nk = 1, n0 = n,ni =

Qk�1j=i bi.

Bi The time (in seconds) needed for the bus at level i to complete onetransaction. This includes the arbitration overhead and the time neededto send data. Since multiple transactions can be completed within onebus cycle, an average arbitration time is taken.Bk defines the time needed for the first level cache to answer requestsof the processor. The processor’s cycle time equals Bk , This is a validassumption since a faster processor will always need cache stalls anyway,and a slower processor is running in an overkill environment.

m The miss rate of a cache. 0 � m � 1. A subscripted m, mi, refers to themiss rate of the level i cache. m�1, the miss rate of the memory equals 0(by definition).

h The hit rate of a cache. 0 � h � 1. hi is the hit rate of the level i cache.Note that hi +mi = 1, and that h�1 equals 1.

d The average number of data accesses per instruction access. An applica-tion dependent parameter.

P Denotes the relative performance, defined as the performance of a multi-processor architecture divided by the performance of a single processorarchitecture.The performance is measured as the instruction rate of the processor,where it is assumed that instruction rate is 1=Bk in the optimal case.Note that this speedup does not take any changes in the application pro-gram into account: it is assumed that the same application is run on 1-or 13-processor architectures.

P P P P P P P P

C C C C C C C C

C C

Memory

d : 0:3B2 : 50:10�9

B1 : 170:10�9

B0 : 170:10�9

m1 : 0:02

m0 : 0:50

m�1 : 0

n = 8

b2 = 1

b1 = 4

b0 = 2

k = 2

level 2 (k)

level 1 (k � 1)

level 0 (k � 2)

Figure 7.1: The definition of the parameters and an example 2*4 architecture.

7.1. The performance model of a flat system 125

the complexity of the models, and because the model of [Agarwal89] cannot beapplied directly in a multiprocessor environment.

The model presented in this chapter takes both the miss rate and the bus con-tention into account, but it uses a (drastically) simplified approach. The miss ratefunctions of this model have three parameters that must be determined experi-mentally (these parameters depend on the cache size and associativity, and on theapplication but they are essentially independent of the structure of the architec-ture), while the bus contention model takes the miss rates and the bus speeds at thevarious levels as input parameters. The model itself is simpler than the others (andless accurate), but it incorporates the consequences of the topology for the cachemiss rates.

Like any other performance model, our model heavily relies on statistical as-sumptions about the architecture and the application. Throughout the wholechapter, it is assumed that the application reacts monotonously to changes in thearchitecture. The application should scale well with the cache parameters (as sizeand associativity), without discontinuities or anomalies, and the application pro-gram should exhibit an unlimited amount of parallelism so that more processingnodes in the architecture give a linear speedup.

In the rest of this chapter the performance model for hierarchical cache archi-tectures is constructed piecewise. The abbreviations and parameters are definedin Figure 7.1, together with an example architecture. A model for single levelbus systems is constructed in Section 7.1. A model for hierarchical architectureswith an arbitrary number of levels is presented in Section 7.2. This model evenfurther abstracts from reality, which has consequences for the predictive value ofthe model. Despite this, the model gives promising results.

7.1 The performance model of a flat system

As a first stage, a performance model of a flat system (k = 1) is constructed. In a flatsystem, transactions cannot split, so the transactions are always answered directlyby either the the memory or a neighbouring cache. Additionally, aread modifiedtransaction can lead to an invalidate of the data in another cache. The rest of thissection shows a stepwise construction of the performance model, followed by acomparison with the simulation results of the previous chapter.

7.1.1 The performance of cache architectures

The average time needed for an access to the cache is given in [Hennessy90].On a hit, a cache provides the data after a delay of the cache cycle time. On acache miss, the cache queries the memory and provides the data after querying theshared memory. The delay of the level below the cache is commonly known as the“miss-penalty”. The average time needed to access the cache can be defined by:

Taverage = h cache cycle time +m miss penalty (7:1)


For the model it is assumed that the processor and cache are closely coupled: thecache can supply instruction and data within the same clock cycle to the processor.When data or instruction accesses miss, they are handled sequentially by the sharedbus below. The cycle time of the cache is therefore B1 (the processor clock), andthe miss penalty accounts for both the data and instruction access of the processor,1 + d accesses on average. The hit rate equals 1 � m, so the average time for aprocessor in our model to access the cache equals:

T = (1�m)B1 +m(d+ 1) miss-penalty (7:2)

Since the speed is reciprocal with the time, an n processor system will thus run witha speed of n=T instructions per second. The optimal speed of a single processorsystem is 1=B1, dividing these two gives the relative performance of an n processorarchitecture:

P =nB1

(1�m)B1 +m(d+ 1) miss-penalty(7:3)

Note that this model of the relative performance does not include possible loss dueto software contention, it only takes the caching effects into account.

In a multi-processor system, the miss penalty depends on the bus contention.If there are no collisions on the shared bus, the miss penalty equals B0, the timeneeded to access the bus. But in the case of a conflict on the bus, the processor waitsfor the other processors. This results in a miss penalty that equals QB0 +B0, whereQ equals the average number of processor waiting for the bus, when a processortries to access the bus. The relative performance is therefore given by:

P =nB1

(1�m)B1 +m(d+ 1)B0(Q+ 1)(7:4)

Two of the parameters in this formula, Q and m, depend on the number of proces-sors. Q because more processors will result in a longer queue of waiting processors,andm because processors may compete for the shared data. Q andm are estimatedin the following two subsections.

7.1.2 Modelling the average number of waiting processors

Q, the average number of processors waiting for the bus when a processor tries toaccess the bus, is modelled with queueing theory. The queueing model of a flatbus architecture is straightforward: the only relevant quantity is the number ofprocessors waiting for the bus to become free. The states of the model are denotedby Pi, where i is the number of processors waiting for the bus. P0 denotes the statethat the bus is idle, Pn denotes the state that all processors want to access the bus,so all processors are idling (note that the system can be in the state Pn, but that aprocessor will never find the system in this state when accessing the bus, since allprocessors are idle). The Markov chain describing this model is:


&%'$

&%'$

&%'$

&%'$- - - - - -

� � � � � �&%'$

&%'$

� � � � � �

n� (n� 1)� (n� 2)� (n� 3)� 2� �

P0 P1 P2 P3 ... Pn�1 Pn

� is the service rate of the bus, which equals 1=B0; � is the access rate of a singleprocessor. The processor speed is 1=B1, but since only a fraction m of the accesses ispassed on to the memory bus, � equals m=B1. In contrast with an ordinary M/M/1model [Jain91], the access rates decrease along the chain, from n� to 0, because aprocessor that posts a request has to wait for the answer before it can post a nextrequest. The system can be solved to calculate pi, the probability that the system isin state Pi:

pi =

8>>>>>>><>>>>>>>:

0 � i < n ! pn

��

�

�n�i 1(n� i)!

i = n !1

nXi=0

��

�

�n�i 1(n� i)!

(7:5)

The solution cannot be presented in a closed form because of the decreasing accessrates. The average number of processors waiting when a processor tries to accessthe shared bus, Q, is now calculated by weighting the probabilities with the queuelengths for all states Pi, 0 � i < n (note that n is excluded because no processor canperform an access in state Pn).

Q =0p0 + 1p1 + 2p2 + � � � + (n� 1)pn�1

p0 + p1 + p2 + � � � + pn�1=

Pn�1i=0 pii

1� pn(7:6)

7.1.3 Modelling the miss rate (m)

The miss rate depends on the number of processors because of the presence ofshared data. Data that is shared between two or more processors is invalidated in allneighbouring caches when the data is updated in one cache; this invalidate leads toa cache miss if one of the other processors tries to access the (now invalidated) data.Only writable shared data can be invalidated for this reason, private data or readonly shared data (as for example the program instructions) are never invalidated.The miss rate is defined as follows (H and M denote the absolute number of hitsand misses, a subscript w indicates the writable shared data, a subscript p indicatesprivate or read only data):

m = 1�Hp +Hwcn

Hp +Mp +Hw +Mw

(7:7)

cn is a parameter related to the fraction of data invalidated due to conflicts on thewritable shared data: cn = 0 implies that all accesses to writable shared data willfind the data invalidated, 1 � cn equals the fraction of the accesses to the writableshared data that finds the data invalidated because one of the other processors in


the system uses the data. The denominator of equation 7.7 does not depend on thesharing or the number of processors: the sums of hits and misses is constant. Thedefinition of m can be rewritten to:

m = mp � hwcn (7:8)

where mp is the maximal miss rate (when all accesses to writable shared data miss,cn = 0), and hw is the maximal hit rate on the writable shared data. If writableshared data is never invalidated by neighbouring caches, cn equals 1, and the missrate equals mp�hw. (Note that we assumed that Hp and Hw are independent of cn,which is not completely true: invalidates on the writable shared data frees spacefor extra private data, since the data is stored in the same cache, Hence a higher Hp;an effect that is ignored in the sequel.)

The dependency of cn on n can be estimated by considering the distributionof the writable shared data over the caches of the architecture. Assume that afraction u of the writable shared data is in the local cache when running on a singleprocessor; because of the limited cache size, not all data can be cached. In a twoprocessor system, a subset of the writable shared data causes conflicts between thetwo processors. Under the assumption that both processors operate independently,the conflict involves a fraction u2 of the writable shared data. The total fraction ofdata stored in both caches is thus 2u� u2. Since it is divided over two caches, bothcaches store a fraction of the shared data of (2u�u2)=2. In comparison with a singleprocessor system, the processors loose a factor (2u�u2)=2

u= 1� u

2 . Generalising thisformula to n processors, gives a definition of cn of:

cn =1� (1� u)n

nu(7:9)

the miss rate is thus defined by:

m = mp � hw1� (1� u)n

nu(7:10)

where mp, hw are the application dependent parameters described before, and udefines the fraction of shared data that is stored in the cache.

Applications that use shared data infrequently will have a small u. Whenu � 1=n, the miss rate can be approximated by mp � hw + hwu(n � 1)=2. For thisclass of applications the miss rate depends linearly on the number of processors.This linear behaviour can be recognised clearly in Figure 7.2, that shows the missrates of the UNIX workload. The solid lines are the performance figures comingfrom the simulator (of Chapter 6), the dashed lines are the outcomes of the model ofequation 7.10, with the parameters fitted with the Levenberg-Marquardt algorithm[Press86]. Figure 7.3 shows the curves for the parallel application. Both fits have alow error: �2 = 1:10�6 respectively 7:10�6; the model is quite close.

7.1.4 Validation of the complete model with experimental data

Above the miss rate, the queue length, and the general performance model for a flatbus architecture are given. Putting them together (with the fitted parameters of the


0.016

0.018

0.02

0.022

0.024

0.026

0.028

0.03

0.032

5 10 15 20 25 30

Mis

s ra

te


Measured miss-rateModelled miss-rate

Figure 7.2: The measured and fitted miss rates for the UNIX workload.

0

0.01

0.02

0.03

0.04

0.05

0.06

0.07

0.08

0.09

5 10 15 20 25 30

Mis

s ra

te


Measured miss-rateModelled miss-rate

Figure 7.3: The measured and fitted miss rates for the parallel workload.

UNIX and the parallel workloads) results in the performance graphs depicted inFigure 7.4. The solid lines are the results as measured by the simulator, the dashedlines are the results of the performance model. The model fits the experimentaldata for the UNIX workload well (�2 = 8:10�3).

The results of the parallel workload fits less well (�2 = 2:10�2). The knee ofthe graph is at the right point, but the height of both the knee and the tail ofthe graph are less correct. This may be explained by two deviations betweenthe model and the simulator. Firstly, the value for the bus speed is assumed tobe independent of n. This is not the case, because the simulator implements anoptimisation that arbitration on an idle bus is faster than on a loaded bus. Secondly,


0

1

2

3

4

5

6

7

0 5 10 15 20 25 30

Rel

ativ

e pe

rfor

man

ce


Simulation resultsFlat bus model

0

0.5

1

1.5

2

2.5

0 5 10 15 20 25 30

Rel

ativ

e pe

rfor

man

ce


Simulation resultsFlat bus model

Figure 7.4: The modelled performance of a flat system for the UNIX and parallelapplications, the solid line are the values of the simulator, the dashedlines are the modelled performance

the use of queueing theory assumes an exponential distribution of � and �. This isnot the case, � has only a few discrete values to choose from (because of the clockspeed of the processor), while � is the time needed for a single bus arbitration andmemory answer, or the time for an arbitration and a cache intervention. We verifiedthis explanation by rewriting the simulator to generate exponentially distributedaccesses for � and � and with a constant arbitration time, the performance figurescoming from that simulator are much closer to the performance model.

7.2 Hierarchical architectures

The behaviour of hierarchical architectures is essentially different from the be-haviour of flat architectures because of the presence of split transactions. Atransaction that cannot be answered immediately is split, and the response is sent

7.2. Hierarchical architectures 131

back later (see Section 6.1.3 on page 104). As a consequence, the bus is addressedtwice for these transactions, leading to both a higher bus contention, and a largermiss penalty because the transaction is arbitrated twice.

For a single bus, the number of processors waiting for the bus is determined withthe help of queueing theory, as demonstrated in the previous section. For multiplebusses, it is more difficult to use queueing theory, because the request rates �0::�k�1

on the various busses depend on the queue lengths of the surrounding busses:�0 depends on the load of bus 1, while �1 depends on the speed of bus 0. Ingeneral, Qi depends on �i, �i depends on Qi+1 and Qi�1, so Qi+1 and Qi�1 dependon �i+1 and �i�1 respectively, which in turn depend on Qi. This cyclic dependencybetween the various queueing models is hard to resolve [Jain91]. For this reason,another approach is used for multiple level hierarchies. The model is based on theassumption that the speed of the architecture is dictated by the speed of the mostheavily loaded bus. To find this bus, an approximation is made of the number of bustransactions that can be expected on each of the busses, whereupon a multiplicationwith the speed of the busses shows how much time is needed by each of the busses.The total system runs as slowly as the slowest bus, which thus gives a performanceestimate.

7.2.1 The number of transactions at each level

To count the number of transactions at the various busses, three kinds of transac-tions are distinguished: reads, responses and cache flushes. When a transactionencounters a miss in a cache at a certain level, the transaction is propagated tolower levels in the hierarchy, and induces a response transaction (the inductionof invalidate transactions is ignored). The number of transactions is given by thefollowing formulas for the number of reads R (both shared and modified reads),responses R0 and flushes F . Each processor is supposed to issue “#Accesses” readsand writes, and to induce “#Flushes” cache flushes at top level:

Ri =

(i = k! #Accessesi < k! bimiRi+1

(7.11)

R0

i =

(i = k! 0i < k!mi�1Ri

(7.12)

Fi =

(i = k! #Flushesi < k! biFi+1

(7.13)

The total number of transactions at level i, Ti is:

Ti = Ri +R0

i + Fi (7:14)

Which can be combined into:

Ti =

(i = k! #Flushes + #Accessesi < k! #Flushes

Qk�1j=i bj + #Accesses

Qk�1j=i bj(1 +mi�1)

Qk�1j=i mj

(7:15)

which defines the number of transactions per bus in the hierarchy.


Each bus at level i in the hierarchy takes BiTi seconds to execute the Ti transac-tions on that bus. The time needed for the whole system to complete all transactionsis limited by the slowest bus, it equals:

kmaxj=0

BjTj (7:16)

The time taken by a single processor to complete the same number of transactionsis Bkn(#Flushes + #Accesses ). The relative performance is thus given by:

P =Bkn(#Flushes + #Accesses )

maxkj=0 BjTj(7:17)

Notice that when equation 7.15 is substituted in equation 7.17, the absolute numberof flushes and accesses is not important (it disappears), only the ratio betweenthe flushes and other accesses appears in equation 7.17 (the time is a functionof Ti, which depends on the sum of accesses and flushes). Since flushes alwaystraverse through the cache, they may cause traffic jams at the level 0 bus in largemultiprocessors when the number of flushes per processor is in the same order asthe number of accesses timesm0m1m2 � � �. If the number of flushes can be neglected,the number of accesses disappears completely from the definition of P .

Invalidate or read modified transactions travelling upwards to the processorcaches (to invalidate shared data or to fetch exclusive data) are not accounted forin this model, but the effect of the transactions (a decreased hit rate at the processorcaches in multiple processors) is accounted for, as is shown next.

7.2.2 The miss rates of the caches at the various levels

As is the case in the flat model, the miss rate of caches in a hierarchical architecturedepends on the number of processors of the architecture. The miss rate of the toplevel caches (level k � 1) can be modelled exactly in the same way as the caches ofa flat system; the contents of the level k � 1 caches is independent of the topologyof the architecture, only the total number of processors is relevant. Equation 7.10can be applied to calculate the miss rate of the top level caches.

A cache deeper in the hierarchy has a miss rate that depends on both the numberof processors, and the exact topology of the architecture. Consider the cache at levelk � 2. The hit rate of this cache depends on both bk�1 and bk�2: if bk�2 is increased,there are more neighbouring caches (hence more neighbouring processors), so thechance on a conflict on the exclusive data increases (resulting in a higher missrate). When bk�1 increases, the number of processors within the cluster increases,which leads to a higher degree of sharing in the cluster (resulting in a lower missrate of cache k � 2). These two effects can be seen clearly in Figure 7.5 where thesimulated miss rates of the level 0 caches in two level topologies are plotted versusthe number of processors in the architecture (since there is in general more thanone topology for a given number of processors, there are multiple points for eachnumber of processors). The lines in Figure 7.5 connect the miss rates of architectureswith equal cluster size (the dashed lines), and the miss rates of architectures with


0.55

0.6

0.65

0.7

0.75

0.8

0.85

0.9

0 5 10 15 20 25 30

Mis

s ra

te (%

)


Number of clusters: 2 3 4 ... Clustersize:2

34...

Figure 7.5: The miss rates of the level 0 caches in all two level topologies withup to 32 processors (the UNIX benchmark).[Note: The miss rate is in the order of 0.5 to 0.9. In the best case,the level 0 cache hits only 56% of the accesses. The miss rate of thelevel 0 cache can be improved by moving cache memory from thelevel 1 cache to the level 0 cache, but this does not leed automaticallyto a higher performance, because the hit rate of the level 1 cache isdecreased.]

the same number of clusters (the solid lines). As explained above, the solid linesdecrease: enlarging the cluster-size leads to a lower miss rate in the level 0 cache;the dashed lines increase: a higher number of clusters leads to a higher miss rate.Below, the effects of larger clusters and a larger number of clusters for the case of atwo level hierarchy are quantified. How this model is applied to hierarchies withmore levels is shown at the end of this section.

Enlarging the size of the cluster

In first instance, the branching factor of the level 0 bus is, b0, is kept fixed, and therelation between m and b1 is studied. The level 0 caches contains the data that isneeded by the b1 processors in the cluster. Part of this data is shared between theprocessors, part of the data is private to one of the processors. When the numberof processors in a cluster is enlarged (b1 is enlarged), more private data is stored inthe level 1 cache (assuming that each processors needs a certain amount of privatedata). But because the cache size scales proportionally to b1 (remember that thecache size per processor is kept constant), and the shared data is stored only once,there is room in the cache to store extra data. Figure 7.6 shows how two processorscan fill a cache with shared and private data, while a four processor system has“Free” space in the cache, that is used to keep extra data.

The effect of a larger cache is quantified as follows: assume that the level 0cache of size S0 of a single processor architecture stores a fraction s of (potentially)


Shared P P Two processors

Shared P P P P Free Four processors

Figure 7.6: Usage of the cache in a cluster with two and four processors.

shared data (sS0 bytes shared, (1 � s)S0 bytes private data). For a level 0 cache oftwo processors, of size 2S0, only sS0 + 2(1 � s)S0 = 2S0 � sS0 bytes are used, theremaining sS0 bytes are “free”, and are effectively extra cache space. For clustersof size b1, the extra cache space equals b1S0 � (sS0 + b1(1 � s)S0) = (b1 � 1)sS0.Programs that heavily rely on shared data have s � 1, for these programs the gainin cache size is enormous: a factor b1 � 1. For application programs that do notshare data, s equals zero, hence an increased cluster size does not lead to extracache space.

The extra cache space is used to store extra data for all processors, but it isunpredictable whether shared or private data will be stored. Here it is assumed thatthe ratio between shared data and private data is constant. The cache of size b1S0 isthus split into a shared part (of size sS0f , f is the gain per processor) and b1 privateparts (with sizes (1�s)S0f ). Because the sum of the private and shared parts equalsthe cache size, f can be calculated: b1S0 = sS0f+b1(1�s)S0f ) f = b1=(b1(1�s)+s).Each processor uses the shared part and one private part, the total size in use byeach of the b1 processors is therefore:

S = sS0f + (1� s)S0f = S0f = S0b1

s+ b1(1� s)(7:18)

The relation between the cache size and miss rate is given in [Przybylski90,Agarwal89]. That gives an exact description, but it is too complex to be usedfor this model, so we stripped it to the most essential component:

m(S) = x0 + x1(1 � e�x2S ) (7:19)

where x0, x1 and x2 are parameters that represent application and cache parameters(as for example the linesize), and S is the cache size. The miss rate for a largernumber of processors in a cluster is thus given by equations 7.18 and 7.19:

m(b1) = x0 + x1(1� e�x2

s+b1(1�s)b1S0 ) (7:20)

which can be simplified to:

m(b1) = xa + xs(1� e�xeb1 ) (7:21)

xa represents the asymptote of the miss rate (for an arbitrarily large cluster therewill always remain misses), xs and xe define the slope of the curve and represent


amongst others on the size and associativity of the cache and the sharing of theapplication. The miss rate of the level 0 cache of a cluster with exactly one processoris 1 (by definition, because the level 1 cache above will catch all possible hitsalready), which allows to express xs in terms of xe and xa:

1 = xa + xs(1� e�xe

1 )) xs =1� xa

1� e�xe(7:22)

So there are only two free parameters: xa and xs (or xe).

Enlarging the number of clusters

When more clusters are put on the same bus, the asymptote xa of the miss rateincreases because of the higher fraction of exclusively shared data that is competedfor. When the clusters are made arbitrarily large, the level 0 caches become arbitrarylarge as well, so all data would be in the caches. But the writable shared data canbe in only one cache. It is assumed that writable shared data is equally distributedover all clusters: each level 0 cache has a fraction 1=b0 of the writable shared data.The asymptotes of the curves are thus related to b0:

xa = x0a �x00ab0

(7:23)

Note that this behaviour can be derived from the definition of cn in equation 7.9:when u equals 1, cn equals 1=n, indicating a loss that is inversely proportional tothe number of clusters.

It is possible to fit a new xs (or xe) for each number of clusters, but it turnsout that one gets fairly good results by keeping xs constant, and recalculates xeaccording to equation 7.22. This means that the miss rate is defined in terms of:

x0a the (asymptotic) miss rate for many large clusters, b0 !1; b1 !1.

x00a the misses in x0a that are caused by conflicts on shared data, x0a�x00a=2 equals m0

of an architecture with two large clusters, b0 = 2; b1 !1.

xs a parameter that quantifies the influence of enlarging b1, xs should be fitted withexperimental results.

Figure 7.7 shows the modelled miss rates for the UNIX workload. The points (�)represent the measured miss rates, the lines are the modelled miss rates. Thereare some deviations, but it is shown later on that their influence on the finalperformance is small. The miss rates of the level 0 caches of the parallel applicationare shown in Figure 7.8.

The miss rates for k > 2

The model above only considered miss rates for two level hierarchies, but theextension to three or more level hierarchies is straightforward. Consider the missrate of a cache in a three level architecture with branching factors b0::b2. The miss


0.55

0.60

0.65

0.70

0.75

0.80

0.85

0.90

0 5 10 15 20 25 30

Mis

s ra

te (%

)


Figure 7.7: Measured and modelled miss rates of the level 0 caches in all twolevel topologies with up to 32 processors (the UNIX benchmark).

0.500.550.600.650.700.750.800.850.900.95

0 5 10 15 20 25 30

Mis

s ra

te (%

)


Figure 7.8: Measured and modelled miss rates of the level 0 caches in all twolevel topologies with up to 32 processors (the par benchmark).

rate of the level 2 caches (the processor caches), m2, only depends on the numberof processors, Therefore m2 is modelled as m0 of a flat architecture with b0b1b2

processors according to equation 7.10.The miss rate of a cache at level 1 (the middle level), m1, can be modelled as if

the cache is at level 0 of a two level architecture with b00 = b0b1 and b01 = b2. Theexact structure of the topology below the cache is not relevant, only the number ofcaches beside the cache (b0b1) and the number of processors inside a cluster (b2) arerelevant.

The miss rate of the cache at level 0 not only depends on the number of proces-sors above these caches, but also on the way the caches are structured. Supposetwo extreme three level architectures, a 2*2*8 architecture and a 2*8*2 architecture


hhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhp p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p p

c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c c cc c c c c c c c c c c c

c c c c c cc c c

memory

m3

m2

m1

m0

b3

b2

b1

b0

Figure 7.9: Modelling the level 1 caches of a four level hierarchy. The dashedboxes are considered as bus segments, so the hierarchy is reduced totwo levels.

(in both cases 32 processors, in both cases two super clusters with 16 processors).The data that is stored in the level 0 caches is equal for both systems (the amount ofcache memory is identical, in both cases the 16 left processors compete with the 16right hand processors about the exclusively owned data via the level 0 caches). Sothe absolute number of misses passing through the level 0 caches of the 2*2*8 and2*8*2 architecture is identical. But in the first architecture, the amount of accessesto the level 0 cache will be lower than in the second architecture, because m1 ofa 2*2*8 architecture is lower than m1 of a 2*8*2 architecture. The same absolutenumber of misses, stemming from more accesses implies a lower missrate at level0. Since the absolute number of misses coming from the level 2 caches is also equalfor both architectures, we may state that the combined missrate of the caches atlevel 1 and 0, m1m0 is the same for both architectures.

More general, the product of the miss ratesm1m0 follows the model for the missrates of two level caches for an architecture with b0 clusters of size b1b2. Since m1

could be modelled already, m0 can be calculated.This procedure is generalised for k levels by downward induction:

mk�1 follows the model for flat bus systems presented in equation 7.10. It dependson n only.

mi follows from the values of mi+1::mk�2 and the model for the level two caches:Qk�2j=i mj is modelled with a two level hierarchy with b00 =

Qij=0 bj and b01 =Qk�1

j=i+1 bj . The miss rates are thus calculated top down:

mk�2 is modelled using b00 =Qk�2

j=0 bj and b01 = bk�1,

mk�3mk�2 is modelled with b00 =Qk�3

j=0 bj and b01 = bk�1bk�2,

mk�4mk�3mk�2 is modelled with b00 =Qk�4

j=0 bj and b01 = bk�1bk�2bk�3,

etcetera.

As an example, Figure 7.9 shows how a four level architecture is mapped onto a twolevel architecture to model m1m2m3: a 3*2*2*3 architecture is mapped onto a 6*6two level architecture. Since m3 and m2 are modelled before, m1 can be calculated.


0

2

4

6

8

10

12

14

0 5 10 15 20 25 30

Rel

ativ

e pe

rfor

man

ce


Simulation resultsModel


# Clusters 2 2 2 3 2 2 2 3 4 3 4 3 2 3 5 2 3 4 3 4

MIPS rate 61 92 123 138 153 184 215 230 246 265 242 257 221 248 219 209 240 221 231 212

Figure 7.10: The simulated and modelled relative performance figures, the tablealso shows the best performing topologies.

7.2.3 Putting it together, the performance of multi level hierarchies

The number of accesses divided by the speed of the slowest bus indicates how manywords per second can be transferred through the total architecture. To calculate thenumber of instructions per second, the performance figures are divided by (d+ 1).This results in the graphs that are shown in Figure 7.10. The graph shows boththe simulated performance and the modelled performance, the table shows thebest performing topologies as follows from the model, it can be compared with thesimulator output of Table 6.3. The similarities are striking: the shapes of the graphsare almost identical, and both the performance model and the simulator show thata few large clusters performs best.

The performance as predicted by the model is too high. This is caused by tworeasons. Firstly, the bus slowly gets saturated in a real architecture, whereas in themodel the bus saturates suddenly. The error is largest at the critical point, wherethe performance stabilises (this is the reason why the performance model marks18 as the best performing topology, while the simulator came to 21). Secondly, anaverage value was used for the bus speed parameters,B0,B1 andB2. Unfortunatelythe bus transaction time depends on the type of the transaction (misses are handledfaster than hits as is correctly modelled by for example [Vernon89], because no datahas to be transferred), and the simulator implements an optimisation that allows formultiple transactions issued by the same cache to be handled within one arbitrationphase. This last optimisation is effective when large clusters are operating on aheavily loaded low level bus (because the cache waits long for the low level bus,and thus has a long queue of waiting requests), reducing the average access time.This is the reason that the modelled performance decreases for more processors

7.3. Discussion 139

(caused by higher miss rates), while the simulation results almost stabilise (becausethe cheaper bus cycles outweigh the higher miss rates).

7.3 Discussion

The performance models presented here are valid under the assumption that the ap-plication is well-behaved: increased cache sizes should consistently lead to lowermiss rates. Furthermore, the application program is assumed to scale perfectly,without change of its caching behaviour: the cache hit rate of a program runningon a large number of processors should not be different from the cache hit rateon a small system (except for the usage of shared data). Although these assump-tions limit the applicability of the model (real applications have a limit on theparallelism), the model gives reasonably good results.

The model can be simplified to an (intuitively clear) rule of a thumb. Theperformance is limited by the bus with the highest load, so an optimally balancedsystem has the same utilisation for all busses. Considering that:

� the number of transactions at each level is proportional to the miss rate of thecache above,

� the speed at which transactions are generated is proportional to the speed ofthe bus one level above, and

� the speed at which transactions are serviced is proportional to the speed ofthis bus level,

a bus at level i has the same utilisation as the bus at level i+ 1 when bimiBi+1 = Bi.Hence, the maximal number of caches at level i is estimated by:

bmaxi �

Bi

Bi+1

1mi

(7:24)

Still, mi depends on the topology of the architecture as discussed on page 137.Because mi tends to be small for the caches close to the processors, bi should behigh for the levels close to the processors, whereas for small i, close to the memory,mi is high, so bi should be low.

From the definition of bmaxi it can be concluded that it is important to reduce the

miss rate of the caches, not only to increase the performance of a processor (as is thecase for single processor systems) but also to increase the scalability of the design,because the maximal branching factors are inversely proportional to the miss rate.If the miss rate of the processor cache of an architecture can be reduced from 3%to 2% (an increase in hit rate from 97% to 98%), 50% more processors can be usedin the architecture to run the application efficiently. Such an increase in hit rate isnot cost effective for a single processor architecture, but it may be cost effective ina multi processor system because of the contention effects. The miss rate can bereduced by adapting the hardware (more cache memory, other distribution of thecache memory over the levels, other line size, other associativities) or by improving


the locality of the software (by rewriting the application, compiler and run timesupport).

Because the performance is not only influenced by the missrate but also by themiss penalty, the optimal performance is reached with bi < bmax

i , The miss penaltyincreases when busses are better utilised, so this simple model gives an upperbound for the number of caches on the various bus levels.

The performance models presented in this chapter could have been used to dras-tically reduce the number of simulation runs performed to get the performancegraphs presented at the end of the previous chapter. Only a few simulation runsare necessary, the performance model can be used to interpolate or extrapolate theresults. Eventually extra simulation runs can be made to get detailed performancefigures for specific architectures. Combining the figures of performance modelsand simulations is an efficient strategy to obtain reliable performance figures.

Part III

Conclusions

Chapter 8

Conclusions: evaluating Oyster

In the introductory chapter, two methods have been presented to estimate theperformance of computer architectures: the designer can construct a performancemodel of the architecture (that defines a relation between the parameters of thearchitecture and the performance) or the designer can construct a simulator of thearchitecture, to measure the performance. Oyster is an environment that allowsto simulate architectures. The architect specifies the architecture at the appropriatelevel of detail whereupon simulation runs can be made in order to measure the ex-pected performance of the architecture. Oyster facilitates the integrated simulationof high and low level models due to a layered structure: the higher level modelsare implemented on top of the lower layer.

Oyster essentially measures the performance in terms of the number of secondsneeded by the architecture (and application) to complete a task. To ease an analysisof the performance of the architecture, the performance figures are also split up. Inthe lower levels of Oyster the performance figures of all components of the archi-tecture are maintained separately, as well as the dependencies between the variouscomponents. The top levels of Oyster measure performance figures relevant forspecific types of components, like for example the hit rate of a cache.

In contrast with many other simulation systems,Oyster is based on a philosophyof simplicity, flexibility and openness. The simulation language has only fewconstructs, Oyster allows to model any type of computer architectures at any levelof detail, and Oyster is easily extended with other simulators or with extra featuresthat are needed by the architect. During each experiment, Oyster is improved byadding or deleting features. Oyster will never be finished; due to new experiencesgained while simulating architectures, Oyster will keep changing.

The usefulness of Oyster has been tested in several case studies: for the simulationof a VLIW machine for graph reduction [Milikowski92, Gijsen92, Hendriksen90],for the the simulated execution of functional programs on shared memory ar-chitectures [Langendoen92a, Hofman93], for a study towards the communicationarchitecture of the PRISMA architecture and (Chapter 5), and for the simulation ofhierarchical architectures based on the Futurebus (Chapter 6).

The experiments with the simulated execution of functional programs haveresulted in a technique to simulate parallel applications with little overhead (MiG).

144 Chapter 8. Conclusions: evaluating Oyster

This simulation technique is not restricted to applications written in a functionallanguage: any parallel programming language that does not perform asynchronousoperations on shared data can be simulated using the MiG. The MiG is currentlyintegrated with Oyster, so that parallel programs can be hooked in Oyster.

The simulator of the PRISMA architecture has been developed to study alter-native implementations for the communication architecture. Because this requiresa tradeoff between hardware and software, the software has been emulated at in-struction level on the simulated hardware. The processor, cache and memory weresimulated using high level models, the (non standard) communication hardwareis simulated at a lower level. The experimentation with the PRISMA architectureclearly showed some deficiencies of Oyster. As an example, the decision to bindarray bounds at load time instead of compile time, and the addition of the criticalpath analyser were directly inspired by this experiment.

The Futurebus simulator could not be developed with standard models. Be-cause the cache consistency protocol of the Futurebus has to be used, the standardcache model does not suffice. For that reason, a standard cache model was ex-tended to a “Futurebus” cache. This Futurebus-cache can be placed in the standardlibrary, but its usefulness as standard component is questionable: it is not expectedthat a Futurebus cache will be used frequently, so it will become part of deadweight around Oyster that is only confusing the designer. For the same reason, thestochastical processor is not a part of Oyster.

This last study also shows that the value of the simulation results can be en-hanced by creating a performance model. The simulator is used to measure essen-tial parameters of application and architecture, whereas the performance modelgives an analytical relation between these parameters and the performance. Thisanalytical relation can be used to interpolate or extrapolate the performance figureswithout the need for exhaustive simulations.

For all the case studies a validation is performed to check the correctness of themodels.

A tool like Oyster can be of great help when evaluating the performance of com-puter architectures. This type of tools aid in a systematical design of computerarchitectures. The architect should however always keep in mind that the perfor-mance is only one aspect of the quality of a computer architecture, and that theuser is eventually interested in other aspects (as the functionality or price) as well.

Bibliography

[Agarwal86] Anant Agarwal, Richard L. Sites and Mark Horowitz, “ATUM: A new tech-nique for capturing address traces using microcode”, Proceedings 13th Annual Sym-posium on Computer Architecture, pp 119-127, June 1986.

[Agarwal89] Anant Agarwal, Mark Horowitz, and John Hennessy, “An analytical cachemodel”, ACM Transactions on Computer Systems, Vol 7, No 2, pp 184-215, May1989.

[America89a] Pierre America and Jan Rutten “A parallel object-oriented language: design andsemantic foundations”, Ph.D. thesis, Free University, Amsterdam, May 17, 1989.

[America89b] P.H.M. America, “P0350: Language definition of POOL–X”, PRISMA docu-ment number 0350, Philips Research Laboratories Eindhoven, November 1989.

[Annot87] J.K. Annot and R. van Twist, “A novel deadlock free and starvation free packetswitching communication processor”, Proceedings of PARLE ’87, Veldhoven, Thenetherlands, pp 68-85, June 1987.

[Apers90] P. Apers, L.O. Hertzberger, B.J.A. Hulshof, A.C.M. Oerlemans and M. Kersten,“PRISMA: A Platform for experiments with parallelism”, P. America (ed), Proceedingsof the PRISMA workshop on parallel databasesystems, Noordwijk, The Nether-lands, September 24-26, Springer Verlag LNCS 503, pp 169-180, 1991.

[Archibald86] James Archibald and Jean-loup Baer, “Cache Coherence Protocols: Evalua-tion Using a Multiprocessor Simulation Model”, ACM Transactions on ComputerSystems, Vol 4, No. 4, pp 273-298, November 1986.

[Baer88] J. Baer and W. Wang, “On the inclusion properties for multi-level cache hierarchies”,Proceedings of the 15th Annual International Symposium on Computer Architec-ture, 1988, pp 73-80.

[Barbacci81] Mario R. Barbacci, “Instruction set processor specifications (ISPS): The notationand its applications”, IEEE Transactions on Computers, Vol C-30, No 1, pp 24-40,January 1981.

[Barendregt92] H.P. Barendregt, M. Beemster, P.H. Hartel, L.O. Hertzberger, R.F.H. Hof-man, K.G. Langendoen, L.L. Li, R. Milikowski, J.C. Mulder and W.G. Vree, “Pro-gramming parallel reduction machines”, Technical report CS-92-05, Computer sciencedepartment, University of Amsterdam, June 1992.

[Beemster90] M. Beemster, “Back end aspects of the portable POOL implementation”, P. Amer-ica (ed), Proceedings of the PRISMA workshop on parallel databasesystems,Noordwijk, The Netherlands, September 24-26, Springer Verlag LNCS 503, 193-228, 1991.

[BenAri82] M. Ben-Ari, “Principles of concurrent programming”, Prentice Hall, ISBN 0-13-701078-8, 1982.

[Bevan86] D.I. Bevan, “The implementation of an event-driven logic simulator in a functional

146 Bibliography

style”, ESPRIT 415B document nr 043, August 1986.[Bird88] R.S. Bird, and P.L. Wadler, “Introduction to functional programming”, Prentice Hall,

New York, 1988.[Birtwistle73] G.M. Birtwistle, O.-J. Dahl, B. Myhrhaug, and K. Nygaard, “SIMULA begin”,

1973.[Bray89] B. Bray, K. Cuderman, M. Flynn and A. Zimmerman, “The computer architect’s

workbench”, Proceedings of IFIP ’89 on VLSI and CAD Tools, 1989.[Bronnenberg87] W.J.H.J. Bronnenberg, L. Nijman, E.A.M. Odijk, and R.A.H. van Twist,

“DOOM: a decentralized object oriented machine”, IEEE micro, Vol 7, No 5, pp 52-69,October 1987.

[Brown83] Harold Brown, Cristopher Tong, and Gordon Foyster, “Palladio: An ExploratoryEnvironment for Circuit Design”, IEEE Computer, pp 41-56, December 1983.

[Bryant84] R.E. Bryant, “A Switch level model and simulator for MOS systems”, IEEE Trans-actions on computers, Vol C-33, No 2, pp 160-177, February 1984.

[Bugge90] H.O. Bugge, E.H. Kristiansen and B.O. Bakka, “Trace-driven simulations for atwo-level cache design in open bus systems”, Proceedings of the 17th Annual Int.Symposium on Computer Architecture, pp 250-259, 1990.

[Bunt84] Richard B. Bunt and Jennifer M. Murphy, “The measurement of locality and thebehaviour of programs”, The computer journal, Vol 27, No 3, pp 238-245, 1984.

[Chaiken90] David Chaiken, Craig Fields, Kioshi Kurihara and Anant Agarwal, “Directorybased cache coherence in large scale multiprocessors”, IEEE Computer, pp 49-58, June1990.

[Chandy79] K. Mani Chandy and Jayadev Misra, “Distributed Simulation: A Case Studyin Design and Verification of Distributed Programs”, IEEE Transactions on SoftwareEngineering, Vol SE-5, No 5, pp 440-452, September 1979.

[Cherry88] James J. Cherry, “Pearl: a CMOS timing analyzer”, Proceedings 25thACM/IEEEdesign automation conference, 1988.

[Dacapo89] Private communication with a sales manager of Dosis Germany, 1989.[Dahl66] Ole-Johan. Dahl and Kristen Nygaard, “SIMULA—an ALGOL-based simulation

language”, Communications of the ACM, Vol 9, No 9, pp 671-678, September 1966[Dally86] William J. Dally and Charles L. Seitz “The torus routing chip”, Distributed com-

puting 1, pp 187-196, 1986.[Deering79] Michael Deering, Joseph Faletti and Robert Wilensky, “PEARL – A Package for

Efficient Access to Representations in LISP”, Proceedings of the seventh internationaljoint conference on Artificial Intelligence, IJCAI-81, August 24-28, Vancouver,Canada, 1981.

[Delgado88] Jose C. Delgado, “Designing computer architectures with SADAN”, Micropro-cessing and Microprogramming 22, pp 205-216, North Holland, 1988.

[Denning72] Peter J. Denning, “On modelling program behaviour”, AFIPS conference pro-ceedings, Vol 40, Spring Joint computer conference, Atlantic City, New Jersey, pp937-943, May 16-18, 1972.

[Digital72] “pdp11/45 processor handbook”, Digital Equipment Corporation, 1972.[Eggers89] S.J. Eggers and R.H. Katz, “Evaluating the performance of four snooping cache

coherency protocols”, Proceedings of the 16th Annual International Symposium onComputer Architecture, pp 2-15, 1989.

[Endot87] “N.2 Introduction and Tutorial”, Endot Inc. 11001 Cedar Ave., Cleveland, Ohio

Bibliography 147

44106, 1987.[Futurebus89] “Futurebus Logical Layer Specifications, Draft 8.1.1”, P896.1 Working Group

of the IEEE Computer Society, December 1990.[Gijsen92] Nienke Gijsen, “Simulation of a micro parallel lazy graph reducer: the G-Hinge”,

Master Thesis, Computer science department, University of Amsterdam, Decem-ber 1992.

[Goor89] A.J. van de Goor, “Computer architecure and design”, Addison Wesley, ISBN 0-201-18241-6, 1989.

[Gottlieb83] A. Gottlieb, R. Grishman, C.P. Kruskal, K.P. McAuliffe, L. Rudolph and M.Snir, “The NYU Ultracomupter— designing an MIMD shared memory parallel com-puter”, IEEE Transactions on computers, Vol C-32, No 2, pp 175-189, 1983.

[Graham82] S.L. Graham, P.B. Kessler, M.K. McKusick, “gprof: A Call Graph ExecutionProfiler”, Proceedings of the SIGPLAN ’82 Symposium on Compiler Construction,SIGPLAN notices, Vol 17, No 6, pp 120-126, June 1982.

[Gupta89] Saurabh Gupta and Rami Melhem, “A software tool for the automatic generationof memory traces for shared memory multiprocessor systems”, Proceedings of the 22nd

annual simulation symposium, Tampa, Florida, pp 93-104, March 1989.[Hagersten91] Erik Hagersten, “A Simulated MIMD Running Gcycles of Real Programs”,

First European workshop on Performance modelling and Evaluation of ParallelComputer Systems, 1991.

[Hartel91] Pieter Hartel, Hugh Glaser, and John Wild, “Compilation of functional languagesusing flow graph analysis”, Technical report 91-03, Computer science department,University of Southampton, January 1991.

[Hendriksen90] Michel Hendriksen, “Simulation of a Sequential Graph Reduction Machine”,Master Thesis, Computer science department, University of Amsterdam, August31, 1990.

[Hennessy90] J.L. Hennessy and D.A. Patterson, “Computer architecture: a quantitativeapproach”, Morgan Kaufmann Publishers, Palo Alto, California, ISBN 1-55860-069-8, 1990.

[Hercksen82] Uwe Hercksen, Raiser Klar, Wolfgang Kleinoder, Franz Kneißl, “Measur-ing simultaneous events in a multiprocessor system”, ACM Sigmetrics, PerformanceEvaluation Review, Vol 11, No 4, pp 77-88, September 1982.

[Hill87] M.D. Hill, “Aspects of Cache Memory and Instruction Buffer Performance”, Ph.D.Thesis, Univ. of California at Berkeley Computer Science Division, Tech. Rep.UCB/CSD 87/381, November 1987.

[Hill89] “Dinero III cache simulator”, on line documentation version 3.3, [email protected].

[Hofman93] R.F.H. Hofman, “Scheduling and Grain Size Control”, Ph.D. thesis, Computerscience department, University van Amsterdam, 1993.

[Horbst87] E. Horbst, C. Muller-Schloer, H. Schwartzel, “Design of VLSI Circuits”, Spring-weer Verlag, ISBN 0-387-17663-2, 1987

[IEEE88] “IEEE Standard VHDL Language reference manual”, 1988.[Jain91] Raj Jain, “The art of computer systems performance analysis”, Morgan Kaufmann

Publishers, Palo Alto, California, ISBN 0-471-50336-3, 1991.[Jefferson85] David R. Jefferson, “Virtual time”, ACM Transactions on Programming Lan-

guages and Systems, Vol 7, No 3, pp 404-425, July 1985.

148 Bibliography

[Jog90] Rajeev Jog, Philip L. Vitale, James R. Callister, “Performance evaluation of a commer-cial cache-coherent shared memory multiprocessor”, Proceedings 1990 ACM SIGMET-RICS, Boulder, Colorado, May 22-35 Performance Evaluation Review Vol 18, No1, pp 173-182, 1990.

[Joosten89] S. Joosten, “The use of functional programs in software development”, Ph.D. Thesis,Twente University, The Netherlands, ISBN 90-9002729-7, April 4, 1989.

[Kahn74] Giles Kahn, “The semantics of a simple language for parallel programming”, Infor-mation Processing 74, North Holland, pp 471-475, 1974.

[Kelly89] Paul Kelly, “Functional programming for loosely-coupled multiprocessors”, PitmanPublishing, ISBN 0-273-08804-1, 1989.

[Kernighan78] Brian W. Kernighan and Dennis M. Ritchie, “The C Programming Language”,Prentice Hall, ISBN 0-13-110163-3, 1978.

[Krishnakumar87] A.S. Krishnakumar, “ART-DACO: Architectural research tool using dataabstraction and concurrency”, Proceedings of the international conference on com-puter design, pp 18-21, October 1987.

[Langendoen91] K.G. Langendoen, H.L. Muller and L.O. Hertzberger, “Evaluation of Fu-turebus hierarchical caching”, Proceedings of PARLE 91, June 10-13, Veldhoven,Springer Verlag LNCS 505, pp 52-68, 1991.

[Langendoen92a] K.G. Langendoen, H.L. Muller, W.G. Vree, “Memory Management for par-allel tasks in shared memory”, Y. Bekkers, and J. Cohen (eds), Memory Management,proceedings of the international workshop IWMM 92, St Malo, France, LNCS 637,pp 165-178, September 16-18, 1992.

[Langendoen92b] K.G. Langendoen and D.J. Agterkamp, “Cache Behaviour of Lazy Func-tional Programs”, H. Kuchen and R. Loogen (eds), 4th International Workshop onthe Parallel Implementation of Functional Languages, Aachen, Germany, Aach-ener Informatik-Berichte 92-19, RWTH Aachen, Fachgruppe Informatik pp 125-138, September 1992.

[Langendoen92c] K.G. Langendoen and P.H. Hartel, “FCG: a code generator for lazy func-tional languages”, in Compiler construction (CC), Springer Verlag LNCS 641, U.Kastens and P. Pfahler (eds), Proceedings of the International Workshop on Com-piler Construction, Paderborn, Germany, pp 278-296, October 1992.

[Lewin85] Douglas Lewin, “Design of Logic Systems”, ISBN 0-442-30606-7, T.J. Press Ltd,Padstow, Cornwall, 1985.

[MacDougall87] M.H. MacDougall, “Simulating Computer Systems, techniques and tools”,ISBN 0-262-13229-X, MIT Press, Massachusetts, 1987.

[Martin81] T. Martin, “PEARL at the age of three”, Proceedings fourth international confer-ence on software engineering, September 17-19, Munchen, Germany, 1979.

[Meulen87] Pieter S. van der Meulen, “INSIST: Interactive Simulation in SmallTalk”, Pro-ceedings of OOPSLA ’87, pp 366-376, October 4-8, 1987.

[Milikowski91] R. Milikowski and W.G. Vree, “The G-Line, a distributed processor for graphreduction”, Proceedings of PARLE 91, June 10-13, Veldhoven, Springer VerlagLNCS 505, pp 119-136, 1991.

[Milikowski92] R. Milikowski, “A description of the G-hinge”, Technical report CS-92-xx,Computer science department, University of Amsterdam, May 1992.

[Misra86] Jayadev Misra, “Distributed discrete-event simulation”, Computing surveys, Vol18, No 1, pp 39-65, March 1986.

[Mooij89] W.G.P. Mooij, “Packet Switching Communication Networks for Multiprocessor Sys-

Bibliography 149

tems” Ph.D. Thesis, Computer science department, University of Amsterdam,December 1989.

[Motorola85] “mc68020 User’s manual, ISBN 0-13-566860-3, Prentice Hall Inc, EnglewoodCliffs, 1985.

[Motorola88] “mc88200 Cache/Memory management unit User’s manual”, Motorola Inc, 1988.[Mulder87] Johannes M. Mulder, “Tradeoffs in Processor-Architecture and Data-Buffer De-

sign”, Ph.D. Thesis, Stanford Technical Report CSL-TR-87-345, December 1987.[Muller90] H.L. Muller, “Evaluation of a communication architecture by means of simulation”,

in P. America (ed), Proceedings of the PRISMA workshop on parallel databas-esystems, Noordwijk, The Netherlands, September 24-26, Springer Verlag LNCS503, pp 275-293, 1991.

[Muller92a] H.L. Muller, K.G. Langendoen and L.O. Hertzberger, “MiG: Simulating parallelfunctional programs on hierarchical cache architectures”, Technical report CS-92-04,Computer science department, University of Amsterdam, June 1992.

[Muller92b] H.L. Muller and L.O. Hertzberger, “Evaluating all regular topologies of hierarchi-cal cache architectures based on the Futurebus”, in Cristoph Eck et al (eds), Proceedingsof Open Bus Systems ’92, Zurich, October 13-15, VITA, ISBN 90-72577-11-6, pp193-199, 1992.

[Nichols88] Kathleen M. Nichols and John T. Edmark, “Modelling multi computer systemswith PARET”, IEEE computer, pp 39-48, May 1988.

[Overeinder89] B.J. Overeinder, “De implementatie van een compiler voor de object georien-teerde taal Pearl en de literatuur studie van load balance algorithmen voor parallellesimulaties”, Master thesis, Computer science department, University of Amster-dam, November 1989.

[Overeinder91] Benno Overeinder, Bob Hertzberger, and Peter Sloot, “Parallel discrete-event simulation”, Third workshop on design and realisation of computer systems,Eindhoven, pp 19-30, ISBN 90-6144-995-2, May 1991.

[Press86] William H. Press, Brian P. Flannery, Saul A. Teukolsky and William T. Vetterling,“Numerical recipes”, Cambridge University Press, ISBN 0-512-30811-9, 1986.

[Przybylski90] Steven A. Przybylski, “Cache and memory hierarchy design: a performancedirected approach”, Morgan Kaufmann Publishers, Palo Alto, California, ISBN 1-55860-136-8, 1990.

[Raina91] Sanjay Raina and David H.D. Warren, “Traffic patterns in a scalable multiprocessorthrough Transputer emulation”, Technical report 91-23, Computer science depart-ment, University of Bristol, October 1991.

[Rashid86] Richard F. Rashid, “Threads of a new system”, UNIX Review 8, pp 37-49, August1986.

[Rubin87] Steven M. Rubin, “Computer Aids for VLSI Design”, ISBN 0-201-05824-3,Addison-Wesley, 1987.

[Rudolph84] L. Rudolph and Z. Segall, “Dynamic decentralised Cache Schemes for MIMDparallel processors”, Proceedings of the 11th annual international symposium oncomputer architecture, pp 340-347, June 1984.

[Sauer81] Charles H. Sauer and K. Mani Chandy, “Computer systems performance modeling”,Prentice-Hall, ISBN 0-13-165175-7, 1981

[SCI92] “Scalable Coherent Interface standard (SCI)”, IEEE 1596, IEEE Computer society,New York, 1992.

[Sijtsma89] Ben A. Sijtsma, “On the Productivity of Recursive List Definitions”, ACM transac-

150 Bibliography

tions on programming languages and systems, Vol 11, No 4, pp 633-649, October1989.

[Smith82] Alan Jay Smith, “Cache Memories”, Computing Surveys, Vol. 14, No 3, pp 473-530, 1982.

[Spek90] J. vd Spek, “Back end aspects of the portable POOL implementation”, P. America (ed),Proceedings of the PRISMA workshop on parallel databasesystems, Noordwijk,The Netherlands, September 24-26, Springer Verlag LNCS 503, pp 309-344, 1991.

[Stallman88] R.M. Stallman, “Using and Porting the GNU C-compiler”, Free Software Foun-dation Inc, Massachusetts, October 1988.

[Stenstrom90] Per Stenstrom, “A survey of cache coherence schemes for multiprocessors”, IEEEComputer (special issue on cache coherency), pp 12-24, June 1990.

[Stil89] J.G. Stil, “De implementatie van een compiler voor de simulatietaal Pearl en een studienaar synchronisatiemechanismen voor gedistribueerde simulatie, Master thesis, Com-puter science department, University of Amsterdam, August 1989.

[Stroustrup87] Bjarne Stroustrup, “The C++ programming language”, Addison Wesley, ISBN0-201-12078-X, 1987 .

[Sweazy86] Paul Sweazy and Alan Jay Smith, “A class of compatible cache consistency pro-tocols and their support by the IEEE Futurebus”, Proceedings of the 13th AnnualInternational Symposium on Computer Architecture, pp 414-423, June 1986.

[Szymanski85] Thomas G. Szymanski and Christopher J. Van Wyk, “Goalie: A Space Ef-ficient System for VLSI Artwork Analysis”, IEEE Design and Test, pp 64-72, June1985.

[Thiebaut92] Dominique Thiebaut, Joel L. Wolf, Harold S. Stone, “Synthetic traces for trace-drive simulation of cache memories”, IEEE Transactions on computers, vol 41, no 4,pp 388-410, April 1992.

[Turner90] “Miranda System Manual”, Research software Limited, 23 St Augustines Road,Canterbury, Kent CT1 1XP, England, April 1990.

[Verilog89] “Verilog reference manual”, Gateway design automation corporation, Lowell,Massachusetts, September 1989.

[Vernon89] Mary K. Vernon, Rajeev Jog and Gurindar S. Sohi, “Performance Analysis ofHierarchical Cache-Consistent Multiprocessors”, Performance evaluation 9, 4, pp287-302, July 1989.

[Vlot90] M. Vlot, “The POOMA architecture”, P. America (ed), Proceedings of the PRISMAworkshop on parallel databasesystems, Noordwijk, The Netherlands, September24-26, Springer Verlag LNCS 503, pp 365-395, 1991.

[Vree89] W.G. Vree, “Design considerations for a parallel reduction machine”, Ph.D. thesis,University of Amsterdam, December 1989.

[Wallace88] D.E. Wallace and C.H. Sequin, “ATV: an abstract timing verifier”, 25th

ACM/IEEE design automation conference, 1988.[Warren88] David H.D. Warren, Seif Haridi, “Data Diffusion Machine – A scalable shared

virtual memory multiprocessor”, Proceedings of the international conference onfifth generation computer systems 1988, ICOT, pp 943-952, 1988.

[Weste81] N. Weste, “MULGA- An interactive symbolic layout system for the design of integratedcircuits”, The Bell System Technical Journal, Vol 60, No 6, 1981.

[Weste85] Neil Weste and Kamran Eshraghian, “Principles of CMOS VLSI Design, A systemsPerspective”, Addison-Wesley , ISBN 0-201-08222-5, 1985.

Index

address trace, 64extraction, 64MiG, 67off line, 64on line, 64, 67synthetic, 75

ART-DACO, 31AWB, 33, 34

benchmark, 5–6branching factors, 124

cache, see also Futurebuscausality condition, 21, 24continuous time simulator, 17correctness

architecture, 4, 13, 26, 36simulation algorithm, 13, 19, 23

critical path, 52

Dacapo, 30demand driven simulator, 16dinero, 34, 60discrete time simulator, 18distributed simulation, 20, 22, 59

efficiency of simulator, 25, 60emulation

assembly level, 63machine level, 62

evaluation, 4–5, 50–54of Oyster, 143tools, 4, 32

event driven simulator, 20Exclusive line, 101

functional language, 13–26Futurebus

associativity, 112cache consistency, 101

intervene, 102line size, 113simulation model, 106snoop, 102split, 104topology, 114

hardware simulation, 29

imperative language, 27INSIST, 31Invalid line, 101ISPS, 32, 63

MFLOPS, 6MiG, 67MIPS, 6, 92Miranda, 15miss rate

flat architectures, 127multiple level architectures, 135two level architectures, 132

object oriented, 27, 39Oyster, 37–60

analysisbandwidth, 53call graph, 52contention, 51profiling, 51utilisation, 50

assembler, 57, 63cache, 55efficiency, 60interface, 58library, 54memory, 55Pearl level analysis, 54processor, 56, 63VLSI simulations, 58

152 Index

Palladio, 32PARET, 33Pearl, 38–48

clock, 44, 48communication, 42, 108kernel, 48objects, 40subtyping, 41typing, 40

performance model, 6–8flat architectures, 125hierarchical architectures, 130of Futurebus, 123–140

POOL, 28, 47, 87, 88price, 4, 114principles of simulation, 13

SADAN, 34, 35sequential block, 69Shared line, 101SIMULA, 28, 39simulation languages, 28SmallTalk, 31SPECmark, 6speedup, 124

Test-And-Set, 73top down design, 2–3trace, see address trace

Unmodified line, 105

validationFuturebus model, 128, 138Futurebus simulator, 109performance models, 7PRISMA simulator, 92simulation models, 9synthetic trace, 75, 79

Verilog, 30VHDL, 30virtual time, 8, 22, 73

in Pearl, 44, 48in SIMULA, 29

Nederlandse samenvatting

De taak van een computerarchitect is het ontwerpen van betere en snellere com-puters. Het ontwikkelen en toepassen van nieuwe technieken heeft de afgelopenjaren geresulteerd in snellere computers, die meer geheugen hebben, kleiner zijn enminder stroom gebruiken dan hun voorgangers. Doordat computers steeds snellerzijn geworden, en steeds meer geheugen hebben is het veel moeilijker gewordenom computers te ontwerpen.

Een van de aspecten (dat in dit proefschrift aan de orde komt) is dat de com-puterarchitect al tijdens het ontwerpen wil weten of de computer later aan degestelde eisen zal voldoen. Een probleem dat ook voorkomt in de “gewone” archi-tectuur: als een architect een brug ontwerpt, zal de architect, nog voordat met debouw begonnen wordt, willen weten of deze bijvoorbeeld stevig genoeg is. Als indeze fase al blijkt dat de brug niet aan de verwachtingen voldoet kan, om misserste voorkomen, het ontwerp worden bijgesteld. Een steviger brugdek, een extrasteun, of een dikkere fundering kunnen misschien uitkomst brengen. Het is zaakom dit soort fouten in een vroeg stadium van het ontwerp te vinden, omdat ze dannog eenvoudig kunnen worden hersteld.

Om terug te keren naar het ontwerp van computers: de architect van eencomputer wil weten of de functionaliteit voldoende is, en of de computer niet teduur, te langzaam of te groot is. (Afhankelijk van het toepassingsgebied zijn dezepunten wel of niet van belang: de computer in een wasmachine moet vooral kleinen goedkoop zijn, de computers van de NASA vooral snel). Als al in een vroegstadium blijkt dat de computer te traag of te duur zal zijn, kan daar meteen wataan gedaan worden.

In dit proefschrift wordt een hulpmiddel beschreven dat de computerarchitecthelpt om de snelheid (de tijd die een programma nodig heeft om te worden uit-gevoerd) van een ontworpen, maar nog niet gebouwde computer te bepalen. Vaneen bestaande computer is de snelheid simpel te meten: met een stopwatch. Om desnelheid van een nog niet gebouwde computer te bepalen zijn andere techniekenvereist. Er zijn twee gangbare manieren: het maken van een analytisch model vande computer en het maken van een simulator.

Een analytisch model van de computer bestaat uit een serie formules die een ver-band leggen tussen de snelheid van de computer en zekere (technologie afhanke-lijke) grootheden, zoals de snelheid van een enkel schakelelementje van de com-puter. De nadelen van analytische modellen zijn dat het lastig is ze te ontwikkelen,en moeilijk om ze aan te passen.

Een simulatie maken, betekent dat een computer (de simulator) zo wordt ge-

154 Nederlandse samenvatting

programmeerd dat deze zich gedraagt als de ontworpen computer. Bekend in ditverband is de vluchtsimulator, een apparaat waarin piloten geoefend worden inhet vliegen, zonder dat daar maar een druppel kerosine voor nodig is. De com-puter van een vluchtsimulator rekent uit wat er in het echt met het vliegtuig zougebeuren als de piloot daarin zat. Het simuleren van computers werkt net zo: desimulator wordt zo geprogrammeerd dat het gedrag van de ontworpen computertot in detail wordt nagebootst. Het is dan relatief eenvoudig om de snelheid vande ontworpen computer te bepalen.

Er zijn in de loop der tijden veel simulatoren ontwikkeld. Het nadeel van veelvan deze simulatoren is dat ze zeer algemeen bruikbaar zijn, maar dat ze weinigsteun bieden voor een specifiek probleem. In dit proefschrift wordt het ontwerp vaneen simulatieomgeving voor computerarchitecturen beschreven, “Oyster” geheten.Oyster is alleen geschikt voor de simulatie van computerarchitecturen, daardoorkan Oyster helpen een analyse van de architectuur te maken.

Oyster is gebruikt voor een aantal experimenten, waarvan er twee in dit proef-schrift beschreven zijn: het simuleren van de PRISMA architectuur (een machinedie ontworpen is voor het parallel verwerken van grote gegevensbanken) en voorhet simuleren van de Futurebus, een communicatiemedium voor parallelle com-puters.

De PRISMA machine is uitgerust met meerdere processoren die allemaal eeneigen stukje geheugen hebben: een zogeheten “distributed memory multipro-cessor”. De processoren kunnen gegevens uitwisselen over een hoge snelheidcommunicatienetwerk. Omdat de processoren op zich heel snel zijn, vormt hetcommunicatienetwerk een flessehals voor de snelheid: voordat de gegevens overhet netwerk kunnen worden verstuurd, moeten ze verpakt worden tot een bood-schap, die dan in stukjes over het netwerk verzonden kan worden. Het in- enuitpakken van de gegevens kost in de huidige implementatie van de PRISMA ma-chine veel tijd. Met behulp van Oyster kan worden bekeken hoe dit probleem kanworden opgelost.

De Futurebus is (onder meer) geschikt om een zogeheten “shared memory mul-tiprocessor” te bouwen: alle processoren werken op een en hetzelfde geheugenwaarin alle data opgeslagen zit. De processoren hoeven daardoor niet meer expli-ciet over een netwerk te communiceren. Een nadeel van dit soort machines is dater maar een processor tegelijk van het geheugen gebruik kan maken. Om nu tevoorkomen dat de processoren allemaal tegelijk toegang vragen tot het geheugenworden veelgebruikte gegevens dicht bij de processor opgeslagen in zogeheten“cache geheugens”. (Eigenlijk wordt het dus weer een soort distributed memorymachine). Met behulp van Oyster kan bekeken worden of het gebruik van cacheseen afdoende antwoord is.

In beide gevallen is Oyster inderdaad een bruikbaar hulpmiddel gebleken omde architectuur te beoordelen.

Simulating Computer Architectures - Semantic Scholar · Simulating Computer Architectures Nice...

Documents

Transcript of Simulating Computer Architectures - Semantic Scholar · Simulating Computer Architectures Nice...