Automatic Generation of Multi-core Stressmarks...

Wouter Kampmann, Lieven Lemiengre

Automatic Generation of Multi-core Stressmarks

Academiejaar 2009-2010Faculteit IngenieurswetenschappenVoorzitter: prof. dr. ir. Jan Van CampenhoutVakgroep Elektronica en informatiesystemen

Master in de ingenieurswetenschappen: computerwetenschappen Masterproef ingediend tot het behalen van de academische graad van

Begeleider: Stijn PolflietPromotor: prof. dr. ir. Lieven Eeckhout

Automatic generation of multi-core stressmarksWouter Kampmann and Lieven Lemiengre

Supervisor(s): Lieven Eeckhout, Stijn Polfliet

Abstract— This article describes a framework for the development ofplatform-portable stressmarks. Estimating the practical power and ther-mal characteristics of a processor is vital to evaluate power and thermalmanagement strategies, to examine hotspots that may damage the proces-sor or reduce the chip’s lifecycle, and to dimension cooling solutions andpower circuitry.The proposed framework makes it possible to automatically generate opti-mized stressmarks for almost any platform.

Keywords— stressmark, platform-independent, synthetic benchmark,portable, power dissipation

I. INTRODUCTION

IN the past few years, it has become apparent that power andthermal characteristics of a processor have become a first

class design constraint. For a long time, the maximum proces-sor power consumption increased by a factor of about two, ev-ery four years [4], [3]. This trend of course could not continueand it came to a halt around 2002, when the industry hit thepower wall. Power consumption and thermal dissipation couldnot increase any further and controlling the power and thermalcharacteristics became a first class design constraint, requiringattention at every stage of the microprocessor design flow.

Conventional benchmarks can be used to estimate the powerand thermal characteristics of a typical workload. However, theyare unsuitable for estimating the maximum power and operat-ing temperature characteristics[1]. It is important to analyse thepractical worst-case behavior of a processor.

The worst-case maximum power consumption and tempera-ture dissipation can be used to develop power and thermal man-agement strategies. Another application is using the worst-casebehavior to dimension the thermal package and the power sup-ply circuitry for the processor.

The current practice in the industry is to develop hand-craftedstressmarks. These stressmarks are developed by specialists thathave a very detailed knowledge of the microprocessor architec-ture. It is a very tedious and time-consuming job. The resultingstressmark is processor-specific, so this work has to be repeatedif the micro-architecture is modified.

We developed a framework to automate the creation of stress-marks. We based our work on the StressMarker framework thatcould automaticaly generate stressmarks for the Alpha 21264microprocessor architecture.[1] The key idea is to generate syn-thetic benchmarks based on an abstract workload description;a machine learing algorithm then optimizes this workload de-scription to induce certain thermal or power characteristics.

Our contributions:• Our aim is to make the framework platform-portable, and weachieve this by generating the synthetic benchmarks purely in Clanguage constructs. The resulting C program is compiled forthe target platform. This means that our framework is unawareof the underlying platform and the platform-specific details are

filled in by the compiler. This allows the framework to generatestressmarks for a very wide range of systems. We verified theresults for MIPS and x86-64 targets.• A stressmark is described by a number of abstract parame-ters, each determining an aspect of either the target platform,or the workload of the stressmark. We minimized the numberof platform-specific parameters and specialized this workloadmodel for generating stressmarks. We also extended the work-load model to support generating multi-threaded stressmarks.The result is a lean workload model, specialized for stressmarks,that uses less parameters than the StressMarker framework[1]—30 instead of 40—while offering more functionality.

II. FRAMEWORK WORKFLOW

The framework workflow consists of four steps. We start witha workload description which is then transformed into a C stress-mark. The C stressmark is first compiled, and is then executedon a test platform. The measurements are fed into the machinelearning algorithm which generates an optimized workload, andthe cycle is complete. As a machine learning algorithm we usea genetic search algorithm.

OptimizationMeasurements(SESC / HPC)

Synthetic BenchmarkAbstract WorkloadModel

Fig. 1. Framework workflow.

III. DEVELOPING THE WORKLOAD MODEL

Stressmarks are described by two kinds of parameters: plat-form parameters and workload parameters. The platform pa-rameters are the number of hardware threads and the size of acache line. These parameters can easily be defined for any plat-form. The second type of parameters describes the workload ofthe stressmark. It is these parameters that the machine learn-ing algorithm will optimize. They were chosen based on theresearch by Joshi et al. [1]. These parameters describe a setof hardware-independent workload characteristics. We tried tominimize the number of parameters because less parameters re-sults in a smaller search space, allowing the machine learningalgorithm to work more efficiently.

A. Workload Parameters

The workload model consists of four major parts: the instruc-tion mix, the minimal dependency between instructions, the dataand instruction footprint, and the memory striding behavior.

A.1 Instruction Mix

This part consists of a high-level distribution of the propor-tions of arithmetic, memory, and branch instructions. Then foreach instruction type a more specialized distribution is defined.For arithmetic instructions, it is the distribution of each arith-metic operation defined by it datatype and numeric operation.Memory instructions are characterized as loads or stores, work-ing in shared or thread-local memory. For branch instructionstheir branch behavior is defined.

A.2 Minimum Dependency Distance

The dependency distance is the number of instructions be-tween two dependent instructions. Since we only work with out-of-order processors, we only considered the RAW (Read afterwrite) dependencies. Instruction dependencies limit the instruc-tion level parallelism, so this parameter is essentially a measureof the ILP.

A.3 Data and Instruction Footprint

The footprints are the number of unique data and instructionaddresses referenced while running the stressmark. The size ofthese will affect the stress on the memory subsystem, particu-larly the caches.

A.4 Memory Striding Behavior

Memory instructions in the stressmark may exhibit some dy-namic behavior. Some memory instructions read from or writeto the same address every time they are executed. Other mem-ory instructions walk through the memory, using a different ad-dress every time they are executed. We use data stream stridesto model this behavior.

IV. GENERATING C BENCHMARKS

We want the framework to be platform-portable, meaningthat, given the platform-dependent parameters, it should be ca-pable of generating stressmarks for almost any platform, with-out knowing the instruction set or register set. To achieve thisfeature, we use the low-level programming language C insteadof assembler to express the stressmark. Once the stressmark iscompiled for the target platform, we obtain an executable stress-mark.

One of the inherent difficulties of this approach is the optimiz-ing behavior of the compiler. Compilers are made to optimizeredundant code, loop invariants, etc., to increase the program’sperformance. However, when a stressmark is compiled, somecompiler optimizations could change characteristics reflectingthe workload model; these optimizations are undesired. Unfor-tunately, we cannot wholly disable optimization as we still relyon intelligent instruction selection and register allocation for ef-ficiency and correctness.

We addressed the compiler optimization problem using twoapproaches. First of all, we dumbed the compiler down to theminimum level of optimization, using a predefined optimizationlevel tweaked with special compiler flags. We then designed thestructure of a stressmark and made it immune to the remainingoptimizations of the dumbed-down compiler.

The resulting method is capable of generating effective stress-marks using C. The only significant drawback of our techniqueis the heavy register usage needed to maintain the minimum de-pendency distance. If the platform does not have enough hard-ware registers, there is a risk of register spilling.

V. RESULTS

A. Test Platforms

We used two platforms to verify the effectiveness of ourframework. First we set up a simulated SMP MIPS platform onwhich we optimize for maximal power usage. The other plat-form is a real world system: an Intel Core2 Quad processor thatwe optimized for maximum IPC; unfortunately, the optimiza-tion gets stuck at a local maximum (IPC ∼= 3) because it is notusing any memory instructions.

0

20

40

60

80

100

120

140

160

0 5

10

15

20

25

30

35

40

45

50

55

60

65

70

75

80

85

90

95

10

0

Po

wer

(W)

Generation

0

0,5

1

1,5

2

2,5

3

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28

IPC

Generation

Fig. 2. Results: top: MIPS, bottom: x86-64.

VI. CONCLUSION

We showed that we can generate effective multi-threadedstressmarks using C as implementation language. Because ofthis, our framework for automated stressmark generation isplatform-portable.

REFERENCES

[1] Ajay M. Joshi, Lieven Eeckhout, Lizy Kurian John, and Ciji Isen. Auto-mated microprocessor stressmark generation. In HPCA [2], pages 229–239.

[2] 14th International Conference on High-Performance Computer Architec-ture (HPCA-14 2008), 16-20 February 2008, Salt Lake City, UT, USA. IEEEComputer Society, 2008.

[3] S H Gunther, F Binns, D M Carmean, and J C Hall. Managing the impact ofincreasing microprocessor power consumption. Intel Technology Journal,(1):2005, 2001.

[4] Herb Sutter. The free lunch is over: A fundamental turn toward concurrencyin software. Dr. Dobb’s Journal, 30(3):202–210, 2005.

PREFACE iv

Preface

During the past nine months, we have been looking tremendously forward to the completion

of this document you are now holding. We hope you enjoy reading it as much as we enjoyed

the experience of preparing and writing it.

In the first chapter, we introduce the concept of stressmarks and briefly discuss some impor-

tant trends and events motivating the subject of our master thesis.

The second chapter contains an overview of the workload model we defined with a description

of the different workload parameters it contains. We discuss the consequences of the design

choices we made and how they relate to the characteristics of a generated stressmark.

In the third chapter, the main component of our framework, the stressmark generator, is

explained. We take a closer look at how the workload model is transformed into an executable

synthetic benchmark.

Chapter four explains how we employed a genetic algorithm in order to turn synthetic bench-

marks into stressmarks, optimizing for an output characteristic such as power usage or the

number of instructions per cycle.

The fifth chapter combines the components described in the foregoing chapter, giving a high-

level overview of the entirety of our StressmarkRunner framework. The two different platforms

we set up are discussed in preparation of the next chapter.

In the sixth chapter, we elaborately discuss the different results we obtained from running the

framework on our two target platforms, and how we verified the correctness and performance

of our genetic algorithm and stressmark generator component.

Chapter seven concludes this document by providing a final overview and some closing remarks

on the work we did.

Acknowledgements

Producing a master thesis can be quite a daunting task. We would therefore like to thank all

those who have supported and guided us throughout this endeavor, especially our supervisors,

professor Lieven Eeckhout, and Stijn Polfliet.

Wouter Kampmann and Lieven Lemiengre, May 2010

USAGE RESTRICTIONS vi

Usage restrictions

”The authors give permission to make this master dissertation available for consultation and

to copy parts of this master dissertation for personal use.

In the case of any other use, the limitations of the copyright have to be respected, in particular

with regard to the obligation to state expressly the source when quoting results from this

master dissertation.”

Wouter Kampmann and Lieven Lemiengre, May 2010

CONTENTS vii

Contents

Preface iv

Usage restrictions vi

Acronyms ix

1 Introduction 1

1.1 Before the Power Wall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Hitting the Power Wall . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Consequences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.4 Stressmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

2 The Workload Model 5

2.1 Stressmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Workload Parameters . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.3 Workload Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Synthetic Benchmarks in C 15

3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.2 Language and Compiler Requirements . . . . . . . . . . . . . . . . . . . . . . 15

3.3 Exploring the Optimization Behavior of the Compiler . . . . . . . . . . . . . 19

3.4 Interesting C Constructs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

3.5 Forming the Stressmark . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.6 Remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.7 Stressmark Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4 Stressmark Optimization 39

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

4.2 Genetic Search Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.3 Meta Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42

CONTENTS viii

5 The Stressmark Runner Framework 44

5.1 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

5.2 Platform Setups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

5.3 Framework Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

6 Results 66

6.1 Number of SESC Instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

6.2 Exploration of Search Space . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

6.3 GA Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.4 GA Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

6.5 Theoretical Maximum . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

7 Conclusion 83

Bibliography 86

List of Figures 88

List of Tables 90

ACRONYMS ix

Acronyms

ACID Atomic Consistent Isolated Durability

ALU Arithmetic Logic Unit

API Application Programming Interface

BTB Branch Target Buffer

IDE Integrated Development Environment

ILP Instruction Level Parallelism

MDD Minimum Dependency Distance

RAW Read after Write

SIMD Single Instruction Multiple Data

WAR Write after Read

WAW Write after Write

XML Extensible Markup Language

YAML YAML Ain’t Markup Language

INTRODUCTION 1

Chapter 1

Introduction

For a long time, every four years the maximum power consumption of processors increased

by a factor of a little more than two. This evolution lasted about fifteen years from 1986

until 2002. Around 2002 things changed; processor manufacturers hit the power wall and

ever since, the power consumption of processors has only marginally increased.

1.1 Before the Power Wall

Before we look at the consequences of the power wall, let us look at the period pre-dating

it. For about fifteen years, processor manufacturers were able to increase the single-threaded

performance of their processors with about 50% every year [11]. How did they achieve this?

1. Moores law: The number of transistors that can be placed inexpensively in an integrated

circuit doubles about every two years. [13]

2. RISC processors: Using simple instructions that are easy to pipeline.

3. Out-of-order execution: Mining ILP in single-threaded workloads by employing specu-

lative execution.

4. Increasing the clock frequency: Requiring deeper pipelines

1.2 Hitting the Power Wall

This evolution came to an end around 2002. You could say that for a long time, power was

free and transistors were expensive, but now the situation had turned around [8]. What

happened?

1.3 Consequences 2

Figure 1.1: SPECint performance over the years (image source: [11]).

1. A higher clock frequency means more power consumption and heat dissipation, and

cooling solutions can only take you so far. Their cost increases exponentially with the

thermal dissipation [9].

2. ILP wall: Serial performance can be improved by mining more ILP but this requires

more speculative execution. The law of diminishing returns applies; at some point the

hardware cost to mine more ILP starts increasing exponentially. This is not beneficial

for the performance per watt.

3. Speed of light: It takes many clock pulses to transport data from one side of the chip

to the another.

1.3 Consequences

For the past few years, the power consumption and the clock frequencies have basically stayed

the same.

However, Moore’s law still applies, so the transistor budgets are still increasing at the same

speed. To improve performance, processor manufacturers are now focusing on multi-core

processors. If we look at the immediate future, we also see that more and more functionality

is being integrated into the CPU. For example, Intel and AMD are going to launch a CPU

with a GPU integrated on the die next year, memory controllers are already integrated on

most CPUs, and PCI-express controllers are also being integrated on some Intel processors.

1.4 Stressmarks 3

Figure 1.2: Power wall, frequency wall and ILP wall (image source: [14]).

It is clear that power and thermal characteristics of a processor are becoming increasingly more

important. They have become a first-class design constraint for high-performance processors

and should be considered at every stage of the microprocessor design flow.

1.4 Stressmarks

Conventional benchmarks can be used to estimate the power and thermal characteristics of

a typical workload. However, they are unsuitable for estimating the maximum power and

thermal characteristics [12]. There actually is an increasing disparity between the maximum

power consumption and the power consumed while running more typical applications [9]. This

growing difference in power consumption faces the system designer with a difficult problem.

The system should be designed to ensure the processor does not exceed the specified maximum

operating temperature, even if those circumstances are extremely rare.

1.4 Stressmarks 4

The worst-case behavior of a processor can be used for a number of applications:

1. Developing power and thermal management strategies for the processor. The processor

could reduce its frequency if it is in danger of overheating; this can reduce the cost of the

cooling solution. Since recently the opposite strategy can be applied as well. A multi-

core processor using only one core could over-clock that core to improve single-threaded

performance if the other cores are idle. [2]

2. Finding hot-spots: Hotspots are small regions on a chip that dissipate a large amount

of power for a short time. This localized overheating can reduce the lifetime of a chip,

cause timing problems, degrade circuit performance, and even cause chip failure.

3. The worst-case behavior can also be used to dimension the cooling solution and the

power supply circuitry for the processor.

Current practice in the industry is to develop hand-crafted stressmarks. These stressmarks

are developed by specialists that have a very detailed knowledge of the microprocessor ar-

chitecture. It is a very tedious and time-consuming job. Moreover the resulting stressmark

is processor-specific, so this work will have to be repeated if the processor architecture is

changed.

We believe the process of generating stressmarks can be automated. Our work is based on

the StressMarker framework [12] that could automatically generate stressmarks for the Alpha

21264. The key idea is to generate synthetic benchmarks based on an abstract workload

description. A machine learning algorithm then optimizes this workload to induce certain

thermal or power characteristics.

We have created a similar framework with a few new features:

� Our framework is platform-portable. We achieve this by generating synthetic bench-

marks purely in C language constructs.

� We developed a workload model that is specialized for stressmark generation.

� We can generate multi-threaded stressmarks that communicate through memory to

stress cache coherence protocols and inter-cache communication.

THE WORKLOAD MODEL 5

Chapter 2

The Workload Model

2.1 Stressmarks

Stressmarks are a kind of synthetic benchmarks especially constructed to stress a specific part

of the processor under test. In our case, these benchmarks are optimized using a machine

learning algorithm to induce certain power or thermal characteristics.

A stressmark is described by a number of abstract parameters, each determining an aspect of

either the target platform, or the workload of the stressmark. The synthetic benchmark gen-

erator processes these parameters and generates the stressmark’s code written in C language

constructs.

The first type of parameters describes the target platform, determining key properties of the

processor the stressmark is being developed for. We opted to keep the number of parameters

to a minimum in our framework, so we pick only cache line size and the number of hardware

threads as platform parameters. Note that these two generally apply to any processor.

The second type of parameters describes the workload of the stressmark. These are the

parameters that the machine learning algorithm optimizes in order to create an efficient

stressmark. We have chosen them based on prior research by Joshi et al [12], introduced

some simplifications, and extended the parameters to support multi-threaded stressmarks. In

the next section, we discuss these workload parameters in detail.

2.2 Workload Parameters

The workload model consists of four major parts: the instruction mix, the minimal depen-

dency between instructions, the data and instruction footprint, and the memory striding

behavior.

2.2 Workload Parameters 6

2.2.1 Instruction Mix

We start by defining a distribution describing the proportion of arithmetic, memory, and

branch instructions. For each of these general instruction types, we then define another

distribution, determining the relative frequencies of instructions of more specific subtypes

(e.g. integer addition, double multiplication, etc. for the arithmetic instructions).

Arithmetic Instructions

Arithmetic instructions take two operand registers, perform some operation on them and

store the result in a register. These instructions are characterized by an operation and the

data type of the operands. Our framework supports integer, and single and double precision

floating point as data types. The supported operations are addition, multiplication, and

division.

The relative frequencies of all arithmetic instructions that should be used in the stressmark

are combined in an arithmetic instruction profile. These instructions should stress the ALUs

responsible for arithmetic calculations in the processor.

Table 2.1: Example of an arithmetic instruction profile

Operation Datatype Relative frequency

Integer Add 20%

Integer Mul 20%

Integer Div 20%

Double Add 20%

Double Mul 10%

Double Div 10%

Memory Instructions

A memory instruction takes an address and either reads from that address and writes its value

to a register, or writes the register value to the address. The address used by the memory

instruction can point to a shared portion of the memory that is used by all threads in the

system, or a local portion of memory specifically allocated for the thread. Consequently, we

can define four kinds of memory operations in total: shared loads, shared stores, local loads

and local stores.


Stores and loads stress the ALUs responsible for memory operations, the store/load buffers,

and the memory system including caches. If the memory operation uses shared memory, it is

possible that the instruction causes extra inter-cache traffic. The amount of traffic depends

on the cache coherence protocol. If there is a lot of contention between processors, this will

cause stalls, negatively affecting the processor’s throughput.

The relative frequencies of all memory instructions are combined in the memory instruction

profile. The memory access pattern is not determined by this.

Table 2.2: Example of a memory instruction profile

Operation Shared? Relative frequency

Load Yes 10%

Store Yes 10%

Load No 40%

Store No 40%

Branch Instructions

There are two kinds of branch instructions: conditional and unconditional branches. Using

branch instructions, we want to stress the ALU responsible for branch processing, the branch

predictor, and other associated hardware structures (e.g. BTB).

To stress the branching-related logic, we need to control the predictability of a branch. We

can do this using the branch transition rate [10]. The branch transition rate is the number

of times a branch switches between taken and untaken, divided by the number of times the

branch is executed.

For example, a branch transition rate of 100% means that the branch will constantly alternate

between taken and untaken. Branch transition rates that are very high (90-100%) or very

low (0-10%) have a high predictability. If the branch transition rate is between 30 and 70

percent, the branching behavior is harder to predict.

In the workload model we use the cumulative distribution of the inverse branch transition

rate.


Table 2.3: Example of a branch transition rate distribution

Inverse branch transition rate Relative frequency

1 70%

2 20%

4 5%

8 5%

2.2.2 Dependencies Between Instructions

If an instruction has to wait for the result of a previous instruction, there is a dependency.

The dependency distance is the number of instructions between two dependent instructions.

There are three kinds of instruction dependencies: WAW (Write after Write), RAW (Read

after Write), and WAR (Write after Read). Since out-of-order processors can eliminate WAW

and WAR dependencies, we will only consider RAW dependencies.

Instruction dependencies limit the instruction level parallelism (ILP). In a stressmark we want

the ILP to be large enough to fully occupy all ALUs in the processor. We will therefore define

a minimum RAW distance in the workload model.

2.2.3 Data and Instruction Footprint

The footprints are the number of unique data and instruction addresses referenced while

running the stressmark. The size of these will affect the stress on the memory subsystem,

particularly the caches.

The size of the instruction footprint determines whether the stressmark fits into the L1 in-

struction cache. The data footprint is twofold, defining the size of the global memory region

on the one hand, and of the private memory region allocated for every thread on the other.

The size of the global memory determines the contention probability when multiple threads

read from it or write to it.

In the workload model, we define the three parameters discussed above: the number of

instructions, the size of the global memory, and the size of the thread-local memory.

Table 2.4: Example of a data and memory footprint

Instruction footprint 400 instructions

Shared data footprint 32kB

Thread-local data footprint 512kB

2.3 Workload Summary 9

2.2.4 Memory Striding Behavior

Memory instructions in the stressmark may exhibit some dynamic behavior. Some memory

instructions read from or write to the same address every time they are executed. Other

memory instructions walk through the memory, reading from a different address every time

they are executed. We will use data stream strides to model this behavior.

Memory instructions that walk through the memory do this with a constant step size, the

stream stride. The size of the stream stride is a multiple of the size of a cache line (defined in

the platform-dependent parameters). Every memory instruction is assigned a stream stride.

In the workload model, we define a distribution of stream strides. Memory instructions that

constantly read from the same address have a stream stride of size zero. Instructions with a

stream stride greater than zero will cycle through a part of the memory defined in the data

footprint.

Table 2.5: Example of a stream stride distribution

Stride value Relative frequency

0 80%

1 10%

2 2%

4 6%

8 2%

2.3 Workload Summary

In total, there are 30 variables in the workload model. While developing the workload model,

we tried to minimize the number of parameters, keeping only those relevant to developing

stressmarks. A workload model containing less parameters results in a smaller search space,

allowing the machine learning algorithm to work more effectively.

Each parameter is designed to stress a specific part of the processor. Although the parameters

may interact with each other, each parameter has one specific goal, avoiding these overlaps

as much as possible. For example, we could support more arithmetic operations (shifts or

2.4 Discussion 10

subtractions) but we decided against this because we assume that we can optimally stress

all ALUs with the provided operations, rendering any more operations superfluous. If this

assumption would turn out to be false, our framework is set up in a way that operations can

be added with relative ease.

While developing the workload model, we also had to take into consideration that it eventually

will be converted into a stressmark. To be able to efficiently generate the stressmark code,

we defined some extra restrictions on some parameters. For example, the inverse branch

transition rate is restricted to powers of two. We will explain this more in detail in the next

chapter.

Table 2.6: Workload summary

Category # parameters

Instruction mix 3

Arithmetic instruction distribution 9

Memory instruction distribution 5

Branch transition rate 4

Inter instruction dependency 1

Footprint 3

Stream stride distribution 5

2.4 Discussion

2.4.1 Differences With Prior Work

In previous work by Joshi et al [12], stressmarks have been created based on a workload

model made to create synthetic equivalents of real benchmarks. This workload model was

composed of 40 parameters and had no support for multi-threaded workloads, nor was it

platform-portable. Our own workload model is based on this, but we would like to highlight

a few key differences:

� In the original stressmark paper, the dependency distance was a cumulative distribution.

To create stressmarks it suffices to define a minimum dependency distance.

� We do not define a basic block size in our model.

� Because our benchmark is platform-portable, we cannot make assumptions about the

latency of arithmetic operations. We define three operations: addition, multiplication,

and division, for three datatypes (integers, floats, and doubles).

2.4 Discussion 11

� We reduced the number of parameters to describe the stream stride distribution and

the branch transition rate distribution, because we found that this relatively coarse

granularity suffices for the generation of stressmarks.

� We added support for multi-threaded stressmarks by introducing operations on shared

memory, affecting stressmark performance due to coherency.

2.4.2 Platform Portability

In a perfect world, we would be able to produce a completely platform-independent stressmark

that could be used to stress any given processor to its absolute limit. To approach this limit

in practice however, a stressmark needs to optimally stress as many processor components as

possible, and often exploit the specifics of the plaform it is running on. It is clear that this

unfortunately renders stressmarks platform-dependent by their very nature.

Because full-fledged platform independence is not possible, our goal becomes maximal plat-

form portability; we design our framework making sure that it can generate stressmarks for

a wide range of platforms, and that the adoption of new platforms is relatively easy.

Our workload model is an excellent starting point in the achievement of this ambition, since

its abstract parameters can be applied to most platforms and new platforms can often be

easily supported by adding a few new parameters.

The next step is to generate the executable stressmark without breaking platform portabil-

ity. To achieve this, a widely supported low-level programming language, in our case the C

language, is used instead of assembler to express the stressmark (as in previous work). Using

a compiler for the target platform, we then obtain the executable stressmark. The framework

itself therefore does not know the instruction set or register set; it only needs a compiler sup-

porting the platform. The wide availability of compilers for different platforms thus ensures

platform portability.

We learnt however that our approach is not without limitations, as the following example

shows. To fully stress a modern processor we would have to support SIMD instructions in

the workload model. Supporting SIMD instructions is however problematic because there

is no standardized way to express them in C. We will explore a couple of possible solutions

further in this document, but unfortunately it is not possible to implement them in a way

that platform portability is absolutely guaranteed.

2.4 Discussion 12

2.4.3 Branch Predictability

The performance characteristics of the branch predictor component, such as power usage and

heat production, are determined by the number of branch instructions on the one hand and

the miss prediction rate on the other; it is therefore of crucial importance that our framework

can influence these two properties through the workload model.

For the number of branch instructions, this is trivial since this property directly corresponds to

the workload parameter determining the instruction footprint. For the branch predictability

however, there is no such workload parameter since it is not possible to accurately generate

code with a specific branch miss prediction rate. We will however show that the branch miss

prediction rate can be controlled indirectly by setting the branch transition rate. The latter

is indeed a workload parameter as it is perfectly possible to generate synthetic code with a

given branch transition rate.

In the paper ”Branch Transition Rate: A New Metric for Improved Branch Classification

Analysis”, Haungs, Sallee, and Farrens [10] found that the branch miss prediction rate for

global as well as local branch predictors is determined by their transition rate and taken rate

(i.e. the number of times the branch is taken.) Due to design decisions about the workings

of our stressmark generator, the taken rate of our branches is a fixed 50%, and the transition

rate is 100%, 50%, 25%, or 12.5%.

Using the following graphs from Haungs et al., it can be deduced that the corresponding miss

prediction rates range from the lowest values (<5%, white) to the highest (>45%, black) with

two evenly spread intermediates.

Figure 2.1: Miss rates of local (left) and global (right) branch predictors for different classes

of branches, identified by transition rate and taken rate.

2.4 Discussion 13

On the axes of these graphs, class 0 corresponds to 0-5%, class 1 to 5-10%, ..., class 4 to

20-25%, class 5 to 25%-75%, class 6 to 75%-80%, ..., class 9 to 90%-95%, and class 10 to

95%-100%. The values for our stressmarks with a fixed branch transition rate of 50% are

therefore located in the middle column, which contains widely varying values.

2.4.4 Transformation Aspects

We now draw attention to the various aspects of the transformation of the workload model into

the executable stressmark, which is performed by the stressmark generator. First, we need

to distinguish clearly between the workload model itself, which is the input of the stressmark

generator, and the effective workload. The latter contains the actual workload parameter

values of the generated stressmark when it is executed.

Randomized vs. Deterministic Stressmark Generation

Although the workload model defines the key characteristics of the stressmark that needs to be

generated, there are some aspects that are not explicitly determined by it; notable examples

are the taken branch rates discussed in the previous section, and the order of instructions.

During stressmark generation, these undefined aspects can in general either be determined at

random, or by reasonable design choices. If determined by design choice, the transformation

process will always produce the same stressmark for a given workload, but the number of

stressmarks that can possibly be created is reduced, and it cannot necessarily be guaranteed

that the design choice always produces the best stressmark possible. If chosen randomly,

the transformation process is no longer deterministic and a single workload model can then

generate different stressmarks during sequential runs of the stressmark generator.

Although we initially opted to determine some of the undefined aspects randomly, it became

clear later on that this was the wrong choice. The effective workloads produced by different

stressmarks based on the same workload model varied too much, causing the search algorithm

described further in this document to function inefficiently, since a given workload model no

longer corresponded to a single fitness value. Concluding that a deterministic transformation

is really necessary, we switched later on and tweaked our design choices for best performance.

We also looked at the theoretical maxima of our search results to guard the efficacy of the

framework.

Mapping Between Workload Model and Effective Workload

Note first that different workload models may sometimes result in the same stressmark and

therefore the same effective workload. For example, the instruction footprint may be 50 in-

2.4 Discussion 14

structions while the number of memory operations is only 1%. In this case the framework will

generate zero memory instructions, which is of course the same stressmark as the one created

for the same workload model, but with 0% memory operations instead of 1%. Moreover,

since there are no memory instructions at all, additional memory parameters in the workload

model become irrelevant (i.e. data footprint, stride distribution, reads/writes, shared/non-

shared). Workload models differing only in these parameters will once again result in the

same stressmark.

It is now also clear that the effective workload is not always consistent with the workload

model. This is not only caused by duplicate mappings as illustrated by the example above, but

also by certain complexities within the stressmark generator algorithm. These are described

in the next chapter.

Multi-threading Aspects

The multi-threaded stressmarks are created from a single workload. This means that every

core will be running the same synthetic benchmark. The interaction between threads will

happen at the memory level. There are two situations of contention between threads. First,

threads may be competing over cache line ownership; in this case we stress the cache coherency

mechanisms. Second, contention may happen if threads compete for cache memory; this will

be the case if the size of the global memory combined with all the thread-local memory is

bigger than the total cache size.

We did not implement synchronization primitives such as mutexes in the stessmarks. These

primitives will typically cause the processor to stall for a while. Stalling is undesired behavior

for stressmarks.

SYNTHETIC BENCHMARKS IN C 15

Chapter 3

Synthetic Benchmarks in C

3.1 Introduction

In the previous chapter we described the workload model, a collection of program character-

istics that describe a stressmark. In this chapter we will examine how the workload model

can be transformed into an executable stressmark.

We want the framework to be platform-portable, meaning that it should be capable of gener-

ating stressmarks for almost any platform given the platform-dependent parameters without

knowing the instruction-set or register-set of the platform. To achieve this feature, we use

a low-level programming language instead of assembler to express the stressmark. Once the

stressmark is compiled for the target platform, we obtain an executable stressmark.

This is why our framework doesn’t have to know the platform it is testing; it only needs a

compiler that supports it.

3.2 Language and Compiler Requirements

We will start with defining the criteria for the low-level programming language. First, compi-

lation should be static; interpreted or JIT compiled languages will not do. Second, it must be

possible to express the various workload properties in the language constructs. And finally, a

high quality compiler should be available for almost any platform.

As low-level programming language we therefore chose the C programming language. It is so

well supported that it is the de facto standard among the low-level programming languages.

Virtually every platform has a C compiler and most have a highly optimized one.

3.2 Language and Compiler Requirements 16

3.2.1 Alternatives

An alternative language to C could be Fortran; it is less supported but it fits all the other

criteria perfectly. We opted for C in favor of Fortran mainly because we have more experience

with it.

Note that in the end, the language choice may not matter all that much, since compilers like

the GNU Compiler Collection support many languages and use the same backend for every

language, making it quite unlikely that using another language will yield significantly better

or worse results. In fact the low-level language used is nothing more than an interface to

control the backend of the compiler for platform-specific code generation.

An alternative approach could be to skip the compiler frontend altogether and directly im-

plement the stressmark in the intermediate representation (IR) of the compiler. We could for

example use GCC’s GIMPLE/TUPLES or LLVM’s IR. Both compiler frameworks support a

lot of platforms but not each one. Some specialized embedded processors (such as Trident

media processors or microchip PIC processors) only have commercial compilers.

Using the compiler’s IR to express the stressmark would improve our control over the form

and properties of the stressmark. To implement the stressmark in C, we have to go through

significant efforts to make sure that the critical stressmark properties are preserved after

compilation. Taking the extra effort to implement the stressmarks in C results in a significant

portability advantage.

3.2.2 Expressing the Stressmark in C

Expressing the stressmark in C is quite simple. The language constructs allow us to easily

express all the behavior we want. In the following example 3.1 we illustrate the general form

of a stressmark. Beware that this a very naive implementation of a stressmark, only for the

purpose of example.

Before we start running the stressmark, we need to perform some initialization. The initial-

ization contains some variable declarations and the memory allocation.

The next part is the stressmark loop, consisting of a start block and the stressmark body. In

the start-block the next iteration of the stressmark body is prepared, and it is checked whether

the stressmark has finished. The stressmark body contains the actual behavior conform to

the workload model. It is important to note that there are no loops inside the stressmark

body.

Finally, in the finalization routine we free the allocated memory.


Figure 3.1: Global stressmark structure.

3.2.3 Compiling the C Stressmark

The role of the compiler in our framework is to fill in the platform-specific details. The com-

piler should perform optimal instruction selection and register allocation for the underlying

platform. At the same time, the compiler may not change the execution properties of the

stressmark as they are expressed in the low-level programming language.

The example may look like a perfectly working stressmark but after compilation, the result is

very disappointing. If we compile this example, we notice that the variables v1 and v2 have

disappeared. The branch operation and the arithmetic operations have been eliminated as

well. This is because they are functionally redundant; they do not contribute to the result

of the function,f nor do they generate any effect. We also notice that the stride calculation

checks for division-by-zero, which is not needed since memSize will always be greater than

zero.

Listing 3.1: Compilation result with -O1


s t r i d e 3 = ( s t r i d e 3 + 12) % memSize ;

400548: addiu v0 , s0 , 12

40054 c : div zero , v0 , s2

400550: bnez s2 ,40055 c <s t re s smark+0x4c>

400554: nop

400558: break 0x7

40055 c : mfhi s0

i f ( i−− == 0) break ;

400560: addiu s1 , s1 ,−1

400564: beq s1 , v1 ,400578 <s t re s smark+0x68>

400568: s l l v0 , s0 , 0 x2

v2 = v3 + v2 ; // a r i t hme t i c i n s t r u c t i o n

i f ( i & 2) v1 = v1 * v3 ; // branch i n s t r u c t i o n

memory [ s t r i d e 3 ] = v3 ; // memory i n s t r u c t i o n

40056 c : addu v0 , v0 , a0

400570: j 400548 <s t re s smark+0x38>

400574: sw s3 , 0 ( v0 )

This brings us to the disadvantages of using a low-level programming language as imple-

mentation target for stressmarks. While the compiler is very good at converting a program

into machine code, it will also optimize the program. For normal applications, optimizing is

of course beneficial as it eliminates unnecessary operations without changing the functional

behavior. However, when a stressmark is compiled, some optimizations could change char-

acteristics reflecting the workload model; these optimizations are undesired. Unfortunately,

we cannot wholly disable optimization as we still rely on intelligent instruction selection and

register allocation for efficiency and correctness.

We want to stress that the optimization tradeoff is very tricky to get right. If the generated

code is not optimized, the result is inefficient as the available registers and instructions are

not optimally utilized. If the compiler performs too much optimization, critical parts of the

stressmark may be optimized away, changing the stressmark behavior and jeopardizing its

conformity to the workload model.

In the remainder of this chapter, we will mainly focus on how to tune the compiler and

the structure of the stressmark to generate correct executable stressmarks. Before analyz-

ing the optimization countermeasures, we define which optimization behavior is required, or

acceptable.

3.3 Exploring the Optimization Behavior of the Compiler 19

3.2.4 Compiler Requirements

The compiler is required to perform two tasks: instruction selection and register allocation.

The workload is only correctly expressed in the executable stressmark if the arithmetic and

memory operations use registers. Stack operations and register spilling should be avoided as

much as possible.

The compiler is allowed to do some instruction rescheduling. The workload does not define

the instruction ordering; only the minimum dependency distance is defined. If the compiler

performs some instruction rescheduling, the minimum dependency distance could possibly

change. Such optimizations are not considered harmful. The compiler may have good enough

knowledge about the latencies of instructions to be able to reschedule instructions without

causing a slowdown. Instruction rescheduling across blocks is however not acceptable.

It is also the responsibility of the compiler to optimize the address calculation for memory

instructions.

3.3 Exploring the Optimization Behavior of the Compiler

We addressed the compiler optimization problem using two approaches. First of all, we

dumbed the compiler down to the minimum level of optimization using a predefined opti-

mization level, tweaked with special compiler flags. We then designed the structure of a

stressmark and made it immune to the remaining optimizations of the dumbed-down com-

piler.

From this point our results are dependent on the used compiler and the platform. We use the

GCC 4.4 for verifying the x86-64 target and GCC 3.4 for verifying the SESC/MIPS target.

3.3.1 Configuring the Compiler

GCC has six optimization levels: O0, O1, O2, O3, Os, and Ofast. The lowest optimization

level O0 is not useful since it does not perform any register allocation; from O1 onwards it

does. We fine-tuned the O1 profile a bit more using flags to disable some loop optimizations.

Table 3.1: Used compiler flags

GCC 3.4 -fno-loop-optimize -mno-check-zero-division -fnew-ra -fno-if-conversion -fno-if-conversion2

GCC 4.4 -fno-tree-loop-optimize -fno-if-conversion -fno-if-conversion2


Through extensive trial and error we sought to find a combination of flags that sufficed to

reliably compile stressmarks. The counter-optimization methods used in the next part rely

on these options.

This is a fragile part of our famework; if a new compiler is used, the user will have to configure

that compiler to fit our stressmark method. If the compiler cannot be configured correctly,

there are two solutions. One way is to change how stressmarks are generated by the framework

based on the behavior of the new compiler. The other solution is to change the framework

component that generates C into a component that generates assembler, thus giving up on

platform portability.

If you can configure the compiler correctly, you can generate stressmarks almost without any

customization. On top of that, gcc already supports many compilation targets and these

require no changes at all.

3.3.2 Analyzing Compiler Optimizations

In this section we will investigate the optimizations of the dumbed-down compiler and their

effects on the quality of the stressmark.

Redundant Code Elimination

Redundant operations are operations that do not contribute to the result of the function or

do not produce an effect, such as writing to memory.

Redundant functions

Table 3.2: Redundant functionC Assembler (MIPS)

void function(){int i,v1=2,v2=3,v3=7,v4=5;

for(i = 0; i<100; i++){v1 = v3 * v4; Completely eliminated

v2 = v2 - v4;

v1 = v2 / v4;

}}

The function in this example only contains dead code. Even the dumbed-down compiler will

completely eliminate this function.


Table 3.3: AlternativesAlternative 1 Alternative 2

int function(){ void function(){int i,v1=2,v2=3,v3=7,v4=5; volatile int effect;

for(i = 0; i<100; i++){ int i,v1=2,v2=3,v3=7,v4=5;

v1 = v3 * v4; for(i = 0; i<100; i++){v2 = v2 - v4; v1 = v3 * v4;

v1 = v2 / v4; v2 = v2 - v4;

} v1 = v2 / v4;

return v1 + v2 + v3; }} effect = v1 + v2 + v3

}

There are two ways to prevent the function from being completely eliminated. In the first

alternative we make the return value dependent on the variables that are used. In the second

alternative we use a volatile variable to generate an effect. The volatile variable is a variable

that is always stored in memory, never in a register. Writing the sum of the used variables

to memory makes it visible for other threads to see, thus preventing the variables from being

optimized away.

Redundant operations Now that the function is not completely optimized away, we turn

our attention to the loop inside the function that represents a stressmark body.

Table 3.4: Redundant operations

C Assembler (MIPS)

for(i = 0; i<100; i++){ for(i = 0; i<100; i++){40051c: move v1,zero

v1 = v3 * v4; v1 = v3 * v4;

v2 = v2 - v4; 400520: subu a2,a2,a0 -> v2=v2-v4;

400524: addiu v1,v1,1

400528: slti v0,v1,100

40052c: bnez v0,400520 <main+0x10>

v1 = v2 / v4; 400530: div zero,a2,a0 -> v1=v2/v4;

} }


After compilation, we can see that the compiler has eliminated one of the operations in the

loop body. The first and the third operation write to the same variable while this variable is

never read between these two writes. In other words, the first operation is redundant. This op-

timization behavior has significant ramifications for implementing the minimum dependency

distance.

Listing 3.2: Dependency distance

. . .

v1 = v1 * v2 ; // v1 Write

v2 = v2 / v3 ;

v3 = v3 + v4 ;

v4 = v4 * v5 ;

v5 = v4 / v1 ; // v1 Read

. . .

The minimum dependency distance is defined as the minimum RAW distance. In the above

example the RAW distance is four. To achieve this distance without redundant operations,

we had to use five variables. If we were to increase the minimum dependency distance, the

required number of variables would increase proportionally.

Since we want all the variables to be stored in registers for optimal execution, a large minimal

RAW distance will require a lot of hardware registers. If there are not enough hardware

registers available, this will cause register spills. Not only is this undesired behavior, it is

also impossible for the framework to detect this happening. This is a limitation caused by

the use of a low-level programming language. If we implemented the benchmark directly

in assembler, it would be a lot easier to achieve very large RAW distances with only a few

hardware registers. In fact, we would probably drop the minimum dependency distance from

the workload model.

In listing 3.5 we look at the redundancy elimination in combination with conditional instruc-

tions, commonly known as ”partial redundancy elimination.” After compilation, the super-

fluous conditional expression is completely eliminated by the dumbed-down compiler. This

means that the redundancy removal even works across blocks within the loop.

Conclusion This is by far the most annoying optimization and to our knowledge there is

no way to prevent the compiler from applying it by tweaking the compiler flags. Whenever

we enable register allocation, the compiler will try to eliminate the most obvious unnecessary

instructions.


Table 3.5: Redundant blocksC Assembler (MIPS)

for(i = 0; i<100; i++){ for(i = 0; i<100; i++){40051c: move v1,zero

if(i == 10) v1=v3*v4; if(i == 10) v1=v3*v4;

v1 = v3 * v4; v1 = v3 * v4;

v2 = v2 - v4; 400520: subu a2,a2,a0 -> v2=v2-v4;

400524: addiu v1,v1,1

400528: slti v0,v1,100

40052c: bnez v0,400520 <main+0x10>

v1 = v2 / v4; 400530: div zero,a2,a0 -> v1=v2/v4;

} }

Loop Invariants

The body of the stressmark is placed inside a loop. A typical compiler optimization is to

hoist loop invariants out of the loop. To avoid this optimization through the structure

of the stressmark, we would have to make sure that all variables within the loop are de-

pendent on a previous iteration of the loop. This would add a lot of complexity to the

stressmark generator. Fortunately we were able to avoid this by setting the compiler-flag

”-fno-tree-loop-optimize”.

Table 3.6: Loop invariants

C Assembler (MIPS)

for(i = 0; i<100; i++){ for(i = 0; i<100; i++){400518: move a0,zero

v1 = v3 * v4; 40051c: mult a2,a1 -> v1 = v3 * v4;

400520: addiu a0,a0,1

400524: slti v0,a0,100

400528: bnez v0,40051c <main+0xc>

v2 = v3 + v4; 40052c: addu v1,a2,a1 -> v2=v3+v4;

} }

In the example (table 3.6) both instructions are loop-invariant. However if we look at the

compilation result, we can see that they are still inside the loop. The multiplication (v1 =


v3*v4) is placed at the beginning of the loop. The addition (v2 = v3+v4) is placed in the

branch delay slot of the branch instruction.

We conclude that we needn’t worry about loop invariants in the body of the stressmark.

Constant Folding and Propagation

These optimizations are present in the dumbed-down compiler, but because of the

”-fno-tree-loop-optimize” flag, they do not work on literals declared outside the loop.

These optimizations come in handy because they optimize some of the address calculation for

memory operations inside the stressmark loop. However, there are some cases where these

optimizations may cause trouble when combined with algebraic simplifications. Listing 3.3 is

a reduced problem case we encountered while testing the framework.

Listing 3.3: Constant folding and propagation

[ . . . ]

v1 = v2 / v2 ; // v1 = 1 −> e l im ina t ed ( a l g e b r a i c s im p l i f i c a t i o n )

[ . . . ]

v3 = v1 * v1 ; // v3 = v1 = 1 −> e l im ina t ed ( cons tant f o l d i n g and prop . )

v5 = v1 + v1 ; // v5 = 2 −> e l im ina t ed ( cons tant f o l d i n g and prop . )

[ . . . ]

v6 = v3 / v5 ; // v5 = v3 >> 1 −> peepho le op t im i za t i on

[ . . . ]

To avoid these optimizations, we came up with a few extra rules for generating the stressmark.

� All variables are initialized with different values.

� Instructions with two operands use two different registers as operand.

Code Layout and Branch Optimization

The compiler eliminates branches whenever it can, and by doing so also removes unreachable

code. These optimizations make it harder to implement static branches in the stressmark.


Table 3.7: Branch optimization

Unoptimized Optimized

if(condition) goto L3; else goto L2; if(condition) goto L3;

L2: [...] L2: [...]

L3: [...] L3: [...]

Unoptimized Optimized

[...] [...]

goto L3 // eliminated

L2: [...] // noting jumps to this block // eliminated

L3: [...] L3: [...]

In listing 3.4 we show a solution to implement static branches in the stressmark. The for-

loop represents the stressmark body. In the finalization routine, we jump to the block that

otherwise would be eliminated.

Listing 3.4: Static branch implementation

for ( i = 0 ; i<MaxIter ; i++) { // s tressmark body

[ . . . ]

goto L3 ;

L2 : [ . . . ]

L3 : [ . . . ]

}goto l 2 ; // f i n a l i z a t i o n rou t ine

Instruction Rescheduling

Most instruction rescheduling optimizations only become available with the -O2 optimization

level. This optimization is allowed as long as it doesn’t work across blocks. In practice we

rarely see an instruction rescheduling optimization in the compiled code.

Exceptions

While compiling divisions and modulo operations for the MIPS target, the compiler will gener-

ate code that checks for division-by-zero. By using the compiler flag ”-mno-check-zero-division”

we can disable this safety.

3.4 Interesting C Constructs 26

3.4 Interesting C Constructs

Before we use all the knowledge we gained about the optimization behavior of the compiler

to implement a good stressmark, we look at some interesting language constructs in C that

may help constructing stressmarks.

3.4.1 Volatile Variables

Volatile variables in C are variables that may change in a way that is not predictable by

the compiler. Volatile variables are typically used to implement signal handlers, or to access

memory mapped devices.

The volatile keyword prevents the compiler from storing the variable content in a register;

this means that writing to a volatile variable always results in writing to a static memory

address, and reading from a volatile variable always results in reading from a static memory

address.

Since the content of the memory is modified, these operations are effectful and cannot be

optimized away.

3.4.2 Const Variables

The value of const variables cannot be changed after initialization. Typically the const

keyword does not improve performance at a sufficiently high optimization level, since the

compiler can figure out if a variable will be modified or not. The crippled compiler however,

does need this information to improve the address calculation for memory instructions.

3.4.3 Control Flow

While researching how we could implement the flow control in the stressmark, we found many

alternatives. Some examples:

3.5 Forming the Stressmark 27

Table 3.8: Alternative control flow implementations

Alternative 1 Alternative 2

while( i > 0 ) { start:

i--; if(i <= 0) goto end;

if( condition1 ) { i--;

a = b + c; if(!condition1) goto L1;

} a = b + c;

c = a * b; L1:c = a * b;

if( condition2 ) { if(!condition2) goto L2:

d = e + f; d = e + f;

} L2:f = e * d;

f = e * d; goto start;

} end:

In the example both alternatives are functionally equivalent and they compile to exactly the

same machine code. We opted to use gotos because they are more flexible; only gotos allow

us to implement unoptimizable static branches (listing 3.4).

3.5 Forming the Stressmark

We will explain the structure of the stressmarks in three stages. We start with only arithmetic

instructions, then add memory operations, and finally control flow to the stressmark.

3.5.1 Arithmetic Operations

The first stressmark is a simple loop containing only arithmetic operations. Stressmarks

such as this one can be generated by the framework by providing a workload model with an

instruction mix existing a 100% out of arithmetic instructions.

The stressmark starts by initializing the used variables (vN), the loopcounter (i) and a division

variable (vDiv). The division variable is created to avoid division-by-zero problems. The next

part of the stressmark is the stressmark loop with a very small start block. The stressmark

body contains the arithmetic operations and it finishes by jumping back to the start block.

When the stressmark has finished, it returns the sum of all the variables to avoid optimization

(table 3.3.2).


We simplified this example by using only integer operations. More typical stressmarks will use

a combination of integer and floating point operations. This stressmark contains one dynamic

branch instruction for the loop even though this is not defined in the workload model.

Listing 3.5: Arithmetic instructions

int s t re s smark ( ) {int i = 999999 , v1 = 2 , v2 = 3 , v3 = 5 , v4 = 7 , v5 = 13 ;

const int vDiv = 3 ;

s t a r t :

i f ( i−− <= 0) goto end ; // s t a r t b l o c k

v1 = v2 + v3 ; // i n t add

v2 = v3 / vDiv ; // i n t d i v

v3 = v4 * v5 ; // i n t mul

v4 = v5 + v1 ; // i n t add

v5 = v1 / vDiv ; // i n t d i v

[ . . . ]

goto s t a r t ;

end :

return v1+v2+v3+v4+v5 ;

}

The number of registers a stressmark uses is critical to avoid register spilling. Register usage

breakdown:

� Necessarily stored in registers

– Loop counter (i): 1

– Variables (vN): Mininum dependency distance + 1

� Optionally stored in registers

– Division variable (vDiv): One register for every datatype

How the compiler compiles the division variable depends on the number of available registers.

Sometimes it is loaded as a literal before the division; if not, the variable will continuously

be kept in a register.

3.5.2 Memory Instructions

To support memory instructions we need to declare thread-local (localMemoryArray) and

shared (globalMemoryArray) memory regions. These regions are split into smaller arrays,


one for each striding memory instruction.

We assign each striding memory instruction a dedicated array to avoid memory instructions

influencing each other’s behavior; more specifically, we want to avoid memory instructions

causing data of another memory instruction to be cached. The performance of a memory

instruction should after all be independent of other instructions.

To implement the striding behavior of the memory instructions, we use the variables lStrideN

for local and sStrideN shared memory instructions, with N equaling the stride distance. These

variables are incremented with the stride value in the start block of the stressmark (= stride

distance * length cache line). A striding memory instruction will read from its array with an

offset defined by lStrideN or sStrideN.

Non-striding memory instructions are a lot easier to implement; we can simply use volatile

variables (sCel1, lCel1, lCel2) to this effect.

Implementing memory instructions causes the initialization and finalization block to grow a

bit, but this of course does not affect the stressmark. The start block has grown as well and

it now contains some expensive modulo operations that will be executed at the beginning of

every iteration. If the stressmark body is sufficiently large, those operations shouldn’t have

an effect on the performance.

The implementation of non-striding memory instructions is very cheap. They read from or

write to an address equal to a static offset from the stack pointer, and can be executed in

a single instruction. Striding memory instructions require some address calculation. They

could be implemented more efficiently if every instruction in the loop had a dedicated hardware

register to store the address. However, since our implementation is already constrained by

the number of registers, we opted not to implement it in this way.

Listing 3.6: Arithmetic + memory instructions

volat i le int sCel1 ;

int s t re s smark ( int * const globalMemoryArray ) {int i = 999999 , v1 = 2 , v2 = 3 , v3 = 5 , v4 = 7 , v5 = 13 ;


int * const localMemoryArray = mal loc ( . . . ) ;

int * const lArr1 = &localMemoryArray [ 0 ] ;

int * const lArr2 = &localMemoryArray [ 1 0 0 ] ;


int * const sArr1 = &globalMemoryArray [ 0 ] ;

volat i le int lCel1 , lCe l2 ;


int l S t r i d e 2 = 0 , l S t r i d e 4 = 0 , s S t r i d e 2 = 0 ;

s t a r t :

i f ( i−− <= 0) goto end ; // beg in s t a r t b l o c k

l S t r i d e 2 = ( l S t r i d e 2 + 2*4) % 100 ;

l S t r i d e 4 = ( l S t r i d e 4 + 4*4) % 100 ;

s S t r i d e 2 = ( s S t r i d e 2 + 2*4) % 150 ; // end s t a r t b l o c k

v1 = v2 + v3 ; // i n t add

v2 = lArr1 [ l S t r i d e 2 ] ; // memread s t r i d e=2

v3 = v3 / vDiv ; // i n t d i v

lArr2 [ l S t r i d e 2 ] = v4 ; // memwrite s t r i d e=2

v4 = lCe l1 ; // memread s t r i d e=0


lCe l2 = v1 ; // memread s t r i d e=0

v1 = v2 / vDiv ; // i n t d i v

[ . . . ]

goto s t a r t ;

end :

f r e e ( localMemoryArray ) ;


}

The number of registers used by the implementation has significantly increased.

Register usage breakdown:



– Variables (vN): Mininum dependency distance + 1

– Stride offsets (lStrideN and sStrideN): 1 for every used stride distance for shared or

thread-local memory instructions—e.g. 4 used stride distances for shared memory

instructions and 2 for thread-local memory instructions means a total of 6 required

registers.

– Address of the localMemoryArray and globalMemoryArray: 2 registers


– Division variable (vDiv): One register for every datatype


The variables localMemoryArray and globalMemoryArray should be stored in registers to

prevent them from being loaded before a striding memory instruction is executed.

The number of required registers is increased by a maximum of ten. Typically there will

only be a few striding memory instructions in a stressmark, so the actual number of extra

registers is lower. It is important to note that the non-striding memory instructions require

no registers because of the use of volatile variables.

3.5.3 Branch Instructions

The final step is to include branch instructions, which will cause some instructions to be exe-

cuted conditionally. Note first that conditional execution is not allowed for every instruction

type. Striding memory instructions must be executed each iteration because their offset is

calculated at every iteration; if they are not executed there will be a gap in their path.

We want to implement the branch behavior as cheaply as possible in terms of register usage.

We reuse the loop counter to calculate if a branch will be taken or not. We use the lowest bits

of the loop counter. The first bit will constantly alternate, forming the pattern ...101010...,

which corresponds to an inverse branch transition rate of 1. The pattern of the second bit is

...110011001100..., corresponding to an inverse branch transition rate of 2, etc. This makes

it very simple to implement branch instructions conform to the workload model as long as

the inverse branch transition rates are a power for two. However, these branches are highly

regular and therefore very predictable. This is not so bad because we typically want very

predictable branches in a stressmark to reduce the stalling probability. A more advanced

implementation of branch transition rates can be found in the remarks (listing 3.10); note

that it uses a lot more registers though.

Using branches also has an effect on the minimum dependency distance. To preserve the

minimum dependency distance across multiple branch instructions, we have to take all possible

paths into account; this results in more registers being used.

Because of this, we reduce the number of instructions in a conditional block to a single instruc-

tion. It would be reasonable to think that the compiler optimizes the very small conditional

blocks, replacing them by conditional moves. However, we can disable this optimization using

the flags -fno-if-conversion and -fno-if-conversion2.

The example is simplified for better readability. The static branch implementation (figure

3.4) was omitted.

Listing 3.7: Arithmetic + memory + branch instructions

// i b t r means ” inv e r s e branch t r a n s i t i o n ra t e ”


volat i le int globalMemoryCel ;

int s t re s smark ( int * const globalMemoryArray ) {int i = 999999 , v1 = 2 , v2 = 3 , v3 = 5 , v4 = 7 , v5 = 13 ;


int * const localMemoryArray = ( int *) mal loc ( . . . ) ;

int * const lArr1 = &localMemoryArray [ 0 ] ;



volat i le int lCel1 , lCe l2 ;

int s t r i d e 2 = 0 , s t r i d e 4 = 0 ;

s t a r t :

i f ( i−− <= 0) goto end ;

l S t r i d e 2 = ( l S t r i d e 2 + 2*4) % 100 ;

l S t r i d e 4 = ( l S t r i d e 4 + 4*4) % 100 ;

v1 = v2 + v3 ; // i n t add


i f ( i & 2) // branch i b t r=2

v3 = v3 / vDiv ; // i n t d i v

lArr2 [ l S t r i d e 2 ] = v4 ; // memwrite s t r i d e=2



v4 = lCe l1 ; // memread s t r i d e=0

lCe l2 = v1 ; // memread s t r i d e=0

v1 = v2 / vDiv ; // i n t d i v


v2 = v3 * v4 ;

[ . . . ]

goto s t a r t ;

end :

f r e e ( localMemoryArray ) ;


}


Register usage breakdown:



– Variables (vN): Mininum dependency distance + (MindepDist * fraction branch

instructions) + 1

– Stride offsets (lStrideN and sStrideN): 1 for every used stride distance for shared or

thread-local memory instructions—e.g. 4 used stride distances for shared memory

instructions and 2 for thread-local memory instructions means a total of 6 required

registers.

– Address of the localMemoryArray and globalMemoryArray: 2 registers


– Division variable (vDiv): One register for every data type

All that remains to be done on this point, is to start the stressmark threads.

Listing 3.8: Starting stressmarks

void * runstressmark ( void * ptr ) {s t re s smark ( ( int * const ) ptr ) ;

}

int main ( int argc , char** argv ) {int t0 , t1 , t2 , t3 ;

globalMemoryArray = ( int *) mal loc ( . . . )

s e s c i n i t ( ) ;

sesc spawn ( ( void *) * runstressmark , ( void *) globalMemoryArray ,NULL) ;




s e s c w a i t ( ) ;

s e s c e x i t ( 0 ) ;

return 0 ;

}

The main method allocates the global memory and starts as many threads as there are

hardware threads. The threading implementation is platform-specific; for the MIPS target

we use a simulator specific library (SESC threads), and for the x86 target we use PThreads.

3.6 Remarks 34

While designing the stressmark form, we mainly focus on avoiding compiler optimization and

minimizing the register usage to ensure the ILP can be maximized.

3.6 Remarks

3.6.1 Register Usage

It is clear that our stressmarks will perform best on architectures with a large number of

registers. Most modern microarchitectures are RISC-based and therefore fulfill this require-

ment. The only big exception is the 32bit x86 microarchitecture; after some quick testing we

concluded that our platform is not capable of generating stressmarks for this register-starved

microarchitecture. We then evaluated our framework with x86-64, and found that although

we can generate stressmarks for this platform, their effectiveness is unclear.

3.6.2 Alternative Branch Transition Rate Implementation

Listing 3.9: Alternative BTR

unsigned int i , btrIndex ;

// pa t t e rn = 11000010110010001100001011001000

unsigned int btr2 = 0xC2C8C2C8 ;

for ( i =0; i <10000; i ++){s t a r t b l o c k :

btrIndex = i % 32 ;

stressmarkbody :

[ . . . ]

i f ( ( btr2 >> btrIndex ) & 1) {[ . . . ]

}[ . . . ]

}

This is a way to implement branch transition rates more accurately. The patterns produced

by this method are much less predictable. This implementation consumes a lot more registers:

one for every branch transition rate.

It could be worthwhile to add a parameter to the platform-dependent workload parameters.

This depends on the platform having many registers, or few. This method is probably more

suited for synthetic benchmarks modeling the load of manually written benchmarks than for

our stressmarks.

3.6 Remarks 35

3.6.3 Differences Between Workload Model and the Compiled Stressmark

There are two reasons the workload model and the properties of the benchmark, i.e. the

effective workload, can be different.

The first reason is that not all operations are necessarily compiled into a single instruction;

on some platforms, they can instead be compiled in a series of instructions. For example,

when compiling for a MIPS target, integer multiplication and division operations are each

transformed into two instructions: one for the operator itself, and one for storing the higher

bits of the result value. Load instructions can even result in up to five instructions.

The second reason is that a stressmark sometimes uses too many registers and so introduces

spilling. As this is undetectable, there is unfortunately nothing much we can do about this.

The important thing to notice however, is that the stressmark needn’t be exactly consistent

with the workload model because it is optimized by a search algorithm that is not aware of

the semantics of the workload model anyway.

The type of algorithm we use, a genetic search algorithm, is even especially robust in this

respect due to the fact that it distinguishes between a genotype and a phenotype. The former,

corresponding to the workload model, is always an abstraction of the latter, in our case the

stressmark. The crucial point is that the fitness value will always reflect the effective workload,

regardless of the exact nature of the relation between the stressmark and its workload model.

Therefore, as long as all relevant aspects of the stressmark somehow can be controlled by

modifying the workload model, the search algorithm will always select for the best solution.

That being said, the search algorithm will be more effective as the workload model more

effectively controls the relevant aspects of the stressmark, and this can be achieved best by

gaining insight in its important qualities and making sure the workload model properly reflects

these.

3.6.4 Exploration of the Implementation of SIMD Instructions

Using Auto-vectorization

We can represent SIMD operations as in this simplified example:

Listing 3.10: Auto-vectorization

int i , a [ 4 ] , b [ 4 ] , c [ 4 ] ;

for ( i =0; i <4; i++) { // may compi le to 1 SIMD in s t r u c t i o n

a [ i ] = b [ i ] + c [ i ] ;

}

3.7 Stressmark Generation 36

Using GCC 4.0+ combined with the –O2 optimization level and the option ”-ftree-vectorize”,

GCC can optimize this kind of code patterns into SIMD instructions.

This approach is portable across platforms that support similar SIMD instructions, making it

a good candidate to be defined as platform-dependent workload parameter. GCC developers

are working on methods to make sure the auto-vectorization will be more predictable in the

future (using pragma’s).

However, this approach does have a number of serious disadvantages. First of all, it only works

with a high optimization level (–O2), a level that our method can not work reliably with.

These kinds of tricks are also somewhat compiler-dependent, as not all C compilers support

auto-vectorization. The transformation is therefore currently not guaranteed, rendering it

unpredictable.

Using Intrinsics

Another approach is to use intrinsics for the implementation of SIMD operations. We do not

use this approach because it is both compiler and architecture-specific. It does however give

direct control.

3.7 Stressmark Generation

The stressmark generator is built using the pipe and filter architectural pattern. This made it

easy to replace parts of this software pipeline or compare alternative implementations. There

are four distinct phases in the generation process.

Backbone

generation

Variable

allocation

Operation

distribution

C

generation Workload

Plaform

Stressmark

Figure 3.2: Stressmark generation.

3.7.1 Backbone Generation

In the first step, a graph made out of basic blocks is created. Each block is assigned an id, a

branch transition rate, a size and the links to its parent(s) and child(ren).


The number of blocks depends on the proportion of branch instructions in the instruction

mix and the trace size.

3.7.2 Operation Distribution

In this step, it is determined which operations should be executed in a block and in what

order. It also makes sure that striding memory operations do not get placed in conditional

blocks.

Arithmetic operations are a data type (integer, float, double) combined with an operation

(add, multiplication, division).

Memory operations are a load or store in shared or non-shared memory with a certain stride

value.

We have two ways to distribute the operations: random distribution and smooth distribution.

� Random distribution means the operations will be randomly distributed across the

blocks. This causes the stressmark to become inconsistent (variation of 10 to 20% in

power for the same workload).

� Smooth distribution means that the operations will be chosen deterministically. To

choose an operation, the algorithm looks at the instruction profile of the previous op-

erations, compares it to the instruction profile in the workload model, and chooses the

operation with a frequency deviating the most from the demanded frequency in the

instruction profile of the workload model.

Variable Allocation

This step will assign variables to the operations determined in the previous step. The variables

are chosen in a way that they cannot be optimized away and the minimum dependency

distance is preserved. The number of used variables is minimized by the algorithm. In this

step the memory sizes are determined and the memory instructions are each assigned their

dedicated array.

When this step is finished, the stressmark is essentially ready; all the information to create

the stressmark is available.

C Generator

Now the stressmark is converted to C code. This step is platform-dependent, resulting in a

couple of different versions:


� Single-threaded: the most platform-portable version

� SESC multi-threaded: a version to run multi-threaded benchmarks on SESC

� Pthread: a version using the POSIX threading implementation to run multi-threaded

on Linux

� Pthread with hardware performance counters: a special version to measure the IPC

value of the stressmark.

Using PAPI to Measure IPC

Using hardware performance counters in Linux is not a trivial affair. We use the kernel 2.6.30

that has built-in support for hardware performance counters. To access the counters, we use

a high-level API provided by PAPI 4.0 [4], which enables us to directly measure the IPC

value. To ensure that the measurements are accurate, we perform them ten times and take

the highest value.

STRESSMARK OPTIMIZATION 39

Chapter 4

Stressmark Optimization

4.1 Introduction



Figure 4.1: Optimization process.

The foregoing chapters describe in detail the abstract workload model we use, and how this

model can be transformed into a platform-portable synthetic benchmark, which can then be

executed on its target platform. During the execution of the benchmark, several character-

istics can be measured reflecting performance, or the design requirements and constraints of

the microprocessor architecture; this can be power usage, temperature, throughput, etc. We

call these parameters the output characteristics of the benchmark.

The goal of a stressmark is to stress the processor, i.e. to induce extreme behavior by max-

imizing (or minimizing) one of these output characteristics. In order to reach this goal, we

start from a random initial workload model and then generate and execute the correspond-

ing stressmark. We measure the output characteristic of our choice, and then optimize the

result by using a genetic algorithm that generates new workload models and selects for this

characteristic.

4.2 Genetic Search Algorithm 40

4.2 Genetic Search Algorithm

4.2.1 Concepts

A genetic search algorithm is a heuristic that can be used to solve optimization and search

problems by employing the principles of biological evolution. Potential solutions to the prob-

lem, the individuals, are grouped in generations. Each individual in a generation has a

genotype, a phenotype, and a fitness value. The genotype, in our case the workload model, is

an abstract representation of the phenotype, in our case the corresponding stressmark. The

fitness value, a property of the phenotype, is the parameter getting optimized.

The algorithm proceeds by creating a new generation of individuals based on the existing

one. This is done in two steps: selection and reproduction. Through selection, two parent

individuals are picked from the current generation in order to produce an individual for the

new generation (also known as the child). This selection is based on the fitness value of the

individuals in the first generation as individuals with a higher fitness value have a higher

probability to get selected. Reproduction is done in two phases: crossover and mutation.

During crossover, the genotypes of both parents are combined to form the genotype of the

child. In this way, crossover is a possible means to select the best properties of both parents,

creating a solution with higher fitness than each parent. After crossover, mutation is applied

by randomly varying the genotype of the child with a certain probability. The function of

mutation is to explore new possible solutions in the search space.

After all individuals for the second generation have been created using selection, crossover,

and mutation, their fitness values are calculated and the process repeats itself. If the algorithm

is working correctly, the average fitness value should increase throughout.

One more principle that we applied is elitism. When a new generation is produced, the

best solutions of the previous generation are normally lost, since new individuals are usually

not identical copies of the previous generation. Elitism prevents this by copying the best

individual(s) of the previous generation to the new generation without altering them. The

elitism parameter is the number of individuals copied.

4.2.2 Domain and Fitness Value

The dimensions of the search space of our genetic algorithm are defined by the parameters of

the abstract workload model, some of which are discrete, and some of which are continuous.

As mentioned in the chapter on the workload model, the number of parameters is kept as

low as possible to reduce the size of the search space to a minimum and speed up the search

process. Nevertheless, a total of 30 parameters remains.

4.2 Genetic Search Algorithm 41

In principle, the fitness can be any measurable output characteristic of the stressmark’s ex-

ecution. We set up a simulated SMP MIPS platform on which we optimize for maximal

power usage, and an Intel Core2Quad x86-64 platform on which we optimize for maximal

IPC (instructions per cycle).

4.2.3 Configuration

Using the meta algorithm discussed further in this chapter, we decided on a generation size

of 72 workload individuals, a mutation probability of 10%, a crossover probability of 80%,

and an elitism value of 1.

4.2.4 Genetic Operators

Mutation

Mutation of a workload model is done by iterating over each of its parameters. If a parameter

is selected for mutation (depending on the mutation probability), its value is randomized.

The randomization process is dependent on the parameter type:

� Integer parameters have a maximum value, a minimum value, and a step size defined.

The new value is one of min+n · step with n a random value between 0 and bmax−minstep c

� Double parameters are randomized in the same way as integer parameters, except that

the value of n is not capped.

� Parameters with enumerated values are set to a random element of the enumeration

set.

� So called FixedSumParameters, used to represent instruction mix distributions, are

vectors with components totaling a predefined sum, usually 100. These parameters are

randomized by picking a single one of their components and setting its value to a random

number between 0 and the predefined fixed sum. As the total sum of the components

will now no longer be correct, the vector is then rescaled to fix this.

Selection

For selection we tried tournament selection, and proportionate or roulette-wheel selection,

finding the latter giving the best results. Proportionate selection elects two parents for

crossover by picking one after the other from the entire population with a probability di-

rectly proportionate to the fitness of the individuals. It is possible that the same individual

is selected twice.

4.3 Meta Algorithm 42

Crossover

Crossover of two workload models begins by creating an empty workload model for the result

and selecting one of the parent workloads. The algorithm then fills the empty result workload

by iterating over the workload parameters. Each iteration, with a given probability the

algorithm switches the selected parent with the other, and then sets the result workload’s

value for the considered parameter to the value of the selected parent.

4.3 Meta Algorithm

Although genetic algorithms are quite easy to implement, it is often unclear in the beginning

exactly which design choices should be made and how they are best configured. Four settings

have to be set right in order for the algorithm to function efficiently:

1. Population size. The population size parameter is all about getting the trade-off right

between generation size and the number of generations. It is often unclear whether a

large number of small generations or a small number of large generations will yield the

best results.

2. Mutation probability. High mutation probability increases the randomness and variation

in the population; it makes the search less directed. Low probabilities on the other hand

make the search more vulnerable to end up in local extrema.

3. Crossover probability. This probability has again an impact on the variation, but in a

different way. A population is often partitioned in different classes of good solutions

(think species), each class containing individuals with the same quality that positively

affects their fitness value, a quality that is different from the qualities of individuals

of other classes. A high crossover probability will lessen the number of these different

classes, heavily mixing them up until they are joined and so lowering variation; a low

probability will encourage the forming of these classes while risking that they grow apart

without ever having the chance of combining their qualities for an even better result.

4. Elitism. If individuals with high fitness values are preserved, every generation they feed

their properties into the population, keeping the search process ”on track,” but also

risking to outcompete solutions with alternative qualities, even though these may yield

better results in the long run.

Because the problem of getting these parameters right is not a trivial one, we implement a

simple meta search algorithm that optimizes them for us.

4.3 Meta Algorithm 43

4.3.1 Domain and value

The four settings discussed in the previous section are the dimensions of our search space, and

the search method is a simple hill climbing algorithm. We consider the following possibilities

for each dimension:

1. The couple (generation size, generation count) can be (18, 12), (36, 6), (54, 4), or (72,

3). The product of the two components, i.e. the total number of individuals, is always

216.

2. The mutation probability varies between 0 and 1

3. The crossover probability varies between 0 and 1

4. Elitism is expressed relative to the size of the population, between 0 and 0.25

The value of each point in the search space is determined by using its components to run a

genetic algorithm, and then calculate the gain in average fitness of its individuals between

the first and last generation. When comparing the value of points, care is taken that the

algorithm executions for each point start from the same initial population.

4.3.2 Neighborhood Concept

In order to implement a hill climbing algorithm, a neighborhood concept needs to be defined

first. Neighbors are typically found by slightly increasing and decreasing the component

values of the different dimensions of a point. Because the evaluation of each point requires

the running of an entire genetic algorithm, in our case the number of neighbors has to be

limited as much as possible. We therefore allow only one dimensional value at a time to be

modified.

The step size we use for the modification of the mutation, crossover, and elitism parameters

is dependent on a zoom level. At zoom level 0, the step size is one sixth of the entire domain

(e.g. between 0 and 1); at zoom level 1, the step size is (16)2 · domainsize, and so one. At

the beginning of the search process, the zoom level is set at 0; it is increased whenever the

algorithm has reached an extremum, gradually refining the result.

THE STRESSMARK RUNNER FRAMEWORK 44

Chapter 5

The Stressmark Runner Framework

5.1 Design Considerations

Every software engineering student is familiar with several methods of software development,

ranging from the traditional, rigid waterfall model to newer, agile approaches like the extreme

programming method. This was however the first time we needed to develop a fairly large

piece of software within a research context. We found that such an environment gives rise to

specific requirements and challenges. The most important ones are discussed in the ensuing

paragraphs.

5.1.1 Scala/Java

Probably the most important requirement of the software development process in a research

environment is that the researcher should be able to focus on his core job and quickly imple-

ment the desired functionality while not being distracted by practical issues concerning the

programming language or the development environment. Ease of use should encourage him

to experiment and try out alternative solutions to the research problem he is examining. Fi-

nally, since the point of research is to discover new things, the functional requirements of the

software being built will change constantly and so the development process and programming

language need to facilitate these changes.

This is why we picked the Scala programming language for our project. Scala is a high-level

programming language that is compiled into byte code running on the Java virtual machine.

It is two-ways compatible with Java meaning that Scala code can be called from within a

piece of Java code as well as the other way round.

There are several benefits that make Scala a suitable language for developing research tools

and getting work done quickly in a research context. The first is expressivity. The more ex-

5.1 Design Considerations 45

pressive a programming language, the more compact its notation, the more work can be done

in less time. While high-level programming languages in general tend to be more expressive

than low-level programming languages, for Scala this is particularly the case. A standard

implementation of the well-known quicksort algorithm gives a fair idea of the expressivity of

Scala.

Listing 5.1: Scala quicksort example

def qso r t [T <% Ordered [T ] ] ( l i s t : L i s t [T ] ) : L i s t [T] = l i s t match {case Ni l => Ni l

case pivot : : t a i l =>

val ( be fore , a f t e r ) = t a i l p a r t i t i o n ( < pivot )

q so r t ( be f o r e ) : : : ( p ivot : : q so r t ( a f t e r ) )

}

The implementation, using only standard language constructs, contains merely six short lines

of code. The resulting function is general enough to sort comparable objects of any type.

Since Scala is a functional programming language, and a lot of attention has been paid to

collections and syntactic flexibility, the code is very clear and expresses the intention of the

programmer in a natural way that closely follows the core line of reasoning of the algorithm.

Implemented in C++, the code size would be several times larger and a helper function

to partition a list of objects would have to be written, distracting attention from the core

functionality.

The second benefit is a corollary of the fact that Scala is intertwined with Java; because of

the two-way compatibility, any Java library can be called directly from Scala without using

glue code or compromising syntactic clarity or brevity. Since a huge number of Java libraries

can be found on the internet, most of them available for free and fairly well-documented, this

is hugely beneficial.

Comparing Scala to lower level languages, other advantages become clear. The type system

prevents basic errors yet is flexible enough not to be restrictive, and of course the developer

needn’t bother with memory allocation and pointers, two concepts leading to numerous bugs

in C/C++ that are often very hard to track down.

5.1.2 Apache Ant

Unfortunately, the use of a high-level programming language presents a big challenge as well.

Within a research environment a lot of interoperating tools are used, and often these tools are

written in a number of different programming languages. The environment is by its nature a

heterogeneous one, despite the fact that a tight integration is often key to work efficiency. This


is even more so the case since the development environment and production environment are

one and the same; the researcher is at the same time developer as well as end-user, constantly

trading one role for the other.

The only point where all these research and development tools come together is the command

line environment and this is where the trouble lies. Since Scala is executed on the Java

Virtual Machine which makes it platform-independent, communication with the operating

system and the command line environment becomes notoriously burdensome. A lot of glue

code is often required and it is hard to handle errors properly.

This is why we chose to use the Apache Ant tool, which was originally made to ease the build

and deployment processes of software applications written in Java. Ant is the bridge between

the Java and the command line environment. Itself written in Java and supported by the

open-source community, it has a well-documented API that is readily available to the Scala

developer. Using this API, it becomes a lot easier and safer to execute file system operations

and invoke command line scripts and tools.

5.1.3 The Mirrored Command Suite Pattern

With Ant firmly in place, and so having joined our Java/Scala environment with the third-

party tools we work with, we needed to expose the functionality of our framework in a way

that it could be readily used to experiment and run tests. At first, we tried to write an

Ant project file in XML to accomplish this. This seemed reasonable as these support all

the functionality we needed. The way to construct a project configuration for Ant is to

define different build targets, which nicely corresponded to the different functional units our

framework implements. We could then execute these targets via the command line, which we

figured would be a convenient way to experiment.

In theory this approach still sounds fine, but unfortunately we would soon find out that in the

end, the proof of the pudding is in the eating. The problem was the complexity and overhead

of working with the Ant configuration file; although it was possible to run commands and

algorithms written in Scala as well as command line tools by defining build targets in the XML

file, it turned out to be a real pain to do so. Commands implemented in Scala each needed

to have a special dedicated interface class to be executed by Ant, passing arguments and

parameters from the configuration space to the command turned out to be very troublesome,

and a lot of other inconveniences soon surfaced.

For this reason we decided to drop the configuration file altogether and opted instead for an

architecture pattern we like to call the mirrored command suite. This approach was inspired

by the way Linux applications and tool sets are often structured, combined with the need


(yet again) to integrate the Scala environment with the command line environment. The key

principles are the following:

1. All functionality of the framework that needs to be directly available to the researcher

to experiment with, is exposed as a set of input/output commands with a (limited)

number of parameters to control their behavior.

2. If this framework functionality includes processes (e.g. the generation of a stressmark),

each step in these processes is made available as a command in its own right. This allows

the researcher to easily execute, control, and debug each of these steps individually. For

convenience, the process as a whole can also be exposed as a single command.

3. Each command is mirrored, meaning it is made available twice: once through the com-

mand line shell as a script, and once through a Java/Scala class called the CommandIn-

terpreter. Typically, one version will be a proxy calling the second version. For example,

if a command invokes a third party tool, the shell script is the natural place to do this,

so the Java/Scala command will be a proxy using Ant to run the shell script. On the

other hand, if the command runs our own stressmark generator which is written in

Scala, the shell script will be the proxy running Scala, invoking the real command, and

passing the command parameters.

4. All shell script names start with the same prefix, in our case ”smr” for stressmark

runner, followed by the name of the command.

5. The parameters and usage of each command are described in concise usage instruc-

tions that can be accessed in the traditional way by invoking the shell script without

parameters or the parameter ”help”.

6. If a command’s input and/or output is structured data, this data is presented in a

human-readable format that can easily be edited by the researcher. This format could

be XML, but we opted for YAML [6] (standing for Yaml Ain’t Markup Language) as

this is less bloated and tends to be easier to read and edit.

We found that this approach is perfectly suited for development in a research environment.

For starters, owing to the fact that the basic architecture is extremely general and has very

little structure since it is just a set of commands, the software can easily evolve in the course of

the research. It is worthwhile to notice that this dynamic nature is completely compatible and

in line with the key tenets of the modern software development methodologies mentioned in

the beginning of this chapter: iterative development in small steps, responsiveness to changing

requirements, and a strong emphasis on experiment and early interaction with the software

that is being built.


Even the merits of the mental exercise of simply dividing the framework’s functionality in

different commands, each with their respective input, output, and parameters, should not be

underestimated, as this greatly contributes to the formation of clear concepts, structuring the

chaos that research usually is. We found that the flat, simple structure of the command set

can actually be better in this respect than a traditional and potentially more complex class

system.

On a more practical note, the command line shell can be optimally used as an interactive

development environment with the developer/user exploring and debugging new commands

and fully leveraging the finished functionality. While invoking commands, thanks to basic

shell functionality and the common prefix of the command’s shell scripts, pressing the tab

key functions as a code completion tool, and the usage instructions as parameter hints.

Continuous integration, another aphorism of modern software development, can be fit into this

as well. Since the development and user environment are the same, it makes perfect sense to

create additional commands to increase usability of the repetitive tasks a developer/researcher

typically faces while modifying the code of different tools: build processes, updating library

binaries with newly compiled versions, and so on.

Last but not least, a mirrored command suite also lends itself perfectly to build on top of it

a job queue for distributed execution of commands, which is the subject of the next design

consideration.

5.1.4 A Distributed Job Queue

Research in computer science would be no fun at all without overly complex simulators and

sky-rocketing simulation times. Unfortunately, sometimes even too much caffeine really is too

much. It is for these rare occasions we found it to be necessary to curb simulation times by

building a distributed job queue for the concurrent execution of commands.

The target platform for running this job system is the Hydra server cluster of the ELIS

research group at our university. We used a MySQL database server and nine worker servers

with a shared file system, each running a dual core processor. When executing the jobs in

the queue, two worker threads are running on each machine in order to efficiently utilize the

dual cores.

As any SQL transaction, calls to the MySQL database comply to the so-called ACID set of

properties, where ACID stands for atomicity, consistency, isolation, and durability. Thanks to

these properties, the database is perfectly suited for the role of central communication point

between our 18 workers. These workers connect to the database, fetch the next job fit for


execution from the queue, and start crunching. After a worker has finished its job, the result

is written back to the database, and the whole process repeats itself from the beginning.

The main concern typically would be workers fetching or updating the same job due to

concurrency issues. A simple reservation system prevents this from happening. Each job

has a status set to QUEUED as long as the job is available for processing. When a worker

attempts to execute the job, it will first try to update the status to RESERVED using the

following query:

Listing 5.2: Get work query

UPDATE j obs SET State=’RESERVED’ WHERE JobID=?id AND State=’QUEUED’

The ACID properties of the update transaction guarantee that this query can only update

the job record once. If a worker detects its query failed, it can therefore safely conclude the

job is being executed by another worker and attempt to fetch the next available job instead.

5.1.5 Lessons Learned

Apart from the points discussed above, we also encountered some smaller and more obvious

issues which are briefly summarized here as lessons learned:

� Avoid immature development tools, and especially unstable builds. This proved a hard

thing to do as Scala is a relatively new language and none of the plug-ins for the

mainstream IDEs had reached release candidate level. We stuck to Netbeans early on,

but had to try quite a lot of builds for until we found a relatively stable one.

� You Ain’t Gonna Need It. Despite the fact that every software engineer knows the

infamous YAGNI maxim, there are few who always heed its advice. As numerous

others before us, we found that designing functionality too rigorously too early tends

to lead to over-engineering and features that were never really needed in the first place.

Agile practices help to keep this phenomenon to an absolute minimum.

� Debugging concurrency issues is no fun, especially in a distributed environment. We

tried to alleviate this problem by using MySQL as a central point of communication as

mentioned before, and by going great lengths to test as much functionality as possible

locally and single-threaded, making sure most errors were detected while they were still

easy to trace.

� No silver bullet; the mirrored command suite pattern has one disadvantage we realized

later on. Since each process is split up in different commands representing the steps

it contains, and each of these commands is invoked through the command line shell,

5.2 Platform Setups 50

there is no encompassing runtime environment to preserve any state between steps.

This is usually not a problem, as this state is typically the input and output of each

process step, which we want to be editable by the researcher anyway and therefore save

in a human-readable file format. If this state becomes increasingly complex though,

the cost of writing the glue code saving and loading it may outweigh the benefits of

this approach. During the course of the project, this happened to be the case for the

measurement points of the meta search algorithm described elsewhere in this document.

The solution we came up with was to save the state to the database instead of the file

system, preserving the possibility to view and edit it while reducing the serialization

overhead. It is however easy to see that this solution will not necessarily work in every

case.

� Keep the use of shared objects, especially singletons, to a minimum while developing a

concurrent system. Scala explicitly supports the declaration of singleton objects which

the programmer can define like classes typically are. While this is an attractive language

feature, we found that using singletons a lot–like static methods in the traditional object-

oriented programming model–can easily introduce concurrency issues if they contain

shared mutable state. We have mitigated this by defining facade classes regulating

access to these singleton objects and minimizing the shared members.

5.2 Platform Setups

5.2.1 SESC SMP MIPS

Component Overview

The first platform we test our framework on, employs the SESC simulator to execute the

generated stressmarks on a simulated SMP MIPS architecture. In this setup, our framework

runs on the Hydra distributed platform as described in the section on design considerations.

We distinguish between the worker nodes and the central database node which contains the

job queue for the workers to execute. Running on the worker nodes are the components we

have implemented ourselves in Scala (i.e. StressmarkRunner and StressmarkGenerator), and

the third-party tools we invoke as executable binaries. The former are described in detail in

the next section; the latter we discuss in the ensuing paragraphs.

In order to compile the generated C code of our stressmarks, we use the gcc cross compiler

for the MIPS instruction set, which is made available in the sesc-utils package. The version

of gcc is 3.4. To implement multi-threading, we used the SESC threading library. Apart from

gcc and SESC, we use the gnuplot tool for displaying the simulation results as a graph.


Hydra Worker Node Job Database Node

<<component>>StressmarkRunner

<<component>>StressmarkGenerator

<<component>>SESC Simulator

<<component>>GCC Compiler

<<component>>GNUPlot

<<component>>MySQL Job Database

<<component>>SESC Threading Library

Figure 5.1: SESC overview.

SESC - SuperESCalar Simulator

”SESC is a microprocessor architectural simulator developed primarily by the

i-acoma research group at UIUC and various groups at other universities that

models different processor architectures, such as single processors, chip multi-

processors and processors-in-memory. It models a full out-of-order pipeline with

branch prediction, caches, buses, and every other component of a modern proces-

sor necessary for accurate simulation. SESC is an event-driven simulator. It has

an emulator built from MINT, an older project that emulates a MIPS processor.”

[5]

The default SMP (symmetric multi-processing) architecture configuration of SESC simulates

256 identical 70 nm cores running at 5 GHz with an issue width of 4. The branch predictor

is based on the Alpha 21264 hybrid predictor. The cache configuration is the following:

� L1D and L1I: 32kB, associativity of 4, LRU, write-through

� private L2: 512kB, associativity of 8, LRU, write-back, MESI

As the large number of cores combined with the 70 nm process and high clock rate does not

even remotely resemble a realistic processor design, we altered the configuration as follows:

� the number of cores was reduced to 4

� the clock rate for each core was reduced to 1 GHz


Experience

Paul Sack, one of the developers of SESC, introduces the simulator stating that ”the biggest

challenge for new students in architecture research groups is not passing theory or software

classes. It is not finding a new apartment or registering with the INS. It is understanding the

architecture of the processor simulator that will soon confront them—a simulator coded not

for perfection, but for deadlines. Even the most well-conceived simulator can quickly look

like a Big Ball of Mud to the uninitiated.” We found these observations accurately matching

our own.

5.2.2 Intel Core2Quad x86-64

Test Node






<<component>>pthread Library

Figure 5.2: Hardware overview.

The second test platform is the Intel Core2 Quad 9450 hardware processor, executing a 64

bit x86 instruction set. There are 4 45nm cores running at 2,66 GHz. The Thermal Design

5.3 Framework Architecture 53

Power (TDP), which Intel defines as the maximum power usage, is 95 Watts. The cache

configuration is the following:

� L1D and L1I: 32kB per core

� L2: 2 x 6 MB (shared by two cores each)

In the Intel Developers Manual [1] we found that the maximum number of instructions per

cycle (IPC) for this processor is four. We use this figure to evaluate the performance of our

generated stressmarks. Measuring the IPC during our tests was done by employing hardware

performance counters.

The setup of the software components is the same as the one for the SESC platform, except

for three things: the SESC simulation has been replaced by native execution of the generated

stressmark, its threading library was replaced by the standard POSIX implementation for

Linux (pthread), and the database now runs locally since native execution is fast enough not

to require running the test in a distributed way.

5.3 Framework Architecture

5.3.1 Overview

On the highest level, the framework is comprised of two components: the StressmarkGener-

ator which transforms the abstract workload model into C code, and the StressmarkRunner

which in turn contains the command set exposing the entire framework’s functionality to the

researcher, the genetic search algorithm used to optimize the generated stressmarks, and the

job queue for concurrent execution of commands.

The remainder of this chapter is a more detailed exploration of the four StressmarkRunner

packages. Note that only the most important classes of each package are given. For the full

picture, the source of the project should be consulted.

5.3.2 Commands

The Command Interpreter

As a central part of the mirrored command suite pattern, the command interpreter mirrors

the command line shell in Scala, allowing to run the commands exposing the framework’s

functionality by providing an instruction string, for example ”generate-workload mywork-

load.yml” to save a random workload model in YAML format to the file myworkload.yml.


stressmarkrunner

stressmarkgenerator

commands jobmanagersearch util

benchgenerator descriptors elements memory

util

Figure 5.3: Packages.

It interprets the instruction, identifies the command name and the different parameters, and

returns an AbstractCommand object, e.g. SMRGenerateWorkload.

This command object can then be executed, and the result code, the standard output string,

and the error string can be retrieved. Note that these three output elements mimic those of

a traditional command line shell script. Moreover, to make the congruency with shell scripts

complete, commands are always run in the context of a work directory.

As mentioned in the section on design considerations, it will often be the case that a command

object is no more than a proxy wrapper for a shell script (or an executable binary), which

is why the command line shell command class (CLSCommand) is provided. CLSCommand

uses the Apache Ant library to allow running a shell script in Scala, simply by providing the

instruction string (e.g. ”ls -a”).

With the CommandInterpreter, the AbstractCommand, and the CLSCommand classes, ev-


-workdir : String-homedirSMR : String

+getCommand(instruction : String) : AbstractCommand+getCommand(commandName : String, args : String []) : AbstractCommand

CommandInterpreter

+run()+getResult() : Int+getOutput() : String+getError() : String+execute() : Int

AbstractCommand

-command : String-args : String[]-baseDir : String

+execute() : Int

CLSCommand

core

SMRSimulate

SMRFullRun

SMRCompileSMRPlotSimresults

SMRParseSimreport

SMRGenerateC

SMGenerateWorkload

Figure 5.4: Commands Core.

erything is in place to expose the entire functionality of the framework by providing the

command implementations. These implementations are grouped in five packages:

1. core: a package containing all commands necessary to generate one or more random

workloads, obtain the corresponding stressmarks, simulate them on SESC, and plot the

results in a graph.

2. jobmanager: contains commands to control the job queue for concurrent execution of

commands.

3. ga: provides the functionality needed to execute the genetic search algorithm for opti-

mizing stressmarks

4. metaga: contains commands for the meta search algorithm to optimize the genetic

search algorithm


5. other: kitchen sink package with all commands that belong nowhere else

Forming the interface by which the developer/researcher controls the framework, the com-

mand implementations in each of these packages are now briefly discussed. The notation

used for the command names is the shell script variant; for example, SMRGenerateWorkload

becomes smr-generate-workload.

Core Commands

smr-generate-workload <workload output file> [<workload output file> ...]

The smr-generate-workload command generates a YAML file containing a workload with

randomly initiated parameters. Parameter values can easily be edited by the researcher.

smr-generate-c <workload file> <output c file>

This command generates the c code of the synthetic benchmark based on the workload that

is provided as input.

smr-compile <c file> <binary output file> [-debug]

The smr-compile command runs the gcc MIPS cross-compiler that comes with SESC utils

using the correct compiler flags, linked libraries, etc. The output file is a MIPS binary file

that can be run on the SESC simulator. Optionally, an extra debug file can be generated

containing the assembler code.

smr-simulate <binary file> <configuration> <report output file>

The smr-simulate command runs a MIPS binary file on the SESC simulator using the re-

quested hardware configuration. The result is a human-readable SESC simulation report.

smr-parse-simreport <simulation report> <simresults output file>

The smr-parse-simreport parses a SESC simulation report and generates a YAML file con-

taining the data that is relevant to the genetic search algorithm (mainly power statistics).

smr-plot-simresults [-r] <output png file> <simresults input file> [<simresults

input file> ...]

The power statistics of one or more simulation runs can be plotted in a stacked graph contain-

ing the power usage split up into the Fetch, Issue, Memory, Execution, and Clock categories.

smr-full-run <workload file> [<workload file> ...]


The smr-full-run command combines all the necessary commands to run a simulation of a

synthetic benchmark based on the provided workload file. If more than one workload file is

provided as input, the simulations of all files will be executed in sequence (i.e. not using the

job manager queue).

JobManager Commands

AbstractCommand CLSCommand

jobmanager

SMRQReset SMRQRunWorkers

SMRQShutdownWorkers SMRQStatus

Figure 5.5: Commands Jobmanager.

smr-q-status

The smr-q-status command shows the number of jobs in the job queue and their state (CRE-

ATED, QUEUED, RESERVED, RUNNING, SUCCESS, or ERROR). If there are active

workers, the command they are running is shown as well.

smr-q-run-workers <worker count>

The smr-q-run-workers command starts a number of worker threads on the local machine.

The workers will automatically connect to the job queue database and start executing any

queued jobs (provided all the jobs they depend upon are already finished successfully).

smr-q-shutdown-workers

The smr-q-shutdown-workers command tries to shut down all active workers currently con-

nected to the job queue. Since workers shut down automatically when all queued jobs have

finished, smr-q-shutdown-workers should only be used to stop the execution of the current

job queue in order to resolve errors, or cancel or pause the execution.


smr-q-reset

The smr-q-reset command clears the job queue. All job data in the database will be removed.

Genetic Algorithm Commands


ga

SMRGaSetup

SMRGaSummarizeGeneration

SMRGaSummarizeTotal

SMRGaEvolve

SMRGaPlotResults

Figure 5.6: Commands Genetic Algorithm.

smr-ga-setup <population size> <generations>

The smr-ga-setup command creates the jobs necessary to run the genetic search algorithm

described earlier with a certain population size for a certain number of generations. The jobs

are ready to be executed by the workers (e.g. by running the smr-q-run-workers command).

The jobs employ the rest of the commands in this section to execute the search algorithm.

smr-ga-evolve [-D ga.mutationProb=x] [-D ga.crossoverProb=x] [-D ga.elitism=x]

<output files prefix> <input file> [<input file> ...]

The smr-ga-evolve command takes the simulation results of one population of synthetic bench-

marks as input and generates the workloads for a new population based on these results.


During the genetic evolution process mutation, crossover, and elitist selection is applied.

smr-ga-plot-results [-r] <output file> <input file> [<input file> ...]

The smr-ga-plot-results commands generates a png file using gnuplot containing a graph that

displays the fitness values of the different generations of stressmarks. The graph includes

the minimum, average, and maximum fitness value (power usage) of each generation. The

input files are generation summaries generated by the smr-ga-summarize-generation command

below.

smr-ga-summarize-generation <output file> <input file> [<input file> ...]

The smr-ga-summarize-generation command takes a list of simresults YAML files as input

and generates another YAML file containing the minimum, average, and maximum fitness

for each input file. The input files normally correspond to the stressmark individuals in the

current generation of the genetic algorithm described in this document.

Meta GA Commands


metaga

SMRMetagaInit

SMRMetagaCollectEvaluationResult

SMRMetagaExpand

SMRMetagaSetupEvaluation

Figure 5.7: Commands Meta GA.

smr-metaga-init


The smr-metaga-init command creates a single metaga measuring point in the database. The

measuring point corresponds to a genetic search using a specific configuration. A search

configuration is comprised of the generation size, the number of generations, the mutation

and crossover probabilities, and the elitism parameter.

smr-metaga-setup-evaluation

The smr-metaga-setup-evaluation command generates the necessary jobs to run the genetic

searches for each unevaluated meta-ga measuring point in the database. If for example smr-

metaga-init has been used to create a single measuring point, the jobs will be generated that

run the genetic search corresponding to the configuration of that measuring point.

smr-metaga-collect-evaluation-results

This commands collects the scores of the differently configured genetic algorithms that have

been run and stores these results in the database.

smr-metaga-expand

The smr-metaga-expand command generates the neighbors of the best scoring (unexpanded)

metaga measuring point in the database. Every neighbor slightly varies in one parameter

of the measuring point configuration (e.g. a slightly increased mutation probability, or an

elitism parameter that is decreased by one.)

Other Commands

smr-consistency-test <workload> <test-runs>

This command generates the jobs necessary to run a number of stressmark simulations based

on a single workload and calculate the consistency of the different outcomes (i.e. the minimum,

maximum, average, and standard deviation of the power usages.) Use smr-q-run-workers to

start executing the generated jobs.

Apart from the commands controlling the functionality of the framework, a couple of com-

mands have been made available to alleviate the development process and make it more

efficient:

1. smr-build-sesc-and-spot [clear]: runs the make commands and others necessary to (re)build

the SESC simulator and hotspot adaption for SESC.

2. smr-build-sesc-configs: runs the make commands and others necessary to (re)build the

configuration files for the different architectures supported by SESC.



other

SMRConsistencyTest SMRBuildSescConfigs

SMRUpdateJar

SMRBuildSescAndSpot

SMRConsistencyTestCalcResults SMRBootHydraWorkers

SMRUpdateJarOnHydra

Figure 5.8: Other Commands.

3. smr-update-jar: copies the StressmarkRunner and StressmarkGenerator jar files pro-

duced by Netbeans to the local test environment.

4. smr-update-jar-on-hydra: uploads the StressmarkRunner and StressmarkGenerator jar

files produced by Netbeans to the Hydra environment.

5. smr-boot-hydra-workers: automatically boots two workers on every Hydra server. This

command is only available on the Hydra servers, not in the local test environment.


5.3.3 Jobmanager

-jobID : Int-parentJobID : Int-jobType : Enum-command : String-workdir : String-executionModel : Enum-state : Enum

+create()+queue()+reserve(worker : Worker)+run()+waitForDependencies()

Job

-workers : Worker[]-threads : Thread[]-jobDatabase : JobDatabase

+createSimpleJob(command : String, jobType : Enum) : SimpleJob+createCompoundJob() : CompoundJob+run()+busy() : Boolean+stop()

JobManager

+create(location : String)+run()+destroy()+getSandboxPath() : String

Worker

-host : String-port : Int-user : String-pass : String-dbName : String

JobDatabase

RegularJob

SimpleJob

+createChildJob(command : String, jobType : Enum) : ChildJob

CompoundJob

ChildJob

Figure 5.9: Jobmanager.

The jobmanager package contains all classes related to the job queue that is used to run

commands concurrently. Using the JobManager, new jobs can be created and worker threads

can be run to execute the jobs in the queue. Before creating a new job however, the developer

should decide whether it is necessary to create a CompoundJob, or a SimpleJob will suffice.

� SimpleJobs are single instructions that come in three different types. The CUSTOM

type is used for executing a framework command through the CommandInterpreter,

the SHELL type runs a command line instruction, and the ANT type executes an Ant

target defined in a configuration file.

� CompoundJobs are sequences of ChildJobs. ChildJobs are just like SimpleJobs apart

from the fact they have a CompoundJob as parent. All ChildJobs of a single Com-

poundJob are fetched and executed by the same worker thread.


Optionally, jobs can be organized in dependency groups and be made dependent on such

groups. Throughout its lifetime, a job can progress through the following states:

ERROR

SUCCESS

RUNNINGRESERVEDQUEUEDCREATED

Figure 5.10: Jobstates.

A job that is in the CREATED state is present in the queue, but will not be executed yet.

After all properties have been set correctly, it can be QUEUED so a worker can try to fetch

it. In order to make sure all jobs are executed by only one worker, the worker must first

update its state to RESERVED as described in the section on design considerations. After

the reservation has succeeded, the worker first waits for all dependencies of the job to be

finished successfully and then starts RUNNING the job which will either lead to SUCCESS

or an ERROR. The result of the whole operation is then written back to the job database.

Note that our platform setup includes a shared file system for all workers. It is the responsi-

bility of the developer creating the jobs that files are managed in a way that plays well with

the concurrent execution of queued jobs. Concurrency can be controlled using dependency

groups and compound jobs if necessary.

5.3.4 Search

The packages ga and metaga contain the classes implementing the core of the genetic search

algorithm and the meta search algorithm.

Genetic Algorithm Classes

For conducting a genetic search, the developer creates the Individuals populating the first

generation. An Individual is nothing more than a fitness value and a ParameterMap which

stores a number of key-value pairs expressing the individual’s properties and information on

how to mutate each of these values. In the case of our framework, the ParameterMap of the


ga

metaga

-individuals : Individual[]

+crossover(parents : GeneticPopulation, probability : Double) : Individual+sortByFitness()+proportionateSelect(number : Int) : GeneticPopulation+evolve(generationSize : Int, elitism : Int, mutationProbability : Double, crossoverProbability : Double) : GeneticPopulation

GeneticPopulation

-fitness : Double-parameters : ParameterMap

+mutate(probability : Double)+setFitness(fitness : Double)

Individual

-sizeCount : Int-mutationProbability : Double-crossoverProbability : Double-elitism : Double-fitness : Double

+getNeighbours(zoomLevel : Int) : GaMeasurement []+commitToDatabase()

GaMeasurement

1..*

Figure 5.11: Genetic Algorithm Classes.

Individuals will invariably contain the abstract workload model and the knowledge how, for

example, the instruction mix can be mutated by randomizing the percentage of arithmetic,

memory, and branch instructions.

When all individuals are created and their fitness values computed, a GeneticPopulation can

be instantiated containing them. Progressing to the next generation then becomes as simple

as calling the evolve method, passing the size of the next generation, the number of best

individuals that should survive without mutation (i.e. the elitism factor), and the mutation

and crossover probabilities. The evolve method returns new GeneticPopulation object with a

new set of individuals, only this time the fitness of these individuals are unknown. It is now

the responsibility of the developer to set these fitness values before calling the evolve method

again, and so on.


Meta Search Classes

The problem for the developer might be that it is unclear for which values of the population

size, elitism, and mutation/crossover probabilities the genetic algorithm performs best. In

order to allow examining this problem thoroughly, the meta search algorithm can be used.

The meta search is a hill climbing algorithm and its four dimensional search space contains the

points defined by the tuple (generation size and count, elitism, mutation probability, crossover

probability). The points of the search space which are being explored by the algorithm are

stored in the database and represented by the GaMeasurement class.

Note that the GaMeasurement class is in the first place a helper class that eases accessing the

database and calculating neighboring measurement points; the rest of the hill climbing func-

tionality is implemented by the metaga commands described earlier (most notably selection

of the best measurement in order to expand it.)

The developer should manually instantiate the initial measurement point by providing its

properties and calculating its fitness. After having committed this measurement point to the

database, its neighbors can be retrieved. The neighbors will be up to eight new measurement

points, each slightly varying in one dimension of the search space. It is now the responsibility

of the developer to calculate the fitness of each neighbor. After this, the process can be

repeated by calling het getNeighbours method again, etc.

The zoom level of the getNeighbours method defines the granularity of the variation between

a measurement point and its neighbors.

5.3.5 Util

util

Database YamlStruct ParameterMap

Figure 5.12: Util.

This package contains a number of utility classes facilitating the use of the database and the

reading of YAML files. For the latter we use the jyaml library [3].

RESULTS 66

Chapter 6

Results

6.1 Number of SESC Instructions

In order to obtain stable output characteristics when running a stressmark on the SESC

simulator, it is necessary to run the simulation long enough. There are two reasons for this.

The first one is the initialization phase executed when the stressmark is being started. During

this phase, the stressmark instructions themselves have not kicked in yet and therefore the

output characteristics obviously do not reflect its qualities. Moreover, as the power charac-

teristic we use applies to the entire run of the stressmark, the effect of the initialization dies

away only slowly.

The second cause is the fact that even the stressmark code itself needs to be run for a certain

amount of time in order for its behavior to stabilize, since, for example, the behavior of the

data caches changes over time as they will slowly adapt to the stressmark loop code being

executed.

6.1.1 Rabbit Mode

Luckily, there is a way to minimize at least the effects of the initialization phase. SESC

supports a so called ”rabbit mode,” allowing the simulator to hop over the initialization,

speeding up the simulation process during this phase by calculating only the data that is

strictly necessary to progress in a correct way. A corollary of this is that there are no

statistics recorded when in rabbit mode, which is exactly what we want to lessen the effect

of the initialization on the average power usage.

6.1 Number of SESC Instructions 67

6.1.2 Test Setup

We determine the number of instructions necessary for a stable output characteristic by

simulating an increasing number of instructions of a random stressmark. The first run only

simulates 1000 instructions, the second 2500, the next 5000, 10000, etc. We continue this

process until the value of the output characteristic remains more or less constant. We can

then choose an optimal number of instructions, balancing the tradeoff between result accuracy

and simulation time.

The entire process is then repeated with rabbit mode enabled in order to see its effect. The

number instructions that is skipped, is one million.

0

10

20

30

40

50

60

70

1K 2,5K 5K 10K 25K 50K 100K 250K 500K 1M 2,5M 5M 10M 25M 50M

Pow

er

(W)

Instructions

Rabbit Mode Disabled 1M Rabbit Mode

Figure 6.1: Executed Instructions.

6.1.3 Results and Discussion

Figure 6.1 shows the results of both experiments. Please note that the number of instructions

on the horizontal axis increases exponentially, and that it is proportional to the simulation

time.

Looking at the power characteristic with rabbit mode disabled, the effect of the initialization

code is very clear. The heavy workload our stressmark generates, only gradually increases

6.2 Exploration of Search Space 68

the average power consumption, reaching relative stability at about ten million instructions.

As was expected, the test with the rabbit mode shows a lessened effect of the initialization

and more quickly climbs to its maximum. Note however that this comes at the (small) cost

of running the first million instructions in rabbit mode prior to the actual simulation.

In order to keep the simulation time short enough without compromising the accuracy of our

results too much, we have eventually settled for simulations running a million instructions in

rabbit mode, followed by two million instructions normal simulation. These figures are used

for all simulations discussed in this document, except if stated otherwise.

6.2 Exploration of Search Space

For many developers, the idea of solving hard problems simply by unleashing a genetic algo-

rithm on it in the hope that it will magically do the heavy lifting for them, is a very tempting

one. This is because it often appears to be the case that one does not really need any insight

in the problem domain in order to design a genetic algorithm that provides all the answers

one is looking for. This is however a deceiving thought, as the success of a genetic algorithm

very much depends on the definition of its genotype and its genetic operators. It is for this

mental exercise of devising these two definitions that a deep understanding of the problem at

hand is crucial.

In this section, we discuss the data we analyzed in order to gain this much needed insight in

the problem domain of our framework and to check the relevance of the genotype we defined—

the abstract workload model—by making sure it efficiently controls the output characteristics.

We do this by studying the relations between several parameters of our workload model and

the two output characteristics we optimize for: power consumption and IPC.

The platform we used for generating this data is the SESC SMP configuration described

earlier. The configuration contains four hardware threads, and two integer and two floating

point ALUs. We now look at different types of workload models.

6.2.1 Integer Addition

The instruction mix of our first workload model demands a stressmark with nothing but

integer addition instructions. The minimum dependency distance varies from 1 to 16.

As we will notice time and time again, the first thing immediately becoming apparent is the

close correlation between our two output characteristics, the number of instructions per cycle

and the power usage. This is of course to be expected, as a higher throughput immediately


0,0

0,5

1,0

1,5

2,0

2,5

0

10

20

30

40

50

60

1 3 5 7 9 11 13 15

IPC

Pow

er

(W)

Minimum Dependency Distance

Power IPC

Figure 6.2: Integer Addition.

translates in heavy usage of the different processor components, which in turn results in a

higher energy consumption.

In order for this high throughput to be possible, as many instructions as possible need to be

processed in parallel. This will be the case if many instructions following each other can be

executed independently, in other words: the larger the minimum dependency distance (MDD)

between instructions, the higher the IPC and power usage.

The data clearly meets our expectations on this account, although we notice the output

characteristics ceasing to grow at an MDD of 10 and upwards. This is the point where an

IPC of two is reached, and since there are only two integer ALUs, it is clear that this is the

limit of what we can achieve by using only integer instructions.

6.2.2 Integer Multiplication

The second workload model only contains integer multiplication operations. All the conclu-

sions we reached for the integer addition operations are valid in this case as well. Note that

the power usage is slightly less than in the previous case; this is partly because every multipli-

cation in C code is compiled into two instructions: one for the multiplication operation itself,

and one for copying the low bits from the result register. The latter consumes less energy,


0,0

0,5

1,0

1,5

2,0

2,5

0

5

10

15

20

25

30

35

40

45

1 3 5 7 9 11 13 15

IPC

Pow

er

(W)


Power IPC

Figure 6.3: Integer Multiplication.

dragging down the average figure for the power usage.

6.2.3 Double Addition

We now look at a stressmark that contains exclusively double add operations. Once again the

positive correlation between the MDD and our output characteristics is apparent, although a

couple of things are different this time.

The minimum dependency distance we need in order to reach an IPC of 2 is higher than in the

case of integer operations. At an MDD of 16, the IPC actually grows larger than 2, although

there are only 2 floating point ALUs. This is due to float register to temporal register spilling.

The spilling operations positively affect the IPC since they are handled by the integer ALUs.

6.2.4 Double Multiplication

Our conclusions for double multiplication are the same as those for addition, with again a

higher MDD that is needed to reach an IPC of 2, and a lower power usage than in the case

of additions.


0,0

0,5

1,0

1,5

2,0

2,5

3,0

0

10

20

30

40

50

60

70

1 3 5 7 9 11 13 15

IPC

Pow

er

(W)


Power IPC

Figure 6.4: Double Addition.

6.2.5 Integer and Double Combined

If we look at a stressmark based on a workload model combining integer additions with

double multiplications, we can increase the MDD even more to reach an IPC of four. Since

the registers of both the double and the integer ALUs are saturated at this point, register

spilling this time affects the IPC negatively with a serious drop in performance at an MDD

of about 27.

6.2.6 Private Loads and Stores

We now turn our attention from arithmetic to memory instructions. The first workload model

we take a look at in this category, is restricted to thread-local load and store instructions with

stride 0, meaning these instructions access the same address each iteration. Instead of the

minimum dependency distance, we now vary the ratio of loads and stores. On the left,

the instruction mix exclusively contains load instructions; on the right, we only have store

instructions.

It is apparent from the graph that executing a lot of store operations has a heavily negative

impact on the IPC. This is to be expected as a store instruction needs to be propagated

through the different cache levels, causing pipeline stalls. When on the other hand a load


0,0

0,5

1,0

1,5

2,0

2,5

3,0

0

10

20

30

40

50

60

70

1 3 5 7 9 11 13 15

IPC

Pow

er

(W)


Power IPC

Figure 6.5: Double Multiplication.

instruction is executed, the data can be immediately fetched from the cache, resulting in a

relatively high IPC.

6.2.7 Shared Loads and Stores

The last instruction mixes we consider, are exclusively comprised of shared load and store

operations, again with stride 0. Important to mention is the fact that this time, all operations

use the same memory address to load from and store at. The mix containing only load

instructions draws attention as it reaches an IPC of 2, sharply contrasting with the instruction

mixes containing store operations. Once again, this can be explained by the cache behavior.

Since all operations are applied to the same address, even a small number of store operations

ruin the performance as cache coherence issues cause many pipeline stalls, rendering the

caches virtually useless.

6.2.8 Conclusion

Having explored the search space of our framework by looking at distinct workload models

exhibiting different types of behavior, we conclude that the parameters of our model efficiently

control the output characteristics in the way we expected. Although the effects of changing

6.3 GA Results 73

0,0

0,5

1,0

1,5

2,0

2,5

3,0

3,5

4,0

4,5

0

20

40

60

80

100

120

1 3 5 7 9 11 13 15 17 19 21 23 25 27 30

IPC

Pow

er

(W)


Power IPC

Figure 6.6: Integer and Double operations.

the minimal dependency distance and the load store ratio can be understood in the case of

these simple scenarios, it is unclear how these effects add up as they are combined, causing

complex behavior. This is where our genetic algorithm comes into play.

6.3 GA Results

We now take a look at the main result of our work. As described in an earlier part, we set

up two different target platforms to demonstrate the portability of our stressmarks. We run

the genetic algorithm with the parameters discovered by the meta search algorithm:

1. A generation size of 72 individuals.

2. A mutation probability of 10%

3. A crossover probability of 80%

4. An elitism factor of 1

Note that we do not determine the number of generations upfront. We simply run the

algorithm until the output characteristic does no longer significantly change for a number of

6.3 GA Results 74

0,0

0,5

1,0

1,5

2,0

2,5

0

20

40

60

80

100

120

140

160

180

100 / 0 95 / 5 90 / 10 85 / 15 80 / 20 60 / 40 40 / 60 20 / 80 0 / 100

IPC

Pow

er

(W)

Loads / Stores

Power IPC

Figure 6.7: Private Loads and Stores.

generations. The graphs show a bit less generations then we actually run in order to focus on

the more interesting first part.

6.3.1 SESC Platform

The output characteristic for the SESC platform is the average power usage over the course

of the entire execution of the stressmark. Figure 6.9 shows an overview of the generations

produced throughout the search process, displaying the maximum, average, and minimum

power usage.

In the initial population, the power usage of the fittest stressmark is 41.28 watts, the power

usage of the least fit stressmark is 12.47 watts, and the average power usage is 21.28 watts.

Throughout the search process, the maximum and the average power usage become signifi-

cantly larger while the values for the minimum power usage remain around 10 watts. This

is a good sign since it shows that the tradeoff between variety and quality is nicely balanced

in the population; this is necessary in order to optimize the found solutions without getting

stuck in a local maximum.

Further looking at the evolution of the fitness, we find the characteristics of a typical genetic

algorithm run. In the first generations, the best properties present in the initial population

6.3 GA Results 75

0,0

0,5

1,0

1,5

2,0

2,5

0

20

40

60

80

100

120

140

160

100 / 0 95 / 5 80 / 20 60 / 40 40 / 60 20 / 80 0 / 100

IPC

Pow

er

(W)

Loads / Stores

Power IPC

Figure 6.8: Shared Loads and Stores.

are selected for and combined using the crossover operator, quickly increasing the maximum

fitness until it is almost tripled around the sixth generation. This the point where the al-

gorithm is capped for the first time, struggling a couple of generations until new properties

increasing the fitness are discovered around generation number 12. We see this happening

again around generations 42, 65, and 75.

Note that despite the elitism factor of 1, which in principle always preserves the fittest individ-

ual, the maximum fitness sometimes drops, and towards the end seems to alternate between

two values. This is because the behavior of the stressmark generator at the point of this

experiment was not entirely deterministic yet, causing variance in the output characteristics

of the stressmark corresponding to the fittest workload model (and undoubtedly the other

models as well.)

The final best result is a stressmark produced first in generation 87 with a power usage of 163

watts, an increase with factor 4 compared to the fittest individual in the initial generation.

The properties of its workload model are shown below:

Listing 6.1: Workload SESC

memoryShared : 64

t r a c e S i z e : 100

a r i thmet i c In s t ruc t i onMix :

6.3 GA Results 76

0

10

20

30

40

50

60

70

80

90

100

110

120

130

140

150

160

170

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Pow

er

(W)

Generation

Minimum Average Maximum

Figure 6.9: Result SESC.

doubleAdd : 44

doubleMul : 28

integerAdd : 23

integerMul : 5

mdd: 29

swThreads : 3

in s t ruc t i onMix :

a r i t h m e t i c I n s t r u c t i o n s : 89

memoryInstruct ions : 2

b ranch In s t ruc t i on s : 9

memoryThreadLocal : 2048

memoryStr ideProf i l e :

s i z e 1 : 26

s i z e 0 : 37

s i z e 4 : 2

s i z e 3 : 2

s i z e 2 : 33

memoryInstructionMix :

unsharedLoad : 5

6.3 GA Results 77

sharedLoad : 10

unsharedStore : 41

sharedStore : 44

branchTrans i t ion :

r a t e 0 : 11

r a t e 1 : 33

r a t e 2 : 10

r a t e 4 : 28

r a t e 8 : 18

As might be expected, we notice a high proportion of arithmetic instructions and a large

minimum dependency distance. The arithmetic instruction mix is quite balanced, but seem-

ingly avoids integer multiplications. The memory instruction mix has an unexpectedly high

number of store operations, but the selection pressure on this mix is probably not that large,

since the proportion of memory instructions is only 2 percent.

6.3.2 Core 2 Quad Platform

0

0,5

1

1,5

2

2,5

3

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28

IPC

Generation


Figure 6.10: Result Core 2 Quad.

The same experiment was repeated on the Intel Core 2 Quad target hardware platform. This

6.3 GA Results 78

time, the output characteristic we use is the number of instructions per cycle, which is a fair

indicator of the power usage as we demonstrated in the section exploring the search space.

The IPC is measured by employing hardware performance counters.

Although less pronounced than in the case of the SESC platform, here too we find a significant

increase in the output characteristic. In the initial population, the maximum IPC value is 2.25

while in the final generation, the fittest individual has an IPC of 2.92, showing an increase of

30

Our algorithm seems to be limited to an IPC of 3, although the maximum IPC of the processor

is 4. The workload model used for the best stressmark gives us a clue as to why this is the

case:

Listing 6.2: Workload x86-64

memoryShared : 64

t r a c e S i z e : 50


doubleMul : 7

integerAdd : 33

doubleAdd : 19

integerMul : 41

mdd: 9

swThreads : 8







s i z e 0 : 30

s i z e 1 : 26

s i z e 2 : 19

s i z e 3 : 17

s i z e 4 : 8


unsharedLoad : 25

sharedLoad : 41

unsharedStore : 30

sharedStore : 4


6.4 GA Efficiency 79

r a t e 0 : 13

r a t e 1 : 24

r a t e 2 : 13

r a t e 4 : 26

r a t e 8 : 24

Looking at the instruction mix in particular, we can see that the algorithm heavily selected

for arithmetic instructions and, to a lesser extent, branch instructions. Provided we follow

this approach and eliminate all memory instructions, an IPC of 3 indeed becomes the limit

because the Intel Core 2 Quad has only three arithmetic ALUs, which the algorithm fully

exploits by balancing the arithmetic instruction mix to effect an optimal throughput.

The IPC limit of 4 is based on the fetch width and it is now clear that it would be necessary to

somehow add memory instructions to the mix to come close to this figure. Other individuals

in the population do have these memory instructions but have a lower fitness, probably

because they no longer optimally stress the arithmetic ALUs as a consequence. If a sweet

spot exists, combining arithmetic and memory instructions up to an IPC of 4, our algorithm

was unfortunately unable to find it—as were we.

6.4 GA Efficiency

Although we found that our genetic algorithm yielded quite satisfactory results, especially in

the case of the SESC target platform, we now set up an experiment to obtain a more objective

measure of its performance. The genetic algorithm set up for SESC had a population size of 72

individuals, and it took roughly a hundred generations in order to find an optimal stressmark.

This means that a total of 7200 workload models were produced, and the same number

of simulations were executed for measuring the output characteristics of the corresponding

stressmarks.

Using this figure, we set up a random search algorithm on the same target platform, producing

another 7200 workloads, and determining their stressmark’s fitness values. We provide two

views on this data: a recording of the different results as they were generated (in grey), and

a more informative, sorted version showing the distribution of stressmarks in the entire set

(the thin black line.) We take a look at the sorted distribution of the stressmarks.

The first thing we notice, is the small proportion of stressmarks producing a constant power

usage of 1.9 watts. These are different instances of the same dummy stressmark that is

generated by the framework whenever it encounters a workload model that cannot be used

to produce a valid stressmark. This may for example be the case if the proportion of branch

6.4 GA Efficiency 80

0

10

20

30

40

50

60

70

80

90

1 501 1001 1501 2001 2501 3001 3501 4001 4501 5001 5501 6001 6501 7001

Pow

er

(W)

Workloads

Random Sequence Sorted

Figure 6.11: Random Search.

instructions is too high (e.g. nearly 100%.) We find that 159 stressmarks in the set are invalid;

this is 2.2%. This is certainly good enough, since we deal with totally randomly generated

workloads here; during the course of a genetic algorithm run, this percentage will be much

lower as the invalid stressmarks are immediately eliminated by the selection process.

If we look at the rest of the distribution, we can spot a small number of extreme stress-

marks. The minimum power usage is 7.74 watts, and the maximum 88.84 watts; the standard

deviation is a meager 6.99 and the average stressmark has a power usage of 24.03 watts.

We cannot compare the maximum power usage to the result shown earlier, since that result has

been obtained using another version of our stressmark generator. The graph below shows a run

of the genetic algorithm using the same version. Its maximum value after 7200 simulations is

140.79 watts, performing significantly better than the random search algorithm: a solid 52%.

It is of course also important to notice that the random search algorithm, being random,

will yield different results each time it is run, especially since the standard deviation in the

set is so low, making the well-performing stressmarks a rare commodity. The genetic search

algorithm does not only perform better, its performance is much more stable, especially since

the fairly low mutation rate that is used.

On a closing note, the random search algorithm provided us also with some data on the

6.5 Theoretical Maximum 81

0

10

20

30

40

50

60

70

80

90

100

110

120

130

140

150

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Pow

er

(W)

Generation


Figure 6.12: GA Comparison.

timing performance of SESC, our distributed computing setup, and our framework in general

(excluding the components of the genetic algorithm). It took the eighteen workers on our

Hydra test platform seven hours to generate and evaluate the 7200 random workload models,

resulting in a total rate of 3.5 seconds per stressmark.

6.5 Theoretical Maximum

The random search algorithm provides a means to evaluate the efficiency of our genetic algo-

rithm; we now describe a way of testing the efficiency of our stressmark generator framework

in its entirety. We do this again using the SESC target platform.

SESC supports different architectures that are each defined in their respective configuration

files, the content of which expresses the properties of the comprised building blocks. Among

these properties are the dissipated energy values in case a particular building block is active,

a particular operation is performed by the building block, or a certain event takes place (e.g.

a cache miss.)

Based on these values, we calculate the theoretical maximum power usage a stressmark can

generate. This is done simply by adding all energies together that can be dissipated in a

single clock cycle, multiplying the result with the clock frequency. We also assume that it is

6.5 Theoretical Maximum 82

not possible to simultaneously write to and load from a cache unit, or to effect a cache hit at

the same time of a cache miss. When encountering incompatible energy values like these, we

add only the largest of the two. The result is displayed in figure 6.13.

Core 154,30 watts

Core 254,30 watts

Core 354,30 watts

Core 454,30 watts

L1 Caches30,44 watts


TLB5,75 watts

Figure 6.13: Theoretical Upper Limit.

The main components in the SMP configuration are four identical cores, each totaling a

theoretical maximum power usage of 54.30 watts, two caches, totaling nearly 100 watts, and

a TLB using at most 5.75 watts. This brings the total theoretical maximum to 321.8 watts.

The best stressmark produced by the genetic algorithm has a power usage of 163.18 watts or

50.7% of the theoretical maximum. In Joshi et al. [12] we find a comparable percentage of

57%.

CONCLUSION 83

Chapter 7

Conclusion

Originally described as early as 1965, Moore’s law holds true today as it always has. While the

number of transistors is still growing exponentially and their size keeps shrinking, processor

designers find themselves in a pickle spending their transistor budgets while continuing to

comply to the ever tightening design requirements such as power usage and temperature.

Having faced the power wall around 2002, slipping from the single-core into the multi-core

era, these problems today are larger than ever before.

A corollary of this tendency is the importance of knowing and understanding the worst-case

behavior of new microprocessors, a trait that is typically researched by writing stressmarks-

benchmark programs that stress the processor to its limits. As this job becomes more and

more tedious and hugely expensive, the industry is looking for ways to automate this process.

In this master thesis, we described the StressmarkRunner framework, a solution for the auto-

matic generation of stressmarks based on prior research in this area by Joshi et al.[12]. The

two largest of our own contributions are the usage of the C programming language to make

these stressmarks platform-portable and the support of multi-core platforms.

The stressmark generation process begins with the workload model, an abstract description

containing a number of parameters determining the stressmark’s characteristics. We started

from the workload model proposed by Joshi et al. [12] and have simplified and extended it

in order for it to better suit our own requirements.

Simplification was achieved by removing a number of workload parameters such as block

size and its standard deviation, and by reducing the parameter determining the minimum

dependency distance to a scalar value. We found that these parameters, which are necessary

for the generation of synthetic benchmarks mimicking the behavior of manually designed

benchmarks, are overkill in the case of stressmarks.

Extension of the workload model was necessary to support multi-threaded stressmarks. We

CONCLUSION 84

added the number of software threads as a workload parameter, and distinguished between

shared and private memory instructions, allowing cache coherence issues to enter the equation.

We ended up with a total of thirty parameters, each one aimed at stressing a specific part of

the processor with as few overlapping effects between the different parameters as possible. We

made sure that new parameters can be added with relative ease, allowing for future extension

with platform-specific parameters, thus increasing the platform-portability of our solution.

As the next step, we employed the abstract workload model to generate synthetic benchmarks

in C, again with platform-portability as our main goal. This portability is preserved during

the stressmark generation phase through the wide support of C compilers, which act as

implementers providing the platform-specific details needed to convert our C benchmark into

executable binaries. We considered alternatives for C such as Fortran, or GCC’s or LLVM’s

intermediate representations. The latter approach stimulated us to reflect on the role of the

programming language as an interface to the backend of the compiler.

The use of C or one of its alternative languages also provides a large number of challenges

we thoroughly researched and documented. Compiler optimization plays a crucial role in the

compilation of stressmarks; it is a huge and complex process we tried to gain understanding

of in order to control it as well as possible. We described different types of optimizations,

discussed their consequences for the stressmark’s behavior, and proposed and implemented so-

lutions to the problems they posed. We discussed and utilized different C language constructs

that proved useful and often necessary to express the different parameters of the workload

model in C, and we described in detail how to implement these parameters.

Having explored the possibilities of the use of a low-level programming language, we also paid

attention to the limits of this approach. We discussed the example of SIMD instructions,

which at the moment cannot possibly be implemented in a real platform-portable way.

The synthetic benchmarks we create according to their workload model using the stressmark

generator, are then optimized to maximally stress the components of the underlying platform.

We achieved this by writing a genetic algorithm that selects for one of the output characteris-

tics of the stressmarks. As the configuration of a genetic algorithm often is a tricky enterprise,

we used a simple hill climbing meta algorithm to determine the best mutation and crossover

probabilities, population size, and elitism factor.

Finally able to generate real stressmarks, we set up two target platforms to run the genetic

algorithm on while demonstrating the platform-portability of our approach. The first plat-

form was the SESC simulator, running the configuration of an SMP architecture executing

the MIPS instruction set. The second was the Intel Core 2 Quad processor for an x86-64

instruction set.

CONCLUSION 85

On the SESC platform, we ran the genetic algorithm optimizing for maximum power usage

through more than a hundred generations, totaling more than 7200 stressmark individuals.

We achieved a resulting power usage three times higher than the maximum usage in the initial

generation.

Because of the large number of simulations and the relatively long simulation times, we set

up a distributed system with a job queue in support of this experiment. For this, we used

nine dual-core servers in the Hydra cluster running at the ELIS research centre at our alma

mater, the University of Ghent. We learnt a lot by developing this system in a research

environment, a setting which inspired us to design a new software architecture pattern that

we found suitable for the specific requirements that research enforces.

On the Intel Core 2 Quad hardware platform, we ran a similar experiment optimizing the

IPC output characteristic of our stressmarks, resulting in a 30% increase, almost reaching an

IPC of three. Since the maximum IPC of the platform is four, we examined the potential

reasons for the result we obtained by studying the processor architecture, finding that our

algorithm restricted itself to the arithmetic ALUs.

On top of all this, we tried to thoroughly verify our framework and methods. First, we

explored our search space by examining characteristic workload models, making sure the

results met our expectations and that the stressmark’s output could be effectively controlled

through the workload model. Second, we compared the performance of our genetic algorithm

to the performance of a random search algorithm, gaining new insights by examining the

distribution of the stressmark’s fitness values and finding that our genetic algorithm is 50%

more effective than the random search we ran. Third, we calculated the theoretical maximum

power usage of the SESC SMP platform by summing the maximum power values of its

components, finding that the power usage of our stressmarks reaches 50% of the theoretical

maximum, a value comparable to the one we found in the literature.

Looking back at a fruitful year however, the things we probably cherish the most are the

experiences we gained—most of them good and all of them valuable—from setting about the

eventful undertaking that producing a master thesis is, and following through until the end

of this very paragraph.

BIBLIOGRAPHY 86

Bibliography

[1] Intel developers manual (basic architecture). http://www.intel.com/Assets/PDF/

manual/253665.pdf.

[2] Intel turbo boost. http://www.intel.com/technology/turboboost/.

[3] Jyaml library. http://jyaml.sourceforge.net/.

[4] Papi: Performance application programming interface. http://icl.cs.utk.edu/papi/.

[5] Sesc documentation. http://iacoma.cs.uiuc.edu/~paulsack/sescdoc/.

[6] Yaml: Yaml ain’t markup language.

[7] 14th International Conference on High-Performance Computer Architecture (HPCA-14

2008), 16-20 February 2008, Salt Lake City, UT, USA. IEEE Computer Society, 2008.

[8] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Parry

Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf,

Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallel comput-

ing research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS

Department, University of California, Berkeley, Dec 2006.

[9] S H Gunther, F Binns, D M Carmean, and J C Hall. Managing the impact of increasing

microprocessor power consumption. Intel Technology Journal, (1):2005, 2001.

[10] Michael Haungs, Phil Sallee, and Matthew Farrens. Branch transition rate: A new metric

for improved branch classification analysis. High-Performance Computer Architecture,

International Symposium on, 0:241, 2000.

[11] John Hennessy and David Patterson. Computer Architecture - A Quantitative Approach.

Morgan Kaufmann, 2003.

[12] Ajay M. Joshi, Lieven Eeckhout, Lizy Kurian John, and Ciji Isen. Automated micropro-

cessor stressmark generation. In HPCA [7], pages 229–239.

http://www.intel.com/Assets/PDF/manual/253665.pdf

http://www.intel.com/Assets/PDF/manual/253665.pdf

http://www.intel.com/technology/turboboost/

http://jyaml.sourceforge.net/

http://icl.cs.utk.edu/papi/

http://iacoma.cs.uiuc.edu/~paulsack/sescdoc/

BIBLIOGRAPHY 87

[13] G. E. Moore. Cramming more components onto integrated circuits. Proceedings of the

IEEE, 86(1):82–85, 1998.

[14] Herb Sutter. The free lunch is over: A fundamental turn toward concurrency in software.

Dr. Dobb’s Journal, 30(3):202–210, 2005.

LIST OF FIGURES 88

List of Figures

1.1 SPECint performance over the years (image source: [11]). . . . . . . . . . . . 2

1.2 Power wall, frequency wall and ILP wall (image source: [14]). . . . . . . . . . 3

2.1 Miss rates of local (left) and global (right) branch predictors for different classes

of branches, identified by transition rate and taken rate. . . . . . . . . . . . . 12

3.1 Global stressmark structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Stressmark generation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36

4.1 Optimization process. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

5.1 SESC overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

5.2 Hardware overview. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52

5.3 Packages. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.4 Commands Core. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.5 Commands Jobmanager. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

5.6 Commands Genetic Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.7 Commands Meta GA. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.8 Other Commands. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

5.9 Jobmanager. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.10 Jobstates. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.11 Genetic Algorithm Classes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5.12 Util. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.1 Executed Instructions. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67

6.2 Integer Addition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69

6.3 Integer Multiplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.4 Double Addition. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71

6.5 Double Multiplication. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72

6.6 Integer and Double operations. . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.7 Private Loads and Stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74

LIST OF FIGURES 89

6.8 Shared Loads and Stores. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.9 Result SESC. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

6.10 Result Core 2 Quad. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

6.11 Random Search. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

6.12 GA Comparison. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81

6.13 Theoretical Upper Limit. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

LIST OF TABLES 90

List of Tables

2.1 Example of an arithmetic instruction profile . . . . . . . . . . . . . . . . . . . 6

2.2 Example of a memory instruction profile . . . . . . . . . . . . . . . . . . . . . 7

2.3 Example of a branch transition rate distribution . . . . . . . . . . . . . . . . 8

2.4 Example of a data and memory footprint . . . . . . . . . . . . . . . . . . . . 8

2.5 Example of a stream stride distribution . . . . . . . . . . . . . . . . . . . . . 9

2.6 Workload summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.1 Used compiler flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

3.2 Redundant function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20

3.3 Alternatives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.4 Redundant operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3.5 Redundant blocks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.6 Loop invariants . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.7 Branch optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.8 Alternative control flow implementations . . . . . . . . . . . . . . . . . . . . . 27

LISTINGS 91

Listings

3.1 Compilation result with -O1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Dependency distance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.3 Constant folding and propagation . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.4 Static branch implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.5 Arithmetic instructions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.6 Arithmetic + memory instructions . . . . . . . . . . . . . . . . . . . . . . . . 29

3.7 Arithmetic + memory + branch instructions . . . . . . . . . . . . . . . . . . 31

3.8 Starting stressmarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.9 Alternative BTR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.10 Auto-vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

5.1 Scala quicksort example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

5.2 Get work query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

6.1 Workload SESC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

6.2 Workload x86-64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78

1

Automatische Generatie van

Multicore Stressmarks

Inleiding

Van 1986 tot 2002, de periode voor de zogenaamde power wall, verdubbelde het energiever-

bruik van microprocessors elke vier jaar en slaagde men erin de seriele verwerkingssnelheid te

verbeteren met zo’n 50% per jaar [14]. Deze snelheidsverbeteringen waren vooral gedreven

door de wet van Moore [13] die zegt dat het aantal transistors in een geıntegreerd circuit onge-

veer elke twee jaar verdubbelt. Transistors waren aanvankelijk schaars en het energieverbruik

was geen limiterende factor. Met meer transistors kon de uitvoeringspijplijn dieper gemaakt

worden zodat de kloksnelheid kon stijgen. Out-of-order processors maakten het mogelijk om

die pijplijnen efficient te benutten en zelfs om meerdere uitvoeringspijplijnen aan te sturen

door het ILP (Instruction Level Parallelism) te maximaliseren.

Deze evolutie kwam tot een abrupt einde rond 2002; de power wall was bereikt. Door het

verhogen van de kloksnelheid stijgt het vermogenverbruik en dus ook de thermische dissipatie.

De kost van koelingsoplossingen stijgt echter exponentieel in functie hiervan. Zo werd rond

2002 de grens bereikt van wat praktisch mogelijk is qua koeling.

Ook op vlak van ILP werd er een plafond bereikt. De hardwarestructuren werden te groot

en de kost van de structuren was niet meer proportioneel met de snelheidswinst. De wet

van Moore blijft wel nog steeds geldig, wat betekent dat transistors steeds minder, en het

vermogenverbruik steeds meer, een limiterende factor werd. De processorontwikkelaar moet

zich dus steeds meer concentreren op het vermogensgedrag van de processor.

2

Figuur 1: Power wall, frequency wall en ILP wall [14].

Conventionele benchmarks kunnen gebruikt worden om het typische vermogensgedrag van

een processor te analyseren, maar dat is niet langer voldoende. Een betere manier om het

gedrag van processors te analyseren, is met stressmarks.

Er is een steeds groter wordende kloof tussen het maximale en het typische vermogensver-

bruik van een processor[9]. Dit stelt de processorontwikkelaar voor een lastig probleem want

de correcte werking van de processor moet gegarandeerd kunnen worden, zelfs in extreem

zeldzame omstandigheden. Om die zeldzame gevallen te onderzoeken worden stressmarks

gebruikt. Dit zijn benchmarks die extreem gedrag in de processor veroorzaken. Een aantal

toepassingen zijn de volgende:

� Het bepalen van de veiligheidsgrenzen voor het thermische beheer en het vermogens-

3

beheer van de processor. Bijvoorbeeld het tijdelijk verlagen van de kloksnelheid als de

processor te warm wordt.

� Het detecteren van zogenaamde hotspots, dit zijn kleine regio’s op de chip die gedurende

een korte periode zeer warm worden. Hotspots zijn slecht voor de levensduur en de

betrouwbaarheid van een processor.

� Het dimensioneren van het koelsysteem van de processor en/of de stroomvoorziening.

Momenteel worden stressmarks manueel ontwikkeld door een specialist die de processor door

en door kent. Het is een zeer vervelende en tijdrovende taak die telkens moet overgedaan

worden als de werking van de processor wijzigt.

In deze thesis proberen we het ontwikkelen van stressmarks te automatiseren. We baseren

ons op voorgaand werk, het ”StressMaker framework”van Joshi e.a. [12]. Dit raamwerk kan

automatisch benchmarks genereren voor een Alpha 21264 microprocessor. Het belangrijkste

idee achter het raamwerk is het gebruik van synthetische benchmarks die gemaakt worden op

basis van een kleine verzameling programmakarakteristieken. Door die karakteristieken met

een zoekalgoritme te optimaliseren voor maximaal vermogenverbruik wordt een stressmark

bekomen.

Ons raamwerk verbetert deze aanpak op een paar kritische punten:

� De synthetische benchmarks worden volledig in pure C-constructies gegenereerd. Hier-

door is het raamwerk platformoverdraagbaar. Het kan dus in principe voor alle proces-

sors die door de compiler ondersteund worden een stressmark genereren.

� De programmakarakteristieken zijn gespecialiseerd voor het genereren van stressmarks

(in plaats van synthetische benchmarks die het gedrag van andere benchmarks immite-

ren.)

� Het raamwerk kan meerdradige stressmarks maken die via het geheugen communiceren.

4

Abstract Werklastmodel

Een stressmark is dus een synthetische benchmark die gegenereerd wordt aan de hand van

een aantal karakteristieken. We maken een onderscheid tussen platformkarakteristieken en

programmakarakteristieken. De platformkarakteristieken zijn beperkt tot de grootte van een

cachelijn en het aantal hardwaredraden dat de processor kan uitvoeren.

De programmakarakteristieken beschrijven de werklast die op de processor uitgeoefend wordt.

Dit zijn de parameters die door het zoekalgoritme geoptimaliseerd zullen worden. Het is

essentieel om het aantal werklastparameters tot een minimum te beperken om de zoekruimte

zo klein mogelijk te maken.

De werklastparameters zijn de volgende:

Algemene instructieverdeling

Een relatieve verdeling van rekenkundige instructies, geheugeninstructies en spronginstructies.

Rekenkundige instructieverdeling

Dit is de relatieve verdeling van rekenkundige operaties. Een operatie wordt gedefinieerd door

een datatype (floating point of integer) en een rekenkundige operatie (optelling, aftrekking of

deling)

Geheugeninstructieverdeling

Dit is een relatieve verdeling van geheugenoperaties. Er zijn vier operaties: lezen of schrijven

in gedeeld geheugen en lezen of schrijven in draadlokaal geheugen.

Gedrag spronginstructies (inverse sprongtransitieverhoudingen)

Deze parameter bepaalt de relatieve verdeling van sprongtransitieverhoudingen. De sprong-

transitieverhouding is het aantal keer dat een sprong van richting verandert (i.e. genomen

of niet genomen) ten opzichte van het totaal aantal keer dat de spronginstructie uitgevoerd

werd. Een sprongtransitieverhouding van 0 betekent dat de sprong statisch is.

5

Minimale afhankelijkheidsafstand

De kleinst toelaatbare read-after-write afhankelijkheidsafstand in de stressmark.

Groottes

Het aantal instructies in de stressmark en de groottes van het draadlocaal en het gedeeld

geheugen.

Gedrag van stappende geheugeninstructies

Geheugeninstructies wandelen door het geheugen met een vaste stapgrootte, gedefinieerd in

functie van de grootte van een cachelijn. Deze paramter is een relatieve verdeling van de

stapgrootte van geheugeninstructies. Het is mogelijk dat de stapafstand 0 is. In dat geval

wordt er altijd van hetzelfde adres gelezen.

Synthetische Benchmarks in C

Opdat het raamwerk platformoverdraagbaar zou zijn, gebruiken we de programmeertaal C

in plaats van assembler voor de implementatie van de stressmarks. Het is de bedoeling dat

de compiler daarna de C-code compileert voor de betreffende doelarchitectuur. De compiler

wordt gebruikt voor instructieselectie en registerallocatie voor het doelplatform.

Het ontwikkelen van synthetische benchmarks met C is een moeilijke opgave omdat een com-

piler het programma optimaliseert voor optimale prestaties. Na compilatie moeten de karak-

teristieken van de stressmark echter behouden blijven. Compilers doen veel optimalisaties

waarbij dit niet geldt, zoals de eliminatie van redundante code, het optimaliseren van lusinva-

rianten, rekenkundige optimalisaties, ongewenste floating point excepties, optimalisatie van

instructievolgorde enz.

Een compiler kan wel geconfigureerd worden om te bepalen welke optimalisaties worden uitge-

voerd en welke niet, maar dit is een moeizaam proces. De moeilijkheid zit hem in het zoeken

naar een evenwicht tussen onder- en overoptimalisatie. Bij onderoptimalisatie zal de compiler

te weinig gaan optimaliseren om de instructieselectie en registerallocatie nog naar behoren

6

uit te voeren; bij overoptimalisatie zal de compiler de karakteristieken van de stressmark

veranderen door b.v. instructies te schrappen.

Onze aanpak is de GCC-compiler te configureren voor een minimaal aanvaarbaar optimali-

satieniveau en vervolgens de codevorm van de stressmark zo aan te passen dat de resterende

optimalisaties geen effect meer hebben op de karakteristieken van de stressmark.

Stressmarkoptimalisatie



Figuur 2: Optimalisatieproces.

Zoals beschreven is in het vorige deel, kunnen synthetische benchmarks gegenereerd worden

aan de hand van een abstract werklastmodel. Tijdens de uitvoering van de benchmark kun-

nen we nu verschillende karakteristieken meten zoals het vermogensverbruik, de IPC, of de

temperatuur. Om een synthetische benchmark om te vormen naar een stressmark die een van

deze karakteristieken strest, zullen we een genetisch zoekalgoritme gebruiken.

We beginnen hiervoor met een initiele populatie die verschillende werklastmodellen bevat

die willekeurig gegenereerd zijn. We laten de StressmarkGenerator-applicatie vervolgens de

bijhorende benchmarks produceren, voeren deze uit en meten telkens de waarde van de te

stressen karakteristiek van elke uitvoering. Deze vormt telkens de fitheid van het individu

(de stressmark) in kwestie. Vervolgens passen we de twee fases toe van het genetisch proces:

selectie en reproductie.

Selectie gebeurt proportioneel met de fitheid van de individuen; de kans om geselecteerd te

worden is evenredig met de fitheid van de stressmark. Nadat twee stressmarks geselecteerd

zijn, wordt een kindstressmark geproduceerd door crossover en mutatie toe te passen. Bij

crossover worden de werklasten van beide stressmarks gecombineerd door sommige parameters

7

te kopieren van de eerste ouder, en sommige van de tweede. De crossoverprobabiliteit bepaalt

hierbij hoezeer de beide ouders dooreengemixt worden. In de mutatiefase worden eventueel

nog een of meerdere willekeurige aanpassingen gedaan aan het resulterende werklastmodel.

Het aantal aanpassingen wordt hierbij beınvloed door de mutatieprobabiliteit.

Op basis van de initiele populatie wordt een nieuwe gegenereerd door het veelvuldig toepassen

van selectie en crossover, maar ook rekening houdend met de elitismefactor. Elitisme betekent

dat een of meerdere individuen uit de vorige generatie ongewijzigd worden overgenomen en

in de volgende populatie gekopieerd. Door keer na keer nieuwe populaties te genereren,

selecteren we voor een steeds hogere fitheid en bekomen we zo een stressmark die de gekozen

karakteristiek maximaal strest.

We gebruiken telkens de volgende configuratiewaarden voor het draaien van onze tests:

� Een populatiegrootte van 72 werklastmodellen

� Een mutatieprobabiliteit van 10%

� Een crossoverprobabiliteit van 80%

� Een elitismefactor van 1

Deze configuratie is bepaald aan de hand van een eenvoudig hill climbing-algoritme dat we

hiervoor hebben ontworpen.

Ontwikkeling van het Raamwerk

Hoewel we doorheen de loop van onze opleiding met verscheidene softwareontwikkelingstech-

nieken hebben kennisgemaakt, was dit de eerste keer dat we een grote softwareapplicatie

ontwikkeld hebben binnen een onderzoeksomgeving. We hebben ondervonden dat dit aanlei-

ding gaf tot specifieke vereisten en uitdagingen die we kort bespreken.

De belangrijkste vereiste is wellicht dat de onderzoeker zich moet kunnen concentreren op

zijn onderzoekswerk zonder hierbij gehinderd te worden door de software die gebruikt wordt.

Er moet dan ook een programmeertaal gekozen worden die deze vereiste weerspiegelt.

8

Daarom hebben we gekozen voor de Scala programmeertaal. Scala is een programmeertaal die

functionele aspecten en objectgeorienteerde paradigma’s combineert. Ze wordt gecompileerd

naar Java bytecode en is dus tweevoudig compatibel met Java: Java-code kan rechtstreeks

uitgevoerd worden vanuit Scala en vice versa.

Het gebruik van Scala heeft verscheidene voordelen. De expressiviteit van de taal zorgt ervoor

dat de code bondig is en duidelijk de intentie van de programmeur weergeeft. Aangezien

Java compatibel is met Scala, kan er gebruik gemaakt worden van de vele beschikbare Java-

bibliotheken. Het typesysteem is tegelijk robuust en flexibel zodat het fouten vermijdt zonder

teveel restricties op te leggen. Aangezien de taal een hoog abstractieniveau heeft, worden er

bovendien geen C-stijl pointers gebruikt en wordt geheugenallocatie automatisch uitgevoerd.

Om gemakkelijk de command line te kunnen besturen vanuit de Scala-omgeving gebruiken

we de Apache Ant-bibliotheek. Dit is erg belangrijk omdat de verschillende third-party ap-

plicaties die we gebruiken, enkel via de command line beschikbaar zijn.

Deze integratie hebben we verder nog verbeterd door een softwarearchitectuur toe te passen

die we de ”mirrored command line suite” noemen. De principes hiervan zijn de volgende:

1. Alle functionaliteit van het raamwerk dat we ontwikkelen moet beschikbaar zijn in de

vorm van input-outputcommando’s met een (beperkt) aantal parameters om hun gedrag

te kunnen bepalen.

2. Van de processen die in het raamwerk aanwezig zijn, worden de verschillende stappen

als aparte commando’s ter beschikking gesteld zodat de ontwikkelaar deze eenvoudig

kan apart uitvoeren, debuggen, ...

3. Elk commando wordt gespiegeld, waarmee we bedoelen dat het twee maal wordt ter

beschikking gesteld aan de programmeur. Eenmaal in de Scala-omgeving en eenmaal

als script via de command line.

4. Al de command line scripts die commando’s uitvoeren hebben een naam die start met

hetzelfde prefix zodat ze gemakkelijk door de ontwikkelaar gevonden kunnen worden

door het prefix te typen en tab in te drukken.

9

5. Als de input en/of output gestructureerde data is, wordt deze opgeslagen in bestanden

in een human-readable formaat. Hiervoor hebben we voor YAML [6] gekozen.

We hebben ondervonden dat deze aanpak zeer degelijk is om te gebruiken binnen een on-

derzoeksomgeving vanwege verschillende factoren. Ten eerste laat de flexibiliteit van de mir-

rrored command line suite toe om deze te laten meegroeien met de veranderende vereisten

van de software. Ten tweede zorgt het onderverdelen van de functionaliteit is de eenvoudige

commandostructuur ervoor dat de concepten binnen het onderzoeksdomein duidelijk wor-

den gedefinieerd. Daarnaast biedt de command line alles om te dienen als een interactieve

ontwikkelomgeving zoals code completion. We kunnen ook command’s voorzien die niet de

functionaliteit van het raamwerk implementeren, maar de ontwikkelomgeving strakker inte-

greren in het werkproces van de onderzoeker door repetitieve taken binnen het buildproces te

automatiseren (continuous integration).

Testplatforms

SESC SMP MIPS

Het eerste platform waarop we ons raamwerk testen, gebruikt de SESC-simulator om een

SMP MIPS-architectuur te simuleren. SESC is een simulator gebouwd bovenop MINT, die

zowel single- als multi-corearchitecturen kan simulerenf.

De configuratie die we gebruikt hebben, is een versie van de symmetric multi-processingarchitectuur

(SMP). Ze bevat 4 identieke kernen die elk draaien aan een kloksnelheid van 1 GHz op een

70 nm technologie. De sprongvoorspeller is gebaseerd op de Alpha 21264 hybridevoorspeller

en de cacheconfiguratie is de volgende:

� L1D en L1I: 32 kB, associativiteit van 4, LRU, write-through

� Private L2: 512kB, associativiteit van 8, LRU, write-back, MESI

Om de totale simulatietijd te beperken, hebben we een gedistribueerd jobsysteem geımplementeerd

om de commando’s die het raamwerk aanbiedt parallel te kunnen uitvoeren op dit platform.

10

Hydra Worker Node Job Database Node



<<component>>SESC Simulator




<<component>>SESC Threading Library

Figuur 3: SESC-platform.

Hiervoor hebben we gebruik gemaakt van de Hydra servercluster van de ELIS onderzoeks-

groep aan onze universiteit. Deze bestaat uit 9 werkservers met een dual-core processor en

een gedeeld bestandssysteem, en een MySQL databaseserver. De werkservers draaien elk 2

draden met een werker die verbinding maakt met de database, telkens de volgende job aan-

vraagt en zo de ene na de andere job uitvoert in parallel met de andere werkers. Zo zijn we in

staat om tijdens testen een totale simulatietijd te behalen van gemiddeld slechts 3,5 seconden

per stressmark.

We gebruiken verder de GCC cross-compiler voor de MIPS-instructieset die samen met SESC

wordt geleverd. De versie van GCC is 3.4. Om multi-threading te implementeren, gebruiken

we de SESC threading-bibliotheek.

Intel Core2Quad x86-64

Het tweede platform is een Intel Core2Quad 9450 hardwareprocessor die een 64 bit x86-

instructieset uitvoert. Er zijn 4 45 nm kernen die elk aan 2,66 GHz draaien. De TDP

(Thermal Design Power), wat Intel definieert als het maximale vermogensverbruik, is 95watt.

De cacheconfiguratie is de volgende:

� L1D en L1I: 32 kB per kern

11

� L2: 2 x 6 MB (gedeeld per twee kernen)

In de Intel Developers Manual [1] vinden we terug dat het maximaal aantal instructies per

cyclus (IPC) voor deze processor 4 bedraagt. We gebruiken dit getal om de prestaties van

onze gegenereerde stressmarks te evalueren. We hebben de effectieve IPC tijdens het testen

gemeten door het gebruik van hardware performance counters.

De setup van de softwarecomponenten is dezelfde als deze van het SESC-platform, met drie

uitzonderingen. Ten eerste is de uitvoering van de stressmark op SESC vervangen door

uitvoering op de hardwareprocessor; ten tweede is de threading-bibliotheek vervangen door

de standaard POSIX-implementatie voor Linux (pthread); en ten derde wordt de database

deze keer locaal gedraaid omdat uitvoering op hardware snel genoeg is om geen gedistribueerde

uitvoering te vereisen.

Resultaten

Aantal SESC-instructies

Om stabiele karakteristieken te bekomen tijdens de uitvoer van de stressmarks op de SESC-

simulator, is het nodig om de simulatie voldoende lang te laten draaien. Ten eerste is er

immers de intialisatiefase die doorlopen wordt wanneer de stressmark opgestart wordt. Ten

tweede moet ook de code van de stressmark zelf voor een voldoend lange tijd uitgevoerd

worden om een stabiel gedrag te vertonen.

Gelukkig is er een manier om alvast de effecten van de initialisatiefase te verminderen. SESC

ondersteunt een zogenaamde ”rabbit mode”, die toelaat om over de initialisatie heen te sprin-

gen door deze versneld uit te voeren door enkel de data bij te houden die hoogst nodig is om

correct te vorderen.

We bepalen het aantal instructies dat nodig is om een stabiele uitvoer te bekomen door

achtereenvolgens een oplopend aantal instructies van dezelfde stressmark uit te voeren. Eerst

worden er slechts 1000 uitgevoerd, vervolgens 2500, dan 5000, 10000, enz. We zetten dit

voort tot de waarde van de karakteristieken niet langer significant wijzigt. Dit proces wordt

vervolgens herhaald, deze keer voorafgegaan door een miljoen instructies in rabbit mode.

12

0

10

20

30

40

50

60

70

1K 2,5K 5K 10K 25K 50K 100K 250K 500K 1M 2,5M 5M 10M 25M 50M

Pow

er

(W)

Instructions

Rabbit Mode Disabled 1M Rabbit Mode

Figuur 4: Uitgevoerde Instructies.

Resultaten en discussie

Figuur 4 toont de resultaten van beide experimenten. Als we kijken naar het vermogensver-

bruik met rabbit mode uitgeschakeld, zien we duidelijk het effect van de initialisatie. De zware

werklast die de stressmark genereert, doet het vermogensverbruik slechts met mondjesmaat

toenemen naarmate we meer instructies draaien, om te stabiliseren op ongeveer 10 miljoen

instructies. Zoals verwacht doet de test met rabbit mode ingeschakeld het veel beter en zien

we daar een verminderd effect van de initialisatie. Merk echter op dat we hiervoor de (kleine)

kost betalen een miljoen instructies in rabbit mode te moeten draaien voor het simuleren van

de 10 miljoen in normale modus. Om de simulatietijd kort genoeg te houden zonder de cor-

rectheid van het resultaat teveel te compromitteren hebben we uiteindelijk gekozen om onze

simulaties te draaien met een miljoen instructies in rabbit mode, gevolgd door twee miljoen

normaal gesimuleerde instructies.

13

Exploratie van de Zoekruimte

Om het genetisch zoekalgoritme efficient te laten werken, is het nodig om te controleren of

ons abstract werklastmodel de karakteristieken waarnaar we optimaliseren voldoende effectief

kan beınvloeden. We doen dit door enkele doorsneden van de zoekruimte te bekijken. We ge-

bruiken het SESC-platform dat we eerder omschreven en draaien hierop een SMP-configuratie

met vier hardwaredraden, en twee integer- en twee floating point-ALU’s. Hierop draaien we

werklastmodellen met specifieke instructiemixprofielen. We kijken telkens naar het vermo-

gensverbruik en de IPC in functie van de minimale afhankelijkheidsafstand (MDD) tussen de

instructies.

Rekenkundige instructies

0,0

0,5

1,0

1,5

2,0

2,5

0

10

20

30

40

50

60

1 3 5 7 9 11 13 15

IPC

Pow

er

(W)


Power IPC

Figuur 5: Integer Optellingen.

Het eerste instructiemixprofiel bevat enkel opteloperaties met integers. We laten de minimale

afhankelijkheidsafstand varieren van 1 tot 16. Wat meteen duidelijk wordt, is de sterke

correlatie tussen de twee karakteristieken die we meten. Dit valt te verwachten aangezien een

14

hogere IPC leidt tot een betere benutting van de componenten, wat op zijn beurt leidt tot

een hoger vermogensverbruik.

Om een hoge IPC mogelijk te maken, moeten er zoveel mogelijk instructies in parallel worden

uitgevoerd. Hiervoor is het nodig dat de minimale afhankelijkheidsafstand zo groot mogelijk

is. We stellen vast dat dit ook blijkt uit de resultaten: de karakteristieken nemen toe als we

de MDD van 1 tot 10 laten varieren. Op dit punt echter wordt een IPC van 2 bereikt en

aangezien er slechts 2 integer-ALU’s beschikbaar zijn, is het duidelijk dat dit het verwachte

maximum is.

0,0

0,5

1,0

1,5

2,0

2,5

3,0

0

10

20

30

40

50

60

70

1 3 5 7 9 11 13 15

IPC

Pow

er

(W)


Power IPC

Figuur 6: Double Optellingen.

Het tweede instructieprofiel bevat enkel optellingsoperaties met doubles. De vorige conclusies

blijven grotendeels gelden, maar hier zien we een interessant bijkomend effect optreden. Bij

een MDD van 16, stijgt de IPC-waarde boven de 2 uit, hoewel er ook slechts 2 double ALU’s

beschikbaar zijn. Dit komt vanwege register spilling. Aangezien de spillinginstructies door de

integer-ALU’s behandeld worden, worden ook deze lichtjes belast.

15

Geheugeninstructies

0,0

0,5

1,0

1,5

2,0

2,5

0

20

40

60

80

100

120

140

160

180

100 / 0 95 / 5 90 / 10 85 / 15 80 / 20 60 / 40 40 / 60 20 / 80 0 / 100

IPC

Pow

er

(W)

Loads / Stores

Power IPC

Figuur 7: Private lees- en schrijfoperaties.

We bekijken nu een profiel dat enkel geheugeninstructies bevat. De instructies zijn allemaal

draadlokaal, maar de verhouding tussen lees en schrijfinstructies varieert, met aan de lin-

kerkant een mix die uitsluitend bestaat uit leesinstructies en aan de rechterkant een mix

die enkel bestaat uit schrijfinstructies. Het blijkt uit de figuur dat het uitvoeren van veel

schrijfoperaties een slecht effect heeft op de IPC. Dit is te verwachten aangezien deze telkens

door verschillende cacheniveaus moeten gepropageerd worden, wat pijplijnstalls met zich mee

brengt.

Resultaten van het Genetisch Algoritme

SESC-platform

De opgemeten karakteristiek voor het SESC-platform is het gemiddeld vermogensverbruik

over de loop van de hele simulatie. Figuur 8 toont een overzicht van de generaties die gepro-

16

0

10

20

30

40

50

60

70

80

90

100

110

120

130

140

150

160

170

0 5 10 15 20 25 30 35 40 45 50 55 60 65 70 75 80 85 90 95 100

Pow

er

(W)

Generation


Figuur 8: Resultaten SESC.

duceerd zijn doorheen het zoekproces waarbij telkens de maximale, gemiddelde en minimale

fitheid van de werklastmodellen in de populatie opgetekend zijn.

In de initiele populatie is het vermogensverbruik van de beste stressmark 41,28 W, het ver-

mogensverbruik van de slechtste 12,47 W, en het gemiddelde vermogensverbruik 21,28 W.

Doorheen het zoekproces groeien het maximale en gemiddelde vermogensverbruik, maar blijft

het minimale verbruik vrij laag. Dit is een goed teken gezien dit een goed evenwicht toont

tussen variatie en kwaliteit in de populatie; dit is noodzakelijk om de gevonden oplossingen

voldoende te optimaliseren zonder in lokale minima van de zoekruimte vast te raken.

Het uiteindelijke resultaat is een maximaal verbruik van 163 W, gegenereerd door een werk-

lastmodel dat ontdekt is in generatie 87. Dit is 4 maal zoveel als het maximale verbruik in

de eerste generatie. We bekijken de eigenschappen van het werklastmodel in kwestie:

Listing 1: Werklast SESC

memoryShared : 64

t r a c e S i z e : 100

17


doubleAdd : 44

doubleMul : 28

integerAdd : 23

integerMul : 5

mdd: 29

swThreads : 3







s i z e 1 : 26

s i z e 0 : 37

s i z e 4 : 2

s i z e 3 : 2

s i z e 2 : 33


unsharedLoad : 5

sharedLoad : 10

unsharedStore : 41

sharedStore : 44


r a t e 0 : 11

r a t e 1 : 33

r a t e 2 : 10

r a t e 4 : 28

r a t e 8 : 18

Zoals verwacht zien we een groot aantal rekenkundige instructies en een hoge minimale af-

18

hankelijkheidsafstand. De rekenkundige mix is redelijk gebalanceerd, maar lijkt integerver-

menigvuldigingen te vermijden. De geheugeninstructiemix is willekeurig omdat het aantal

geheugeninstructies slechts 2% bedraagt, waardoor deze geen invloed hebben op het geheel.

Core2Quad-platform

0

0,5

1

1,5

2

2,5

3

0 2 4 6 8 10 12 14 16 18 20 22 24 26 28

IPC

Generation


Figuur 9: Resultaat Core2Quad.

Hetzelfde experiment hebben we herhaald op het hardwareplatform. Deze keer hebben we

de IPC als karakteristiek gebruikt. Hoewel het resultaat minder uitgesproken is dan bij het

SESC-platform, vinden we ook hier een significante toename van de karakteristiek. De IPC-

waarde varieert van 2,25 in de eerste generatie, tot 2,92 in de laatste, een toename van 30%.

Ons algoritme lijkt gelimiteerd tot een IPC-waarde van 3, hoewel de maximale IPC-waarde

van de processor 4 bedraagt. Het werklastmodel van het beste individu toont ons waarom:

Listing 2: Werklast x86-64

memoryShared : 64

t r a c e S i z e : 50

19


doubleMul : 7

integerAdd : 33

doubleAdd : 19

integerMul : 41

mdd: 9

swThreads : 8







s i z e 0 : 30

s i z e 1 : 26

s i z e 2 : 19

s i z e 3 : 17

s i z e 4 : 8


unsharedLoad : 25

sharedLoad : 41

unsharedStore : 30

sharedStore : 4


r a t e 0 : 13

r a t e 1 : 24

r a t e 2 : 13

r a t e 4 : 26

r a t e 8 : 24

Als we naar de instructiemix kijken, zien we dat het algoritme sterk geselecteerd heeft voor re-

20

kenkundige operaties en in mindere mate spronginstructies. In het geval dat we deze strategie

volgen en alle geheugeninstructies elimineren, is 3 inderdaad de maximale IPC. Dit is om-

dat de Intel Core2Quad slechts 3 ALU’s heeft, dewelke het zoekalgoritme dan ook maximaal

benut.

De IPC-limiet van 4 is gebaseerd op de fetchbreedte en het is nu duidelijk dat het noodzakelijk

zou zijn om geheugeninstructies aan de mix toe te voegen om deze limiet te benaderen. Er

zijn individuen in de populatie aanwezig die deze instructies inderdaad gebruiken, maar deze

hebben een lagere fitheid, wellicht omdat ze niet langer de integer-ALU’s volledig belasten.

Indien er al een mix bestaat met een combinatie van rekenkundige operaties en geheugenin-

structies die een IPC van 4 benadert, dan was ons algoritme — net als onszelf — niet in staat

om deze te vinden.

Efficientie

Om de efficientie van ons genetisch algoritme te testen, hebben we hetzelfde aantal simulaties

dat het uitgevoerd heeft (7200), eveneens laten evalueren door een willekeurig zoekalgoritme

(Monte Carlo).

De grafiek toont twee weergaves van dezelfde data. De grijze balken geven de simulaties weer

in de volgorde dat ze gedraaid zijn (voor zover mogelijk in een gedistribueerde omgeving),

terwijl de zwarte curve dezelfde resultaten gesorteerd weergeeft. We kijken naar deze laatste

om de distributie van werklastmodellen nader te onderzoeken.

Het eerste wat we opmerken is een klein aantal stressmarks die elk een constant verbruik van

1,9 W produceren. Dit zijn verschillende instanties van dezelfde stressmarkcode die gegene-

reerd wordt indien de StressmarkGenerator niet in staat is om een stressmark te genereren

die voldoet aan het werklastmodel dat als input is gegeven. Dit kan b.v. het geval zijn

als het aantal gevraagde spronginstructies te hoog is (b.v. tegen de 100%). Slechts 2,2%

van de stressmarks is ongeldig, dit is een voldoende klein aantal, zeker gezien het hier gaat

om willekeurig gegenereerde stressmarks. Bij het genetisch algoritme worden deze meteen

weggeselecteerd.

De rest van de distributie toont ons een maximale waarde van 88,84 W. Als we het genetisch

21

0

10

20

30

40

50

60

70

80

90

1 501 1001 1501 2001 2501 3001 3501 4001 4501 5001 5501 6001 6501 7001

Pow

er

(W)

Workloads

Random Sequence Sorted

Figuur 10: Willekeurig Zoekalgoritme.

zoekalgoritme draaien met dezelfde versie van de StressmarkGenerator krijgen we 140,79 W.

Dit is 52% beter dan het random zoekalgoritme. Bovendien moeten we opmerken dat de pres-

tatie van het genetisch zoekalgoritme uiteraard veel stabieler is met een kleinere geluksfactor.

Theoretisch Maximum

Als laatste resultaat bekijken we een schatting van het theoretisch maximum van onze SESC-

configuratie en vergelijken dit met het resultaat behaald door ons genetisch zoekalgoritme. We

berekenen dit maximum door het energieverbruik van alle componenten van de architectuur

voor 1 cyclus in geactiveerde toestand op te tellen en dit getal te vermenigvuldigen met de

kloksnelheid. We verkrijgen de verdeling in figuur 11.

We krijgen een theoretisch maximum van 321,8 W in totaal. De beste stressmark geprodu-

ceerd door ons genetisch algoritme geeft 163,18 W, of 50,7% van het theoretisch maximum.

Dit is vergelijkbaar met het percentage van 57% dat we terugvinden in Joshi e.a. [12].

22

Core 154,30 watts

Core 254,30 watts

Core 354,30 watts

Core 454,30 watts



TLB5,75 watts

Figuur 11: Theoretisch Maximum.

BIBLIOGRAFIE 23

Bibliografie

[1] Intel developers manual (basic architecture). http://www.intel.com/Assets/PDF/

manual/253665.pdf.

[2] Intel turbo boost. http://www.intel.com/technology/turboboost/.

[3] Jyaml library. http://jyaml.sourceforge.net/.

[4] Papi: Performance application programming interface. http://icl.cs.utk.edu/papi/.

[5] Sesc documentation. http://iacoma.cs.uiuc.edu/~paulsack/sescdoc/.

[6] Yaml: Yaml ain’t markup language.

[7] 14th International Conference on High-Performance Computer Architecture (HPCA-14

2008), 16-20 February 2008, Salt Lake City, UT, USA. IEEE Computer Society, 2008.

[8] Krste Asanovic, Ras Bodik, Bryan Christopher Catanzaro, Joseph James Gebis, Par-

ry Husbands, Kurt Keutzer, David A. Patterson, William Lester Plishker, John Shalf,

Samuel Webb Williams, and Katherine A. Yelick. The landscape of parallel computing

research: A view from berkeley. Technical Report UCB/EECS-2006-183, EECS Depart-

ment, University of California, Berkeley, Dec 2006.

[9] S H Gunther, F Binns, D M Carmean, and J C Hall. Managing the impact of increasing

microprocessor power consumption. Intel Technology Journal, (1):2005, 2001.

[10] Michael Haungs, Phil Sallee, and Matthew Farrens. Branch transition rate: A new metric

for improved branch classification analysis. High-Performance Computer Architecture,

International Symposium on, 0:241, 2000.

BIBLIOGRAFIE 24

[11] John Hennessy and David Patterson. Computer Architecture - A Quantitative Approach.

Morgan Kaufmann, 2003.

[12] Ajay M. Joshi, Lieven Eeckhout, Lizy Kurian John, and Ciji Isen. Automated micropro-

cessor stressmark generation. In HPCA [7], pages 229–239.

[13] G. E. Moore. Cramming more components onto integrated circuits. Proceedings of the

IEEE, 86(1):82–85, 1998.

[14] Herb Sutter. The free lunch is over: A fundamental turn toward concurrency in software.

Dr. Dobb’s Journal, 30(3):202–210, 2005.

Automatic Generation of Multi-core Stressmarks...

Documents

Transcript of Automatic Generation of Multi-core Stressmarks...