Reliability Assessment of Microservice Architectureswpage.unina.it › roberto.pietrantuono › tesi...

Scuola Politecnica e delle Scienze di Base Corso di Laurea Magistrale in Ingegneria Informatica Tesi di Laurea Magistrale in Sistemi Distribuiti

Reliability Assessment of

Microservice Architectures Anno Accademico 2017/2018 relatore Ch.mo Prof. Stefano Russo correlatore Prof. Roberto Pietrantuono candidato Antonio Guerriero matr. M63000564

Dedicata a chi mi è stato sempre accanto, alla mia famiglia, ai miei amici, a Zia Rita, a Nonno Antonio, a Milly.

Index

Index ......................................................................................................................................... III Introduction .................................................................................................................................5 Chapter 1: Background ................................................................................................................7

1.1 Microservices Architectures ...............................................................................................7 1.1.1 Example ...............................................................................................................8

1.2 Reliability Assessment ..................................................................................................... 10 1.3 Adaptive Web Sampling .................................................................................................. 11

1.3.1 Sampling setup ................................................................................................... 11 1.3.2 Design ................................................................................................................ 12

1.4 Related Work ................................................................................................................... 13 Chapter 2: MART strategy ......................................................................................................... 17

2.1 MART overview ............................................................................................................... 17 2.1.1 Assumptions ....................................................................................................... 18

2.2 Test generation algorithm ................................................................................................. 18 2.2.1 Domain Interpretation ......................................................................................... 18 2.2.2 Weights Matrix determination ............................................................................ 22 2.2.3 Testing strategy .................................................................................................. 25 2.2.4 Estimation .......................................................................................................... 26 2.2.5 Active Set Update ................................................................................................ 28 2.2.6 Algorithm implementation .................................................................................. 29

2.3 Probability Update ........................................................................................................... 35 2.4 Formulation with dynamic sampler selection .................................................................... 36

Chapter 3: Simulation of the test generation algorithm ............................................................... 39 3.1 Simulation Scenarios ........................................................................................................ 39

3.1.1 Population generators ......................................................................................... 41 3.2 Evaluation Criteria ........................................................................................................... 42 3.3 Empirical correction of Estimator ..................................................................................... 42 3.4 Sensitivity Analysis .......................................................................................................... 45

3.4.1 Sensitivity Analysis in static implementation ...................................................... 45 3.4.2 Sensitivity Analysis in dynamic implementation ................................................. 50

3.5 Results ............................................................................................................................. 59 3.5.1 MSE ................................................................................................................... 59 3.5.2 Sample Variance ................................................................................................. 66 3.5.3 Failing Point Number ......................................................................................... 73 3.5.4 Considerations .................................................................................................... 80

Chapter 4: Experimentation ....................................................................................................... 82 4.1 Pet Clinic ......................................................................................................................... 82 4.2 MART setup .................................................................................................................... 83 4.3 Functions ......................................................................................................................... 85

4.3.1 True Reliability calculation function ................................................................... 85 4.3.2 Update functions ................................................................................................. 86 4.3.3 Reliability Assessment function .......................................................................... 87 4.3.4 Operational Testing function............................................................................... 87 4.3.5 Distance between profiles function ..................................................................... 88

4.4 Experimental design ......................................................................................................... 88 4.4.1 Experimental scenarios ....................................................................................... 88 4.4.2 Evaluation criteria .............................................................................................. 90 4.4.3 True Reliability estimation.................................................................................. 90

4.5 Results ............................................................................................................................. 91 4.5.1 Experiment 1 ...................................................................................................... 91 4.5.2 Experiment 2 ...................................................................................................... 93 4.5.3 Experiment 3 ...................................................................................................... 94 4.5.4 Experiment 4 ...................................................................................................... 96 4.5.5 Further considerations......................................................................................... 97

4.6 ANOVA........................................................................................................................... 97 Conclusions ............................................................................................................................. 101 Bibliography ............................................................................................................................ 102

Reliability Assessment of Microservice Architectures

5

Introduction

The problem addressed in this Thesis is an open problem, it refers to the Reliability

Assessment of Microservice Architectures (MSA). Nowadays, many software

architectures are defined according to this style, especially by companies providing on-

demand software services; a relevant example is represented by Netflix [1], the company

is the world's leading internet entertainment service.

Microservice architectures are typically developed by means of an agile approach (as in

DevOps), with frequent deliveries of software services (up to several releases per day).

In such a dynamic scenario, service reliability may change over time, due to changes of

the service provisioning and/or in the way services are used by customers (operational

profile). The goal is to determine how these changes may affect the reliability of the entire

system. Hence, we deal with the specific problem of online reliability assessment, i.e.

when the system is in the operational phase. All these considerations require the ability of

providing reliability estimates that are, at the same time, accurate and efficient.

There are different techniques for reliability assessment, for example through conceptual

models of the system under test. In this case, the assessment is performed by testing (the

system under test is online and in its usage environment). This approach has been rarely

used, because it is difficult to obtain the exact operational profile of system, but in this

Thesis a technique to overcome this limit is presented.


6

Starting from the preliminary tester’s knowledge, the information about the operational

profile and the proneness to failure of a given service is progressively refined during the

execution. This information can make the difference in reliability assessment; in fact, with

agile development and in-vivo testing, there is an implicit feedback mechanism that allows

us to update the system representation.

Operational testing does not consider several characteristics about this kind of software

systems, in particular:

1. Variability of Microservices, caused by the high frequency of release;

2. Variability of the operational profile;

3. Testing budget constraints.

The technique developed in this Thesis, called Microservice Adaptive Reliability

Testing (MART), exploits the information coming from the field and an advanced

sampling algorithm to have an updated, accurate and efficient estimate of reliability.

In particular, a new testing algorithm is developed, based on an adaptive sampling

procedure particularly suited for rare and clustered population [2], like faults in

operational software systems.

The Thesis is organized in four chapters: first, a Background and Related work

examination is presented, in order to explain preliminary knowledge and to survey the

existing literature about reliability assessment of MSA; the second chapter is about the

formulation, in which the conceptual transformation from the sampling to the MART

strategy is described; the third chapter reports results of the simulation, to evaluate the

behavior of the algorithm under specific assumptions so as to determine the best

configuration; the fourth chapter deals with the experimentation, in which MART is

validated and evaluated on a real system.


7

Chapter 1: Background

This Chapter presents the knowledge needed as background for the rest of the Thesis. In

particular, it focuses on three points: Microservice Architectures, reliability assessment

and the sampling strategy defined by Thompson in [2].

To verify if reliability assessment of Microservice Architectures is an open problem and is

actually a problem, the current literature about the covered subjects is examined in the

related work section.

1.1 Microservices Architectures As described in Paolo Di Francesco, Patricia Lago, Ivano Malavolta article [3], the most

recurring definition is the one provided by J. Lewis and M. Fowler [4]: “the microservice

architectural style is an approach to developing a single application as a suite of small

services, each running in its own process and communicating with lightweight

mechanisms, often an HTTP resource API. These services are built around business

capabilities and independently deployable by fully automated deployment machinery.

There is a bare minimum of centralized management of these services, which may be

written in different programming languages and use different data storage technologies”.

Microservice Architectures are particularly prone to be implemented by extremely agile

processes, for which automated continuous development, integration, test and release,

monitoring and feedback are a cornerstone. All these features are supported by the

underlying cloud-based technologies, such as containers, allowing to alleviate several

manual tasks and save people time.


8

Microservice Architectures are characterized by a simple communication infrastructure; in

fact, they are usually based on REST, in this way every service can be reached with an

URI using HTTP.

Several technologies are being developed to implement MSA-based applications. In this

Thesis, the framework based on Spring is adopted. Specifically, Spring Boot, based on

Spring framework, is used to accelerate and facilitate applications development. Spring

Cloud, builds on Spring Boot by providing a bunch of libraries that enhance the behavior

of an application when included [5].

Another important characteristic is the use of Netflix OSS [6], that is the open source

Microservices implemented by Netflix, that are at the forefront in managing and

implementing Microservices systems.

Spring Cloud Netflix [7] is a project that offers Netflix OSS integrations for Spring Boot

apps through auto-configuration and binding to the Spring Environment and other Spring

programming model idioms.

1.1.1 Example

Microservice Architecture (MSA) arises from the broader area of service-oriented-

architecture (SOA). As described in [8] and [9], there are several differences between

SOA and MSA. The following example, taken from [10], shows a simple application

structured according to the MSA style compared to an SOA in order to highlight the main

difference between the two - a detailed comparison is beyond the scope of this Thesis. The

example is on a system for e-commerce: there are 2 main roles in an SOA, a service

provider and a service consumer. A software agent can play both roles. The consumer

layer is the point where consumers (human users, other services or third parties) interact

with the SOA, and the provider layer consists of all the services defined within the SOA,

in detail: a service to manage the order, a service to manage the inventory and a service to

manage the shipping. Figure 1 shows a quick view of an SOA architecture.

Communication is based on the Enterprise Service Bus (ESB), that allows communication

via a common communication bus consisting of a variety of point-to-point connections


9

between providers and consumers. In addition, the data storage is shared with all the

services.

Figure 1: SOA representation

On the other hand, in an MSA, services should be independently deployable, or it should

be possible to shut-down a service when is not required in the system with no impact on

other services. Figure 2 shows a view of an MSA.

Figure 2: MSA representation

This architecture makes all the services independent, each one implemented in a

microservice, and the communication mechanism (usually REST-based) is simpler than in

the SOA case.


10

1.2 Reliability Assessment Dependability is a software quality factor, defined as “the trustworthiness of a computer

system such that reliance can justifiably be placed on the service it delivers” [11]; it is

composed of five attributes: Reliability, Availability, Safety, Integrity and Maintainability.

Reliability is a Dependability attribute with different definitions, but the most commonly

used in engineering application is: “the characteristic of an item expressed by the

probability that it will perform a required function under stated conditions for a stated

period of time” [12]. In the most general case, reliability can be defined as 𝑅(𝑡, 𝜏), the

probability that the system is in proper service in the interval [t, t+𝜏], given that it was in

proper service at time t:

𝑅(𝑡, 𝜏) = 𝑃(! F𝑎𝑖𝑙𝑢𝑟𝑒𝑖𝑛(t, 𝜏)|𝑝𝑟𝑜𝑝𝑒𝑟𝑠𝑒𝑟𝑣𝑖𝑐𝑒𝑖𝑛𝑡)

In particular, when the interval is [0, t], reliability can be described as:

𝑅(𝑡) = 𝑃(! 𝐹𝑎𝑖𝑙𝑢𝑟𝑒𝑖𝑛(0, 𝑡))

Defined F(t) as unreliability, the Cumulative Distribution Function (CDF) of the failing

time, reliability is computed as:

𝑅(𝑡) = 1 − 𝐹(𝑡)

Defining 𝜆(𝑡) as the failure rate of a system, measured as number of failures per hour, it is

possible to calculate:

𝑅(𝑡) = 𝑒?@A

This latter is called “exponential failure law”: in a system with constant failure rate,

reliability decreases exponentially over time.

In the considered systems, a failure is perceived only when a request is submitted to a

Microservice, thus, the discrete reliability is introduced:

𝑅𝑒𝑙𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = 1 − 𝑃𝑟𝑜𝑏𝑎𝑏𝑖𝑙𝑖𝑡𝑦𝑜𝑓𝐹𝑎𝑖𝑙𝑢𝑟𝑒𝑂𝑛𝑑𝑒𝑚𝑎𝑛𝑑

What does reliability assessment mean? It is a way to quantify the reliability of a system.


11

There are different techniques to do this, for example reliability assessment can be

performed by conceptual models of system under test, but in this Thesis the assessment is

carried out by testing. Testing effectiveness and efficiency are strongly influenced by test

cases generation strategy. In the following section, the adaptive web sampling scheme is

introduced to formulate the test generation algorithm of the MART strategy.

1.3 Adaptive Web Sampling Sampling is a statistical technique to infer characteristics about an entire population from a

set samples (a set of observations, in our case a set of test cases).

The considered sampling algorithm is the Adaptive Web Sampling Without Replacement,

described by Thompson in [2]. This choice depends on many factors: in software systems

failures are a rare population, all information on operational profile and failure

probability can be codified in way to “drive” the sampling to the most significant samples.

It is interesting to deep the idea expressed by Thompson: he considers a network

characterized by a certain number of nodes and links between them; the goal of adaptive

web sampling is to estimate a variable y selecting n nodes from available N.

This sampling is made with a mixture distribution, in particular nodes are chosen

randomly or depending on the links weight.

1.3.1 Sampling setup

In this formulation, the author considers a population made by a set of labeled units (1, 2,

..., N); every label is associated with one or more variable of interest yi. A couple of values

is attached to each node:

• yi: associated with the ith node;

• wij: associated with any pair of nodes i, j (this value indicates if a link exists

between i and j; it describes the possible weight).

A sample s is defined as a subset of units from the population: it represents both a sample

of nodes s(1) and a sample of pairs of nodes s(2). The design is adaptive if it depends on any

of the variables of interest in the sample.


12

The original data, defined as D0, is the sequence of sample units, in the order selected,

with their associated value. It is assumed that the minimal sufficient statistic consists only

on the set of label of distinct units selected, with the associated y value. Finally, define

reduced data as 𝐷I = {(𝑖, 𝑦K), L(𝑗, 𝑘),𝑤PQR ; 𝑖 ∈ 𝑠(U), (𝑗, 𝑘) ∈ 𝑠(V)}.

1.3.2 Design

An initial sample s0 is selected from some design p0. At the kth step, the sample sk is taken,

depending on values associated with the current active set ak. The active set is defined as

subsequence or subset of current sample, together with any associated variable of interest.

Thus, the selection distribution 𝑝Q(𝑠Q|𝑎Q, 𝑦XY, 𝑤XY) is defined.

The idea is to select the next sample considering the probability d: the next unit is taken

using a distribution based on the unit values, or graph structure of the active set, with a d

probability; alternatively, another distribution is used, for example, based on the sampling

frame or spatial structure of the population. This can be realized by a mixture distribution.

In [2], a link is selected randomly from the active set with probability d, instead, with

probability 1-d this link is taken completely at random. The probability d depends on

values in the active set; in fact, if there are not outgoing links from the active set, the next

unit must be taken randomly.

The adaptive selection can be made unit by unit or in waves. One of the most important

features of this approach is the flexibility, that is obtained by the use of mixture

distribution and by the allocation of part of effort to the initial sample. The flexibility can

balance the way in which the population is explored: if you go deeper following links or

go wide with only one or a few waves.

The current sample is defined as sck; the number of units in the active set (or current

sample) nak (nck). The next set of units sk is selected with a probability 𝑞(𝑠Q|𝑎Q, 𝑦XY,𝑤XY),

in the wave’s case sckt (tth unit in the kth wave) is considered.

When the tth unit in the kth wave is selected, it is possible determinate wakt+ as the total

number of the links out, or the total of the weight values, from the active set ak to units

non in the current sample sckt, consequently it is possible to define 𝑤XY[\ =


13

∑ 𝑤KP{K∈XY[,P∈^∗`Y[} .

Thus, for each i unit yi and wi+ are observed, instead, for each couple (i,j), with i and j both

in the sample, wij and wji are observed.

Now, the case in which a unit that is not already in the sample is taken: assuming that in

the active set there are one or more units having links (or positive weight values) out to

unit i, referred as 𝑤XYK = ∑ 𝑤KP{P∈XY} , the probability that this unit is taken is: 𝑞QAK = 𝑑

abYcabY[d

+ (1 − 𝑑) U(f?gh`Y[)

,

where d is between 0 and 1, if there are no links out from the active set only the second

part of the equation is considered.

The equation allows to take a unit linked to the active set randomly with probability d, ot

to select a random unit among those that are not part of current sample with probability 1-

d.

The overall sample selection probability is 𝑝(𝑠) = 𝑝i ∏ ∏ 𝑞QAKgYAkU

lQkU , where:

• ikt: is the tth unit in the kth wave;

• kth wave: in the order selected is 𝑠𝑘 = (𝑖QU, … , 𝑖QgY);

• nk: is the size of the kth wave;

• p0: is the selection probability of the initial sample;

• K: is the number of the waves.

The link weight can be associated to the selection probability of the current unit,

𝑝(𝑖|𝑠nQA, 𝑎Q, 𝑦XY, 𝑤XY). In the most general context w is a continuous weight variable and

the extraction probability is: 𝑞QAK = 𝑑𝑝o𝑖p𝑠nQA, 𝑎Q, 𝑦XY,𝑤XYq + (1 − 𝑑)𝑝(𝑖|𝑠nQA).

It is possible add a step in which the solution obtained is accepted/rejected, based on a

certain value, such as value of y or out degree. As previously stated d can be replaced with

a probability 𝑑(𝑘, 𝑡, 𝑎Q, 𝑦XY, 𝑤XY), so that depends from nodes and links in the active set or

changes as sample selection progresses.

1.4 Related Work A considerable number of works about Microservice Software Architectures are being


14

published in the last few years. Besides architectural and design issues, researchers started

targeting quality concerns and how this new architectural style impacts them. Among the

quality attributes of interest, performance and maintainability are the most investigated

ones, followed by security-related studies, according to a recent mapping study [3].

Reliability is considered in few studies and always in its broader acceptation related to

dependability (i.e., fault tolerance, robustness, resiliency, anomaly detection) – no study

deals with reliability meant as probability of failure-free operation for a specified time and

environment. Moreover, reliability-related considerations rarely appear as the main

proposal of a research, but more often as a side concern in a design-related proposal.

For instance, in [13] authors propose a novel architecture that enables scalable and

resilient self-management of microservices applications on cloud, emphasizing continuous

management monitors application and infrastructural metrics to provide automated and

responsive reactions to failures (health management) and changing environmental

conditions (auto- scaling) minimizing human intervention. In [14] authors define a

prototype framework for software service emergence called Mordor, with the scope of

enabling service emergence in a pervasive environment. In [15] a formal model for multi-

phase live testing is defined to test changes or new features in the production environment.

In [16] is discussed identifies the key challenges that impede realizing the full promise of

containerizing infrastructure services. In [17] is described the possibility to use

Microservices patterns for IoT applications. In [18] authors introduced a software to

compose one or more arbitrary Docker containers to realize systems of Microservices. In

[19] the author proposed a new Cloudware PaaS platform based on microservice

architecture and light weighted container technology, with this application traditional

software are deployable without any modification, by utilizing the microservice

architecture. In [20] is defined a framework for systematically testing the failure-handling

capabilities of microservices. In [21] author introduced an automated tool to be able to

understand the service architecture, topology, and be able to inject faults to assess fault

tolerance and resiliency of the system. In [22] authors present some related cloud


15

computing patterns and discuss their adaptations for implementation of IMS or other

telecommunication systems. In [23] are present three cloud microservices that can

substantially accelerate the development and evolvement of location and context-based

applications. In [24] learning-based testing (LBT) is used to evaluate the functional

correctness of distributed systems and their robustness to injected faults. In [25]

experiences about the migrating a monolithic on-premise software architecture to

microservices, with positive and negative considerations. In [26] authors propose a

decentralized message bus to use as a communication tool between services.

In these articles, the problem of Microservices implementation, deployment and

improvement is central, but with no reference to reliability assessment; however, from

these emerge the need to evaluate the environment characteristic and the general

dependability of deployed system.

To enlarge the scope, it is possible to consider the quality assessment, in particular

Reliability, Performance and Security: in all cases there are references to evaluation rather

than to assessment of the Quality attributes defined.

Once again, in the scope of reliability assessment of Microservice Architectures there are

no references. For this reason, the subject matter in this thesis is an absolute novelty.

In MSAs, as well as in many other scenarios, the assumption of an operational profile

known at development time is easily violated. In addition, in MSA the changing of the

software itself needs to be considered too, as continuous service up- grades occur. The

method proposed here is conceived to encompass both the updates of the profile and of the

services’ software failure probability to generate new tests at run-time in order to assess

the actual reliability depending on the current usage and deployed software. Indeed,

generating and executing tests at run-time further stresses the second issue always

criticized to operational testing, namely the high cost required to expose many failures

besides the high-occurrence ones. To this aim, evolutions of operational testing could be

considered, which improve the fault detection ability by a partition-based approach and

through adaptation. For instance, Cai et al. published several papers on Adaptive Testing,


16

in which the assignment of the next test to a partition is based on the outcomes of previous

tests, with the overall goal of reducing the variance of the reliability estimator, as in [27].

The profile (assumed known) is defined on partitions, and selection within partitions is

done by simple random sampling with replacement (SRSWR). Adaptiveness is also

exploited in [28], where importance sampling is used to allocate tests toward more failure-

prone partitions. Adaptive random testing (ART) [29] also exploits adaptiveness, as test

outcomes are used to evenly distribute the next tests across the input domain, but it aims at

increasing the number of exposed failures rather than at assessing reliability. Besides

adaptiveness, the sampling procedure is another key to improve the efficiency while

preserving unbiasedness of the estimate. In [30], authors introduced a family of sampling-

based algorithms that, exploiting the knowledge of testers about partitions, enable the

usage of more efficient sampling strategies than SRSWR/SRSWOR. The proposed method

includes a new sampling-based testing algorithm, conceived to quickly detect clusters of

faults with very scarce testing budget, hence, suitable for run-time testing, and by

dynamically considering the updated operational profile and the services version being

deployed at assessment time. Estimation efficiency (i.e., small-variance) and accuracy

(w.r.t. to the real reliability at assessment time) are both pursued thanks to these features.


17

Chapter 2: MART strategy

This Chapter focuses on MART, in particular the testing strategy is formulated, detailing

the test generation algorithm, the reliability estimation and the update of knowledge about

the MSA application.

2.1 MART overview MART is a testing technique, that exploits the sampling-based testing strategy and field

information to update the knowledge about the application under test and to improve the

assessment task.

Figure 3: MART

As described in Figure 3, MART is characterized by two principal steps: the first step is


18

performed Development-time, it consists of Initialization, in which the input domain is

partitioned and interpreted as a network, linking test frames between them; Mapping of

occurrence probabilities to each partition, to define the “suspected” operational profile.

The second step is run-time, in which the reliability assessment is performed on demand

and the probabilities update is made cyclically. These operations are fundamental to adapt

the used operational profile to the true one.

2.1.1 Assumptions

To guarantee the correct MART usage, the following assumptions are considered:

• The application can be monitored: it is possible to collect requests and responses to

microservices;

• Frozen code during the assessment operations;

• Perfect oracle: it is always possible to recognize if a failure occurred or not;

• The test case domain can be partitioned;

• Independent requests (this assumption derives from the independent nature of

Microservices).

2.2 Test generation algorithm This section reports the formulation of the test generation algorithm of MART. This may

involve a lot of optimizations that can be used to improve the behavior of the sampling

technique. The formulation starts with the domain interpretation, in which the test case

domain is partitioned and interpreted like a network, defining the weight matrix.

2.2.1 Domain Interpretation

To partition the input domain, it is introduced the concept of test frame, defined as an

element of the Cartesian product of equivalence classes.

For example, the methods OP1(inputClass1, inputClass2, …inputClassM) is characterized

by M different input classes. Each of this is made by a certain number of instance, for

instance the inputClass1 assumes 5 values:


19

• inputClass1,1 = in range positive integer;

• inputClass1,2 = in range negative integer;

• inputClass1,3 = out of range positive integer;

• inputClass1,4 = out of range negative integer;

• inputClass1,5 = 0.

The test frame is a combination of these classes, for example: OP1(inputClass1,4,

inputClass2,3, … inputClassM,K).

In this way, the representation of the input domain like a network is simpler, identifying

possible links between test frames, for example, based on the Microservice they belong to.

Example

Three methods, described in Table 1, are considered to made a test frame example. Table 1: example methods

Login(String username, String password); username made by at least 8

characters, password made by at

least 8 characters with at least

one number and one special

character.

Select(int selection, int flag, boolean value); selection∈[0, 12], flag∈{0, 1,

2}.

Register(String username, String password,

int age);

username made by at least 8

characters, password made by at

least 8 characters with at least

one number and one special

character, age∈[16, 99].

The following input classes are defined:

InputUsername with K = 3 values:

• String ≥ 8 characters -> Valid Class;

• String< 8 characters -> Invalid Class;


20

• Username is not String-> Invalid class.

InputPassword with K = 5 values:

• String≥ 8 characters, with at least one number and one special character -> Valid

class;

• String≥ 8 characters, without number -> Invalid class;

• String≥ 8 characters, without special character -> Invalid class;

• String< 8 characters -> Invalid class;

• String with only special character -> Invalid class

InputSelection with K = 4 values:

• In range Integer -> Valid class;

• Out of range Integer -> 12 -> Invalid class;

• Negative out of range Integer -> Invalid class;

• Values different from Integer (ex.: character) -> Invalid class

InputFlag with K = 5 values:

• Integer = 0 -> Valid class;



• Integer value different from {0, 1, 2} -> Invalid class;

• Value different from Integer -> Invalid class.

InputValue with K = 3 values:

• Value = true, 1 -> Valid class;

• Value = false, 0 -> Valid class;

• Value different from {true, false, 0, 1}.

InputAge with K = 4 values:

• In range Integer -> Valid class;

• Out of range Integer > 99 -> Invalid class;

• Out of range Integer < 16 -> Invalid class;

• Values different from Integer (ex.: character) -> Invalid class.


21

Thus can be defined test frames shown in Table 2. Table 2: test frames example

1 Login(InputUsername1, InputPassword1); NF

2 Login(InputUsername2, InputPassword1); F



5 …


7 Select(InputSelection1, InputFlag1, InputValue1); NF

8 Select(InputSelection2, InputFlag1, InputValue1); F



11 Select(InputSelection1, InputFlag2, InputValue1); NF

12 …


14 Register(InputUsername1, InputPassword1, InputAge1); NF

15 Register(InputUsername2, InputPassword1, InputAge1); F


17 …


To perform the reliability assessment, it is necessary to adequately represent the system, so

as to encode the information on the operational profile and the probability of failure

within the test frames. Two values are attached to each test frame:

• the occurrence probability, between 0 and 1 and such that ∑ 𝑂𝑃KK = 1, represents

the probability that a test case taken from the correspondent partition is executed

during system execution;

• the failure probability, between 0 and 1, that represents the proneness to failure of

partition’s test cases.


22

The way in which these values are encoded determines the impact of MART strategy.

2.2.2 Weights Matrix determination

This Section focuses on how to define the links between test frames and on how to encode

the information about their failure proneness, then exploited for reliability assessment.

Test frames failure probability

In the domain interpretation, test frames are introduced as partition of test cases and two

probabilities are attached to them. In particular, the concept of failure probability is

introduced: in case of perfect partitioning there are only test frames with failure

probability 0 and test frames with failure probability 1. This means that all test cases

inside a partition are failing or not. This situation is unrealistic, therefore test frames with

failure probability between ]0, 1[ are considered and this information is exploited to

“drive” testing strategy towards test cases more prone to failure.

Weight calculation

To determine the weight matrix, a concept of distance between test frames is defined. For

this purpose, more techniques can be considered; for example, this distance can be

computed starting from the signature of the test frame and calculating the Hamming

distance. In the rest of the Thesis, the distance is computed using the distance factor, that

is the number of differing input classes between two test frames. The distance between test

frames is calculated as the module of difference of their distance factors.

As previously defined, weights are the coding of tester’s information, in a way to “drive”

the algorithm towards test frames more prone to failure. For this aim, it is necessary to

exploit all the information and to combine them: one approach is based on the use of joint

probability, that is to configure the weights considering the probability that the next

selected node could fail if the previously selected nodes has exposed a failure.

In case of failure, testing strategy should be prefer destination nodes whose joint

probability with the source node is maximum. The joint probability𝑃𝑟(𝐷 ∩ 𝑆) =


23

𝑃𝑟(𝐷|𝑆) ∗ 𝑃𝑟(𝑆), where Pr(S) is the failure probability of source (S), Pr(D) is the failure

probability of destination (D), Pr(D|S) has to be determined as a distance function.

The probability represents the “belief” that D fail since S is failed. This belief increases as

the distance k decreases, remembering that k is the module of difference between distance

factors, it is always an integer greater than or equal to zero. Thus, this probability is

defined as 𝑃𝑟(𝐷|𝑆) = 𝑃𝑟(𝐷) ∗ 𝑓(𝑘), with 𝑓(𝑘) that increases to decreasing of k (with

successive normalization): in particular it is considered 𝑓(𝑘) = 1/𝑘, in this way weight is

inversely proportional to the distance and, when 𝑘 > 0, the product with P(D) is between

0 and 1. Else if 𝑘 = 0, this quantity is not calculated because there is not a link.

The assumption is that the “belief on Pr(S)” remains unchanged (independently from the

observation or failures). Removing this assumption, the performances can be improved,

but in this moment the static case is considered.

In conclusion, weights determination is based on definition of the probabilities associated

at different couples of test frames. Considering a couple of test frame (destination, source),

weights are calculated as:

𝑃𝑟(𝐷 ∩ 𝑆) = 𝑃𝑟(𝐷|𝑆) ∗ 𝑃𝑟(𝑆) = 𝑃𝑟(𝐷) ∗ 𝑓(𝑘) ∗ 𝑃𝑟(𝑆)

where k is the distance between nodes and where 𝑓(𝑘) = 1/k.

Example

Figure 4 is considered as a simple example of network: the distance between every couple

of nodes is 1 (𝑘KP = 1, ∀𝑖𝑒∀𝑗), knowing that the assignment of weights is achieved

Figure 4: Example


24

through the joint probability calculation 𝑃𝑟(𝐷 ∩ 𝑆) = 𝑃𝑟(𝐷|𝑆) ∗ 𝑃𝑟(𝑆) = 𝑃𝑟(𝐷) ∗

𝑓(𝑘) ∗ 𝑃𝑟(𝑆) and considering 𝑓(𝑘) = UQ, they are:

1. About node 0:

a. W1 = 0.08;

b. W3 = 0.24;

c. W5 = 0.32.

2. About node 1:

a. W0 = 0.08;

b. W6 = 0.03.

3. About node 2:

a. W2 = 0.24.

4. About node 3:

a. W4 = 0.32.

It is possible to notice that starting from node 0 it is more likely to reach nodes 2 and 3,

rather than the node 1; as well as starting from node 1 it is more likely to reach the node 3

and node 0.

In Table 3 the matrix built on the obtained weights is shown, considering sources on rows

and destinations on columns.

Table 3: weight matrix

𝑆 𝐷⁄ 0 1 2 3 0 0 0.08 0.24 0.32 1 0.08 0 0.03 0 2 0.24 0 0 0 3 0.32 0 0 0

Weights Matrix computation

To calculate the weights matrix, it is necessary to consider the following three steps:

• Test frames acquisition;

• Joint probability computation for each couple of test frames;

• Matrix population.


25

Executive steps are performed by two classes, TestFrameDistance and TestFrameLoader,

and they are represented in the activity diagram in Figure 5.

Figure 5: Distance Matrix Population

2.2.3 Testing strategy

In this Section test frames and weight matrix concepts are used to define testing strategy.

This strategy is based on Thompson’s idea in adaptive web sampling without replacement

described in Section 1.3.

Testing strategy is characterized by a two-stage sampling: for the first stage, the sampling

unit is the test frame; for the second stage a test case is generated randomly from the

partition. After each test case generation, it is executed and its outcome is collected for the

estimation.

As adaptive web sampling, this testing strategy consists of two principal steps: at first, n0

test frames (with 𝑛0 ≥ 1) are selected using Simple Random Sampling Without

Replacement (SRSWR), building the “initial sample”; in second instance the remaining


26

units are selected with a mixture distribution.

First step is necessary to populate the active set, that represents the set of extracted test

frames at each pass, which constitutes the knowledge base for the second step execution.

The concept of mixture distribution can be translated as “the use of two different

samplers”, in particular the use of a Weight Based Sampler and of a Simple Random

Sampler. At each step only a sampler is used, in particular WBS is selected with

probability d, instead SRS is selected with probability 1-d.

Once the initial sample is built, the extraction probability is:

𝑞QK = 𝑑𝑤XYK𝑤XY\

+ (1 − 𝑑)1

o𝑁 − 𝑛^`Yq

where:

• qki is the probability to extract the test frame i in the step k;

• 𝑎Q is the active set at kth-step containing all sampled test frames with information

about outgoing links;

• N is the cardinality of test frames population;

• 𝑛^`Y is the sample dimension at kth-step.

The parameters to be defined are:

• d: between 0 and 1;

• 𝑤z{: static weights corresponding to the (i, j) element of the weight matrix.

The expression 𝑑abYcabYd

is representative of probability to take a unit with Weight Based

Sampler, in fact it selects the next test frames choosing among the ones linked with the

active set, with probability proportional to the link weight.

If there are no outgoing links from the active set, next test frame must be selected with the

Simple Random Sampler.

After the testing strategy definition, it is important to introduce how the reliability of the

system is estimate.

2.2.4 Estimation

The estimation represents a crucial problem, in fact the estimation in sampling with

varying probability is still a problem. For instance, many solutions are reported in [32] and


27

[33].

Before introducing the estimation problem, it is necessary to individuate the variable to be

estimated. With this testing strategy a test case is selected from each test frame, to estimate

the reliability the variable 𝑥K = 𝑝K𝑦K is considered, where pi is the occurrence probability

(the probability that at run time an input will be taken from corresponding test frame) and

yi is the outcome of current test case (1 in case of failure, 0 otherwise). This variable is

useful to estimate the unreliabilty, defined as 𝜙 = ∑ 𝑥KfKki , and consequentially to obtain

the reliability R=1-𝜙.

The estimator used in this case is an adaptation of the Estimator Based on Conditional

Selection Problem described in [2]. This estimator is presented in two versions, the first is

relative to an initial sample dimension greater than one, the second is particularized in case

of an initial sample with unitary dimension.

Estimation with initial sample dimension greater than one

The Reliability estimation is performed in three steps. At first step, the total estimator of

SRSWR is considered for the initial sample:

𝑡^i~ =𝑁𝑛 � 𝑥K

g�?U

Kki

=𝑁𝑛 � 𝑥K

g�?U

Kki

.

In the following step the estimator zi is considered as total estimator of x: 𝑧K =� 𝑥P +

𝑥K𝑞QKP�^`Y

=� 𝑝P𝑦P +𝑝K𝑦K𝑞QKP�^`Y

Finally, the unreliability is calculated as:

𝜙� =1𝑛�𝑛i𝑡^i~ + � 𝑧K

g?U

Kkg�

�

and the reliability is calculated as 𝑅� = 1 − 𝜙�.

To calculate the analytic variance, it is presented the variance estimator [2]:

𝑣𝑎𝑟~ o𝜙�q = Lg�gRV𝑣U + L

g?g�gRV𝑣V

where:

- 𝑣U = Lf?g�fg�

R 𝑣i, where v0 is the sample variance of the initial sample.

- 𝑣V = ∑(�c?�̅)�

(g?g�)(g?g�?U)f�g?UKkg� , where 𝑧̅ = ∑ �c

(g?g�)g?UKkg� .


28

𝑣iis calculated as described in [31]:

𝑣i =U

g�?U∑ (𝑧K − 𝑚(𝑧))Vg�?UKki , where 𝑚(𝑧) = U

g�∑ 𝑧Kg�?UKki .

Estimation with initial sample dimension unitary

This case is a particularization of the previous case, in fact the total estimator of initial

sample is

𝑡î~ =𝑁𝑛 𝑥i =

𝑁𝑛 𝑝i𝑦i,

zi is defined as:

𝑧K =� 𝑥P +𝑥K𝑞QKP�^`Y

=� 𝑝P𝑦P +𝑝K𝑦K𝑞QKP�^`Y

,

Finally, the unreliability is calculated as:

𝜙� =1𝑛�𝑡î~ + � 𝑧K

g?U

Kkg�

�

with the reliability calculated as 𝑅� = 1 − 𝜙�.

The variance estimator results simpler than the one used previously:

𝑣𝑎𝑟~ o𝜙�q = ��(𝑧K − 𝑡î~)V

𝑛(𝑛 − 1)𝑁V�g?U

Kki

2.2.5 Active Set Update

In this Section there is a focus on Active set update operations, this procedure is relevant to

understand how the selected test frame affects the next sampling step.

Information about links are stored in an array, where each position represents a test frame,

and the contained value is 0, if there is no outgoing link from the Active Set to it,

otherwise there is the link weight. Each time that a test frame is selected, the information

about outgoing links from the Active Set is updated:

1. The relative row to the selected test frame is taken from weight matrix.

2. Each element of this row is summed with homologues values in Active Set

outgoing links array.

3. As the testing technique is without replacement, there are no self-links, as a

consequence, in the test frames cells already selected the value 0 is stored.


29

4. All values of Active Set outgoing links array are normalized.

At each step, when the new test frame is taken with Weight Based Sampler, it is selected

from test frames linked to active set with a probability proportional to values defined in

active set outgoing links array (test frames with bigger weights are taken with greater

probability). If there are no outgoing links from the active set the next test frame is

selected with Simple Random Sampling. 2.2.6 Algorithm implementation

The classes used to implement the described algorithm are represented in Figure 6.

Figure 6: Class Diagram

The testing strategy has been implemented in two different versions: in Figure 7 the

version with unitary initial sample is described, instead, in Figure 8 there is the version

with variable initial sample dimension. The latter has the objective to provide to weight

based sampling a more representative Active Set: test frames (and consequentially test

cases) are selected as long as a failure is taken or a limit is reached (maxInitialSampleSize,

that in our case is set at 25% of n).


30

Figure 7: Implementation with unitary initial sample


31

Figure 8: Implementation with variable initial sample

Example

To well understand how test generation algorithm of MART works, an example based on a

dummy test frames network is presented. In this case an initial sample formed by only a

test frame is considered (first implementation).


32

Table 4: test frames with attached failure probability and occurrence probability

TestFrame Failure Probability

Occurrence

probability

Login(InputUsername1, InputPassword1) 0 0.05




















In Table 4 all obtained test frames are represented: the failure probability of each one is a

binary value (0 or 1), this results from a perfect partitioning of test cases; the occurrence

probability is the same for each test frame. The distance between test frames is calculated

as the difference between their signature (Hamming distance). The resulting network is


33

represented in Figure 9. Recalling that the failure probability is calculated as the joint

probability described in Section 2.2, there are links only between test frames with failure

probability 1.

Assuming to run the algorithm with d = 0.8, two possible executions are presented.

Execution 1: in this execution, represented in Figure 10, it is possible to evaluate how the

testing strategy works in case of first sample is a failing point, in this case 4/5 failing point

are taken.

Figure 9: Example Network


34

Figure 10: First sampled test frame is failing

Execution 2: in this execution, represented in Figure 11, it is possible to evaluate the

testing strategy when a non-failing point is taken in the first sample. In this case there is a

Random Selection of samples until a failing sample is not taken, this choice derives from

the absence of outgoing links from the Active Set in second and third step.

Figure 11: First sampled test frame is not failing


35

2.3 Probability Update The testing algorithm exploits the updated information about the application under test.

This information, as described in previous section, is encoded through probabilities.

The operational profile is defined as the set of occurrence probability (OP) assigned to

each test frame. These values are such that ∑ 𝑂𝑃KK = 1; they represent the probability that

a “Client” performs a test case contained in the ith test frame. On the other hand, failure

probability is a value between 0 and 1 attached to each test frame. It encodes the

probability that a test case, taken from this set, exposes a failure.

The operational profile and failure probability update of each test frame are calculated at

run time evaluating the couple request-response generated from the system under test.

For this purpose, monitoring tools can be used, like MetroFunnel [34], that has been

modified to obtain the payload also, so that to monitor properly the examined system.

The operational profile update is realized using a sliding window of size “MAX”, that

represents the maximum number of test cases that can be considered for this purpose. The

expression that regulates the operational profile update is the following:

𝑂𝑃g�aK = 𝑂𝑃��K ∗ �𝑝 + (1 − 𝑝) ∗ L1 −𝑥

𝑀𝐴𝑋R� + 𝑂𝑃nX�K ∗ (1 − 𝑝) ∗ L

𝑥𝑀𝐴𝑋R

Where:

• 𝑂𝑃��K : it is the occurrence probability of the ith test frame in the previous update

step (at first execution it is the one specified from tester);

• 𝑂𝑃nX�K : it is the occurrence probability calculated with a frequentistic approach, it is

the ratio between failed test cases and executed test cases in the ith test frame;

• 𝑝: it is a value between 0 and 1, that represents the minimum percentage of history

that is preserved in the update operations (in the examined case this value is 50%);

• 𝑥: it is the number of executed test cases (maximum MAX).

This approach guarantees that the sum of occurrence probabilities values is unitary, in

fact:

�𝑂𝑃��KK

= 1𝑎𝑛𝑑�𝑂𝑃nX�KK

= 1.

This technique guarantees that the changes in the true operational profile are seen from


36

the update function in few steps. This property is the reason of sliding window using, in

fact, if the entire history is considered, the changes of true operational profile are more

difficult to see.

The mechanism described is an implementation of the feedback mechanism (Figure 12),

that is a feature of agile development; in fact, there is the effective possibility to cyclically

update the operational profile during the system execution.

Figure 12: Operational Profile Feedback

The failure probability update of each test frame is implemented in the same way:

𝐹𝑃g�aK = 𝐹𝑃��K ∗ �𝑝 + (1 − 𝑝) ∗ L1 −𝑥

𝑀𝐴𝑋R� + 𝐹𝑃nX�K ∗ (1 − 𝑝) ∗ L

𝑥𝑀𝐴𝑋R

Where FP is the failure probability.

Other update approaches can be used, like the black-box ones, where adjustments to the

frequentist or Bayesian estimators are done at a profile changes, or the white-box approach

where the control flow transfer among components is captured. However, investigating the

best update strategy is outside the scope of this work and matter of future work.

2.4 Formulation with dynamic sampler selection In this Section another version of test generation algorithm of MART is described, in

particular it is a pseudo-adaptive version, in which an initial value of d named d0 (with

𝑑0 ∈ [0.5, 1[) is specified. It encodes the trust that tester has in the WBS compared with


37

SRS.

The idea is based on the recent history of sampling for each of two techniques, like in the

algorithms used for processors’ branch prediction. The architecture considered for this

purpose is described in Figure 13.

Figure 13: Dynamic d logic

In this approach there is the possibility to codify the remaining part of d (that is 1-d0) into

the different shift register cells. The idea is to attribute this quantity in descending order,

for example, the configuration for a shift register of dimension four is represented in

Figure 14.

At each execution of relative sampler, values in each cell are shifted on the left, with the

new value inserted at position 0. The shift register update logic assures that d is always

Figure 14: Sampler Shift Register


38

between [0,1]: it takes the maximum value (1) when the WBS register is full and the other

one is empty; instead, it takes the minimum value (d0-(1-d0)) when the SRS register is full

and the other one is empty.

The estimator used in this approach is the same estimator used in the static formulation.


39

Chapter 3: Simulation of the test generation algorithm

This Chapter focuses on evaluating, by simulation, the test generation algorithm. The aim

is to tune parameters in the two versions of the test generation algorithm of MART (with

static and dynamic d), then evaluating their performance compared to the Simple Random

Sampling Without Replacement.

3.1 Simulation Scenarios To obtain the simulation scenarios, it is necessary to consider three main factors about the

population and the problem size:

• Type of partitioning: it represents how test cases are partitioned in test frame. This

information is encoded as the test frame’s failure probability, in particular in a

Correct/Failing value, for which three settings are considered: 0/1, 0.25/0.75 and

0.1/0.9. This probability is necessary to determine the weight associated with each

network’s link. In case 0/1, a test frame believed to be failing (correct) contains

only failing (correct) test cases, hence its failure probability (as proportion of

failing test cases) is 1 (0). This refers to a Perfect Partitioning, instead, if the

failure test frames are organized in clusters, a Perfect Clustered Partitioning is

taken into account; the couples 0.25/0.75 and 0.1/0.9 mean that the partitioning is

not accurate, because failures are distributed in all test frames. These probabilities

generate the Close to Uniform Partitioning (for 0.25/0.75) and Close to Perfect

Partitioning (0.1/0.9), if failure test frames are organized in cluster, the population

distribution is Clustered.

• Failing test frame proportion: his is the proportion of the failing test frames over


40

the total, for which we consider two values, 0.1 and 0.2.

• Total number of test frames (N): two order of magnitudes are tried: N = 100, N =

1000.

The combination of the first two factors generates 12 different configurations, shown in

Table 5. Then, a completely uniform population distribution is added, in which the failure

probability of each test frame is obtained as a random value between 0 and 1; as a

consequence, failures are uniformly distributed between test frames (this represents an

ideal case). Table 5: Configurations

Configurations Type of partitioning Failing test frame

proportion

1 Uniform (Random) 0.5

2 Close to uniform (0.25/0.75) 0.1

3 Close to perfect partitioning (0.1/0.9) 0.1

4 Clustered (0.25/0.75) 0.1

5 Clustered (0.1/0.9) 0.1

6 Perfect 0.1

7 Perfect Clustered (1/0) 0.1

8 Close to uniform (0.25/0.75) 0.2

9 Close to perfect partitioning (0.1/0.9) 0.2

10 Clustered (0.25/0.75) 0.2

11 Clustered (0.1/0.9) 0.2

12 Perfect 0.2

13 Perfect Clustered 0.2

26 different scenarios are obtained adding to the combinations the last factor, N.

Assessment is made at 9 checkpoints: n1 = 0.1N, n2 = 0.2N, ... 0.9N.

For the described scenarios, a uniform operational profile is adopted. The populations


41

from 1 to 7 are defined in ascending order of compatibility with testing strategy (from

worst to best case). The same consideration also applies to populations from 8 to 13, but

with different failure distribution.

3.1.1 Population generators

Different population distributions are generated using the following functions:

• generatePopulationAndMatrix: to generate a random set of test frames with

random failure probability, random distance factor (between 0 and maxdistance);

occurrence probabilities can be random or equiprobable.

• generatePopulationAndMatrixBinary: to generate a random set of test frames with

failure probability chosen randomly between two selected values ((0/1), (0.1/0.9),

(0.25/0.75)), respecting a failure proportion (proportion of values with the high

probability of failure on the total). A random distance factor (between 0 and

maxdistance) and the probability of occurrence (random or equiprobable) are also

generated;

• generatePopulationAndMatrixCluster: to generate a set of test frames with a policy

defined below, a random distance factor (between 0 and maxdistance) and the

probability of occurrence can be random or equiprobable.

The steps used to determine the failure probability of N test frames set with clustered

distribution are the following:

1. A t% of test frames is considered as failing, the t%-th part of N is called X:

2. The lowest failure probability is assigned to all points;

3. Each Cluster of failure points is made by a certain percentage (in this case 10% and

20% are considered) of X’s cardinality, called T;

4. |�||�|

points are chosen randomly as centroids;

5. Finally, for each centroid T minimum distance test frames are chosen assigning

them the maximum failure probability.


42

3.2 Evaluation Criteria Accuracy and efficiency are considered as evaluation criteria estimated as follows. A

simulation scenario j is repeated 100 times; denote with r one of such repetitions. At the

end of each repetition, the reliability estimates 𝑅I,�� is computed by the technique under

assessment as well as the true reliability 𝑅P. For simulation, it is known in advance which

input t is a failure point (hence, 𝑅P = ∑ 𝑝A𝛿AA∈� , where 𝛿A is 1 if the input is a failure point

and 0 otherwise).

For each scenario j, the sample mean (denoted as M), sample variance (var) and mean

squared error (MSE) are computed:

• 𝑀(𝑅�� ) = UUii

o∑ 𝑅I,��UiiKki q;

• 𝑀𝑆𝐸(𝑅�� ) = UUii

o∑ (𝑅I,�� − 𝑅P)VUiiKki q;

• 𝑣𝑎𝑟(𝑅�� ) = U

Uii?U∑ (𝑅I,�� −𝑀(𝑅�� ))VUiiKkU ;

Comparison of estimation accuracy is done by looking at the MSE. Comparison of

efficiency is done by the sample variance. Lastly, the average number of detected failing

points (NFP) is also considered.

3.3 Empirical correction of Estimator The simulations showed that for small values of n, zi at each pass is strongly influenced

both by the order in which test frames are selected and by the number of taken failures.

This behavior is a consequence of the working conditions foreseen for MART: since it is

conceived to work under a scarce testing budget, the formulation considers at most one

test case per test frame. Unreliability is calculated as 𝜙� = ∑ 𝑝K𝑦Kf?UKki ,where 𝑦K is a binary

value, because only a test case is taken from each test frame. A consequence is the

possibility that a test case taken from a test frame with “high” failure probability (ex.: 0.9)

could not fail and that a test case taken from a test frame with “low” failure probability

(ex.: 0.1) could fail, causing underestimation or overestimation respectively. This

phenomenon is more evident when are considered small values of n.

For the discussed problem it is more dangerous have a reliability overestimation than an


43

underestimation, thus the idea is to adjust this value for avoid this condition.

The estimate defined in previous chapter is based on the mean value of the estimates

calculated with zi, influenced by the presence of "outliers”. This presence is very heavy

when small values of n are considered, but the mean is still the most representative value

of the set.

The simulation shows that testing strategy takes test frames with the highest failure

probability in the initial part of the testing, this implies that the probability to obtain an

underestimation of unreliability (overestimation of reliability) increases.

The idea is to calculate the mean of estimated value in a single algorithm execution and

use it to consider only:

• values that differing not later than 90% by mean in overestimation;

• values that differing not later than 10% by mean in underestimation.

In other words, a window with two limit values is considered: the upper bound is the mean

plus 90% of mean, the lower bound is mean minus 10% of mean.

This idea does not influence the unbiasedness of the estimators; in fact, this operation

consists in calculating the mean choosing values that better representing the examined

population. This adjustment is particularly relevant for small values of n, for which

outliers are more impacting.

The final solution consists of a combination of two explained estimators, the choice

depends on n value. It is fundamental to define what small or big values of n mean. It is

observed that the adjusted estimator has good performance when n is between 0 and

30/40%. This consideration leads to divide n values as in Figure 15, where: for 0-34% n is

small and the adjusted estimator is used; for 66-100% n is big and the estimators defined

in previous chapter are used.

For n among 34% and 66% a linear combination of both estimators is used. This linear

Figure 15: Different Estimator usage


44

combination is realized using a “coefficient of correction” defined as: 𝑐𝑜𝑐 = (𝑛%−

0.34) ∗ 3 and the estimation is calculated as:

𝑒𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 = (1 − 𝑐o𝑐) ∗ 𝑐𝑜𝑟𝑟𝑒𝑐𝑡𝑒𝑑𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛 + 𝑐𝑜𝑐 ∗ 𝑠𝑡𝑎𝑛𝑑𝑎𝑟𝑑𝐸𝑠𝑡𝑖𝑚𝑎𝑡𝑖𝑜𝑛.

Figure 16 shows the trend of MSE, in particular it converges to 0 when n increases. For

this purpose, configurations 1, 2 and 10 are considered.

(a) Configuration 1

(b) Configuration 2


45

(c) Configuration 10

Figure 16: Results of the adjusted estimator based on MSE

3.4 Sensitivity Analysis 3.4.1 Sensitivity Analysis in static implementation

The chosen value of d is taken comparing simulation results, which are obtained

considering these four values: 0.2, 0.4, 0.6, 0.8. These represent the trust in the weight

based sampling compared to the simple random sampling.

Sensitivity analysis considers only five configurations 1, 2, 8, 9, which represent extreme

cases, and configuration 12, as a best case example.

The evaluation criteria are MSE and Sample Variance.

(a) Configuration 1


46

(b) Configuration 2

(c) Configuration 8


47

(d) Configuration 9

(e) Configuration 12

Figure 17: MSE comparison between d = 0.2, d = 0.4, d = 0.6 and d = 0.8, considering the most significant configurations

From results shown in Figure 17, the best value of d about MSE is 0.8, in particular for

small values of n.


48

(a) Configuration 1

(b) Configuration 2


49

(c) Configuration 8

(d) Configuration 9


50

(e) Configuration 12

Figure 18: Sample Variance comparison between d = 0.2, d = 0.4, d = 0.6 and d = 0.8, considering the most significant configurations

As shown in Figure 18, the sample variance is slightly influenced by the d value in the

different configurations; in fact, except few cases, the variance values have the same order

of magnitude. In configurations 1 and 2, better sample variance values are observed for d

= 0.2 and d = 0.4, but, in configuration 9 and 12, better values are for d= 0.6 and d= 0.8.

The chosen value of d is 0.8, because it has the best trade-off between MSE and variance.

3.4.2 Sensitivity Analysis in dynamic implementation

In this case, sensitivity analysis is performed for two different values: d0 (value between

0.5 and 0.9 with step 0.1) and the shift register size (3, 4 and 5).

Sensitivity Analysis on d0

The chosen value of d0 is taken comparing simulation results, which are obtained

considering five different values between 0.5 and 0.9 with step 0.1. MSE, Sample

Variance and Number of Failing Point (NFP) are the used evaluation criteria for the

sensitivity analysis.

Sensitivity analysis is done considering only four configurations 1, 2, 8, 9, that represent

the limit cases.


51

As shown in Tables 19, 20 and 21, it is possible to observe three different trends:

• MSE has a downward trend with the increment of d0;

• Sample Variance is more or less growing with the increment of d0;

• NFP is almost constant for all d0 values.

Considerations about the limit cases are also interesting, the first subset, in fact there are

substantial differences for different values of d0.

(a) Configuration 1

(b) Configuration 2


52

(c) Configuration 8

(d) Configuration 9 Figure 19: Sensitivity Analysis on d0 for MSE, where Ad1 is the implementation with dynamic d and initial sample unitary, instead,

Ad2 is the implementation with dynamic d and variable initial sample

How previously described, the MSE has a downward trend with the increment of d0. As

shown in Figure 19 the best values of d0 are 0.8 and 0.9.


53

(a) Configuration 1

(b) Configuration 2

(c) Configuration 8


54

(d) Configuration 9 Figure 20: Sensitivity Analysis on d0 for Sample Variance, where Ad1 is the implementation with dynamic d and initial sample

unitary, instead, Ad2 is the implementation with dynamic d and variable initial sample

In Figure 20 it is shown that Sample Variance is better for small values of d0, instead, 0.9

is the worst case.

(a) Configuration 1


55

(b) Configuration 2

(c) Configuration 8

(d) Configuration 9

Figure 21: Sensitivity Analysis on d0 for NFP, where Ad1 is the implementation with dynamic d and initial sample unitary, instead, Ad2 is the implementation with dynamic d and variable initial sample


56

In case of NFP performances, shown in Figure 21, are more or less the same for each

value of d0.

Considering all observations, the selected value of d0 is 0.8, that offers a good trade-off

between MSE and Sample Variance.

Sensitivity Analysis on Shift Register

The analysis is carried out on three values of shift registers’ dimension: 3, 4 and 5.

The values associated with each cell are organized as in Figure 22.

Figure 22: Shift registers values

As the previous case, simulations are performed for configurations 1, 2, 8 and 9. About

MSE and Sample Variance, the three configuration are few different.

In case of MSE the better value of Shift Register’s dimension is 4, because it gives better

performances also in the limit cases.

(a) Configuration 1


57

(b) Configuration 2

(c) Configuration 8

(d) Configuration 9

Figure 23: Sensitivity Analysis on Shift Register dimension for MSE

As shown in Figure 23, in these cases performances are better in cases SR = 3 and 4.

As in the case of MSE, Sample Variance is more or less the same in all cases, except the

limit cases.


58

(a) Configuration 1

(b) Configuration 2

(c) Configuration 8


59

(d) Configuration 9

Figure 24: Sensitivity Analysis on Shift Register dimension for Sample Variance

Observing Figure 24, the performances of SR=5 are better, instead, values 4 and 3 show

the same performances. It is important to note that the differences are the third decimal

digit, hence very small. The chosen value is SR=4, because is the best value in case of

MSE, with acceptable values for the variance (2nd best value).

3.5 Results Results are based on five different variants of the MART algorithm plus the SRS case:

1. Static d with n0=1 (1);

2. Static d with n0≥1 (2);

3. Simple Random Sampling (SRS);

4. Dynamic d with n0=1 (Ad1);

5. Dynamic d with n0≥1 (Ad2). 3.5.1 MSE

To evaluate the difference between the three different approaches it is possible observe the

histograms in Figure 25. The evaluation of MSE is important to make considerations on

how the obtained values deviate from the mean.


60

(a) Configuration 1

(b) Configuration 2


61

(c) Configuration 3

(d) Configuration 4


62

(e) Configuration 5

(f) Configuration 6


63

(g) Configuration 7

(h) Configuration 8


64

(i) Configuration 9

(j) Configuration 10


65

(k) Configuration 11

(l) Configuration 12


66

(m) Configuration 13

Figure 25: MSE simulation results for each Configuration

In Figures (a), (b) and (h) SRS is better than other algorithms, this result depends on

configurations nature, in fact configuration 1 corresponds to a uniform distribution of

failures across partitions; configurations 2 and 8, as defined in Section 3.1, are close to a

uniform failure distribution.

In all other case SRS has worse results, in particular for small values of n the algorithm

with the initial unitary sample (1) is the best.

3.5.2 Sample Variance

Sample Variance is important to evaluate the goodness of estimate; in particular it

underline how much estimate is representative of population mean. All results are shown

in Figure 26.


67

(a) Configuration 1

(b) Configuration 2


68

(c) Configuration 3

(d) Configuration 4


69

(e) Configuration 5

(f) Configuration 6


70

(g) Configuration 7

(h) Configuration 8


71

(i) Configuration 9



72




73


Figure 26: Sample Variance simulation results for each Configuration

In Figure 26 for SRS all configurations present the worse values of sample variance. The

results of the others algorithms are very close to each other, except in Figures (a), (c) and

(h), in which the static techniques are better for small values of n.

3.5.3 Failing Point Number

This quantity explains the trend of different techniques to expose failures, all results are

shown in Figure 27.


74

(a) Configuration 1

(b) Configuration 2


75

(c) Configuration 3

(d) Configuration 4


76

(e) Configuration 5

(f) Configuration 6


77

(g) Configuration 7

(h) Configuration 8


78

(i) Configuration 9



79




80


Figure 27: NFP simulation results for each Configuration

In Figure 27 all configurations present the worse values for SRS, in Figures (a), (b) e (h) it

presents its best values, that are however worse than other techniques ones. About NFP

values related to the other four techniques are the same across all configurations.

3.5.4 Considerations

The first consideration is about efficiency and accuracy of SRS compared with the four

different testing strategy versions. For this purpose, configurations 1, 2 and 8, represented

in graphs (a), (b) and (h), are considered. In these configurations MSE is better in SRS, this

result depends on the uniform distribution of failures (or close to uniform distribution in

case 0.25/0.75), that is an ideal case away from the real world. On the other hand, SRS is

worse than four versions of test generation algorithm of MART about sample variance,

where the difference is very strong.

For all other configurations our techniques are better than SRS both about sample variance

and except for few isolated point about MSE.

At last the NFP values are considered, where the four implementations of test generation

algorithm are globally better.


81

Now is useful to verify what is the better between four versions of test generation

algorithm of MART. The first step is the comparison between the different “initial sample

dimensions”. Results show that two techniques with unitary initial sample are globally

better, both for MSE than for sample variance.

This consideration leads to the comparison between technique with static d and technique

with dynamic d both with unitary sample size, knowing that the differences are very slight.

For MSE, values are more or less the same, in some configuration i.e. 5 the dynamic d is

better, instead, there are other configurations, i.e. 6 in which the static d is better. For

Variance the static d is meanly better.

In conclusion, the test generation algorithm of MART with static d is considered better, not

only for Sample Variance, but also for its simpler formulation than dynamic one.


82

Chapter 4: Experimentation

The objectives of experimentation are evaluated with the application of MART to a real

system, they are:

• To demonstrate the validity of update operations, considering operational profiles

with different distances from the true one;

• To verify the advantage of using MART rather than operational testing;

• To verify the behavior of MART when there is a true operational profile change.

From simulation, a set of desirable working conditions come out, under which the MART

technique is expected to work better than SRS.

• Clusterized failures;

• Scarce testing budget;

• Good belief of tester about the system: probability values assigned by tester are

close to real ones;

• Good partitioning: each partition is made mostly by failing or not failing test case.

These will be considered in the interpretation of the experimental results.

4.1 Pet Clinic The considered application for the experimentation is Pet Clinic [35], a system to manage

owners, pets, vets and visits of a veterinarian clinic. This microservice architecture is

based on Spring Cloud Netflix technology [7], that gives the necessary integration between

Spring Environment and Netflix OSS [6].

The architecture includes the following services:

• Admin Server (Spring Boot Admin): is a project to manage and monitor Spring

Boot applications [36].


83

• Api Gateway Application (Zuul Sever): Zuul is the front door for all requests from

devices and web sites to the backend of the Netflix streaming application [37]. As

an edge service application, Zuul is built to enable dynamic routing, monitoring,

resiliency and security. It also has the ability to route requests to multiple Amazon

Auto Scaling Groups as appropriate.

• ConfigServerApplication: The Server provides an HTTP, resource-based API for

external configuration (name-value pairs, or equivalent YAML content) [38].

• CustomersServiceApplication, VetsServiceApplication, VisitsServiceApplication:

microservices for owners, pets, vets and visits management.

• DiscoveryServerApplication (Eureka Server): Eureka is a REST (Representational

State Transfer) based service that is primarily used in the AWS cloud for locating

services for the purpose of load balancing and failover of middle-tier servers [39].

• Monitoring (AspectJ): for the monitoring of determinate “aspects” [40].

• Tracing Server (Zipkin): Zipkin is a distributed tracing system. It helps gather

timing data needed to troubleshoot latency problems in microservice architectures.

It manages both the collection and lookup of this data [41].

The system is considered in a reduced version, because the admin server and the tracing

server are not useful for experimentation objectives.

All Pet Clinic services are executed locally with Docker [42], that is an open-source

project for the system’s deployment automation, with the use of Linux containers.

4.2 MART setup The Setup stage consists in defining all necessary information to the algorithm execution.

The representative structure of test frames is characterized by the following fields:

• Service: name of the microservice that exposes the method;

• Distance Factor: integer number to encode the distance between test frames owned

by the same Service;

• Linked Service: a way to encode logic links between different Services;


84

• test frame: the name of test frame, represented as encoded URL, built as the

Cartesian product of Input Classes;

• Type: REST requests type (GET, POST, …);

• Failure Probability: probability that a test case, taken from his test frame, expose a

failure;

• Occurrence Probability: probability that one of test cases included, in the test

frame, is run in the system execution;

• Payload: coding of the payload associated with a method, represented with JSON.

All information must be defined to derive automatically all necessary data structures for

MART execution.

4.2.1 Weight Matrix determination

To obtain the weight matrix automatically, the coding of the “distance information” is

defined on three fields:

• The field Service determines if there is a link between test frames depending on

service that belong them (the assumption that there are no self-links is guaranteed);

• The field Distance Factor determines the distance between test frames for which a

link is defined;

• The field Linked Service determines if there is a link of fixed weight (in this case it

is 2) between methods of different services.

All other operations, carried out for weight matrix definition, are the same described in

subsection 2.2.2, respecting the self-links absence and the inverse proportionality of

weights compared to the distance.

4.2.2 Test frame definition

As described in the subsection 2.2.1, test frames are directly dependent from Input

Classes; the latter are described to realize test cases partitioning. This representation is

realized using the following fields:


85

• Name: instance name of an input class;

• Type: type of input, that can be chosen from different values;

• Min: this field interpretation depends on the specified type;

• Max: this field interpretation depends on the specified type.

Input type and values are encoded in the filed Type. Possible values are:

• Range: integer value between Min and Max;

• Lower: integer value lower than Min;

• Greater: integer value greater than Min;

• Different: every value of type that differs from the one specified in Min;

• Symbol: special character specified in Min;

• S_range: alphanumeric string with length between Min and Max;

• S_greater: alphanumeric string with length greater than Min;

• Empty: empty input;

• N_range: numeric string with length between Min and Max;

• N_greater: numeric string with length greater than Min.

In this implementation, test cases are randomly generated using input obtained by different

Input Classes instances.

4.3 Functions Each necessary operation for experimentation execution is implemented by a function. All

functions are collected in a Java application. Considering that Pet Clinic is an application

REST based, functions are implemented as a Client that sends requests and receives

responses; it evaluates them and calculates the objective values. The examined system is

an interactive application, this means that, to observe their behavior, services have to be

stimulated externally. In the following subsections the functions implementation is

described.

4.3.1 True Reliability calculation function

The first step is to calculate the true operational profile (the operational profile is the set


86

of occurrence probabilities attached to each test frame): firstly, there is the execution of

10.000 test cases collecting each one outcome. After the True Reliability is calculated as

ratio between the number of failed test cases (F) and the total number of executed test

cases (T):

𝑇𝑟𝑢𝑒𝑅𝑒𝑙𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = 𝐹𝑇

Figure 28: True Reliability Function

The True Reliability calculation function is described in Figure 28. In the implemented

function all requests are made by a single Client.

4.3.2 Update functions

The run-time assessment requires the ability of monitoring and updating the usage profile

and the failure probability for each test frame. Common monitoring tools can be used to


87

gather data, such as Wireshark, Amazon Cloudwatch, Nagios. In this case, a tool called

MetroFunnel [34], tailored for microservice applications, is customized for defined

purpose, adding payload information. In alternative, because of system interactive nature,

all the workload can be generated by a unique Client, in this way it is simple to collect

request and responses in a single point.

The probabilities update is implemented, as described in Section 2.5, in Loop Requests and

Monitoring Parsing functions, where the first is based on a single Client idea, the second

implements the log parsing with modified Metro Funnel implementation.

4.3.3 Reliability Assessment function

In this paragraph the reliability assessment function is described, in which MART is

applied to obtain the Reliability of System. It is necessary, to run this algorithm, to execute

the steps described in Chapter 2:

• Test cases domain is partitioned in test frames, this is realized considering the

different methods and the Cartesian product between the input classes;

• the second step consists of assigning to each test frame both occurrence

probability and failure probability.

After these two steps, the reliability assessment is executed.

In previous chapters, different versions of the test generation algorithm of MART are

presented, but the experimentation is based on the first version, with one unitary initial

sample and static d.

This procedure is implemented by reliability assessment by MART function.

4.3.4 Operational Testing function

A benchmark algorithm is considered to evaluate MART performance; in this case it is the

Operational Testing. The latter executes pseudo-random requests based on the current

operational profile and calculates Reliability in a frequentistic way, as in True Operational

Profile case:

𝑅𝑒𝑙𝑖𝑎𝑏𝑖𝑙𝑖𝑡𝑦 = 1 −𝐹𝑎𝑖𝑙𝑒𝑑𝑇𝑒𝑠𝑡𝐶𝑎𝑠𝑒𝑠𝑛𝑢𝑚𝑏𝑒𝑟𝐸𝑥𝑒𝑐𝑢𝑡𝑒𝑑𝑇𝑒𝑠𝑡𝐶𝑎𝑠𝑒𝑠𝑛𝑢𝑚𝑏𝑒𝑟


88

This procedure is implemented in reliability assessment by operational testing function.

4.3.5 Distance between profiles function

The distance between operational profiles is calculated as sum of scraps between the old

occurrence probability and the new one of homologous test frames.

𝑑𝑖𝑠𝑡𝑎𝑛𝑐𝑒 = � p𝑜𝑐𝑐𝑃𝑟𝑜𝑏K�� − 𝑜𝑐𝑐𝑃𝑟𝑜𝑏Kg�ap�¥hc¦§

Kki

where:

• 𝑇𝐹^K��: it is the cardinality of test frame set;

• 𝑜𝑐𝑐𝑃𝑟𝑜𝑏K��: occurrence probability of a test frame in the old operational profile;

• 𝑜𝑐𝑐𝑃𝑟𝑜𝑏Kg�a: occurrence probability of a test frame in the new operational profile.

This procedure is implemented by Distance Calculation function.

4.4 Experimental design 4.4.1 Experimental scenarios

To define experimental scenarios, test frames are analyzed to provide a realistic testing-

time characterization. Specifically, 30 test cases are executed for each test frame,

collecting outcomes. Results (in terms of failing/correct test cases) are a support to

distinguishing more or less failure-prone test frames. Based on them, test frames are

divided in three categories, and an initial failure probability is assigned to each one:

• First Category: 25 test frames which exhibited no failure. To each test frame of

this category is assigned an initial failure probability fi = ε = 0.01, with i denoting

the test frame. The ε (instead of assigning 0) represents the uncertainty due to the

limited number of observations;

• Second Category: 46 test frames, which failed at any of the 30 executions.

Specularly, the initial failure probability for these test frames are: fi = 1 − ε = 0.99;

• Third Category: 191 test frames, the rest of test frames, which failed sporadically.

Based on observed proportion of failures, approximately 1 failure each 10 requests,

the initial probability is set at: fi = 0.


89

To demonstrate that MART is able to see true operational profile variation, two different

"true profiles" are considered; they have been obtained assigning to each category a

percentage, that represents the probability to select an input from a test frame belonging to

this category:

• To build the first profile 80% is assigned to the First Category, 15% to the Second

Category, 5% to the Third Category;

• To build the second profile 55% is assigned to the First Category, 35% to the

Second Category, 10%, to the Third Category.

The probability attached to test frame in the same category is equal (there is

equiprobability inside categories).

The following factors are defined according to a Design of Experiment planning:

• technique: operational testing or MART;

• true profile: true operational profile;

• used profile: three operational profile, that differ from the first true operational

profile of 10%, 50% and 90%;

• n: number of executed test case expressed in percentage of test frames number, the

considered values are 20%, 40% and 70%;

• K: number of experiment repetitions, fixed to 30;

• step: number of different experiments executed in a session;

• update cycles: number of updates (consisting of 5000 test case) that are executed

between each step change.

In conclusion, four different experiments are considered: in the first three MART is

evaluated considering a used profile that differs from the true one of a certain quantity,

they are used to observe the update impact compared to the operational testing; in the

fourth there is the evaluation of MART in case of true operational profile changes.

The first three experiments consist of three steps and 1 update cycle for each step change.

These experiments refer to the first true operational profile (80%, 15%, 5%).


90

The fourth experiment consists of five steps and 1 update cycle for each step change. In

the latter assessment is evaluated considering: for the first two steps, the first true

operational profile; for the step 3, both profiles; for the last two, the second profile.

These combinations generate 42 different scenarios.

4.4.2 Evaluation criteria

As in simulation, accuracy and efficiency metrics are used. However, an experimental

scenario j is repeated 30 times instead of 100. At the end of each repetition r, the true

reliability 𝑅P is computed by preliminary running 10,000 test cases under the true profile

and using𝑅 = 1 − ¥�, with F being the number of observed failures and T=10,000. For

each scenario j, the MSE and Sample Variance are used as accuracy and efficiency

metrics.

The number of experiment runs are: (2 techniques x 3 profiles x 3 n x 30 repetitions) + (2

techniques x 1 profile x 3n x 30 repetitions) = 720 runs. The first addend refers to first

three experiments, the second to the fourth experiment.

For MART all runs of first addend are executed for three consecutive steps and all runs of

second for five steps, for a total of 1260 effective runs. For operational testing (applicable

only at step 1) there are: (15 simulation scenarios x 30 repetitions) = 360 runs, for a total

of 1620 effective runs.

4.4.3 True Reliability estimation

For the first profile, the true reliability is 0.9436 calculated on 10000 tests executed

independently, with 564 failed test cases.

For the second profile, the true reliability is 0.8979 calculated on 10000 tests executed

independently, with 1021 failed test cases.

As highlighted from the true reliability estimation, the reliability is strongly influenced by

the percentage of occurrence probability assigned to the third category, in fact the

unreliability is very close to this value. This result means that a good partitioning is

performed.


91

4.5 Results In this Section all experiments results are presented. The code necessary for experiments

execution is realized in JAVA. The experiments are performed on MacBook Pro 15'', Intel

Core i7 2.5 GHz CPU and 16 GB 1600 MHz DDR3 RAM, with java version "1.8.0_101"

and Docker version 17.12.0-ce, build c97c6d6.

4.5.1 Experiment 1

The operational profile considered in this experiment differs from the first true operational

profile of 10% and it is obtained considering the following input distribution:

• 75% for first category,

• 17.5% for second category,

• 7.5% for third category.

Figure 29: MSE of each technique at each Step, for different n%


92

Figure 30: Sample Variance of each technique at each Step, for different n%

The wrong operational profile is evaluated in the step 1: as shown in Figures 29 and 30

operational testing is worse than MART, both for MSE and for sample variance.

The considerations are more or less the same for all considered n values.

There is an increment only MSE in case of 40% for MART: this condition depends on

various factors, as the combination of two formulated estimators (see Section 3.3) and the

peculiar nature of algorithm, that exposes more consecutively failures causing the crash of

microservices and violating the requests independence. The latter is particularly evident

for big values of n: this demonstrates the power of MART, in fact when the independent

requests assumption is respected, there is an improvement of performance.

An update phase, in which there is the observation of generated traffic to obtain an

adjustment of the operational profile, is necessary to pass at next step. In this case, the

difference between the initial operational profile and the one obtained by the update is

0.07, with an improvement of 3%.

In step 2 the superiority is even more evident both for MSE, then for Sample Variance, in

fact results of the operational testing are the same obtained in the first step; this

consideration is a direct consequence of the update function, that is absent in the

operational testing technique.


93

In step 3 the difference between operational profile obtained in previous step after the

global update is very small, ~1%, that is a good value remembering that there are 262 test

frames.

Compared to the previous step there is a lower MSE, this implies a constant improvement

of MART performances compared to operational testing.

4.5.2 Experiment 2

The operational profile considered differs from the first true operational profile of 50%:

• 57.5% for first category,

• 40% for second category,

• 2.5% for third category.

The operational profile considered in the previous case leads to an underestimation of true

system reliability; instead, in this case the operational profile is built in way to obtain an

overestimation of system reliability.



94


Results are represented in Figures 31 and 32: about step 1 there are close values of MSE,

in particular for n=70%, and an important MART superiority in term of Sample Variance.

After an update cycle, there is an operational profile improvement of 25% compared to the

true one, with a general decrement trend of MSE values.

After another update cycle, there is an improvement of 12%, with the decrement of the

reliability estimated value for all n percentage values.

The considerations between algorithms performances are the same of the previous

experiment.

4.5.3 Experiment 3

The operational profile considered differs from the first true operational profile of 90%:

• 35% for first category,

• 55% for second category,

• 10% for third category.

Experiment results are reported in Figures 33 and 34.


95



In the first step, the operational testing has better MSE values than MART for small n, but

worse values of Sample Variance.

After an update cycle, there is an improvement of operational profile of 45% compared to

the true one, with a very important improvement on reliability estimation in case of

MART.


96

After another update cycle, there is another improvement of operational profile of 22%. In

this case is evident the trend to converge towards the true reliability value.

Considerations are the same that in previous two experiments, evidencing that the

technique implemented in this thesis is better than the compared ones.

4.5.4 Experiment 4

For the fourth experiment there is a different formulation. The objective is to see how

MART works when there is a variation of the true operational profile. The aim is to verify

not only the superiority of MART compared to operational testing, but also the power of

operational profile update.

After three Steps there is a true operational profile variation. Hence, the ability of MART

to adapt to this variation is evaluated adding other two steps.



97


Figures 35 and 36 show that for the first three steps results are more or less the same of

experiment 2.

Step 4 is not a true step, but is a revaluation of step 3 compared to the new true operational

profile. In this Step there is a very important increment of MSE.

In steps 5 and 6, after their operational profile update operations, the estimated value tends

to the true value, in fact there is a decreasing MSE.

Sample variance is more or less the same across all Steps.

4.5.5 Further considerations

About failure probability update results are relevant: the probability, attributed by default,

are respected, except for few test frames. In the most cases the 0.99 probabilities are

incremented, instead 0.1 and 0.01 are reduced. Because of the violation of request

independency, in few cases there is a slightly increment of this last two values (~0.1).

4.6 ANOVA For statistical significance one-way analysis of variance (ANOVA) test is conducted. At

first test data properties, in particular the normality and homoscedasticity of both MSE and

Sample Variance residuals, to determine the type of ANOVA to apply.


98

Figure 37: normality of MSE residuals

Figure 38: normality of Sample Variance residuals

The Shapiro-Wilk test is run to verify normality of residuals, results are reported in Figures

37 and 38: the null hypothesis of data coming from a normal distribution is rejected for

MSE with a p-value < 0.0001 and for Sample Variance with a p-value = 0.0003.

Homoscedasticity is verified by the Levene’s test, in fact it is less sensitive to non-

normality.


99

Figure 39:Levene's Test

As shown in Figure 39, homoscedasticity is also rejected with p-value = 0.0003 in MSE

case, otherwise in case of Sample Variance is rejected with p-value < 0.0001.

Thus, Wilcoxon’s test is adopted in both cases.

Figure 40: Wilcoxon’s test

As shown in Figure 40, the hypothesis of no difference among techniques is rejected at p-

value < 0.0001 in both cases: this means that the differences between the two techniques


100

are statistically significant and the considerations reported are valid.


101

Conclusions

Reliability assessment in MSA calls for run-time approaches. The run-time testing method

presented in this thesis, based on MART, supports the on-demand assessment of reliability

pursuing the accuracy and efficiency of the estimate during the operational phase. Results

suggest that both the run-time adaptivity to the real observed profile and failing behavior

and the testing-time adaptivity implemented by MART (allowing to spot failures with few

tests while preserving the estimate unbiasedness) are good starting points to further

elaborate in the future. Improvements can be achieved by investigating what other

information can be useful to expedite the assessment (e.g., about service interactions), by

exploring other approaches for the information update (e.g., Bayesian updates), and/or by

exploring different partitioning criteria. More extensive experiments are also planned to

improve generalization in terms of experimental subjects and operational profiles.

MART is a valid technique to realize reliability assessment of Microservice Architectures.

This assertion is very robust, in fact results are positive and they are obtained considering

conditions very close to a real case.

102

Bibliography

[1] Netflix, Inc. https://media.netflix.com/en/about-netflix

[2] S. K. Thompson, “Adaptive Web Sampling”, The International Biometric Society,

2006.

[3] P. Di Francesco, P. Lago, I. Malavolta, "Research on Architecting Microservices:

Trends Focus and Potential for Industrial Adoption", IEEE International Conference on

Software Architecture (ICSA), 2017.

[4] J. Lewis and M. Fowler, “Microservices: a definition of this new architectural term”,

2014, Available at: http://martinfowler.com/articles/microservices.html.

[5] Spring Cloud. Available at: http://projects.spring.io/spring-cloud/

[6] Netflix OSS. Available at: https://netflix.github.io/

[7] Spring Cloud Netflix. Available at: https://cloud.spring.io/spring-cloud-netflix/

[8] D. Shadija, M. Rezai, R. Hill, “Towards an Understanding of Microservices”, In

Proceedings of the 23rd International Conference of Automation and Computing (ICAC),

University of Huddersfield, IEEE Computer Society, 2017.

[9] M. Richards, “Microservices vs. Service-Oriented Architecture”, O’Reilly Media,

2015.

[10] Ima Miri, Microservices vs. SOA. Available at:

https://dzone.com/articles/microservices-vs-soa-2

[11] M. R. Lyu, “Handbook of software reliability engineering”, Chap. 2, 12, 13 and 16,

McGraw-Hill, Inc., Hightstown, NJ, 1996.

[12] JOSEPH J. NARESKY, “Reliability Definitions”, IEEE, 1970.

[13] G. Toffetti, S. Brunner, M. Blochlinger, F. Dudouet, A. Edmonds, "An architecture

103

for self-managing microservices", Proceedings of the 1st International Workshop on

Automated Incident Management in Cloud (AIMC), pp. 19-24, 2015.

[14] N. Cardozo, “Emergent software services”, In Proceedings of the 2016 ACM

International Symposium on New Ideas, New Paradigms, and Reflections on

Programming and Software, pages 15–28, 2016.

[15] G. Schermann, D. Scho ̈ni, P. Leitner, and H. C. Gall, “Bifrost: Supporting

Continuous Deployment with Automated Enactment of Multi-Phase Live Testing

Strategies”. In ACM/IFIP/USNIX Middleware, p. 12, 2016

[16] H. Kang, M. Le, S. Tao, "Container and microservice driven design for cloud

infrastructure devops", IEEE International Conference on Cloud Engineering (IC2E), pp.

202-211, 2016.

[17] B. Frank, G. Butzin, D. Timmermann, "Microservices approach for the internet of

things", Emerging Technologies and Factory Automation (ETFA) IEEE 21st International

Conference on, 2016.

[18] J. Stubbs, W. Moreira, R. Dooley, "Distributed systems of microservices using docker

and serfnode", Science Gateways (IWSG) 7th International Workshop on, pp. 34-39,

2015.

[19] D. Guo, W. Wang, G. Zeng, Z. Wei, "Microservices architecture based cloudware

deployment platform for service computing", IEEE Symposium on Service-Oriented

System Engineering (SOSE), pp. 358-363, 2016.

[20] V. Heorhiadi, S. Rajagopalan, H. Jamjoom, M. K. Reiter, V. Sekar, "Gremlin:

systematic resilience testing of microservices", Proc. of ICDCS, pp. 57-66, 2016.

[21] A. Nagarajan; A. Vaddadi,“Automated Fault-Tolerance Testing”,IEEE, 2016

[22] P. Potvin, M. Nabaee, F. Labeau, K. Nguyen, M. Cheriet, “Microservice cloud

computing pattern for next generation networks”, LNICST 166, pp. 263-274, 2016

[23] P. Bak, R. Melamed, D. Moshkovich, Y. Nardi, H. Ship, A. Yaeli, "Location and

context-based microservices for mobile and internet of things workloads", IEEE

International Conference on Mobile Services, pp. 1-8, 2015.

[24] K. Meinke, P. Nycander, "Learning-based testing of distributed microservice

104

architectures: Correctness and fault injection", International Conference on Software

Engineering and Formal Methods, pp. 3-10, 2015.

[25] A. Balalaie, A. Heydarnoori, P. Jamshidi, "Migrating to Cloud-Native Architectures

Using Microservices: An Experience Report", Proc. 1st Int'l Workshop Cloud Adoption

and Migration, 2015.

[26] P. Kookarinrat, Y. Temtanapat, "Design and Implementation of a Decentralized

Message Bus for Microservices," 13th International Joint Conference on Computer

Science and Software Engineering (JCSSE), Khon Kaen, Thailand, 2016.

[27] K.-Y. Cai, Y.-C. Li, K. Liu, “Optimal and adaptive testing for software reliability

assessment”, Information and Software Technology 46 (15) 989–1000, 2004.

[28] D. Cotroneo, R. Pietrantuono, S. Russo, “RELAI Testing: A Technique to Assess and

Improve Software Reliability”, IEEE Trans. on Software Engineering 42 (5) 452–475,

2016.

[29] T. Chen, H. Leung, and I. Mak, “Adaptive random testing, in Proc. 9th Asian

Comput. Sci. Conf. Adv. Comput. Sci. Higher-Level Decision Making, pp. 320–329, 2005

[30] R. Pietrantuono, S. Russo, “On Adaptive Sampling-Based Testing for Soft- ware

Reliability Assessment”, in: Proceedings 27th International Symposium on Software

Reliability Engineering (ISSRE), IEEE, pp. 1–11, 2016.

[31] Daniela Cocchi, “Teoria Dei Campioni”, Cap. 2 e 3.

[32] D. Raj, “Some Estimators in Sampling with Varying Probabilities without

Replacement”, Journal of the American Statistical Association, Vol. 51, No. 274, pp. 269-

284, 1956.

[33] M. N. Murthy, “Ordered and Unordered Estimators in Sampling without

Replacement”, Sankhyā: The Indian Journal of Statistics (1933-1960), Vol. 18, No. 3/4,

pp. 379-390, 1957

[34] R. Iorio, M. Cinque, R. Della Corte, “Real-time monitoring of microservices-based

software systems”, 2017.

[35] Pet Clinic, Available at: https://github.com/spring-petclinic/spring-petclinic-

microservices

105

[36] Admin Server, Available at: http://codecentric.github.io/spring-boot-admin/current/

[37] Zuul Server, Available at: https://github.com/Netflix/zuul/wiki

[38] Config Server, Available at: https://github.com/spring-cloud/spring-cloud-config/

[39] Eureka Server, Available at: https://github.com/Netflix/eureka/wiki

[40] AspectJ, Available at: http://www.baeldung.com/aspectj

[41] Zipkin, Available at: https://zipkin.io/

[42] Docker, Available at: https://www.docker.com/

Reliability Assessment of Microservice Architectureswpage.unina.it › roberto.pietrantuono › tesi...

Documents

Transcript of Reliability Assessment of Microservice Architectureswpage.unina.it › roberto.pietrantuono › tesi...