[IEEE 2012 6th International Conference on Software Security and Reliability (SERE) - Gaithersburg, MD, USA (2012.06.20-2012.06.22)] 2012 IEEE Sixth International Conference on Software

μTIL: Mutation-based Statistical Test InputsGeneration for Automatic Fault Localization

Mickael DelahayeUJF-Grenoble 1LIG UMR 5217Grenoble, France

[email protected]

Lionel C. BriandSnT Center

University of LuxembourgLuxembourg

[email protected]

Arnaud GotliebCertus Software V&V CenterSimula Research Laboratory

Oslo, [email protected]

Matthieu PetitOWI Technologies

92340 Bourg-la-ReineFrance

[email protected]

Abstract—Automatic Fault Localization (AFL) is a process tolocate faults automatically in software programs. Essentially, anAFL method takes as input a set of test cases including failed testcases, and ranks the statements of a program from the most likelyto the least likely to contain a fault. As a result, the efficiency ofan AFL method depends on the “quality” of the test cases usedto rank statements. More specifically, in order to improve theaccuracy of their ranking within test budget constraints, we haveto ensure that program statements are executed by a reasonablylarge number of test cases which provide a coverage as uniformas possible of the input domain. This paper proposes μTIL, anew statistical test inputs generation method dedicated to AFL,based on constraint solving and mutation testing. Using mutantswhere the locations of injected faults are known, μTIL is able tosignificantly reduce the length of an AFL test suite while retainingits accuracy (i.e., the code size to examine before spotting thefault). In order to address the motivations stated above, thestatistical generator objectives are two-fold: 1) each feasible pathof the program is activated with the same probability; 2) thesubdomain associated to each feasible path is uniformly covered.Using several widely used ranking techniques (i.e., Tarantula,Jaccard, Ochiai), we show on a small but realistic program thata proof-of-concept implementation of μTIL can generate testsets with significantly better fault localization accuracy than bothrandom testing and adaptive random testing. We also show on thesame program that using mutation testing enables a 75% lengthreduction of the AFL test suite without decrease in accuracy.

I. INTRODUCTION

Though software testing aims at revealing faults in programswhile debugging aims at locating faults and correcting them,these activities need to be combined to be cost-effective. Overthe last decade, many automated fault-localization methodsusing execution traces have been proposed [9], [19], [22].These methods, referred to as Automatic Fault Localization(AFL), use both statement coverage and test verdicts to rankstatements from the most suspect to the least suspect [34].Basically, an AFL method takes as input a set of test cases (testinputs and associated verdicts) including failed test cases, andranks the statements in a program according to their likelihoodto contain a fault. As a consequence, though the efficiency ofany AFL method depends on the quality of the test inputs

This research was partly funded by the French-government Single Inter-Ministry Fund (FUI) through the IO32 project (instrumentation and tools for32-bit microcontrollers).

that serve to generate a statement ranking, the generationstrategy for obtaining optimal test inputs, which maximizethe accuracy of the fault localization process, is still an openresearch problem [1], [4], [5], [35].In this paper, we propose a new approach, called μTIL, that

combines constraint solving, statistical test input generation,and mutation testing in order to generate test suites leadingto accuracy improvements in AFL techniques. Though severalmeasures have been proposed, the diagnosis accuracy of AFLtakes into account the size of the code to examine beforespotting a fault.There are two key contributions in μTIL. The first con-

tribution is related to the probabilistic properties of the testinputs it generates, which strive to ensure that:

• each feasible path of the program is activated with thesame probability;

• the subdomain associated to each feasible path is uni-formly covered.

Roughly speaking, a test suite having those two propertiescovers uniformly each feasible path of a program and thenensures that faulty statements are executed by a reasonablylarge number of fairly distributed test cases, thus increasingthe diagnosis accuracy of a coverage-based fault localizationprocess. However, generating a test suite that follows thosetwo properties is challenging as establishing path feasibility isusually very difficult (even undecidable in the general case)and generating fairly distributed test inputs over a subdomainis costly. In μTIL, we build on our previous work thatexploit Constraint Programming, a set of constraint solvingtechniques, to generate test suites that uniformly cover asubdomain [16] and we tailor statistical test inputs generationto uniformly activate each feasible path. The second contri-bution concerns our usage of mutation testing to minimize thelength of the AFL test suite, while retaining the fault diagnosiscapabilities of μTIL. The method generates program mutantswhere injected faults are precisely located and computes afault diagnosis using an AFL technique. Then, among the testinputs that are generated, those that contribute to locate themutant fault are kept while others are discarded. This processcan be viewed as a learning process that consists in training

2012 IEEE Sixth International Conference on Software Security and Reliability

978-0-7695-4742-8/12 $26.00 © 2012 IEEE

DOI 10.1109/SERE.2012.32

197

μTIL to minimize the length of the AFL test suite, withoutloss in diagnosis accuracy.In this paper, we provide a detailed discussion explaining

when μTIL is useful and is likely to provide practical benefits.As a brief summary, it is beneficial to use μTIL when:• the cost of running additional test cases is acceptable in

light of the cost of debugging;• the cost of test execution (on a target platform) needs to

be reduced, especially in contexts where the test suite isexpected to be used a large number of times to supportmany debugging activities;

• it is possible to simulate the execution environment andexecute test cases with mutant programs using a simulatoron the development platform;

The above conditions are representative, for example, of manyembedded system development environments.To evaluate our approach, a proof-of-concept implementa-

tion was developed. From the SIR repository [12], we selecteda small but realistic program to assess the feasibility andpotential benefits of our approach, before initiating a largerscale experimental evaluation. We generated five types of testsuites: uniformly generated random test suites, two versions ofstatistical test suites based on Adaptive Random Testing [8],statistical test suites based on μTIL with mutation testing beingdisabled, and statistical test suites using the full capabilitiesof μTIL. Based on a rigorous statistical evaluation [30], ourresults show that μTIL-generated test suites lead to AFLaccuracy that is significantly surpassing both random testingand adaptive random testing. We also show that mutationtesting in μTIL-generated test suites enable a 75% lengthreduction of the AFL test suite without loss in fault localizationaccuracy.

Outline of the paper. The paper is organized as follows:Sec. 2 gives some background on AFL, and mutation testing;Sec. 3 describes our overall approach while Sec. 4 delves intothe details of our statistical test inputs generation technique.Sec. 5 explains how to generate random test inputs thatuniformly activate each feasible path of the program whileSec. 6 describes our implementation prototype μTIL. Sec. 7presents our experimental validation and Sec. 8 discusses ofthe applicability of μTIL. Finally, Sec. 9 concludes this work.

II. BACKGROUND

In this section, we survey coverage-based Automatic FaultLocalization (AFL) and introduce the necessary backgroundon Mutation Testing to understand the rest of the paper.It should be noticed that, though researchers are currentlydebating the practical interest of AFL [27], we will not engagein this discussion and simply assume that if AFL is indeedbeing used, an effective test inputs generation process isrequired, thus justifying our contribution.

A. Automatic Fault Localization

Though our approach is meant to be applicable to all AFLtechniques, we only review the three most well-known AFLmeasures for which large amounts of experimental results

Tarantula Jaccard Ochiainf (e)

nf

ns(e)ns

+nf (e)

nf

nf (e)nf+ns(e)

nf (e)√nf∗(nf (e)+ns(e))

Figure 1. Suspiciousness scores for various AFL approaches

[23] are available, namely the Tarantula approach by Jones,Harrold and Stasko [22], the Jaccard approach and the Ochiaiapproach [1], [2]. The current body of experimental resultssuggest that those approaches perform well in practice [9],[19], [21]. These three techniques take as inputs a set oftest cases, including at least one failed test case and thecorresponding execution traces in terms of covered blocksor statements. And then, by computing some suspiciousnessscores, they rank program blocks or statements from the mostsuspicious to the least suspicious. The three approaches weconsider in this paper differ only from the suspiciousnessscores they compute.Using the notations of [23], where:

• ns is the number of test cases that succeed;• nf is the number of test cases that fail;• ns(e) is the number of test cases that run through a

particular element e and succeed;• nf (e) is the number of test cases that run through a

particular element e and fail;

Fig.1 shows the formulas used in each of the three approaches.

The three approaches correlate the execution traces of testcases, using a diagnosis matrix, such as the one given on theleft hand-side of Fig. 2. The matrix represents the executiontraces for a set of test cases and their associated verdicts (s:success, f : fail). Based on this matrix, the three techniquescompute suspiciousness scores using the formulas given inFig.1 and rank the statements from the most “suspicious”to the least “suspicious” (right part of Fig. 2). As a result,they provide a rank-ordered list of statements called a diagno-sis, e.g., [3, 7, 1, 2, 4, 5, 6] in the example. The underpinningprinciple of these approaches is that faulty statements morefrequently appear in the traces of failed test cases than inpassed test cases. In other words, suspiciousness should be anindicator of the likelihood for a statement to contain a fault.Evaluating the accuracy of a diagnosis is crucial in our

approach, and the literature contains many definitions ofaccuracy [3], [19]–[21]. One of the most commonly useddefinition of accuracy is provided in [20], derived from [35]under the name Expense.

Definition 1 (Expense of a fault localization diagnosis). Let fbe an actual faulty statement in a program and D be a diagno-sis, let #SRank(f) be the the number of statements, whosesuspiciousness score is greater or equal than the suspiciousnessof f , and #SExec the total number of executable statements,then Expense is defined as:

Expense(D) =#SRank(f)

#SExec

198

1 2 3 4x=2 x=-2 x=2 x=-3

y=4 y=0 y=-4 y=-3 n_s(e) n_f(e) Taran. Jacc. Oich.

pow(x, y:integer) : floatlocal i, p : integeri := 0; {1} 1 1 1 1 3 1 0,50 0,50 0,50

Result := 1; {2} 1 1 1 1 3 1 0,50 0,50 0,50

if y<0 then p := -x; {3} 0 0 1 1 1 1 0,75 0,50 0,70

else p := y; {4} 1 1 0 0 2 0 0,00 0,00 0,00

while i<p do

Result := Result * x; {5} 1 0 0 1 2 0 0,00 0,00 0,00

i := i + 1; {6} 1 0 0 1 2 0 0,00 0,00 0,00

done

if y<0 thenResult := 1/Result; {7} 0 0 1 1 1 1 0,75 0,50 0,70

end

Verdicts : s s f s

Rank

1

1

2

2

3

33

Figure 2. Diagnosis matrix and computations of various suspiciousness scores

In the presence of multiple faulty statements then thecomputation of #SRank(f) is performed using the lessfavorable statement in the ranking. It is also interesting tocomplement the Expense measurement with Dynamic BasicBlock computations. A Dynamic Basic Block (DBB) is agroup of statements that are covered by exactly the same subsetof the test suite [5]. The average size of DBB is known to beanother measure of the accuracy of diagnosis of a specific testsuite, the smaller being the better.Since suspiciousness scores are estimates based on test

execution traces, the more test cases used to compute thesescores, the more confident we can be that the diagnosisaccurately reveals faulty statements. Test cases that are wellscattered over the input domain and exhibit high programstructure coverage are also expected to lead to better diagnosisaccuracy [20]. However, since having a large number of testcases contradicts the common usual objective of minimizingthe software testing effort and cost (e.g., oracle definition,execution time), in order to mitigate this dilemma, we muststrive to generate test cases generating execution traces thatoptimize the localization ability of AFL.

B. Mutation testing

Mutation testing [11] is a well-known technique to generatetest inputs having the capability to reveal faults in programs[26]. Given a program P , a mutation fault injected in P resultsin a program P ′, called a mutant, that slightly differs from P

through a single modification, e.g., a variable replacement oran operator replacement. For a given set of mutants, mutationtesting requires to generate a test set that kills all the mutants,i.e., a set of test inputs on which the execution of P ′ produceseither a distinct result than the execution of P (called strongmutation), either a distinct state just after the statement wherethe fault has been introduced (called weak mutation). Ofcourse, the choice of good mutation operators is crucial to

build a test suite able to reveal actual faults in P . A difficultproblem in mutation testing lies in the automatic detection ofequivalent mutants, i.e., mutants P ′ that behave as P [17].In general, equivalent mutants cannot be killed as there is noway to differentiate them from P through program execution.Generating test cases able to kill (non-equivalent) mutants isalso considered as a difficult problem to solve [29]. In ourwork, we only use mutation testing to reduce the size of a testsuite, while maintaining its AFL capabilities, then we willnot get into further details in this classical software testingapproach.

III. OVERVIEW OF THE μTIL APPROACH

In this section we give an overview of μTIL, a new statisti-cal test inputs generation approach we propose for maximizingthe accuracy of AFL. We first present its architecture and thenwe describe its functionalities through a user session.

��

��

��

��

��

��

��

��

��

� �� ! �

��"#��$%� �� "�!��&��

�'��

�(��)��

�*��

�+� �� ,��

�+� ��

��Figure 3. The μTIL method

Fig. 3 depicts the architecture of μTIL with its four main

199

components, represented with diamonds, namely TS Genera-tor the test inputs generator, Evaluator, μJava, and the UserInterface. The μTIL method takes as inputs the source codeof a program P and a set of failed execution traces. As output,it generates a set of test data that can be used to feed any AFLtechnique. The ultimate goal of μTIL is to generate a set oftest data for P that maximizes the accuracy of the diagnosis,while keeping the set a reasonable size. Note that, once testdata have been generated by μTIL, the user or an externaloracle procedure still has to provide test verdicts in order toperform AFL on P .

TS Generator generates statistical test inputs (without anyverdict), according to specific probabilistic properties. Thiscomponent is the most challenging part of the μTIL method;it is fully described in section IV of the paper.

μJava is a tool1 that automatically generates mutants for Javaprograms. We consider this tool as a black-box to generatemutants on-demand. The interested reader can consult thedetails in [24] regarding the way mutants are generated andwhich mutation operators can be considered.

Evaluator implements a procedure to evaluate the “quality”of a test set in terms of fault localization accuracy, as definedin Sec.II. It takes the mutant P ′ and the program under test Pas inputs, and computes test mutation verdicts. These verdictsare used to compute the diagnosis Expense of a set of testinputs. Depending on the value of a given threshold σ forthe Expense, Evaluator decides whether the process shouldcontinue generating test inputs or should switch to anothermutant program. This component has several parameters suchas the value σ, the number of test inputs to consider for theevaluation, or the suspiciousness measure.

The user interface allows users to monitor and control μTILwith a set of parameters that includes the number of mutantsto generate, time-out and memory-out values, the thresholdσ, the expected size of the final test suite, the suspiciousnessmeasure, and so on.

A typical user session with μTIL starts with an initialstatistical test inputs generation ((1) and (2) in Fig. 3) usingTS generator. This component can (optionally) be initializedwith traces, corresponding to one or several failed tests. Usinga suspiciousness measure, the accuracy of the generated testinputs is evaluated on P ′, a mutant of P generated by μJava.As said previously, the Evaluator computes a diagnosis (i.e.,the Expense value) associated to the generated test inputs byusing strong mutation between P and P ′. This computationis possible in μTIL as the location of the mutation fault isprecisely known. If the Expense value is greater than a giventhreshold σ, then the test inputs generator TS generator iscalled again. This process iterates until the expected thresholdis reached. Of course, as the time required by this processcannot be predicted in advance, time-out procedures are addedto control it. When the Expense becomes less or equal to σ,

1http://www.cs.gmu.edu/˜offutt/mujava/

then the set of test inputs is presented to the user who candecide either to continue the statistical test input generationprocess ((3) in Fig. 3) or to stop it ((4) in Fig. 3). Thisprocess offers the users several advantages such as controllingthe number of test inputs generated by μTIL as well as thetime required for AFL, and fine-tuning the test generationparameters. μTIL can also be completely automated, withoutany user interaction, using default values for the parameters.In our experimental settings, reported in Sec.VI of the paper,we have used these default values

IV. STATISTICAL TEST INPUTS GENERATION

This section explains how to generate statistical test inputsfor AFL. The generator strives to ensure that:

• each feasible path of the program is activated with thesame probability;

• the subdomain associated to each feasible path is uni-formly exercised.

Recent test inputs generators such as PathCrawler [33],PEX [32], SAGE [13] or Euclide [15] exploit SMT-solvingor Constraint Programming approaches to generate test inputsand detect infeasible paths. For a given path, checking pathfeasibility only requires to check whether the path conditionhas a solution or not. Constraint propagation interleaved withvalue-labelling search, as proposed in the context of ConstraintProgramming [18], is well-suited to address this problem. Inthe rest of this section, we briefly review how to check pathfeasibility with constraint propagation and search, and how touniformly sample feasible paths.

A. Path feasibility

Roughly speaking, constraint propagation considers eachconstraint in isolation as a filter for the variation domain ofits variables. Once a reduction is performed on the domainof a variable (by filtering inconsistent values), constraintpropagation awakes the other constraints that hold on thevariable, to propagate the reduction. Technically, constraintpropagation introduces incrementally constraints into a prop-agation queue. Then, an iterative algorithm manages eachconstraint one by one into this queue by filtering the domainsof their inconsistent values. When the variation domain ofvariables is too large, filtering algorithms consider usuallyonly the bounds of the domains for efficiency reasons, e.g.a domain {v1, v2, . . . , vn−1, vn} is approximated by the rangev1..vn. The algorithm iterates until the queue becomes empty,which corresponds to a state where no more pruning canbe performed. When selected in the propagation queue, eachconstraint is also added into a constraint store which mem-orizes all the considered constraints. The constraint store iscontradictory if the domain of at least one variable becomesempty. But, constraint propagation alone does not guaranteesatisfiability, as it just tries to prune the variation domains.To address this caveat, it is necessary to launch a (sometimescostly) value labelling search process. Labelling consists intrying a value for a chosen variable and restart constraintpropagation to see if the hypothesis is still consistent. If we

200

get a contradiction, then the process backtracks to anothervalue or variable to enumerate. This process is performed untilall the domains have been explored. If there is a solution,this process will find it, while if the constraint system isunsatisfiable, this process will also detect it. Hence, thisis a complete decision procedure. However, value-labellingsearch is an exponential process in the number of variablesand then practical limitations must be addressed by time-out or memory-out interrupting procedures. Note that similarpractical limitations exist for SMT-solvers [25].Path feasibility can be efficiently checked using constraint

propagation, value-labelling search and symbolic execution.Symbolic execution is a well-known process that replacesinput parameters of the program by symbolic values andexecutes each statement of a given path, in order to build aconjunction of constraints that holds only on the input sym-bolic values (called path conditions). By calling a constraintpropagation based solver on path conditions such as describedabove, it becomes possible to check path feasibility.

B. Uniform generation of feasible paths

Once infeasible paths are discarded, generating uniformly(feasible) paths requires giving the same probability to anypath to be selected, as proposed in Alg. 1. The algorithm takesthe program Program , the expected number of test inputsto generate Nb DT , and k the maximum number of loopiterations as inputs. The algorithm issues a random test setthat guarantees the uniform selection of feasible paths in theprogram.

Algorithm 1: Uniform generation of feasible pathsInput: Program , Nb DT , kOutput: DT Seq

DT Seq ← ∅1

Paths ← extract paths(Program, k)2

while |DT Seq| < Nb DT do3

Select p at random from Paths4

Get Path Conditions from p5

if solutions(Path Conditions) = ∅ then6

Remove p from Paths7

else8

DT ← one solution(Path Conditions)9

Add DT in DT Seq10

end11

end12

return DT seq13

The operation on line 5 computes the path conditions usingsymbolic execution. Line 6 of the algorithm checks whetherPath Conditions is satisfiable or not using a constraintpropagation based solver, while line 10 uses the functionone solution to return a solution of Path Conditions . Thislast operation is crucial to AFL as described in the nextsection.

This algorithm can be optimized in several ways but wechose to keep it as simple as possible for the sake of clarity.Just to mention, checking path feasibility (line 6) can beperformed at the same time as finding a solution (line 10) inorder to minimize the number of calls to the solver. Detectingincrementally infeasible paths and caching previously detectedinfeasible paths also helps avoiding the exploration of uselesspaths.Uniform selection of feasible paths gives the same chance to

any path of the program to contribute to the fault localizationprocess. By doing so, this process maximizes the feasible-paths coverage of the program through statistical test datageneration. It should be noted that using a pure randomgenerator over the input domain cannot yield the same result.Although any input point has the same chance2 to be selectedin random testing, there is no uniform selection of feasiblepaths. By selecting input values at random over the inputspace, only feasible paths are selected but there is no reasonfor each path to be selected with the same probability. Inthe worst case, random testing can activate only one feasiblepath corresponding to the greatest input subdomain, leadingto the coverage of a single path of the program. As trace-based fault localization tries to identify faulty statementsby crossing traces executing distinct statements, such worstcase scenario would be catastrophic for AFL. In fact, everystatement belonging to the covered feasible path would obtainexactly the same score.Thus, building a statistical test data generator that gives

the same chance to any feasible path, yields optimal resultsin terms of feasible-path coverage, thus enabling better faultlocalization.

V. PATH-ORIENTED RANDOM TESTING

In this section, we detail a process called Path-orientedRandom Testing (PRT) [16], that generates test inputs havingthe property of uniformly activating a given path. This processis implemented within the function one solution, shown inAlg. 1 and discussed above. The PRT process aims at selectingrandom solutions of a path condition, by giving the sameprobability to each solution that can potentially be selected.Note that this is not a trivial problem as witnessed by severalauthors [6], [10], [14].In the PRT approach, we aim to perform uniform selection

by using an over-approximation of the solutions set and byrejecting spurious test data (test data that activate another path)[16]. The ultimate goal is to minimize the number of rejectsto keep down the cost to an acceptable level. Let us considera simple example to illustrate the approach.

A. Illustrating example for PRT

Consider the constraint set {y ≥ 0, x ≤ 14, x ≥ y}that corresponds to the triangle domain shown on the leftin Fig. 4. Constraint propagation over these constraints givesD = (x ∈ 0..14, y ∈ 0..14) which corresponds to the larger

2In general, equiprobability can only be approximated on machines throughthe usage of pseudo-random number generators.

201

rectangle domain of Fig. 4. A first way of building a uniformgenerator within the grey triangle domain is first, to generate atrandom a point in the rectangle and second, to check whetherthe constraints are satisfied or not. Note that drawing first avalue v for x and propagating x = v is forbidden as it leads toa non-uniform test inputs generator. Indeed, on the example ofFig. 4, the number of values for y that satisfy the constraintsdepends on v. For x = 0 there is a single value for y whilefor x = 14 there are 15 possible values for y. Thus, the input(x = 0, y = 0) has probability 1 to be selected while, theinput (x = 14, y = 0) has probability 1

15 . Building a uniformgenerator requires those probabilities to be equal. Generatingat random an input within the rectangle and then checkingwhether it belongs to the triangle has the disadvantage ofrequiring many inputs to be discarded (about 1 over 2 inthis example). A better approach suggested in [16] consistsfirst, to divide the input domain into equivalent subdomains(i.e., subdomains with the similar number of inputs) andsecond, to discard some of the subdomains by using constraintpropagation, and then third, to draw at random an input froma remaining subdomain.Consider an arbitrary division parameter equal to 4, then

PRT divides the rectangle domain x ∈ 0..14, y ∈ 0..14 into42 = 16 subdomains of equal area. As a result, we get thefollowing 16 subdomains: D1 = (x ∈ 0..3, y ∈ 12..15) ,D2 = (x ∈ 0..3, y ∈ 8..11),. . . ,D16 = (x ∈ 12..15, y ∈ 0..3) .The partition on the (slightly augmented) initial subdomainD′ = (x ∈ 0..15, y ∈ 0..15) is shown on the right-hand side in Fig. 4. With constraint propagation, subdomainsD1, D2, D3, D5, D6 and D9 can be safely discarded (e.g.,D1 = (x ∈ 0..3, y ∈ 12..15) does not intersect the triangledomain {y ≥ 0, x ≤ 14, x ≥ y} and then constraintpropagation on x < y and D1 exhibits a contradiction). As allthe subdomains have the same area, we can still easily buildan uniform test inputs generator for the resulting subdomainD′ = D4 ∪ D7 ∪ D8 ∪ D10 ∪ · · · ∪ D16. It suffices to drawat random a subdomain in D′ and then to draw at random avalue in this subdomain. Note however that this process stillcompute an over-approximation of the set of solutions andsome points that do not satisfy the constraints can still beselected by the statistical test data generator. Hence, rejectionis still necessary but the number of points to reject has beendrastically minimized, as about an half of the initial domainhas been discarded.

B. Uniform generation of test inputs

This section formalizes the reasoning presented in the pre-vious section with the algorithm shown below. The algorithmdraws at random a test input that satisfies Path Conditions ,the conjunction of constraints associated to a selected path.It takes as inputs a set of variables along with their varia-tion domain, Path Conditions , and div a division param-eter. It returns either a test datum that has been selecteduniformly among all the solutions of Path Conditions , orfail an indication of the non-feasibility of the correspondingpath. Firstly, the algorithm partitions the subdomain resulting

from constraint propagation in kn subdomains of equal area(divide function). Then, each subdomain Di in the partition ischecked for unsatisfiability. This results in a list of subdomains(D′1, . . . , D

′

p) where p ≤ divn. Secondly, uniform selectionof test data is realized by picking first a subdomain and thensampling a point inside this subdomain. If the selected pointdoes not satisfy the path conditions then it is simply rejected.This process is repeated until a test input is found.

Algorithm 2: Uniform generation of a test input

Input: (x1, . . . , xn), Path Conditions , divOutput: a solution or fail

(D1, . . . , Dkn)← divide((x1, . . . , xn), k)1

forall Di ∈ (D1, . . . , Ddivn) do2

if Di is inconsistent w.r.t. Path Conditions then3

Remove Di from (D1, . . . , Ddivn)4

end5

end6

Let (D′1, .., D′

p) be the remaining domains7

if p ≥ 1 then8

while true do9

Pick up uniformly D at random from10

(D′1, . . . , D′

p)Pick up uniformly t at random from D11

if t satisfies Path Conditions then12

return t13

end14

end15

end16

return fail;17

Selecting uniformly test inputs activating a given pathensures that the best results will be obtained in terms of faultlocalization. This is based on the following reasoning. First,the accuracy of any AFL ranking algorithm depends on theaccuracy of the estimates for the suspiciousness scores [2],[20], [23]. So, the better the scores the better the diagnosis.Second, uniform random selection of inputs over a subdomaincorresponding to a given feasible path, maximizes the chancesto draw less probable inputs leading to faults that can only berevealed on this feasible path. This statement relies on thewell-known fact that maximizing the less probable elementcan be obtained by giving the same probability to all elements[31]. Hence, this process ensures that even the less probablefault-revealing inputs will be given the same probability thanother inputs to be drawn, leading to an increase in accuracyof suspiciousness scores.To state this more clearly, suppose that only one test input

is selected for a given feasible path, and let this input havea successful verdict though there exists fault-revealing inputsactivating this path. Then, any suspiciousness score (e.g.,Tarantula, Jaccard, Ochiai) will be computed by consideringthis feasible path to be fault-free, which is obviously incor-rect and will artificially decrease the suspiciousness of thestatements in that path. Uniform coverage of a feasible path

202

x>y

y ≥ 0

x≤

14

0 14

14

x

y

x>y

y ≥ 0

0

y

x15

D1

D2

D3

D4

D5

D6

D7

D8

D9

D10

D11

D12

D13

D14

D15

x≤

14

D16

3

7

11

3 7 11

15

Figure 4. The triangle domain example

subdomain helps maximize the probability of drawing the lessprobable fault-revealing inputs.By combining uniform selection of feasible paths with

uniform selection of inputs in their subdomains, we devisea statistical test input generator that maximizes the accuracyof AFL suspiciousness scores. In addition, such a generatorensures that no feasible path is executed with a too smallnumber of test cases and thus enables the computation ofsuspiciousness scores based on a reasonably large number oftest executions. This helps improve the scores and ultimatelythe accuracy of the ranking. In practice, the magnitude of thisimprovement is expected to be larger if the latent faults arelocated in feasible paths that are unlikely to be covered byrandom testing.

VI. IMPLEMENTATION

We implemented the approach presented in the paper ina tool also called μTIL. The tool generates statistical testinputs for Automatic Fault Localization (AFL) on Java sourcecode. We have chosen Java to target a real-world program-ming language to demonstrate our ideas, but many importantfeatures of the language have not yet been taken into account.For example, unstructured control flow such as break andcontinue, native methods, object-orientation through over-loading, inheritance or virtual method calls are not currentlycorrectly handled in μTIL. Given a Java class and a method,μTIL generates a set of statistical test inputs that is finetuned for AFL. The interest of the approach is on both sides:the generated test suite respects the probabilistic propertiesdiscussed above in the previous section and mutation testingis used to improve the AFL capabilities of μTIL.The μTIL tool is composed of a source code parser and

analyzer, a constraint generator and solver, a statistical testinput generator, and modules to call μJava [24] and evaluatesthe diagnosis Expense through source code instrumentation.The module dedicated to source code analysis includes acomplete Java 1.5 parser that builds a symbol table and anabstract syntax tree. From this tree, the constraint generation

module derives a constraint model of the Java method undertest. The constraint solving module is used to prune the searchspace in various places and finally, the statistical test datagenerator provide solutions of path conditions.Initiated with failed traces, a first set of test inputs for AFL

is generated by the statistical test inputs generator. The firsttest suite is called the basis and is composed of N test inputs,where N is much less than the expected final length M ofthe test suite (by default, we take N = 20% ∗ M ). Usingmutants from μJava, an evaluator computes the Expense ofthe diagnosis. As said previously, knowing where the fault islocated in the mutant program allows to establish test verdictand then compute diagnosis Expense. If the Expense valueis less than the selected threshold σ (in practice, σ = 0.2),meaning that the AFL capabilities of the test suite are enoughfor the mutant, then the process switches to another mutant. Onthe contrary, if Expense is greater than σ, then the test suite isaugmented with new statistically generated test inputs until theExpense reaches σ. Of course, additional mechanisms controlthis iterating process to prevent the tool falling in unboundedloops or simply too costly computations. Implementing thisprocess, μTIL bends its test input generation to minimizethe length of the test suite, while maintaining its localizationaccuracy.μTIL is implemented with 4.5 KLOC of SICStus Prolog.

The tool calls the clp(FD) [7] library to perform FiniteDomains constraint solving, the PCC(FD) [28] library forprobabilistic constraint operators, and PRT [16]. As mentionedearlier, exploring all the paths of a program for checkingfeasibility is not necessary and using a probabilistic constraintmodel of the program allows us to avoid many spurious calls tothe constraint solver. The user interacts with μTIL through aninterface written in Python. Statistical test inputs generation inμTIL is parameterized through a predicate that takes optionalarguments for constraining the maximum number of paths tobe selected, the number of test inputs to be generated, k aparameter that bounds the loop unfolding, div the divisionparameter and a time-out and mem-out process. μTIL also

203

Figure 5. DBBs average size for ART 1-2, MuTIL 0-1 and RT

proposes to limit the number of test inputs to generate foreach iteration, the number of mutants to consider, to set up thethreshold σ, and the AFL technique to consider (i.e., Tarantula,Jaccard or Ochiai).

VII. EXPERIMENTAL RESULTS

We evaluated μTIL on a small but realistic program ex-tracted from the SIR [12], called tcas. The source codemodestly contains 173 lines of C code, but it contains nestedconditionals, logical operators, type definitions, and functioncalls. We just translated this version in Java in order to applyμTIL.Because our goal is to assess the benefits of using μTIL, our

experiment addresses the two following research questions:

Q1. Is the statistical test suite generated by μTIL moreaccurate for AFL than Random Testing and AdaptiveRandom Testing?

Q2. Does the mutation-based learning process of μTIL en-able the significant reduction of the test suite lengthwithout decreasing its AFL capabilities?

In order to answer those questions, we defined an ex-perimental process involving distinct test suites generationprocedures:

1) Random Testing (RT). We implemented an uniform(pseudo-)random test input generator by using thePython random library. For each input variable, wedefined a range corresponding to its possible valuesand generated uniformly 50 sequences of 100 tuples ofvalues within those ranges;

2) Adaptive Random Testing (ART). As a variation ofRT, we implemented an ART test input generator andconsidered two distinct types of ART test suites. Startingwith a random test input t, ART picks randomly a setof W test inputs and evaluates the distance between t

and the W other test inputs, and then selects the furthesttest input as a new test input. The process iterates untila given number of test inputs is reached. We generated50 ART test suites containing 100 test inputs with two

possible values for W : 10 (called ART ) and 20 (calledART2);

3) μTIL with no mutant (μTIL-0). We built a version ofμTIL where no mutant is considered to reduce the lengthof the test suite. In this version, the statistical generatorensures that each feasible path will be exercised with areasonable and uniformly distributed test set. We usedthis generator to compute 20 test suites of length 100;

4) μTIL with mutants (μTIL-1). In this version of μTIL,test inputs are incrementally selected using mutationtesting to reduce the length of the test suites, whileretaining the diagnosis capabilities of μTIL. Resultingtest suites contain 25 test inputs instead of 100 andwe measure the Expense fault diagnosis to determinewhether it is significantly different from μTIL-0. Foreach considered mutant (up to 20), test suites targetingan Expense value lower than 0.2 were generated.

The reason why we generated 50 and 20 test suites of length100, respectively for RT , ART and μTIL-0, is that we needin our analysis to account for randomness in their Expense andDBB results. In other words, the use of test techniques leadsto accuracy distributions that need to be compared throughstatistical means. We considered for our experiments 10 faultyprogram variants of tcas. These variants were generatedby manually injecting a combination of at least two faultswithin two distinct statements. Hence, these variants do notcorrespond to possible mutants of the original program. Wefed each of them initially with a failed trace, resulting fromthe failure detection phase. The reason is that one would notstart debugging without having detected first a failure and wetherefore needed to ensure that every test suite, regardless ofthe test technique employed and program variant, led to atleast one failing test execution. Note that we ran the aboveexperimental process with larger test suites (i.e., 250 and1000 instead of 100) to determine if test suite length had anyeffect but obtained very similar results, which are thereforenot reported here .Using the three AFL techniques considered in this paper,

namely Tarantula, Jaccard and Ochiai, we computed the Ex-pense value of the diagnosis obtained when running all thetest suites on each of the program variants. We obtainedthe results shown in Fig.6 where observations (a test suiteexecuted on a variant), boxplots (showing medians, minimumsand maximums, and 25/75 percent quartiles), and means aredepicted for each test technique. In addition, as an additionalmeasure of accuracy, we also computed the DBB mean sizeobtained for the localization. We obtained the results shownin Fig.5 showing once again boxplots and mean lines, but thistime for the DBB mean obtained for each test suite run oneach program variant.Regarding research question Q1, our results show that

both versions of μTIL overcome RT and ART for our twoaccuracy measures. The differences are statistically significantwhen comparing both Expense (p-value < 0.0001) and DBBmeans (p-value < 0.0001 ) using a Wilcoxon Test to compareindependent samples. This test is non-parametric and does

204

Figure 6. Expenses for 3 diagnosis (Tarantula, Jaccard, Ochiai) using test suites generated with ART 1-2, MuTIL 0-1, and RT

not make any distributional assumption [30]. With respect toresearch question Q2, our results show that μTIL-1 version,though generating much smaller test suites (of length 25instead of 100), retains similar accuracy. From the figuresabove, it is indeed clear that the median and averages of μTIL-0 and μTIL-1 are very close for both measures of accuracy.

VIII. CONDITIONS OF APPLICATION

Several conditions need to be met for the μTIL method tobe beneficial within a testing and debugging process. First, theoverall approach is particularly useful when, after the failuredetection phase, the cost of debugging is so high that additionaltest executions are worthwhile. In addition, given that μTIL-0is a complex statistical test data generation technique (whichexploits constraint solving, infeasible path detection, and uni-form sampling over a subdomain), using μTIL-0 instead ofa simpler (adaptive) random testing technique is beneficialwhenever program execution is much more costly than testgeneration in terms of resources usage (e.g., limited accesstime to test lab, manual effort setting up hardware in embeddedsystems), CPU time, or memory consumption. Generating aμTIL-0 test suite is definitely more costly than RT or ART,but if the test generation cost, using our constraint solvingapproach, is significantly lower compared to any relevantaspect of the test execution cost, then our results comparingμTIL-0, RT and ART using test suites of the same length offervaluable insights.Furthermore, since μTIL-1 performs mutation testing and

Expense computation by executing the mutants, μTIL-1 isonly an option if the AFL test suite generation can beperformed within a simulation environment. It is particularlybeneficial is the cost of test execution on the target platform

must be reduced. In the context of embedded systems, execut-ing a program in a simulation environment is easier becausethere is fewer costly communication (cross-compilation, busdata exchange) between an host and its target machines. So,using μTIL-1 should be reserved to situations where theprogram execution can be simulated at lower cost. Since usingthe μTIL-1 method enables significant reduction in test suitelength, it is useful to alleviate the cost of AFL when it isperformed on the actual program under test when run on thetarget machine instead of a simulation environment.To sum up, we argue that the μTIL-1 approach provides a

practical interest in testing contexts where:

• program or system executions on the target platform iscostly in terms of resources, CPU time or memory;

• a simulation environment is available for constructing theAFL test suite;

In practice, such conditions are in no way exceptional asexemplified by the many cross-testing processes for embeddedsoftware (e.g., critical on-board computing systems, smartcards, mobile phone applications). However, before extrap-olating the applicability of μTIL to those contexts, largerexperiments and industrial case studies need to be performed.

IX. CONCLUSION

In this paper, we proposed a new approach based on con-straint solving and mutation-based statistical testing methodcalled μTIL. Its objective is to generate a random test suitewith the following two properties: 1) each feasible pathof the program is activated with the same probability; 2)the subdomain associated to each feasible path is uniformlycovered. Those test suite properties help improve the fault

205

localization accuracy of AFL techniques, thus leading to morecost-effective debugging. In addition, using mutation testing,the statistical generator can reduce the length of the test suitewhile retaining the accuracy of the diagnosis. This is the firsttime that a mutation-based statistical test input generator isbuilt for minimizing a test suite used to localize faults. OurμTIL method has been implemented in a proof-of-concept toolfor Java and, although our experiment is modest in size, itshows without ambiguity that 1) μTIL generates test suiteswith higher AFL accuracy than ART and RT; 2) μTIL isable to significantly reduce the length of an AFL test suite,while retaining the original AFL accuracy. Though μTIL isnot applicable in all situations, we identified software testingcontexts where it could be of practical interest, for examplewhen program executions on the target platform are costly anda simulation execution environment is available. Embeddedsoftware development where cross-testing is performed meetsthese conditions.In the short term, we plan to perform more extensive

experiments to fine tune the μTIL parameters such as thenumber of mutants, the value of the Expense threshold or thenumber of test inputs to consider for each feasible path. In thelonger term, we plan to apply μTIL in the fault localizationprocess of the development of industrial embedded softwareand assess it in realistic conditions.

REFERENCES

[1] R. Abreu, P. Zoeteweij, and A.J.C. van Gemund. On the accuracy ofspectrum-based fault localization. In TAIC-PART’07: Testing: Academicand Industrial Conference, Practice and Industrial Conference, 2007.

[2] Rui Abreu, Peter Zoeteweij, and Arjan J. C. van Gemund. An evaluationof similarity coefficients for software fault localization. In Proc. of the12th Pacific Rim Int. Symp. on Dependable Computing, PRDC ’06, pages39–46, 2006.

[3] Shaimaa Ali, James H. Andrews, Tamilselvi Dhandapani, and WantaoWang. Evaluating the accuracy of fault localization techniques. In Proc.of the 2009 IEEE/ACM Int. Conf. on Automated Soft. Eng., ASE ’09,pages 76–87. IEEE Computer Society, 2009.

[4] S. Artzi, J. Dolby, F. Tip, and M. Pistoia. Directed test generation foreffective fault localization. In Proc. of the 2010 Int. Symp. on Soft.Testing and Analysis, ISSTA ’10, pages 49–60, 2010.

[5] B. Baudry, F. Fleurey, and Y. Le Traon. Improving test suites forefficient fault localization. In Proceedings of International Conferenceon Software Engineering, pages 20–28, Shanghai, China, May 2006.

[6] E. Bin, R. Emek, G. Shurek, and A. Ziv. Using a constraint satisfactionformulation and solution techniques for random test program generation.IBM Systems Journal, 41(3), 2002.

[7] M. Carlsson, G. Ottosson, and B. Carlson. An open–ended finite domainconstraint solver. In Proc. of Programming Languages: Implementations,Logics, and Programs, 1997.

[8] Tsong Yueh Chen, Hing Leung, and I. K. Mak. Adaptive random testing.In ASIAN’04, pages 320–329, 2004.

[9] Holger Cleve and Andreas Zeller. Locating causes of program failures.In Proceedings of the 27th international conference on Software engi-neering, ICSE ’05, pages 342–351, 2005.

[10] Rina Dechter, Kalev Kask, Eyal Bin, and Roy Emek. Generating randomsolutions for constraint satisfaction problems. In Eighteenth nationalconference on Artificial intelligence, pages 15–21, Menlo Park, CA,USA, 2002. American Association for Artificial Intelligence.

[11] R.A. DeMillo and J.A. Offut. Constraint-based automatic test datageneration. IEEE Transactions on Software Engineering, 17(9):900–910, September 1991.

[12] Hyunsook Do, Sebastian G. Elbaum, and Gregg Rothermel. Supportingcontrolled experimentation with testing techniques: An infrastructure andits potential impact. Empirical Software Engineering: An InternationalJournal, 10(4):405–435, 2005.

[13] Patrice Godefroid, Michael Y. Levin, and David A. Molnar. Automatedwhitebox fuzz testing. In NDSS’08: Network and Distributed SystemSecurity Symposium. The Internet Society, 2008.

[14] V. Gogate and R. Dechter. A new algorithm for sampling csp solu-tions uniformly at random. In Principles and Practice of ConstraintProgramming (CP’06), volume 4204 of LNCS, pages 711–715, 2006.

[15] A. Gotlieb. Euclide: A constraint-based testing platform for critical cprograms. In 2th IEEE International Conference on Software Testing,Validation and Verification (ICST’09), Denver, CO, Apr. 2009.

[16] A. Gotlieb and M. Petit. A uniform random test data generator for pathtesting. The Journal of Systems and Software, 83(12):2618–2626, Dec.2010.

[17] B. J.M. Gruen, D. Schuler, and A. Zeller. The impact of equivalentmutants. In Mutation ’09: Proc. of the 3rd Int. Workshop on MutationAnalysis, pages 192–199, Apr. 2009.

[18] P.V. Hentenryck, V. Saraswat, and Y. Deville. Design, implementation,and evaluation of the constraint language cc(fd). Journal of LogicProgramming, 37:139–164, 1998.

[19] Dennis Jeffrey, Neelam Gupta, and Rajiv Gupta. Fault localization usingvalue replacement. In Proc. of the 2008 Int. Symp. on Soft. Testing andAnalysis, ISSTA ’08, pages 167–178, 2008.

[20] Bo Jiang, W. K. Chan, and T.H. Tse. On practical adequate test suitesfor integrated test case prioritization and fault localization, 2011.

[21] James A. Jones and Mary Jean Harrold. Empirical evaluation of thetarantula automatic fault-localization technique. In ASE ’05: Proc. ofthe 20th IEEE/ACM Int. Conf. on Automated software engineering, pages273–282, 2005.

[22] James A. Jones, Mary Jean Harrold, and John Stasko. Visualization oftest information to assist fault localization. In ICSE ’02: Proc. of the24th Int. Conf. on Soft. Eng., pages 467–477, 2002.

[23] D. Lucia, Lo, Jiang Lingxiao, and A. Budi. Comprehensive evaluationof association measures for fault localization. In IEEE Int. Conf. onSoftware Maintenance (ICSM’10), pages 1–10, Timisoara, 2010.

[24] Yu-Seung Ma, Jeff Offutt, and Yong Rae Kwon. Mujava: an automatedclass mutation system. Softw. Test. Verif. Reliab., 15:97–133, June 2005.

[25] R. Nieuwenhuis, A. Oliveras, E. Rodrguez-Carbonell, and A. Rubio.Challenges in satisfiability modulo theories (invited paper). In 18th Int.Conf. on Rewriting Techniques and Applications (RTA’07), LNCS 4533,pages 2–18, Jun. 2007.

[26] A. Jefferson Offutt and Ronald H. Untch. Mutation testing for the newcentury. chapter Mutation 2000: uniting the orthogonal, pages 34–44.Kluwer Academic Publishers, 2001.

[27] C. Parnin and A. Orso. Are automated debugging techniques actuallyhelping programmers? In Proc. of the Int. Symp. on Software Testingand Analysis (ISSTA 2011), pages 199–209, Toronto, Canada, July 2011.

[28] M. Petit and A. Gotlieb. Boosting probabilistic choice operators.In Proceedings of Principles and Practices of Constraint Program-ming, Springer Verlag, LNCS 4741, pages 559–573, Providence, USA,September 2007.

[29] H. Riener, R. Bloem, and G. Fey. Test case generation from mutantsusing model checking techniques. pages 388–397, 2011.

[30] D. J. Sheskin. Handbook of Parametric and Nonparametric StatisticalProcedure. Chapman and Hall/CRC, 2007.

[31] P. Thevenod-Fosse and H. Waeselynck. An investigation of statisticalsoftware testing. Journal of Sotware Testing, Verification and Reliability,1(2):5–25, July 1991.

[32] N. Tillmann and J. de Halleux. Pex: White box test generation for .net.In Proc. of the 2nd Int. Conf. on Tests and Proofs, LNCS 4966, pages134–153, 2008.

[33] N. Williams, B. Marre, P. Mouy, and M. Roger. Pathcrawler: Automaticgeneration of path tests by combining static and dynamic analysis. InProc. Dependable Computing - EDCC’05, 2005.

[34] W.E. Wong, Yu Qi, Lei Zhao, and Kai-Yuan Cai. Effective faultlocalization using code coverage. In COMPSAC’07: Computer Softwareand Applications Conference, pages 449–456, 2007.

[35] Y. Yu, J.A. Jones, and M.J. Harrold. An empirical study of the effects oftest-suite reduction on fault localization. In Proceedings of the Int. Conf.on Software Engineering (ICSE’08), pages 201–210, Leipzig, Germany,May 2008.

206

Documents

[IEEE 2012 6th International Conference on Software Security and Reliability (SERE) - Gaithersburg, MD, USA (2012.06.20-2012.06.22)] 2012 IEEE Sixth International Conference on Software