Studying Evolving SoHware Ecosystemsinspired by ecological models
Tom Mens, Maelick ClaesService de Génie Logiciel
Philippe GrosjeanService d’écologie numérique des milieux aqua@ques
[email protected]/genlog/projects/ecos
9 July 2013 -‐ SATToSE, Bern
Collaborators
2
?
9 July 2013 -‐ SATToSE, Bern
Long-‐term goals
• Determine the main factors that drive the success or failure of OSS projects within their ecosystem
• Investigate new techniques and mechanisms to predict and improve survivability of OSS projects– Inspired by research in biological ecology
• Use these insights to help– the developer community to improve upon their
practices– companies and users to compare and adopt OSS
projects
3
9 July 2013 -‐ SATToSE, Bern 4
Standing on the shoulders of giants
Lamarck
Lotka
Volterra
Darwin
Pareto
9 July 2013 -‐ SATToSE, Bern 5
TerminologyBiological ecosystem
Defini@ons
• Ecology: the scien)fic study of the interac)ons that determine the distribu)on and abundance of organisms
• Ecosystem: the physical and biological components of an environment considered in rela)on to each other as a unit– combines all living organisms
(plants, animals, micro-‐organisms) and physical components (light, water, soil, rocks, minerals)
Example: coral reefs
• High biodiversity: polyps, sea anemones, fish, mollusks, sponges, algae
9 July 2013 -‐ SATToSE, Bern 6
TerminologySoHware ecosystem
Business-‐oriented view
• “a set of actors func)oning as a unit and interac)ng with a shared market for so<ware and services, together with the rela)onships among them.” (Jansen et al. 2009)
Examples
• Eclipse• Android and iOS app store
9 July 2013 -‐ SATToSE, Bern 7
SoHware ecosystem
Development-‐centric view
• “a collec)on of so<ware products that have some given degree of symbio)c rela)onships.”(MesserschmiK & Szyperski 2003)
• “a collec)on of so<ware projects that are developed and evolve together in the same environment.”(Lungu 2008)
Examples
• GnomeKDE
• DebianUbuntu
• R’s CRAN
• Apache
9 July 2013 -‐ SATToSE, Bern 8
Comparison
Biological evolu@on
AND BY A DUMMY
9 July 2013 -‐ SATToSE, Bern 10
Ecological theories ofevolu@on of species
• Lamarckism-‐ animal organs and behaviour can change according to the way they are used
-‐ those characteris@cs can transmit from one genera@on to the next to reach a greater level of perfec@on
• Example-‐ giraffe’s necks have become longer while trying to reach the upper leaves of a tree
Jean-‐Bap@ste Lamarck(1744–1829)
9 July 2013 -‐ SATToSE, Bern 11
Ecological theories ofevolu@on of species
• Darwinism-‐ all species of life have descended over @me from common ancestors
-‐ this branching padern of evolu@on resulted from natural selec@on, similar to ar@ficial selec@on in selec@ve breeding
• Example– 13 types of Galapagos finches,
same habits and characteris@cs, but different beaks
Charles Darwin(1809–1882)
9 July 2013 -‐ SATToSE, Bern 12
Ecological theories ofevolu@on of species
Hologenome theory
• The unit of natural selec@on is the holobiont: the organism together with its associated microbial communi@es, that live together in symbiosis.
• The holobiont can adapt to changing environmental condi@ons far more rapidly than by gene@c muta@on and selec@on alone.
Compe@@on vs coopera@on
• While Darwin’s theory emphasises compe))on (survival of the fidest), hologenome theory also includes coopera)on (through symbiosis)
9 July 2013 -‐ SATToSE, Bern 13
Evolu@on HistoryEcology
Darwin (1837)• Evolu@on history of species can be represented by a phylogene)c tree.
• Describes the evolu@onary rela@onships among species assuming that they share acommon ancestor.
9 July 2013 -‐ SATToSE, Bern 14
Evolu@on HistoryEcology
Re$culate evolu$on
• Unlike in Darwinian model, evolu@on history is represented using a graph structure
• When re@cula@on of species occurs, two or more evolu@onary lineages are combined at some level of biological organiza@on.
• Causes– hybrid specia)on (two lineages
recombine to create a new one) – horizontal gene transfer (genes
are transferred across species)
9 July 2013 -‐ SATToSE, Bern 15
Evolu@on HistorySoHware
9 July 2013 -‐ SATToSE, Bern 16
Trophic web (food chain)in natural ecosystems
9 July 2013 -‐ SATToSE, Bern 17
Trophic web insoHware ecosystems
Producer-‐consumer rela@on
Users
Peripheral developers
Core developers
Onion model
TOP-‐DOWNchange requests & bug reports
BOTTOM-‐UPchanges in core projects and architecture
9 July 2013 -‐ SATToSE, Bern 18
Core Architecture -‐ orWhy developers are polyps
Coral reef ecosystem
• Sclerac)nian coral polyps are responsible for crea@ng the coral reef structure
• This coral reef is required for the other species of the ecosystem to thrive.
SoHware ecosystem
• Core developers are responsible for crea@ng the core soHware architecture
• Based on this core architecture, other developers and third par@es can create other projects, services, and so on.
9 July 2013 -‐ SATToSE, Bern 19
Ecosystem Dynamics
Predator-‐prey rela@onship
• An instance of the consumer-‐resource rela@onship
• Predators (hun@ng animals) feed upon their prey (adacked animals)
Dynamic model
• Two mutually dependent parametric differen@al equa@ons(Lotka-‐Volterra 1925/1926)
9 July 2013 -‐ SATToSE, Bern 20
SoHware EcosystemDynamics
Analogies based on predator-‐prey rela$onship
• Debuggers are predators, soHware defects are prey[Calzolari et al. Maintenance and tes)ng effort modeled by linear and nonlinear dynamic systems,” Informa)on and So<ware Technology, 43(8): 477 – 486, 2001]
• Developers are predators, the informa@on they seek is prey[Lawrance et al. Scents in programs: Does informa)on foraging theory apply to program maintenance? VL/HCC 2007, pp. 15–22]
9 July 2013 -‐ SATToSE, Bern 21
SoHware EcosystemDynamics
Analogies based on predator-‐prey rela$onship
• Dual views in a soHware ecosystem– Developers are predators, the
projects they work on are prey
– Projects are predators that feed upon the cogni@ve resources of their developers
Bipar@te developer-‐project graph
project 1
project 2
project 3
9 July 2013 -‐ SATToSE, Bern
• Stability: the capacity to maintain an equilibrium over longer periods of @me
• Resistance: the ability to withstand environmental changes without too much disturbances of its biological communi@es
• Resilience: the ability to return to an equilibrium aHer a disturbance
• Higher biodiversity favours these characteris@cs
Other desirable ecosystem characteris@cs
22
Defining and Evaluating a Measureof Open Source Project Survivability
Uzma Raja, Member, IEEE Computer Society, and Marietta J. Tretter
Abstract—In this paper, we define and validate a new multidimensional measure of Open Source Software (OSS) project survivability,
called Project Viability. Project viability has three dimensions: vigor, resilience, and organization. We define each of these dimensions
and formulate an index called the Viability Index (V I) to combine all three dimensions. Archival data of projects hosted atSourceForge.net are used for the empirical validation of the measure. An Analysis Sample (n ! 136) is used to assign weights to each
dimension of project viability and to determine a suitable cut-off point for V I. Cross-validation of the measure is performed on a hold-out Validation Sample (n ! 96). We demonstrate that project viability is a robust and valid measure of OSS project survivability that can
be used to predict the failure or survival of an OSS project accurately. It is a tangible measure that can be used by organizations tocompare various OSS projects and to make informed decisions regarding investment in the OSS domain.
Index Terms—Evaluation framework, external validity, open source software, project evaluation, software measurement, software
survivability.
Ç
1 INTRODUCTION
OPEN Source Software (OSS) projects are developed anddistributed for free, with full access to the project
source code. Recently there has been a significant increasein the use of these projects. Some OSS projects have earnedthemselves a high reputation and corporate sponsorships.Large corporations (e.g., IBM, SUN microsystems) arebecoming involved with the OSS movement in variouscapacities. Projections indicate that the corporate interest inOSS projects will grow stronger in the future [1] and theseprojects will see integration in enterprise architecture [2].This increased use of OSS projects creates the need forbetter project evaluation measures.
Traditionally, software projects are evaluated by con-formance to budget, schedule, and user requirements [3], [4],[5], [6], [7], [8]. These measures, however, are difficult tomap to OSS projects, which are developed through anetwork of volunteer participants, with no defined budget,schedule, or customer. Although there is a surge in theinvestment in OSS projects [1], research indicates that a largenumber of OSS projects fail [9], [10]. Some have questionedthe operational reliability and quality of OSS projects [11].Since there are no contractual or legal bindings for providingOSS updates or maintenance services, businesses investinghuman or financial capital on adoption of OSS projects needthe ability to evaluate whether the project will continue toexist or not [12]. Development teams need to measure
project survivability to control and improve performance.Individual and corporate users need a measure of projectsurvivability to compare the available OSS projects beforemaking decisions regarding project adoption.
In this paper, we define and validate a new multi-dimensional measure of OSS project survivability, calledProject Viability. OSS projects provide access to theirdevelopment archives, thereby providing a unique oppor-tunity to conduct empirical research [13] and developreliable measures [14], [15]. In the following sections, wedefine, formulate, and validate project viability. Section 2provides a brief overview of the existing empirical researchin OSS and the background of project survivability. Section 3defines the dimensions of project viability and formulatesan index to measure it. Section 4 discusses the empiricalevaluation framework and validates the new measure usingOSS project data. Discussion of the results is presented inSection 5 and conclusions are given Section 6.
2 BACKGROUND
A large number of OSS projects are available for use.However, the failure rate of these projects is high [9]. Theevaluation of OSS projects is different than CommercialSoftware Systems (CSS) [16]. The adopters of OSS projectsneed a mechanism to compare the chances of failure orsurvival of the available projects. This would allow betterdecisions regarding corporate resource investment.
A range of measures has been used in prior research toevaluate OSS projects. Godfrey and Tu [17] examined theevolution of the Linux kernel and its growth pattern in oneof the first empirical studies in the OSS domain. They usedthe Source Lines of Code (SLOC) to compare the growthpattern of Linux to CSS projects and found evidence thatOSS growth rates are significantly high compared to CSSprojects. Paulson et al. [18] compared OSS and CSS projectsusing a diverse sample of OSS projects and found no
IEEE TRANSACTIONS ON SOFTWARE ENGINEERING, VOL. 38, NO. 1, JANUARY/FEBRUARY 2012 163
. U. Raja is with the Department of Information Systems, Statistics andManagement Science, The University of Alabama, Box #870226,300 Campus Drive, Tuscaloosa, AL 35487. E-mail: [email protected].
. M.J. Tretter is with the Department of Information and OperationsManagement, Texas A&M University, Mail Stop #310D, WehnerBuilding, College Station, TX 77840. E-mail: [email protected].
Manuscript received 30 Oct. 2009; revised 14 June 2010; accepted 21 Aug.2010; published online 1 Apr. 2011.Recommended for acceptance by R. Jeffery.For information on obtaining reprints of this article, please send e-mail to:[email protected], and reference IEEECS Log Number TSE-2009-10-0294.Digital Object Identifier no. 10.1109/TSE.2011.39.
0098-5589/12/$31.00 ! 2012 IEEE Published by the IEEE Computer Society
9 July 2013 -‐ SATToSE, Bern
Measuring diversity
Based on Shannon’s notion of entropy and 2nd law of thermodynamics
Species diversityX = set of n distinct species xip(xi) = proportion of all individuals
that belong to species xi
23
Ecosystem Biodiversity
• Biodiversity: The degree of varia@on of species within a given ecosystem
• Interpreta@on-‐ Maximum diversity if all species have same number of individuals
-‐ Low diversity if a par@cular species dominates the others
€
H(X) = − p(xi)ln p(xi)i=1
n
∑
9 July 2013 -‐ SATToSE, Bern 24
Evolu@on of diversity in open source soHwareEconometric indices
• Gini and Theil are measures of inequality in a distribu@on
Figure 3. Box plot of Gini coefficients across all analyzed systems.
[6], a new managed language for the .NET platform, wasannounced. Like in Java, C# programs are compiled intoa machine-independent, language-appropriate representa-tion defined by the Common Language Infrastructure [21].Moreover, C# and Java are very closely related and wetherefore asked ourselves whether programs written in C#exhibit distribution profiles similar to the ones we observedin Java. Unfortunately, the number of freely-available,open-source systems developed in C# framework that metour selection criteria is rather limited. So, we began ourstudy with systems that were originally written in Java andhad been ported to the .NET platform in order to take ad-vantage from the knowledge gained in the analysis of theirrespective Java counterparts.
For the .NET metrics extraction, we used CLI [18], anassembly reader library that provides access to both themetadata and byte code. We added a small wrapper for thecomputation of the Gini coefficients and stored the resultingdata in a text file for further processing with JSeat.
We collected metrics data from four .NET systems:iTextSharp, NHibernate, SharpDevelop, and NAnt. Theanalysis of our 10 measures produced Gini coefficientsequivalent to the ones determined for Java systems. How-ever, there were also exceptions. We observed a shift ex-ceeding 0.4 (i.e., individual Gini coefficients doubled invalue) for almost all measures in NAnt version 0.8.3-rc1.The Gini coefficients stayed high until version 0.84-rc1,where they assumed “normal” values again. An inspectionof the developer logs provided an explanation: in version0.8.3-rc1, the NAntContrib project was integrated into theNAnt distribution. This project defines a number of utili-ties whose metrics exhibit very uneven distribution profilescaused by a centralization of event handling in a few classes.In version 0.84-rc1, the developers removed NAntContribfrom NAnt resulting in a change by�0.4, returning the Ginicoefficients for NAnt to their previous values.
4. Discussion
4.1 The Value of the Gini Coe�cient
We discovered in our analysis that Gini coefficients nor-mally change little between adjacent releases. However,changes do happen and may result in significant fluctua-tions in Gini coefficients that warrant a deeper analysis (seeFigure 4 showing selected Gini profiles for 51 consecutivereleases of the Spring framework). But why do we see sucha remarkable stability of Gini coefficients?
Figure 4. Selected Gini profiles in Spring.
Developers accumulate system competence over time.Proven techniques to solve a given problem prevail, whereuntested or weak practices have little chance of survival.If a team has historically built software in a certain way,then it will continue to prefer a certain approach over oth-ers. Moreover, we can expect that most problems in a givendomain are similar, hence the means taken to tackle themwould be similar, too. Tversky and Kahneman coined theterm “decision frame” [27] to refer to this principle in whichdecision-makers proactively organize their solutions withinwell-established and strong boundaries defined by culturalenvironment and personal preferences. These boundaries
184
Authorized licensed use limited to: SWINBURNE UNIV OF TECHNOLOGY. Downloaded on November 3, 2009 at 21:29 from IEEE Xplore. Restrictions apply.
Comparative Analysis of Evolving Software Systems Using the Gini Coefficient
Rajesh Vasa, Markus Lumpe, Philip BranchFaculty of Information & Communication Technologies
Swinburne University of TechnologyP.O. Box 218, Hawthorn, VIC 3122, AUSTRALIA{rvasa,mlumpe,pbranch}@swin.edu.au
Oscar NierstraszInstitute of Computer Science
University of BernBern, CH-3012, SWITZERLAND
Abstract
Software metrics offer us the promise of distilling usefulinformation from vast amounts of software in order to trackdevelopment progress, to gain insights into the nature of thesoftware, and to identify potential problems. Unfortunately,however, many software metrics exhibit highly skewed, non-Gaussian distributions. As a consequence, usual ways ofinterpreting these metrics — for example, in terms of “av-erage” values — can be highly misleading. Many metrics,it turns out, are distributed like wealth — with high concen-trations of values in selected locations. We propose to an-alyze software metrics using the Gini coefficient, a higher-order statistic widely used in economics to study the dis-tribution of wealth. Our approach allows us not only toobserve changes in software systems efficiently, but also toassess project risks and monitor the development process it-self. We apply the Gini coefficient to numerous metrics overa range of software projects, and we show that many met-rics not only display remarkably high Gini values, but thatthese values are remarkably consistent as a project evolvesover time.
1. Introduction
What is the inherent nature of software? Do softwaresystems form “perfect” societies with an equal distributionof responsibilities, or are they polarized, where some partshave to shoulder most of the load and others are just simpleservice providers? These are questions of more than passinginterest. By understanding what typical and successful soft-ware evolution looks like, we can identify anomalous situ-ations and perhaps take action earlier than might otherwisebe possible. However, we are only beginning to understandhow change and distribution of functionality affect evolvingsoftware systems [23, 29, 30].
A standard technique [8, 15, 16] to answer these ques-
tions is to identify a number of characterizing properties,collect corresponding software metrics, and render the ob-tained data into meaningful information that can assist bothdevelopers and project managers in their decision making[13, 27]. Unfortunately, software metrics data are, in gen-eral, heavily skewed [7, 12, 30], which makes precise inter-pretation with standard descriptive statistical analysis diffi-cult. Summary measures like “average” or “mean” assumea Gaussian distribution to capture the central tendency in agiven data set. However, when applied to non-Gaussian dis-tributions, central tendency measures become increasinglymore unreliable the greater the distance is between a givendistribution and a normal distribution.
The shortcomings of central tendency measures are am-plified when we wish to compare skewed distributions. Anymeaningful comparison requires additional effort to fit thedistributions in question to a specially-designed third modeldistribution [1, 26]. This transformation is not only cum-bersome but also expensive and may not yield the desiredresult. Moreover, additional problems may arise due tochanges in both the degree of concentration of individualvalues and and the total value of a distribution. Consider, forexample, the high-performance text search engine libraryLucene. The median of the heavily-skewed distribution forcyclomatic complexity [19] at class level increased from 5to 8 as new classes were added to the system. The change inthe median suggests that the overall cyclomatic complexityof Lucene increased significantly. But this interpretation isincorrect. The newly added classes had actually the oppo-site effect. What made the median increase was the growingpopulation size (i.e., the number of classes in the system),which resulted in a new middle value for cyclomatic com-plexity.
Interestingly, an approach to cope with and meaning-fully interpret unevenly-distributed data sets has alreadybeen widely adopted in the field of economics. In 1912,the Italian statistician Corrado Gini proposed the so-calledGini coefficient, a single numeric value between 0 and 1,to measure the inequality in the distribution of income or
978-1-4244-4828-9/09/$25.00 2009 IEEE Proc. ICSM 2009, Edmonton, Canada
179
Authorized licensed use limited to: SWINBURNE UNIV OF TECHNOLOGY. Downloaded on November 3, 2009 at 21:29 from IEEE Xplore. Restrictions apply.
Gini = A/(A+B)
9 July 2013 -‐ SATToSE, Bern
Evolu@on of diversity in open source soHwareEconometric indices
Commits sentE-‐mails sentBug reports modified
Evince
25
-‐ Theil indexCorresponds to Shannon’s no@on of entropy
Evidence for the Pareto principlein Open Source Software Activity
Mathieu Goeminne and Tom MensInstitut d’Informatique, Faculte des Sciences
Universite de Mons – UMONSMons, Belgium
{ mathieu.goeminne | tom.mens }@umons.ac.be
Abstract—Numerous empirical studies analyse evolving opensource software (OSS) projects, and try to estimate the activityand effort in these projects. Most of these studies, however, onlyfocus on a limited set of artefacts, being source code and defectdata. In our research, we extend the analysis by also taking intoaccount mailing list information. The main goal of this articleis to find evidence for the Pareto principle in this context, bystudying how the activity of developers and users involved inOSS projects is distributed: it appears that most of the activityis carried out by a small group of people. Following the GQMparadigm, we provide evidence for this principle. We selecteda range of metrics used in economy to measure inequality indistribution of wealth, and adapted these metrics to assess howOSS project activity is distributed. Regardless of whether weanalyse version repositories, bug trackers, or mailing lists, andfor all three projects we studied, it turns out that the distributionof activity is highly imbalanced.
Index Terms—software evolution, activity, software project,data mining, empirical study, open source software, GQM, Pareto
I. INTRODUCTION
Numerous empirical studies aim to understand and modelhow open source software (OSS) evolves over time [1]. Inorder to gain a deeper understanding of this evolution, itis essential to study not only the software artefacts thatevolve (e.g. source code, bug reports, and so on), but alsotheir interplay with the different project members (mainlydevelopers and users) that communicate (e.g., via mailing lists)and collaborate in order to construct and evolve the software.
In this article, we wish to understand how activity is spreadover the different members of an OSS project, and how thisactivity distribution evolves over time. Our hypothesis is thatthe distribution of activity follows the Pareto principle, in thesense that there is a small group of key persons that carryout most of the activity, regardless of the type of consideredactivity. To verify this hypothesis, we carry out an empiricalstudy based on the GQM paradigm [2]. We rely on conceptsborrowed from econometrics (the use of measurement ineconomy), and apply them to the field of OSS evolution.In particular, we apply indices that have been introducedfor measuring distribution (and inequality) of wealth, anduse them to measure the distribution of activity in softwaredevelopment.
The remainder of this paper is structured as follows. Sec-tion II explains the methodology we followed and defines
the metrics that we rely upon. Section III presents the ex-perimental setup of our empirical study that we have carriedout. Section IV presents the results of our analysis of activitydistribution in three OSS projects. Section V discusses theevidence we found for the Pareto principle. Section VI presentsrelated work, and Section VII concludes.
II. METHODOLOGY
A. GQM paradigm
To gain a deeper understanding of how OSS projects evolve,we follow the well-known Goal-Question-Metric (GQM)paradigm. Our main research Goal is to understand how ac-tivity is distributed over the different stakeholders (developersand users) involved in OSS projects. Once we have gaineddeeper insight in this issue, we will be able to exploit it toprovide dedicated tool support to the OSS community, e.g.,by helping newcomers to understand how the community isstructured, by improving the way in which the communitymembers communicate and collaborate, by trying to reducethe potential risk of the so-called bus factor1, and so on.
To reach the aforementioned research goal, we raise thefollowing research Questions:
1) Is there a core group of OSS project members (develop-ers and/or users) that are significantly more active thanthe other members?
2) How does the distribution of activity within an OSScommunity evolve over time?
3) Is there an overlap between the different types of activity(e.g., committing, mailing, submitting and changing bugreports) the community members contribute to?
4) How does the distribution of activity vary across differ-ent OSS projects?
As a third step, we need to select appropriate Metrics thatwill enable us to provide a satisfactory answer to each of theabove research questions. For our empirical study, we willmake use of basic metrics to compute the activity of OSSproject members, and aggregate metrics that allow us to com-pare these basic metric values across members (to understandhow activity is distributed), over time (to understand how they
1The bus factor refers to the total number of key persons (involved in theproject) that would, if they were to be hit by a bus, lead the project intoserious problems
Brasero
9 July 2013 -‐ SATToSE, Bern 26
SoHware Ecosystem Biodiversity
• Uses no@on of biodiversity to measure developer ac)vity focus and module ac)vity focusCf. bipar@te author-‐module graph
• Based on no@on of rela)ve entropy
• More details: see results of hackaton.
Dual Ecological Measures of Focusin Software Development
Daryl Posnett†, Raissa D’Souza!, Premkumar Devanbu,† and, Vladimir Filkov††!University of California Davis, USA
†{dpposnett,ptdevanbu,vfilkov}@ucdavis.edu,[email protected]
Abstract—Work practices vary among software developers.Some are highly focused on a few artifacts; others make wide-ranging contributions. Similarly, some artifacts are mostly au-thored, or “owned”, by one or few developers; others have verywide ownership. Focus and ownership are related but differentphenomena, both with strong effect on software quality. Priorstudies have mostly targeted ownership; the measures of own-ership used have generally been based on either simple counts,information-theoretic views of ownership, or social-network viewsof contribution patterns. We argue for a more general concep-tual view that unifies developer focus and artifact ownership.We analogize the developer-artifact contribution network to apredator-prey food web, and draw upon ideas from ecology toproduce a novel, and conceptually unified view of measuringfocus and ownership. These measures relate to both cross-entropyand Kullback-Liebler divergence, and simultaneously providetwo normalized measures of focus from both the developer andartifact perspectives. We argue that these measures are theoret-ically well-founded, and yield novel predictive, conceptual, andactionable value in software projects. We find that more focuseddevelopers introduce fewer defects than defocused developers. Incontrast, files that receive narrowly focused activity are morelikely to contain defects than other files.
I. INTRODUCTION
Developers are the lifeblood of open source software, OSS,and their contributions are vital for OSS to thrive. Ratherthan being assigned tasks by management, OSS developers aregenerally free to choose the style, focus, and breadth of theircontributions. Some might be quite focused, working on onespecific subsystem; others may contribute to many differentsubsystems. An device driver expert, for example, may con-tribute very specialized knowledge to an open source project,focusing on only a few files or packages. His contributions to asmall subset of modules1 may be his only contribution duringhis tenure with the project. In contrast, a project leader maywork on a variety of different tasks touching many moduleswithin a project. While OSS developers are free to choosetheir contribution styles, such choices are not inconsequential,especially to the central issue of software quality.
A dominant theme emerging from previous work in thisarea is module ownership [1], [2], [3]. Low ownership of amodule, i.e., too many contributors, can adversely impact codequality. There is, however, an entirely different perspective,developer’s attention focus, which is relatively unexplored.Human attention and cognition are finite resoucres [4]. Whendifferent tasks are simultaneously engaged, they can compete
1We use modules to mean either packages or files, depending on the context.
for mental resources and task performance can suffer [5]. Adeveloper engaged in many different tasks carries a greatercognitive burden than a more focused developer. Interestingly,the developer and module perspectives are, conceptually sym-metric, dualistic views of focus. From a module’s perspective,strong ownership indicates a strong focused contribution. Werefer to this as module activity focus, or MAF , a measure ofhow focused the activities are on a module. Symmetrically, werefer to the developer’s attention focus, or DAF , a measureof how focused the activities are of a particular developer.
A surprising, but natural analogy for MAF and DAF , arepredator-prey food webs from ecology. In a sense, modulesare predators that “feed upon” the cognitive resources ofdevelopers. As the number of developers contributing to amodule increases, the diversity of cognitive resources uponwhich the module “feeds” also increases; likewise, a developeris a “prey” whose limited cognitive resources are spread overthe modules that “prey” upon her.
Ecosystem diversity is of great interest to ecologists.Williams and Martinez call the roles complexity and diversityplay “[o]ne of the most important and least settled questionsin ecology.” [6] This diversity has two symmetric perspectives,both from a prey’s perspective, and a predator’s perspective.Ecologists have developed sophisticated symmetric measuresof predator-prey relationships, drawing upon ideas such asentropy and Kulback-Leibler divergence, that simultaneouslycapture both perspectives. We adapt these measures for soft-ware engineering projects into the metrics MAF and DAF .
In this work, we employ the methodology presented by ElEmam to validate our measures [7]. In particular, we showthat the DAF and MAF measures succeed in distinguishingimportant cases that extant measures don’t capture. We makethe following contributions:
• We adapt terminology and motivation from ecology,based on bipartite graphs;
• We incorporate and generalize previous results on devel-oper and artifact diversity;
• We provide easy to compute measures of focus, MAFand DAF , normalized to facilitate comparison within andacross projects;
• We show these measures more precisely capture out-comes relevant to software researchers and practitioners.
This novel analysis simultaneously considers focus bothfrom the artifact perspective and the author perspective.Researchers can use our MAF and DAF metrics to more
978-1-4673-3074-9/13/$31.00
c� 2013 IEEE
ICSE 2013, San Francisco, CA, USA
452
ICSE 2013
Migra@on in soHware ecosystems
Gnome case study
9 July 2013 -‐ SATToSE, Bern 28
Migra@on in soHware ecosystems
• How do soHware projects evolve?– Analogy to “gene transfer” in re@culate evolu@on
• Transfer of knowledge– Sharing or migra@on of contributors across projects
• Transfer of code– Copy-‐paste reuse and code cloning– Branching and merging of code repositories
C
Java
Objective C
Python
Lisp
JS
ASP.Net
C/C++ HeaderC++
Perl
yacc
C#
IDL
HaskellObjective C++
lexAssembly
Visual BasicPHP
Ruby
Tcl/Tk
1e+05
1e+07
100 1000 10000Files
LOC
9 July 2013 -‐ SATToSE, Bern
Migra@on in soHware ecosystemsGnome case study
29
• 16 years of ac@vity
• > 1400 projects
• > 5800 contributors (> 4300 coders)
• > 1.3M of commits (> 0.6M of code commits)
• > 12M of file touches (> 6M of codefile touches)
9 July 2013 -‐ SATToSE, Bern
Migra@on in soHware ecosystemsGnome case study
30
Hierarchical clustering of projects
Developers tend to collaborate more if they use the same programming language
9 July 2013 -‐ SATToSE, Bern
Migra@on in soHware ecosystemsGnome case study
31
Time
Joiners
1997 1999 2001 2003 2005 2007 2009 2011 20130
510
1520
2530
35
9 July 2013 -‐ SATToSE, Bern 32
Migra@on in soHware ecosystemsGnome case study
Evolu@on Gimp GTK+
28 Tom Mens, Maelick Claes, Philippe Grosjean and Alexander Serebrenik
project that were not active in this project during the preceding 6-month period,but that were involved in some activity in other GNOME projects instead. Globaljoiners are incoming coders in the considered project that were not active in anyof the GNOME projects during the preceding period. A similar definition holds forthe local and global leavers. Formally, the metrics are defined as follows. Let p bea GNOME project, t a 6-month activity period (and t � 1 the previous period), c acoder, Gnome the set of GNOME’s code projects, and isDev(c, t, p) is a predicatewhich is true if and only if c made a code commit in p during t:
localLeavers(p, t) ={c|isDev(c, t �1, p)^¬isDev(c, t, p)^9p2 (p2 2 Gnome^ isDev(c, t, p2))}
globalLeavers(p, t) ={c|isDev(c, t �1, p)^8p2 (p2 2 Gnome ) ¬isDev(c, t, p2))}
localJoiners(p, t) ={c|isDev(c, t, p)^¬isDev(c, t �1, p)^9p2 (p2 2 Gnome^ isDev(c, t �1, p2))}
globalJoiners(p, t) ={c|isDev(c, t, p)^8p2 (p2 2 Gnome ) ¬isDev(c, t �1, p2))}
Time
Joiners
1997 1999 2001 2003 2005 2007 2009 2011 2013
05
1015
2025
30
Time
Joiners
1997 1999 2001 2003 2005 2007 2009 2011 2013
05
1015
2025
30
Time
Joiners
1997 1999 2001 2003 2005 2007 2009 2011 2013
05
1015
2025
30
evolution gtk+ gimp
Fig. 1.11 Historical evolution (timeline) of the number of local (black solid) and global (reddashed) joiners (y-axis) for three GNOME projects.
We did not find any general trend, the patterns of intake and loss of coders arehighly project-specific. Figure 1.11 illustrates the evolution of the number of localand global joiners for some of the more important GNOME projects (the figures forleavers are very similar). For some projects (e.g., evolution) we do not observea big difference between the number of local and global joiners, respectively. Theseprojects seem to attract new developers both from within and outside of GNOME.Other projects, like gimp, attract most of its incoming developers from outsideGNOME. A third category of projects attracts most of its incoming developers fromother GNOME projects. This is the case for gtk+, glib and libgnome, whichcan be considered as belonging to the core of GNOME. This observation seems tosuggests that libraries, toolkits and auxiliary projects attract more inside developers,while projects that are well-known to the outside world (such as GIMP, a popular
Timeline (6-‐month intervals)of joiners to Gnome projects
-‐ Black = local joiners from other Gnome projects-‐ Red = global joiners from outside of Gnome-‐ Blue = stayers
Time
Joiners
1997 1999 2001 2003 2005 2007 2009 2011 2013
05
1015
2025
3035
Time
Joiners
1997 1999 2001 2003 2005 2007 2009 2011 2013
05
1015
2025
3035
Time
Leavers
1997 1999 2001 2003 2005 2007 2009 2011 2013
05
1015
2025
3035
Time
Leavers
1997 1999 2001 2003 2005 2007 2009 2011 2013
05
1015
2025
3035
Time
Leavers
1997 1999 2001 2003 2005 2007 2009 2011 20130
510
1520
2530
35
9 July 2013 -‐ SATToSE, Bern 33
Migra@on in soHware ecosystemsGnome case study
Evolu@on Gimp GTK+
-‐ Black = local joiners from other Gnome projects-‐ Red = global joiners from outside of Gnome-‐ Blue = stayers
28 Tom Mens, Maelick Claes, Philippe Grosjean and Alexander Serebrenik
project that were not active in this project during the preceding 6-month period,but that were involved in some activity in other GNOME projects instead. Globaljoiners are incoming coders in the considered project that were not active in anyof the GNOME projects during the preceding period. A similar definition holds forthe local and global leavers. Formally, the metrics are defined as follows. Let p bea GNOME project, t a 6-month activity period (and t � 1 the previous period), c acoder, Gnome the set of GNOME’s code projects, and isDev(c, t, p) is a predicatewhich is true if and only if c made a code commit in p during t:
localLeavers(p, t) ={c|isDev(c, t �1, p)^¬isDev(c, t, p)^9p2 (p2 2 Gnome^ isDev(c, t, p2))}
globalLeavers(p, t) ={c|isDev(c, t �1, p)^8p2 (p2 2 Gnome ) ¬isDev(c, t, p2))}
localJoiners(p, t) ={c|isDev(c, t, p)^¬isDev(c, t �1, p)^9p2 (p2 2 Gnome^ isDev(c, t �1, p2))}
globalJoiners(p, t) ={c|isDev(c, t, p)^8p2 (p2 2 Gnome ) ¬isDev(c, t �1, p2))}
Time
Joiners
1997 1999 2001 2003 2005 2007 2009 2011 2013
05
1015
2025
30
Time
Joiners
1997 1999 2001 2003 2005 2007 2009 2011 2013
05
1015
2025
30
Time
Joiners
1997 1999 2001 2003 2005 2007 2009 2011 2013
05
1015
2025
30
evolution gtk+ gimp
Fig. 1.11 Historical evolution (timeline) of the number of local (black solid) and global (reddashed) joiners (y-axis) for three GNOME projects.
We did not find any general trend, the patterns of intake and loss of coders arehighly project-specific. Figure 1.11 illustrates the evolution of the number of localand global joiners for some of the more important GNOME projects (the figures forleavers are very similar). For some projects (e.g., evolution) we do not observea big difference between the number of local and global joiners, respectively. Theseprojects seem to attract new developers both from within and outside of GNOME.Other projects, like gimp, attract most of its incoming developers from outsideGNOME. A third category of projects attracts most of its incoming developers fromother GNOME projects. This is the case for gtk+, glib and libgnome, whichcan be considered as belonging to the core of GNOME. This observation seems tosuggests that libraries, toolkits and auxiliary projects attract more inside developers,while projects that are well-known to the outside world (such as GIMP, a popular
Timeline (6-‐month intervals)of leavers from Gnome projects
Time
Leavers
1997 1999 2001 2003 2005 2007 2009 2011 2013
05
1015
2025
3035
Time
Leavers
1997 1999 2001 2003 2005 2007 2009 2011 2013
05
1015
2025
3035
Time
Leavers
1997 1999 2001 2003 2005 2007 2009 2011 20130
510
1520
2530
35
9 July 2013 -‐ SATToSE, Bern 34
Migra@on in soHware ecosystemsGnome case study
Evolu@on Gimp GTK+
CF(p) = Collabora@on factor for Gnome project p= percentage of coders in p having contributed to other Gnome projects
CF(Gimp) = 65,3%(low collabora@on)
CF(GTK+) = 94,8%(very high collabora@on)
CF(Evolu@on) = 85,1%(high collabora@on)
9 July 2013 -‐ SATToSE, Bern
Migra@on in soHware ecosystemsGnome case study
35
9 July 2013 -‐ SATToSE, Bern
Some references
36
To appear in 2013 in Springer’s Empirical Software Engineering journal – manuscript No.(will be inserted by the editor)
On the variation and specialisation of workload – Acase study of the Gnome ecosystem community
Bogdan Vasilescu · Alexander Serebrenik ·Mathieu Goeminne · Tom Mens
DOI: 10.1007/s10664-013-9244-1
Abstract Most empirical studies of open source software repositories focus on theanalysis of isolated projects, or restrict themselves to the study of the relation-ships between technical artifacts. In contrast, we have carried out a case study thatfocuses on the actual contributors to software ecosystems, being collections of soft-ware projects that are maintained by the same community. To this aim, we defineda new series of workload and involvement metrics, as well as a novel approach—eT-graphs—for reporting the results of comparing multiple distributions. We usedthese techniques to statistically study how workload and involvement of ecosys-tem contributors varies across projects and across activity types, and we exploredto which extent projects and contributors specialise in particular activity types.Using Gnome as a case study we observed that, next to coding, the activities of lo-calization, development documentation and building are prevalent throughout theecosystem. We also observed notable di↵erences between frequent and occasionalcontributors in terms of the activity types they are involved in and the numberof projects they contribute to. Occasional contributors and contributors that areinvolved in many di↵erent projects tend to be more involved in the localization ac-tivity, while frequent contributors tend to be more involved in the coding activityin a limited number of projects.
Keywords open source · software ecosystem · metrics · developer community ·case study
B. Vasilescu and A. SerebrenikMDSE, Eindhoven University of Technology, PO Box 513, 5600 MB Eindhoven, The Nether-landsTel.: +31-40-2473595 Fax: +31-40-2475404E-mail: {b.n.vasilescu | a.serebrenik}@tue.nl
M. Goeminne and T. MensCOMPLEXYS Research Institute, Universite de Mons, Place du Parc 20, 7000 Mons, BelgiumTel.: +32-65-373453 Fax: +32-65-373459E-mail: {mathieu.goeminne | tom.mens}@umons.ac.be
Chapter 10Studying Evolving Software Ecosystemsbased on Ecological Models
Tom Mens, Maelick Claes, Philippe Grosjean and Alexander Serebrenik
Research on software evolution is very active, but evolutionary principles, modelsand theories that properly explain why and how software systems evolve over timeare still lacking. Similarly, more empirical research is needed to understand howdifferent software projects co-exist and co-evolve, and how contributors collaboratewithin their encompassing software ecosystem.
In this chapter, we explore the differences and analogies between natural ecosys-tems and biological evolution on the one hand, and software ecosystems and soft-ware evolution on the other hand. The aim is to learn from research in ecology toadvance the understanding of evolving software ecosystems. Ultimately, we wishto use such knowledge to derive diagnostic tools aiming to analyse and optimisethe fitness of software projects in their environment, and to help software projectcommunities in managing their projects better.
Tom Mens and Maelick Claes and Philippe GrosjeanCOMPLEXYS Research Institute, University of Mons, Belgiume-mail: tom.mens,maelick.claes,[email protected]
Alexander SerebrenikEindhoven University of Technology, The Netherlandse-mail: [email protected] work has been partially supported by F.R.S-F.N.R.S. research grant BSS-2012/V 6/5/015author’s stay at the Universite de Mons, supported by the F.R.S-F.N.R.S. under the grant BSS-2012/V 6/5/015. and ARC research project AUWB-12/17-UMONS-3,“Ecological Studies of OpenSource Software Ecosystems” financed by the Ministere de la Communaute francaise - Directiongenerale de l’Enseignement non obligatoire et de la Recherche scientifique, Belgium.
245
UMONSFaculté des Sciences
Département d’Informatique
Understanding the Evolution ofSocio-technical Aspects in Open SourceEcosystems: An Empirical Analysis of
GNOME
Mathieu Goeminne
A dissertation submitted in fulfillment of the requirements ofthe degree of Docteur en Sciences
Advisor Jury
Dr. TOM MENS Dr. XAVIER BLANCUniversité de Mons, Belgium Université de Bordeaux 1, France
Dr. VÉRONIQUE BRUYÈREUniversité de Mons, Belgium
Dr. JESUS M. GONZALEZ-BARAHONAUniversidad Rey Juan Carlos, Spain
Dr. TOM MENSUniversité de Mons, Belgium
Dr. ALEXANDER SEREBRENIKTechnische Universiteit Eindhoven, The Netherlands
Dr. JEF WIJSENUniversité de Mons, Belgium
June 2013
A historical dataset for GNOME contributorsMathieu Goeminne, Maelick Claes and Tom Mens
Software Engineering Lab, COMPLEXYS research institute, UMONS, Belgium
Abstract—We present a dataset of the open source
software ecosystem GNOME from a social point of view.
We have collected historical data about the contributors
to all GNOME projects stored on git.gnome.org, taking
into account the problem of identity matching, and as-
sociating different activity types to the contributors. This
type of information is very useful to complement the
traditional, source-code related information one can ob-
tain by mining and analyzing the actual source code.
The dataset can be obtained at https://bitbucket.org/
mgoeminne/sgl-flossmetric-dbmerge.
I. INTRODUCTION
The historical and empirical study of open sourcesoftware (OSS) ecosystems is a relatively recent but fast-growing research domain. An important characteristic ofsuch ecosystems, at least according to our definition [15],is the fact that they are made up of a set of softwareprojects sharing a community of users and contributors.A well-known example is GNOME. Its constituent soft-ware projects are designed to work together in order toconstitute a complete software desktop environment. TheGNOME projects are developed by a developer commu-nity that is spread across the world. We have observedthat it is not uncommon for a contributor to be activelyinvolved in many projects at a time [16]. In additionto this, the type of activity a contributor is involved inmay change from one person to another. For example,a very important activity involves internationalization(localization and translation), which is globally managedvia the web application Damned Lies1 for all GNOMEtranslation teams.
Many tools and datasets have been proposed to anal-yse a software project’s history, but few are availableat the level of the ecosystem because of the additionallevel of difficulty involved. It does not suffice to simplyconsider the union of all project histories belonging tothe same ecosystem. Because some projects may havecontributors in common, and some contributors may beinvolved in different projects over time, this informationneeds to be explicitly represented at the ecosystemlevel. The same is true for the types of activity of anecosystem’s contributor, and how this varies over time,and over the different projects he is involved in.
1http://l10n.gnome.org
In this paper, we present the process we have usedto create a dataset containing the historical informationrelated to contributors to the GNOME ecosystem. Ourdatabase and the tools and scripts used to created it canbe found on a dedicated Bitbucket repository2.
In contrast to many other datasets, we do not focus onsource code, since a significant amount of files commit-ted to GNOME’s project repositories do not even containcode (e.g., image files, web pages, documentation, lo-calization and many more). Such type of information isoften ignored in MSR research while it is very relevantto understand which types of activities contributors areinvolved in. For GNOME we observed, for example, thata significant fraction of the community is working oninternationalization instead of code [16].
II. MOTIVATION
An important motivation for creating a historicaldataset for analysing contributors to the GNOME ecosys-tem was inspired by the many OSS repository miningstudies that have used GNOME as a case study [2], [13].In 2009 and 2010, GNOME was part of the MSR MiningChallenge, which lead to many contributions [1], [5], [8],[9], [11], [12], [14].
Of specific interest, in the context of software ecosys-tem research, are the social interactions in the commu-nity of contributors. Following a holistic approach, [7]estimated effort and studied developer co-operation andco-ordination in GNOME, based on the version controlrepositories and mailing lists. Similarly, [4] developed anadvanced measure of individual developer contributionbased on the source code repository, mailing lists andbug tracking systems, and applied the measure to anumber of GNOME projects. [6] studied six GNOMEprojects in order to understand how contributors join,socialize and develop within GNOME. [10] studied re-lations between the GNOME contributors by means ofsocial network analysis.
In our own previous work [15], [16] we used thedataset presented in this article to statistically analyse thespecialization of workload and involvement of GNOMEcontributors across projects and activity types, and we
2https://bitbucket.org/mgoeminne/sgl-flossmetric-dbmerge
@ MSR 2013
[to appear in 2014]