Sampling in Open Source Software Development The case for using the Debian GNULinux Distrib.docx

资源描述

《Sampling in Open Source Software Development The case for using the Debian GNULinux Distrib.docx》由会员分享，可在线阅读，更多相关《Sampling in Open Source Software Development The case for using the Debian GNULinux Distrib.docx（29页珍藏版）》请在三一文库上搜索。

1、Sampling in Open Source Software Development The case for using the Debian GNULinux DistribSampling in Open Source Software Development: The case for using the Debian GNU/Linux Distribution WORKING DRAFT 2006-06-16Sebastian Spaeth, Matthias Stuermer,Stefan Haefliger, Georg von Krogh, ETH ZurichAbstr

2、act*Research on open source (OS) projects often focuseson the SourceForge collaboration platform. We argue that a GNU/Linux distribution,such as Debian,isbetter suited for the sampling of projects because it avoids biases and contains unique information only available in an integrated environment.Es

3、pecially research on the reuse of components can build on dependency information inherent in the Debian GNU/Linux package tracking system.1. IntroductionOne of the central problems of research design for OS projects is sampling. Hundreds of thousands of OS projects exist, some of which are not regis

4、tered on any collaboration platform or do not have a corresponding website.Many researchers turn to http:/ using projects registered there as population(e.g. Crowston and Howison, 2004;Madey et al., 2002). SourceForge is arguably the largest collaboration platform: it currently hosts 122,000 project

5、s and has more than 1.3m users registered.Despite its size, sampling from SourceForge means systematically omitting projects hosted on other platforms (such as Savannah, Tigris, or Berlios) and those not listed on such platforms at all.SourceForge was intended specifically as an incubator for small

6、projects, therefore potentially biasing the sample further. Large projects, such as Apache, Samba, Mozilla, or the Linux kernel are able to host their own infrastructure and are not registered on SourceForge. Others use only part of the offered tools, such as the file distribution or the CVS reposit

7、ory, and ignore for instance the included bug database or dump old and unmaintained data in their code repository. Howison and Crowston (2004) have* The authors wish to thank Christian Sommer, Daniel Baumann and Mario Bischof for their valuable research assistance and two Debian developers for their

8、 comments.recognized this in an article which explicitly warns from blindly mining SourceForge project data.When examining open source projects, we propose to look at software distributions containing entire populations of OS projects; especially Debian seems an appropriate target.The Debian Project

9、1”is one of the oldest GNU/Linux distributions in existence, founded by Ian Murdock on Aug16th,1993(http:/ builds on the GNU operating system and the Linux kernel. Debian is maintained by a large community of individuals, hence one of the few large distributions without a commercial background.Many

10、other GNU/Linux distributions,for instance Ubuntu,are directly derived from it. Debian offers a vast package repository compared to other distributions, and most OS applications, license permitting, can be found there.Debians history and comprehensiveness make it attractive to study as it represents

11、 a large universe of OS projects.We argue that sampling projects from Debian overcomes some of the major sampling problems incurred when using single collaboration platforms, as Debian includes software components hosted on various platforms, and even those without a website of their own.The work of

12、 integrating thousands of such packages into a distribution results in a repository of software components that can be navigated along various criteria. One of the interesting attributes to research is the dependency information, allowing us to study the amount of reuse for each component.The study

13、of competition among OS projects has been delayed by the difficulty of measuring (relative) success. Relational information among OS projects, as it is available from an integrated environment of software components,enables an analysis of reuse1 The official pronunciation of Debian isdeb ee n.The na

14、meoriginates from the names of the creator of Debian, Ian Murdock, and his wife, Debra.-1-across projects.Every instance of component reusesaves overall development cost.Therefore,frequent reuse, as we argue, amounts to relative success of aprogram and invites empirical studies that aim at predictin

15、g reuse success. In this paper, we discuss the sampling that precedes a study of competitive dynamics among OS projects and propose the Debian distribution as a suitable environment for such a sampling.This paper is organized as follows. The next section provides the outline of an ongoing research p

16、roject on software reuse in order to illustrate the use of a distribution as an empirical base when analyzing competitive dynamics among OS software components. The third section reviews the critical sampling issues when using the Debian distribution and discusses data collection in more detail. The

17、 last section concludes with some limitations and implications for OS research.2. The case of software reuseThe development of open source software revealsnew innovation practices that promise fruitful insights to students of organizations(von Krogh and von Hippel, 2006). While the study of innovati

18、on processes in open source software development has received considerable attention with regards to the developers motivations(Hertel et al.,2003;Lakhani and Wolf, 2005),project governance(Lee and Cole,2003; OMahony,2003;Shah,2006),and coordination (Bonaccorsi and Rossi, 2003), the study of competi

19、tion among open source software products has received little attention due to the difficulty to measure competitive performance satisfactorily where monetary rewards and gains are virtually absent (Crowston et al., 2003). An analysis of software reuse can both serve to discriminate between more or l

20、ess successful open source software projects and hold implications for firms and the management of innovation.One of the central problems in the management ofinnovation is if and how firms reuse previously created knowledge across the various stages of an innovation process(see Argote,1999;Majchrzak

21、,Cooper,and Neece, 2004; von Krogh, Ichijo, and Nonaka, 2000; Zander and Kogut,1995).Research on reuse of software shows that implementing reuse is difficult for organizations to achieve and depends on many organizational settings which explicitly encourage programmers to conduct reuse(Lynex and Lay

22、zell, 1998; Banker and Kauffman, 1991). Also, the creation of easily reusable components needs explicit encouragement as the creation of reusable software requires an additional planning and implementation effort in areas such as documentation,portability, architectural design and collaboration.Lite

23、rature on software reuse estimates additional costs of up to 200percent on top of the production costs for making software components reusable (Tracz, 1995).A recent study among open source softwaredevelopment projects revealed that reuse is widespread and frequent (von Krogh et al., 2006). Understa

24、nding the incentives at work in open source software development helps to explain the extent of code reuse in general. We argue that an integrated environment of software components can inform information systems research and,ultimately,strategy research about the success factors of components in te

25、rms of reuse frequency. By studing the given information in Debian and coding additional characteristics of software components,we ask:what are the characteristics of software components that are reused more often than others? And how can we predict reuse success?Open source software components come

26、 in a myriad of forms and originate from diverse contexts.Ultimately, however, some are reused more frequently than others (see Figure 1). Information Systems (IS) literature, and research on software reuse in particular, predicts that developers will reuse components that help lowering the developm

27、ent costs for the firm or the individual(Banker et al.,1993;Prieto-Diaz,1993;Frakes and Isoda, 1994; Griss, 1993; von Krogh et al., 2006).In open source software development, considerable reuse of components is conducted despite the lack of explicit organizational encouragement and a plethora of rea

28、dy-to-use components are publicly available for free in the pool of open source projects (von Krogh et al., 2006). Software developers reuse if their development costs can be mitigated through reuse relative to writing the software from scratch (Banker et al., 1993). Standards and tools for classifi

29、cation and retrieval help to lower the costs for search and integration of a component (Isakowitz and Kauffman, 1996;Poulin,1995).Hence,the organizational implications for component developers are as follows: Published,easily identifiable,and well documented components should facilitate reuse.In add

30、ition to searching a component, the developer needs to trust and then integrate a component into his or her system in order to reuse http:/ accompanying reusable software should contain quality ratings or certification to enhance the developers trust in a component written by someone else (Knight an

31、d Dunn, 1998; Poulin, 1995). Taking into consideration the requirements for successful reuse,a component should lower the prospective reusers search and integration costs and communicate confidence in components quality.As already elaborated, the Debian distribution offers an ideal context for study

32、ing the variance of reuse across software components. A comprehensive tracking system structures the packages managed and distributed-2-by Debian2. These packages are created by the Debiandevelopers acting as maintainers who bundle a so-called upstream projectthe piece of softwareintended to be inte

33、grated into the Debian distribution with various scripts, e.g.for the configuration and installation of the upstream software into the particular Debian system. Additionally,the Debian maintainer adds meta-information about the package, like a short description of the components functionality or the

34、 software section the package belongs to.Most important, the maintainer logs the dependencies of this package, that is, which other packages are needed to be installed first in order to execute the intended software. Therefore,each package comes with dependency information, leading to a dependency g

35、raph: If a user of the distribution wishes to run program x,the dependency information can be used to check if all other packages needed by program x are installed or need to be installed on the users machine. The reverse dependency, that is which programs depend on package y to function properly, c

36、an be understood as instances of reuse.An abundance of other package information canserve as potential independent variables for a model. They include:?Number of binary packages per source package.?Size of the source package.?Size of the binary packages.?Debian bug statistics: Amount and urgency rat

37、ing.?Age of the Debian package.?Change logs for the package within Debian.?License information.?Identity of maintainers and, usually, authors of the package.?External (upstream) source of the package.?Version information.Depending on the focus of the reuse model and the required variables, additiona

38、l project information for each Debian package should be collected from the communities that program and maintain the software in order to properly test that model. Researchers should avoid fitting their theories to the available data (Howison and Crowston, 2004).Studying dependencies among Debian pa

39、ckages could shed light on the characteristics of components that are more frequently reused. As each incidence of reuse saves overall coding costs, the variance in reuse can be interpreted as success among developers of2 The Debian Package Tracking System can be found here: http:/ software componen

40、ts and hence as competitive success with respect to similar components.3. Sampling and data collectionFor research questions that probe into the comparative or competitive dynamics of OS projects, a complete,unbiased,and high-variance sample is difficult to obtain.Particularly for large-scale popula

41、tions,researchers often rely on collaboration platforms such as SourceForge with its known pitfalls (Howison and Crowston, 2004). Research projects that require reliable,comprehensive,standardized,and relational information across OS software projects may find it advantageous to sample projects from

42、 Debian.Some of the advantages include the following.?Debian developers assemble a comprehensive universe of software products.?Debian contains programs which are actually used,as at least one person insisted it beincluded in the repository.?Rules and guidelines ensure a standardized process of incl

43、uding software into thedistribution3.As a result,the informationavailable for each package is usually complete(information listed above) and dates back untilup to 1993.?Packages within one version of the distribution are designed to be compatible,hence form an integrated environment ofsoftware produ

44、cts.?Categories, or sections, within the distribution provide possible sampling foci(sectionsinclude editors, web, admin, libs, mail, and soon)?For each package,a maintainer assumes personal responsibility and is knowledgeableabout the software he or she maintains forDebian.The current stable Debian

45、 release, termed sarge, was examined, including security updates as of May 2006. All main and contrib (3rd party contribution) packages were used, only excluding unfree packages4 which contain popular, yet non-free software, such as Macromedias Flash or the Realtime player. This led toa repository o

46、f 19,692 binary packages for 32bit Intel-compatible processors (i386), which were associated with a section attribute (such as mail, text, or libs). A binary package is the ready-to-run compiled version of one program or part of a program. Each binary package3 The developer manual and reference can

47、be found here:http:/ for further information see the Debian Free Software Guidelines onhttp:/ compiled from exactly one corresponding sourcepackage, although one source package can be split up into several binary packages in Debian(these oftenoffer additional but optional functionality), leading to a 1:n relationship(see schematics in Figure2).An example for multiple binary packages derived from one source package“php3”is the binary core package “php3”, plus the optional binary package “php3-xml” providing an xml interface to the PHP programming language.Discussions with two De

展开阅读全文