Mining software repositories: Difference between revisions

From Wikipedia, the free encyclopedia
Content deleted Content added
create tool section
→‎Tools for mining software repositories: add discussion about CVS/SVN/Git
Line 18: Line 18:
There are many different kinds of commits in version control systems, e.g. bug fix commits, new feature commits, documentation commits, etc. To take data-driven decisions based on past commits, one needs to select subsets of commits that meet a given criterion. That can be done based on the commit message,<ref name="HindleGerman2009">{{cite journal|last1=Hindle|first1=Abram|last2=German|first2=Daniel M.|last3=Godfrey|first3=Michael W.|last4=Holt|first4=Richard C.|title=Automatic classication of large changes into maintenance categories|year=2009|doi=10.1109/ICPC.2009.5090025}}</ref> or based on the commit content.<ref name="MartinezDuchien2013">{{cite journal|last1=Martinez|first1=Matias|last2=Duchien|first2=Laurence|last3=Monperrus|first3=Martin|title=Automatically Extracting Instances of Code Change Patterns with AST Analysis|year=2013|url=https://hal.archives-ouvertes.fr/hal-00861883/document|doi=10.1109/ICSM.2013.54}}</ref>
There are many different kinds of commits in version control systems, e.g. bug fix commits, new feature commits, documentation commits, etc. To take data-driven decisions based on past commits, one needs to select subsets of commits that meet a given criterion. That can be done based on the commit message,<ref name="HindleGerman2009">{{cite journal|last1=Hindle|first1=Abram|last2=German|first2=Daniel M.|last3=Godfrey|first3=Michael W.|last4=Holt|first4=Richard C.|title=Automatic classication of large changes into maintenance categories|year=2009|doi=10.1109/ICPC.2009.5090025}}</ref> or based on the commit content.<ref name="MartinezDuchien2013">{{cite journal|last1=Martinez|first1=Matias|last2=Duchien|first2=Laurence|last3=Monperrus|first3=Martin|title=Automatically Extracting Instances of Code Change Patterns with AST Analysis|year=2013|url=https://hal.archives-ouvertes.fr/hal-00861883/document|doi=10.1109/ICSM.2013.54}}</ref>


== Tools for mining software repositories ==
== Data & Tools ==

The primary mining data comes from version control systems. Early mining experiments were done on CVS repositories.<ref>{{Cite book|doi = 10.1109/METRICS.2005.28|chapter = Impact Analysis by Mining Software and Change Request Repositories|title = 11th IEEE International Software Metrics Symposium (METRICS'05)|pages = 29|year = 2005|last1 = Canfora|first1 = G.|last2 = Cerulo|first2 = L.}}</ref> Then, researchers have extensively analyzed SVN repositories. <ref>{{Cite book|doi = 10.1007/978-3-540-76440-3_3|chapter = Analysing Software Repositories to Understand Software Evolution|title = Software Evolution|pages = 37–67|year = 2008|last1 = d'Ambros|first1 = Marco|last2 = Gall|first2 = Harald|last3 = Lanza|first3 = Michele|last4 = Pinzger|first4 = Martin}}</ref> Now, Git repositories are dominant<ref>{{Cite book|doi = 10.1145/2597073.2597074|chapter = The promises and perils of mining GitHub|title = Proceedings of the 11th Working Conference on Mining Software Repositories - MSR 2014|pages = 92–101|year = 2014|last1 = Kalliamvakou|first1 = Eirini|last2 = Gousios|first2 = Georgios|last3 = Blincoe|first3 = Kelly|last4 = Singer|first4 = Leif|last5 = German|first5 = Daniel M.|last6 = Damian|first6 = Daniela}}</ref>, but special care must be given to handle branches and forks<ref>{{Cite book|doi = 10.1109/ICSME.2014.48|chapter = On Analyzing the Topology of Commit Histories in Decentralized Version Control Systems|title = 2014 IEEE International Conference on Software Maintenance and Evolution|pages = 261–270|year = 2014|last1 = Biazzini|first1 = Marco|last2 = Monperrus|first2 = Martin|last3 = Baudry|first3 = Benoit|chapter-url = https://hal.archives-ouvertes.fr/hal-01063789/file/main.pdf}}</ref>.

Tools:


* [https://github.com/uni-bremen-agst/libvcs4j LibVCS4j] is a Java library that allows existing tools to analyse the evolution of software systems by providing a common API for different version control systems and issue trackers.
* [https://github.com/uni-bremen-agst/libvcs4j LibVCS4j] is a Java library that allows existing tools to analyse the evolution of software systems by providing a common API for different version control systems and issue trackers.

Revision as of 15:58, 2 November 2018

The mining software repositories[citation needed] (MSR) field [1] analyzes the rich data available in software repositories, such as version control repositories, mailing list archives, bug tracking systems, issue tracking systems, etc. to uncover interesting and actionable information about software systems, projects and software engineering.

Definition

Herzig and Zeller define ”mining software archives” as a process to ”obtain lots of initial evidence” by extracting data from software repositories. Further they define ”data sources” as product-based artefacts like source code, requirement artefacts or version archives and claim that these sources are unbiased, but noisy and incomplete.[2]

Techniques

Coupled Change Analysis

The idea in coupled change analysis is that developers change code entities (e.g. files) together frequently for fixing defects or introducing new features. These couplings between the entities are often not made explicit in the code or other documents. Especially developers new on the project do not know which entities need to be changed together. Coupled change analysis aims to extract the coupling out of the version control system for a project. By the commits and the timing of changes, we might be able to identify which entities frequently change together. This information could then be presented to developers about to change one of the entities to support them in their further changes.[3]

Commit Analysis

There are many different kinds of commits in version control systems, e.g. bug fix commits, new feature commits, documentation commits, etc. To take data-driven decisions based on past commits, one needs to select subsets of commits that meet a given criterion. That can be done based on the commit message,[4] or based on the commit content.[5]

Data & Tools

The primary mining data comes from version control systems. Early mining experiments were done on CVS repositories.[6] Then, researchers have extensively analyzed SVN repositories. [7] Now, Git repositories are dominant[8], but special care must be given to handle branches and forks[9].

Tools:

  • LibVCS4j is a Java library that allows existing tools to analyse the evolution of software systems by providing a common API for different version control systems and issue trackers.
  • pydriller is a Python Framework to analyse Git repositories.
  • Coming is a Java tool to search for patterns in past commits
  • CVSAnalY extracts information out of source code repository logs and stores it into a database.

Contradictory Findings

Software Metrics

See also

References

  1. ^ Working Conference on Mining Software Repositories, the main software engineering conference in the area
  2. ^ K. S. Herzig and A. Zeller, “Mining your own evidence,” in Making Software, pp. 517–529, Sebastopol, Calif., USA: O’Reilly, 2011.
  3. ^ Gall, H.; Hajek, K.; Jazayeri, M. (November 1998). "Detection of logical coupling based on product release history". Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272): 190–198. doi:10.1109/icsm.1998.738508.
  4. ^ Hindle, Abram; German, Daniel M.; Godfrey, Michael W.; Holt, Richard C. (2009). "Automatic classication of large changes into maintenance categories". doi:10.1109/ICPC.2009.5090025. {{cite journal}}: Cite journal requires |journal= (help)
  5. ^ Martinez, Matias; Duchien, Laurence; Monperrus, Martin (2013). "Automatically Extracting Instances of Code Change Patterns with AST Analysis". doi:10.1109/ICSM.2013.54. {{cite journal}}: Cite journal requires |journal= (help)
  6. ^ Canfora, G.; Cerulo, L. (2005). "Impact Analysis by Mining Software and Change Request Repositories". 11th IEEE International Software Metrics Symposium (METRICS'05). p. 29. doi:10.1109/METRICS.2005.28.
  7. ^ d'Ambros, Marco; Gall, Harald; Lanza, Michele; Pinzger, Martin (2008). "Analysing Software Repositories to Understand Software Evolution". Software Evolution. pp. 37–67. doi:10.1007/978-3-540-76440-3_3.
  8. ^ Kalliamvakou, Eirini; Gousios, Georgios; Blincoe, Kelly; Singer, Leif; German, Daniel M.; Damian, Daniela (2014). "The promises and perils of mining GitHub". Proceedings of the 11th Working Conference on Mining Software Repositories - MSR 2014. pp. 92–101. doi:10.1145/2597073.2597074.
  9. ^ Biazzini, Marco; Monperrus, Martin; Baudry, Benoit (2014). "On Analyzing the Topology of Commit Histories in Decentralized Version Control Systems" (PDF). 2014 IEEE International Conference on Software Maintenance and Evolution. pp. 261–270. doi:10.1109/ICSME.2014.48.