Skip to content
Surf Wiki
Save to docs
technology/software-engineering

From Surf Wiki (app.surf) — the open knowledge base

Mining software repositories


Within software engineering, the mining software repositories{{Cite conference | book-title = 2008 frontiers of software maintenance

Definition

Herzig and Zeller define ”mining software archives” as a process to ”obtain lots of initial evidence” by extracting data from software repositories. Further they define ”data sources” as product-based artifacts like source code, requirement artefacts or version archives and claim that these sources are unbiased, but noisy and incomplete.

Techniques

Coupled Change Analysis

The idea in coupled change analysis is that developers change code entities (e.g. files) together frequently for fixing defects or introducing new features. These couplings between the entities are often not made explicit in the code or other documents. Especially developers new on the project do not know which entities need to be changed together. Coupled change analysis aims to extract the coupling out of the version control system for a project. By the commits and the timing of changes, we might be able to identify which entities frequently change together. This information could then be presented to developers about to change one of the entities to support them in their further changes.

Commit Analysis

There are many different kinds of commits in version control systems, e.g. bug fix commits, new feature commits, documentation commits, etc. To take data-driven decisions based on past commits, one needs to select subsets of commits that meet a given criterion. That can be done based on the commit message.

Documentation generation

It is possible to generate useful documentation from mining software repositories. For instance, Jadeite computes usage statistics and helps newcomers to quickly identify commonly used classes.

Data and tools

The primary mining data comes from version control systems. Early mining experiments were done on CVS repositories. Then, researchers had extensively analyzed SVN repositories. Now, Git repositories are dominant. Depending on the nature of the data required (size, domain, processing), one can either download data from one of these sources. However, data governance and data collection for the sake of building large language models have come to change the rules of the game, by integrating the use of web crawlers to obtain data from multiple sources and domains.

References

References

  1. [http://msrconf.org Working Conference on Mining Software Repositories], the main [[software engineering]] conference in the area
  2. K. S. Herzig and A. Zeller, “Mining your own evidence,” in Making Software, pp. 517–529, Sebastopol, Calif., USA: O’Reilly, 2011.
  3. (1998). "Proceedings. International Conference on Software Maintenance (Cat. No. 98CB36272)".
  4. (2009). "2009 IEEE 17th International Conference on Program Comprehension".
  5. (2009). "2009 IEEE Symposium on Visual Languages and Human-Centric Computing (VL/HCC)".
  6. (2005). "11th IEEE International Software Metrics Symposium (METRICS'05)".
  7. (2008). "Software Evolution".
  8. (2014). "Proceedings of the 11th Working Conference on Mining Software Repositories - MSR 2014".
Info: Wikipedia Source

This article was imported from Wikipedia and is available under the Creative Commons Attribution-ShareAlike 4.0 License. Content has been adapted to SurfDoc format. Original contributors can be found on the article history page.

Want to explore this topic further?

Ask Mako anything about Mining software repositories — get instant answers, deeper analysis, and related topics.

Research with Mako

Free with your Surf account

Content sourced from Wikipedia, available under CC BY-SA 4.0.

This content may have been generated or modified by AI. CloudSurf Software LLC is not responsible for the accuracy, completeness, or reliability of AI-generated content. Always verify important information from primary sources.

Report