Personal tools
Home Center for Digital Transformation eBusiness Research Center Archived Publications Research Papers What’s the code? Automatic Classification of Source Code Archives

Skip to content. | Skip to navigation

What’s the code? Automatic Classification of Source Code Archives

Authors: Secil Ugurel, Robert Krovetz, C. Lee Giles, David M. Pennock, Eric Glover, Hongyuan Zha

There are various source code archives on the World Wide Web. These archives are usually organized by application categories and programming languages. However, manually organizing source code repositories is not a trivial task since they grow rapidly and are very large (on the order of terabytes). We demonstrate machine learning methods for automatic classification of archived source code into eleven application topics and ten programming languages. For topical classification, we concentrate on C and C++ programs from the Ibiblio and the Sourceforge archives. We show that a support vector machine (SVM) classifier can be trained on examples of a given programming language or programs in a specified category.

Spinner Icon