method and tool support for classifying software languages
TRANSCRIPT
© 2013 Software Languages Team, University of Koblenz-Landau
Method and tool support for classifying software languages
with WikipediaRalf Lämmel, Dominik Mosen and Andrei Varanovich
Software Languages Team, University of Koblenz-Landau
http://softlang.uni-koblenz.de/wikitax/
© 2013 Software Languages Team, University of Koblenz-Landau
Why and how to classify software languages?
© 2013 Software Languages Team, University of Koblenz-Landau
planet-sl.org/sle2013/
The term "software language" refers to artificial languages used in software development. These include general-purpose programming languages, domain-specific languages, modeling and metamodeling languages, data models and ontologies. Examples include general purpose modeling languages such as SysML and UML, metamodeling frameworks such as Ecore, MOF or GOPRR, domain-specific modeling languages for business process modeling, such as BPMN, or embedded systems, such as Simulink or Modelica, and specialized XML-based and OWL-based languages and vocabularies. The term "software language" is intentionally broad; besides the above categories and examples, it also encompasses implicit approaches to language definition, such as APIs and collections of design patterns.
© 2013 Software Languages Team, University of Koblenz-Landau
planet-sl.org/slebok/
© 2013 Software Languages Team, University of Koblenz-Landau
101companies: the emerging hitchhiker’s guide through the software galaxy
http://101companies.org/wiki/Software_language
© 2013 Software Languages Team, University of Koblenz-Landau
BTW, where is the Wikipedia for “Software Language”?
:-)Anyone?
© 2013 Software Languages Team, University of Koblenz-Landau
Problem-specific exploration tools
• Baskaya et al.: A tool for ontology-editing and ontology- based information exploration. ESAIR 2010.
• Haun et al.: CET: A tool for creative exploration of graphs. ECML/PKDD 2010.
• Dumas et al.: ViDaX: an interactive semantic data visualisation and exploration tool. AVI 2012.
• Hora et al.: Bug Maps: A tool for the visual exploration and analysis of bugs. CSMR 2012.
• De Roover et al: Multi-dimensional exploration of API usage. ICPC 2013.
© 2013 Software Languages Team, University of Koblenz-Landau
Category graph exploration with
WikiTax
http://softlang.uni-koblenz.de/wikitax/
© 2013 Software Languages Team, University of Koblenz-Landau
Category graph exploration with WikiTaxMethod and tool support for classifying software languages with Wikipedia 3
Fig. 1. Exploration of level 1 and 2 subcategories of Computer languages.
Graph reduction WikiTax supports reduction of the graph—both during (level-
by-level) extraction and post extraction. Reduction boils down to the exclusion of
nodes, i.e., categories. (In fact, we may also remove individual edges, given that a
category may have multiple parent categories.) A category would be removed, if
domain knowledge suggests that the category at hand does not serve the intended
kind of classification, e.g., classification of software languages in our case. When
exclusion is performed during extraction, then the excluded nodes (edges) are
ignored during subsequent extraction steps. When exclusion is performed post
extraction, then nodes (edges) are only blacklisted, without actually reducing
the graph. In this manner, exclusion decisions can be revisited.
WikiTax’s visualization Figure 1 shows the WikiTax exploration view after
the extraction of levels 1 and 2 starting from the category Computer languages.
Some edges are marked for exclusion. (Exclusion would be confirmed with the
‘removal’ button.) The marked categories are to be excluded because domain
knowledge suggests that these categories do not serve language classification
in a conceptual manner. Highlighting is applied to the categories according to
the metric of immediate member pages. In the figure, the category Articles withexample code is selected so that extra data is shown in the panel on the right, e.g.,
member pages. All categories and pages are clickable to navigate to Wikipedia.
© 2013 Software Languages Team, University of Koblenz-Landau
Method and tool support for classifying software languages with Wikipedia 3
Fig. 1. Exploration of level 1 and 2 subcategories of Computer languages.
Graph reduction WikiTax supports reduction of the graph—both during (level-
by-level) extraction and post extraction. Reduction boils down to the exclusion of
nodes, i.e., categories. (In fact, we may also remove individual edges, given that a
category may have multiple parent categories.) A category would be removed, if
domain knowledge suggests that the category at hand does not serve the intended
kind of classification, e.g., classification of software languages in our case. When
exclusion is performed during extraction, then the excluded nodes (edges) are
ignored during subsequent extraction steps. When exclusion is performed post
extraction, then nodes (edges) are only blacklisted, without actually reducing
the graph. In this manner, exclusion decisions can be revisited.
WikiTax’s visualization Figure 1 shows the WikiTax exploration view after
the extraction of levels 1 and 2 starting from the category Computer languages.
Some edges are marked for exclusion. (Exclusion would be confirmed with the
‘removal’ button.) The marked categories are to be excluded because domain
knowledge suggests that these categories do not serve language classification
in a conceptual manner. Highlighting is applied to the categories according to
the metric of immediate member pages. In the figure, the category Articles withexample code is selected so that extra data is shown in the panel on the right, e.g.,
member pages. All categories and pages are clickable to navigate to Wikipedia.
Result of 2 levels of extraction withsome categories marked for exclusion
© 2013 Software Languages Team, University of Koblenz-Landau
The WikiTax approach
• Category graph extraction (from Wikipedia)
• ... reduction (by the exclusion of categories)
• ... visualization (using simple metrics)
• ... export (for external processing)
© 2013 Software Languages Team, University of Koblenz-Landau
Metamodel of the WikiTax category graph4 Ralf Lämmel, Dominik Mosen, and Andrei Varanovich
Fig. 2. Metamodel of the WikiTax category graph.
WikiTax’s metamodel WikiTax operates on an enhanced category graph; see
the metamodel in Figure 2. Thus, each category associates with contained pages
and subcategories. The subcategory associations are attributed to keep track of
metadata as follows:
backwardArc Marker for cyclic edges in the category graph.blacklisted Marker for categories blacklisted past extraction.excluded Marker for categories excluded during reduction.comment Label (‘reason for exclusion’) to be associated with the edge.
Categories are associated with measures as follows:
level The level 0, 1, 2, ... of the category in the graph with the root at level 0.subcategories The number of immediate subcategories.transitiveSubcategories The number of all subcategories.pages The number of immediately contained pages.transitivePages The number of all pages in this category.
The implementation of WikiTax uses the Java-based JGraLab library4 for the
representation of (annotated) graphs with JSON as an export format.
Exclusion types A methodologically important aspect of graph reduction is
that reasons for category exclusion are not just simply documented by a com-
ment, but a manageable, well-defined set of exclusion types is to be developed
over time. For instance, the category Unified Modeling Language could be said
to be of an exclusion type ‘Singleton classifier’ to mean that this category, by
design, is primarily concerned with a single language, i.e., UML in this case; the
other members or subcategories of the category are concerned with UML con-
cepts, tools, and other related artifacts. §3 lists several more exclusion types. The
aggregation and use of exclusion types captures domain knowledge and insight
into Wikipedia’s category graph in a transparent manner.
4 https://github.com/jgralab
Metrics
© 2013 Software Languages Team, University of Koblenz-Landau
Implementation of WikiTax
• Wikipedia API
• TGraphs, JSON, and CSV
• Java (Swing)
• JGraLabSee
GitHub
URL etc.:
http://
softlang.
uni-ko
blenz.
de/wikit
ax/
(Open
Sourc
e and O
pen D
ata)
© 2013 Software Languages Team, University of Koblenz-Landau
Problem-specific concern: exclusion
© 2013 Software Languages Team, University of Koblenz-Landau
Method and tool support for classifying software languages with Wikipedia 3
Fig. 1. Exploration of level 1 and 2 subcategories of Computer languages.
Graph reduction WikiTax supports reduction of the graph—both during (level-
by-level) extraction and post extraction. Reduction boils down to the exclusion of
nodes, i.e., categories. (In fact, we may also remove individual edges, given that a
category may have multiple parent categories.) A category would be removed, if
domain knowledge suggests that the category at hand does not serve the intended
kind of classification, e.g., classification of software languages in our case. When
exclusion is performed during extraction, then the excluded nodes (edges) are
ignored during subsequent extraction steps. When exclusion is performed post
extraction, then nodes (edges) are only blacklisted, without actually reducing
the graph. In this manner, exclusion decisions can be revisited.
WikiTax’s visualization Figure 1 shows the WikiTax exploration view after
the extraction of levels 1 and 2 starting from the category Computer languages.
Some edges are marked for exclusion. (Exclusion would be confirmed with the
‘removal’ button.) The marked categories are to be excluded because domain
knowledge suggests that these categories do not serve language classification
in a conceptual manner. Highlighting is applied to the categories according to
the metric of immediate member pages. In the figure, the category Articles withexample code is selected so that extra data is shown in the panel on the right, e.g.,
member pages. All categories and pages are clickable to navigate to Wikipedia.
Why and what and
how to exclude?
Are these true classifiers
in terms of software concepts?
© 2013 Software Languages Team, University of Koblenz-Landau
Exclusion types• Alternative classifier (unrelated to software concepts)
‣ e.g., Academic programming languages
• Deviating classifier (in fact, non-classifier)
‣ e.g., Articles with example code
• Singleton classifier (focusing on one language)
‣ e.g., Cascading Style Sheets
• List classifier (collecting list pages)
‣ e.g., Lists of programming languages
• Maintenance classifier
‣ e.g., Uncategorized programming languages
© 2013 Software Languages Team, University of Koblenz-Landau
Method and tool support for classifying software languages with Wikipedia 7
Category Exclusion typeAcademic programming languages Alternative classifierArticles with example code Deviating classifierCascading Style Sheets Singleton classifierData types Deviating classifierDiscontinued programming languages Alternative classifierDocBook Singleton classifierEsoteric programming languages Alternative classifierExperimental programming languages Alternative classifierHTML Singleton classifierJSON Singleton classifierLists of computer languages List classifierLists of programming languages List classifierMarkup language comparisons Deviating classifierMarkup language stubs Maintenance classifierNon-English-based programming languages Alternative classifierProgramming language families Deviating classifierProgramming language standards Deviating classifierProgramming language topics Deviating classifierProgramming languages by creation date Alternative classifierProgramming languages conferences Deviating classifierSoftware by programming language Deviating classifierSyncML Singleton classifierTeX Singleton classifierText Encoding Initiative Singleton classifierTroff Singleton classifierUncategorized programming languages Maintenance classifierUnified Modeling Language Singleton classifierWikipedia categories named after programming languages Deviating classifierXML Singleton classifier
Fig. 4. Exclusion types for levels 1 and 2 of Computer languages; this list is producedby the WikiTax tool based on metadata (comments) entered by us interactively.
this manner. During the study, we realized, for example, an asymmetry between
‘query’ versus ‘transformation’. That is, there is a category Transformation lan-guages at level 1, but there is apparently no category for ‘query languages’, not
even at level 2. Let us inspect the page for SQL, which is an obvious query
language. It turns out that SQL is a member of various categories including a
category Query languages which in turn is a subcategory of various categories in-
cluding the category Domain-specific programming languages which occurred in
Figure 3. Let us compare this classification scheme with the one of XSLT , which
is an obvious transformation language: it is a member of the categories Trans-formation languages, Declarative programming languages, Functional languages,
Markup languages, XML-based programming languages, and yet other categories
that may count as ‘alternative classifiers’. However, XSLT (unlike SQL) is not a
member of the category Domain-specific programming languages.
WikiTax is helpful in making such observations regarding consistency (or lack
thereof) of classification on Wikipedia.
Exclu
sion
type
s fo
r lev
els
1 an
d 2
of C
ompu
ter l
angu
ages
© 2013 Software Languages Team, University of Koblenz-Landau
6 Ralf Lämmel, Dominik Mosen, and Andrei Varanovich
Category SubcategoriesData modeling languages –Markup languages Declarative markup languages, GIS file formats, Knowl-
edge representation languages, Lightweight markup lan-guages, Mathematical markup languages, Musical markuplanguages, Page description markup languages, Playlistmarkup languages, User interface markup languages, Vec-tor graphics markup languages, Web syndication formats,XML markup languages
Programming languages .NET programming languages, Agent-based programminglanguages, Agent-oriented programming languages, Concate-native programming languages, Concurrent programminglanguages, Data-structured programming languages, Declar-ative programming languages, Dependently typed languages,Domain-specific programming languages, Dynamic program-ming languages, Extensible syntax programming languages,Formula manipulation languages, Function-level languages,Functional languages, High Integrity Programming Lan-guage, High-level programming languages, ICL programminglanguages, Intensional programming languages, Low-levelprogramming languages, Multi-paradigm programming lan-guages, Nondeterministic programming languages, Object-based programming languages, Pattern matching program-ming languages, Procedural programming languages, Processtermination functions, Prototype-based programming lan-guages, Reactive programming languages, Secure program-ming languages, Set theoretic programming languages, Stat-ically typed programming languages, Synchronous program-ming languages, Term-rewriting programming languages,Text-oriented programming languages, Tree programminglanguages, Visual programming languages, XML-based pro-gramming languages
Specification languages Algorithm description languages, Dependently typed lan-guages, Formal specification languages, Hardware descrip-tion languages
Stylesheet languages –Transformation languages Macro programming languages
Fig. 3. Reduced subcategory lists for subcategories of Computer languages.
List classifier The category collects lists or categories of lists (rather than plain cat-egories) of software languages. For instance, category Lists of computer languageshas Lists of programming languages as a subcategory, which in turn contains pagesfor some lists of languages, such as the List of BASIC dialects.
Maintenance classifier The category is used by the Wikipedia authors to capturesome information related to the maintenance of pages or categories. For instance,the category Uncategorized programming languages describes itself as serving cat-egories or pages “which need to be classified under more specific categories”. Also:“This category may be empty occasionally or even most of the time.”
An observation regarding Wikipedia style The resulting classification ofFigure 3 with the remaining level-1 and level-2 subcategories is of a manageablesize. We may review the classification and observe some of its characteristics in
Computer languages
6 Ralf Lämmel, Dominik Mosen, and Andrei Varanovich
Category SubcategoriesData modeling languages –Markup languages Declarative markup languages, GIS file formats, Knowl-
edge representation languages, Lightweight markup lan-guages, Mathematical markup languages, Musical markuplanguages, Page description markup languages, Playlistmarkup languages, User interface markup languages, Vec-tor graphics markup languages, Web syndication formats,XML markup languages
Programming languages .NET programming languages, Agent-based programminglanguages, Agent-oriented programming languages, Concate-native programming languages, Concurrent programminglanguages, Data-structured programming languages, Declar-ative programming languages, Dependently typed languages,Domain-specific programming languages, Dynamic program-ming languages, Extensible syntax programming languages,Formula manipulation languages, Function-level languages,Functional languages, High Integrity Programming Lan-guage, High-level programming languages, ICL programminglanguages, Intensional programming languages, Low-levelprogramming languages, Multi-paradigm programming lan-guages, Nondeterministic programming languages, Object-based programming languages, Pattern matching program-ming languages, Procedural programming languages, Processtermination functions, Prototype-based programming lan-guages, Reactive programming languages, Secure program-ming languages, Set theoretic programming languages, Stat-ically typed programming languages, Synchronous program-ming languages, Term-rewriting programming languages,Text-oriented programming languages, Tree programminglanguages, Visual programming languages, XML-based pro-gramming languages
Specification languages Algorithm description languages, Dependently typed lan-guages, Formal specification languages, Hardware descrip-tion languages
Stylesheet languages –Transformation languages Macro programming languages
Fig. 3. Reduced subcategory lists for subcategories of Computer languages.
List classifier The category collects lists or categories of lists (rather than plain cat-egories) of software languages. For instance, category Lists of computer languageshas Lists of programming languages as a subcategory, which in turn contains pagesfor some lists of languages, such as the List of BASIC dialects.
Maintenance classifier The category is used by the Wikipedia authors to capturesome information related to the maintenance of pages or categories. For instance,the category Uncategorized programming languages describes itself as serving cat-egories or pages “which need to be classified under more specific categories”. Also:“This category may be empty occasionally or even most of the time.”
An observation regarding Wikipedia style The resulting classification ofFigure 3 with the remaining level-1 and level-2 subcategories is of a manageablesize. We may review the classification and observe some of its characteristics in
6 Ralf Lämmel, Dominik Mosen, and Andrei Varanovich
Category SubcategoriesData modeling languages –Markup languages Declarative markup languages, GIS file formats, Knowl-
edge representation languages, Lightweight markup lan-guages, Mathematical markup languages, Musical markuplanguages, Page description markup languages, Playlistmarkup languages, User interface markup languages, Vec-tor graphics markup languages, Web syndication formats,XML markup languages
Programming languages .NET programming languages, Agent-based programminglanguages, Agent-oriented programming languages, Concate-native programming languages, Concurrent programminglanguages, Data-structured programming languages, Declar-ative programming languages, Dependently typed languages,Domain-specific programming languages, Dynamic program-ming languages, Extensible syntax programming languages,Formula manipulation languages, Function-level languages,Functional languages, High Integrity Programming Lan-guage, High-level programming languages, ICL programminglanguages, Intensional programming languages, Low-levelprogramming languages, Multi-paradigm programming lan-guages, Nondeterministic programming languages, Object-based programming languages, Pattern matching program-ming languages, Procedural programming languages, Processtermination functions, Prototype-based programming lan-guages, Reactive programming languages, Secure program-ming languages, Set theoretic programming languages, Stat-ically typed programming languages, Synchronous program-ming languages, Term-rewriting programming languages,Text-oriented programming languages, Tree programminglanguages, Visual programming languages, XML-based pro-gramming languages
Specification languages Algorithm description languages, Dependently typed lan-guages, Formal specification languages, Hardware descrip-tion languages
Stylesheet languages –Transformation languages Macro programming languages
Fig. 3. Reduced subcategory lists for subcategories of Computer languages.
List classifier The category collects lists or categories of lists (rather than plain cat-egories) of software languages. For instance, category Lists of computer languageshas Lists of programming languages as a subcategory, which in turn contains pagesfor some lists of languages, such as the List of BASIC dialects.
Maintenance classifier The category is used by the Wikipedia authors to capturesome information related to the maintenance of pages or categories. For instance,the category Uncategorized programming languages describes itself as serving cat-egories or pages “which need to be classified under more specific categories”. Also:“This category may be empty occasionally or even most of the time.”
An observation regarding Wikipedia style The resulting classification ofFigure 3 with the remaining level-1 and level-2 subcategories is of a manageablesize. We may review the classification and observe some of its characteristics in
An aside:Where are the
query languages?
© 2013 Software Languages Team, University of Koblenz-Landau
Let’s compare SQL and XSLT!
© 2013 Software Languages Team, University of Koblenz-Landau
http://en.wikipedia.org/wiki/SQL
Categories: Database management systems Computer languages Data modeling languages Declarative programming languages Query languages Relational database management systems SQL
© 2013 Software Languages Team, University of Koblenz-Landau
Category:Query_languages
Categories: Domain-specific programming languages Data management Databases
A level-2 subcategory of
computer languages
© 2013 Software Languages Team, University of Koblenz-Landau
http://en.wikipedia.org/wiki/XSLT
Categories: Declarative programming languages Functional languages Markup languages Transformation languages World Wide Web Consortium standards XML-based programming languages XML-based standards
A level-1 subcategory of
computer languages
© 2013 Software Languages Team, University of Koblenz-Landau
Category:Transformation_language
Categories: Computer languages
© 2013 Software Languages Team, University of Koblenz-Landau
Programming languages -- all levels
• Initial extraction
‣ 423 categories, 7515 pages, 8 levels
• 1st pruning phase
‣ 29 excluded categories as discussed earlier
‣ 288 categories, 6671 pages
• 2nd pruning phase
‣ 79 categories, 1560 pages, 4 levelsTwo hours of work
8 Ralf Lämmel, Dominik Mosen, and Andrei Varanovich
Pages Categories
Fig. 5. Metrics-based views on Programming languages graph.
Programming languages: all levels According to Figure 3, the subcategoryof Computer languages with by far the most subcategories is Programming lan-guages. Thus, we embarked on a more comprehensive exploration of categoryProgramming languages:
Initially, we extracted 423 categories over 8 levels with 7515 pages. The au-tomatic extraction took several minutes. We performed exclusion in two steps.First, we (re-) excluded those direct subcategories that already appeared in Fig-ure 4. After such initial pruning, 288 categories with 6671 pages remained. Wecompleted reduction at all levels of the category graph. This process requiredabout 2 hours of manual work to determine what categories to remove and forwhat reason. This effort is intrinsically manual; it requires domain knowledgeand involves consultation of the relevant and additional Wikipedia pages. Ulti-mately, 79 categories over 4 levels with 1560 pages remained. Figure 5 visualizesthe reduced taxonomy for two different metrics supported by WikiTax.
8 Ralf Lämmel, Dominik Mosen, and Andrei Varanovich
Pages Categories
Fig. 5. Metrics-based views on Programming languages graph.
Programming languages: all levels According to Figure 3, the subcategoryof Computer languages with by far the most subcategories is Programming lan-guages. Thus, we embarked on a more comprehensive exploration of categoryProgramming languages:
Initially, we extracted 423 categories over 8 levels with 7515 pages. The au-tomatic extraction took several minutes. We performed exclusion in two steps.First, we (re-) excluded those direct subcategories that already appeared in Fig-ure 4. After such initial pruning, 288 categories with 6671 pages remained. Wecompleted reduction at all levels of the category graph. This process requiredabout 2 hours of manual work to determine what categories to remove and forwhat reason. This effort is intrinsically manual; it requires domain knowledgeand involves consultation of the relevant and additional Wikipedia pages. Ulti-mately, 79 categories over 4 levels with 1560 pages remained. Figure 5 visualizesthe reduced taxonomy for two different metrics supported by WikiTax.
8 Ralf Lämmel, Dominik Mosen, and Andrei Varanovich
Pages Categories
Fig. 5. Metrics-based views on Programming languages graph.
Programming languages: all levels According to Figure 3, the subcategoryof Computer languages with by far the most subcategories is Programming lan-guages. Thus, we embarked on a more comprehensive exploration of categoryProgramming languages:
Initially, we extracted 423 categories over 8 levels with 7515 pages. The au-tomatic extraction took several minutes. We performed exclusion in two steps.First, we (re-) excluded those direct subcategories that already appeared in Fig-ure 4. After such initial pruning, 288 categories with 6671 pages remained. Wecompleted reduction at all levels of the category graph. This process requiredabout 2 hours of manual work to determine what categories to remove and forwhat reason. This effort is intrinsically manual; it requires domain knowledgeand involves consultation of the relevant and additional Wikipedia pages. Ulti-mately, 79 categories over 4 levels with 1560 pages remained. Figure 5 visualizesthe reduced taxonomy for two different metrics supported by WikiTax.
© 2013 Software Languages Team, University of Koblenz-Landau
http://softlang.uni-koblenz.de/wikitax/
Thanks! Questions?