method and tool support for classifying software languages

30
© 2013 Software Languages Team, University of Koblenz-Landau Method and tool support for classifying software languages with Wikipedia Ralf Lämmel, Dominik Mosen and Andrei Varanovich Software Languages Team, University of Koblenz-Landau http://softlang.uni-koblenz.de/wikitax/

Upload: others

Post on 23-May-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

Method and tool support for classifying software languages

with WikipediaRalf Lämmel, Dominik Mosen and Andrei Varanovich

Software Languages Team, University of Koblenz-Landau

http://softlang.uni-koblenz.de/wikitax/

Page 2: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

Why and how to classify software languages?

Page 3: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

planet-sl.org/sle2013/

The term "software language" refers to artificial languages used in software development. These include general-purpose programming languages, domain-specific languages, modeling and metamodeling languages, data models and ontologies. Examples include general purpose modeling languages such as SysML and UML, metamodeling frameworks such as Ecore, MOF or GOPRR, domain-specific modeling languages for business process modeling, such as BPMN, or embedded systems, such as Simulink or Modelica, and specialized XML-based and OWL-based languages and vocabularies. The term "software language" is intentionally broad; besides the above categories and examples, it also encompasses implicit approaches to language definition, such as APIs and collections of design patterns.

Page 4: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

planet-sl.org/slebok/

Page 5: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

101companies: the emerging hitchhiker’s guide through the software galaxy

http://101companies.org/wiki/Software_language

Page 6: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

BTW, where is the Wikipedia for “Software Language”?

:-)Anyone?

Page 7: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

Problem-specific exploration tools

• Baskaya et al.: A tool for ontology-editing and ontology- based information exploration. ESAIR 2010.

• Haun et al.: CET: A tool for creative exploration of graphs. ECML/PKDD 2010.

• Dumas et al.: ViDaX: an interactive semantic data visualisation and exploration tool. AVI 2012.

• Hora et al.: Bug Maps: A tool for the visual exploration and analysis of bugs. CSMR 2012.

• De Roover et al: Multi-dimensional exploration of API usage. ICPC 2013.

Page 8: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

Category graph exploration with

WikiTax

http://softlang.uni-koblenz.de/wikitax/

Page 9: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

Category graph exploration with WikiTaxMethod and tool support for classifying software languages with Wikipedia 3

Fig. 1. Exploration of level 1 and 2 subcategories of Computer languages.

Graph reduction WikiTax supports reduction of the graph—both during (level-

by-level) extraction and post extraction. Reduction boils down to the exclusion of

nodes, i.e., categories. (In fact, we may also remove individual edges, given that a

category may have multiple parent categories.) A category would be removed, if

domain knowledge suggests that the category at hand does not serve the intended

kind of classification, e.g., classification of software languages in our case. When

exclusion is performed during extraction, then the excluded nodes (edges) are

ignored during subsequent extraction steps. When exclusion is performed post

extraction, then nodes (edges) are only blacklisted, without actually reducing

the graph. In this manner, exclusion decisions can be revisited.

WikiTax’s visualization Figure 1 shows the WikiTax exploration view after

the extraction of levels 1 and 2 starting from the category Computer languages.

Some edges are marked for exclusion. (Exclusion would be confirmed with the

‘removal’ button.) The marked categories are to be excluded because domain

knowledge suggests that these categories do not serve language classification

in a conceptual manner. Highlighting is applied to the categories according to

the metric of immediate member pages. In the figure, the category Articles withexample code is selected so that extra data is shown in the panel on the right, e.g.,

member pages. All categories and pages are clickable to navigate to Wikipedia.

Page 10: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

Method and tool support for classifying software languages with Wikipedia 3

Fig. 1. Exploration of level 1 and 2 subcategories of Computer languages.

Graph reduction WikiTax supports reduction of the graph—both during (level-

by-level) extraction and post extraction. Reduction boils down to the exclusion of

nodes, i.e., categories. (In fact, we may also remove individual edges, given that a

category may have multiple parent categories.) A category would be removed, if

domain knowledge suggests that the category at hand does not serve the intended

kind of classification, e.g., classification of software languages in our case. When

exclusion is performed during extraction, then the excluded nodes (edges) are

ignored during subsequent extraction steps. When exclusion is performed post

extraction, then nodes (edges) are only blacklisted, without actually reducing

the graph. In this manner, exclusion decisions can be revisited.

WikiTax’s visualization Figure 1 shows the WikiTax exploration view after

the extraction of levels 1 and 2 starting from the category Computer languages.

Some edges are marked for exclusion. (Exclusion would be confirmed with the

‘removal’ button.) The marked categories are to be excluded because domain

knowledge suggests that these categories do not serve language classification

in a conceptual manner. Highlighting is applied to the categories according to

the metric of immediate member pages. In the figure, the category Articles withexample code is selected so that extra data is shown in the panel on the right, e.g.,

member pages. All categories and pages are clickable to navigate to Wikipedia.

Result of 2 levels of extraction withsome categories marked for exclusion

Page 11: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

The WikiTax approach

• Category graph extraction (from Wikipedia)

• ... reduction (by the exclusion of categories)

• ... visualization (using simple metrics)

• ... export (for external processing)

Page 12: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

Metamodel of the WikiTax category graph4 Ralf Lämmel, Dominik Mosen, and Andrei Varanovich

Fig. 2. Metamodel of the WikiTax category graph.

WikiTax’s metamodel WikiTax operates on an enhanced category graph; see

the metamodel in Figure 2. Thus, each category associates with contained pages

and subcategories. The subcategory associations are attributed to keep track of

metadata as follows:

backwardArc Marker for cyclic edges in the category graph.blacklisted Marker for categories blacklisted past extraction.excluded Marker for categories excluded during reduction.comment Label (‘reason for exclusion’) to be associated with the edge.

Categories are associated with measures as follows:

level The level 0, 1, 2, ... of the category in the graph with the root at level 0.subcategories The number of immediate subcategories.transitiveSubcategories The number of all subcategories.pages The number of immediately contained pages.transitivePages The number of all pages in this category.

The implementation of WikiTax uses the Java-based JGraLab library4 for the

representation of (annotated) graphs with JSON as an export format.

Exclusion types A methodologically important aspect of graph reduction is

that reasons for category exclusion are not just simply documented by a com-

ment, but a manageable, well-defined set of exclusion types is to be developed

over time. For instance, the category Unified Modeling Language could be said

to be of an exclusion type ‘Singleton classifier’ to mean that this category, by

design, is primarily concerned with a single language, i.e., UML in this case; the

other members or subcategories of the category are concerned with UML con-

cepts, tools, and other related artifacts. §3 lists several more exclusion types. The

aggregation and use of exclusion types captures domain knowledge and insight

into Wikipedia’s category graph in a transparent manner.

4 https://github.com/jgralab

Metrics

Page 13: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

Implementation of WikiTax

• Wikipedia API

• TGraphs, JSON, and CSV

• Java (Swing)

• JGraLabSee

GitHub

URL etc.:

http://

softlang.

uni-ko

blenz.

de/wikit

ax/

(Open

Sourc

e and O

pen D

ata)

Page 14: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

Problem-specific concern: exclusion

Page 15: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

Method and tool support for classifying software languages with Wikipedia 3

Fig. 1. Exploration of level 1 and 2 subcategories of Computer languages.

Graph reduction WikiTax supports reduction of the graph—both during (level-

by-level) extraction and post extraction. Reduction boils down to the exclusion of

nodes, i.e., categories. (In fact, we may also remove individual edges, given that a

category may have multiple parent categories.) A category would be removed, if

domain knowledge suggests that the category at hand does not serve the intended

kind of classification, e.g., classification of software languages in our case. When

exclusion is performed during extraction, then the excluded nodes (edges) are

ignored during subsequent extraction steps. When exclusion is performed post

extraction, then nodes (edges) are only blacklisted, without actually reducing

the graph. In this manner, exclusion decisions can be revisited.

WikiTax’s visualization Figure 1 shows the WikiTax exploration view after

the extraction of levels 1 and 2 starting from the category Computer languages.

Some edges are marked for exclusion. (Exclusion would be confirmed with the

‘removal’ button.) The marked categories are to be excluded because domain

knowledge suggests that these categories do not serve language classification

in a conceptual manner. Highlighting is applied to the categories according to

the metric of immediate member pages. In the figure, the category Articles withexample code is selected so that extra data is shown in the panel on the right, e.g.,

member pages. All categories and pages are clickable to navigate to Wikipedia.

Why and what and

how to exclude?

Are these true classifiers

in terms of software concepts?

Page 16: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

Exclusion types• Alternative classifier (unrelated to software concepts)

‣ e.g., Academic programming languages

• Deviating classifier (in fact, non-classifier)

‣ e.g., Articles with example code

• Singleton classifier (focusing on one language)

‣ e.g., Cascading Style Sheets

• List classifier (collecting list pages)

‣ e.g., Lists of programming languages

• Maintenance classifier

‣ e.g., Uncategorized programming languages

Page 17: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

Method and tool support for classifying software languages with Wikipedia 7

Category Exclusion typeAcademic programming languages Alternative classifierArticles with example code Deviating classifierCascading Style Sheets Singleton classifierData types Deviating classifierDiscontinued programming languages Alternative classifierDocBook Singleton classifierEsoteric programming languages Alternative classifierExperimental programming languages Alternative classifierHTML Singleton classifierJSON Singleton classifierLists of computer languages List classifierLists of programming languages List classifierMarkup language comparisons Deviating classifierMarkup language stubs Maintenance classifierNon-English-based programming languages Alternative classifierProgramming language families Deviating classifierProgramming language standards Deviating classifierProgramming language topics Deviating classifierProgramming languages by creation date Alternative classifierProgramming languages conferences Deviating classifierSoftware by programming language Deviating classifierSyncML Singleton classifierTeX Singleton classifierText Encoding Initiative Singleton classifierTroff Singleton classifierUncategorized programming languages Maintenance classifierUnified Modeling Language Singleton classifierWikipedia categories named after programming languages Deviating classifierXML Singleton classifier

Fig. 4. Exclusion types for levels 1 and 2 of Computer languages; this list is producedby the WikiTax tool based on metadata (comments) entered by us interactively.

this manner. During the study, we realized, for example, an asymmetry between

‘query’ versus ‘transformation’. That is, there is a category Transformation lan-guages at level 1, but there is apparently no category for ‘query languages’, not

even at level 2. Let us inspect the page for SQL, which is an obvious query

language. It turns out that SQL is a member of various categories including a

category Query languages which in turn is a subcategory of various categories in-

cluding the category Domain-specific programming languages which occurred in

Figure 3. Let us compare this classification scheme with the one of XSLT , which

is an obvious transformation language: it is a member of the categories Trans-formation languages, Declarative programming languages, Functional languages,

Markup languages, XML-based programming languages, and yet other categories

that may count as ‘alternative classifiers’. However, XSLT (unlike SQL) is not a

member of the category Domain-specific programming languages.

WikiTax is helpful in making such observations regarding consistency (or lack

thereof) of classification on Wikipedia.

Exclu

sion

type

s fo

r lev

els

1 an

d 2

of C

ompu

ter l

angu

ages

Page 18: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

6 Ralf Lämmel, Dominik Mosen, and Andrei Varanovich

Category SubcategoriesData modeling languages –Markup languages Declarative markup languages, GIS file formats, Knowl-

edge representation languages, Lightweight markup lan-guages, Mathematical markup languages, Musical markuplanguages, Page description markup languages, Playlistmarkup languages, User interface markup languages, Vec-tor graphics markup languages, Web syndication formats,XML markup languages

Programming languages .NET programming languages, Agent-based programminglanguages, Agent-oriented programming languages, Concate-native programming languages, Concurrent programminglanguages, Data-structured programming languages, Declar-ative programming languages, Dependently typed languages,Domain-specific programming languages, Dynamic program-ming languages, Extensible syntax programming languages,Formula manipulation languages, Function-level languages,Functional languages, High Integrity Programming Lan-guage, High-level programming languages, ICL programminglanguages, Intensional programming languages, Low-levelprogramming languages, Multi-paradigm programming lan-guages, Nondeterministic programming languages, Object-based programming languages, Pattern matching program-ming languages, Procedural programming languages, Processtermination functions, Prototype-based programming lan-guages, Reactive programming languages, Secure program-ming languages, Set theoretic programming languages, Stat-ically typed programming languages, Synchronous program-ming languages, Term-rewriting programming languages,Text-oriented programming languages, Tree programminglanguages, Visual programming languages, XML-based pro-gramming languages

Specification languages Algorithm description languages, Dependently typed lan-guages, Formal specification languages, Hardware descrip-tion languages

Stylesheet languages –Transformation languages Macro programming languages

Fig. 3. Reduced subcategory lists for subcategories of Computer languages.

List classifier The category collects lists or categories of lists (rather than plain cat-egories) of software languages. For instance, category Lists of computer languageshas Lists of programming languages as a subcategory, which in turn contains pagesfor some lists of languages, such as the List of BASIC dialects.

Maintenance classifier The category is used by the Wikipedia authors to capturesome information related to the maintenance of pages or categories. For instance,the category Uncategorized programming languages describes itself as serving cat-egories or pages “which need to be classified under more specific categories”. Also:“This category may be empty occasionally or even most of the time.”

An observation regarding Wikipedia style The resulting classification ofFigure 3 with the remaining level-1 and level-2 subcategories is of a manageablesize. We may review the classification and observe some of its characteristics in

Computer languages

Page 19: Method and tool support for classifying software languages

6 Ralf Lämmel, Dominik Mosen, and Andrei Varanovich

Category SubcategoriesData modeling languages –Markup languages Declarative markup languages, GIS file formats, Knowl-

edge representation languages, Lightweight markup lan-guages, Mathematical markup languages, Musical markuplanguages, Page description markup languages, Playlistmarkup languages, User interface markup languages, Vec-tor graphics markup languages, Web syndication formats,XML markup languages

Programming languages .NET programming languages, Agent-based programminglanguages, Agent-oriented programming languages, Concate-native programming languages, Concurrent programminglanguages, Data-structured programming languages, Declar-ative programming languages, Dependently typed languages,Domain-specific programming languages, Dynamic program-ming languages, Extensible syntax programming languages,Formula manipulation languages, Function-level languages,Functional languages, High Integrity Programming Lan-guage, High-level programming languages, ICL programminglanguages, Intensional programming languages, Low-levelprogramming languages, Multi-paradigm programming lan-guages, Nondeterministic programming languages, Object-based programming languages, Pattern matching program-ming languages, Procedural programming languages, Processtermination functions, Prototype-based programming lan-guages, Reactive programming languages, Secure program-ming languages, Set theoretic programming languages, Stat-ically typed programming languages, Synchronous program-ming languages, Term-rewriting programming languages,Text-oriented programming languages, Tree programminglanguages, Visual programming languages, XML-based pro-gramming languages

Specification languages Algorithm description languages, Dependently typed lan-guages, Formal specification languages, Hardware descrip-tion languages

Stylesheet languages –Transformation languages Macro programming languages

Fig. 3. Reduced subcategory lists for subcategories of Computer languages.

List classifier The category collects lists or categories of lists (rather than plain cat-egories) of software languages. For instance, category Lists of computer languageshas Lists of programming languages as a subcategory, which in turn contains pagesfor some lists of languages, such as the List of BASIC dialects.

Maintenance classifier The category is used by the Wikipedia authors to capturesome information related to the maintenance of pages or categories. For instance,the category Uncategorized programming languages describes itself as serving cat-egories or pages “which need to be classified under more specific categories”. Also:“This category may be empty occasionally or even most of the time.”

An observation regarding Wikipedia style The resulting classification ofFigure 3 with the remaining level-1 and level-2 subcategories is of a manageablesize. We may review the classification and observe some of its characteristics in

Page 20: Method and tool support for classifying software languages

6 Ralf Lämmel, Dominik Mosen, and Andrei Varanovich

Category SubcategoriesData modeling languages –Markup languages Declarative markup languages, GIS file formats, Knowl-

edge representation languages, Lightweight markup lan-guages, Mathematical markup languages, Musical markuplanguages, Page description markup languages, Playlistmarkup languages, User interface markup languages, Vec-tor graphics markup languages, Web syndication formats,XML markup languages

Programming languages .NET programming languages, Agent-based programminglanguages, Agent-oriented programming languages, Concate-native programming languages, Concurrent programminglanguages, Data-structured programming languages, Declar-ative programming languages, Dependently typed languages,Domain-specific programming languages, Dynamic program-ming languages, Extensible syntax programming languages,Formula manipulation languages, Function-level languages,Functional languages, High Integrity Programming Lan-guage, High-level programming languages, ICL programminglanguages, Intensional programming languages, Low-levelprogramming languages, Multi-paradigm programming lan-guages, Nondeterministic programming languages, Object-based programming languages, Pattern matching program-ming languages, Procedural programming languages, Processtermination functions, Prototype-based programming lan-guages, Reactive programming languages, Secure program-ming languages, Set theoretic programming languages, Stat-ically typed programming languages, Synchronous program-ming languages, Term-rewriting programming languages,Text-oriented programming languages, Tree programminglanguages, Visual programming languages, XML-based pro-gramming languages

Specification languages Algorithm description languages, Dependently typed lan-guages, Formal specification languages, Hardware descrip-tion languages

Stylesheet languages –Transformation languages Macro programming languages

Fig. 3. Reduced subcategory lists for subcategories of Computer languages.

List classifier The category collects lists or categories of lists (rather than plain cat-egories) of software languages. For instance, category Lists of computer languageshas Lists of programming languages as a subcategory, which in turn contains pagesfor some lists of languages, such as the List of BASIC dialects.

Maintenance classifier The category is used by the Wikipedia authors to capturesome information related to the maintenance of pages or categories. For instance,the category Uncategorized programming languages describes itself as serving cat-egories or pages “which need to be classified under more specific categories”. Also:“This category may be empty occasionally or even most of the time.”

An observation regarding Wikipedia style The resulting classification ofFigure 3 with the remaining level-1 and level-2 subcategories is of a manageablesize. We may review the classification and observe some of its characteristics in

An aside:Where are the

query languages?

Page 21: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

Let’s compare SQL and XSLT!

Page 23: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

Category:Query_languages

Categories: Domain-specific programming languages Data management Databases

A level-2 subcategory of

computer languages

Page 24: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

http://en.wikipedia.org/wiki/XSLT

Categories: Declarative programming languages Functional languages Markup languages Transformation languages World Wide Web Consortium standards XML-based programming languages XML-based standards

A level-1 subcategory of

computer languages

Page 26: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

Programming languages -- all levels

• Initial extraction

‣ 423 categories, 7515 pages, 8 levels

• 1st pruning phase

‣ 29 excluded categories as discussed earlier

‣ 288 categories, 6671 pages

• 2nd pruning phase

‣ 79 categories, 1560 pages, 4 levelsTwo hours of work

Page 27: Method and tool support for classifying software languages

8 Ralf Lämmel, Dominik Mosen, and Andrei Varanovich

Pages Categories

Fig. 5. Metrics-based views on Programming languages graph.

Programming languages: all levels According to Figure 3, the subcategoryof Computer languages with by far the most subcategories is Programming lan-guages. Thus, we embarked on a more comprehensive exploration of categoryProgramming languages:

Initially, we extracted 423 categories over 8 levels with 7515 pages. The au-tomatic extraction took several minutes. We performed exclusion in two steps.First, we (re-) excluded those direct subcategories that already appeared in Fig-ure 4. After such initial pruning, 288 categories with 6671 pages remained. Wecompleted reduction at all levels of the category graph. This process requiredabout 2 hours of manual work to determine what categories to remove and forwhat reason. This effort is intrinsically manual; it requires domain knowledgeand involves consultation of the relevant and additional Wikipedia pages. Ulti-mately, 79 categories over 4 levels with 1560 pages remained. Figure 5 visualizesthe reduced taxonomy for two different metrics supported by WikiTax.

Page 28: Method and tool support for classifying software languages

8 Ralf Lämmel, Dominik Mosen, and Andrei Varanovich

Pages Categories

Fig. 5. Metrics-based views on Programming languages graph.

Programming languages: all levels According to Figure 3, the subcategoryof Computer languages with by far the most subcategories is Programming lan-guages. Thus, we embarked on a more comprehensive exploration of categoryProgramming languages:

Initially, we extracted 423 categories over 8 levels with 7515 pages. The au-tomatic extraction took several minutes. We performed exclusion in two steps.First, we (re-) excluded those direct subcategories that already appeared in Fig-ure 4. After such initial pruning, 288 categories with 6671 pages remained. Wecompleted reduction at all levels of the category graph. This process requiredabout 2 hours of manual work to determine what categories to remove and forwhat reason. This effort is intrinsically manual; it requires domain knowledgeand involves consultation of the relevant and additional Wikipedia pages. Ulti-mately, 79 categories over 4 levels with 1560 pages remained. Figure 5 visualizesthe reduced taxonomy for two different metrics supported by WikiTax.

Page 29: Method and tool support for classifying software languages

8 Ralf Lämmel, Dominik Mosen, and Andrei Varanovich

Pages Categories

Fig. 5. Metrics-based views on Programming languages graph.

Programming languages: all levels According to Figure 3, the subcategoryof Computer languages with by far the most subcategories is Programming lan-guages. Thus, we embarked on a more comprehensive exploration of categoryProgramming languages:

Initially, we extracted 423 categories over 8 levels with 7515 pages. The au-tomatic extraction took several minutes. We performed exclusion in two steps.First, we (re-) excluded those direct subcategories that already appeared in Fig-ure 4. After such initial pruning, 288 categories with 6671 pages remained. Wecompleted reduction at all levels of the category graph. This process requiredabout 2 hours of manual work to determine what categories to remove and forwhat reason. This effort is intrinsically manual; it requires domain knowledgeand involves consultation of the relevant and additional Wikipedia pages. Ulti-mately, 79 categories over 4 levels with 1560 pages remained. Figure 5 visualizesthe reduced taxonomy for two different metrics supported by WikiTax.

Page 30: Method and tool support for classifying software languages

© 2013 Software Languages Team, University of Koblenz-Landau

http://softlang.uni-koblenz.de/wikitax/

Thanks! Questions?