network based models of cognitive and social dynamics of human languages

4
Available online at www.sciencedirect.com Computer Speech and Language 25 (2011) 635–638 Editorial Network based models of cognitive and social dynamics of human languages Recently, complex network theory (Albert and Barabási, 2002; Newman, 2003), a popular modeling paradigm in statistical mechanics and physics of complex systems, has been proven to be a promising tool in understanding the structure and dynamics of human languages. There has been a plethora of work on linguistic networks with various motivations and at various levels of linguistic structure. The work in this area can be broadly classified into two categories: (1) those which intend to investigate the structural properties of language from the perspective of language evolution and thereby, explain the emergence of certain universal characteristics of languages, and (2) those which try to exploit the network based representations to develop certain useful practical applications, such as machine translation, information retrieval and text summarization. In other words, the application of complex network based models could be in areas as diverse as language evolution, acquisition, historical linguistics, mining and analyzing social networks of blogs and emails, link analysis and information retrieval, information extraction, and representation of the mental lexicon. “You shall know a word by the company it keeps” (Firth, J.R. 1957:11). Indeed, one of the most common linguistic networks studied in the literature are the word co-occurrence networks (Ferrer-i-Cancho and Sole, 2001a, 2004). In language, words interact among themselves in different ways – some words co-occur with higher probability than other words. These co-occurrences are nontrivial, that is, their patterns cannot be inferred from the frequency distribution of the individual words. As has been clearly established by several previous studies (Choudhury et al., 2010; Ferrer- i-Cancho, 2005; Ferrer-i-Cancho and Sole, 2001a,b, 2004), understanding the structure and the emergence of these patterns can present us with important clues and insights about language evolution. In fact, due to its popularity, this framework has also been applied to unsupervised grammar induction – one of the most difficult problems in natural language acquisition (Solan et al., 2005). It is well-known that the average size of the receptive vocabulary for a normal high school student is more than 100,000 words (Miller and Gildea, 1987). However, quite surprisingly, speakers are capable of navigating this huge lexicon (often termed as the Mental Lexicon or ML) in a very efficient way; reaction time to judge whether a word form is legitimate takes less than 100 ms. Consequently, there can be two important interrelated questions associated with the ML: (a) how are the words stored in the long term memory, i.e., how is ML organized, and (b) how are these words retrieved from the ML. In search for these answers, the ML can be modeled as a web of inter-connected nodes, where each node corresponds to a word form and the inter-connections may be based on any one (or more) of the following: (a) phonological similarity (e.g., the words banana, bear and bean may be connected since they start with the same phoneme) (Gruenenfelder and Pisoni, 2005; Kapatsinski, 2006; Tamariz, 2005; Vitevitch, 2004), (b) semantic similarity (e.g., the words banana, apple and pear may be connected since all of them are names of fruits) (Sigman and Cecchi, 2002), and (c) orthographic properties (Choudhury et al., 2007). Complex networks have also found applications in modeling some of the hardest problems in phonology, such as a clear analysis of the structure, the self-organization and the evolutionary dynamics of human speech sound systems, including consonants (Mukherjee et al., 2007, 2009), vowels (Mukherjee et al., 2008) and syllables (Soares et al., 2005). 0885-2308/$ – see front matter © 2011 Published by Elsevier Ltd. doi:10.1016/j.csl.2011.02.001

Upload: animesh-mukherjee

Post on 26-Jun-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Available online at www.sciencedirect.com

Computer Speech and Language 25 (2011) 635–638

Editorial

Network based models of cognitive and socialdynamics of human languages

Recently, complex network theory (Albert and Barabási, 2002; Newman, 2003), a popular modeling paradigm instatistical mechanics and physics of complex systems, has been proven to be a promising tool in understanding thestructure and dynamics of human languages. There has been a plethora of work on linguistic networks with variousmotivations and at various levels of linguistic structure. The work in this area can be broadly classified into twocategories: (1) those which intend to investigate the structural properties of language from the perspective of languageevolution and thereby, explain the emergence of certain universal characteristics of languages, and (2) those which try toexploit the network based representations to develop certain useful practical applications, such as machine translation,information retrieval and text summarization. In other words, the application of complex network based models couldbe in areas as diverse as language evolution, acquisition, historical linguistics, mining and analyzing social networksof blogs and emails, link analysis and information retrieval, information extraction, and representation of the mentallexicon.

“You shall know a word by the company it keeps” (Firth, J.R. 1957:11). Indeed, one of the most common linguisticnetworks studied in the literature are the word co-occurrence networks (Ferrer-i-Cancho and Sole, 2001a, 2004). Inlanguage, words interact among themselves in different ways – some words co-occur with higher probability than otherwords. These co-occurrences are nontrivial, that is, their patterns cannot be inferred from the frequency distributionof the individual words. As has been clearly established by several previous studies (Choudhury et al., 2010; Ferrer-i-Cancho, 2005; Ferrer-i-Cancho and Sole, 2001a,b, 2004), understanding the structure and the emergence of thesepatterns can present us with important clues and insights about language evolution. In fact, due to its popularity, thisframework has also been applied to unsupervised grammar induction – one of the most difficult problems in naturallanguage acquisition (Solan et al., 2005).

It is well-known that the average size of the receptive vocabulary for a normal high school student is more than100,000 words (Miller and Gildea, 1987). However, quite surprisingly, speakers are capable of navigating this hugelexicon (often termed as the Mental Lexicon or ML) in a very efficient way; reaction time to judge whether a wordform is legitimate takes less than 100 ms. Consequently, there can be two important interrelated questions associatedwith the ML: (a) how are the words stored in the long term memory, i.e., how is ML organized, and (b) how are thesewords retrieved from the ML. In search for these answers, the ML can be modeled as a web of inter-connected nodes,where each node corresponds to a word form and the inter-connections may be based on any one (or more) of thefollowing: (a) phonological similarity (e.g., the words banana, bear and bean may be connected since they start withthe same phoneme) (Gruenenfelder and Pisoni, 2005; Kapatsinski, 2006; Tamariz, 2005; Vitevitch, 2004), (b) semanticsimilarity (e.g., the words banana, apple and pear may be connected since all of them are names of fruits) (Sigman andCecchi, 2002), and (c) orthographic properties (Choudhury et al., 2007).

Complex networks have also found applications in modeling some of the hardest problems in phonology, such asa clear analysis of the structure, the self-organization and the evolutionary dynamics of human speech sound systems,including consonants (Mukherjee et al., 2007, 2009), vowels (Mukherjee et al., 2008) and syllables (Soares et al.,2005).

0885-2308/$ – see front matter © 2011 Published by Elsevier Ltd.doi:10.1016/j.csl.2011.02.001

636 Editorial / Computer Speech and Language 25 (2011) 635–638

Graph based approaches are quite common in the areas of Natural Language Processing (NLP) and InformationRetrieval (IR). Interestingly, although there are no obvious technical differences between the scope of graph theory inthese areas and in the area of complex networks, the terminologies used and the objectives are often quite different.The works on linguistic networks discussed above are primarily targeted to the statistical physics community andthe objective there is to unfurl the structure of languages and their dynamics. Nevertheless, there are some equallyinteresting and significant works, which use the same set of mathematical tools, but the objective is to develop practicalengineering applications concerning languages. One of the earliest and recurrent applications of networks in NLP havebeen the automatic induction of syntactic and semantic categories based on the distributional hypothesis (Harris, 1968).The distributional hypothesis states that words of similar syntactic (semantic) category are found in similar contexts(Harris, 1968). Measuring to what extent two words appear in similar contexts measures their similarity (Miller andCharles, 1991). One can construct a weighted network, where the nodes are the words and the weight of the edgebetween two words defines their contextual similarity (see Biemann, 2006; Dhillon et al., 2003; Nath et al., 2008;Rapp, 2005; Schtze, 1993 for details). Subsequently, this network can be clustered to obtain syntactic or semanticcategories in a completely unsupervised fashion (Biemann, 2006; Dhillon et al., 2003; Nath et al., 2008; Rapp, 2005;Schtze, 1993). Identification of syntactic or semantic classes is of great importance to NLP and IR. For instance,part-of-speech (POS) tagging is the first step towards parsing. However, the supervised machine learning techniquesfor POS tagging demand a large amount of human annotated data which is expensive, as well as non-existent for mostof the languages. Since automatic induction of POS-tags through graph clustering does not require annotated data, itmight turn out to be a very useful technique in NLP for resource poor languages. Similarly, semantic clustering ofwords is useful for search and IR.

Network methods are also at the heart of any information retrieval task. The central problem of IR is to rank a givenlarge collection of documents with respect to their similarity to a query. In a typical IR setup, the whole web consistingof billions of webpages represents the collection of documents to be ranked and the query is only one or two wordslong. One of the challenges of IR is to utilize the network structure of the web to compute the ranks of the documents.The web can be conceptualized as a directed graph where the nodes are the webpages and a hyperlink from webpageA to webpage B represents a directed edge between the nodes corresponding to A and B.

PageRank (Brin and Page, 1998) is one of the first and very popular ranking algorithms that is used by Google searchengine. The basic idea behind the PageRank algorithm is – the rank (or popularity) of a node is a function of the rank ofits neighbors. In other words, the page which has a hyperlink from a popular page is also popular. An alternative viewof the PageRank algorithm involves a random walker (here a random surfer). A random walker starts from a randomnode and follows the edges of the graph randomly to reach other nodes. The PageRank of a page is proportionate tothe probability that a random surfer reaches that page by following random hyperlinks on the web. Yet another way todefine PageRank is that it is the components of the principal eigenvector of the nodes. Thus, PageRank is also knownas eigenvector centrality in the complex network literature. In addition to Information Retrieval, PageRank has beenwidely successful in various NLP tasks like text summarization (Mihalcea and Tarau, 2004), document classification(Mihalcea and Hassan, 2005), and semantic similarity (Hughes and Ramage, 2007).

The primary objective of this special issue is to bring forth the use of complex networks in modeling the cognitiveand social dynamics of languages. Cognitive dynamics of languages include topics focused primarily on languageacquisition, which can be extended to language change (historical linguistics) and language evolution as well. Sincethe latter phenomena are also governed by social factors, we can further classify them under social dynamics oflanguages, which in turn includes topics such as mining the social networks of blogs and emails. Unfortunately, thisfield of research remains poorly represented in the computer science literature. It is our hope that this special issuewill bring awareness of this field among computer scientists, leading to cross-fertilization of ideas and collaborationbetween statistical physics and computational sciences communities in the long run. We believe that Computer Speechand Language with a target audience primarily consisting of NLP and IR researchers therefore constitutes a perfectjournal for this issue.

This special issue includes five full-length articles, the first two being applications of complex networks in modelingcognitive dynamics and the last three being instances of network methods in modeling sociolinguistic phenomena.In “Network analysis of a corpus of undeciphered Indus civilization inscriptions indicates syntactic organization”,Sitabhra Sinha and his co-authors demonstrate through network analysis the syntactic network organization of Indusinscriptions obtained from the archaeological excavations in the sites of the Indus Valley civilization (2500–1900 BCE)in Pakistan and northwestern India. Their results suggest the existence of a grammar underlying these inscriptions.

Editorial / Computer Speech and Language 25 (2011) 635–638 637

In “Applications of graph theory to an English rhyming corpus”, Morgan Sonderegger builds rhyme graphs with anobjective to infer pronunciation patterns of a language by observing the rhyming words in that language. His resultsindicate how rhyme graphs could be used for historical pronunciation reconstruction. In “Geometric representations oflanguage taxonomies”, Philippe Blanchard and his coauthors present a Markov chain analysis of a network generatedby the matrix of lexical distances that allows for representing complex relationships between different languages ina language family geometrically, in terms of distances and angles. They apply this method to construct taxonomiesfor the Indo-European and the Austronesian language families. In “Bipartite spectral graph partitioning for clusteringdialect varieties and detecting their linguistic features”, Martijn Wieling and John Nerbonne apply bipartite spectralgraph partitioning to simultaneously cluster varieties and identify their most distinctive linguistic features in Dutchdialect data. They show that this method results in the emergence of clear and sensible geographical groupings furtherdiscussing and analyzing the importance of the concomitant features. Finally, in “Geography of social ontologies:Testing a variant of the Sapir–Whorf Hypothesis in the context of Wikipedia”, Alexander Mehler and his coauthorsevaluate a variant of the Sapir–Whorf Hypothesis in the area of complex network theory and, thereby, develop anapproach to classify linguistic networks of tens of thousands of vertices by exploring a small range of mathematicallywell-established topological indices.

Our sincere thanks to all the referees for their thoughtful, high quality and elaborate reviews, especially consideringour extremely tight time frame for reviewing. The articles appearing in this issue have surely benefited from theirexpert feedback. In conclusion, we hope that through this special issue, we will kindle interest in the use of complexnetwork theory and motivate greater awareness of the dynamical nature of human languages, while designing variousengineering applications.

References

Albert, R., Barabási, A.L., 2002. Statistical mechanics of complex networks. Reviews of Modern Physics 74, 47.Biemann, C., 2006. Unsupervised part-of-speech tagging employing efficient graph clustering. In: Proceedings of the COLING/ACL 2006 Student

Research Workshop , Sydney, Australia, July, pp. 7–12 (Association for Computational Linguistics).Brin, S., Page, L., 1998. The anatomy of a large-scale hypertextual web search engine. CNIS 30 (1–7), 107–117.Choudhury, M., Chatterjee, D., Mukherjee, A., 2010. Global topology of word co-occurrence networks: beyond the two-regime power-law. In:

Coling , pp. 162–170.Choudhury, M., Thomas, M., Mukherjee, A., Basu, A., Ganguly, N., 2007. How difficult is it to develop a perfect spell-checker? A cross-linguistic

analysis through complex network approach. In: Proceedings of the Second Workshop on TextGraphs: Graph-Based Algorithms for NaturalLanguage Processing , Rochester, NY, USA, pp. 81–88 (Association for Computational Linguistics).

Dhillon, I.S., Mallela, S., Modha, D.S., 2003. Information-theoretic co-clustering. In: Proceedings of the Ninth ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDD-2003) , pp. 89–98.

Ferrer-i-Cancho, R., 2005. The structure of syntactic dependency networks: Insights from recent advances in network theory. In: Altmann, G.,Levickij, V., Perebyinis, V. (Eds.), Problems of Quantitative Linguistics, pp. 60–75.

Ferrer-i-Cancho, R., Sole, R.V., 2001a. The small world of human language. Proceedings of the Royal Society of London Series B: BiologicalSciences 268 (November(1482)), 2261–2265.

Ferrer-i-Cancho, R., Sole, R.V., 2001b. Two regimes in the frequency of words and the origin of complex lexicons: Zipf’s law revisited. Journal ofQuantitative Linguistics 8, 165–173.

Ferrer-i-Cancho, R., Sole, R.V., 2004. Patterns in syntactic dependency networks. Physical Review E 69, 051915.Firth, J.R., 1957. Papers in linguistics 1934-1951. Oxford University Press.Gruenenfelder, T., Pisoni, D.B., 2005. Modeling the mental lexicon as a complex graph. Ms. Indiana University.Harris, Z.S., 1968. Mathematical Structures of Language. Wiley.Hughes, T., Ramage, D., 2007. Lexical semantic relatedness with random graph walks. In: Proceedings of the 2007 Joint Conference on Empirical

Methods in Natural Language Processing and Computational Natural Language Learning (EMNLP-CoNLL) , Prague, Czech Republic, June,pp. 581–589 (Association for Computational Linguistics).

Kapatsinski, V., 2006. Sound similarity relations in the mental lexicon: modeling the lexicon as a complex network. Speech Research Lab ProgressReport, Indiana University.

Mihalcea, R., Hassan, S., 2005. Using the essence of texts to improve document classification. In: Proceedings of the Conference on Recent Advancesin Natural Language Processing (RANLP).

Mihalcea, R., Tarau, P., 2004. TextRank – bringing order into texts. In: Proceedings of the Conference on Empirical Methods in Natural LanguageProcessing (EMNLP 2004) , Barcelona, Spain.

Miller, G.A., Charles, W.G., 1991. Contextual correlates of semantic similarity. Language and Cognitive Processes 6 (1), 1–28.Miller, G.A., Gildea, P.M., 1987. How children learn words. Scientific American 257 (3), 86–91.Mukherjee, A., Choudhury, M., Basu, A., Ganguly, N., 2007. Modeling the co-occurrence principles of the consonant inventories: A complex

network approach. International Journal of Modern Physics C 18 (2), 281–295.

638 Editorial / Computer Speech and Language 25 (2011) 635–638

Mukherjee, A., Choudhury, M., Basu, A., Ganguly, N., 2009. Self-organization of sound inventories: analysis and synthesis of the occurrence andco-occurrence networks of consonants. Journal of Quantitative Linguistics 16 (2), 157–184.

Mukherjee, A., Choudhury, M., RoyChowdhury, S., Basu, A., Ganguly, N., 2008. Rediscovering the co-occurrence principles of the vowel inventories:A complex network approach. Advances in Complex Systems 11 (3), 371–392.

Nath, J., Choudhury, M., Mukherjee, A., Biemann, C., Ganguly, N., 2008. Unsupervised parts-of-speech induction for Bengali. In: Proceedings ofthe Sixth International Language Resources and Evaluation Conference (LREC).

Newman, M.E.J., 2003. The structure and function of complex networks. SIAM Review 45 (2), 167–256.Rapp, R., 2005. A practical solution to the problem of automatic part-of-speech induction from text. In: Conference Companion Volume of the 43rd

Annual Meeting of the Association for Computational Linguistics (ACL-05) , Ann Arbor, Michigan, USA.Schtze, H., 1993. Part-of-speech induction from scratch. In: Proceedings of the 31st Annual Meeting on Association for Computational Linguistics

, Morristown, NJ, USA, pp. 251–258 (Association for Computational Linguistics).Sigman, M., Cecchi, G.A., 2002. Global organization of the Wordnet lexicon. Proceedings of the National Academy of Science 99 (3), 1742–1747.Soares, M.M., Corso, G., Lucena, L.S., 2005. The network of syllables in Portuguese. Physica A: Statistical Mechanics and its Applications 355

(2–4), 678–684.Solan, Z., Horn, D., Ruppin, E., Edelman, S., 2005. Unsupervised learning of natural languages. Proceedings of National Academy of Sciences 102

(33), 11629–11634.Tamariz, M., 2005. Exploring the Adaptive Structure of the Mental Lexicon. PhD Thesis. Department of Theoretical and Applied Linguistics,

Univerisity of Edinburgh.Vitevitch, M.S., 2004. Phonological Neighbors in a Small World. Ms. University of Kansas.

Animesh Mukherjee a

Monojit Choudhury b

Samer Hassan c

Smaranda Muresan d

a Institute for Scientific Interchange (ISI), Viale Settimio Severo 65, 10133 Torino, Italyb Microsoft Research Lab India, Bangalore 560080, India

c Language Information Technologies Lab, University of North Texas, TX 76201, United Statesd School of Communication and Information, Rutgers University, NJ 08901, United States