a post-detection decipherment matrix

4

Click here to load reader

Upload: john-elliott

Post on 26-Jun-2016

215 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: A post-detection decipherment matrix

Acta Astronautica 61 (2007) 712–715www.elsevier.com/locate/actaastro

A post-detection decipherment matrixJohn Elliott∗

Computational Intelligence Research Group, School of Computing, Leeds Metropolitan University, Leeds LS6 3QS, UK

Received 15 February 2006; accepted 6 October 2006Available online 17 April 2007

Abstract

In any decipherment of an unknown language, whether it may be from our own antiquity or from a non-terrestrial source, it iscritical to realising an efficient and successful conclusion that amongst the first stages of analysis we isolate its genetic affinity(�). No decipherment, unless a crib is available, has ever been successful without this. In order that a more scientific approach isdeveloped, rather than relying on intuitions and good guesses, a corpus of scripts (both ancient and modern) is to be developed tounderpin structural analysis, which is generic for any decipherment. It is intended that such a resource will facilitate the structuralinterrogation and identification of linguistic features, with contemporary languages, to ascertain affinities with known scripts.© 2007 Elsevier Ltd. All rights reserved.

1. Introduction

To date, no decipherment, of an ancient or unknownlanguage, has been achieved by a cryptographer, crypto-palaeographer or linguist, by using robust scientificmethods or without the aid of a crib: decipherment hastypically relied on the insights and good guesses ofhobbyists from unrelated disciplines. This illustrateshow ‘courageous’ a strategy of relying on establishedalgorithms and brute-force cryptanalysis techniques, toattempt a decipherment, is likely to be. Articles, booksand even the film contact, portray post-detection signaldecipherment as a task for ‘imported’ cryptographers.However, such expertise and methodologies are relianton the premise that the data are an encoded form ofa known ‘system’, which has a high probability ofnot being the case here. It is therefore submitted thatsuch a ‘signal’ is likely to present the antithesis of the

∗ Tel.: +44 0113 2832600x5157; fax: +44 0113 2833182.E-mail address: [email protected].

0094-5765/$ - see front matter © 2007 Elsevier Ltd. All rights reserved.doi:10.1016/j.actaastro.2007.02.006

expected norm for decryption techniques: a plain textrepresentation of an unknown system.

To enable a realistic attempt at a decipherment ofan unknown language, it is the universal attributes andbehavioural characteristics of language structure thatfirst need to be modelled and understood [1,2]. Thiswill then provide the necessary methodologies andalgorithms to detect such hierarchical layers, whichcomprise intelligent, complex communication. As partof this ‘toolkit’, it is submitted that all known systems(language parameters) need to be structurally analysedto ‘place’ their ‘system’ within a language matrix. Thiswill need to include all known languages, whether‘living’ (in current use) or ancient; this must alsoinclude endeavours to incorporate yet undecipheredscripts, to provide as complete a picture as possible. Increating such a relational matrix, post-detection deci-pherment will be assisted by a structural ‘map’ that willhave the potential for ‘placing’ an alien communicationwith its nearest known ‘neighbour’, to assist subsequentcategorisation of basic parameters as a precursor todecipherment.

Page 2: A post-detection decipherment matrix

J. Elliott / Acta Astronautica 61 (2007) 712–715 713

2. Problem statement

The structural interrogation and identification of lin-guistic relationships with contemporary languages ofan unknown script to establish a relational affinity as aprecursor to decipherment.

3. Existing solutions

Historically, in any attempt at deciphering an un-known language, a preliminary step is first to compilea catalogue of all the different characters which occurin the script. This is important as the number of char-acters that comprise a script’s symbol set provides aclue to whether it is an alphabet, syllabary or logogra-phy: usually in the order of 20–30, 70–120 or 300 plus,respectively: ‘Nearly all successful decipherments haveinvolved clues through the script being a language thatwas familiar or very like a known language’ [3]. Wheresuch clues do not exist or the familiar clues are toofew to provide a useful key, decipherment has generallyrelied on the discovery of bilingual/multilingual inscrip-tions, where the main task is then to correctly ‘map’the known meaning to the unknown. Exemplars of such‘keys’ are the ancient Egyptian Hieroglyphs, which wasaided by the bilingual inscriptions of the Rosetta stoneand the Cuneiform script, where a trilingual inscriptionprovided the crib.

In addition, the Hittite language was deciphered aftera ‘good guess’ as to the nature of its related languagesand the Creto-Mycenaean inscriptions were decipheredon the assumption that language was Greek. Even thedecipherment of Linear B was ultimately assisted bythe realisation of its similarities with Greek. Additionalscripts deciphered on the basis of their relationship witha familiar language or related script were the Brahmiscript of Ashokan India, the Cypriote syllabary and theHimyartic script of Southern Arabia [3,4].

These intuitions, cribs and ‘good guesses’ cannotbe the foundations of either good practice or a soundtheoretical base to devise strategies for subsequentdetection and decipherment. Nevertheless, the particu-lar relevance of these past experiences is that some ofthe techniques used to aid disambiguation and realisa-tion of correspondences are achieved using tools moreappropriate for the computational task set. In particular,the utilisation of contextual, morphological and distri-butional analysis of characters and words has been ofgreat benefit in identifying and classifying vowels, con-sonants and determiners (word classifiers, such as in theEgyptian Hieroglyphs). The decipherment of Ugaritic,Linear B and the Turkic runes particularly benefited

from such textual analysis before ultimately relying onfamiliar keys. It is therefore these techniques that I willadopt as candidate tools to assist in my research. Un-fortunately, the successful decipherments documentedhave predominantly been reliant upon hobbyists us-ing their intuitions on an ad hoc basis, devoid of anyrigorous formal methods.

4. Proposed solution

Alphabetic and syllabic scripts convey a transparencyto speech phonetics, which pictographic, logographicand rebus scripts do not. Due to this, the two scriptgroups present different initial challenges for thematrix. However, in order that the matrix is a fullyintegrated set of relationships, a method of mappingnormalised structural phenomena for these two ortho-graphically disparate groups is required.

To derive a membership feature set that will representlanguage structure and measure genetic affinity, a repre-sentative set of known languages will be compiled andanalysed to ascertain vector types and their associatedweights. Using resources that also include grammati-cally tagged corpora will assist this process. This quasi-hidden layer is incorporated for modelling aspects ofsyntax that ‘meet’ semantics, whilst remaining featuresthat are computationally tractable.

Therefore, it is proposed that by compiling a profileof linguistic features and their behavioural characteris-tics for a given script, classification of affinity to otherscripts can be achieved.

Formally, the mathematical description and measure-ment of inter-language affinity is given as

�(L) = (u‖k) =n∑

1

u(Xi) logu(xi)

k(xi),

U ⇔ K{iff��r ⊂ R},r ⊂ R{r ⇒∈} (u ≡ k, where � = 0),

Rn→∞ =

n,m∑

u,k

(�),

where L is the language script U the unknownscript/script to be classified, K the known script, X thefeature under analysis, R the set of relations in matrixand r the linguistically related subset of R.

5. Analytical analysis

To expedite an intuitive metric, the value of �(alpha) will equate as inversely proportional to its

Page 3: A post-detection decipherment matrix

714 J. Elliott / Acta Astronautica 61 (2007) 712–715

affinity. Therefore, a value of zero denotes a perfectfit, and increasing values indicate an ever-increasingabsence of affinity. The following examples for theanalytical analysis of language structure document ex-amples of metrics, which will comprise the feature setfor inter-language comparison.

One method that has been ‘employed’ in recent lan-guage classification (often ambiguously termed as lan-guage identification) systems is the use of letter ngramfrequencies. Using this method, the source language canbe determined with great accuracy using the frequencyof its letter bigrams and/or trigrams from a very smallsample of text [5]. Much work has been conducted inthis sphere of research over the past few years (e.g.,[6,7]), spawning many similar analytical tools that cen-tre on this specific task. Adapting this knowledge canprovide additional information and glean insights intoanother universal attribute: orthotactic constraints. It isworth noting though that analysing excessive amountsof data can introduce a convergence problem, whereexceptions to normal characteristics begin to affectstatistics. It is therefore intended that only a few thou-sand words be used to derive orthotactic profiles for thismetric.

Secondly, a precursor to planned combinationalconstraint behaviour across many different languagefamilies, to ascertain if the behaviour of core partsof speech, irrespective of their encoding strategies,display evidence of a generic cohesive template, twolanguages are chosen as exemplar comparators: Chi-nese [8,9] and English [10]. These then provided anopportunity to compare the two very different ortho-graphic systems of a Sino-Tibetan logographic scriptand Indo-European (West Teutonic) alphabetic lan-guage, thereby facilitating a robust testing regimeto ascertain whether underlying similarities can bedetected.

All parts-of-speech pairs were analysed for com-binational constraint behaviour over a window of 10words to reflect cognitive constraints, and results wererecorded in accordance with the key described below:

Key to Table 1:

Z= zero bigram—or at offset specified—occurrences�= very weak bonding—near zero—at bigram occur-

rences: <5%�= strong bonding at bigram co-occurrences: fre-

quency 20% �x

∗= indicates opposing cohesive trend when part ofspeech reversed

�n = high peak beyond bigram at offset distance of ‘n’:increase in frequency 20% �x

Table 1Summary of results for bi-directional cohesion comparison betweenEnglish and Chinese

Noun Adjective Adverb Prep.

Noun �∗ �2 �2 �Adjective �2 � � �5,8,10

Adverb Z, �2,5 �∗ �∗ �, �2,3,9

Prep. �∗ �8 �4,8 �, �3,5

Conjunction �∗ � �3,8 �Verb � �2 � �Article �∗

Z, �2 Z, �6,8 Z, �6

Conjunction Verb Article

Noun � �∗ �2

Adjective �4,6,9 �, �3 Z, ��Adverb �9 �∗

Z, �2,3,5,7

Prep. Z,� �∗2 �, �6,10

Conjunction �, �6,7,10 �∗Z, �2,3

Verb �4,6 �, �2 �Article Z, �5,8 �∗

Z, �3

�= flat distribution across offsets—bigram bondingevident: no offsets deviate 20% �x.

Table 1 (below) is a summary of the findings, whichdisplay results indicating that the interactive behaviourbetween their core parts of speech are in fact remark-ably similar. These results support the hypothesis thatthe way we weave our respective ontological descrip-tions of the world around us, when communicating, arein fact constrained to general binding rules. The onearea where differences are seen to occur, for this ex-periment, is with immediate bonding of articles withsome of the descriptive parts of speech: specifically,with verbs, adjectives and adverbs, when preceded byarticles and where conjunctions and adverbs are imme-diate priors. It is believed the restricted set of articlesused in the Chinese language is the root cause of thiseffect.

The classification for the similarity criteria depictedabove, where extremely conservative and even thoseclassified as only similar, was in fact showing manyclosely related trends in the bonding at the varyingdegrees of separation. These findings have helped fur-ther indicate and orroborate that cross-language struc-tural consistency occurs to the highest syntactic level ofabstraction irrespective of orthography.

Although the overall summary of linguistic cohesion,for selected pairs of part of speech, is subject to a smallamount of subjective assessment upon manual compi-lation, the underlying analysis is a purely objective pro-cedure. This involves detecting immediate strong and

Page 4: A post-detection decipherment matrix

J. Elliott / Acta Astronautica 61 (2007) 712–715 715

weak bonding, repulsion and computational analysis ofmode occurrence above threshold values, against oth-erwise diminishing frequencies, at varying degrees ofseparation. The system also recognises flat distributionsacross all offsets and indicates opposing cohesive trendswhen parts of speech reversed.

The summary of these results shows that after com-paring what are seemingly two very dissimilar scripts,using a variety of representative collocations, evidenceindicates that the underlying bonding behaviour dis-plays high degrees of similarity, suggesting geneticaffinity, for this feature, a conclusion corroborated bythe knowledge that common word order and aspects ofmorphology do in fact exist.

This then supports the hypothesis that bi-directionalcohesion of linguistic objects, at degrees of separation,aids detection and subsequent classification of patternbehaviour. And the sub-goal, to determine whether par-ticular types of linguistic objects display evidence of‘bonding’ and ‘repulsion’ that could be detected fromsurface structure, is achieved.

These physical and behavioural characteristics oflinguistic features in a sample are then calculated asa precursor to deducing general principles for ‘typ-ing’ and clustering into syntactico-semantic lexicalclasses.

6. Conclusions

Ultimately, these measured characteristics, whichcomprise the behavioural ‘footprint’ of a given script,will then provide a set of comparable measures bywhich its affinity can be ultimately measured againstanother linguistic ‘system’. The comparison of mea-sures then equates to the summation (

∑1,n) that in-

forms the value alpha (the genetic affinity), which isthen evaluated against a threshold derived from knownempirical evidence.

The primary aim of this research is to provide aca-demics with the facility to interrogate undecipheredscripts as a precursor to decipherment. However, it ishoped that through such analytical methods, a deeperunderstanding of language structure and an individuallanguage’s true structural make-up, which may facilitatemore accurate and consistent classification, will also begleaned.

References

[1] J. Elliott, E. Atwell, B. Whyte, Language identification inunknown signals, in: Proceedings of COLING’2000—18thInternational Conference on Computational Linguistics, 2 vols.,Morgan Kaufmann, Los Altos, CA, 2000, pp. 1021–1026,ISBN: 1-55860-717-X.

[2] J. Elliott, The filtration of inter-galactic objets trouvés andthe identification of the lingua ex machina hierarchy, in:Proceedings of World Space Congress, the 53rd InternationalAstronautical Congress, 2002, IAA-02-IAA.9.2.10.

[3] P. Daniels, W. Bright, The Worlds Writing Systems, OxfordUniversity Press, Oxford, UK, 1996.

[4] M. Pope, The Story of Decipherment, Thames and Hudson,London, UK, 1999.

[5] C. Souter, G. Churcher, J. Hayes, J. Hughes, S. Johnson, Naturallanguage identification using corpus-based models, HermesJournal of Linguistics 13 (1994).

[6] G. Grefenstette, Comparing two language identificationschemes, in: Proceedings of the Third International Conferenceon Statistical Analysis of Textual Data (JADT), Rome, Italy,1995.

[7] T. Dunning, Statistical identification of language, TechnicalReport MCCS94-273, New Mexico State University, USA,1994.

[8] S. Piao, Sentence and word alignment between Chinese andEnglish, Ph.D. Thesis, Lancaster University, 2000.

[9] S. Piao, Chinese corpus adapted from CEPC corpus, SheffieldUniversity, Sheffield, UK, 2000.

[10] S. Johansson, E. Atwell, R. Garside, G. Leech, The Tagged LOBCorpus: Users’ Manual, ICAME, The Norwegian ComputingCentre for the Humanities, Bergen University, Norway,1986. Available from: 〈http://www.hit.uib.no/icame/lobman/lob-cont.html〉.