first stage identification of syntactic elements in an extra-terrestrial signal

Contents lists available at ScienceDirect

Acta Astronautica

Acta Astronautica 68 (2011) 389–398

0094-57

doi:10.1

� Tel.

E-m

journal homepage: www.elsevier.com/locate/actaastro

First stage identification of syntactic elements in anextra-terrestrial signal

John Elliott �

Computational Intelligence Research Group, School of Computing, Leeds Metropolitan University, Leeds LS1 3HE, England

a r t i c l e i n f o

Article history:

Received 27 February 2009

Received in revised form

1 March 2010

Accepted 16 April 2010Available online 12 June 2010

Keywords:

Language

Entropy

Extraterrestrial

Cognition

Audio

Structure

Decipherment

65/$ - see front matter & 2010 Elsevier Ltd. A

016/j.actaastro.2010.04.013

: +44 113 812 7379

ail address: [email protected]

a b s t r a c t

By investigating the generic attributes of a representative set of terrestrial languages at

varying levels of abstraction, it is our endeavour to try and isolate elements of the

signal universe, which are computationally tractable for its detection and structural

decipherment. Ultimately, our aim is to contribute in some way to the understanding of

what ‘languageness’ actually is. This paper describes algorithms and software developed

to characterise and detect generic intelligent language-like features in an input signal,

using natural language learning techniques: looking for characteristic statistical

‘‘language-signatures’’ in test corpora. As a first step towards such species-independent

language-detection, we present a suite of programs to analyse digital representations of

a range of data, and use the results to extrapolate whether or not there are language-like

structures which distinguish this data from other sources, such as music, images, and

white noise.

& 2010 Elsevier Ltd. All rights reserved.

1. Introduction

Having detected a signal from an extra-terrestrialsource, which satisfies criteria indicating language-likestructures at a physical level [1,3,8], second stage analysisis required to begin the process of identifying internalgrammar components that constitute the basic buildingblocks of the symbol system.

Unlike traditional natural language processing, asolution cannot be assisted using vast amounts of trainingdata with well-documented ‘legal’ syntax and semanticinterpretation. Using computational linguistic universalsderived from analysing a representative sample set of thehuman chorus, algorithms developed are designed towork unsupervised and without in-built prior knowledge,for the filtration of inter-galactic Objets Trouves and thedecoding of an unknown signal’s grammar structure.

ll rights reserved.

With the use of embedded clauses and phrases, we areable to represent an expression or description, howevercomplex, as a single component of another description.This allows us to build up complex structures far beyondour otherwise restrictive cognitive capabilities. It is thisuniversal hierarchical structure, together with the essen-tial ontological requirements for describing the worldaround us (external), and the relationships betweengrammatical structures (internal) evident in all humanlanguages, and necessary for any advanced communica-tor, that constitute the next phase in the signals analysis.

Given this, the first step to interpretation is to identifythese language-like features, detecting where wordchunks and phrase-like boundaries occur. It is from thesebasic syntactic units that the analysis of behaviouraltrends and inter-relationships amongst terminals andnon-terminals alike begins to unlock the encoded internalgrammatical structure; clustering into syntacto-semanticclasses and indicating candidate parts-of-speech.

Therefore, at this stage we are endeavouring toidentify features of language surface structure, which

www.elsevier.com/locate/actaastro

dx.doi.org/10.1016/j.actaastro.2010.04.013

mailto:[email protected]

dx.doi.org/10.1016/j.actaastro.2010.04.013

J. Elliott / Acta Astronautica 68 (2011) 389–398390

are universal – or at least quasi-universal – to the world’smajor language families, irrespective of their adoptedscripts or lexical encoding strategies. It is by using thesefeatures, that algorithms developed from such informa-tion will enable us to extract core syntactic elementswithout the necessity of a primer or universal Rossettastone and making as few assumptions as possible.

The problem goal is to separate language fromnon-language without dialogue, and learn somethingabout the structure of language in the passing. Thelanguage may not be human (animals, aliens, compu-tersy), the perceptual space can be unknown, and wecannot assume human language structure but must beginsomewhere. We need to approach the language signalfrom a naive viewpoint, in effect, increasing our ignoranceand assuming as little as possible.

Given this standpoint, an informal description of‘language’ might include that it:

�
has structure at several interrelated levels � is not random � has grammar � has letters/characters, words, phrases and sentences � has parts of speech � is recursive � has a theme with variations � is aperiodic but evolving � is generative � has transformation rules � is designed for communication � has Zipfian type-token distributions at several levels
Language as a ‘signal’

�
has some signalling elements (a ‘script’) � has a hierarchy of signalling elements?
(‘Words’, ‘phrases’, etc.)
� is serial? � is correlated across a distance of several signalling
elements applying at various levels in the hierarchy
� is usually not truly periodic � is quasi-stationary? � is non-ergodic?
We assume that a language-like signal will be encodedsymbolically, i.e. with some kind of character-stream. Ourlanguage-detection algorithm for symbolic input uses anumber of statistical clues such as entropy, ‘‘chunking’’ tofind character bit-length and boundaries, and matchingagainst a Zipfian type-token distribution for ‘‘letters’’ and‘‘words’’.

In addition to the search for such decoding strategiesof language scripts and thereby the understanding ofwhat language structure actually is, is the analysis ofaudio signals for their structural features to facilitate thediscrimination of language-like signals from non-lan-guage phenomena. To further enable rigorous analysis ofthis problem we have continued to look at a varietyof sound/signal samples, which also include the ‘sounds ofspace’ tape compiled by Dr Cullers. To date our results

remain encouraging and we are continuing to positivelydiscriminate language—whether distorted or subject tointerference-from ‘other’ natural phenomena.

This paper describes the algorithms and softwaredeveloped for these purposes, including a visualisation toolto facilitate the examination of annotation-combinationspace topology and approach vectors.

2. Identifying structure and the ‘character set’

2.1. Revisiting entropy

The initial task, given an incoming bit-stream, is toidentify if a language-like structure exists and if detectedwhat are the unique patterns/symbols, which constituteits ‘character set’. A visualisation of the alternativepossible byte-lengths is gleaned by plotting the entropycalculated for a range of possible byte-lengths.

In ‘real’ decoding of unknown scripts it is accepted thatidentifying the correct set of discrete symbols is no meanfeat [12]. To make life simple for ourselves we assume adigital signal with a fixed number of bits per character.Very different techniques are required to deal with audioor analogue equivalent waveforms [1,3]. We have reasonto believe that the following method can be modified torelax this constraint, but this needs to be tested further.The task then reduces to trying to identify the number ofbits per character. Given the probability of a bit is Pi; themessage entropy of a string of length N will be given bythe first order measure

E¼ SUM ½Piln Pi�; i¼ 1,N

If the signal contains merely a set of random digits, theexpected value of this function will rise monotonically asN increases. However, if the string contains a set ofsymbols of fixed length representing a character set usedfor communication, it is likely to show some decrease inentropy when analysed in blocks of this length, becausethe signal is ‘less random’.

Of course, we need to analyse blocks that begin andend at character boundaries. We simply carry out themeasurements in sliding windows along the data. In Fig. 1above, we see what happens when we applied this tosamples of 7-bit ASCII text, Chinese big 5 encoded textand image data. We notice a clear drop, as predicted, forbit lengths of 7, 14 and 21 for ASCII text, which equate tounigrams, bigrams and trigrams, respectively. Chineseshows a similar drop at 14 bits, which is where theirsingle symbols are encoded. In contrast, image dataresults show that at no given segmentation, can anydecline in the entropic value be detected. Values increasewith ‘n’ monotonically as predicted for non-language likephenomena, albeit data, which stores information, thatdata’s structure equates to random events or ‘noise’.Modest progress though it may be, it is not unreasonableto assume that the first piece of evidence for the presence of

language-like structure, would be the identification of a low-

entropy, character set within the signal. Thresh-holding at

½L�e ¼ ft1rH1c rt2g

J. Elliott / Acta Astronautica 68 (2011) 389–398 391

where t1=H1p–1; t2=H1n+1; [L]e is the physical languagestructure; H1c the current entropic value; H1p the previousentropic value; and H1n the next entropic value.

Once detected, the bit-length where a candidatelanguage signal is encoded can then be analysed forevidence of internal structure.

This next level of analysis, where higher-order entropicvalues calculate the conditional probability of priorinformation, will then detect such structure if presentand add further credence to the belief that the signalcarries information. This assumption is based on all knownnatural intelligent communication devises embeddinginformation, allowing us to build up complex structuresfar beyond our otherwise restrictive cognitive capabilities.

To test this assumption, a series of tests wereperformed on language and non-language data. Thesecomprised of a variety of natural languages, randomlygenerated text and the DNA genome for E-Coli [11].Results, which are summarised in Fig. 2 below, show thatincrease conditionally at this character level lowers theentropic value for language. In contrast to this, both the

0

2

4

6

8

10

12

14

16

Bit chunk

English alphabet

Unigram

1000 2515000800050003000

Fig. 1. First orde

0

1

2

3

4

5

6

7

8

9

1 2 3Entropi

Characters in Natural Language

DNA

Rand

Words in Natural Langua

Entr

opic

Val

ues

Fig. 2. Higher-order e

random text and DNA show no appreciable variation,approximately retaining their first-order values andtherefore displaying no evidence of internal structure upto fifth order. As well as calculating the internal dynamicsbetween characters, tests were performed at word level,where interestingly, after initial higher values for firstorder, subsequent higher order results show language atword level mirroring that of the internal dynamics ofcharacters. This may well indicate a relationship betweenthe syntactic and semantic content encoded and will bean avenue of further investigation.

2.2. Ngrams

At this stage and complementary to the calculation of asystem’s entropy, is the analysis of character ngrams. Theanalysis of occurrences of character combinations canindicate the redundancy in a system and its ‘legal’combinations, which equate to the sound combinationconstraints of its exponents. By analysing representativelanguages across the majority of all language groups and

length

Chinese symbols

Image data

Bigram

Trigram

4000001000004700035000000

r entropy.

4 5c Order

om Text

ge

ntropic trends.

0

20

40

60

80

100

120

%

1000 47000

Natural Language Bigrams

Random Bigrams

Random Trigrams

Natural Language Trigrams

1000003000 5000 8000 15000 25000 35000

Fig. 3. Occurrences of Ngrams.

050

100150200250300350400450

1 4 7 10 13 16 19 22 25 28 31Word length

Freq

uenc

y

Fig. 4. Candidate word-length distributions using the 3 most frequent

characters.


thereby representing in excess of 90% of the world’spopulation, against randomly generated text and DNA,results were compared for evidence of this being aconsistent and uniquely language-like metric. Resultsshow, that for all languages tested, a marked drop in thepercentage of ‘legal’ trigrams to bigrams of greater than25% occurs: all with similar negative slopes irrespective ofsample length or script. Randomly generated text almostimmediately generates 100% bigram combinations, itstrigrams rapidly moving towards, and achieving the samecoverage. These results are uncharacteristic of language,where high ratios of redundancy are evident. However,the natural system comparator of DNA does displaysimilar redundancy and could not be filtered out by thismetric alone. Fig. 3 (above) shows how the length of asample can affect the percentage of ngrams. From this, itcan be conjectured that as little as 3000 characters canreliably indicate the presence of a random event.

3. Identifying ‘words’

The next task, still below the stages normally tackledby NLL researchers, is to chunk the incoming character-stream into words. Looking at a range of (admittedlyhuman language) text, if the text includes a space-likeword-separator character, this will be the most frequentcharacter. So, a plausible hypothesis would be that themost frequent character is a word-separator; then plottype-token frequency distributions for words, and forword-lengths. If the distributions are Zipfian, and thereare no significant ‘outliers’ (very large gaps between‘spaces’ signifying very long words) then we haveevidence corroborating our space hypothesis; this alsocorroborates our byte-length hypothesis, since the twoare inter-dependent.

Again, work by crytopaleologists suggests that, once thecharacter set has been found, the separation into word-likeunits, is not trivial and again we cheat, slightly: we assumethat the language possesses something akin to a ‘space’character. Taking our entropy measurement describedabove as a way of separating characters, we now try toidentify which character represents ‘space’. It is not

unreasonable to believe that, in a word-based language,it is likely to be one of the most frequently used characters.

Using a number of texts in a variety of languages, wefirst identified the top three most used characters. Foreach of these we hypothesised in turn that it represented‘space’. This then allowed us to segment the signal intowords-like units (‘words’ for simplicity). We could thencompute the frequency distribution of words as a functionof word length, for each of the three candidate ‘space’characters (see Fig. 4 below).

It can be seen that one ‘separator’ candidate (unsur-prisingly, in fact, the most frequent character of all)results in a very varied distribution of word lengths. Thisis an interesting distribution, which, on the right-handside of the peak, approximately follows the well-known‘law’ according to Zipf [7], which predicts this behaviouron the grounds of minimum effort in a communicationact. Conversely, results obtained similar to the ‘flatter’distributions above, when using the most frequentcharacter, is likely to indicate the absence of wordseparators in the signal.

To ascertain whether the word-length frequencydistribution holds for language in general, multiplesamples from 20 different languages from Indo-European,Bantu, Semitic, Finno-Ugrian and Malayo-Polynesiangroups were analysed (see Fig. 5). Using statisticalmeasures of significance, it was found that most groupsfell well within 5% limits—only two individual languageswere near exceeding these limits.

0.00

5.00

10.00

15.00

20.00

25.00

1 4 7 10 13 16 19 22 25

Word length

Freq

uenc

y %

Word length distributions in multiplesamples from Indo-European, Semitic,

Finno-Ugrian, andMalayo-Polynesian language groups

Fig. 5. Word length frequency.


Zipf’s law is a strong indication of language-like

behaviour. It can be used to segment the signal provided a

‘space’ character exists. However, we should not assumeZipf to be an infallible language detector. Naturalphenomena such as molecular distribution in yeast DNApossess characteristics of power laws [4]. Nevertheless, itis worth noting, that such non-language possessors ofpower law characteristics generally display distributionranges far greater than language with long repeats farfrom each other [1]; characteristics detectable at this levelor at least higher order entropic evaluation.

1

Fre

quen

cy

Number of words between candidate functional words

98765432

Fig. 6. Function word separation in English.

4. Identifying ‘phrase-like’ chunks

Although alien brains may be more or less powerfulthan ours [10], it is reasonable to assume that allintelligent problem solvers are subject to the sameultimate constraints of computational power and storageand their symbol systems will reflect this.

Thus, language must use small sets of rules to generatea vast world of implications and consequences. Perhapsits most important single device is the use of embeddedclauses and phrases [5], with which to represent anexpression or description, however complex, as a singlecomponent of another description.

In serial languages, this appears to be achieved byclustering words into ‘chunks’ (phrases, sentences) ofinformation, which are more-or-less consistent and self-contained elements of thought. Furthermore, in humanlanguage at least, these ‘chunks’ tend to consist of content

terms, which describe what the chunk is ‘about’ andfunctional terms, which attribute references and contextby which the content terms convey their informationunambiguously.

Functional terms in a language tend to be short,probably attributable to the principle of least effort, asthey are used frequently.

A further distinguishing characteristic of functionaland content terms is that different texts will often vary intheir content but tend to share a common linguisticstructure and therefore make similar use of functional

terms. That is, the probability distribution of contentterms will vary from text to text, but the distribution offunction terms will not.

Using text from a number of languages, which hadbeen enciphered using a simple substitution cipher(to avoid cheating), we identified across a variety ofintra-language texts, the most common words, with leastinter-text variation. These we call ‘candidate functionwords’.

Now, suppose these words occurred at random in thesignal: we would expect to see the spacing between themto be merely a function of their individual probabilities ofoccurrence. But this is NOT what happens. Instead, thereis empirical evidence across the languages and languagegroups analysed that function word separation is con-strained to within short limits, with very few more thannine words apart. Analysing this statistically (as a Poissondistribution) or simply simulate it practically by randomgeneration, we find that there are a non-insignificantnumber of cases wherein there are very large gaps (of theorder of several tens of words) between successiveoccurrences. This contrasts markedly with what occursin natural language.

Initial findings show that the frequency distribution ofthese lengths of text – our candidate phrases – follow aZipfian distribution curve and rarely exceed lengths ofmore than eight.

We might conclude from this, that our brains tend to

‘chunk’ linguistic information into phrase-like structures of

the order of seven or so word units long.

Interestingly enough, this fits in well with humancognition theory [9], which states that our short-termmental capacity operates well only up to 7 (72) pieces ofinformation, but any causal connection between this andour results must be considered highly speculative at thisstage!

Fig. 6 below depicts the phrase length distributionbetween function words, which is characteristic of alllanguages tested.

12.00

11.00

10.00

9.00

8.00

7.00

6.00

5.00

4.00

3.00

2.00

1.00

0.00

-10.00 -5.00 0.00 5.00 10.00

Freq

uenc

y

P (w1, w2) P (w2, w1)

10.............1 1.............10

Offsets

|

Fig. 7. Visualisation of a correlation profile for a word pair (w1=the,

w2=king).

12

34

56

78

910

vbjj

prepart

ccrb

cnoun0

5

10

15

20

25

30

35

distance POS

freq

uenc

y

Fig. 8. VB-tag profile.


5. Clustering into syntactico-semantic classes

Unlike traditional natural language processing, asolution cannot be assisted using vast amounts of trainingdata with well-documented ‘legal’ syntax and semanticinterpretation or known statistical behaviour of speechcategories. Therefore, at this stage we are endeavouring toextract the syntactic elements without a ‘Rossetta’ stoneand by making as few assumptions as possible. Given this,a generic system is required to facilitate the analysis ofbehavioural trends amongst selected pairs of terminalsand non-terminals alike, regardless of the target language.

Therefore, one intermediate research goal is to applynatural language learning techniques to the identificationof ‘‘higher-level’’ lexical and grammatical patterns andstructure in a linguistic signal. We have begun thedevelopment of tools to visualise the correlation profilesbetween pairs of words or parts of speech, as a precursorto deducing general principles for ‘typing’ and clusteringinto syntactico-semantic lexical classes. Linguists havelong known that collocation and combinational patternsare characteristic features of natural languages, which setthem apart [13]. Speech and language technologyresearchers have used word-bigram and n-gram modelsin speech recognition, and variants of PoS-bigram modelsfor part-of-speech tagging. In general, these models focuson immediate neighbouring words, but pairs of wordsmay have bonds despite separation by intervening words;this is more relevant in semantic analysis, e.g. [6,2]. Wesought to investigate possible bonding between typetokens (i.e., pairs of words or between parts-of-speechtags) at a range of separations, by mapping the correlation

profile between a pair of words or tags. This can becomputed for given word-pair type (w1, w2) by recordingeach word-pair token (w1, w2, d) in a corpus, where d isthe distance or number of intervening words. Thedistribution of these word-pair tokens can be visualisedby plotting d (distance between w1 and w2) againstfrequency (how many (w1, w2, d) tokens found at thisdistance). Distance can be negative, meaning that w2occurred before w1 and for any size window (i.e., 2 to n).In other words, we postulate that it might be possible todeduce part-of-speech membership and, indeed, identifya set of part-of-speech classes, using the joint probabilityof words themselves. But is this possible? One test wouldbe to take an already tagged corpus and see if the parts-of-speech did indeed fall into separable clusters.

Fig. 7 shows the results for the relationship between acontent and function word, so identified by looking attheir cross-corpus statistics.

Using a five thousand-word extract from the LOBcorpus [14], a number of parts-of-speech pairings wereanalysed for their cohesive profiles. The arbitrary figure offive thousand was chosen, as it both represents a samplelarge enough to reflect trends seen in samples much larger(without losing any valuable data) and a sample size,which we see as at least plausible when analysing ancientor extra-terrestrial languages, where data is at a premium.

To enable analysis of multiple selections and how theycompare with each other, the information extrapolated isthen ported for 3D graphical representation (see Fig. 8).

This particular stage will eventually be integrated forpurposes of efficiency but is not essential. Examininglanguage in such a manner also lends itself tosummarising the behaviour to its more notable featureswhen forming profiles. Therefore conducting informationcompression akin to principal component analysis: atechnique more usually found in applications conducting

Fig. 10. In/the profile.


analysis of images is found to be extremely effective. Fig. 9shows a sample of the main syntactic behavioural featuresfor their co-occurrence ranging over the chosen windowof ten words.

Most of the combination patterns found correspond totag-bigrams, which could be extracted automatically in aMarkov model. However, some longer-distance cohesivetrends were found, indicated by ln in the above table,where n is the offset distance. For example, in our sample,adverbs (Rb) were never immediately followed by acommon noun (Cnoun), but there was a peak at aseparation of 5.

Such information could be used to guide developmentof constraint grammars. The English constraint grammardescribed in Ref. [15] includes constraint rules up to 4words either side of the current word (see Table 16, p352); the peaks and troughs in the visualisation toolmight be used to find candidate patterns for such long-distance constraints.

To investigate whether particular combinations displaydistinguishable traits at more distant separations, whichmay further aid unsupervised language learning, a onehundred word/parts-of-speech tag window was employed.The rationale here being: given no prior knowledge exceptthose gleaned from previous stages of unsupervisedanalysis, can statistically based features of the annota-tion-combination space topology contribute towards clus-tering the functional words into parts of speech.

To ascertain the feasibility of such a hypothesis,English functional words, which are discovered duringprevious unsupervised analysis, were analysed. It was

Fig 9 noun adjective adverb prep noun β, λ3 δ* λ2 β*adj β* β δ, λ5,9 λ2

adverb Ζ, λ5 λ7 β β*Prep δ* , λ2 λ2 δ* , λ7 δ, λ3

conj δ* , λ3:4 β β, Ζ6 λ4

Verb λ2 λ2 β β*article β* β* δ, λ3,8 Ζ, λ2

Fig 9 conj verb article noun β* ,λ6 δ,λ2 δ* , λ2

adj λ2,4 δ δ* λ3

adverb δ,λ9 β λ2

Prep λ3 Ζ* , λ9 βConj Ζ λ5 β*Verb δ, Ζ9 Ζ β*

Article Ζ* Ζ Ζ, λ4

Key:

Ζ = Zero bigram occurrences. δ = Very weak bonding at bigram occurrences. β = Strong bonding at bigram co-occurrences. * = Indicates opposing cohesive trend λn = High peak beyond bigram at offset of ‘n’

Fig. 9. Analysis of distinguishing grammatical collocations of main Parts

of Speech.

found, using such distant behaviour, that frequent-almostbound-word combinations such as in Fig. 10 below,display a marked ‘tailing off’ in the direction wherebonding is evident, in contrast to the opposing directionwhere repulsion occurs at the immediate zero offsetbigram occurrence.

The red area depicts behaviour for the bigram in/the,whilst the blue depicts its opposite. This profile contrastsmarkedly with the non-bound word pair topology, wheresuch ‘tailing off’ is absent. The white areas, whichcontribute towards the overall topology, are where theexclusivity of selected pair combinations are not enforcedand intervening secondary occurrences of a selected wordare ignored.

With the ongoing work on compiling inter-languagedata on ontologically encoded universals through part-of-speech, it is hoped that such cohesive and topographicalinformation will provide key information for unlockingthis linguistic layer in an unknown signal. Currentwork, analysing parts-of speech annotated corpora forten of the most widely used, yet disparately encodedwriting systems, is promising to uncover such features.

6. SAS’S to distinguish language in audio

The classification of an audio signal, using its physicalmake-up has also been aided by visualisation techniques.As an initial stage towards filtering intelligent language-like communication from ‘noise’ and other semi-struc-tured natural phenomena, a suite of programs has beenwritten to break the signal into its constituent parts andpresent its underlying feature structures visually. This isachieved by sampling the analogue sound wave at


10,000 samples/s and measuring features such as theamplitude at a given point, the number of samples in anenvelope and the distance between these samples.

Vertical analysis of the sound wave is analysed foramplitude, so as to ascertain when high activity equatesto significant activity sections (SASs) above a selectedthreshold. Horizontal analysis uses these envelopes tocapture the duration and rhythm of the sound wave.

The system therefore looks at both a signal’s overallstructure, and its rhythm over time, viewing it as a‘snapshot’. Fig. 11 below illustrates this process.

This has enabled the comparison of a range of signalswith our own spoken language structure, ranging fromthe audio signatures of other considered intelligentcommunicators, such as dolphins and apes, to the rapidclicks of pulsars and the white noise of the hydrogen lineand cosmic background radiation.

Results have been promising, consistently showingthat as a first-pass filter, this technique distinguisheslanguage-like signals, even when subject to distortion andnoise, from not only random noise but also otherstructured signals such as music, which although candramatically vary, have yet to mimic the signature oflanguage. The following images summarise resultsobtained from visualising and analysing such signals.

In general, for ‘snapshot’ distributions of amplitudeabout a centrally oriented zero line, the following arefound:

�
Language communication such as found in humansdolphins and apes, displays a leptokurtic distributionseen in Fig. 12
Fig. 12. Language.
� Random noise results in platykurtic distributions, suchas in Fig. 13. � Modem and Fax transmissions show a distribution,
which is halfway between language and noise. SeeFig. 14.
� Music ranges from a bimodal distribution to a distinct
alternating spikes found in Fig. 15a and b. Helvetica

ddd d dd = duration of merged envelopes

Threshold

S.A.S

}

Fig. 11. Analysing the Signal.

amplitude, the figures below depict the duration of SAS
In contrast to the above visualisation of occurrences of
envelopes over time. The y-axis indicates the duration of agiven sound envelope, whilst the x-axis is time. Using thismethod, Fig. 16 is a visualisation of human speech interms of SAS-duration. The rhythm of long and shortenvelopes is particularly symptomatic of vocalisation andis readily distinguishable from other sounds, such asPulsars, music and noise as mentioned above

Fig. 13. Random noise/hydrogen line.


On analysing sound samples taken from the ‘sounds ofspace’ tape provided by Dr Cullers at the SETI Institute, thesystem was able to distinguish sounds produced by a thehuman carrier mechanism despite interference and phaseshifts. Results consistently produced the distinctiveleptokurtic profile together with the type of time seriesseen in Fig. 16. The regularly occurring ‘clicks’ of thepulsar were immediately distinguishable by their ‘toothcomb’ time series, predictably producing a consistency inintervals and amplitude uncharacteristic of more complexsources. See Fig. 17 below.

Individually, the two visualisation methods mentionedabove are reasonably robust in their ability to differentiate

Fig. 14. Fax/Modem transmission.

Fig. 15. a, b

language from other signals. However, combining the twomethods produces a system, which we believe to be anextremely useful automated first-pass filter for intelligentaudio communication discovery.

7. Summary and future developments

To summarise, our achievements to date include:

�

mus

a method for splitting a binary digit-stream intocharacters, by using entropy to diagnose byte-length.
� internal structure confirmation and detection of ‘legal’
combinations using higher-order entropy and ngrams.

ic.

Fig. 16. Human voice time series.

Fig. 17. Pulsar time series.


�
a method for tokenising unknown character-streamsinto words of language. � an approach to chunking words into phrase-like sub-
sequences, by assuming high-frequency functionwords act as phrase-delimiters.
� a visualisation tool for exploring word-combination
patterns, where word-pairs need not be immediateneighbours but characteristically combine despiteseveral intervening words.
� a toolkit for analysing the physical structure of audio
signals both historically and over time.

So far, our approaches have involved working withlanguages with which we are most familiar and, to a certainextent, making use of linguistic ‘knowns’ such as pre-taggedcorpora. It is still early days yet and we make no apology forthis initial approach. However, we feel that by deliberatelyreducing our dependence on prior knowledge (‘increasingour ignorance of language’) and by treating language as a‘signal’, we might be contributing a novel approach tonatural language processing which might ultimately lead toa better, more fundamental understanding of what distin-guishes language from the rest of the signal universe.

References

[1] John Elliott, Eric Atwell, Language in signals: the detection ofgeneric species-independent intelligent language features in sym-bolic and oral communications, in: Proceedings of the 50thInternational Astronautical Congress, paper IAA-99-IAA.9.1.08,International Astronautical Federation, Paris, 1999.

[2] George Demetriou, Eric Atwell, A domain independent semantictagger for the study of meaning associations in English text, in:Proceedings of IWCS4: Fourth International Workshop on Compu-tational Semantics, Tilburg, Netherlands, 2001.

[3] John Elliott, Eric Atwell, Is there anybody out there? The detectionof intelligent and generic language-like features, Journal of theBritish Interplanetary Society 53 (1-2) (2000) 13–22.

[4] H. Jenson, Self Organised Criticality, Cambridge University Press,1998.

[5] M. Minsky, Why Intelligent Aliens will be Intelligible, CambridgeUniversity Press, 1984.

[6] Andrew Wilson, Rayson Paul, The automatic content analysis ofspoken discourse, in: C. Souter, E. Atwell (Eds.), Corpus basedcomputational linguistics, Rodopi, Amsterdam, 1993.

[7] G.K. Zipf, Human Behaviour and the Principle of Least Effort,Addison Wesley Press, New York, 1949 (1965 reprint).

[8] John Elliott, Atwell Eric, Whyte Bill. 2000. Language identification inunknown signals, in: Proceeding of COLING’2000, 18th Interna-tional Conference on Computational Linguistics, pages 1021-1026,Association for Computational Linguistics (ACL) and MorganKaufmann Publishers, San Francisco.

[9] Ally & Bacon, Cognitive Psychology, (third edition), Solso, Massa-chusetts, USA, 1991.

[10] R. Norris, How old is ET? in: Proceedings of 50th InternationalAstronautical Congress, paper IAA-99-IAA.9.1.04, InternationalAstronautical Federation, Paris. 1999.

[11] UW Genome Project and Blatter Laboratory, University of Wiscon-sin, USA, /http://www.genome.wis.edu/S.

[12] E. Charniak, Statistical language learning.[13] John Sinclair, Corpus, concordance, collocation describing English

language, Oxford University Press, 1991.[14] Fred Karlsson, Voutilainen Atro, Heikkila. Juha, Anttila Arto. 1995.

Constraint Grammar: a language-independent system for parsingunrestricted text. Berlin: Mouton de Gruyter.

[15] Geoffrey Leech, Garside Roger, Atwell Eric, The automaticgrammatical tagging of the LOB corpus, ICAME Journal 7 (1983)13–33.

Glossary

Unigram; Bigram; Trigram; Ngram: statistical sequences of1, 2, 3 and N number of linguistic units, used to formprobabilistic models.

Zipfian: Named after George Zipf, a phenomenon whichstates that given some corpus of natural languageutterances, the frequency of any word is inverselyproportional to its rank in the frequency table.

http://www.genome.wis.edu/

first stage identification of syntactic elements in an extra-terrestrial signal

Documents