using language examples in an introductory sas programming class
TRANSCRIPT
-
8/4/2019 Using Language Examples in an Introductory SAS Programming Class
1/12
Using Language Examples in anIntroductory SAS Programming Class
USCOTSOhio State UniversitySaturday, June 27th, 2009
Roger Bilisoly, PhDDepartment of Mathematical SciencesCentral Connecticut State University
-
8/4/2019 Using Language Examples in an Introductory SAS Programming Class
2/12
Why analyze language in a SAS class?
There are several excellent sources of free texts on the Web. For example,
Project Gutenberg at http://www.gutenberg.org/wiki/Main_Page
Google books at http://books.google.com/
VIRGObeta at http://virgobeta.lib.virginia.edu/
There are several sources of free word lists on the Web. For example,
Moby word lists for English, German, Spanish, French, Italian, and Japanese atGutenberg.org.
The American Cryptogram Association has lists for many additional languages.See http://cryptogram.org/cdb/words/words.html.
The National Puzzlers League has many types of wordlists for English. See
http://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:start.
Adding variety to the types of data used could broaden the appeal of a statistics class.
Many examples of statistical analyses of text have already been developed bylinguists and computer scientists.
Corpus linguists use computers to analyze text samples designed to berepresentative of a certain aspect of a language. For example, the million-wordBrown corpus was created to be representative of American English in 1961.
http://www.gutenberg.org/wiki/Main_Pagehttp://books.google.com/http://virgobeta.lib.virginia.edu/http://cryptogram.org/cdb/words/words.htmlhttp://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:starthttp://www.puzzlers.org/dokuwiki/doku.php?id=solving:wordlists:about:starthttp://cryptogram.org/cdb/words/words.htmlhttp://virgobeta.lib.virginia.edu/http://books.google.com/http://www.gutenberg.org/wiki/Main_Page -
8/4/2019 Using Language Examples in an Introductory SAS Programming Class
3/12
Homework Problem: Find the Proportion of Each Letterof the Alphabet in Dickens A Christmas Carol.
Are there any letter frequency anomalies?
For example, does the letter Jappear more often thanaverage due to the name Jacob Marley?
This novel was originally published in 1843. How do its letter
frequencies compare to American English in 1961, i.e., to theBrown Corpus?
How do its letter frequencies compare to German frequencies:e.g., to Goethes Die Leiden des jungen Werther?
Complications: Other languages using the Latin alphabet
often employ diacritical marks (e.g., German has umlauts)and sometimes add new letters (e.g., German has , theEszett, which stands for a double s). Hence alphabets aremore complex than one might first suppose.
-
8/4/2019 Using Language Examples in an Introductory SAS Programming Class
4/12
This SAS Code Introduces bothCharacter Data and Frequency Tables.
data carol;
infile C:\A_Christmas_Carol.txt";
input char $1. @@;
lowchar = lowcase(char);
run;
data letters_carol; set carol;
if anyalpha(lowchar) > 0;
run;
proc freq data=letters_carol order=freq;
tables lowchar / out=carolfreq;
run;
SAS v9 has many
character functions.
Read characters
one at a time.
The above code can be introduced early in a programming class, and
the ability to read in external files is important for applications.
-
8/4/2019 Using Language Examples in an Introductory SAS Programming Class
5/12
Letter Frequencies for A Christmas Carolwith some comparisons.
The FREQ Procedure
Cumulative Cumulative
lowchar Frequency Percent Frequency Percent
e 14869 12.27 14869 12.27
t 10890 8.99 25759 21.26
o 9696 8.00 35455 29.26a 9315 7.69 44770 36.95
h 8378 6.91 53148 43.86
i 8309 6.86 61457 50.72n 7962 6.57 69419 57.29
s 7916 6.53 77335 63.82
r 7038 5.81 84373 69.63d 5676 4.68 90049 74.31
l 4555 3.76 94604 78.07
u 3335 2.75 97939 80.82w 3096 2.55 101035 83.38
c 3036 2.51 104071 85.88
g 2980 2.46 107051 88.34m 2841 2.34 109892 90.68
f 2438 2.01 112330 92.70
y 2299 1.90 114629 94.59p 2122 1.75 116751 96.35
b 1943 1.60 118694 97.95
k 1031 0.85 119725 98.80v 1029 0.85 120754 99.65
x 131 0.11 120885 99.76
j 113 0.09 120998 99.85q 97 0.08 121095 99.93
z 84 0.07 121179 100.00
1 0.00 121180 100.00
Top 12 letters infrequency order
for several sources:
Christmas Carol
ETOAHI NSRDLU
Brown Corpus
ETAOIN SRHLDU
junges Werthers
ENIRSH TADULC
Rule of Thumb
ETAOIN SHRDLU
The letterj
Dickens: 0.0009
Brown: 0.0020
From the word
Laocon, a figure
from Greek
mythology
-
8/4/2019 Using Language Examples in an Introductory SAS Programming Class
6/12
Homework Problem:Find Initial Consonant Clusters.
How do languages differ in their use of consonants? As noted earlier, diacritical marks and additional letters makes
this complicated. In addition, the same sound can berepresented in quite different ways in different languages.
Sounds in a language are restricted in practice: these are
called phonotactic constraintsin linguistics. For example, English has a tssound (as in cats), but it doesnt
appear at the beginning of words, except for loanwords like tsar(from the Russian , where ts= ). German does have an initialtssound, but its represented with the letterz(as in Zimmer.)
However, tscan also appear where tends a syllable and sstarts the
next syllable as in pantsuit. In this case the sound is not the tsappearing in catsor tsar.
Studying initial consonant clusters restricts attention to onesyllable, so boundaries are not a problem.
Lets compare English and German.
-
8/4/2019 Using Language Examples in an Introductory SAS Programming Class
7/12
Initial Consonant Clusters:English vs. German
Obs start COUNT PERCENT
1 c 7333 8.01202
2 r 7012 7.66129
3 m 6261 6.84075
4 d 5831 6.37094
5 s 5540 6.05299
6 p 5398 5.89784
7 b 5265 5.75253
8 h 4079 4.45671
9 t 3713 4.0568210 l 3706 4.04917
11 f 3426 3.74324
12 g 2473 2.70199
13 w 2284 2.49549
14 n 2206 2.41027
15 pr 2150 2.34908
16 v 1924 2.10216
17 st 1364 1.49030
18 ch 1330 1.45315
19 tr 1311 1.4324020 j 1104 1.20623
21 k 1017 1.11117
22 sh 987 1.07839
Obs start COUNT PERCENT
1 v 11356 9.43581
2 g 9310 7.73577
3 b 9282 7.71251
4 w 8208 6.82011
5 h 6851 5.69256
6 z 6444 5.35438
7 k 6214 5.16327
8 s 4849 4.02908
9 m 4847 4.0274210 f 4836 4.01828
11 r 4035 3.35272
12 d 3978 3.30536
13 l 3501 2.90902
14 t 3456 2.87162
15 st 3144 2.61238
16 n 2977 2.47362
17 sch 2669 2.21770
18 p 2521 2.09472
19 tr 1724 1.4324920 pr 1681 1.39676
21 sp 1348 1.12007
22 fr 1296 1.07686
23 gr 1258 1.04528
First, note that English and German phonology (sounds the letters make) differ. For example, a German vis pronounced
like the Englishf. Second, these two languages have different constraints on initial letters. For example, almost no words
in German start with c, but z is pronounced like ts, which is a common starting letter (ranks 6th above) in German. Third,
the frequencies of initial letters does not match the overall letter frequencies found earlier.
-
8/4/2019 Using Language Examples in an Introductory SAS Programming Class
8/12
Analyzing Word Games and Language
Many language games require finding words given specificletter constraints. Crossword puzzles and hangman are twoexamples.
In linguistics, morphology, the study of the structure of words,can be analyzed in similar ways. Words are broken intomorphemes, which are the smallest units of a word that havemeaning. For example, in English, many adverbs are formed by adding the
morphemely to an existing word.
Compare: Scoot is quick, and Scoot runs quickly. Quickis an
adjective in the former sentence, and quicklyis an adverb in thelatter.
Run here, Scoot, and be quick about it. Here quickis used as anadverb. However, a rule with exceptions can still be useful.
This adverb example is from Section 6.4.3 from Practical Text Mining with Perl(Bilisoly, 2008).
-
8/4/2019 Using Language Examples in an Introductory SAS Programming Class
9/12
Can you solve the following word puzzles?
1. Find all the words that fit the following crosswordpuzzle pattern: ___b__u
2. Find all the words that fit the following hangmanpattern: _e____s, where t, a, o, i, n dont appear.
3. How useful is the idea that most adverbs in Englishcan be formed by addingly to an existing word?Unfortunately, there are many complications:
Happybecomes happily (ychanges to i.)
Seasonablebecomes seasonably(eis dropped.)
Automaticbecomes automatically(-al-is added.)
Hillbecomes hilly(onlyy is added.)
And there are words ending inly that are not adverbs:anomaly, apply, fly, etc.
-
8/4/2019 Using Language Examples in an Introductory SAS Programming Class
10/12
Here are the SAS solutionsto the crossword and hangman problems.
data one;
length word $30;
infile "C:\crosswd.txt";
input word;
len = length(word);
run;
data two; set one;
if len = 7;
if substr(word,4,1) = 'b';
if substr(word,7,1) = 'u';
run;
proc print data=two; run;
SAS output:Obs word len
1 jambeau 7
data three; set one;
if len = 7;
if findc(word,'taoin') = 0;
if findc(word,'e') = 2 and findc(word,'e',-30) = 2;
if findc(word,'s') = 7 and findc(word,'s',-30) = 7;
proc print data=three; run;
SAS output:Obs word len
1 bedbugs 72 bedrugs 7
3 bedumbs 7
4 begulfs 7
5 ferrums 7
6 peplums 7
7 rebuffs 7
8 redbuds 7
9 redbugs 7
10 regulus 7
11 vellums 712 zephyrs 7
-
8/4/2019 Using Language Examples in an Introductory SAS Programming Class
11/12
Word Inflections
A complete analysis of adverbs would be quite complicated.
However, the exceptions noted earlier (happily, etc.) wereeasy to find by reading in a wordlist and then checking eachword that ends inly to see if it is still a word after removing
ly. There is a methodology called regular expressionsthat finds
general text patterns. This is implemented in version 9 of SASusing functions such as PRXPARSE and PRXMATCH.
English is not very inflected, but this varies from language to
language. For example, English is less inflected than German,and Finnish is heavily inflected.
Moreover, there are many other word structures (morphemes)to analyze: plurals, verb conjugations, compound nouns, etc.
-
8/4/2019 Using Language Examples in an Introductory SAS Programming Class
12/12
Current Status
I used language examples in CCSUs STAT 456
(Fundamentals of SAS), Spring, 2009, for the first time.
Initial feedback is mixed. The language examples weredifficult for non-native speakers of English.
Would this be helpful in an introductory class? I plan toask my future classes in their interest in word games to
judge whether this is worth pursuing at the introductorylevel.