analysing word frequencies corpus linguistics...word frequencies using cramer’s phi on the whole...
TRANSCRIPT
![Page 1: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/1.jpg)
Corpus Linguistics:Analysing word
frequencies
![Page 2: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/2.jpg)
INTRODUCTIONA corpus of British English and American English
1
![Page 3: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/3.jpg)
WORD FREQUENCIES
❖ Books from 19th century British and American writers downloaded from the Gutenberg Project
❖ Number of individual words: 30 723
❖ Number of occurrences in AE corpus: 11 709 009
❖ Number of occurrences in BE corpus: 13 795 791
❖ Total size of corpus: 25 504 800
![Page 4: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/4.jpg)
Is the word “colour” used more often in American or British English?
WORD FREQUENCIES
Occ. of “colour” Total number of words Frequency
AE 255 11 709 009 21.77 pwm
BE 1772 13 795 791 128.44 pmw
![Page 5: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/5.jpg)
WORD FREQUENCIES
Is it significant
?
![Page 6: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/6.jpg)
WORD FREQUENCIES
CHI SQUARE
![Page 7: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/7.jpg)
WORD FREQUENCIES
CHI SQUARE
χ² = 906.71
p < .001
![Page 8: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/8.jpg)
Is the word “the” used more often in American or British English?
WORD FREQUENCIES
Occ. of “the” Total number of words Frequency
AE 848 729 11 709 009 72 485 pwm
BE 914 669 13 795 791 66 300 pmw
![Page 9: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/9.jpg)
WORD FREQUENCIES
CHI SQUARE
χ² = 3503.73
p < .001
![Page 10: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/10.jpg)
But...Is χ² really appropriate?
?
![Page 11: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/11.jpg)
WORD FREQUENCIES
Normal distribution: 95% of the values lie within two standard deviations
![Page 12: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/12.jpg)
WORD FREQUENCIES
Word frequencies: zeta distribution
ZIPF’S LAW
A small number of words have a very high frequency, and a large number of words have a very low frequency.
![Page 13: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/13.jpg)
NULL HYPOTHESISThe null hypothesis is the idea that there is no
relationship between two measured phenomena.
![Page 14: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/14.jpg)
WORD FREQUENCIES
IN OTHER WORDS...
The null hypothesis is the hypothesis that chance alone can explain what we’re observing.
![Page 15: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/15.jpg)
LANGUAGE IS NOT RANDOM
![Page 16: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/16.jpg)
““Words are not selected at random. There is no a
priori reason to expect them to behave as if they
had been, and indeed they do not.”
Adam Kilgarriff, 1996
![Page 17: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/17.jpg)
Chi-square:How bad is it?
1
![Page 18: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/18.jpg)
WORD FREQUENCIES
USING CHI-SQUARE ON THE WHOLE CORPUS:
❖ Only keep words with over 5 occurrences❖ Only keep words that occur in both AE and BE corpus
![Page 19: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/19.jpg)
WORD FREQUENCIES
USING CHI-SQUARE ON THE WHOLE CORPUS:
❖ Only keep words with over 5 occurrences❖ Only keep words that occur in both AE and BE corpus
❖ Number of significant results: 15197❖ Number of non-significant results: 15526❖ 49.4% of tests turn out significant (p < .05)
![Page 20: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/20.jpg)
WORD FREQUENCIES
![Page 21: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/21.jpg)
ALTERNATIVE #1 :Cramer’s V
2
![Page 22: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/22.jpg)
WORD FREQUENCIES
Cramer’s Phi & Cramer’s V
![Page 23: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/23.jpg)
WORD FREQUENCIES
USING CRAMER’S PHI ON THE WHOLE CORPUS:
❖ Same data as for Chi-square test ❖ Maximum value of Phi coefficient is determined by
the distribution of the two variables
![Page 24: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/24.jpg)
WORD FREQUENCIES
USING CRAMER’S PHI ON THE WHOLE CORPUS:
❖ Same data as for Chi-square test ❖ Maximum value of Phi coefficient is determined by
the distribution of the two variables
Φ > 0.5 :❖ Significant results: 1198❖ Non-significant results: 29525❖ 4.06% of tests turn out
significant
![Page 25: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/25.jpg)
WORD FREQUENCIES
USING CRAMER’S PHI ON THE WHOLE CORPUS:
❖ Same data as for Chi-square test ❖ Maximum value of Phi coefficient is determined by
the distribution of the two variables
Φ > 0.5 :❖ Significant results: 1198❖ Non-significant results: 29525❖ 4.06% of tests turn out
significant
Φ > 0.6 :❖ Significant results: 698❖ Non-significant results: 30025❖ 2.32% of tests turn out
significant
![Page 26: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/26.jpg)
WORD FREQUENCIES
USING CRAMER’S PHI ON THE WHOLE CORPUS:
❖ Phi coefficient for “colour”: 0.447
❖ Phi coefficient for “the”: 0.001
![Page 27: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/27.jpg)
ALTERNATIVE #2 :Wilcoxon-Mann-Whitney ranking test
3
![Page 28: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/28.jpg)
WORD FREQUENCIES
Wilcoxon-Mann-Whitney ranking test
![Page 29: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/29.jpg)
WORD FREQUENCIES
WMW
❖ Uses frequency to rank items and determine the value of the statistic (U)
❖ Divide the data in equal sized samples❖ For each observation, retain frequency
and origin of the sample (AE or BE)
“raining”
1 2 4 4 5 7
AE AE BE AE BE BE
1 2 3 4 5 6
![Page 30: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/30.jpg)
WORD FREQUENCIES
WMW
❖ The significance of the U statistic can be checked using normal distribution tables.
❖ AE and BE were divided in 10 equal sized chunks❖ Tests made on all words with a frequency over 30
(n = 15756)
R1 = sum of ranks in sample 1n1 = sample size for sample 1
![Page 31: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/31.jpg)
WORD FREQUENCIES
WMW
p < 0.05:
❖ Significant results: 2357❖ Non-significant results: 13399❖ 17.59% of tests turn out significant
![Page 32: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/32.jpg)
WORD FREQUENCIES
WMW
p < 0.05:
❖ Significant results: 2357❖ Non-significant results: 13399❖ 17.59% of tests turn out significant
p < 0.01:
❖ Significant results: 889❖ Non-significant results: 14867❖ 5.98% of tests turn out significant
![Page 33: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/33.jpg)
WORD FREQUENCIES
Let’s see some
results...
![Page 34: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/34.jpg)
WORD FREQUENCIES
British English
![Page 35: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/35.jpg)
WORD FREQUENCIES
American English
![Page 36: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/36.jpg)
CONCLUSIONChoosing the most appropriate test
4
![Page 37: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/37.jpg)
CRAMER’s V
WORD FREQUENCIES
WILCOXON-MANN-WHITNEY
No local copy of corpus needed
No programming skills required
Interpretation can be difficult
Local copy of corpus needed
Some programming skills required
Interpretation is easy
![Page 38: Analysing word frequencies Corpus Linguistics...WORD FREQUENCIES USING CRAMER’S PHI ON THE WHOLE CORPUS: Same data as for Chi-square test Maximum value of Phi coefficient is determined](https://reader030.vdocument.in/reader030/viewer/2022040920/5e982c31dc45fe37b1739336/html5/thumbnails/38.jpg)
THANKS!