quita - katedra českého jazyka · quita quantitative index text analyzer miroslav kubát...

1
QUITA Quantitative Index Text Analyzer Miroslav Kubát Vladimír Matlach Department of General Linguistic, Palacký University, Czech Republic Acknowledgement: QUITA was supported by the student project IGA (no. FF_2013_031) of the Palacký University, Olomouc. oltk.upol.cz/software Our aim is to provide a user-friendly tool of quantitative text analysis for researchers from various disciplines (linguistics, criticism, history, sociology, psychology, politics, biology, etc.). QUITA combines all important parts of any quantitative research: obtaining results, statistical testing and graphical visualization. There is no need to use any additional software such as spreadsheet applications or special statistical programs. INDICATORS TO COMPUTE Frequency Structure indicators o Type-Token Ratio ( TTR ) o h -point ( h ) o Vocabulary Richness ( R 1 ) o Repeat Rate ( RR ) o Relative Repeat Rate of McIntosh ( RR mc ) o Hapax Legomenon Percentage ( HL ) o Lambda ( Λ ) o Gini Coefficient ( G ) o Vocabulary Richness ( R 4 ) o Curve length ( L ) o Curve length Indicator ( R ) o Entropy ( H ) o Adjusted Modulus ( A ) Miscellaneous indicators o Verb Distances ( VD ) o Activity ( Q ) & Descriptivity ( D ) o Writer’s View ( α ) o Average Tokens length ( ATL ) o Thematic Concentration ( TC ) o Secondary Thematic Concentration ( STC ) TEXT-PROCESSING Pre-processing o Tokenizer (word, line, char, DNA Triplet, DNA Nucleotide) o Multilingual lemmatizer (AR, CZ, DE, DK, EN, ES, FI, FR, IT, NL, PT, RO, RU, SE) o POS Tagger (It distinguishes parts of speech in a text) Post-processing o N-grams (QUITA enables creating char, word or whatever n-grams) o Text length reduction STATISTICAL COMPARISON CREATING CHARTS

Upload: others

Post on 28-Sep-2020

6 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: QUITA - Katedra českého jazyka · QUITA Quantitative Index Text Analyzer Miroslav Kubát Vladimír Matlach Department of General Linguistic, Palacký University, Czech Republic

QUITAQuantitative Index Text Analyzer

Miroslav KubátVladimír Matlach

Department of General Linguistic, Palacký University, Czech Republic

Acknowledgement: QUITA was supported by the student project IGA (no. FF_2013_031) of the Palacký University, Olomouc.

oltk.upol.cz/software

Our aim is to provide a user-friendly tool of quantitative text analysis for researchers from various disciplines (linguistics, criticism, history, sociology, psychology, politics, biology, etc.). QUITA combines all important parts of any quantitative research: obtaining results, statistical testing and graphical visualization. There is no need to use any additional software such as spreadsheet applications or special statistical programs.

INDICATORS TO COMPUTE

Frequency Structure indicatorso Type-Token Ratio (TTR)o h-point (h)o Vocabulary Richness (R1)o Repeat Rate (RR) o Relative Repeat Rate of McIntosh (RRmc)o Hapax Legomenon Percentage (HL)o Lambda (Λ)o GiniCoefficient(G) o Vocabulary Richness (R4) o Curve length (L)o Curve length Indicator (R)o Entropy (H)o Adjusted Modulus (A)

Miscellaneous indicatorso Verb Distances (VD)o Activity (Q) & Descriptivity (D)o Writer’s View (α)o Average Tokens length (ATL)o Thematic Concentration (TC)o Secondary Thematic Concentration (STC)

TEXT-PROCESSING

Pre-processingo Tokenizer (word, line, char, DNA Triplet, DNA Nucleotide)o Multilingual lemmatizer (AR, CZ, DE, DK, EN, ES, FI, FR, IT, NL, PT, RO, RU, SE)o POS Tagger (It distinguishes parts of speech in a text)

Post-processingo N-grams (QUITA enables creating char, word or whatever n-grams)o Text length reduction

STATISTICAL COMPARISON

CREATING CHARTS