compositional data - latent dimensions of religion …analyzing compositional data with r (vol....
TRANSCRIPT
Latent Dimensions of Religion and Spirituality: A Longitudinal Correlated Topic Model
Seong-Hyeon (Sung) Kim1, Nathaniel R. Strenger2, & Narae Lee1
1Fuller Graduate School of Psychology, Pasadena, California, USA
2Pastoral Counseling Center, Dallas, Texas, USA
Overview
• Religion & Spirituality (R/S) • Research Questions
• Topic models • Automated text analysis
• Topics: Latent dimensions of text
• Topic proportions as compositional data
• Ternary diagrams
• Topic correlations
Religion & Spirituality (R/S)
• Definitions
• Religion: “the search for significance that occurs within the context of established institutions that are designed to facilitate spirituality” (Pargament et al., 2013, p. 15).
• Spirituality: “the search for the sacred” (Pargament et al., 2013, p. 14).
Pargament, K. I., Mahoney, A., Exline, J. J., Jones, J. W., & Shafranske, E. P. (2013). Envisioning an integrative paradigm for the psychology of religion and spirituality. In K. I. Pargament, J. J. Exline, & J. W. Jones (Eds.), APA handbook of psychology, religion, and spirituality (Vol 1): Context, theory, and research (pp. 3–19). Washington, DC: American Psychological Association. https://doi.org/10.1037/14045-001
Religion & Spirituality (R/S)
• Gorsuch (1984) introduced factor analysis as a tool to investigate the dimension of R/S. • He had criticized the over-supply of R/S measures.
• Our research introduces topic modeling as a tool to identify the fundamental dimensions or building blocks of R/S that had been conceptualized in the R/S measures.
Gorsuch, R. L. (1984). Measurement: The boon and bane of investigating religion. American Psychologist, 39(3), 228–236. https://doi.org/10.1037/0003-066X.39.3.228
Automated Text Analysis
• Quantitative (NOT qualitative) text analysis
• Three Different Types
1. Dictionary method: Pre-defined set of categories
2. Supervised learning: Outcome categories known (e.g., spam mail sorting)
3. Unsupervised learning: e.g., topic modeling (outcome categories unknown)
Topic Modeling
• Identify topics, the latent dimensions, in the text data
• Machine (statistical) learning + computer science + statistics
• Latent Dirichlet Allocation (LDA; Blei, Ng, & Jordan, 2003): Basic and popular, but does not allow topic correlations
Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3(Jan), 993–1022.
TASA Corpus: 37,000 Texts & 300 Topics
Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Handbook of latent semantic analysis, 427(7), 424-440.
Example: Steyvers & Griffiths (2007)
• 2 topics
• Each gives approximately equal probability to
• Topic 1: “money,” “loan,” and “bank”
• Topic 2: “river,” “stream,” and “bank”
• 16 documents were created by arbitrarily mixing the two topics
• Let’s analyze this collection of documents with LDA (Blei et al., 2003)
. Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. In T. Landauer, D. S. McNamara, S. Dennis, & W. Kintsch (Eds.), Handbook of Latent Semantic Analysis (pp.424-440). Hillsdale, NJ: Erlbaum.
Steyvers & Griffiths (2007)
Example: 16 Documents
Term Distributions for Topics
Topic 1
Word Probability
bank .390
money .314
loan .287
river .009
stream .000
Topic 2
Word Probability
stream .391
bank .345
river .240
money .012
loan .012
Topic Distribution for Documents
Matrix Factorization
Steyvers, M., & Griffiths, T. (2007). Probabilistic topic models. Handbook of latent semantic analysis, 427(7), 424-440.
LDA & Beyond
• Limitations of LDA
• Fails to model correlation between topics
• Stems from the implicit independence assumption in the Dirichlet distribution on the topic proportions in documents
• Topics are usually correlated in texts.
LDA & Beyond
• Correlated Topic Model (CTM, Blei & Lafferty, 2007)
• Replaces the Dirichlet in LDA with “more flexible logistic normal distribution” (p. 19).
• This paper cites Aitchison & Shen (1980), Aitchison (1982), & Aitchison (1985).
Blei, D. M., & Lafferty, J. D. (2007). A correlated topic model of science. The Annals of Applied Statistics, 1(1), 17–35. https://doi.org/10.1214/07-AOAS114 Aitchison, J. (1982). The statistical analysis of compositional data. Journal of the Royal Statistical Society. Series B (Methodological), 44(2), 139–177. Aitchison, J. (1985). A general class of distributions on the simplex. Journal of the Royal Statistical Society. Series B (Methodological), 47(1), 136-146. Atchison, J., & Shen, S. M. (1980). Logistic-normal distributions: Some properties and uses. Biometrika, 67(2), 261-272.
Structural Topic Model (STM)
• Our research used STM based on CTM
• Allows topic correlations
• Allows covariates (i.e., predictors of topic proportions)
• We collected 255 R/S measures published from 1929 and 2016 to identify the latent dimensions of text.
Atkins, D. C., Rubin, T. N., Steyvers, M., Doeden, M. A., Baucom, B. R., & Christensen, A. (2012). Topic Models: A Novel Method for Modeling Couple and Family Text Data. Journal of Family Psychology, 26, 816-27. doi: 10.1037/a0029607
Preprocessing
• R ‘tm’ package (Feinerer & Hornik, 2017)
• Items of 255 R/S measures
• Preprocessed texts
• Removed stop words, numbers, and punctuations.
• e.g., a/an, the, to, for, at, she/he, I, ., or ?.
• Lemmatized words
• e.g., educate, educated, or educating educate
Feinerer, I. & Hornik, K. (2015). tm: Text Mining Package (Version 0.6-2) [Computer software]. Retrieved from https://CRAN.R-project.org/package=tm.
Preprocessing
• Created a document-term matrix
• Dimensions: 255 × 5617
• Included
• unigrams
• bigrams (e.g., Jesus Christ)
• trigrams (e.g., religious (and/or) spiritual belief)
• Deleted low-frequency terms (< 3)
Model Estimation
• R ‘stm’ package (Roberts, Stewart, & Tingley, 2017)
• Topics
• Latent dimensions of text data
• Comparable to principal components or factors
• Estimated based on word co-occurrences across documents
• Structural topic modeling
• Estimate covariates’ effect on topic proportions
• Current analysis: Decade of publication as a predictor 1950’s through 2010’s
Roberts, M. E., Stewart, B. M., & Tingley, D. (2016). stm: R Package for Structural Topic Models (Version 1.1.3) [Computer software]. Retrieved from http://www.structuraltopicmodel.com
Top 50 Frequent Terms
Diagnostic Indexes
3 Topics Identified
• Topic 1: Spirituality spirituality, spiritual belief, religious spiritual, wilderness, never experience, spiritual experience, connect, illness, transcendent, transcendent spiritual
• Topic 2: Religion
church member, loving, teaching church, dealing, dealing life, local religious, join, local religious group, question meaning life, religious denomination
• Topic 3: Judeo-Christianity christian, allah, miracle, god will, god god, punish, client, god feel, patient, writing
The estimated regression lines and their 95% confidence intervals are plotted.
Longitudinal Change of Expected Topic Proportions from 1950’s to 2010’s
Created using R ‘compositions’ package (van der Boogaart, Tolosana, & Bren, 2015)
Van den Boogaart, K. G., Tolosana, R. & Bren, M. (2015). compositions: R Package for Compositional Data Analysis (Version 1.40-1) [Computer software]. Retrieved from https://cran.r-project.org/web/packages/compositions/index.html
Normal Distribution on the Simplex
Topic Correlations
1. exp(-var(z)): Buccianti & Pawlowsky-Glahn (2005)
• Z = ilr transformed parts
• 0 (1) → low (high) variability of ratios between parts
• e.g., .0016 for Topics 1 and 2
2. exp(-τ2/2): van den Boogaart & Tolosano-Delgado (2013)
• τ: Variation
• Interpret this as a correlation coefficient
• Very small between topics
Buccianti, A., & Pawlowsky-Glahn, V. (2005). New perspectives on water chemistry and compositional data analysis. Mathematical Geology, 37(7), 703-727. Van den Boogaart, K. G., & Tolosana-Delgado, R. (2013). Analyzing compositional data with R (Vol. 122). Heidelberg: Springer.
THANK YOU