practice architecture 1

8/3/2019 Practice Architecture 1

http://slidepdf.com/reader/full/practice-architecture-1 1/10

1

This Is The Architecture Of Our Practice

An overview of how Rock Creek Analytics provides opinion research designed andused for the Internet, from sampling to analysis.

Rock Creek Analyticsʼ pioneering tools use Internet text to

analyze public opinion. We provide thorough profiles of

people and issues. We find, follow, and assess developing

trends. Our work is quick, accurate and thorough.

We use Internet content because everyone – people,

institutions, even issues – leaves a record there: what we

say and what is said about us. Even more important,

opinion in Net content is there because people want it

there: people go to blogs, Facebook, and Twitter to express

their opinions and leave them for others to read.

We download that content and analyze it for the

characteristic and distinctive words and phrases that mark

off opinion about a person, an issue, or any other factor in

developing policy. We use a variety of statistical

perspectives to create profiles that characterize whatʼsbeing said online by and about someone (or some-thing).

Or whatʼs being said about something new: an emerging

trend.

Here is what we do:

• Create profiles – from as many perspectives as you need.

What makes Glen Beck and his message jump out for his

audience? We tell you the how and the why of the rise of

Glen Beck into prominence, and not just the how much.

• Identify and assess emerging trends. Where did the

nativist outcry during the financial crisis come from? What

made the argument effective and how did it trail off?

• Measure the prominence – newsworthiness, notoriety – of

an agitator or a cause. When and by how much did

someone become noticed, and how quickly did they fade

into the crowd?

We do this by benchmarking: comparing text and opinion

about a person with a context, comparing one political

position with another, finding celebrity against shiftingbackground: Al Gore and “global warming debate”; Sarah

Palin and the 2008 “Presidential campaign”.

Our benchmarking uses a series of statistical tools: most

importantly, evaluating significance. We find the differences

that matter, and put them together.

What follows is how we do it.



2

Search and samplingMost opinion research uses random sampling. Ours does

not.

In random sampling, each item has an equal chance of

being selected and each selection is made independently.

Randomness is modeled by normal distribution. Even in a

non-random environment, randomness is the basis of the

standard polling process.

The Economist/YouGov Internet Presidential poll .

Krosnick, J. A. (2006) compares with Blumenthal,

M.. No Such Thing As A Perfect Sample , (2009.

And ) appears to be a typical debate, focusing on

randomness as an unquestioned assumption,rather than considering whether it applies to Net

content (including whatʼs said by poll subjects

found online.)

Internet content is not randomly ordered. The Net material

we use is not amenable to random sampling, and is not

described with the mathematical models of randomness,

such as normal distribution. Rather than randomness, we

base our analysis on the makeup of Internet architecture:

power laws, and the scale-free and small world distribution.

The Structure of the Web ; L. Breslau, P. Cao, L.

Fan, G. Phillips, and S. Shenker. Web Caching an

Zipf-like Distributions: Evidence and Implications;

Menczer, F., Lexical and Semantic Clustering by

Web Links.

Power laws, scale freedom, and small world distribution

therefore apply to samples of Net content, including those

we use in our work. Random sampling is unlikely toproduce workable and representative material in a non-

random environment. This material is unlikely to be found

through random sampling, which is unlikely to be

representative of a non-random environment.

“[C]ohesive collections of Web pages (for

instances, pages on a site or pages about a topic

mirror the structure of the Web at large.” Dill, S., eal., Self-Similarity In the Web , Lexical and Seman

Clustering by Web Links; D. Gibson, J. Kleinberg,

Inferring Web Communities from Link Topology

This difficulty obtains for any grouped text, either

downloaded, as contemplated above, or gathered off-line.

“The inherent non-randomness of corpus data

renders statistical"estimates unreliable, since the

random variation on which they are based will be

smaller than the true variation of the observed

frequencies.” S. Evert, How random is a corpus?

The library metaphor

Consider analyzing opinion leadership during the efforts to

stem the financial crisis during September and October

2008. Ultimately the Net content of interest was in thetopics of finance and politics. Under the assumption of self-

similarity, those topics were the source from which Net

content was taken.

See Chrakrabarti, et al, The Structure of Broad

Topics on the Web; David M. Pennock, Gary W.

Flake, Steve Lawrence , Eric J. Glover, and C. LeeGiles. Winners don ̓ t take all: Characterizing the

competition for links on the web

The concurrent and cross-sectional analysis

conducted for profiles is similar.

The units of sampling in this case were documents/texts,

from 100 to 2000 words in length, in two groups of about

5000 Internet files total, taken from on-line websites of

newspapers (the New York Times), topical websites,

political and other weblogs, and assorted newsgroups.

Different Net upload-store-download technologi

present different issues for, variously, search and

retrieval (and analysis). Of the four here, only web

pages are usually undated; by-lines are used in o

line newspaper articles in contrast with

pseudonyms found elsewhere, collective versus

individual authorship, and so on. See below.

Originally the choices were made to see if

opinion leadership existed in one technology, e.g.

re-purposed newspapers, or another.

Blogs and Web 2.0 social media are sometimes used when

changing opinion is being analyzed.

In some cases, we use social media for timesensitive searching, weighted to reflect recency a

time-sensitive topicality.

Blogs present separate issues. Because a blog

post with a comment string may be a kind of

conversation (therefore a single functional text), b

may also be broken up into several different chun

of code, we may sample this as a set of strings

while analyzing the string as a single unit.

The unit of analysis we use is a ʻtextʼ. A text may refer to a



3

single item (web page) or to an aggregate (10,000 Net files

using the words “financial” and “crisis” which were uploaded

or posted September 1, 2008 – September 28, 2008).

The term ʻtextʼ as used here refers to each of two different

functions:

• The formal definition required for sampling and retrieval:

ʻone or more sentences demarcated by typological

conventions (white space, binding) or technical definitions

and use (<body> text </body> in HTML)ʼ.

• The functional definition required for analysis: ʻa semantic

unit of language in use, containing one or more

sentences, containing chains of repeated and related

words, and both familiar and novel information.ʼ

Net content is embedded in different kinds of cod

Typically we analyze content in HTML files (also,

when needed, blog comments in database

languages). (Newsgroup code, UUE etc, presents

separate issues out of scope here.)

The working convention for retrieval is, therefore,

“file” = “HTML document” = “web page” = “text”.

Depending on context, however, a blog thread –post and comments - may count as a single text.

the post and each comment may each count as

separate individual texts. (This is a working

distinction; the details are out of scope.)

Texts are also, as the formal definition implies, collections

of words with more or less well-formed boundaries. Words

are the units of measurement at this level of granularity,

and thereby serve two critical functions: they are units of

analysis for frequency, dispersion and collocation, and they

are semantic anchors for contextual and topical analysis, in

words, phrases, sentences, and texts, The dialectic

between the formal and semantic/topical perspectives on

words is the keystone of our work.

The formal extensional definition of words –

ʻcharacter strings bounded with white space,

grammatical markingʼ and so on – is omitted here

Text files are obtained from the Internet by using several

kinds of search engine, each with a different kinds of

ranking algorithm: Google (as an example of PageRank),

backlinks (Yahoo, among others), HITS/authority, and

unique visitors/popularity. Results are retrieved (with date

limitations, as needed), and downloaded by using returns

from each search engine separately (ranked by weight and

then recursed and results retrieved from a new search) and

by aggregating results.

The search engines used and the weighting

algorithms vary from case to case. The collection

process is designed around self-similarity, small

world, and scale free assumptions.

Analysis: FrequenciesTrend recognition is one of the most common forms of

frequency analysis, and so we will focus on it here.

Discovering a trend – either retrospectively, or more or less

concurrently - is a before-and-after analysis of content.

Working with the blocs of text files and the statistics of word

frequency change, we compare sequential blocks of

comparable topical materials.

The before-and-after analysis used for trends begins with

compiling word frequencies in the ʻbeforeʼ material – which

also functions as a benchmark. In this case the words are

those in the September and October text, and frequencies

are enumerated for each.

The compiled lists give what are sometimes calle

the “observed absolute frequencies” for the listedwords.



4

We use several metrics. One set is drawn from changes in

network graph and graph results.

Google (as an example of PageRank), backlinks,

HITS/authority, and unique visitors/popularity

For a comparable approach, see Gabrilovich,

Dumais, Horvitz Newsjunkie: Providing

Personalized Newsfeeds via Analysis of

Information Novelty, and, to similar effect, Jon

Kleinberg, Temporal Dynamics of On-Line

Information Streams .

To analyze incipient trends and numbers of words, asecond set of metrics uses measurements of word

association, the significance of changes in word frequency,

and changes in the dispersion of the most important words

to measure effects.

Some trend metrics use an interval scale – those used for

word frequencies, for example. To an extent we can the

measure the trend and its effect using interval data and

derivatives. However, we are constrained by the need to

use ordinal results (web page ranking systems), and non-

parametric dispersion analysis.

For example: relative frequencies are critical for

identifying the under the radar onset of trends,

popularity and some Google functions for followin

trends, and word and link dispersion for quantifyin

effects.

To compare before and after opinion on political issues, welooked at the two one-month periods before and after the

September 26th, the date the first bailout legislation failed.

We took each to mark an appropriate sampling unit for Net

opinion on the political dimensions of the financial crisis.

Assuming self-similarity, we used “financial” and “crisis” to

define a set of texts dealing with that topic and also

representative of the larger domains of opinion on the issue

on the Net. These definitions also served as search engine

queries (in the first instance, as discussed above). After

weighted ranking, we also introduced date limitations (of

convenience, further simplified for this discussion) and

downloaded files in two sets of about 250,000 words each,

for the two time periods.

There is also the dynamic case, not used here, wi

feedback, such that results (from one or another

level) about opinion from earlier periods are

introduced (at one or another level) for another

period. This ranges from media feedback to

explicitly and overtly gaming a popularity-based

search engine such as Technorati.

The listsʼ word frequencies - here for the pre- and post-

September 29 text collections – are then compared. This

contrast is the next step in showing whether a trend – a

discrete and identifiable chain of opinion – emerged from

one dated set of texts compared to its predecessor. What

turns up when raw counts are compared?



5

There was very little difference between the periods at this

level in this case: terms like bailout and financial, which

have led the substantive discussion decrease, but only very

slightly.

Observed absolute frequencies of critical terms

before and after 9/26/08 (Functional words are

much more common than the content words we

analyze. The function word “the” has been added

the table for comparison; its use declines as well,

also only slightly. This suggests that decline in

observed frequencies standing alone is unlikely to

be informative.)

pre-9/26 post-9/26

“the” 14,171 (1st) 13,300 (1

st)

bailout 899 (32

nd

) 582 (59

th

)government 661 (45th) 437 (79

th)

financial 601 (50th) 553 (63

rd)

(Numbers in parentheses show frequency rank)

(Data from a Rock Creek study, the pre- and post

9/26 text sets were about 500,00 words each) eac

However, going beyond this case, as a general matter, if

the sample sets are different sizes, comparing frequencies

in word usage between sets would not standing alone even

be valid, much less useful, at least not until the comparison

is checked.

The more frequent occurrence of a word in one te

collection does not by itself show that the observe

word is actually more frequent because the

observed frequencies are dependent on the sizes

of Normalized frequencies the texts that are being

compared. Gries, S. Th. Useful statistics for corpu

linguistics

One method for benchmarking comparisons normalizes the

different instances, a ratio of its raw count to the word countfor the entire text. This can be expressed as “[word X] per

thousand”, or as a percentage.

Differences after normalizing continue to be slight in this

case. “[F]inancial” and “bailout” dominated the content

words of the substantive debate, but, for example, showing

only about a 0.15% decrease in the use of “bailout” for the

post-September 26 period.

Important terms (as percentages) of

observed total word counts

pre-9/26 post-9/26

“the” 5.6 4.6

bailout .35 .20

government .26 .15

financial .24 .19

As the tables suggest, comparing word frequencies in

almost any pair of texts may not show much difference

even when normalized. The critical question is whether thedifferences in use (i.e., frequencies) for important terms

matter. Our analysis relies on a statistical test for comparing

frequencies.

The results of relative frequency testing serve two different

but related inquiries:

• First, are the two texts being compared (non-trivially)

distinct from one another? Is there word use in the sets of

text that shows meaningful differences in opinion for

September and October 2008?

• Second, how are the texts distinct? What word use

distinguishes one from the other? In this case, did

patterns of word use – ultimately trends in opinion –emerge and develop?

• Third, if there are conspicuous differences, do the sharp-

edged differences in word patterns suggest more or less

topically and thematically related set(s) of words?



6

The process begins with the frequency lists just described.

For each word in the two frequency lists we derive the

significance statistic to obtain a value with which to

distinguish the September and October texts and to analyze

the distinction.

There are more than two dozen tests now being discussed

in the scientific disciplines concerned with evaluating the

significance of frequency differences in word use when

paired texts are compared. The most commonly used are alog likelihood test and the chi-squared test.

The log likelihood test we use does not assume

randomness or normally distributed data in making

comparisons. Therefore it is better this test as suited to the

Net's non-random word and content distribution.

See Dunning T., Accurate methods for statistics o

surprise and coincidence (cited more than 1300

times); Rayson P, Garside R. Comparing Corpora

using Frequency Profiling.

By contrast the commonly used chi-squared test

derives probabilities for the frequencies by

comparing them with random ordering. The

examples are not random. This depends on the

assumption - not applicable for Net content or

words in general - that words are independent an

identically distributed.

We also use log likelihood analysis to derive the sets of

words that can be used to distinguish one set of texts fromanother, or to characterize one of them. Sometimes these

sharply defined words are referred to as "key words": those

that occur uncommonly more or less in one text (or set of

texts) than another. Keyness measures relative

distinctiveness: how a far a term departs from its

comparative benchmark. These keywords are the words

that characterize individual texts - Romeo and Juliet - or

groups of text (post- as opposed to pre-September 26

discussions of the financial crisis).

Scott M. & Tribble C., TEXTUAL PATTERNS. Note th

key/keyness are derivative terms: the log likelihoo

test (discussed below) measures the salience

(more or less the same as statistical significance)

relative frequency between texts. That is, the met

looks not at the arithmetical difference in word use

but at how much that difference matters.

When we applied “keyness” to the September-October Net

discussion, some words stood out. While used less often

as a matter of in raw numbers, these words stood out and

made the later October discussion distinctive. “[M]inorities”,

“hate”, and “alien” have become visible in this relative

frequency analysis

These emerging keywords are evidence of a change in the

terms of the debate

The log likelihood test also uses a logarithmically

based ratio scale that facilitates comparisons of

individual word (and some other) usage across

sets of texts. This in turn allows cross-sectional

and longitudinal comparisons.

New terms in the crisis debate

(Left column is absolute percentage and rank; rig

column is departure from expected frequencie

negative – declines – in red)

Percentage KEYNESS

bailout .20 (64th) -117

government .15 (85th) -82

financial .19 (68th) -14

minorities .02 (525th) 34

hate .01 (284th) 21

alien -- (3141st) 10

Keywords are the hallmarks of frequency change – theybring out the contrast between profile and background, or

between the blocks of opinion recorded in Net text. If people

talk differently about Toyota than they do about General

Motors, what stands out – by the log likelihood statistical

metric – in the comparison?



7

Keywords are also are markers for shifts in word use as the

Net discussion moves forward. How are keywords situated

within text? Where do they fall? This kind of ʻlocationʼ is

measured with analysis of “dispersion”: the even or uneven

distribution of an item through the text being studied the

closely related issue of dispersion.

Dispersion - placing keywords in the Internet text

environment - is the basis for the audience-effect side of ouropinion research.

We can place keywords in different audience segments

represented by the different groups of Net text, and the

compare the text groups for the distribution of keywords. Iif

“minority” and “alien” are found significantly more often

during October on conservative blogs than elsewhere, this is

evidence of an echo chamber effect for that issue.

Dispersion, then, shows where messages have taken hold:

where and by how much the message has had an effect.

That is, we are developing ways to measure where, how,

and how much a message has affected different blocs of

Internet opinion. Where are the key terms found most often

in the audience – and in which parts of the audience?

Ordinarily dispersion for ratio-scaled data is

measured by standard deviation or variance.However, where the data may be non-parametric

those metrics are not available. Also, pair-wise da

is not available, and sample sizes are large. For

continuing surveys of the problem see S. Gries,

Dispersions and adjusted frequencies in corpora;

and Dispersions and adjusted frequencies in

corpora, further explorations .

In this case, the keyword term with visibly uneven dispersion

is “minorities”, and the effect is concentrated in the posts of

three bloggers, two visibly conservative-leaning. In this case,

Malkin seems ultimately to have been preaching to the

converted.

Nativist rhetoric was found on conservative blogs

at least 40,000 of them (by different sampling tha

that used above.) However, this was less than 4%

of blogs discussing the financial crisis, and only

traces of the message could be found in

mainstream media common websites. Please not

that these results are crude, and common from th

application of a form of head-counting, using theresults of the frequency analysis.

Collocation

Collocation is the degree to which words occur together

unusually often, by some measure of significance.

"Collocation" is a formal term for this intuition - that some

words tend to occur near each other: "night" and "day", "kick"

and "bucket", "global" and "warming".

Evert, S., Corpora and collocations

If one or more collocations - a set of key phrases and other

patterns - can be found in a text, then we can build up to

quantitatively derived core features of the text. Moreover,when collocates can be found and aggregated for the

distinctively frequent vocabulary (keywords) in a text, the

process marks off the message of a text, whether this is the

intended message or the message picked up by the Internet

audience reading the text and its message.

It follows that study of collocations and their usesextends and applies statistics in virtually every

language discipline, from machine translation to

literary analysis to email forensics.



8

As a rule of thumb, the higher the statistical score for a word

pair's collocation, the more the association tells us about the

pairʼs role in the text. This measurement and analysis can be

done by hand (as by inspecting a text for every instance of a

word in order to identify which words recur near each other)

or using statistics.

This is a considerable over-simplification: there ar

well more than 25 measures for collocation that

have been proposed; seldom, if ever, will all point

the same direction. Corpora and collocations ;

Oakes M, STATISTICS FOR CORPUS LINGUISTICS188

195. In practice, we use one or more of the mutua

information, t-test, and log likelihood tests for a

project.

Collocates add color to literal meaning; repeated and

prominent usage may enhance the coloring of surroundingwords. “Cause” is an example; when used as a verb its usual

collocates are negative.

ʻCauseʼ collocates with, (among other things):

damage, problems, pain, disease, distress, troubl

blood, concern, degradation, harm, pollution,suffering, anxiety, death, fear, stress, surprise,

symptoms

Collocation can compound the effect of a distinctive and vivid

vocabulary. For example, at the end of 2008 we analyzed the

impact of an online essay Michelle Malkin wrote in late

September of that year, arguing that illegal immigrants were

to blame for the banking collapse.

Illegal immigration and the mortgage mess

There were interlocking word patterns in that essay that, as

received and passed on in Net discussion, could be captured

and measured with statistical collocation analysis.

These words also interlocked: “illegal” collocated markedly

with both “alien” and “Hispanic” and so on, as shown in the

figure below.

The full recursive collocation analysis is beyond

scope here. Details on request.

Figure 2 –how the core words in

Malkinʼs essay interlocked

From the Rock Creek Analytics collocation analys

of the Malkin essay. The links and nodes are not scale, except that the node and link sizes are

scaled as shown relative to each other and these

are among the most common words in the text. Th

graphic was created using Voisine network

visualization software

Each of these keywords collocated significantly often with

each of the others. The width of the lines represents the

result of applying the collocation metrics, a kind of tensilestrength. Moreover two of these keywords were “key”, used

unusually often (by Malkin in comparison with other

September Net opinion).

See above for her keywords.

The result was a tightly bound bundle of blame. The word

pattern, by itself, or in noteworthy part, was picked up by

about 40,000 weblogs in October 2008.

However, given the dispersion review noted abov

this number may not reflect a significant impact o

the overall discussion.



9

One of the discussion threads picking up and echoing her

phrasing also used the negative coloration of “cause”

described above: “Giving home loans to minorities caused

financial crisis”.

There are many other rhetorical devices in Net tex

that also served to convey Malkinʼs message (and

that can be measured but were not analyzed in th

case). They include synonyms, homonyms, and

other rhetorical figures, like part-for-whole

synecdoche (“alien”).

Repeating phrasing and other distinctive vocabulary in this

way reflects its influence - we tend to quote or reword the

phrasing for ideas we agree with - and by tracking

phraseology in this we can track influence.

For an example of tracking phraseology in this wa

see J. Leskovec, L. Backstrom, and J. Kleinberg,

Meme-tracking and the Dynamics of the News

Cycle. (However, to be clear, the conception of a

meme used in that article is very different than we

use, as shown for example, in Figure 2.)

Concordancing

A concordance is a list of a word (or sometimes a brief

phrase), along with immediate context, from a corpus or text

collection. The process produces a list of occurrences of the

search term, with each occurrence centered in the GUI

window of specialized concordance software. Each instance

of the search term displays the words that come before and

after in the text to the left and right of the term as it is shown

in the software window

See S. Hunston, CORPORA IN APPLIED LINGUISTICS

38-66

Although not by itself a complex tool, concordancing servesseveral functions: when counting from display, it can be used

to discover latent word patterns. It can analyze a text using

the context for collocated terms. It can be used to identify

and supply words for relative frequency analysis

Here is an example of a concordance list, centered on the

word illegal, in the Malkin essay discussed above, using

“illegal” as a search term

Figure 3: screenshot of concordance software, showing “illegal” as used in context in Malkinʼs essay

Material taken from the Malkin essay, using Antconc software



10

Beyond investigation and research, a concordance serves as

a check for the results of other functions.

Do the collocations appear to be significant when examined

in context? What do key word results show when their usage

is examined in context? “Illegal” here shows as a linchpin of

nativist rhetoric.

Putting the data together

This technical review has shown how we extract opinion from Internest text and put it to work.

Here is a brief summary of the example.

Which are the words that matter – that make a

message stand out, that make a text distinctive.

How do words and phrases com[are with

competing messages?

Frequency and relative frequency analysis.

Keywords:

“minorities” and “alien” as core conservative

rhetoric

How do critical words – especially keywords –

hold together?

Collocation: “illegal”, “Hispanics” and “alien”

Where and how much do messages have an

impact?

Nativism had conservative resonanace, but

slight if any effect elsewhere.How were critical words used in context? Concordancing: here, for example, “illegal” and

in context: “the massive illegal alien mortgage

racket”

Conclusion

The Internet is nothing more than a vast collection of

computer files, billions of them. Many are machine–readable

text that can be displayed in English. These text files aredocuments describing, referring to, and corresponding to

people, institutions, and issues. Many of these reflect and

express opinion. Neglecting this analysis Net text means

missing out a critical resource for opinion research.

These are the most critical and the most valuable tools

available at Rock Creek Analytics.

They work.

Contact Donald Weightman, principal

(cell 202 997-3290)

[email protected] or [email protected]

practice architecture 1

Documents