using the web

75
Using the web as a linguistic resource. Approaches and methods

Upload: erika-miglietta

Post on 14-Jul-2016

2 views

Category:

Documents


1 download

DESCRIPTION

the use of the web during the act of translation

TRANSCRIPT

Page 1: Using the Web

Using the web as a linguistic resource.

Approaches and methods

Page 2: Using the Web

• Glossaries/Dictionaries• Translation-Oriented Web Search• Using traditional corpora online• Concordancing the web• The mega-corpus/mini-web approach• Building and using monolingual/comparable

corpora

Page 3: Using the Web

• Glossaries/Dictionaries• Translation-Oriented Web Search• Using traditional corpora online• Concordancing the web• The mega-corpus/mini-web approach• Building and using monolingual/comparable

corpora

Page 4: Using the Web

Glossaries

Page 5: Using the Web
Page 6: Using the Web
Page 7: Using the Web
Page 8: Using the Web

• Glossaries/Dictionaries• Translation-Oriented Web Search• Using traditional corpora online• Concordancing the web• The mega-corpus/mini-web approach• Building and using monolingual/comparable

corpora

Page 9: Using the Web

Translation-oriented Web Search

• Basic issues in web search

• Case studies:- evidence of attested usage; - investigating phraseology; - testing translation candidates

Page 10: Using the Web

Problems and limitations of search engines

• limited number of results• format• ranking• normalization, punctuation, special characters• informational, navigational, transactional

needs

Page 11: Using the Web

Sintassi di GOOGLE

+ > co-occorrenza di due o più parole- > esclude la presenza di una parolaOR > una qualsiasi fra due o più parole

“” > frase esatta * > una parola qualsiasi in una posizionenazione (.uk; .it; .fr…)istituzione (ac.uk; .edu; .org…)

Page 12: Using the Web

Web pages evaluation (4)

Imparare a leggere gli indirizzi:

• .it – Italy• .uk – Great Britain • .us – United States• .ca – Canada • .au – Australia • .ie – Ireland (Eire)

Page 13: Using the Web

Imparare a leggere gli indirizzi:

• .edu – restricted use by educational sites (usually a university or college)

• .com – general use by commercial business sites • .gov – restricted use by U.S. governmental/non-military

sites • .mil – restricted use by U.S. military sites and agencies • .net – general use by networks, internet service providers,

organizations • .org – general use by non-profit organizations and others

Page 14: Using the Web

• Some specific portals:• Pubmed > site:. .ncbi.nlm.nih.gov/pubmed• Wlsecier > site:.elesevier.com

Page 15: Using the Web

The kleene star

• By placing an asterisk (kleene star) within quotes one can explore phrases:

e.g. the key to/for success?“the key * success”- For 1,340,000- To 90,000,000e.g. linee cellulari di/dell’epatoma umano“linee cellulari * epatoma umano”? linee cellulari di epatoma umano!

Page 16: Using the Web

1. Exploring collocationPAESAGGI SUGGESTIVI

Page 17: Using the Web
Page 18: Using the Web

Exploring collocationPAESAGGI SUGGESTIVI

Page 19: Using the Web

Complex query

• Geographical position: UK• Exact phrase: «suggestive landscapes»• Co-occurrence: travel OR tourism OR holiday• Language: English• Provenance: United Kingdom• Domain:.uk

Google search…

Page 20: Using the Web
Page 21: Using the Web

2. Valdating translation candidates

You are into English an Italian article on cancer and you encounter the phrase “sede di insorgenza”.

Sede = siteInsorgenza = onsetONSET SITE vs SITE OF ONSET

Page 22: Using the Web

“onset site”

Page 23: Using the Web

cancer “onset site” (relevance)

Page 24: Using the Web

cancer “onset site” site:elsevier.com(relevance + reliability)

Page 25: Using the Web

cancer “site of onset” site:.elsevier.com

Page 26: Using the Web

«onset site» > 8700cancer «onset site» > 1170cancer «onset site» site:.elsevier.com > 36

• «site of onset» 29500• cancer «site of onset» 10100• cancer «site of onset» site:.elsevier.com > 55

Page 27: Using the Web

• «onset site» cancer Google Books

• «site of onset» cancer Google Books

Page 28: Using the Web
Page 29: Using the Web
Page 30: Using the Web

3.Exploring PhraseologyLA FREQUENZA … E’ IN AUMENTO

• La frequenza del carcinoma a cellule squamose della mucosa orale è in rapido aumento; inoltre, il suo comportamento clinico è difficilmente prevedibile basandosi solo sui classici parametri istologici.

Draft translation • The frequency of squamous cell oral cancer is

rapidly increasing…But… • Do people really say that “the frequency of something

is increasing”? • Would people say this when talking about cancer?

Page 31: Using the Web

Step 1: in put a string to be tested“The frequency of * is increasing”

50,000 matches

Step 2: add context (“cancer”) to boost relevancecancer “The frequency of * is increasing”

9000 matches

Step 3: add domain (e.g.: .ac.uk) to boost reliabilitycancer “The frequency of * is increasing” site:.ac.uk

4 matches

Page 32: Using the Web

Step 4 – Search for a more specific pattern:“the frequency of * cancer is increasing” site:.ac.uk

0 hits

Step 5 - Move the kleene star to further explore phraseology

e.g. “The * of * cancer is increasing”…

Page 33: Using the Web

“the * of * cancer is increasing"

Page 34: Using the Web

The incidence of * cancer is increasing

Page 35: Using the Web

• Glossaries/Dictionaries• Translation-Oriented Web Search• Using traditional corpora online• Concordancing the web• The mega-corpus/mini-web approach• Building and using monolingual/comparable

corpora

Page 36: Using the Web

USING THE BNC

- Go to http://corpus.byu.edu/bnc/

- Register as: “Graduate student: not languages or linguistics”

Page 37: Using the Web
Page 38: Using the Web
Page 39: Using the Web

Collocation

Page 40: Using the Web

Colligation

Page 41: Using the Web

Colligation

Page 42: Using the Web

Semantic preference

Page 43: Using the Web

Semantic prosody

Page 44: Using the Web

Compare collocates

Page 45: Using the Web

• Glossaries/Dictionaries• Translation-Oriented Web Search• Using traditional corpora online• Concordancing the web• The mega-corpus/mini-web approach• Building and using monolingual/comparable

corpora

Page 46: Using the Web

www.webcorp.org.uk

Page 47: Using the Web
Page 48: Using the Web

Results for LANDSCAPEfrom WebCorp

Page 49: Using the Web
Page 50: Using the Web
Page 51: Using the Web
Page 52: Using the Web
Page 53: Using the Web
Page 54: Using the Web

Exercises

• Explore collocationVolatilitySpread

• Explore neologismsFlexicurityStaycation

• Explore phrasal creativity Everyone and their * knows

• Explore Computer-Mediated-Communication l8r … dunno… IMHO

Page 55: Using the Web

Volatility - chemistry

Page 56: Using the Web

Volatility - market

Page 57: Using the Web
Page 58: Using the Web
Page 59: Using the Web
Page 60: Using the Web
Page 61: Using the Web

More examples from blogs:

• Dunno• IMHO

Page 62: Using the Web

WebCorp Linguist’s Search Engine

Page 63: Using the Web

http://www.webcorp.org.uk/live/wlse.jsp

REGISTER FOR FREE

Page 64: Using the Web

• Synchronic English Web Corpus, a corpus consisting of 467,713,650 words from web-extracted texts covering the period 2000-2010;

• Diachronic English Web Corpus, consisting of 128,951,238 words covering the period Jan 2000 - Dec 2010, with each month containing approximately 1 million words;

• Birmingham Blog Corpus, consisting of 628,558,282 words from blog texts.

Page 65: Using the Web

Synchronic English Web Corpus

• Mini-web Sample > 339,907,995 words corpus compiled from 100,000 randomly selected web-pages to form a sample of the distribution of texts throughout the web.

• Domains > 127,805,655 words from 56,000 pages selected from 14 domains on the basis of the Open Directory classification of web pages.

Page 66: Using the Web

Domains in the Synchronic English Web Corpus

Page 67: Using the Web
Page 68: Using the Web

SPREAD

• Spread > health• Spread > business

Page 69: Using the Web
Page 70: Using the Web
Page 71: Using the Web

balkanis/zation

Page 72: Using the Web
Page 73: Using the Web

Birmingham Blog Corpus

• kinda

Page 74: Using the Web

kinda feel + adj

Page 75: Using the Web

kinda sorta + verb