exploring polish-english parallel corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfparalela...
TRANSCRIPT
![Page 1: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/1.jpg)
Paralela
Exploring Polish-English Parallel Corpora with
Piotr Pęzik University of Łódź
pelcra.pl
![Page 2: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/2.jpg)
About the data• Manually and automatically aligned collections
acquired in a number of projects (CESAR, ATLAS and most notably: CLARIN-PL)
• Finally made available through Paralela, which is developed as one of the CLARIN-PL (clarin-pl.eu) applications
• An open-ended corpus with user-defined virtual collections
![Page 3: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/3.jpg)
Sources
0"
10000000"
20000000"
30000000"
40000000"
50000000"
60000000"
70000000"
JRC-Acquis" EUBooks" EMEA" RAPID"
Sources(1..5(
Words"
0"
20000"
40000"
60000"
80000"
100000"
120000"
cosel"wiry"
korsarz"
bruhl" po
e"
napoluchwaly"
diveinto"
sawyer"
gulliver"
flowerofnorth"
argonauci"
duet"
treasure"
2ndjunglebook"
skalky"
almayer"
Sources(27..42(
Words" 0"10000"20000"30000"40000"50000"60000"70000"
houn
dbaskervilles1"
houn
dbaskervilles2"
thegoldh
unters"
golden
snare"
captainscourageo
us"
kres"
puck"
thew
ol=un
ters"
then
iggero>h
enarcissus"
junglebo
ok1"
junglebo
ok2"
thestoryo>
hetreasuresee
shadow
line"
heartofdarkness"
hania"
Batory"
freya"
thecallo>h
ewild"
Sources(43..60(
Words"
0"500000"
1000000"1500000"2000000"2500000"3000000"3500000"4000000"4500000"
CORD
IS"
OSW
"po
top"
Academ
ia"
ogniem
imieczem
"faraon
"ville@e
"jane
eyre"
greatexpectaDo
ns"
wom
eninlove"
nostromo"
quovadis"
sonsandlovers"
emma"
chance"
dogm
at"
wuthe
ringheights"
ESO"
kim"
wpu
styniiw
puszczy"
lordjim
"
Sources(6..26(
Words"
196M words, 4.7M segments
![Page 4: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/4.jpg)
Genres
![Page 5: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/5.jpg)
Alignment
![Page 6: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/6.jpg)
Druga księga dżungli Bez dogmatu Historia Jees-‐uck Panna Brill The Secret Agent Zwycięstwo
Szaleństwo Almayera Emma Doktor Jekyll i Pan hHyde Państwo Gołębiowie Opowieść Podróż
Amy Foster Wiara w człowieka Doktor Jekyll i pan Hyde Na polu chwały Anarchista Zimowa Powieść
Argonauci Falk Szara Wilczyca Na Polu Chwały Czarny Sternik Wiry
W Zatoce Kwiat dalekiej północy Kim Nostromo -‐ opowieści z
wybrzeża BesQa Zakochane kobiety
Święto bankowe Freja z siedmiu wysp Klub Pickwicka Pleśń Świata Zew krwi Wichrowe Wzgórza
Złowroga Bisara Freja z Siedmiu Wysp Komediantka Potop Łowcy złota Podlotek
Bruhl Garden Party Korsarz Książę Roman Władca skalnej doliny Młodość
Kapitanowie zuchy Gaspar Ruiz U kresu sił Oficer Pruski Gospoda pod "Dwiema Wiedźmami" Z legend dawnego Egiptu
Los Złote sidła Krzyżacy Puk z Pukowej Górki Murzyn z załogi "Narcyza" Żywy Telegraf
Cienie Wielkie Nadzieje Pokojówka jaśnie pani Czerwony Bóg Poszukiwacze skarbu
Ciężkie czasy na te czasy Hamlet Lady SusanRobinson Kruzo : jego życia losy, doświadczenia i przypadki
Watsonowie
Pyramids Hamlet: królewicz duński Laguna Ukryty Sojusznik Biała cisza
Przygody Tomka Sawyera Jądro ciemności Ostatni Mohikanin Smuga Cienia. Wyznanie Łowcy wilków
Hrabina Cosel Jej pierwszy bal Życie Mamy Parker Choroba samotnego wodza Trzech starszych panów w jednej łódce (oprócz psa)
Córki zmarłego pułkownika Przygody Huck'a Opowiadania Lekcja śpiewu Jutro
Dzieje, przygody, doświadczenia i zapiski Dawida Copperfielda, T. 1
Idealna rodzina Zamążpójście Lit Lit Stalky i spółka Przygody T. S.
Dzieje, przygody, doświadczenia i zapiski Dawida Copperfielda, T. 2
Szpieg Meir Ezofowicz Uśmiech Szczęścia -‐ opowieść portowa Humoreski
Dzieje, przygody, doświadczenia i zapiski Dawida Copperfielda, T. 2
Dziwne Losy Jane Eyre Meksykanin Synowie i Kochankowie Tajfun
Bez dogmatu Dziwne losy Jane Eyre Milknące Głosy Obcy W oczach Zachodu
>100 literary classics (10M words)
![Page 7: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/7.jpg)
Mantel
![Page 8: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/8.jpg)
More data
• 350M words by early 2016?
• Most of it: low-hanging fruit with better metadata wherever possible
• But: more public-domain literary classics as well
![Page 9: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/9.jpg)
• http://paralela.clarin-pl.eu/
• Based on Apache Solr, scales up to billions of words, scales out to even more
![Page 10: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/10.jpg)
Query syntax
• SlopeQ 2 syntax corpus queries (cf. spokes.clarin-pl.eu)
• Solr DisMax syntax for metadata
![Page 11: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/11.jpg)
Metadata filters: <lemma=miłość> AND (genre:typ_lit_proza NOT source:wutheringheights AND (alignment:simple OR
alignment:paraphrase) AND wc:[5 TO *])
![Page 12: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/12.jpg)
SlopeQ 2 queries: <pos=adj:.*:gen:.*> <lemma=wiara>|<lemma=nadzieja>
![Page 13: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/13.jpg)
SlopeQ queries: (<lemma=dać> do zrozumienia)=3
![Page 14: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/14.jpg)
Bitext queries: PL:(<lemma=dać> do zrozumienia)=3 AND EN:(<lemma=give> to understand)=3
![Page 15: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/15.jpg)
Negation and regular expressions: <lemma=st.+>|!<lemma=start>|!<pos=n.+>
![Page 16: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/16.jpg)
Syntax enhancements
• Metadata-only DisMax queries, e.g. source:Academia
• WordNet identifiers, e.g. <lemma=zamek pos=noun:.+gen:.+ ssid=33424>
![Page 17: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/17.jpg)
Facets• Search facets are computed for every query on the
full set of results
• Used to:
• Visualize relative metadata values frequencies
• Navigate through and filter results in a relevance feedback mode
![Page 18: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/18.jpg)
Facets
![Page 19: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/19.jpg)
Export formats
• Excel
• JSON
• Soon: TMX, XLiFF
• Up to 100 000 results per query
![Page 20: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/20.jpg)
Full programmatic access• REST API available
• Used by the web application
![Page 21: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/21.jpg)
Making Paralela (more) useful
• More (useful) data 350MW
• Good quality metadata (often requires metadata re-harvesting for existing collections)
• Grouping and sampling (e.g. max. 5 results per source, cf. spokes.clarin-pl.eu)
• ACD alignment
• Virtual collections: registered users will be able to build (and save) complex metadata queries to define their own corpora
• Some predefined sub corpora will be made available by default
• Write a good tutorial/documention with lots of examples
• Hands-on workshops (watch out for announcements on [email protected])
![Page 22: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/22.jpg)
Virtual collections
• Collection1 = (genre:typ_lit_proza NOT (genre:duet OR genre:sawyer)) AND (ts:[2015-03-30 TO *])
![Page 23: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/23.jpg)
Miłość w prozie i w poradnikach
![Page 24: Exploring Polish-English Parallel Corpora with …clarin-pl.eu/pliki/warsztaty/paralela.pdfParalela Exploring Polish-English Parallel Corpora with Piotr Pęzik University of Łódź](https://reader031.vdocument.in/reader031/viewer/2022021622/5b4c988c7f8b9ad1338b9d71/html5/thumbnails/24.jpg)
PRZYKŁADY