3rd international conference on linguistic and cultural diversity in cyberspace - 28 june - 3 july,...
TRANSCRIPT
3RD INTERNATIONAL CONFERENCE ON LINGUISTIC AND CULTURAL
DIVERSITY IN CYBERSPACE-
28 June - 3 July, 2014 Yakutsk, Russia
Daniel [email protected]
Networks & Development Foundationhttp://funredes.org
Observatory of languages & cultures in the Internethttp://funredes.org/lc
Executive Committee Memberof
http://maaya.org
A methodology for exploring the situation of FrenchFrench & languages of languages of France France
in the Internet which could apply
to other groups of languages.
Daniel Pimienta and Daniel Prado MAAYA, May 2014
Mayotte
CREDITS
The methodology is the result of the merge of the products of two independent studies realized by the team D. Prado/D. Pimienta,
on behalf MAAYA, in 2013:
OIF mandated study about the space of FrenchFrench on the Internet
General Delegation to French and languages of France (DGLFF) of Ministry of Culture mandated study about the space of languages of languages of France France on the Internet
TWO COMPLEMENTARY APPROACHES
FRENCH, a language classified in position 8 in terms of speakers (L1+L2)
OTHER “MINORITY” LANGUAGES spoken in France territories
ANTECEDENTS
• DIFFICULTIES IN PRODUCTION OF INDICATORS
• DILINET PROJECT
• DILINET PROJECT STATUS
• MEANWHILE…
LINGUISTIC DIVERSITY INDICATORS PARADOX
1988 89 90 91 92 93 94 95 96 97 98 99 2000 01 02 03 04 05 06 07 08 09 10 11 12 13 20141988 89 90 91 92 93 94 95 96 97 98 99 2000 01 02 03 04 05 06 07 08 09 10 11 12 13 2014
INTERESTINTEREST
CAPACITYCAPACITY
LINGUISTIC DIVERSITY INDICATORS PARADOX
1988 89 90 91 92 93 94 95 96 97 98 99 2000 01 02 03 04 05 06 07 08 09 20101988 89 90 91 92 93 94 95 96 97 98 99 2000 01 02 03 04 05 06 07 08 09 2010
INTERESTINTEREST
CAPACITYCAPACITY
FUNREDES/UL…………………………………..….FUNREDES/UL…………………………………..….
LOP……………LOP……………
ALIS/ISOC………..OCLCALIS/ISOC………..OCLCFUNREDES………………..FUNREDES………………..XEROX……………………..XEROX……………………..
IDESCAT…….IDESCAT…….
Internet users per language (source InternetworldStats)…. till 2011
Web pages per language (not all!)… till 2008
Other indicators per country (FUNREDES/UL) till 2008
WHAT INDICATORS DO WE HAVE?
WHERE IS THE BOTTLENECK?
The two main indicators building activities rely:
on crawling ccTLD for languages in Asia, Africa, the Caribbean and applying recognition algorithms (LOP).
on using Search Engines counting capacity and their large percentage of web coverage (FUNREDES/UNION LATINA).
WHERE IS THE BOTTLENECK?
But…
- The size of the web is getting too large for traditional crawling (close to infinite!).
- Search Engines are no more indexing a substantial part of it (80% 5%)
- Search Engines counting has became unreliable.
- … And anyway all we got is static data mostly focused on the number of web pages per language.
A RESEARCH PROJECT Collaboration between UNESCO, OIF, UNION LATINA
with participation of ITU.
High level profile partners ERCIM, MAAYA, UNESCO, OIF, FUNREDES, EXALEAD, UPC, DIALOGIC, CNRS/LIMSI, FRAUNHOFER, CWI, VOCAPIA, NIELSEN
Important investment (estimate 300 Keuros, direct and indirect)
PROCESS Proposing to 2 EU/PF7 calls:
– Jan. 2012: Integrated Project of 7Meuros for ICT-2011.4.4 Intelligent Information Management
– Jan. 2013: Specific Targeted Research Project of 3Meuros for ICT-2013.4.1 Content analytics and language technologies - Cross-media content analytics
2 near misses reflecting low EU interest in the theme
New attempt in process with Qatar partners with LOP on board
MEANWHILE…
• InternetWorldStats stopped updating 3 years ago
• A new interesting player but limited to 10 millions top sites (2% of the sites) : W3TECH
• Web evolution towards dynamic pages, video, social networks
The context call for alternative approaches
PART 1 : MEASURING FRENCH Defining a large set of spaces and applications
to get data from.
Searching for a large number of Internet sites which offer linguistic or country data for those spaces/applications.
Applying appropriate selection criteria to this set of sites.
Collecting, compiling, organizing data
Crossing Internet data with reliable demo-linguistic data
Putting results in perspective.
P1 : SPACES & APPLICATIONS
Applications• Office applications• Web 2.0• Search engines• Email• P2P
UP TO 100, split into following categories:
Spaces• Infrastructure• Online library• Smartphones• VOIP/Chat• Operating systems• Browsers
P1: SOURCES
• Traditional sources (UN, UNESCO, ITU, OCDE, EU) have few linguistic data but plenty of country data
• Most non traditional sources are either:– Marketing company offering free glances on
expensive data– Experts showing their capacity thru reports
• Life duration of non traditional sources is often short.
SOURCE SELECTION CRITERIA
• Too small scope• Too biased
• Not recently updated• Methodology not reliable
SELECTED SOURCES
more than 200 sources
less than 100 sources
10 = excellent< 5 = Not used but kept for future check
SOURCES PARAMETERS• Title• URL• Publication year• Rating (0 10)• Focus (worldwide, Europe, France, USA, OCDE…)• Frequently updated (y/n)• Type of source (meta, general, space, application,
book, report, paper, webpage…)• Application or space concerned• Language specific (y/n)• Comments
DEMO-LINGUISTIC DATA
• No institutional support low data quality• Large and diverse geography divergent data• Main demo-linguistic sources divergent
data• Language typology boundary dilemma
• L2 counting
DEMO-LINGUISTIC CHOICES
– ETHNOLOGUE FOR L1 ( homogeneity)
– DIVERSE SOURCES FOR L2 ( reliability)
– WIKIPEDIA FOR COUNTRY DEMOGRAPHIC
– INTERVAL DATA FOR SOME SPACE/APPLICATION
PUT IN PERSPECTIVEÈLEMENT A B C D I L1 L1+L2
(L12)P L1x
IL12xI
L1 xP
L12xP
TYPE
Viadeo 2 5 7 10 7 1 6 0 7 0 6 RS
Tumblr 6 6 7 6 15 4 2 4 60 30 16 8 RS
Hotmail 5 5 6 6 9 2 4 0 18 0 8 APP
Open office 9 9 9 8 58 2 5 0 117 0 10 APP
Blogs.com 6 7 7 5 15 2 5 0 29 0 10 BLOG
Ning 7 7 7 8 27 6 5 165 0 30 0 RS
Msn 7 7 7 6 21 6 5 123 0 30 0 APP
Wordpress 8 7 7 7 27 7 5 192 0 35 0 BLOG
AVERAGE 6,8 4,2 7,4 4,3 7,2 4,2
I = AxBxCxD/1000
A= Level of world relevance (0 to 10)B = Level of reliability of source (0 to 10)C = Level of trust for French (0 to 10)D = Level of relevance for French (0 to 10)
P = Direct weighting
ANALYZE PER TYPE
Type of space L1 L1+L2
BOOKS 3 *
BLOGS 6,5 3,3
APPLICATIONS 6,7 3,6
SOCIAL NETWORKS 7 4
INFRASTRUCTURES 7,9 4
USERS 9 4 *
CONTENTS 8 4,1
VIDEO 7 6 *
P2P 6,3
* = Only one source
CONCLUSION P1
French, as first language, can be considered up but close to position 7 in the Internet, all elements mixed.
French, as first and second language, can be considered as up but very close to position 4.
CONCLUSION P1
FrenchFrench, in spite its lower demographic strength, is in close competition in the Internet,
depending of space/application, with:
Spanish, German, Japanese, Portuguese, and in some way with Russian and Arabic.
CONCLUSION P1: TRENDS
Strongly emerging languagesStrongly emerging languages (competing with English)(competing with English)
ChineseChinese ( (will go over English)English) SpanishSpanish
Emerging languagesEmerging languages(Competing with French)(Competing with French)
Hindi, Bengali, Russian, ArabicHindi, Bengali, Russian, Arabic
New playersNew players Urdu, IndonesianUrdu, Indonesian
CONCLUSION P1
Most of the elements of the applied methodology
should perform for other languages of large world wide scope,
such as Arabic, Portuguese, Spanish or RussianArabic, Portuguese, Spanish or Russian.
PART 2 : LANGUAGES OF FRANCE
MAYOTTEMAYOTTE
SELECTION OF “LANGUAGES OF FRANCE” FOR THAT STUDY
• Alsatian• Basque• Breton• Catalan• Corsican• Creole (*)• Flemish• Frankish
• Franco-Provençal• Futunan• Languages of Mayotte (*)• Oïl languages (*)• Kanak languages (*)• Occitan (*)• Tahitian• Walisian
(*) : family of languages
SELECTION CRITERIA
• Territory based languages (no immigration languages)
• Subset with higher probability of Internet presence
–> more than 50,000 speakersor –> used in official teaching
Language families
• Creole : Martinique, Guadeloupe, Guyane, la Réunion
• Occitan: auvergnat, gascon, languedocien, limousin, provençal, vivaro-alpin
• Kanak: ajië, drehu, nengone, paicî, xârâcùù (+ 24 more not studied)
• Languages of Mayotte: kibushi et shimaoré
Language’s terminology• Alsacien: alemannic, alemannisch, alsacien,
elsaessisch, elsässisch, etc.• Basque: biscayan, gipuzkera, gipuzkoan, guipuzcoan,
guipuzcoano, euskera, euskara, roncalese, vasco, vascuense, vizcaino, etc.
• Catalan: Aiguavivan, Algherese, Aragonais oriental, Balear, Català, Catalán, Catalan-Valencian-Balear, Eivissenc, Mallorqui, Menorqui, Menorquin, Lleidatà, Pallarese, Ribagorçan, Valencià, Valenciano, etc.
• Corse: corsu, corsican, corsi, corso, sartenais, venaco, vico-ajaccio, etc.
Language’s terminology
• Francique mosellan: lothrìnger ditsch, lothringer deutsch, lothringer plattm, lothrénger deitsch, lothrìnger deitsch, lothrénger platt, francique luxembourgeois, francique mosellan, platt, etc.
• Futunian: fakafutuna
Language’s terminology• Francoprovençal: arpetan, arpian, arpitan ,
arpitano , brassè , burgondan , burgondês, dauphinois, delfinese, dialetto , faetar, francoprovençâl , friborgês , fribourgeois, genevois, harpitan , lyonè, lyonnais, mâconês, neuchatelais, neuchâtelois, patois, patoua, patouès, romand, romand , savoiardo, savoyard, savoyârd, tot-parier, valaisan, valdostano, valdôtain, valdôtèn, valêsan , vaudois, vôdouês
Language’s terminology
• Langue d’oïl: angevin, berrichon, bourbonnais, bourguignon-morvandiau, brionnais-charolais, champenois, frain-comtou, franc-comtois, gallo, langue comtoise, lorrain, mâconnais, manceau, maraîchin, mayennais, normand, normand méridional, picard, poitevin, poitevin-saintongeais, saintongeais, wallon, etc.
Language’s terminology
• Occitan: béarnais, aspois, girondin, lemozin, limousin, médocain, mondin, monegasque, neugue, niçois, nissard, nissart, occitanien, occitanique, parler d’oc, romans, patois, proensal, raimondin, rouergat, etc.
• Shibushi: malgache de Mayotte, kibushi kimaore, kibushi kiantalaoutsi, kibushi, kibuki, bushi
• Tahitien: reo tahiti• Wallisian: fakaʻuvea, faka uvea, ouvéa
DIFFERENT METHODOLOGY
The same method cannot apply because most of the languages would not have any Internet references offering space/applications data as they do for French (or Spanish, English or Russian).
What would be the alternative knowing that the Internet spaces of most of those languages is quite small compared to French?
LoF METHODOLOGY
• Cannot search only Internet references giving data on the situation of those languages on the Internet.
• Cannot search all references related to those languages.
BUTHow about searching references closely related to those languages?
SCOPE OF THE SEARCH
• References closely related to one of the language of the study (not the territory!)
• Also references offering data on all the languages of France or offering data to all languages including the one which are studied.
• What is the definition of closely related?
Close relationship to languageBest choice: site/book/paper discussing the situation in the Internet of the language and/or offering data about it
Good choices : – meta reference about the language (data base,
clearinghouse, linguistic organization,…)– Linguistic resources (dictionary, …)– Reference discussing the language– Cultural reference if they have an indirect relation
with the language (literature, poetry or songs)– Reference offering serious language learning– Blogs in or about the language
Close relationship to language
• Bad choices:– Touristic resources (except excellent
presentations of language)– Reference looking good but not public domain– Reference copying another source (go to the very
source)
RATING
WARNING This is no value judgment about the reference, what is rated is only the level of contribution and proximity to the theme « language on the Internet:
TARGET
The theme of language on the Internet or bringing meaningful data about that theme
RATING
9: Exceptional contribution to the theme or meaningful data
8: Strong contribution or interesting data7: Interesting contribution or original data6: Average contribution5: Indirect relation4: Indirect relation but not much content3: Not accessible but kept in memory because
special interest.<3 : Forget it
COLLECTED DATA
YEARUPDATED (Y/N)SECTOR : GOV, EDU, ORG, COM, PERTYPE : Article /Blog /Portal /Linguistic Resource /
Social Network / META/ Data Base/ Library LANGUAGE: Local, French, English, German, SpanishDATA: Y/NCOMMENTS
SEARCH METHOD
1) Simple search with the language most common name to find main sites in first 100 answers
2) Go to the external links page if possible and note all of them
3) Systematic analyze of links4) Back to 2 until it is clear that no new links
appear5) Complete with more sophisticated search
(GoogleScholar, books, blogs, other languages, other language terminology)
RESULTS
A total > 1000 references (still missing 4 languages)
This obviously cannot taken as an exhaustive search but indeed we have enough data to use statistics to get some meaning useful for public policies.
NUMBER OF REFERENCES
RATING SPLIT
STATISTICS
• Some key indicators are observed:– The rate of wrong links informs about the vitality
of the language in the Internet (example Creole rate > 20% reveal problems)
– The split between ORG, PER, EDU, COM
– The split between reference types
SECTOR SPLITORG EDU PER GOV COM
OTHER
General 27% 49% 0% 8% 7% 8%Languages of France 20% 48% 7% 23% 2% 0%Breton 52% 17% 6% 3% 22% 0%Corsican 15% 24% 27% 19% 14% 0%Creoles 24% 31% 14% 5% 26% 0%Francoprovençal 44% 17% 35% 4% 1% 0%Futunian 28% 56% 16% 0% 0% 0%Kanak 21% 48% 12% 7% 6% 6%Mayotte 34% 37% 14% 2% 14% 0%Occitan 39% 19% 25% 7% 9% 1%Tahitien 28% 28% 6% 9% 30% 0%Wallisien 19% 39% 26% 0% 16% 0%
TOTAL 31% 30% 17% 8% 12% 2%
SECTOR SPLITORG EDU PER GOV COM
OTHER
General 27% 49% 0% 8% 7% 8%Languages of France 20% 48% 7% 23% 2% 0%Breton 52% 17% 6% 3% 22% 0%Corsican 15% 24% 27% 19% 14% 0%Creoles 24% 31% 14% 5% 26% 0%Kanak 21% 48% 12% 7% 6% 6%Occitan 39% 19% 25% 7% 9% 1%TOTAL 31% 30% 17% 8% 12% 2%
ORGANIZED CIVIL SOCIETYACADEMIA
CITIZENSHIPGOVERNMENT (OFTEN LOCAL)
TURISM
TYPE SPLIT
TYPES Gen LDFBreto
n CorseCréoleFranco
provençalKanakOccita
n TahitianTOTAL
PUBLICATIONS 23% 40% 15% 22% 25% 21% 40% 23% 13% 24%
DATA BASE 4% 2% 0% 4% 0% 7% 2% 2% 4% 3%
BLOGS 0% 2% 6% 22% 2% 6% 8% 15% 0% 9%
MEDIA 0% 2% 0% 2% 1% 2% 0% 3% 0% 1%
META 14% 7% 16% 2% 10% 2% 5% 2% 19% 9%
PORTAL 10% 10% 44% 24% 28% 25% 14% 24% 30% 23%LINGUISTICRESOURCES 48% 38% 18% 23% 31% 28% 26% 30% 28% 29%SOCIALNETWORK 1% 0% 1% 0% 2% 10% 3% 0% 4% 2%
TOTAL 7% 6% 8% 9% 10% 10% 12% 24% 4% 100%
LANGUAGE SPLIT
MEAN
Groupwith
higher %
Groupwith
lower %% in English 10% General Occitan% in French 48% LoF Tahitian% in local language 7% Corsican LoF% in French & local language 19%
Breton & Corsican LoF
% multilingual 18% Tahitian Corsican
EMERGING PATTERNS
• A1- Not much spoken, Internet presence pushed by citizenship & multistakeholder, including local government : Corsican Corsican
• A2- Not much spoken, Internet presence pushed by citizenship but low government involvement: Occitan & Franco-provençal Occitan & Franco-provençal
• A3- Not much spoken but Internet presence pushed by civil society organizations but low government involvement: Breton Breton
• B- Spoken language but low Internet presence except academic: Creole, Kanak, Futunian et WalisianCreole, Kanak, Futunian et Walisian
CONCLUSION P2
• First interesting results into a field not yet systematically explored
• Next step will be to create a public clearinghouse and invite players to contribute and promote dialog cross languages
• The approach should be applicable to other countries with a variety of “minority languages” (such as Italy, Spain, Germany or Russia).
GENERAL CONCLUSION
The exposed methodology could probably be reused with no much modifications by other language family…
MERCI
Thank youGracias
Obrigado
Amesegnalhu
Shukran Dhonnyobaad
Orkun
Doh jeh Dekuji Adjarama
Abhar
Toda raba
N’gue penù
TackTack
спасибо