can we track the geography of surnames based on bibliographic data?
Post on 05-Aug-2015
109 Views
Preview:
TRANSCRIPT
Can we track the geographic origin of
surnames based on bibliographic data?
Nicolas Robinson-Garcia, Ed Noyons & Rodrigo Costas
15th INTERNATIONAL CONFERENCE ON SCIENTOMETRICS & INFORMETRICS
29 June – 3 July, 2015, Bogazici University, Istanbul, Turkey
EC3metrics spin off CWTS Leiden University
Agenda
oBackground
oBibliographic data
oMethod 1. Kullback-Leibler divergence
oMethod 2. Concentration Index
oThe ‘golden list’
oNext or previous steps
Background
“the use of surnames in human population biology dates back to 1875, when George Darwin used frequency of occurrences of the
same surname in married couples to study in-breeding”
Kissin, 2011
WHAT IS IN A SURNAME?
o Proxy for genetic/ethnic origin -> Epidemiology, Biomedical research
o Proxy for country origin -> Demographic studies, migratory movements
Background
o The representation of Jewish surnames in biomedical journals and US-patents
Kissin, 2011; Kissin & Bradley, 2013
o Relation between ethnic mix collaboration and citation impact
Freeman & Huang, 2014
… in the field of bibliometrics
Background
HOW CAN WE DETERMINE THE GEOGRAPHIC ORIGIN OF SURNAMES?
METHODS
o Manually curated lists
o Probability and Bayesian methods
o Clustering techniques
DATA SOURCES
o National census
o Dispersion of sources
o Lack of international coverage
Bibliographic data
o Scientific databases as international surnames data sources
Regional restrictions Temporal restrictions
o Establishing ‘trusted’ linkages between surnames and countries
Reprint address First author-First address One country publications Author-address linkages (2008)
Bibliographic data
o Scientific databases as international surnames data sources
Regional restrictions Temporal restrictions
o Establishing ‘trusted’ linkages between surnames and countries
Some figures: -> 1,568,052 distinct surnames assigned to 119 countries -> France 8,8%; Germany 8,0%; Russia 7,1%; Spain 4,9%
Assumptions
HYPOTHESIS 1
A surname should be assigned to the country where there is a higher frequency of such surname
HYPOTHESIS 2
A surname should be assigned to the country where there is a greater concentration of such surname.
Method 1. Kullback-Leibler
OPERATIONALIZATION
A surname will be assigned to a country if 1) it has the highest frequency, and 2) there are “certain levels of assurance”.
METHOD 1
Kullback-Leibler divergence
indicates the (dis)similarity of a
global surname distribution with its
distribution in each country.
Method 2. Gini Index
OPERATIONALIZATION
A surname will be assigned to a country if it is the one with the highest concentration of such surname.
METHOD 2
Gini Index is an inequality indicator
already employed for other
purposes in bibliometrics. It ponder
within 0 and 1 the concentration of
a surname in a country.
Kulback-Leibler vs. Gini index
Country No. surnames
FRANCE 138349
GERMANY 112445
RUSSIA 111716
SPAIN 83529
USA 76219
ITALY 69637
ENGLAND 63885
JAPAN 56345
CANADA 49775
NETHERLANDS 41306
Country No. surnames
USA 310739
FRANCE 117938
GERMANY 111375
RUSSIA 94369
ITALY 65699
JAPAN 52399
ENGLAND 47521
CANADA 46146
POLAND 44087
INDIA 42897
Method 1. Kullback-Leibler Method 2. Gini index
Top 10 countries with the highest number of surnames assigned
Kulback-Leibler vs. Gini index
Surname Country
CLINTON USA
EGGHE BELGIUM
GARFIELD USA
HERRERA SPAIN
GARCIA SPAIN
EINSTEIN USA
NOYONS NETHERLANDS
PEREIRA BRAZIL
Method 1. Kullback-Leibler Method 2. Gini index
Top 10 countries with the highest number of surnames assigned
Surname Country
CLINTON USA
EGGHE BELGIUM
GARFIELD USA
HERRERA CUBA
GARCIA CUBA
EINSTEIN ISRAEL
NOYONS NETHERLANDS
PEREIRA PORTUGAL
The ‘golden list’
Validating the methods proposed
SEARCHING A ‘GOLDEN LIST’ TO VALIDATE THE RESULTS o Coverage
o Criteria
› Language › Ethnicity › Historical origin
o Reliance and double assignments
The ‘golden list’
Validating the methods proposed
SEARCHING A ‘GOLDEN LIST’ TO VALIDATE THE RESULTS o Coverage
o Criteria
› Language › Ethnicity › Historical origin
o Reliance and double assignments
The ‘golden list’
Validating the methods proposed
Unified country Languages
Denmark Danish
England Celtic; Anglo-Cornish; English; Scottish; Irish
Finland Finnish
France Breton; French
Germany German
Greece Greek
Iceland Icelandic
Italy Italian
Japan Japanese
Netherlands Afrikaans; Dutch
Portugal Portuguese
Spain Basque; Catalan; Galician;
In search for a ‘golden list’ of
surnames assigned to
countries/languages/ethnicities
http://en.wikipedia.org/wiki/Category:Surnames_by_language
The ‘golden list’
METHOD 1 METHOD 2 Countries % coverage % correct % coverage % correct
DENMARK 91.1% 68.75% 100% 60.16%
ENGLAND 28.8% 80.97% 100% 58.56%
FINLAND 99.11 94.62% 100% 91.96%
FRANCE 88.08% 68.28% 100% 50.54%
GERMANY 52.24% 69.00% 100% 43.78%
GREECE 84.12% 78.32% 100% 78.57%
ICELAND 100.00% 65.52% 100% 100.00%
ITALY 87.65% 86.97% 100% 64.77%
JAPAN 98.74% 98.95% 100% 91.39%
NETHERLANDS 88.11% 60.96% 100% 41.67%
PORTUGAL 98.54% 92.59% 100% 91.91%
SPAIN 93.18% 48.74% 100% 54.74%
Total 73.22% 79.03% 100% 61.29%
Next or previous steps
o Is the Web of Science a good sample of the world population? › Country census crossed with the WoS
o Time frames and migratory movements › Apply methods to different periods
o Validation and comparison with other techniques › Bayesian, probability, clustering
o Multiple assignments of countries (e.g., Lee, Santos)
top related