building nanobank data structure and selection criteria jason fong and emre uyar university of...

BUILDING NANOBANKBUILDING NANOBANK

Data Structure and Selection Criteria

Jason Fong and Emre UyarUniversity of California, Los Angeles

1

What is Nanobank?What is Nanobank?Nanobank is a collection of

observations from various sources (scientific articles, patents and government grants), determined to be related to nanotechnology field, either by probabilistic information retrieval (IR) methods or by being declared nano by a source authority.

2

Data Sources - ArticlesData Sources - Articles580,711 scientific articles from peer

reviewed journals.Source: Science Citation Index, Arts &

Humanities Citation Index and Social Sciences Citation Index of the Institute for Scientific Information Inc. (ISI®). All together, these indexes contain more than 24,250,000 entries from over 8,700 peer reviewed scientific journals.

3

Data Sources – Patents and Data Sources – Patents and GrantsGrants240,437 patents from U.S. Patenting

and Trademark Office’s online database of more than 4,000,000 patents, granted by USPTO from 1976 to 2006.

52,831 grants from NIH and NSF databases.

4

Data ContentsData ContentsArticles

◦ Titles◦ Journal volume and issue numbers◦ Publication years◦ Author names◦ Names and addresses of organizations

affiliated with authors

5

Data ContentsData ContentsPatents

◦ Titles and abstracts◦ Application and grant dates◦ Names and addresses of inventors and

assignees◦ U.S. and international patent classifications

6

Data ContentsData ContentsGrants

◦ Titles and abstracts◦ Receiving organization names and

addresses◦ PI and co-PI names◦ Grant amounts

7

Nanobank Data StructureNanobank Data Structure• Internal database

– Stored in a relational database– Separate tables for various data items– ID numbers for each item link between tables

• Version posted on Nanobank.org– Denormalized form of internal database– Storing redundant data isn’t as space-

efficient, but lessens the need to join multiple tables

– Nanobank Codebook contains detailed information on tables and fields available in each

8

Document SelectionDocument SelectionDocument Selection Methods

◦ Keywords◦ Probabilistic◦ Authority-selected

Tables include a field to indicate selection method: ◦ “nanobank_flag” = 1 if selected by

Keywords or Probabilistic; 0 otherwise◦ “authority_flag” = 1 if Authority-selected; 0

otherwise

9

Document Selection: Document Selection: KeywordsKeywords• Search for text patterns matching

words or phrases related to nanotechnology

• Words and phrases chosen by subject specialists

• Less effective for identifying very early or recent documents–Early documents were written before the

terms were in common usage–Recent documents have terms that are

too new to be included in the search patterns

10

Document Selection: Document Selection: ProbabilisticProbabilisticIncorporates new terms as they

come into common usageUses the Xapian search engine

library to perform ranking calculations

Analyzes document text and ranks against a set of query terms

11

Document Selection: Document Selection: ProbabilisticProbabilisticInitial query terms from the Virtual

Journal of Nanoscale Science & Technology (VJNano):◦ All articles in VJNano assumed to be relevant◦ Select highest ranked terms

Document selection process:◦ Use initial query terms to select relevant

documents from all journal articles◦ Select additional terms from those relevant

documents and add to query◦ Repeat selection with expanded query terms

12

Document Selection: Document Selection: Authority SetAuthority Set• Articles

– Listed in the Virtual Journal of Nanoscale Science & Technology

• Patents– Listed under United States Patent

Classification Class 977 (Nanotechnology)

• NSF Grants– program name contains “nano”

• NIH Grants– NIH descriptive tag contains “nano”

13

GEOCODINGGEOCODINGStandardizing between differing

naming conventions used in different sources.

Standardizing between non-uniformity in how observations are recorded.

Correcting common mistakes.For US observations: Providing

different grouping units (other than city and state) not available in original data sources, like counties and BEA areas.

14

COUNTRY GEOCODINGCOUNTRY GEOCODINGCountry names in all

observations are cleaned, standardized and assigned an ISO code (2 digit alphabetical)

Current ISO list of countries is taken as basis; historical entries assigned to the closest current country to the extend available.

15

US GEOCODINGUS GEOCODINGUS observations are those in 50 US

states, DC and 7 US associated areas.Cities, states, counties and BEA economic

areas are coded using “Populated Places” data obtained from FIPS 55 database and BEA.

Basis is the city-state combination. City names are standardized and matched to the names in FIPS database on a state-by-state basis.

In articles, 99.98% of US observations have been assigned a definite city - state code.

16

US GEOCODING: Variables US GEOCODING: Variables CreatedCreated1. Standard_city_name: Standardized name as

it appears on the FIPS database (corrected for misspelings, abbreviations, etc...)

2. State_code: 2 digit numeric code.3. City_code: 5 digit numeric code, unique by

state.4. County_code: 5 digit numeric code.5. County_name

City code + state code uniquely determine a populated place.

Numeric codes are same as the codes used by FIPS.

17

GEOCODING – US BEA GEOCODING – US BEA AreasAreas Bureau of Economic Analysis (BEA) created 179 Economic

Areas in the US by asigning each county is assigned to a unique BEA.

BEA_code: 3 digit numeric code that determines the associated BEA Economic area for each observation.

"BEA's economic areas define the relevant regional markets surrounding metropolitan or micropolitan statistical areas. They consist of one or more economic nodes - metropolitan or micropolitan statistical areas that serve as regional centers of economic activity and the surrounding counties that are economically related to the nodes.

The economic areas were redefined on November 17, 2004, and are based on commuting data from the 2000 decennial population census, on redefined statistical areas from OMB (February 2004), and on newspaper circulation data from the Audit Bureau of Circulations for 2001."

18

ORGANIZATION CODESORGANIZATION CODES

Each observation is assigned an alpha numerical code.

2 digit alphabetical part determines the organization type.

Numeric part groups names that are same up to standardization and hand cleaning

First 2 digits Organization type

FI Firm

UN University

NL National Lab

RI Research Inst

UG US Government

HO Hospital

AS Academy of Sciences

NO No Organization

SC School

OT Other

19

Organization Codes: Types of Organization Codes: Types of CleaningCleaning1. Standardization of common identifiers:

◦ IBM = IBM Corp. = IBM Corporation◦ Univ = University = University of =

Universidade = Universidad = Univerzitet = Universita = Universitat = Universiti = Universite = Universitet = Universiteit

2. Using look up tables and hand cleaning to identify common variants (and misspellings) of names used by the same organization:

◦ IBM = Int Buisness Machines= International Business Machines Corporation= Int Business Machines Operation

20

building nanobank data structure and selection criteria jason fong and emre uyar university of...

Documents

queryrepeat selection

data sources patents

data contentsgrantstitles

data contentspatentstitles

expanded query terms

set of query terms

data sources articles580

selection criteriajason