building nanobank data structure and selection criteria jason fong and emre uyar university of...
Post on 21-Jan-2016
218 Views
Preview:
TRANSCRIPT
BUILDING NANOBANKBUILDING NANOBANK
Data Structure and Selection Criteria
Jason Fong and Emre UyarUniversity of California, Los Angeles
1
What is Nanobank?What is Nanobank?Nanobank is a collection of
observations from various sources (scientific articles, patents and government grants), determined to be related to nanotechnology field, either by probabilistic information retrieval (IR) methods or by being declared nano by a source authority.
2
Data Sources - ArticlesData Sources - Articles580,711 scientific articles from peer
reviewed journals.Source: Science Citation Index, Arts &
Humanities Citation Index and Social Sciences Citation Index of the Institute for Scientific Information Inc. (ISI®). All together, these indexes contain more than 24,250,000 entries from over 8,700 peer reviewed scientific journals.
3
Data Sources – Patents and Data Sources – Patents and GrantsGrants240,437 patents from U.S. Patenting
and Trademark Office’s online database of more than 4,000,000 patents, granted by USPTO from 1976 to 2006.
52,831 grants from NIH and NSF databases.
4
Data ContentsData ContentsArticles
◦ Titles◦ Journal volume and issue numbers◦ Publication years◦ Author names◦ Names and addresses of organizations
affiliated with authors
5
Data ContentsData ContentsPatents
◦ Titles and abstracts◦ Application and grant dates◦ Names and addresses of inventors and
assignees◦ U.S. and international patent classifications
6
Data ContentsData ContentsGrants
◦ Titles and abstracts◦ Receiving organization names and
addresses◦ PI and co-PI names◦ Grant amounts
7
Nanobank Data StructureNanobank Data Structure• Internal database
– Stored in a relational database– Separate tables for various data items– ID numbers for each item link between tables
• Version posted on Nanobank.org– Denormalized form of internal database– Storing redundant data isn’t as space-
efficient, but lessens the need to join multiple tables
– Nanobank Codebook contains detailed information on tables and fields available in each
8
Document SelectionDocument SelectionDocument Selection Methods
◦ Keywords◦ Probabilistic◦ Authority-selected
Tables include a field to indicate selection method: ◦ “nanobank_flag” = 1 if selected by
Keywords or Probabilistic; 0 otherwise◦ “authority_flag” = 1 if Authority-selected; 0
otherwise
9
Document Selection: Document Selection: KeywordsKeywords• Search for text patterns matching
words or phrases related to nanotechnology
• Words and phrases chosen by subject specialists
• Less effective for identifying very early or recent documents–Early documents were written before the
terms were in common usage–Recent documents have terms that are
too new to be included in the search patterns
10
Document Selection: Document Selection: ProbabilisticProbabilisticIncorporates new terms as they
come into common usageUses the Xapian search engine
library to perform ranking calculations
Analyzes document text and ranks against a set of query terms
11
Document Selection: Document Selection: ProbabilisticProbabilisticInitial query terms from the Virtual
Journal of Nanoscale Science & Technology (VJNano):◦ All articles in VJNano assumed to be relevant◦ Select highest ranked terms
Document selection process:◦ Use initial query terms to select relevant
documents from all journal articles◦ Select additional terms from those relevant
documents and add to query◦ Repeat selection with expanded query terms
12
Document Selection: Document Selection: Authority SetAuthority Set• Articles
– Listed in the Virtual Journal of Nanoscale Science & Technology
• Patents– Listed under United States Patent
Classification Class 977 (Nanotechnology)
• NSF Grants– program name contains “nano”
• NIH Grants– NIH descriptive tag contains “nano”
13
GEOCODINGGEOCODINGStandardizing between differing
naming conventions used in different sources.
Standardizing between non-uniformity in how observations are recorded.
Correcting common mistakes.For US observations: Providing
different grouping units (other than city and state) not available in original data sources, like counties and BEA areas.
14
COUNTRY GEOCODINGCOUNTRY GEOCODINGCountry names in all
observations are cleaned, standardized and assigned an ISO code (2 digit alphabetical)
Current ISO list of countries is taken as basis; historical entries assigned to the closest current country to the extend available.
15
US GEOCODINGUS GEOCODINGUS observations are those in 50 US
states, DC and 7 US associated areas.Cities, states, counties and BEA economic
areas are coded using “Populated Places” data obtained from FIPS 55 database and BEA.
Basis is the city-state combination. City names are standardized and matched to the names in FIPS database on a state-by-state basis.
In articles, 99.98% of US observations have been assigned a definite city - state code.
16
US GEOCODING: Variables US GEOCODING: Variables CreatedCreated1. Standard_city_name: Standardized name as
it appears on the FIPS database (corrected for misspelings, abbreviations, etc...)
2. State_code: 2 digit numeric code.3. City_code: 5 digit numeric code, unique by
state.4. County_code: 5 digit numeric code.5. County_name
City code + state code uniquely determine a populated place.
Numeric codes are same as the codes used by FIPS.
17
GEOCODING – US BEA GEOCODING – US BEA AreasAreas Bureau of Economic Analysis (BEA) created 179 Economic
Areas in the US by asigning each county is assigned to a unique BEA.
BEA_code: 3 digit numeric code that determines the associated BEA Economic area for each observation.
"BEA's economic areas define the relevant regional markets surrounding metropolitan or micropolitan statistical areas. They consist of one or more economic nodes - metropolitan or micropolitan statistical areas that serve as regional centers of economic activity and the surrounding counties that are economically related to the nodes.
The economic areas were redefined on November 17, 2004, and are based on commuting data from the 2000 decennial population census, on redefined statistical areas from OMB (February 2004), and on newspaper circulation data from the Audit Bureau of Circulations for 2001."
18
ORGANIZATION CODESORGANIZATION CODES
Each observation is assigned an alpha numerical code.
2 digit alphabetical part determines the organization type.
Numeric part groups names that are same up to standardization and hand cleaning
First 2 digits Organization type
FI Firm
UN University
NL National Lab
RI Research Inst
UG US Government
HO Hospital
AS Academy of Sciences
NO No Organization
SC School
OT Other
19
Organization Codes: Types of Organization Codes: Types of CleaningCleaning1. Standardization of common identifiers:
◦ IBM = IBM Corp. = IBM Corporation◦ Univ = University = University of =
Universidade = Universidad = Univerzitet = Universita = Universitat = Universiti = Universite = Universitet = Universiteit
2. Using look up tables and hand cleaning to identify common variants (and misspellings) of names used by the same organization:
◦ IBM = Int Buisness Machines= International Business Machines Corporation= Int Business Machines Operation
20
top related