Automating NSF HERD Reporting Using Machine Learning and Administrative Data
Rodolfo H. Torres
CIMA Session: The Use of Advance Analytics to Drive Decisions2018 APLU Annual MeetingNew Orleans Marriott, New Orleans LANovember 11, 2018
This research has been supported in part by the National Science Foundation under the EAGER Awards 1547464 / 1547513Any opinions, findings, and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of the National Science Foundation.
Luke HuanInitial co-PI - Former Professor EECS / ITTCUniversity of Kansas
Current PositionHead of Beijing Big Data LabBaidu Research
Joshua RosenbloomProfessor and ChairDepartment of EconomicsIowa State University
Joseph St.AmandFormer Graduate StudentEECSUniversity of Kansas
Current PositionChief Technology OfficerPatients Voices
Adrienne SadovskyPrincipal Analyst SeniorOffice of ResearchUniversity of Kansas
Project in Collaboration with
From https://www.nsf.gov/statistics/srvyherd/#sd
“The Higher Education Research and Development Survey,…, is the primary source of information on R&D expenditures at U.S. colleges and universities. The survey collects information on R&D expenditures by field of research and source of funds and also gathers information on types of research and expenses …The survey is an annual census of institutions that expended at least $150,000 in separately budgeted R&D in the fiscal year.”
The HERD Survey
In FY 2016 there were 902 institutions reporting data for a total of $72B in total R&D expenditures, of which $39B were from Federal sources.
• R&D expenditures by source of funds (federal government, state and local government, business, nonprofit, institutional, and other)
• R&D expenditures passed through to sub-recipients or received as a sub-recipient
• Federally funded R&D expenditures by federal agency
• R&D expenditures by purpose of work (e.g., Basic Research, Applied Research, Development, etc. )
• Federally and non-federal funded R&D expenditures by field (e.g., Computer Sciences, Chemistry, Economics, etc.)
Some Features of the HERD Survey
Total and federally financed higher education R&D expenditures, by type of R&D: 2010–2016 (in thousands)
Sample of Tables in the HERD Report
Fiscal year
Total Federal
All R&D expenditures Basic research
Applied research Development
All R&D expenditures Basic research
Applied research Development
2010 61,286,610 40,416,177 15,478,375 5,392,058 37,477,582 25,399,596 9,361,940 2,716,046
2011 65,274,393 42,809,196 16,733,579 5,731,618 40,768,251 27,331,458 10,498,586 2,938,207
2012 65,729,007 42,401,697 17,295,653 6,031,657 40,142,223 26,469,347 10,577,754 3,095,122
2013 67,013,138 43,305,409 17,390,865 6,316,864 39,445,931 26,071,617 10,327,219 3,047,095
2014 67,196,537 42,989,478 17,745,860 6,461,199 37,960,175 24,905,121 10,015,778 3,039,276
2015 68,566,890 43,865,982 18,022,569 6,678,339 37,848,552 24,945,232 9,969,994 2,933,326
2016 71,833,308 45,101,655 19,986,766 6,744,887 38,793,542 24,944,577 10,893,286 2,955,679
SOURCE: National Science Foundation, National Center for Science and Engineering Statistics, Higher Education Research and Development Survey https://ncsesdata.nsf.gov/herd/2016/html/HERD2016_DST_08.html
Source NSF https://ncsesdata.nsf.gov/herd/2015/html/HERD2016_DST_05.html
Expenditures by Field and Source 2016All R&D
expenditures
Source of funds
Federal governmentState and local
government Institution funds BusinessNonprofit
organizations All other sourcesAll R&D fields 71,833,308 38,793,542 4,025,280 17,974,962 4,210,563 4,614,800 2,214,161Science 56,290,662 31,090,354 3,023,028 13,541,084 3,031,096 3,868,151 1,736,949Computer and information sciences 2,077,884 1,442,771 49,502 399,965 90,288 59,588 35,770Geosciences, atmospheric sciences, and ocean sciences 3,087,774 1,992,990 157,693 614,647 109,478 127,763 85,203Atmospheric science and meteorology 626,518 513,275 18,416 68,923 6,319 7,660 11,925Geological and earth sciences 999,351 605,706 47,334 226,541 51,237 32,557 35,976Ocean sciences and marine sciences 1,097,864 665,121 59,874 241,440 32,896 69,841 28,692Geosciences, atmospheric sciences, and ocean sciences, nec 364,041 208,888 32,069 77,743 19,026 17,705 8,610
Life sciences 40,887,850 21,798,334 2,437,745 9,700,749 2,569,302 3,038,475 1,343,245Agricultural sciences 3,293,092 976,912 873,403 1,031,049 166,341 134,067 111,320Biological and biomedical sciences 13,048,981 7,707,943 554,094 2,983,417 552,727 958,620 292,180Health sciences 22,393,716 12,098,295 813,806 5,025,036 1,802,695 1,832,951 820,933Natural resources and conservationb 689,725 315,559 115,681 193,967 14,949 30,632 18,937Life sciences, nec 1,462,336 699,625 80,761 467,280 32,590 82,205 99,875
Mathematics and statistics 681,661 444,419 25,714 170,414 8,844 23,601 8,669Physical sciences 4,893,565 3,286,816 93,518 1,044,829 139,153 200,852 128,397Astronomy and astrophysics 622,008 418,147 1,839 122,375 4,578 34,443 40,626Chemistry 1,775,071 1,097,719 48,331 421,143 82,673 82,956 42,249Materials scienceb 172,086 111,802 4,579 38,435 9,518 5,465 2,287Physics 2,124,098 1,523,751 33,703 417,189 37,851 71,221 40,383Physical sciences, nec 200,302 135,397 5,066 45,687 4,533 6,767 2,852
Psychology 1,218,721 761,433 49,603 291,319 13,084 84,105 19,177Social sciences 2,366,571 898,576 145,563 908,025 50,569 282,278 81,560Anthropologyb 96,505 39,440 2,501 42,190 1,982 7,860 2,532Economics 396,393 112,338 37,543 166,032 8,910 54,860 16,710Political science and government 385,245 103,681 15,042 177,119 3,991 61,439 23,973Sociology, demography, and population studies 504,594 269,371 27,602 135,118 8,213 52,471 11,819Social sciences, nec 983,834 373,746 62,875 387,566 27,473 105,648 26,526
Sciences, nec 1,076,636 465,015 63,690 411,136 50,378 51,489 34,928Engineering 11,381,727 6,583,476 699,032 2,335,527 1,055,444 359,441 348,807Aerospace, aeronautical, and astronautical engineering 883,260 623,571 24,846 115,771 80,432 31,049 7,591Bioengineering and biomedical engineering 1,084,355 650,752 56,057 254,840 46,976 53,428 22,302Chemical engineering 885,273 467,678 40,386 199,334 121,432 34,328 22,115Civil engineering 1,331,155 591,637 221,119 348,873 84,724 46,354 38,448Electrical, electronic, and communications engineering 2,517,147 1,742,632 51,270 416,262 167,403 55,818 83,762Industrial and manufacturing engineeringb 239,078 148,464 10,846 56,714 16,372 3,652 3,030Mechanical engineering 1,435,828 860,745 55,454 279,079 169,124 33,587 37,839Metallurgical and materials engineering 771,683 442,893 29,270 181,287 74,702 19,512 24,019Engineering, nec 2,233,948 1,055,104 209,784 483,367 294,279 81,713 109,701
Non-S&E 4,160,919 1,119,712 303,220 2,098,351 124,023 387,208 128,405
Categorizing each project by purpose and field of research requires considerable time and effort as it is done “manually” at KU
• Labor intensive (expensive)
• Subjective
• Questionable reliability and validity
Goals
• Apply machine‐learning and text analysis tools to automate project classification
• Ease administrative burden
• Generate more objective classifications
A Proof of Concept Project
• We identified 1,700 historical awards that had been manually classified. We try to classify them using the project Title, SOW/Abstract, PI Home Department, and additional metadata. We treated the “purpose” and the “field” classification as two different tasks.
• After eliminating awards for which electronic abstracts were not available, we were left with a set of roughly 1,500 awards that could be used as a training data set.
• We used the “bag-of-words” model to represent the data; each word is considered as a separate “feature”. There were 17,046 separate data fields or features. However using tools for feature weighting and selection we reduced this number to a few hundreds. This feature extraction “pipeline” is configurable, allowing us to experiment with different means of producing features for the classification models.
Methods
We divide the awards into a “testing” set (of about 30% of the data) a “training” set (which is then divided into 5 parts for cross-validation). We explored the application of established machine-learning models:
• Decision Tree
• Support Vector Machine
• Logistic Regression
• Random Forest
• Naïve Bayes
• Neural Network
We evaluate the quality of the models on a per-category basis in terms of an F1-score by comparison with the “human” classification done by hand.
Methods (cont.)
Precision = TP / (TP + FP)Recall =TP / (TP + FN)F1 Score =2 (Precision * Recall) / (Precision + Recall)
Methods (cont.)
Actual Outcome Predicted Outcome
In Field Not in Field
In Field TP FN
Not in Field FP TN
• Greater success with Field of Study than Research Purpose
• Best models: Logistic Regression and Support Vector Machine models
• Surprisingly using the Title of the project alone we do better than with the SOW/Abstract
• Potentially compromising factors:
o SOW/Abstract not sufficiently clear
o Models cannot understand complex relationships between the words
o Words have different meaning in different contexts
o Insufficient sample size
Results
0.00
0.20
0.40
0.60
0.80
1.00
1.20
F1 S
core
Field Label
Training and Testing F1 Scores
Training Score Testing Score
Results (cont.)
0
20
40
60
80
100
120
140
160
180
200
Label Distribution
0.00
0.20
0.40
0.60
0.80
1.00
1.20
F1 S
core
Field Label
Training and Testing F1 Scores
Results (cont.)
F1 scores vs. sample size
Conclusions and Future Work
• It is feasible to classify the projects using machine-learning if enough data is available
• Need to collect more data points
• Need to understand better in which areas the tools do not perform well and why is so
• Recruit other universities:
o Expand training data
o Determine whether tool is applicable cross-university
Publication: Enhancing and Automating University Reporting Of R&D Expenditure Data Using Machine Learning Techniques.
Joshua L. Rosenbloom, Rodolfo H. Torres, Joseph St. Amand, and Adrienne Sadovsky
Merrill Advanced Studies Center Report, No. 121, 2017.
https://merrill.ku.edu/sites/merrill.ku.edu/files/docs/2017_whitepaper/University_Research_Planning_in_the_Data_Era_2017.pdf
Software and documentation: https://github.com/jstamand/KUHERD
More Information
Questions?
Field Code New Field
Spring 2016
Aerospace / Aeronautical / Astronautical Engineering A1
Bioengineering and Biomedical Engineering A2
Chemical Engineering A3
Civil Engineering A4
Electrical, Electronic, and Communications Engineering A5
Mechanical Engineering A6
Metallurgical & Materials Engineering A7
Other Engineering A8
Industrial and Manufacturing Engineering A9 *
Astronomy and Astrophysics B1
Chemistry B2
Physics B3
Other Physical Sciences B4
Materials Science B5 *
Atmospheric Sciences and Meteorology C1
Geological and Earth Sciences C2
Ocean Sciences and Marine Sciences C3
Other Geosciences, Atmospheric, and Ocean Sciences C4
Mathematics and Statistics D
Computer and Information Sciences E
Agricultural Sciences F1
Biological and Biomedical Sciences F2
Health Sciences F3
Other Life Sciences F4
Natural Resources and Conservation F5 *
Psychology G
Economics H1
Political Science and Government H2
Sociology, Demography, and Population Studies H3
Other Social Sciences H4
Anthropology H5 *
Other Sciences I
Education K
Law L
Humanities M
Visual and Performing Arts N
Business Management and Business Administration O
Communication and Communications Technologies P
Social Work Q
Other Non-S&E Fields R