Large-scale testing: Uses and abuses
Richard P. Phelps
Universidad Finis Terrae, Santiago, Chile
January 7, 2014
Large-scale testing: Uses and abuses
1. 3 types of large-scale tests2. Measuring test quality3. A chronology of mistakes4. Economists misunderstand testing5. How SIMCE is affected
AchievementAptitude
Non-cognitive
1. Three types of large-scale tests
Achievement tests Historically, were larger versions of classroom tests
~ 1900 - “scientific” achievement tests developed (Germany & USA)
SOURCE: Phelps, Standardized Testing Primer, 2007
J.M. Rice - systematically analyzed test structures & effects
E.L. Thorndike - developed scoring scales
Achievement tests
Purpose: to measure how much you know and can recall
Developed using: content coverage analysis
How validated: retrospective or concurrent validity (correlation with past measures, such as high school
grades)
Requires a mastery of content prior to test.
Fairness assumes that all have same opportunity to learn content
Coachable – specific content is known in advance
SOURCE: Phelps, Standardized Testing Primer, 2007
Aptitude tests
1917 – Adapted by U.S. Army to select, assign soldiers in World War 1
1930s – Harvard University president J. Conant- wanted new admission test to identify students from lower social classes with the
potential to succeed at Harvard- developed the first Scholastic Aptitude Test (SAT)
SOURCE: Phelps, Standardized Testing Primer, 2007
1890s – A. Binet & T. Simon (France)
- Pre-school children with mental disabilities
- achievement test not possible- developed content-free test of mental abilities
(association, attention, memory, motor skills, reasoning)
Aptitude testsPurpose: predict how much can be learned
Developed using: skills/job analysis
How validated: predictive validity, correlation with future activity (e.g., university or job evaluations)
Content independent. Measures: … what student does with content provided… how student applies skills & abilities developed over a lifetime
Not easily coachable – the content is either…… not known in advance, … basic, broad, commonly known by all, curriculum-free;… less dependent on the quality of schools
SOURCE: Phelps, Standardized Testing Primer, 2007
Aptitude tests
Aptitude tests can identify:
- Students bored in school who study what interests them on their own
- Students not well adapted to high school, but well adapted to university
- Students of high ability stuck in poor schools
SOURCE: Phelps, Standardized Testing Primer, 2007
Achievement Aptitude
Measure past learning potential
Development content analysis job/skills analysis
Validation retrospective predictive
Content dependent independent
Coachable? very much not much
Comparing Achievement & Aptitude tests
Non-cognitive tests
More recently developed – measure values, attitudes, preferences
Types: integrity tests career exploration matchmakingemployment “fit”
Non-cognitive tests
Purpose: to identify “fit” with others or a situation
Developed using: surveys, personal interviews
How validated? success rate in future activities
Content is personal, not learned
“Faking” can be an issue (e.g., “honesty” tests)
Achievement Aptitude Non-Cognitive
Measure past learning potential attitudes, values, preferences
Development content analysis job/skills analysis surveys
Validation retrospective predictive predictive
Content dependent independent independent
Coachable? very much very little can be faked
Comparing Achievement, Aptitude, & Non-Cognitive Tests
2. Measuring test quality
3 measures are important:1. Predictive validity2. Content coverage3. Sub-group differences
Test reports can be “data dumps”
Predictive validity(values from -1.0 to +1.0)
…measures how well higher scores on admission test match better outcomes at university (e.g., grades, completion)
A test with low predictive validity provides little information.
Source: NIST, Engineering Statistics Handbook
A positive correlation between two measures
Source: NIST, Engineering Statistics Handbook
A negative correlation between two measures
Source: NIST, Engineering Statistics Handbook
No correlation between two measures
How does one measure predictive capacity?
Correlation Coefficient: I--------------------------------------------I
-1 0 1
0
0.1
0.2
0.3
0.4
0.5
0.6
SAT
PSU 2010
Predictive validities: SAT and PSU
SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013
Language Mathematics SAT Writing PSU Social Science
0
0.1
0.2
0.3
0.4
0.5
0.6
SAT PSU Administracion
Predictive validities: SAT and PSU(faculty: Administracion)
SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013
Language Mathematics SAT Writing PSU Social Science
0
0.1
0.2
0.30.4
0.5
0.6
SAT PSU Arquitectura
Predictive validities: SAT and PSU(faculty: Arquitectura)
SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013
Language Mathematics SAT Writing PSU Social Science
0
0.1
0.2
0.30.4
0.5
0.6
SAT PSU Educacion
Predictive validities: SAT and PSU(faculty: Educacion)
SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013
Language Mathematics Social Science Science0
0.1
0.2
0.30.4
0.5
0.6
ACT PSU
Predictive validities: ACT and PSU
SOURCE: ACT, Research Summary Services, 1997_1998; Pearson, Final Report Evaluation of the Chile PSU, January 2013
Language Mathematics0
0.1
0.2
0.30.4
0.5
0.6
CTA Pearson
Predictive validities of the PSU(CTA v Pearson estimates)
SOURCE: Pearson, Final Report Evaluation of the Chile PSU, January 2013; CTA
Incremental Predictive validities (engineering): (controlling for NEM)
SOURCE: S.A. Prado, Estudio de Validez Predictiva de la PSU y Comparacion con el Sistema PAA, Universidad de Chile
U. Chile PUC U. Chile PUCLanguage & Math Language & Math + subject test
0
5
10
15
20
25
30
35
PAAPSU
Content coverage (values from 0% to 100%)
It is not fair to expect students to master content to which they have not been exposed. …or, to compare students who have been exposed to students who have not.
…how much of the content domain of a test has been taught in the schools.
SOURCE: Centro de Estudios Mineduc, Cobertura Curricular en Ensenanza Media Lenguaje y Comunicacion – Matematica, Septiembre 2012
Municipal Subvencionado Pagado0
25
50
75
100
Percentage curricular coverage in Chilean high schools, by type of school: 2012
Mathematics, Level 1
SOURCE: Centro de Estudios Mineduc, Cobertura Curricular en Ensenanza Media Lenguaje y Comunicacion – Matematica, Septiembre 2012
Percentage curricular coverage in Chilean high schools, by type of school: 2012
Language & Communication, Level 2
Municipal Subvencionado Pagado0
25
50
75
100
SOURCE: Centro de Estudios Mineduc, Cobertura Curricular en Ensenanza Media Lenguaje y Comunicacion – Matematica, Septiembre 2012
Percentage curricular coverage in Chilean high schools, by type of school: 2012
Mathematics, Level 3
Municipal Subvencionado Pagado0
25
50
75
100
SOURCE: Centro de Estudios Mineduc, Cobertura Curricular en Ensenanza Media Lenguaje y Comunicacion – Matematica, Septiembre 2012
Percentage curricular coverage in Chilean high schools, by type of school: 2012
Language & Communication, Level 4
Municipal Subvencionado Pagado0
25
50
75
100
SOURCE: Centro de Estudios Mineduc, Cobertura Curricular en Ensenanza Media Lenguaje y Comunicacion – Matematica, Septiembre 2012
Percentage curricular coverage in Chilean high schools, by type of curriculum: 2012
Mathematics, Level 4
Humanista Cientifica Technico Profesional Polivante0
25
50
75
100
SOURCE: Centro de Estudios Mineduc, Cobertura Curricular en Ensenanza Media Lenguaje y Comunicacion – Matematica, Septiembre 2012
Percentage curricular coverage in Chilean high schools, by type of curriculum: 2012
Language & Communication, Level 4
Numanista Cientifica Technico Profesional Polivante0
25
50
75
100
SOURCE: Centro de Estudios Mineduc, Cobertura Curricular en Ensenanza Media Lenguaje y Comunicacion – Matematica, Septiembre 2012
Percentage of Chilean high schools with full curricular coverage, by subject area: 2012
Levels 1--4
Mathem
atics
Langu
age &
Communication
0%
25%
50%
75%
100%
Do NOT Cover 100%Cover 100%
Subgroup differences
Differences in test scores among subgroups (e.g., gender, ethnic, school type) should be due only to differences in the attribute measured by the test and not to systematic biases in the test.
111
170
46
8595
124
43
51
0102030405060708090
100110120130140150160170180190
2002 2003 2004 2005 2006 2007 2008 2009 2010
Bre
chas
Brechas PSU Matemáticas para toda la muestra
0
20
40
60
80
100
120
140
160
180
200
2002 2003 2004 2005 2006 2007 2008 2009 2010
Bre
chas
Brechas PSU Matemáticas para toda la muestra PP Muni-TP Brecha Sin Ajustar
PP Muni-TP Brecha Ajustada
PP Muni-CH Brecha Sin Ajustar
PP Muni-CH Brecha Ajustada
0
20
40
60
80
100
120
140
160
180
200
2002 2003 2004 2005 2006 2007 2008 2009 2010
Bre
chas
Brechas PSU Matemáticas para toda la muestra PP Muni-TP Brecha Sin Ajustar
PP Muni-TP Brecha Ajustada
PP Muni-CH Brecha Sin Ajustar
PP Muni-CH Brecha Ajustada
0
20
40
60
80
100
120
140
160
180
200
2002 2003 2004 2005 2006 2007 2008 2009 2010
Bre
chas
Brechas PSU Matemáticas para toda la muestra PP Muni-TP Brecha Sin Ajustar
PP Muni-TP Brecha Ajustada
PP Muni-CH Brecha Sin Ajustar
PP Muni-CH Brecha Ajustada0
20
40
60
80
100
120
140
160
180
200
2002 2003 2004 2005 2006 2007 2008 2009 2010
Bre
chas
Brechas PSU Matemáticas para toda la muestra PP Muni-TP Brecha Sin Ajustar
PP Muni-TP Brecha Ajustada
PP Muni-CH Brecha Sin Ajustar
PP Muni-CH Brecha Ajustada
SOURCE: Koljatic, Silva, & Phelps, Consequential Tests and Conflicts of Interest: The Case of Chile’s PSU, forthcoming, 2014
Growing gaps in PSU Mathematics raw & adjusted scores, by type of curriculum: 2002—2010
106
161
44
79
86
113
36
44
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
2002 2003 2004 2005 2006 2007 2008 2009 2010
Bre
chas
Brechas PSU Lenguaje para toda la muestra
0
20
40
60
80
100
120
140
160
180
200
2002 2003 2004 2005 2006 2007 2008 2009 2010
Bre
chas
Brechas PSU Matemáticas para toda la muestra PP Muni-TP Brecha Sin Ajustar
PP Muni-TP Brecha Ajustada
PP Muni-CH Brecha Sin Ajustar
PP Muni-CH Brecha Ajustada
0
20
40
60
80
100
120
140
160
180
200
2002 2003 2004 2005 2006 2007 2008 2009 2010
Bre
chas
Brechas PSU Matemáticas para toda la muestra PP Muni-TP Brecha Sin Ajustar
PP Muni-TP Brecha Ajustada
PP Muni-CH Brecha Sin Ajustar
PP Muni-CH Brecha Ajustada
0
20
40
60
80
100
120
140
160
180
200
2002 2003 2004 2005 2006 2007 2008 2009 2010
Bre
chas
Brechas PSU Matemáticas para toda la muestra PP Muni-TP Brecha Sin Ajustar
PP Muni-TP Brecha Ajustada
PP Muni-CH Brecha Sin Ajustar
PP Muni-CH Brecha Ajustada0
20
40
60
80
100
120
140
160
180
200
2002 2003 2004 2005 2006 2007 2008 2009 2010
Bre
chas
Brechas PSU Matemáticas para toda la muestra PP Muni-TP Brecha Sin Ajustar
PP Muni-TP Brecha Ajustada
PP Muni-CH Brecha Sin Ajustar
PP Muni-CH Brecha Ajustada
SOURCE: Koljatic, Silva, & Phelps, Consequential Tests and Conflicts of Interest: The Case of Chile’s PSU, forthcoming, 2014
Growing gaps in PSU Language & Communication raw & adjusted scores, by type of curriculum: 2002—2010
3. A chronology of mistakes
2000, initial proposal, SIES/PSU project
This proposal attempts a redesign of the tests currently used to select students for higher education in Chile. It is expected that [this new test will] have a positive impact in the efficiency of the selection process, improving the psychometric properties of the measuring instruments, and establishing a better articulation between the selection system and the secondary education curriculum.
SOURCE: Proyecto FONDEF, Reformulacion de las Pruebas de Seleccion a la Educacion Superior
…the Academic Aptitude Test for entry to the university system is under revision, together with the universities belonging to the Council of Rectors. This instrument of entry selection, needs also to be aligned with the new curriculum and may become an exit exam from the secondary education system.
2001 (World Bank & MINEDUC)
SOURCE: World Bank, Implementation Completion Report on a Loan in the Amount of $35 million to the Republic of Chile for Secondary Education, 2001
A chronology of mistakes (cont.)
…The new law adopted in May 2005 (Bulletin 3223-04) established a system of student loans available to all students achieving a threshold score in the University Admission Exam (PSU). …the new system does not impede students unable to provide collateral from financing their studies. The new system promises to improve equity further by increasing options for talented students from non-affluent families to access higher education.
2005 (World Bank)
SOURCE: IMPLEMENTATION COMPLETION REPORT (TF-25378 SCL-44040 PPFB-P3360) ON A LOAN IN THE AMOUNT OF US$145.45 MILLION TO THE REPUBLIC OF CHILE FOR THE HIGHER EDUCATION IMPROVEMENT PROJECT, December 2005
A chronology of mistakes (cont.)
[One option for revising admission testing] would be for Chile to move away from a university entry test towards a national school leaving test or set of tests – ideally, not simple multiple choice tests but longer exams, which test both knowledge and candidates’ ability to think and to apply knowledge. Such school leaving exams or tests could also remove the need for a separate school leaving certificate, by having two pass levels, the lower level equivalent to the NEM and the higher level setting the minimum standard for entry to an academic or professional degree course.
2009 (OECD & World Bank)
SOURCE: OECD & World Bank, Tertiary Education in Chile, 2009
A chronology of mistakes (cont.)
The second option [to revising admissions testing] would be to reform the PSU by incorporating elements other countries consider useful and important in identifying the students most likely to benefit from HE. These elements would include extended essays and questions designed to test reasoning ability and learning potential. They could also include personal statements which could cover non-curricular experience, personal motivation and interest in the programme. Again, there should be a variant for vocational secondary school students.
2009 (OECD & World Bank)
A chronology of mistakes (cont.)
SOURCE: OECD & World Bank, Tertiary Education in Chile, 2009
Over time the government should consider replacing the university entry exam with a national school leaving exam as the prime criterion for entry into tertiary education institutions. This could establish a closer link between test results and the school that is responsible for them, making it easier to reach the goal that has been pursued with the introduction of the PSU.
2010 (World Bank)
SOURCE: N. Brandt, CHILE: CLIMBING ON GIANTS' SHOULDERS: BETTER SCHOOLS FOR ALL CHILEANCHILDREN; ECONOMICS DEPARTMENT WORKING PAPERS No. 784
A chronology of mistakes (cont.)
There is evidence that central curriculum based exit exams are strongly and positively related to student academic performance (Wößmann, 2005; Bishop, 2006). To allow students to show in more detail their knowledge and their ability to apply it, the school exit exam could be a bit more in-depth than the multiple-choice PSU, including verbal and nonverbal reasoning.
2010 (World Bank)
SOURCE: N. Brandt, CHILE: CLIMBING ON GIANTS' SHOULDERS: BETTER SCHOOLS FOR ALL CHILEANCHILDREN; ECONOMICS DEPARTMENT WORKING PAPERS No. 784
A chronology of mistakes (cont.)
4. Economists misunderstand testing
EDUC 501 Classroom AssessmentEDUC 553 Construction, Validation, and Uses of Criterion-Referenced TestsEDUC 555 Introduction to Statistics & Computer Analysis IEDUC 632 Principles of Educational & Psychological TestingEDUC 637 Non-Parametric Statistics AnalysisEDUC 656 Introduction to Statistical & Computer Analysis IIEDUC 661 Educational Research Methods IEDUC 727 Scale and Instrument DevelopmentEDUC 731 Structural Equation ModelingEDUC 735 Advanced Theory & Practice of Testing IEDUC 736 Advanced Theory & Practice of Testing IIEDUC 771 Application of Applied Multivariate Statistics IEDUC 772 Application of Applied Multivariate Statistics IIEDUC 821 Advanced Validity Theory & Test Validation
Testing & Measurement PhD program (University of Massachusetts, USA, 2013-2014)
How economists misunderstand testing - 1
Increasing an admission test’s correlation with high school work can decrease its correlation with university work
Incentives aren’t all that matter in improving efficiency; …also important: more and better information, better classification & allocation
How economists misunderstand testing - 2
Incentives generally work best when applied to the actor responsible for the target behavior; …currently, students bear the consequences when schools do not teach the curriculum tested on the PSU
How economists misunderstand testing - 3
Many useful and successful tests serve multiple purposes. But, some purposes are compatible and some are not. Responsible authorities have argued that the PSU will: 1. Measure the implementation of a new curriculum; 2. Fairly measure mastery of two, very different curricula;3. Incentivize high schools to implement the new curriculum; 4. Incentivize high school students to study more; 5. Predict success in university generally;6. Predict success across very different types of university programs;7. Reduce socio-economic disparities.
How economists misunderstand testing - 4
The PSU: A test at war with itself
Expected to do to many things…
…it does none of them well,
…and makes some of them worse.
(a science-humanities exit exam, sold originally as a science-humanities curriculum coverage survey, that is used as an entry exam for all students)
You cannot get there from here
A non-cognitive test, used as a high-stakes admission test, will exacerbate the problems. It is easily faked. Wealthier students will pay for coaching and the scores will be invalid.
The PSU cannot be “fixed”; it is fundamentally flawed.
The old system – PAA + PCEs – was a sensible system.
Option for Technical-Professional Graduates:
As is done in Germany, offer short course on scientific-humanistic 11th & 12th grade curricula with exam at the end for technical-professional graduates who decide after graduation that they wish to change careers.
Create separate test for technical-professionals to enter university.
ETS & Pearson recommendations:
Lessen the content in PSU to the common level – 10th grade – and to that which is genuinely necessary for a good prediction.
Other options to consider
04/10/2023
How the PSU Runs:
• CRUCh: "owners" of the PSU• Comité Técnico Asesor (CTA) para la PSU: designated
by CRUCh as supervisors of DEMRE and official evaluators of the PSU
• DEMRE: responsible for developing test items, test assembly, tests administration, test scoring, application system for CRUCh and associated universities, etc.
Ministry of Education--funds the system since 2007 (fee waivers)
CRUCH
COMITÉ TECNICOASESOR DEL CRUCHPARA LA PSU (CTA)
DEMRE U. de Chile
Source: adapted from the Pearson Report (2013)
What does this have to do with SIMCE?
Most do not see the difference among tests. In public perception, one bad test makes all tests look bad.
SIMCE’s largest challenge may the loss of public goodwill towards all testing.
5. How SIMCE is affected
“If a thing exists, it exists in some amount. If it exists in some amount, then it is capable of being measured.”
−−Rene Descartes, Principles of Philosophy, 1664