benchmarking web accessibility evaluation tools: measuring the harm of sole reliance on automated...
DESCRIPTION
The use of web accessibility evaluation tools is a widespread practice. Evaluation tools are heavily employed as they help in reducing the burden of identifying accessibility barriers. However, an overreliance on automated tests often leads to setting aside further testing that entails expert evaluation and user tests. In this paper we empirically show the capabilities of current automated evaluation tools. To do so, we investigate the effectiveness of 6 state-of-the-art tools by analysing their coverage, completeness and correctness with regard to WCAG 2.0 conformance. We corroborate that relying on automated tests alone has negative effects and can have undesirable consequences. Coverage is very narrow as, at most, 50% of the success criteria are covered. Similarly, completeness ranges between 14% and 38%; however, some of the tools that exhibit higher completeness scores produce lower correctness scores (66-71%) due to the fact that catching as many violations as possible can lead to an increase in false positives. Therefore, relying on just automated tests entails that 1 of 2 success criteria will not even be analysed and among those analysed, only 4 out of 10 will be caught at the further risk of generating false positives.TRANSCRIPT
![Page 1: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/1.jpg)
Benchmarking Web Accessibility Evaluation Tools:
10th International Cross-Disciplinary Conference on Web AccessibilityW4A2013
Markel Vigo University of Manchester (UK)Justin Brown Edith Cowan University (Australia) Vivienne Conway Edith Cowan University (Australia)
Measuring the Harm of Sole Reliance on Automated Tests
http://dx.doi.org/10.6084/m9.figshare.701216
![Page 2: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/2.jpg)
Problem & Fact
W4A201313 May 2013 2
WWW is not accessible
![Page 3: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/3.jpg)
Evidence
W4A201313 May 2013 3
Webmasters are familiar with accessibility guidelines
Lazar et al., 2004Improving web accessibility: a study of webmaster perceptions
Computers in Human Behavior 20(2), 269–288
![Page 4: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/4.jpg)
Hypothesis I
Assuming guidelines do a good job...
H1: Accessibility guidelines awareness is not that widely spread.
W4A201313 May 2013 4
![Page 5: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/5.jpg)
Evidence II
W4A201313 May 2013 5
Webmasters put compliance logos on non-compliant websites
Gilbertson and Machin, 2012Guidelines, icons and marketable skills: an accessibility evaluation of 100 web development company homepages
W4A 2012
![Page 6: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/6.jpg)
Hypothesis II
Assuming webmasters are not trying to cheat...
H2: A lack of awareness on the negative effects of overreliance on automated tools.
W4A201313 May 2013 6
![Page 7: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/7.jpg)
• It's easy
• In some scenarios seems like the only option: web observatories, real-time...
• We don't know how harmful they can be
W4A201313 May 2013 7
Expanding on H2Why we rely on automated tests
![Page 8: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/8.jpg)
• If we are able to measure these limitations we can raise awareness
• Inform developers and researchers
• We run a study with 6 tools
• Compute coverage, completeness and correctness wrt WCAG 2.0
W4A201313 May 2013 8
Expanding on H2Knowing the limitations of tools
![Page 9: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/9.jpg)
• Coverage: whether a given Success Criteria (SC) is reported at least once
• Completeness:
• Correctness:
W4A201313 May 2013 9
MethodComputed Metrics
![Page 10: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/10.jpg)
W4A201313 May 2013 10
Vision Australiawww.visionaustralia.org.au
• Non-profit• Non-government• Accessibility resource
Prime Ministerwww.pm.gov.au
• Federal Government• Should abide by the Transition Strategy
Transperthwww.transperth.wa.gov.au
• Government affiliated• Used by people with disabilities
MethodStimuli
![Page 11: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/11.jpg)
MethodObtaining the "Ground Truth"
W4A201313 May 2013 11
Ad-hoc sampling
Manual evaluation
Agreement
Ground truth
![Page 12: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/12.jpg)
W4A201313 May 2013 12
Evaluate Compare with the GT
MethodComputing Metrics
Computemetrics
T1
For every page in the sample...
T2
T3
T4
T5
T6
R1
R2
R3
R4
R5
R6
Get reports
GT
M1
M2
M3
M4
M5
M6
![Page 13: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/13.jpg)
Accessibility of Stimuli
W4A201313 May 2013 13
Vision Australiawww.visionaustralia.org.au
Prime Ministerwww.pm.gov.au
Transperthwww.transperth.wa.gov.au
![Page 14: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/14.jpg)
• 650 WCAG Success Criteria violations (A and AA)
• 23-50% of SC are covered by automated test
• Coverage varies across guidelines and tools
W4A201313 May 2013 14
ResultsCoverage
![Page 15: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/15.jpg)
• Completeness ranges in 14-38%
• Variable across tools and principles
W4A201313 May 2013 15
ResultsCompleteness per tool
![Page 16: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/16.jpg)
• How conformance levels influence on completeness
• Wilcoxon Signed Rank: W=21, p<0.05
• Completeness levels are higher for 'A level' SC
W4A201313 May 2013 16
ResultsCompleteness per type of SC
![Page 17: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/17.jpg)
• How accessibility levels influence on completeness
• ANOVA: F(2,10)=19.82, p<0.001
• The less accessible a page is the higher levels of completeness
W4A201313 May 2013 17
ResultsCompleteness vs. accessibility
![Page 18: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/18.jpg)
• Cronbach's α = 0.96
• Multidimensional Scaling (MDS)
• Tools behave similarly
W4A201313 May 2013 18
ResultsTool Similarity on Completeness
![Page 19: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/19.jpg)
• Tools with lower completeness scores exhibit higher levels of correctness 93-96%
• Tools that obtain higher completeness yield lower correctness 66-71%
• Tools with higher completeness are also the most incorrect ones
W4A201313 May 2013 19
ResultsCorrectness
![Page 20: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/20.jpg)
• We corroborate that 50% is the upper limit for automatising guidelines
• Natural Language Processing?– Language: 3.1.2 Language of parts– Domain: 3.3.4 Error prevention
W4A201313 May 2013 20
ImplicationsCoverage
![Page 21: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/21.jpg)
• Automated tests do a better job...
...on non-accessible sites
...on 'A level' success criteria
• Automated tests aim at catching stereotypical errors
W4A201313 May 2013 21
ImplicationsCompleteness I
![Page 22: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/22.jpg)
• Strengths of tools can be identified across WCAG principles and SC
• A method to inform decision making
• Maximising completeness in our sample of pages– On all tools: 55% (+17 percentage points)– On non-commercial tools: 52%
W4A201313 May 2013 22
ImplicationsCompleteness II
![Page 23: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/23.jpg)
Conclusions
• Coverage: 23-50%
W4A201313 May 2013 23
• Completeness: 14-38%
• Higher completeness leads to lower correctness
![Page 24: Benchmarking Web Accessibility Evaluation Tools: Measuring the Harm of Sole Reliance on Automated Tests](https://reader036.vdocument.in/reader036/viewer/2022081602/549a1c30ac79591d2e8b5a98/html5/thumbnails/24.jpg)
Follow up
13 May 2013 24
Contact@markelvigo | [email protected]
Presentation DOIhttp://dx.doi.org/10.6084/m9.figshare.701216
Datasetshttp://www.markelvigo.info/ds/bench12/index.html
10th International Cross-Disciplinary Conference on Web AccessibilityW4A2013