article of reliability

48
This article was downloaded by: [Universiti Pendidikan Sultan Idris] On: 06 March 2015, At: 03:50 Publisher: Taylor & Francis Informa Ltd Registered in England and Wales Registered Number: 1072954 Registered office: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK Click for updates Theoretical Issues in Ergonomics Science Publication details, including instructions for authors and subscription information: http://www.tandfonline.com/loi/ttie20 Reliability and validity of findings in ergonomics research H. Kanis a a School of Industrial Design Engineering , Delft University of Technology , Landbergstraat 15, 2628 CE Delft, The Netherlands Published online: 28 Jun 2013. To cite this article: H. Kanis (2014) Reliability and validity of findings in ergonomics research, Theoretical Issues in Ergonomics Science, 15:1, 1-46, DOI: 10.1080/1463922X.2013.802058 To link to this article: http://dx.doi.org/10.1080/1463922X.2013.802058 PLEASE SCROLL DOWN FOR ARTICLE Taylor & Francis makes every effort to ensure the accuracy of all the information (the “Content”) contained in the publications on our platform. However, Taylor & Francis, our agents, and our licensors make no representations or warranties whatsoever as to the accuracy, completeness, or suitability for any purpose of the Content. Any opinions and views expressed in this publication are the opinions and views of the authors, and are not the views of or endorsed by Taylor & Francis. The accuracy of the Content should not be relied upon and should be independently verified with primary sources of information. Taylor and Francis shall not be liable for any losses, actions, claims, proceedings, demands, costs, expenses, damages, and other liabilities whatsoever or howsoever caused arising directly or indirectly in connection with, in relation to or arising out of the use of the Content. This article may be used for research, teaching, and private study purposes. Any substantial or systematic reproduction, redistribution, reselling, loan, sub-licensing, systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &

Upload: yeeting07

Post on 18-Nov-2015

215 views

Category:

Documents


0 download

DESCRIPTION

e

TRANSCRIPT

  • This article was downloaded by: [Universiti Pendidikan Sultan Idris]On: 06 March 2015, At: 03:50Publisher: Taylor & FrancisInforma Ltd Registered in England and Wales Registered Number: 1072954 Registeredoffice: Mortimer House, 37-41 Mortimer Street, London W1T 3JH, UK

    Click for updates

    Theoretical Issues in ErgonomicsSciencePublication details, including instructions for authors andsubscription information:http://www.tandfonline.com/loi/ttie20

    Reliability and validity of findings inergonomics researchH. Kanis aa School of Industrial Design Engineering , Delft University ofTechnology , Landbergstraat 15, 2628 CE Delft, The NetherlandsPublished online: 28 Jun 2013.

    To cite this article: H. Kanis (2014) Reliability and validity of findings in ergonomics research,Theoretical Issues in Ergonomics Science, 15:1, 1-46, DOI: 10.1080/1463922X.2013.802058

    To link to this article: http://dx.doi.org/10.1080/1463922X.2013.802058

    PLEASE SCROLL DOWN FOR ARTICLE

    Taylor & Francis makes every effort to ensure the accuracy of all the information (theContent) contained in the publications on our platform. However, Taylor & Francis,our agents, and our licensors make no representations or warranties whatsoever as tothe accuracy, completeness, or suitability for any purpose of the Content. Any opinionsand views expressed in this publication are the opinions and views of the authors,and are not the views of or endorsed by Taylor & Francis. The accuracy of the Contentshould not be relied upon and should be independently verified with primary sourcesof information. Taylor and Francis shall not be liable for any losses, actions, claims,proceedings, demands, costs, expenses, damages, and other liabilities whatsoever orhowsoever caused arising directly or indirectly in connection with, in relation to or arisingout of the use of the Content.

    This article may be used for research, teaching, and private study purposes. Anysubstantial or systematic reproduction, redistribution, reselling, loan, sub-licensing,systematic supply, or distribution in any form to anyone is expressly forbidden. Terms &

    http://crossmark.crossref.org/dialog/?doi=10.1080/1463922X.2013.802058&domain=pdf&date_stamp=2013-06-28http://www.tandfonline.com/loi/ttie20http://www.tandfonline.com/action/showCitFormats?doi=10.1080/1463922X.2013.802058http://dx.doi.org/10.1080/1463922X.2013.802058

  • Conditions of access and use can be found at http://www.tandfonline.com/page/terms-and-conditions

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

    http://www.tandfonline.com/page/terms-and-conditionshttp://www.tandfonline.com/page/terms-and-conditions

  • Reliability and validity of findings in ergonomics research

    H. Kanis*

    School of Industrial Design Engineering, Delft University of Technology, Landbergstraat 15, 2628CE Delft, The Netherlands

    (Received 29 February 2012; final version received 30 April 2013)

    Evaluation of findings in ergonomics/human factors (E/HF) research suffers from mis-conceived assessments in terms of reliability and validity. Evaluation of E/HF studiespublished after 2000 confirms these observations. With an eye on these misconceivedassessments, the present paper focuses on the consequences of various types of humaninvolvement in (co-)shaping the phenomena to be evaluated. Issues addressed includethe questionability of the inter-individual equivalence of findings, the defectiveness ofthe reliability coefficient as dispersion measure and the elusiveness of intended/pre-sumed measurement as a validity criterion. These deficiencies are at odds with thezeal of E/HF authors to flag their findings as reliable and valid. In particular, posi-tive evaluations of findings may show off as rhetoric. An evaluation procedure of con-secutive constituents in a flowchart is proposed as an aid for appropriate evaluations.Various conditions are discussed that may encourage the adoption of correctprocedures.

    Keywords: reliability; validity; human agency; investigative syntaxes; measurementby fiat

    1. Introduction

    Researchers have to account for the evaluation of their findings. This evaluation is, in

    general, based on two criteria. The first of these is the extent of the dispersion in measure-

    ment repetition, i.e. random variation in measurement results that tends to be present

    whatever the accuracy of an attempted repetition. A current term for addressing this crite-

    rion in the field of ergonomics/human factors (E/HF) is reliability. The second criterion

    concerns the occurrence of systematic differences emerging in a comparison of measure-

    ment results with reference values. These systematic differences turn into deviation when

    a reference is adopted as unquestioned. The prevalent notions for addressing the second

    criterion in E/HF refer to the term validity.

    I have discussed both of these criteria in two reviews of E/HF research papers (Kanis

    1997, on reliability, 2000, on validity). Both gave rise to several criticisms, including

    the procedures applied, the use by authors of a plethora of terms and the adoption of asso-

    ciative criteria for the specification of reliability or the establishment of validity.

    In order to find out whether improvements occurred after publication of the criticisms,

    I conducted a new review of recent E/HF research papers. This study yielded similar

    results as the two initial reviews. To be precise, each of the findings of the first two

    reviews is reaffirmed in this new inspection. Occasionally, the review of more recent

    research papers has resulted in a new observation; an example is the refuge that authors

    take to the authority of experts in validations.

    *Email: [email protected]

    2013 Taylor & Francis

    Theoretical Issues in Ergonomics Science, 2014

    Vol. 15, No. 1, 146, http://dx.doi.org/10.1080/1463922X.2013.802058

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

    mailto:[email protected]://dx.doi.org/10.1080/1463922X.2013.802058

  • The deja-vu character of most findings of the new review leads to the conclusion that adescriptive analysis of the results of the new inspection can be dispensed with since this

    would largely be repetitive. This conclusion raises the obvious question of what may clarify

    the ostensible lack of impact of criticisms on the evaluation of findings in E/HF research.

    The answering of this question hinges on various considerations. For instance, it may be

    wondered whether a discursive paper format would be the best medium to disseminate my

    argument. In addition, this argument might be presented poorly in both reviews. Moreover,

    the argument may be a misapprehension of evaluation procedures that always served well,

    without publicly raising bothersome questions from reviewers or readers. As a counter-

    weight to these speculations, the fact is that both previous reviews were accepted for publi-

    cation. This acceptance at least lends some credibility to the criticisms put forward.

    In order to conclude my evaluation of findings in E/HF research, I have opted for the

    positioning of my criticisms as the starting point of a peer-reviewed commentary study.

    Therefore, I have compiled the information from both previous reviews and from the new

    study in the present paper. The objective of this target paper is to outline what is right/

    what goes wrong in the evaluation of findings in E/HF research, in order to point out what

    should or may be done to establish appropriate evaluations. The discussion starts with an

    inventory of the evaluation procedures in the technical sciences and in the quantitative

    social sciences; these research disciplines are seen as the main constituents of E/HF

    research practice (cf. Kanis 2000). Similarities and differences between evaluation proce-

    dures are identified in order to shed light on their applicability in the broad spectrum of

    findings in E/HF research. Where appropriate, alternatives are put forward.

    Constituents of evaluation procedures are arranged in a flowchart. The practicability

    of this flowchart is discussed on the basis of its application to the findings in the E/HF

    research papers sampled for the new review. These research papers (53) are summarised

    in the Appendix, including brief descriptions of the relevant parts. The descriptions are

    based on the focal points of both previous reviews (Kanis 1997, 2000), involving the

    specification of dispersion measures (which concerns reliability) and the identification

    of deviation (i.e. validity, on the basis of so-called investigative syntaxes), as well as

    the kind of criteria applied (e.g. correlation coefficients). Observations resulting from the

    application of the flowchart are compared with the authors assessments of current prac-

    tice in their evaluation of findings. To conclude, I discuss what can be done to encourage

    the adoption of the proposed procedures.

    2. Terms, definitions, procedures

    Of the two constituent disciplines of E/HF research, the technical sciences evaluation of

    findings described in ISO 5725 (1994) is charted first in view of the straightforward intro-

    duction of some basic principles and operations in this field. As an example, I will discuss

    the application of the technical sciences evaluation in a study of anthropometrics. After

    this, the quantitative social sciences evaluation of findings is addressed. Ingredients of

    this approach, which is by far the most frequently applied in E/HF, are valued in compari-

    son with the elements constituting the technical sciences evaluation. Shortcomings are

    specified, and alternatives proposed.

    2.1 Evaluation of research findings in the technical sciences

    Table 1 charts the terminology and definitions in ISO 5725 (1994). Of the terms in

    Table 1, repeatability, reproducibility and precision occasionally feature in E/HF

    research papers.

    2 H. Kanis

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • In Table 1, the evaluation terms are grouped under the two basic procedures outlined

    in the introduction: measurement repetition and comparison of findings with an indepen-

    dent criterion. In ISO 5725, measurement repetition involves repeated measurement of

    test items that are claimed to be identical (see note c in Table 1), i.e. with the implication

    that exactly the same thing is being measured (1994, 1). Examples of test items, or speci-

    mens, mentioned in the ISO standard are the purity of materials (liquids, powders, solid

    objects; 1994, 6, 10), and also transitory phenomena (e.g. the water flow in a river; 1994,

    11). As indicated in the Introduction, measurement repetition involving these test items

    usually gives rise to a dispersion of results, i.e. variation with a random character.

    The second evaluation procedure comparison of findings with an independent crite-

    rion may give rise to the identification of non-random variation in measurement results.

    That is, persistent differences in terms of systematic deviation from a reference value that

    is adopted as unquestioned.

    The inventory in Table 1 shows that evaluation of findings is addressed in terms of

    closeness of agreement in both procedures. With respect to dispersion, the measure of

    closeness of agreement (or of precision) is usually expressed in terms of imprecision

    (see note b in Table 1). Thus, this measure is operationalised on the basis of the extent of

    negative evidence. This also applies to deviation. Then, the measure of closeness of

    agreement (or of trueness) is usually/normally expressed in terms of bias (see note d

    in Table 1). This type of formulation resembles the expressions absence of dispersion

    for reliability and absence of deviation for validity as adhered to in the previous reviews

    (Kanis 1997, 2000).

    2.1.1 The technical sciences type of evaluation in E/HF research

    An example of the application of the technical sciences type of evaluation is a study by

    Meunier and Yin (2000, see Appendix). These authors tested the results of measuring

    anthropometrics according to an alternative method (so-called 2D image-based) against

    the results of the traditional method as the reference.

    Measurement repetition consists of measuring the anthropometrics (stature; sleeve

    length; circumference of neck, chest, waist, hip) of one participant 10 times. Precision

    Table 1. Evaluation of research findings in the technical sciences.a

    Evaluation of findings on the basis of:

    Measurement repetition Comparison with an independent criterion

    Terms Definition Term Definition

    Repeatability,reproducibility

    Precision,b as the closeness ofagreement between test resultsunder repeatability/reproducibility conditionsc

    Truenessd Closeness of agreementbetween the average of a largeseries of test results and anaccepted reference value

    aAccuracy (Trueness and Decision) of measurement methods and results (ISO 5725 (1994)).bThe measure of precision is usually expressed in terms of imprecision and computed as a standard deviation(ISO 5725 (1994), 3).cRepeatability conditions: repetition of same method, identical test items, same lab(s), same operator, sameequipment, short time interval. Reproducibility conditions: repetition of same method, identical test items,different lab(s), different operator, different equipment, no specified time interval.dThe measure of trueness is usually/normally expressed in terms of bias (ISO 5725 (1994), 2).

    Theoretical Issues in Ergonomics Science 3

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • is specified by the range of the findings (n 10) and by the standard deviation of theirdistribution. It is noticeable that these dispersion measures, apparently, are presumed to

    hold for all participants in the Meunier and Yin study. The authors could have checked

    this assumption by demonstrating that intra-individual variation in measurement repeti-

    tion is distributed evenly, or homoscedastically, throughout participants. The issue of

    homo-/heteroscedasticity is discussed in more detail in Section 2.2.1.

    Comparison with an independent criterion for each anthropometric in the Meunier and

    Yin study consists of testing the mean of the differences between the results of both meth-

    ods across the participants for its statistical non-significant difference from zero (referred

    to by t-tests; 2000, 448). This procedure accounts for the occurrence of a sample of par-

    ticipants, with each participant having her/his reference value, whereas in the Table 1

    description it suffices to adopt one reference value for the comparison with the average of

    the results of a large series of measurement repetitions.

    2.1.2 Human-involved phenomena: human agency

    The evaluation applied by Meunier and Yin does not coincide in every respect with the

    procedure described in ISO 5725. A basic distinction concerns the role of human involve-

    ment as a possible constituent of the findings to be evaluated.

    As outlined above, the evaluation procedure in the ISO-standard depends on measur-

    ing test items repeatedly as the same thing. This same thing is presumed to exist out

    there in a positivist sense, i.e. comprehensively definable as to its kind without any diver-

    gence induced by the interference of human agency.

    This kind of positivism-rooted measuring cannot be strictly upheld for the anthropo-

    metrics in the Meunier and Yin study. These anthropometrics are to a limited extent co-

    shaped by participants, depending on the ways in which instructions given to these partic-

    ipants work out. These instructions (may) involve the adoption of a particular posture,

    moving away from a measurement platform between measurements and breathing nor-

    mally (Meunier and Yin 2000, 448). This type of human agency is alien to phenomena to

    be evaluated in accordance with ISO 5725. The fact that the dispersion evaluation of the

    Meunier and Yin findings can be seen to comply with the technical sciences approach, as

    applied in the procedure, is a matter of approxima-tion. This approximation can be explained on the basis of Figure 1, which presents

    research topics encountered in E/HF studies.

    In Figure 1, human involvement of findings emerges as the commonality between the

    research topics raised. However, this commonality comes in different degrees. Human

    involvement can be denoted as passive in the case of the head circumference. Passive

    involvement implies that participants are not in the position to affect the constancy of

    what is being measured. This constancy allows for unaffected repetition of measurements.

    Hence, for the head circumference, the evaluation of findings can take place in compli-

    ance with the technical sciences procedure.

    In addition to the head circumference, Figure 1 (including its caption) points to sev-

    eral examples of research topics that are (co-)shaped by active involvement of partici-

    pants, i.e. by human agency including physiologic functions (e.g. heart beat), actions and

    performances (e.g. force exertion) as well as self-reports (e.g. effort experienced). Active

    human involvement may jeopardise the constancy of what is being measured in measure-

    ment repetitions. In the Meunier and Yin study, constancy in consecutive measurements

    of the anthropometrics (n 10, intra-individually) is advanced by the instructions as themeans to focus human agency on the substantiation of the anthropometrics aimed for. In

    4 H. Kanis

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • addition, adherence to these instructions in the Meunier and Yin study cannot reasonably

    be seen as a demanding effort for participants. That is why accumulation of carry-over of

    that effort between consecutive measurements, for instance tiredness, seems to be far-

    fetched. This observation sides with the fact that Meunier and Yin did not discuss the

    absence of imaginable systematic differences such as an upward or a downward trend in

    the consecutive measurement results. These considerations leave the procedure as a viable approach in the Meunier and Yin study.

    The impact of human agency on the evaluation of findings is further discussed in

    Section 2.2.

    2.2 Evaluation of research findings in the quantitative social sciences

    Table 2 gives an overview of the evaluative terminology and definitions in the quantita-

    tive social sciences, particularly those from educational and psychological testing, in

    Figure 1. Research topics encountered in E/HF papers sampled for both reviews (Kanis 1997,2000) and for the new inspection (see Appendix). Generally, phenomena to be measured orobserved are human-involved. This involvement ranges from:

    passive (e.g. the head circumference; see Measurers Handbook (Clauser et al. 1988) for theway in which to deal with participants head-hair) to

    active human involvement, including various kinds of human agency physiological functions(e.g. heart beat, muscle activity, saccadic eye movements), actions or performances (e.g. adop-tion of postures, body movements, manipulations in product use, force exertion) as well as self-reports (e.g. past-occurrences; duration of episodes experienced; internal states, references, pro-cesses; scoring a perceived workload, experienced anxiety, (dis)comfort or aesthetics).

    Theoretical Issues in Ergonomics Science 5

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • terms of reliability/the reliability coefficient, and in terms of validity as the general

    notion covering concepts such as construct validity, content validity and predictive

    validity (see Table 2, note d).

    Table 2 demonstrates that terms are grouped under the same basic evaluation proce-

    dures as in Table 1: measurement repetition (dispersion evaluation) and comparison of

    findings with an independent criterion (evaluation of deviation).

    As distinct from the technical sciences (see Table 1), Table 2 shows the occurrence in

    the quantitative social sciences of a multi-participant procedure (i) in the computation of

    the so-called reliability coefficient. In addition, this coefficient, as a criterion in the evalu-

    ation of findings in terms of dispersion (ii), is absent in the technical sciences evaluation

    procedure. As for deviation, the evaluation of findings in the quantitative social sciences

    is based on measurement as intended/presumed (iii), whereas in the technical sciences

    this evaluation is operationalised in terms of bias as manifestation of the closeness of

    agreement (see Table 1).

    The conceptual divergence of these issues (i, ii, iii) basically affects the evaluation of

    research findings in E/HF, as is discussed in the next sections.

    2.2.1 Facing human agency: a multi-participant procedure in dispersion evaluation (i)

    In the Meunier and Yin study, the procedure is ade-quate because of the (assumed) constancy of the anthropometrics across repeated meas-

    urements. The application of this procedure tends to become problematic the more

    intrusive the involvement of participants in (co-)shaping the phenomena to be evaluated.

    For instance, frequent exertion of a maximum force by the same individual within a short

    time is expected to cause temporary loss of strength. This loss will, inevitably, surface as

    systematic difference between consecutive measurements. This thwarts the status of these

    measurements as real repetitions. In a similar way, memory tends to bias consecutive self-

    reports. This infringes on the independence of these self-reports if repeated within a short

    time; examples would be the repetitive grading of the perceived effort required to carry

    out a task, or of the fatigue felt at the end of a working day, or of the experienced discom-

    fort of a seat.

    Table 2. Evaluation of findings in the quantitative social sciences.a

    Evaluation of findings on the basis of:

    Measurement repetition Comparison with an independent criterion

    Terms Definition Terms Definition

    Reliability,b

    reliabilitycoefficient

    Reliability coefficient: proportionof the total variance (in test serieswith different participants) that istrue variancec

    Validity,d

    validationeThe extent to which ismeasured what isintended/presumed to bemeasured

    aSources: Guilford (1954), Cronbach (1971), Nunnally (1978), Carmines and Zeller (1987), Messick (1989).bOther meanings encountered in the literature (Kanis 2000): reliability as an indicator of failures in the perfor-mance of (technical) systems and humans; reliable as statistical significant in relation to a confidence interval;reliable as equivalent with valid.cThe reliability coefficient equals the correlation coefficient (Pearson) in a test-retest (e.g. Kanis 1997, 156).dValidity may be envisaged as a unitary concept involving various validities such as concurrent val., constructval., content val., criterion val., external val., predictive val. (Messick 1989, 2426).eValidation as the establishment of any validity (see note d).

    6 H. Kanis

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • These examples illustrate the difficulty of generating random variation by means of

    repeated measurement of phenomena that are co-shaped by human agency. In the quanti-

    tative social sciences this problem is handled by limiting, on the one side, the number of

    repetitions per participant, and by involving, on the other side, several participants in the

    evaluation. The test-retest, i.e. one repetition per participant, is the most widely applied

    procedure. Thus, by a approach, within-participantcarry-over may be avoided. This can be checked by demonstrating that there is no statisti-

    cally significant difference between test and retest results across participants. The absence

    of this difference is not enough, though, for the specification of dispersion for a subpopu-

    lation, i.e. as a constant across participants. Therefore, intra-individual variation should

    be distributed evenly, or homoscedastically (see above), across participants. Figure 2

    Figure 2. Maximum force exertion by children (n 72) in pulling: intra-individual differencesbetween test-retest results plotted against their mean (a so-called Bland Altman plot, cf. Essendrop,Schibye, and Hansen 2001). The second measurement took place one week after the first measure-ment. Test-retest results did not differ systematically: t(71) 1.11, p .27. The positive relation-ship of the intra-individual differences with their mean may rely on the proportionality of humancontrol (Carlton and Newell 1993, 18, 19). Homoscedasticity of the patterning can be assumed onthe basis of the statistical non-significance of the association between the absolute values of theintra-individual differences and the corresponding means (Bland and Altman (1996) propose to sta-tistically test the absence of heteroscedasticity by Kendalls t). Source of the data: Steenbekkers(1993).

    Theoretical Issues in Ergonomics Science 7

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • gives an example of so-called heteroscedastic patterning. This patterning shows a positive

    relationship between the level of human agency, i.e. the maximum force exertion, and the

    dispersion in measurement repetition of that agency, i.e. the test-retest differences.

    Statistical testing of the differences in measurement repetition and checking homosce-

    dasticity both rely on the comparison of test-retest results across participants. The feasi-

    bility of this comparison requires that a similar phenomenon is measured amongst

    participants in terms of inter-individual equivalence. Then, findings can be aggregated

    with regard to their degrees, e.g. in an average or in a variance. However, inter-individual

    equivalence is not a self-evident property of findings, as is discussed in the next section.

    2.2.2 Equivalence of findings in a multi-participant procedure (i, continued)

    As discussed above, human agency may lead to the adoption of a multi-participant

    approach in dispersion specification. Then, human agency can be the source of heterosce-

    dastic data patterning across participants. In addition, human agency may be the cause of

    non-equivalence of findings across participants, in terms of inter-individual differences in

    kind rather than in degree. Instances of non-equivalence were found in studies into the

    development of survey questions (Bercini 1992; Foddy 1998). These studies show that

    respondents being asked the same question may end up answering different questions due

    to divergent interpretations of key-concepts, or to distinctive comprehension difficulties,

    or to the adoption of different perspectives or contexts in answering a question. It is

    noticeable that the issues addressed by the survey questions in these studies appear to

    have much in common with the research topics dealt with by self-reports in Figure 1.

    Inter-individual variety in kind, rather than in degree, is also found in a study into the

    interpretation of design models by prospective users (Rooden and Kanis 2011). In this

    study, participants were asked to imagine a functioning product on the basis of a design

    model of a blood pressure monitor (drawings, foam model or a final appearance model).

    Participants exhibited various ways of use to compensate for provisional features or

    incompleteness of the models. Presenting the same model to different participants, effec-

    tively, resulted in different products being operated in a simulation amongst these

    participants.

    It is difficult to pin down the extent to which inter-individual non-equivalence of find-

    ings as shown in these examples is recognised in the vast field of social science research.

    What can be observed, though, are criticisms on taking inter-individual equivalence of

    findings for granted, on the basis of so-called measurement by fiat (Torgerson 1958).

    Measurement by fiat may best be demonstrated by means of a few examples.

    Prior (2004, 82) points to the use of instruments like a questionnaire as an act of mea-

    surement by fiat. That is to say, the instrument imposes commonality of meaning on ques-

    tions and answers that are, in all likelihood, variously understood at different times and

    by different people. In fact, there is no reason to presume that internal states, references

    or processes of human beings in answering questions would be less diverse than the inter-

    individual variety usually measured or observed for human characteristics and activities.

    Clearly, disregarding this variety will conceal inter-individual non-equivalence of

    findings.

    Seale (1999, 35, 120) remarks that researchers may end up with measurement by fiat

    in deciding that a measurement result means a certain thing rather than another, thus fix-

    ing meanings of questions and answers that suit their preconceptions. Then, arbitrariness

    would aggravate any imposed inter-individual equivalence of findings raised by Prior

    (2004).

    8 H. Kanis

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • Implications of these observations by Prior and Seale can be indicated in the case of

    rating scales, which are widely used for the establishment of participants self-reports

    (cf. Annett 2002, 967). Ultimately, measurement by fiat may imply that a rating scale is

    presumed to be interpreted similarly by participants, given its layout and wording. That

    is to say, differences between participants scores on the same topic are only recognised

    as differences in degree, not in kind, since all participants are assumed to understand the

    scale(-elements) in the same way and to rate accordingly.

    In their discussion of measurement by fiat, Prior and Seale both refer to Cicourel

    (1964, 3, 14) who pleads for literal measurement instead of measurement by fiat.

    Cicourel opposes the unwarranted imposition of information on findings by methods

    applied, particularly through numerical procedures (1964, 2, 14, 33; see also Pawson

    1982, 38). The paramount example of this type of procedure is the adoption of ordinal

    scores of participants as if they rated on interval level. This upgrade of quantification

    results in findings that are amenable to numerical analysis such as averaging and the

    determination of standard deviations (see above), as well as the use of parametric statis-

    tics, including, for instance, the computation of product moment correlations (see

    Velleman and Wilkinson 1993, for an overview of the longstanding controversy about

    the applicability of (non-)parametric statistics).

    What matters here is that a contrived unanimity of findings amongst participants is

    further solidified in the metrication of ordinal differences in degree. In other words with

    the example of rating scales in mind variability in the meaning of a finding first tends to

    be sacrificed to an arbitrary commonality (cf. Prior, Seale) that, next, is exacerbated by

    quantification on interval level (cf. Cicourel, Pawson).

    The emergence of the measurement by fiat mechanism in the field of E/HF research is

    discussed further in Sections 3.2.1 and 4.1.

    2.2.3 The reliability coefficient as criterion in dispersion evaluation (ii)

    Table 2 shows that the quantitative social sciences present the so-called reliability coeffi-

    cient as the criterion of the evaluation of dispersion. In accordance with the definition in

    Table 2, this coefficient is written as

    sd2intersd2inter sd2intra

    1 sd2intra

    sd2total; 1

    with

    sd2inter for the variance of measurement results inter- (or between) participants, in the

    literature also called true variance,

    sd2intra for the variance of measurement results intra- (or within) participants, i.e. the

    dispersion variance, and

    sd2total for the observed variance.

    The reliability coefficient (Expression (1)) equals the Pearson correlation coefficient for

    the test-retest results in a procedure (see note c inTable 2). Expression (1) shows that the reliability coefficient is addressed by the comple-

    ment of the dispersion variance sd2intra, as a proportion:sd2

    inter

    sd2total

    . This concept renders the reli-

    ability coefficient dimensionless (a) as well as range dependent (b).

    Theoretical Issues in Ergonomics Science 9

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • (ad a) Obviously, the quotientsd2

    inter

    sd2total

    cannot be but dimensionless. This property implies

    that the reliability coefficient does not address, at least not directly, the actual dispersion

    in the measurement dimension at hand, such as in a test-retest.

    (ad b) Range dependence of the reliability coefficient shows up, for instance, by

    involving participants who score extreme on the dimension concerned whilst their intra-

    individual dispersion aligns with the other participants (i.e. dispersion patterning homo-

    scedastic all over). More heterogeneity amongst participants enlarges sd2inter in Expression

    (1) and, consequently, inflates the reliability coefficient due to the relative decrease of the

    contribution of sd2intra because of its constancy. Hence, a high reliability coefficient does

    not warrant the conclusion that dispersion is small. Neither does a low reliability coeffi-

    cient denote a large dispersion. In the medical world particularly, Bland and Altman

    (1986) have objected to the evaluation of measurement dispersion as depicted above (see

    also Atkinson and Nevill 1998). Bland and Altman point at the ineffectiveness of their

    own criticisms and of proposed alternatives (2003, 87, 90, 91).

    2.2.4 Correct dispersion measures in a procedure (ii, continued)

    If no indications of inter-individual non-equivalence of findings occur (Section 2.2.2) and

    the intra-individual variation across participants has been established as random and its

    patterning as homoscedastic (Section 2.2.1), then the intra-participant standard deviation

    sdintra is the obvious dispersion specification. This measure, also known as the standard

    error of measurement (SEM), can be computed in various ways. In a straightforward deri-

    vation it holds that the variance of {x1j x2j}, with x1j and x2j as the test-retest results

    with n participants (j 1 . . . n), equals twice the intra-participants variance sd2intra (alsocalled error variance). Hence,

    sdintra sdfx1j x2jgffiffiffi2

    p : 2

    An equivalent way to express sdintra is on the basis of the association between {x1j}

    and {x2j} (Kanis 1997, 157). Another alternative is the root mean square error (RMSE) as

    the residual standard deviation in an analysis of variance (ANOVA); this type of analysis

    can cope with more than one repetition per participant (Kanis 2000, 1951). Finally, the

    so-called limits of agreement (LoA) approach can be applied. This type of dispersion

    measure originates from the comparison of measurement results of two different methods

    that are intended to measure the same phenomenon (Bland and Altman 1986). This tech-

    nique can be expanded to measurement repetition, with sdfx1j x2jg as the basis for thespecification of a confidence interval for test-retest differences (Hopkins 2000, 4, 5; Bland

    and Altman 2003, 92).

    2.2.5 Evaluation of findings in terms of validity: intended/presumed measurement (iii)

    The aspiration to establish validity on the basis of the extent to which intended or pre-

    sumed measurement is achieved (see Table 2) characterises validation as a positively

    oriented endeavour. This is different for the evaluation of findings in the technical scien-

    ces where closeness of agreement is specified on the basis of a straightforward

    10 H. Kanis

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • difference in the dimension of measurement, in terms of bias or deviation (Table 1, partic-

    ularly note d). This technical sciences evaluation approach targets the absence of discor-

    dant evidence rather than the presence of confirmatory evidence as in the quantitative

    social sciences.

    Once a criterion is set, the identification of possible bias in findings, on the basis of

    direct comparison, appears to be straightforward in the technical sciences. This straight-

    forwardness may be lacking for validation in the quantitative social sciences. In particu-

    lar, the extent to which intended or presumed measurement is achieved is difficult to

    establish. In fact, this type of criterion easily renders matters elusive. What Guilford

    wrote many decades ago still appears to be of current interest: In the crudest terms, we

    say that a test is valid when it measures what it is presumed to measure. This is, however,

    but one step better than the definition that states that a test is valid if it measures the truth

    (Guilford 1954, 470, 471).

    It might be suggested that the lack of foothold in the Table 2 formulation of validity

    can be countered by the involvement of experts who should appraise the extent to which

    an intended or presumed measurement is realised, or has been realised. However, any

    indirectness or implicitness in expert appraisals would be at odds with validation as an

    empirical and traceable effort, rather than as a matter of make-believe or of authority ex

    cathedra.

    We are therefore left to conclude that the apparent difficulty in the quantitative social

    sciences to empirically substantiate the extent of intended or presumed measurement has

    not prompted a reconsideration of validation as a confirmatory endeavour. This observa-

    tion does not imply though that there would be no other possibilities (see the next

    section).

    2.2.6 Distinctive validations charted by investigative syntaxes (iii, continued)

    An alternative to confirmatory validation as indicated above could be a Popperian

    approach, involving that validity, in terms of the absence of deviation, can only be

    claimed as long as attempts to produce counter-evidence fail. This approach would, as a

    method of enquiry, render validation a questioning endeavour. This view is incorporated

    in so-called investigative syntaxes (Kanis 2000), which constitute a descriptive tool for

    charting distinctive types of questioning.

    Figure 3 gives an example. It shows the validation of a newly developed instrument

    to observe patient transfer techniques (Warming et al. 2004) in terms of investigative syn-

    tax s2; this syntax is one of the five syntaxes identified, shown in Figure A in the Appen-

    dix (cf. Kanis 2000; see the caption of Figure 3 for details).

    A final consideration in discussing investigative syntaxes involves the identification of

    D resulting from the comparison (#) of questioned findings, or of questioned predictions(see syntaxes s4 and s5) with an adopted criterion. If the two compared phenomena are in

    the same dimension, the #-operation may fully comply with the technical sciences evalua-

    tion procedure, in specifying closeness of agreement on the basis of straightforward dif-

    ferences in the dimension of measurement (see above).

    Matters become more complex if the two compared phenomena differ in kind, as in

    the example in Figure 3. In this case, direct comparison resulting in the specification of

    clear-cut deviation is out of the question since the #-operation cannot be but an associa-

    tive comparison. The obvious way to deal with association between series of data is by

    correlation. Then, D concerns the (non-)occurrence of corresponding trends or patterns. Itis noticeable that conclusions based on correlation analyses tend to lack cogency

    Theoretical Issues in Ergonomics Science 11

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • compared with conclusions based on direct comparison. It is true that associative compar-

    ison may be focussed on the correspondence of differences between questioned ratings or

    scores with an adopted contrast (cf. Haber and LoBiondo-Wood 2006, 342), such as in

    the study by Warming et al. between recommended patient transfer techniques and tech-

    niques self-chosen by nurses (see Figure 3). However, this correspondence still appears to

    be a matter of association. The significance of correlation coefficients as dimensionless

    measures of this association further suffers from their range dependence. This depen-

    dence is outlined above for the Pearson correlation coefficient rP (see Section 2.2.3). As a

    consequence, range dependence also holds for the Spearman correlation coefficient rS, as

    a special case of rP. Since the computation of rS is based on inter-participant differences

    in ranks, instead of on actual differences (including outliers as in the calculation of rP),

    rS tends to be less sensitive to the range of data (cf. Atkinson and Nevill 1998, 226).

    Finally, the so-called intraclass correlation coefficient (ICC) also appears to be range

    dependent as can be deduced from its definition, i.e. on the basis of population variances

    as specified in Expression (1) (Winer et al. 1991, 93; see also McGraw and Wong

    1996, 30).

    To conclude this discussion, the importance of inspection of a graphical presentation

    of data is stressed in order to spot (if not for other reasons) inflationary artefacts like those

    due to overstretched data intervals in correlation analyses.

    Figure 3. Application of an investigative syntax. In a study by Warming et al. (2004), an observa-tion instrument was developed for the evaluation of patient transfer techniques. Items included fortheir possible/apparent relevance and/or their association with low back pain (604) are observed(f oa ; 1 assumed for a successful operation, 0 for an unsuccessful one). Each issue is weightedaccording to its importance (605) for the support of a transfer technique; these weightings reflecttheoretical considerations (prop). In addition, maximal lumbar compression forces on the low backwere calculated (f 0b ) on the basis of video recordings of the patient transfers. A difference in f

    0b

    appears to exist between recommended techniques and transfer techniques self-chosen by nurses.This difference in f 0b , set as the golden standard (605, 611), should go together with a correspond-ing difference in weighted f oa . This is what is found, i.e. differences as expected. Thus, the non-observation of D (see Figure A in the Appendix) means that the weighting of f oa as prop is seen tohold. The authors correlate calculated compression forces with weighted scores; r 0.59, p .01 (609). In addition, findings are presented graphically. Of note is that in this study the questioningof prop in s2 can be further detailed, as in syntax s4 including weighted f

    oa as predictions (see

    Figure A in the Appendix). In fact, the syntaxes s2 and s4 can be seen largely to differ gradually,rather than by mutual exclusion (Kanis 2000, 1960).

    12 H. Kanis

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • 2.3 In sum: towards an evaluation flowchart

    In the inventory of terms, definitions and procedures pertaining to the evaluation of

    findings in the technical sciences and in the quantitative social sciences (the

    main constituents of E/HF research practice), the technical sciences approach

    emerges not only as clear-cut but also as limited due to its problematic (or

    non-)applicability to research findings (co-)shaped by human agency. Evaluation of

    these findings, i.e. of those (co-)shaped by active human involvement (see Figure 1),

    might be expected to be accommodated by the quantitative social science approach

    (including the procedure). However, doubtshave been raised regarding the application of this approach as it suffers from various

    shortcomings, which are:

    the questionability of the inter-individual equivalence of findings in a multi-participant procedure,

    the defectiveness of the reliability coefficient as dispersion measure, the elusiveness of intended/presumed measurement as a validity-criterion, and the insidiousness of correlations as dimensionless measures of association (ratherthan of agreement) that are range dependent.

    In the evaluation flowchart dealt with in the next section, these doubts and shortcom-

    ings are accounted for, including the application of suitable alternatives put forward in

    the previous discussion.

    3. Flowchart

    The flowchart in Figure 4 depicts the evaluation of research findings in terms of the

    notions addressed so far. Thus, these notions form the constituents of this evaluation

    process. The assembly of these evaluation constituents into the flowchart is aligned

    with two alternative entries: passive versus active human involvement and dispersion

    versus deviation of findings.

    3.1 Evaluation constituents

    In the flowchart, four types of evaluation constituents are distinguished; these types are

    modelled in different boxes:

    & for a state, including running a procedure; for alternatives, such as choices or findings; for a (possible) problem encountered in the evaluation; and for the evaluation in terms of dispersion or deviation of findings.

    There are no compelling prescriptions for the way of modelling different categories or

    for the number of constituents to be distinguished (cf. Fryman 2002).

    3.2 Entries

    The first entry is the distinction between passive and active human involvement of

    the findings to be evaluated (see constituent 2 in the flowchart). The second entry

    involves the distinction between the two basic evaluation procedures: specification

    Theoretical Issues in Ergonomics Science 13

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • 14 H. Kanis

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • Figure

    4.

    Evaluationofhuman-involved

    research

    findings,in

    consecutiveconstituents.Human

    involvem

    entcomprisesparticipantscharacteristics,actions

    andactivities,includingself-reports.

    Legend:

    N,totalnumberofparticipants

    inastudy;

    n,number

    ofparticipants

    involvedin

    measurementrepetition(n

    N);

    f a,em

    pirical

    finding;

    f0 b,statusoff bunquestioned;

    #,comparison.

    3p1

    2p:constituentsoftheevaluationoffindingswithpassivehumaninvolvement;

    3a1

    9a:constituentsoftheevaluationoffindingswithactivehumaninvolvem

    ent.

    Boxesmodellingtheconstituents:&

    state,(running)procedure;

    alternatives(choices,findings);

    (possible)problem;

    evaluationin

    term

    sofdispersionordeviationoffindings.

    Dashed

    arrowsdepictpotentialproblemsdueto

    measurementbyfiat(see

    main

    text,Section3.2.1).

    aTheabsenceofmeasurementbyfiatin

    thecase

    ofpassivehuman

    involvem

    entisdiscussed

    inSection3.2.1.

    bDispersionspecificationandidentificationofdeviationarepositionedas

    parallelprocedures(see

    main

    text,Section3.2.2

    forfurtherdiscussion).

    cTheflowchartdoes

    notreferto

    alternatives

    forthetest-retest(8a),suchas

    theparallel/alternateform

    ortheso-called

    split-half.Thesealternativesfacethe

    sameproblemsas

    thetest-retestwithrespect

    totheinter-individualequivalenceoffindings.

    Moreover,thesealternatives

    arehardly

    appliedin

    E/H

    Fresearch.

    dAnexampleoftakingaccountofheteroscedasticity

    bycomputingadispersionmeasure

    locally

    canbefoundin

    Kanis(1997,161).

    eSEM:standarderrorofmeasurement;LoA:limitsofagreement;RMSE:rootmean

    square

    error.

    f Comparison(#)focusedonthecorrespondenceofdifferencesbetw

    eenquestionedratingsandscoreswithan

    adoptedcontrast(seemaintext,Section2.2.6).

    gDetectable

    deviationtendsto

    belimitedbydispersion:thegreater

    thedispersion,thewider

    therangewithin

    whichfindingscannotbedemonstratedto

    differfrom

    each

    otherorto

    deviate

    from

    acriterion(see

    main

    text,Section3.2.2

    forfurther

    discussion).

    Theoretical Issues in Ergonomics Science 15

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • of dispersion; identification of deviation. These procedures are applied to both alter-

    natives of the first entry; see the constituents 4p and 8p (..p for passive human

    involvement), and 7a and 13a (..a for active human involvement).

    3.2.1 Human involvement: passive, active

    The kind of human involvement active versus passive is associated with fundamental

    differences in the evaluation of findings. An example of this type of difference involves

    measurement by fiat.

    In the case of active human involvement, the flowchart addresses measurement by fiat

    in terms of both problems discussed above: the inter-participant non-equivalence of self-

    reports (4a) and the adoption of ordinal scores as interval data (6a). These problems are

    not at stake in the case of passive human involvement (3p) for two reasons. Firstly, inter-

    individual non-equivalence of findings, as referred to in 4a, does not occur in passive

    human involvement because of the absence of any interference by human agency.

    Secondly, passive human involvement, as a rule, accommodates measurement at interval/

    ratio level, just as in the technical sciences evaluation of findings (discussed in

    Section 2.1.2; see also the Meunier and Yin study on anthropometrics). Then, a dimen-

    sional upgrade as referred to in 6a appears to be out of order.

    Of note are the dashed arrows 3a > 4a > 5a and 5a > 6a > 7a/13a.These arrows depict potential problems due to measurement by fiat. It shows that mea-

    surement by fiat that is not recognised, or ignored, may still affect ensuing evaluations.

    Problematic empirical evidence may surface at particular places in the flowchart. See

    the possible neglect of changes over time of phenomena studied (9a) as well as the taking

    of homoscedasticity for granted (11a), and the disregard of systematic differences in

    favour of the desired random variation (6p).

    3.2.2 Specification of dispersion, identification of deviation

    The basic difference between dispersion specification and the identification of deviation

    involves the way in which they are determined.

    The extent of dispersion rests solely on variation in the results of repeated measure-

    ment (4p, 7a), resulting in the range of findings or their standard deviation as its specifica-

    tions in passive human involvement (7p), and SEM, LoA or RMSE as its specifications in

    active human involvement of findings (12a).

    The identification of deviation needs an independent criterion, f 0b , to compare with. As

    observed in Section 2.1.2, in the case of passive human involvement, the kind of findings

    to be evaluated (fa, at interval or ratio level, see above) can be established without any

    interference of human agency. Then, an obvious approach is to adopt a criterion of the

    same dimension that allows for direct comparison with the findings to be evaluated (9p,

    10p); these findings may be questioned, for instance, as a consequence of their generation

    by a new method (cf. Meunier and Yin 2000).

    In the case of active human involvement, a criterion of the same kind as the findings to

    be evaluated may not be available. This is demonstrated by Warming et al. (2004; see

    Figure 3). Then, the #-operation is complicated considerably as will be further discussed

    in Section 4. Both alternatives fa, f0b either or not of the same kind are addressed in 14a.

    In the flowchart, dispersion specification and identification of deviation are positioned

    as parallel procedures, both in passive human involvement studies (left part of the

    16 H. Kanis

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • flowchart) and in active human involvement studies (right part). However, both proce-

    dures are not independent of each other.

    On the one hand, identification of deviation is in need of dispersion specification since

    detectable deviation is limited by dispersion: the greater the dispersion, the wider the range

    within which findings cannot be demonstrated to differ systematically from each other or to

    deviate from a criterion. In extremis, the notion of deviation may lose significance

    completely since it becomes unclear what actually has been measured. Or, in other words,

    unreliability of findings, in the end, rules out any evaluation of these findings as (in)valid.

    This observation is at odds with the unjustified claim featuring in some textbooks that

    unreliable data are invalid (cf. Babbie 2004, 145). Here, these considerations would argue

    for the consecutive numbering in Figure 4 of the constituents (3p12p; 3a19a), indicatingthat, as a rule, dispersion specification has to precede the identification of deviation.

    On the other hand, however, this sequential practice may fall short if findings from

    repeated measuring unmistakably differ systematically. Clearly, these findings would not

    constitute a proper base of dispersion specification. This concurrence renders the specifi-

    cation of dispersion and the identification of systematic differences or of deviation recur-

    sive. In the flowchart, this possibility is not detailed further.

    3.3 Additional considerations

    3.3.1 Inspection

    Occasionally, the flowchart points at inspection as a way to (co-)evaluate findings (10a,

    15a, 5p; see Section 2.2.6). Consultation of plots is strongly advocated by Altman and

    Bland (1983, 313) and Bland and Altman (1986, 308, 1999, 143). Obviously, inspection

    is not meant to be an alternative to statistical testing or to correlation analysis. Inspection

    rather should go together with this type of analysis in order to determine peculiarities

    such as trends or patterns, which, then, may be addressed or specified by a quantitative

    analysis.

    3.3.2 Questioning of empirical findings or of propositions

    Figure A in the Appendix shows that the investigative syntaxes focus either on the ques-

    tioning of empirical findings (s1, s3) or on the questioning of propositions (s2, s4, s5; cf.

    Kanis 2000). In questioning empirical findings, the flowchart in Figure 4 features the

    comparison fa#f0b from syntax s1/syntax s3 as the central element (see the constituents 8p

    12p, and 13a19a). Figure A in the Appendix shows that in questioning propositions the

    flowchart has to cover either the comparison f oa #f0b (s2) or the comparison pred#f

    0b (s4, s5).

    In questioning propositions on the basis of syntax s2, fa in the Figure 4 flowchart is

    replaced by f oa in the constituents 8p12p, and 13a19a. In addition, the constituents 8p

    and 13a are rephrased in terms of identification of systematic differences between f oa and

    f 0b (the status of foa and of f

    0b as unquestioned rules out the identification of a systematic

    difference as deviation). Finally, the constituents 11p, 12p, 18a and 19a are rephrased in

    terms of the presence (11p, 18a) or absence (12p, 19a) of systematic differences between

    f oa and f0b , which lines up with the questionability of propositions involved.

    In questioning propositions on the basis of syntax s4 or s5, including pred#f0b as the

    central element, fa is replaced in the Figure 4 flowchart by pred in the constituents 8p

    12p, and 13a19a. The presence (11p, 18a) or absence (12p, 19a) of deviation of pred

    from f 0b exemplifies the questionability of propositions involved.

    Theoretical Issues in Ergonomics Science 17

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • It is noted that the accommodation of the Figure 4 flowchart for questioning proposi-

    tions also includes a rephrase of other (introductory) constituents referring to fa (i.e. 1, 2,

    3a, 5a). These rephrases are not further detailed here.

    4. The flowchart in E/HF research practice

    The practicability of the flowchart is charted by its application to the research papers sum-

    marised in the Appendix. To begin with, the discussion focuses on the occurrence of mea-

    surement by fiat. Next, specific occurrences are addressed including other problems faced

    as indicated in the flowchart. Finally, observations resulting from the application of the

    flowchart are compared with the authors assessments of current practice in the evaluation

    of their findings, if mentioned in their papers.

    Of note is that the assembly of the flowchart discussed above occasionally refers to a

    research paper summarised in the Appendix. This means that any demonstration of the

    practicability of the flowchart on the basis of that research paper, as a firm proof of its

    overall practicality, would get stuck in circularity.

    4.1 Measurement by fiat

    As shown above, the flowchart accommodates two ways of measurement by fiat: the

    assumed inter-individual equivalence of findings and the loose interpretation of a mea-

    surement dimension, i.e. the adoption of ordinal scores as interval/ratio data in the analy-

    sis. These are discussed both for passive and for active human involvement of findings.

    4.1.1 Inter-individual equivalence of findings (3a in the flowchart)

    Questioning the inter-individual equivalence of findings with a passive human involve-

    ment such as body measures (Meunier and Yin 2000; Luximon and Goonetilleke 2004)

    seems to be out of place. As indicated above, a characteristic like the head circumference

    can reasonably be assumed to exist out there, i.e. to be comprehensively definable, inde-

    pendent of any interference from human agency.

    This positivist view on the kind of phenomenon being measured can be seen as

    smoothly applicable when human agency is present in terms of physiologic functions

    such as heart rate (Simeonov et al. 2005), muscle activity (e.g. Mogk and Keir 2003;

    Lariviere et al. 2004) or saccadic eye movements (Jainta, Jaschinski, and Baccino 2004).This means that data may be conceived as equivalent inter-individually. This equivalence

    is equally plausible for human actions such as making movements (e.g. Fenety, Putnam,

    and Walker 2000) or operating a product (e.g. Blangsted, Hansen, and Jensen 2004;

    Stanton and Baber 2005). The same holds for human performances such as a walk-

    and-turn test and a one-leg-stand in a sobriety test (Stuster 2006), or the exertion of a

    particular force (e.g. de Looze et al. 2000; Bao and Silverstein 2005). It is noticeable that

    in none of these studies, inter-individual equivalence of findings is put forward as an issue

    in its own right.

    Occasionally a surprise may occur, though. In Kanis (2000), an example is given of

    participants exceeding their maximum force exertion, Fmax, when asked to exert a force

    they experience as just comfortable, Fcomf. Apparently, it can occur that Fcomf > Fmaxintra-individually. Then, Fmax cannot be taken as fully equivalent across participants. In

    fact, Fcomf and Fmax are best seen as self-reports reflecting internal states, references or

    processes, as indicated in Figure 1.

    18 H. Kanis

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • Of the various kinds of data presented in Figure 1, self-reports in terms of scores on

    worded scales tend to be most profoundly shaped by human agency. This type of finding

    features in half the number of research papers summarised in the Appendix (26 out of 53).

    As it happens, only a few studies touch upon inter-individual equivalence of self-reports.

    An example is the paper by Pickup et al. (2005) who found that their newly developed

    Integrated Workload Scale reflects, according to all participants, how hard they were

    working (689). This observation can be seen to allocate the perceived amount or degree

    of workload, rather than the kind of workload, as the source of differences between

    participants.

    Another example is the study by Smith, Andrews, and Wawrow (2006) into the

    development and evaluation of an automotive seating discomfort questionnaire. This

    questionnaire consists of visually analogue scales (148). The authors conclude that

    between-participants variability of ratings suggests that subjects have different percep-

    tions of discomfort but are able to maintain this perception over time (146).

    A third example was found in a study by Ben-Bassat and Shinar (2006) into the com-

    prehension of traffic signs. Ratings of the standardisation and compatibility of these signs

    were compared between experts and students. Because there was a possibility that some

    participants [. . .] would use different criteria for evaluating the signs, participants werespotted as outliers and eliminated if their answers were grossly inconsistent with those

    of the majority (187, 188). So, in this approach, large differences in scored perceptions

    are conceived as deviations in kind rather than in degree.

    These references illustrate that inter-individual equivalence of findings seems to be

    largely taken for granted in E/HF studies. As a matter of course, this condition does not

    exclude that possible discrepancies could have emerged in the research papers consulted.

    The fact that this has not happened might, at least to some extent, be ascribed to the pre-

    supposition of inter-individual equivalence of findings. This renders equivalence in actual

    research practice a self-fulfilling assumption rather than being questioned empirically.

    4.1.2 Ordinal findings as interval data (6a in the flowchart)

    In the case of passive human involvement in the phenomena studied, findings can be seen

    to be measurable at interval level, as in the technical sciences. Then, the adoption of ordi-

    nal scores as interval data in the analysis is not at issue. This may be different for active

    human involvement, especially in the case of self-reports consisting of participants

    scores on scales.

    In some 15 of the 26 papers in the Appendix featuring these ordinal self-reports

    (see previous discussion on inter-individual equivalence of findings), scores are dealt

    with as interval data, mainly in correlation analyses. The admissibility of this conversion

    of ordinal scores to interval data is not discussed in any of the papers. This leads to the

    conclusion that, if this dimensional upgrade would cause a difference in terms of statisti-

    cal significance, the conversion is not justified by the findings. And if this significant

    difference cannot be demonstrated, any dimensional upgrade would be pointless.

    4.2 Dispersion specification for active human involvement of findings

    (7a and the following constituents in the flowchart)

    In E/HF research, a number of precautions are taken to make sure that intra-individual

    differences in measurement repetition are random.

    Theoretical Issues in Ergonomics Science 19

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • Participants may have to move around between measurement repetitions (Meunier

    and Yin 2000, 448; see also DeVocht et al. 2006, 299). Repetition of measurement may

    be masked by not informing participants (Kouchi and Mochimaru 2004, 1511). In the

    retest, questionnaire items can be randomised in order to counteract the role of memory

    (Leung, Chan, and He 2004, 236). For the same reason, a certain time lapse may be

    deemed appropriate (cf. Murphy et al. 1998, 171, 175). In the papers compiled in the

    Appendix, the longest time intervals between test and retest are reported by Lariviereet al. (2004, 105; 1979 days in measuring strength and muscle activity), Bohle, Tilley,

    and Brown (2001, 891; 11 weeks in scoring circadian rhythm characteristics and behav-

    ioural dispositions of people) and Zinzen et al. (2000, 1793; six months in charting the

    prevalence of low back problems). Authors do not discuss the length of these time inter-

    vals with respect to conceivable changes of the human organism (such changes might ren-

    der data intra-individually non-equivalent).

    4.2.1 Test-retests (8a in the flowchart)

    Assuming inter-individual equivalence of findings, the question arises of whether the sta-

    tistical non-significance of the test-retest differences has actually been established. Test-

    retests were carried out in 24 studies compiled in the Appendix. In 10 of these studies, no

    indications were found that test-retest differences were analysed for their possible statisti-

    cal significance. This oddity raises the question of why the authors involved did a test-

    retest in the first place.

    In the rest of the studies featuring a test-retest (14 papers), the statistical significance

    of the intra-individual variation across participants is checked in an ANOVA or by a

    paired t-test, as far as mentioned at all. In four of these 14 papers, the demonstrated statis-

    tical non-significance of the test-retest differences is taken as a token of reliability or of

    repeatability (Tran, Letowski, and Abouchacra 2000, 821; Mogk and Keir 2003, 961,

    969; Kolich and Taboun 2004, 848; McGorry, Chang, and Dempsey 2004, 21, 27). This

    is an erroneous interpretation: the larger the intra-individual variations, the smaller the

    chance of coming across a statistically significant difference. Then, the statistical non-

    significance of test-retest differences as a token of reliability rests upon the very

    unreliability of data. In fact, statistical non-significance demonstrates nothing in terms

    of reliability. It only means that the test-retest results give no reason to question these

    results as possibly being biased (Atkinson and Nevill 1998, 222).

    4.2.2 Homo-/heteroscedastic patterning (10a in the flowchart)

    The type of intra-individual variation patterning across participants is addressed in three

    studies. Fothergill and Sims (2000, 1492) found that the dispersion of their data was pat-

    terned homoscedastically. McGorry, Chang, and Dempsey (2004, 27) noted that there is

    a slight homoscedasticity present in their data on wrist angular displacement that, pre-

    sumably, may read a slight heteroscedasticity. Finally, Essendrop, Schibye, and Hansen

    (2001, 383) observed that from the Bland-Altman plots it became evident that the differ-

    ences [between results of isometric muscle strength tests] were the same through the

    entire measuring range. However, some of the plots presented in this paper indicate a

    heteroscedastic patterning (no statistical tests are reported). None of the other papers

    addresses the issue of dispersion patterning in spite of the fact that in most cases force

    exertion and movements are studied. These human activities tend to show a positive

    20 H. Kanis

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • relationship between the measured activity level and the dispersion in measurement repe-

    tition of that activity, as shown in Figure 2 and addressed in its caption.

    4.2.3 Dispersion measures actually established (12a in the flowchart)

    The intra-individual variation in measurement repetition across participants is specified

    by a dispersion measure in five of the 24 studies in the Appendix featuring a test-retest. In

    three of these, the SEM (Kanlayanaphotporn et al. 2002, 243; Lariviere et al. 2004, 107,110; Leung, Chan, and He 2004, 237) is used. Whether test-retest differences are statisti-

    cally (non-)significant is not reported.

    In the other two studies, intra-individual variation in repeated measurements is

    addressed by the LoA (Fothergill and Sims 2000, 1492; Essendrop, Schibye, and

    Hansen 2001, 383 and the following pages). In both papers, some test-retest differ-

    ences are reported as being statistically significant. These systematic differences are

    accounted for in the calculated LoA.

    Correlation coefficients are also included in these five studies. However, as discussed

    in Section 2.2.3, the idea that correlations would specify test-retest dispersion is a

    misconception.

    Another regular pitfall is the use of Cronbach a to address precision issues. Cron-

    bach a is a criterion for the internal consistency of multiple items put forward to capture

    a particular construct. However, several authors term a a measure of reliability

    (Taylor 2000, 323; Zinzen et al. 2000, 1793; Bohle, Tilley, and Brown 2001, 891;

    Radovanovic and Alexandre 2004, 325; Diaz-Morales and Sanchez-Lopez 2005, 357;

    Gutierrez et al. 2005, 743; Harris, Chan-Pensley, and McGarry 2005, 969, 973; Durso,

    Bleckley, and Dattel 2006, 730). Signifying a as a reliability measure can be consid-

    ered a misconception in view of the fact that reliability relies on measurement repeti-

    tion. Conceiving scores for different items as repetitions would classify these items as

    homogeneous in being unidimensional; for this conception, a is an inappropriate index

    (see e.g. Schmitt 1996).

    4.3 Deviation identification for active human involvement of findings

    (13a in the flowchart)

    Comparison is the core of the identification of deviation, as shown by the investigative

    syntaxes. Compared phenomena can be similar in kind (14a), such as quantifications on

    the basis of the same measurement or observation unit. In these cases, the comparison

    can be carried out straightforwardly, i.e. in a direct confrontation of findings with the

    adopted criterion. Direct comparison occurs in 15 studies listed in the Appendix (see

    Table B). Once findings can be seen as equivalent, direct comparison is an apt way of

    identifying (the absence of) deviation.

    In 16 studies listed in Table B (see Appendix), the comparison concerns the confron-

    tation of data or findings of different dimensions with each other, which results in an asso-

    ciative comparison (15a). Obviously, direct comparison is out of the question in these

    cases. As Table B shows, authors chose to compute correlations or to establish the corre-

    spondence of differences between questioned findings with a chosen contrast. The latter

    approach is applied, for example, in the evaluation of newly developed observational

    methods to assess patient handling. Assumed contrasts involve different groups of

    patients (cardiology versus intensive care) in Radovanovic and Alexandre (2004, 323),

    different tasks (recommended versus self-chosen transfer techniques) in Warming et al.

    Theoretical Issues in Ergonomics Science 21

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • (2004, 609, see Figure 3) and different types of observational categories (ergonomic haz-

    ards versus safety of working techniques) in Johnsson et al. (2004, 596). Each study con-

    firms the anticipated correspondence. The impact of these confirmations relies on the

    significance of the chosen contrasts, particularly in terms of the challenge they pose.

    Occasionally, authors indicate the limitations of their choices. Radovanovic and Alexandre

    (2004, 323) draw attention to the tentative character of their comparison of only two

    groups of patients. In the same vein, Warming et al. (2004, 611) question the compression

    force as golden standard in the comparison with scores generated by their observational

    method of patient handling. In other studies, authors show less reservation (see Pickup

    et al. 2005, 692, who eventually call their newly developed workload scale valid; see also

    Johnsson et al. 2004, 600, for a similar general conclusion).

    4.3.1 Correlation in addition to direct comparison

    If compared data are of the same dimension, direct comparison is the obvious way to

    reveal the occurrence or absence of deviation. This data can also be correlated, as

    happens in Blangsted, Hansen, and Jensen (2004, 239), McGorry, Chang, and Dempsey

    (2004, 26), Azar, Andrews, and Callaghan (2005, 691), Hamilton and Clarke (2005,

    668) and Francis and Oxtoby (2006, 284). However, because correlations are dimen-

    sionless quantities, they do not reflect possible deviations, as is demonstrated by

    Expression (1). This expression was introduced with reference to the test-retest, but can

    be seen to hold for any paired series of quantitative data. Variances, such as sd2inter and

    sd2intra in Expression (1), are invariant for a constant difference across participants. Hence,

    Expression (1) is not affected by a constant systematic difference. Consequently, this

    algorithm is neither fit to reveal the presence of a constant difference, nor of a correspond-

    ing deviation.

    4.3.2 Correlation as a measure of association rather than agreement

    As noted, calculation of correlation requires paired series of data that do not have to be of

    the same dimension. This makes correlation an apt criterion for addressing associations

    in terms of corresponding trends or patterns in compared series of data that differ in kind.

    Setting a required level of association to be observed would pose a boundary between

    where correspondence ends and deviation begins. This type of boundary will always be

    ambiguous because of the range dependency of correlations. In the Appendix, there is

    one example of the application of a boundary value. Belz, Robinson, and Casali (2004,

    165) reject self-assessed alertness and vehicle separation as valid indicators of lorry-

    driver fatigue on the basis of low associations with observed drowsiness and driver man-

    nerisms as the current indicators.

    Other papers in the Appendix are inconclusive about any level of association as crite-

    rion. Correlations are largely seen to support validity as they are adequate, positive,

    moderate, significant, strong, good and high (Ben-Bassat and Shinar 2006, 182;

    Fenety, Putnam, and Walker 2000, 387, 392; Bohle, Tilley, and Brown 2001, 897; Kolich

    and Taboun 2004, 860; Richardson, Jones, and Torrance 2004, 961, 962; Gutierrez et al.

    2005, 743; Pickup et al. 2005, 687).

    In Fenety, Putnam, and Walker (2000, 388), the report of correlations is accom-

    panied with a graphical presentation of the findings (in-chair positions in centimetre

    over time by two different measurement instruments). This presentation demonstrates

    the correspondence between data-patterns. The authors do not discuss whether this

    22 H. Kanis

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • correspondence or the length of data intervals accounts for the observed levels of

    association (in most cases, r 0.99). For that matter, the in-chair positions are notdiscussed further on the basis of direct comparison, which would have been possible

    in this study. Kolich and Taboun (2004, 860) depict another example of associative

    comparison whereas direct comparison would to some extent have been possible in

    their case between predicted and experienced overall comfort of automobile seats

    rated on the same scale.

    4.4 Dispersion/deviation of findings, passive human involvement

    (3p and the following constituents in the flowchart)

    In the discussion of measurement by fiat in Section 4.1, it is argued that in the case of pas-

    sive human involvement findings are seen to be measurable on interval-level (at the least).

    This similarity with the technical sciences also holds for the inter-participant equivalence

    of findings: questioning inter-individual equivalence of findings that are passively human-

    involved seems meaningless. In this way, the absence of human agency considerably sim-

    plifies the evaluation constituents 4p12p compared to 7a19a in the flowchart. As an

    example see 8p and 9p where a criterion, f 0b , is adopted that is of the same kind (same

    dimension) as fa. This type of convenient criterion may not be available in the case of

    active human involvement of findings (13a, 14a).

    Practical examples of the application of the evaluation constituents, as depicted by 3p

    and the following constituents in the flowchart, are rare. In E/HF research, apparently,

    human involvement in the phenomena being studied is largely active, in terms of human

    agency.

    4.5 The flowchart versus authors assessment of their evaluations

    The flowchart in Figure 4 depicts appropriate evaluation procedures. In a number of

    research papers, evaluations live up to these procedures to a certain extent. Compliance is

    encountered for identifications of deviation on the basis of direct comparison (see the dis-

    cussion in Section 4.3). Occasionally, dispersion measures are specified, as discussed in

    Section 4.2.3. In the majority of the E/HF research papers selected, though, the evaluation

    of data is plagued by deficiencies. Constituents in Figure 4 are regularly ignored, misin-

    terpreted or addressed incorrectly. Every now and then, it seems that anything that can go

    wrong will go wrong.

    The proliferation of flawed evaluations may be triggered by imitation. Bland and

    Altman point this out in discussing why the totally inappropriate method of

    correlation is almost universally used in medical research (1986, 310; see also Bland

    and Altman 2003, 87). In the selection of papers in the Appendix, imitation may be

    seen at work in the almost complete neglect of the potential non-equivalence of

    findings, as well as in the correlation-mania, which appears not only to be prominent

    in medical research (cf. Bland and Altman 1986), as is shown by Table A in the

    Appendix.

    4.5.1 Authors assessments of current evaluation practices

    In only one of the studies selected (see Appendix) is inadequacy of the applied evaluation

    procedure identified. Essendrop, Schibye, and Hansen (2001, 379, 386) come across the

    invariance of correlation for systematic differences in observing that correlations do not

    Theoretical Issues in Ergonomics Science 23

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • reflect deviations in terms of increase of strengths, as measured in their study. These

    authors also point at the possible inflation of correlation by heterogeneity amongst partici-

    pants (Essendrop, Schibye, and Hansen (2001, 386; see also Fenety, Putnam, and Walker

    2000, 390). In two more papers, the SEM is adopted in order to counterbalance the range-

    dependency of the presented ICCs (see Kanlayanaphotporn et al. 2002, 246; Lariviereet al. 2004, 110).

    This inspection suggests that the correctness of evaluation procedures seems to

    be largely taken for granted in E/HF, with correlating being, by far, the most fre-

    quently applied procedure. This popularity may reflect the easily applicable function-

    ality of the computation of correlations in statistical packages, which can be applied

    to any paired series of quantitative data. Moreover, correlations, being dimension-

    less, are mouldable quantities that, occasionally, are even averaged (Jainta, Jaschin-

    ski, and Baccino 2004, 111; Arthur, Edwards et al. 2005, 664; McDowell et al.

    2006, 125). Range dependency may further obscure the arbitrariness of invented

    evaluations. In fact, correlating seems to be a Jack of all trades, readily at hand as

    a supple analytic tool for the generation of convenient evidence. That is, evidence in

    support of the zeal in E/HF studies to tell a positive story in terms of reliable and

    valid findings. Authors, in making the most of it, are not on the lookout, it seems,

    for surprises or unanticipated counter-evidence. And indeed, hardly any is reported.

    This may be illustrated by the observation that in four out of five of the studies sum-

    marised in the Appendix, findings are presented as being reliable and/or valid.

    Reliability and validity as observed are regularly qualified (to mention the most

    popular terms) as encouraging, excellent, good, high, moderate, positive,

    satisfactory or significant. Occasionally, this positive focus is qualified by terms

    such as initial or preliminary (Bohle, Tilley, and Brown 2001; Arthur, Edwards

    et al. 2005; Diaz-Morales and Sanchez-Lopez 2005; Gutierrez et al. 2005; Harris,

    Chan-Pensley, and McGarry 2005). The infrequent negative evaluations are

    described as low, poor or weak (Kanlayanaphotporn et al. 2002; Belz, Robin-

    son, and Casali 2004; Jainta, Jaschinski, and Baccino 2004; Arthur, Bell et al. 2005;

    Azar, Andrews, and Callaghan 2005; Koppelaar and Wells 2005).

    Another example of authors positive preoccupations concerns direct comparison

    (17a, 18a in the flowchart). As discussed in Section 4.3, direct comparison is an apt

    way of identifying (the absence of) deviation. In most studies featuring direct com-

    parison, systematic differences are established between findings or predictions and

    the criteria adopted (see Ds in Table B of the Appendix). These differences giverise to the identification of (some) deviations. However, in none of these studies do

    the authors abandon a method or a proposition or some underlying model as

    invalid. This may illustrate the wish to validate matters rather than to question or

    to reject matters.

    Clear-cut examples of the attraction of being reliable and valid are the referencing

    of these notions without presenting the empirical evidence, as can be seen in the follow-

    ing quotes:

    From the test subjects questionnaires [about workload in an underwater environment] itemerged that for all the test subjects the procedure established for each experimental taskwas valid. (Toscano, Fubini, and Gaia 2004, 389)

    The usability, reliability and validity of the tool [for auditing in a study into injury reductionof manual tasks] was tested with government inspectors and found to be good. (Straker et al.2004, 171)

    24 H. Kanis

    Dow

    nloa

    ded

    by [

    Uni

    vers

    iti P

    endi

    dika

    n Su

    ltan

    Idri

    s] a

    t 03:

    50 0

    6 M

    arch

    201

    5

  • Finally, we note that the real possibility of identifying deficient data evaluation may

    be small as a consequence of the inadequacy of the applied evaluation procedures them-

    selves. Inadequacies include the side-stepping and non-specification of constituents of the

    evaluation procedure outlined in Figure 4. These shortcomings render outcomes of evalu-

    ation procedures elastic, poly-interpretable or unspecified. In this way, inadequate data

    evaluation blurs its inadequacy, resulting in wide margins for the establishment of posi-

    tive evaluations in terms of reliable and valid.

    5. Discussion