Psychometric Instruments의 신뢰도와 타당도에 대한 현재의 개념:이론과 적용(Am J Med. 2006)

Current Concepts in Validity and Reliability for Psychometric Instruments: Theory and Application

David A. Cook, MD, MHPE, Thomas J. Beckman, MD, FACP

Division of General Internal Medicine, Mayo Clinic College of Medicine, Rochester, Minn.





 

'타당도validity'라는 용어는 "어떤 평가의 결과로부터 도출한 결론(또는 해석)이 충분한 근거well-grounded를 가지고 정당화가능한 정도, 관련되고relevant 의미있는meaningful 정도"를 말한다. 그러나 psychometric 평가의 결과로부터 타당도를 평가가는데 필요한 스킬은 의학논문을 평가하거나 실험실 실험의 결과를 해석하는 것과는 다르다. 최근의 임상교육평가에 관한 리뷰에서, 우리는 타당도과 신뢰도가 자주 잘못 이해되고 잘못 적용됨을 알게 되었다. 또한 우리는 타당한 방법론을 활용한 연구에서도 다양한 타당도의 스펙트럼을 제시하는데 실패함을 발견하였다.

The term “validity” refers to the degree to which the conclusions (interpretations) derived from the results of any assessment are “well-grounded or justifiable; being at once relevant and meaningful.”10 However, the skills required to assess the validity of results from psycho- metric assessments are different than the skills used in appraising the medical literature11 or interpreting the results of laboratory tests.12 In a recent review of clinical teaching assessment, we found that validity and reliability were fre- quently misunderstood and misapplied.13 We also have noted that research studies with sound methods often fail to present a broad spectrum of validity evidence supporting the primary outcome.6,14-16

 

psychometric 평가의 결과에 대한 타당도를 평가하는 방법은 심리학과 교육평가로부터 derive되었다.

Methods for evaluating the validity of results from psy- chometric assessments derive from theories of psychology and educational assessment,17,18



평가도구 점수의 타당도, 구인, 유의미한 해석

VALIDITY, CONSTRUCTS, AND MEANINGFUL INTERPRETATION OF INSTRUMENT SCORES

 

타당도는 "검사의 의도한 목적에 대하여 근거나 이론이 검사점수의 해석을 지지하는 정도"이며, 다른 말로는 타당도란 어떤 검사의 결과가 특정한 목적을 위해서 해석되었을 때 얼마나 그것을 신뢰할 수 있는가이다.

Validity refers to “the degree to which evidence and theory sup- port the interpretations of test scores entailed by the proposed uses of tests.”19 In other words, validity describes how well one can legitimately trust the results of a test as interpreted for a specific purpose.


많은 도구들이 물리적 양(키, 혈압, 혈중 소듐 등)을 측정한다. 그러한 결과의 의미를 해석하는 것은 복잡하지 않다. 반면, 환자의 증상, 학생의 지식, 의사의 태도 등에 대한 평가의 결과는 그것 자체에 내재된 의미가 없다. 오히려, 그 평가들은 "무형의 추상적 개념과 원칙의 집합"이라 할 수 있는 기저에 깔린 구인을 측정하기 위한 것이다. 모든 psychometric 평가의 결과는 그것이 평가하고자 하는 구인이라는 맥락에 대해서만 의미(타당도)를 가질 수 있다.

Many instruments measure a physical quantity such as height, blood pressure, or serum sodium. Interpreting the meaning of such results is straightforward.20 In contrast, results from assessments of patient symptoms, student knowledge, or physician attitudes have no inherent mean- ing. Rather, they attempt to measure an underlying con- struct, an “intangible collection of abstract concepts and principles.”21 The results of any psychometric assessment have meaning (validity) only in the context of the construct they purport to assess.17


타당도는 평가도구의 특성이 아니라, 그 도구로부터 얻은 점수와 그 해석에 대한 것이다.

Validity is not a property of the instrument, but of the instru- ment’s scores and their interpreta- tions.17,19


평가도구는 변하지 않고 평가의 해석만 달라질 수 있다.

Note that the instruments in these examples did not change—only the score interpretations.


타당도란 '추론'의 특성이며 '도구'의 특성이 아니기 때문에, 타당도는 각각의 의도한 해석에 따라서 establish 되어야 한다. 유사하게, 환자의 증상에 대한 척도가 특정 연구 조건이나 고도로 선택된 환자집단에서 타당하다고 하더라도, 일반적인 임상진료에 활용되기 위해서는 추가적인 평가를 거쳐야 한다.

Because validity is a property of inferences, not instruments, validity must be established for each intended interpretation. Similarly, a patient symp- tom scale whose scores provided valid inferences under research study conditions or in highly selected patients may need further evaluation before use in a typical clinical practice.


 

타당도에 대한 개념적 접근

A Conceptual Approach to Validity

 

우리는 종종 "validated instruments"라는 문구를 접한다. 이러한 관점을 틀린 것이다. 첫째로, 우리는 타당도가 inference의 특징이며 instrument의 특징이 아님을 기억해야 한다. 둘째, 해석의 타당도는 언제나 정도의 차이에 대한 것이다. 어떤 점수가 내재된 구인을 '더 정확히' 혹은 '덜 정확히' 반영할 수는 있지만, 절대로 '완벽하게' 반영할 수는 없다. 

We often read about “validated instruments.” This view is inaccurate. First, we must remember that validity is a property of the inference, not the instrument. Second, the validity of interpretations is always a matter of degree. An instrument’s scores will reflect the underlying construct more accurately or less accurately but never perfectly. 

 

타당도를 바라보는 가장 정확한 관점은 '가설' 혹은 '해석적 주장'으로, 제시된 추론을 지지하기 위한 근거가 수집되어야 한다. Downing이 기술한 바와 같이

Validity is best viewed as a hypothesis or “interpretive argument” for which evidence is collected in support of proposed inferences.17,23,24 As Downing states,

 

“Validity requires an evidentiary chain which clearly links the inter- pretation of . . . scores . . . to a network of theory, hypoth- eses, and logic which are presented to support or refute the reasonableness of the desired interpretations.”21

 

모든 가설-기반 연구에서 가설이 명확하게 기술되어야 하고, 가장 문제가 될 소지가 있는 가정을 평가하기 위한 근거가 수집되어야 하고, 가설은 비판적으로 평가되어야 하듯이, 새로운 검사가 있을 때 근거는 "해석적 주장에 대한 모든 추론이 타당하다고 나올 때 까지plausible, 또는 해석적 주장이 기각될 때까지" 수집되어야 한다. 그러나 타당도는 절대로 '증명될' 수 없다.

As with any hypothesis-driven research, the hypothesis is clearly stated, evidence is collected to evaluate the most problem- atic assumptions, and the hypothesis is critically reviewed, leading to a new cycle of tests and evidence “until all inferences in the interpretive argument are plausible, or the interpretive argument is rejected.”25 However, validity can never be proven.

 

 

타당도 근거의 출처

Sources of Validity Evidence


주어진 해석을 지지하기 위한 근거는 다양한 서로 다른 출처로부터 수집되어야 하며, 하나의 출처에서 강력한 근거를 얻었다고 해서 다른 출처로부터의 근거가 필요하지 않은 것이 아니다. 근거를 수집함에 있어서, 두 가지 주 위협을 고려해야 한다. 하나는 내용 영역에 있어서 부적절한 표집sampling이며(구인 과소대표성), 점수에 비-무작위적으로 영향을 주는 요인들(편향, 구인-비관련 변동)이다.

Evi-dence should be sought from several different sources to support any given interpretation, and strong evidence from one source does not obviate the need to seek evidence from other sources. While accruing evidence, one should specif-ically consider two threats to validity: inadequate sampling of the content domain (construct underrepresentation) and factors exerting nonrandom influence on scores (bias, or construct-irrelevant variance).24,27 

 

내용

Content.    

 

검사의 내용과 측정하고자 하는 구인의 관계. 내용은 truth(구인), whole truth(구인), 그리고 nothing but truth(구인)을 반영해야 한다. 따라서 다음을 봐야 한다.

Content evidence involves evaluating the “rela-tionship between a test’s content and the construct it is intended to measure.”19 The content should represent the truth (construct), the whole truth (construct), and nothing but the truth (construct). Thus, we look at

  • 구인의 정의 the construct definition,
  • 도구의 의도한 목적 the instrument’s intended purpose,
  • 문항 개발과 선택 프로세스 the process for developing and selecting items (the individual questions, prompts, or cases comprising the instrument),
  • 각 문항의 워딩 the wording of individual items, and
  • 문항 개발자와 감수자의 자격 the qualifications of item writers and reviewers.

 

Content evidence is often presented as a detailed description of steps taken to ensure that the items represent the construct.28


 

응답 프로세스

Response Process.


시험을 치르는 사람 혹은 평가자의 행동 및 사고 프로세스를 리뷰하는 것은 "실제로 행해지는 수행의 세세한 특성과 구인이 얼마나 서로 잘 합치fit되는가"를 보여준다. 예컨대, 교육자는 "진단추론을 평가하기 위한 시험을 치르는 학생이 실제로 고등-차원의 사고 프로세스를 쓰는가?"를 궁금해 하 수 있다. 이 문제를 위한 접근법으로, 일군의 학생들에게 질문에 답을 하면서 "think aloud"하게 할 수 있다. 만약 도구가 한 사람이 다른 사람의 수행을 평가하는 방식이라면, 응답 프로세스에 대한 타당도 근거는 평가자가 적절한 훈련을 거쳤음을 보여줘야 한다. 자료의 보안과 점수를 산출하는 방법과 보고하는 방법 등도 여기에 들어간다.

Reviewing the actions and thought pro-cesses of test takers or observers (response process) can illuminate the “fit between the construct and the detailed nature of performance . . . actually engaged in.”19 For ex-ample, educators might ask, “Do students taking a test intended to assess diagnostic reasoning actually invoke higher-order thinking processes?” They could approach this problem by asking a group of students to “think aloud” as they answer questions. If an instrument requires one person to rate the performance of another, evidence supporting response process might show that raters have been properlytrained. Data security and methods for scoring and reporting results also constitute evidence for this category.21 

 

 

내적 구조

Internal Structure.


신뢰도와 요인분석이 보통 내적 구조를 보여준다.

Reliability29,30 (discussed below and in Table 3) and factor analysis31,32 data are generally consid- ered evidence of internal structure.21,31 Scores intended to measure a single construct should yield homogenous re- sults, whereas scores intended to measure multiple con- structs should demonstrate heterogenous responses in a pat- tern predicted by the constructs.

 

유사한 수준의 결과가 기대되는 하위그룹간에 systemic variation이 있다면(differential item functioning이라고 불린다) 이 역시 내적 구조에 문제가 있음을 보여주는 것이며, 예측한 차이를 확인시켜주는 것은 이 영역의 근거를 서포트해주는 것이다. 만약 히스패닉이 지속적으로 코카시안보다 점수를 잘 받는다면 (이러한 결과가 의도된 것이 아닌 이상) 의도한 해석에 대한 타당도를 약화시킨다. 이는 총점에서의 subgroup variation과는 다르다.

Furthermore, systematic variation in responses to specific items among subgroups who were expected to perform similarly (termed “differen- tial item functioning”) suggests a flaw in internal structure, whereas confirmation of predicted differences provides sup- porting evidence in this category.19 For example, if Hispan- ics consistently answer a question one way and Caucasians answer another way, regardless of other responses, this will weaken (or support, if this was expected) the validity of intended interpretations. This contrasts with subgroup vari- ations in total score, which reflect relations to other vari- ables as discussed next.


 

다른 변인과의 관련성

Relations to Other Variables.

 

다른 도구와의 상관관계(없어야 할 때 없고, 있어야 할 때 있는 것)도 내재된 구인과 일치하는 해석을 지지해준다. 예컨대 양성 PH의 심각도를 평가하기 위한 설문과 acute urinary retention의 발생의 상관관계

Correlation with scores from another instrument or outcome for which correlation would be expected, or lack of correlation where it would not, supports interpretation consistent with the underlying construct.18,33 For example, correlation between scores from a questionnaire designed to assess the severity of benign prostatic hypertrophy and the incidence of acute urinary retention would support the validity of the intended inferences.

 

 

결과

Consequences.


평가의 의도한 결과 혹은 의도하지 않은 결과를 평가하는 것은 이 전에는 인식하지 못했던 비-타당성을 드러낼 수도 있다. 예컨대, 교수에 대한 평가에서 남자선생님이 지속적으로 여자선생님보다 낮은 점수를 받는다면, 이것은 어떤 예측하지 못했던 bias의 원인일 수 있다. 혹은 남자가 덜 효과적인 선생님임을 의미하는 것일수도 잇다. 따라서 결과에 대한 근거는, 그것이 진정으로 추론의 타당도에 영향을 미친다고 결론지어지기 전에 원래의 구인으로 되돌아가서 어떤 연결성이 있는지 확인해야 한다.

Evaluating intended or unintended conse- quences of an assessment can reveal previously unnoticed sources of invalidity. For example, if a teaching assessment shows that male instructors are consistently rated lower than females it could represent a source of unexpected bias. It could also mean that males are less effective teachers. Ev- idence of consequences thus requires a link relating the observations back to the original construct before it can truly be said to influence the validity of inferences.

 

결론에 대한 타당도 근거를 평가하는 또 다른 방법은 의도한 결과를 달성하였는지(의도하지 않은 결과는 회피되었는지) 보는 것이다. 만약 높은 평가를 받은 교수들이 낮은 점수를 외면한 것이었다면, 이 기대하지 않았던 부정적 결과가 점수의 의미를 해석하는데 영향을 줄 수 있으며, 따라서 그 타당도에도 영향을 미친다. 반대로, 만약 낮은 점수를 받은 교수들에 대한 remediation이 수행능력을 향상시키는 것으로 나타났으면 이러한 해석의 타당도를 지지하는 근거가 된다. 마지막으로, 점수의 합격선 등과 같은 threshold를 결정하는 방법 역시 이 분류에 들어간다. 결과에 대한 근거는 타당도 근거중 가장 논쟁적인 부분이며, 가장 잘 보고되지 않는 타당도 근거 출처이다.

Another way to assess evidence of consequences is to explore whether desired results have been achieved and unintended effects avoided. In the example just cited, if highly rated faculty ostracized those with lower scores, this unexpected negative outcome would certainly affect the meaning of the scores and thus their validity.17 On the other hand, if reme- diation of faculty with lower scores led to improved perfor- mance, it would support the validity of these interpretations. Finally, the method used to determine score thresholds (eg, pass/fail cut scores or classification of symptom severity as low, moderate, or high) also falls under this category.21 Evidence of consequences is the most controversial cate- gory of validity evidence and was the least reported evi- dence source in our recent review of instruments used to assess clinical teaching.34


 

근거의 통합

Integrating the Evidence.


"의도한" 또는 "예측한"이란 단어가 자주 사용되었다. 각 근거는 내재된(이론적)구인으로 돌아가서 어떤 관계가 있는지 보아야 하고, 사전에 기술된 관계를 확인하기 위해서 사용될 때 가장 강력하다. 만약 근거가 애초의 타당도 주장을 지지하지 않는다면, 그 주장은 "기각되거나 해석 and/or 측정 절차를 조정하여 향상될 수 있"고, 이후에 그 주장은 다시 평가되어야 한다. 실제로 타당도 평가는 testing과 revision의 지속적 사이클이다.

The words “intended” and “pre- dicted” are used frequently in the above paragraphs. Each line of evidence relates back to the underlying (theoretical) construct and will be most powerful when used to confirm relationships stated a priori.17,25 If evidence does not sup- port the original validity argument, the argument “may be rejected, or it may be improved by adjusting the interpreta- tion and/or the measurement procedure”25 after which the argument must be evaluated anew. Indeed, validity evalua- tion is an ongoing cycle of testing and revision.17,31,35


얼마나 어떤 근거가 필요한지는 도구의 의도한 목적에 따라 달라진다.

The amount of evidence necessary will vary according to the proposed uses of the instrument.

  • high degree of confidence high-stakes board certification lower degree of confidence
  • Some instrument types will rely more heavily on certain categories of validity evidence than others.21
  • observer ratings: internal struc- ture characterized by high inter-rater agreement.
  • multiple-choice exams: content evidence.

안면타당도는?

What About Face Validity?


"안면타당도"라는 표현이 많은 의미를 가지고 있지만, 이는 실질적 검증을 거치지 않고, 그저 겉보기의 타당도를 기술하기 위해 사용된다. 이는 자동차의 속력을 외관만 가지고 추정하는 것과 비슷하다. 이러한 판단은 단순한 찍기이다. Content evidence와 안면타당도는 표면적으로 유사해 보이지만 실제로는 크게 다르다. Content evidence는 systematic, documented 접근법이나, 안면타당도는 그 (평가)도구가 어떻게 생겼는지만 보고 판단하는 것이다.

Although the expression “face validity” has many mean- ings, it is usually used to describe the appearance of validity in the absence of empirical testing. This is akin to estimating the speed of a car based on its outward appearance or the structural integrity of a building based on a view from the curb. Such judgments amount to mere guesswork. The con- cepts of content evidence and face validity bear superficial resemblance but are in fact quite different. Whereas content evidence represents a systematic and documented approach to ensure that the instrument assesses the desired construct, face validity bases judgment on the appearance of the in- strument.

 

Downing and Haladyna 는..."표면적 퀄리티가 평가의 본직적 특성을 보여줄 수도 있지만, '타당해 보이는 것'은 '타당도'가 아니다" 라고 했다. DeVellis는 안면타당도에 대하여 다음을 우려했다.

Downing and Haladyna note, “Superficial quali- ties . . . may represent an essential characteristic of the assessment, but . . . the appearance of validity is not valid- ity.”27 DeVellis37 cites additional concerns about face va- lidity, including

  • 판단의 오류 가능성 fallibility of judgments based on appear- ance,
  • 개발자와 사용자 간의 인식 차이 differing perceptions among developers and users, and
  • 외관으로부터 의도를 추론하는 것이 역효과를 낳을 수 있음 instances in which inferring intent from appearance might be counterproductive.

 

우리는 이 용어를 사용하지 않을 것을 권장한다.

For these reasons, we discourage use of this term.



신뢰도: 타당도 추론의 필요조건이지만 충분조건은 아님

RELIABILITY: NECESSARY, BUT NOT SUFFICIENT, FOR VALID INFERENCES


신뢰도는 재생산가능성과 일관성에 관한 것이다. 신뢰도는 타당도의 필요조건이나 충분조건은 아니다. Psychometric 도구의 점수는 꼭 unreliability에 취약할 수 있지만 한 가지 중요한 차이가 있다. 한 개인으로부터 다수의 측정결과를 얻는다는 것은 실용적이지 못하거나 심지어 불가능하다. 따라서 점수의 신뢰도 근거를 축적하기 위해서는 충분한 근거가 축적되어야 한다.

Reliability refers to the reproducibility or consistency of scores from one assessment to another.19 Reliability is a necessary, but not sufficient, component of validity.21,29 Scores from psychometric instru- ments are just as susceptible to unreliability, but with one crucial distinction: It is often impractical or even impossible to obtain multiple measurements in a single individual. Thus, it is essential that ample evidence be accumulated to establish the reliability of scores before using an instrument in practice.


신뢰도에 대한 측정과 카테고리화에는 다양한 방법이 있다.

There are numerous ways to categorize and measure reliability (Table 3).30,37-41 We would expect that scores measuring a single construct would correlate highly (high internal consistency). If internal consistency is low, it raises the possibility that the scores are, in fact, measuring more than one construct. Reproducibility over time (test-retest), between different versions of an instrument (parallel forms), and between raters (inter-rater) are other measures of reliability.


Generalizability studies use analysis of variance to quantify the contribution of each error source to the overall error (unreliability) of the scores, just as analysis of variance does in clinical research.


신뢰도는 근거의 한 가지 형태일 뿐이다. 또한 타당도와 마찬기지로 평가도구 자체의 특성이 아니다.

Reli- ability constitutes only one form of evidence. It is also important to note that reliability, like validity, is a property of the score and not the instrument itself.30


 

도구 선택에 있어서 실제적 적용 (사례와 예시)

PRACTICAL APPLICATION OF VALIDITY CONCEPTS IN SELECTING AN INSTRUMENT


AUA-SI

The American Urological Association Symptom Index1 (AUA-SI, also known as the International Prostate Symptom Score)


Content evidence for AUA-SI scores is abundant and fully supportive.1 The instrument authors reviewed both published and unpublished sources to develop an initial item pool that reflected the desired content domain. Word choice, time frame, and response set were carefully defined. Items were deleted or modified after pilot testing.


Some response process evidence is available. Patient debriefing revealed little ambiguity in wording, except for one question that was subsequently modified.1 Scores from self-administration or interview are similar.49


Internal structure is supported by good to excellent in- ternal consistency and test-retest reliability,1,49,50 although not all studies confirm this.51 Factor analysis confirms two theorized subscales.50,52


In regard to relations to other variables, AUA-SI scores distinguished patients with clinical benign prostatic hyper- trophy from young healthy controls,1 correlated with other indices of benign prostatic hypertrophy symptoms,53 and improved after prostatectomy.54 Another study found that patients with a score decrease of 3 points felt slightly im- proved.51 However, a study found no significant association between scores and urinary peak flow or postvoid residual.55


Evidence of consequences is minimal. Thresholds for mild, moderate, and severe symptoms were developed by comparing scores with global symptom ratings,1 suggesting that such classifications are meaningful. One study56 found that 81% of patients with mild symptoms did not require therapy over 2 years, again supporting the meaning (valid- ity) of these scores.




PRACTICAL APPLICATION OF VALIDITY CONCEPTS IN DEVELOPING AN INSTRUMENT



The first step in developing any instrument is to iden- tify the construct and corresponding content.

  • In our ex- ample we could look at residency program objectives and other published objectives such as Accreditation Com- mittee for Graduate Medical Education competencies,57 search the literature on qualifications of ideal physicians, or interview faculty and residents.
  • We also should search the literature for previously published instruments, which might be used verbatim or adapted.
  • From the themes (constructs) identified we would develop a blueprint to guide creation of individual questions.
  • Questions would ideally be written by faculty trained in question writing and then checked for clarity by other faculty.


For response process,

  • we would ensure that the re- sponse format is familiar to faculty, or if not (eg, if we use computer-based forms), that faculty have a chance to practice with the new format.
  • Faculty should receive training in both learner assessment in general and our form specifically, with the opportunity to ask questions.
  • We would ensure security measures and accurate scoring methods.
  • We could also conduct a pilot study in which we ask faculty to “think out loud” as they observe and rate several residents.


In regard to internal structure,

  • inter-rater reliability is critical so we would need data to calculate this statistic.
  • Internal consistency is of secondary importance for per- formance ratings,30 but this and factor analysis would be useful to verify that the themes or constructs we identi- fied during development hold true in practice.


For relations to variables,

  • 다른 도구와의 비교는 비교대상이 되는 도구가 얼마나 좋은 도구인가에 달렸다.
    we could correlate our in- strument scores with scores from another instrument as- sessing clinical performance. Note, however, that this comparison is only as good as the instrument with which comparison is made. Thus, comparing our scores with those from an instrument with little supporting evidence would have limited value.
  • Alternatively, we could com- pare the scores from our instrument with United States Medical Licensing Examination scores, scores from an in-training exam, or any other variable that we believe is theoretically related to clinical performance.
  • We could also plan to compare results among different subgroups. For example, if we expect performance to improve over time, we could compare scores among postgraduate years.
  • Finally, we could follow residents into fellowship or clinical practice and see whether current scores predict future performance.


Last, we should not neglect evidence of consequences.

  • If we have set a minimum passing score below which remedial action will be taken, we must clearly document how this score was determined.
  • If subgroup analysis reveals unex- pected relationships (eg, if a minority group is consistently rated lower than other groups), we should investigate whether this finding reflects on the validity of the test.
  • Finally, if low-scoring residents receive remedial action, we could perform follow-up to determine whether this inter- vention was effective, which would support the inference that intervention was warranted.



APPENDIX: INTERPRETATION OF RELIABILITY INDICES AND FACTOR ANALYSIS



Acceptable values will vary according to the pur- pose of the instrument. For high-stakes settings (eg, licen- sure examination) reliability should be greater than 0.9, whereas for less important situations values of 0.8 or 0.7 may be acceptable.30 Note that the interpretation of reliabil- ity coefficients is different than the interpretation of corre- lation coefficients in other applications, where a value of 0.6 would often be considered quite high.62 Low reliability can be improved by increasing the number of items or observers and (in education settings) using items of medium difficulty.30 Improvement expected from adding items can be estimated using the Spearman-Brown “prophecy” formula (described elsewhere).41


A less common, but often more useful,63 measure of score variance is the standard error of measurement (SEM) (not to be confused with the standard error of the mean, which is also abbreviated SEM). The SEM, given by the equation SEM   standard deviation   square root (1- reliability),64 is the “standard deviation of an individual’s observed scores”19 and can be used to develop a confidence interval for an individual’s true score (the true score is the score uninfluenced by random error).


Agreement between raters on binary outcomes (eg, heart murmur present: yes or no?) is often reported using kappa, which represents agreement corrected for chance.40 A dif- ferent but related test, weighted kappa, is necessary when determining inter-rater agreement on ordinally ranked data (eg, Likert scaled responses) to account for the variation in intervals between data points in ordinally ranked data (eg, in a typical 5-point Likert scale the “distance” from 1 to 2 is likely different than the distance from 2 to 3). Landis and Koch65 suggest that kappa less than 0.4 is poor, from 0.4 to 0.75 is good, and greater than 0.75 is excellent.


Factor analysis32 is used to investigate relationships be- tween items in an instrument and the constructs they are intended to measure.


41. Traub RE, Rowley GL. An NCME instructional module on under- standing reliability. Educational Measurement: Issues and Practice. 1991;10(1):37-45.










 2006 Feb;119(2):166.e7-16.

Current concepts in validity and reliability for psychometric instrumentstheory and application.

Author information

  • 1Division of General Internal Medicine, Mayo Clinic College of Medicine, Rochester, Minn 55905, USA. cook.david33@mayo.edu

Abstract

Validity and reliability relate to the interpretation of scores from psychometric instruments (eg, symptom scales, questionnaires, education tests, and observer ratings) used in clinical practice, research, education, and administration. Emerging paradigms replace prior distinctions of face, content, and criterion validity with the unitary concept "construct validity," the degree to which a score can be interpreted as representing the intended underlying construct. Evidence to support the validity argument is collected from 5 sources:

CONTENT:

Do instrument items completely represent the construct?

RESPONSE PROCESS:

The relationship between the intended construct and the thought processes of subjects or observers.

INTERNAL STRUCTURE:

Acceptable reliability and factor structure.

RELATIONS TO OTHER VARIABLES:

Correlation with scores from another instrument assessing the same construct.

CONSEQUENCES:

Do scores really make a difference? Evidence should be sought from a variety of sources to support a given interpretation. Reliable scores are necessary, but not sufficient, for valid interpretation. Increased attention to the systematic collection of validity evidence for scores from psychometric instruments will improve assessments in research, patient care, and education.

PMID:
 
16443422
 
[PubMed - indexed for MEDLINE]


+ Recent posts