타당도: 평가결과의 유의한 해석을 위한 도구 (Med Educ, 2003)

Validity: on the meaningful interpretation of assessment data

Steven M Downing





타당도는 평가결과에 따르는 의미나 해석을 지지하거나 반박하기 위한 근거이다. 모든 평가는 타당도 근거를 필요로 하고, 평가의 거의 모든 주제가 어떤 식으로든 '타당도'와 관련이 있다. 타당도는 평가의 필수불가결한 요소이며, 타당도가 결여되면 평가는 거의 혹은 아무 의미가 없다.

Validity refers to the evidence presented to support or refute the meaning or interpretation assigned to assessment results. All assessments require validity evidence and nearly all topics in assessment involve validity in some way. Validity is the sine qua non of assessment, as without evidence of validity, assess- ments in medical education have little or no intrinsic meaning.


타당도는 언제나 '가설'의 형태로 접근하게 된다. 평가자가 기대하는 해석적 의미를 평가 결과와 연관짓게 되고, 이 최초 가설을 바탕으로 자료가 수집되고, 타당도 가설을 지지하거나 반박하는 결과로 나타난다. 이러한 개념하에서 평가자료는 어특 특정한 목적, 의미, 해석에 대해서 더 타당하거나 덜 타당할 수 있으며, 이는 특정 시점이나 특정 집단에 대해서만 그러할 수도 있다. 평가 그 자체만으로는 절대로 '타당'하다거나 '타당하지 않다'라는 말을 할 수 없으며, 평가점수의 해석을 하는 데 있어서 그것을 지지하거나 반박하는 과학적으로 타당한 근거가, 특정 시점에 존재한다고 말할 수 있을 뿐이다.

Validity is always approached as hypothesis, such that the desired interpretative meaning associated with assessment data is first hypothesized and then data are collected and assembled to support or refute the validity hypothesis. In this conceptualization, assess- ment data are more or less valid for some very specific purpose, meaning or interpretation, at a given point in time and only for some well-defined population. The assessment itself is never said to be ‘valid’ or ‘invalid’ rather one speaks of the scientifically sound evidence presented to either support or refute the proposed interpretation of assessment scores, at a particular time period in which the validity evidence was collected.


타당도라는 것이 다양한 근거원을 고려하는 일원화된(unitary) 개념이라는 것이 지금의 개념적 해석이다. 근거에 대한 출처는 보통 의도한 방향의 해석이나 의미와 관련하여 논리적으로 제안된다. 현재의 프레임워크에서 모든 타당도는 구인타당도(construct validity)이며Messick이 SEPM에서 보다 우아하게 설명한 바 있다. 과거에는 타당도는 세 가지 다른 종류로 나눠졌다. Content, Criterion, Construct. 이 중 Criterion-related validity는 준거자료의 수집 시점에 따라 종종 concurrent 와 predictive 로 나눠졌다.

In its contemporary conceptualization,1,3–14 validity is a unitary concept, which looks to multiple sources of evidence. These evidentiary sources are typically logi- cally suggested by the desired types of interpretation or meaning associated with measures. All validity is construct validity in this current framework, described most eloquently by Messick8 and embodied in the current Standards of Educational and Psychological Meas- urement.1 In the past, validity was defined as three separate types: content, criterion and construct, with criterion-related validity usually subdivided into con- current and predictive depending on the timing of the collection of the criterion data.2,15


왜 구인타당도가 이제 유일한 유형의 타당도가 된 것일까? 과학의 철학에서 그 복잡한 답을 찾을 수 있는데, 어떤 영역이나 더 넓은 인구집단에 대해서 의미있고 논리적인 추론을 위해서는 무수한 상호-연결된 추론의 거미줄은 그 contents를 sampling하는 것과 연결되어있다.

Why is construct validity now considered the sole type of validity? The complex answer is found in the philosophy of science8 from which, it is posited, there are many complex webs of inter-related inference associated with sampling content in order to make meaningful and reasonable inferences to a domain or larger population of interest.


보다 직접적 대답은 이러하다: 거의 모든 평가는 사회과학이기 때문이다. 의학교육도 마찬가지이다. 이러한 평가는 무형의 추상적 개념과 원칙 - 행동으로부터 추론할 수 있고, 교육이나 심리학 이론으로부터 설명할 수 있는 - 의 집합체를 구성한다. 교육적 성취 역시 구인(construct)으로서, 잘 정의된 지식영역에 대한 지필고사나, 특정 문제나 사례에 대한 구술고사, 표준화환자를 이용한 병력청취와 의사소통 등으로 부터 추론(infer)되는 것이다.

The more straightforward answer is: Nearly all assessments in the social sciences, including medical education, deal with – constructs intangible collections of abstract concepts and princi- ples which are inferred from behavior and explained by educational or psychological theory. Educational achievement is a construct, usually inferred from per- formance on assessments such as written tests over some well-defined domain of knowledge, oral exami- nations over specific problems or cases in medicine, or highly structured standardized patient examinations of history-taking or communication skills.


교육적 능력이나 적성 역시 우리에게 친근한 구인의 또 다른 사례이다. 이 구인은 학업성취보다도 더 실체가 없고 추상적인데, 왜냐하면 교육자나 심리학자 사이에 합의가 덜 되어있기 때문이다. 교육적 능력을 측정하기 위한 검사는 - MCAT같은 - 북미에서 의과대학 입학시에 주요하게 활용되며, 따라서 MCAT을 사용하는 타당성에 대해 지지하려면 다양한 출처로부터, 과학적으로 타당한 근거를 제시할 수 있어야 한다. 타당도 근거의 중요한 출처로는 이러한 MCAT점수가 의과대학 입학 후 학업성취를 얼만 예측하는가를 보여주는 것이다.

Educational ability or aptitude is another example of a familiar construct – a construct that may be even more intangible and abstract than achievement because there is less agreement about its meaning among educators and psychologists.16 Tests that purport to measure educational ability, such as the Medical College Admissions Test (MCAT), which is relied on heavily in North America for selecting prospective students for medical school admission, must present scientifically sound evidence, from multiple sources, to support the reasonableness of using MCAT test scores as one important selection criterion for admitting students to medical school. An important source of validity evi- dence for an examination such as the MCATis likely to be the predictive relationship between test scores and medical school achievement.


타당도는 평가 점수 해석을 그 의도한 해석의 논리성을 지지하거나 반박하는 이론/가설/논리와 연결시키는 evidentiary chain을 필요로 한다. 타당도는 절대로 당연히 가정될 수 있는 것이 아니며, 지속적으로 가설을 수립하고 자료를 모으고, 검증하고, 비판적으로 평가하고, 논리적으로 추론해야 하는 것이다. 타당도에 대한 근거, 그에 관련된 이론, 경험적 근거는 어떤 특정한 해석이 타당하고 어떤 해석이 그렇지 않은가를 알려주는 것이다.

Validity requires an evidentiary chain which clearly links the interpretation of the assessment scores or data to a network of theory, hypotheses and logic which are presented to support or refute the reasonableness of the desired interpretations. Validity is never assumed and is an ongoing process of hypothesis generation, data collection and testing, critical evaluation and logical inference. The validity argument11,12 relates theory, predicted relationships and empirical evidence in ways to suggest which particular interpretative meanings are reasonable and which are not reasonable for a specific assessment use or application.


유의미한 점수의 해석을 위해서, 어떤 평가는 - 예컨대 지식에 대한 학업성취도 - 상당히 직접적인 시험 내용의 적합성에 대한 근거로 내용-관련 근거, 점수의 재생산가능성, 문항의 통계적 질, 합격선이나 학점을 결정한 근거 등이 필요할 수 있다. 수행능력 평가와 같은 다른 종류의 평가에서는 다른 것이 필요하다.

In order to meaningfully interpret scores, some assessments, such as achievement tests of cognitive knowledge, may require fairly straightforward content- related evidence of the adequacy of the content tested (in relationship to instructional objectives), statistical evidence of score reproducibility and item statistical quality and evidence to support the defensibility of passing scores or grades. Other types of assessments, such as complex performance examinations, may require both evidence related to content and consider- able empirical data demonstrating the statistical rela- tionship between the performance examination and other measures of medical ability, the generalizability of the sampled cases to the population of skills, the reproducibility of the score scales, the adequacy of the standardized patient training and so on.



평가의 목적이나 의도한 해석에 따라 달라질 수 있는 타당도 근거의 전형적인 출처에는 다음과 같은 것이 있다.

Some typical sources of validity evidence, depending on the purpose of the assessment and the desired interpretation are: 

  • evidence of the content representa- tiveness of the test materials, 
  • the reproducibility and generalizability of the scores, 
  • the statistical character- istics of the assessment questions or performance the statistical prompts, 
  • relationship between and among other measures of the same (or different but related) constructs or traits, 
  • evidence of the impact of assessment scores on students and 
  • the consistency of pass–fail decisions made from the assessment scores.


평가에 따르는 부담이 클수록 , 타당도 근거를 더 다양한 출처로부터 수집하고, 지속적으로, 재평가할 필요가 커진다.(면허, 자격증 등) 

The higher the stakes associated with assessments, the greater the requirement for validity evidence from multiple sources, collected on an ongoing basis and continually re-evaluated.17 The ongoing documenta- tion of validity evidence for a very high-stakes testing programme, such as a licensure or medical specialty certification examination, may require the allocation of many resources and the contributions of many different professionals with a variety of skills – content specialists, psychometricians and statisticians, test editors and administrators. 




구인타당도의 출처

Sources of evidence for construct validity


the Standards에 따르면 "타당도는 근거나 이론이 검사점수를 해석하는 데 있어서 의도한 활용을 지지하는 정도"이다. 현재의 Standards는 타당도에 대한 일원화된 관점 - 모든 타당도는 구인타당도이다 - 을 충분히 반영하고 있다. 이 때 타당도란, 구인을 잘 정의하고, 자료와 근거를 모으고 통합해서 그 매우 구체적인 해석을 지지하거나 반박하는 절차이다. 역사적으로 타당도를 점증하는 방법과 구인타당도와 관련된 근거들은 Cronbach,3–5Cronbach and Meehl6 and Messick.7의 초기 과업에 많은 토대를 두고 있다. 초기의 단일화된 개념은 1957년 Loevinger, Kane 의 논문으로 거슬러올라가며, 이들은 타당도라는 것을 해석적 주장의 맥락에 두면서, 각각의 평가마다 확립되어야 하는 것이라고 했다. 

According to the Standards: ‘Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests’1 (p. 9). The current Standards1 fully embrace this unitary view of validity, following closely on Messick’s work8,9 that considers all validity as con-struct validity, which is defined as an investigative process through which constructs are carefully defined,data and evidence are gathered and assembled to form an argument either supporting or refuting some very specific interpretation of assessment scores.11,12 His-torically, the methods of validation and the types of evidence associated with construct validity have their foundations on much earlier work by Cronbach,3–5Cronbach and Meehl6 and Messick.7 The earliest unitary conceptualization of validity as construct validity dates to 1957 in a paper by Loevinger.18 Kane11–13 places validity into the context of an interpretive argument, which must be established for each assessment; Kane’s work has provided a useful framework for validity and validation research. 



The Standards


다섯 가지 근거

The Standards1 discuss five distinct sources of validity evidence (Table 1)


평가의 종류에 따라 한 종류의 타당도를 다른 종류의 타당도보다 더 강조하곤 한다.

Some types of assessment demand a stronger emphasis on one or more sources of evidence as opposed to other sources and not all sources of data or evidence are required for all assessments. 

  • For example, a written, objectively scored test covering several weeks of instruction in microbio-logy, might emphasize content-related evidence, to-gether with some evidence of response quality, internal structure and consequences, but very likely would not seek much or any evidence concerning relationship to other variables.
  • On the other hand, a high-stakes summative Objective Structured Clinical Examination (OSCE), using standardized patients to portray and rate student performance on an examination that must be passed in order to proceed in the curriculum, might require all of these sources of evidence





Sources of validity evidence for example assessments


점수 자체는 아무런 의미도 없다. 따라서 이 '근거'는 특정 평가에서 얻은 점수가 의도한 방식대로 해석할 수 있다는 것에 대한 논리적 근거를 제시해야 한다.

The scores have little or no intrinsic meaning; thus the evidence presented must convince the skeptic that the assess- ment scores can reasonably be interpreted in the proposed manner.



내용 타당도 근거 

Content evidence


 Examination blueprint

• Representativeness of test blueprint to achievement domain

• Test specifications

• Match of item content to test specifications

• Representativeness of items to domain

• Logical/empirical relationship of content tested to achievement domain

• Quality of test questions

• Item writer qualifications

• Sensitivity review

지필고사에 있어서 '내용'과 관련된 타당도 근거자료가 가장 필수적이다. Blueprint나 Test specification에서 드러난다. 

For the written assessment, documentation of validity evidence related to the content tested is the most essential. The outline and plan for the test, described by a detailed test blueprint or test specifications, clearly relates the content tested by the 250 MCQs to the domain of the basic sciences as described by the course learning objectives. The test blueprint is sufficiently detailed to describe subcategories and subclassifications of content and specifies precisely the proportion of test questions in each category and the cognitive level of those questions. The blueprint documentation shows a direct linkage of the questions on the test to the instructional objectives. 


독립적인 내용전문가가 test blueprint가 합리적인지 판단할 수 있다. 시험문항과 주요 학습목표와 교수-학습 활동의 관계가 명확해야 한다. 만약 대부분의 학습목표가 적용이나 문제해결 수준의 것이라면, 시험문항도 그러한 인지수준에 맞춰야 한다.

Independent content experts can evaluate the reasonableness of the test blueprint with respect to the course objectives and the cognitive levels tested. The logical relationship between the content tested by the 250 MCQs and the major instructional objectives and teaching⁄ learning activities of the course should be obvious and demonstrable, especially with respect to the proportionate weighting of test content to the actual emphasis of the basic science courses taught. Further, if most learning objectives were at the applica- tion or problem-solving level, most test questions should also be directed to these cognitive levels.


시험문항의 질 역시 내용-관련 타당도 근거의 하나이다. 

The quality of the test questions is a source of content-related validity evidence. 

    • MCQ가 효과적인 문항작성법에 근거했나?
      Do the MCQs adhere to the best evidence-based principles of effective item-writing.19 
    • 문항작성자가 내용전문가로서 자격이 있는가?
      Are the item-writers qualified as content experts in the disciplines? 
    • 문항 수가 충분한가?
      Are there sufficient numbers of questions to adequately sample the large content domain? 
    • 문항의 문장을 분명하고 오류 없이 기술했는가?
      Have the test questions been edited for clarity,removing all ambiguities and other common item flaws?
    • 문화적 민감성에 따라 검토되었는가?
      Have the test questions been reviewed for cultural sensitivity? 


SP에 있어서 마찬가지로 contents에 대한 이슈가 있다. SP case가 10개 있다면 이 10개는 - 예컨대 일차의료 외래상황에 대한 - 대표성이 있어야 함. 

For the SP performance examination, some of the same content issues must be documented and presen- ted as validity evidence. 


For example, each of the 10 SP cases fits into a detailed content blueprint of ambula- tory primary care history and physical examination skills. There is evidence of faculty content–expert agreement that these specific 10 cases are representative of primary care ambulatory cases. Ideally, the content of the 10 clinical cases is related to population demographic data and population data on disease incidence in primary care ambulatory settings. 


또한 임상 전문가가 SP case를 (체크리스트와 평가기준 포함) 공동으로 작성/검토/수정했는지에 대한 근거가 있어야 함. SP case에 대한 editing이 잘 되었고 SP가 자세한 가이드라인을 제공받았으며, 평가준거가 전문가에 의해서 준비되고 검토되며, SP trainer에 의해서 훈련되었는가 등도 모두 중요함.

Evi- dence is documented that expert clinical faculty have created, reviewed and revised the SP cases together with the checklists and ratings scales used by the SPs, while other expert clinicians have reviewed and critic- ally critiqued the SP cases. Exacting specifications detail all the essential clinical information to be portrayed by the SP. Evidence that SP cases have been competently edited and that detailed SP training guidelines and criteria have been prepared, reviewed by faculty experts and implemented by experienced SP trainers are all important sources of content-related validity evidence.


SP로 시험을 수행하는 동안에도 SP가 수행하는 내용을 면밀히 감시해서 모든 학생이 거의 동일한 case를 경험하게 해야 함. 서로 다른 SP가 동일한 case를 수행했다면, 학생 평가도 동일하게 내려야 함. 

There is documentation that during the time of SP administration, the SP portrayals are monitored closely to ensure that all students experience nearly the same case. Data are presented to show that a different SP, trained on the same case, rates student case perform- ance about the same. Many basic quality-control issues concerning performance examinations contribute to the content-related validity evidence for the assessment.20





평가 절차 근거 

Response process


• Student format familiarity

• Quality control of electronic scanning/scoring

• Key validation of preliminary scores

• Accuracy in combining different formats scores

• Quality control/accuracy of final scores/marks/grades

• Subscore/subscale analyses:

• Accuracy of applying pass-fail decision rules to scores

• Quality control of score reporting to students/faculty

• Understandable/accurate descriptions/interpretations of scores for students


Validity 근거로서 response process는 이상해보일 수 있다. 여기서 Response process란 시험 수행과 관련한 모든 관련 error가 가능한 최대한 통제/제가 되었느냐에 대한 것이다. 

As a source of validity evidence, response process may seem a bit strange or inappropriate. Response process is defined here as evidence of data integrity such that all sources of error associated with the test administration are controlled or eliminated to the maximum extent possible. Response process has to do with aspects of assessment such as ensuring 

  • 응답의 정확도
    the accuracy of all responses to assessment prompts, 
  • 평가에 있어 data flow의 질 관리
    the
     quality control of all data flowing from assessments, 
  • 다양한 평가점수를 하나의 점수로 산출하는 방식의 적합성
    the appropriate- ness of the methods used to combine various types of assessment scores into one composite score and 
  • 피평가자에게 제공되는 점수의 유용성과 정확도
    the usefulness and the accuracy of the score reports provided to examinees.


지필고사에 있어서 모든 시험시행절차와 관련된 문서와 시험에 대한 정보, 학생에게 제공되는 지침을 기록하는 것이 중요. 시험점수의 절대적 정확성을 확보하기 위한 모든 quality-control procedure와 관련된 것의 문서화가 중요한 근거. 이는 일차 채점 이후 final key validation이다. scoring key의 정확성을 확실하게 하고, final scoring에서 안 좋은 문항을 배제시키는 것이다. 

For evidence of response process for the written comprehensive examination, documentation of all practice materials and written information about the test and instructions to students is important. Docu- mentation of all quality-control procedures used to ensure the absolute accuracy of test scores is also an important source of evidence: the final key validation after a preliminary scoring – to ensure the accuracy of the scoring key and eliminate from final scoring any poorly performing test items; a rationale for any combining rules, such as the combining into one final composite score of MCQ, multiple true–false and short-essay question scores.


SP시험에 있어서, SP rating의 정확성을 보여주는 자료가 있어야 한다. 점수 계산법, reporting methods와 그 논리 - 특히 수행능력 평가 점수의 적절한 해석에 대한 설명자료 등.

For the SP performance examination, many of the same response process sources may be presented as validity evidence. For a performance examination, documentation demonstrating the accuracy of the SP rating is needed and the results of an SP accuracy study is a particularly important source of response process evidence. Basic quality control of the large amounts of data from an SP performance examination is important to document, together with information on score calculation and reporting methods, their rationale and, particularly, the explanatory materials discussing an appropriate interpretation of the performance- assessment scores (and their limitations).


global rating과 checklist rating 중 어떤 것을 선택했는지에 대한 논리에 대한 근거.

Documentation of the rationale for using global versus checklist rating scores, for example, may be an important source of response evidence for the SP examination. Or, the empirical evidence and logical rationale for combining a global rating-scale score with checklist item scores to form a composite score may be one very important source of response evidence.




내적 구조 근거 

Internal structure


• Item analysis data:

1. Item difficulty/discrimination

2. Item/test characteristic curves (ICCs/TCCs)

3. Inter-item correlations

4. Item-total correlations

• Score scale reliability

• Standard errors of measurement (SEM)

• Generalizability

• Dimensionality

• Item factor analysis

• Differential Item Functioning (DIF)

• Psychometric model


통계적, psychometric 특징과 관련되어 있음.

Internal structure, as a source of validity evidence, relates to the statistical or psychometric characteristics of the examination questions or performance prompts, the scale properties – such as reproducibility and general- izability, and the psychometric model used to score and scale the assessment.


문항 분석

Many of the statistical analyses needed to support or refute evidence of the test’s internal structure are often carried out as routine quality-control procedures. Ana- lyses such as item analyses – which computes 

  • 난이도 the difficulty (or easiness) of each test each question (or performance prompt), 
  • 변별도 the discrimination of question (a statistical index indicating how well the question separates the high scoring from the low scoring examinees) and 
  • 각 답가지별로 선택한 학생 비율 a detailed count of the number or proportion of examinees who responded to each option of the test question, 

are completed.


신뢰도는 타당도 근거의 중요한 측면. 신뢰도 없이 타당도 없다.

Reliability is an important aspect of an assessment’s validity evidence. Unless assess- ment scores are reliable and reproducible (as in an experiment) it is nearly impossible to interpret the meaning of those scores – thus, validity evidence is lacking.


합격-불합격의 재생산가능성이 매우 중요하다. 평가의 궁극적 결과(합-불합)이 일정 수준 이상으로 재생산가능하지 않으면 검사점수의 의미있는 해석이 불가능

In both example assessments described above, in which the stakes are high and a passing score has been estab- lished, the reproducibility of the pass–fail decision is a very important source of validity evidence. That is, analogous to score reliability, if the ultimate outcome of the assessment (passing or failing) can not be repro- duced at some high level of certainty, the meaningful interpretation of the test scores is questionable and validity evidence is compromised.


SP와 같이 수행능력 평가에서는 일반화가능도이론에서 유도한 특별한 타입의 신뢰도가 있음

For performance examinations, such as the SP example, a very specialized type of reliability, derived from generalizability theory (GT)21,22 is an essential component of the internal structure aspect of validity evidence. GT is concerned with how well the specific samples of behaviour (SP cases) can be generalized to the population or universe of behaviours. 


GT는 error의 source를 찾는데 유용함

GT is also a useful tool for estimating the various sources of contributed error in the SP exam, such as error due to the SP raters, error due to the cases (case specificity),and error associated with examinees. As rater error and case specificity are major threats to meaningful inter-pretation of SP scores, GT analyses are important sources of validity evidence for most performance assessments such as OSCEs, SP exams and clinical performance examinations. 


IRT와 같은 복잡한 통계측정법을 활용하는 경우, 측정 모델(measurement model) 그 자체가 internal structure와 construct validity이다. 요인 구조, 아이템-간-상관관계, 기타 구조적 특성 등등 

For some assessment applications, in which sophis- ticated statistical measurement models like Item Response Theory (IRT) models23,24 the measurement model itself is evidence of the internal structure aspect of construct validity. In IRT applications, which might be used for tests such as the comprehensive written examination example, the factor structure, item-inter- correlation structure and other internal structural characteristics all contribute to validity evidence.


편항과 비뚤림의 이슈도 중요하다. 모든 평가는 다양한 그룹을 대상으로 치뤄지게 되는데, 통계적 편향의 가능성이 있다. differential item functioning (DIF)과 같은 Bias analysis와 문항이나 performance prompts의 sensitivity review가 모두 내적구조 타당도 근거이다. 

Issues of bias and fairness also pertain to internal test structure and are important sources of validity evidence. All assessments, presented to heterogeneous groups of examinees, have the potential of validity threats from statistical bias. Bias analyses, such as differential item functioning (DIF)25,26 analyses and the sensitivity review of item and performance prompts are sources of internal structure validity evidence. Documentation of the absence of statistical test bias permits the desired score interpretation and therefore adds to the validity evidence of the assess- ment.


다른 변인과의 관계 근거 

Relationship to other variables


• Correlation with other relevant variables
• Convergent correlations - internal/external:
1. Similar tests
• Divergent correlations-internal/external
1. Dissimilar measures
• Test-criterion correlations
• Generalizability of evidence

전형적인 '타당도 연구'의 방법. 새로운 척도를 기존의 척도와 비교하는 것.
This familiar source of validity evidence is statistical and correlational. The correlation or relationship of assessment scores to a criterion measure’s scores is a typical design for a ‘validity study’, in which some newer (or simpler or shorter) measure is ‘validated’ against an existing, older measure with well known characteristics.


이 때 confirmatory evidence와 counter-confirmatory evidence를 모두 찾게 된다. 

This source of validity evidence embodies all the richness and complexity of the contemporary theory of validity in that the relationship to other variables aspect seeks both confirmatory and counter-confirmatory evidence. For example, it may be important to collect correlational validity evidence which shows a strong positive correlation with some other measure of the same achievement or ability and evidence indicating no correlation (or a strong negative correlation) with some other assessment that is hypothesized to be a measure of some completely different achievement or ability.


Campbell and Fiske가 제안한 multitrait multimethod 디자인과 관련되어 있다.

The concept of convergence and divergence of validity evidence is best exemplified in the classic research design first described by Campbell and Fiske.27 In this ‘multitrait multimethod’ design, differ- ent measures of the same trait (achievement, ability, performance) are correlated with different measures of the same trait. The resulting pattern of correlation coefficients may show the convergence and divergence of the different assessment methods on measures of the same and different abilities or proficiencies.


지필평가에서는 전체 점수와 subscale 점수의 상관관계를 볼 수 있다.

In the written comprehensive examination example, it may be important to document the correlation of total and subscale scores with achievement examina- tions administered during the basic science courses.



후속 결과 근거 

Consequences


• Impact of test scores/results on students/society

• Consequences on learners/future learning

• Positive consequences outweigh unintended negative consequences?

• Reasonableness of method of establishing pass-fail (cut) score

• Pass-fail consequences:

1. P/F Decision reliability- Classification accuracy

2. Conditional standard error of measurement at pass score (CSEM)

• False positives/negatives

• Instructional/learner consequences


비록 현재 Standards에 포함되어 있으나 가장 논쟁이 많이 되는 것이다. 시험 점수, 결정, 결과가 피시험자에게 미치는 영향, 그리고 교수-학습에 미치는 영향 등이다. 평가 결과가 피시험자, 교수, 환자, 사회에 미치는 영향은 엄청나게 클 수 있으며, 의도했든 그렇지 않았든 긍정적이거나 부정적일 수 있다.

This aspect of validity evidence may be the most controversial, although it is solidly embodied in the current Standards.1 The consequential aspect of validity refers to the impact on examinees from the assessment scores, decisions and outcomes, and the impact of assessments on teaching and learning. The conse- quences of assessments on examinees, faculty, patients and society can be great and these consequences can be positive or negative, intended or unintended.


북미에는 고부담 시험이 많다. 이런 경우 이 시험에 탈락할 때 따르는 결과는 심대하다. 의과대학에 입학할 것인지, 의사 자격을 부여할 것인지 등에 대한 결정에 따르는 비용이 크다.

High-stakes examinations abound in North Amer- ica, especially in medicine and medical education. Extremely high-stakes assessments are often mandated as the final, summative hurdle in professional educa- tion. The consequences of failing any of these examinations is enormous, in that medical education is interrupted in a costly manner or the examinee is not permitted to enter graduate medical education or practice medicine.


마찬가지로 전문의, 세부전문의 자격 시험도 그러하다. 위양성은 환자에게, 위음성은 시험을 본 당사자에게 요구하는 비용이 크다.

Likewise, most medical specialty boards in the USA mandate passing a high-stakes certification examination in the specialty or subspec- ialty, after meeting all eligibility requirements of training. postgraduate The consequences of passing or failing these types of examinations are great, as false positives (passing candidates who should fail) may do harm to patients through the lack of a physician’s skill and specialized knowledge or false negatives unjustly (failing candidates who should pass) may harm individual candidates who have invested a great deal of time and resources in graduate medical education.


시험 결과에서 오는 harm이 없거나, 아니면 최소한 good > harm임을 보여야 한다.

Evidence related to consequences of testing and its outcomes is presented to suggest that no harm comes directly from the assessment or, at the very least, more good than harm arises from the assessment.



합격률, 합격률(합격선)의 적정함에 대한 판단, 다른 시험결과과의 상관관계

In both example assessments, sources of consequen- tial validity may relate to issues such as 

  • passing rates (the proportion who pass), 
  • the subjectively judged appropriateness of these passing rates, 
  • data comparing the passing rates of each of these examinations to other comprehensive examinations such as the USMLE Step 1 and so on.


합격 점수, 합격 점수 결정 절차, 합격점수의 통계적 특성 등이 모두 validity의 일부이다. 어떻게 합-불합 기준점수를 설정했는지, 그 방법에 대한 근거도 중요함.

The passing score (or grade levels) and the process used to determine the cut scores, the statistical prop- erties of the passing scores, and so on all relate to the consequential aspects of validity.28 Documentation of the method used to establish a pass–fail score is key consequential evidence, as is the rationale for the selection of a particular passing score method.


다른 psychometric quality indicator 들

Other psychometric quality indicators concerning the passing score and its consequences (for both example assessments) include a 

  • 결정의 신뢰도 formal, statistical estimation of the pass–fail decision reliability or 
  • 분류 정확도 classification accu- racy29 and 
  • SEM추정 some estimation of the standard error of measurement at the cut score.30


Equally important consequences of assessment meth- ods on instruction and learning have been discussed by Jaeger.31 The methods and strategies Newble and profound selected to evaluate students can have a impact on and what is taught, how exactly what students learn, how this learning is used and retained (or not) and how students view and value the educa- tional process.









 2003 Sep;37(9):830-7.

Validity: on meaningful interpretation of assessment data.

Author information

  • 1Department of Medical Education, College of Medicine, University of Illinois at Chicago, 60612-7309, USA. sdowning@uic.edu

Abstract

CONTEXT:

All assessments in medical education require evidence of validity to be interpreted meaningfully. In contemporary usage, all validity is construct validity, which requires multiple sources of evidence; construct validity is the whole of validity, but has multiple facets. Five sources--content, response process, internal structure, relationship to other variables and consequences--are noted by the Standards for Educational and Psychological Testing as fruitful areas to seek validity evidence.

PURPOSE:

The purpose of this article is to discuss construct validity in the context of medical education and to summarize, through example, some typical sources of validity evidence for a written and a performance examination.

SUMMARY:

Assessments are not valid or invalid; rather, the scores or outcomes of assessments have more or less evidence to support (or refute) a specific interpretation (such as passing or failing a course). Validity is approached as hypothesis and uses theory, logic and the scientific method to collect and assemble data to support or fail to support the proposed score interpretations, at a given point in time. Data and logic are assembled into arguments--pro and con--for some specific interpretation of assessment data. Examples of types of validity evidence, data and information from each source are discussed in the context of a high-stakes written and performance examination in medical education.

CONCLUSION:

All assessments require evidence of the reasonableness of the proposed interpretation, as test data in education have little or no intrinsic meaning. The constructs purported to be measured by our assessments are important to students, faculty, administrators, patients and society and require solid scientific evidence of their meaning.

PMID:
 
14506816
 
[PubMed - indexed for MEDLINE]


+ Recent posts