국지적으로 개발된 MCQ에서 타당도에 대한 위협: 구인-무관-변이와 구인 과소반영(Adv Health Sci Educ Theory Pract. 2002)

Threats to the Validity of Locally Developed Multiple-Choice Tests in Medical Education: Construct-Irrelevant Variance and Construct Underrepresentation


STEVEN M. DOWNING

University of Illinois at Chicago, College of Medicine, Department of Medical Education (MC 591),

808 South Wood Street, Chicago, IL 60612-7309, USA (E-mail: sdowning@uic.edu)





도입

Introduction


MCQ는 이러한 시험에서 가장 흔히 사용되는 형태이며, 그 이유는 다음과 같다.

The multiple-choice question (MCQ) format is most commonly used for such tests, due to

  • its many positive psychometric characteristics,
  • its long history of research evidence,
  • its versatility in testing most cognitive knowledge,
  • its relative (apparent) ease to write, store, administer, and score, and
  • its continued use by the highest-stakes examinations in medical education (Downing, 2002).


그러나 MCQ형태의 시험이 이러한 긍정적인 성과를 내는 것은 어려우며, 과학이라기보다는 예술에 가깝다.

But, to accomplish these positive outcomes of MCQ-type assessment is challenging and may be more art than science (Haladyna, Downing and Rodriguez, 2002).


 

MCQ점수 해석에는 여러 타당도 위협이 있다. 검사의 타당도란 "accurate and meaningful interpretation of test scores and to the reasonableness of the inferences drawn from test scores"을 뜻한다.

There are many threats to the validity of MCQ test score interpretation. Test validity refers to the accurate and meaningful interpretation of test scores andtothe reasonableness of the inferences drawn fromtest scores (Messick, 1989; American Educational Research Association, 1999).


CIV

Construct-Irrelevant Variance

 

CIV는 평가에 외재적extraneous, 비통제된uncontrolled 변인이 들어가서, 이것이 일부 혹은 전체 응시자의 점수를 errorneously inflate 또는 deflate시킴으로써 검사점수의 해석의 의미와 정확성을 떨어뜨리고, 검사 점수를 기반으로 한 결정의 정당성을 약화시키며, 검사의 타당도근거를 낮추는 것을 말한다.

Construct-irrelevant 1Research Association, 1999; Cook and Campbell, 1979) refers to introducing into assessments extraneous, uncontrolled variables which tend to erroneously inflate or deflate scores for some or all examinees, thus reducing the meaningfulness and accuracy of test score interpretations, the legitimacy of decisions made on the basis of test scores, and the validity evidence for tests. There are many potential CIV sources in objective tests commonly used to assess achievement in medical education.


 

잘 설계되지 않은 문항

POORLY CRAFTED TEST QUESTIONS


최근의 Jozefowicz 등이 보여준 바에 따르면 의학교육의 단계를 막론하고 시험문항은 가장 기본적인 문항작성법의 원칙도 위반하는 경우가 많다. 아래와 같은 경우에 오답을 배제할 수 있는 의도하지 않은 힌트를 줘서 응시자의 점수가 errorneously inflate 되어 CIV가 증가한다.

Test items used at all levels of medical education training often violate the most basic principles of effective item writing, as a recent study by Jozefowicz and colleagues demonstrates (Jozefowicz et al., 2002).

  • Ambiguously worded test questions,
  • questions written in non-standard form,
  • ques- tions cast in overly complex formats, or
  • questions testing too much information in a single item

...may provide unintended cues to the correct answer or facilitate the elimination of incorrect options – thus inflating examinee scores erroneously and adding CIV to the measurement.


보안상의 문제와 기타 다른 비정형성irregularities

INSECURE TEST QUESTIONS AND OTHER TESTING “IRREGULARITIES”


만약 학생이 문제를 미리 안다면 점수가 높아질 것이다. 또한 부정행위 등도 false-positive information을 추가한다.

If some or all examinees have prior knowledge of test questions, their test scores may be erroneously inflated. Similarly,other testing irregularities, such as cheating, can add false-positive information to test scores. All such ill-gotten gain is CIV.


Testwiseness

TESTWISENESS


Testwiseness 란 응시자가 점수를 최대화하는데 사용할 수 있는 행동들을 말한다. 예컨대 정답에 대한 아무런 정보도 없을 때 C를 찍거나 가장 긴 답가지를 찍는 것은 testwise한 것이다.

Testwiseness refers to a set of behaviors that allows examinees to maximize their test score. The exam- inee who, absent all other information about the correct-answer choice, marks C or chooses the longest answer from the option set, is using testwise cues.


찍기

GUESSING


단순 찍기만으로도 정답을 맞출 수 있다. 찍는 것 만으로 고득점을 하는 것은 어렵지만, 심각하게 flawed item은 찍기를 통해서 성공할 가능성을 높여준다.

Thus, there is a statistical probability of arriving at the correct response by random guessing alone.


While it is unlikely that examinees will obtain high scores or achieve a passing score through guessing alone, seriously flawed test items do increase the probability of success through guessing and thereby add CIV to the measurement.


문항의 Bias

ITEM BIAS: DIFFERENTIAL ITEM FUNCTIONING


가장 기본적인 의미에서 test item bias란 응시자의 하위집단마다 문항의 공정성이 달라지는이다. item bias를 잡아내기 위한 일반적인 분석법을 DIF(Differential Item Functioning)이라고 하며, subgroup간 부당하게 차별하는 문항을 잡아낼 수 있다.

In its most basic meaning, test item bias refers to the fairness of the test item for different subgroups of examinees. A general class of analyses used to detect item bias is called Differ- ential ItemFunctioning (DIF analysis) and can be used to detect items that unfairly discriminate between subgroups. Poorly written MCQs tend to be confusing, artificially difficult or easy, and may be differentially confusing for different subgroups.


정당화 불가능한 합격점수

INDEFENSIBLE PASSING SCORES


모든 기준에는 판단이 개입되며, 어느 정도 임의적arbitrary인 면이 있다. 만약 합격을 결정하는 점수에 심각한 flaw가 있따면 CIV가 더해지는 것이다. 합격기준에 연관된 CIV에는 다음과 같은 것이 있다.

While all standards require judg- ment and all standards are somewhat arbitrary, if the methods used to establish the pass-fail mark are seriously flawed, CIV is added to the measurement. CIV associated with passing standards include problems such as

  • low reproducibility of passing scores,
  • low rela- tionship between the pass-fail outcomes for tests of similar content,
  • low agreement between passing standards established by different instructors in similar or the same courses,

and so on.


구인 과소반영

Construct Underrepresentation


의학교육에서 이뤄지는 많은 시험의 타당도는 내용-관련 근거에 의존하며, 시험문항이 (평가)관심 영역과 관계가 있음을 보여주는 자료에 기반한다. 이론적으로 특정 시험에서 사용되는 구체적인 문항들은 모집단으로부터 선택가능한 모든 시험문항의 표본을 대표해야 한다. Classical measurement theory (CMT)는 오로지 여기에 기반한다.

Much of the validity argument for typical achievement examinations used in medical education rests on content-related evidence and data showing the relation- ship between test item content and the domain of interest. Theoretically, the specific test items used on a particular examination represent a sample of all possible test items selected from the population. Classical measurement theory (CMT) rests solidly on this foundation and assumption (Anastasi, 1988).


지식-회상 수준에만 머무는 문항

TRIVIAL CONTENT TESTED AT THE FACTUAL-RECALL LEVEL


여러 시험문항이 앞으로의 학습이나 환자 진료에 별로 중요하지 않은 사소한 것들을 물어보는 문항이다. 팩트를 기억하거나 이 팩트들에 대해서 답하는 능력은 학생이 복잡한 임상문제를 다루는 능력에 대한 성공/실패, 또는 새로문 문제에 이러한 지식을 적용할 것인가를 알려주지 않는다.

Many of the test items constructed to test achievement in medical education curricula appear to ask trivial questions – questions that are unimportant for future learning or the clinical care of patients. The ability to remember facts and answer questions about these facts may have little to do with the ultimate success or failure of students’ ability to manage complex clinical problems or to predict their success at applying this knowledge to novel problems,


시험(문항)에만 대비한 교육

TEACHING-TO-THE-TEST


만약 교수자가 구체적으로 시험에 나올 내용만을 짚어준다면 sampling assumption이 위배되는 것이고 CIV가 추가된다. 따라서 점수에 기반한 해석은 부정확해진다. 이러한 상황에서 검사의 점수는 content domain의 무작위 표집을 대표하지 않으며, 검사점수로부터의 정당한 추론이 불가능해진다.

If instructors specifically guide exam-inees to content that is to be tested on the examination, the sampling assumption is violated, CIV is introduced to test scores, and the accurate interpretation ofscores may be jeopardized. Under these circumstance, the score on the test does not represent a random sample of the content domain to which one can draw legitimate inferences fromtest scores


너무 적은 수의 문항

TOO FEW TEST ITEMS


너무 문항이 적으면 content domain을 적절히 표집할 수 없다.

Tests that are too short can not adequately sample the content domain and thus threaten the validity of inferences to the domain.


또한 너무 적은 수의 시험은 unreliable해지거나 낮은 reproducibility를 낳고, SEM이 커진다. 예컨대 40문항정도 되는 시험은 0.5~0.6정도의 신뢰도 계수가 나오며, SEM은 3~5점정도 된다. unreliable한 검사는 CI가 커지는데, 만약 SEM이 5라면, 95% 신뢰구간CI는 거의 +/- 10점정도가 되며, 합격선 근처에 있는 학생들의 진정한 합-불합 status에 대해서 불확실해진다.

Also, short tests tend to produce scores that are unreliable or have low reproducibility indices, with relatively large standard errors of measurement (SEM). For example, a typical formative test in undergraduate medical education may have only about 40 test questions to cover a fairly broad content area. Such a test may have a reliability coefficient of 0.50–0.60 and a SEMof 3–5 raw score points. Unreliable tests produce wide confidence intervals around student scores. For example, if the SEM is 5, the 95 percent confidence interval around each raw score (including the passing score) is nearly ±10 points, creating much uncertainty about the true pass-fail status of many students scoring near the passing score (depending on the test score distribution).



모든 의학교육의 시험이 valid inference를 갖춰야 하는가?

Must All Tests in Medical Education Produce Valid Inferences?


 

레토릭한 답은 '그렇다'이다. 타당도는 절대로 '그냥 갖췄을 것assumed'이라고 여겨서는 안되는 것이며, 늘 그 근거를 찾아야 하는 열려있는 질문인 것이다. 만약 의학교육자들이 학생을 test한다면, 그 test measure가 의도한 것을 평가하며 점수로부터 유도한 추론이 얼마나 accurate하고 얼마나 defensible한지에 대한 hard evidence를 보여줘야 한다.

The answer to this rhetorical question is yes. Validity is never assumed and is always an open-ended question, seeking a variety of sources of evidence. If medical educators test students, there is an obligation to collect and present hard evidence that the test measures what is intended and that the inferences drawn from test scores are more-or-less accurate and more-or-less defensible.


CIV와 CU에 대한 대부분의 이슈들은 통제가능한 것이다. 효과적이고 높은 인지수준을 평가하기 위한 시험을 만드는 방법(items that do not cue exam- inees to the correct answer, permit inordinate guessing, or confuse knowledgeable students)은 잘 알려져 있다.

Most of the issues discussed under the rubric of construct-irrelevant variance and construct underrepresentation are under the control of those who create tests in medical education. Techniques to develop effective, higher cognitive-level test items that measure important and useful information – items that do not cue exam- inees to the correct answer, permit inordinate guessing, or confuse knowledgeable students – are well known and readily accessible (Case and Swanson, 1998).



 2002;7(3):235-41.

Threats to the validity of locally developed multiple-choice tests in medical educationconstruct-irrelevant variance and construct underrepresentation.

Author information

  • 1University of Illinois at Chicago, College of Medicine, Department of Medical Education (MC 591), 808 South Wood Street, Chicago, IL 60612-7309, USA. sdowning@uic.edu

Abstract

Construct-irrelevant variance (CIV) - the erroneous inflation or deflation of test scores due to certain types of uncontrolled or systematic measurement error - and construct underrepresentation (CUR) - the under-sampling of the achievement domain - are discussed as threats to the meaningful interpretation of scores from objective tests developed for local medical education use. Several sources of CIV and CUR are discussed and remedies are suggested. Test score inflation or deflation, due to the systematic measurement error introduced by CIV, may result from poorly crafted test questions, insecure test questions and other types of test irregularities, testwiseness, guessing, and test item bias. Using indefensible passing standards can interact with test scores to produce CIV. Sources of content underrepresentation are associated with tests that are too short to support legitimate inferences to the domain and which are composed of trivial questions written at low-levels of the cognitive domain. "Teaching to the test" is another frequent contributor to CUR in examinations used in medical education. Most sources of CIV and CUR can be controlled or eliminated from the tests used at all levels of medical education, given proper training and support of the faculty who create these important examinations.

PMID:
 
12510145
[PubMed - indexed for MEDLINE]


+ Recent posts