MCQ에서 CIV와 IWF: 원칙이 차이를 만드는가?(Acad Med, 2002)

Construct-irrelevant Variance and Flawed Test Questions: Do Multiple-choice Item-writing Principles Make Any Difference?

STEVEN M. DOWNING

EVALUATION METHODS: WHAT DO WE KNOW? 

Moderator: Reed G. Williams, PhD




Messick 은 CIV를 다음과 같이 정의하였다.

Messick defines construct-irrelevant variance (CIV) as

‘‘. . . excess reliable variance that is irrelevant to the interpreted construct.’’2


Testwiseness, teaching to the test, and test irregularities (cheat- ing) 등이 모두 CIV이다.

Testwiseness, teaching to the test, and test irregularities (cheat- ing) are all examples of CIV that tend to inflate test scores by adding measurement error to scores.



문헌 고찰

Review of the Literature


NBME 연구에서 기본적인 문항작성원칙에 violation이 있음을 보여주었다.

Yet, a recent study from the National Board of Medical Examiners (NBME)6 shows that viola- tions of the most basic item-writing principles are very common in achievement tests used in medical education.


개별 IWF는 연구되었으나 cumulative effect는 연구된 바가 없다.

While several individual item flaws have been studied (negative stems,6 multiple true–false items,7 none of the above option8), the cumulative effect of grouping flawed items together as scales mea- suring the same ability has not been investigated.


방법

Method



Three independent raters, blinded to item-performance data, classified the items using the standard principles of effective item writing as the universe of item-writing principles.4


Absolute passing standards were established for this test by the faculty responsible for teaching this instructional unit using a mod- ified Nedelsky method.9


다음을 계산함

Typical item-analysis data were computed for each scale:

  • means,
  • standard deviations,
  • mean item difficulty,
  • mean biserial discrimi- nation indices, and
  • Kuder-Richardson 20 reliability coefficients,
  • to- gether with the absolute passing score and the passing rate (pro- portion of students passing).


Results


22개의 표준문항과 11개의 오류문항을 비교했을 때, KR20은 0.62 vs 0.44였다.

Comparing the standard (22 items) and the flawed (11 items) scales,

  • the observed K-R 20 reliability was .62 versus .44.
  • The standard-scale mean p value was .70; the flawed-scale mean p value was .63 (t197 = 6.274, p < .0001).
  • The standard-scale items were slightly more discriminating than the flawed items, rbis = .34 versus.30 (using the total test score as criterion).
  • The flawed and the standard scales were correlated r = 0.52 (p < .0001). 



고찰

Discussion and Conclusions


1/3에서 1개 이상의 IWF 발견

One third of the questions in this test have at least one item flaw.


오류문항에서 난이도가 상승하였다. 문항이 제대로 안 만들어질 경우 인위적인 난이도 추가가 발생하는 것. 이 CIV는 시험점수의 정확하고 meaningful한 해석에 방해가 되고 passing rate에도 부정적으로 작용함.

The increased test and item difficulty associated with the use of flawed item forms is an example of CIV, because poorly crafted test questions add artificial difficulty to the test scores. This CIV inter- feres with the accurate and meaningful interpretation of test scores and negatively impacts students’ passing rates, particularly for pass- ing scores at or just above the mean of the test score distribution.









 2002 Oct;77(10 Suppl):S103-4.

Construct-irrelevant variance and flawed test questions: Do multiple-choice item-writing principles make any difference?

Author information

  • 1Visiting Professor, University of Illinois at Chicago, College of Medicine, Department of Medical Education, 808 South Wood Street, Chicago, IL 60612-7309, USA.
PMID:
 
12377719
[PubMed - indexed for MEDLINE]


+ Recent posts