문항 작성의 질 관리: 의학교육에서 고부담시험용 MCQ 도입의 경험 (Med Teach, 2009)

Quality assurance of item writing: During the introduction of multiple choice questions in medicine for high stakes examinations


JAMES WARE1 & TORSTEIN VIK2

1Health Sciences Centre, Kuwait University, Kuwait, 2Norwegian University of Science and Technology (NTNU), Norway





도입

Introduction


수월성의 표지자를 찾아서 트레이닝 프로세스에 포함시키는 것이 적절하다고 생각하였음.

It was also felt appropriate to look for markers of excellence and build them into the training process.

  • Susan Case and David Swanson’s NBME web monograph gives useful guidelines and tips how this goal might be achieved (Case & Swanson 2004).

  • Other sources include a series of papers written by Haladyna and Downing (1989), who have also produced evidence suggesting that item writing flaws (IWFs) can prejudice the outcome of high stakes examinations (Haladyna & Downing 1989; Downing 2005).

  • It is an important area of educational research that has stimulated further examination in the health sciences (Tarrant & Ware 2008).


방법

Methods


모든 노르웨이 의과대학은 6년제이며, NLE는 없다.

All Norwegian medical schools have a 6-year curriculum, however, there is no national licensing examination, and each school has their specific educational model.


K1은 기억 회상과 이해, K2는 적용과 추론이다. 임의로 결정한 목표는 최소 50%의 K2문항을 출제하는 것이다.

Particular emphasis was put on recognising items testing lower levels of cognition, K1 (recall and comprehension) and higher levels, K2 (application and reasoning). This is a modification of a proposal made by Irwin and Bamber (1982) and also accords with the classification used by the IDEAL Consortium(Prideaux & Gordon 2002). An arbitrary goal was set for NTNU examinations being delivered with at least 50% K2 items.



negative marking없이 결과를 계산하였다. 문항분석은 IDEAL 소프트웨어로 시행하였다.

Following the delivery of each MCQ paper the results were computed without negative marking (Downing 2003). Item analysis was carried out using the IDEAL software (vers. 4.1). The post hoc reviews confirming the results were done with the aid of these data. The item statistics are based on classical test theory and the output is presented in the format shown in Table 1, (Osterlind 1998). Also available on the output sheet is the mean group result, variance, SD, Kuder–Richardson Reliability and SE of Measurement. Data have been used both from the performance data of each item and also the whole test, particularly p-values and upper-lower item discrimination based on the top and bottom27%of candidates. The frequency of candidates marking each option was also noted and where  5% of candidates marked a distracter it was determined to be functional. A further analysis was carried out after the removal of items with p-values  85%.


변별도는 세 단계로 나눈다.

Arbitrary levels of discrimination were used to create ranges which reflect three levels of the discrimination power:

  • >0.40, excellent;

  • 0.30–0.39, good and

  • 0.15–0.29, moderate.

 

0.15이하는 유의한 변별도가 없다고 간주하였다.

Below 0.15 was considered as having no discrimination power of significance.



Results


longest option은 가장 흔한 IWF였다. 그 외에 다른 것은..

The longest option was the commonest IWF (55%) and thefour others were: word repeats in vignette and correct option(2/31), logical clues (4/31), sentence completions (6/31) and a negatively worded question (2/31). 



Discussion


무엇이 Item writing violation에 들어가는지 논의할 부분이 많다. 기관 내에서의 규칙을 정할 필요가 있다. NTNU에서는 negative marking을 사용하지 않기로 했고, 그럴만한 근거가 있다.

There still remains much discussion about what constitutes an item writing violation, with available empiric data being rather few (Haladyna & Downing 1989). Notwithstanding the controversies, it still remains reasonable to set rules for an institution, and we believe the list given in Appendix 2 are worth avoiding until such time as we know that any inclusion does not affect the test outcome. This becomes more important when the guessing factor, inherent in any selected response item format test, is accounted for. At NTNU negative marking was not used and there is good evidence for avoiding such a strategy (Downing 2003).


고부담 시험의 다섯 가지 기준은 다음과 같다.

The five criteria for a high stakes end of course we would (viz., graduation) or year-end examinations recommend are the following: 

  • 1. 내부 스타일을 따를 것. Strong adherence to an in-house style: for NTNU see Appendix 1. 

  • 2. K2는 50%이상 될 것 The proportion of K2 items is at or above 50%. 

  • 3. 50% 이상의 답가지가 5% 수준에서 기능할 것 Greater than or equal to 50% of all distracters shall be functioning at the 5% level. 

  • 4. 60% 이상의 문항이 중등도 혹은 그 이상의 변별도를 가질 것 Greater than or equal to 60% of items shall have moderate or better discrimination using set ranges. 

  • 5. IWF가 10% 미만일 것.  The frequency of IWFs agreed for the institution shall be <10%.




답가지가 4개이거나 3개일 때도 잘 기능한다는 근거는 있지만 다섯 개의 답가지를 활용했다. 몇 개로 하든지 이는 상당히 임의적인 결정이다. 정답률이 0.85 이상의 값을 갖는 문항을 제거하고 나서 바람직한 functional distractor 비율이 50% 이상이 되어야 한다고 보았다.

NTNU chose to use five options, although there is evidence that four or even three option MCQs function as well (Haladyna & Downing 1993). Whatever number chosen, and this may be a quite arbitrary decision, an important part of quality assurance is to determine that the number of options that function justifies the number set as a policy. We believe that after removing items with p-values  0.85 the desirable functional distracter propor- tion should be >50%.


Maastricht school 에서는 평가하고자 하는 인지수준은 문항의 형태가 결정하는 것이 아니라, (문항의) 내용이 결정하는 것이며, vignettes을 사용하는 것이 K2평가를 보장하지 않는다고 하였다.

The influential Maastricht school (Schuwirth & Van der Vleuten 2003) stresses that item format is not the arbiter of cognitive level tested, but rather the content. using vignettes does not guarantee an item testing at K2.

 

 


 


Case S, Swanson DW. 2004. Item writing manual: Constructing written test questions for the basic and clinical sciences, National Board of Medical Examiners publications, retrieved in 2004, from: http://www.nbme.org/ aboutitem/writing.asp


Tarrant M, Ware J. 2008. The impact of itemwriting flaws in multiple-choice questions on student achievment in highy-stakes nursing assessments. Med Educ 43:198–206.


Downing SM. 2003. Guessing on selected-response examinations. Med Educ 37:670–671.


Downing SM. 2005. The effects of violating standard itemwriting principles on tests and students: The consequences of using flawed items on achievement examinations in medical education. Adv Health Sci Educ 10:133–143.




Appendix 2


문항작성의 오류

IWFs to be avoided 


1. Grammatical clues, found when using sentence com- pletions. The option with an incorrect grammatical flow is automatically eliminated by most candidates


2. Logical clues, based on information in the stem also being used in the correct keyed option. Test wise candidates are quick to spot this flaw. 


3. Words repeat, where the stem has a complete or part of a word that is clearly identified in the correct keyed option. 


4. Convergence cues, usually based on multiple facts used in the options. The good candidate quickly adds up these facts and finds the correct option having most repeaters in it. Or, where more than two options deal with similar areas to the exclusion of others, which are the distracters and then serve little purpose. 


5. The longest option is the correct keyed option because of the number of qualifying statements added to justify it as the best choice. 


6. Lost sequence in presentation of data, failure to use ranges and mixed units, as well as overlapping data, or no normal values given. All these flaws add to the uncertainty and, therefore, become confusing. 


7. Use of absolute terms such as never, always, only etc which are seldom appropriate qualifiers for clinical statements and the option is eliminated by a good candidate. 


8. Use of vague terms such a frequently, occasionally or rarely (among others) which then cause uncertainty and are usually eliminated as being fillers. 


9. Use of negative(s) in the question. These items are frequently misunderstood as one is not expecting the formulation to be in the negative. Alternatively, the correct option is so implausible so that it shall not apply under any circumstance. 


10. Use of EXCEPT in the stem as part of the question formulation. Although seldom confuses, these items identify the correct keyed option as often being out of sequence with the others without the use of any knowledge. 


11. The use of none or all of the above (NOTA or AOTA) as the last option. Writing options that fulfill these absolutes: NOTA, often provide clues; while AOTA rewards partial information.


12. Failure to pass the Hand Cover Test (HCT) increases uncertainty about the question being asked, or leaves the examinee guessing. 


13. Unclear language, ambiguities, gratuitous information, vignette not required etc. 


14. Use of interpreted data. Not infrequently a complex vignette is followed by a reference to the condition, disease or diagnosis followed by a question which requires no reference to the information given in the vignette, only knowledge of the condition. 


15. Inaccurate information, including implausible options.






 2009 Mar;31(3):238-43. doi: 10.1080/01421590802155597.

Quality assurance of item writing: during the introduction of multiple choice questions in medicine for highstakes examinations.

Author information

  • 1Faculty of Medicine, Health Sciences Centre, Kuwait University, Safat, Kuwait. jamesw@hsc.edu.kw

Abstract

BACKGROUND:

One Norwegian medical school introduced A-type MCQs (best one of five) to replace more traditional assessment formats (e.g. essays) in an undergraduate medical curriculum. Quality assurance criteria were introduced to measure the success of the intervention.

METHOD:

Data collection from the first four year-end examinations included item analysis, frequency of item writing flaws (IWF) and proportion of items testing at a higher cognitive level (K2). All examinations were reviewed before after delivery and no items were removed.

RESULTS:

Overall pass rates were similar to previous cohorts examined with traditional assessment formats. Across 389 items, the proportion of items with >or=5% of candidates marking two or more functioning distracters was >or=47.5%. Removal of items with high p-values (>or=85%), this item distracter proportion became >75%. With each successive year in the curriculum the proportion of K2 items used rose steadily to almost 50%. 31/389 (7%) items had IWFs. 65% items had a discriminatory power, >or=0.15.

CONCLUSIONS:

Five item quality criteria are recommended: (1) adherence to an in-house style, (2) item proportion testing at K2 level, (3) functioning distracter proportion, (4) overall discrimination ratio and (5) IWF frequency.

[PubMed - indexed for MEDLINE]


+ Recent posts