고부담 간호 평가의 다지선다형 문제에서 문항작성오류로 인한 영향(Med Educ, 2008)

Impact of item-writing flaws in multiple-choice questions on student achievement in high-stakes nursing assessments

Marie Tarrant1 & James Ware2





도입

INTRODUCTION


적절히 만들어지기만 한다면 MCQ는 높은 수준의 인지적 추론을 검사할 수 있고, 고성과-학생과 저성과-학생을 변별할 수 있다. 그러나 현실에서 자체적으로 개발되는 MCQ는 poorly constructed되는데, 왜냐하면 양질의 MCQ 개발에 대한 적절한 교육과 훈련을 받은 교수가 거의 없기 때문이다.

If properly constructed, MCQs are able to test higher levels of cognitive reasoning and can accurately discriminate between high- and low-achieving students.1,3 The reality, how- ever, is that MCQs on many cross-discipline examin- ations developed in-house are poorly constructed because few teaching faculty have adequate education and training in developing high-quality MCQs.


 

비록 양질의 MCQ를 만드는 가이드라인이 다수의 문헌에 명확히 나와있음에도, 이 가이드라인을 위반하는 경우는 흔다하.

Although guidelines for constructing high-quality MCQs have been clearly outlined in numerous publications,11–16 violations of these guidelines are nonetheless common.


IWF는 문항의 난이도를 어렵거나 쉽게 해서 학생들의 MCQ 퍼포먼스에 영향을 줄 수 있다. 비록 실제적으로 평가된 IWF의 영향은 많지 않지만, 전문가들은 IWF가 학생들의 퍼포먼스에 영향을 준다는 것에 대체로 동의한다.

Item-writing flaws can affect student performance on MCQs, making items either more or less difficult to answer.12,16,17 Although the impact of only a few item- writing flaws has been empirically evaluated,18–22 experts agree that item-writing flaws do affect student performance.


 

어떤 IWF들은 문항을 더 쉽게 만든다.

Some flaws, such as

  • the use of absolute terms (e.g. always, never),

  • the use of  all of the above ,

  • making the correct option the longest or most detailed,

  • using word repeats or logical clues in the stem as to the correct answer, and

  • grammatical clues,

...cue the examinee to the correct answer and make items less difficult.12,15–17,20

 

 

어떤 IWF들은 문항을 더 어렵게 만든다.

Furthermore, experts recommend against

  • using items with negatively worded stems (i.e. not, except),

  • unfocused or unclear stems,

  • gratuitous or unnecessary information in the stem, and

  • the  none of the above  option

...as these formats can make questions more difficult.16,18,22

 

 

복합형 혹은 K-type 문항은 가급적 지양되어야 하는데, confusing하거나 학생들이 부분적 정보만으로 답을 하기 때문이다.

Complex or K-type item formats that have a range of correct responses and require examinees to select from combinations of responses should also be avoided as they can be confusing and allow examinees to answer questions based on partial information.23,24


이 주제에 대해서 Downing은 미국 의과대학 학생들이 보는 시험의 질을 평가하여 33~46%의 MCQ에 IWF가 있음을 확인하였다. 그 결과 낙제인 것으로 분류된 학생 중 10~25%는 그 IWF가 없었다면 합격할 학생들이었다.

In the only published studies on this topic, Downing7,8 assessed the quality of examina- tions given to medical students in a US medical school and found that 33–46%of MCQs were flawed. As a consequence, 10–25% of examinees who were classified as failures would have passed if flawed items had been removed from the tests.7,8





METHODS




5년간 간호대 학부 프로그램의 MCQ 수집

As part of a larger study5 examining the quality of MCQs, we retrieved all high-stakes tests and examin- ations containing MCQs that had been administered in an undergraduate nursing programme over a 5-year period from 2001 to 2005 (n = 121).


제외 기준

we eliminated tests without item analysis data (n = 54), all of which were administered prior to 2003. We also removed tests that were not summative assessments (n = 10) as these were not considered to be high-stakes tests. To ensure enough flawed items on each test analysed, we removed tests with < 50 items (n = 42). Finally, we removed tests with unacceptably low reliability (r < 0.50) (n = 5). Although higher reliabil- ity (r > 0.70) is desirable in high-stakes, classroom- type assessments,25 0.50 is sufficient reliability to allow researchers to draw meaningful conclusions about individual achievement.7 Thus, 10 tests were available for analysis.


MCQ 퀄리티 평가 과정

Procedures for assessing the quality of the MCQs were rigorous and have been described indetail elsewhere.5 Briefly, MCQs were reviewed for the presence or absence of 32 commonly identified violations11–16 to item-writing guidelines by a 4-person consensus panel consisting of expert clinicians and trained item writers. Items were classified as  flawed  if they contained >1of the assessed violations of good item writing. Items were classified as  standard  if they did not contain any of the assessed item-writing violations.8 In total, 15 of the 32 common violations were found in the reviewed papers (Appendix). Each panel member reviewed each question independently; discordance on violations occurred on 13.1%(n = 87) of the items. These items were further discussed until consensus was reached among panel members. Item analysis data were retrieved only after all item classification was complete.


2개의 별도의 척도를 계산함

For each test, 2 separate scales were computed:

  • IWF문항 포함된 것 a total scale which reflected the characteristics of the test as it was administered and

  • IWF문항 제외한 것 a standard scale which reflected the characteristics of a hypothetical test that included only the unflawed items.8

 

포함된 시험들의 특성

All items were 4- option, single-best answer questions with no penalty for incorrect answers. All tests were pencil-and-paper format and were completed in-person by examinees. No computer-based tests were assessed. Test results were computed using optical scanning sheets and a customised software program. In the undergraduate nursing programme, criterion-referenced assessment is used and pass scores are set at 50%.



다음의 결과를 계산함

For the 2 scales assessed in this study (total and standard), the following data were computed:

  • mean item difficulty;

  • mean item discrimination;

  • raw test scores;

  • percent scores;

  • Kuder-Richardson 20 reliabil- ity (KR-20);

  • the number and proportion of examinees passing, and

  • the number and proportion of high- achieving examinees (those scoring ‡ 80%).

 

To enhance comparisons, we also computed the mean item difficulty and mean item discrimination for flawed items but did not calculate total scores for this scale.

 

  • 난이도
    Item difficulty
    is the proportion of examinees answering the question correctly, with lower values reflecting more difficult questions.26

  • 변별도
    Item discrimina- tion
    is computed using the point-biserial correlation coefficient or the correlation between the item and total test score.26 Item discrimination is a measure of how effectively an item discriminates between high- and low-ability students.13 Items with higher discrim- ination values are more desirable.

 

Item-analysis was conducted using IDEAL 4.1, an item-analysis program (IDEAL-HK, Hong Kong, China).27 All other data analysis was conducted using STATA Version 9.2 (Stata Corporation Inc., College Station, TX, USA).28



결과

RESULTS




고찰

DISCUSSION


여기서 검토한 시험에는 flawed items이 지나치게 높았다.

There was an unacceptably high level of flawed items in the tests we reviewed.


MCQ문항을 잘 만드는 것은 시간이 많이 들고 어렵다. 따라서 교사들이 필요한 스킬이 없으면 기관은 적절한 훈련을 제공하여 타당성과 신뢰성을 갖춘 평가를 가능하게끔 해야 할 책임이 있다. 연구자들은 이미 training으로 MCQ의 퀄리티가 상당히 향상되었음을 보여준 바 있다.

Well constructed MCQ items are time-consuming and difficult to write. Therefore, if teachers responsible for assessment and evaluation lack the necessary skills, it is the responsibility of the academic institutions that employ them to provide the necessary training and instruction to enable them to develop valid and reliable assessments.29 Research has shown that training substantially improves the quality of MCQs developed by teaching faculty.6,9 



전반적으로 본 연구의 결과는 flawed item과 학생 성취도 사이의 복잡한 관계를 보여준다. 첫째, 평균적인 난이도 점수는 flawed items가 더 어렵거나 쉬운 것은 아님을 보여준다.

Overall, the results of this study show a complex interaction between flawed items and student achievement. First, mean difficulty scores show that flawed items were not substantially more or less difficult over the 10 tests than were standard items.


이는 IWF가 학생의 MCQ에 대한 응답에 다양한 영향을 줄 수 있다는 것과 IWF의 빈도가 다양하다는 점을 고려하면 놀랍지 않다.

This is not surprising given the varying effects that flawed items can have on student responses on MCQs and the different frequencies of various item-writing flaws on the tests examined.



둘째, flawed items가 10개 시험에 걸쳐서 난이도를 더 유의하게 낮춘 것은 아님에도, standard scale과 비교했을 때 total scale에서 합격률이 더 높았다. 이것은 borderline학생이 flawed item으로부터 이득을 봤음을 뜻한다.

Second, although flawed items across all 10 tests were not substantially less difficult, more examinees were able to pass the total scales compared with the standard scales (94.5% versus 90.9%). This indicates that borderline students benefit from flawed items


셋째, total scale에서 80%이상을 받은 학생은 standard scale보다 더 적었는데, 이는 flawed item이 고성취-학생에게 negative한 효과를 미쳤음을 보여준다. 이들 학생은 고부담 시험에서 testwiseness보다는 지식과 추론에 더 의존할 것이므로 flawed item에서 부당하게 패널티를 받았을 것이다.

  • Testwiseness는 "behav- iours that allow examinees to guess or deduce correct answers without knowing the material, thereby increasing their test scores.17 "를 의미한다. 이러한 결과는 CIV가 학생 성취에 미치는 영향을 명백히 보여준다.

  • CIV는 "the introduction of extraneous vari- ables (i.e. item-writing flaws, test-wiseness) that are irrelevant to the construct being measured and which can increase or decrease test scores for some or all examines.17,30"을 말한다.

Third, fewer examinees scored ‡ 80% on the total scales when compared with the standard scales (14.6%versus 21.0%), demonstrating that flawed test items negatively impact high-achieving students. These students may be more likely to rely on knowledge and reasoning to answer questions on high-stakes assessments and less likely to rely on test- wiseness, and thus are unfairly penalised when questions are flawed. Test-wiseness refers to behav- iours that allow examinees to guess or deduce correct answers without knowing the material, thereby increasing their test scores.17 These findings clearly illustrate the impact of construct-irrelevant variance on student achievement. Construct-irrelevant vari- ance refers to the introduction of extraneous vari- ables (i.e. item-writing flaws, test-wiseness) that are irrelevant to the construct being measured and which can increase or decrease test scores for some or all examines.17,30


본 연구에서 검토된 flawed item은 변별도가 더 낮았다. borderline 학생들의 점수가 인공적으로 부풀려졌을 것이며, 고성취 학생들의 점수가 낮아졌을 때, 평가는 변별력을 잃게 되고, 학생 성취를 변별해내지 못한다.

The flawed test items reviewed in this study had lower discriminating power than did standard items. When test scores for borderline students are artificially inflated and scores for high-achieving students are lowered, assessments lose their discriminating power and there is less differentiation in student achievement.


본 연구의 결과는 flawed item이 학생들의 합격률에 negative한 영향을 준다고 밝힌 이전 연구와 다르다. 이는 놀랍지 않은데, 왜냐하면 IWF의 유형이 다양하고 그 빈도가 다양하기 때문이다. 다른 연구결과와 일관된 점은 flawed item은 unflawed item보다 더 perform이 떨어지고, 학생 성취와 변별력에 부정적 영향을 준다는 점이다.

Findings from this study differ from those of previous research that has found flawed items to negatively affect examinee pass rates.7,8 This is not surprising, however, when we consider the wide variation in types of item-writing violations and the differing frequencies of these violations on various tests and examinations. What is consistent with other findings is that flawed items perform worse than unflawed items and negatively affect student achievement and discrimination.


간호와 같은 전문직 프로그램에서 교사들은 다양한 이해관계자들에 대한 책임이 있고, 여기에는 면허기구와 대중들도 포함된다. 따라서 고부담 시험에서 학생의 수행능력은 examinee와 patient 모두에게 중요한 결과를 가져온다.

In professional programmes such as nursing, teachers are accountable to many stakeholders, including licensing bodies and the public.2,31 Thus student performance on high-stakes tests can have serious consequences for both examinees and patients.17



결론

CONCLUSIONS


 

불행하게도 IWF는 여러 학문분야에서 너무 흔하다. 우리는 borderline student에 대해서 flawed item이 가져오는 의도하지 않은 효과를 보여주었다. 또한 고성취 학생에 대한 효과도 보여주었는데 ,이는 과거에는 드러나지 않았던 것이다. 만약 IWF가 borderline student에게 이득이 된다면 모든 학생들의 점수를 모두 올려서 모두에게 이득이 될 것이라고 짐작하기 쉬우나 우리의 연구에서는 그렇지 않았다.

The presence of item-writing violations is unfortu- nately all too common in teacher-developed exam- inations across many disciplines. We have shown the unintended consequences of using flawed items for borderline students. We also examined the impact of flawed items on high-achieving students, something that has not been done previously. One might naturally assume that if item-writing flaws benefit borderline students, they would benefit all students, raising test scores for everyone. Our findings suggest otherwise.





7 Downing SM. Construct-irrelevant variance and flawed test questions: do multiple-choice item-writing principles make any difference? Acad Med 2002;77 (Suppl):103–4.


8 Downing SM. The effects of violating standard item- writing principles on tests and students: the conse- quences of using flawed test items on achievement examinations in medical education. Adv Health Sci Educ 2005;10:133–43.


9 Jozefowicz RF, Koeppen BM, Case S, Galbraith R, Swanson D, Glew RH. The quality of in-house medical school examinations. Acad Med 2002;77:156–61.


11 Case SM, Swanson DB. Constructing Written Test Questions for the Basic and Clinical Sciences, 3rd edn. Philadelphia: National Board of Medical Examiners 2001;19–29.


16 Haladyna TM, Downing SM, Rodriguez MC. A review of multiple-choice item-writing guidelines for classroom assessment. Appl Meas Educ 2002;15:309–34.


17 Downing SM. Threats to the validity of locally devel- oped multiple-choice tests in medical education: construct-irrelevant variance and construct under-rep- resentation. Adv Health Sci Educ 2002;7:235–41.




 2008 Feb;42(2):198-206. doi: 10.1111/j.1365-2923.2007.02957.x.

Impact of item-writing flaws in multiple-choice questions on student achievement in high-stakes nursingassessments.

Author information

  • 1Department of Nursing Studies, Faculty of Medicine, University of Hong Kong, Hong Kong, China. tarrantm@hku.hk

Abstract

CONTEXT:

Multiple-choice questions (MCQs) are frequently used to assess students in health science disciplines. However, few educators have formal instruction in writing MCQs and MCQ items often have item-writing flaws. The purpose of this study was to examine the impact of item-writing flaws on student achievement in high-stakes assessments in a nursing programme in an English-language university in Hong Kong.

METHODS:

From a larger sample, we selected 10 summative test papers that were administered to undergraduate nursing students in 1 nursingdepartment. All test items were reviewed for item-writing flaws by a 4-person consensus panel. Items were classified as 'flawed' if they contained > or = 1 flaw. Items not containing item-writing violations were classified as 'standard'. For each paper, 2 separate scales were computed: a total scale which reflected the characteristics of the assessment as administered and a standard scale which reflected the characteristics of a hypothetical assessment including only unflawed items.

RESULTS:

The proportion of flawed items on the 10 test papers ranged from 28-75%; 47.3% of all items were flawed. Fewer examinees passed the standard scale than the total scale (748 [90.6%] versus 779 [94.3%]). Conversely, the proportion of examinees obtaining a score > or = 80% was higher on the standard scale than the total scale (173 [20.9%] versus 120 [14.5%]).

CONCLUSIONS:

Flawed MCQ items were common in high-stakes nursing assessments but did not disadvantage borderline students, as has been previously demonstrated. Conversely, high-achieving students were more likely than borderline students to be penalised by flawed items.

[PubMed - indexed for MEDLINE]


+ Recent posts