표준적인 문항작성 원칙 위반에 따른 효과 (Adv Health Sci Educ Theory Pract. 2005)

The Effects of Violating Standard Item Writing Principles on Tests and Students: The Consequences of Using Flawed Test Items on Achievement Examinations in Medical Education 


STEVEN M. DOWNING

University of Illinois at Chicago, Department of Medical Education (MC 591), College of Medicine,

808 South Wood Street, Chicago, Il 60612-7309, USA (Phone: +1-312-996-6428; Fax: +1-312-

413-2048; E-mail: sdowning@ uic.edu)





도입

Introduction


그러나 Mehrens and Lehmann 이 지적한 것처럼, 교수자가 준비한 시험에는 종종 여러 중대한 결함들이 있다. Jozefowicz 등은 poorly constructed test item이 의과대학에서 흔하게 사용됨을 보여주었다.

However, as Mehrens and Lehmann (1991) point out, there are often major deficiencies in examinations prepared by classroom instructors. And, Jozefowicz and others (2002) show that poorly constructed test items are frequently used in medical schools.


문항작성이란 과학이라기보다는 예술에 가깝지만, 정립된 원칙이 있고, 이것 중 다수는 근거-기반으로 효과적인 형식과 비효과적인 형식이 무엇인지 알려준다.

While test item writing may be as much art as science, there are well established principles, many of which are evidence-based, suggesting what is an effective item form vs. an ineffective item form (Haladyna et al., 2002).


몇몇 IWF에 대해서 그 영향이 연구된 바 있다.

Several item flaws have been studied empirically for their effect on item and test psychometric characteristics.

  • For example, the use of ‘‘all of the above’’ (AOTA) and ‘‘none of the above’’ (NOTA) as options has been extensively studied with mixed results (Harasym et al., 1998).

  • 일반적 MCQ의 변형된 형태 Variants of the straightforward multiple-choice question (MCQ) stem, such as multiple- true–false or unfocused stems, have been studied and generally found to be detrimental to item performance (Case and Downing, 1989; Downing et al., 1995).

  • 조합형 문항 Complex item forms, which require selection of combinations of individual options, have been extensively studied and found to be generally detrimental to the psychometric attributes of tests (Albanese, 1993; Daw- son-Saunders et al., 1989).

  • The use of negative words in the stem has been evaluated with mixed results concerning difficulty and discrimination of test items (Downing et al., 1991; Tamir, 1993).


최근의 리뷰문헌을 보면, 부정문을 지양하라거나 AOTA를 지양하라는 것 등을 권고한다. NOTA의 사용에 대해서는 권고가 일정하지 않으나, 현재의 권고는 고도로 숙련된 출제자가 아닌 경우 NOTA를 지양하라는 것이다.

A recent review paper (Haladyna et al., 2002) recommends the avoid-ance of negation in the stem and reports that most educational mea-surement textbook authors recommend avoiding the AOTA option. The use of the NOTA option has mixed recommendations from textbook authors and the empirical research is also mixed but the current recom-mendation is to avoid use of the NOTA option, except when used by highly experienced item writers (Crehan and Haladyna, 1991; Frary, 1991).


방법

Methods


 

문항 예시

EXAMPLE ITEMS


1. It is correct that: 

A. Growth hormone induces production of IGFBP3 

B. The predominant insulin-like growth factor binding protein (IGFBP) in human serum is IGFBP3 

C. Multiple forms of IGFBP are derived from a single gene 

D. All of the above 

E. Only A and B are correct


This is an example of an unfocused stem item. The stem does not pose a direct question. The options must each be addressed as ‘‘true or false,’’ but the item is scored as a single-best answer question. Further, option D, ‘‘all of the above’’ is not recommended. And, option E is a combination of two other possible answers, making this a ‘‘partial-K type’’ item. Overall, there are three distinct flaws in this question.


2. Which of the following will NOT occur after therapeutic administration of chlorpheniramine? 

A. Dry mouth 

B. Sedation 

C. Decrease in gastric acid production 

D. Drowsiness 

E. All of the above


The second item is an example of a negative-stem question. It requires the student to identify which sign or symptom will not occur. Option E (all of the above) is not recommended. This example item has two item flaws.


 

아래의 분석을 시행함.

For each of the three scales evaluated for this study (standard, flawed, and total), item analysis data were computed: raw score means, standard devia- tions, mean item difficulty, mean point-biserial correlation with the total examination score, Kuder–Richardson 20 reliability (K–R 20), minimum passing score, the number of students passing, and passing rate (the proportion of students who passed).


Results


Table I. Descriptive statistics of four tests


 

Table II. Frequency of item flaws in four basic science examinations

 


 

Table III. Psychometric characteristics of the standard and flawed scales: Tests A, B, C, and D




PASS–FAIL AGREEMENT ANALYSIS


Table IV. Pass–fail agreement analysis, all examinations, all students N = 749





고찰

Discussion


이 연구에서, item difficulty, item discrimination, score reliability, and passing scores and passing rates. 의 관계를 이해하는 것이 중요하다.

In this study, it is important to understand the relationship among item difficulty, item discrimination, score reliability, and passing scores and passing rates.

  • 문항 난이도란 정답을 맞춘 학생의 %이다.
    Item difficulty refers to the proportion (%) of students getting the item correct.

  • 문항 변별도란 얼마나 문항이 잘하는 학생과 못하는 학생을 구분해낼 수 있는가이다. 변별도가 높은 것이 바람직하다.
    Item discrimination describes how effectively the test item separates or differentiates between high ability and low ability students – noting that test items that highly discriminate are desirable.

  • 모든 것이 동일하다면 변별도가 높을수록 신뢰도가 높다.
    All things being equal, highly discriminating items tend to produce high score reliability.

  • 이 연구에서의 문항에는 passing score value가 정해졌기 때문에, IWF가 없는 경우와 있는 경우의 passing score와 passing rate를 계산할 수 있었다. passing score와 passing rate가 서로 반비례 관계에 있다는 것을 이해하는 것이 중요하다.
    Because items in this study had each been assigned a passing score value (by the Nedelsky absolute standard setting method) it was possible to calculate passing scores (the score needed to pass the test) and passing rates (the percentage of students who pass) for the two subscales of interest – the standard and the flawed subscales. It is important to note that the passing score and the passing rate are inversely related; that is, high passing scores tend to produce lower passing rates.


이번 연구에서는 IWF가 매우 흔했다. 이것은 중요한 결과이며, 기대하지 못한 바는 아니다.

There was a high frequency of flawed items in the tests studied. This is an important finding, although not completely unexpected (Downing, 2002; Jozefowicz et al., 2002; Mehrens and Lehmann, 1991). Classroom assess- ments in medical school settings are not immune to poorly crafted test items.


IWF가 있는 경우 3/4에서 더 난이도가 높았다. 0~15%정도 더 어려웠는데, 이 결과는 MCQ에 매우 testwise한 의과대학생을 대상으로 했다는 것을 고려하면 조금 놀랍다.

Flawed item formats were more difficult than standard, non-flawed item formats for students in three of four examinations studied. These mixed results showed that flawed item formats were 0–15 percentage points more difficult than questions posed in a standard form. This finding is some- what surprising, given that examinees in this study are medical students, highly experienced in taking MCQ examinations and presumably very testwise.


Passing rate는 IWF에 의해서 negative한 영향을 받는다. Poorly crafted 문항은 학생들에게 더 challenge가 된다.

Passing rates (the proportion of students meeting or exceeding the passing score) tended to be negatively impacted by flawed items. Poorly crafted, flawed test questions tended to present more of a passing challenge for students.


IWF가 없는 경우와 있는 경우의 합-불합의 일치도를 보면 749명 중 102명이 IWF가 없는 경우 합격했으나 IWF가 있는 문항에서 불합격했고, 30명의 학생만이 IWF가 없는 경우 불합격, 있는 경우 합격이었다. 이 결과의 해석에 있어 length and reliability, passing score가 일부 scale에 있어 차이가 있으므로 주의를 기울여야 한다. 위의 102명이 중요한데, 이 misclassification 중 일부는 무작위측정오차에 따른 것이지만, 일정 부분은 IWF에 의한 systematic error이기 때문이다.

The agreement between pass–fail outcome assigned by the standard and the flawed scales shows that 102 of 749 students (14%) pass the standard items but fail the flawed items, while only 30 students (4%) pass the flawed items and fail the standard items. (These data must be interpreted cautiously, since the scales differ in length and reliability and the passing scores also differ for some of the scales.) The 102 students (of 749) classified as passing the standard items while failing the flawed items are of great concern. Some of these misclassi- fications are due to random measurement error (unreliability), but some proportion is also due to the systematic error introduced by flawed items, given the results of this research.


IWF를 교정의 난이도나 거기에 들어가는 낮은 비용을 고려하면 이러한 false-negative는 과도하게 높아 보인다. 명백하게 이렇게 높은 misclassification 비율은 consequential validity evidence에 부정적으로 작용한다.

A false negative rate this high seems unreasonable, given the relative ease and lowcosts associated with re-writing flawed questions into a form that would adhere to the standard, evidence-based principles of effec- tive item writing. Clearly, this high misclassification rate impacts the conse- quential validity evidence for the tests in a negative manner (Messick, 1989).


IWF가 신뢰도에 미치는 영향은 일관되지는 않았다. IWF에 의한 영향은 systematic error이지 random error가 아니며, 오직 random error만이 내적신뢰도에 의해서 추정가능하다. 따라서 점수의 신뢰도가 IWF와 관계가 거의 없는 것은 놀랍지 않다.

The effect of flawed itemforms on score reliability is mixed; The nature of the item format flaws studied contributes systematic error to the measurement, not random error; only random errors of measurement are estimated by the internal consistency score reliability. Thus, it is not surprising that the score reliability shows little relationship to item flaws.


Messick은 CIV를 다음과 같이 정의했다. IWF에 의한 높아진 난이도와 낮아진 passing rate는 이 정의에 정확히 부합한다.

Messick (1989, p. 34) defines construct-irrelevant variance (CIV) as ‘‘…excess reliable variance that is irrelevant to the interpreted construct.’’ The excess difficulty and tendency toward lower passing rates for flawed vs. standard items meets Messick’s definition of CIV perfectly.


이 연구의 결과로부터 교수들에게 효과적인 객관식문항작성법 교육이 필요함을 제기한다. 좋은 소식은 이 연구에서 드러난 다섯 개의 가장 흔한 오류에만 집중해도 된다는 것이다.

The results of this study suggest that efforts to teach faculty the principles of effective objective-test item writing should be increased. The good news is that these faculty development efforts can concentrate on eliminating the five most common errors found in this study and thereby eliminate nearly all flawed items fromtests.




Haladyna, T.M., Downing, S.M. & Rodriguez, M.C. (2002). A review of multiple-choice item- writing guidelines. Applied Measurement in Education 15: 309–334.


Jozefowicz, R.F., Koeppen, B.M., Case, S., Galbraith, R., Swanson, D. & Glew, H. (2002). The quality of in-house medical school examinations. Academic Medicine 77: 156–161.



 

 

 

 





The effects of violating standard item writing principles on tests and students: the consequences of usingflawed test items on achievement examinations in medical education.

Author information

  • 1Department of Medical Education (MC 591), College of Medicine, University of Illinois at Chicago, 60612-7309, USA. sdowning@uic.edu

Abstract

The purpose of this research was to study the effects of violations of standard multiple-choice item writing principles on test characteristics, student scores, and pass-fail outcomes. Four basic science examinations, administered to year-one and year-two medical students, were randomly selected for study. Test items were classified as either standard or flawed by three independent raters, blinded to all item performance data. Flawed test questions violated one or more standard principles of effective item writing. Thirty-six to sixty-five percent of the items on the four tests were flawedFlawed items were 0-15 percentage points more difficult than standard items measuring the same construct. Over all fourexaminations, 646 (53%) students passed the standard items while 575 (47%) passed the flawed items. The median passing rate difference between flawed and standard items was 3.5 percentage points, but ranged from -1 to 35 percentage points. Item flaws had little effect on testscore reliability or other psychometric quality indices. Results showed that flawed multiple-choice test items, which violate well established and evidence-based principles of effective item writing, disadvantage some medical studentsItem flaws introduce the systematic error of construct-irrelevant variance to assessments, thereby reducing the validity evidence for examinations and penalizing some examinees.

[PubMed - indexed for MEDLINE]


+ Recent posts