OSCE 스타일의 시험에서 평가자의 판단이 대조효과에 영향을 받을까? (Acad Med, 2015)

Are Examiners’ Judgments in OSCE-Style Assessments Influenced by Contrast Effects?

Peter Yeates, MClinEd, PhD, Marc Moreau, MD, and Kevin Eva, PhD





불행하게도, 판단-기반 평가는 psychometric weakness를 안고 있다. 그리고 이러한 문제는 reformulation이나 training으로도 충분히 극복하지 못해온 것이 사실이다.

judgment-based assessments are susceptible to a raft of psychometric weaknesses1–3 that have not been satisfactorily resolved through either reformulation4,5 or training.6–8"


평가자의 오류들: "평가자의 인식에 대한 연구에서, 평가자들은 상대적으로 독특하고 개인별로 특유한 수행능력 이론을 가지고 있으며('우수한 수행능력을 구성하는 것이 무엇인가에 대한 개인적 믿음') (공인된 평가기준이 아닌) 자기 자신의 임상능력을 평가틀(frame of reference)로 활용한다. 평가자들은 평가가 자신의 감정이나 (인색하게 보이기 싫음), 특정 피교육자와의 과거 경험에 영향을 받으며, 조직의 문화에 의해서도 영향을 받는다는 것을 인식하고 있다. 더 나아가서 평가자의 판단은 관찰한 것을 넘어선 추론에 의해 영향을 받기도 하는데, 여기에는 피교육자의 문화, 교육, 동기에 대한 지레짐작까지 포함된다. 이런 연구 결과 외에도 평가자들은 행동 준거에 따라서 평가하라는 지침에도 불구하고 피교육자의 수행수준을 다른 피교육자와 비교하는 경향을 보인다.

“assessor cognition” assessors appear to possess relatively unique, idiosyncratic performance theories (personal beliefs about what constitutes good performance)9,10 and may use their own clinical abilities as their frame of reference (rather than recognized assessment standards) when judging the performance of trainees.11,12 Assessors perceive that their judgments are influenced by their own emotions (e.g., not wanting to feel mean), their prior experiences of particular trainees’ performance, and their institutional culture.12 Further, assessors’ judgments appear to be frequently guided by inferences that go beyond their observations, including presumptions about a trainee’s culture, education, or motivation.12,13 Additional to these findings, an exploratory investigation10 suggested that assessors, despite instructions to judge against a behavioral standard, showed a tendency to make judgments by comparing trainees’ performance against other trainees’.


요약하면, 역량에 대한 이해라는 것은 본질적으로 상대적인 것이라 할 수 있다.

in essence, it indicates that their understanding of competence may be inherently comparative (i.e., norm referenced).17


어떤 순서든 학생이 OSCE 서킷에 들어가면 그 학생과 그 전 학생간의 관계는 무작위가 된다. 그 결과 아주 큰 데이터셋을 분석한다면 특정 학생과 그 전학생간의 상관관계는 전혀 없어야 한다. 정적 상관은 assimilation effect를 의미하며, 부적 상관은 contrast effect를 의미한다. 

When students are not entered into an OSCE circuit in any particular order, the relationship between a performance and its predecessor should be random. Consequently, when a large dataset is examined, no relationships should exist between the scores of performances and their predecessors. A positive relationship between successive performances would indicate an assimilation effect, whereas a negative relationship would indicate a contrast effect. The first dataset was drawn from the 2011 United Kingdom Foundation Programme Office (UKFPO) Clinical Assessment. The second dataset was drawn from the 2008 Multiple Mini Interview (MMI) that was used for selection into the University of Alberta Medical School.


'스테이션 서킷'이란 것은 각 특정 스테이션에 들어가는 지원자의 점수들로 구성된다. 스테이션의 난이도나 평가자의 엄격한 정도가 분석에 영향을 주는 것을 방지하기 위해서 모든 점수를 z score로 변환하였다. 수행능력 기반 평가 점수는 총 5288명을 대상으로 수집하였다. 모든 점수의 평균과 중간값은 비슷했다. skewness와 kurtosis는 -1과 1 사이이다.

A “station-circuit,” therefore, comprised the scores for the candidates at an individual station for an individual circuit of the exam. To prevent station difficulty and/or rater stringency from influencing the analyses, we transformed every candidate score into a z score centered around the mean of its station-circuit. Performance-based assessment data were available for 5,288 candidate observations (see Table 1). All scores’ mean and median values were similar, with skewness and kurtosis values between −1 and 1, indicating that data were adequately normal for parametric analysis.





어떤 관계가 나타나더라도 분석 중 생긴 artifact가 아니라는 것을 확실히 하기 위해서 Excel을 활용하여 Monte Carlo Simulation을 수행하였다. 20개의 무작위 숫자를 생성한 뒤 동일한 분석을 하였다.

To ensure that any relationship observed was not an analytic artifact generated from the way in which scores were compared with preceding scores, we used Microsoft Excel (Microsoft Corporation, Redmond, Washington) to run a Monte Carlo simulation. Twenty random numbers were produced and then used to calculate the same “preceding candidate” metrics as described above (n-1, n-2, etc.).


첫 번째 분석은 contrast effect가 나타나는가였다. 분명하게 현재 점수가 그 앞 세 명의 점수의 평균과 관계가 있음이 나타났으며, 어떤 이전 점수와도 관계가 나타났다.

We initially queried whether the previously demonstrated contrast effects15,16 examiners may be susceptible to contrast effects in real-world situations despite the formality of high-stakes exams and despite explicit behavioral guidance. Notably, the observed relationships were stronger when current scores were related to the average of three preceding scores relative to when they were related to any individual preceding performance.





이론적으로, 이는 우리가 기존에 주장한 바를 지지하는데, 보건의료인력 교육에서 준거지향평가에 의지함에도, 고도로 훈련받고 충분한 자원이 제공되는 평가 참여하는 평가자도 역량에 대한 '고정된(fixed)' 감각이 없다

Theoretically, this sustains our prior assertions15,16 that, despite our espoused reliance on criterion-based assessments within health professional education,19 highly trained and well-resourced examiners may still lack any truly fixed sense of competence against which to judge when making assessment decisions.


평가자들은 특정 수험생을 평가할 때 단순히 가장 최근 수험생과 대조하는 것이 아니라 그 전에 접한 수험생들의 수행능력을 통합하여 기준을 정하는 것으로 보인다. 이는 행동을 설명한 문구나, 훈련에도 불구하고 평가자들은 과거의 예제들을 바탕으로 비교 판단을 하기 위한 mental database를 축적한다는 생각을 지지한다. 이런 효과가 시험의 맨 끝까지 나타난다는 점에서 이것은 워밍업 단계의 현상이 아니며, 따라서 한 두개의 초기 스테이션을 제외한다고 사라질 수 있는 것이 아니다. 가장 강한 부적 상관은 특정 학생을 접하기 4, 5, 6번째에 접한 학생의 평균이였고, 이는 평가자에게 초반부에 접한 케이스가 의미를 갖는다는 초기효과(primacy effect)를 보여준다.

This suggests that assessors mentally amalgamate previous performances to produce a performance standard to judge against rather than simply contrasting with the most recent performance. This is consistent with the idea that, despite the availability of behavioral descriptors and training, examiners accumulate a mental database of past exemplars against which they make comparative judgments. The persistence of the effect near the end of the exam indicates that it is not a “warm-up” phenomenon, and so cannot be counteracted by discarding one or two initial stations. That the strongest negative relationships were observed between later candidates and the average of students that preceded them by four, five, and six places suggests a primacy effect in these exemplar comparisons, in that the early cases that examiners see may be particularly meaningful in setting expectations.


과거 연구들은 대조효과가 여기서 나타난 것보다 더 컸는데, 몇 가지 해석이 가능하다. 첫 번째로 실험실 상황(조작된 상황)에서는 참가자들인 지속적으로 강한, 양방향의 조작(manipulation)에 노출되게 된다. AVN4-6이 가장 큰 영향을 줬다는 것이 특히 시사하는 바는 비디오 기반의 예제를 활용하여 모든 평가자에게 초반의 표준화된 비교측정기(comparator)를 만들도록 할 수 있다는 장점이 있다. 

The preceding laboratory studies showed contrast effects that were larger than those observed in this study, explaining up to 24% of observed score variance in mini-CEX scores. A number of explanations are possible. First, in the laboratory context, participants were consistently exposed to a strong, bidirectional manipulation (either seeing very good or very poor performances prior to intermediate performances). That AvN4-6 showed the largest influence has particularly important implications for practical strategies to overcome the biases observed in this line of research. If confirmed, it would suggest potential benefits in using video-based exemplars to create a standardized set of initial exemplar comparators for all examiners.











 2015 Jan 27. [Epub ahead of print]

Are Examiners' Judgments in OSCE-Style Assessments Influenced by Contrast Effects?

Author information

  • 1P. Yeates is clinical lecturer in medical education, Centre for Respiratory Medicine and Allergy, Institute of Inflammation and Repair, University of Manchester, and specialist registrar, Respiratory and General Internal Medicine, Health Education North West, Manchester, United Kingdom. M. Moreau is assistant dean for admissions, Faculty of Medicine and Dentistry, and professor, Division of Orthopaedic Surgery, University of Alberta, Edmonton, Alberta, Canada. K. Eva is senior scientist, Centre for Health Education Scholarship, and professor and director of educational research and scholarship, Department of Medicine, University of British Columbia, Vancouver, British Columbia, Canada.

Abstract

PURPOSE:

Laboratory studies have shown that performance assessment judgments can be biased by "contrast effects." Assessors' scores become more positive, for example, when the assessed performance is preceded by relatively weak candidates. The authors queried whether this effect occurs in real, high-stakes performance assessments despite increased formality and behavioral descriptors.

METHOD:

Data were obtained for the 2011 United Kingdom Foundational Programme clinical assessment and the 2008 University of Alberta Multiple Mini Interview. Candidate scores were compared with scores for immediately preceding candidates and progressively distant candidates. In addition, average scores for the preceding three candidates were calculated. Relationships between these variables were examined using linear regression.

RESULTS:

Negative relationships were observed between index scores and both immediately preceding and recent scores for all exam formats. Relationships were greater between index scores and the average of the three preceding scores. These effects persisted even when examiners had judged several performances, explaining up to 11% of observed variance on some occasions.

CONCLUSIONS:

These findings suggest that contrast effects do influence examiner judgments in high-stakes performance-based assessments. Although the observed effect was smaller than observed in experimentally controlled laboratory studies, this is to be expected given that real-world data lessen the strength of the intervention by virtue of less distinct differences between candidates. Although it is possible that the format of circuital exams reduces examiners' susceptibility to these influences, the finding of a persistent effect after examiners had judged several candidates suggests that the potential influence on candidate scores should not be ignored.

PMID:
 
25629945
 
[PubMed - as supplied by publisher]


+ Recent posts