"상대적으로 본다면...", 대조효과가 평가자의 점수와 서술 피드백에 미치는 영향(Med Educ, 2015)

Relatively speaking: contrast effects influence assessors’ scores and narrative feedback

Peter Yeates,1 Jenna Cardell,2 Gerard Byrne3 & Kevin W Eva4







피교육자의 임상수행능력 평가의 중요성. 그러나 이 때 항상 문제가 되는 것은 평가자간 차이.

Accurate assessment of trainees’ clinical performance is vital both to guide trainees’ educational development1 and to ensure they achieve an appropriate standard of care.2 Although overall support exists for workplace-based assessment (WBA),3–6 inter-assessor score variability has been cited as a cause for concern.7,8



평가자에 대한 훈련은 대체로 드문 편. 평가자 훈련이나 평가스케일 조정의 효과도 크지 않음. 이러한 인터벤션이 가정하는 것은 수행능력을 평가하는데 뒤따르는 문제의 해결에 평가 또는 관찰된 수행능력과 정의가능한 기준 사이의 관계를 잘 이해하는 것이 도움이 된다는 것이다.

Training of assessors is generally sparse,9 and both training10,11 and scale alterations12,13 have produced only limited improvements. Implicit in such interventions is the assumption that challenges with rating performance can be overcome by helping raters better understand the rating task and the relationship between the observed performance and some definable criterion against which they can compare.


평가자의 판단은 두 가지 방향으로 나타난다. Contrast effect와 Assimilation effect.

As a result, the performance of recently viewed candidates can bias assessors’ judgements of current candidates. Such effects can (theoretically) occur in either of two directions: 

      • contrast effects occur when a preceding good performance reduces the scores given to a current performance by making the current performance seem poor ‘by contrast’, and 
      • assimilation effects occur when a preceding good performance increases the scores given to a current performance by focusing attention on similar aspects of performance.17


평가자가 관찰한 수행의 수준이 섞여 있을 때에는 다양한 상황이 발생할 수 있다. Primacy, Recency, Averaging.

When assessors observe a mixture of preceding performances, a variety of effects may occur: 

      • assessors may be most influenced by the initial performances they encounter (primacy), 
      • by the latest performances (recency), or 
      • by the aggregation of previous performances they have seen (averaging).



대조효과가 나타났을 때, 이것이 피교육자의 수행에 대한 평가자의 인식에 영향을 준 것인지, 평가자의 판단을 점수로 옮기는 과정에서 나타난 결과인지 알아보는 것이 목적.

An additional objective of the current study is to determine whether the observation of contrast effects represents an influence on assessors’ perceptions of performance or is an artefact of the way that assessors translate judgements into scores.



평가자의 판단을 점수로 옮기는 것이 어렵다는 것에 대한 지적

An additional objective of the current study is to determine whether the observation of contrast effects represents an influence on assessors’ perceptions of performance or is an artefact of the way that assessors translate judgements into scores. Crossley et al.19 have suggested that score variations arise in part because scales may not align with assessors’ thinking, suggesting that disappointing psychometric performance of WBA to date may stem not from disagreements about the performance observed, but from different interpretations of the questions and the scales’. Other authors have suggested that assessments should focus on narrative comments rather than scores.20–23



Ecological validity를 보존하기 위하여 이 평가자들이 익숙하게 사용해온 형식을 활용함.

As this study was intended to be a fundamental examination of the mechanism through which rater judgements might be influenced, we chose to preserve ecological validity by not altering the assessment format to which these examiners were accustomed. Similarly, we did not specify additional criteria or provide additional training to assessors.



Study design

The study used an Internet-based, randomised, double-blinded design. 



6단계 평가

Scores were assigned using a 6-point scale with reference to expectations for completion of FY-1: 

      • 1 = well below expectations; 
      • 2 = below expectations; 
      • 3 = borderline; 
      • 4 = meets expectations; 
      • 5 = above expectations, and 
      • 6 = well above expectations. 

The UK FP intends these judgements to be criterion-referenced as ‘expectations for completion of FY-1’ are defined by reference to the outcomes of its curriculum. 26



T-test의 robustness. 

Adding to the literature suggesting that t-tests are very robust to deviations from normality,27–29 a recent systematic review has reported that equivalent non-parametric tests are more flawed than their parametric counterparts when such deviations exist.30 Nonetheless, we examined the skewedness of the distributions (and found it to be < 1 in all instances) prior to proceeding with analysis via independent-samples t-tests.




자유형식 코멘트의 분석

Analysis of free-text comments

To address RQ 3, free-text feedback comments were coded using content analysis. Researchers, blinded to the group from which comments arose, segmented the feedback into phrases. Next, each comment was coded independently by two researchers to indicate whether it was ‘communication-focused’ (i.e. commenting on the candidate’s interpersonal skills) or ‘content-focused’ (i.e. commenting on the candidate’s knowledge or clinical skills). The two researchers used an initial subset of data to develop a shared understanding of the codes and then independently coded all remaining segments. The researchers also independently coded comments as positive, negative or equivocal. The independently coded comments were then compared and agreement was calculated using Cohen’s j-values. Discrepant codes were independently reconsidered by both researchers and remaining differences were discussed and resolved. Frequencies of each thematic category (communication and content) were calculated for each performance by each participant. The positive, negative and equivocal codes were assigned scores of +1, 1 and 0respectively, and their sum was calculated for each participant for each performance. This variable was termed ‘positive/negative (pos/neg) balance’.


피어슨 product moment correlation

We examined the relationship between scores and feedback measures using Pearson’s product moment correlations


Effect size

All analyses were performed using IBM SPSS Statistics for Windows Version 20.0 (IBM Corp., Armonk, NY, USA). A p-value of < 0.05 was set as the significance threshold. Cohen’s d is used to report effect size for all statistically significant comparisons. By convention, d = 0.8 is considered to represent a large effect, d = 0.6 represents a moderate effect, and d = 0.4 represents a small effect.














피드백의 valence (Valence (psychology), the emotional value associated with a stimulus)는 수행능력 점수와 중간-강한 상관과계를 보였다. 그러나 피드백의 내용은 평가자의 최근 평가경험이나 다른 수행능력에 영향을 받지 않아서 평가자들이 비슷한 이슈에 대해서는 조건에 무관하게 비슷한 판단을 내리나, 상황에 따라서 그것의 중요도에 대한 판단이 달라짐을 시사했다.

The valence of feedback showed moderate to strong relationships with performance scores and revealed similar influences of contrast. The content of participants’ feedback, by contrast, was not altered by their recent experiences of other performances, which suggests that raters across conditions identified similar issues, but interpreted their severity differently depending on the cases they had previously seen.



단 한 차례의 경험만으로도 다음 평가에 영향을 준다. 장기기억에 있는 정보보다 즉각적 맥락에 의해 더 영향을 받음.

Firstly, this study has shown that a single performance is sufficient to produce a contrast effect on assessors’ judgements despite the fact that participants claimed to have considerable experience in conducting assessments of this type. Schwarz 32 has suggested that contrast effects occur because humans are more readily influenced by information from their immediate context than by information in long-term memory.



즉각적 맥락에 의해 영향을 받긴 하나, 이전 평가의 평균적 경험에 기반하여 비교한다.

This study further suggests that whilst assessors are readily influenced by information in their immediate context, the comparison they make is against an evolving standard that may be based on the average of preceding performances.



자신의 판단을 숫자(점수)로 환산하는 과정에서 발생하는 오류인가?

Prior studies have yielded questions about whether apparent biases arise because assessors find it difficult to translate their judgements into scores.24,34




평가자의 피드백 내용 자체는 변하지 않았다. 다만 그것의 중요도에 대한 인식이 달라진 것으로 보인다. 단순하게 판단을 점수로 변환하는 과정이 아니라 평가자의 근본적 인상에 영향을 미치는 것으로 보인다.

In this study, the content of assessors’ feedback (linguistic expressions of their judgement) was unchanged as a function of our experimental manipulation, which suggests that they saw the issues similarly in each condition. The considerable variability in the valence of their comments, however, suggests that the perceived severity of the issues observed was influenced by contrasts with previously seen cases. This suggests that such cases fundamentally influence assessors’ impressions of performance rather than simply biasing their translation of judgements into scores.


소규모 프로그램에서는 장기간에 걸쳐서 상대적으로 작은 숫자의 학생만을 만나게 되고 학생 하나하나가 서로 더 비교되는 효과를 낳을 수도 있다.

In smaller programmes, however, such as the longitudinal integrated clerkships that are becoming increasingly popular,36 particular examiners will interact with a relatively small number of trainees over a long period of time, creating risk that the trainees will appear more different from one another than is realistic because they are implicitly contrasted against one another.


평가에 대한 사회적 압력이 있다. 많은 경우 평가자는 피평가자와 계속 함께 일을 해야하므로 긍정적인 쪽으로 비뚤림이 생기는 경우가 많다. 평가를 내릴 수 있는 폭이 좁을수록 대조효과가 현실적으로 영향을 미칠 가능성이 낮아질 것이다. higher-stake평가에서는 평가자와 피평가자가 서로 모르는 상황에서 이루어지기 때문에 평가자가 더 분포를 넓게 할 수 있고, 대조효과의 영향도 더 커질 것이다. 

Finally, it should be noted that social pressures of various types can have both implicit and explicit influences on the ratings that assessors assign. In many assessment contexts, including those in which assessors must continue to work with those being assessed, a positive skew in the ratings assigned is commonplace. The more the ratings are compressed into a narrow range, the less likely it is that a contrast effect of discernible practical significance will be observed. In higher-stakes examinations in which the examiner and examinee are unknown to one another or in the context of a video-based review of performance in which the examiner is anonymous (as in this study), raters may be more likely to spread their ratings out, thereby creating greater potential for the psychological contrasts observed in this study to be seen to have influence.







 2015 Sep;49(9):909-19. doi: 10.1111/medu.12777.

Relatively speakingcontrast effects influence assessors' scores and narrative feedback.

Author information

  • 1Centre for Respiratory Medicine and Allergy, Institute of Inflammation and Repair, University of Manchester, Manchester, UK.
  • 2Royal Bolton Hospital, Bolton NHS Foundation Trust, Bolton, Lancashire, UK.
  • 3Health Education North West, Health Education England, Manchester, UK.
  • 4Centre for Health Education Scholarship, Division of Medicine, University of British Columbia, Vancouver, BC, Canada.

Abstract

CONTEXT:

In prior research, the scores assessors assign can be biased away from the standard of preceding performances (i.e. 'contrast effects' occur).

OBJECTIVES:

This study examines the mechanism and robustness of these findings to advance understanding of assessor cognition. We test theinfluence of the immediately preceding performance relative to that of a series of prior performances. Further, we examine whether assessors' narrative comments are similarly influenced by contrast effects.

METHODS:

Clinicians (n = 61) were randomised to three groups in a blinded, Internet-based experiment. Participants viewed identical videos of good, borderline and poor performances by first-year doctors in varied orders. They provided scores and written feedback after each video. Narrative comments were blindly content-analysed to generate measures of valence and content. Variability of narrative comments and scores was compared between groups.

RESULTS:

Comparisons indicated contrast effects after a single performance. When a good performance was preceded by a poor performance, ratings were higher (mean 5.01, 95% confidence interval [CI] 4.79-5.24) than when observation of the good performance was unbiased (mean 4.36, 95% CI 4.14-4.60; p < 0.05, d = 1.3). Similarly, borderline performance was rated lower when preceded by good performance (mean 2.96, 95% CI 2.56-3.37) than when viewed without preceding bias (mean 3.55, 95% CI 3.17-3.92; p < 0.05, d = 0.7). The series of ratings participants assigned suggested that the magnitude of contrast effects is determined by an averaging of recent experiences. The valence (but not content) of narrative comments showed contrast effects similar to those found in numerical scores.

CONCLUSIONS:

These findings are consistent with research from behavioural economics and psychology that suggests judgement tends to be relative in nature. Observing that the valence of narrative comments is similarly influenced suggests these effects represent more than difficulty in translating impressions into a number. The extent to which such factors impact upon assessment in practice remains to be determined as theinfluence is likely to depend on context.

© 2015 John Wiley & Sons Ltd.

PMID:

 

26296407

 

[PubMed - in process]


+ Recent posts