같은 것을 다르게 보는 것 - DOPA에서 평가자 간 차이의 기전 (Adv in Health Sci Educ, 2013)

Seeing the same thing differently 

Mechanisms that contribute to assessor differences in directly-observed performance assessments

Peter Yeates • Paul O’Neill • Karen Mann • Kevin Eva







Background


전문역량의 평가를 위해서는 다양한 스킬에 대한 수행능력을 측정한 정보의 표집sampling과 통합이 필요하다. 이러한 프레임워크에서 WBA(또는 수행능력 평가)는 매력적인 도구인데, 왜냐하면 실제 현장에서 수행능력의 표본sample을 제공해주기 때문이며, 다수의 역량을 통합된 형태로 동시적으로 평가할 수 있게 해주고, 피드백의 기회를 주기 때문이다. 이러한 평가의 일반적 유용성에 대한 근거들이 있지만, 이러한 평가로부터 나오는 점수에 내재한 variability는 타당성을 위협하는 문제가 되기도 한다.

The assessment of professional competence requires sampling and integration of measures of performance on multiple diverse skills (Van der Vleuten and Schuwirth 2005). Within this framework, workplace based assessments (or performance assessments) represent an attractive tool as they offer samples of performance from real practice, simultaneously assess multiple competencies in an integrated manner and offer opportunities for feedback (Norcini 2003). Whilst support exists for the general utility of these assessments, vari- ability inherent in the scores that result from them are problematic as it threatens their validity (Hawkins et al. 2010; Pelgrim et al. 2011).


점수의 편차variation이 진점수true score의 차이에서 기인하는 부분도 있지만, 동일한 비디오를 보고 평가하게 하는 방식으로 이러한 차이를 통제한 연구에서도 동일한 수행상황을 보고도 평가자들은 9점 스케일에서 1점에서 6점까지 다양한 점수 분포를 보여주었다.

Whilst score variations may have arisen partly due to true score variability (in that different assessors generally assess different performances) a further study controlled for this by asking assessors to rate common videoed performances. This study showed that assessors’ scores ranged from 1 to 6 on a 9 point scale whilst rating the same performance (Holmboe et al. 2003).


평가자간 신뢰도에 대한 문제 뿐만 아니라, range restriction의 문제도 있으며, 서로 다른 역량 영역에 대한 점수 간 상관관계가 높게 나타나는 문제도 있는데, 특히 후자는 수행의 독립적 측면들을 통해서 다양한 역량을 보여주는 상황에서 더욱 그러하다.

In addition to problems of inter-rater reliability, scores also demon- strate range restriction (Alves de Lima et al. 2007; Wilkinson et al. 2008), and high correlations between scores from different domains of competence, (Cook et al. 2010; Fernando et al. 2008; Margolis et al. 2006), the latter issue being problematic to those who present variable competencies as independent aspects of practice (Lurie et al. 2009).


scale range가 제한되면 평가자간 신뢰도가 높아지기보다는 낮아지게 되고, 이를 극복하기 위해서 behavioral anchor를 추가하기도 했지만 그 향상 정도는 미미했다. Cook et al. 는 평가자 훈련이 평가자간 신뢰도를 향상시킬 수 있는가 보았지만, 유의한 효과는 없었다. 즉 수행능력에 대한 평가 점수는 문제가 있으며, scale format을 바꾸거나 평가자 훈련을 한다고 해도 기대하는 만큼 많은 향상이 이뤄지지는 않는다.

Restricting the scale range has been seen to reduce rather than increase inter-rater reliability (Cook and Beckman 2009), whereas the addition of behavioural anchors has produced small improvements (Donato et al. 2008). Cook et al. (2009) investigated whether rater training could improve inter-rater reliability, but showed no significant effect. Thus, performance assessment scores are problematic, and neither alterations of scale format nor rater training have produced the desired improve- ment in their psychometric properties.


이러한 문제를 재개념화하기 위한 시도로서, Govaerts 등은 수행능력 평가를 고전시험이론의 관점에서 보는 것은 제한된 관점만을 제공한다고 지적했다. 이러한 관점에서는 평가자를 고정적인 실체stable entity에 대한 신뢰할 수 없는 측정결과만 만들어내는 '고장난 도구faulty instrument'로 본다. 그러나 Govaerts는 이러한 관점 대신 수행능력평가를 구성주의적, 사회-심리학적 관점에서 보기를 제안한다. 이러한 관점에서는 사회적, 인지적 요인이 서로 상호작용해서 수행능력을 판단함에 있어 개개인의 특이성idiosyncratic을 만든다는 것이다. 즉, 점수의 차이는 평가자의 인식의 유의한 차이에서 기반한다는 것이다. 어떠한 차이는 이미 잘 알려진 문화적 편향으로 인해 생길 수 있으나, 한편으로 다른 차이는 단순히 개개인의 접근 방식이 독특해서 생기는 것일 수도 있다. 예컨대, 평가자가 어떠한 과제를 이해하거나 판단하는 방식 등이다. 이러한 모델에서 점수의 차이는 '진점수'의 다원성plurality에서 기인하는 것이며, 단일한 '진점수'가 존재하고 이것이 '에러'에 의해서 왜곡되는 것이 아니다.

In attempting to reconceptualise this problem, Govaerts et al. (2007) assert that viewing performance assessments through the lens of classical test theory offers a limited per- spective. This theoretical orientation views the assessor as a ‘‘faulty instrument’’ that produces unreliable measures of a stable entity (hence the classical test theory notion of ‘‘true score’’ and ‘‘error’’) (Streiner and Norman 2008, p 170). Instead, Govaerts et al. propose a theoretical view of performance assessment based on a constructivist, social- psychological perspective. This asserts that social and cognitive factors interact to produce idiosyncratic individual judgements on performance (i.e., variability that can be attributed to meaningful differences in the perceptions of raters). That is, while some variability will arise from the well-documented cultural or other biases that raise concerns about the validity of a rating process, some might arise simply from individual peculiarities in approach—for example comparatively unique ways in which the task is understood or judged, or differences in the specific aspects of practice to which assessors attend. In this model, score variations arise from a plurality of ‘‘true scores’’ rather than from a single ‘‘true score’’ that is distorted by ‘‘error’’.


개개인의 수행능력에 대한 판단을 형성하는데 기여하는 사회적, 심리학적 프로세스는 직업심리학 영역에서 많은 연구가 이뤄져왔으며, 의학교육에서의 평가와의 관련성이 고려된 바 있다. 요약하면, 이 판단 프로세스는 카테고리화작업으로 볼 수 있으며, 개개인을 어떤 카테고리에 배정하는 자동적이면서도 신중한 인지automatic and deliberate cognition의 혼합이며 이 판단의 일부분은 유사성에 기인한다. 평가자는 과거의 사례로부터 형성된 판단-관련 schemata를 가지고 있다. 이 과정에서의 기억의 왜곡, 정보 담색, 잘못된 귀인attribution 프로세스 등이 많은 에러를 설명해줄 수 있다. 판단은 사회적 맥락에서 이뤄지며, 평가자의 성향, 평가의 목적, 평가자와 피평가자의 관계 등 다양한 요소에 의해 영향을 받을 수 있다.

The social and psychological processes that contribute to forming judgements on an individual’s performance have been extensively studied within occupational psychology (De Nisi 1984; Feldman 1981), and their relevance to assessment within medical education has been considered (Govaerts et al. 2007; Gingerich et al. 2011; Williams et al. 2003). In summary, the judgement process can be viewed as a categorisation task that proceeds through a mixture of automatic and deliberate cognition to assign individuals to categories, in part based on similarity. Assessors necessarily possess judgement-related schemata that have arisen through exposure to past exemplars. Various distortions of memory and information search, and faulty attribution processes can account for many errors within these processes. Judgements are conducted within a social context and can be influenced by (amongst other things) the assessor’s disposition, the purpose of the assessment, and the relationship between the trainee and the assessor.


극소수의 연구만이 의학교육에서 평가자가 DOPA(directly-observed performance assessments )에 있어 판단 프로세스를 연구한 바 있다. Govaerts 는 비전문가와 비교할 때, 전문가는 다음이 달랐다.

Very few studies have investigated the processes responsible for assessors’ judgements within directly-observed performance assessments in medical education. Govaerts et al. (2011) showed that, compared to non-experts, expert assessors

  • 문제의 대표적 특징을 더 빠르게 찾아냄 developed problem rep- resentations more quickly,
  • 맥락적 힌트에 더 민감함 were more sensitive to contextual cues and
  • 더 많은 추론을 함 made more infer- ences.

 

따라서 전문가는 비전문가보다 더 디테일한 평가 schemata를 가지고 있다. Kogan 등은 평가자의 서로 다른 개별적 특성(성향/임상 역량/연령/성별)을 포함하고, 수행능력을 dual lenses of inferences about trainees로 바라보고, internal or external frames of reference로 바라보는 모델을 제시했다. 이 모델에서 판단과 이어지는 통합과정은 환경적 요인과 맥락에 의해 영향을 받는다. 따라서 판단은 다양한 인지적, 사회적 요인에 취약하다고 할 수 있다.

Thus experts appear to possess more detailed assessment schemata then non-experts. Kogan et al. (2010, 2011) described a model in which assessors possess differing personal characteristics (disposition/clinical competence/age/gender etc.) and then view perfor- mance through dual lenses of inferences about trainees, and either internal or external frames of reference. In their model the judgement and subsequent synthesis are influenced by environmental factors and context. Thus judgements are susceptible to a range of cognitive and social factors.


방법

Methods


평가 형식

Assessment format


Assessors score 7 domains of the performance (history taking; physical examination; communication skills; critical judgement; professionalism; organisation/efficiency; and overall clinical care) using a 6-point Likert scale anchored at

  • point 4 against the criterion of ‘‘meets expectation for F1 completion’’.
  • Point 3 is ‘‘borderline for F1 completion’’ with the remaining points comprising
  • ‘‘well below’’, ‘‘below’’, ‘‘above’’ and ‘‘well above’’ this criterion, plus ‘‘unable to comment’’.


평가자료 개발

Development of materials


PGY1 의사들에 대해서 우수-보통-나쁨 수준의 비디오 스크립트 개발

We developed scripted videos of performances by foundation (PGY1) doctors at different levels: ‘‘good’’, ‘‘borderline’’ and ‘‘poor’’ performances.

 

다음의 문헌 참조(병력청취, 전문성의 개발)

We used literature

  • on desirable contents of history taking (Kurtz et al. 2003; Martin 2003) and
  • on the development of expertise (Boshuizen and Schmidt 1992; McLaughlin et al. 2007),

저자들의 경험을 이용

along with the authors’ experience of foundation doctors, to write abstract descriptions of expected performance at each level.


 

참가자

Participants


All participants were consultant physicians from the North West of England.


 

절차

Procedure


Participants viewed videos individually on a laptop computer with headphones. They were instructed to imagine that they were on the medical admissions unit and that a Foundation Year 1 doctor had requested a Mini-CEX.


Think aloud 절차

Think aloud process


여기서 활용된 ‘‘Think aloud’’ 프로토콜을 참가자의 의사결정을 가이드하는 실제 사고 과정을 보여주는 것이라고 여겨서는 안된다. 그러나 이것이 유용한 것은 사고과정에 영향을 주는 요인에 대한 개개인의 인식을 탐색하는데 도움이 되며, 이후 검사에 풍부한 insight를 준다.

‘‘Think aloud’’ protocols such as those used here should not be treated as necessarily indicative of the actual thought processes that are guiding participants’ decision-making (Bargh and Chartrand 1999). They are useful, however, for exploring individuals’ per- ceptions of factors that influence their thought processes, which can yield rich insight for further testing (Wilson 1994).

 

유용성을 최대화 하기 위해서 Ericsson and Simon 의 가이드라인 활용. 다음의 것에 중요함.

To maximize the usefulness of the process in our study we followed the guidance provided by Ericsson and Simon (1980) who suggest that it is important to

  • (a) 참가자가 열심히 하는지 확인 ensure that participants are actively engaged in the task in question,
  • (b) 참가자가 생각을 묘사(not 설명)하게 함 ask participants to describe, rather than explain their thoughts, and
  • (c) 생각과 생각이 말로 나오는 시간 간격 줄임 reduce the time between participants’ thoughts and their verbalisation.


구인타당도 분석

Analysis of videos’ construct validity

 


 

질적자료 분석

Analysis of qualitative data


Audio recordings were transcribed verbatim and checked for accuracy. A researcher (PY) labelled sections to indicate whether they were

  • concurrent (spoken whilst watching per- formance) or
  • retrospective (spoken after), or
  • from follow up interviews.


Following repeated reading, PY began analysis by inductively assigning codes. Codes were developed to describe each new aspect relevant to the research question. These were discussed with other researchers (PON, KM) and refined from an initial 67 codes to 21 based on similarity. We grouped these codes as

  • ‘‘trainee focused codes’’—comments on the behaviours of the trainee, and
  • ‘‘assessor focused codes’’—comments that indicated ways in which the assessor was thinking.

 

A second researcher (KM) coded 2 transcripts independently, and comparison was made to develop the interpretation and consistency of codes. Constant comparison was used to compare the use and content of both trainee- focused and assessor-focused codes within each assessor, across the different performance levels, and subsequently between assessors.


As the analysis proceeded, we used memos to capture emergent theoretical ideas from the data that enabled understanding of the research question. These were systematically tested and refined or refuted with existing and subsequent data. We developed axial codes to label further examples of these new theoretical concepts.

 

Data were further examined to determine the inter-relationships between theoretical concepts, and to organise concepts into themes. We discussed and reflexively considered all emerging concepts against the data as analysis progressed. Throughout the process, deviant cases that did not fit were sought and used to challenge and refine the emerging theory. When analysis was com- pleted, only slight deviations from the theory were found. These are highlighted in the results.


We collected and analysed data iteratively as the study progressed. Codes were applied by the same researcher (PY) and theory was progressively developed across iterations. Throughout each iteration, we monitored each area of developing theory, and considered whether the new data extended or changed the conceptual ideas that it expressed. Satu- ration was judged to have occurred when iteration 4 developed the theory very little and iteration 5 did not alter the theory. Coding was done using QSR NVivo 8 software. This was used as part of the audit trail, which also included documentation of all memos, and the iterative development of theory.


Results


평가자 판단의 variability의 원인

Sources of variability in assessors’ judgements


두드러지는 특징에 대한 관점 차이

Differential salience


한 평가자에게 중요하게 다가온 것과 다른 평가자에게 중요하게 다가온 것이 다르다.

We found that what struck one assessor as important about a given performance varied from what struck a different assessor as important about the same performance.


또한 한 평가자가 특정 측면에 대해서는 코멘트가 거의 없었을 경우, 다른 측면에 대해서는 많은 코멘트를 했다. 수행의 다양한 측면마다 대해서 평가자들이 평가하는 정도가 달랐다. 동일한 수행에 대해서도 평가자의 전반적 초점은 평가자마다 비교적 독특했다.

Moreover, when a given assessor commented little on one aspect of a performance, they typically commented more on different aspects. Thus the relative extent to which assessors commented on different aspects of the performances varied. In this way, assessors’ overall focus within each performance seemed compara- tively unique.


종합하면, 같은 수행을 보고 있어도 수행의 퀄리티를 결정할 때 유용하게 사용하는 수행의 측면들이 다양했다. 그러나 이러한 차이가 attentional focus during the observation (i.e., noticing) 의 차이인지 differences in the emphasis assigned to a given aspect of performance (i.e., weighting).의 차이인지는 불분명하다(noticing vs weighting)

In sum, the aspects of the performances that assessors regarded as useful for deter- mining their quality varied, despite viewing the same performances. It is not clear from our data whether there were differences in attentional focus during the observation (i.e., noticing) or differences in the emphasis assigned to a given aspect of performance (i.e., weighting).


같은 수행에 대해서도 어떤 측면에서 보느냐에 따라서 두드러지는 특징의 정도degrees of salience가 다르기 때문에, 평가자들은 본질적으로 (같은 수행을 보아도) 서로 다른 관찰에 기반한 판단을 내린다고 할 수 있다.

By having different aspects of the same performance take on variable degrees of salience, raters were in essence forming judgements based on different observations, thereby representing the first source of vari- ability between assessors that contributes to differences between assessors in the judge- ment process.


 

준거 불확실성

Criterion uncertainty


평가 포멧은 "F1 종료시 기대되는 수준"과 비교하여 판단을 내리게 했다. F1 종료시 기대되는 수준은 평가자가 지금까지 겪어온 PGY1 의사가 누구냐에 따라 경험적으로 만들어진 것이라 할 수 있다.

The assessment format asks assessors to judge performance in comparison to ‘‘meeting expectations for F1 completion’’. These expectations were described as experientially- developed through exposure to post-graduate (foundation) year 1 doctors who were encountered over the course of their careers.


평가자들은 이 '기대치'의 구성요소가 무엇인가를 묘사할 때 서로 달랐다.

Assessors differed in the way they described the constituents of their expectations. Some assessors emphasised

  • 지식 the need for factual coverage;
  • 라뽀 others were more concerned with communication or rapport building;
  • 진단 정확성 diagnostic accuracy; or
  • 독립성 evidence of developing independence.

 

어떤 사람들에게는 내용 그 자체보다 면담의 프로세스가 중요했다.

For some the interview process (rather than factual content) was key to their expectations.

가끔은 어떤 한 가지 특정 측면singular aspect의 유무가 중요했다.

Singular aspects (i.e. the presence or absence of a drug-allergy history) were sometimes pivotal.


평가자간 PGY1 의사에게 전형적으로 기대되는 수행능력에 상당한 차이가 있음을 보여준다.

Further comments indicated considerable variation in assessors’ perceptions of the level at which foundation doctors typically perform.


평가자는 PGY1의 수행능력에 대해서 일반화된 기준general criterion으로 삼는 기대를 가지고 있었다. 그러나 이 기준은 경험적으로 나온 것이며, 서로 다른 방식으로 구성되고, 종종 모호하기도 하다. 아마도 이러한 모호함에 대해서 평가자들은 그들의 기준을 강화할augment할 수 있는 상대적 비교를 할지도 모르지만, had the potential to be situationally influ- enced as assessors’ perceptions of the level at which foundation doctors typically perform also varied. 평가기준의 이해와 활용에 있어서의 개인적 경험에 따른 차이가 두 번째 원인이 된다.

In summary, assessors possessed expectations of foundation doctor performance that served as a general criterion. These were experientially derived, differently constructed, and often ambiguous. Perhaps in response to this ambiguity, assessors also made relative comparisons that augmented their criteria, but had the potential to be situationally influ- enced as assessors’ perceptions of the level at which foundation doctors typically perform also varied. Consequently differences in understanding and use of the assessment’s cri- teria—probably due to differing personal experiences—acted as a second mechanism that contributed to variability in assessors’ scores, introducing relative uniqueness into their judgements.

 


 

정보 통합

Information integration


평가자들은 나름의 서사적 기술 언어narrative description language를 사용하여 판단을 내린다.

Thus it appears that—by and large—assessors judge in their own narrative descriptive language.


대부분의 평가자들은 포괄적 용어global term으로 판단을 묘사했으며, 영역간 구분을 짓는 것이 어렵다는 것을 인식했다.

Most assessors described that their judgements evolved in global terms or that they perceived that the domains were difficult to distinguish between:


그 결과 각 영역에 대해서 점수를 지정하는 것은 두 단계를 필요로 한다.

As a consequence, allocating a score for each domain required two processes:

  • 사서적 기술 언어에서 드러난 평가자의 편단을 scale descriptor로 변환하는 것
    con- version of the assessor’s judgement from their individual narrative description into the scale descriptors and
  • 총괄적 인상을 영역별 점수로 변환하는 것
    conversion into scores for each domain based on a global impression.

 

이러한 과정을 통해서 총괄적 인상global impression이 영역 점수에 영향을 주는 것이며 그 반대가 아니라는 것이 드러난다.

In this way it appears that variability in global impressions influenced variability of per- ceptions of domain scores rather than the reverse.


종합하면, 평가자의 판단이 그 형태를 갖춰갈수록, 수행능력이 competent한 것으로 판정되느냐의 정도는 (평가자간 다양하게 나타나는) 비교적 독특한 서사적 기술로 대표된다. 이는 주로 총괄적 판단에 따라서 형성되는 경향이 있으며, 수행의 개별 측면을 나타내는 scale descriptors 로 변환되어야 한다.

In sum, as assessors’ judgements took shape, the degree to which a performance was judged to be competent was represented by means of individual, comparatively unique narrative descriptions that varied between assessors. These tended to form along with a global overall judgement, both of which had to be converted into scale descriptors for individual aspects of practice. How that conversion took place may have further influenced the variability inherent in the scores.


 

고찰

Discussion


요약과 이론

Summary of findings and theory


 

같은 수행을 보고도 평가자가 집중해서 보는 측면과 서로 다른 측면에 배정되는 가중치는 평가자마다 달랐고 그 다른 정도도 다양했다. 결과적으로, 평가자는 서로 다른 관찰, 상대적으로 독특한 관찰의 조합을 바탕으로 판단을 내린다고 할 수 있다.

Despite viewing the same performances, assessors’ attentional focus and perhaps the weight they assign to different aspects of performance varies such that different aspects of the per- formance become salient to different assessors to different degrees. Consequently, asses- sors appear to rely on different, comparatively unique combinations of observations when formulating judgements.


두번째로, 평가자는 이러한 관찰 결과를 그들이 가지고 있는mentally held (종종 모호하고, 서로 다르게 구성되고, 서로 다른 '전형성typicality'의) 역량 기준과 비교하게 된다. 평가자는 이러한 기준을 형성할 때, 그리고 이 기준을 가지고 판단을 내릴 때, (적어도) 그들이 경험한 피훈련자의 사례를 참고로 하게 된다. 따라서 평가자들의 경험이 독특하기 때문에 평가자에게 다음과 같은 방식으로 다양한 방면으로 영향을 주게 된다multifaceted influence

  • 수행의 어떤 측면facet이 가장 두드러지는가salient 에 대해서 영향을 주고
  • 평가자의 평가기준에 영향을 주고
  • 평가자가 직접적으로 비교할 사례집단을 형성하여 영향을 준다

Secondly, assessors compare these observations against mentally held competence criteria that are often uncertain, are differently constructed, and include different percep- tions of typicality. Assessors’ appear to formulate these criteria and judge against them at least partly through reference to exemplar trainees with whom they have experience. Uniqueness in assessors’ experience is likely, therefore, to have a multifaceted influence by altering assessors’ perception of which facets of the performance are most salient, by influencing the criterion standard held by the assessor, and by creating a group of exem- plars against which assessors directly compare.


마지막으로, 평가자가 판단을 내릴 때 이러한 다양한 프로세스에 의해서 영향을 받기 때문에, 따라서 평가자들은 그러한 판단의 긍정-부정(valence of those judgements)을(혹은 관찰결과와 평가기준 간 차이의 정도를) 개개인별로 특이적으로 생성한 서사적 언어로 표현한다. 이러한 서사적(총괄적) 판단은 평가 스케일로 변환되어 개별 영역의 점수를 생성한다.

Finally, as assessors form judgements that are influenced by these various processes, they describe—and therefore presumably mentally represent—the valence of those judgements (or the judged degree of difference between their observations and their criteria) in individually generated narrative language. These individual narrative (and global) judgements are converted into the assessment scale to produce scores for each individual domain.


마지막으로, 우리가 개개인의 수준에서 판단의 variability에 초점을 맞추었지만, 그 variability가 무한하다고 가정해서는 안된다. Thammasitboon 등은 서로 다른 평가자가 역량에 대한 다양한 개념을 가지고 있지만, 이 개념은 네 개의 역량 구인으로 그룹지어질 수 있음을 밝혔다. 아마도 더 많은 수의 평가자를 대상으로 표본을 수집하면 반복되는 패턴을 찾을 수 있을 것이다. 그러한 패턴을 밝힘으로써 평가자에 의한 모든 variability를 단순히 error라고 뭉뚱그리지 않고 variability에 대한 더 깊은 이해가 가능할 것이다.

Further, while we have focused on variability in judgement at the individual level, one should not presume that such variability could ever be infinite in scope. Thammasitboon et al. (2008) showed that whilst different assessors possessed different conceptions of competence in multi-source feedback, their conceptions could be grouped into four dif- ferent constructs of competence. Presumably a larger sample of assessors might have enabled us to identify repeated patterns of performance that could be grouped. Observation of such patterns would reinforce the conclusion that meaningful differences might be drawn from variability in perception rather than simply concluding that all variability not attributable to the individuals being assessed should be deemed to be ‘‘error.’’


 

연구의 한계

Consideration of limitations


 

이론적 함의

Theoretical implications of findings


이 연구에는 몇 가지 중요한 함의가 있다. Govaerts 는 점수의 varitaion을 단순히 error로 보기보다는 특이성idiosyncrasy의 한 형태로 보는 것이 좋다고 제안했다. 우리의 결과는 '정확성' 그 자체에 대한 주장에 대한 것은 아니지만, 어떻게 한 개인의 여러 특이점들이 합해져서 평가자가 내리는 판단의 특이성을 형성하는지 보여주었다.

This study has a number of important implications. Govaerts et al. (2007) suggested that score variation is better viewed as a form of idiosyncrasy rather than simply as error. Our results do not allow us to make claims about ‘accuracy’ per se, but they do illustrate how multiple individual peculiarities can combine to produce idiosyncrasy in assessors’ judgements.


수행능력에 관한 평가자 특이적 판단은 예전에는 다른 맥락에서 보고되었다. Ginsburgh 등은 서로 다른 전공의를 볼 때의 인상에 기반해서 이뤄졌지만, 우리의 연구는 심지어 동일한 수행상황을 보고도 그러한 결과가 나타남을 보여주었다. 즉, 적어도 이러한 평가자 특이적 판단의 일부는 피평가자의 차이가 아니라 평가자의 차이에서 기인하는 것이다.

Idiosyncratic judgements on performance by assessors have been previously reported in a different context. Ginsburgh et al. (2010) ’s study was based on impressions of different residents, our results showthat similar findings can occur even when assessors view the same pool of performances—thus indicating that at least some of this idiosyncrasy arises from assessor differences rather than from differences in trainee behaviour.


교육적 관점에서 '부정확inaccuracy'와 '특이성idiosyncrasy'의 구분은 중요하다. 무엇보다, 교육 영역에서 채택하고 있는 psychometric 관점은, 일단 충분한 수의 표본을 수집하여 적절한 수준의 신뢰도를 갖춘다면 학습자에게 '정확한' 피드백을 줄 수 있을 것으로 가정한다. 만약 일부 variability가 수행능력에 대한 서로 다른 인상 간의 유의한 차이를(equally valid) 보여주는 것이라면, 우리의 과제는 어떻게 다양한 관점을 삼각측량하여 학습자에 대한 온전한 모습complete picture를 그려낼 것인가, 이를 가지고 그들의 수행능력에 관한 유용한 피드백을 전달할 수 있을까이다. 이러한 개념은 Govaerts등에 대해서 언급된 바 있으며, van der Vleuten and Schuwirth의 programmatic assessment 접근법과도 일치한다.

From an educational perspective, the difference between inaccuracy and idiosyncrasy is important: most dominantly, the field has adopted a psychometric perspective which assumes that once we have sampled enough to ensure adequate reliability, we can provide learners with ‘‘accurate’’ feedback. If some variability indicates meaningful (i.e., equally valid) different impressions of performance we are now faced with determining how to triangulate between multiple perspectives to create a complete picture of a learner and convey useful feedback about their performance. This concept has been previously artic- ulated by Govaerts et al. (2007) and resonates with the approach to programmatic assessment suggested by van der Vleuten and Schuwirth (2005).


평가자가 받은 인상의 특이성에 기여하는 개개인 수준의 기전이 갖는 추가적 함의가 있다. 평가자가 'attribute different degrees of salience to different aspects of common perfor- mances'라는 사실은 과연 평가자가 - 판단이나 평가는 차치하고서라도 - '객관적인 관찰'을 할 수 있느냐는 의문을 갖게 한다. 따라서 평가자에게 무엇을 평가할 것인가에 대한 정해진 준거를 가지는 공식적 시험 세팅을 제공하고자 하는 것에 많은 노력이 들어갔지만, 평가자의 평가가 진행되는 중의 관심대상attentional focus에 관한 노력은 별로 없었다.

The individual mechanisms that contribute to idiosyncrasy of assessors’ impressions have further implications for both theory and practice within assessment. The finding that assessors attribute different degrees of salience to different aspects of common perfor- mances—through noticing or paying attention to them differently—questions the notion that assessors can ‘‘objectively’’ observe—let alone judge or rate—performances. Thus, whereas much prior effort has gone into providing examiners in formal exam settings with defined criteria against which to judge (Newble 2004), very little work has been undertaken concerning examiners’ attentional focus whilst judging.


평가자 훈련, 특히 ‘‘Frame of Reference Training’’ (FORT) 훈련은 직업심리학에서 폭넓게 효과적인 것으로 확인되었다. FORT는 다음을 포함한다.

Assessor training, in particular ‘‘Frame of Reference Training’’ (FORT) training, has been shown to be effective across a breadth of contexts in occupational psychology, showing moderate to large effects on a range of endpoints (Woehr 1994). Frame of ref- erence training involves:


  • defining performance dimensions, 
  • providing a sample of behavioural incidents representing each dimension (along with the level of performance represented by each incident) and 
  • practice and feedback using these standards to evaluate perfor- mance (Schleicher et al. 2002)


따라서 Holmboe 등이 보고한 바와 같이 의학교육 맥락에서 평가자 훈련이 효과가 없다는 것은 놀랍다. 우리는 평가자들이 그들의 평가 준거를 그들의 직업경험동안 반복적으로 사례를 접하면서 쌓아온 것으로 설명함을 확인했다. 이러한 긴 경험에도 불구하고 그 준거는 불확실하며, 최근의 사례에 의해 영향을 받을 수 있다. 이러한 결과는 다수의 사례에 노출되는 것이, 평가 특성에 대한 합의를 이루는 것보다, 대안적으로 보다 효과적인 전략이 될 수 있음을 시사한다.

Therefore, the results reported by Holmboe et al. (2004) and Cook et al. (2009)ina medical education context, showing limited or no effect of rater training, are surprising. We found that assessors described their criteria as experientially derived over the course of their careers, through exposure to repeated exemplars. Despite this long experience, criteria remain uncertain, and can be influenced by recent examples. These findings raise the possibility that exposure to a greater number of exemplars, rather than agreeing on attri- butes, may represent an alternative, potentially successful, strategy that is more akin to the way assessors’ criteria are represented.





 2013 Aug;18(3):325-41. doi: 10.1007/s10459-012-9372-1. Epub 2012 May 12.

Seeing the same thing differently: mechanisms that contribute to assessor differences in directly-observed performance assessments.

Author information

  • 1School of Translational Medicine, University of Manchester, Manchester, UK. peter.yeates@manchester.ac.uk

Abstract

Assessors' scores in performance assessments are known to be highly variable. Attempted improvements through training or rating format have achieved minimal gains. The mechanisms that contribute to variability in assessors' scoring remain unclear. This study investigated these mechanisms. We used a qualitative approach to study assessors' judgements whilst they observed common simulated videoed performances of junior doctors obtaining clinical histories. Assessors commented concurrently and retrospectively on performances, provided scores and follow-up interviews. Data were analysed using principles of grounded theory. We developed three themes that help to explain how variability arises: Differential Salience-assessors paid attention to (or valued) different aspects of the performances to different degrees; Criterion Uncertainty-assessors' criteria were differently constructed, uncertain, and were influenced by recent exemplars; Information Integration-assessors described the valence of their comments in their own unique narrative terms, usually forming global impressions. Our results (whilst not precluding the operation of established biases) describe mechanisms by which assessors' judgements become meaningfully-different or unique. Our results have theoretical relevance to understanding the formative educational messages that performance assessments provide. They give insight relevant to assessor training, assessors' ability to be observationally "objective" and to the educational value of narrative comments (in contrast to numerical ratings).

PMID:
 
22581567
 
[PubMed - indexed for MEDLINE]


+ Recent posts