평가자의 인지Rater Cognition(Med Educ, 2016)

Rater cognition: review and integration of research findings

Genevieve Gauthier,1 Christina St-Onge1 & Walter Tavares2,3,4





도입

INTRODUCTION


Complex한 역량을 평가하기 위해서는, 피훈련자의 퍼포먼스에 대한 판단이 내려져야 하며, 이는 고등 기술이나 고등 능력(지식 적용, 분석, 통합, 평가 등)의 개발을 support하기 위함이다.  평가하고자 하는 능력과 기술의 복잡성과 그것들이 속한 context로 인해서 우리는 점차 피훈련자의 퍼포먼스 평가에 평가자의 판단에 의존하게 된다.

In order to assess these com- plex competencies, judgements need to be made about the performance of trainees, to support the development of high-level skills or abilities, such as knowledge application, analysis, synthesis and evalu- ation.1 The contexts and complexity of abilities and skills assessed within these frameworks require that we increasingly rely on rater judgements in assessing trainee performance.2


CBE에서 평가자 판단에 대한 의존이 늘어나는 것을 환영하는 것에 대한 비판이 있는데, 왜냐하면 평가자의 판단은 지나치게 variable하고 bias에 취약하기 때문이다. 실제로, (최근까지 dominant했던) psychometric한 관점에서 평가자들의 variability는 측정오류이며, 왜냐하면 그 variability는 피훈련자의 퍼포먼스 그 자체가 아니라 '평가 점수'에서 관찰되는 것이었기 때문이다. 그러나 다양한 평가서식, 시스템, 피훈련자 훈련 등으로 이 variability문제를 해결하기 위한 시도는 유의미한 성과를 거두지 못했다.

Some have greeted the increased reliance on rater judgement in CBE with substantial criticism, because rater judgement has often been accused of being too variable and fraught with bias. In fact, the psychometric perspective, which has been dominant until recently, has positioned rater variability as measurement error, because rater variability has been shown at times to explain more of the variabil- ity seen in trainees’ scores than the trainees’ own performances.3–6 Attempts to address this variability problem by improving rating forms and systems, or by training raters, have not produced meaningful improvements.7–11


Gingerich 는 평가자 인지를 세 가지 관점에서 보았다.

In a recent review, Gingerich et al.13 categorised rater cognition research into three different per- spectives focusing on the assessors as either

  • (i) 적절한 평가기준 적용 실패  failing to apply appropriate assessment criteria,

  • (ii) 과제의 복잡성에 의해 인지적으로 압도당함 being cognitively overwhelmed by the complexity of the task, or

  • (iii) 유니크하지만 의미있는 메시지를 보내는 전문가 판단 expert decision makers send- ing unique but meaningful messages.


Fig 2.의 양측 화살표는 상호의존성과 지속적 상호작용을 의미한다.

The bidirectional arrow in Fig. 2 represents the interdependence and con- stant interaction between the phases.

 

 


 

각 Phase의 구조

STRUCTURE OF PHASES


rater-cognition 에 관한 최근 연구는 세 가지 phase로 나뉜다.

Aligned with previous work on performance, recent findings in rater-cognition research have implied a three-phase framework of observation, processing and integration.

  • 관측: 어떻게 평가자들이 능동적으로 정보를 선택하는가
    The observation phase refers to how raters attend to and actively select information about trainees and their performances. This phase is the most agreed upon and mentioned by most authors.

  • 처리: 맥락적 정보와 사전지식을 어떻게 회수하고 사용하며, 어떻게 주어진 퍼포먼스와 비교하는가
    The processing phase encompasses appraisal of how raters retrieve and use contextual informa- tion and prior knowledge to inform and compare the performance at hand.

  • 통합: 어떻게 다양한 출처의 정보를 통합하는가
    The integration phase speaks to how raters combine different sources of information to form an overall judgement.




관측

Observation phase


관측에서 평가자는 피훈련자의 퍼포먼스를 관측하여, 판단을 형성하는데 필요한 정보에 관심을 기울인다. 관측phase의 impact와 이후 이어지는 phase와의 관계는 잘 밝혀졌으나, 이 초기 phase를 기술하는 용어는 다양하다.

In this initial phase, a rater observes a trainees’ per- formance in what is usually a complex and authen- tic or educational setting to identify and attend to relevant informa- tion that will be used subsequently to form a judge- ment. The impact of the observation phase and its interrelation with subsequent phases have been well acknowledged,17–20 but the specific terminology used to describe this initial phase differs between researchers (e.g. recognition, selection of informa- tion, perception and first impression).21–27



관측phase에는 다양한 메커니즘이 있다.  

The number and complexity of mechanisms that occur during the observation phase have been well conceptualised.

  • 행동 관련 정보에 대한 능동적인 선택과 탐지. 통합과 카테고리화
    Tavares and Eva18 described this phase of information acquisition as an active selection and detection of relevant beha- viours that require management, integration and categorisation in the rater’s working memory.
  • 자동적이면서 통제된 프로세스. 상황situation과 성향disposition에 대한 의식적, 무의식적 요인
    The observation, once conceived as a passive and objec- tive task, requires active observation skills that involve both automatic and controlled processes during which observers are influenced by conscious and unconscious situational and dispositional fac- tors.28,29
  • Differential salience: 같은 것을 보면서도 다른 측면에 초점을 두는 것
    Yeates et al.17 discussed the notion of differ- ential salience occurring when, despite viewing the same video, raters paid attention to different aspects of the performance.



처리

Processing phase


수집된 정보에 의미를 부여하고, 암묵적 또는 명시적 기준과 비교함으로써 관측한 것을 make sense함.

In this phase, raters are thought to give meaning to the information collected during the observation phase. That is, they are thought to make sense of what they have observed by comparing it with some implicit or explicit standard.

 

다음과 같이 불리기도

This phase is referred to as organ- isation,21,25 judgement,20 criterion uncertainty,17 inter- pretation,19,23 provisional grade27 and evaluation.24



평가라는 관점에서 처리phase는 두 단계의 분석을 함축한다. 첫째, 평가자는 퍼포먼스의 퀄리티를 element of competency로 결정해야 한다. 둘째, 평가자는 (다양한 기준점과의 비교를 통해) 이 elements of competency에 대해서 피훈련자가 보여주는 것이 어떤 level에 있는지 판단해야 한다. 이 과정은 자동화되고 통제된automatic and controlled 과정으로서, schema-based categorization이라고 불린다.

More specifically, from an assessment perspective, the processing phase implies two levels of analysis. First, raters need to determine the quality of the perfor- mance in terms of elements of competency. Second, raters need to determine the level at which the trainee demonstrates these elements of competency in com- parison to various reference points. This process, which involves both automatic and controlled pro- cesses,17,18,28,30 is often referred to as schema-based categorisation.19,30




통합

Integration phase


이 시기에, 평가자는 피훈련자의 퍼포먼스에 대한 전반적인 판단을 내린다.

During this phase, raters form and articulate an over- all judgement and rating about trainee performance,

 

이 phase에 대한 연구자들의 판단은 다양한데, 어떤 연구자는 우리가 통합phase에 포함시킨 것을 처리phase로 분류하기도 한다.

Fewer authors seem to agree about this phase, Some authors17,18 put mechanisms that we have included in the integra- tion phase into the processing phase.


다른 이름으로도 불린다.

This third phase is sometimes referred to as the translation phase,18,23 final grade27 or feedback phase.20,24


다양한 출처의 정보를 통합하는 과정이다. 이 정보는 unique한 가중치가 적용되어 최종적 global impression을 형성하고, 이어서 corresponding rating으로 변환된다.

The integration phase encompasses the integration of different sources of information. This information is uniquely weighted to forma final global impres- sion, which is then subsequently converted into a corresponding rating.



 

 

공통의 메커니즘

SHARED MECHANISMS


Table 1 illustrates the most salient mechanisms described in the reviewed studies on rater cognition.


 

 


 

관측 메커니즘

Mechanisms in the observation


관측 시기에서 세 가지가 있다.

The first phase, the observation phase, is where we position three different mechanisms reported as happening somewhat independently from one another:

  • (i) 한 사람에 대한 자동화된 인상 형성 generating automatic impressions about the person,

  • (ii) 고차원의 추론 형성 formulating high-level inferences and

  • (iii) 평가대상 역량의 다양한 차원에 집중 focusing on different dimensions of the assessed competencies.


'자동화된 인상 형성'이란 무의식적으로 발생하는 것이며, 한 피훈련자의 특정한 unrelated aspects를 가지고 퍼포먼스에 대한 일반화를 내리는 것이다. 이 메커니즘은 halo effect로도 잘 알려져있다.

The automatic generation of impressions about a person is a mechanism that takes place uncon- sciously and through which raters make generalisa- tions about a trainee’s performance based on that person’s specific unrelated aspects or traits. The existence and impact of this mechanism has been well documented and includes the well-known halo effect.19,21,22,26,27,32


비록 이러한 인상 형성이 자동적으로 이뤄지는 것임은 널리 받아들여지고 있지만, 메커니즘의 유용성, 전반적인 프로세스에 미치는 영향, 다양한 맥락에서 발생하는 이들 판단의 exact nature 등에 대한 논란이 있다.

Although the concept of a person-based impression that happens automatically is well accepted, questions persist as to the potential useful- ness of this mechanism,22 the degree of influence it has on the overall process,21 and the exact nature of these judgements in different contexts.19


이 메커니즘이 던지는 또 다른 질문으로 현재 이뤄지는 대부분의 평가과제에서 '분석의 단위'가 무엇인지에 대한 것이다. 평가과제는 대부분 'person level'이 아니라 'performance level'에서 이뤄지나, 사람의 판단이 구조화되는 방식은 그 반대인 것으로 보인다는 점이다. 예컨대 지원자-특이적 평가를 한 평가자들과 스테이션-특이적 평가를 한 평가자들을 비교하면, 전자의 경우에서 failure한 지원자의 수가 더 많고, 신뢰도가 하락하면서 internal consistency는 높아졌다.

In parallel , the results of these investigations raise interesting questions regarding the current unit of analysis in most assess- ment tasks, which are anchored at the performance level as opposed to the person level, which seems to be contrary to how human judgement is struc- tured.29 For example, research on candidate-specific raters compared with station-specific raters reported a higher number of candidate failures and a higher level of internal consistency while showing a trade- off in reliability.33


두 번째 메커니즘은 (관찰가능한 행동의 단순 묘사가 아닌) high-level inference의 형성이다.

Another mechanism common to various studies on rater cognition, and beginning during the observa- tion phase, is the formulation of high-level infer- ences as opposed to description of observable behaviours or facts regarding trainee perfor- mance.13,20,25,30,34,35


Kogan 등은 평가자의 관측에 있어서 이러한 high-level inference를 문제시하였다. 반면, 경험이 많은 평가자들에서 high-level inference의 빈도가 높아진다는 사실을 바탕으로 Govaerts 등은 평가자의 경험이 쌓일수록 이러한 자동화automaticity가 개발된다고 보았다.

Kogan et al.20 viewed the high level of inferences in rater observations (when the raters were not always conscious of them) as a prob- lem to be addressed. By contrast, the increased fre- quency of high-level inferences of experienced raters compared with less experienced raters led Govaerts et al.25 to infer that automaticity developed as raters gained experience in assessing trainee per- formance.


이러한 유형의 정보가 (프로페셔널리즘과 같은) complex competencies에 대한 평가자의 inference에 영향을 미칠 수 있다는 사실을 이해하기 위하여..

To understand the type of informa- tion that could influence raters’ inferences about complex competencies such as professionalism, Ginsburg et al.34


어떤 유형의 정보가 평가자가 '유용한' inference를 내릴 수 있을 것인지가 standard setting의 맥락에서 연구되어왔다.

Understanding what type of information can support raters’ ability to make useful inferences has also been explored in the context of standard setting,36


complexity의 또 다른 층위는... Pulito 등이 보고한 바와 같이, 평가자는 직접적 관측결과를 활용하여 의학지식/프로페셔널리즘/임상추론기술 등을 평가하며, 학생의 발표능력을 가지고 추론하여 병력청취나 신체진찰 스킬을 평가한다.

Another layer of complexity may arise as we begin investigating rater assessment practices regarding inferences. As reported by Pulito et al.,37 raters used direct observations in assessing medical knowledge, professionalism and clinical-reasoning skills, whereas they made inferences using the students’ presenta- tions to assess history-taking and physical-examina- tion skills.



관측시기에서 또 다른 메커니즘은 평가자가 어떻게 다양한 역량의 element에 집중할 수 있는가이다. Yeates 등은 평가자들이 동일한 비디오를 보고도 코멘트에 variation을 보임을 관측하였다. 한 평가자에게 key aspect라고 보여진 것이 다른 평가자에게는 언급조차 되지 않았다.

Another shared finding during the observation phase is how raters attend to different elements of competencies.17,20,23,27,30,37,38 Yeates et al.17 observed variations in comments by raters assessing the same video performance. What struck some raters as being a key aspect of a performance was sometimes not even mentioned by other raters. Kogan et al.20 had similar findings.



유사하게 Yaphe and Street는 퍼포먼스가 좋게 평가되는지 나쁘게 평가되는지에 따라 평가자가 어떤 attribute에 focus하는 것이 달라진다는 것을 보여주었다.

Similarly, Yaphe and Street27 highlighted that raters focus on different attributes depending on whether the performance being eval- uated was good or poor.



assorted elements of competencies 를 활용하는 것은 각 평가자가 (특정 퍼포먼스의 유형이나 맥락에 따라) 역량에 대한 unique한 이해를 가지고 있음을 시사한다. context나 task가 rater performance에 미치는 영향은 Tavares의 결과에서도 강조되는데, 7개의 영역을 평가하라고 했을 때와 2개의 영역을 평가하라고 했을 때의 평가결과가 달랐다.

The use of assorted elements of competencies may reflect raters’ unique under- standing of competency in relationship to specific types of performances and contexts. The influence of context and task on rater performance is high- lighted in Tavares38 findings, in which the perfor- mance of raters was affected when they were required to evaluate seven dimensions instead of two.



처리 메커니즘

Mechanisms in the processing phase



처리 메커니즘은 다음에 기반한다.

The glue that binds the mechanisms together in the processing phase is the presence and use of com- plex schemata based on:

  • (i) 역량에 대한 개개인의 개념 a personal conception of competency,

  • (ii) 다양한 사례의 비교 comparison with various exem- plars and

  • (iii) 과제와 맥락 특이성 task and context specificity.

 

이 메커니즘들이 automatic and controlled 하다고 묘사되지만, 평가자들은 피훈련자의 퍼포먼스를 어떻게 해석하는지를 설명하는데 어려움을 겪는 것으로 보이며, 이는 그러한 메커니즘들이 controlled 이기보다는 automatic한 것임을 시사한다. 예컨대 Kogan의 연구에서 평가자들은 자신의 평가 프로세스를 설명하는데 어려움을 겪었고, 다수의 평가자는 (구체적인 레퍼런스나 스탠다드의 측면에서) 어떻게 퍼포먼스를 해석했는지 설명하지 못했다.

Although these mechanisms have been described as being both automatic and controlled, raters seem to expe- rience difficulty in articulating how they interpret trainee performance, suggesting that such mecha- nisms may be more automatic than controlled. For example, participants in Kogan et al.’s20 study had difficulty explaining their assessment processes and a number of them could not articulate how they interpreted a trainee’s performance in terms of specific references or standards.


평가자 판단의 complex and implicit한 특징은 카테고리화 과정과 연결되는데, 자신이 경험한 피훈련자의 퍼포먼스 경험과 모범적인 퍼포먼스 사례에 의해서 개발된 schemata를 비교하고 활용함으로써 해석이 이뤄지는 것이다. 이러한 카테고리화 메커니즘의 발생은 왜 경험이 많은 평가자들이 평가자들의 퍼포먼스에 대한 더 풍부하고 디테일한 description을 하는지를 설명해준다.

The complex and implicit nature of rater judge- ment has been linked to a categorisation process, in which interpretations occur through the use and comparison of schemata developed by exposure to trainee performance and exemplars of perfor- mance.17,25,39–42 The occurrence of such a categori- sation mechanism might explain why experienced raters can have richer and more detailed descrip- tions of trainee performances than inexperienced raters.25,35,39,43


카테고리화 프로세스에서 역할을 한다고 알려진 세 가지 요소 중 하나는 '평가자 개인이 지닌 역량에 대한 개념'이다. Kogan 등은 평가자 자신의 임상스킬이 피훈련자 퍼포먼스 평가와 상관관계가 있음을 보여주었다. 평가자 자신의 역량의 중요성은 Berendonk의 결과에서도 나타나는데, 평가과제의 선택이 평가자 자신의 전문분야 또는 전문지식과 관련되었다는 것이다. 평가자의 역량에 대한 개념은 "특정 술기가 주어진 맥락에서 어떻게 수행되어야 하는지"에 대한 판단을 포함하는 것이지만, 이는 동시에 평가자가 의사소통이나 프로페셔널리즘과 같은 복잡한 역량을 판단하는 기준을 프레임하기도 한다.

One of the three components identified as playing a role in the categorisation process is the rater’s personal conception of competency.19,20,39,44 Kogan et al.45 found that the raters’ own clinical skills related to a task correlated with their assessment of trainees performing the task. The importance of the raters’ own level of competence with respect to the task is also reflected in Berendonk et al.’s39 findings about selecting assessment tasks related to the rater’s own area of expertise and content knowl- edge. Raters’ concept of competence may involve judging how things should be done in the context of procedural skills, but it may also frame the nat- ure of raters’ expectations and standards when judging more complex competencies such as com- munication and professionalism.44


카테고리화 프로세스의 또 다른 요소는 자신이 경험한 다양한 모범사례를 벤치마크로 사용하는 것이다. 이들 모범사례는 평가자들이 특정 평가과제에서 스탠다드를 설정하는 전략으로 밝혀진 바 있다. 모범사례는 과거에 경험한 피훈련자가 될 수도 있고, 평가자 자신이 피평가자 시절에 어떻게 했었는지가 될 수도 있고, 현재 비슷한 수준의 피평가자에 대한 경험이 될 수도 있고, 심지어는 다양한 레벨에 있는 동료가 될 수도 있다. contrast effect에 대한 연구에서 밝혀진 바와 같이, 평가자들은 (심지어 criteria-based assessment에서조차) implicit하게 norm-based evaluation에 의존한다.

Another component of the categorisation process is the use of various exemplars accumulated through experience as benchmarks for comparison with observed trainee performances. The use of these exemplars has been identified as a strategy adopted by raters to set standards for specific assessment tasks.20,30,39–41 The nature of these exemplars ranges from trainees assessed in the past40–42 to the rater’s own memory of his or her own performance at the same level of training20,30 to current experience with learners at similar levels or even colleagues with varied levels of experience.20,30 As demonstrated by research on the contrast effect,40,42,46 raters implicitly relied on a norm-based evaluation process even in the con- text of criteria-based assessment.


 

구체적인 과제와 맥락의 역할은 'task specificity'라고도 불리는 것으로, 'context specificity'의 한 현상이기도 하다. 경험이 많은 평가자들은 과제 특이적 퍼포먼스에 대해서 더 풍부한 코멘트와 묘사를 제공하며, 특정 과제에 대한 경험을 쌓아나가면서 과제-특이적 스탠다드가 더 진화하기도 한다. task의 role을 학습자를 위한 mediating factor로서 인정하였다.

The role of specific tasks and contexts, which we refer to as task specificity, relates to a well-documen- ted phenomenon of context specificity47 and was mentioned in a number of studies.18–20,27,30,39,43,48 Experienced raters provided richer assessment com- ments and descriptions of task-specific perfor- mances19,39 and they discussed the evolution of task- specific standards as they gained experience with a specific task. They acknowledged the role of the task as a mediating factor for learners.



 

통합 메커니즘

Mechanisms in the integration phase



이 마지막 phase에서 이전 phase로부터 수집된 정보를 재활용하는 것으로 보인다.

In this last phase, the observed mechanisms seem to reuse the information from the previous phases to

  • (i) 정보를 서로 달리 가중치를 주어 통합 weigh and synthesise information differently,

  • (ii) 카테고리식 혹은 내러티브식의 판단을 내림 produce categorical or narrative judgements and then

  • (iii) 내러티브한 판단을 척도로 변환 translate narrative judgements into scales.

 

통합phase에서 평가자들은 서로 다른 방법을 사용하여 다양한 출처의 정보를 통합하는데, 이 때 숫자점수나 척도의 형태로 생각하는 것은 아닌 것으로 보인다. 그럼에도 불구하고 전반적인 평가프로세스의 판단은 궁극적으로 scale로 변환되어야 한다.

During the integration phase, raters seem to have used different methods to synthesise various sources of information and they do not seem to have been thinking in terms of number grades or scales.17,20,23,27,34 Nevertheless, the overall judge- ments developed throughout the assessment process often have to be eventually translated into scales to suit the rating tools.


 

평가자들은 종종 여러 출처의 정보를 단순히 평균을 내기도 하고, 무엇이 더 중요한지에 대한 자신의 신념에 기반하여 우선순위를 매기기도 하고, 심지어는 프로그램 내에서의 목적에 따라 우선순위를 매기기도 한다. St-Onge는 특정한 가중치 부여 혹은 거부veto 메커니즘은 특정 error와 관련되어있는데, 이것은 전반적인 퍼포먼스를 'good'이라고 판단할 때도 마찬가지이다. Yeates 등은 정보가 수집된 다양한 출처에 따라 valence에 variation이 있음을 보여주었다.

The raters reported that they sometimes simply averaged all sources of informa-tion or prioritised based on their own beliefs about the importance of specific elements of competen- cies or even prioritised according to the assess- ment’s purpose within the programme. St-Onge et al.30 reported a particular weighting or veto mech- anism associated with specific oversights or errors, even when the overall performance was described as being good. Yeates et al.41 also reported variation in valence regarding the different sources of informa- tion obtained,


 

Berendonk 등은 divergent한 정보를 통합하는 데 있어서 content knowledge가 중요함을 고찰하였다.

Berendonk et al.39 discussed the importance of content knowledge in raters’ abil- ity to integrate divergent information about a per- formance.



평가자가 내러티브 판단을 만들어내는 것은 잘 알려진 메커니즘이다. 평가자는 전반적 퍼포먼스나 특정 퍼포먼스에 대해 정량적 혹은 카테고리적 용어로 생각하지 않는다. 평가자들은 '숫자' 그 자체는 자신들에게 무의미하다고 코멘트하였으며, 퍼포먼스의 복잡한 측면들을 정량화하는 과정에 불확실성과 의심을 낳는다고 하였다.

A rater’s production of narrative judgements stands out as a well-documented mechanism. Raters did not seem to think in quantitative or categorical terms about the overall or specific aspects of a per- formance.17,20,27 Participants themselves commented about numbers being meaningless for them20 and about the process of quantifying complex aspects of a performance creating uncertainty and doubts.17,20,39


 

이 메커니즘이 핵심일 수도 있다. 이러한 방식으로 프레임된 '평가과제rating task'는 평가자들이 서로 다른 개념(예컨대 사과와 오랜지)을 비교하게 만든다. 한 평가자가 말한 것처럼, 무언가를 합당한 방식으로 하는데 여러가지 paths의 상대적 quality를 비교하고 정량화하는 것은 어렵다. 이 말은 주어진 task의 특성과는 부합하지 않는 지식에 관한 psychometric assumption을 사용하여 평가를 해야 하는 상황에서 평가자들에게 부여되는 cognitive challenge를 매우 잘 설명해준다.

This mechanism may reveal key aspects of the process. It may be that the rating task was framed in such a way that it required raters to compare disparate concepts (i.e. apples and oranges). As stated by one participant, it is difficult to compare and quantify the relative quality of the different paths taken to do something in an accept- able manner.20 This statement beautifully illustrates the cognitive challenges put on raters when using psychometric assumptions about knowledge that are unfit for the nature of the task at hand.23,52,53


내러티브 판단을 scale로 변환하는 메커니즘이 있다. Yaphe and Street는 평가자가 내리는 최종 grade는 arithmetic process가 아니라 피평가자에 대한 전반적 인상에 analysis of response가 더해진 결과이다. Yeates 등은 평가자가 네러티브한 묘사를 통해서 전반적 판단을 형성하는 프로세스에 대해서 연구하였으며, 이 narrative description이 usable scale을 활용한 description으로 변환된다. Kogan 등은 평가자가 9점 척도의 fragment를 서로 구분하는 능력이 없음을 보여주었으며, 평가자들이 각자 그 딜레마를 해결하기 위한 어떤 법칙을 형성하였는지를 공유하였다. 만약 global judgement로 변환하는 과정이 평가자에게 struggle이라면, 그 변환과정에서 상실하는 것이 무엇인지 생각해봐야 한다.

A related but separate mechanism is the translation of narrative judgements into scales, which raters seemed to do towards the end of the process. Yaphe and Street27 illustrated that the final grade given by raters was not an arithmetic pro- cess but a reflection based on an overall impression with an analysis of responses given by the examinee. Yeates et al.17 investigated the raters’ process of form- ing an overall judgement through a narrative descrip- tion, which was then translated into descriptions for a usable scale. Kogan et al.20 also reflected on the raters’ inability to discriminate between fragments on a nine-point scale and share participants’ quotes about the development of their own rules for address- ing the dilemma. If the translation of global judge- ment into scales represents a struggle for raters, we may need to reflect on what is lost in translation53–55


 

연구의 이슈와 기회

ISSUES AND OPPORTUNITIES FOR RESEARCH



평가 과제를 구분하기

Disentangling the rating task


 

한 가지 메커니즘으로 다 설명하지 못할 것. 다음과 같은 것들이 연구되어야.

no single mechanism or factor will ever be able to entirely explain the observed variability in rater judgement,22


  • how each is interconnected with the others.

  • how these theo- ries later interact within the complexity of the task in authentic contexts.

  • nature of specific mechanisms under different con- ditions (e.g. summative versus formative assessment)

  • what happens when specific mechanisms function well


예컨대, 의학교육자들이 복잡한 환자 시나리오를 기반으로 임상추론 역량을 평가하는 경우, 교육자들은 구체적인 문제와 관련하여 임상추론의 key element 측면에서 competent한지에 대해서는 높은 수준의 일치도를 보여으나, 이 element들과 연관되어야 하는 standard에 대해서는 total disagreement가 확인되었따.

For example, in a study looking at medical educa- tors’ shared concept of competency in assessing clinical reasoning in complex patient scenarios, we found that educators exhibited a high level of agreement about the key elements associated with competent reasoning for specific problems yet evi- denced total disagreement about the standards that should be associated with these elements.58



평가자 결과의 다양한 활용

Acknowledging the multiple uses and audiences of rater output


세 phase외에 추가적 phase가 있어야 할 수도.

We suggest that an additional phase should be added to the three-phase process proposed above


피드백의 중요성을 인정한 연구자들은 많으나 추가적인 feedback phase를 별도의 section으로 둔 연구자는 별로 없다. 이 경우 feedback의 전달을 rating process의 다른 측면들과 능동적으로 상호작용하는 별도의 phase로 묘사하였다.

Even though a number of authors have acknowledged the importance and influence of feedback on the rating process,17,20,39 only Koganet al.20 included an additional feedback phase as a distinct section in both their research design and conceptual model. In their conceptual model, they portrayed the delivery of feedback as being a dis- tinct phase actively interacting with other aspects of the rating process,



학습자에 대해 평가자가 awareness가 평가자가 더 관대하게 평가하게끔 만들지만, 최근의 연구를 보면, 평가자는 자신의 동료들을 피드백의 audience로 생각한다. 이 결과는 평가자가 자신의 평가의 public aspect를 우려한다는 것, 그리고 평가자가 평가과제의 overall purpose를 적극적으로 고려하고 있다는 것과도 부합한다. dissemination이라는 네 번째 phase를 추가함으로써, 우리는 다양한 audience와 정확한 rating을 제공하는 평가자의 능력의 목적에 대해 인정할 수 있다.

Although rater awareness of the learner group has been shown to induce rater leniency,60 recent research on the analysis of narra- tive evaluations also revealed that they recognised their colleagues as an additional audience for their feedback.56 These findings align with raters’ con- cerns about the public aspects of their ratings and raters’ active consideration of the overall purpose of the assessment task, as expressed in Berendonk et al.39 Adding this fourth phase, entitled dissemina- tion, would acknowledge the impact of different audiences and objectives on raters’ ability to provide accurate ratings.



평가자의 경험과 평가수행을 연구

Investigating raters’ experience and assessment practices



평가자의 능력이 발달하는 것이며, 실천과 노출에 의해서 영향을 받기 때문에, 평가자가 이러한 능력을 습득하는 맥락을 고려해봐야 한다. 평가자는 주로 formative context에서 주로 평가를 하게 되고, 이는 summative context와는 다르다. formative context에서 판단을 내리는 방식을 연구하여야 한다.

As raters’ abilities have been shown to be develop- mental and affected by practice and exposure,25,38,39 we may need to reflect on the context in which raters acquire these abilities. Rater practices are mostly situated in formative contexts that differ from the summative contexts in which they are gen- erally studied. Studying raters’ abilities to assess a trainee’s performance and the way raters make judgements in formative contexts may be highly rel- evant to gaining a better understanding of raters’ daily assessment practices and the underlying mech- anisms supporting raters’ judgements and interac- tions.

 

예컨대 평가자는 경험이 더 많아질수록 (normative한 것이 아니라) stable standard를 develop한다고 보고되고 있다. 최근의 연구에서 주어진 과제에서의 경험이 confounding variable이었다.

For example, raters have reported developing stable standards, as opposed to normative ones, as they gain more experience with a specific assess- ment task and context.39 A recent experimental study also showed that experience with the task at hand for the specific level of learners was a con- founding variable.42 Research focusing on better understanding the judgement process in formative context can bring a longitudinal perspective on the development and accuracy of judgement and offer an alternative to inter-rater reliability as a way to measure rating accuracy.52



결론

CONCLUSION


This review focusing on the assessment process instead of the rater per se underscores the complexity and number of cogni- tive processes identified in research on rater cogni- tion.

 

  • During the observation phase, as raters selected and attended to trainee performance, they were influenced by their automatic impressions about the trainee while formulating high-level infer- ences about the various dimensions of the compe- tencies they were assessing. These automatic mechanisms have been situated in this first phase to convey their impact on succeeding mechanisms and events.

  • In the next phase, processing, simultaneous schema-based mechanisms highlight different com- ponents of raters’ expertise and experience through their personal conception of competence, their abil- ity to use their repertoire of exemplars for compar- ison, and their ability to monitor and adjust their judgements according to aspects of the task and context.

  • In the subsequent phase, the observed mechanisms seem to reuse the information from the previous phases in uniquely weighting and syn- thesising information and in generating narrative judgements later translated into scales used for the assessment.


We also believe that looking at the raters as fallible, cogni- tively limited or as experts can contribute to a shared understanding of the phenomenon. For example,

  • researchers in the expertise camp focus on how mechanisms function, whereas

  • researchers in the cognitively limited resource camp attend pri- marily to the limitations of mechanisms in realistic contexts, and

  • those in the fallible camp investigate how these mechanisms produce outputs that are (in)compatible with the requirements of our cur- rent assessment system.


11 Kogan JR, Conforti LN, Bernabeo E, Iobst W, Holmboe E. How faculty members experience workplace-based assessment rater training: a qualitative study. Med Educ 2015;49(7):692–708.


13 Gingerich A, Kogan J, Yeates P, Govaerts M, Holmboe E. Seeing the ‘black box’ differently: assessor cognition from three research perspectives. Med Educ 2014;48 (11):1055–68.



 

 



See 1 citation found by title matching your search:

 2016 May;50(5):511-22. doi: 10.1111/medu.12973.

Rater cognitionreview and integration of research findings.

Author information

  • 1Medecine interne, Universite de Sherbrooke, Sherbrooke, Quebec, Canada.
  • 2Division of Emergency Medicine, McMaster University, Hamilton, Ontario, Canada.
  • 3Centennial College, School of Community and Health Studies, Toronto, Ontario, Canada.
  • 4ORNGE Transport Medicine, Faculty of Medicine, Mississauga, Ontario, Canada.

Abstract

BACKGROUND:

Given the complexity of competency frameworks, associated skills and abilities, and contexts in which they are to be assessed in competency-based education (CBE), there is an increased reliance on rater judgements when considering trainee performance. This increased dependence on rater-based assessment has led to the emergence of rater cognition as a field of research in health professions education. The topic, however, is often conceptualised and ultimately investigated using many different perspectives and theoretical frameworks. Critically analysing how researchers think about, study and discuss rater cognition or the judgement processes in assessment frameworks may provide meaningful and efficient directions in how the field continues to explore the topic.

METHODS:

We conducted a critical and integrative review of the literature to explore common conceptualisations and unified terminology associated with rater cognition research. We identified 1045 articles on rater-based assessment in health professions education using Scorpus, Medline and ERIC and 78 articles were included in our review.

RESULTS:

We propose a three-phase framework of observation, processing and integration. We situate nine specific mechanisms and sub-mechanisms described across the literature within these phases: (i) generating automatic impressions about the person; (ii) formulating high-level inferences; (iii) focusing on different dimensions of competencies; (iv) categorising through well-developed schemata based on (a) personal concept of competence, (b) comparison with various exemplars and (c) task and context specificity; (v) weighting and synthesising information differently, (vi) producing narrative judgements; and (vii) translating narrative judgements into scales.

CONCLUSION:

Our review has allowed us to identify common underlying conceptualisations of observed rater mechanisms and subsequently propose a comprehensive, although complex, framework for the dynamic and contextual nature of the rating process. This framework could help bridge the gap between researchers adopting different perspectives when studying rater cognition and enable the interpretation of contradictory findings of raters' performance by determining which mechanism is enabled or disabled in any given context.

Comment in

PMID:
 
27072440
 
DOI:
 
10.1111/medu.12973


+ Recent posts