평가자의 인지Rater Cognition(Med Educ, 2016)

Rater cognition: review and integration of research findings

Genevieve Gauthier,1 Christina St-Onge1 & Walter Tavares2,3,4





도입

INTRODUCTION


Complex한 역량을 평가하기 위해서는, 피훈련자의 퍼포먼스에 대한 판단이 내려져야 하며, 이는 고등 기술이나 고등 능력(지식 적용, 분석, 통합, 평가 등)의 개발을 support하기 위함이다.  평가하고자 하는 능력과 기술의 복잡성과 그것들이 속한 context로 인해서 우리는 점차 피훈련자의 퍼포먼스 평가에 평가자의 판단에 의존하게 된다.

In order to assess these com- plex competencies, judgements need to be made about the performance of trainees, to support the development of high-level skills or abilities, such as knowledge application, analysis, synthesis and evalu- ation.1 The contexts and complexity of abilities and skills assessed within these frameworks require that we increasingly rely on rater judgements in assessing trainee performance.2


CBE에서 평가자 판단에 대한 의존이 늘어나는 것을 환영하는 것에 대한 비판이 있는데, 왜냐하면 평가자의 판단은 지나치게 variable하고 bias에 취약하기 때문이다. 실제로, (최근까지 dominant했던) psychometric한 관점에서 평가자들의 variability는 측정오류이며, 왜냐하면 그 variability는 피훈련자의 퍼포먼스 그 자체가 아니라 '평가 점수'에서 관찰되는 것이었기 때문이다. 그러나 다양한 평가서식, 시스템, 피훈련자 훈련 등으로 이 variability문제를 해결하기 위한 시도는 유의미한 성과를 거두지 못했다.

Some have greeted the increased reliance on rater judgement in CBE with substantial criticism, because rater judgement has often been accused of being too variable and fraught with bias. In fact, the psychometric perspective, which has been dominant until recently, has positioned rater variability as measurement error, because rater variability has been shown at times to explain more of the variabil- ity seen in trainees’ scores than the trainees’ own performances.3–6 Attempts to address this variability problem by improving rating forms and systems, or by training raters, have not produced meaningful improvements.7–11


Gingerich 는 평가자 인지를 세 가지 관점에서 보았다.

In a recent review, Gingerich et al.13 categorised rater cognition research into three different per- spectives focusing on the assessors as either

  • (i) 적절한 평가기준 적용 실패  failing to apply appropriate assessment criteria,

  • (ii) 과제의 복잡성에 의해 인지적으로 압도당함 being cognitively overwhelmed by the complexity of the task, or

  • (iii) 유니크하지만 의미있는 메시지를 보내는 전문가 판단 expert decision makers send- ing unique but meaningful messages.


Fig 2.의 양측 화살표는 상호의존성과 지속적 상호작용을 의미한다.

The bidirectional arrow in Fig. 2 represents the interdependence and con- stant interaction between the phases.

 

 


 

각 Phase의 구조

STRUCTURE OF PHASES


rater-cognition 에 관한 최근 연구는 세 가지 phase로 나뉜다.

Aligned with previous work on performance, recent findings in rater-cognition research have implied a three-phase framework of observation, processing and integration.

  • 관측: 어떻게 평가자들이 능동적으로 정보를 선택하는가
    The observation phase refers to how raters attend to and actively select information about trainees and their performances. This phase is the most agreed upon and mentioned by most authors.

  • 처리: 맥락적 정보와 사전지식을 어떻게 회수하고 사용하며, 어떻게 주어진 퍼포먼스와 비교하는가
    The processing phase encompasses appraisal of how raters retrieve and use contextual informa- tion and prior knowledge to inform and compare the performance at hand.

  • 통합: 어떻게 다양한 출처의 정보를 통합하는가
    The integration phase speaks to how raters combine different sources of information to form an overall judgement.




관측

Observation phase


관측에서 평가자는 피훈련자의 퍼포먼스를 관측하여, 판단을 형성하는데 필요한 정보에 관심을 기울인다. 관측phase의 impact와 이후 이어지는 phase와의 관계는 잘 밝혀졌으나, 이 초기 phase를 기술하는 용어는 다양하다.

In this initial phase, a rater observes a trainees’ per- formance in what is usually a complex and authen- tic or educational setting to identify and attend to relevant informa- tion that will be used subsequently to form a judge- ment. The impact of the observation phase and its interrelation with subsequent phases have been well acknowledged,17–20 but the specific terminology used to describe this initial phase differs between researchers (e.g. recognition, selection of informa- tion, perception and first impression).21–27



관측phase에는 다양한 메커니즘이 있다.  

The number and complexity of mechanisms that occur during the observation phase have been well conceptualised.

  • 행동 관련 정보에 대한 능동적인 선택과 탐지. 통합과 카테고리화
    Tavares and Eva18 described this phase of information acquisition as an active selection and detection of relevant beha- viours that require management, integration and categorisation in the rater’s working memory.
  • 자동적이면서 통제된 프로세스. 상황situation과 성향disposition에 대한 의식적, 무의식적 요인
    The observation, once conceived as a passive and objec- tive task, requires active observation skills that involve both automatic and controlled processes during which observers are influenced by conscious and unconscious situational and dispositional fac- tors.28,29
  • Differential salience: 같은 것을 보면서도 다른 측면에 초점을 두는 것
    Yeates et al.17 discussed the notion of differ- ential salience occurring when, despite viewing the same video, raters paid attention to different aspects of the performance.



처리

Processing phase


수집된 정보에 의미를 부여하고, 암묵적 또는 명시적 기준과 비교함으로써 관측한 것을 make sense함.

In this phase, raters are thought to give meaning to the information collected during the observation phase. That is, they are thought to make sense of what they have observed by comparing it with some implicit or explicit standard.

 

다음과 같이 불리기도

This phase is referred to as organ- isation,21,25 judgement,20 criterion uncertainty,17 inter- pretation,19,23 provisional grade27 and evaluation.24



평가라는 관점에서 처리phase는 두 단계의 분석을 함축한다. 첫째, 평가자는 퍼포먼스의 퀄리티를 element of competency로 결정해야 한다. 둘째, 평가자는 (다양한 기준점과의 비교를 통해) 이 elements of competency에 대해서 피훈련자가 보여주는 것이 어떤 level에 있는지 판단해야 한다. 이 과정은 자동화되고 통제된automatic and controlled 과정으로서, schema-based categorization이라고 불린다.

More specifically, from an assessment perspective, the processing phase implies two levels of analysis. First, raters need to determine the quality of the perfor- mance in terms of elements of competency. Second, raters need to determine the level at which the trainee demonstrates these elements of competency in com- parison to various reference points. This process, which involves both automatic and controlled pro- cesses,17,18,28,30 is often referred to as schema-based categorisation.19,30




통합

Integration phase


이 시기에, 평가자는 피훈련자의 퍼포먼스에 대한 전반적인 판단을 내린다.

During this phase, raters form and articulate an over- all judgement and rating about trainee performance,

 

이 phase에 대한 연구자들의 판단은 다양한데, 어떤 연구자는 우리가 통합phase에 포함시킨 것을 처리phase로 분류하기도 한다.

Fewer authors seem to agree about this phase, Some authors17,18 put mechanisms that we have included in the integra- tion phase into the processing phase.


다른 이름으로도 불린다.

This third phase is sometimes referred to as the translation phase,18,23 final grade27 or feedback phase.20,24


다양한 출처의 정보를 통합하는 과정이다. 이 정보는 unique한 가중치가 적용되어 최종적 global impression을 형성하고, 이어서 corresponding rating으로 변환된다.

The integration phase encompasses the integration of different sources of information. This information is uniquely weighted to forma final global impres- sion, which is then subsequently converted into a corresponding rating.



 

 

공통의 메커니즘

SHARED MECHANISMS


Table 1 illustrates the most salient mechanisms described in the reviewed studies on rater cognition.


 

 


 

관측 메커니즘

Mechanisms in the observation


관측 시기에서 세 가지가 있다.

The first phase, the observation phase, is where we position three different mechanisms reported as happening somewhat independently from one another:

  • (i) 한 사람에 대한 자동화된 인상 형성 generating automatic impressions about the person,

  • (ii) 고차원의 추론 형성 formulating high-level inferences and

  • (iii) 평가대상 역량의 다양한 차원에 집중 focusing on different dimensions of the assessed competencies.


'자동화된 인상 형성'이란 무의식적으로 발생하는 것이며, 한 피훈련자의 특정한 unrelated aspects를 가지고 퍼포먼스에 대한 일반화를 내리는 것이다. 이 메커니즘은 halo effect로도 잘 알려져있다.

The automatic generation of impressions about a person is a mechanism that takes place uncon- sciously and through which raters make generalisa- tions about a trainee’s performance based on that person’s specific unrelated aspects or traits. The existence and impact of this mechanism has been well documented and includes the well-known halo effect.19,21,22,26,27,32


비록 이러한 인상 형성이 자동적으로 이뤄지는 것임은 널리 받아들여지고 있지만, 메커니즘의 유용성, 전반적인 프로세스에 미치는 영향, 다양한 맥락에서 발생하는 이들 판단의 exact nature 등에 대한 논란이 있다.

Although the concept of a person-based impression that happens automatically is well accepted, questions persist as to the potential useful- ness of this mechanism,22 the degree of influence it has on the overall process,21 and the exact nature of these judgements in different contexts.19


이 메커니즘이 던지는 또 다른 질문으로 현재 이뤄지는 대부분의 평가과제에서 '분석의 단위'가 무엇인지에 대한 것이다. 평가과제는 대부분 'person level'이 아니라 'performance level'에서 이뤄지나, 사람의 판단이 구조화되는 방식은 그 반대인 것으로 보인다는 점이다. 예컨대 지원자-특이적 평가를 한 평가자들과 스테이션-특이적 평가를 한 평가자들을 비교하면, 전자의 경우에서 failure한 지원자의 수가 더 많고, 신뢰도가 하락하면서 internal consistency는 높아졌다.

In parallel , the results of these investigations raise interesting questions regarding the current unit of analysis in most assess- ment tasks, which are anchored at the performance level as opposed to the person level, which seems to be contrary to how human judgement is struc- tured.29 For example, research on candidate-specific raters compared with station-specific raters reported a higher number of candidate failures and a higher level of internal consistency while showing a trade- off in reliability.33


두 번째 메커니즘은 (관찰가능한 행동의 단순 묘사가 아닌) high-level inference의 형성이다.

Another mechanism common to various studies on rater cognition, and beginning during the observa- tion phase, is the formulation of high-level infer- ences as opposed to description of observable behaviours or facts regarding trainee perfor- mance.13,20,25,30,34,35


Kogan 등은 평가자의 관측에 있어서 이러한 high-level inference를 문제시하였다. 반면, 경험이 많은 평가자들에서 high-level inference의 빈도가 높아진다는 사실을 바탕으로 Govaerts 등은 평가자의 경험이 쌓일수록 이러한 자동화automaticity가 개발된다고 보았다.

Kogan et al.20 viewed the high level of inferences in rater observations (when the raters were not always conscious of them) as a prob- lem to be addressed. By contrast, the increased fre- quency of high-level inferences of experienced raters compared with less experienced raters led Govaerts et al.25 to infer that automaticity developed as raters gained experience in assessing trainee per- formance.


이러한 유형의 정보가 (프로페셔널리즘과 같은) complex competencies에 대한 평가자의 inference에 영향을 미칠 수 있다는 사실을 이해하기 위하여..

To understand the type of informa- tion that could influence raters’ inferences about complex competencies such as professionalism, Ginsburg et al.34


어떤 유형의 정보가 평가자가 '유용한' inference를 내릴 수 있을 것인지가 standard setting의 맥락에서 연구되어왔다.

Understanding what type of information can support raters’ ability to make useful inferences has also been explored in the context of standard setting,36


complexity의 또 다른 층위는... Pulito 등이 보고한 바와 같이, 평가자는 직접적 관측결과를 활용하여 의학지식/프로페셔널리즘/임상추론기술 등을 평가하며, 학생의 발표능력을 가지고 추론하여 병력청취나 신체진찰 스킬을 평가한다.

Another layer of complexity may arise as we begin investigating rater assessment practices regarding inferences. As reported by Pulito et al.,37 raters used direct observations in assessing medical knowledge, professionalism and clinical-reasoning skills, whereas they made inferences using the students’ presenta- tions to assess history-taking and physical-examina- tion skills.



관측시기에서 또 다른 메커니즘은 평가자가 어떻게 다양한 역량의 element에 집중할 수 있는가이다. Yeates 등은 평가자들이 동일한 비디오를 보고도 코멘트에 variation을 보임을 관측하였다. 한 평가자에게 key aspect라고 보여진 것이 다른 평가자에게는 언급조차 되지 않았다.

Another shared finding during the observation phase is how raters attend to different elements of competencies.17,20,23,27,30,37,38 Yeates et al.17 observed variations in comments by raters assessing the same video performance. What struck some raters as being a key aspect of a performance was sometimes not even mentioned by other raters. Kogan et al.20 had similar findings.



유사하게 Yaphe and Street는 퍼포먼스가 좋게 평가되는지 나쁘게 평가되는지에 따라 평가자가 어떤 attribute에 focus하는 것이 달라진다는 것을 보여주었다.

Similarly, Yaphe and Street27 highlighted that raters focus on different attributes depending on whether the performance being eval- uated was good or poor.



assorted elements of competencies 를 활용하는 것은 각 평가자가 (특정 퍼포먼스의 유형이나 맥락에 따라) 역량에 대한 unique한 이해를 가지고 있음을 시사한다. context나 task가 rater performance에 미치는 영향은 Tavares의 결과에서도 강조되는데, 7개의 영역을 평가하라고 했을 때와 2개의 영역을 평가하라고 했을 때의 평가결과가 달랐다.

The use of assorted elements of competencies may reflect raters’ unique under- standing of competency in relationship to specific types of performances and contexts. The influence of context and task on rater performance is high- lighted in Tavares38 findings, in which the perfor- mance of raters was affected when they were required to evaluate seven dimensions instead of two.



처리 메커니즘

Mechanisms in the processing phase



처리 메커니즘은 다음에 기반한다.

The glue that binds the mechanisms together in the processing phase is the presence and use of com- plex schemata based on:

  • (i) 역량에 대한 개개인의 개념 a personal conception of competency,

  • (ii) 다양한 사례의 비교 comparison with various exem- plars and

  • (iii) 과제와 맥락 특이성 task and context specificity.

 

이 메커니즘들이 automatic and controlled 하다고 묘사되지만, 평가자들은 피훈련자의 퍼포먼스를 어떻게 해석하는지를 설명하는데 어려움을 겪는 것으로 보이며, 이는 그러한 메커니즘들이 controlled 이기보다는 automatic한 것임을 시사한다. 예컨대 Kogan의 연구에서 평가자들은 자신의 평가 프로세스를 설명하는데 어려움을 겪었고, 다수의 평가자는 (구체적인 레퍼런스나 스탠다드의 측면에서) 어떻게 퍼포먼스를 해석했는지 설명하지 못했다.

Although these mechanisms have been described as being both automatic and controlled, raters seem to expe- rience difficulty in articulating how they interpret trainee performance, suggesting that such mecha- nisms may be more automatic than controlled. For example, participants in Kogan et al.’s20 study had difficulty explaining their assessment processes and a number of them could not articulate how they interpreted a trainee’s performance in terms of specific references or standards.


평가자 판단의 complex and implicit한 특징은 카테고리화 과정과 연결되는데, 자신이 경험한 피훈련자의 퍼포먼스 경험과 모범적인 퍼포먼스 사례에 의해서 개발된 schemata를 비교하고 활용함으로써 해석이 이뤄지는 것이다. 이러한 카테고리화 메커니즘의 발생은 왜 경험이 많은 평가자들이 평가자들의 퍼포먼스에 대한 더 풍부하고 디테일한 description을 하는지를 설명해준다.

The complex and implicit nature of rater judge- ment has been linked to a categorisation process, in which interpretations occur through the use and comparison of schemata developed by exposure to trainee performance and exemplars of perfor- mance.17,25,39–42 The occurrence of such a categori- sation mechanism might explain why experienced raters can have richer and more detailed descrip- tions of trainee performances than inexperienced raters.25,35,39,43


카테고리화 프로세스에서 역할을 한다고 알려진 세 가지 요소 중 하나는 '평가자 개인이 지닌 역량에 대한 개념'이다. Kogan 등은 평가자 자신의 임상스킬이 피훈련자 퍼포먼스 평가와 상관관계가 있음을 보여주었다. 평가자 자신의 역량의 중요성은 Berendonk의 결과에서도 나타나는데, 평가과제의 선택이 평가자 자신의 전문분야 또는 전문지식과 관련되었다는 것이다. 평가자의 역량에 대한 개념은 "특정 술기가 주어진 맥락에서 어떻게 수행되어야 하는지"에 대한 판단을 포함하는 것이지만, 이는 동시에 평가자가 의사소통이나 프로페셔널리즘과 같은 복잡한 역량을 판단하는 기준을 프레임하기도 한다.

One of the three components identified as playing a role in the categorisation process is the rater’s personal conception of competency.19,20,39,44 Kogan et al.45 found that the raters’ own clinical skills related to a task correlated with their assessment of trainees performing the task. The importance of the raters’ own level of competence with respect to the task is also reflected in Berendonk et al.’s39 findings about selecting assessment tasks related to the rater’s own area of expertise and content knowl- edge. Raters’ concept of competence may involve judging how things should be done in the context of procedural skills, but it may also frame the nat- ure of raters’ expectations and standards when judging more complex competencies such as com- munication and professionalism.44


카테고리화 프로세스의 또 다른 요소는 자신이 경험한 다양한 모범사례를 벤치마크로 사용하는 것이다. 이들 모범사례는 평가자들이 특정 평가과제에서 스탠다드를 설정하는 전략으로 밝혀진 바 있다. 모범사례는 과거에 경험한 피훈련자가 될 수도 있고, 평가자 자신이 피평가자 시절에 어떻게 했었는지가 될 수도 있고, 현재 비슷한 수준의 피평가자에 대한 경험이 될 수도 있고, 심지어는 다양한 레벨에 있는 동료가 될 수도 있다. contrast effect에 대한 연구에서 밝혀진 바와 같이, 평가자들은 (심지어 criteria-based assessment에서조차) implicit하게 norm-based evaluation에 의존한다.

Another component of the categorisation process is the use of various exemplars accumulated through experience as benchmarks for comparison with observed trainee performances. The use of these exemplars has been identified as a strategy adopted by raters to set standards for specific assessment tasks.20,30,39–41 The nature of these exemplars ranges from trainees assessed in the past40–42 to the rater’s own memory of his or her own performance at the same level of training20,30 to current experience with learners at similar levels or even colleagues with varied levels of experience.20,30 As demonstrated by research on the contrast effect,40,42,46 raters implicitly relied on a norm-based evaluation process even in the con- text of criteria-based assessment.


 

구체적인 과제와 맥락의 역할은 'task specificity'라고도 불리는 것으로, 'context specificity'의 한 현상이기도 하다. 경험이 많은 평가자들은 과제 특이적 퍼포먼스에 대해서 더 풍부한 코멘트와 묘사를 제공하며, 특정 과제에 대한 경험을 쌓아나가면서 과제-특이적 스탠다드가 더 진화하기도 한다. task의 role을 학습자를 위한 mediating factor로서 인정하였다.

The role of specific tasks and contexts, which we refer to as task specificity, relates to a well-documen- ted phenomenon of context specificity47 and was mentioned in a number of studies.18–20,27,30,39,43,48 Experienced raters provided richer assessment com- ments and descriptions of task-specific perfor- mances19,39 and they discussed the evolution of task- specific standards as they gained experience with a specific task. They acknowledged the role of the task as a mediating factor for learners.



 

통합 메커니즘

Mechanisms in the integration phase



이 마지막 phase에서 이전 phase로부터 수집된 정보를 재활용하는 것으로 보인다.

In this last phase, the observed mechanisms seem to reuse the information from the previous phases to

  • (i) 정보를 서로 달리 가중치를 주어 통합 weigh and synthesise information differently,

  • (ii) 카테고리식 혹은 내러티브식의 판단을 내림 produce categorical or narrative judgements and then

  • (iii) 내러티브한 판단을 척도로 변환 translate narrative judgements into scales.

 

통합phase에서 평가자들은 서로 다른 방법을 사용하여 다양한 출처의 정보를 통합하는데, 이 때 숫자점수나 척도의 형태로 생각하는 것은 아닌 것으로 보인다. 그럼에도 불구하고 전반적인 평가프로세스의 판단은 궁극적으로 scale로 변환되어야 한다.

During the integration phase, raters seem to have used different methods to synthesise various sources of information and they do not seem to have been thinking in terms of number grades or scales.17,20,23,27,34 Nevertheless, the overall judge- ments developed throughout the assessment process often have to be eventually translated into scales to suit the rating tools.


 

평가자들은 종종 여러 출처의 정보를 단순히 평균을 내기도 하고, 무엇이 더 중요한지에 대한 자신의 신념에 기반하여 우선순위를 매기기도 하고, 심지어는 프로그램 내에서의 목적에 따라 우선순위를 매기기도 한다. St-Onge는 특정한 가중치 부여 혹은 거부veto 메커니즘은 특정 error와 관련되어있는데, 이것은 전반적인 퍼포먼스를 'good'이라고 판단할 때도 마찬가지이다. Yeates 등은 정보가 수집된 다양한 출처에 따라 valence에 variation이 있음을 보여주었다.

The raters reported that they sometimes simply averaged all sources of informa-tion or prioritised based on their own beliefs about the importance of specific elements of competen- cies or even prioritised according to the assess- ment’s purpose within the programme. St-Onge et al.30 reported a particular weighting or veto mech- anism associated with specific oversights or errors, even when the overall performance was described as being good. Yeates et al.41 also reported variation in valence regarding the different sources of informa- tion obtained,


 

Berendonk 등은 divergent한 정보를 통합하는 데 있어서 content knowledge가 중요함을 고찰하였다.

Berendonk et al.39 discussed the importance of content knowledge in raters’ abil- ity to integrate divergent information about a per- formance.



평가자가 내러티브 판단을 만들어내는 것은 잘 알려진 메커니즘이다. 평가자는 전반적 퍼포먼스나 특정 퍼포먼스에 대해 정량적 혹은 카테고리적 용어로 생각하지 않는다. 평가자들은 '숫자' 그 자체는 자신들에게 무의미하다고 코멘트하였으며, 퍼포먼스의 복잡한 측면들을 정량화하는 과정에 불확실성과 의심을 낳는다고 하였다.

A rater’s production of narrative judgements stands out as a well-documented mechanism. Raters did not seem to think in quantitative or categorical terms about the overall or specific aspects of a per- formance.17,20,27 Participants themselves commented about numbers being meaningless for them20 and about the process of quantifying complex aspects of a performance creating uncertainty and doubts.17,20,39


 

이 메커니즘이 핵심일 수도 있다. 이러한 방식으로 프레임된 '평가과제rating task'는 평가자들이 서로 다른 개념(예컨대 사과와 오랜지)을 비교하게 만든다. 한 평가자가 말한 것처럼, 무언가를 합당한 방식으로 하는데 여러가지 paths의 상대적 quality를 비교하고 정량화하는 것은 어렵다. 이 말은 주어진 task의 특성과는 부합하지 않는 지식에 관한 psychometric assumption을 사용하여 평가를 해야 하는 상황에서 평가자들에게 부여되는 cognitive challenge를 매우 잘 설명해준다.

This mechanism may reveal key aspects of the process. It may be that the rating task was framed in such a way that it required raters to compare disparate concepts (i.e. apples and oranges). As stated by one participant, it is difficult to compare and quantify the relative quality of the different paths taken to do something in an accept- able manner.20 This statement beautifully illustrates the cognitive challenges put on raters when using psychometric assumptions about knowledge that are unfit for the nature of the task at hand.23,52,53


내러티브 판단을 scale로 변환하는 메커니즘이 있다. Yaphe and Street는 평가자가 내리는 최종 grade는 arithmetic process가 아니라 피평가자에 대한 전반적 인상에 analysis of response가 더해진 결과이다. Yeates 등은 평가자가 네러티브한 묘사를 통해서 전반적 판단을 형성하는 프로세스에 대해서 연구하였으며, 이 narrative description이 usable scale을 활용한 description으로 변환된다. Kogan 등은 평가자가 9점 척도의 fragment를 서로 구분하는 능력이 없음을 보여주었으며, 평가자들이 각자 그 딜레마를 해결하기 위한 어떤 법칙을 형성하였는지를 공유하였다. 만약 global judgement로 변환하는 과정이 평가자에게 struggle이라면, 그 변환과정에서 상실하는 것이 무엇인지 생각해봐야 한다.

A related but separate mechanism is the translation of narrative judgements into scales, which raters seemed to do towards the end of the process. Yaphe and Street27 illustrated that the final grade given by raters was not an arithmetic pro- cess but a reflection based on an overall impression with an analysis of responses given by the examinee. Yeates et al.17 investigated the raters’ process of form- ing an overall judgement through a narrative descrip- tion, which was then translated into descriptions for a usable scale. Kogan et al.20 also reflected on the raters’ inability to discriminate between fragments on a nine-point scale and share participants’ quotes about the development of their own rules for address- ing the dilemma. If the translation of global judge- ment into scales represents a struggle for raters, we may need to reflect on what is lost in translation53–55


 

연구의 이슈와 기회

ISSUES AND OPPORTUNITIES FOR RESEARCH



평가 과제를 구분하기

Disentangling the rating task


 

한 가지 메커니즘으로 다 설명하지 못할 것. 다음과 같은 것들이 연구되어야.

no single mechanism or factor will ever be able to entirely explain the observed variability in rater judgement,22


  • how each is interconnected with the others.

  • how these theo- ries later interact within the complexity of the task in authentic contexts.

  • nature of specific mechanisms under different con- ditions (e.g. summative versus formative assessment)

  • what happens when specific mechanisms function well


예컨대, 의학교육자들이 복잡한 환자 시나리오를 기반으로 임상추론 역량을 평가하는 경우, 교육자들은 구체적인 문제와 관련하여 임상추론의 key element 측면에서 competent한지에 대해서는 높은 수준의 일치도를 보여으나, 이 element들과 연관되어야 하는 standard에 대해서는 total disagreement가 확인되었따.

For example, in a study looking at medical educa- tors’ shared concept of competency in assessing clinical reasoning in complex patient scenarios, we found that educators exhibited a high level of agreement about the key elements associated with competent reasoning for specific problems yet evi- denced total disagreement about the standards that should be associated with these elements.58



평가자 결과의 다양한 활용

Acknowledging the multiple uses and audiences of rater output


세 phase외에 추가적 phase가 있어야 할 수도.

We suggest that an additional phase should be added to the three-phase process proposed above


피드백의 중요성을 인정한 연구자들은 많으나 추가적인 feedback phase를 별도의 section으로 둔 연구자는 별로 없다. 이 경우 feedback의 전달을 rating process의 다른 측면들과 능동적으로 상호작용하는 별도의 phase로 묘사하였다.

Even though a number of authors have acknowledged the importance and influence of feedback on the rating process,17,20,39 only Koganet al.20 included an additional feedback phase as a distinct section in both their research design and conceptual model. In their conceptual model, they portrayed the delivery of feedback as being a dis- tinct phase actively interacting with other aspects of the rating process,



학습자에 대해 평가자가 awareness가 평가자가 더 관대하게 평가하게끔 만들지만, 최근의 연구를 보면, 평가자는 자신의 동료들을 피드백의 audience로 생각한다. 이 결과는 평가자가 자신의 평가의 public aspect를 우려한다는 것, 그리고 평가자가 평가과제의 overall purpose를 적극적으로 고려하고 있다는 것과도 부합한다. dissemination이라는 네 번째 phase를 추가함으로써, 우리는 다양한 audience와 정확한 rating을 제공하는 평가자의 능력의 목적에 대해 인정할 수 있다.

Although rater awareness of the learner group has been shown to induce rater leniency,60 recent research on the analysis of narra- tive evaluations also revealed that they recognised their colleagues as an additional audience for their feedback.56 These findings align with raters’ con- cerns about the public aspects of their ratings and raters’ active consideration of the overall purpose of the assessment task, as expressed in Berendonk et al.39 Adding this fourth phase, entitled dissemina- tion, would acknowledge the impact of different audiences and objectives on raters’ ability to provide accurate ratings.



평가자의 경험과 평가수행을 연구

Investigating raters’ experience and assessment practices



평가자의 능력이 발달하는 것이며, 실천과 노출에 의해서 영향을 받기 때문에, 평가자가 이러한 능력을 습득하는 맥락을 고려해봐야 한다. 평가자는 주로 formative context에서 주로 평가를 하게 되고, 이는 summative context와는 다르다. formative context에서 판단을 내리는 방식을 연구하여야 한다.

As raters’ abilities have been shown to be develop- mental and affected by practice and exposure,25,38,39 we may need to reflect on the context in which raters acquire these abilities. Rater practices are mostly situated in formative contexts that differ from the summative contexts in which they are gen- erally studied. Studying raters’ abilities to assess a trainee’s performance and the way raters make judgements in formative contexts may be highly rel- evant to gaining a better understanding of raters’ daily assessment practices and the underlying mech- anisms supporting raters’ judgements and interac- tions.

 

예컨대 평가자는 경험이 더 많아질수록 (normative한 것이 아니라) stable standard를 develop한다고 보고되고 있다. 최근의 연구에서 주어진 과제에서의 경험이 confounding variable이었다.

For example, raters have reported developing stable standards, as opposed to normative ones, as they gain more experience with a specific assess- ment task and context.39 A recent experimental study also showed that experience with the task at hand for the specific level of learners was a con- founding variable.42 Research focusing on better understanding the judgement process in formative context can bring a longitudinal perspective on the development and accuracy of judgement and offer an alternative to inter-rater reliability as a way to measure rating accuracy.52



결론

CONCLUSION


This review focusing on the assessment process instead of the rater per se underscores the complexity and number of cogni- tive processes identified in research on rater cogni- tion.

 

  • During the observation phase, as raters selected and attended to trainee performance, they were influenced by their automatic impressions about the trainee while formulating high-level infer- ences about the various dimensions of the compe- tencies they were assessing. These automatic mechanisms have been situated in this first phase to convey their impact on succeeding mechanisms and events.

  • In the next phase, processing, simultaneous schema-based mechanisms highlight different com- ponents of raters’ expertise and experience through their personal conception of competence, their abil- ity to use their repertoire of exemplars for compar- ison, and their ability to monitor and adjust their judgements according to aspects of the task and context.

  • In the subsequent phase, the observed mechanisms seem to reuse the information from the previous phases in uniquely weighting and syn- thesising information and in generating narrative judgements later translated into scales used for the assessment.


We also believe that looking at the raters as fallible, cogni- tively limited or as experts can contribute to a shared understanding of the phenomenon. For example,

  • researchers in the expertise camp focus on how mechanisms function, whereas

  • researchers in the cognitively limited resource camp attend pri- marily to the limitations of mechanisms in realistic contexts, and

  • those in the fallible camp investigate how these mechanisms produce outputs that are (in)compatible with the requirements of our cur- rent assessment system.


11 Kogan JR, Conforti LN, Bernabeo E, Iobst W, Holmboe E. How faculty members experience workplace-based assessment rater training: a qualitative study. Med Educ 2015;49(7):692–708.


13 Gingerich A, Kogan J, Yeates P, Govaerts M, Holmboe E. Seeing the ‘black box’ differently: assessor cognition from three research perspectives. Med Educ 2014;48 (11):1055–68.



 

 



See 1 citation found by title matching your search:

 2016 May;50(5):511-22. doi: 10.1111/medu.12973.

Rater cognitionreview and integration of research findings.

Author information

  • 1Medecine interne, Universite de Sherbrooke, Sherbrooke, Quebec, Canada.
  • 2Division of Emergency Medicine, McMaster University, Hamilton, Ontario, Canada.
  • 3Centennial College, School of Community and Health Studies, Toronto, Ontario, Canada.
  • 4ORNGE Transport Medicine, Faculty of Medicine, Mississauga, Ontario, Canada.

Abstract

BACKGROUND:

Given the complexity of competency frameworks, associated skills and abilities, and contexts in which they are to be assessed in competency-based education (CBE), there is an increased reliance on rater judgements when considering trainee performance. This increased dependence on rater-based assessment has led to the emergence of rater cognition as a field of research in health professions education. The topic, however, is often conceptualised and ultimately investigated using many different perspectives and theoretical frameworks. Critically analysing how researchers think about, study and discuss rater cognition or the judgement processes in assessment frameworks may provide meaningful and efficient directions in how the field continues to explore the topic.

METHODS:

We conducted a critical and integrative review of the literature to explore common conceptualisations and unified terminology associated with rater cognition research. We identified 1045 articles on rater-based assessment in health professions education using Scorpus, Medline and ERIC and 78 articles were included in our review.

RESULTS:

We propose a three-phase framework of observation, processing and integration. We situate nine specific mechanisms and sub-mechanisms described across the literature within these phases: (i) generating automatic impressions about the person; (ii) formulating high-level inferences; (iii) focusing on different dimensions of competencies; (iv) categorising through well-developed schemata based on (a) personal concept of competence, (b) comparison with various exemplars and (c) task and context specificity; (v) weighting and synthesising information differently, (vi) producing narrative judgements; and (vii) translating narrative judgements into scales.

CONCLUSION:

Our review has allowed us to identify common underlying conceptualisations of observed rater mechanisms and subsequently propose a comprehensive, although complex, framework for the dynamic and contextual nature of the rating process. This framework could help bridge the gap between researchers adopting different perspectives when studying rater cognition and enable the interpretation of contradictory findings of raters' performance by determining which mechanism is enabled or disabled in any given context.

Comment in

PMID:
 
27072440
 
DOI:
 
10.1111/medu.12973


국지적으로 개발된 MCQ에서 타당도에 대한 위협: 구인-무관-변이와 구인 과소반영(Adv Health Sci Educ Theory Pract. 2002)

Threats to the Validity of Locally Developed Multiple-Choice Tests in Medical Education: Construct-Irrelevant Variance and Construct Underrepresentation


STEVEN M. DOWNING

University of Illinois at Chicago, College of Medicine, Department of Medical Education (MC 591),

808 South Wood Street, Chicago, IL 60612-7309, USA (E-mail: sdowning@uic.edu)





도입

Introduction


MCQ는 이러한 시험에서 가장 흔히 사용되는 형태이며, 그 이유는 다음과 같다.

The multiple-choice question (MCQ) format is most commonly used for such tests, due to

  • its many positive psychometric characteristics,
  • its long history of research evidence,
  • its versatility in testing most cognitive knowledge,
  • its relative (apparent) ease to write, store, administer, and score, and
  • its continued use by the highest-stakes examinations in medical education (Downing, 2002).


그러나 MCQ형태의 시험이 이러한 긍정적인 성과를 내는 것은 어려우며, 과학이라기보다는 예술에 가깝다.

But, to accomplish these positive outcomes of MCQ-type assessment is challenging and may be more art than science (Haladyna, Downing and Rodriguez, 2002).


 

MCQ점수 해석에는 여러 타당도 위협이 있다. 검사의 타당도란 "accurate and meaningful interpretation of test scores and to the reasonableness of the inferences drawn from test scores"을 뜻한다.

There are many threats to the validity of MCQ test score interpretation. Test validity refers to the accurate and meaningful interpretation of test scores andtothe reasonableness of the inferences drawn fromtest scores (Messick, 1989; American Educational Research Association, 1999).


CIV

Construct-Irrelevant Variance

 

CIV는 평가에 외재적extraneous, 비통제된uncontrolled 변인이 들어가서, 이것이 일부 혹은 전체 응시자의 점수를 errorneously inflate 또는 deflate시킴으로써 검사점수의 해석의 의미와 정확성을 떨어뜨리고, 검사 점수를 기반으로 한 결정의 정당성을 약화시키며, 검사의 타당도근거를 낮추는 것을 말한다.

Construct-irrelevant 1Research Association, 1999; Cook and Campbell, 1979) refers to introducing into assessments extraneous, uncontrolled variables which tend to erroneously inflate or deflate scores for some or all examinees, thus reducing the meaningfulness and accuracy of test score interpretations, the legitimacy of decisions made on the basis of test scores, and the validity evidence for tests. There are many potential CIV sources in objective tests commonly used to assess achievement in medical education.


 

잘 설계되지 않은 문항

POORLY CRAFTED TEST QUESTIONS


최근의 Jozefowicz 등이 보여준 바에 따르면 의학교육의 단계를 막론하고 시험문항은 가장 기본적인 문항작성법의 원칙도 위반하는 경우가 많다. 아래와 같은 경우에 오답을 배제할 수 있는 의도하지 않은 힌트를 줘서 응시자의 점수가 errorneously inflate 되어 CIV가 증가한다.

Test items used at all levels of medical education training often violate the most basic principles of effective item writing, as a recent study by Jozefowicz and colleagues demonstrates (Jozefowicz et al., 2002).

  • Ambiguously worded test questions,
  • questions written in non-standard form,
  • ques- tions cast in overly complex formats, or
  • questions testing too much information in a single item

...may provide unintended cues to the correct answer or facilitate the elimination of incorrect options – thus inflating examinee scores erroneously and adding CIV to the measurement.


보안상의 문제와 기타 다른 비정형성irregularities

INSECURE TEST QUESTIONS AND OTHER TESTING “IRREGULARITIES”


만약 학생이 문제를 미리 안다면 점수가 높아질 것이다. 또한 부정행위 등도 false-positive information을 추가한다.

If some or all examinees have prior knowledge of test questions, their test scores may be erroneously inflated. Similarly,other testing irregularities, such as cheating, can add false-positive information to test scores. All such ill-gotten gain is CIV.


Testwiseness

TESTWISENESS


Testwiseness 란 응시자가 점수를 최대화하는데 사용할 수 있는 행동들을 말한다. 예컨대 정답에 대한 아무런 정보도 없을 때 C를 찍거나 가장 긴 답가지를 찍는 것은 testwise한 것이다.

Testwiseness refers to a set of behaviors that allows examinees to maximize their test score. The exam- inee who, absent all other information about the correct-answer choice, marks C or chooses the longest answer from the option set, is using testwise cues.


찍기

GUESSING


단순 찍기만으로도 정답을 맞출 수 있다. 찍는 것 만으로 고득점을 하는 것은 어렵지만, 심각하게 flawed item은 찍기를 통해서 성공할 가능성을 높여준다.

Thus, there is a statistical probability of arriving at the correct response by random guessing alone.


While it is unlikely that examinees will obtain high scores or achieve a passing score through guessing alone, seriously flawed test items do increase the probability of success through guessing and thereby add CIV to the measurement.


문항의 Bias

ITEM BIAS: DIFFERENTIAL ITEM FUNCTIONING


가장 기본적인 의미에서 test item bias란 응시자의 하위집단마다 문항의 공정성이 달라지는이다. item bias를 잡아내기 위한 일반적인 분석법을 DIF(Differential Item Functioning)이라고 하며, subgroup간 부당하게 차별하는 문항을 잡아낼 수 있다.

In its most basic meaning, test item bias refers to the fairness of the test item for different subgroups of examinees. A general class of analyses used to detect item bias is called Differ- ential ItemFunctioning (DIF analysis) and can be used to detect items that unfairly discriminate between subgroups. Poorly written MCQs tend to be confusing, artificially difficult or easy, and may be differentially confusing for different subgroups.


정당화 불가능한 합격점수

INDEFENSIBLE PASSING SCORES


모든 기준에는 판단이 개입되며, 어느 정도 임의적arbitrary인 면이 있다. 만약 합격을 결정하는 점수에 심각한 flaw가 있따면 CIV가 더해지는 것이다. 합격기준에 연관된 CIV에는 다음과 같은 것이 있다.

While all standards require judg- ment and all standards are somewhat arbitrary, if the methods used to establish the pass-fail mark are seriously flawed, CIV is added to the measurement. CIV associated with passing standards include problems such as

  • low reproducibility of passing scores,
  • low rela- tionship between the pass-fail outcomes for tests of similar content,
  • low agreement between passing standards established by different instructors in similar or the same courses,

and so on.


구인 과소반영

Construct Underrepresentation


의학교육에서 이뤄지는 많은 시험의 타당도는 내용-관련 근거에 의존하며, 시험문항이 (평가)관심 영역과 관계가 있음을 보여주는 자료에 기반한다. 이론적으로 특정 시험에서 사용되는 구체적인 문항들은 모집단으로부터 선택가능한 모든 시험문항의 표본을 대표해야 한다. Classical measurement theory (CMT)는 오로지 여기에 기반한다.

Much of the validity argument for typical achievement examinations used in medical education rests on content-related evidence and data showing the relation- ship between test item content and the domain of interest. Theoretically, the specific test items used on a particular examination represent a sample of all possible test items selected from the population. Classical measurement theory (CMT) rests solidly on this foundation and assumption (Anastasi, 1988).


지식-회상 수준에만 머무는 문항

TRIVIAL CONTENT TESTED AT THE FACTUAL-RECALL LEVEL


여러 시험문항이 앞으로의 학습이나 환자 진료에 별로 중요하지 않은 사소한 것들을 물어보는 문항이다. 팩트를 기억하거나 이 팩트들에 대해서 답하는 능력은 학생이 복잡한 임상문제를 다루는 능력에 대한 성공/실패, 또는 새로문 문제에 이러한 지식을 적용할 것인가를 알려주지 않는다.

Many of the test items constructed to test achievement in medical education curricula appear to ask trivial questions – questions that are unimportant for future learning or the clinical care of patients. The ability to remember facts and answer questions about these facts may have little to do with the ultimate success or failure of students’ ability to manage complex clinical problems or to predict their success at applying this knowledge to novel problems,


시험(문항)에만 대비한 교육

TEACHING-TO-THE-TEST


만약 교수자가 구체적으로 시험에 나올 내용만을 짚어준다면 sampling assumption이 위배되는 것이고 CIV가 추가된다. 따라서 점수에 기반한 해석은 부정확해진다. 이러한 상황에서 검사의 점수는 content domain의 무작위 표집을 대표하지 않으며, 검사점수로부터의 정당한 추론이 불가능해진다.

If instructors specifically guide exam-inees to content that is to be tested on the examination, the sampling assumption is violated, CIV is introduced to test scores, and the accurate interpretation ofscores may be jeopardized. Under these circumstance, the score on the test does not represent a random sample of the content domain to which one can draw legitimate inferences fromtest scores


너무 적은 수의 문항

TOO FEW TEST ITEMS


너무 문항이 적으면 content domain을 적절히 표집할 수 없다.

Tests that are too short can not adequately sample the content domain and thus threaten the validity of inferences to the domain.


또한 너무 적은 수의 시험은 unreliable해지거나 낮은 reproducibility를 낳고, SEM이 커진다. 예컨대 40문항정도 되는 시험은 0.5~0.6정도의 신뢰도 계수가 나오며, SEM은 3~5점정도 된다. unreliable한 검사는 CI가 커지는데, 만약 SEM이 5라면, 95% 신뢰구간CI는 거의 +/- 10점정도가 되며, 합격선 근처에 있는 학생들의 진정한 합-불합 status에 대해서 불확실해진다.

Also, short tests tend to produce scores that are unreliable or have low reproducibility indices, with relatively large standard errors of measurement (SEM). For example, a typical formative test in undergraduate medical education may have only about 40 test questions to cover a fairly broad content area. Such a test may have a reliability coefficient of 0.50–0.60 and a SEMof 3–5 raw score points. Unreliable tests produce wide confidence intervals around student scores. For example, if the SEM is 5, the 95 percent confidence interval around each raw score (including the passing score) is nearly ±10 points, creating much uncertainty about the true pass-fail status of many students scoring near the passing score (depending on the test score distribution).



모든 의학교육의 시험이 valid inference를 갖춰야 하는가?

Must All Tests in Medical Education Produce Valid Inferences?


 

레토릭한 답은 '그렇다'이다. 타당도는 절대로 '그냥 갖췄을 것assumed'이라고 여겨서는 안되는 것이며, 늘 그 근거를 찾아야 하는 열려있는 질문인 것이다. 만약 의학교육자들이 학생을 test한다면, 그 test measure가 의도한 것을 평가하며 점수로부터 유도한 추론이 얼마나 accurate하고 얼마나 defensible한지에 대한 hard evidence를 보여줘야 한다.

The answer to this rhetorical question is yes. Validity is never assumed and is always an open-ended question, seeking a variety of sources of evidence. If medical educators test students, there is an obligation to collect and present hard evidence that the test measures what is intended and that the inferences drawn from test scores are more-or-less accurate and more-or-less defensible.


CIV와 CU에 대한 대부분의 이슈들은 통제가능한 것이다. 효과적이고 높은 인지수준을 평가하기 위한 시험을 만드는 방법(items that do not cue exam- inees to the correct answer, permit inordinate guessing, or confuse knowledgeable students)은 잘 알려져 있다.

Most of the issues discussed under the rubric of construct-irrelevant variance and construct underrepresentation are under the control of those who create tests in medical education. Techniques to develop effective, higher cognitive-level test items that measure important and useful information – items that do not cue exam- inees to the correct answer, permit inordinate guessing, or confuse knowledgeable students – are well known and readily accessible (Case and Swanson, 1998).



 2002;7(3):235-41.

Threats to the validity of locally developed multiple-choice tests in medical educationconstruct-irrelevant variance and construct underrepresentation.

Author information

  • 1University of Illinois at Chicago, College of Medicine, Department of Medical Education (MC 591), 808 South Wood Street, Chicago, IL 60612-7309, USA. sdowning@uic.edu

Abstract

Construct-irrelevant variance (CIV) - the erroneous inflation or deflation of test scores due to certain types of uncontrolled or systematic measurement error - and construct underrepresentation (CUR) - the under-sampling of the achievement domain - are discussed as threats to the meaningful interpretation of scores from objective tests developed for local medical education use. Several sources of CIV and CUR are discussed and remedies are suggested. Test score inflation or deflation, due to the systematic measurement error introduced by CIV, may result from poorly crafted test questions, insecure test questions and other types of test irregularities, testwiseness, guessing, and test item bias. Using indefensible passing standards can interact with test scores to produce CIV. Sources of content underrepresentation are associated with tests that are too short to support legitimate inferences to the domain and which are composed of trivial questions written at low-levels of the cognitive domain. "Teaching to the test" is another frequent contributor to CUR in examinations used in medical education. Most sources of CIV and CUR can be controlled or eliminated from the tests used at all levels of medical education, given proper training and support of the faculty who create these important examinations.

PMID:
 
12510145
[PubMed - indexed for MEDLINE]


표준적인 문항작성 원칙 위반에 따른 효과 (Adv Health Sci Educ Theory Pract. 2005)

The Effects of Violating Standard Item Writing Principles on Tests and Students: The Consequences of Using Flawed Test Items on Achievement Examinations in Medical Education 


STEVEN M. DOWNING

University of Illinois at Chicago, Department of Medical Education (MC 591), College of Medicine,

808 South Wood Street, Chicago, Il 60612-7309, USA (Phone: +1-312-996-6428; Fax: +1-312-

413-2048; E-mail: sdowning@ uic.edu)





도입

Introduction


그러나 Mehrens and Lehmann 이 지적한 것처럼, 교수자가 준비한 시험에는 종종 여러 중대한 결함들이 있다. Jozefowicz 등은 poorly constructed test item이 의과대학에서 흔하게 사용됨을 보여주었다.

However, as Mehrens and Lehmann (1991) point out, there are often major deficiencies in examinations prepared by classroom instructors. And, Jozefowicz and others (2002) show that poorly constructed test items are frequently used in medical schools.


문항작성이란 과학이라기보다는 예술에 가깝지만, 정립된 원칙이 있고, 이것 중 다수는 근거-기반으로 효과적인 형식과 비효과적인 형식이 무엇인지 알려준다.

While test item writing may be as much art as science, there are well established principles, many of which are evidence-based, suggesting what is an effective item form vs. an ineffective item form (Haladyna et al., 2002).


몇몇 IWF에 대해서 그 영향이 연구된 바 있다.

Several item flaws have been studied empirically for their effect on item and test psychometric characteristics.

  • For example, the use of ‘‘all of the above’’ (AOTA) and ‘‘none of the above’’ (NOTA) as options has been extensively studied with mixed results (Harasym et al., 1998).

  • 일반적 MCQ의 변형된 형태 Variants of the straightforward multiple-choice question (MCQ) stem, such as multiple- true–false or unfocused stems, have been studied and generally found to be detrimental to item performance (Case and Downing, 1989; Downing et al., 1995).

  • 조합형 문항 Complex item forms, which require selection of combinations of individual options, have been extensively studied and found to be generally detrimental to the psychometric attributes of tests (Albanese, 1993; Daw- son-Saunders et al., 1989).

  • The use of negative words in the stem has been evaluated with mixed results concerning difficulty and discrimination of test items (Downing et al., 1991; Tamir, 1993).


최근의 리뷰문헌을 보면, 부정문을 지양하라거나 AOTA를 지양하라는 것 등을 권고한다. NOTA의 사용에 대해서는 권고가 일정하지 않으나, 현재의 권고는 고도로 숙련된 출제자가 아닌 경우 NOTA를 지양하라는 것이다.

A recent review paper (Haladyna et al., 2002) recommends the avoid-ance of negation in the stem and reports that most educational mea-surement textbook authors recommend avoiding the AOTA option. The use of the NOTA option has mixed recommendations from textbook authors and the empirical research is also mixed but the current recom-mendation is to avoid use of the NOTA option, except when used by highly experienced item writers (Crehan and Haladyna, 1991; Frary, 1991).


방법

Methods


 

문항 예시

EXAMPLE ITEMS


1. It is correct that: 

A. Growth hormone induces production of IGFBP3 

B. The predominant insulin-like growth factor binding protein (IGFBP) in human serum is IGFBP3 

C. Multiple forms of IGFBP are derived from a single gene 

D. All of the above 

E. Only A and B are correct


This is an example of an unfocused stem item. The stem does not pose a direct question. The options must each be addressed as ‘‘true or false,’’ but the item is scored as a single-best answer question. Further, option D, ‘‘all of the above’’ is not recommended. And, option E is a combination of two other possible answers, making this a ‘‘partial-K type’’ item. Overall, there are three distinct flaws in this question.


2. Which of the following will NOT occur after therapeutic administration of chlorpheniramine? 

A. Dry mouth 

B. Sedation 

C. Decrease in gastric acid production 

D. Drowsiness 

E. All of the above


The second item is an example of a negative-stem question. It requires the student to identify which sign or symptom will not occur. Option E (all of the above) is not recommended. This example item has two item flaws.


 

아래의 분석을 시행함.

For each of the three scales evaluated for this study (standard, flawed, and total), item analysis data were computed: raw score means, standard devia- tions, mean item difficulty, mean point-biserial correlation with the total examination score, Kuder–Richardson 20 reliability (K–R 20), minimum passing score, the number of students passing, and passing rate (the proportion of students who passed).


Results


Table I. Descriptive statistics of four tests


 

Table II. Frequency of item flaws in four basic science examinations

 


 

Table III. Psychometric characteristics of the standard and flawed scales: Tests A, B, C, and D




PASS–FAIL AGREEMENT ANALYSIS


Table IV. Pass–fail agreement analysis, all examinations, all students N = 749





고찰

Discussion


이 연구에서, item difficulty, item discrimination, score reliability, and passing scores and passing rates. 의 관계를 이해하는 것이 중요하다.

In this study, it is important to understand the relationship among item difficulty, item discrimination, score reliability, and passing scores and passing rates.

  • 문항 난이도란 정답을 맞춘 학생의 %이다.
    Item difficulty refers to the proportion (%) of students getting the item correct.

  • 문항 변별도란 얼마나 문항이 잘하는 학생과 못하는 학생을 구분해낼 수 있는가이다. 변별도가 높은 것이 바람직하다.
    Item discrimination describes how effectively the test item separates or differentiates between high ability and low ability students – noting that test items that highly discriminate are desirable.

  • 모든 것이 동일하다면 변별도가 높을수록 신뢰도가 높다.
    All things being equal, highly discriminating items tend to produce high score reliability.

  • 이 연구에서의 문항에는 passing score value가 정해졌기 때문에, IWF가 없는 경우와 있는 경우의 passing score와 passing rate를 계산할 수 있었다. passing score와 passing rate가 서로 반비례 관계에 있다는 것을 이해하는 것이 중요하다.
    Because items in this study had each been assigned a passing score value (by the Nedelsky absolute standard setting method) it was possible to calculate passing scores (the score needed to pass the test) and passing rates (the percentage of students who pass) for the two subscales of interest – the standard and the flawed subscales. It is important to note that the passing score and the passing rate are inversely related; that is, high passing scores tend to produce lower passing rates.


이번 연구에서는 IWF가 매우 흔했다. 이것은 중요한 결과이며, 기대하지 못한 바는 아니다.

There was a high frequency of flawed items in the tests studied. This is an important finding, although not completely unexpected (Downing, 2002; Jozefowicz et al., 2002; Mehrens and Lehmann, 1991). Classroom assess- ments in medical school settings are not immune to poorly crafted test items.


IWF가 있는 경우 3/4에서 더 난이도가 높았다. 0~15%정도 더 어려웠는데, 이 결과는 MCQ에 매우 testwise한 의과대학생을 대상으로 했다는 것을 고려하면 조금 놀랍다.

Flawed item formats were more difficult than standard, non-flawed item formats for students in three of four examinations studied. These mixed results showed that flawed item formats were 0–15 percentage points more difficult than questions posed in a standard form. This finding is some- what surprising, given that examinees in this study are medical students, highly experienced in taking MCQ examinations and presumably very testwise.


Passing rate는 IWF에 의해서 negative한 영향을 받는다. Poorly crafted 문항은 학생들에게 더 challenge가 된다.

Passing rates (the proportion of students meeting or exceeding the passing score) tended to be negatively impacted by flawed items. Poorly crafted, flawed test questions tended to present more of a passing challenge for students.


IWF가 없는 경우와 있는 경우의 합-불합의 일치도를 보면 749명 중 102명이 IWF가 없는 경우 합격했으나 IWF가 있는 문항에서 불합격했고, 30명의 학생만이 IWF가 없는 경우 불합격, 있는 경우 합격이었다. 이 결과의 해석에 있어 length and reliability, passing score가 일부 scale에 있어 차이가 있으므로 주의를 기울여야 한다. 위의 102명이 중요한데, 이 misclassification 중 일부는 무작위측정오차에 따른 것이지만, 일정 부분은 IWF에 의한 systematic error이기 때문이다.

The agreement between pass–fail outcome assigned by the standard and the flawed scales shows that 102 of 749 students (14%) pass the standard items but fail the flawed items, while only 30 students (4%) pass the flawed items and fail the standard items. (These data must be interpreted cautiously, since the scales differ in length and reliability and the passing scores also differ for some of the scales.) The 102 students (of 749) classified as passing the standard items while failing the flawed items are of great concern. Some of these misclassi- fications are due to random measurement error (unreliability), but some proportion is also due to the systematic error introduced by flawed items, given the results of this research.


IWF를 교정의 난이도나 거기에 들어가는 낮은 비용을 고려하면 이러한 false-negative는 과도하게 높아 보인다. 명백하게 이렇게 높은 misclassification 비율은 consequential validity evidence에 부정적으로 작용한다.

A false negative rate this high seems unreasonable, given the relative ease and lowcosts associated with re-writing flawed questions into a form that would adhere to the standard, evidence-based principles of effec- tive item writing. Clearly, this high misclassification rate impacts the conse- quential validity evidence for the tests in a negative manner (Messick, 1989).


IWF가 신뢰도에 미치는 영향은 일관되지는 않았다. IWF에 의한 영향은 systematic error이지 random error가 아니며, 오직 random error만이 내적신뢰도에 의해서 추정가능하다. 따라서 점수의 신뢰도가 IWF와 관계가 거의 없는 것은 놀랍지 않다.

The effect of flawed itemforms on score reliability is mixed; The nature of the item format flaws studied contributes systematic error to the measurement, not random error; only random errors of measurement are estimated by the internal consistency score reliability. Thus, it is not surprising that the score reliability shows little relationship to item flaws.


Messick은 CIV를 다음과 같이 정의했다. IWF에 의한 높아진 난이도와 낮아진 passing rate는 이 정의에 정확히 부합한다.

Messick (1989, p. 34) defines construct-irrelevant variance (CIV) as ‘‘…excess reliable variance that is irrelevant to the interpreted construct.’’ The excess difficulty and tendency toward lower passing rates for flawed vs. standard items meets Messick’s definition of CIV perfectly.


이 연구의 결과로부터 교수들에게 효과적인 객관식문항작성법 교육이 필요함을 제기한다. 좋은 소식은 이 연구에서 드러난 다섯 개의 가장 흔한 오류에만 집중해도 된다는 것이다.

The results of this study suggest that efforts to teach faculty the principles of effective objective-test item writing should be increased. The good news is that these faculty development efforts can concentrate on eliminating the five most common errors found in this study and thereby eliminate nearly all flawed items fromtests.




Haladyna, T.M., Downing, S.M. & Rodriguez, M.C. (2002). A review of multiple-choice item- writing guidelines. Applied Measurement in Education 15: 309–334.


Jozefowicz, R.F., Koeppen, B.M., Case, S., Galbraith, R., Swanson, D. & Glew, H. (2002). The quality of in-house medical school examinations. Academic Medicine 77: 156–161.



 

 

 

 





The effects of violating standard item writing principles on tests and students: the consequences of usingflawed test items on achievement examinations in medical education.

Author information

  • 1Department of Medical Education (MC 591), College of Medicine, University of Illinois at Chicago, 60612-7309, USA. sdowning@uic.edu

Abstract

The purpose of this research was to study the effects of violations of standard multiple-choice item writing principles on test characteristics, student scores, and pass-fail outcomes. Four basic science examinations, administered to year-one and year-two medical students, were randomly selected for study. Test items were classified as either standard or flawed by three independent raters, blinded to all item performance data. Flawed test questions violated one or more standard principles of effective item writing. Thirty-six to sixty-five percent of the items on the four tests were flawedFlawed items were 0-15 percentage points more difficult than standard items measuring the same construct. Over all fourexaminations, 646 (53%) students passed the standard items while 575 (47%) passed the flawed items. The median passing rate difference between flawed and standard items was 3.5 percentage points, but ranged from -1 to 35 percentage points. Item flaws had little effect on testscore reliability or other psychometric quality indices. Results showed that flawed multiple-choice test items, which violate well established and evidence-based principles of effective item writing, disadvantage some medical studentsItem flaws introduce the systematic error of construct-irrelevant variance to assessments, thereby reducing the validity evidence for examinations and penalizing some examinees.

[PubMed - indexed for MEDLINE]


다지선다형 문제에서의 찍기(Guessing) (Med Educ, 2003)

Guessing on selected-response examinations


S M Downing




어떤 사용자들은 객관식이 실제 성취도를 공정하게 보여주기 위해서는 '찍기'의 adverse effect를 줄이기 위한 transform이 필요하다고 걱정한다.

Some users of objective formats worry that student scores will not fairly represent true achievement unless the scores are trans-formed in some way to reduce the adverse effects of guessing


학생들이 선다형 문제에서 blind guessing하는 빈도는 상당히 과장되어 있었을 수 있다. 학생들은 일반적으로 매우 testwise하고 직관적으로 random guessing이 고부담시험에서 좋은 전략이 아님을 알고 있다.

The frequency of student blind guessing on selected-response test ques- tions may be considerably overestima-ted.2,3 Medical students are generally very testwise and intuitively understand that random guessing is not a good strategy to obtain high scores on even moderately high-stakes examinations.


반면, informed guessing은 학생이 어떤 일부 지식을 가지고 오답을 신중하게 제거한 후, 정답에 이를 가능성을 높이는 것을 말한다. 삶에서, 의학에서 대부분의 결정이 불완전한 지식을 바탕으로 이뤄진다.

Informed guessing, on the other hand, describes test-taking behaviour in which the student uses some partial knowledge to thoughtfully eliminate incorrect an-swers and improve the probability of arriving at a correct answer. Most deci-sions in life (and in medicine) are made with incomplete knowledge.


내적신뢰도의 KR20과 같은 것은 가끔 일어나는 random guessing에 의해서 크게 민감하지는 않지만, 많은 수의 문항에서 많은 수의 학생이 random guessing을 하였는지는 이 계수로 찾아낼 수 있다.

While typical estimates of internal consistency reliability, such as the Kuder–Richardson 20 estimate, are not sensitive to the occasional random guess by a few students, random guess-ing on large numbers of test items bymany students would be detected bythis coefficient. 



의학교육에서 일반적인 시험을 보면, random guessing외에도 많은 변수가 낮은 신뢰도에 기여한다.

In typical classroom assessments in medical education, many variables other than random guessing contribute to lower test reliability. Variables such as

  • poorly written and ambiguous test items,

  • examinations that are too short,

  • items with flaws which cue students to the correct answer or confuse them into giving an incorrect answer,

  • items with implausible distracters and

  • poorly discriminating items

...all contribute more measurement error than random guessing.4,5



guessing을 보정하는 가장 흔한 두 가지 방법은 오답에 대해서 일부 감점을 하는 것이거나, 빈칸으로 남긴 문항에 대해서 부분점수를 부여해서 guessing을 보상하는 것이다. 그러나 이 두 가지 방법 모두 '정답을 맞춘 문항'과 완벽한 상관관계를 가지며, 즉 raw scoring은 위의 두 가지 보정 공식과 정확히 동일한 석차를 낸다는 것이다.

The two most common methods of correcting for guessing are to subtract fractional points for incorrect answers from the total of correct answers or to add fractional points for omitted items to ‘number-correct’ absence scores in order to reward of guessing. However, both types of guessing-corrected scores cor-relate perfectly with raw number-correct scores and with each other, indicating that raw scoring and both of the guess-ing-correction formula scores rank order examinees in exactly the same order.


그러나 부분점수를 더하거나 빼는 것은 CIV(construct-irrelevant variance)를 더하게 되는데, 왜냐하면 이러한 보정된 점수는 학생의 성취도 뿐만 아니라 내용과 상관없는 변수, 예컨대 위험을 감수/회피하려는 성향, 고득점/저득점에 대한 학생의 기대와 같은 것을 동반하기 때문이다. 만약 검사점수가 criterion-referenced로 해석된다면, 즉 점수 해석을 위해 절대값이 배정된다면, 그러한 해석은 guessing correction에 의해서 flawed될 수 있다. 진점수 외에도 non-random 또는 systematic error가 일부 피험자들에게 더해지는 것이다. 이것이 바로 CIV의 정의이기도 하다.

It is noteworthy, however, that the fractional score additions or subtrac-tions may add construct-irrelevant vari-ance to the number-correct scores, since these corrected scores are a measure of student achievement plus a measure of some other variable that is not related to content, such as the examinee’s risk-the taking propensity or aversion or student’s expectations about a high or a low score on the test.6 If test scores are interpreted in a criterion-referenced sys-tem, such that the absolute value of scores is assigned some interpretation(such as pass–fail), that interpretation may be flawed due to the guessingcorrection. Non-random or systematic error will be added to (or subtracted from) the true scores of some, but not all, examinees. This is the definition of construct-irrelevant variance.7 



Correction-for-guessing 공식은 raw score에서 random guessing의 영향력을 제거하지 못한다. 사실, '찍지 말라'고 해도 testwise한 학생들은 '모든 문항에 대답을 해야' 가장 높은 점수를 받을 것이라는 것을 알고 있다. 따라서, 대부분의 교육측정전문가들은 이미 오래 전에 모든 guessing-correction 공식을 배척하였으며, 단순히 맞춘 문항의 점수만 낸다. 이러한 방식은 충분히 길고 심사숙고하여 만들어진 시험이 잘 가르쳐지고 잘 준비된 학생들에게 시행되기만 한다면 test validity에 있어서 거의 위협이 되지 않는다.  그러나correction for guessing은 CIV를 추가시킬 가능성이 높다.

Correction-for-guessing formulas do not remove the effect of random guess-ing from raw scores. In fact, even when directed not to guess (when students are warned that formula scoring will beused), testwise students know that their score will be maximized by answering every question.8 Therefore, most edu-cational measurement specialists long ago abandoned all guessing-correction formulas in favour of simple number-correct scoring. There is an extremely low threat to test validity from guessing in reasonably long and carefully con-structed objective tests which are administered to students who are well taught and well prepared for testing.However, there is a very real likelihood of adding construct-irrelevant variance or systematic measurement error to test scores in an attempt to correct for guessing.9,10 In the case of proposed guessing corrections, the cure maybe worse than the disease. 



1 Burton RF. Misinformation, partial knowledge and guessing in true ⁄ false tests. Med Educ 2002;36:805–11. 




 2003 Aug;37(8):670-1.

Guessing on selected-response examinations.

Comment in

Guessing in selected-response tests. [Med Educ. 2004]

PMID:
 
12895242
[PubMed - indexed for MEDLINE]



의사 역량 평가의 개념적, 실천적 어려움(Med Teach, 2015)

Conceptual and practical challenges in the assessment of physician competencies

CYNTHIA R. WHITEHEAD1,2,3, AYELET KUPER1,2,4, BRIAN HODGES1,2,5 & RACHEL ELLAWAY6

1University of Toronto, Canada, 2Wilson Centre, Canada, 3Women’s College Hospital, Canada, 4Sunnybrook Health Sciences

Centre, Canada, 5University Health Network, Canada, 6Northern Ontario School of Medicine, Canada






도입

Introduction


평가에 대한 현재 모델은 우리의 사고와 실천을 발전시켜왔다. 그러나 이 모델들은 의료행위와 평가의 본성nature를 더 이해해가면서 점차 낡고 올이 다 드러났다.

Current models of assessment have served us well in advancing our thinking and practices, but they are becoming increasingly threadbare in light of our emerging understanding of the nature of medical practice and of the assessment of medical practitioners.


배경

Background



최근 의과대학 교육과정은 생의학적 기초에서 (이 전에는 암묵적으로만 다뤄져오던) 의사소통/프로페셔널리즘/리더십과 같은 역량에 이르기까지 확장되어왔다. 이러한 개념이 비공식 또는 잠재 교육과정에서 공식 교육과정으로 옮겨온 것은 단순히 내용만 더해진 것이 아니다. 이러한 변화는 형식이 없던 것에 형식을 부여했고, 새로운 형식을 위한 새로운 방법이 필요했다.

In recent years medical school curricula have broadened from their biomedical base to explicitly include previously tacit competencies such as communication, professionalism and leadership. Moving these concepts from the informal and hidden curricula of medical education to its formal curriculum is not just a matter of adding content; these changes require form to be given to the formless, and new methods to be put in place to match these new and emerging forms.


CanMEDS의 개발은 한 사례이다. Educating Future Physicians for Ontario (EFPO) project 가 CanMEDS의 근원이다.

The development of CanMEDS is an example of this shift. The roots of CanMEDS can be found in the Educating Future Physicians for Ontario (EFPO) project (Whitehead et al. 2011a).


EFPO프로세스에서 의사의 8개 역할이 개발되었다. 캐나다의 전문의들은 EFPO역할을 가지고 다시 작업하여 일곱개의 CanMEDS 역할로 만들었다. 8번째 역할인 "Person"을 제거하고, "Professional"을 넣었다.

Eight relatively distinct physician roles were developed in the EFPOprocess. Canadian specialist physicians took the EFPO roles and re-worked them into the seven current CanMEDS Roles (Medical Expert, Collaborator, Manager, Health Advocate, Communicator, Scholar and Professional) (Frank et al. 1996), in the process removing EFPO’s eighth role of ‘‘Person’’, embedding it instead in the Professional Role (Whitehead et al. 2011a, 2014).



이 역할에 깔려 있는 복잡한 개념을 고려했을 때, 각 역할을 학생과 trainee에게 인식시키는 것은 잘해봐야 근사치에 가까운 수준approximation 정도이다. 

Understandably, given the complex concepts that underpin the roles, the realization of each role in our students and trainees remains an approximation at best.



EFPO 와 CanMEDS 는 귀납적 프로세스의 결과물이다. 의료행위의 복잡성과 다원성을 단순하고 인식가능한 역할로 만든 것이다. 개별 역할의 집합(즉, 이상적인 의사의 완벽한 모습totality)이 CanMEDS 역량 프레임워크를 구성한다. 이제 CanMEDS는 공식화되어서 국가 및 국제적 인증 기준까지 공식화되고 확장되었다. 교육자들은 이렇게 성문화된now-codified, 표준화된 역할을 교육 및 평가를 위해 translate해야 한다. 의학교육자들은 이 역량을 평가하기 위한 실용적 접근법으로서 모든 가능한 도구들(채점기준/마일스톤/체크리스트)을 현장에 있는 임상교사들에게 제공해주기 위해서 설계/파일럿/의무화/도입에 열심히 노력했다.  교육자들이 수년간 이 프로세스에 몰입해온 동안 competency project는 교육리더들에게 있어서 "아직 걸음마 단계"라고 여겨지고 있다.

The EFPO and CanMEDS Roles were the result of an inductive process, rendering the complexities and pluralities of what it takes to be a medical practitioner into a set of simple and recognizable roles. The aggregate of the individual Roles (reflecting the totality of the ideal physician) comprises the CanMEDS competency framework. Now that CanMEDS has been formalized and expanded on in national and international accreditation standards, educators have been required to deductively translate the now-codified and standardized roles back into their teaching and assessment practices. Medical educators have risen to the challenge, working diligently to design, pilot, mandate and implement all manner of toolkits, rubrics, milestones and checklists to provide on-the-ground clinical teachers with practical approaches to learner assessment of these competencies (Bandiera et al. 2006; Sherbino et al. 2008; Royal College of Physicians and Surgeons of Canada 2014). While educators have already been engaged in this process for a number of years, the competency project is still considered by education leaders to be ‘‘in its infancy’’ (Association of Faculties of Medicine of Canada 2012, p. 4).


일부 CanMEDs 역할(Medical Expert)는 상대적으로 가르치고 평가하기가 쉽지만, 다른 역할들은 그보다 어렵다. 교육자들이 효과적이고 relevant한 평가도구를 모든 역량에 대하여 가지고 있지 않는 이상, 평가는 평가하기 '손쉬운'영역으로 skew되고 말 것이다.

While some CanMEDS Roles or competencies (such as Medical Expert) have been relatively easy to teach and assess, others have proved to be more challenging (Verma et al. 2005; Leveridge et al. 2007; Bryden et al. 2010). Unless educators have effect- ive and relevant assessment tools for all competencies, assessment may end up skewed towards ‘‘easier’’ areas,




Competence와 Competency 이해하기

Understanding competence and competency



각각의 프레임워크는 복잡한 의료행위의 개별 요소들을 단순하게 구조화된 다이어그램으로 표현하며, 종종 그림으로 묘사되기도 한다. CanMEDS는 "Medical Expert"를 중심에 둔 일곱 개 역할을 꽃으로 표현했다. ACGME도 기능을 목록화 한 CanMEDS와 비슷한 모델을 가지고 있으며, Scottish Doctor은 세 개의 concentric ring을 가지고 의사가 무엇을 할 줄 알아야 하고, 그것을 어떻게 해야 하고, 전문직으로서의 역할이 무엇인가를 표현하였다. 이 모델들이 각각 사용한 언어가 다르고, 그것이 구현arrange된 방식이 다르지만, 이 모델은 모두 의사의 능력 또는 역량을 보여주기 위한 목적이 있다.

Each frame-work reduces the complexities of multiple individual compo-nents of medical practice diagram into or a simple structure, often based represented as a figure. CanMEDS is around seven roles represented as a flower with ‘‘MedicalExpert’’ at the centre (Frank & Danoff 2007), the AccreditationCouncil for Graduate Medical Education (ACGME) (2006) has asimilar model to CanMEDS usually given as a list of functions,while the Scottish Doctor has three concentric rings repre-senting what the doctor is able to do, their approach to practice and their role as a professional (Ellaway et al. 2007).The language used and the way they are each arranged visually is different, although they all aim to represent a physician’s capabilities or competencies (ibid). 


또한, 맥락이 바뀌면 역량 프레임워크고 새로운 사회와 사회적 상황을 반영할 수 있게 달라져야 한다. 이에 우리는 역량 프레임워크의 적용은 상황-, 맥락- 특이적으로 이뤄져야 한다고 제안한다.

Furthermore, as contexts change, then, at least by implication, competency frameworks may also need to change to reflect new social and societal circumstances. We suggest, therefore, that the applicability of competency frameworks should be considered as situated and context- specific.


그러나 '추상화'로서, 이 역량 프레임워크는 그것이 보여주고자 하는 복잡한 아이디어와, 맥락의 변화에 반응하여 달라지는 의료행위의 변화 방식을 단순화한 버전을 제공할 수 밖에 없다. 따라서 어떤 프레임워크도 절대 "진실"일 수 없으며, 모든 프레임워크는 "근사치approximation"으로 장점과 함께 한계점이 분명하다.

As abstractions, however, they necessarily provide a simplified version of the complex ideas they represent and the ways that practice changes in response to the context in which it takes place. No framework, therefore, is ever ‘‘the truth’’, but instead all frameworks are approximations, and all will inevitably have limitations as well as strengths.


 

역량의 평가

Assessment of competencies


북미 교육에서 측정을 지배해온 전통은 psychometric 방법론 적용과 역량프레임워크의 개념에 초점을 두고 있고, 이것 때문에 평가와 측정에 관한 논의가 제한된다. Psychometric 기법은 한 개인 내에 stable trait로서 존재한다고 여겨지는 현상을 평가하기 위하여 처음 사용되었다. 예컨대 진실성/논리적 추론/시공간 능력 등이 있다. 인지심리학자들은 나중에 이 기법을 지식을 평가하는데 활용하였고, 더 나아가 수행능력 평가에까지 활용하였다. 'Psychometrically evaluated instruments'는 개인의 stable하고 latent한 특질을 평가하는데 유용했다. 교육자에게 이것은 real/measurable/underlying 심리특성을 평가함을 뜻했다.

The dominant tradition of measurement in education in North America has led to a focus on the application of psychometric methods and concepts to competency frame- works, thereby sometimes limiting discussion of assessment and capability to matters of measurement (Hodges 2013). Psychometric techniques were first used to evaluate phenom- ena that were thought of as stable traits that existed within a particular individual: things like truthfulness, logical reasoning and visual–spatial ability. Cognitive psychologists later expanded the use of these techniques to assess knowledge, and then further to assess performance. Psychometrically evaluated instruments are useful for assessing stable, latent traits within individuals. This implies that, as educators, we are assessing real, measurable, underlying psychological traits (Kuper et al. 2007).



psychometric approach 는 지식이나 테크닉적 스킬에 대해서는 잘 작동했다. 그러나 이 접근법이 개인의 특성으로서 stable하지 않은 construct에 대해서는 잘 작동하지 않았다. CanMEDS 역할 중 Advocate나 Collaborator는 어떻게 의료전문직이 다른 사람과 상호작용해야하는지에 대한 내용이나, 상호작용이란 본질적으로 맥락- 그리고 문화- 특이적인 것이다. 따라서 아무리 이 역량을 깔끔하게neatly 기술하다고 하더라도, 의사소통/협력/프로페셔널리즘/Advocacy 역량에 대한 개개인의 관점은 역사적, 상황적으로 달라지며 변화가능한 것이고, 배경과 문화가 다르면 달라진다.

A psychometric approach works well for constructs that relate to knowledge and technical skill. This approach does not, however, easily align with constructs that are not stable for individual traits. CanMEDS Roles such as Advocate and Collaborator, for example, depict how medical professionals should perform in their interactions with others – interactions that are intrinsically context- and culture-specific. Therefore, no matter how many attempts are made to neatly codify physician competence, individual views of competent com- munication, collaboration, professionalism and advocacy will be historically contingent, situational, changeable – and inevitably different from those from other backgrounds and cultures (Kuper et al. 2007).


추가적으로, medical training을 통해서 커뮤티니가 요구하고 바라는 것을 제공해야 할 사회적 책무성이 강조되고 있다. 따라서 평가 전략과 평가 행위는 반드시 광범위한 개념(공정성, 개인의 요구, 안전, 신뢰도와 타당도, 사회와 커뮤니티의 특수한 요구에 대한 반응)을 포괄해야 한다.

In addition, there is a growing call for socially account- able medical training to ensure that our graduates can provide what communities need and want (Boelen & Heck 1995; Frenk et al. 2010). Assessment strategies and practices must therefore embrace and encompass a wide range of concepts, including fairness, individual needs, safety, reli- ability, and validity and responsiveness to particular societal and community needs.


새로운 접근법을 받아들이려는 역사적인 전례가 있음을 안다. 하지만 이러한 도구들은 여러가지 측면에서 평가에 유용하기는 하나, 사회적, 문화적으로 결정되는 21세기 의사 역량에 대한 현재의 미묘한 차이current nuanced를 담아내기에는 이상적이지 않다.

We also know that there is historical precedent to the adoption of new approaches: We suggest, however, that these tools, while very useful for assessing many things, are not ideal for the more socially and culturally-determined roles that comprise the current nuanced 21st Century understanding of physician competence.





방법과 도구를 다시 생각하기

Rethinking methods and means



여기서 우리는 두 가지 구체적인 사례에 초점을 두고자 한다. EthonographyRealist evaluation이다 이 두 가지 모두 복잡한 사회적 구조를 평가하기 위한 것이다. 맥락과 사회적 위치location를 고려하며, 동일한 상황 내에서도 잠재적으로 가능한 다양한 subject position의 존재를 존중하고, 역량을 수행하는 것이 맥락에 따라서 달라질 수 있음에 개방적이다.

In this paper we focus on two specific examples: ethnography and realist evaluation, both of which were designed to assess complex social constructs. Each takes into account context and social location, honours the existence of multiple potential subject positions within the same situation, and is open to wide variability in the contextual performance of competence.


 

  • Realist inquiry explains the dynamics of complex systems in terms of various mechanisms in different contexts that lead to different outcomes (Wong et al. 2012; Pawson 2013). Realist inquiry also works with the concept of middle-range theory: demi-regularities within and around particular contexts rather than global phenomena. Realist assessment is therefore about explaining what individuals and groups are doing and how they are doing it rather than measuring a stable and predict- able construct.

 

  • Ethnographic assessment involves gathering data about social interactions, using tools including observation, discus- sions and the analysis of written artifacts. Originally deriving from the discipline of anthropology, ethnography examines social processes, perceptions and behaviours within and between groups (Reeves et al. 2008, 2013).


이 두 가지는 복잡한 사회적 construct 평가를 framing하기 위한 방법론의 사례일 뿐이다.

Ethnography and realist evaluation are only two examples of methodological approaches for framing the assessment of complex social constructs such as the non-Medical Expert (Intrinsic) Roles (Sherbino et al. 2011).1 


이 방법론들은 trainee의 수행능력을 사회적 영역social realm의 관점에서 바라보며, 신뢰도와 타당도에 초점을 맞춰온 지난 60년간 "노이즈"로 무시되어온 수행능력의 맥락을 다시 비춰준다.

These meth- odologies can illuminate aspects of trainee performance in the social realm, as well as the contexts of that performance, that have been ignored as ‘‘noise’’ within our almost 60-year focus on validity and reliability.




나아갈 방향

Ways forward


우리는 변화 없이 relevant하고 rigorous한 사회적으로 구성되는 Non-Medical Expert역할에 대한 평가는 불가능하다. 우리는 교육자들이 고의로 incongruent한 평가도구를 사용한다고 생각하지 않는다.

However, we contend that, without change, the assessment of the socially constructed, non-Medical Expert (Intrinsic) Roles cannot be relevant and rigorous. We do not think that choose educators will knowingly to use methodologically incongruent assessment tools, as in doing so they will fall short of the needs of learners, patients, the profession and society as a whole.


"우리가 평가하지 않으면 학생들은 그것을 중요하게 여기지 않을 것이다"라는 mantra가 교육자가 만들어내는 산출물의 어느 정도를 차지할까? 의학교육과 같은 Academic culture에서 교수들은 늘 강의를 끝낼 때 "학생들이 관심있어 하는 것은 시험에 뭐가 나올지 뿐이라는 것을 다 안다. 지금부터 알려주겠다"라는 식으로 강의를 끝내며, 이것이 test-focused 환경에서 만들어진 교사를 보여준다. 이러한 환경은 universal하지 않다. 예를 들면 덴마크에서는 레지던트 시험이 없다.

To what extent is the mantra ‘‘if we don’t assess it the students will not value it’’ a product of educators’ own making? Academic cultures, such as medical education, where professors routinely end lectures with statements like ‘‘I know all you care about it what is on the test, so now I will tell you’’ clearly implicate teachers in the construction of a test-focused environment. This environment is not universal; in Denmark, for example, there are no residency exams (Karle & Nystrup 1995; Hodges & Segouin 2008).


고찰

Discussion


우리의 집단적 기억은 모하하고 부분적이다. 우리는 과거에 어땠는지를 빠르게 잊고, 어떻게 여기까지 왔는지를 잊어버린다. 우리는 우리가 지금 가지고 있는 것과 하고 있는 것을 정상으로 받아들인다normalize.  

Our collective memory tends to be rather vague and partial: we quickly forget where we have been and how we got here. We normalize what we currently have and do (in this case our current constellation of competencies, roles and frameworks)


non-Medical Expert (Intrinsic) Role의 평가에 대한 또 다른 가능한 평가방법으로는  case study methodology, critical discourse analysis and phenomenology등이 있다.

Other potential methodologies relevant to the assessment of the non-Medical Expert (Intrinsic) Roles include case study methodology, critical discourse analysis and phenomenology.


가능한 옵션을 늘리고 다양한 평가 접근법에 의존하는 것이 의학교육의 다른 트렌드와 부합한다. 평가방법론에 적용하는 논리도 같아야 한다.

Expanding our options and drawing on multiple assess- ment approaches fits with other trends in medical education. We need the same to apply logic to our assessment methodologies.


국제 커뮤니티가 평가에 대한 사회과학 기반의 접근법을 탐구하지 않는다면, socially-constructed non- Medical Expert (Instrinsic) Roles를 적절하게 평가할 수 있는 능력이 제한될 것이다.

If this international community does not explore the potential of social science- based approaches to assessment then there will remain limits to the ability to adequately assess the socially-constructed non- Medical Expert (Instrinsic) Roles.








 


Sherbino J, Frank JR, Flynn L, Snell L. 2011. ‘‘Intrinsic Roles’’ rather than ‘‘armour’’: Renaming the ‘‘non-medical expert roles’’ of the CanMEDS framework to match their intent. Adv Health Sci Educ Theory Pract 16(5):695–697.





 2015 Mar;37(3):245-51. doi: 10.3109/0142159X.2014.993599. Epub 2014 Dec 19.

Conceptual and practical challenges in the assessment of physician competencies.

Author information

  • 1University of Toronto , Canada .

Abstract

Abstract The shift to using outcomes-based competency frameworks in medical education in many countries around the world requires educators to find ways to assess multiple competencies. Contemporary medical educators recognize that a competent trainee not only needs sound biomedical knowledge and technical skills, they also need to be able to communicate, collaborate and behave in a professional manner. This paper discusses methodological challenges of assessment with a particular focus on the CanMEDS Roles. The paper argues that the psychometric measures that have been the mainstay of assessment practices for the past half-century, while still valuable and necessary, are not sufficient for a competency-oriented assessment environment. New assessment approaches, particularly ones from the social sciences, are required to be able to assess non-Medical Expert (Intrinsic) roles that are situated and context-bound. Realist and ethnographic methods in particular afford ways to address thechallenges of this new assessment. The paper considers the theoretical and practical bases for tools that can more effectively assess non-Medical Expert (Intrinsic) roles.

PMID:
 
25523113
 
[PubMed - indexed for MEDLINE]


성찰에 대한 평가의 교란요인: 비판적 리뷰 (BMC Med Educ, 2011)

Factors confounding the assessment of reflection: a critical review

Sebastiaan Koole1*, Tim Dornan2, Leen Aper1, Albert Scherpbier3, Martin Valcke4, Janke Cohen-Schotanus5 and

Anselme Derese1






배경

Background


평생학습은 최신의 헬스케어 서비스 제공을 위해 필수적이다. 평생학습이 단순히 컨퍼런스 참석을 의미하는 것이 아니며, 오늘날 평생학습이란 지속적 프로세스로서, 일상의 전문직 행동에 embed된 것이다. 평생학습의 핵심은 자신의 행동에 대해서 성찰하는 능력, 치료의 과정과 성과를 검토하고, 새로운 학습목표를 세우고, 수월성을 추구하기 위한 미래의 행동을 계획하는 것이다.

Lifelong learning is, consequently, crucial to the provision of up-to-date healthcare services [1].Rather than just attending conferences, lifelong learning today is seen as a continuous process, embedded in everyday professional practice. At its core lies practi-tioners’ ability to reflect upon their own actions, con-tinuously reviewing the processes and outcomes of treatments, defining new personal learning objectives, and planning future actions in pursuit of excellence [2-5].


많은 교육기관에서 성찰능력을 직헙훈련 프로그램의 목표로 삼고 있으며, 성찰적 사고가 개발될 수 있다는 것이 전제이다.

As a result, many educational institutions incorporate the ability to reflect as an objective of their vocational pro- grams, premised on a belief that reflective thinking is something that can be developed rather than a stable personality trait [4,10,11].


그러나 성찰능력을 어떻게 가장 잘 개발하도록 도와줄 수 있는가는 불확실하다. 합의된 방법이 부족하다.

There is, however, uncertainty about how best to help people develop their ability to reflect [11]. Lack of an agreed way of assessing reflection is a particular obstacle


 

평가는 피드백의 원천으로서 motivation에 영향을 주며, 요구되는 수준의 역량이 달성되었는지를 언제 판단할지에 따라서도 영향을 준다. 어떻게 '성찰적 학습'을 조작화할 것인가는 더 심각한 문제이다. 서로 다른, 광범위하게 받아들여지는 서로 다른 방식의 성찰에 대한 정의가 평가에 있어서 성찰의 성과, 성찰의 차원, 성찰의 기준 등을 다양하게 한다. 그 결과 연구결과를 비교하기도 어렵다.

Assessment has also a motiva- tional influence as a source for feedback (formative assessment) and when to judge whether requisite levels of competence have been attained (summative assess-ment) [3,4,12]. The persisting lack of clarity about how to operationalise reflective learning is symptomatic of an even deeper problem. Different, widely accepted theories define reflection in different ways, consider different outcomes as important, define different dimensions along which reflection could be assessed and point towards different standards [11]. Consequently, research findings are hard to compare. This unsatisfactory state of affairs leaves curriculum leaders without practical guidelines,



논문의 목적

The purpose of this article is to review four factors,which confound the assessment of reflection: 


  • 1. Non-uniformity in defining reflection and linking theory with practice. 
  • 2. A lack of agreed standards to interpret the results of assessments. 
  • 3. Threats to the validity of current methods of asses- sing reflection. 
  • 4. The influence of internal and external contextual factors on the assessment of reflection.

고찰

Discussion


1. '성찰'의 정의

1. Defining reflection

 

다양한 성찰의 정의 


  • Boe- nink 등은 '상황을 분석하는 서로 다른 관점의 숫자'로서 성찰을 묘사했다. 즉 하나의 관점에서부터 다수의 관점까지 다양할 수 있다.

  • Aukes 등은 자기성찰/공감적 성찰/성찰적 의사소통의 조합으로서 개인의 성찰을 개념화하면서, 정서적, 의사소통적 요소를 강조했다.

  • Sobral은 학습의 관점에서 reflection-in-learning을 강조했다.

Boe- nink et al [10] described reflection in terms of the num- ber of different perspectives a person used to analyse a situation. Reflection ranged from a single perspective to a balanced approach considering multiple relevant per- spectives. Aukes et al [13] emphasised emotional and communication components when they conceptualised personal reflection as a combination of self-reflection, empathic reflection, and reflective communication. Sobral’s [14] emphasis on reflection-in-learning approached reflection from a learning perspective.



이 세 가지 관점이 이 분야에서의 비일관성을 보여준다면 Dewey, Boud, Schön, Kolb, Moon, and Mezirow 의 연구는 공통점을 보여준다.

If those three perspectives exemplify inconsistency in the field, the work of Dewey, Boud, Schön, Kolb, Moon, and Mezirow exemplifies shared ground between reflec- tion theories and used terms.

 

  • Dewey is usually regarded as the founder of the concept of reflection in an educa- tional context. He described reflective thought as “active, persistent, and careful consideration of any belief or supposed form of knowledge in the light of the grounds that support it, and the further conclusions to which it tends” [15]. He saw reflective thinking in the education of individuals as a lever for the development of a wider democratic society.

  • In line with his work, Boud et al emphasised reflection as a tool to learn from experience in experiential learn- ing [16]. They identified reflection as a process that looks back on experience to obtain new perspectives and inform future behaviour. A special feature of their description of reflection in three stages -

    • 1. 경험으로 돌아가기 Returning to an experience;

    • 2. 감정에 집중하기 attending to feelings; and

    • 3. 경험을 재평가하기 re-evaluat- ing the experience - was the emphasis it placed on the role of emotions.

  • Moon described reflection as an input-outcome pro- cess [17]. She identified reflection as a mental function transforming factual or theoretical, verbal or non-verbal knowledge, and emotional components generated in the past or present into the output of reflection (e.g. learn- ing, critical review or self-development). (사실적 또는 이론적/ 언어적 또는 비언어적/ 과거 혹은 현재의 감정적 요소를 성찰의 output으로 만드는 것)

  • Schön’s concept of reflective practitioner the identi-fied reflection as a tool to deal with complex profes- sional situations [18,19]. Reflection in a situation (reflection-in-action) is linked to practitioners’ immedi- ate behaviour. Reflection after the event (reflection-on- action) provides insights that can improve future prac- tice. Those two types of reflection together form a con- tinuum for practice improvement.

  • The term ’ reflective learning describes reflection in the context of experiential learning. Kolb’s widely accepted experiential learning cycle describes four stages of learning:

    • 1. 경험을 한다(구체적 경험) having an experience (concrete experi- ence),

    • 2. 성찰적 관찰(경험을 성찰한다) reflective observation (reflecting on this experi- ence),

    • 3. 추상적 개념화(경험에서 배운다) abstract conceptualisation (learning from the experience) and

    • 4. 능동적 실험(배운 것을 시도해본다) active experimentation (trying out what you have learned) [20].

    • 이 네 단계는 나선형이다. These four stages are con- ceptualised as a spiral, each of whose turns is a step for- ward in a person’s experiential learning.

  • Lifelong learning is considered today as essential for maintaining a high standard of professional practice. Mezirow’s transformative learning theory described life- long learning in terms of learners’ transforming frames of reference, in which reflection is the driving force [21].





공통 요소의 '절충 모델'

Towards an ‘eclectic model’ of common elements


Atkins and Murphy 는 성찰을 다음과 같이 밝혔다.

Atkins and Murphy [22] identified reflec- tion as:

  • 1. 불편한 감정 또는 생각의 인지 ‘awareness of uncomfortable feelings and thoughts’, resulting in

  • 2. 감정과 지식의 분석 an ‘analysis of feelings and knowledge’, finally leading to

  • 3. 새로운 관점 ‘new perspectives’.


Korthagen의 ALACT 모델(’Action, Looking back on action, Awareness of essential aspects, Creating alternative methods of action, and Trial’)은 '인식하게 됨' 시기의 첫 두 단계를 보여준다. 일반적인 회상적 행동, 더 해석적인 행동

Korthagen’s ALACT model (’Action, Looking back on action, Awareness of essential aspects, Creating alternative methods of action, and Trial’)[23]describes the first phase of ‘becoming aware’ in two steps: a general retrospective action and a more interpretive action.


이 두 가지 이론을 통합하면 첫 번째 phase가 나온다(경험 검토’reviewing an experience’). 두 가지 하부 구성요소

Integrating those two theories, resulted in a first phase (’reviewing an experience’)with two subcomponents:

  • 1. 무슨 일이 일어났는지를 일반적으로 기술한다. generally describing what hap- pened and

  • 2. 생각/사고/맥락적 요인을 고려하여 본질적 측면을 밝힌다. identifying essential aspects by considering both thoughts, feelings and contextual factors.



그러나 단순히 경험을 검토하는 것이 효과적 성찰로 이어지지는 않는다. Bourner에게 있어서, 경험에서 더 정보를 얻기 위해서interrogate 탐색 질문을  활용하는 것은 '성찰'과 '생각thinking'의 차이였고, 그는 '성찰적 탐구reflective inquiry'를 성찰의 중요한 요소로 보았다. 성찰에 대한 이러한 관점은 Mamede and Schmidt가 성찰적 행동을 '성찰에 대한 개방성openness to reflection'으로 본 것과 마찬가지이다. Bourner는 탐색질문searching question을 하는 것만 강조했고, 그것에 대한 답을 구하는 것을 강조하지는 않았다. Korthagen의 접근법은 Bourner의 접근법에다가 질문에 대답을 하는 과정으로서 '대안적 행동방법 만들기creating alternative methods of action'를 추가하여 이를 보완해준다. 이렇게 추가한 것은 Boud가 분석을 association, integration, validation and appropriation의 조합이라고 한 것과 잘 맞는다.

Just reviewing an experience, however, does not neces- sarily lead to effective reflection. For Bourner [24], using searching questions to interrogate an experience was the key difference between reflecting and thinking and he saw ‘reflective inquiry’ as a crucial component of reflec- tion. This aspect of reflection was also represented in Mamede and Schmidt’s proposed structure of reflective practice as ‘openness to reflection’ [25]. Bourner only emphasised posing searching questions, however, not answering them. Korthagen’s approach supplements Bourner’s by contributing ‘creating alternative methods of action’ as a process of answering questions. This addition is compatible with Boud’s characterization of analysis as a combination of association, integration, validation and appropriation.

 

개인의 Frame of reference 안에서 이뤄지는 내면의 대화internal dialogue를 하는 것은 분석의 방향을 제시해주며, "감각적 인상을 거르는 가정과 기대의 구조"를 보여준다. 이러한 개인의 관점은 인식/인지/감정/성향(intentions, expectations and purposes) 등으로 구성되어 있으며, 우리의 감각 경험에 의미를 부여하는 맥락을 제공한다. 성찰의 첫 번째 phase가 경험을 묘사하고, 감정/생각/다른 측면을 인식하는 것이라면, 두 번째 phase는 성찰적 탐구reflective inquiry를 가지고 경험을 분석하여 개인의 독특한 Frame of reference 안에서 분석 프로세스를 trigger하는 것이다.

The internal dialogue that results is conducted within a ‘personal frame of reference’ that, according to Mezirow, directs the analy- sis and represents “the structure of assumptions and expectations through which we filter sense impressions” [21]. This personal perspective, made up of our percep- tions, cognitions, feelings and dispositions (intentions, expectations and purposes), creates a context in which we give meaning to our sensory experiences. If the first phase of reflection, then, is identified as the description of an experience and the awareness of feelings, thoughts, and other essential aspects, our second phase of reflec- tion is analysing experiences by reflective inquiry, which triggers a process of analysis within a person’sunique frame of reference.



Moon의 투입-산출 모델은 성찰이 '목적성을 가짐purposeful'을 강조한다. Atkins and Murphy에 의해 밝혀진 이 목적, 즉 세 번째 phase는 '새로운 관점의 발견identification of new perspectives'이다. Korthagen and Boud는 둘 다 추가 단계를 하나 더 넣었는데, 이 새로운 관점을 행동으로 옮겨서 새로운 성찰적 사이클의 시작점으로 삼는 것이다. Stockhau- sen의 성찰적 실천의 임상 나선 모델에서 '재건축reconstruction phase'가 같은 기능을 한다. 이 phase에서 성찰적 통찰reflective insights은 미래의 행동을 위한 계획으로 전환된다. 이러한 행동이 미래의 성찰로 이끌어줄 수 있으므로, 경험에 대해서 성찰을 하는 것은 중요한 경험을 well-informed practical action으로 전환시켜주는 순환적 과정이다.

Moon’s input-outcome model emphasises that reflec- tion is purposeful [17]. This purpose is identified by Atkins and Murphy in the third phase of reflection as the ‘identification of new perspectives’ [22]. Both Korthagen and Boud, however, included an additional stage - the conversion of those new perspectives into actions that are the starting point for new reflective cycles [16,23]. The ‘reconstruction phase’ of Stockhau- sen’s clinical learning spiral model of reflective practice among undergraduate nursing students in clinical prac- tice settings had the same function [26]. During this phase, reflective insights were transformed into plans for future actions. Since those actions could lead to further reflections, reflecting on experiences was identified as a cyclic process that transformed significant experiences into deliberate, well informed practical actions.

 

이러한 insight를 가지고 '새로운 관점의 발견'을 성찰 프로세스의 성과로서 정의했으며, 이 새로운 관점은 '성찰을 통한 미래의 행동'으로 이끌어준다. 이 phase를 연구자에 따라서는 행동-전-성찰reflection-before-action이라고 부르며, 이번 절충 모델에서는 성찰을 순환적 과정으로 만듦으로서 포함되었다. 이 모델에서 성찰은 과거의 성찰에서부터 나온 학습목표로부터 정보를 받고, 발달 프로세스로서 성찰의 중요성을 강조한다. Korthagen and Stockhausen은 모두 성찰나선reflection spiral이라는 용어와 함께 이 프로세스를 강조하였으며, 이를 통해서 더 높은 차원의 이해/실천/학습으로 갈 수 있는 길이라고 보았다.

We incorporated those insights into the eclectic model by defining the outcome of a reflection process as the iden- tification of new perspectives, which leads to future actions informed by reflection. Stockhausen also described a preparatory phase to establish objectives for a new clinical experience. This phase, which other authors have labelled as reflection-before-action [27,28], is incorporated into the eclectic model by representing reflection as a cyclical process. It allows reflection to be informed by learning goals arising from past reflections and stresses the importance of reflection as a develop- mental process. Both Korthagen and Stockhausen have highlighted this process with the term reflection spiral with each winding leading to a higher order of under- standing, practice or learning [23,26].



요약

Reviewing the experience has two compo- nents:

  • description of the experience as a whole’,and

  • awareness of essential aspects based on the considera- tion of personal thoughts, feelings, and important con- textual factors’.

Critical analysis starts with

  • ‘reflective inquiry’ - posing searching questions about an experi- ence - and progresses to

  • searching for answerswhile remaining aware of the ‘frame of reference’ within which the inquiry is being conducted.

Reflective out- come comprises the

  • new perspectivesresulting from phase two, and the

  • ‘translation of those perspectives into behaviour that has been informed by reflection’.



 

모델을 만드는 것부터 성찰 평가를 위한 지표 개발까지

From model building to developing indicators for assessment of reflection


 

성찰 프로세스의 적절성 지표

indicator of the adequacy of reflection processes (table 2).

 

 

 

 

 


2. 성찰 평가 해석를 위한 기준

2. Standards to interpret reflection assessment


Boud의 이론에는 여섯 가지 항목이 있다.

Boud’s theory, had six items:

  • attending to feelings,

  • association,

  • integration,

  • vali- dation,

  • appropriation and

  • outcome of reflection.

Mezirow는 학생을 다음과 같이 나눴다.

Mezirow, labelled students as:

  • 비-성찰자 non-reflectors (no evidence of reflective thinking),

  • 성찰자(경험을 학습 기회에 연관짓기) reflectors (evidence of relating experience to learning opportunities) and

  • 비판적 성찰자(성찰의 결과를 전문직적 행동에 통합하기) critical reflectors (evidence of inte- grating reflective outcomes in professional behaviour).


연구자들은 Boud의 카테고리가 written material에 적용하기 어렵다는 것을 알았고, Mezirow의 것보다 신뢰도가 떨어졌다. 그러나 Mezirow의 카테고리는 세 개밖에 없어서 사람들 간 변별력이 떨어졌다. Kember 등은 이 문제를 finer-tuned을 통해서 해결하고자 했다. 이들이 제시한 7개 카테고리는

The researchers found Boud’s categories hard to apply to written materials, resulting in less reliable coding than using Mezirow’s scheme. With only three cate- gories, however, this latter scheme had a limited capa- city to discriminate between people. Kember et al [31] addressed this issue by using a finer-tuned coding scheme based on the work of Mezirow. Their seven categories ranged from

  • 비성찰적 사고 unreflective thinking (

    • 습관적 행동 habitual action,

    • 자기반성 introspection and

    • 사려깊은 행동 thoughtful action) to

  • 성찰적 사고 reflective thinking (

    • 내용 성찰 content reflection,

    • 과정 성찰 process reflection,

    • 내용과 과정 성찰 content and process reflection and

    • 전제premise 성찰 premise reflection).


Boenink 는 성찰을 1~10까지 순위를 매겼고, 이는 학생이 쓴 성찰적 반응에서 드러난 관점의 숫자을 기준으로 매긴 것이였다. 그러나 이 척도는 성찰의 한 가지 측면밖에 보여주지 못한다는 한계가 있다.

Boenink et al [10] used an alternative approach, which ranked reflections from 1-10. Their scale was based on the number of perspectives students described in short written reflective reactions to a case vignette describing a challenging situation. The scale was limited, however, by measuring only one aspect of reflection (being aware of the frame of reference used).


 

Duke and Appleton 는 8개의 스킬을 평가했다. Grade를 줌으로써 이 연구자들은 성찰적 스킬에 대한 기준을 처음으로 설정했으나, 어떻게 level을 grade로 연결했는지를 밝히지는 않았다.

Duke and Appleton [29]developed a broader marking grid to score reflectivereports. It assessed eight skills that support reflection,identified by a literature review, on five-level scales,‘anchored’ and linked to a grade (A, B+, B, C and F).By providing grades, these authors were the first to set standards for reflective skills , however, the authors did not disclose how they linked the levels to grades.


Boyd 는 성찰적 판단을 King and Kitchener가 제안한 7개의 지적발달을 기준으로 코딩하였다.

Boyd [32] assessed reflective judgement using a coding scheme based on seven stages of intellectual development described by King and Kitchener:

  • 전-성찰적 사고 Pre-reflective thinking (stages 1-3);

  • 유사-성찰적 사고 quasi- reflective thinking (stages 4 and 5); and

  • 성찰적 사고 reflective think- ing (stages 6 and 7).

Measurements made with the scale had an interrater reliability of 0.76 (Cronbach alpha).


접근법은 두 그룹으로 나눌 수 있다. 한 가지 접근법은 level에 따라 순위를 매기는 것이고, 다른 하나는 성찰 프로세스의 phase를 밝히는 것이다.

Based on the approach coding schemes can be divided into two groups. A first approach ranks reflections according to levels. The other approach is the iden- tification of phases in the reflection process considering items of reviewing an experience, analysis and reflective outcome based on the used model of reflection [29,30].



연구 결과에서 공통된 것은 학생들이 성찰에 숙달된 수준이 매우 낮고, 따라서 발전의 여지가 충분하다는 것이다. 

their results share a common feature. Within their own scale, all stu- dies demonstrate learners to have very limited mastery of reflection, indicating an apparent room for improve- ment.



충분한 성찰의 수준을 갖추고 있는 의사를 구분할 수 있는 기준이 나올 때 까지는, 이해관계자들에게 어떤 성찰스킬이 필요한지 명확하게 설명해주고, 학습자들에게 최대한으로 그것을 개발하게끔 해야 할 것이다. 

Until standards have been formulated that can identify practitioners whose level of reflection is adequate, it seems reasonable to clarify to stakeholders (curriculum developers, students, practitioners, assessors) what reflection skills are expected and urge learners to develop them as far as possible.



성찰적 학습을 촉진하기 위해서 성찰능력을 개발하는 것과, 성찰의 빈도를 늘리는 것 사이에 균형이 필요하다.

In promoting reflective learning, however, a balance has to be struck between developing an ability to reflect and increasing the frequency of reflection.



3. 평가를 어렵게 만드는 요인들

3. Factors that complicate assessment


 

성찰의 메타인지적 성격 때문에, 평가를 위해서는 '성찰'을 '글written words'로 바꿔야 한다 (인터뷰 기록, 포트폴리오 성찰일지 등)

The metacognitive nature of reflection is an important complicating factor of reflection assessment [4]. Sub-jects are most often asked to ‘translate’ their reflections into written words, which are assessed against coding schemes or scoring grids [29-31,38-40]. Other suggested methods to ‘visualise’ reflections include the verbalisa-tion in interviews [32,41,42], written responses to vign-ettes [10], or reflective writings in portfolios [34,43].

 

따라서 평가자는 성찰을 기록한 사람의 '선택적 묘사'에 대해서 그것이 과연 '적절한지'를 확인해보지도 못하고, 평가해야 한다. 이 때 (비)의도적 뒤늦은 깨달음, 자성적 능력의 부족 등으로 편향이 생길 수 있다. 기록된 것은 선택적으로 기록된 것이고, 불완전하다. 이런 측면에서 인터뷰가 장점이 있으나 여전히 주관적인 평가이며, 성찰활동에 대한 선택적 네러티브만 평가할 수 있다.

Assessors’ dependency on a person’s interpretative description is a serious threat to the validity of assess-ments of reflection because they have to judge selective descriptions without being able to verify their adequacy. Accordingly this approach fails to detect bias caused by a lack of (un)intentional hindsight and introspection ability [44,45], reflections being determined by the requirements of the assessment and selectivity and/or incompleteness of aspects they portray [44]. Interviews have the advantage that they can pose clarifying ques- tions and monitor a reflecting person’s reactions, but they still leave assessors to ground their judgements in potentially subjective and selective narrative accounts of reflective activity.


두 가지 문제가 있다. 

There are two related problems in that.

  • 성찰을 기술하는 의미론적 스킬semantic skill이 효과적인 성찰에 중요한 부분이긴 하지만, 성찰을 글 또는 말로 바꿔야 하기에, 순수한 성찰스킬이 아닌 다른 것(글쓰기 기술, 말하기 기술)에 영향을 받을 수 있다.
    Although the semantic skill of describing reflec- tions is considered integral to effective reflection [46], skills other than pure reflective skills are needed to turn reflection into writing and/or speech, which has a self- evident effect on reflective narratives [44].

  • 다른 문제는 평가를 위해서 written approach를 하는 것이 학습자가 선호하는 학습스타일과 맞지 않을 수 있다. 인터넷 세대의 학생들은 그룹-기반의 테크놀로지 멀티미디어 활동을 선호한다(블로그, SNS 등). 또한 창의적으로 멀티미디어를 사용하게 지지해주는 것이 성찰에 더 헌신하게끔 해주고, 더 효율적 성찰을 도와줄지도 모른다.
    The other problem lies in a decrease of motivation caused by the non-alignment between the written approach to assess- ment and a learners preferred learning style [12]. Find- ings of Sandars and Homer [47] suggest the discrepancy between ‘net generation’ students learning preference of group-based and technological multimedia activities (blogs, social networks, digital storytelling) and the text based approaches to reflective learning. Moreover, sup- porting learners to reflect with the creative use of multi- media, will likely increase their commitment to reflect and stimulate even more efficient reflection [48].



자기기입식 설문: 정확한 introspect가 요구된다. Eva and Regehr 는 자기-평가적 접근만 활용하는 것은 부정확한 결과를 가져오며, introspection에 대한 삼각측량이 필요함을 강조함.

Self-assessment questionnaires have the advantage of circumventing indirect observation [13,14,49,50], but their requirement to introspect accurately introduces another validity threat [22,51], because it is then unclear if it is reflection or the ability to introspect that is being tested. Eva and Regehr [45] concluded that it is best not to build solely on self-assessment approaches as they tend to be inaccurate and they recommended triangulat- ing introspection with other forms of feedback.


이러한 이유로 과연 성찰이 평가가능한 것인가라는 질문이 남는다. 두 가지 요소가 중요해보인다. 타당한 접근을 위해서 Bourner는 내용Content과 프로세스Process에 대한 평가가 서로 구분되어야 한다고 제안했다. 주관성 때문에 내용이 평가에 있어서 장애요인이 된다면, 프로세스는 더 일반적인 특징이 될 수 있다. 유사하게, Bourner는 관찰가능한 항목 (학습목표 기술 등)이 성찰능력을 보여주는 것으로 사용되어야 한다고 주장했다.

Since there are such serious validity threats, the ques-tion remains whether it is possible to assess reflection at all. Two elements appear to be important. In search fora valid approach, Bourner [24] suggested the content and the process of reflection should be viewed as two separate entities. While the content is a barrier to assessment because of its subjective nature, the process has a more general character. Similarly Bourner proposed that observa- ble items, like the ability to formulate learning goals, should be used to demonstrate a person’s capacity for reflecting.







4. 성찰의 평가에 영향을 주는 내적, 외적 맥락요인

4. Internal and external contextual factors affecting reflection assessment


성찰에 대한 평가는 성찰능력 뿐 아니라 맥락적 요인에 의해서도 영향을 받는다. Motivation은 학습과 성취의 중요한 매개인자이다. 기대-가치 모델Expectation-value model은 과제에 대한 개인의 가치와 그 과제를 성공적으로 수행했을 때의 기대치가 과업 수행의 주요 예측인자라고 하였다. 이것을 성찰에 적용시키면, practice에 있어 성찰을 얼마나 중요하게 생각하느냐가 성찰에 얼마나 많은 시간과 노력을 쏟는지를 결정할 것이다. 성찰이 주는 보상에 대해 긍정적인 기대를 하지 않는 사람은 깊이있고 비판적인 성찰을 하지 않을 것이다. 이 motivational model은 성찰적 학습에 대한 개인의 과거 경험과 성찰 과정에 대한 개인의 이해가 motivation에 영향을 미치고, 궁극적으로 행동에도 영향을 준다고 설명한다. 따라서 성찰의 가치를 frame하고 의도한 결과를 얻기 위해서는 introductory session이 중요하다.

The results of assessments of reflection are influenced by contextual factors as well as people’s ability to reflect. Motivation is considered to be an important mediator of learning and achievement in medical education [55,56]. The expectancy-value model proposed by Wigfield and Eccles identifies the subjective value of a task to a per- son and their expectation of performing it successfully as main predictors of task performance [57]. Applied to reflection, it predicts that the perceived importance of reflection for (professional) practice will determine the time and effort a person is willing to invest in it; those who do not expect a positive return are unlikely to reflect profoundly and critically [4]. This motivational model also explains how personal factors like prior experience of reflective learning and a person’sunder- standing of the reflection process will influence motiva- tion and consequently reflective behaviour. Hence introductory sessions are important to frame the value and intended outcomes of reflection [4].




과거에는 성찰을 지극히 개인적인 프로세스라고 보았다. 그러나 점차 사회적 상호작용에 의해서 촉진되는 프로세스라고 개념화하는 쪽으로 생각이 바뀌고 있다. supervision과 동료들이 학습자에게 정기적으로 피드백을 주고, 생각을 자극하는 질문을 함으로써 성찰을 향상시킬 수 있다. 퍼실리테이터는 비-판단적 질문을 통해서 (학습자가) 그 상황을 더 탐구하고, 대안적 관점과 solution을 찾고, 당연하게 여겼던 가정이 무엇이었는지를 알게 해줄 수 있다. 더 나아가서 situations 이 강력한 감정과 부정적인 생각을 불러일으켜서 효율적인 성찰을 방해할 수 있다. 퍼실리테이터는 이러항 강력한 감정들을 동화assimilate시키고 성찰 프로세스에 초점을 맞추게끔 도와줄 수 있다. 성찰적 사고, 감정, 정서 등을 완전히 탐구하기 위해서는 성찰을 하는 사람과 퍼실리테이터 사이에 안전한 환경이 마련되는 것이 중요하다. 다른 사람을 돕는다는 의미 외에도, 퍼실리테이터가 된다는 것은 자기자신의 성찰도 더 효과적으로 할 수 있게 됨을 뜻한다. 그러나 Schon은 학습자와 코치 사이의 관계가 균형잡히지 않았을 경우에, 그리고 맥락적 요인에 과도하게 영향을 받았을 경우에 성찰적 실천이 방해받을 수 있음을 경고하였으며, 방어적 태도defensiveness를 보일 수 있다고 하였다. 맥락적 요인의 강조와 더불어 Schaub 등은 성찰적 학습을 장려하는 교사의 능력을 평가하는 척도를 개발하였다. 여기서는 학습자에게 교사가 self-insight를 지지하는지, 안전한 환경을 조성하는지, 자기-조절을 장려하는지 등을 물어본다.

Whereas reflection was traditionally conceived of as a strictly individual process, ideas are shifting towards conceptualising it as a process facilitated by social inter- action [4,45]. A stimulating environment in which supervisors and peers give learners regular feedback and ask thought-provoking questions can, from that point of view, be expected to improve reflection. With non-jud- gemental questions, facilitators can encourage to fully explore the situation, to consider alternative perspectives and solutions, and to uncover taken-for-granted assumptions [3]. Furthermore, situations can provoke strong emotions and negative thoughts which could potentially form a barrier obstructing efficient reflection and reflection upon . A facilitator can help to assimilate these strong emotions and refocus on the reflection process [12,16]. To fully explore reflective thoughts, feelings and possible emotions, it is crucial to create a safe environment established between the reflecting person and the facilitator(s) [3]. Next to sup- porting others, being a facilitator is also reported as even more effective for a person’s own reflections[58]. Schön, however, warned that an unbalanced relationship between learner and coach and an undue influence of contextual factors could hinder reflective practice, as it could lead to defensiveness [18]. In line with this emphasis on contextual factors, Schaub et al developed a scale to assess teachers’ competence in encouraging reflective learning [59]. It asks learners to identify whether teachers support self-insights, create a safe environment, and encourage self-regulation.

 


요약

Summary


성찰은 메타인지적 과정이므로, 성찰기록, 포트폴리오, 면접 등의 간접적으로만 평가될 수 있다. 이 방법에서 평가자들은 보고받은 성찰이 진실인지 확인하는 것verify이 어렵다. 자기평가식 설문지가 널리 사용되고 있는데 이 역시 마찬가지의 타당도 문제를 가지고 있고, 근본적으로 자기-평가의 문제도 있다. 이러한 타당도 문제를 해결하기 위하여, 평가는 주관적으로 미화된 성찰의 내용이 아니라 성찰의 프로세스에 초점을 맞춰야 한다는 주장이 제기되고 있다. 추가적으로, 성찰이 그 성찰을 자극triggering 상황적 맥락과 엮여 있기 때문에, 이러한 triggering situation에 대한 객관적 정보를 고려하는 것이 평가자로 하여금 묘사된 성찰을 verify할 수 있게 해준다. 성찰 프로세스는 내적(동기, 기대, 과거경험)과 외적(평가의 성격, 퍼실리테이터의 존재, 평가에 대한 introduction) 요인에 영향을 받는다. 이러한 요인들에 대해서 인식하는 것이 효과적인 교육 전략을 개발하고, 평가결과를 해석하고, 성찰 프로세스에 대하여 이해를 높이는데 중요할 것이다.

Because reflection is a metacognitive process, it can only be assessed indirectly; through written reflections in vignettes or portfolios, or spoken expressions in inter- views. These methods do not allow assessors to verify information related to the reflections reported, which is a serious limitation. The widespread use of self- assessment questionnaires shares both that validity pro- blem and the inherent limitations of self-assessment. To counter these validity threats, it has been proposed that assessment should focus on the process rather than the subjectively coloured content of reflection. In addition, as reflections are intimately entangled with their trigger- ing situational context, we suggest where possible to consider objective information about this triggering situation allowing assessors to verify described reflec- tions. The reflection process is influenced by internal (eg. motivation, expectancy and prior experiences with reflection) and external factors (formative or summative character of assessment, presence of facilitators and introduction to the assessment). Awareness of these fac- tors are important to develop effective educational stra- tegies, interpreting assessment results and finally the increase in understanding about the reflection process.



실용 가이드라인

practical guidelines


  • 1. Clearly define the concept of reflection and verify that all stakeholders (curriculum developers, students, assessors and supervisors) adopt the same definition and intended outcomes

  • 2. Be specific about what level of reflection skills is expected, identifying good and inadequate reflection and communicate this to all stakeholders

  • 3. Be aware of possible bias in self-assessment meth- ods, caused by inadequate ability to introspect. 

  • 4. Provide assessors with a perspective on the situation triggering the reflection to create the ability to verify the described reflections in an objective frame of additional information. 

  • 5. Consider and report contextual factors when asses- sing reflection and/or when engaging in reflective educa- tion in support to interpret the outcomes.




 


 


 







 2011 Dec 28;11:104. doi: 10.1186/1472-6920-11-104.

Factors confounding the assessment of reflection: a critical review.

Author information

  • 1Centre for Educational Development, Faculty of Medicine and Health Sciences, Ghent University, Ghent, Belgium. sebastiaan.koole@ugent.be

Abstract

BACKGROUND:

Reflection on experience is an increasingly critical part of professional development and lifelong learning. There is, however, continuing uncertainty about how best to put principle into practice, particularly as regards assessment. This article explores those uncertainties in order to find practical ways of assessing reflection.

DISCUSSION:

We critically review four problems: 1. Inconsistent definitions of reflection; 2. Lack of standards to determine (in)adequate reflection; 3. Factors that complicate assessment; 4. Internal and external contextual factors affecting the assessment of reflection.

SUMMARY:

To address the problem of inconsistency, we identified processes that were common to a number of widely quoted theories and synthesised a model, which yielded six indicators that could be used in assessment instruments. We arrived at the conclusion that, until further progress has been made in defining standards, assessment must depend on developing and communicating local consensus between stakeholders (students, practitioners, teachers, supervisors, curriculum developers) about what is expected in exercises and formal tests. Major factors that complicate assessment are the subjective nature of reflection's content and the dependency on descriptions by persons being assessed about theirreflection process, without any objective means of verification. To counter these validity threats, we suggest that assessment should focus on generic process skills rather than the subjective content of reflection and where possible to consider objective information about the triggering situation to verify described reflections. Finally, internal and external contextual factors such as motivation, instruction, character of assessment(formative or summative) and the ability of individual learning environments to stimulate reflection should be considered.

PMID:
 
22204704
 
[PubMed - indexed for MEDLINE] 
PMCID:
 
PMC3268719
 
Free PMC Article


바다괴물 & 소용돌이: 의학교육에서 시험과 성찰 사이에서의 항해(Med Teach, 2015)

Sea monsters & whirlpools: Navigating between examination and reflection in medical education

Brian David Hodges







Introduction


Homer의 Odyssey에서 Odysseus 는 Strait of Messina를 지나야 하는데 두 개의 큰 위협이 있다 하나는 스킬라(큰 바위 옆에 사는 머리가 여섯, 바링 열두 개인 여자 괴물)이고 다른 하나는 카리브디스(Sicily 섬 앞바다의 큰 소용돌이, 배를 삼킴)이다. 이러한 비유를 보다 시적으로 바꿔보면 "진퇴양난에 빠진caught between a rock and a hard place"것으로 지금 의학교육이 당면한 상황을 우화적으로 보여준다.

In Homer’s epic poem, the Odyssey, Odysseus must travel through the Strait of Messina, passing between two great threats: the Scylla and Charybdis. The Scylla, on one side of the strait, is a multi-headed monster that plucks sailors off the ship and eats them. On the other side of the strait lays the Charybdis, a deadly, sucking whirlpool that is invisible to all who approach it. This metaphor, a more poetic version of ‘‘being caught between a rock and a hard place’’, is useful allegory for the challenges facing medical education.


가장 우려되는 것은 "'책무성'이라는 담화에서 출발한 고부담의, 외부의 시험" 그리고 보다 최근에 강조되기 시작한 "자기주도성, 성찰과 같은 내적 동기부여에 대한 투자" 사이의 tension이다 나는 이 두 가지의 담화가 이론적/실제적으로 양립불가능하다고 주장한 바 있으마, 우리는 여전히 이 두 가지를 모두 추구하고자 한다. 가능할까?

One of the most worrisome is the growing tension between high stakes, external examinations driven by a discourse of ‘‘accountability’’ and a more recent, but no less passionate, investment in internally motivated notions of ‘‘self-direction’’ and ‘‘reflection’’. I have argued that these two discourses may be theoretically and practically incompatible, yet we persist (Hodges 2007). How did we get here?


'시험'이라는 문화의 폭발적 성장

The explosion of a culture of examination



19세기에 의학은 '길드' 였고 역량은 '옳은' 사람man 되는 것이었다(이 당시에는 여자 의사는 거의 없었다). 이 당시의 평가 시스템은 수련중의 발전과정에 대한 판단이며 사부master의 승인에 따라 고용여부가 정해졌다.

In the nineteenth century, medicine was a guild and compe- tence was linked to the notion of being the ‘‘right kind’’ of man (there were very few women doctors in the nineteenth century). The assessment system of the time was a judgement model in which progression in training and employment was based on approval of a master.

 

20세기에, 생명과학이 발전하고, 의과대학이 대학 안으로 들어가면서, '역량'이라는 개념은 character에 대한 종합적 판단이 아니라, 지식의 기반에 대한 것base of knowledge로 바뀌었다. '역량'을 보여주기 위해서 의과대학은 지필고사를 개발하였다. 20세기 초반 MCQ의 발명은 평가를 보다 효율적이고 쉽게 수행할 수 있게 해줬다.

In the twentieth century, the biological sciences flourished and medical schools were relocated into universities, heralding a shift in the concept of competence away from holistic judgement of ‘‘character’’ toward a rich base of knowledge. To confirm competence, medical schools developed written examinations. The inven- tion of multiple choice questions in the early twentieth century made assessment more efficient and easier to administer.

 

20세기 중반, 또 한번의 패러다임 전환이 있었는데, '수행능력으로서의 역량'의 개념이 나타났다. OSCE나 시뮬레이션 같은 수행능력 기반 평가는 평가의 face를 바꾸었다. 의학교육자들은 이제 '밀러의 피라미드'라는 것을 활용했다. 동시에 의학교육자들은 평가의 책임을 외부로 확장시켰는데, state 혹은 province 차원의 전문직 조직이 의과대학에 시험을 제공하고, 그리고는 국가적 수준의 고부담/표준화 시험을 보게 했다. 그 결과 의사가 되려고 하는 사람들이 일생동안 치뤄야 하는 시험은 엄청나게 많아졌다. 이러한 이야기는 의학교육 뿐만 아니라 20세기 자체가 서구 국가에서는 '시험의 폭발적 증가'의 시대였다.

By the mid-twentieth century, there was another paradigm shift and the notion of competence-as-performance was born. Performance-based assessments such as Objective Structured Clinical Exams (OSCEs) and simulations changed the face of assessment. Medical educators were climbing what is now called Miller’s pyramid: a competence ladder starting from a base of ‘‘knowing’’, rising to ‘‘knowing how’’, to ‘‘showing how’’ and finally to ‘‘doing’’ (Miller 1990). At the same time, medical educators distributed the responsibility for assessment outward, adding examinations given by state or provincial professional organizations to tests in medical schools, and then national, high stakes, standardized examinations. The net result was an enormous expansion of testing in the life of would-be physicians. This story is not limited to medical education, however; in the twentieth century there was an explosion of testing across Western countries.


 

무수한 시험들..

When we are born we have an Apgar Test and at the end of our lives, as we quietly slip away, someone will perform a Glasgow Coma Scale. In between we undergo all manner of elementary and high school tests, college exams, intelligence tests, driving tests, MCATs, SATs, LSATs and on and on. Our lives are punctuated by an endless series of written and performance assessments.

 

마이클 푸코는 시험examination은 고전시기classical age의 가장 눈부신 발명품이면서, 가장 덜 연구된 것이다 라고 했다. 시험이 늘어난 것에는 여러가지 장점도 있다.

Michel Foucault called ours an examined society and argued that the examination is one of the most brilliant, if least studied, inventions of the classical age (Foucault 1975/1995, pp. 184–185). There have been many benefits from the proliferation of testing. In medical education these include

  • 교육과 학습목표의 합치 greater alignment of teaching with learning objectives,

  • 대중에 대한 책무성 강화 more accountability to the public and

  • 학생에 대한 피드백 가능성 the possibility of better feedback to learners (although formative feedback tends to become rare as the stakes of testing get higher).

  • 새로운 평가도구의 개발 Further, educators have developed new testing tools and can assess a wider range of competencies.

  • 역량 프레임워크의 개발을 위한 긴밀한 협조 Finally, the rise of assessment has gone hand in glove with the elaboration of new competence frameworks such as the Canadian CanMEDS roles and the American ACGME competencies (Whitehead et al. 2013).

 

그러나 Hanson 은 이 모든 시험이 단순이 역량을 측정하는 것 이상의 기능을 하고 있다고 하며, 이 시험이 우리를 발명inventing us 한다고 했다.

However Hanson (1993) is among those who have cautioned that all those tests are doing more than just measuring competence: they are also inventing us (p. 210).


우리의 시험이 학생을 바람직한 방향으로 이끌고 있는가?

The question that drives my research is ‘‘Are our assessment methods shaping students in a desirable way?’’


'시험에 뭐가 나오나요'라고만 묻는 학생들에게 시험은 그냥 그 자체로 존재하는 것이며 다른 교육과의 관계는 아무런 의미가 없다. 내 동료들은 학생들이 '뭐가 시험에 나오나요'라고 물었다는 이야기를 들으면 놀라기보다는 모두 동의한다는듯 고개를 끄덕인다. 우리 모두는 시험에 따르는 안타까운 부작용을 잘 알고 있다. 시험은 종종 학생이 학습과는 잘 맞지 않는 행동을 하게 한다. 나는 이 부작용 - '의도하지 않은'이라 부르는 - 효과에 관심이 있다.

For this student the examination existed for its own sake only – devoid of any meaningful relationship to the pedagogy that preceded it. Far from shocking my colleagues, recounting this anecdote never fails to invoke a concerted nodding of heads. We all know about these unfortunate adverse effects of testing: examinations drive behaviours that are often at odds with learning. I am most interested in these adverse – let us call them ‘‘unintended’’ – effects of assessment.


우리는 왜 이 부작용을 감당하고 있는가? 파블로프의 보상-반응 행동을 의도하기 위한 시험을 의도적으로 노리는 선생은 본 적이 없다. 그러나 모든 의학교육자들은 '시험'이라는 문화에서 얻은 것도 많지만, 의도하지 않은 효과도 적지 않고, 실제로 팽배해있다는 것을 안다.

Why do we tolerate these effects? I cannot imagine any teacher setting out deliberately to create an examination that creates the Pavlovian reward-response effect illustrated by my student. And yet all medical educators know that whilst our intense testing culture has brought many gains, these unintended effects are not rare, but actually endemic.


Scylla and Charybdis 의 비유로 돌아가자. Scylla 는 시험을 과도하게 사용하는 우리가 마주한 위험이다. 이는 학생의 동기를 꺾고 외부의 강제/보상에 반응하게 만들며, 내적 동기부여와 자기주도성을 상실하게 한다. 우리는 지속적으로 평생학습의 중요성을 강조하지만, 그리고는 교육환경 자체는 매우 외적-동기부여만이 존재하게 만들고, 관리감독 중심으로 만들며, 보건의료직은 스스로의 학습을 guide해 나가는데 위협을 겪고 있다. 많은 교사들은 시험이 너무 많아서 학생들이 배웠으면 하는 것을 못 배운다고 하지만, 매우 최근까지도 그에 대한 해결책은 시험을 어설프게 바꾸는 것이었지, 패러다임을 변화시키는 것은 아니었다.

Returning to the Scylla and Charybdis metaphor, the Scylla of overusing examinations is a danger we ignore at our peril. It diminishes student motivation by pushing them to respond to external reinforcement/reward rather than fostering internally motivated, self-direction. We speak constantly of the centrality of lifelong learning but then construct an educational envir- onment that is so externally motivated, so surveillance- oriented, that health professionals risk developing neither the drive nor the skills to guide their own learning. Many teachers decry the fact that too many examinations drive students away from the things we wish them to learn, but until quite recently the solutions amounted to tinkering with examination tools rather than fomenting a paradigm shift.


과도하게 많은 시험이라는 Scylla 를 두려워하여 의학교육자들은 반대편을 보기 시작했다. 조용한 건너편 바다속에는 짙은 안개 속에 "자기-성찰"이라는 약속의 땅이 있었다. 나는 내 교육자 동료들이 "학생들이 학습에 대한 열정을 가지기를 바라는" 열망을 이해하며 나도 그 열망을 공유하고 있다. 개인의 발달 동력이 모두 내부로부터 오는 이 이상적인 세계에는 외부적 힘이 존재하지 않는다.

Fearing the Scylla of over-examination some medical educators are looking to the other shore What is over there? Off in the distant tranquil sea, shrouded in a gentle mist, is the promise of ‘‘self-reflection’’. I understand and share the desire of my educator colleagues who dream of a world in which students have an enduring inner passion for learning. In this paradise, no external forces are needed because learning and personal development will be driven from within.



'역량으로서의 성찰'의 등장

The rise of a discourse of competence-as-reflection


성찰이란 무엇인가?

What is reflection?

  • Dewey (1933) defined it as ‘‘active, persistent and careful consideration of any belief or supposed form of knowledge’’ (p. 9);

  • Boud et al. (1985) as ‘‘intellectual and affective activities in which individuals engage to explore their experiences in order to lead to a new understanding and appreciation’’ (p. 19); and

  • Wikipedia as the ‘‘capacity of humans to exercise introspection and willingness to learn more about our fundamental nature, purpose and essence’’ (Wikipedia 2014).

이 모든 정의는 약간씩만 달라 보이며, 연구자들은 성찰에 대한 정의가 분산되어있고, 성찰을 특징짓는 행동 역시 다양하다고 지적한다. 어떤 사람들은 taxonomy를 시도했다. 그리고 이러한 정의definition의 문제는 성찰을 가르치려고 할 때 더 극명해진다.

Worryingly, these definitions all sound a little different and several authors have pointed out the that dispersion of definitions and diversity of practices characterize ‘‘reflection’’. Some have attempted taxonomies (Kinsella (2012) presents a thoughtful, epistemologically- oriented categorization). The problem becomes even more apparent when one is asked to teach reflection.




교육자들이 얼마나 '성찰 행위의 목적이 무엇인가'를 명확히 하지 않고 성찰이라는 행위를 받아들였는지를 생각하면 정말 놀랍다. 이 이유는 '성찰'이 모든 상처를 낫게 하는 연고라고 봤기 때문이다.

It is striking the degree to which educators have embraced reflection as a practice without clearly articulating to what end the practices of reflection are being engaged. This may be because reflection has become a kind of generic salve to heal all wounds:


의학교육에서 성찰이란, 정말 많은, 그리고 이질적인 문제에 대한 해결책으로 존재해왔다. 이렇게 분산된 형태의 성찰행위와 맞물려서, 가장 널리 퍼져 있는 가정은 성찰이 어떤 긍정적인, 좋은, '의도하지 않은 효과'는 거의 없는 것이라는 것이다.

Reflection, in the way it is often taken up in medical education, seems stands in as the solution for so many different and disparate challenges that I have begun to wonder if we have any shared idea about it at all. Coupled with this dispersion of practices is an almost universal assumption that reflection is something positive, something good, with hardly a nod to the possibility of unintended effects.


Ng 은 서로 다른 이론가들과 학문분야에서 성찰행위reflective practice를 다양한 방식으로 이론화하고 적용해왔으며, 이것이 새로 이 분야에 들어온 사람이 거대한 문헌 속에서 혼란스럽게 만드는 이유라고 했다. 이러한 혼란으로 인해 성찰과 성찰행위를 오해하거나 지나치게 단순화할 가능성이 있다.

Ng (2012) has argued that different theorists and disciplines have theorized and applied reflective practice in a variety of ways, making it confusing for newcomers to navigate their way through the large body of literature. The danger in this confusion is the possibility for reflection and reflective practice to be dismissed, misinterpreted or oversimplified (p. 119).


그러나 성찰이 보건의료직에게 어떻게 관련되고 적용되어야 하는지를 고민한 학자들도 있다. 이들은 이론적 토대에 초점을 두었다.

Yet there are scholars who have given considerable thought to the idea of reflection and its relevance/application in health professions (Nelson & Purkis 2004; Kinsella 2008,2012; Mann e al. 2009; Nelson 2012; Ng 2012). These scholars focus on the theoretical underpinnings of reflection, drawingon theorists such as Dewey (1933), Habermas (1971), Kolb(1984) and Scho¨n (1983, 1987).


의학교육에서 무비판적으로 성찰을 활용할 때 있을 수 있는 문제 중 하나의 예를 들자면, 간호교육학에서 성찰을 활용할 때 Habermas의 '성찰의 주요 기능은 우세한 사고와 존재의 방식에서 해방되는 것이다'라는 주장을 무시했다는 것이다. 성찰의 목적이 권력구조에서 해방되어 현재상태에 도전하는 것이라면, "의무적 성찰"을 만들어서 점수를 주고 인증서를 주는 것은 이해할 수 없는 일이다.

the pitfalls of uncritical use of reflection in medical education. To take but one example, Nelson (2012) writes that the use of reflection in nursing education (largely reflective diaries for practice assessment) has ignored Habermas’ (1971) notion that a main function of reflection is emancipation from dominant ways of thinking and being. Yet if the purpose of reflection is to get free of power structures and to challenge the status quo, creating ‘‘mandatory reflection’’ for grading and certification is incomprehensible.

 

 



메타인지로서의 성찰

Reflection as metacognition


인지심리학에서 발달한 것으로, 기본이 되는 생각은 '우리는 자신의 인지 프로세스를 인식할 수 있다'이다.

Reflection as metacognition is a concept that arose in cognitive psychology and is based on the idea that we can become aware of our own cognitive processes (Flavell 1979).

  • Think aloud 프로토콜: 자신의 생각을  말로 표현해서, 어떻게 자기가 생각하는지를 더 잘 이해하고, 불일치/예상밖의 변화/빈틈을 더 잘 이해함
    Practices associated with reflection as metacognition are variations on the think aloud protocol developed by cognitive scientists for research. The notion is that by articulating one’s thoughts(usually to another person, but possibly to the self) one is able to see more clearly the nature of how one thinks and byextension some of the inconsistencies, vagaries, traps and holes in our cognitive processes.

  • Medical error와 관계됨.
    For this reason, metacogni-tion has been associated with medical error and popularized inbooks such as How Doctors Think (Groopman & Prichard2007).

  • 자신의 사고를 관찰함으로써 어떻게 자신의 감정이 우리의 인지에 영향을 주는가를 알 수 있음
     Observing our own thoughts also opens a window onto the way our emotions affect our cognitions



메타인지 기반 교육의 문제는?

Could there be any problems with education based on meta-cognition?


  • 무의식적으로 (가능하지 않은) 펙트 혹은 자신의 인지(과정)을 '지어낼' 수 있음. 따라서 인지의 '진실성'을 지나치게 강조하면 오히려 문제가 될 수 있음.
    First, within the cognitive paradigm there is a well-known phenomenon that subjects asked to report or recall their thinking processes will unwittingly ‘‘invent’’ facts or cognitions that the think they used in their decisions but were not even available to them (Koole et al. 2011). Thus relying too heavily on the veracity of cognitions could present a problem, particularly if metacognition and ‘‘think aloud’’ were used for assessment.

  • 학생들이 지식형성과 지식활용의 사회-문화적 차원을 놓칠 수 있음. 지나치게 내면inward에 집중하는 것은 비용이 따른다.
    A second concern is that a cognitive focus might distract students from the socio-cultural dimensions of know- ledge formation and use. Kinsella (2012), for example, has cautioned that there is a cost to exclusively turning inward – an individual may become overly focused on their own thoughts and lose perspective on the importance of external, socio- cultural dimensions of knowledge (p. 43).


인지과정을 보고할 때 '진실성'을 지나치게 강조하는 것의 위험성을 인지하고, 학습자가 사회-문화적 시스템에 대한 시선을 놓치지 않게 해야 함.

The caution is to be aware of the slipperiness of ‘‘veracity’’ in reporting cognition and the need for vigilance that learners do not lose sight of the social and cultural systems in which they and their thoughts are embedded.




마음챙김으로서의 성찰

Reflection as mindfulness


'마음챙김mindfulness'란 ‘‘active, open attention to the present’’ and when one can ‘‘observe your thoughts and your feelings from a distance, without judging them, good or bad’’ 이다. 불교에 뿌리를 두고 있지만, 대부분의 종교는 어떤 식으로는 성찰적 기도와 명상을 강조하며, 일상적인 집착에서 벗어나 삶의 더 큰 관점을 보게 한다. 임상연구에서 mindfulness는 불안/스트레스/우울/심리증상에 효과가 있는 것으로 드러났다. 의학교육자들은 번아웃burnout때문에 관심을 가짐.

Mindfulness is a state of ‘‘active, open attention to the present’’ and when one can ‘‘observe your thoughts and your feelings from a distance, without judging them, good or bad’’ (Psychology Today 2014). Although mindfulness has roots in Buddhism, most religions promote some form of reflective prayer or meditation that helps shift away from quotidian preoccupations toward a larger perspective on life. In clinical research, mindfulness has been shown to be effective in reducing anxiety, distress, depression and other psychological symptoms. This is interesting to medical educators because of growing appreciation that our field is beset by burnout (Fralick & Flegel 2014).



치료적 활용 측면에서는 상대적으로 benign하나, 임상에서는 과거의 트라우마나 비인간화에 대해서 유의하는 측면이 있고, 의학교육자들도 마찬가지여야 할 것이다. 그러나 임상적인 이슈를 제쳐주면, 더 껄끄러운 질문은 어떻게 내면을 지향하는/비-판단적 접근이 '평가'와 합치될 수 있느냐이다 스스로에 대해서 비-판단적인 학습은 평가의 기풍ethos와 철학적으로 부합하기가 어렵다. 왜냐하면 '평가'란 그 정의상 '판단'이기 때문이다. mindfulness와 같은 비-판단적 형태의 성찰을 활용하는 교육자들은 그러한 교육법이 평가대상이 되어야 하는지, 혹은 그 교육법과 평가가 구분될 수 있는 것인지를 생각해봐야 한다

While considered relatively benign in therapeutic uses, clinicians using the method are vigilant for the emergence of past traumas and of depersonalization (Booth 2014) and medical educators should be as well. But clinical issues aside, the more prickly question that arises is how an inward looking, non-judgemental approach, aligns with assessment. Learning to be non-judgemental about oneself is difficult to square philosophically with the ethos of assessment, which by definition is a judgment – often a rather harsh and high stakes one in medical education. It is important for the educator using non-judgmental forms of reflection, such as mindfulness, to consider whether pedagogy should be assessed at all or whether pedagogy and assessment should be decoupled (Koole et al. 2011) as many schools do with student wellness/support and academic matters.



정신분석으로서의 성찰

Reflection as psychoanalysis


소크라테스는 '반성하지 않는 삶'은 살 가치가 없다고 했다. 한 세기의 정신분석은 한 사람의 inner life를 성찰하고 (종종 무의식인) 정신역학을 드러내는 것은 중요한 치유적 성격을 가진다고 본다. 오늘날, 많은 사람들은 정신-역동 formulation (꿈 분석, 어릴 때의 관계가 현재의 관계에 대한 감정의 전이 등)이 중요하다고 보며, 심리치료와 심리분석의 모든 산업이 이 역동을 드러내고 해석하여 적절한 효과를 내도록 설계되어 있다. 프로이드, 융, 그리고 그들의 후손들이 알린 정신분석적 개념에 의존하고 있다. 정신분석가들은 임상적 적용을 위해서 공식적인 훈련을 받았으나, 정신역동이란 개념이 널리 퍼지면서 많은 교사들도 교실에서 그것을 활용하고 싶어한다. 실제로 정신역동의 해석은 한 사람의 생애를 이해하게 도와주는 수단이 되며, 타인과 자신과의 관계는 의대생medical learner의 정체성 형성의 가장 중심에서 중요한 역할을 한다.

Socrates apparently said that the ‘‘unexamined life’’ is not worth living. A century of psychoanalysis has embraced the notion that reflecting on one’s inner life and uncovering the (often unconscious) psychodynamics of one’s relationship to the self and to others, is a valuable pursuit with healing properties. Today, many people believe in the importance of psycho- dynamic formulations (dream analysis, transference of emo- tions from earlier relations onto present relationships, deficits and traumas of the formation of self, etc.) and there is a whole industry of psychotherapies and psychoanalytic approaches designed to achieve felicitous effects by uncovering and interpreting these dynamics. The arts and humanities draw heavily on psychoanalytic concepts promulgated by Freud, Jung and their descendants. While only practicing psychoana- lysts are likely to have had formal training in the clinical applications, psychodynamic concepts are widespread in popular culture and many teachers will be tempted to bring them into the classroom. Indeed psychodynamic interpret- ations, which served as a means to help people to understand life’s journey, their relationship to others and to the self are valuable for medical learners who are deep in the midst of their identity formation.



그러나 흥미롭게도, 프로이드는 정신분석을 하는 것을 경계했는데 "분석자가 얼마나 다른 사람에 대하여 교사로서, 모델로서, 이상으로서 행동하고 싶든 간에, 그리고 대상자를 자신의 이미지대로 만들고 싶든 간에, 분석자가 잊지 말아야 할 것은 그것이 분석적 관계에서 자신의 역할이 아니라는 사실이다"라고 했다. mindfulness와 마찬가지로 정신분석의 임상적 활용이 평가에서의 교육적 활용과 잘 맞지 않을 수 있다. 정신과의사로서, 나는 introspection을 매우 중요하다고 생각하지만, 누군가의 분석가/치료자가 되는 것과 누군가의 교사가 되는 것 사이의 경계가 흐릿해지는 것을 우려한다. introspection을 shaping하는 것은, 특히 누군가가 다른 사람의 진로 궤적에 영향을 줄 수 있는 평가적 권력을 가진 경우 아주 복잡한 정신역동을 초래하며, 성찰의 개념을 뒤죽박죽으로 만들 수 있다.

Interestingly however, Freud (1940/1969) himself cautioned that in the practice of psychoanalysis, ‘‘However much the analyst may be tempted to act as teacher, model, and ideal to other people and to make men in his own image, he shouldnot forget that that is not his task in the analytic relationship’’ (p. 50). As with mindfulness, the clinical uses of psychoanalysis may not mix well with the pedagogical and the evaluative. As a psychiatrist myself, while I greatly value introspection, I worry about blurring the role of being some- one’s analyst/therapist and someone’s teacher. Shaping intro- spection, particularly when one has power (through assessment for example) over the career trajectory of students creates complex psychodynamics and muddles the notion of reflection.


정신역동을 교육의 프레임 안으로 가져오려는 모델이 있다. 예를 들어 Balint group이 활용된 바 있다. 그러나 이렇나 접근법ㄷ은 세밀한 training과 facilitation을 필요로 한다.

There are indeed models that bring psychodynamics into an educational frame, for example Balint groups have been used around the world to help practicing physicians under- stand their reactions to patients (Benson & Magraith 2005). But this approach requires sophisticated training and facilitation.



또한 심리분석가들은 잘 알텐데, 지나치게 inward focus하는 것은 narcissistic self-preoccupation을 초래할 수 있다.

Further, as psychoanalysts well know (and echoing Kinsella’s (2012) critique), too much inward focus can also lead to narcissistic self-preoccupation.


Kinsella 와 Ng은 모두 'self'라는 접두사를 'reflection'에 가져다 붙이는 문제를 제기한 바 있다. 이들은 '자기-성찰'이라는 용어가 의학교육에서 사용될 때, '비판적 성찰'과 'reflexivity'라는 개념으로부터 오히려 멀어지게 한다고 주장했다. 이 후자의 개념들은 개개인이 권력/문화/시스템적 불평등(차별과 같은)의 사회적 구성을 고려하는 것이며, 이러한 것은 '자기'를 우선시하는 경우에 강조되기 어렵다.

Kinsella (2012) and Ng et al. (in press) both highlight the problem of adding the prefix ‘‘self’’ to ‘‘reflection’’ and argue that the adoption of the term ‘‘self-reflection’’ in medical education moves us away fromconcepts of ‘‘critical reflection’’ and ‘‘reflexivity’’. These latter approaches, which allow indi- viduals to consider social constructions of power, culture and systematic inequities such as discrimination (following Nelson’s (2012) call to rediscover the Habermasian critical/ emancipatory functions of reflection) are not very well emphasized when prioritizing the ‘‘self’’.



고해confession으로서의 성찰

Reflection as confession


불교에서 '성찰'과 같이, '고해'는 가톨릭적으로 중요하다. Catholic Online에서는 고해를 하기 전에 "마지막 성사로서sacramental confession, 스스로의 치명적이고 부패한 잘못을 돌아보아야 한다review"라고 설명한다. 명상과 달리 고해는 다른 사람과 관계가 있다. 따라서 "만약 도움이 필요하면, 특히 일정 시간 떨어져 있었다면, 사제에게 요청하여야 하며, 그러면 그는 당신에게 다가와서 바람직한 고해를 할 수 있게 도와줄 것이다"라고 말한다. 교육에서의 평가와 고해를 비교하는 이론가들이 있다 예를 들어, 나는 최근 한 학생애게 "지금은 성찰할 시간이다. 종이를 꺼내서 이번 주에 한 일을 적어라. 프로페셔널리즘 문제일 수도 있고, 경험했던 어떤 문제, 혹은 보거나 가담했던 일일 수 있다. 스스로의 성찰과 무엇을 했는지에 대해 쓰고, 채점을 위해 제출해라. 다음주에 돌려주겠다"라고 한 교사를 봤다. 의학교육자들이 비록 실제 '고해'의 형태를 갖춰야 한다고 의미하는 것은 아니지만 나는 Fejes and Dahlstedt가 범죄자 처벌/교화와 같은 서구의 시스템은 '고해'행동에 중요한 의미를 부여한다. 즉, 고해와 속죄를 통해 종교적/도덕적 규범의 위반에 대한 '면제 선언'이 되는 것이다. 한 사례는 의학교육자들이 프로페셔널리즘과 관련해 학생들에게 리포트를 제출하게 하고, 그들의 '잘못'에 대해 속죄atone해주는 것이다.

Like meditation in Buddhist tradition, confession is important for those of Catholic faith. Catholic Online explains that before you go to confession, ‘‘you should make a review of your mortal and venal sins since your last sacramental confession’’ (Catholic on Line 2014). Unlike meditation, confession involves another person. Thus, ‘‘if you need some help, especially if you’ve been away for some time [you should] simply ask the priest and he will help you by ‘walking’ you through the steps to make a good confession’’ (Catholic on Line 2014). There are theorists who have compared what we do in educational assessment to confession (Fejes & Dahlstedt 2013). For example, I recently observed a teacher say to medical students, ‘‘It’s reflection time. Please take a piece of paper, write down an experience you’ve had this week – it could be a professionalism issue, a problem you’ve experi- enced, a lapse you saw or were part of. Write down your reflections and when you’re done, please turn them in for marking. I’ll have them back to you for next week’’. Though I do not mean to imply that medical educators are taking up the actual practice of confession, I agree with Fejes and Dahlstedt who argue, after Foucault (1975/1995), that western systems of criminal punishment/reform as well as education draw significantly on confessional practices: absolution of the transgression of religious or moral codes through confession and atonement. An example is the practice among medical education’s professionalism movement to have students report (or confess) and then perhaps atone for their professionalism ‘‘lapses’’ (Hodges et al. 2009).


고해와 다른 형태의 성찰이 가장 다른 점은, 외부의 판단 또는 '고해 신부'가 하는 중요한 역할이다. Frankford는 "성찰이 모든 사람이 기본적으로 갖춘 기술이라고 봐서는 안된다. 이 프로세스는 혼자서도 할 수 있는 것이긴 하나, 동료/퍼실리테이터 등과 함께 하는 성찰은 자각하는conscious한 성찰을 만들어 그 과정을 더 강력하게 해준다. 퍼실리테이터나 동료에게 요약해서 말함debrief으로써 '정확성'과 '객관성'을 체크할 수 있다

What differentiates confessional approaches from other forms of reflection is the pivotal role of the external judge or ‘‘confessor’’. Frankford et al. (2000) have written, for example, ‘‘it should not be assumed that reflection is a natural part of everyone’s skill set. This process can be done alone, of course, but reflection with facilitators, or peers, strengthens the process by ensuring that reflection is conscious. Debriefing with facilitators or peers can ‘‘provide a check’’ of accuracy and objectivity’’ (p. 712, emphasis added).



우리는 '정확성과 개관성'에 관심을 기울여야 한다. 고해성사를 도와주는 사제와 같이 교육에서 성찰이 '정확성과 객관성'을 갖게 하는데에는 'confessor'의 자질이 중요하다. 이것은 매우 흥미로운 현상인데, 왜냐하면 이 '고해의 퀄리티'에는 성찰이 외부의 평가를 충족시키기 위한 행동을 반영하기 때문이다. 만약 성찰을 shape/judge/grade하기 위해서는, 의학교육자들이 벗어나고자 하는 20세기의 유산인 '외부 통제'의 개념으로 돌아와야 할 것이다.

That we should be concerned with the ‘‘accuracy and objectivity’’ of reflection reveals something important. Like the priest who will help the penitent ‘‘walk through’’ confession, the medical educator who guides and shapes the ‘‘accuracy and objectivity’’ of reflection may take on qualities of a ‘‘confessor’’. This is a very interesting phenomenon because it is in this confessional quality that reflection comes back, full circle, to meet external examination. If we are to shape, judge and grade reflections, we are returning to a concept of external locus of control, precisely the twentieth century inheritance that some medical educators are trying to shake off.



결론

Conclusions



아마도 가장 큰 과제는 성찰행위를 평가와 연결시키려는 노력일 것이다. 실제로 일부 교육자들은 성찰이 애초에 평가가능한 것이냐고 의문을 표한다. Murdoch-Eaton & Sandars는 성찰에 대해 지나치게 도구적인 관점에서 접근하면, 어떤 의미있는 통찰이 아니라 그냥 하나의 의식ritual을 만드는 것일 뿐이라고 지적한다. Ng은 "성찰의 본질과 목적은, 만약 그것이 비판적 담화critical dialogue가 아니라 지나치게 prescriptive한 방식으로 활용될 경우에, 그리고 공식적 평가의 대상이 될 경우에,  훼손될 것이다"라고 했다. 우리는 완전히 합치될 수 없는 두 개의 패러다임 사이에서 고통받고 있는 것으로 보인다. 메타인지/마음챙김/정신역학적 접근은 reflective pedagogy의 좋은 기반이 될 수 있다. 그러나 이들은 '시험'과는 잘 맞지 않는다. 고해confessional도 훼손될 수 있다.

Perhaps our biggest challenge is trying to square practices of reflection with assessment. Indeed some educators ask if reflection should be assessed at all (Sumsion & Fleet 1996; Stewart & Richardson 2000). Murdoch-Eaton & Sandars (2014) caution that adopting an overly instrumental approach to reflection results in the creation of rituals more than any meaningful insight. Ng et al. (in press) has argued that, ‘‘The very essence and purpose of reflection may be compromised when it is experienced in an overly prescriptive manner, and when it is subjected to formal evaluation, instead of critical dialogue’’ (p. 1). We are, it seems, torn between two paradigms that we cannot fully align. Metacognition, mindful- ness and psychodynamic approaches may be a good basis on which to base reflective pedagogy. But they do not align well with examination. Confessional practices may be the (dubious) compromise.


원래의 비유로 돌아오면, 나는 분명히 지금 이 시점에서 지나친 외부 평가를 벗어날 때라고 바란다. 그리고 우리는 우리 앞에 놓인 위협을 잘 알고 있다. 그 위협은 학생이 '시험에 나오는 것'만 공부하게 하는 것에서 벗어나는 것임과 동시에 무비판적으로, 무이론적으로, 잘 모르는 "자기"성찰에 맹목적으로 뛰어들지 않는 것이다.

To return to the metaphor of a sea-journey I hope that as we steer a course away from what was certainly a time of excessive external assessment, we are thoughtful (indeed reflective) about the dangers that may lie in front of us; that in charting a course away from forming students who are driven only by ‘‘what is on the exam’’ that we do not lurch headlong and blindly into an invisible whirlpool of uncritical, un- theorized ‘‘self’’-reflection.



Koole S, Dornan T, Aper L, Scherpbier A, Valcke M, Cohen-Schotanus J, Derese A. 2011. Factors confounding the assessment of reflection: A critical review. BMC Med Educ 11:1–9.









 2015 Mar;37(3):261-6. doi: 10.3109/0142159X.2014.993601. Epub 2014 Dec 19.

Sea monsters & whirlpoolsNavigating between examination and reflection in medical education.

Author information

  • 1University of Toronto , Canada .

Abstract

The 16th International Ottawa Conference/Canadian Conference on Medical Education (2014) featured a keynote deconstructing the rising discourse of competence-as-reflection in medical education. This paper, an elaborated version of the presentation, is an investigation into the theoretical roots of the diverse forms of reflective practice that are being employed by medical educators. It also raises questions about the degree to which any of these practices is compatible with assessment.

PMID:
 
25523011
 
[PubMed - indexed for MEDLINE]


지금까지의 평가에 문제가 있었다면? : Pumpkin Plan의 필요성 (Med Teach, 2015)

Have we got assessment wrong? Thoughts from the Ottawa Conference and the need for a Pumpkin Plan

Ronald M. Harden







Whitehead 는 북미에서 psychometric한 접근법이 거의 지배적이었지만, CanMEDS를 비롯한 역량 프레임워크에서 정의된다로 역량에 대한 평가를 align하는 것은 쉽지 않았다고 말한다. 신발의 짝에 대한 비유를 했다 "우리는 모두 가장 좋아하는, 과거에정말 좋아했던 신발이나 코트나, 아니면 머그잔이 있다. 우리는 그것들이 허물어져가도 사용하곤 하는데, 왜냐하면 그 물건에 대한 선호가 실용적인, 미적인 한계를 초월하기 때문이다. 그러나 조금 지나면 우리는 어쩔 수 없이 그 물건이 망가지고 있다는 사실을 인정해야 하며, 그 물건은 치워놓고 새롭게 시작해야 함을 인정할 수 밖에 없게 된다". "평가에 관한 현재 모델은 우리의 사고와 행위를 발전시키는데 크게 기여했다. 그러나 우리가 의료행위의 특성과 의료행위자의 평가에 대한 최근의 이해에 따라emerging understanding 현재의 방식은 빠른 속도로 낡은 것이 되어버렸다."

Whitehead et al. (2015) argue that while a psychometric approach has dominated assessment in North America, this does not align easy with assessment of competency as defined in CanMEDS and other competency frameworks. They use the metaphor of a pair of shoes ...‘‘We have all had a favourite pair of shoes, a coat, or perhaps a mug that has seen better days. We carry on using it even though it is falling apart as our fondness transcends its practical or aesthetic limitations. However, sooner or later, we are forced to admit its decayed state and we set it aside and start afresh.’’ They argue that ...‘‘Current models of assessment have served us well in advancing our thinking and practices, but they are becoming increasingly threadbare in light of our emerging understanding of the nature of medical practice and of the assessment of medical practitioners.’’


Hodges 는 '고부담 외부 시험이 학생들에게 가하는 압박'과 '내부적인 동기부여(자기주도성, 성찰)에 대한 더 많은 요구' 사이의 tension을 논한 바 있다. Hodges는 성찰에 대한 네 가지 접근법을 밝혔다.

Hodges (2015) high- lights in his paper the growing tension between the pressure on students from high stakes external examinations and the need for more involvement in internally motivated notions of ‘‘self-direction’’ and ‘‘reflection’’. Hodges identifies four approaches to reflection:

  • metacognition,

  • mindfulness,

  • psycho- analysis and

  • confession.

 

Hodges는 각각의 접근법의 일차적 활동으로서, 어떻게 practice가 좋은 의사를 만드는가, 교사의 역할은 무엇인가, 그리고 (가장 중요한 것으로서) 의도하지 않은 결과의 가능성 등을 설명했다. Hodges는 의학교육자들이 당면한 주요 문제는 '성찰의 실천practice of reflection'을 '평가'와 부합시키는 것이다. 우리는 "시험에 무엇이 나오나요?"라는 질문에 의해서 움직이는 학생을 만드는데 벗어나서 '무비판적, 무이론적' '자기' 성찰이라는 보이지 않는 소용돌이 속으로 무턱대로 들어가는 것에 대한 계획을 세워야 한다"

He describes the primary activity in each approach, how the practice will create a better doctor, the role of the teacher and, importantly, possible unintended consequences. Hodges continues by suggesting that a major challenge we face as medical educators is to square practices of reflection with assessment. We need to chart ‘‘...a course away from forming students who are driven only by ‘What is in the exam?’ to lurching headlong and blindly into an invisible whirlpool of un-critical, un-theorized ‘self’-reflection’’ (Hodges 2015).


2010년의 14th Ottawa에서 최선의 평가란 무엇인가에 대한 합의문을 개발하였다.

The development of consensus statements on current best practices in assessment was a feature of the 14th Ottawa held in Miami, USA in 2010. Topics addressed included

  • ‘‘Criteria for Good Assessment’’ (Norcini et al. 2011),

  • ‘‘Assessment for Selection’’ (Prideaux et al. 2011),

  • ‘‘Research in Assessment’’ (Schuwirth et al. 2011),

  • ‘‘Assessment of Professionalism’’ (Hodges et al. 2011),

  • ‘‘Technology Enabled Assessment’’ (Amin et al. 2011), and

  • ‘‘Performance Assessment’’ (Boursicot et al. 2011).


Mike Michalowicz 는 그의 책, The Pumpkin Plan에서 성공을 위해서는 다음을 인식해야 한다고 주장했다.

Mike Michalowicz (2012) in his book, The Pumpkin Plan, argues that for continuing success we need to recognise that


"모든 것에는 철이 있는 법이다. 호박은 영원히 유지되지 않는다. 심지어 가장 크고 우수한 호박도 죽는다. 궁극적으로 우리는 그 큰 호박에서 씨를 빼서 새로 심고, 새로 시작해야 한다. 모든 것이 다 그러할 것이다"

there is a season for everything. He describes that pumpkins do not last forever, even giant and great pumpkins die. Eventually we need to extract the seed from a giant pumpkin and use it to plant a new one and start again. The same is true, he suggests, in any endeavour.

 


 

Hodges BD. 2015. Sea monsters & whirlpools: Navigating between examination and reflection in medical education. Med Teach 37(3):261–266.


Whitehead CR, Kuper A, Hodges B, Ellaway R. 2015. Conceptual and practical challenges in the assessment of physician competencies. Med Teach 37(3):245–251.





 2015 Mar;37(3):209-10. doi: 10.3109/0142159X.2015.1010497.

Have we got assessment wrongThoughts from the Ottawa Conference and the need for a Pumpkin Plan.

Author information

  • 1AMEE , Dundee , UK.
PMID:
 
25651987
 
[PubMed - indexed for MEDLINE]


Psychometric Instruments의 신뢰도와 타당도에 대한 현재의 개념:이론과 적용(Am J Med. 2006)

Current Concepts in Validity and Reliability for Psychometric Instruments: Theory and Application

David A. Cook, MD, MHPE, Thomas J. Beckman, MD, FACP

Division of General Internal Medicine, Mayo Clinic College of Medicine, Rochester, Minn.





 

'타당도validity'라는 용어는 "어떤 평가의 결과로부터 도출한 결론(또는 해석)이 충분한 근거well-grounded를 가지고 정당화가능한 정도, 관련되고relevant 의미있는meaningful 정도"를 말한다. 그러나 psychometric 평가의 결과로부터 타당도를 평가가는데 필요한 스킬은 의학논문을 평가하거나 실험실 실험의 결과를 해석하는 것과는 다르다. 최근의 임상교육평가에 관한 리뷰에서, 우리는 타당도과 신뢰도가 자주 잘못 이해되고 잘못 적용됨을 알게 되었다. 또한 우리는 타당한 방법론을 활용한 연구에서도 다양한 타당도의 스펙트럼을 제시하는데 실패함을 발견하였다.

The term “validity” refers to the degree to which the conclusions (interpretations) derived from the results of any assessment are “well-grounded or justifiable; being at once relevant and meaningful.”10 However, the skills required to assess the validity of results from psycho- metric assessments are different than the skills used in appraising the medical literature11 or interpreting the results of laboratory tests.12 In a recent review of clinical teaching assessment, we found that validity and reliability were fre- quently misunderstood and misapplied.13 We also have noted that research studies with sound methods often fail to present a broad spectrum of validity evidence supporting the primary outcome.6,14-16

 

psychometric 평가의 결과에 대한 타당도를 평가하는 방법은 심리학과 교육평가로부터 derive되었다.

Methods for evaluating the validity of results from psy- chometric assessments derive from theories of psychology and educational assessment,17,18



평가도구 점수의 타당도, 구인, 유의미한 해석

VALIDITY, CONSTRUCTS, AND MEANINGFUL INTERPRETATION OF INSTRUMENT SCORES

 

타당도는 "검사의 의도한 목적에 대하여 근거나 이론이 검사점수의 해석을 지지하는 정도"이며, 다른 말로는 타당도란 어떤 검사의 결과가 특정한 목적을 위해서 해석되었을 때 얼마나 그것을 신뢰할 수 있는가이다.

Validity refers to “the degree to which evidence and theory sup- port the interpretations of test scores entailed by the proposed uses of tests.”19 In other words, validity describes how well one can legitimately trust the results of a test as interpreted for a specific purpose.


많은 도구들이 물리적 양(키, 혈압, 혈중 소듐 등)을 측정한다. 그러한 결과의 의미를 해석하는 것은 복잡하지 않다. 반면, 환자의 증상, 학생의 지식, 의사의 태도 등에 대한 평가의 결과는 그것 자체에 내재된 의미가 없다. 오히려, 그 평가들은 "무형의 추상적 개념과 원칙의 집합"이라 할 수 있는 기저에 깔린 구인을 측정하기 위한 것이다. 모든 psychometric 평가의 결과는 그것이 평가하고자 하는 구인이라는 맥락에 대해서만 의미(타당도)를 가질 수 있다.

Many instruments measure a physical quantity such as height, blood pressure, or serum sodium. Interpreting the meaning of such results is straightforward.20 In contrast, results from assessments of patient symptoms, student knowledge, or physician attitudes have no inherent mean- ing. Rather, they attempt to measure an underlying con- struct, an “intangible collection of abstract concepts and principles.”21 The results of any psychometric assessment have meaning (validity) only in the context of the construct they purport to assess.17


타당도는 평가도구의 특성이 아니라, 그 도구로부터 얻은 점수와 그 해석에 대한 것이다.

Validity is not a property of the instrument, but of the instru- ment’s scores and their interpreta- tions.17,19


평가도구는 변하지 않고 평가의 해석만 달라질 수 있다.

Note that the instruments in these examples did not change—only the score interpretations.


타당도란 '추론'의 특성이며 '도구'의 특성이 아니기 때문에, 타당도는 각각의 의도한 해석에 따라서 establish 되어야 한다. 유사하게, 환자의 증상에 대한 척도가 특정 연구 조건이나 고도로 선택된 환자집단에서 타당하다고 하더라도, 일반적인 임상진료에 활용되기 위해서는 추가적인 평가를 거쳐야 한다.

Because validity is a property of inferences, not instruments, validity must be established for each intended interpretation. Similarly, a patient symp- tom scale whose scores provided valid inferences under research study conditions or in highly selected patients may need further evaluation before use in a typical clinical practice.


 

타당도에 대한 개념적 접근

A Conceptual Approach to Validity

 

우리는 종종 "validated instruments"라는 문구를 접한다. 이러한 관점을 틀린 것이다. 첫째로, 우리는 타당도가 inference의 특징이며 instrument의 특징이 아님을 기억해야 한다. 둘째, 해석의 타당도는 언제나 정도의 차이에 대한 것이다. 어떤 점수가 내재된 구인을 '더 정확히' 혹은 '덜 정확히' 반영할 수는 있지만, 절대로 '완벽하게' 반영할 수는 없다. 

We often read about “validated instruments.” This view is inaccurate. First, we must remember that validity is a property of the inference, not the instrument. Second, the validity of interpretations is always a matter of degree. An instrument’s scores will reflect the underlying construct more accurately or less accurately but never perfectly. 

 

타당도를 바라보는 가장 정확한 관점은 '가설' 혹은 '해석적 주장'으로, 제시된 추론을 지지하기 위한 근거가 수집되어야 한다. Downing이 기술한 바와 같이

Validity is best viewed as a hypothesis or “interpretive argument” for which evidence is collected in support of proposed inferences.17,23,24 As Downing states,

 

“Validity requires an evidentiary chain which clearly links the inter- pretation of . . . scores . . . to a network of theory, hypoth- eses, and logic which are presented to support or refute the reasonableness of the desired interpretations.”21

 

모든 가설-기반 연구에서 가설이 명확하게 기술되어야 하고, 가장 문제가 될 소지가 있는 가정을 평가하기 위한 근거가 수집되어야 하고, 가설은 비판적으로 평가되어야 하듯이, 새로운 검사가 있을 때 근거는 "해석적 주장에 대한 모든 추론이 타당하다고 나올 때 까지plausible, 또는 해석적 주장이 기각될 때까지" 수집되어야 한다. 그러나 타당도는 절대로 '증명될' 수 없다.

As with any hypothesis-driven research, the hypothesis is clearly stated, evidence is collected to evaluate the most problem- atic assumptions, and the hypothesis is critically reviewed, leading to a new cycle of tests and evidence “until all inferences in the interpretive argument are plausible, or the interpretive argument is rejected.”25 However, validity can never be proven.

 

 

타당도 근거의 출처

Sources of Validity Evidence


주어진 해석을 지지하기 위한 근거는 다양한 서로 다른 출처로부터 수집되어야 하며, 하나의 출처에서 강력한 근거를 얻었다고 해서 다른 출처로부터의 근거가 필요하지 않은 것이 아니다. 근거를 수집함에 있어서, 두 가지 주 위협을 고려해야 한다. 하나는 내용 영역에 있어서 부적절한 표집sampling이며(구인 과소대표성), 점수에 비-무작위적으로 영향을 주는 요인들(편향, 구인-비관련 변동)이다.

Evi-dence should be sought from several different sources to support any given interpretation, and strong evidence from one source does not obviate the need to seek evidence from other sources. While accruing evidence, one should specif-ically consider two threats to validity: inadequate sampling of the content domain (construct underrepresentation) and factors exerting nonrandom influence on scores (bias, or construct-irrelevant variance).24,27 

 

내용

Content.    

 

검사의 내용과 측정하고자 하는 구인의 관계. 내용은 truth(구인), whole truth(구인), 그리고 nothing but truth(구인)을 반영해야 한다. 따라서 다음을 봐야 한다.

Content evidence involves evaluating the “rela-tionship between a test’s content and the construct it is intended to measure.”19 The content should represent the truth (construct), the whole truth (construct), and nothing but the truth (construct). Thus, we look at

  • 구인의 정의 the construct definition,
  • 도구의 의도한 목적 the instrument’s intended purpose,
  • 문항 개발과 선택 프로세스 the process for developing and selecting items (the individual questions, prompts, or cases comprising the instrument),
  • 각 문항의 워딩 the wording of individual items, and
  • 문항 개발자와 감수자의 자격 the qualifications of item writers and reviewers.

 

Content evidence is often presented as a detailed description of steps taken to ensure that the items represent the construct.28


 

응답 프로세스

Response Process.


시험을 치르는 사람 혹은 평가자의 행동 및 사고 프로세스를 리뷰하는 것은 "실제로 행해지는 수행의 세세한 특성과 구인이 얼마나 서로 잘 합치fit되는가"를 보여준다. 예컨대, 교육자는 "진단추론을 평가하기 위한 시험을 치르는 학생이 실제로 고등-차원의 사고 프로세스를 쓰는가?"를 궁금해 하 수 있다. 이 문제를 위한 접근법으로, 일군의 학생들에게 질문에 답을 하면서 "think aloud"하게 할 수 있다. 만약 도구가 한 사람이 다른 사람의 수행을 평가하는 방식이라면, 응답 프로세스에 대한 타당도 근거는 평가자가 적절한 훈련을 거쳤음을 보여줘야 한다. 자료의 보안과 점수를 산출하는 방법과 보고하는 방법 등도 여기에 들어간다.

Reviewing the actions and thought pro-cesses of test takers or observers (response process) can illuminate the “fit between the construct and the detailed nature of performance . . . actually engaged in.”19 For ex-ample, educators might ask, “Do students taking a test intended to assess diagnostic reasoning actually invoke higher-order thinking processes?” They could approach this problem by asking a group of students to “think aloud” as they answer questions. If an instrument requires one person to rate the performance of another, evidence supporting response process might show that raters have been properlytrained. Data security and methods for scoring and reporting results also constitute evidence for this category.21 

 

 

내적 구조

Internal Structure.


신뢰도와 요인분석이 보통 내적 구조를 보여준다.

Reliability29,30 (discussed below and in Table 3) and factor analysis31,32 data are generally consid- ered evidence of internal structure.21,31 Scores intended to measure a single construct should yield homogenous re- sults, whereas scores intended to measure multiple con- structs should demonstrate heterogenous responses in a pat- tern predicted by the constructs.

 

유사한 수준의 결과가 기대되는 하위그룹간에 systemic variation이 있다면(differential item functioning이라고 불린다) 이 역시 내적 구조에 문제가 있음을 보여주는 것이며, 예측한 차이를 확인시켜주는 것은 이 영역의 근거를 서포트해주는 것이다. 만약 히스패닉이 지속적으로 코카시안보다 점수를 잘 받는다면 (이러한 결과가 의도된 것이 아닌 이상) 의도한 해석에 대한 타당도를 약화시킨다. 이는 총점에서의 subgroup variation과는 다르다.

Furthermore, systematic variation in responses to specific items among subgroups who were expected to perform similarly (termed “differen- tial item functioning”) suggests a flaw in internal structure, whereas confirmation of predicted differences provides sup- porting evidence in this category.19 For example, if Hispan- ics consistently answer a question one way and Caucasians answer another way, regardless of other responses, this will weaken (or support, if this was expected) the validity of intended interpretations. This contrasts with subgroup vari- ations in total score, which reflect relations to other vari- ables as discussed next.


 

다른 변인과의 관련성

Relations to Other Variables.

 

다른 도구와의 상관관계(없어야 할 때 없고, 있어야 할 때 있는 것)도 내재된 구인과 일치하는 해석을 지지해준다. 예컨대 양성 PH의 심각도를 평가하기 위한 설문과 acute urinary retention의 발생의 상관관계

Correlation with scores from another instrument or outcome for which correlation would be expected, or lack of correlation where it would not, supports interpretation consistent with the underlying construct.18,33 For example, correlation between scores from a questionnaire designed to assess the severity of benign prostatic hypertrophy and the incidence of acute urinary retention would support the validity of the intended inferences.

 

 

결과

Consequences.


평가의 의도한 결과 혹은 의도하지 않은 결과를 평가하는 것은 이 전에는 인식하지 못했던 비-타당성을 드러낼 수도 있다. 예컨대, 교수에 대한 평가에서 남자선생님이 지속적으로 여자선생님보다 낮은 점수를 받는다면, 이것은 어떤 예측하지 못했던 bias의 원인일 수 있다. 혹은 남자가 덜 효과적인 선생님임을 의미하는 것일수도 잇다. 따라서 결과에 대한 근거는, 그것이 진정으로 추론의 타당도에 영향을 미친다고 결론지어지기 전에 원래의 구인으로 되돌아가서 어떤 연결성이 있는지 확인해야 한다.

Evaluating intended or unintended conse- quences of an assessment can reveal previously unnoticed sources of invalidity. For example, if a teaching assessment shows that male instructors are consistently rated lower than females it could represent a source of unexpected bias. It could also mean that males are less effective teachers. Ev- idence of consequences thus requires a link relating the observations back to the original construct before it can truly be said to influence the validity of inferences.

 

결론에 대한 타당도 근거를 평가하는 또 다른 방법은 의도한 결과를 달성하였는지(의도하지 않은 결과는 회피되었는지) 보는 것이다. 만약 높은 평가를 받은 교수들이 낮은 점수를 외면한 것이었다면, 이 기대하지 않았던 부정적 결과가 점수의 의미를 해석하는데 영향을 줄 수 있으며, 따라서 그 타당도에도 영향을 미친다. 반대로, 만약 낮은 점수를 받은 교수들에 대한 remediation이 수행능력을 향상시키는 것으로 나타났으면 이러한 해석의 타당도를 지지하는 근거가 된다. 마지막으로, 점수의 합격선 등과 같은 threshold를 결정하는 방법 역시 이 분류에 들어간다. 결과에 대한 근거는 타당도 근거중 가장 논쟁적인 부분이며, 가장 잘 보고되지 않는 타당도 근거 출처이다.

Another way to assess evidence of consequences is to explore whether desired results have been achieved and unintended effects avoided. In the example just cited, if highly rated faculty ostracized those with lower scores, this unexpected negative outcome would certainly affect the meaning of the scores and thus their validity.17 On the other hand, if reme- diation of faculty with lower scores led to improved perfor- mance, it would support the validity of these interpretations. Finally, the method used to determine score thresholds (eg, pass/fail cut scores or classification of symptom severity as low, moderate, or high) also falls under this category.21 Evidence of consequences is the most controversial cate- gory of validity evidence and was the least reported evi- dence source in our recent review of instruments used to assess clinical teaching.34


 

근거의 통합

Integrating the Evidence.


"의도한" 또는 "예측한"이란 단어가 자주 사용되었다. 각 근거는 내재된(이론적)구인으로 돌아가서 어떤 관계가 있는지 보아야 하고, 사전에 기술된 관계를 확인하기 위해서 사용될 때 가장 강력하다. 만약 근거가 애초의 타당도 주장을 지지하지 않는다면, 그 주장은 "기각되거나 해석 and/or 측정 절차를 조정하여 향상될 수 있"고, 이후에 그 주장은 다시 평가되어야 한다. 실제로 타당도 평가는 testing과 revision의 지속적 사이클이다.

The words “intended” and “pre- dicted” are used frequently in the above paragraphs. Each line of evidence relates back to the underlying (theoretical) construct and will be most powerful when used to confirm relationships stated a priori.17,25 If evidence does not sup- port the original validity argument, the argument “may be rejected, or it may be improved by adjusting the interpreta- tion and/or the measurement procedure”25 after which the argument must be evaluated anew. Indeed, validity evalua- tion is an ongoing cycle of testing and revision.17,31,35


얼마나 어떤 근거가 필요한지는 도구의 의도한 목적에 따라 달라진다.

The amount of evidence necessary will vary according to the proposed uses of the instrument.

  • high degree of confidence high-stakes board certification lower degree of confidence
  • Some instrument types will rely more heavily on certain categories of validity evidence than others.21
  • observer ratings: internal struc- ture characterized by high inter-rater agreement.
  • multiple-choice exams: content evidence.

안면타당도는?

What About Face Validity?


"안면타당도"라는 표현이 많은 의미를 가지고 있지만, 이는 실질적 검증을 거치지 않고, 그저 겉보기의 타당도를 기술하기 위해 사용된다. 이는 자동차의 속력을 외관만 가지고 추정하는 것과 비슷하다. 이러한 판단은 단순한 찍기이다. Content evidence와 안면타당도는 표면적으로 유사해 보이지만 실제로는 크게 다르다. Content evidence는 systematic, documented 접근법이나, 안면타당도는 그 (평가)도구가 어떻게 생겼는지만 보고 판단하는 것이다.

Although the expression “face validity” has many mean- ings, it is usually used to describe the appearance of validity in the absence of empirical testing. This is akin to estimating the speed of a car based on its outward appearance or the structural integrity of a building based on a view from the curb. Such judgments amount to mere guesswork. The con- cepts of content evidence and face validity bear superficial resemblance but are in fact quite different. Whereas content evidence represents a systematic and documented approach to ensure that the instrument assesses the desired construct, face validity bases judgment on the appearance of the in- strument.

 

Downing and Haladyna 는..."표면적 퀄리티가 평가의 본직적 특성을 보여줄 수도 있지만, '타당해 보이는 것'은 '타당도'가 아니다" 라고 했다. DeVellis는 안면타당도에 대하여 다음을 우려했다.

Downing and Haladyna note, “Superficial quali- ties . . . may represent an essential characteristic of the assessment, but . . . the appearance of validity is not valid- ity.”27 DeVellis37 cites additional concerns about face va- lidity, including

  • 판단의 오류 가능성 fallibility of judgments based on appear- ance,
  • 개발자와 사용자 간의 인식 차이 differing perceptions among developers and users, and
  • 외관으로부터 의도를 추론하는 것이 역효과를 낳을 수 있음 instances in which inferring intent from appearance might be counterproductive.

 

우리는 이 용어를 사용하지 않을 것을 권장한다.

For these reasons, we discourage use of this term.



신뢰도: 타당도 추론의 필요조건이지만 충분조건은 아님

RELIABILITY: NECESSARY, BUT NOT SUFFICIENT, FOR VALID INFERENCES


신뢰도는 재생산가능성과 일관성에 관한 것이다. 신뢰도는 타당도의 필요조건이나 충분조건은 아니다. Psychometric 도구의 점수는 꼭 unreliability에 취약할 수 있지만 한 가지 중요한 차이가 있다. 한 개인으로부터 다수의 측정결과를 얻는다는 것은 실용적이지 못하거나 심지어 불가능하다. 따라서 점수의 신뢰도 근거를 축적하기 위해서는 충분한 근거가 축적되어야 한다.

Reliability refers to the reproducibility or consistency of scores from one assessment to another.19 Reliability is a necessary, but not sufficient, component of validity.21,29 Scores from psychometric instru- ments are just as susceptible to unreliability, but with one crucial distinction: It is often impractical or even impossible to obtain multiple measurements in a single individual. Thus, it is essential that ample evidence be accumulated to establish the reliability of scores before using an instrument in practice.


신뢰도에 대한 측정과 카테고리화에는 다양한 방법이 있다.

There are numerous ways to categorize and measure reliability (Table 3).30,37-41 We would expect that scores measuring a single construct would correlate highly (high internal consistency). If internal consistency is low, it raises the possibility that the scores are, in fact, measuring more than one construct. Reproducibility over time (test-retest), between different versions of an instrument (parallel forms), and between raters (inter-rater) are other measures of reliability.


Generalizability studies use analysis of variance to quantify the contribution of each error source to the overall error (unreliability) of the scores, just as analysis of variance does in clinical research.


신뢰도는 근거의 한 가지 형태일 뿐이다. 또한 타당도와 마찬기지로 평가도구 자체의 특성이 아니다.

Reli- ability constitutes only one form of evidence. It is also important to note that reliability, like validity, is a property of the score and not the instrument itself.30


 

도구 선택에 있어서 실제적 적용 (사례와 예시)

PRACTICAL APPLICATION OF VALIDITY CONCEPTS IN SELECTING AN INSTRUMENT


AUA-SI

The American Urological Association Symptom Index1 (AUA-SI, also known as the International Prostate Symptom Score)


Content evidence for AUA-SI scores is abundant and fully supportive.1 The instrument authors reviewed both published and unpublished sources to develop an initial item pool that reflected the desired content domain. Word choice, time frame, and response set were carefully defined. Items were deleted or modified after pilot testing.


Some response process evidence is available. Patient debriefing revealed little ambiguity in wording, except for one question that was subsequently modified.1 Scores from self-administration or interview are similar.49


Internal structure is supported by good to excellent in- ternal consistency and test-retest reliability,1,49,50 although not all studies confirm this.51 Factor analysis confirms two theorized subscales.50,52


In regard to relations to other variables, AUA-SI scores distinguished patients with clinical benign prostatic hyper- trophy from young healthy controls,1 correlated with other indices of benign prostatic hypertrophy symptoms,53 and improved after prostatectomy.54 Another study found that patients with a score decrease of 3 points felt slightly im- proved.51 However, a study found no significant association between scores and urinary peak flow or postvoid residual.55


Evidence of consequences is minimal. Thresholds for mild, moderate, and severe symptoms were developed by comparing scores with global symptom ratings,1 suggesting that such classifications are meaningful. One study56 found that 81% of patients with mild symptoms did not require therapy over 2 years, again supporting the meaning (valid- ity) of these scores.




PRACTICAL APPLICATION OF VALIDITY CONCEPTS IN DEVELOPING AN INSTRUMENT



The first step in developing any instrument is to iden- tify the construct and corresponding content.

  • In our ex- ample we could look at residency program objectives and other published objectives such as Accreditation Com- mittee for Graduate Medical Education competencies,57 search the literature on qualifications of ideal physicians, or interview faculty and residents.
  • We also should search the literature for previously published instruments, which might be used verbatim or adapted.
  • From the themes (constructs) identified we would develop a blueprint to guide creation of individual questions.
  • Questions would ideally be written by faculty trained in question writing and then checked for clarity by other faculty.


For response process,

  • we would ensure that the re- sponse format is familiar to faculty, or if not (eg, if we use computer-based forms), that faculty have a chance to practice with the new format.
  • Faculty should receive training in both learner assessment in general and our form specifically, with the opportunity to ask questions.
  • We would ensure security measures and accurate scoring methods.
  • We could also conduct a pilot study in which we ask faculty to “think out loud” as they observe and rate several residents.


In regard to internal structure,

  • inter-rater reliability is critical so we would need data to calculate this statistic.
  • Internal consistency is of secondary importance for per- formance ratings,30 but this and factor analysis would be useful to verify that the themes or constructs we identi- fied during development hold true in practice.


For relations to variables,

  • 다른 도구와의 비교는 비교대상이 되는 도구가 얼마나 좋은 도구인가에 달렸다.
    we could correlate our in- strument scores with scores from another instrument as- sessing clinical performance. Note, however, that this comparison is only as good as the instrument with which comparison is made. Thus, comparing our scores with those from an instrument with little supporting evidence would have limited value.
  • Alternatively, we could com- pare the scores from our instrument with United States Medical Licensing Examination scores, scores from an in-training exam, or any other variable that we believe is theoretically related to clinical performance.
  • We could also plan to compare results among different subgroups. For example, if we expect performance to improve over time, we could compare scores among postgraduate years.
  • Finally, we could follow residents into fellowship or clinical practice and see whether current scores predict future performance.


Last, we should not neglect evidence of consequences.

  • If we have set a minimum passing score below which remedial action will be taken, we must clearly document how this score was determined.
  • If subgroup analysis reveals unex- pected relationships (eg, if a minority group is consistently rated lower than other groups), we should investigate whether this finding reflects on the validity of the test.
  • Finally, if low-scoring residents receive remedial action, we could perform follow-up to determine whether this inter- vention was effective, which would support the inference that intervention was warranted.



APPENDIX: INTERPRETATION OF RELIABILITY INDICES AND FACTOR ANALYSIS



Acceptable values will vary according to the pur- pose of the instrument. For high-stakes settings (eg, licen- sure examination) reliability should be greater than 0.9, whereas for less important situations values of 0.8 or 0.7 may be acceptable.30 Note that the interpretation of reliabil- ity coefficients is different than the interpretation of corre- lation coefficients in other applications, where a value of 0.6 would often be considered quite high.62 Low reliability can be improved by increasing the number of items or observers and (in education settings) using items of medium difficulty.30 Improvement expected from adding items can be estimated using the Spearman-Brown “prophecy” formula (described elsewhere).41


A less common, but often more useful,63 measure of score variance is the standard error of measurement (SEM) (not to be confused with the standard error of the mean, which is also abbreviated SEM). The SEM, given by the equation SEM   standard deviation   square root (1- reliability),64 is the “standard deviation of an individual’s observed scores”19 and can be used to develop a confidence interval for an individual’s true score (the true score is the score uninfluenced by random error).


Agreement between raters on binary outcomes (eg, heart murmur present: yes or no?) is often reported using kappa, which represents agreement corrected for chance.40 A dif- ferent but related test, weighted kappa, is necessary when determining inter-rater agreement on ordinally ranked data (eg, Likert scaled responses) to account for the variation in intervals between data points in ordinally ranked data (eg, in a typical 5-point Likert scale the “distance” from 1 to 2 is likely different than the distance from 2 to 3). Landis and Koch65 suggest that kappa less than 0.4 is poor, from 0.4 to 0.75 is good, and greater than 0.75 is excellent.


Factor analysis32 is used to investigate relationships be- tween items in an instrument and the constructs they are intended to measure.


41. Traub RE, Rowley GL. An NCME instructional module on under- standing reliability. Educational Measurement: Issues and Practice. 1991;10(1):37-45.










 2006 Feb;119(2):166.e7-16.

Current concepts in validity and reliability for psychometric instrumentstheory and application.

Author information

  • 1Division of General Internal Medicine, Mayo Clinic College of Medicine, Rochester, Minn 55905, USA. cook.david33@mayo.edu

Abstract

Validity and reliability relate to the interpretation of scores from psychometric instruments (eg, symptom scales, questionnaires, education tests, and observer ratings) used in clinical practice, research, education, and administration. Emerging paradigms replace prior distinctions of face, content, and criterion validity with the unitary concept "construct validity," the degree to which a score can be interpreted as representing the intended underlying construct. Evidence to support the validity argument is collected from 5 sources:

CONTENT:

Do instrument items completely represent the construct?

RESPONSE PROCESS:

The relationship between the intended construct and the thought processes of subjects or observers.

INTERNAL STRUCTURE:

Acceptable reliability and factor structure.

RELATIONS TO OTHER VARIABLES:

Correlation with scores from another instrument assessing the same construct.

CONSEQUENCES:

Do scores really make a difference? Evidence should be sought from a variety of sources to support a given interpretation. Reliable scores are necessary, but not sufficient, for valid interpretation. Increased attention to the systematic collection of validity evidence for scores from psychometric instruments will improve assessments in research, patient care, and education.

PMID:
 
16443422
 
[PubMed - indexed for MEDLINE]


post-psychometric 시대의 평가: 주관과 집단을 생각하기 (Med Teach, 2013)

Assessment in the post-psychometric era: Learning to love the subjective and collective

Brian Hodges





평가 영역에서 psychometric discourse의 등장

The rise of psychometric discourse in assessment


20세기의 마지막 50년간 의학교육은 평가에 새로운 언어, 개념, 실천이 등장하는 것을 목격하였고, 이것들은 다 함께 psychometrics에 대한 담론을 이루었다.

In the last half of the twentieth century, medical education witnessed the rise of a new language, concepts, and practices of assessment which, taken together, constitute the discourse of psychometrics.


우리는 이제 psychometric 담론이 저물고 주관성과 집단성에 대한 새로운 담론이 떠오르는 것을 보고있다.

We are now seeing a decline of the dominance of psychometric discourse and a rise in discourses anchored in subjectivity and collectively.


그러나 미래를 논하기 전에, 우리는 먼저 어떻게 진실로 여겨졌던 특정한 명제가 전혀 의심당하지 않고 수십년간 받아들여져왔는지를 이해할 필요가 있다. 1922년에 처음 언급된 (비록 그 이후 수십년간 의학교육계에 반영되지는 않았지만) 이 말을 볼 만 하다 "교육의 산출물과 교육적 목적에 대한 지식은 반드시 정량적이어야하며, 측정의 형태를 띈다" 그러한 진실을 정당화해준 것의 결과로 엄청난 담론의 변화가 있었다.

Before discussing the future however, we first need to understand how a particular set of truth statements became accepted as unquestionable for decades. First articulated in 1922 (although not fully adopted into medical education until a few decades later), an exemplar is: ‘‘Knowledge of educational products and educational purposes must become quantitative, take the form of measurement’’ (Thorndike 1922, p.1). What arose from the legitimization of such truths was a huge discursive shift;



인간의 행동을 숫자로 변환한다는 개념이 의학교육의 모든 분야에서 사고방식을 구성하게 되었다.

The notion of converting human behaviors to numbers constituted a way of thinking that found its way into every corner of medical education.


psychometric 담론에 중요한 여러 개념이 있으나, 아마 가장 중요한 것은 신뢰도reliability일 것이다.

Many concepts are central to psychometric discourse, although perhaps none is more important than reliability.


검사의 신뢰도(Cronbach's alpha)가 0.8이상임을 밝히지admonition 않고는 어떤 평가에 대한 가이드나 논문도 완전complete하지 않다. 이 필수조건의 기원은 심리측정 교과서였다. Nunnally and Bernstein’s (1994) textbook Psychometric Theory 에서는 "만약 중요한 의사결정이 구체적인 검사 점수를 근거로 내려진다면, 0.9의 신뢰도는 최저 기준일 뿐이다" 라고 했다.

No assessment guide or article was complete without the admonition that all tests must have a (Cronbach’s alpha) reliability coefficient of at least 0.8. The origin of this imperative for was psychology measurement textbooks; example, Nunnally and Bernstein’s (1994) textbook Psychometric Theory states, ‘‘if important decisions are made with respect to specific test scores, a reliability of 0.90 is the bare minimum.’’


가장 중대한 담론의 변화는 '주관성'이라는 단어에 부정적인 함의가 담기게 된 것이다. '객관성'의 반댓말로서, '주관성'이 평가에 들어가는 것은 비뚤림biased를 의미했고, 비뚤림biased란 즉 '불공평unfair'한 것이었다. '객관적인 평가'와 '표준화된 평가'는 마치 동의어처럼 쓰였다.

The most important discursive shift was the negative connotation taken on by the word subjective. Framed in opposition to objective, the use of subjective in conjunction with assessment came to mean biased and biased came to mean unfair. There was also a strong association forged between assessment that was objective and tools that were standardized.



psychometric 담론의 등장은 신뢰도가 모든 검사에 있어서 요구되어야 할 특정이라는 것을 의미하게 되었다.

The rise of psychometric discourse meant that reliability was a desirable characteristic of all tests;


이러한 관점에서 보자면, 어떤 형태의 평가는 - 우리가 "오래 된 구두시험"이라 부르는 - 불공정하고 부적절한 것으로 여겨질 수 있는데, 왜냐하면 한두명의 평가자가 평가하고, 표준화되지 않은 질문이기 때문이다. 1960년대 후반에 등장한 중요한 사건 중 하나는 NBME가 모든 구두시험을 중단시킨 것인데, 이 때 이유는 최종 구두시험을 치른 10,000명 이상을 대상으로 한 연구에서 두 평가자간 평균적인 상관관계가 0.25 이하인 것으로 나온 것이었다. 1960년대 후반부터 2005년에 다수-스테이션의, 표준화된 임상스킬 평가가 (CSA) 도입되기까지 미국에는 표준화된 지필고사만이 의과대학 졸업생의 역량을 평가하기 위한 전부였다.

Seen through this lens, some forms of assessment, such as what were called ‘‘old orals exams,’’ were deemed unfair and unsuitable because of the one or two examiners and unstandardized questions. A pivotal event occurred in the late 1960s when the National Board of Medical Examiners in the United States discontinued the use of all oral examinations on the basis of a large-scale study of more than 10,000 final oral exams in which the average correlation between the two examiners was less than 0.25 (McGuire 1966). From the end of the 1960s, until the adoption of a multi- station, standardized clinical skills (CSA) assessment examin- ation in 2005, the United States required only standardized written exams to assess the competence of graduating medical students.


신뢰도에 대해서 집중하기 시작함으로써 더 포괄적인 표집sampling을 하게 되었고, 이는 명백하게 평가를 더 공정하게 만들었다. '경험만 필요할 뿐'이라고 더 이상 여겨지지 않았다. 평가자들은 검사 방법과 심지어 보정calibration까지 익혀야 했다.

Attention to reliability contributed to broader sampling that undoubtedly did make assessment fairer. Another positive effect was the rise in examiner training. No longer was it assumed that experience was all that was needed; examiners required orientation to testing methods and even calibration.



Psychometric 고부담 검사와 이에 대한 불만

Psychometric high-stakes testing and its discontents


그럼에도 불구하고, psychometric 담론의 부정적 영향도 점점 명확해져갔다. 여기에는 더 미세하고 세밀한 역량의 분자화finer and finier atomization of competencies가 있었으며, 표준화가 가능하게 만들게 위해서 하부-하부-영역까지 역량이 분절되었다.

Nevertheless, adverse effects of the dominance of psychomet- ric discourse became apparent. This included the finer and finer atomization of competencies into sub-sub-domains that could be standardized



평가 내용의 표준화를 위해서 시험 자료와 시나리오가 균질화 되었는데, 그에 따라서 시험을 보는 모든 사람에게 검사를 동등화하기 위하여 막상 실제 임상상황에서 수반되는 진단적/맥락적/대인관계적 변인이 시험에서 사라지게 되었다. 또 다른 문제는 보안을 위해서 큰 문제은행을 만들어야 했던 것이다.

Standardization of examination content led to the homogenization of test materials and or scenarios, while diagnostic, contextual, inter-personal variables that might be part of the authentic variability of real practice settings were often removed to make tests equivalent for all test takers. Another problematic effect was the need to create large testing banks because of exam security.


마지막으로, 표준화에 따르는 비용이 높아지면서, OSCE나 MCQ와 같이 다수의 표본을 수집하는 검사는 가끔만 치뤄지게 되었고, 종종 한 블록이나 해, 프로그램이 끝나는 시점에만 시행됨에 따라 시험 결과가 학생의 학습요구의 범위를 벗어나게 되었다.

Finally, the increased expense of standardized, multiple sampling exam- inations (such as OSCEs, MCQs) meant that exams were given infrequently, often at the end of a training block, year, or program, putting the test results out of range of students’ learning needs.



 psychometric 담론이 기대고 있는 기본적 개념은 무엇인가? 가장 근본적으로 이것은 인간의 현상을 숫자로 바꾸는 작업이다. 이러한 변환은 정확exact한 프로세스가 아닌데, 그 변환과정에서 정보가 상실된다. 그 프로세스를 통해서 생성된 숫자는 무언가를 대표하긴 하나 - 무엇인가가 존재한다는 것 - 그러나 그 숫자 자체로는 어떤 실체entity가 아니다. 

What are the fundamental concepts on which psychometric discourse rests? First and foremost, it is a set of practices to convert human phenomena into numbers. Such conversion is not an exact process; data are lost during the conversion. The numbers generated during the process represent some- thing—a formof existence of something—but they are not that entity, in and of themselves.



표준화된 지능검사에서 121점을 받은 것은 그 자체로는 개인의 지능이 아니다.

a score of 121 on a standardized intelligence test is not, in and of itself, a person’s intelligence.



이러한 가정은 어떤 현상phenomena가 한 개인 안에 있다는 것에 기반한다. 즉, 한 개인에게는 측정할 수 있는 quantity나 양amount가 존재한다. 그리고 이 측정은, 즉 점수는, 제거되어야 하는 외부의 통계학적 노이즈에 의해서 가려진다. 그리고 시험이 가지고 있는 여러 사람을 구분해내는 능력은 무언가 긍정적인 것이다.

These assumptions are grounded on the ideas that phenomena are located within individuals; that there is a quantity or amount that can be measured; that this measure, or score, is obscured by sources of true statistical noise from extraneous factors that needs to be eliminated; and that the ability of tests to discriminate between individuals is something positive.



개념적 측면에서 이러한 가정에는 몇 가지 문제가 있다.

From a conceptual perspective there are several difficulties with these assumptions. Among them, in no particular order, are that

  • 역량이란 개인에게 내재된embedded 특성이 아니라 집단이 갖는 특성이다.
    competence is not a characteristic of individuals but is embedded in collectivities;
  • 역량이란 고정된 것이 아니라 맥락에 따라 변화하는 것이다.
    competence is not a fixed, stable characteristic but one that varies in different contexts;
  • 검사는 한 개인의 사고와 행동을 만든shape다.
    tests have the power to shape the thoughts and behaviors of individuals; and
  • 여러 개개인을 서로 구분하는 것은 한 개인이 가진 여러 능력을 구분하는 것보다 더 유용하지 않다.
    finally, discriminating individuals between might be less helpful than some form of differentiation of abilities within individuals.



검사의 실천practice와 관련된 몇 가지 우려에 대해, psychometric 담론은 세 가지 필수불가결한 요소를 제공했다.

Turning from conceptual concerns to practices of assess- ment, psychometric discourse provided three key imperatives:

  • 역량의 하위 요소를 찾는다.
    to identify sub-components of competence;
  • 평가를 표준화하고 다수의 표본을 수집한다.
    to standardize assessments and take multiple samples; and
  • 하위 점수를 합산하여 역량을 재구성한다.
    to aggregate sub- scores to reconstitute competence.

 

CanMEDS, ACGME, TD 등이 있다.

the CanMEDS roles, the ACGME competence framework, and Tomorrow’s Doctor in the UK.



흥미롭게도 'Dissecting the good doctor'에서 Whitehead 는 의학교육이 character에 대한 관심에서 characteristics에 대한 관심으로 초점을 옮겨가며 진화했음을 추적했다. 그녀는 역량의 개별적 영역들을 밝혀내는 것의 장점도 있지만, 점점 더 작은 수준dimension에서의 측정에 더 많이 의존하게 되면서 character를 평가하는 것의 예술 the art of judging character을 잃을 수도 있다고 지적한다.

Interestingly, in Dissecting the good doctor, Whitehead traces the evolution of medical education from a concern with a holistic notion of character to a focus on characteristics (Whitehead et al. 2012). She argues that while there are many advantages to identifying individual domains of competence, to place more and more reliance on measurements of smaller and smaller dimensions is to risk losing the art of judging character.


패턴인식과 게슈탈트에 대한 인식apperception이 진단적 역량의 핵심임에도, 우리는 어떤 이유에서인지 감독관의 피훈련자에 대한 판단을 "편견에 휩싸인 것"으로 생각해왔다. 평가에 있어서 게슈탈트의 가치를 재조명하는 것은 감독자의 전인적인 판단holistic supervisor judgments의 지혜를 되찾는 길이 될 수도 있다.

While pattern recognition and the apperception of gestalt are at the heart of medical diagnostic competence, somehow we have moved to thinking of supervisor judgments of trainees as being ‘‘riddled with bias.’’ Refocusing on the value of gestalt in assessment raises the possibility of capturing the wisdom in holistic supervisor judgments.


 

두 번째 psychometric 필수요소는 평가의 표준화와 다수의 표본을 수집하는 것이다. 그러나 van der Vleuten and Schuwirth 은 신뢰도의 주요 결정인자가 총 평가시간이지 평가한 도구의 표준화가 아님을 보여주었다.

The second psychometric imperative has been to standard- ize assessments and take multiple samples. Yet, as van der Vleuten and Schuwirth (2005) have shown, the major deter- minant of reliability is total testing time, not the standardization of the instrument used.


만약 어떤 도구를 사용하는지가 중요하지 않다면, 이것이 의미하는 바는 더 표준화된 평가도구(MCQ, OSCE)가 더 주관적인 평가도구(논술, 구술고사)보다 반드시 더 신뢰도가 높다거나 하지는 않다는 점이다. 중요한 것은, 신뢰도는 평가자의 숫자와 관련이 매우 높다는 것이다. 즉, 신뢰도를 획득하기 위해서 결정적으로 중요한 변수는 시험 시간과 다수의 표본을 수집하는 것이지, '표준화'가 아니다. 따라서 우리는 비록 우리가 표본을 주관적 영향을 받는 출처들로부터 수집하더라도, 전인평가holistic judgement를 두려워해서는 안 된다.

 If the type of tool does not matter,the implication is that those tools that are more standardized(MCQ, OSCE) are not necessarily more reliable than those that are more subjective (essays, oral examinations). The caveat, of course, is that reliability is strongly tied to the number of examiners (Swanson 1987). The critical variables in attaining reliability, therefore, are testing time and multiple sampling, not standardization. We should not, therefore, be afraid of holistic judgment, although we should sample widely across sources of subjective influences (raters, examiners, patients). 


세 번째 필수요소는, 일단 평가를 하고 나면, 하부 점수들을 합쳐서 역량을 결정한다는 것이다. 만약 우리가 가장 좋은 자전거 타이어와, 가장 좋은 트렉터의 엔진과, 가장 좋은 비행기의 날개를 모아서 무언가를 만든다고 하자. 운송수단의 측면에서 우리가 만든 것은 아무런 가치도 없을 것이다.

The third psychometric imperative is that once assessed, component sub-scores should be recombined to determine competence. Imagine that we put together the world’s highest quality bicycle tire, a top quality tractor engine, and one wing from a state of the art airplane. In terms of transportation we will not have created anything of value.


72%의 MCQ와 80%의 OSCE와 4/5의 훈련중 사례보고 평가 점수와...등등등...무엇이 되는가? 인간적 현상을 숫자로 바꾸는 것은 정확한 프로세스가 아니지만, 그것을 다시 합치는 것은 문제를 더 가중시킨다. 개별 평가 도구 차원에서 신뢰도는 매우 유용한 것이지만, 이질적인 정보의 여러 출처를 합하는 경우에는 신뢰도가 별로 쓸모가 없다.

Adding 72% on a MCQ þ80% on OSCE þ4/5 on in-training evaluationþscores on case reports, mini-CEXs, and SP interviews gives us...what? Converting human phenomena into numbers is not an exact process, but recombining them compounds the problem. Reliability is very useful at the level of individual assessment tools, but it is not of much use when we combine very heterogeneous sources of information collected with different types of instruments.



집단성을 사랑하자

Learning to love the collective


Psychometric 담론은 평가의 대상이 되는 구인이 한 개인에게 내재한다는 생각에 근거한다. 그러나 팀-기반 의료는 개인으로부터 협력과 집단으로 초점을 옮겨갔다.

Psychometric discourse is based on the idea that constructs of interest are located in individuals. Yet, the rise of team-based 566 health care is shifting the focus from individuals to collabor- ation and collectivity (Lingard 2012).

  • 의사는 팀으로서 일하고 시스템 내에서 일한다. 의료의 퀄리티를 한 개인의 수준에 놓을attribute 수 없다.
    Ringsted et al. (2007, p. 2764) wrote, ‘‘In the assessment of physicians it must be acknowledged that physicians often work in teams and systems, rendering it impossible to attribute quality of practice to a single person.’’
  • 뛰어난 역량을 갖춘 개인이 모여서 부실한 역량을 갖춘 팀을 이룰 수 있으며, 실제로 자주 그런 일이 생긴다.
    For Lingard (2009, p. 626), ‘‘our individu- alist healthcare system and education culture [focuses] atten- tion on the individual learner’’ nevertheless ‘‘competent individual professionals can—and do, with some regularity— combine to create an incompetent team.’’
  • 의학교육의 어떤 측면은 '사회적 구조물social constructs'로 보는 것이 타당하다. 개개인들의 능력이 발현 되는 것이라기보다 두 명 이상의 개인의 상호작용에 따른 결과물이다.
    And Kuper (2007, p. 1122) argues that, ‘‘some aspects of medical education are better thought of as social constructs: instead of being considered as expressions of a single individuals abilities, they are conceived of as the products of interactions between two or more individuals or groups.’’




'집단 수준의 역량collective in competence'에 대한 개념은 1990년대 후반 수술장에서 본격적으로 도입되었으며, 이 당시에 항공 분야에서의 안전과 위험관리에 대한 검토, 그리고 상대적으로 의학에서 그 파라미터들이 얼마나 무시되고 있는가가 비교되었다.

Attention to the collective in competence began in earnest in the operating room in the late 1990s, at a time when comparisons were made between the scrutiny given to safety and risk management in aviation and the relative neglect of those same parameters in medicine.


 

 

이후 많은 연구들이 팀-기반 훈련이 도입되었을 때 환자성과가 더 향상된다는 것을 보여주었다.

Many studies have since demonstrated improved patient outcomes when team-based training is employed (Haynes et al. 2009; Marr et al. 2012; Stevens et al. 2012).

 

역량이란 한 개인이 가지는 것이라는 개념은 점점 더 옹호될 수 없는 것이 되어가고 있다.

the notion that competence is something held by an individual becomes more and more untenable.





주관성을 사랑하자

Learning to love the subjective

 

 

20세기 후반, 의학교육은 심리학과 psychometric적 방법을 통해서 '불공정'과 동의어처럼 쓰이던 '주관성'의 문제를 해결하기 위해 노력했다.

In the late twentieth century, medical education tried to solve the problem of subjectivity when it became equated with unfairness, by turning to methods from psychology and psychometrics.



Eva and Hodges 는 평가에서 주관성의 위험을 경고하는 연구들이 쌓여있지만, 오류가능성이 포함된 다수의 판단이 합해져서 어떠한 가치를 창출한다는 연구도 있다. 이는 군중의 지혜‘‘The wisdom of crowds,’’라는 James Surowiecki의 책에서도 다뤄진 것으로, 주관성의 가치는 판단의 수가 늘어남에 따라서, 그 판단의 독립성에 따라서, 그리고 관점의 다양성(균질성이 아니라)에 따라서 향상된다.

Eva and Hodges (2012) point out that, while the literature is replete with critiques of the dangers of subjectivity in assessment, there is a literature showing that many fallible judgments, summed together, create value. This is the key argument in James Surowiecki’s (2004) book, ‘‘The wisdom of crowds,’’ in which he writes that the value of subjectivity increases with the number of judgments, the independence of those judgments, and, interestingly, the diversity (not homogeneity) of perspectives.


주관적 판단을 다시 생각해보게끔 하는 가장 강력한 방법 중 하나는 임상 진단에서 무엇이 중요한가와 연결지어보는 것이다. 임상에서 패턴인식은 고도의 가치를 지니는 것이다. 경험이 풍부한 의사는 환자의 게슈탈트 인상을 토대로 무슨 질문과 검사를 할지를 찾아낸다. 경험이 풍부한 의사는 예비적 진단을 내리기 전에 철저한 증상과 징후체크리스트를 사용하지 않는다. 이것은 초심자가 하는 방식이다.

One of the most powerful ways to rethink subjective judgment is to relate it to something of great value in medical practice—clinical diagnosis. In the clinical domain, pattern recognition is highly valued. Experienced clinicians rely on a gestalt impression of presenting features to engage in further questioning and investigation. What they do not do is use an exhaustive checklist of symptoms and signs before forming preliminary diagnostic impressions. That is what novices do.



이러한 프로세스가 교육에서도 작동하지 않으리라고 믿을 만한 이유는 없다. 경험이 풍부한 교사는 피훈련자의 역량에 대한 인상을 빠르게 형성한다. 당연히, 의사가 첫인상을 후속 질문과 검사로 검증clarify해야 하는 것처럼, 교육자들도 특정 평가 도구와 관측을 사용하여 전인적 인상을 확인해야 한다.

There is no reason to believe that this process does not also operate in education. Experienced teachers also form rapid impressions of the competence of their trainees. Of course, just as clinicians must clarify first impressions with follow-up questions and investigations, so too, educators need to use specific assess- ment tools and observations to confirm their holistic impressions.


사고실험을 해보자. 한 의과대학생/레지던트/동료가 있다. 이제 스스로 '당신의 가족을 그 사람에게 보낼 것인가?'를 물어보자. 대부분의 사람들에게 그 대답은 쉽고, 자동적이다. 이것이 게슈탈트 인상gestalt impression이다.

Try a thought experiment. Think about one of your current medical students, residents, or colleagues. Now ask yourself, would you send a family member to see this doctor? For most people the answer is easy, and automatic. This is a gestalt impression.


만약 그러한 인상을 여럿 모으고 통합한다면 - '배심원 모델'이라 불리는 것을 사용하여 - 종합적인 판단은 어떤 중요한 가치를 가질 것이다. 또한 단순하게 예-아니오가 아니라 왜 그러한 판단을 내렸는가를 살피고 들어가면 평가와 피드백의 강건한robust 정보 출처를 얻게 될 것이다. 질적 연구자는 이러한 구체적 묘사를 'thick description'이라 부른다.

if multiple such impressions were collected and integrated—using something called a jurymodel—the collective judgment would have significant value. Further, if we went beyond just asking a yes or no question and had each rater describe why they would (or would not) refer a family member to this doctor, we would have a robust source of information for evaluation and feedback. Qualitative researchers call such a detailed narrativea thick description. 


만약 그러한 과정이 두 명의 개인에 대해서 사용된다고 하자. 모든 평가자들은 왜 그런지 혹은 왜 그렇지 않은지 이유를 써야 한다. 이 모델에서 assesor의 역할은 단순히 숫자로 변환하는 것이 아니라 '해석'하는 것이다. 이 '해석'은 수치적인 것과 언어적인 것을 모두 포함한다. 예컨대 환자 집단에서만, 혹은 비-의료 보건전문직 동료에게서만 일관된 결과를 보인다면 이는 무언가를 의미하는 것이다. 미래의 assessor는 이러한 독립적 판단을 잘 종합aggregator할 수 있는 사람이어야 한다.

Imagine such a procedure was used to evaluate two individuals. All raters are also asked to write down in detail why or why not. In this model, the assessor’s role would be to interpret, not simply to apply transformation to numbers; interpretation would be both numerical and linguistic. If, for example, the impression of not wanting to send a family member to the doctor came systematically from patients, or from non-medical health professional colleagues, that would mean something. The assessor of the future must be an aggregator of such independent judgments.



post-psychometric 시대의 평가

Assessment in the post-psychometric era


그곳에 이르기 위해서는 우리는 먼저 우리의 역량을 향상시키고 평가 프로그램의 초점을 바꿔야 한다.

To get there, we need to raise our game and focus on the overall impacts of assessment programs. As Rowntree (1987, p. 2) has written,

 

"어떻게 좋은 객관식 문제를 만들어야 하고, 어떻게 통계적으로 시험 결과를 다뤄야 하며, 어떻게 서로 다른 평가자들이 서로 다르게 평가하는 것을 보상할 수 있는가에 대해 쓰여진 것은 많다.

‘It is easy to find writers concerned with how to produce a better multiple choice question, how to handle test results statistically, or how to compensate for the fact that different examiners respond differently to a given piece of student work.

 

그러나 '평가의 목적'이 무엇인가에 대한 의문, 어떤 역량을 identify해야 하는가, 교사와 학습자의 관계에 미치는 영향은 무엇인가, 진실/공정/신뢰/인간성/사회적 정의 등의 개념과는 어떻게 연결되는가 에 대한 의문을 논하는 사람은 찾기 어렵다.

It is much less easy to find writers questioning the purpose of assessment, asking what qualities it does or should identify, examining its effects on the relationship between teachers and learners, or attempting to relate it to such concepts as truth, fairness, trust, humanity or social justice.’’

 

우리 앞에 놓인 도전은 우리가 가진 평가프로그램을 엄격하게 해서build rigor, 역량이 맥락적이며/구조적이고/변화가능하다는 것과, 적어도 일부분은 주관적/집단적임을 인식하는 것이다.

The challenge before us then is to build rigor into our assessment programs, and to recognize that competence is contextual, constructed, and changeable and, at least in part, also subjective and collective.





Whitehead C, Hodges BD, Austen Z. 2012. Dissecting the doctor: from character to characteristics in North American medical education. Advances in Health Sciences Education and Practice. Adv in Health Sci Educ (epub ahead of print) 2012; September 28.


Zibrowski EM, Singh SI, Goldszmidt MA, Watling CJ, Kenyon CF, Schulz V, et al. 2009. The sum of the parts detracts from the intended whole: Competencies and in-training assessments. Med Educ 43:741–748.






 2013 Jul;35(7):564-8. doi: 10.3109/0142159X.2013.789134. Epub 2013 Apr 30.

Assessment in the post-psychometric eralearning to love the subjective and collective.

Author information

  • 1Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada. brian.hodges@uhn.ca

Abstract

Since the 1970s, assessment of competence in the health professions has been dominated by a discourse of psychometrics that emphasizes the conversion of human behaviors to numbers and prioritizes high-stakes, point-in-time sampling, and standardization. There are many advantages to this approach, including increased fairness to test takers; however, some limitations of overemphasis on this paradigm are evident. Further, two shifts are underway that have significant consequences for assessment. First, as clinical practice becomes more interprofessional and team-based, the locus of competence is shifting from individuals to teams. Second, expensive, high-stakes final examinations are not well suited for longitudinalassessment in workplaces. The result is a need to consider assessment methods that are subjective and collective.

PMID:
 
23631408
 
[PubMed - indexed for MEDLINE]


 

관찰을 통해서 임상술기평가의 블랙박스 열기: 개념적 모델(Med Educ, 2011)

Opening the black box of clinical skills assessment via observation: a conceptual model

Jennifer R Kogan,1 Lisa Conforti,2 Elizabeth Bernabeo,2 William Iobst2 & Eric Holmboe2







INTRODUCTION


고부담의 최종시험을 제한하는 것 뿐만 아니라, 이제는 임상 근무환경에서 지속적인 평가를 하는 것, 그리고 그러한 평가가 정확하고 타당할 것까지 요구되고 있다.

Additionally, in response to calls to limit high-stakes final examinations, greater emphasis is now placed on the continuous assessment of skills in the clinical workplace7–11 and such assessment must be accurate and valid.



METHODS


표본

Sample

 



연구 설계와 자료 수집

Study design and data collection


On their study day, faculty members individually watched four videos and two live scenarios of a standardised postgraduate year 2 (PGY2) resident (SR) taking a history, performing a physical exami- nation or counselling a standardised patient (SP).20 The live cases also scripted resident receptiveness to feedback. These cases were previously used with medical residents and each case was scripted to depict a PGY2 resident whose performance was unsatisfactory, satisfactory or superior for content (history taking, examination, counselling) and interpersonal skills (some cases portrayed superior content but unsatisfactory interpersonal skills). Initial error scripting (by JRK) was based on actual resi- dent performance norms. The study team reviewed scripts to confirm that they reflected predetermined performance levels. For the video cases, volunteer medical residents trained on a single script, practised with the SP and were videotaped once their perfor- mance accurately represented the intended perfor- mance level.


After watching each of four video encounters (Fig. 1a), faculty staff completed a mini-clinical evaluation exercise (mini-CEX). The mini-CEX, developed by the American Board of Internal Medicine (ABIM) to provide residents with feed- back about their history-taking, physical examina- tion, counselling and interpersonal skills, details seven competencies that are rated on a 9-point scale (1–3 = unsatisfactory, 4–6 = satisfactory, 7–9 = superior).25,26 Faculty members were then interviewed individually for 15 minutes by a trained study investigator using a semi-structured interview guide.



Videos were shown in a random order and each faculty member was interviewed by at least three interviewers. Following the video scenarios, faculty members observed two live representations of an SR taking a history, conducting an examination and counselling an SP (Fig. 1b). Following each encounter, faculty staff rated the SR using the mini- CEX and provided the SR with up to 10 minutes of feedback, which was video-recorded. Faculty members were then interviewed individually by a study investigator for 30 minutes using the semi- structured interview (Appendix S1). Faculty mem- bers were asked about the feedback encounter before and after watching a DVD of themselves giving feedback to the SR.




자료 분석

Data analysis


Grounded theory approach 활용. 그러한 이유는 관찰과 평가 프로세스에 대해서 알려진 바가 거의 없었고, 현재의 가설이나 이전 연구로부터의 추론에 제약되지 않고자 하였기 때문임.

We utilised a grounded theory approach to analyse the data for emergent themes and to develop a thematic coding structure.27 We selected grounded theory because little is known about the observation and evaluation process and we wished to avoid restricting ourselves to current hypotheses or inferences from prior studies.28 Transcripts were sampled for coding across faculty participants, SP cases and interviewers.27 Two researchers (JRK, LC) independently coded and used constant comparative techniques to develop a preliminary coding structure.27 A portion of the transcripts were also coded by additional study team members (EB, WI, EH) to review, further define and refine the coding structure. Refinement of the coding structure continued as analysis progressed. Coding was terminated when theoretical saturation was achieved and when all team members agreed upon the final interpretation of the data. In total, 56 of 172 video interviews (33%) and 29 of 88 live interviews (33%) were coded. NVivo Version 2.0 (QSR International Pty Ltd, Melbourne, Vic, Australia) was used to organise and analyse the coding structure.

 

 


결과

RESULTS

 

주제 1. 관찰과 평가의 Frame of reference

Theme 1. Frames of reference during observation and rating


자기 자신을 레퍼런스로 활용

Using self as a reference



가장 흔하게, 교수들은 레지던트의 수행능력을 교수 자신이 스스로의 진료에 대해서 인식하는 것과 비교하였다.

Most frequently, faculty members compared resident performance with how they perceived themselves to practise:


교수들이 자신의 임상적 강점과 약점에 대해서 어떻게 인식하는지가 그들의 판단과 평가를 mediate했으며, 제시된 상황에 대해 얼마나 편하게 느끼느냐도 여기에 영향을 받았다.

Faculty members’ perceptions of their own clinical strengths or limitations at times mediated their judgements and ratings, as well as their comfort with the encounter.


교수들이 특히 중요하다고 혹은 우선시 되어야한다고 생각하는 역량이 무엇인지도 평가를 frame하였다.

Competencies believed by faculty members to be especially important and which were prioritised in feedback also framed ratings:


교수들은 레지던트의 수행능력을 평가할 때 본인이 레지던트 시절에 어떻게 했었는를 기준으로 평가하기도 했다.

Faculty staff also referred to comparisons of resident performance with the faculty member’s perception of his or her own performance as a resident.



다른 의사를 레퍼런스로 활용

Using other doctors as a reference


많은 교수들이 레지던트의 수행능력을 비슷한 단계에 있는 다른 레지던트와 비교했다.

Many faculty members compared resident perfor- mance with that of residents at a similar stage:


그러나 일부 교수들은 평가가 꼭 몇 년차인지 (PGY level)에 따라 달라야 하는가에 의문을 제기하였다.

However, some faculty staff questioned whether ratings should be based on the resident’s PGY level:


일부 교수들은 레지던트 수행능력을 (수련을 마치고 진료하는) 보통 의사들practising doctor와 비교하였다. 많은 교수들은 일부 practising doctor들이 임상 스킬이 부족한 면이 있기에 과연 레지던트들에게 그 임상 스킬을 요구하는 것이 타당한지에 대해 고민하였다.

Some faculty staff also compared resident perfor- mance with that of practising doctors. Many faculty members acknowledged that some practising doctors have deficient clinical skills and this led them to question what it might be reasonable to expect of a resident:


 

환자 성과outcome을 레퍼런스로 활용

Using patient outcomes as a reference


 

기타 Frame of reference

Additional frames of reference



특정 임상수행(History taking)의 기본 요소 충족 여부

SIGECAPS [Sleep, Interest, Guilt, Energy, Concentration, Appetite, Psychomotor, Suicidal].



어떤 사람들에게 있어서 평가의 기준을 설명해내는 것은 어려운 일이었다. 대신, 일부 교수들은 평가를 '직감gut feeling' 에 의존했다고 말하였다. 또 어떤 교수들은 어떻게 '관찰'을 '판단'으로 옮겼는지를 말로 설명하기 어려워했으며, 이러한 변환이 게스탈트를 반영하거나 혹은 자신이 판단을 내리고 평가를 할 때 어떤 프레임워크를 사용하는지가 불확실하기 때문이라고 했다. 일부 교수들은 대인관계에 대한 평가를 특히 어렵다고 했는데, 왜냐하면 이 스킬은 주관적이고 정량화하기 어렵기 때문이다

For others, articulating the standard for evalua- tion was difficult. Instead, some faculty staff referred to having a ‘gut’ feeling that drove evaluation. Others had difficulty in verbalising how they moved from observations to judgement and commented that the transition represented a gestalt or simply that they were uncertain of which framework they were making judgements and ratings against. Some faculty staff found the assessment of interpersonal skills particularly challenging because these skills were felt to be more subjective and difficult to quantify:



중요한 것은, 우리의 자료가 교수들이 frame of reference를 설정하는 것이 복잡하고, 역동적이고, 사람마다 매우 다르다는 것을 보여준다. 많은 교수들은 상황마다는 물론 한 상황 내에서도 FOR의 차이를 보였다.

Importantly, our data show that the ways in which faculty staff implemented these frames of reference were complex, dynamic and highly variable. Many faculty members shifted between frames of reference both within and between encounters:


 

주제 2. 추론inference의 역할

Theme 2. The role of inference


우리는 레지던트와 그들의 수행능력에 대한 추론inference가 평가가 이뤄지는 동안에 두드러진다는 것을 발견했다. 교수들은 구체적 자료(레지던트의 행동)을 사용하고, 이 행동들로부터 일부를 선별하여(의식적으로든 무의식적으로든), 의미를 부여affix하고, 이 행동에 대해 해석하여, 주로 결론을 내리는 가정assumption을 설정하였다. 종종 추론은 고차원적인 수준에서 이뤄졌는데, 이는 관찰한 행동을 기반으로 중대한 해석이 있었음을 의미한다. Table 2는 어떻게 동일한 행동도 교수에 따라서 서로 다르게 해석되는지를 보여준다. 레지던트의 감정(자신감이 있는지, 편하게 느끼는지), 성격, 스킬(지식, 잠재력), 향상에 대한 동기부여, 과거 경험과 대비상태 등등에 대해서 추론이 이뤄졌다.

We found that inferences about residents and their performance were prominent during assessment. Faculty members used concrete data (resident actions), selected from those actions (consciously or subconsciously), affixed meaning and interpreta- tion to those actions and made assumptions from which they frequently drew conclusions. Often, inferences were of the ‘high’ level, meaning there was significant interpretation based on the behav- iour witnessed. Table 2 provides examples of how the same behaviour was interpreted differently by different faculty staff watching the same video. Inferences were made about the residents’ feelings (i.e. their levels of confidence and comfort), personalities, skills (i.e. knowledge base and potential), motivation to improve, and prior experiences and preparation. 


일부 교수들은 그들이 추론을 한다는 사실을 인식하고 있었다. 그러나 많은 교수들은 주관적인 추론을 내리면서 그렇게 하고 있다는 것을 인식하는데 실패하였으며, 결과적으로 레지던트의 수행능력에 대해서 무수한 가정을 하고 있었다. 

A few faculty staff seemed to be aware that they made inferences, However, many faculty members failed to recognise when they made subjective inferences and consequently made numerous assumptions about residents’ performance.

 

 

 

 


주제 3. '판단'을 점수로 통합하는 과정

Theme 3. Variable approaches to synthesising judgements to numerical ratings


 

'판단'을 '숫자(점수)'로 변환하는 과정이 얼마나 다양하고 불확실한가를 보여준다

Our data showed significant variability and uncer- tainty surrounding how to translate a judgement about the resident into a numerical rating, espe- cially the overall mini-CEX rating.


일부 교수들은 개개 mini-CEX 역량을 평균하여 점수를 주었다.

Some faculty members chose to average all of the individual mini-CEX competencies:


다른 사람들은 비-보상적으로 점수를 주었다. 

Others used non-compensatory grading:


일부 교수들은 그 해당 시나리오encounter가 어디에 초점을 두는지 혹은 목적이 무엇인지에 따라 가중을 두었다.

Some faculty staff weighted ratings according to the encounter’s focus or purpose:


일부는 기존의 프레임워크를 활용하여 평가하였다.

A few faculty members used existing frameworks to guide them in assessing residents.


많은 교수들은 판단을 수치로 변환하는데 어려움을 겪었다. 교수들은 그들이 각 숫자의 의미가 무엇인지를 이해하지 못하고 있다고 말하였으며, 9점 척도에서 각 척도를 구별해내는 능력이 부족함으로, 그리고 어떻게 평가를 synthesis할지 불확실해하고 있음을 언급했다.

Many faculty staff struggled to translate their judge- ments to a numerical rating. Faculty members described their own lack of understanding about the meaning of the numbers, their inability to discrimi- nate along a 9-point scale, and their uncertainty regarding how to synthesise ratings:




주제 4. 평가에 관여되는 레지던트 수행 외적 요인

Theme 4. Factors external to resident performance that drive ratings


다음과 같은 요인이 영향을 주었다.

Several additional factors influenced faculty staff ratings, including

  • 맥락(시나리오의 복잡성, 레지던트의 과거 경험, 교수-레지던트 관계)
    context (the complexity of the encounter, the resident’s prior experience, the faculty–resident relationship) and
  • 피드백을 대하는 태도(레지던트의, 교수의, 기관의)
    response to feedback (by the resident, by the faculty member, by the institution).

맥락

Context


임상 시나리오가 얼마나 복잡한 것인지, 또는 레지던트가 그 시나리오에 얼마나 친숙해 보이는지 등이 영향을 주었다.

Contextual factors such as the complexity of the clinical scenario and perceptions of the resident’s familiarity with a clinical situation influenced how faculty members translated their observations into ratings:


교수-레지던트 관계가 얼마나 오래 되었는지도 영향을 주었다. 교수들은 실제 레지던트와 관계와 비교해가면서, 그들이 레지던트와 지속적인 관계를 가졌을 때 어떠했는지를 설명하고, 그 레지던트가 이미 어떤 피드백을 받았었는가를 안다고 설명했다. 동일한 실수를 반복하는 것은 평가를 더 엄격하게 만들었다.

The duration of the resident–faculty relationship also impacted ratings. Faculty members, referring to their experiences with actual residents, explained that when they had a longitudinal relationship with a resident, they knew what that resident had already received feedback on. The repetition of a mistake by the resident after feedback resulted in rater stringency:


반대로, 기존에 어떤 레지던트와 긍정적인 관계에 있었을 경우 평가가 더 느슨leniency해지고, 후광 효과가 있었다. 

By contrast, a pre-existing positive relationship with a resident was sometimes associated with rater leniency and the halo effect:



피드백을 대하는 태도

Response to feedback


레지던트가 점수에 대해서 어떤 반응을 보일지에 대한 우려가 종종 감정emotion으로 드러났다.

Emotions frequently stemmed from concern about residents’ reactions to numerical ratings that might be either high or low:

 

'만약 누군가 '만족' 범주에 들어간다면, 나는 6점을 줄 것이다. 왜냐면 왜 6점이 아니라 4점 혹은 5점이냐를 가지고 따지고 싶지 않기 때문이다'

‘If anyone’s in the satisfactory category, I tend to put them in the 6 range because I don’t want to be having a conversation about why it wasn’t a 6 instead of a



추가적으로, 건설적 피드백을 주는 것에 대한 교수들 자신의 정서적 반응과 레지던트의 감정에 미칠 영향에 대한 우려, 어떻게 레지던트가 교수를 인식할까에 대한 정서적 반응 등도 영향을 주었다.

In addition, the faculty member’s own emotional response to providing constructive feedback (i.e. feeling mean or unkind; being ‘demoralising’) and concern about the emotional impact on the resident and how the resident might perceive the faculty member seemed to mediate assessment:



반대로 어떤 교수들은 그들의 역할과 책임을 '코치'로서 인식했다

By contrast, other faculty staff were focused on their roles and responsibilities as coaches:



'나는 그들을 최대한 좋은 의사로 만들어야 한다. 나는 그들의 코치이다'

It’s to make them the best doctors they can be. I’m their coach.


마지막으로 여러 교수들이 기관 차원의 문화의 역할에 대해 언급했다.

Finally, several faculty members described the role of the broader institutional culture in guiding their ratings:


내가 점수를 낮게 준 학생과 학장과 셋이 나란히 앉아 내 점수를 디펜드 하는 것은 매우 불편하다. 그러한 상황에 다시는 놓이고 싶지 않다.

Sitting in very uncomfortable meetings with the dean and the student who I graded very poorly and having to sit there and defend my score. I don’t want to be in this situation again.’




DISCUSSION


 

교수들은 레지던트의 임상스킬을 평가할 때 관찰과 평가에 잠재적으로 영향을 미칠 수 있는 특성의 결합물amalgam of characteristics을 활용한다. 이러한 특성에는 나이/성별/임상 및 교육 경험/임상 및 교육 역량/태도와 감정/피드백에 대한 반응 등이 있다.

Faculty staff bring to resident clinical skills assessment an amalgam of characteristics that potentially impact on observations and assessment. These characteristics include age, gender, clinical and teaching experience, clinical and educational competence, and attitudes and emotions related to observation and feedback.


 

교수들은 레지던트와 환자의 상호작용을 두 가지 렌즈로 바라본다. 하나는 frame of reference이고, 다른 하나는 inference이다. 전자는 스스로의 역량에 대한 생각, 환자 성과에 대한 것, 다른 레지던트와의 비교 등에 대한 것이며, 후자는 관찰에 추가적인 의미를 부여하고 해석을 덧붙이는 것이다.

Faculty mem- bers observe resident and patient interactions through two lenses, one of which concerns a frame of reference (whereby faculty members use their own or other doctors’ performance, or patient outcomes, as a yardstick against which to compare resident performance) and one of which refers to inference which further shapes the meaning and interpretation assigned to observations.



피훈련자에 대한 교수들의 관측은 어떤 맥락 안에서 이뤄지고, 그 맥락적 요인에 의해 영향을 받는다. 여기에는 임상시스템(환자와의 친숙도, 환자의 복잡한 정도, 임상 유닛의 조직)과 교육시스템(조직 문화와 실수)이 있다.

The faculty member’s observation of the trainee with the patient occurs within and is influenced by contextual factors including the clinical system (i.e. familiarity with the patient, patient complexity, organisation of the clinical unit, etc.) and educational system (i.e. institutional culture and oversight).


관찰 동안에, 그리고 관찰 이후에 교수들은 관찰한 것을 해석하고 통합하여 점수로 변환한다. 그러나 이 프로세스는 깔끔하거나 예측가능하거나 단순하지 않다. 여러가지에 의해서 영향을 받는데, 점수를 어떻게 주느냐에 따라오는 피드백, 조직 문화, 시나리오의 복잡성 등이 있다. 이러한 영향은 관찰/피드백/평가에서 맥락이 중요하다는 것을 더 지지해준다.

During and after observation, the faculty member interprets and synthesises his or her observations into a rating. However, the process is not neat, predictable or straightforward. Multiple additional influences that can impact on ratings include anticipated feedback, institutional culture and encounter complexity. These influences further support the importance of context in observation, feedback and ratings.29,30


이러한 복잡한 상호작용은 상황인지이론situated cognition theory로 지지되는데, 이 이론에서는 개개인의 사고나 앎knowing, 처리processing이 그 사고나 행동이 일어나는 구체적 사회적 상황과 분리불가능하게 엮여 있다고 주장한다.

These complex interactions can be supported by situated cognition theory which contends that an individual’s thinking, knowing and processing are uniquely tied to and inextricably situated within (and cannot be completely separated from) the specific social situations within which those thoughts and actions occur.29,31,32


상황인지는 이러한 '상황'이 기여하는바를 인식하지 못하면, 그 구조construct를 완전히 잡아내지 못하는 사고로 이어진다.

Situated cognition contends that failing to acknowledge the contributions of the setting leads to a perspective on thinking that cannot fully capture the construct;33


피평가자, 임상 상황, 교육 상황, 조직 문화, 교수들 자신 등이 영향을 준다. 

We have found that factors including the trainees, the clinical and educational setting, the institutional culture and the faculty members them- selves



교수들은 평가를 할 때 다수의 frame of reference를 활용한다. 우리가 놀란 점은 교수들이 매우 흔하게 자신의 현재 진료 스타일에 비추어서 레지던트를 평가한다는 사실이다. 이것이 특히 중요한 이유는, 교수들이 임상스킬의 proficiency에 있어서 매우 차이가 크기 때문이다. 그럼에도 이번 연구에서 교수들은 관찰과 평가에서 근거-기반 프레임워크 혹은 환자 성과를 거의 사용하지 않았다. 더 나아가서 동일한 수행에 대해서도 서로 다른 평가를 내리는 것은 레지던트가 스스로의 셀프-이미지와 맞지 않는 건설적 피드백은 무시하거나 검열해버릴 가능성이 있어서 잠재적으로 피드백 프로세스를 훼손시킬 수 있다.

Our data suggest that faculty staff approach assess- ment using multiple frames of reference. We were struck by how often self was used as a frame of reference, particularly by comparing a resident’s performance with one’s current practice style. This finding has important implications because faculty staff have variable clinical skill proficiency.20,34–36 Yet, in the present study, faculty staff rarely used evidence-based frame- works describing best clinical skills practices (e.g. informed decision making, patient communication) or patient outcomes37–39 to anchor observation and assessment. Variable frames of reference are also problematic for residents. Furthermore, variable assessments of the same performance can potentially undermine the feedback process if residents preferentially dismiss or censor constructive feedback that is incongruent with their self-image.40


수행능력에 대한 결론을 내리기 위해서는 실제 자료(레지던트 행동)을 기반으로(활용하여), 이 행동들 중에서 선택하고, 이ㅡ미를 부여해야 한다. 이러한 가정이 결론을 형성하고, 이것을 바탕으로 행동(평가와 피드백)이 이뤄진다.

Reaching conclusions about performance requires faculty staff to use real data (resident behaviours), select from those behav- iours, and affix meaning to them. These assump- tions form conclusions which, in turn, lead to actions (the rating and feedback).41


비록 교수들이 비디오의 SR과 대화할 기회는 없었지만, 교수들이 live observation을 하는 동안 추론을 했음에도, 피드백을 주기 위해서 SR을 만났을 때 그 추론에 대해 질문하지 않았다.

Although the faculty member did not have the opportunity to talk with the SR in the video encounters, similar inferences were made and not questioned during the live observations when the faculty member met with the SR for feedback.



교수들은 관측결과를 판단(점수)로 변환하는데 어려움을 겪는다. 그러나 평가서식을 이용하는 교수들은 각 점수가 의미하는 바가 무엇이며, 어떻게 전반적 평가를 선택할 것인지를 알아야 한다. 우리가 아는 한, 교수개발 프로그램은 어떻게 관찰과 평가를 overall rating으로 통합하는지에 대해서 설명address해주지 않는다.

We found that faculty members struggled to translate their observations and judgements into numerical ratings. However, faculty members who use rating forms need to know what the numbers mean and how to select an overall rating. To our knowledge, faculty development has not addressed how observations and assessments should be synthesised into an overall rating.



교수들은 평가에 있어서 기준이나 정신모형을 공유할 필요가 있다. 이상적으로는 자기-기반의 프레임워크에서 준거-기반의 프레임워크로 바꿔나가야 한다. 특정 훈련단계에서 기대되는 역량이 무엇인가를 알고, milestone을 명확히 하는 것이 도움이 될 것이다.

Our findings suggest that there is a need to ensure that faculty staff approach assessment with a shared standard or mental model, ideally shifting from a self-based to a criterion-based framework. Knowledge of expected competencies and elucidation of milestones at par- ticular levels of training could be valuable to faculty staff who are required to make assessments.45



 


 


 










 2011 Oct;45(10):1048-60. doi: 10.1111/j.1365-2923.2011.04025.x.

Opening the black box of clinical skills assessment via observation: a conceptual model.

Author information

  • 1Department of Medicine, University of Pennsylvania, School of Medicine, Philadelphia, Pennsylvania 19104, USA. jennifer.kogan@uphs.upenn.edu

Abstract

OBJECTIVES:

This study was intended to develop a conceptual framework of the factors impacting on faculty members' judgements and ratings of resident doctors (residents) after direct observation with patients.

METHODS:

In 2009, 44 general internal medicine faculty members responsible for out-patient resident teaching in 16 internal medicine residency programmes in a large urban area in the eastern USA watched four videotaped scenarios and two live scenarios of standardised residents engaged inclinical encounters with standardised patients. After each, faculty members rated the resident using a mini-clinical evaluation exercise and were individually interviewed using a semi-structured interview. Interviews were videotaped, transcribed and analysed using grounded theory methods.

RESULTS:

Four primary themes that provide insights into the variability of faculty assessments of residents' performance were identified: (i) the frames of reference used by faculty members when translating observations into judgements and ratings are variable; (ii) high levels of inference are used during the direct observation process; (iii) the methods by which judgements are synthesised into numerical ratings are variable, and (iv) factors external to resident performance influence ratings. From these themes, a conceptual model was developed to describe the process of observation, interpretation, synthesis and rating.

CONCLUSIONS:

It is likely that multiple factors account for the variability in faculty ratings of residents. Understanding these factors informs potential new approaches to faculty development to improve the accuracy, reliability and utility of clinical skills assessment.

© Blackwell Publishing Ltd 2011.

Comment in

PMID:
 
21916943
 
[PubMed - indexed for MEDLINE]


타당도주장(validity arguments)에 대한 현대적 접근: Kane의 프레임워크에 대한 실용 가이드(Med Educ, 2015)

A contemporary approach to validity arguments: a practical guide to Kane’s framework

David A Cook,1,2 Ryan Brydges,3,4 Shiphra Ginsburg3,4 & Rose Hatala5





INTRODUCTION


타당도 근거를 수집하고 해석하는 과정을 'validation'이라고 부른다.

The process of collecting and inter- preting validity evidence is called ‘validation’.


Messick6 and Kane7 은 지난 100년간 어떻게 validation이 발전해왔는가에 대해 설명했다. 간략히 요약하자면, 교육자들은 초반에는 두 종류의 validity를 인식했다.

Messick6 and Kane7 have offered detailed reviews of how validation has evolved over the past 100 years. To summarise very briefly (see Fig. 1), educators ini- tially recognised two types of validity:

  • 내용타당도: 평가 문항을 만드는 것과 관련
    content valid- ity (which relates to the creation of the assessment items), and
  • 준거타당도: 동일한 현상을 측정하는 레퍼런스기준reference standard과 점수가 얼마나 잘 correlate하는가
    criterion validity (which refers to how well scores correlate with a reference-standard mea- sure of the same phenomenon).

 

그러나 내용타당도는 거의 언제나 검사를 지지하는 결과를 보였고, 연구자들은 곧 레퍼런스기준을 찾고 validating하는 것이 매우 어렵다는 것을 - 특히 실체가 없는 특성(예: 프로페셔널리즘)에 대해서 - 알게 되었다. 명확하게 정의내릴 수 있는 준거가 존재하지 않는 상황에 대한 대안으로서, 이론가들은 '구인타당도'를 제시했다. 이는 '구인'의 개념이나 이론을 기반으로 실체가 없는 특성attributes(구인contruct)이 관찰가능한 특성attributes과 연결되어 있다는 것이다. 이렇게 되면 타당도를 검증할 때, 관찰가능한 특성을 측정하고, 이론적인 관계theorised relationship을 평가하면 된다.

However, content validity nearly always supported the test, and investi- gators quickly recognised that identifying and vali- dating a reference standard is very difficult, especially for intangible attributes (e.g. professional- ism). As an alternative for contexts in which no definitive criterion existed, theorists proposed con- struct validity,8 in which intangible attributes (con- structs) are linked with observable attributes based on a conception or theory of the construct. Validity can then be tested by measuring observable attri- butes and evaluating the theorised relationships.

 

 

전문가들은 곧 이러한 다양한 타당도의 '유형'들이 (신뢰도reliability metrics와 합해져서) 궁극적으로 construct-related relationship을 지지하(거나 반박하)는 공통의 경로commom pathway를 갖는다는 것을 인식하게 되었다. 이는 연구자들로 하여금 서로 '다양한 유형'의 타당도라는 개념을 버리고, '유일한 validity인 construct validity를 지지하즌 다수의 근거 출처evidence from multiple sources'라는 통합된 프레임워크에 이르게 된다. 그러나 Messick의 프레임워크가 이후 폭넓게 인정받아왔지만, 서로 다른 근거출처evidence sources 사이에 우선순위를 정해주지 않았으며, 어떻게 평가에 따라서 그 우선순위가 달라질 수 있는지(예: 고부담 평가와 저부담 평가에서 어떻게 다르닞) 등을 보여주지 않았다.

Experts soon realised that all these different ‘types’ of validity, together with reliability metrics, ulti- mately had the common pathway of supporting (or refuting) the construct-related relationships. This led researchers (as detailed by Messick6) to abandon the different ‘types’ of validity in favour of a unified framework in which construct validity (the only type) is supported by evidence derived from multi- ple sources. However, although Messick’s framework has subsequently been widely embraced,9,10 it does not prioritise among the different evidence sources or indicate how priority might vary for different assessments.11 


Kane의 프레임워크가 아름다운 점은, 양적 평가, 질적 평가, 평가프로그램에 모두 적용될 수 있다는 점이다. 이러한 다재다능함이 우리를 질적 자료와 주관적 자료가 점차 더 가치를 인정받고 있는, 그리고 다수의 평가자료가 일상적으로 통합되는 'post-psychometric era'로 옮겨가게 했다.

The beauty of Kane’s framework is that it applies equally to an individual quantitative assessment tool, a qualitative assessment tool, or a programme of assessment. Such versatility will be required as we move into a ‘post-psychometric era’ of assessment in which qualitative and subjective data are increas- ingly valued13 and multiple assessment data points of varying rigour are routinely integrated.14


의사결정의 초점과 그 결과

A FOCUS ON DECISIONS AND CONSEQUENCES


우리가 학습자를 평가할 때, 우리는 주로 '숫자'를 생성한다. 그러나 우리가 진짜로 원하는 것은 그 학습자에 대한 '판단decision'이며, 예컨대 '합격인가?' 하는 것이다.

When we assess a learner, we usually generate a number but What we want –indeed, is a decision about that learner. Did he pass?


궁극적으로 validation은 그 판단의 방어가능성defensibility를 지지하는 근거를 수집하는 것이 전부이다.

Ultimately, validation is all about collecting evidence to support the defensibility of that decision.


임상의학과의 비유를 하면 이렇다. PSA검사가 전립선암 진단에 유용한가? 근거를 삺보면 재검사에 대한 결과가 reproducible하고 매년 측정한 결과도 그렇다. 그러나 이러한 긍정적 결과에도 불구하고 전문가 조직에서는 대부분의 남성에 대해서는 권고하지 않는다. 이러한 불일치가 일어나는 이유는 추가 검사evaluation에 의한 의도하지 않은 부정적 결과 때문이며,더 중요하게는 대규모의 무작위연구에서 서로 상충하는 결과가 나왔기 때문이다.

An analogy with clinical medicine may help to illus- trate this point. Is the prostate-specific antigen (PSA) test useful in screening for prostate cancer? Evidence suggests that values are quite reproducible on retesting and from year-to-year,16,17 Yet despite this favourable evidence, professional organisations rec- ommend against screening for most men.21–23 This incongruity arises because of the unintended adverse consequences of further evaluation24 and, more importantly, because large randomised trials have arrived at conflicting conclusions


이 임상 사례에서 중요한 교훈을 얻을 수 있다. 

From this clinical example we learn several impor- tant lessons


첫째, 모든 평가가 다 도움이 되는 것은 아니다.(낮은 점수가 불필요한 재교육활동으로 이어질 수 있다)

Firstly, not all assessments are beneficial. (if, for instance, a low score prompted unnecessary remediation activities).

 

둘째, 사람들은 동일한 근거를 두고도 서로 다른 결론에 이르를 수 있다.

Secondly, peo- ple may rightly arrive at different conclusions when interpreting the same evidence

 

셋째, 평가는 어떤 맥락에서는 유용하면서 다른 맥락에서는 그렇지 않을 수 있다(PSA검사의 특성은 연령에 따라 다르며, 교육상황에서 체크리스트는 절차적 기술을 평가하는데는 적합하나 임상현상에서의 미묘한 뉘앙스를 잡아내지는 못한다.)

Thirdly, an assessment mightbe useful in some contexts but not in others (e.g. PSA test properties vary by age; an education checklist may prove adequate for assessing proce- dural skills in a simulation-based context, but fail to capture important nuances of clinical practice).

 

넷째, 시험(검사)의 유용성은 목적에 따라 다르다(PSA검사는 일반적으로 암의 재발을 보는데 좋다. mini-CEX는 형성적 피드백을 제공하는데는 좋으나, 총괄적 목적이나 프로그램 평가의 목적으로는 덜 그러하다.)

Fourthly, the usefulness of a test may vary for differ- ent purposes (e.g. the PSA test is generally consid- ered useful in monitoring for cancer recurrence; the mini-clinical evaluation exercise [mini-CEX] seems appropriate as a tool for formative feedback, but may be less defensible when used for summative purposes or programme evaluation26).

 

다섯째, 평가 행위 자체가 개입intervention이다. Test-enhanced learning은 하나의 사례이다.(PSA검사와 PSA검사 비수행을 무작위연구 할 수 있으며, 평가시행 vs 평가비시행에 대해서도 무작위 연구가 가능하다.)

Fifthly, the act of assessment is in fact an intervention, as wit- nessed by research on test-enhanced learning,27 (e.g. one can conduct a randomised trial of PSA testing versus no PSA,19,20 or educational assessment versus no assessment28,29).



validation의 목적은 어떠한 의사결정과 그에 수반되는 결과가 유용한지를 평가할 근거를 수집하는 것이다.

The purpose of validation is to collect evidence that evaluates whether or not a decision and its atten- dant consequences are useful.



타당도 논거

THE VALIDITY ARGUMENT


타당도논거validity argument는 타당도근거의 수집과 해석을 이끈다. 한 조각의 근거가 단독적으로 결론을 내릴정도로 논란의 여지가 없는 경우는 거의 없다. 보통, 타당도논거는 다수의 근거들로 이뤄지며, 각각으로는 불완전하나 종합적으로 판단을 내리기체 충분하다.

The validity argument guides the collection and interpretation of validity evidence. Rarely is a single piece of evidence so incontrovertible that it single-handedly ‘makes the case’. Rather, the argument usually con- sists of multiple pieces of evidence, individually incomplete but collectively sufficient to convince the jury.


법정의 비유를 들자면,얼마나 많은 근거가 필요한가는 그에 따른 결정의 무게gravity에 달렸다.

Continuing the analogy of a legal argument, the amount of evidence required varies depending on the gravity of the pending decision.


같은 원칙을 적용할 수 있다.

Turning now to assessment validation, the same principles apply. (Fig. 2).



Brennan은 '악마는 디테일에 있다. 그러나 기본적 접근법은 간단하다'. Kane은 '첫째, 의도나 해석(interpretation/use argumgnet, IUA)에 대한 주장을 한다. 둘째, 이 주장을 평가한다(validity argument)' 이 두 단계 접근법은 주장을 하고, 그 주장을 평가하는 것이고, 일상적 연구에서 가설을 검증하는 것과 다르지 않다.

Brennan34 observed: ‘There may be devilish details to be considered, but the basic approach is straight- forward.’ Kane12 declared: ‘First, state the claims that are being made in a proposed interpretation or use (the IUA [interpretation/use argument]), and second, evaluate these claims (the validity argu- ment).’ This two-step approach – stating and then evaluating claims – is analogous to the routine research practice of stating and then testing a hypothesis.


Kane의 프레임워크

KANE’S FRAMEWORK



그러나 가설의 가장 취약한 고리를 찾아내고, 그것을 평가할 검사를 계획하는 것은 그렇게 쉬운obvious 일은 아니다(Brennan이 말한 '악마의 디테일'과 같다). 다행히도, Kane은 타당도주장에 대해서 생각해볼 수 있는 프레임워크를 만들었다.

However, identifying the weakest links and assump- tions in the hypothesis, and planning the tests that will evaluate those assumptions, is rarely obvious (the ‘devilish details’ referred to by Brennan34). For- tunately, Kane has described a framework for think- ing about the validity argument


Kane은 평가를...다음과 같이 나눔

Kane traces an assessment from

  • 단일한 관찰결과(객관식 시험 점수)로부터의 점수
    the Scoring of a single observa- tion (e.g. multiple-choice examination question, skill station, clinical observation or portfolio item),
  • 관찰한 점수를 시험 상황에서의 일반적 수행능력을 대표하는 점수의 일반화
    to using the observation score(s) to generate an overall test score representing performance in the test setting (Generalisation),
  • 시험 상황에서의 점수를 실제 상황에서의 수행능력으로 추론하는 외삽
    to drawing an inference regarding what the test score might imply for real- life performance (Extrapolation), and then
  • 이 정보를 해석하고, 결정을 내리는 것인 함의
    to inter- preting this information and making a decision (Implications) (Fig. 3).


타당도논거는 다수의 추론inference를 포괄하는 다양한 출처의 근거를 포함해야 한다. 또한 가장 취약한 고리에 초점을 맞추는 것도 중요하다. Kane의 프레임워크가 가지는 장점 중 하나는 psychometric data에 지나치게 의존하지 않으며, 따라서 비-정량적 평가에도 적용가능하다는 것이다.

The validity argument should contain multiple sources of evidence that span several (if not all) inferences. It is also important to focus on the weak- est links (most questionable assumptions). One advantage of Kane’s framework is that it does not rely heavily on psychometric data, and thus the con- cepts apply readily to non-quantitative assessments (such as learning portfolios and narrative perfor- mance reviews).




제안하고자 하는 활용법을 정의함

Define the proposed use


합격/불합격 결과를 위한 평가와 단순히 raw score만 보고하면 되는 평가는 필요한 타당도 근거의 우선순위가 다를 수 있다. 의대 2학년생을 위한 의사소통 평가, 1년차 레지던트를 위한 평가, 정신과의사 혹은 정형외과의사를 위한 평가 등에서 다 다를 수 있다.

Tests intended to result in pass/fail decisions may require different prioritisa- tion of validity evidence than those reporting raw scores. The validation of an assessment of communi- cation skills for a second-year medical stu-dent, first-year resident or practising physician, or for a psychiatrist versus an orthopaedic surgeon. Interpretations to guide formative feedback or to establish a minimal level of competence



점수산출 추론

Scoring inference


각 평가는 몇 개의 수행능력을 관찰하는 것으로부터 시작하며, 이를 통해 공정/정확/재생산가능한 양적 점수를 생성해낸다

Each assessment begins with an observation of some performance to generate a fair,accurate, reproducible quantitative score (or an accurate and insightful narrative comment).



일반화 추룬

Generalisation inference


일반화를 이해하기 위해서 우리는 '시험 상황'에서의 수행능력과 '실제 상황'에서의 것을 구분할 필요가 있다. 일반화는 '시험 상황'의 수행능력에 대한 것이다.

To understand Generalisation we need to distinguish performance in the ‘test world’ (formally the ‘uni- verse of assessment’) from that in the ‘real world’. Generalisation deals with test-world performance.


시험 상황universe of assessment에서 만들 수 있는 문항의 숫자는 이론적으로는 거의 무한하다.

In the universe of assessment, there are in theory a limitless number of items that we could create or select


궁극적으로 이 무한한 가능성의 우주에서 일부를 선택하여 어떤 표본을 만든다고 했을 때, 우리는 이 표본으로부터 모든 assessment universe로 일반화할 수 있기를 원한다.

The test items we ultimately select represent a sample of the items from this universe of possibilities. However, we ideally want to general- ise from this sample to the entire assessment uni- verse.


따라서 '일반화'란 '얼마나 문항들을 잘 선택했는가'에 대한 답을 찾는 것이다.

Thus, Generalisation seeks to answer the question: how well do the selected test items



근거를 살펴보면, 이 질문에 대한 답은 주로 두 가지 출처로부터 온다.

Evi- dence to answer this question comes from two primary sources:

  • 검사 영역 내에서 적절한 표본 선정을 위해서 선택한 방법
    methods taken to ensure adequate and appropriate sampling within the test domain, and
  • 완전히 새로운 표본을 사용했을 때 비슷한 점수를 얻을 가능성(재생산가능성, 신뢰도)
    empiric studies to determine the likelihood of obtaining similar scores if we use an entirely new sample of items (reproducibility or reliability).


적절한 표본 선정을 위해서 선택하는 방법에는 블루프린트 혹은 무작위 선택이 있다.

Methods to ensure appropriate sampling might include a test blueprint (across domains) or random sampling (within a domain)


비-수치적 자료 혹은 assessment universe가 고도로 이질적인 경우와 같은 질적 연구에서는 '포화'라는 개념이 유용할 수 있다.

The qualitative research concept of saturation may be useful, espe- cially for non-numeric data or if the universe is highly heterogeneous:


수치 점수의 재생산가능성은 신뢰도를 이용해서 결정할 수 있다.(classical test theory 또는 Generalisability theory에서의 신뢰도)

The reproducibility of numeric scores can be empir-ically determined using reliability metrics. 

classical test theory,

Generalisability theory


질적연구에서 개별 질적자료를 synthesis하는 것은 통찰을 제공하는/정확한/방어가능한 해석법이며, 양적자료의 generalisation에 비견될 수 있다. 평가자간 신뢰도를 대부분의 수치점수에 있어서 '에러'로 처리하지만, 질적 평가에 있어서 우리는 평가자간 차이를 수행능력에 대한 통찰을 제공해주는 가치있는 것으로 본다(다양한 관점). 서로 다른 자료 출처로부터 선택과 통합하는 것(triangulation), 언제 종료할 것인가를 결정하는 것(saturation)은 질적연구자료의 '일반화추론'을 도와준다.

For qualitative assessments, the synthesis of individ- ual pieces of qualitative data to form an insightful, accurate and defensible interpretation is analogous to quantitative generalisation. Whereas we treat inter-rater variability as error for most numeric scores, in qualitative assessments we view observer variability as representing potentially valuable insights into performance (i.e. different perspec- tives38,39). The method for selecting and synthesis- ing data from different sources (triangulation) and deciding when to stop (saturation) will inform the Generalisation inference for qualitative data.



외삽 추론

Extrapolation inference 


'외삽'은 시험 상황에서 실제 상황으로 나아가는 것이다.

Extrapolation takes us from the test-world universe to the real world.


'외삽'을 지지하는 근거는 주로 두 가지가 있다.

Evidence to support Extrapolation comes primarily from two sources:

  • 시험 영역에서의 점수가 실제 수행능력의 핵심 특성을 반영하게끔 하는 방법
    methods taken to ensure that the test domain reflects the key aspects of real perfor- mance, and
  • 시험 수행능력과 실제상황 수행능력의 관계를 평가하는 분석
    empiric analyses evaluating the relationship between the test performance and real-world perfor- mance.
다음과 같은 방법 사용 가능:

 

      interview or poll experts,

observe the actual task

think aloud

review past literature


그러나 known-group comparison은 상대적으로 약한 타당도 근거만을 제공하는데, 왜냐하면 관련성이 인과성을 의미하지는 않기 때문이다. 더 강력한 외삽근거는 시험점수가 실제상황 평가와 개념적으로 관련된 점수와 상관관계가 어떤지 보는 것이다.

How- ever, known-group comparisons offer relatively weak validity evidence because association does not imply causation.40 Stronger Extrapolation evidence can be collected by correlating test scores with scores from a conceptually related real-world assessment.


질적연구에서, 외삽은 이해관계자들이 그 해석에 동의함을 보여주는 근거, 혹은 새로운 훈련이나 수행의 맥락에서 적용될 것이라는 기대 등에 의해서 더 지지될 수 있다. 

For qual- itative assessment, Extrapolation might be further supported by evidence suggesting that stakeholders agree with the interpretations and anticipate that they will apply to new contexts in training or practice.


안타깝게도, 일반화와 추론은 서로 반대로 작용할 수 있다. Kane은 '우리는 일반화를 희생하여 외삽을 강화시킬 수 있다(평가 과제가 평가 대상 영역을 반영하도록 함), 또는 우리는 외삽을 희생하여 일반화를 강화시킬 수 있다(다수의 고도로 표준화된 과제를 사용)'

Unfortunately, Generalisation and Extrapolation are often at odds with one another. Kane7 notes: ‘We can strengthen extrapolation at the expense of gen- eralisation by making the assessment tasks as repre- sentative of the target domain as possible, or we can strengthen generalisation at the expense of extrapo- lation by employing larger numbers of highly standardised tasks.’


함의 추론

Implications inference 



마지막 단계는 대상영역의 점수로부터 그 점수의 해석으로 나아가고, 그 해석으로부터 특정한 활용방법/결정/후속활동 으로 나아가는 것이다. Kane은 '검사 점수를 특정 방식으로 해석하는 것이 타당하다는 근거가 자동적으로 그 점수를 어떻게 활용할지에 대한 것까지 정당화시켜주는 것은 아니다' 라고 말했다. 또한 '완벽하게 정확한 정보를 바탕으로 하고 있더라도, 그에 따른 의사결정은 목적을 달성하지 못할 수도 있고, 목적을 달성하더라도 비용이 너무 많이 들어갈 수도 있고, 그냥 폐기될 수도 있다' 라고 했다. 또 다른 말로는, 우리가 비록 정확한 측정을 하였다 하더라도, 그 정보가 유용할 것인지(혹은 적절하게 활용될 것인지)는 또 다른 문제라는 것이다. 따라서 타당도 논거의 최종 단계는 이 평가가 학습자/이해관계자/사회에 미칠 여파를 평가하는 것이다.

The final inference moves from the target domain score to some interpretation about that score, and from that interpretation to a specific use, decision or action. As Kane7 states: ‘It is gener- ally inappropriate to assume that evidence support- ing a particular interpretation of test scores automatically justifies a proposed use of the scores.’ He also notes: ‘A decision procedure that does not achieve its goals, or does so at too high a cost, is likely to be abandoned even if it is based on perfectly accu- rate information.’7 In other words, even if we mea- sure the attribute correctly, it doesn’t necessarily mean this information will be useful (or used well). Thus, the final phase in the validity argument evalu- ates the consequences or impact of the assessment on the learner, other stakeholders and society at large.42



평가의 여파에 대한 자료를 수집하는 가장 단순한 방법은 일부 학습자에게만 제공하는 것이다.

The most straightforward way to collect data regard- ing the consequences of assessment would be to offer the assessment to some learners but not to others,


그러나 이러한 연구는 대부분 연구자들에게 수행하기 어렵다. '함의추론'을 평가하는 더 현실적인 방법은 다음과 같다.

However, such studies are diffi- cult to conduct and exceed the reach of most inves- tigators. More achievable studies evaluating the Implications inference include

  • 기준 설정 연구
    standard-setting stud- ies (discussed under Scoring),
  • 비-비교 연구
    non-comparative stud- ies exploring intended and unintended consequences (e.g. what happens to learners who fail a key examination), and
  • 하위그룹간 차이 비교
    evaluations of differ- ences in test performance among subgroups for which performance should be similar, such as men and women (differential item functioning).

 

이처럼 질적 평가에서 최종 해석에 관한 전문가들의 동의를 평가하는 것, 그리고 학습자와 평가자에 대한 결정사항의 영향력을 평가하는 것이 함의추론을 지지한다.

Like- wise, in qualitative assessments, evaluating the agree-ment of experts with final interpretations and the impact of decisions on learners and raters would support the Implications inference.


이러한 질문에 답하는 것이다.

  • 시험에서 떨어진 학생들과 통과한 학생들에게는 어떤 일이 생기는가?
    what happened to those learners who failed the test and those who passed? 
  • 재교육remediation이 후속 평가에서 수행능력의 향상을 가져왔는가?
    Did remediation result in improved perfor- mance on follow-up assessment?

PUTTING THE ARGUMENT TOGETHER


일관성있는 논거의 계획과 제시

Planning and presenting a coherent argument


비록 Kane이 어떤 순서로 타당도 근거를 수집하고 평가해야하는가를 명시하진 않았지만 논거의 phase에 따른 자연스러운 진행과정이 있다.

Although Kane does not specify the order in which validity evidence should be collected and evaluated, there seems to be a natural progression that aligns the phases of the argument (from left to right in Fig. 3) with the priority and sequence of collecting empiric evidence. It seems natural...

  • to solidify evi- dence regarding the scoring rubric before analysing the generalisability of those scores,
  • to evaluate gener- alisability before extrapolating to real life, and
  • to con- firm relationships with real-life performance before attempting to confirm the impact of assessment on meaningful outcomes.

타당도논거의 모든 추론들이 모두 가치가 있지만, 그 중요도가 모두 같지는 않다. '일반화'는 형성평가를 강조하는 상황에서는 덜 중요하고, '외삽'은 실제 상황에서의 수행을 직접 관찰하는 상황에서는 덜 중요하다. 이러한 중요도의 차이는 근거를 수집하기 전에 '가설을 명확히 설정하는 것(IUA)'이 필요함을 강조한다.

Although all of the inferences in the validity argu- ment merit some attention, they are not all of equal importance. 
General- isation may be less important when the emphasis is on formative feedback, and the Extrapolation infer- ence may be less important for assessments (both qualitative and quantitative) that rely on direct observation of real clinical performance as the underlying assumptions are relatively plausible. This underscores the need to clearly state the hypothesis (the interpretation/use argument) before collecting evidence!

실제 관찰 결과가 하나의 추론 내에서(일반화 내에서 근거가 상충할 수 있음), 추론 간에서(일반화 논거에는 긍정적이나 외합 논거에는 부정적인 것이 있을 수 있음), 맥락에 따라 다를 수 있다. 사전에 IUA(the interpretation/use argument)를 구체적으로 하는 것이 이러한 관찰결과를 통합하는데 도움이 될 것이다.

Empiric findings often disagree within an inference (e.g. conflicting evidence for Generalisation), between inferences (e.g. favourable Generalisation but unfavourable Extrapolation), and across different contexts or research studies. A pre-specified inter- pretation/use argument and evaluation plan helps to integrate such findings.


 

타당도 근거 축적building의 흔한 오류

Flaws in building the validity argument


교육자들은 흔히 한 목적 혹은 한 맥락에서 validate된 검사는 다른 것에서도 그러할 것이라고 가정하는 실수를 범하곤 한다. 실제로는, 모든 평가가 interpretation and use마다 validation되어야 한다. Kane이 언급한 실수의 유형들.

Educators commonly make the mistake of assuming that a test validated for one purpose or context is valid for another. In reality, all assess- ments must be validated for each new proposed interpretation and use. Kane7 identified a number of other flaws in building the validity argument.

  • 제한된 근거만을 평가하여 해석이나 결정이 타당하다고 판단내리는 것
    Firstly, educators often conclude that interpreta- tions and decisions are valid after evaluating lim- ited evidence.
  • 주어진 목적에서 필요한 것보다 더 과도한ambitious 논거를 요구하는 것(비평가, 연구자, 규제기준)
    Secondly, critics, na€ıve investigators or inappropriate regulatory requirements might propose an argument that is more ambitious than required for a given purpose.
  • 수집하기 쉬운 근거만 수집하고, 더 질문의 여지가 있는 것에 대해서는 근거를 수집하지 않는 것
    Thirdly, investigators often collect easy-to-measure evidence for assump- tions that are already plausible; this typically occurs at the expense of addressing other more question- able assumptions, and can be misleading if the sheer quantity of evidence obscures important omissions.



Kane의 프레임워크의 실용적 적용

PRACTICAL APPLICATION OF KANE’S FRAMEWORK



 

임상상황: 제안된 '함의'를 지지하기 위해서, 우리는 특정 질병에 대한 스크리닝과 이후 질병을 치료하는 것이 아무것도 안하고 기다리는 것보다 장기적으로 임상성과가 더 나은지를 알고 싶어한다.

Finally, to support the proposed Implications we would want to know that screening for a disease and then treating it yields better long- term clinical outcomes than waiting for the disease to become clinically apparent, and that adverse effects of the treatment do not outweigh the bene- fits.


 

양적연구상황: 제안된 '함의'를 지지하기 위해서, 우리는 다음과 같은 것을 알고 싶어한다.

Finally, to support the proposed Implications, we would want to know

  • 딜레이 결정을 내린 것이 patient care를 향상시키는가 that decisions to delay operating privileges improve patient care,
  • 재교육으로 향상이 되는가 that remediation leads to objective improvement,
  • 레지던트가 좋다고 느끼는가 that residents perceive a benefit, and
  • 이러한 딜레이가 레지던트나 교육 프로그램에 부담이 되지는 않는가 that the delay does not impose an excessive burden on resi- dents or training programmes.

 

그러나 이를 지지하는 근거는 아직 없다.

However, virtually no evi- dence has been reported to support the Implications inference.63


 

질적연구상황:

Finally, we consider the use of narrative comments (qualitative data) from supervisors assessing resi- dents’ clinical performance to make decisions about promotion to the next training year.

 

Scoring inference 를 위해서는 다음을 보고자 함

To support the Scoring inference we would expect to see

  • that ques- tions prompt a variety of relevant narrative data,
  • that assessors have actually observed the behaviours they are asked to assess, and
  • that narrative com- ments provide a rich, detailed description of observed behaviours.

 

Generalisation inference 를 위해서는 다음을 보고자 함

To support Generalisation,we would expect to see

  • that narratives have been solicited from people representing a variety of clini- cal roles,
  • that the narratives collectively form a coherent picture of the resident, and
  • that those conducting the interpretive analysis have appropri- ate training or experience.

 

Extrapolation inference 를 위해서는 다음을 보고자 함

To support Extrapolation, we would anticipate

  • that those providing raw narra- tives agree with the synthesised ‘picture’ and
  • that the qualitative narrative agrees with other data (qualitative or quantitative) measuring similar traits.

 

Implications inference 를 위해서는 다음을 보고자 함

Finally, to support the proposed Implications,we would want to know

  • that both those providing nar- ratives and the residents themselves agree with the decision based on these narratives, and
  • that actions based on these decisions have the desired effect.

 

We found evidence to support many, but not all, of these propositions (Table 2).64–77



CONCLUSIONS


결론적으로 네 가지를 강조하고자 한다

In conclusion, we emphasise four points.

 

첫째, validation은 끝이 아니라 과정이다. 검사가 'validate'되었다고 말하는 것은, 그 과정을 수행했다는 것을 의미할 뿐, 의도한 해석, validation의 과정, 그 과정이 이뤄진 맥락 등을 지칭하는게 아니다.

Firstly, validation is not an endpoint but a process. Stat- ing that a test has been ‘validated’ merely means that the process has been applied, but does not indicate the intended interpretation, the result of the validation process or the context in which this was done.

 

둘째, 이상적인 validation은 명확한 IAU를 기술하는 것으로부터 시작된다. 핵심 주장과 근거를 정의하는 당초 계획한 IAU로 진행되며, 타당도논거를 위한 논리적, 실용적 근거를 수집하고 종합함으로써 진행된다.

Secondly, validation ideally

  • begins with a clear statement of the proposed interpretation and use (decision),
  • continues with a carefully planned interpretation/use argument that defines key claims and assumptions, and
  • 앞에 두 개가 되고 나서야 이게 됨
    only then pro- ceeds with the collection and organisation of logi- cal and empirical evidence into a substantiated validity argument.

 

셋째, 가장 취약한 고리에 초점을 둬야 한다.

Thirdly, educators should focus on the weakest links (most questionable assump- tions) in the chain of inference.

 

넷째, 여기서 제시된 모든 임상, 교육 사례에서 점수/일반화/외삽 근거는 매우 강하다. 실제로 행동을 위한 함의implication에 이르러서야 꼭 있어야 하는데 부족한 것들이 드러난다. 이러한 이유로, 우리는 implication과 이어지는 결정이 validity argument에서 궁극적으로 가장 중요한 것이라고 믿는다.

Fourthly, in all of the clinical and educational examples cited herein, the Scoring, Generalisation and Extrapolation evidence is fairly strong; only when we attempt to infer actionable Implications, moving from the real- world score to specific decisions, do important deficiencies come to light. For this reason, we believe that the Implications and associated deci- sions are ultimately the most important inferences in the validity argument.


12 Kane MT. Validating the interpretations and uses of test scores. J Educ Meas 2013;50:1–73.


13 Hodges B. Assessment in the post-psychometric era: learning to love the subjective and collective. Med Teach 2013;35:564–8.


15 Schuwirth LWT, van der Vleuten CPM. Programmatic assessment and Kane’s validity perspective. Med Educ 2012;46:38–48.















 2015 Jun;49(6):560-75. doi: 10.1111/medu.12678.

contemporary approach to validity arguments: a practical guide to Kane's framework.

Author information

  • 1Mayo Clinic Online Learning, Mayo Clinic College of Medicine, Rochester, Minnesota, USA.
  • 2Division of General Internal Medicine, Mayo Clinic, Rochester, Minnesota, USA.
  • 3Department of Medicine, University of Toronto, Toronto, Ontario, Canada.
  • 4Wilson Centre, University Health Network, Toronto, Ontario, Canada.
  • 5Department of Medicine, University of British Columbia, Vancouver, British Columbia, Canada.

Abstract

CONTEXT:

Assessment is central to medical education and the validation of assessments is vital to their use. Earlier validity frameworks suffer from a multiplicity of types of validity or failure to prioritise among sources of validity evidence. Kane's framework addresses both concerns by emphasising key inferences as the assessment progresses from a single observation to a final decision. Evidence evaluating these inferences is planned and presented as a validity argument.

OBJECTIVES:

We aim to offer a practical introduction to the key concepts of Kane's framework that educators will find accessible and applicable to a wide range of assessment tools and activities.

RESULTS:

All assessments are ultimately intended to facilitate a defensible decision about the person being assessed. Validation is the process of collecting and interpreting evidence to support that decision. Rigorous validation involves articulating the claims and assumptions associated with the proposed decision (the interpretation/use argument), empirically testing these assumptions, and organising evidence into a coherent validityargument. Kane identifies four inferences in the validity argument: Scoring (translating an observation into one or more scores); Generalisation (using the score[s] as a reflection of performance in a test setting); Extrapolation (using the score[s] as a reflection of real-world performance), and Implications (applying the score[s] to inform a decision or action). Evidence should be collected to support each of these inferences and should focus on the most questionable assumptions in the chain of inference. Key assumptions (and needed evidence) vary depending on the assessment's intended use or associated decision. Kane's framework applies to quantitative and qualitative assessments, and to individual tests and programmes of assessment.

CONCLUSIONS:

Validation focuses on evaluating the key claims, assumptions and inferences that link assessment scores with their intended interpretations and uses. The Implications and associated decisions are the most important inferences in the validity argument.

© 2015 John Wiley & Sons Ltd.


완전학습에서의 평가: 타당도와 합리화의 핵심 이슈(Acad Med, 2015)

Making the Case for Mastery Learning Assessments: Key Issues in Validation and Justification

Matthew Lineberry, PhD, Yoon Soo Park, PhD, David A. Cook, MD, MHPE,

and Rachel Yudkowsky, MD, MHPE






교육연구 사업 영역에서 타당도와 정당화Validation and justification 는 중요한 활동이다. 새롭게 나타난 근거가 오랫동안 해온 평가의 타당성을 반박할 수도 잇으며, 검사점수의 해석과 활용에 관한 논란이 highest courts of law가 되기도 한다.

Validation and justification are important activities in the educational research enterprise; new evidence may show long-standing assessment practices to be invalid,8 and controversies about interpretations and uses of test scores have risen to the highest courts of law.9


그러나 완전학습에서의 평가는 점수의 해석과 활용이 표준적 평가와 다르며, 타당도와 정당화 과정에도 변화가 필요하다.

However, mastery learning assessments entail interpretations and uses of scores that differ from those of standard assessments, requiring changes in validation and justification practices.


완전학습 평가의 해석과 활용

Interpretations of and Uses for Mastery Learning Assessments


'완전'이 의미하는 의미는 무엇인가? 구어적으로는 높은 수준의 전문성을 말한다. 그러나 '완전학습'에 있어서 '완전(마스터)'란 단순히 '다음 교육 단계로 넘어갈 수 있게 준비되었음'을 말한다. 의과대학생이 돌연변이를 잘 이해하여 genetic transmission에 대해 배울 준비가 되어있다고 하더라도, 일반적인 관점에서 그 주제를 '마스터'했다고 볼 수는 없고, 다음 단계로 넘어갈 준비가 되었음을 말한다.

What does “mastery” mean? Colloquially, it suggests a high level of expertise. However, for mastery learning, it only means readiness to proceed to the next phase of instruction. A medical student who understands mutagenesis enough to learn about genetic transmission has almost certainly not “mastered” mutagenesis in the lay sense but may have mastered it enough to move on to the next educational unit.


한 학습유닛을 다 마친 학습자는 - 비록 완전학습적 관점에서의 '마스터'임에도 - 자신이 그 내용을 진짜로 일반적 관점에서 '마스터' 했다고 믿고 있을 수 있다. 마찬가지로, 교육자들도 마스터 기준을 정해달라는 요청을 받을 때, '마스터'라는 단어가 일반적으로 쓰이는 의미가 덧씌워지면서, 교육자들은 불필요하게 높은 기준을 설정할 수도 있다.

Learners who advance through a unit may believe they have “mastered” its content in the lay sense when they have only done so in the mastery learning sense. Conversely, educators asked to set mastery standards may set unnecessarily high standards, letting the lay connotation of “mastery” color their judgments.


학습자들은 얼마나 오래 '마스터' 수준을 유지해야 할까? 완전학습 모델에서 성취도는 종종 훈련이 종료된 직후에 평가된다. 대부분의 의학교육에서의 학습단위unit가 이후에 배울 많은 학습단위unit과 연결되어 있는 반면, 학습자의 성취도는 시간이 지나면서 종종 쇠퇴한다. 더 나아가서, 단기적 성취mastery를 최대화하기 위한 여러 학습활동이 오히려 그 성취의 장기적인 유지와 일반화에는 반대로 작용하기도 한다.

How long are learners expected to retain “mastery”? In mastery learning models, achievement is often assessed immediately after the completion of training. Yet most learning units in medical education are connected to many later units, and achievement often decays rapidly following training.13 Moreover, many learning activities that maximize short-term mastery are precisely the opposite of those that support long-term retention and generalization of mastery.14



완전학습 평가를 훈련 직후에 시행되는 평가로만 제한하는 것은 (균일하고, 오래 지속되는 역량을 갖추게 하려는) 완전학습 시스템의 의도를 전복시킬 수도 있는 것이다.

limiting mastery learning assessment to the period immediately following training could subvert the intent of the mastery system, which is to ensure uniform, enduring competence.15


'마스터'는 지식이나 스킬의 완전성을 의미할 수도 있다. 어떤 맥락에서는 '마스터'는 학습자가 해당 영역의 모든 하위영역subunit에서 충분한 역량을 갖췄음을 의미하기도 한다.

Mastery also may connote a completeness of knowledge or skill. In some contexts, mastery means that a learner has achieved sufficient competence in all the subunits of a content area


그러한 경우에 있어서, 만약 학습자가 90%를 달성하고, 10%를 놓친 것은 심각한 문제일 수 있으며 '마스터'를 수여해서는 안된다. 이러한 비보상적noncompensatory(conjunctive) 점수 계산 방식에서, 학습자의 수행능력은 각각의 subunit에 대해서 최소 기준을 달성하였는지를 평가해야 하며, 모든 subunit에서 통과했을 때야만이 '마스터'를 받을 수 있다.

In such situations, for example, if a learner scores 90% on a procedural task but the missed 10% reflect a serious error, the designation of mastery would be inappropriate.16 In such noncompensatory (i.e., conjunctive) scoring, learners’ performance on each subunit would be evaluated against a minimum standard, and mastery would be achieved only when the learner passes all subunits.



완전학습 모델의 핵심적 전제는 통과하고 다음으로 넘어가거나, 실패하고 현 과정을 반복하거나 이 두 가지 중 하나라는 것이다. 중간 지점은 없다. 따라서 '통과'기준은 반드시 엄격하게 설정되어야 한다.

the central inference in the mastery model is pass and advance or fail and repeat; there is no middle ground. Thus, the passing standard must be established with great rigor.


 

진점수가 '마스터 판정 기준'에 걸쳐 있는 학습자에 대해서는 (합격점수의 1SD 이내), 정밀한 측정이 우선되어야 한다. 이 범위 내에서 변별도가 높은 문항은 과도표집oversampled 되어야 하며, 이는 비록 이러한 문항을 찾거나 만드는 것이 psychometric하게 복잡하더라도 그렇게 해야 한다.

for learners whose true scores are within range of the mastery standard, perhaps within one standard error of measurement from the cut score, precise measurement becomes the priority. Assessment items that discriminate well in this range should be oversampled, though identifying such items may require sophisticated psychometric approaches, such as item response theory.17


고부담 시험을 위해서 그러한 문항은 보안을 철저히 해서 부적절하게 학생들에게 노출되는 것을 방지해야 한다(선배 학생이 후배 학생에게 물려주는 것 등). 이는 해당 문항에 대한 측정 정밀도를 떨어뜨릴 수 있다. 문항의 노출disclosure를 방지하는 방법으로는, 무엇을 맞추고 틀렸는지, 왜 그런 점수를 받았는지 알려주지 않는 것이 한 방법이 될 수 있다. 대신 학생들은 총점만 알게 된다. 이렇게 할 경우에, 이러한 범위에 있는 문항에 대해서는 측정의 정밀성을 위하여 '평가에 기반한 피드백'을 희생해야 할 수도 있다.

For high-stakes examinations, such items also need to be kept secure from inappropriate disclosure to examinees (e.g., senior students sharing test items from previous years with junior students), which would compromise the measurement precision of those items. Preventing such disclosure likely requires that, for any given item, educators not divulge which answers are correct versus incorrect nor the reasons they are so scored; instead, examinees are likely only to be told their total score across many items. As such, for items in this range, beneficial assessment-based feedback will often need to be sacrificed to maintain measurement precision.


이러한 평가를 반드시 사용하게 되는 시점은 학습자들이 다음 단계로 넘어갈지를 결정하는 시점이다. 이 결정에 대해서 두 가지 핵심 디테일 있다.

(1)통과하지 못한 학습자들에게 들어가는 자원과 이들을 위한 정책,

(2)마스터 기준을 계속 충족하지 못하는 학생들에 대한 특별한 조치consequences.

 

완전학습 평가점수를 다른 방식으로 사용하면 의도하지 못한 결과가 나올 수도 있다. 예컨대, '빨리 교육과정을 마스터한 학생에게 dean's letter를 수여'하는 경우 "마스터까지 걸리는 시간"이 새로운 성취지표가 되면서 학습자들로 하여금 교육과정을 '마스터'하기보다는 빨리 해치워rush through버리게끔 만들기도 한다.

The most obvious use of such assessments is for deciding when to advance learners in the curriculum. Two key details related to this decision are (1) the resources and policies in place for learners who do not pass, and (2) any special consequences for learners who fail persistently to meet mastery standards. Other uses of mastery scores exist but may have unintended consequences. For instance, a dean’s letter to a residency program that extols a medical student who quickly mastered the curriculum inadvertently makes “time to mastery” a new achievement indicator, perhaps encouraging learners to rush through the curriculum rather than truly mastering it.

 

 

 




타당도 근거: 내용

Sources of Validity Evidence: Content


완전학습시스템에서 사전시험을 볼 수도 있으며, testing effect를 통해 학습이 강화될 수 있고, 일부 학습자들로 하여금 이미 유닛을 마스터 한 경우 그 유닛을 넘어가게 해줄 수도 있다. 이러한 시스템에서 대부분의 학습자는 최소한 두 차례의 평가 - 사전시험, 사후시험 - 를 치르게 된다. 추가적으로, '마스터'에 대한 정의를 어떻게 내리느냐에 따라서 단순히 수행의 '산출물product'이 아니라 수행의 '과정how'이 핵심 평가준거가 될 수도 있다. 예컨대, 봉합기술의 '마스터'를 아무런 의식도 하지 않고 (무의식적으로 이뤄지는) automatical한 봉합의 수행으로 정의할 경우, 적절한 평가방법은 학습자의 집중력이 방해받는distract되는 상황에서도 그것을 잘 해내느냐가 되어야 한다.

mastery systems may include pretests before instruction begins, possibly enhancing learning via the testing effect19 and allowing some learners to skip already-mastered units entirely. In such systems, most learners complete at least two assessments—a pretest and at least one posttest. Additionally, depending on one’s definition of mastery, certain aspects of how learners perform may be key criteria, beyond simply the products of their performance (e.g., correct answers or completed procedural tasks). For instance, if one defines mastery of suturing skill as the ability to suture automatically, with minimal to no conscious thought, a suitable assessment must detect when learners can suture even while they are distracted.21



타당도 근거: 응답 절차

Sources of Validity Evidence: Response Process


완전학습 시스템에서 재시험은 내용에 관한 보안에 위협이 될 수도 있고, 학습자가 어떻게 평가문항에 응답하는지에 영향을 준다.

Retesting in mastery learning systems could in some cases create a content security threat that may be evident in how learners respond to assessment items. 


요령이 좋은 학습자들은 'test-wise'해지기 위해서 완전학습평가시험을 일부러 치른 다음에, 부족한 부분만 재빨리 채워서 재시험을 볼 수도 있다.

Savvy learners might deliberately take a mastery examination for which they are not prepared to become “test-wise,” and then study only enough to briefly regurgitate the required information on a retest.


답안을 암기해가는 학습자들에 대한 가장 직접적인 해결책은 (비록 자원이 많이 드나) 충분히 큰 문제(내용)은행을 만드는 것이다. 또는 학습자의 추론과정을 묻는 방법 역시 가능하다. 예를 들어, '정답이 무엇이냐'를 묻기보다는 '왜 그것이 정답이냐'를 물을 수도 있다. 그러나 이러한 더 심화된 이해는 '정답을 고르는 능력'과는 다른 구인을 대변하고 있음이 증명된 바 있기도 하다. 다행히도, 내용의 보안문제는 일부 영역에서는 문제가 되지 않는다. 예를 들어 임상스킬 절차의 체크리스트는 모든 단계를 만족스러운 수준으로 수행할 수 있도록 학습자들에게 제공되기도 한다.

The most straightforward solution to the problem of learners memorizing answers is to build larger content banks (e.g., more items, more scenarios), though this is admittedly resource intensive. Probing learners’ reasoning for the answers they select to detect superficial memorization also may be possible; for instance, one may ask not only what the correct answer is on a multiple-choice examination but also why it is correct. However, such deeper understanding is a demonstrably different construct than the ability to recognize correct answers.22 Fortunately, content security is not a concern for some types of content; for instance, procedural checklists are given freely to learners with the expectation that they will be able to demonstrate all procedural steps satisfactorily.


타당도 근거: 내적 구조와 신뢰도

Sources of Validity Evidence: Internal Structure and Reliability


즉, 동일 수행능력 영역에서의 점수는 평가 상황에 무관하게 신뢰성있어야reliable across 한다.

namely, scores reflecting the same dimension of performance should ideally be reliable across each test condition.


엄격하게 말하자면, 완전학습 평가에서 신뢰도는 얼마나 '마스터'와 '비마스터'를 일관되게 구분할 수 있느냐에 의해서만 결정된다. 전통적인 신뢰도 통계치들 (알파계수, 검사-재검사 상관)은 진점수의 분포가 모든 범위에 걸쳐서분포되어 있을 때에 관한 것이다. 그러나 특정 cut score에 있어서 통과/탈락 결정의 신뢰도는 동일한 평가를 가능한 모든 점수영역에 대해서 구한 신뢰도와 크게 다를 수 있다. 일반적으로, 평균적 수행능력 수준에 가까운 cut score가 가장 reliable하지 않으며, 극단적으로 높거나 낮은 cut score는 매우 reliable하다. 적절하게 신뢰도 공식을 변형하는 것이 가능하며, 완전학습평가에서는 (conditional error variance absolute decision generalizability coefficient24 and decision-consistency reliability indices) 등을 포함하여 그렇게 변형하여 활용해야 한다.

Strictly speaking, reliability in mastery learning assessments is defined only in terms of how consistently the mastery versus nonmastery distinction is made. Common reliability statistics, such as coefficient alpha and test–retest correlations, refer to the reliability of discriminations between learners across the full range of their true scores. However, the reliability of a pass/fail decision at a particular cut score can be dramatically different from the average reliability of the same assessment across the range of possible scores. Generally, cut scores at or near the average learner performance level will be the least reliable, whereas extremely high or low cut scores are often highly reliable.23 Suitably modified reliability equations are available and should be used for mastery learning assessments, including the conditional error variance absolute decision generalizability coefficient24 and decision-consistency reliability indices.25,26


만약 학습자가 언제 마스터평가 시험을 치를지 선택할 수 있다면, 학생들의 시험점수는 매우 비슷할 것이다(대부분이 합격선에 있음). 이러한 경우에는 점수의 variance가 작아지고, 신뢰도 추정계수가 약화attenuate될 것이다. 완전학습시스템의 목표는 - 모든 학습자가 균일한 성취를 하는 것으로 - 전통적인 신뢰도 추정과는 잘 맞지 않는다. 동시에, remediation과 retraining이 문항 수준의 점수 variation에 영향을 미칠 수 있으며, 신뢰도를 상승시킬 수도 있다. 따라서, 재시험의 빈도에 따라 완전학습평가는 안정적이지 못한 신뢰도 추정reliability estimates을 보여줄 수도 있다.

If learners can choose when to take the mastery assessment their total test scores will be very similar (i.e., very near the passing score). In situations of such reduced score variance (i.e., restriction in range), reliability estimates will be attenuated. The very goal of mastery learning systems—uniform achievement from all learners—is thus at odds with classical reliability estimation. At the same time, remediation and retraining can affect item-level score variation and may actually increase reliability. Therefore, depending on the frequency of retesting, mastery learning assessments can show unstable reliability estimates. 


연장선상에서, 이 이슈는 요인분석을 통한 내적 구조 분석도 어렵게 만드는데, 왜냐하면 요인분석을 하려면 subject와 item 사이에 일정정도의 variance가 존재해야 하기 때문이다.

By extension, these issues may limit one’s ability to assess internal structure using methods such as factor analysis, which also requires a reasonable degree of variance between subjects and items.


마지막으로, 평가 운영의 차원에서 완전학습평가를 비보상적noncompensatory 으로 진행할 수 있는데, 이 때 학습다는 다수의 서로 다른 subunit에서 '마스터'를 받아야 한다. 이러한 점수체계에서 전체 측정오차는 각 subunit의 측정오차의 지수함수가 되며, 그 결과 매우 통과/탈락 결정이 unreliable해질 수 있다. 예컨대, 다섯 개 subunit이 각각 0.8의 통과/탈락 신뢰도를 가진다면, 전체적으로는 0.8^5 = 0.33이 되어서 최악으로 낮은 신뢰도가 나온다.

Finally, as with credentialing examinations generally, administrators may choose to score mastery learning assessments in a noncompensatory fashion, whereby learners must demonstrate mastery on many different subunits before progressing.27 In noncompensatory scoring, overall measurement error is an exponential function of the measurement error for each subunit and thus can “balloon” into very unreliable overall pass/fail decisions. For instance, if learners must pass each of five procedural skill stations, which each have a pass/fail reliability of 0.8, overall pass/fail decision reliability would be only 0.8*0.8*0.8*0.8*0.8 = 0.33, an abysmally low reliability coefficient.28


타당도 근거: 다른 변인과의 관계

Sources of Validity Evidence: Relationships to Other Variables


완전학습 시스템에서 평가결과와 가장 중요한 관계에 있는 것은, 평가점수가 뒤따라오는 교육유닛에서의 성공과 관련되어 있는지에 대한 것이며, 여기에는 궁극적으로 진료로의 이행transition to practice도 포함된다.

the most important relationship to evaluate in a mastery learning system is whether assessment scores relate to learners’ success in their subsequent educational unit(s), including their eventual transition to practice.


완전학습평가에서 점수분포범위의 제한(restriction of range)으로 인한 신뢰도 추정에 손상이 있을 수 있기에, 다른 변인과의 관계를 추정하는 것도 어렵게 된다. 그러나 완전학습시스템을 도입하기 이전에 수집된 상대적으로 제한이 덜 되는unrestricted 평가자료와의 완계를 보는 것이 가능하다.

As it impairs the estimation of reliability, the restriction of range in mastery learning assessment scores makes estimating relationships to other variables difficult. However, correlating relatively unrestricted assessment data obtained prior to implementing a mastery learning system with other variables is possible.


타당도와 정당화 근거: 평가결과 활용에 따른 여파consequences

Sources of Validity and Justification Evidence: Consequences of Assessment Use


평가가 의도한 추론desired inference를 지지할 수 있느냐에 초점을 둔 타당도근거와 달리, 여파(결과, consequences)근거는 '의도한/의도하지 않은 결과', '평가의 도입절차가 논리적이고 바람직한가' 등을 고려하여 점수를 활용하고 적용하는 것을 정당화하는 것을 목적으로 한다. 여파근거는 기준을 설정하는 프로세스, 학습 프로세스/학습 성과 평가에 따른 영향impact, 헬스케어 수행practice of health care에 대한 정보 등을 포함한다.

In contrast to validity evidence that focuses on whether the assessment can support desired inferences, consequences evidence seeks to justify the uses or applications of scores by considering the intended and unintended consequences of the assessment and whether implementation of the assessment is reasonable and desirable.6,7 Consequences evidence includes information about the process of setting standards and the impact of the assessment on the learning process, learning outcomes, and the practice of health care.12


완전학습은 교육과정과 교육훈련 프로그램에 큰 영향을 줄 수 있다. 충분한 교육시간과 재교육, 재연습, 재시험을 위한 자원을 필요로 하며, 역량바탕접근을 강화한다.

The mastery model potentially could widely influence curricula and training programs. Mastery standards mandate sufficient curricular time and resources for repeated practice, remediation, and retesting, thus reinforcing a competency- based approach to education.5


개별 학습자 수준에서 다음을 찾아볼 수 있다.

On an individual learner level, one can seek evidence of

  • increased efficiency and effectiveness of study and practice strategies,
  • increased attention to the critical elements of the assessed domain,
  • more functional motivational orientations,32 and
  • improved self- regulation of learning.

 

그러나 완전학습 시스템은 정기적으로 '마스터' 여부를 재평가하지 않기에 학습자가 '마스터'수준을 단기적으로만 유지하지, 전체 커리어에 걸쳐 유지하게끔 하는 것에 초점을 두지 않을 수도 있다.

However, mastery learning systems that do not periodically reassess mastery may lead learners to focus on demonstrating mastery in the short term rather than maintaining mastery throughout their careers.


완전학습시스템은 학습자가 다음 단계로 넘어갈 준비가 되었을 때에만 넘어갈 수 있게끔 하는 것을 의도한다. 따라서 다음 교육유닛에서 학습자의 성과가 가장 주요한 관심의 대상이 되는 결과이다. 그러나 학습자가 이후에 보이는 progress를 가지고 완전학습평가에 관한 inference를 하는 것은 어렵다.

  • 만약 학습자가 이후 교육유닛에서 보이는 수준이 평균 이하라면, 앞서서 수여한 '마스터' 기준중 하나 이상이 너무 느슨했음을 뜻한다.
  • 반대로, 이후 교육유닛에서 학습자가 만족스러운 수준을 보인다면, 앞서 수여한 '마스터' 기준이 지나치게 엄격했기 때문이 약간 느슨하게 만들어서 시간은 덜 들이고 동등한 결과를 낼 수도 있다.

Mastery learning systems are meant to ensure that learners progress only when they are ready to do so; thus, learner outcomes in subsequent educational units are a primary consequence of interest. However, drawing inferences about mastery learning assessments from learners’ later progress can be challenging. If learners’ progress in later educational units is found to be subpar, it may be that one or more of the previous mastery standards were too lenient. If learners’ subsequent progress is satisfactory, the preceding mastery standards were arguably stringent enough, though more lenient standards may have yielded comparable results in less time.



systematic하게 기준을 실험하고, 어떻게 이후 성과가 영향을 받는지 실험하는 것은 logistic하게, 그리고 종종 윤리적으로 문제가 된다.

to systematically experiment with the standards and observe how later outcomes are affected can be logistically and sometimes ethically challenging.


마지막으로, 환자/보건의료시스템/사회 전체 에 미치는 영향에 대한 근거를 볼 수도 있다.

Finally, one can seek evidence of an impact on outcomes for patients, the health care system, and society as a whole.








 



8 Lineberry M, Kreiter CD, Bordage G. Threats to validity in the use and interpretation of script concordance test scores. Med Educ. 2013;47:1175–1183.


11 Cook DA, Beckman TJ. Current concepts in validity and reliability for psychometric instruments: Theory and application. Am J Med. 2006;119:166.e7–166.16.


22 Williams RG, Klamen DL, Markwell SJ, Cianciolo AT, Colliver JA, Verhulst SJ. Variations in senior medical student diagnostic justification ability. Acad Med. 2014;89:790–798.


23 Stansfield RB, Kreiter CD. Conditional reliability of admissions interview ratings: Extreme ratings are the most informative. Med Educ. 2007;41:32–38.







 2015 Nov;90(11):1445-50. doi: 10.1097/ACM.0000000000000860.

Making the case for mastery learning assessmentskey issues in validation and justification.

Author information

  • 1M. Lineberry is assistant professor, Department of Medical Education, and assistant director for research, Dr. Allan L. and Mary L. Graham Clinical Performance Center, University of Illinois at Chicago College of Medicine, Chicago, Illinois. Y.S. Park is assistant professor, Department of Medical Education, University of Illinois at Chicago College of Medicine, Chicago, Illinois. D.A. Cook is professor of medicine and medical education, associate director, Mayo Clinic Online Learning, and consultant, Division of General Internal Medicine, Mayo Clinic College of Medicine, Rochester, Minnesota. R. Yudkowsky is associate professor, Department of Medical Education, and director, Dr. Allan L. and Mary L. Graham Clinical Performance Center, University of Illinois at Chicago College of Medicine, Chicago, Illinois.

Abstract

Theoretical and empirical support is increasing for mastery learning, in which learners must demonstrate a minimum level of proficiency before completing a given educational unit. Mastery learning approaches aim for uniform achievement of key objectives by allowing learning time to vary and as such are a course-level analogue to broader competency-based curricular strategies. Sound assessment is the cornerstone of masterylearning systems, yet the nature of assessment validity and justification for mastery learning differs in important ways from standard assessment models. Specific validity issues include (1) the need for careful definition of what is meant by "mastery" in terms of learners' achievement or readiness to proceed, the expected retention of mastery over time, and the completeness of content mastery required in a particular unit; (2) validity threats associated with increased retesting; (3) the need for reliability estimates that account for the specific measurement error at the masteryversus nonmastery cut score; and (4) changes in item- and test-level score variance over retesting, which complicate the analysis of evidence related to reliability, internal structure, and relationships to other variables. The positive and negative consequences for learners, educational systems, and patients resulting from the use of mastery learning assessments must be explored to determine whether a given mastery assessment and pass/fail cut score are valid and justified. In this article, the authors outline key considerations for the validation and justification of masterylearning assessments, with the goal of supporting insightful research and sound practice as the mastery model becomes more widespread.

PMID:
 
26287919
 
[PubMed - indexed for MEDLINE]


'블랙박스' 다르게 보기: 세 가지 관점에서 보는 평가자의 인식(Med Educ, 2014)

Seeing the ‘black box’ differently: assessor cognition from three research perspectives

Andrea Gingerich,1 Jennifer Kogan,2 Peter Yeates,3 Marjan Govaerts4 & Eric Holmboe5






INTRODUCTION


수행능력 평가의 한 가지 형태는 workplace-based assessment (WBA)로서 매일매일의 진료현장에서 복잡한 임상과제에 대한 수행능력을 피훈련자가 실제로authentically 환자와 실제 임상 상황에서 상호작용을 하는 모습을 직접 관찰하여 평가한다.

One type of performance assessment, workplace-based assessment (WBA), incorporates the assessment of complex clinical tasks within day-to-day practice through direct observation of trainees as they authentically interact with patients in real clinical settings.


WBA가 중요하고 필요하지만, 이러한 형태의 평가는 측정상 한계가 있다. 낮은 평가자간 신뢰도와 같은 한계는 종종 평가자의 판단이 잘못된 것으로 그 책임을 돌리곤 한다. 실제로 수행능력 평가를 분석하는데 psychometric을 사용하여 평가자에 의한 variance가 피훈련자에 의한 variance보다 크게 나타나곤 한다.

Despite the importance and necessity of their use, WBA and other performance assessments have mea- surement limitations.7–9 These limitations, such as low inter-rater reliability, are often attributed to flaws in assessors’ judgements.10–12 In fact, when psycho- metrics are used to analyse performance assessments, often a greater amount of variance in ratings can be accounted for by the assessors (i.e. rater variance) than the trainees (i.e. true score variance).13–15


이 논문에서는 rater라는 단어보다 assessor라는 단어를 사용하고자 하며, 이는 평가가 단순히 rating(점수의 수치화)뿐 아니라 서술형 코멘트/피드백/supervisory 결정 등과 관련되기 때문이다.

In this paper, the term ‘assessor’ will be used rather than ‘rater’ to emphasise that assessment involves not only rating (numerical scores), but the provision of narrative comments, feedback and supervisory decisions.



방법

METHODS



결과

RESULTS


 

비록 상호베타적이지는 않지만, 평가자의 인식에 대한 세 가지의 구분되는 관점이 있다.

There appear to be three distinct, although not mutually exclusive, perspectives on assessor cogni- tion within the research community.

  • 행동학습이론, 통제가능한 인지과정으로 보는 관점.
    The first per- spective describes potentially controllable cognitive processes invoked during assessment and draws on components of behavioural learning theory to help frame an approach to reduce unwanted variability in assessors’ assessments through faculty training.
  • 사회 심리학 연구에서 흔히 다뤄지며, 자동적이고 회피불가능한 인간 인식의 오류에 대한 것
    The second perspective draws on social psychology research and focuses on identifying the automatic and unavoidable biases of human cognition so that assessment systems can compensate for them.
  • 사회 문화적 이론에 기반하며, 판단의 다양성이 유용한 정보를 제공할 수 있다는 관점
    A third perspective draws from socio-cultural theory and the expertise literature and proposes that variability in judgements could provide useful assessment infor- mation within a radically different assessment design.


처음 두 가지 관점은 어떤 주어진 수행능력에는 단일한 '참'기준이 있으며, 비록 평가자간 변동성(variability)을 설명하는 방식에는 차이가 있지만, 둘 모두 이것을 에러로 바라본다. 반면, 평가자간 변동성이 다양한 합리적 진실에 의해서 생긴다는 관점에서는 이것을 '오류'로 보지 않는다.

Importantly, the first two perspec- tives assume that any given performance exhibits a singular ‘true’ standard of performance; although they differ in their explanations of assessor variabil- ity, both perspectives view it as error. Conversely, the third perspective argues that variability may arise as a result of multiple legitimately different truths, which may not represent error.




관점 1. 평가자는 훈련가능하다

Perspective 1. The assessor as trainable


 

이 관점에서 WBA의 평가자간 변동성은 평가자가 평가 준거를 '잘 알지 못하거나' '정확하게 적용하지 못함으로써' 나타나는 결과이다. 따라서 평가에 있어서의 변동은 평가자가 제공하는 정보가 부정확함을 의미하고, 이 변동은 반드시 최소화되어서 평가 정보의 퀄리티를 향상시켜야 한다. 이러한 변동을 줄여서 평가의 측정결과를 향상시키기 위한 실행 가능한 해결책은 목표를 정해서 평가자를 훈련시키는 것이다.

From this perspective, inter-assessor variability in WBA is seen as the result of assessors not ‘knowing’ or correctly ‘applying’ assessment criteria. There- fore, variability in assessment judgements reflects inaccuracy in the information provided by assessors and this variability must be minimised to improve the quality of assessment information. A viable solution to reduce variability in judgements and improve measurement outcomes in assessments is the provision of targeted training for assessors.


이러한 관점은 일부 행동학습이론에 토대를 두며, 여기서는 

  • 피훈련자의 행동에 측정하고 평가할 수 있는 관찰가능한 변화가 있을 때 학습이 일어났다고 본다.
  • 학습과제는 구체적인 측정가능한 행동들로 쪼개지고, 구체적 행동 목표SBO를 설정하여 학습자는 구체적으로 어떤 행동을 해야 하는가를 배우게 된다.
  • 평가의 측정(점수표)는 준거-기준평가이며 왜냐하면 학습자를 평가할 때 동료에 비해서 얼마나 잘 했느냐가 아니라 그 준거에 비추어볼 때 얼마나 잘 했느냐를 평가하기 때문이다.

This perspective is partially grounded in behavioural learning theory, which assumes that trainee learning has occurred when there are observable changes in the trainee’s behaviours (or actions) which can be measured and evaluated. Learning tasks can be bro- ken down into specific measurable behaviours16 and, by identifying specific behavioural objectives, learners can know exactly what behaviours should be performed.17,18 Assessment measures (i.e. scoring rubrics) are crite- rion-referenced in that learners are assessed accord- ing to how well they do rather than by how well they rank among their peers.19,20


WBA에서 평가자가 피훈련자를 관찰하고 평가할 때, 평가자는 반드시 피훈련자의 '바람직한' 행동과 '바람직하지 못한'행동을 찾아낼 수 있어야 한다. 

In WBA, in which assessors observe and assess train- ees with patients, assessors must be able to identify trainees’ ‘desired’ and ‘undesired’ behaviours (clini- cal skills).


피훈련자 평가를 위해서 '가장 바람직한 행동'이 무엇인가에 대한 정보를 주어야 하며, 평가자는 이 퀄리티 기준quality metrics 을 가지고 임상스킬을 평가한다. 이렇게 기준이 있다면 단 한 차례의 자극single stimulus, 즉 한 차례의 환자-의사 상호작용도 이상적으로는 평가자간 유사한 반응을 일으켜야 한다. 그러나 평가자들은 종종 퀄리티 기준을 적절하게 활용하는데 실패한다.

best practices for care quality should inform trainee assessment, and assessors should use these quality metrics to assess clinical skills.31 A single stimulus, the interaction between a trainee and a patient, would then ideally result in more similar responses by assessors. However, assessors often fail to appropriately use quality met- rics to assess clinical skills.


WBA에서의 연구는 평가에 안 좋은 영향을 미칠 수 있는 최소 세 가지의 핵심 인지 프로세스를 찾아냈다. 하나는 평가자가 피평가자를 판단할 때 사용하는 frame of reference(FOR)이나 기준이 다양하다는 것이다. '불만족' '만족' 우수'는 흔히 사용되는 anchor이나 이것에 대한 해석은 매우 다양하다.

Research in WBA has revealed at least three key cog- nitive processes used by assessors that could adversely influence assessments. One is that asses- sors use variable frames of reference, or standards, against which they judge trainees’ performance.32–35 ‘Unsatisfactory’, ‘satisfactory’ and ‘superior’ are common anchors on many assessment tools.36 How these anchors are interpreted is very variable.


다른 흔한 FOR은 평가자 자신이다. 피훈련자가 환자를 대하는 모습을 볼 때 평가자는 자기 자신의 스킬을 비교대상으로 삼는다('자신'이 FOR이 됨). 이는 평가에 있어서 큰 문제가 되는데, 왜냐하면 임상스킬에 있어서 의사마다 차이가 크고, 심지어 어떤 경우에는 핵심 임상스킬 수행 능력이 부족한 경우조차 있기 때문이다. 스스로의 임상스킬이 떨어질 경우 제대로 평가를 할 수 있을 가능성이 낮다. 많은 평가자에게 있어서 피훈련자를 평가할 때 사용하는 준거는 경험적으로 개발되며, 동일한 수행에 대해서도 사람마다 관심을 가지고 바라보는 지점이 다르기 때문에 평가의 퀄리티를 결정하는 평가자간 변동성을 야기한다.

Another particularly prevalent frame of reference that assessors use is themselves. While observing trainees with patients, assessors commonly use their own skills as comparators (the ‘self’ as the frame of reference).32,37 This is problematic for assessment because practising physicians’ clinical skills may be variable, or sometimes even deficient, in core skill domains such as history taking, physical examina- tion and counselling.38–41 They may be less able to do this if their own clinical skills are insufficient. For many assessors, the criteria they use to assess trainees develop experientially and different individuals subsequently come to focus on different aspects of performance, which results in variable definitions among assessors of what determines quality.32,33


오류의 근원이 되는 두 번째는 평가자가 직접 관찰을 하면서 '평가'가 아니라 '추론'을 하는 경우에 발생한다. 평가자는 그러나 자신이 이러한 '추론'을 내리고 있다는 사실을 인지하지 못하며, 그 추론의 정확성을 validate하지 않는다. 검증되지 않은 추론은 정확한 평가를 '왜곡'할 위험이 있으며, 왜냐하면 이러한 평가자의 추론은 관찰되거나 측정될 수 없기 때문이다.

A second potential source of measurement error arises when assessors make inferences during direct observation rather than assessing observable behaviours.32,42 Assessors do not recognise when they are making these infer- ences and do not validate them for accuracy.32 Unchecked inferences risk ‘distorting’ the accurate assessment of the trainee because the assessor’s inferences cannot be observed and measured;


세 번째로는 평가자가 불편한 간접영향을 회피하고자 평가 판단을 조정하는 경우가 있다. 어떤 평가자들은 인기와 호감을 얻기 위해서 점수를 잘 줄 수 있다.

A third cognitive process used by assessors that might increase assessment variability is the modify- ing of assessment judgements to avoid unpleasant repercussions. Some may inflate assessments in order to be perceived as pop- ular and likable teachers,


이러한 관점에서 앞서 언급된 오류의 원인들은, 적어도 일부분은, 교수개발을 통해서 극복될 수 있으며, 어떤 행동학습이론의 원칙은 '훈련을 통한 문제해결'을 지지한다.

From this perspective, the aforementioned sources of error can, in part, be addressed through faculty development (i.e. the assessor is trainable) and cer- tain principles of behavioural learning theory can be invoked to support proposed ‘training solutions’.


피훈련자에 대한 평가는 그들이 달성해야 하는 역량을 기준으로 이뤄져야 하며, 이것을 달성하기 위해서 평가자는 준거-기반 접근법을 익혀야 한다. 준거기반 평가에서는 피훈련자의 수행능력이 의료행위에 대한 근거에 기반하여 사전에 정의된 준거에 따라 평가된다.

assess- ment of trainees should be based upon those com- petencies needed to achieve. To accom- plish this, assessors will need to learn a criterion- based approach to assessment in which trainee per- formance is compared with pre-specified criteria that are ideally grounded in evidence-based best practices.


그러나 제대로 이뤄지지 않으면 문제가 되는데, 학습자가 평가를 거치며 혼란스러운 뒤섞인 메시지나 피드백을 전달받게 되면 어떤 행도을 강화해야 하는지에 대한 비일관성이 학습에 오히려 안 좋은 영향을 미칠 수 있다.

This situation creates problems for learners, assessors and patients. Learners receive mixed mes- sages during assessment, as well as discrepant feed- back, which can interfere with their learning because there is inconsistency in what is or is not being reinforced.




관점 2. 평가자는 오류에 빠지기 쉽다.

Perspective 2. The assessor as fallible


논리적으로, 관점 1에서 드러난 어떤 평가의 문제도 더 명확한 프레임워크를 제시하고 더 평가자를 훈련시켜서 더 정확한 관찰을 하게 하면 향상될 수 있다. 그러나 수십년의 연구 결과는 이러한 접근법으로는 거의 차이를 만들어내지 못함을 알려준다. 왜 그럴까? 여러 문헌에서 이 '정밀한 분석기계' 가설에 도전하였다. 두 번째 관점은 평가자간 변동성을 인간 인지과정의 근본적 한계에서 기인한다고 본다. 간략히 말하자면, 낮은 평가자간 신뢰도는 훈련을 해도 계속 있을 것이며, 그 이유는 평가자가 평가에 대한 준비가 잘 안되었기 때문이 아니라, 인간의 판단이 원래 불완전하고, 여러 요인에 의해 쉽게 영향을 받기 때문이다.

Logically, any difficulties with this approach should be improved through clearer frameworks or through training in more accurate observation. Yet decades of research tell us that these approaches make comparatively little differ- ence.49 Why? A different body of literature chal- lenges this ‘precise analytical machine’ assumption. This second perspective sees assessor variability aris-ing from fundamental limitations in human cogni- tion. In short, low inter-rater reliability persists despite training, not because assessors are ill pre- pared, but because human judgement is imperfect and will always be readily influenced. 


인지심리학과 사회심리학 연구들은 평가자가 단순히(수동적으로) 관찰하고 특징을 잡아내는 것이 아님을 주장한다. 인간의 작업기억과 처리용량은 제한되어 있다. 정보는 매우 빠르게 소실되거나, 그렇지 않으려면 처리과정을 통해 기존에 가지고 있던 지식구조에 연결되어야 유지되고 사용될 수 있다. 그 결과, 수행능력에 대한 '객관적' 관찰이란 애초에 존재하지 않는다.

Cognitive and social psychology assert that assessors cannot simply (passively) observe and capture per- formances.50 Human working memory and process- ing capacity are limited.51 Information is either lost very quickly, or must be processed and linked to a person’s pre-existing knowledge structures to allow it to be retained and used.52 As a result, there can be no such thing as ‘objective’ observation of per- formance.


인지와 관련한 무수한 bias들이 있지만, 몇 가지가 유용하다. 정보를 인지적으로 관리가능하게 만들기 위해서 사람들은 'schema' 혹은 관련 정보의 네트워크를 활성화시켜야 한다. 예를 들면 '심장 마비'라는 용어는 '전형적인' 환자의 이미지를 떠올리게 한다. 이러한 '전형적인' 환자에 대한 개념은 우리가 사람을 카테고리화하는 경향으로부터 발생하는 것이며, 종종 '대표성 오류representativeness bias'에 빠지게 한다.

Although numerous biases in cognition exist, some illustration is useful. To make information cogni- tively manageable, people activate ‘schemas’ or net- works of related information. Thus, for example, the phrase ‘heart attack’ might also activate a mental image of a ‘typical’ heart attack patient. The notion of a ‘typical’ patient, or person, arises from our tendency to categorise people,55 which leaves us open to ‘representativeness bias’,56


이러한 과정은 정신적 노력을 매우 절감시켜주나, 중요한 정보를 놓치게 하는 원인도 되고, 판단을 비뚤게 할 수도 있다. 이러한 유형의 bias는 '고정관념stereotype'에 대한 문헌에서 잘 연구되어 있다.

This saves a lot of mental effort, but means we tend to ignore important information, and this can bias our judgements. This type of bias is well illus- trated by the literature on stereotypes.


고정관념은, 일단 발동되기만 하면, 개개인이 어떤 특징에 관심을 갖게 되는지, 어떤 판단을 내리게 되는지, 어떤 기억을 회상할 것인지를 왜곡distort시킨다. 후자가 특히 중요한데, 평가자가 방금 관찰한 것을 '객관적으로' 회상하기 보다는, 사람들은 무의식적으로 자신이 기존에 가지고 있던 고정관념적 신념에 기반해서 '빈 칸을 채우는'식으로 작동하기 때문이다. 이는 WBA에서 특히 중요한데, 왜냐하면 단순히 점수를 왜곡시키는 것이 아니라 피훈련자에게 제공되는 피드백에 영향을 주기 때문이다.

Once active, stereotypes can distort which features individuals pay attention to,57 the judgements they reach58 and their recall of what occurs.59 The latter is particu- larly important: rather than ‘objectively’ recalling what they have just observed, people may uncon- sciously ‘fill in the blanks’ based on what their stereotypical beliefs suggest.60 This is particularly important in WBA because it will distort not just scores, but also the feedback given to trainees.


중요한 점은, 고정관념의 영향이 의식의 통제 아래 있지 않다는 점이다. 맥락의 변화는 어떤 고정관념이 활성화될지를 결정한다. 또한 사람들은 그들의 인식이나 행동에 영향을 미치는 무의식적 사고를 잘 인식하지 못한다. 정서/시간 압박/주기 리듬/동기부여/편견의 정도/개인의 인식 선호 등이 영향을 준다

Importantly, the influence of stereotypes is often not under conscious control: changes in context determine which stereotypes are activated,61 and people are often unaware of the unconscious thoughts that influence either their cognition62 or behaviour.63 Emotions,64 time pressure,59 circadian rhythms,65 motivation, pre-existing levels of preju- dice66 and individual cognitive preferences67


고정관념을 회피하게 만들려는 목적의 지침이 오히려 역설적으로 그것을 더 악화시킨다.

Instructions to avoid stereotyping can make their influence para- doxically worse,68


우리는 시니어 의사들이 학생들에 대한 고정관념을 가지고 있어서, 소수인종 학생들의 수행능력이나 행동에 대해서 무의식적으로 낮게 보는 경향이 있음을 안다. 또한 의사들이 피훈련자의 수행능력을 판단함에 있어서 스스로의 판단에 과도한 자신감을 보이는 것으로 드러났다. 스스로의 판단에 대한 과도한 자신감은 흔히 대표성 편향representativeness bias의 결과로 나타난다고 본다.

However, we do know that senior doctors possess well-developed stereotypes of the way that ethnic minority students may perform or behave69 and that, in other aspects of education, unconscious stereotyping of ethnic minorities can be seen to account for the reduced academic achievement of these students.70 It has previously been shown that doctors judging performances of trainees are over-confident in their judgements (they are right less often than they think).71 Judge- mental overconfidence is thought to typically arise as a result of representativeness bias,56


인간은 절대적 수치를 계량하거나 판단을 내리는 것에 취약하다고 알려져 있다. 판단은 매우 쉽게 맥락적 정보에 의해 영향을 받으며, 이는 assimila- tion or contrast effects로 알려져 있다.

Humans are known to be poor at judging or scaling absolute quantities; judgements are easily influenced by contextual information72 through processes known as assimila- tion or contrast effects.73


연구 결과를 보면 이러한 영향이 다양한 범위의 수행능력에서 나타나고 있으며, 매우 왕성하나, 평가자는 그것이 존재함조차 모르고 있는 경우가 많다. 

study suggested that this effect can occur across a range of performance levels, is fairly robust and that assessors may lack insight into its operation.33


실제로 더 많은 구체적인 체크리스트를 만드는 것은 평가자의 인지부담을 증가시키고, 이러한 접근법은 역설적으로 개선하고자 하는 문제를 악화시킨다.

In fact, as making more detailed checklists might increase the cognitive load experienced by assessors, this approach could poten- tially (paradoxically) worsen the very problem it hopes to improve.75


따라서, 이러한 관점에서 내리는 결론은 평가-기반 판단의 허무주의로 빠지게 된다. 인간의 판단은 애초에 문제가 있으며 교정될 수 없는 것이 아닐까? 그렇지 않다. 대신, 이것이 시사하는 바는 인지적 개입이 가능한 도구 속에 해결책이 있다. 최근의 연구를 살펴보면, 사람들은 한 사람에 대한 판단을 내리기 전 평등주의자적 동기egalitarian motivation가 있다. 이는 고정관념의 활성화를 줄여줄 수 있으며, 행동의 의도나 대인관계 상호작용에 관한 고정관념의 영향을 줄여줄 수 있다.

It would be easy, therefore, to conclude that this perspective demands a nihilistic view of judgement- based assessments: judgement is flawed and cannot be fixed. It does not. Instead, it suggests that pro- gress may lie within a toolbox of possible cognitive interventions. Recent research indicates that people can be induced to adopt an ‘egalitarian motivation’ prior to making judgements of a person.78,79 This reduced the cog- nitive activation of stereotypes78,79 and lessened the influence of stereotypes on behavioural intentions and interpersonal interactions.79


말할 필요도 없이, 더 많은 연구가 필요하며, 비록 이러한 인터벤션이 성공적이더라도, 맥락적 영향이 판단에 미치는 영향을 완전히 극복할 수는 없다. 한 가지 함의는 인간의 판단을 알고리즘을 활용한 측정으로 대체해야 하는가이다.

Needless to say, much further work is required before any claims can be made about the potential benefits of these approaches. Even if these interven- tions are successful, they are unlikely to completely overcome contextual influences on judgements.74 One possible implication of this perspective would be to seek ways to replace human judgement with algorithmic measurement.


알고리즘을 활용한 측정에는 인간의 판단이 개입되지 않으며, 아마 인간의 판단을 점차 대체할지도 모른다.

No human judgement is involved.80 Perhaps further develop- ments of this sort will gradually replace human judgement.




관점 3. 평가자의 특이성은 나름의 의미가 있다.

Perspective 3. The assessor as meaningfully idiosyncratic


만약 평가자간 변동성이, 적어도 일부분이나마, 서로 다르긴 해도 분명히 (피평가자와) 관련되어 있고 합당한, 그러나 서로 다르고 종종 상반되는 해석을 낳는다면 어떨까? 라는 의문을 가질 수 있다. 이러한 관점에서는 평가자 인식의 독특성idiosyncrasy가 유의미한 평가정보를 제공해줄 수 있으면서, 동시에 평가자간 변동성과 불일치를 야기하고, 더 나아가 낮은 평가자간 신뢰도에 이르게 한다고 본다.

One of its fundamental questions con- cerns what happens if variability, at least in part, derives from the forming by assessors of relevant and legitimate but different, and sometimes con- flicting, interpretations. This perspective examines potential sources of idiosyncrasy within assessor cog- nition that could provide meaningful assessment information, but also lead to variability, assessor dis- agreement and low inter-rater reliability.


WBA가 표준화되지 않은 상태에서, 평가자의 idiosyncrasies에 따르는 변동은 맥락특이성에 따른 변동에 비견outmatch 될 수 있다. psychometric한 측정관점에서 보자면, 이 두 가지 중 어떤 것도 피평가자의 역량에 대해서 알려주는 바가 없으며, 일반적으로 측정오류로 여겨진다. 그러나 상황인지이론situated cognition theory와 사회-문화 (학습)이론에 따르면, 맥락-특이적 variance는 오류error가 아니다. 이 이론에 따르면 맥락은 비활성inert한 것이 아니며, 피훈련자의 수행능력과 분리되어서 여러 맥락이 서로 상호교환가능한 것이 아니다. 대신 맥락은 피훈련자가 어떤 의도한 스킬을 수행하는데 있어서 그것을 가능하게 하거나 제약시키는 요인으로 여겨진다. 이는 왜냐하면 '맥락'이라는 것이 모든 사람과 모든 환경 사이에서 가능한 모든 역동적 상호작용을 포괄하기 때문이며, 단순히 물리적 환경에 대한 이름표가 아니기 때문이다. 맥락을 이렇게 이해한다면, 피훈련자는 그들이 접하는 임상상황이나 임상사건에 대한 완전한 통제를 가지고 있지 않으며, 그들의 역량은 독특한 맥락에 의해서 형성되고, 그 맥락과 연결되어 드러나는 것이다.

In the non-standardised reality of WBA, variance attributable to the idiosyncrasies of assessors is only outmatched by variance attributable to context spec- ificity.81–83 From a psychometric measurement standpoint, neither of these sources of variance reveal anything about the trainee’s competence and are generally assumed to contribute to measure- ment error. Viewed from situated cognition theory and socio-cultural (learning) theories, however, con- text-specific variation is not ‘error’. According to these theories, context is not an inert or inter- changeable detail separate from a trainee’s perfor- mance, but instead is viewed as enabling and constraining the trainee’s ability to perform any intended or required skills.84–86 This is because con- text is understood to encompass all the dynamic interactions between everyone and everything within an environment, and is not just a label for the physi- cal location.84,85,87,88 Based on this understanding of context, trainees will not have full control over the events within a clinical encounter and their compe- tence will instead be shaped by, revealed within, and linked to that unique context.89,90


이러한 관점에서 맥락을 '무시되어야 할 것' 혹은 여러 맥락이 '평균내어질 수 있는 것'으로 보는 것이 어렵다. 또한 역량에 대해서 평가자에게만 내재된reside solely within 것으로 보는 관점, 역량이 서로 다른 장소/환자/시간에 걸쳐 안정적으로 유지된다는 관점에 대해서도 의문을 표한다. 반대로 역량은 사회적으로 구성되고, 다른 사람에 의해서 보여지고 인지될demonstrated and perceived 필요가 있다. WBA에서 한 사람이 다른 사람의 역량을 '인지'한다는 관점특히 중요한 이유는 많은 핵심 구인들이 직접적으로 관찰가능하지 않기 때문이다. 대신 환자-중심, 프로페셔널리즘, 휴머니즘 등과 같은 여러 구인이 관찰가능한 행위로부터 추론되는 것이다.

Viewpoints such as these make it more difficult to think of context as something to be disregarded or averaged across. They also call into question the idea of competence as something that resides solely within each trainee and remains stable across differ- ent places, patients and time.91 On the contrary, competence has been described as being socially constructed and needing to be demonstrated and perceived by others.92–94 The idea of perceiving oth- ers’ competence is especially important for WBA because many of the key constructs that must be assessed are not directly observable.95 Instead, con- structs such as patient-centredness, professionalism, humanism and many others must be inferred from observable demonstrations.89,93


여러 연구로부터 평가자의 전문성은 임상에서 진단의 전문성과 닮아있음을 제시한다. 경험이 많은 의사는 신속하고 자동화된 패턴 인식을 통해서 진단을 내린다diagnostic impression. 정보의 집합을 빠르게 유의미한 패턴으로 묶고, 빠르고 정확하게 진단적 추론을 한다. 이들은 구체적인 체크리스트를 사용하지 않으며, 오히려 환자를 만나는 맥락에 따른 사소한 차이들을 반영해내는 방식으로 정보를 사용한다. 추가적으로 전문가는 '기대'에 위배되는 '이상anomalies'를 인지하며, 즉각적 사건을 넘어서 배경이 가지는 중요성을 알고, ...등등

Research increasingly suggests that assessor expertise resembles diagnostic expertise in the clini-cal domain to a remarkable extent.43,100,101 Experi- enced clinicians use rapid, automatic pattern recognition to form diagnostic impressions; they very rapidly cluster sets of information into mean- ingful patterns, enabling fast and accurate diagnos- tic reasoning.102 They do not use detailed checklistswith signs and symptoms based on textbook knowl- edge as novices would do, and more than that, they use information reflecting (subtle) variations in the context of the patient encounter.103 In addition, experts can

  • recognise anoma- lies that violate expectancies,
  • note the significance of the situation beyond the immediate events,
  • iden- tify what events have already taken place based on the current situation, and
  • form expectations of events that are likely to happen also based on the current situation.105–107

WBA에 대한 연구결과는 경험 많은 평가자는 평가 과제의 상황-특이적 신호를 인지해서, 과제-특이적 신호를 과제-특이적 수행요건과 수행능력 평가에 연결시킬 수 있다.

In WBA, research findings indicate that experienced assessors are similarly able to note situation-specific cues in the assessment task, link task-specific cues to task-specific performance requirements and performance assessment,


경험이 많은 임상 평가자는 복잡한 과제에 대한 수행능력을 평가할 때에도, 시간의 압박이 있어도, 목표들이 서로 상충하고 잘 정의되어있지 않아도, 피훈련자의 수행능력에서 미래의 수행능력과 관련된 신호를 잡아낼 수 있다. 이들은 핵심을 짚어낼 줄 안다.

Even when experienced clinical assessors are engaged in complex tasks, often under time pressures and with conflicting as well as ill-defined goals, they seem to be capable of identifying cues in trainees’ performances that correlate with future performances.100 They spot the gist.


평가 전문가는 어떤 전문직의 전문가와 마찬가지로, 특정 맥락에 immersio을 통해 발달한다. 각 평가자의 전문성은 그들의 독특한 경험에 의해서 만들어지고, 다양한 맥락..등등 에 따라 영향을 받아서 독특한 인지 필터unique cognitive filter를 발달시킨다.

Assessor expertise, as with any professional expertise, develops through immersion within specific contexts.108 As each asses- sor’s expertise will have been influenced by

  • differ- ent contexts and shaped by unique experiences,
  • different mental models of general performance,
  • task-specific performance and person schemas might be expected,
  • with each assessor inevitably developing a unique cognitive filter.42,43

평가자는 gist를 잡아낼 줄 안다.

Consequently, assessors may spot different ‘gists’ or underlying concepts within a complex performance and con- struct different interpretations of them.89,109 Variations in assessor judg- ements may very well represent variations in the way performance can be understood, experienced and interpreted.


이러한 관점에서 평가자간 차이는 제거해야 할 무언가가 아니다. 오히려, 평가자간 차이가 존재한다는 것은 평가가 이상적이지 못한 것을 의미한다기보다는 이러한 불일치가 수행능력의 복합성, 그 수행능력이 평가자의 이해를 거칠 때 본질적으로 따라오는 해석의 '주관성' 등을 보여준다. 평가자 간 차이가 다양한 사람이 수행능력 다양하게 인식한다는 방식으로 인정될 수 있다면 평가자들의 해석은 상호보완적이며 모두 동등하게 유효하다.

From this perspective, differences in assessor judge- ments are not something to eliminate. However, rather than reflecting subop- timal judgements, inconsistencies among assessors’ interpretations may very well reflect the complexity of the performance and the inherently ‘subjective’ interpretation of that performance filtered through the assessor’s understanding. If differences in assess- ment judgements were to come from differences in the way the trainee’s performance can be perceived and experienced by others, then the inconsistencies among assessors’ interpretations might be comple- mentary and equally valid.


어떤 유형으로든 정보가 포화될 때까지 의도적으로 수집된 것이라면 심지어 서로 모순되는 정보조차 도움이 될 수 있다. '신뢰도' 대신 '포화'를 활용하는 것의 핵심 이점은 대다수의 해석majority interpretation과 다르지만 여전히 레지던트의 행동이 인식될 수 있는 중요한 변종들variants이 무엇인지 알려주기 때문이다.

Even contradictory judgements might be informa- tive if judgements were collected purposefully until some type of information saturation was reached.113 A key benefit of using saturation, rather than reli- ability, to analyse assessors’ judgements is that it provides the power to capture pockets of repeated interpretations that may differ from the majority interpretation yet represent important variants of how that resident’s behaviour can be perceived.


경험이 많은 평가자는 WBA에서 중요한 평가도구이다. 따라서 평가자의 전문성을 함양하는 것은 지속적 피드백을 제공하고 평가 결정을 내리는데 중요하다. 체크리스트나 관찰가능한 하위요소로 과제를 나눔으로써 평가자간 변동을 최소화하고자 하는 목적의 해결책은 반드시 지양되어야 하며, 왜냐하면 평가자가 전문가적인 판단을 내리는데 방해가 되기 때문이다.

If experienced assessors are viewed as poten- tially important assessment instruments for WBA, then it will be important to cultivate expertise in assessors through the provision of ongoing feedback and deliberate practice in making assessment judge- ments. Solutions that aim to minimise assessor vari- ability, such as checklists and the reduction of tasks into observable subcomponents, would be best avoided as they may interfere with assessors making expert judgements.91,114,115


피훈련자들에게 있어서, 그들은 상충되는 평가정보를 받기 때문에, 어떻게 다른 사람들이 그들의 행동을 해석했고 애초의 본인의 의도와 어떻게 다른지에 관한 guided reflection이 필요할 것이다.

As for trainees, because they may receive conflicting assessment information from assessors, guided reflection may help them to reconcile how others can derive an interpretation of their behaviour that differs from how it was intended.


반대로 두 번째 관점에서 평가자간 변동성은 서로 다른 전문성을 개발하고 서로 다른 전문가 판단을 사용하는 평가자들로부터 나오는 유용한 평가정보의 원천이다.

By contrast with the second perspective, vari- ability has been described as a potentially useful source of assessment information that stems from assessors differently developing expertise and using expert judgement.


DISCUSSION



세 관점의 공통점

Areas of concordance


첫째로, 모든 세 가지 관점이 평가자가 객관적으로 피평가자를 관찰할 것을 요구하며, 모든 관점이 현재의 UME와 PGME에서 관찰-기반 평가의 빈도와 양quantity가 이상적인 수준보다 못 미침을 지적한다. 이것은 즉각적 관심이 필요한 평가 프로그램에 있어서의 심각한 결핍이다. 따라서 WBA를 향상시키기 위한 첫 번째 단계는 교수들이 실제로 그것을 할 수 있게끔 지원해주고, 그렇게 확실히 하도록 만드는 것이다.

Firstly, all three perspectives require assessors to actually observe trainees interacting with patients and all recognise that the current quantity and fre- quency of observation-based assessment of under- graduate and postgraduate medical trainees is less than ideal. This is a serious deficiency in assessment programmes, which requires immediate atten- tion.36,116–124 Hence, the first step to improving WBA requires institutions to provide support and to ensure that faculty staff actually do it.


두 번째 공통점은 교수들이 스스로의 임상역량을 기르고 유지해야 한다는 것이며, 동시에 평가자로서의 전문성도 길러야 한다. 피훈련자 스킬의 퀄리티를 평가하는데 있어서 장애물은 특정 과제를 수행할 때 그 특정 스킬이 필요하다는 것을 평가자가 인식하지 못하는 것이다. 따라서 평가자를 위한 교수개발은 임상 스킬을 어떻게 평가할지 뿐만 아니라, 스스로 그 임상스킬을 어떻게 개발할 수 있는지도 포함되어야 한다.

A second area of concordance among the three per- spectives concerns the need for faculty members to achieve and maintain their own clinical compe- tence, while concomitantly developing expertise as assessors. An impediment to assessing the quality of specific skills performed by a trainee is an assessor’s lack of awareness of the specific skills required to competently perform that task. Therefore, faculty development for assessors may need to include training that refers to their own clinical skills devel- opment in addition to training in how to assess those skills.


마지막으로, 각 관점에 대해서 강점을 강화하고 약점을 줄이는데 도움이 될 두 가지 메커니즘이 있다.

Finally, there are two mechanisms common to each perspective that may help to maximise the strengths and minimise the weaknesses of assessor cognition.

  • Robust한 피평가자 샘플링평가자 샘플링
    One concerns the robust sampling of tasks per- formed by each trainee and assessed by an equally robust sample of assessors and is intended to improve the reliability, validity, quality and defensi- bility of assessment decisions.
  • 모든 활용가능한 정보를 종합하여 피평가자의 총괄적 수행능력에 대한 완전한 그림을 명확히 보여줄 수 있게 하는 평가자간 그룹토론
    The other is facili- tated group discussions among assessors and assessment decision makers that provide opportuni- ties to synthesise all available assessment data to cre- ate a clearer composite picture of a trainee’s overall performance.125 Group discussions allow both con- sistent and variable judgements to be explored and better understood.126


세 관점의 차이

Areas of discordance


세 관점의 차이에는 과연 하나의 진실이 존재하는지 다수의 진실'들'이 존재하는지에 대한 것, 교수개발의 목표, 추론을 하는 것의 효용성, 신뢰성reliability의 추구 등이 있다. 관점의 차이를 극복하고 완전히 통합하여 하나의 이론을 만들려고 노력하기보다는 상황에 따라 도움이 되는 관점을 적용하는 것이 좋을 것이다.

There are also areas of discordance, or incompati- bilities, among the three perspectives that cannot be ignored. For example, whether there exists one or multiple ‘truths’, the goals of faculty development, the utility of making inferences and the pursuit of reliability have been previously discussed. Rather than trying to overcome the discordances and fully integrate the different perspectives into a unified theory, it may be useful to identify circumstances in which the strengths of a particular perspective may be especially advantageous.


단순한 축구와의 비유가 도움이 될 수 있다. 축구선수는 반드시 공을 골대에 넣어서 점수를 내야 하며, 골대를 벗어난 것은 모두 miss이다. 보건의료서비스도 비슷하다. 안전하고 효과적인 환자-중심 진료를 제공하는 방법이 무한하지는 않다. 어떤 임상업무는 좀더 타이트한 경계가 있다(골대와 비슷). 예를 들어서 CVC 삽입이나 Mech Vent 관리 등이 그러하다. 이러한 임상행위는 반드시 최신의 근거와 절차적 체크리스트에 기반하여 이뤄져야 한다. 기준에서 벗어나는 것이 매우 제한된다. 따라서 이러한 수행에 관한 평가는 변동성이 적다.

 

그러나 피훈련자의 수행능력이 훨씬 더 많은 숫자의 맥락적 요인에 달려 있는 경우도 있다. 예를 들면 나쁜소식을 전하기는 가이드라인이 있지만(SPIKE framework), 그 경계boundary zone은 CVC삽입과 달리 더 넓고, 그러나 둘 모두 그 가지수가 무한한 것은 아니다. 맥락적 요인에 의해서 심하게 영향을 받을 수 있는 임상과제의 경우 평가자 판단의 변동성과 전문성을 수용할 수 있는 시스템이 적합할 것이다.

A simple football (soccer) analogy might help to illustrate how different perspectives on assessor cog- nition could be purposefully matched to fundamen- tally different assessment situations to improve WBA. A football player must place the ball into the net in order to score a goal and anything outside the boundary of the net is a miss. The delivery of health care is similarly bounded; there are not limit- less ways for trainees to provide safe, effective patient-centred care. Some clinical tasks have tighter boundaries, or a smaller ‘net’. For example, the insertion of central venous catheters and the management of mechanical ventilators to prevent pneumonia should be performed within the bound- aries specified by the latest evidence-based medicine or procedural checklists. Variance from the stan- dards in these cases should be limited. Correspond- ingly, it would be advantageous for assessor judgements of these performances to have less vari- ability. However, there are situations in which deter- mining the quality of the trainee’s performance depends on a larger number of contextual factors For example, although there are guidelines for delivering bad news (e.g. the SPIKES127 framework), the boundary zone (i.e. the size of the net) is wider for breaking bad news than it is for central venous catheter insertion, but nei- ther is infinite. For clinical encounters that can be highly influenced by contextual factors, an assess- ment system that can accommodate variability and expertise in assessors’ judgements may be appropri- ate and valuable.


Moving forward








 2014 Nov;48(11):1055-68. doi: 10.1111/medu.12546.

Seeing the 'black box' differentlyassessor cognition from three research perspectives.

Author information

  • 1Northern Medical Program, University of Northern British Columbia, Prince George, British Columbia, Canada.

Abstract

CONTEXT:

Performance assessments, such as workplace-based assessments (WBAs), represent a crucial component of assessment strategy in medical education. Persistent concerns about rater variability in performance assessments have resulted in a new field of study focusing on the cognitive processes used by raters, or more inclusively, by assessors.

METHODS:

An international group of researchers met regularly to share and critique key findings in assessor cognition research. Through iterative discussions, they identified the prevailing approaches to assessor cognition research and noted that each of them were based on nearly disparate theoretical frameworks and literatures. This paper aims to provide a conceptual review of the different perspectives used by researchers in this field using the specific example of WBA.

RESULTS:

Three distinct, but not mutually exclusive, perspectives on the origins and possible solutions to variability in assessment judgements emerged from the discussions within the group of researchers: (i) the assessor as trainable: assessors vary because they do not apply assessment criteria correctly, use varied frames of reference and make unjustified inferences; (ii) the assessor as fallible: variations arise as a result of fundamental limitations in human cognition that mean assessors are readily and haphazardly influenced by their immediate context, and (iii) theassessor as meaningfully idiosyncratic: experts are capable of making sense of highly complex and nuanced scenarios through inference and contextual sensitivity, which suggests assessor differences may represent legitimate experience-based interpretations.

CONCLUSIONS:

Although each of the perspectives discussed in this paper advances our understanding of assessor cognition and its impact on WBA, every perspective has its limitations. Following a discussion of areas of concordance and discordance across the perspectives, we propose a coexistent view in which researchers and practitioners utilise aspects of all three perspectives with the goal of advancing assessment quality and ultimately improving patient care.

© 2014 John Wiley & Sons Ltd.

PMID:
 
25307633
 
[PubMed - indexed for MEDLINE]


같은 것을 다르게 보는 것 - DOPA에서 평가자 간 차이의 기전 (Adv in Health Sci Educ, 2013)

Seeing the same thing differently 

Mechanisms that contribute to assessor differences in directly-observed performance assessments

Peter Yeates • Paul O’Neill • Karen Mann • Kevin Eva







Background


전문역량의 평가를 위해서는 다양한 스킬에 대한 수행능력을 측정한 정보의 표집sampling과 통합이 필요하다. 이러한 프레임워크에서 WBA(또는 수행능력 평가)는 매력적인 도구인데, 왜냐하면 실제 현장에서 수행능력의 표본sample을 제공해주기 때문이며, 다수의 역량을 통합된 형태로 동시적으로 평가할 수 있게 해주고, 피드백의 기회를 주기 때문이다. 이러한 평가의 일반적 유용성에 대한 근거들이 있지만, 이러한 평가로부터 나오는 점수에 내재한 variability는 타당성을 위협하는 문제가 되기도 한다.

The assessment of professional competence requires sampling and integration of measures of performance on multiple diverse skills (Van der Vleuten and Schuwirth 2005). Within this framework, workplace based assessments (or performance assessments) represent an attractive tool as they offer samples of performance from real practice, simultaneously assess multiple competencies in an integrated manner and offer opportunities for feedback (Norcini 2003). Whilst support exists for the general utility of these assessments, vari- ability inherent in the scores that result from them are problematic as it threatens their validity (Hawkins et al. 2010; Pelgrim et al. 2011).


점수의 편차variation이 진점수true score의 차이에서 기인하는 부분도 있지만, 동일한 비디오를 보고 평가하게 하는 방식으로 이러한 차이를 통제한 연구에서도 동일한 수행상황을 보고도 평가자들은 9점 스케일에서 1점에서 6점까지 다양한 점수 분포를 보여주었다.

Whilst score variations may have arisen partly due to true score variability (in that different assessors generally assess different performances) a further study controlled for this by asking assessors to rate common videoed performances. This study showed that assessors’ scores ranged from 1 to 6 on a 9 point scale whilst rating the same performance (Holmboe et al. 2003).


평가자간 신뢰도에 대한 문제 뿐만 아니라, range restriction의 문제도 있으며, 서로 다른 역량 영역에 대한 점수 간 상관관계가 높게 나타나는 문제도 있는데, 특히 후자는 수행의 독립적 측면들을 통해서 다양한 역량을 보여주는 상황에서 더욱 그러하다.

In addition to problems of inter-rater reliability, scores also demon- strate range restriction (Alves de Lima et al. 2007; Wilkinson et al. 2008), and high correlations between scores from different domains of competence, (Cook et al. 2010; Fernando et al. 2008; Margolis et al. 2006), the latter issue being problematic to those who present variable competencies as independent aspects of practice (Lurie et al. 2009).


scale range가 제한되면 평가자간 신뢰도가 높아지기보다는 낮아지게 되고, 이를 극복하기 위해서 behavioral anchor를 추가하기도 했지만 그 향상 정도는 미미했다. Cook et al. 는 평가자 훈련이 평가자간 신뢰도를 향상시킬 수 있는가 보았지만, 유의한 효과는 없었다. 즉 수행능력에 대한 평가 점수는 문제가 있으며, scale format을 바꾸거나 평가자 훈련을 한다고 해도 기대하는 만큼 많은 향상이 이뤄지지는 않는다.

Restricting the scale range has been seen to reduce rather than increase inter-rater reliability (Cook and Beckman 2009), whereas the addition of behavioural anchors has produced small improvements (Donato et al. 2008). Cook et al. (2009) investigated whether rater training could improve inter-rater reliability, but showed no significant effect. Thus, performance assessment scores are problematic, and neither alterations of scale format nor rater training have produced the desired improve- ment in their psychometric properties.


이러한 문제를 재개념화하기 위한 시도로서, Govaerts 등은 수행능력 평가를 고전시험이론의 관점에서 보는 것은 제한된 관점만을 제공한다고 지적했다. 이러한 관점에서는 평가자를 고정적인 실체stable entity에 대한 신뢰할 수 없는 측정결과만 만들어내는 '고장난 도구faulty instrument'로 본다. 그러나 Govaerts는 이러한 관점 대신 수행능력평가를 구성주의적, 사회-심리학적 관점에서 보기를 제안한다. 이러한 관점에서는 사회적, 인지적 요인이 서로 상호작용해서 수행능력을 판단함에 있어 개개인의 특이성idiosyncratic을 만든다는 것이다. 즉, 점수의 차이는 평가자의 인식의 유의한 차이에서 기반한다는 것이다. 어떠한 차이는 이미 잘 알려진 문화적 편향으로 인해 생길 수 있으나, 한편으로 다른 차이는 단순히 개개인의 접근 방식이 독특해서 생기는 것일 수도 있다. 예컨대, 평가자가 어떠한 과제를 이해하거나 판단하는 방식 등이다. 이러한 모델에서 점수의 차이는 '진점수'의 다원성plurality에서 기인하는 것이며, 단일한 '진점수'가 존재하고 이것이 '에러'에 의해서 왜곡되는 것이 아니다.

In attempting to reconceptualise this problem, Govaerts et al. (2007) assert that viewing performance assessments through the lens of classical test theory offers a limited per- spective. This theoretical orientation views the assessor as a ‘‘faulty instrument’’ that produces unreliable measures of a stable entity (hence the classical test theory notion of ‘‘true score’’ and ‘‘error’’) (Streiner and Norman 2008, p 170). Instead, Govaerts et al. propose a theoretical view of performance assessment based on a constructivist, social- psychological perspective. This asserts that social and cognitive factors interact to produce idiosyncratic individual judgements on performance (i.e., variability that can be attributed to meaningful differences in the perceptions of raters). That is, while some variability will arise from the well-documented cultural or other biases that raise concerns about the validity of a rating process, some might arise simply from individual peculiarities in approach—for example comparatively unique ways in which the task is understood or judged, or differences in the specific aspects of practice to which assessors attend. In this model, score variations arise from a plurality of ‘‘true scores’’ rather than from a single ‘‘true score’’ that is distorted by ‘‘error’’.


개개인의 수행능력에 대한 판단을 형성하는데 기여하는 사회적, 심리학적 프로세스는 직업심리학 영역에서 많은 연구가 이뤄져왔으며, 의학교육에서의 평가와의 관련성이 고려된 바 있다. 요약하면, 이 판단 프로세스는 카테고리화작업으로 볼 수 있으며, 개개인을 어떤 카테고리에 배정하는 자동적이면서도 신중한 인지automatic and deliberate cognition의 혼합이며 이 판단의 일부분은 유사성에 기인한다. 평가자는 과거의 사례로부터 형성된 판단-관련 schemata를 가지고 있다. 이 과정에서의 기억의 왜곡, 정보 담색, 잘못된 귀인attribution 프로세스 등이 많은 에러를 설명해줄 수 있다. 판단은 사회적 맥락에서 이뤄지며, 평가자의 성향, 평가의 목적, 평가자와 피평가자의 관계 등 다양한 요소에 의해 영향을 받을 수 있다.

The social and psychological processes that contribute to forming judgements on an individual’s performance have been extensively studied within occupational psychology (De Nisi 1984; Feldman 1981), and their relevance to assessment within medical education has been considered (Govaerts et al. 2007; Gingerich et al. 2011; Williams et al. 2003). In summary, the judgement process can be viewed as a categorisation task that proceeds through a mixture of automatic and deliberate cognition to assign individuals to categories, in part based on similarity. Assessors necessarily possess judgement-related schemata that have arisen through exposure to past exemplars. Various distortions of memory and information search, and faulty attribution processes can account for many errors within these processes. Judgements are conducted within a social context and can be influenced by (amongst other things) the assessor’s disposition, the purpose of the assessment, and the relationship between the trainee and the assessor.


극소수의 연구만이 의학교육에서 평가자가 DOPA(directly-observed performance assessments )에 있어 판단 프로세스를 연구한 바 있다. Govaerts 는 비전문가와 비교할 때, 전문가는 다음이 달랐다.

Very few studies have investigated the processes responsible for assessors’ judgements within directly-observed performance assessments in medical education. Govaerts et al. (2011) showed that, compared to non-experts, expert assessors

  • 문제의 대표적 특징을 더 빠르게 찾아냄 developed problem rep- resentations more quickly,
  • 맥락적 힌트에 더 민감함 were more sensitive to contextual cues and
  • 더 많은 추론을 함 made more infer- ences.

 

따라서 전문가는 비전문가보다 더 디테일한 평가 schemata를 가지고 있다. Kogan 등은 평가자의 서로 다른 개별적 특성(성향/임상 역량/연령/성별)을 포함하고, 수행능력을 dual lenses of inferences about trainees로 바라보고, internal or external frames of reference로 바라보는 모델을 제시했다. 이 모델에서 판단과 이어지는 통합과정은 환경적 요인과 맥락에 의해 영향을 받는다. 따라서 판단은 다양한 인지적, 사회적 요인에 취약하다고 할 수 있다.

Thus experts appear to possess more detailed assessment schemata then non-experts. Kogan et al. (2010, 2011) described a model in which assessors possess differing personal characteristics (disposition/clinical competence/age/gender etc.) and then view perfor- mance through dual lenses of inferences about trainees, and either internal or external frames of reference. In their model the judgement and subsequent synthesis are influenced by environmental factors and context. Thus judgements are susceptible to a range of cognitive and social factors.


방법

Methods


평가 형식

Assessment format


Assessors score 7 domains of the performance (history taking; physical examination; communication skills; critical judgement; professionalism; organisation/efficiency; and overall clinical care) using a 6-point Likert scale anchored at

  • point 4 against the criterion of ‘‘meets expectation for F1 completion’’.
  • Point 3 is ‘‘borderline for F1 completion’’ with the remaining points comprising
  • ‘‘well below’’, ‘‘below’’, ‘‘above’’ and ‘‘well above’’ this criterion, plus ‘‘unable to comment’’.


평가자료 개발

Development of materials


PGY1 의사들에 대해서 우수-보통-나쁨 수준의 비디오 스크립트 개발

We developed scripted videos of performances by foundation (PGY1) doctors at different levels: ‘‘good’’, ‘‘borderline’’ and ‘‘poor’’ performances.

 

다음의 문헌 참조(병력청취, 전문성의 개발)

We used literature

  • on desirable contents of history taking (Kurtz et al. 2003; Martin 2003) and
  • on the development of expertise (Boshuizen and Schmidt 1992; McLaughlin et al. 2007),

저자들의 경험을 이용

along with the authors’ experience of foundation doctors, to write abstract descriptions of expected performance at each level.


 

참가자

Participants


All participants were consultant physicians from the North West of England.


 

절차

Procedure


Participants viewed videos individually on a laptop computer with headphones. They were instructed to imagine that they were on the medical admissions unit and that a Foundation Year 1 doctor had requested a Mini-CEX.


Think aloud 절차

Think aloud process


여기서 활용된 ‘‘Think aloud’’ 프로토콜을 참가자의 의사결정을 가이드하는 실제 사고 과정을 보여주는 것이라고 여겨서는 안된다. 그러나 이것이 유용한 것은 사고과정에 영향을 주는 요인에 대한 개개인의 인식을 탐색하는데 도움이 되며, 이후 검사에 풍부한 insight를 준다.

‘‘Think aloud’’ protocols such as those used here should not be treated as necessarily indicative of the actual thought processes that are guiding participants’ decision-making (Bargh and Chartrand 1999). They are useful, however, for exploring individuals’ per- ceptions of factors that influence their thought processes, which can yield rich insight for further testing (Wilson 1994).

 

유용성을 최대화 하기 위해서 Ericsson and Simon 의 가이드라인 활용. 다음의 것에 중요함.

To maximize the usefulness of the process in our study we followed the guidance provided by Ericsson and Simon (1980) who suggest that it is important to

  • (a) 참가자가 열심히 하는지 확인 ensure that participants are actively engaged in the task in question,
  • (b) 참가자가 생각을 묘사(not 설명)하게 함 ask participants to describe, rather than explain their thoughts, and
  • (c) 생각과 생각이 말로 나오는 시간 간격 줄임 reduce the time between participants’ thoughts and their verbalisation.


구인타당도 분석

Analysis of videos’ construct validity

 


 

질적자료 분석

Analysis of qualitative data


Audio recordings were transcribed verbatim and checked for accuracy. A researcher (PY) labelled sections to indicate whether they were

  • concurrent (spoken whilst watching per- formance) or
  • retrospective (spoken after), or
  • from follow up interviews.


Following repeated reading, PY began analysis by inductively assigning codes. Codes were developed to describe each new aspect relevant to the research question. These were discussed with other researchers (PON, KM) and refined from an initial 67 codes to 21 based on similarity. We grouped these codes as

  • ‘‘trainee focused codes’’—comments on the behaviours of the trainee, and
  • ‘‘assessor focused codes’’—comments that indicated ways in which the assessor was thinking.

 

A second researcher (KM) coded 2 transcripts independently, and comparison was made to develop the interpretation and consistency of codes. Constant comparison was used to compare the use and content of both trainee- focused and assessor-focused codes within each assessor, across the different performance levels, and subsequently between assessors.


As the analysis proceeded, we used memos to capture emergent theoretical ideas from the data that enabled understanding of the research question. These were systematically tested and refined or refuted with existing and subsequent data. We developed axial codes to label further examples of these new theoretical concepts.

 

Data were further examined to determine the inter-relationships between theoretical concepts, and to organise concepts into themes. We discussed and reflexively considered all emerging concepts against the data as analysis progressed. Throughout the process, deviant cases that did not fit were sought and used to challenge and refine the emerging theory. When analysis was com- pleted, only slight deviations from the theory were found. These are highlighted in the results.


We collected and analysed data iteratively as the study progressed. Codes were applied by the same researcher (PY) and theory was progressively developed across iterations. Throughout each iteration, we monitored each area of developing theory, and considered whether the new data extended or changed the conceptual ideas that it expressed. Satu- ration was judged to have occurred when iteration 4 developed the theory very little and iteration 5 did not alter the theory. Coding was done using QSR NVivo 8 software. This was used as part of the audit trail, which also included documentation of all memos, and the iterative development of theory.


Results


평가자 판단의 variability의 원인

Sources of variability in assessors’ judgements


두드러지는 특징에 대한 관점 차이

Differential salience


한 평가자에게 중요하게 다가온 것과 다른 평가자에게 중요하게 다가온 것이 다르다.

We found that what struck one assessor as important about a given performance varied from what struck a different assessor as important about the same performance.


또한 한 평가자가 특정 측면에 대해서는 코멘트가 거의 없었을 경우, 다른 측면에 대해서는 많은 코멘트를 했다. 수행의 다양한 측면마다 대해서 평가자들이 평가하는 정도가 달랐다. 동일한 수행에 대해서도 평가자의 전반적 초점은 평가자마다 비교적 독특했다.

Moreover, when a given assessor commented little on one aspect of a performance, they typically commented more on different aspects. Thus the relative extent to which assessors commented on different aspects of the performances varied. In this way, assessors’ overall focus within each performance seemed compara- tively unique.


종합하면, 같은 수행을 보고 있어도 수행의 퀄리티를 결정할 때 유용하게 사용하는 수행의 측면들이 다양했다. 그러나 이러한 차이가 attentional focus during the observation (i.e., noticing) 의 차이인지 differences in the emphasis assigned to a given aspect of performance (i.e., weighting).의 차이인지는 불분명하다(noticing vs weighting)

In sum, the aspects of the performances that assessors regarded as useful for deter- mining their quality varied, despite viewing the same performances. It is not clear from our data whether there were differences in attentional focus during the observation (i.e., noticing) or differences in the emphasis assigned to a given aspect of performance (i.e., weighting).


같은 수행에 대해서도 어떤 측면에서 보느냐에 따라서 두드러지는 특징의 정도degrees of salience가 다르기 때문에, 평가자들은 본질적으로 (같은 수행을 보아도) 서로 다른 관찰에 기반한 판단을 내린다고 할 수 있다.

By having different aspects of the same performance take on variable degrees of salience, raters were in essence forming judgements based on different observations, thereby representing the first source of vari- ability between assessors that contributes to differences between assessors in the judge- ment process.


 

준거 불확실성

Criterion uncertainty


평가 포멧은 "F1 종료시 기대되는 수준"과 비교하여 판단을 내리게 했다. F1 종료시 기대되는 수준은 평가자가 지금까지 겪어온 PGY1 의사가 누구냐에 따라 경험적으로 만들어진 것이라 할 수 있다.

The assessment format asks assessors to judge performance in comparison to ‘‘meeting expectations for F1 completion’’. These expectations were described as experientially- developed through exposure to post-graduate (foundation) year 1 doctors who were encountered over the course of their careers.


평가자들은 이 '기대치'의 구성요소가 무엇인가를 묘사할 때 서로 달랐다.

Assessors differed in the way they described the constituents of their expectations. Some assessors emphasised

  • 지식 the need for factual coverage;
  • 라뽀 others were more concerned with communication or rapport building;
  • 진단 정확성 diagnostic accuracy; or
  • 독립성 evidence of developing independence.

 

어떤 사람들에게는 내용 그 자체보다 면담의 프로세스가 중요했다.

For some the interview process (rather than factual content) was key to their expectations.

가끔은 어떤 한 가지 특정 측면singular aspect의 유무가 중요했다.

Singular aspects (i.e. the presence or absence of a drug-allergy history) were sometimes pivotal.


평가자간 PGY1 의사에게 전형적으로 기대되는 수행능력에 상당한 차이가 있음을 보여준다.

Further comments indicated considerable variation in assessors’ perceptions of the level at which foundation doctors typically perform.


평가자는 PGY1의 수행능력에 대해서 일반화된 기준general criterion으로 삼는 기대를 가지고 있었다. 그러나 이 기준은 경험적으로 나온 것이며, 서로 다른 방식으로 구성되고, 종종 모호하기도 하다. 아마도 이러한 모호함에 대해서 평가자들은 그들의 기준을 강화할augment할 수 있는 상대적 비교를 할지도 모르지만, had the potential to be situationally influ- enced as assessors’ perceptions of the level at which foundation doctors typically perform also varied. 평가기준의 이해와 활용에 있어서의 개인적 경험에 따른 차이가 두 번째 원인이 된다.

In summary, assessors possessed expectations of foundation doctor performance that served as a general criterion. These were experientially derived, differently constructed, and often ambiguous. Perhaps in response to this ambiguity, assessors also made relative comparisons that augmented their criteria, but had the potential to be situationally influ- enced as assessors’ perceptions of the level at which foundation doctors typically perform also varied. Consequently differences in understanding and use of the assessment’s cri- teria—probably due to differing personal experiences—acted as a second mechanism that contributed to variability in assessors’ scores, introducing relative uniqueness into their judgements.

 


 

정보 통합

Information integration


평가자들은 나름의 서사적 기술 언어narrative description language를 사용하여 판단을 내린다.

Thus it appears that—by and large—assessors judge in their own narrative descriptive language.


대부분의 평가자들은 포괄적 용어global term으로 판단을 묘사했으며, 영역간 구분을 짓는 것이 어렵다는 것을 인식했다.

Most assessors described that their judgements evolved in global terms or that they perceived that the domains were difficult to distinguish between:


그 결과 각 영역에 대해서 점수를 지정하는 것은 두 단계를 필요로 한다.

As a consequence, allocating a score for each domain required two processes:

  • 사서적 기술 언어에서 드러난 평가자의 편단을 scale descriptor로 변환하는 것
    con- version of the assessor’s judgement from their individual narrative description into the scale descriptors and
  • 총괄적 인상을 영역별 점수로 변환하는 것
    conversion into scores for each domain based on a global impression.

 

이러한 과정을 통해서 총괄적 인상global impression이 영역 점수에 영향을 주는 것이며 그 반대가 아니라는 것이 드러난다.

In this way it appears that variability in global impressions influenced variability of per- ceptions of domain scores rather than the reverse.


종합하면, 평가자의 판단이 그 형태를 갖춰갈수록, 수행능력이 competent한 것으로 판정되느냐의 정도는 (평가자간 다양하게 나타나는) 비교적 독특한 서사적 기술로 대표된다. 이는 주로 총괄적 판단에 따라서 형성되는 경향이 있으며, 수행의 개별 측면을 나타내는 scale descriptors 로 변환되어야 한다.

In sum, as assessors’ judgements took shape, the degree to which a performance was judged to be competent was represented by means of individual, comparatively unique narrative descriptions that varied between assessors. These tended to form along with a global overall judgement, both of which had to be converted into scale descriptors for individual aspects of practice. How that conversion took place may have further influenced the variability inherent in the scores.


 

고찰

Discussion


요약과 이론

Summary of findings and theory


 

같은 수행을 보고도 평가자가 집중해서 보는 측면과 서로 다른 측면에 배정되는 가중치는 평가자마다 달랐고 그 다른 정도도 다양했다. 결과적으로, 평가자는 서로 다른 관찰, 상대적으로 독특한 관찰의 조합을 바탕으로 판단을 내린다고 할 수 있다.

Despite viewing the same performances, assessors’ attentional focus and perhaps the weight they assign to different aspects of performance varies such that different aspects of the per- formance become salient to different assessors to different degrees. Consequently, asses- sors appear to rely on different, comparatively unique combinations of observations when formulating judgements.


두번째로, 평가자는 이러한 관찰 결과를 그들이 가지고 있는mentally held (종종 모호하고, 서로 다르게 구성되고, 서로 다른 '전형성typicality'의) 역량 기준과 비교하게 된다. 평가자는 이러한 기준을 형성할 때, 그리고 이 기준을 가지고 판단을 내릴 때, (적어도) 그들이 경험한 피훈련자의 사례를 참고로 하게 된다. 따라서 평가자들의 경험이 독특하기 때문에 평가자에게 다음과 같은 방식으로 다양한 방면으로 영향을 주게 된다multifaceted influence

  • 수행의 어떤 측면facet이 가장 두드러지는가salient 에 대해서 영향을 주고
  • 평가자의 평가기준에 영향을 주고
  • 평가자가 직접적으로 비교할 사례집단을 형성하여 영향을 준다

Secondly, assessors compare these observations against mentally held competence criteria that are often uncertain, are differently constructed, and include different percep- tions of typicality. Assessors’ appear to formulate these criteria and judge against them at least partly through reference to exemplar trainees with whom they have experience. Uniqueness in assessors’ experience is likely, therefore, to have a multifaceted influence by altering assessors’ perception of which facets of the performance are most salient, by influencing the criterion standard held by the assessor, and by creating a group of exem- plars against which assessors directly compare.


마지막으로, 평가자가 판단을 내릴 때 이러한 다양한 프로세스에 의해서 영향을 받기 때문에, 따라서 평가자들은 그러한 판단의 긍정-부정(valence of those judgements)을(혹은 관찰결과와 평가기준 간 차이의 정도를) 개개인별로 특이적으로 생성한 서사적 언어로 표현한다. 이러한 서사적(총괄적) 판단은 평가 스케일로 변환되어 개별 영역의 점수를 생성한다.

Finally, as assessors form judgements that are influenced by these various processes, they describe—and therefore presumably mentally represent—the valence of those judgements (or the judged degree of difference between their observations and their criteria) in individually generated narrative language. These individual narrative (and global) judgements are converted into the assessment scale to produce scores for each individual domain.


마지막으로, 우리가 개개인의 수준에서 판단의 variability에 초점을 맞추었지만, 그 variability가 무한하다고 가정해서는 안된다. Thammasitboon 등은 서로 다른 평가자가 역량에 대한 다양한 개념을 가지고 있지만, 이 개념은 네 개의 역량 구인으로 그룹지어질 수 있음을 밝혔다. 아마도 더 많은 수의 평가자를 대상으로 표본을 수집하면 반복되는 패턴을 찾을 수 있을 것이다. 그러한 패턴을 밝힘으로써 평가자에 의한 모든 variability를 단순히 error라고 뭉뚱그리지 않고 variability에 대한 더 깊은 이해가 가능할 것이다.

Further, while we have focused on variability in judgement at the individual level, one should not presume that such variability could ever be infinite in scope. Thammasitboon et al. (2008) showed that whilst different assessors possessed different conceptions of competence in multi-source feedback, their conceptions could be grouped into four dif- ferent constructs of competence. Presumably a larger sample of assessors might have enabled us to identify repeated patterns of performance that could be grouped. Observation of such patterns would reinforce the conclusion that meaningful differences might be drawn from variability in perception rather than simply concluding that all variability not attributable to the individuals being assessed should be deemed to be ‘‘error.’’


 

연구의 한계

Consideration of limitations


 

이론적 함의

Theoretical implications of findings


이 연구에는 몇 가지 중요한 함의가 있다. Govaerts 는 점수의 varitaion을 단순히 error로 보기보다는 특이성idiosyncrasy의 한 형태로 보는 것이 좋다고 제안했다. 우리의 결과는 '정확성' 그 자체에 대한 주장에 대한 것은 아니지만, 어떻게 한 개인의 여러 특이점들이 합해져서 평가자가 내리는 판단의 특이성을 형성하는지 보여주었다.

This study has a number of important implications. Govaerts et al. (2007) suggested that score variation is better viewed as a form of idiosyncrasy rather than simply as error. Our results do not allow us to make claims about ‘accuracy’ per se, but they do illustrate how multiple individual peculiarities can combine to produce idiosyncrasy in assessors’ judgements.


수행능력에 관한 평가자 특이적 판단은 예전에는 다른 맥락에서 보고되었다. Ginsburgh 등은 서로 다른 전공의를 볼 때의 인상에 기반해서 이뤄졌지만, 우리의 연구는 심지어 동일한 수행상황을 보고도 그러한 결과가 나타남을 보여주었다. 즉, 적어도 이러한 평가자 특이적 판단의 일부는 피평가자의 차이가 아니라 평가자의 차이에서 기인하는 것이다.

Idiosyncratic judgements on performance by assessors have been previously reported in a different context. Ginsburgh et al. (2010) ’s study was based on impressions of different residents, our results showthat similar findings can occur even when assessors view the same pool of performances—thus indicating that at least some of this idiosyncrasy arises from assessor differences rather than from differences in trainee behaviour.


교육적 관점에서 '부정확inaccuracy'와 '특이성idiosyncrasy'의 구분은 중요하다. 무엇보다, 교육 영역에서 채택하고 있는 psychometric 관점은, 일단 충분한 수의 표본을 수집하여 적절한 수준의 신뢰도를 갖춘다면 학습자에게 '정확한' 피드백을 줄 수 있을 것으로 가정한다. 만약 일부 variability가 수행능력에 대한 서로 다른 인상 간의 유의한 차이를(equally valid) 보여주는 것이라면, 우리의 과제는 어떻게 다양한 관점을 삼각측량하여 학습자에 대한 온전한 모습complete picture를 그려낼 것인가, 이를 가지고 그들의 수행능력에 관한 유용한 피드백을 전달할 수 있을까이다. 이러한 개념은 Govaerts등에 대해서 언급된 바 있으며, van der Vleuten and Schuwirth의 programmatic assessment 접근법과도 일치한다.

From an educational perspective, the difference between inaccuracy and idiosyncrasy is important: most dominantly, the field has adopted a psychometric perspective which assumes that once we have sampled enough to ensure adequate reliability, we can provide learners with ‘‘accurate’’ feedback. If some variability indicates meaningful (i.e., equally valid) different impressions of performance we are now faced with determining how to triangulate between multiple perspectives to create a complete picture of a learner and convey useful feedback about their performance. This concept has been previously artic- ulated by Govaerts et al. (2007) and resonates with the approach to programmatic assessment suggested by van der Vleuten and Schuwirth (2005).


평가자가 받은 인상의 특이성에 기여하는 개개인 수준의 기전이 갖는 추가적 함의가 있다. 평가자가 'attribute different degrees of salience to different aspects of common perfor- mances'라는 사실은 과연 평가자가 - 판단이나 평가는 차치하고서라도 - '객관적인 관찰'을 할 수 있느냐는 의문을 갖게 한다. 따라서 평가자에게 무엇을 평가할 것인가에 대한 정해진 준거를 가지는 공식적 시험 세팅을 제공하고자 하는 것에 많은 노력이 들어갔지만, 평가자의 평가가 진행되는 중의 관심대상attentional focus에 관한 노력은 별로 없었다.

The individual mechanisms that contribute to idiosyncrasy of assessors’ impressions have further implications for both theory and practice within assessment. The finding that assessors attribute different degrees of salience to different aspects of common perfor- mances—through noticing or paying attention to them differently—questions the notion that assessors can ‘‘objectively’’ observe—let alone judge or rate—performances. Thus, whereas much prior effort has gone into providing examiners in formal exam settings with defined criteria against which to judge (Newble 2004), very little work has been undertaken concerning examiners’ attentional focus whilst judging.


평가자 훈련, 특히 ‘‘Frame of Reference Training’’ (FORT) 훈련은 직업심리학에서 폭넓게 효과적인 것으로 확인되었다. FORT는 다음을 포함한다.

Assessor training, in particular ‘‘Frame of Reference Training’’ (FORT) training, has been shown to be effective across a breadth of contexts in occupational psychology, showing moderate to large effects on a range of endpoints (Woehr 1994). Frame of ref- erence training involves:


  • defining performance dimensions, 
  • providing a sample of behavioural incidents representing each dimension (along with the level of performance represented by each incident) and 
  • practice and feedback using these standards to evaluate perfor- mance (Schleicher et al. 2002)


따라서 Holmboe 등이 보고한 바와 같이 의학교육 맥락에서 평가자 훈련이 효과가 없다는 것은 놀랍다. 우리는 평가자들이 그들의 평가 준거를 그들의 직업경험동안 반복적으로 사례를 접하면서 쌓아온 것으로 설명함을 확인했다. 이러한 긴 경험에도 불구하고 그 준거는 불확실하며, 최근의 사례에 의해 영향을 받을 수 있다. 이러한 결과는 다수의 사례에 노출되는 것이, 평가 특성에 대한 합의를 이루는 것보다, 대안적으로 보다 효과적인 전략이 될 수 있음을 시사한다.

Therefore, the results reported by Holmboe et al. (2004) and Cook et al. (2009)ina medical education context, showing limited or no effect of rater training, are surprising. We found that assessors described their criteria as experientially derived over the course of their careers, through exposure to repeated exemplars. Despite this long experience, criteria remain uncertain, and can be influenced by recent examples. These findings raise the possibility that exposure to a greater number of exemplars, rather than agreeing on attri- butes, may represent an alternative, potentially successful, strategy that is more akin to the way assessors’ criteria are represented.





 2013 Aug;18(3):325-41. doi: 10.1007/s10459-012-9372-1. Epub 2012 May 12.

Seeing the same thing differently: mechanisms that contribute to assessor differences in directly-observed performance assessments.

Author information

  • 1School of Translational Medicine, University of Manchester, Manchester, UK. peter.yeates@manchester.ac.uk

Abstract

Assessors' scores in performance assessments are known to be highly variable. Attempted improvements through training or rating format have achieved minimal gains. The mechanisms that contribute to variability in assessors' scoring remain unclear. This study investigated these mechanisms. We used a qualitative approach to study assessors' judgements whilst they observed common simulated videoed performances of junior doctors obtaining clinical histories. Assessors commented concurrently and retrospectively on performances, provided scores and follow-up interviews. Data were analysed using principles of grounded theory. We developed three themes that help to explain how variability arises: Differential Salience-assessors paid attention to (or valued) different aspects of the performances to different degrees; Criterion Uncertainty-assessors' criteria were differently constructed, uncertain, and were influenced by recent exemplars; Information Integration-assessors described the valence of their comments in their own unique narrative terms, usually forming global impressions. Our results (whilst not precluding the operation of established biases) describe mechanisms by which assessors' judgements become meaningfully-different or unique. Our results have theoretical relevance to understanding the formative educational messages that performance assessments provide. They give insight relevant to assessor training, assessors' ability to be observationally "objective" and to the educational value of narrative comments (in contrast to numerical ratings).

PMID:
 
22581567
 
[PubMed - indexed for MEDLINE]


평가자-기반 평가에서 첫인상의 역할에 대한 고찰(Adv in Health Sci Educ, 2014)

Exploring the role of first impressions in rater-based assessments

Timothy J. Wood




의학은 오랜 기간 학습자의 역량을 평가할 때 선생이나 전문가의 판단에 의존해왔다. 이들을 평가자로 사용하는 것은 두 가지 요인을 반영한다. 첫째로, 좋은 의사가 되는데 필요한 스킬은 지필고사와 같은 비-평가자 기반 평가로는 쉽사리 드러나지 않는다. 둘째 요인은 어떻게 의사가 훈련되느냐와 관계되어 있다. 학습자가 임상환경에서 수행하는 능력이 관찰되상이 되며, 이는 훈련과정 중 하나이다.

Medicine has a long history of assessing the competence of learners by relying on the judgments of teacher and/or experts. This use of these people as raters is likely a reflection of two factors. First, the skills that make a good physician do not necessarily lend themselves easily to non-rater based assessments methods like written examinations. The second factor relates to how physicians are trained. Learners are observed in clinical settings as part of their training,


최근, 역량-바탕 프레임워크를 적용하여 학습자를 평가할 것이 권고되고 있다. 이 평가 프레임워크는 근무지 평가 뿐 아니라 피드백의 활용을 강조하는데, 이 둘 모두 평가자 역할의 중요성이 매우 강조된다.

More recently, there has been an increased push to adopt a competency- based framework to assess the skills of learners (Holmboe et al. 2010). This assessment framework emphasizes the use of feedback as well as workplace assessments, both of which require observation thus further highlighting the critical role of the rater.


그러나 안타깝게도, 모든 인간은 선입견과 편견을 가지고 있고, 이것이 학습자의 역량을 평가할 때 그 평가의 퀄리티에 영향을 끼친다.

Unfortunately, all humans have preconceived notions, biases and abilities that influence the quality of the judgments they make when assessing the competence of learners (Gige- renzer and Gaissmaier 2011; Hoyt 2000; Landy and Farr 1980; Saal et al. 1980, 1974; Williams et al. 2003).


의학교육에서 활용되는 평가가 가치를 가지려면(타당하고 신뢰성 있으려면), 사람들이 타인에게 점수를 매기는 인지적 프로세스를 이해하는 것이 중요하다. 실제로, 모든 평가에 있어서 이러한 종류의 정보를 수집하는 것은 chain of validity evidence의 한 부분이다.

To ensure the assessments that are used in medical education have value (i.e., are valid and reliable), it is crucial that we understand the cognitive processes behind howpeople assign scores when assessing others. In fact, the collection of this type of information is considered part of the chain of validity evidence one should collect with regards to any assessment (AERAet al. 1999; Clauser et al. 2008; Cook and Beckman 2006; Downing and Haladyna 2009).


특히 관심의 대상이 되는 것은 '첫인상', '단편적 판단', '아는바 없음' 와 같은 판단이다.

Of particular interest is the impact of judgments often referred to as ‘‘first impression’’, ‘‘thin slice’’ or ‘‘zero acquaintance’’ judgments.


첫인상

First impressions


이러한 타인에 대한 판단을 '인상'이라 부르며, 우리가 타인의 인성과 행동에 관한 정보를 인지하고 조직하고 통합하는데 도움을 주는 카테고리이다. 첫인상은 빠르게 만들어지는 인상으로서, 누군가를 만나고 5분 내에 형성된다. 첫인상은 첫인상 판단이 내려지는 시점이 매우 빠르고 제한된 정보에 따라 내려진다는 점을 감안하면 놀라울 정도로 정확하다.

These judgments about others are called impressions, which are categories that we use to help us perceive, organize and integrate information about an individual’s personality and behavior (Feldman 1981; Fiske and Neuberg 1990; Gingerich et al. 2011). First impressions are a type of impression that is made quickly, usually within 5 min of meeting someone for the first time. First impressions have been found to be surprisingly accurate given how quickly they form and the limited information on which they are based (Ambady and Rosenthal 1992; Ambady 2010; Harris and Garris 2008).

 


 

첫인상 뒤에 숨겨진 인지절차는 무엇인가?

What are the cognitive processes behind a first impression?


많은 인지적 활동(의사결정, 추론, 카테고리화, 기억) 등은 두 가지 절차에 따라 이뤄진다. 시스템1과 시스템2 프로세스이다. 일반적으로 시스템1은 빠르고, 노력이 덜 들며, 비-분석적이고, 자동적이고, 무의식적이며, 시스템2는 느리고, 노력이 들고, 분석적이고, 통제되며, 의식적이다.

Many cognitive activities including decision making, reasoning, categorization, and memory are thought to consist of two underlying processes; what have come to be known as System 1 and System 2 processes (Evans 2008; Uleman et al. 2008; Kahneman 2011). It is generally accepted that System 1 processes are rapid, effortless, non-analytic, automatic, and/or unconscious, whereas System 2 processes are slow, effortful, analytic, controlled, and/or conscious.


  • 시스템1: 강아지나 고양이 라는 단어를 읽는 것 cat, dog.
  • 시스템2: 이러한 단어를 읽는 것 parasito- logical, incudostapedial.


시스템1과 시스템2 모두 많은 인지활동에 활용될 수 있으며, 인지심리학과 사회적판단, 의사결정에 관한 연구에서 우리가 어떻게 이 두 가지 프로세스를 조화시키는지 이해하려고 노력해왔다.

Both System 1 and System 2 processes can be used to perform many of the cognitive activities listed above; therefore the focus of research in cognitive psychology and social judgment and decision making is to try to understand how we coordinate these two processes (see Brooks 2005; DeNisi et al. 1984; Fiske and Neuberg 1990; Jacoby 1991; Kahneman 2011; Norman 2009; Schneider and Chein 2003 for examples in these areas).


첫인상은 주로 시스템1 프로세스를 반영한다.

First impressions are thought to reflect primarily System 1 processes


만약 첫인상에 대한 이러한 가정(시스템1 프로세스)이 사실이라면, 무의식적 프로세스에 의존하는 과제에 있어서, 사람들이 자신이 그 과제를 어떻게 수행했는지를 말로 설명하는 것은 어려울 것이다.

If this assumption about first impressions is true, for tasks that rely on unconscious processes, it should be difficult for people to accurately verbalize how they performed a task


 

어린아이에게 자전거를 어떻게 타는지 설명해주는 것이 얼마나 어려운가를 생각해보라

One just has to think of how hard it is to explicitly verbalize to a child the steps needed to ride a bicycle to realize this occurs.


첫인상과 관련된 인식의 수준(level of awareness)를 본 연구는 적다. 이 중, 정확성과 자신감(accuracy and confidence)에 대한 것이 있다. 예컨대, 정확성과 자신감은 낮지만 정적 상관관계에 있다는 연구가 있음.

there have only been a few studies that have looked at the level of awareness associated with first impressions. Of the work that has been done, the focus has been on the relationship between accuracy and confidence. For example, (Smith et al. 1991) found a low but positive relation between accuracy and confidence levels.


유사하게, 정확성과 자신감의 관계는 평가자가 자신의 평가에 자신감이 전혀 없을 때 가장 높았는데, 왜냐하면 이 때 평가자는 자신이 평가하는 대상에 대해서 생각이 없었no idea기 때문이다.

Similarly, (Ames et al. 2010), the correlation between accuracy and confidence was highest for those raters with no confidence in their rating because they knew when they had no idea about a personality judgment.


비록 사람들이 다른 사람에 대해서 판단하는 것이 가져올 결과를 알고 있더라도, 사람들은 어떻게 그 판단을 내렸는지 설명하는 일을 어려워하며, 그 판단을 어떻게 내렸는가에 대한 통찰insight을 거의 가지고 있지 않다

The conclusion from both of these studies is that, although people may be aware of the outcome of forming a judgment of others, they appear to have difficulty articulating how they did it and/or have little insight into how they actually made that judgment.


Biesanz 등은 자신감과 정확성 사이에 관계가 작다는 것이 lack of awareness를 반영한다는 결과에 의문을 표했다. 이들은 사람들이 비록 스스로 첫인상 판단이 얼마나 정확한가에 대해서는 잘 모르더라도, 그들이 첫인상에 대한 판단을 내리는 시점을 인지하고 있다고 주장한다.

Recently, Biesanz et al. (2011) questioned the finding that a low relationship between confidence and accuracy in first impression judgments reflects a lack of awareness. They argued that people are aware at the time when individual first impression judgments are accurate even if they do not know how accurate their judgments are globally.


첫인상이 시스템1 프로세스를 사용함을 반영하는 또 다른 연구패턴은 이 판단이 빠르게 내려진다는 것이다.

Another pattern of results that one should expect if a first impression reflects the use of System 1 processes is that the judgments should be made quickly.


Willis and Todorov 는 사진에 있는 사람을 보고 성격에 대한 인상을 정확하게 판단할 때, 100ms만 보고서도 시간제한없이 본 것과 비슷한 정확도로 판단할 수 있음을 밝혔다.

Willis and Todorov (2006). found that people can produce as accurate an impression of the personality traits associated with a person in a photograph after 100 ms exposure as they do when viewing the same photo- graph with no time constraints.

 


Dodson 등은 OSCE의 평가자에게 피평가자를 5분 시점에서, 그리고 8분 시점에서 평가하게 했다. 5분 시점에서 평가한 결과는 8분 시점에서 평가한 결과보다 점수가 낮았으나, 둘 사이의 상관관계는 높았으며, 5분 시점이 평가로도 점수의 신뢰도는 낮아지지 않았다.

Dodson et al. (2009) asked examiners on an admissions OSCE to provide a rating of the examinee’s abilities at the 5 min mark and then again at the 8 min mark. Ratings at 5 min were lower than ratings at eight minutes, but the correlation between ratings at the two time points was high (r = 0.82–0.91) with no drop in reliability for scores at the 5 min mark.


Govaerts 등은 경험이 많은 평가자와 경험이 거의 없는 평가자에게 피평가자의 비디오를 보게 하였는데, 모든 경우에서 평가자들은 5분 내에 판단을 내릴 수 있다고 생각했다.

Govaerts et al (2011; see also Govaerts et al. 2013) asked experienced and inexperienced examiners to watch two videos of a trainee with a patient. In all conditions, therefore, the examiners thought they could judge the performance in under 5 min.


첫인상이 시스템1 프로세스를 사용한다고 가정한다면, 세 번째 특징은 판단을 내릴 때 인지적 자원의 소모가 거의 없어야 한다는 것이다. 인지심리학에서 과제의 자동화를 연구하는 한 가지 흔한 방법은 divided attention task를 사용하는 것.

A third characteristic that one would expect if first impressions reflect System 1 pro- cesses is that the judgment should require few cognitive resources to operate. In cognitive psychology, one common method used to study the automaticity of a task has been to use a divided attention task;


 

이러한 논리에 따라서, 또 다른 과제를 동시에 하게끔 하여도 초기 판단의 정확성이나 수행능력에는 차이가 없었다.

By this logic, introducing a simultaneous task will have little impact on the accuracy or performance associated with the initial judgment.


Patterson and Stockbridge 는 첫인상을 높은 인지적부하 조건에서 판단하게 한 그룹에서 오랜 시간 숙고하게 한 그룹보다 더 정확한 판단을 내렸으며, 이는 첫인상이 시스템1 프로세스를 사용한다는 것을 기대하게끔 한다.

Patterson and Stockbridge (1998). found that participants in the high cognitive load condition were more accurate in the first impression group compared to the deliberative group, a finding one would expect if a first impression judgment relied primarily on a System 1 process.


Ambady는 과목에 대한 과목 초반의 평가와 최종 평가의 상관관계가 낮음을 보고하면서, 인지부하조건에 비해서 통제조건 혹은 지연조건은 아무런 차이가 없음을 밝혔다.

Ambady found a low correlation between the initial ratings and the final course ratings in the reasons condition (r = 0.27) but no differences between the control and delay conditions compared to the cognitive load condition (r = 0.65–0.71). This pattern is what would be expected if participants were relying primarily on a System1 process to make their initial judgments.


요약하자면, 첫인상에 깔린 인지프로세스에 대한 연구를 보면 첫인상은 시스템1 프로세스에 의존하는 것으로 보이며, 왜냐하면 평가자는 흔히 자신들이 어떻게 그 인상을 형성했는지 인지하지 못하며, 그 판단은 빠르게 내려지고, 인지적 부하를 가한 조건에 민감하게 반응하지 않았기 때문이다.

In summary, research looking at the underlying cognitive processes behind first impressions would suggest that first impressions likely reflect the reliance on a System 1 process because raters are typically unaware of how they created an impression, the impressions are made quickly, and they are not sensitive to manipulations that add com- peting attentional demands.


첫인상은 얼마나 정확한가?

How accurate are first impressions?


이 질문에 대한 대답은 논란이 있다. 듀얼-프로세스 모델에 곤한 대부분의 연구는 사람들이 시스템1 프로세스에 의존하면 오류를 일으킬 확률이 높다는 것에 초점을 둔다. 시스템1 의존 시에 늘어나는 에러는, 평가자들이 (그들을 심사숙고하고 분석적으로 만들어주는 것보다) 휴리스틱, 기억 인출, 인지적 편향과 같이 에러를 유발하는 것들에 영향을 받는 경향이 높기 때문이라고 설명한다. 이러한 관점에서, 첫인상은 에러에 취약하며, 여기에 기반한 판단은 지양되어야 한다.

The answer to this question is debatable. Much of the literature on dual process models (Evans 2008; Croskerry 2009; Kahneman 2011; Tversky and Kahneman 1974) has focused on the increase in errors that occur when people rely on System 1 processes. The explanation for the increase is that when relying on System 1 processes, raters are more likely to be influenced by heuristics, memory retrieval or other cognitive biases which lead to errors compared to processes that are more deliberative and analytic. From this perspective, first impressions should be prone to errors, and judgments based on them should be avoided.


시스템1 프로세스가 시스템2 프로세스보다 더 에러를 발생시킬 가능성이 높다는 근거에도, 일부 연구자들은 이 연구결과의 일반성에 의문을 표한다. 예컨대, 임상추론 연구를 리뷰하여, 일부 연구자들은 임상문제에 대해서 (빠른 반응은 정확하나) 늦은 반응slow response은 오류를 만들어내는 상황들을 찾아내었다.

Despite evidence that System 1 processes can lead to more errors than System 2 pro- cesses, some researchers have challenged the generality of these results (Eva and Norman 2005; Gigerenzer and Gaissmaier 2011; Klein 2009). For example, (Norman 2009; also Sherbino et al. 2012), in a review of clinical reasoning studies, described situations in which errors were associated with slow responses to clinical problems, whereas fast responses were more accurate.


또 다른 연구는 실험실 세팅에서 내린 틀린 판단이 실제 상황에서는 옳은 판단이었을 수 있다는 것에 초점을 둔다. Mu¨ller-Lyer Illusion 을 시스템1과 시스템2에 따라서 해석한 예시가 있다.

Other researchers have suggested that rating-based research needs to be focused on what people can do in more naturalistic settings or with more realistic stimuli because a wrong judgment in the laboratory may be a correct judgment in the real world. This distinction is best demonstrated by considering how the Mu¨ller-Lyer Illusion is interpreted in terms of System1 and System2 processes.


즉, '착각'이 반드시 판단의 오류를 의미하는 것은 아니라는 점이다.

In other words, the illusion does not reflect an error of judgment.


정확성에 대한 또 다른 논점으로는, '정확성'이라는 것이 다양한 사회적판단 연구에서 상대적인 개념이며, 왜냐하면 무엇을 옳고 그르다고 정의내리는 황금률은 존재하지 않기 때문이다. 이보다는 정확성이 '판단'에 기초하고 있으며, 그 판단이란 '동의'혹은 '예측'에 의존한다고 보는 것이 옳다.

Another comment about accuracy is needed. Accuracy is a relative concept in many social judgment studies because a gold standard that clearly defines right or wrong does not exist. Rather, accuracy is based on a judgment, which may rely on agreement or prediction (Funder 1987; Funder and West 1993; Kenny 1993).


'동의agreement'에 있어서, 평가자 평가와 자기평가를 비교한 것이나 같은 준거로 다른 평가자의 평가와 비교한 연구 등이 있다.(self-other agreement / consensus rating)

In the case of agreement, studies usually compare ratings made by a rater to those made by the target (self-other agreement) or to a rating made by other raters using the same criteria (consensus rating).


'예측prediction'에 있어서, 동일한 혹은 다른 준거에 따라 미래의 결과를 예측하는지 보는 것이다.

In the case of prediction, the ratings are used to see if they predict a future result based on either the same or different criteria.


Funder가 주장한 바와 같이, 이 분야의 연구는 '상관관계의 크기가 아니라, 판단이 더 정확해지는지 아니면 덜 정확해지는지'를 연구해야 한다.

As argued by Funder (1987), research in this area should study circumstances in which judgments become more or less accurate rather than focus on the magnitude of the correlation.


판단의 정확성이 상대적이라는 관점에서, 첫인상에 대한 연구는 정확도에 있어서 다양한 결과를 보여주었다. Barrick 등은 짧은 라뽀 세션rapport session에 기반한 첫인상과 인터뷰 점수가 중등도의 상관관계를 가짐을 보여주었다.

In light of the argument that accuracy is relative, studies of first impressions have shown considerable range in terms of the degree of accuracy. Barrick et al. (2010) found a moderate correlation between first impression based on a short rapport session and an interview score (r = 0.42).


요약하자면, 듀얼-코드 이론가들에게 공통적인 결론은 시스템1 프로세스에 의존하는 것이 (첫인상을 포함해서) 판단의 오류를 유발할 수 있으며, 우리는 이를 경계해야 한다. 연구자들은 이에 대하여 두 가지 반응을 보인다. 첫째로, 이러한 패턴이 모든 경우에 있어서 옳지는 않으며, 느리고 숙고하는 프로세스에 기반한 판단이 더 에러를 일으키는 경우도 있다. 둘째로, 연구자들은 에러를 일으키는 요인들이 가지는 가치가 있는지 의문을 표하며, 실험실을 벗어나면 시스템1 프로세스를 사용한 판단이 오히려 더 정확할 수 있음을 지적한다. 또한 대부분의 판단에 있어서 정확성은 - 첫인상을 포함하여 - 상대적인 개념이며 절대적인 옳고 그름의 황금률은 없다. 이러한 상대성을 고려한다면, 어떤 프로세스가 에러를 유발하는가를 보는가에만 초점을 두기보다는 판단의 정확성이 높아지거나 낮아질 수 있는 조건을 연구하는 것이 나을 것이다.

In summary, a common perspective from dual code theorists is that reliance on System 1 processes, including first impressions, can lead to errors in judgment and that we need to be wary of relying on these processes when making judgments. Researchers have had two responses to this perspective. First, the pattern is not necessarily true in all cases and it has been shown that judgments based on slow deliberative processes can be more error-prone than those made on first impressions. Second, some researchers have questioned whether studies of factors that produce errors are of value, and point out that often errors made in the laboratory using System 1 processes are actually correct judgments when studied outside the laboratory. In addition, accuracy of most judgments, including first impres- sions, is relative because there is often no gold standard that determines right from wrong. Given this relativity, it may be more fruitful to study conditions that cause accuracy to increase or decrease rather than focus solely on whether one process leads to errors.


첫인상의 정확도에 영향을 주는 요인은 무엇인가?

What factors modify the accuracy of a first impression?


첫인상의 정확도에 영향을 주는 요인을 찾는 것이 중요하다. Gingerich 등은 평가자의 기분, 평가자가 알던 다른 사람과의 유사성, 사전에 접한 정보 등을 지적했다. 예컨대, 피평가자의 관찰가능한 성격 (외향성 등)은 덜 관찰가능한 성격 (신경증, 개방성) 등에 비해서 더 정확하게 평가가능하다.

An examination of other factors that could modify the accuracy of a first impression would be of value. Gingerich et al. (2011) has reviewed some of these factors within the larger impression formation literature and they include: mood of the rater, similarity to other people the rater knows, and seeing information in advance. For example, with regard to the people being rated, observable personality traits like extroversion are typically judged more accurately than less obser- vable traits like neuroticism or openness (Ambady et al. 1999; Borkenau and Liebler 1992; Lippa and Dietz 2000).


평가자-기반 요인에 있어서..지능intelligence, 젠더, 평가자의 기분mood (슬픈 평가자가 덜 정확하다)

With regard to rater-based factors,

  • intelligence has been identified as a factor that could influence the accuracy of first impressions judgments.
  • Gender has also been identified as a potential factor that can influence the accuracy of judgments based on first impressions (Ambady et al. 1995; Chan et al. 2011; c.f. Lippa and Dietz 2000).
  • Another rater-based factor that can influence the accuracy of a first impression is the mood of the rater, with sad raters having less accurate first impressions than happy raters (Ambady and Gray 2002).


인상 관리impression management와 안정성stability에 대한 연구. 인상 관리란 직업 면접에서 흔히 연구되며, 피면담자가 면담자와의 상호작용을 컨트롤하여 영향을 주고자 하는 것.

Another issue related to factors that could influence the accuracy of a judgment based on a first impression is related to impression management and the stability of the first impression. Impression management is most commonly studied in the job interview lit- erature and refers to situations in which interviewees attempt to influence an interviewer by controlling the interaction between themselves and interviewer. Barrick et al. (2009) found evidence that

  • 외모 appearance (i.e. physical and professional),
  • 인상관리 impression management (i.e., self-promotion, ingratiation to the interviewer, emphasizing positives, focusing attention on the interviewer), and
  • 언어/비언어적 특성 verbal (voice) and non-verbal (smiling, eye contact) characteristics

 

can all have an influence on the impression a rater may form.


 

 

요약하면, 첫인상에 영향을 줄 수 있는 요인은 다양하다. 이들 중 일부는 판단의 대상이 되는 사람과 관련되어있다. '평가자와 얼마나 유사한가'와 같은 비의도적인 요인들 뿐 아니라 '피평가자가 평가자가 받는 인상을 관리하려는 노력'과 같은 의도적 요인들도 있다. 젠더/지능/기분 등이 관련된다. 외향성은 더 판단하기 쉬운 특성이다.

In summary, a number of factors were identified that can influence the accuracy of a first impression. Some of these factors are related to the person being judged: either uninten- tional factors like similarity to the rater or intention factors like those deliberately used by ratees to manage impressions raters may create. Other factors that influence accuracy like gender, intelligence, or mood are related to the rater. Finally, some traits like extraversion appear to be easier to judge than other traits.


 

평가에 있어서 첫인상의 영향력은?

What is the impact of first impressions for assessment?


첫인상 연구의 대부분은 다음 등이다

The majority of studies of first impressions have focused on

  • the ability of raters to make a personality judgment of some kind,
  • rate the abilities of a teacher, or
  • predict the success of a job interview.


'자기충족적 예언' 혹은 '예언효과'와 관련된 것이다. 이는 첫인상이 이후의 평가자와 피평가자의 관계에 영향을 준다는 것이다.

The first area deals with a phenomenon called self-fulfilling prophecies or an expectancy effect. This phenomenon occurs when an initial impression influences subsequent interactions between the rater and the person being rated (Dipboye 1982; Harris and Garris 2008; Rosenthal 1994).


Snyder 등은 남성 참가자가 부정적 기대를 가지고 있으면, 여성 참가자를 덜 친절하게 대하고 여성으로부터 부정적 반응을 얻는다.

Snyder et al. concluded that if male participants had negative expectations, they treated the female participants in a less friendly manner, getting a negative reaction from the females.


유사하게, 직무 면접에서 Dougherty 등은 긍정적 첫인상이 면접관의 긍정적 커뮤니케이션 스타일과 연결되며, 합격 가능성이 높아지고, 더 긍정적인 보컬 스타일과 연결된다.

Similarly, in a study using job interviews, Dougherty et al. (1994) found that positive first impressions were related to more positive communication styles by the interviewer, increased likelihood to extend an offer, and more positive vocal style.


두 번째로, 첫인상과 후광효과에 대한 것이다. 후광효과는 평가자가 피평가자를 판단할 때 '독립적인 특성들 간' 분별에 실패할 때 발생한다. 후광효과는 모든 평가 영역간 상관관계가 다 높게 나타나는 식으로 드러나거나, 혹은 평균 SD가 작은 방식으로 드러난다. 이는 다양한 측면dimensions에 걸쳐서 한 가지 요인이 모든 variability를 설명할 수 있는 경우이며, 또는 유의미한 평가자-피평가자 상호작용 rater-ratee interaction이 발견되는 경우이다.

The second area to which first impressions could impact on assessment is a type of rater bias called a halo effect. A halo effect is thought to occur when a rater fails to discriminate among independent aspects of behavior when making a judgment about a person. Halo is typically manifested as either high average correlations across all dimensions being assessed, low average standard deviations across all dimensions being assessed, when a single factor accounts for all of the variability in scores across multiple dimensions, or when a significant rater 9 ratee interaction is found (Balzer and Sulsky 1992; Cooper 1981).


후광의 원인에는 여러가지가 있다.

Several sources of halo have also been identified by researchers.

  • 일반적 인상general impression이 이어지는 모든 판단에 영향을 주는 경우
    The first source of halo occurs when a rater makes a judgment about a person based on a general impression (e.g. a first impression) that they form. This impression then influences all subsequent ratings or judgments about the person. For example, if a rater forms a first impression of a learner that is either positive or negative in nature, then this impression will guide the ratings on all dimensions being rated.
  • 한 영역에서 두드러지는 특징salient dimension이 다른 영역에도 줄줄이 영향을 미치는 것
    The second source of halo occurs when a salient dimension or trait drives the ratings on other dimensions being judged. For example, a high or low rating on communication skills could influence ratings on other dimensions, even those that may be unrelated, like technical skills or knowledge.
  • 평가대상이 되는 영역 간의 분간에 실패한 것inadequate discrimination between dimension
    A third source of halo is an inadequate discrimination between dimensions being rated. This source of halo usually occurs when the dimensions being rated are ambiguous and raters end up grouping what are intended to be unrelated dimensions and providing similar ratings.


후광효과가 평가의 정확성/비정확성을 가져오는가? 후광의 존재는 시스템1 프로세스에 따른 것으로 이해되며, 따라서 첫인상의 정확도에 대한 것과 같이 논쟁의 여지가 있다.

Does the presence of a halo effect lead to accurate or inaccurate ratings? The presence of halo is considered to be due to a System 1 process (i.e., use of a general impression or memory of behaviors rather than independent ratings) and therefore, like the discussion around the accuracy of first impressions, the accuracy of judgments influenced by a halo is debatable.


Cook 등은 후광효과와 정확성간 차이가 거의 없다고 밝힘. 더 연구 필요

Cook et al. (2008) found similar results as the Bernadin and Pence study in that there was little difference in halo and accuracy between raters who were trained and those in a control group. It would appear, therefore, that the relationship between halo and accuracy is an area that warrants further research to understand the conditions that influ- ence this relationship.


요약하자면, 첫인상은 두 가지 방향으로 영향을 줄 수 있다. 자기충족적 예언, 그리고 후광효과
In summary, first impressions may influence the types of assessments used in medicine in two ways. It could contribute to a self-fulfilling prophecy in which negative or positive first impressions influence the way a rater thinks about or interacts with a target. It could also contribute to the presence of a type of rater bias called the halo effect because one of the causal mechanisms behind halo is the use of a general impression by a rater when making a judgment about a target.


결론과 함의

Conclusion, implications for assessment in medical education


Factors related to impression formation (Gingerich et al. 2011), cognitive load (Tavares and Eva 2013; van Merrie¨nboer and Sweller 2010; Wood 2013), familiarity with the examinee (Stroud et al. 2011), rater expertise (Berendonk et al. 2013) as well as rater-biases (Ira- maneerat and Yudkowsky 2007; Williams et al. 2003) and overly structured assessments within competency-based frameworks (Ginsburg et al. 2010) have all been identified as influencing the way assessors assign scores.


1) 첫인상이 평가에 얼마나 영향을 주는가?

1) To what degree will first impressions influence subsequent ratings within a particular assessment context or tool?


영향을 준다는 것은 확실해 보이(is related to subsequent scores)나, 다양한 맥락에서의 확인이 필요

The basic finding, that first impressions are related to subsequent scores, is compelling but requires demonstration in a variety of contexts.


2) 한 평가 상황내에서도 첫인상이 바뀌는가?
2) Do first impressions change within the context of a single assessment session and if so under what conditions?


OSCE에서 초반에 못하다가 점차 회복하는 학생들이 있다. 그러나 좀 더 rigorous한 연구가 필요
Anecdotally, many physician examiners can describe an examinee that started off an OSCE station or oral examination badly and then recovered brilliantly. These stories suggest that impressions can change, but such anecdotal evidence must be supported by rigorous research. The stability of a first impression is particularly important for examinations like OSCEs,


만약 판단이 첫 몇분간 끝난다면, 평가 시간이 길어지는 것이 평가의 퀄리티에 주는 영향이 없을 것이다.

If a judgment about the examinee’s ability is made within the first couple of minutes, and that judgment remains stable throughout the assessment despite a change in an examinee’s performance, then longer assessments may not be adding anything to the quality of the rating that one cannot get within a few minutes.


3) 시스템1과 시스템2 프로세스의 조화

3) How does the coordination of System 1 and System 2 processes influence the use of and the accuracy of a first impression?


어떤 경우에는 시스템1이 더 정확

Under some circumstances, System 1 processes, like first impressions, can lead to more accurate judgments than System 2 processes, but it is not clear under what conditions this may occur.


무슨 평가방법을 사용하느냐

One such factor is the scoring method used. There is a considerable amount of literature on the advantages and disadvantages of using checklists versus rating scales for assessments (Hawkins and Boulet 2008; Van der Vleuten and Swanson 1990).

  • 체크리스트: 고도로 심사숙고하는deliberative 평가법. 시스템2를 활용함 A checklist is a highly deliberative scoring process so would likely reflect the use of System 2 processes.
  • 평가스케일Rating scales: 시스템1의 역할이 더 커질 수 있음(평가자가 해석할 여지가 많고 덜 rigid함). Rating scales, on the other hand, have more room for rater interpretation and are less rigid, so could allow a larger role for System 1 processes like first impressions to influence scoring.

 

평가의 목적이 무엇이냐

The purpose of the assessment (i.e., formative or summative assessment), is also important in terms of whether System 1 or System 2 processes should be favored.

  • 형성적 피드백을 위한 평가는 더 심사숙고해야하고 분석적 채점 프로세스를 위한 설계
    It is possible that an examination designed for formative feedback might favor a deliberative, analytical scoring process in order to provide feed- back,
  • 총괄평가를 위한 평가는 더 global하고 덜 analytic함.
    whereas an examination designed solely for summative assessment may favor a more global, less analytical scoring process.

 

인지부하: 어떤 과제는 인지적 자원을 더 필요로 함. 예를 들면 응급실에서 피평가자의 병력청취, 의사소통기술, 프로페셔널리즘을 판단해야 하는 경우

Cognitive load is another factor that would likely influence the use of and accuracy of first impressions. Because some rating tasks require a higher degree of cognitive resources (i.e., attention) than other tasks, the resulting scores could start to mimic the results found under divided attention manipulations described earlier. For example, imagine a situation in which a rater must evaluate an examinee’s history taking, communication skills and professionalism while they interact with a live patient in a busy Emergency Department.


4) 자기충족적 예언이 얼마나 평가에 영향을 주는가?

4) To what degree does a self-fulfilling prophecy influence the ratings?


5) 후광효과와 첫인상의 관계

5) What is the relationship between first impressions and the halo effect?


 

상황에 따라 다름;

First impressions are thought to contribute to the presence of a halo effect.

  • Under some circumstances, especially when one wants to identify specific strengths and weakness within a person, the presence of halo would make the assessment difficult.
  • In other cir- cumstances, especially when trying to discriminate abilities between individuals, the presence of halo may actually be a benefit due to the high reliability.

What is unclear is what those circumstances are, and how manipulations that influence first impressions impact on the presence or absence of halo.


checklist vs rating scale

First impression ratings could be compared to a condition in which examiners score examinees using a checklist versus a condition in which they use a rating scale. If rating scales support the use of System 1 processes and checklists facilitate System 2 process, one might find a larger correlation with the former scoring system.






 2014 Aug;19(3):409-27. doi: 10.1007/s10459-013-9453-9. Epub 2013 Mar 26.

Exploring the role of first impressions in rater-based assessments.

Author information

  • 1Academy for Innovation in Medical Education (AIME), RGN2206, Faculty of Medicine, University of Ottawa, Ottawa, ON, K1H-8M5, Canada, twood@uottawa.ca.

Abstract

Medical education relies heavily on assessment formats that require raters to assess the competence and skills of learners. Unfortunately, there are often inconsistencies and variability in the scores raters assign. To ensure the scores from these assessment tools have validity, it is important to understand the underlying cognitive processes that raters use when judging the abilities of their learners. The goal of this paper, therefore, is to contribute to a better understanding of the cognitive processes used by raters. Representative findings from the social judgment and decision making, cognitive psychology, and educational measurement literature will be used to enlighten the underpinnings of these rater-based assessments. Of particular interest is the impact judgments referred to as first impressions (or thin slices) have on rater-based assessments. These are judgments about people made very quickly and based on very little information. A narrative review will provide a synthesis of research in these three literatures (social judgment and decision making, educational psychology, and cognitive psychology) and will focus on the underlying cognitive processes, the accuracy and the impact of first impressions on rater-based assessments. The application of these findings to the types of rater-based assessmentsused in medical education will then be reviewed. Gaps in understanding will be identified and suggested directions for future research studies will be discussed.

Comment in

PMID:
 
23529821
 
[PubMed - in process]


사회적 판단으로서 평가자의 평가: 평가 오차의 원인에 대한 생각(Acad Med, 2011)

Rater-Based Assessments as Social Judgments: Rethinking the Etiology of Rater Errors

Andrea Gingerich, Glenn Regehr, and Kevin W. Eva






평가자기반 평가는 학생들이 복잡한 과제를 수행하는 것을 직접 볼 수 있기 때문에 역량의 높은 레벨에 해당하는 능력을 확인할 수 있는 장점이 있다. 그러나 안타깝게도 평가자기반평가(RBA)는 일반적으로 psychometric하게 약점이 있다.

Rater-based assessments are used because they allow students to be observed performing complex tasks corresponding to higher levels of competency.1,2 Unfortunately, rater-based assessments generally demonstrate psychometric weaknesses6–9 including

  • measurement errors of leniency,10
  • undifferentiation,11
  • range restriction,12
  • bias,13 and
  • unreliability.14

 

심지어 여러 평가자가 동일한 수행을 보고도 reproducibility나 평가자간 신뢰도 등에 문제가 발견된 바 있다.

One of the biggest threats to the reproducibility of clinical ratings, low interrater reliability,15,16 has been found to occur even when different raters view the same performance.17–20



극단적으로는 20개 OSCE스테이션 중 스테이션에서 특정 관찰가능한 행동의 수행 여부를 체크할 때 누구는 수행했다고 체크한 반면 다른 사람은 하지 않았다고 체크하는 것과 같은 사례가 19/20개 스테이션에서 발생하였으며 그 차이는 1개에서 8개에 이르렀다.

In a dramatic example, 19 of 20 OSCE stations each had one to eight discrepancies where at least one rater made a positive evaluative comment about the presence or absence of a specific observable behavior, while another rater made a negative evaluative comment regarding the exact same behavior.21



피평가자의 실제 수행능력이 맥락이나 사례 특이적으로 달라질 수 있다는 것이 RBA가 복잡한 주된 이유로 인정된 바 있지만, 그것의 효과에 대해서는 우리가 이미 잘 이해하고 있으며, 현재 평가 시스템에서는 잘 다뤄지고 있다. 그러나 다양한 평가자가 동일한 수행환경에서 나타나는 동일한 수행능력에 대해서도 서로 다르게 평가하는 것의 이유에 대해서는 우리가 알고있는 바가 더 적으며, 그러한 차이가 극복될 수 있는 것인가에 대한 상당한 논쟁이 있다. Marshall과 Ludbrook은 "관례적인 임상술기 평가에 있어서 평가자가 피평가자에 대해서 내리는 판단은 전적으로 개인 수준의 것이다"라고 했다.

While actual ratee performance differences attributable to context or case specificity are acknowledged to play a critical role in the complexities of rater- based assessment,22 its effects are well understood and accounted for in current assessment systems. Causes of variability in ratings, given by multiple raters for the same performance within the same context, are more uncertain, with considerable debate currently taking place about whether or not such variability can be overcome.23–25 The challenge is illustrated well by Marshall and Ludbrook,26(p215) who stated that “the judgment that an examiner makes of a candidate in the setting of the conventional test of clinical skills is an entirely personal one.”



평가자가 문제라면, 평가자 훈련이 가장 지속적으로 이뤄진 해결책이다. 그러나 평가자 훈련은 평가 결과에 미미한 향상만을 가져왔으며, 일부 연구자들은 의학에서 평가자들이란 훈련으로 바뀌지 않는 사람들은 아닌지 의구심을 표했다. "일부 평가자들은 본질적으로 일관된 면이 있고, 어떤 사람들은 좀 덜 하다. 이들 중 전자에 해당하는 사람들은 훈련으로 나아지지 않는다."

With raters identified as the problem, rater training has been the most persistently proposed solution.31 Rater training’s meager improvement of measurement outcomes, however, has provoked some researchers to suspect that medical raters are impervious to training,7,32 by suggesting that “some examiners are inherently consistent raters and others less so. The former do not need training and the latter are not improved by training.”33(p349)




표준화된 프레임워크를 사용하는 것으로는 이 문제를 해결하기 어렵다면, 여러 의학교육연구에서는 평가자의 사회-인지 프로세스의 중요성에 관심을 가질 것을 요구하며, 그것이 수행능력 평가에 가지는 함의를 다룬 바 있다. 이 저자들에게 평가자는 '능동적 정보 처리자'이며, 판단/추론/의사결정 전략을 활용하여 피평가자를 판단한다. 그들은 또한 평가점수를 결정하는 데 있어서 인상(impression)형성/해석/회상/판단 등의 복잡한 상호작용을 강조하였다. 일부 연구자는 평가절차와 psychometric 측정 원칙과 인간평가자의 능력의 잠재적 불일치를 지적했다.

Given the apparent intractability of this problem using our standard frameworks, a handful of medical education researchers have called attention to the importance of considering raters’ social cognitive processes and corresponding implications concerning measurement of performance assessments. These authors have stressed the need to see raters as active information processors using judgment, reasoning, and decision- making strategies to assess ratees.34 They have also highlighted a complex interaction of impression formation, interpretation, memory recall, and judgment in assigning ratings.21 And several have described potential incongruence between assessment procedures, psychometric measurement principles, and human rater capabilities.2,35,36


이들 연구자들이 문제에 접근한 방식은 '인상형성(impression formation)' 연구를 떠올리게 하는데, 사회적 상황에서 어떻게 한 개인이 다른 개인에 대한 판단을 내리는지에 대한 이해에 초점을 둔 사회인지에 관한 큰 연구영역 중 하나이다. 인상이란 다른 사람을 아는 일부분으로서 형성된다. 인상은 상대방에 대한 사실적 정보, 추론, 평가반응으로 구성된다. 인상이란 어떤 사람과 상호작용하기 위해서 그 사람에 대한 기존의 지식구조에 정보를 조직화하는데 사용된다. 사회인지 연구자들은 어떻게 사람들이 사회세계(social world)에 대해서 생각하는가에 대한 구체적인 인지 프로세스에 관심을 가졌다. 그들은 어떻게 사회적 정보가 encode, store, retrieve, structured, represented 되는지를 연구했으며, 어떻게 판단을 가지고 의사결정을 내리는지에 대한 프로세스를 연구하였다.

The approach being explored by these authors is highly reminiscent of the impression formation literature, a large research domain within social cognition focused on understanding how individuals make judgments of others in social settings.37 Impressions are formed as part of knowing another person. They are constructed from factual information,inferences, and evaluative reactions regarding the target person.37 It has been suggested that impressions are used to organize information into a structure of knowledge about the person38 in order to interact with himor her.39 Social cognition researchers are interested in thespecific cognitive processes used by people to think about the social world. They investigate how social information is encoded, stored and retrieved from memory, and structured and represented as knowledge; they also study the processes used to form judgments and make decisions.40 


 

흥미롭게도 평가자 나름의 독특한 방식(idiosyncrasy)도 인상형성 연구자들의 연구대상이 되었다. 이 연구에서 평가자들은 심지어 완전히 동일한 정보가 주어졌을 때 조차 피평가자에 대해서 상이한 인상을 가진다는 것이 확인되었다. 실제로, 한 평가자가 다수의 피평가자에 대해서 가지는 인상(들)이 다수의 평가자가 하나의 피평가자에게 가지는 인상(들)보다 더 유사하다. 전형적으로 인성특성 평가의 variance를 차지하는 가장 큰 부분은 피평가자 간의 차이가 아니라 피평가자와 평가자의 관계에서 독특하게 나타나는 차이이다.

Interestingly, the idiosyncrasy of raters has also been of interest to impression formation researchers.41 In that literature, it is well established that different raters will often form different impressions of the same ratee even when given the exact same information.42,43 In fact, the descriptions made by a single rater about multiple others have been found to be more similar than the descriptions made by multiple raters about a single ratee.44 Typically, the largest portion of variance in personality trait ratings is not attributable to differences perceived between the ratees but to differences uniquely contained within the relationship between each rater and ratee.42,45





결과

Results


심리학 연구에서 다른 사람을 인식하는 행위(인상형성)은 흔히 카테고리화 작업으로 묘사되는데, 이 카테고리화 작업에 진행되는 인지적 프로세스들 간에는 차이가 있을 수 있다.

Within psychology literatures, the act of perceiving other people (i.e., forming impressions) is commonly described as a categorization task, though differences exist in the way in which these cognitive processes are thought to be enacted.46,47

 


 

idiosyncratic 하지만 convergent 한 인간모델(Person Models)로서의 인상형성

Impression formation as idiosyncratic yet convergent Person Models


사회적 판단은 특정 조건에서는 idiosyncratic하고 실수의 가능성이 높다. 예를 들면 평가자의 감정이나 기분이 영향을 줄 수 있다. 만약 피평가자가 평가자로 하여금 평가자에게 중요한 다른 누군가를 떠올리게 하면, 피평가자는 유사한 특성을 공유하는 것으로 인식될 수 있다. 먄약 평가자가 근래에 피평가자를 묘사하는 말을 들었다면 모호한 행동도 그 전에 들었던 묘사와 일관된 것으로 해석될 수 있다. 이들은 인상이 피평가자 자체보다 다양한 변인과 맥락적 요인에 영향을 받기 쉽다는 것을 보여준다.

Social judgments have been found to be idiosyncratic and fallible under certain conditions.48 For example, raters’ mood and emotions at the time of the judgment can have an influence.49 If the ratee reminds the rater of a significant other, the ratee can be perceived to share similar characteristics.50 If the rater has recently been exposed to a description of the ratee, ambiguous behavior can be interpreted as being consistent with that description.51,52 Thus, there exists an implicit understanding that impressions are subject to variables and contextual factors beyond the ratee himself or herself.


인상형성에 있어 평가자의 idiosyncrasy 에도 불구하고 인상형성이 평가자간 상당히 일치한다는 근거도 있다. 한 연구에서는 평가자들이 피평가자에 대한 인상을 묘사할 때, 모든 표사방식이 세 가지 대표적 스토리(인간모델)로서 그룹지어질 수 있었다. 이 모델은 평가자가 접근가능한 정보로부터 받은 인상에 대한 즉석 묘사이다. 중요한 것은 비록 많은 이야기가 만들어졌지만, 비록 이 세 가지 모델 모두를 모든 개인에게 적용가능하지는 않았라도, 한 개인에 대한 이야기는 아래의 세 가지 모델 중 하나로 분류가능했다는 점이다.

Despite this expectation of rater idiosyncrasy in impression formation, however, there exists evidence that impressions will often be quite consistent across raters. One line of research, for example, has demonstrated that when raters were asked to write descriptions of a ratee based on their impressions, all descriptions for that ratee could be grouped into three representative stories (or “Person Models”) about that individual.42,45 The models are ad hoc descriptions of the ratee based on the rater’s impressions formed fromthe information available. Importantly, although many stories can be generated, stories pertaining to any one individual tend to fall into one of three models, though the same three models are not relevant to every individual.
 

    • Model 1 (67.6%of descriptions): [Ratee E] is energetic, friendly, and expressive, although she is more outgoing with her friend than her mother. She seems to be a kind and considerate person who enjoys talking to others. She laughs a lot and has many ideas.
    • Model 2 (15.5%of descriptions): [Ratee E] is insecure and nervous. She seems distracted at times, and she has trouble making decisions. She plays with her pen a lot and keeps bringing up a trip she was supposed to go on last year.
    • Model 3 (16.9%of descriptions): [Ratee E] has to dominate the conversation. She is rude and obnoxious and seems insensitive to other people. She doesn’t even say bless you when her friend sneezes. She seems self-centred and barely lets her friend talk.


비록 판단이 idiosyncratic하지만, 그것이 무한정 그렇지는 않다. 정보의 조각을 조합하고 정보의 우선순위를 결정함으로서 다양한 이야기를 만들어낼 수 있다. 후속 연구에서 인간모델은 긍정적 혹은 부정적 평가와 연결되는데 모델1에 해당하는 경우 긍정적인 평가를, 모델 2나 3에 해당하는 경우 부정적 평가를 받는 것으로 나온다. 따라서 인간모델은 평가자와 피평가자 사이에 독특한 관계에 의해서 나타나는 variance의 상당부분을 설명해준다.

Thus, although judgments are idiosyncratic, they are not infinitely so. It has been suggested that different combinations and prioritization of the pieces of information resulted in the different explanatory stories.42 In a follow-up study,45 the Person Models corresponded with ratings of liking and positive– negative evaluation such that raters usingModel 1 viewed the ratee positively and liked her, whereas raters using Models 2 or 3 viewed her negatively and disliked her. The Person Models, therefore, were found to account for a substantial portion of the variance in impressions attributed to the unique relationship between the rater and the ratee



만약 평가자가 그들이 피평가자에 관해 받은 정보를 바탕으로 피평가자에 대한 coherent한 인상을 구성해나갈 때 그 한 부분으로 인간모델을 만드는 것이라면, 그리고 모든 피평가자에 대해서 일반적으로 사용되는 세 가지 인간모델이 있다면, 이는 평가자들 사이에 cohesion과 coherence가 존재함에도 평가자간 신뢰도가 RBA에서 왜 감소하는지 설명해줄 수 있을지도 모른다.

If raters are forming Person Models as part of constructing a coherent impression about a ratee fromthe information they are receiving, and if there generally exist about three Person Models that are used for every ratee, this could help explain decreased interrater reliability in rater- based assessments while still yielding a sense of relative cohesion and coherence for each rater.



명목 카테고리화 작업으로서의 인상형성

Impression formation as a nominal categorization process


여기서는 피평가자의 행동에 대해서 '즉석에서' 네러티브를 구성하는 것이 아니라, 기존에 존재하던 스키마에 피평가자를 묶어내는 경향에 초첨을 둔다.

Here, the focus is not on the ad hoc construction of narratives around a ratee’s behavior; rather, the focus is on raters’ tendencies to lump ratees into preexisting schemas.



비록  과도한 일반화가 가지는 명확한 위험성도 있으나, 카테고리화의 뚜렷한 장점도 있다. 카테고리를 사용함으로써 평가자는 평가자의 카테고리-일치 행동을 관찰할 때에는 인지적 리소스를 사용할 필요가 없다. 실제로 평가자는 카테고리-비일치 행동만 관찰하면 된다. 피평가자의 카테고리화는 평가자로 하여금 주어진 정보를 넘어서서 기존의 카테고리 구성원과 일치하는 디테일까지 예상(추론)할 수 있다. 이는 개별 피평가자를 이해하는데 유용하며, 이들이 어떻게 행동할지 예측하게 해준다. 또한 그들과 상호작용할 때에 어떻게 하는 것이 가장 좋은지 결정을 도와준다. 인상형성의 인간모델 이론과 마찬가지로, 카테고리-기반 지식은 왜 피평가자가 특정 행동을 특정 상황에서 하는가에 대한 가능한 설명을 제공해주는 프레임워크가 된다.

Although there are clear and readily recognized dangers in overgeneralization (such as stereotypes), there are apparent benefits to categorization as well.46 With the use of categories, cognitive resources do not need to be used to monitor a ratee’s category-consistent behavior. Instead, the rater only needs to note any category- inconsistent behaviors.55 Categorization of the ratee also allows the rater to go beyond the given information to infer other expected details consistent with typical category members.56 This can be useful to better understand the individual ratees, to make predictions about how they will behave, and to decide how best to behave when interacting with them.47 Consistent with the Person Model theories of impression formation, category-based knowledge is thought to act as a framework to provide possible explanations for why a ratee might display particular behaviors in a given situation.



사회적 카테고리화에 대한 연구를 보면 이 카테고리가 장기기억에 존재할 수 있으나, 한 사람에 대한 사회적 카테고리화는 한 사람이 다양한 카테고리에 속할 수 있기 때문이 flexible한 측면도 있다. 위에서 묘사한 바와 같이, 이 연구에서는 한 사람에게 적용할 다양한 잠재적 카테고리 중 하나를 결정하는 데에는 맥락이 중요함을 찾아내었다. 예컨대, 아이를 안고 있는 남자는 마트에서는 아빠일 수 있지만, 병원에서는 간호사일 수 있다.

Although the social categorization literature suggests that these categories can exist preformed in long-term memory,46 social categorizations of a person are thought to be flexible because any individual can be categorized in multiple ways.58 Consistent with the findings described above, this literature has found context to be important in determining which category of the many possibilities will be applied to the person.51,59 For example, a man carrying a baby in a grocery store may be categorized as a dad but in a hospital as a nurse.



이 분야의 연구를 보면, 어떻게 카테고리 활성화를 조절할 수 있는가에 대한 것도 있다. 흥미롭게도, 의도적으로 카테고리-기반 가정에  반하여 사회적 판단을 조정하려고 하는 시도, 또는 카테고리적 사고를 억제하려는 시도는 오히려 카테고리화를 유발하여 더 안 좋은 영향을 미칠 수 있다. 이는 예를 들어 평가자가 고정관념을 회피하고자 하는 노력을 했을 때 오히려 더 고정관념에 빠졌다거나, 피평가자에 대한 더 고정관념적 기억을 잘 했다는 연구에서 나타난다. 이는 카테고리화를 극복하고자 하는 좋은 의도와 동기가 어쩌면 완전히 불가능하거나 적어도 결과의 향상을 가져오지 못할 수 있다는 것을 보여준다.

Researchers in this area have been particularly concerned with the question of how controllable category activation is. Interestingly, there is evidence to suggest that intentionally trying to adjust social judgments to counteract categorization-based assumptions or trying to suppress categorical thinking can cause the categorizations to have more adverse influence on impressions.64 This has been repeatedly demonstrated, for example, with studies where raters who were trying to avoid the use of stereotypes ended up demonstrating more stereotypic thinking in subsequent trials65 and more stereotyped memories of the ratee.66 This suggests that good intentions and the motivation to avoid categorizing people may not be completely possible and, when attempted, may not result in improved judgments.



만약 우리가 평가자가 피평가자에 대한 인상을 형성하고 피평가자를 인식하는 데 있어서 카테고리화할 수 있다는 것을 인정한다면, 이것은 RBA에 중요한 함의를 갖는다. 아마 가장 흥미로운 것은 카테고리가 순위/간격 자료가 아니라 명목자료라는 사실일 것이다. 명목변수는 본질적으로 논리적 위계나, 0점이 없고, 카테고리 간 간격이 균일하지 않다. 그러나 평가 서식은 순위를 매기는 답변을 요구한다(Behaviorally anchored scale 등). 혹은 리커트 척도 등의 숫자값을 선택하게 한다. 만약 평가자가 피평가자를 특정 카테고리에 속하는 것으로 인식한다면, 이 카테고리를 scale로 변환하는 것은 어떻게 이루어질까?

If we were to accept that raters may be categorizing ratees as part of perceiving and forming an impression of them, this could have important implications for rater-based assessment. Perhaps the most intriguing implication is the resemblance of categories to nominal rather than ordinal or interval data. Nominal variables have categories but do not have an inherent, logical order, a true zero, or an equal interval between the categories. Assessment forms often require ordinal responses such as the selection of an ordered descriptive value on a behaviorally anchored scale, or interval responses such as the selection of a numerical value on a Likert-type rating scale. If raters are judging ratees by perceiving themas belonging to a particular category, then how do they translate that categorical judgment into a rating scale value?




다차원적 카테고리화로서의 인상형성

Impression formation as dimensionally based categorizations



사람들은 두 개의 차원에 있어서 이분법적 판단에 따라 카테고리화 될 수 있다. 광범위한 연구에서 사회적 판단에 있어서 두 개의 직교하는 차원 인상형성에 있어서 variance의 상당부분을 설명할 수 있음을 보여준 바 있다.

As is described more thoroughly in the following, people can appear to be placed into categories based on dichotomized judgments on two underlying dimensions. An extensive literature consistently identifies two orthogonal dimensions underlying social judgments that can account for the majority of variance in impression formation.



모든 연구에서 한 차원은 사회적으로 바람직하거나 그렇지 않은 것에 대한 것이다.(정직-부정직 등)

In all studies, one of the dimensions refers to socially desirable or undesirable traits that directly impact on others. It includes positive traits such as friendly or honest and negative traits such as cold or deceitful.



두 번째 차원은 연구에 따라서 보다 다양한데, 개인의 성공에 영향을 미치는 특질에 대한 것이다. 지능/야망과 같은 긍정적 특질과 우유부단함/비효율적 과 같은 부정적 특질을 포함한다.

The second dimension has more variability across studies and refers to traits that tend to more directly influence the individual’s success.68,69 It tends to include positive traits such as intelligent or ambitious and negative traits such as indecisive or inefficient.

 

이 차원들은 다양한 이름으로 연구된 바 있다.

These dimensions have been given various labels, likely attributable, in part, to differing domains having been studied:

  • warmth/competence,69,70
  • communion/ agency,68,71
  • social/intellectual,72
  • other- profitability/self-profitability,73
  • morality/ competence,74 and
  • social desirability/ social utility.75

 

어떤 명명을 선택하느냐가 서로 다른 영역으로부터 온 연구자들이 매우 다른 차원을 찾아내었음을 보여주나, 연구자들은 대체로 이들 특질/행동에 겹치는 부분이 있음을 인정한다.

Although the choice of labels for each of the dimensions may imply that researchers fromdifferent domains have identified very different dimensions, the researchers agree there is a common overlap of traits and behaviors.68–70,75,76 


흥미롭게도, 두 개의 연속적, scaled 차원이 있다는 추측에도 불구하고 사회적판단에 관한 문헌을 보면 두 개의 이분된 직교하는 차원으로 구분한다. 각각 이분화된 나눠진 두 차원은 네 개의 조합을 만들어내고, 개인은 이 네 영역 중 하나로 카테고리화된다.

Interestingly, despite the speculation that there are two continuous, scaled dimensions underlying the process of social judgment, many researchers in the social judgment literature suggest that these two orthogonal dimensions are dichotomized into high- versus low-value judgments. When the two dimensions are crossed, therefore, the result is four potential combinations, and it has been proposed that individuals and groups are categorized in one of these four clusters.77



Warmth/Competence dimension의 사례 

Researchers have shown that the stereotyped groups described in the preceding section can be categorized into each cluster based on rater judgments of warmth/competence dimensions and that each cluster is associated with emotional and behavioral responses in the rater.78 More specifically, in North America,

  • groups judged high on warmth and competence, such as the middle class, invoke the emotions of pride and admiration and lead to behaviors of wanting to help and associate with them.
  • Groups judged low on warmth and high on competence, such as the stereotypically gluttonous rich, elicit envy and willingness to associate but also to attack under certain conditions.
  • Groups judged high on warmth and low on competence, including stereotypes for the elderly and disabled, elicit pity and willingness to help but also to avoid.
  • Low judgments of both warmth and competence, including stereotypes for the homeless and drug-addicted, invoke the emotions of disgust and contempt and lead to behaviors of wanting to attack and to avoid.


사회적 판단의 두 차원에 깔린 근본적 특징은  진화론적 관점에서 설명되곤 하는데, 이것이 낯선 사람을 친구인지 적인지 판단하는데 사용된다고 하면서, 상대방의 의도가 무엇인가를 인지하고, 상대방이 그 의도를 달성할 능력이 있는가를 판닪하는 것이 생존에 도움이 된다고 하였다. 이렇듯, 차갑거나 비도덕적 의도를 가진 사람이 그 의도를 달성할 능력까지 갖춘 것으로 분류될 경우 의도는 마찬가지로 비도덕적이나 능력이 없는 사람도다 더 위험하게 인식된다. 즉 두 개의 negative보다 한 개의 positive와 한 개의 negative가 더 위험하다는 것으로, 다차원적 판단으로서의 카테고리화든 단순 연산의 결과가 아님을 보여준다.

The fundamental nature of two dimensions underlying social judgments has been explained using an evolutionary perspective. It has been proposed that successfully determining whether strangers are potential friends or enemies, based on their perceived intentions and also on whether they are capable of achieving those intentions, would provide a survival advantage.79 As such, persons categorized as having cold or immoral intentions and high competence receive more strongly negative impression ratings than those categorized as having immoral intentions and low competence.80 This occurs despite the immoral–incompetent categorization resulting from two negative dimensional judgments and the immoral–competent categorization resulting from the combination of a positive and a negative dimensional judgment. Categorizations based on dimensional judgments, therefore, do not purely reflect an algebraic combination of values judged on two orthogonal dimensions.




임상역량을 평가하기 위해 만들어진 서시의 요인분석을 통해서 두 개의 내재된 요인을 발견하였다. 평가의 variance 대부분을 차지하는 두 요인으로서 지식과 대인관계기술을 언급하였다. 이 때 '지식'은 사회적판단에서 '역량'에 해당하며, 대인관계기술은 'warmth'에 해당한다. 따라서 의학에서 평가자들은 위에서 언급된 북미에서의 stereotype을 활용하여 피평가자를 네 가지 부류 중 하나로 분류할 것이다.

Factor analysis of rating forms designed to assess clinical competence often identifies two underlying factors regardless of the number of items or the number of dimensions included on the form. Of the two factors that explain the majority of variance in ratings, one tends to refer to knowledge and the other to interpersonal skills. The knowledge dimension seems analogous to the competence dimension in social judgments, and the interpersonal skills dimension seems comparable to the warmth dimension. As such, medical raters could be using the cognitive processes, previously described using the example of stereotyped groups in North America, to classify ratees into one of the four clusters with consequent emotions and reactions.



결론

Discussion


비록 사례특이성이 중요한 역할을 하는 것으로 밝혀졌지만, 평가자 variability 또한 construct-irrelevant error의 원인으로 지적되고 있으며, 이것을 어떻게 극복할지는 보다 불분명하다. 평가자의 객관성이나 평가자의 능력을 강화하기 위한 해결책은 효과가 미미했으며, 이제는 평가자 '에러'에 대한 다른 개념을 고려할 때일 수 있다.

Although case specificity has been shown to play a very important role, rater variability (based on idiosyncrasies of opinion, defiance, or ineptitude) has also been seen as a source of construct- irrelevant error16,25 with less clear understanding of how to overcome the challenge it creates. Solutions targeted at bolstering rater objectivity and ability have had little impact on reducing these measurement errors,7 and hence, perhaps the time has come to consider an alternate conception of rater “error.”


만약 우리가 RBA에서의 평가자가 사회적판단에서의 평가자와 동일한 인지 프로세스를 사용한다고 전제하고 시작했다면, 여기에서 함의는 무엇이고 어떻게 바꿀 수 있을까?

If we were to start with the premise that raters in rater-based assessments use the same cognitive processes as raters in social judgments, then what would the implications be for assessment and how would it change the way we talk about assessment?



 

심리학자들은 사회적판단을 내리는 데 있어서 사람들은 다른 사람을 카테고리화하는 경향이 있음을 밝혔다. 인상형성에 관한 연구에 따르면 카테고리화 프로세스에는 세 가지 다른 개념이 있다.

Psychologists have shown that, in making social judgments, people have a propensity to categorize other people. In the impression formation literature, there seem to be at least three different conceptualizations of this categorization process.

 

  • The Person Model literature presents an adaptable type of categorization based on the construction of stories, as needed, to describe specific individuals.42,45
  • In contrast, the categorization literature suggests that categories can be preformed constructs that exist in the long-term memory and are applied when activated.46
  • And a third conceptualization is the concept of cluster-based categorization that results from dichotomous judgments on two dimensions.77,78

이들 개념 간 차이와 무관하게, 인상형성 연구결과는 이러한 카테고리화가 전형적인 카테고리 구성원에 대한 정보를 새로운 사람에게 적용하게 해주는 것이라고 공통적으로 말하고 있으며, 그 결과 인지적 자원을 아낄 수 있게 해주고, 어떻게 행동할지 예상하게 해주며, 어떻게 상호작용할지 최적의 선택지를 제공한다.

Regardless of these differences in conceptualization, there is general agreement in the impression formation literature that such categorizations allow information about a typical category member to be applied to the new person, thereby reducing the cognitive resources needed to monitor the person’s behavior, allowing for predictions of how he or she will behave, and providing options for how best to interact with him or her.46



첫째, 카테고리화는 무의식중에, 그리고 자발적으로 일어날 수 있으며, 어떤 식으로든 이 프로세스를 통제하는 것은 매우 어렵다. 그렇기 때문에 평가자 훈련을 통해서 카테고리화의 영향을 변화시키려는 직접적 노력을 impede한다. 더 나아가 이들 카테고리화가 여러 평가자들 간에 놀라울 정도로 비슷하다는 연구결과가 있으나, 평가자 특이성(rater idiosyncrasy)은 여전히 존재하고 있으며, 최소한 서로 다른 인간모델의 서브그룹 수준에서는 차이가 있다.

First, the categorization of the person can happen spontaneously and without awareness,60 and there may be poor control over these processes even when they are made explicit.64 This could directly impede efforts to modify the influence of categorization on assessments through rater training. Further, although there is evidence of these categorizations being surprisingly consistent across raters, there is nonetheless room for rater idiosyncrasy, or at least subgroups that consistently use a different Person Model in understanding a particular individual’s behavior.45


평가자들은 피평가자를 서로 다른 스케일에 두는 것이 아니라, 애초에 서로 다른 명목 카테고리로 분류하는 것이다.

It is not that raters are scaling the behaviors differently but, rather, that they are placing ratees in different nominal categories.


둘째로, 대부분의 의학교육에서의 RBA는 표준화된 형식을 가지고 사전에 결정된 수행능력 영역/역할/역량을 평가한다. 이렇게 이론적으로 구성되어있는 평가 영역들은 우리에게 내재된 인지프로세스의 카테고리화와 잘 맞지 않을 수 있으며, 어떤 피평가자를 카테고리화하는데는 적용가능하지 않을 수 있다. 따라서 평가자 에러는 사람이 판단을 내릴 때 사용하는 인지 프로세스와 잘 맞지 않는 평가체계를 사용하게 하는 것에서 유발되는 것일 수도 있다.

Second, in the vast majority of rater- based assessments in medical education, the standard forms require ratings on a predetermined list of performance domains, roles, and/or competencies. These theoretically constructed assessment dimensions may not correspond with the categorizations that result from our innate cognitive processes, and they may not be universally applicable to all ratee categorizations. It is possible, therefore, that rater error might stem from an assessment system that asks raters to carry out judgment tasks that are incongruent with the cognitive processes used by humans to perform judgments. 


만약 평가자가 nominal한 판단을 내리는데, 평가서식은 ordinal/interval 평가를 내리도록 만들어져 있다면, 어떻게 이 카테고리적 판단을 rating scale로 변환할 것인가? 서로 다른 변환체계를 사용하는 평가자가 평가자 에러의 한 부분이 아닐까?

If raters are forming nominal judgments but assessment forms require ordinal or interval ratings, how do they translate that categorical judgment into a rating scale value? Could raters using different conversion systems explain a portion of rater error?






6 Lurie SJ, Mooney CJ, Lyness JM. Measurement of the general competencies of the Accreditation Council for Graduate Medical Education: A systematic review. Acad Med. 2009;84:301–309.


30 Kogan JR, Holmboe ES, Hauer KE. Tools for direct observation and assessment of clinical skills of medical trainees: A systematic review. JAMA. 2009;302:1316–1326.








 2011 Oct;86(10 Suppl):S1-7. doi: 10.1097/ACM.0b013e31822a6cf8.

Rater-based assessments as social judgmentsrethinking the etiology of rater errors.

Author information

  • 1Northern Medical Program, University of Northern British Columbia, 3333 University Way, Prince George, British Columbia V2N 4Z9. gingeri@unbc.ca

Abstract

BACKGROUND:

Measurement errors are a limitation of using rater-based assessments that are commonly attributed to rater errors. Solutions targeting rater subjectivity have been largely unsuccessful.

METHOD:

This critical review examines investigations of rater idiosyncrasy from impression formation literatures to ask new questions for the parallel problem in rater-based assessments.

RESULTS:

Raters may form categorical judgments about ratees as part of impression formation. Although categorization can be idiosyncratic, raters tend to consistently construct one of a few possible interpretations of each ratee. If raters naturally form categorical judgments, an assessment system requiring ordinal or interval ratings may inadvertently introduce conversion errors due to translation techniques unique to each rater.

CONCLUSIONS:

Potential implications of raters forming differing categorizations of ratees combined with the use of rating scales to collect categorical judgments on measurement outcomes in rater-based assessments are explored.

PMID:
 
21955759
 
[PubMed - indexed for MEDLINE]


OSCE에 대한 오해(Med Teach, 2015)

Misconceptions and the OSCE

RONALD M. HARDEN

AMEE, Dundee, UK






OSCE에 반대하는 이유로 가장 흔하게 언급되는 것은 비용에 대한 것이다.

The most frequently cited argument against the use of the OSCE relates to cost


Walsh가 언급한 바와 같이 의료전문직 교육은 비용이 많이 들며, 영국에선느 매년 £5bn 이상의 비용을 할당한다.

As noted by Walsh (2015), healthcare professional education is expensive, with £5bn allocated in England to this each year.


 

Brown 등은 평가와 관련한 비용을 이해하는데 유용한 기여를 하였다. Aberdeen, UK에서 운영되는 최종 총괄 OSCE의 서로 다른 요소들의 비용을 정량화하였다. 추정된 비용은 학생당  £355 였다. 이는 매우 작은 부분이며, 아마 한 명의 의사를 양성하는데 들어가는 전체 비용의 0.1%에도 미치지 못할 것이며, 최종 시험의 비용으로는 합당해보인다. 왜냐하면 Brown이 지적한 바와 같이 사회와 전문직에게 있어서 위음성(역량있는 의사의 탈락)은 물론 심지어 위양성(역량 부족한 의사의 합격)결과가 의과대학 최종 시험에서 나오는 것은 매우 크기 때문이다.

Brown et al. (2015) have made a useful contribution to our understanding of cost in relation to assessment by quantifying the cost of the different components of a summative final year OSCE as organised in Aberdeen, UK.The estimated cost was £355 per student. This represents a small proportion, probably <0.1% of the total budget for training a doctor and seems a reasonable expense for a final examination given that, as pointed out by Brown et al. (2015), the costs to society and the profession of false negatives (failing a competent doctor) or even more so false positives (passing an incompetent doctor) in the final assessment of a student are high.



이러한 목적으로 사용되는 도구의 스펙(specification)은 부담이 크다. 그러나 우리가 해야 할 질문은 그보다 비용이 덜 들어가는 평가도구가 과연 존재하느냐는 것이다. 답은 절대적으로 '그렇지 않다'이다.

The specification of a tool which can be used for this purpose is demanding. The question has to be asked, however, as to whether there is a less costly tool available. The answer is almost certainly no.


OSCE의 비용-효과에 대해 고려할 때, OSCE와 관련된 다른 이점들을 생각해 봐야 한다. 여기에는 시험이 학생의 학습에 가지는 긍정적인 영향을 포함하여, 학생들의 지식 습득은 물론 임상스킬 발전에 관한 학습을 더 장려한다는 것이다.

In a consideration of the cost benefit ratio for the OSCE, it is important also to take into account the other advantages associated with an OSCE These include the positive impact that the examination has on student’s learning, encouraging students to focus their attention on the development of clinical skills as well as on the acquisition of knowledge


두 번째는 수험생의 환자 진료에 대한 전체적인 접근이 아니라 분절된 일부 부분만 평가한다는 것이다.

A second argument cited against the use of the OSCE is that it does not assess the examinee’s overall approach to the care of the patient and fragments or compartmentalises medicine.


이러한 우려는 만약 OSCE가 유일한 평가도구였다면 타당한 우려였을 수 있다. 그러나 실제에서는 OSCE는 평가자의 도구상자에서 여러 평가도구 중 하나일 뿐이며 포트폴리오나 다른 근무지기반 평가도구들과 함께 쓰이는 것일 뿐이다.

This would be a legitimate concern if the OSCE was used as the only assessment tool. In practice, however, the OSCE should be seen as only one of the tools in the examiners tool kit, used alongside portfolios and other work-based assessment instruments (Friedman Ben-David et al. 2001).


이와 관련된 다른 근거는 OSCE가 실제 임상상황에서의 평가가 아니며(authentic assessment), 시험 스테이션에서의 것이 의사의 실제 직무에서 기대되는 스킬을 평가하지 못한다는 것이다. 

A related argument against the use of the OSCE is that the examination is not an authentic assessment of students’ competence and that the stations in the examination do not assess skills expected of a doctor in clinical practice.


네 번째 오해는 평가자의 역할에 대한 것이다. 평가자의 역할은 학생을 관찰하고 체크리스트를 가지고 수행결과를 평가하는 것으로 제한되는 기계적인 것으로 보일 수도 있다. 그러나 체크리스트를 채우는 것 외에도 평가자는 학생의 역량에 대한 전반적인 판단을 내려야 하며, global rating scale의 점수를 줘야 한다. 그들은 매우 우수한 수행능력을 보인 학생에게는 보너스 점수를 줄 수 있는 자유도 있다.

A fourth misconception about the use of the OSCE relates to the role of the examiner. The task of the examiner may be seen as a mechanistic one with the role restricted to the observation of a student and the scoring of their performance using a checklist and tick boxes, each representing an element of the performance. In addition to completing a check list, however, the examiner may be asked also to come to an overall judgement of the student’s competence and to score a global rating scale. They may also have the freedom to award bonus points, where excellent performance is demonstrated.


시험을 더 발전시킴에 있어서 평가자는 스테이션의 설계에도 기여할 수 있으며, 어떤 역량을 평가할 것인지, 그리고 어떻게 이것이 OSCE 스테이션에서 구현될 수 있는가에 기여할 수 있다.

In advance of the examination, the examiner can also contribute to the design of the station, the competences to be assessed and how this is achieved in the OSCE station.


시험에 이어서 평가자는 학생에게 개인적으로 그리고 전체적으로 피드백을 제공하는 역할을 한다.

Following the examin- ation, the examiner also has an important role to play in the provision of feedback to the students individually and to the class as whole.


OSCE와 관련된 다섯 번째 문제는 OSCE가 학생에게 스트레스를 유발한다는 것이다. 학생들이 여러 개의 스테이션을 거쳐가며 시험을 본다는 것이 상당히 스트레스를 받는 일이고 매우 피곤할 수 있다. 그러나 오랜 시간에 걸쳐서 학생을 평가하는 상황 자체가 의사가 실제 임상상황에서 격는 것과 비슷하다. 스트레스는 모든 평가의 특징이며, OSCE에서만 유독 나타나는 것이 아니다.

A fifth problem attributed to the OSCE relates to a concern about the OSCE as a cause of stress to the student. It may be

seen as excessively stressful or tiring for the student particularly if the student is assessed over a large number of stations. It can be argued however that to assess a student over an extended period of time itself reflects the pressures challenging a doctor in clinical practice. While stress may be a feature of any assessment, in general it has not featured as a particular concern for students in an OSCE.


이러한 오해는 전통적인 의과대학에서 더 두드러지는데, 그러한 곳에서는 임상 스킬의 엄격한 평가에 대한 지원이 충분하지 않고, 교수개발에 최소한의 관심만 기울이곤 한다.

Such misunderstandings may be more evident in a traditional school, where the climate does not support the rigorous assessment of clinical skills and where minimum attention is paid to faculty development (Troncon 2004).








 2015 Jun 15:1-3. [Epub ahead of print]

Misconceptions and the OSCE.

Author information

  • 1AMEE , Dundee , UK.
PMID:
 
26075956
 
[PubMed - as supplied by publisher]


OSCE에서 총괄점수(Global grade)와 체크리스트의 차이(Med Teach, 2015)

Investigating disparity between global grades and checklist scores in OSCEs

GODFREY PELL, MATT HOMER & RICHARD FULLER

University of Leeds, UK





OSCE는 장점이 명확하다. 이 장점은 특히 기준확립에 적절한 방법을 쓴다면, 구체적 내용을 잘 정할 수 있으며, 표준화할 수 있고, 광범위한 측정이 가능하며, 평가의 질에 대한 사후 분석이 가능하다. 신뢰도에 대한 측정은 평가의 질을 결정할 때 흔히 사용되는 요소이며, 이 때 스테이션 수준에서 OSCE형식에서 다룬 여러 문항에 대해 분석하고 교정하는 것에 초점을 둔다.

OSCEs have clear strengths, especially when appropri- ate standard setting methodologies are employed, allowing careful specification of content, standardisation and an oppor- tunity to undertake extensive measurement and post hoc analysis to determine assessment quality. Measurements of reliability are routinely used as an element of determining assessment quality (Streiner & Norman 2003; Chapter 8), with an increasing focus on the value of station level metrics in the detection and remediation of a range of problems with OSCE formats (Pell et al. 2010; Fuller et al. 2013).


전통적인 시험 형식에서 OSCE는 각 스테이션마다 두 가지 평가 결과물이 나오는데, 체크리스트CL과 종합점수GG이다. 다른 형식의 OSCE는 CL을 없애기도 했는데, 예를 들면 미국의 주요 면허시험에서, 병력청취에 관한 CL은 분별력이 약하다는 이유로 사라졌다. 한 스테이션 내에서 CL와 GG의 alignment는 중요한 특징이며, 좋은 스테이션이라면 두 가지의 alignment가 강해야 한다.

In ‘‘traditional’’ test formats, OSCEs have two assessment outcomes within each station, a checklist score and a global grade (other formats of the OSCE have seen a move away from checklists, for example, in the USA’s main licensing exam, the history-taking checklist has been eliminated due to concerns regarding its poor discrimination). The alignment between the checklist/marking scheme score and overall global grade within a station is an important characteristic, and one would expect that in a high-quality station (i.e. one that is working well as part of a reliable and valid assessment), this alignment should be strong.


많은 연구에서 CL이나 GG의 불일치를 연구했으나, 한 스테이션 내에서 비교하고 그 의미를 찾아본 연구는 없다.

A number of studies have looked at checklist discrepancies and/or rating discrepancies (Boulet et al. 2003), but to our knowledge, none has investigated discrepancies between checklist scores and global ratings in a station and what this might mean.


어떤 식으로든 misalignment가 발생하면 스테이션 수준에서 "역량을 갖춘"학생을 떨어뜨리고 "역량이 부족한"학생을 합격시킬 가능성이 있다.

Any degree of misalignment increases the likelihood of failing ‘‘competent’’ students or passing ‘‘incompetent’’ students at the station level.


동시에, 이 분야의 연구는 이러한 수행능력 검사에서 평가자의 의사결정에 관한 복잡한 영역을 이해하기 위한 연구의 증가에 따라 더 풍요로워지고 있다. 평가에 대한 구성주의자적 관점을 적용하는 입장에서 이러한 연구는 평가자의 의사결정에 영향을 미치는 요인이 고도로 individualize, contextualize 되어있으며, 평가자의 경험이나 연공서열에 영향을 받는다는 것을 밝혔다. 이는 OSCE내에서 검사형식설계, 구조, 평가자 행동, 피평가자 수행능력이 복잡한 관계에 있음을 보여주며, 종종 변인의 '블랙박스'로 여겨지곤 한다. Misalignment가 발생했을 경우, 우리의 연구는 이 '블랙박스'를 이해하기 위한 것이다.

At the same time, research in these areas is complemented by a growing body of literature that seeks to understand the complex area of assessor decision making in performance testing (Sadler 2009; Govaerts 2011; Kogan et al. 2011; Yorke 2011). Employing constructivist views of assess- ment, this literature reveals that the factors affecting assessor decision-making can be highly individualised, contextualised and influenced by characteristics such as assessor experience and seniority (Pell & Roberts 2006; Kogan et al. 2011). This can be summarised as a complex interaction of test format design issues, construct, assessor behaviours and candidate perform- ances within the OSCE environment, sometimes described as a ‘‘black box’’ of variance (Gingerich et al. 2011; Kogan et al. 2011). Where misalignment occurs, our work seeks additional understanding of this ‘‘black box’’ with regard to this error variance.


'전통적인' CL평가에 대한 불만이 높아지는 가운데, 그리고 이와 함께 GG 기반 채점이 늘어나고 있으며 이는 GG가 CL점수보다 더 신뢰성있다는 연구에 근거하고 있다. 이 misalignment의 일부는 잘못된 CL설계에 기인하며, 또한 이 두 점수가 서로 다른 trait을 평가한다는 것을 반영한다. 

There is a growing dissatisfaction with ‘‘traditional’’ (i.e. reductionist) checklist marking schedules both in healthcare and wider education (Sadler 2009), with an accompanying growth in the use of global/domain based marking schema, supported by work that indicates that global grades are more reliable than checklist scores (Cohen et al. 1996; Regehr et al. 1998). It is important to note that some of the misalignment between scores and grades in a station can reflect poor checklist design, and that these two performance ‘‘scores’’ may measure quite different traits. 


이 접근법은 평가점수의 자연발생적 변인에 대한 우려, 즉 평가자 판단의 에러에 대한 것을 조사할 수 있는 능력이 없이는 평가의 진짜 가능성을 제대로 인식할 수 없음을 보여준다.

This approach poses the real possibility that assess- ments take place without an ability to investigate concerns about the nature of variance in marks, implying that error in assessor judgements may be more likely to go unrecognised.




방법

Methods


Initial exemplification and exploration


Our OSCE format uses global grade year-specific descriptors (indexed as clear fail, borderline and three passing grades) alongside a specific marking schema that develops from a traditional checklist format in our junior OSCEs (third year) to a sophisticated ‘‘key features’’ format in the final, qualifying OSCE (fifth year).


Within each station, a pass mark is calculated from all the grades/marks within the station using the Borderline Regression Method (Kramer et al. 2003; Pell & Roberts 2006).


두 가지 유용한 통계수치는 '총 탈락률'과 '보더라인 점수자의 비율'이다. 여기서는 CL과 GG를 특히 보더라인 점수자에 대해서 비교했다.

The first useful statistical measures are the overall failure rate at the station (38/230 ¼16.5%) and the percentage in the ‘‘Borderline’’ grade (48/234 ¼20.9%), the latter of which is arguably high since in one in five encounters the assessors are unable to make a clear pass/fail global decision. The methods we develop will enable us to examine the congruence grades and between the two assessor judgements: global checklist marks, particularly within borderline categories.


Treatment of ‘‘Borderline’’ grades




Formulating measures of misclassification


degree of misalignment를 정량화하기 위한 표 만들었음.

One of the key areas of research in this study is to explore the possibility of developing useful metrics to quantify the degree of misalignment that is listed in Table 2 as a step towards highlighting stations that require further investigation.





Results – Application in practice



문제가 뚜렷하게 드러난 스테이션

Stations with established problems based on existing station-level metrics


점수를 못 받고도 합격한 학생이 점수를 잘 받고도 탈락한 학생보다 3배 많다.

As part of the validation of the new metrics, we examine their application where existing station-level metrics already high-light concerns about quality. In Table 3, station 3 shows a poor R-square value with an accompanying low value for the slope of the regression line (inter-grade discrimination), already suggesting that the station is not discriminating well between students based on ability. From this analysis, we would anticipate there to be a wide range of checklist marks for each global grade. The the pass/fail grid reveals a high times level of asymmetry in off-diagonal with three as many candidates (25:8) achieving a global pass grade from assessor but poor checklist marks compared to those not achieving a global pass whilst having good checklist marks. 




표면적으로는 문제가 없는 스테이션

Stations with no ‘‘apparent’’ problems


(GG에서 나타난) 평가자의 예측보다 (CL에서 나타난)수행능력이 더 낫다.

We now examine stations where the ‘‘standard’’ set of metrics would not highlight underlying quality issues in respect of assessor reveals decision central making theme: and judgements. This whole analysis achieve a candidates as a comparatively better performance (determined by the check-list score) than would be expected by assessors’ prediction(determined by the global grade). 



보더라인이 높게 나온 스테이션

Stations with relatively high numbers of‘‘Borderline’’ students


As a final part of the work examining the impact of them is alignment measure, we review stations where the propor-tion of borderline grades awarded is relatively ‘‘high’’. Station 8, which focusses on medico-legal responsibilities after the death of a patient, has the highest proportion of borderline grades (25.2% of the whole cohort) amongst the stations listed in Table 3. Review of traditional station level metrics (columns2–5) shows an acceptably performing station, but with a high number of student failures. 




Discussion


수행능력 검사에는 많은 "노이즈"가 있으며, 최근 근무지-기반 평가에서는 이것을 "블랙박스"라고 개념화한 적이 있다.

Despite this, there remains a large degree of ‘‘noise’’ in performance testing, recently conceptua- lised within workplace assessment settings as a ‘‘black box’’ (Kogan et al. 2011).


연구자들은 이 "노이즈"를 해독하기 시작했으며, 보건의료계열 평가에서는 복잡하고 변화하는 OSCE 스테이션의 특성과 관련한 평가자의 행동과 의사결정에 초점을 두기 시작했다. 다른 전문직 영역에서는 GG와 CL사이의 긴장관계를 강조하며 평가자가 전체론적, 종합적 평가를 신뢰하면서 CL의 활용을 무시하는 것과 같은 적극적 "위법(transgression)"을 지적하고 있다. 다른 연구에서는 평가자가 기술어(descriptor) 내의 '안전' 혹은 '프로페셔널리즘'  과 같은 복합적인 구인을 이해하는데 있어서 GG만을 사용하는 것의 문제를 지적한다. 그러한 구인은 종종 한 단어로 대표되는데, 평가자 훈련에서 다양한 재해석이 판단의 변화를 가져오며, 일부 연구자들은 이를 '예상된 변동(anticipated variance)'이라고 개념화하면서, 단순한 에러와는 구분하고자 한다. 이러한 복잡한 역학이 일련의 관찰결과는 '불확정성'으로 개념화되면서, CL, rubric, grading scheme 사용에 대한 이론적 배경을 challenging하고 있다.

Researchers have begun to unpack this ‘‘noise’’, and work within healthcare assessment has focused on assessor behav- iours and decision making in the complex, changing nature of the OSCE station (Govaerts 2011). Work from other profes- sional disciplines has highlighted a wider tension between the balance of global grades and checklists/marking rubrics, revealing active ‘‘transgressions’’ as assessors trust of holistic, global judgements overrides their use of checklist criteria (Marshall 2000). Other work reveals the challenges of using global grades and descriptors alone, as assessors seek to make sense of complex constructs such as ‘‘safety’’ or ‘‘profession- alism’’ within descriptors. Such constructs are often repre- sented by single words, and despite assessor training, multiple re-interpretations lead to variation in judgements – with some researchers conceptualising this as anticipated variance, rather than just simply error (Govaerts et al. 2007). This complex dynamic has been conceptualised through a series of obser- vations as ‘‘indeterminancy’’, challenging the theoretical back- ground to the accepted use of checklists, rubrics and grading schemes (Sadler 2009).


OSCE에서 GG와 CL과 관한 연구는 많으나, 둘 사이의 alignment 연구는 적다.

Whilst an extensive literature exists in respect of the use of global grades and checklists within OSCEs (Cunnington et al. 1996; Humphrey-Murto & MacFadyen 2002; Wan et al. 2011), little has been done to explore the nature of the alignment between the two.


우리는 각 기관이 PM calculator 모델을 만들어서 각자의 자료로부터 가장 적절한 값을 내기를 권고한다.

We encourage other institutions to model the PM calculations using their own data to determine the most suitable value of the parameter M (formula 1) to meet local conditions.


이 연구는 평가자의 CL결정과 "예측(즉, GG)" 사이의 misalignment를 밝혔다.  부정적인(adverse) standard station-level metrics 을 가지는 한 스테이션 내에서 이 misalignment는 어느 지점에서 평가자가 스테이션과 CL에 불만을 갖는가를 보여준다. 도입부에서 언급한 것처럼 misalignment는 여러 문제에 기인할 수 있다(예컨대, 평가자 훈련, 보조자료, 까칠한 평가자 등). 그보다 더 중요한 것은 기존의 metrics에서 "수용가능한" 것으로 인정된 스테이션에 대한 더 깊은 이해이다. 이들 스테이션에 대한 불만족은 많은 경우 보더라인 집단에 대한 평가자 판단에 있다. 이 맥락에서의 misalignment는 서로 다른 '방향성(directionalities)를 보여주는데, 평가자는 '탈락'을 주는데 어려움을 겪으며, 보더라인 집단의 학생을 더 후하게 평가하는 경향이 있다는 것이다. 이러한 결과는 평가자가 "bestowed credit"을 준다는 기존 연구와 일치하는데, 즉 GG나 CL 시스템에 들어가있지 않은 피평가자의 행동에 대해서 벌점 또는 가점을 준다는 것이다. 이는 수행능력 평가에 대한 신뢰의 threat이다.

This study has revealed the extent of misalignment between assessors’ checklist decisions and their ‘‘predictions’’ (i.e. the global grades) across a range of different academic cohorts and levels of assessment in a large-scale OSCE. Within stations with ‘‘adverse’’ standard station-level metrics (Pell et al. 2010; Fuller et al. 2013), the misalignment measures complement these well, highlighting where assessors are dissatisfied with station and checklist constructs. As stated earlier in the introduction, the misalignment could be the result of a number of problems (including but not limited to, for example, assessor training, support materials, ‘‘rogue’’ assessors and so on). Of more importance is the deeper insight into stations that might have been judged as ‘‘acceptable’’ based on pre-existing metrics. The unsatisfactory characteristic of many of these stations lies in assessor judgement of the borderline group. Interpreting the misalignment measure in this context reveals different directionalities – with assessors showing difficulty in awarding fail grades and a tendency to over rate student performance in the borderline group. Such findings resonate with assessors awarding ‘‘bestowed credit’’ – rewarding or penalising other candidate activities that are not featured within grading and checklist systems, and an activity that has been identified as a threat to the fidelity of performance assessments (Sadler 2010).


보더라인 그룹의 합-불합 결과의 불일치 결과는 약 10%정도에서 발생할 것으로 추정한다. 다른 말로는, 보더라인 그룹의 대다수가 합격 결정을 받는 (또는 불합격 결정을 받는) 스테이션이 생긴다는 것이다. 그 결과 이들 보더라인 그룹의 평균점수는, 즉 borderline group method에 따른 커트라인은 the borderline regression method에  따른 점수보다 낮(거나 높)다. 

We estimate that the incidence of substantial asymmetry of pass/fail outcomes within the borderline group occurs in approximately 10% of stations. In other words, there are incidences where the large majority of candidates in the borderline group pass the station (or, conversely, fail the station). Hence, the mean mark for this borderline group, giving the cut-score as per the borderline group method, is lower (or higher) than that under the borderline regression method. We would argue from a quality perspective that this is further evidence in favour of BRM, since under the borderline group method these issues would remain unknown.





Gingerich A, Regehr G, Eva KW. 2011. Rater-based assessments as social judgments: Rethinking the etiology of rater errors. Acad Med 86:S1–S7.


Kogan JR, Conforti L, Bernabeo E, Iobst W, Holmboe E. 2011. Opening the black box of clinical skills assessment via observation: A conceptual model. Med Educ 45(10):1048–1060.


Yorke M. 2011. Assessing the complexity of professional achievement. Chapter 10. In: Jackson N, editor. Learning to be professional through a higher education. London: Sceptre. Available from: http://learningto- beprofessional.pbworks.com/w/page/15914981/Learning%20to%20be %20Professional%20through%20a%20Higher%20Education%20e-Book.












 2015 Dec;37(12):1106-13. doi: 10.3109/0142159X.2015.1009425. Epub 2015 Feb 16.

Investigating disparity between global grades and checklist scores in OSCEs.

Author information

  • 1a University of Leeds , UK.

Abstract

BACKGROUND:

When measuring assessment quality, increasing focus is placed on the value of station-level metrics in the detection and remediation of problems in the assessment.

AIMS:

This article investigates how disparity between checklist scores and global grades in an Objective Structured Clinical Examination (OSCE) can provide powerful new insights at the station level whenever such disparities occur and develops metrics to indicate when this is a problem.

METHOD:

This retrospective study uses OSCE data from multiple examinations to investigate the extent to which these new measurements ofdisparity complement existing station-level metrics.

RESULTS:

In stations where existing metrics are poor, the new metrics provide greater understanding of the underlying sources of error. Equally importantly, stations of apparently satisfactory "quality" based on traditional metrics are shown to sometimes have problems of their own - with a tendency for checklist score "performance" to be judged stronger than would be expected from the global grades awarded.

CONCLUSIONS:

There is an ongoing tension in OSCE assessment between global holistic judgements and the necessarily more reductionist, but arguably more objective, checklist scores. This article develops methods to quantify the disparity between these judgements and illustrates how such analyses can inform ongoing improvement in station quality.

PMID:
 
25683174
 
[PubMed - in process]


타당도를 위협하는 것들 (Med Educ, 2004)

Validity threats: overcoming interference with proposed interpretations of assessment data

Steven M Downing1 & Thomas M Haladyna2






타당도란 검사점수 해석에 있어서 meaningfulness가 얼마나 되느냐에 관한 것이다. 

Validity refers to the degree of meaningfulness for any interpretation of a test score. In a previous paper in this series1 validity was discussed and sources of validity evidence based on the Standards for Educational and Psychological Testing2


meaningful interpretation을 훼방하는 모든 것이 타당도를 위협하는 것이다.

Any factors that interfere with the meaningful interpretation of assessment data are a threat to validity.


Messick은 두 개의 주요 위협을 언급했다. 구인-과소반영(CU)와 구인-무관변인(CIV)이다. CU는 내용영역에 대하서 과소-샘플링 혹은 편향된 샘플링을 하는 것이다. CIV는 측정하려는 구인과 무관한 변인에 의해서 생기는 평가자료의 시스템적 에러(systematic error)이다. (무작위 에러(random error)가 아니다).

Messick3 noted 2 major sources of validity threats: construct under-representation (CU) and construct-irrelevant variance (CIV). Construct under-representation refers to the undersampling or biased sampling of the content domain by the assessment instrument. Construct-irrelevant variance refers to systematic error (rather than randomerror) introduced into the assessment data by variables unrelated to the con- struct being measured.




지필고사 

Written examinations


지필고사에서 CU는 너무 짧은 시험 등이 원인이 될 수 있다. 또 다른 예시는 시험문항의 내용이 시험의 blueprint와 맞지 않아서 어떤 영역이 과대반영되거나 어떤 영역이 과소반영 되는 것이다. 수업목표는 고차원의 인지행동인데 시험에서는 낮은 수준의 인지행동만 평가한다거나(암기, 사실인식) 하는 것도 마찬가지다. 또한 미래의 학습과 무관한 사소한(지엽적) 내용에 대해서만 묻는 것도 이에 포함된다.

In a written examination, such as an objective test in a basic science course, CU is exemplified in an exam- ination that is too short to adequately sample the domain being tested. Other examples of CU are: test item content that does not match the examination specifications well, so that some content areas are oversampled while others are undersampled; use of many items that test only low level cognitive beha- viour, such as recall or recognition of facts, while the instructional objectives require higher level cognitive behaviour, such as application or problem solving; and, use of items that test trivial content that is unrelated to future learning.4


시험문항은 적절한 샘플링을 위해서는 일반적으로 30개 이상으로 충분해야 한다.

Tests must have suffi- cient numbers of items in order to sample adequately (generally, at least 30 items)


지필고사에서 CIV는 모든 학생이 아니라 종종 일부 학생에게서만 발생한다. CIV는 의도하지않은, 타겟을 벗어난(off-target) 구인에 대한 측정이며, 일차적으로 관심대상이 되는 구인에 대한 것이 아니고, 따라서 타당도를 위협하게 된다.

Con- struct-irrelevant variance represents systematic noise in the measurement data, often associated with the scores of some but not all examinees. This CIV noise represents the unintended measurement of some construct that is off-target, not associated with the primary construct of interest, and therefore interferes with the validity evidence for assessment data.


CIV는 statistically biased items을 사용한다거나(일부 집단이 과도하게 문제를 잘 풀거나 못 푸는 경우), 혹은 문화적으로 둔감한 언어를 사용하여 학생들을 offend하는 경우가 있다.

Construct-irrelevant variance is also introduced by including statistically biased items6 on which some subgroup of students under- or over-performs compared to their expected performance, or by including test items which offend some students by their use of culturally insensitive language.


만약 문항이 기술된 방식이 학생에게 적합하지 못하면, 읽기능력이 CIV variable이 된다. 자신의 모국어가 아닌 언어로 시험을 치르는 경우 특히 중요하다.

If the reading level of achievement test items is inappropriate for students, reading ability becomes a CIV variable which is unrelated to the construct measured, thereby introducing CIV.7 This reading level issue may be particularly important for students taking tests written in a language that is non-native to them.


CIV의 마지막 사례는 정당화되지 못하는 합격선에 대한 것이다. 모든 합격선을 결정하는 방법은 상대적이든 절대적이든 arbitrary한 것이다. 그럼에도 이러한 방법과 그 결과가 변덕스러워서는 안된다.

A final example of CIV for written tests concerns the use of indefensible passing scores.10 All passing score determination methods, whether relative or absolute, are arbitrary. These methods and their results should not be capricious, however.



Performance examinations

OSCE같은 것은 실제상황의 시뮬레이션이며, 실제 상황은 아니다. 학생들의 수행능력은 훈련된 SP에 의해서 통제된 환경에서 평가하게 되며, 최대치의 수행능력을 요구하는 제한된 수의 선택된 사례에 대해서 평가하게 된다. 이것은 실제 상황에서의 수행능력이 아니고, 체크리스트나 평가스케일에 부여된 의미에 대한 구체적 해석을 통해 어떤 영역에 대한 평가점수를 바탕으로 추론하게 되는 것이다. 

They are simulations of the real world,but are not the real world. The performance of students, rated by trained SPs in a controlled environment on a finite number of selected cases requiring maximum performance, is not actual per- formance in the real world; rather, inferences must be made from performance ratings to the domain of performance, with a specific interpretation or mean-ing attributed to the checklist or rating scale data. 


어떤 domain에 관하여 최소한의 일반화가능한 추론을 위해서는 각각 20분정도 진행되는 약 12명의 SP정도는 필요하다. 충분한 generalisability가 확보되지 않는 것은 CU에 해당한다.

Approximately 12 SP encounters, lasting 20 minutes each, may be required to achieve even minimal generalisability to support inferences to the domain.16 Lack of sufficient generalisability repre- sents a CU threat to validity.


만약 SP가 충분히 잘 훈련되지 않아서 환자의 일반적인 모습을 잘 보여주지 못하는 경우에는 모든 학생이 동일한 환자문제 혹은 자극에 노출되지 않으므로 관심을 갖는 구인이 잘못 해석될 수 있다.

If the SPs are not sufficiently well trained to consistently portray the patient in a standardised manner, The construct of interest is there- fore misrepresented, because all students do not encounter the same patient problem or stimulus.


  • 학생에게 부적절한 난이도 inappropriate difficulty for students
  • 모호한 체크리스트나 평가스케일 checklist or rating scale items that are ambiguous
  • 발견/교정되지 않은 특정 학생 그룹에만 영향을 주는 통계적 비뚤림 Statistical bias for 1 or more subgroups of students, which is undetected and uncorrected,
  • 평가자의 인종/민족 편견 Racial or ethnic rater bias

학생이 SP에게 거짓행동을 할 수도 있으며, 특히 SP 사례의 비-의학적 측면에서 그러할 수 있다. 그 경우 그러한 학생들이 평가를 더 잘 받을 수도 있다.

It is possible for students to bluff SPs, particularly on non-medical aspects of SP cases, making ratings higher for some students than they actually should be.


일반화가능도는 generalizability theory를 활용하여 이러한 유형의 시험에서 반드시 측정되어야한다. 고부담 수행능력 평가에서 일반화가능도 계수는 최소한 0.8이상은 되어야 한다. phi-coefficient 는 criterion-referenced performance examinations (상대적 기준이 아니라 절대적 기준으로 합/불합을 결정하는 시험)에서 적합한 방법이다.

Generalisability must be estimated for most performance-type examina- tions, using generalisability theory.17,18 For high- stakes performance examinations, generalisability coefficients should be at least 0AE80; the phi-coefficient is the appropriate estimate of generalisability for criterion-referenced performance examinations (which have absolute, rather than relative passing scores).16


수행능력을 평가하기 위한 case는 최종적으로 사용되기에 앞서 학생을 대표할 수 있는 집단을 대상으로 미리 테스트를 해보아야 한다.

Performance cases should be pretested with a representative group of students prior to their final use, testing the appropriateness of case difficulty and all other aspects of the case presentation.



임상수행능력 평가

Ratings of clinical performance


의학교육에서 Clerkship이나 Preceptorship에서 임상수행능력 평가는 종종 주요한 평가 수단이다. 이 방법은 주로 현실 그대로의 상황에서 교수가 관찰한 학생의 수행능력에 의존하게 된다.

In medical education, ratings of student clinical performance in clerkships or preceptorships (on the wards) are often a major assessment modality. This method depends primarily on faculty observations of student clinical performance behaviour in a naturalis- tic setting.


이 경우에 CU위협은 관찰 결과가 너무 적은 것 혹은 교수가 평가한 행동의 숫자가 적은 것이다. William 등은 유용하고 해석가능한 충분히 일반화가능한 자료를 얻기 위해서는 7개에서 11개의 독립적 평가가 필요하다고 했다.

The CU threat is exemplified by too few observations of the target or rated behaviour by the faculty raters (Table 1). Williams et al.20 suggest that 7–11 inde- pendent ratings of clinical performance are required to produce sufficiently generalisable data to be useful and interpretable.


주요 CIV위협은 평가자의 systematic error에 의한 것이다. 이러한 측정평가에서 평가자는 측정오류의 주된 원인이나 CIV는 평가자의 엄격/관대 오류, central tendency 오류, 제한된 범위의 점수만 사용(restriction of range) 등과 같은 systematic error와 관련이 있다.  평가자가 평가해야 하는 특질이 무엇인지 외면하게 되면 halo effect가 생길 수 있다.

The major CIV threat is due tosystematic rater error. Raters are the major source of measurement error for these types of observational assessments, but CIVis associated with systematic rater error, such as rater severity or leniency errors, central tendency error (ratingin the centre of the rating scale) and restriction of range (failure touse all the points on the rating scale). The halo rater effect occurs when the rater ignores the traits to be rated and treats all traits as if they were one.


비록 더 많은 훈련을 통해서 부적절한 평가자 영향을 줄일 수는 있지만, 평가자의 엄격/관대 성향에 대응하는 또 다른 방법은 얼마나 엄격/관대한지를 추정하여 최종 평가단계에서 그로 인한 영향을 보정하는 것이다.

Although better training may help to reduce some undesirable rater effects, another way to combat rater severity or leniency error is to estimate the extent of severity (or leniency) and adjust the final ratings to eliminate the unfairness that results from harsh or lenient raters.


평가스케일은 흔히 사용되는 방법인데, 평가문항의 기술이 잘 되어있지 않으면, 즉 평가자가 워딩에 의해 혼란을 겪을 수도 있고, 의도한 특정이 아닌 다른 것을 평가하게 될 수도 있다. 

Rating scales are frequently used for clinical per- formance ratings. If the items are inappropriately written, such that raters are confused by the wording or misled to rate a different student characteristic from that which was intended,


합격/불합격 결정이나 성적을 결정하는 방법도 CIV의 원인이 된다.

the methods used to establish passing scores or grades may be a source of CIV.




안면타당도는?

What about face validity?


'안면타당도'라는 용어는, 비록 일부 의학교육자들이 흔히 사용하는 단어지만 교육측정전문가들 사이에서는 이미 1940년대부터 조롱의 대상이 되어왔다. 안면타당도는 여러 다른 의미를 가질 수 있다. 가장 치명적인 의미는 Mosier에 따르면.."검사의 타당도는 상식(common sense)를 활용하여 그 검사가 시험 상황과 직무 상황 모두에 존재하는 세부 능력을 측정한다는 것을 발견함으로써 가장 잘 결정할 수 있다"와 같은 것이다. 명백하게, 의학교육자들의 논문이나 그들이 쓰는 단어에 안면타당도의 자리는 없다. 따라서 이러한 유형의 안면타당도에 의존하는 것은 타당도의 주요 위협이 된다.

The term face validity, despite its popularity in some medical educators’ usage and vocabulary, has been derided by educational measurement professionals since at least the 1940s. Face validity can have many different meanings. The most pernicious meaning, according to Mosier, is: …the validity of the test is best determined by using common sense in discov- ering that the test measures component abilities which exist both in the test situation and on the job. 23(p 194) Clearly, this meaning of face validity has no place in the literature or vocabulary of medical educators. Thus, reliance on this type of face validity as a major source of validity evidence for assessments is a major threat to validity.


안면타당도는, 위의 정의에 따르면, 근대의 교육측정연구자들에 의해서 지지받지 못한다. 안면타당도는 타당도의 적합한 근거가 될 수 없으며, 다른 여러 타당도 근거 중 어떤 것도 안면타당도가 대체할 수는 없다.

Face validity, in the meaning above, is not endorsed by any contemporary educational meas- urement researchers.24 Face validity is not a legit- imate source of validity evidence and can never substitute for any of the many evidentiary sources of validity.2


그럼에도 안면타당도라는 용어는 종종 의학교육에서 사용된다는 것을 감안하면, 어떠한 정당성을 가질 수는 없을까? 만약 안면타당도라는 용어를 통해서, 어떤 측정이 의도한 구인을 측정하는 것으로 보이는 표면적 퀄리티를 갖는다는 것을 의미한다면(예컨대 SP사례를 통해 병력청취 기술을 판단한다) 이는 그 평가의 필수적 특성을 보여줄 수는 있을지는 몰라도 타당도는 아니다. 이 SP 특징은 학생이나 교수가 그 평가를 받아들일 수 있느냐와 연관이 되고, 따라서 행정가들에게, 심지어는 대중들에게 중요할 수는 있으나 타당도는 아니다. 이러한 식의 안면-비타당도를 회피하자는 것이 Messick의 주장이었다. 타당해보이는 것이 타당도는 아니다. 외관(appearance)는 가설이나 이론에서 유도된, 실제 자료를 바탕으로 지지하거나 반박할 수 있는, 그래서 논리적 주장으로 만들어질 수 있는 과학적 근거가 아니다.

However, as the term face validity is sometimes used in medical education, can it have any legitimate meaning? If by face validity one means that the assessment has superficial qualities that make it appear to measure the intended construct (e.g. the SP case looks like it assesses history taking skills), this may represent an essential characteristic of the assessment, but it is not validity. This SP charac- teristic has to do with acceptance of the assessment by students and faculty or is important for admin- istrators and even the public, but it is not validity. (The avoidance of this type of face invalidity was endorsed by Messick.3) The appearance of validity is not validity; appearance is not scientific evidence, derived from hypothesis and theory, supported or unsupported, more or less, by empirical data and formed into logical arguments.


안면타당도라는 용어를 대체할 수 있는 용어가 필요하다. 예컨대, 만약 객관시험이 관심의 대상이 되는 구인을 측정할 수 있는 것 처럼 보인다면, 그것이 이 시험이 성공하기 위한, 받아들여지고 활용되는데 있어서 시험의 가치와 중요성에 무언가 기여한다고 볼 수 있다. 그러나 이것은 타당도의 충분한 근거는 아니다. 표면적으로 보이는 것, 평가에 대해서 느끼는 것과 제대로 된 타당도 근거가 일치한다는 것은 "알맞음" 또는 "사회정치적 의미"라고 볼 수는 있지만, 명백하게 타당도 근거의 기본적 유형은 아니며, 앞서 언급한 다섯 가지의 타당도의 primary source 중 어떤 것도 이것이 대체할 수는 없다.

Alternative terms for face validity might be consid- ered. For example, if an objective test looks like it measures the achievement construct of interest, one might consider this some type of value-added and important (even essential) trait of the assessment that is required for the overall success of the assessment programme, its acceptance and its utility, but this clearly is not sufficient scientific evidence of validity. The appearance of validity may be necessary, but it is not sufficient evidence of validity. The congruence between the superficial look and feel of the assessment and solid validity evidence might be referred to as congruent or sociopolitical meaningfulness, but it is clearly not a primary type of validity evidence and can not, in any way, substitute for any of the 5 suggested primary sources of validity evidence.2



2 American Educational Research Association, American Psychological Association, National Council on Meas- urement in Education. Standards for Educational and Psychological Testing. Washington, DC: American Edu- cational Research Association 1999.









 2004 Mar;38(3):327-33.

Validity threatsovercoming interference with proposed interpretations of assessment data.

Author information

  • 1University of Illinois at Chicago, College of Medicine, Department of Medical Education, Chicago, Illinois 60612-7309, USA. sdowning@uic.edu

Abstract

CONTEXT:

Factors that interfere with the ability to interpret assessment scores or ratings in the proposed manner threaten validity. To be interpreted in a meaningful manner, all assessments in medical education require sound, scientific evidence of validity.

PURPOSE:

The purpose of this essay is to discuss 2 major threats to validity: construct under-representation (CU) and construct-irrelevant variance (CIV). Examples of each type of threat for written, performance and clinical performance examinations are provided.

DISCUSSION:

The CU threat to validity refers to undersampling the content domain. Using too few items, cases or clinical performance observations to adequately generalise to the domain represents CU. Variables that systematically (rather than randomly) interfere with the ability to meaningfully interpret scores or ratings represent CIV. Issues such as flawed test items written at inappropriate reading levels or statistically biased questions represent CIV in written tests. For performance examinations, such as standardised patient examinations, flawed cases or cases that are too difficult for student ability contribute CIV to the assessment. For clinical performance data, systematic rater error, such as halo or central tendency error, represents CIV. The term face validity is rejected as representative of any type of legitimate validity evidence, although the fact that the appearance of the assessment may be an important characteristic other than validity is acknowledged.

CONCLUSIONS:

There are multiple threats to validity in all types of assessment in medical education. Methods to eliminate or control validitythreats are suggested.

PMID:
 
14996342
 
[PubMed - indexed for MEDLINE]


타당도: 평가결과의 유의한 해석을 위한 도구 (Med Educ, 2003)

Validity: on the meaningful interpretation of assessment data

Steven M Downing





타당도는 평가결과에 따르는 의미나 해석을 지지하거나 반박하기 위한 근거이다. 모든 평가는 타당도 근거를 필요로 하고, 평가의 거의 모든 주제가 어떤 식으로든 '타당도'와 관련이 있다. 타당도는 평가의 필수불가결한 요소이며, 타당도가 결여되면 평가는 거의 혹은 아무 의미가 없다.

Validity refers to the evidence presented to support or refute the meaning or interpretation assigned to assessment results. All assessments require validity evidence and nearly all topics in assessment involve validity in some way. Validity is the sine qua non of assessment, as without evidence of validity, assess- ments in medical education have little or no intrinsic meaning.


타당도는 언제나 '가설'의 형태로 접근하게 된다. 평가자가 기대하는 해석적 의미를 평가 결과와 연관짓게 되고, 이 최초 가설을 바탕으로 자료가 수집되고, 타당도 가설을 지지하거나 반박하는 결과로 나타난다. 이러한 개념하에서 평가자료는 어특 특정한 목적, 의미, 해석에 대해서 더 타당하거나 덜 타당할 수 있으며, 이는 특정 시점이나 특정 집단에 대해서만 그러할 수도 있다. 평가 그 자체만으로는 절대로 '타당'하다거나 '타당하지 않다'라는 말을 할 수 없으며, 평가점수의 해석을 하는 데 있어서 그것을 지지하거나 반박하는 과학적으로 타당한 근거가, 특정 시점에 존재한다고 말할 수 있을 뿐이다.

Validity is always approached as hypothesis, such that the desired interpretative meaning associated with assessment data is first hypothesized and then data are collected and assembled to support or refute the validity hypothesis. In this conceptualization, assess- ment data are more or less valid for some very specific purpose, meaning or interpretation, at a given point in time and only for some well-defined population. The assessment itself is never said to be ‘valid’ or ‘invalid’ rather one speaks of the scientifically sound evidence presented to either support or refute the proposed interpretation of assessment scores, at a particular time period in which the validity evidence was collected.


타당도라는 것이 다양한 근거원을 고려하는 일원화된(unitary) 개념이라는 것이 지금의 개념적 해석이다. 근거에 대한 출처는 보통 의도한 방향의 해석이나 의미와 관련하여 논리적으로 제안된다. 현재의 프레임워크에서 모든 타당도는 구인타당도(construct validity)이며Messick이 SEPM에서 보다 우아하게 설명한 바 있다. 과거에는 타당도는 세 가지 다른 종류로 나눠졌다. Content, Criterion, Construct. 이 중 Criterion-related validity는 준거자료의 수집 시점에 따라 종종 concurrent 와 predictive 로 나눠졌다.

In its contemporary conceptualization,1,3–14 validity is a unitary concept, which looks to multiple sources of evidence. These evidentiary sources are typically logi- cally suggested by the desired types of interpretation or meaning associated with measures. All validity is construct validity in this current framework, described most eloquently by Messick8 and embodied in the current Standards of Educational and Psychological Meas- urement.1 In the past, validity was defined as three separate types: content, criterion and construct, with criterion-related validity usually subdivided into con- current and predictive depending on the timing of the collection of the criterion data.2,15


왜 구인타당도가 이제 유일한 유형의 타당도가 된 것일까? 과학의 철학에서 그 복잡한 답을 찾을 수 있는데, 어떤 영역이나 더 넓은 인구집단에 대해서 의미있고 논리적인 추론을 위해서는 무수한 상호-연결된 추론의 거미줄은 그 contents를 sampling하는 것과 연결되어있다.

Why is construct validity now considered the sole type of validity? The complex answer is found in the philosophy of science8 from which, it is posited, there are many complex webs of inter-related inference associated with sampling content in order to make meaningful and reasonable inferences to a domain or larger population of interest.


보다 직접적 대답은 이러하다: 거의 모든 평가는 사회과학이기 때문이다. 의학교육도 마찬가지이다. 이러한 평가는 무형의 추상적 개념과 원칙 - 행동으로부터 추론할 수 있고, 교육이나 심리학 이론으로부터 설명할 수 있는 - 의 집합체를 구성한다. 교육적 성취 역시 구인(construct)으로서, 잘 정의된 지식영역에 대한 지필고사나, 특정 문제나 사례에 대한 구술고사, 표준화환자를 이용한 병력청취와 의사소통 등으로 부터 추론(infer)되는 것이다.

The more straightforward answer is: Nearly all assessments in the social sciences, including medical education, deal with – constructs intangible collections of abstract concepts and princi- ples which are inferred from behavior and explained by educational or psychological theory. Educational achievement is a construct, usually inferred from per- formance on assessments such as written tests over some well-defined domain of knowledge, oral exami- nations over specific problems or cases in medicine, or highly structured standardized patient examinations of history-taking or communication skills.


교육적 능력이나 적성 역시 우리에게 친근한 구인의 또 다른 사례이다. 이 구인은 학업성취보다도 더 실체가 없고 추상적인데, 왜냐하면 교육자나 심리학자 사이에 합의가 덜 되어있기 때문이다. 교육적 능력을 측정하기 위한 검사는 - MCAT같은 - 북미에서 의과대학 입학시에 주요하게 활용되며, 따라서 MCAT을 사용하는 타당성에 대해 지지하려면 다양한 출처로부터, 과학적으로 타당한 근거를 제시할 수 있어야 한다. 타당도 근거의 중요한 출처로는 이러한 MCAT점수가 의과대학 입학 후 학업성취를 얼만 예측하는가를 보여주는 것이다.

Educational ability or aptitude is another example of a familiar construct – a construct that may be even more intangible and abstract than achievement because there is less agreement about its meaning among educators and psychologists.16 Tests that purport to measure educational ability, such as the Medical College Admissions Test (MCAT), which is relied on heavily in North America for selecting prospective students for medical school admission, must present scientifically sound evidence, from multiple sources, to support the reasonableness of using MCAT test scores as one important selection criterion for admitting students to medical school. An important source of validity evi- dence for an examination such as the MCATis likely to be the predictive relationship between test scores and medical school achievement.


타당도는 평가 점수 해석을 그 의도한 해석의 논리성을 지지하거나 반박하는 이론/가설/논리와 연결시키는 evidentiary chain을 필요로 한다. 타당도는 절대로 당연히 가정될 수 있는 것이 아니며, 지속적으로 가설을 수립하고 자료를 모으고, 검증하고, 비판적으로 평가하고, 논리적으로 추론해야 하는 것이다. 타당도에 대한 근거, 그에 관련된 이론, 경험적 근거는 어떤 특정한 해석이 타당하고 어떤 해석이 그렇지 않은가를 알려주는 것이다.

Validity requires an evidentiary chain which clearly links the interpretation of the assessment scores or data to a network of theory, hypotheses and logic which are presented to support or refute the reasonableness of the desired interpretations. Validity is never assumed and is an ongoing process of hypothesis generation, data collection and testing, critical evaluation and logical inference. The validity argument11,12 relates theory, predicted relationships and empirical evidence in ways to suggest which particular interpretative meanings are reasonable and which are not reasonable for a specific assessment use or application.


유의미한 점수의 해석을 위해서, 어떤 평가는 - 예컨대 지식에 대한 학업성취도 - 상당히 직접적인 시험 내용의 적합성에 대한 근거로 내용-관련 근거, 점수의 재생산가능성, 문항의 통계적 질, 합격선이나 학점을 결정한 근거 등이 필요할 수 있다. 수행능력 평가와 같은 다른 종류의 평가에서는 다른 것이 필요하다.

In order to meaningfully interpret scores, some assessments, such as achievement tests of cognitive knowledge, may require fairly straightforward content- related evidence of the adequacy of the content tested (in relationship to instructional objectives), statistical evidence of score reproducibility and item statistical quality and evidence to support the defensibility of passing scores or grades. Other types of assessments, such as complex performance examinations, may require both evidence related to content and consider- able empirical data demonstrating the statistical rela- tionship between the performance examination and other measures of medical ability, the generalizability of the sampled cases to the population of skills, the reproducibility of the score scales, the adequacy of the standardized patient training and so on.



평가의 목적이나 의도한 해석에 따라 달라질 수 있는 타당도 근거의 전형적인 출처에는 다음과 같은 것이 있다.

Some typical sources of validity evidence, depending on the purpose of the assessment and the desired interpretation are: 

  • evidence of the content representa- tiveness of the test materials, 
  • the reproducibility and generalizability of the scores, 
  • the statistical character- istics of the assessment questions or performance the statistical prompts, 
  • relationship between and among other measures of the same (or different but related) constructs or traits, 
  • evidence of the impact of assessment scores on students and 
  • the consistency of pass–fail decisions made from the assessment scores.


평가에 따르는 부담이 클수록 , 타당도 근거를 더 다양한 출처로부터 수집하고, 지속적으로, 재평가할 필요가 커진다.(면허, 자격증 등) 

The higher the stakes associated with assessments, the greater the requirement for validity evidence from multiple sources, collected on an ongoing basis and continually re-evaluated.17 The ongoing documenta- tion of validity evidence for a very high-stakes testing programme, such as a licensure or medical specialty certification examination, may require the allocation of many resources and the contributions of many different professionals with a variety of skills – content specialists, psychometricians and statisticians, test editors and administrators. 




구인타당도의 출처

Sources of evidence for construct validity


the Standards에 따르면 "타당도는 근거나 이론이 검사점수를 해석하는 데 있어서 의도한 활용을 지지하는 정도"이다. 현재의 Standards는 타당도에 대한 일원화된 관점 - 모든 타당도는 구인타당도이다 - 을 충분히 반영하고 있다. 이 때 타당도란, 구인을 잘 정의하고, 자료와 근거를 모으고 통합해서 그 매우 구체적인 해석을 지지하거나 반박하는 절차이다. 역사적으로 타당도를 점증하는 방법과 구인타당도와 관련된 근거들은 Cronbach,3–5Cronbach and Meehl6 and Messick.7의 초기 과업에 많은 토대를 두고 있다. 초기의 단일화된 개념은 1957년 Loevinger, Kane 의 논문으로 거슬러올라가며, 이들은 타당도라는 것을 해석적 주장의 맥락에 두면서, 각각의 평가마다 확립되어야 하는 것이라고 했다. 

According to the Standards: ‘Validity refers to the degree to which evidence and theory support the interpretations of test scores entailed by proposed uses of tests’1 (p. 9). The current Standards1 fully embrace this unitary view of validity, following closely on Messick’s work8,9 that considers all validity as con-struct validity, which is defined as an investigative process through which constructs are carefully defined,data and evidence are gathered and assembled to form an argument either supporting or refuting some very specific interpretation of assessment scores.11,12 His-torically, the methods of validation and the types of evidence associated with construct validity have their foundations on much earlier work by Cronbach,3–5Cronbach and Meehl6 and Messick.7 The earliest unitary conceptualization of validity as construct validity dates to 1957 in a paper by Loevinger.18 Kane11–13 places validity into the context of an interpretive argument, which must be established for each assessment; Kane’s work has provided a useful framework for validity and validation research. 



The Standards


다섯 가지 근거

The Standards1 discuss five distinct sources of validity evidence (Table 1)


평가의 종류에 따라 한 종류의 타당도를 다른 종류의 타당도보다 더 강조하곤 한다.

Some types of assessment demand a stronger emphasis on one or more sources of evidence as opposed to other sources and not all sources of data or evidence are required for all assessments. 

  • For example, a written, objectively scored test covering several weeks of instruction in microbio-logy, might emphasize content-related evidence, to-gether with some evidence of response quality, internal structure and consequences, but very likely would not seek much or any evidence concerning relationship to other variables.
  • On the other hand, a high-stakes summative Objective Structured Clinical Examination (OSCE), using standardized patients to portray and rate student performance on an examination that must be passed in order to proceed in the curriculum, might require all of these sources of evidence





Sources of validity evidence for example assessments


점수 자체는 아무런 의미도 없다. 따라서 이 '근거'는 특정 평가에서 얻은 점수가 의도한 방식대로 해석할 수 있다는 것에 대한 논리적 근거를 제시해야 한다.

The scores have little or no intrinsic meaning; thus the evidence presented must convince the skeptic that the assess- ment scores can reasonably be interpreted in the proposed manner.



내용 타당도 근거 

Content evidence


 Examination blueprint

• Representativeness of test blueprint to achievement domain

• Test specifications

• Match of item content to test specifications

• Representativeness of items to domain

• Logical/empirical relationship of content tested to achievement domain

• Quality of test questions

• Item writer qualifications

• Sensitivity review

지필고사에 있어서 '내용'과 관련된 타당도 근거자료가 가장 필수적이다. Blueprint나 Test specification에서 드러난다. 

For the written assessment, documentation of validity evidence related to the content tested is the most essential. The outline and plan for the test, described by a detailed test blueprint or test specifications, clearly relates the content tested by the 250 MCQs to the domain of the basic sciences as described by the course learning objectives. The test blueprint is sufficiently detailed to describe subcategories and subclassifications of content and specifies precisely the proportion of test questions in each category and the cognitive level of those questions. The blueprint documentation shows a direct linkage of the questions on the test to the instructional objectives. 


독립적인 내용전문가가 test blueprint가 합리적인지 판단할 수 있다. 시험문항과 주요 학습목표와 교수-학습 활동의 관계가 명확해야 한다. 만약 대부분의 학습목표가 적용이나 문제해결 수준의 것이라면, 시험문항도 그러한 인지수준에 맞춰야 한다.

Independent content experts can evaluate the reasonableness of the test blueprint with respect to the course objectives and the cognitive levels tested. The logical relationship between the content tested by the 250 MCQs and the major instructional objectives and teaching⁄ learning activities of the course should be obvious and demonstrable, especially with respect to the proportionate weighting of test content to the actual emphasis of the basic science courses taught. Further, if most learning objectives were at the applica- tion or problem-solving level, most test questions should also be directed to these cognitive levels.


시험문항의 질 역시 내용-관련 타당도 근거의 하나이다. 

The quality of the test questions is a source of content-related validity evidence. 

    • MCQ가 효과적인 문항작성법에 근거했나?
      Do the MCQs adhere to the best evidence-based principles of effective item-writing.19 
    • 문항작성자가 내용전문가로서 자격이 있는가?
      Are the item-writers qualified as content experts in the disciplines? 
    • 문항 수가 충분한가?
      Are there sufficient numbers of questions to adequately sample the large content domain? 
    • 문항의 문장을 분명하고 오류 없이 기술했는가?
      Have the test questions been edited for clarity,removing all ambiguities and other common item flaws?
    • 문화적 민감성에 따라 검토되었는가?
      Have the test questions been reviewed for cultural sensitivity? 


SP에 있어서 마찬가지로 contents에 대한 이슈가 있다. SP case가 10개 있다면 이 10개는 - 예컨대 일차의료 외래상황에 대한 - 대표성이 있어야 함. 

For the SP performance examination, some of the same content issues must be documented and presen- ted as validity evidence. 


For example, each of the 10 SP cases fits into a detailed content blueprint of ambula- tory primary care history and physical examination skills. There is evidence of faculty content–expert agreement that these specific 10 cases are representative of primary care ambulatory cases. Ideally, the content of the 10 clinical cases is related to population demographic data and population data on disease incidence in primary care ambulatory settings. 


또한 임상 전문가가 SP case를 (체크리스트와 평가기준 포함) 공동으로 작성/검토/수정했는지에 대한 근거가 있어야 함. SP case에 대한 editing이 잘 되었고 SP가 자세한 가이드라인을 제공받았으며, 평가준거가 전문가에 의해서 준비되고 검토되며, SP trainer에 의해서 훈련되었는가 등도 모두 중요함.

Evi- dence is documented that expert clinical faculty have created, reviewed and revised the SP cases together with the checklists and ratings scales used by the SPs, while other expert clinicians have reviewed and critic- ally critiqued the SP cases. Exacting specifications detail all the essential clinical information to be portrayed by the SP. Evidence that SP cases have been competently edited and that detailed SP training guidelines and criteria have been prepared, reviewed by faculty experts and implemented by experienced SP trainers are all important sources of content-related validity evidence.


SP로 시험을 수행하는 동안에도 SP가 수행하는 내용을 면밀히 감시해서 모든 학생이 거의 동일한 case를 경험하게 해야 함. 서로 다른 SP가 동일한 case를 수행했다면, 학생 평가도 동일하게 내려야 함. 

There is documentation that during the time of SP administration, the SP portrayals are monitored closely to ensure that all students experience nearly the same case. Data are presented to show that a different SP, trained on the same case, rates student case perform- ance about the same. Many basic quality-control issues concerning performance examinations contribute to the content-related validity evidence for the assessment.20





평가 절차 근거 

Response process


• Student format familiarity

• Quality control of electronic scanning/scoring

• Key validation of preliminary scores

• Accuracy in combining different formats scores

• Quality control/accuracy of final scores/marks/grades

• Subscore/subscale analyses:

• Accuracy of applying pass-fail decision rules to scores

• Quality control of score reporting to students/faculty

• Understandable/accurate descriptions/interpretations of scores for students


Validity 근거로서 response process는 이상해보일 수 있다. 여기서 Response process란 시험 수행과 관련한 모든 관련 error가 가능한 최대한 통제/제가 되었느냐에 대한 것이다. 

As a source of validity evidence, response process may seem a bit strange or inappropriate. Response process is defined here as evidence of data integrity such that all sources of error associated with the test administration are controlled or eliminated to the maximum extent possible. Response process has to do with aspects of assessment such as ensuring 

  • 응답의 정확도
    the accuracy of all responses to assessment prompts, 
  • 평가에 있어 data flow의 질 관리
    the
     quality control of all data flowing from assessments, 
  • 다양한 평가점수를 하나의 점수로 산출하는 방식의 적합성
    the appropriate- ness of the methods used to combine various types of assessment scores into one composite score and 
  • 피평가자에게 제공되는 점수의 유용성과 정확도
    the usefulness and the accuracy of the score reports provided to examinees.


지필고사에 있어서 모든 시험시행절차와 관련된 문서와 시험에 대한 정보, 학생에게 제공되는 지침을 기록하는 것이 중요. 시험점수의 절대적 정확성을 확보하기 위한 모든 quality-control procedure와 관련된 것의 문서화가 중요한 근거. 이는 일차 채점 이후 final key validation이다. scoring key의 정확성을 확실하게 하고, final scoring에서 안 좋은 문항을 배제시키는 것이다. 

For evidence of response process for the written comprehensive examination, documentation of all practice materials and written information about the test and instructions to students is important. Docu- mentation of all quality-control procedures used to ensure the absolute accuracy of test scores is also an important source of evidence: the final key validation after a preliminary scoring – to ensure the accuracy of the scoring key and eliminate from final scoring any poorly performing test items; a rationale for any combining rules, such as the combining into one final composite score of MCQ, multiple true–false and short-essay question scores.


SP시험에 있어서, SP rating의 정확성을 보여주는 자료가 있어야 한다. 점수 계산법, reporting methods와 그 논리 - 특히 수행능력 평가 점수의 적절한 해석에 대한 설명자료 등.

For the SP performance examination, many of the same response process sources may be presented as validity evidence. For a performance examination, documentation demonstrating the accuracy of the SP rating is needed and the results of an SP accuracy study is a particularly important source of response process evidence. Basic quality control of the large amounts of data from an SP performance examination is important to document, together with information on score calculation and reporting methods, their rationale and, particularly, the explanatory materials discussing an appropriate interpretation of the performance- assessment scores (and their limitations).


global rating과 checklist rating 중 어떤 것을 선택했는지에 대한 논리에 대한 근거.

Documentation of the rationale for using global versus checklist rating scores, for example, may be an important source of response evidence for the SP examination. Or, the empirical evidence and logical rationale for combining a global rating-scale score with checklist item scores to form a composite score may be one very important source of response evidence.




내적 구조 근거 

Internal structure


• Item analysis data:

1. Item difficulty/discrimination

2. Item/test characteristic curves (ICCs/TCCs)

3. Inter-item correlations

4. Item-total correlations

• Score scale reliability

• Standard errors of measurement (SEM)

• Generalizability

• Dimensionality

• Item factor analysis

• Differential Item Functioning (DIF)

• Psychometric model


통계적, psychometric 특징과 관련되어 있음.

Internal structure, as a source of validity evidence, relates to the statistical or psychometric characteristics of the examination questions or performance prompts, the scale properties – such as reproducibility and general- izability, and the psychometric model used to score and scale the assessment.


문항 분석

Many of the statistical analyses needed to support or refute evidence of the test’s internal structure are often carried out as routine quality-control procedures. Ana- lyses such as item analyses – which computes 

  • 난이도 the difficulty (or easiness) of each test each question (or performance prompt), 
  • 변별도 the discrimination of question (a statistical index indicating how well the question separates the high scoring from the low scoring examinees) and 
  • 각 답가지별로 선택한 학생 비율 a detailed count of the number or proportion of examinees who responded to each option of the test question, 

are completed.


신뢰도는 타당도 근거의 중요한 측면. 신뢰도 없이 타당도 없다.

Reliability is an important aspect of an assessment’s validity evidence. Unless assess- ment scores are reliable and reproducible (as in an experiment) it is nearly impossible to interpret the meaning of those scores – thus, validity evidence is lacking.


합격-불합격의 재생산가능성이 매우 중요하다. 평가의 궁극적 결과(합-불합)이 일정 수준 이상으로 재생산가능하지 않으면 검사점수의 의미있는 해석이 불가능

In both example assessments described above, in which the stakes are high and a passing score has been estab- lished, the reproducibility of the pass–fail decision is a very important source of validity evidence. That is, analogous to score reliability, if the ultimate outcome of the assessment (passing or failing) can not be repro- duced at some high level of certainty, the meaningful interpretation of the test scores is questionable and validity evidence is compromised.


SP와 같이 수행능력 평가에서는 일반화가능도이론에서 유도한 특별한 타입의 신뢰도가 있음

For performance examinations, such as the SP example, a very specialized type of reliability, derived from generalizability theory (GT)21,22 is an essential component of the internal structure aspect of validity evidence. GT is concerned with how well the specific samples of behaviour (SP cases) can be generalized to the population or universe of behaviours. 


GT는 error의 source를 찾는데 유용함

GT is also a useful tool for estimating the various sources of contributed error in the SP exam, such as error due to the SP raters, error due to the cases (case specificity),and error associated with examinees. As rater error and case specificity are major threats to meaningful inter-pretation of SP scores, GT analyses are important sources of validity evidence for most performance assessments such as OSCEs, SP exams and clinical performance examinations. 


IRT와 같은 복잡한 통계측정법을 활용하는 경우, 측정 모델(measurement model) 그 자체가 internal structure와 construct validity이다. 요인 구조, 아이템-간-상관관계, 기타 구조적 특성 등등 

For some assessment applications, in which sophis- ticated statistical measurement models like Item Response Theory (IRT) models23,24 the measurement model itself is evidence of the internal structure aspect of construct validity. In IRT applications, which might be used for tests such as the comprehensive written examination example, the factor structure, item-inter- correlation structure and other internal structural characteristics all contribute to validity evidence.


편항과 비뚤림의 이슈도 중요하다. 모든 평가는 다양한 그룹을 대상으로 치뤄지게 되는데, 통계적 편향의 가능성이 있다. differential item functioning (DIF)과 같은 Bias analysis와 문항이나 performance prompts의 sensitivity review가 모두 내적구조 타당도 근거이다. 

Issues of bias and fairness also pertain to internal test structure and are important sources of validity evidence. All assessments, presented to heterogeneous groups of examinees, have the potential of validity threats from statistical bias. Bias analyses, such as differential item functioning (DIF)25,26 analyses and the sensitivity review of item and performance prompts are sources of internal structure validity evidence. Documentation of the absence of statistical test bias permits the desired score interpretation and therefore adds to the validity evidence of the assess- ment.


다른 변인과의 관계 근거 

Relationship to other variables


• Correlation with other relevant variables
• Convergent correlations - internal/external:
1. Similar tests
• Divergent correlations-internal/external
1. Dissimilar measures
• Test-criterion correlations
• Generalizability of evidence

전형적인 '타당도 연구'의 방법. 새로운 척도를 기존의 척도와 비교하는 것.
This familiar source of validity evidence is statistical and correlational. The correlation or relationship of assessment scores to a criterion measure’s scores is a typical design for a ‘validity study’, in which some newer (or simpler or shorter) measure is ‘validated’ against an existing, older measure with well known characteristics.


이 때 confirmatory evidence와 counter-confirmatory evidence를 모두 찾게 된다. 

This source of validity evidence embodies all the richness and complexity of the contemporary theory of validity in that the relationship to other variables aspect seeks both confirmatory and counter-confirmatory evidence. For example, it may be important to collect correlational validity evidence which shows a strong positive correlation with some other measure of the same achievement or ability and evidence indicating no correlation (or a strong negative correlation) with some other assessment that is hypothesized to be a measure of some completely different achievement or ability.


Campbell and Fiske가 제안한 multitrait multimethod 디자인과 관련되어 있다.

The concept of convergence and divergence of validity evidence is best exemplified in the classic research design first described by Campbell and Fiske.27 In this ‘multitrait multimethod’ design, differ- ent measures of the same trait (achievement, ability, performance) are correlated with different measures of the same trait. The resulting pattern of correlation coefficients may show the convergence and divergence of the different assessment methods on measures of the same and different abilities or proficiencies.


지필평가에서는 전체 점수와 subscale 점수의 상관관계를 볼 수 있다.

In the written comprehensive examination example, it may be important to document the correlation of total and subscale scores with achievement examina- tions administered during the basic science courses.



후속 결과 근거 

Consequences


• Impact of test scores/results on students/society

• Consequences on learners/future learning

• Positive consequences outweigh unintended negative consequences?

• Reasonableness of method of establishing pass-fail (cut) score

• Pass-fail consequences:

1. P/F Decision reliability- Classification accuracy

2. Conditional standard error of measurement at pass score (CSEM)

• False positives/negatives

• Instructional/learner consequences


비록 현재 Standards에 포함되어 있으나 가장 논쟁이 많이 되는 것이다. 시험 점수, 결정, 결과가 피시험자에게 미치는 영향, 그리고 교수-학습에 미치는 영향 등이다. 평가 결과가 피시험자, 교수, 환자, 사회에 미치는 영향은 엄청나게 클 수 있으며, 의도했든 그렇지 않았든 긍정적이거나 부정적일 수 있다.

This aspect of validity evidence may be the most controversial, although it is solidly embodied in the current Standards.1 The consequential aspect of validity refers to the impact on examinees from the assessment scores, decisions and outcomes, and the impact of assessments on teaching and learning. The conse- quences of assessments on examinees, faculty, patients and society can be great and these consequences can be positive or negative, intended or unintended.


북미에는 고부담 시험이 많다. 이런 경우 이 시험에 탈락할 때 따르는 결과는 심대하다. 의과대학에 입학할 것인지, 의사 자격을 부여할 것인지 등에 대한 결정에 따르는 비용이 크다.

High-stakes examinations abound in North Amer- ica, especially in medicine and medical education. Extremely high-stakes assessments are often mandated as the final, summative hurdle in professional educa- tion. The consequences of failing any of these examinations is enormous, in that medical education is interrupted in a costly manner or the examinee is not permitted to enter graduate medical education or practice medicine.


마찬가지로 전문의, 세부전문의 자격 시험도 그러하다. 위양성은 환자에게, 위음성은 시험을 본 당사자에게 요구하는 비용이 크다.

Likewise, most medical specialty boards in the USA mandate passing a high-stakes certification examination in the specialty or subspec- ialty, after meeting all eligibility requirements of training. postgraduate The consequences of passing or failing these types of examinations are great, as false positives (passing candidates who should fail) may do harm to patients through the lack of a physician’s skill and specialized knowledge or false negatives unjustly (failing candidates who should pass) may harm individual candidates who have invested a great deal of time and resources in graduate medical education.


시험 결과에서 오는 harm이 없거나, 아니면 최소한 good > harm임을 보여야 한다.

Evidence related to consequences of testing and its outcomes is presented to suggest that no harm comes directly from the assessment or, at the very least, more good than harm arises from the assessment.



합격률, 합격률(합격선)의 적정함에 대한 판단, 다른 시험결과과의 상관관계

In both example assessments, sources of consequen- tial validity may relate to issues such as 

  • passing rates (the proportion who pass), 
  • the subjectively judged appropriateness of these passing rates, 
  • data comparing the passing rates of each of these examinations to other comprehensive examinations such as the USMLE Step 1 and so on.


합격 점수, 합격 점수 결정 절차, 합격점수의 통계적 특성 등이 모두 validity의 일부이다. 어떻게 합-불합 기준점수를 설정했는지, 그 방법에 대한 근거도 중요함.

The passing score (or grade levels) and the process used to determine the cut scores, the statistical prop- erties of the passing scores, and so on all relate to the consequential aspects of validity.28 Documentation of the method used to establish a pass–fail score is key consequential evidence, as is the rationale for the selection of a particular passing score method.


다른 psychometric quality indicator 들

Other psychometric quality indicators concerning the passing score and its consequences (for both example assessments) include a 

  • 결정의 신뢰도 formal, statistical estimation of the pass–fail decision reliability or 
  • 분류 정확도 classification accu- racy29 and 
  • SEM추정 some estimation of the standard error of measurement at the cut score.30


Equally important consequences of assessment meth- ods on instruction and learning have been discussed by Jaeger.31 The methods and strategies Newble and profound selected to evaluate students can have a impact on and what is taught, how exactly what students learn, how this learning is used and retained (or not) and how students view and value the educa- tional process.









 2003 Sep;37(9):830-7.

Validity: on meaningful interpretation of assessment data.

Author information

  • 1Department of Medical Education, College of Medicine, University of Illinois at Chicago, 60612-7309, USA. sdowning@uic.edu

Abstract

CONTEXT:

All assessments in medical education require evidence of validity to be interpreted meaningfully. In contemporary usage, all validity is construct validity, which requires multiple sources of evidence; construct validity is the whole of validity, but has multiple facets. Five sources--content, response process, internal structure, relationship to other variables and consequences--are noted by the Standards for Educational and Psychological Testing as fruitful areas to seek validity evidence.

PURPOSE:

The purpose of this article is to discuss construct validity in the context of medical education and to summarize, through example, some typical sources of validity evidence for a written and a performance examination.

SUMMARY:

Assessments are not valid or invalid; rather, the scores or outcomes of assessments have more or less evidence to support (or refute) a specific interpretation (such as passing or failing a course). Validity is approached as hypothesis and uses theory, logic and the scientific method to collect and assemble data to support or fail to support the proposed score interpretations, at a given point in time. Data and logic are assembled into arguments--pro and con--for some specific interpretation of assessment data. Examples of types of validity evidence, data and information from each source are discussed in the context of a high-stakes written and performance examination in medical education.

CONCLUSION:

All assessments require evidence of the reasonableness of the proposed interpretation, as test data in education have little or no intrinsic meaning. The constructs purported to be measured by our assessments are important to students, faculty, administrators, patients and society and require solid scientific evidence of their meaning.

PMID:
 
14506816
 
[PubMed - indexed for MEDLINE]


신뢰도: 평가 데이터의 재생산가능성(Med Educ, 2004)

Reliability: on the reproducibility of assessment data
Steven M Downing

 

 

 

 

 

신뢰도란 무엇인가? 가장 간편한 정의는 신뢰도란 평가자료, 평가점수 등이 시간이나 상황이 달라져도 재생산가능한 정도를 의미하는 것이다. 이 정의는 data의 재생산에 관한 것이므로 validity와 마찬가지로 reliability란 평가의 결과의 특성이지 평가도구 그 자치의 특성이 아니다Feldt and Brennan는 이렇게 말했다. "평가대상자의 수행능력의 일관성 혹은 비일관성을 정량화하는 것이 신뢰도 분석의 핵심이다"

What is reliability? In its most straightforward defini-tion, reliability refers to the reproducibility of assess-ment data or scores, over time or occasions. Notice that this definition refers to reproducing scores or data, so that, just like validity, reliability is a charac- teristic of the result or outcome of the assessment, not the measuring instrument itself. Feldt and Brennan5 suggest that: Quantification of the consis- tency and inconsistency in examinee performance constitutes the essence of reliability analysis. (p 105)

 

 

평가자료의 일관성

THE CONSISTENCY OF ASSESSMENT DATA

 

따라서 신뢰도란 타당도의 필요조건이나 충분조건은 아니며, 모든 평가에서 신뢰도는 타당도 근거에 대한 주요 원천이다. 신뢰도가 충분치 않다면 그 자료는 uninterruptible하며, 왜냐하면 신뢰도가 낮게 나온 평가로부터 얻은 자료는 random error의 가능성이 높기 때문이다.

Thus, reliability is a necessary but not sufficient conditionfor validity6 and reliability is a major source of validity evidence for all assessments.7,8 In the absence of sufficient reliability, assessment data are uninter- ruptible, as the data resulting from low reliability assessments have a large component of randomerror.


이론적으로, 신뢰도는 classical meas- urement theory (CMT) 로 정의할 때, 총 variance 중 true variance의 비율로 나타내어진다. (신뢰도계수는 상관관계 계수처럼 해석되기 때문에, 신뢰도를 true score와 observed score의 상관계수의 제곱이라고 생각해도 정확하다.) [관찰점수는 진점수 +/- 측정의 무작위에러]라는 기본적 정의로부터 시작하여, 약간의 통계적 가정을 더하면 신뢰도나 평가의 재생산가능성에 대해 흔히 사용되는 신뢰도 측정 공식을 유도할 수 있다. 이상적인 세계에서는 error term이 없을 것이고 모든 관찰점수가 언제나 진점수와 정확히 일치할 것이다.

Theoretically, reliability is defined in classical meas- urement theory (CMT) as the ratio of true score variance to total score variance. (As reliability coeffi- cients are interpreted like correlation coefficients, it is also accurate to think of reliability as the squared correlation of the true scores with the observed scores.5) Starting fromthe basic definitional formula, X ¼ T + e (the observed score is equal to the true score plus random errors of measurement), and making some statistical assumptions along the way, one can derive all the formulae commonly used to estimate reliability or reproducibility of assessments. In the ideal world there would be no error term in the formula and all observed scores would always be exactly equal to the true score (defined as the long- run mean score, much like l, the population mean score).

 

 

성취평가의 신뢰도(검사지의 신뢰도)

RELIABILITY OF ACHIEVEMENT EXAMINATIONS

 

검사점수의 재생산성에 대해서 흔히 사용되는 것은 Cronbach alpha계수나  Kuder-Richardson formula 20 (KR 20)로 흔히 추정할 수 있는 내적일관성이라는 개념이다. 이 내적일관성 신뢰도의 논리는 직관적이고 간단하다. 이 공식들의 통계적인 유도방식은 시험-재시험 개념으로부터 시작한다.

The approach typically utilised to estimate the reproducibility of test scores in written examinations employs the concept of internal consistency, usually estimated by the Cronbach alpha9 coefficient or Kuder-Richardson formula 20 (KR 20).10 The logic of internal test consistency reliability is straightforward and intuitive. The statistical derivation of these formulae starts with the test-retest concept,

 

이 시험-재시험 개념이 대부분의 신뢰도 추정의 토대이긴 하나, 시험-재시험 방식의 연구설계는 거의 없으며, 있더라도 실제 상황에서 시행한다는 것은 어렵다.

While this test-retest concept is the foundation of most of the reliability estimates used in medical education, the test-retest design is rarely, if ever, used in actual practice, as it is logistically so difficult to carry out.

 

다행히, 측정통계학자들이 한 차례의 시험으로도 시험-재시험 조건에서의 신뢰도 추정방법을 만들었는데, 그 논리는 검사결과를 반으로 나누는 것이다. 검사 결과를 무작위로 둘로 나눠서 시험-재시험 재생산가능성을 추정하는 것이다. 그러나 이러한 신뢰도는 오직 검사결과의 절반에 대한 것이며, 전체 검사의 신뢰도 추정을 위해서는 Spearman-Brown prophecy formula를 이용해서 추가적인 계산을 해야 한다.

Happily, measurement statisticians sorted out ways to estimate the test-retest condition many years ago, from a single testing.11 The logic is: the test-retest design divides a test into 2 random halves, Further, the correlation of the scores fromthe 2 randomhalf tests approximates the test-retest reproducibility of the examination scores. (Note that this is the reliability of only half of the test and a further calculation must be applied, using the Spearman-Brown prophecy formula, in order to determine the reliability of the complete examina- tion.12)

 

또 다른 통계적 도출방법은 모든 가능한 방식으로 검사결과를 두 개로 나누는 것이다. 이는 Cronbach’s alpha coefficient 에서 사용하는 것인데, Cronbach alpha는 polytomous data에서 사용하는 것으로 dichotomous data에서 사용하는 KR20에 비해서 보다 일반화된 형태라고 할 수 있다.

A further statistical derivation (making a few assumptions about the test and the statistical char- acteristics of the items) allows one to estimate internal consistency reliability from all possible ways to split the test into 2 halves: this is Cronbach’s alpha coefficient, which can be used with polytomous data (0, 1, 2, 3, 4, …n) and is the more general formof the KR 20 coefficient, which can be used only with dichotomously scored items (0, 1), such as typically found on selected-response tests.

 

지필검사에 대한 높은 내적일관성 신뢰도 추정은 이 검사가 나중에 다시 시행되어도 같은 결과가 반복될 것임을 시사한다.

A high internal consistency reliability estimate for a written test indicates that the test scores would be about the same, if the test were to be repeated at a later time.

 

 

평가자 자료의 신뢰도 (평가자의 신뢰도)

RELIABILITY OF RATER DATA

 

 

사람이 평가자로 들어가는 평가나, 사람의 평가가 자료의 일차 원천인 경우, 신뢰도 혹은 일관성에 대한 관심은 그 평가자에게 쏠린다. 임상상황에서의 평가나 구두평가의 재생산가능성의 가장 큰 위협은 개개의 평가자가 일관되지 못한 것, 혹은 다수 평가자간 재생산가능성이다. (그러나 대부분의 설계에서, 평가자는 item/case에 nested되어있거나 confounded되어 있기 때문에, item이나 case라는 컨텍스트를 배제하고 순수하게 평가자한테서만 기인한 에러를 추정하기는 불가능하다)

For all assessments that depend on human raters or judges for their primary source of data, the reliability or consistency of greatest interest is that of the rater or judge. The largest threat to the reproducibility of such clinical or oral ratings is rater inconsistency or low interrater reproducibility. (Technically, in most designs, raters or judges are nested or confounded with the items they rate or the cases, or both, so that it is often impossible to directly estimate the error associated with raters except in the context of items and cases.)

 

이러한 경우 평가척도를 활용한 평가에서 내적일관성(alpha)에 대한 관심보다는 평가자간 신뢰도가 신뢰도 추정에 있어서 더욱 중요하다.

The internal consistency (alpha) reliability of the rating scale (all items rated for each student) may be of some marginal interest to establish some commu- nality for the construct assessed by the rating scale, but interrater reliability is surely the most important type of reliability to estimate for rater-type assess- ments.

 

평가자간 신뢰도를 추정하기 위한 여러 방법이 있다. 

There are many ways to estimate interrater reliability, depending on the statistical elegance desired by the investigator. 

  • 가장 단순한 방법은 일치도를 %로 나타내는 것인데, 간편히 쓰기는 좋지만 논문에 쓰기는 부적절하다.
    The simplest type of interrater reliability is percent agreement , such that for each item rated, the agreement of the 2 (or more) independent raters is calculated. Percent-agreement statistics may be acceptable for in-house or everyday use, but would likely not be acceptable to manuscript reviewers and editors of high quality publications, as these statistics do not account for the chance occurrence of agree- ment. 
  • Kappa는 우연에 의해서 일치할 가능성을 고려한 것이며, 2명의 독립적 평가자에 대해서 종종 사용되는 방법이다. phi coefficient도 유사한 상관계수이지만, 우연히 일치할 가능성을 보정하지 않아서 과대추정하는 경향이 있다.
    The kappa13 statistic (a type of correlation coefficient) does account for the random-chance occurrence of rater agreement and is therefore sometimes used as an interrater reliability estimate, particularly for individual questions, rated by 2 independent raters. (The phi14 coefficient is the same general type of correlation coefficient, but does not correct for chance occurrence of agreement and therefore tends to overestimate true rater agree- ment.)
  • 가장 우아한 방법은 일반화가능도이론을 활용한 분석이다. 잘 설계되기만 하면 GT는 모든 관심 변인에 대한 variance component 를 알 수 있다.
    The most elegant estimates of interrater agreement use generalisability theory (GT) analysis.2–4 From a properly designed GT study, one can estimate variance components for all the variables of interest in the design: the persons, the raters and the items.
  • GT만큼 우아하지는 않지만, 가장 활용하기 좋은 방법은 ICC이다. ICC는 (GT와 마찬가지로) ANOVA를 활용한 것이며, 특정 요인과 관련된 variance를 추정해준다. 평가자간 신뢰도 분석에 ICC를 사용하는 것의 강점은 흔히 사용가능한 통계소프트웨어로 계산이 된다는 것이며, n명의 평가자에 대한 실제 평가자간 신뢰도를 계산해주며, 종종 큰 관심의 대상이 되는 한 평가자의 신뢰도도 계산할 수 있다는 것이다. 또한 결측치도 manage할 수 있다.
    A slightly less elegant, but perhaps more accessible method of estimating interrater reliability is by use of the intraclass correlation
    coefficient.15 Intraclass correlation uses analysis of variance (ANOVA), as does generalisability theory analysis, to estimate the vari- ance associated with factors in the reliability design. The strength of intraclass correlation used for inter- rater reliability is that it is easily computed in commonly available statistical software and it permits the estimation of both the actual interrater reliability of the n-raters used in the study as well as the reliability of a single rater, which is often of greater interest. Additionally, missing ratings, which are common in these datasets, can be managed by the intraclass correlation.

 

 

수행능력 평가의 신뢰도 (OSCE와 SP)

RELIABILITY OF PERFORMANCE EXAMINATIONS: OSCES AND SPS

 

실제 상황에서 일어나는 SP나 OSCE 검사에 대해서는 더 표준화된, 통제된 형태의 신뢰도 추정이 필요하다.

While ward-type evaluations attempt to assess some of these skills in the real setting (which tends to lower reliability due to the interference of many uncontrolled variables and a lack of standardisation), simulated patient (SP) examinations and objective structured clinical exam- inations (OSCEs) can be used to assess such skills in a more standardised, controlled fashion.


신뢰도 분석시에 수행능력 검사에서 특히 고려해야 할 점이 있다. 이 경우 평가문항(item)이 평가사례(case)에 nested되어 있기 때문에, 신뢰도분석의 단위는 case가 되어야 하며 item이 되어서는 안된다. 모든 신뢰도 분석에 있어서 공통된 한 가지 가정은, 각각의 item이 'locally independent'하다는 것이며, 이것이 의미하는 바는 모든 item의 점수가 다른 item에 대해서 논리적으로 독립적이어야 한다는 것이다. 한 세트 내에 nested된 item의 경우(예컨대 OSCE, SP, Key feature, MCQ의 testlet)는 모두 이 local independence 가정에 위배되는 것이다. 따라서 case set이 신뢰도 분석의 단위가 되어야 한다. 실제 예를 들어보면 20개 스테이션으로 된 OSCE를 시행하며, 각 스테이션에 5개 item이 있다면, 신뢰도 분석은 20개의 OSCE점수를 대상으로 해야지, 100개의 item을 대상으로 하면 안된다. 20개 OSCE점수로부터 나온 결과는 100개 item으로부터 나온 것보다 분명 낮을 것이다.

Performance examinations pose a special challenge for reliability analysis. Because the items rated in a performance examination are typically nested in a case, such as an OSCE, the unit of reliability analysis must necessarily be the case, not the item. One statistical assumption of all reliability analyses is that the items are locally independent, which means that all items must be reasonably independent of one another. Items nested in sets, such as an OSCE, an SP examination, a key features itemset16 or a testlet17 of multiple choice questions (MCQs), generally violate this assumption of local independence. Thus, the case set must be used as the unit of reliability analysis. Practically, this means that if one administers a 20- station OSCE, with each station having 5 items, the reliability analysis must use the 20 OSCE scores, not the 100 individual item scores. The reliability esti- mate for 20 observations will almost certainly be lower than that for 100 observations.

 

 

평가에서 신뢰도 계수를 어떻게 활용할 수 있을까?

HOW ARE RELIABILITY COEFFICIENTS USED IN ASSESSMENT?

 

한 가지 실제 활용 방식은 SEM을 계산하는데 사용하는 것이다.

One practical use of the reliability coefficient is in the calculation of the standard error of measurement (SEM). The SEM for the entire distribution of scores on an assessment is given by the formula:12

 

 

이 SEM은 신뢰구간을 계산하는데 사용할 수 있다.

This SEM can be used to form confidence bands around the observed assessment score, indicating the precision of measurement, given the reliability of the assessment, for each score level.

 

 

 

신뢰도는 어느 정도나 되어야 하는가? 

HOW MUCH RELIABILITY IS ENOUGH?

 

매우 high stake인 경우 0.9는 되어야 한다고 하며(예컨대 면허나 자격증 시험과 같이 평가대상자와 사회에 미치는 영향이 지대한 경우), moderate stake (학기말 고사, 연말고사)의 경우 0.8~0.89, low stake (수업시간의 평가)에서는 0.7~0.79 등과 같다.

If the stakes are extremely high, the reliability must be high in order to defensibly support the validity evidence for the measure. Various authors, textbook writers and researchers offer a variety of opinions on this issue, but most educational measurement professionals suggest a reliability of at least 0.90 for very high stakesassessments, such as licensure or certification exam- inations in medicine, which have major conse- quences for examinees and society. For more moderate stakes assessments, such as major end-of- course or end-of-year summative examinations in medical school, one would expect reliability to be in the range of 0.80–0.89, at minimum. For assessments with lower consequences, such as formative or summative classroom-type assessments, created and administered by local faculty, one might expect reliability to be in the range of 0.70–0.79 or so.

 

 

신뢰도 계수의 절대값보다는 평가대상자에 대한 위양성 혹은 위음성 판정에 따른 결과가 훨씬 중요하다.

The consequences on examinees of false positive or false negative outcomes of the assessment are far more important than the absolute value of the reliability coefficient.

 

pass/fail 결정의 재생산가능성을 추정하는 한 방법은 pass/fail reproducibility index를 계산하는 것인데, 이는 어느 정도나 confidence할 수 있는가에 대한 지수이다. 0에서 1 사이의 값으로 나타나며, 이것을 해석할 때는 동일한 pass / fail결정이 재시험에서도 이루어질 것인가에 대한 가능성이다. 일반화가능도이론으로도 커트라인 점수에 대한 측정 정밀도를 계산할 수 있다.

One method of estimating this pass ⁄ fail decision reproducibility was presented by Subkoviak20 and permits a calculation of a pass ⁄ fail reproducibility index, indicating the degree of confidence one can place on the pass ⁄ fail outcomes of the assessment. Pass ⁄ fail decision reli- ability, ranging from 0.0 to 1.0, is interpreted as the probability of an identical pass or fail decision being made upon retesting. Generalisability theory also permits a calculation of the precision of measure- ment at the cut score (a standard error of measure- ment at the passing score), which can be helpful in evaluating this all-important accuracy of classifica- tion.

 

평가자료의 해석에 있어서 신뢰도가 낮은 경우에 일어날 결과는 무엇일까? Wainer and Thissen는 table 1과 같은 결과를 제시했따.

What are some of the practical consequences of low reliability of the interpretation of assessment data? Wainer and Thissen21 discuss the expected change in test scores, upon retesting, for various levels of score reliability (Table 1).

 

신뢰도가 낮을 경우 재시험 상황에서 예상할 수 있는 점수의 변화폭이 상당히 크다. 예컨대 신뢰도가 0.5라면..

Expected changes in test scores upon retesting can be quite large, especially for lower levels of reliability. Consider this example: a test score distribution has a mean of 500 and a standard deviation of 100. If the score reliability is 0.50, the standard error of meas- urement equals 71.

 

575점을 받은 학생의 95% 신뢰구간은 575 ± 139로, 재시험에서 가능한 점수는 436–714에 이른다. 이는 상당히 넓은 범위이며, 이 정도의 신뢰도 수준이 그다지 드물지 않다. (특히 평가자-기반 혹은 수행능력 시험에서). 0.75의 신뢰도에서도 98점까지 달라질 수 있다. 

Thus, a 95% confidence interval for a student scoring of 575 on this test is 575 ± 139. Upon retesting this student, we could reasonably expect 95⁄ 100 retest scores to fall somewhere in the range of 436–714. This is a very wide score interval, at a reliability level that is not uncommon, especially for rater-based oral or performance examinations in medical education. Even at a more respectable reliability level of 0.75, using the same data example above, we would reasonably expect this student’s scores to vary by up to 98 score points upon repeated retesting. The effect of reliability on reasonable and meaningful interpretation of assess- ment scores is indeed real.

 



 

신뢰도 높이기

IMPROVING RELIABILITY OF ASSESSMENTS

 

신뢰도를 높일 수 있는 방법이 있다. 가장 중요한 것은 충분히 많은 숫자의 검사문항, 평가자, 케이스를 사용하는 것이다. 신뢰도가 낮은 흔한 원인 중 하나는 지나치게 작은 수의 평가문항, 케이스, 평가자 등이다. 문항이나 지시문에 혼동이 없도록 명확하게 기술되어야 한다. 내용전문가가 충분히 검토해야 한다. 중간정도의 난이도를 가진 케이스나 문항을 사용한다. 검사문항이 너무 쉽거나 어려우면, 거의 대부분 맞거나 틀리게 되고, 학생의 성취 혹은 신뢰도에 대해서 얻을 정보가 매우 적다.

There are several ways to improve the reliability of assessments. Most important is the use of sufficiently large numbers of test questions, raters or perform- ance cases. One frequent cause of low reliability is the use of far too few test items, performance cases or raters to adequately sample the domain of interest. Make sure the questions or performance prompts are clearly and unambiguously written and that they have been thoroughly reviewed by content experts. Use test questions or performance cases that are of medium difficulty for the students being assessed. If test questions or performance prompts are very easy or very hard, such that nearly all students get most questions correct or incorrect, very little information is gained about student achievement and the reliability of these assessments will be low. (In mastery-type testing, this will present different issues.)

 

가능하다면 예비시험 등을 통해서 결과를 얻어보라.

If possible, obtain pretest or tryout data from assess- ments before they are used as live or scored questions. However, it is possible to bank effective test questions or performance cases in secure itempools for reuse later.

 

 

 

 

 

  

 

 

 


 

 2004 Sep;38(9):1006-12.

Reliability: on the reproducibility of assessment data.

Author information

  • 1Department of Medical Education, College of Medicine, University of Illinois at Chicago, 808 South Wood Street, Chicago, IL 60612-7309, USA. sdowning@uic.edu

Abstract

CONTEXT:

All assessment data, like other scientific experimental data, must be reproducible in order to be meaningfully interpreted.

PURPOSE:

The purpose of this paper is to discuss applications of reliability to the most common assessment methods in medical education. Typical methods of estimating reliability are discussed intuitively and non-mathematically.

SUMMARY:

Reliability refers to the consistency of assessment outcomes. The exact type of consistency of greatest interest depends on the type of assessment, its purpose and the consequential use of the data. Written tests of cognitive achievement look to internal test consistency, using estimation methods derived from the test-retest design. Rater-based assessment data, such as ratings of clinical performance on the wards, require interrater consistency or agreement. Objective structured clinical examinations, simulated patient examinations and other performance-type assessments generally require generalisability theory analysis to account for various sources of measurement error in complex designs and to estimate the consistency of the generalisations to a universe or domain of skills.

CONCLUSIONS:

Reliability is a major source of validity evidence for assessments. Low reliability indicates that large variations in scores can be expected upon retesting. Inconsistent assessment scores are difficult or impossible to interpret meaningfully and thus reduce validity evidence.Reliability coefficients allow the quantification and estimation of the random errors of measurement in assessments, such that overall assessmentcan be improved.

PMID:
 
15327684
 
[PubMed - indexed for MEDLINE]


OSCE 스타일의 시험에서 평가자의 판단이 대조효과에 영향을 받을까? (Acad Med, 2015)

Are Examiners’ Judgments in OSCE-Style Assessments Influenced by Contrast Effects?

Peter Yeates, MClinEd, PhD, Marc Moreau, MD, and Kevin Eva, PhD





불행하게도, 판단-기반 평가는 psychometric weakness를 안고 있다. 그리고 이러한 문제는 reformulation이나 training으로도 충분히 극복하지 못해온 것이 사실이다.

judgment-based assessments are susceptible to a raft of psychometric weaknesses1–3 that have not been satisfactorily resolved through either reformulation4,5 or training.6–8"


평가자의 오류들: "평가자의 인식에 대한 연구에서, 평가자들은 상대적으로 독특하고 개인별로 특유한 수행능력 이론을 가지고 있으며('우수한 수행능력을 구성하는 것이 무엇인가에 대한 개인적 믿음') (공인된 평가기준이 아닌) 자기 자신의 임상능력을 평가틀(frame of reference)로 활용한다. 평가자들은 평가가 자신의 감정이나 (인색하게 보이기 싫음), 특정 피교육자와의 과거 경험에 영향을 받으며, 조직의 문화에 의해서도 영향을 받는다는 것을 인식하고 있다. 더 나아가서 평가자의 판단은 관찰한 것을 넘어선 추론에 의해 영향을 받기도 하는데, 여기에는 피교육자의 문화, 교육, 동기에 대한 지레짐작까지 포함된다. 이런 연구 결과 외에도 평가자들은 행동 준거에 따라서 평가하라는 지침에도 불구하고 피교육자의 수행수준을 다른 피교육자와 비교하는 경향을 보인다.

“assessor cognition” assessors appear to possess relatively unique, idiosyncratic performance theories (personal beliefs about what constitutes good performance)9,10 and may use their own clinical abilities as their frame of reference (rather than recognized assessment standards) when judging the performance of trainees.11,12 Assessors perceive that their judgments are influenced by their own emotions (e.g., not wanting to feel mean), their prior experiences of particular trainees’ performance, and their institutional culture.12 Further, assessors’ judgments appear to be frequently guided by inferences that go beyond their observations, including presumptions about a trainee’s culture, education, or motivation.12,13 Additional to these findings, an exploratory investigation10 suggested that assessors, despite instructions to judge against a behavioral standard, showed a tendency to make judgments by comparing trainees’ performance against other trainees’.


요약하면, 역량에 대한 이해라는 것은 본질적으로 상대적인 것이라 할 수 있다.

in essence, it indicates that their understanding of competence may be inherently comparative (i.e., norm referenced).17


어떤 순서든 학생이 OSCE 서킷에 들어가면 그 학생과 그 전 학생간의 관계는 무작위가 된다. 그 결과 아주 큰 데이터셋을 분석한다면 특정 학생과 그 전학생간의 상관관계는 전혀 없어야 한다. 정적 상관은 assimilation effect를 의미하며, 부적 상관은 contrast effect를 의미한다. 

When students are not entered into an OSCE circuit in any particular order, the relationship between a performance and its predecessor should be random. Consequently, when a large dataset is examined, no relationships should exist between the scores of performances and their predecessors. A positive relationship between successive performances would indicate an assimilation effect, whereas a negative relationship would indicate a contrast effect. The first dataset was drawn from the 2011 United Kingdom Foundation Programme Office (UKFPO) Clinical Assessment. The second dataset was drawn from the 2008 Multiple Mini Interview (MMI) that was used for selection into the University of Alberta Medical School.


'스테이션 서킷'이란 것은 각 특정 스테이션에 들어가는 지원자의 점수들로 구성된다. 스테이션의 난이도나 평가자의 엄격한 정도가 분석에 영향을 주는 것을 방지하기 위해서 모든 점수를 z score로 변환하였다. 수행능력 기반 평가 점수는 총 5288명을 대상으로 수집하였다. 모든 점수의 평균과 중간값은 비슷했다. skewness와 kurtosis는 -1과 1 사이이다.

A “station-circuit,” therefore, comprised the scores for the candidates at an individual station for an individual circuit of the exam. To prevent station difficulty and/or rater stringency from influencing the analyses, we transformed every candidate score into a z score centered around the mean of its station-circuit. Performance-based assessment data were available for 5,288 candidate observations (see Table 1). All scores’ mean and median values were similar, with skewness and kurtosis values between −1 and 1, indicating that data were adequately normal for parametric analysis.





어떤 관계가 나타나더라도 분석 중 생긴 artifact가 아니라는 것을 확실히 하기 위해서 Excel을 활용하여 Monte Carlo Simulation을 수행하였다. 20개의 무작위 숫자를 생성한 뒤 동일한 분석을 하였다.

To ensure that any relationship observed was not an analytic artifact generated from the way in which scores were compared with preceding scores, we used Microsoft Excel (Microsoft Corporation, Redmond, Washington) to run a Monte Carlo simulation. Twenty random numbers were produced and then used to calculate the same “preceding candidate” metrics as described above (n-1, n-2, etc.).


첫 번째 분석은 contrast effect가 나타나는가였다. 분명하게 현재 점수가 그 앞 세 명의 점수의 평균과 관계가 있음이 나타났으며, 어떤 이전 점수와도 관계가 나타났다.

We initially queried whether the previously demonstrated contrast effects15,16 examiners may be susceptible to contrast effects in real-world situations despite the formality of high-stakes exams and despite explicit behavioral guidance. Notably, the observed relationships were stronger when current scores were related to the average of three preceding scores relative to when they were related to any individual preceding performance.





이론적으로, 이는 우리가 기존에 주장한 바를 지지하는데, 보건의료인력 교육에서 준거지향평가에 의지함에도, 고도로 훈련받고 충분한 자원이 제공되는 평가 참여하는 평가자도 역량에 대한 '고정된(fixed)' 감각이 없다

Theoretically, this sustains our prior assertions15,16 that, despite our espoused reliance on criterion-based assessments within health professional education,19 highly trained and well-resourced examiners may still lack any truly fixed sense of competence against which to judge when making assessment decisions.


평가자들은 특정 수험생을 평가할 때 단순히 가장 최근 수험생과 대조하는 것이 아니라 그 전에 접한 수험생들의 수행능력을 통합하여 기준을 정하는 것으로 보인다. 이는 행동을 설명한 문구나, 훈련에도 불구하고 평가자들은 과거의 예제들을 바탕으로 비교 판단을 하기 위한 mental database를 축적한다는 생각을 지지한다. 이런 효과가 시험의 맨 끝까지 나타난다는 점에서 이것은 워밍업 단계의 현상이 아니며, 따라서 한 두개의 초기 스테이션을 제외한다고 사라질 수 있는 것이 아니다. 가장 강한 부적 상관은 특정 학생을 접하기 4, 5, 6번째에 접한 학생의 평균이였고, 이는 평가자에게 초반부에 접한 케이스가 의미를 갖는다는 초기효과(primacy effect)를 보여준다.

This suggests that assessors mentally amalgamate previous performances to produce a performance standard to judge against rather than simply contrasting with the most recent performance. This is consistent with the idea that, despite the availability of behavioral descriptors and training, examiners accumulate a mental database of past exemplars against which they make comparative judgments. The persistence of the effect near the end of the exam indicates that it is not a “warm-up” phenomenon, and so cannot be counteracted by discarding one or two initial stations. That the strongest negative relationships were observed between later candidates and the average of students that preceded them by four, five, and six places suggests a primacy effect in these exemplar comparisons, in that the early cases that examiners see may be particularly meaningful in setting expectations.


과거 연구들은 대조효과가 여기서 나타난 것보다 더 컸는데, 몇 가지 해석이 가능하다. 첫 번째로 실험실 상황(조작된 상황)에서는 참가자들인 지속적으로 강한, 양방향의 조작(manipulation)에 노출되게 된다. AVN4-6이 가장 큰 영향을 줬다는 것이 특히 시사하는 바는 비디오 기반의 예제를 활용하여 모든 평가자에게 초반의 표준화된 비교측정기(comparator)를 만들도록 할 수 있다는 장점이 있다. 

The preceding laboratory studies showed contrast effects that were larger than those observed in this study, explaining up to 24% of observed score variance in mini-CEX scores. A number of explanations are possible. First, in the laboratory context, participants were consistently exposed to a strong, bidirectional manipulation (either seeing very good or very poor performances prior to intermediate performances). That AvN4-6 showed the largest influence has particularly important implications for practical strategies to overcome the biases observed in this line of research. If confirmed, it would suggest potential benefits in using video-based exemplars to create a standardized set of initial exemplar comparators for all examiners.











 2015 Jan 27. [Epub ahead of print]

Are Examiners' Judgments in OSCE-Style Assessments Influenced by Contrast Effects?

Author information

  • 1P. Yeates is clinical lecturer in medical education, Centre for Respiratory Medicine and Allergy, Institute of Inflammation and Repair, University of Manchester, and specialist registrar, Respiratory and General Internal Medicine, Health Education North West, Manchester, United Kingdom. M. Moreau is assistant dean for admissions, Faculty of Medicine and Dentistry, and professor, Division of Orthopaedic Surgery, University of Alberta, Edmonton, Alberta, Canada. K. Eva is senior scientist, Centre for Health Education Scholarship, and professor and director of educational research and scholarship, Department of Medicine, University of British Columbia, Vancouver, British Columbia, Canada.

Abstract

PURPOSE:

Laboratory studies have shown that performance assessment judgments can be biased by "contrast effects." Assessors' scores become more positive, for example, when the assessed performance is preceded by relatively weak candidates. The authors queried whether this effect occurs in real, high-stakes performance assessments despite increased formality and behavioral descriptors.

METHOD:

Data were obtained for the 2011 United Kingdom Foundational Programme clinical assessment and the 2008 University of Alberta Multiple Mini Interview. Candidate scores were compared with scores for immediately preceding candidates and progressively distant candidates. In addition, average scores for the preceding three candidates were calculated. Relationships between these variables were examined using linear regression.

RESULTS:

Negative relationships were observed between index scores and both immediately preceding and recent scores for all exam formats. Relationships were greater between index scores and the average of the three preceding scores. These effects persisted even when examiners had judged several performances, explaining up to 11% of observed variance on some occasions.

CONCLUSIONS:

These findings suggest that contrast effects do influence examiner judgments in high-stakes performance-based assessments. Although the observed effect was smaller than observed in experimentally controlled laboratory studies, this is to be expected given that real-world data lessen the strength of the intervention by virtue of less distinct differences between candidates. Although it is possible that the format of circuital exams reduces examiners' susceptibility to these influences, the finding of a persistent effect after examiners had judged several candidates suggests that the potential influence on candidate scores should not be ignored.

PMID:
 
25629945
 
[PubMed - as supplied by publisher]


"상대적으로 본다면...", 대조효과가 평가자의 점수와 서술 피드백에 미치는 영향(Med Educ, 2015)

Relatively speaking: contrast effects influence assessors’ scores and narrative feedback

Peter Yeates,1 Jenna Cardell,2 Gerard Byrne3 & Kevin W Eva4







피교육자의 임상수행능력 평가의 중요성. 그러나 이 때 항상 문제가 되는 것은 평가자간 차이.

Accurate assessment of trainees’ clinical performance is vital both to guide trainees’ educational development1 and to ensure they achieve an appropriate standard of care.2 Although overall support exists for workplace-based assessment (WBA),3–6 inter-assessor score variability has been cited as a cause for concern.7,8



평가자에 대한 훈련은 대체로 드문 편. 평가자 훈련이나 평가스케일 조정의 효과도 크지 않음. 이러한 인터벤션이 가정하는 것은 수행능력을 평가하는데 뒤따르는 문제의 해결에 평가 또는 관찰된 수행능력과 정의가능한 기준 사이의 관계를 잘 이해하는 것이 도움이 된다는 것이다.

Training of assessors is generally sparse,9 and both training10,11 and scale alterations12,13 have produced only limited improvements. Implicit in such interventions is the assumption that challenges with rating performance can be overcome by helping raters better understand the rating task and the relationship between the observed performance and some definable criterion against which they can compare.


평가자의 판단은 두 가지 방향으로 나타난다. Contrast effect와 Assimilation effect.

As a result, the performance of recently viewed candidates can bias assessors’ judgements of current candidates. Such effects can (theoretically) occur in either of two directions: 

      • contrast effects occur when a preceding good performance reduces the scores given to a current performance by making the current performance seem poor ‘by contrast’, and 
      • assimilation effects occur when a preceding good performance increases the scores given to a current performance by focusing attention on similar aspects of performance.17


평가자가 관찰한 수행의 수준이 섞여 있을 때에는 다양한 상황이 발생할 수 있다. Primacy, Recency, Averaging.

When assessors observe a mixture of preceding performances, a variety of effects may occur: 

      • assessors may be most influenced by the initial performances they encounter (primacy), 
      • by the latest performances (recency), or 
      • by the aggregation of previous performances they have seen (averaging).



대조효과가 나타났을 때, 이것이 피교육자의 수행에 대한 평가자의 인식에 영향을 준 것인지, 평가자의 판단을 점수로 옮기는 과정에서 나타난 결과인지 알아보는 것이 목적.

An additional objective of the current study is to determine whether the observation of contrast effects represents an influence on assessors’ perceptions of performance or is an artefact of the way that assessors translate judgements into scores.



평가자의 판단을 점수로 옮기는 것이 어렵다는 것에 대한 지적

An additional objective of the current study is to determine whether the observation of contrast effects represents an influence on assessors’ perceptions of performance or is an artefact of the way that assessors translate judgements into scores. Crossley et al.19 have suggested that score variations arise in part because scales may not align with assessors’ thinking, suggesting that disappointing psychometric performance of WBA to date may stem not from disagreements about the performance observed, but from different interpretations of the questions and the scales’. Other authors have suggested that assessments should focus on narrative comments rather than scores.20–23



Ecological validity를 보존하기 위하여 이 평가자들이 익숙하게 사용해온 형식을 활용함.

As this study was intended to be a fundamental examination of the mechanism through which rater judgements might be influenced, we chose to preserve ecological validity by not altering the assessment format to which these examiners were accustomed. Similarly, we did not specify additional criteria or provide additional training to assessors.



Study design

The study used an Internet-based, randomised, double-blinded design. 



6단계 평가

Scores were assigned using a 6-point scale with reference to expectations for completion of FY-1: 

      • 1 = well below expectations; 
      • 2 = below expectations; 
      • 3 = borderline; 
      • 4 = meets expectations; 
      • 5 = above expectations, and 
      • 6 = well above expectations. 

The UK FP intends these judgements to be criterion-referenced as ‘expectations for completion of FY-1’ are defined by reference to the outcomes of its curriculum. 26



T-test의 robustness. 

Adding to the literature suggesting that t-tests are very robust to deviations from normality,27–29 a recent systematic review has reported that equivalent non-parametric tests are more flawed than their parametric counterparts when such deviations exist.30 Nonetheless, we examined the skewedness of the distributions (and found it to be < 1 in all instances) prior to proceeding with analysis via independent-samples t-tests.




자유형식 코멘트의 분석

Analysis of free-text comments

To address RQ 3, free-text feedback comments were coded using content analysis. Researchers, blinded to the group from which comments arose, segmented the feedback into phrases. Next, each comment was coded independently by two researchers to indicate whether it was ‘communication-focused’ (i.e. commenting on the candidate’s interpersonal skills) or ‘content-focused’ (i.e. commenting on the candidate’s knowledge or clinical skills). The two researchers used an initial subset of data to develop a shared understanding of the codes and then independently coded all remaining segments. The researchers also independently coded comments as positive, negative or equivocal. The independently coded comments were then compared and agreement was calculated using Cohen’s j-values. Discrepant codes were independently reconsidered by both researchers and remaining differences were discussed and resolved. Frequencies of each thematic category (communication and content) were calculated for each performance by each participant. The positive, negative and equivocal codes were assigned scores of +1, 1 and 0respectively, and their sum was calculated for each participant for each performance. This variable was termed ‘positive/negative (pos/neg) balance’.


피어슨 product moment correlation

We examined the relationship between scores and feedback measures using Pearson’s product moment correlations


Effect size

All analyses were performed using IBM SPSS Statistics for Windows Version 20.0 (IBM Corp., Armonk, NY, USA). A p-value of < 0.05 was set as the significance threshold. Cohen’s d is used to report effect size for all statistically significant comparisons. By convention, d = 0.8 is considered to represent a large effect, d = 0.6 represents a moderate effect, and d = 0.4 represents a small effect.














피드백의 valence (Valence (psychology), the emotional value associated with a stimulus)는 수행능력 점수와 중간-강한 상관과계를 보였다. 그러나 피드백의 내용은 평가자의 최근 평가경험이나 다른 수행능력에 영향을 받지 않아서 평가자들이 비슷한 이슈에 대해서는 조건에 무관하게 비슷한 판단을 내리나, 상황에 따라서 그것의 중요도에 대한 판단이 달라짐을 시사했다.

The valence of feedback showed moderate to strong relationships with performance scores and revealed similar influences of contrast. The content of participants’ feedback, by contrast, was not altered by their recent experiences of other performances, which suggests that raters across conditions identified similar issues, but interpreted their severity differently depending on the cases they had previously seen.



단 한 차례의 경험만으로도 다음 평가에 영향을 준다. 장기기억에 있는 정보보다 즉각적 맥락에 의해 더 영향을 받음.

Firstly, this study has shown that a single performance is sufficient to produce a contrast effect on assessors’ judgements despite the fact that participants claimed to have considerable experience in conducting assessments of this type. Schwarz 32 has suggested that contrast effects occur because humans are more readily influenced by information from their immediate context than by information in long-term memory.



즉각적 맥락에 의해 영향을 받긴 하나, 이전 평가의 평균적 경험에 기반하여 비교한다.

This study further suggests that whilst assessors are readily influenced by information in their immediate context, the comparison they make is against an evolving standard that may be based on the average of preceding performances.



자신의 판단을 숫자(점수)로 환산하는 과정에서 발생하는 오류인가?

Prior studies have yielded questions about whether apparent biases arise because assessors find it difficult to translate their judgements into scores.24,34




평가자의 피드백 내용 자체는 변하지 않았다. 다만 그것의 중요도에 대한 인식이 달라진 것으로 보인다. 단순하게 판단을 점수로 변환하는 과정이 아니라 평가자의 근본적 인상에 영향을 미치는 것으로 보인다.

In this study, the content of assessors’ feedback (linguistic expressions of their judgement) was unchanged as a function of our experimental manipulation, which suggests that they saw the issues similarly in each condition. The considerable variability in the valence of their comments, however, suggests that the perceived severity of the issues observed was influenced by contrasts with previously seen cases. This suggests that such cases fundamentally influence assessors’ impressions of performance rather than simply biasing their translation of judgements into scores.


소규모 프로그램에서는 장기간에 걸쳐서 상대적으로 작은 숫자의 학생만을 만나게 되고 학생 하나하나가 서로 더 비교되는 효과를 낳을 수도 있다.

In smaller programmes, however, such as the longitudinal integrated clerkships that are becoming increasingly popular,36 particular examiners will interact with a relatively small number of trainees over a long period of time, creating risk that the trainees will appear more different from one another than is realistic because they are implicitly contrasted against one another.


평가에 대한 사회적 압력이 있다. 많은 경우 평가자는 피평가자와 계속 함께 일을 해야하므로 긍정적인 쪽으로 비뚤림이 생기는 경우가 많다. 평가를 내릴 수 있는 폭이 좁을수록 대조효과가 현실적으로 영향을 미칠 가능성이 낮아질 것이다. higher-stake평가에서는 평가자와 피평가자가 서로 모르는 상황에서 이루어지기 때문에 평가자가 더 분포를 넓게 할 수 있고, 대조효과의 영향도 더 커질 것이다. 

Finally, it should be noted that social pressures of various types can have both implicit and explicit influences on the ratings that assessors assign. In many assessment contexts, including those in which assessors must continue to work with those being assessed, a positive skew in the ratings assigned is commonplace. The more the ratings are compressed into a narrow range, the less likely it is that a contrast effect of discernible practical significance will be observed. In higher-stakes examinations in which the examiner and examinee are unknown to one another or in the context of a video-based review of performance in which the examiner is anonymous (as in this study), raters may be more likely to spread their ratings out, thereby creating greater potential for the psychological contrasts observed in this study to be seen to have influence.







 2015 Sep;49(9):909-19. doi: 10.1111/medu.12777.

Relatively speakingcontrast effects influence assessors' scores and narrative feedback.

Author information

  • 1Centre for Respiratory Medicine and Allergy, Institute of Inflammation and Repair, University of Manchester, Manchester, UK.
  • 2Royal Bolton Hospital, Bolton NHS Foundation Trust, Bolton, Lancashire, UK.
  • 3Health Education North West, Health Education England, Manchester, UK.
  • 4Centre for Health Education Scholarship, Division of Medicine, University of British Columbia, Vancouver, BC, Canada.

Abstract

CONTEXT:

In prior research, the scores assessors assign can be biased away from the standard of preceding performances (i.e. 'contrast effects' occur).

OBJECTIVES:

This study examines the mechanism and robustness of these findings to advance understanding of assessor cognition. We test theinfluence of the immediately preceding performance relative to that of a series of prior performances. Further, we examine whether assessors' narrative comments are similarly influenced by contrast effects.

METHODS:

Clinicians (n = 61) were randomised to three groups in a blinded, Internet-based experiment. Participants viewed identical videos of good, borderline and poor performances by first-year doctors in varied orders. They provided scores and written feedback after each video. Narrative comments were blindly content-analysed to generate measures of valence and content. Variability of narrative comments and scores was compared between groups.

RESULTS:

Comparisons indicated contrast effects after a single performance. When a good performance was preceded by a poor performance, ratings were higher (mean 5.01, 95% confidence interval [CI] 4.79-5.24) than when observation of the good performance was unbiased (mean 4.36, 95% CI 4.14-4.60; p < 0.05, d = 1.3). Similarly, borderline performance was rated lower when preceded by good performance (mean 2.96, 95% CI 2.56-3.37) than when viewed without preceding bias (mean 3.55, 95% CI 3.17-3.92; p < 0.05, d = 0.7). The series of ratings participants assigned suggested that the magnitude of contrast effects is determined by an averaging of recent experiences. The valence (but not content) of narrative comments showed contrast effects similar to those found in numerical scores.

CONCLUSIONS:

These findings are consistent with research from behavioural economics and psychology that suggests judgement tends to be relative in nature. Observing that the valence of narrative comments is similarly influenced suggests these effects represent more than difficulty in translating impressions into a number. The extent to which such factors impact upon assessment in practice remains to be determined as theinfluence is likely to depend on context.

© 2015 John Wiley & Sons Ltd.

PMID:

 

26296407

 

[PubMed - in process]


객관식 시험의 점수분석: AMEE Guide No. 66

Post-examination interpretation of objective test data: Monitoring and improving the quality of high-stakes examinations: AMEE Guide No. 66

MOHSEN TAVAKOL & REG DENNICK

University of Nottingham, UK






Introduction

측정 오류는 다음과 같은 이유로 발생할 수 있다.

The output of the examination process is transferred to students either formatively, in the form of feedback, or summatively, as a formal judgement on performance. Clearly, to produce an output which fulfils the needs of students and the public, it is necessary to define, monitor and control the inputs to the process. Classical Test Theory (CTT) assumes that inputs to post-examination analysis contain sources of measurement error that can influence the student's observed scores of knowledge and competencies. Sources of measurement error is derived from test construction, administration, scoring and interpretation of performance. For example; quality variation among knowledge-based questions, differences between raters, differences between candidates and variation between standardised patients (SPs) within an Objective Structured Clinical Examination (OSCE).


신뢰도에 대한 가장 간단한 해석은 1에서 신뢰도의 제곱을 뺀 값만큼이 error라는 것이다.

To improve the quality of high-stakes examinations, errors should be minimised and, if possible, eliminated. CTT assumes that minimising or eliminating sources of measurement errors will cause the observed score to approach the true score. Reliability is the key estimate showing the amount of measurement error in a test. A simple interpretation is that reliability is the correlation of the test with itself; squaring this correlation, multiplying it by 100 and subtracting from 100 gives the percentage error in the test. For example, if an examination has a reliability of 0.80, there is 36% error variance (random error) in the scores. As the estimate of reliability increases, the fraction of a test score that is attributable to error will decrease. Conversely, if the amount of error increases, reliability estimates will decrease (Nunnally & Bernstein 1994).


(...)


(...)



Interpretation of basic post-examination results

Individual questions

기술통계분석이 첫 단계가 된다. 만약 결측값이 없다면 학생들이 충분한 시간이 있었건, 일부 문제에 대해서는 추측해서 답을 썼다는 의미가 될 수 있다. 반대로 결측값이 너무 많다면 시간이 부족했거나, 너무 시험이 어려웠거나, 오답에 대한 감점이 있었을 수 있다.

A descriptive analysis is the first step in summarising and presenting the raw data of an examination. A distribution frequency for each question immediately shows up the number of missing questions and the patterns of guessing behaviour. For example, if there were no missing question responses identified, this would suggest that students either had good knowledge or were guessing for some questions. Conversely, if there were missing question responses, this might be either an indication of an inadequate time for completing the examination, a particularly hard exam or negative marking is being used (Stone & Yeh 2006; Reeve et al. 2007).


SD는 variation을 보여준다.

The means and variances of test questions can provide us with important information about each question. The mean of a dichotomous question, scored either 0 or 1, is equal to the proportion of students who answer correctly, denoted by p. The variance of a dichotomous question is calculated from the proportion of students who answer a question correctly (p) multiplied by those who answer the question incorrectly (q). To obtain the standard deviation (SD), we merely take the square root of p × q. For example, if in an objective test, 300 students answered Question 1 correctly and 100 students answered it incorrectly, the p value for Question 1 will be equal to 0.75 (300/400), and the variance and SD will be 0.18 (0.75 × 0.25) and 0.42 () respectively. The SD is useful as a measure of variation or dispersion within a given question. A low SD indicates that the question is either too easy or too hard. For example, in the above example, the SD is low indicating that the item is too easy. Given the item difficulty of Question 1 (0.75) and a low item SD, one can conclude that responses to item was not dispersed (there is little variability on the question) as most students paid attention to the correct response. If the question had a high variability with a mean at the centre of distribution, the question might be useful.


Total performance

전체 수행능력에 대한 평가를 위해서 점수의 총합과 그 SD를 구할 수 있다.

After obtaining the mean and SD for each question, the test can be subjected to conventional performance analysis where the sum of correct responses of each student for each item is obtained and then the mean and SD of the total performance are calculated. Creating a histogram using SPSS allows us to understand the distribution of marks on a given test. Students’ marks can take either a normal distribution or may be skewed to the left or right or distributed in a rectangular shape. Figure 1(a) illustrates a positively skewed distribution. This simply shows that most students have a low-to-moderate mark and a few students received a relatively high mark in the tail. In a positively skewed distribution, the mode and the median are greater than the mean indicating that the questions were hard for most students. Figure 1(b) shows a negatively skewed distribution of students’ marks. This shows that most students have a moderate-to-high mark and a few students received relatively a low mark in the tail. In a negatively skewed distribution, the mode and the median are less than the mean indicating that the questions were easy for most students.




Figure 1(c) shows most marks distributed in the centre of a symmetrical distribution curve. This means that half the students scored greater than the mean and half less than mean. The mean, mode and median are identical in this situation. Based on this information, it is hard to judge whether the exam is hard or easy unless we obtain differences between the mode, median or mean plus an estimate of the SD. We have explained how to compute these statistics using SPSS elsewhere (Tavakol & Dennick 2011b; Tavakol & Dennick 2012).


As an example, we would ask you to consider the two distributions in Figure 2, which represent simulated marks of students in two examinations.




Both the mark distributions have a mean of 50, but show a different pattern. Examination A has a wide range of marks, with some below 20 and some above 90. Examination B, on the other hand, shows few students at either extreme. Using this information, we can say that Examination A is more heterogeneous than Examination B and that Examination B is more homogenous than Examination A.


In order to better interpret the exam data, we need to obtain the SD for each distribution. For example, if the mean marks for the two examinations are 67.0, with different SDs of 6.0 and 3.0, respectively, we can say that the examination with a SD of 3.0 is more homogenous and hence more consistent in measuring performance than the examination with a SD of 6.0. A further interpretation of the value of the SD is how much it shows students’ marks deviating from the mean. This simply indicates the degree of error when we use a mean to explain the total student marks. The SD also can be used for interpreting the relative position of individual students in a normal distribution. We have explained and interpreted it elsewhere (Tavakol & Dennick 2011a).



Interpretation of classical item analysis

하지만 무한한 숫자로 시험을 반복할 수는 없기 때문에 최대한 많은 학생에게 시험을 보도록 하는 방법을 택한다. 

In scientific disciplines, it is often possible to measure variables with a great deal of accuracy and objectivity but when measuring student performance on a given test due to a wide variety of confounding factors and errors, this accuracy and objectivity becomes more difficult to obtain. For instance, if a test is administrated to a student, he or she will obtain a variety of scores on different occasions, due to measurement errors affecting his or her score. Under CTT, the student's score on a given test is a function of the student's true score plus random errors (Alagumalai & Curtis 2010), which can fluctuate from time to time. Due to the presence of random errors influencing examinations, we are unable to exactly determine a student's true score unless they take the exam an infinite number of times. Computing the mean score in all exams would eliminate random errors resulting in the student's score eventually equalling the true score. However, it is practically impossible to take a test an infinite number of times. Instead we ask an infinite number of students (in reality a large cohort!) to take the test once allowing us to estimate a generalised standard error of measurement (SME) from all the students’ scores. The SME allows us to estimate the true score of each student which has been discussed elsewhere (Tavakol & Dennick 2011b).


Reliability

It is worth reiterating here that just as the observed score is composed of the sum of the true score and the error score, the variance of the observed score in an examination is made up of the sum of the variances of the true score and the error score, which can be formulated as follows:









Now imagine a test has been administered to the same cohort several times. If there is a discrepancy between the variance of the observed scores for each individual, on each test, the reliability of the test will be low. The test reliability is defined as the ratio of the variance of the true score to the variance of the observed score:



Given this, the greater the ratio of the true score variance to the observed score variance, the more reliable the test. If we substitute variance (true scores) from Equation (1) in Equation (2), the reliability will be as follows:



And then we can rearrange the reliability index as follows:


This equation simply shows the relationship between source of measurement error and reliability. For example, if a test has no random errors, the reliability index is 1, whereas if the amount of error increases, the reliability estimate will decrease.



Increasing the test reliability

The statistical procedures employed for estimating reliability are Cronbach's alpha and the Kuder–Richardson 20 formula (KR-20). If the test reliability was less than 0.70, you may need to consider removing questions with low item-total correlation. For example, we have created a simulated SPSS output for four questions in Tables 1 and 2.




Table 1 shows Cronbach's alpha for four questions, 0.72. Table 2 shows item-total correlation statistics with the column headed ‘Cronbach's Alpha if Item deleted’. (Item-total correlation is the correlation between an individual question score and the total score).


The fourth question in the test has a total-item correlation of −0.51 implying that responses to this particular question have a negative correlation with the total score. If we remove this question from the test, the alpha of the three remaining questions increase from 0.725 to 0.950, making the test significantly more reliable.


Tables 3 and 4 show the output SPSS after removing Question 4:



Tables 3 and 4 illustrate the impact of removing Question 4 from the test, which significantly increases the value of alpha.


Table 4에서 2번문항을 지우면 alpha는 완벽(=1)해질 것이다. 즉 여러 문항이 완전히 동일한 것을 측정하고 있다는 것이다. 그러나 이것이 반드시 좋지만은 않은데, 시험문항에 쓸데없는 반복이 있다는 의미이기 때문이다. 이런 경우라면 신뢰도를 해치지 않으면서도 시험의 길이를 줄일 수ㅇ 있다. 신뢰도는 문항의 수에 영향을 받는 값이며 문항이 많을수록 신뢰도는 높아진다.

However, if we now remove Question 2, the value of the alpha for the test will be perfect, i.e. 1, which means each question in the test must be measuring exactly the same thing. This is not necessarily a good thing as it suggests that there is redundancy in the test, with multiple questions measuring the same construct. If this is the case, the test length could be shortened without compromising the reliability (Nunnally & Bernstein 1994). This is because the reliability is a function of test length. The more the items, the more the reliability of a test.


alpha나 KR-20이 신뢰도 추정에 유용하기는 하나, 여기서는 모든 가능한 오차의 원인이 하나로 합해져버리게 된다. 하지만 error의 소스는 다양할 수 있고, 이러한 각 error의 소스의 영향은 일반화가능도계수에 의해서 추정될 수 있다. 

Although Cronbach's alpha and KR-20 are useful for estimating the reliability of a test, they conflate all sources of measurement error into one value (Mushquash & O'Connor 2006). Recall that true scores equal observed scores plus errors, which is derived from a variety of sources. The influence of each source of error can be estimated by the coefficient of generalisability, which is similar to a reliability estimate in the true score model (Cohen & Swerdlik 2010). Later we will describe how to identify and reduce sources of measurement errors using generalisability theory or G-theory as it is known. What is more, in our previous Guide (Tavakol & Dennick 2012), we explained and interpreted item difficulty level, item discrimination index and point bi-serial coefficient in terms of CTT. In this Guide, we will explain and interpret these concepts in terms of Item Response Theory (IRT) using item characteristic parameters (item difficulty and item discrimination) and the student ability/performance to all questions using the Rasch model.



Factor analysis

Linear factor analysis is widely used by test developers in order to reduce the number of questions and to ensure that important questions are included in the test. For example, the course convenor of cardiology may ask all medical teachers involved in teaching cardiology to provide 10 questions for the exam. This might generate 100 questions, but all these questions are not testing the same set of concepts. Therefore, identifying the pattern of correlations between the questions allows us to discover related questions that are aimed at the underlying factors of the exam. A factor is a construct which represents the relationship between a set of questions and will be generated if the questions are correlated with the factor. In factor analysis language, this refers to factor ‘loadings’. After factor analysis is carried out, related questions load onto factors which represent specific named constructs. Questions with low loadings can therefore be removed or revised.


If a test measures a single trait, only one factor with high loadings will explain the observed question relationships and hence the test is uni-dimensional. If multiple factors are identified, then the test is considered to be multi-dimensional.


두 종류가 있다. EFA와 CFA

There are two main components to linear factor analysis: exploratory and confirmatory. Exploratory Factor Analysis (EFA) identifies the underlying constructs or factors within a test and hypothesises a model relationship between them. Confirmatory Factor Analysis (CFA) validates whether the model fits the data using a new data set. Below, each method is explained.


Exploratory factor analysis

문항을 손보거나 구체적인 지식영역에 해당하는 문항을 고르기 위하여 사용한다. EFA에서는 각 문항에 대한 communality도 함께 계산한다.

EFA is widely used to identify the relationships between questions and to discover the main factors in a test as previously described. It can be used either for revising exam questions or choosing questions for a specific knowledge domain. For example, if in the cardiology exam we are interested in testing the clinical manifestations of coronary heart disease, we simply look for the questions which load on to this domain. The following simulated example, using an examination with 10 questions taken by 50 students, demonstrates how to improve the questions in an examination. This allows us to demonstrate how to revise and strengthen exam questions and to calculate the loadings on the domain of interest. As well as identifying the factors EFA also calculates the ‘communality’ for each question. To understand the concept of communality, it is necessary to explain the variance (the variability in scores) within the EFA approach.


요인분석에서 variance는 두 부분으로 나뉘어진다.

We have already learnt from descriptive statistics how to calculate the variance of a variable. In the language of factor analysis, the variance of each question consists of two parts. 

      • One part can be shared with the other questions, called ‘common variance’; 
      • the rest may not be shared with other questions, called ‘error’ or ‘random variance’. 

한 문항에 대한 communality는 특정 요인들로 인해서 설명가능한 variance로서 0에서 1사이의 값을 갖는다.

The communality for a question is the value of the variance accounted for by the particular set of factors, ranging from 0 to 1.00. 

For example, a question that has no random variance would have a communality of 1.00; a question that has not shared its variance with other questions would have a communality of 0.00. The communality shown for Question 9 (Table 5) is 0.85, that is 85% of the variance in Question 9 is explained by factor 1 and factor 2, and 15% of the variance of Question 9 has nothing in common with any other question. To compute the shared variances for each question in SPSS, the following steps are carried out in SPSS (SPSS 2009). From the menus, choose ‘Analyse’, ‘Dimension Reduction’ and ‘Factor’, respectively. Then move all questions on to the ‘Variables’ box. Choose ‘Descriptive’ and then click ‘Initial Solution’ and ‘Coefficients’, respectively. Then click ‘Rotation’. Choose ‘Varimax’ and click on ‘Continue’ and then ‘OK’. In Table 5, we have combined the simulated data of the SPSS output together.



Loading의 값은 어느 정도가 되어야 하는가?

Table 5 shows that two factors have emerged. Factor 1 demonstrates excellent loading with Questions 9, 2, 6, 10, 4, 1 and 3 and Factor 2 demonstrates excellent loading with Questions 7 and 8, indicating these items have a strong correlation with Factors 1 and 2. 

      • It should be noted that loadings with values greater than 0.71 are considered excellent (0.71 × 0.71 = 0.50 × 100; i.e. 50% common variance between the item and the factor, or 50% of the variation in the item can be explained by the variation in the factor, or 50% of the variance is accounted for by the item and the factor), 
      • 0.63 (40% common variance) very good
      • 0.45 (20% common variance) fair. 
      • Values less than 0.32 (10% common variance) are considered poor and less contribute to the overall test and they should be investigated (Comrey & Lee 1992; Tabachnick & Fidell 2006). 


h^2라고 붙여진 열은 각 질문의 communality 값을 나타낸다. 5번문항에서 8%라는 것은, 이 질문에 의해서 variance의 8%가 설명된다는 의미이고, 30%이하의 값을 나타내는 문항은 그 factor에 load되어있는 다른 문항들과 관련되어있지 않다는 의미이다. Table 5에서 5번문항은 communality가 가장 낮고, factor 1이나 2 모두에 load되어있지 않으므로 삭제되거나 교정되어야 한다.

Table 5 also shows communalities for each question in the column labelled h2. For example, 92% of the variance in Question 2 is explained by the two factors that have emerged from the EFA approach. The lowest communality is for Question 5, indicting 8% of the variance is explained by this question. Low values of less than 30% indicate that the variance of the question does not relate to other questions loaded on to the identified factors. In Table 5, Question 5 has the lowest communality figure and has not loaded onto Factors 1 or 2, suggesting this question should be revised or discarded.


5번문항을 삭제하기 전에는 요인1이 .47, 요인2가 .23이었는데, 5번을 지운 이후에는 합이 0.78이 되었음. 또한 대부분의 문항이 요인1에 load되어있어 construct validity에 대한 convergence와 discrimination의 근거를 제시한다. 즉 이 시험은 'convergent'하면서(factor 1의 loading이 크므로), discriminant하다(factor 1에 load된 문항이 factor 2에는 load 되어있지 않으므로) 두 factor 각각에 대하여 Cronbach alpha를 계산해보아야 한다.

Table 5 also shows the values of variance explained by the two factors that have been identified from the EFA approach; 0.47 of the variance is accounted for by Factor 1 and 0.23 of the variance is accounted for by Factor 2. Therefore, 0.70 of the variance is accounted for by all of the questions. However, if we delete Question 5, we can increase the total variance accounted for to 0.78. A further interpretation of Table 5 is that the vast majority of questions have been loaded on to Factor 1, providing evidence of convergence and discrimination for the construct validity of the test. We can argue that the test is convergent as there are high loadings on to Factor 1. The test is also discriminant as the questions that have loaded on to Factor 1 have not loaded on to Factor 2. This means that Factor 2 measures another construct/concept which is discriminated from Factor 1. Because two factors have been identified, it would be appropriate to calculate Cronbach's alpha co-efficient for each factor because they are measuring two different constructs. It should be noted that items which load on more than two factors need to be investigated.


Confirmatory factor analysis

CFA에서는 EFA로부터 추출된 가설적 모델을 활용하여 잠재적 요인을 밝혀낸다. 그러나 순환논리를 피하기 위해서는 모델 적합성을 확인하기 위해서 새로운 자료가 필요하다.

The technique of CFA has been widely used to validate psychological tests but has been less used to evaluate and improve the psychometric properties of exam questions. The EFA approach can reveal how exam questions are correlated or connected to an underlying domain of factors. For example, an EFA approach may show that the internal structure of a 100 question test consist of three underlying domains, say physical examination, clinical reasoning and communication skills. The number of factors identified constitutes the components of a hypothesised model, the factor structure model. In the above example, the model would be termed a three-factor model. The CFA approach uses the hypothesised model extracted by EFA to confirm the latent (underlying) factors. However, in order to confirm model fitting, a new data set must be used to avoid a circular argument. For example, the same test could be administered to a different but comparable group of students.


따라서 먼저 EFA로부터 모델을 밝히고, CFA로 검증해야 한다. SEM을 통해서 새로운 샘플 데이터를 가설적 모델에 넣어서 모형의 적합도를 본다. 

Therefore, educators must first identify a model using EFA and test it using CFA. This approach also allows educators to revise exam questions and the factors underlying their constructs (Floys & Widaman 1995). For example, suppose EFA has revealed a two-factor model from an exam consisting of history-taking and physical examination questions. The researcher wishes to measure the psychometric characteristics of the questions and test the overall fit of the model to improve the validity and reliability of the exam. This can be achieved by the use of structural equation modelling (SEM) which determines the goodness-of-fit of the newly input sample data to the hypothesised model. The model fit is assessed using Chi-square testing and other fit indices. In contrast to other statistical hypothesis testing procedures, if the value of Chi-square is not significant, the new data fit and the model is confirmed. However, as the value of Chi-square is a function of increasing or decreasing sample size, other fit indices should also be investigated (Dimitrov 2010). These indices are the comparative fit index (CFI) and the root mean square error of approximation (RMSEA)

      • A CFI value of greater than 0.90 shows a psychometrically acceptable fit to the exam data. 
      • The value of RMSEA needs to be below 0.05 to show a good fit (Tabachnick & Fidell 2006). A RMSEA of zero indicates that the model fit is perfect. 

활용가능한 통계프로그램 및 수행방법

It should be noted that CFA can be run by a number of popular statistical software programmes such as SAS, LISREL, AMOS and Mplus. For the purpose of this article, we choose AMOS (Analysis of Moment Structures) for its use of ease. The AMOS software program can easily create models and calculate the value of Chi-square as well as the fit indices. In the above example, a test of 8 questions has two factors, history-taking and physical examination and the variance of these eight exam questions can be explained by these two highly correlated factors. The test developer draws the two-factor model (the path diagram) in AMOS to test the model (Figure 3). Before estimating the parameters of the model, click on the ‘view’ and click on ‘Analysis Properties’ and then click on ‘Minimization history’, Standardised estimates, ‘Squared multiple Correlations’ and ‘Modification indices’. To run the estimation, from the menu at the top, click on ‘Analyze’, then click on ‘Calculate Estimates’.



intercept와 slope를 계산해준다. intercept는 아이템의 난이도와 유사하며, slope는 변별도와 유사하다.

The output is given in Table 6. SEM calculates the slopes and intercepts of calculated correlations between questions and factors. From a CTT, the intercept is analogous to the item difficulty index and the slope (standardised regression weights/coefficients) is analogous to the discrimination index.



병력청취의 4번문항은  변별도가 낮아서 전체 점수에 별로 기여하지 못하고 있다고 판단할 수 있다.

Table 6 shows that Question 1 in history-taking and Question 3 in physical examination were easy (intercept = 0.97) and hard (0.08), respectively. Table 6 also shows that Question 4 in history-taking is not contributing to overall history-taking score (slope = −0.03). Further analysis was conducted to assess degree of fit model to the exam data. Focusing on Table 7, the absence of significance for the Chi-square value (p = 0.49) implies support for the two- factor model in the new sample. In reviewing values of both CFI and RMSEA in Table 7, it is evident that the two-factor model represents a best fit to the exam data for the new sample.


병력청취와 신체검진 사이의 관계를 보면, 0.7의 상관관계가 있어서 가설로 설정한 2요인 모델을 지지한다. 

Further evidence for the relationship between the history-taking and physical examination components of the test is revealed by the calculation of a 0.70 correlation between the two factors, supporting the hypothesised two-factor model. It should be noted that AMOS will display the correlation between factors/components by clicking the ‘view the output diagram’ button. You can also view correlation estimates from ‘text output’. From the main menu, choose view and then click on ‘text output’.



Generalisability theory analysis

We would ask you to recall that reliability is concerned with the ability of a test to measure students' knowledge and competencies consistently. For example, if students are re-examined with the same items and with the same conditions on different occasions, the results should be more or less the same. In CTT, the items and conditions may be the causes of measurement errors associated with the obtained scores. Reliability estimates, such as KR-20 or Cronbach's alpha, cannot identify the potential sources of measurement error associated with these items and conditions (also known as facets of the test) and cannot discriminate between each one. However, an extension of CTT called Generalisability Theory or G-theory, developed by Lee J. Cronbach and colleagues (Cronbach et al. 1972), attempts to recognise, estimate and isolate these facets allowing test constructors to gain a clearer picture of sources of measurement error for interpreting the true score. One single analysis of, for example, the results of an OSCE examination, using G-theory can estimate all the facets, potentially producing error in the test. Each facet of measurement error has a value associated with it called its variance component, calculated via an analysis of variance (ANOVA) procedure, described below. These variance components are next used to calculate a G-coefficient which is equivalent to the reliability of the test and also enables one to generalise students’ average score over all facets.


For example, imagine an OSCE has used SPs, a range of examiners and various items to assess students' performance on 12 stations. SPs, examiners and items and their interactions (e.g. interaction between SPs and items) are considered as facets of the assessment. The score that the student obtains from the OSCE will be affected by these facets of measurement error and therefore the assessor should estimate the amount of error caused by each facet. Furthermore, we examine students using a test to make a final decision regarding their performance on the test. To make this decision, we need to generalise a test score for each student based on that score. This indicates that assessors should ensure the credibility and trustworthy of the score as means to making a good decision (Raykov & Marcoulides 2011). Therefore, the composition of errors associated with the observed (obtained) scores that gained from a test need to be investigated. G-theory analysis can then provide useful information for test constructors to minimise identified sources of error (Brennan 2001). We will now explain how to calculate the G-coefficient from variance components.


G-coefficient calculation

To calculate the G-coefficient from variance components of facets, test analysers traditionally use the ANOVA procedure. ANOVA is a statistical procedure by which the total variance present in a test is partitioned into two or more components which are sources of measurement error. Using the calculated mean square of each source of variation from the ANOVA output (e.g. SPs, items, assessors, etc.), investigators determine the variance components and then calculate the G-coefficient from these values.


However, SPSS and other statistical packages like the Statistical Analysis System (SAS) now allow us to calculate the variance components directly from the test data. We will now illustrate how to obtain the variance components from SPSS directly for calculating the G-coefficient. The procedure used varies according to the number of facets in the test. There are single facet and multiple facet designs as described below.


Single facet design

A single facet design examines only a single source of measurement error in a test although in reality others may exist. For example, in an OSCE examination, we might like to focus on the influence of examiners as sources of error. In G-theory, this is called a one-facet ‘student (s) crossed-with-examiner (e)’ design: (s × e). Consider an OSCE in which three examiners independently rate a cohort of clinical students on three different stations using a 1–5 check list of 5 items. The total mark can therefore range from 5 to 25, with higher mark suggesting a greater level of performance in each station. Using G-theory, we can find out what amount of measurement error is generated by the examiners. For illustrative purpose, only 10 students and the three examiners are presented in the Data Editor of SPSS in Figure 4.




Before analysing, the data needs to be restructured. To this end, from the data menu at the top of the screen, one clicks on ‘restructure’ and follows the appropriate instructions. In Figure 5, the restructured data format is presented.




To obtain the variance components, the following steps are carried out:


SPSS를 활용한 분석 방법

From the menus chooses ‘Analyse’, ‘General Linear Model’, respectively. Then click on ‘variance components’. Click on ‘Score’ and then click on the arrow to move ‘Score’ into the box marked ‘dependent variable’. Click on student and examiner to move them into ‘random factors’. After ‘variance estimates’ appears, click OK and the contribution of each source of variance to the result is presented as shown in Table 8.



학생의 variance는 facet of measurement error로 분류되지 않으며, 이것은 object of measurement이다. 즉 여기서 평가자에 의해 설명되는 변인은 6.20%이며, 충분히 낮다. 남은 variance는 어떤 구체적 원인에 의한 것도 아니며, 여러 facet간의 상호작용에 의한 것이다. 

Table 8 shows that the estimated variance components associated with student and examiner are 10.144 and 1.578, respectively. Expressed as a percentage of the total variance, it can be seen that 40.00 % is due to the students and 6.20 % to the examiners. However, the variance of the students is not considered a facet of measurement error as this variation is expected within the student cohort and in terms of G-theory, it is called the ‘object of measurement’ (Mushquash & O'Connor 2006). Importantly for our analysis, the findings indicate that the examiners generated 6.20% of the total variability, which is considered a reasonably low value. Higher values would create concern about the effect of the examiners on the test. The residual variance is the amount of variance not attributed to any specific cause but is related to the interaction between the different facets and the object of measurement of the test. In this example, 13.656 or 53.80% of the variance is accounted for by this factor.



On the basis of the findings of Table 8, we are now in a position to calculate the generalisability coefficient. In this case, the G-coefficient is defined as the ratio of the student variance component (denoted ) to the sum of the student variance component and the residual variance (denoted ) divided by the number of examiners (k) (Nunnally and Bernstein 1994) and written as follows:


Inserting the values from above, this gives:


G-coefficient는 신뢰도계수에 대응되는 것이며 0에서 1사이의 값을 갖는다.이 값의 해석은 다양한 variance component들로부터 몇 개의 가능한 오차의 원인을 고려했을 때의 reliability이다. 

The G-coefficient, traditionally depicted as ρ 2, is the counterpart of the well-known reliability coefficient with values ranging from 0 to 1.0. (It is worth noting that the G-coefficient in the single facet design described above is equal to Cronbach's alpha coefficient (for non-dichotomous data) and to Kuder–Richardson 20 (for dichotomous data). The interpretation of the value of the G-coefficient is that it represents the reliability of the test taking into account the multiple sources of error calculated from their variance components. The higher the value of the G-coefficient, the more we can rely on (generalise) the students’ scores and the less influence the study facets have been. In the above example, the G-coefficient has a reasonably high value and the variance component for examiners is low. This shows that the examiners did not have significant variation in scoring students and that we can have confidence in the students’ scores.


A multi-facet design

더 많은 facet을 고려해야 하는 상황이 많다.

Clearly in an OSCE examination, there are a number of other potential facets that need to be taken into consideration in addition to the examiners. For example, the number of stations, the number of SPs and the number of items on the OSCE checklist. We will now explain how to calculate the variance components and a G-coefficient for a multi-facet design building on the previous example. Each of three stations now has a SP and a 5-item checklist leading to an overall score for each student. Here, examiners, stations, SPs and items can affect the student performance and hence are facets of measurement error.


i, s, st, sp, e를 source of error로 넣을 수 있다.

However, because we are now interested in the influence of the number items as a source of error, we need to input the score for each item (i), for each student (s), for each station (st), for each SP (sp) and for each examiner (e). After entering exam data into SPSS and restructuring it, analysis of variance components is carried out as described before. Table 9 shows the hypothetical results of variance components for potential sources of measurement error in the OCSE results.



interaction들 중 표에 나와있지 않은 것들은 그것들로 인한 measurement error는 없다는 뜻이다.

Table 9 shows that 59.16 %, 16.37 % and 15.04 of the sources of measurement error are generated by interactions between student, item and examiner, interactions between student and examiner and student, respectively. The lack of residual variance between other combinations of facets indicates that student scores cannot fluctuate owing to these interactions and consequently they do not lead to any measurement error. The value for the variance component for examiners (0.06) in Table 9 differs from the value in Table 8 (1.57) because in creating the multi-facet matrix, we are using individual item scores from students rather than their total mark for all stations. These findings also indicate that there is little disagreement about the actual scores given to student by each examiner (2.88%). We can insert the values of the variance components and the numbers associated with each facet shown in Table 8 into the following equation:






Zero values of variance components are not inserted, thus excluding SPs and stations.



In this example, the G-coefficient is high and the variance components of the facets are low, hence the reliability of the OSCE is very good. If higher values of variance components are found for particular facets, then they need to be examined in more detail. This might lead to better training for examiners or modifying items in checklists or the number of stations. Given the high G-coefficient shown with these hypothetical data, we could in principle reduce the values of k for individual facets whilst maintaining a reasonably high value of G and hence maintaining the reliability of the OSCE exam. In the real world of OSCEs, this could lead to simplifications and a reduction in the cost of OSCE examining. As for Cronbach's alpha statistic, there are different views concerning acceptable values for G ranging from 0.7 to 0.95 (Tavakol and Dennick 2011a, b). This ability to manipulate the generalisability equation in order to see how examination factors can influence sources of measurement error and hence reliability lies at the heart of decision study or D-study (Raykov & Marcoulides 2011). Thus G-theory and D-study provide a greater insight into the various processes occurring in examinations, hidden by merely measuring Cronbach's alpha statistic. This enables assessors to improve the quality of assessments in a much more specific and evidence-based way.



The IRT and Rasch modelling

CTT에서는 학생의 능력이나, 그 능력에 따라서 문항의 점수가 어떠한지에 대한 정보가 거의 없다. IRT에서는 이러한 점에 초점을 맞추고 있으며, 이 분석을 통해서 CAT의 문제은행을 더 강력하게 만들어줄 수 있다.

Test constructors have traditionally quantified the reliability of exam tests using the CTT model. For example, they use item analysis (item difficulty and item discrimination), traditional reliability coefficients (e.g. KR-20 or Cronbach's alpha), item-total correlations and factor analysis to examine the reliability of tests. We have just shown how G-theory can be used to make more elaborate analyses of examination conditions with a view to monitoring and improving reliability. CTT focuses on the test and its errors but says little about how student ability interacts with the test and its items (Raykov & Marcoulides 2011). On the other hand, the aim of IRT is to measure the relationship between the student's ability and the item's difficulty level to improve the quality of questions. Analyses of this type can also be used to build up better question banks for Computer Adaptive Testing (CAT).



Consider a student taking an exam in anatomy. The probability that the student can answer item 1 correctly is affected by the student's anatomy ability and the item's difficulty level. If the student has a high level of anatomical knowledge, the probability that he/she will answer the item 1 correctly is high. If an item has a low index of difficulty (i.e. a hard item), the probability that the student will answer the item correctly is low. IRT attempts to analyse these relationships using student test scores plus factors (parameters) such as item difficulty, item discrimination, item fairness, guessing and other student attributes such as gender or year of study. In an IRT analysis, graphs are produced showing the relationship between student ability and the probability of correct item responses, as well as item maps depicting the calibrations of student abilities with the above parameters. Also tables showing ‘fit’ statistics for items and students, to be described later.


parameter의 수에 따라서 1PL (Rasch model), 2PL, 3PL 등이 있다.

A variety of forms of IRT have been introduced. If we wish to look at the relationship between item difficulty and student ability alone, we use the one-parameter logistic IRT (1PL). This is called the Rasch model in honour of the Danish statistician who promoted it in the 1960s. The Rasch model assesses the probability that a student will answer an item correctly given their conceptual ability and the item difficulty. Two-parameter IRT (2PL) or three-parameter IRT (3PL) are also available where further parameters such as item discrimination, item difficulty, gender or year of study can be included. For the purposes of this article, we are going to concentrate on 1PL or Rasch modelling.


Rasch modelling에서 학생의 능력은 평균이 0이 되도록 표준화되며, 문항의 난이도 역시 평균이 0이 되도록 표준화된다. 즉, 표준화 이후에 평균점수가 0인 학생의 능력은 딱 평균인 것이며, 점수가 1.5라면 평균보다 1.5SD 위에 있는 것이다. 유사하게 난이도도 0인 것이 난이도가 중간인 것이다. 

In Rasch modelling, the scores of students’ ability and the values of item difficulty are standardised to make interpretation easier. After standardising the mean, student ability level is set to 0 and the SD is set to 1. Similarly, the mean item difficulty level is set to 0 and the SD is set to 1. Therefore, after standardisation a student who receives a mean score of 0 has an average ability for the items being assessed. With a score of 1.5, the student's ability is 1.5, SDs above the mean. Similarly, an item with a difficulty of 0 is considered an average item and an item with a difficulty of 2 is considered to be a hard item. In general, if a value of a given item is positive, that item is difficult for that cohort of students and if the value is negative, that item is easy (Nunnally & Bernstein 1994).


To standardise the student ability and item difficulty, consider Table 10, presenting the simulated dichotomous data for seven items on an anatomy test from seven students showing the student ability for each student and the difficulty level for each of the seven items. To calculate the ability of the student, which is called θ , the natural logarithm of the ratio of the fraction correct to the fraction incorrect (or 1 – fraction correct) for each student is taken. For example, the ability of student 2 (θ2) is calculated as follows:








학생 능력은 평균 이상이다.

This indicates that the ability of student 2 is 0.89 above the mean SD. To calculate the difficulty level of each item which is called b, the natural log of the ratio of the fraction incorrect (or 1 – fraction correct) to the fraction correct for each item is calculated. For example, the difficulty of item 1 is calculated as follows:




-1.73은 문항이 쉬웠다는 뜻

A value of −1.73 suggests that the item is relatively easy. This standardisation process is carried out for all students and all items and can easily be facilitated in an Excel spreadsheet (Table 10).






특정 수준의 능력을 가진 학생이 특정 난이도의 문항에 정답을 맞출 가능성은 아래와 같이 구할 수 잇다.

We are now in a position to estimate the probability that a student with a specific ability will correctly answer a question with a specific item difficulty. For 1PL, the following equation is used to estimate the probability:


즉 학생 1이 문항 1을 정확히 맞출 가능성은 0.12이다. 3번 학생이 3번 문항에 정답을 맞출 가능성은 50%이고, 이는 우연히 맞출 확률과 같다. Rasch 분석의 기본적 목적은 학생 능력과 난이도가 맞는 문항을 만드는 것이다. 좀 더 단순하게 말하면 '똑똑한 학생'을 '똑똑한 문항'과 매치시켜주는 것이다. 

Where p is the probability, θ is the student ability and b the item difficulty. Referring to Table 10, the ability of student 1 is −0.28 SD below the average, and item 1, with a difficulty level of −1.73, was answered correctly, which is below the average. On the basis of the above formula, the probability that student 1 will answer item 1 correctly is [1/(1 + e−(−0.28−(−1.73))] = 0.12. Considering student 3's ability level and the difficulty of item 4, the probability that the student will answer correctly item 3 is [1/(1 + e−(0.28−(0.28))] = [1/(1 + e0)] = 0.50. This shows that if the level of student ability and the level of item difficulty are matched, the probability that the student will select the correct answer is 50%, which is equal to chance. The fundamental aim of Rasch analysis is to create test items that match their degree of difficulty with student ability. In simple terms, the ‘cleverness’ of the students should be matched with the ‘cleverness’ of the items. In order to further examine the relationship between student ability and item difficulty, the data in Table 11 shows the probability (p) that a student will answer item 1, with item difficulty (b), correctly given their ability (θ) using data taken from Table 10 and using the equation above.


Item characteristic curves

In Rasch analysis, the relationship between item difficulty and student ability is depicted graphically in an item characteristic curve (ICC) shown in Figure 6.






In Figure 6, dotted lines are drawn to interpret the characteristics of item 1. There is a 50% probability that students with an ability of −1.85 will answer this question correctly. This implies that student with lower ability have an equal chance of answering this question correctly. In addition, a student with an average ability (θ = 0) has an 80% chance of giving a correct answer. The implication is that this question is too easy. It should be noted that if an item shifts the curve to the left along the theta axis, it will be an easy item and a hard item will shift the curve to right. Examples of ICC curves for items taken from an examination analysis shown in Figure 8 are displayed in Figure 7. Figure 7(a) shows a difficult question (Question 101) and Figure 7(b) shows an easy question (Question 3). Figure 7(c) shows the ‘perfect’ question (Question 46) in which students of average ability have a 50% chance of giving the correct answer.








Item-student maps

위의 Fig 8은 왼쪽과 오른쪽으로 나누어 볼 수 있다. 왼쪽은 학생의 능력이며 모든 학생이 평균 이상의 능력을 지님을 의미한다. 오른쪽은 문항의 난이도이며 일부 문항은 매우 어렵고 일부 문항은 매우 쉬우나 전반적으로 학생이 문제보다 더 '똑똑하다' 고 할 수 있다.

The distribution of students’ ability and the difficulty of each item can also be presented on an Item–student map (ISM). Using IRT software programmes such as Winsteps® (Linacre, 2011) item difficulty and student ability can be calculated and displayed together. Figure 8 shows the ISM using data from a knowledge-based test. The map is split into two sides. The left side indicates the ability of students whereas the right side shows the difficulty of each item. The ability of each student is represented by ‘hash’ (#) and ‘dot’ (.), items are shown by their item number. Item difficulty and student ability values are transformed mathematically, using natural logarithms, into an interval scale whose units of measurement are termed ‘logits’. With a logit scale, differences between values can be quantified and equal distances on the scale are of equal size (Bond & Fox 2007). Higher values on the scale imply both greater item difficulty and greater student ability. The letters of ‘M’, ‘S’ and ‘T’ represents mean, one standard deviation and two standard deviations of item difficulty and student ability, respectively. The mean of item difficulty is set to 0. Therefore, for example, items 46, 18 and 28 have an item difficulty of 0, 1, and −1 respectively. A student with an ability of 0 logits has a 50% chance of answering items 46, 60 or 69 correctly. The same student has a greater than 50% probability of correctly answering items less difficult, for example items 28 and 62. In addition, the same student has a less than 50% probability of correctly answering more difficult items such items 64 and 119.


By looking at the ISM in Figure 8 we can now interpret the properties of the test. First, the student distribution shows that the ability of students is above the average, whereas more than half of the items have difficulties below the average. Second, the students on the upper left side are ‘cleverer’ than the items on the lower right side meaning that the items were easy and unchallenging. Third, most students are located opposite items to which they are well matched on the upper right and there are no students on the lower left side. However, items 101, 40, 86 and 29 are too difficult and beyond the ability of most students.


Overall, in this example, the students are ‘cleverer’ than most of the items. Many items in the lower right hand quadrant are too easy and should be examined, modified or deleted from the test. Similarly, some items are clearly too difficult. The advantage of Rasch analysis is that it produces a variety of data displays encapsulating both student and item characteristics that enable test developers to improve the psychometric properties of items. By matching items to student ability, we can improve the authenticity and validity of items and develop higher quality item banks, useful for the future of computer adapted testing.



Conclusions

Objective tests as well as OSCE stations should be the psychometrically sound instruments used for measuring the proficiency of students and can be of use to medical educators interested in the actual use of these examination tests in the future. In this Guide, we tried to simply explain how to interpret the outcomes of psychometric values in objective test data. Examination tests should be standardised both nationally and locally and we need to ensure about the psychometric soundness of these tests. A normal question that may be posed is to what extent our exam data measure the student ability (to what extent the students have learned subject matter). The interpretation of exam data using psychometric methods is central to understand students’ competencies on a subject matter and to identify students with low ability. Furthermore, these methods can be employed for test validation research. We would suggest medical teachers, especially who are not trained in psychometric methods, practice these methods on hypothetical data and then analyse their own real exam data in order to improve the quality of exam data.











 2012;34(3):e161-75. doi: 10.3109/0142159X.2012.651178.

Post-examination interpretation of objective test data: monitoring and improving the quality of high-stakes examinations: AMEE Guide No. 66.

Author information

  • 1University of Nottingham, UK.

Abstract

The purpose of this Guide is to provide both logical and empirical evidence for medical teachers to improve their objective tests by appropriate interpretation of post-examination analysis. This requires a description and explanation of some basic statistical and psychometric concepts derived from both Classical Test Theory (CTT) and Item Response Theory (IRT) such as: descriptive statistics, explanatory and confirmatory factor analysis, Generalisability Theory and Rasch modelling. CTT is concerned with the overall reliability of a test whereas IRT can be used to identify the behaviour of individual test items and how they interact with individual student abilities. We have provided the reader with practical examples clarifying the use of these frameworks in test development and for research purposes.

PMID:
 
22364473
 
[PubMed - indexed for MEDLINE]


평가 관련 이론들 : AMEE Guide No. 57

General overview of the theories used in assessment: AMEE Guide No. 57

LAMBERT W. T. SCHUWIRTH1 & CEES P. M. VAN DER VLEUTEN2

1Flinders University, Australia, 2Maastricht University, The Netherlands





Introduction

평가는 모든 사람의 관심사이며 이는 놀랄 일도 아니다.

It is our observation that when the subject of assessment in medical education is raised, it is often the start of extensive discussions. Apparently, assessment is high on everyone's agenda. This is not surprising because assessment is seen as an important part of education in the sense that it not only defines the quality of our students and our educational processes, but it is also seen as a major factor in steering the learning and behaviour of our students and faculty.


그러나 평가와 관련된 논의는 종종 전통과 직관에 의존하고 있다. 전통에 관심을 갖는 것이 꼭 나쁜 것은 아니며, George Santayana는 "역사로부터 배우지 못하는 사람은 그것을 반복하게 된다"라고 했다.

Arguments and debates on assessment, however, are often strongly based on tradition and intuition. It is not necessarily a bad thing to heed tradition. George Santayana already stated (quoting Burk) that Those who do not learn from history are doomed to repeat it.1 So, we think that an important lesson is also to learn from previous mistakes and avoid repeating them.


직관 역시 변덕스럽게 옆으로 치워두어야 할 것은 아니며, 사람들의 행동을 변화시키는 강력한 힘이되기도 한다. 그러나 마찬가지로 직관이 연구 결과와 부합하지 않는 경우도 많다.

Intuition is also not something to put aside capriciously, it is often found to be a strong driving force in the behaviour of people. But again, intuition is not always in concordance with research outcomes. Some research outcomes in assessment are somewhat counter intuitive or at least unexpected. Many researchers may not have exclaimed Eureka but Hey, that is odd instead.


여기서 두 가지 중요한 과제가 나타난다. 첫째로, 우리는 실수를 반복하지 않기 위해서 전통적으로, 일상적으로 수행하고 있는 것이 여전히 그 만한 가치가 있는가를 비판적으로 평가해볼 필요가 있다. 두 번째로, 틀린 직관을 바로잡기 위해서는 연구결과를 적절한 방법과 접근법으로 translate할 수 있어야 한다. 

This leaves us, as assessment researchers, with two very important tasks. First, we need to critically study which common and tradition-based practices still have value and consequently which are the mistakes that should not be repeated. Second, it is our task to translate research findings to methods and approaches in such a way that they can easily help changing incorrect intuitions of policy makers, teachers and students into correct ones. Both goals cannot be attained without a good theoretical framework in which to read, understand and interpret research outcomes. The purpose of this AMEE Guide is to provide an overview of some of the most important and most widely used theories pertaining to assessment. Further Guides in assessment theories will give more detail on the more specific theories pertaining to assessment.


그러나 불행하게도 많은 다른 과학분야와 마찬가지로 의학교육 분야의 평가는 근간을 이루는 이론이 단일하지 않으며, 인접한 분야에서 다양한 이론을 끌어다 슨다. 추가적으로 의료전문직의 평가와 좀 더 직접적으로 관련된 이론적 틀이 만들어져오기도 했으며, 여기서 가장 중요한 인식은 '학습의 평가'와 '학습을 위한 평가'라는 두 인식의 차이이다. 

Unfortunately, like many other scientific disciplines, medical assessment does not have one overarching or unifying theory. Instead, it draws on various theories from adjacent scientific fields, such as general education, cognitive psychology, decision-making and judgement theories in psychology and psychometric theories. In addition, there are some theoretical frameworks evolving which are more directly relevant to health professions assessment, the most important of which (in our view) is the notion of ‘assessment of learning’ versus ‘assessment for learning’ (Shepard 2009).


In this AMEE Guide we will present the theories that have featured most prominently in the medical education literature in the recent four decades. Of course, this AMEE Guide can never be exhaustive; the number of relevant theoretical domains is simply too large, nor can we discuss all theories to their full extent. Not only would this make this AMEE Guide too long, but also this would be beyond its scope, namely to provide a concise overview. Therefore, we will discuss only the theories on the development of medical expertise and psychometric theories, and then end by highlighting the differences between the assessment of learning and assessment for learning. As a final caveat, we must say here that this AMEE Guide is not a guide to methods of assessment. We assume that the reader has some prior knowledge about this or we would like to refer to specific articles or to text books (e.g. Dent & Harden 2009).



의-전문가가 어떻게 만들어지는가에 대한 이론 

Theories on the development of (medical) expertise

의학 분야의 전문가는 어떤 특징을 갖는가? 초심자와 전문가의 차이는 무엇인가? 이런 질문들이 반드시 나오게 되어있으며, 이는 무엇을 평가할지 모른다면 어떻게 평가할지 결정할 수 없기 때문이다.

What distinguishes someone as an expert in the health sciences field? What do experts do differently compared to novices when solving medical problems? These are questions that are inextricably tied to assessment, because if you do not know what you are assessing it also becomes very difficult to know how you can best assess.


경험을 쌓고 학습을 하면 전문가가 된다는 것이 자명해 보일지도 모른다.

I may be obvious that someone can only become an expert through learning and gaining experience.


초창기 연구 중 하나는 de Groot의 연구로, 왜 체스의 그랜드마스터가 그랜드마스터가 되었으며, 무엇을 아마추어와 다르게 했는가를 연구하였다. 처음에 그는 아마추어보다 더 많은 수를 내다볼 수 있기 때문이라고 생각했다. 그러나 그렇지 않았으며, 약 7수 이상을 내다보는 것은 동일했다. 대신 de Groot이 찾은 것은 그랜드마스터는 체스판의 말의 위치를 더 잘 암기한다는 것이다. 그와 그의 후계자들은 그랜드마스터는 잠깐만 보고도 체스판위의 말의 위치를 정확하게 기억해낼 수 있음을 발견했다.

One of the first to study the development of expertise was by de Groot (1978), who wanted to explore why chess grandmasters became grandmasters and what made them differ from good amateur chess players. His first intuition was that grandmasters were grandmasters because they were able to think more moves ahead than amateurs. He was surprised, however, to find that this was not the case; players of both expertise groups did not think further ahead than roughly seven moves. What he found, instead, was that grandmasters were better able to remember positions on the board. He and his successors (Chase & Simon 1973) found that grandmasters were able to reproduce positions on the board more correctly, even after very short viewing times. Even after having seen a position for only a few seconds, they were able to reproduce it with much greater accuracy than amateurs.


여기서 아마도 '기억력이 우수하다는 것이군'이라고 생각할 수 있지만 그렇지 않다. 인간의 작업기억은 약 7단위(+/- 2) 정도이며, 학습으로 향상되는 것이 아니다.

One would think then that they probably had superior memory skills, but this is not the case. The human working memory has a capacity of roughly seven units (plus or minus two) and this cannot be improved by learning (Van Merrienboer & Sweller 2005, 2010).


가장 두드러진 차이는 작업기억에 넣을 수 있는 유닛의 숫자가 아니라, 각 유닛이 담고 있는 정보의 양 차이였다.

The most salient difference between amateurs and grandmasters was not the number of units they could store in their working memory, but the richness of the information in each of these units.


자국어로 정보를 저장할 때는 모든 단어와 일부 정형화된 표현이 하나의 단위로 저장되는데, 이는 이미 장기기억에 존재하는 기억과 직접적으로 연결되기 때문이다. 반면 외국어를 사용할 때에는 그 외국어가 장기기억에 입력된바가 없다면 인지적 자원의 일부를 글자 그 자체를 외우는데 사용해야 한다. (...) 이처럼 정보를 저장할 때 더 큰 양의 정보를 담은 유닛으로 저장하는 것을 chunking이라고 하며, 전문성이 무엇인지, 어떻게 개발되는지에 있어서 중요한 요소이다. 

To illustrate this, imagine having to copy a text in your own language, then a text in a foreign Western European language and then one in a language that uses a different character set (e.g. Cyrillic). It is clear that copying a text in your own language is easiest and copying a text in a foreign character set is the most difficult. While copying you have to read the text, store it in your memory and then reproduce it on the paper. When you store the text in your native language, all the words (and some fixed expressions) can be stored as one unit, because they relate directly to memories already present in your long-term memory. You can spend all your cognitive resources on memorising the text. In the foreign character set you will also have to spend part of your cognitive resources on memorising the characters, for which you have no prior memories (schemas) in your long-term memory. A medical student who has just started his/her study will have to memorise all the signs and symptoms when consulting a patient with heart failure, whereas an expert can almost store it as one unit (and perhaps only has to store the findings that do not fit to the classical picture or mental model of heart failure). This increasing ability to store information as more information-rich units is called chunking and it is a central element in expertise and its development. Box 1 provides an illustration of the role of chunking.




So, why were the grandmasters better than good amateurs? Well, mainly because they possessed much more stored information about chess positions than amateurs did, or in other words, they had acquired so much more knowledge than the amateurs had.


이 체스 연구로부터 배울 것이 있다면 - 다른 분야의 수많은 연구를 통해서도 밝혀진 바와 같이 - '풍부하고' '잘 구성된' 지식기반이 성공적인 문제해결의 근간이라는 것이다.

If there is one lesson to be drawn from these early chess studies – which have been replicated in such a plethora of other expertise domains that it is more than reasonable to assume that these findings are generic – it is that a rich and well-organised knowledge base is essential for successful problem solving (Chi et al. 1982; Polsen & Jeffries 1982).


다음 질문은 그렇다면 'well-organized'의 의미가 무엇인가? 일 것이다. 근본적으로 '조직화'는 새로운 정보를 빠르게 저장하고 더 오래 기억하게 하며, 관련된 정보가 필요할 때 바로 인출할 수 있게 만들어준다. (...) 컴퓨터를 인간의 두뇌에 비교하곤 하지만 인간은 컴퓨터와 같이 File Allocation Table을 사용하지 않는다. 이는, 인간에게 있어서 새로운 정보가 연결될 기존의 지식이 없는 상태에서 새로운 정보를 저장하는 것이 대단히 어려운 일임을 보여준다.

The next question then would be: What does ‘well-organised’ mean? Basically, it comes down to organisation that will enable the person to store new information rapidly and with good retention and to be able to retrieve relevant information when needed. Although the computer is often used as a metaphor for the human brain (much like the clock was used as a metaphor in the nineteenth century), it is clear that information storage on a hard disk is very much different from human information storage. Humans do not use a File Allocation Table to index where the information can be found, but have to embed information in existing (semantic) networks (Schmidt et al. 1990). The implication of this is that it is very difficult to store new information if there is no existing prior information to which it can be linked. Of course, the development of these knowledge networks is quite individualised, and based on the individual learning pathways and experiences. For example, we – the authors of this AMEE Guide – live in Maastricht, so our views, connotations and association with ‘Maastricht’ differ entirely from those of most of the readers of the AMEE Guides, although we may share the knowledge that it is a city (and perhaps that it is in the Netherlands) and that there is a university with a medical school, the rest of the knowledge is much more individualised.


'지식'이란 것은 상당히 '영역-특이적'인 것이다. 한 사람이 한 토픽에는 매우 많은 것을 알면서도 다른 토픽에 대해서는 거의 아는 바가 없을 수 있다. 그리고 전문성이라는 것이 '잘 구성된' 지식을 기반으로 하기에, '전문성' 역시 영역-특이적이다. 평가에 있어서 이것이 의미하는 바는, 한 사람의 수행능력을 하나의 사례나 문항으로 테스트하는 것은 다른 사례나 문항에 대한 예측력 차원에서 대단히 좋지 않다는 것이다. 따라서 절대로 제한된 평가 정보에 의존해서는 안된다. 고부담의 의사결정을 단일한 사례에 기반해서 한다면 이는 매우 신뢰성이 낮다고 볼 수 있다.

Knowledge generally is quite domain specific (Elstein et al. 1978; Eva et al. 1998); a person can be very knowledgeable on one topic and a lay person on another, and because expertise is based on a well-organised knowledge base, expertise is domain specific as well. For assessment, this means that the performance of a candidate on one case or item of a test is a poor predictor for his or her performance on any other given item or case in the test. Therefore, one can never rely on limited assessment information, i.e. high-stakes decisions made on the basis of a single case (e.g. a high-stakes final VIVA) are necessarily unreliable.


두 번째 중요한 교훈은 '문제해결능력'이란 것이 개인-특이적(idiosyncratic)이라는 점이다. 앞서 논의한 영역-특이성은 한 사람의 수행능력도 다양한 사례에 따라서 서로 달라질 수 있다는 것이라면, 여기서 말한 개인-특이성은 동일한 사례라도 전문가에 따라서 그 해결방법이 서로 다를 수 있다는 점이다. 이는 지식을 구조화한 방법이 개인마다 다르다는 점을 고려하면 논리적이다. 이로부터 평가와 관련해서 얻을 수 있는 교훈은 각 지원자의 전문성을 '진단'하는 차원에서 문제해결의 '절차'를 평가하는 것은 문제해결의 '결과'를 평가하는 것에 비해서 얻을 수 있는 정보가 더 적다는 것이다. 

A second important and robust finding in the expertise literature – more specifically the diagnostic expertise literature – is that problem-solving ability is idiosyncratic (cf. e.g. the overview paper by Swanson et al. 1987). Domain specificity, which we discussed above, means that the performance of the same person varies considerably across various cases, idiosyncrasy here means that the way different experts solve the same case varies substantially between different experts. This is also logical, keeping in mind that the way the knowledge is organised is highly individual. The assessment implication from this is that when trying to capture, for example, the diagnostic expertise of candidates, the process may be less informative than the outcome, as the process is idiosyncratic (and fortunately the outcome of the reasoning process is much less).


가장 중요한 이슈는 '전이(transfer)'에 대한 것이다. 이는 영역-특이성, 개인-특이성과 관련된 것인데, '전이'는 한 사람이 특정 문제에 대해서 적용할 수 있는 문제해결능력을 다른 문제에 대해서도 적용할 수 있는 정도를 말하며, 서로 다른 두 가지의 유사성을 이해하고 동시에 적용될 수 있는 원칙을 발견할 수 있어야 한다.

A third and probably most important issue is the matter of transfer (Norman 1988; Regehr & Norman 1996; Eva 2004). This is closely related to the previous issue of domain specificity and idiosyncrasy. Transfer pertains to the extent to which a person is able to apply a given problem-solving approach to different situations. It requires that the candidate understands the similarities between two different problem situations and recognises that the same problem-solving principle can be applied. Box 2 provides an illustration (drawn from a personal communication with Norman).




이 두 가지 문제에서 구체적으로 드러난 상황은 'surface feature'라고 할 수 있으며, 두 가지 문제에 근본적으로 깔려있는 원칙은 'deep structure'라고 할 수 있다. 이러한 상황에서 '전이'라는 것은 'surface feature'에 눈이 멀지 않고 'deep structure'를 밝혀낼 수 있는 능력이다.

Most often, the first problem is not recognised as being essentially the same as the second and that the problem-solving principle is also the same. Both solutions lie in the splitting up of the total load into various parts. In problem 1, the 1000 W laser beam is replaced by 10 rays of 100 W each, but converging right on the spot where the filament was broken. In the second problem the solution is more obvious: build five bridges and then let your men run onto the island. If the problem were represented as: you want to irradiate a tumour but you want to do minimal harm to the skin above it, it would probably be recognised even more readily by physicians. The specific presentation of these problems is labelled as the surface features of the problem and the underlying principle is referred to as the deep structure of the problem. Transfer exists by the virtue of the expert to be able to identify the deep structure and not to be blinded by the surface features.


의-전문성의 발달에 관련된 가장 많이 이용되는 이론 중 하나가 주장하는 것은 '의-전문성'이라는 것은 isolated fact을 모으는 것으로부터 시작해서, 이들을 혼합해서 의미있는 semantic network를 구성하는 것이다. 이 네트워크는 이후 보다 압축되어 고밀도의 illness script를 만들게 되고, 수년간의 경험이 쌓이면 이는 instance script가 되어서 특정 진단을 즉각적으로 인식할 수 있게 된다. illness script(특정 진단의 굳어진 패턴) 와 instance script의 차이는 instance script은 보통 사람이라면 지나칠 수 있는 맥락까지도 고려한다는 것이다. 이러한 맥락에는 환자의 외모나 냄새까지도 포함한다.

One of the most widely used theories on the development of medical expertise is the one suggested by Schmidt, Norman and Boshuizen (Schmidt 1993; Schmidt & Boshuizen 1993). Generally put, this theory postulates that the development of medical expertise starts with the collection of isolated facts which further on in the process are combined to form meaningful (semantic) networks. These networks are then aggregated into more concise or dense illness scripts (for example pyelonephritis). As a result of many years of experience, these are then further enriched into instance scripts, which enable the experienced clinician to recognise a certain diagnosis instantaneously. The most salient difference between illness scripts (that are a sort of congealed patterns of a certain diagnosis) and instance scripts is that in the latter contextual, and for the lay person sometimes seemingly irrelevant, features are also included in the recognition. Typically, these include the demeanour of the patient or his/her appearance, sometimes even an odour, etc.


평가에 있어서 중요한 교훈들.

These theories then provide important lessons for assessment:


  1. Do not rely on short tests. The domain specificity problem informs us that high-stakes decisions based on short tests or tests with a low number of different cases are inherently flawed with respect to their reliability (and therefore also validity). Keep in mind that unreliability is a two-way process, it does not only imply that someone who failed the test could still have been satisfactorily competent, but also that someone who passed the test could be incompetent. The former candidate will remain in the system and be given a re-sit opportunity, and this way the incorrect pass–fail decision can be remediated, but the latter will escape further observation and assessment, and the incorrect decision cannot be remediated again.
  2. For high-stakes decisions, asking for the process is less predictive of the overall competence than focussing on the outcome of the process. This is counterintuitive, but it is a clear finding that the way someone solves a given problem is not a good indicator for the way in which she/he will solve a similar problem with different surface features; she/he may not even recognise the transfer. Focussing on multiple outcomes or some essential intermediate outcomes – such as with extended-matching questions, key-feature approach assessment or the script concordance test – is probably better than in-depth questioning the problem-solving process (Bordage 1987; Case & Swanson 1993; Page & Bordage 1995; Charlin et al. 2000).
  3. Assessment aimed only at reproduction will not help to foster the emergence of transfer in the students. This is not to say that there is no place for reproduction-orientated tests in an assessment programme, but they should be chosen very carefully. When learning arithmetic, for example, it is okay to focus the part of the assessment pertaining to the tables of multiplication on reproduction, but with long multiplications, focussing on transfer (in this case, the algorithmic transfer) is much more worthwhile.
  4. When new knowledge has to be built into existing semantic networks, learning needs to be contextual. The same applies to assessment. If the assessment approach is to be aligned with the educational approach, it should be contextualised as well. So whenever possible, set assessment items, questions or assignments in a realistic context.



Psychometric theories

Whatever purpose an assessment may pursue in an assessment programme, it always entails a more or less systematic collection of observations or data to arrive at certain conclusions about the candidate. The process must be both reliable and valid. Especially, for these two aspects (reliability and validity) psychometric theories have been developed. In this chapter, we will discuss these theories.


타당도 Validity

최근 100년간 타당도에 대한 개념이 몇 차례 바뀌었다.

Simply put, validity pertains to the extent to which the test actually measures what it purports to measure. In the recent century, the central notions of validity have changed substantially several times. 


타당도에 대한 첫 번째 이론은 준거타당도 혹은 예측타당도에 대한 개념에 가까웠다. 완전히 비논리적은 것은 아니며, 많은 교수들이 '정말 이렇게 해서 좋은 의사가 나오는건가?'라는 질문을 하는 것과 유사하다. 그러나 '좋은 의사'에 대한 단일한, 충분한, 측정가능한 준거가 존재하지 않는한 이 질문은 대답할 수 없다. '좋은 의사'와 같은 용어에 대한 타당도를 정의내리기가 어려운 문제와 마찬가지인 것이다. 또한 준거의 타당도를 점증해야 하는 문제가 있다. 한 연구자가 '좋은 의사'를 측정하기 위한 척도를 제안하여 이를 특정 평가에 사용하였다면, 그 척도에 대한 타당도 평가가 필요하다. 이 준거에 대한 타당도 평가를 위해서는 그것을 평가하기 위한 준거가 필요하고, 이러한 문제는 무한히 반복된다.

The first theories on validity were largely based on the notion of criterion or predictive validity. This is not illogical as the intuitive notion of validity is one of whether the test predicts an outcome well. The question that many medical teachers ask when a new assessment or instructional method is suggested is: But does this produce better doctors?. This question – however logical – is unanswerable in a simple criterion-validity design as long as there is no good single measureable criterion for good ‘doctorship’. This demonstrates exactly the problem with trying to define validity exclusively in such terms. There is an inherent need to validate the criterion as well. Suppose a researcher was to suggest a measure to measure ‘doctorship’ and to use it as the criterion for a certain assessment, then she/he would have to validate the measure for ‘doctorship’ as well. If this again were only possible through criterion validity, it would require the research to validate the criterion for the criterion as well – etcetera ad infinitum.


두 번째의 직관적 접근법은 수행능력을 관찰하여 평가하는 것이다. 예컨대 플룻 연주 실력을 평가하는 것은 복잡하지 않다. 플룻 전문가들을 모시고 각 연주자를 연주하게 하면 된다. 물론, 일부 블루프린트에서는 다양한 음악 장르에 걸친 연주실력 평가를 해야 할 수도 있다. 오케스트라 지원자에 대해서는 오케스트라의 레파토리 중에 있는 음악을 잘 연주해야 한다. 이러한 내용타당도는 중요한 역할을 해왔고 지금도 그러하다.

A second intuitive approach would be to simply observe and judge the performance. If one, for example, wishes to assess flute-playing skills, the assessment is quite straightforward. One could collect a panel of flute experts and ask them to provide judgements for each candidate playing the flute. Of course, some sort of blueprinting would then be needed to ensure that the performances of each candidate would entail music in various ranges. For orchestral applicants, it would have to ensure that all classical music styles of the orchestra's repertoire would be touched upon. Such forms of content validity (or direct validity) have played an important role and still do in validation procedures.


그러나 우리가 학생에 대해 평가하고자 하는 것은 이렇게 명확하게 관찰가능하거나 관찰을 통해 추론할 수 있는 것이 아니다. 지식이나 신경증 정도 역시 잠재특성(latent traits)이며, 지식, 문제해결능력, 프로페셔널리즘도 마찬가지다. 직접 관찰이 불가능하고, 관찰된 행동을 기반으로 한 가정에 따라 평가할 수 밖에 없다.

However, most aspects of students we want to assess are still not clearly visible and need to be inferred from observations. Not only are characteristics such as intelligence or neuroticism invisible (so-called latent) traits, but also are elements such as knowledge, problem-solving ability, professionalism, etc. They cannot be observed directly and can only be assessed as assumptions based on observed behaviour.


Cronbach와 Meehl은 'construct validity'라는 개념을 주장했는데, 그들에 따르면 'construct validity'는 '귀납적 경험적 과정'과 마찬가지다. 각 연구자는 어떤 시험이 측정하고자 하는 구인(construct)에 대하여 명확한 이론을 만들거나 추정을 하게 된다. 그리고 나서 그 시험을 설꼐하고 수행하고 비판적 평가를 하여 구인에 대한 이론적 개념을 지지하는지 확인한다. 

In an important paper, Cronbach and Meehl (1955) elaborated on the then still young notion of construct validity. In their view, construct validation should be seen as analogous to the inductive empirical process; first the researcher has to define, make explicit or postulate clear theories and conceptions about the construct the test purports to measure. Then, she/he must design and carry through a critical evaluation of the test data to see whether they support the theoretical notions of the construct. An example of this is provided in Box 3.



이것은 '중간 효과'라고도 불리는데, 검사의 타당도에 대한 가정을 반증하는 중요한 요인이다.

The so-called ‘intermediate effect’, as described in the example (Box 3) (especially when it proves replicable) is an important falsification of the assumption of validity of the test.


여기서 배울 교훈은 다음과 같다.

We have used this example deliberately, and there are important lessons that can be drawn from it. 


이러한 중간 효과가 존재한다는 것은 타당도를 강하게 반박하는 근거가 된다. 검사의 타당성을 지지할 수 있는 근거는 결정적 '관찰'로부터 나타나야 한다. 비유를 들자면 다음과 같다. 특정 질병에 걸렸음을 최대한으로 보여주려면, 질병이 없을 때 검사 결과 음성으로 나올 가능성을 최대한 높은 검사를 사용해야 한다. '약한' 실험에서 나온 근거는 타당성을 보여줄 수 없다.

First, it demonstrates that the presence of such an intermediate effect in this case is a powerful falsification of the assumption of validity. This is highly relevant, as currently it is generally held that a validation procedure must contain ‘experiments’ or observations which are designed to optimise the probability of falsifying the assumption of validity (much like Popper's falsification principle2). Evidence supporting the validity must therefore always arise from critical ‘observations’. There is a good analogy to medicine or epidemiology. If one wants to confirm the presence of a certain disease with the maximum likelihood, one must use the test with the maximum chance of being negative when disease is absent (the maximum sensitivity). Confirming evidence from ‘weak’ experiments therefore does not contribute to the validity assumption.


두 번째로, 흔한 오해 중 하나로 '실제 직무 상황에서의 평가'라는 것이 '타당성'을 담보하지 않는다. 평가시에 authenticity를 최대한으로 높이기 위한 여러 근거들이 있으나, 그 효과는 주로 형성평가보다는 총괄평가에서 나타난다. 이러한 상황을 가정해보자. 만약 우리가 진료하는 의사의 매일매일의 수행능력의 수준을 평가하기 위해서 진료하는 실제 상황을 평가할 수도 있고, 차트를 리뷰하거나, 검사결과, 타과의뢰 자료를 볼 수도 있다. 후자가 분명 덜 authentic하지만, 더 valid할 수 있다. 첫 번째 평가방식에서 나타날 수 있는 '관찰자 효과'는 의사의 행동에 영향을 줘서 편향된 결과를 보여줄 수 있다. 

Second, it demonstrates that authenticity is not the same as validity, which is a popular misconception. There are good reasons in assessment programmes to include authentic tests or to strive for high authenticity, but the added value is often more prominent in their formative than in their summative function. An example may illustrate this: Suppose we want to assess the quality of the day-to-day performance of a practising physician and we had the choice between observing him/her in many real-life consultations or extensively reviewing charts (records and notes), ordering laboratory tests and referral data. The second option is clearly less authentic than the first one but it is fair to argue that the latter is a more valid assessment of the day-to-day practice than the former. The observer effect, for example, in the first approach may influence the behaviour of the physician and thus draw a biased picture of the actual day-to-day performance, which is clearly not the case in the charts, laboratory tests and referral data review.


세 번째로, 타당도는 검사 자체에 대한 것이 아니며, 그 검사가 의도한 특징을 검사한 정도에 대한 것이다. Box 3의 예시가 데이터 수집을 빈틈없이 했느냐에 대한 목적이 있었다면 타당한 평가였겠지만, 전문성 정도를 측정하기 위한 것이었다면 정보수집과 활용의 효율성을 구인의 중요한 요소로서  않았다. 

Third, it clearly demonstrates that validity is not an entity of the assessment per se; it is always the extent to which the test assesses the desired characteristic. If the PMPs in the example in Box 3 were aimed at measuring thoroughness of data gathering – i.e. to see whether students are able to distinguish all the relevant data from non-relevant data – they would have been valid, but if they are aimed at measuring expertise they failed to incorporate efficiency of information gathering and use as an essential element of the construct.


Current views (Kane 2001, 2006) highlight the argument-based inferences that have to be made when establishing validity of an assessment procedure.


요약하자면, 관찰에서 점수로, 관찰점수에서 완전체점수(universe score)로, 완전체점수에서 특정 영역으로, 특정 영역에서 구인으로의 순차적 추론이 이루어진다.

In short, inferences have to be made from observations to scores, from observed scores to universe scores (which is a generalisation issue), from universe scores to target domain and from target domain to construct.


혈압을 측정한다고 하면, 청진기의 소리(observation) => 혈압계의 수치(observed score) => (반복, 다른 상황에서의 측정 수치(universe score) => 환자의 심혈관계 상태(target domain) => 건강(construct)

To illustrate this, a simple medical example may be helpful: When taking a blood pressure as an assessment of someone's health, the same series of inferences must be made. When taking a blood pressure, the sounds heard through the stethoscope when deflating the cuff have to be translated into numbers by reading them from the sphygmomanometer. This is the translation from (acoustic and visual) observation to scores. Of course, one measurement is never enough (the patient may just have come running up the stairs) and it needs to be repeated, preferable under different circumstances (e.g. at home to prevent the ‘white coat’-effect). This step is equivalent to the inference from observed scores to universe scores. Then, there is the inference from the blood pressure to the cardiovascular status of the patient (often in conjunction with other signs and symptoms and patient characteristics) which is equivalent to the inference from universe score to target domain. And, finally this has to be translated into the concept ‘health’, which is analogous to the translation of target domain to construct. There are important lessons to be learnt from this.


  • 첫째, 타당도는 논증에 대한 사례적 근간을 쌓는 것이다. 이러한 논증은 타당도 연구의 결과에 기반할 수도 있고, 이치에 맞고 방어가능한 주장을 포함할 수도 있다.
    First, validation is building a case based on argumentation. The argumentation is preferably based on outcomes of validation studies but may also contain plausible and/or defeasible arguments.
  • 평가에서 측정하고자 하는 명확한 정의나 구인에 대한 이론 없이는 평가에 대한 타당도를 검증할 수 없다. 따라서 특정 검사도구는 그 자체로 타당한 것이 아니라, 특정 구인을 검사하는데 타당한 것이다.
    Second, one cannot validate an assessment procedure without a clear definition or theory about the construct the assessment is intended to capture. So, an instrument is never valid per se but always only valid for capturing a certain construct.
  • 세 번째로, 타당도 검증은 끝나지 않으며, 종종 무수한 관찰과 기대와 비판적 실험이 필요하기도 하다.
    Third, validation is never finished and often requires a plethora of observations, expectations and critical experiments.
  • 마지막으로 이러한 추론을 이끌어내기 위해서는 일반화가능성 필요하다.
    Fourth, and finally, in order to be able to make all these inferences, generalisability is a necessary step.


신뢰도 Reliability

신뢰도라는 것은 앞선 타당도 섹션에서 말한 '일반화' 단계의 하나이다. 그러나 '일반화'가 타당도 검증을 위해 필요한 단 하나의 과정이라고 해도, 이러한 일반화가 이뤄지는 방식은 이론에 따라 다르다. 다음의 세 단계의 일반화를 이해하는 것이 좋다.

Reliability of a test indicates the extent to which the scores on a test are reproducible, in other words, whether the results a candidate obtains on a given test would be the same if she/he were presented with another test or all the possible tests of the domain. As such, reliability is one of the approaches to the generalisation step described in the previous section on validity. But even if generalisation is’only’ one of the necessary steps in the validation process, the way in which this generalisation is made is subject to theories in its own. To understand them, it may be helpful to distinguish three levels of generalisation.


첫 번째로, '평행 검사'의 개념이 필요한데, '평행 검사'라는 것은 유사한 내용의, 동일한 난이도의, 유사한 블루프린트의, 이상적으로는 동일한 학생에게 원 시험 직후에, 학생이 이전 검사에 의한 피로가 없다는 가정 하에 진행되는 가상의 검사이다.

First, however, we need to introduce the concept of the ‘parallel test’ because it is necessary to understand the approaches to reproducibility described below. A parallel test is a hypothetical test aimed at a similar content, of equal difficulty and with a similar blueprint, ideally administered to the same group of students immediately after the original test, under the assumption that the students would not be tired and that their exposure to the items of the original test would not influence their performance on the second.


세 종류의 일반화가 있다.

Using this notion of the parallel test, three types of generalisations are made in reliability, namely if the same group of students were presented with the original and the parallel test:



  1. 같은 학생이 두 시험에서 합/불합 하는가.
    Whether the same students would pass and fail on both tests.
  2. 1등부터 꼴등까지의 등수가 동일한가
    Whether the rank ordering from best to most poorly performing student would be the same on both the original and the parallel tests.
  3. 모든 학생이 동일한 점수를 받는가
    Whether all students would receive the same scores on the original and the parallel tests.


Three classes of theories are in use for this: classical test theory (CTT), generalisability theory (G-theory) and item response theory (IRT).


고전검사이론 Classical test theory

CTT is the most widely used theory. It is the oldest and perhaps easiest to understand. It is based on the central assumption that the observed score is a combination of the so-called true score and an error score (O = T + e).3 The true score is the hypothetical score a student would obtain based on his/her competence only. But, as every test will induce measurement error, the observed score will not necessarily be the same as the true score.


This in itself may be logical but it does not help us to estimate the true score. How would we ever know how reliable a test is if we cannot estimate the influence of the error term and the extent it makes the observed score deviate from the true score, or the extent to which the results on the test are replicable?


The first step in this is determining the correlation between the test and a parallel test (test–retest reliability). If, for example, one wanted to establish the reliability of a haemoglobin measurement one would simply compare the results of multiple measurements from the same homogenised blood sample, but in assessment this is not this easy. Even the ‘parallel test’ does not help here, because this is, in most cases, hypothetical as well.


The next step, as a proxy for the parallel test, is to randomly divide the test in two halves and treat them as two parallel tests. The correlation between those two halves (corrected for test length) is then a good estimate of the ‘true’ test–retest correlation. This approach, however, is also fallible, because it is not certain whether this specific correlation is a good exemplar; perhaps another subdivision in two halves would have yielded a completely different correlation (and thus a different estimate of the test–retest correlation). One approach is to repeat the subdivision as often as possible until all possibilities are exhausted and use the mean correlation as a measure of reliability. That is quite some work, so it is simpler and more effective to subdivide the test in as many subdivisions as there are possible (the items) and calculate the correlations between them. This approach is a measure of internal consistency and the basis for the famous Cronbach's alpha. It can be taken as the mean of all possible split half reliability estimates (cf. e.g. Crocker & Algina 1986).


Cronbach's alpha가 널리 사용되고 있기는 하지만, 이는 norm-referenced 관점에서만 사용가능하다(상대평가적 관점), Criterion-referenced 관점에서 Cronbach's alpha를 사용하면 신뢰도가 과대추정된다. Box 4에 설명되어있다.

Although Cronbach's alpha is widely used, it should be noted that it remains an estimate of the test–retest correlation, so it can only be used correctly if conclusions are drawn at the level of the whether the rank orderings between the original and the parallel tests are the same, i.e. a norm-referenced perspective. It does not take into account the difficulty of the items on the test, and because the difficulty of the items of a test influences the exact height of the score, using Cronbach's alpha in a criterion-referenced perspective overestimates the reliability of the test. This is explained in Box 4.



Although the notion of Cronbach's alpha is based on correlations, reliability estimates can range from 0 to 1. In rare cases, calculations could result in a value lower than zero, but this is then to be interpreted as being zero.


신뢰도에 대해서는 실제 점수와 연결해서 평가해야 한다. 신뢰도 0.9가 0.75보다 항상 좋은 것일까?

Although it is often helpful to have a measure of reliability that is normalised, in that for all data, it is always a number between 0 and 1, in some cases, it is also important to evaluate what the reliability means for the actual data. Is a test with a reliability of 0.90 always better than a test with a reliability of 0.75? Suppose we had the results of two tests and that both tests had the same cut-off score, for example 65%. The score distributions of both tests have a standard deviation (SD) of 5%, but the mean, minimum and maximum scores differ, as shown in Table 1.



Based on these data, we can calculate a 95% confidence interval (95%-CI) around each score or the cut-off score. For this, we need the standard error of measurement (SEM). In the beginning of this section, we showed the basic formula in CTT (observed score = true score + error). In CTT, the SEM is the SD of the error term or, more precisely put, the square root of the error variance. It is calculated as follows:






커트라인은 65점으로 같은데 Test 1은 평균은 높지만 신뢰도가 낮고, Test 2는 평균은 낮지만 신뢰도가 높다. Test 1은 신뢰도는 낮지만 평균이 높아서 95% CI안에 매우 일부 학생만 들어가는 반면, Test 2는 신뢰도가 높지만 평균이 낮아서 95%CI안에 다수의 학생이 포함된다. 즉 낮은 신뢰도에도 불구하고 부정확한 pass-fail decision의 가능성이 낮아지는 것이다.

If we use this formula, we find that in test 1, the SEM is 2.5% and in test 2, it is 1.58%. The 95% CIs are calculated by multiplying the SEM by 1.96. So, in test 1 the 95% CI is ±4.9% and in test 2 it is ±3.09%. 

    • In test 1 the 95% CI around the cut-off score ranges from 60.1% to 69.9% but only a small proportion of the score of students falls into this 95% CI.4 This means that for those students we are not able to conclude, with a p ≤ 0.05, whether these students have passed or failed the test. 
    • In test 2, the 95% CI ranges from 61.9% to 68.1% but now many students fall into the 95% CI interval. We use this hypothetical – though not unrealistic – example to illustrate that a higher reliability is not automatically better. To illustrate this further, Figure 1 presents a graphical representation of both tests.





일반화가능도 이론 Generalisability theory

G-theory is not per se an extension to CTT but a theory on its own. It has different assumptions than CTT, some more nuanced, some more obvious. These are best explained using a concrete example. We will discuss G-theory here, using such an example.


When a group of 500 students sit a test, say a 200-item knowledge-based multiple-choice test, their total scores will differ. In other words, there will be variance between the scores. From a reliability perspective, the goal is to establish the extent to which these score differences are based on differences in ability of the students in comparison to other – unwanted – sources of variance. In this example, the variance that is due to differences in ability (in our example ‘knowledge’) can be seen as wanted or true score variance. Level of knowledge of students is what we want our test to pick up, the rest is noise – error – in the measurement. G-theory provides the tools to distinguish true or universe score variance from error variance, and to identify and estimate different sources of error variance. The mathematical approach to this is based on analysis of variance, which we will not discuss here. Rather, we want to provide a more intuitive insight into the approach and we will do this stepwise with some score matrices.


In Table 2, all students have obtained the same score (for reasons of simplicity, we have drawn a table of five test items and five candidates). From the total scores and the p-values, it becomes clear that all the variance in this matrix is due to systematic differences in items. Students collectively ‘indicate’ that item 1 is easier than item 2, and item 2 is easier than item 3, etc. There is no variance associated with students. All students have the same total score and they have collected their points on the same items. In other words, all variance here is item variance (I-variance).



Table 3 draws exactly the opposite picture. Here, all variance stems from differences between students. Items agree maximally as to the ability of the students. All items give each student the same marks, but their marks differ for all students, so the items make a consistent, systematic distinction between students. In the score matrix, all items agree that student A is better than student B, who in turn is better than student C, etc. So, here, all variance is student-related variance (person variance or P-variance).




Table 4 draws a more dispersed picture. For students A, B and C, items 1 and 2 are easy and items 3–5 difficult, and the reverse is true for students D and E. There seems to be a clearly discernable interaction effect between items and students. Such a situation could occurs if, for example, items 1 and 2 are on cardiology and 3–5 on the locomotor system, and students A, B and C have just finished their clerkship in cardiology and the other students just finished their orthopaedic surgery placements.




Of course, real life is never this simple, so matrix 5 (Table 5) presents a more realistic scenario, some variance can be attributed to systematic differences in item difficulty (I-variance), some to differences in student ability (P-variance), some to the interaction effects (P × I-variance), which in this situation cannot be disentangled from general error (e.g. perhaps student D knew the answer to item 4 but was distracted or he/she misread the item).





Generalisability is then determined by the portion of the total variance that is explained by the wanted variance (in our example, the P-variance). In a generic formula:






Or in the case of our 200 multiple choice test example:5



The example of the 200-item multiple-choice test is called a one-facet design. There is only one facet on which we wish to generalise, namely would the same students perform similarly if another set of items (another ‘parallel’ test) were administered. The researcher does not want to draw conclusions as to the extent to which another group of students would perform similarly on the same set of items. If the latter were the purpose, she/he would have to redefine what is wanted and what is error variance. In the remainder of this paragraph we will also use the term ‘factor’ to denote all the components of which the variance components are estimates (so, P is a factor but not a facet).


위의 식에서 어떤 것이 error term에 들어가는지가 어떤 종류의 일반화를 할 것인가를 결정한다.

If we are being somewhat more precise, the second formula is not always a correct translation of the first. The first deliberately does not call the denominator ‘total variance’, but ‘wanted’ and ‘error variance’. Apparently, the researcher has some freedom in deciding what to include in the error term and what not. This of course, is not a capricious choice; what is included in the error term defines what type of generalisations can be made.



If, for example, the researcher wants to generalise as to whether the rank ordering from best to most poorly performing student would be the same on another test, the I-variance does not need to be included in the error term (for a test–retest correlation, the systematic difficulty of the items or the test is irrelevant). For the example given here (which is a so-called P × I design), the generalisability coefficient without the I/ni term is equivalent to Cronbach's alpha.


The situation is different if the reliability of an exact score is to be determined. In that case, the systematic item difficulty is relevant and should be incorporated in the error term. This is the case in the second formula.


To distinguish between both approaches, the former (without the I-variance) is called ‘generalisability coefficient’ and the latter ‘dependability coefficient’. This distinction further illustrates the versatility of G-theory, when the researcher has a good overview on the sources of variance that contribute to the total variance she/he can clearly distinguish and compare the wanted from the unwanted sources of variance.


The same versatility holds for the calculation of the SEM. As discussed in the section on CTT the SEM is the SD of the error term, so in a generalisability analysis it can be calculated as the square root of the error variance components, so either


In this example the sources of variance are easy to understand, because there is in fact one facet, but more complicated situations can occur. In an OSCE with two examiners per station, things already become more complicated. 

    • First, there is a second facet (the universe of possible examiners) on top of the first (the universe of possible stations). 
    • Second, there is crossing and nesting. 


A crossed design is most intuitive to understand. The multiple-choice example is a completely crossed design (P × I, the ‘×’ indicating the crossing), all items are seen by all students. Nesting occurs when certain ‘items’ of a factor are only seen by some ‘items’ of another factor. This is a cryptic description, but the illustration of the OSCE may help. The pairs of examiners are nested within each station. It is not the same two examiners who judge all stations for all students, but examiners A and B are in station 1, C and D in station 2, etc. The examiners are crossed with students (assuming that they remain the same pairs throughout the whole OSCE), because they have judged all students, but they are not crossed with all stations as A and B have only examined in station 1, etc. In this case examiner pairs are nested within stations.


There is a second part to the analyses in a generalisability analysis, namely the decision study or D-study. You may have noticed in the second formula that both the I-variance and the interaction terms have a subscript/ni. This indicates that the variance component is divided by the number of elements in the factor (in our example the number of items in the I-variance) and that the terms in the formula are the mean variances per element in the factor (the mean item variance). From this, it is relatively straightforward to extrapolate what the generalisability or dependability would have been if the numbers would change (e.g. what is the dependability if the number of items on the test would be twice as high, or which is more efficient, using two examiners per OSCE station or having more station with only one examiner?), just by inserting another value in the subscript(s). Although it may seem very simple, one word of caution is needed: such extrapolations are only as good as the original variance component estimates. The higher the number of original observations, the better the extrapolation. In our example, we had 200 items on the test and 500 students taking it, but it is obvious that this leads to better estimates and thus better extrapolations than 50 students sitting a 20 item test.



문항반응이론 Item response theory

CTT와 G-theory가 공통적으로 가지고 있는 단점은 응시자 그룹으로부터 난이도가 응시자에 미치는 영향을 떼어낼 수가 없다는 것이다. 특정 검사나 시험에 대한 점수가 낮은 것은 특정 문항이 매우 어렵거나, 특정 응시자집단이 능력이 떨어지기 때문일 수 있다. IRT는 이러한 문제를 학생의 능력과 독립적으로 문항의 난이도를 측정하여 해결하고자 했으며, 문항 난이도와 독립적으로 학생의 능력을 측정하고자 했다.

Both CTT and G-theory have a common disadvantage. Both theories do not have methods to disentangle test difficulty effects from candidate group effects. If a score on a set of items is low, this can be the result of a particularly difficult set of items or of a group of candidates who are of particularly low ability level. Item response theories try to overcome this problem by estimating item difficulty independent of student ability, and student ability independent of item difficulty.


CTT에서 난이도는 p-value, 즉 해당 문항을 맞춘 학생의 비율로 나타난다. Rit와 Rir같은 수치는 전체 혹은 나머지 문항에서의 수행능력과 특정 문항에 대한 수행능력과의 상관관계를 보여준다. 다른 그룹이 동일한 검사를 했다거나, 다른 검사에 문항이 재사용되어도 p-value는 다를 것이다. IRT에서는 응시자의 응답을 모델화하여 개별 문항에 대한 능력을 보여준다.

Before we can explain this, we have to go back to CTT again. In CTT, item difficulty is indicated by the so-called p-value, the proportion of candidates who answered the item correctly, and discrimination indices such as point biserials, Rit (item-total correlation) or Rir (item-rest correlation), all of which are measures to correlate the performance on an item to the performance on the total test or the rest of the items. If in these cases a different group of candidates (of different mean ability) would take the test, the p-values would be different, and if an item were re-used in a different test, all discrimination indices would be different. With IRT the response of the candidates are modelled, given their ability to each individual item on the test.


이러한 모델링에는 몇 가지 가정이 필요하다.

Such modelling cannot be done without making certain assumptions. 

    • The first assumption is that the ability of the candidates is uni-dimensional and 
    • the second is that all items on a test are locally independent except for the fact that they measure the same (uni-dimensional) ability. If, for example, a test would contain an item asking for the most probable diagnosis in a case and a second for the most appropriate therapy, these two items are not locally independent; if a candidate answers the first items incorrectly, she/he will most probably answer the second one incorrectly as well.
    • The third assumption is that modelling can be done through an item response function (IRF) indicating that for every position on the curve, the probability of a correct answer increases with a higher level of ability. The biggest advantage of IRT is that difficulty and ability are modelled on the same scale. IRFs are typically graphically represented as an ogive, as shown in Figure 2.






모델링에는 데이터가 필요하다. 따라서 모델링을 하기 전에 사전 검사가 필요하다.

Modelling cannot be performed without data. Therefore pre-testing is necessary before modelling can be performed. The results on the pre-test are then used to estimate the IRF. For the purpose of this AMEE Guide, we will not go deeper into the underlying statistics but for the interested reader some references for further reading are included at the end.


세 수준의 모델링이 가능하다.

Three levels of modelling can be applied, conveniently called one-, two- and three-parameter models. 

    • A one-parameter model distinguishes items only on the basis of their difficulty, or the horizontal position of the ogive. Figure 3 shows three items with three different positions of the ogive. The curve on the left depicts the easiest item of the three in this example; it has a higher probability of a correct answer with lower abilities of the candidate. The most right curve indicates the most difficult item. In this one-parameter modeling, the forms of all curves are the same, so their power to discriminate (equivalent to the discrimination indices of CTT) between students of high and low abilities are the same.
    • A two-parameter model includes this discriminatory power (on top of the difficulty). The curves for different items not only differ in their horizontal position but also in their steepness. Figure 4 shows three items with different discrimination (different steepness of the slopes). It should be noted that the curves do not only differ in their slopes but also in their positions, as they differ both in difficulty and in discrimination (if they would only differ in slopes, it would be a sort of one-parameter model again).
    • A three-parameter model includes the possibility that a candidate with extremely low ability (near-to-zero ability) still produces the correct answer, for example through random guessing. The third parameter determines the offset of the curve or more or less its vertical position. Figure 5 shows three items differing on all three parameters.





대략 one-parameter modelling에는 200~300개의 응답이, three-parameter model에는 1000개의 응답이 필요하다.

As said before, pre-testing is needed for parameter estimation and logically there is a relationship between the number of candidate responses needed for good estimates; the more parameters have to be estimated, the higher the number of responses needed. As a rule of thumb, 200–300 responses would be sufficient for one-parameter modelling, whereas a three-parameter model would require roughly 1000 responses. Typically, large testing bodies employ IRT mix items to be pre-tested with regular items, without the candidates knowing which item is which. But it is obvious that such requirements in combination with the complicated underlying statistics and strong assumptions limit the applicability of IRT in various situations. It will be difficult for a small-to-medium-sized faculty to produce enough pre-test data to yield acceptable estimates, and, in such cases, CTT and G-theory will have to do.


IRT는 강력한 신뢰도 이론이다.

On the other hand, IRT must be seen as the strongest theory in reliability of testing, enabling possibilities that are impossible with CTT or G-theory. One of the ‘eye-catchers’ in this field is computer-adaptive testing (CAT). In this approach, each candidate is presented with an initial small set of items. Depending on the responses, his/her level of ability is estimated, and the next item is selected to provide the best additional information as to the candidate's ability and so on. In theory – and in practice – such an approach reduces the SEM for most if not all students. Several methods can be used to determine when to stop and end the test session for a candidate. One would be to administer a fixed number of items to all candidates. In this case, the SEM will vary between candidates but most probably be lower for most of the candidates then with an equal number of items with traditional approaches (CTT and G-theory). Another solution is to stop when a certain level of certainty (a certain SEM) is reached. In this case, the number of items will vary per candidate. But apart from CAT, IRT will mostly be used for test equating, in such situations where different groups of candidates have to be presented with equivalent tests.


권고 Recommendations

The three theories – CTT, G-theory and IRT seem to co-exist. This is an indication that there is good use for each of them depending on the specific test, the purpose of the assessment and the context in which the assessment takes place. Some rules of thumb may be useful.


    • CTT is helpful in straightforward assessment situations such as the standard open-ended or multiple choice test. In CTT, item parameters such as p-values and discrimination indices can be calculated quite simply with most standard statistical software packages. The interpretation of these item parameters is not difficult and can be taught easily. Reliability estimates, such as Cronbach's alpha, however, are based on the notion of test–ret7est correlation. Therefore, they are most suitable for reliability estimates from a norm-orientated perspective and not from a domain-orientated perspective. If they are used in the latter case, they will be an overestimation of the actual reproducibility.

    • G-theory is more flexible in that it enables the researcher to include or exclude source of variance in the calculations. This presupposes that the researcher has a good understanding of the meaning of the various sources of variance and the way they interact with each other (nested versus crossed), but also how they represent the domain. The original software for these analyses is quite user unfriendly and requires at least some knowledge of older programming languages such as Fortran (e.g. UrGENOVA; http://www.education.uiowa.edu/casma/GenovaPrograms.htm, last access 17 December 2010). Variance component estimates can be done with SPSS, but the actual g-analysis would still have to be done by hand. Some years ago, two researchers at McMaster wrote a graphical shell around UrGenova to make it more user friendly (http://fhsperd.mcmaster.ca/g_string/download.html, accessed 17 December 2010). Using this shell prevents the user from knowing and employing a difficult syntax. Nevertheless, it still requires a good understanding of the concept of G-theory. In all cases where there is more than one facet of generalisation (as in the example with the two examiners per station in an OSCE), G-theory has a clear advantage over CTT. In CTT multiple parameters should be used and somehow combined (in this OSCE Cronbach's alpha and Cohen's Kappa or an ICC for inter-observer agreement), in the generalisability analysis both facets are incorporated. If a one-facet situation exists (like the multiple choice examination) from a domain-orientated perspective (e.g. with an absolute pass–fail core), a dependability coefficient is a better estimate than CTT.

    • IRT should only be used if people with sufficient understanding of the statistics and the underlying concepts are part of the team. Furthermore, considerably large item banks are needed and pre-testing on a sufficient number of candidates must be possible. This limits the routine applicability of IRT in all situations other than large testing bodies, large schools or collaboratives.



새롭게 떠오르는 이론들 Emerging theories

새롭게 떠오르는 이론의 대부분은 '학습의 평가'에서 '학습을 위한 평가'로의 관점 전환과 관련되어있다. 비록 이 자체는 이론의 변화는 아니지만, 관점의 변화가 새로운 이론을 가져오거나 기존 이론의 확장을 가져왔다.

Although we by no means possess a crystal ball, we see some new theories or extension to existing theories emerging. Most of these are related to the changing views from (exclusively) assessment of learning to more assessment for learning. Although this in itself is not a theory change but more a change of views on assessment, it does lead to the incorporation of new theories or extensions to existing ones.


첫째로, '학습을 위한 평가'라는 것이 무언가를 알 필요가 있다. 기존의 관점이 상징적으로 보여주는 것이 바로 교과목 종료 후에 보는 총괄평가이다. 이러한 방법은 전 세계적으로 흔하게 사용되는 것이지만, 교육적 맥락에서는 이러한 방법에 대한 불만이 커지고 있다. 이러한 평가는 학습환경의 변화를 잘 따라가지 못하고 있으며, 이러한 'purely selective test'는 의료의 'screening procedure'에 비견될 수 있다. 필수 역량에 미달한 학생에 대하여 졸업 여부를 판별하는데는 좋을 수 있으나, 아직 역량이 부족한 학생에게 어떻게 충분한 역량을 키울 수 있게 해줄 것인가에 대한 정보는 주지 못한다. 또한 각 학생을 어떻게 가장 좋은 의사로 키울 수 있을 것인가에 대한 정보도 주지 못한다. 환자를 더 낫게 만드는 것은 screening 그 자체가 아니라 잘 맞춰진 진단과 치료인 것처럼, 학습자에 대한 진단 그 자체로는 학습을 향상시키지 못하며, 학습을 위한 평가만이 이를 가능하게 한다.

First, however, it might be helpful to explain what assessment for learning entails. For decades, our thinking about assessment has been dominated by the view that assessment's main purpose is to determine whether a student has successfully completed a course or a study. This is epitomised in the summative end-of course examination. The consequences of such examinations were clear; if she/he passes, the student goes on and does not have to look back; if she/he fails, on the other hand, the test has to be repeated or (parts of) the course has to be repeated. Successful completion of a study was basically a string of passing individual tests. We draw – deliberately – somewhat of a caricature, but in many cases, this is the back bone of an assessment programme. Such an approach is not uncommon and is used at many educational institutes in the world, yet there is a growing dissatisfaction in the educational context. Some discrepancies and inconsistencies are felt to be increasingly incompatible with learning environments. These are probably best illustrated with an analogy. Purely selective tests are comparable in medicine to screening procedures (e.g. for breast cancer or cervical cancer). They are highly valuable in ensuring that candidates lacking the necessary competence do not graduate (yet), but they do not provide information as to how an incompetent candidate can become a competent one, or how each student can achieve to become the best possible doctor she/he could be. Just as screening does not make the patients better, but tailored diagnostic and therapeutic intervention do, assessment of learning does not help much in improving the learning but assessment for learning can.


We will mention the most striking discrepancies between assessment of and assessment for learning.


  • 교육과정의 가장 중심이 되는 목표는 학생이 공부를 열심히 해서 가능한 많이 배울 수 있게 하는 것이다. 따라서 평가도 이러한 목적에 맞게 이뤄져야 한다. 충분한 역량을 갖춘 학생을 골라내는데만 집중된 평가는 이러한 목표에 도달할 수 없다.
    A central purpose of the educational curriculum is to ensure that students study well and learn as much as they can; so, assessment should be better aligned with this purpose. Assessment programmes that focus almost exclusively on the selection between the sufficiently and insufficiently competent students do not reach their full potential in steering student learning behaviour.
  • '학습의 평가'에서 하는 질문은 'A가 B보다 낫나?'이다. CTT나 G-theory에서는 학생간 차이가 없을 경우 신뢰도를 계산해낼 수 없다. '학습을 위한 평가'에서 질문은 '오늘의 A가 어제의 A보다 낫나?'이다. 수월성을 위하여 끊임없이 나아간다는 의미를 가지고 있는데, 모든 학생이 '우수'에 도달하면, 그 '우수'는 다시 '평범함'이 되기 때문이다. '학습을 위한 평가'에서 질문은 A와 B의 향상이 충분한가에 대해서도 당연히 생각해보게 된다.
    If the principle of assessment of learning is exclusively used, the question all test results need to answer is: is John better than Jill?, where the pass–fail score is more or less one of the possible ‘Jills’. Typically CTT and G-theory cannot calculate test reliability if there are no differences between students. A test–retest correlation does not exist if there is no variance in scores, generalisability cannot be calculated if there is no person variance. The central question in the views of assessment for learning is therefore: Is John today optimally better than he was yesterday, and is Jill today optimally better than she was yesterday. This gives also more meaning to the desire to strive for excellence, because now excellence is defined individually rather than on the group level (if everybody in the group is excellent, ‘excellent’ becomes mediocre again). It goes without saying that in assessment for learning, the question whether John's and Jill's progress is good enough needs to be addressed as well.
  • 보다 어려운 개념은 '학습의 평가'에서 '일반화' 혹은 '예측'이란 '동질성(uniformity)'에 기반하고 있다는 점이다. 즉 학생이 동일한 상황에서 동일한 시험을 잘 볼 것인가에 대한 예측과 일반화를 한다는 것이다. 그러나 '평가를 위한 학습'에서 '예측'이란 여전히 중요하긴 하지만, 평가법의 선택은 진단적 목적이 더 크고, 평가법을 학생의 구체적 특징에 따라서 선택할 수 있는 유연성이 있다. CAT나 임상의사의 진단적 사고 - 구체적 추가적 진단기술을 환자에 맞추어 사용하는 것 - 이 이와 유사하다 할 수 있다.
    A difficult and more philosophical result of the previous point is that the idea of generalisation or prediction (how well will John perform in the future based on the test results of today) in an assessment of learning is mainly based on uniformity. It states that we can generalise and predict well enough if all students sit the same examinations under the same circumstances. In the assessment for learning, view prediction is still important but the choice of assessment is more diagnostic in that there should be room for sufficient flexibility to choose the assessment according to the specific characteristics of the student. This is analogous to the idea of (computer) adaptive testing or the diagnostic thinking of the clinician, tailoring the specific additional diagnostics to the specific patient.
  • '학습의 평가'에서는 최선의 평가법이 개발해내는 것이 중요하다. 이러한 관점에서 가장 이상적인 평가 프로그램은 각 의학적 역량을 평가하기에 '가장 좋은' 평가도구만을 사용하게 된다. 예컨데 지식의 평가를 위한 객관식 문항, 술기 평가를 위한 OSCE, 문제해결능력을 위한 long simulation 등이다. 그러나 '학습을 위한 평가'에서는 다양한 정보를 얻기 위해서 다양한 도구를 사용하며, 다음의 세 가지 질문에 답하는 것이 중요하다.
    In the assessment of learning view, developments are focussed more on the development (or discovery) of the optimal instrument for each aspect of medical competence. The typical example of this is the OSCE for skills. In this view, an optimal assessment programme would incorporate only the best instrument for each aspect of medical competence. Typically, such a programme would look like this: multiple-choice tests for knowledge, OSCEs for skills, long simulations for problem-solving ability, etc. From an assessment for learning, view information needs to be extracted from various instruments and assessment moments to optimally answer the following three questions:


    1. 진단적 질문: 이 학생에 대한 완전한 그림을 그리기에 충분한 정보를 가지고 있는가?
    2. Do I have enough information to draw the complete picture of this particular student or do I need specific additional information? (the ‘diagnostic’ question)
    3. 치료적 질문: 이 시점에서 가장 필요한 교육적 개입은 무엇인가?
      Which educational intervention is most indicated for this student at this moment? (the ‘therapeutic’ question)
    4. 예후적 질문: 이 학생이 옳은 길을 가고 있으며 유능한 전문직으로 성장할 것인가?
      Is this student on the right track to become a competent professional on time? (the ‘prognostic’ question).


  • 단일한 혹은 소수의 평가로만 위의 질문에 답을 할 수는 없을 것이다. 평가프로그램이 필요하며, 각각의 장점과 단점이 있는 다양한 평가법이 필요하며 이는 의사가 다양한 진단적 도구를 활용할 수 있는 것과 마찬가지다. 이 도구들은 양적일 수도 있고 질적일 수도 있으며, 더 객관적일수도, 주관적일수도 있다. 비유를 좀 더 해보자면 만약에 의사가 환자의 Hb 수치 오더를 내리면 단순히 객관적인 수치를 알고 싶은 것일 수 있다. 그러나 한편으로 의사는 병리학자에게 특정 숫자가 아니라 서술적 판단을 요청할 수도 있다. 유사하게 평가 프로그램도 양적, 질적 요소를 다 갖출 수 있다.
    It follows logically from the previous point that this cannot be accomplished with one single assessment method or even with only a few. A programme of assessment is needed instead, incorporating a plethora of methods, each with its own strengths and weaknesses, much like the diagnostic armamentarium of a clinician. These can be qualitative or quantitative, more ‘objective’ or more ‘subjective’. To draw the clinical analogy further: if a clinician orders an haemoglobin level of a patient she/he does not want the laboratory analyst's opinion but the mere ‘objective’ numerical value. If, on the other hand, she/he asks a pathologist, s/he does not expect a number but a narrative (‘subjective’) judgement. Similarly, such a programme of assessment will consist of both qualitative and quantitative elements.


이 이론들 중 많은 부분은 여전히 더 개발이 필요하나 일부는 다른 분야의 이론에서 가져올 수도 있다.
Much of the theory to support the approach of assessment for learning still needs to be developed. Parts can be adapted from theories in other fields; parts need to be developed within the field of health professions assessment research. We will briefly touch on some of these.


  • 평가 프로그램의 질을 결정하는 것은 무엇인가? 한 가지 중요한 것은 좋은 평가프로그램은 개별 구성요소의 합보다 전체가 더 커야 한다는 점이다. 그러나 이런 목표를 달성하기 위해서 각 요소를 어떻게 결합할 것인가는 또 다른 문제이다. 
    What determines the quality of assessment programmes? It is one thing to state that in a good assessment programme the total is more than the sum of its constituent parts, but it is another to define how these parts have to be combined in order to achieve this. Emerging theories describe a basis for the definition of quality. Some adopt a more ideological approach (Baartman 2008) and some a more utilistic ‘fitness-for-purpose’ view (Dijkstra et al. 2009). 
    • 평가의 질이란 평가프로그램이 얼마나 '이상적인 모습'에 가까운가에 따른 것이다.
      In the former,
      quality is defined as the extent to which the programme is in line with an ideal (much like formerly quality of an educational programme was defined in terms of whether it was PBL or not); 
    • 평가의 질이란 프로그램에서 명확하게 정의한 목표에 의해서 정의되는 것이며, 각 부분이 이 목표를 달성하기 위해서 최적화되어야 한다. 
      in the latter
      the quality is defined in terms of a clear definition of the goals of the programme and whether all parts of the programmes optimally contribute to the achievement of this goal. This approach is more flexible in that it would allow for an evaluation of the quality of assessment of learning programmes as well. 
  • At this moment, theories about the quality of assessment programmes are being developed and researched (Dijkstra et al. 2009, submitted 2011).

  • 평가가 어떻게 학습에 영향을 미치는가? 상당한 합의가 있어 보인다. 그러나 연구가 그리 많이 되어있지는 않다.
    How does assessment influence learning? Although there seems to be complete consensus about this – a complete shared opinion, much empirical research has not been performed in this area. For example, much of the intuitive ideas and uses of this notion are strongly behaviouristic in nature and do not incorporate motivational theories very well. The research, especially in the health professions education, is either focussed on the test format (Hakstian 1971; Newble et al. 1982; Frederiksen 1984) or on the opinions of students (Stalenhoef-Halling et al. 1990; Scouller 1998). Currently, new theories are emerging incorporating motivational theories and describing better which factors of an assessment programme influence learning behaviour, how they do that and what the possible consequences of these influences are (Cilliers et al. submitted 2010, 2010).

  • Test-enhanced learning이 최근 논의되고 있다. 전문가 이론에 따르면 시험을 보는 것 자체가 다양한 측면에서 지식의 저장, 유지, 인출에 도움이 된다고 보는 것은 합당하다. 그러나 평가프로그램에 있어서, 특히 '학습을 위한 평가' 차원에서 어떻게 해야하는가는 별로 아는 바가 많지 않다.
    The phenomenon of test-enhanced learning has been discussed recently (Larsen et al. 2008). From expertise theories it is logical to assume that from sitting a test, as a strong motivator to remember what was learned, the existing knowledge is not only more firmly stored in memory, but also reorganised from having to produce and apply it in a different context. This would logically lead to better storage, retention and more flexible retrieval. Yet we know little about how to use this effect in a programme of assessment especially with the goal of assessment for learning.

  • 피드백이 효과를 나타내게 해주는 것은 무엇인가? 피드백을 총괄평가와 함께 주는 것은 그 가치를 떨어뜨리는 것이다라는 지적이 있지만, 어떤 요인이 여기에 영향을 주는가에 대해서는 알려져 있는 바가 적다. 
    What makes feedback work? There are indications that the provision of feedback in conjunction with a summative decision limits its value, but there is little known about which factors contribute to this. Currently, research not only focusses on the written combination of summative decisions and formative feedback, but also on the combination of a summative and formative role within one person. This research is greatly needed as in many assessment programmes it is neither always possible nor desirable to separate teacher and assessor role.

  • 평가프로그램의 차원에서 인간의 판단은 포함될 수 밖에 없다. 심리학에서 인간의 판단(human judgement)는 actuarial 한 방법에 비해서 오류의 가능성이 더 높다고 본다. 그 이유에는 여러가지가 있다.
    In a programme of assessment the use of human judgement is indispensible. Not only in the judgement of more elusive aspects of medical competence, such as professionalism, reflection, etc., but also because there are many situations in which a prolonged one-on-one teacher-student relationship exists, as is for example the case in long integrated placements or clerkships. From psychology it is long known that human judgement is fallible if it is compared to actuarial methods (Dawes et al. 1989). There are many biases that influence the accuracy of the judgement. 
    • The most well-known are primacy, recency and halo effects (for a more complete overview, cf. Plous 1993). 
    • A primacy effect indicates that the first impression (e.g. in an oral examination) often dominates the final judgement unduly; 
    • a recency effect indicates the opposite, namely that the last impressions determine largely the judgement. There is good indication that the length of the period between the observation and the making of judgement determines whether the primacy or the recency effect is most prominent effect. 
    • The halo effect pertains to the inability of people to judge different aspects of someone's performance and demeanour fully independently during one observation, so they all influence each other. 
    • Other important sources of bias are cognitive dissonance, fundamental attribution error, ignoring base rates, confirmation bias. All have their specific influences on the quality of the judgement. As such, these theories shed a depressing light on the use of human judgement in (high-stakes) assessment. 
  • 그러나, 이러한 이론들에고 불구하고 이러한 human judgement에서 오는 편향을 줄일 수 있는 방법이 있다. 자연주의 의사결정에 대한 이론에서는 왜 딱 잘라지는, 숫자를 기반으로 한 결정보다 사람의 의사결정이 더 부정확한가에 초점을 두는 것이 아니라, 왜 사람들이 '정보가 불충분하거나', '이상적이지 못한 상황'에서의 '명확히 정의되지 않는 문제'를 훌륭히 수행하는가에 대해서 연구한다. 정보의 저장, 경험으로부터의 학습, 상황-특이적 스트립트의 보유 등이 중요한 역할을 하는 것으로 보인다. 그리고 많은 부분이 빠른 패턴 인식과 매칭에 기반하고 있다. 
    Yet, from these theories and the studies in this field, there are also good strategies to mitigate such biases. Another theoretical pathway which is useful is the one on naturalistic decision making (Klein 2008; Marewski et al. 2009). This line of research does not focus on why people are so poor judges when compared to clear-cut and number-based decisions, but why people still do such a good job when faced with ill-defined problems with insufficient information and often under less than ideal situations. Storage of experiences, learning form experiences and the possession of situation-specific scripts seem to play a pivotal role here, enabling the human to employ a sort of expertise-type problem solving. Much is based on quick pattern recognition and matching. 
  • 두 가지 theoretical pathway 모두 관찰에서 얻은 제한된 정보만을 가지고 접근하는 인간의 접근법에 대해서 다루고 있다. 의료전문가가 임상추론을 하고 진단활동을 하는 것과 평가를 위해서 학생의 수행능력을 판단하는 것에는 유사성이 있음이 많은 연구에서 보고되고 있다.  
    Both theoretical pathways have commonality in that they both describe human approaches that are based on a limited representation of the actual observation. When, as an example, a primacy effect occurs, the judge is in fact reducing information to be able to handle it better, but when the judge uses a script, she/he is also reducing the cognitive load by a simplified model of the observation. Current research increasingly shows parallels between what is known about medical expertise, clinical reasoning and diagnostic performance and the act of judging a student's performance in an assessment setting. The parallels are such that they most probably have important consequences for our practices of teacher training.

  • 위에서 다룬 것을 설명하는데 필요한 이론이 CLT이다. CLT는 인간의 작업기억이 제한적이어서 제한된 수의 정보를 짧은 시간만큼만 기억할 수 있다는 것으로부터 시작한다. CLT에서 인지부하는 세 가지 종류가 있다. 내재적, 외재적, 본유적이다. 내재적 부하는 과제에 내재되어있는 복잡성에 의해서 생기는 부하이다. 외재적 부하는 그 과제와 직접적으로 관련되어있지는 않지만, 그 과제를 처리하기 위해서 필요한 모든 정보들과 관련되어있다. CLT에 근거하자면 Authentic setting에서 의과대학 교육과정을 바로 시작하는 것은 바람직하지 않은데, authenticity는 도움이 될지 모르겠지만, 과도한 외재적 부하가 과도하게 걸려서 학습을 위해 필요한 자원(본유적 부하)까지를 다 잡아먹기 때문이다.
    An important underlying theory to explain the previous point is cognitive load theory (CLT) (Van Merrienboer & Sweller 2005, 2010). CLT starts from the notion that the human working memory is limited in that it can only hold a low number of elements (typically 7 ± 2) for a short-period of time. Much of this we already discussed in the paragraphs on expertise. CLT builds on this as it postulates that cognitive load consists of three parts: intrinsic, extraneous and germane load. 
    • Intrinsic load is generated by the innate complexity of the task. This has to do with the number of elements that need to be manipulated and the possible combinations (element interactivity). 
    • Extraneous load relates to all information that needs to be processed yet is not directly relevant for the task. If, for example, we would start the medical curriculum by placing the learners in an authentic health care setting and require them to learn from solving real patient problems, CLT states that this is not a good idea. The authenticity may seem helpful, but it distracts, the cognitive resources needed to deal with all the practical aspects would constitute a high extraneous load even to such an extent that it would minimise the resources left for learning (the germane load).

내재적 인지부하[편집]

내재적 인지부하(intrinsic cognitive load)란 학습자료나 과제 자체가 가지고 있는 난이도와 복잡성이라 할 수 있다. 상호 작용성이 높은 학습 자료를 해결하기 위해서는 개념을 획득하고 개념들 간의 관련성을 이해하는 것이 작동기억의 부하를 감소시킬 수 있다[2]. 내재적 인지부하는 학습의 난이도에 따라 상대적일 수 있으며 이는 사전지식의 보유와 관련이 있다고도 할 수 있다[3].

외재적 인지부하[편집]

외재적 인지부하(extraneous cognitive load)는 학습 과제 자체의 난이도가 아닌 학습방법, 자료제시방법 등 교수전략에 의해 개선될 수 있는 인지부하이다. 그러나 외재적 인지부하는 내재적 인지부하에 영향을 받는다(김 경, 김동식, 2004). 즉, 학습 과제 자체가 내재적 인지부하가 낮다면 교수 설계가 부적절하여 외재적 인지부하가 발생하더라도 이것이 작동기억의 범위 내에 있기 때문에 문제를 해결하는데 어려움이 없게 된다.[1]


본유적 인지부하[편집]

본유적 인지부하(germane cognitive load)란 작동기억의 범위안에서 학습과 직접 관련이 있는 정신적인 노력을 의미한다. 인지부하는 학습자에게 지나치게 낮은 수준의 학습자료를 제공하거나 높은 자료를 제시하게 되면 일어나지 않는다. 그러나 학습자에게 적절한 수준의 학습자료를 제공하면 학습자는 문제를 해결하기 위해 정신적인 노력을 기울이게 된다. 이때 발생하는 인지부하를 ‘본유적 인지부하’라고 한다[2]
  • 마지막으로 새로운 모델이 개발되고 옛 모델도 재발견이 이뤄지고 있다.
    Finally, new psychometric models are developed and old ones are being rediscovered at this present time. It is clear that, from a programme of assessment view, in incorporating many instruments in the programme not one single psychometric model will be useful for all elements of the programme. In the 1960s and 1970s, some work was done on domain-orientated reliability approaches (Popham and Husek 1969; Berk 1980). In the currently widely used method internal consistency (like Cronbach's alpha) is often used as the best proxy for reliability or universe generalisation, but one can wonder whether this is the best approach to all situations. Most standard psychometric approaches do not handle a changing object of measurement very well. By this we mean that the students – hopefully – change under the influence of the learning programme. In the si
    tuation of a longer placement for example, the results of repeatedly scored observations (for instance, repeated mini-CEX) will differ in their outcomes, with part of this variance being due to the learning of the student and part to measurement error (Prescott-Clements et al. submitted 2010). Current approaches do not provide easy strategies to distinguish between both effects. Where internal consistency is a good approach to reliability, then stability of the object of measurement and of the construct can be reasonably expected; it is problematic when this is not the case. The domain-orientated approaches therefore were not focussed primarily on the internal consistency but on the probability that a new observation would shed new and unique light on the situation, much like the clinical adage never to ask for additional diagnostics if the results are unlikely to change the diagnosis and/or the management of the disease. As said above, these methods are being rediscovered and new ones are being developed, not to replace the existing theories, but rather to complement them.






 2011;33(10):783-97. doi: 10.3109/0142159X.2011.611022.

General overview of the theories used in assessment: AMEE Guide No. 57.

Author information

  • 1Flinders University, Adelaide 5001, South Australia, Australia. lambert.schuwirth@flinders.edu.au

Abstract

There are no scientific theories that are uniquely related to assessment in medical education. There are many theories in adjacent fields, however, that can be informative for assessment in medical education, and in the recent decades they have proven their value. In this AMEE Guide we discuss theories on expertise development and psychometric theories, and the relatively young and emerging framework of assessment for learning. Expertise theories highlight the multistage processes involved. The transition from novice to expert is characterised by an increase in the aggregation of concepts from isolated facts, through semantic networks to illness scripts and instance scripts. The latter two stages enable the expert to recognise the problem quickly and form a quick and accurate representation of the problem in his/her working memory. Striking differences between experts and novices is not per se the possession of more explicit knowledge but the superior organisation of knowledge in his/her brain and pairing it with multiple real experiences, enabling not only better problem solving but also more efficient problem solving. Psychometric theories focus on the validity of the assessment - does it measure what it purports to measure and reliability - are the outcomes of the assessment reproducible. Validity is currently seen as building a train of arguments of how best observations of behaviour (answering a multiple-choice question is also a behaviour) can be translated into scores and how these can be used at the end to make inferences about the construct of interest. Reliability theories can be categorised into classical test theory, generalisability theory and item response theory. All three approaches have specific advantages and disadvantages and different areas of application. Finally in the Guide, we discuss the phenomenon of assessment for learning as opposed to assessment of learning and its implications for current and future development and research.

PMID:
 
21942477
 
[PubMed - indexed for MEDLINE]


의학교육에서의 평가: 일반화가능도 이론의 개념

Medical education assessment: a brief overview of concepts in generalizability theory

Mohsen Tavakol1, Robert L. Brennan2

1Medical Education Unit, The University of Nottingham, UK

2Centre for Advanced Studies in Measurement and Assessment, The University of Iowa, USA






의학교육자들은 학생평가의 질 향상을 위해서 측정오차가 발생하는 원인을 알아야 한다.

General Medical Council (GMC) in the UK has emphasized the importance of internal consistency for students’ assess-ment scores in medical education.1 Typically Cronbach’s alpha is reported by medical educators as an index of internal consistency. Medical educators mark assessment questions and then estimate statistics that quantify the consistency (and, if possible, the accuracy and appropriate-ness) of the assessment scores in order to improve subse-quent assessments. The basic reason for doing so is the recognition that student marks are affected by various types of errors of measurement which always exist in student marks, and which reduce the accuracy of measurement. The magnitude of measurement errors is incorporated in the concept of reliability of test scores, where reliability itself quantifies the consistency of scores over replications of a measurement procedure. Therefore, medical educators need to identify and estimate sources of measurement error in order to improve students’ assessment.


CTT에서 학생의 진점수는 관찰점수와 하나의 미분화된 오차항의 합이다. 이 모델에서 흔히 사용되는 신뢰도 척도는 Cronbach's alpha이다. 그러나 언제나 그랬든 alpha는 item들의 표본과 관련된 오차만 포함되어있다고 볼 수 있다. 따라서 우리는 alpha로부터 서로 다른 측정의 소스가 유발하는 오차에 대한 영향력을 집어내거나 고립시키거나 추정할 수 없다. CTT의 확장된 것이 G이론이며, 이는 'facet'이라 불리는 다양한 측정오차의 원인을 구분할 수 있게 해준다. 

Under the Classical Test Theory (CTT) model, the stu-dent’s true score is the sum of the student’s observed score and a single undifferentiated error term. Using this model, the most frequently reported estimate of reliability is Cronbach’s alpha. Almost always, however, when alpha is reported, it incorporates errors associated with sampling of items, only. Accordingly, alpha does not allow us to pin-point, isolate, and estimate the impact of different sources of measurement error associated with observed student marks. An extension of CTT called “G (Generalizability) theory” enables us to differentiate the multiple, potential sources of measurement error called “facets” (sometimes called “dimensions” in experimental design literature). 


모든 facet의 집합은 인정가능한 관측(admissible observation)의 모든 측면(universe)라고 할 수 있다.

For exam-ple, in an OSCE exam, a student might be observed by one of a large sample of examiners, for one of a large sample of standardized patients (SPs), and for one of a large sample of cases. The facets, then, would be examiners, SPs and cases---each of which serves as a potential source of measurement error. The set of all facets constitutes the universe of admissible observations (UAO) in the terminology of G theory. As another example, suppose that for a cardiology exam, the investigator is interested in an item facet, only; in that case, there is only one facet.


어떤 facet을 사용해야 하는지, 얼마나 많은 facet을 사용해야 하는가에 대한 정답은 없다. 이를 결정하는 것은 연구자의 책임이며, 각 facet의 중요도에 대한 근거를 제시할 수 있어야 한다. 

There is no right answer to the question of which facets, or how many facets, should be included in the UAO. It is the investigator’s responsibility to justify any decision about the inclusion of facets, and provide supporting evidence about the importance of each facet to the consistency and accuracy of the measurement procedure. G theory provides a conceptual framework and statistical machinery to help an investigator do so.


특정 검사에는 각 facet별로 구체적인 조건들의 숫자가 정해져 있다. UG의 정의. (CTT의 진점수에 대응되는 것이다)

For any given form of a test, there are a specified num-ber of conditions for each facet. The (hypothetical) set of all forms similarly constructed is the called the universe of generalization (UG). For any given examinee, we can conceive of getting an average score over all such forms in the UG. This average score is called the student’s universe score, which is the analogue of true score in CTT. The variance of such universe scores, called universe score variance, can be estimated using the analysis of variance “machinery” employed by G theory.


G이론에서는 다양한 설계가 가능하다.

G theory can accommodate numerous designs to exam-ine the measurement characteristics of many kinds of student assessments. If medical educators wish to investi-gate assessment items as a single source of measurement error on a test, this is a single facet design. There are two types of single-facet designs. If the same sample of questions is administered to a cohort of students, we say the design is crossed in that all students (s) respond to all items (i). This crossed design is symbolised as s × i, and read students are crossed within items. If each student takes a different set of items, we have a nested design, which is symbolised i:s meaning that items are nested within students.

In most realistic circumstances there are facets in addi-tion to items. Imagine a case-based assessment with four cases and a total of 40 items designed to measure the ability of students about dermatology. In this example, all students take all items; hence, students are crossed within items (s × i), but items are distributed into cases (e.g., 10 items in case 1, 10 items in case 2, 10 items in case 3 and 10 items in case 4). That is, items are nested within cases, and this design is called a two-facet nested design that is symbolised as s × (i:c).


특정 facet에 대해서 variance component가 크다면, 이 facet이 학생 점수에 상대적으로 큰 영향을 줬다는 의미이다. 예를 들어 OSCE에서 만약 시험관(examiner)에 대한 variance component가 높게 나왔다면, 시험관이 평가에 있어서 일관되게 행동하지 못했음을 보여주는 것이다.

The designs discussed in the previous paragraphs are usually called G study designs, and they are associated with the UAO. The principal purpose of such designs is to collect data that can be used to estimate what are called “variance components.” In essence, the set of variance components for the UAO provides a decomposition of the total observed variance into its component parts. These component parts reflect the differential contribution of the various facets; i.e., a relatively large variance component associated with a facet indicates that the facet has a relatively large impact on student marks. For example, in an OSCE, if the variance component for examiners (the examiner facet) is estimated as high, we would conclude that the examiners have not behaved consistently in their rating of the construct of interest.


variance component를 계산하고 나면, 연구자들은 error variance를 추정하고 UG와 관련된 reliability-like coefficient를 를 계산한다. 

Once variance components are estimated, typically in-vestigators estimate error variances and reliability-like coefficients that are associated with the UG. Such coeffi-cients can range from 0 to 1. 


One coefficient is called a generalizability coefficient; it incorporates relative error variance. Another coefficient is called a Phi coefficient; it incorporates absolute error variance. 


Computing these coefficients and error variances requires specifying the D study design which, in turn, specifies the number of condi-tions of each facet that are (or will be) used in the opera-tional measurement procedure. 


Relative error variance (and, hence, a generalizability coefficient) is appropriate when interest focuses on the rank ordering of students. 


Absolute error variance (and, hence, a Phi coefficient) is appropriate when interest focuses on the actual or “abso-lute” scores of students. 


Relative error variance (for a so-called “random effects” model) involves all the variance components that are interactions between students and facets. 


Absolute error variance includes relative error variance plus the variance components for the facets them-selves. The square root of these error variances are called standard errors of measurement. They can be used to establish confidence intervals for students’ universe scores. For further information about the these coefficients and error variances, readers may refer to particular books.2,3


Knowing the magnitude of estimated variance compo-nents enables us to design student assessments that are optimal, at least from a measurement perspective. For example, a relatively small estimated variance component for the interaction of students and items suggests that a relatively small number of items may be sufficient for a test to achieve an acceptable level for a generalizability coeffi-cient.


In practice, powerful computer programs are required to estimate variance components, coefficients, and error variances, especially for multifaceted designs. Several G theory software programs have been developed for estimating such statistics (see, for example, http://www. education.uiowa.edu/centers/casma/computer-programs).

Variance components can also be estimated using SPSS and SAS, but these packages do not directly estimate coefficients and error variances. The first author is develop-ing an online user friendly application for estimating variance components, for both balanced and unbalanced designs. Using a simple script, readers will be able to print out the estimates of important parameters in G theory. The application is written in R and C++ languages and executed by PHP codes. Figure 1 shows a balanced design output from the application.








Medical education assessment: a brief overview of concepts in generalizability theory

Mohsen Tavakol, Robert L. Brennan
Int J Med Educ. 2013; 4: 221–222. Published online 2013 September 11. doi: 10.5116/ijme.5278.a850
PMCID: 
PMC4205529


Cronbach's alpha 이해하기 (IJME, 2011)

Making sense of Cronbach’s alpha

Mohsen Tavakol, Reg Dennick

International Journal of Medical Education






Reliable하지 않은 도구는 valid할 수 없다.

An instrument cannot be valid unless it is reliable. However, the reliability of an instrument does not depend on its validity.2



What is Cronbach alpha?

1951년 개발되었음. 0과 1사이의 값을 가진다.

Alpha was developed by Lee Cronbach in 195111 to provide a measure of the internal consistency of a test or scale; it is expressed as a number between 0 and 1.


내적일관성은 검사를 수행하기 전에 결정되어야 한다. 신뢰도는 한 검사에서 측정오차의 양이 어느정도인지를 보여주는 것이다. 상관관계를 제곱해서 1에서 빼면 측정오차의 index가 된다. 즉, 신뢰도가 0.80이라면 무작위오차가 0.36이라는 의미이다.

Internal consistency should be determined before a test can be employed for research or examination purposes to ensure validity. In addition, reliability estimates show the amount of measurement error in a test. Put simply, this interpretation of reliability is the correlation of test with itself. Squaring this correlation and subtracting from 1.00 produces the index of measurement error. For example, if a test has a reliability of 0.80, there is 0.36 error variance (random error) in the scores (0.80×0.80 = 0.64; 1.00 – 0.64 = 0.36).12


검사에 포함된 문항이 서로 연관되어있다면, alpha는 올라간다. 그러나 alpha가 높은 것이 언제나 높은 내적일관성을 보장하는 것은 아니다. 왜냐면 이는 alpha가 검사지의 길이에 의해서도 영향을 받기 때문이다. 검사 문항이 너무 적으면 alpha는 떨어진다. 따라서 alpha를 높이기 위해서 더 많은 문항을 포함시킬 수 있다. 또한 alpha는 특정 검사자 표본에 대한 것이다. 따라서 다른 문헌에서 제시된 alpha에만 의존해서는 안되며 검사를 하면 alpha를 매번 새로 구해야 한다.

If the items in a test are correlated to each other, the value of alpha is increased. However, a high coefficient alpha does not always mean a high degree of internal consistency. This is because alpha is also affected by the length of the test. If the test length is too short, the value of alpha is reduced.2, 14 Thus, to increase alpha, more related items testing the same concept should be added to the test. It is also important to note that alpha is a property of the scores on a test from a specific sample of testees. Therefore investigators should not rely on published alpha estimates and should measure alpha each time the test is administered. 14


homogeneity이 '단차원성'을 의미하는 것과 비교해서, 내적일관성은 문항들의 상호연관성과도 관련이 있다. 한 척도가 '단차원적'이라는 것은 모든 문항이 하나의 특성 혹은 구인을 측정한다는 의미이다. 내적일관성은 균일성/단차원성 측정의 필요조건이지만 충분조건은 아니다. 근본적으로 신뢰도라는 개념은 검사문항의 단차원성을 가정하고 있으며, 만약 이 가정이 위배된다면 신뢰도를 과소평가하는 주요 원인이 될 수 있다. 다차원성을 갖는 검사지가 단차원성을 갖는 검사지에 비해서 alpha가 반드시 낮지는 않다는 점은 잘 알려져 있다. 따라서 검사의 '내적일관성'에 대한 index라고 보는 것이 alpha를 좀더 정확히 해석하는 것이다.

Internal consistency is concerned with the interrelatedness of a sample of test items, whereas homogeneity refers to unidimensionality. A measure is said to be unidimensional if its items measure a single latent trait or construct. Internal consistency is a necessary but not sufficient condition for measuring homogeneity or unidimensionality in a sample of test items.5, 15 Fundamentally, the concept of reliability assumes that unidimensionality exists in a sample of test items16 and if this assumption is violated it does cause a major underestimate of reliability. It has been well documented that a multidimensional test does not necessary have a lower alpha than a unidimensional test. Thus a more rigorous view of alpha is that it cannot simply be interpreted as an index for the internal consistency of a test. 5, 15, 17


alpha는 단순이 검사 문항의 단차원성을 측정하는 것이 아니지만, 문항들이 단차원적인지 확인하는 용도로 활용될 수는 있다. 반대로 한 검사가 두 개 이상의 개념(구인)을 검사하고 있다면, 전체 검사지의 alpha를 보고하는 것은 문항의 숫자를 늘리는 것과 같아서 결과적으로 alpha를 inflation시키는 것이다. 따라서 원칙적으로 alpha는 각각의 개념에 대해서 계산되어야 한다. 비균질한, 다양한 사례 바탕으로 한 문항들로 구성된 총괄평가에서 alpha는 각 사례별로 계산되어야 한다.

Alpha, therefore, does not simply measure the unidimensionality of a set of items, but can be used to confirm whether or not a sample of items is actually unidimensional. 5 On the other hand if a test has more than one concept or construct, it may not make sense to report alpha for the test as a whole as the larger number of questions will inevitable inflate the value of alpha. In principle therefore, alpha should be calculated for each of the concepts rather than for the entire test or scale. 2, 3 The implication for a summative examination containing heterogeneous, casebased questions is that alpha should be calculated for each case.


더 중요한 것은 alpha가 'tau equivalent model'에 근거하고 있다는 것이다. 이 모델은 각각의 문항이 같은 scale로 같은 latent trait를 측정하고 있음을 가정한다. 따라서 factor analysis를 통해서 밝힐 수 있듯, 다수의 trait을 다루고 있다면 이 가정에 위배되는 것이며 alpha는 검사의 reliability를 과소평가하게 된다. 검사문항의 숫자가 너무 작다면 이 역시 tau-equivalence 가정을 위배하는 것이고, 신뢰도를 과소추정할 수 있다. 현실에서 alpha는 신뢰도의 하한추정값(lower-bound estimate)이며, 왜냐하면 비균질한 검사문항이 tau-equivalent model의 가정을 위배하는 것이기 때문이다. SPSS에서 'standardised item alpha'가 'Cronbach's alpha'보다 높다면 tau equivalent measurement에 대한 추가 검사가 필요하다.

More importantly, alpha is grounded in the ‘tau equivalent model’ which assumes that each test item measures the same latent trait on the same scale. Therefore, if multiple factors/traits underlie the items on a scale, as revealed by Factor Analysis, this assumption is violated and alpha underestimates the reliability of the test.17 If the number of test items is too small it will also violate the assumption of tau-equivalence and will underestimate reliability.20 When test items meet the assumptions of the tau-equivalent model, alpha approaches a better estimate of reliability. In practice, Cronbach’s alpha is a lower-bound estimate of reliability because heterogeneous test items would violate the assumptions of the tau-equivalent model.5 If the calculation of “standardised item alpha” in SPSS is higher than “Cronbach’s alpha”, a further examination of the tauequivalent measurement in the data may be essential.


Numerical values of alpha

검사문항의 숫자, 상호연관성, 차원성 등이 alpha값에 영향을 준다. 수용가능한 alpha값에 대한 보고는 0.7에서 0.95까지의 범위에 이른다. alpha값이 낮은 것은 문항 수가 작거나, 상호연관성이 적거나, 비균질한 구인때문일 수 있다. 만약 낮은 alpha값이 문항간 낮은 상관관계에 기인한 것이라면 일부 문항을 수정하거나 버려야 한다. 가장 쉬운 방법은 각 문항과 총점의 상관관계를 구해보는 것이다. 상관관계가 낮은 문항을 버리면 된다. 만약 alpha가 너무 높다면, 이는 일부 문항은 서로 다른 문항인 척 위장하고 있지만 사실상 같은 문제임을 보여주는 것이다. alpha의 최대값은 0.90정도가 추천된다.

As pointed out earlier, the number of test items, item interrelatedness and dimensionality affect the value of alpha.5 There are different reports about the acceptable values of alpha, ranging from 0.70 to 0.95. 2, 21, 22 A low value of alpha could be due to a low number of questions, poor interrelatedness between items or heterogeneous constructs. For example if a low alpha is due to poor correlation between items then some should be revised or discarded. The easiest method to find them is to compute the correlation of each test item with the total score test; items with low correlations (approaching zero) are deleted. If alpha is too high it may suggest that some items are redundant as they are testing the same question but in a different guise. A maximum alpha value of 0.90 has been recommended.14




Making sense of Cronbach's alpha

Mohsen Tavakol, Reg Dennick
Int J Med Educ. 2011; 2: 53–55. Published online 2011 June 27. doi: 10.5116/ijme.4dfb.8dfd
PMCID: 
PMC4205511


Programmatic Assessment를 위한 열두가지 팁(Medical Teacher, 2014)

12 Tips for programmatic assessment

C.P.M. VAN DER VLEUTEN1, L.W.T. SCHUWIRTH2, E.W. DRIESSEN1, M.J.B. GOVAERTS1 & S. HEENEMAN1

1Maastricht University, Maastricht, The Netherlands, 2Flinders University, Adelaide, Australia






Introduction

평가 프로그램을 구성할 때, 개별 평가는 '모든 평가의 합은 개별 평가의 단순합보다 크다'라는 생각을 가지고 선택되어야 한다. 따라서 개별 평가가 모두 완벽할 필요는 없다. 여러 평가법을 혼합한 결과가 이상적이어야 한다. 

From the notion that every individual assessment has severe limitations in any criterion of assessment quality (Van der Vleuten 1996), we proposed to optimise the assessment at the programme level (Van der Vleuten & Schuwirth 2005). In a programme of assessment, individual assessments are purposefully chosen in such a way that the whole is more than the sum of its parts. Not every individual assessment, therefore, needs to be perfect. The dependability and credibility of the overall decision relies on the combination of the emanating information and the rigour of the supporting organisational processes. Old methods and modern methods may be used, all depending on their function in the programme as a whole. The combination of methods should be optimal. After the introduction of assessment programmes we have published conceptual papers on it (Schuwirth & Van der Vleuten 2011, 2012) and a set of guidelines for the design of programmes of assessment (Dijkstra et al. 2012). More recently we proposed an integrated model for programmatic assessment that optimised both the learning function and the decision-making function in competency-based educational contexts (Van der Vleuten et al. 2012), using well-researched principles of assessment (Van der Vleuten et al. 2010). 


Dijkstra의 가이드라인이 보다 일반적이고 교육과정이 존재하지 않는 평가프로그램에도 적용가능하다고 한다면, 통합적 모델은 구성주의적 학습 프로그램 평가를 위한 것이다. PA(programmatic assessment) 에서의 의사결정은 개별 평가와 분리된다. 각각의 평가는 학습자에 대한 정보를 모으는 것이 목적이다. 이 때의 결정은 각각의 평가에서 충분한 정보가 수집되었을 때 내려진다. PA는 학습에 대한 종단적 관점을 포괄하며, 특정 학습성과와 관련해서 평가가 이뤄진다. 성장과 발달을 모니터하고 멘토링을 제공한다. 정보가 모두 모여졌을 때 서로 독립적인 평가자 그룹에 의해서 의사결정을 내린다. 이러한 PA모델은 교육에서 많이 인정되는 반면, 많은 사람들은 PA가 복잡하고 이론에 불과하다고 생각한다. 

Whereas the Dijkstra et al. guidelines are generic in nature and even apply to assessment programmes without a curriculum (e.g. certification programmes), the integrated model is specific to constructivist learning programmes. In programmatic assessment decisions are decoupled from individual assessment moments. These individual assessment moments primarily serve for gathering information on the learner. Decisions are only made when sufficient information is gathered across individual moments of assessment. Programmatic assessment also includes a longitudinal view of learning and assessment in relation to certain learning outcomes. Growth and development is monitored and mentored. Decision-making on aggregated information is done by an (independent) group of examiners. Although this model of programmatic assessment is well received in educational practice (Driessen et al. 2012; Bok et al. 2013), many find programmatic assessment complex and theoretical. Therefore, in this paper we will describe concrete tips to implement programmatic assessment.



평가를 위한 마스터플랜을 세우라

Tip 1 Develop a master plan for assessment

역량프레임워크 형태로 큰 틀에서의 구조를 선택해야 한다. 개별 평가에 대해서 모두 P/F 결정을 내리는 것이 아니라 다양한 평가가 이루어진 다음에 일관된 평가를 내려야 하기 때문이다. 기존의 형성평가와 종합평가라는 개념은 '저부담' 과'고부담' 의사결정으로 새롭게 정의된다. '고부담'결정은 많은 자료를 필요로 한다. 

Just like a modern curriculum is based on a master plan, programmatic assessment has to be based on such a master plan as well. Essential here is the choice for an overarching structure usually in the form of a competency framework. This is important since in programmatic assessment pass/fail decisions are not taken at the level of each individual assessment moment, but only after a coherent interpretation can be made across many assessment moments. An individual assessment can be considered as a single data point. The traditional dichotomy between formative and summative assessment is redefined as a continuum of stakes, ranging from low- to high-stakes decisions. The stakes of the decision and the richness of the information emanating from the data points are related, ensuring proportionality of the decisions: high-stake decisions require many data points. In order to meaningfully aggregate information across these data points an overarching structure is needed, such as a competency framework. Information from various data points can be combined to inform the progress on domains or roles in the framework. For example, information on communication from an objective structured Clinical examination (OSCE) may be aggregated with information on communication from several mini-clinical evaluation exercise (Mini-CEX) and a multisource-feedback tool.


따라서 마스터플랜은 전체 평가구조와 교육과정에서 각 데이터포인트가 어디에 위치하는지를 보여주는 지도가 되어야 한다. 실제 상황에서 이뤄지는 직접관찰과 같은 비표준화된 조건의 평가도 있고 이런 경우 전문가의 판단이 불가피하다. 학습 단계에 따라서 마스터플랜에는 표준화된 방법과 비표준화된 방법이 혼합된다. 교육과정에 대한 마스터플랜과 평가에 대한 마스터플랜은 이상적으로 하나의 마스터플랜이어야 한다.

The master plan should therefore also provide a mapping of data points to the overarching structure and to the curriculum. The choices for each method and its content are purposefully chosen with a clear educational justification for using this particular assessment in this part of the curriculum in this moment in time. Many competency frameworks emphasise complex skills (collaboration, professionalism, communication, etc.) that are essentially behavioural, and therefore require longitudinal development. They are assessed through direct observation in real-life settings, under unstandardised conditions, in which professional, expert judgement is imperative. Depending on the curriculum and the phase of study, the master plan will thus contain a variety of assessment contents, a mixture of standardised and non-standardised methods and the inclusion of modular as well as longitudinal assessment elements. For any choice, the contribution to the master plan and through this alignment with the curriculum and the intended learning processes is crucial. The master plan for the curriculum and the assessment is ideally one single master plan.


그 결과 비표준화된 평가의 주관성은 두 가지 측면에서 PA에 영향을 준다.

The resulting subjectivity from non-standardised assessment using professional judgement is something that can be dealt with in programmatic assessment in two ways. 

First, by sampling many contexts and assessors, because many subjective judgements provide a stable generalisation from the aggregated data (Van der Vleuten et al. 1991). 

Second, because subjectivity can be dealt through bias-reduction strategies showing due process in the way decisions are reached. We will revisit these latter strategies later in Tip 6. Subjectivity is not dealt with by removing professional judgement from the assessment process, for example, by over-structuring the assessment.



피드백을 장려하는 평가 규정을 개발하라

Tip 2 Develop examination regulations that promote feedback orientation

총괄평가식 접근에서 피드백은 대체로 잊혀지게 된다. 개별 평가와 credit을 연결시킬수록 학습자는 피드백을 받고 그것을 따르려 하기보다는 어떻게 시험에서 통과할지만 고민하게 된다. Credit point는 고부담 결정에만, 그리고 여러 데이터포인트를 기반으로 뒤따라야 한다. 

Individual data points are optimised for providing information and feedback to the learner about the quality of their learning and not for pass/fail decisions. Pass–fail decisions should not be made on the basis of individual data points – as is often the case in traditional regulations. Examination regulations traditionally connect credits to individual assessments; this should be prevented in programmatic assessment. Research has shown that feedback is ignored in assessment regimes with a summative orientation (Harrison et al. 2013). Because lining credits to individual assessments raises their stake, learners will primarily orientate themselves on passing the test instead of on feedback reception and follow-up (Bok et al. 2013). Credit points should be linked only to high stake decisions, based on many data points. In all communication and most certainly in examination regulations the low-stake nature of individual assessments should be given full reign.



정보 수집을 위한 견고한 시스템을 도입하라

Tip 3 Adopt a robust system for collecting information

e-portfolio는 다음과 같은 장점이 있다.

In programmatic assessment, information about the learner is essential and massive information is gathered over time. Being able to handle this information flexibly is vital. One way of collecting information is through the use of (electronic) portfolios. Here, portfolios have a dossier function allowing periodic analyses of the student’s competence development and learning goals. The (e-)portfolio should therefore serve three functions: 

  • (1) provide a repository of formal and informal assessment feedback and other learning results (i.e. assessment feedback, activity reports, learning outcome products, and reflective reports), 
  • (2) facilitate the administrative and logistical aspects of the assessment process (i.e. direct online loading of assessment and feedback forms via multiple platforms, regulation of who has access to which information and by connecting information pieces to the overarching framework), and 
  • (3) enable a quick overview of aggregated information (such as overall feedback reports across sources of information). User friendliness is vital. The (e-)portfolio should be easily accessible to whatever stakeholder who has access to it. Many e-portfolios are commercially available, but care should be taken to ensure that the structure and functionalities of these portfolios are sufficiently aligned with the requirements of the assessment programme.



모든 저위험평가가 학습자들에게 의미있는 피드백을 제공하도록 하라

Tip 4 Assure that every low-stakes assessment provides meaningful feedback for learning

풍부한 정보량은 PA의 핵심이다. 의미있는 피드백이란 다양한 형태를 띌 수 있다.

Information richness is the cornerstone of programmatic assessment. Without rich assessment information programmatic assessment will fail. Mostly, conventional feedback from assessments, that is, grades and pass/fail decisions, are poor information carriers (Shute 2008). Meaningful feedback may have many forms. 

  • 한가지는 시험이 종료된 이후에 정답과 오답에 대한 정보를 제공해주는 것이다.
    One is to give out the test material after test administration with information on the correct or incorrect responses. In standardised testing, score reports may be used that provide more detail on the performance (Harrison et al. 2013), for example, by giving online information on the blueprint categories of the assessment done, or on the skill domains (i.e. in an OSCE), or longitudinal overview for progress test results (Muijtjens et al. 2010). 
  • 구두로 제공되는 피드백도 있을 수 있다. 비표준화된 평가에서 rating scale을 활용한 양적 정보를 얻기도 하지만, 한계가 있고 복잡한 기술에 대한 피드백은 묘사적 정보를 통하는 것이 더 낫다.
    Sometimes verbal feedback in or after the assessment may be given (Hodder et al. 1989). In unstandardised assessment, quantitative information usually stems from the rating scales being used. This is useful, but it also has its limitations. Feedback for complex skills is enhanced by narrative information (Govaerts et al. 2007). 
  • 묘사적 정보는 표준화된 평가도 더 풍부하게 만들 수 있다.
    Narrative information may also enrich standardised assessment. For example, in one implementation of programmatic assessment narrative feedback is given to learners on weekly open-ended questions (Dannefer & Henson 2007). 
  • 수량화하기 어려운 것을 억지로 수량화시킨다면 평가대상의 의의를 상실하게 될 수도 있다. 또한 점수만 따기 위한 행동이나 학점 인플레이션을 유발할 수도 있다.
    Given the fact that putting a metric on things that are difficult to quantify may actually trivialise what is being assessed. Metrics such as grades often lead to unwanted side effects like grade hunting and grade inflation. 
  • 또한 평점은 의도치않게 피드백 과정을 '망칠' 수도 있다. 
    Grades may unintentionally “corrupt” the feedback process. Some argue we should replace scores with words (Govaerts & Van der Vleuten 2013), particularly in unstandardised situations where complex skills are being assessed such as in clinical workplaces. This is not a plea against scores. Scoring and metrics are fine particularly for standardised testing. This is a plea for a mindful use of metrics and words when they are appropriate to use in order to provide meaningful feedback.


효과적인 피드백을 얻기 위한 과정은 길고 힘들다. 자원을 절약하는 것에 관심을 가지는 것도 좋지만, 양질의 피드백을 제공하는데는 결국 시간과 노력이 필요하다. 두 가지를 명심해야 할 것이다.

Obtaining effective feedback from teachers, supervisors or peers can be a tedious process, because it is time and resource intensive. Considering resource-saving procedures is interesting (e.g. peer feedback or automatic online feedback systems), but ultimately providing good quality feedback will cost time and effort. Two issues should be kept in mind when thinking about the resources. 

  • 평가와 학습은 서로 얽혀 있다. 즉 가르치는 시간과 평가하는 시간이 명확하게 구분되지 않는다.
    In programmatic assessment, assessment and learning are completely intertwined (assessment as learning), so the time for teaching and assessment becomes rather blurred. 
  • 쓸모없는 피드백을 자주 하는것보다는 가끔이라도 양질의 피드백을 하는 것이 낫다.
    Second, more infrequent good feedback is better than frequent poor feedback. Feedback reception is highly dependent on the credibility of the feedback (Watling et al. 2012), so the “less-is-more” principle really applies to the process of feedback giving. High-quality feedback should be the prime purpose of any individual data point. If this fails within the implementation, programmatic assessment will fail.



학습자에게 멘토링을 제공하라

Tip 5 Provide mentoring to learners

피드백만으로는 부족할 수 있다. 피드백은 이상적으로는 성찰적 대화의 한 부분이어야 하며, 멘토링은 그러한 대화를 만들어나가는 효과적인 수단이다.

Feedback alone may not be sufficient for learners to be heeded well (Hattie & Timperley 2007). Research findings clearly indicate that feedback, reflection, and follow-up on feedback are essential for learning and expertise development (Ericsson 2004; Sargeant et al. 2009). Reflection for the mere sake of reflection is not well received by learners, but reflection as a basis for discussion is appreciated (Driessen et al. 2012). Feedback should ideally be part of a (reflective) dialogue, stimulating follow-up on feedback. Mentoring is an effective way to create such a dialogue and has been associated with good learning outcomes (Driessen & Overeem 2013).


PA에서 멘토링은 피드백 과정과 피드밸 활용을 지원하기 위한 목적을 갖는다. 멘토의 역할. 멘토의 역할은 학습자에서 최대치를 이끌어내는 것이다. 전통적 평가에서 최소 기준을 만족하는 것이 진급을 위해 충분한 것이었다면 PA에서는 개인의 수월성을 추구하는 것이 목적이며, 멘토는 이러한 수월성 달성을 위한 핵심인물이다. 

In programmatic assessment mentoring is used to support the feedback process and the feedback use. In a dialogue with an entrusted person, performance may be monitored, reflections shared and validated, remediation activities planned, and follow-up may be negotiated and monitored. This is the role of a mentor. The mentor is a regular staff member, preferably having some knowledge over the curriculum. Mentor and learner meet each other periodically. It is important that the mentor is able to create a safe and entrusted relationship. For that purpose the mentor should be protected in having a judgemental role in the decision-making process (Dannefer & Henson 2007). The mentor’s function is to get the best out of the learner. In conventional assessment programmes, adherence to minimum standards can suffice for promotion and graduation. In programmatic assessment individual excellence is the goal and the mentor is the key person to promote such excellence.



신뢰할 수 있는 의사결정을 내려라

Tip 6 Ensure trustworthy decision-making

풍부한 정보를 담고 있는 자료는 보통 양적, 질적 자료의 특성을 모두 가지고 있기 때문에, 이러한 정보를 종합해서 판단하는 것은 전문가적 판단력이 필요하다. '고부담'이라는 특성을 감안했을 때, 이러한 판단은 충분한 신뢰성을 갖추어야 하며 절차적 방법론이 이러한 신뢰성의 근거가 되어야 한다. 다음의 절차를 포함할 수 있다.
High-stakes decisions must be based on many data points of rich information, that is, resting on broad sampling across contexts, methods and assessors. Since this information rich material will be of both quantitative and qualitative nature, aggregation of information requires professional judgement. Given the high-stakes nature, such professional judgement must be credible or trustworthy. Procedural measures should be put in place that bring evidence to this trustworthiness. These procedural measures may include (Driessen et al. 2013):


  • An appointment of an assessment panel or committee responsible for decision-making (pass–fail–distinction or promotion decisions) having access to all the information, for example, embedded in the e-portfolio. Size and expertise of the committee will matter for its trustworthiness.
  • Prevention of conflicts of interest and ensuring independence of panel members from the learning process of individual learners.
  • The use of narrative standards or milestones.
  • The training of committee members on the interpretation of standards, for example, by using exceptional or unusual cases from the past for training purposes.
  • The organisation of deliberation proportional to the clarity of information. Most learners will require very little time; very few will need considerable deliberation. A chair should prepare efficient sessions.
  • The provision of justification for decisions with high impact, by providing a paper trail on committee deliberations and actions, that is, document very carefully.
  • The provision of mentor and learner input. The mentor knows the learner best. To eliminate bias in judgement and to protect the relationship with the learner, the mentor should not be responsible for final pass–fail decisions. Smart mentor input compromises can be arranged. For example, a mentor may sign for the authenticity of the e-portfolio. Another example is that the mentor may write a recommendation to the committee that may be annotated by the learner.
  • Provision of appeals procedures.

This list is not exhaustive, and it is helpful to think of any measure that would stand up in court, such as factors that provide due process in procedures and expertise of the professional judgement. These usually lead to robust decisions that have credibility and can be trusted.



중간 의사결정을 위한 평가를 조직하라

Tip 7 Organise intermediate decision-making assessments

모든 과정이 끝나고 이루어지는 '고부담 결정'은 학습자를 놀래키는 식으로 진행되어서는 안 된다. 중간평가의 결과를 제공하고 최종 결정에 대한 피드백을 줌으로써 최종평가의 신뢰성을 높일 수 있다. 중간평가는 보다 작은 수의 데이터포인트를 기반으로 내려진다. '저부담'과 '고부담'사이의 '중부담' 평가라 할 수 있으며, 진단적/치료적/예후적 역할을 할 수 있다. 이상적으로는 평가위원회가 모든 중간평가 결과를 제공하는 것이 좋으나 모든 학생에 대한 전체 위원회 평가를 하는 것은 지나치게 자원이 많이 소모될 것이다. 따라서 보다 자원을 효율적으로 사용할 수 있는 접근법을 고려할 필요가 있다.

High-stakes decisions at the end of the course, year, or programme should never be a surprise to the learner. Therefore, provision of intermediate assessments informing the learner and prior feedback on potential future decisions is in fact another procedural measure adding to the credibility of the final decision. Intermediate assessments are based on fewer data points than final decisions. Their stakes are in between low-stake and high-stake decisions. Intermediate assessments are diagnostic (how is the learner doing?), therapeutic (what should be done to improve further?), and prognostic (what might happen to the learner; if the current development continues to the point of the high-stake decision?). Ideally, an assessment committee provides all intermediate evaluations, but having a full committee assessing all students may well be a too resource-intensive process. Less costly compromises are to be considered, such as using subcommittees or only the chair of the committee to produce these evaluations, or having the full committee only looking at complex student cases and the mentors evaluating all other cases.



개별화된 교정교육을 장려하고 촉진하라.

Tip 8 Encourage and facilitate personalised remediation

교정교육은 재시험과는 다르다. 교정교육은 지속적인 성찰과정에서 드러나는 진단적 정보를 바탕으로 이뤄져야 하며, 언제나 개별화되어야 한다. 따라서 교육과정은 학습자가 교정교육을 계획하고 이수할 수 있도록 충분한 유연성이 있어야 한다. 비용이 많이 드는 교정교육 패키지를 개발할 필요는 없으며 학습자를 어떤 교정교육이 필요할지에 대한 결정에 참여시키고, 경험이 풍부한 멘토로부터 지원을 받도록 하면 된다. 이상적으로 교정교육은 충분한 지원과 방법을 학습자에게 제공하여 스스로의 책임이 되도록 해야 한다.

Remediation is essentially different from resits or supplemental examinations. Remediation is based on the diagnostic information emanating from the on-going reflective processes (i.e. from mentor meetings, from intermediate evaluations, and from the learner self) and is always personalised. Therefore, the curriculum must provide sufficient flexibility for the learner to plan and complete remediation. There is no need for developing (costly) remediation packages. Engage the learner in making decisions on what and how remediation should be carried out, supported by an experienced mentor. Ideally, remediation is made a responsibility of the learner who is provided with sufficient support and input to achieve this.



프로그램의 효과와 활용을 모니터하고 평가하라

Tip 9 Monitor and evaluate the learning effect of the programme and adapt

멘토는 중요한 이해관계자이다.

Just like a curriculum needs evaluation in a plan-do-act-cycle, so does an assessment programme. Assessment effects can be unexpected, side effects often occur, assessment activities, particularly very routine ones, often tend to trivialise and become irrelevant. Monitor, evaluate, and adapt the assessment programme systematically. All relevant stakeholders involved in the process of programmatic assessment provide a good source of information on the quality of the assessment programme. One very important stakeholder is the mentor. Through the mentor’s interaction with the learners, they will have an excellent view on the curriculum in action. This information could be systematically gathered and exchanged with other stakeholders responsible for the management of the curriculum and the assessment programme. Most schools will have a system for data-gathering on the quality of the educational programme. Mixed-method approaches combining quantitative and qualitative information are advised (Ruhe & Boudreau 2013). Similarly, learners should be able to experience the impact of the evaluations on actual changes in the programme (Frye & Hemmer 2012).



평가절차에서 나온 정보를 교육과정 평가에 활용하라

Tip 10 Use the assessment process information for curriculum evaluation

평가는 주로 세 가지 기능을 한다.

Assessment may serve three functions: 

  • to promote learning, 
  • to promote good decisions on whether learning outcomes are achieved, and 
  • to evaluate the curriculum. 


In programmatic assessment, the information richness is a perfect basis also for curriculum evaluation. The assessment data gathered, for example, in the e-portfolio, not only provides an X-ray of the competence development of the learners, but also on the quality of the learning environment.



이해관계자간 지속적 상호작용을 하라

Tip 11 Promote continuous interaction between the stakeholders

PA는 모든 사람들에게 영향을 미친다. 따라서 교육기관 전체에 대한 책임이 있다. 의사소통이 중요하며, 의사소통은 불완전성을 의미할 수도 있다. 평가위원과 멘토 사이에 벽이 있다면 객관적이고 독립적인 의사결정이 가능할지는 모르겠지만, 정보는 그만큼 덜 풍부해지는 것이다. 

As should be clear from the previous, programmatic assessment impacts at all levels: students, examiners, mentors, examination committees, assessment developers, and curriculum designers. Programmatic assessment is, therefore, the responsibility of the whole educational organisation. When implemented, frequent and on-going communication between the different stakeholder groups is essential in the process. Communication may regard imperfections in the operationalisation of standards or milestones, incidents, and interesting cases that could have consequences for improvement of the system. Such communication could eventually affect procedures and regulations and may support the calibration of future decisions. For example, a firewall between the assessment committee and mentors fosters objectivity and independency of the decision-making, but at the same time may also hamper information richness. Sometimes, however, decisions need more information about the learner and then continuous communication processes are indispensable. The information richness in programmatic assessment enables us to make the system as fair as possible.



도입을 위한 전략을 개발하라

Tip 12 Develop a strategy for implementation

PA는 학습에 대한 구성주의적 관점을 기반으로 한다. 평가시스템을 급격하게 변화시키는 것은 평가가 무뎌지거나 학생들의 'gaming'에 취약해질 것이라는 우려를 가지게 하나, 실제 활용한 사례를 살펴보면 그 반대이다. 그럼에도 고등교육의 많은 부분이 변화에 저항하는 특성이 있어서 변화전략이 필요하다. 

Programmatic assessment requires a culture change in thinking about assessment that is not easy to achieve in an existing educational practice. Traditional assessment is typically modular, with summative decisions and grades at the end of modules. When passed, the module is completed. When failed, repetition through resits or through repetition of the module is usually the remedy. This is all very appropriate in a mastery learning view on learning. However, modern education builds on constructivist learning theories, starting from notions that learners create their own knowledge and skills, in horizontally and/or vertically integrated programmes to guide and support competence. Programmatic assessment is better aligned to notions of constructivist learning and longitudinal competence development through its emphasis on feedback, use of feedback to optimise individual learning and remediation tailored to the needs of the individual student. This radical change often leads to fear that such assessment systems will be soft and vulnerable to gaming of students, whereas the implementation examples demonstrate the opposite effect (Bok et al. 2013). Nevertheless, for this culture change in assessment a change strategy is required, since many factors in higher education are resistant to change (Stephens & Graham 2010). A change strategy needs to be made at the macro-, meso- and micro levels.


  • At the macro level, national legal regulations and university regulations are often strict about assessment policies. Some universities prescribe grade systems to be standardised across all training programmes. These macro level limitations are not easy to influence, but it is important to know the “wriggle room” these policies leave for the desired change in a particular setting. Policy-makers and administrators need to become aware of why a different view on assessment is needed. They also need to be convinced on the robustness of the decision-making in an assessment programme. The qualitative ontology underlying the decision-making process in programmatic assessment is a challenging one in a positivist medical environment. Very important is to explain programmatic assessment in a language that is not jargonistic and which aligns with the stakeholder’s professional language. For clinicians, for example, analogies with diagnostic procedures in clinical health care often prove helpful.
  • At the meso level programmatic assessment may have consequences for the curriculum. Not only should the assessment be aligned with the overarching competency framework, but with the curriculum as well. Essential are the longitudinal lines in the curriculum requiring a careful balance of modular and longitudinal elements. Individual stakeholders and committees need to be involved as early as possible. Examination rules and regulations need to be constructed which are optimally transparent, defensible, but which respect the aggregated decision-making in programmatic assessment. The curriculum also needs to allow sufficient flexibility for remediation. Leaders of the innovation need to be appointed, who have credibility and authority.
  • Finally, at the micro level teachers and learners need to be involved in the change right from the start. Buy-in from teachers and learners is essential. To create buy-in the people involved should understand the nature of the change, but more importantly they should be allowed to see how the change also addresses their own concerns with the current system. Typically, teaching staff do have the feeling that something in the current assessment system is not right, or at least suboptimal, but they do not automatically make the connection with programmatic assessment as a way to solve these problems.


The development of programmatic assessment is a learning exercise for all and it is helpful to be frank about unexpected problems to arise during the first phases of the implementation; that is innate to innovation. So it is therefore good to structure this learning exercise as a collective effort, which may exceed traditional faculty development (De Rijdt et al. 2013). Although conventional faculty development is needed, involving staff and students in the whole design process supports the chance of success and the creation of ownership (Könings et al. 2005) and creates a community of practice promoting sustainable change (Steinert 2014).


PA로 변화하는 것은 전통적 교육과정이 PBL로 변화하는 과정에 비견될 수 있다. 

Changing towards programmatic assessment can be compared with changing traditional programmes to problem-based learning (PBL). Many PBL implementations have failed due to problems in the implementation (Dolmans et al. 2005). When changing to programmatic assessment, careful attention should be paid to implementation and the management of change at all strategic levels.




Conclusion

Programmatic assessment has a clear logic and is based on many assessment insights that have been shaped trough research and educational practice. Logic and feasibility, however, are inversely related in programmatic assessment. To introduce full-blown programmatic assessment in actual practice all stakeholders need to be convinced. This is not an easy task. Just like in PBL, partial implementations are possible with programmatic assessment (i.e. the increase in feedback and information in an assessment programme, mentoring). Just like in PBL, this will lead to partial success. We hope these tips will allow you to get as far as you can get.








 2014 Nov 20:1-6. [Epub ahead of print]

12 Tips for programmatic assessment.

Author information

  • 1Maastricht University , Maastricht , The Netherlands .

Abstract

Abstract Programmatic assessment is an integral approach to the design of an assessment program with the intent to optimise its learning function, its decision-making function and its curriculum quality-assurance function. Individual methods of assessment, purposefully chosen for their alignment with the curriculum outcomes and their information value for the learner, the teacher and the organisation, are seen as individual data points. The information value of these individual data points is maximised by giving feedback to the learner. There is a decoupling of assessmentmoment and decision moment. Intermediate and high-stakes decisions are based on multiple data points after a meaningful aggregation of information and supported by rigorous organisational procedures to ensure their dependability. Self-regulation of learning, through analysis of theassessment information and the attainment of the ensuing learning goals, is scaffolded by a mentoring system. Programmatic assessment-for-learning can be applied to any part of the training continuum, provided that the underlying learning conception is constructivist. This paper provides concrete recommendations for implementation of programmatic assessment.

PMID:

 

25410481

 

[PubMed - as supplied by publisher]


의과대학 임상실습에서의 학생평가방법: 과거, 현재 및 제언

Assessment of Medical Students in Clinical Clerkships

이상엽1,2ㆍ임선주1ㆍ윤소정3ㆍ백선용1ㆍ우재석1

1부산대학교 의학전문대학원 의학교육실, 2양산부산대학교병원 가정의학클리닉, 3부산대학교 교수학습지원센터


Sang Yeoup Lee1,2 · Sun Ju Im1 · So Jung Yune3 · Sunyong Baek1 · Jae Seok Woo1

1Medical Education Unit, Pusan National University School of Medicine; 2Family Medicine Clinic, Pusan National University Yangsan Hospital, Yangsan; 3Center for Teaching

and Learning, Pusan National University, Busan, Korea





서 론

교육목표를 바탕으로 개발된 교육과정에 따라 교수-학습이 이루어지고 나면, 교육평가를 통해 교육목표에 도달했는지 확인해야한다. 평가는 교육과정의 마지막 단계이며 동시에 평가결과를 다시금 교육목표에 반영하게 되는 시작단계라고도 할 수 있다. 최근에대두되는 성과나 역량중심의 교육과정에서는 성과나 역량을 무엇으로, 어떻게 측정할 것인지가 실제적인 핵심이기 때문에 평가의 중요성은 더욱 강조되고 있다. 특히 의학에 있어 임상실습의 평가는지금까지 그 타당성과 신뢰성에 많은 의문이 제기되어 온 바 있다.따라서 본 종설에서는 의학교육과정의 임상실습평가의 과거와 현재를 살펴보고 개선방향을 논의하고자 한다.



과 거


불과 수년 전만 하더라도 기초와 임상의학만 모두 배우고 나면임상실습 진입식과 함께 졸업은 따 놓은 당상(堂上)이었다. 임상실습과정에서는 출석미달 이외에는 유급이 거의 없기 때문에, 실습병원에 출석만 제대로 하면 진급이 되었다. 전공의가 실습점수에 적지 않게 관여하기 때문에, 선배가 있는 학생은 더욱 마음 편히 실습을 하곤 했다(Lee et al., 2002; Park &Kim, 2004). 교수도 이제 다배웠다고 생각해서인지 여간해선 임상실습에서 유급을 결정하지않았다. 제도적으로도 학생 수가 많은 학교에서는 집단(group)별로 실습을 순환하기 때문에 학생들 간에 실습점수를 비교하기에는현실적으로 많은 어려움이 있었다. 임상실습을 모두 마치고 나면오히려 실습 전보다 의학지식은 더 모자란 것 같았고, 할 줄 아는 임상술기도 별로 없었다. 어미 뒤를 쫓아가는 병아리처럼 학생들은교수의 꽁무니만 이리저리 따라 다녔다. 학생에게 인턴이나 전공의는 하늘 같은 존재였고, 온갖 수모를 당할지라도 실제로 교수보다는 이들에게서 배우는 것이 더 많았다. 각 과를 일정기간 돌며 환자의 진단과 치료과정을 어깨너머로 익혔다. 지금에야 실습 도중에간단한 수업이나 설명도 해 주지만, 이전에는 교수가 직접 가르쳐주는 것은 거의 없고 말 그대로 알아서 보고 스스로 배우도록 하였다. 환자 회진 때는 행여 질문이라도 할까 싶어 회진대열의 가급적맨 뒤에서 따라다닌다. 수술실에서는 교수의 어깨너머로 수술을참관하기도 하지만 가까이서 보려고 다가가면 수술조명을 가린다고 야단맞기 일쑤였다. 간혹 견인기구를 당기면서 수술 보조역할을하게 되면 큰 영광이었고, 수술 도중에 교수가 질문을 하면 아무런대답도 못 하는 경우가 허다했다. 학생이 대답을 못 하고 쩔쩔매는모습을 즐기는 교수도 있었을 것이다. 각종 세미나, 컨퍼런스에 참여하고 이런 저런 경험을 겪으면서, 의사가 되면 저런 일을 하는구나, 이 과는 이렇게 아침 일찍 출근하는구나, 이 과에는 이런 환자들이 방문하는구나, 나중에 의사가 되면 나도 저렇게 할 수가 있을까를 느끼다 보면 어느새 실습이 모두 끝나버린다. 임상실습학점이나오면 관심도 없었거니와 실습 때 질문해서 답을 제대로 한 경우가 없다보니, 실습학점이 낮다고 이의신청을 하는 경우는 거의 없었다. 실습평가가 객관적인지, 타당한지, 신뢰할 만한지에 대해 학생은 생각할 여유가 없었다. 교수 또한 임상실습평가의 문제점에 대한 인식과 고민은 있었으나 그것을 개선하려는 시도는 거의 하지못했다(Park, 2004).



현 재


임상실습 이전의 교육과정에는 강의식 수업, 팀바탕학습, 문제바탕학습, 토론식 수업 등 다양한 방식의 교수방법에 따른 교육평가가 시행되고 있으며, 임상실습평가보다는 용이한 것이 사실이다. 현재까지도 의과대학 임상실습의 학생평가방법에 대해서는 신뢰도,타당도에 대한 의문이 여전히 남아 있으며, 과연 무엇을 기준으로유급을 결정할 것인지에 대한 확신이 서지 않는 경우가 많다. 이런이유로 학업성취도 따라 유급(fail)이 결정되기도 하는 임상실습 전의 교육과정과 달리 임상실습에서는 그렇게 결정하기가 쉽지 않기때문에 대부분 높은 학점을 부여한다. 이러한 상황은 미국의 경우도 마찬가지이다. 미국 임상실습과정은 학교마다 매우 다양한 학점부여체계를 가지고 있다. 어떤 체계를 사용하더라도 전체 의대생의1% 미만은 임상실습과정에서 유급을 하게 되지만, 나머지 학생의대부분, 즉 97%는 최상위 세 가지 등급에 속하는 학점을 주고 있어,한국과 마찬가지로 미국에서도 임상실습평가에서 소위 학점인플레이션현상이 나타나고 있다(Alexander et al., 2012).


임상실습은 학생신분으로 정식의사가 되기 이전에 의사로서의역할과 업무를 익힐 수 있는 기회이다. 즉, 환자면담과 진찰, 모의처방, 의무기록작성 등을 경험함으로써 임상실습 전에 익혔던 의학지식들이 임상추론을 통해 서서히 적용이 되도록 하는 것이다. 하지만 실제로는 임상지식습득, 구두사례발표능력, 문제해결능력 등을주로 평가하고 있었으며, 팀워크, 의사소통(면담)기술, 임상적 의사결정능력, 진단검사사용능력, 의무기록작성 등 의사의 실무에 대한 평가는 미흡했다(Kim et al., 2009; Yang et al., 2007) 또한 비교적 대표성을 지닌 우리나라 23개 의과대학 . 의학전문대학원의 임상실습평가현황에 의하면, 학생들이 주로 피동적으로 참가하는소규모 강의, 외래참관, 세미나, 수술실 관찰 및 지원, 병동회진 등의 교육방법이 임상실습에 주로 사용되고 있었다. 임상실습평가는학생들의 수행능력 향상에 초점을 두고, 실습활동에 근거해야 하는데도(Miller &Archer, 2010), 실제로는 평가방법의 경우에도 필기시험, 출석, 보고서 같이 수행능력평가로는 타당성에 의문이 있는 평가방식이 전체 평가반영비율의 50% 이상으로 대부분을 차지하고 있었다.


그러나 2000년에 시작된 의학교육인증평가와 2010년도 제74회의사국가시험의 실기시험 도입이라는 외적 동기는 의과교육과정,특히 임상실습과정에 대대적인 혁신을 불러일으켰다. 각 학교마다실기시험을 대비하여 각종 임상수기훈련센터, 혹은 임상시뮬레이션 등 다양한 이름의 시설이 갖추어지고, 임상실습교육을 강화하는 교육과정 개편이 이루어지기 시작하였다. 이에 따른 새로운 평가방법으로 객관구조화진료시험, 진료수행시험과 같은 임상수행능력평가시험이 확대 . 보급되었으며, 실기모형 혹은 표준화 환자의도움으로 일차의료에 필요한 술기능력뿐 아니라 이전에는 평가하기 어려웠던 의사소통, 면담기술 혹은 환자의사관계까지도 평가가가능해졌다. 여전히 실제 환자를 대상으로 하는 것과는 차원이 다른 것이지만 학생들은 임상실습 전에 미리 최소한의 준비를 할 수있게 되었다. 아울러 의료윤리, 의료면담기술, 신체진찰교육, 의무기록작성, 외래실습확대와 같은 교육과정도 보다 확대되었다.


학생들에게도 많은 변화가 생겼지만, 학생을 제대로 평가하려면제대로 된 문항의 개발이 전제되어야 하기 때문에, 학교마다 교수또한 임상수행능력시험 문항개발이라는 새로운 과업이 생겼다. 개별 학교에서 국가시험처럼 12문항의 임상수행능력평가시험을 실시하려면 학생 수가 120명인 경우 2일이 소요되며, 적지 않은 표준화환자와 교수평가자가 필요하고 비용도 만만치 않았다. 따라서 여러의과대학 . 의학전문대학원과 연합체(컨소시엄)를 구성하여 역할을 공유하게 되었다. 의학교육인증평가기준에는 모든 핵심과목에서 해당 과목별로 실습기간 중 임상수행능력평가를 시행하고, 그결과를 실습성적에 반영하는 것을 우수기준으로 정하고 있지만,대개는 한 학기 혹은 모든 학기의 임상실습과정이 종료되면 종합임상수행능력시험을 시행하고, 그 결과를 해당 실습과별로 실습점수에 반영한다. 하지만 실습과별로 배당된 문항은 현실적으로 1-2문항에 불과하기 때문에, 1-2문항에 우수한 점수를 받은 것이 과연해당 실습과의 진료능력과 역량이 탁월했다고 단정 짓기는 어렵다.그렇다고 해도 현재로서는 임상수행능력평가시험이 의학생의 진료수행능력평가를 위한 가장 신뢰할 만한 도구로 여겨지고 있다(Park, 2012; Ramchandani, 2011).




제언 및 논의


의학교육에서 평가는 매우 중요하고 다양한 목적을 지닌다(Yuet al., 1994). 평가도구가 양호한 지를 결정하는 요인으로는 타당도,신뢰도, 객관도뿐 아니라 실용도, 구체성, 관련성, 효율성 등 다양한항목들이 있다. 또한 평가시기에 따라 진단평가, 형성평가, 총괄평가 등으로 분류되며, 평가방법에 따라 양적 평가와 질적 평가로 분류될 수 있다. 그리고 관찰이나 포트폴리오, 면접 및 구술 등을 통해 학습자의 수행과정 및 그 결과를 전문적으로 판단하는 수행평가방식이 있다. 배운 것을 기억해 내거나 이해하는 것과 같은 지식의 회상(recall)과 지식을 응용하고 분석하고 평가하는 것과 같은지식의 적용(application)은 필기시험으로, 술기(skill)는 실기시험으로 평가가 가능하지만, 창조성(creating)은 실제 임상상황에서학생이 하는 것을 관찰하거나 구두시험을 통해 학생의 사고흐름을파악해야 가능하다.


임상실습은 학생들이 강의에서 배운 의학지식보다는 그 지식을활용하여 실제 환자에게 의사소통기술과 임상술기를 적용해 보고,의사의 직무를 경험해 보게 하는 과정이다(Miller &Archer, 2010).기본적으로 매일매일의 실습활동이 평가대상이지만 모든 학생이임상실습에서 동일한 임상표현을 경험하는 것은 아니라는 점에서동일한 상황의 평가는 아니고 구조적이지도 않다. 게다가 의과대학에서 학업성취도는 진급 여부를 결정하는 주요 근거가 되지만(Park et al., 2009), 임상실습에서는 학업성취도를 명확히 구분할수 있는 평가방법을 모색하기란 쉽지 않다. 임상실습평가의 타당성과 신뢰성을 확보하기 어렵기 때문이다. 선행 연구에서 임상실습성적은 필기시험성적과는 높은 상관성이 있었으나 임상수행능력시험성적과는 낮은 상관성을 보였는데, 이는 임상실습성적의 평가항목의 대부분이 지식에 치우쳐 있었기 때문으로 생각된다(Kim,2003; Koh &Park, 2009; Lurie &Mooney, 2010).


임상실습평가에 있어서의 잠재된 문재 중 하나는 평가자 요인에있다. 아무리 좋은 평가도구를 사용하더라도 평가자의 잦은 변동은 신뢰도(reliability)와 객관도(objectivity)에 문제가 된다. 신뢰도는 평가도구나 결과의 일관성인 반면 객관도는 평가자의 평가에 일관성이라고 할 수 있다. 객관도는 평가자가 다를 때의 평가자 간의평가결과 일치 정도(degrees of consistency)를 말하는 것으로서 평가자 간의 신뢰도라고도 한다. 객관도를 높이기 위해 평가도구 자체를 객관화시키고, 명확하고 구체적인 평가기준을 마련하며, 평가자가 주관적 요소를 최대한 줄이고 객관적인 평가가 되도록 자신의 평가역량을 개발하는 것 또한 중요한 문제이다. 이를 위해 각 대학에서는 임상실습에 대한 공통평가체계를 수립하고 구조화, 표준화할 필요가 있다. 그것은 평가자의 단순 관찰부터 소그룹지도, 일대일지도, 실습노트나 포트폴리오와 같은 다양한 평가자료의 활용및 자기성찰, 객관화한 임상수행시험, 실습현장의 다면평가, 환자안전, 의사소통과 협력, 리더십, 관리능력 등에 대한 다양한 평가가 실습과정 중에 이루어질 필요가 있다. 이렇듯 학생들의 임상실습역량은 학생을 둘러싼 다면적 평가를 통해 비로소 합리적인 평가가 가능해진다(Han, 2012; Wang et al., 2011). 기술적으로는 여러 평가자가 동시에 논의하여 하나의 종합평가를 내리는 것도 신뢰도를 높일수 있는 하나의 방편이 될 것이다.


또한 각 실습활동에 대한 의미와 교육목적을 학생에게 사전에공지하고 평가지침과 평가항목을 구체적이고 명확하게 하는 것이효과적이다. 실습평가는 실습활동에 근거하여 각 활동별로 용이하게 평가가 이루어질 수 있어야 한다. 몇몇 자기주도 학생이라면 평가가 되지 않는 실습활동이라고 하더라도 학교에서 주어진 실습계획 이외에도 자신만의 실습계획을 추가하여 적극적으로 실습활동을 하겠지만, 대부분의 학생은 평가하지 않는 실습활동은 피동적이기 마련이다. 한편 어떤 평가항목이 다른 항목의 평가기능을 포함한다면 그 항목은 삭제하여 간결하게 하는 것이 좋다. 예를 들어많은 경우 출석이 임상실습평가항목의 기본요소로 활용이 되고 있지만, 만약 어떤 식으로든 날마다 평가가 이루어진다면 출석을 하지 않으면 의례히 평가점수가 없기 때문에 출석항목이 평가항목에는 없어도 된다(Kim, 2003). 또한 수행평가 체크리스트의 경우에도교수입장에서는 학생의 수행을 일일이 관찰해야 하는 부담이 있고거의 모든 학생이 결국 평가자의 확인을 모두 받아오기 때문에 평가점수에 포함하지 않고 점수 없이 기본항목으로만 두던지 점수를주더라도 비중을 줄여도 될 것이다. 그리고 자기평가나 동료평가항목에 있어서는 임상실습 후 학생들은 교수보다 자신을 관대하게 평가하는 경향을 보이는 것으로 나타나 자기평가나 동료평가가 우리나라 정서에서는 아직 교수평가를 대체하기는 어렵다고 결론을 내린 바 있다(Hur et al., 2008). 임상실습교육방법과 평가의 향후 개선방안에 대해 다음과 같은 제언을 하고자 한다.



첫째, 임상실습 전에 반드시 알아야 할 의학지식의 목록을 알리고 진단평가로 예습필기시험을 치고 실습을 마친 후에는 종합필기시험을 시행할 수도 있다. 또한 임상실습 전에 사전에 예고한 중요질환이나 임상표현에 관한 구술시험을 치고 임상실습을 마친 후 동일한 구술시험으로 사전-사후 구술시험결과의 향상 정도를 평가에 반영하는 것도 가능하다.


둘째, 구성주의 교육이론에 근거하여 학생들이 임상실습에서 새롭게 배우게 되는 지식과 기술들을 사전에 통합교육과정에서 학습한 기초 및 임상의학지식과 연합하여 새로이 구성할 수 있도록 한다. 또한 매 실습에서 배우게 되는 지식들이 다음 실습과 어떻게 통합되고 구성되어 갈 수 있을지 인지하도록 한다. 이 목적을 달성하려면 교수가 학생에게 순간순간 피드백과 코칭이 필요하다. 1분 지도(the one minute preceptor), summarize the case-narrow the differential-analyze the differential-probe the preceptor-plan management-select an issue for self directed learning (SNAPPS) 교수법, 혹은 set-tutor demonstration-explanation-practice-subsequentdeliberate practice set (STEPS) 교수법이 그 예들이다.

  • 1분 지도는 짧은 시간 이내에 학생이 파악하고 있는 입원 환자상태나 원인에 대해 먼저 묻고 답변에 따라 긍정적 피드백이나 잘못된 생각이나 실수를 교정해주면서 임상추론능력을 향상시키려는 지도방법이고, 
  • SNAPPS 교수법은 학생으로 하여금 병력청취와 신체진찰에서 얻은 객관적 사실을 보고하도록 한 후, 학생이 미리 생각해 둔질문을 토대로 교수와 토론을 해나가면서 진단과정과 치료계획까지 다루어 보도록 하는 또 하나의 임상추론능력 계발방법이다.
  • STEPS 교수법은 특히 임상술기교육에 유용한 것으로 술기를 설명하고 시범을 보여 준 후 학생들이 해보도록 하는 과정을 단계별로구조화시킨 교수법이다. 하지만 여기에도 중요한 것은 전 학생에게일관되게 시행되어야 하고 반드시 적절한 평가가 수행되어야 한다는 점이다(Im, 2012).


셋째, 학생들이 집담회나 세미나에 참관한 경우는 마칠 때 학생을 위해 요약정리를 해 주는 것이 필요하다(Park, 2004). 초보학습자의 경우에는 자신이 배운 내용의 핵심적인 내용과 기억해야 할내용을 잘 인식하지 못하는 경우가 많다. 따라서 요약정리시간을갖고 반드시 기억할 수 있도록 하는 것이 필요하다.



넷째, 사전에 공지한 범위 내에서 구술시험 혹은 질의응답을 시행하여 임상추론과정을 평가할 수 있다. 이는 온라인과 오프라인모두에서 가능하다. 예를 들어 답변에서 미흡한 부분이나 질의응답과정에 도출된 과제는 페이스북과 같은 웹사이트를 활용하여 탑재하고 내용에 대해 실시간 피드백을 시행할 수 있다. 저자가 운영하는 http://www.facebook.com/ilovefamilydoc 사이트를 방문하면 참고가 될 것이다.



다섯째, 배정받은 환자의 증례발표의 경우 제대로 하려면 환자의동의를 구한 후 병력청취와 신체검사를 임상추론과정을 스스로해보아야 목표하던 교육효과를 얻을 수 있으나, 임상실습이 주로 3차 의료기관에서 이루어지고 배정받은 환자는 이미 1, 2차 의료기관에서 어느 정도 진단이 된 다음에 방문 또는 입원하였기 때문에임상추론과정을 경험해 보기가 어렵다. 또한 대부분의 졸업목표인일차진료영역을 넘어선 경우가 많다. 이를 개선하려면 진단되지 않은 첫 환자를 배정하도록 하고, 1, 2차 병원의 실습을 확대하며, 외래진료 위주로 초진 환자에게 병력청취와 신체검사를 수행하도록한다.


여섯째, RIME란 reporter-interpreter-manager-educator의 첫글자를 딴 것으로 번역하자면 보고자, 해석자, 관리자, 교육자로 할수 있다

  • 보고자는 면담기술을 발휘하여 체계적으로 환자병력을수집하고 신체검사를 할 수 있어야 하고, 사례별로 특히 살펴봐야할 것을 알고 있어야 한다. 
  • 해석자는 환자의 문제에 우선순위를 정하고 감별 진단하는 것이며, 
  • 관리자는 환자에게 합리적인 진단과치료방법들을 여러 개 제시하고 선택할 수 있는 능력을 갖추어야한다. 
  • 교육자는 지금 가진 의학적인 지식을 특별한 환자에게 적용할 수 있도록 전문적인 공부를 더 하고 그 지식을 남에게 가르쳐 주어야 한다. 

보고자, 해석자, 관리자, 교육자로 갈수록 높은 수준의역량을 가지게 된다. RIME는 1999년 미국의 Pangaro 박사가 제시한 이후 유용한 것으로 입증이 되어 임상실습평가에 널리 사용되고 있으나, 이론과 달리 2009-2010년 동안 미국의 119개 의과대학의 임상실습평가자료를 분석한 결과에 따르면 이조차도 학교 간의평가등급의 폭과 정확하지 못한 점을 좁히지 못한 것으로 나왔다(Kim, 2004; Pangaro, 1999, 2000). 그럼에도 다양한 임상실습 장면에 RIME 역량 평가방법을 적용하는 한 가지 방법이 될 수 있다.


일곱째, 외래참관의 경우 대개 조원 전체가 진료실에 앉지도 못한 채 교수의 진료를 참관하게 되지만 대부분의 환자가 재진이다.그런 경우 진료시간이 짧고 교수의 의무기록을 채 읽어보기도 전에다음 환자가 들어오기 때문에 그 한순간을 참관한 학생들은 어떤문제를 가진 환자인지 알기 어렵다. 이를 개선하려면 만약 외래진료시간이 3시간이고 한 조에 3명의 학생이 배정되어 있다면 학생1명마다 1시간씩 교수 옆에 앉도록 하여 교수와 같은 눈높이로 교수가 의무기록을 어떻게 작성하는지를 보게 하고 간단하게라도 환자에 대한 설명을 해 주어야 한다. 혹은 학생마다 외래 환자를 배정하여 진료순서 이전에 초진부터 지금까지의 의무기록을 면밀히 검토한 후 그 환자의 진료 차례가 되면 함께 진료실로 들어와 참관하도록 하여 비록 짧은 진료의 한 단편일지라도 맥락을 이해할 수 있게 할 수도 있다. 그리고 진료를 마친 후에는 학생들과 함께 오늘 진료한 환자에 대해 질의응답을 하면서 학생들의 임상추론능력을 간접적으로라도 평가하기를 권한다. 또한 환자나 간호사도 그 학생을평가할 수 있도록 360도 평가(360 degree assessment) 같은 다면평가를 활용할 수도 있다.


여덟째, 입원 환자의 경우에도 다양한 의사의 직무를 경험하도록 공동주치의 역할과 회진보고에 참여시켜 볼 수 있다. 병동실습을 마친 학생들의 피드백조사에 의하면 학생들은 평가교수가 배정된 병동 환자에게 공동주치의로서 자신을 정식으로 소개해 주고,회진할 때마다 모든 학생을 참관하도록 하기보다는 참관이 용이하도록 한 번 회진에 학생 2-3명씩 배정해 주며, 학생 자신들이 실제진찰하는 장면을 평가해 주기를 바라고 있었다(Park, 2004).


아홉째, 학생 수가 적은 대학의 경우에는 최근 일부 의과대학에서 시행하는 통합임상실습을 시도해 볼 수 있다. 매주 전과를 임상실습하게끔 일정을 계획하여 멘토와 함께 1년 내내 지속하는 것이다. 개별 임상과의 단편적인 단기간의 임상경험이 아닌 최소한 1년간 그 임상과의 환자를 만나면서 환자-학생관계를 발전시켜 나갈수 있고 질병에 대한 통합적인 시각을 갖도록 하는 게 목적이다(Ogur et al., 2007). 가정의학과 임상실습 학생을 대상으로 6주간의전통순환실습(rotation-based clerkship, RBC) 학생과 32주간의 장기통합실습(longitudinal integrated clerkship, LIC) 학생의 실습만족도를 실습 전후에 비교한 연구에 의하면 LIC 학생의 만족도가RBC 학생보다 훨씬 높게 나타났다(Myhre et al., 2013). 또 다른 면담연구에서는 학생이 LIC가 실제적이고 유용하고 건설적이라고인식하는 것으로 보고하였다(Bates et al., 2013).




결 론


임상실습평가에 대한 관심은 이전부터 꾸준히 있었으나 평가자의 잦은 변동, 평가기준 적용의 차이, 부적합한 평가내용, 지속적인관찰의 어려움 등으로 인해 평가의 타당성과 신뢰성은 아직 어려움이 많다. 이는 외국의 선진의과대학도 마찬가지이다. 우리나라의 경우 전국적인 설문을 통해 임상실습교육의 개선이 필요한 영역을 조사한 결과, 교수들은 임상실습 담당 인력의 부족, 학생 임상실습교육에 대한 관심부족, 임상실습시설 및 장비부족 등을 지적하였으며, 수기 위주의 실습교육, 수행평가 강화 및 타당한 평가도구의 개발 등이 필요함을 지적하였다(Yang et al., 2007). 임상실습 담당인력부족 등 단기간에 해결되기는 어려운 점이 대부분이다. 하지만관심부족, 평가도구나 방식의 개선은 얼마든지 가능할 것이다. 결국 임상실습평가에 가장 중요한 요인은 평가자이다. 평가자가 자주바뀌거나 평가에 대한 사전교육을 받지 못한 전공의나 전임 의사에게 평가를 일임해서는 안 될 것이다. 평가자가 늘 일관되고 타당한 평가를 할 수 있도록 해야 하며, 평가방법은 현실 가능해야 하고용이해야 한다(Plymale et al., 2010). 한 임상과목에 여러 교수가 학생의 각 실습활동마다 할당되어 1년 내내 책임평가를 하면 타당도가 높아질 수 있다. 평가는 교육의 어떤 분야보다 중요하다 할 수 있다. 타당하고 신뢰할 만한 평가를 위해서는 향후 더 많은 연구가 이루어져야 할 필요가 있으며, 각 대학마다 많은 노력들이 투자되어야 한다. 그러나 늦었다고 생각할 때가 가장 빠를 때이다.







The clinical clerkship focuses students on developing their ability to perform comprehensive diagnosis and

management of patients with common undifferentiated problems by the integration of knowledge and clinical

reasoning. Therefore, the clerkship evaluation system should assess their actual problem solving and professional

behavior. However, concern remains that clerkship evaluations are imprecise and highly variable. This review

is designed to provide faculty members with concepts, options, and a methodology to actively teach and

evaluate the clinical clerkship, as well as offer encouragement and inspiration to medical students. We reviewed

past and current clinical clerkship evaluations and discuss several tips to improve clinical excellence such as continuity,

transparency of the evaluation process, a faculty development program, practical examination of clinical

skills, implementation of a checklist for recording exposure and skills, providing prompt and constructive feedback

to students, self-evaluation of professional performance, varying multi-faceted assessment combinations,

being outpatient clinic-centered, and having dedicated faculty members who give students one-on-one contact

with a preceptor.

Keywords: Clinical clerkship, Medical education, Assessment

+ Recent posts