관찰을 통해서 임상술기평가의 블랙박스 열기: 개념적 모델(Med Educ, 2011)

Opening the black box of clinical skills assessment via observation: a conceptual model

Jennifer R Kogan,1 Lisa Conforti,2 Elizabeth Bernabeo,2 William Iobst2 & Eric Holmboe2







INTRODUCTION


고부담의 최종시험을 제한하는 것 뿐만 아니라, 이제는 임상 근무환경에서 지속적인 평가를 하는 것, 그리고 그러한 평가가 정확하고 타당할 것까지 요구되고 있다.

Additionally, in response to calls to limit high-stakes final examinations, greater emphasis is now placed on the continuous assessment of skills in the clinical workplace7–11 and such assessment must be accurate and valid.



METHODS


표본

Sample

 



연구 설계와 자료 수집

Study design and data collection


On their study day, faculty members individually watched four videos and two live scenarios of a standardised postgraduate year 2 (PGY2) resident (SR) taking a history, performing a physical exami- nation or counselling a standardised patient (SP).20 The live cases also scripted resident receptiveness to feedback. These cases were previously used with medical residents and each case was scripted to depict a PGY2 resident whose performance was unsatisfactory, satisfactory or superior for content (history taking, examination, counselling) and interpersonal skills (some cases portrayed superior content but unsatisfactory interpersonal skills). Initial error scripting (by JRK) was based on actual resi- dent performance norms. The study team reviewed scripts to confirm that they reflected predetermined performance levels. For the video cases, volunteer medical residents trained on a single script, practised with the SP and were videotaped once their perfor- mance accurately represented the intended perfor- mance level.


After watching each of four video encounters (Fig. 1a), faculty staff completed a mini-clinical evaluation exercise (mini-CEX). The mini-CEX, developed by the American Board of Internal Medicine (ABIM) to provide residents with feed- back about their history-taking, physical examina- tion, counselling and interpersonal skills, details seven competencies that are rated on a 9-point scale (1–3 = unsatisfactory, 4–6 = satisfactory, 7–9 = superior).25,26 Faculty members were then interviewed individually for 15 minutes by a trained study investigator using a semi-structured interview guide.



Videos were shown in a random order and each faculty member was interviewed by at least three interviewers. Following the video scenarios, faculty members observed two live representations of an SR taking a history, conducting an examination and counselling an SP (Fig. 1b). Following each encounter, faculty staff rated the SR using the mini- CEX and provided the SR with up to 10 minutes of feedback, which was video-recorded. Faculty members were then interviewed individually by a study investigator for 30 minutes using the semi- structured interview (Appendix S1). Faculty mem- bers were asked about the feedback encounter before and after watching a DVD of themselves giving feedback to the SR.




자료 분석

Data analysis


Grounded theory approach 활용. 그러한 이유는 관찰과 평가 프로세스에 대해서 알려진 바가 거의 없었고, 현재의 가설이나 이전 연구로부터의 추론에 제약되지 않고자 하였기 때문임.

We utilised a grounded theory approach to analyse the data for emergent themes and to develop a thematic coding structure.27 We selected grounded theory because little is known about the observation and evaluation process and we wished to avoid restricting ourselves to current hypotheses or inferences from prior studies.28 Transcripts were sampled for coding across faculty participants, SP cases and interviewers.27 Two researchers (JRK, LC) independently coded and used constant comparative techniques to develop a preliminary coding structure.27 A portion of the transcripts were also coded by additional study team members (EB, WI, EH) to review, further define and refine the coding structure. Refinement of the coding structure continued as analysis progressed. Coding was terminated when theoretical saturation was achieved and when all team members agreed upon the final interpretation of the data. In total, 56 of 172 video interviews (33%) and 29 of 88 live interviews (33%) were coded. NVivo Version 2.0 (QSR International Pty Ltd, Melbourne, Vic, Australia) was used to organise and analyse the coding structure.

 

 


결과

RESULTS

 

주제 1. 관찰과 평가의 Frame of reference

Theme 1. Frames of reference during observation and rating


자기 자신을 레퍼런스로 활용

Using self as a reference



가장 흔하게, 교수들은 레지던트의 수행능력을 교수 자신이 스스로의 진료에 대해서 인식하는 것과 비교하였다.

Most frequently, faculty members compared resident performance with how they perceived themselves to practise:


교수들이 자신의 임상적 강점과 약점에 대해서 어떻게 인식하는지가 그들의 판단과 평가를 mediate했으며, 제시된 상황에 대해 얼마나 편하게 느끼느냐도 여기에 영향을 받았다.

Faculty members’ perceptions of their own clinical strengths or limitations at times mediated their judgements and ratings, as well as their comfort with the encounter.


교수들이 특히 중요하다고 혹은 우선시 되어야한다고 생각하는 역량이 무엇인지도 평가를 frame하였다.

Competencies believed by faculty members to be especially important and which were prioritised in feedback also framed ratings:


교수들은 레지던트의 수행능력을 평가할 때 본인이 레지던트 시절에 어떻게 했었는를 기준으로 평가하기도 했다.

Faculty staff also referred to comparisons of resident performance with the faculty member’s perception of his or her own performance as a resident.



다른 의사를 레퍼런스로 활용

Using other doctors as a reference


많은 교수들이 레지던트의 수행능력을 비슷한 단계에 있는 다른 레지던트와 비교했다.

Many faculty members compared resident perfor- mance with that of residents at a similar stage:


그러나 일부 교수들은 평가가 꼭 몇 년차인지 (PGY level)에 따라 달라야 하는가에 의문을 제기하였다.

However, some faculty staff questioned whether ratings should be based on the resident’s PGY level:


일부 교수들은 레지던트 수행능력을 (수련을 마치고 진료하는) 보통 의사들practising doctor와 비교하였다. 많은 교수들은 일부 practising doctor들이 임상 스킬이 부족한 면이 있기에 과연 레지던트들에게 그 임상 스킬을 요구하는 것이 타당한지에 대해 고민하였다.

Some faculty staff also compared resident perfor- mance with that of practising doctors. Many faculty members acknowledged that some practising doctors have deficient clinical skills and this led them to question what it might be reasonable to expect of a resident:


 

환자 성과outcome을 레퍼런스로 활용

Using patient outcomes as a reference


 

기타 Frame of reference

Additional frames of reference



특정 임상수행(History taking)의 기본 요소 충족 여부

SIGECAPS [Sleep, Interest, Guilt, Energy, Concentration, Appetite, Psychomotor, Suicidal].



어떤 사람들에게 있어서 평가의 기준을 설명해내는 것은 어려운 일이었다. 대신, 일부 교수들은 평가를 '직감gut feeling' 에 의존했다고 말하였다. 또 어떤 교수들은 어떻게 '관찰'을 '판단'으로 옮겼는지를 말로 설명하기 어려워했으며, 이러한 변환이 게스탈트를 반영하거나 혹은 자신이 판단을 내리고 평가를 할 때 어떤 프레임워크를 사용하는지가 불확실하기 때문이라고 했다. 일부 교수들은 대인관계에 대한 평가를 특히 어렵다고 했는데, 왜냐하면 이 스킬은 주관적이고 정량화하기 어렵기 때문이다

For others, articulating the standard for evalua- tion was difficult. Instead, some faculty staff referred to having a ‘gut’ feeling that drove evaluation. Others had difficulty in verbalising how they moved from observations to judgement and commented that the transition represented a gestalt or simply that they were uncertain of which framework they were making judgements and ratings against. Some faculty staff found the assessment of interpersonal skills particularly challenging because these skills were felt to be more subjective and difficult to quantify:



중요한 것은, 우리의 자료가 교수들이 frame of reference를 설정하는 것이 복잡하고, 역동적이고, 사람마다 매우 다르다는 것을 보여준다. 많은 교수들은 상황마다는 물론 한 상황 내에서도 FOR의 차이를 보였다.

Importantly, our data show that the ways in which faculty staff implemented these frames of reference were complex, dynamic and highly variable. Many faculty members shifted between frames of reference both within and between encounters:


 

주제 2. 추론inference의 역할

Theme 2. The role of inference


우리는 레지던트와 그들의 수행능력에 대한 추론inference가 평가가 이뤄지는 동안에 두드러진다는 것을 발견했다. 교수들은 구체적 자료(레지던트의 행동)을 사용하고, 이 행동들로부터 일부를 선별하여(의식적으로든 무의식적으로든), 의미를 부여affix하고, 이 행동에 대해 해석하여, 주로 결론을 내리는 가정assumption을 설정하였다. 종종 추론은 고차원적인 수준에서 이뤄졌는데, 이는 관찰한 행동을 기반으로 중대한 해석이 있었음을 의미한다. Table 2는 어떻게 동일한 행동도 교수에 따라서 서로 다르게 해석되는지를 보여준다. 레지던트의 감정(자신감이 있는지, 편하게 느끼는지), 성격, 스킬(지식, 잠재력), 향상에 대한 동기부여, 과거 경험과 대비상태 등등에 대해서 추론이 이뤄졌다.

We found that inferences about residents and their performance were prominent during assessment. Faculty members used concrete data (resident actions), selected from those actions (consciously or subconsciously), affixed meaning and interpreta- tion to those actions and made assumptions from which they frequently drew conclusions. Often, inferences were of the ‘high’ level, meaning there was significant interpretation based on the behav- iour witnessed. Table 2 provides examples of how the same behaviour was interpreted differently by different faculty staff watching the same video. Inferences were made about the residents’ feelings (i.e. their levels of confidence and comfort), personalities, skills (i.e. knowledge base and potential), motivation to improve, and prior experiences and preparation. 


일부 교수들은 그들이 추론을 한다는 사실을 인식하고 있었다. 그러나 많은 교수들은 주관적인 추론을 내리면서 그렇게 하고 있다는 것을 인식하는데 실패하였으며, 결과적으로 레지던트의 수행능력에 대해서 무수한 가정을 하고 있었다. 

A few faculty staff seemed to be aware that they made inferences, However, many faculty members failed to recognise when they made subjective inferences and consequently made numerous assumptions about residents’ performance.

 

 

 

 


주제 3. '판단'을 점수로 통합하는 과정

Theme 3. Variable approaches to synthesising judgements to numerical ratings


 

'판단'을 '숫자(점수)'로 변환하는 과정이 얼마나 다양하고 불확실한가를 보여준다

Our data showed significant variability and uncer- tainty surrounding how to translate a judgement about the resident into a numerical rating, espe- cially the overall mini-CEX rating.


일부 교수들은 개개 mini-CEX 역량을 평균하여 점수를 주었다.

Some faculty members chose to average all of the individual mini-CEX competencies:


다른 사람들은 비-보상적으로 점수를 주었다. 

Others used non-compensatory grading:


일부 교수들은 그 해당 시나리오encounter가 어디에 초점을 두는지 혹은 목적이 무엇인지에 따라 가중을 두었다.

Some faculty staff weighted ratings according to the encounter’s focus or purpose:


일부는 기존의 프레임워크를 활용하여 평가하였다.

A few faculty members used existing frameworks to guide them in assessing residents.


많은 교수들은 판단을 수치로 변환하는데 어려움을 겪었다. 교수들은 그들이 각 숫자의 의미가 무엇인지를 이해하지 못하고 있다고 말하였으며, 9점 척도에서 각 척도를 구별해내는 능력이 부족함으로, 그리고 어떻게 평가를 synthesis할지 불확실해하고 있음을 언급했다.

Many faculty staff struggled to translate their judge- ments to a numerical rating. Faculty members described their own lack of understanding about the meaning of the numbers, their inability to discrimi- nate along a 9-point scale, and their uncertainty regarding how to synthesise ratings:




주제 4. 평가에 관여되는 레지던트 수행 외적 요인

Theme 4. Factors external to resident performance that drive ratings


다음과 같은 요인이 영향을 주었다.

Several additional factors influenced faculty staff ratings, including

  • 맥락(시나리오의 복잡성, 레지던트의 과거 경험, 교수-레지던트 관계)
    context (the complexity of the encounter, the resident’s prior experience, the faculty–resident relationship) and
  • 피드백을 대하는 태도(레지던트의, 교수의, 기관의)
    response to feedback (by the resident, by the faculty member, by the institution).

맥락

Context


임상 시나리오가 얼마나 복잡한 것인지, 또는 레지던트가 그 시나리오에 얼마나 친숙해 보이는지 등이 영향을 주었다.

Contextual factors such as the complexity of the clinical scenario and perceptions of the resident’s familiarity with a clinical situation influenced how faculty members translated their observations into ratings:


교수-레지던트 관계가 얼마나 오래 되었는지도 영향을 주었다. 교수들은 실제 레지던트와 관계와 비교해가면서, 그들이 레지던트와 지속적인 관계를 가졌을 때 어떠했는지를 설명하고, 그 레지던트가 이미 어떤 피드백을 받았었는가를 안다고 설명했다. 동일한 실수를 반복하는 것은 평가를 더 엄격하게 만들었다.

The duration of the resident–faculty relationship also impacted ratings. Faculty members, referring to their experiences with actual residents, explained that when they had a longitudinal relationship with a resident, they knew what that resident had already received feedback on. The repetition of a mistake by the resident after feedback resulted in rater stringency:


반대로, 기존에 어떤 레지던트와 긍정적인 관계에 있었을 경우 평가가 더 느슨leniency해지고, 후광 효과가 있었다. 

By contrast, a pre-existing positive relationship with a resident was sometimes associated with rater leniency and the halo effect:



피드백을 대하는 태도

Response to feedback


레지던트가 점수에 대해서 어떤 반응을 보일지에 대한 우려가 종종 감정emotion으로 드러났다.

Emotions frequently stemmed from concern about residents’ reactions to numerical ratings that might be either high or low:

 

'만약 누군가 '만족' 범주에 들어간다면, 나는 6점을 줄 것이다. 왜냐면 왜 6점이 아니라 4점 혹은 5점이냐를 가지고 따지고 싶지 않기 때문이다'

‘If anyone’s in the satisfactory category, I tend to put them in the 6 range because I don’t want to be having a conversation about why it wasn’t a 6 instead of a



추가적으로, 건설적 피드백을 주는 것에 대한 교수들 자신의 정서적 반응과 레지던트의 감정에 미칠 영향에 대한 우려, 어떻게 레지던트가 교수를 인식할까에 대한 정서적 반응 등도 영향을 주었다.

In addition, the faculty member’s own emotional response to providing constructive feedback (i.e. feeling mean or unkind; being ‘demoralising’) and concern about the emotional impact on the resident and how the resident might perceive the faculty member seemed to mediate assessment:



반대로 어떤 교수들은 그들의 역할과 책임을 '코치'로서 인식했다

By contrast, other faculty staff were focused on their roles and responsibilities as coaches:



'나는 그들을 최대한 좋은 의사로 만들어야 한다. 나는 그들의 코치이다'

It’s to make them the best doctors they can be. I’m their coach.


마지막으로 여러 교수들이 기관 차원의 문화의 역할에 대해 언급했다.

Finally, several faculty members described the role of the broader institutional culture in guiding their ratings:


내가 점수를 낮게 준 학생과 학장과 셋이 나란히 앉아 내 점수를 디펜드 하는 것은 매우 불편하다. 그러한 상황에 다시는 놓이고 싶지 않다.

Sitting in very uncomfortable meetings with the dean and the student who I graded very poorly and having to sit there and defend my score. I don’t want to be in this situation again.’




DISCUSSION


 

교수들은 레지던트의 임상스킬을 평가할 때 관찰과 평가에 잠재적으로 영향을 미칠 수 있는 특성의 결합물amalgam of characteristics을 활용한다. 이러한 특성에는 나이/성별/임상 및 교육 경험/임상 및 교육 역량/태도와 감정/피드백에 대한 반응 등이 있다.

Faculty staff bring to resident clinical skills assessment an amalgam of characteristics that potentially impact on observations and assessment. These characteristics include age, gender, clinical and teaching experience, clinical and educational competence, and attitudes and emotions related to observation and feedback.


 

교수들은 레지던트와 환자의 상호작용을 두 가지 렌즈로 바라본다. 하나는 frame of reference이고, 다른 하나는 inference이다. 전자는 스스로의 역량에 대한 생각, 환자 성과에 대한 것, 다른 레지던트와의 비교 등에 대한 것이며, 후자는 관찰에 추가적인 의미를 부여하고 해석을 덧붙이는 것이다.

Faculty mem- bers observe resident and patient interactions through two lenses, one of which concerns a frame of reference (whereby faculty members use their own or other doctors’ performance, or patient outcomes, as a yardstick against which to compare resident performance) and one of which refers to inference which further shapes the meaning and interpretation assigned to observations.



피훈련자에 대한 교수들의 관측은 어떤 맥락 안에서 이뤄지고, 그 맥락적 요인에 의해 영향을 받는다. 여기에는 임상시스템(환자와의 친숙도, 환자의 복잡한 정도, 임상 유닛의 조직)과 교육시스템(조직 문화와 실수)이 있다.

The faculty member’s observation of the trainee with the patient occurs within and is influenced by contextual factors including the clinical system (i.e. familiarity with the patient, patient complexity, organisation of the clinical unit, etc.) and educational system (i.e. institutional culture and oversight).


관찰 동안에, 그리고 관찰 이후에 교수들은 관찰한 것을 해석하고 통합하여 점수로 변환한다. 그러나 이 프로세스는 깔끔하거나 예측가능하거나 단순하지 않다. 여러가지에 의해서 영향을 받는데, 점수를 어떻게 주느냐에 따라오는 피드백, 조직 문화, 시나리오의 복잡성 등이 있다. 이러한 영향은 관찰/피드백/평가에서 맥락이 중요하다는 것을 더 지지해준다.

During and after observation, the faculty member interprets and synthesises his or her observations into a rating. However, the process is not neat, predictable or straightforward. Multiple additional influences that can impact on ratings include anticipated feedback, institutional culture and encounter complexity. These influences further support the importance of context in observation, feedback and ratings.29,30


이러한 복잡한 상호작용은 상황인지이론situated cognition theory로 지지되는데, 이 이론에서는 개개인의 사고나 앎knowing, 처리processing이 그 사고나 행동이 일어나는 구체적 사회적 상황과 분리불가능하게 엮여 있다고 주장한다.

These complex interactions can be supported by situated cognition theory which contends that an individual’s thinking, knowing and processing are uniquely tied to and inextricably situated within (and cannot be completely separated from) the specific social situations within which those thoughts and actions occur.29,31,32


상황인지는 이러한 '상황'이 기여하는바를 인식하지 못하면, 그 구조construct를 완전히 잡아내지 못하는 사고로 이어진다.

Situated cognition contends that failing to acknowledge the contributions of the setting leads to a perspective on thinking that cannot fully capture the construct;33


피평가자, 임상 상황, 교육 상황, 조직 문화, 교수들 자신 등이 영향을 준다. 

We have found that factors including the trainees, the clinical and educational setting, the institutional culture and the faculty members them- selves



교수들은 평가를 할 때 다수의 frame of reference를 활용한다. 우리가 놀란 점은 교수들이 매우 흔하게 자신의 현재 진료 스타일에 비추어서 레지던트를 평가한다는 사실이다. 이것이 특히 중요한 이유는, 교수들이 임상스킬의 proficiency에 있어서 매우 차이가 크기 때문이다. 그럼에도 이번 연구에서 교수들은 관찰과 평가에서 근거-기반 프레임워크 혹은 환자 성과를 거의 사용하지 않았다. 더 나아가서 동일한 수행에 대해서도 서로 다른 평가를 내리는 것은 레지던트가 스스로의 셀프-이미지와 맞지 않는 건설적 피드백은 무시하거나 검열해버릴 가능성이 있어서 잠재적으로 피드백 프로세스를 훼손시킬 수 있다.

Our data suggest that faculty staff approach assess- ment using multiple frames of reference. We were struck by how often self was used as a frame of reference, particularly by comparing a resident’s performance with one’s current practice style. This finding has important implications because faculty staff have variable clinical skill proficiency.20,34–36 Yet, in the present study, faculty staff rarely used evidence-based frame- works describing best clinical skills practices (e.g. informed decision making, patient communication) or patient outcomes37–39 to anchor observation and assessment. Variable frames of reference are also problematic for residents. Furthermore, variable assessments of the same performance can potentially undermine the feedback process if residents preferentially dismiss or censor constructive feedback that is incongruent with their self-image.40


수행능력에 대한 결론을 내리기 위해서는 실제 자료(레지던트 행동)을 기반으로(활용하여), 이 행동들 중에서 선택하고, 이ㅡ미를 부여해야 한다. 이러한 가정이 결론을 형성하고, 이것을 바탕으로 행동(평가와 피드백)이 이뤄진다.

Reaching conclusions about performance requires faculty staff to use real data (resident behaviours), select from those behav- iours, and affix meaning to them. These assump- tions form conclusions which, in turn, lead to actions (the rating and feedback).41


비록 교수들이 비디오의 SR과 대화할 기회는 없었지만, 교수들이 live observation을 하는 동안 추론을 했음에도, 피드백을 주기 위해서 SR을 만났을 때 그 추론에 대해 질문하지 않았다.

Although the faculty member did not have the opportunity to talk with the SR in the video encounters, similar inferences were made and not questioned during the live observations when the faculty member met with the SR for feedback.



교수들은 관측결과를 판단(점수)로 변환하는데 어려움을 겪는다. 그러나 평가서식을 이용하는 교수들은 각 점수가 의미하는 바가 무엇이며, 어떻게 전반적 평가를 선택할 것인지를 알아야 한다. 우리가 아는 한, 교수개발 프로그램은 어떻게 관찰과 평가를 overall rating으로 통합하는지에 대해서 설명address해주지 않는다.

We found that faculty members struggled to translate their observations and judgements into numerical ratings. However, faculty members who use rating forms need to know what the numbers mean and how to select an overall rating. To our knowledge, faculty development has not addressed how observations and assessments should be synthesised into an overall rating.



교수들은 평가에 있어서 기준이나 정신모형을 공유할 필요가 있다. 이상적으로는 자기-기반의 프레임워크에서 준거-기반의 프레임워크로 바꿔나가야 한다. 특정 훈련단계에서 기대되는 역량이 무엇인가를 알고, milestone을 명확히 하는 것이 도움이 될 것이다.

Our findings suggest that there is a need to ensure that faculty staff approach assessment with a shared standard or mental model, ideally shifting from a self-based to a criterion-based framework. Knowledge of expected competencies and elucidation of milestones at par- ticular levels of training could be valuable to faculty staff who are required to make assessments.45



 


 


 










 2011 Oct;45(10):1048-60. doi: 10.1111/j.1365-2923.2011.04025.x.

Opening the black box of clinical skills assessment via observation: a conceptual model.

Author information

  • 1Department of Medicine, University of Pennsylvania, School of Medicine, Philadelphia, Pennsylvania 19104, USA. jennifer.kogan@uphs.upenn.edu

Abstract

OBJECTIVES:

This study was intended to develop a conceptual framework of the factors impacting on faculty members' judgements and ratings of resident doctors (residents) after direct observation with patients.

METHODS:

In 2009, 44 general internal medicine faculty members responsible for out-patient resident teaching in 16 internal medicine residency programmes in a large urban area in the eastern USA watched four videotaped scenarios and two live scenarios of standardised residents engaged inclinical encounters with standardised patients. After each, faculty members rated the resident using a mini-clinical evaluation exercise and were individually interviewed using a semi-structured interview. Interviews were videotaped, transcribed and analysed using grounded theory methods.

RESULTS:

Four primary themes that provide insights into the variability of faculty assessments of residents' performance were identified: (i) the frames of reference used by faculty members when translating observations into judgements and ratings are variable; (ii) high levels of inference are used during the direct observation process; (iii) the methods by which judgements are synthesised into numerical ratings are variable, and (iv) factors external to resident performance influence ratings. From these themes, a conceptual model was developed to describe the process of observation, interpretation, synthesis and rating.

CONCLUSIONS:

It is likely that multiple factors account for the variability in faculty ratings of residents. Understanding these factors informs potential new approaches to faculty development to improve the accuracy, reliability and utility of clinical skills assessment.

© Blackwell Publishing Ltd 2011.

Comment in

PMID:
 
21916943
 
[PubMed - indexed for MEDLINE]


+ Recent posts