외과 실습중인 의대생들에 대한 교수와 레지던트의 평가는 표준화시험 점수와 상관관계가 거의 없다.

Faculty and resident evaluations of medical students on a surgery clerkship correlate poorly with standardized exam scores

Seth D. Goldstein, M.D.a,*, Brenessa Lindeman, M.D.a, Jorie Colbert-Getz, Ph.D.b, Trisha Arbella, B.S.a, Robert Dudas, M.D.c, Anne Lidor, M.D.a, Bethany Sacks, M.D.a


aDepartment of Surgery, Johns Hopkins School of Medicine, 1800 Orleans Street, Bloomberg Children’s Center 7310,

Baltimore, MD 21287, USA; 

bOffice of Medical Education Services, Johns Hopkins School of Medicine, Baltimore, MD,

USA; 

cDepartment of Pediatrics, Johns Hopkins School of Medicine, Baltimore, MD, USA


Abstract


BACKGROUND: The clinical knowledge of medical students on a surgery clerkship is routinely assessed via subjective evaluations from faculty members and residents. Interpretation of these ratings should ideally be valid and reliable. However, prior literature has questioned the correlation between subjective and objective components when assessing students’ clinical knowledge.

METHODS: Retrospective cross-sectional data were collected from medical student records at The Johns Hopkins University School of Medicine from July 2009 through June 2011. Surgical faculty members and residents rated students’ clinical knowledge on a 5-point, Likert-type scale. Interrater reliability was assessed using intraclass correlation coefficients for students with R4 attending surgeon evaluations (n 5 216) and R4 resident evaluations (n 5 207). Convergent validity was assessed by correlating average evaluation ratings with scores on the National Board of Medical Examiners (NBME) clinical subject examination for surgery. Average resident and attending surgeon ratings were also compared by NBME quartile using analysis of variance.

RESULTS: There were high degrees of reliability for resident ratings (intraclass correlation coefficient, .81) and attending surgeon ratings (intraclass correlation coefficient, .76). Resident and attending surgeon ratings shared a moderate degree of variance (19%). However, average resident ratings and average attending surgeon ratings shared a small degree of variance with NBME surgery examination scores (r2 % .09). When ratings were compared among NBME quartile groups, the only significant difference was for residents’ ratings of students with the lower 25th percentile of scores compared with the top 25th percentile of scores (P 5 .007).

CONCLUSIONS: Although high interrater reliability suggests that attending surgeons and residents rate students with consistency, the lack of convergent validity suggests that these ratings may not be reflective of actual clinical knowledge. Both faculty members and residents may benefit from training in knowledge assessment, which will likely increase opportunities to recognize deficiencies and make student evaluation a more valuable tool.


2014 Elsevier Inc. All rights reserved.










임상지식을 쌓는 것이 의대생들이 임상실습을 하는 주된 목적이지만 이 중요한 분야에 대한 평가 측면에 있어선 'gold standard'가 없다. 임상실습에서 지식을 평가하기 위해서 대부분의 학교에서는 흔히 교수/레지던트의 주관적 평가와 국가 표준화 시험 점수를 혼합한 형태를 활용하고 있다.

Fostering the development of clinical knowledge is among the primary goals of medical student clerkships,1 but no gold standard for assessment has emerged in this key area. Common approaches to knowledge assessment on clinical clerkships at most medical schools remain a mixture of subjective evaluations from faculty members and residents with objective scores on national standardized examinations. 


학생에 대한 평가는 타당성과 신뢰성을 갖추어야 한다. 그러나 기존 연구는 학생의 임상지식에 대한 주관적 평가와 객관적 평가의 상관관계에 대해서 상반된 주장을 하고 있다. 

Student assessment should ideally be valid and reliable; however, prior literature has demonstrated mixed conclusions when examining correlations between subjective and objective components of student clinical knowledge. 

    • Literature from radiology and pediatrics has demonstrated moderate correlations between grades from subjective and objective components of medical knowledge. 2,3 
    • However, other studies in emergency medicine and internal medicine have shown lower levels of correlation between medical knowledge assessment by faculty members and discipline-specific standardized exam performance.4–6 
    • Only 1 prior study has also examined evaluations of surgical students,7 demonstrating low predictive value of resident ratings that was only marginally better than the predictive value of surgical faculty member ratings. 


이러한 측면은 학생에게 무슨 학점을 줄 것인가와 같은 행정적인 결정뿐만 아니라, 학생의 부족한 측면을 초기에 발견하여 적절한 전략을 설정하는 점에서도 중요하다. '자기평가'가 성인학습자의 핵심 요소기인 하지만, 의과대학생의 자기평가와 객관적 평가, 그리고 최종 임상실습 성정의 상관관계가 낮다는 것이 반복적으로 보고되고 있다.

These points are of key importance not only regarding the administrative decision of what grades to assign students but also because early recognition of deficits in student performance is crucial in offering constructive strategies to overcome them. Although self-assessment is a key component of adult learning, research has repeatedly demonstrated poor correlations between medical students’ self-assessments with objective measures of knowledge8,9 and their final clerkship grades.10


임상실습중인 학생에 대한 주관적인 평가의 타당도에 대해서 많은 연구가 이루어지지 않았음에도 미국의 모든 의과대학에서는 이 방법을 사용하고 있다. 

Rigorous validation of scores from subjective assessments on student clerkships has not been conducted, although all medical schools in the United States use these in the clinical years.11 

    • One study showed that a student’s overall assigned clerkship grade can be predicted by faculty ratings in only a single performance area,12 despite these ratings’ not correlating with standardized, objective measures. 

교수의 주관적 평가가 predictive ability가 있기 때문에, 학생 수행능력에서 어디가 부족한지 감지할 수 있는 평가자라면 적절한 시기에 피드백을 줄 수 있을 것이다.

Because of the potential predictive ability of subjective ratings, instructors who sense deficiencies in students’ performance are able to provide timely feedback and work with learners to adapt learning plans earlier during the clerkship.


이 연구는 주관적평가와 NBME점수 사이의 타당성을 보고자 했으며, 교수의 평가와 레지던트의 평가자간 신뢰도를 보고자 했다. 우리는 외과 레지던트의 평가와 교수의 평가가 표준화시험점수와 상관관계가 낮을 것으로 예측했다.

This study was designed to investigate the convergent validity between subjective ratings of clinical knowledge and scores on the National Board of Medical Examiners (NBME) subject examination, as well as interrater reliability of faculty members’ and residents’ evaluations of global clinical knowledge among students on the surgery clerkship. We hypothesized that surgical residents’ and faculty members’ ratings of clinical knowledge would correlate poorly with the students’ standardized exam scores.



Methods


Retrospective cross-sectional data were collected from medical student records at The Johns Hopkins University School of Medicine from July 2009 through June 2011 (n 5 219 students ranging from the 2nd to the 4th year). 


The medical student basic clerkship was just under 9 weeks in duration and was divided into a 4.5-week general surgery experience and 2 separate 2-week surgical subspecialty rotations, though not necessarily in that order. 


Students were instructed to approach potential evaluators at the conclusion of their time on a given service to request evaluations, which were then sent by e-mail and completed within 4 weeks. Minimums of 4 faculty member and 4 resident evaluations were desired. Surgical faculty members and residents rated students’ clinical knowledge as part of a 17-item summative evaluation. All items were rated on a 5-point, Likert-type scale. The clinical knowledge 1-to-5 rating descriptors are provided in Table 1


Data analysis was performed using SPSS version 20 (IBM, Armonk, NY). The clinical knowledge rating was extracted from the full evaluation, and the interrater reliabilityor consensus between evaluators, of those scores was assessed. The proportion of variance due to variability of scores between raters, known as the intraclass correlation coefficient (ICC), was calculated separately for both faculty member and resident ratings.

An ICC R 

.75 indicates good agreement among raters and thus good reliability. Values of 

.50 to .74 indicate moderate reliability, and values ,

.49 indicate poor reliability.13



Convergent validity of clinical knowledge ratings was assessed by correlating average ratings with scores on the NBME clinical subject examination for surgery using Spearman’s r. The r2 value was also calculated to determine the shared variance between ratings and examination scores. 

A r2 value R .25 indicates a high degree of variance shared between 2 variables, values of .09 to .24 indicate a moderate degree of variance, and values ,.09 indicate a small degree of variance.14 


Additionally, students were assigned to 1 of 3 groups on the basis of their examination scores relative to their peers’ scores: top 25%, middle 50%, and bottom 25%. Average clinical knowledge ratings from residents and attending surgeons were analyzed for differences on the basis of NBME score quartile using analyses of variance.



Comments

미국 의과대학에서 레지던트와 의대생의 평가는 모델을 따라가고 있다. 의학지식은 fundamental domain중 하나이고, 측정하기 가장 쉬운 것이라는 점에 대해서는 논쟁이 있지만 gold standard는 없는 상황이다. 의대생의 임상실습을 평가하는 grading schema에 학교별로 차이가 크며, 의과대학들은 나름의 기준들을 적용해왔다.

Assessment of residents and medical students in the United States has become increasingly aligned with the model of 6 core competencies developed by the Accreditation Council for Graduate Medical Education.15 Medical knowledge is 1 of the fundamental domains in which practicing physicians are called to exhibit competence. Although it is debatably the easiest to measure, there is no present consensus on which assessment methods should constitute a gold standard. As such, multiple studies have shown that wide variability exists in grading schema for medical student clerkships,16–18 as medical schools have each applied their own standards on an ad hoc basis.


여전히 의대생의 수행능력을 평가하는 확실한 magic bullet이 있는 것은 아니지만, 본 연구에서는 예측 가능한 패턴을 찾을 수 있었다. 교수의 평가나 레지던트의 평가 모두 NBME surgery 과목시험점수와의 convergent validity는 낮았지만 한 학생에 대해서 여러 교수들과 레지던트들 사이의 평가자간 신뢰도는 높았다.

Our findings suggest that although the magic bullet of medical student performance assessment continues to remain elusive, there may be predictable patterns. Although the convergent validity of subjective assessments from both faculty members and residents with NBME surgery subject examination scores was low, the interrater reliability between multiple faculty member and resident ratings for each student was high. 


이러한 측면에서 학생 수행능력의 다른 어떤 측면이 지식을 평가하는 대리지표로서 사용된다고 판단하게 되었다. 교수들이 의과대학생을 평가하는데 영향을 주는 것에 대한 이전 연구의 자료를 보면, 교수들은 학점을 줄 때 다양한 평가준거를 활용하기보다는 학생 수행능력의 한 단면만을 보는 것으로 나타난다. 우리의 연구에서 외과교수들은 평가시에 generalized global assessment이라는 개념을 가지고 있는 것으로 보이며, 구어체로 하면 "후광효과"라 할 수 있다.

On that basis, we posit that other aspects of student performance are perhaps being considered as proxies for knowledge. Data from a prior study that examined factors contributing to faculty members’ evaluation of medical students’ performance on a surgery clerkship indicated that faculty members form 1-dimensional views of students’ performance when assigning grades, rather than nuanced cognitive models that account for differentiation among multiple grading criteria.19 Our data further imply that faculty evaluators in surgery may be conceptualizing a generalized global assessment of student performance on which they base all of their ratings. This is colloquially known as the halo effect.


흥미로운 사실은, 레지던트의 평가가 교수들의 평가보다 NBME와 더 높은 상관관계를 보인다는 것이다. 여러 연구에서 교수들이 학생을 관찰하는 시간은 드물며, 학생들은 대부분의 시간을 레지던트와 보내는 것으로 보고되고 있다. 교수가 학생을 평가한 결과에 편차가 있다는 점에 대한 문헌은 있으나, 레지던트의 평가에 대해서는 유사한 연구결과가 없다. Dudas가 "학생의 분절된 임상경험은 평가의 적이다"라고 말한 것에 동의하며, 이것이 longitudinal clerkship이 넓은 영역에 걸쳐서 학생의 역량을 정확히 평가할 수 있다는 주장의 근거일 것이다.

Interestingly, we found that residents’ evaluations of students’ knowledge correlated better with NBME examination scores than did faculty members’ evaluations. Studies have shown that students are observed infrequently by faculty members in clinical encounters20,21 and spend a majority of their contact time with residents. 22 There is documentation of variation in faculty members’ evaluations of students by clinical service, with increasing length of rotation (4 vs 2 weeks) correlating with lower overall scores,23 although no similar data exist for residents’ evaluations. We agree with Dudas et al3 that ‘‘fragmentation of student clinical experiences is a threat to assessment,’’ and it can be reasonably argued that longitudinal clerkship experiences provide the maximal opportunities for accurate student assessment across all domains of competency.


Conclusions

의과대학생의 평가에 대한 관심이 높고, 이는 타당성과 신뢰성을 갖춰야 한다. 의과대학은 주관적과 객관적 평가를 사용하지만 의사들이 학생이 잘 하는 것과 못 하는 것을 어느 정도로 파악할 수 있는지에 대해서는 알려진 바가 없다. 

Assessment of medical students is of broad interest and should ideally be valid and reliable. Medical schools use both subjective and objective assessments of performance on student clerkships, as each contributes differently to insight regarding students’ abilities. However, it is unknown to what extent clinicians can detect excellence or deficiencies while working with medical students. 



Our data suggest that there are contextual and discipline-specific trends at work regarding the ways in which faculty members and residents perceive and subsequently rate students’ performance. Specifically, compared with medical specialties, the surgical routines and culture may not easily lend themselves to accurate student assessment without focused training in such. We have yet to explore if and how the importance given to technical skills in surgery skews our ability to assess other domains compared with nonprocedural disciplines. Within the confines of the existing student clerkship structure, it seems evident that both faculty members and residents would benefit from dedicated training in rating students’ knowledge. Such an intervention could decrease the ‘‘halo effect’’ that has been demonstrated to plague the subjective evaluation of medical students26 and would also provide increased opportunities to recognize deficiencies within the context of the time-limited settings so common in surgical practice.






 2014 Feb;207(2):231-5. doi: 10.1016/j.amjsurg.2013.10.008. Epub 2013 Oct 24.

Faculty and resident evaluations of medical students on a surgery clerkship correlate poorly with standardizedexam scores.

Abstract

BACKGROUND:

The clinical knowledge of medical students on a surgery clerkship is routinely assessed via subjective evaluations from facultymembers and residents. Interpretation of these ratings should ideally be valid and reliable. However, prior literature has questioned the correlation between subjective and objective components when assessing students' clinical knowledge.

METHODS:

Retrospective cross-sectional data were collected from medical student records at The Johns Hopkins University School of Medicine from July 2009 through June 2011. Surgical faculty members and residents rated students' clinical knowledge on a 5-point, Likert-type scale. Interrater reliability was assessed using intraclass correlation coefficients for students with ≥4 attending surgeon evaluations (n = 216) and ≥4 residentevaluations (n = 207). Convergent validity was assessed by correlating average evaluation ratings with scores on the National Board of MedicalExaminers (NBME) clinical subject examination for surgery. Average resident and attending surgeon ratings were also compared by NBME quartile using analysis of variance.

RESULTS:

There were high degrees of reliability for resident ratings (intraclass correlation coefficient, .81) and attending surgeon ratings (intraclass correlation coefficient, .76). Resident and attending surgeon ratings shared a moderate degree of variance (19%). However, average resident ratings and average attending surgeon ratings shared a small degree of variance with NBME surgery examination scores (ρ(2) ≤ .09). When ratings were compared among NBME quartile groups, the only significant difference was for residents' ratings of students with the lower 25th percentile of scorescompared with the top 25th percentile of scores (P = .007).

CONCLUSIONS:

Although high interrater reliability suggests that attending surgeons and residents rate students with consistency, the lack of convergent validity suggests that these ratings may not be reflective of actual clinical knowledge. Both faculty members and residents may benefit from training in knowledge assessment, which will likely increase opportunities to recognize deficiencies and make student evaluation a more valuable tool.

Copyright © 2014 Elsevier Inc. All rights reserved.






+ Recent posts