강의 동료평가 도구 개발과 교훈(Acad Med, 2009)

Developing a Peer Assessment of Lecturing Instrument: Lessons Learned

Lori R. Newman, MEd, Beth A. Lown, MD, Richard N. Jones, ScD, Anna Johansson, PhD, and Richard M. Schwartzstein, MD






전통적으로 임상가-교육자의 교육은 학생이 평가해왔다. 그러나 의과대학 행정가와 교육연구자들 사이에 교육을 효과적으로 평가하기 위해서는 다양한 출처의 근거를 포함시켜야 한다는 합의가 늘고 있다. teaching에 대한 피어리뷰는 학생들의 강의평가와 합해져서 의과대학과 임상에서의 교육을 평가하고 향상시키는데 필수적인 자료를 제공해줄 수 있다. 피어리뷰는 교수들로 하여금 teaching skill에 대해 토론하게 해주고, 구체적인 교육 테크닉에 대한 형성평가를 해주며, 승진을 목적으로 한 총괄평가의 요소로서 포함될 수도 있다.

Traditionally, clinician–educators’ teaching has been assessed by students.1,2 There is, however, growing agreement among medical school administrators and educational researchers that effective assessment of teaching must include evidence from multiple sources.3–6 Peer review of teaching, combined with student evaluation, can provide essential data to evaluate and improve medical school and clinical teaching.7 Peer review engages faculty in a discussion about their teaching skills, provides formative assessment of specific instructional techniques, and may be included as a component of summative assessment for academic promotional purposes.


 

프로그램 설명

Program Description


배경과 목표

Background and goals


강의에 대한 동료평가 도구의 개발로 시작하였다. 강의는 아직 가장 흔하게 사용되는 교육법 중 하나이다.

The task force began its work by developing a peer assessment instrument on medical lecturing. The lecture remains the most commonly used instructional method in the first two years of medical education11,12 and, thereby, offers fertile ground to assess faculty.


프로그램이 시작될 당시 Shapiro Institute task force 는 평가프로그램을 위한 도구를 개발하기 시작했다.

At the program’s inception, a Shapiro Institute task force (made up of the authors and two members of the institute staff) began developing instruments for the assessment program.


2007년 Shapiro Institute for Education and Research at Harvard Medical School (HMS) and Beth Israel Deaconess Medical Center (BIDMC) 가 교수들의 교육에 대한 동료평가 프로그램을 시작했다.

In 2007, the Shapiro Institute for Education and Research at Harvard Medical School (HMS) and Beth Israel Deaconess Medical Center (BIDMC) initiated a program of peer assessment of faculty teaching.



동료교수는 학생에게 가르치는 내용의 적합성, 강사의 전문성, 강의에서 제시되는 연구의 퀄리티 등을 평가할 수 있다.

Peers are able to judge the appropriateness of the content delivered, the lecturer’s expertise, and the quality of studies presented during the lecture.2,15,16



참여자

Participants


In 2007, after receiving institutional review board approval from the BIDMC Committee on Clinical Investigation, the task force invited all members of the BIDMC’s Resource Faculty in Medical Education to participate in a study to develop an instrument for peer assessment of lecturing and to measure the reliability of the scores obtained from the instrument. The Resource Faculty consists of HMS physician faculty members, representing all major clinical departments at BIDMC, who have a strong commitment to medical education and experience teaching in a variety of medical school and hospital settings.


도구 개발

Instrument Development

 

준거 설정

Criteria identification



modified Delphi 활용의 이유와 목적

We used the modified Delphi method19 to develop our instrument for peer assessment of lecturing. The Delphi method is shown to be an effective consensus building process to use when published information is inadequate or nonexistent.20 The modified Delphi method is an iterative process designed to establish expert consensus on specific questions or criteria by systematic collection of informed judgments from professionals in the field.

 

modified Delphi 진행과정

Using this method, a researcher first surveys a panel of experts individually about a particular issue or set of criteria. After analyzing and compiling their responses, the researcher resurveys the experts, asking each to indicate agreement or disagreement with the items. Repeated rounds of surveys are carried out until full consensus is reached. For development of the peer assessment of lecturing instrument, the Resource Faculty members served as the expert panelists.

 

Resource Faculty 를 패널로 선정한 이유

We chose to involve the Resource Faculty because of their educational expertise, diverse clinical backgrounds, and experience teaching in a variety of instructional settings. Furthermore, we felt the Resource Faculty would have a strong interest and commitment to the development of this instrument, as their education leadership role involves the peer assessment of teaching.


첫 번째 설문을 위한 리스트 선정

In preparation for the first survey round, we generated an initial list of effective lecturing behaviors, skills, and characteristics. To compile the list, we

  • spoke with faculty members with extensive expertise in lecturing and

  • reviewed the medical literature for observable, effective lecturing behaviors11–14,21–25 (Figure 1, Delphi Round 1).

 

첫 번째 설문 시행

We constructed and distributed a listing of 19 possible criteria to the panelists and asked them to rate the importance of including each item in an instrument to assess medical lecturing. We based the ratings on a four-point scale: 1  very important; 2  important; 3  not important; 4  eliminate. We also asked panelists to suggest different wording, note redundancies, or propose additional items for the instrument. All 14 Resource Faculty experts responded to the first Delphi survey round.


 

첫 번째 설문 결과 평가: 평균점수(2.5)와 표준편차(1.0)을 기준으로 선택

We used measures of central tendency and dispersion to analyze the data collected from the first survey round. Calculating these measures allowed us to determine the level of group consensus for inclusion or exclusion of each criterion. The mean value of 2.5 (the midpoint of our four-point scale) was chosen as the numerical indicator of group consensus. Those criteria with mean values less than 2.5 were included. Standard deviation (SD) was used to measure the dispersion of responses for each criterion and provide further evidence of group consensus. The smaller the SDs, the greater the consensus. Those criteria with an SD of less than 1 were included. Seventeen of the 19 criteria had means between 1.0 and 2.2 and SDs between 0.00 and 0.96. Two of the criteria had means of 2.6 and 2.9, with SDs of 1.1 and 1.2, and were eliminated.



첫 번째 설문 결과 평가: 워딩(기술방식) 수정

In addition, we edited the criteria according to the panelists’ suggestions for rewording. Five items were reworded to describe explicit, observable behaviors. For example, the original criterion “Captures and keeps the audience’s attention,” became “Captures attention by explaining or demonstrating need, importance, or relevance of topic.” Several panelists noted redundancies among six of the criteria. We therefore eliminated three of these criteria. The outcome fromthe first Delphi survey round resulted in a listing of 14 criteria. We summarized and distributed to the panel of experts the data fromthe first survey round and the resulting list of criteria, along with a written request for a second round of review (Figure 1, Delphi Round 2).



두 번째 설문 결과

Twelve experts responded to the second Delphi survey round. Thirteen of the 14 criteria had mean ratings between 1 and 1.3 and SDs between 0.0 and 0.6. One criterion had a mean of 2.5 and an SD of 1.2 and was eliminated fromthe listing. We again edited and reworded the criteria according to the panelists’ suggestions. Most suggestions were recommendations to shorten the criterion’s word length, and to add specific behavioral descriptors or anchors to the assessment instrument. The panelists noted redundancy of two criteria, and we therefore eliminated one of these.


최종 12개의 준거를 가지고 세 번째 설문 시행

We e-mailed a final revised listing of the 12 criteria to the expert panelist for the third Delphi survey round. All 14 experts reached full consensus on this final listing of criteria (Figure 1, Delphi Round 3). Using this listing of 12 criteria of effective lecturing, we constructed our initial peer assessment instrument. We used a three- point scale to rate each criterion: 1   excellent demonstration, 2  adequate demonstration, and 3  does not demonstrate. We also added an option to indicate unable to assess, along with a global rating of the lecture.



각 준거별로 3단계로 수행능력수준을 구분함

To differentiate the three levels of lecturer performance, we included behavioral descriptors of each criterionculled from the literature.6,26 The behavioral descriptors were placed under the column heading

  • for rating level 1, excellent demonstration of performance.

  • For rating level 2, adequate demonstration of performance, we used qualifying terms such as “limited in scope.”

  • For rating level 3, does not demonstrate, we used terms such as “does not present.”

 

최종 11개 준거 확정

We presented the rating scale and criteria to the faculty as a group, who recommended eliminating one additional criterion, “Presents material at level appropriate for learners.” The group felt that, to assess this criterion, a peer observer would need to know the learners’ opinions regarding the appropriateness of the presentation level. This resulted in identification of 11 criteria of effective lecturing.



평가스케일 개발

Rating scale development


rating scale을 검토하고 비디오를 보면서 평가

We invited the same Resource Faculty members who participated in the Delphi rounds to consider and review the rating scale and behavioral anchors of the peer assessment instrument to finalize it for pilot testing of interrater reliability. These faculty members met for two, 2-hour sessions to discuss peer observation techniques, consider the behavioral descriptors for each criterion, comment on the sufficiency of the three rating levels, and provide feedback on the overall format of the instrument. To gain experience using the instrument, we asked the group to watch, score, and discuss videotaped lectures filmed during an HMS human physiology course. We showed 10-minute segments from the beginning, middle, and end of each lecture and asked the faculty to rate the elements observed. After rating the lecture segments, the faculty shared their scores and discussed behaviors they saw that persuaded them to choose a particular level of performance. Several faculty made suggestions for minor rewording of the behavioral descriptors.


대부분 3점 중 2점에 주어서 5점으로 확대

During the second rating scale development session, the faculty noted that the three-point rating scale was limiting, as they tended to rate most criteria at the second performance level (adequate demonstration). The group suggested changing the instrument to a five-point scale (1  excellent demonstration, 2  very good, 3   adequate, 4  poor, and 5  does not demonstrate criteria) and maintaining descriptive benchmarks for the excellent, adequate, and poor performance rating levels. At a follow-up meeting with the faculty, we distributed the finalized peer assessment of the lecturing instrument consisting of 11 criteria rated on a five-point scale. The group unanimously agreed on this final version (Appendix 1).


신뢰도를 위한 파일럿 테스트

Pilot testing reliability of the instrument’s measures


새로운 영상 활용하여 개발한 평가도구 시험

We subsequently pilot tested the instrument to measure internal consistency and interrater agreement. We instructed each participant to rate the entirety of four, 1-hour HMS videotaped lectures (not viewed previously) according to the criteria, and to provide a global rating assessment of the quality of each lecture. Because of faculty time constraints, the number of observers varied in the assessment of the four lectures. We collected a total of 31 peer assessment rating forms for the lectures (the four lectures had 12, 9, 5, and 5 reviewers, respectively).


파일럿 결과 분석

We analyzed the pilot data to measure reliability of the scores obtained fromthe instrument. Cronbach alpha was used to assess internal consistency reliability of the ratings.27 The coefficient alpha was high (a  .87, 95%bootstrap confidence interval [BCI]  0.80–0.91), indicating that the items on the instrument measure a cohesive set of concepts of lecture effectiveness. Bootstrap resampling approaches were used to obtain interval estimates. Missing data were handled with multiple imputation.28



내적신뢰도

There was some variability in the internal consistency across each of the four lectures (0.92, 0.77, 0.93, 0.87). All but one were close to a minimal threshold of 0.90 for making decisions about individuals, and well above the threshold for making decisions about groups (0.80).29



평가자간 일치도: 문항간 ICC값은 차이가 컸음

Interrater agreement was assessed by forming all possible pairs of raters who observed the same lecture. The reliability of a randomly selected reviewer’s scores was computed using intraclass correlation coefficient (ICC). The measure of ICC for the 31 raters’ scores across all criteria and the global measure was fair (0.27, 95%BCI   0.08 to 0.44). However, there was variability of ICC measures for the individual criteria.

  • For criterion 11 (ICC  0.69), the magnitude of association across pairs of raters can be described as substantial.

  • For criteria 3 through 7 and 9, the magnitude of association can be described as moderate to fair.

  • The reviewers reached only slight agreement on criteria 1, 2, 8, and 10, and on the global rating of the lectures.30

Table 1 presents a comparison by criteria of the interrater agreement (as measured by ICC) for all four lectures. The table is arranged in descending order of agreement.


 

 

 

동료평가 도구 개발 과정에서의 교훈

Lessons Learned About Instrument Development and Peer Assessment of Lecturing



교훈 1: 합의를 도출하는 것이 도구의 일관성과 자기성찰을 촉진한다.

Lesson 1: Consensus building fosters instrument coherence and self-reflection



Resource Faculty 가 평가도구 개발에 쏟은 시간과 노력은 강의의 효과성을 일관성있게 측정하기 위해 필수적인 것이었다. 이러한 노력은 높은 내적 신뢰도에 기여했다. 또 한가지 교훈은 도구를 개발하는 과정에서 교수들간의 협력이 '좋은 강의가 무엇인가'에 대한 공통의 정의를 만드는데 기여했다는 점이다. Resource Faculty 는 효과적인 강의에 대한 준거를 개발하는 과정에서 자기성찰을 할 수 있었고, 스스로 얼마나 이러한 기준을 충족하는지 생각해볼 수 있었다고 하였다.

The time and effort the Resource Faculty dedicated to the development of the assessment instrument was vital to establishing cohesive measures of effective lecturing. The effort expended likely contributed to the high measurement of internal consistency when we tested the reliability of the instrument. One lesson learned, and noted in the literature, is that collaboration of faculty in the development of an assessment instrument can create a shared definition of good performance.32 Resource Faculty also noted that the work of establishing the criteria of effective lecturing stimulated self-reflection and consideration of how well they met these standards when giving their own lectures.




교훈 2: 교수들이 반드시 평가절차의 신뢰도와 타당도를 신뢰해야 한다.

Lesson 2: Faculty members must trust the validity and reliability of the evaluation process



동료교수의 평가가 효과적인 강의의 근거로서 사용되기 위해서는 그 평가과정이 높은 수준의 객관성을 획득함으로써 신용/신뢰/방어가능한 평가가 되어야 한다. 동료교수의 평가를 받는 교수들은 그 평가결과가 idiosyncratic한 것이 아니라는 것을 믿어야 한다. 따라서 우리는 교수들이 피드백을 더 신뢰하게 만들기 위해서는 평가자간 신뢰도를 측정하여 우리의 평가도구의 신뢰도를 점검하는 것이 필수적이라고 생각했다. 이 평가도구 자체가 교수개발의 교육자료로 활용될 수도 있는 것이었다. 반대로, 낮은 평가자간 일치도는 포괄적인 평가 프로그램으로서의 유용성이나 고부담 의사결정에 포함시키는 것에 큰 위협이 된다.

For peer assessment to be used as evidence of effective teaching, the process requires a high degree of objectivity to produce credible, reliable, and defensible evaluations.9 Faculty undergoing peer review need to trust that the ratings are not idiosyncratic scores of their performance. We therefore felt it was critical to test the reliability of our assessment instrument through measuring interrater agreement,33 as faculty would be more likely to trust the feedback. The instrument itself could then be used as instructional material in faculty development. Conversely, low interrater agreement of the instrument’s scores would be a significant threat to its usefulness in a comprehensive assessment program or inclusion in high-stakes decision making.


평가자간 일치도에는 상당한 편차가 있었다. 이것은 몇 가지로 설명할 수 있다.

There was considerable variability in our instrument’s interrater agreement measures. There are several possible explanations for this variability.

  • 가장 중요한 요인은 우리가 적절한 평가자 훈련을 하지 않은 것이다. Resource Faculty 는 관찰 기술에 대해서 토론하고, 도구에 대해서 코멘트하고, 도구를 사용하여 연습했지만 이것을 공식적 훈련 세션이라고 볼 수는 없다.
    The most significant factor is that we did not provide proper rater training. In our two, 2-hour faculty development sessions, the Resource Faculty discussed peer observation techniques, offered comments on the instrument, and practiced using the assessment tool. However, these were not formal training sessions (see Lesson 3).

  • 두 번째 요인은 교수평가자들이 강의를 평가할 때 이미 정해져있는(자신이 가지고 있는) 내적 기준을 가지고 평가했기 때문일 수 있다.  Braskamp and Ory는 평가자들이 한 사람의 수행능력을 비교할 때 다른 사람과 비교하는 방식으로 혹은 과거의 경험에서 나온 기준을 가지고 한다고 지적했다.
    A second factor contributing to the low interrater agreement measure may be that the faculty raters used predetermined, internal standards in judging the quality of a lecturer’s performance. Braskamp and Ory34 note that, at times, raters compare a person’s performance or contribution against those of others, or against some a priori standard derived from previous experience.

 

우리 연구에서 교수들은 동료의 강의를 관찰할 때 내면의 편견을 가지고 보았을 수 있다. 이것은 특히 (강의의) 주제가 교수의 관심대상이거나 교수 자신의 전공일 경우에 특히 더 그럴 수 있다. 따라서 교수의 idiosyncratic한 인식이 객관적인 강의력 평가보다 더 우선할 수도 있다.

In our study, the faculty might have approached the peer observation event with an internal bias about how the lecture should be presented. This may have been the case, in particular, if the topic was of interest to the faculty or within the faculty’s own discipline. Therefore, the faculty’s idiosyncratic perceptions may have superseded more objective appraisal of the lecturing performance.



 

교훈 3: 평가자 훈련이 고부담 평가에서는 필수적이다.

Lesson 3: Peer rater training is essential for high-stakes evaluation


평가자 훈련은 수행능력 평가에서의 rating에 있어서 일관성과 정확성을 높이기 위한 가장 효과적인 전략이다. 훈련과정에서 평가자는 평가자들이 흔히 하는 실수를 회피하는 법을 배워야 하며, 각 수행능력 영역의 행동지표가 무엇인지 토론하여 개인이 가진 인식이 그룹 전체의 인식과 비슷해지게 해야 한다. 수행능력이 높고 낮음을 능숙하게 구분해낼 수 있게 하기 위해서는 평가자들은 샘플을 보고 토론하여야 한다. 가장 중요한 것은 평가자가 점수를 매기는 연습을 한 다음에 퍼실리테이터로부터 점수의 정확성에 대하여 피드백을 받아야 한다는 점이다.

Careful attention to rater training has been singled out as the most effective strategy for increasing accuracy and consistency of performance assessment ratings.32 During training, raters learn to avoid common rater errors (such as halo, leniency, and central tendency) and discuss behaviors indicative of each performance dimension until individual perceptions are brought into closer congruence with those held by the group.35 To increase proficiency at discriminating between performance dimensions, raters view and discuss samples of each performance level included on the rating scale. Most important, raters practice scoring performances and receive feedback from a training facilitator on the accuracy of their scores.




평가자 훈련 프로그램의 성공을 위해서는 참가자들이 시간과 노력을 들여서 시스템의 기준을 내면화하고 평가도구를 사용하는데 일관성을 갖추는 것이 중요하다. 모든 교수들이 높은 수준으로 헌신commitment하게 요구하다보면 대규모의 평가자들에 대한 훈련이 문제가 될 수 있다. 이것에 대한 한 가지 해결책은 소수 집단의 고강도로 훈련된 평가자들을 만드는 것이다. 이들 평가자들로부터 신뢰도있는 결과를 얻는다면 총괄적, 고부담 평가를 할 수 있을 것이다.

The success of rater training programs requires that participants commit to the time and effort necessary to internalize the standards of the systemand become consistent in their use of the ratings. The need for a high level of commitment among all faculty participants can make training a large group of peer raters problematic. One solution might be to establish a small cadre of faculty who undergo intensive rater training together. Reliable appraisal data obtained fromthis cadre of peer raters could then be used in summative, high-stakes assessment of lecturing effectiveness.





11 (Provides a conclusion) 0.69 

3 (Presents material in a clear, organized fashion) 0.60 

4 (Shows enthusiasm for the topic) 0.56 

6 (Explains and summarizes key concepts) 0.46 

5 (Demonstrates command of the subject matter) 0.45 9 (Audio and/or visual aids reinforce the content effectively) 0.38 

7 (Encourages appropriate audience interaction) 0.22 

8 (Monitors audience’s understanding and responds accordingly) 0.20 

2 (Communicates or demonstrates importance of lecture topic) 0.14 

1 (Clearly states goals of the talk) 0.07 

10 (Voice is clear and audiovisuals are audible/legible) 0.06 

Global rating (Overall, how would you rate this lecture?) 0.19

 





 2009 Aug;84(8):1104-10. doi: 10.1097/ACM.0b013e3181ad18f9.

Developing a peer assessment of lecturing instrumentlessons learned.

Author information

  • 1Shapiro Institute for Education and Research, Harvard Medical School and Beth Israel Deaconess Medical Center, Boston, Massachusetts 02215, USA. lnewman@bidmc.harvard.edu

Abstract

Peer assessment of teaching can improve the quality of instruction and contribute to summative evaluation of teaching effectiveness integral to high-stakes decision making. There is, however, a paucity of validated, criterion-based peer assessment instruments. The authors describe development and pilot testing of one such instrument and share lessons learned. The report provides a description of how a task force of the Shapiro Institute for Education and Research at Harvard Medical School and Beth Israel Deaconess Medical Center used the Delphi method to engage academic faculty leaders to develop a new instrument for peer assessment of medical lecturing. The authors describe how they used consensus building to determine the criteria, scoring rubric, and behavioral anchors for the rating scale. To pilot test the instrument, participants assessed a series of medical school lectures. Statistical analysis revealed high internal consistency of the instrument's scores (alpha = 0.87, 95% bootstrap confidence interval [BCI] = 0.80 to 0.91), yet low interrater agreement across all criteria and the global measure (intraclass correlation coefficient = 0.27, 95% BCI = -0.08 to 0.44).The authors describe the importance of faculty involvement in determining a cohesive set of criteria to assess lectures. They discuss how providing evidence that a peer assessment instrument is credible and reliable increases the faculty's trust in feedback. The authors point to the need for proper peer rater training to obtain high interrater agreement measures, and posit that once such measures are obtained, reliable and accuratepeer assessment of teaching could be used to inform the academic promotion process.

PMID:
 
19638781
 
[PubMed - indexed for MEDLINE]


+ Recent posts