의과대학 Trainee선발에서 집단의사결정을 위한 새로운 방법(Med Educ, 2016)

A new method for group decision making and its application in medical trainee selection

James R Kiger & David J Annibale






도입

INTRODUCTION


의과대학이나 레지던트 프로그램에서 지원자를 선바하는 기준은 시험점수나 grade에 기반하고 있다. 그러나 많은 경우, 비록 이 숫자 점수의 합이 면접수행능력, 리더십, 기존 경험 등과 같이 정량화하기 어려운 것들보다 덜 중요한 것은 아니지만, 숫자 자료들은 combine된다. 결국, 모든 프로그램에서는 어떻게든 이 모든 정보를 '선호'의 순서로 단순화시킨 리스트로 승화시켜야 한다. 이 목표를 달성하기 위하여, 종종 pseudo-quantitative scoring systems 을 사용하나, 수학적으로 타당하지 못하고, counterproductive하다.

The criteria by which a medical school or residency training programme selects its preferred applicants may, in part, rely on test scores or grades. In almost every case, however, these numerical data are combined with, if not superseded by, considerations that are difficult to quantify, such as interview performance, leadership traits and prior experience. In the end, every schoolor programme must find a way of distilling all this information into a simple list of applicants in order of preference. To achieve this goal, groups often rely on pseudo-quantitative scoring systems that are mathematically unsound and may be counterpro- ductive to the collaborative process of making a list. 


우리의 전공 수련 프로그램은 NRMP를 사용한다. NRMP는 1952년 도입되었는데, 이 당시에는 의과대학생과 레지던트 프로그램에서 혼란과 불만이 늘어나던 시기였다. 중앙화된 기구가 모든 의과대학졸업생을 available residency spot에 배정하는 역할을 맡게 되었다. NRMP 시스템은 60년간 그 자리를 지켜왔고, 더 많은 전공, 세부전공까지 확장되었다.

Our subspecialty training programme uses the National Resident Matching Program (NRMP) for applicant selection. The NRMP was formed in 1952 in response to escalating confusion and exas- peration on the part of medical students and resi- dency programmes. This centralised body assumed the task of sorting all of the nation’s graduating medical students into available residency spots.2 The NRMP system has stood relatively unchanged for more than 60 years, and has expanded to cover more specialties and subspecialties.


 

지원자와 훈련프로그램은 NRMP에 각자 자기의 입장에서의 순위를 제출한다. NRMP는 'deferred acceptance'알고리즘을 사용하여 지원자를 안정적이고 최적의 결과를 얻을 수 있게 sort해준다. 지원자에게 있어서 순위를 매기는 것은 부담이 크지만 근본적으로 개인적인 문제이다. 훈련프로그램 입장에서 순위를 정하는 것은 더 복잡하다. 어떻게 정량적 자료를 질적 특성과 통합할지를 결정해야 하고, 다수의 면접관에게 받은 주관적 정보를 최종 순위 정보로 만들지 고민해야 한다. 이 단계에서 발생하는 부정확성은 여러 문헌에서 밝혀진 바 있다

Applicants and training programmes both submit rank-order lists to the NRMP, which employs a ‘deferred acceptance’ algorithm to sort the appli- cants into training positions such that stable and optimal results are achieved.2,3 For applicants, creat-ing a rank order may be taxing, but is a fundamen- tally personal matter. For training programmes, generating a rank-order list may be significantly more complicated. Each programme must decide how to integrate objective quantitative data (test scores, grades, etc.) with qualitative characteristics (volunteer work, written statements, etc.) and the subjective opinions of multiple interviewers into a final rank-order list. The imprecision of this process is highlighted by published reports that have demonstrated the lack of correlation between  information gathered during the interview process, the position of applicants on a programme’s rank- order list, and future resident performance.4–8



ERAS는 AAMC가 제공하는 순위 산정을 위한 pseudo- quantitative method 이다. 면접관은 지원자를 리커트-타입 평가 스케일에 배정하고(1~9), 지원자에 대한 평균점수가 예비적 순위를 만들어준다. ERAS시스템은 리커트 스케일 기반 시스템의 한 예이다.

The Electronic Residency Application Service (ERAS), provided by the Association of American Medical Colleges (AAMC), incorporates a pseudo- quantitative method to generate a rank-order list. Interviewers assign applicants scores on a Likert-typerating scale (integers of 1–9), and averaged scores for applicants are sorted to create a preliminary rank-order list. This ERAS sys- tem is simply one example of a Likert scale-based system,



이러한 Pseudo-quantitative methods 는 몇 가지 근본적 문제가 있다.

Pseudo-quantitative methods such as this are beset by a number of fundamental problems:


  • 1 면접관마다 분포가 다름.
    the scores assigned by different interviewers are differently distributed;

  • 2 면접관에게 '숫자'의 의미가 일관되지 않음
    numeric scores have no consistent meaning for interviewers (e.g. an interviewer who gives con- sistently lower scores may view a score of 7 points as signifying an excellent candidate, whereas another interviewer may view the same score as indicating an average candidate);

  • 3 임의적 스케일의 순위자료이다. arithmetic operation에 부적절하다.
    Likert scale-type scores are ordinal data on an arbitrary scale; it is inappropriate to perform arithmetic operations, such as the calculation of means, on such data,9–11 and

  • 4 지원자는 일부 교수에 의해서만 면접을 하게 되고, 교수도 일부 지원자만 면접한다.
    candidates are interviewed only by a subset of faculty staff, and each faculty member may interview only a subset of candidates. Any partic- ular candidate’s final score may be altered sub- stantially by the inclusion or exclusion of an interviewer who gives consistently high or low scores.


이러한 문제로, 우리는 ERAS에서 만들어준 순위를 그룹토의를 거쳐 재평가한 뒤 NRMP에 제출한다. 토론과정에서 점수는 '집단 의견'에 맞게 조정되어 순위를 재조정한다. 물론, 이러한 집단 토의도 목소리가 큰 소수의 영향을 받을 수 밖에 없고, 참여못한 사람의 의견은 토론에서 배제된다.

Given these problems, our programme has had to re-evaluate the preliminary ERAS-generated rank- order list in group discussions prior to submission to the NRMP. During such discussions, scores are modified to force the rank list to conform to the ‘group opinion’. Of course, this group opinion may be unduly influenced by a vocal minority, and those who are unable to attend are left out of the discussion.


rank-ordering process 향상을 위한 수학적 노력이 있어왔다.

Others have suggested different mathematical meth-ods to improve the rank-ordering process.

  • One approach is to have interviewers compile individually ordered preference lists of applicants, instead of assigning scores. Both Chew et al. and Collins et al. suggest applying a formula to individ- ual rank lists to create scores that can then be aver- aged.12,13

  • These systems resemble the Borda voting system in which each voter gives each candidate a number of points proportional to that candidate’s place on the voter’s list.14 These systems are ham- pered by the fact that the score derived from any given voter is dependent on the number of candi- dates seen by that voter.

  • A recent article by Ross and Moore suggests retaining scores, but comparing candidates pairwise and assigning a ‘win percentage’to each in a system similar to that used in sports ranking.15



우리는 몇 가지 설계원칙을 정했다.

We proposed a set of design principles to which an optimal system should adhere:


  • 1 the opinions of all interviewers will carry equal weight;

  • 2 the rank-order list will not be influenced by which interviewers meet any individual candi- date;

  • 3 interviewers will compare only applicants whom they have met;
  • 4 the system will not depend on scores assigned on an arbitrary scale, and
  • 5 the final ordering will be transparent and repro- ducible.



METHODS


알고리즘 개발

Algorithm development


We developed an algorithm termed ‘collab-orative unbiased rank list integration’ (CURLI) 


네 단계로 이뤄짐

The CURLI algorithm involves four steps:


  • 1 each interviewer submits a personal ranked pref- erence list of the applicants he or she has met or reviewed;

  • 2 each personal rank-order list is used to generate a pairwise preference table of applicants;
  • 3 the individual preference tables are summed to generate a composite preference table, and

  • 4 a sorting algorithm is applied to the composite preference table to generate a final rank-order list.


기본적인 결과는 이렇다. 만약 지원자 A와 B가 모두 일부 교수에 의해서만 면접을 봤다면, 그리고 A가 B보다 더 많은 면접관들에게 선호된다면, A는 선호도 리스트에서 더 높은 순위를 받는다. 이는 얼마나 많은 인터뷰를 했는지, 몇 명의 교수가 했는지, 어떤 배점 bias가 있는지에 무관하다.

The fundamental result of the CURLI algorithm is as follows: if applicants A and B are both inter- viewed by a subset of faculty members, and candi- date A is preferred to candidate B by a majority of those interviewers, then candidate A will appear higher on the final preference list. This is unaf- fected by how many interviews any specific faculty member conducts or any individual scoring biases.


개별 순위 리스트

Personal rank-order lists


The fundamental change for interviewers is that instead of scoring applicants on an arbitrary scale, they are asked to maintain a personal ranked prefer- ence list of the applicants they have interviewed. Interviewers include only applicants they have met, conforming to design principles 2 and 3 above. Interviewers no longer assign arbitrary scores, removing the undue influence exerted by interview- ers who give consistently high or low scores, satisfy- ing principles 1 and 4.


짝지은 순위 표

Pairwise preference tables


지원자 선호가 더 높으면 상대비교에서 1 입력

Each interviewer’s ranked preference list is converted to a preference table, which is populated by the numbers 1 or 1 depending upon which applicant appears higher on that preference list. No values are assigned to applicants the interviewer did not meet. A preference list implies a comparison between all possible pairs of applicants on that list. Applicants appearing higher on the rank-order list are preferred to all applicants ranked below them. Therefore, a rank-order list of size n contains (n 9 [n 1])/2 pairwise comparisons between applicants.



4명의 지원자 A B C D중, C는 면접을 못 보고, 나머지 셋의 순위는 B D A 순서인 경우

For example, imagine there are four applicants: A, B, C and D. An interviewer meets all but applicant C, and submits the following rank-order list: B–D–A.


Table 1 shows the preference table generated from this list.



혼합 순위 표

Composite preference table

A composite preference table is computed simply by adding all of the individual preference tables.


For example, four interviewers (I, II, III and IV) provide the following rank lists for four applicants:


  • Interviewer I: B–D–A; 

  • Interviewer II: C–B–A–D; 

  • Interviewer III: B–C–D–A, and

  • Interviewer IV: C–D–B.


Table 2 shows the resulting four individual prefer- ence tables. Table 3 shows the composite preference table yielded by the sum for each cell.

 

 


 

 

배열

Sorting


modified bubble-sort algorithm 를 사용하여 composite table을 만들었음.

A sorting algorithm is applied to the composite preference table to obtain the final rank-order list. For our programme, we applied a modified bubble-sort algorithm to the composite table.16 An initial unsorted list is generated. Each applicant is compared with the applicant immediately below on the rank list by checking the corresponding value inthe composite preference table. If the lower-ranked applicant is preferred (i.e. the value in the cell is > 0), the order of the two applicants is swapped. This is continued until no more pairs of applicants are swapped. In the ideal scenario, the re-sorted list will yield a composite preference table with all nega-tive values in the upper triangle. 


Re-sorting하면 Table 4가 됨

For our example, the final sorted rank list is: C–B– D–A. Re-sorting the preference table to reflect this order gives a matrix with a fully negative upper tri- angle which indicates that every applicant is pre- ferred by a majority of interviewers to all the applicants below them on the list (Table 4).



Borda voting scheme으로 같은 것을 한다고 했을 때, 각 지원자가 획득 점수 기준으로 나열했을 때 두 명이 C를 더 선호했음에도 B가 가장 높을 수도 있다.

If one imagines running the same example with a Borda voting scheme, for instance, in which each applicant is awarded points based on his or her position on each list, it is possible that applicant B may have been ranked highest, although two of the three interviewers who directly compared applicants B and C preferred applicant C.

 

 



방법론

Methodology



We implemented this new ranking algorithm during the 2013 neonatal-perinatal fellowship match. All faculty members and fellows were instructed to maintain a personal ranked preference list of the applicants they interviewed. They were also asked to assign a score of 1–9 to each participant as had been done in previous years, as per the ERAS sys- tem. These ‘shadow’ scores were used to compare the outcome of the CURLI algorithm with the results that would have been generated by the old Likert scale-based method.



결과

RESULTS


During the trial year 14 applicants were interviewed, and 19 faculty members and fellows served as inter- viewers. Figure 1 shows the minimum, maximum, median and interquartile ranges for the scores assigned by each individual interviewer.

 

 


 

평가자들은 점수 범위의 일부만 사용하였고 86%는 6점 이상이었다.

On average, each interviewer scored nine applicants. All inter- viewers utilised a truncated part of the scoring range at the top of the scale. Of 162 total scores assigned, 139 (86%) were ≥ 6. The median score assigned by each interviewer ranged between 6 and 8.


개별 면접관마다 discordance가 있었다. 총 162개의 점수를 주었는데, 그 중 23개는 자신이 매긴 순위와 점수의 순위가 달랐다. 

We observed discordance between individual inter- viewers’ assigned scores and their final assessments of an applicant’s desirability. Collectively, the inter- viewers assigned a total of 162 scores, 23 (14%) of which were out of order in relation to the rank- order list of the interviewer who had given them.


 new CURLI algorithm에 따라서 14명의 지원자 중 9명이 서로 다른 ranking list에 assign됨.

by the new CURLI algorithm. Of the 14 applicants, nine would have been assigned to a dif- ferent place on the final ranking list.

 

 

지난 3년간, 우리 분과는 2시간씩 2번의 미팅을 해서 preliminary list를 조정했는데, 이번에는 1시간만 걸렸다. 순위가 달라진 지원자는 없었다.

In the prior 3 years, our division had scheduled two 2-hour meetings to discuss and modify the prelimi- nary rank-order list. In this trial year, we required only a single 1-hour meeting to achieve consensus. No candidates were moved as a result of that discus- sion. Figure 2 shows the relationships between the preliminary rank-order list and the final rank-order list for 2013 and the prior 2 years. The changes reflect the alterations made during the divisional meeting. In 2011 and 2012, the positions of nine of 14 applicants, and 13 of 16 applicants, respectively, were moved on the final list.

 

 


 

 

고찰

DISCUSSION


행정적 관점에서 미팅이 4시간에서 1시간으로 줄었고, 순위의 변화가 없었다. composite preferene table을 공개하여 투명성을 확보하였다.

From an administrative perspective, the new method reduced meeting time from 4 hours to 1 hour, dur- ing which no changes were made to the rank-order list. During that meeting the composite preference table was displayed, providing complete trans- parency.


CURLI algorithm 는 몇 가지 장점이 있다. 재생산가능하고 투명하다. 지원자의 순위를 바꾸려는 소수의 압력을 극복할 수 있다. 면접관의 intrinsic difference에 의한 불공평함을 줄일 수 있다.

We suggest that our CURLI algorithm has numer- ous theoretical benefits that are borne out in prac- tice. It is reproducible and transparent. There is reduced vulnerability to pressure from a minority of participants to change a candidate’s rank position, and the inequality imposed by intrinsic differences in scoring among interviewers is removed.


CURLI algorithm 는 확실한 장점이 있다. Borda voting scheme과 유사한 방법들에서 지원자는 '점수'로 평가받거나 랭킹을 평균낸다.

Compared with other options that have been pro- posed, we feel that the CURLI method offers clear advantages. Borda voting schemes, and similar methods, introduce a process whereby applicants receive points for their place on each list, or in which the rank number on each list is averaged.12–14


이러한 방법은 모든 면접관이 모든 지원자를 면접할 경우에는 만족스러운 결과를 줄지도 모르나, 각 면접관이 일부 지원자만 면접할 경우 문제가 될 수 있다. 예컨대 일부 지원자만 면접했는데, 이들이 모두 least desirable한 지원자들일 수도 있다. 이 경우 Borda-like 방법에서는 이 지원자들 중 순위가 높은 사람은 엄청난 이득을 보는 셈이다. CURLI에서는 상대적 비교만 하기 때문에 그러한 문제가 없다.

These methods may yield satisfactory results if all interviewers see all applicants (i.e. every individual preference list is full), but in cases like ours in which each interviewer sees only a subset of appli- cants, these methods are problematic and allow bias. Take, for example, an interviewer who inter- views only a few applicants, all of whom happen to be among the least desirable. Under the Borda-like methods, the top-ranked applicant on this list will obtain a huge advantage in points or rank, even though that applicant may actually not be desirable compared with all the other applicants that particu- lar interviewer did not see. As the CURLI method uses the rank lists only to make pairwise compar- isons between applicants the interviewer actually saw, it suffers no such bias.


다른 pairwise 비교법도 있지만 CURLI보다 덜 투명하고 더 쓰기 힘들다. 대부분의 면접관은 심지어 내적일관성조차 유지하기 힘들다. CURLI는 arbitrary score의 가능성을 완전이 없앤다.

Other pairwise comparison methods have been proposed, but we feel they are less transparent and more cumbersome than our CURLI method.15 As our case study highlights, the majority of interviewers failed to maintain even internal consistency in their score assignment during one interview season. The CURLI method we have described dispenses with arbitrary scores entirely.



지식점수, 임상추론점수, SCT등에서 사용 가능할 것이다.

We believe this method may find fur- ther application in medical training in the scoring of knowledge or clinical reasoning assessment tools, such as script concordance testing.17


 





 2016 Oct;50(10):1045-53. doi: 10.1111/medu.13112.

new method for group decision making and its application in medical trainee selection.

Author information

  • 1Department of Pediatrics, Medical University of South Carolina, Charleston, South Carolina, USA. kiger@musc.edu.
  • 2Department of Pediatrics, Medical University of South Carolina, Charleston, South Carolina, USA.

Abstract

CONTEXT:

The problems associated with generating a collaborative ranked preference list represent a common source of dilemma in academic medicine and medical education. Such issues present during the process of choosing among applicants to medical schools, during the selection of postgraduate trainees, and in the course of performance assessments and the prioritising of financial expenditures. Currently, most institutions use pseudo-quantitative methods, such as the averaging of scores awarded on an arbitrary scale. These methods are mathematically problematic and may not accurately reflect group opinion.

METHODS:

The present authors developed a novel algorithm for creating a collaborative preference list that generates and sorts a matrix of pairwise comparisons between applicants or choices without placing any reliance on arbitrary Likert scale-type scores. This method achieves equality in influence across individual assessors, as well as transparency and reproducibility. The authors report a case study of their experience using this new algorithm in the 2013 neonatal-perinatal fellowship match.

RESULTS:

When used by this group in the selection of fellowship trainees, the method proposed here allowed for greater efficiency and created a rank-order list that did not require reshuffling or significant debate. A survey of faculty staff and fellows showed much higher levels of satisfaction with the new algorithm and a unanimous desire to use the new algorithm in the future, in preference to a score-based system.

CONCLUSIONS:

The algorithm developed and described here may reduce arbitrariness in processes that require the collaborative creation of a preference list. This method may have wide applicability in medical education and training, and beyond. The present authors' experience of using this algorithm during the National Resident Matching Program match showed improved perceptions of fairness, ease of use and efficiency.

PMID:
 
27628721
 
DOI:
 
10.1111/medu.13112
[PubMed - in process]


+ Recent posts