학생선발과정에서 얻은 네러티브 정보가 문제행동을 예측한다 (Med Teach, 2016)

Narrative information obtained during student selection predicts problematic study behavior

MIRJAM G. A. OUDE EGBRINK & LAMBERT W. T. SCHUWIRTH

Maastricht University, The Netherlands





도입

Introduction


최근까지 초점은 cognitive academic performance 의 예측인자에 있었다. 그러나 이제 비인지적 quality도 미래 의과대학생과 의사로서 중요하다는 것이 명확하다.

Until recently, the focus has been primarily on predictors of cognitive academic perform- ance (Salvatori 2001; Siu & Reiter 2009). Nowadays, however, it is clear that, besides cognitive skills, non-cognitive qualities are important competencies of future medical students and doctors.


MMI가 사용되고 있음.

Recently, the so-called multiple mini-interview (MMI) show that multiple individual human judgments of non- cognitive skills when combined predict future performance in a sufficiently reliable way.


2007년 Maastricht University 의 P-CI 선발에 MMI를 사용하기 시작. 선발 과정에서 순위리스트가 나오는데, research master로서의 성공적인 수행 적합도를 예측에 대한 순위이다.

In 2007, the MMI method was introduced as part of the selection procedure for the four-year medical research master Physician-Clinical Investigator (P-CI) at Maastricht University (Guyaux et al. 2010). The selection procedure results in a ranking list, representing differences in predicted suitability to perform successfully in this research master.


대부분의 선발된 학생이 인지적 측면과 비인지적 측면 모두에서 성공적이지만, 일부는 문제행동을 보인다. 명확하게 이들 문제는 MMI점수에 의해서 예측되지 않으며 선발과정의 다른 부분에 의해서도 예측되지 못한다. 이론적으로 MMI 진행과정에서 면접관이 기록한 narrative information은 학생 파일에 저장되고, 이것이 미래 행동을 더 잘 예측해줄 수도 있다.

Although most selected students are successful in both cognitive and non-cognitive aspects of the study, some encounter professional lapses or problematic study behavior. Clearly, these problems were not predicted by the MMI scores or any other part of the selection procedure. Theoretically, the narrative information that is written down by the interviewers during the MMIs and stored in the student files could be a better predictor of such problems and could constitute a useful resource for the student mentors (called counselors in the P-CI master), but till now this information has been unused.



방법

Methods


 

맥락

Context


The four-year P-CI research master is a graduate-entry program that enables students to become medical doctor as well as clinical investigator. This combination makes it a challenging program for the students. Each year, a selection procedure decides which 30 students are allowed to enter this master.

  • They must have finished a biomedical bachelor with good results; GPAs as well as a cognitive test are taken into account in the first part of the selection procedure.

  • The second part consists of MMIs on different topics, such as motivation, past performance, empathy and communication skills. The applicants’ performances on each individual interview are graded independently by the interviewers as being ‘‘suffi- cient’’, ‘‘doubtful’’ or ‘‘insufficient’’, and the combination of all individual scores adds up to a ranking list. In each station, interviewers also make notes that are not used in the procedure itself; both notes and grading are completed in the time interval between individual interviews. The notes are stored for possible use in appeals, to underpin the inter- viewers’ judgments.


학생과 카운셀러(지도교수)

Students and counselors


In this study, we focused on students who enrolled into the n¼30) P-CI master in 2007 (cohort 2007; and 2008 (cohort 2008; n¼30). In this master, each student is assigned to a counselor at the start of the first year, who mentors the student on an individual basis throughout his/her study. Each counselor typically takes care of 3–8 students per cohort. Every year, student and counselor meet at least four times.


Seven counselors mentored the 60 students in cohorts 2007 and 2008 (five in cohort 2007 and six in cohort 2008; four of them were active in both cohorts). In the end, 54 out of 60 students have finished their study within four to five years, while one student is currently finishing the last part.


연구설계

Study design


This retrospective exploratory study was subdivided into three parts.

 

  • First, the seven counselors were asked to name the three most prevalent non-cognitive problems they encountered in ‘their’ students, and grade them (3-2-1) to indicate the graduate-entry (3 ¼most From their frequency of occurrence frequent). program that enables students to become medical doctor as reactions the two most highly-graded problems were selected well as clinical investigator. This combination makes it a for further analysis.

  • Second, two independent and blinded investigators (MoE and LS) analyzed the de-identified notes written down during the MMIs of 15 randomly chosen students out of the total of 55, and identified what they thought to be possible indicators for these two most frequent non-cognitive problems.

  • Third, a case-control study design was used. The coun- selors were asked to identify the students who exhibited either one or both of these non-cognitive problems during their study (cases). The notes of their MMIs were de-identified and screened by the same two independent and blinded investi- gators (MoE and LS) to investigate whether the proposed indicators of these problems were indeed present. As a control, the MMI notes of a similar number of control students from the same cohorts (without the identified non-cognitive problems) were screened for the presence of these indicators as well.



Results


두 가지 가장흔한 비인지적 문제

Part 1: The two most prevalent non-cognitive problems


계획 문제

Planning difficulties related to problems with

  • 시간 관리 time management,

  • 학습량의 과소추정 under- estimation of study load, and

  • 우선순위 배정 문제 problematic prioritizing of tasks.

 

자기성찰 문제

Self-reflection-related problems were addressed as

  • 자신의 행동의 결과에 대한 인식 부족 insufficient awareness of (the consequences of) own functioning,

  • 방어적 행동 indica- tions of defensive behavior, and

  • 개선을 위한 불충분한/비효과적 행동 insufficient or non-effective actions to improve this.

 

 


 

MMI노트에서 나타난 지표들

Part 2: Indicators in MMI notes


The narrative information that was written down during MMIs with 15 randomly chosen students was analyzed to investigate whether indications for the two most prevalent non-cognitive problems were already present during the selection procedure preceding the master.



In the MMI notes of five students both investigators found no indicators at all for the two non-cognitive problems. In the MMI notes of the other 10 students one or more potential indicators were found. In four of them potential indicators for both planning-related and self-reflection-related problems were present.



As a result of this analysis, a limited number of potential indicators for planning-related and self-reflection-related problems were identified (Table 2).

 

 


 


사례-대조군 연구

Part 3: Case-control study


Based on the above-mentioned findings, a case-control study was performed to investigate how predictive these indicators were for planning-related and/or reflection-related problems during the research master P-CI.


The seven counselors identified 23 students who exhibited prob-lems during their study  planning-related and/or reflection-related had (cases).

  • Thirteen students planning-related problems, while

  • six had reflection-related problems; another

  • four students showed problems in both domains.


Altogether, the data indicate a statistically-significant asso- ciation between the presence of indicators for planning-related problems in MMI notes and the actual occurrence of such problems during the subsequent study (Table 3A: odds ratio 9.33; 95% confidence interval 2.12–41.07; p ¼0.003). No such evidence was found for self-reflection-related problems (Table 3B: odds ratio 1.39; 95% confidence interval 0.29–6.68).

 

 


 

 

고찰

Discussion


보통 선발 단계는 누구를 선발하고 떨어뜨릴지 결정에만 사용된다. 이번 연구에서 선발단계에서 얻어진 정부를 미래의 문제행동을 예측하는데 사용하였다.

As a result, the selection proced- ure is merely used to decide on who is admitted and who is not. In the current study, we propose to use narrative information obtained during selection interviews to predict future problems


선발된 학생이 성공할 수 있도록 early and dedicated counseling and remediation을 가능하게 해줄 것이다. 선발은 단순히 assessment-of-learning이 아니라 assessment-for-learning의 역할을 할 것이다.

This may enable early and dedicated counseling and remediation to improve the selected students’ study success. This way, selection will not only serve as an assessment-of-learning measure but also as a first assessment-for-learning step (Shepard 2000; Schuwirth & Van der Vleuten 2011).


Counseling은 연구커리어의 초반부터 이뤄지는 것이 educational, therapeutic intervention을 가능하게 해줄 것이다. Unorganized한 학생은 사전에 정해진 시간표에 따라 학습이 이뤄지는 과정에서의 학업부담과 압박때문에 힘들어한다. 적성 외에도 시간관리와 우선순위 설정은 학업적 성취에 중요하다. Organized한 학습은 progress와 success 모두와 연결된다. 따라서 early and dedicated counseling은 계획-관련 학습문제를 예방하거나 없애줄 것이며, study success를 높여줄 것이다.

With the cur- rent knowledge, however, counseling can be more focused right from the beginning of a study career, enabling specific educational and even therapeutic interventions. Literature shows that unorganized students suffer most from workload and pressure of progressing in their studies according to a predetermined timetable (Ruohoniemi et al. 2010). More than aptitude, time management and prioritizing are important for academic achievement (West & Sadoski 2011). Organized studying appears to be related to both study progress and success (Rytkonen et al. 2012). Therefore, early and dedicated counseling will help to prevent or diminish planning-related study problems and, as a consequence, improve study success.


절절한 자기-성찰은 의료전문직에게 중요하다. 이것이 우리가 포트폴리오와 카운셀링 시스템에서 학생에게 자기-성찰의 중요성을 깨닫게 하고, 성찰 스킬 개발을 자극하는 것을 중요한 목표로 삼은 이유이다.

Adequate self-reflection is nowadays considered an essential attribute of competent healthcare professionals. This is why it is one of the important goals of our portfolio and counseling system to increase students’ awareness of the importance of self-reflection and to stimulate development of their reflective skills (Driessen et al. 2005).


선발에 들인 노력에도 불구하고 의과대학 기간에 낙제하거나 유급이 발생하는 것은 우려를 낳는다. personal distress로 힘들어 하는 학생도 걱정하고, 대학 역시 struggling student에 쏟는 시간과 에너지가 disproportionate하여 걱정하며, 사회도 이들 학생에게 들어가는 공적 자금의 부담 때문에 걱정한다.

Drop-out from or delay during medical school, in spite of selection efforts, is a cause for concern (Yates 2011; Stratton & Elam 2014). This is the case

  • for the students involved who suffer from personal distress,

  • for the university that is faced with a disproportionate amount of time and energy spent on struggling students, and

  • for society that has to bear the financial in burden for drop-out and delayed students countries where they receive public funding.


실제로, 선발자료의 사용 용도가 많아지는 것은 재정적 관점에서도 매력적이다. 네덜란드같이 교육이 공적 자금으로 이뤄지는 국가에서, delay 나 drop-out을 막는 것은 상당한 비용을 보상한다.

Indeed, the additional use of selection data is attractive from a financial perspective. In countries like the Netherlands, where education is publicly funded, the gains of avoiding delay or drop-out will compensate largely for the costs of a selection procedure and counseling system.


Siu E, Reiter HI. 2009. Overview: What’s worked and what hasn’t as a guide towards predictive admissions tool development. Adv Health Sci Educ Theory Pract 14:759–775.


Stratton TD, Elam CL. 2014. A holistic review of the medical school admission process: examining correlates of academic underperform- ance. Med Educ Online 19:22919.



 

 

 






 2016 Aug;38(8):844-9. doi: 10.3109/0142159X.2015.1132410. Epub 2016 Jan 25.

Narrative information obtained during student selection predicts problematic study behavior.

Author information

  • 1a Maastricht University , The Netherlands.

Abstract

INTRODUCTION:

Up to now, student selection for medical schools is merely used to decide which applicants will be admitted. We investigated whether narrative information obtained during multiple mini-interviews (MMIs) can also be used to predict problematicstudy behavior.

METHODS:

A retrospective exploratory study was performed on students who were selected into a four-year research master's program Physician-Clinical Investigator in 2007 and 2008 (n = 60). First, counselors were asked for the most prevalent non-cognitive problems among their students. Second, MMI notes were analyzed to identify potential indicators for these problems. Third, a case-control study was performed to investigate the association between students exhibiting the non-cognitive problems and the presence of indicators for these problems in their MMI notes.

RESULTS:

The most prevalent non-cognitive problems concerned planning and self-reflection. Potential indicators for these problems were identified in randomly chosen MMI notes. The case-control analysis demonstrated a significant association between indicators in the notes and actual planning problems (odds ratio: 9.33, p = 0.003). No such evidence was found for self-reflection-related problems (odds ratio: 1.39, p = 0.68).

CONCLUSIONS:

Narrative information obtained during MMIs contains predictive indicators for planning-related problems during study. This information would be useful for early identification of students-at-risk, which would enable focused counseling and interventions to improve their academic achievement.

PMID:
 
26805655
 
DOI:
 
10.3109/0142159X.2015.1132410
[PubMed - in process]


의과대학 Trainee선발에서 집단의사결정을 위한 새로운 방법(Med Educ, 2016)

A new method for group decision making and its application in medical trainee selection

James R Kiger & David J Annibale






도입

INTRODUCTION


의과대학이나 레지던트 프로그램에서 지원자를 선바하는 기준은 시험점수나 grade에 기반하고 있다. 그러나 많은 경우, 비록 이 숫자 점수의 합이 면접수행능력, 리더십, 기존 경험 등과 같이 정량화하기 어려운 것들보다 덜 중요한 것은 아니지만, 숫자 자료들은 combine된다. 결국, 모든 프로그램에서는 어떻게든 이 모든 정보를 '선호'의 순서로 단순화시킨 리스트로 승화시켜야 한다. 이 목표를 달성하기 위하여, 종종 pseudo-quantitative scoring systems 을 사용하나, 수학적으로 타당하지 못하고, counterproductive하다.

The criteria by which a medical school or residency training programme selects its preferred applicants may, in part, rely on test scores or grades. In almost every case, however, these numerical data are combined with, if not superseded by, considerations that are difficult to quantify, such as interview performance, leadership traits and prior experience. In the end, every schoolor programme must find a way of distilling all this information into a simple list of applicants in order of preference. To achieve this goal, groups often rely on pseudo-quantitative scoring systems that are mathematically unsound and may be counterpro- ductive to the collaborative process of making a list. 


우리의 전공 수련 프로그램은 NRMP를 사용한다. NRMP는 1952년 도입되었는데, 이 당시에는 의과대학생과 레지던트 프로그램에서 혼란과 불만이 늘어나던 시기였다. 중앙화된 기구가 모든 의과대학졸업생을 available residency spot에 배정하는 역할을 맡게 되었다. NRMP 시스템은 60년간 그 자리를 지켜왔고, 더 많은 전공, 세부전공까지 확장되었다.

Our subspecialty training programme uses the National Resident Matching Program (NRMP) for applicant selection. The NRMP was formed in 1952 in response to escalating confusion and exas- peration on the part of medical students and resi- dency programmes. This centralised body assumed the task of sorting all of the nation’s graduating medical students into available residency spots.2 The NRMP system has stood relatively unchanged for more than 60 years, and has expanded to cover more specialties and subspecialties.


 

지원자와 훈련프로그램은 NRMP에 각자 자기의 입장에서의 순위를 제출한다. NRMP는 'deferred acceptance'알고리즘을 사용하여 지원자를 안정적이고 최적의 결과를 얻을 수 있게 sort해준다. 지원자에게 있어서 순위를 매기는 것은 부담이 크지만 근본적으로 개인적인 문제이다. 훈련프로그램 입장에서 순위를 정하는 것은 더 복잡하다. 어떻게 정량적 자료를 질적 특성과 통합할지를 결정해야 하고, 다수의 면접관에게 받은 주관적 정보를 최종 순위 정보로 만들지 고민해야 한다. 이 단계에서 발생하는 부정확성은 여러 문헌에서 밝혀진 바 있다

Applicants and training programmes both submit rank-order lists to the NRMP, which employs a ‘deferred acceptance’ algorithm to sort the appli- cants into training positions such that stable and optimal results are achieved.2,3 For applicants, creat-ing a rank order may be taxing, but is a fundamen- tally personal matter. For training programmes, generating a rank-order list may be significantly more complicated. Each programme must decide how to integrate objective quantitative data (test scores, grades, etc.) with qualitative characteristics (volunteer work, written statements, etc.) and the subjective opinions of multiple interviewers into a final rank-order list. The imprecision of this process is highlighted by published reports that have demonstrated the lack of correlation between  information gathered during the interview process, the position of applicants on a programme’s rank- order list, and future resident performance.4–8



ERAS는 AAMC가 제공하는 순위 산정을 위한 pseudo- quantitative method 이다. 면접관은 지원자를 리커트-타입 평가 스케일에 배정하고(1~9), 지원자에 대한 평균점수가 예비적 순위를 만들어준다. ERAS시스템은 리커트 스케일 기반 시스템의 한 예이다.

The Electronic Residency Application Service (ERAS), provided by the Association of American Medical Colleges (AAMC), incorporates a pseudo- quantitative method to generate a rank-order list. Interviewers assign applicants scores on a Likert-typerating scale (integers of 1–9), and averaged scores for applicants are sorted to create a preliminary rank-order list. This ERAS sys- tem is simply one example of a Likert scale-based system,



이러한 Pseudo-quantitative methods 는 몇 가지 근본적 문제가 있다.

Pseudo-quantitative methods such as this are beset by a number of fundamental problems:


  • 1 면접관마다 분포가 다름.
    the scores assigned by different interviewers are differently distributed;

  • 2 면접관에게 '숫자'의 의미가 일관되지 않음
    numeric scores have no consistent meaning for interviewers (e.g. an interviewer who gives con- sistently lower scores may view a score of 7 points as signifying an excellent candidate, whereas another interviewer may view the same score as indicating an average candidate);

  • 3 임의적 스케일의 순위자료이다. arithmetic operation에 부적절하다.
    Likert scale-type scores are ordinal data on an arbitrary scale; it is inappropriate to perform arithmetic operations, such as the calculation of means, on such data,9–11 and

  • 4 지원자는 일부 교수에 의해서만 면접을 하게 되고, 교수도 일부 지원자만 면접한다.
    candidates are interviewed only by a subset of faculty staff, and each faculty member may interview only a subset of candidates. Any partic- ular candidate’s final score may be altered sub- stantially by the inclusion or exclusion of an interviewer who gives consistently high or low scores.


이러한 문제로, 우리는 ERAS에서 만들어준 순위를 그룹토의를 거쳐 재평가한 뒤 NRMP에 제출한다. 토론과정에서 점수는 '집단 의견'에 맞게 조정되어 순위를 재조정한다. 물론, 이러한 집단 토의도 목소리가 큰 소수의 영향을 받을 수 밖에 없고, 참여못한 사람의 의견은 토론에서 배제된다.

Given these problems, our programme has had to re-evaluate the preliminary ERAS-generated rank- order list in group discussions prior to submission to the NRMP. During such discussions, scores are modified to force the rank list to conform to the ‘group opinion’. Of course, this group opinion may be unduly influenced by a vocal minority, and those who are unable to attend are left out of the discussion.


rank-ordering process 향상을 위한 수학적 노력이 있어왔다.

Others have suggested different mathematical meth-ods to improve the rank-ordering process.

  • One approach is to have interviewers compile individually ordered preference lists of applicants, instead of assigning scores. Both Chew et al. and Collins et al. suggest applying a formula to individ- ual rank lists to create scores that can then be aver- aged.12,13

  • These systems resemble the Borda voting system in which each voter gives each candidate a number of points proportional to that candidate’s place on the voter’s list.14 These systems are ham- pered by the fact that the score derived from any given voter is dependent on the number of candi- dates seen by that voter.

  • A recent article by Ross and Moore suggests retaining scores, but comparing candidates pairwise and assigning a ‘win percentage’to each in a system similar to that used in sports ranking.15



우리는 몇 가지 설계원칙을 정했다.

We proposed a set of design principles to which an optimal system should adhere:


  • 1 the opinions of all interviewers will carry equal weight;

  • 2 the rank-order list will not be influenced by which interviewers meet any individual candi- date;

  • 3 interviewers will compare only applicants whom they have met;
  • 4 the system will not depend on scores assigned on an arbitrary scale, and
  • 5 the final ordering will be transparent and repro- ducible.



METHODS


알고리즘 개발

Algorithm development


We developed an algorithm termed ‘collab-orative unbiased rank list integration’ (CURLI) 


네 단계로 이뤄짐

The CURLI algorithm involves four steps:


  • 1 each interviewer submits a personal ranked pref- erence list of the applicants he or she has met or reviewed;

  • 2 each personal rank-order list is used to generate a pairwise preference table of applicants;
  • 3 the individual preference tables are summed to generate a composite preference table, and

  • 4 a sorting algorithm is applied to the composite preference table to generate a final rank-order list.


기본적인 결과는 이렇다. 만약 지원자 A와 B가 모두 일부 교수에 의해서만 면접을 봤다면, 그리고 A가 B보다 더 많은 면접관들에게 선호된다면, A는 선호도 리스트에서 더 높은 순위를 받는다. 이는 얼마나 많은 인터뷰를 했는지, 몇 명의 교수가 했는지, 어떤 배점 bias가 있는지에 무관하다.

The fundamental result of the CURLI algorithm is as follows: if applicants A and B are both inter- viewed by a subset of faculty members, and candi- date A is preferred to candidate B by a majority of those interviewers, then candidate A will appear higher on the final preference list. This is unaf- fected by how many interviews any specific faculty member conducts or any individual scoring biases.


개별 순위 리스트

Personal rank-order lists


The fundamental change for interviewers is that instead of scoring applicants on an arbitrary scale, they are asked to maintain a personal ranked prefer- ence list of the applicants they have interviewed. Interviewers include only applicants they have met, conforming to design principles 2 and 3 above. Interviewers no longer assign arbitrary scores, removing the undue influence exerted by interview- ers who give consistently high or low scores, satisfy- ing principles 1 and 4.


짝지은 순위 표

Pairwise preference tables


지원자 선호가 더 높으면 상대비교에서 1 입력

Each interviewer’s ranked preference list is converted to a preference table, which is populated by the numbers 1 or 1 depending upon which applicant appears higher on that preference list. No values are assigned to applicants the interviewer did not meet. A preference list implies a comparison between all possible pairs of applicants on that list. Applicants appearing higher on the rank-order list are preferred to all applicants ranked below them. Therefore, a rank-order list of size n contains (n 9 [n 1])/2 pairwise comparisons between applicants.



4명의 지원자 A B C D중, C는 면접을 못 보고, 나머지 셋의 순위는 B D A 순서인 경우

For example, imagine there are four applicants: A, B, C and D. An interviewer meets all but applicant C, and submits the following rank-order list: B–D–A.


Table 1 shows the preference table generated from this list.



혼합 순위 표

Composite preference table

A composite preference table is computed simply by adding all of the individual preference tables.


For example, four interviewers (I, II, III and IV) provide the following rank lists for four applicants:


  • Interviewer I: B–D–A; 

  • Interviewer II: C–B–A–D; 

  • Interviewer III: B–C–D–A, and

  • Interviewer IV: C–D–B.


Table 2 shows the resulting four individual prefer- ence tables. Table 3 shows the composite preference table yielded by the sum for each cell.

 

 


 

 

배열

Sorting


modified bubble-sort algorithm 를 사용하여 composite table을 만들었음.

A sorting algorithm is applied to the composite preference table to obtain the final rank-order list. For our programme, we applied a modified bubble-sort algorithm to the composite table.16 An initial unsorted list is generated. Each applicant is compared with the applicant immediately below on the rank list by checking the corresponding value inthe composite preference table. If the lower-ranked applicant is preferred (i.e. the value in the cell is > 0), the order of the two applicants is swapped. This is continued until no more pairs of applicants are swapped. In the ideal scenario, the re-sorted list will yield a composite preference table with all nega-tive values in the upper triangle. 


Re-sorting하면 Table 4가 됨

For our example, the final sorted rank list is: C–B– D–A. Re-sorting the preference table to reflect this order gives a matrix with a fully negative upper tri- angle which indicates that every applicant is pre- ferred by a majority of interviewers to all the applicants below them on the list (Table 4).



Borda voting scheme으로 같은 것을 한다고 했을 때, 각 지원자가 획득 점수 기준으로 나열했을 때 두 명이 C를 더 선호했음에도 B가 가장 높을 수도 있다.

If one imagines running the same example with a Borda voting scheme, for instance, in which each applicant is awarded points based on his or her position on each list, it is possible that applicant B may have been ranked highest, although two of the three interviewers who directly compared applicants B and C preferred applicant C.

 

 



방법론

Methodology



We implemented this new ranking algorithm during the 2013 neonatal-perinatal fellowship match. All faculty members and fellows were instructed to maintain a personal ranked preference list of the applicants they interviewed. They were also asked to assign a score of 1–9 to each participant as had been done in previous years, as per the ERAS sys- tem. These ‘shadow’ scores were used to compare the outcome of the CURLI algorithm with the results that would have been generated by the old Likert scale-based method.



결과

RESULTS


During the trial year 14 applicants were interviewed, and 19 faculty members and fellows served as inter- viewers. Figure 1 shows the minimum, maximum, median and interquartile ranges for the scores assigned by each individual interviewer.

 

 


 

평가자들은 점수 범위의 일부만 사용하였고 86%는 6점 이상이었다.

On average, each interviewer scored nine applicants. All inter- viewers utilised a truncated part of the scoring range at the top of the scale. Of 162 total scores assigned, 139 (86%) were ≥ 6. The median score assigned by each interviewer ranged between 6 and 8.


개별 면접관마다 discordance가 있었다. 총 162개의 점수를 주었는데, 그 중 23개는 자신이 매긴 순위와 점수의 순위가 달랐다. 

We observed discordance between individual inter- viewers’ assigned scores and their final assessments of an applicant’s desirability. Collectively, the inter- viewers assigned a total of 162 scores, 23 (14%) of which were out of order in relation to the rank- order list of the interviewer who had given them.


 new CURLI algorithm에 따라서 14명의 지원자 중 9명이 서로 다른 ranking list에 assign됨.

by the new CURLI algorithm. Of the 14 applicants, nine would have been assigned to a dif- ferent place on the final ranking list.

 

 

지난 3년간, 우리 분과는 2시간씩 2번의 미팅을 해서 preliminary list를 조정했는데, 이번에는 1시간만 걸렸다. 순위가 달라진 지원자는 없었다.

In the prior 3 years, our division had scheduled two 2-hour meetings to discuss and modify the prelimi- nary rank-order list. In this trial year, we required only a single 1-hour meeting to achieve consensus. No candidates were moved as a result of that discus- sion. Figure 2 shows the relationships between the preliminary rank-order list and the final rank-order list for 2013 and the prior 2 years. The changes reflect the alterations made during the divisional meeting. In 2011 and 2012, the positions of nine of 14 applicants, and 13 of 16 applicants, respectively, were moved on the final list.

 

 


 

 

고찰

DISCUSSION


행정적 관점에서 미팅이 4시간에서 1시간으로 줄었고, 순위의 변화가 없었다. composite preferene table을 공개하여 투명성을 확보하였다.

From an administrative perspective, the new method reduced meeting time from 4 hours to 1 hour, dur- ing which no changes were made to the rank-order list. During that meeting the composite preference table was displayed, providing complete trans- parency.


CURLI algorithm 는 몇 가지 장점이 있다. 재생산가능하고 투명하다. 지원자의 순위를 바꾸려는 소수의 압력을 극복할 수 있다. 면접관의 intrinsic difference에 의한 불공평함을 줄일 수 있다.

We suggest that our CURLI algorithm has numer- ous theoretical benefits that are borne out in prac- tice. It is reproducible and transparent. There is reduced vulnerability to pressure from a minority of participants to change a candidate’s rank position, and the inequality imposed by intrinsic differences in scoring among interviewers is removed.


CURLI algorithm 는 확실한 장점이 있다. Borda voting scheme과 유사한 방법들에서 지원자는 '점수'로 평가받거나 랭킹을 평균낸다.

Compared with other options that have been pro- posed, we feel that the CURLI method offers clear advantages. Borda voting schemes, and similar methods, introduce a process whereby applicants receive points for their place on each list, or in which the rank number on each list is averaged.12–14


이러한 방법은 모든 면접관이 모든 지원자를 면접할 경우에는 만족스러운 결과를 줄지도 모르나, 각 면접관이 일부 지원자만 면접할 경우 문제가 될 수 있다. 예컨대 일부 지원자만 면접했는데, 이들이 모두 least desirable한 지원자들일 수도 있다. 이 경우 Borda-like 방법에서는 이 지원자들 중 순위가 높은 사람은 엄청난 이득을 보는 셈이다. CURLI에서는 상대적 비교만 하기 때문에 그러한 문제가 없다.

These methods may yield satisfactory results if all interviewers see all applicants (i.e. every individual preference list is full), but in cases like ours in which each interviewer sees only a subset of appli- cants, these methods are problematic and allow bias. Take, for example, an interviewer who inter- views only a few applicants, all of whom happen to be among the least desirable. Under the Borda-like methods, the top-ranked applicant on this list will obtain a huge advantage in points or rank, even though that applicant may actually not be desirable compared with all the other applicants that particu- lar interviewer did not see. As the CURLI method uses the rank lists only to make pairwise compar- isons between applicants the interviewer actually saw, it suffers no such bias.


다른 pairwise 비교법도 있지만 CURLI보다 덜 투명하고 더 쓰기 힘들다. 대부분의 면접관은 심지어 내적일관성조차 유지하기 힘들다. CURLI는 arbitrary score의 가능성을 완전이 없앤다.

Other pairwise comparison methods have been proposed, but we feel they are less transparent and more cumbersome than our CURLI method.15 As our case study highlights, the majority of interviewers failed to maintain even internal consistency in their score assignment during one interview season. The CURLI method we have described dispenses with arbitrary scores entirely.



지식점수, 임상추론점수, SCT등에서 사용 가능할 것이다.

We believe this method may find fur- ther application in medical training in the scoring of knowledge or clinical reasoning assessment tools, such as script concordance testing.17


 





 2016 Oct;50(10):1045-53. doi: 10.1111/medu.13112.

new method for group decision making and its application in medical trainee selection.

Author information

  • 1Department of Pediatrics, Medical University of South Carolina, Charleston, South Carolina, USA. kiger@musc.edu.
  • 2Department of Pediatrics, Medical University of South Carolina, Charleston, South Carolina, USA.

Abstract

CONTEXT:

The problems associated with generating a collaborative ranked preference list represent a common source of dilemma in academic medicine and medical education. Such issues present during the process of choosing among applicants to medical schools, during the selection of postgraduate trainees, and in the course of performance assessments and the prioritising of financial expenditures. Currently, most institutions use pseudo-quantitative methods, such as the averaging of scores awarded on an arbitrary scale. These methods are mathematically problematic and may not accurately reflect group opinion.

METHODS:

The present authors developed a novel algorithm for creating a collaborative preference list that generates and sorts a matrix of pairwise comparisons between applicants or choices without placing any reliance on arbitrary Likert scale-type scores. This method achieves equality in influence across individual assessors, as well as transparency and reproducibility. The authors report a case study of their experience using this new algorithm in the 2013 neonatal-perinatal fellowship match.

RESULTS:

When used by this group in the selection of fellowship trainees, the method proposed here allowed for greater efficiency and created a rank-order list that did not require reshuffling or significant debate. A survey of faculty staff and fellows showed much higher levels of satisfaction with the new algorithm and a unanimous desire to use the new algorithm in the future, in preference to a score-based system.

CONCLUSIONS:

The algorithm developed and described here may reduce arbitrariness in processes that require the collaborative creation of a preference list. This method may have wide applicability in medical education and training, and beyond. The present authors' experience of using this algorithm during the National Resident Matching Program match showed improved perceptions of fairness, ease of use and efficiency.

PMID:
 
27628721
 
DOI:
 
10.1111/medu.13112
[PubMed - in process]


썪은 사과 골라내기 (Adv in Health Sci Educ, 2015)

Identifying the bad apples


Geoff Norman




35년 전, 두 명의 사회심리학자가 "Human Inference"라는 책을 썼다. 그 책에서 어떻게 인간이 판단과 행동이 다양한 맥락적 변인들에 얼마나 취약한지를 보여주었다. 그 중 하나는 "vividness hypothesis"인데, 단 하나의 생생한 경험이 아주 명맥한 통계적 근거에도 불구하고 사회적 태도에 영향을 준다는 것이다.

Thirty-five years ago, two social psychologists, Richard Nisbett and Lee Ross, wrote a classic book called ‘‘Human Inference: Strategies and Shortcomings of Social Judgment’’ (1980). In that book, they demonstrated how human judgments and actions are vulnerable to many contextual variables. One particular shortcoming they labeled the ‘‘vividness hypothesis’’—A single vivid instance can influence social attitudes when pallid statistics of far greater evidential value do not’ (p. 57).


심리학적 편견에 대한 근거는 넘쳐난다.

Evidence of this psychological bias abounds.

  • 모든 developed countries에서 범죄율은 25년간 지속적으로 감소중이다.
    Politicians continue to garner votes claiming they are ‘‘tough on crime’’ despite the fact that crime rates have been steadily declining in all developed countries for 25 years; as one example of many, homicide rates in Canada are half what they were in 1999.

  • 비행기 사고는 얼마나 심각한 것일까? 1970년에 비하면 1/3밖에 안된다. 비행거리가 5배나 늘었음에도 말이다.
    How bad was 2014 for air crashes? Remember MH370 and MH 17? In fact, the number of civil aviation crashes was the lowest on record, and about 1/3 of what it was in 1970, despite a five-fold increase in passenger miles flown.

  • 지하드에게 살해당하는 사람은 얼마나 될까? Violent death rate는 2000년 이후 계속 감소중이다.
    What about all those people killed by jihadists? Violent death rates have been on a steady decline for millennia (Pinker 2011).


Dr. Harold Shipman이라는 영국의 GP사건. 다시는 이런일이 일어나지 않게 교육프로세스를 개혁하라는 요구가 이어졌다. 그 이후 '비인지적', 특히 프로페셔널리즘에 대한 관심이 높아짐을 보고 있다.

Dr. Harold Shipman, a British GP who is esti- mated to have killed 250 of his patients and was eventually convicted of 15 murders. The publicity surrounding his trial and conviction led to calls to reform the educational process so that such things do not happen again (Powis 2015). In particular, we have seen increased focus on ‘‘non-cognitive’’ factors, particularly professionalism.


van Mook은 dyscompetent 레지던트를 어떻게 찾아내고 교정할 것인가와 관련된 몇 가지 이슈를 짚어보았다. 연구에 따르면 진료상황에서의 unprofessional behavior는 의과대학의 performance로 부터 예측가능하다.

In a review article, van Mook et al. (2014) examines the multiple issues related to the identification and remediation of ‘‘dyscompetent’’ residents, particularly in the area of professionalism. And an original study by Santen et al. (2014) extends the findings of two landmark studies by Papadakis et al. (2004, 2008), which showed that unprofessional behavior in practice was apparently predictable from performance in medical school. Both studies used a ‘‘case control’’ design


Saten은 일상적인 진급위원회에서 제기되는 학생평가자료로부터 위의 결과를 replicate and extend하였다.

The Santen et al. (2014) study in this issue replicated and extends these findings by examining the routine student assessments arising from promotion committees, instead of creating a system geared to identifying professionalism issues.


두 연구 모두 case-control 연구로서, 이러한 연구는 관심의 대상이 되는 결과(암 발생, 죽음)가 infrequent한 것일 때 흔히 사용된다. Papadakis 연구에서는 6330명의 졸업생 중 70명이 캘리포니마 stae board에 의해서 제제disciplined를 받았고, 1.1%의 prevalence를 보여준다. 즉 6260명은 그런 일이 없었다.

Both studies use a case–control design. Case–control studies are frequently used when the outcome of interest, such as developing cancer or dying, is infrequent. This design certainly applies in the Papadakis study of 6330 graduates from UCSF over the time interval, 70 were disciplined by the California state board, a prevalence of about 1.1 %. And, of course, 6260 were not.


그리고 여기에 핵심이 있다. 우리가 찾으려는 unprofessionalism이라는 '질병'은 유병률이 매우 낮다. 이러한 상황에서는 매우 좋은 진단도구라도 진짜 양성인 사례조차 위양성으로 가려진다. 위의 70명중 38%(27명)만을 가려낼 수 있을 뿐인데, 이것을 가려내기 위해서 1190명의 다른 졸업생이 unprofessional한 것으로 잘 못 label될 수 있다. PPV는 27/(1190+27)로 2.2%에 불과하다. Saten의 연구에서도 2000명 이상의 졸업생 중 140명만이 의과대학에서 poor performance가 있었다.

And there’s the rub. The disease we’re screening for— documented unprofessionalism—has a very low prevalence. Under these circumstances, even very good diagnostic tests result in true positive cases that are swamped by false positives. Working through this example, if we used documented concerns as a medical student as a screening test to decide if a graduate should be allowed to proceed, we would detect 38 % of the bad apples or 27; but we would incorrectly label .19 9 6260 = 1190 other graduates as unprofessional. The positive predictive value of the test is 27/ (1190 ? 27) = 2.2 %. Similar data arise in the Santen study, where review of 20 years’ data, involving over 2000 graduates, showed that 140 had poor performance in school, and only 29 were subsequently sanctioned by the state medical board.


따라서 Papadakis의 연구에서 졸업생 100명당 2명만이 state board에 보고되는 것에 그치고 만다.

So in the Papadakis study, for every 100 students who would have been denied grad- uation, if they had proceeded to implement a policy based on documented concerns, only two would end up reported to the State board.


 

여기에 경제적 논리를 더하면, 한 의사를 양성하는데 매년 10만달러가 필요할 때, 40만달러x98명 = 약 4천만달러의 사회적 비용이 들어간다는 것을 의미한다. 왜냐하면 이 98명의 학생들은 satisfactory 하게 행동했음에도 unsatisfactory한 것으로 적발되어 졸업하지 못하기 때문이다.

If you want to put an economic spin on it, if it costs $100,000/year to educate a doctor, that policy would result in a social cost of $400,000 9 98 = $40 million of education costs based on the number of satisfactory students who had an unsatisfactory and then could not graduate, without even considering lost income in practice.


그러나 unprofessional behavior는 하루아침에 생기는 것이 아니며 입학 시점부터 발견가능할 수 있다. 이것이 성격검사를 입학 때 사용하자는 Powis 등의 주장이기도 하다.

But if these unprofessional behaviours are longstanding, perhaps they are detectable at the time of admissions. This is the promise held out by Powis, who has argued repeatedly for the more widespread use of personality tests at admissions (2003, 2009, 2015).


실제로, 이러한 정책의 유용성에 대한 근거가 있다. Papadakis는 2007년의 연구에서  CPI 성격검사 결과를 활용하여 평균점수에 차이가 있음을 보여주었고, 2.1SD의 차이가 있었다. 여기까지는 좋다.

In fact, there is some useful evidence to inform this policy. Papadakis, in another published study (2007), looked at performance on a personality test (the California Psy- chological Inventory), using a subsample from her earlier study that had undergone the psychological testing as part of admissions. The sample was 19 cases (fromthe original 70) who had difficulty with the state board, and 26 controls (of 196 sampled from 6260), all of whom who had taken the CPI as part of admission to medical school. For the total score, the mean of the cases was 156 (SD = 14.7); for the controls 181 (SD = 11.7). This means that the case mean was 25/11.7 = 2.1 SDs below the control mean. So far so good.


우리가 이 점수를 선발에 사용한다고 상상해보자. 비유을 따져볼 수 있을 것이다. 궁극적으로 문제를 일으킬 사람은 70명이고, 6260명은 그러하지 않다.

Now let us imagine using these data for selection, by establishing a threshold score that students must attain to be considered for admission—a policy directly advocated by Powis (2015). We can look at the proportion of each group who are accepted or rejected, keeping in mind that our real denominator is 70 cases who will eventually get in trouble with the state board, and 6260 controls who won’t.


156명의 "case"를 가지고 threshold를 정하면 50%(35명)의 case는 놓칠 것이다. 그리고 -2.1SD로 설정한다고 할 때 "control"에서 2%를 false label하는데, 이 숫자가 125명이다. 즉 125/(125+35),, 즉 81%가 실제로는 문제가 없다.

If we were to set the threshold at 156, the ‘‘case’’ mean, then we’ll miss 50 % of the cases, 35. And this is a z score of -2.1 for the controls, so we’ll falsely label 2 % of the controls, 125, as unprofessional. And 81 % (125/(125 ? 35)) of the people we have la- beled would not have any problems in practice.


분명히, Case 중 50%만 탐지해낼 수 있는 검사는 문제가 많다. 그러면 sensitivity를 90%로 올려보자. 그러면 70명 중 63명을 잡아내지만, Control 중 1315명이 같이 적발된다. 즉 '잡힌' 사람 중 95%는 나중에 문제가 없다.

Clearly, a test that only detects 50 %of the cases is of little value. So let’s rack it up to a sensitivity of 90 %; which is a Z value on the ‘‘Cases’’ distribution of 1.28. We will detect 63 of 70 cases. That means the threshold, in Z units on the ‘‘control’’ distribution is (-2.1 ? 1.28) =-0.82, which equates to 21 % of the Control distribution below the threshold, or 1315. In short, similar to the previous calculation, 1315/(1315 ? 63) = 95 % of the individuals identified by a low score on the psychological test would not have any further problems in practice.

 


 

명확하게,  성격검사를 활용해서 궁극적으로 주정부에서 제제를 받을 사람을 탐지해내는 것은 심각한 비용을 치른다.

Clearly, any attempt to identify individuals who will be eventually subject to report to the State disciplinary board using personality tests comes at a serious cost in terms of denying access to many who would not have problems.


이러한 접근의 전제는 '인지적 척도'만으로는 나중에 문제를 일으킬 사람을 찾기에 불충분하다는 것이다. 그러나 정말 그러한가?

The underlying premise of this approach is that cognitive measures are inadequate to identify individuals who will become problems in practice. But is this necessarily the case?


Tamblyn 등은 MCCQE의 타당도를 연구하였다. MCCQE는 두 파트로 되어있다. 지필고사와 OSCE. 의사소통의 complaints를 예측하는데 있어서, OSCE의 하위 1/4의 RR은 1.43이었다. 지필고사는 1.34였다. quality of care에 대한 complaints를 예측하는데 있어서 RR은 의사소통점수에서 1.38, 지필고사는 1.54였다. 따라서 얼마나 폄하되든지간에, 인지적 척도는 practice performance의 중요한 예측요인이다. 동일한 결론이 Teherani 등의 연구에서도 드러나는데, 이들은 레지던트의 졸업후 퍼포먼스로 나중의 displinary action를 예측가능한지 보았다. '퍼포먼스'는 두 가지로 보았는데 하나는 ABIM의 in-training 평가, 다른 하나는 ABIM 인증시험. Discipline charge에 대한 hazard ratio는 1.9정도로 매우 인상적이었다. ABIM 인증시험은 1.7정도였다. 그리고 여기에서도 prevalence는 1% 정도였다.

Tamblyn et al. (2007) has studied the validity of the Medical Council of Canada Qualifying Examination in predicting complaints (quality of care and communication skills) to provincial licensing bodies. The MCCQE examination has two parts—a written ex- amination, primarily multiple choice completed at graduation, and an OSCE completed 1 year later. In terms of predicting communication complaints, the relative risk of a complaint for a communication skill performance in the bottom quartile of the OSCE was 1.43; for the written exam score was 1.34. For quality of care complaints, the relative risks were 1.38 for communication skills and 1.54 for the written test. (Relative risks for the data gathering and problem solving parts of the OSCE ranged from .97 to 1.13, predicting nothing). So it appears that, however much they are disparaged, cognitive measures of performance are an important predictor of practice performance. The same conclusion came from a study by Teherani et al. (2005), who looked at postgraduate performance of residents as a predictor of disciplinary action in practice. Performance was measured two ways: by American Board of Internal Medicine in-training evaluations, and by the ABIM certification examination. Again, the hazard ratio in predicting discipline charges looked impressive—about 1.9. However, the ABIM certification examination was not far behind at 1.7. And as before, with a prevalence of disciplinary action of about 1 %in this sample, the results do not support the use of either measure as a ‘‘diagnostic test’’.


"비인지적" 혹은 성격을 입학 때 평가하는 것을 논할 때 또 다른 가정 중 하나는 either-or 가설이다. 입학위원회가 '인성이 좋은' 지원자와 '학업능력이 좋은' 지원자 중에서 선택을 내리는 Faustian 선택을 해야 한다는 assumption이다. 그러나 러한 선택은 성격과 학업능력에 negative correlation이 있을 때에만 적용되는 선택이다.

One other assumption pervades discussion of assessing ‘‘non-cognitive’’ or personality at admissions—the ‘‘either-or’’ hypothesis. It is presumed that the admissions committee must make a Faustian choice between selecting someone who is personable, professional and compassionate, or someone who is academically top-tier. Such a choice would only be necessary if there were a strong negative correlation between personal qualities and aca- demic performance. But is there?


한 연구에서는 고등학교 성적과 면접 성적간 negative association을 보여주었다. 그러나 최근의 MMI연구를 보면, 하나는 -0.21, 다른 하나는 0.07이다. 진실이 어디 있든지 학업적 수월성과 대인관계기술 모두를 가지고 선발하는 것에 문제는 없어 보인다. 더 나아가서 성격검사의 현재 대표격인 Neo-5성격검사에서 다른 척도와 일관된 관계를 보이는 것은 conscientiousness와 성적의 moderate positive relationship 뿐이다.

One study (Powis and Bristow 1997) showed a significant negative association between scores on a personal interview and high school grades. However, two more recent studies examined the relation between the MMI (a well-validated measure of non-cognitive skills) and university GPA. In the first study (Eva et al. 2004) the correlation was -0.21; in the second (Kulasegaram et al. 2010) the correlation was ?.07. Wherever the true correlation lies, it would appear that there should be no problem identifying students who have both academic excellence and interpersonal skills. Moreover, when one examines the constructs measured by the current stte of the art personality test, the Neo-5 personality test, about the only consistent relationship with other measures that has emerged is a moderate positive relationship between conscientiousness and grades (Kulasegaram et al. 2010).


입학전략을 academic and interpersonal measures 모두를 활용하는 것은 완벽하게 적절하다. 그러나 둘 중 하나를 선택하게 강요하는 것은 부적절하다. 또한 'unprofessionalism'이라는 희귀질환을 진단해내는 검사를 만들 수 있을 것이라는 기대는 그릇된 것이다.

It is perfectly appropriate to devise admissions strategies, in-course performance indi-ces, and certification procedures that include both academic and interpersonal measures. It is not appropriate to force a choice between one and the other. And it is folly to presume that we will ever be able to create an adequate diagnostic test to the ultimately rare disease of unprofessionalism. 



 



Powis, D. (2015). Selecting medical students: An unresolved challenge. Medical Teacher, 37, 252–260.



 2015 May;20(2):299-303. doi: 10.1007/s10459-015-9598-9.

Identifying the bad apples.

Author information

  • 1McMaster University, Hamilton, ON, Canada, norman@mcmaster.ca.
[PubMed - indexed for MEDLINE]


MMI점수 타당화: 다양한 기질을 측정하는가? (Adv in Health Sci Educ, 2014)

Validating MMI scores: are we measuring multiple attributes?

Tom Oliver • Kent Hecker • Peter A. Hausdorf • Peter Conlon






도입

Introduction


MMI는 지원자의 비인지적 특성을 평가하기 위한 면접방법이다. 전통적인 면접(덜 구조화된)은 신뢰도와 타당도가 낮다고 보고되어 왔으며, MMI는 신뢰도가 충분히 높고, 의과대학 및 면허시험 수행능력과 유의한 상관관계가 있다. 추가적으로, MMI총저은 GPA와 같은 인지적 능력과 discriminant validity가 있어서, MMI가 뭔가 다른 것을 측정한다고 볼 수 있다.

The multiple mini-interview (MMI) is an interview method used in health professional school selection to assess the non-cognitive attributes of applicants (Eva et al. 2004). Whereas more traditional—and often less-structured—interviews have been found to have poor reliability and validity in health professional school selection (Kreiter et al. 2004; Eva et al. 2004; Albanese et al. 2003; Edwards et al. 1990), previous studies have found MMI scores to have sufficient reliability and to be significantly correlated to performance in school and licensure exams (Eva et al. 2009, 2012; Hecker and Violato 2011; Reiter et al. 2007). In addition, there is consistent evidence for the discriminant validity of total MMI scores from ratings of cognitive skill such as incoming grade point average (GPA; Eva et al. 2004, 2009, 2012; Reiter et al. 2007), which suggests that the MMI is measuring something other than cognitive skill.


MMI로 평가하는 비인지적 특성

Non-cognitive attributes assessed by the MMI


비인지적 특성에는 다양한 것들이 포함된다. MMI는 평가자가 한 스테이션 내에서 그리고 여러 스테이션을 거치며 지원자로부터 다양한 구인을 평가하게끔 설계되어 있다. 그러나 동일한 스테이션 내에서 평가한 서로 다른 종류의 비인지적 특성은 서로 상관관계가 매우 높은 것으로 보고되고 있어서, MMI점수는 흔히 각각의 구인에 기반한 점수가 아니라 총점을 활용하는 것이 일반적이다.

Non-cognitive attributes can include a variety of individual differences related to attitudes, personality traits, and motivations (Schmitt et al. 2009). MMIs have been designed to have raters assess candidates on multiple constructs (e.g. oral communication and moral rea- soning) both within and across interview stations. However, MMI measures of different non-cognitive attribute constructs assessed within the same station have been found to be highly correlated (Eva et al. 2004; Lemay et al. 2007; Roberts et al. 2009). As a result, it is common practice to report total scores (i.e. the average score across all measures) within each station instead of construct-based scores.


MMI가 측정할 수 있는 것에는 구두의사소통Oral Communication (OrCo)와 문제해결Problem Evaluation, PrEv가 있다. OrCo는 다양한 언어 메시지를 구조적으로 전달할 수 있는 능력이며, PrEv는 문제를 찾고 다양한 이해관계자들의 관점을 고려하여 의사결정과 판단을 내리는 능력이다.

Two of the more distinct interpersonal constructs that an MMI can attempt to measure are oral communication and problem evaluation. Oral communication is the ability to convey verbal messages constructively; and problem evaluation is the ability to identify and take into account multiple perspectives from various different stakeholders in decision making and judgment.


평가자들은 지원자의 OrCo와 PrEv 능력을 관찰하고 평가할 수 있다.

raters have an opportunity to observe and rate the candidate on

  • the clarity of their language and confidence in their conveyed verbal response (oral communication), and

  • the breadth and depth to which they can explore underlying issues within cases and correctly balance pros and cons for the situation (problem evaluation).


성격특성과 관련한 MMI 척도

MMI measures related to personality characteristics


세 가지 연구에서 MMI 총점과 성격척도의 관계를 살펴본 바 있다. 이들 연구의 결과는 혼재되어 있다.

Three exploratory studies have investigated the relationship between total MMI scores and personality measures (Griffin and Wilson 2012; Jerant et al. 2012; Kulasegaram et al. 2010). The results from these studies found mixed evidence


보건의료전문직의 대인관계 능력에 영향을 주는 두 가지 성격특성에는 emotionality와 extraversion이 있다.

Two personality traits that are likely to be related to health professionals’ interpersonal performance are emotionality and extraversion (Ashton and Lee 2007).

  • Emotionality가 높은 사람: 공감을 잘 하고, 위험하거나 스트레스 상황에 민감하며, 다른 사람의 감정적 지지에 의지를 느낄 수 있다 People who have high emotionality tend to feel empathy and sentimental attachments with others, are sensitive to dangerous and stressful situations, and feel dependent on the emotional support from others;

  • Extraversion이 높은 사람: 자신감이 넘치고, 그룹을 이끌며, 각종 모임과 관계를 즐기고, 열정과 에너지지가 넘친다 people who are extraverted tend to feel confident when leading or addressing groups of people, enjoy social gatherings and interactions, and frequently experience positive feelings of enthusiasm and energy.


외향성이란..

Extraversion is a trait that includes tendencies such as acting confident with others and expressing enthusiasm and energy (Ashton and Lee 2007).


정서성이란..

Emotionality is a trait that includes tendencies such as being familiar with the anxieties and fears that come with stressful situations, and feeling emotional connections with others (Ashton and Lee 2007).


미래 수행능력과 관계된 MMI 척도

MMI measures related to future performance


 

MMI에서는 다음을 측정(OrCo와 PrEv)

MMI measures of two distinct constructs (oral communication and problem evaluation) and

 

의사소통 스킬 인터뷰에서는 다음을 측정

communication skill interview scores of students’

  • 효과적 관계구축 effectiveness in building a rela- tionship (i.e. build a patient’s or client’s feelings of rapport and trust with the practitioner) and

  • 효과적 설명과 계획 effectiveness in explaining and planning (i.e. build a patient’s or client’s under- standing and motivation to support an action plan; Silverman et al. 2005).


OrCo는 효과적 관계구축과, PrEv는 효과적 설명 및 계획과 관계가 있을 것으로 생각함.

Oral communication should be more closely related to building a relationship and problem evaluation should be more closely related to explaining and planning.


연구목적과 가설

Research objectives and hypotheses


H1 Given the explicit measurement and distinctiveness of oral communication and problem evaluation, there will be a stronger model fit for a 2-factor solution for MMI scores than for a 1-factor solution. 


H2a Oral communication MMI scores will be positively related to building the rela- tionship score in a communication interview.

H2b Problem evaluation MMI scores will be positively related to explaining and plan- ning scores in a communication interview. 


H3a Oral communication MMI scores will be positively related to extraversion scores measured by the HEXACO-PI-R-60 (Ashton and Lee 2009). 


H3b Problem evaluation MMI scores will be positively related to emotionality scores measured by the HEXACO-PI-R-60 (Ashton and Lee 2009).


 

방법

Method


표본 Sample


척도 Measures


다면인적성면접

MMI


The MMI consisted of eight 10-min stations, with two raters per station who each inde- pendently rated the participants on two constructs. The development of the MMI followed the description outlined in Hecker et al. (2009). The majority of the stations were devel- oped at the University of Calgary, Canada and modified by the admissions committee at OVC.


The eight stations were meant to assess oral communication and problem evaluation for a range of issues relevant to success as a veterinarian. These issues were ethical and moral (2 stations), interpersonal (3 stations), intrapersonal (1 station), and professional (2 stations). At each station raters scored candidates on two items, one for each construct. Each item was scored on a scale of 1–5 (1 = unacceptable; 3 = meets expectations; 5 = exceptional)


커뮤니케이션 인터뷰

Communication interview


 

Adams and Ladner 가 설계한 표준화된 임상 커뮤니케이션 인터뷰가 있다. 각 참여자는 두 개의 인터뷰 스테이션에서 효과적인 의사소통 스킬의 활용을 평가받았다. 의학지식은 거의 필요하지 않았다. simulated client가 7개 항목에 대해서 즉각적으로 참여자를 평가하였다.(관계형성 4 문항, 설명과 계획 3문항)

The standardized clinical communication interviews were initially designed by Adams and Ladner (2004) with the consultation of practicing veterinarians. Each participant partici- pated in two communication interview stations designed to assess participants’ use of effective communication skills. Medical and technical knowledge requirements were minimal. The simulated client rated the participant immediately after each station on 7 items (using a 9 point scale) meant to assess two constructs, building the relationship (4 items) and explaining and planning (3 items).

 

두 점수는 T-score로 변환되어서 스테이션간 서로 다른 simlated client의 차이를 보정하고자 하였다. 두 스테이션의 T-score의 평균 점수를 계산하였다.

The two scores within each station were converted to a T-score to account for differences in simulated client scores between sta- tions (Howell 2002). The mean of each participant’s T-score across the two stations was calculated to use as his/her building the relationship and explaining and planning score.


성격

Personality


The personality traits of emotionality and extraversion were measured with the HEXACO- PI-R-60 (Ashton and Lee 2009).


분석

Analysis


 

결과

Results













고찰

Discussion


주요 결과

The major findings from this study were:


1. 2-요인 모델을 지지한다. 그러나 OrCo와 PrEv 구인은 두 모델에서 모두 상관관계가 매우 높았다.

1. There was support for a two factor model, however, the oral communication and problem evaluation constructs were highly correlated both within the model (.87) and the correlation analyses with the actual data (.73; Table 4).


2. OrCo 점수는 외향성, 그리고 관계구축과 유의한 상관관계가 있었다.

2. Oral Communication MMI score was significantly correlated with extraversion (small but significant) and building the relationship scores, supporting Hypotheses 2a and 3a.


3. PrEv점수는 정서성과 유의한 상관관계가 없었으나, 관계구축 및 설명과 계획과 유의한 상관관계가 있었다.

3. Problem evaluation MMI score was not significantly related to emotionality score but did correlate with building the relationship (not hypothesized) and explaining and planning, thus not supporting Hypothesis 2b but supported Hypothesis 3b.


4. MMI총점은 외향성과 작지만 유의한 상관관계가 있었고, 관계구축, 설명과 계획 과 유의한 상관관계가 있었다.

4. Total MMI score had a weak but significant correlation with extraversion, and significant correlations with building the relationship and explaining and planning.


2-요인 모델이 더 강력했으나, 두 요인의 상관관계가 높았다. 따라서 두 개의 truly distinct factor를 측정한다고 결론을 내리기에는 조심스럽다.

While there was a stronger and significantly better model fit for a two factor model (Fig. 1) than a one factor model, the two constructs were highly corre- lated (.87). Thus while there was support for a two factor model, caution must be taken in concluding that we are measuring two truly distinct factors as there was weak evidence for discriminant validity between the two construct scores.


본 연구결과의 실용적 의의를 찾자면,

practical implications of these findings : there is evidence for

  • MMI 스테이션 구성에 시간과 노력을 투자할 가치가 있다. investing the time and effort in MMI station construction,

  • 미래 수행능력을 예측하는 것으로 알려진 특성에 기반하여 평가표를 만들어야 함 creating appropriate scoring rubrics based upon attributes known to be pre- dictive for future performance and

  • 평가자 훈련을 통해서 공정한 평가가 이뤄지도록 해야 함 conducting rater training to ensure appropriate and fair assessment of the candidate.


두 번째 연구의 목적은 MMI척도가 비인지적 구인의 nomological network에 부합하는지를 보는 것이었다. 흥미롭게도 OrCo와 PrEv의 MMI 점수가 매우 상관관계까 높았지만, 다른 비인지적 구인과의 상관관계는 서로 다르게 나타났다.

The second research objective was to test whether the MMI measures fit within the nomological network for non-cognitive constructs. Interestingly, even though the MMI scores of oral communication and problemevaluation were highly related, they were found to have different relationships to other measures of non-cognitive constructs.


2B가설과 같이, MMI의 PrEv점수는 '설명과 계획'과 유의한 관계가 있었다. 그러나 MMI의 PrEv점수에 대해서 정서성과 관련될 것이라는 가설은 맞지 않았다.

Consistent with our hypothesis (2B), the MMI problem evaluation rating had a sig- nificant positive relationship with explaining and planning. However, the hypothesized relationship between the MMI problemevaluation measure and emotionality was not found (hypothesis 3B). 

 

한 가지 설명은 MMI가 학생의 공감능력을 제대로 측정하지 못한 것이다. 현재의 MMI는 학생이 다른 사람의 관점을 얼마나 잘 인식하는지를 측정하지만, 얼마나 학생이 다른 사람에게 공감을 표현하는지는 측정하지 못한다. 좀 더 직접적으로 상호작용하는 스테이션을 포함시킴으로서 이러한 한계를 극복 가능할 것이다.

One explanation is that the MMI did not effectively measure students’ ability to empathize with others. The current MMI measured students’ ability to recognize the points of views of others, but it did not measure how students’ would express feelings of empathy towards others. One way that this could be done is to include stations that require candidates to engage more directly in an interaction.


또 다른 설명은 정서성이 광범위한 성격특성이라는 점이다. 광범위한 성격특성으로서 정서성은 성격의 여러가지 측면을 포함한다. 구체적인 준거와 강력한 개념상의 연결관계가 있을 때, Narrow trait가 Broad trait보다 종종 더 predictive한 것으로 알려져있다.

Another explanation is that emotionality is a broad personality trait. As a broad personality trait, emotionality measures a broad range of individual attributes (e.g. empathy towards others, sensitivity to physical harm). Narrow traits are often found to be more predictive than broad traits when there is a strong conceptual link to a specific criterion (Rothstein and Goffin 2006; Tett et al. 1991).


모든 valid한 선발과정은 그 과정을 거침으로서 일반적인 지원자 집단이 보다 균질한 집단으로 변해야 하나, 그러한 균질성은 학교나 직장에서의 성공과 관계된 특성에 대해서만 균질해야 한다.

Any valid selection process should lead to the selection of a more homogenous group of successful candidates from the general applicant pool, wherein the successful applicants are homogenous only on the characteristics that lead to in-school or in-job success.


따라서, 임상에서의 인터뷰 또는 health outcome을 더 향상시킬 수 있는 성격특성이 존재한다면, MMI시나리오가 그러한 것들을 평가할 수 있도록 설계되어야 한다.

Thus, if these are personality traits that can lead to better performance within the clinical interview and potentially better health outcomes, then it can be argued MMI scenarios should be designed to assess attributes related to these traits.



 

Ashton, M. C., & Lee, K. (2007). Empirical, theoretical, and practical advantages of the HEXACO model of personality structure. Personality and Social Psychology Review, 11, 150–166.


Jerant, A., Griffin, E., Rainwater, J., Henderson, M., Sousa, F., Bertakis, K. D., et al. (2012). Does applicant personality influence multiple mini-interview performance and medical school acceptance offers? Academic Medicine, 87, 1–10.




 2014 Aug;19(3):379-92. doi: 10.1007/s10459-013-9480-6. Epub 2014 Jan 22.

Validating MMI scores: are we measuring multiple attributes?

Author information

  • 1University of Guelph, Guelph, Canada.

Abstract

The multiple mini-interview (MMI) used in health professional schools' admission processes is reported to assess multiple non-cognitive constructs such as ethical reasoning, oral communication, or problem evaluation. Though validation studies have been performed with total MMI scores, there is a paucity of information regarding how well MMI scores differentiate the constructs being measured, the relationship between MMI scores (construct or total) and personality characteristics, and how well MMI scores (construct or total) predict future performance in practice. Results from these studies could assist with MMI station development, rater training, score interpretation, and resource allocation. The purpose of this study was to investigate the validity of MMI construct scores (oral communication and problem evaluation), and their relationship to personality measures (emotionality and extraversion) and specific scores from standardized clinical communications interviews (building the relationship and explaining and planning). Confirmatory factor analysis results support a two factor MMI model, however the correlation between these factors was .87. Oral communication MMI scores significantly correlated with extraversion (r c = .25, p < .05), but MMI scores were not related to emotionality. Scores for building a relationship were significantly related to MMI oral communication scores, (r c = .46, p < .001) and problem evaluation scores (r c = .43, p < .001); scores for explaining and planning were significantly related to MMI problem evaluation scores (r c = .36, p < .01). The results provide validity evidence for assessing multiple non-cognitive attributes during the MMI process and reinforce the importance of developing MMI stations and scoring rubrics for attributes identified as important for future success in school and practice.

PMID:
 
24449121
 
[PubMed - in process]


입학면접의 조건부 신뢰도: 극단적 평가에서 얻을 정보가 더 많다(Med Educ, 2007)

Conditional reliability of admissions interview ratings: extreme ratings are the most informative

R Brent Stansfield1 & Clarence D Kreiter2





INTRODUCTION


평가자들은 보통 지원자들을 양적이지만, 신뢰도가 낮은 첫도인 리커트식 척도로 평가하게 된다.

Interviewers typically rate applicants on Likert-type scales1 that yield quantitative, but unreliable, meas- ures.4,6


면접의 예측타당도에 대한 근거는 매우 적다.

There is little evidence for the predictive validity of interviews.


면접점수의 낮은 신뢰도는 면접 절차가 invalid하다는 것으로부터 유래한 것이 아니라, 평가점수가 모든 영역에 있어서 균등하게 informative하다는 태도에서 기인했을 수 있다. 만약 평가자가 매우 우수한 지원자를 감별할 수는 있으나, 중간정도 혹은 불충분한 지원자는 감별하지 못한다고 하자. 그렇다면 이 평가자의 점수는 높은 점수 범위에서 낮은 점수 범위보다 더 informative할 것이다. 이 타당도에도 불구하고 이 평가자의 점수를 전체적으로 보면 낮을 것이다. 이 경우 평가자가 준 점수를 적절히 활용하는 것이 조건부신뢰도(conditional reliability)이며, 서로 다른 점수영역에서의 신뢰도를 말한다.

Unreliable interview scores may not arise from invalid interviewing processes, but rather from the treatment of ratings as homogenously informative measures. Imagine an interviewer able to identify stellar candi- dates, but unable to distinguish mediocre from poor ones; his high scores would be more informative than his low scores. Despite this validity, his ratings would have low reliability overall. The proper use of his ratings would account for conditional reliability: the reliability of different scale ranges.


조건부신뢰도에 대한 또 다른 연구에서는 리커트식 척도에서 error variance의 이질성heterogeneity를 발견한 바 있다. 정치적 의견에 대해서 중간지점의 점수midpoint가 있을때, 이것이 의미하는 바는 '결정하지 못함' 일 수도 있고 '생각해본 적 없음' 일수도 있으며, 이 경우 '찬성도 반대도 아닌 중립'과는 다른 의미이다. 이는 midpoint의 응답은 non-midpoint의 응답에 비해서 확신이 낮다는 것을, 즉 높은 SE를 보임을 의미한다. 불안 정도에 대한 한 연구에서 단순히 midpoint를 결측치로 설정한 것 만으로 Cronbach's alpha가 0.7에서 0.94로 상승하였다.

Other investigations of conditional reliability have found heterogeneity of error variance in Likert-type scales. Use of midpoint responses on political opin- ion questions may represent undecided or never thought about it as opposed to neutral or neither agree nor disagree .13 This suggests less certainty, and therefore a higher standard error of measurement, in midpoint responses than in non-midpoint responses. A study of education graduate students’ responses on an anxiety scale raised Cronbach’s alpha from0.70 to 0.94 merely by treating midpoint responses as missing data.14


방법

METHODS


참가자

Participants: observed and simulated



관찰 집단 1

Observed set 1


관찰 집단 2

Observed set 2


가상 집단

Simulated set


분석

Analysis




결과

RESULTS


관찰 집단 1이 가상 집단보다 더 reliable하다.

Observed set 1 is more reliable than the simulated set


높은 평가점수와 낮은 평가점수에서 더 reliable하다.

Low and high ratings are more reliable


높은 점수와 낮은 점수에 가중치를 둠으로써 validity를 향상시킬 수 있다.

Weighting low and high responses improves validity



DISCUSSION



평가자들은 가장 높은 퀄리티와 가장 낮은 퀄리티의 지원자 면접에 대해서 서로 동의하게 되는 경우가 더 많다. 이러한 동의가 발생하는 것은 수학적 artefact가 아니다. 실제 관찰집단에 비해서 가상 집단에서 극단치 점수에서 평가자간 불일치가 더 크게 나타났다. 평가자는 한 명의 (이상의) 평가자가 '평균수준'으로 여긴 지원자에 대해서 우연의일치를 보이는 확률보다 더 높은 확률로 불일치를 보였다. 이 중간정도 지원자에 대한 평가는 negatively reliable했으며, 이는 modal response를 활용하는 것이 invalid함을 보여준다. 즉 '평균수준이다'가 아니라 '나는 모르겠다'의 응답에 가깝다는 것이다. 만약 그렇다면, 평가자간 불일치가 크게 나타나는 것은 substance의 문제가 아니라 자신감confidence의 문제일 수 있다. 평가자가 5점척도에서 1점과 2점을 거의 사용하지 않는다면 4점이 사실상의 3점척도(3, 4, 5점)에서 중간치가 된다

Raters tend to agree more about the lowest and highest quality applicant interviews. This agreement is not a mathematical artefact: the simulated set contains much more inter-rater disagreement at extreme ratings than observed sets 1 or 2 (Fig. 2). Raters tend to disagree more than chance about applicants whom 1 rater has deemed average. These moderate ratings are actually negatively reliable , suggesting an invalid use of the modal response, perhaps denoting I don t know’ rather than average applicant . If so, these large inter-rater disagreements reflect differences in confidence rather than sub- stance. As raters rarely use levels 1 and 2, the modal level 4 is effectively the midpoint on a 3-point scale; these results mirror those finding midpoint responses unreliable.13,14


더 중요한 것은, 이 결과가 입학절차에 있어서 각 부분점수에 가중치를 두어 최종점수를 구할 때, 중간치 평가점수moderate interview rating을 무시해버리는 것이 더 낫다는 점을 시사한다. 신뢰도가 낮은 척도를 신뢰도가 높은 척도와 함께 가중-점수에 넣는 것은 그 결과로 나오는 점수의 신뢰도를 하락시킬 수 있다. 모든 moderate response를 결측치로 처리하는 것이 이 자료에 미치는 noise의 영향을 제거할 수 있는 길이며, 극단치 점수는(이 점수들은 예측타당도가 잇으므로) 지원자의 상대적 비교를 할 때 영향을 주게끔 해야 한다.

More importantly, these results suggest that ignoring moderate interview ratings entirely during the admissions process is preferable to using them when computing larger weighted sum scores. Introducing unreliable measures into weighted averages with reliable ones can compromise the reliability of the resulting score.6 Treating all moderate responses as missing data eliminates the impact of the noise in those responses, while allowing extreme scores (which in these data have some predictive validity) to influence applicants’ relative standings.


7 Kreiter CD, Gordon JA, Elliott S, Callaway M. Recom- mendations for assigning weights to component tests to derive an overall course grade. Teach Learn Med 2004;16:133–8.









 2007 Jan;41(1):32-8.

Conditional reliability of admissions interview ratingsextreme ratings are the most informative.

Author information

  • 1Department of Medical Education, University of Michigan, Ann Arbor, Michigan 48109, USA. rbent@umich.edu

Abstract

CONTEXT:

Admissions interviews are unreliable and have poor predictive validity, yet are the sole measures of non-cognitive skills used by most medical school admissions departments. The low reliability may be due in part to variation in conditional reliability across the rating scale.

OBJECTIVES:

To describe an empirically derived estimate of conditional reliability and use it to improve the predictive validity of interview ratings.

METHODS:

A set of medical school interview ratings was compared to a Monte Carlo simulated set to estimate conditional reliability controlling for range restriction, response scale bias and other artefacts. This estimate was used as a weighting function to improve the predictive validity of a second set of interview ratings for predicting non-cognitive measures (USMLE Step II residuals from Step I scores).

RESULTS:

Compared with the simulated set, both observed sets showed more reliability at low and high rating levels than at moderate levels. Rawinterview scores did not predict USMLE Step II scores after controlling for Step I performance (additional r2 = 0.001, not significant). Weightinginterview ratings by estimated conditional reliability improved predictive validity (additional r2 = 0.121, P < 0.01).

CONCLUSIONS:

Conditional reliability is important for understanding the psychometric properties of subjective rating scales. Weighting these measures during the admissions process would improve admissions decisions.

PMID:
 
17209890
 
[PubMed - indexed for MEDLINE]


미래의 보건의료 리더 선발을 위한 MMI의 신뢰도 향상 (Acad Med, 2011)

Enhancing the Reliability of the Multiple Mini-Interview for Selecting Prospective Health Care Leaders

Sebastian Uijtdehaage, PhD, Lawrence “Hy” Doyle, EdD, and Neil Parker, MD





미국에서 효과적이고 접근가능한 의료 제공과 관련한 현재의 위기는 미국 의과대학 학부 프로그램에 듀얼-학위 리더십 프로그램을 낳았다. Program in Medical Education (PRIME), David Geffen School of Medicine at UCLA, UCLA-PRIME

The current crisis in providing effective and accessible health care in the United States has spawned a number of dual- degree leadership programs for medical undergraduates.1

  • In 2005, the University of California (UC) initiated an ambitious initiative, the Program in Medical Education (PRIME), to increase enrollment in its medical schools in order to address the needs of California’s disadvantaged populations.2,3
  • In 2007, at the David Geffen School of Medicine at UCLA, UCLA-PRIME was developed as a five-year dual-degree program focused on the development of leadership skills in 18 medical students per year whose career goals would be to improve health care for the disadvantaged and medically underserved.


미래의 의사를 선발하는 것은 종종 몇 가지 이유로 실패하곤 한다.

The selection of future physicians, however, often fails on several accounts.4

  • GPA나 MCAT같은 인지적 성취기록이 비인지적 특성을 무시하게끔 한다.
    First, the cognitive record of the applicant, that is, grade point average (GPA) and Medical College Admission Test (MCAT) scores, commonly overrides any consideration of noncognitive attributes in decisions to admit.5
  • 지원자들로부터 확인하고자 하는 비인지적 특징들이 불명확하고, Implicit하고 합의되지 않았다.
    Second, the noncognitive qualities sought in applicants are unclear, remain implicit, and are not necessarily agreed on by stakeholders.
  • 합의되고 명확한 경우에도 신뢰도와 타당도를 갖춘 평가법이 적다
    Third, even if a set of desirable noncognitive qualities for candidates is clear and agreed on, reliable and valid assessment methods are scarce. This is particularly true for characteristics such as altruism, empathy, and leadership.
  • 전체 입학 프로세스가 투명하거나 uniformly 적용되는 경우가 적다.
    Furthermore, the entire admissions process is rarely transparent or uniformly applied.


불행하게도, 입학 면접은 맥락-특이적이다. 지원자의 응답이 면접관, 질문, 그 외 요인 등에 따라 달라질 수 있다는 것이다. Kreiter 등은 입학면접의 variance component에 대해서 지원자들로부터 기인하는 변인성분이 지원자-상황 상호작용 성분보다 작다고 보고했다. 이런 유사한 결과가 전통적 면접의 신뢰도가 부적절하며, 따라서 타당도도 의문을 가지게 됨을 시사한다.

Unfortunately, admissions interviews are, like many other assessments, prone to “context specificity.”7 That is, the performance of an applicant during the interview may depend to an important extent on the particular interviewer, the specific questions asked, or other factors irrelevant to the applicant’s suitability. Indeed, Kreiter and colleagues8 studied the variance components of admissions interview scores and found that the variance component attributable to applicants was smaller than variance component attributable to the applicant- by-occasion interaction. These and similar findings imply that traditional interviews may have inadequate reliability and, thus, questionable validity.


Eva 등이 최초로 연구한 MMI는 학부졸업생을 대상으로, 의과대학 지원자들이라는 상대적으로 이질적진 집단에서 연구되었다. 이는 신뢰도 결과를 부풀리는 결과를 가져왔을 수 있다. Eva 등이 이후 연구에서 밝힌 바와 같이 "어떤 평가의 신뢰도와 타당도는 그 전략이 적용되는 맥락이나 평가의 내용에 따라 달라진다"라고 하였고, 다른 말로는 MMI의 우수한 psychometric properties는 더 균질한 집단에서는 보장되지 않을 수 있는 것이다.

The initial MMI study by Eva and colleagues12 was conducted on graduate students, a relatively heterogeneous group compared with a pool of medical school applicants. This may have inflated their reliability results. As Eva and colleagues22 put forth in a subsequent article, “the reliability and validity of any assessment strategy is dependent on the context in which the strategy is applied and the content of the assessment.” In other words, the promising psychometric properties of the MMI may not necessarily hold up for a more homogenous pool of applicants who have been selected for consideration on the basis of a more specific set of attributes.



방법

Method


우리는 우선 델파이 접근을 통해서 리더십과 취약계층에 대한 헌신에 초점을 둔 UCLA-PRIME 지원자가 갖추어야 할 바람직한 특성의 인벤토리를 만들었다. 

First, we generated an inventory of the desirable characteristics of UCLA-PRIME candidates with a focus on leadership and commitment to disadvantaged populations using a Delphi approach among stakeholders (program administrators, deans, faculty members, and community leaders). We described the details of the Delphi study elsewhere.23 Characteristics that were deemed essential for the PRIME program included

  • 헌신 commitment to and experience with underserved populations,
  • 문화적 민감성 cultural sensitivity,
  • 리더십 잠재력 leadership potential,
  • 성숙 maturity, and
  • 효과적인 팀 구성원 되기 being an effective team member.


연구 1

Study 1 (2009)



In 2009, we created a panel of 28 interviewers consisting of 18 faculty members, 6 medical students, and 4 community members.


  • On the day of the MMI, we handed out the scenarios and a list of applicants to the interviewers.
  • The interviewers practiced the scenarios with each other before the applicants arrived.
  • We instructed the interviewers to rate the overall performance of the applicant using a seven-point Likert scale (1 unsatisfactory; 7 outstanding).
  • Specifically, we asked themto “consider the applicant’s communication skills, strength of the argument, and suitability for the medical profession.
  • We strongly encouraged the interviewers to use the full rating scale, recognizing that interviewees had been selected from a very large pool of applicants and exceeded all other admissions requirements. Interviewers scored the applicants immediately after each interview.
  • They could adjust their scoring after they completed interviewing the entire cohort.
  • A total score was calculated for each applicant by summing the scores for individual stations. Thus, total scores could range from12 through 84.




연구 2

Study 2 (2010)

 

몇 가지 변화

  • 장소 변화 First, we moved the MMI venue to our education building and used adjacent rooms typically used for small-group teaching of medical students. The applicants could familiarize themselves with the layout of the facility before commencing the MMI. 
  • 쉬운 문항을 어려운 문항으로 Second, we replaced an easy station (Station 9, “How did you prepare for this interview?”) with a perhaps more challenging task in which applicants were asked to describe student characteristics desirable for the PRIME program. Difficulty level was not assessed formally but was suggested by the fact that interviewers had difficulty differentiating performance of the applicants in the original station. The remaining 11 stations were the same as in 2009. 
  • Normative scoring rubric으로 Third, we asked the interviewers to rate the performance of an applicant relative to the pool of all applicants. Accordingly, we changed the seven-point Likert-scale anchors to a normative scoring rubric (1 bottom15%; 4 middle 50%; 7 top 15%). 
  • 워딩 수정 Finally, we changed the wording of two stations that previously led to confusion among some applicants. In 2009, one station asked the applicants to discuss “surgeons’ mortality rates.” A few applicants proceeded to discuss the mortality rate of surgeons and not their patients. In 2010, we changed the prompt to “surgeons’ patient mortality rates.” In another station, we replaced the term “SARS epidemic” with the more recent “H1N1 epidemic” but left the crux of the station the same.





결과

Results


연구 1

Study 1 (2009)


분포가 최대치 점수쪽으로 치우쳐져 있음

The distribution of the total MMI scores, however, was skewed toward the maximum score, suggesting that interviewers had difficulty using the lower range of the rating rubric (Figure 1).

 

 


 

연구 2

Study 2 (2010)

 

 





 


고찰

Discussion


MMI가 균일한 지원자 집단에 대해서도 효과적으로 사용가능하다.

Our study showed that the MMI can be effectively used to assess a homogeneous group of applicants and that its reliability can be enhanced with minor changes in protocol.


처음 2009년에 도입된 MMI의 신뢰도는 0.58이었고 다른 연구의 보고된 결과보다 낮았다. 1차와 2차 지원 정보를 통해서 취약계층에 대한 강한 헌신을 보이는 학생을 일차적으로 스크리닝했기에 상대적으로 균일한 지원자 집단이었다. 이러한 균일성과 작은 표본크기가 variability를 작게 만들었을 수 있다.

Reliability of the first MMI implementation in 2009 was 0.58—lower than reported elsewhere. Our interviewees were a relatively homogenous group of applicants because initial screening considered primary and secondary application information that demonstrated a strong commitment to disadvantaged populations. This homogeneity and the smaller sample size may have resulted in comparatively less variability among the interviewees and could have suppressed the reliability of the overall MMI assessment as estimated by the generalizability coefficient.


2010년에는 몇 가지 변화를 가져왔고 이것들이 신뢰도에 기여한 것으로 보인다. 하나는 쉬운 스테이션을 어렵게 바꾼 것인데, 지원자 간 구분discrimination을 촉진하기 위해서는 적절한 수준의 난이도를 유지해야 한다. IRT에서는 중간 난이도가 가장 변별력이 있다고 제안한다.

We made a few changes in the 2010 implementation of the MMI process that, all taken together, seemed to have contributed to a substantial improvement of the reliability. One such change was the replacement of a seemingly “easy” station (determined at face value) with a more challenging one. To facilitate discrimination between applicants, the stations must have an optimal level of difficulty. Item response theory suggests that items of median difficulty best discriminate between groups with either high or low magnitude of a latent trait.28


실제로, 우리의 결과를 보면 쉬운 스테이션은 단순히 '시그널에 노이즈만 더한' 결과를 가져왔다. 우리가 쉬운 스테이션을 제외하고 신뢰도를 분석하면 신뢰도가 상승하였고, 이는 한 평가 포인트를 제외했을 때 신뢰도가 감소할 것이라는 일반적 기대와 다른 결과이다.

And, indeed, our analysis showed that an easy station simply “added noise to the signal.” When we recalculated the reliability excluding Station 9, the reliability improved; it did not decrease, as one would expect when taking away one assessment point.


2010년 연구에서 평가자들은 채점 anchor를 하위 15%, 하위 30%, 중위 50% 등으로 바꿨을 때 더 전체 평가 스케일을 사용할 수 있었던 것으로 드러난다. 이러한 채점방법을 통해서 우리는 지원자들의 순위를 매길 것을 권장한 것이다. 면접관들은 13명의 지원자를 본 이후에 점수를 보정할 수 있게 하였으며 2009년에도 이는 동일하였다.

In our 2010 study, the interviewers seemed better able to use the full range of the rating scale after we changed its anchors to “bottom15%,” “bottom30%,” “middle 50%,” etc., and asked interviewers to rate an applicant’s performance relative to the pool of all applicants. Thus, we encouraged rank- ordering of candidates with a more normative approach of scoring. Interviewers could adjust their scoring after having seen a cohort of 13 applicants (and this was allowed in the 2009 study as well).



MMI를 도입하는 것은 가능하긴 하지만, 여전히 부담스러운 일이다.

We found that implementing MMIs was feasible but a daunting task nonetheless.


 

인적자원이 많이 들어간다. 준비할 것이 많다(securing space, identifying appropriate interview questions, interviewer training, etc.). 그러나 이러한 비용은 각 평가자가 지원자 풀을 평가하는데 들어가는 시간이 덜 들어가는 것으로 보상된다. 면접관이 보고서를 작성거나 위원회 회의에 들어가는 시간 등을 고려하면 시간의 절감 효과는 더 크다.

Clearly, the MMI requires extensive human resources. In a recent cost- efficiency analysis, Rosenfeld et al29 found that MMI requires more upfront preparation (securing space, identifying appropriate interview questions, interviewer training, etc.) compared with the traditional interview process. This cost, however, was offset by considerably fewer hours required of each person to assess a pool of applicants. We would note that the time saving is even more considerable if the time spent by interviewers in writing reports and attending committee meetings in which applicants are discussed is taken into account. 



한계점. Validity를 평가하지 않았음.

Our study has several limitations. First, we did not assess the validity of the MMI process even though one could argue that blueprinting the MMI stations based on our Delphi study provided an acceptable level of content validity.


이 영역의 연구는 널리 사용되나 여전히 잘 정의되지 않는 용어인 '비인지적 특성'이라는 용어로 인해서 제약을 받는다. Norman이 지적한 바와 같이 'noncognitive skills'라는 용어는 MCAT점수나 GAP점수가 반영하지 않는 특성을 의미하며, 여기에는 tacit knowledge, communication skills, emotional intelligence, and stable personality traits 등이 포함된다. 입학위원회는 의사로서의 진로와 의료행위, 그리고 기관의 철학과 목적에 맞춰 이러한 특성이 무엇인지 명확히 정의해야 할 것이다.

Research in this area is hampered by the ubiquitous but ill-defined term “noncognitive characteristics.” As Norman32 pointed out, the umbrella term“noncognitive skills” is used to describe those characteristics that MCAT score or GPA do not reflect, such as tacit knowledge, communication skills, emotional intelligence, and stable personality traits. We feel that admissions committees must explicitly define those qualities they deem essential for a successful medical school career and subsequent practice and that are in concordance with the institution’s philosophy and goals.




 




 

 



1 Crites GE, Ebert JR, Schuster RJ. Beyond the dual degree: Development of a five-year programin leadership for medical undergraduates. Acad Med. 2008;83:52–58. http://journals.lww.com/academicmedicine/ Fulltext/2008/01000/Beyond_the_Dual_ Degree__Development_of_a_Five_Year.8. aspx. Accessed April 28, 2011.



26 Crossley J, Russell J, Jolly B, et al. ‘I’mpickin’ up good regressions’: The governance of generalisability analyses. Med Educ. 2007;41: 926–934.



34 Ko M, Edelstein RA, Heslin KC, et al. Impact of the University of California, Los Angeles/ Charles R. Drew University Medical Education Programon medical students’ intentions to practice in underserved areas. Acad Med. 2005;80:803–808. http://journals. lww.com/academicmedicine/Fulltext/2005/ 09000/Impact_of_the_University_of_ California,_Los.4.aspx. Accessed April 28, 2011.








 2011 Aug;86(8):1032-9. doi: 10.1097/ACM.0b013e3182223ab7.

Enhancing the reliability of the multiple mini-interview for selecting prospective health care leaders.

Author information

  • 1Center for Educational Development and Research, David Geffen School of Medicine, University of California, Los Angeles, USA. bas@mednet.ucla.edu

Abstract

PURPOSE:

The David Geffen School of Medicine at UCLA Program in Medical Education (UCLA-PRIME) used a 12-station multiple mini-interview(MMI) circuit to assess applicants. The authors sought to determine the reliability of the MMI, potential bias in scores, and the degree of acceptance by interviewers and applicants.

METHOD:

In 2009, 28 interviewers interviewed a cohort of 76 applicants. An anonymous survey assessed interviewers' and applicants' satisfaction with the MMI process and perceived bias. Psychometric properties were determined with generalizability and decision theory. The process was repeated the following year with a new cohort of 78 applicants and minor modifications aimed at improving reliability.

RESULTS:

The MMI format was well received by both applicants and interviewers. No bias based on gender or disadvantaged status was found. The preliminary reliability of the MMI in 2009 was 0.58-lower than reported in previous studies-but improved in 2010 to 0.71 after an easy station was replaced with a more challenging one and a new scoring rubric was introduced.

CONCLUSIONS:

This interview technique proved to be reliable and was seen as transparent, uniform, and fair. The predictive validity of this process remains to be determined.

PMID:
 
21694560
 
[PubMed - indexed for MEDLINE]


의과대학 입학도구에 지역사회, 교수, 학생의 가치 반영하기 (Teach Learn Med. 2005)

Reflecting the Relative Values of Community, Faculty, and Students in the Admissions Tools of Medical School

Harold I. Reiter Kevin W. Eva 

McMaster University Department of Clinical Epidemiology and Biostatistics Hamilton, Ontario, Canada






두 번째 천년을 마무리지으며, 미국과 캐나다에서는 의사에게 요구되는 특질attribute을 정의했을 뿐 아니라, 이 특질들을 postgraduate와 practice 수준까지 강화하기 위한 교육과정과 평가 프로세스를 강화하였다. ACGME의 six competencies, 캐나다의 “Educating Future Physicians of Ontario,” , Core Committee of the Institute for International Medical Education 의 일곱개 역량 영역.

In the concluding years of the second millennium, efforts were under way in both the United States and Canada not only to define the attributes desirable in our physicians but also to foster curricular and evaluative processes to enhance those attributes at the postgradu- ate and practice levels.

  • In the United States, efforts by the American Board of Medical Specialties and by the Accreditation Council for Graduate Medical Educa- tion produced a document describing the six compe- tencies expected of physicians.1
  • A parallel movement in Canada, arising from the project “Educating Future Physicians of Ontario,”2 led to the creation of CanMEDS 2000 and its seven roles of the physician.3
  • From a global perspective, the Core Committee of the Institute for International Medical Education has grouped the essentials that physicians must have under seven competence domains.4


인지적 역량과 대비되는 개인 역량, 개인 인성에 대한 강조는 우연의 일치가 아니다. 전통적으로 인지적 능력을 평가하기 위한 도구들은 비교적 성공적이었지만, 인성 역량을 평가하기 위한 도구는 아주 드문 예외를 제외하고는 신뢰도와 타당도가 떨어진다.

The emphasis on personal, as opposed to cognitive, qualities in that reviewis no accident. As clearly dem- onstrated in an earlier, separate literature review6 of ad- missions tools to health professional schools, tradi- tional tools for the evaluation of cognitive qualities have largely succeeded, although those evaluating per- sonal qualities, with rare exception, have failed to dem- onstrate reliability and validity.


MMI는 이러한 측면에서 상당한 진전이었다.

A significant step in the development of those tools was taken with the advent of the Multiple Mini-Interview (MMI).7


방법

Methods


학부 입학시에 중요한(관련된) 일곱 개의 인적 특성에 대한 리스트를 만들었다. 이 정의는 comprehensive하지는 않지만, 가이드로 사용될 수 있을 것이다.

Adapting the roles, competencies, and competence domains outlined in Table 1 in conjunction with the lit- erature on admissions and local discussion, we created a list of seven personal characteristics that could be conceived to be relevant in an undergraduate admis- sions context. These characteristics, along with the definitions provided to participants, are illustrated in Table 2. Participants were told that these definitions were not comprehensive but that they should serve as a guide.


paired comparison approach에 따라서, 7개를 서로 비교하는 21개 문항을 만들었다. 아래와 같은 instruction

Following the paired comparison approach,12 a questionnaire was created by listing all pairs of these 7 characteristics (e.g., collaborative versus ethical) and randomizing the order in which the items were presented. Participants were given the following instruction.


더 중요하다고 생각하는 것을 선택해주세요

For each pair of characteristics outlined below, please circle the characteristic that you consider more important in determining who should be admitted to the Undergraduate MD Program at McMaster University. You must choose one characteristic from each pair, or your responses will not be analyzed. Definitions for each char- acteristic are provided on the preceding page.


약 10분정도 소요. z score 계산.

Participants responded to 21 pairings; the task re- quired approximately 10 min to complete. From these data, the probability of each item being selected was determined and converted to z scores to determine the relative importance of each of the seven characteris- tics on an interval level scale.

  • Negative z scores do not indicate that the characteristic is viewed as in- unimportant—undoubtedly each of the items cluded are valued to some extent.
  • Rather, negative z scores simply indicate that the characteristic is less important relative to the other options provided.
  • For example, imagine only two items, A and B, were in- cluded in the study, both of which are considered im- portant characteristics. If item A was selected as more important than item B 60% of the time, the probability of selecting item A (0.6) would convert to a z score of 0.26 for item A and the probability of se- lecting item B (0.4) would convert to a z score of –0.26 for item B (see Streiner & Norman13 for an ac- cessible description of the analyses).





Results


그룹을 어떤 식으로 구분하든 z score 결과는 매우 유사했다.

The resultant z score comparisons were remarkably uniform regardless of whether the group under consid- eration was from community, faculty, or the student body. Similarly, homogeneity was observed on com- paring those with more or less intimate administrative level of involvement.

 


 

Discussion



입학 단계에서 실수가 있을 경우 그 결과는 드라마틱하다. 사회적으로 촉발될 수 있는 잠재적 피해 뿐 아니라, 학부의학교육에 들어가는 학생당 비용은 9만달러에 달한다. 균질하게 성공적인 의사결정에 대한 합당한 사회적 기대와 높은 교육 비용을 고려하면 입학과 선발의 판단에서 생겨난 오류를 교정하기 위해 추가적으로 시간, 재정, 노력을 들이는 것은 용납할 수 없다.

The cost of a mis- step in admissions is dramatic. Aside from the poten- tial damage unleashed on society, the financial cost of undergraduate medical education approximates $90,000 (US) annually per student.14,15 Given the rea- sonable expectation of uniformly successful decision making and the high cost of education, any further sig- nificant expenditure of time, money, and effort to remediate errors of judgment by the admissions office is unacceptable.



지난 50년동안 지역사회, 교수, 학생 간 관점에 차이가 유의미하게 다를 것이라는 기대가 있었고, 이는 입학위원회의 구성에 엄청난 변화를 가져왔다. 1957년과 1971년 사이에 입학위원회에 학생이 포함되는 비율은 거의 0%에서 56%까지 늘어났다. 이는 1982년에는 74%까지 늘어났다. 지역사회 인사의 비중이 늘어나는 것은 조금 더 느렸지만 확실히 다가오고 있다. 1971년까지는 3%에서만 포함되어 있었으나 1982년에는 27%까지 늘어났다.

Over the last 50 years, the expectation of significant differences in perspective between community, faculty, and students has promulgated a seismic shift in representation on admissions committees. Between 1957 and 1971, the presence of students on admissions committees of schools affiliated with the Association of American Medical Colleges swung sharply upward, from near nonexistence to 56% (41/73) of committees responding to the survey indicating a student presence.16 This presence continued to rise to 74%(64/86) by the time a similar survey was conducted in 1982.17 The rise of community influence was more delayed, but nevertheless forthcoming. Even by the time of the 1971 survey, only 3% (2/73) of committees reported a community stakeholder presence, although this appears to have risen by the 1982 survey (27% of responding committee memberships arose from non medical–nonprofessional backgrounds in that survey).


이러한 변화를 지지해주는 관점의 차이는 덜 명확하다. 특정 영역에 대한 상대적 중요도 순서를 비교한 연구에서 지역사회 인사와 입학위원회 사이에 공통점이 많았다라는 연구도 있다.

The existence of differences in perspective to warrant these shifts is less clear. A comparison of rank order of the relative importance of particular defined domains was conducted between community members versus members of the Admissions Committee of the University of Massachusetts Medical School (UMMS).18 The study reported that the “results of the rank-ordering of criteria indicate commonalities in outlook and approach between the [community member] conferees and the UMMS Admissions Committee despite the fact that the ranking of the characteristics was done independently” (p. 640). The methodology used by UMMS was, in contrast to the paired comparison analysis described here, far more resource intensive and included a much smaller sample size of stakeholders (n = 20).


이 결과를 바탕으로 윤리적의사결정과 의사소통을 강조하는 MMI스테이션을 만들어야 할 것이다.

These results can now be used to guide the develop- ment of admissions protocols, particularly the MMI, ensuring that the stations are designed to preferentially emphasize ethical decision-making and communica- tion skills.

 






 2005 Winter;17(1):4-8.

Reflecting the relative values of communityfaculty, and students in the admissions tools of medical school.

Author information

  • 1McMaster University, Department of Clinical Epidemiology and Biostatistics, Hamilton, Ontario L8N 325, Canada.

Abstract

BACKGROUND:

In defining the characteristics of medical students that society and the medical profession find desirable, little effort has been spent assessing the relative value of the dozens of characteristics that have been identified. Furthermore, many institutions go to great lengths to ensure equal representation across stakeholder groups in an effort to maximize the heterogeneity of the pool of students accepted to study medicine; however, the extent to which different stakeholders value different characteristics has yet to be determined.

PURPOSE:

This study was an attempt to assess the relative value of the characteristics of medical students that society and the medicalprofession find desirable.

METHODS:

Using documents created internationally to identify the core competencies of medical personnel, a series of 7 characteristics were generated for inclusion in a study that adopted the paired comparison technique. Of 347 surveyed, 292 respondents indicated the rank ordering they would assign to each characteristic by circling the more important characteristic in all possible pairings.

RESULTS:

Overwhelmingly, "ethical" was deemed to be the most important characteristic on which selection tools should be based. Surprisingly, the pattern of responses was highly consistent regardless of stakeholder group and degree of affiliation with the undergraduate medical program.

CONCLUSIONS:

The generalizable features of this study not only include the empirical findings but also demonstrate useful survey protocol that can be adapted by any admission committee to guide the generation of an institution-specific admissions blueprint. A novel protocol that provides the necessary flexibility is discussed.

PMID:
 
15691807
 
[PubMed - indexed for MEDLINE]


각 선발방법은 얼마나 효과적인가? systematic review (Med Educ, 2016)

How effective are selection methods in medical education? A systematic review

Fiona Patterson,1 Alec Knight,2 Jon Dowell,3 Sandra Nicholson,4 Fran Cousans2 & Jennifer Cleland5




INTRODUCTION


실제로, 의학교육에서의 선발은 종종 정치적 고려 및 핵심 이해관계자에 따라 움직인다. 이러한 영향력은 '전통적인' 척도로부터 벗어나고자 하는 모든 움직임에 - 비록 그렇게 해야 하는 확고한 근거가 있음에도 - 반대하는 결과를 낳기도 하며, 근거-기반 선발을 어렵게 한다. 그러나 Kreiter와 Axelson의 non-systemic review를 보면 지난 25년간 효과적인 교육 인터벤션이 학습에 가져다주 이득은 0.20이하의 효과크기이나, 근거-기반 선발은 훨씬 더 강력해서, 잘 설계된 선발 도구는 1SD 이상의 향상을 가져온다.

Indeed, selection for medi- cal education internationally is frequently driven by political considerations and the preferences of key stakeholders.1 Such influences may result in resis- tance against any move away from ‘traditional’ mea- sures despite compelling evidence to do so, often to the detriment of evidence-based selection practices. However, Kreiter and Axelson’s2 non-systematic review of medical admissions research and practice in the last 25 years noted that effective educational interventions typically produce only small gains in learning (effect sizes generally below 0.20), whereas evidence-based selection is comparatively far more powerful, with well-designed selection tools achieving performance gains exceeding one standard devia- tion.


이전 학업 성취도는 일반적으로, 그리고 앞으로도 선발의 기반 근거가 될 것이고, 초기 스크리닝 단계에서 평가될 것이다. 

Prior academic attainment has gener- ally been, and continues to be, the primary basis for selection and is usually assessed at an initial screen- ing stage.3


그러나 이렇나 접근법에 대해서 몇 가지 우려가 있다. 우선, 이전 연구에서 학업성취도가 좋긴 하나 수행능력의 완벽한 예측인자는 아니며, UME의 23%, PGME의 6% 분산만을 설명한다. 

How- ever, there are several concerns about this approach. Firstly, previous reviews have concluded that aca- demic performance is a good, but not perfect, pre- dictor of performance, accounting for approximately 23% of the variance in performance in undergradu- ate medical training and 6% in postgraduate performance.4


둘째로, 학업성취도가 지속적으로 의과대학 수행능력의 좋은 예측인자라는 것을 보여주고 있으나, 역사적으로 중요한 비학업적 특성, 흥미, 동기부여요인과 같은 것들을 신뢰성있게 평가하는 방법에 관한 연구는 덜 이루어져 왔다.

Secondly, although academic achievement is consis- tently shown to be a good predictor of performance in medical school,5 historically substantially less attention has been paid to researching methods that reliably evaluate important non-academic personal attributes, interests and motivational qualities.


셋째로, 장기적 코호트 연구가 부족하다.

Thirdly, there has been a dearth of longitudinal cohort studies examining the predictors of success after qualification.


의과대학 선발절차와 전공의 선발절차의 공정성은 대중의 많은 관심과 비판의 대상이 되어왔다.

Medical school admissions processes and selection for specialty training attract strong public interest and often criticism regarding fairness.7–9






방법

METHODS


자료 출처

Data sources


We conducted a formal literature search using the criteria specified in Table S1 (online).


연구 포함 및 제외 기준

Study selection and inclusion and exclusion criteria


연구 유형, 퀄리티, 선발방법 평가

Assessment of study type, quality and selection method


 

연구질문과 근거의 퀄리티는 Table 1에. Muir and Grey의 ‘salience’ and ‘safety’ 카테고리는 삭제

The research questions and evidence quality cate- gories are displayed in Table 1. In relation to the different research questions under investigation, we removed Muir and Grey’s (1996)10 ‘salience’ and ‘safety’ categories as they were not relevant to our context.


연구에 대해서 다음을 평가함.

Therefore, we examined each study in relation to four research questions concerning, respectively:

  • effectiveness;
  • proce- dural issues;
  • acceptability, and
  • cost-effectiveness.

 

예측타당도가 선발방법의 효과성에 있어 가장 중요한 척도라는 은연중의 가정을 해소하기 위한 것. 또한 선발도구의 성패는 그 외에도 accessibility, 실행(도입)의 용이성, 핵심 이해관계자들에게 받아들여지는acceptable 정도 등에 따라 달려있다.

This approach was intended to address the assumption implicit in much previous research that predictive validity is the most important measure of the effec- tiveness of a selection method; we acknowledge that the success of a selection tool may be determined by a range of additional factors, including its acces- sibility, ease of implementation and the extent to which it is viewed as acceptable by key stakeholders.



RESULTS


For a full list and description of all papers identified in the review, refer to Tables S2 and S3 (online).


Type of evidence


Effectiveness


Procedural issues


Acceptability


Cost-effectiveness


 

적성검사

Aptitude tests



요약 Summary


학생 선발에 있어서 적성검사의 유용성에 대한 근거는 혼재되어 있으며, 어떠한 적성검사를 대상으로 하였는가에 따라 크게 달라진다. 따라서 적성검사에 대한 일반적인 결론을 내리는 것은 어렵다. 예컨대, 어떤 연구는 적성검사의 예측타당도를 지지하나 다른 연구에서는 어떤 적성검사는 예측타당도가 부족하다고 지적한다. 이러한 mixed 근거는 적성검사의 공정성에 대해서도 마찬가지로 나타나는데, 일부 연구에서는 특정 그룹이 더 점수를 받는다고 하며, 어떤 연구에서는 또 그렇지 않다고 한다. 예컨대, 의과대학 지원자의 여러 그룹 간 공정성equity에 대한 근거는 다양하다(sex, age, language status and socio-economic sta- tus) 또 다른 적성검사에 대한 근거는 지원자의 배경에 상관없이 공정하며, 코칭에 영향을 거의 받지 않고, 시간이 지나도 안정적인stable 성격을 보인다고 말하며, 그 예외로 UMAT을 지적한다. 따라서 각 적성검사에 대해서 평가하는 것이 중요하다.

Mixed evidence exists among researchers on the usefulness of aptitude tests in medical student selec- tion and findings largely depend on the specific aptitude test studied; hence commenting on the generality of findings is problematic. For example, some studies support the predictive validity of apti- tude tests, but other research suggests that some specific aptitude tests lack predictive validity. Mixed evidence also exists on the fairness of aptitude tests, with some research suggesting that certain groups score more highly on aptitude tests than other groups, whereas other research suggests that this is not the case. For example, there is varied evidence on the equity of aptitude tests for different groups of medical school applicants (e.g. according to sex, age, language status and socio-economic sta- tus).11,15,20,24,46–50 Other evidence suggests that apti- tude tests are equitable with respect to candidate background, are affected relatively little by candi- date coaching, and remain stable over time,20,24,44,50–52 with the possible exception of the UMAT.30 It is therefore important to evaluate each aptitude test in its own right in order to draw con- clusions on the quality of the tool.





학업성취도

Academic records


Summary


연구자들 사이에서 학업성취도가 의과대학 선발에 유용한 정보를 준다는 합의가 있다. 연구 결과는 일반적으로 학업성취도가 예측력이 있으며, 즉 학업성취도가 더 뛰어날수록 의과대학에서 성공 가능성이 높다는 것이다. 그러나 이전 학업성취도의 변별력에 대한 우려가 있어서 이는 의과대학 지원자가 최상위권top grades를 받을수록 점차 변별력이 없어진다는 우려도 있다. 또한 높은 성적을 받은 지원자가 더 좋은 의사가 된다는 장기 추적 자료근거가 부족하다. 더 나아가 Milburn은 영국에서 지나치게 A-level 지원자에 의존하는 것이 대학의 사회적 유입 social intake를 왜곡시키며, 의과대학을 학업성취도에만 근거해서 뽑는것이 중요한 비학업적 요인을 무시하는 결과를 가져올 수 있다고 지적한다.

There is a high level of consensus among researchers that academic records provide useful information to inform medical student selection. Research generally suggests that prior academic attainment has predictive power, meaning that those with stronger academic records are more likely to succeed in medical school. However, there is concern that the discriminatory power of prior academic attainment may be diminishing as increasing numbers of medical school applicants have top grades. There is also a lack of long-term follow-up data to provide evidence that medical school applicants with higher grades go on to become better physicians. Moreover, Milburn8 notes that over-reliance on A-level results in the UK may create a distorted social intake to univer- sities, and recruiting medical students solely on the basis of academic attainment may neglect important non-academic factors required for suc- cess in medical school and beyond.


자기소개서

Personal statements


효과성 Effectiveness


예측타당도에 대한 효과성 근거는 엇갈린다. 비록 일부 근거가 자기소개서의 유급/탈락, 내과 수행능력, 임상 관련 교육 등에 관한 예측타당도를 지지하고 있지만, 또 다른 연구는 자기소개서는 다른 흔히 사용되는 선발도구에 비해서 신뢰성이 떨어진다고 주장하기도 하며, 의과대학 성공의 예측을 잘 해주지 못한다고 지적한다. 그러나 일부 저자들은 자기소개서는 지원자들로 하여금 그들이 지원하는 의학 학위의 특징에 대해서 인식하게 해주며, 좀더 informed decision을 하게 도와준다고 말한다.

Evidence on the predictive validity of personal state- ments is varied. Although some evidence has been found for the predictive validity of personal state- ments for medical school dropout rates,65 perfor- mance on internal medicine14 and clinical aspects of training,66 several others have reported that personal statements have low reliability compared with other commonly used selection instruments70 and are not predictive of subsequent success at medical school.2,71–73 Some authors suggest, however, that personal statements may have some value for making applicants aware of the characteristics of the medical degree they are applying to, which may help themto make a more informed decision to apply.73


 

절차적 이슈 Procedural issues


절차적 요인이 자기소개서의 신뢰도와 타당도에 영향을 준다. 의과대학 지원자는 자기소개서를 통해서 입학위원회에게 매력적으로 보일 만한 방법으로 스스로를 보여주나, 그것이 지원자의 특성을 반드시 정확하게 보여주지 않을 수도 있다. 따라서 자기소개서에 드러나는 인적 특성은 부분적이고 주관적이다. 자기소개서의 효과성에 영향을 주는 요인으로는 마감시기에 비해서 일찍 냈는지, 채점 방식, onsite vs offsite 등이 있다. 마지막으로 한 연구는 자기소개서가 여러 영국 의과대학 사이에 서로 다양한 방법으로 사용되고 있음을 지적했다. 일부 의과대학은 선발 결정을 내리는 공식적 정보로서 활용했으나, 어떤 의과대학은 선발에 부당한 bias를 줄 수 있어서 이 정보를 무시하였다.

Evidence suggests that a number of procedural factors affect the reliability and validity of personal statements. Medical school candidates may use personal statements to present themselves in ways they believe are attractive to admission commit- tees, which may not necessarily be accurate.74,75 Hence, the information captured by personal statements is likely to be both partial and subjec- tive in nature. Factors that may affect the effec- tiveness of the selection method include the earliness of submission in relation to a deadline,76 marking method, and on-site versus off-site com- pletion.77 Finally, one article highlighted the fact that personal statements are used differentially by different UK medical schools.78 Some medical schools use the information formally in making selection decisions, whereas others ignore this information out of concern that it may unfairly bias selection decisions.


수용가능성 Acceptability


연구 결과로부터 자기소개서의 데이터 오염의 가능한 원인이 지적된 바 있다. 여기에는 지원자의 이전 기대, 제출까지 걸리는 시간, 제3자의 도움 candidates’ prior expectations, the length of time spent completing submissions, and input to submis- sions from third parties등이 있다. 또 다른 연구에서 정치적 타당성과 이해관계자의 만족도에 대해서 지적한 바 있으며, Stevens 등은 약 60%의 학생이 자기소개서를 의과대학 선발도구로서 적절하다고 인식함을 보여주었다. Elam 등은 의과대학 지원서에 작성해야 하는 내용이 입학위원회가 내리는 결정에 중요한 영향력도 행사할 가능성이 매우 낮다는 것을 보고했다. White 등은 의과대학 지원자가 자신을 보여줄 때, 지원자로서 바람직한 모습을 보여주지, 진짜 자신의 모습을 성찰항 보여주지 않는다고 지적했다. 마찬가지로 Kumwenda는 대부분의 의과대학 지원자는 다른 지원자들이 진실을 왜곡한다고 생각했고, 상당 비율의 지원자가 지원서의 정확성accuracy(진실성)을 평가하지 않을 것으로 생각함을 보여주었다.

Research has highlighted potential sources of data contamination in personal statements, including candidates’ prior expectations, the length of time spent completing submissions, and input to submis- sions from third parties. Other research14,74 has commented on the political validity and stakeholder satisfaction of personal statements in medical stu- dent selection. Whereas Stevens et al.45 found that approximately 60% of students thought that per- sonal statements were suitable to use for admission to medical school, Elam et al.13 reported that the contents of medical school candidates’ application forms are very unlikely to exert any significant influ- ence on decisions made by admissions committees. White et al.74 also argued that medical school candi- dates present themselves in ways that they believe are expected of candidates, rather than in ways that are genuine reflections of themselves. Likewise, Kumwenda et al.79 found that most medical school applicants believed that others stretched the truth in their personal statements, and a proportion of applicants believed it was unlikely that statements were checked for accuracy.


 

요약 Summary



자기소개서의 효과성은 좋게 봐줘야 mixed 되어있다고 할 수 있으며, 예측타당도를 지지하는 근거는 매우 적고, 많은 연구에서 신뢰도와 타당도가 부족하다고 지적한다. 자기소개서는 선발도구로서의 효과성이 다양한 외부 요인에 영향을 받음에도 전세계적으로 의과대학 선발에서 널리 사용된다. 자기소개서의 내용은 선발결정을 내리는 사람들의 판단을 불공정하게 흐릴 수 unfairly cloud 있다.

Evidence on the effectiveness of personal statements in medical student selection is mixed at best. Little evidence exists to support the predictive validity of personal statements, and a large volume of research evidence suggests that the selection method lacks reliability and validity. Personal statements remain widely used in medical school selection worldwide, despite concerns that the effectiveness of the selec- tion method is influenced by numerous extraneous factors. The content of personal statements may also unfairly cloud the judgement of individuals making selection decisions.



추천서

References


요약 Summary


추천서의 신뢰성과 타당성 모두에서 부정적이라는 근거는 충분하다. 그럼에도 추천서는 의과대학 선발에 흔히 사용되는 도구이다. 이러한 측면에서, 의과대학 선발에 추천서를 넣는 것은 도움이 되지 않으며, 소중한 자원은 다른 선발 도구에 사용하는 것이 더 좋을 것이다.

There is a good level of consensus that references are neither a reliable nor a valid tool for selecting candidates for medical school. Despite these find- ings, references remain a common feature of med- ical school selection worldwide. To this extent, the inclusion of references in medical school admis- sion processes may be unhelpful and may use valuable resources that could be directed more usefully to selection methods with evidentially based reliability and validity.




SJT

Situational judgement tests


요약 Summary


SJT가 잘 만들어지기만 한다면 신뢰성 있고, 타당하교, 비용효과적이고, 수용가능하다는 근거가 충분하다. SJT는 개발이 복잡하고, 따라서 문항의 형식, Instruction, 채점 등과 관련하여 다양한 옵션이 있다. 이러한 옵션이 적절하게 보정calibrate된다면 SJT에 근거들은 이것이 의과대학에서 비학업적 특성 평가에 강점을 갖음을 보여준다.

There is a good level of consensus among research- ers that SJTs, when properly constructed, can form a reliable, valid, cost-effective and acceptable ele- ment of medical school selection systems. SJTs are complex to develop and there is a wide range of options available in relation to item formats, instruc- tions and scoring. When these options are cali- brated appropriately, research evidence points to the strength of SJTs in medical student selection for assessing non-academic attributes.




성격, 감정지능

Personality and emotional intelligence


요약 Summary


포괄적으로 말해서, 연구자들은 성격의 어떤 영역은 의과대학 수행능력에 유의미하게 긍정적/부정적 방향으로 관련됨에 합의를 이룬다. 그러나 성격 영역과 의과대학 수행능력간의 관계는 종종 매우 복잡한데, 예를 들면 conscientiousness 는 지식-기반 평가에는 긍정적으로 연관되어 있으나, 일부 임상상황에서의 평가에서는 부정적으로 연관되어 있다. 이러한 결과는 성격-기반 선발도구를 검토할 때 준거의 구인에 대해서 보다 자세히 살펴볼 필요가 있음을 제안한다. 성격검사는 비용-효과적이고 면접 방법 등과 같이 추가 probe가 가능한 다른 선발도구와 함께 사용될 수 있다.선발을 하는 사람들은 성격검사가 의과대학을 넘어선 장기적 예측타당도에 대한 근거가 부족함을 알아야 한다. 또한 성격검사가 의과대학에 입학하는 학생들의 다양성을 축소시킬 수 있음을 알아야 한다. EI의 예측타당도에 관한 연구는 거의 없고, 매우 초기 단계이다.

Taken broadly, there is a relatively high level of con- sensus among researchers that some domains or traits of personality are significantly positively or neg- atively associated with aspects of performance in medical school. However, the associations between personality domains and medical school perfor- mance are often complex, as is demonstrated by evidence that conscientiousness may be positively associated with knowledge-based assessment, but negatively associated with some clinical aspects of medical school assessment. This suggests that closer attention to the criterion constructs should also be considered when reviewing personality-based selection tools. Personality assessment can be cost-ef- fective and may be used in combination with an interview method in which applicant responses can be probed further. Recruiters should be aware that there is a relative dearth of evidence regarding the long-term predictive validity of personality assess- ment beyond medical school, and that there has been some concern that personality assessment may narrow the diversity of types of individuals entering medical education and training. Research on the predictive validity of EI assessment was sparse and at a very early stage of development.



면접, MMI

Interviews and multiple mini-interviews


Type of evidence



효과성 Effectiveness


 

일부 반하는 근거가 있지만, 근거를 종합하면 전통적인 면접방식은 학생선발로서 예측타당도가 부족하고 강건한robust 방법이 아니라는 것이 중론이다. Edwards 등은 면접에서의 수행능력이 낮은 것이 높은 의과대학 성적과 연괸된다고 하였다. 면접의 효과성에 대한 혼재된 근거는 면접 방법의 다양성을 보여주는 것이기도 하며, 상대적으로 비구조화된 것부터 고도로 구조화된 패널 면접까지 다양하다. Eva와 Macala는 비록 행동면접스테이션behavioural indicator stations가 다른 타입보다 더 신뢰도가 높긴 했으나, 면접관 평가의 신뢰도에 있어서 비구조화된 것과 구조화된 MMI 간 차이가 없음을 보여주었다.

Despite some evidence to the contrary,14,16,33,123–130 the balance of evidence suggests that generally, the traditional interview is not a robust method of selecting medical students, and lacks predictive validity.4,9,28,80,131–137 Edwards et al.17 found that poorer interview performance was associated with higher medical school grade point average (GPA). The mixed findings on the effectiveness of inter- views may reflect substantial differences in interview methods, which range from relatively unstructured individual interviews to highly structured panel interviews. However, Eva and Macala138 found no difference between the reliability of interviewer ratings in unstructured and structured multiple mini-interview (MMI) stations, although behavioural indicator stations differentiated between candidates more reliably than other station types.




MMI에 관한 연구는 전통적 면접에 관한 것보다 일관된다. 예컨대 psychometric properties는 적절한 것으로 보고된다. Uijtdehaage and Parker는 지원자에 대한 상대적(rather than 절대적absolute) 평가를 사용한 연구에서 MMI의 신뢰성이 쉬운 스테이션을 보다 어려운 것으로 바꿔서 향상될 수 있음을 보여주었다. 그러나 Hissbach 등은 지원자의 수행능력에 대한 systemic difference보다 평가자의 bias가 지원자 점수에 더 큰 영향을 줄 수 있음을 보여주었다. 비록 의사소통기술과 같은 일부 특성은 MMI에서 흔히 평가대상이 되곤 하나, 여러 면접밥법 사이에 측정하고자 하는 것이 무엇인가에 대한 명확성이 부족하다. 비록 설계와 무관하게 MMI와 학업성취도 간의 관계는 작거나 없지만, MMI의 구인타당도는 아직 연구대상이다. 더 나아가서 매우 표준화된 면대면 면접은 표준화된 배우를 활용한 시나리오-기반 MMI면접에 비할 바가 아니며, MMI 스테이션의 차원성dimensionality(MMI가 스테이션당 하나 이상의 구인을 측정하는가)에 관한 문제는 논쟁거리가 되고 있다.

The findings from research on MMIs tend to be more directionally consistent than those from research on traditional interviews: for example, the psychometric properties of MMIs are usually reported to be adequate.44,139–146 Uijtdehaage and Parker146 found that the reliability of an MMI was improved by replacing an easy station with a more challenging one, and using relative, rather than absolute, ratings of candidate performance. How- ever, Hissbach et al.147 found that rater bias had a greater effect on applicant scores than systematic differences in candidate performance. There is little clarity about what is being measured within the dif- ferent approaches described, although some attri- butes, such as communication skills, are commonly purported to be assessed by MMIs. Construct validity evidence for MMIs remains exploratory and largely inconclusive, although irrespective of design differ- ences, the relationships between MMIs and aca- demic measures are small to absent.145 Moreover, tightly standardised face-to-face interviews may not be comparable with scenario-based MMI stations utilising standardised role actors, and the dimen- sionality of MMI stations (i.e. whether MMIs can measure more than one construct per station/inter- view question) has been debated in the literature.145



절차적 이슈 Procedural issues


MMI는 대학별로 길이, 패널 구성, 구조, 내용, 채점방법 등이 다양하다. 면접방법이 다양한 것은 신뢰도와 타당도의 혼재된 연구결과의 원인일 수 있다. 다른 근거들은 지원자의 수행능력이 코칭에 따라 영향을 많이 받는다고 지적한다. 비록 많은 연구자들이 MMI를 성공적으로 도입하였다고는 하나 면접을 사용함에 있어 질문의 범위나 유형에 관련된 logistical 어려움이나 면접관의 주관성 등과 같은 어려움이 있었다고 보고한다. Uijt- dehaage and Parker 는 'MMI도입은 할 수는 있지만 상당히 부담스러운daunting 일이다'라고 요약했다.

Schools differ significantly in terms of the length, panel composition, structure, content and scoring methods for interviews. The differential usage of the interview method in medical student selection may underlie the mixed findings on both the relia- bility and validity of interviews reported above. Other research evidence suggests that candidate performance may be significantly affected by coach- ing.30 Using interviews in a selection process also presents logistical difficulties relating to the range and type of questions155 and interviewer subjectiv- ity,51,143,156,157 although numerous authors report on the successful implementation of MMIs into their medical school admission processes.44,146 Uijt- dehaage and Parker summarised that ‘implementing an MMI was feasible but a daunting task’.146




수용가능성 Acceptability


대부분의 연구는 면접 절차에 대한 지원자와 면접관의 긍정적 인식을 보여주며, MMI와 더 구조화된 면접이 덜 구조화된 면접보다 선호된다는 근거가 있다. 일부 근거는 의과대학 지원자는 면접을 시행하는 의과대학을 더 선호함을 보여준다. Campagna-Vaillan- court 등은 대부분의 지원자와 평가자가 MMI가 다양한 역량을 평가하는데 적절한 방법이며, 이를 공정fair하다고 보았고, 전통적 방법보다 선호함을 보여주었다. MMI를 선발에 도입할 때 단계적으로 staged 도입하는 것이 더 받아들여질 가능성acceptance을 높일 수 있다. 표준화된 면접은 PGME 선발에도 사용할 수 있으며, IMG학생이나 면접관에게도 acceptable하다.

Most research reports that applicants and interviewers tend to viewthe interviewing process posi- tively,44,45,60,146 and there is tentative evidence that MMIs and more structured interviews are preferred over less structured methods.138,158 Some evidence suggests that aspiring medical students may prefer the schools that conduct interviews.159 Campagna-Vaillan- court et al.144 found that the majority of applicants and assessors perceived an MMI to be appropriate to assess a range of competencies and considered it to be a fair process, as well as being preferable to a tradi- tional interview. The staged introduction of an MMI into a selection process may foster institutional accep- tance of the method.160 Standardised interviews can also be adapted for use in postgraduate medical selec- tion to measure characteristics that are considered important and acceptable to both international medi- cal graduates and interviewers.139,141,161


비용 효과성 Cost-effectiveness


비록 면접이 기계-채점 방식의 시험보다 더 비용이 많이 들긴 하고, MMI가 전통적 면접보다 스테이션 개발과 연기자 인건비로 인해서 비용이 더 올라가나, MMI의 비용-효과성은 일반적으로 괜찮은 편이다. Value for money는 스테이션 수를 늘리거나 신뢰도가 충분하지 않은 스테이션을 줄여서 더 높아질 수 있다. 그러나 일부 연구결과를 보면 스테이션 수나 질문question의 수를 늘리는 것이 면접관을 늘리는 것보다 더 신뢰성 향상에 도움이 됨을 보여준다. 실제로 Roberts 등은 Cronbach's alpha가 고부담 시험에서 0.80에 달해야 한다고 추정하며, 한 스테이션당 1명의 면접관을 사용할 경우 14스테이션짜리 MMI 가 이 정도에 도달한다고 했다. 이 숫자는 7~12개 스테이션 정도로 줄일 수 있는데, 이 경우 스테이션당 두 명의 면접관이 필요하다. 또한 Dodson 등은 MMI 스테이션당 길이를 8분에서 5분으로 줄임으로서 자원을 아끼면서도 지원자의 등수나 검사 신뢰도에 영향을 최소화 할 수 있다고 말했다. Knorr과 Hissbach는 최소 MMI 스테이션 수에 대해서 일반적 권고안을 내리기 어렵다고 했다.

The cost-effectiveness of MMIs is generally reported to be good,154 although comparatively interviews are significantly more costly than machine-marked tests, and MMIs are more expensive than traditional inter- views because they incur increased costs for station development and actor payments.145,146 Value for money may be improved by examining the number of stations in an MMI, and reducing the number of stations if reliability is not affected. However, some research suggests that increasing the number of questions or stations in MMIs increases reliability more than increasing the number of interview- ers.143,145,162 Indeed, Roberts and colleagues esti- mated that to reach a Cronbach’s coefficient alpha of 0.80 for high-stakes assessment, MMIs must include 14 stations if each is manned by a single interviewer. This number could be reduced to between seven and 12 stations if each station is manned by two interviewers.143 Alternatively, Dod- son et al.163 found that reducing the duration of MMI stations from 8 to 5 minutes conserves resources with minimal effect on applicant ranking and test reliability. Knorr and Hissbach145 concluded in their systematic review that no general recommen- dation for the minimum number of MMI stations can be derived from the literature at present.


Tiller 등은 비용과 시간을 줄이기 위해서 스카이프로 MMI를 시행가능함을 보여주었다.

Tiller et al.164 found that cost and time savings for candidates were substantial when an MMI was con- ducted online via Skype rather than in person, although further research is required regarding the impact on fidelity of the lack of a face-to-face encounter.



요약 Summary


면접은 가장 많이 사용되는 선발도구 중 하나이다. 여러 근거를 보면 전통적인 면접은 고부담 결정의 도구로 사용하기에는 신뢰도와 타당도가 떨어지며, MMI가 신뢰도와 타당도를 높일 수 있는 방법이다. MMI의 예측타당도와 구인타당도에 대해서는, 특히 구인이 정확하게 측정가능한가에 대해서,  더 많은 이론-주도theory-driven연구가 필요하다. 면접에서 평가될 준거의 적절성에 대한 근거가 더 필요하고, validation study가 필요하다. 비용효과성이 평가되어야 하며, 채점이나 점수의 대안적 활용(최저 기준(과락) 설정)에 대한 연구도 더 필요하다. MMI는 그 신뢰성 근거가 누적되며 최근 빠르게 확산되어가고 있다. 그러나 구인타당도와 차원성dimensionality에 대한 이슈는 아직 문제의 여지가 있다. 대학들은 그들이 측정하고자 하는 것이 무엇인지, 실제로 측정하는 것은 무엇인지를 더 잘 이해해야 한다. MMI가 지원자에 미치는 영향은(공정성fairness, 수행능력, 코칭의 영향력 등) question rotation과 같은 설계 관련 결정에 매우 중요한 실제적 문제이다.

Interviews are among the most widely used tools in selection for medical school admission. Evidence suggests that traditional interviews lack the reliability and validity that would be expected of a selection instrument in a high-stakes selection setting. Evidence also suggests that MMIs offer improved reliability and validity over traditional interview approaches. Further theory-driven research is war- ranted, however, in relation to the predictive and construct validity of the MMI method, particularly with respect to the constructs that can be assessed accurately (e.g. communication, critical thinking, empathy, etc.). More evidence is required regarding the appropriateness of criteria that can be assessed in interviews and should be informed by validation studies. In addition, the cost-efficiency and utility of MMIs should be evaluated, along with alternative approaches to scoring and alternative uses of scores (including any minimum threshold criteria). The use of MMIs has spread rapidly in recent years as they can be designed as a reliable selection method. However, issues surrounding the construct validity and dimensionality of MMIs remain problematic: it is critically important that schools better understand what they are seeking to measure, and actually are measuring, with this approach. The impact of the MMI on candidates (in terms of fairness, perfor- mance, coaching effects, etc.) is an outstanding practical concern that should influence design deci- sions such as question rotation.






선발센터

Selection centres


Summary


전반적으로 SC의 유용성에 대한 연구가 부족하다. PG 선발에서 SC의 예측타당도 근거가 강력하며, 더 많은 연구 필요.

Overall, research on the utility of SCs for medical student selection was relatively sparse. Evidence on the predictive validity of SCs for postgraduate selec- tion is stronger, although further evidence is required to build a case for their predictive validity in medical school selection.






DISCUSSION


핵심결과요약

Summary of key findings


지나치게 단면연구설계에 대한 의존도가 높고, 타당도보다는 신뢰도에 집중되어 있어서 'reliably wrong'한 결과를 가져올 수 있다. 비록 일부 연구가 예측타당도를 다루었지만, 구인타당도(무엇이 측정되고 있는가)를 다룬 연구는 적고, 비용-효과성 연구도 적다. 비록 18년간의 연구를 다루었지만, 장기 추적 연구가 부족하다. 지난 2년간 증가하고 있기는 하다.

There is an over-reliance on cross-sectional study designs and a general focus on reliability estimates as indicators of quality rather than aspects of validity (a method may have high reliability but be ‘reliably wrong’25). Although some studies have addressed issues relating to pre- dictive validity, very little research has explored construct validity issues (i.e. what is being mea- sured) and the relative cost-effectiveness of selec- tion methods. During the 18 years covered by this review, there have been remarkably few long-term evaluation studies; however, we note that over the last 2 years there has been an increase in the amount of longitudinal evidence emerging in this area.


여러 선발방법이 복합적으로 사용된 경우 다양한 선발방법들을 아우르는(그리고 가중치의 영향력을 포함한) 선발 시스템과 관련한 연구가 적다.

There remain comparatively few studies examining selection system design overall and the relative contributions of the various selection methodolo- gies (and the impacts of various weightings) when methods are used in combination (as is the norm in medical school selection172,173).


그러나 신뢰성, 타당성, 효과성에 대한 명확한 메시지는 있다. 학업성취도는 대부분의 선발정책과 근거의 strength에서 공통적 특징으로 지속되고 있으며, 앞으로도 그러할 것으로 생각된다. 여러 근거가 전통적 면접, 자기소개서, 추천서보다 구조화된 면접, MMI, SJT, SC가 더 효과적이고 공정한 방법임을 보여준다. 적성검사의 효과성과 공정성에 대한 근거는 혼재되어있고 검사에 따라 다르다. 이는 현재로서 '적성'이 의미하는 바가 무엇인지 합의된 프레임워크가 없기 때문일 것이다. 현재로서는 '순수한' 인지능력 평가(UKCAT)부터 학력검사(BMAT)까지 다양하다. 이런 상태에서는 다양한 적성검사의 상대적 기여를 systematic하게 평가하기 어렵다.

There are, however, some clear messages about the comparative reliability, validity and effectiveness of various selection methods. The academic attainment of candidates remains a common feature of most selection policies and the strength of evidence in support of it continuing to do so remains strong. The extant evidence paints a relatively clear picture illustrating that structured interviews or MMIs, SJTs and SCs are more effective methods and generally fairer than traditional interviews, references and personal statements. Evidence is currently mixed regarding the effectiveness and fairness of aptitude tests, depending on the tool in question. This stems largely from the fact that there is no currently agreed framework that specifies what is meant by aptitude; at present tests range from assessments of ‘pure’ cognitive ability (e.g. the UKCAT) to aca- demic tests (e.g. the BMAT). As such, it is difficult to systematically assess the relative contributions of different aptitude tests, and of aptitude tests within a wider selection system.


다양한 선발방식의 수용가능성에 대한 결과도 혼재되어 있는데, 다양한 정치적 이슈 - 이해관계자의 다양한 관점, 의과대학생과 의과대학에 관한 철학적 차이, 선발도구가 도입되는 형태 - 때문이다.

The picture regarding the acceptability of various selection methods is also mixed, and may be influenced by a variety of political issues including differing stakeholder views, variations in the philosophies of both medical students and medical schools, and the ways in which the tool is implemented as part of a selection system.


여기에 실린 논문을 평가할 때 어떤 용어는 그 스펙트럼이 다양하다는 것을 명확히 해야한다. 그 설계방식에 따라서 평가도구의 질이 엄청나게 달라질 수 있으며, 따라서 효과성에 대한 결론을 내리기 전에 개별적으로 각 설계방식을 검토해봐야 한다. 

When judging the papers in this review, it was clear that some terms cover a broad spectrum of meth- ods: MMIs, SJTs, aptitude tests, personality assess- ments and SCs are measurement methods that comprise a multitude of different design parame- ters. Depending on the design, this may significantly alter the quality of the instrument to the extent that each needs to be indi- vidually evaluated before conclusions about its effec- tiveness can be reached.


이론에 대한 함의

Implications for theory


선발연구에 대해서 지속적인 문제는 우리가 선발도구로 예측하려는 성과와 관련되어 있다. 예를 들어 준거criterion에 있어서  conscientiousness 와 수행능력간 관계에 있어 의과대학 초기 성과와 후기(임상)성과에 따라 혼재된 결과를 보여준다. 또한 선발도구 평가에 사용되는 성과척도가 성취도와 최대 수행능력에 대한 것이기에 (의과대학 성취도, 면허시험 수행능력), 임상 진료행위나 전형적(day-to-day) 수행능력과는 다를 수 있다.

A persistent problem with selection research relates to the issue of which outcomes we are trying to pre- dict by using various selection methods.59 For exam- ple, to illustrate this criterion problem, when exploring the association between conscientiousness and per- formance outcomes, we find mixed results when examining outcomes relating to early examination performance in medical school and performance within clinical practice in later years. Furthermore, our review also highlights that outcome measures used to evaluate selection methods most often focus on indicators of attainment and maximal perfor- mance (e.g. medical school achievements, perfor- mance in licensure examinations) rather than indicators relating to clinical practice and typical (day-to-day) in-role job performance.


선발 방법의 정확성과 관련해서 outcome criteria의 명확한 프레임워크가 필요하다.

In judging the evidence for the relative accuracy of selection methods, it becomes appar- ent that a clear framework of outcome criteria with which to interpret the research evidence and compare selection methods, both individually, and within a selection system, has yet to be established;



또한 주로 예측타당도에 초점을 맞춰왔으며, 각 평가도구가 무엇을 측정하고 있는가(구인타당도construct validity)에 대해서는 덜 연구되어왔으며, 어떻게 각 방법이 합해져서 선발시스템을 만드는가에 대한 의문을 갖게 한다. 이는 특히 MMI에 대해서 그러한데, 비록 최근 매우 유명해졌지만, MMI를 가지고 평가하려는 특징attribute가 무엇인가에 대한 consistency가 부족한 것이 구인타당도에 관련된 근거 결론을 내리지 못하게 한다.

In addition, evidence regarding the effectiveness of some methods has focused pre- dominantly on the predictive validity of the tool, rather than on assessing precisely what different methods are measuring (i.e. construct validity); this raises the question of how a method can be considered to add value to a selection system if the constructs it is measuring are unknown. This is particularly the case for MMI research, in which, despite the method’s increasing popularity in recent years, there is a lack of consistency regard- ing the attributes selectors are using MMIs to assess for and, relatedly, evidence regarding con- struct validity remains inconclusive.



지원자의 역량의 지표로 무엇을 봐야 하는가는 medical career의 어느 지점을 기준으로 보느냐에 따라서 달라질 수 있다. 따라서 구체적인 역할에 따라서 지원자를 평가하는 선발 준거가 다양해지고 달라지는데, 여기에는 학업적, 비학업적 지표가 모두 포함된다. 어떤 요인이 UME에는 중요한 예측인자로 나올 수 있지만 임상 수행능력에서는 반대로 작용할 수도 있다. 따라서 서로 다른 선발 방법은 서로 다른 단계마다 서로 다른 방식으로 사용되어야 한다. 예컨대 SJT는 의과대학 초기 수행능력과는 예측력이 낮으나(주로 학업에 초점이 맞춰지므로), clinical practice에 있어서는 더 예측력이 높다. 의학 분야의 선발시스템 설계 어려움은 학업적, 비학업적 자질을 아우르는, 학부선발에서 신뢰도와 타당도가 있는 것과 수 년이 지난 전공의 수련에서 신뢰도와 타당도가 있는 것에 대한 연구 근거를모두 포함시켜야 하는 것이다.

It is clear that indicators of competence for entrance to medical training and practice are likely to be different at different points in a medical career; thus, applicants are judged on multiple selection criteria depending on the specific role, which may include varying combinations of aca- demic and non-academic indicators of aptitude. A factor may be identified as an important predictor for undergraduate training, but may actually hinder some aspects of performance in clinical prac- tice.59,66 As such, different selection methods may predict differently at different stages: for example, an SJT may be less predictive of performance in the early years at medical school (which tends to be more academically-focused), but significantly more predictive of performance outcomes when trainees enter clinical practice.28,174 A major challenge within medicine is to integrate the research evi- dence to inform the design of selection systems that are reliable and valid (and weighted appropriately) from undergraduate selection through to selection for specialty training after many years of education, for both academic and non-academic qualities.


따라서, 더 이론-주도적 연구가 'competent'의사란 누구인가 를 밝히기 위해 이뤄져야 한다. unified taxonomy of performance indicators 를 만들어서 단기- 장기- 예측 타당도의 표지자로서 활용해야 한다. 예컨대, 일부 연구자들은 의과대학선발시에는 학업성취도를 기반으로 select in 하고, 비학업적 기술을 바탕으로 select out해야 한다고 주장한다. 비학업적 능력이 PGME 선발에서 더 큰 역햘을 하며, 전공에 다라서 가중치가 달라질 수 있다는 주장도 있다. 예컨대 공감과 의사소통은 일반의와 소아과에서 중요하고, 경계vigilance와 상황인지situational awareness는 마취과에서 중요하다.

Hence, there is a need for more theoretically driven, future-oriented research aimed at identifying what a ‘competent’ physician is at the various stages of training and practice. This will allow researchers and practi- tioners to move towards crafting a unified taxonomy of performance indicators which may be used as markers in short- and long-term predictive validity studies of selection methods. For example, some researchers suggest that from undergraduate selec- tion onwards, medical students should be selected in on the basis of academic attainment and selected out on the basis of non-academic skills and attributes.175 It could be argued that non-academic attributes and skills should therefore play a much larger role in postgraduate selection and the weighting of these may differ depending on the specialty. For example, research from job analysis studies shows that empa- thy and communication are weighted more heavily for selection into general practice176 and paedi- atrics, whereas vigilance and situational awareness carry more weight in anaesthesia.177



실제practice적 함의

Implications for practice


추천서나 자기소개서보다 SJT와 MMI가  inter- and intrapersonal (non-aca- demic) 특성을 더 타당하게 예측한다. SJT와 MMI는 보완적일 수 있다. SJT가 더 넓은 영역의 구인을 효율적으로 평가한다면, MMI는 면대면 접촉을 포함한다. 비록 비용이 들지만 구조화된 면접은 지원자 응답을 더 멀리, 더 깊게 probe할 수 있다.

Our review shows that SJTs and MMIs are more valid predictors of inter- and intrapersonal (non-aca- demic) attributes than personal statements or refer- ences. Situational judgement tests (SJTs) and MMIs may be complementary: whereas SJTs can measure a broader range of constructs efficiently as they can be machine-marked, MMIs, by contrast, involve a face-to-face encounter. Although expensive, struc- tured interviews (including MMIs) allow applicant responses to be probed further and in more depth.


현재로서는 적성검사와 인지요인에 대한 그림은 덜 분명하다.  

At present, the picture for aptitude tests and cogni- tive factors is less clear as a result of
  • the large num- ber of aptitude tests and the differences between those that are currently available,
  • the diverse out- come measures against which performance on apti- tude tests is compared (to assess validity, see the ‘criterion problem’ discussed above),
  • the multiple ways in which aptitude tests are implemented, and
  • the mixed nature of the evidence on the effective- ness of aptitude testing.

 

일부 적성검사는 특정 지원자를 선호한다는 근거도 있다.

There is also some evidence that some aptitude tests may favour certain types of candidate,46 which may have unfavourable implica- tions for fairness and widening access to medicine.


선발방법의 근거를 해석하고 적용하는데 대한 어려움에는 아래와 같은 것들

The challenges of interpreting and apply- ing evidence of selection methods include

  • 장기 자료 부족 the relative lack of longitudinal data,
  • 성과 준거의 합의된 기준 부족 lack of an agreed-upon framework of outcome criteria, and
  • 기관별 차이 institutional differences (including in available resources, curricula and philosophies of what a high-performing medical student is considered to be).

Kreiter and Axelson는 학생선발의 목표의 복잡성이 장애가 된다고 지적함. social jus- tice, educational equality, health care and political outcomes 등이 종종 서로 경쟁하는 목표가 됨. 선발방법의 질과 효과성을 판단할 때, 어떤 준거는 서로 경쟁관계에 있음을 알아야 함. 예컨대 이해관계자나 평가자들이 생각하는 acceptability가 높더라도 타당도 근거가 낮을 수 있다. 유사하게, SC의 타당도 근거는 높지만, 비용이 많이 들어 사용하기 힘들다. 이러한 측면에서 선발도구의 질과 효과성을 판단할 때 의과대학은 선발시스템이 작동하는 시스템 내에서의 맥락을 고려해야 한다.

Kreiter and Axelson2 acknowledge that the complexity of admissions goals may also be an obsta- cle to evidence-based progress in medical school admissions because concerns regarding social jus- tice, educational equality, health care and political outcomes are broad and frequently competing. When judging the quality and effectiveness of selec- tion methods, it is noteworthy that some criteria may compete with one another. For example, the stakeholder acceptability of referees’ reports in selection is generally high, but the evidence for their validity is poor. Similarly, regarding other cri- teria, the evidence for the validity of SCs is high, but they are relatively costly to implement. In this respect, when judging the quality and effectiveness of different selection methods, medical schools and employers may choose to weight different features depending on the context within which the selec- tion system is operating.


코칭에 대한 취약성은 모든 평가도구의 공통된 우려사항이다.  

A common central concern for any selection tool is susceptibility to coaching. Research over the last 10 years has increasingly focused on this issue, prob- ably because there has been increasing emphasis on how to validly assess non-academic attributes in selection for medical education.

  • 자기소개서: 코칭에 영향을 받음. 다국적 기업이 있음.  In particular, per- sonal statements are at significant risk of being influenced by coaching, or indeed of being written by somebody other than the applicant; a brief online search reveals a large number of companies internationally that sell pre-written personal state- ments.
  • SJT: 코칭의 효과가 없음. With regard to SJTs, recent studies have found no effects of commercial coaching on SJT scores or the predictive validity of SJTs.87,178 How- ever, ongoing research is required to assess the coachability of the full range of non-academic selec- tion tools in greater depth.

 

미래 연구 아젠다

Scoping a future research agenda



명확한 결론은 내리기 어렵다.

It is clear from our review that it is challenging to draw firm conclusions regarding the relative strength of the different tools given the variety in the quality and design of the currently available research evidence: at present there are insufficient data, and medical education providers’ agendas are too diverse, to propose a fully comprehensive frame- work for international best practice in medical selec- tion methods.


잘 설계된 연구가 필요하다.

There is a clear need for well-planned studies focusing on the long-term follow-up of medical students, tracking students from admission through to assessments in more senior training posts in clini- cal practice, at the point of licensure and beyond.


widening access and diversity 에 관한 연구가 필요하다.

Within the broader sphere of issues of fairness in selection, more research exploring issues of widening access and diversity is required, whether it refers to race, ethnicity or social class, as this remains a chal- lenge within medical school admissions globally, and it is becoming increasingly important politically to reflect society within the health care profes- sions.179,180


O’Neill 등은 선발방법이 socal diversity에 미치는 유의한 영향은 없다고 하면서, 지원자 풀을 다양하게 하는 것이 더 중요하다고 했다. 아직까지 결론은 임시적이다.

O’Neill et al.181 found no significant effect of selection method on social diversity in the medical student population,

and sug- gest that the attraction of a sufficiently diverse appli- cant pool is more important for widening access than which selection tool is used. Therefore, only tentative conclusions can be drawn.



이전 교육성취도는 높은 예측타당도로 인해서 의학교육의 'academic backbone'이라고 불리지만, 어떻게 'contextual data'가 활용될 수 있을 것인가에 대한 연구 필요.

Whereas traditional markers of prior educational attainment have been called the ‘academic backbone’ of medical education because they are highly predictive of subsequent perfor- mance both at medical school and beyond, there is a need to explore how ‘contextual data’ can be used to allow the social and educational backgrounds of applicants to be taken into consideration alongside their educational achievements.


'비인지적'이라는 용어는 문제가 있는데problematic, '생각하지 않음'을 의미하기 때문이다.

A key criticism of selection research is that there is a distinct lack of theory-driven studies that examine issues related to validity and the constructs being measured and that, more broadly, acknowledge con- temporary models of adult intellectual development and skill acquisition, or attempt to integrate cogni- tive and non-cognitive factors.172,173 The term ‘non- cognitive’ is in itself problematic as it arguably implies ‘not thinking’;




다음을 제안함

In summary, we propose the following priorities for a future research agenda over the next 50 years in order to enable schools and employers to make evi- dence-based decisions about which selection tools to use and why:


1 longitudinal research exploring predictive valid- ity and following students throughout the course of their careers within education, train- ing and practice;


2 research enabling greater understanding of how selection tools may impact on widening access and diversity agendas, and


3 theory-driven studies of the construct validity of both academically and non-academically ori- ented selection methods and selection systems that will help us to understand what we are assessing for in both the short and long terms.




Finally, we propose that the following five consid- erations will be integral in shaping the direction of medical education research over the next 50 years:



 

1. 의과대학 입학은 여전히 경쟁이 높을 것이다.

1. Medical school admissions will remain highly compet- itive. The prestige of being a physician is likely to continue to drive a high applicant-to-selec- tion ratio in medical school selection interna- tionally over the next 50 years. However, this is unlikely to be true in all postgraduate spe- cialties; some medical career pathways may be perceived to be of higher status and will there- fore be more competitive than others. Medical selection may become part of a process to facil- itate recruitment into areas of most need. This may, in turn, require varying emphasis on selec- tion for specific attributes and competencies: one size is unlikely to fit all.



2. 비학업적 역량에 대해 더 집중될 것이다.

2. There will be an increased focus on, and value of, non-academic attributes and skills in medical selec- tion, aligned with what wider society wishes from its physicians. The role of the physician’s own well- being and resilience, and how these can best be selected for, then supported and developed, will be of increasing importance. Trainees’ expectations of their work–life balance will also be integral to medical selection over the next 50 years. Consideration must be given during selection to the discourse around how we encourage new generations of medical students to expend discretionary effort in future.This is strongly related to:




3. 다학제간 팀을 이끄는 능력, 제한된 자원으로 '일상의' 혁신 문화를 만드는 능력

3. a growing focus on capability to lead multidisci- plinary teams, and building a culture of ‘everyday’ innovation in an environment of reduced resources.



4. 한두명의 '혁신가'에 집중하기 보다는 모든 구성원의 헌신이 필요함

4. Rather than a focus on just one or two people in a team, who are touted as the ‘innovators’, there is likely to be an increased 책임onus on all health care professionals to innovate and pro- vide leadership in order to engage multiprofes- sional teams and to continue to deliver high- quality and compassionate care in a climate of ongoing health care spending cuts.185,186 This may represent a significant change in how applicants to medical education are selected. This, in turn, relates to:


5. 더 넓은 지원자 풀 확보

5. a focus on attracting a wider selection pool and recruiting a more diverse workforce, reflecting a philosophical shift towards acknowledging that non-traditional students may be able to align themselves with patients from diverse back- grounds and also contribute to the education of their peers by acting to challenge the cur- rent medical culture.187,188 Bringing such ‘non- traditional’ applicants into the health care sys- tem may promote, and indeed necessitate, innovative working practices. However, as we have discussed elsewhere,180 there is currently a multitude of unanswered questions on how this may be best implemented and how outcomes can be measured in a reliable and valid way.













 2016 Jan;50(1):36-60. doi: 10.1111/medu.12817.

How effective are selection methods in medical education? A systematic review.

Author information

  • 1Department of Organisational Psychology, City University, London, UK.
  • 2Work Psychology Group, Derby, UK.
  • 3School of Medicine, University of Dundee, Dundee, UK.
  • 4Barts and The London School of Medicine and Dentistry, Queen Mary University of London, London, UK.
  • 5School of Medicine and Dentistry, University of Aberdeen, Aberdeen, UK.

Abstract

CONTEXT:

Selection methods used by medical schools should reliably identify whether candidates are likely to be successful in medical training and ultimately become competent clinicians. However, there is little consensus regarding methods that reliably evaluate non-academic attributes, and longitudinal studies examining predictors of success after qualification are insufficient. This systematic review synthesises the extant research evidence on the relative strengths of various selection methods. We offer a research agenda and identify key considerations to inform policy and practice in the next 50 years.

METHODS:

A formalised literature search was conducted for studies published between 1997 and 2015. A total of 194 articles met the inclusion criteria and were appraised in relation to: (i) selection method used; (ii) research question(s) addressed, and (iii) type of study design.

RESULTS:

Eight selection methods were identified: (i) aptitude tests; (ii) academic records; (iii) personal statements; (iv) references; (v) situational judgement tests (SJTs); (vi) personality and emotional intelligence assessments; (vii) interviews and multiple mini-interviews (MMIs), and (viii)selection centres (SCs). The evidence relating to each method was reviewed against four evaluation criteria: effectiveness (reliability and validity); procedural issues; acceptability, and cost-effectiveness.

CONCLUSIONS:

Evidence shows clearly that academic records, MMIs, aptitude tests, SJTs and SCs are more effective selection methods and are generally fairer than traditional interviews, references and personal statements. However, achievement in different selection methods may differentially predict performance at the various stages of medical education and clinical practice. Research into selection has been over-reliant on cross-sectional study designs and has tended to focus on reliability estimates rather than validity as an indicator of quality. A comprehensive framework of outcome criteria should be developed to allow researchers to interpret empirical evidence and compare selection methods fairly. Thisreview highlights gaps in evidence for the combination of selection tools that is most effective and the weighting to be given to each tool.

© 2015 John Wiley & Sons Ltd.

PMID:
 
26695465
 
[PubMed - in process]


의과대학에서 사회통합이라는 달성하기 힘든 목표(Med Educ, 2013)

The elusive grail of social inclusion in medical selection

Nancy Sturman & Malcolm Parker






이 이슈에서 O'Neill 등은 덴마크의 의과대학생의 사회적 구성이 고등학교 성적을 기반으로 하든 'attribute-based' (자질-기반) 트랙으로 선발하든 차이가 없음을 밝혔다. 후자는 학업성취도가 사회경제적 요인으로 제약되었던 학생들을 위한 것으로, 의과대학에 들어올 수 있는 기회를 '성적 외 중요한 자질과 자격'을 바탕으로 제공한 것이다.

In this issue, a study by O’Neill and colleagues reports that the social composition of Danish medical students was similar whether they were selected according to school-leaving grades or on an ‘attribute-based’ track.1 The latter was designed to afford students whose academic grades may have been limited by socio- economic disadvantage, a chance of entry on the basis of ‘other valuable qualifications and attri- butes’.1


선발에 관한 2010년의 Ottawa Conference Consensus Statement를 살펴보면, 환자군의 인구집단을 반영할 수 있도록 사회문화적 포용이 필요하며, 이것이 과소-대표성이 '차별'과 비슷한 '정치적' 타당성과 관련되기 때문이라고 했다. 이러한 평등에 관한 주장은 다른 것과도 연결되는데 치료자와 환자의 관계, patient outcome등이 환자와 의사의 '매칭'이 잘 되었을 때 더 향상된다는 것, 그리고 더 나아가서 의사와 환자 사이의 사회적 계급의 차이가 의사소통 장애의 근간이 되며, 낮은 SES 환자에게 더 열악/열등한 치료를 제공하게 되는 원인이라는 것 등이 있다.

The 2010 Ottawa Conference Consensus Statement on selection for the health care professions argued that wider social and cul- tural inclusion to reflect the patient populations to be served has a ‘political’ validity in that under-representation is tanta- mount to discrimination.4 This argument for equity is accompa- nied by other equity-related con- cerns, including the proposition that therapeutic relationships and patient outcomes are strength- ened by better ‘matching’ of patients and doctors.5 Further- more, there is at least some evidence that differences in social class between doctors and patients underlie difficulties in communication and the delivery of inferior treatment to patients of lower socio-economic status.6


명백하게, 이러한 취약계층 환자의 진료와 진료성과를 향상시키는 것은 의과대학의 사회적 책무성이라는 의제에 속하는 것이다. 그러나 극단적으로 다원주의적인 이 사회에서, 과소-대표되거나 취약계층이라 할 수 있는 사회적 그룹이 숫자는 아주 많다. '사회통합'이라는 의제는 arbitrary하거나 unwieldy하거나 아니면 그 둘 다이다. 불리한 배경 출신의 의사들이 불리한 배경 출신의 환자들을 더 보려고 할 것이라고 가정하지도 않아야 한다. 이들 의사는 의과대학 졸업 후에 그 자신들이 더 높은 사회경제적 계층으로 이동하게 된다. 낮은 SES 그룹에서 더 많은 의과대학생을 모집하는 방법은 모든 의과대학생에게 문화적, 사회적 역량을 기르게 하는 것이며, 여기에는 다양한 사회적 그룹과의 효과적인 의사소통 등이 포함된다. 대부분의 의사는 자신이 속한 배경과 다른 배경을 가진 환자를 진료하게 될 것이며, 이에 대한 교육훈련이 모든 의과대학에 필수적이어야 한다. 또한 의학이 소외집단에 효과적인 의료를 제공하고 건강을 증진시키기 위한 유일한 진로가 아니라는 것을 지적할 필요가 있다.

Clearly, improving care and out- comes for disadvantaged patients falls within the social accountabil- ity agenda of medical schools. However, in increasingly pluralist societies, there are many under- represented and disadvantaged social groups which might reason- ably lay claim to inclusion in med- ical student quotas. A ‘social inclusion agenda’ might become arbitrary or unwieldy, and perhaps both. It should also not be assumed that doctors from disad- vantaged backgrounds will be more likely to work with disadvan- taged patients. Most of these doc- tors will themselves shift to a higher socio-economic bracket after medical qualification.6 An alternative to recruiting more medical students from lower socio-economic strata is to train all medical students in cultural and social competence, including effective communication with dif- ferent social groups.7 As almost all doctors will work with patients from backgrounds which differ from their own at some stage dur- ing their careers, this training should be fundamental to all medical school programmes. It should also be noted that medi- cine is not the only career open to talented young people with a commitment to promoting health and providing health care effec- tively in disadvantaged communi- ties.


성공적인 사회통합을 위해서 들어가는 의학교육의 비용은 높다. 의과대학 수업을 듣는 것은 만만치 않고, 낮은 학업능력을 가진 학생들은 학업적으로, 종종 사회적으로까지 힘들어한다. 이들 학생이 선발단계에서부터 더 경쟁력을 갖추게끔 하려는 upstream 전략과 더불어 의과대학 기간의 학업 지원 프로그램 등이 도움이 될 것이다. 그러나 의과대학 입학 전 'pipeline' 혹은 의과대학 대비 특별 프로그램을 만드는 비용이나, 이들을 의과대학에서 유지하고 지원해주는 교육과정에 드는 비용은 상당하다.

A successful social inclusion agenda for medical education is also costly. Medical courses are challenging, and students with lower academic qualifications are more likely to struggle academically and sometimes also socially.8 Deliberate upstream strategies to support students from identified groups to become more competitive at selection,9 as well as targeted academic support pro- grammes during medical school training,10 appear to be successful. However, the costs of pre-medical ‘pipeline’ and special preparation programmes within these comprehensive strategies of recruitment, retention and support in the curriculum are considerable.5


사회 통합과 관련하여 다른 주장이 또 있을까? 의과대학생 코호트의 다양성이 높아질 때 사회적으로, 지적으로 풍요로운 교육환경을 만들어주며, 의과대학에서의 전통적인, 위해한hamrful 패러다임이 도전을 받을 것이라는 주장도 있다. 이러한 주장은 직관적으로 옳은 것으로 보이며, 반박하기 힘들다. 그러나 어쩌면 의과대학의 잠재교육과정에 만연한 위해한 조직구조, 위계적 문제, 윤리적 과실 등을 해결하는데 더 비용-효과적인 방법이 있을지도 모른다.

Are there other arguments for the social inclusion agenda? There is also the suggestion that greater diversity in student cohorts is likely to produce a socially and intellec- tually richer educational environ- ment in which traditional, potentially harmful paradigms of medical culture are more likely to be challenged. This argument is also intuitively compelling and probably irrefutable. However, there may be other, more cost- effective strategies for addressing the harmful institutional struc- tures, hierarchical relationships and ethical lapses that have been identified as comprising a perva- sive hidden curriculum in medical education.



의과대학에 지원하는 것은 지원자는 물론 가족에게도 고부담의 결정이며, 모든 선발절차는 지원자와의 gaming이며 우수한 의사가 될 잠재력을 지닌 일부를 떨어뜨리게 된다. 의과대학생 선발과 훈련은 이미 비용이 많이 들고 자원-집중적resource-intensive이다. 따라서 일견 타당해보이더라도 달성하기 어려운 사회적 책무성 목표는 reflective, finite, practical해야 한다.

The stakes are high for medical applicants and their families, and any selection process will both attract its share of gaming from applicants and deny admission to some with the potential to become excellent doctors. Medical student selection and training are already expensive and resource-intensive. Quests for plausible but elusive social accountability goals should therefore be reflective, finite and practical.




비록 정치적 타당성이 명백해 보이더라도, 코호트와 다른 연구에서 의과대학에서의 사회통합이라는 목표가 장점이 있음을 보여줄 수도 있지만, 반대로 비현실적이며, 자리를 잘못 잡은 것이고 형편이 되지 못함unaffordable을 보여줄 수도 있다.

Cohort and other studies may demonstrate benefits, or they may show that, despite its apparent political validity, the quest for social inclusion in medical selec- tion is impractical, misplaced and unaffordable.





 2013 Jun;47(6):542-4. doi: 10.1111/medu.12211.

The elusive grail of social inclusion in medical selection.

Author information

  • 1School of Medicine, University of Queensland, 8th Floor, Health Sciences Building, Royal Brisbane Hospital, Herston, Brisbane, Queensland 4068, Australia. n.sturman1@uq.edu.au


의과대학 학생선발: 자기소개서의 신뢰도와 타당도 향상 (Acad Med, 2006)

Medical School Admissions: Enhancing the Reliability and Validity of an Autobiographical Screening Tool

Kelly L. Dore, Mark Hanson, Harold I. Reiter, Melanie Blanchard, Karen Deeth, and Kevin W. Eva





많은 다른 학교들처럼 Michael G. DeGroote School of Medicine at McMaster University 는 uGPA와 이들이 제출한 자기소개서(ABS)를 기반으로 학생들을 초청하여 지원자 면접을 한다. 연구 결과를 보면 uGPA의 신뢰도와 타당도는 비교적 안정적이나 ABS의 그것은 약하다.

Like many schools, the Michael G. DeGroote School of Medicine at McMaster University invites candidates to interview based on grade point average (uGPA) and a candidate-written autobiographical submission (ABS). Local research has demonstrated strong reliability and validity for uGPA,2,3 but the reliability of the ABS has been weak.2


ABS는 다섯개의 질문으로 되어있으며, 여기에는 지원자의 개인적 경험, McMaster에 적합성, 의학 진로와 전공에 적합성 등을 포함한다. 각 지원자의 ABS는 개인정보를 삭제한 다음 세 명의 독립적 평가자에 의해서 평가된다(one health science faculty member, one community member, and one medical student.)

The ABS is composed of five questions designed to evaluate noncognitive characteristics such as applicants’ personal experiences, suitability for McMaster and suitability for a career in medicine. Each applicant’s five ABS questions, stripped of any personal identifiers, are scored by three independent raters: one health science faculty member, one community member, and one medical student.


각 평가자는 30~60개의 ABS를 평가하며, 매년 최대 150명의 평가자가 동원된다.

Each rater scores 30–60 ABS submissions and upwards of 150 raters participate annually.



평가의 비-독립성

Non-independence of the ratings


ABS점수는 의과대학에서의 수행능력과의 상관성이 매우 낮은 것으로 밝혀졌으며, NLE와도 마찬가지다. 이에 대한 한 자기 이유는 ABS점수의 평가자간 신뢰도가 0.45로 낮기 때문이다. ABS는 그러나 높은 내적일치도를 보여준다. 비록 높은 내적일치도가 척도로서의 신뢰성을 보여주는 지표이긴 하나, 동시에 이것은 한 개인에 대한 평가가 지원자별로 독립적으로 이뤄지고 있지 않음을 보여주는 것이기도 하다. 즉, 후광효과가 반영된다는 의미이다. 어떤 지원자의 첫 번째 답변이 그 지원자의 두 번째, 세 번째 항목에 대한 답변의 수행능력에 영향을 주게 되면, 전체적인 지원자가 작성한 각각에 대한 점수의 평균이 아니라 지원자에 대한 첫인상이 그 지원자의 점수를 결정짓게 되는 것이다. 이는 중요한데 왜냐하면 기능적으로, 열 다섯개가 아니라(평가자 수 x 문항 수), 단 세 개의(즉 평가자 수 만큼의) 관찰결과만이 수집된다고 볼 수 있기 때문이다.

Scores on the ABS have been shown to correlate poorly with performance both within medical school and on the national licensing examinations written postgraduation.2 One reason identified by Kulatunga-Moruzi and Norman is that the interrater reliability of ABS scoring is less than adequate (0.45). The ABS has, however, been seen to have high internal consistency (0.88). Although high internal consistency may be seen as supportive of the reliability of a measure, it may in fact be a negative indication that the scores assigned to the individual questions do not provide independent measures of the applicant. That is, the halo effect may be afflicting this measure; if performance on the first question influences the raters’ perceptions of performance on subsequent questions, then the initial overall impression of the candidate will determine the scores assigned to individual questions rather than the individual questions summing to provide a global assessment. This is an important distinction, because it would indicate that, functionally, only three observations (from three raters) are being collected in the current system instead of the desired fifteen


이것이 문제인지 아닌지 알기 위해서 평가가 수집되는 방향을 바꾸었다.

To test whether or not this was an issue, we altered the direction in which ratings were collected.


피평가자의 비-독립성

Non-independence of the ratees


의심할 여지 없이, 소수의 지원자는 ABS를 대리인을 시켜 작성하게 한다. 더 흔한 것은 지원자가 작성한 ABS를 친구, 가족, 재학생, 의사 등에게 보여주고 피드백을 받는 것이다.

Undoubtedly a small percentage of candidates are less than scrupulous and hire ghostwriters in an attempt to generate a more appealing ABS. More commonly, however, candidates will pass their submissions around to friends, family, current students, or practicing physicians for feedback to improve the submission.



여기에는 몇 가지 측정과 관련한 문제가 있는데, 첫째로 좋은 ABS의 상한선이 존재하는 한, 그리고 이러한 피드백을 통해서 향상이 이뤄진다면, 결국 지원자가 homogeneous해지는데 기여할 것이며, 획득가능한 신뢰도와 타당도의 최대치를 낮출 것이다. 둘째로, 이러한 restriction of range 가 아니어도 타당도에 대한 의문이 생기는데, 지원자를 평가하는 것인지 지원자의 지지기반시스템을 평가하는 것인지 헷갈리기 때문이다.

it creates a pair of measurement problems. First, given that there is an upper limit on how good an ABS can appear, and assuming that the collection of feedback results in improvement, the submissions may end up being more homogeneous than the candidates, thus lowering the maximum achievable reliability and validity. Second, even without restriction of range, the validity itself must be questioned as it becomes questionable whether one is discriminating between candidates or between candidate support systems.


방법

Method



현장에서 작성하는 ABS는 사전에 제출한 ABS와 대등하나 동일하지 않다. 윤리적 의사결정, advocacy, 개인 경험 등에 초점을 둔다.

The onsite ABS questions were comparable with, but not identical to, the noninvigilated questions participants answered offsite, with questions focusing on ethical decision making, advocacy, and personal experiences.



30개의 무작위 선택한 지원자의 ABS를 평가

For a subset of 30 randomly selected candidates, two scoring methods were compared for each ABS.



Results


 

사전 제출한 ABS의 점수가 현장 작성 ABS보다 높았다. 유의미한 interaction이 있어서, 이러한 main effect는 전통적 방법(offsite, vertical) 평가방법에 영향을 받는 것으로 보인다.

The scores for the ABS completed offsite (mean 4.4) were significantly higher than those completed onsite (mean 4.1; F 5.7, p .05). A significant interaction between site and scoring method (p .01) revealed that this main effect was driven by a higher mean score in the traditional (offsite, vertical) scoring method (mean 4.7) relative to the other three groups (mean 4.0 to 4.2).


평가자간 신뢰도는 onsite에서 높았다. 그러나 offsite ABS의 평가자간 신뢰도는 수평방향 평가는 중등도였으나, 수직방향 평가에서는 낮았다.

When the interrater reliability was assessed, it was found to be high for ABS’s completed onsite (0.81 with vertical scoring, 0.78 with horizontal scoring). However, the offsite ABS interrater reliability was moderate when horizontal scoring was used (0.69), but poor when vertical scoring was used (0.03).


가장 중요한 것으로, ABS 점수와 MMI 의 상관관계는 수직방향보다 수평방향 평가에서 더 뚜렷했다.

Perhaps more importantly, the ABS scores correlated better with the MMI when the horizontal scoring method was used (r 0.44 offsite and 0.65 onsite) relative to when the vertical scoring method was used (r 0.12 offsite and 0.28 onsite).


Discussion


수직방향 평가를 사용했을 때의 높은 내적일관성은 후광효과에 대하여 우려하게끔 한다.

The higher internal consistency achieved using the vertical scoring method provides evidence for our concern that the halo effect may have been biasing ABS assessments


외부와 단절된 상태에서 진행되는 ABS는 수평방향 평가를 활용하여, onsite에서 감독하에, 시간제한을 두고 작성하게 했을 때 가장 좋았다. 감독을 두고 작성하게 함으로써 피평가자의 독립성이 유지된다. 그러나 ABS는 외부와 단절된 상태에서 진행되지 않는다. MMI와 onsite ABS를 비교하여 보았을 때, MMI가 여러 이유로 더 선호된다. 첫째로, 전반적인 일반화가능도에서 MMI는 onsite ABS만큼 강ㄺ하다. 둘째로, 예측타당도에 있어서 MMI는 의과대학에서의 측정과 유의한 정의 상관을 보인다. 셋째로, onsite ABS 채점은 평가자의 시간을 많이 들여야 하고 의사결정이 지연되나 MMI는 즉석에서 그 날 결과가 나온다.

Seen in a vacuum, the method of ABS administration that performed best is clearly application of the horizontal scoring method to submissions collected in onsite, invigilated, time-controlled circumstances. Invigilation ensures independence of the ratees. However, the ABS does not function in a vacuum. Given the choice between MMI and onsite ABS, the MMI is preferred for a number of reasons. First, in terms of overall test generalizability, the MMI is at least as strong as the onsite ABS4. Second, with respect to predictive validity, the MMI has demonstrated significant positive correlation with in-school measures,5 and national licensing examination scores.6 Third, scoring of onsite ABS’s requires rater time subsequent to the date of interview, thus delaying decision making, whereas MMI scores are available immediately on that date.



 







 2006 Oct;81(10 Suppl):S70-3.

Medical school admissionsenhancing the reliability and validity of an autobiographical screening tool.

Author information

  • 1Program for Educational Research and Development, McMaster University, MDCL 3510, 1200 Main Street West, Hamilton, Ontario, L8N 3Z5, Canada. kelly.dore@learnlink.mcmaster.ca

Abstract

BACKGROUND:

Most medical school applicants are screened out preinterview. Some cognitive scores available preinterview and some noncognitive scores available at interview demonstrate reasonable reliability and predictive validity. A reliable preinterview noncognitive measure would relax dependence upon screening based entirely on cognitive tendencies.

METHOD:

In 2005, applicants interviewing at McMaster University's Michael G. DeGroote School of Medicine completed an offsite, noninvigilated,Autobiographical Submission (ABS) preinterview and another onsite, invigilated, ABS at interview. Traditional and new ABS scoring methods were compared, with raters either evaluating all ABS questions for each candidate in turn (vertical scoring-traditional method) or evaluating all candidates for each question in turn (horizontal scoring-new method).

RESULTS:

The new scoring method revealed lower internal consistency and higher interrater reliability relative to the traditional method. More importantly, the new scoring method correlated better with the Multiple Mini-Interview (MMI) relative to the traditional method.

CONCLUSIONS:

The new ABS scoring method revealed greater interrater reliability and predictive capacity, thus increasing its potential as a screen for noncognitive characteristics.

PMID:
 
17001140
 
[PubMed - indexed for MEDLINE]


보건의료전문직 교육훈련을 위한 학생선발에서의 MMI - Systematic review (Med Teach, 2013)

The Multiple Mini-Interview (MMI) for student selection in health professions training – A systematic review

ALLAN PAU1, KAMALAN JEEVARATNAM2, YU SUI CHEN1, ABDOUL AZIZ FALL1, CHARMAINE KHOO1 &

VISHNA DEVI NADARAJAH1

1International Medical University, Malaysia, 2Royal College of Surgeons in Ireland, Perdana University, Malaysia







보건전문직 교육 프로그램 학생을 선발하는 것은 고부담 결정이다. 패널 혹은 위원회 면접이 흔히 사용되나 근거들을 살펴보면 이러한 방식은 학업 혹은 임상 수행능력 예측에 제한적 능력만 가진다.

Admissions to health professions training programmes are high stake decisions. The panel or board interview is commonly used to aid this decision (Edwards et al. 1990), although the evidence suggests its limited ability to predict academic or clinical performance in health care disciplines (Goho & Blackman 2006).


예를 들어 Dixon 등은 패널 인터뷰를 review하여 구조와 점수 anchor가 신뢰도와 타당도에 영향을 준다고 하였으며, Wilkinson 등은 패널 인터뷰가 예측력이 떨어지고 면접으로 인한 'threat'이 일부 잠재적 지원자를 떨어져나가게 한다고 하면서 GPA가 학업 수행능력에서 최고의 예측력을 가지는 것이라고 결론지었다.

For example, Dixon et al. (2002), in their review on the panel interview commented that structure and scoring anchors impact on its reliability and validity. Wilkinson et al. (2008), in their study argued that panel interviews have little predictive value and added that the ‘‘threat’’ of an interview may even dissuade some potential applicants and concluded that GPA (grade point average from student pre entry qualification) has the best predictive value to academic performance.


면접을 구조화하는 것은 수용가능도와 신뢰도를 향상시킨다. MMI는 고도로 구조화된 학생선발 방법이다.

Structuring the interview has been reported to enhance its acceptability and reliability (Patrick et al. 2001). The Multiple Mini-Interview (MMI) is a highly structured student selection method designed to resemble the Objective Structured Clinical Examination (OSCE) (Eva et al. 2004c).


MMI는 지원자의 역량에 대한 다면적 표집을 통해서 그들의 전체적 능력에 대한 더 구체적인 그림을 갖게 해준다.

The MMI, therefore, allows a wide sampling of candidates’ competencies in order to gain a more accurate picture of their overall ability.



방법

Methods



 

결과

Results



Review한 연구들의 특징

Characteristics of studies reviewed



MMI의 특징

Features of the MMI


  • The number of stations used in the studies reviewed ranged from 4 to 12, with 10 studies using a 10-station MMI, 6 using 12, 5 using 8, and the remaining 9 using 4, 7, 9 or 11 stations.
  • Fourteen of the studies used one assessor per station while 4 used 2 assessors, and the remaining 12 did not report the number of assessors per station.
  • Most studies used faculty as assessors, while some used a combination of faculty and community practitioners (Hecker & Violato 2011) and others included students (Brownell et al. 2007).


  • The range of time at each station was 5 to 15min with a mode of 8 min. Eleven studies reported using 8-min stations, five using 7-min, three using 10-min, one using 5-min and one 15-min stations.
  • Two studies tested the effect of different lengths of time at stations; one comparing eight and six minutes (Cameron & Mackeigan 2012) and the other eight and five minutes (Dodson et al. 2009). Seven did not report the time at each station.

 

  • The average MMI has 10 stations, each lasting eight minutes and is rated by one assessor.

 


 

활용가능성

Feasibility


Three studies reported on the feasibility of the MMI. One reported that it did not require more examiners when compared to the panel interview, did not cost more, and the interviews could be completed over a short period of time(Brownell et al. 2007; Finlayson & Townson 2011). Another study reported that it provided a positive experience for interviewers as well as applicants (Eva et al. 2004c).





수용가능성

Acceptability


Of the 30 studies reviewed, 14 reported on the acceptability of the MMI. Some authors reported that the MMI was acceptable to interviewees and interviewers because it was perceived as fair (Razack et al. 2009), transparent (Uijtdehaage et al. 2011) and providing opportunities for the interviewees to regain composure if they had problems with a previous station(Kumar et al 2009). Positive experience for both applicants and examiners has also been reported (Eva et al. 2004c). 


Acceptability was also determined as free from gender and cultural bias (Brownell et al. 2007), and socio-economic disadvantage (Uijtdehaage et al. 2011) or benefit of previous coaching (Griffin et al. 2008). Griffin et al. (2008) reported that previous coaching, as disclosed by applicants, had no effect on UMAT or MMI scores. Applicants who had previous MMI experience improved their subsequent performance in the same stations but not in new stations. 



Preference for station length differed between interviewers and interviewees, with the former judging six mins to be ‘‘just right’’ and eight mins to be ‘‘a bit long’’, and the latter preferring longer time (Cameron & Mackeigan 2012). One study reported that graduate candidates outperformed school-leavers (Dowell et al. 2012) while another reported no difference between graduate and school-leaver applicants(O’Brien et al. 2012). 


Acceptability of the MMI was compared to that of the panel or standard interview by O’Brien et al. (O’Brien et al. 2011) for graduate and school-leaver applicants to 4-year and 5-year medical training programmes. The 5-year candidates, generally school-leaver applicants, reportedly felt that the MMI gave amore accurate picture of their abilities and that the panel interview was more difficult. In contrast, the 4-year candidates felt the MMI was more difficult



신뢰도

Reliability


Eighteen studies reported on the reliability of the MMI. Intra-station reliability was reported to reach 0.98 by Lemay et al.(2007). The inter-item reliability (i.e. the internal consistency of the three scores assigned within any one station) and the inter-rater reliability within stations have also been reported to be very high by Dore et al. (2010). However, Finlayson &Townson (2011) conducted a 4-station MMI, each at 15min,and reported inter-rater reliability ranging from 0.50 to 0.69 for three stations, and 0.10 for one station. 



Generally the reported reliability ranged from moderate(Roberts et al. 2008) to acceptable (Dore et al. 2010) to high(Lemay et al. 2007), with Cronbach’s alpha ranging from 0.69to 0.98. However, Finlayson & Townson (2011) reported 0.45 inter-station reliability ranging from to 0.47. Other researchers have also reported low inter-station correlations,(Lemay et al. 2007). 


Using generalisability analysis, Hecker & Violato (2011) reported a G coefficient of 0.79 for seven stations with two assessors. A Decision study indicated that G¼0.81 can be achieved fromten stations with one assessor. Similarly, in Dore et al.’s (2010) study, G¼0.55 to 0.72 for seven stations, is increased to G¼0.64 to 0.79 with 10 stations in a D-study.



타당도

Validity



내용 타당도

Content validity.


The validity of the MMI was discussed in 17 of the 30 studies. One key observation was that the MMI scores did not correlate with traditional admission tools scores such as (r ¼0.185), the personal interview undergraduate grades (r ¼0.317), simulated tutorial (r ¼ 0.227) and autobiograph- (r ¼0.170) ical sketch (Eva et al. 2004c). Other studies did not reported that the MMI correlate with pre-entry such as academic scores (Hecker qualifications, the GPA et al. 2009), pre-pharmacy average (PPA) (r ¼ 0.025) or (r ¼0.042) Pharmacy College Admission Test (PCAT) (Cameron & Mackeigan 2012), GAMSAT ( ¼0.04) and UK Clinical Aptitude Test (MCAT) ( ¼ 0.00) (O’Brien et al. 2011).

 

However, positive association with certain cognitive skills, such as the GAMSAT scores for ‘‘Reasoning in (r ¼0.26) Humanities and Social Sciences’’ and ‘‘Written Communication’’ (0.26) (Roberts et al. 2008), and cognitive reasoning skills (Roberts et al. 2009) have been reported as well as correlation with autobiographical submission focusing on ethical decision making (r ¼0.65) (Dore et al. 2006). The MMI was not reported to be associated with emotional intelligence (Yen et al. 2011).




예측 타당도

Predictive validity.


For medical students, MMI performance at admission was the best predictor for subsequent OSCE as well as clerkship performance (Eva et al. 2004a). Validity against future non-cognitive assessment was investigated by Eva et al. (2009), who reported that MMI performance at admission was statistically significantly predictive of perform- ance at future examinations, such as the percentage of stations passed in the MCCQE (Medical Council of Canada Qualifying Examination) Part II.

 

However, a cross-sectional study investigating the association between MMI performance of medical residency applicants and their MCCEE (Medical Council of Canada Evaluating Examination) and MCCQE I scores reported low, non-significant correlations, and also non-significant correlation with MCCQE II scores (Hofmeister et al. 2009). In a more recent study, Eva et al. (2012) reported that better MMI performance at entry to medical school was predictive of higher MCCQE scores.



Discussion



이번 연구의 핵심 결과는 다음과 같다.

The key findings were that the MMI was

  • (i) practically feasible in terms of efficient utilisation of time, costs and human resources when compared to the panel interview;
  • (ii) generally acceptable to both interviewees and interviewers;
  • (iii) generally reliable with acceptable Cronbach’s apha and G-coefficient values; and
  • (iv) predictive of future performance in certain aspects of medical council examinations.



스테이션을 개발하고 면접을 시행하려면 전문성이 필요하다. 따라서 초기의 준비비용은 높을 수 있다.

Expertise is also necessary in developing the stations and conducting the interviews. Therefore the initial preparatory costs to develop the MMI are likely to be high (Rosenfeld et al. 2008).



Kumar 등은 시나리오 기반의 MMI가 어떻게 답변해야 하는가에 대한 리허설이나 코칭을 더 어렵게 만들며, 실제로도 MMI에서의 수행능력이 자기-보고된 이전 코칭 여부와 상관이 없으며, 코칭을 받지 못한 지원자에게 불리하지 않음을 보여주었다.

Kumar et al. (2009) identified that the scenario-based nature of the MMI made it harder for rehearsal and coaching of responses, and indeed it has been reported that performance at the MMI is not associated with self-reported previous therefore, coaching (Griffin et al. 2008), and does not disadvantage applicants with no access to coaching.



스테이션 내, 평가자 간 신뢰도는 높고 스테이션 간 신뢰도는 낮은데, 이는 서로 다른 스테이션은 서로 다른 특질을 테스트하기 때문이다.

For example, it is expected that intra-station and inter-rater reliability would be high and inter- station reliability low(Lemay et al. 2007; Dore et al. 2010) since different stations may test different attributes.


그러나 신뢰도는 스테이션이나 면접관의 수와 관련된 것으로 보이며, 각 스테이션의 내용과도 관련되어 있다.

However, reliability would appear to be associated with number of stations or interviewers (Hecker & Violato 2011), and the content of each station (Lemay et al. 2007).


MMI 신뢰도는 acceptable하며, Ottawa 2010 컨퍼런스에서 보건의료전문직 선발에서 활용되는 것에 대한 합의를 이루었다. 면접관의 주관은 측정오차의 가장 큰 원인이 되며, 면접관 훈련이 도움이 될 것임을 시사한다.

The reliability of the MMI has generally been reported to be acceptable. This has been recognised by the Ottawa 2010 Conference in a consensus statement on assessment for al. selection for the health care professions (Prideaux et 2011). Interviewer subjectivity is the largest source of meas- urement error, suggesting that interviewer training could be helpful (Roberts et al. 2008).


대부분의 연구는 MMI 수행능력이 입학 전 성취(GPA, MCAT, GAMSAT)과 무관함을 보여준다. 이는 MMI가 비인지적 특성을 평가한다는 것이다.

Most studies reported that MMI performance was not pre-entry qualifications such as associated with academic GPA, MCAT and GAMSAT scores. This suggests that the MMI is capable of testing non-cognitive attributes, such as

  • profes- sionalism (Hofmeister et al. 2009),
  • legal, ethical and organ- isational skills. (Eva et al. 2009),
  • motivation, interest in medicine, decision making skills, ability to debate a complex issue (O’Brien et al. 2011),
  • empathy, moral and ethical reasoning, motivation and preparedness to study medicine, teamwork and leadership, honesty and integrity (Till et al. 2013), and
  • advocacy, ambiguity, collegiality and collabor- ation, cultural sensitivity, responsibility and reliability (Lemay et al. 2007).


Lemay JF, Lockyer JM, Collin VT, Brownell AK. 2007. Assessment of non- cognitive traits through the admissions multiple mini-interview. Med Educ 41(6):573–579.











 2013 Dec;35(12):1027-41. doi: 10.3109/0142159X.2013.829912. Epub 2013 Sep 20.

The Multiple Mini-Interview (MMI) for student selection in health professions training - a systematic review.

Author information

  • 1International Medical University , Malaysia.

Abstract

BACKGROUND:

The Multiple Mini-Interview (MMI) has been used increasingly for selection of students to health professions programmes.

OBJECTIVES:

This paper reports on the evidence base for the feasibility, acceptability, reliability and validity of the MMI.

DATA SOURCES:

CINAHL and MEDLINE STUDY ELIGIBILITY CRITERIA: All studies testing the MMI on applicants to health professions training.

STUDY APPRAISAL AND SYNTHESIS METHODS:

Each paper was appraised by two reviewers. Narrative summary findings on feasibility, acceptability, reliability and validity are presented.

RESULTS:

Of the 64 citations identified, 30 were selected for review. The modal MMI consisted of 10 stations, each lasting eight minutes and assessed by one interviewer. The MMI was feasible, i.e. did not require more examiners, did not cost more, and interviews were completed over a short period of time. It was acceptable, i.e. fair, transparent, free from gender, cultural and socio-economic bias, and did not favour applicants with previous coaching. Its reliability was reported to be moderate to high, with Cronbach's alpha = 0.69-0.98 and G = 0.55-0.72. MMI scores did not correlate to traditional admission tools scores, were not associated with pre-entry academic qualifications, were the best predictor for OSCE performance and statistically predictive of subsequent performance at medical council examinations.

CONCLUSIONS:

The MMI is reliable, acceptable and feasible. The evidence base for its validity against future medical council exams is growing with reports from longitudinal investigations. However, further research is needed for its acceptability in different cultural context and validity against future clinical behaviours.

PMID:
 
24050709
 
[PubMed - indexed for MEDLINE]


MMI로 평가하는 학업/경험/역량 측정의 가중치 변화가 합격자 민족/인종 코호트에 미치는 영향(Acad Med, 2015)

The Effect of Differential Weighting of Academics, Experiences, and Competencies Measured by Multiple Mini Interview (MMI) on Race and Ethnicity of Cohorts Accepted to One Medical School

Carol A. Terregino, MD, Meghan McConnell, PhD, and Harold I. Reiter, MD






의학교육에 있어서 피훈련자의 다양성 혹은 그들의 비율을 인구구조를 반영하게 하자는 폭넓은 요구가 있다. 보건의료인력의 다양성을 증가시키는 것은 그 그룹 간 격차를 줄이는 하나의 접근법이 된다. Cohen 등은 공정과 평등 이슈에 더하여 접근성의 향상, 보건의료시스템의 관리의 최적화 등을 인력 다양화를 달성해야 할 실용적 이유로 보았다.

Within the context of medical education, there has been a call for broad strategies extending beyond measures of the compositional diversity of trainees or representational ratios.2 Enhancing diversity in the health care workforce has been proposed as one approach to address those group disparities.3 Cohen et al3 cite increasing access and ensuring optimal management of the health care system, in addition to issues of equity and fairness, as pragmatic reasons for attaining workforce diversity.


피훈련자의 다양성을 높이는 것은 모든 학생에 대하여 교육의 질을 높이는 것에 중요하고, 농촌지역, 도심 매부, 소수자들의 의료접근성을 높이고, 공공보건 연구의 진보를 가속화하는 데 중요하다. GPA와 MCAT점수에 의존하는 방식은 의료계의 다양성을 증대시키는데 큰 제약이 되며, 연구자들은 MMI에 기반한 선발이 다양성을 더 높인다고 주장한 바 있다.

Increasing trainee diversity is important for shaping educational quality for all students, increasing access to health care in rural, inner-city, and minority populations, and accelerating advances in medical and public health research.22 Reliance on GPAs and MCAT scores may severely constrain diversity within medicine,23,24 and researchers have argued that basing admission selections on MMI scores may promote applicant diversity.17,25


의과대학 인증기준의 변화 역시 의과대학들이 다양성에 관심을 가지게 된 계기이다. Holistic Review Project는 학생선발 과정에서 학문적 역량과 인성 역량을 모두 고려할 것을 장려하는 모델이며, 이를 위해서 RWJMS는 더 전인적인 평가과정을 도입했다.

Changes in accreditation requirements reflect the enhanced attention to diversity expected of all medical schools.26 The Holistic Review Project has articulated a model that promotes the consideration of both academic and personal competencies in the application process.27 In response, Rutgers Robert Wood Johnson Medical School (RWJMS) began to implement a more holistic screening process;


중요한 것은, 지원자들을 오직 MMI점수로만 선발한다는 점이다. MCAT 자료는 학업역량의 최저 수준을 결정하는 것을 도와준다. 11개 의과대학의 자료를 바탕으로 Julian은 MCAT점수 중 생물과학 점수 8점, 물리점수 7점, 언어추롡점수 6점 이하가 되지 않는 한 학업적 어려움을 겪을 가능성은 매우 낮다는 것을 보여주었다. 이러한 연구결과는 합당한 학업적 최저한계점만 넘어선다면, 입학절차는 학업적 수행능력에 덜 신경쓰고, 핵심 인성역량에 더 신경써야 한다는 것을 보여준다.

Importantly, applicants are admitted exclusively on the basis of their MMI scores. MCAT data support this reliance on academic thresholds. Using data from 11 schools, a study by Julian28 demonstrated that the risk of academic difficulties remained very low until entering students’ MCAT scores fell below 8 for biological sciences, 7 for physical sciences, and 6 for verbal reasoning. These findings suggest that for students exceeding acceptable academic thresholds, selection procedures should be less concerned with academic performance and more concerned with core personal competencies performance.


이 가설을 지지하듯 최종 합격자 선발을 MMI로만 했던 RWJMS의 첫 번째 코호트는 1학년과 2학년 과정, 그리고 USMLE Step 1에서 그 앞의 코호트와 동등한 성과를 보여주었다. 또한 이 집단의 MMI점수가 의과대학 재학 중 평가한 핵심인성역량(reliability, integrity, service/sensitivity to diversity)을 잘 예측했다.

In support of this hypothesis, the first cohort at RWJMS whose final admissions decision was based solely on MMI scores performed equivalently in first- and second-year courses and on United States Medical Licensing Examination (USMLE) Step 1 relative to previous cohorts admitted on the basis of traditional interviews, academic scores, and experiences. Additionally, the MMI scores from this first cohort predicted scores for students’ core personal competencies assessed in medical school (reliability, integrity, service/sensitivity to diversity).29


우리는 학업적 척도, 경험 척도, 인성 점수가 지원자의 자기보고식 민족/인종에 따라 다른지, 그리고 이 점수들의 가중치를 변화시켜서 입학생의 다양성에 영향을 줄 수 있는지를 보았다.

Specifically, we examined whether academic measures (GPA, MCAT), experience scores (service, clinical, and research [SCR]), and personal competencies scores (MMI) varied as a function of applicants’ self-reported race/ethnicity, and whether change in weighting of scores would impact diversity by altering the demographic composition of the entering classes.



방법

Method


세팅, 연구집단, 지원자 선발 과정

Setting, study population, application screening process


후향적 연구

This is a retrospective study of previously collected and recorded data for the RWJMS admissions process for entering classes 2011–2013.


학업 기준

We determined that applicants screened for MMI were academically and experientially prepared, based on threshold criteria previously set by the RWJMS Admissions Committee (

    • total GPA > 3.0, 
    • total MCAT > 22, 
    • MCAT biological science score > 8, and 
    • no other MCAT score < 6).


봉사/임상노출/연구/자기소개서/추천서를 5점 척도로 평가함. (3점: 지원자로서 acceptable함.)

    • 연구에서의 5점은 피어-리뷰 발표나 출판 경험
    • 봉사에서의 5점은 봉사단체를 조직한 것, 3점은 정기적으로 봉사조직에 참여한 것


We scored service, clinical exposure, research, the personal essay, and letters of recommendation on a 1–5 Likert scale. The scale was developed so that a score of 3 is an acceptable score for an applicant. An example of a research rating of 5 would indicate culmination of the research experience with peer- reviewed presentation or publication. With respect to service, regular involvement in a service organization would be rated 3, whereas the founder of a service organization would be rated a 5.


스크리닝 점수의 총합은 지원자의 순위를 매기는데 사용되지 않고, threshold로만 사용함(어느 점수 이하는 면접 안 봄). 그러나 SCR점수는 스크리닝 결정에 도움을 주기 위한 자료이지 스크리닝을 하는 절대적 기준은 아니며, 예컨대 일부학생은 연구 경험이 없었기 때문이다. 학업기준을 충족시키고 SCR, 자기소개서, 추천서 점수가 3점을 넘는 학생에게 면접기회를 줌. 이후 GPA, MCAT, 경험치 스크리닝 점수, 자기소개서 ,추천서 등은 더 이상 고려하지 않음

The sums of the screening scores were not used to rank applicants but served as threshold scores below which an interview would not be offered. An SCR score was developed to inform but not dictate screening decisions, as some students did not have research experience. We considered for interview only applicants who met the academic criteria and who had SCR, personal essay, and letters scores of at least 3. We did not revisit the GPA, MCAT, experiences screening scores, essays, and letters after applicants were selected for interview.





MMI 절차, 위원회 고려사항, 합격 결정

The MMI process, committee deliberations, admissions decisions


MMI. 6개 스테이션. 한 면접날의 문항은 그 날에만 사용됨. 

The MMI process at RWJMS consists of a six-station MMI. Each station consists of a behavioral descriptor or situational judgment-type interview stem addressing a specific AAMC COA core personal competency4 or combination of competencies. All interview stems are unique on a given interview day and written by one of the authors (C.A.T.). The MMI process at RWJMS employs only the 30 members of the standing committee, who participate in modified frame-of-reference training prior to the sessions. Extensive interviewer training allows for the assumption of adequate reliability with a six-station MMI.


Table 1

Table 1 demonstrates the behaviorally anchored rating scale for communication.




5점 척도로 다음을 평가

In each station, interviewers evaluate applicants on the 

    • basis of communication, 
    • content/argument, and 
    • overall global impression 

using a behaviorally anchored 1–5 Likert scale.



Statistical analysis


가중치를 달리하여 "what-if" analyses를 수행함. alternative weighting을 적용하기 전에 서로 다른 스케일로 평가하였기 때문에 z-score로 변환함

In addition to comparing differences in mean performance scores as a function of applicant self-reported race/ ethnicity, we also conducted a series of “what-if ” analyses to determine whether alternative weighting methods would have changed final admissions decisions and entering class composition. Because the different performance measures are on different numeric scales, we converted performance measures (GPA, MCAT, SCR score, and MMI) to z scores before implementing alternative weighting schemes.





결과

Results



전통적 수행능력 측정

Traditional performance measures


지원자와 MMI 스테이션의 상호작용은 33% 변인 설명. 이러한 상호작용 효과는 지원자가 MMI 스테이션에 따라 다양한 수행능력을 보이며, context-specificity를 의미함.

The interaction between applicant and MMI station accounted for the second largest amount of variance (33%). This interaction effect indicates that applicant performance varied across MMI stations, an effect commonly referred to as “contextspecificity.” 15







지원자 다양성과 전통적 수행능력 척도와의 관계

Relation of traditional performance measures to applicant diversity







"먄약" 분석: 가중치가 달랐을 경우의 결과

“What-if ” analyses: The effects of alternative weighting of performance measures on race/ethnicity composition of accepted applicants


URIM 지원자의 비율은 가중치에 따라 57%~22%로 다양함.

the proportion of URIM applicants accepted into the undergraduate medical program would have declined from 57% to 22% depending on weighting.









고찰

Discussion


전통적인 학업이나 경험 점수보다 MMI의 비율을 높이면 인종/민족 다양성이 높아질 것임을 보여준다. 우리가 아는 바에 따르면 이는 미국 의과대학에서 MCAT이나 GPA가 아닌 MMI의 URIM 지원자에 대한 중립성을 보여준 첫 번째 연구

Our findings suggest that increasing use of MMI scores in admission decisions may enhance racial/ethnic diversity among entering medical students, relative to reliance on traditional academic measures and experience scores. To our knowledge this is the only report from a U.S. medical school showing the neutrality of the MMI for underrepresented applicants, contrary to the MCAT or GPA.31


MMI 수행능력에 있어서 URIM지원자와 non-URIM 지원자간 차이는 없었으며, 소규모 캐나다 연구와 같은 결과이다. 이러한 결과로부터 extrapolate하는 것은 연구 대상자의 규모나 미국/캐나다의 극도의 사회문화적 다양성 때문에 한계가 있다.

Our results revealed that there was no statistical significance in MMI performance between URIM and non- URIM groups, a finding consistent with a small Canadian study on five aboriginal applicants.25 Extrapolation from that study, however, is limited because of the size of that study, and the very different social and cultural backgrounds of the United States and Canada. 


상위 45% 학생의 민족/인종 구성만 놓고 보면 변화는 더 극적이다. Reiter 등은 여섯 개 캐나다 의과대학에서 MMI 결과를 분석하여 MMI가 다양성을 증가시키고, 의과대학 접근가능성을 높이며, 학업적 변인의 효과를 중화시킨다는 것을 보여줬다. McMaster의 접근법(면접 대상자 선발시에는 60% GPA 와 40% 자기소개서, 최종선발자 선발시에는 70% MMI와 30% GPA)도 있다. 캐나다 연구는 이렇게 가중치를 달리 했을 때 가구수입이나 지역사회 규모를 기준으로 비교하였을 때 합격자 코호트에는 영향을 주지 않았다.

The change in racial/ethnic makeup of the top 45% ranked students who would be offered acceptance is even more surprising. Reiter et al17 combined MMI results of six Canadian medical schools over two years, focusing on MMI effect on enhancing diversity, increasing access to medical school, and neutralizing the effect of academic variables. McMaster’s formulaic approach to invitation for interview was 60% GPA and 40% autobiographical questionnaire, and postinterview selection was 70% MMI score and 30% GPA. The Canadian study found that these differential weighting schemes did not impact the diversity of accepted cohorts, as measured by income and community size.17











 2015 Dec;90(12):1651-7. doi: 10.1097/ACM.0000000000000960.

The Effect of Differential Weighting of AcademicsExperiences, and Competencies Measured by Multiple MiniInterview (MMI) on Race and Ethnicity of Cohorts Accepted to One Medical School.

Author information

  • 1C.A. Terregino is senior associate dean for education and associate dean for admissions, Rutgers Robert Wood Johnson Medical School, Piscataway, New Jersey. M. McConnell is assistant professor, Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada. H.I. Reiter is professor, Department of Oncology, McMaster University, Hamilton, Ontario, Canada.

Abstract

PURPOSE:

To examine whether academic scores, experience scores, and Multiple Mini Interview (MMI) core personal competencies scores vary across applicants' self-reported ethnicities, and whether changes in weighting of scores would alter the proportion of ethnicities underrepresented in medicine (URIM) in the entering class composition.

METHOD:

This study analyzed retrospective data from 1,339 applicants to the Rutgers Robert Wood Johnson Medical School interviewed for entering classes 2011-2013. Data analyzed included two academic scores-grade point average (GPA) and Medical College Admission Test (MCAT)-service/clinical/research (SCR) scores, and MMI scores. Independent-samples t tests evaluated whether URIM ethnicities differed from non-URIM across GPA, MCAT, SCR, and MMI scores. A series of "what-if" analyses were conducted to determine whether alternative weighting methods would have changed final admissions decisions and entering class composition.

RESULTS:

URIM applicants had significantly lower GPAs (P < .001), MCATs (P < .001), and SCR scores (P < .001). However, this pattern was not found with MMI score (non-URIM 10.4 [1.6], URIM 10.4 [1.3], P = .55). Alternative weighting analyses show that including academic/experiential scores impacts the percentage of URIM acceptances. URIM acceptance rate declined from 57% (100% MMI) to 43% (10% GPA/10% MCAT/10% SCR/70% MMI), 39% (30% GPA/70% MMI), to as low as 22% (50% MCAT/50% MMI).

CONCLUSIONS:

Sole reliance on the MMI for final admissions decisions, after threshold academic/experiential preparation are met, promotes diversity with the accepted applicant pool; weighting of "the numbers" or what is written about the application may decrease the acceptance of URIM applicants.

PMID:
 
26488572
 
[PubMed - in process]


의과대학입학면접: 구조화 면접이 비구조화 면접보다 더 Reliable한가? (Teaching and Learning in Medicine, 2010)

Medical School Preadmission Interviews: Are Structured Interviews More Reliable Than Unstructured Interviews?


Rick Axelson and Clarence Kreiter

Department of Family Medicine, University of Iowa, Iowa City, Iowa, USA

Kristi Ferguson

Office of Consultation and Research in Medical Education, University of Iowa, Iowa City, Iowa, USA

Catherine Solow and Kathi Huebner

Office of Student Affairs and Curriculum, Carver College of Medicine, Iowa City, Iowa, USA





평가점수의 신뢰도를 향상시키는 한 가지 흔한 방법은 구조화면접을 사용하는 것. 면접의 구조는 scoring rubric의 활용, 질문의 효준화, 프로빙의 사용, 기타 요인 등에 따라 정해진다. 그러나 구조화된 면접을 사용하는 것을 지지하는 직접적 근거는 희박하며, Kreiter 등의 연구에 따르면 공정성이나 신뢰도와 관련하여 모든 질문을 모든 지원자에게 동일하게 제시하는 것에 대한 논리적 rationale는 없다. 이러한 결과는 직관에 반하는 것일 지도 모른다. 그러나 면접의 질문은 facet의 무작위 측정으로 받아들여져야 하며, that sampling a small number of questions effectively equates for question difficulty across applicants.

One commonly advocated method for enhancing score reli- ability is to use a structured interview format.3,4 The level of interview structure is defined by the use of a scoring rubric, question standardization, the use of probing, and other factors. There is, however, little direct evidence to support the practice of using structured interviews, and a recent study by Kreiter et al.5 suggests there is no logical rationale related to fairness or reliability that would support presenting the same questions to all applicants. This finding may appear counterintuitive; how- ever, it is easily demonstrated that interview questions should be regarded as a random measurement facet and that sampling a small number of questions effectively equates for question difficulty across applicants.









방법

METHODS


25분간, 2명의 교수가 면접. 각 교수는 면접관이 되면 매년 8~10명의 지원자를 평가함.

The University of Iowa Roy J. and Lucille A. Carver College of Medicine (UICCOM) is a public medical school with a total enrollment of 572 students. As part of the application process to UICCOM, particularly well-qualified candidates are selected to participate in a 25-min interview with two faculty members. A pool of faculty interviewers is recruited by the director of medical admissions each year (average interviewers used per year are approximately 150) to conduct the interviews. Each faculty member interviews approximately eight to ten applicants per year.


면접을 두 파트로 나눴음

Hence, there are two parts to the interview: 

  • (a) a structured component—where candidates are read predetermined ques- tions and their responses are scored on a scale from 1 to 5 using an established scoring rubric, and 
  • (b) an unstructured component—where there is a free-flowing exchange between faculty and the candidate on any appropriate topic of interest to the faculty interviewer and/or the candidate. 

비구조화 파트에 있어서 평가는 5점 척도로. 명백한 scoring rubric은 없었으며, 5 (excellent) and 1 (poor).

Scores ranging from1 to 5 are also awarded on this unstructured portion of the interview but, given the variable nature of these exchanges, are not guided by explicit scoring rules or rubrics. For each of these 5-point rating scales, the anchors are 5 (excellent) and 1 (poor).


면접 진행 프로토콜

The interview protocol is as follows. 

  • 4개 표준질문이 있는 구조화 파트로 시작 Each interview begins with a highly structured component that asks the same four standard questions of all applicants being interviewed. 
  • 질문은 매년 바뀌나 면접질문의 표준 pool 중에서 선정됨 Ques- tions vary somewhat from year to year, but they are drawn from a standard pool of interview questions.1 
    • 지원 동기 In general, these ques- tions ask about applicants’ motivation for pursuing a career in medicine, 
    • 난관 극복 how they might deal with various challenges encoun- tered in practicing medicine, and 
    • 과거 경험, 성격 특성 how applicants’ experiences and/or attributes will enable them to be outstanding physicians. 
  • 질문에 답한 직후 두 명의 평가자는 scoring rubric에 따라서 평가하고 다음 질문으로 넘어감. Immediately following the applicant response to a question, the two faculty raters, guided by a scoring rubric, independently rate each of the applicant’s responses before moving on to the next question. 
  • 후속 질문 불가 Interviewers are not allowed to probe or ask follow- up questions. 
  • 각 질문에 대한 시간 제한은 없음 There is no time limit set for responses to each question; candidate responses are typically about 2 to 3 min per question. 
  • 모든 구조화 질문이 끝난 후, 남은 시간은 개방형 대화 After all the structured questions are completed, the remaining minutes of the interview are devoted to an open conversation with the applicant.

면접관 훈련

  • 처음 참여하는 교수는 모두 훈련대상 Training is provided for all first-time interviewers. 
  • 프로토콜, 구조화 질문의 평가 rubric, 샘플 비디오을 이용한 평가 Train- ing sessions provide faculty with an overview of the interview protocol, scoring rubrics for structured questions, and an oppor- tunity to score some sample (fictitious) videotaped responses using the scoring rubrics. 
  • 샘플 비디오를 본 이후에 트레이너와 토론 After each sample response is viewed and scored, faculty discuss their rationale for awarding a given score with the trainer. 
  • 비구조화 파트에 대해서 리뷰하고, 면접 주제로 적절한 것과 부적절한 것을 강조함. Trainers also review the protocol for the unstructured portion of the interview, emphasizing what are considered appropriate and inappropriate topics for discussion. 
  • 실제 면접 세션에 잘 적응하도록 촉진하기 위하여 처음 면접에 참여하는 면접관은 관찰자로부터 피드백 받음 To facilitate adjustment to actual interview sessions, first-time interviewers receive feedback from an observer who is present during their initial day of interviewing. 
  • 관찰자는 숙련된 면접관으로서 새로 참여하는 면접관이 어긋나갈 수 있는 어떤 부분에 대해서든 피드백을 주는 역할
    Observers are experi- enced interviewers who provide feedback regarding any areas where new interviewers may be straying from the established interview protocols and scoring procedures.


Variance components

Table 5 shows variance components and reliability obtained from two complete interview occasions•× • each employing a struc-tured and unstructured format [p o ] and provides informa-tion related to a complete replication of an interview using both the structured and unstructured format. 






The proportion of person variance for the structured format was 22% compared with 30% for the unstructured format and implies the unstructured format will yield more consistent scores across replications. The universe score correlation between the formats was .82, suggesting the formats may not assess identical attributes of the applicant. 



DISCUSSION


기존의 연구결과와 달리, 비구조화 형식이 평가자간 일치도라는 관점 뿐만 아니라 무작위 복제(평가-재평가) 분석에서도 더 reliable함이 확인되었다. 더 나아가 서로 다른 형식이 - 서로 관련되지만 - 서로 구분되는 구인을 평가하는 것으로 보인다. 전체 점수의 상관관계와 Person X Format 상호작용은 두 개의 형식이 지원자와 관련하여 동일한 구인을 측정하는 것이 아님을 보여준다. 마지막으로 신뢰도가 두 개의 척도를 병합함(구조화+비구조화)으로써 더 높아질 수 있음을 알아내었다. 

Contrary to the predominant view in the research literature, we found that the unstructured format was more reliable from both an interrater rater agreement perspective and in the random replications (test–retest) analysis. Further, it appears that the different formats are measuring related, yet distinct, constructs. The universe score correlation (ru = .82) and Person × Format interaction indicated that the two formats do not measure identical constructs related to the applicant. Last, we found that reliability can be increased by combining the two measures into a composite score. An examination of weighted composite scores indicates a sum score with approximately equal weights on both formats maximizes reliability and the information obtained.








 2010 Oct;22(4):241-5. doi: 10.1080/10401334.2010.511978.

Medical school preadmission interviews: are structured interviews more reliable than unstructured interviews?

Author information

  • 1Department of Family Medicine, University of Iowa, Iowa City, Iowa 52242, USA. rick-axelson@uiowa.edu

Abstract

BACKGROUND:

The medical education research literature consistently recommends a structured format for the medical school preadmissioninterview. There is, however, little direct evidence to support this recommendation.

PURPOSE:

To shed further light on this issue, the present study examines the respective reliability contributions from the structured andunstructured interview components at the University of Iowa.

METHODS:

We conducted three univariate G studies on ratings from 3,043 interviews and one multivariate G study using responses from 168 applicants who interviewed twice.

RESULTS:

Examining interrater reliability and test-retest types of reliability, the unstructured format proved more reliable in both instances. Yet, combining measures from the two interview formats yielded a more reliable score than using either alone.

CONCLUSIONS:

At least from a reliability perspective, the popular advice regarding interview structure may need to be reconsidered. Issues related to validity, fairness, and reliability should be carefully weighed when designing the interview process.

PMID:
 
20936568
 
[PubMed - indexed for MEDLINE]


인턴선발 MMI에서 과거행동면접 vs 상황면접 : 신뢰도와 수용가능도 비교(BMC Med Educ, 2015)

Past-behavioural versus situational questions in a postgraduate admissions multiple mini-interview: a reliability and acceptability comparison

Hiroshi Yoshimura1,2,3*, Hidetaka Kitazono2, Shigeki Fujitani2, Junji Machi2,3, Takuya Saiki4, Yasuyuki Suzuki4 and Gominda Ponnamperuma5






세팅과 참가자

Settings and participants


TBIIMC 개요; 진료과; 미션; 교육목표; MMI 진행

TBUIMC is a Japanese general hospital, which newly introduced three specialty training programmes: internal medicine, surgery, and emergency medicine. To accomplish the trans-specialty mission of ‘fostering high-quality generalist physicians providing holistic patient care’, the educational committee of TBUIMC decided to introduce the Accreditation Council for Graduate Medical Education (ACGME) six general competencies [36] as educational outcomes. In 2013, the MMI took place at the partitioned TBUIMC conference room, in three separate weekends. Of the 26 candidates who applied for the TBUIMC programmes, 13, 10, and 3 were invited for the MMI on the first, the second, and the third day of the MMI, respectively.


면접 진행; 대상자; 면접관; 

Three separate days were set for candidates’ convenience, having better access to selection opportunities in TBUIMC; this facilitated the recruitment process. All candidates were Japanese medical graduates, whose level of training ranged from Post Graduate Year (PGY)-2 to PGY-4. They were either in the second year of, or had concluded the two-year National Obligatory Initial Postgraduate Clinical Training Programme (NOIPCTP), following their graduation from Japanese medical schools, and the Japanese National Licensure Examination [37]. A total of 18 examiners, including TBUIMC’s educational committee members (most of whom were US specialty board certified) and clinical supervisors, were all Japanese physicians in the aforementioned three specialties. All candidates, regardless of their applying specialties or the PGY level, were examined by all examiners, who were randomly allocated to the stations. All examiners stayed within the same station, on all three days.


인터벤션

Intervention


ACGME 여섯 개 역량 중 의학지식 제외; 나머지 다섯 개 역량은 하나당 한 스테이션; 각 역량당 2~8개의 하부 영역; 스테이션당 2명의 평가자; PBQ에서는 STAR Approach 사용. SQ에서 평가자는 독단적으로 probing은 못하게 함.

To base stations on the competencies of the ACGME, except ‘medical knowledge’, 5 stations were created to assess one competency (domain) per station. Out of the 2 to 8 sub-domains in each competency [36], two sub- domains (one for the PBQ, and the other for the SQ) per station were selected so that one PBQ followed by one SQ was administered within the same station (Table 1). The same questions were asked from all candidates. Two examiners were assigned to one station and they alternated questioning roles. 

  • In PBQs, Situation-Task-Action-Result (STAR) approach was applied for guiding interviews [38]. 
  • In SQs, presenting a scenario with a dilemma and making the candidates describe what they would do, in a situation where the candidate had to choose between two or more mutually exclusive courses of action [21,22] were followed by structured probing [27]. Examiners were not allowed to probe independently. 

A sample of instructions to exam- iners for one of the stations is shown in Table 2.




인터뷰가이드





10개 스테이션이면 충분히 reliable하다. 질문의 형태 외에도 다른 요인들이 영향을 미쳤을 것.

The current study suggests that less than 10 stations of the MMI with one examiner per station may be suffi- ciently reliable. In addition to the question format, other structuring processes may have contributed to this, e.g. 

  • 기존에 확립된 프레임워크 basing stations on an established competency framework; 
  • 불필요한 라포형성 최소화 minimising unnecessary rapport building between exam- iners and candidates; 
  • 계획에 따른 동일한 질문 asking exactly the same questions from each candidate with planned probing; 
  • 3개의 구분가능한 평가기준 활용 using three distinguishable rating rubrics; 
  • 구체적 anchor에 따른 평가 rating candidates on points anchored with detailed descriptors; and 
  • 평가자 훈련 providing exam- iner training. 

이러한 구조화 노력이 스테이션 수를 줄이는데 도움을 주었을 것임

These structuring efforts would help reduce the number of stations, especially where only limited examiner resources are available for a relatively smaller number of candidates.


평가자와 지원자가 긍정적(하지만 중등도의) 반응을 보인 것에는 스테이션 면접 형식이 이처럼 고도로 구조화된 것이 기여하는 바가 있을 것임. 흥미롭게도, 본 연구에서는 SQ와 PBQ에 대해 지원자와 면접관의 상반되는 반응을 보여준다. SQ는 지원자가 더 선호하였고, 평가자는 PBQ를 더 선호하였다. 특히 모든 참여자는 현재 MMI가 공평하며 SQ와 PBQ를 모두 사용하는 것의 중요성에 대해 언급하였다. 

As non-medical personnel selection studies have sug- gested [27], the highly structured nature of the station interview formats and other structuring efforts in the present study may be responsible for the positive but modest candidate and examiner reaction compared with previous studies [1,7-9,11-15]. Interestingly, this study also indicates contrasting acceptability for SQs and PBQs amongst candidates and examiners, i.e. SQs being more favourable for candidates as opposed to PBQs be- ing more favourable for examiners. Of particular note, all participants admitted fairness of the current MMI and most expressed importance of using both SQs and PBQs. As to how best PBQs and SQs could be com- bined, the participant reactions could be used as a guide for generating a discussion on both question formats at a given level (undergraduate or postgraduate [founda- tion, specialty, or subspecialty]) of admissions MMIs in the future, as is being discussed in the area of SSPIs in non-medical personnel selection [27].

























 2015 Apr 14;15:75. doi: 10.1186/s12909-015-0361-y.

Past-behavioural versus situational questions in a postgraduate admissions multiple mini-interview: a reliabilityand acceptability comparison.

Author information

  • 1Educational Committee, Prefectural Okinawa Nanbu and Children's Medical Centre, Haebaru Town, Okinawa Prefecture, Japan. yoshimura.hiroshi@gmail.com.
  • 2Educational Committee, Tokyo Bay Urayasu-Ichikawa Medical Centre, Urayasu City, Chiba Prefecture, Japan. yoshimura.hiroshi@gmail.com.
  • 3Department of Surgery, University of Hawaii, John A. Burns School of Medicine, Honolulu, State of Hawaii, USA. yoshimura.hiroshi@gmail.com.
  • 4Educational Committee, Tokyo Bay Urayasu-Ichikawa Medical Centre, Urayasu City, Chiba Prefecture, Japan. hkitazono@gmail.com.
  • 5Educational Committee, Tokyo Bay Urayasu-Ichikawa Medical Centre, Urayasu City, Chiba Prefecture, Japan. shigekifujitani@gmail.com.
  • 6Educational Committee, Tokyo Bay Urayasu-Ichikawa Medical Centre, Urayasu City, Chiba Prefecture, Japan. junji@hawaii.edu.
  • 7Department of Surgery, University of Hawaii, John A. Burns School of Medicine, Honolulu, State of Hawaii, USA. junji@hawaii.edu.
  • 8Medical Education Development Centre, Faculty of Medicine, Gifu University, Gifu City, Gifu Prefecture, Japan. saikitak@gifu-u.ac.jp.
  • 9Medical Education Development Centre, Faculty of Medicine, Gifu University, Gifu City, Gifu Prefecture, Japan. ysuz@gifu-u.ac.jp.
  • 10Faculty of Medicine, University of Colombo, Colombo, Western Province, Sri Lanka. gomindap@hotmail.com.

Abstract

BACKGROUND:

The Multiple Mini-Interview (MMI) mostly uses 'SituationalQuestions (SQs) as an interview format within a station, rather than 'Past-BehaviouralQuestions (PBQs), which are most frequently adopted in traditional single-station personal interviews (SSPIs) for non-medical and medical selection. This study investigated reliability and acceptability of the postgraduate admissions MMI with PBQ and SQ interview formats within MMI stations.

METHODS:

Twenty-six Japanese medical graduates, first completed the two-year national obligatory initial postgraduate clinical training programme and then applied to three specialty training programmes - internal medicine, general surgery, and emergency medicine - in a Japanese teaching hospital, where they underwent the Accreditation Council for Graduate Medical Education (ACGME)-competency-based MMI. This MMI contained five stations, with two examiners per station. In each station, a PBQ, and then an SQ were asked consecutively. PBQ and SQ interview formats were not separated into two different stations, or the order of questioning of PBQs and SQs in individual stations was not changed due to lack of space and experienced examiners. Reliability was analysed for the scores of these two MMI question types. Candidates and examiners were surveyed on this experience.

RESULTS:

The PBQ and SQ formats had generalisability coefficients of 0.822 and 0.821, respectively. With one examiner per station, seven stations could produce a reliability of more than 0.80 in both PBQ and SQ formats. More than 60% of both candidates and examiners felt positive about the overall candidates' ability. All participants liked the fairness of this MMI when compared with the previously experienced SSPI. SQs were perceived more favourable by candidates; in contrast, PBQs were perceived more relevant by examiners.

CONCLUSIONS:

Both PBQs and SQs are equally reliable and acceptable as station interview formats in the postgraduate admissions MMI. However, the use of the two formats within the same station, and with a fixed order, is not the best to maximise its utility as an admission test. Future studies are required to evaluate how best the SQs and PBQs should be combined as station interview formats to enhance reliability, feasibility, acceptability and predictive validity of the MMI.

PMID:
 
25890189
 
[PubMed - indexed for MEDLINE] 
PMCID:
 
PMC4427914
 
Free full text


MMI 기반 선발의 결과와 의과대학 지원자의 인종/민족/사회경제적지위의 관계(Acad Med, 2015)

How Medical School Applicant Race, Ethnicity, and Socioeconomic Status Relate to Multiple Mini-Interview–Based Admissions Outcomes: Findings From One Medical School

Anthony Jerant, MD, Tonya Fancher, MD, MPH, Joshua J. Fenton, MD, MPH, Kevin Fiscella, MD, MPH, Francis Sousa, MD, Peter Franks, MD, and Mark Henderson, MD





MMI 도입에 따라 underrepresented racial/ethnic minority (URM) 집단이나 낮은 SES의 지원자가 어떤 영향을 받았는가에 대한 연구가 적다. 미국 의과대학에 URM과 Low SES 학생의 비율이 불균형을 이루고 있음을 감안하면 중요한 사안이다.

Little studied is how underrepresented racial/ethnic minority (URM) and lower socioeconomic status (SES) applicants may be affected by adoption of the MMI. This is a key issue given that U.S. medical schools admit disproportionately few URM and lower SES individuals.6–8


전통적인 비구조화면접은 오랜 시간 면접관의 편견에 취약하다는 지적이 있었다. 무의식적 편견이 인종/민족 소수자들과 낮은 SES 지원자를 탈락시키는 방향으로 작용하는 것은 의사는 물론 미국에 흔한 현상이다. 면접에서 발생하는 비뚤림의 영향은 구조화를 높임으로서(모호성을 제거하고, 정형화된 구조에 따라 판단하게 하는) 줄일 수 있고, 다양한 평가자의 평가결과를 취함함으로써 개개인의 편견의 영향을 희석시킬 수 있다.

A long-recognized problem with traditional nonstructured interviews is vulnerability to interviewer biases triggered by various applicant characteristics.17–22 Implicit (i.e., unconscious) biases disfavoring racial/ ethnic minority and lower SES persons are common in U.S. society,23 including among physicians.24 The effects of bias during interviews can be reduced by increasing structure (removing ambiguity and, therefore, the tendency to rely on stereotype-driven judgments) and pooling evaluations from multiple raters (potentially diluting or offsetting individual biases).20,25–27


우리가 아는 한 MMI수행능력과 URM, SES의 관련성에 대한 연구는 세 개이다.

Only three studies to our knowledge have explored the associations of medical school applicants’ racial/ethnic minority status or SES with MMI performance.


MMI를 치른 이후 합격에 인종/민족이 영향을 주었는지에 대한 연구는 없다. 혹은 인종/민족, SES가 MMI invitation영향에 대한 연구도 없다.

To our knowledge, no studies have examined whether applicants’ race/ ethnicity influences acceptance following MMI participation, or whether race/ ethnicity or SES influences the likelihoodof being invited to an MMI.



방법

Method


지원, 스크리닝, MMI초청, 일정조정

Application, screening, and MMI invitation and scheduling


다음에 따라서 MMI invitation을 평가함
Faculty evaluated secondary applications for invitation to an MMI based on cumulative GPA and MCAT scores, personal statements, extracurricular activities, recommendation letters, and other characteristics that could contribute to fulfilling the educational and service missions of the school.

MMI절차와 점수

MMI process and scoring


2분-8분, 다음의 10개 주제

The MMI consisted of 10 individual 10-minute stations. At each station, applicants had 2 minutes to read a brief set of instructions, and 8 minutes to address the assigned tasks on entering the room. Nine stations assessed skills in the following domains: 

    • integrity/ ethics, 
    • professionalism, 
    • interpersonal communication, 
    • diversity/cultural awareness, 
    • teamwork, 
    • ability to handle stress, and 
    • problem solving. 
    • An additional station asked applicants to explain their choice to pursue a career in medicine

Most stations were adapted from content developed at McMaster University and marketed by ProFitHR.34


학생의 AMCAS 지원 정보를 모르는 한 명의 숙련된 평가자가 각 스테이션에 배정됨

A single trained rater, blinded to participants’ AMCAS application information, attended each station.


총 216명의 서로 다른 평가자

There were 216 different raters during the study period; 

    • 평균 참가 스테이션 the mean number of MMI stations that each evaluated was 104 (standard deviation [SD] 61.9; range 8–276). 
    • 여성 Women made up 61% of raters. 
    • 평가자 Background Rater professional backgrounds were as follows: physicians, 31%; medical students, 15%; other clinicians (e.g., nurses), 11%; basic science faculty, 6%; patients, 2%; and various nonclinician leaders (e.g., deans), professionals (e.g., lawyers), and high- level administrative staff (e.g., curriculum manager), 35%. 


평가자의 배경이 다양한 것은 다양한 관점이 미래에 온갖 계층의 사람들과 효율적으로 일할 의사를 선발하는데 도움이 된다고 생각했기 때문. 의무적인 평가자 훈련은 입학절차에 대한 1시간의 리뷰, 평가자 역할과 의무, 계급문제를 지양할 필요성 등을 다뤘다.

The range of rater backgrounds reflected the conviction that diverse perspectives are helpful in selecting future physicians who will be able to work effectively with people from all walks of life. Mandatory rater training included a one-hour course reviewing the admissions process, rater roles and duties, and the need to avoid pursuing protected class issues (e.g., race/ethnicity, gender).36


각각 스테이션의 평가 (4점 척도)

At each station, raters scored overall applicant performance using an anchored four-point scale: 

    • 0, < 25th percentile performance (relative to other applicants); 
    • 1, 25th–50th percentile; 
    • 2, 51st–75th percentile; or 
    • 3, > 75th percentile. 

또한 지원자의 의사소통능력과 이해도를 고려하도록 함. 

Raters were instructed to consider both the applicant’s communication abilities and the content (e.g., comprehensiveness) of their statements in assigning ratings. The total MMI score was the mean of each applicant’s individual station scores. Scale internal consistency (Cronbach alpha = 0.67) was comparable to that observed in other MMI studies.2,18,37–41



입학 판정

Acceptance recommendation


Subsequently, the committee made one of the following recommendations: reject, low waitlist, high waitlist, or offer acceptance.


URM 상태

URM status


AMCAS 지원정보를 바탕으로 URM Status를 판단

We determined URM status (URM [black, Southeast Asian, Native American, or Pacific Islander race and/or Hispanic ethnicity] versus not [all other responses]) from self-reported race/ethnicity information in the AMCAS application.



SES 불이익

Socioeconomic disadvantage


AMCAS 지원정보를 바탕으로 SES 척도를 개발

We developed a composite measure of SES using self-reported information in the AMCAS application,


다음의 정보를 활용

The following predictors (yes/ no items except where indicated) were significant and maximized the area under the receiver operating characteristic curve (0.95):

    • fee assistance received for medical school application (yes/no); 
    • childhood spent in an underserved area; 
    • family recipients of family assistance program; 
    • income level category of applicant’s family (< $25,000; $25,000 to < $50,000; $50,000 to < $75,000; or > $75,000); 
    • applicant contributed to family income; 
    • any financial-need-based scholarship(s) in paying for postsecondary education; 
    • percentage of postsecondary education costs contributed by the family; and 
    • parents’ highest level of educational attainment (< high school, high school graduate, some college, or college graduate).


Applicant characteristics



MMI invitation




MMI score



Acceptance recommendation





Discussion


URM지원자는 non-URM지원자보다 MMI invitation을 받을 가능성이 더 낮지 않았고, MMI에서 유사한 정도의 점수를 받았으며, 입학할 가능성은 더 높았다.

Further, URM applicants were no less likely than non-URM applicants to receive an MMI invitation, performed similarly on the MMI, and were just as likely to be recommended for acceptance.


URM과 non-URM지원자 사이의 유사한 MMI점수는 구조화된 면접이 다양한 평가자의 관점을 포함하게 하면서 개개인이 은연중에 가지는 편견으로부터 덜 취약하게 해주는 효과가 있음을 보여준다. 비록 우리가 평가자의 implicit bias를 측정하지는 않았지만, 이렇한 정보는 미국사회에 널리 퍼져있음이 이미 여러 문헌에서 나타난 바 있으며, 의사나 다른 전문직도 예외는 아니고, 의료를 포함하여 다양한 고용면접의 결과에 영향을 준다. 따라서 implicit bias는 우리의 평가자들 사이에도 있었을 것이다. 그러나 이는 net로 보았을 때 유의한 영향은 없었고, URM과 non-URM사이에 차이가 있지도 않았다. 의료분야에 URM 비중이 낮은 것이 이미 많이 인정된 문제인만큼, URM에 대한 안좋은 편견은 (이러한 인종/민족의 문제를 해결하기 위하여 평가자들이 들이는 노력에 따른) URM지원자에 대한 우호적 편견으로 offset할 수 있다.

The similar MMI scores for URM and non-URM participants support the notion that structured interview processes that incorporate the perspectives of multiple evaluators like the MMI may be less vulnerable to the effects of individual evaluator implicit biases.20,25–27 Although we did not measure rater implicit biases regarding racial/ ethnic minorities, such biases have been documented to be pervasive in U.S. society, including among physicians and other professionals,23,24 and can affect the outcomes of employment interviews in various fields including medicine.17,19–22 Thus, it is likely that implicit biases were present among our raters; however, they did not exert a significant net influence, given that mean MMI scores did not differ between URM and non-URM applicants. Because lack of URMs in medicine is a widely acknowledged problem,6,7,13,33,42–44 it is possible that biases against URM applicants were offset by ratings biased in favor of URM applicants, made by raters seeking to address limited racial/ethnic diversity in the physician workforce.


반면, 낮은 SES는 더 낮은 MMI점수를 받았다.

In this context, our finding that lower SES applicants had worse adjusted MMI performance may be cause for concern. 


그럼에도 불구하고, 낮은 SES가 MMI점수에 미치는 영향은 작았다. SES를 0-1로 평가했을 때 그 감소 정도가 0.12정도였다. 또한 낮은 MMI점수는 더 높은 합격률로 offset되었다. 이러한 결과는 AAMC가 지향하는 바와 같이 순전히 metric-based의 지원자 검토보다 더 holistic process로 변하고 있음을 보여준다.

Nonetheless, the decrement in MMI performance with decreasing SES in our study was small: The MMI score (scale of 0–3 points) declined by a mean of 0.12 points across the 0–1 range of the SES score. Further, the lower MMI scores among lower SES applicants were more than offset by their greater likelihood of being invited to an MMI and recommended for acceptance. These findings may reflect the ongoing shift from a purely metric-based applicant review process toward the more holistic process advocated by the Association of American Medical Colleges.12,15


낮은 SES 지원자는 MMI에서 평가하는 생애 경험이 더 적을 수 있다. 더 낮은 MCAT점수를 받은 지원자에 대해서도 유사한 추론이 제기된 바 있다. 덜 부유한 지원자가 postsecondary education기간동안 임금노동을 더 많이 했을 수는 있지만, 그들이 일한 것이 MMI식의 선발절차를 거치진 않았을 것이다. MMI와 같은 유형의 선발절차 경험이 없는 것은 특정 면접 형식에 대한 과거 경험이 유사한 방식의 면접에서 더 높은 점수와 관계됨을 고려할 때 의과대학 MMI에서 약점으로 작용할 수 있다. 또한 낮은 수준의 일자리는 높은 수준의 의사소통, 비판적 사고, 문제해결 등 MMI에서 요구하는 능력 개발을 촉진시키지 않을 가능성이 높으며, 그러한 일자리에 투자하하는 시간이 이들 skill 개발에 장애가 될 것이다.

Lower SES applicants may have fewer life experiences bolstering skills assessed by the MMI. Similar reasoning has been suggested to explain the lower MCAT scores among such applicants.45 Although less affluent applicants are more likely to report paid employment during postsecondary education, their financial circumstances may require taking jobs that do not require MMI-type preemployment screening. Lack of prior experience with MMI-type screening may be a disadvantage in the medical school MMI because prior experience with a particular interview format is associated with better future performance with that format.46 Lower-level jobs also may not facilitate the higher-level communication, critical thinking, and problem-solving skills the MMI assesses, and the time required for such jobs may limit participation in pursuits that build such skills (e.g., scholarly presentations, volunteer clinic work).


기존의 연구를 보면 익숙하지 않은 언어(표현)를 사용하는 것이 낮은 평가로 비뚤리게 하는 요인이 된다고 한다. 지원자의 언어 기술은 면접관의 즉각적 인상을 결정하고, 그 결과 최종 평가에도 영향을 줄 수 있다. 의사인력의 SES 불균형은 인종/민족 불균형보다 관심을 덜 받아왔다. 따라서 면접관이 낮은 SES 지원자에게 우호적으로 bias하려고 의식적으로 신경을 썼을 가능성은 낮다. 

Prior work indicates that applicant factors such as use of language unfamiliar to the typical rater could trigger a biased low rating.20,21 Applicants’ verbal skills have been shown to determine immediate interviewer impressions and, in turn, final appraisals.49 The issue of SES-based physician workforce disparities has received less attention than race/ ethnicity-based disparities.6 Thus, it is less likely that raters consciously biased their evaluations in favor of lower SES applicants to address SES-based physician workforce disparities.


34 Advanced Psychometrics for Transitions Inc. Welcome to ProFitHR. http://www.profithr.com/. Accessed April 4, 2015.




















 2015 Dec;90(12):1667-74. doi: 10.1097/ACM.0000000000000766.

How Medical School Applicant RaceEthnicity, and Socioeconomic Status Relate to Multiple Mini-Interview-Based Admissions OutcomesFindings From One Medical School.

Author information

  • 1A. Jerant is professor, Department of Family and Community Medicine, Center for Healthcare Policy and Research, University of California, Davis,School of Medicine, Sacramento, California. T. Fancher is associate professor, Division of General Internal Medicine, Department of Internal Medicine, University of California, Davis, School of Medicine, Sacramento, California. J.J. Fenton is associate professor, Department of Family and Community Medicine, Center for Healthcare Policy and Research, University of California, Davis, School of Medicine, Sacramento, California. K. Fiscella is professor, Department of Family Medicine, University of Rochester School of Medicine and Dentistry, Rochester, New York. F. Sousa is assistant dean, Admissions and Student Development, and volunteer clinical professor, Department of Internal Medicine, University of California, Davis, School of Medicine, Sacramento, California. P. Franks is professor, Department of Family and Community Medicine, Center for Healthcare Policy and Research, University of California, Davis, School of Medicine, Sacramento, California. M. Henderson is associate dean, Admissions and Outreach, and professor, Division of General Medicine, Department of Internal Medicine, University of California, Davis, School of Medicine, Sacramento, California.

Abstract

PURPOSE:

To examine associations of medical school applicant underrepresented minority (URM) status and socioeconomic status (SES) withMultiple Mini-Interview (MMI) invitation and performance and acceptance recommendation.

METHOD:

The authors conducted a correlational study of applicants submitting secondary applications to the University of California, Davis, Schoolof Medicine, 2011-2013. URM applicants were black, Southeast Asian, Native American, Pacific Islander, and/or Hispanic. SES from eight application variables was modeled (0-1 score, higher score = lower SES). Regression analyses examined associations of URM status and SES with MMI invitation (yes/no), MMI score (mean of 10 station ratings, range 0-3), and admission committee recommendation (accept versus not), adjusting for age, sex, and academic performance.

RESULTS:

Of 7,964 secondary-application applicants, 19.7% were URM and 15.1% self-designated disadvantaged; 1,420 (17.8%) participated in the MMI and were evaluated for acceptance. URM status was not associated with MMI invitation (OR 1.14; 95% CI 0.98 to 1.33), MMI score (0.00-point difference, CI -0.08 to 0.08), or acceptance recommendation (OR 1.08; CI 0.69 to 1.68). Lower SES applicants were more likely to be invited to an MMI (OR 5.95; CI 4.76 to 7.44) and recommended for acceptance (OR 3.28; CI 1.79 to 6.00), but had lower MMI scores (-0.12 points, CI -0.23 to -0.01).

CONCLUSIONS:

MMI-based admissions did not disfavor URM applicants. Lower SES applicants had lower MMI scores but were more likely to be invited to an MMI and recommended for acceptance. Multischool collaborations should examine how MMI-based admissions affect URM and lower SES applicants.

PMID:

 

26017355

 

[PubMed - in process]


MMI에서 면접관의 특성과 평가 점수의 관계(Acad Med, 2004)

The Relationship between Interviewers’ Characteristics and Ratings Assigned during a Multiple Mini-Interview

Kevin W. Eva, PhD, Harold I. Reiter, MD, MSc, Jack Rosenfeld, PhD, and Geoffrey R. Norman, PhD






MMI는 지원자의 수행능력에 대한 신뢰도있는 추정을 가능하게 해주나, 이질적인 평가자들의 서로 다른 vantage point로부터 생길 수 있는 bias에 관심을 둬야 한다.

This Multiple Mini-Interview (MMI) has been shown to provide a reliable estimate of candidates’ perfor- mance,1 but the new protocol demands that attention be paid to the biases that might arise as a result of the different vantage points held by heterogeneous raters.



배경

Background


문제는 내용-특이성이다. 학생선발 결정은 Albanese 등이 지적한 바와 같이, "거의 무한에 가까운 서로 다른 상황에 대해서 발생가능성이 가장 높은 안정적인 특질에 관심이 있다". 비록 그러한 "안정적인 특질"이 존재하느냐에 대한 논쟁은 있지만, 다양한 상황을 맞닥뜨리면서 보여주는 평균적인 수행능력이 어떠한 단일한 상황에서의 모습보다 한 개인의 질(qualities)에 대해서 더 일반화가능하다는 것이 여러 context에서 명확해지고 있다.

The problem is one of content spec- ificity. In making selection decisions, as indicated by Albanese et al. “one is most interested in stable qualities that have a high probability of occurrence in an almost infinite number of different sit- uations.”2,p.317Although debate exists regarding whether such “stable qualities” exist, it has become clear in various con- texts that the average performance an individual displays over the course of many encounters is a more generalizable indication of that individual’s qualities than is any single encounter.5


MMI

The Multiple Mini-Interview


MMI가 입학에서 사용되는 OSCE라고 할 수 있지만, 우리는 이 이름을 바꿨는데, 그 이유는 판단이 객관적이지 않고, 스테이션이 의도적으로 임상과 무관하게 설정되기 때문이다.

Although essen- tially an admissions OSCE, we have opted to change the name of the proto- col to make explicit the facts that the judgments are not objective and the stations are intentionally nonclinical.


이 절차는 입학위원회가 종사하는, MMI를 도입하는 기관의 교육 철학에 따라 영향을 받게 되며, 또한 더 넓은 차원에서 진료행위를 하는 의사의 핵심역량에 대해 설명하는 문헌의 영향을 받는다. 그 절차는 Reiter and Eva에 의해서 개발된 바 있다.

This process should be informed by the educational philosophy adopted by the institution in which the admissions committee works as well as broader documents that out- line the key competencies of practicing physicians.6,7 A process for doing so has been developed by Reiter and Eva.8


기존의 연구를 살펴보면, MMI는 지원자의 역량에 대한 신뢰도높은 평가를 가능하게 해준다. 전반적인 검사의 신뢰도는 스테이션당 평가자보다 스테이션의 숫자를 늘릴 때 더 향상되며, 지원자와 평가자 모두에게 긍정적인 평가를 받는다. 그러나 아직 남겨진 질문은 교수와 비-교수 사이에 평가가 서로 다른가 하는 것이다. McMaster에서 다양성(heterogeneity)는 언제나 근본적인 원칙이었는데, 왜냐하면 학생들의 경험의 폭을 넓혀주는 것이 학업적 경험을 더 풍요롭게 해준다고 믿기 때문이다. 학생들의 다양성을 최대화하기 위하여 면접관들은 다양한 인구집단에서 선발되어왔는데, 여기에는 교수, 학생, 지역사회인사 등이 다 포함된다. 우리가 한 스테이션당 한 명의 면접관을 배치하기 때문에, 교수와 지역사회인사의 평가향상이 서로 일치하는가를 보는 것이 중요하다.

Previous research has shown that the MMI provides a reliable assessment of candidates’ abilities, that the overall test reliability improves to a greater ex-tent by maximizing the number of sta-tions rather than by maximizing the number of observers per station, and that the MMI is viewed positively by both candidates and examiners alike.1Remaining unanswered, however, is the question of whether faculty members and nonfaculty members are distin-guishable by their ratings. At McMas-ter, heterogeneity has always been a fundamental principle because it is be-lieved that breadth of experiences across students enriches the scholastic experi-ence.9 To try to maximize heterogeneity across students, interviewers have tradi-tionally been drawn from various popula-tions, including faculty members, medical students, and individuals from the com-munity at large. As we propose assigning a single interviewer to each station, the question of whether faculty members and individuals from the community assign performance ratings consistent with one another becomes an increasingly impor-tant question.



방법

METHOD


참가자

Participants


In addition, 18 health sciences fac- ulty members and 18 community mem- bers drawn from the legal profession and human resource departments of both local businesses and the university were recruited to act as examiners. In two instances, faculty members had to with- draw—they were replaced with current medical students.


절차

Procedure


On the study weekend, three sessions were run sequentially on each of two days with a 40-minute break for the examiners between sessions. Two examiners were assigned to each station. 

    • 3개는 교수만 Three of the nine stations were staffed by two faculty members, 
    • 3개는 지역사회인사만 three by two community members, and 
    • 3개는 교수와 지역사회인사 각 1명씩 three by one member of each group. 

Before the first MMI on each day the authors of this article met with the examiners to ensure that the procedure was clear, to answer any last-minute queries, and to reinforce that the ratings should be assigned in- dependently.



결과

RESULTS


점수

Scores

internal consistency는 높음. 총점만 사용하기로 함.

Table 1 shows the average score and standard deviation assigned to candi- dates for each of the four items on the evaluation form. The internal consis- tency (i.e., the average relationship be- tween pairs of questions) was found to equal .96, indicating a high degree of redundancy. As a result, only the “over- all performance” score was used in sub- sequent analyses.



To determine whether the ratings faculty members assigned were biased relative to those community members assigned, a repeated measures ANOVA was performed on the data collected within the three stations that were staffed by both a community and a fac- ulty member. The mean score assigned by faculty members (4.66) bordered on being significantly less than that as- signed by community members (4.96; F1,53 3.972, mean squared error 1.790, p .06).




신뢰도 분석

Reliability Analysis



평가자의 특성과 평가 점수와의 관계

The Relationship between Interviewers’ Characteristics and Ratings


두 명의 지역사회인사가 들어간 경우 일반화가능도는 가장 높은 경우 0.58정도였다. 두 명의 교수가 들어간 곳에서는 0.46, 한 명의 교수와 한 명의 지역사회인사가 들어간 경우는 0.31이었다. 각각 짝을 지어 보았을 때 그 차이는 통계적으로 유의했다.

The generaliz- ability for the three stations that were staffed by two community members was highest at .58. The three stations that were staffed by two faculty members revealed the second highest generaliz- ability .46. Least reliable were the three stations that were staffed by one member of each group (generalizability .31). Each pairwise difference is statis- tically significant: .58 versus .46, z(106) 2.78, p .05; .46 versus .31, z(106) 3.12, p .05; .58 versus .31, z(106) 5.90, p .05.


어떤 경우든 MMI의 일반화가능도는 각각 1명씩 들어간 경우 가장 낮았고, 둘 간에 larger inconsistency가 있음을 의미한다.

In either case, the generaliz- ability of the MMI appears to be lowest among stations evaluated by one commu- nity member and one faculty member, suggesting that there are larger inconsis- tencies in the way that community mem- bers rate candidates relative to the way that faculty members rate candidates than there are within either group of raters.



Post-MMI Surveys








DISCUSSION


면접이 지원자의 성격을 안정적이고 일반화가능한 수준으로 측정하기 위해서 평가자간 신뢰도를 보여주는 것 만으로는 충분한 근거가 되지 않음을 보여준다. 반면, 지원자가 이 면접과 저 면접 사이에 예측불가능한 형태로 엄청난 차이를 보여준다는 것을 제시한다. 그 결과 한 면접에서의 결과는 다음 면접에서의 결과를 거의 예측해주지 못한다.

These findings suggest that the dem- onstration of adequate interrater reli- ability, which has been used in the past as an argument for standardized inter-views, is insufficient evidence to ensure that an interview is measuring stable and generalizable applicant characteris-tics. By contrast, the findings suggest that applicants will vary considerably,in unpredictable fashion, from one in-terview to another. Consequently, the scores derived from any one interview will be a poor predictor of performance in a second interview.


적어도 이 결과는 Ferrier 등이 주장한 '다양한 평가자가 더 다양한 학생군을 만든다'라는 것을 지지한다. 교수와 지역사회인사가 준 평균점수의 차이는 더 많은 평가자 훈련을 통해서 극복가능하겠지만, 점수 차이의 절대값은 각 그룹에 속한 평가자가 동등한 비율로 있다면 문제가 되지는 않을 것이다. 

At the very least these results support Ferrier et al.’s9 claim that using heterogeneous raters may result in a more heterogeneous class. The difference we observed in the mean scores faculty and community rat- ers provide may be overcome with fur- ther training, but the absolute differ- ence in scores will not matter as long as all circuits contain an equal proportion of examiners from each group. It should be noted that the distinction drawn in this study between raters of different backgrounds is very broad.


MMI의 또 다른 장점은 Edward 등이 밝힌 네 가지 입학면접의 목적을 (굳이 한 차례의 면접에 뒤섞지 않고서도) 달성할 수 있다는 것이다. (정보 수집, 의사 결정, 확인, 모집) 또한 전통적인 면접에서 지적된 시간의 비효율적 사용 문제도 극복할 수 있다.

Additional advantages to the MMI include the potential to achieve the four purposes of admissions interviews identified by Edwards et al.4 (i.e., infor- mation gathering, decision making, ver- ification, and recruitment) without con- founding these purposes within a single interview (e.g., one station could be designed as a recruitment station with- out the goal of attracting the best can- didates affecting the rest of the inter- view process). The MMI also corrects for the inefficient use of time that has been identified by Litton-Hawes et al.12 as a problem in more traditional inter- views.


"깐깐한" 혹은 "널럴한" 면접관에게 배정될 가능성이 무작위였지만 더 많은 수의 평가자에 의해 평가되면 이 효과는 사라질 것이다.

Similarly, any chance effects of being randomly assigned to an “easy” or “hard” panel of interviewers will be di- luted with the MMI as candidates are exposed to a greater number of examin- ers.


왜 지역사회인사의 평가가 교수들의 평가보다 더 less consistent 할까?

Of further interest is the finding that community members’ ratings were less consistent with those provided by fac- ulty members than were the ratings pro- vided within either group.




8. Reiter HI, Eva KW. Reflecting the relative values of community, faculty, and students in the admissions tools of medical school. Sub- mitted manuscript.


Background: In defining the characteristics of medical students that society and the medical profession find desirable, little effort has been spent assessing the relative value of the dozens of characteristics that have been identified. Furthermore, many institutions go to great lengths to ensure equal representation across stakeholder groups in an effort to maximize the heterogeneity of the pool of students accepted to study medicine; however, the extent to which different stakeholders value different characteristics has yet to be determined. 


Purpose: This study was an attempt to assess the relative value of the characteristics of medical students that society and the medical profession find desirable. 


Methods: Using documents created internationally to identify the core competencies of medical personnel, a series of 7 characteristics were generated for inclusion in a study that adopted the paired comparison technique. Of 347 surveyed, 292 respondents indicated the rank ordering they would assign to each characteristic by circling the more important characteristic in all possible pairings. 


Results: Overwhelmingly,ethical” was deemed to be the most important characteristic on which selection tools should be based. Surprisingly, the pattern of responses was highly consistent regardless of stakeholder group and degree of affiliation with the undergraduate medical program. 


Conclusions: The generalizable features of this study not only include the empirical findings but also demonstrate useful survey protocol that can be adapted by any admission committee to guide the generation of an institution-specific admissions blueprint. A novel protocol that provides the necessary flexibility is discussed.














 2004 Jun;79(6):602-9.

The relationship between interviewers' characteristics and ratings assigned during a multiple mini-interview.

Author information

  • 1Department of Clinical Epidemiology and Biostatistics, McMaster University, Hamilton, Ontario, Canada. evakw@mcmaster.ca

Abstract

PURPOSE:

To assess the consistency of ratings assigned by health sciences faculty members relative to community members during an innovative admissions protocol called the Multiple Mini-Interview (MMI).

METHOD:

A nine-station MMI was created and 54 candidates to an undergraduate MD program participated in the exercise in Spring 2003. Three stations were staffed with a pair of faculty members, three with a pair of community members, and three with one member of each group. Raters completed a four-item evaluation form. All participants completed post-MMI questionnaires. Generalizability Theory was used to examine the consistency of the ratings provided within each of these three subgroups.

RESULTS:

The overall test reliability was found to be .78 and a Decision Study suggested that admissions committees should distribute their resources by increasing the number of interviews to which candidates are exposed rather than increasing the number of interviewers within each interview. Divergence of ratings was greater within the pairing of community member to faculty member and least for pairings of community members. Participants responded positively to the MMI.

CONCLUSION:

The MMI provides a reliable protocol for assessing the personal qualities of candidates by accounting for context specificity with amultiple sampling approach. Increasing the heterogeneity of interviewers may increase the heterogeneity of the accepted group of candidates. Further work will determine the extent to which different groups of raters provide equally valid (albeit different) judgments.

PMID:
 
15165983
 
[PubMed - indexed for MEDLINE]


MMI 시험 특성: 지원자에게 상상하길 요구하기보다는 회상하길 요구하라(Med Educ, 2014)

Multiple mini-interview test characteristics: ‘tis better to ask candidates to recall than to imagine

Kevin W Eva1 & Catherine Macala2





MMI는 그 정의상 일련의 독립적 관찰을 통해 지원자에 대한 정보를 얻으며(대개 인터뷰의 형태로), 선발을 하는 주체가 되는 기관의 목적이나 이상(desires), 그리고 선발된 학생이 장차 될 전문직의 특성을 바탕으로 blueprint를 만든다. 따라서, MMI는 어떤 평가의 도구나 수단이라기보다는 평가의 프로세스로 봐야 한다. 따라서 "MMI는 무엇을 위해서 하는것인가?"라는 질문은 무의미하며, implementation에 따라서 완전히 달라질 수 있기 때문이다.

By definition, it involves collecting (and aggregat- ing across) a series of brief independent observa- tions of the candidate (typically in the form of interviews), preferably blueprinted against the goals and desires of both the institution making the selection and the profession to which the candidate is applying. As a result, the MMI should be considered a process of assessment rather than a tool or instru- ment, and generic questions such as ‘For what does the MMI select?’ are meaningless because the answer is entirely dependent on implementation.


MCQ를 가지고 다양한 내용을 대표하는 시험을 만들 수 있는 것처럼, 매우 다양한 스테이션들로 MMI를 구성할 수 있다.

Just as one can populate a multiple-choice question (MCQ) examination with questions representative of diverse content areas, one can populate an MMI with highly variable stations.


기존 연구를 살펴보면 일반적인 원칙들을 발견할 수 있다. 신뢰도에 대해서는 관찰의 횟수를 증가시키면 신뢰도가 증가하는데, 10~12개 스테이션에서 plateau에 도달하며, 스테이션당 시간을 늘리는 것의 장점은 별로 없고, 각 상황마다 평가자의 수를 늘리는 것보다는 여러 개의 독립적 상황에 대한 수행능력을 관찰하는 것이 더 효과가 좋다.

Research has identified gen- eral principles, including that the reliability of mea- surement improves with increasing number of observations, often reaching a plateau in the 10–12 range,2 that extending the length of the interactions has little discernible benefit,3 and that observing per- formance across independent situations has a greater beneficial impact on the reliability of measurement than does incorporating the opinions of multiple rat- ers within each situation.4,5



배경

Background


MMI 프로세스는 크게 두 가지에 토대를 둔다. Sampling과 Structure

The MMI process was largely designed on two foundations: sampling and structure.


Sampling이 중요하다는 것은 인간 행동에 대한 trait-based model에 대한 우려로부터 출발했다. 사람을 묘사하는데 쓰이는 단어(똑똑한, 달변의, 전문적인)는 변하지 않는 특성인 것처럼 묘사하지만, 실제 행동을 보면 매우 맥락-특이적이다.

The priority placed on sampling is drawn from empirically derived concerns about trait-based mod- els of human behaviour.6 Whereas the adjectives we use to describe people (e.g. ‘smart’, ‘eloquent’, ‘professional’) imply unwavering features of the individual, behaviour has been shown repeatedly to be context-specific.7


한 가지 임상상황에 대한 단일한 관찰결과가 의미하는 바는 한 사람의 지식에 대해서 한 문항의 MCQ가 말해주는 것과 다를 바가 없다.

One observation tells us no more about an individual’s clinical prowess than one MCQ answer tells us about the extent of an individual’s knowledge base.


8분짜리 면접이 지원자의 능력에 대해 충분히 모든 측면을 보여주지 않는다는 주장과 달리, 우리는 이것을 logistic한 필요에 따른 (약점이 아니라) 강점이라고 본다. 여러 연구를 보면 더 긴 면접시간의 가치는 그저 환상일 뿐이며, 이는 지원자에 대한 면접관의 인상은 매우 빠른 시간내에 형성되기 때문이다. 더 나아가서 시간이 더 많을 경우 지원자가 애초에 면접에서 의도한 방향과 다른 방향으로 비틀어버릴 기회를 준다.

Contrary to the argument that 8- minute selection interviews do not allow sufficient time to yield a full perspective on a candidate’s abil- ity, we view this logistic necessity as a strength rather than a liability. A variety of studies have demon- strated that the added value of longer interviews is illusory as examiners tend to form impressions very quickly.9,10 Further, more time yields greater oppor- tunity for the applicant to sway the conversation to issues that are distinct from the intended focus of the interview.11


9 Ambady N, Bernieri F, Richeson J. Toward a histology of social behaviour: judgmental accuracy from thin slices of the behavioural stream. Adv Exp Soc Psychol 2000;32:201–72.

10 Ambady N, Rosenthal R. Thin slices of expressive behaviour as predictors of interpersonal consequences: a meta-analysis. Psychol Bull 1992;111:256–74.


두 번째 토대인 Structure의 가치는 조금 덜 명확하다. MMI가 처음 만들어졌을 때, panel-based 면접은 면접자간 신뢰도 차이가 크지만 면접이 구조화되면(구체적인 문항을 주면) 더 나아진다고 했다. 비록 직관적으로는 그럴 듯 하지만, 최근의 연구 결과를 보면 이 가정에 대한 의문을 갖게 한다. Kreiter 등은 기존 연구는 간접적 비교만 한다고 지적했다. 다섯 개의 구조화된 질문으로 구성된 25분짜리 의과대학입학면접으로부터 일반화가능도 분석을 통해서 '질문'에 기인하는 variance가 무시할 만한 정도라고 밝혔다. 이로부터 저자들은 다수의 질문을 통해서(즉 sampling을 늘려서) 문항 간 난이도에서 오는 차이를 상쇄시킬 수 있기에, 문항의 구조화에서 얻을 수 있는 장점이 없다는 결론에 이르렀다. 몇 년 후, 같은 기관의 면접에서 비구조화 요소가 구조화 면접에 추가되었고, Axelson은 그 결과로부터 구조화 요소보다 평가자간, 평가-재평가 신뢰도가 높다고 보고했다. 결론은 모호하다.

The value of the second foundation, structure, how- ever, has become less clear over time. When the MMI was created, the literature on panel-based interviewing practices revealed that the inter-rater reliability of such exercises was highly variable, but tended to be greater when interviews were structured by giving interviewers a specific set of questions.16 This remains intuitively appealing, but recent research has led us to question this assump- tion. Kreiter et al.17 critiqued the literature for offering only indirect comparisons. Using data collected from a set of 25-minute medical school selection interviews containing five structured questions, they used generalisability analyses to illustrate that the variance attributable to question had a negligible influence on the reliability observed. These findings led the authors to argue that asking multiple questions (i.e. increased sam- pling) washes out differences in difficulty level across questions such that structuring questions offers no advantage. A few years later, an unstruc- tured component was added to the end of the struc- tured interview at the same institution, and Axelson et al.18 reported that resulting scores had greater inter-rater and test–retest reliability than the struc- tured component. As the authors noted, it is unclear whether the performance of the unstructured interview derived fromthe fact that it followed the structured interview or whether the benefit of such structuring is illusory.


구조화 스테이션을 만드는 것은 MMI 프로세스 도입에 가장 큰 장애라는 점에서 이 질문은 대단히 중요하다. 시험 보안에 관한 우려가 많은 대학으로 하여금 (그것을 예방하고자) 스테이션의 데이터베이스를 구축하거나 구입하게 만들었다(비록 시험 보안 위반에 대한 영향력은 확실하지 않더라도). 만약 MMI의 장점이 구조화와 무관하다는 결론에 이른다면, 즉, 주로 sampling의 효과만 있다면, MMI를 도입하는 비용이 크게 절감될 것이다.

This is an important question because the creation of structured stations is one of the primary barriers to adoption of the MMI process.19 Concern about test security breaches derived from the repeated use of set questions has led most institutions we have encountered to generate or purchase a database of stations to reduce this risk (although the impact of such breaches remains questionable20,21). If the benefits that have been observed to accrue from the adoption of MMI practices are unrelated to struc- ture and, instead, are derived dominantly from the sampling it promotes, then the cost inherent in creating an MMI might be substantially reduced.


MMI에서 가장 흔한 타입의 스테이션은 어떤 이슈와 관련하여 면접관과 토론하게 하는 것인데, 이 때 '관련성'의 정의는 그 기관이 만든 blueprint에 달려있으며, 공개되어있는 예시들을 보면 주로 지원자가 경험하게 될 상황과 관련된 딜레마를 제시하는 경우가 많다. 조직/산업 관련 심리연구 문헌을 보면 그러한 면접 대화는 경험-기반(과거 경험을 떠올리게 하기) 이거나 상황-기반(맞닥뜨릴 상황을 상상하게 하기)이다. 어떤 종류의 면접이 더 효과적인지에 대해서 많은 논란이 있었다. 

The most common type of MMI station involves ask- ing a candidate to discuss an issue of relevance with an examiner. The definition of ‘relevance’ depends on the blueprint the institution establishes, but pub- lished examples indicate a tendency towards describ- ing a dilemma about which the candidate is expected to engage in dialogue. The organisational and industrial psychology literature defines such dialogues as generally being ‘experience-based’ (i.e. candidates are required to recall their particular experiences and the behaviours they demonstrated) or ‘situation-based’ (i.e. candidates are required to imagine and describe what they would do if they were to encounter a particular situation).22 There has been considerable debate in this literature regarding which type of interview is most effective.


상황-기반 면접을 선호하는 사람들은 면접이 미래지향적으로 이뤄져야 하며, 과거에 유사한 경험이 없던 지원자라도 주어진 상황에서 자신의 인적특성을 보여줄 기회가 있어야 한다고 주장한다. 

반면 경험-기반 면접을 선호하는 사람들은 과거의 행동이 미래 행동의 가장 정확한 예측인자라고 주장하며, 가상적 상황을 지양하고 과거의 경험에 초점을 둬야 한다고 말한다. 


인상-관리(자기가 어떻게 보이는지를 관리하는 것)이 면접 상황에 따라서 서로 다르게 나타나는데, 상황-기반 면접에서는 환심을 사려는 방향(호감을 유발하고 의견을 동조하게 하는) 으로 나타나며, 경험-기반 면접에서는 자기-홍보 (자신의 성공이 다른 요인보다 스스로의 능력 덕분이다)가 주로 나타난다.

Those who favour situation-based interviewing argue that structure is important and that interviews should be future-oriented so that interviewees with- out previous experience in a given context are granted the opportunity to demonstrate their per- sonal qualities; those who favour experience-based interviewing argue that past behaviour is most pre- dictive of future behaviour and, as a result, one should avoid discussion of the hypothetical and focus on previous experience.20 Impression manage- ment (i.e. attempts to control the image one pro- jects) appears to take place in different ways according to interview type, with situation-based interviews tending to induce ingratiating tactics (i.e. behaviours aimed at inducing liking, such as opin- ion conformity) and experience-based interviewing tending to induce self-promotion (i.e. behaviours aimed at indicating that one’s success is attributable to competence rather than other factors).11



참가자 Participants


4개 서킷, 12개 스테이션, 48명 평가자

Four distinct circuits of 12 stations required the participation of 48 examiners.



문항 Materials


모든 스테이션은 CanMEDS 프레임워크에 기반. 

All stations were focused upon the Professional role promoted within the CanMEDS framework pre- sented by the Royal College of Physicians and Sur- geons of Canada.25 


네 개의 SJ스테이션은 이후 training기간 동안 발생할 수 있는 상황에 대해서 그 상황을 상상하고 어떻게 할지를 물었음.

Four SJ stations were designed around this role, the operational definition being that the station had to present a situation that could plausibly occur during medical training and would require the candidate to imagine and discuss what he or she would do in that situation.


4명의 평가자, 문 앞에 설명, 스테이션 목적에 관한 한 쪽 짜리 설명, 스테이션당 6개까지 문항. 대화를 진행할 것(스크립트처럼 질문만 하지 말고) 질문은 대화를 하는데 도움을 주는 정도. CanMEDS에 대한 설명. 평가지. 6점척도로 세 가지에 대해서 평가 (i) communication skills, (ii) reasoning ability, and (iii) professionalism. 

This information was provided to the four examin- ers who were assigned to that station (one per cir- cuit) and posted on the doors of their rooms for candidates to read. In addition, examiners were given one page of information outlining both the intent of the station and a list of up to six questions they could ask the candidate. They were told that they should engage in actual dialogue with candi- dates rather than treating the list of questions as a script (i.e. the questions were presented simply as prompts that examiners might find useful if conver- sation stalled). Examiners were also given a page of background information outlining aspects of the CanMEDS competencies that were relevant to the situation described, along with a copy of the score- sheet on which they were to offer their assessment. None of the background information or prompting questions contained content that was specific to the instructions given to candidates and thus the same information could be given to examiners in other experimental conditions. The scoresheet consisted of a series of 6-point scales (1 = weak, 2 = below average, 3 = average, 4 = very good, 5 = excellent, 6 = exceptional) on which examiners were asked to rate each candidate’s (i) communication skills, (ii) reasoning ability, and (iii) professionalism. Brief definitions were provided for each quality.


네 개의 BI 스테이션을 위해서 SJ 스테이션을 약간 modify함. 

To generate the four BI stations, each of the SJ sta- tions was modified so that the candidate was instructed to think of a time in which he or she had experienced a situation analogous to the scenario presented in the SJ station.


다른 정보는 SJ 스테이션과 동일

All other information provided to the examiners on these stations was identical to that provided to the SJ station interviewers with the exception of minor wording revisions to ensure that the grammar remained appropriate.


FF스테이션에 대해서는 지원자의 적합성을 평가할 수 있는 대화를 하라고 함. 

To generate the four FF stations, examiners were told simply that we wanted them to conduct a con- versation that would help them evaluate the candi- date’s suitability for the Professional role. They were given the same background information as used in other stations, but the prompting questions were removed. The station instruction, as presented to candidates, said simply:



절차 

Procedure


지원자는 무작위 배정 

Candidates were randomly assigned to a circuit and a starting station.


2분 지시문 숙지, 7분 후 종료, 옆 방 이동. 스테이션 간 3분이 있어서 1분은 지원자 설문 작성, 2분은 다음 스테이션 지시문 숙지

At the start of the MMI, candidates were given 2 minutes to read the first station, after which a buz- zer was sounded to alert them to enter the inter- viewing rooms. Seven minutes later, another buzzer was sounded to indicate that the interview was com- plete and that the candidate should move to the next station. From this point onward, a pause of 3 minutes was provided between stations and candi- dates were asked to spend 1 minute completing a candidate survey about the preceding station and 2 minutes reading and preparing for the next sta- tion.



분석 

Analysis


맥락-특이성은 Applicant x Station 상호작용에 의해서 나타난다. 연구 디자인 상 평가자의 영향을 분리해내기 어렵게 만들며 따라서 순수한 맥락-특이성은 불가능하다. 이러한 연구 설계는 세 가지 이유에 근거한다.

Context specificity is generally indicated by a large Applicant X Station interaction. The design of this study did not allow us the capacity to separate rater influences from station influences and therefore a pure test of context specific- ity is not available. This design decision was based on three reasons: 

  • 평가자 효과는 모든 실험조건에서 나타난다.
    (i) rater effects are likely to be present in all experimental conditions; 
  • 한 스테이션에 한 명의 평가자를 두는 것은 MMI나 OSCE에서 흔한 일이다. 
    (ii) the inclusion of one examiner per station is common practice in MMIs, objective structured clinical examinations (OSCEs) and other comparable assessment activities, and 
  • 기존 연구들을 보면 평가자의 variance는 station variance에 비해서 기여하는 바가 작다.
    (iii) previous work has robustly indicated that rater vari- ance tends to contribute little error relative to station variance.4,5



RESULTS


신뢰도 Reliability


Applicant x Station error가 가장 컸고, 그 다음은 Residual error, 그 다음은 Applicant 였다.

Table 1 reveals that the dominant source of vari- ance in all cases was the Applicant X Station inter- action. The residual error (Item X Station X Applicant [Circuit]) was next most dominant, fol- lowed by Applicant differences, which accounted for 10.0–18.7% of the variance.


Applicant에 따른 variance는 BI > SJ > FF 순이었는데, 이는 BI 스테이션이 지원자간 변별에 가장 뛰어남을 보여준다. Station, Item, Circuit의 main effect와 그것들의 상호작용은 무시할만한 수준이었음. 

The variance attribut- able to Applicant declined from BI to SJ and then to FF stations, suggesting that BI stations offered better capacity to consistently discriminate between applicants relative to the other forms of interview. The main effects of Station, Item and Circuit, and their interactions, were negligible, generally contrib- uting < 3% of the variance in scores.


스테이션간 신뢰도는 스테이션간 평가 결과가 일관되는가에 대한 것으로, BI가 가장 우수하다.

Inter-station reliability, reflecting the extent to which the scores assigned are consistent across stations, suggested that BI stations allowed better measurement than SJ or FF stations.




실제 MMI 결과와의 비교

Relationship to the actual admission MMI

SJ, r = 0.45; BI, r = 0.57, and FF, r = 0.42.

The correlations between the average of the four stations within each station type and the average of the 9-station MMI used for the actual admis- sion decision were: SJ, r = 0.45; BI, r = 0.57, and FF, r = 0.42.



수용가능성

Acceptability


지원자에서 지원자들이 FF가 더 어렵고, 더 긴장을 느낌

In general, candidates considered the FF stations to be more challenging and more anxiety-provoking than either the SJ or BI stations (Table 4). 


평가자의 관점은 유형간 큰 차이가 없었음.

In gen- eral, examiners’ perceptions of their ability to assess candidate performance and the amount of strain MMI stations placed on candidates were insensitive to station type, although BI stations were rated rela- tively low on one question (Table 5).



결론

DISCUSSION


평가프로세스의 질을 평가하기 위한 도구의 다양한 측면이 잘 align 되어있지 않아(신뢰도를 높이면 활용가능도가 떨어짐), 적절한 협상을 하게 된다. 우리는 다양한 결과가 internally 그리고 validity study에 대해서 일관된 결과를 낸다는 것에 놀랐다. 다양한 관찰을 모으는 것 만으로도 중등도의 신뢰도는 도달할 수 있지만(FF 에서 G=0.66), 스테이션을 구조화하는 것은 acceptability는 물론 신뢰도에 있어서도 이득이 있었다. 다만 신뢰도에 대해서는 BI에 대해서만 이득이 있었다. SJ가 신뢰도 측면에서 BI와 같다고 하더라도, feasibility (만들기 쉬움)과 동등한 수용가능성을 고려하면 BI를 쓰는 것이 낫다.

Given that the various aspects of utility used to assess the quality of assessment processes commonly do not align (e.g. increasing reliability tends to decrease feasibility), thereby requiring that compro- mises are made,14 we were surprised by the extent to which the various outcomes considered yielded consistent conclusions both internally and with respect to validity studies that have been conducted in other domains of selection. Although moderate reliability can be achieved simply by aggregating across many observations (G = 0.66 in the FF condi- tion), there did appear to be some benefit from the structuring of stations in terms of both acceptability and reliability, the latter being true only when BI techniques were used (G = 0.77). Even if SJ stations were to be considered equal to BI stations in terms of their reliability, the greater feasibility (i.e. ease of generation) and equivalent acceptability of BI stations would support the prioritising of their use.


추측하건대, BI를 사용하면 - 자신의 경험을 성찰하게 만들고 - MMI 사용에 대한 초창기의 비판 - 지원자가 자신의 과거 자서전적 내용을 설명할 기회가 없다 - 도 극복할 수 있다.

Speculatively, the use of BI stations, which require candidates to reflect on and discuss personal experiences they have had, may also help MMI administrators to address one of the more robust early criticisms of the MMI process, which claims that candidates desire an opportunity to pres- ent autobiographical details during their interview.1







17 Kreiter CD, Solow C, Brennan RL, Yin P, Ferguson K, Huebner K. Examining the influence of using same versus different questions on the reliability of the medical school preadmission interview. Teach Learn Med 2006;18 (1):4–8.


18 Axelson R, Kreiter C, Ferguson K, Solow C, Huebner K. Medical school preadmission interviews: are structured interviews more reliable than unstructured interviews? Teach Learn Med 2010;22 (4):241–5.


20 Reiter HI, Salvatori P, Rosenfeld J, Trinh K, Eva KW. The effect of defined violations of test security on admissions outcomes using multiple mini-interviews. Med Educ 2006;40:36–42. 


21 Griffin B, Harding DW, Wilson IG, Yeomans ND. Does practice make perfect? The effect of coaching and retesting on selection tests used for admission to an Australian medical school. Med J Aust 2008;189:270–3.


23 Taylor PJ, Small B. Asking applicants what they would do versus what they did do: a meta-analytic comparison of situational and past behaviour employment interview questions. J Occup Organ Psychol 2002;75 (3):277–94.


24 Klehe U-C, Latham G. What would you do – really or ideally? Constructs underlying the behaviour description interview and the situation interview in predicting typical versus maximum performance. Hum Perform 2006;19:357–82.


















 2014 Jun;48(6):604-13. doi: 10.1111/medu.12402.

Multiple mini-interview test characteristics: 'tis better to ask candidates to recall than to imagine.

Author information

  • 1Centre for Health Education Scholarship, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada.

Abstract

CONTEXT:

The multiple mini-interview (MMI), used to facilitate the selection of applicants in health professional programmes, has been shown to be capable of generating reliable data predictive of success. It is a process rather than a single instrument and therefore its psychometric properties can be expected to vary according to the stations generated, the alignment between the stations and the qualities an institution prioritises, and the outcomes used. The purpose of this study was to explore the MMI's test characteristics when station type is manipulated.

METHODS:

A 12-station MMI was established in which four stations were presented in three different ways. These included: situational judgement (SJ) stations, in which applicants were asked to imagine what they would do in specific situations; behavioural interview (BI) stations, in which applicants were asked to recall what they did in experienced situations, and free form (FF) stations, which were unstructured in that the examiner was simply given a brief explanation of the intent of the station without further guidance on how to conduct the discussion. Four circuits of the 12 stations were run with one examiner within each station. Candidates and examiners were surveyed regarding their experience. The reliability of the scores derived from the assessment was analysed separately for each station type.

RESULTS:

A total of 41 medical school candidates participated after completing the regular admission process. Although the score assigned did not differ across station type, BI stations more reliably differentiated between candidates (g = 0.77) than did the other station types (SJ, g = 0.69; FF, g = 0.66). The correlation between actual MMI scores and BI stations was also greatest (BI, r = 0.57; SJ, r = 0.45; FF, r = 0.42). Candidates' opinions indicated that FF stations were more anxiety-provoking, less clear, and more difficult than structured stations (SJ and BI stations). Examiner opinions indicated equivalence on these measures.

CONCLUSIONS:

The results suggest that structuring stations has value, although that value was gained only through the use of BI stations, in which candidates were asked to recall and discuss a specific experience of relevance to the purpose of the interview station.

© 2014 John Wiley & Sons Ltd.

PMID:
 
24807436
 
[PubMed - indexed for MEDLINE]


MMI 점수가 면접관의 엄격/관대 성향에 따라 보정되어야 하는가? (Med Educ, 2010)

Should candidate scores be adjusted for interviewer stringency or leniency in the multiple mini-interview?

Chris Roberts,1 Imogene Rothnie,2 Nathan Zoanetti3 & Jim Crossley4






Theoretical framework for interviewer performance


평가자와 관련된 오류에는 크게 세 가지가 있다. (1. 엄격/관대, 2. 면접관 주관(지원자 관련, 문항 관련), 3. 상호작용)

There are broadly three areas of interviewer-related error within the MMI,1,4,8 which are expanded upon in Fig. 1.



그러나 복잡한 평가 절차로 인해서 어떤 MMI 결과자료를 가지고도 아직까지 1차 효과 혹은 2차 효과(상호작용)을 정밀하게 추정해내지는 못하고 있다. 이는 기본적으로 대규모의 면접 계획에서 면접관은 문항에 nested 되어있기 때문이다. 현재까지 지원자-간 variance는 22%에서 25% 수준이다. MMI의 난이도에 따른 것은 0-3%, 평가자 관련 요인 중 엄격/관대 성향은 14% 를 차지한다.

However, because of the designs inherent in complex assessment procedures,6 no set of MMI data has thus far allowed for precise estimates of each first-order effect and their second-order interactions using G theory. This is because of confounding within the naturalistic large-scale interviewing plan, in which interviewers are usually nested in MMI questions. Current estimates suggest candidate-to-candidate variance ranges from 22%4 to 25%.1 MMI question difficulty variance is in the range of 0–3%.1,4 Of the interviewer-related factors, interviewer strin- gency ⁄ leniency accounts for 14% of error,4 


면접관의 지원자-특이 주관은 45% 정도에 달하는 것으로 연구된 바도 있다.

Variance reflecting interviewer candidate-specific subjectivity has been estimated to be as high as 45%in a study of assessments which used two interviewers within each station.8


MMI에 참여하는 면접관들이 자신들이 내리는 판단에 대해서, Kumar 등은 면접관이 결정을 내릴 때 생기는 긴장에 대한 preliminary insight를 제공한 바 있다. 

Kumar et al.9 have provided some preliminary insights into the tensions that arise in the process of making such decisions. 

  • 독립적 차원의 의사결정의 가치와 입학생에게 기대되는 수준에 대한 합의
    These highlight, firstly, the contrast between appre- ciation of independent decision making and the need to achieve a consensus around the standards expected of entry-level students. 
  • 의사소통기술과 대비하여 입학생 수준에서 요구되는 추론능력을 평가한다고 느낌
    The second source of tension concerns the extent to which interviewers may feel they are assessing entry-level reasoning skills in professionalism domains compared with communications skills. 
  • 어떻게 면접관이 지원자에 대한 주관적 판단을 극복할 수 있을까? 
    The third source relates to how interviewers overcome their subjectivity towards certain candidates and 
  • '탈락하는' 지원자에 대한 우려를 어떻게 극복할 것인가?
    the fourth to how they handle their concerns over ‘failing’ candidates. 
  • 참가자들은 적극적으로 면접관과의 상호작용을 통해서 자기 자신에 대한 긍정적 판단을 이끌어내고자 노력하며, 이는 대답의 질과는 무관하다.
    Finally, candidates are actively interacting with interviewers using their impression management skills to promote a favourable decision for themselves, which is not necessarily related to the quality of their answers.9



방법론적 접근

Methodological approaches


IRT 사용

Researchers have turned to item response theory (IRT)11 to provide this opportunity.


MFRM 사용

Roberts et al.12 applied multi-faceted Rasch modelling (MFRM) to the MMI, but they focused on differences in the performance of MMI questions in an item bank rather than on differences between the interviewers themselves. However, they did note that questions appeared to be measuring a unidimensional con- struct, ‘entry-level reasoning skills in professionalism’, as suggested by a good fit to the IRT model.12 The consistency of judgements within and between judges and candidates has been the focus of a number of papers.13–17 IRT software such as FACETS provides easily derived estimates of candidate ability, inter- viewer stringency ⁄ leniency and question difficulty.


'관찰평균점수'는 raw score에 기반한 점수이며 'fair average score'는 다른 모든 facet의 요소들이 평균값일 경우를 가정한 점수이다. 이러한 세팅에서 FAS는 면접관 엄격/관대 성향에 따라 보정된 점수이다.

An ‘observed average score’ is the average rating based on raw scores received by the candidate. The ‘fair average score’ is the measure that would have been observed if all the measures of the other elements on all other facets had been located at the average measure.18 In this setting, the fair average for candidates is the score that has been adjusted for interviewer stringency ⁄ leniency and question difficulty.


McManus는 엄격/관대 성향에 따라서 보정하면 95.9%는 바뀌지 않지만 2.6%가 원점수로는 탈락이지만 합격하게 되며, 1.5%가 원점수로는 합격하나 보정후 탈락함을 보였다. Harasym은 11%의 지원자가 영향을 받을 수 있다고 했다. 

For exam- ple, in the case of a clinical examination for entry into a professional college, McManus et al.14 found that if examination scores were adjusted for examiner stringency ⁄ leniency and the same pass mark was kept, the outcome for 95.9% of candidates would be unchanged using adjusted marks, whereas 2.6% of candidates would pass, although they had failed on the basis of raw marks, and 1.5%of candidates would fail, despite having passed on the basis of raw marks. However, Harasym17 estimated that as many as 11% of candidates in an MMI might be affected by adjusting for interviewer stringency ⁄ leniency,




Psychometric analysis


소프트웨어 

Multi-facet Rasch modelling was used in FACETS Version 3.65 (Winsteps.com, Chicago, IL, USA) to perform a concurrent estimation of several indepen- dent first-order facets and their associated error variances. A model was specified that included identification of the individual facets, the rating scale and how the interviewer was expected to interact with the rating scale.



세팅 

Setting


Details of the MMI design principles have been reported elsewhere.4,9,12 Candidates were applying to a 4-year, graduate-entry, problem-based learning (PBL) programme. From 2007 onwards, candidates were applying for medicine or dentistry or both. The MMI in this study was designed to assess entry-level reasoning skills in professionalism and had eight stations, with each candidate rotating through the circuit and meeting a different single interviewer at each station. Questions were sourced from a preprepared bank and took the format of a non-clinical scenario followed by structured prompts. Each question had five prompts marked with a 4-point Likert scale, giving a total of 20 raw marks per station and 160 for the whole assessment. In this design, although the performance of a candidate on any particular MMI question was assessed once only by a single interviewer, the total performance was rated by eight interviewers. Furthermore, each MMI question was assessed by several interviewers during the course of the MMI process. This created a network through which every parameter was linked to every other parameter with these connecting observations, allowing the measures estimated from the observations to be placed on one common scale.11 This naturalistic interviewing plan also allowed for the partially nested G study design.4



평가자

Interviewers

각 면접관은 평균 22명의 지원자를 면접함. 교수 89명, 지역사회인사 47명, 졸업생 39명.

Each interviewer had interviewed a median of 22 candidates (SD 18.44, range 4–121). Complete details were available in the database for 117 interviewers. Of the 207 used, 88 interviewers were known to be male and 95 were known to be female. Twenty-two were aged 18–34 years, 27 were aged 35–44 years and 68 were aged > 45 years. They included 89 faculty members, 47 community members and 39 graduates.


MFRM

Multi-facet Rasch modelling


Y축이 위로 갈수록 면접관이 엄격해지고, 지원자 능력이 높아지고, 난이도가 높아짐.

Reading the ruler (Fig. 2) from bottom to top shows increasing interviewer stringency, increasing candi- date ability and increasing question difficulty.


Fig 2와 Table 1 모두 면접관이 MMI 문항보다 더 variable함을 보여줌.

Both Fig. 2 and Table 1 show that interviewers are more variable than MMI ques- tions and the spread of interviewers is nearly 3.5 times that of MMI questions.


면접관 J는 모델의 예측과 over-fitting하여 지나치게 예측가능함, 즉 halo effect의 가능성을 시사하며, 면접관 G는 under-fitting으로 점수를 줄 때 randomness가 심함.

Interviewer J appeared to be over-fitting the model and his or her ratings were too predictable, suggesting a halo effect. Interviewer G seems to be under-fitting the model with too much randomness in his or her scoring.







Making adjustments for interviewer leniency and question difficulty


지원자 E는 엄격한 면접관을 만나서 OAS가 3.5로 낮지만 FAS는 3.64. 

Here, candidate E has a lower observed average score of 3.50, but a higher fair average score of 3.64 because he or she answered harder MMI questions and sawmore stringent interviewers. 


OAS대신 FAS를 사용하면, 합격자 270명중 31명(11.5%)는 합격에서 불합격이 되며, 여기서 중요한 것은 이것이 쌍방 이동의 과정으로, 그 대신 누군가가 합격하는 것이다.

Let us assume a scenario in which the fair average rather than observed average scores are used to rankthe candidates. In our situation, in which 270 studentplaces were on offer, if the MMI were the sole determinant of ranking, 31 of 270 (11.5%) candi- dates who were offered a place on the basis of their observed score rankings would not have been offered a place on the basis of their fair average rankings. This is a two-way movement.





Interviewer goodness-of-fit statistics


For the interviewer, the in fit mean square statistic ranged from 0.74 to 1.58 (mean 1.03, SD 0.74). This was a high-stakes assessment and was similar to a clinical rating situation and well within the accepted lower- and upper-control limits of 0.5 and 1.7 to indicate acceptable model fit.19



Number of candidates examined

면접관의 엄격 성향은 면접한 학생의 수와 유의하게 부적 상관관계가 있었다. 즉, 더 많은 학생을 면접한 경우 더 관대해진다. 이는 McManus의 연구결과와 반대되는 것.

Interviewer stringency ⁄ leniency showed a significant but inverse correlation with the number of candidates examined (r = ) 0.21, n = 207, p = 0.002). Thus, interviewers who interviewed more candidates tended to be somewhat more lenient. McManus et al.14 found examiners became more stringent with more candidates. Our finding contrasts with this, but we do not have data to show whether more lenient interviewers participated in more assessments or whether more interviewing caused interviewers to become more lenient.



시사점

Implications


IRT결과를 variance로 변환하는 과정이 중요하다. MFRM 사용에 관한 내용.

The translation of IRT output into variance compo- nents is important. Some have reported a number of limitations in applying IRT models to assessments which measure the performance of skills or behav- iours, as in the MMI.14 These arose because of claims that the MFRM analysis could not take into account the second-order effects of interviewer-by-station, interviewer-by-candidate and candidate-by-station var- iance. There was concern that, as in an incorrectly designed G study,6 error would be apportioned wrongly and hence any calculation of reliability or standard error of measurement was likely to be inflated. The use of MFRM to isolate variance com- ponents is very new and there has been some misun- derstanding in the medical education literature about how they can be estimated and reported with software such as FACETS. This has inflated reliability estimates undermining the credibility of the IRT method for this type of assessment. For example, McManus et al.14 reported variation between examinees in a clinical examination for entry into a professional college as an unrealistic 87%. This resulted from a calculation which partly assumed that the three first-order effects of examiner, item and person were proportions of 100%and thus neglected to take account of the bias or interactions and the residuals that MFRMalso reports.


FACETS를 활용하여 variance component를 분해할 수 있다.

An iterative relationship between the FACETS software developer and the educational research measure- ment community has ensured that later iterations of FACETS are able to provide the decomposition of variance components, including interactions, with naturalistic data.


MMI 훈련 과정에서 면접관들은 누가 hawk이고 누가 dove인지 피드백을 줘야하느냐에 대한 질문을 한다. 그러나 IRT로 측정하든 GT로 측정하든 MMI에서 엄격/관대 성향은 비교적 일관된 것이라는 점이, McManus의 연구와도 같은 결과이다. 따라서 이것의 함의는 McManus가 제안한 것과 같이, 면접관은 염격/관대 성향을 고치려고 하기보다는 지속적으로 하던대로 하는 것이 낫다.

In MMI training, interviewers often ask whether they should be given feedback on which of them are ‘hawks’ and which are ‘doves’ so that they can try to correct their tendencies to mark higher (leniently) or lower (stringently) on the rating scale. The finding that interviewer stringency ⁄ leniency seems to be a stable characteristic in the MMI, whether measured by IRT or by G theory, is remarkable and echoes the findings of McManus et al.14 in examiner stringency in clinical rating situations. The implications, as McManus et al.14 suggest, is that interviewers should not try to correct their hawkish or dove-like tendencies, but should instead continue to behave as they have always done.


Kumar가 지적한 바와 같이, 면관의 MMI 프로세스에 대한 경험이나 트레이닝의 효과에 대한 이론적 개발이 부족하다.

As Kumar et al.9 have noted, theoretical develop- ment in the area of interviewers’ experience of the process and impact of training is lacking.



13 Downing SM. Threats to the validity of clinical teaching assessments: what about rater error? Med Educ 2005;39:353–5.





















 2010 Jul;44(7):690-8. doi: 10.1111/j.1365-2923.2010.03689.x.

Should candidate scores be adjusted for interviewer stringency or leniency in the multiple mini-interview?

Author information

  • 1Sydney Medical School-Northern, University of Sydney, Sydney, New South Wales, Australia. christopher.roberts@sydney.edu.au

Abstract

CONTEXT:

There are significant levels of variation in candidate multiple mini-interview (MMI) scores caused by interviewer-related factors. Multi-facet Rasch modelling (MFRM) has the capability to both identify these sources of error and partially adjust for them within a measurement model that may be fairer to the candidate.

METHODS:

Using facets software, a variance components analysis estimated sources of measurement error that were comparable with those produced by generalisability theory. Fair average scores for the effects of the stringency/leniency of interviewers and question difficulty were calculated and adjusted rankings of candidates were modelled.

RESULTS:

The decisions of 207 interviewers had an acceptable fit to the MFRM model. For one candidate assessed by one interviewer on one MMI question, 19.1% of the variance reflected candidate ability, 8.9% reflected interviewer stringency/leniency, 5.1% reflected interviewer question-specific stringency/leniency and 2.6% reflected question difficulty. If adjustments were made to candidates' raw scores for interviewerstringency/leniency and question difficulty, 11.5% of candidates would see a significant change in their ranking for selection into the programme. Greater interviewer leniency was associated with the number of candidates interviewed.

CONCLUSIONS:

Interviewers differ in their degree of stringency/leniency and this appears to be a stable characteristic. The MFRM provides a recommendable way of giving a candidate score which adjusts for the stringency/leniency of whichever interviewers the candidate sees and the difficulty of the questions the candidate is asked.

PMID:
 
20636588
 
[PubMed - indexed for MEDLINE]


고위직을 위한 상황면접질문과 행동묘사면접질문 비교(PERSONNEL PSYCHOLOGY, 2001)

COMPARISON OF SITUATIONAL AND BEHAVIOR DESCRIPTION INTERVIEW QUESTIONS FOR HIGHER-LEVEL POSITIONS


ALLEN I. HUFFCUlT Department of Psychology Bradley University
JEFF A. WEEKLEY Kenexa

WILL1 H. WIESNER, TIMOTHY G. DEGROOT Department of Psychology McMaster University
CASEY JONES Kenexa

 

 

 

 

 

Pulakos and Schmitt 는 고위직에 있어 SI가 BDI보다 덜 효과적이라는 가설을 내세웠다. 그들의 가설을 평가하기 위해서 우리는 2개의 새로운 구조화된 면접 연구를 수행하였다. 두 연구는 모두 고위직 선발에 대한 것이었고, 동일한 직무특성 평가를 위하여 SI와 BDI 문항을 매칭시켰다. 그 결과는 SI가 이러한 직위에 있어서는 수행능력 예측에 더 떨어진다는 것이다. 더 나아가서 SI와 BDI가 동일한 직무 특성을 평가하고자 매칭되었지만, 상관관계가 매우 낮았고 BDI는 외향성과 관련되어 있었다. 낮은 SI의 효과성을 논의하고자 한다.

Based on a study of federal investigative agents, Pulakos and Schmitt (1995) hypothesized that situational interviews are less effective for higher-level positions than behavior description interviews. To evalu- ate their hypothesis we analyzed data from 2 new structured interview studies. Both of these studies involved higher-level positions, a mili- tary officer and a district manager respectively, and had matching SI and BDI questions written to assess the same job characteristics. Re- sults confirmed that situational interviews are much less predictive of performance in these types of positions. Moreover, results indicated very little correspondence between situational and behavior descrip- tion questions written to assess the same job characteristic, and a link between BDI ratings and the personality trait Extroversion. Possible reasons for the lower situational interview effectiveness are discussed.

 

 


 


근대 구조화면접을 이루는 두 가지 가장 유명한 것이 SI와 BDI이다. SI에서 지원자는 가상의 직무상황에 대해서 어떻게 대응할지를 대답해야 한다. SI는 goal-setting theory에 근간을 두고 있어서, 의도(goal)이 행동(action)의 즉각적 전구체(precursor)라고 가정한다.

Situational and behavior description interviews have emerged as the two most popular formats for constructing modern structured interviews (Campion, Palmer, & Campion, 1997; Harris, 1989). In a situational in- terview (SI) applicants are given hypothetical job situations and asked to indicate how they would respond (Latham, Saari, Pursell, & Campion, 1980). Situational interviews are grounded in goal- setting theory, par- ticularly in that intentions (i.e., goals) are the immediate precursor of a person’s actions (Latham, 1989). In a behavior description interview

 

BDI에서 지원자는 과거의 관련된 경험과 관련한 질문을 받는데, BDI는 과거의 행동이 미래의 최고의 예측인자라는 전제에 기반한다.

(BDI) applicants are asked to relate actual incidents from their past rel- evant to the target job (Janz, 1982). Behavior description interviews are grounded in the premise that the past is the best predictor of the future (Janz, 1989).

 

그러나 Pulakos and Schmitt 의 연구는 위의 validity scenario에 잠재 위협을 말한다. 그들은 상당히 복잡한 직무에 대해서 SI와 BDI를 개발하였고, 216개의 샘플에서 수행능력 평가와의 상관관계가 SI에서 -0.02, BDI에서 0.32임을 보여주었다. 이 연구에서 특히 중요한 점은, 그들의 가설, 즉 SI가 고위직에 대해서 효과적이지 않을 것이라는 것, 이며, 만약에 이것이 사실이라면 구조화 면접의 과학과 실행에 대한 상당한 함의가 있다.

However, a study by Pulakos and Schmitt (1995) suggests a possible caveat to the above validity scenario. They developed both situational and behavior description' interviews for a fairly complex position, a In a sample of 216 incumbents (108 for federal investigative agent. each format), the correlations with performance evaluations were -0.02 for the SI and 0.32 for the BDI. What is particularly important about this study is their hypothesis that situational interviews may not be as effective for higher-level positions as they are for lower-level positions. If true, this has very important implications for the science and practice of structured interviewing.

 





그렇다면 고위직 면접에서 왜 SI가 BDI보다 덜 효과적일까? 첫 번째 설명은 SI질문이 이들에게 너무 단순하다는 것이다. 그러나 평균과 표준편차를 분석해보면, 이 세 가지 연구에서 이것은 사실이 아니었다. 오히려 문항으로서 SI가 BDI보다 더 나은 편이었다. 두 번째 가능한 설명은 고위직에 있어서는 SI에 대한 대답을 평가하는 것 자체가 어렵기 때문이라는 것이다. 평가자간 신뢰도를 보면, 이 역시 가능성이 낮다. 세 번째 설명은 BDI가 현재, 혹은 최근의 직위와 관련한 직무 수행능력을 타당하게 보여준다는 것이다. Pulakos와 Schmitt의 연구와 우리의 두 번째 실험은이 가설을 뒷받침해주지 않는다. 또 다른 가능성은 SI와 BDI가 서로 다른 구인을 평가한다는 것이다. 우리의 연구 결과를 보면 SI와 BDI가 같은 직무 특성을 평가하고자 하더라도, 적어도 고위직에 대해서는, 그 결과는 잘 일치하지 않는다. 이러한 낮은 일치도는 중요한 결과인데, 다른 면접 관련 문헌에서 다뤄진 바가 없는 것이다. 이 결과의 함의는 SI와 BDI가 서로 대체가능한 측정방법으로 고려되어서는 안된다는 점이다. 그보다는 각각 별개의 검사도구라고 보는 것이 나으며, 서로 다른 구인을 평가하는 것으로 봐야 한다. 두 번째 함의는 BDI로어떤 구인을 보고자 하든, 고위직에 대해서는 SI보다는 우월하다는 것이다.

So why would situational interviews be less effective than behavior description interviews for higher-level ppsitions? The first possible ex- planation is that SI questions are just to8 simple for higher-level posi- tions. Analysis of the means and standard deviations suggest that this was not the case in any of the three studies. Rather, it was not uncom- mon for the SI questions to have slightly better properties than the BDI questions. The second possible explanation is that responses to SI ques- tions are more difficult to rate with higher-level positions. Analysis of interrater reliability data in all three studies again suggests that this was not the case. The third possible explanation is that BDI questions are valid because they capture job performance in either the current or a re- cent position, an explanation which is particularly viable in concurrent designs. Data available in Pulakos and Schmitt (1995) and in our second study (both of which were concurrent) does not support this idea either. Another possible explanation is that SI and BDI ratings tend to cap- ture different constructs. Our results strongly suggest that SI and BDI questions written to assess the same job characteristics do not tend to correspond, at least not for higher-level positions. This lack of corre- spondence is an important finding, one we are unaware of anywhere else in the interview literature. One implication of this finding is that situa- tional and behavior description formats probably should not be consid- ered as alternate methods of measurement. Rather, it might be more appropriate to view them as separate testing devices, ones which for the most part capture different constructs. A second implication is that whatever constructs BDI questions tend to capture for higher-level po- sitions are more predictive of performance than whatever constructs SI questions tend to capture.

 

마지막으로, SI의 타당도에 관해서 언급되어야 할 방법론적 이슈가 있다. Pulakos와 Schmitt의 연구에서, 일부 지원자는 모든 가능한 가능성을 고려하고자 했고, 다른 지원자는 표면적인 응답만을 했다. 후자와 같은 답도 여전히 옳은 답이기에, 더 복잡한 사고를 통해서 답을 한 전자와 같은 지원자가 - 비록 그들이 더 적합한 지원자임을 보였더라도 - 반드시 더 높은 점수를 받은 것은 아니다. Pulakos와 Schmitt 연구의 함의는 SI의 점수체계가 낮은 복잡도의 직무에 더 잘 맞는다 것이다. SI연구의 표본 답안에 대한 연구를 보면, SI 점수체계가 그 지원자가 어떤 행동을 할 것인가에만 엄격하게 초점이 맞춰져 있고, 왜 그러한 행동을 할 것인지, 어떻게 그 행동을 할 것인지에 대해서 맞춰져 있지 않다. 표면적으로 드러난 행동에 초점을 두는 것은 낮은 직위에 대해서는 완벽하게 적합할 수 있다. 그러나 고위직에 대해서는 어떻게 지원자가 특정 행동에 이르렀고, 왜 그 행동을 하기로 했는가가 행동 그 자체보다 중요할 수 있다.

Last, there is a methodological issue related to SI validity in higher- level positions that warrants mention. During the Pulakos and Schmitt (1995) study it was observed that some candidates thought through every possible contingency when answering the SI questions and other appli- cants gave more superficial responses. Because the latter answcrs were still essentially correct, candidates engaging in more complex thought did not necessarily receive higher ratings even though a case could be made that they represented better job candidates. The implication of Pulakos and Schmitt’s (1995) observation is that the standard SI scoring system may be better suited for jobs of lower com- plexity. An examination of the benchmark answers provided as examples in several SI studies illustrates the tendency for SI scoring to be based strictly on what overt action candidates would take (e.g., Campion et al., 1994, Latham & Saari, 1984), not on why they would take that action or how they arrived that action. A focus upon overt actions may be perfectly adequate (and even preferred) for lower-level positions. But for higher- level positions knowing how candidates arrived at a particular action and why they chose that action is often just as important as the action itself. 


또한 SI 질문에 대한 probing이 왜 그 행동을 하게 되었는가와 특히 관련이 있음에도, 면접관이 probe 하지 못하게 되어있다는 점도 중요하다. 아직 SI의 점수체계가 문제라는 것에 대한 직접적 증거는 없다. 연구가 필요하다.

It is also important to point out that interviewers typically are not allowed to probe responses to SI questions, and probing is where information related to why they choose a particular action would be most likely to emerge. Admittedly we do not have direct evidence at the present time that the standard SI scoring system is the culprit. Nonetheless, this is- sue and its implications are important enough to warrant investigation of modifications to the SI scoring system in future research.

 


요약하면, 본 연구의 결과는 다음과 같은 기여가 있다. 가장 중요한 것은 SI가 고위직에 맞지 않는다는 Pulakos and Schmitt의 가설을 지지하는 결과이다. 그들의 원래 연구와 이번 두 개의 새로운 연구를 합해서 보면, 면접 개발자들은 고위직에 SI를 사용할 때 조심해야 한다는 제언을 할 수 있다. 이 제언이 특히 중요한 이유는 모든 세 연구에서 SI와 BDI를 직접적으로 비교했기 때문이다. 또한 SI와 BDI 질문에 대한 평가 결과는 일치도가 매우 낮았다. 또한 흥미로운 것은 SI와 BDI 점수의 일치도가 낮은 직위의 면접에서는 높은 일치도를 보였다는 점이다. 마지막으로, BDI가 외향성 점수와 상관관계가 높은 점은 verbal presentation skill의 영향이 컸을 수 있음을 의미한다.

In summary, results of this investigation contribute to the interview literature in several ways. Probably the most important contribution is they support Pulakos and Schmitt’s (1995) hypothesis that situational in- terviews do not tend to work as well for higher-level positions. Based on the combined results of their original Ftudy and our two new studies, the formal recommendation can now be dade that interview developers should exercise considerable caution when using the standard SI format for higher-level positions. What makes this recommendation particu- larly viable is that all three of these studies involved direct comparison of situational and behavior description interviews for the same position. In addition, out results suggest a strong lack of correspondence between SI and BDI questions written to assess the same job characteristics in higher-level positions. What is interesting is that the one published study which involved a direct comparison of SI and BDI validity for the same lower-level position found a much higher correspondence (Campion et al., 1994). Finally, our results suggest an association between BDI rat- ings and Extroversion scores, which may point to a larger influence from verbal presentation skdls.

 

 

 

 

 

 


COMPARISON OF SITUATIONAL AND BEHAVIOR DESCRIPTION INTERVIEW QUESTIONS FOR HIGHER-LEVEL POSITIONS

  1. ALLEN I. HUFFCUTT1,*, 
  2. JEFF A. WEEKLEY2, 
  3. WILLI H. WIESNER3,
  4. TIMOTHY G. DEGROOT3 and
  5. CASEY JONES2

Article first published online: 7 DEC 2006

DOI: 10.1111/j.1744-6570.2001.tb00225.x

Personnel Psychology

Personnel Psychology

Volume 54, Issue 3, pages 619–644, September 2001


지원자에게 '어떻게 할 것인가요?' 를 묻기 vs '무엇을 했나요?' 묻기: Situational Interview와 Behavior Employment Interview 비교의 메타분석 (J Occup Organ Psychol., 2002)

Asking applicants what they would do versus what they did do: A meta-analytic comparison of situational and past behaviour employment interview questions


Paul J. Taylor1* and Bruce Small2

1Chinese University of Hong Kong and University of Waikato, Hamilton, New Zealand 2AgResearch, Hamilton, New Zealand

 

 

 

 

 

Situational question(SQ, 다음과 같은 상황에서 어떻게 할 것인가?)혹은 Past behavior question (PBQ, ~한 경험을 떠올려볼 수 있나요? 어떻게 했나요?)를 활용한 구조화된 면접의 준거-관련 타당도와 평가자간 신뢰도를 분석하였다. 신뢰도와 타당도는  descriptively-anchored rating scales 을 사용했을 때, 그리고 직무 복잡성에 따라 나누었을 때를 비교했다.

Criterion-related validities and inter-rater reliabilities for structured employment interview studies using situational questions (e.g. ‘‘Assume that you were faced with the following situation . . . what would you do?’’) were compared meta-analytically with studies using past behaviour questions (e.g. ‘‘Can you think of a time when . . . what did you do?’’). Validities and reliabilities were further analysed in terms of whether descriptively-anchored rating scales were used to judge interviewees’ answers, and validities for each question type were also assessed across three levels of job complexity.

 

SQ와 PBQ 모두 높은 타당도를 보여주나, PBQ를 활용한 연구는 DARS를 사용하면 SQ를 DARS로 했을 때보다 훨씬 더 높은 타당도를 보여주었다(.63 vs .47). rating scale을 보정하고도 질문의 종류(SQ vs PBQ)는 면접 타당도의 moderator인 것으로 밝혀졌다. SQ가 높은 복잡성을 가진 직부에 덜 타당하다는 가설에 근거는 없었다.

While both question formats yielded high validity estimates, studies using past behaviour questions, when used with discriptively anchored answer rating scales, yielded a substantially higher mean validity estimate than studies using the situational question format with descriptively-anchored answer rating scales (.63 versus .47). Question type (situational versus past behaviour) was found to moderate interview validity, after controlling for whether studies used answer rating scales. No support was found for the hypothesis that situational questions are less valid for predicting job performance in high-complexity jobs.

 

DARS를 사용한 경우 SQ와 PBQ의 Sample-weighted mean inter-rater reliabilities 는 비슷했으며, DARS를 사용하지 않은 PBQ는 조금 더 낮았다.

Sample-weighted mean inter-rater reliabilities were similar for both situational and past behaviour questions, provided that descriptively-anchored rating scales were used (.79 and .77, respectively), although they were slightly lower (.73) for past behaviour question studies lacking such rating scales.

 

 


 

 

직무 수행능력의 결정요인에 대한 모델은 PBQ가 SQ보다 우월할 것을 기대하는 토대이다. 두 개의 수행능력 결정요인 Can do와 Will do.

In contrast, models of the determinants of job performance provide a basis for expecting past behaviour questions to exhibit superior criterion-related validity over situational questions. While various theorists have specified somewhat different variables as performance determinants (see Blumberg & Pringle, 1982; Campbell, 1990; McCloy, Campbell, & Cudeck, 1994; Vroom, 1964), all have in common two fundamental groups of performance determinants: ‘can do’ and ‘will do’ variables.

 

  • ‘Can do’ variables include job knowledge, skills and abilities, while
  • ‘will do’ variables primarily concern workers’ motivation to perform.

 

Campbell 등의 수행능력 결정요인 모델에서는 수행능력은 세 가지 변인의 함수이다.

  • (1) 서술적 지식 declarative knowledge,
  • (2) 절차적 지식과 기술 procedural knowledge and skills, and
  • (3) 동기 motivation

In Campbell and colleagues’ performance determinants model (Campbell, 1990; McCloy et al., 1994), for example, performance is viewed as a function of three variables: (1) declarative knowledge, (2) procedural knowledge and skills, and (3) motivation.

 

서술적 지식은 절차적 지식과 기술의 필요조건이나 충분조건은 아니다. 한편 'motivation'은 수행능력의 직접적 결정요인이다.

Declarative knowledge is seen as a necessary, though insufficient, condition for procedural knowledge and skills, while motivation is a direct determinant of performance.

 

 

최대 수행능력(maximal performance)의 척도는 서술적 지식과 절차적 지식의 함수라는 것으로 이론화되어있는 반면, 일상적 수행능력(typical performance, 즉 상관의 평가와 같은 것)은 위의 세 가지 결정요인이 모두 포함되는 함수이다. 따라서 최대 수행능력과 일상 수행능력의 차이는 -시험 상황에서는 모든 사람이 수행의 동기부여가 되어있기 때문에 - 최대 수행능력을 측정할 때에는 개개인이 지식과 기술을 일상적 직무 수행에 적용하고자 하는 동기를 측정하지 못한다는 것이다.

Measures of maximal performance, such as job knowledge tests and work sample tests, have been theorized to be a function of declarative knowledge and procedural knowledge/skills, while measures of typical performance, such as supervisory ratings of job performance, are believed to be a function of all three performance deter- minants (McCloy et al., 1994). Thus the critical difference between maximal and typical measures of performance is that maximal performance measures fail to assess differences in individuals’ motivation to apply knowledge and skills to day-to-day job performance, since all performers are motivated to perform well during the testing situation.

 

우리는 MP와 TP를 구분하는 것과 그리고 직무 수행능력의 결정요인으로서 그들의 관계가 어떻게 면접문항의 형태를 구성하는가에 관련이 된다고 본다.

We believe that the distinction between measures of maximal and typical performance, and their relationship to the determinants of job performance, are relevant to how structured interview questions are formatted.

 

SQ는 시뮬레이션과 마찬가지로, MP의 척도로서 서술적 지식과 절차적 지식을 평가할 수 있다. 그러나 지원자가 이미 최선의 답안을 할 준비가 되어있기 때문에, 일상적 수행에 대한 지원자의 동기를 평가하지는 못한다.

Situational questions, like simulations, work samples and situational judgment tests, are measures of maximal performance, and so they can assess declarative knowledge and procedural knowledge/skills; but since interviewees are all motivated to provide the best answer possible, answers do not necessarily reflect interviewees’ motivation to apply that knowledge/skill to day-to-day job performance. For example, an interviewee who is able to describe the appropriate response to a hypothetical situation certainly demon- strates the requisite knowledge/skill, but it remains uncertain whether the individual would actually apply that knowledge/skill in an actual job situation.

 

면 PBQ는 TP를 평가할 가능성이 높은데, 왜냐하면 지원자가 겪은 일상적 상황에 초점을 맞추고 있기 때문이며, motivation을 포함한 세 가지 결정요인을 모두 평가할 수 있다. 과거 상황에 대해서 효과적으로 대답했다고 응답한 지원자는 지식과 기술, 그리고 충분히 동기부여가 되어있었다. 고용시에 보통 관심을 갖는 것은 TP이므로 PBQ는 미래의 수행능력에 더 정확한 지표가 될 수 있다.

Past behaviour questions, however, are more likely to assess typical performance since they focus on candidates’ responses to the day-to-day situations that candidates have faced, and so they can assess all three performance determinants (including motivation). Interviewees who report that they have responded effectively in a past situation demonstrate both the necessary knowledge and skills, and also that they were sufficiently motivated to apply their knowledge/skills in that situation. Since the cri- terion of interest in employment settings is usually typical performance, past behaviour questions could be expected to provide a more accurate indication of future job performance than situational questions.

 

 

 

 

 

 

 


 

 

Asking applicants what they would do versus what they did do: A meta-analytic comparison of situational and past behaviour employment interview questions

  1. Paul J. Taylor1,* and
  2. Bruce Small2

Article first published online: 16 DEC 2010

DOI: 10.1348/096317902320369712

Journal of Occupational and Organizational Psychology

Journal of Occupational and Organizational Psychology

Volume 75, Issue 3, pages 277–294, September 2002

진화하는 의과대학 입학면접(AAMC, 2011)

The Evolving Medical School Admissions Interview

Anaysis in Brief

 

 

 

많은 연구들이 의과대학 입학면접의 신뢰도와 타당도에 대해 연구했지만, 전형적인 면접 절차에 대한 연구는 지난 20년간 없었다.

While numerous studies have examined the reliability and validity of medical school admissions inter- views, a description of the typical interview process has not been published in nearly 20 years.2,3

 

20년전, 전형적인 입학면접은 교수 혹은 교직원과 일대일 면접이었다. 질문 내용에 대해 면접관에게 주어지는 가이드는 거의 없었으며, graphic rating을 사용했다. 이후 많은 의과대학 입학면접은 반구조화 면접이나 MMI 등을 도입했다. 2011년 총 8만개가 넘는 입학면접이 수행되었다.

Twenty years ago, the typical admis- sions interview was characterized by one-on-one interviews conducted by faculty and staff. Interviewers received little guidance about the content of questions but were required to use graphic rating scales to evaluate applicants. Since then, many medical school admissions interviews use techniques such as semi-structured interviews4 and the Multi-Mini Interview (MMI).5,6 In 2011, admis- sions committees conducted over 80,000 admissions interviews (median = 566; range = 87 to 1,438).7

 

 

어떻게 면접 대상자를 선발하는가?

How is the interview pool selected?

 

입학위원회 위원(64%)과 교수(staff, 56%)이 지원서를 평가하여 면접대상자를 결정한다고 했고, 12%만이 컴퓨터 기반 알고리즘을 사용했다. 69%에서는 2명 혹은 이상이 지원자 정보를 평가한다. 53%에서는 이 과정이 15분 이상 걸린다. 학업적(uGPA, MCAT), 비학업적(봉사, 자기소개서) 정보를 모두 활용하고 있으며, 그러나 보통 이 단계에서의 가중치는 학업적 자료에 더 주어진다.

More than half of admissions officers reported that admissions committee members (64%) and staff (56%) review application materials to decide which applicants to interview; only 12 percent reported that their schools use computer-based algorithms to make this decision.8 Sixty-nine percent of respondents indicated that two or more people review each applicant’s information. At most schools (53%), the review takes 15 minutes or more. The companion AIB indicates that both academic (e.g., undergraduate GPAs and MCAT scores) and non- academic (e.g., medical community service, personal statements) data are used to select the interview pool; however, more weight is given to academic data at this stage in the admissions process.

 

 

전형적인 인터뷰 절차는?

What process is used to conduct a typical admissions interview?

 

20년 전과 마찬가지로 약 83%에서 faculty, staff, 그리고 종종 의과대학생까지 일대일 면접의 면접관이 된다. 59%에서는 각 지원자마다 두 차례의 면접을 한다. 50% 이상에서 면접관은 자기소개서, 평가결과지, MCAT점수, 학부GPA 등을 면접 전 혹은 면접 중에 리뷰한다. 입학면접은 보통 30~44분 소요된다.

As was the case 20 years ago, many admissions officers (83%) indicate faculty and staff and, in some cases medical students, conduct one-on- one interviews. Fifty nine percent of schools conduct two interviews with each interviewee. At more than 50 percent of schools, interviewers review personal statements, letters of evaluation, MCAT scores, and under- graduate GPAs prior to or during the interview.9 Admissions interviews typically last between 30 and 44 minutes each.

 

현재의 입학면접은 과거보다 더 구조화되어있는 편이다. 64%에서 면접관에게 질문 내용에 대한 일반적 가이드라인을 제시하며, 대부분에서 면접에 대한 표준화 절차를 도입한다.

Results show that the current admis- sions interview is more structured than it was in the past. Sixty-four percent of schools provide general guidance to interviewers about the content of the questions they should ask. Similarly, most employ a standardrating process to evaluate applicants during the interview

 

어떤 특징을 평가하는가?

What characteristics are assessed in the typical admissions interview?

 

50% 이하의 학교에서 평가하는 것으로는 지원자의 학과 내용 지식(생물, 화학, 심리학 등)이다.

Less than 50 percent of respondents indicated that their interviews include questions about applicants’ academic content knowledge (e.g., biology, chemistry, psychology, etc.).10

 

 

 

Discussion

 

의과대학들은 거의 전적으로 지원자의 인적특성 평가에 면접을 활용한다. 인성 평가에 대해서 면접에 의존하는 것은 다른 입학도구로는 평가가 어렵기 때문일 것이다.

These data also show that medical schools use the interview, almost exclusively, to assess applicants’ personal characteristics. Reliance on the interview is likely due to the difficulty of assessing personal char- acteristics with other admissions tools currently available earlier in the admissions process.

 

 

 

 

 

 


Dunleavy, D. M., & Whittaker, K. M. (2011). The evolving medical school admissions interview.AAMC Analysis in Brief, 11(7), 1-2.

MMI에서 시험내용 보안 위반에 따른 영향(Med Educ, 2006)

The effect of defined violations of test security on admissions outcomes using multiple mini-interviews

Harold I Reiter, Penny Salvatori, Jack Rosenfeld, Kien Trinh & Kevin W Eva






2001년 11월, MMI의 첫 번째 파일럿 프로젝트가 완료되었다. OSCE의 형식을 따라 6스테이션의, 18명의 가상 지원자를 대상으로 하여 괜찮은 수준의 일반화가능도(신뢰도), 수용가능성, 실행가능성을 확인하였다. 이후 실제 지원자를 대상으로 2002년과 2003년 대규모 연구를 통해 이전의 이러한 결론을 재확인하고 예측타당도를 검증하였다.

In November 2001 the first pilot project of a multiple mini-interview (MMI) process for student admissions was completed.1 Modelled after the objective structured clinical examinations (OSCEs), a 6-station MMI with 18 faux-applicants generated promising data regarding overall test generalisability (reliability), acceptability and feasibility. Results of subsequent large-scale studies of actual medical school applicants in April 2002 and 2003 confirmed prior conclusions and generated preliminary data demonstrating pre- dictive validity.1–3


MMI를 도입하고자 할 때 시험의 보안에 대한 우려가 있다. 그리고 이것은 현실이다. 면접 과정의 신뢰성이 2개의 핵심적 요인에 의해서 위험에 처해 있다. 입학은 엄청나게 중요한(high-stake) 시험이면서, 이 진실성을 깨트릴 수 있는 수단은 많다. 그 결과 MMI의 보안은 위협받게 된다. 그렇게 할 동기도 있고, 수단도 있고, 기회도 있다. 면접의 지시문(stems)이 일반 대중에게 공개될 가능성이 높으며, 그러나 그러한 부도덕한 행위로 인해서 어떤 이득이 있을지는 불확실하다.

With the anticipated move towards MMI implementation, concerns arose regarding test security. Cause for concern is real. The integrity of the interview process, like any other evaluation in academia, is endangered to a greater or lesser extent based upon 2 critical factors. How high are the stakes involved? What obstacles are in place to limit the extent of breaches of academic integrity? As a result, the MMI provides an attractive target for such breaches. 

  • There is motive, with the exceedingly high stakes of career-making in the balance. 
  • There is method, with the explosion of communication tech- nology decreasing obstacles to information dissem- ination. 
  • There is also opportunity, with the stems of interview stations available, of necessity, to those applicants undergoing the MMI. 

Thus the availability of stems to the general populace is anticipated. It remains far less certain whether anything is gained by such unscrupulous conduct.


더 포괄적으로 보자면, 보안문제에 관해 중요한 것 하나는 그에 따른 영향이 얼마나 되느냐인데, 18개의 연구 중 6개는 통계적으로 유의미한 향상을, 4개는 제한적인 향상을, 4개는 차이 없음을 보고했다.

More broadly, the issue is one of determining the impact of security violations on perceived compet- ence levels. Literature exists outlining this impact in the domain of clinical skills assessment. Of 18 studies, 6 showed a statistically significant improvement in performance after test security violations,4–9 4 showed limited benefits10–13 and 4 revealed no difference.14–21


Swanson은 문헌 고찰을 하면서 이 부분에 대한 방법론적 개선을 요구해쓴데, 이러한 연구를 할 때 방법론적 4가지 핵심적 고려사항이 있다.

Swanson et al.,22 in their review of this literature, promoted the need for methodologi- cal improvements in this area. They described 4 key methodological aspects to ensure when designing these types of studies. As applied to the MMI, these are as follows.


  • 1 면접대상자의 일부가 비교가능해야 한다. 평가대상이 무작위 배정되어야 한다.
    Subgroups of applicants being interviewed must be comparable, achievable using random assign- ment of those being rated. 
  • 2 보안 위반이 발생했음이 확실해야 한다.
    The violation(s) must be known to have occurred. 
  • 3 통계적으로 영향력을 예측하기 위해서 충분한 표본크기가 있음어야 하고, 연구의 power가 충분해야 한다.
    The study must have sufficient power, in terms of sample size, to enable any presumptive impact to be identifiable statistically. 
  • 4 평가도구의 신뢰도가 충분해야 한다.
    The tool must be sufficiently reliable for true shifts in the ability to perform to be detectable.



연구 1 

STUDY 1


Methods


57명의 지원자. 전통적 면접 수행 이후, MMI Trial에 자발적 참가

A total of 57 applicants to the MD programme participated in a voluntary trial run of the MMI after their traditional interviews were completed.


절반의 지원자는 2주 전 모든 9개 스테이션의 내용을 제공받음. 절반은 제공받지 않음.

Two weeks in advance of the interview date, half of the volunteers were provided copies of all 9 station stems via electronic mail. Access to these 9 station stems remained restricted from the other half of the volunteers.


2명의 평가자가 종합적 수행능력 평가함 (7점척도)

Two examiners provided a global per- formance rating for each candidate at each stationusing an anchored 7-point scale. 



Results


24명은 2주 전에 면접내용 제공받음. 0.06 차이가 있었으며, 통계적으로 유의미하려면 그룹당 1495명짜리 샘플 필요.

Twenty-four applicants received the station summar- ies 2 weeks in advance of their participation. The mean score of these participants was 4.97 (SD ¼ 0.46). The 33 applicants who did not receive the stations in advance achieved a mean score of 4.91 (SD ¼ 0.67). This difference is not statistically signi- ficant; F1,55 ¼ 0.19, MSE ¼ 6.22, P >0.65. To reveal a difference of 0.06 to be significant with the pooled standard deviation of 0.58 would require a sample size of 1495 per group.


Discussion


그룹간 차이가 존재하고 그 방향이 우려되는 방향이었으나, 매우 미미했고 그 수치가 유의미하려면 많은 지원자가 필요해야 함. 즉 임상적으로 중요하지 않음.

While the difference between groups is in the direction that would cause concern, it is so minuscule that 7 times the number of partici- pants that the MD programme interviews typically would be required to show the difference to be significant, thereby suggesting that the result would be clinically unimportant even if large enough samples were drawn.



연구 2

STUDY 2


Methods


2004년 3~4월에 진행됨. 실제 MMI

The second study occurred in March⁄ April 2004 with the first real high stakes implementation of the MMI.


24개 스테이션 개발하여서, 2일에 걸쳐 진행. 각각 12개 스테이션. 스테이션당 1명의 평가자가 7점척도로 평가

Twenty-four stations were developed, with 12 used on each of 2 interview dates. The 24 stations again focused upon personal quality domains. The system of scoring remained similar to that described above, with the exception that only 1 examiner was present per station.


12개중 2개 스테이션을 파일럿 스테이션으로 사용하였고, 이 스테이션 점수는 총점에 반영 안됨. 절반의 지원자는 그 2개중 1개, 나머지 절반은 다른 1개에 대한 내용을 제공받음. 면접날 지원자는 일부 스테이션이 파일럿 목적으로 포함되었으며, 입학 결정에 영향이 없다는 설명을 받음. 그러나 그 스테이션이 무엇인지는 알려주지 않았음. Repeated measure t-test 사용하여 정보가 없었던 스테이션과 그렇지 않은 스테이션 점수를 비교

Once again an intentional security violation was introduced, this time by using 2 of the 12 stations as pilot stations, scores on which did not count toward the admissions decision. Half the applicants received 1 of the 2 pilot stations with their mailed letter inviting them to interview; in a covering letter they were told to expect to encounter that particular station during their interview. The other half of the applicants received the other pilot station in the same manner. On the day of the interview applicants were told that some stations were included for pilot purposes and that these stations would not count towards their admis- sions decision; they were not told, however, which stations fell into this category. Repeated measures t-tests were used to compare scores on the station seen in advance to scores received on stations to which applicants were naive.


Results


평균 점수

The mean overall performance score received by candidates per station was 4.94 (SD ¼ 1.10). The overall test–retest reliability of this 12-station MMI with 1 examiner per station was 0.70.


Discussion


high-stake 였음에도, 그리고 2주전에 내용을 제공했음에도 benefit은 없었다.

Despite the high stakes nature of this interview process and the fact that stations were delivered 2 weeks in advance with clear indication that they would be included in the interview, we again wit- nessed no benefit of prior exposure in the performance ratings assigned.


그러나 일부 평가자는 - 이 intervention에 대해서 모르는 - 자발적으로 일부 지원자가 지나치게 연습이 되어있었다고 말했는데, 이것이 왜 스테이션에 대한 정보를 알더라도 별 이득이 없는지를 설명해주는 기전의 가능성을 제시함.

Anecdotally, a number of examiners, each of whom were blinded to the intervention, noted spontaneously that some responses seemed too rehearsed, potentially providing insight into the mechanism by which potential benefits of prior knowledge of the stations are lost.



연구 3 

STUDY 3 


Methods


직업치료사 면접을 본 사람 중 38명은 물리치료사에도 지원함. 이 38명은 7개 스테이션 MMI에대해서 오전에 OT, 오후에 PT 선발용 면접을 수행함. 따라서 이 38명은 면접 문항 뿐 아니라, 실제로 그 스테이션에 대한 경험이 있음. 7점척도를 사용하였으며, 이 프로그램의 면접관들은 이 38명이 누군지 몰랐음. 7점척도로 종합적 수행능력과 직업적합성 정도를 각각 평가하게 했음. 이 둘 사이의 상관관계가 0.95를 넘어서 종합적 수행능력 점수만 비교함.

Of the interviewees for occupa- tional therapy seats, 38 also interviewed for physio- therapy seats. These 38 applicants underwent the same 7-station MMI for both interviews (OT in the morning and PT in the afternoon). They were therefore privy not only to the stems of the MMI stations, but also potentially gained benefit from the experience of working through the 7 stations with an interviewer. As before, the stations focused on personal quality domains and were globally scored using a 7-point anchored scale. Interviewers in this programme were blinded to the candidates being repeated interviewees. They were asked to assign ratings of each candidate’s overall performance and to provide a 7-point gut opinion of the person as a candidate for the profession⁄ programme. The cor- relation between these two scores was greater than 0.95, so only the overall performance score will be reported for the sake of comparison with the first 2 studies outlined above.


평균은 0.01차이가 있었으며, 유의미하려면 그룹당 29000이 필요함.

The mean score provided to the sample of 38 applicants during their interview for the OT pro- gramme was 3.46 (SD ¼ 0.43). The mean score provided to the same group during their interview for the PT programme was 3.45 (SD ¼ 0.44). This difference is not statistically significant; t(37) ¼ 0.14, P >0.8. To reveal a difference of 0.01 to be significant with the pooled standard deviation of 0.43 would require a sample size of over 29 000 per group.





고찰 

GENERAL DISCUSSION



MMI 면접 스테이션 개발은 노동집약적이고 여러 단계를 거친다. 다음의 결과물을 만든다.

MMI station development can be labour intensive, requiring several steps.24 The written product consists of:


1 제시문 A station stem entitled Instructions for the Applicant 

2 스테이션 가이드 (면접관용 가이드) A station guide entitled Instructions for the Observer 

3 스테이션의 배경과 이론에 대한 심도 평가 An in-depth review of the station implications entitled Background and Theory .

4 평가지 A station score sheet. 


이 4개중 1번은 지원자와 평가자에게 제공되고 나머지 3개는 평가자에게만 제공된다. 모든 문서는 높은 수준으로 보안되는데, 평가자들은 일반적인 MMI에 대해서는 오래 전에 교육을 받으나, 스테이션에 대해서는 면접 당일에 정보를 받는다. MMI가 시행되면 지원자들은 처음으로 지시문을 접한다. 그러나 종이와 펜이 없다고 해서 이 제시문을 복원하는데 지장이 있지 않다.

The first of these 4 is available to the applicant throughout the station; the other 3 are available to the observer only. All documents are jealously guarded. While observers receive general MMI training well in advance, they remain station-naive until the morning of the interview date when they receive station-specific training. Once the MMI commences, the interviewed applicants become privy to the station stems. Their lack of paper and pen has not significantly constrained the subse- quent publication of those stems.


2004년에 시행한 MMI에서 많은 정보통신기구에 의한 보안 위협이 있었다. 한 지원자는 다른 지원자에게 어떻게 MMI가 진행되었는지 알려주었다고 했고, 웹사이트에 MMI가 끝난지 7분만에 그 정보가 올라갔다고 했다. 몇 주가 지나자 모든 24개 스테이션의 제시문이 상당한 정확도로 복원되었다.

The practical administration of the MMI in March 2004 provideda sample of security challenges in the age of hand-held computers, wireless communication and the internet. The first comment about the MMI by aninterviewed applicant, informing others about howthe MMI was run, was posted in a forum website 7minutes after MMI completion.23 In the subsequentweeks, after interviews were completed, reasonably accurate descriptions of all 24 MMI station stems could be viewed on the same site. 


이렇게 제시문이 빠르게 퍼져나가는 것은 놀랍지도 않고, MMI도입에 걱정거리이다. 체크리스트와 배경정보와 배경이론이 새어나가는 것의 영향력은 모르나, 제시문이 공개되는 것에 대한 우려는 보다 현실적이다.

The rapid publication of these stems was therefore hardly surprising. Nor, apparently, is it particularly unnerving for prospective MMI implementation. While the effects of security violations of station checklists and background and theory remain unknown, the more practical concern regarding security violation of station stems appears misplaced.


직관에 반하는 이러한 결과에 대한 한 가지 설명은, MMI 스테이션은 OSCE나 다른 지식/능력 검사와 달리 1개의 정답만 있을 가능성을 배제하고 있으며, 지원자가 무슨 답변을 하든 면접관이 그에 대한 반문이 가능하다. 모든 가능한 재질문에 대하여 준비하기는 대단히 어려우며, 그래서 오히려 지원자가 사전에 진행 의제를 설정하여 그것대로 진행하려고 하면 오히려 안좋은 결과가 나타나는 것이다. 
One plausible explanation of this counter-intuitive result is that MMI stations, unlike OSCEs and other knowledge⁄ ability tests, are designed to guard against the possibility of there being 1 correct answer, thereby allowing the interviewer to challenge any response provided by the candidate. It would be very difficult to prepare responses for every possible challenge, thus resulting in poorer performance if a candidate attempts to force a pre-planned agenda on the discussion.


시험 보안에 대해서 연구는 제시문에 제한되어 있었지면, 여기서 보면 2주간의 기회를 주어도 향상은 없으며 따라서 얼마나 긴 시간을 주는가는 그다지 문제가 되지 않아 보인다.

In these studies the extent of the violation was limited to the availability of the stem. Our results show that there was no score enhancement despite the 2-week window of opportunity. It appears that time delay is not an issue.


조금 불분명한 것은 보안 위반의 범위에 대한 것이다. 앞의 두 연구에서 제시문 보안만 위반된 경우에는 유의한 영향이 없어 보인다. 시험-재시험 위반의 경우에도 역시나 결과에 영향은 없었다. 그러나 세 번째 연구의 결과는 time delay가 짧았기 때문일 수 있다.

Less clear is the influence of the extent of the violation. One5 of 25,15 OSCE studies with egregious and identifiable violations suggested that extent of violation is a critical factor. In the first 2 MMI studies broadcasting of the stem alone, a more limited violation, had no significant impact on test scores. The more extensive, test–retest violation of the third MMI study also failed to demonstrate any impact on scores. However, this may have been a result of the short time delay (several hours only), and thus short potential responsive preparatory time between information access and the retest.


security violation이나 time delay 둘 중 하나만으로는 MMI 점수를 향상시키는데 불충분하다. 이 둘이 모두 갖춰진다면 가능할지도 모른다. 혹은, MMI에 정답이 없다는 것이 바라지않은 수행능력 향상을 애초에 불가능하게 할 수도 있다.

Alone, neither factor is sufficient to enhance MMI scores. Together, they may be sufficient. Alternatively, the absence of correct answers on MMI performance might result in no unwanted performance modification, even in a setting combining both more extensive security violations and greater time delay between violation and subsequent performance.



결론

CONCLUSIONS


의과대학 지원부터 전문의가 되기까지의 단계.

From 

  • application to medical school through to its successful completion,24 
  • national licensing examina- tions for general medical licence,25 
  • entry into one’s preferred speciality training26 and to 
  • speciality certi- fication,27 


이 중 단 하나의 가장 높은 허들은 의과대학 입학이다. McMaster 의과대학에는 지원자의 3.8%만 합격한다. 들어오면 99%가 졸업하고, 캐나다 의사국가시험 Part I은 95%, Part II는 91%의 합격률을 보인다. 전공과목 수련은 88%가 마치며, 91%는 전문의 시험에 합격한다.

the single greatest hurdle in terms of likelihood of success is, overwhelmingly, admission to medical school. Only 3.8% of applicants to the McMaster University Undergraduate Medical Pro- gram were admitted in 2004.24 Of those who enter the programme, 99%graduate.25 Canadian graduates nationally enjoy a greater than 95% success rate on Part I and 91%success rate on Part II, respectively, of the Licentiate Medical Council of Canada examina- tion upon their first sitting of each examination.26 They also enjoy an 88%likelihood of being chosen by the preferred speciality training programmes in Canada27 and a 91% first attempt success rate on Royal College fellowship speciality certification examinations.28


미국과 캐나다 시스템에서 의과대학에 일단 들어오면 원하는 전공과목 전문의가 못 될 가능성은 별로 없다. 의과대학생과 레지던트의 노력을 폄하하는 것은 아니다. 그러한 노력이 의과대학 입학단계에까지 확장되어야 한다.

In the American and Canadian systems, failure to complete medical school through failure to obtain one’s preferred speciality or family practice certification remains unlikely. This is not meant to denigrate the Herculean efforts and enor- mous talents required on the part of dedicated medical students and residents, but rather to recog- nise that the odds are very much in their favour for those later stages. That same effort, talent and dedication, expended at the level of admission to medical school, combine for far lower success rates.


일반적인 상황, 즉 제시문이 노출되는 보안 위반에도 MMI의 진실성은 유지될 수 있다.

Under normal circumstances, including the potential security violation of distribution of station stems, confidence in the veracity of MMI outcomes can be maintained.




24 Ontario Medical School Application Service Statistical Summary 2004. Ontario: Ontario Universities’ Appli- cation Centre, Council of Ontario Universities, 8 October 2004.









 2006 Jan;40(1):36-42.

The effect of defined violations of test security on admissions outcomes using multiple mini-interviews.

Author information

  • 1Dept. of Clinical Epidemiology and Biostatistics, McMaster University, 1200 Main Street West, Hamilton, Ontario L8Z 3N5, Canada.

Abstract

INTRODUCTION:

Heterogeneous results exist regarding the impact of security violations on student performances in objective structured clinical examinations (OSCEs). Three separate studies investigate whether anticipated security violations result in undesirable enhancement of MMI performance ratings.

METHODS:

Study 1: low-stakes: MMI station stems provided to a random half of 57 medical school applicants 2 weeks in advance of participation in a research study. Study 2: high-stakes: 384 medical school applicants sat a 12-station MMI to determine admission. Each half received 1 of 2 pilot MMI station stems 2 weeks in advance. Study 3: high-stakes: 38 interviewees with dual applications to occupational therapy and physiotherapy experienced the same 7-station MMI twice on the same date.

RESULTS:

No statistically significant differences in MMI performances were detected.

CONCLUSIONS:

Predictable violations of MMI security do not unduly influence applicant performance ratings.

PMID:
 
16441321
 
[PubMed - indexed for MEDLINE]


미국 의과대학생 선발에서 면접절차 (J Med Educ. 1981)

Description of the Interview Process in Selecting Students for Admission to U.S. Medical Schools

James B. Puryear, Ph.D., and Lloyd A. Lewis, Ph.D.



1981~1982년도의 자료에 따르면 전 의과대학 중 99%에서 면접을 활용하고 있다. 의과대학 입학위원회가 면접을 활용해 학생선발에 도움을 받고 있음을 보여준다.

According to data obtained from Medical School Admission Requirements, 1981- 82 (5), 99 percent of all medical schools use the interview in the selection process. This widespread use further supports the idea that medical school admissions com­ mittees rely on the interview to help select students.


이러한 의존에도 불구하고 Fruen에 의하면 면접에 대한 연구는 거의 없다.

Despite this reliance, according to Fruen (6), there has been surprisingly little re­ search on the interview in the medical school admissions setting.


지난 10년간 8개의 논문밖에 없었다.

The authors found only eight articles in the Journal of Medical Education during the last 10 years that concentrated on the subject of medical student admission interviews.



면접의 활용

Use of the Interview


99%가 면접을 활용하고 있다. 그 중요도에 대해서는 98%가 4점 척도 기준으로 매우 중요함~중요함 으로 응답했다.

Ninety-nine percent of the respondents ( 106 of 107) indicated that they used the interview in student selection. In order to ascertain just how much the interview is used, three questions were asked. On a four-point scale, ranging from very impor­ tant to unimportant, an overwhelming ma­ jority (98 percent) of the respondents who used the interview indicated that the interview was important to some degree. Sev­ enty-two percent said it was very  important.



면접 형식

The Interview Format


85%가 일대일 면접

Eighty-five percent of the respondents indicated that their interviews were one to one" (that is, one interviewer and one

interviewee). 


100%에서 교수와 직원이 면접관 역할을 하며, 72%는 학생도 면접관으로서 사용한다고 했음. 34%에서는 동문 활용

One-hundred percent of the responding schools using interviews indicated that fac­ ulty and staff members were used to inter­ view applicants, while 72 percent of the schools used students also as interviewers. Alumni also interviewed applicants in 34 percent of the schools.


절반 정도에서 표준 질문 세트가 있다고 했음

About half (47 percent) of the medical schools which interview had a standard set of questions or areas of inquiry that all interviewers included in their interview sessions.


46%에서 한 지원자당 두 개의 독립적 면접을 한다고 했음

Forty-six percent of the interviewing schools required applicants to be inter­ viewed in two separate interviews,



면접 운영

Interview Administration


92%에서 면접관 훈련이나 면접에 대한 개요 설명이 있다고 했고 94%에서 지원자에 대해서 기술한 보고서를 요구한다고 했음.

Ninety-two percent of the interviewing schools trained or at least briefed their interviewers in some way on interviewing applicants. A similar percentage (94 per­ cent) required a written report on the in­ terview for the applicant's file.


76%에서는 on-campus 면접만, 24%는 on- off- campus 면접

It was found that most (76 percent) of the schools which interview held inter- views only on campus, while the rest (24 percent) interviewed students on and off campus.



Implications

대부분의 의과대학은 면접을 중요하게 여기나, 스스로의 면접의 효과성을 평가하는 도구를 가진 의과대학은 적다.

Obviously, most medical schools consider the interview important in the selection of students. However, medical schools have apparently not, for the most part, established a means of evaluating the ef­ fectiveness of their own interviews.












 1981 Nov;56(11):881-5.

Description of the interview process in selecting students for admission to U.Smedical schools.

Abstract

A survey was made of the medical schools in the United States to obtain a description of the interview process used in the selection of first-yearmedical students. The following questions were the basis for the study: What is the role of the interview in the selection of medical students? What is the nature of the interview process? How is the interview administered? An 87 percent response rate was obtained. The results indicated that 99 percent of the responding medical schools use interviews in evaluating students for medical school admission, and the interview ranks second only to the grade-point average in importance among four selection factors. The interview is usually in a one to one setting, with each applicant having two separate interviews. All schools use faculty and staff members in interviewing, and usually at least one admissions committee member interviews each applicant. Usually interviews are conducted on the campus of the school. Implications drawn from the results indicate a need for a quantification of methods to incorporate the interview into the selection process.

PMID:
 
7299795
 
[PubMed - indexed for MEDLINE]


변형된 면접: 학생선발을 위한 신뢰성 있는 면접의 부활? (Acad Med, 2012)

Modified Personal Interviews: Resurrecting Reliable Personal Interviews for Admissions?

Mark D. Hanson, MD, MEd, Kulamakan Mahan Kulasegaram, Nicole N. Woods, PhD, Lindsey Fechtig, and Geoff Anderson, MD, PhD




특히 개별 면접은 전반적인 신뢰도가 낮은데 - 면접관 간 일치도가 낮고, 서로 다른 인터뷰 상황마다 일관성이 낮다 - 이로 인해서 예측력이 제한된다.

Particularly, personal interviews have low overall reliability— lack of agreement among interviewers and lack of consistency across different interview occasions—which, in turn, limits their predictive power.3,4


한 가지 흔한 해결책은 여러 차례 독립적인 샘플링을 하는 방법이다 (multiple independent sampling (MIS) method)

One common solution to increase the reliability of a performance measurement is to assess samples of the performance independently multiple times—that is, the multiple independent sampling (MIS) method.


가장 눈에 띄는 면접기법은 MMI이다. 고도로 구조화된, 시나리오 기반의 면접을 시행하는 방식이다. AAMC는 최근 MMI가 높은 신뢰도와 중간정도의 타당도를 가짐에도 불구하고, 그리고 일반적인 면접이 psychometric한 한계점이 있음에도 불구하고 개인면접을 시행하는 학교가 압도적임을 보고했다.

The most notable use of this measurement technique is the Multiple Mini-Interview (MMI),6,7 which uses up to 10 highly structured, scenario-based interviews to assess applicants. Interestingly, the Association of American Medical Colleges recently reported that a preponderance of schools use the admissions personal interview, not the MMI,1 despite not only the evidence regarding the MMI’s high reliability and moderate validity7–9 but also the critical psychometric limitations of the personal interview.2–4


MIS방법을 사용하는 비율이 낮은 이유는 상당한 자원이 투입되어야 하기 때문이며(신뢰도 있는 점수를 얻으려면 10명의 면접관이 필요하다) 면접관 모집에 관한 잠재적 영향 때문이다(모집 관련 활동과 관련된 변화). 추가적으로 기존 면접의 유연성과 직관적 단순함이 입학위원회가 MIS 도입을 꺼려하는 또 다른 이유이다.

Two critical factors contributing to the low uptake of the MIS method to admissions interviews are the aforementioned high resourcing requirements (10 interviewers needed to attain reliable scores) and the potential effects on recruitment (due to the associated alterations to campus recruitment-focused activities).10 Additionally, the flexibility and intuitive simplicity of the personal interview may make admissions committees (and interviewers) reluctant to abandon it all together.


Axelson과 Kreiter는 MIS의 적용을 연구했다. 2009년, 두 해 연속 면접을 본 지원자 집단을 대상으로 한 연구에서 전통적 면접 방식을 패널의 수를 줄이는 대신 독립적인 개인별 면접 수를 증가시킴으로써 신뢰도를 높일 수 있음을 보고했다. 따라서 입학위원회는 다수의 구조화된 시나리오에 의존하는 대신 다수의, 짧은, 단일 평가자로 이뤄진 면접을 수행함으로서 면접의 신뢰도를 향상시킬 수 있다.

Axelson and Kreiter10 investigated the application of MIS to the admissions personal interview itself. In their 2009 investigation, they reviewed the multiple interview scores of applicants who had been interviewed twice in consecutive years by a panel of two interviewers for admission to medical school. They estimated, after analyzing the scores of 168 candidates across four years who had interviewed twice, that reasonable reliability could be achieved using a traditional personal interview format by reducing the number of interviewers in the panel while increasing the number of separate personal interviews. Thus—instead of relying on a large number of structured scenarios—admissions committees might be able to depend on multiple, brief, single-rater interviews to enhance the reliability of the personal interview.


MPI에 대한 MIS 방법을 활용한 시도에 대한 연구

We report here the first prospective empirical test of the reliability of a similar modification to the admissions personal interview format using an MIS methodology named the modified personal interview (MPI).



방법
Method



1학년 학생에게 LEAD 프로그램에 대한 설명을 함. LEAD 지원자가 갖추어야 할 특성을 도출함. 이 특성에 대해 잠정적 지원자들과 communicate했으며, MPI과정동안의 질문을 만드는데 사용했다.

We informed the first-year students about LEAD and its selection process via announcements made during class and notifications sent over e-mail. The selection process constituted submission of written materials followed by, for a selected subset of candidates, the MPI process. We derived the attributes of successful LEAD candidates from the literature on leadership11,12 and through LEAD faculty consensus. These desired attributes were communicated to the pool of potential applicants and blueprinted onto (aligned with) questions asked during the MPI process.



제출 자료 Written submission materials (3 가지)

The written submission materials comprised three components: 

    • a two-page curriculum vitae (CV) summarizing applicants’ academic and leadership experiences, 
    • three brief descriptions of leadership experiences reported in the CV, and 
    • a brief vision statement of leadership goals and career aspirations.

MPI 절차 The MPI process

4개 면접방, 10~12분, 4명의 평가자, 평가자들이 Behavioral description 질문 개발

Candidates who proceeded on to the interview stage moved among four interview rooms to complete the MPIs in succession. Each MPI was about 10 to 12 minutes long; a few interviews were longer at the discretion of the faculty interviewer. The four interviewers, all of whom had participated in the review of the written materials, framed all questions as behavioral descriptive questions which have strong validity in assessing personal characteristics.13–15


평가자들은 MPI 형식에 대해서 설명을 받고, 면접의 초점에 대해서도 연습함. 3개의 인터뷰는 반구조화되어있었으며, 평가자는 사전에 질문 목록을 가지고 있었음.

Interviewers received training on the MPI format and on the focus of the interviews. Three of the interviews were semistructured, and the interviewers used a list of predetermined questions.


4명의 평가자는 3개의 공통 특성과 한 개의 MPI-특이적 특성에 대해 평가함.

All four interviewers rated three common attributes—maturity, communication skills, and interpersonal skills—and a fourth attribute unique to their MPI.


평가자는 각 특성에 대해서 5점척도로 평가함. 총점 20점

The interviewers evaluated each attribute as a separate item on a five-point Likert- type scale (1 = poor, 2 = good, 3 = very good, 4 = excellent, and 5 = outstanding) to increase the scoring range available to interviewers. All items were totaled for a final MPI score out of 20, and overall total scores were used for selection.



Results


16명의 지원자, 10명에 대해서 MPI 수행, 8명 선발. 면접시간은 총 3시간

Sixteen candidates submitted initial applications to LEAD. Of these, we selected 10 for the MPI stage, 8 of whom were selected for the program. The entire set of MPIs was completed in three hours in one afternoon.


58%의 변인은 pi와 pq:i에 기인함.

The majority of variance among MPI scores (58%) was attributable to the participant–interview interaction (pi) as well as the participant–question interaction nested with MPI (pq:i), which suggests that these facets caused random error in the assessment of applicants.


전체 신뢰도는 0.79

Overall reliability of the MPI component and subsequent average MPI reliability was 0.79. The reliability of questions nested within MPIs (q:i) was 0.97.






Discussion and Conclusions


MIS가 MPI형태에 적용되었을 때 신뢰도가 높아진다. 4개의 MPI만으로도 0.7 이상의 신뢰도를 보여줌. 총 8 faculty hour 소모. 비슷한 수의 지원자를 대상으로 면접을 전통적 방식으로 한다면 13 faculty hours가 필요.

This report provides some evidence that MIS as applied within the MPI format is a reliable selection strategy. High reliability was achieved with just four MPIs, and a d-study revealed that future MPIs can achieve reliability greater than 0.7 with only three MPIs. A total of only 8 faculty hours was spent conducting the MPI process. A comparable traditional admissions personal interview of 40 minutes’ duration with two interviewers would take more than 13 faculty hours (66% more time) for the same number (n = 10) of applicants.


전통적 면접의 이러한 변형은 MIS 도입 가능성을 높여준다. MMI와 같이 기존의 MIS의 방식에 기반한 방식에서는 10명의 독립적 면접이 필요했다. 여기서 MPI는 3개의 인터뷰만으로도 threshold에 도달했다. 아마도 LEAD 선발 과정 때문일 수도 있다. 이러한 절차에서 사용된 MPI에는 좁은 범주의 특성만 평가했기 때문이다. 다른 비학업적 수행능력은 이미 의과대학 입학단계에서 평가되었다. 이러한 구체적인 제한적인 맥락이 면접 신뢰도를 높여줄 수 있다.
This modification of the personal interview has the potential to increase the uptake of MIS in admissions interviews. Previous application of MIS in the MMI showed that at least a minimum of 10 separate interviews were needed to achieve acceptable reliability.6,7 The MPI here met a minimum threshold at 3 interviews. A potential explanation for this finding is the specialized selection context of the LEAD admissions process. The MPI as used in this process focused on a narrow set of attributes related to leadership qualities as determined by LEAD faculty. Other aspects of nonacademic performance had already been assessed in the medical school admissions process. This specialized context also enabled the use of expert raters, which may have further enhanced interview reliability.


이러한 선발절차의 특수(전문)화는 안면타당도에도 기여한다. 안면타당도는 지원자가 지원절차가 직무에 관련되어있다고 믿는 정도라고 묘사되는데, (의과대학에서는 의과대학 교육과정 수행능력에 대한 추정가능성) 지원자가 면접 절차를 받아들이는 정도가 이 face validity와 관련되어있다.

The specialization of this selection process (with interviewers rating applicants’ performances according to a predetermined, defined suite of attributes aligned with a specific physician role—in this case, the role of physician leader) also lends to the face validity of the MPI format. Face validity has been described as the extent to which the applicants believe the application process is relevant to the job in question,16 or—to extrapolate to the medical school context—medical school curriculum. Applicant acceptance of admissions processes has been associated with face validity.16


본 연구에서 평가 특성의 오버랩은 content validity를 높여주었다. 지원서 점수와 MPI 점수의 상관관계가 높은 것은 LEAD 지원절차를 개발하는데 블루프린팅이나 매핑을 의도적으로 그렇게 한 것에 기인할 것이다. 지원서 평가자를 면접관으로 한 것 역시 영향을 주었을 수 있다.

In the current study, the overlap of attributes across the written application and MPIs enhanced content validity (and reliability). The strong association between written application scores and MPI performance is likely a result both of the intentional attribute mapping or blueprinting we performed in developing the LEAD application process and of the availability of written application materials during MPI occasions. The use of raters from the written application as interviewers may have also contributed to the strong association of scores across both evaluations, even though we removed all personal identifying information from the candidates’ written application materials.



Thus, we would not expect the recruitment of applicants to decrease through the use of the MPI format.




1 Dunleavy DM, Whittaker KM. The evolving medical school admissions interview. AAMC Analysis in Brief. 2011;11. https://www.aamc. org/download/261110/data/aibvol11_no7. pdf. Accessed June 13, 2012.


10 Axelson RD, Kreiter CD. Rater and occasion impacts on the reliability of pre-admission assessments. Med Educ. 2009;43:1198–1202.


13 Taylor P, Small B. Asking applicants what they would do versus what they did do: A meta-analytic comparison of situational and past behaviour employment interview questions. J Occup Organ Psychol. 2002; 75:277–294.


15 Huffcut AI, Weekley JA, Wiesner WH, Degroot TG, Jones C. Comparison of situational and behavior description interview questions for higher-level positions. Pers Psychol. 2001; 54: 619–644.
















 2012 Oct;87(10):1330-4.

Modified personal interviewsresurrecting reliable personal interviews for admissions?

Author information

  • 1Faculty of Medicine, University of Toronto, Toronto, Ontario, Canada. mark.hanson@utoronto.ca

Abstract

PURPOSE:

Traditional admissions personal interviews provide flexible faculty-student interactions but are plagued by low inter-interview reliability. Axelson and Kreiter (2009) retrospectively showed that multiple independent sampling (MIS) may improve reliability of personal interviews; thus, the authors incorporated MIS into the admissions process for medical students applying to the University of Toronto's Leadership Education and Development Program (LEAD). They examined the reliability and resource demands of this modified personal interview (MPI) format.

METHOD:

In 2010-2011, LEAD candidates submitted written applications, which were used to screen for participation in the MPI process. Selected candidates completed four brief (10-12 minutes) independent MPIs each with a different interviewer. The authors blueprinted MPI questions to (i.e., aligned them with) leadership attributes, and interviewers assessed candidates' eligibility on a five-point Likert-type scale. The authors analyzed inter-interview reliability using the generalizability theory.

RESULTS:

Sixteen candidates submitted applications; 10 proceeded to the MPI stage. Reliability of the written application components was 0.75. The MPI process had overall inter-interview reliability of 0.79. Correlation between the written application and MPI scores was 0.49. A decision study showed acceptable reliability of 0.74 with only three MPIs scored using one global rating. Furthermore, a traditional admissions interview format would take 66% more time than the MPI format.

CONCLUSIONS:

The MPI format, used during the LEAD admissions process, achieved high reliability with minimal faculty resources. The MPI format's reliability and effective resource use were possible through MIS and employment of expert interviewers. MPIs may be useful for otheradmissions tasks.

PMID:
 
22914517
 
[PubMed - indexed for MEDLINE]


입학 OSCE: MMI (Med Educ, 2004)

An admissions OSCE: the multiple mini-interview

Kevin W Eva, Jack Rosenfeld, Harold I Reiter & Geoffrey R Norman






많은 북미의 의과대학은 대체로 유급률이 매우 낮아서 입학 단계가 의과대학에서 이뤄지는 평가 중 가장 중요한 평가단계라고 말하곤 한다.

Because many medical schools, particularly in North America, have very low rates of attrition, one could argue that the admissions procedure is the most important evaluation exercise conducted by a school.


일반적으로 몇 가지 형태의 면접이 사용되어왔다. 1980년대 초반, 99%의 미국 의과대학은 입학선발과정에서 면접을 사용해왔는데, 물리치료사 프로그램의 81%, 직업치료사 프로그램의 63%에서 사용하고 있었다. 더 최근 결과를 보면 이 비율에는 거의 차이가 없으며, Naver는 99%의 의과대학과 83%의 물리치료사 프로그램에서 면접을 사용한다.

Typically, some form of interview is used; by the early 1980s, 99%of medical programmes in the USA were found to use the interview as part of the admissions process,2 as were 81% of physiother- apy programmes and 63% of occupational therapy programmes.3 A more recent survey suggests there has been little change in these proportions; Nayer reported that 99% of US medical schools and 83% of US physiotherapy programmes use interviews.4


면접의 안면타당도는 상당히 높았지만, 그 효과성의 근거는 모호하다. 평가자간 신뢰도가 크게 차이가 나서 0.14부터 .95까지 차이가 나는데, 이러한 비일관성은 면접 수행의 방식에 다른 것으로 보인다. 비구조화된 면접에 비하여 구조화된 면접은 보다 높은 신뢰도와 타당도를 보여준다.

While the face validity of the interview remains strong, evidence of its effectiveness is more equivocal. Interrater reliability estimates vary widely, from 0.14 to 0.95, but this inconsistency might largely be an effect of variability in the way in which interviews are administered;5 structured formats (i.e. standardised questions with, sample answers provided to inter- viewers) tend to yield higher rates of reliability and validity than do unstructured formats.6,7


그러나 이 신뢰도조차 다음의 이유로 인위적으로 향상될 수 있다.

However, even these reliability estimates may be artificially inflated by:


1. 면접팀이 지원자의 학업정보에 대한 정보를 가지고 있다.

2. 면접관 사이의 비언어적 의사소통(비록 의도가 없었더라도) 

1 the interview team having access to academic information on candidates,8,9 and 

2 non-verbal communication (which is, admittedly, often unintentional) between members of the interviewing team.


그 결과, 평가자간 신뢰도가 높더라도 지원자의 점수가 그저 '운'에 의한 것이었을 수도 있는 것이다. 마음이 잘 맞고 편한 면접관이 다른 평가자 패널에도 영향을 미침으로써, 이 운 좋은 지원자는 높은 점수를 받을 수도 있으며, 보다 불편하고 까다로운 면접관이 다른 평가자 패널에도 영향을 준다면, 이러한 만난 운 나쁜 면접관은 안 좋은 점수를 받을 수도 있다.

As a result, despite acceptable interrater reliability in some cases, a candidate’s score may still be attributable, in large part, to chance. A lucky candidate who is randomly assigned to a like-minded, easy interviewer who influences the rest of the interview panel will score highly, whereas an identical, but less fortunate candidate who is randomly assigned to an incompatible, hard interviewer who influences the rest of the interview panel will score poorly.10


면번에 영향을 주는 다른 비뚤림은 면접관의 배경과 면접관의 기대이다. 실제로 Harasym 등은 면접관 사이의 차이가 총 변인의 56%를 차지한다는 것을 발견했다. 이러한 강력한 비뚤림은 (면접관이 아니라) 지원자의 인적특성을 평가하고자 하는 목적의 면접에서는 수용되어서는 안되며, 비윤리적이다.

Other biases that have been shown to impinge upon the personal interview include both the interviewers’ backgrounds6,8,11 and the inter- viewers’ expectations.6,12 In fact, Harasym et al.found that interviewer variability accounts for 56%of the total variance in interview ratings.12 Such strong biases are unacceptable (and unethical) for an assessment tool that is intended to examine the characteristics of the candidate, not the interviewers.


그러나 면접점수의 일반화에 제한을 거는 것은 단순히 면접관뿐만이 아니다. 여러 면접에서, 적어도 일부분은, 맥락-특이성에 영향을 받는 또 하나의 영역일 가능성을 보여준다. 수십년의 연구를 보면 우리의 인지적 기술이 맥락에 상당히 의존적이라는 것을 보여준다. 다른 말로 하자면, 우리의 수행능력은 'trait(개인에게 안정적으로 나타나는 특성)'보다는, 우리가 직관적으로 알 수 있듯이 'state(그 수행이 이뤄지는 맥락)'에 의해서 결정된다. 

However, it is not simply interviewer bias that limits the generalisability of interview scores. Many of the problems with the personal interview might be explained, at least in part, by the possibility that the personal interview is yet another domain that is plagued by context specificity.13 Decades of research have indicated that many of our cognitive skills are highly dependent on context.14,15 In other words, our performance is commonly less determined by trait (the stable characteristics of the individual) than our intuitions suggest, and more determined by the state (the context within which the performance is elicited).


예컨대, 개개인이 지구의 자기장에 대해서 문제를 해결하는 능력이나 효과적으로 의사소통을 하는 능력은 국제 경제에서 독점의 유해함에 대해 문제를 해결하거나 효과적으로 의사소통하는 능력을 잘 예측하지 못할 것이다.

For example, an individual’s ability to problem solve or communicate effectively when discussing the impact of the magnetic compass on the modern world will not predict with great certainty that individual’s ability to problem solve or commu- nicate effectively when discussing the detrimental effect of monopolies on the world’s economy.16


이러한 가능성과 맞물려서 Turnbull 등은 캐나다의 RCPS에서 사용한 구술면접의 평가자간 신뢰도가 높았음에도, 면접 세션간의 일반화가능도는 낮았고, 이로 인해서 전체적인 시험의 신뢰도가 낮아졌다고 하였다. 그 결과 한 차례의 면접으로는 일반화가능한 지원자의 진짜 능력을 알아볼 수 없는데, 이는 문항을 표준화하고 평가자를 훈련시켜서 평가자간 신뢰도를 향상시키더라도 그렇다. 

Consistent with this possibility, Turnbull et al. showed that, although interrater reliability within the oral interview certification examinations used by the Royal College of Physicians and Surgeons of Canada was high, the generalisability across interview sessions was low, thereby lowering the overall test reliability.17 As a result, a single interview may not provide an accurate, generalisable portrayal of a candidate’s true abilities even though interrater reliability may be improved by standardising the questions asked and training the interviewers. Multiple topics might be raised within an interview, but this may still represent a small sample of possible responses by the candidate and an interviewer’s impressions of each response may not be independent of one another.


MMI의 개발에 관한 논문

The current paper will first outline the development of an innovative admissions protocol – the multiple mini- interview(MMI) – that is intended to take advantage of this lesson in the context of student admissions and, second, report results from 2 studies of this protocol performed at McMaster University. In testing this innovation, it was necessary to make many decisions based solely on educated intuition. As a result, we make no claims at this point regarding the optimal use of the MMI, but instead present our logic and reasoning with the hope that some of our assumptions and expectations will be further tested in the future.




다면인적성면접 MMI

THE MULTIPLE MINI-INTERVIEW


가장 먼저 OSCE란 용어는 독자의 이해를 돕기 위한 것이다. OSCE와 마찬가지로 MMI는 다수의 짧은 스테이션으로 이뤄져있다. 그러나 MMI는 objective 하지 않으며 clinical 관련된 것도 아니다. 임상추론에 대한 연구와 OSCE에 대한 연구는 주관적인 평가도 개인의 능력을 평가할 때 신뢰도와 타당도를 담보할 수 있음을 보여준다. 그 결과 우리는 면접과정의 주관성(subjective nature)이 입학도구로서의 제약사항이라고 보지 않았다. 더 나아가서 면접문항을 개발할 때 임상지식을 요구하는 스테이션은 지양하였는데, 이는 health sciences 학생에게 유리하게 작용하는 비뚤림을 배제하기 위해서였다.

First and foremost, it should be noted that the term OSCE has been used in the title of this article simply to orient the reader to the protocol that has been developed for the MMI. Like the OSCE, the MMI is intended to consist of a large number of short stations, each with a different examiner. The MMI is not, however, objective. Nor is it clinical. Research on both the clinical reasoning exercise19,20 and the OSCE21,22 has shown that subjective ratings can be reliable and valid estimates of an individual’s abilities. As a result, we do not view the subjective nature of the interview process itself to be a limiting feature of this admissions tool. Furthermore, we have carefully avoided developing stations that require clinical knowledge in an effort to prevent biasing the process in favour of health sciences students ⁄ personnel.


MMI는 면접을 통해서 (지금껏 부적절한 방식으로 평가되어온) 다양한 인지적/비인지적 기술을 평가하기 위한 목적이 있다. 이것의 장점은 '운'과 '면접관/상황'에 따른 효과를 희석시킨다는 점이다. 전통적 면접과 면접관이 서로 다른 방에 있기 때문에 달리 토론의 다양한 지점들이 독립적으로 평가된다.

In contrast to what it is not, the MMI is an OSCE-style exercise consisting of multiple, focused encounters. It is intended to assess many of the cognitive and non- cognitive skills that are currently assessed (inad- equately) by the personal interview. Its specific advantage is that multiple interviews should dilute the effect of chance and interviewer ⁄ situational biases. Unlike traditional interviews, we can ensure that the ratings assigned to the multiple points of discussion are given independently because inter- viewers engage the applicants in separate rooms.


'면접'이란 용어는 그대로 유지되었는데, 여기서 의도한 것은 면접실 개발에 유연성을 부여하는 것이다. 어떤 면접실에서든 평가자는 면접관이 될 수도 있고 관찰자가 될 수도 있다.

While the term interview has been maintained, one of the intended benefits of this protocol is the flexibility with which stations can be developed. For any given station, the examiner might be an inter- viewer or an observer. 

  • 면접관과 직접 의사소통 As an example, a station on ethical decision making, such as station 1 (see Appendix) can consist of a discussion between candidate and interviewer. Obviously some part of the rating assigned by the interviewer will be influenced by the candidate’s ability to communicate effectively, but stations that are intended to tap into communi- cation skills more directly can also be developed. 
  • 면접관은 모의환자와 대화하는 것을 관찰 For example, communication skills stations might consist of interviews conducted with a simulated patient while the examiner acts as an observer. Station 3 (see Appendix) is one such station in which the candidate is told s ⁄ he has to pick up a colleague to fly to a conference only to discover upon entering the room that the colleague has developed a fear of flying as a result of the September 11th tragedy. The observer rates the candidate based on the communication skills and empathy observed during the interaction be- tween the candidate and colleague . 


이러한 면접실의 유연성은 지원자가 구체적인 질문에 대해서 대비하거나 예행연습을 할수 있는 가능성을 낮춰준다. 전통적인 질문(왜 의사가 되려고 하나요?)을 사용하는 대신, 지원자는 자연스럽게 주어진 상황에 대응해야 한다. 의심할 여지 없이, 지원자는 여전히 답변 예행연습을 해야 하지만, 스테이션의 DB가 충분한 크기로 개발된다면 무슨 질문을 받을지를 예측하는 것이 더 어렵다.

This flexibility in station development reduces the likelihood that candidates will benefit frompreparing and rehearsing responses to specific questions. Instead of asking the usual historical questions (e.g. Why do you want to become a doctor?), candidates must respond sponta- neously to the presented situation. Undoubtedly, candidates will still prepare and rehearse responses, but it will be more difficult to predict the types of questions one will be asked if a database of stations is developed to sufficient size.


지원자의 성장배경을 탐색하는 면접에서 전통적인 방식은 어떤 경험이든, 고난이든, 신념이든 지원자가 입학위원회에게 인정받고 싶은 내용을 이야기하게 하는 것이었다. 유사하게 만약 프로그램에서 이 면접을 사용한다면 한 부분에서는 이러한 스테이션을 나머지 면접 절차를 크게 훼손시키지 않으면서 포함시킬 수 있다.

If a programme does desire to query applicants regarding their life history, traditional interview stations can be used in which the interviewer allows the candidate to discuss whatever personal experi- ences, challenges or beliefs s ⁄ he would like the admissions committee to recognise. Similarly, if a programme desires to use the interview, in part, as a recruitment exercise, then a station can be assigned for this purpose without fear of impinging upon the rest of the interview process.


남은 면접 스테이션에서 구체적인 면접 질문이 예술의 역사부터 동물학까지 어떤 주제에서든 선별될 수 있다. 실제로 이러한 방식의 이차 이득은 학문분야나 지역사회의 다양한 분야에서 면접관을 모집할 수 있다는 데에서 오는데, 우리는 네 개의 영역을 선택하였는데 비록 이 영역이 모든 영역을 포괄하지는 않지만 의료인으로서 필수적이라고 여겨지는 것을 넣었다.

For the remaining stations, specific interview topics can potentially be drawn from any subject ranging from art history to zoology. In fact, an anticipated secondary advantage of this new protocol lies in its potential to draw interviewers from diverse academic and community areas and allow them to assess topics that are consistent with their domain of expertise. We opted to focus our test stations on 4 domains that are not considered to be comprehensive, but are considered to be vital for a career in the health sciences:


1 비판적 사고 critical thinking;

2 윤리적 판단 ethical decision making;

3 의사소통 communication skills, and

4 의료시스템에 대한 지식 knowledge of the health care system.



면접 스테이션의 적절성을 평가하기 위해서 우리는 지원자에게 전문적 지식을 기대하지는 말아야 한다고 결정했다. 예컨대 의학적 구체적 지식을 알 것을 요구해서는 안되며, 면접 스테이션은 지원자들이 주어진 주제에서 논리적으로 생각하고 아이디어를 효과적으로 의사소통할 수 있는 능력을 평가해야 한다. 추가적으로 우리는 어떤 문항도 정해진 답이 있는 것은 부적절하다고 보았다. 어떤 답이 다른 답보다 낫지 않다는 것을 뜻하는 것은 아니며, 면접관들이 특정한 '문구'나 '의견'을 찾아내려고 하지는 않아야 한다는 의미이다.

To assess the suitability of potential stations, we decided that candidates should not be expected to possess specialised knowledge. For example, they should not be expected to know details of a medical condition. Rather, stations should be developed in such a way that they allow candidates to display an ability to think logically through a topic and com- municate their ideas effectively. In addition, as a simple heuristic, we viewed any question that had a definitively correct answer to be inadequate. That is not to say that some answers are not better than others, but rather that the interviewers should not be searching for a specific catch phrase or a specific opinion.



실험 1: 졸업생 대상 파일럿 스터디

EXPERIMENT 1: PILOT STUDY WITH GRADUATE STUDENT PARTICIPANTS


OSCE에서처럼 독립적인 방이 사용되었다. 면접 개요.

As in an OSCE, separate rooms were used for each station. Posted to each door was a card with the Instructions to Applicants , as shown in the Appen- dix. In addition, as this was not intended to be a memory task, the same information was included on a card inside the interview room so that the candidate could refer back to it if s ⁄ he desired to do so. Each station lasted 8 minutes and was followed by a 2-minute interval during which interviewers comple- ted standardised evaluation forms and candidates prepared for the subsequent station. The evaluation forms requested interviewers to rate each of the candidates using 7-point scales on:


스테이션마다 동일한 네 가지 평가준거: 의사소통/주장의 논리-타당성/의료인으로서의 적절성/종합적 평가

1 communication skills; 

2 strength of the arguments raised; 

3 suitability for the health sciences, and 

4 overall performance. 


일반적으로 스테이션의 숫자를 늘리는 것이 한 스테이션 안에서 평가자 수를 늘리는 것보다 효과가 크다. 이는 기존 면접이 맥락-특이성에 의해 훼손되었다는 가설을 지지한다.

In general, it appears that increasing the number of stations has a greater impact on the reliability of the test than increasing the number of raters within any given station, thereby supporting the hypothesis that context specificity plagues the traditional interview.




실험 2: 학부 의과대학생 선발

EXPERIMENT 2: UNDERGRADUATE MD PROGRAMME CANDIDATES


면접 진행 개요

All applicants (n = 396) who were offered an interview by McMaster University’s undergraduate medical programme were sent a letter inviting them to parti- cipate in an admissions research study. The letter stressed that their participation (or lack of participa- tion) would in no way influence their chances of being accepted to the medical programme and offered candidates $40 in an attempt to make it clear that this initiative was completely separate from the regular admissions process. A total of 182 candidates respon- ded affirmatively, of which the first 120 candidates whose schedules coincided with participation in one of 12 prearranged research sessions were selected. 


4일에 걸쳐서 3 세션을 연속적으로 진행. 세션간 40분의 휴식시간. 

Three sessions were run sequentially during each of the 4 interview days, with a 40-minute break for examiners between sessions. All candidates were allowed to participate only after completion of the regular admissions protocol. Three candidates backed out due to illness, resulting in a total sample size of 117; 2 of these left before completing a post-MMI survey.


면접관 모집: 대부분 교수였으나 8명의 학생과 2명의 HR부서 직원 포함

Interviewers were recruited broadly from the Faculty of Health Sciences, the students currently in the medical programme, and the community at large (including McMaster University’s Human Resources Department). From the surplus of individuals who volunteered to participate, we selected 40 (10 per day) based on their willingness to volunteer for an entire day. Evaluators were mostly drawn from the Faculty of Health Sciences, but 8 students and 2 members of the Human Resources Department also participated. The list of health sciences volunteers included representation from rehabilitation sciences, nursing, biochemistry and medicine.



절차 Procedure


1 모든 10개 스테이션 사용 all 10 stations reported in the Appendix were used; 

2 스테이션당 면접관 1명 only 1 interviewer was assigned per station, and 

3 파일럿 연구에서 문항간 상관이 높게 나와서 종합적 평가만 하도록 함 as a result of the high correlations among the 4 evaluation questions used during the pilot study, we opted to ask evaluators to simply score the applicant s overall performance on this station’.



신뢰도 분석 Reliability analyses


지원자-스테이션 상호작용에 따른 변인이 지원자 자체에 의한 변인보다 5배 큼. 이 역시 context-specificity을 의미

Furthermore, the variance attributable to the candidate–station interaction was 5 times greater than that assigned to the candidates themselves, further supporting the hypothesis that context spe- cificity negatively impacts on traditional interviews.



다른 척도와의 상관관계 Correlation with other measures


다음과 같다.

The MMI scores did not correlate highly with any of the other admissions tools currently used by McMas- ter’s admissions protocol. The correlations between the MMI and the existing admissions tools23 – 

personal interview, 0.185, 

simulated tutorial, 0.317,

undergraduate grade and 0.227 

autobiographical sketch 0.170, 


– were r ¼ r ¼ r ¼ ) and r ¼ respectively.








면접후 설문 Post-MMI surveys


추가적으로 지원자에게 3개의 개방형 질문을 했다. MMI의 최고 장점은? 한 스테이션에서 못 한 것을 다른 스테이션에서 ㅁ나회할 수 있음. 지원자의 기술/경험에 보다 균형잡힌 관점을 제공해준다. 

In addition, candidates were asked 3 open-ended questions. In response to the question: What do you believe to be the greatest benefits of using the MMI? , many commented on the opportunity to recover from poor stations and the belief that the MMI should provide a more balanced view of the applicant s skills and experiences’. Positive comments were also recorded regarding the oppor- tunity to maintain a dialogue with the interviewer and the opportunity to solve and discuss REAL PROB- LEMS [sic] .


어떤 점을 개선하면 좋을까? 약점은 무엇인가? 스테이션간 의자를 놔달라, 각 면접방별 시간을 늘려달라 그룹 스킬을 평가하는 스테이션을 넣어달라 등

Candidates were also asked the questions: Are there any improvements you would like to see made before the MMI is implemented? and What do you believe to be the greatest weaknesses of the MMI? Their responses to these focused primarily on logistical issues, such as including a chair between stations , lengthening the amount of time for each interview (most often suggested as lengthening to 10 minutes) and allow[ing] for some discussion at the end, [to provide an] opportunity to go back to a point not adequately covered . Some commented that the MMI would allow for a shorter interview day , but that a break half way through would help . Others noted the lack of an opportunity to reveal group skills – a domain that could potentially be built into future iterations of the MMI.


흥미롭게도 지원자들의 의견과 달리 평가자들은 8분이 지원자를 평가하기에 충분하고도 남는 시간이라고 했음. 일반적으로 가장 일관된 의견은 사전 훈련이 더 많이 필요하다는 것이었다. 특히 더 많은 정보와 가능한 질문 목록을 더 많이 준비해야 한다는 것.

Interestingly, in contrast to the comments offered by some candidates, examiners tended to suggest that 8 minutes was more than enough time to get a sense of the candidate’s performance. In general, the most consistent comment, raised by approxi- mately a quarter of respondents, was that the examiners would have liked more training before- hand, potentially in the form of including more information and a longer list of potentially relevant questions in the preparatory package received by all examiners.



DISCUSSION


평가 프로토콜을 평가할 때 네 가지 주요 이슈가 있다.

There are 4 issues that need to be considered when evaluating the efficacy of any assessment protocol:


1 신뢰도 reliability; 

2 타당도 validity; 

3 적용가능성 feasibility, and 

4 수용가능성 acceptability. 


신뢰도: 받아들일 수 있는 수준이며, 다른 입학평가도구와 상관관계가 낮은 것은 맥락-특이성을 보여주는 것이다. 

The reliability of the MMI has now been shown to be in an acceptable range (0.65–0.81) across 2 studies, using graduate student volunteers and actual applicants to the undergraduate medical programme. While ade- quate, this reliability might be further improved with examiner training. The low correlations between various admissions tools including the MMI is consis- tent with the hypothesis that context specificity impacts upon admissions protocols, thereby further promoting the need for a tool that adopts a multiple observations approach analogous to that provided by OSCEs when assessing clinical competence.


타당도: 블루프린트가 필요하다. 우리는 4 영역을 선정하였다. 학교나 프로그램마다 MMI 스테이션을 만들 때 중요하게 여기는 가치를 포함시킬 것을 권한다. 

The blueprinting process undertaken for the gen- eration of stations was intended to maximise the content validity of the MMI. We selected 4 domains that are thought to represent important, non-cogni- tive characteristics for success in the health sciences. We advocate that specific schools and specific pro- grammes within the schools that consider imple- menting the MMI engage in a similar process, determining the characteristics they value before creating MMI stations. This blueprinting technique might then ensure an optimal match between the curricular tenets of the programme and the charac- teristics of the individuals accepted into the pro- gramme.


적용가능성이나 비용효과성이 없으면 활용할 수 없을 것: 

That being said, even the most reliable and most valid of admissions exercises will not be useful if they do not prove to be feasible and cost-effective. In fact, the issue of cost-effectiveness ranks high among the primary assaults that have been launched against the use of personal interviews.5 For McMas- ter’s medical programme, approximately 400 appli- cants are interviewed annually, each of whom requires an hour of interview time (30 minutes for the interview and 30 minutes for scoring and a break). There are 4 people on each interview team, so each personal interview requires 4 person hours per applicant; the entire interview programme therefore requires 1600 person hours in total. Of these, 550 are typically faculty hours, the cost of which amounts to an estimated $27 500 per annum. The use of other non-cognitive tools, particularly the simulated tutorial, increases total interviewer time to about 1800 hours and faculty cost to about $32 000.


비용이 더 적게 들어가며 이미 많은 학교에서 OSCE를 활용하고 있음.

A 10-station MMI (with 10 minutes per station) could be run for only 2 person hours per candidate (including a 20-minute break for all examiners). Assuming the same ratios of faculty versus commu- nity personnel, this would require 275 faculty hours at a cost of $13 750 per annum. These values could potentially be reduced even further if it is deter- mined that 10 minutes per station is not required or if fewer stations are used (although the disadvantage of this latter strategy will be poorer reliability). The cost will be increased slightly by the use of SPs, with the absolute value of the increase depending on the number of SP stations used. As one final note on the feasibility of implementing the MMI, most health sciences programmes have considerable experience with mounting OSCEs. This expertise can potentially be used to make the transition from a personal interview to the MMI as smooth as possible.


수용가능성: 평가자는 참여하고자 하는 의지가 있는가? 일반적 면접보다 더 피곤하다는 의견. 이러한 점은 면접시간, 휴식시간, 프로토콜 등을 조정하여 고칠 수 있을 것.

Finally, a note on the acceptability of admissions tools. As the MMI and personal interview do require more human resources than simple reliance on grades, it is important that the individuals who are asked to act as interviewers are willing to participate in the process.


Interviewers in the pilot study were most concerned about the experience being more tiring than the personal interview, because a single person is responsible for each interview. Addressing this concern might require adjusting the protocol, increasing the number or length of breaks, or changing some other aspect of the process.


이 과정이 재밌다고 응답한 평가자도 많았음

It should be kept in mind, however, that an equal number of interviewers reported the exercise to be fun and entertaining.



MMI의 장점

The anticipated strengths of the MMI are 6-fold:


1 it allows multiple samples of insight into a candidate’s abilities; 

2 it dilutes the effect of chance and examiner bias; 

3 stations can be structured so that all candidates respond to the same questions and interviewers receive background information a priori; 

4 admissions directors have a great deal of flexibility in that stations can be designed with a blueprint of the qualities they would like to select for in mind; 

5 candidates can feel confident that they will be given a chance to recover from a disastrous station by moving onto a new, independent interviewer, and 

6 fewer resources might be required. 














 2004 Mar;38(3):314-26.

An admissions OSCE: the multiple mini-interview.

Author information

  • 1Department of Clinical Epidemiology and Biostatistics, Programme for Educational Research and Development, McMaster University, Hamilton, Ontario, Canada. evakw@mcmaster.ca

Abstract

CONTEXT:

Although health sciences programmes continue to value non-cognitive variables such as interpersonal skills and professionalism, it is not clear that current admissions tools like the personal interview are capable of assessing ability in these domains. Hypothesising that many of the problems with the personal interview might be explained, at least in part, by it being yet another measurement tool that is plagued by context specificity, we have attempted to develop a multiple sample approach to the personal interview.

METHODS:

A group of 117 applicants to the undergraduate MD programme at McMaster University participated in a multiple mini-interview (MMI), consisting of 10 short objective structured clinical examination (OSCE)-style stations, in which they were presented with scenarios that required them to discuss a health-related issue (e.g. the use of placebos) with an interviewer, interact with a standardised confederate while an examiner observed the interpersonal skills displayed, or answer traditional interview questions.

RESULTS:

The reliability of the MMI was observed to be 0.65. Furthermore, the hypothesis that context specificity might reduce the validity of traditional interviews was supported by the finding that the variance component attributable to candidate-station interaction was greater than that attributable to candidate. Both applicants and examiners were positive about the experience and the potential for this protocol.

DISCUSSION:

The principles used in developing this new admissions instrument, the flexibility inherent in the multiple mini-interview, and its feasibility and cost-effectiveness are discussed.

PMID:
 
14996341
 
[PubMed - indexed for MEDLINE]


의과대학 선발에서 MMI를 통한 비인지적 역량 측정(Med Educ, 2007)

Assessment of non-cognitive traits through the admissions multiple mini-interview

Jean-Franc¸ ois Lemay,1 Jocelyn M Lockyer,2 V Terri Collin3 & A Keith W Brownell4







입학면접은 학생선발과정 중 가장 주관적이면서 다양한 측면을 지닌 단계이다. 일반적으로 입학면접은 비인지적 특성을 평가하는 가치가 있는데, 그 신뢰도에 대한 의문이 있어왔다. 이러한 이유로 Michael G DeGroote School of Medicine at McMaster University 에서는 MMI를 도입했다.

The interview has been identified as among the most subjective and variable aspects of the medical school admissions process.1–4 Generally, the admissions interview is valued for its ability to assess non-cognitive attributes, although the reliability of the interview process has been questioned.5,6 In response to this challenge, the Michael G DeGroote School of Medicine at McMaster University intro- duced a multiple-station-based assessment process, the multiple mini-interview (MMI).6–8


MMI는 전통적 면접보다 더 신뢰도가 높으며, 임상실습 수행능력을 더 잘 예측해준다. 또한 의과대학이 중요시하는 가치를 반영할 수 있다. MMI는 모든 지원자들이, 성별에, 면접시간에, 배경에 비뚤리지 않는 결과를 제공한다.

MMI is more reliable than traditional interviews,6–8 is better able to predict pre-clerkship performance7 and can be designed to reflect the values of the medical school.9 Furthermore, the MMI appears to offer an unbiased opportunity for all applicants, regardless of gender, time of day, or background.6,8,10


일부 학생들에게 면접 내용을 미리 제공한 경우에도 MMI는 두 집단 사이에 차이를 보이지 않았다.

An assessment of the MMI when some candidates were provided with test content a priori and others were not showed no differences between the 2 groups.11


2003년 우리 의과대학은 선발인재상을 정립했다.

In 2003, our Faculty of Medicine adopted a list of non-cognitive attributes that we wanted our medical students to possess.12


Michael G DeGroote School of Medicine의 MMI가 4 가지 인성측면에만 초점을 뒀지만 우리는 더 다양한 측면을 평가하고자 했다.

Although the Michael G DeGroote School of Medicine’s MMI had only focused on 4 attributes (ethics, critical thinking, communication skills and knowledge of the health care system8,9), we felt the MMI could be developed to assess multiple non-cognitive attributes.



Each of the first 9 MMI stations was designed to assess a distinct non-cognitive attribute: 

1 advocacy; 

2 ambiguity; 

3 collegiality and collaboration; 

4 cultural sensitivity; 

5 empathy; 

6 ethics

7 honesty and integrity; 

8 responsibility and reliability, and 

9 self-assessment. 9


마지막인 10번째 스테이션에서, 지원자들은 스스로 왜 우수한 의과대학생과 의사가 될 것인지를 답하게 했다.

At the 10th and last station, applicants were asked why they thought they would become an excellent medical student and doctor.


각 스테이션의 템플릿은 비슷했다. 5~15줄의 시나리오, 2분간 읽고, "discuss this with the assessor"로 끝남.

The template for each station was similar. The applicant was provided with a 5–15-line scenario to read during a 2-minute period prior to entering the room. Each scenario ended with the statement: Discuss this with the assessor. 


방에 들어가서 8분간 진행. 평가자들은 지원자가 말하게 놔두고, 중간에 코멘트를 하거나 개입하지 않도록 지시를 받았음. 또한 4개의 탐색질문을 하게 되어있음. 평가자는 각 스테이션의 목적과 배경정보를 제공받았음. (Appendix) 면접 2주 전 모든 평가자는 MMI에 대한 오리엔테이션과 2시간의 트레이닝 세션을 마쳤고, MMI의 개요와 연구 결과, 동일한 스테이션에서 2명의 지원자의 영상, 평가 시스템에 대한 설명을 들었다.

After entering the room, applicants discussed the scenario with the assessor for 8 minutes. Interview assessors were instructed to allow the applicant to talk at length and not to interrupt the commentary. The interview assessors were provided with 4 probing questions to use if they deemed it necessary. Assessors were provided with information about the objectives of the station and some background information. An example of a complete station is provided in the Appendix. Two weeks prior to the MMI, all assessors were oriented to the MMI in a 2-hour training session in which we provided an overview of the MMI and its research basis, showed a videoclip with 2 different applicants interviewed on the same station, and explained the scoring system.


각 스테이션에서 10점짜리 5개의 준거로 평가하였음. 

At each station, each applicant was assessed on 5 criteria using a 10-point scale. Thus applicants could achieve a score of up to 50 points for each station. The criteria were: 

  • ability to understand and address the objectives of the scenario, 
  • communication skills dis- played, 
  • strength of the arguments presented, 
  • suitability for a career in medicine, and 
  • overall performance.


평가자들에게 지원자는 이미 충분한 자격을 갖추었으나 지원자간 차이를 둬야 하며, 처음 3~4명의 지원자를 통해서 calibration을 할 것을 권했다. 연습 세션에서 1점에서 10점까지 고루 사용할 것을 권했다.

We reminded interview assessors that all the applicants were highly qualified but that they would have to discriminate between applicants. We suggested inter- view assessors use the first 3 or 4 applicants to calibrate their scoring for the station. In the training session, we particularly emphasised to the interviewers that they needed to use the full range of the 10-point scales.


각 지원자는 9명의 평가자를 만났으며, 2일간 진행되었고, confidentiality를 유지하기 위해서 2개의 내용과 맥락이 비슷하지만 동일하지는 않은 parallel station을 만들었다. 둘 사이에 유의미한 차이는 없었다.

Thus, each applicant was assessed by 9 different interview assessors. As interviews were conducted on 2 days, to ensure confidentiality, 2 parallel stations with similar, but not identical, content and context were created for each non-cognitive attribute. As an ANOVA com- paring station performance for each day did not reveal any significant differences, we combined the data for days 1 and 2 for this study.




통계분석

For each applicant we had the following data: scores on each of the 10 stations (5 scales) as well as the total score, sociodemographic data (age, gender, grade point average [GPA]) and whether the applicant was accepted or placed on the waitlist. Descriptive statistics for the applicant pool for both the sociodemographic data and each station were tabulated. The internal consistency reliability (Cronbach’s alpha) was examined for each station (Question 1). 


To determine whether the stations were measuring a single construct or several, we correlated the total scores for each station. As a result of the multicolinearity among the subscale scores (within each station), the total scores for individual stations were used in the analysis. Additionally, an exploratory factor analysis (EFA) was used to determine whether the structure of the data was unidimensional or multidimensional. For the EFA, we analysed the data using principal component analysis, with varimax rotation following the Kaiser rule (i.e. eigenvalues >1.0), to determine the number of factors to be extracted (Question 2). 


We assessed the ability of the stations to discriminate between those placed on the acceptance and waiting lists by examining the mean scores for each station through an ANOVA (Question 3). 


ANOVA was used to determine whether there were sociodemographic differences (gender, GPA, age, degrees) between those accepted and those waitlisted (Question 4).









Cronbach's alpha는 item cohesiveness를 보여준다.

The high Cronbach’s alpha scores suggest there was high item cohesiveness among the subscales with each station and provide evidence of stable scores for each applicant


흥미롭게도 10번 스테이션을 포함하자는 결정은 현재 의과대학 학생과 교수들이 지원자들도 스스로에 대해 이야기할 기회를 줘야 한다고 주장하여 포함되었다. 지원자들은 미리 이런 스테이션이 있을 것이라는 사실을 전달받았다. 그러나 이 자료를 부면 평가자의 신념이 개인적인 특성을 가진다는 것을 확인시켜주었으며, 입학위원회가 MMI를 가지고 특정 인적특성을 평가해야지, 방향이 없는 토론이 되게 놔두면 안된다는 것을 보여준다. 10번째 스테이션의 이러한 결과는 평가자가 평가기준을 calibration할 기회가 없었기 때문일수도 있다. 이러한 유형의 스테이션에서는 얻을 것이 많지 않아 보이며, borderline에 있는 지원자에게는 영향을 미칠 수도 있다.

Interestingly, the decision to in- clude this station was made at the urging of our current medical students and faculty members, whofelt that applicants should have an opportunity to talk about themselves and present their cases for being accepted into our medical school. Applicantswere advised in advance of the interview day that this would be among the stations. However, these data reveal the idiosyncratic nature of assessor beliefs and provide a compelling reason for admissions committees to use the MMI to assess specific attributes and not to have a free-flowing discussion. The data on this 10th station were probably also compromised by the fact that asses- sors did not have an opportunity to calibrate applicants and may have been influenced by the prior performance of the candidate. There appears to be little to be gained from this type of station and it could affect admission for borderline candidates.


상관관계 분석으로부터 스테이션간 상관관계가 낮았음을 보여준다. 그러나 요인분석은 10개의 서로 다른 요인을 보여줌으로써, content specificity와 더불어 다면적 평가를 했음을 입증해주었다. 

Our correlation analyses showed a low correlation between stations; however, the factor analysis revealed 10 distinct factors, attesting to the multi- dimensional structure of the data as well as providing support for the existence of content specificity. This gives us confidence that the MMI is able to assess and differentiate between a large and diverse set of traits.




11 Reiter HI, Salvatori P, Rosenfeld J, Trinh K, Eva KW. The effect of defined violations of test security on admissions outcomes using multiple mini-interviews. Med Educ 2006;40 (1):36–42.


12 University of Calgary. Non-cognitive qualities we look for in students admitted to the Faculty of Medicine. University of Calgary. http://admissions.myweb.med. ucalgary.ca/NoncognitiveQualitiesWeLookFor.html. [Accessed 18 September 2006.]













의학의 기술: 과학이라 여겨지는 미신: 의과대학생 선발(Lancet, 2010)

The art of medicine - Science as superstition: selecting medical students

Donald A Barr

Department of Pediatrics, Stanford University, Stanford,

CA 94305-2160, USA

barr@stanford.edu





19세기까지 돌아가봤다. 미국의 의료전문직은 정체성 확립을 위해 고군분투하고 있었고, 그 정체성에 깔린 가정은 1893년 존스홉킨스의과대학의 설립, 1914년 AMA와 AAMC가 공동으로 실시한 CME의 10회 집회 사이에 명확해졌다.  이 1914년에, CME를 창립한 Dr Victor Vaughan은 미국 의료전문직이 만들어진 핵심 신념에 대해서 이야기했다.

My research took me to the close of the 19th century. The profession of medicine in the USA was struggling to establish an identity. That identity, and the assumptions underlying it, became clear between 1893, with the founding of the Johns Hopkins School of Medicine, and 1914, with the tenth annual convening of the Council on Medical Education (CME) established jointly by the American Medical Association and the Association of American Medical Colleges. Addressing the CME in 1914, Dr Victor Vaughan, a founding member of the Council, spoke the core belief on which the American medical profession by then was built.


물리, 화학, 생물의 기본적 사실에 대해서 충분히 익숙하지 않은 사람은 의학 공부에 적합하지 않다. 생물, 물리, 화학은 의학이 먹고 자라는 자양분이며, 이 과학 기반 없이 의학의 이름으로 시행되는 모든 것은 사기이고, 가짜이고, 미신이다.

“No man is fi t to study medicine, unless he is acquainted, and pretty thoroughly acquainted, with the fundamental facts in physical, chemical, and biological subjects…The facts of the biological, physical, and chemical sciences are the pabulum on which medicine feeds. Without these sciences, everything that goes under the name of medicine is fraud, sham, and superstition.”


본의 주장을 지지해주는 근거가 있을까? 본이 이렇게 말한 것으로부터 40년 전까지 거슬러올라가봤지만 그 주장을 지지해주는 근거는 찾지 못했다. 그 뒤로 90년간의 발표된 연구자료를 찾아봤지만, 여전히 학부에서의 기초과학 수행능력이 의과대학의 과학 과목에서의 실패를 예측하는가에 대한 근거를 찾지 못했다. 표준화시험에서 하위 10~20%에 있는 사람이 의과대학 초반에 fail할 가능성이 가장 높긴 했다. 또한 학부에서의 과학 성적과 전임상과학의 성적 사이에 상관관계가 있었다. 그러나, 여기까지였다.

Was there evidence to support Vaughan’s words? Is science the pabulum that nurtures young physicians? I went back in the literature of medical education 40 years before the time of Vaughan’s comments, and found no scientifi c evidence to support his assertion. I went ahead, through more than 90 years of published research. I did fi nd evidence that performance in the undergraduate sciences can predict failure in the initial sciences courses in the medical school curriculum. Those who score in the bottom 10–20% on standardised tests of scientifi c knowledge are the most likely to fail the early years of medical school (although more than half would succeed, given the chance). I also found evidence of a correlation between one’s grades in undergraduate sciences and one’s grades in the preclinical science courses that initiate the curriculum in many medical schools. That, though, was where the evidence stopped.


의과대학 마지막 학년에서의 임상교육과 GME 통해서 학생들은 의료의 실천을 배운다. 그러나 학부 과학 성적이 의사로서의 임상 혹은 전문직의 질과 관련된다는 과학적 근거는 찾지 못했다. 전문직으로서의 질 - 즉 의료의 '기술' - 은 단순한 화학 지식이 아니다. 오히려 다른 결과를 찾았다. 지속적으로, 학부에서의 과학 성적은 의료의 실천(art)의 핵심이 되는 인성적, 비인지적 능력과 부적 상관관계에 있었던 것이다. 이러한 결과를 처음 찾은 것은 바로 내가 있는 곳 근처였다.

Students learn the practice of medicine through clinical instruction, gained in the fi nal years of medical school and through graduate medical education. I found no scientifi c evidence that supported the power of performance in undergraduate science courses as a way to predict clinical or professional quality as a physician. Professional quality— the “art” of medicine—is based on something other than knowledge of chemistry. My search found something else, though—something troubling. It found consistent evidence that performance in the premedical sciences is inversely associated with many of the personal, non-cognitive qualities so central to the art of medicine. I fi rst found this evidence in my own educational back yard.


UC 버클리의 심리학자인 Harrison Gough는 UCSF에 입학하는 1071명의 학생을 대상으로 심리검사를 했고(1955~1967), 학부성적과 MCAT과학 점수가 의과대학 첫 2년간의 성적과 관련이 있었지만, "4학년에서의 수행능력과 일반적인 역량, 임상 역량에서 교수의 평가와는 거의 아무 관련이 없다"라는 것을 발견했다. 그 이후 학생들의 심리 프로파일을 학부 과학 성적의 수행능력과 비교했는데, 과학과목에서 잘 한 학생일수록 "관심 분야가 협소하고, 적응력이 떨어지고, 명료하게 말하지 못하고, 대인관계에서 덜 편안하다"라는 것을 찾았다. Gough의 연구를 보고 내가 처음 놀란 것은 5년의 UCSF입학위원으로 있으면서, 아무도 우리에게 이 정보를 준 적이 없었다는 사실이다.

Harrison Gough, a psychologist at the University of California, Berkeley, administered a series of psychological tests to 1071 students entering medical school at UCSF between 1955 and 1967. Gough reported that students’ under graduate science grades and MCAT science scores were associated with grades in the fi rst 2 years of medical school, but were, “almost completely unrelated to performance in the fourth year and to faculty rating of general and clinical competence”. He then compared the psychological profi les of these students with their performance in premedical sciences. He found that the students who did better in science were, “narrower in interests, less adaptable, less articulate, and less comfortable in interpersonal relation ships”. What startled me when I fi rst read the results of Gough’s research was that, through the 5 years of my participation on UCSF’s admissions committee, we were never informed of the results of his study.


다른 여러 연구자들도 학생의 심리 프로파일로부터 학부 과학과목에서 최고의 성적을 받은 학생들이 의사로서 기대되는 부분에 대해서는 오히려 부족함을 보여줬다. 1970년대에 Witkin은 과학에서 가장 성공을 거둔 학생들은 "비인간적인 측면이 있다. 다른 사람에 크게 관심이 없다"라고 했다. Tutton의 1990년대 호주에서 시행된 연구를 보면 학부 과학과목에서 최고의 성적을 거둔 학생들이 공감에 대한 표준 측정에서 낮은 점수를 받았으며, 부끄러움이 많고 수동적이고 내향적이고 사회성이 떨어진다고 했다. 이에 대해 "우리가 의사에게 기대하는 것에 정확히 반대되는 것들이다"라고 저자들은 말한다.

A number of others have found the psychological profi le of students who perform best in the premedical sciences to be the reverse of what one might hope for in a physician. Writing in the 1970s, Witkin found students who were most successful in the sciences, “have an impersonal orientation: they are not very interested in others”. Tutton’s studies of medical students in Australia in the 1990s found that students who did the best in the premedical sciences scored lower on standardised measures of empathy and tended to be “shy”, “submissive”, “withdrawn”, or “awkward and ill at ease socially”, characteristics the author suggested are, “the antithesis of what most of us would want in a clinician”.




The Humanities and Medicine Program of the Mount Sinai School of Medicine in New York off ers another example.



Gough HG. Some predictive implications of premedical scientifi c competence and preferences. J Med Educ 1978; 53: 291.


Tutton PJ. Psychometric test results associated with high achievement in basic science components of a medical curriculum. Acad Med 1996; 71: 181–86.


Witkin HA, Goodenough DR. Field dependence and interpersonal behavior. Psychological Bull 1977; 84: 661.




 2010 Aug 28;376(9742):678-9.

Science as superstitionselecting medical students.

Comment in

PMID:
 
20879079
 
[PubMed - indexed for MEDLINE]


의과대학생 선발의 신뢰성에 평가자와 상황이 미치는 영향(Med Educ, 2009)

Rater and occasion impacts on the reliability of pre-admission assessments

Rick D Axelson & Clarence D Kreiter CONTEXT





지난 5년간 발표된 논문을 보면, MSPI(medical school pre-admission interview)의 대안을 개발하고 평가해왔다. 여러 의과대학이 이미 전통적인 MSPI를 OSCE 타입의 면접방법으로 바꿨는데, MMI와은 이런 면접방식은 지원자의 비인지적 특성을 평가하고 GPA와 같은 학업성취도나 MCAT같은 인지적성을 보완하는 척도로서 역할을 했다. 기존 연구에 따르면 MSPI에 비해서 이런 방식이 더 신뢰도가 높다.

However, research published over the last 5 years has documented the development and evaluation of new assessment techniques that are designed as alternatives to the MSPI. This literature suggests that a number of medical schools have already replaced the traditional MSPI with objective structured clinical examination (OSCE)-style mea- surement methods. Similar in function to the MSPI, station-based OSCE methods such as the multiple mini-interview (MMI)1,2 attempt to assess applicants’ non-cognitive attributes and generate scores that are used to supplement measures of undergraduate academic achievement (e.g. grade point average [GPA]) and cognitive aptitude (e.g. Medical College Admission Test [MCAT] score) in making admission decisions. Research suggests that, compared with the traditional MSPI, these new techniques yield summary scores with superior reliability.3–6


이러한 새로운 방법을 적용하는 것이 지원자에 대한 비인지적 정보 습득의 신뢰도와 타당도를 높여줄 지는 모르지만, 의과대학에서는 이러한 OSCE식의 면접방식을 위해서는 면접 절차와 지원자의 캠퍼스 방문 프로그램의 상당한 재설계가 필요하다. OSCE형태에서는 여러 스테이션이 있고, 다수의 독립적인 평가자가 지원자를 평가하므로 다수의 지원자, 평가자, 표준화연기자가 동시에 한 캠퍼스에 모여야 한다. 인터뷰와 캠퍼스 방문의 특성, 기능, 비용을 크게 변화시키게 된다. 

Although adopting this new method may improve the reliability and validity of the non-cognitive information obtained about an applicant, medical colleges must weigh this against the fact that OSCE-style assessments will require a significant restructuring of both the pre-admission interview process and the applicant’s campus visit. Because the OSCE format requires multiple independently rated performances of each applicant in response to station challenges, logistics necessitate that a large number of applicants, raters and stan- dardised participants be simultaneously present at one location on campus for the administration of this type of assessment. The changes required by an OSCE- style assessment can alter the nature, function and costs of both the interview and the campus visit. 


각 의과대학의 좋은 점을 지원자에게 보여주기 위해서 면접을 활용하고 있는 학교들은 작은 수의 학생을 대상으로 상대적으로 덜 구조화된 면접을 운영한다. 비록 일부 학교에서는 추가 비용이 적다고 했지만, 대체로 이러한 학교들에게 있어서 OSCE 식의 면접을 도입하는 것은 비용을 크게 상승시키는 일이 된다. 

In schools that currently use the interview to familiarise applicants with the positive attributes of the institu- tion, interviews generally require smaller groups of students and are typically conducted in relatively informal, less structured sessions. For these schools, implementing an OSCE-style approach would require significant restructuring of existing recruitment and admissions procedures and would probably increase their cost. Although some have maintained that the added expense is small,8 changing from the MSPI to an OSCE-style format will undoubtedly incur consid- erable development costs and alter the nature and function of the applicant’s pre-admission campus visit.


입학OSCE에서 사용되는 면접 시나리오와 지원자에게 주어지는 과제와 면접 질문등은 면접에서 평가하고자 하는 구인을 어떻게 정의하고 개념화하느냐에 따라 달라지므로, 각 스테이션 안에서 어떤 일이 일어날지에 대해서는 학교간 차이가 크다.

Because the scenarios, tasks and interview questions used in an admission OSCE differ depending on how assessment designers define and conceptualise the construct being assessed, there is considerable variability across schools in what transpires within the stations pre- sented.


OSCE스타일의 평가를 위한 내용을 만들기 위해서 프로페셔널리즘이나 직무분석 등에 맞추어 스테이션을 설계하는데, 어떤 시긍로 하든 대체로 신뢰도 높은 결과를 낸다. 그러나 스테이션별 내용에 차이에도 불구하고 모든 OSCE스타일의 방법은 다수의 독립적 평가자가 지원자를 평가한다. performance-based 평가에 있어서 G study와 D study가 독립적으로 평가되는 행동 샘플의 수가 늘어나면 신뢰도가 높아진다는 결과를 꾸준히 내고 있어서 MMI 사용해서 높은 신뢰도를 얻을 수 있다는 것이 놀랍지는 않다. 더 나아가서 타당도 일반화 이론에 따르면 신뢰도가 타당도의 최대치를 규정짓기 때문에, 타당도가 높아지는 결과도 기대할 수 있다.

Approaches to shaping the content of the OSCE-style assessment have focused on designing station challenges that fit within a framework of professionalism, job analysis or another domain, but all tend to yield reliable scores. Yet, despite content differences, all OSCE-style methods are similar in that they elicit multiple independently rated applicant performances. Given that generalisability (G) studies and decision (D) studies of performance-based assessment scores have consistently demonstrated that increasing the number of independently rated behavioural samples also efficiently increases reli- ability, the positive reliability outcome fromthe use of the MMI is not surprising. Further, as validity gener- alisation theory suggests that reliability governs the maximum attainable validity,9 the positive validity outcomes are also expected.


그러나 MMI에서 얻은 G study 결과의 해석에는 오해가 조금 있다. 최근 Roberts는 G study에서 8개 스테이션을 합하여 0.70의 G 계수를 추정하였다. 비록 이 신뢰도 계산 결과가 다른 MMI와 비슷하고 MSPI보다는 훨씬 우월하지만, Roberts는 이 결과를 '면접관 주관성'이 신뢰도가 높게 나온 주된 이유라고 언급했다. 비록 이것이 사실일지라도, 그들의 G study가 이 결론을 지지하지는 않는다. 더 나아가서, 우리가 보여줄 것처럼, 한 면접방에 평가자 수만 늘리는 것은 Roberts가 보고한 수준으로 신뢰도를 높여주지 못한다. 만약 평가자의 주관성이 에러의 주된 원인이었다면 단순히 하나의 스테이션에 평가자 수를 늘리거나 패널 인터뷰를 하는 것 만으로도 MMI만큼의 신뢰도가 나와야 할 것이다. Roberts의 G study에서 높은 신뢰도는 다수의 독립적 평가상황에 기인한 것일 가능성이 높다.

It should be pointed out, however, that there remains some misunderstanding regarding the interpretation of G study results derived using MMI scores. In a recent example, Roberts et al.3 published a G study of an MMI trial and estimated a G coefficient of 0.70 for a score summarising performance on an eight-station MMI. Although this reliability result is consistent with other studies of the MMI and is far superior to results obtained with the MSPI,10 Roberts et al.3 interpreted the results as suggesting that ‘interviewer subjectivity’ is the most important determinant governing the level of obtained reliability. Although this may be true, their G study does not support this conclusion. Further, as we will show, increasing the number of raters for a single encounter does not yield reliabilities to the level reported by Roberts and his colleagues.3 If rater subjectivity were the primary source of error, simply adding raters to a single station or panel interview would achieve reliabilities similar to those reported for the MMI. In the G study reported by Roberts et al.,3 it seems much more likely that the high level of reliability can be primarily attributed to the number of independently rated occasions on which the applicant was allowed to perform.


MMI의 타당도를 이해하는데 있어서, OSCE스타일의 기법에서 왜 신뢰도가 높은가를 연구하는 것이 도움이 될 것이다. 

In understanding MMI validity, it is informative to study why OSCE-style techniques yield these high reliabilities. Do they emanate from the number of raters, the unique challenges presented by the MMI assessment or, alternatively, from the OSCE-style measurement format that affords multiple opportu- nities to perform? To help address these issues, the present study examines whether independent repli- cations of the MSPI are likely to positively impact reliability. If the MSPI achieves dramatically improved reliabilities with a simple strategic restructuring, this may also imply that this modified MSPI is a useful intermediate approach for those who are currently unable to implement MMIs.



방법

METHODS


Each interviewee participated in a 25-minute inter- view conducted by two faculty members. 

  • 구조화된 부분으로 시작하여 (5점 척도)
    Interviews began with a structured component, in which candi- dates were read and responded to a series of four predetermined questions. Answers to each question were independently and immediately scored by the interviewers on a scale of 1–5 (5 = excellent, 1 = poor) using an established scoring rubric. 
  • 비구조화된 면접으로 이어짐(5점 척도)
    Fol- lowing the completion of the structured questions, the interview was opened to a free-flowing, unstruc- tured
    exchange between the faculty interviewers and the candidate on any questions or topics of interest. At the end of the interview, each faculty interviewer independently assigned a score for the unstructured portion of the interview on a scale of 1–5 (5 = excel- lent, 1 = poor). 
  • 두 부분 사이의 시간은 비슷함
    On average, equal amounts of interview time were spent on the structured and unstructured parts of the interview during the study period (2003–2007).



Across the 5 years, 168 applicants were interviewed twice in consecutive years. As the faculty interviewers were drawn from a large pool (n > 150) and assigned to an applicant in a ‘pseudo random’ fashion, it is very unlikely that students who interviewed in con- secutive years encountered the same interviewers. Consequently, a random model with rater (r) nested within both person (p) and occasion (o) and person crossed with occasion (r : [p · o]) was used to estimate variance components (VCs) for those appli- cants who interviewed twice.












스테이션 수 증가에 따른 효과 > 면접관 수 증가에 따른 효과

As shown in Fig. 2, increasing the number of interview occasions is much more effective than increasing the number of raters within an occasion. 

  • For example, the reliability estimate for one rater for one occasion is 0.23, but rises to 0.73 for nine occasions each with one rater. 
  • However, when the number of raters for a single occasion is increased, the reliability, estimated at 0.23 for one rater, increases to only 0.36 for nine raters.



DISCUSSION


다수의 단일 평가자 MSPI만으로도 신뢰도를 높일 수 있다. 

These results suggest that the reliability of a score reflecting the summary of performances on multiple single-rater MSPIs is likely to be quite high and that a simple modification of the panel interview might substantially improve the quality of interview scores. For those schools that are reluctant to implement an MMI, a restructured MSPI might prove to be an effective intermediate approach.


단일한 면접에서 질문의 숫자나 평가자의 숫자를 늘리는 것은 면접실 전체를 여러 번 복제하는 것 만큼 좋지는 못함

As G studies clearly indicate that increasing the number of questions or raters within a single inter- view will not enhance reliability in the same way as replicating the entire interview process,10 changes to a single-interview format are unlikely to provide an efficient means of enhancing reliability.


이것을 추천함.

In summary, this study indicates that replicating a number of brief interviews, each with one rater, is likely to be superior to the often recommended panel interview approach and may offer a practical, low-cost method for enhancing MSPI reliability.







 2009 Dec;43(12):1198-202. doi: 10.1111/j.1365-2923.2009.03537.x.

Rater and occasion impacts on the reliability of pre-admission assessments.

Author information

  • 1Department of Family Medicine, University of Iowa, Iowa City, USA. rick-axelson@uiowa.edu

Abstract

CONTEXT:

Some medical schools have recently replaced the medical school pre-admission interview (MSPI) with the multiple mini-interview (MMI), which utilises objective structured clinical examination (OSCE)-style measurement techniques. Their motivation for doing so stems from the superior reliabilities obtained with the OSCE-style measures. Other institutions, however, are hesitant to embrace the MMI format because of the time and costs involved in restructuring recruitment and admission procedures.

OBJECTIVES:

To shed light on the aetiology of the MMI's increased reliability and to explore the potential of an alternative, lower-cost interview format, this study examined the relative contributions of two facets (raters, occasions) to interview score reliability.

METHODS:

Institutional review board approval was obtained to conduct a study of all students who completed one or more MSPIs at a large Midwestern medical college during 2003-2007. Within this dataset, we identified 168 applicants who were interviewed twice in consecutive years and thus provided the requisite data for generalisability (G) and decision (D) studies examining these issues.

RESULTS:

Increasing the number of interview occasions contributed much more to score reliability than did increasing the number of raters.

CONCLUSIONS:

Replicating a number of interviews, each with one rater, is likely to be superior to the often recommended panel interview approach and may offer a practical, low-cost method for enhancing MSPI reliability. Whether such a method will ultimately enhance MSPI validity warrants further investigation.

PMID:
 
19930511
 
[PubMed - indexed for MEDLINE]


배제의 과정에서 포용을 추구하는 법: 의과대학 학생선발(Med Educ, 2015)

Seeking inclusion in an exclusive process: discourses of medical school student selection

Saleem Razack,1,2 Brian Hodges,3,4 Yvonne Steinert1 & Mary Maguire5










지난 10년간, 국가적/세계적으로 의과대학 학생이 전체 인구에 대한 대표성을 높여서 사회의 다양성을 반영할 것에 대한 요구가 높아지고 있다. 그럼에도 불구하고, 의과대학 입학은 매우 경쟁이 높으며, 취약계층을 배제하고 있다.

In the last decade, growing concerns at both national and international levels have resulted in calls for an increase in the demographic representativeness of medical classes to better reflect the diversity of soci- ety.1–3 Despite this, entry into medical school remains highly competitive and exclusive of underprivileged groups.


의료전문직의 소중한 이상이란 학생들이 의과대학에 들어올 때 실질적으로, 그로기 법적으로 능력중심으로 입학하는 것이다.

The cherished ideal within the medical profession is that student selection for entry into medicine func- tions as a de facto and de jure systemof meritocracy. 


인구학적 대표성을 반영할 것에 대한 요구와 학업성취에 바탕을 둔 경쟁적 선발방식은 본질적인 긴장관계에 있다.

There is an inherent tension between calls to address the demographic representativeness of the profession and the competitive process of selection driven by academic achievement.


"담화"란 여러 사회과학과 인문과학에서 사용되는 용어로서, 사회적, 조직적 실천을 조절하는데 중추적 역할을 한다. 이는 어떤 것이 다른 것에 비해서 갖는 가치가 어떠하고 그 정당성이 어떠한가에 대한 것을 포함하는 것이며, 담화라는 것은 특정 주제에 대해서 무엇을 말할 수 있고, 무엇을 말할 수 없는가에 대한 사회적 경계에 대한 신념, 사고, 행동를 규정짓는 조직적 방식이다.

‘Discourse’, a termused by many in the social sciences and humanities disciplines, can play a central role in regulating social institutional practices, including the valuing and justification of certain actions over others. Discourse canbe defined as an institutionalised way of believing, thinking and acting that includes allowing social boundaries to define what can and cannot be said about a particular topic.7


이론틀

Theoretical framework


푸코, 보르도, 바흐찐: 담화 분석의 이론적 기반

Foucault–Bourdieu–Bakhtin: the theoretical basis for the discourse analysis


담화 분석의 토대

Throughout the three phases of the study, our discourse analysis was grounded in the critical social and language theories of Michel Foucault, Pierre Bourdieu and Mikhail Bakhtin.9–11


푸코의 이론

Foucault’s theory of discourse was notably helpful in approaching the way in which discourses help construct versions of reality. Foucault12 was specifi- cally concerned with how, in a given set of social conditions and history, it becomes possible to say that certain things are ‘true’ and other things are not. What is considered to be true is linked to power dynamics that are embedded in discourses and practices that are not always immediately visi- ble to people involved in their (re)production. Accordingly, discourses form objects and exist through systems of distributions of power, which can lead either to the reproduction of existing structures or to the production of new ones.13 Foucault’s concept of discourse allowed us to appreciate what he called the ‘conditions of possi- bility’ for what agents of student selection and theapplicants who desire to be selected articulate in their representations of self. Conditions of possi- bility emerge as statements of truth ( enonc es) and make it possible to think, say and do certain things and not others. Statements of truth in turncan serve to generate ‘dividing practices’, or pro- cesses that categorise, classify and separate individ- uals. Discourses produced by organisations thus imply the creation, maintenance and transforma- tion of certain relations of power and control by contributing to the definition of ways of believing, thinking and acting, and of social boundaries that define ‘who’ can say ‘what’ in a given context. 


보르도의 네 종류의 자본

Inspired by the economic notion of capital, Pierre Bourdieu discussed four kinds of capital: 

  • (i) eco- nomic capital: material and financial resources at the disposition of an individual; 
  • (ii) social capital: the potential and actual resources associated with the maintenance of a network of more or less insti- tutionalised relations; 
  • (iii) cultural capital: the cul- tural resources made available to a given individual, mainly through his or her family environment and schooling, and 
  • (iv) symbolic capital: any kind of resource (social, religious, ethnic, associative, gen- der-based, artistic, etc.) that is recognised in a given society or group that contributes to the definition of an individual’s social status.14


네 종류의 자본은 서로 다른 영역에서 작동한다.

Each of these different types of capital is operative within a distinctive ‘field’, which refers to a rela- tively autonomous game or competition for capital that takes place through a configuration of objec- tive relations that constitute an arena of the pro- duction, circulation and appropriation of goods, services, knowledge and status. Individuals may occupy different positions of power depending on their specific capital relative to each field. Accord- ing to Bourdieu, different forms of capital interact, and one’s standing in different fields as well as the opportunities that are put at one’s disposal derive from the outcomes of this interaction. Bour- dieu’s theory of capitals helped to frame how dif- ferent applicants’ experiences, backgrounds or modes of preparation for the selection process might be more or less valued by the admission committee. It also contributed to a better under- standing of the relationship between knowledge and power in the context of student selection. In fact, Brosnan15 discusses how medical education and admission into medical school can be concep- tualised as a field within which applicants’ portfo- lios, personal narratives, curriculum or interview answers can be understood as capital or market- able commodities.


바흐찐의 관점

We worked with Bakhtin’s dialogic view of language, in particular the concepts of ‘single-voiced’ (authori- tative) and ‘double-voiced’ (internally persuasive) discourses, in order to understand the mechanisms by which language may be used to assert power. Sin- gle-voiced or authoritative language (‘This is so...’) ‘is directed towards its referential object and consti- tutes the ultimate semantic authority within the lim- its of a given context’, whereas the double-voiced or internally persuasive discourse (‘This may be so in reference to that...’) inserts ‘a new semantic inten- tion into a discourse which already has, and which retains, an intention of its own’.16 The distinction between these two types of discourse helped us to understand how academic discourses come to pre- dominate over service-to-society discourses, and then to relate these to the social world through the knowledge–power relation theories of Foucault and Bourdieu’s concepts of the forms of capital. Exam- ining the discourses in this way can help us to understand how power might be asserted through language in order to claim ‘truth’.




Research questions


Phase 1. Stakeholders: institutions, credentialing and licensing bodies


Phase 2. Stakeholders: ACMs


Phase 3. Stakeholders: applicants to medical education programmes



Research phases



RESULTS AND DISCUSSION



조직의 '수월성'에 대한 표명방식: 학문 담화와 봉사 담화 사이의 긴장관계

Phase 1a. Institutional representations of excellence: tension between discourses of the academy and the service discourses of the profession


The lessons learned from this phase of the research can be summarised thus: 

  • 학문, 연귀, 지식은 의과대학이 '수월성'에 대하여 기술할 때 지배적인 방식이며, 이것이 의과대학간 위계를 형성한다.
    scholarship, research and knowledge creation predominate when medical schools write about excellence, establishing a hierarchy among med- ical schools; 
  • 수월성과 사회적 책무 사이에는 변증법적(dialectic) 긴장관계가 있다.
    there is a dialectic tension between claims of excellence and those of social accountability; 
  • 다양성이란 구체적인 목표이며, 수월성에 관한 담와에서는 인종, 종교, 성별과 같은 표면적인 특성으로 나타난다.
    diversity appears as a reified object, tokenised into superficial features (race, religion, gender, etc.) within the discourses of excellence, and 
  • '평등성'의 정의는 추상적이다.
    equity remains vaguely defined.

정책서류: 긴장 해소를 위한 고장난 시도들
Phase 1b. Policy documents: dysfunctional attempts to resolve the tension


Our conclusions from this phase of the research can be summarised thus: 

  • 이 부분에서 분석한 서류는 '다양성'을 무언가 새로운 것처럼 제시한다.
    the documents analysed in this section present diversity as something new; 
  • '사회적 책무'에 대한 개념은 매우 모호하나, 변화의 이유로서의 가치를 지니고 있다.
    the concept of ‘social accountability’ is defined vaguely, but valued as the reason for required change, and 
  • 여러 서류에서 의학교육은 역사적인 것으로, 그리고 당연한 것으로 표명하고 있으며, 이는 다른 관점을 통합하는 것을 더 어렵게 만든다.
    documents present medical education, as well as regulatory bodies, as ahistorical and as taken for granted, which makes it difficult to integrate alternative perspectives.

거시적 관점에서 미시적 관점으로: ACM(입학위원회)는 이러한 긴장 속에서 어떻게 운영되고 있는가?

Phase 2. From the macro to the micro: how do ACMs live the discursive tensions in their day-to-day practice of student selection?



The following two exemplar quotations from committee members encapsulate this tension:


  • "학업성취도에 대한 중요도를 낮출 수는 없다. 이는 민주적인 사회에 반하는 것이고, 세살 때부터 부모가 숙제를 도와주는 부유한 가정에서 자란 학생의 성취도가 더 높다는 것을 이해하더라도 말이다"
    I’m saying that we cannot discount academic performance. I mean, that frankly goes against everything in a democratic society – understanding that excellent performance is much easier if you come froma wealthy household, whose parents helped you with your homework since age three.
  • "기본적으로 지원자를 하나의 개인으로 평가해야 한다는 것이다. 어떤 해에도 입학정원을 온갖 종류로 분류해서 이 작은 구멍에 모든 사람들이 잘 맞게 하는 것은 불가능하다."
    I think that the bottom line is that you have to evaluate the candidates that come to you as indi- viduals and in any given year you can’t start estab- lishing all sorts of quotas and trying to get people to fit into these little pigeon holes.

From the work in this phase of the research, we concluded that: 

  • 입학위원회는 다양한 우선순위 가운데서 복잡한 협상을 하고 있었으나, 수월성에 특히 관심을 두었고 있다.
    members of selection committees negotiate a complex terrain with multiple priorities, but pay special attention to excellence, and 
  • 다양성이 중요하고 오랜 역사의 불평등성을 해소해야 한다는 담화, 그리고 능력에 기반한 선발을 해야 한다는 담화 사이에는 긴장관계가 있다.
    there is a tension in their discourse between the importance of diversity and the need to address historic inequities in the composition of the profession, and the notion of merit-based selection.



학생의 관점: 특권의 재생산을 통한 사회적 거리 횡단

Phase 3. The perspective of students: traversing social distances through the reproduction of privilege



The analysis of applicant interviews in this phase of the research afforded us the following insights: 

  • 지원자는 그들의 정체성과 '의료전문직'까지의 사회적 거리에 대한 인식 사이에서 협상을 하고 있다.
    applicants negotiate their identities and their perceived social distance from the medical pro- fession;
  • 이러한 특징들 사이에서 어떻게 협상을 해 나가는지, 그리고 스스로의 진실성을 어떻게 수행하느냐 사이에는 긴장관계가 있다.
    there are tensions between how they negotiate these features and how they perform their own sense of authenticity, and 
  • 지원자들은 의료전문직의 특권을 알고 있으며, 선발 과정에서 적합한 것으로 보일만한 것으로 여겨지는 모범답안식 행동을 하며, 여기에는 계층, 성별, 언어에 대한 것을 포함된다.
    participants understand the profession as privi- leged and perform by following scripted behav- iours that are seen as appropriate in the selection process and involve aspects of class, gender and language.



본질적으로 배제적인 과정 내에서 포용을 추구하는 방법

Seeking inclusion in an inherently exclusive process


학생 선발과정을 평가하고 인증함에 있어 다음의 원칙을 제안한다.

Specifically, we propose the following guiding prin- ciples through which to evaluate and accredit stu- dent selection processes for entry into medical school: 

  • 수월성을 추구함에 있어서, '수월성'의 정의는 학업적 그리고 사회에 대한 봉사 측면이 모두 들어가야 한다.
    in the seeking of excellence, definitions of excellence that integrate both the academic and the service-to-society discourses should be included; 
  • '포용' 측면에서 의과대학 지원 절차는 다양한 배경의 지원자에게 얼마나 열려있는가(ability to welcome)의 관점에서 평가되어야 한다.
    to support inclusiveness, the process of applying to medical school is examined for its ability to welcome persons from diverse backgrounds; 
  • 선발 도구는 교육과정의 성과와 그 배열이 일치해야 하며, 사회에 대한 봉사와 관련된 내용을 포함해야 한다.
    the selection tools used should be aligned with desired curricular outcomes, including those related to service to society; 
  • 잠재적 비뚤림을 찾아내고 해소함으로써 공정성을 강화할 수 있다.
    fairness should be enhanced by ensuring that potential biases are examined and addressed, and 
  • 선발 절차에 내재되어 있는 - 잘 드러나지 않는 - 사회적 문화적 자본으로부터 야기되는 권력 비대칭성을 해소하기 위해서는 투명성이 확보되어야 하며, 투명성은 모든 지원자에게 선발 절차에 대한 명확한 묘사와 지식을 제공함으로써 도달할 수있다.
    transparency should be facilitated by clear descriptions and knowledge of the processes available to all candidates as an approach to addressing power imbalances that may result from the hidden forms of social and cultural capital inherent in the selection process.









 2015 Jan;49(1):36-47. doi: 10.1111/medu.12547.

Seeking inclusion in an exclusive processdiscourses of medical school student selection.

Author information

  • 1Centre for Medical Education, Faculty of Medicine, McGill University, Montreal, Quebec, Canada; Department of Pediatrics, Faculty of Medicine, McGill University, Montreal, Quebec, Canada.

Abstract

CONTEXT:

Calls to increase medical class representativeness to better reflect the diversity of society represent a growing international trend. There is an inherent tension between these calls and competitive student selection processes driven by academic achievement. How is this tension manifested?

METHODS:

Our three-phase interdisciplinary research programme focused on the discourses of excellence, equity and diversity in the medicalschool selection process, as conveyed by key stakeholders: (i) institutions and regulatory bodies (the websites of 17 medical schools and 15 policy documents from national regulatory bodies); (ii) admissions committee members (ACMs) (according to semi-structured interviews [n = 9]), and (iii) successful applicants (according to semi-structured interviews [n = 14]). The work is theoretically situated within the works of Foucault, Bourdieu and Bakhtin. The conceptual framework is supplemented by critical hermeneutics and the performance theories of Goffman.

RESULTS:

Academic excellence discourses consistently predominate over discourses calling for greater representativeness in medical classes. Policy addressing demographic representativeness in medicine may unwittingly contribute to the reproduction of historical patterns of exclusion of under-represented groups. In ACM selection practices, another discursive tension is exposed as the inherent privilege in the process is marked, challenging the ideal of medicine as a meritocracy. Applicants' representations of self in the 'performance' of interviewing demonstrate implicit recognition of the power inherent in the act of selection and are manifested in the use of explicit strategies to 'fit in'.

CONCLUSIONS:

How can this critical discourse analysis inform improved inclusiveness in student selection? Policymakers addressing diversity and equity issues in medical school admissions should explicitly recognise the power dynamics at play between the profession and marginalised groups. For greater inclusion and to avoid one authoritative definition of excellence, we suggest a transformative model of faculty development aimed at promoting multiple kinds of excellence. Through this multi-pronged approach, we call for the profession to courageously confront the cherished notion of the medical meritocracy in order to avoid unwanted aspects of elitism.

© 2014 John Wiley & Sons Ltd.







의과대학생 선발을 위한 포괄적 모델(Med Teach, 2009)

A comprehensive model for the selection of medical students

MILES BORE, DON MUNRO & DAVID POWIS

The University of Newcastle, Australia






의과대학에 입학절차가 필요한 두 가지 이유가 있다. 하나는 입학 가능한 정원보다 지원자가 많기 때문이며, 두 번째는 사회와 전문직 집단이 유능하고 윤리적인 의사가 될 사람만을 원하기 때문이다. 지금까지 선호되었던 방법은 학업성취만을 가지고 뽑는 것이었다. 그러나 많은 국가에서 활용가능한 학업성취만으로는 지원자들 사이에 충분히 의미있는 차이를 보여주지 못한다.

There are essentially two reasons why medical schools around the world need to have a selection procedure for medical students. Firstly, there are almost invariably more applicants than there are places available. Secondly, there is a social and professional desire to admit only those who will become competent and ethical practitioners. The erstwhile preferred method to achieve the first aim is to select on prior academic achievement alone. However, in many countries the available measures of academic performance do not provide sufficient variance to allow meaningful differentiation in performance (Rolfe & Powis 1997; McManus et al. 2005) on which to base selection decisions.


우리가 제시한 모델은 의과대학생선발, 인성 심리학, psychometrics로부터 만든 것이다.

The model we propose has evolved from our research and work in the fields of medical student selection, personality psychology and psychometrics (Powis 1998; Powis & Rolfe 1998; Bore et al. 2005; Lumsden et al. 2005; Munro et al. 2005).


이 모델의 개괄은 그림1에 있다.

The model consists of the following components and is shown schematically in Figure 1. 

  • Informed self-selection through the provision of timely vocational guidance. 
  • Academic achievement as indicated by performance at school and/or undergraduate studies. 
  • Cognitive ability as measured by psychometric testing.
  • Personality as measured by psychometric testing. 
  • Interpersonal skills as measured by interview.



충분한 정보를 기반으로 한 자기선발

Informed self-selection


많은 지원자가 지원시에 17~18세에 불과하다는 것을 감안하면, 학교를 다니는 동안 자신이 정말 의학에 적합한가에 대한 생각을 할 수 있는 기회가 주어져야 한다.

Given that many applicants are just 17 or 18 years old at the time of application, such insight usually needs to have been gained in the school years.


한가지 방법은 의학교육과 의료에 대한 모든 것을 다 보여주는 웹사이트를 만드는 것이다. 다음의 내용이 담길 수 있다.

One possible approach would be the development of a website that presents a vocational ‘warts and all’ view of medical education and practice (Blundell et al. 2007).


  • Descriptions of being a medical student supplied by current and past students. 
  • Descriptions of internship and specialty training. 
  • A typical day in the life of each specialty, with negatives emphasised as much as positives. 
  • Suggestions on where to get more information. 
  • Suggestions on how to find out if one is suited to medicine (e.g. by doing voluntary work in a hospital). 
  • Suggestions about other health professional careers.



학업성취도

Academic achievement


높은 학업성취도만으로는 유능하고 윤리적인 의료를 보장해주지 않는다. 그러나, 미래의 행동의 가장 좋은 예측인자는 과거의 행동이고, 과거의 학업성취는 미래의 학업성취와 유의미한 상관관계가 있다.

High academic achievement alone does not ensure the competent and ethical practice of medicine. However, the best predictor of future behaviour is past behaviour and past academic achievement is correlated significantly with future academic performance (Kuncel et al. 2001; McManus et al. 2005).


학업성취를 보여주는 척도가 지원자간에 차이를 충분히 보여주지 못할 수도 있다.

The metric used to indicate academic achievement might not provide sufficient discrimination between candidates,


게다가 만약 UAI와 같이 지원자간 차이를 크게 보여주는 척도라 하더라도 98.6점을 받은 지원자가 95점 96점 97점을 받은 지원자보다 더 나은 학생이나 의사가 될 것이라는 이유는 없다.

Even where the range of the metric allows greater discrimination, as with the Universities Admissions Index (UAI) in Australia, there is no reason to suppose that an applicant with a UAI of 98.6 (out of 100) will make a better student or doctor than a person with 95 or 96 or 97, for example.


또한 특정 과목이 선수과목으로 요구되어야 하는가도 중요하다.

There is also the issue of whether achievement in specific subjects should be used as prerequisites.


우리는 그러한 선수과목이 단순히 허들을 하나 더 추가하는 것이 되어서는 안되며, 그 과목의 의과대학 교육과정에서 필요할 때에만 사용해야 한다고 생각한다.

we believe that it should not be imposed simply as an extra hurdle, but only when required by the medical school curriculum.


선수과목과 같은 것들은 의과대학 프로그램에 대한 접근성 정도에 큰 차이를 만든다. 낮은 레벨의 더 적은 요구조건이 낮은 사회경제적 배경에 있는 사람들의 접근성을 크게 높여준다. 반면 높은 수준의 서로 다른 가중치가 부여되는 요구조건은 지원자 풀을 축소시키고, 다음 단계의 선발을 처리하기 쉽게 만든다.

These will influence the degree of access to the medical school program: lower levels and fewer prerequisites would provide greater access to people from lower social and economic backgrounds (Powis et al. 2007), while higher levels and differently weighted prerequisites will reduce the applicant pool so that the next stage of selection is manageable..


한 가지 확실한 것은, 학업성취도에 가중치를 덜 줄수록 다른 것에 가중치가 더 들어가야 한다는 것이다. 이것이 다른 선발변수를 사용하는 것을 정당화시켜준다. 

One point is obvious: Where less weight is given to academic criteria, more weight has to be given to other selection variables. It is this very point that, in part, justifies using other selection variables.



인지적 능력

Cognitive ability


대부분의 의과대학 입학시험은 인지기술 검사가 포함된다.

Most medical school selection procedures in Australia and the UK include a test of cognitive skills.


지식검사에 관해서 두 가지 이슈가 있다. 

  • 하나는 그러한 지식검사는 불필요하다는 것인데, 왜냐하면 학업성취 지표에 의해서 이미 신뢰성있게 측정되었기 때문이다. 
  • 또 다른 이슈는 특정 영역에 대한 지식을 가졌다 하더라도 그 영역은 이미 의과대학 교육과정에 포함되어 있기 때문이 그 변수를 선발과정에 활용하는 것에 대한 논쟁이 있다.

There are two issues with knowledge tests. First, they appear to be redundant given that such knowledge has been measured more reliably by academic achievement indicators (also probably more validly than is possible with brief tests). The other issue is that gaining knowledge in these specific areas is part of the medical curriculum anyway, and so the justification for using such a variable in selection is debatable.


그러나 구체적인 지식을 검사하는 시험은 기간이 단축된 GEP에서는 적절할 수 있는데, 이 경우에는 생물/화학 등과 같이 교육과정에서 다루지 않는 지식을 점검할 수 있기 때문이다.

However, the use of specific knowledge tests may be appropriate for shortened graduate entry programs where it is necessary to check for some content knowledge of biology and chemistry not covered in the program’s curriculum.


또 다른 접근법은 구체적인 인지능력(지식이 아니라)을 검사하는 것으로, 이것은 의과대학과 의사로서의 성공과 연관되어있을 수 있다. 여기에 포함되는 것은 언어능력, 언어추론, 수리능력, 수리추론 이런 것들이다. 그러나 메타분석을 보면 일반인지능력(General Cognitive Ability, GCA)가 직업 내에서의 성취와 수행능력에 중간정도~강한 예측인자가 되지만, 세부적인 능력(specific ability)는 GCA의 전체적인 예측력을 높여주지는 못하는 것으로 나온다.

An alternative approach is to measure specific cognitive abilities (rather than knowledge) that may be related to success in medical school and medical practice, such as verbal ability/ reasoning, numerical ability/reasoning and so on. However, meta-analytic research has clearly shown that General Cognitive Ability (GCA) is a moderate to strong predictor of occupational attainment and performance within occupations, and that measures of specific abilities do not appreciably improve the overall predictive power of GCA (Schmidt & Hunter 2004; Brown et al. 2006).


Specific ability가 아니라 GCA를 사용하는 것의 장점은 '보상(compensation)'의 개념 때문이다. GCA검사는 SCA를 측정하는 질문을 포함할 수 있다. (언어, 수리, 추상, 공간, 다른 추론능력 등) 만약 이 점수가 합해져서 하나의 점수로 나타난다면, 한 구체적인 영역에서 점수가 떨어지는 것이 다른 영역에서의 높은 점수로 보상될 수 있다. 그러나 만약 SCA로만 선발한다면, 하나 혹은 두 개 영역에서 높은 점수가 다른 영역에서의 낮은 능력을 보상해주지 못한다. 학생 선발에서 GCA를 주된 접근법을 하는 것이 선발된 학생들 간의 다양성을 확보하는것에도 좋다. 모든 지원자의 GCA가 높더라도, 그 집단 내에서 구체적인 SCA는 차이가 있을 것이고, 우리는 의료의 다양성을 고려하면 이것이 중요하다고 생각한다.

An advantage of using an indicator of GCA rather than measuring narrower specific abilities hinges on the concept of ‘compensation’. A test of GCA might include questions that measure specific cognitive abilities, e.g. verbal, numerical, abstract, spatial and other reasoning abilities. If summed to produce a single score (indicating individual differences in GCA), then a lesser ability in one specific area can be compensated for by higher abilities in other areas. However, if people are selected on their performance on individual specific abilities, then high ability in one or two areas cannot compensate for lower ability in others. The outcome of using a general (compensatory) approach to ability testing in selection is higher variability within the selected pool of applicants. All will have high GCA, but specific abilities will vary within this pool and this we see as important, given the diversity of medical practice.


일부 연구결과가 학업성취도가 GCA보다 더 좋은 예측인자라고 보여주기도 하지만, GCA에 대한 메타분석을 보면 GCA와 교육 수행능력과 직업 성취에 유의미한 연관이 있다.

While some research has found academic achievement to be a better predictor of some occupational outcomes in medicine than GCA (McManus et al. 2003), the broader research reported in the GCA meta-analysis literature does demonstrate the significant relationship of GCA to educational performance and occupational attainment.


선발위원회에서 GCA를 측정하는 검사를 사용하려고 할 때의 문제는 적합한 검사를 찾는 것이다. 그러한 검사지가 있긴 하나, 대부분은 이미 잘 알려져있고, 사교육(coaching)에 취약하다.
The challenge for selection committees is to find a test that measures GCA, but not academic achievement/knowledge, that can be adminis- tered to large groups of applicants. Such tests do exist (e.g. Raven’s Progressive Matrices), but most of them are well-established tests that are widely known and susceptible to coaching.



인성

Personality

의과대학생 선발에서 인성검사를 사용하는 것은 아마 가장 논쟁이 많은 영역일 것이다. 그러나 인성척도는 지금까지 경찰, 군, 공무원, 상업이나 기업 영역 등에서 선발에서 상당히 많이 사용되어 왔다.

The use of personality tests in the selection of medical students is possibly the most contentious area in the selection debate. However, personality measures have been (and are) used extensively for selection in the commercial and industrial sectors as well as the police, military and other government services in many countries.


FFM이 가장 많은 근거를 가지고 있다.

‘Five Factor Model’ (FFM) has emerged as the dominant empirically supported approach.


각각의 다섯 개 특징은 하위 영역이 있다.

Each of the five traits consists of a number of lower-order facets.


최근의 성격검사와 관한 메타분석에서 강조되어야 할 세 가지가 있다. 

There are three relevant points made in a recent personality meta-analysis that may be emphasised. 


첫 번째는, 성격특성이 예측타당도를 보여주었다는 것이다. 비록 0.10~0.45정도로 높지는 않지만, 수년간에 걸쳐서 이러한 중간 정도의 타당도만으로도 선발결정이 매우 중요하고, 선발되는 비율이 매우 낮은 의과대학생선발과 같은 과정에서는 충분히 가치가 있는 것으로 드러났다.

First, personality traits have demonstrated predictive validity: there are correlations between personality predictors and work-related criteria of 0.10 and 0.45. While such coefficients appear low, it has been recognised for many years that tests of even modest validity can make a valuable contribution to selection decisions where the proportion to be selected from the applicant pool is low and the importance of good selection decisions is high (as in medicine). 


두 번째는, 비록 인지능력이 근무지에서 수행능력에 대해서 많은 부분을 설명해주지만, 성격이 그 예측타당도에 추가적으로 기여하는 바가 있다는 점이다. 즉, 성격특성이 추가되면 인지적능력 단독으로 예측한 것보다 더 많은 부분을 예측할 수 있다.

Second, while cognitive ability accounts for a greater proportion of variance in work-related performance criteria, personality has incremental predictive validity: that is, the proportion of variance accounted for in job-related criterion measures increases when personality traits are included alongside cognitive ability (Ones et al. 2007).


세 번째는, 사람들이 '좋은 척'하는 경향이 있지만, 이것으로 인한 예측타당도의 저하는 매우 낮다는 점이다. 이것이 고무적이기는 하지만 high-stake 검사에서 사람들이 고의로 인격척도를 속이는 것에 대응할 수 있는 두 가지 전략을 다루고자 한다. 

    • 하나는 lie scale을 포함하는 것이며, 
    • 다른 하나는 극단치를 제외하는 것이다.

The third point is that while there is a tendency for people to ‘fake good’ on personality tests when they are taken under high stakes conditions, it has been found that this reduces the predictive validity of the tests only minimally (Ones et al. 2007). While this is encouraging, we suggest that faking on personality measures in high stakes testing can be countered to some extent by two strategies: inclusion of a ‘lie scale’ and by exclusion of extreme scorers. The details of these strategies are elaborated below.


비인지적 변인을 선발에 포함시키는 것은 두 가지를 면밀히 살펴야 한다. 하나는 어떤 것을 포함시킬 것인가이다. 논문에 근거하면 네 가지 필수적 영역이 있다.

The inclusion of non-cognitive variables in selection careful consideration of two aspects. First is the question of what variables to include. Our view based on the literature is that there are four essential non-cognitive criteria for competent and ethical practitioners in the medical and allied health professions:


  • ‘Involved with’ rather than ‘detached from’ or ‘manipulative of’ others (Agreeableness in terms of the Big 5). 
  • ‘Emotionally stable’ and ‘resilient’ rather than ‘overly emo- tionally reactive’ or ‘unpredictable’ (Big 5 ‘Emotional Stability’). 
  • ‘Self-controlled’ and ‘conscientious’ rather than ‘impulsive’ and ‘disorderly’ (Big 5 ‘Conscientiousness’). 
  • Neither too judgemental nor too permissive in one’s moral/ ethical values.

예측타당도에 대한 근거가 있는 다른 비인지적 변인들도 있다. 

There are other non-cognitive variables that can be, and have been, used in selection, for example, 

    • integrity (considered to be a ‘compound’ personality construct related to conscien- tiousness, agreeableness and emotional stability; Ones & Viswesvaran 2001), 
    • stress tolerance (Hogan R & Hogan J 1995) and 
    • Moral Orientation (Bore et al. 2005) tests for these qualities can be valuable where there is an evidence of predictive validity for specific occupations.

주요 이슈는 어떻게 이 변인들을 사용할 것인가다. 단순합을 하는 것은 부적절함.

A major issue in the present context is how these variables should be used in the selection model. 




인터뷰 대상자 선정을 위한 점수 합 구하기

Combining scores to create the interview pool


선발결정을 내릴 때 몇 가지 고려할 수 있는 모델이 있다. 가장 그럴듯한 것은 regression model로서, 선발변인들을 하나의 criterion이나 outcome에 대해서 regress하는 것이다. 그리고 regression weight를 구해서 실제 선발 점수에 적용한다. 이러한 접근법은 'outcome variable'이 쉽게 가능한 직종에서 널리 사용되어 왔다.

There are several models that can be considered when using scores to make selection decisions (Gatewood & Feild 2001). Perhaps the most appealing is the regression model where, experimentally, the selection variables are regressed against a criterion/outcome variable and the regression weights found are then applied to actual selection scores. This approach has been used extensively in selecting people for jobs in which outcome variables are readily available (e.g. sales perfor- mance, production rates, accident rates, absenteeism and so on).


그러나 의료에 있어서 쉽게 측정가능하고 타당도가 있으면서 윤리적으로 수집가능한 criterion variable의 집합은 아직 밝혀지지 않았다. 놀라운 일도 아니다.

A set of criterion variables (or compound of criteria) for the practice of medicine that can be reliably measured, that have evidence of validity and that can be obtained ethically has yet to be agreed and established. This is perhaps not surprising.


regression model의 대안은 점수를 multiple cut-off model로 활용하는 것이다. 여기서는 특정 범위에 있는 지원자가 선택된다.

An alternative to the regression model is to use the scores of each test in a multiple cut-off model. That is, applicants who score within a particular range for each variable measured are selected (in the model we are outlining here) into the interview pool.



선발 절차 - 예시

The selection procedure – A demonstration


어떤 선발절차와 마찬가지로, 이 모델은 가장 적절한 학생을 선발할 가능성을 최대화하기 위해서 만들어졌다. 첫 번째로 과거 학업성취도와 informed self-selection으로 지원자 풀이 만들어진다. 그러나 우리는 상위 10%의 학업성취도 집단으로 제한하기를 권고한다. 이는 학교에서의 수행능력의 기여수준을 효과적으로 낮춘다. 그리고 전체적인 능력과 관련된 더 많은 변인이 이후의 선발과정에서 다뤄진다.

Like any selection procedure, the model here is designed to maximise the probability of selecting the most appropriate students. In the first instance a pool of applicants is created based on past academic performance and informed self- selection. Each medical school sets its own academic perfor- mance criterion; however, we would suggest a cut point that allows the top 10% of academic achievers over this hurdle (Neame et al. 1992). This effectively lowers the contribution of school performance: more of the variance in overall ability is dealt with by the subsequent selection steps.


가상의 1000명 집단이 있을 때를 가정.

To demonstrate the multiple-cut off method, we have used actual scores from experimental testing conducted by us in a number of sample groups to create a hypothetical n¼1000 pool.



6개의 variable로 검사점수를 얻는다. z score로 변환하여 이 점수를 가지고 면접대상자를 선발한다.

In our sample of 1000 applicants, test scores were obtained for six variables as shown in Table 1. The raw scores were normed and transformed into z scores (mean of 0, SD of 1). We can now use the scores to create the interview pool in a two-step procedure.


극단적으로 self-representation하는 사람을 제거한다.

In Step 1, excessively positive self-representation on the non-cognitive tests can be managed by removing extreme high scorers on a Lie scale


2SD를 벗어나는 극단적으로 높거나 낮은 사람은 제외하기를 권고한다.

We also suggest that the extreme high and low scorers (z scores less than 2 or greater than +2) on the non-cognitive tests should not progress to the interview pool. 


두 가지 이유이다. 하나는 '좋은 사람인 척'하면 극도로 높은 점수가 나오며, 이것이 거짓을 가리는 전략이 될 수 있다. 두 번째로, 극도로 반응이 심하고 충동적이고 타인에게 무심한 사람은 의사로서 부적절하며, 그 정 반대의 경우도 마찬가지이다. 지나치게 통제하려들거나, 감정이 없거나, 지나치게 감정적이고 자신감이 높은 경우도 마찬가지이다. 도덕적 지향에 있어서는 지나치게 liberal하거나 지나치게 사회적 규범에 집착하는 사람은 윤리성에 문제가 생길 가능성이 높고, 두 극단 사이의 균형이 더 중요하다.

The reason for this is two-fold. First, ‘faking good’ would produce extremely high scores so this is another strategy to manage lying. Second, while being excessively reactive, impulsive or detached (narcissistic and aloof) is inappropriate to the practice of medicine, so too is being at the other extreme: overly controlled, resilient (lacking in emotion) or involved (overly empathic and confident). With regard to moral orientation, we suggest that being too liberal or too socially rule-bound is potentially problematic in terms of ethical practice and a balance between these two extremes is more appropriate (Bore et al. 2005).




Step 1에서 1000명에서 9.9%가 줄어 901명이 됨.

Applying Step 1 to our sample of 1000 applicants resulted in a reduction of 9.9% to 901.


Step 2에서 성격검사와 인지검사에 하한치를 적용하고 면접대상자를 정한다. 이 단계는 반복적으로 수행하여 하한치를 점차 높여가며 면접 대상자의 수 만큼만 남길 수 있다.

The light shaded area in Figure 2 results fromStep 2, which involves applying lowest score ‘cut points’ for the personality (Control, Resilience and Involvement) and cognitive test results to reach the number of applicants to be interviewed. This process can be done iteratively, raising and lowering the lowest score cut points, until only the number to be interviewed remain in the pool.


하한치를 변화하는 것은 각 변인간의 가중치를 바꾸는 것과 마찬가지이다. 가중치가 높은 변인은 더 높은 하한선을 가진다.

Changing the lowest score cut points on any of the the given variables changes weighting to the variable. Greater weight is given to a variable if its cut point is raised, and lesser weight is given if the cut point is lowered, relative to the remaining variables.


이러한 방법의 장점은 개별 의과대학마다 자신이 바라는 입학생의 혼합 구성비를 조정할 수 있다는 것이다.

This method has several advantages. It allows individual medical schools to determine the mix of qualities they want in their incoming students.


또 다른 장점은 하나의 시험에서 나타난 수행능력에만 의존하여 면접대상자를 선별하지 않는다는 점이다.

Another advantage is that proceeding to the interview pool does not ultimately rest on performance in one test (usually a cognitive skills test).


아마도 이 방법에서 가장 중요한 장점은 validity 근거가 더 확실한 검사에 더 가중치를 줄 수 있다는 점이다.

Perhaps, the most important advantage is that the method allows tests with a greater evidence of validity to have greater weighting in selection decision-making.




면접

Interviews


이 선발모델의 마지막 요소는 면접이다. 두 가지 이슈가 있다. 하나는 교수들의 부담스러워 하더라도 면접에 참여시킬 것이냐는 문제이고, 다른 하나는 어떤 것을 평가할 것인가에 대한 문제이다.

The final component of our selection model is the interview. There are two major issues in relation to medical school selection interviews. First, whether to interview at all given the costs and inconvenience to faculty staff; and second, what to assess in the interview.


첫 번째에 관한 우리의 관점은 어떤 선발모델도 지원자와 의과대학의 대표단(representative)사이의 접촉을 생략해서는 안된다는 것이다.

Our view regarding the first is that no selection model should omit an opportunity for personal contact between the applicant and a representative of the medical school.


면접이 갖는 문제는 '의사가 되고자 하는 동기'에 대해서 지원자들이 이미 충분한 연습이 되어있다는 점이다. 그들의 대답은 진실될수도 그렇지 않을 수도 있다. 따라서 대안적인 방법은 면접을 통해서 직접적으로 관찰가능한 것만 측정하는 것이다.

A problem with interviews is that they are sometimes used as a measure of ‘motivation to be a doctor’ with applicants usually having carefully rehearsed their answer to the question ‘Why do you want to be a doctor?’ Their answer might, or might not, reflect reality. An alternative, and preferable, approach is to use the interview to measure only what can be directly observed.


  • interpersonal skills/communication, 
  • punctuality and presentation, 
  • decision-making (in response to presented scenarios) and 
  • behaviour under pressure (globally throughout the inter- view or in a specifically designed task),

MMI의 확장형 면접 형태는 일련의 서로 다른 과제를 수행하면서, 지원자를 입학관련 스텝들과 접촉하는 모든 지점에서 평가하는 것이다.

An extension of the multiple mini interviews might be to have the applicants rated at all points of contact with school admission staff (including administrators as well as interviewers) over a series of different tasks.


각 의과대학은 자신만의 고유한 면접 절차를 개발할 기회가 있고 많은 의대가 그렇게 해왔다. 우리는 다양한 의과대학이 존재하여 학생들이 단일한 시스템만 대면하지 않도록 하는 것이 중요하다고 생각한다.

There is an opportunity for medical schools to develop their own interview procedures and many have done so. We see it as important that there is variety among medical schools so that candidates are not faced by a monolithic system.


또한 면접과 면접과제의 중요한 요소는 구조화되어야 하고, 객관적이어야 한다는 것이다. 이를 통해서 관찰가능한 행동만 평가해야 하며 모든 평가자와 면접관이 그 과정에 훈련되어 있어야 한다.

The important point is that the interviews and tasks need to be structured, objective in that only observable behaviour is rated and all interviewers/raters should have been trained in the procedure.



Brown KG, Le H, Schmidt FL. 2006. Specific aptitude theory revisited: Is there incremental validity for training performance? Int J Select Assess 14:87–100













 2009 Dec;31(12):1066-72. doi: 10.3109/01421590903095510.

comprehensive model for the selection of medical students.

Author information

  • 1School of Psychology, University of Newcastle, Callaghan, NSW, Australia. Miles.Bore@newcastle.edu.au

Abstract

BACKGROUND:

Medical schools have a need to select their students from an excess of applicants. Selection procedures have evolved piecemeal: Academic thresholds have risen, written tests have been incorporated and interview protocols are developed.

AIM:

To develop and offer for critical review and, ultimately, present for adoption by medical schools, an evidence-based and defensible model formedical student selection.

METHODS:

We have described here a comprehensive model for selecting medical students which is grounded on the theoretical and empiricalselection and assessment literature, and has been shaped by our own research and experience.

RESULTS:

The model includes the following selection criteria: Informed self-selection, academic achievement, general cognitive ability (GCA) and aspects of personality and interpersonal skills. A psychometrically robust procedure by which cognitive and non-cognitive test scores can be used to make selection decisions is described. Using de-identified data (n = 1000) from actual selection procedures, we demonstrate how the model and the procedure can be used in practice.

CONCLUSION:

The model presented is based on a currently best-practice approach and uses measures and methods that maximise the probability of making accurate, fair and defensible selection decisions.

PMID:
 
19995169
 
[PubMed - indexed for MEDLINE]


+ Recent posts