MMI 시험 특성: 지원자에게 상상하길 요구하기보다는 회상하길 요구하라(Med Educ, 2014)

Multiple mini-interview test characteristics: ‘tis better to ask candidates to recall than to imagine

Kevin W Eva1 & Catherine Macala2





MMI는 그 정의상 일련의 독립적 관찰을 통해 지원자에 대한 정보를 얻으며(대개 인터뷰의 형태로), 선발을 하는 주체가 되는 기관의 목적이나 이상(desires), 그리고 선발된 학생이 장차 될 전문직의 특성을 바탕으로 blueprint를 만든다. 따라서, MMI는 어떤 평가의 도구나 수단이라기보다는 평가의 프로세스로 봐야 한다. 따라서 "MMI는 무엇을 위해서 하는것인가?"라는 질문은 무의미하며, implementation에 따라서 완전히 달라질 수 있기 때문이다.

By definition, it involves collecting (and aggregat- ing across) a series of brief independent observa- tions of the candidate (typically in the form of interviews), preferably blueprinted against the goals and desires of both the institution making the selection and the profession to which the candidate is applying. As a result, the MMI should be considered a process of assessment rather than a tool or instru- ment, and generic questions such as ‘For what does the MMI select?’ are meaningless because the answer is entirely dependent on implementation.


MCQ를 가지고 다양한 내용을 대표하는 시험을 만들 수 있는 것처럼, 매우 다양한 스테이션들로 MMI를 구성할 수 있다.

Just as one can populate a multiple-choice question (MCQ) examination with questions representative of diverse content areas, one can populate an MMI with highly variable stations.


기존 연구를 살펴보면 일반적인 원칙들을 발견할 수 있다. 신뢰도에 대해서는 관찰의 횟수를 증가시키면 신뢰도가 증가하는데, 10~12개 스테이션에서 plateau에 도달하며, 스테이션당 시간을 늘리는 것의 장점은 별로 없고, 각 상황마다 평가자의 수를 늘리는 것보다는 여러 개의 독립적 상황에 대한 수행능력을 관찰하는 것이 더 효과가 좋다.

Research has identified gen- eral principles, including that the reliability of mea- surement improves with increasing number of observations, often reaching a plateau in the 10–12 range,2 that extending the length of the interactions has little discernible benefit,3 and that observing per- formance across independent situations has a greater beneficial impact on the reliability of measurement than does incorporating the opinions of multiple rat- ers within each situation.4,5



배경

Background


MMI 프로세스는 크게 두 가지에 토대를 둔다. Sampling과 Structure

The MMI process was largely designed on two foundations: sampling and structure.


Sampling이 중요하다는 것은 인간 행동에 대한 trait-based model에 대한 우려로부터 출발했다. 사람을 묘사하는데 쓰이는 단어(똑똑한, 달변의, 전문적인)는 변하지 않는 특성인 것처럼 묘사하지만, 실제 행동을 보면 매우 맥락-특이적이다.

The priority placed on sampling is drawn from empirically derived concerns about trait-based mod- els of human behaviour.6 Whereas the adjectives we use to describe people (e.g. ‘smart’, ‘eloquent’, ‘professional’) imply unwavering features of the individual, behaviour has been shown repeatedly to be context-specific.7


한 가지 임상상황에 대한 단일한 관찰결과가 의미하는 바는 한 사람의 지식에 대해서 한 문항의 MCQ가 말해주는 것과 다를 바가 없다.

One observation tells us no more about an individual’s clinical prowess than one MCQ answer tells us about the extent of an individual’s knowledge base.


8분짜리 면접이 지원자의 능력에 대해 충분히 모든 측면을 보여주지 않는다는 주장과 달리, 우리는 이것을 logistic한 필요에 따른 (약점이 아니라) 강점이라고 본다. 여러 연구를 보면 더 긴 면접시간의 가치는 그저 환상일 뿐이며, 이는 지원자에 대한 면접관의 인상은 매우 빠른 시간내에 형성되기 때문이다. 더 나아가서 시간이 더 많을 경우 지원자가 애초에 면접에서 의도한 방향과 다른 방향으로 비틀어버릴 기회를 준다.

Contrary to the argument that 8- minute selection interviews do not allow sufficient time to yield a full perspective on a candidate’s abil- ity, we view this logistic necessity as a strength rather than a liability. A variety of studies have demon- strated that the added value of longer interviews is illusory as examiners tend to form impressions very quickly.9,10 Further, more time yields greater oppor- tunity for the applicant to sway the conversation to issues that are distinct from the intended focus of the interview.11


9 Ambady N, Bernieri F, Richeson J. Toward a histology of social behaviour: judgmental accuracy from thin slices of the behavioural stream. Adv Exp Soc Psychol 2000;32:201–72.

10 Ambady N, Rosenthal R. Thin slices of expressive behaviour as predictors of interpersonal consequences: a meta-analysis. Psychol Bull 1992;111:256–74.


두 번째 토대인 Structure의 가치는 조금 덜 명확하다. MMI가 처음 만들어졌을 때, panel-based 면접은 면접자간 신뢰도 차이가 크지만 면접이 구조화되면(구체적인 문항을 주면) 더 나아진다고 했다. 비록 직관적으로는 그럴 듯 하지만, 최근의 연구 결과를 보면 이 가정에 대한 의문을 갖게 한다. Kreiter 등은 기존 연구는 간접적 비교만 한다고 지적했다. 다섯 개의 구조화된 질문으로 구성된 25분짜리 의과대학입학면접으로부터 일반화가능도 분석을 통해서 '질문'에 기인하는 variance가 무시할 만한 정도라고 밝혔다. 이로부터 저자들은 다수의 질문을 통해서(즉 sampling을 늘려서) 문항 간 난이도에서 오는 차이를 상쇄시킬 수 있기에, 문항의 구조화에서 얻을 수 있는 장점이 없다는 결론에 이르렀다. 몇 년 후, 같은 기관의 면접에서 비구조화 요소가 구조화 면접에 추가되었고, Axelson은 그 결과로부터 구조화 요소보다 평가자간, 평가-재평가 신뢰도가 높다고 보고했다. 결론은 모호하다.

The value of the second foundation, structure, how- ever, has become less clear over time. When the MMI was created, the literature on panel-based interviewing practices revealed that the inter-rater reliability of such exercises was highly variable, but tended to be greater when interviews were structured by giving interviewers a specific set of questions.16 This remains intuitively appealing, but recent research has led us to question this assump- tion. Kreiter et al.17 critiqued the literature for offering only indirect comparisons. Using data collected from a set of 25-minute medical school selection interviews containing five structured questions, they used generalisability analyses to illustrate that the variance attributable to question had a negligible influence on the reliability observed. These findings led the authors to argue that asking multiple questions (i.e. increased sam- pling) washes out differences in difficulty level across questions such that structuring questions offers no advantage. A few years later, an unstruc- tured component was added to the end of the struc- tured interview at the same institution, and Axelson et al.18 reported that resulting scores had greater inter-rater and test–retest reliability than the struc- tured component. As the authors noted, it is unclear whether the performance of the unstructured interview derived fromthe fact that it followed the structured interview or whether the benefit of such structuring is illusory.


구조화 스테이션을 만드는 것은 MMI 프로세스 도입에 가장 큰 장애라는 점에서 이 질문은 대단히 중요하다. 시험 보안에 관한 우려가 많은 대학으로 하여금 (그것을 예방하고자) 스테이션의 데이터베이스를 구축하거나 구입하게 만들었다(비록 시험 보안 위반에 대한 영향력은 확실하지 않더라도). 만약 MMI의 장점이 구조화와 무관하다는 결론에 이른다면, 즉, 주로 sampling의 효과만 있다면, MMI를 도입하는 비용이 크게 절감될 것이다.

This is an important question because the creation of structured stations is one of the primary barriers to adoption of the MMI process.19 Concern about test security breaches derived from the repeated use of set questions has led most institutions we have encountered to generate or purchase a database of stations to reduce this risk (although the impact of such breaches remains questionable20,21). If the benefits that have been observed to accrue from the adoption of MMI practices are unrelated to struc- ture and, instead, are derived dominantly from the sampling it promotes, then the cost inherent in creating an MMI might be substantially reduced.


MMI에서 가장 흔한 타입의 스테이션은 어떤 이슈와 관련하여 면접관과 토론하게 하는 것인데, 이 때 '관련성'의 정의는 그 기관이 만든 blueprint에 달려있으며, 공개되어있는 예시들을 보면 주로 지원자가 경험하게 될 상황과 관련된 딜레마를 제시하는 경우가 많다. 조직/산업 관련 심리연구 문헌을 보면 그러한 면접 대화는 경험-기반(과거 경험을 떠올리게 하기) 이거나 상황-기반(맞닥뜨릴 상황을 상상하게 하기)이다. 어떤 종류의 면접이 더 효과적인지에 대해서 많은 논란이 있었다. 

The most common type of MMI station involves ask- ing a candidate to discuss an issue of relevance with an examiner. The definition of ‘relevance’ depends on the blueprint the institution establishes, but pub- lished examples indicate a tendency towards describ- ing a dilemma about which the candidate is expected to engage in dialogue. The organisational and industrial psychology literature defines such dialogues as generally being ‘experience-based’ (i.e. candidates are required to recall their particular experiences and the behaviours they demonstrated) or ‘situation-based’ (i.e. candidates are required to imagine and describe what they would do if they were to encounter a particular situation).22 There has been considerable debate in this literature regarding which type of interview is most effective.


상황-기반 면접을 선호하는 사람들은 면접이 미래지향적으로 이뤄져야 하며, 과거에 유사한 경험이 없던 지원자라도 주어진 상황에서 자신의 인적특성을 보여줄 기회가 있어야 한다고 주장한다. 

반면 경험-기반 면접을 선호하는 사람들은 과거의 행동이 미래 행동의 가장 정확한 예측인자라고 주장하며, 가상적 상황을 지양하고 과거의 경험에 초점을 둬야 한다고 말한다. 


인상-관리(자기가 어떻게 보이는지를 관리하는 것)이 면접 상황에 따라서 서로 다르게 나타나는데, 상황-기반 면접에서는 환심을 사려는 방향(호감을 유발하고 의견을 동조하게 하는) 으로 나타나며, 경험-기반 면접에서는 자기-홍보 (자신의 성공이 다른 요인보다 스스로의 능력 덕분이다)가 주로 나타난다.

Those who favour situation-based interviewing argue that structure is important and that interviews should be future-oriented so that interviewees with- out previous experience in a given context are granted the opportunity to demonstrate their per- sonal qualities; those who favour experience-based interviewing argue that past behaviour is most pre- dictive of future behaviour and, as a result, one should avoid discussion of the hypothetical and focus on previous experience.20 Impression manage- ment (i.e. attempts to control the image one pro- jects) appears to take place in different ways according to interview type, with situation-based interviews tending to induce ingratiating tactics (i.e. behaviours aimed at inducing liking, such as opin- ion conformity) and experience-based interviewing tending to induce self-promotion (i.e. behaviours aimed at indicating that one’s success is attributable to competence rather than other factors).11



참가자 Participants


4개 서킷, 12개 스테이션, 48명 평가자

Four distinct circuits of 12 stations required the participation of 48 examiners.



문항 Materials


모든 스테이션은 CanMEDS 프레임워크에 기반. 

All stations were focused upon the Professional role promoted within the CanMEDS framework pre- sented by the Royal College of Physicians and Sur- geons of Canada.25 


네 개의 SJ스테이션은 이후 training기간 동안 발생할 수 있는 상황에 대해서 그 상황을 상상하고 어떻게 할지를 물었음.

Four SJ stations were designed around this role, the operational definition being that the station had to present a situation that could plausibly occur during medical training and would require the candidate to imagine and discuss what he or she would do in that situation.


4명의 평가자, 문 앞에 설명, 스테이션 목적에 관한 한 쪽 짜리 설명, 스테이션당 6개까지 문항. 대화를 진행할 것(스크립트처럼 질문만 하지 말고) 질문은 대화를 하는데 도움을 주는 정도. CanMEDS에 대한 설명. 평가지. 6점척도로 세 가지에 대해서 평가 (i) communication skills, (ii) reasoning ability, and (iii) professionalism. 

This information was provided to the four examin- ers who were assigned to that station (one per cir- cuit) and posted on the doors of their rooms for candidates to read. In addition, examiners were given one page of information outlining both the intent of the station and a list of up to six questions they could ask the candidate. They were told that they should engage in actual dialogue with candi- dates rather than treating the list of questions as a script (i.e. the questions were presented simply as prompts that examiners might find useful if conver- sation stalled). Examiners were also given a page of background information outlining aspects of the CanMEDS competencies that were relevant to the situation described, along with a copy of the score- sheet on which they were to offer their assessment. None of the background information or prompting questions contained content that was specific to the instructions given to candidates and thus the same information could be given to examiners in other experimental conditions. The scoresheet consisted of a series of 6-point scales (1 = weak, 2 = below average, 3 = average, 4 = very good, 5 = excellent, 6 = exceptional) on which examiners were asked to rate each candidate’s (i) communication skills, (ii) reasoning ability, and (iii) professionalism. Brief definitions were provided for each quality.


네 개의 BI 스테이션을 위해서 SJ 스테이션을 약간 modify함. 

To generate the four BI stations, each of the SJ sta- tions was modified so that the candidate was instructed to think of a time in which he or she had experienced a situation analogous to the scenario presented in the SJ station.


다른 정보는 SJ 스테이션과 동일

All other information provided to the examiners on these stations was identical to that provided to the SJ station interviewers with the exception of minor wording revisions to ensure that the grammar remained appropriate.


FF스테이션에 대해서는 지원자의 적합성을 평가할 수 있는 대화를 하라고 함. 

To generate the four FF stations, examiners were told simply that we wanted them to conduct a con- versation that would help them evaluate the candi- date’s suitability for the Professional role. They were given the same background information as used in other stations, but the prompting questions were removed. The station instruction, as presented to candidates, said simply:



절차 

Procedure


지원자는 무작위 배정 

Candidates were randomly assigned to a circuit and a starting station.


2분 지시문 숙지, 7분 후 종료, 옆 방 이동. 스테이션 간 3분이 있어서 1분은 지원자 설문 작성, 2분은 다음 스테이션 지시문 숙지

At the start of the MMI, candidates were given 2 minutes to read the first station, after which a buz- zer was sounded to alert them to enter the inter- viewing rooms. Seven minutes later, another buzzer was sounded to indicate that the interview was com- plete and that the candidate should move to the next station. From this point onward, a pause of 3 minutes was provided between stations and candi- dates were asked to spend 1 minute completing a candidate survey about the preceding station and 2 minutes reading and preparing for the next sta- tion.



분석 

Analysis


맥락-특이성은 Applicant x Station 상호작용에 의해서 나타난다. 연구 디자인 상 평가자의 영향을 분리해내기 어렵게 만들며 따라서 순수한 맥락-특이성은 불가능하다. 이러한 연구 설계는 세 가지 이유에 근거한다.

Context specificity is generally indicated by a large Applicant X Station interaction. The design of this study did not allow us the capacity to separate rater influences from station influences and therefore a pure test of context specific- ity is not available. This design decision was based on three reasons: 

  • 평가자 효과는 모든 실험조건에서 나타난다.
    (i) rater effects are likely to be present in all experimental conditions; 
  • 한 스테이션에 한 명의 평가자를 두는 것은 MMI나 OSCE에서 흔한 일이다. 
    (ii) the inclusion of one examiner per station is common practice in MMIs, objective structured clinical examinations (OSCEs) and other comparable assessment activities, and 
  • 기존 연구들을 보면 평가자의 variance는 station variance에 비해서 기여하는 바가 작다.
    (iii) previous work has robustly indicated that rater vari- ance tends to contribute little error relative to station variance.4,5



RESULTS


신뢰도 Reliability


Applicant x Station error가 가장 컸고, 그 다음은 Residual error, 그 다음은 Applicant 였다.

Table 1 reveals that the dominant source of vari- ance in all cases was the Applicant X Station inter- action. The residual error (Item X Station X Applicant [Circuit]) was next most dominant, fol- lowed by Applicant differences, which accounted for 10.0–18.7% of the variance.


Applicant에 따른 variance는 BI > SJ > FF 순이었는데, 이는 BI 스테이션이 지원자간 변별에 가장 뛰어남을 보여준다. Station, Item, Circuit의 main effect와 그것들의 상호작용은 무시할만한 수준이었음. 

The variance attribut- able to Applicant declined from BI to SJ and then to FF stations, suggesting that BI stations offered better capacity to consistently discriminate between applicants relative to the other forms of interview. The main effects of Station, Item and Circuit, and their interactions, were negligible, generally contrib- uting < 3% of the variance in scores.


스테이션간 신뢰도는 스테이션간 평가 결과가 일관되는가에 대한 것으로, BI가 가장 우수하다.

Inter-station reliability, reflecting the extent to which the scores assigned are consistent across stations, suggested that BI stations allowed better measurement than SJ or FF stations.




실제 MMI 결과와의 비교

Relationship to the actual admission MMI

SJ, r = 0.45; BI, r = 0.57, and FF, r = 0.42.

The correlations between the average of the four stations within each station type and the average of the 9-station MMI used for the actual admis- sion decision were: SJ, r = 0.45; BI, r = 0.57, and FF, r = 0.42.



수용가능성

Acceptability


지원자에서 지원자들이 FF가 더 어렵고, 더 긴장을 느낌

In general, candidates considered the FF stations to be more challenging and more anxiety-provoking than either the SJ or BI stations (Table 4). 


평가자의 관점은 유형간 큰 차이가 없었음.

In gen- eral, examiners’ perceptions of their ability to assess candidate performance and the amount of strain MMI stations placed on candidates were insensitive to station type, although BI stations were rated rela- tively low on one question (Table 5).



결론

DISCUSSION


평가프로세스의 질을 평가하기 위한 도구의 다양한 측면이 잘 align 되어있지 않아(신뢰도를 높이면 활용가능도가 떨어짐), 적절한 협상을 하게 된다. 우리는 다양한 결과가 internally 그리고 validity study에 대해서 일관된 결과를 낸다는 것에 놀랐다. 다양한 관찰을 모으는 것 만으로도 중등도의 신뢰도는 도달할 수 있지만(FF 에서 G=0.66), 스테이션을 구조화하는 것은 acceptability는 물론 신뢰도에 있어서도 이득이 있었다. 다만 신뢰도에 대해서는 BI에 대해서만 이득이 있었다. SJ가 신뢰도 측면에서 BI와 같다고 하더라도, feasibility (만들기 쉬움)과 동등한 수용가능성을 고려하면 BI를 쓰는 것이 낫다.

Given that the various aspects of utility used to assess the quality of assessment processes commonly do not align (e.g. increasing reliability tends to decrease feasibility), thereby requiring that compro- mises are made,14 we were surprised by the extent to which the various outcomes considered yielded consistent conclusions both internally and with respect to validity studies that have been conducted in other domains of selection. Although moderate reliability can be achieved simply by aggregating across many observations (G = 0.66 in the FF condi- tion), there did appear to be some benefit from the structuring of stations in terms of both acceptability and reliability, the latter being true only when BI techniques were used (G = 0.77). Even if SJ stations were to be considered equal to BI stations in terms of their reliability, the greater feasibility (i.e. ease of generation) and equivalent acceptability of BI stations would support the prioritising of their use.


추측하건대, BI를 사용하면 - 자신의 경험을 성찰하게 만들고 - MMI 사용에 대한 초창기의 비판 - 지원자가 자신의 과거 자서전적 내용을 설명할 기회가 없다 - 도 극복할 수 있다.

Speculatively, the use of BI stations, which require candidates to reflect on and discuss personal experiences they have had, may also help MMI administrators to address one of the more robust early criticisms of the MMI process, which claims that candidates desire an opportunity to pres- ent autobiographical details during their interview.1







17 Kreiter CD, Solow C, Brennan RL, Yin P, Ferguson K, Huebner K. Examining the influence of using same versus different questions on the reliability of the medical school preadmission interview. Teach Learn Med 2006;18 (1):4–8.


18 Axelson R, Kreiter C, Ferguson K, Solow C, Huebner K. Medical school preadmission interviews: are structured interviews more reliable than unstructured interviews? Teach Learn Med 2010;22 (4):241–5.


20 Reiter HI, Salvatori P, Rosenfeld J, Trinh K, Eva KW. The effect of defined violations of test security on admissions outcomes using multiple mini-interviews. Med Educ 2006;40:36–42. 


21 Griffin B, Harding DW, Wilson IG, Yeomans ND. Does practice make perfect? The effect of coaching and retesting on selection tests used for admission to an Australian medical school. Med J Aust 2008;189:270–3.


23 Taylor PJ, Small B. Asking applicants what they would do versus what they did do: a meta-analytic comparison of situational and past behaviour employment interview questions. J Occup Organ Psychol 2002;75 (3):277–94.


24 Klehe U-C, Latham G. What would you do – really or ideally? Constructs underlying the behaviour description interview and the situation interview in predicting typical versus maximum performance. Hum Perform 2006;19:357–82.


















 2014 Jun;48(6):604-13. doi: 10.1111/medu.12402.

Multiple mini-interview test characteristics: 'tis better to ask candidates to recall than to imagine.

Author information

  • 1Centre for Health Education Scholarship, Faculty of Medicine, University of British Columbia, Vancouver, British Columbia, Canada.

Abstract

CONTEXT:

The multiple mini-interview (MMI), used to facilitate the selection of applicants in health professional programmes, has been shown to be capable of generating reliable data predictive of success. It is a process rather than a single instrument and therefore its psychometric properties can be expected to vary according to the stations generated, the alignment between the stations and the qualities an institution prioritises, and the outcomes used. The purpose of this study was to explore the MMI's test characteristics when station type is manipulated.

METHODS:

A 12-station MMI was established in which four stations were presented in three different ways. These included: situational judgement (SJ) stations, in which applicants were asked to imagine what they would do in specific situations; behavioural interview (BI) stations, in which applicants were asked to recall what they did in experienced situations, and free form (FF) stations, which were unstructured in that the examiner was simply given a brief explanation of the intent of the station without further guidance on how to conduct the discussion. Four circuits of the 12 stations were run with one examiner within each station. Candidates and examiners were surveyed regarding their experience. The reliability of the scores derived from the assessment was analysed separately for each station type.

RESULTS:

A total of 41 medical school candidates participated after completing the regular admission process. Although the score assigned did not differ across station type, BI stations more reliably differentiated between candidates (g = 0.77) than did the other station types (SJ, g = 0.69; FF, g = 0.66). The correlation between actual MMI scores and BI stations was also greatest (BI, r = 0.57; SJ, r = 0.45; FF, r = 0.42). Candidates' opinions indicated that FF stations were more anxiety-provoking, less clear, and more difficult than structured stations (SJ and BI stations). Examiner opinions indicated equivalence on these measures.

CONCLUSIONS:

The results suggest that structuring stations has value, although that value was gained only through the use of BI stations, in which candidates were asked to recall and discuss a specific experience of relevance to the purpose of the interview station.

© 2014 John Wiley & Sons Ltd.

PMID:
 
24807436
 
[PubMed - indexed for MEDLINE]


+ Recent posts