MMI로 평가한 인적특성의 변동요인(Medical Teacher, 2014)

Variance in attributes assessed by the multiple mini-interview

NIKKI BIBLER ZAIDI1, CHRISTOPHER SWOBODA2, LEIGH LIHSHING WANG2 & R. STEPHEN MANUEL3

1University of Michigan Medical School, USA, 2University of Cincinnati, USA, 3University of Cincinnati College of Medicine, USA






Introduction

가장 신뢰도가 높은 입학면접은 어떤 형태일 것인가?, MMI에 대한 논의. MMI의 다면표집법은 낮은 신뢰도를 극복하는 수단으로 긍정적 평가를 받았다.

The medical school preadmission interview (MSPI) remains a widely used tool in medical school admissions (Monroe et al. 2013); therefore, discussions regarding the most reliable and valid MSPI format continue to evolve (Edwards et al. 1990; Goho & Blackman 2006). Over the past decade, the multiple mini-interview (MMI) has gained considerable attention as an alternative to more traditional MSPI formats. The MMI is a multi-sampling, structured interview format in which interviewers, referred to as “raters,” assess specific applicant attribute(s) using multiple 5–15 minute interview stations. Each interview station is assigned a different discussion prompt, referred to as a “scenario;” likewise, each station has a different rater who is tasked with assigning applicants scores for a set of items on an evaluation tool (Eva et al. 2004c; Pau et al. 2013). The MMI’s multi-sampling technique has been celebrated for increasing the low reliability estimates that plague traditional MSPIs (Eva et al. 2004b, c; Uijtdehaage et al. 2011), and some studies suggest that MMI scores can be used to predict performance during medical school clerkships and on medical licensure examinations (Eva et al. 2004a, 2009, 2012).


MMI의 신뢰도는 흔히 G theory를 이용해서 추정된다. 어떤 측정이든 (MMI를 포함하여) 그 목적은 진점수(true score)를 흐릿하게 만드는 원하지 않는 변이(unwanted variance)를 줄이는 것이다. G theory를 활용한 대부분의 MMI에 대한 연구에서 그 모델은 평가자와 스테이션을 facet으로 모델링하였다.

Reliability of the MMI is commonly estimated using Generalizability (G) theory because of the multi-faceted nature of the measurements. In any measurement process, including the MMI, the goal is to reduce the unwanted variance in observed scores that can obscure true scores. G theory can simultaneously capture multiple sources of unwanted variance, referred to as “facets,” to provide an estimate of generalizability – or reliability (Brennan 2001). Consequently, MMI reliability is expressed as a G coefficient and represents a “universe of admissible observations” – a “universe that is defined by the specific facet(s) that the researcher decides to include in the model. The decision regarding facets for inclusion is based on the context of a measurement to which the researcher plans to generalize findings. For instance, if some raters are always more lenient or more severe than other raters, then raters are a source of unwanted variance in MMI scores, and the rater facet would be modeled in a subsequent G study if the researcher wishes to generalize across raters. Most MMI studies associated with medical school admissions have modeled raters and/or stations as facets (Eva et al. 2004b, c; Uijtdehaage et al. 2011).


각 facet은 condition으로 구성되는데, 조건(condition)은 CTT에서 factor의 수준에 해당하는 것이다. 연구자들은 Condition에 대해서, condition은 어떤 측정의 질을 낮추지 않으면서도 바꿀 수 있다라고 가정한다. 

Each facet that defines the universe of admissible observations is comprised of “conditions” (Brennan 2001). These conditions are analogous to the levels of a factor in classical test theory (CTT). Overall, the MMI literature reports facets with a range of corresponding conditions and it is presumed, as an assumption for most applications of G theory, that the varying conditions represent random samples from these facets. Therefore, it is also generally assumed by researchers that these conditions can be altered without making the measurement any less acceptable. Although G theory can examine the extent to which such changes in a facet’s conditions make the measurement more or less acceptable (Shavelson & Webb 1991), this concept of interchangeability has not been examined for all potential facets of the MMI.


MMI에 대한 G coefficient연구에서 rater와 station을 facet으로 했지만, 어떤 인적특성을 평가하는가는 대체로 무시해왔던 것이 사실이다. MMI에서 평가하는 인적특성은 지금까지는 잘 포함시키지 않아왔지만 상당히 큰 변이(variance)의 원인이 될 수 잇다. 여러 연구에서 의사에게 요구되는 서로 다른 인적특성을 최대 87개까지 추출한 바 있지만, 입학면접에서는 그 중 일부만을 평가할 수 있을 뿐이다. 따라서 MMI에서 평가되는 구체적인 인적특성은 의과대학마다 리더십, 문화적 감수성, 대인관계, 비판적 사고 등으로 다양할 것이다. 또한 이들 평가는 주로 Likert scale로 평가하게 되며, 평가대상이 되는 인적특성은 의사로서 중요한 다양한 특성 중 무작위로 선정된다는 암묵적 가정을 기반으로 한다. 즉, MMI 연우게서는 이러한 인적특성들이 item 측면에서 상호교환가능함을 가정하고 있다. 즉, '리더십'에 대한 점수는 '문화적 감수성'에 대한 점수와 동등하고 상호교환가능하다는 뜻이다. 연구자들은 이러한 가정의 안면타당도에 의문을 제기하였으며 추가적 연구가 필요하게 되었다.

The extant literature reports moderate to high G coefficients for medical school MMIs ranging 0.58–0.81 (Eva et al. 2004b, c; Uijtdehaage et al. 2011). These reports, however, are based on studies that have modeled raters and stations as facets but have essentially ignored the impact of the attributes assessed. The attributes assessed by the MMI have the potential to introduce additional and largely unaccounted for, variance in MMI scores. The medical literature identifies up to 87 different attributes considered important for an aspiring physician (Price et al. 1971; Albanese et al. 2003); yet, an MSPI, including the MMI, can only reasonably capture a handful of these attributes. Therefore, the specific attributes assessed by an MMI will vary across medical schools and can range from leadership potential, cultural sensitivity, interpersonal skills, and critical thinking to a single, overall performance score (Eva et al. 2004c; Reiter et al. 2007; Uijtdehaage et al. 2011). These attributes are generally assessed as items on a Likert-like scale (Eva et al. 2004c) and carry the implicit assumption that an institution’s choice of attribute(s) can be considered a random selection from the domain of characteristics deemed important for the medical profession. Subsequently, MMI studies have largely considered attributes to be interchangeable conditions within the item facet. This would suggest that it is reasonable to believe that scores for the item, “leadership potential,” are parallel and interchangeable with scores for the item, “cultural sensitivity.” Consequently, the researchers question the face validity of this assumption and believe it warrants further investigation.


더 나아가서 MMI에서 평가하는 인적특성의 구성은 item facet을 넘어서 station facet으로 들어간다. 각 MMI 스테이션은 특정한 주제에 맞는 특정한 시나리오를 가지고 진행되는데, 이 시나리오를 바탕으로 평가서식에 의해서 'item'화 되는 사전에 결정된 인적특성에 대해 평가하게 된다. 결과적으로 station scenario 사이의 차이는 한번 더 의도하지 않은 변이를 유발할 수 있다. 기존 연구들은 station facet을 포함시키기는 했으나, station facet을 1회의 측정사건(measurement occasion)에만 국한시켰다. 따라서 기존 문헌에서 측정사건의 숫자가 증가할수록 MMI의 일반화가능도가 높아지는 것으로 되어있으므로 기존 연구의 결과는 CTT의 Spearman-Brown prophecy formula에만 부합하는 것일 수 있다. 

Furthermore, the impact of the composition of attributes assessed by the MMI has the potential to reach beyond the item facet into the station facet. Each MMI station is assigned a specific scenario that focuses on topics such as “knowledge of the healthcare system” or “critical thinking” and is intended to elicit information regarding a set of predetermined attributes that are captured as items on an evaluation form (Eva et al. 2004c). Consequently, differences among station scenarios have the potential to introduce further unwanted variance in the attributes assessed. Previous studies model the station facet into G studies (Eva et al. 2004b, c; Uijtdehaage et al. 2011); however, these studies generally recognize the station facet in terms of a measurement occasion only. Therefore, while it is well-established in the literature that increasing the number of measurement occasions increases generalizability estimates for the MMI, this finding merely aligns with the CTT’s Spearman-Brown prophecy formula. 


결과적으로 기존의 문헌은 스테이션의 시나리오가 MMI에서 인적특성을 평가하는데 미치는 영향력을 상당부분 무시해왔다고 볼 수 있다. MMI가 맥락특이성을 희석시키기 위한 목적으로 개발되었다는 점을 고려하면, 이에 대한 추가적 연구가 필요하다. 실제로 Eva의 파일럿연구를 보면, 지원자-스테이션 상호작용이 지원자 단독으로 인한 변이보다 다섯배나 컸다. 이러한 결과는 스테이션의 내용이 MMI점수에서 발생하는 오차의 중요한 원인이 될 수 있음을 보여준다. 그러나 아직까지 어떤 연구도 특정 인적특성(즉 item)에 대해서 MMI 스테이션의 내용, 즉 시나리오가 의과대학 지원자의 평가에 어떤 영향을 주는가를 연구한 바는 없다. 본 연구에서는 MMI평가서식의 구체적 특성에 의해서 정의내려진 item이 시나리오에 관계없이 여러 MMII station에 걸쳐서 일관되게 평가되어지는지를 연구해보고자 한다.

Consequently, the extant literature has largely ignored the potential influence of the stations’ scenario on the assessment of attributes within an MMI. Given the fact that the MMI was created in large part to dilute the effects of context specificity (Eva 2003), this warrants further investigation. In fact, Eva et al.’s (2004c) pilot study concluded that variance attributable to the candidate–station interaction was five times greater than that assigned to the candidate alone. This finding suggests that station content may introduce the most significant source of error in MMI scores. Yet, to the best of the authors’ knowledge, no study has examined how the MMI station content – the scenario – may influence the assessment and evaluation of medical school applicants on a set predetermined attributes (i.e. items). This study will explore whether items, defined as specific attributes on an MMI evaluation form, are assessed consistently across MMI stations regardless of station scenario.



Methods

This study examines one aspect of psychometric evidence from one United States (US) medical school that has fully adopted the MMI process as a replacement for the traditional MSPI. Using G theory, this study examines the variance attributable to the item facet and the scenario-item interaction. Data used for analysis represent MMI scores that were collected for the sole purpose of making admissions decisions. These data come from a US medical school that receives approximately 4000 admissions applications annually and interviews approximately 625 applicants each year. This institution fully adopted the MMI to select the entering class of 2009. With IRB approval (# 10-06-08-01), all applicants who participated in the MMI from 2009 to 2013 are included in the dataset used for analysis. This empirically collected dataset represents a nested design; therefore, only a small subset of applicants was used in this analysis in order to create a fully crossed design because in G theory, nested facets make it impossible to estimate all variance components separately (Brennan 2001).


'의사소통'과 이를 평가하기 위한 여섯 개의 구체적 특성 + 하나의 총괄평가

After a comprehensive blueprinting process, the school’s Admissions Committee identified one overarching characteristic – communication – to assess through the MMI. The rationale for choosing this single construct was largely rooted in literature that suggests that one of the chief patient complaints concerns poor communication between the patient and physician (Wofford et al. 2004). Communication was selected as the single construct from the larger domain of attributes deemed important for an aspiring physician. To operationalize this construct, six specific attributes and one “overall score” were used as sub score items in the MMI evaluation tool. The specific attributes assessed by this MMI included (1)multiple perspectives, (2)reflection of scenario, (3)articulation, (4)interest in dilemma, (5)non-verbal communication, and (6)interpersonal skills. These seven items were measured on a seven-point Likert-like scale that assumes equal intervals between the anchors (Unsatisfactory-1; Below Average-2; Slightly Below Average-3; Average-4; Slightly Above Average-5; Above Average-6; Outstanding-7).


G-String IV software (Bloch & Norman, Hamilton, Ontario, Canada), was used to estimate variance components attributable to the facets of measurement. In G theory, the object of measurement is not considered a facet. Therefore, this study’s two-facet design includes the object of measurement – applicants (p) – and two facets of generalization – scenario (s) and item (i).


Facet of differentiation

The object of measurement is considered the facet of differentiation. This facet of differentiation is analogous to the dependent variable and is considered the only desired source of variation. In other words, the object of measurement is the “universe” or “true” score (Brennan 2001). The facet of differentiation is the person, the applicant (p) facet, which represents the true MMI score for the applicant. Therefore, this variance should be large and other modeled sources of variance are expected to be small.


Facets of generalization

The facets of generalization are analogous to the independent variables and they contribute unwanted sources of error to the universe score, or for this study – MMI scores. These facets of generalization include the sources of measurement error that the researcher intends to generalize from the sample to the universe of admissible observations. Because this study intends to generalize applicant scores from one scenario to applicant scores from a much larger set of scenarios, scenario (s) is considered a facet of generalization. Likewise, because this study intends to generalize from applicant scores on one attribute item to applicant scores on a much larger set of items, item (i) is also a facet of generalization. In line with G theory assumptions, both the scenario facet and the item facet are considered random and conditions within these facets are deemed interchangeable (Shavelson & Webb 1991).


Confounded facet

Because there is one rater assigned to each scenario at this US medical school, the variance attributable to rater cannot be disentangled from variance attributable to scenario. Therefore, rater and scenario variance are completely confounded. For the purposes of this study; however, this confounded effect will be recognized as a limitation and the variance accounted for by this confounded facet will be considered attributable to the scenario (s).


Sample

지원자에 대한 MMI점수가 서로 다른 시나리오 아래서 수집되는 nested structure이다. 따라서 scenario facet이 applicant facet과 fully crossed 되는 subset을 찾는 purposive sampling을 하였음. 결과적으로 completely crossed design을 위하여 동일한 시나리오에서 동일한 아이템으로 평가받은 지원자를 표집하였다.

This study uses actual admissions data; therefore, the data structure represents a pragmatic design in which inevitable nesting and confounding occurs. A fully crossed G study elicits the most information; however, the existing data set represents a nested structure in which MMI scores are collected for applicants by using different scenarios. Therefore, a purposive sampling method was employed to generate a subset of data in which the scenario (s) facet (confounded but representing the same raters within scenario combinations) was fully crossed with the applicant (p) facet. Consequently, a subset of the full dataset was intentionally sampled for applicants rated within the same scenario using the same items to ensure a completely crossed design. The sample included 16 applicants who were evaluated within the same six scenarios and scored on the same seven items. This small, purposive sample was necessary in order to examine the variance attributable to the main effect of the scenario (s) facet and the scenario- item (si) interaction (Shavelson & Webb 1991), which is information pertinent to the study’s objectives.



Results

While the true score (p) should represent a sizable amount of variance, Table 1 shows that the applicant (p) represents only 6% of total variance. The estimated variance components from the G study suggest that the greatest amount of variance is attributable to the main effect of the scenario (s) facet and the interaction between scenario and applicant (ps). Collectively, these two variance components account for 77% of the total variance. The item facet (i) represents the lowest estimated variance, estimating only 0.6% of the total variance in MMI scores. Likewise, the scenario-item interaction (si) accounts for only 1.4% of the total variance. The low estimate of variance attributable to the item facet is reinforced by a high Cronbach’s alpha (0.97) for the seven items, which suggests very high internal consistency among the attributes measured by this MMI.





Discussion

일곱 개의 sub scores (items)의 높은 내적 일관성으로부터 현 MMI에서는 하나의 단일한 차원의 인적특성을 평가하고 있음을 알 수 있다. item facet으로부터 유발되는 variance가 2%에 불과한 것도 이를 지지한다. 만약 이 item들이 하나의 단일차원의 특성을 평가하는 것이라면 일곱 개의 item은 하나의 item으로 압축될 수 있다. p와 i에 의한 변이가 적다는 점은 대부분의 변이가 s에 기인한다는 것을 의미한다.

The high internal consistency of the seven sub scores (items) may support assumptions that the current MMI process is measuring one unidimensional attribute; this is further supported by only 2% of variance attributable to the item facet– (i), (pi), and (si). These findings either support the G theory assumption that conditions of the item facet can be considered interchangeable or it may suggest that raters do not understand how to use the items associated with the MMI evaluation tool and simply assign the same value for each item. If the items are capturing one unidimensional attribute, then a seven-item evaluation tool could be condensed into a single item tool. The low percentage of variance attributable to both items (i) and the true score – the applicant (p) – further suggests that the variation in MMI scores is mostly attributable to scenarios (s). 


스테이션 시나리오의 내용에 차이가 있다는 점을 감안하면, 시나리오에 의해서 지원자가 보여주는 인적특성의 비일관성이 높아진다고 보는 것이 타당하다. 예컨대 지원자는 윤리적 딜레마를 포함하고 있는 시나리오와 팀워크 활동을 포함하는 시나리오에서 서로 다른 특질을 보여줄 것이다. 따라서 MMI가 item차원에서는 하나의 단일차원 특성(one unidimensional attribute)을 측정하게끔 한다 하더라도, 스테이션의 내용은 그 특성(attribute)에 대한 측정을 변화시킴으로서 시나리오 수준에서의 다차원(multidimensionality)을 유발 할 수 있다. 연구자들은 이러한 차이가 시나리오-아이템 상호작용으로부터 나타날 것으로 기대했으나 본 연구의 결과는 이러한 가설이 틀렸음을 보여준다. 아마도 item facet으로 인한 변이의 비율이 낮기 때문에 이런 결과가 나왔을 것이다. 따라서 시나리오-아이템 상호작용이 작다는 것은 item facet으로 인한 variance가 scenario facet에 의한 variance에 포함되어버리기 때문일 수 있다. 결과적으로 이러한 상호작용이 MMI점수의 variance중 상당한 부분을 차지하게 될 것이나, 이러한 것이 이번 연구 샘플에서는 드러나지 않았다. 

Given the variation among the content of station scenarios, it is plausible to believe that scenarios promote inconsistencies among attributes exhibited by an applicant. For instance, a scenario involving an ethical dilemma might highlight different attributes than a scenario requiring an applicant to engage in a teamwork activity. Therefore, even if the MMI is supposedly measuring one unidimensional attribute at the item level, the content of the stations may elicit different measurements of attributes, thereby introducing multidimensionality at the scenario level. While the researchers expected to find this disparity manifested as a large variance component associated with the scenario-item interaction, this initial analysis does not support the assumption. This potential effect may be obscured by the low percentage of variance attributable to the item facet. Therefore, it is possible that the small scenario-item interaction is a result of variance attributable to the item facet being subsumed by the variance attributable to the varying conditions of the scenario facet. Consequently, the interaction may indeed contribute substantial variance in MMI scores; but this was not identified within this study’s sample.


AERA, APA, NCME 기준에서 드러나듯, "만약 문항 개발자가 시험을 수행하는 조건이 응시자에 따라서 다를 수 있음을 적시한다면, 그러한 조건에서 허용가능한 변이가 확인되어야 하고, 서로 다른 조건을 인정하는 rationale가 명시되어야 한다"라고 언급하고 잇다. item facet으로부터 기인하는 변이가 작다는 것이 item facet이라는 서로 다른 조건에 대한(즉 서로 다른 특성들에 대한) 허용가능성을 의미할 수 있지만, 이러한 가정은 scenario facet에는 해당되지 않는다. 본 연구의 결과는 scenario facet이라는 조건에 대한 상호교환가능성에 의문을 제기한다. 따라서 시나리오 선정에 보다 주의를 기울일 필요가 있다.

As outlined by the AERA, APA and NCME Standard 3.21, “If the test developer indicates that the conditions of administration are permitted to vary from one test taker or group to another, permissible variation in conditions for administration should be identified, and a rationale for permitting the different conditions should be documented” (American Educational Research Association, American Psychological Association, & National Council on Measurement in Education 1999, p. 47). While the low variance attributable to the item facet may suggest permissibility for permitting different conditions of the item facet (i.e. attributes), this assumption may not hold for the conditions of the scenario facet. The results of this study suggest that interchangeability for the conditions of the scenario facet is questionable. Subsequently, more attention should be directed towards the selection of scenarios.


이러한 연구결과에도 불구하여 몇 가지 한계가 있다.

Despite these findings, this study has some limitations. Because this study uses empirically collected data intended for admissions purposes, the researchers did not have direct control over the data collection process. Consequently, this G study is limited by the need for a purposive sample which results in a very small sample size relative to the larger dataset. The researchers felt that the benefit of using a fully crossed design justified the small sample size. In G theory, the nested facet cannot be estimated separately from its interaction effects because nesting creates missing cells in the design; additionally, the scenario-item (si) interaction, a major focus of this study, could only be examined using a fully crossed design (Brennan 2001). Nonetheless, the sample size used in this study may not be representative of the larger population; subsequently, the variance components might be influenced by sampling error. Because estimated variance components can be very unstable when the number of conditions within a measurement is small, this study could be replicated using a larger sample size. In addition, only two facets of generalization are modeled and one of these facets, the scenario facet, is confounded with another potential source of variation – the rater facet. Therefore, other facets could be added to the model in order to expand the universe of admissible observations and the corresponding generalizability of the study. Consequently, this study’s external validity is limited to the extent that other MMIs mirror the one used in this analysis. Despite these limitations, this study offers a solid framework for future exploration into the impact that scenario content can have on the attributes assessed by the MMI.



Conclusions

Because of the variation in how and what an institution-specific MMI measures, psychometric properties must be examined for each medical school that chooses to adopt the MMI as a replacement for the MSPI. This study adds to the growing body of literature related to psychometric analyses of the MMI. Because the extant literature has primarily focused on predictive validity and largely ignored other aspects of validity, this study adds to the foundation for further exploration into construct validity. As the MMI continues to gain momentum as a replacement for the traditional MSPI, the measurement process deserves careful attention, especially in terms of how and what is measured. Future analysis should explore the potential that both items and scenarios have on subsequent MMI scores. Overall, the results of this study reinforce the need to examine all psychometric properties of a measurement process – especially one, such as the MMI, that is used for high-stakes admissions purposes.










 2014 Sep;36(9):794-8. doi: 10.3109/0142159X.2014.909587. Epub 2014 May 12.

Variance in attributes assessed by the multiple mini-interview.

Author information

  • 1University of Michigan Medical School , USA .

Abstract

INTRODUCTION:

While the extant literature has explored the impact of stations on multiple mini- interview (MMI) scores, the influence of station scenarios has been largely overlooked.

METHOD:

A subset of MMI scores was purposively sampled from admissions data at one US medical school. Generalizability (G) theory was used to estimate variance components attributable to applicants and two facets of generalization - scenarios, the content of the station, and items, the attributes assessed.

RESULTS:

G study suggests that the greatest amount of variance is attributable to the main effect of the scenario (s) facet and the interaction between applicant and scenario (ps), which account for 77% of the total variance. The item facet (i) accounts for only 0.6% of total variance; likewise, the scenario-item interaction (si) accounts for only 1.4% of the total variance.

DISCUSSION:

While the researchers expected to find a large variance component associated with the scenario-item interaction, this analysis does not support this assumption. The researchers interpret the small scenario-item interaction as a result of variance attributable to the item facet being subsumed by the variance attributable to the content of the scenarios.

CONCLUSIONS:

The results of this study reinforce the need to examine psychometric properties of the MMI.

PMID:

 

24820377

 

[PubMed - in process]


+ Recent posts