MMI 점수가 면접관의 엄격/관대 성향에 따라 보정되어야 하는가? (Med Educ, 2010)

Should candidate scores be adjusted for interviewer stringency or leniency in the multiple mini-interview?

Chris Roberts,1 Imogene Rothnie,2 Nathan Zoanetti3 & Jim Crossley4






Theoretical framework for interviewer performance


평가자와 관련된 오류에는 크게 세 가지가 있다. (1. 엄격/관대, 2. 면접관 주관(지원자 관련, 문항 관련), 3. 상호작용)

There are broadly three areas of interviewer-related error within the MMI,1,4,8 which are expanded upon in Fig. 1.



그러나 복잡한 평가 절차로 인해서 어떤 MMI 결과자료를 가지고도 아직까지 1차 효과 혹은 2차 효과(상호작용)을 정밀하게 추정해내지는 못하고 있다. 이는 기본적으로 대규모의 면접 계획에서 면접관은 문항에 nested 되어있기 때문이다. 현재까지 지원자-간 variance는 22%에서 25% 수준이다. MMI의 난이도에 따른 것은 0-3%, 평가자 관련 요인 중 엄격/관대 성향은 14% 를 차지한다.

However, because of the designs inherent in complex assessment procedures,6 no set of MMI data has thus far allowed for precise estimates of each first-order effect and their second-order interactions using G theory. This is because of confounding within the naturalistic large-scale interviewing plan, in which interviewers are usually nested in MMI questions. Current estimates suggest candidate-to-candidate variance ranges from 22%4 to 25%.1 MMI question difficulty variance is in the range of 0–3%.1,4 Of the interviewer-related factors, interviewer strin- gency ⁄ leniency accounts for 14% of error,4 


면접관의 지원자-특이 주관은 45% 정도에 달하는 것으로 연구된 바도 있다.

Variance reflecting interviewer candidate-specific subjectivity has been estimated to be as high as 45%in a study of assessments which used two interviewers within each station.8


MMI에 참여하는 면접관들이 자신들이 내리는 판단에 대해서, Kumar 등은 면접관이 결정을 내릴 때 생기는 긴장에 대한 preliminary insight를 제공한 바 있다. 

Kumar et al.9 have provided some preliminary insights into the tensions that arise in the process of making such decisions. 

  • 독립적 차원의 의사결정의 가치와 입학생에게 기대되는 수준에 대한 합의
    These highlight, firstly, the contrast between appre- ciation of independent decision making and the need to achieve a consensus around the standards expected of entry-level students. 
  • 의사소통기술과 대비하여 입학생 수준에서 요구되는 추론능력을 평가한다고 느낌
    The second source of tension concerns the extent to which interviewers may feel they are assessing entry-level reasoning skills in professionalism domains compared with communications skills. 
  • 어떻게 면접관이 지원자에 대한 주관적 판단을 극복할 수 있을까? 
    The third source relates to how interviewers overcome their subjectivity towards certain candidates and 
  • '탈락하는' 지원자에 대한 우려를 어떻게 극복할 것인가?
    the fourth to how they handle their concerns over ‘failing’ candidates. 
  • 참가자들은 적극적으로 면접관과의 상호작용을 통해서 자기 자신에 대한 긍정적 판단을 이끌어내고자 노력하며, 이는 대답의 질과는 무관하다.
    Finally, candidates are actively interacting with interviewers using their impression management skills to promote a favourable decision for themselves, which is not necessarily related to the quality of their answers.9



방법론적 접근

Methodological approaches


IRT 사용

Researchers have turned to item response theory (IRT)11 to provide this opportunity.


MFRM 사용

Roberts et al.12 applied multi-faceted Rasch modelling (MFRM) to the MMI, but they focused on differences in the performance of MMI questions in an item bank rather than on differences between the interviewers themselves. However, they did note that questions appeared to be measuring a unidimensional con- struct, ‘entry-level reasoning skills in professionalism’, as suggested by a good fit to the IRT model.12 The consistency of judgements within and between judges and candidates has been the focus of a number of papers.13–17 IRT software such as FACETS provides easily derived estimates of candidate ability, inter- viewer stringency ⁄ leniency and question difficulty.


'관찰평균점수'는 raw score에 기반한 점수이며 'fair average score'는 다른 모든 facet의 요소들이 평균값일 경우를 가정한 점수이다. 이러한 세팅에서 FAS는 면접관 엄격/관대 성향에 따라 보정된 점수이다.

An ‘observed average score’ is the average rating based on raw scores received by the candidate. The ‘fair average score’ is the measure that would have been observed if all the measures of the other elements on all other facets had been located at the average measure.18 In this setting, the fair average for candidates is the score that has been adjusted for interviewer stringency ⁄ leniency and question difficulty.


McManus는 엄격/관대 성향에 따라서 보정하면 95.9%는 바뀌지 않지만 2.6%가 원점수로는 탈락이지만 합격하게 되며, 1.5%가 원점수로는 합격하나 보정후 탈락함을 보였다. Harasym은 11%의 지원자가 영향을 받을 수 있다고 했다. 

For exam- ple, in the case of a clinical examination for entry into a professional college, McManus et al.14 found that if examination scores were adjusted for examiner stringency ⁄ leniency and the same pass mark was kept, the outcome for 95.9% of candidates would be unchanged using adjusted marks, whereas 2.6% of candidates would pass, although they had failed on the basis of raw marks, and 1.5%of candidates would fail, despite having passed on the basis of raw marks. However, Harasym17 estimated that as many as 11% of candidates in an MMI might be affected by adjusting for interviewer stringency ⁄ leniency,




Psychometric analysis


소프트웨어 

Multi-facet Rasch modelling was used in FACETS Version 3.65 (Winsteps.com, Chicago, IL, USA) to perform a concurrent estimation of several indepen- dent first-order facets and their associated error variances. A model was specified that included identification of the individual facets, the rating scale and how the interviewer was expected to interact with the rating scale.



세팅 

Setting


Details of the MMI design principles have been reported elsewhere.4,9,12 Candidates were applying to a 4-year, graduate-entry, problem-based learning (PBL) programme. From 2007 onwards, candidates were applying for medicine or dentistry or both. The MMI in this study was designed to assess entry-level reasoning skills in professionalism and had eight stations, with each candidate rotating through the circuit and meeting a different single interviewer at each station. Questions were sourced from a preprepared bank and took the format of a non-clinical scenario followed by structured prompts. Each question had five prompts marked with a 4-point Likert scale, giving a total of 20 raw marks per station and 160 for the whole assessment. In this design, although the performance of a candidate on any particular MMI question was assessed once only by a single interviewer, the total performance was rated by eight interviewers. Furthermore, each MMI question was assessed by several interviewers during the course of the MMI process. This created a network through which every parameter was linked to every other parameter with these connecting observations, allowing the measures estimated from the observations to be placed on one common scale.11 This naturalistic interviewing plan also allowed for the partially nested G study design.4



평가자

Interviewers

각 면접관은 평균 22명의 지원자를 면접함. 교수 89명, 지역사회인사 47명, 졸업생 39명.

Each interviewer had interviewed a median of 22 candidates (SD 18.44, range 4–121). Complete details were available in the database for 117 interviewers. Of the 207 used, 88 interviewers were known to be male and 95 were known to be female. Twenty-two were aged 18–34 years, 27 were aged 35–44 years and 68 were aged > 45 years. They included 89 faculty members, 47 community members and 39 graduates.


MFRM

Multi-facet Rasch modelling


Y축이 위로 갈수록 면접관이 엄격해지고, 지원자 능력이 높아지고, 난이도가 높아짐.

Reading the ruler (Fig. 2) from bottom to top shows increasing interviewer stringency, increasing candi- date ability and increasing question difficulty.


Fig 2와 Table 1 모두 면접관이 MMI 문항보다 더 variable함을 보여줌.

Both Fig. 2 and Table 1 show that interviewers are more variable than MMI ques- tions and the spread of interviewers is nearly 3.5 times that of MMI questions.


면접관 J는 모델의 예측과 over-fitting하여 지나치게 예측가능함, 즉 halo effect의 가능성을 시사하며, 면접관 G는 under-fitting으로 점수를 줄 때 randomness가 심함.

Interviewer J appeared to be over-fitting the model and his or her ratings were too predictable, suggesting a halo effect. Interviewer G seems to be under-fitting the model with too much randomness in his or her scoring.







Making adjustments for interviewer leniency and question difficulty


지원자 E는 엄격한 면접관을 만나서 OAS가 3.5로 낮지만 FAS는 3.64. 

Here, candidate E has a lower observed average score of 3.50, but a higher fair average score of 3.64 because he or she answered harder MMI questions and sawmore stringent interviewers. 


OAS대신 FAS를 사용하면, 합격자 270명중 31명(11.5%)는 합격에서 불합격이 되며, 여기서 중요한 것은 이것이 쌍방 이동의 과정으로, 그 대신 누군가가 합격하는 것이다.

Let us assume a scenario in which the fair average rather than observed average scores are used to rankthe candidates. In our situation, in which 270 studentplaces were on offer, if the MMI were the sole determinant of ranking, 31 of 270 (11.5%) candi- dates who were offered a place on the basis of their observed score rankings would not have been offered a place on the basis of their fair average rankings. This is a two-way movement.





Interviewer goodness-of-fit statistics


For the interviewer, the in fit mean square statistic ranged from 0.74 to 1.58 (mean 1.03, SD 0.74). This was a high-stakes assessment and was similar to a clinical rating situation and well within the accepted lower- and upper-control limits of 0.5 and 1.7 to indicate acceptable model fit.19



Number of candidates examined

면접관의 엄격 성향은 면접한 학생의 수와 유의하게 부적 상관관계가 있었다. 즉, 더 많은 학생을 면접한 경우 더 관대해진다. 이는 McManus의 연구결과와 반대되는 것.

Interviewer stringency ⁄ leniency showed a significant but inverse correlation with the number of candidates examined (r = ) 0.21, n = 207, p = 0.002). Thus, interviewers who interviewed more candidates tended to be somewhat more lenient. McManus et al.14 found examiners became more stringent with more candidates. Our finding contrasts with this, but we do not have data to show whether more lenient interviewers participated in more assessments or whether more interviewing caused interviewers to become more lenient.



시사점

Implications


IRT결과를 variance로 변환하는 과정이 중요하다. MFRM 사용에 관한 내용.

The translation of IRT output into variance compo- nents is important. Some have reported a number of limitations in applying IRT models to assessments which measure the performance of skills or behav- iours, as in the MMI.14 These arose because of claims that the MFRM analysis could not take into account the second-order effects of interviewer-by-station, interviewer-by-candidate and candidate-by-station var- iance. There was concern that, as in an incorrectly designed G study,6 error would be apportioned wrongly and hence any calculation of reliability or standard error of measurement was likely to be inflated. The use of MFRM to isolate variance com- ponents is very new and there has been some misun- derstanding in the medical education literature about how they can be estimated and reported with software such as FACETS. This has inflated reliability estimates undermining the credibility of the IRT method for this type of assessment. For example, McManus et al.14 reported variation between examinees in a clinical examination for entry into a professional college as an unrealistic 87%. This resulted from a calculation which partly assumed that the three first-order effects of examiner, item and person were proportions of 100%and thus neglected to take account of the bias or interactions and the residuals that MFRMalso reports.


FACETS를 활용하여 variance component를 분해할 수 있다.

An iterative relationship between the FACETS software developer and the educational research measure- ment community has ensured that later iterations of FACETS are able to provide the decomposition of variance components, including interactions, with naturalistic data.


MMI 훈련 과정에서 면접관들은 누가 hawk이고 누가 dove인지 피드백을 줘야하느냐에 대한 질문을 한다. 그러나 IRT로 측정하든 GT로 측정하든 MMI에서 엄격/관대 성향은 비교적 일관된 것이라는 점이, McManus의 연구와도 같은 결과이다. 따라서 이것의 함의는 McManus가 제안한 것과 같이, 면접관은 염격/관대 성향을 고치려고 하기보다는 지속적으로 하던대로 하는 것이 낫다.

In MMI training, interviewers often ask whether they should be given feedback on which of them are ‘hawks’ and which are ‘doves’ so that they can try to correct their tendencies to mark higher (leniently) or lower (stringently) on the rating scale. The finding that interviewer stringency ⁄ leniency seems to be a stable characteristic in the MMI, whether measured by IRT or by G theory, is remarkable and echoes the findings of McManus et al.14 in examiner stringency in clinical rating situations. The implications, as McManus et al.14 suggest, is that interviewers should not try to correct their hawkish or dove-like tendencies, but should instead continue to behave as they have always done.


Kumar가 지적한 바와 같이, 면관의 MMI 프로세스에 대한 경험이나 트레이닝의 효과에 대한 이론적 개발이 부족하다.

As Kumar et al.9 have noted, theoretical develop- ment in the area of interviewers’ experience of the process and impact of training is lacking.



13 Downing SM. Threats to the validity of clinical teaching assessments: what about rater error? Med Educ 2005;39:353–5.





















 2010 Jul;44(7):690-8. doi: 10.1111/j.1365-2923.2010.03689.x.

Should candidate scores be adjusted for interviewer stringency or leniency in the multiple mini-interview?

Author information

  • 1Sydney Medical School-Northern, University of Sydney, Sydney, New South Wales, Australia. christopher.roberts@sydney.edu.au

Abstract

CONTEXT:

There are significant levels of variation in candidate multiple mini-interview (MMI) scores caused by interviewer-related factors. Multi-facet Rasch modelling (MFRM) has the capability to both identify these sources of error and partially adjust for them within a measurement model that may be fairer to the candidate.

METHODS:

Using facets software, a variance components analysis estimated sources of measurement error that were comparable with those produced by generalisability theory. Fair average scores for the effects of the stringency/leniency of interviewers and question difficulty were calculated and adjusted rankings of candidates were modelled.

RESULTS:

The decisions of 207 interviewers had an acceptable fit to the MFRM model. For one candidate assessed by one interviewer on one MMI question, 19.1% of the variance reflected candidate ability, 8.9% reflected interviewer stringency/leniency, 5.1% reflected interviewer question-specific stringency/leniency and 2.6% reflected question difficulty. If adjustments were made to candidates' raw scores for interviewerstringency/leniency and question difficulty, 11.5% of candidates would see a significant change in their ranking for selection into the programme. Greater interviewer leniency was associated with the number of candidates interviewed.

CONCLUSIONS:

Interviewers differ in their degree of stringency/leniency and this appears to be a stable characteristic. The MFRM provides a recommendable way of giving a candidate score which adjusts for the stringency/leniency of whichever interviewers the candidate sees and the difficulty of the questions the candidate is asked.

PMID:
 
20636588
 
[PubMed - indexed for MEDLINE]


+ Recent posts