OSCE에서 총괄점수(Global grade)와 체크리스트의 차이(Med Teach, 2015)

Investigating disparity between global grades and checklist scores in OSCEs

GODFREY PELL, MATT HOMER & RICHARD FULLER

University of Leeds, UK





OSCE는 장점이 명확하다. 이 장점은 특히 기준확립에 적절한 방법을 쓴다면, 구체적 내용을 잘 정할 수 있으며, 표준화할 수 있고, 광범위한 측정이 가능하며, 평가의 질에 대한 사후 분석이 가능하다. 신뢰도에 대한 측정은 평가의 질을 결정할 때 흔히 사용되는 요소이며, 이 때 스테이션 수준에서 OSCE형식에서 다룬 여러 문항에 대해 분석하고 교정하는 것에 초점을 둔다.

OSCEs have clear strengths, especially when appropri- ate standard setting methodologies are employed, allowing careful specification of content, standardisation and an oppor- tunity to undertake extensive measurement and post hoc analysis to determine assessment quality. Measurements of reliability are routinely used as an element of determining assessment quality (Streiner & Norman 2003; Chapter 8), with an increasing focus on the value of station level metrics in the detection and remediation of a range of problems with OSCE formats (Pell et al. 2010; Fuller et al. 2013).


전통적인 시험 형식에서 OSCE는 각 스테이션마다 두 가지 평가 결과물이 나오는데, 체크리스트CL과 종합점수GG이다. 다른 형식의 OSCE는 CL을 없애기도 했는데, 예를 들면 미국의 주요 면허시험에서, 병력청취에 관한 CL은 분별력이 약하다는 이유로 사라졌다. 한 스테이션 내에서 CL와 GG의 alignment는 중요한 특징이며, 좋은 스테이션이라면 두 가지의 alignment가 강해야 한다.

In ‘‘traditional’’ test formats, OSCEs have two assessment outcomes within each station, a checklist score and a global grade (other formats of the OSCE have seen a move away from checklists, for example, in the USA’s main licensing exam, the history-taking checklist has been eliminated due to concerns regarding its poor discrimination). The alignment between the checklist/marking scheme score and overall global grade within a station is an important characteristic, and one would expect that in a high-quality station (i.e. one that is working well as part of a reliable and valid assessment), this alignment should be strong.


많은 연구에서 CL이나 GG의 불일치를 연구했으나, 한 스테이션 내에서 비교하고 그 의미를 찾아본 연구는 없다.

A number of studies have looked at checklist discrepancies and/or rating discrepancies (Boulet et al. 2003), but to our knowledge, none has investigated discrepancies between checklist scores and global ratings in a station and what this might mean.


어떤 식으로든 misalignment가 발생하면 스테이션 수준에서 "역량을 갖춘"학생을 떨어뜨리고 "역량이 부족한"학생을 합격시킬 가능성이 있다.

Any degree of misalignment increases the likelihood of failing ‘‘competent’’ students or passing ‘‘incompetent’’ students at the station level.


동시에, 이 분야의 연구는 이러한 수행능력 검사에서 평가자의 의사결정에 관한 복잡한 영역을 이해하기 위한 연구의 증가에 따라 더 풍요로워지고 있다. 평가에 대한 구성주의자적 관점을 적용하는 입장에서 이러한 연구는 평가자의 의사결정에 영향을 미치는 요인이 고도로 individualize, contextualize 되어있으며, 평가자의 경험이나 연공서열에 영향을 받는다는 것을 밝혔다. 이는 OSCE내에서 검사형식설계, 구조, 평가자 행동, 피평가자 수행능력이 복잡한 관계에 있음을 보여주며, 종종 변인의 '블랙박스'로 여겨지곤 한다. Misalignment가 발생했을 경우, 우리의 연구는 이 '블랙박스'를 이해하기 위한 것이다.

At the same time, research in these areas is complemented by a growing body of literature that seeks to understand the complex area of assessor decision making in performance testing (Sadler 2009; Govaerts 2011; Kogan et al. 2011; Yorke 2011). Employing constructivist views of assess- ment, this literature reveals that the factors affecting assessor decision-making can be highly individualised, contextualised and influenced by characteristics such as assessor experience and seniority (Pell & Roberts 2006; Kogan et al. 2011). This can be summarised as a complex interaction of test format design issues, construct, assessor behaviours and candidate perform- ances within the OSCE environment, sometimes described as a ‘‘black box’’ of variance (Gingerich et al. 2011; Kogan et al. 2011). Where misalignment occurs, our work seeks additional understanding of this ‘‘black box’’ with regard to this error variance.


'전통적인' CL평가에 대한 불만이 높아지는 가운데, 그리고 이와 함께 GG 기반 채점이 늘어나고 있으며 이는 GG가 CL점수보다 더 신뢰성있다는 연구에 근거하고 있다. 이 misalignment의 일부는 잘못된 CL설계에 기인하며, 또한 이 두 점수가 서로 다른 trait을 평가한다는 것을 반영한다. 

There is a growing dissatisfaction with ‘‘traditional’’ (i.e. reductionist) checklist marking schedules both in healthcare and wider education (Sadler 2009), with an accompanying growth in the use of global/domain based marking schema, supported by work that indicates that global grades are more reliable than checklist scores (Cohen et al. 1996; Regehr et al. 1998). It is important to note that some of the misalignment between scores and grades in a station can reflect poor checklist design, and that these two performance ‘‘scores’’ may measure quite different traits. 


이 접근법은 평가점수의 자연발생적 변인에 대한 우려, 즉 평가자 판단의 에러에 대한 것을 조사할 수 있는 능력이 없이는 평가의 진짜 가능성을 제대로 인식할 수 없음을 보여준다.

This approach poses the real possibility that assess- ments take place without an ability to investigate concerns about the nature of variance in marks, implying that error in assessor judgements may be more likely to go unrecognised.




방법

Methods


Initial exemplification and exploration


Our OSCE format uses global grade year-specific descriptors (indexed as clear fail, borderline and three passing grades) alongside a specific marking schema that develops from a traditional checklist format in our junior OSCEs (third year) to a sophisticated ‘‘key features’’ format in the final, qualifying OSCE (fifth year).


Within each station, a pass mark is calculated from all the grades/marks within the station using the Borderline Regression Method (Kramer et al. 2003; Pell & Roberts 2006).


두 가지 유용한 통계수치는 '총 탈락률'과 '보더라인 점수자의 비율'이다. 여기서는 CL과 GG를 특히 보더라인 점수자에 대해서 비교했다.

The first useful statistical measures are the overall failure rate at the station (38/230 ¼16.5%) and the percentage in the ‘‘Borderline’’ grade (48/234 ¼20.9%), the latter of which is arguably high since in one in five encounters the assessors are unable to make a clear pass/fail global decision. The methods we develop will enable us to examine the congruence grades and between the two assessor judgements: global checklist marks, particularly within borderline categories.


Treatment of ‘‘Borderline’’ grades




Formulating measures of misclassification


degree of misalignment를 정량화하기 위한 표 만들었음.

One of the key areas of research in this study is to explore the possibility of developing useful metrics to quantify the degree of misalignment that is listed in Table 2 as a step towards highlighting stations that require further investigation.





Results – Application in practice



문제가 뚜렷하게 드러난 스테이션

Stations with established problems based on existing station-level metrics


점수를 못 받고도 합격한 학생이 점수를 잘 받고도 탈락한 학생보다 3배 많다.

As part of the validation of the new metrics, we examine their application where existing station-level metrics already high-light concerns about quality. In Table 3, station 3 shows a poor R-square value with an accompanying low value for the slope of the regression line (inter-grade discrimination), already suggesting that the station is not discriminating well between students based on ability. From this analysis, we would anticipate there to be a wide range of checklist marks for each global grade. The the pass/fail grid reveals a high times level of asymmetry in off-diagonal with three as many candidates (25:8) achieving a global pass grade from assessor but poor checklist marks compared to those not achieving a global pass whilst having good checklist marks. 




표면적으로는 문제가 없는 스테이션

Stations with no ‘‘apparent’’ problems


(GG에서 나타난) 평가자의 예측보다 (CL에서 나타난)수행능력이 더 낫다.

We now examine stations where the ‘‘standard’’ set of metrics would not highlight underlying quality issues in respect of assessor reveals decision central making theme: and judgements. This whole analysis achieve a candidates as a comparatively better performance (determined by the check-list score) than would be expected by assessors’ prediction(determined by the global grade). 



보더라인이 높게 나온 스테이션

Stations with relatively high numbers of‘‘Borderline’’ students


As a final part of the work examining the impact of them is alignment measure, we review stations where the propor-tion of borderline grades awarded is relatively ‘‘high’’. Station 8, which focusses on medico-legal responsibilities after the death of a patient, has the highest proportion of borderline grades (25.2% of the whole cohort) amongst the stations listed in Table 3. Review of traditional station level metrics (columns2–5) shows an acceptably performing station, but with a high number of student failures. 




Discussion


수행능력 검사에는 많은 "노이즈"가 있으며, 최근 근무지-기반 평가에서는 이것을 "블랙박스"라고 개념화한 적이 있다.

Despite this, there remains a large degree of ‘‘noise’’ in performance testing, recently conceptua- lised within workplace assessment settings as a ‘‘black box’’ (Kogan et al. 2011).


연구자들은 이 "노이즈"를 해독하기 시작했으며, 보건의료계열 평가에서는 복잡하고 변화하는 OSCE 스테이션의 특성과 관련한 평가자의 행동과 의사결정에 초점을 두기 시작했다. 다른 전문직 영역에서는 GG와 CL사이의 긴장관계를 강조하며 평가자가 전체론적, 종합적 평가를 신뢰하면서 CL의 활용을 무시하는 것과 같은 적극적 "위법(transgression)"을 지적하고 있다. 다른 연구에서는 평가자가 기술어(descriptor) 내의 '안전' 혹은 '프로페셔널리즘'  과 같은 복합적인 구인을 이해하는데 있어서 GG만을 사용하는 것의 문제를 지적한다. 그러한 구인은 종종 한 단어로 대표되는데, 평가자 훈련에서 다양한 재해석이 판단의 변화를 가져오며, 일부 연구자들은 이를 '예상된 변동(anticipated variance)'이라고 개념화하면서, 단순한 에러와는 구분하고자 한다. 이러한 복잡한 역학이 일련의 관찰결과는 '불확정성'으로 개념화되면서, CL, rubric, grading scheme 사용에 대한 이론적 배경을 challenging하고 있다.

Researchers have begun to unpack this ‘‘noise’’, and work within healthcare assessment has focused on assessor behav- iours and decision making in the complex, changing nature of the OSCE station (Govaerts 2011). Work from other profes- sional disciplines has highlighted a wider tension between the balance of global grades and checklists/marking rubrics, revealing active ‘‘transgressions’’ as assessors trust of holistic, global judgements overrides their use of checklist criteria (Marshall 2000). Other work reveals the challenges of using global grades and descriptors alone, as assessors seek to make sense of complex constructs such as ‘‘safety’’ or ‘‘profession- alism’’ within descriptors. Such constructs are often repre- sented by single words, and despite assessor training, multiple re-interpretations lead to variation in judgements – with some researchers conceptualising this as anticipated variance, rather than just simply error (Govaerts et al. 2007). This complex dynamic has been conceptualised through a series of obser- vations as ‘‘indeterminancy’’, challenging the theoretical back- ground to the accepted use of checklists, rubrics and grading schemes (Sadler 2009).


OSCE에서 GG와 CL과 관한 연구는 많으나, 둘 사이의 alignment 연구는 적다.

Whilst an extensive literature exists in respect of the use of global grades and checklists within OSCEs (Cunnington et al. 1996; Humphrey-Murto & MacFadyen 2002; Wan et al. 2011), little has been done to explore the nature of the alignment between the two.


우리는 각 기관이 PM calculator 모델을 만들어서 각자의 자료로부터 가장 적절한 값을 내기를 권고한다.

We encourage other institutions to model the PM calculations using their own data to determine the most suitable value of the parameter M (formula 1) to meet local conditions.


이 연구는 평가자의 CL결정과 "예측(즉, GG)" 사이의 misalignment를 밝혔다.  부정적인(adverse) standard station-level metrics 을 가지는 한 스테이션 내에서 이 misalignment는 어느 지점에서 평가자가 스테이션과 CL에 불만을 갖는가를 보여준다. 도입부에서 언급한 것처럼 misalignment는 여러 문제에 기인할 수 있다(예컨대, 평가자 훈련, 보조자료, 까칠한 평가자 등). 그보다 더 중요한 것은 기존의 metrics에서 "수용가능한" 것으로 인정된 스테이션에 대한 더 깊은 이해이다. 이들 스테이션에 대한 불만족은 많은 경우 보더라인 집단에 대한 평가자 판단에 있다. 이 맥락에서의 misalignment는 서로 다른 '방향성(directionalities)를 보여주는데, 평가자는 '탈락'을 주는데 어려움을 겪으며, 보더라인 집단의 학생을 더 후하게 평가하는 경향이 있다는 것이다. 이러한 결과는 평가자가 "bestowed credit"을 준다는 기존 연구와 일치하는데, 즉 GG나 CL 시스템에 들어가있지 않은 피평가자의 행동에 대해서 벌점 또는 가점을 준다는 것이다. 이는 수행능력 평가에 대한 신뢰의 threat이다.

This study has revealed the extent of misalignment between assessors’ checklist decisions and their ‘‘predictions’’ (i.e. the global grades) across a range of different academic cohorts and levels of assessment in a large-scale OSCE. Within stations with ‘‘adverse’’ standard station-level metrics (Pell et al. 2010; Fuller et al. 2013), the misalignment measures complement these well, highlighting where assessors are dissatisfied with station and checklist constructs. As stated earlier in the introduction, the misalignment could be the result of a number of problems (including but not limited to, for example, assessor training, support materials, ‘‘rogue’’ assessors and so on). Of more importance is the deeper insight into stations that might have been judged as ‘‘acceptable’’ based on pre-existing metrics. The unsatisfactory characteristic of many of these stations lies in assessor judgement of the borderline group. Interpreting the misalignment measure in this context reveals different directionalities – with assessors showing difficulty in awarding fail grades and a tendency to over rate student performance in the borderline group. Such findings resonate with assessors awarding ‘‘bestowed credit’’ – rewarding or penalising other candidate activities that are not featured within grading and checklist systems, and an activity that has been identified as a threat to the fidelity of performance assessments (Sadler 2010).


보더라인 그룹의 합-불합 결과의 불일치 결과는 약 10%정도에서 발생할 것으로 추정한다. 다른 말로는, 보더라인 그룹의 대다수가 합격 결정을 받는 (또는 불합격 결정을 받는) 스테이션이 생긴다는 것이다. 그 결과 이들 보더라인 그룹의 평균점수는, 즉 borderline group method에 따른 커트라인은 the borderline regression method에  따른 점수보다 낮(거나 높)다. 

We estimate that the incidence of substantial asymmetry of pass/fail outcomes within the borderline group occurs in approximately 10% of stations. In other words, there are incidences where the large majority of candidates in the borderline group pass the station (or, conversely, fail the station). Hence, the mean mark for this borderline group, giving the cut-score as per the borderline group method, is lower (or higher) than that under the borderline regression method. We would argue from a quality perspective that this is further evidence in favour of BRM, since under the borderline group method these issues would remain unknown.





Gingerich A, Regehr G, Eva KW. 2011. Rater-based assessments as social judgments: Rethinking the etiology of rater errors. Acad Med 86:S1–S7.


Kogan JR, Conforti L, Bernabeo E, Iobst W, Holmboe E. 2011. Opening the black box of clinical skills assessment via observation: A conceptual model. Med Educ 45(10):1048–1060.


Yorke M. 2011. Assessing the complexity of professional achievement. Chapter 10. In: Jackson N, editor. Learning to be professional through a higher education. London: Sceptre. Available from: http://learningto- beprofessional.pbworks.com/w/page/15914981/Learning%20to%20be %20Professional%20through%20a%20Higher%20Education%20e-Book.












 2015 Dec;37(12):1106-13. doi: 10.3109/0142159X.2015.1009425. Epub 2015 Feb 16.

Investigating disparity between global grades and checklist scores in OSCEs.

Author information

  • 1a University of Leeds , UK.

Abstract

BACKGROUND:

When measuring assessment quality, increasing focus is placed on the value of station-level metrics in the detection and remediation of problems in the assessment.

AIMS:

This article investigates how disparity between checklist scores and global grades in an Objective Structured Clinical Examination (OSCE) can provide powerful new insights at the station level whenever such disparities occur and develops metrics to indicate when this is a problem.

METHOD:

This retrospective study uses OSCE data from multiple examinations to investigate the extent to which these new measurements ofdisparity complement existing station-level metrics.

RESULTS:

In stations where existing metrics are poor, the new metrics provide greater understanding of the underlying sources of error. Equally importantly, stations of apparently satisfactory "quality" based on traditional metrics are shown to sometimes have problems of their own - with a tendency for checklist score "performance" to be judged stronger than would be expected from the global grades awarded.

CONCLUSIONS:

There is an ongoing tension in OSCE assessment between global holistic judgements and the necessarily more reductionist, but arguably more objective, checklist scores. This article develops methods to quantify the disparity between these judgements and illustrates how such analyses can inform ongoing improvement in station quality.

PMID:
 
25683174
 
[PubMed - in process]


+ Recent posts