원하는 것을 측정하기 위한 OSCE 개발을 위한 12가지 팁(Med Teach, 2017)

Meded. 2021. 8. 19. 05:31

2021. 8. 19. 05:31

원하는 것을 측정하기 위한 OSCE 개발을 위한 12가지 팁(Med Teach, 2017)
Twelve tips for developing an OSCE that measures what you want
Vijay John Daniels & Debra Pugh

도입
Introduction

OSCE(Objective Structured Clinical Examination)는 1975년(Harden et al. 1975)에 처음 도입되었으며, 그 이후 지역 기관과 국가 고위험 검사 모두에서 임상 능력 평가에 OSCE가 광범위하게 사용되고 있다(Patrício et al. 2013).
The Objective Structured Clinical Examination (OSCE) was first introduced in 1975 (Harden et al. 1975) and, since that time, OSCEs have been used extensively (Patrıcio et al. 2013) for assessing clinical skills, both at local institutions and on national high-stakes examinations.

타당성에 대한 우리의 이해는 여러 개별 타당성 유형(예: 기준, 내용 유효성 등)에서 타당성에 대한 주장을 뒷받침하기 위해 다양한 근거 출처를 사용하는 구성 타당성의 통일적 개념으로 발전해 왔다.

첫째는 메식(Messick 1989)의 5가지 출처의 프레임워크를 통해
그리고 더 최근에는 케인(Kane)의 주장argument-기반 검증 접근법(Kane 2013)이다

Our understanding of validity has evolved from several separate types of validity (e.g. criterion, content validity etc.) to a unitary concept of construct validity in which various sources of evidence are used to support an argument for validity,

first through Messick’s framework of the five sources (Messick 1989) and, more recently,
through Kane’s argument-based approach to validation (Kane 2013).

쿡 외 연구진(2015)이 요약한 바와 같이, 케인의 프레임워크는 관찰에서 평가에 기초한 의사결정에 이르기까지 유효한 해석을 보장하기 위한 4가지 핵심 단계에 초점을 맞춘다.

첫 번째 단계는 관찰된 성과를 점수(점수)로 변환하여 점수가 최대한 성과를 반영하도록 하는 것입니다.
두 번째 단계는 특정 검사에서 테스트 성능 환경(즉, 가능한 모든 동등한 테스트 – 일반화)에 이르는 점수를 일반화하는 것입니다.
세 번째는 테스트 환경에서의 성능을 실제 삶(Extrapolation)으로 외삽하는 것입니다.
마지막으로 네 번째 단계는 의사결정을 위한 정보의 해석입니다(함의implication).

As summarized by Cook et al (2015), Kane’s framework involves focusing on four key steps to ensure valid interpretation from observation to making a decision based on the assessment.

The first step is translation of an observed performance into a score (Scoring) ensuring the score reflects the performance as best as possible.
The second step is generalizing the score from the specific examination to the test performance environment (i.e. all possible equivalent tests – Generalization).
Third is extrapolating performance in the test environment to real life (Extrapolation).
And finally the fourth step is the interpretation of this information for making a decision (Implications).

타당성에 대한 두 가지 주요 위협은 다음과 같다.

구인-대표성 부족(샘플링이 너무 적거나 부적절한 표본 추출)
구인-무관 분산(점수 변동을 초래하는 관심 구성과 관련이 없는 것)

The two main threats to validity are

construct underrepresentation (too little sampling or inappropriate sampling) and
construct-irrelevant variance (anything unrelated to the construct of interest that results in score variability).

본 논문의 목적은 케인의 타당성 프레임워크의 렌즈를 통해 바라본 바와 같이 원하는 것을 측정하는 OSCE를 개발하기 위한 12가지 팁을 제공하는 것입니다. 12가지 팁은 OSCE를 개발할 때 사용할 수 있는 순서로 제시됩니다. 각 팁의 핵심 사항은 표 1에 요약되어 있습니다.
the purpose of this paper is to provide 12 tips for developing an OSCE that measures what you want, as viewed through the lens of Kane’s validity framework. The 12 tips are presented in the order they would be operationalized when developing an OSCE. Key points from each tip are summarized in Table 1,

팁 1 OSCE 결과의 용도를 결정합니다.
Tip 1 Decide on the intended use of the results from your OSCE

OSCE의 개발은 끝에서 시작해야 합니다. 결과를 가지고 어떤 결정을 내리게 됩니까? OSCE는 형성적입니까 아니면 총괄적입니까? 이 질문에 대한 답은 케인의 모델의 Implication 단계에 대한 증거를 제공합니다. 그리고 이 단계가 마지막이지만, 이러한 질문에 대한 답은 나머지 OSCE 개발의 틀을 만들 것이며, 따라서 이러한 질문들이 왜 먼저 이루어져야 하는지에 대한 것입니다. 예를 들어, 저부담의 시험은 학습자에게 피드백을 제공하는 데 사용되며, 이는 고부담의 임상실습후 및 국가 면허 시험과 달리, 개별적 코칭이나 재교육을 위하여 사용할 수 있다. 이러한 이유로 저부담 시험은 고부담 검사와 같은 수준의 점수 신뢰도가 필요하지 않으므로(Downing 2004) 더 짧은 검사가 가능하다.
Development of an OSCE should begin with the end: What decisions will I make with the results?; Is the OSCE formative or summative? The answers to these questions provide evidence for the Implications stage of Kane’s model. And though this stage is last, the answers to these questions will frame the rest of OSCE development, and hence why they must be asked first. For example, a lower stakes exam would be used to provide feedback to learners, and could lead to individual coaching or remediation, compared to a higher stakes end-of-clerkship or national certification examination, that can result in repeating a clerkship or year of residency. For these reasons, a lower stakes exam does not require the same level of score reliability as a high stakes examination (Downing 2004), and so a shorter examination is possible.

또 다른 참신한 디자인은 모든 응시자가 상대적으로 짧은 심사 참여가 요구되는 순차적 OSCE이다. 그런 다음, 사전에 정의된 표준이 미치지 못하는 사람만 전체 OSCE에 참여하여 기술을 평가해야 합니다.
Another novel design is the sequential OSCE in which all candidates would be required to participate in a relatively short screening examination. Then, only those who perform below a predefined standard would subsequently be required to participate in a full-length OSCE to assess their skills.

팁 2 OSE가 평가해야 할 항목 결정
Tip 2 Decide what your OSCE should assess

OSCE는 전체 컨텐츠 도메인을 평가하는 데 사용할 수 없습니다. 오히려 학습자가 습득해야 할 지식과 기술의 샘플을 평가하는 데 사용됩니다. OSCE가 교육 목표를 반영하도록 하려면 청사진을 작성하는 것이 중요합니다. 청사진blueprint 작성은 콘텐츠 전문가가 관심 구인contruct이 적절하게 대표되도록 하는 프로세스를 말합니다(Coderre 등). 2009).
OSCEs cannot be used to assess an entire content domain. Rather, they are used to assess a sample of the knowledge and skills that learners are expected to have mastered. To ensure that an OSCE reflects educational objectives, blueprinting is key. Blueprinting refers to the process by which content experts ensure that constructs of interest are adequately represented (Coderre et al. 2009).

따라서 한 OSCE 스테이션에서 보여준 퍼포먼스는 다른 상황에서의 병력청취 및 신체 검사 수행능력으로 일반화할 수 있습니다(Generalization). 각 스테이션의 길이는 보통 5분에서 10분 사이이다(Khan et al. 2013). 그러나 어떤 과제를 평가하느냐에 따라 더 길어질 수 있다.
This helps to ensure that one can generalize performance on these stations to the learner’s ability to perform other history and physical examinations in an OSCE (Generalization). The length of each station is usually between five and ten minutes (Khan et al. 2013) but could be longer depending on what task is being assessed.

시험 결과의 의도된 용도(즉, 낮은 위험 대 높은 위험)를 고려하여, 관심 구성을 적절하게 표본 추출할 수 있는 [충분한 수의 스테이션]이 있어야 한다. 국지적으로 개발된 저부담 시험은 8~10개의 스테이션도 괜찮은 반면, 고부담 OSCE는 허용 가능한 신뢰성을 달성하기 위해 14~18개의 스테이션이 필요할 수 있다(Khan et al. 2013).
There must be enough stations to adequately sample the construct of interest, taking into account the intended use of the exam results (i.e. low versus high stakes). A lower stakes locally developed exam may have only eight to ten stations, whereas a high stakes OSCE may require 14-18 stations to achieve acceptable reliability (Khan et al. 2013).

모든 CanMED 역할(Frank et al. 2015; Jefferies et al. 2007)을 평가하기 위해 OSCE가 사용되었지만, 본질적(즉, 비의료 전문가) 역할(예: 전문성, 협업 등)을 실제로 평가하는데 어려움이 있으며, 이는 테스트 성과가 실제 성과를 얼마나 잘 추정하는가에 영향을 미칩니다.
Although OSCEs have been used to assess all of the CanMEDS roles (Frank et al. 2015; Jefferies et al. 2007), there are challenges in assessing the intrinsic (i.e. nonMedical Expert) roles authentically (e.g. professionalism, collaboration, etc.), which has an impact on how well the test performance extrapolates to real-world performance.

평가에 대한 프로그램적 접근방식(Schwirth and van der Vleuten 2011)은 OSCE를 전체 평가 프레임워크의 한 부분으로 볼 것이다. 그러면 OSCE 개발을 안내할 수 있는 두 가지 질문이 제시됩니다.

(1) 전체 프로그램에서 중 어디에서 해당 스킬을 평가합니까? (혹은 평가할 수 있습니까?);
(2) OSCE에서 평가하기로 선택한 경우, 이를 authentic하게 평가할 수 있습니까?

A programmatic approach to assessment (Schuwirth and van der Vleuten 2011) would view an OSCE as one part of an overall assessment framework. This leads to two questions that can guide OSCE development:

(1) Where else are (or could) skills be assessed in my overall program?; and
(2) If I choose to assess this in an OSCE, can I do it authentically?

팁 3 사례 개발
Tip 3 Develop the cases

OSCE에서 평가할 항목을 결정한 후에는 사례 개발을 신중하게 고려해야 합니다. 사례는 관심 임상 문제를 확실히 나타내기 위해 개발되어야 한다(Extrapolation). 후보자에 대한 지침에는 현재 문제와 관련된 정보, 과제 및 만남을 완료하기 위한 기간(Pugh 및 Smee 2013)이 포함되어야 한다.
Once you have decided what will be assessed by your OSCE, careful consideration should be given to case development. Cases should be developed to ensure that they authentically represent the clinical problem of interest (Extrapolation). Instructions to candidates should include information related to the presenting problem, a task, and a time-frame for completing the encounter (Pugh and Smee 2013).

사례는 OSCE 사례 개발 모범 사례(Pugh 및 Smee 2013)를 반영하기 위해 콘텐츠 전문가와 교육 전문가 모두의 검토를 거쳐야 합니다. 이러한 전문가는 검토 시 다음 질문을 고려해야 합니다.

(1) 과제가 명확합니까? (Scoring),
(2) 할당된 시간 내에 과제를 완료할 충분한 시간이 있습니까? (Extrapolation)
(3) 사례가 임상 문제를 실제authentically로 나타냅니까?; (Extrapolation)
(4) 난이도 수준이 학습자에게 적합한가? (Extrapolation)

이 단계에서 사례를 시범적으로 테스트하면 잠재적 문제를 식별하고 완화할 수 있습니다.

Cases should undergo review by both content experts as well as educational experts to ensure that the cases reflect best practices of OSCE case development (Pugh and Smee 2013). These experts should consider the following questions in their review:

(1) Is the task clear? (Kane’s Scoring stage);
(2) Is there enough time to complete the task in the allotted time?;
(3) Does the case authentically represent a clinical problem?; and
(4) Is the level of difficulty appropriate for the learners? (the last three relate to Kane’s Extrapolation stage).

Pilot-testing of cases at this stage can help identify and mitigate potential issues.

팁 4 OSCE가 후보자를 평가하는 방법 결정(점수 루빅)
Tip 4 Decide how your OSCE should assess candidates (the scoring rubric)

스코어링 루브릭의 개발은 OSCE 타당성에 대한 연구의 많은 부분이 집중된 분야입니다. 루브릭이 개발되거나 선택되는 방법에 대한 설명은 케인의 프레임워크에서 스코어링Scoring에 대한 중요한 타당성 증거를 제공할 수 있습니다.
The development of scoring rubrics is an area where much of the research on OSCE validity has focused. A description for how rubrics were developed or selected can provide important validity evidence for Scoring in Kane’s framework.

체크리스트는 관찰 가능한 행동(예: 흡연 이력에 대한 질문, JVP 식별 등)을 평가하는 데 사용됩니다. 체크리스트는 일반적으로 이분법(예: 했거나 하지 않았거나)이지만, 다분법(예: 잘 했거나 시도했지만 잘 안 했거나, 잘 안 했거나)일 수도 있다(Pugh, Halman, et al. 2016). 체크리스트는 매우 어린 의대생과 같은 목표가 아닌 한, 무작위 접근 방식rote approach을 사용하는 학습자에게 보상을 주지 않도록 주의 깊게 구성해야 합니다. 대부분의 학습자는 주제를 이해하는 학습자와 그렇지 않은 학습자를 구별하는 데 도움이 되는 항목(즉, 주요 특성 접근 방식)을 포함하려고 시도해야 합니다(Daniels et al. 2014).
Checklists are used to assess observable behaviors (e.g. asked about smoking history, identified the JVP, etc.). Checklists are generally dichotomous (e.g. did or did not do), but they can also be polytomous (e.g. done well, attempted but not done well, not done) (Pugh, Halman, et al. 2016). Checklists should be carefully constructed to avoid rewarding learners who use a rote approach unless that is the goal, such as for very junior medical students. For most learners, there should be an attempt to include items that help to discriminate between learners who understand the subject matter and those who do not (i.e. a key features approach) (Daniels et al. 2014).

병력이나 신체검사에서 임상적으로 구별되는 주요 특징에 초점을 맞추지 않고 [비특정 철저성nonspecific thoroughness]을 보상하는 긴 체크리스트를 사용하는 경우, 이는 사려 깊은 진단 전문가로서 의사들이 원하는 것에 대해 잘 추론하지 못할 것이다. 직관적으로 인식된 중요도에 기초하여 체크리스트 항목에 차등 가중치를 적용하는 것이 타당하지만, 가중치 항목은 전반적인 신뢰성이나 통과/실패 결정에 큰 영향을 미치지 않는 것으로 보인다(Sandilands et al. 2014).

If one uses long checklists that reward nonspecific thoroughness as opposed to focusing on key clinically discriminating features in a history or physical examination, this will not extrapolate well to what we want in physicians as thoughtful diagnosticians. Although, intuitively, it makes sense to apply differential weights to checklist items based on their perceived importance, weighting items does not appear to affect overall reliability or pass/fail decisions significantly (Sandilands et al. 2014),

팁 5 평가자 교육
Tip 5 Train your raters

Scoring에 대한 (타당도를) 추가적으로 지지하는 근거로는, 채점자가 의도한 대로 채점 루브릭을 해석했는지 확인하기 위해 교육받았다는 근거가 있다. 평가자에게는 OSCE의 목적, 학습자의 수준 및 학습자와 어떻게 상호작용해야 하는지에 대한 정보가 포함된 오리엔테이션을 제공해야 합니다(예: 학습자에게 프롬프트 또는 피드백을 제공할 수 있습니까?). 또한 체크리스트 항목에 대한 성공의 조작적 정의와 등급 척도에 대한 각 행동 앵커의 의미를 포함하여 채점 루브릭의 예를 제공해야 한다.
Further support for Scoring includes evidence demonstrating raters were trained to ensure they interpreted scoring rubrics as intended. Raters should be provided with an orientation that includes information about the purpose of the OSCE, the level of the learners, and how they should interact with learners (e.g. can they provide prompts or feedback to learners?). They should also be provided with examples of the scoring rubrics, including the operational definition of success on any checklist items and the meaning of each behavioral anchor for rating scales.

[기준 체계 훈련frame-of-reference]과 같은 보다 상세한 형태의 오리엔테이션은 때때로 평가자에게 제공되며, 여기에는

수행능력 차원performance dimension를 정의하여 원하는 성과에 대한 공유된 정신모델을 만들고
각 차원에 대한 행동의 예를 제공한 다음
평가자가 표본 퍼포먼스를 가지고 연습한 뒤 피드백을 받을 수 있도록 한다.

A more detailed form of orientation, such as frame-of-reference training, is sometimes provided to raters, which involves

creating a shared mental model of the desired performance by defining performance dimensions,
providing examples of behaviors for each dimension, and then
allowing raters to practice and receive feedback on sample performances (Roch et al. 2012).

이 방법은 시간이 많이 소요될 수 있으며 일반적으로 고부담 시험에서만 주로 사용되지만, 채점에 대한 타당성Scoring 주장을 강화할 수 있습니다.
This method can be time-consuming and is usually reserved for high-stakes examinations, but can strengthen the validity argument for scoring.

원하지 않는 등급 점수 변동undesired variation은 [CIV construct irrelevant variance]을 초래할 수 있으므로 점수Scoring 추론의 타당성을 위협할 수 있다는 점을 명심해야 한다. 훈련에도 불구하고, 평가자들은 실수를 할 수 있다. 전통적으로 우리는 종종 일부 평가자를 다른 평가자(즉, 매와 비둘기)에 비해 지나치게 가혹하거나 관대한 것으로 생각하지만, 보다 최근의 연구는 평가자 변동성variability이 이보다 더 복잡하다는 것을 보여준다(Govaerts et al. 2013; Gingerich et al. 2014).
It is important to remember that any undesired variation in rater scoring may introduce construct irrelevant variance and thus threaten the validity of scoring inferences made. Despite training, raters may make mistakes. Although traditionally we often think of some raters as excessively harsh or lenient compared to other raters (i.e. hawks and doves), more recent research demonstrates that rater variability is more complex than this (Govaerts et al. 2013; Gingerich et al. 2014).

팁 6 표준화된 환자를 위한 스크립트 개발 및 교육
Tip 6 Develop scripts for and train standardized patients

대부분의 OSCE는 학습자가 임상 기술을 입증할 수 있도록 표준화된 환자(SP)를 사용합니다. [SP 교육에 대한 엄격하고 표준화된 접근 방식]은 SP 묘사portrayals 간의 차이를 줄이기 때문에 스코어링Scoring의 무결성integrity에 대한 추가적인 타당성 증거를 제공합니다.
Most OSCEs employ the use of standardized patients (SPs) to allow learners to demonstrate their clinical skills. A rigorous and standardized approach to SP training provides further validity evidence for the integrity of Scoring as it reduces the variance between SP portrayals.

SP에는 묘사portrayal를 안내하는 스크립트가 제공되어야 하며, 실제 환자에 기반한 스크립트가 진실성authenticity를 더해줄 수 있다. 병력청취의 경우, 스크립트에는 다음에 대한 세부 정보가 풍부하게 있어야 한다.

제시될 임상표현(타임라인과 및 관련 양성 음성 증상 포함)
SP의 과거 의료 기록(의약품 사용 포함)
필요한 경우 사회력(예: 흡연 및 알코올 사용)을 참조하십시오.

SPs should be provided with a script to guide their portrayal, and basing the script on a real patient adds authenticity. For history stations, the script is relatively rich in details about:

the presenting problem (including a timeline and pertinent positives and negatives);
the SP’s past medical history (including medication use); and
social history (e.g. smoking and alcohol use), as required.

최소한 모든 체크리스트 항목에 대해 스크립트로 작성된 답변이 있어야 하지만 학습자가 예상한 질문에 대한 답변이 제공되어야 합니다. 예상치 못한 질문에 대해 SP는 상황에 따라 "아니오" 또는 "잘 모르겠습니다"라고 대답하도록 교육할 수 있습니다. 반대로 신체검사 스테이션의 경우 세부사항이 적게 요구될 수 있지만, SP는 자극에 반응하도록 훈련될 수 있다(예: 복부 검사 시 경계, 관절의 움직임 범위 제한 등).

At a minimum, there should be a scripted answer for all checklist items, but there should be answers provided for any anticipated questions that learners might ask. For unanticipated questions, SPs can be trained to answer either “no” or “I’m not sure” depending on the context. In contrast, for physical examination stations, fewer details may be required, but SPs can be trained to react to stimuli (e.g. guarding during an abdominal examination, limited range of motion of a joint, etc.).

스크립트에 포함할 다른 세부 사항은 다음과 관련될 수 있습니다.

인구 통계(예: 나이와 성별),
방에서의 SP 시작 위치(예: 앉음 vs 누움),
외모(예: 불안함 vs 침착함),
행동(예: 협동함 vs 회피)

Other details to be included in the script may relate to

demographics (e.g. age and gender),
SP starting position in room (e.g. sitting vs lying down),
appearance (e.g. anxious vs calm), and
behavior (e.g. cooperative vs evasive).

평가자가 학습자의 문제 이해도를 더 잘 평가할 수 있도록 SP가 질문(예: "나에게 무슨 일이 일어나고 있는 것 같습니까?")하는 문항이나 프롬프트도 스크립트에 포함될 수 있습니다.
The script may also include statements or prompts for the SP to ask (e.g. “What do you think is going on with me?”)to allow raters to better assess learners’ understanding of the problem.

팁 7 데이터 수집 프로세스의 무결성 보장
Tip 7 Ensure integrity of data collection processes

데이터 수집에는 데이터 무결성을 보장하기 위한 일종의 품질 보장이 있어야 합니다. 이것은 시험 점수가 관측치를 반영한다는 추가 증거를 제공합니다(Kane의 채점Scoring 단계).
Data collection should have some sort of quality assurance to ensure data integrity. This provides further evidence that test scores reflect the observations (Kane’s Scoring stage).

OSCE를 진행하는 동안 직원은 평가자가 평가지를 올바르게 작성하는지(예: 항목을 건너뛰지 않는지) 주기적으로 확인하고, 질문이 있을 경우 이를 해결할 수 있습니다. OSCE가 끝난 후 컴퓨터에 점수를 수동으로 입력하는 경우, 정확한 데이터 입력을 위해 채점표 중 일부를 무작위로 확인해야 합니다. 스캔 가능한 점수표를 만들 수 있는 합리적인 가격의 소프트웨어 패키지가 있어 무작위 검증의 필요성을 줄이기는 하지만 없앨 수는 없습니다.
During an OSCE, staff can periodically verify that raters are completing the rating instruments correctly (i.e. not skipping any items) and address any questions they might have. After the OSCE, if scores are manually entered into a computer, a random set of score sheets should be checked to ensure accurate data entry. There are reasonably-priced software packages that allow creating scannable score sheets which reduces, but does not eliminate, the need for random verification.

일부 센터에서는 코멘트 작성 시간을 단축하고, 누락된 등급 척도 수를 줄여줄 수 있는 추가적인 장점이 있는 태블릿 및 eOSCE 시스템에 액세스할 수 있으며, 수량 피드백이 의 품질과 품질을 높일 수 있습니다(Daniels et al. 2016; Denison et al. 2016). 그러나 인터넷 기반 시스템에 대한 안정적인 인터넷 액세스와 태블릿 또는 eOSCE 시스템에 장애가 발생한 경우를 위한 백업 계획이 반드시 필요합니다.
Some centers may have access to tablets and eOSCE systems that have an added advantage of reducing time to transcribe comments and number of missed rating scales, and can quantity feedback increase the and quality of (Daniels et al. 2016; Denison et al. 2016). However, having reliable internet access for internet-based systems, and back-up a plans for when tablet or the eOSCE system fails is imperative.

결측 데이터에 대해 결정해야 합니다(예: 비어 있는 등급 척도).
Decisions must be made about missing data (e.g. a rating scale that is left blank).

마지막으로, 다른 평가와 마찬가지로, 테스트 보안 문제를 고려해야 합니다. 학습자의 능력을 정확하게 측정하려면 모든 학생이 평가에 대한 정보에 동등하게 액세스할 수 있어야 합니다. 시험 자료에 대한 무단 접근(예: 학생이 만든 유령 은행을 통해)은 OSCE의 점수 해석의 타당성을 위협하는 부당한 이점을 학습자에게 제공합니다.

Finally, as with any assessment, one must consider the issue of test security. To ensure an accurate measurement of learners’ abilities, it is important that all students have equal access to information about the assessment. Unauthorized access to test materials (e.g. through student created ghost banks) provides learners with an unfair advantage that threatens the validity of the interpretation of scores from the OSCE.

팁 8 표준 설정 접근법 선택
Tip 8 Choose a standard setting approach

표준 설정 방법(즉, cut score)의 선택도 평가의 Implication에 영향을 미치므로 점수 해석의 타당성을 뒷받침하기 위해 세심한 주의를 기울여야 한다. 부적절하게 높은 cut-score를 설정하면 실제로 능력이 있는 학습자가 낙제할 수 있고, 너무 낮은 cut-score를 설정하면, 약한 학습자가 자신의 능력에 대해 지나치게 자신감을 가질 수 있습니다. 이는 특히 합격-불합격 결정이 학습자, 교육자 및 환자에게 중요한 영향을 미치는 고부담 평가에 중요합니다.
The choice of standard-setting methods (i.e. cut score) also deserves careful attention in order to support the validity of score interpretations as this impacts the Implications of the assessment. Cut scores that are inappropriately high may result in failing learners who are actually competent, while cut scores that are too low may lead weak learners to be overly confident in their abilities. This is especially important for high-stakes assessments in which pass-fail decisions have important repercussions for learners, educators and patients.

컷 스코어를 설정할 때 gold-standard는 없지만 선택한 방법에 대한 자세한 근거를 제시해야 합니다. OSCE에 가장 일반적으로 사용되는 세 가지 기준 참조 방법은 Angoff, Borderline Group 및 Borderline Regression입니다.
Although there is no gold standard when setting a cutscore, a detailed rationale for the method chosen should be provided. The three most common criterion-referenced methods used for OSCEs are Angoff, Borderline Group, and Borderline Regression.

다음 결정은 전체 합격/불합격 결정이 전체 OSCE 점수만 기준으로 이루어져야 하는지, 또는 수험자가 최소 스테이션 수를 통과해야 하는지에 대한 것이다. 후자(conjunctive) 접근방식은 수험자가 광범위한 지식(즉, 여러 관측소의 낙제 성과는 다른 관측소에 대한 매우 강력한 성과로 보상될 수 없다는 것)을 입증하기 위해 일부 교육자가 선호한다(Homer et al. 2017).
The next decision is whether the overall pass/fail determination should be based on the overall OSCE score alone, or if examinees must also pass a minimum number of stations. The latter (conjunctive) approach is favored by some educators, to ensure that examinees demonstrate a breadth of knowledge (i.e. that a failing performance on several stations cannot be compensated for by very strong performance on others) (Homer et al. 2017).

팁 9 OSE가 가능한 모든 양식을 얼마나 잘 일반화하는지 고려합니다.
Tip 9 Consider how well the OSCE would generalize to all possible forms

또 다른 중요한 타당성 근거 출처는 결과의 일반화 가능성Generalizability과 관련이 있다. OSCE의 심리측정적 특성을 분석하여, 타당성 주장시 이 요소(generalizability)에 대한 지원을 제공할 수 있습니다.
Another important source of validity evidence relates to the Generalizability of the results. Support for this element of the validity argument can be provided by analyzing the psychometric properties of the OSCE.

점수의 신뢰성(즉, 재현성)은 타당성 증거의 중요한 요소입니다. 알파는 일반적으로 전반적인 신뢰성을 측정하고 문제가 있는 스테이션을 찾는 데 사용됩니다. 단일 스테이션의 성능에 기반하여 결정을 내리는 경우 스테이션 레벨에서 알파를 사용하여 신뢰성을 평가하고 문제가 있는 항목을 식별할 수 있습니다.
The reliability (i.e. reproducibility) of scores is an important element of validity evidence. Alpha is usually used across stations to measure overall reliability and to look for problematic stations. If decisions are made based on the performance of a single station , then alpha can be usedat the station level to evaluate reliability and identify problematic items.

OSCE는 본질적으로 다면적이기 때문에(예: 사람, 항목, 평가자, 트랙 등), 일반화가능도 이론(G-이론)은 종종 신뢰성을 계산하고 다양한 오류 발생원의 영향을 결정하는 데 선호된다. 그러나 G-이론은 측점당 여러 등급이 있는 경우에 가장 효과적이며, 그렇지 않은 경우에는 측점이 아닌 평점으로 인한 변동을 제거할 수 없습니다. 구문 기반 GENOVA(Crick and Brennan 1983)와 보다 사용자 친화적인 G-string IV(Bloch and Norman 2015)와 같은 G-스터디를 실행하는 데 무료로 사용할 수 있는 패키지가 있습니다.
Because OSCEs are inherently multi-faceted (e.g. persons, items, raters, tracks, etc.), generalizability theory (G-theory) is often preferred for calculating reliability as well as determining the impact of the various sources of error. However, G-theory works best if there are multiple raters per station; otherwise, one cannot tease out the variance due to raters as opposed to due to the station. There are freely available packages for running G-studies such as the syntax-based GENOVA (Crick and Brennan 1983) and the more user friendly G-string IV (Bloch and Norman 2015).

팁 10 검사와 다른 변수와의 상관 관계를 검토합니다.
Tip 10 Review the correlation of your examination with other variables

Tamblyn과 동료들은 라이선스 검사의 낮은 점수가 상담, 처방 및 유방 촬영 검사의 패턴으로 측정되는 낮은 임상 관행과 관련이 있음을 입증했다.
Tamblyn and colleagues demonstrated that lower scores on a licensing examination were associated with lower quality of clinical practice as measured by patterns in consultations, prescribing, and mammography screening.

이 데이터는 면허시험에서 케인의 추정Extrapolation 단계에 대한 증거를 뒷받침합니다.
This data supports evidence along Kane’s Extrapolation stage of validity of that licensing exam.

이 증거에서는, 일반적으로 [OSCE 점수를 다른 평가와 비교]하여 증거를 찾습니다. 예를 들어, Pugh와 동료들은 현지에서 개발된 [Internal Medicine OSCE progress test]의 성과가 높은 위험도 [내과 인증 시험의 점수]와 관련이 있음을 입증했습니다.
More commonly, evidence is sought by comparing OSCE scores to other assessments. For example, Pugh and colleagues demonstrated that performance on a locally developed Internal Medicine OSCE progress test correlated with scores on the high stakes Internal Medicine certification examination

모든 상관관계가 기관 외부의 데이터로 이루어질 필요는 없습니다. 로컬 데이터를 사용하여 OSCE 점수를 유사하거나 다른 역량을 측정하는 다른 평가와 상호 연관시킬 수 있습니다. 또 다른 분석에서는 OSCE가 더 많은 상급자 대 하급자를 차별하는지 여부를 조사할 수 있다.
Not all correlations need to be done with data external to the institution. Local data can be used to correlate OSCE scores to other assessments measuring similar and dissimilar competencies. Another analysis could examine if an OSCE discriminates more senior versus junior learners as this also provides validity evidence.

팁 11 OSE가 학습자에게 미치는 영향 평가
Tip 11 Evaluate the effects of the OSCE on learners

형성적이든 총괄적이든 평가가 학습을 촉진한다는 것을 알고 있습니다(Kane의 함의Implication 단계).
Whether formative or summative, we know that assessment drives learning (Kane’s Implications stage).

평가는 긍정적이고 부정적인 방식으로 학습에 영향을 미칠 수 있으며, 따라서 OSCE가 학습을 촉진하거나 방해하는 방법에 대한 증거를 찾아야 한다.
assessment can influence learning in both positive and negative ways , and so one should seek evidence for how an OSCEis promoting or impeding learning.

고려해야 할 질문은 다음과 같습니다.

OSCE는 학습에 어떤 영향을 미칩니까?;
불합격 또는 합격한 학습자에게 수반하는 결과는 무엇입니까?
불합격자에게 재교육이 제공되는 경우, 재시험에서 성과가 개선된다는 증거가 있는가?
OSCE는 커리큘럼의 후속 변화에 어떤 영향을 미칩니까(예: 많은 수의 후보자가 불합격할 경우), 반대로 커리큘럼의 변화는 OSCE 수행능력에 어떤 영향을 미칩니까?
OSCE가 환자 치료에 어떤 영향을 미칩니까?
OSCE의 목적이 학습을 유도하는 것이라면, 학습자가 OSCE의 결과로 학습하고 있음을 보여주는 데이터가 있습니까?

Questions to be considered include:

How does the OSCE influence learning?;
What are the outcomes of learners who fail versus pass?;
If remediation is provided to those who fail, is there evidence that performance improves on a repeat assessment?;
How does the OSCE influence subsequent changes in the curriculum (e.g. if a high number of candidates fail a station) and, conversely, do changes to the curriculum influence OSCE performance?; and finally,
how does the OSCE influence patient care?
If the purpose of the OSCE is to drive learning, then is there data to show the learners are learning as a result of the OSCE?

팁 12 전체 프로세스를 검토하여 유효성에 대한 위협을 찾습니다.
Tip 12 Review the entire process to look for threats to validity

타당성 주장은 평가의 해석과 사용을 제안한 후 유효성의 증거를 검토하는 반복적인 과정으로, 증거가 의도된 해석이나 사용을 뒷받침하지 않는 경우에는, 사용을 수정하거나 평가 과정을 개정한다. 이러한 상황은 평가가 목적에 부합하는지 확인하기 위해 [지속적으로 이뤄져야] 합니다. 이러한 [지속적인 품질 보증ongoing quality assurance]은 신뢰성과 같은 심리측정적인 면에만 초점이 맞춰지는 경우가 많지만, OSCE 개발의 모든 측면을 검토하여 케인 모델의 네 가지 단계와 관련된 문제를 찾아야 합니다.
An argument for validity is an iterative process where one states the proposed interpretation and use of the assessment, then examines the evidence of validity, and if the evidence does not support the intended interpretation or use, either revise the use or revise the assessment process. This should continually happen to ensure the assessment is meeting its purpose. Too often this ongoing quality assurance is focused solely on psychometrics such as reliability, but all aspects of the development of an OSCE should be reviewed to look for issues related to each of the four stages of Kane’s model.

종종 간과되는 일부 OSCE 지표로는 다음이 있다.

전체 불합격 또는 특정 스테이션 불합격 학생 비율(프로그램 평가 정보일 수 있음),
스테이션에서의 [(체크리스트) 합계 점수]와 [Global 등급 척도] 사이의 상관 관계(상관성이 낮으면 점수 시트 내용에 대한 우려가 높아짐) 및
동일한 스테이션이지만, 평가자 또는 위치에 차이가 있는 그룹 간의 비교

Some OSCE metrics that are often overlooked are

the percent of students who fail overall or fail a specific station (can be program evaluation information),
correlation between a station’s sum score and global rating scale (lower correlation raises concern about score sheet content), and
comparisons between groups who same encounter the stations, but with differences such as raters or locations, (Pell et al. 2010).

Fuller R, Homer M, Pell G, Hallam J. 2017. Managing extremes of assessor judgment within the OSCE. Med Teach. 39:58–66.

Pugh D, Regehr G. 2016. Taking the sting out of assessment: is there a role for progress testing? Med Educ. 50:721–729.

Yousuf N, Violato C, Zuberi RW. 2015. Standard setting methods for pass/fail decisions on high-stakes objective structured clinical examinations: a validity study. Teach Learn Med. 27:280–291.

Med Teach. 2018 Dec;40(12):1208-1213.

doi: 10.1080/0142159X.2017.1390214. Epub 2017 Oct 25.

Twelve tips for developing an OSCE that measures what you want

Vijay John Daniels 1, Debra Pugh 2

Affiliations collapse

Affiliations

1a Department of Medicine , University of Alberta , Edmonton , Canada.
2b Department of Medicine , University of Ottawa , Ottawa , Canada.

PMID: 29069965
DOI: 10.1080/0142159X.2017.1390214Abstract
The Objective Structured Clinical Examination (OSCE) is used globally for both high and low stakes assessment. Despite its extensive use, very few published articles provide a set of best practices for developing an OSCE, and of those that do, none apply a modern understanding of validity. This article provides 12 tips for developing an OSCE guided by Kane's validity framework to ensure the OSCE is assessing what it purports to measure. The 12 tips are presented in the order they would be operationalized during OSCE development.

저작자표시 (새창열림)

'Articles (Medical Education) > 평가법 (Portfolio 등)' 카테고리의 다른 글

의학교육에서 기준선 설정: 고부담 평가(Understanding Medical Education Ch 24) (0)	2021.08.20
형성적 OSCE가 어떻게 학습을 유도하는가? 전공의 인식 분석 (Med Teach, 2017) (0)	2021.08.19
OSCE의 퀄리티 측정하기: 계량적 방법 검토 (AMEE Guide no. 49) (Med Teach) (0)	2021.08.19
체면을 차리기 위한 헷징: ITER의 서술형 코멘트의 언어학적 분석(Adv in Health Sci Educ, 2015) (0)	2021.08.15
WBA이해하기: 옳은 질문을, 옳은 방식으로, 옳은 것에 대해서, 옳은 사람에게 (Med Educ, 2012) (0)	2021.08.15

Passing the Torch : 의학을 가르치는 것은 횃불을 전달하는 것과 같다.