차이에 관하여: 왜 전문가-초심자 비교가 타당도주장에 기여하는 바가 별로 없는가(Adv in Health Sci Educ, 2015)

Much ado about differences: why expert-novice comparisons add little to the validity argument


David A. Cook





도입

Introduction


CBME와 성과를 강조하며 근거-기반 교육이 필수적이 됨.

The growing emphasis on competency-based education (Albanese et al. 2008; Weinberger et al. 2010) and educational outcomes (Prystowsky and Bordage 2001; Cook and West 2013) creates an imperative for evidence-based educational metrics.


'known-groups comparison'식의 연구는 (타당도근거 중) "relationships with other variables"에 해당되는 것으며, 이 때 'other variable'은 보통 training status에 대한 것이다.

Such known-groups comparison studies provide evidence of ‘‘relationships with other variables’’ (American Educational Research Association 1999; Downing 2003; Cook and Beckman 2006), with the other variable usually being the training status (or rather, the presumably higher proficiency that comes with advanced training).


전형적인 방법은 staff physician과 PG trainee 또는 4학년과 2학년 의과대학생을 비교하는 식인데, 이 때의 가정은 더 advance한 training status에 있을 때 덜 advance한 status에 있는 경우보다 점수가 높을 것이라는 것이다.

The typical study might enroll and compare scores between staff physicians and postgraduate trainees, or senior medical students and junior medical students, with the hypothesis that those with more advanced training status (the ‘‘experts’’) will have higher scores than those less advanced (the ‘‘novices’’).


이러한 known-groups comparisons 방식은 validity argument에 기여하는 바가 거의 없다. 평가에서 between group의 discrimination을 확실하게 해주지 못하는 것이 문제이긴 하나, 그러한 차이를 확인해주는 것 그 자체가 validity에 충분한 것은 아니다.

The problem is that such known-groups comparisons actually contribute little to the validity argument. While failure to confirm discrimination between groups suggests a serious potential problem for the assessment [as has been shown for checklists as measures of clinical reasoning (Neufeld et al. 1981; Hodges et al. 1999)], confirmation of such differences is by itself insufficient to establish score validity.


 

주된 문제: 상관관계가 인과관계는 아니다.

The main problem: association does not imply causality


expert-novice study 에서 가장 중요한 문제는 confounding에 대한 것이다. 차이가 발생했을 때 그것을 설명할 수 있는 방식은 다양하나 "상관관계가 인과관계는 아니다".  

Arguably the most important flaw in the expert-novice study is the problem of con- founding: there are multiple plausible explanations for any observed differences. ‘‘association does not imply causation.’’ However, these analyses actually provide no evidence to confirm that score differences reflect the target characteristic or any other specific underlying characteristic.


 

물론, 우리가 의도적으로 '흰머리 개수'와 같이 '완전히 무관한 특성'을 가져오지는 않는다. 그러나 새로운 도구를 개발하는 연구자들은 그 도구의 숫자가 측정하는 (혹은 측정하지 않는) 것이 무엇인지 알려주지 않는다. Fig 1.에서 점수로만 봤을 때 이 도구가 측정하는 것이 셋 중에 어떤 것인지 알 수 없다 (cardiology능력 vs pulmonology 능력 vs 흰머리) 

Of course, we would never intentionally use a completely irrele- vant characteristic such as grey hair to measure clinical proficiency. Yet researchers evaluating a new instrument don’t really know what the instrument’s scores do (or do not)measure; they only know the pattern of the numbers. In Fig. 1 there is no way to know (judging from the numbers alone, without the benefit of labels) whether the instrument is measuring proficiency in cardiology, proficiency in pulmonology, or simply grey hair. 

 

 


 

 

외삽에는 주의가 필요하다.

Cautions in extrapolating results to educational practice


몇몇 다른 방법론적 문제도 있다.

several other methodological problems


첫째, expert-novice studies 에 참여하는 참가자들은 그 결과가 적용될 집단을 대표하지 않는 경우가 많다. 1학년 학생과 4학년 학생 모두에게 동시에 사용되는 instrument는 거의 없으며, 1년차 수련생과 experienced physician에게 동시에 사용하는 instrument도 거의 없다. 또한, 궁극적으로 그 평가를 사용할 할습자그룹은 일반적으로 expert-novice studies 에 참여한 사람들보다 더 homogenous하며, 이로 인해 변별력이 감소하고, training level외에도 다른 측면에서 유의하게 다른 특성이 있다. 이러한 spectrum bias는 임상에서 진단용 검사의 measurement properties에도 유의한 영향을 주며, 같은 문제가 educational assessment에서도 동일하게 발생한다. Lijmer 등은 "이미 질병을 가지고 있다고 알려진 환자집단, 그리고 이와 별개의 정상 환자집단을 대상으로 검증된 검사"란 사실상 case-control study와 같은 것이며, 이러한 연구가 accuracy를 세 배 이상 과대추정한다고 지적했다.

First, the participants enrolled in expert-novice studies are rarely representative of the population to whom the results will be applied. Few instruments are intended for use with both first- year and final-year medical students at the same time, or first-year postgraduate trainees and experienced physicians. Also, the learner groups that will ultimately use the assess- ment are typically more homogenous than those enrolled in the expert-novice study, which decreases discriminatory power, and may differ in important ways other than the level of training (e.g., degree of interest in the study topic). Such spectrum bias has been shown to significantly influence the measurement properties of diagnostic tests in clinical medicine (Lijmer et al. 1999; Whiting et al. 2011), and the same problem holds true for educational assessments. Lijmer et al. (1999, page 1,062) noted that studies in which ‘‘the test is evaluated in a group of patients already known to have the disease and a separate group of normal patients’’ (i.e., known-groups comparisons) are actually case–control studies, and found that such study designs overestimated accuracy by a factor of three.


둘째, known-group design은 도구의 전형적인 활용사례를 대표하지 않는다. 실제 발생하는 상황은 각자의 능력을 모르는 비슷한 training status의 학습자들을 대상으로 이뤄지게 되며, 도구를 사용해서 각 개인의 능력을 추정하고 분류하는 목적으로 사용한다. 반대로 expert-novice comparision은 서로 다른 training status에 있는 집단에서 시작해서, 평균적인 점수가 다르다는 것을 확인해준다. known-group analysis에서 평균점수가 다른 것이 개개인의 점수가 정확히 전향적prospectively으로 classify해줄 수 있음을 보장하진 않는다.

Second, the known-groups design does not mirror a typical application of the instru- ment. In real-life applications the educator starts with a group of learners of similar training status but with unknown abilities, and uses the assessment to estimate and classify the ability of each individual. By contrast, the expert-novice comparison starts with groups at different training statuses and presumably known ability, and confirms that the average assessment scores vary. Showing that the average group scores differ in the known-groups analysis does not guarantee that individual scores will accurately classify learners prospectively.


마지막으로, 많은 연구가 expert-novice differences 를 통해서 신뢰도계수를 추정하다. 여기에는 심각한 개념적 오류가 있는데, 왜냐하면 known-group comparison은 between group의 variability를 가정하나, 신뢰도 분석은 within group의 variability에 초점을 두기 때문이다. 이미 차이가 알려진 집단을 포함시키는 것은 신뢰도 계수를 부적절하게 inflate한다.

Finally, many studies that evaluate expert-novice differences also attempt to estimate the reliability of scores. This is a serious conceptual flaw, because known-groups com- parisons hypothesize variability between groups while reliability analyses focus on vari- ability within groups. Including groups with known differences in a reliability analysis will erroneously inflate the reliability coefficient, as shown in Fig. 2.

 

 

 


 

What should researchers do?


known-groups comparisons의 문제에 대해서, 어떻게 해야 할까? 첫째, 이러한 분석을 'relations with other variables' 유형의 validity evidence를 위해 사용하는것이 잘못된 것은 아니다.

Given the problems with known-groups comparisons, how should researchers proceed? First, it is not wrong to perform such analyses in search of validity evidence of relations with other variables (Cook and Beckman 2006),


이러한 장점이 있다.

 advantages including simplicity, convenience, low cost, high power, and short duration.

 

이러한 비교가 training level에 따라 차이가 나지 않아야 하는 특성을 examine할 수도 있다. 또는 training이외의 것에 의해서 결정되는 특성을 조사할 수도 있다.

Such comparisons can also examine attributes that should not vary across training level (e.g., professionalism), or examine groups determined by characteristics other than training.

 

'필요하지만 충분하지 않음'의 관점에서, 이들 분석은 '차이가 있어야 하는데 차이가 없다고 나타나는' 경우에 가장 흥미로울 것이다. 또는 차이가 없어야 하는데 차이가 있는 경우도 흥미로울 것이다. 이미 가정된 difference를 재확인하는 것은 validity argument에 추가적인 가치가 별로 없다.

In light of the ‘‘necessary but not sufficient’’ guideline, these analyses will bemost interesting if they fail to discriminate groups that should be different, or find dif-ferences where none should exist. Confirmation of hypothesized differences or similarities adds little to the validity argument. 


둘째, 연구자들은 다른 study design을 활용하여 'relations with other variables'에 대한 stronger evidence를 찾을 수도 있다. 예컨대 실제 상황하고 비슷하게 trainee를 구성하고, 동일하거나 유사한 특성을 독립적으로 측정하여 비교하는 것이다. 이러한 설계는 educator가 실제로 하는 것과 비슷하다.

Second, researchers might use other study designs to identify stronger evidence of relations with other variables (Cook et al. 2014). Such studies might assemble a group of trainees similar in composition to that expected in real-life applications, and then examine the correlation with an independent measure of the same or a similar characteristic mea- sured concurrently or at a later date. These designs closely mimic what educators do in real life,


마지막으로, validity evidence에서 single source에 의존하지 않는 것이 중요하다.

Finally, it is important not to rely on any single source of validity evidence (whether from a known-groups comparison, or from any other source).


 

 




 2015 Aug;20(3):829-34. doi: 10.1007/s10459-014-9551-3. Epub 2014 Sep 27.

Much ado about differenceswhy expert-novice comparisons add little to the validity argument.

Author information

  • 1Mayo Clinic Online Learning, Mayo Clinic College of Medicine, Rochester, MN, USA, cook.david33@mayo.edu.

Abstract

One approach to validating assessment scores involves evaluating the ability of scores to discriminate among groups who differ in a specific characteristic, such as training status (in education) or disease state (in clinical applications). Such known-groups comparison studies provide validity evidence of "relationships with other variables." The typical education research study might compare scores between staff physicians and postgraduate trainees with the hypothesis that those with more advanced training (the "experts") will have higher scores than those less advanced (the "novices"). However, such comparisons are too nonspecific to support clear conclusions, and expert-novice comparisons (and known-groups comparisons in general) thus contribute little to the validity argument. The major flaw is the problem of confounding: there are multiple plausible explanations for any observed between-group differences. The absence of hypothesized differences would suggest a serious flaw in the validity argument, but the confirmation of such differences adds little. As such, accurate known-groups discrimination may be necessary, but will never be sufficient, to support the validity of scores. This article elaborates on this and other problems with the known-groups comparison that limit its utility as a source of validity evidence.

PMID:
 
25260974
 
DOI:
 
10.1007/s10459-014-9551-3
[PubMed - in process]


+ Recent posts