썪은 사과 골라내기 (Adv in Health Sci Educ, 2015)

Identifying the bad apples


Geoff Norman




35년 전, 두 명의 사회심리학자가 "Human Inference"라는 책을 썼다. 그 책에서 어떻게 인간이 판단과 행동이 다양한 맥락적 변인들에 얼마나 취약한지를 보여주었다. 그 중 하나는 "vividness hypothesis"인데, 단 하나의 생생한 경험이 아주 명맥한 통계적 근거에도 불구하고 사회적 태도에 영향을 준다는 것이다.

Thirty-five years ago, two social psychologists, Richard Nisbett and Lee Ross, wrote a classic book called ‘‘Human Inference: Strategies and Shortcomings of Social Judgment’’ (1980). In that book, they demonstrated how human judgments and actions are vulnerable to many contextual variables. One particular shortcoming they labeled the ‘‘vividness hypothesis’’—A single vivid instance can influence social attitudes when pallid statistics of far greater evidential value do not’ (p. 57).


심리학적 편견에 대한 근거는 넘쳐난다.

Evidence of this psychological bias abounds.

  • 모든 developed countries에서 범죄율은 25년간 지속적으로 감소중이다.
    Politicians continue to garner votes claiming they are ‘‘tough on crime’’ despite the fact that crime rates have been steadily declining in all developed countries for 25 years; as one example of many, homicide rates in Canada are half what they were in 1999.

  • 비행기 사고는 얼마나 심각한 것일까? 1970년에 비하면 1/3밖에 안된다. 비행거리가 5배나 늘었음에도 말이다.
    How bad was 2014 for air crashes? Remember MH370 and MH 17? In fact, the number of civil aviation crashes was the lowest on record, and about 1/3 of what it was in 1970, despite a five-fold increase in passenger miles flown.

  • 지하드에게 살해당하는 사람은 얼마나 될까? Violent death rate는 2000년 이후 계속 감소중이다.
    What about all those people killed by jihadists? Violent death rates have been on a steady decline for millennia (Pinker 2011).


Dr. Harold Shipman이라는 영국의 GP사건. 다시는 이런일이 일어나지 않게 교육프로세스를 개혁하라는 요구가 이어졌다. 그 이후 '비인지적', 특히 프로페셔널리즘에 대한 관심이 높아짐을 보고 있다.

Dr. Harold Shipman, a British GP who is esti- mated to have killed 250 of his patients and was eventually convicted of 15 murders. The publicity surrounding his trial and conviction led to calls to reform the educational process so that such things do not happen again (Powis 2015). In particular, we have seen increased focus on ‘‘non-cognitive’’ factors, particularly professionalism.


van Mook은 dyscompetent 레지던트를 어떻게 찾아내고 교정할 것인가와 관련된 몇 가지 이슈를 짚어보았다. 연구에 따르면 진료상황에서의 unprofessional behavior는 의과대학의 performance로 부터 예측가능하다.

In a review article, van Mook et al. (2014) examines the multiple issues related to the identification and remediation of ‘‘dyscompetent’’ residents, particularly in the area of professionalism. And an original study by Santen et al. (2014) extends the findings of two landmark studies by Papadakis et al. (2004, 2008), which showed that unprofessional behavior in practice was apparently predictable from performance in medical school. Both studies used a ‘‘case control’’ design


Saten은 일상적인 진급위원회에서 제기되는 학생평가자료로부터 위의 결과를 replicate and extend하였다.

The Santen et al. (2014) study in this issue replicated and extends these findings by examining the routine student assessments arising from promotion committees, instead of creating a system geared to identifying professionalism issues.


두 연구 모두 case-control 연구로서, 이러한 연구는 관심의 대상이 되는 결과(암 발생, 죽음)가 infrequent한 것일 때 흔히 사용된다. Papadakis 연구에서는 6330명의 졸업생 중 70명이 캘리포니마 stae board에 의해서 제제disciplined를 받았고, 1.1%의 prevalence를 보여준다. 즉 6260명은 그런 일이 없었다.

Both studies use a case–control design. Case–control studies are frequently used when the outcome of interest, such as developing cancer or dying, is infrequent. This design certainly applies in the Papadakis study of 6330 graduates from UCSF over the time interval, 70 were disciplined by the California state board, a prevalence of about 1.1 %. And, of course, 6260 were not.


그리고 여기에 핵심이 있다. 우리가 찾으려는 unprofessionalism이라는 '질병'은 유병률이 매우 낮다. 이러한 상황에서는 매우 좋은 진단도구라도 진짜 양성인 사례조차 위양성으로 가려진다. 위의 70명중 38%(27명)만을 가려낼 수 있을 뿐인데, 이것을 가려내기 위해서 1190명의 다른 졸업생이 unprofessional한 것으로 잘 못 label될 수 있다. PPV는 27/(1190+27)로 2.2%에 불과하다. Saten의 연구에서도 2000명 이상의 졸업생 중 140명만이 의과대학에서 poor performance가 있었다.

And there’s the rub. The disease we’re screening for— documented unprofessionalism—has a very low prevalence. Under these circumstances, even very good diagnostic tests result in true positive cases that are swamped by false positives. Working through this example, if we used documented concerns as a medical student as a screening test to decide if a graduate should be allowed to proceed, we would detect 38 % of the bad apples or 27; but we would incorrectly label .19 9 6260 = 1190 other graduates as unprofessional. The positive predictive value of the test is 27/ (1190 ? 27) = 2.2 %. Similar data arise in the Santen study, where review of 20 years’ data, involving over 2000 graduates, showed that 140 had poor performance in school, and only 29 were subsequently sanctioned by the state medical board.


따라서 Papadakis의 연구에서 졸업생 100명당 2명만이 state board에 보고되는 것에 그치고 만다.

So in the Papadakis study, for every 100 students who would have been denied grad- uation, if they had proceeded to implement a policy based on documented concerns, only two would end up reported to the State board.


 

여기에 경제적 논리를 더하면, 한 의사를 양성하는데 매년 10만달러가 필요할 때, 40만달러x98명 = 약 4천만달러의 사회적 비용이 들어간다는 것을 의미한다. 왜냐하면 이 98명의 학생들은 satisfactory 하게 행동했음에도 unsatisfactory한 것으로 적발되어 졸업하지 못하기 때문이다.

If you want to put an economic spin on it, if it costs $100,000/year to educate a doctor, that policy would result in a social cost of $400,000 9 98 = $40 million of education costs based on the number of satisfactory students who had an unsatisfactory and then could not graduate, without even considering lost income in practice.


그러나 unprofessional behavior는 하루아침에 생기는 것이 아니며 입학 시점부터 발견가능할 수 있다. 이것이 성격검사를 입학 때 사용하자는 Powis 등의 주장이기도 하다.

But if these unprofessional behaviours are longstanding, perhaps they are detectable at the time of admissions. This is the promise held out by Powis, who has argued repeatedly for the more widespread use of personality tests at admissions (2003, 2009, 2015).


실제로, 이러한 정책의 유용성에 대한 근거가 있다. Papadakis는 2007년의 연구에서  CPI 성격검사 결과를 활용하여 평균점수에 차이가 있음을 보여주었고, 2.1SD의 차이가 있었다. 여기까지는 좋다.

In fact, there is some useful evidence to inform this policy. Papadakis, in another published study (2007), looked at performance on a personality test (the California Psy- chological Inventory), using a subsample from her earlier study that had undergone the psychological testing as part of admissions. The sample was 19 cases (fromthe original 70) who had difficulty with the state board, and 26 controls (of 196 sampled from 6260), all of whom who had taken the CPI as part of admission to medical school. For the total score, the mean of the cases was 156 (SD = 14.7); for the controls 181 (SD = 11.7). This means that the case mean was 25/11.7 = 2.1 SDs below the control mean. So far so good.


우리가 이 점수를 선발에 사용한다고 상상해보자. 비유을 따져볼 수 있을 것이다. 궁극적으로 문제를 일으킬 사람은 70명이고, 6260명은 그러하지 않다.

Now let us imagine using these data for selection, by establishing a threshold score that students must attain to be considered for admission—a policy directly advocated by Powis (2015). We can look at the proportion of each group who are accepted or rejected, keeping in mind that our real denominator is 70 cases who will eventually get in trouble with the state board, and 6260 controls who won’t.


156명의 "case"를 가지고 threshold를 정하면 50%(35명)의 case는 놓칠 것이다. 그리고 -2.1SD로 설정한다고 할 때 "control"에서 2%를 false label하는데, 이 숫자가 125명이다. 즉 125/(125+35),, 즉 81%가 실제로는 문제가 없다.

If we were to set the threshold at 156, the ‘‘case’’ mean, then we’ll miss 50 % of the cases, 35. And this is a z score of -2.1 for the controls, so we’ll falsely label 2 % of the controls, 125, as unprofessional. And 81 % (125/(125 ? 35)) of the people we have la- beled would not have any problems in practice.


분명히, Case 중 50%만 탐지해낼 수 있는 검사는 문제가 많다. 그러면 sensitivity를 90%로 올려보자. 그러면 70명 중 63명을 잡아내지만, Control 중 1315명이 같이 적발된다. 즉 '잡힌' 사람 중 95%는 나중에 문제가 없다.

Clearly, a test that only detects 50 %of the cases is of little value. So let’s rack it up to a sensitivity of 90 %; which is a Z value on the ‘‘Cases’’ distribution of 1.28. We will detect 63 of 70 cases. That means the threshold, in Z units on the ‘‘control’’ distribution is (-2.1 ? 1.28) =-0.82, which equates to 21 % of the Control distribution below the threshold, or 1315. In short, similar to the previous calculation, 1315/(1315 ? 63) = 95 % of the individuals identified by a low score on the psychological test would not have any further problems in practice.

 


 

명확하게,  성격검사를 활용해서 궁극적으로 주정부에서 제제를 받을 사람을 탐지해내는 것은 심각한 비용을 치른다.

Clearly, any attempt to identify individuals who will be eventually subject to report to the State disciplinary board using personality tests comes at a serious cost in terms of denying access to many who would not have problems.


이러한 접근의 전제는 '인지적 척도'만으로는 나중에 문제를 일으킬 사람을 찾기에 불충분하다는 것이다. 그러나 정말 그러한가?

The underlying premise of this approach is that cognitive measures are inadequate to identify individuals who will become problems in practice. But is this necessarily the case?


Tamblyn 등은 MCCQE의 타당도를 연구하였다. MCCQE는 두 파트로 되어있다. 지필고사와 OSCE. 의사소통의 complaints를 예측하는데 있어서, OSCE의 하위 1/4의 RR은 1.43이었다. 지필고사는 1.34였다. quality of care에 대한 complaints를 예측하는데 있어서 RR은 의사소통점수에서 1.38, 지필고사는 1.54였다. 따라서 얼마나 폄하되든지간에, 인지적 척도는 practice performance의 중요한 예측요인이다. 동일한 결론이 Teherani 등의 연구에서도 드러나는데, 이들은 레지던트의 졸업후 퍼포먼스로 나중의 displinary action를 예측가능한지 보았다. '퍼포먼스'는 두 가지로 보았는데 하나는 ABIM의 in-training 평가, 다른 하나는 ABIM 인증시험. Discipline charge에 대한 hazard ratio는 1.9정도로 매우 인상적이었다. ABIM 인증시험은 1.7정도였다. 그리고 여기에서도 prevalence는 1% 정도였다.

Tamblyn et al. (2007) has studied the validity of the Medical Council of Canada Qualifying Examination in predicting complaints (quality of care and communication skills) to provincial licensing bodies. The MCCQE examination has two parts—a written ex- amination, primarily multiple choice completed at graduation, and an OSCE completed 1 year later. In terms of predicting communication complaints, the relative risk of a complaint for a communication skill performance in the bottom quartile of the OSCE was 1.43; for the written exam score was 1.34. For quality of care complaints, the relative risks were 1.38 for communication skills and 1.54 for the written test. (Relative risks for the data gathering and problem solving parts of the OSCE ranged from .97 to 1.13, predicting nothing). So it appears that, however much they are disparaged, cognitive measures of performance are an important predictor of practice performance. The same conclusion came from a study by Teherani et al. (2005), who looked at postgraduate performance of residents as a predictor of disciplinary action in practice. Performance was measured two ways: by American Board of Internal Medicine in-training evaluations, and by the ABIM certification examination. Again, the hazard ratio in predicting discipline charges looked impressive—about 1.9. However, the ABIM certification examination was not far behind at 1.7. And as before, with a prevalence of disciplinary action of about 1 %in this sample, the results do not support the use of either measure as a ‘‘diagnostic test’’.


"비인지적" 혹은 성격을 입학 때 평가하는 것을 논할 때 또 다른 가정 중 하나는 either-or 가설이다. 입학위원회가 '인성이 좋은' 지원자와 '학업능력이 좋은' 지원자 중에서 선택을 내리는 Faustian 선택을 해야 한다는 assumption이다. 그러나 러한 선택은 성격과 학업능력에 negative correlation이 있을 때에만 적용되는 선택이다.

One other assumption pervades discussion of assessing ‘‘non-cognitive’’ or personality at admissions—the ‘‘either-or’’ hypothesis. It is presumed that the admissions committee must make a Faustian choice between selecting someone who is personable, professional and compassionate, or someone who is academically top-tier. Such a choice would only be necessary if there were a strong negative correlation between personal qualities and aca- demic performance. But is there?


한 연구에서는 고등학교 성적과 면접 성적간 negative association을 보여주었다. 그러나 최근의 MMI연구를 보면, 하나는 -0.21, 다른 하나는 0.07이다. 진실이 어디 있든지 학업적 수월성과 대인관계기술 모두를 가지고 선발하는 것에 문제는 없어 보인다. 더 나아가서 성격검사의 현재 대표격인 Neo-5성격검사에서 다른 척도와 일관된 관계를 보이는 것은 conscientiousness와 성적의 moderate positive relationship 뿐이다.

One study (Powis and Bristow 1997) showed a significant negative association between scores on a personal interview and high school grades. However, two more recent studies examined the relation between the MMI (a well-validated measure of non-cognitive skills) and university GPA. In the first study (Eva et al. 2004) the correlation was -0.21; in the second (Kulasegaram et al. 2010) the correlation was ?.07. Wherever the true correlation lies, it would appear that there should be no problem identifying students who have both academic excellence and interpersonal skills. Moreover, when one examines the constructs measured by the current stte of the art personality test, the Neo-5 personality test, about the only consistent relationship with other measures that has emerged is a moderate positive relationship between conscientiousness and grades (Kulasegaram et al. 2010).


입학전략을 academic and interpersonal measures 모두를 활용하는 것은 완벽하게 적절하다. 그러나 둘 중 하나를 선택하게 강요하는 것은 부적절하다. 또한 'unprofessionalism'이라는 희귀질환을 진단해내는 검사를 만들 수 있을 것이라는 기대는 그릇된 것이다.

It is perfectly appropriate to devise admissions strategies, in-course performance indi-ces, and certification procedures that include both academic and interpersonal measures. It is not appropriate to force a choice between one and the other. And it is folly to presume that we will ever be able to create an adequate diagnostic test to the ultimately rare disease of unprofessionalism. 



 



Powis, D. (2015). Selecting medical students: An unresolved challenge. Medical Teacher, 37, 252–260.



 2015 May;20(2):299-303. doi: 10.1007/s10459-015-9598-9.

Identifying the bad apples.

Author information

  • 1McMaster University, Hamilton, ON, Canada, norman@mcmaster.ca.
[PubMed - indexed for MEDLINE]


+ Recent posts