평가 관련 이론들 : AMEE Guide No. 57
General overview of the theories used in assessment: AMEE Guide No. 57
LAMBERT W. T. SCHUWIRTH1 & CEES P. M. VAN DER VLEUTEN2
1Flinders University, Australia, 2Maastricht University, The Netherlands
Introduction
평가는 모든 사람의 관심사이며 이는 놀랄 일도 아니다.
It is our observation that when the subject of assessment in medical education is raised, it is often the start of extensive discussions. Apparently, assessment is high on everyone's agenda. This is not surprising because assessment is seen as an important part of education in the sense that it not only defines the quality of our students and our educational processes, but it is also seen as a major factor in steering the learning and behaviour of our students and faculty.
그러나 평가와 관련된 논의는 종종 전통과 직관에 의존하고 있다. 전통에 관심을 갖는 것이 꼭 나쁜 것은 아니며, George Santayana는 "역사로부터 배우지 못하는 사람은 그것을 반복하게 된다"라고 했다.
Arguments and debates on assessment, however, are often strongly based on tradition and intuition. It is not necessarily a bad thing to heed tradition. George Santayana already stated (quoting Burk) that Those who do not learn from history are doomed to repeat it.1 So, we think that an important lesson is also to learn from previous mistakes and avoid repeating them.
직관 역시 변덕스럽게 옆으로 치워두어야 할 것은 아니며, 사람들의 행동을 변화시키는 강력한 힘이되기도 한다. 그러나 마찬가지로 직관이 연구 결과와 부합하지 않는 경우도 많다.
Intuition is also not something to put aside capriciously, it is often found to be a strong driving force in the behaviour of people. But again, intuition is not always in concordance with research outcomes. Some research outcomes in assessment are somewhat counter intuitive or at least unexpected. Many researchers may not have exclaimed Eureka but Hey, that is odd instead.
여기서 두 가지 중요한 과제가 나타난다. 첫째로, 우리는 실수를 반복하지 않기 위해서 전통적으로, 일상적으로 수행하고 있는 것이 여전히 그 만한 가치가 있는가를 비판적으로 평가해볼 필요가 있다. 두 번째로, 틀린 직관을 바로잡기 위해서는 연구결과를 적절한 방법과 접근법으로 translate할 수 있어야 한다.
This leaves us, as assessment researchers, with two very important tasks. First, we need to critically study which common and tradition-based practices still have value and consequently which are the mistakes that should not be repeated. Second, it is our task to translate research findings to methods and approaches in such a way that they can easily help changing incorrect intuitions of policy makers, teachers and students into correct ones. Both goals cannot be attained without a good theoretical framework in which to read, understand and interpret research outcomes. The purpose of this AMEE Guide is to provide an overview of some of the most important and most widely used theories pertaining to assessment. Further Guides in assessment theories will give more detail on the more specific theories pertaining to assessment.
그러나 불행하게도 많은 다른 과학분야와 마찬가지로 의학교육 분야의 평가는 근간을 이루는 이론이 단일하지 않으며, 인접한 분야에서 다양한 이론을 끌어다 슨다. 추가적으로 의료전문직의 평가와 좀 더 직접적으로 관련된 이론적 틀이 만들어져오기도 했으며, 여기서 가장 중요한 인식은 '학습의 평가'와 '학습을 위한 평가'라는 두 인식의 차이이다.
Unfortunately, like many other scientific disciplines, medical assessment does not have one overarching or unifying theory. Instead, it draws on various theories from adjacent scientific fields, such as general education, cognitive psychology, decision-making and judgement theories in psychology and psychometric theories. In addition, there are some theoretical frameworks evolving which are more directly relevant to health professions assessment, the most important of which (in our view) is the notion of ‘assessment of learning’ versus ‘assessment for learning’ (Shepard 2009).
In this AMEE Guide we will present the theories that have featured most prominently in the medical education literature in the recent four decades. Of course, this AMEE Guide can never be exhaustive; the number of relevant theoretical domains is simply too large, nor can we discuss all theories to their full extent. Not only would this make this AMEE Guide too long, but also this would be beyond its scope, namely to provide a concise overview. Therefore, we will discuss only the theories on the development of medical expertise and psychometric theories, and then end by highlighting the differences between the assessment of learning and assessment for learning. As a final caveat, we must say here that this AMEE Guide is not a guide to methods of assessment. We assume that the reader has some prior knowledge about this or we would like to refer to specific articles or to text books (e.g. Dent & Harden 2009).
의-전문가가 어떻게 만들어지는가에 대한 이론
Theories on the development of (medical) expertise
의학 분야의 전문가는 어떤 특징을 갖는가? 초심자와 전문가의 차이는 무엇인가? 이런 질문들이 반드시 나오게 되어있으며, 이는 무엇을 평가할지 모른다면 어떻게 평가할지 결정할 수 없기 때문이다.
What distinguishes someone as an expert in the health sciences field? What do experts do differently compared to novices when solving medical problems? These are questions that are inextricably tied to assessment, because if you do not know what you are assessing it also becomes very difficult to know how you can best assess.
경험을 쌓고 학습을 하면 전문가가 된다는 것이 자명해 보일지도 모른다.
I may be obvious that someone can only become an expert through learning and gaining experience.
초창기 연구 중 하나는 de Groot의 연구로, 왜 체스의 그랜드마스터가 그랜드마스터가 되었으며, 무엇을 아마추어와 다르게 했는가를 연구하였다. 처음에 그는 아마추어보다 더 많은 수를 내다볼 수 있기 때문이라고 생각했다. 그러나 그렇지 않았으며, 약 7수 이상을 내다보는 것은 동일했다. 대신 de Groot이 찾은 것은 그랜드마스터는 체스판의 말의 위치를 더 잘 암기한다는 것이다. 그와 그의 후계자들은 그랜드마스터는 잠깐만 보고도 체스판위의 말의 위치를 정확하게 기억해낼 수 있음을 발견했다.
One of the first to study the development of expertise was by de Groot (1978), who wanted to explore why chess grandmasters became grandmasters and what made them differ from good amateur chess players. His first intuition was that grandmasters were grandmasters because they were able to think more moves ahead than amateurs. He was surprised, however, to find that this was not the case; players of both expertise groups did not think further ahead than roughly seven moves. What he found, instead, was that grandmasters were better able to remember positions on the board. He and his successors (Chase & Simon 1973) found that grandmasters were able to reproduce positions on the board more correctly, even after very short viewing times. Even after having seen a position for only a few seconds, they were able to reproduce it with much greater accuracy than amateurs.
여기서 아마도 '기억력이 우수하다는 것이군'이라고 생각할 수 있지만 그렇지 않다. 인간의 작업기억은 약 7단위(+/- 2) 정도이며, 학습으로 향상되는 것이 아니다.
One would think then that they probably had superior memory skills, but this is not the case. The human working memory has a capacity of roughly seven units (plus or minus two) and this cannot be improved by learning (Van Merrienboer & Sweller 2005, 2010).
가장 두드러진 차이는 작업기억에 넣을 수 있는 유닛의 숫자가 아니라, 각 유닛이 담고 있는 정보의 양 차이였다.
The most salient difference between amateurs and grandmasters was not the number of units they could store in their working memory, but the richness of the information in each of these units.
자국어로 정보를 저장할 때는 모든 단어와 일부 정형화된 표현이 하나의 단위로 저장되는데, 이는 이미 장기기억에 존재하는 기억과 직접적으로 연결되기 때문이다. 반면 외국어를 사용할 때에는 그 외국어가 장기기억에 입력된바가 없다면 인지적 자원의 일부를 글자 그 자체를 외우는데 사용해야 한다. (...) 이처럼 정보를 저장할 때 더 큰 양의 정보를 담은 유닛으로 저장하는 것을 chunking이라고 하며, 전문성이 무엇인지, 어떻게 개발되는지에 있어서 중요한 요소이다.
To illustrate this, imagine having to copy a text in your own language, then a text in a foreign Western European language and then one in a language that uses a different character set (e.g. Cyrillic). It is clear that copying a text in your own language is easiest and copying a text in a foreign character set is the most difficult. While copying you have to read the text, store it in your memory and then reproduce it on the paper. When you store the text in your native language, all the words (and some fixed expressions) can be stored as one unit, because they relate directly to memories already present in your long-term memory. You can spend all your cognitive resources on memorising the text. In the foreign character set you will also have to spend part of your cognitive resources on memorising the characters, for which you have no prior memories (schemas) in your long-term memory. A medical student who has just started his/her study will have to memorise all the signs and symptoms when consulting a patient with heart failure, whereas an expert can almost store it as one unit (and perhaps only has to store the findings that do not fit to the classical picture or mental model of heart failure). This increasing ability to store information as more information-rich units is called chunking and it is a central element in expertise and its development. Box 1 provides an illustration of the role of chunking.
So, why were the grandmasters better than good amateurs? Well, mainly because they possessed much more stored information about chess positions than amateurs did, or in other words, they had acquired so much more knowledge than the amateurs had.
이 체스 연구로부터 배울 것이 있다면 - 다른 분야의 수많은 연구를 통해서도 밝혀진 바와 같이 - '풍부하고' '잘 구성된' 지식기반이 성공적인 문제해결의 근간이라는 것이다.
If there is one lesson to be drawn from these early chess studies – which have been replicated in such a plethora of other expertise domains that it is more than reasonable to assume that these findings are generic – it is that a rich and well-organised knowledge base is essential for successful problem solving (Chi et al. 1982; Polsen & Jeffries 1982).
다음 질문은 그렇다면 'well-organized'의 의미가 무엇인가? 일 것이다. 근본적으로 '조직화'는 새로운 정보를 빠르게 저장하고 더 오래 기억하게 하며, 관련된 정보가 필요할 때 바로 인출할 수 있게 만들어준다. (...) 컴퓨터를 인간의 두뇌에 비교하곤 하지만 인간은 컴퓨터와 같이 File Allocation Table을 사용하지 않는다. 이는, 인간에게 있어서 새로운 정보가 연결될 기존의 지식이 없는 상태에서 새로운 정보를 저장하는 것이 대단히 어려운 일임을 보여준다.
The next question then would be: What does ‘well-organised’ mean? Basically, it comes down to organisation that will enable the person to store new information rapidly and with good retention and to be able to retrieve relevant information when needed. Although the computer is often used as a metaphor for the human brain (much like the clock was used as a metaphor in the nineteenth century), it is clear that information storage on a hard disk is very much different from human information storage. Humans do not use a File Allocation Table to index where the information can be found, but have to embed information in existing (semantic) networks (Schmidt et al. 1990). The implication of this is that it is very difficult to store new information if there is no existing prior information to which it can be linked. Of course, the development of these knowledge networks is quite individualised, and based on the individual learning pathways and experiences. For example, we – the authors of this AMEE Guide – live in Maastricht, so our views, connotations and association with ‘Maastricht’ differ entirely from those of most of the readers of the AMEE Guides, although we may share the knowledge that it is a city (and perhaps that it is in the Netherlands) and that there is a university with a medical school, the rest of the knowledge is much more individualised.
'지식'이란 것은 상당히 '영역-특이적'인 것이다. 한 사람이 한 토픽에는 매우 많은 것을 알면서도 다른 토픽에 대해서는 거의 아는 바가 없을 수 있다. 그리고 전문성이라는 것이 '잘 구성된' 지식을 기반으로 하기에, '전문성' 역시 영역-특이적이다. 평가에 있어서 이것이 의미하는 바는, 한 사람의 수행능력을 하나의 사례나 문항으로 테스트하는 것은 다른 사례나 문항에 대한 예측력 차원에서 대단히 좋지 않다는 것이다. 따라서 절대로 제한된 평가 정보에 의존해서는 안된다. 고부담의 의사결정을 단일한 사례에 기반해서 한다면 이는 매우 신뢰성이 낮다고 볼 수 있다.
Knowledge generally is quite domain specific (Elstein et al. 1978; Eva et al. 1998); a person can be very knowledgeable on one topic and a lay person on another, and because expertise is based on a well-organised knowledge base, expertise is domain specific as well. For assessment, this means that the performance of a candidate on one case or item of a test is a poor predictor for his or her performance on any other given item or case in the test. Therefore, one can never rely on limited assessment information, i.e. high-stakes decisions made on the basis of a single case (e.g. a high-stakes final VIVA) are necessarily unreliable.
두 번째 중요한 교훈은 '문제해결능력'이란 것이 개인-특이적(idiosyncratic)이라는 점이다. 앞서 논의한 영역-특이성은 한 사람의 수행능력도 다양한 사례에 따라서 서로 달라질 수 있다는 것이라면, 여기서 말한 개인-특이성은 동일한 사례라도 전문가에 따라서 그 해결방법이 서로 다를 수 있다는 점이다. 이는 지식을 구조화한 방법이 개인마다 다르다는 점을 고려하면 논리적이다. 이로부터 평가와 관련해서 얻을 수 있는 교훈은 각 지원자의 전문성을 '진단'하는 차원에서 문제해결의 '절차'를 평가하는 것은 문제해결의 '결과'를 평가하는 것에 비해서 얻을 수 있는 정보가 더 적다는 것이다.
A second important and robust finding in the expertise literature – more specifically the diagnostic expertise literature – is that problem-solving ability is idiosyncratic (cf. e.g. the overview paper by Swanson et al. 1987). Domain specificity, which we discussed above, means that the performance of the same person varies considerably across various cases, idiosyncrasy here means that the way different experts solve the same case varies substantially between different experts. This is also logical, keeping in mind that the way the knowledge is organised is highly individual. The assessment implication from this is that when trying to capture, for example, the diagnostic expertise of candidates, the process may be less informative than the outcome, as the process is idiosyncratic (and fortunately the outcome of the reasoning process is much less).
가장 중요한 이슈는 '전이(transfer)'에 대한 것이다. 이는 영역-특이성, 개인-특이성과 관련된 것인데, '전이'는 한 사람이 특정 문제에 대해서 적용할 수 있는 문제해결능력을 다른 문제에 대해서도 적용할 수 있는 정도를 말하며, 서로 다른 두 가지의 유사성을 이해하고 동시에 적용될 수 있는 원칙을 발견할 수 있어야 한다.
A third and probably most important issue is the matter of transfer (Norman 1988; Regehr & Norman 1996; Eva 2004). This is closely related to the previous issue of domain specificity and idiosyncrasy. Transfer pertains to the extent to which a person is able to apply a given problem-solving approach to different situations. It requires that the candidate understands the similarities between two different problem situations and recognises that the same problem-solving principle can be applied. Box 2 provides an illustration (drawn from a personal communication with Norman).
이 두 가지 문제에서 구체적으로 드러난 상황은 'surface feature'라고 할 수 있으며, 두 가지 문제에 근본적으로 깔려있는 원칙은 'deep structure'라고 할 수 있다. 이러한 상황에서 '전이'라는 것은 'surface feature'에 눈이 멀지 않고 'deep structure'를 밝혀낼 수 있는 능력이다.
Most often, the first problem is not recognised as being essentially the same as the second and that the problem-solving principle is also the same. Both solutions lie in the splitting up of the total load into various parts. In problem 1, the 1000 W laser beam is replaced by 10 rays of 100 W each, but converging right on the spot where the filament was broken. In the second problem the solution is more obvious: build five bridges and then let your men run onto the island. If the problem were represented as: you want to irradiate a tumour but you want to do minimal harm to the skin above it, it would probably be recognised even more readily by physicians. The specific presentation of these problems is labelled as the surface features of the problem and the underlying principle is referred to as the deep structure of the problem. Transfer exists by the virtue of the expert to be able to identify the deep structure and not to be blinded by the surface features.
의-전문성의 발달에 관련된 가장 많이 이용되는 이론 중 하나가 주장하는 것은 '의-전문성'이라는 것은 isolated fact을 모으는 것으로부터 시작해서, 이들을 혼합해서 의미있는 semantic network를 구성하는 것이다. 이 네트워크는 이후 보다 압축되어 고밀도의 illness script를 만들게 되고, 수년간의 경험이 쌓이면 이는 instance script가 되어서 특정 진단을 즉각적으로 인식할 수 있게 된다. illness script(특정 진단의 굳어진 패턴) 와 instance script의 차이는 instance script은 보통 사람이라면 지나칠 수 있는 맥락까지도 고려한다는 것이다. 이러한 맥락에는 환자의 외모나 냄새까지도 포함한다.
One of the most widely used theories on the development of medical expertise is the one suggested by Schmidt, Norman and Boshuizen (Schmidt 1993; Schmidt & Boshuizen 1993). Generally put, this theory postulates that the development of medical expertise starts with the collection of isolated facts which further on in the process are combined to form meaningful (semantic) networks. These networks are then aggregated into more concise or dense illness scripts (for example pyelonephritis). As a result of many years of experience, these are then further enriched into instance scripts, which enable the experienced clinician to recognise a certain diagnosis instantaneously. The most salient difference between illness scripts (that are a sort of congealed patterns of a certain diagnosis) and instance scripts is that in the latter contextual, and for the lay person sometimes seemingly irrelevant, features are also included in the recognition. Typically, these include the demeanour of the patient or his/her appearance, sometimes even an odour, etc.
평가에 있어서 중요한 교훈들.
These theories then provide important lessons for assessment:
- Do not rely on short tests. The domain specificity problem informs us that high-stakes decisions based on short tests or tests with a low number of different cases are inherently flawed with respect to their reliability (and therefore also validity). Keep in mind that unreliability is a two-way process, it does not only imply that someone who failed the test could still have been satisfactorily competent, but also that someone who passed the test could be incompetent. The former candidate will remain in the system and be given a re-sit opportunity, and this way the incorrect pass–fail decision can be remediated, but the latter will escape further observation and assessment, and the incorrect decision cannot be remediated again.
- For high-stakes decisions, asking for the process is less predictive of the overall competence than focussing on the outcome of the process. This is counterintuitive, but it is a clear finding that the way someone solves a given problem is not a good indicator for the way in which she/he will solve a similar problem with different surface features; she/he may not even recognise the transfer. Focussing on multiple outcomes or some essential intermediate outcomes – such as with extended-matching questions, key-feature approach assessment or the script concordance test – is probably better than in-depth questioning the problem-solving process (Bordage 1987; Case & Swanson 1993; Page & Bordage 1995; Charlin et al. 2000).
- Assessment aimed only at reproduction will not help to foster the emergence of transfer in the students. This is not to say that there is no place for reproduction-orientated tests in an assessment programme, but they should be chosen very carefully. When learning arithmetic, for example, it is okay to focus the part of the assessment pertaining to the tables of multiplication on reproduction, but with long multiplications, focussing on transfer (in this case, the algorithmic transfer) is much more worthwhile.
- When new knowledge has to be built into existing semantic networks, learning needs to be contextual. The same applies to assessment. If the assessment approach is to be aligned with the educational approach, it should be contextualised as well. So whenever possible, set assessment items, questions or assignments in a realistic context.
Psychometric theories
Whatever purpose an assessment may pursue in an assessment programme, it always entails a more or less systematic collection of observations or data to arrive at certain conclusions about the candidate. The process must be both reliable and valid. Especially, for these two aspects (reliability and validity) psychometric theories have been developed. In this chapter, we will discuss these theories.
타당도 Validity
최근 100년간 타당도에 대한 개념이 몇 차례 바뀌었다.
Simply put, validity pertains to the extent to which the test actually measures what it purports to measure. In the recent century, the central notions of validity have changed substantially several times.
타당도에 대한 첫 번째 이론은 준거타당도 혹은 예측타당도에 대한 개념에 가까웠다. 완전히 비논리적은 것은 아니며, 많은 교수들이 '정말 이렇게 해서 좋은 의사가 나오는건가?'라는 질문을 하는 것과 유사하다. 그러나 '좋은 의사'에 대한 단일한, 충분한, 측정가능한 준거가 존재하지 않는한 이 질문은 대답할 수 없다. '좋은 의사'와 같은 용어에 대한 타당도를 정의내리기가 어려운 문제와 마찬가지인 것이다. 또한 준거의 타당도를 점증해야 하는 문제가 있다. 한 연구자가 '좋은 의사'를 측정하기 위한 척도를 제안하여 이를 특정 평가에 사용하였다면, 그 척도에 대한 타당도 평가가 필요하다. 이 준거에 대한 타당도 평가를 위해서는 그것을 평가하기 위한 준거가 필요하고, 이러한 문제는 무한히 반복된다.
The first theories on validity were largely based on the notion of criterion or predictive validity. This is not illogical as the intuitive notion of validity is one of whether the test predicts an outcome well. The question that many medical teachers ask when a new assessment or instructional method is suggested is: But does this produce better doctors?. This question – however logical – is unanswerable in a simple criterion-validity design as long as there is no good single measureable criterion for good ‘doctorship’. This demonstrates exactly the problem with trying to define validity exclusively in such terms. There is an inherent need to validate the criterion as well. Suppose a researcher was to suggest a measure to measure ‘doctorship’ and to use it as the criterion for a certain assessment, then she/he would have to validate the measure for ‘doctorship’ as well. If this again were only possible through criterion validity, it would require the research to validate the criterion for the criterion as well – etcetera ad infinitum.
두 번째의 직관적 접근법은 수행능력을 관찰하여 평가하는 것이다. 예컨대 플룻 연주 실력을 평가하는 것은 복잡하지 않다. 플룻 전문가들을 모시고 각 연주자를 연주하게 하면 된다. 물론, 일부 블루프린트에서는 다양한 음악 장르에 걸친 연주실력 평가를 해야 할 수도 있다. 오케스트라 지원자에 대해서는 오케스트라의 레파토리 중에 있는 음악을 잘 연주해야 한다. 이러한 내용타당도는 중요한 역할을 해왔고 지금도 그러하다.
A second intuitive approach would be to simply observe and judge the performance. If one, for example, wishes to assess flute-playing skills, the assessment is quite straightforward. One could collect a panel of flute experts and ask them to provide judgements for each candidate playing the flute. Of course, some sort of blueprinting would then be needed to ensure that the performances of each candidate would entail music in various ranges. For orchestral applicants, it would have to ensure that all classical music styles of the orchestra's repertoire would be touched upon. Such forms of content validity (or direct validity) have played an important role and still do in validation procedures.
그러나 우리가 학생에 대해 평가하고자 하는 것은 이렇게 명확하게 관찰가능하거나 관찰을 통해 추론할 수 있는 것이 아니다. 지식이나 신경증 정도 역시 잠재특성(latent traits)이며, 지식, 문제해결능력, 프로페셔널리즘도 마찬가지다. 직접 관찰이 불가능하고, 관찰된 행동을 기반으로 한 가정에 따라 평가할 수 밖에 없다.
However, most aspects of students we want to assess are still not clearly visible and need to be inferred from observations. Not only are characteristics such as intelligence or neuroticism invisible (so-called latent) traits, but also are elements such as knowledge, problem-solving ability, professionalism, etc. They cannot be observed directly and can only be assessed as assumptions based on observed behaviour.
Cronbach와 Meehl은 'construct validity'라는 개념을 주장했는데, 그들에 따르면 'construct validity'는 '귀납적 경험적 과정'과 마찬가지다. 각 연구자는 어떤 시험이 측정하고자 하는 구인(construct)에 대하여 명확한 이론을 만들거나 추정을 하게 된다. 그리고 나서 그 시험을 설꼐하고 수행하고 비판적 평가를 하여 구인에 대한 이론적 개념을 지지하는지 확인한다.
In an important paper, Cronbach and Meehl (1955) elaborated on the then still young notion of construct validity. In their view, construct validation should be seen as analogous to the inductive empirical process; first the researcher has to define, make explicit or postulate clear theories and conceptions about the construct the test purports to measure. Then, she/he must design and carry through a critical evaluation of the test data to see whether they support the theoretical notions of the construct. An example of this is provided in Box 3.
이것은 '중간 효과'라고도 불리는데, 검사의 타당도에 대한 가정을 반증하는 중요한 요인이다.
The so-called ‘intermediate effect’, as described in the example (Box 3) (especially when it proves replicable) is an important falsification of the assumption of validity of the test.
여기서 배울 교훈은 다음과 같다.
We have used this example deliberately, and there are important lessons that can be drawn from it.
이러한 중간 효과가 존재한다는 것은 타당도를 강하게 반박하는 근거가 된다. 검사의 타당성을 지지할 수 있는 근거는 결정적 '관찰'로부터 나타나야 한다. 비유를 들자면 다음과 같다. 특정 질병에 걸렸음을 최대한으로 보여주려면, 질병이 없을 때 검사 결과 음성으로 나올 가능성을 최대한 높은 검사를 사용해야 한다. '약한' 실험에서 나온 근거는 타당성을 보여줄 수 없다.
First, it demonstrates that the presence of such an intermediate effect in this case is a powerful falsification of the assumption of validity. This is highly relevant, as currently it is generally held that a validation procedure must contain ‘experiments’ or observations which are designed to optimise the probability of falsifying the assumption of validity (much like Popper's falsification principle2). Evidence supporting the validity must therefore always arise from critical ‘observations’. There is a good analogy to medicine or epidemiology. If one wants to confirm the presence of a certain disease with the maximum likelihood, one must use the test with the maximum chance of being negative when disease is absent (the maximum sensitivity). Confirming evidence from ‘weak’ experiments therefore does not contribute to the validity assumption.
두 번째로, 흔한 오해 중 하나로 '실제 직무 상황에서의 평가'라는 것이 '타당성'을 담보하지 않는다. 평가시에 authenticity를 최대한으로 높이기 위한 여러 근거들이 있으나, 그 효과는 주로 형성평가보다는 총괄평가에서 나타난다. 이러한 상황을 가정해보자. 만약 우리가 진료하는 의사의 매일매일의 수행능력의 수준을 평가하기 위해서 진료하는 실제 상황을 평가할 수도 있고, 차트를 리뷰하거나, 검사결과, 타과의뢰 자료를 볼 수도 있다. 후자가 분명 덜 authentic하지만, 더 valid할 수 있다. 첫 번째 평가방식에서 나타날 수 있는 '관찰자 효과'는 의사의 행동에 영향을 줘서 편향된 결과를 보여줄 수 있다.
Second, it demonstrates that authenticity is not the same as validity, which is a popular misconception. There are good reasons in assessment programmes to include authentic tests or to strive for high authenticity, but the added value is often more prominent in their formative than in their summative function. An example may illustrate this: Suppose we want to assess the quality of the day-to-day performance of a practising physician and we had the choice between observing him/her in many real-life consultations or extensively reviewing charts (records and notes), ordering laboratory tests and referral data. The second option is clearly less authentic than the first one but it is fair to argue that the latter is a more valid assessment of the day-to-day practice than the former. The observer effect, for example, in the first approach may influence the behaviour of the physician and thus draw a biased picture of the actual day-to-day performance, which is clearly not the case in the charts, laboratory tests and referral data review.
세 번째로, 타당도는 검사 자체에 대한 것이 아니며, 그 검사가 의도한 특징을 검사한 정도에 대한 것이다. Box 3의 예시가 데이터 수집을 빈틈없이 했느냐에 대한 목적이 있었다면 타당한 평가였겠지만, 전문성 정도를 측정하기 위한 것이었다면 정보수집과 활용의 효율성을 구인의 중요한 요소로서 않았다.
Third, it clearly demonstrates that validity is not an entity of the assessment per se; it is always the extent to which the test assesses the desired characteristic. If the PMPs in the example in Box 3 were aimed at measuring thoroughness of data gathering – i.e. to see whether students are able to distinguish all the relevant data from non-relevant data – they would have been valid, but if they are aimed at measuring expertise they failed to incorporate efficiency of information gathering and use as an essential element of the construct.
Current views (Kane 2001, 2006) highlight the argument-based inferences that have to be made when establishing validity of an assessment procedure.
요약하자면, 관찰에서 점수로, 관찰점수에서 완전체점수(universe score)로, 완전체점수에서 특정 영역으로, 특정 영역에서 구인으로의 순차적 추론이 이루어진다.
In short, inferences have to be made from observations to scores, from observed scores to universe scores (which is a generalisation issue), from universe scores to target domain and from target domain to construct.
혈압을 측정한다고 하면, 청진기의 소리(observation) => 혈압계의 수치(observed score) => (반복, 다른 상황에서의 측정 수치(universe score) => 환자의 심혈관계 상태(target domain) => 건강(construct)
To illustrate this, a simple medical example may be helpful: When taking a blood pressure as an assessment of someone's health, the same series of inferences must be made. When taking a blood pressure, the sounds heard through the stethoscope when deflating the cuff have to be translated into numbers by reading them from the sphygmomanometer. This is the translation from (acoustic and visual) observation to scores. Of course, one measurement is never enough (the patient may just have come running up the stairs) and it needs to be repeated, preferable under different circumstances (e.g. at home to prevent the ‘white coat’-effect). This step is equivalent to the inference from observed scores to universe scores. Then, there is the inference from the blood pressure to the cardiovascular status of the patient (often in conjunction with other signs and symptoms and patient characteristics) which is equivalent to the inference from universe score to target domain. And, finally this has to be translated into the concept ‘health’, which is analogous to the translation of target domain to construct. There are important lessons to be learnt from this.
- 첫째, 타당도는 논증에 대한 사례적 근간을 쌓는 것이다. 이러한 논증은 타당도 연구의 결과에 기반할 수도 있고, 이치에 맞고 방어가능한 주장을 포함할 수도 있다.
First, validation is building a case based on argumentation. The argumentation is preferably based on outcomes of validation studies but may also contain plausible and/or defeasible arguments. - 평가에서 측정하고자 하는 명확한 정의나 구인에 대한 이론 없이는 평가에 대한 타당도를 검증할 수 없다. 따라서 특정 검사도구는 그 자체로 타당한 것이 아니라, 특정 구인을 검사하는데 타당한 것이다.
Second, one cannot validate an assessment procedure without a clear definition or theory about the construct the assessment is intended to capture. So, an instrument is never valid per se but always only valid for capturing a certain construct. - 세 번째로, 타당도 검증은 끝나지 않으며, 종종 무수한 관찰과 기대와 비판적 실험이 필요하기도 하다.
Third, validation is never finished and often requires a plethora of observations, expectations and critical experiments. - 마지막으로 이러한 추론을 이끌어내기 위해서는 일반화가능성 필요하다.
Fourth, and finally, in order to be able to make all these inferences, generalisability is a necessary step.
신뢰도 Reliability
신뢰도라는 것은 앞선 타당도 섹션에서 말한 '일반화' 단계의 하나이다. 그러나 '일반화'가 타당도 검증을 위해 필요한 단 하나의 과정이라고 해도, 이러한 일반화가 이뤄지는 방식은 이론에 따라 다르다. 다음의 세 단계의 일반화를 이해하는 것이 좋다.
Reliability of a test indicates the extent to which the scores on a test are reproducible, in other words, whether the results a candidate obtains on a given test would be the same if she/he were presented with another test or all the possible tests of the domain. As such, reliability is one of the approaches to the generalisation step described in the previous section on validity. But even if generalisation is’only’ one of the necessary steps in the validation process, the way in which this generalisation is made is subject to theories in its own. To understand them, it may be helpful to distinguish three levels of generalisation.
첫 번째로, '평행 검사'의 개념이 필요한데, '평행 검사'라는 것은 유사한 내용의, 동일한 난이도의, 유사한 블루프린트의, 이상적으로는 동일한 학생에게 원 시험 직후에, 학생이 이전 검사에 의한 피로가 없다는 가정 하에 진행되는 가상의 검사이다.
First, however, we need to introduce the concept of the ‘parallel test’ because it is necessary to understand the approaches to reproducibility described below. A parallel test is a hypothetical test aimed at a similar content, of equal difficulty and with a similar blueprint, ideally administered to the same group of students immediately after the original test, under the assumption that the students would not be tired and that their exposure to the items of the original test would not influence their performance on the second.
세 종류의 일반화가 있다.
Using this notion of the parallel test, three types of generalisations are made in reliability, namely if the same group of students were presented with the original and the parallel test:
- 같은 학생이 두 시험에서 합/불합 하는가.
Whether the same students would pass and fail on both tests. - 1등부터 꼴등까지의 등수가 동일한가
Whether the rank ordering from best to most poorly performing student would be the same on both the original and the parallel tests. - 모든 학생이 동일한 점수를 받는가
Whether all students would receive the same scores on the original and the parallel tests.
Three classes of theories are in use for this: classical test theory (CTT), generalisability theory (G-theory) and item response theory (IRT).
고전검사이론 Classical test theory
CTT is the most widely used theory. It is the oldest and perhaps easiest to understand. It is based on the central assumption that the observed score is a combination of the so-called true score and an error score (O = T + e).3 The true score is the hypothetical score a student would obtain based on his/her competence only. But, as every test will induce measurement error, the observed score will not necessarily be the same as the true score.
This in itself may be logical but it does not help us to estimate the true score. How would we ever know how reliable a test is if we cannot estimate the influence of the error term and the extent it makes the observed score deviate from the true score, or the extent to which the results on the test are replicable?
The first step in this is determining the correlation between the test and a parallel test (test–retest reliability). If, for example, one wanted to establish the reliability of a haemoglobin measurement one would simply compare the results of multiple measurements from the same homogenised blood sample, but in assessment this is not this easy. Even the ‘parallel test’ does not help here, because this is, in most cases, hypothetical as well.
The next step, as a proxy for the parallel test, is to randomly divide the test in two halves and treat them as two parallel tests. The correlation between those two halves (corrected for test length) is then a good estimate of the ‘true’ test–retest correlation. This approach, however, is also fallible, because it is not certain whether this specific correlation is a good exemplar; perhaps another subdivision in two halves would have yielded a completely different correlation (and thus a different estimate of the test–retest correlation). One approach is to repeat the subdivision as often as possible until all possibilities are exhausted and use the mean correlation as a measure of reliability. That is quite some work, so it is simpler and more effective to subdivide the test in as many subdivisions as there are possible (the items) and calculate the correlations between them. This approach is a measure of internal consistency and the basis for the famous Cronbach's alpha. It can be taken as the mean of all possible split half reliability estimates (cf. e.g. Crocker & Algina 1986).
Cronbach's alpha가 널리 사용되고 있기는 하지만, 이는 norm-referenced 관점에서만 사용가능하다(상대평가적 관점), Criterion-referenced 관점에서 Cronbach's alpha를 사용하면 신뢰도가 과대추정된다. Box 4에 설명되어있다.
Although Cronbach's alpha is widely used, it should be noted that it remains an estimate of the test–retest correlation, so it can only be used correctly if conclusions are drawn at the level of the whether the rank orderings between the original and the parallel tests are the same, i.e. a norm-referenced perspective. It does not take into account the difficulty of the items on the test, and because the difficulty of the items of a test influences the exact height of the score, using Cronbach's alpha in a criterion-referenced perspective overestimates the reliability of the test. This is explained in Box 4.
Although the notion of Cronbach's alpha is based on correlations, reliability estimates can range from 0 to 1. In rare cases, calculations could result in a value lower than zero, but this is then to be interpreted as being zero.
신뢰도에 대해서는 실제 점수와 연결해서 평가해야 한다. 신뢰도 0.9가 0.75보다 항상 좋은 것일까?
Although it is often helpful to have a measure of reliability that is normalised, in that for all data, it is always a number between 0 and 1, in some cases, it is also important to evaluate what the reliability means for the actual data. Is a test with a reliability of 0.90 always better than a test with a reliability of 0.75? Suppose we had the results of two tests and that both tests had the same cut-off score, for example 65%. The score distributions of both tests have a standard deviation (SD) of 5%, but the mean, minimum and maximum scores differ, as shown in Table 1.
Based on these data, we can calculate a 95% confidence interval (95%-CI) around each score or the cut-off score. For this, we need the standard error of measurement (SEM). In the beginning of this section, we showed the basic formula in CTT (observed score = true score + error). In CTT, the SEM is the SD of the error term or, more precisely put, the square root of the error variance. It is calculated as follows:
커트라인은 65점으로 같은데 Test 1은 평균은 높지만 신뢰도가 낮고, Test 2는 평균은 낮지만 신뢰도가 높다. Test 1은 신뢰도는 낮지만 평균이 높아서 95% CI안에 매우 일부 학생만 들어가는 반면, Test 2는 신뢰도가 높지만 평균이 낮아서 95%CI안에 다수의 학생이 포함된다. 즉 낮은 신뢰도에도 불구하고 부정확한 pass-fail decision의 가능성이 낮아지는 것이다.
If we use this formula, we find that in test 1, the SEM is 2.5% and in test 2, it is 1.58%. The 95% CIs are calculated by multiplying the SEM by 1.96. So, in test 1 the 95% CI is ±4.9% and in test 2 it is ±3.09%.
- In test 1 the 95% CI around the cut-off score ranges from 60.1% to 69.9% but only a small proportion of the score of students falls into this 95% CI.4 This means that for those students we are not able to conclude, with a p ≤ 0.05, whether these students have passed or failed the test.
- In test 2, the 95% CI ranges from 61.9% to 68.1% but now many students fall into the 95% CI interval. We use this hypothetical – though not unrealistic – example to illustrate that a higher reliability is not automatically better. To illustrate this further, Figure 1 presents a graphical representation of both tests.
일반화가능도 이론 Generalisability theory
G-theory is not per se an extension to CTT but a theory on its own. It has different assumptions than CTT, some more nuanced, some more obvious. These are best explained using a concrete example. We will discuss G-theory here, using such an example.
When a group of 500 students sit a test, say a 200-item knowledge-based multiple-choice test, their total scores will differ. In other words, there will be variance between the scores. From a reliability perspective, the goal is to establish the extent to which these score differences are based on differences in ability of the students in comparison to other – unwanted – sources of variance. In this example, the variance that is due to differences in ability (in our example ‘knowledge’) can be seen as wanted or true score variance. Level of knowledge of students is what we want our test to pick up, the rest is noise – error – in the measurement. G-theory provides the tools to distinguish true or universe score variance from error variance, and to identify and estimate different sources of error variance. The mathematical approach to this is based on analysis of variance, which we will not discuss here. Rather, we want to provide a more intuitive insight into the approach and we will do this stepwise with some score matrices.
In Table 2, all students have obtained the same score (for reasons of simplicity, we have drawn a table of five test items and five candidates). From the total scores and the p-values, it becomes clear that all the variance in this matrix is due to systematic differences in items. Students collectively ‘indicate’ that item 1 is easier than item 2, and item 2 is easier than item 3, etc. There is no variance associated with students. All students have the same total score and they have collected their points on the same items. In other words, all variance here is item variance (I-variance).
Table 3 draws exactly the opposite picture. Here, all variance stems from differences between students. Items agree maximally as to the ability of the students. All items give each student the same marks, but their marks differ for all students, so the items make a consistent, systematic distinction between students. In the score matrix, all items agree that student A is better than student B, who in turn is better than student C, etc. So, here, all variance is student-related variance (person variance or P-variance).
Table 4 draws a more dispersed picture. For students A, B and C, items 1 and 2 are easy and items 3–5 difficult, and the reverse is true for students D and E. There seems to be a clearly discernable interaction effect between items and students. Such a situation could occurs if, for example, items 1 and 2 are on cardiology and 3–5 on the locomotor system, and students A, B and C have just finished their clerkship in cardiology and the other students just finished their orthopaedic surgery placements.
Of course, real life is never this simple, so matrix 5 (Table 5) presents a more realistic scenario, some variance can be attributed to systematic differences in item difficulty (I-variance), some to differences in student ability (P-variance), some to the interaction effects (P × I-variance), which in this situation cannot be disentangled from general error (e.g. perhaps student D knew the answer to item 4 but was distracted or he/she misread the item).
Generalisability is then determined by the portion of the total variance that is explained by the wanted variance (in our example, the P-variance). In a generic formula:
Or in the case of our 200 multiple choice test example:5
The example of the 200-item multiple-choice test is called a one-facet design. There is only one facet on which we wish to generalise, namely would the same students perform similarly if another set of items (another ‘parallel’ test) were administered. The researcher does not want to draw conclusions as to the extent to which another group of students would perform similarly on the same set of items. If the latter were the purpose, she/he would have to redefine what is wanted and what is error variance. In the remainder of this paragraph we will also use the term ‘factor’ to denote all the components of which the variance components are estimates (so, P is a factor but not a facet).
위의 식에서 어떤 것이 error term에 들어가는지가 어떤 종류의 일반화를 할 것인가를 결정한다.
If we are being somewhat more precise, the second formula is not always a correct translation of the first. The first deliberately does not call the denominator ‘total variance’, but ‘wanted’ and ‘error variance’. Apparently, the researcher has some freedom in deciding what to include in the error term and what not. This of course, is not a capricious choice; what is included in the error term defines what type of generalisations can be made.
If, for example, the researcher wants to generalise as to whether the rank ordering from best to most poorly performing student would be the same on another test, the I-variance does not need to be included in the error term (for a test–retest correlation, the systematic difficulty of the items or the test is irrelevant). For the example given here (which is a so-called P × I design), the generalisability coefficient without the I/ni term is equivalent to Cronbach's alpha.
The situation is different if the reliability of an exact score is to be determined. In that case, the systematic item difficulty is relevant and should be incorporated in the error term. This is the case in the second formula.
To distinguish between both approaches, the former (without the I-variance) is called ‘generalisability coefficient’ and the latter ‘dependability coefficient’. This distinction further illustrates the versatility of G-theory, when the researcher has a good overview on the sources of variance that contribute to the total variance she/he can clearly distinguish and compare the wanted from the unwanted sources of variance.
The same versatility holds for the calculation of the SEM. As discussed in the section on CTT the SEM is the SD of the error term, so in a generalisability analysis it can be calculated as the square root of the error variance components, so either
In this example the sources of variance are easy to understand, because there is in fact one facet, but more complicated situations can occur. In an OSCE with two examiners per station, things already become more complicated.
- First, there is a second facet (the universe of possible examiners) on top of the first (the universe of possible stations).
- Second, there is crossing and nesting.
A crossed design is most intuitive to understand. The multiple-choice example is a completely crossed design (P × I, the ‘×’ indicating the crossing), all items are seen by all students. Nesting occurs when certain ‘items’ of a factor are only seen by some ‘items’ of another factor. This is a cryptic description, but the illustration of the OSCE may help. The pairs of examiners are nested within each station. It is not the same two examiners who judge all stations for all students, but examiners A and B are in station 1, C and D in station 2, etc. The examiners are crossed with students (assuming that they remain the same pairs throughout the whole OSCE), because they have judged all students, but they are not crossed with all stations as A and B have only examined in station 1, etc. In this case examiner pairs are nested within stations.
There is a second part to the analyses in a generalisability analysis, namely the decision study or D-study. You may have noticed in the second formula that both the I-variance and the interaction terms have a subscript/ni. This indicates that the variance component is divided by the number of elements in the factor (in our example the number of items in the I-variance) and that the terms in the formula are the mean variances per element in the factor (the mean item variance). From this, it is relatively straightforward to extrapolate what the generalisability or dependability would have been if the numbers would change (e.g. what is the dependability if the number of items on the test would be twice as high, or which is more efficient, using two examiners per OSCE station or having more station with only one examiner?), just by inserting another value in the subscript(s). Although it may seem very simple, one word of caution is needed: such extrapolations are only as good as the original variance component estimates. The higher the number of original observations, the better the extrapolation. In our example, we had 200 items on the test and 500 students taking it, but it is obvious that this leads to better estimates and thus better extrapolations than 50 students sitting a 20 item test.
문항반응이론 Item response theory
CTT와 G-theory가 공통적으로 가지고 있는 단점은 응시자 그룹으로부터 난이도가 응시자에 미치는 영향을 떼어낼 수가 없다는 것이다. 특정 검사나 시험에 대한 점수가 낮은 것은 특정 문항이 매우 어렵거나, 특정 응시자집단이 능력이 떨어지기 때문일 수 있다. IRT는 이러한 문제를 학생의 능력과 독립적으로 문항의 난이도를 측정하여 해결하고자 했으며, 문항 난이도와 독립적으로 학생의 능력을 측정하고자 했다.
Both CTT and G-theory have a common disadvantage. Both theories do not have methods to disentangle test difficulty effects from candidate group effects. If a score on a set of items is low, this can be the result of a particularly difficult set of items or of a group of candidates who are of particularly low ability level. Item response theories try to overcome this problem by estimating item difficulty independent of student ability, and student ability independent of item difficulty.
CTT에서 난이도는 p-value, 즉 해당 문항을 맞춘 학생의 비율로 나타난다. Rit와 Rir같은 수치는 전체 혹은 나머지 문항에서의 수행능력과 특정 문항에 대한 수행능력과의 상관관계를 보여준다. 다른 그룹이 동일한 검사를 했다거나, 다른 검사에 문항이 재사용되어도 p-value는 다를 것이다. IRT에서는 응시자의 응답을 모델화하여 개별 문항에 대한 능력을 보여준다.
Before we can explain this, we have to go back to CTT again. In CTT, item difficulty is indicated by the so-called p-value, the proportion of candidates who answered the item correctly, and discrimination indices such as point biserials, Rit (item-total correlation) or Rir (item-rest correlation), all of which are measures to correlate the performance on an item to the performance on the total test or the rest of the items. If in these cases a different group of candidates (of different mean ability) would take the test, the p-values would be different, and if an item were re-used in a different test, all discrimination indices would be different. With IRT the response of the candidates are modelled, given their ability to each individual item on the test.
이러한 모델링에는 몇 가지 가정이 필요하다.
Such modelling cannot be done without making certain assumptions.
- The first assumption is that the ability of the candidates is uni-dimensional and
- the second is that all items on a test are locally independent except for the fact that they measure the same (uni-dimensional) ability. If, for example, a test would contain an item asking for the most probable diagnosis in a case and a second for the most appropriate therapy, these two items are not locally independent; if a candidate answers the first items incorrectly, she/he will most probably answer the second one incorrectly as well.
- The third assumption is that modelling can be done through an item response function (IRF) indicating that for every position on the curve, the probability of a correct answer increases with a higher level of ability. The biggest advantage of IRT is that difficulty and ability are modelled on the same scale. IRFs are typically graphically represented as an ogive, as shown in Figure 2.
모델링에는 데이터가 필요하다. 따라서 모델링을 하기 전에 사전 검사가 필요하다.
Modelling cannot be performed without data. Therefore pre-testing is necessary before modelling can be performed. The results on the pre-test are then used to estimate the IRF. For the purpose of this AMEE Guide, we will not go deeper into the underlying statistics but for the interested reader some references for further reading are included at the end.
세 수준의 모델링이 가능하다.
Three levels of modelling can be applied, conveniently called one-, two- and three-parameter models.
- A one-parameter model distinguishes items only on the basis of their difficulty, or the horizontal position of the ogive. Figure 3 shows three items with three different positions of the ogive. The curve on the left depicts the easiest item of the three in this example; it has a higher probability of a correct answer with lower abilities of the candidate. The most right curve indicates the most difficult item. In this one-parameter modeling, the forms of all curves are the same, so their power to discriminate (equivalent to the discrimination indices of CTT) between students of high and low abilities are the same.
- A two-parameter model includes this discriminatory power (on top of the difficulty). The curves for different items not only differ in their horizontal position but also in their steepness. Figure 4 shows three items with different discrimination (different steepness of the slopes). It should be noted that the curves do not only differ in their slopes but also in their positions, as they differ both in difficulty and in discrimination (if they would only differ in slopes, it would be a sort of one-parameter model again).
- A three-parameter model includes the possibility that a candidate with extremely low ability (near-to-zero ability) still produces the correct answer, for example through random guessing. The third parameter determines the offset of the curve or more or less its vertical position. Figure 5 shows three items differing on all three parameters.
대략 one-parameter modelling에는 200~300개의 응답이, three-parameter model에는 1000개의 응답이 필요하다.
As said before, pre-testing is needed for parameter estimation and logically there is a relationship between the number of candidate responses needed for good estimates; the more parameters have to be estimated, the higher the number of responses needed. As a rule of thumb, 200–300 responses would be sufficient for one-parameter modelling, whereas a three-parameter model would require roughly 1000 responses. Typically, large testing bodies employ IRT mix items to be pre-tested with regular items, without the candidates knowing which item is which. But it is obvious that such requirements in combination with the complicated underlying statistics and strong assumptions limit the applicability of IRT in various situations. It will be difficult for a small-to-medium-sized faculty to produce enough pre-test data to yield acceptable estimates, and, in such cases, CTT and G-theory will have to do.
IRT는 강력한 신뢰도 이론이다.
On the other hand, IRT must be seen as the strongest theory in reliability of testing, enabling possibilities that are impossible with CTT or G-theory. One of the ‘eye-catchers’ in this field is computer-adaptive testing (CAT). In this approach, each candidate is presented with an initial small set of items. Depending on the responses, his/her level of ability is estimated, and the next item is selected to provide the best additional information as to the candidate's ability and so on. In theory – and in practice – such an approach reduces the SEM for most if not all students. Several methods can be used to determine when to stop and end the test session for a candidate. One would be to administer a fixed number of items to all candidates. In this case, the SEM will vary between candidates but most probably be lower for most of the candidates then with an equal number of items with traditional approaches (CTT and G-theory). Another solution is to stop when a certain level of certainty (a certain SEM) is reached. In this case, the number of items will vary per candidate. But apart from CAT, IRT will mostly be used for test equating, in such situations where different groups of candidates have to be presented with equivalent tests.
권고 Recommendations
The three theories – CTT, G-theory and IRT seem to co-exist. This is an indication that there is good use for each of them depending on the specific test, the purpose of the assessment and the context in which the assessment takes place. Some rules of thumb may be useful.
- CTT is helpful in straightforward assessment situations such as the standard open-ended or multiple choice test. In CTT, item parameters such as p-values and discrimination indices can be calculated quite simply with most standard statistical software packages. The interpretation of these item parameters is not difficult and can be taught easily. Reliability estimates, such as Cronbach's alpha, however, are based on the notion of test–ret7est correlation. Therefore, they are most suitable for reliability estimates from a norm-orientated perspective and not from a domain-orientated perspective. If they are used in the latter case, they will be an overestimation of the actual reproducibility.
- G-theory is more flexible in that it enables the researcher to include or exclude source of variance in the calculations. This presupposes that the researcher has a good understanding of the meaning of the various sources of variance and the way they interact with each other (nested versus crossed), but also how they represent the domain. The original software for these analyses is quite user unfriendly and requires at least some knowledge of older programming languages such as Fortran (e.g. UrGENOVA; http://www.education.uiowa.edu/casma/GenovaPrograms.htm, last access 17 December 2010). Variance component estimates can be done with SPSS, but the actual g-analysis would still have to be done by hand. Some years ago, two researchers at McMaster wrote a graphical shell around UrGenova to make it more user friendly (http://fhsperd.mcmaster.ca/g_string/download.html, accessed 17 December 2010). Using this shell prevents the user from knowing and employing a difficult syntax. Nevertheless, it still requires a good understanding of the concept of G-theory. In all cases where there is more than one facet of generalisation (as in the example with the two examiners per station in an OSCE), G-theory has a clear advantage over CTT. In CTT multiple parameters should be used and somehow combined (in this OSCE Cronbach's alpha and Cohen's Kappa or an ICC for inter-observer agreement), in the generalisability analysis both facets are incorporated. If a one-facet situation exists (like the multiple choice examination) from a domain-orientated perspective (e.g. with an absolute pass–fail core), a dependability coefficient is a better estimate than CTT.
- IRT should only be used if people with sufficient understanding of the statistics and the underlying concepts are part of the team. Furthermore, considerably large item banks are needed and pre-testing on a sufficient number of candidates must be possible. This limits the routine applicability of IRT in all situations other than large testing bodies, large schools or collaboratives.
새롭게 떠오르는 이론들 Emerging theories
새롭게 떠오르는 이론의 대부분은 '학습의 평가'에서 '학습을 위한 평가'로의 관점 전환과 관련되어있다. 비록 이 자체는 이론의 변화는 아니지만, 관점의 변화가 새로운 이론을 가져오거나 기존 이론의 확장을 가져왔다.
Although we by no means possess a crystal ball, we see some new theories or extension to existing theories emerging. Most of these are related to the changing views from (exclusively) assessment of learning to more assessment for learning. Although this in itself is not a theory change but more a change of views on assessment, it does lead to the incorporation of new theories or extensions to existing ones.
첫째로, '학습을 위한 평가'라는 것이 무언가를 알 필요가 있다. 기존의 관점이 상징적으로 보여주는 것이 바로 교과목 종료 후에 보는 총괄평가이다. 이러한 방법은 전 세계적으로 흔하게 사용되는 것이지만, 교육적 맥락에서는 이러한 방법에 대한 불만이 커지고 있다. 이러한 평가는 학습환경의 변화를 잘 따라가지 못하고 있으며, 이러한 'purely selective test'는 의료의 'screening procedure'에 비견될 수 있다. 필수 역량에 미달한 학생에 대하여 졸업 여부를 판별하는데는 좋을 수 있으나, 아직 역량이 부족한 학생에게 어떻게 충분한 역량을 키울 수 있게 해줄 것인가에 대한 정보는 주지 못한다. 또한 각 학생을 어떻게 가장 좋은 의사로 키울 수 있을 것인가에 대한 정보도 주지 못한다. 환자를 더 낫게 만드는 것은 screening 그 자체가 아니라 잘 맞춰진 진단과 치료인 것처럼, 학습자에 대한 진단 그 자체로는 학습을 향상시키지 못하며, 학습을 위한 평가만이 이를 가능하게 한다.
First, however, it might be helpful to explain what assessment for learning entails. For decades, our thinking about assessment has been dominated by the view that assessment's main purpose is to determine whether a student has successfully completed a course or a study. This is epitomised in the summative end-of course examination. The consequences of such examinations were clear; if she/he passes, the student goes on and does not have to look back; if she/he fails, on the other hand, the test has to be repeated or (parts of) the course has to be repeated. Successful completion of a study was basically a string of passing individual tests. We draw – deliberately – somewhat of a caricature, but in many cases, this is the back bone of an assessment programme. Such an approach is not uncommon and is used at many educational institutes in the world, yet there is a growing dissatisfaction in the educational context. Some discrepancies and inconsistencies are felt to be increasingly incompatible with learning environments. These are probably best illustrated with an analogy. Purely selective tests are comparable in medicine to screening procedures (e.g. for breast cancer or cervical cancer). They are highly valuable in ensuring that candidates lacking the necessary competence do not graduate (yet), but they do not provide information as to how an incompetent candidate can become a competent one, or how each student can achieve to become the best possible doctor she/he could be. Just as screening does not make the patients better, but tailored diagnostic and therapeutic intervention do, assessment of learning does not help much in improving the learning but assessment for learning can.
We will mention the most striking discrepancies between assessment of and assessment for learning.
- 교육과정의 가장 중심이 되는 목표는 학생이 공부를 열심히 해서 가능한 많이 배울 수 있게 하는 것이다. 따라서 평가도 이러한 목적에 맞게 이뤄져야 한다. 충분한 역량을 갖춘 학생을 골라내는데만 집중된 평가는 이러한 목표에 도달할 수 없다.
A central purpose of the educational curriculum is to ensure that students study well and learn as much as they can; so, assessment should be better aligned with this purpose. Assessment programmes that focus almost exclusively on the selection between the sufficiently and insufficiently competent students do not reach their full potential in steering student learning behaviour. - '학습의 평가'에서 하는 질문은 'A가 B보다 낫나?'이다. CTT나 G-theory에서는 학생간 차이가 없을 경우 신뢰도를 계산해낼 수 없다. '학습을 위한 평가'에서 질문은 '오늘의 A가 어제의 A보다 낫나?'이다. 수월성을 위하여 끊임없이 나아간다는 의미를 가지고 있는데, 모든 학생이 '우수'에 도달하면, 그 '우수'는 다시 '평범함'이 되기 때문이다. '학습을 위한 평가'에서 질문은 A와 B의 향상이 충분한가에 대해서도 당연히 생각해보게 된다.
If the principle of assessment of learning is exclusively used, the question all test results need to answer is: is John better than Jill?, where the pass–fail score is more or less one of the possible ‘Jills’. Typically CTT and G-theory cannot calculate test reliability if there are no differences between students. A test–retest correlation does not exist if there is no variance in scores, generalisability cannot be calculated if there is no person variance. The central question in the views of assessment for learning is therefore: Is John today optimally better than he was yesterday, and is Jill today optimally better than she was yesterday. This gives also more meaning to the desire to strive for excellence, because now excellence is defined individually rather than on the group level (if everybody in the group is excellent, ‘excellent’ becomes mediocre again). It goes without saying that in assessment for learning, the question whether John's and Jill's progress is good enough needs to be addressed as well. - 보다 어려운 개념은 '학습의 평가'에서 '일반화' 혹은 '예측'이란 '동질성(uniformity)'에 기반하고 있다는 점이다. 즉 학생이 동일한 상황에서 동일한 시험을 잘 볼 것인가에 대한 예측과 일반화를 한다는 것이다. 그러나 '평가를 위한 학습'에서 '예측'이란 여전히 중요하긴 하지만, 평가법의 선택은 진단적 목적이 더 크고, 평가법을 학생의 구체적 특징에 따라서 선택할 수 있는 유연성이 있다. CAT나 임상의사의 진단적 사고 - 구체적 추가적 진단기술을 환자에 맞추어 사용하는 것 - 이 이와 유사하다 할 수 있다.
A difficult and more philosophical result of the previous point is that the idea of generalisation or prediction (how well will John perform in the future based on the test results of today) in an assessment of learning is mainly based on uniformity. It states that we can generalise and predict well enough if all students sit the same examinations under the same circumstances. In the assessment for learning, view prediction is still important but the choice of assessment is more diagnostic in that there should be room for sufficient flexibility to choose the assessment according to the specific characteristics of the student. This is analogous to the idea of (computer) adaptive testing or the diagnostic thinking of the clinician, tailoring the specific additional diagnostics to the specific patient. - '학습의 평가'에서는 최선의 평가법이 개발해내는 것이 중요하다. 이러한 관점에서 가장 이상적인 평가 프로그램은 각 의학적 역량을 평가하기에 '가장 좋은' 평가도구만을 사용하게 된다. 예컨데 지식의 평가를 위한 객관식 문항, 술기 평가를 위한 OSCE, 문제해결능력을 위한 long simulation 등이다. 그러나 '학습을 위한 평가'에서는 다양한 정보를 얻기 위해서 다양한 도구를 사용하며, 다음의 세 가지 질문에 답하는 것이 중요하다.
In the assessment of learning view, developments are focussed more on the development (or discovery) of the optimal instrument for each aspect of medical competence. The typical example of this is the OSCE for skills. In this view, an optimal assessment programme would incorporate only the best instrument for each aspect of medical competence. Typically, such a programme would look like this: multiple-choice tests for knowledge, OSCEs for skills, long simulations for problem-solving ability, etc. From an assessment for learning, view information needs to be extracted from various instruments and assessment moments to optimally answer the following three questions:
- 진단적 질문: 이 학생에 대한 완전한 그림을 그리기에 충분한 정보를 가지고 있는가? Do I have enough information to draw the complete picture of this particular student or do I need specific additional information? (the ‘diagnostic’ question)
- 치료적 질문: 이 시점에서 가장 필요한 교육적 개입은 무엇인가?
Which educational intervention is most indicated for this student at this moment? (the ‘therapeutic’ question) - 예후적 질문: 이 학생이 옳은 길을 가고 있으며 유능한 전문직으로 성장할 것인가?
Is this student on the right track to become a competent professional on time? (the ‘prognostic’ question).
- 단일한 혹은 소수의 평가로만 위의 질문에 답을 할 수는 없을 것이다. 평가프로그램이 필요하며, 각각의 장점과 단점이 있는 다양한 평가법이 필요하며 이는 의사가 다양한 진단적 도구를 활용할 수 있는 것과 마찬가지다. 이 도구들은 양적일 수도 있고 질적일 수도 있으며, 더 객관적일수도, 주관적일수도 있다. 비유를 좀 더 해보자면 만약에 의사가 환자의 Hb 수치 오더를 내리면 단순히 객관적인 수치를 알고 싶은 것일 수 있다. 그러나 한편으로 의사는 병리학자에게 특정 숫자가 아니라 서술적 판단을 요청할 수도 있다. 유사하게 평가 프로그램도 양적, 질적 요소를 다 갖출 수 있다.
It follows logically from the previous point that this cannot be accomplished with one single assessment method or even with only a few. A programme of assessment is needed instead, incorporating a plethora of methods, each with its own strengths and weaknesses, much like the diagnostic armamentarium of a clinician. These can be qualitative or quantitative, more ‘objective’ or more ‘subjective’. To draw the clinical analogy further: if a clinician orders an haemoglobin level of a patient she/he does not want the laboratory analyst's opinion but the mere ‘objective’ numerical value. If, on the other hand, she/he asks a pathologist, s/he does not expect a number but a narrative (‘subjective’) judgement. Similarly, such a programme of assessment will consist of both qualitative and quantitative elements.
이 이론들 중 많은 부분은 여전히 더 개발이 필요하나 일부는 다른 분야의 이론에서 가져올 수도 있다.
Much of the theory to support the approach of assessment for learning still needs to be developed. Parts can be adapted from theories in other fields; parts need to be developed within the field of health professions assessment research. We will briefly touch on some of these.
- 평가 프로그램의 질을 결정하는 것은 무엇인가? 한 가지 중요한 것은 좋은 평가프로그램은 개별 구성요소의 합보다 전체가 더 커야 한다는 점이다. 그러나 이런 목표를 달성하기 위해서 각 요소를 어떻게 결합할 것인가는 또 다른 문제이다.
What determines the quality of assessment programmes? It is one thing to state that in a good assessment programme the total is more than the sum of its constituent parts, but it is another to define how these parts have to be combined in order to achieve this. Emerging theories describe a basis for the definition of quality. Some adopt a more ideological approach (Baartman 2008) and some a more utilistic ‘fitness-for-purpose’ view (Dijkstra et al. 2009). - 평가의 질이란 평가프로그램이 얼마나 '이상적인 모습'에 가까운가에 따른 것이다.
In the former, quality is defined as the extent to which the programme is in line with an ideal (much like formerly quality of an educational programme was defined in terms of whether it was PBL or not); - 평가의 질이란 프로그램에서 명확하게 정의한 목표에 의해서 정의되는 것이며, 각 부분이 이 목표를 달성하기 위해서 최적화되어야 한다.
in the latter the quality is defined in terms of a clear definition of the goals of the programme and whether all parts of the programmes optimally contribute to the achievement of this goal. This approach is more flexible in that it would allow for an evaluation of the quality of assessment of learning programmes as well. - At this moment, theories about the quality of assessment programmes are being developed and researched (Dijkstra et al. 2009, submitted 2011).
- 평가가 어떻게 학습에 영향을 미치는가? 상당한 합의가 있어 보인다. 그러나 연구가 그리 많이 되어있지는 않다.
How does assessment influence learning? Although there seems to be complete consensus about this – a complete shared opinion, much empirical research has not been performed in this area. For example, much of the intuitive ideas and uses of this notion are strongly behaviouristic in nature and do not incorporate motivational theories very well. The research, especially in the health professions education, is either focussed on the test format (Hakstian 1971; Newble et al. 1982; Frederiksen 1984) or on the opinions of students (Stalenhoef-Halling et al. 1990; Scouller 1998). Currently, new theories are emerging incorporating motivational theories and describing better which factors of an assessment programme influence learning behaviour, how they do that and what the possible consequences of these influences are (Cilliers et al. submitted 2010, 2010).
- Test-enhanced learning이 최근 논의되고 있다. 전문가 이론에 따르면 시험을 보는 것 자체가 다양한 측면에서 지식의 저장, 유지, 인출에 도움이 된다고 보는 것은 합당하다. 그러나 평가프로그램에 있어서, 특히 '학습을 위한 평가' 차원에서 어떻게 해야하는가는 별로 아는 바가 많지 않다.
The phenomenon of test-enhanced learning has been discussed recently (Larsen et al. 2008). From expertise theories it is logical to assume that from sitting a test, as a strong motivator to remember what was learned, the existing knowledge is not only more firmly stored in memory, but also reorganised from having to produce and apply it in a different context. This would logically lead to better storage, retention and more flexible retrieval. Yet we know little about how to use this effect in a programme of assessment especially with the goal of assessment for learning.
- 피드백이 효과를 나타내게 해주는 것은 무엇인가? 피드백을 총괄평가와 함께 주는 것은 그 가치를 떨어뜨리는 것이다라는 지적이 있지만, 어떤 요인이 여기에 영향을 주는가에 대해서는 알려져 있는 바가 적다.
What makes feedback work? There are indications that the provision of feedback in conjunction with a summative decision limits its value, but there is little known about which factors contribute to this. Currently, research not only focusses on the written combination of summative decisions and formative feedback, but also on the combination of a summative and formative role within one person. This research is greatly needed as in many assessment programmes it is neither always possible nor desirable to separate teacher and assessor role.
- 평가프로그램의 차원에서 인간의 판단은 포함될 수 밖에 없다. 심리학에서 인간의 판단(human judgement)는 actuarial 한 방법에 비해서 오류의 가능성이 더 높다고 본다. 그 이유에는 여러가지가 있다.
In a programme of assessment the use of human judgement is indispensible. Not only in the judgement of more elusive aspects of medical competence, such as professionalism, reflection, etc., but also because there are many situations in which a prolonged one-on-one teacher-student relationship exists, as is for example the case in long integrated placements or clerkships. From psychology it is long known that human judgement is fallible if it is compared to actuarial methods (Dawes et al. 1989). There are many biases that influence the accuracy of the judgement. - The most well-known are primacy, recency and halo effects (for a more complete overview, cf. Plous 1993).
- A primacy effect indicates that the first impression (e.g. in an oral examination) often dominates the final judgement unduly;
- a recency effect indicates the opposite, namely that the last impressions determine largely the judgement. There is good indication that the length of the period between the observation and the making of judgement determines whether the primacy or the recency effect is most prominent effect.
- The halo effect pertains to the inability of people to judge different aspects of someone's performance and demeanour fully independently during one observation, so they all influence each other.
- Other important sources of bias are cognitive dissonance, fundamental attribution error, ignoring base rates, confirmation bias. All have their specific influences on the quality of the judgement. As such, these theories shed a depressing light on the use of human judgement in (high-stakes) assessment.
- 그러나, 이러한 이론들에고 불구하고 이러한 human judgement에서 오는 편향을 줄일 수 있는 방법이 있다. 자연주의 의사결정에 대한 이론에서는 왜 딱 잘라지는, 숫자를 기반으로 한 결정보다 사람의 의사결정이 더 부정확한가에 초점을 두는 것이 아니라, 왜 사람들이 '정보가 불충분하거나', '이상적이지 못한 상황'에서의 '명확히 정의되지 않는 문제'를 훌륭히 수행하는가에 대해서 연구한다. 정보의 저장, 경험으로부터의 학습, 상황-특이적 스트립트의 보유 등이 중요한 역할을 하는 것으로 보인다. 그리고 많은 부분이 빠른 패턴 인식과 매칭에 기반하고 있다.
Yet, from these theories and the studies in this field, there are also good strategies to mitigate such biases. Another theoretical pathway which is useful is the one on naturalistic decision making (Klein 2008; Marewski et al. 2009). This line of research does not focus on why people are so poor judges when compared to clear-cut and number-based decisions, but why people still do such a good job when faced with ill-defined problems with insufficient information and often under less than ideal situations. Storage of experiences, learning form experiences and the possession of situation-specific scripts seem to play a pivotal role here, enabling the human to employ a sort of expertise-type problem solving. Much is based on quick pattern recognition and matching. - 두 가지 theoretical pathway 모두 관찰에서 얻은 제한된 정보만을 가지고 접근하는 인간의 접근법에 대해서 다루고 있다. 의료전문가가 임상추론을 하고 진단활동을 하는 것과 평가를 위해서 학생의 수행능력을 판단하는 것에는 유사성이 있음이 많은 연구에서 보고되고 있다.
Both theoretical pathways have commonality in that they both describe human approaches that are based on a limited representation of the actual observation. When, as an example, a primacy effect occurs, the judge is in fact reducing information to be able to handle it better, but when the judge uses a script, she/he is also reducing the cognitive load by a simplified model of the observation. Current research increasingly shows parallels between what is known about medical expertise, clinical reasoning and diagnostic performance and the act of judging a student's performance in an assessment setting. The parallels are such that they most probably have important consequences for our practices of teacher training.
- 위에서 다룬 것을 설명하는데 필요한 이론이 CLT이다. CLT는 인간의 작업기억이 제한적이어서 제한된 수의 정보를 짧은 시간만큼만 기억할 수 있다는 것으로부터 시작한다. CLT에서 인지부하는 세 가지 종류가 있다. 내재적, 외재적, 본유적이다. 내재적 부하는 과제에 내재되어있는 복잡성에 의해서 생기는 부하이다. 외재적 부하는 그 과제와 직접적으로 관련되어있지는 않지만, 그 과제를 처리하기 위해서 필요한 모든 정보들과 관련되어있다. CLT에 근거하자면 Authentic setting에서 의과대학 교육과정을 바로 시작하는 것은 바람직하지 않은데, authenticity는 도움이 될지 모르겠지만, 과도한 외재적 부하가 과도하게 걸려서 학습을 위해 필요한 자원(본유적 부하)까지를 다 잡아먹기 때문이다.
An important underlying theory to explain the previous point is cognitive load theory (CLT) (Van Merrienboer & Sweller 2005, 2010). CLT starts from the notion that the human working memory is limited in that it can only hold a low number of elements (typically 7 ± 2) for a short-period of time. Much of this we already discussed in the paragraphs on expertise. CLT builds on this as it postulates that cognitive load consists of three parts: intrinsic, extraneous and germane load. - Intrinsic load is generated by the innate complexity of the task. This has to do with the number of elements that need to be manipulated and the possible combinations (element interactivity).
- Extraneous load relates to all information that needs to be processed yet is not directly relevant for the task. If, for example, we would start the medical curriculum by placing the learners in an authentic health care setting and require them to learn from solving real patient problems, CLT states that this is not a good idea. The authenticity may seem helpful, but it distracts, the cognitive resources needed to deal with all the practical aspects would constitute a high extraneous load even to such an extent that it would minimise the resources left for learning (the germane load).
내재적 인지부하[편집]
내재적 인지부하(intrinsic cognitive load)란 학습자료나 과제 자체가 가지고 있는 난이도와 복잡성이라 할 수 있다. 상호 작용성이 높은 학습 자료를 해결하기 위해서는 개념을 획득하고 개념들 간의 관련성을 이해하는 것이 작동기억의 부하를 감소시킬 수 있다[2]. 내재적 인지부하는 학습의 난이도에 따라 상대적일 수 있으며 이는 사전지식의 보유와 관련이 있다고도 할 수 있다[3].
외재적 인지부하[편집]
외재적 인지부하(extraneous cognitive load)는 학습 과제 자체의 난이도가 아닌 학습방법, 자료제시방법 등 교수전략에 의해 개선될 수 있는 인지부하이다. 그러나 외재적 인지부하는 내재적 인지부하에 영향을 받는다(김 경, 김동식, 2004). 즉, 학습 과제 자체가 내재적 인지부하가 낮다면 교수 설계가 부적절하여 외재적 인지부하가 발생하더라도 이것이 작동기억의 범위 내에 있기 때문에 문제를 해결하는데 어려움이 없게 된다.[1]
본유적 인지부하[편집]
- 마지막으로 새로운 모델이 개발되고 옛 모델도 재발견이 이뤄지고 있다.
Finally, new psychometric models are developed and old ones are being rediscovered at this present time. It is clear that, from a programme of assessment view, in incorporating many instruments in the programme not one single psychometric model will be useful for all elements of the programme. In the 1960s and 1970s, some work was done on domain-orientated reliability approaches (Popham and Husek 1969; Berk 1980). In the currently widely used method internal consistency (like Cronbach's alpha) is often used as the best proxy for reliability or universe generalisation, but one can wonder whether this is the best approach to all situations. Most standard psychometric approaches do not handle a changing object of measurement very well. By this we mean that the students – hopefully – change under the influence of the learning programme. In the situation of a longer placement for example, the results of repeatedly scored observations (for instance, repeated mini-CEX) will differ in their outcomes, with part of this variance being due to the learning of the student and part to measurement error (Prescott-Clements et al. submitted 2010). Current approaches do not provide easy strategies to distinguish between both effects. Where internal consistency is a good approach to reliability, then stability of the object of measurement and of the construct can be reasonably expected; it is problematic when this is not the case. The domain-orientated approaches therefore were not focussed primarily on the internal consistency but on the probability that a new observation would shed new and unique light on the situation, much like the clinical adage never to ask for additional diagnostics if the results are unlikely to change the diagnosis and/or the management of the disease. As said above, these methods are being rediscovered and new ones are being developed, not to replace the existing theories, but rather to complement them.
Med Teach. 2011;33(10):783-97. doi: 10.3109/0142159X.2011.611022.
General overview of the theories used in assessment: AMEE Guide No. 57.
Author information
- 1Flinders University, Adelaide 5001, South Australia, Australia. lambert.schuwirth@flinders.edu.au
Abstract
- PMID:
- 21942477
- [PubMed - indexed for MEDLINE]
'Articles (Medical Education) > 평가법 (Portfolio 등)' 카테고리의 다른 글
"상대적으로 본다면...", 대조효과가 평가자의 점수와 서술 피드백에 미치는 영향(Med Educ, 2015) (0) | 2015.09.04 |
---|---|
객관식 시험의 점수분석: AMEE Guide No. 66 (0) | 2015.04.30 |
의학교육에서의 평가: 일반화가능도 이론의 개념 (0) | 2015.04.22 |
Cronbach's alpha 이해하기 (IJME, 2011) (0) | 2015.04.21 |
Programmatic Assessment를 위한 열두가지 팁(Medical Teacher, 2014) (0) | 2015.04.21 |