의학교육에서의 평가: 일반화가능도 이론의 개념

Medical education assessment: a brief overview of concepts in generalizability theory

Mohsen Tavakol1, Robert L. Brennan2

1Medical Education Unit, The University of Nottingham, UK

2Centre for Advanced Studies in Measurement and Assessment, The University of Iowa, USA






의학교육자들은 학생평가의 질 향상을 위해서 측정오차가 발생하는 원인을 알아야 한다.

General Medical Council (GMC) in the UK has emphasized the importance of internal consistency for students’ assess-ment scores in medical education.1 Typically Cronbach’s alpha is reported by medical educators as an index of internal consistency. Medical educators mark assessment questions and then estimate statistics that quantify the consistency (and, if possible, the accuracy and appropriate-ness) of the assessment scores in order to improve subse-quent assessments. The basic reason for doing so is the recognition that student marks are affected by various types of errors of measurement which always exist in student marks, and which reduce the accuracy of measurement. The magnitude of measurement errors is incorporated in the concept of reliability of test scores, where reliability itself quantifies the consistency of scores over replications of a measurement procedure. Therefore, medical educators need to identify and estimate sources of measurement error in order to improve students’ assessment.


CTT에서 학생의 진점수는 관찰점수와 하나의 미분화된 오차항의 합이다. 이 모델에서 흔히 사용되는 신뢰도 척도는 Cronbach's alpha이다. 그러나 언제나 그랬든 alpha는 item들의 표본과 관련된 오차만 포함되어있다고 볼 수 있다. 따라서 우리는 alpha로부터 서로 다른 측정의 소스가 유발하는 오차에 대한 영향력을 집어내거나 고립시키거나 추정할 수 없다. CTT의 확장된 것이 G이론이며, 이는 'facet'이라 불리는 다양한 측정오차의 원인을 구분할 수 있게 해준다. 

Under the Classical Test Theory (CTT) model, the stu-dent’s true score is the sum of the student’s observed score and a single undifferentiated error term. Using this model, the most frequently reported estimate of reliability is Cronbach’s alpha. Almost always, however, when alpha is reported, it incorporates errors associated with sampling of items, only. Accordingly, alpha does not allow us to pin-point, isolate, and estimate the impact of different sources of measurement error associated with observed student marks. An extension of CTT called “G (Generalizability) theory” enables us to differentiate the multiple, potential sources of measurement error called “facets” (sometimes called “dimensions” in experimental design literature). 


모든 facet의 집합은 인정가능한 관측(admissible observation)의 모든 측면(universe)라고 할 수 있다.

For exam-ple, in an OSCE exam, a student might be observed by one of a large sample of examiners, for one of a large sample of standardized patients (SPs), and for one of a large sample of cases. The facets, then, would be examiners, SPs and cases---each of which serves as a potential source of measurement error. The set of all facets constitutes the universe of admissible observations (UAO) in the terminology of G theory. As another example, suppose that for a cardiology exam, the investigator is interested in an item facet, only; in that case, there is only one facet.


어떤 facet을 사용해야 하는지, 얼마나 많은 facet을 사용해야 하는가에 대한 정답은 없다. 이를 결정하는 것은 연구자의 책임이며, 각 facet의 중요도에 대한 근거를 제시할 수 있어야 한다. 

There is no right answer to the question of which facets, or how many facets, should be included in the UAO. It is the investigator’s responsibility to justify any decision about the inclusion of facets, and provide supporting evidence about the importance of each facet to the consistency and accuracy of the measurement procedure. G theory provides a conceptual framework and statistical machinery to help an investigator do so.


특정 검사에는 각 facet별로 구체적인 조건들의 숫자가 정해져 있다. UG의 정의. (CTT의 진점수에 대응되는 것이다)

For any given form of a test, there are a specified num-ber of conditions for each facet. The (hypothetical) set of all forms similarly constructed is the called the universe of generalization (UG). For any given examinee, we can conceive of getting an average score over all such forms in the UG. This average score is called the student’s universe score, which is the analogue of true score in CTT. The variance of such universe scores, called universe score variance, can be estimated using the analysis of variance “machinery” employed by G theory.


G이론에서는 다양한 설계가 가능하다.

G theory can accommodate numerous designs to exam-ine the measurement characteristics of many kinds of student assessments. If medical educators wish to investi-gate assessment items as a single source of measurement error on a test, this is a single facet design. There are two types of single-facet designs. If the same sample of questions is administered to a cohort of students, we say the design is crossed in that all students (s) respond to all items (i). This crossed design is symbolised as s × i, and read students are crossed within items. If each student takes a different set of items, we have a nested design, which is symbolised i:s meaning that items are nested within students.

In most realistic circumstances there are facets in addi-tion to items. Imagine a case-based assessment with four cases and a total of 40 items designed to measure the ability of students about dermatology. In this example, all students take all items; hence, students are crossed within items (s × i), but items are distributed into cases (e.g., 10 items in case 1, 10 items in case 2, 10 items in case 3 and 10 items in case 4). That is, items are nested within cases, and this design is called a two-facet nested design that is symbolised as s × (i:c).


특정 facet에 대해서 variance component가 크다면, 이 facet이 학생 점수에 상대적으로 큰 영향을 줬다는 의미이다. 예를 들어 OSCE에서 만약 시험관(examiner)에 대한 variance component가 높게 나왔다면, 시험관이 평가에 있어서 일관되게 행동하지 못했음을 보여주는 것이다.

The designs discussed in the previous paragraphs are usually called G study designs, and they are associated with the UAO. The principal purpose of such designs is to collect data that can be used to estimate what are called “variance components.” In essence, the set of variance components for the UAO provides a decomposition of the total observed variance into its component parts. These component parts reflect the differential contribution of the various facets; i.e., a relatively large variance component associated with a facet indicates that the facet has a relatively large impact on student marks. For example, in an OSCE, if the variance component for examiners (the examiner facet) is estimated as high, we would conclude that the examiners have not behaved consistently in their rating of the construct of interest.


variance component를 계산하고 나면, 연구자들은 error variance를 추정하고 UG와 관련된 reliability-like coefficient를 를 계산한다. 

Once variance components are estimated, typically in-vestigators estimate error variances and reliability-like coefficients that are associated with the UG. Such coeffi-cients can range from 0 to 1. 


One coefficient is called a generalizability coefficient; it incorporates relative error variance. Another coefficient is called a Phi coefficient; it incorporates absolute error variance. 


Computing these coefficients and error variances requires specifying the D study design which, in turn, specifies the number of condi-tions of each facet that are (or will be) used in the opera-tional measurement procedure. 


Relative error variance (and, hence, a generalizability coefficient) is appropriate when interest focuses on the rank ordering of students. 


Absolute error variance (and, hence, a Phi coefficient) is appropriate when interest focuses on the actual or “abso-lute” scores of students. 


Relative error variance (for a so-called “random effects” model) involves all the variance components that are interactions between students and facets. 


Absolute error variance includes relative error variance plus the variance components for the facets them-selves. The square root of these error variances are called standard errors of measurement. They can be used to establish confidence intervals for students’ universe scores. For further information about the these coefficients and error variances, readers may refer to particular books.2,3


Knowing the magnitude of estimated variance compo-nents enables us to design student assessments that are optimal, at least from a measurement perspective. For example, a relatively small estimated variance component for the interaction of students and items suggests that a relatively small number of items may be sufficient for a test to achieve an acceptable level for a generalizability coeffi-cient.


In practice, powerful computer programs are required to estimate variance components, coefficients, and error variances, especially for multifaceted designs. Several G theory software programs have been developed for estimating such statistics (see, for example, http://www. education.uiowa.edu/centers/casma/computer-programs).

Variance components can also be estimated using SPSS and SAS, but these packages do not directly estimate coefficients and error variances. The first author is develop-ing an online user friendly application for estimating variance components, for both balanced and unbalanced designs. Using a simple script, readers will be able to print out the estimates of important parameters in G theory. The application is written in R and C++ languages and executed by PHP codes. Figure 1 shows a balanced design output from the application.








Medical education assessment: a brief overview of concepts in generalizability theory

Mohsen Tavakol, Robert L. Brennan
Int J Med Educ. 2013; 4: 221–222. Published online 2013 September 11. doi: 10.5116/ijme.5278.a850
PMCID: 
PMC4205529


+ Recent posts