평가에 관한 연구: Ottawa 2010 컨퍼런스의 합의문 (Med Teach, 2011)

Meded. 2017. 7. 20. 11:18

2017. 7. 20. 11:18

평가에 관한 연구: Ottawa 2010 컨퍼런스의 합의문 (Med Teach, 2011)

Research in assessment: Consensus statement and recommendations from the Ottawa 2010 Conference

LAMBERT SCHUWIRTH1, JERRY COLLIVER2, LARRY GRUPPEN3, CLARENCE KREITER4, STEWART MENNIN5, HIROTAKA ONISHI6, LOUIS PANGARO7, CHARLOTTE RINGSTED8, DAVID SWANSON9, CEES VAN DER VLEUTEN1 & MICHAELA WAGNER-MENGHIN10

1Maastricht University, The Netherlands, 2Southern Illinois University School of Medicine, USA, 3University of Michigan Medical School, USA, 4University of Iowa Carver College of Medicine, USA, 5University of New Mexico School of Medicine, USA/Brasil, 6University of Tokyo, Japan, 7Uniformed Services University, USA, 8Rigshospitalet, Denmark, 9National Board of Medical Examiners, USA, 10Medical University of Vienna, Austria

예비 진술

Preliminary statements

얼마 전부터 평가의 주요 초점은 학습 과정의 결과를 측정하는 것, 즉 학생들이 충분한 지식, 기술, 역량 등을 습득했는지 여부를 결정하는 것이 었습니다. 이러한 접근 방식은 흔히 '학습의 평가'라고합니다. 현재는 두 번째 개념이 등장하였고, 즉 '학습을 위한 평가' (Shepard 2009)가 있습니다.

Not so long ago, the main focus of assessment was on measuring the outcomes of the learning process, i.e. to determine whether the students had acquired sufficient knowledge, skills, competencies, etc. This approach is often referred to as assessment of learning. Currently, a second notion has gained ground, namely assessment for learning (Shepard 2009).

이거을 강조하는 유일한 이유는, 교육 연구와 평가 연구의 구분을 덜 명확하게 만들기 때문입니다. 따라서 이 논문은 평가 연구에만 전적으로 관련되지 않으며, 보다 일반적인 교육 연구에도 도움이 될 수 있는 설명, 위치 및 합의를 포함하고 있다.

The sole reason for highlighting it is that it makes the distinction between educational research and assessment research less clear. Therefore, it is inevitable that this article contains descriptions, positions and consensus that do not pertain exclusively to assessment research but may have bearing on more general educational research as well.

Assessment라는 용어는 학생 / 학습자 성취도 및 성과의 체계적인 결정을 나타 내기 위해 사용됩니다.
Evaluation이라는 용어는 프로그램, 프로젝트 및 커리큘럼과 관련된 문제 및 질문과 관련되어 사용되며, 여기에는 학습자 평가의 질문 및 쟁점이 중첩되어 있다. 또한 교육 문제 및 자원, 교수진, 일반적인 제도적 및 프로그램 적 결과와 관련된 질문 뿐만 아니라 교육 개입에 대한 설명도 관련된다.

the term assessment will be used to refer to the systematic determination of student/learner achievement and performance.
The term evaluation will be used with reference to issues and questions related to programmes, projects and curriculum within which questions and issues of assessment of learners are nested and co-embedded with educational issues and questions related to resources, faculty, general institutional and programmatic outcomes as well as explanations of educational intervention.

도입

Introduction

과학적 학문으로서의 의학 교육은 아직 미성숙하다. 과학으로서의 역사가 훨씬 더 오래된 교육 심리학과 임상 의학을 토대로 창안되었지만, 의학 교육 연구 자체는 1960 년대 이전에는 독립적인 흐름을 갖추지 않았습니다. 그러나 의학교육은 이제는 급변하는 분야에서 자신의 과학적 정체성을 찾고 있으며, 의학적(의과학으로서의), 그리고 교육학적(사회과학으로서의 심리학으로서의) 언어와 관습이 상당 부분 다르기 때문이다.

Medical education as a scientific discipline is still young. Although the two disciplines on which it is founded, educational psychology and clinical medicine, have a much longer scientific history, medical educational research itself did not start as an independent stream before the 1960s. It is now a rapidly changing field seeking its own scientific identity, not in the least because the scientific languages and mores of medicine – as a biomedical science – and educational – as psychology a social science – differ considerably.

따라서 의학교육은 자기 자신을 관계적으로 정의해야 하는 과제를 안고 있다. 보건인력교육, 생물학, 사회과학이라는 넓은 맥락 속에서, 의학교육의 '특징에도 불구하고' 정체성을 가지는 것이 아니라 의학교육의 '특징으로 인하여' 정체성을 지녀야 할 것이다.

Medical education is therefore faced with the challenge of defining itself relationally so that it fits identifiably within the larger context of health professions education and the biolog- ical and social sciences because of its unique characteristics rather than in spite of them.

정의

Definitions

이론

Theory

우리가 '이론'이라는 단어를 사용할 때, 우리는 그것을 관측에 기반한 현상의 본질에 대한 합리적 가정과 이론을 검증하거나 반증하려고 하는 과학적 연구의 대상으로 삼는다. 이론은 반드시 실제적으로 유용하지는 않지만 현상을 이해하는 데 유용해야 합니다.

When we use the word ‘theory’, we refer to rational assump- tions about the nature of phenomena, based on observations, and subject to scientific studies aiming to verify or falsify the theory. A theory is not necessarily practically useful but it must be useful for understanding of phenomena.

이론의 예로 CTT이 있습니다. 이 이론에서 백본은 관찰 된 점수가 참 점수 (특정 영역의 가능한 모든 관련 항목에 응답했을 경우 응시자가 얻었을 점수)와 오류 점수의 합계라는 개념입니다.

An example of a theory is classical test theory. In this theory, the backbone is the notion that an observed score is the sum of true score (the score a candidate would have obtained if s/he had answered all the possible relevant items of a certain domain) and the error score.

이론적 틀

Theoretical framework

이론적 틀을 가지고, 우리는 복잡한 측면 (이 경우 평가와 관련하여)을 설명하는 데 도움이 되는 일련의 관련된 이론을 암시합니다. 이론적 틀의 한 예가 validity에 대한 접근법이다. 예를 들어, 타당도는 Messick (1994)에 의한 CU 및 CIV의 최소화와 Kane (2006)의 argument-based rationale로 정의되었습니다. 두 견해 모두 이론적 틀로 볼 수있다.

With a theoretical framework, we imply a set of related theories that together serve to explain a complicated aspect (in this case concerning assessment). An example of a theoretical framework is the approaches to validity. Validity, for example, has been defined as the minimalisation of construct under-representation and construct-irrelevant vari- ance by Messick (1994) and as an argument-based rationale by Kane (2006). Both views can be seen as theoretical frameworks.

여기서 우리는 '패러다임'이라는 용어의 사용을 피한다. '패러다임'은 사용되는 과학 흐름의 철학에 따라 중요한 의미를 지닌 뚜렷한 의미를 지닙니다 (예를 들어 Kuhn과 그의 후임자 인 Imre Lakatos의 견해를 비교). 따라서 우리는이 단어를 신중하게 사용해야한다고 생각합니다.

Note that we avoid using the term ‘paradigm’ here. ‘Paradigm’ has a distinct meaning with important implications depending on the specific philosophy of science stream in which it is used (compare for example, the views of Kuhn and his successor Imre Lakatos). Therefore, we think that the word should be used with care.

개념적 틀

Conceptual framework

이론적 틀이 일련의 관련 이론들을 공식화하여 복잡한 현상을 최적으로 설명하려고 시도 할 때, 개념적 틀은 결과를 해석하고 지시를 내리는give direction 데 도움이된다. 개념적 프레임 워크의 한 예는 assessment for learnin과 assessment of learning이라는 상대되는 개념이다.

Assessment of learning의 틀에서, 사례 별 간 편차는 오류로 간주 될 수 있지만,
Assessment for learning의 틀에서 동일한 편차도 의미있는 편차로 간주 될 수있다 (예 : 교사는 강점과 약점의 확인을 통해 특정 학생의 학습을 자극하는 입장을 가짐).

Where a theoretical framework tries to formulate a series of related theories to explain optimally a complicated phenom- enon, a conceptual framework helps to interpret findings and give directions. An example of a conceptual framework is assessment for learning as opposed to its conceptual counter- part assessment of learning. In the context of a framework of assessment of learning, case-to-case variance can be seen as error, whereas in the conceptual framework of assessment for learning, the same type of variance can be seen as meaningful variance (e.g. because it provides the teacher with an entry to stimulate the learning of a specific student, through the identification of strengths and weaknesses).

연구의 종류

Types of research

표의적 설명 또는 '사례보고'

The ideographic description or ‘case report’

종종 과학 연구는 거의 독점적으로 실험 연구와 관련되어 있습니다. 그러나 기본적으로, 과학적 연구에서 필수적인 것은 일반화가능한 지식을 생성하거나 추가하려는 의도로 계획되고 체계화 된 데이터 수집 또는 관리이다 (Miser 2005). General relevancy과 applicability에 대한 개념을 강조하는 것은, 종종 '그걸 누가 신경씁니까?'와 '그래서 어쩌란 말입니까?'(Bligh 2003)라는 두 가지 질문으로 요약됩니다.

Often, scientific research is associated almost exclusively with experimental research; Basically, one could state that essential in scientific research is a planned and structured collection or management of data with the intent to generate or add to, generalisable knowledge (Miser 2005). This bears in it the notion of general relevancy and applicability, often epitomised in the two questions: ‘who cares?’ and ‘so what?’ (Bligh 2003).

과학적 연구에 대한 위의 정의에도 불구하고, 문헌에서 풍부한 특징은 평가 방법과 접근법에 대한 표의적 설명이다. 가장 일반적인 예 중 하나는 일반 교육에서 T/F 형식을 소개 한 맥콜 (McCall, 1920)의 논문입니다. 이러한 기술은 의학 문헌의 사례보고와 유사하게 볼 수 있습니다. 이러한 논문은 위에서 언급 한 '계획되고 체계화 된 데이터 수집 또는 관리'의 정의와 일치하지 않을 수도 있지만 분명히 특유의 목적을 달성합니다.

Despite the above definition of scientific research, an abundant feature in the literature has been the ideographic1 description of assessment methods and approaches. One of the earliest examples of this – in general education – was given by McCall (1920) who introduced the true-false examination format. Such descriptions can be seen as analogous to the case reports in the medical literature. They may not be in line with the definition of ‘a planned and structured collection or management of data’ mentioned above but they certainly serve a purpose.

교육적 '사례 보고서'는 많은 양의 supportive data는 없지만 혁신을 묘사합니다. 새로운 교육 방법이나 평가 도구를 기술 할 수 있습니다. 예를 들어 객관적 구조 임상 시험 (OSCE)에 대한 첫 간행물이 그러한 사례이다 (Harden & Gleeson 1979). 그러나 그것은 엄청난 영향을 미쳤습니다.

Educational ‘case reports’ describe innovations without providing much of supportive data; they may describe new instructional methods or assessment tools. For example, one can regard the first publication on the objective structured clinical examination (OSCE) as such a presentation paper or case report (Harden & Gleeson 1979). Yet, it has had an enormous impact on practice,

따라서 Local innovation에 대한 교육적 설명은 특정 조건 하에서 가치가 있을 수 있습니다. 이 단서 중 가장 중요한 것은 저자가 도구 / 방법 / 접근 방법의 의미를 설명한다는 것입니다. 다른 말로 표현하면 무엇을 사용하고, 어떻게 사용해야하며, 어떤 요소를 사용하면 유용하며, 다른 상황에서 어떻게 사용해야 하는가? 등을 알려주는 것이다. 또한 가능할 때마다 더 많은 연구를위한 지침을 제공해야 한다.

Therefore, educational descriptions of local innovations can be valuable under certain provisos. The most important of these provisos is that the authors describe the implications of their instrument/method/approach. In other words, what uses does it have, how should it be used, which of its elements make it so useful and how should it be used in other contexts? Also, whenever possible,they should provide directions for further research.

권고 1 : 교육의 '사례보고'는 항상 도구, 방법 또는 접근법에 대한 관념적 설명을 능가하여 일반화가능한 지식을 이끌어 내야한다. 증례보고는 instrument의 이론적 근거에 대한 설명을 충실히 하고, 다른 상황이나 맥락에서 사용될 수 있는 일반적인 측면에 대한 가설을 충분히 고찰했을 때 기존 문헌에 기여할 수 있습니다.

Recommendation 1: The educational ‘case report’ should always surpass the idiographic description of an instrument, method or approach and lead to generalisable knowledge. A case report can only optimally contribute to the existing literature if it is supported by a description of the theoretical rationale for the instrument and a discussion hypothesising which generic aspects of it can be used in other contexts or situations.

그러한 연구의 '현대적인'예는 Eva 외 MMI를 소개하는 논문이다. (2004).

A ‘modern’ example of such a study is the paper introduc- ing the multiple mini interview by Eva et al. (2004).

개발 또는 디자인 기반 연구

Developmental or design-based research

평가를 위한 새로운 도구나 새로운 접근법을 개발할 필요가 항상 있습니다. 그러나 새로운 개발은 단지 좋은 생각 그 이상입니다. 이론적 체계 뿐만 아니라 (인지) 심리학, 일반 교육, 정신 측정과 같은 의학 교육과 관련된 학문분야에서 충분한 아이디어의 토대를 마련해야 하며, 그렇게 문헌 고찰을 잘 수행하고, 아이디어의 underpinning을 두는 것은 새로운 아이디어 개발의 전제조건(즉 sine qua non)이다. 이것은 기존 문헌이 새로운 접근법이 필요하다는 전제를 뒷받침 할 경우에도 마찬가지다)

There is always a need to develop new instruments or new approaches to assessment. A new development, however, is more than merely a good idea. The theoretical body and the amount of literature in medical education and related disci- plines such as (cognitive) psychology, general education and psychometrics should be ample enough to base an idea on and we take the stance that a good review of the literature and subsequent underpinning of the idea is a condition (or, sine qua non) for the development of new ideas (even if this literature serves to support the premise that there is a need for a novel approach).

Recommendation 2 : 발달 또는 디자인 기반 연구는 하나 이상의 단일 연구를 통해 실현되어야하며, 아이디어, 시범 실험, 개선, 실제 생활에서의 사용 사이의 다리를 구축하는 연구로 계획되어야한다.

Recommendation 2: Developmental or design-based research should be realised through more than one single study, and be planned as a train of studies building the bridges between the idea, the pilot experiments, the improvements, the use in real life, etc.

정당화 연구

Justification research

평가에서 가장 잘 알려진 사례는 개방형 질문이 객관식 질문보다 우월한지 여부를 결정하는 데 목적이있는 연구입니다.

In assessment, the most well-known examples are the studies aiming at determining whether open-ended questions are superior to multiple-choice questions.

예를 들어, 정당성 조사는 기본 프로세스에 대한 통찰력을 제공하는 데 강하지 않습니다. 결과는 왜 접근 방식이 효과가 있는지 또는 작동하지 않는지에 대해 알려주지 않습니다. 두 번째 제한점은 '정당화'가 다른 학교 및 환경 설정에 적용될 수도 있고 적용되지 않을 수도 있다는 것입니다. 연구 설계를 분명히 보여주는 이론에 대한 명확한 설명은 다른 사람들이 자신의 상황을 해석 할 수있게합니다.

Justification research is, for example, not strong at providing insight into the underlying processes; the results do not tell us why an approach works or does not work. A second limitation is that the ‘justification’ may or may not apply to other schools and settings. A clear description of the theory underlining the study design allows others to interpret for their own situation.

권고안 3 : 우리는 정당화 연구가 하나 이상의 확실하고 잘 설명 된 이론적 틀에서 이루어져야한다는 것이 중요하다는 입장을 취한다. 이론이 없으면 그 결과를 사용하는 것은 종종 제한적이 된다.

Recommendation 3: We take the stance that it is imperative that any justification research is done fromone or more certain well-founded and well-described theoretical frameworks. Without theory, the results are often of limited use.

많은 연구가 객관식 테스트의 점수와 동일한 주제에 대한 개방형 테스트의 점수를 단순히 연관지었습니다. 일반적으로 적당한 상관 관계 0.4-0.5가 발견되었다 (Norman et al. 1996; Schuwirth et al. 1996). 그러나 유리가 반쯤 찼거나 반쯤 비어 있는지 여부가 여전히 불분명하기 때문에 이러한 결과는 해석 할 수없는 결과입니다. 연구가 타당도와 인지 심리학의 틀에서 이루어졌다면, 질문은 예를 들어 format이 contents(of the question)보다 더 종류의 기억 경로를 이끌어내는지 여부가 될 것이다. 이러한 비교는 내용이 유사하고 형식이 다른 경우 상관관계가 높고, 내용이 다르고 형식이 유사한 경우 상관 관계가 매우 낮다는 것을 보여줍니다 (Norman et al., 1985). Think aloud 프로토콜 연구를 살펴보면, 전문성 관한 이론을 바탕으로 stimulus type이 thinking process의 질에 직접적인 영향을 미치는 것으로 나타났다 (Schuwirth et al., 2001).

Many studies have simply correlated scores on a multiple-choice test to those on an open-ended test on the same topic. Typically moderate correlations 0.4–0.5 were found (Norman et al. 1996; Schuwirth et al. 1996). These, however, are uninterpretable results as it is still unclear whether the glass is half full or half empty. Had the research been done from the framework of validity and cognitive psychology, the question would for example have been whether the format determines the type of memory pathways the question elicits more than the content of the question does. Such comparisons show that when the content is similar and the format different correlations are extremely high, and when the content differs and the format is the same correlations are extremely low (Norman et al. 1985). Looking into this further through a think aloud protocol study, using the theory on expertise and its development then shows that the stimulus type (case-based or plain factual knowledge) directly influences the quality of the thinking processes (Schuwirth et al. 2001).

불행하게도 이러한 한계점이 종종 간과되기 때문에 의학에서 무작위 대조 연구는 종종 최고 또는 가장 과학적인 접근법으로 간주됩니다. 그러나 의학교육에서 정당화 연구는 단지 이론과 실천을 연결시켜주는 하나의 link일 뿐이다. 이것은 하나의 항암제가 placebo보다 우월하다는 것을 보여주는 것이 암의 기전에 대한 근본적인 이해를 주지 못하는 것과 같다.

Unfortunately, this limitation is often overlooked, becausein medicine, the randomised controlled trial is frequently seenas the best or ultimate scientific approach. Yet in medicaleducation, justification research can only serve to form onelink in the chain connecting theoretical scientific findings withpractice. Much like studies proving the superiority of onecancer drug over a placebo do not help to gain insight into thefundamental mechanisms of cancer.

하나의 큰 정당화 연구만으로 복잡한 질문에 대답 할 수 없습니다. 종종 '이 모든 것이 아주 훌륭하지만, 그래서 그 결과로 더 좋은 의사를 만들어 낼 수 있습니까?'라는 질문을 듣습니다. 이것을 임상의 사례로 번역하자면, '이 CT 스캔과 MRI는 매우 훌륭하지만 국민 인구의 건강을 개선합니까?'와 같다. 의학 교육에서 토론에 대한 좋은 토론이있다 (Torgerson 2002; Norman 2003; Regehr 2010).

In addition, one single big justification study cannot answer complicated questions. Therefore, the often heard question: ‘this is all very nice but does this produce better doctors?’ is unanswerable with one single study (Translated to the clinical context it would be: ‘This CT-scan and MRI stuff is all very nice but does it improve the health of the national population?). There are good discussions in the medical educational this literature of debate (Torgerson 2002; Norman 2003; Regehr 2010).

권고 4 : 우리는 정당화 연구만으로는 '큰 질문'에 답할 수 없고, 그러기 위해서는 연구 프로그램에 통합되어야 한다는 입장을 취한다.

Recommendation 4: We take the position that that justifi- cation research alone is not able to answer the ‘big questions’ and needs to be incorporated in a programme of research.

근본 이론 또는 명확화 연구

Fundamental theoretical or clarification research

근본 이론 연구는 질적 연구이거나 또는 양적 연구일 수 있으며, 무엇이 되었든 일이 어떻게 작동하는지, 왜 일이 작동하는지 이해하려고 노력합니다 (Miscellaneous authors 2001).

Fundamental theoretical research can be qualitative or quantitative but seeks to understand how things work, or why things work (Miscellaneous authors 2001).

훌륭한 연구 프로젝트의 경우 일반적으로 의학 교육이나 평가에서 특히 그렇듯이 기존 문헌을 잘 검토하는 것이 중요합니다. 이전에 발견되거나 설명되지 않은 '새로운 아이디어'는 거의 없다. 따라서 문헌을 검토하지 않으면 불필요한 복제로 이어질 수 있습니다. 이것은 복제 연구가 이미 입증 된 것에 추가적인 marginal benefit이 없다는 것을 의미하는 것은 아니지만, 반복된 복제는 자원의 효과적 활용이 아니다.

It is obvious that for a good research project, be it in medical education in general or in assessment in particular, it is essential to review the existing literature well. Rarely, ‘new ideas’ have not been discovered or described before. Not reviewing the literature, of course, can lead to unnecessary duplication, provided it ever gets published. This is not to say that a replication study cannot provide the marginal benefit of an additional example of something already demonstrated, but the nth replication may not be the best use of resources

권고안 5 : 복제 연구를 하려면, 문헌 검토를 통해 그 연구의 고유한 특징과 목적을 확인하는 준비를 해야한다.

Recommendation 5: Replication studies must/should be prepared by a literature review which identifies unique features and purposes of the study.

최근 수십 년 동안의 평가에 대한 연구는 전문성에 대한 인지 심리학의 연구로부터 많은 이득을 보았다. WBA에 대한 근래의 연구는, 여기에 경영분야 연구가 가미된다면 더 시사하는 바가 많아질 것이다. 이 방법으로, 연구자는 자신의 연구가 기존의 연구에 어떤 것을 추가하는지에 대해 보다 정확하게 질문해야합니다. 즉 완전히 새로운 분야인지, 완전히 다른 분야에서 기존 연구 결과를 복제하는 것인지, 그것이 다른 맥락에서 이루어지는 복제 연구인지 등을 결정해야 한다.

Research on assessment in the recent decades has profited much of the research in cognitive psychology on expertise. Current research into workplace-based assessment has more to offer if the research in the business literature is included as well. In this way, the researcher is challenged to be more precise about what his/her study adds to the existing literature; i.e. whether it is something completely new, whether it is a replication of findings in a totally different field, or whether it is a replication study in different context.

권고 6 : 근본 이론 연구를 설계 할 때, 연구자는 기존의 의학 교육 문헌뿐만 아니라 관련 인접 과학 분야 (예 :인지 심리학, 평가에 대한 비즈니스 문헌 등)를 스캔해야 한다.

Recommendation 6: When designing a fundamental theo- retical study, the researcher should not only scan the existing medical education literature but also relevant adjacent scien- tific disciplines (e.g., cognitive psychology, business literature on appraisal, etc.)

권고 7 : 지금보다 더 많은 명확화 연구가 필요하다. 지금까지 대부분의 기술은 idiographic했으며 견고한 기초 과학 이론의 출현에 기여하지 못했다. 과학에서의 이론 형성은 필수적이다. 왜냐하면, 이론을 통합하거나, 적어도 supporting하지 않으면, 개별 연구가 의미있는 방식으로 서로 연결될 수 없고, 결과를 의미있게 충분히 해석 할 수도 없으며, sufficient focus를 갖춘 새로운 연구를 계획 할 수 없기 때문이다.

Recommendation 7: We take the position that currently there is a need for more clarification research in assessment. Much of the descriptions so far have been idiographic and have not contributed to the emergence of solid underlying scientific theories. Theory formation in science is essential, because without unifying or at least supporting theories, individual studies cannot be linked together in a meaningful way, results cannot be interpreted meaningfully enough and new studies cannot be planned with sufficient focus.

이론적 틀 / 맥락

Theoretical frameworks/context

위에서 언급했듯이, 교육 평가에 대한 연구는 근본적으로 사회 과학 연구에 뿌리를두고 있습니다. 사회 과학 연구는 생물 의학 연구보다 더 구체적인 이론적 틀의 명확한 선택이 필요합니다. 현재 평가에서 흥미로운 분야 중 하나는 평가에서의 인간 판단의 사용, 특히 작업 WBA입니다. 연구 질문은 자연 주의적 의사 결정 이론 (Klein 2008),인지 부하 이론 (Van Merrienboer & Sweller 2005) 또는 인간 판단의 보험 수리적 가치 이론 (Dawes et al. 1989)에서 접근 할 수있다.

As stated above, research in educational assessment is rooted in social scientific research to a large extent. In social scientific research, even more than in biomedical research, a clear choice of a specific theoretical framework is needed. One currently interesting field in the assessment is, for example, the use of human judgement in assessment, espe- cially in workplace-based assessment. Research questions can be approached from the theories of naturalistic decision making (Klein 2008), cognitive load theory (Van Merrienboer & Sweller 2005), or theories on the actuarial value of human judgement (Dawes et al. 1989).

그러한 이론적 체계가 바람직한 몇 가지 이유가 있습니다.

첫째, 연구 질문에 초점을 명확하게 해주며, 탐구하려는 변수 또는 구조의 조작적 정의를 뒷받침합니다.
그것들은 결과와 결론의 함의를 이해하는데 도움이됩니다.
그것들은 위조되거나 부식 될 수있는 가설보다 명확하게 규정 할 수 있도록 도와줍니다.
그러나 가장 중요한 것은 같거나 다른 이론적 틀에서 확인된 다양한 연구를 비교하거나 나란히 배치함으로써 일관성있는 포괄적인 이론 또는 패러다임으로 연결시켜준다.

There are several reasons why such theoretical frameworks are desirable. First, they help to focus the research questions, and underpin the operational definitions of the variables or constructs explored. They are useful in helping us understand the implications of the results and conclusion. They help us to clearly stipulate hypotheses than can be falsified or corrobo- rated. Most importantly, however, they serve to link various studies together to a coherent overarching theory or paradigm, either by using studies founded in the same theoretical framework or different studies comparing or juxtapositioning different frameworks.

권고안 8 : 평가 분야의 연구는 가능하다면 명확하게 정의 된 이론적 체계에서 수행해야한다. 이 프레임 워크는 도이부에서 보고되어야하며, 방법과 고찰에서 사용되어야합니다.

Recommendation 8: Studies in the field of assessment from should whenever possible be conducted a clearly defined theoretical framework. This framework must be reported in the introduction, be used in the description of the methods and in the discussion.

권고안 9 : 이론적 틀을 사용할 수 없다면 기존 문헌에 특정 연구가 어디에 위치하는지 명확히 결정하기 위해 기존 문헌에 대한 철저한 검토가 수행되어야한다.

Recommendation 9: If it is not possible to use a theoretical framework at least a thorough review of the existing literature must have been performed to clearly determine where the specific research is positioned in the existing literature.

권장 사항 8과 9는 연구를 계획하거나 논문을 작성할 때 항상 준수하기가 쉽지 않을 수 있습니다. 이것을 돕기위한 제안은 그것이 수행 된 지역적 맥락을 언급하지 않으면서도, 연구의 도입부를 작성하려고 시도해보고, 그러면서도 여전히 연구의 중요성과 관련성을 입증할 수 있는지 해보는 것이다. (그걸 누가 신경씁니까? 그래서 어쩌란 말입니까? 질문).

Recommendations 8 and 9 may not always be easy to adhere to when designing a study or writing a paper. A suggestion to aid in this is – as an exercise – to try to write the introduction of the study without mentioning the local context in which it was performed and still be able to demonstrate the importance and relevancy of the study (the ‘who-cares-and-so-what’ question).

이를 수행하는 것이 불가능한 경우, 어떤 연구의 맥락이 성과와 연결되는가를 보여주는 것은 연구자의 책임이다. 예를 들어, 다른 기관과의 공통점이나 차이점을 설명하거나, 기존에 문헌에서 알려진 것이 무엇인지 설명할 수 있다. 이런 것들이 다른 사람들이 당신의 설정에 흥미를 가지게 만드는 것이다.

If it is impossible to do this, the onus is on the researcher to explain what makes the context of the study relevant to its outcomes, for example, by explaining what it has in common with other institutions or what is different, or what is known in the literature. What it is about your setting what makes it interesting to other ones.

또 다른 중요한 문제는 연구가 수행 된 이론 및 실제 상황입니다. 현재 두 가지 주류 상황은 assessment of learning versus assessment for learning입니다. 전자는 학생의 학습 활동이 충분히 유능한지 여부를 정확하게 파악하는 데 그 목적이 있습니다. 후자는 평가와 학습 사이의 불가분의 관계를 포함한다. 전자는 주로 심리적 측정 문제로 평가에 접근하고 후자는 교육 설계 문제로 접근합니다.

Another very important issue is the theoretical and practical contexts in which the study was performed. Two mainstream contexts at the moment are assessment of learning versus assessment for learning. The former is aimed at establishing accurately enough whether the student’s learning activities have made him/her sufficiently competent. The latter includes the inextricable relationship between assessment and learning. The former mainly approaches assessment as a – psychometric – measurement problem, the latter as an educational design problem.

연구 설계, 방법 선택

Study design, choices of methods

의료 능력 평가에 대한 연구를 수행하기 위해 여러 가지 방법론을 선택할 수 있습니다. 어떤 신념과는 반대로, 우리는 모든 것보다 본질적으로 우월한 단 하나의 방법론은 없다는 입장이다. 최선의 방법이란 연구 질문을 가장 잘 대답해주는 방법이.

Many different methodologies can be chosen to conduct research into assessment of medical competence. Contrary to some beliefs, we would take the stance there is no such thing as one single inherently better methodology. The best meth- odology is the one that is most suited to answer the research question.

권고안 10 : 한 방법론이 다른 방법론보다 선천적으로 우월할수 있다고 생각하지 말라. 가장 좋은 방법론은 연구 질문에 최적으로 대답 할 수있는 방법이다.

Recommendation 10: Avoid thinking in terms of innate superiority of one methodology over another, but rather as the best methodology is the one that is optimally able to answer the research question.

우리가 제공하고자하는 제안은 선택된 방법론을 생각해보고 연구의 가능한 결과를 상상해보고 어떤 결론을 내릴 수 있는지 비판적으로 고려하는 것입니다.

A suggestion we want to offer is to think through the chosen methodology, to imagine the possible outcomes of the study and then consider critically which conclusion you could draw from them.

Recommendation 11 : 교육 연구는 수행하기 쉽지 않기 때문에, 단순히 sound mind를 가졌다면 이러한 유형의 연구를 할 수 있다고 가정하지 말고, 언제나 사용하려는 방법론에 전문 지식을 가진 사람을 팀에 포함시키는 것이 좋습니다.

Recommendation 11: As educational research is not easy to conduct, it is always wise to include in your team someone with expertise in the methodology you want to use, rather than to simply assume that anyone with a sound mind can do this type of research.

도구의 특성: 타당도와 일반화가능성

Instrument characteristics: Validity and generalisability

과학 연구 결과의 정확성을 위해 신중하게 설계된 도구의 사용은 필수 조건입니다. 사회 과학 연구에서, 도구는 다른 유형의 연구에서와 같이 항상 표준이 되는 것은 아닙니다. 우리는 실험 동물 모델, 표준 초 원심 분리기 등이 없지만 우리는 종종 우리 도구를 스스로 설계하거나 다른 것들로부터 변형시켜야합니다. 따라서 도구 개발 및 설명은 최대한의 주의를 기울여 수행해야합니다. 예를 들어 질문 모음만으로는 좋은 설문지가 되지 않으며, 일부 문항만으로 시험이 되지 않는 것과 같다.

For the correctness of the outcomes of any scientific study, the use of carefully designed instruments is a necessary condition. In social scientific research, the instru- mentation is not always as standard as in other types of research. We do not have experimental animal models, standard ultra centrifuges, etcetera, but we often have to design our instruments ourselves or have to adapt them from others. This makes it essential that instrument development and description are conducted with the utmost care. Just a collection of questions, for example, do not make a good questionnaire, and just some items do not make a good test.

타당도

Validity

따라서 평가 도구에 대한 타당성 검사 절차는 항상 도구 점수가 구인를 평가하는 데 도움이 되는 정도를 평가하는 일련의 연구입니다. 검증 절차는 과학 이론과 비슷하지만 끝나지 않는다. 대신, 테스트가 실제로 측정하려는 구인을 평가하는지 여부를 결정하기 위해 일련의 중요한 연구로 구성되어야합니다.

Therefore, a validation procedure for an assessment instrument is always a series of studies evaluating the extent to which the instrument scores help to assess the construct. A validation procedure is– much like a scientific theory – never finished. Instead, it needs to consist of a series of critical studies to determine whether the test actually assesses the construct it purports to measure.

타당도에 대한 가장 확실하고 역사적으로 중요한 첫 번째 개념은 준거 타당도이다. 즉 평가 결과가 기준 측정에서 성능을 충분히 예측하는지 여부입니다. 가능만 하다면, 교육에 익숙하지 않은 이해 관계자를 설득하는 데 매우 실용적이기 때문에 이는 타당한 접근법입니다. 그러나 우리가 보이지 않는 / 무형의 측면을 평가하는 것을 목표로하기 때문에 준거가 또 다시 타당도 검사를 필요로 할 수 있다는 점에서 tautological 문제가 발생할 수 있습니다. 누군가가 훌륭한 전문가가 될지 여부를 예측하는 평가 도구를 검증한다면 '훌륭한 전문성'을 측정하기 위한 기준이 필요합니다. 이 기준 또한 구인이며 타당성 검사가 필요할 수도 있습니다. 결국 이 준거의 타당도를 검증하는 또 다른 기준을 필요로하는 일종의 러시아 인형 문제로 이어진다. 이것이 준거 타당도가 지배적 위치를 잃은 이유입니다.

The most obvious – and historically first – major notion of validity is one of criterion validity, i.e. does the assessment result predict performance on a criterion measure well enough. Where possible, this is a convenient approach to validity because it is quite practical in convincing stakeholders who are less well versed in education. But, because we aim at assessing an invisible/intangible aspect, a tautological problem may arise, in that the criterion may be needing validation as well. If we are validating an assessment instrument which is supposed to predict whether someone will be a good professional, we need some criterion to measure ‘good professionalism’. This criterion is also a construct and may want validation as well. This would then invariably lead to a sort of Russian doll problem of needing another criterion to validate the criterion, etc. ad infinitum. This is the reason why criterion validity as the dominant approach has lost ground.

타당도에 대한 두 번째 직관적인 접근법은 내용 타당도입니다. 내용 타당도에서, 전문가는 신중하게 시험의 내용을 평가하고 이 내용이 어느 정도까지 관심 구인를 대표하는지 결정합니다. 본질적으로 분명한 접근 방식과 설명하고 방어하기 쉽지만, 유일하게 인간의 판단만을 사용한다는 점이 한계이다. 컨텐츠 검증 프로세스의 심사 위원이 테스트 개발자인 경우 늘 중립적이기를 기대할 수 없으며, 독립적인 심사 위원에게 많은 편견이 존재하거나 심사 위원의 구체적인 선택이 프로세스의 결과에 영향을 줄 수 있습니다. 이는 Angoff 패널의 문제와 동등합니다 (Verhoeven 외. 2002).

A second intuitive approach to validity is content validity. In content validity, expert judges carefully evaluate the content of the test and determine to what extent this content is is representative of the construct of interest. Although it inherently an obvious approach and it is easy to explain and defend, the sole use of human judgements is its bottle neck as well. If the judges in the content validation process are the test developers themselves, they cannot be expected to be neutral, but also with independent judges, many possible sources of bias may exist or the specific choice of the judges may influence the outcome of the process, much equivalent to the problem with Angoff panels (Verhoeven et al. 2002).

현재 구조 검증의 지배적 인 개념은 이론 생성, 데이터 수집, 분석 및 개선 또는 이론 변경에 대한 경험적 접근 방식의 개념과 유사합니다. 여기서 타당도 검증은 먼저 평가하려는 구인을 주의 깊게 살펴본 후 데이터를 수집하여 구인에서 추정되는 특성이 평가 도구에 의해 충분히 포착되었는지 여부를 확인하는 프로세스입니다. 따라서 평가 도구는 그 자체로 유효하지 않을 수 있습니다. 그것은 특정 목적에 대해서만 유효하거나 그렇지 않다. 이것은 현재 매우 인기있는 견해 임에도 불구하고 과학에 대한 귀납주의적 접근법과 동일한 우려가 있다. 즉, 타당도를 검증하기위한 충분한 관찰이 있는지 또는 더 정확한지 여부를 알지 못한다는 것이며, 타당도를 성공적으로 'falsify'할 수 있는 타당한 관찰이 존재하지 않는지 알 수 없다는 것이다.

The currently dominant notion of construct validation is analogous to the concept of the empirical approach with theory generation, data collection, analysis and refinement or change of the theory. In this view, validation is a process of first explicating carefully the construct one tries to assess, and then collect data to see whether the assumed character- istics of the construct are sufficiently captured by the assess- ment instrument. An assessment instrument can therefore not be valid in itself; it is always valid FOR a certain purpose. Although this is currently a highly popular view, it suffers from the same central concern as does the inductivistic approach to science, namely that one never knows whether there are sufficient observations to verify the validity, or put more precisely, whether there is not a valid observation possible that would successfully ‘falsify’ the validity.

그러므로 타당도에 대한 현재 견해는 과학 철학의 현대적 사고와 비교할 만하다. 즉, 타당도는 여러 추론과 이론적 개념에 기초한 논증 과정으로 보여 져야만한다. Kane (2006)은 타당도을 일련의 추론과 강점의 집합으로 묘사한다.

첫째, 관찰에서 점수까지의 추론으로, 즉 학생들의 관찰된 행동이 어떻게 점수 변수로 변환되는지이다
두 번째 추론은 관찰 된 점수와 우주 점수에 대한 것이다. 이것은 문헌에서 표준 의미의 신뢰도와 매우 동일하지만 동일하지는 않습니다. 관찰 된 점수에서 우주 점수로의 일반화에 대한 추론을하기 전에 우주의 본질에 대한 이론적 인 가정을해야합니다. 내부 일관성 측정 (알파, 스플릿 - 1/2 신뢰성, KR, 테스트 재검사)은 샘플이 추출 된 우주가 내부적으로 일관성이 있다면 두 번째 추론에 대한 모든 유용한 접근법입니다. 이러한 상황에서, 예를 들어, 사례 특이성은 오류의 원인입니다. 그러나 구조의 이론적인 개념이 이질성이라면, 표본의 내부 일관성은 과소-반영의 구조적 표시이며, 따라서 우주 점수에 대한 일반화 가능성이 낮다. 이 상황에서, 사례 특이성은 innate to the construct이다. 하루 동안 환자 그룹의 혈압을 측정하여 환자간에 명확한 차이가 있지만 환자 내에서의 측정간에 차이가 없다면 이는 내부적으로 매우 일관성이 있지만 일반적인 샘플로 간주되지는 않습니다. 혈압의 구성은 하루 또는 이전 활동의 순간에 따라 달라지는 것으로 가정합니다.

Current views on validity, therefore, are comparable to modern ideas in the philosophy of science, namely that validity must be seen as an argumentation process built on several inferences and theoretical notions. Kane (2006) describes validity therefore as a set of inferences and their strengths.

First, there is the inference from observation to score, i.e. how the observation of actions of students are converted to a scorable variable.
The second inference is one from observed score to universe score. This is highly equivalent to but not the same as reliability in the standard meaning in the literature. Before one can make an inference about the generalisation from observed score to universe score, one must make theoretical assumptions about the nature of the universe. Internal consistency measures (alpha, split-half reliability, KR, test retest) are all useful approaches to the second inference providing the universe from which the sample was drawn is internally consistent. In such a situation, for example, case specificity is an error source. If, however, the theoretical notion of the construct is one of heterogeneity, internal consistency of the sample is an construct indication of under-representation and therefore of poor generalisability to the universe score. In this situation, case specificity is, instead, innate to the construct. If one were to take the blood pressure of a group of patients during the day and find clear difference between the patients, but no variation between measurements within patients, it would be highly internally consistent, but would not be considered a generalisable sample, simply because the construct of blood pressure is assumed to vary with the moment of the day or previous activities.

세 번째 및 네 번째 추측은 우주 점수에서 대상 도메인 및 구인에 이르기까지입니다. 다른 말로 표현하면, 이 시험에서 일반화가능한 점수가 구인을 대표할 수 있습니까? 아니면 구인의 한 요소에 대해서만 일반화가능한 측정입니까?

The third and fourth inferences are from universe scores to target domain and construct. In other words, is the generalisable score on this test representative for the construct or does it very generalisably measure only one element of the construct.

Messick (1994)은 평가 절차의 consequences를 강조하며, 이것은 타당도에 대한 우리의 생각을 포함하는 요소라고 하였다. 이것이 중요한 이유는, 평가는 결코 진공 상태가 아니며, 교육적 결과에서 결코 벗어날 수 없기 때문이다.

Messick (1994) has highlighted the consequences of the assessment procedure as an element to include in our thinking about validity. This is an important notion because assessment never takes places in a vacuum and can never be seen disentangled from its (educational) consequences.

Recommendation 12 : 도구들은 절대로 완전히 검증되지 않는다. 타당도는 항상 타당도 주장을 뒷받침하는 증거를 수집하는 문제입니다. 논증은 연역적 일 수도 있고 귀납적 일 수도 있고 불가능할 수도 있습니다.

Recommendation 12: Instruments are never completely validated; validity is always a matter of collecting evidence to support the arguments for validity. Arguments may be deduc- tive, inductive or defeasible.

권장 사항 13 : 본질적으로 도구는 그 자체로 결코 타당하지 않습니다. 도구는 어떤 목적에 대해서만 유효한 것이다. 필요하다면, 다른 연구에서 검증 된 도구조차도 특정 연구가 수행 된 맥락에서 다시 검증 될 필요가있다.

Recommendation 13: An instrument is never valid per se. An instrument is always valid FOR something. If necessary, even an instrument validated in another context needs to be validated again for the context in which the specific study was done.

신뢰도

Reliability

관찰 된 점수의 우주 점수에 대한 일반화는 타당도 과정의 일부이지만, 신뢰성의 개념은 종종 많은 연구에서 별개로 취급됩니다.

Though generalisation of the observed score to the universe score is part of the validity process, the concept of reliability is often treated separately in many studies.

신뢰성은 반복된 측정의 일관성을 나타냅니다. 심리 측정 분석은 타당도 근거에 기여하는 데 사용할 수있는 통계를 제공하지만, 반복 측정시에 도구의 일관성을 정량화하는 통계도 제공합니다. 기본적으로 세 가지 유형의 추론이 필요할 수 있습니다.

Reliability refers to the consistency of the measure- ment, when it is repeated. Psychometric analysis offers statistics, that can be used to contribute to validity evidence, but also offers statistics to quantify the consistency of an instrument, when it is repeated. Basically three types of inferences can be required:

(1) 실제 시험과 평행 시험에서 동일한 점수를 얻을 수 있습니까?

(1) Would the student obtain the same score on the parallel test as s/he did on the actual test?

(2) 실제 시험과 평행 시험에서 최상위 학생부터 최하위 학생까지 동일한 순서로 순위를 정할 것입니까?

(2) Would the student take the same place in the rank ordering from best to worst performing student on the parallel test as s/he did on the actual test?

(3) 학생은 실제 시험에서와 동일한 합격 판정을 받습니까?

(3) Would the student obtain the same pass–fail decision as s/he did on the actual test?

신뢰도 계수는 점수 주변의 신뢰 구간을 계산하는 데 사용되므로 신뢰도를 고려하여 점수 범위를 결정합니다.

Reliability coefficients are also used to calculate a confidence interval around a score, thus determining the score range, taking reliability into account.

신뢰도에 대한 세 가지 이론적 접근법은 현재 대중적이며 고전적 테스트 이론, 일반화 가능성 이론 및 확률 론적 이론 (항목 반응 이론 (IRT) 및 Rasch 모델링)입니다.

Three theoretical approaches to reliability are currently popular, classical test theory, generalisability theory and probabilistic theories (item response theory (IRT) and Rasch modelling).

고전 문항 이론

Classical test theory.

고전적 시험 이론의 기본 원리는 관찰 된 점수가 실제 점수와 오류의 결과를 반영한다는 개념입니다. 이것은 관찰 된 점수와 소위 평행 시험 (동일한 주제에 대해 똑같이 어려운 시험)에서 점수 사이의 연관성으로 표현됩니다.

The basic principle of classical test theory is the notion that the observed score reflects the results of a true score plus error. This is expressed as the associ-ation between the observed score and the score on a so-calledparallel test (an equally difficult test on the same topics).

CTT에서 가장 인기있는 통계는 Cronbach 's alpha이며, 각 항목이 '병렬 테스트'라고 가정 할 때 사용 된 항목의 일관성을 나타냅니다. 그러나 이것은 대부분의 경우 가장 적합한 접근법이 아닙니다 (Cronbach & Shavelson 2004). Cronbach의 알파는 기본적으로 시험 - 재검사 상관 관계의 개념에 기반하고 있으므로 후보자 점수의 순위 순서의 복제 가능성 정도를 나타내는 유효한 접근 방법입니다. 따라서 기준 참조 (절대 표준) 접근법이 사용되는 경우 항상 신뢰도를 과대 평가합니다. 이 경우, 테스트의 구체적인 평균 난이도는 Cronbach 's alpha가 고려하지 않은 에러 소스입니다.

The most popular statistic within CTT is Cronbach’s alpha, expressing the consistency of the items used, assuming each item being a ‘parallel-test’. This, however, is in the majority of the cases not the most suitable approach (Cronbach & Shavelson 2004). Cronbach’s alpha is basically based on the notion of a test–retest correlation and is therefore only a valid approach to express the degree of replicability of the rank order of candidates’ scores. As such, it is always an over- estimation of the reliability if a criterion-referenced (absolute norm) approach is used. In this case, the specific mean difficulty of the test is an error source which Cronbach’s alpha does not take into account.

권고안 14 : Cronbach 's alpha는 criterion-referenced 시험의 신뢰성을 평가하는 데 사용되어서는 안되며, 이러한 경우에는 domain-referenced 지수가 사용되어야한다.

Recommendation 14: Cronbach’s alpha must not be used to estimate the reliability of criterion-referenced tests, in such cases domain-referenced indices must be used.

기존 문헌에서 알파 해석에 대한 몇 가지 경험적 규칙을 제공하지만 (평균치가 높은 테스트의 경우 최소 0.80), 평균의 표준 오차를 계산하기 위해 신뢰성을 사용하는 것이 더 바람직하며, 이를 바탕으로 95 % 신뢰 구간 어떤 학생에게 합격 / 불합격 결정이 불확실한지를 결정하는 컷오프 점수. 그렇게하면 신뢰성이 실제 데이터와 비교되고 통과 / 실패 결정의 견고성이 확립됩니다. 점수 분포와 컷오프 점수에 따라 0.80의 알파가있는 다른 상황 (다른 분포 및 기타 컷오프 점수 포함)보다 0.60의 알파가보다 안정적인 통과 실패 결정을 내리는 상황이 있습니다.

Although the literature provides some rules of thumb for the interpretation of alpha (generally 0.80 as a minimum for high-stakes testing), it is more advisable to use reliability to calculate the standard error of the mean and from this a 95% confidence interval around the cut-off score to determine for which students a pass–fail decision is too uncertain. That way, the reliability is compared to the actual data and the robustness of the pass–fail decisions is established. Based on the score distribution and the cut-off score, there are situations in which an alpha of 0.60 gives more reliable pass–fail decisions than in other situations (with other distributions and other cut-off scores) with an alpha of 0.80.

권고안 15 : 신뢰도는 실제 데이터에 대한 영향에 비추어 항상 해석되어야한다. 예를 들어, 전체 테스트와 관련하여 신뢰성이보고되면, 합격 / 불합격 오분류의 관점에서 항상 해석되어야합니다.

Recommendation 15: Reliabilities should always be inter- preted in the light of their influence on the actual data. For example if reliabilities are reported with respect to a summa- tive test, they should always be interpreted in the light of possible pass/fail misclassifications.

일반화가능도 이론

Generalisability theory.

일반화 이론은 고전적 테스트 이론의 접근법을 확장하고 사용자가 다양한 소스로부터 오류 분산을 분석하고 추정 할 수 있도록하는보다 유연한 이론입니다.

Generalisability theory expands the approach of classical test theory and is a more flexible theory in that it allows the user to dissect and estimate error variance fromvarious sources.

일반화 가능성 이론은 연구자가 방정식의 오차 분산 원인을 구체적으로 포함하거나 제외 할 수 있다는 점에서 추가적인 유연성을 제공합니다. 연구원은 그들이 사용하는 디자인, 포함시킬 분산 원천, 포함시키지 않는 임의의 원천, 고정 요인 등으로 다루는 분산 원 등을 신중하게 고려해야합니다. 또한 연구자는 연구자가 선택한 설계를 완전히 설명하고, G study 테이블에서 complete variance component를 보고해야 하는데, 이러한 보고가 없으면 결과를 독자가 해석하거나 평가할 수 없으며 결과를 메타 분석 합성에 통합 할 수 없기 때문입니다.

Generalisability theory has additional flexibility in that it allows researchers to specifically include or exclude sources of error variance in the equation. Researchers must be very careful in thinking about the designs they use, which sources of variance to include and which are not to be included, which to treat as random and which as fixed factors, etc. Also, it requires that researchers completely describe the chosen design in any publication and report complete variance component G study tables, because without this sort of complete reporting, results cannot be interpreted or evaluated by the reader and the results cannot be incorporated into meta- analytic synthesis.

권고 16 : 일반화 가능성 이론을 도구에 적용 할 때 연구원은 sources of variance의 본질에 대한 명확한 개념을 가져야한다. 그들은 분석에 포함 된 디자인을 포함합니다.

Recommendation 16: When applying Generalisability theory to an instrument, the researchers must have clear conceptions about the nature of the sources of variance; they include and the design used in the analysis.

권고안 17 : 연구 보고서의 일반화가능도 분석에 대한 설명은 독립적인 연구자가 복제 할 수 있어야한다. 분산 소스, 고정 요소 및 무작위 요소의 처리, 설계 및 완전한 분산 구성 요소 테이블에 대한 포괄적 인 설명이 제공되어야합니다.

Recommendation 17: The description of generalisability analyses in any report of the study must be such that an independent research can replicate the study. So a compre- hensive description of the sources of variance, treatments of fixed and random factors, designs and complete variance component tables must be provided.

확률 론적 이론 (IRT 및 Rasch 모델링).

Probabilistic theories (IRT and Rasch modelling).

기본적으로 사용할 수있는 세 가지 모델이 있습니다. 가장 간단하고 간단한 것은 소위 1 매개 변수 모델입니다. 이 모델에서 항목 당 정답의 확률과 응시자의 능력에 따라 관계가 결정됩니다 (그림 1).

There are basically three models that can be used. The first and simplest is the so-called one- parameter model. In this model, per item, the relationship is determine with the probability of a correct answer and the ability of the candidate (Figure 1).

그러나 어려움만으로는 테스트를 선택하고 조작하기위한 매개 변수로 충분하지 않기 때문에 두 매개 변수 모델은 차별적 인 힘을 포함합니다. 이 모델에서는 응시자의 능력을 고려할 때 정답 확률뿐만 아니라 다른 능력 수준의 두 시험 응시자를 구별하는 항목의 힘이 포함됩니다 (그림 2).

But since difficulty alone is not enough as parameter to select and manipulate tests, a two-parameter model includes discriminatory power as well. In this model, not only the probability of a correct answer, given the test taker’s ability, is included but also the power of the item to discriminate between two test takers of different ability levels (Figure 2).

세 개의 매개 변수 모델이 사용되면 오프셋이 포함됩니다. 이는 무작위 추론의 기회와 정확히 같지 않지만 유사합니다 (그림 3).

If three-parameter models are used, the offset is included. This is not precisely the same as a randomguessing chance but is similar (Figure 3).

더 많은 매개 변수가 모델에 포함되면 더 많은 사전 테스트 데이터가 필요하다는 것이 명백 할 수 있습니다. 경험적으로 200 명의 테스트 응시자는 하나의 매개 변수 모델을 사전 테스트하기에 충분할 수 있지만 1000 개의 매개 변수 모델은 3 개의 매개 변수 모델을 안정적으로 맞추기 위해 필요합니다.

It may be obvious that the more parameters are included in the model, the more pre-test data are needed. As a rule of thumb, 200 test takers can be enough to pre-test a one- parameter model whereas up to 1000 would be needed for a stable fit of a three-parameter model.

권장 사항 18 : 귀하의 데이터가 가정에 맞는지, 그리고 팀에 필요한 전문 지식을 가지고 있는지를 확인하지 않는 한 IRT 모델링을 사용하지 마십시오.

Recommendation 18: Do not use IRT modelling unless you are sure your data fit the assumptions and you have the necessary expertise in your team to handle it.

신뢰성의 마지막 문제는 객관성 / 주관성과 신뢰성 / 신뢰성의 관계에 대한 오해입니다.

A final issue in reliability is the misconceptions that exist about the relationship between objectivity/subjectivity on the one hand and reliability/unreliability on the other.

권고안 19 : 신뢰성 또는 충분한 우주를 확보 할 때 표현, 평가, 좋은 표본 추출은 필수적이다. objective는 도움이되지만 universe representation에는 부차적이다. 연구 및 사용 된 도구에서 연구원은 샘플링이 충분히 크고 다양 함을 보장해야합니다.

Recommendation 19: In ensuring reliability or sufficientuniverse Structuring representation, the assessment, good sampling making it more is essential. objective can help but is secondary to universe representation. In the study and in the instruments used the researcher must ensure thatthe sampling is sufficiently large and varied.

비용/수용가능성

Cost/acceptability

여기에는 평가 프로그램, 기술 지원 문제, 평가 프로그램의 문서화 및 게시, R & D 접근법, 변화 관리, 감사 방법, 비용 효율성, 책임 문제 등을 둘러싼 정치 및 법적 문제가 포함됩니다 (Van der Vleuten 1996; Dijkstra 외. 2010). 이 요소들은 사소하지 않습니다.

They include political and legal issues surrounding the assessment programme, technical support issues, documenting and publishing the assessment programme, R&D approaches, change management, audit methods, cost-effectiveness, accountability issues, etc. (Van der Vleuten 1996; Dijkstra et al. 2010). These elements are not trivial.

예를 들어 현재의 평가 도구는 인간의 관찰과 판단에 크게 의존하기 때문에 이해 관계자의 수용 가능성에 대한 우수한 연구가 필요합니다.

For example, good research into stakeholder acceptability is necessary because current assessment instruments rely heavily on human observation and judgement.

평가 절차의 품질은 교사 교육, 성과에 대한 피드백 등에서 비롯됩니다. 이해 관계자가 평가 절차의 부가가치에 대해 확신하지 못하고 그것을 사용하도록 잘 지시되지 않은 경우 결과는 유효하거나 신뢰할 수 없습니다.

Quality of the assessment procedure then comes from teacher training, feedback on performance, etc. If the stakeholders are not convinced about the added value of the assessment procedure and are not well instructed to use it, the results can never be valid or reliable.

권고안 20 : 연구원은 또한 조직 내에 평가의 임베딩, 프로그램으로서의 평가, 평가의 사용자 등을 고려하여 주제를 고려해야한다

Recommendation 20: Researchers should also consider topics that pertain to the embedding of the assessment within the organisation, assessment as a programme and concerning the users of the assessment to fill paucity in the literature

윤리적 문제

Ethical issues

우리는 서로 다른 나라들이 윤리적 동의에 관한 절차가 다르다는 것을 알고 있습니다. 일부 국가에서는 윤리위원회가 교육 연구를 면제 된 것으로 자동 판결하는 반면, 다른 국가에서는 종종 윤리적 인 전체 검토가 필요합니다 (때로는 의학 윤리위원회도 필요함).

We acknowledge that different countries have different procedures regarding ethical consent. In some countries, ethical commit- tees rule educational research automatically as exempt, whereas in other countries, often a full ethical review is needed (sometimes even by medical ethical committees).

권고안 21 : 연구 프로젝트의 윤리적 지층을 판단하기에 충분한 지식과 관할권을 가진 윤리적 검토위원회에서 연구 프로젝트가 논의되어야한다.

Recommendation 21: When an ethical review committee exists with sufficient knowledge and jurisdiction to judge the ethical stratus of a research project it should be consulted.

권고안 22 : 연구 제안서를 제출할 적절한 윤리위원회가없는 경우, 연구원은 연구 프로젝트에서 취한 윤리적 인 관심에 관한 정보를 제공해야한다. 정보에 입각 한 동의 절차, 완전 자발적 참여, 참가자에 대한 정확한 브리핑 제공, 그릇된 정보를 회피하기 위한 최대한의 노력, 참가자의 신체적 또는 정신적 상해를 최대한 예방, 보고서 / 출판물의 모든 참여자에 대해 익명성 보장 등.

Recommendation 22: If there is no suitable ethical commit- tee to submit the research proposal to, the researcher should provide information as to the ethical care taken in the research project. S/he should describe and ensure minimally the informed ensure consent procedure, completely voluntary participation, provide a correct briefing of the participants, ensure maximum avoidance of disinformation unless there is a good debriefing, utmost prevention of any physical or psychological harm to the participant, and ensured anonymity for all participants in the reports/publication.

시설과 지원

Infrastructure and support

연구 커뮤니티

The research community

의료 평가 연구 공동체는 개방적이고, 협동 적이며, 공동 작업이 가능한 그룹으로 가장 잘 특징 지어 질 수 있습니다. 이러한 공동 작업은 연구자가 지역 문제를 넘어서서 생각하게하고 연구 질문을보다 일반적으로 공식화해야하기 때문에 연구의 질을 향상 시키는데 도움을 줄 수 있습니다. 다양한 각도에서 아이디어를 입력하고 내부 복제를 사용하여 연구를 생산할 수 있습니다. 다른 문맥.

The medical assessment research community can best be characterised as an open, collegial and collaboration-orien- tated group. Such collaboration can help to improve the quality of research, because it always forces the researchers to think beyond their local problems and formulate their research questions more generically, it gives input to ideas from various angles and it can produce research with in-built replication to other contexts.

권고 23 : 가능할 때마다 기관 간 연구가 시도되어야하며, 최소한 자료와 전문 지식 공유는 개방적이고 공동의 관점에서 이루어져야한다.

Recommendation 23: Whenever possible, cross-institutional research should be attempted, or at least sharing of materials and expertise should be done from an open, collegial standpoint.

과학 저널

The scientific journals

예를 들어 일부 저널은 단어 수 제한을 사용합니다. 다행스럽게도 온라인 자료 또는 보조 자료 (부록, 표 등)의 온라인 간행물을 통해 단어 수 제한은 물류상의 이유로 더 이상 필요하지 않습니다. 실질적으로 모든 저널은 지금까지 그러한 한계를 폐지했습니다.

Some journals have, for example, used word count limits. Fortunately, now with online publications of papers or online publications of auxiliary material (appendices, tables, etc.), a word count limit is not longer necessary for logistical reasons. Practically, all journals have abolished such limits by now.

Recommendation 24 : 저널은 논리적 인 이유로 단어 수 한도를 줄이지 말아야하며, 논문의 길이가 포함 된 메시지에 적합한 지 여부를 평가해야합니다.

Recommendation 24: Journals should not instil word count limits for logistical reasons, but should evaluate whether the length of the paper is appropriate for the message it contains.

용어 및 그 사용은 평가 연구에서 여전히 문제입니다. 이 기사의 소개에서 언급했듯이 교육과 평가, 그리고 평가와 평가의 경계는 사라져 가고 있습니다. 예를 들어, '감사'라는 용어는 아마도 10 가지 이상의 다른 의미를 지닙니다.

Terminology and its use is still a problem in assessment research. As stated in the introduction to this article, the boundaries between education and assessment and between assessment and evaluation are fading. For example, the term ‘audit’ probably has over 10 different meanings.

Recommendation 25 : 평가 연구 (또는보다 광범위하게는 보건 과학 교육 연구)의 학문 분야는 MESH 표제와 동등한 기간의 고정 분류법을 필요로한다. 과학 저널은 이 문제를 주도하여야 한다.

Recommendation 25: The discipline of research in assess- ment (or more broadly health sciences education research) has need of a fixed taxonomy of term, equivalent to MESH headings. The scientific journals are invited to take the lead in this.

평가 연구 또는 보건 과학 교육 연구는 일반적으로 자금 지원 기관에 쉽게 접근 할 수있는 과학적 영역이 아닙니다. 어떤 경우에는 교육에 초점을 맞추기 때문에 생물 의학 연구를위한 자금 제공 기관의 목표 영역에 속하지 않으며 도메인 특정 연구이기 때문에 교육 자금 제공 기관의 영역에 속하지 않습니다. 연구 공동체로서, 우리는 관련성, 중요성 및 평가 연구의 엄격함을 더 잘 알고있는 자금 지원 기관에 대한 노력에 동참해야합니다.

Assessment research or health sciences education research in general is not a scientific domain that easily finds its way to funding agencies. In some situations, they do not belong to the target areas of funding agencies for biomedical research because of the focus on education, nor do they belong to the domains of educational funding agencies because the research is to domain specific. We, as a research community, should join efforts inmaking funding agencies more aware of the relevance, the importance and the rigour of assessment research.

권고안 26 : 평가 연구 공동체 (및 보건 과학 교육위원회)는 기금 지원 기관을 교육 연구에 대한 자금 지원에보다 적극적으로 참여시켜야한다. 우리는 이것이 주요 의학 교육 협회가 주도해야한다고 제안합니다.

Recommendation 26: The assessment research community (and the health sciences educational committee) should join forces in making funding agencies more open to funding of educational research. We suggest that this should be led by the major medical education associations.

~~Bligh J. 2003. Nothing is but what is not. Med Educ 37:184–185.~~

Med Teach. 2011;33(3):224-33. doi: 10.3109/0142159X.2011.551558.

Research in assessment: consensus statement and recommendations from the Ottawa 2010Conference.

Schuwirth L1, Colliver J, Gruppen L, Kreiter C, Mennin S, Onishi H, Pangaro L, Ringsted C, Swanson D, Van Der Vleuten C, Wagner-Menghin M.

Author information

1: Educational Development and Research, Maastricht University, PO Box 616, Maastricht, 6200 MD, The Netherlands. l.schuwirth@maastrichtuniversity.nl

Abstract

Medical education research in general is a young scientific discipline which is still finding its own position in the scientific range. It is rooted in both the biomedical sciences and the social sciences, each with their own scientific language. A more unique feature of medical education (and assessment) research is that it has to be both locally and internationally relevant. This is not always easy and sometimes leads to purely ideographic descriptions of an assessment procedure with insufficient general lessons or generalised scientific knowledge being generated or vice versa. For medical educational research, a plethora of methodologies is available to cater to many different research questions. This article contains consensus positions and suggestions on various elements of medical education (assessment) research. Overarching is the position that without a good theoretical underpinning and good knowledge of the existing literature, good research and sound conclusions are impossible to produce, and that there is no inherently superior methodology, but that the best methodology is the one most suited to answer the research question unambiguously. Although the positions should not be perceived as dogmas, they should be taken as very serious recommendations. Topics covered are: types of research, theoretical frameworks, designs and methodologies, instrument properties or psychometrics, costs/acceptability, ethics, infrastructure and support.

PMID:: 21345062
DOI:: 10.3109/0142159X.2011.551558

[Indexed for MEDLINE]

저작자표시 비영리 변경금지 (새창열림)

'Articles (Medical Education) > 의학교육연구(Research)' 카테고리의 다른 글

교육연구에서 Mixed method 활용을 위한 12가지 팁(Med Teach, 2013) (0)	2017.07.21
의학교육에서 양적연구방법 (Understanding Medical Education Ch25) (0)	2017.07.21
절대표준에서 벗어나자: 무작위대조군시험과 교육연구 (J Grad Med Educ. 2011) (0)	2017.07.20
의학교육연구에서 역설적 긴장의 가치 (Med Educ, 2010) (0)	2017.07.20
의학교육에서 무작위대조군연구와 메타분석의 역할(Med Teach, 2012) (0)	2017.07.20

Passing the Torch : 의학을 가르치는 것은 횃불을 전달하는 것과 같다.

평가에 관한 연구: Ottawa 2010 컨퍼런스의 합의문 (Med Teach, 2011)

Research in assessment: consensus statement and recommendations from the Ottawa 2010Conference.

Author information

Abstract

'Articles (Medical Education) > 의학교육연구(Research)' 카테고리의 다른 글

+ Recent posts

티스토리툴바