시험을 마친 후의 의과대학생의 자기조절의 영향 (Med Educ, 2012)
Influences on medical students’ self-regulated learning after test completion
Sacha Agrawal,1 Geoffrey R Norman2 & Kevin W Eva3

 

 

도입
INTRODUCTION

최근 몇 년 동안 [assessment practice]는 점점 더 [성과를 측정할 수 있는 기회]뿐만 아니라 [그 자체로 교수 및 학습 활동]으로 간주되고 있다. 오랫동안 '평가 꼬리가 커리큘럼 개를 좌우한다'는 주장이 제기되어 왔으며, 그 중에서도 뉴블과 예거는 선택한 평가 전략이 학생들의 학습 활동에 영향을 미칠 것임을 입증해 왔다. 그러나 이 대화의 초점은 최근 들어 발전했다. 보다 확립된 담론은 일반적으로 [시험]은 학습자가 [더 큰 부담stakes에 부합하는 영역]에서 [좋은 성과를 보장]하기 위하여 [학습 접근 방식과 우선순위를 변경]하는 방식을 통해 학습에 간접적인 영향을 미친다는 개념에 초점을 맞추고 있습니다. 예를 들어, Newble과 Jaeger가 학생들의 공부 행동1에 집중하는 것과 시험이 임박했을 때 [학습량이 증가한다]는 것을 보여주는 다양한 연구2에서 이러한 사실이 입증된다. 그러나 최근에는 시험이 학생 학습에 미치는 직접적인 영향에 대한 추가적인 논의가 있었으며, 이러한 대화 중 가장 두드러진 것은 '시험 강화 학습test-enhanced learning'으로 알려진 현상에 초점을 맞추고 있다.3 
In recent years there has been a considerable increase in the extent to which assessment practices are thought of not only as opportunities to measure performance, but as teaching and learning activities in their own right. It has long been argued that ‘the assessment tail wags the curriculumdog’, and Newble and Jaeger,1 among others, have demonstrated that the assessment strategies chosen will influence students’ learning activities. The focus of this conversation, however, has evolved of late. The more established discourse has generally centred on the notion that testing has indirect effects on student learning by leading learners to change their study approach and priorities to ensure good performance in the domains that align with greater stakes. This is evidenced, for example, in Newble and Jaeger’s focus on students’ study behaviour1 and in a variety of studies2 demonstrating that the amount of studying increases when tests are imminent. More recently, however, there has been additional discussion of the direct effects of testing on student learning and the most prominent of these conversations has focused on a phenomenon known as ‘test-enhanced learning’.3 

시험 강화 학습은 아마도 1620년에 쓴 프란시스 베이컨(4)의 말에 가장 잘 묘사되어 있을 것이다. '[한 개의 텍스트를 20번 읽는 것]보다 [외우려고 시도하면서, 그리고 잘 안 외워지면 원문을 찾아보며 10번을 읽을 때]더 많이 배울 것이다'. Roediger와 Karpicke5는 이 인용문을 사용하여 [시험이 학습을 향상시키는 것은 단순히 학습이 반복적으로 이뤄지기 때문만은 아니다]라는 개념이 새롭지 않음을 보여줬다. 다만, 공식 교육 환경에서 과소평가되고 충분히 활용되지 못했을 뿐이다. 그들은 테스트 효과가 단순히 [작업 시간time on task]에만 기인하지는 않는다는 것을 명확히 보여주는 심리학에 관한 광범위한 문헌을 검토한다. 이는 (Bacon의 격언에서 제시된 바와 같이) 소재에 대한 노출이 반복적인 학습을 위해 편향된 경우에 조차 종종 [시험의 효익]이 나타나기 때문이다.
Test-enhanced learning is perhaps best described in the words of Francis Bacon,4 who, in 1620, wrote: ‘If you read a piece of text through twenty times, you will not learn it by heart so easily as if you read it ten times while attempting to recite it from time to time and consulting the text when your memory fails.’ Roediger and Karpicke5 use this quote to demonstrate that the notion that testing improves learning beyond that afforded by repeated study is not new despite the fact that the phenomenon has been underappreciated and underutilised in formal educational settings. They review an extensive literature in psychology that clearly indicates the testing effect cannot be attributed simply to time on task as the benefits of testing are often seen even when the exposure to the material is biased in favour of repeated study, as suggested by Bacon’s maxim.

시험 강화 학습에 대한 일반적인 연구에서는 학습자에게 일련의 학습 자료를 제시하며 학습 자료를 여러 번 학습하거나 한 번 학습한 후 시험을 완료하도록 랜덤화됩니다. 그런 다음 실험 세션 직후(예: 5분) 또는 더 상당한 지연 후(예: 1주) 최종 테스트가 주어집니다. 
In a typical study of test-enhanced learning, learners are presented with a set of material and are randomised to study the material multiple times or to complete a test after studying the material once. Students are then given a final test shortly (e.g. 5 minutes) after the experimental session or after a more substantial delay (e.g. 1 week).

일반적으로 Roediger와 Karpicke가 보고한 바와 같이, (단순 반복 학습과 비교했을 때)
[사전 테스트prior testing]는 
[즉각적 테스트immediate testing] 조건에서 차이가 없거나, 심지어는 더 점수가 낮은 경우에도,
[지연된 테스트delayed test]에서 더 나은 인출을 보여주기 때문이다.
 
Generally, as reported by Roediger and Karpicke,6 prior testing yields better recall on delayed tests relative to repeated study even when differences do not exist or may be reversed in immediate testing conditions. 

현재 선호되는 이러한 편익이 발생하는 메커니즘은 인식cognition의 현재 모델에 의존하며, 이는 회상 가설retrieval hypothesis로 알려져 있다. 회상 가설이란 기억에서 정보를 검색하는 행위가 메모리 추적을 강화하여 미래에 필요할 때 정보를 검색할 수 있도록 한다는 것이다. 이러한 개념적 프레임워크는 학습을 위해 시험을 최적으로 구현하는 방법에 대한 유용하고 구체적인 제안을 도출하는 많은 실험 연구로 이어졌다. 예를 들어, Larsen 등은 테스트가 자주 반복되고 시간이 지남에 따라 간격을 두어야 한다고 지적했다. 테스트는 가능한 한 항상 [정보의 생성]을 요구해야 하며, [정답에 대한 피드백]을 즉시 제공할 필요는 없지만 제공해야 한다. 즉, (비록 두 형식이 학습을 개선하는 것으로 나타났지만) 간단한 객관식 문제 시험보다 단답형 시험이 선호된다.

The presently favoured mechanism by which this benefit is thought to occur (known as the retrieval hypothesis) draws on current models of cognition by suggesting that the act of retrieving information from memory strengthens the memory trace and, thus, makes the information more likely to be retrievable when it is needed in the future. This conceptual framework has led to many experimental studies that have yielded useful and concrete suggestions as to how testing might optimally be implemented for learning. For example, Larsen et al.3 have indicated that tests should be frequent, repeated and spaced out over time; they should require the production of information whenever possible (i.e. short-answer tests are preferred over simple multiplechoice question tests, although both formats have been shown to improve learning), and feedback about the correct answers should be provided, although not necessarily immediately. 

보건 전문 교육 커뮤니티 내에서, 심리학의 [시험 강화 학습]에 대한 연구의 성장은 적절한 시기에 성장했다. 바로 보건 전문 교육자들이 [(성과 개선 노력이 뒷받침될 수 있는 토대로서) 자기-평가]의 가치에 대해 점점 회의적이 되어가고 있던 시기였다. 동시에 이들은 [외부적으로 유도된externally derived 데이터가 제공하는 교육적 가치]에 대한 인식을 높일 것을 요구했습니다.
Within the health professional education community, the growth of research into test-enhanced learning in psychology proved timely as it emerged at a time when health professional educators were growing increasingly sceptical of the value of self-assessment as the foundation on which performance improvement efforts can be built7,8 and were calling for greater awareness of the pedagogic value provided by externally derived data.9 

의대 교육이 [시험 강화 학습]의 적용으로 혜택을 받을 수 있다는 라르센 외 연구진의 제안이 발표된 이후 비교적 짧은 기간 동안, 연구자들은 다음을 확인했다. 이러한 효과가 의학적 지식에도 일반화될 수 있으며, 스킬 학습에도 동일하게 적용될 수 있고, 관찰된 편익이 일반적으로 실험실 기반 실험에서 연구되는 것보다 훨씬 더 오래(최대 6개월) 지속될 수 있다는 것이다. 나아가, [발달 시험progress test]의 구현이 커리큘럼 수준에서 달성된 학습 성과에 어떻게 큰 영향을 미칠 수 있는지를 설명하기 위해 시험 강화 학습 프레임워크를 사용해 왔다.

In the relatively short period since the publication of Larsen et al.’s3 suggestion that medical education could benefit fromthe application of test-enhanced learning, researchers have confirmed that the effects do generalise to medical knowledge,10 that they are equally applicable to skills learning,11 and that the benefits observed may last much longer (up to 6 months) than is typically studied in laboratory-based experiments.12 Furthermore, the test-enhanced learning framework has been used to help elucidate how the implementation of progress testing is able to have a major impact on learning outcomes achieved at a curricular level.13 

 

그러나 시험 강화 학습이 실제로 연구 대상 자료의 [원시 기억력raw memorability]을 변화시켜 [직접적인 영향]을 미치는지, 또는 [학습 행동study behavior]을 변화시켜 (실제로 뉴블과 예거의 개념과 더 잘 부합하는 방식으로) [간접적인 영향]을 미치는지 여부는 여전히 불분명하며, 이것이 결과적으로 장기적인 성과로 이어진다.

  • 설명하자면, 시험의 간접적인 효과는 일반적으로 [전향적 현상prospective phenomena]으로 간주됩니다. 즉, 학생들은 시험을 기대하게 되고 그 결과, 시험을 준비하기 위해 학습 행동을 변화시킵니다.
  • 그러나 이러한 테스트 효과를 [후향적인retrospective 것]으로 볼 수도 있다. 즉, 학생들은 시험을 보았고, 그 결과 자료에 대해서 생각하고, 틀린 것에 대해 혼란스러워하고, 정답을 정확히 이해하는데 도움이 될 수 있는 정보를 찾찾으며 더 많은 시간을 자료를 가지고 보냈다.

It remains unclear, however, whether test-enhanced learning truly has a direct effect by altering the raw memorability of the material being studied, or whether it may actually have an indirect effect more compatible with the notions of Newble and Jaeger1 by altering study behaviour, which, in turn, leads to better long-term performance.

  • To explain, indirect effects of testing are usually thought of as prospective phenomena: students expect to be tested and, as a result, they alter their study behaviours in preparation for the test.
  • It is possible, however, that such testing effects may just as readily be retrospective: students have been tested and, as a result, they spend more time thinking about the material, puzzling over what they got wrong, and seeking information that might help them understand (or debate) the accuracy of the answer key.

즉, 시험 강화 학습은 [인출]에 의한 것인가 혹은 [리허설]에 의한 것인가?
In other words, is test-enhanced learning yielded by retrieval or by rehearsal? 

[리허설 가설]은 현존하는 데이터와 일치한다. 즉, 테스트의 이점이 나타나는 데는 시간이 걸린다. [지식의 생성]을 요구하는 시험(이런 시험은 종종 큰 인지 활동을 유도하며) 때때로 정답의 [인식을 요구하는 시험]보다 더 나은 학습 결과를 산출한다. 그리고 피드백이 너무 빨리 주어지면 해가 될 수 있다 (올바른 응답을 배우기 전에 수신자가 정보에 대해 고민할 필요성을 축소시키기 때문이다).
This rehearsal hypothesis is consistent with the extant data: the benefit of testing takes time to emerge; tests that require generation of knowledge (and, hence, sometimes induce greater cognitive activity) sometimes yield better learning outcomes than tests that require recognition of the correct answer, and feedback can be detrimental if it is given too soon (thus reducing the recipient’s need to puzzle over the information prior to learning the correct response).14 

시험 강화 학습이 발생할 수 있는 가능한 메커니즘을 구분한다고 해서 이 점을 초래한 작업의 중요성이 감소하거나 교육자가 시험을 교육학적인 개입으로 활용해야 한다는 제안의 정확성이 저하되는 것은 아니다. 그러나 테스트를 가장 효과적으로 사용하는 방법을 결정하려면 테스트에 영향을 미치는 방법을 이해하는 것이 중요합니다.
Differentiating between the possible mechanisms by which test-enhanced learning might occur does not reduce the importance of the work that has led to this point or decrease the accuracy of the suggestion that educators should utilise testing as a pedagogic intervention. It is important, however, that we understand the means by which testing has an effect if we are to determine how to use testing most effectively. 

그렇다고 위에서 설명한 표준 시험 강화 학습 패러다임을 사용하여 '인출' 가설과 '리허설' 가설을 분리하는 것은 어렵습니다. 왜냐하면 시험 학습의 영향이 측정되는 실험 세션과 최종 테스트 사이의 2일에서 6개월 지연 기간 동안 학생들의 사고 과정과 학습 활동을 정확히 통제하거나 포착하는 것은 어렵기 때문입니다. 
That said, it is difficult to tease apart the ‘retrieval’ and ‘rehearsal’ hypotheses using the standard test-enhanced learning paradigm described above because it is difficult to accurately control or capture the thought processes and learning activities of students during the 2-day to 6-month delay between the experimental session and the final test on which the impact of test-enhanced learning is measured.

테스트의 장점을 고려할 때, 이 정보를 측정하는 것은 [추가적인 인출 또는 리허설을 유도할 수 있기 때문에] 중립적인 exercise로 간주될 수 없다. 따라서, 우리는 우선 학생들이 시험을 보는 것에 대응하여 참여하는 자율적인 검토 관행과 그러한 관행에 영향을 미치는 변수들을 검토하는 중간 단계를 선택하였습니다.

Given the benefits of testing, measuring this information cannot be considered a neutral exercise as it may induce further retrieval or rehearsal. As a result, we have chosen to first take an intermediate step of examining the self-regulated review practices in which students engage in response to sitting a test and the variables that influence those practices. 

방법
METHODS


참여자
Participants


연구 샘플은 McMaster University MD 프로그램의 최종 학년 학생 코호트에서 모집했다. 참가 당시 피험자는 임상 사무직 교대를 완료했으며 졸업이 약 1개월 정도 남아 있었고 캐나다 의료 위원회 자격 검사(MCCQE) 파트 1(캐나다에서 의학을 수행하기 위한 면허 취득 과정의 일부를 구성하는 컴퓨터 기반 검사)을 시도하고 있었다. 
The study sample was recruited from the final-year cohort of students on the McMaster University MD programme. At the time of participation, subjects had completed their clinical clerkship rotations and were approximately 1 month away from graduation and attempting the Medical Council of Canada Qualifying Examination (MCCQE) Part 1 (a computer-based examination that constitutes part of the process for gaining a licence to practise medicine in Canada). 

연구자료
Materials

실험은 컴퓨터 기반 플랫폼(RunTime Revolution Version 2.8.0; RunTime Revolution Ltd, 영국 Edinburgh)을 사용하여 수행되었습니다. 모든 피험자가 하루에 세 번의 세션 중 한 번의 세션 동안 절차를 완료했습니다. 참가자들은 6가지 임상 분야(심장학, 내분비학, 위장내과, 산부인과, 신경과, 정신의학)에서 각각 10개의 객관식 질문에 대답하도록 요청받았다. 이 항목은 시험-재시험 신뢰도가 0.7 >이고 MCCQE 점수와 잘 연관되는 것으로 나타난 교육 내 평가에 사용되는 문항 은행으로부터 도출되었습니다.18 
The experiment was delivered using a computerbased platform (RunTime Revolution Version 2.8.0; RunTime Revolution Ltd, Edinburgh, UK). All subjects completed the procedure during one of three sessions held on a single day. Participants were asked to answer 10 multiple-choice questions from each of six clinical domains (cardiology, endocrinology, gastroenterology, obstetrics and gynaecology, neurology, psychiatry). The items were drawn from a bank of questions used for in-training assessment that has been shown to have test–retest reliability > 0.7 and to correlate well with MCCQE scores.18 


각각의 예는 표 1에 나와 있습니다. 항목 유형에 대한 분류 체계의 평가자간 신뢰성은 독립적인 독자(경험이 많은 내부 전문가)의 도움을 받아 확인되었습니다. 
An example of each is given in Table 1. The inter-rater reliability of the classification scheme for item type was confirmed with the assistance of an independent reader (an experienced internist). 

절차
Procedure

절차를 시작할 때 참가자들은 자신이 가장 자신 있다고 느끼는 도메인을 선택하라는 요청을 받았다. 그런 다음 해당 도메인 내에서 올바르게 답변할 수 있는 질문 수를 예측하라는 요청을 받았습니다. 다음으로, 참가자들은 그 도메인에서 한 번에 하나씩 10개의 질문을 랜덤 순서로 받았다. 오답에 대해 보정correction factor이 이뤄진다는 안내를 받고 상대적으로 자신 있는 문제만 시도하도록 했다. 질문지를 본 참가자들은 화면에서 해당 버튼을 클릭하여 항목을 시도attempt하거나 연기defer하도록 요청받았다. 
At the beginning of the procedure, participants were asked to select the domain about which they felt most confident. They were then asked to predict how many questions (out of 10) they would answer correctly within that domain. Next, participants were presented with 10 questions fromthat domain, one at a time, in random order. They were advised that a correction factor would be imposed for incorrect answers and instructed to attempt only questions for which they felt relatively confident of their response. Once they had seen the question stem, participants were asked to attempt or defer the item by clicking the corresponding button on the screen.

[화면에 문항줄기가 나타난 시간]과 시도 또는 연기 버튼 클릭 사이에 경과된 시간을 시도/연기 결정을 내리는 데 필요한 시간으로 기록했습니다.
The time that elapsed between the stem appearing on the screen and the clicking of the attempt or defer button was recorded as the time required to make the attempt ⁄ defer decision. 

참가자가 항목 시도를 선택한 경우, 네 가지 응답 옵션이 제공되고 가장 적합한 응답을 선택하라는 요청을 받았습니다. '제출' 버튼을 클릭하기 전에 경과된 시간이 응답 시간으로 기록되었습니다. 그런 다음 참가자들은 0과 100으로 고정된 시각적 아날로그 척도에서 응답에 대한 신뢰도를 평가하도록 요청받았다(0은 신뢰도가 없고 100은 총 신뢰도를 나타냄). 참가자들은 10개의 질문을 모두 본 후 나머지 모든 도메인에서 절차를 반복했다.

If participants chose to attempt the item, they were presented with four response options and asked to select the best response. The time that elapsed prior to the clicking of a ‘submit’ button was recorded as the answer time. Participants were then asked to rate their confidence in their response on a visual analogue scale anchored with 0 and 100 (0 indicating no confidence and 100 indicating total confidence). After seeing all 10 questions, participants repeated the procedure through all remaining domains.   

다음 단계에서는 처음에 이연deferred되었던 질문들을 다시 제시하였다. 참가자들은 더 이상 correction factor가 있지 않다는 것을 알게 되었다. 각 질문에 대해 최고의 응답을 제공하고 각 응답에 대한 신뢰도를 평가하도록 지시받았습니다. 모든 질문을 완료한 후 참가자들은 각 도메인에서 몇 개의 항목을 올바르게 답했는지 추정하고, 향후 각 도메인에서 10개의 질문을 더 시도할 경우 몇 개의 항목을 올바르게 답할 것인지 예측하라는 질문을 받았다.
In the next phase, the questions that had been initially deferred were presented again. Participants were informed that the correction factor was no longer in place. They were instructed to give their best response for each question and to rate their confidence in each response. When they had completed all questions, participants were asked to estimate how many items they had answered correctly in each domain and to predict how many they would answer correctly if they were to attempt another 10 questions from each domain in the future.

마지막으로, 응시자들이 시험 상황에 대응하여 참여하는 자율적인 검토 과정을 검토하기 위해, 참가자들은 방금 제시된 항목들을 검토할 수 있는 기회가 주어졌습니다. 정답을 강조하여 [일반적인 피드백]은 제공했지만, 학생들에게 자신의 답변을 다시 보여주지는 않았다. 참가자들은 각 항목을 검토하는 시간을 조절하고 시간을 기록했습니다. 검토를 완료한 후, 참가자들은 각 영역 내에서 최근에 완료한 성과와 향후 성과에 대한 추정치를 제시하도록 다시 요청받았다. 
Finally, to examine the self-regulated review process in which candidates engage in response to a test situation, participants were given an opportunity to review the items they had just been presented with. Generic feedback was given by highlighting the correct answers, but students were not reminded of their own responses. Participants controlled the length of time they spent reviewing each item and this time was recorded. After completing their review, participants were asked again to give estimates of the accuracy of their recently completed and future performance within each domain. 

분석
Analysis

평균 비교에 대한 효과 크기는 Cohen의 d = (평균 [1] - 평균[2]) / 표준 편차를 사용하여 계산되었습니다.
Effect sizes for comparisons of means were calculated using Cohen’s d = (mean[1] - mean[2]) ⁄ standard deviation.

자기조절적 검토 행동을 조사하기 위해 인구통계학적, 실험적, 반응 변수 간의 관계와 테스트 문제당 검토 시간의 결과를 조사하기 위해 다중 선형 회귀 분석과 분산 분석을 적용했습니다. Tabachnik과 Fidell의 19가지 엄격한 공선성 기준 < 0.10을 초과할 경우 다중 공선성을 검사하고 서로 밀접한 관련이 있는 변수를 분석에서 제거했다. 기존에 검토했던 문제를 학생들이 자유롭게 재방문할 수 있었던 만큼, 문제를 처음 검토했을 때 검토하는 시간과 총 검토 시간을 고려했다. 이 변수들 사이의 상관관계는 매우 높았고(r = 0.95) 따라서 총 검토 시간만 보고되었다. 

To investigate self-regulated review behaviours, multiple linear regression and ANOVA were applied to examine the relationships among demographic, experimental and response variables and the outcome of review time per test question. Multi-collinearity was examined and variables that were closely related to one another were removed from the analysis if Tabachnik and Fidell’s19 stringent criterion of tolerance < 0.10 was exceeded. As the students had been free to revisit questions previously reviewed, we considered the amount of time a question was reviewed the first time it was considered and the total amount of time the question was reviewed. The correlation between these variables was very high (r = 0.95) and thus only total review time is reported. 

윤리
Ethics


결과
RESULTS

참여자
Participants

연구 샘플은 McMaster MD 프로그램의 졸업생 67명으로 구성되었습니다. 40명(60%)은 여성이었다. 연구 표본의 중위수 연령은 25세(범위: 23~41세)였습니다. 이에 비해 클래스 코호트 전체(n = 149)도 60% 여성이고 중위수 연령은 25세(범위: 22~42)였습니다. 의학적 지식의 지표인 참가자의 진도 시험 점수는 학급 전체와 동등했다. 성별은 측정된 변수에 영향을 미치지 않았습니다. 
The study sample consisted of 67 individuals fromthe graduating class of the McMaster MD programme. Forty (60%) were female. The median age in the study sample was 25 years (range: 23–41 years). By comparison, the class cohort as a whole (n = 149) was also 60% female and its median age was 25 years (range: 22–42 years). Participants’ progress test scores, an indication of medical knowledge,18 were equivalent to those of the class as a whole. Gender had no effect on any of the measured variables. 

참가자 성과 및 자체 평가
Participant performance and self-assessment


자가 모니터링
Self-monitoring


학생들은 60개 항목 중 55개를 중간값으로 시도하기로 했다. 이런 높은 시도율에도 불구하고 미수품과 이연품목의 차이distrimination은 여전히 뚜렷했다. 짝지은 비교 분석을 허용하기 위해 모든 항목을 시도한 16명의 학생에 대한 데이터를 제외했을 때, 학생들은 지연된 항목(40%)에 비해 시도된 항목(71%)의 더 큰 비율에 대해 올바르게 답한 것으로 밝혀졌다(차이 = 31%, 95% 신뢰 구간 [CI] 24–38, 효과 크기 [ES] = 1.2, 쌍체 t-검정[50] = 8.8).; p < 0.001). 모든 아이템을 시도한 사람들의 평균 정확도는 72%였습니다. 항목 유형(사실 대 비넷 기반)은 이 변수에 영향을 주지 않았습니다. 
Students chose to attempt a median of 55 of 60 items. Despite this high attempt rate, discrimination between attempted and deferred items was still apparent. When data for the 16 students who attempted all items were excluded in order to permit a paired comparison analysis, students were found to have correctly answered a larger proportion of attempted items (71%) relative to those they deferred (40%) (difference = 31%, 95% confidence interval [CI] 24–38, effect size [ES] = 1.2, paired t-test[50] = 8.8; p < 0.001). The mean accuracy of those who attempted every item was 72%. Item type (fact versus vignette-based) had no effect on this variable. 

그림 1은 각 빈에 동일한 수의 관측치를 생성하기 위해 정의된 빈에 의사결정 시간을 묶어서 이러한 관계를 보여줍니다. 지연된 항목이 거의 없기 때문에 학생당 지연된 항목에 대한 관측치의 수는 작았습니다. 
Figure 1 illustrates these relationships by bundling decision time into bins defined to create equal numbers of observations in each bin. Because few items were deferred, the number of observations for deferred items per student was small. 

개별 항목에 대한 신뢰 등급은 0에서 100 사이였으며 중위수가 65이고 큰 피크가 50과 100인 바이모달 분포를 따랐습니다. 참가자들은 잘못 답한 항목(46.0)(차이 = 24.1, 95% CI 22–26, ES = 2.7, 쌍체 t-검정[66] = 22.3; p = 0.001)보다 자신이 답한 항목에 더 높은 평균 신뢰도를 부여했다. 
Confidence ratings for individual items ranged from 0 to 100 and followed a bimodal distribution with a median of 65 and large peaks at 50 and 100. Participants assigned higher mean confidence to items they answered correctly (70.1) than to items they answered incorrectly (46.0) (difference = 24.1, 95% CI 22–26, ES = 2.7, paired t-test[66] = 22.3; p < 0.001), which aligned well with their actual accuracy. 



자율규제 검토시간
Self-regulated review time


62명의 학생들이 절차의 검토 부분에 들어가기로 선택했다. 전체 정확도(68.3%)는 검토에 참여하지 않은 참가자 5명(68.0%)과 동일했다. 항목을 검토하는 데 소요된 시간은 문제당 0.0초에서 81.5초 사이였습니다. 
Sixty-two students chose to enter the review section of the procedure. Their overall accuracy (68.3%) was identical to that of the five participants who did not engage in review (68.0%). The time spent reviewing items was skewed and ranged from 0.0 to 81.5 seconds (median = 3.0 seconds) per question. 

자율 규제 검토 시간과의 연관성을 결정하기 위해 10개 변수를 다중 회귀 분석에 투입했다. 두 변수는 인구 통계학(성별, 연령), 두 변수는 실험(세션: 오전 또는 오후 대 저녁; 질문 유형: 사실 대 비녜트 기반) 및 6개 항목은 참가자의 응답 패턴(질문이 제시된 순서, 시도 또는 답변 연기 결정, 결정 시간, 시도 또는 연기 결정이 내려진 후 질문에 답변하는 시간, 주어진 답변의 정확성, 답변에 대한 신뢰)을 기준으로 측정되었습니다. 
Ten variables were submitted to a multiple regression analysis to determine their association with self-regulated review time:

  • two were demographic (Gender, Age);
  • two were experimental
    • (Session: Morning or Afternoon versus Evening;
    • Question type: Factual versus Vignette-based), and
  • six were measured based on the participants’ response pattern
    • (Order in which questions were presented;
    • Decision to attempt or defer responding;
    • Time to make that decision;
    • Time to answer the question once the decision to attempt or defer was made;
    • Accuracy of the answer given;
    • Confidence in the answer given).

 

전체 모형은 통계적으로 유의했지만 약한 연관도(r = 0.34, p = 0.001)를 보였습니다. 자기조절 검토 시간과 관련된 특정 변수는 다음과 같다.

  • 정확도(정답에 대한 평균 검토 시간 = 4.0초, 오답에 대한 평균 검토 시간 = 8.3초, 표준 베타 = ) 0.281, p = 0.001)
  • 의사결정 시간과 응답 시간(자율 규제 검토에 소요된 시간이 더 길었고, 표준화된 베타 = 각각 0.08과 0.11, 각 사례에서 p < 0.001).

The overall model was statistically significant, but showed weak degrees of association (r = 0.34, p < 0.001). The specific variables that were associated with self-regulated review time were

  • Accuracy (mean review time for correctly answered questions = 4.0 seconds, mean review time for incorrectly answered questions = 8.3 seconds; standardised beta = ) 0.281, p < 0.001),
  • Decision time and Answer time (longer times were associated with longer time spent inself-regulatedreview; standardisedbeta = 0.08 and 0.11, respectively, p < 0.001 in each instance).

Accuracy는 질문에 대한 응답을 시도하거나 연기하기로 한 결정과 밀접한 관련이 있는 것으로 밝혀졌기 때문에 반복 측정 양방향 분산 분석을 사용하여 Accuracy가 검토 시간에 미치는 영향을 추가로 조사했습니다. 
Because Accuracy was found to strongly relate to the Decision to attempt or defer responding to a question, we further explored the influence of Accuracy on review time using a repeated-measures two-way ANOVA. 

그림 2에서 알 수 있듯이, 각 정확도 수준 내에서 참가자들은 [시도 지연 의사결정과 정확도가 불일치하는 항목(즉, attempt하였지만 틀렸거나, defer하였지만 맞춘 문항)]을 검토하는 데 [이러한 변수가 일치한 항목(즉, 시도되고 정답이 맞거나 이연되고 오답이 나온 항목)보다] 더 오랜 시간을 소비했다. 
As Fig. 2 illustrates, within each level of accuracy, participants spent longer reviewing items for which the attempt ⁄ defer Decision and Accuracy were discordant (i.e. items that were attempted but answered incorrectly, or deferred but answered correctly), compared with items for which these variables were concordant (i.e. items that were attempted and answered correctly, or deferred and answered incorrectly).
  
[주어진 영역에서 검토 시간]을 [개인의 해당 영역에서의 강점strength]에 대한 일반적인 인식과 비교하기 위해 범주 내 항목의 평균을 구한 결과, 이 두 변수는 서로 무관한 것으로 밝혀졌다(r = ) 0.17 ~ 0.19; 중위 r = ) 0.01).
 Averaging across items within category to compare review time with individuals’ general perceptions of their strengths in a given domain revealed these two variables to be unrelated to one another (r = ) 0.17 to 0.19; median r = ) 0.01). 

고찰
DISCUSSION


시험 완료 후 의대생들 간의 자율 학습
Self-regulated learning among medical students post-test completion

의대생들이 방금 테스트한 항목을 검토(그리고 각 항목을 얼마나 오래 검토했는지 측정)할 수 있는 테스트 절차의 마지막에 기회를 포함함으로써, 우리는 비록 제한된 맥락에서 명확하게나마 학생들의 자기조절 학습 경향에 미치는 영향을 경험적으로 조사할 수 있었다. 학생들은 잘못 답한 문항에 대해 더 오랜 시간을 복습했으며, 더 흥미롭게도 검토 전략이 정확성과 기대치의 일치에 의해 조정되었음을 보여주었다. 정확도의 양측 모두에서(즉, 정답인 경우와 오답인 경우 모두) 내에서 학생들은 [자신이 잘못 판단한 문항]을 검토하는 데 더 많은 시간을 보냈습니다.
By including an opportunity at the end of the testing procedure for medical students to review the items on which they had just been tested (and measuring how long they reviewed each item), we were able to empirically examine influences on students’ self-regulated learning tendencies, albeit clearly in a circumscribed context. Students spent longer reviewing items they had answered incorrectly and, more interestingly, showed indications that their review strategies were moderated by the congruence between their accuracy and their expectations. Within both levels of accuracy (i.e. for correctly and incorrectly answered items), students spent more time reviewing items for which they had misjudged their knowledge.

즉, 학생들은 [시도하였고 정답을 맞춘 항목보다] [(정답을 제공할 수 있다고 생각했기 때문에) 자신이 답을 시도했지만 틀린 문항]에서 더 오랜 시간을 보냈습니다. 또한 이연하고 오답한 항목에 비해 이연했지만 정답이 아닌 항목을 검토하는 데 더 많은 시간을 할애했다. 더욱이 응답 여부를 결정하는 데 필요한 시간과 답변을 제공하는 데 필요한 시간은 모두 개별 항목을 검토하는 데 소요된 시간과 긍정적인 관련이 있었다. 
That is, students spent more time reviewing items they had attempted to answer (because they thought they could provide the correct answer) but had then answered incorrectly compared with items they had attempted and answered correctly. They also spent more time on reviewing items they had deferred but answered correctly, compared with items they had deferred and answered incorrectly. Furthermore, the amount of time required to decide whether or not to respond and the amount of time required to provide an answer were both positively related to the amount of time spent reviewing an individual item. 

일반적으로, 이러한 결과([i] 참가자의 신뢰도, [iii] 참가자의 인구통계, [iii] 질문 순서 및 유형, 세션 개최 시기 등)는 테스트 절차에 의해 제공된 데이터나 피드백이 참가자를 지시하는 데 지배적인 역할을 수행했음을 시사한다. 비록 이것이 정답correct response이 무엇인지 식별해주는 것에 불과했음에도 자기조절학습 행동을 directing한 것이다. 

In general, these findings (combined with the lack of influence of other variables including: [i] participants’ confidence; [ii] participants’ demographics; [iii] the order and type of questions, and [iv] when the session was held) suggest that the data or feedback provided by the testing procedure played a dominant role in directing participants’ self-regulated learning behaviour although it consisted solely of the identification of the correct response. 


성능 향상에서 자가 모니터링의 역할
The role of self-monitoring in performance improvement

Moulton 등.20은 전문가가 [자동 실행 모드]에서 공식적 판단을 적용할 수 있는 [노력적이고 분석적인 모드]로 전환하는 것을 나타내는 전문지식의 정의적 특징defining feature을 나타낸다고 주장했다. 그림 1은 참가자들이 항목에 정확하게 답변하는 데 필요한 지식을 보유하고 있는지 여부에 대해 의문을 가질 때 속도를 줄였다는 것을 보여준다. 이는 Moulton 등이 설명한 모델에서 '상황적에 반응적으로situationally responsive' 속도가 느려지는 것과 유사하다. 또한, 응답 시간과 정확성 사이의 관계는 Moulton et al.21에 의해 정의된 'When you should' 기준에 따르는 responsiveness에 잘 맞는다.
Moulton et al.20 have argued that ‘slowing down when one should’ represents a defining feature of expertise that indicates the expert shifting from an automatic mode of practice to an effortful and analytic mode that enables her to apply formal judgement. Figure 1 illustrates that participants did slow down when they had some doubt about whether or not they possessed the knowledge necessary to correctly answer the item; this is analogous to the ‘situationally responsive’ slowing down in the model described by Moulton et al.21 Further, the relationship between response time and accuracy indicates responsiveness in a manner that speaks to the ‘when you should’ criterion defined by Moulton et al.21 

이것은 Eva와 Regehr가 개발한 자가 모니터링 모델을 보건 직업의 맥락으로 확장하여 임상 영역에서 수행된 연구와 실험실 기반 유사체로 수행된 연구를 더욱 밀접하게 조정하는 최초의 통제된 연구이다. [면허 시험 준비의 일환으로 시험을 잘 치르도록 동기부여된 의대생 표본] 내에서 이전 작업을 복제한 연구에서, [자기 모니터링]은 [자기 평가]와 다르며, [자기 모니터링]이 더 정확한 심리 과정이라는 추가 증거를 제공합니다. (즉, 자기평가란 특정 영역에서 자신의 강점을 총괄적으로 판단하는 과정). 응답 대기 시간, 신뢰도 및 응답 정확도 사이의 강력한 관계에 대한 추가 증거는 광범위한 평가 문헌에 있습니다.
This is the first controlled study to extend the model of self-monitoring developed by Eva and Regehr15,16 to the health professions context, thus more closely aligning studies performed in the clinical domain and those conducted with laboratory-based analogues. The replication of that previous work within a sample of medical students who were motivated to perform well on the test as part of their preparation for licensing examinations provides further evidence that self-monitoring importantly differs from and is a more accurate psychological process than self-assessment (i.e. the process of making a summary judgement of one’s strength in a particular domain). Further evidence for a strong relationship between response latency, confidence and response accuracy resides in the broader assessment literature.22,23 

(여러 연구와 맥락에 걸친) 일관된 연구 결과와, [자기 모니터링 지표]가 합리적으로 안정적인 개인 차이를 제공하는 것으로 보이는 정도와, 여기서 관찰된 [자기 모니터링 지표]와 [자기조절적 검토 간의 관계]를 고려했을 때, [자기 모니터링]은 [자기조절 학습 및 성과 개선 모델]에서 중심적인 역할을 할 가치가 있음을 시사합니다. 이는 임상 전문의와 비임상 전문의 모두 도메인 수준에서 지식과 능력에 대한 전반적인 추정치를 도출하는 데 서툴다는 결론을 도출한 과거의 수십 건의 연구 결과와 대조되며, 이 연구에서 다시 한 번 확인되었다.

The consistency of the findings (across study and context), the extent to which the reported selfmonitoring indices appear to provide reasonably stable individual differences24 and the relationship observed here between the self-monitoring indices and self-regulated review suggest that self-monitoring deserves a central role in models of self-regulated learning and performance improvement. This is by contrast with the results of dozens of previously published experiments (replicated again in the present study) that have led to the conclusion that clinicians and non-clinicians alike are poor at generating overall estimates of their knowledge and ability at a domain level. 

바람직한 어려움 조성
Creating desirable difficulties

현존하는 문헌들은 [사람들이 가장 주의가 필요한 분야]를 파악하지 못하면, [최적의 학습 활동]을 파악하지 못하기 때문에, 자신의 학습을 direct하려는 노력이 종종 misdirected될 수 있음을 시사한다. 학습자가 종종 학습 활동의 가치를 잘못 판단한다는 주장을 뒷받침하는 경험적 발견은 [교육자의 역할이 학생들을 '바람직한 어려움'의 위치에 놓이게 하는 상황을 만드는 것]이라는 Bjork의 주장에 기초를 제공합니다. (즉, 학습자가 경험을 통해 자신의 한계를 발견할 수 있는 상황을 조성하여 학습자의 [실수를 유도하는 것]이다.)
The extant literature suggests that efforts to direct one’s own learning may often be misdirected as people’s failures to identify the areas most in need of attention lead to failures to identify optimal learning activities. Empirical findings supportive of the contention that learners often misjudge the value of learning activities provide the foundation for Bjork’s25 claim that the educator’s role is to create situations for students that place themin a position of ‘desirable difficulty’ (i.e. to induce mistakes by creating situations that enable learners to discover their limits through experience25,26). 


과다뇰리와 리가 제시한 [도전점 프레임워크challenge point framework]는 학습이 학습자의 역량의 가장자리edge에 있을 때 학습이 최적의 속도로 발생한다는 것을 시사한다.

  • 최적의 도전 지점 아래로 떨어지는 과제는 연습 중에 더 나은 성과를 낼 수 있지만 장기적으로 더 적은 학습으로 이어집니다.
  • 최적의 도전 지점 위에 있는 과제는 더 낮은 연습과 더 낮은 학습으로 이어집니다.  

The challenge point framework, put forward by Guadagnoli and Lee,27 suggests that learning occurs at an optimal rate when the difficulty of the task being practised lies at the edge of the learner’s competence.

  • Tasks that fall below the optimal challenge point will enable better performance during practice, but result in less learning in the long term;
  • tasks that lie above the optimal challenge point will result in both poorer practice and poorer learning.  




시험 강화 학습 이해
Understanding test-enhanced learning


우리는 이 연구에서, (시험을 완료하기보다는 단지 자료를 학습하도록 요청받았을 경우에), 수집된 데이터로는 [시험 자체에 의해 유발된 검토 시간]과 [학생들이 다양한 문제에 소비했을 검토 시간]을 직접 비교할 수 없음을 알아야 합니다.

  • 따라서 우리는 우리 결과가 서론에 요약된 [리허설 가설]을 뒷받침한다고 확실하게 말할 수 없다(즉, "시험이라는 행위가 있음으로써 (없을 때보다) 훨씬 더 많이 연습하거나 자료를 탐색하도록 촉진하기 때문에 테스트 강화 학습이 발생한다"고 말할 수 없다).
  • 그러나 그들은 일반적으로 받아들여지는 [인출 가설](즉, 검색 행위가 기억 추적을 직접적으로 강화하기 때문에 시험 강화 학습이 발생한다는 것)에 대한 대안을 제기한다.

We must note that the data collected in this study do not allow us to make a direct comparison between the amount of review time prompted by the test itself and the amount of review time students would have spent on the various questions had they been asked merely to study the material rather than to complete a test.

  • Therefore, we cannot say withcertainty that the results support the rehearsal hypothesis outlined in the introduction (i.e. that test-enhanced learning occurs because the act of testing prompts individuals to rehearse or explore material to a greater extent than they would do otherwise).
  • They do, however, raise an alternative to the commonly accepted retrieval hypothesis (i.e. that test-enhanced learning occurs because the act of retrieval directly strengthens the memory trace).

어떤 메커니즘이 정확한지(또는 둘 다 정확한지)에 관계없이 테스트는 유용한 교육학적 전략을 제공하는 것으로 볼 수 있습니다.
Regardless of which mechanismis accurate (or whether both are), testing can be seen as providing a useful pedagogic strategy.


결론
CONCLUSIONS

 

 

 


Med Educ. 2012 Mar;46(3):326-35.

 doi: 10.1111/j.1365-2923.2011.04150.x.

Influences on medical students' self-regulated learning after test completion

Sacha Agrawal 1Geoffrey R NormanKevin W Eva

Affiliations collapse

Affiliation

  • 1Department of Psychiatry and Behavioural Neurosciences, McMaster University, Hamilton, Ontario, Canada. agrawas@mcmaster.ca
  • PMID: 22324532
  • DOI: 10.1111/j.1365-2923.2011.04150.xAbstract
  • Context: The inadequacy of self-assessment as a mechanism to guide performance improvements has placed greater emphasis on the value of testing as a pedagogic strategy. The mechanism whereby testing influences learning is incompletely understood. This study was performed to examine which aspects of a testing experience most influence self-regulated learning behaviour among medical students.Results: Students correctly answered a larger proportion of attempted items than deferred items (71% versus 40%; p < 0.001), and indicated a higher mean confidence in responses to items they answered correctly compared with items they answered incorrectly (70 versus 46; p < 0.001). They spent longer reviewing items they had answered incorrectly than correctly (8.3 versus 4.0 seconds; p < 0.001), and paid particular attention to items for which the attempt/defer decision and accuracy were discordant (p < 0.01). The amount of time required to make a decision on whether or not to answer a test question was also related to reviewing time.
  • Conclusions: Medical students showed a robust ability to accurately and consciously self-monitor their likelihood of success on multiple-choice test items. By focusing their subsequent self-regulated learning on areas in which performance and self-monitoring judgements were misaligned, participants reinforced the importance of providing learners with opportunities to discover the limits of their ability and further elucidated the mechanism through which test-enhanced learning might be derived.
  • Methods: Sixty-seven medical students participated in a computer-based, multiple-choice test. Initially, participants were instructed to attempt only items for which they felt confident of their response. They were then asked to indicate their best responses to deferred items. Students were then given an opportunity to review the items, with correct responses indicated. Accuracy, the attempt/defer decision and the time taken to reach this decision were recorded, along with participants' ratings of their confidence in each response and the time spent reviewing each item on completion of the test.

+ Recent posts