통계학개론 (Adv Health Sci Educ Theory Pract. 2019)
Statistics 101

Geoff Norman1

간단한 사전 테스트를 준비했습니다. 그건 이런 식이다:
I have devised a simple pretest. It goes like this:

  • 대학원생 중 한 명이 최근 연구 결과를 보여 줍니다. 이것은 간단한 두 그룹 비교입니다 – 세부 사항은 중요하지 않습니다. 독립적인 표본 t 검정을 사용하여 두 그룹 평균을 비교했는데, 이는 단지 기호화 –> p = .0498인 것으로 밝혀졌다.
    One of your graduate students shows you the findings from her latest study. It is a simple two group comparison – the details do not matter. She compared the two group means using an independent sample t test, which turned out to be just significant –> p = .0498.
  • 동네 술집에서 축하를 하고 나면, 당신은 최근 복제 불가능에 대한 모든 홍보에 시달립니다. 그래서 다음 날 당신은 그녀가 확실히 하기 위해 연구를 반복해야 한다고 주장합니다. 그녀는 전에 했던 것처럼 똑같이 합니다. 디자인은 바뀐 것이 없습니다.
    After due celebration at the local pub, you are haunted by all the recent publicity about non-replication. So the next day you insist that she repeat the study just to be sure. She does it exactly like she did before. Nothing has changed in the design.
  • 질문: 순전히 통계적인 근거로, 두 번째 연구에서 동일한 결론에 도달할 확률, 즉 귀무 가설을 두 번 기각할 확률은 얼마입니까?
    QUESTION: On purely statistical grounds, what is the probability that you will arrive at the same conclusion in the second study; that is, you will reject the null hypothesis a second time?

다음 페이지의 각주1에서 정답을 찾을 수 있을 것입니다. 정답을 맞힌 사람은 올바른 이유(어떻게 나올지에 대한 정당한 논리적 주장)로 다음 섹션을 건너뜁니다. 나머지는, 계속 읽어보세요. 
You will find the correct answer in the footnote1 on the next page, just to keep you from peeking. For those who got it right, for the right reasons (a legitimate logical argument as to how it comes out that way), skip the next section. For the remainder, read on.

1 귀무 가설을 두 번째로 기각할 확률은 0.50입니다.
1 The probability that you will reject the null hypothesis a second time is 0.50


통계논리학의 기초
A primer of statistical logic


복습할 시간이에요. 통계 추론의 기본 논리는 이제 약 100년 정도 되었고(RIP 로날드 경), 반복된 도전(Cohen 2016)을 견뎌냈지만 여전히 살아있다. 효과 크기, 신뢰 구간 및 승산비를 추가했지만, 좋든 싫든 마법의 "p < .05"가 없으면 정량적으로 게시하는 데 어려움이 있을 것입니다. 
It’s time for a review. The basic logic of statistical inference is now about 100 years old (RIP Sir Ronald), and has withstood repeated challenges (Cohen 2016), but lives on. We’ve added effect sizes and confidence intervals and odds ratios, but like it or not, you still will have trouble publishing anything quantitative without the magical “p < .05”. 

그러니 그것이 무엇인지 그리고 우리에게 말하지 않는 것을 명확히 하자. 쉽게 설명하려면 이 책에서 가장 간단한 예를 들어 샘플을 모집단에 비교하는 경우를 살펴보겠습니다. 예를 들어, 우리는 "광도"와 소아과 유사점을 찾았다고 가정해 봅시다(그런데 광도는 작동하지 않습니다). 노화에 따른 인지력 저하를 줄이는 대신, 우리는 다른 쪽 끝에서 일할 것입니다. 그리고 유동적인 지능의 흐름을 조금 더 좋게 만들고 아이들의 IQ를 높이기 위해 고안된 온라인 개입인 윤활성을 고안할 것입니다. 
So let’s be clear on what it is and is not telling us. To make it easy, let’s take the simplest case in the book—comparing a sample to a population. Suppose, for example, we have come up with a pediatric analog to “Luminosity” (which does NOT work, by the way). Instead of reducing cognitive decline with aging, we’re going to work at the other end, and devise an online intervention, Lubricosity—designed to make fluid intelligence flow just a bit better, and raise IQ of kids (that doesn’t work either, but let’s pretend it does for now). 

그러나, 우리는 그 반대로 시작하여 "null 가설"을 설정한다. 
However we begin by doing the opposite and setting up a “null hypothesis”, which in contracted form is: 

H0 pop 모평균(모양성) = 모평균 = 100. 
H0 ∶ Population mean(lubricosity) = Population mean = 100. 

기본적인 논리는 간단하다. 만약 우리 연구 아이들이 12세 모두의 집단에서 무작위로 추출된 샘플이고, 만약 치료법이 효과가 없다면, 그리고 만약 우리가 연구를 수 천 번 하고 모든 샘플의 평균 IQ를 표시한다면, 그들은 치료받지 않은 집단의 평균 IQ인 100 정도 정규 분포를 따를 것입니다. 
The basic logic is simple: If our study kids are a random sample from the population of all 12 year olds, and if the treatment doesn’t work, and if we did the study a zillion times and displayed all the sample mean IQ’s, they would be normally distributed around 100, the mean IQ in the untreated population. 

그래서 우리는 무작위로 100명의 6살짜리 아이들을 표본으로 추출하여 3개월 동안 프로그램에 등록한 후 그들의 IQ를 측정합니다. 중요한 것은 표본 크기 100에서 계산된 평균의 분포를 살펴보기 때문에, 정규분포의 width는, 평균의 표준오차라고 하며, 이는 원래 점수의 표준편차를 평균의 제곱근으로 나눈 1.5가 된다는 것입니다. 데이터는 그림 1a)와 비슷할 것이며, 여기서 우리는 IQ를 Z 점수로 변환하여 모든 것을 표준 오차 단위로 표현했다.  
So we randomly sample 100 6 year olds, enrol them in the program for 3 months, then measure their IQ. Critically, because we re looking at the distribution of means computed from sample size 100, the width of the normal distribution of means, the standard error of the mean, would be the standard deviation of the original scores (15 for IQ) divided by the square root of the sample size (√100 = 10), or 1.5. The data would look like Fig. 1a), where we have also converted the IQ to a Z score, expressing everything in standard error units.  

논리의 다음 단계는 연구의 표본 평균이 이 (null 가설) 분포에서 왔다면 우연히 발생할 가능성이 충분히 없는 경우, 귀무 가설을 기각하고 관측된 차이가 통계적으로 유의하다고 선언하는 것이다. 그리고 "비슷하게"는 항상 같은 방식으로 정의된다; 발생 확률은 100분의 5 미만이다. 
The next step in the logic is to declare that, if the sample mean of the study is sufficiently unlikely to have arisen by chance if it came from this (null hypothesis) distribution, we will reject the null hypothesis and declare that the observed difference is statistically significant. And “unlikely” is always defined the same way; a probability of occurrence of less than 5 in 100. 

그런 다음 꼬리 확률이 .05 미만인 분포에 임계값을 설정하며, 이는 일반적으로 "알파"라고 불린다. 간단한 z 테스트의 경우, 그림과 같이 Z가 1.96일 때 발생합니다. 즉, H0을 거부(수용)하지 못하는 [임계값의 왼쪽에 있는 영역]과 H0을 거부(수용)하지 못하는 [임계값의 오른쪽에 있는 영역]입니다. 
This then establishes a critical value out on the distribution beyond which the tail probability is < .05, which is conventionally called “alpha”. For the simple z test, this arises at a Z of 1.96, as shown in the figure. In turn, this defines two zones: one to the left of the critical value, where we fail to reject (accept) H0, and one to the right, where we reject H0. (Again, because this is a two-tailed test, there is a similar zone on the left side of the graph, but we’ll ignore this). See Fig. 1a). 

이제 중요한 부분이 왔다. 만약 우리가 H0을 거부한다면, 우리는 논리적으로 그것이 다른 분포인 H1 분포에서 왔다고 선언한다. 데이터가 나오기 전에 이 모든 것을 가정한다면, H1은 거의 모든 곳에 중심이 맞춰질 수 있습니다. 따라서 표본 크기 계산은 항상 올바른 값을 산출합니다! 그러나 연구가 끝나면 H1 분포가 있을 수 있는 위치, 정확히 관찰한 위치에 대한 정보를 얻을 수 있습니다. 따라서 "작업"한 두 번째 연구 집단은 새로운 모집단 평균인 관측된 표본 평균에 대한 "최상의 추측"에 초점을 맞춘 분포를 가지고 있다고 가정한다. 또한 해당 분포가 치료되지 않은 모집단과 동일한 표준 편차를 갖는다고 가정한다. 
Now comes the critical part. If we reject H0, we then logically declare it comes from a different distribution, the H1 distribution, somewhere to the right of the critical value. Now, if we were doing all this hypothetically before we had the data, H1 could be centered almost anywhere (which is why sample size calculations always come up with the right number!) But once the study is over, we have information about where the H1 distribution might be—exactly where we observed it. So we assume that the second population of studies that “worked” has a distribution centered on our “best guess” of the new population mean, the observed sample mean. We also assume the distribution has the same standard deviation as the untreated population (“homoscedasticity”, if you want to sound intellectual). 

연구 평균이 104.5로 H0 평균보다 3 표준 오차 위에 있다고 가정합시다. 그러면 곡선은 그림 1b와 같다.
Let’s assume we found that the study mean was 104.5, 3 standard errors above H0 mean. Then the curve looks like Fig. 1b):

이제 중요한 비트는 임계값 왼쪽의 H1 곡선 영역입니다. 즉, 4.5의 IQ 포인트 차이가 있다는 대립 가설 하에서 유의한 차이를 선언하지 않을 가능성인 베타입니다. 이 경우 0.15입니다. 그리고 (1-베타)는 차이가 있는 경우 이를 검출할 수 있는 가능성으로, 이를 검정력이라고 합니다. 이 값은 (1–.15) =.85입니다. 
Now the important bit is the area of the H1 curve to the left of the critical value. That is beta, the likelihood that you would not declare a significant difference, under the alternative hypothesis that there was a difference of 4.5 IQ points. In this case, it’s 0.15. And (1-beta) is the likelihood of detecting a difference if there was one, which is called “power”. This is (1–.15) = .85 

분명히 하자면, 이 연구는 0.0001의 p 값에 해당하는 Z = 3.0의 차이를 발견했지만 유의미한 차이를 반복할 확률은 여전히 85%에 불과하다는 것을 의미합니다. 
To be very clear, what this means is that even though this study found a difference of Z = 3.0, corresponding to a p value of 0.0001, the chance of replicating the finding of a significant difference is still only 85%. 

그리고 그것은 우리에게 프라이머의 시작 부분에서 제기되는 질문으로 이어집니다. p-값을 정확히 0.05로 계산하면, 이것은 Ho 분포의 Z = 1.96에서 표본 평균에 해당합니다. 즉, H1 분포가 임계값에 바로 중심을 맞춘다는 뜻입니다. 분포의 절반은 임계값의 왼쪽에 있고 절반은 오른쪽에 있습니다. 유의한 차이의 원래 발견을 반복할 가능성은 50%에 불과합니다! 
And that brings us to the question posed at the beginning of the primer. If we computed a p-value of exactly .05, this corresponds to a sample mean at Z = 1.96 on the Ho distribution. That means the H1 distribution is centred right on the critical value. Half of the distribution lies to the left of the critical value and half to the right. The likelihood of replicating the original finding of a significant difference is only 50%! 

그림 2에서, 나는 계산된 p 값의 함수로 복제 가능성을 표시했다. 그것은 0.50에서 0.97까지이다. 
In Fig. 2, I’ve plotted the likelihood of replication as a function of the calculated p value (for “significant” results). It goes from .50 to .97. 



이제 시작 부분에 제시된 세 가지 시나리오로 돌아가겠습니다. 
Now let’s return to the 3 scenarios posed at the beginning: 

1. "Power analysis이 제시되지 않았습니다. 실험 설계의 일부로 포함되어야 한다." 
1. “There is not [sic] power analysis presented. Should be included as part of experimental design.” 

위에서 설명한 바와 같이, 통계적 검정의 검정력은 (즉, 데이터가 H1 분포에서 나온 경우) 하나일 때 유의한 차이를 찾을 확률이다. 연구에서 우리는 모든 중요한 차이가 0.001에서 0.0001 사이의 p 값을 갖는다고 보고했다. 유의한 차이를 찾을 확률은 1.0이었는데, 그 이유는 우리가 유의한 차이를 발견했기 때문입니다. Power analysis는 아무것도 더하지 않습니다. 
As we described above, the power of a statistical test is the probability of finding a significant difference when there is one present (i.e. when the data come from the H1 distribution). In the study we reported all the critical differences had p values ranging from 0.001 to 0.0001. The probability of finding a significant difference was 1.0, because we did find a significant difference. The power calculation adds nothing. 

검정력 계산은 차이가 예상되지만 발견되지 않을 때 추정된 크기의 차이를 찾을 가능성을 추정하는 데 유용합니다. 차이가 감지되면 가치가 없습니다. 
Power calculations are useful when a difference is expected but was not found, to estimate the likelihood of finding a difference of some presumed magnitude. They have no value when a difference was detected. 



2. 복제 비복제에 대한 Pashler의 해결책은 피험자당 시행 횟수를 증가시켜 (표본 크기를 증가시켜) 연구에 더 큰 통계적 power을 부여하는 것이었다. 
2. Pashler’s solution to non-replication was to build greater statistical power into the studies by increasing the number of trials per subject (increasing sample size). 

밝혀진 바와 같이 표본 크기가 증가하면 복제되지 않는 문제가 줄어들지 않을지는 유의미한 효과의 원래 결과가 참이었는지 아닌지에 대한 믿음에 달려 있습니다. 
As it turns out, whether increased sample size will or will not reduce problems of non-replication depends on your belief that the original finding of a significant effect was true or not. 

복제와 비복제에 대한 대부분의 문헌은 원래의 결과가 위양성false positive이라는 견해를 가지고 있는 것으로 보인다. (존 외 2012; 마시캄포 및 랄랑드 2012; 시몬스 외 2011). 만약 그렇다면, 우리가 설명한 개념 논리는 알파가 처음부터 0.05로 설정되기 때문에 거짓 양의 가능성은 항상 0.05라는 것을 보여준다. (이 주장을 할 때, 우리는 의도적으로 참고 문헌 목록 중 일부에서 논의된 많은 잠재적인 조사자 편향을 무시하고 단지 이론적 확률을 보고 있다.) 
It would appear that most of the literature on replication and non-replication holds the view that the original finding is a false positive; the effect is really not there. (John et al. 2012; Masicampo and Lalande 2012; Simmons et al. 2011). If this is the case, then the conceptual logic we have described demonstrates that the likelihood of a false positive is always 0.05, because alpha is set at .05 from the outset. No amount of increase in sample size changes that. (In making this claim, we are deliberately ignoring the many potential investigator biases discussed in some of the references in the bibliography, and are simply looking at the theoretical probability). 

그러면 샘플 크기를 늘리면 어떤 효과가 있을까요? 도함수로 돌아가면, 평균의 표준 오차는 감소하므로 두 곡선은 원래 척도에서 더 멀리 이동한다. 겹치는 부분overlap이 감소하면 검정력이 증가하지만, 이는 실제 차이를 탐지할 가능성에만 영향을 미칩니다. 즉, 증가된 검정력은 실제 효과가 있는 경우 점점 더 작은 효과의 탐지를 허용하지만, 효과가 유의하다고 잘못 선언될 가능성은 바꾸지 않는다. 

So what does increased sample size achieve? Going back to the derivation, as sample size increases, the standard error of the mean decreases, so the two curves move further apart on the original scale. Power increases as the overlap decreases, but this only impacts on the likelihood of detecting a true difference. In other words, increased power will permit detection of smaller and smaller effects if they are real, but does not change the likelihood that an effect will be falsely declared significant. 



3. 다중 문항 조사에서 얻어진 문항수준에서의 상관 관계.
3. Correlations at item level from a multi-item survey.

연구 설계가 알파에 미치는 영향의 연장선상에서, 다중 검정이 전체 알파 수준에 미치는 영향에 대한 관심이 부족하다. 다중 항목 목록에서 관찰된 차이의 분석은 적어도 두 가지 이유로 인해 타당하지 않다. 첫째, 검정 횟수가 증가할수록 유의한 차이를 관측할 가능성이 커집니다. 예를 들어, 알파가 0.05인 경우 5개의 검정을 통해 적어도 하나의 유의한 차이(위양성)를 찾을 확률은 0.23, 10개의 검정에서는 0.40, 20개에서는 0.65, 50개에서는 0.92입니다. 거짓과 실제 긍정을 구별하는 것은 불가능하다. 게다가, 수백 명의 참여자들이 참여하기 때문에, 심지어 작은 상관관계도 통계적으로 중요한 것으로 나타날 것이다. 한 가지 해결책은 기존의 알파를 제안된 테스트 수로 나누는 본페로니 보정이다. 발표된 예에서, 논문에 있는 약 40개의 중요한 결과는 본페로니 보정을 통해 3으로 떨어집니다. 

As an extension of the effect of study design on alpha, there is insufficient attention to the effect of multiple tests on the overall alpha level. Analysis of differences observed in a multi-item inventory is rarely sensible, for at least two reasons. First, as the number of tests increases, the likelihood of observing a significant difference increases. As a simple example, using an alpha of .05, the likelihood of finding at least one significant difference (false positive) with 5 tests is .23; with 10 tests, .40; with 20, .65 and with 50, .92. It is not possible to distinguish false from real positives. Moreover, with several hundred participants, which is not uncommon in surveys, even tiny correlations will emerge as statistically significant. One solution is a Bonferroni correction, dividing the conventional alpha by the number of proposed tests. In the published example, the approximately 40 significant results in the paper drops to 3 with a Bonferroni correction. 

그러한 접근법이 유용하지 않은 두 번째 이유가 있다. 다중문항인벤토리를 사용하는 것의 핵심은, 하나의 문항만으로는 신뢰할 수 있는 결과를 산출하기에 충분히 신뢰성이 없다는 것을 인식하는 것이다. 실제로 요인분석은 기초 치수를 식별하기 위한 항목별 판매 생성 지침을 제공하기 위한 것이며, 전체적인 내부 일관성 계산은 모든 항목이 동일한 기초 치수를 측정한다는 가정에 기초한다. 따라서 분석은 항목item 수준이 아닌 척도scale 또는 하위척도subscale 수준에서 진행되어야 합니다. 

There is a second reason why such an approach is not useful. The whole point of a multi-item inventory is recognition that a single item is not sufficiently reliable to yield credible results. Indeed a factor analysis is directed at providing guidance for creating subscales to identify underlying dimensions, and an overall internal consistency calculation is based on the assumption that all items are measuring the same underlying dimension. So analysis should be conducted at the scale or subscale level, not the item level. 

결론
Conclusion


다양한 방법론적 편향에 기초한 복제 실패의 이해를 위한 광범위한 문헌이 있지만(Francis 2013; Schulz 등 1995), 비복제가 피셔 통계 추론의 구조적 특징이라는 사실에 대한 인식과 동일하다고 생각한다.
While there is an extensive literature directed at understanding failure to replicate based on various methodological biases (Francis 2013; Schulz et al. 1995), there is, I think inadequate recognition of the fact that non-replication is an architectural feature of Fisherian statistical inference.

Cohen, J. (2016). The earth is round (p < . 05). In What if there were no significance tests? (pp. 69–82). Routledge.

Norman, G. (2017). Generalization and the qualitative–quantitative debate. Advances in Health Sciences Education, 22(5), 1051–1055. XXX

 

 

 

 


Adv Health Sci Educ Theory Pract. 2019 Oct;24(4):637-642.

 doi: 10.1007/s10459-019-09915-3.

Statistics 101

Geoff Norman 1

Affiliations collapse

Affiliation

  • 1McMaster University, Hamilton, ON, Canada. norman@mcmaster.ca.

+ Recent posts