Likert 스케일, 측정의 수준과 통계의 '법칙' (Adv in Health Sci Educ, 2010)

Likert scales, levels of measurement and the ‘‘laws’’ of statistics

Geoff Norman






종종 통계기법을 걸고 넘어지는 리뷰어로부터 좌절을 겪는다.

One recurrent frustration in conducting research in health sciences is dealing with the reviewer who decides to take issue with the statistical methods employed.


그러나 일부 코멘트는 그저 틀린 말일 뿐이며, 연구설계의 문제보다는 리뷰어의 역량의 문제를 보여주곤 한다.

Some of these comments, like the proscription on the use of ANOVA with small samples, the suggestion to use power analysis to determine if sample size was large enough to do a parametric test, or the concern that a significant result still might be a Type II error, are simply wrong and reveal more about the reviewer’s competence than the study design.


여러가지가 지적되곤 하지만, 여전히 남는 문제는 '틀린 결론을 내릴 확률이 정말 높아지는가'에 대한 문제이다. 통계학자들은 이를 robustness라 부른다. 그 확률이 그다지 높아지지 않는다면, 진행해도 되는 것이다.

But what is left unsaid is how much it increases the chance of an erroneous conclusion. This is what statisticians call ‘‘robustness’’, the extent to which the test will give the right answer even when assumptions are violated. And if it doesn’t increase the chance very much (or not at all), then we can press on.


모수적 방법은 매우 다재다능하고 강력하다. 현대의 모수적 통계법은 정규분포의 간격척도 자료 사용을 가정하고 있다. 유사하게 일반화가능도이론도 ANOVA를 기반으로 한 것으로서 모수적 방법이다.

It is critically important to take this next step, not simply because we want to avoid ‘‘coming to the wrong conclusion’’. As it turns out, parametric methods are incredibly versatile, powerful and comprehensive. Modern parametric statistical methods like factor analysis, hierarchical linear models, structural equation models are all based on an assumption of normally distributed, interval-level data. Similarly generalizability theory, is based on ANOVA that again is a parametric procedure.


하나씩 알아보겠다.

I will explore the impact of three characteristics-sample size, non-normality, and ordinal-level measurement, on the use of parametric methods. The arguments and responses:



1) 샘플크기가 너무 작아서 모수적 방법을 사용할 수 없습니다.

1) You can’t use parametric tests in this study because the sample size is too small


어디에도 모수적 방법을 사용하는 가정으로 샘플사이즈에 대한 제한을 두고 있지는 않다. ANOVA와 t-test 는 동일한 가정을 기반으로 한다. 두 그룹에 대한 ANOVA의 F test는 t-test의 제곱과 같다. 어디에도 샘플사이즈가 작으면 비모수적방법이 모수적방법이 더 적절하다는 근거는 없다.

This is the easiest argument to counter. The issue is not discussed in the statistics literature, and does not appear in statistics books, for one simple reason. Nowhere in the assumptions of parametric statistics is there any restriction on sample size. It is simply not true, for example, that ANOVA can only be used for large samples, and one should use a t test for smaller samples. ANOVA and t tests are based on the same assumptions; for two groups the F test from the ANOVA is the square of the t test. Nor is it the case that below some magical sample size, one should use non-parametric statistics. Nowhere is there any evidence that non-parametric tests are more appropriate than parametric tests when sample sizes get smaller.


비모수적 방법이 극도록 보수적인(즉 틀린) 답을 내놓을 한 가지 상황이 있는데, 바로 자료를 이분화하는 것이다. 자료를 이분화하면 통계적 power를 크게 떨어뜨릴 수 있다.

In fact, there is one circumstance where non-parametric tests will give an answer that can be extremely conservative (i.e. wrong). The act of dichotomizing data (for example, using final exam scores to create Pass and Fail groups and analyzing failure rates, instead of simply analyzing the actual scores), can reduce statistical power enormously.



샘플사이즈가 안 중요한 것은 아니다. 몇 가지 이유로 인해서 문제가 될 수 있다.

Sample size is not unimportant. It may be an issue in the use of statistics for a number of reasons unrelated to the choice of test:

(a) 너무 샘플 수가 작으면 외적타당도가 문제가 될 수 있다.

(a) With too small a sample, external validity is a concern. It is difficult to argue that 2 physicians or 3 nursing students are representative of anything (qualitative research notwithstanding). But this is an issue of judgment, not statistics.

(b) 샘플 수가 작으면 분포에 대한 우려가 있다. 하지만 5명 이상이면 족하다. 걱정해야 할 것은 검사를 수행할 수 있느냐가 아니라 검사의 robustness이다.

(b) As we will see in the next section, when the sample size is small, there may be concern about the distributions (see next section). However, it turns out that the demarcation is about 5 per group. And the issue is not that one cannot do the test, but rather that one might begin to worry about the robustness of the test.

(c) 샘플 수가 작으면 더 큰 효과가 있어야만 통계적 유의성이 나타난다. 그러나 '통계적으로 유의한 것은 통계적으로 유의한 것이다'

(c) Of course, small samples require larger effects to achieve statistical significance. But to say, as one reviewer said above, ‘‘Given the small number of participants in each group, can the authors claim statistical significance?’’, simply reveals a lack of understanding. If it’s significant, it’s significant. A small sample size makes the hurdle higher, but if you’ve cleared it, you’re there.



2) 데이터가 정규분포를 따르지 않기 때문에 t test나 ANOVA를 사용할 수 없다.

2) You can’t use t tests and ANOVA because the data are not normally distributed

This is likely one of the most prevalent myths. We all see the pretty bell curves used to illustrate z tests, t tests and the like in statistics books, and we learn that ‘‘parametric tests are based on the assumption of normality’’. Regrettably, we forget the last part of the sentence. For the standard t tests ANOVAs, and so on, it is the assumption of normality of

the distribution of means, not of the data. The Central Limit Theorem shows that, for sample sizes greater than 5 or 10 per group, the means are approximately normally distributed regardless of the original distribution. Empirical studies of robustness of ANOVA date all the way back to Pearson (1931) who found ANOVA was robust for highly skewed non-normal distributions and sample sizes of 4, 5 and 10. Boneau (1960) looked at normal, rectangular and exponential distributions and sample sizes of 5 and 15, and showed that 17 of the 20 calculated P-values were between .04 and .07 for a nominal 0.05. Thus both theory and data converge on the conclusion that parametric methods examining differences between means, for sample sizes greater than 5, do not require the assumption of normality, and will yield nearly correct answers even for manifestly nonnormal and asymmetric distributions like exponentials.



3) ANOVA나 Pearson correlation (혹은 회귀분석)과 같은 모수적 방법은 자료가 서열척도고 그래서 정규성을 가정할 수 없다면 사용해서는 안된다.

3) You can’t use parametric tests like ANOVA and Pearson correlations (or regression, which amounts to the same thing) because the data are ordinal and you can’t assume normality.


개별 Likert 척도가 서열척도라도, 여러 Likert 척도의 합은 등간척도이다.

The question, then, is how robust are Likert scales to departures from linear, normal distributions. There are actually three answers. The first, perhaps the least radical, is that (...) But their strongest argument appears to be that while Likert questions or items may well be ordinal, Likert scales, consisting of sums across many items, will be interval.



숫자는 숫자일 뿐이다.

The second approach, as elaborated by Gaito (1980), is that this is not a statistics question at all. The numbers ‘‘don’t know where they came from’’.


컴퓨터는 주어진 숫자에 대한 결론을 주는 것일 뿐이다. 그 숫자에 대한 결론이 틀렸다고 할 수는 없드며, 연구자가 결정해야 할 것은 그 숫자에 대해서 이뤄진 분석이 그 아래 깔려있는 구인(construct)를 잘 반영하는가에 대한 판단이다.

And all the computer can do is draw conclusions about the numbers themselves. So if the numbers are reasonably distributed, we can make inferences about their means, differences or whatever. We cannot, strictly speaking, make further inferences about differences in the underlying, latent, characteristic reflected in the Likert numbers, but this does not invalidate conclusions about the numbers. This is almost a ‘‘reductio ad absurbum’’ argument, and appears to solve the problem by making it someone else’s, but not the statistician’s problem. After all, someone has to decide whether the analysis done on the numbers reflects the underlying constructs, and Gaito provides no support for this inference.


ANOVA나 다른 비슷한 검사에 대해서는 비정규성으로 인해서 생길 수 있는 문제는 앞에서 다뤘다.

So let us return to the more empirical approach that has been used to investigate robustness. As we showed earlier, ANOVA and other tests of central tendency are highly robust to things like skewness and non-normality. Since an ordinal distribution amounts to some kind of nonlinear relation between the number and the latent variable, then in my view the answer to the question of robustness with respect to ordinality is essentially answered by the studies cited above showing robustness with respect to non-normality. 


그러나 상관관계나 회귀분석은 어떨까? 여기서는 더 이상 평균의 분포에 대해서 이야기하지 않는다. 여기서는 분포의 양 극단이 어딘가 - 궁극적으로 회귀선을 고정(anchor)하게 되므로 - 가 중요하다. 따라서 skewness나 비정규성이 틀린 답을 줄 수도 있다.

However, when it comes to correlation and regression, this proscription cannot be dealt with quite so easily. The nature of regression and correlation methods is that they inherently deal with variation, not central tendency (Cronbach 1957). We are no longer talking about a distribution of means. Rather, the magnitude of the correlation is sensitive to individual data at the extremes of the distribution, as these ‘‘anchor’’ the regression line. So, conceivably, distortions in the distribution—skewness or non-linearity—could well ‘‘give the wrong answer’’.


만약 Likert 분포가 왜곡되어있거나 다른 바람직하지 못한 특성을 가진다면 상관관계나 회귀계수를 산출할지 말지가 통계적 문제가 되고, 이는 다시 'robustness'의 문제라고 할 수 잇다. 여기서는 중심극한정리가 도움이 되지는 않지만, 우리에게 도움이 될 만한 연구들이 있다. Pearson correlation은 왜도나 비정규성에 대해서 robust하다.

If the Likert ratings are ordinal which in turn means that the distributions are highly skewed or have some other undesirable property, then it is a statistical issue about whether or not we can go ahead and calculate correlations or regression coefficients. It again becomes an issue of robustness. If the distributions are not normal and linear. what happens to the correlations? This time, there is no ‘‘Central Limit Theorem’’ to provide theoretical confidence. However, there have been a number of studies that are reassuring. Pearson (1931, 1932a, b), Dunlap (1931) and Havlicek and Peterson (1976) have all shown, using theoretical distributions, that the Pearson correlation is robust with respect to skewness and nonnormality. (...)They concluded that ‘‘The Pearson r is rather insensitive to extreme violations of the basic assumptions of normality and the type of scale’’.


Spearman과 Pearson 계수의 상관관계는 0.99이고 기울기는 1.001이었다. 심각하게 왜곡된 자료에서도 비슷했다. 둘은 거의 동일한 결과를 주는 것이다. 동순위자가 많은 경우에 Spearman이 조금 다른 답을 주긴 하지만, 이는 Spearman이 동점자를 처리하는 방식의 문제이지 Pearson 상관의 문제는 아니다. Pearson correlation은 이들 가정에 위배되더라도 매우 robust하다.

For the original data, the correlation between Spearman and Pearson coefficients was 0.99, the slope was 1.001, and the intercept was -.007. Even with the severely skewed data, the correlation was still 0.987, the slope was 0.995, and the intercept was -.0003. The means of the Pearson and Spearman correlations were within 0.004 for all conditions. For this set of observations, the Pearson correlation and the Spearman correlation based on ranks yielded virtually identical values, even in conditions of manifestly non-normal, skewed data. Now it turns out that, when you have many tied ranks, the Spearman gives slightly different answers than the Pearson, but this reflects error in the Spearman way of dealing with ties, not a problem with the Pearson correlation. The Pearson correlation like all parametric tests we have examined, is extremely robust with respect to violations of assumptions.



4) 명목, 순위 척도에서는 ICC(혹은 일반화가능도 이론)을 사용할 수 없으며 Kappa나 Weighted Kappa를 사용해야 한다.

4) You cannot use an intraclass correlation (or Generalizability Theory) to compute the reliability because the data are nominal/ordinal and you have to use Kappa (or Weighted Kappa)


Kappa was originally developed as a ‘‘Coefficient of agreement for nominal scales’’ (Cohen 1960), and in its original form was based on agreement expressed in a 2 9 2 frequency table. Cohen (1968) later generalized the formulation to ‘‘weighted kappa’’, to be used with ordinal data such as Likert scales, where the data would be displayed as agreement in a 7 9 7 matrix. Weighting accounted for partial agreement (Observer 1 rates it 6; Observer 2 rates it 5). Although any weighting scheme is possible, the most common is ‘‘quadratic’’ weights, where disagreement of 1 unit is weighted 1, of 2 is weighted 4, of 3, 9, and so forth.


Surprisingly, if one proceeds to calculate an intraclass correlation with the same 7-point scale data, the results are mathematically identical, as proven by Fleiss and Cohen (1973). And if one computes an intraclass correlation from a 2 9 2 table, using ‘‘1’’ when there is agreement and ‘‘0’’ when there is not, the unweighted kappa is identical to an ICC. Since ICCs and G theory are much more versatile (Berk 1979), handling multiple observers and multiple factors with ease this equivalence is very useful.



Summary

Parametric statistics can be used with Likert data, with small sample sizes, with unequal variances, and with non-normal distributions, with no fear of ‘‘coming to the wrong conclusion’’. These findings are consistent with empirical literature dating back nearly 80 years. The controversy can cease (but likely won’t).





 2010 Dec;15(5):625-32. doi: 10.1007/s10459-010-9222-y. Epub 2010 Feb 10.

Likert scales, levels of measurement and the "laws" of statistics.

Author information

  • 1McMaster University, 1200 Main St. W., Hamilton, ON, L8N3Z5, Canada. norman@mcmaster.ca

Abstract

Reviewers of research reports frequently criticize the choice of statistical methods. While some of these criticisms are well-founded, frequently the use of various parametric methods such as analysis of variance, regression, correlation are faulted because: (a) the sample size is too small, (b) the data may not be normally distributed, or (c) The data are from Likert scales, which are ordinal, so parametric statistics cannot be used. In this paper, I dissect these arguments, and show that many studies, dating back to the 1930s consistently show that parametric statistics are robust with respect to violations of these assumptions. Hence, challenges like those above are unfounded, and parametric methods can be utilized without concern for "getting the wrong answer".

PMID:
 
20146096
 
[PubMed - indexed for MEDLINE]



+ Recent posts