데이터입력
5.1. Data Entry


Data entry for correlation, regression and multiple regression is straightforward because the data can be entered in columns. So, for each variable you have measured, create a variable in the spreadsheet with an appropriate name, and enter each subject’s scores across the spreadsheet. There may be occasions where you have one or more categorical variables (such as gender) and these variables can be entered in the same way but you must define appropriate value labels. For example, if we wanted to calculate the correlation between the number of adverts (advertising crisps!) a person saw and the number of packets of crisps they subsequently bought we would enter these data as in Figure 5.1.


예비분석 : 산점

5.2. Preliminary Analysis of the Data: the Scatterplot


Before conducting any kind of correlational analysis it is essential to plot a scatterplot and look at the shape of your data. A scatterplot is simply a graph that displays each subject’s scores on two variables (or three variables if you do a 3-D scatterplot). A scatterplot can tell you a number of things about your data such as whether there seems to be relationship between the variables, what kind of relationship it might be and whether there are any cases that are markedly different from the others. A case that differs substantially from the general trend of the data is known as an outlier and if there are such cases in your data they can severely bias the correlation coefficient. Therefore, we can use a scatterplot to show us if any data points are grossly incongruent with the rest of the data set.


이변수 상관관계

5.3. Bivariate Correlation


Once a preliminary glance has been taken at the data, we can proceed to conducting the actual correlation. Pearson’s Product Moment Correlation Coefficient and Spearman’s Rho should be familiar to most students and are examples of a bivariate correlation. The dialogue box to conduct a bivariate correlation can be accessed by the menu path AnalyzeÞCorrelateÞBivariate … and is shown in Figure 5.5.


피어슨 상관관계

5.3.1. Pearson’s Correlation Coefficient


피어슨 계수를 구하기 위해서는 모수적 데이터가 필요하나, 사실 이 통계법은 극도록 robust한 방법이다.

For those of you unfamiliar with basic statistics (which shouldn’t be any of you … !), it is not meaningful to talk about means unless we have data measured at an interval or ratio level. As such, Pearson’s coefficient requires parametric data because it is based upon the average deviation from the mean. However, in reality it is an extremely robust statistic.



만약 가지고 있는 자료가 nonparametric이라면 Pearson에 체크된 것을 해제해야 한다.

This is perhaps why the default option in SPSS is to perform a Pearson’s correlation. However, if your data are nonparametric then you should deselect the Pearson tick-box. The data from the exam performance study are parametric and so a Pearson’s correlation can be applied. The dialogue box (Figure 5.5) allows you to specify whether the test will be one- or two-tailed. 


단측검정과 양측검정의 활용

One-tailed tests should be used when there is a specific direction to the hypothesis being tested, and two tailed tests should be used when a relationship is expected, but the direction of the relationship is not predicted. 


Our researcher predicted that at higher levels of anxiety exam performance would be poor and that less anxious students would do well. Therefore, the test should be one-tailed because she is predicting a relationship in a particular direction. What’s more, a positive correlation between revision time and exam performance is also expected so this too is a one tailed test.


유의할 단어(1) : 인과관계

5.3.1.1. A Word of Warning about Interpretation: Causality


상관계수를 해석할 때는 매우 조심해야 하는데, 인과관계에 대해서 정보를 주는 것이 아니기 때문이다. 

A considerable amount of caution must be taken when interpreting correlation coefficients because they give no indication of causality. So, in our example, although we can conclude that exam performance goes down as anxiety about that exam goes up, we cannot say that high exam anxiety causes bad exam performance. This is for two reasons:


제3의 변수

·  The Third Variable Problem: In any bivariate correlation causality between two variables cannot be assumed because there may be other measured or unmeasured variables effecting the results. This is known as the ‘third variable’ problem or the ‘tertium quid’. In our example you can see that revision time does relate significantly to both exam performance and exam anxiety and there is no way of telling which of the two independent variables, if either, are causing exam performance to change. So, if we had measured only exam anxiety and exam performance we might have assumed that high exam anxiety caused poor exam performance. However, it is clear that poor exam performance could be explained equally well by a lack of revision. There may be several additional variables that influence the correlated variables, and these variables may not have been measured by the researcher. So, there could be another, unmeasured, variable that affects both revision time and exam anxiety.


인과관계의 방향

·  Direction of Causality: Correlation coefficients say nothing about which variable causes the other to change. Even if we could ignore the third variable problem described above, and we could assume that the two correlated variables were the only important ones, the correlation coefficient doesn’t indicate in which direction causality operates. So, although it is intuitively appealing to conclude that exam anxiety causes exam performance to change, there is no statistical reason why exam performance cannot cause exam anxiety to change. Although the latter conclusion makes no human sense (because anxiety was measured before exam performance), the correlation does not tell us that it isn’t true.


해석에 r2 사용하기

5.3.1.2. Using r2 for Interpretation


인과관계에 대해서 직접적인 결론을 내릴 수는 없지만 상관계수를 제곱하여 variability에 대한 결론을 낼 수 있다.

Although we cannot make direct conclusions about causality, we can draw conclusions about variability by squaring the correlation coefficient. By squaring the correlation coefficient, we get a measure of how much of the variability in one variable is explained by the other


For example, if we look at the relationship between exam anxiety and exam performance. Exam performances vary from subject to subject because of any number of factors (different ability, different levels of preparation and so on). If we add all of this variability (rather like when we calculated the sum of squares in chapter 1) then we would get an estimate of how much variability exists in exam performances. We can then use r 2 to tell us how much of this variability is accounted for by exam anxiety. These variables had a correlation of - 0.4410. The value of r2 will therefore be (-0.4410)2 = 0.194 . This tells us how much of the variability in exam performance that exam anxiety accounts for. If we convert this value into a percentage (simply multiply by 100) we can say that exam anxiety accounts for 19.4% of the variability in exam performance. So, although exam anxiety was highly correlated to exam performance, it can account for only 19.4% of variation in exam scores. To put this value into perspective, this leaves 80.6% of the variability still to be accounted for by other variables. 


r^2은 매우 유용하지만, 인과관계를 추론하는데 사용할 수는 없다.

I should note at this point that although r 2 is an extremely useful measure of the substantive significance of an effect, it cannot be used to infer causal relationships. Although we usually talk in terms of ‘the variance in Y accounted for by X’ or even the variation in one variable explained by the other, this says nothing of which way causality runs. So, although exam anxiety can account for 19.4% of the variation in exam scores, it does not necessarily cause this variation.


Spearman's Rho

5.3.2. Spearman’s Rho

비모수적 통계법으로서 parametric assumption이나 distributional assumption을 위반했을 때 사용한다. 

Spearman’s correlation coefficient is a nonparametric statistic and so can be used when the data have violated parametric assumptions and/or the distributional assumptions. Spearman’s tests works by first ranking the data, and then applying Pearson’s equation to those ranks. 


Spearman correlation은 변수가 interval이 아니라 ordinal일 경우에 사용한다.

As an example of nonparametric data, a drugs company was interested in the effects of steroids on cyclists. To test the effect they measured each cyclist’s position in a race (whether they came first, second or third etc.) and how many steroid tablets each athlete had taken before the race. Both variables are nonparametric, because neither of them was measured at an interval level. The position in the race is ordinal data because the exact difference between the ability of the runners is unclear. It could be that the first athlete won by several metres while the remainder crossed the line simultaneously some time later, or it could be that first and second place was very tightly contested but the remainder were very far behind. The Spearman correlation coefficient is used because one of these variables is ordinal not interval.


Kendall's Tau (비모수적)

5.3.3. Kendall’s Tau (nonparametric)

Kendall's tau역시 비모수적 상관관계이지만 데이터세트가 작고 동순위(동점자)가 많을 때 상요한다. Spearman이 더 유명하긴 하지만, Kendall's 방법이 상관관계를 더 잘 보여준다는 의견도 많다.

Kendall’s tau is another nonparametric correlation and it should be used rather than Spearman’s coefficient when you have a small data set with a large number of tied ranks. This means that if you rank all of the scores and many scores have the same rank, the Kendall’s tau should be used. Although Spearman’s statistic is more popular of the two coefficients, there is much to suggest that Kendall’s statistic is actually a better estimate of the correlation in the population (see Howell, 1992, p.279). As such, we can draw more accurate generalisations from Kendall’s statistic than from Spearman’s.


(출처 : http://www.statisticshell.com/docs/correlation.pdf)



+ Recent posts