The Relationship Between Variables

understanding bivariate regression

All files, sofware, and tutorials that make up SABLE Copyright (c) 1997 1998 1999 Virginia Tech. You may use these programs under the conditions of the SABLE General License, which incorporates the GNU GENERAL PUBLIC LICENSE.

Smoking causes cancer. Fluoride toothpaste prevents cavities. 78,000 college graduates will buy new cars in the month of June.

You see bold statements and predictions such as these on the news and in advertisements all the time, but have you stopped to wonder how the claims were reached? How was it concluded that smoking and cancer are related? How can "they" predict how many new cars will be sold?

The ability to understand causes and predict outcomes of events is critical in business, medicine, education, government, and, in fact, nearly every facet of our lives. We intuitively have the ability to detect that variables are related. The concepts of regression and correlation, however, provide us with a means to establish concrete evidence of such relationships.

Relating Variables

From known data, we can determine if values of variables are linearly related, meaning a straight line can be used to summarize the data. Suppose Joe is writing a science report on properties of certain metals. He finds several metals whose melting points are between the melting and boiling points of water. His chemistry book lists their melting points in degrees Celsius, but Joe wishes to also give them in degrees Fahrenheit. Below is a graph he has begun. The dark blue points show the freezing and boiling points for water. The melting points for three of the metals are also graphed. He hasn't yet calculated the Fahrenheit melting points for cesium (red) and rubidium (green). Complete the graph by "dragging" the red and green dots up to their proper Fahrenheit values. When you have correctly placed the points, click the "Done" button.

The relationship between degrees Celsius and degrees Fahrenheit is given by the equation of the line:

y (degrees F) = 32 + 9/5 * x (degrees C)

This equation has two numerical values (32 and 9/5) and two variables (degrees Celsius and degrees Fahrenheit). When relating two variables, the variable being predicted is typically labeled as "y" and the variable used to predict y is generally labeled as "x". In other words, we predict the value of y because it somehow relates to the value of x. In the above equation, degrees Fahrenheit is the y variable and degrees Celsius is the x variable.

The first numerical value in the equation is 32. This value represents the obvious fact that 0 degrees C is the same as 32 degrees F. Generally, this first numerical term in an equation representing a linear relationship between two variables indicates the value of y when x is zero, and this value is labeled the "y-intercept".

The second numerical value in the equation is 9/5, and it is the multiplier for the x variable. The value of 9/5 indicates that there is a 9/5-unit increase in degrees Fahrenheit for every one-unit increase in degrees Celsius. In an equation representing a linear relationship between two variables, the second numerical term generally is a multiplier that gives the slope of the regression line seen in the graph of the data and is labeled the " regression coefficient".

The above example is easy to understand because there is a perfect relationship between degrees Celsius and Fahrenheit. That is, knowing the temperature in degrees Celsius allows one to predict the temperature in degrees Fahrenheit with perfect accuracy. The line connecting the points in the above graph is simply the conversion equation between the two measurement scales. In the behavioral sciences, however, the variables of interest are not perfectly related. We can use the data to determine if a linear relationship between the variables exists. If so, a regression line may be calculated from the data values. Without a perfect linear relationship, the regression line will not connect all the data points. Rather, it is the line which comes closest to all the data, making it the best general representation of the data set. Consider the following example.

The graph below shows an increasing number of students graduating from the Sociology department at Imaginary U. over the past 10 years. Use the mouse to "drag" the left and right endpoints of the line to draw the line you think best represents the data. Click "Done!" to see how close you are to the true regression line.

Chances are, you were able to come pretty close to the correct regression line, which the computer calculated by finding the line passing through the data that minimizes the total distance between all the points and the regression line. This is known as a line of best fit, which is another name for the regression line. The regression line/line of best fit is important because it represents the most likely y value for any given x value. Scientists interested in the relationship between two variables typically quantify this relationship by representing the line of best fit as a mathematical equation known as a regression equation. The general form of a bivariate (two-variable) regression equation is:

Bivariate Regression Equation: y = a + b_yx x

Where x and y represent the variables being studied, a is the y-intercept, and b_yx is the regression coefficient.

Important points about the regression coefficient:

If b_yx = 0, then the correlation between x and y is zero and it is concluded that the variables are not linearly related.
If b_yx is negative, then the correlation between x and y is negative, meaning that scores on y increase as scores on x decrease and vice versa.
In practical terms, b_yx represents the average increase (or decrease) in y for each 1-unit increase in x.

Predicting Values for Variables

We can use a known relationship between variables to predict scores for future samples. Research that is done for predictive purposes uses the following steps:

a sample of subjects are recruited, and measures are taken on both the x and y variables.
a regression analysis is conducted that mathematically establishes the line of best fit between the two variables.
future subjects are measured only on the x variable.
for these future subjects, their predicted scores on the y variable are the points on the y-axis that correspond to where their scores on the x-axis intersect the line of best fit.

In this situation, the regression equation is often called the "prediction" equation.

Use the following activity to explore prediction equations calculated from data that you create.

Assumptions of Regression Analysis

Regression analysis is a powerful tool that can be used in a number of ways. Regression analysis can be used to describe populations or to make predictions about other subjects in the population or even to test causal hypotheses. However, it is important to note that the use of regression analyses require assumptions. One of the many assumptions is that the y-variable is normally distributed, and this places some limitations on using regression analysis. Fortunately, if the y-variable deviates from normality, the resulting bias in the estimation of the regression coefficient typically is small. The normal distribution assumption is more limiting in that only ratio and interval measurement scales can be normally distributed. Technically, it is not appropriate to use nominal and ordinal measures as the y-variable in regression analysis. However, behavioral scientists often use ordinal scales as y-variables in regression analyses. This typically occurs when the theoretical concept being measured by the ordinal scale is assumed to be continuous.

For example, intelligence test scores are really ordinal measures because there is no evidence that the units of measurement represent equal intervals. That is, there is no evidence that the difference in intelligence between the IQ scores of 95 and 96 is equal to the difference in intelligence for the IQ scores of 105 and 106. Nonetheless, the behavioral scientist will typically assume that the intervals are equal due to the continuous nature of the concept being measured. In short, any theoretically continous variable that is measured in some manner where the intervals appear equal are frequently used as the y-variable in regression analysis.

Pratically speaking, there are no limiting assumptions for the x-variable. As such, the x-variable can be measured on any measurement scale (i.e., nominal, ordinal, interval, or ratio). The following visualization allows you to examine real data scatterplots and the corresponding regression lines when using x-variables measured on different types of scales.

Go to Top of Page

Return to Table of Contents

Report Problems to SoSci
Updated March 16, 1998