Linear regression refers to a regression model that applies a straight line for describing the correlation between variables. Linear regression finds the line of best fit through research data by searching for the value of the regression coefficient that minimizes the model’s total error.
This post focuses on linear regression in R and how to perform it.
Definition: Linear regression in R
Linear regression is a method of supervised learning algorithm used to predict a continuous dependent variable in research data based on the independent variable values.3
The two primary forms of linear regression are:
- Simple linear regression
- Multiple linear regression
Datasets for simple linear regression in R
The first dataset features observations about adult incomes ranging from $20k to $80k and the satisfaction scales rated from 1 to 10 in an imaginary sample of 400 individuals. The income values are divided by 10,000 for the income data to match the satisfaction scales. Therefore, $1 represents $10,000, and so on.
Dataset for multiple linear regression in R
The second set of data features observations of the percentage of people that drink alcohol, have ulcers, and drive to work every day in an imaginary sample of 500 cities.
Download the datasets for linear regression in R from the following links; simple regression and multiple regression.
Linear regression in R: Getting started
The first step of running linear regression in R is downloading R and RStudio software. Next, open the software and click on File, New File, then R script.
As you proceed from one step to the next, copy and paste the code in the text boxes directly into your script. Then, run the code by highlighting specific lines and clicking on “Run” or pressing ctrl + enter.2
Run the code below to install the first package needed for your analysis:
Finally, run this code to load the packages into your R environment (do this each time you restart R):1
Step 1 of simple linear regression in R: Loading data
Follow the linear regression in R steps below to load your data into R:1
- Go to File, Import Data Set, then choose From Text (In RStudio)
- Select your data file and the import dataset window will show up
- The data frame window will display an X column that lists the data for each of your variables
- Finish by clicking on “import”
Use summary () to check if the loaded data has been read correctly.2
Use this code to see if the simple regression dataset has been correctly loaded:
The variables in our dataset are both quantitative, so this function will provide a table with a numeric data summary that tells the minimum, mean, median, and highest independent variable incomes (income). Also, the dependent variables (satisfaction).
Use this code to check if the multiple regression dataset has been correctly loaded:
Running this function will yield a numeric summary of the data for the independent variables, which are drinking and driving, and the dependent one, ulcers.
Step 2 of linear regression in R: Assumptions
Using R, you can check if your data meets R’s four key assumptions in linear regression. The assumptions in linear regression in R are:5
- Independence of observations
- Homogeneity of variance
Since there is one independent and one dependent variable, you need not test for any hidden correlations between the variables. So, if there is an autocorrelation of variables, then you need not perform simple linear regression in r. Instead, use a constructed model like a linear mixed-effects model.5
The hist() function will help you check if the dependent variable follows a normal data distribution.6
You can test the linearity using a scatter plot. If the points of distribution can be described with a straight line, then there is linearity.6
This linear regression in the r assumption refers to the homogeneity of variance. It means that the prediction error does not significantly change over the model’s prediction range. You can confirm this assumption after fitting the linear model for simple regression in R.5
The cor() function will help you test this linear regression in the r assumption (the correlation between your variables, and ensure that they are not too highly linked).3
The hist() function will help you test if the dependent variable adheres to a normal distribution.2
The resulting histogram is barely bell-shaped. So, you can continue with the linear regression in R.
You can use two scatter plots to check for linearity.1 One scatter plot for driving and ulcers and another for drinking and ulcers.
Proceed with linear regression in r if the correlation appears linear.
This assumption in linear regression in R is easier to check after model construction.
Step 3 of linear regression in R: Analysis
After determining that your data meet the assumptions of linear regression in R, you can proceed to the analysis for evaluating the link between your variables.1
Here, you should check the relationship between income levels and satisfaction scales. You will need to run two code lines to perform simple linear regression in R and check out the results. The first line of code is the linear model, while the second produces the model summary.2
The section of coefficients displays:
- Model parameters estimates
- Standard error of estimated values
- The t-value (test statistic)
The final three lines of the results are model diagnostics. This result will explain if there is a significant relationship between the two variables.3
You can use multiple regression to test the link between ulcers, drinking, and driving. You should use a linear model of ulcers as the dependent variable, and drinking and driving as the independent variables.4
Step 4 of linear regression in R: Homoscedasticity
Data visualization will help you check the homoscedasticity of your data and clear this assumption within linear regression in R.1
Run plot(income.satisfaction.im) to ensure that this assumption is met.
The code will produce residual plots, which you can use to determine if the data meets the linear regression in the R homoscedasticity assumption.6
Use the following linear regression in the R code:
Lack of bias in the residual clarifies that the model fits the linear regression in R assumption of homoscedasticity.
Step 5 of linear regression in R: Visualize
The next step is data visualization using a graph. You can plot data and the line of regression from the linear regression model for shared results.1
Follow the following steps for linear regression in the R result visualization:
Plot the data points on a graph
Income.graph(-ggplot(income.data, aes(x=income, y=happiness))+geom_point()
Add linear regression lines to the plotted data
Income.graph (- income.graph + geom_smooth(method=”im”, col+”black”)
Add the regression line equation
Income.graph (- income.graph +
Stat_regline_equation(label.x = 3, label.y = 7)
Prepare the graph for publication
Labs(title = “reported satisfaction as a function of income”,
X = “income (x$10,000)
Y = “satisfaction score (1 to 10)
This will produce a finished linear regression in an R graph that you can include in your papers.
This linear regression in the R process is more challenging than for simple linear regression in r.5
Follow these steps:
Create a new data frame with the necessary information
Biking = seq(min(ulcers.data&driving), max(ulcers.data&driving), length.out=30),
Smoking=c(min(ulcers.data&drinking, mean(heart.data&driving), max(ulcers.data&drinking)))
This will produce a frame in the environment tab that you can click to review.
Predict the values of ulcers based on the linear model
Plotting.data&predicted.y (- predict.im(ulcers.disease.im, newdata=plotting.data)
Round the drinking values to two decimals
Plotting.data&drinking (- round(plotting.data&smoking, digits = 2)
Change the drinking variable into a factor
Plotting.data&drinking (- as.factor(plotting.data.drinking)
Plot the original data
Heart.plot (- ggplot(ulcers.data, aes(x=driving, y=ulcers.disease)) + geom_point()
Add the regression lines
Heart.plot (- heart.plot +
Geom_line(data=plotting.data, aes(x=driving, y=predicted.y, color=smoking), size=1.25)
Prep the graph for publication
labs(title = “Rates of ulcers disease (% of the population) \n as a function of driving to work and drinking”,
x = “Driving to work (% of population)”,
y = “Ulcers (% of population)”,
color = “Drinking \n (% of population)”)
- Add the linear regression in the R model to your graph
heart.plot + annotate(geom=”text”, x=30, y=1.75, label=” = 15 + (-0.2*drinking) + (0.178*drinking)”)
You can add the finished graph to your paper.
Step 6 of linear regression in R: Report
Add the graph to your paper and include a small explanation statement.1
Linear regression in R is a technique that finds the line of best fit through research data by searching for the value of the regression coefficient that minimizes the model’s total error.
Linear regression is a form of regression that utilizes straight lines to describe the link between variables.
The two primary types of linear regression are:
- Simple linear regression
- Multiple linear regression
Simple linear regression uses one independent and dependent variable, while multiple linear regression includes more than one variable.
1 Zach. “How to Perform Simple Linear Regression in R (Step-by-Step).” Statology. October 26, 2020. https://www.statology.org/simple-linear-regression-in-r/.
2 Lateef, Zulaikha. “A Step By Step Guide To Linear Regression In R.” Edureka!. May 19, 2020. https://www.edureka.co/blog/linear-regression-in-r/.
3 datasciencebeginners. “Step-By-Step Guide On How To Build Linear Regression In R (With Code).” R Bloggers. May 16, 2020. https://www.r-bloggers.com/2020/05/step-by-step-guide-on-how-to-build-linear-regression-in-r-with-code/.
4 Data Camp. “Multiple Linear Regression in R: Tutorial With Examples.” December, 2022. https://www.datacamp.com/tutorial/multiple-linear-regression-r-tutorial.
5 Khandelwal, Renu. “A Step by Step Guide to Multiple Linear Regression in R.” Medium. December 14, 2021. https://arshren.medium.com/a-step-by-step-guide-to-multiple-linear-regression-in-r-a85d270f70f7.
6 Johnson, Daniel. “R Stepwise & Multiple Linear Regression [Step by Step Example].” Guru99. March 11, 2023. https://www.guru99.com/r-simple-multiple-linear-regression.html.