**Linear regression** refers to a regression model that applies a straight line for describing the correlation between variables. Linear regression finds the line of best fit through research data by searching for the value of the regression coefficient that minimizes the model’s total error.

This post focuses on linear regression in R and how to perform it.

## Definition: Linear regression in R

Linear regression is a method of supervised learning algorithm used to predict a continuous dependent variable in research data based on the independent variable values.^{3}

The **two primary forms** of linear regression are:

- Simple linear regression
- Multiple linear regression

**Datasets for simple linear regression in R**

The first dataset features observations about adult incomes ranging from $20k to $80k and the satisfaction scales rated from 1 to 10 in an imaginary sample of 400 individuals. The income values are divided by 10,000 for the income data to match the satisfaction scales. Therefore, $1 represents $10,000, and so on.

**Dataset for multiple linear regression in R**

The second set of data features observations of the percentage of people that drink alcohol, have ulcers, and drive to work every day in an imaginary sample of 500 cities.

Download the datasets for linear regression in R from the following links; simple regression and multiple regression.

**How to avoid point deductions**Point deductions can also be caused when citing passages that are not written in your own words. Don’t take a risk and run your paper through our online plagiarism checker. You will receive the results in only 10 minutes and submit your paper with confidence.

## Linear regression in R: Getting started

The first step of running linear regression in R is downloading R and RStudio software. Next, open the software and click on File, New File, then R script.

As you proceed from one step to the next, copy and paste the code in the text boxes directly into your script. Then, run the code by highlighting specific lines and clicking on “Run” or pressing ctrl + enter.^{2}

**Run the code** below to install the first package needed for your analysis:

Install.packages(“ggplot2”)

Install.pacjages(“dplyr”)

Install.packages(“broom”)

Install.packages(“ggpubr”)

Finally, **run this code** to load the packages into your R environment (do this each time you restart R):^{1}

Library(ggplot2)

Library(dplyr)

Library(broom)

Library(ggpubr)

## Step 1 of simple linear regression in R: Loading data

Follow the linear regression in R steps below to load your data into R:^{1}

- Go to File, Import Data Set, then choose From Text (In RStudio)
- Select your data file and the import dataset window will show up
- The data frame window will display an X column that lists the data for each of your variables
- Finish by clicking on “import”

Use summary () to check if the loaded data has been read correctly.^{2}

### Simple regression

Use this **code** to see if the simple regression dataset has been correctly loaded:

summary(income.data).^{4}

The variables in our dataset are both quantitative, so this function will provide a table with a numeric data summary that tells the minimum, mean, median, and highest independent variable incomes (income). Also, the dependent variables (satisfaction).

### Multiple regression

Use this **code** to check if the multiple regression dataset has been correctly loaded:

summary(heart.data).^{4}

Running this function will yield a numeric summary of the data for the independent variables, which are drinking and driving, and the dependent one, ulcers.

## Step 2 of linear regression in R: Assumptions

Using R, you can check if your data meets R’s four key assumptions in linear regression. The **assumptions** in linear regression in R are:^{5}

- Independence of observations
- Normality
- Linearity
- Homogeneity of variance

#### Simple regression

Since there is one independent and one dependent variable, you need not test for any hidden correlations between the variables. So, if there is an autocorrelation of variables, then you need not perform simple linear regression in r. Instead, use a constructed model like a linear mixed-effects model.^{5}

The hist() function will help you check if the dependent variable follows a normal data distribution.^{6}

You can test the linearity using a scatter plot. If the points of distribution can be described with a straight line, then there is linearity.^{6}

This linear regression in the r assumption refers to the homogeneity of variance. It means that the prediction error does not significantly change over the model’s prediction range. You can confirm this assumption after fitting the linear model for simple regression in R.^{5}

#### Multiple regression

The cor() function will help you test this linear regression in the r assumption (the correlation between your variables, and ensure that they are not too highly linked).^{3}

The hist() function will help you test if the dependent variable adheres to a normal distribution.^{2}

The resulting histogram is barely bell-shaped. So, you can continue with the linear regression in R.

You can use two scatter plots to check for linearity.^{1} One scatter plot for driving and ulcers and another for drinking and ulcers.

Proceed with linear regression in r if the correlation appears linear.

This assumption in linear regression in R is easier to check after model construction.

## Step 3 of linear regression in R: Analysis

After determining that your data meet the assumptions of linear regression in R, you can proceed to the analysis for evaluating the link between your variables.^{1}

### Simple regression

Here, you should check the relationship between income levels and satisfaction scales. You will need to run two code lines to perform simple linear regression in R and check out the results. The first line of code is the linear model, while the second produces the model summary.^{2}

The section of coefficients displays:

- Model parameters estimates
- Standard error of estimated values
- The t-value (test statistic)
- P-value

The final three lines of the results are model diagnostics. This result will explain if there is a significant relationship between the two variables.^{3}

### Multiple regression

You can use multiple regression to test the link between ulcers, drinking, and driving. You should use a linear model of ulcers as the dependent variable, and drinking and driving as the independent variables.^{4}

## Step 4 of linear regression in R: Homoscedasticity

**Data visualization** will help you check the **homoscedasticity** of your data and clear this assumption within linear regression in R.^{1}

### Simple regression

Run plot(income.satisfaction.im) to ensure that this assumption is met.

The code will produce residual plots, which you can use to determine if the data meets the linear regression in the R homoscedasticity assumption.^{6}

### Multiple regression

Use the following linear regression in the R code:

Lack of bias in the residual clarifies that the model fits the linear regression in R assumption of homoscedasticity.

## Step 5 of linear regression in R: Visualize

The next step is data visualization using a graph. You can plot data and the line of regression from the linear regression model for shared results.^{1}

### Simple regression

Follow the following steps for linear regression in the R result visualization:

**Plot the data points on a graph**

Income.graph(-ggplot(income.data, aes(x=income, y=happiness))+geom_point()

Income.graph

**Add linear regression lines to the plotted data**

Income.graph (- income.graph + geom_smooth(method=”im”, col+”black”)

Income.graph

**Add the regression line equation**

Income.graph (- income.graph +

Stat_regline_equation(label.x = 3, label.y = 7)

Income.graph

**Prepare the graph for publication**

Income.graph +

Theme_bw() +

Labs(title = “reported satisfaction as a function of income”,

X = “income (x$10,000)

Y = “satisfaction score (1 to 10)

This will produce a finished linear regression in an R graph that you can include in your papers.

### Multiple regression

This linear regression in the R process is more challenging than for simple linear regression in r.^{5}

Follow these steps:

**Create a new data frame with the necessary information**

Plotting.data(-expand.grid(

Biking = seq(min(ulcers.data&driving), max(ulcers.data&driving), length.out=30),

Smoking=c(min(ulcers.data&drinking, mean(heart.data&driving), max(ulcers.data&drinking)))

This will produce a frame in the environment tab that you can click to review.

**Predict the values of ulcers based on the linear model**

Plotting.data&predicted.y (- predict.im(ulcers.disease.im, newdata=plotting.data)

**Round the drinking values to two decimals**

Plotting.data&drinking (- round(plotting.data&smoking, digits = 2)

**Change the drinking variable into a factor**

Plotting.data&drinking (- as.factor(plotting.data.drinking)

**Plot the original data**

Heart.plot (- ggplot(ulcers.data, aes(x=driving, y=ulcers.disease)) + geom_point()

**Add the regression lines**

Heart.plot (- heart.plot +

Geom_line(data=plotting.data, aes(x=driving, y=predicted.y, color=smoking), size=1.25)

heart.plot

**Prep the graph for publication**

Ulcers.plot

Ulcers.plot +

theme_bw() +

labs(title = “Rates of ulcers disease (% of the population) \n as a function of driving to work and drinking”,

x = “Driving to work (% of population)”,

y = “Ulcers (% of population)”,

color = “Drinking \n (% of population)”)

heart.plot

**Add the linear regression in the R model to your graph**

heart.plot + annotate(geom=”text”, x=30, y=1.75, label=” = 15 + (-0.2*drinking) + (0.178*drinking)”)

You can add the finished graph to your paper.

## Step 6 of linear regression in R: Report

Add the graph to your paper and include a small **explanation statement**.^{1}

**Tip for submitting your thesis**Depending on the type of binding and customer frequency at a print shop, the printing process and delivery may take a longer period of time. Don’t lose valuable time and use the printing service with free express delivery at BachelorPrint! This enables you to finalize your thesis up to one day before hand in.

## FAQs

**Linear regression in R** is a technique that finds the line of best fit through research data by searching for the value of the regression coefficient that minimizes the model’s total error.

Linear regression is a form of regression that utilizes straight lines to describe the link between variables.

The **two primary types** of linear regression are:

- Simple linear regression
- Multiple linear regression

Simple linear regression uses **one** independent and dependent variable, while multiple linear regression includes **more than one** variable.

^{1} Zach. “How to Perform Simple Linear Regression in R (Step-by-Step).” Statology. October 26, 2020. https://www.statology.org/simple-linear-regression-in-r/.

^{2} Lateef, Zulaikha. “A Step By Step Guide To Linear Regression In R.” Edureka!. May 19, 2020. https://www.edureka.co/blog/linear-regression-in-r/.

^{3} datasciencebeginners. “Step-By-Step Guide On How To Build Linear Regression In R (With Code).” R Bloggers. May 16, 2020. https://www.r-bloggers.com/2020/05/step-by-step-guide-on-how-to-build-linear-regression-in-r-with-code/.

^{4} Data Camp. “Multiple Linear Regression in R: Tutorial With Examples.” December, 2022. https://www.datacamp.com/tutorial/multiple-linear-regression-r-tutorial.

^{5} Khandelwal, Renu. “A Step by Step Guide to Multiple Linear Regression in R.” Medium. December 14, 2021. https://arshren.medium.com/a-step-by-step-guide-to-multiple-linear-regression-in-r-a85d270f70f7.

^{6} Johnson, Daniel. “R Stepwise & Multiple Linear Regression [Step by Step Example].” Guru99. March 11, 2023. https://www.guru99.com/r-simple-multiple-linear-regression.html.