Linear Regression Model for predicting the concrete compressive strength based on the concrete's ingredients and age

Behroz Ahmad Ali

bhrz.ali@gmail.com

Compressive strength is the resistance of a material or structure to breaking under compressive forces. The compressive strength of concrete determines its performance during its service condition. Therefore the study of the concrete compressive strength is of immense importance.

The charactersitics of concrete are dependent on the types of ingredients used and their proportions. The main constituents of concrete are cement, water and aggregates with varying proportions, but usually some other materials are also included in the mix to get the required compressive strength and properties.

As the concrete dries up and hardens over time, its compressive strength increases. The required compressive strength of concrete can vary from 17 MPa for residential purposes up to 70 MPa for some commerical applications.

The dataset used here has the concentrations of the constituents of concrete and its compressive strength after some number of days. The original owner of dataset is Prof. I-Cheng Yeh, Department of Information Management, Chung-Hua University, Taiwan. This dataset is freely available in UCI repository.

Information about Dataset

The following information has been taken from UCI repository. https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength

The information given in the dataset:

Cement (component 1) -- quantitative -- kg in a m3 mixture -- Input Variable

Blast Furnace Slag (component 2) -- quantitative -- kg in a m3 mixture -- Input Variable

Fly Ash (component 3) -- quantitative -- kg in a m3 mixture -- Input Variable

Water (component 4) -- quantitative -- kg in a m3 mixture -- Input Variable

Superplasticizer (component 5) -- quantitative -- kg in a m3 mixture -- Input Variable

Coarse Aggregate (component 6) -- quantitative -- kg in a m3 mixture -- Input Variable

Fine Aggregate (component 7) -- quantitative -- kg in a m3 mixture -- Input Variable

Age -- quantitative -- Day (1~365) -- Input Variable

Concrete compressive strength -- quantitative -- MPa -- Output Variable

Linear Regression Model

Here we will create a linear regression model that can predict the compressive strength of concrete based on its ingredients and age.

In [6]:
#Read the data
concrete1 = read.csv("Concrete_Data.csv",header=T,na.strings="?")
#Remove rows with missing values if any.
concrete1 = na.omit(concrete1)
#Make the names of the headings shorter
concrete = concrete1
names(concrete) = c('cement_component','blast_furnace_slag','fly_ash','water','superplasticizer',
                    'coarse_aggregate','fine_aggregate','age','compressive_strength')
In [13]:
#Creating a test data
set.seed(5)
test = sample(nrow(concrete),200)
In [17]:
#Creating a Linear Regression model using all the attributes (Input Variable).
lm.model = lm(compressive_strength~.,data=concrete,subset=-test)
summary(lm.model)
Call:
lm(formula = compressive_strength ~ ., data = concrete, subset = -test)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.663  -6.460   0.848   7.056  33.956 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -21.785271  30.346643  -0.718 0.473035    
cement_component     0.122450   0.009728  12.587  < 2e-16 ***
blast_furnace_slag   0.105964   0.011604   9.131  < 2e-16 ***
fly_ash              0.092771   0.014412   6.437 2.07e-10 ***
water               -0.166748   0.045777  -3.643 0.000287 ***
superplasticizer     0.207143   0.105462   1.964 0.049850 *  
coarse_aggregate     0.017539   0.010728   1.635 0.102456    
fine_aggregate       0.022011   0.012165   1.809 0.070763 .  
age                  0.115364   0.006066  19.019  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.62 on 821 degrees of freedom
Multiple R-squared:  0.6131,	Adjusted R-squared:  0.6094 
F-statistic: 162.6 on 8 and 821 DF,  p-value: < 2.2e-16
In [23]:
#Mean Squared Error of the lm.model on the test data
predict_strength = predict(lm.model,concrete[test,])
mean((concrete[test,]$compressive_strength-predict_strength)^2)
90.1101752537359

p-values

As we can see the p-values of the t-statistic of all the coefficients are close to zero except for that of coarse_aggregate and fine_aggregate. This indicates that other than the coefficients of coarse_aggregate and fine_aggregate, all the other coefficients of the model are significant for the accuracy of the fit.

In [18]:
#Removing the coarse_aggregate and fine_aggregate attributes from the model
lm.model2 = lm(compressive_strength~.-coarse_aggregate-fine_aggregate,data=concrete,subset=-test)
summary(lm.model2)
Call:
lm(formula = compressive_strength ~ . - coarse_aggregate - fine_aggregate, 
    data = concrete, subset = -test)

Residuals:
    Min      1Q  Median      3Q     Max 
-30.291  -6.272   0.700   6.876  34.039 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        32.130338   4.807333   6.684 4.30e-11 ***
cement_component    0.107023   0.004816  22.221  < 2e-16 ***
blast_furnace_slag  0.087499   0.005661  15.458  < 2e-16 ***
fly_ash             0.071776   0.008681   8.268 5.45e-16 ***
water              -0.236707   0.023979  -9.871  < 2e-16 ***
superplasticizer    0.169648   0.095845   1.770   0.0771 .  
age                 0.114440   0.006039  18.949  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.63 on 823 degrees of freedom
Multiple R-squared:  0.6116,	Adjusted R-squared:  0.6087 
F-statistic:   216 on 6 and 823 DF,  p-value: < 2.2e-16
In [24]:
#Mean Squared Error of the lm.model2 on test data
predict_strength2 = predict(lm.model2,concrete[test,])
mean((concrete[test,]$compressive_strength-predict_strength2)^2)
90.2706287255054

Removal of "coarse_aggregate" and "fine_aggregate" attributes doesn't improve the R-squared value of the model and the mean squared error of the model on the test data doesn't decrease. Therefore we will go ahead with the first model, lm.model.

Residual Standard Error

In [34]:
#Residual Standard Error of lm.model
sqrt(deviance(lm.model)/lm.model$df.residual)
10.6166364099954
In [36]:
#Mean response of the dataset
mean(concrete[-test,]$compressive_strength)
35.9509545058014

The residual standard error of lm.model is around 10.62 and the mean response is about 36.00. The error rate is as following.

In [37]:
#Error rate
10.62/36*100
29.5

R-squared value

$$TSS = \sum \left ( y_{i}-\bar{y} \right )^{2}$$$$RSS = \sum \left ( y_{i}-\hat{y} \right )^{2}$$$$R^{2} = \frac{TSS-RSS}{TSS}$$
In [40]:
summary(lm.model)$r.squared
0.613134978332615

The R-squared value of lm.model is 0.61. This indicates that 61% of the variability in the response has been explained by the model.

Correlation between response in the test data and the response of lm.model

In [44]:
cor(concrete[test,]$compressive_strength,predict(lm.model,concrete[test,]))
0.790447328395033

There is a correlation of approximately 0.8 between the response in the test data and the output of lm.model. This is a reasonable amount of correlation. Which shows that our model fits the test data very well.

Comparison of mean squared error between training data and test data

In [46]:
#Mean squared error of lm.model on training data
mean((concrete[-test,]$compressive_strength-predict(lm.model,concrete[-test,]))^2)
111.490779845222
In [47]:
#Mean squared error of lm.model on test data
mean((concrete[test,]$compressive_strength-predict(lm.model,concrete[test,]))^2)
90.1101752537359

Testing a model on a single new data

In [49]:
new_data = data.frame(cement_component=310,
                     blast_furnace_slag=0,
                     fly_ash=0,
                     water=150,
                     superplasticizer=5,
                     coarse_aggregate=1047,
                     fine_aggregate=676,
                     age=28)
predict(lm.model,new_data,interval="confidence")
fitlwrupr
128.6707322.7291734.61229