Linear Regression Model for predicting the concrete compressive strength based on the concrete's ingredients and age¶

Behroz Ahmad Ali

bhrz.ali@gmail.com

Compressive strength is the resistance of a material or structure to breaking under compressive forces. The compressive strength of concrete determines its performance during its service condition. Therefore the study of the concrete compressive strength is of immense importance.

The charactersitics of concrete are dependent on the types of ingredients used and their proportions. The main constituents of concrete are cement, water and aggregates with varying proportions, but usually some other materials are also included in the mix to get the required compressive strength and properties.

As the concrete dries up and hardens over time, its compressive strength increases. The required compressive strength of concrete can vary from 17 MPa for residential purposes up to 70 MPa for some commerical applications.

The dataset used here has the concentrations of the constituents of concrete and its compressive strength after some number of days. The original owner of dataset is Prof. I-Cheng Yeh, Department of Information Management, Chung-Hua University, Taiwan. This dataset is freely available in UCI repository.

Information about Dataset¶

The following information has been taken from UCI repository. https://archive.ics.uci.edu/ml/datasets/Concrete+Compressive+Strength

The information given in the dataset:

Cement (component 1) -- quantitative -- kg in a m3 mixture -- Input Variable

Blast Furnace Slag (component 2) -- quantitative -- kg in a m3 mixture -- Input Variable

Fly Ash (component 3) -- quantitative -- kg in a m3 mixture -- Input Variable

Water (component 4) -- quantitative -- kg in a m3 mixture -- Input Variable

Superplasticizer (component 5) -- quantitative -- kg in a m3 mixture -- Input Variable

Coarse Aggregate (component 6) -- quantitative -- kg in a m3 mixture -- Input Variable

Fine Aggregate (component 7) -- quantitative -- kg in a m3 mixture -- Input Variable

Age -- quantitative -- Day (1~365) -- Input Variable

Concrete compressive strength -- quantitative -- MPa -- Output Variable

Linear Regression Model¶

Here we will create a linear regression model that can predict the compressive strength of concrete based on its ingredients and age.

#Read the data
concrete1 = read.csv("Concrete_Data.csv",header=T,na.strings="?")
#Remove rows with missing values if any.
concrete1 = na.omit(concrete1)
#Make the names of the headings shorter
concrete = concrete1
names(concrete) = c('cement_component','blast_furnace_slag','fly_ash','water','superplasticizer',
                    'coarse_aggregate','fine_aggregate','age','compressive_strength')

#Creating a test data
set.seed(5)
test = sample(nrow(concrete),200)

#Creating a Linear Regression model using all the attributes (Input Variable).
lm.model = lm(compressive_strength~.,data=concrete,subset=-test)
summary(lm.model)

Call:
lm(formula = compressive_strength ~ ., data = concrete, subset = -test)

Residuals:
    Min      1Q  Median      3Q     Max 
-28.663  -6.460   0.848   7.056  33.956 

Coefficients:
                     Estimate Std. Error t value Pr(>|t|)    
(Intercept)        -21.785271  30.346643  -0.718 0.473035    
cement_component     0.122450   0.009728  12.587  < 2e-16 ***
blast_furnace_slag   0.105964   0.011604   9.131  < 2e-16 ***
fly_ash              0.092771   0.014412   6.437 2.07e-10 ***
water               -0.166748   0.045777  -3.643 0.000287 ***
superplasticizer     0.207143   0.105462   1.964 0.049850 *  
coarse_aggregate     0.017539   0.010728   1.635 0.102456    
fine_aggregate       0.022011   0.012165   1.809 0.070763 .  
age                  0.115364   0.006066  19.019  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.62 on 821 degrees of freedom
Multiple R-squared:  0.6131,	Adjusted R-squared:  0.6094 
F-statistic: 162.6 on 8 and 821 DF,  p-value: < 2.2e-16

#Mean Squared Error of the lm.model on the test data
predict_strength = predict(lm.model,concrete[test,])
mean((concrete[test,]$compressive_strength-predict_strength)^2)

p-values¶

As we can see the p-values of the t-statistic of all the coefficients are close to zero except for that of coarse_aggregate and fine_aggregate. This indicates that other than the coefficients of coarse_aggregate and fine_aggregate, all the other coefficients of the model are significant for the accuracy of the fit.

#Removing the coarse_aggregate and fine_aggregate attributes from the model
lm.model2 = lm(compressive_strength~.-coarse_aggregate-fine_aggregate,data=concrete,subset=-test)
summary(lm.model2)

Call:
lm(formula = compressive_strength ~ . - coarse_aggregate - fine_aggregate, 
    data = concrete, subset = -test)

Residuals:
    Min      1Q  Median      3Q     Max 
-30.291  -6.272   0.700   6.876  34.039 

Coefficients:
                    Estimate Std. Error t value Pr(>|t|)    
(Intercept)        32.130338   4.807333   6.684 4.30e-11 ***
cement_component    0.107023   0.004816  22.221  < 2e-16 ***
blast_furnace_slag  0.087499   0.005661  15.458  < 2e-16 ***
fly_ash             0.071776   0.008681   8.268 5.45e-16 ***
water              -0.236707   0.023979  -9.871  < 2e-16 ***
superplasticizer    0.169648   0.095845   1.770   0.0771 .  
age                 0.114440   0.006039  18.949  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 10.63 on 823 degrees of freedom
Multiple R-squared:  0.6116,	Adjusted R-squared:  0.6087 
F-statistic:   216 on 6 and 823 DF,  p-value: < 2.2e-16

#Mean Squared Error of the lm.model2 on test data
predict_strength2 = predict(lm.model2,concrete[test,])
mean((concrete[test,]$compressive_strength-predict_strength2)^2)

Removal of "coarse_aggregate" and "fine_aggregate" attributes doesn't improve the R-squared value of the model and the mean squared error of the model on the test data doesn't decrease. Therefore we will go ahead with the first model, lm.model.

Residual Standard Error¶

#Residual Standard Error of lm.model
sqrt(deviance(lm.model)/lm.model$df.residual)

#Mean response of the dataset
mean(concrete[-test,]$compressive_strength)

The residual standard error of lm.model is around 10.62 and the mean response is about 36.00. The error rate is as following.

#Error rate
10.62/36*100

R-squared value¶

$$TSS = \sum \left ( y_{i}-\bar{y} \right )^{2}$$$$RSS = \sum \left ( y_{i}-\hat{y} \right )^{2}$$$$R^{2} = \frac{TSS-RSS}{TSS}$$

summary(lm.model)$r.squared

The R-squared value of lm.model is 0.61. This indicates that 61% of the variability in the response has been explained by the model.

Correlation between response in the test data and the response of lm.model¶

cor(concrete[test,]$compressive_strength,predict(lm.model,concrete[test,]))

There is a correlation of approximately 0.8 between the response in the test data and the output of lm.model. This is a reasonable amount of correlation. Which shows that our model fits the test data very well.

Comparison of mean squared error between training data and test data¶

#Mean squared error of lm.model on training data
mean((concrete[-test,]$compressive_strength-predict(lm.model,concrete[-test,]))^2)

#Mean squared error of lm.model on test data
mean((concrete[test,]$compressive_strength-predict(lm.model,concrete[test,]))^2)

Testing a model on a single new data¶

new_data = data.frame(cement_component=310,
                     blast_furnace_slag=0,
                     fly_ash=0,
                     water=150,
                     superplasticizer=5,
                     coarse_aggregate=1047,
                     fine_aggregate=676,
                     age=28)
predict(lm.model,new_data,interval="confidence")