Atmospheric Pollution Constituents

Do you need this or any other assignment done for you from scratch?
We have qualified writers to help you.
We assure you a quality paper that is 100% free from plagiarism and AI.
You can choose either format of your choice ( Apa, Mla, Havard, Chicago, or any other)

NB: We do not resell your papers. Upon ordering, we do an original paper exclusively for you.

NB: All your data is kept safe from the public.

Click Here To Order Now!

Summary

A department dealing with the effects of atmospheric pollutants in the vicinity of an industrial complex has established a data table of measurements of a purity index Y on a scale of 0 (extremely bad ) to 1000 ( absolutely pure) and the dependence of this on component pollutant variables X1, X2, …, X6. The aim of the department is to establish which of the component variables is contributing most to local atmospheric pollution.

This report analyzed and discussed the association of the purity index Y with component pollutant variables and developed a model to forecast the purity index. The analysis suggested that the component pollutant variables X1, X2, X4, X5, and X6 are significantly related to purity index Y (p <.05). However, only two-component pollutant variables X1 and X5 are most likely to contribute significantly to atmospheric pollution (purity index Y). The equation for the best regression (chosen) model was given by Y = 0.185 + 1.111X1 + 7.598X5

Further, for the chosen model, all the underlying assumptions of the regression analysis (multicollinearity, non-normality, nonconstant variance, and autocorrelation) are valid.

Introduction

A department dealing with the effects of atmospheric pollutants in the vicinity of an industrial complex has established a data table of measurements of a purity index Y. The purity index Y is measured on a scale of 0 to 1000, with 0 being extremely bad and 1000 being absolutely pure and the dependence of this on component pollutant variables X1, X2, …, X6. The aim of the department is to establish which of the component variables is contributing most to local atmospheric pollution.

This report will analyze and discuss the association of the purity index Y with component pollutant variables X1, X2, …, X6. Further, this report will develop a model for forecasting the purity index Y based on component pollutant variables X1, X2, …, X6. For this, sample data for a period of 50 days is obtained. The test is a ‘blind’ one in the sense that none of the pollutants has been identified by name, because of its association with the source and the possibility at this stage of unwanted litigation.

Correlation and Scatterplot Analysis

Figure 1 to 6 shows the scatterplots of purity index Y against component pollutant variables X1, X2… X6.

Y versus X1
Figure 1: Y versus X1
Y versus X2
Figure 2: Y versus X2
Y versus X3
Figure 3: Y versus X3
Y versus X4
Figure 4: Y versus X4
Y versus X5
Figure 5: Y versus X5
Y versus X6
Figure 6: Y versus X6

There appears a strong linear relationship between Y and X1, Y and X2, and Y and X5. In addition, there appears a moderately strong linear relationship between Y and X6. Furthermore, there appears weak or no linear relationship between Y and X3 and Y and X4. Table 2 shows the correlation matrix (using MegaStat, an Excel Add-in) for purity index Y and component pollutant variables X1, X2… X6.

Table 1: Correlation Matrix

X1 X2 X3 X4 X5 X6 Y
X1 1.000
X2 .738 1.000
X3 -.293 -.283 1.000
X4 .201 .287 -.130 1.000
X5 .605 .803 -.094 .307 1.000
X6 .491 .675 -.163 .109 .521 1.000
Y .881 .778 -.261 .290 .805 .533 1.000
50 sample size
±.279 critical value.05 (two-tail)
±.361 critical value.01 (two-tail)

As shown in table 1, the correlation of Y is significant for X1, X2, X4, X5, and X6. Therefore, excluding component pollutant variable X3 from first multiple regression analysis based on correlation and scatterplot analysis.

Multiple Regression Model

Model with Five Independent Variables (Excluding X3)

Table 2

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9477
R Square 0.8982
Adjusted R Square 0.8866
Standard Error 44.0675
Observations 50
ANOVA
df SS MS F Significance F
Regression 5 753910.3541 150782.0708 77.6449 0.0000
Residual 44 85445.5108 1941.9434
Total 49 839355.8649
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -80.3818 67.5179 -1.1905 0.2402 -216.4552 55.6917
X1 1.1879 0.1278 9.2925 0.0000 0.9302 1.4455
X2 -1.4448 1.0805 -1.3372 0.1880 -3.6223 0.7327
X4 6.2999 7.2074 0.8741 0.3868 -8.2257 20.8255
X5 8.4910 1.4413 5.8911 0.0000 5.5862 11.3959
X6 2.4322 3.2784 0.7419 0.4621 -4.1750 9.0393

Table 2 shows the regression model with five component pollutant variables. Although, the regression model is significant (F = 77.64, p <.001), the p-value for coefficient of component pollutant variables X2, X4, and X6 are greater than 0.05. The p-value for coefficient of X6 (0.462) is higher as compared to coefficient of other component pollutant variables X2 (0.188) and X4 (0.3868), thus, excluding component pollutant variable X6 from further multiple regression analysis.

Model with Four Independent Variables (Excluding X3 and X6)

Table 3

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9471
R Square 0.8969
Adjusted R Square 0.8878
Standard Error 43.8468
Observations 50
ANOVA
df SS MS F Significance F
Regression 4 752841.5355 188210.3839 97.8967 0.0000
Residual 45 86514.3294 1922.5407
Total 49 839355.8649
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -46.0844 48.9615 -0.9412 0.3516 -144.6979 52.5291
X1 1.1863 0.1272 9.3280 0.0000 0.9301 1.4424
X2 -1.0792 0.9567 -1.1280 0.2653 -3.0061 0.8477
X4 5.6856 7.1238 0.7981 0.4290 -8.6625 20.0338
X5 8.4570 1.4334 5.9000 0.0000 5.5700 11.3440

Table 3 shows the regression model with four component pollutant variables. Although, the regression model is significant (F = 97.90, p <.001), the p-value for coefficient of component pollutant variables X2 and X4 are greater than 0.05. The p-value for coefficient of X4 (0.443) is higher as compared to coefficient of component pollutant variable X2 (0.265), thus, excluding component pollutant variable X4 from further multiple regression analysis.

Model with Four Independent Variables (Excluding X3, X4 and X6)

Table 4

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9463
R Square 0.8955
Adjusted R Square 0.8887
Standard Error 43.6734
Observations 50
ANOVA
df SS MS F Significance F
Regression 3 751616.9110 250538.9703 131.3532 0.0000
Residual 46 87738.9539 1907.3686
Total 49 839355.8649
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept -12.7192 25.3857 -0.5010 0.6187 -63.8179 38.3795
X1 1.1842 0.1266 9.3505 0.0000 0.9293 1.4391
X2 -1.0248 0.9505 -1.0781 0.2866 -2.9381 0.8885
X5 8.6109 1.4147 6.0865 0.0000 5.7631 11.4586

Table 4 shows the regression model with three component pollutant variables. Although, the regression model is significant (F = 131.35, p <.001), the p-value for coefficient of component pollutant variable X2 (0.287) is greater than 0.05, thus, excluding component pollutant variable X2 from further multiple regression analysis.

Model with Two Independent Variables X1 and X5

Table 5

SUMMARY OUTPUT
Regression Statistics
Multiple R 0.9449
R Square 0.8928
Adjusted R Square 0.8883
Standard Error 43.7488
Observations 50
ANOVA
df SS MS F Significance F
Regression 2 749399.7991 374699.8996 195.7722 0.0000
Residual 47 89956.0658 1913.9588
Total 49 839355.8649
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 0.1849 22.4256 0.0082 0.9935 -44.9296 45.2995
X1 1.1114 0.1074 10.3531 0.0000 0.8955 1.3274
X5 7.5978 1.0594 7.1717 0.0000 5.4665 9.7290

Table 5 shows the regression model with two component pollutant variables X1 and X5. The regression model is significant (F = 195.77, p <.001). The p-value for coefficient of component variables X1 and X5 is significant that indicates that both component pollutant variables X1 and X5 significantly predicts purity index Y in regression model.

Table 6 shows the stepwise regression (using MegaStat, an Excel Add-in) taking n number of variables (best for n). As shown in table 6, the best multiple regression model is given by component pollutant variables X1 and X5, as p-value for model is highest.

Table 6: Multiple regression model with different number of independent variables

p-values for the coefficients
Nvar X1 X2 X3 X4 X5 X6 Se Adj R² p-value
1 .0000 62.649 .771 .776 3.46E-17
2 .0000 .0000 43.749 .888 .893 1.61E-23
3 .0000 .2866 .0000 43.673 .889 .895 1.45E-22
4 .0000 .1933 .2597 .0000 43.530 .889 .898 9.57E-22
5 .0000 .1428 .2482 .0000 .4818 43.772 .888 .900 7.98E-21
6 .0000 .1282 .2802 .4404 .0000 .4349 43.970 .887 .901 5.57E-20

Adjusted R2 is a parameter for deciding number of independent variables in multiple regression model. Figure 7 show the Adjusted R2 versus Number of Independent Variables. As shown in figure 7, there is not much increase in Adjusted R2 after two independent variables X1, and X5. The Adjusted R2 value is approximately same (0.888) for more than 2 independent variables in multiple regression model. Therefore, the best regression model is given by only taking two independent variables X1, and X5.

Adjusted R2 versus Number of Independent Variables
Figure 7: Adjusted R2 versus Number of Independent Variables

Chosen Multiple Regression Model

The equation for the best regression (chosen) is given by Y = 0.185 + 1.111X1 + 7.598X5

Regression slope coefficient of 1.111 of X1 indicates that for each point increase in X1, purity index Y increase by about 1.111 on average given fixed component pollutant variable X5.

The regression slope coefficient of 7.598 of X2 indicates that for each point increase in X2, purity index Y increase by about 7.598 on average given fixed component pollutant variable X1.

Component pollutant variables X1 and X5 explain about 89.3% variation in purity index Y. The other 10.7% variation in purity index Y remains unexplained may be due to other factors.

T-tests on Individual Coefficients

The null and alternate hypotheses are:

Formula

Formula

The selected level of significance is 0.05 and the selected test is t-test for Zero Slope.

The decision rule will reject H0 if p-value ≤ 0.5. Otherwise, do not reject H0.

Component pollutant variable X1 significantly predicts purity index Y, t(47) = 10.35, p <.001.

Component pollutant variable X5 significantly predicts purity index Y, t(47) = 7.17, p <.001.

F – test on All coefficients

The null and alternate hypotheses are:

Formula

Formula

The selected level of significance is 0.05 and the selected test is F-test.

The decision rule will reject H0 if p-value ≤ 0.5. Otherwise, do not reject H0.

The regression model is significant, R2 =.893, F(2, 47) = 195.77, p <.001.

Assumptions of Regression Model

Multicollinearity

Klein’s Rule suggests that we should worry about the stability of the regression coefficient estimates only when a pairwise predictor correlation exceeds the multiple correlation coefficient R (i.e., the square root of R2). The value of the correlation coefficient between X1 and X5 is 0.605. The value of Multiple R for the final regression model with X1 and X5 is 0.945 and far exceeds 0.605, which suggests that the confidence intervals and t-tests may not be affected.

Another approach for checking multicollinearity is the Variance Inflation Factor (VIF). Figure 2 shows the interpretation of the Variance Inflation Factor (VIF). As a Rule of Thumb, we should not worry about multicollinearity, if VIF for the explanatory variable is less than 10.

Variance Inflation Factor (VIF) and Interpretation
Figure 8: Variance Inflation Factor (VIF) and Interpretation

Table 7: Variance Inflation Factor (VIF) using MegaStat

Regression Analysis
0.893
Adjusted R² 0.888 n 50
R 0.945 k 2
Std. Error 43.749 Dep. Var. Y
ANOVA table
Source SS df MS F p-value
Regression 749,399.7991 2 374,699.8996 195.77 1.61E-23
Residual 89,956.0658 47 1,913.9588
Total 839,355.8649 49
Regression output confidence interval
variables coefficients std. error t (df=47) p-value 95% lower 95% upper std. coeff. VIF
Intercept 0.1849 22.4256 0.008 .9935 -44.9296 45.2995 0.000
X1 1.1114 0.1074 10.353 1.03E-13 0.8955 1.3274 0.621 1.576
X5 7.5978 1.0594 7.172 4.49E-09 5.4665 9.7290 0.430 1.576

As shown in table 7, the VIF’s for both X1 and X5 is 1.576; thus, there is no need for concern.

Non-Normal Errors

Figure 9 shows the normal probability plot of residuals. As shown in figure 9, the residual plot is approximately linear, thus, the residuals seem to be consistent with the hypothesis of normality.

Normal Probability Plot of Residuals
Figure 9: Normal Probability Plot of Residuals

Nonconstant Variance (Heteroscedasticity)

Figure 10 and 11 show the plots of residuals by X1 and residuals by X5.

Residuals by X1
Figure 10: Residuals by X1
 Residuals by X5
Figure 11: Residuals by X5

As shown in figure 10 and 11 the data points are scattered, and there is no pattern in the residuals as we move from left to right, thus, the residuals seem to be consistent with the hypothesis of homoscedasticity (constant variance).

Autocorrelation

Autocorrelation exists when the residuals are correlated with each other. With time-series data, one needs to be aware of the possibility of autocorrelation, a pattern of nonindependent errors that violate the regression assumption that each error is independent of its predecessor. The most common test for autocorrelation is the Durbin-Watson test. The DW statistic lies between 0 and 4. For no autocorrelation, the DW statistic will be near 2. In this case, DW = 2.33, which is near 2, thus errors are non-autocorrelated. However, for cross-sectional data, the DW statistic is usually ignored.

Figure 12 shows the residual by observation number. As shown in figure 12, the sign of a residual cannot be predicted from the sign of the preceding one this means that there is no autocorrelation.

Residuals by Observations
Figure 12: Residuals by Observations

Thus, for the chosen model, all the underlying assumptions of the regression analysis are valid.

Pollutant Variables (X) to Contribute Atmospheric Pollution (Purity Index Y)

As shown in table 1: correlation matrix, the component pollutant variables X1, X2, X4, X5, and X6 are significantly related to purity index Y (p <.05). Thus, they all are individually contributing significantly to atmospheric pollution. However, looking at the multiple regression model analysis, the only two-component pollutant variables X1 and X5 are most likely to contribute significantly to atmospheric pollution (purity index Y).

Do you need this or any other assignment done for you from scratch?
We have qualified writers to help you.
We assure you a quality paper that is 100% free from plagiarism and AI.
You can choose either format of your choice ( Apa, Mla, Havard, Chicago, or any other)

NB: We do not resell your papers. Upon ordering, we do an original paper exclusively for you.

NB: All your data is kept safe from the public.

Click Here To Order Now!