Development of QSRR model of a set of Polycyclic Aromatic Hydrocarbons (PAHs) using simple regression analysis in silico
Nadia Ziani1,4, Khadidja Amirat2,4, Souhaila Meneceur3,4, Abderrhmane Bouafia3*
1Faculty of Science, Chemistry Department Badji Mokhtar University Annaba, Annaba, Algeria.
2Faculty of Science, Department of Chemistry University of Sétif 1 - Ferhat Abbas, El Bez, Setif 19000.
3Department of Process Engineering and Petrochemistry, Faculty of Technology,
University of El Oued, 39000 El-Oued, Algeria.
4Renewable Energy Development Unit in Arid Zones (UDERZA), University of El Oued El-Oued, Algeria.
*Corresponding Author E-mail: abdelrahmanebouafia@gmail.com
ABSTRACT:
Planar benzonoid hydrocarbons (PAHs) are organic molecules due to incomplete combustion of carbonaceous materials1 following minor natural processes2-4 and major anthropic processes5. The HAPs are made up of carbon and hydrogen atoms forming at least two condensed aromatic rings6,7. They are released in all the compartments of the environment8,9. The number of PAHs identified with this Day is of the order of 13010. Some are causing major environmental problems due to their toxicity. Since old, the carcinogenic and mutagenic properties of many PAHs have been studied and established, while those of several others are being investigated11-22. Because of the pollution generated by the increasing emission of PAHs in the atmosphere, it is imperative to have methods that allow reliable identification at the same time, And precise quantification of these compounds.
The aim of this jobis to predict the retention indices for 59 PAHs, using the general molecular descriptors by genetic algorithm Mobidygs.
MATERIALS AND METHODS:
Dataset:
The values of retention indices for 59 PAHs were realized by Jujun Kang et al23. A chemical nomenclature compounds and their corresponding retention indicesare shown in Table 1. The data set was divided into two subsets according to Kennard and Stone algorithm24: 38 molecules from the training set (construction model) and 21 compounds for testing the robustness model.
Descriptor Generation:
The optimization of geometry molecule for each compound was sketched using Spartan25 software by PM6 semi empirical method. The resulted files were transferred into the Dragon version 5.326, to calculate the descriptors, with elimination for each pair of correlated descriptors (with correlation coefficient r≥0.95). The Genetic Algorithm Mobidygs27 has been selected the best models by maximizing the cross-validation Q²LOO28.
Kennard and Stone Algorithm:
The Kennard and Stone CADEX algorithm selected is a sequential technique that maximizes Euclidean distances between newly selected samples and those already selected. It begins by locating the two samples furthest from each other, which are removed from the original database and assigned to the calibration set.29,30.
Model development and validation:
Simple linear regression analysis was performed with MobyDigs software11, using the ordinary least squares (OLS) method.
Evaluation of the quality of the fit:
We used the following statistical parameters to assess goodness of fit
· The coefficient of determination R2:
Where
is
the mean value of the observed values.
· The mean square prediction deviation:
· The mean square deviation calculated on the calibration set (SDEC) :
· The mean quadratic deviation calculated on the external validation set (SDEPext)
·
· The predictioncoefficient:
or :
SCT: the sum of the squares of the total deviations.
PRESS: The sum of the squares of the prediction errors.
· The coefficient of external prediction calculated by the following formula:
Where, 𝑦̂i and 𝑦̂i / i are respectively the measured and predicted values (on the prediction set) the values of the dependent variable (y), and 𝑦̅ Tr the mean value of the dependent variable in the training set. The index (EXT) relates to the objects of the validation set, and the index (tr) to those of the calibration set (training set).
Wilyams Diagram:
The field of application has been discussed using the Williams diagram (treated in detail in 29, 31, representing the standardized prediction residuals as a function of the values of the levers. Equation (7) defines the lever d 'a compound in the original space of independent variables.
hi= xi (X T X)-1xiT (i=1,…..,n) (7)
Where (xi) is the row vector of the descriptors of compound i and X (n * p) the matrix of the model deduced from the values of the descriptors of the calibration set; the index T denotes the transposed vector (or matrix).
The critical value of the leverage (h *) is set at 3 (p + 1)/ n. If hii<h *, the probability of agreement between the measured and predicted values of compound i is as high as that of calibration compounds. Compounds with hii> h * strengthen the model when they belong to the calibration set, but will otherwise have questionable predictors without necessarily being outliers, as residuals may be low.
Randomization TEST:
This test makes it possible to highlight correlations due to chance. It consists in generating a “considered property” vector by random permutation of the components of the real vector. A QSAR model is then calculated on the vector obtained (considered as a real experimental vector), according to the usual method. This process is repeated several times (100 in our case).
RESULTS AND DISCUSSION:
Simple regression model (SLR):
Our model is built with a single descriptor, which is in relation with the retention indices; this descriptor is adapted for the modeling by SLR.
The optimal model equation can be written as follows:
Ri =27.6 (±) 3.686+ 50.4 (±) 0.5716 X3 (8)
Here X3 was calculated by dragon program, it belongs to the Connectivity indices class.
All statistical parameters are shown in Table 2.
The values of R2 show each time the quality of the fit; while the difference between R2 and Q2 is very small provide information on the robustness of the models. In addition, the similarity of SDEC and of SDEP means that the internal prediction capabilities of the models are not too dissimilar to their powers of adjustment.
Table 1. Value of Ri and X3 for a set of 59 PAHs. The last 21 chemicals are the test set.
|
Chemical |
Ri |
X3 |
|
Naphthalene |
200 |
3.466 |
|
2-Methylnaphthalene |
218.14 |
3.802 |
|
1-Methylnaphthalene |
221.04 |
3.933 |
|
2-Ethylnaphthalene |
236.08 |
4.226 |
|
2,6-Dimethylnaphthalene |
237.58 |
4.137 |
|
1,3-Dimethylnaphthalene |
240.25 |
4.178 |
|
1,8-Dimethylnaphthalene |
249.52 |
4.327 |
|
2,3,6-Trimethylnaphthalene |
263.31 |
4.723 |
|
2,3,5-Trimethylnaphthalene |
265.9 |
4.861 |
|
Anthracene |
301.69 |
5.344 |
|
1-Phenylnaphthalene |
315.19 |
5.886 |
|
2-Methylanthracene |
321.57 |
5.68 |
|
2-Methylphenanthrene |
321.57 |
5.729 |
|
4-Methylphenanthrene |
323.17 |
5.806 |
|
9-Methylanthracene |
329.13 |
5.892 |
|
Chemical |
Ri |
X3 |
|
2-Phenylnaphthalene |
332.59 |
5.897 |
|
9-Ethylphenanthrene |
337.05 |
6.128 |
|
2-Ethylphenanthrene |
337.5 |
6.153 |
|
2,7-Dimethylphenanthrene |
339.23 |
6.065 |
|
9-Isopropylphenanthrene |
345.78 |
6.374 |
|
1,8-Dimethylphenanthrene |
346.26 |
6.339 |
|
9-n-Propylphenanthrene |
350.3 |
6.263 |
|
9,10-Dimethylanthracene |
355.49 |
6.451 |
|
9-Methyl-10-Ethylphenanthrene |
359.91 |
6.672 |
|
1-Methyl-7-isoprppylphenanthrene |
368.67 |
6.923 |
|
9,10-Dimethyl-3-ethylphenanthrene |
381.85 |
7.246 |
|
Benzo(c)phenanthrene |
391.39 |
7.285 |
|
Benzo(a)anthracene |
398.5 |
7.278 |
|
9-Phenylphenanthrene |
406.9 |
7.773 |
|
6-Methylbenzo(a)anthracene |
417.57 |
7.69 |
|
1-Phenylphenanthrene |
421.66 |
7.819 |
|
1,12-Dimethylbenzo(a)anthracene |
436.82 |
8.189 |
|
7,12-Dimethylbenzo(a)anthracene |
443.38 |
8.335 |
|
Dibenzo(a,c)anthracene |
495.01 |
9.166 |
|
Dibenzo(a,h)anthracene |
495.45 |
9.213 |
|
9-Methylbenzo(a)anthracene |
416.5 |
7.614 |
|
Chemical |
Ri |
X3 |
|
12-Methylbenzo(a)anthracene |
419.39 |
7.771 |
|
4-Methylbenzo(a)anthracene |
419.67 |
7.751 |
|
1-Ethylnaphthalene |
236.56 |
4.248 |
|
2,3-Dimethylnaphthalene |
243.55 |
4.387 |
|
1,2-Dimethylnaphthalene |
246.49 |
4.534 |
|
3,6-Dimethylphenanthrene |
337.83 |
6.079 |
|
1-Methylbenzo(a)anthracene |
414.37 |
7.691 |
|
3-Methylbenzo(a) anthracene |
416.63 |
7.614 |
|
7-Methylbenzo(a)anthracene |
423.14 |
7.832 |
|
2,7-Dimethylnaphthalene |
237.71 |
4.137 |
|
1,7-Dimethylnaphthalene |
240.66 |
4.276 |
|
1,6-Dimethylnaphthalene |
240.72 |
4.269 |
|
1,4-Dimethylnaphthalene |
243.57 |
4.414 |
|
1,5-Dimethylnaphthalene |
244.98 |
4.406 |
|
3-Methylphenanthrene |
319.46 |
5.736 |
|
9-Methylphenanthrene |
323.06 |
5.798 |
|
1-Methylanthracene |
323.33 |
5.818 |
|
1-Methylphenanthrene |
323.9 |
5.866 |
|
9,10-Dimethyphenanthrene |
367.97 |
6.48 |
|
11-Methylbenzo(a)anthracene |
412.72 |
7.752 |
|
2-Methylbenzo(a)anthracene |
413.78 |
7.621 |
|
Chemical |
Ri |
X3 |
|
8-Methylbenzo(a)anthracene |
417.56 |
7.752 |
|
5-Methylbenzo(a)anthracene |
418.72 |
7.683 |
The external statistical validation (Q² EXT, SDEP) attests to the good predictive capacity of the compounds which did not participate in the calculation of the models. SDEP is little different from SDEP EXT (the difference between SDEP and SDEP EXT is very small 0.042 (%).
The value of Ri and X3 for a set of PAHs were reported in Table 1
Table 2. Statistical parameters of a developed model.
|
ntr |
next |
Q²LOO |
R² |
Q²LMO/50 |
SDEC |
SDEP |
SDEPext |
S |
F |
Q²BOOT |
R²adj |
Q²ext |
|
38 |
21 |
99.49 |
99.54 |
99.47 |
5.113 |
5.36 |
5.036 |
5.2527 |
7773.96 |
99.45 |
99.53 |
99.6 |
Figure 1. Residuals versus Ri (exp) for the entire set.
Figure 2. Randomization test associated with the previous QSRR model. The Square represents the randomly orderedretentions andthe circle corresponds to the real retention.
Figure 3.Williams plot of the current QSRR model
It is clear that the statistics obtained for the modified vectors of the indices retention are smaller than those of the real QSPR model, which makes it possible to affirm that the proposed model is not random.
All standardized eistd residues are less than 3 standard deviation units (3s). The values of hii, ith diagonal term of the projection matrix: where is the matrix of the observed values of the explanatory variables and its transpose? The critical value for determining the leverage points corresponds to h* = = 3* 2/38 = 0.15. It can be seen that all hi's are below this critical value 0.15which means that the model has a good external productivity.
Mechanistic Interpretation:
The descriptor and its class and meaning are gathered at the Table 3.
Table 3. Selected descriptor and its meaning and class for the best GA/ SLR model
|
Descriptor |
Definition |
class |
|
X3 |
Connectivity index chi-3 |
Connectivity indices |
CONCLUSION:
In this study, we developed a useful QSAR equation that relates theoretical chemical descriptors to the indice retention of 59 HAP. For each compound 1664 descriptors (which belong to 20 classes) calculated by the Dragon software. The data set was divided into two sets of calibration and prediction, using the Kennard and Stone algorithm. Then the best descriptors were selected by Moby Dygs “genetic algorithm”. The model obtained has high statistical quality and low prediction errors. In general, it can be concluded that, for this data set, the combinations of modeling techniques result in an improvement of the linear models. The results indicate that the descriptor chosen play an important role in the indice retention of Planarbenzonoid hydrocarbons.
CONFLICT OF INTEREST:
The authors declare no conflict of interest in this reported work.
REFERENCES:
1. Samanta SK, Singh OV, Jain RK. Polycyclic aromatic hydrocarbons: Environmental Pollution and Bioremediation. TRENDS in Biotechnology. 2002; 20(6):243-8.
2. Hylton JH. Aboriginal Self-Government in Canada: Current Trends and Issues. Purich's Aboriginal Issues Series: ERIC. 1994.
3. Juhasz AL, Naidu R. Bioremediation of high molecular weight polycyclic aromatic hydrocarbons: a review of the microbial degradation of benzo [a] pyrene. International Biodeterioration and Biodegradation. 2000; 45(1-2):57-88.
4. Wilcke W. Synopsis polycyclic aromatic hydrocarbons (PAHs) in soil—a review. Journal of Plant Nutrition and Soil Science. 2000; 163(3): 229-48.
5. Hill AJ, Ghoshal S. Micellar solubilization of naphthalene and phenanthrene from nonaqueous-phase liquids. Environmental science & technology. 2002; 36(18):3901-7.
6. Li JL, Chen BH. Solubilization of model polycyclic aromatic hydrocarbons by nonionic surfactants. Chemical Engineering Science. 2002; 57(14):2825-35.
7. Menzie CA, Potocki BB, Santodonato J. Exposure to carcinogenic PAHs in the environment. Environmental Science & Technology. 1992; 26(7):1278-84.
8. Rababah A, Matsuzawa S. Treatment system for solid matrix contaminated with fluoranthene. II––Recirculating photodegradation technique. Chemosphere. 2002; 46(1):49-57.
9. Gabet S. Remobilisation d'Hydrocarbures Aromatiques Polycycliques (HAP) présents dans les sols contaminés à l'aide d'un tensioactif d'origine biologique. 2004.
10. Bernal-Martinez A. Elimination des hydrocarbures aromatiques polycycliques présents dans les boues d'épuration par couplage ozonation-digestion anaérobie. 2005.
11. dos Santos Duarte C, dos Santos Marques A, Takahata Y. QSAR study of quinoline metanols with antimalarial activity against Plasmodium falciparum.
12. Verma MP, Gupta S. Software Development for Nursing: Role of Nursing Informatics. Int J Nur Edu and Research. 2017; 5(2):203-7.
13. Galande AK, Rohane SH. Insilico Molecular docking analysis in Maestro Software. Asian Journal of Research in Chemistry. 2021; 14(1):97-100.
14. Ganatra SH, Patle MR, Bhagat GK. Studies of Quantitative Structure-Activity Relationship (QSAR) of Hydantoin Based Active Anti-Cancer Drugs. Tc. 2011; 1(2.124):8-9351.
15. Swathi N, Subrahmanyam CVS, Satyanarayana K. Synthesis and Quantitative Structure-Antioxidant Activity Relationship Analysis of Thiazolidine-2, 4-dione Analogues. Asian J Research Chem. 2015; 8(1):21-6.
16. Otuokere IE, Amaku FJ. Computer-aided drug design of an anti-angiogenic and immunomodulatory agent,-(2, 4-dioxocyclohexyl)-1H-isoindole-1, 3 (2H)-dione (thalidomide). Asian Journal of Research in Chemistry. 2015; 8(9):601-5.
17. Mathew B, Mathew GE, Shafeer VP, Musthafa CM, Femina P. A Green Route Approach of α, β-Unsaturated Ketone Having a Benzimidazole Tail and Their Virtual Screening on the Molecular Descriptors for Predicting the CNS-Drug likeness. Asian Journal of Research in Chemistry. 2012; 5(1):65-8.
18. Otuokere IE, Alisa CO. Computational Study on Molecular Orbital’s, Excited State Properties and Geometry Optimization of Anti-benign Prostatic Hyperplasia Drug, N-(1, 1-dimethylethyl)-3-oxo-(5α, 17β)-4-azaandrost-1-ene-17-carboxamide (Finasteride). Asian Journal of Research in Pharmaceutical Science. 2014; 4(4): 169-73.
19. Otuokere IE, Amaku FJ. Conformation Analysis and Self-Consistent Field Energy of Immune Response Modifier, 1-(2-methylpropyl)-1H-imidazo [4, 5] quinolin-4-amine (Imiquimod). Asian Journal of Research in Pharmaceutical Science. 2015; 5(3):1-6.
20. Otuokere IE, Amaku FJ, Alisa CO. In silico geometry optimization, excited–state properties of (2E)-N-Hydroxy-3-[3-(Phenylsulfamoyl) Phenyl] prop-2-enamide (Belinostat) and its molecular docking studies with Ebola Virus glycoprotein. Asian Journal of Pharmaceutical Research. 2015; 5(3):131-7.
21. Yadav M, Yadav VK. A Study of Challenges and Practices Related to HRM in Software Industry. Asian Journal of Management. 2017; 8(4):1233-6.
22. Gujjar PJ, Manjunatha T. Testing of Dupont model for software and Training services Companies in India. Asian Journal of Management. 2021; 12(2):169-80.
23. Kang J, Cao C, Li Z. Quantitative structure–retention relationship studies for predicting the gas chromatography retention indices of polycyclic aromatic hydrocarbons: Quasi-length of carbon chain and pseudo-conjugated system surface. Journal of Chromatography A. 1998; 799(1-2):361-7.
24. Kennard RW, Stone LA. Computer aided design of experiments. Technometrics. 1969; 11(1):137-48.
25. Zakharian TY, Coon SR. Evaluation of Spartan semi-empirical molecular modeling software for calculations of molecules on surfaces: CO adsorption on Ni (111). Computers & Chemistry. 2001; 25(2):135-44.
26. Todeschini R, Consonni V, Mauri A, Pavan M. DRAGON-Software for the calculation of molecular descriptors. Web Version. 2004; 3.
27. Leardi R, Boggia R, Terrile M. Genetic algorithms as a strategy for feature selection. Journal of Chemometrics. 1992; 6(5):267-81.
28. Todeschini R, Ballabio D, Consonni V, Mauri A, Pavan M. MOBYDIGS, Software for Multilinear Regression Analysis and Variable Subset Selection by Genetic Algorithm. Release 1.1 for windows. Milano; 2009.
29. Tropsha A, Gramatica P, Gombar VK. The importance of being earnest: validation is the absolute essential for successful application and interpretation of QSPR models. QSAR & Combinatorial Science. 2003; 22(1):69-77.
30. Wu W, Walczak B, Massart DL, Heuerding S, Erni F, Last IR, et al. Artificial neural networks in classification of NIR spectral data: design of the training set. Chemometrics and Intelligent Laboratory Systems. 1996; 33(1):35-46.
31. Eriksson L, Jaworska J, Worth AP, Cronin MTD, McDowell RM, Gramatica P. Methods for reliability and uncertainty assessment and for applicability evaluations of classification-and regression-based QSARs. Environmental Health Perspectives. 2003; 111(10):1361-75.
Received on 30.11.2022 Modified on 25.05.2023
Accepted on 19.09.2023 ©AJRC All right reserved
Asian J. Research Chem. 2023; 16(5):358-362.
DOI: 10.52711/0974-4150.2023.00057