Development of a new quantitative structure–activity relationship model for predicting Ames mutagenicity of food flavor chemicals using StarDrop™ auto-Modeller™

Background Food flavors are relatively low molecular weight chemicals with unique odor-related functional groups that may also be associated with mutagenicity. These chemicals are often difficult to test for mutagenicity by the Ames test because of their low production and peculiar odor. Therefore, application of the quantitative structure–activity relationship (QSAR) approach is being considered. We used the StarDrop™ Auto-Modeller™ to develop a new QSAR model. Results In the first step, we developed a new robust Ames database of 406 food flavor chemicals consisting of existing Ames flavor chemical data and newly acquired Ames test data. Ames results for some existing flavor chemicals have been revised by expert reviews. We also collected 428 Ames test datasets for industrial chemicals from other databases that are structurally similar to flavor chemicals. A total of 834 chemicals’ Ames test datasets were used to develop the new QSAR models. We repeated the development and verification of prototypes by selecting appropriate modeling methods and descriptors and developed a local QSAR model. A new QSAR model “StarDrop NIHS 834_67” showed excellent performance (sensitivity: 79.5%, specificity: 96.4%, accuracy: 94.6%) for predicting Ames mutagenicity of 406 food flavors and was better than other commercial QSAR tools. Conclusions A local QSAR model, StarDrop NIHS 834_67, was customized to predict the Ames mutagenicity of food flavor chemicals and other low molecular weight chemicals. The model can be used to assess the mutagenicity of food flavors without actual testing. Supplementary Information The online version contains supplementary material available at 10.1186/s41021-021-00182-6.


Introduction
Food flavor chemicals are used and/or present in foods at very low level. Human exposure to these flavor chemicals through foods is too low to raise concerns about general toxicity. Regarding mutagenicity, however, there are health concerns even with trace amounts because there is no threshold for mutagenicity, and even very low levels of exposure of mutagenic chemicals do not result in zero carcinogenic risk [1]. Therefore, the presence or absence of mutagenicity is an important point for risk assessment of flavor chemicals.
The bacterial reverse mutation test (Ames test) is an important mutagenicity test, but it requires approximately 2 g of sample for a dose-finding study and main study [2]. On the other hand, the amount of flavor produced industrially is extremely small, which often means that testing is impossible. Additionally, the peculiar odor of some flavors sometimes makes it difficult to perform the test in the laboratory. Recently, quantitative structure-activity relationship (QSAR) approaches instead of the Ames test have been frequently used for assessing the mutagenicity of chemicals [3]. Ono et al. assessed the viability of QSAR tools by using three QSAR tools to calculate the Ames mutagenicity of 367 flavor chemicals (for which Ames test results were available) [4]. Consequently, the highest sensitivity (the ability of a QSAR tool to detect Ames positives chemicals correctly) was 38.9% with the single tool and 47.2% even with the combination of three tools, which indicated that application of QSAR tools to assess the Ames mutagenicity of flavor chemicals was still premature. Therefore, it is necessary to improve or develop QSAR tools for predicting Ames mutagenicity of flavor chemicals.
Flavor chemicals are relatively low molecular weight chemical substances mainly composed of carbon, hydrogen, oxygen, nitrogen, and sulfur that often have specific functional groups. In Japan, most food flavors are classified into 18 types according to their chemical structure [5]. Therefore, with a focus on their characteristic chemical space, we thought that there was potential to increase the predictive performance by developing a local QSAR model customized for flavor chemicals. In recent years, computational software has been provided to assist with development of QSAR models by machine learning. We have tried to develop a QSAR model specialized for flavor chemicals using StarDrop™ software, which has a module (Auto-Modeller™) that can generate predictive models automatically.
Before developing the QSAR model, we developed a new robust Ames database of 406 food flavor chemicals that is based on Ono's database [4]. We re-evaluated ambiguous data judged as "equivocal" in Ono's database via literature review and incorporated Ames test data of flavor chemicals from other publicly available databases.
In parallel, we performed the Ames test with key flavor chemicals of which Ames data is unknown and incorporated their results into the new database. This benchmark food flavor chemical database is useful for development of QSAR models and evaluation of QSAR model performance.

Ames test database of food flavor chemicals
We utilized the Ames test database of food flavor chemicals reported by Ono et al. [4], but because the database includes 14 "equivocal" judgments (Table 1), we reevaluated by reviewing the reference literature and reclassified them as positive, negative, or inconclusive. Ames test data of the "inconclusive" chemicals were excluded from the database. If there were any other flavor chemicals from publicly available Ames test database (Hansen database [6]), they were also added.

Ames test
Ames tests were performed for 45 flavor chemicals. The purities and suppliers of the test chemicals are shown in Table 2. The Ames tests were conducted by contract research organizations following Good Laboratory Practice compliance according to the Industrial Safety and Health Act test guideline with preincubation method [7]. The test guideline requires five strains (Salmonella thyphimurium TA100, TA98, TA1535, TA1537, and Escherichia coli WP2 uvrA) under both the presence and absence of metabolic activation (rat S9 mix prepared from phenobarbital and 5,6-benzoflavone-induced rat liver), which is similar to the Organization of Economic Co-operation and Development guideline TG471 [8]. The positive criterion is when the number of revertant colonies increased more than twice as much as the control in at least one Ames test strain in the presence or absence of S9 mix. Dose dependency and reproducibility were also considered in the final judgment. The relative activity value (RAV), which is defined as the number of induced revertant colonies per mg, was calculated for the positive result.

Commercial QSAR tools
DEREK Nexus™ is a knowledge-based commercial software developed by Lhasa Limited, UK [9,10]. The software includes knowledge rules created by considering insights related to structural alert, chemical compound examples, and metabolic activations and mechanisms. We used DEREK Nexus™ version 6.1.0 in this study. DEREK Nexus™ ranks the possibility of mutagenicity (certain, probable, plausible, equivocal, doubted, improbable, impossible, open, contradicted, nothing to report) by applying a "reasoning rule." When it is "certain," "probable," "plausible," or "equivocal," the query chemical is predicted to be positive in the Ames test.
CASE Ultra is a QSAR-based toxicity prediction software developed by MultiCASE Inc. (USA). CASE Ultra uses a statistical method to automatically extract alerts based on training data by using machine learning technology [11,12]. The structural characteristics of the alert surroundings are called the "modulator," and these are also learned automatically from the training data. In this algorithm, to construct a QSAR model with continuous toxicity endpoints, various physical chemistry parameters and descriptors are used. We used CASE Ultra version 1.8.0.2 with the GT1_BMUT module in this study. The prediction result of each module is ranked as "known positive," "positive," "negative," "known negative, " "inconclusive," or "out of domain." A query chemical ranked "known positive," "positive" or "inconclusive" is predicted to be positive in the Ames test. * Reference that was considered as a basis to draw a conclusion of "equivocal".

Analysis of QSAR tool performance
Because the Ames test results are binary, positive, or negative, their predictive power can be objectively quantified and assessed from their coincidence from the QSAR calculation results. The 2 × 2 prediction matrix comprising true positive (TP), false positive (FP), false negative (FN), and true negative (TN) is given in Table 3.

Development of a new Ames test database of food flavor chemicals
We developed a new Ames test database consisting of 406 food flavor chemicals ( Table 4). The data source is described as follows. Ono et al. reported an Ames test database consisting of 367 food flavor chemicals (positive: 24, equivocal: 12, negative: 331) [4]. However, it actually contained 369 chemicals (positive: 24, equivocal: 14, negative: 331). Table 1 shows the 14 equivocal chemicals. We reviewed key references that led to "equivocal" and re-evaluated to determine if there was evidence of positivity or negativity in view of current testing criteria. Our final judgment and the supporting reasons are described in Table  1 [13][14][15][16][17][18][19][20][21][22][23]. If there was insufficient evidence or no detailed information available for the judgment, we concluded that they were "inconclusive." Among 14 equivocal flavoring chemicals, four were positive, six were negative, and four were inconclusive. In total, 365 flavor chemicals (positive: 28, negative: 337), excluding four inconclusive chemicals, were added to the new database.

Development of a new QSAR model for predicting Ames mutagenicity
We developed a new QSAR model for predicting Ames mutagenicity by using StarDrop™ Auto-Modeller™. To develop the QSAR model, the available Ames test study dataset is essential. We used 406 datasets of flavor chemicals in the new Ames test database to develop the model. To further increase the size of the dataset (especially positive data), we added Ames test data of chemicals structurally similar to flavor chemicals. We previously developed a large Ames test database consisting of > 12,000 industrial chemicals [25]. We selected 428 chemicals (positive: 255; negative: 173) from the database that have molecular weights < 500 and possess a characteristic substructure of flavor chemicals defined in the Food Sanitation Law in Japan [5]. The Ames test data of 834 chemicals (positive: 299, negative: 535) were integrated as the study dataset for the development of the QSAR model.
Prototypes of predictive models were built by using an automatic process. The study dataset was divided into training (70%) and validation (30%) data by using the cluster method, which uses an unsupervised nonhierarchical clustering algorithm developed by Butina [26]. Auto-Modeller™ has three modeling methods (Gaussian process, random forest, and decision tree) for the category model. In a pretest, the random forest model gave the best performance for our target. The descriptors were automatically generated, including whole molecule descriptors (e.g., molecular weight, logP, and polar surface area) and 2D structural descriptors from  the training set. Because the accuracy of the prototype depends on the training data set and the data splitting process is not replicable, 80 prototypes were built to search for the best model. The prototypes that earned favorable prediction scores were selected for further performance evaluation by using the Ames test data of flavoring chemicals, and their performances were compared with those of the benchmarks. Finally, a new QSAR model "StarDrop NIHS 834_67" was developed.
The prediction result is ranked as "positive" or "negative."

Performance of the QSAR model
We evaluated the performance of StarDrop NIHS834_67 to predict the Ames mutagenicity. We calculated the Ames mutagenicity of 406 food flavors listed in the new     Table 4 shows the results of the QSAR calculation. Table 5 is a 2 × 2 prediction matrix, and Table 6 shows the performance (sensitivity, specificity, accuracy, and applicability) of the three (Q) SARs. StarDrop NIHS 834_67 showed the best performance. Table 7 shows nine FN chemicals that were positive in the Ames test but were negatively predicted by NIHS834_67. Table 8 shows 13 FP chemicals that were negative in the Ames test but were positively predicted by NIHS834_67.

Discussion
We have developed new Ames database consisting of 406 types of food flavor chemicals. This benchmark food flavor chemicals database is open to the public and useful for risk assessment of food additives and developing QSAR models for predicting Ames mutagenicity of food flavor chemicals and other low molecular weight chemicals. The main body of the database is derived from the database reported by Ono et al. [4]. We re-assessed 14 "equivocal" chemicals and classified them as negative, positive, or inconclusive. However, the positive and negative chemicals remaining in Ono's database were not re-assessed. Some of these chemicals may also be misjudged. In fact, 2,3-pentanedione (600-14-6), which was negative in Ono's database, was clearly positive in the present Ames test (Additional file (6)). To ensure database robustness, it is necessary to re-assess the test results reported as positive and negative. As will be described later, especially, the results of the Ames test that differ from the QSAR prediction results could be questioned.
In 2012, Ono et al. reported the performance of three commercial QSAR tools (Derek for Windows, Multi-CASE, and ADMEWorks) for predicting Ames mutagenicity of 367 food flavor chemicals [4]. Derek for Windows and MultiCASE are earlier models of DEREK Nexus™ and CASE Ultra, respectively. As a result, the sensitivity, specificity, and accuracy were 38.9, 93.4, and 88.0% (Derek for Windows), 25.0, 94.3, and 87.5% (Mul-tiCASE), respectively. In this study, we evaluated the performance of DEREK Nexus™ and CASE Ultra for 406 food flavors in the new Ames database. As a result, the sensitivity, specificity, and accuracy were 70.5, 96.1, and 93.3% (DEREK Nexus™) and 70.5, 90.3, and 88.2% (CASE Ultra), respectively. These results indicate that the performance of the QSAR prediction has improved significantly over the last decade. The improvement in sensitivity was particularly remarkable. Improvement of the QSAR models and accumulation of newly acquired Ames test training data may have contributed to the high performance. In particular, the NIHS-sponsored Ames/QSAR International Challenge Project has contributed significantly to improving the performance of commercial QSAR tools, such as DEREK Nexus™ and CASE Ultra, which have acquired over 12,000 unique chemical Ames datasets [24]. The newly developed Star-Drop NIHS 834_67 outperformed DEREK Nexus™ and CASE Ultra. StarDrop NIHS 834_67 also acquired 428 chemicals (positive: 255, negative: 173) selected from the 12,000 unique chemical Ames datasets. Despite incorporating the same training data, StarDrop NIHS 834_67 provided higher prediction, probably due to differences in the target chemical space. Flavor chemicals are relatively low molecular weight and have unique functional groups that allow them to focus on the chemical space of interest and develop highly predictable models with relatively small size training data. Our attempt to develop a local QSAR model that focused on flavor chemicals has been somewhat successful. However, it is not surprising that that StarDrop NIHS 834_67 showed higher performance than other QSAR tools. It may be because StarDrop NIHS 834_67 used the results of 39 new flavor chemical datasets and revised existing flavor chemical data for training and validation data.
Considering that the estimated interlaboratory reproducibility of the Ames test has been reported to be approximately 85% [27,28], the performance of the prediction may be approaching the upper limit. Nonetheless, FN and FP analysis points to improvements in the database and QSAR models. Of the nine FN flavor chemicals by StarDrop NIHS 834_67, menthone (89-80-5), raspberry ketone (54-51-2), and cadinene (29350-73-0) were also predicted as negative by DEREK Nexus™ and CASE Ultra ( Table 7). The Ames mutagenicity of these chemicals, which were predicted to be negative by the three QSARs, may actually be negative chemicals. We need to perform actual Ames tests to confirm.
On the other hand, of the 13 FP chemicals, 3,4-hexanedione (4437-51-8) was also predicted as positive by DEREK Nexus™ and CASE Ultra. The Ames mutagenicity of this chemical may actually be positive. Interestingly, 12 other FP flavor chemicals were correctly predicted as negative by DEREK Nexus™ and CASE Ultra, which highlights the different characteristics between StarDrop NIHS 834_67 and other QSAR tools and indicates the potential for further improvement.

Conclusions
We developed a new Ames database of 406 food flavor chemicals. Using this database and other Ames datasets of chemicals that are structurally similar to flavor chemicals, we also developed a new QSAR model for predicting Ames mutagenicity. The local QSAR model, StarDrop NIHS 834_67, is customized to efficiently predict the mutagenicity of food flavors and other low molecular weight chemicals, delivering performance superior to that of other commercial QSAR tools. By further improving the model, it can be used to assess the mutagenicity of food flavors without actual testing.