Expansive linguistic representations to predict interpretable odor mixture discriminability
Curation statements for this article:-
Curated by eLife
eLife assessment
Dhurhandar and colleagues developed a computational method that predicts discriminability of odor mixtures based on chemical structures of component molecules. The model first transforms chemical structures into natural language descriptions of odor, and then perform Lasso regressions to obtain a compact transformation into discriminability. The results suggest that the model performs better compared to that without transformation to language descriptions, yet, there are some issues that need to be addressed to make strong conclusions.
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (eLife)
Abstract
Language is often thought as being poorly adapted to precisely describe or quantify smell and olfactory attributes. In this work, we show that semantic descriptors of odors can be implemented in a model to successfully predict odor mixture discriminability, an olfactory attribute. We achieved this by taking advantage of the structure-to-percept model we previously developed for monomolecular odorants, using chemical descriptors to predict pleasantness, intensity and 19 semantic descriptors such as “fish,” “cold,” “burnt,” “garlic,” “grass,” and “sweet” for odor mixtures, followed by a metric learning to obtain odor mixture discriminability. Through this expansion of the representation of olfactory mixtures, our Semantic model outperforms state of the art methods by taking advantage of the intermediary semantic representations learned from human perception data to enhance and generalize the odor discriminability/similarity predictions. As 10 of the semantic descriptors were selected to predict discriminability/similarity, our approach meets the need of rapidly obtaining interpretable attributes of odor mixtures as illustrated by the difficulty of finding olfactory metamers. More fundamentally, it also shows that language can be used to establish a metric of discriminability in the everyday olfactory space.
Article activity feed
-
-
eLife assessment
Dhurhandar and colleagues developed a computational method that predicts discriminability of odor mixtures based on chemical structures of component molecules. The model first transforms chemical structures into natural language descriptions of odor, and then perform Lasso regressions to obtain a compact transformation into discriminability. The results suggest that the model performs better compared to that without transformation to language descriptions, yet, there are some issues that need to be addressed to make strong conclusions.
-
Reviewer #1 (Public Review):
In this manuscript, Dhurandhar, Cecchi and Meyer present a model that aims to predict the discrimination performance of human subjects in an odor mixture discrimination task using low-dimensional features, which include intensity, pleasantness and a set of 19 semantic descriptors. Specifically, the authors aim to find a metric of odor mixture similarity in feature space that accurately captures similarity (or discriminability) as judged by human subjects. The semantic descriptors are obtained from a chemoinformatic model previously developed by the authors. A mixture's feature vector is defined as the average of the features of the individual components. A Mahalanobis distance is defined between two mixtures, whose parameters are fit using experimental data from Bushdid et al, Science, 2016 and applied to …
Reviewer #1 (Public Review):
In this manuscript, Dhurandhar, Cecchi and Meyer present a model that aims to predict the discrimination performance of human subjects in an odor mixture discrimination task using low-dimensional features, which include intensity, pleasantness and a set of 19 semantic descriptors. Specifically, the authors aim to find a metric of odor mixture similarity in feature space that accurately captures similarity (or discriminability) as judged by human subjects. The semantic descriptors are obtained from a chemoinformatic model previously developed by the authors. A mixture's feature vector is defined as the average of the features of the individual components. A Mahalanobis distance is defined between two mixtures, whose parameters are fit using experimental data from Bushdid et al, Science, 2016 and applied to three other independent datasets. They show that the RMSE in prediction outperforms a previously published model in two of the datasets.
Strengths:
The idea to relate the embedding vector of individual odor components to the embedding of a mixture so as to predict mixture discrimination performance is novel and interesting.
Weaknesses:
- The authors claims are not supported by the data presented in the Figures. A trivial model which predicts a constant can potentially achieve better predictive performance:
It is difficult to gauge the performance of the model solely from the RMSE as the data and predictions are not plotted (except in a pooled format in Figure 4b, which is however masked by the density plot). The RMSE should at the minimum be compared to the standard deviation of the dataset and plotted as the fraction of variance unexplained. Without knowledge of the standard deviation of the experimental data, it is not possible to judge the quality of the prediction.
An examination of the inset in Figure 2a and Figures 4 shows that the data spans from ~0.54 to ~0.75. Since this was quite comparable to the RMSE of ~0.17 obtained by the author's prediction, I examined the data from the four datasets provided as a supplement by the authors. It turns out that the standard deviations of the discrimination performance (the output variable) are: Bushdid 0.176, Ravia 0.144, Snitz1 0.124, Snitz2 0.119. As these numbers indicate, simply using the constant mean as a prediction will lead to an RMSE of 0.176 for the Bushdid dataset.
This appears to contradict the Middle inset in Figure 2a, which seemingly looks like a good fit. Closer examination of the two plots shows that the experimental data in the two are not the same (note for example the two datapoints with y < 0.45 in the left plot which are absent in the right). Since the authors have not clarified in the caption whether this is an illustration or if it is actual data, it is unclear how to interpret this plot.
The data transformations performed to obtain the mixture embedding vector seem arbitrary. For a mixture of 30 components (or even 10), this involves taking an average of 30 feature vectors, which will very likely average out. The authors should explain the rationale for taking the average and not for instance the most common descriptors that appears in the mixture components.
Other comments - i) the authors use linear regression to model a classification task. The justification for this choice is not explained. ii) Although this is not primary data from the authors, the authors should perhaps comment on why the minimal performance is not chance level (33%) but instead around 50 percent, even when the percent overlap between the mixtures is close to 100%. Iii) The authors do not define the Direct model. How is the RMSE of the Direct model on the Ravia dataset (0.45) much larger than the standard deviation of the dataset (0.144)?
-
Reviewer #2 (Public Review):
The authors introduce a model based on textual data for predicting odor properties of a mixture of chemicals. Modelling approach is relevant to olfactory scientists and experimental neuro-scientists.
Work is relevant because it unifies and studies multiple mixture odor datasets, achieving satisfactory results. Work is novel because modelling for mixture datasets is scarce, this work introduces a grounded approach for modeling such data. Model is directly interpretable since it relies on a linear model (lasso) to build mapping between features (metric learning).
The authors's evidence supports most of the conclusions of the work with some room for improvement.
This work can be of the many in the future trying to further modelling approaches for mixture data.
-
Reviewer #3 (Public Review):
It has been difficult to predict perceptual quality of odor mixtures. In this study, Dhurandhar and colleagues developed a computational method to predict perceptual discriminability of odor mixtures. The authors previously developed a method to predict natural language descriptors from chemical structures of monomolecular odorants (Gutierrez et al., Nat. Commun. 2018). In the new model developed in the present study, the authors used these predicted natural language descriptors to predict the discriminability of odor mixtures. This was done by first averaging the values of natural language descriptors across component odorants in a mixture. The authors then used a Lasso regression to predict the fraction of subjects that correctly discriminated these odors from the Mahalanobis distance between the average …
Reviewer #3 (Public Review):
It has been difficult to predict perceptual quality of odor mixtures. In this study, Dhurandhar and colleagues developed a computational method to predict perceptual discriminability of odor mixtures. The authors previously developed a method to predict natural language descriptors from chemical structures of monomolecular odorants (Gutierrez et al., Nat. Commun. 2018). In the new model developed in the present study, the authors used these predicted natural language descriptors to predict the discriminability of odor mixtures. This was done by first averaging the values of natural language descriptors across component odorants in a mixture. The authors then used a Lasso regression to predict the fraction of subjects that correctly discriminated these odors from the Mahalanobis distance between the average descriptors of two odorants. The performance of the model was compared against a "Direct model" in which chemical structures were used directly to compute the vector angles based on the cosine similarity metric.
The authors address an important question and the model that the authors propose is potentially interesting to the community. The method is relatively simple and the manuscript was written relatively clearly. However, I have some concerns on the approach or methods used.
Major concerns
1. The authors compare the new model against the Direct model. The performance was compared based on the root mean squared errors (RMSE). While the result indicates statistically significant improvement, the models differ in multiple ways, and it is unclear what components in the new model contributed to the improvement. The authors should compare a model in which discrimination performance was predicted based on chemical structures using a Lasso regression. Comparison to this model would be necessary to demonstrate that transforming to the natural language descriptors was critical for the improvement, and not due to just the use of Lasso.
2. The authors should compare their model against other classes of model proposed before.
-