SgmlV1, Pmc, Corpus, bibRecord, 000017

***** Acces problem to record *****\

Identifieur interne : 000017 ( Pmc/Corpus ); précédent : 0000169; suivant : 0000180 ***** probable Xml problem with record *****

Links to Exploration step

Le document en format XML

<record><TEI><teiHeader><fileDesc><titleStmt><title xml:lang="en">The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS</title>
<author><name sortKey="Tetko, Igor V" sort="Tetko, Igor V" uniqKey="Tetko I" first="Igor V." last="Tetko">Igor V. Tetko</name>
<affiliation><nlm:aff id="Aff1">Institute of Structural Biology, Helmholtz Zentrum München für Gesundheit und Umwelt (HMGU), Ingolstädter Landstraße 1, b. 60w, 85764 Neuherberg, Germany</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="Aff2">BigChem GmbH, 85764 Neuherberg, Germany</nlm:aff>
</affiliation>
</author>
<author><name sortKey="M Lowe, Daniel" sort="M Lowe, Daniel" uniqKey="M Lowe D" first="Daniel" last="M. Lowe">Daniel M. Lowe</name>
<affiliation><nlm:aff id="Aff3">NextMove Software Limited, Innovation Centre (Unit 23), Cambridge Science Park, Cambridge, CB4 0EY UK</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Williams, Antony J" sort="Williams, Antony J" uniqKey="Williams A" first="Antony J." last="Williams">Antony J. Williams</name>
<affiliation><nlm:aff id="Aff4">ChemConnector Inc., 904 Tamaras Circle, Wake Forest, NC 27587 USA</nlm:aff>
</affiliation>
</author>
</titleStmt>
<publicationStmt><idno type="wicri:source">PMC</idno>
<idno type="pmid">26807157</idno>
<idno type="pmc">4724158</idno>
<idno type="url">http://www.ncbi.nlm.nih.gov/pmc/articles/PMC4724158</idno>
<idno type="RBID">PMC:4724158</idno>
<idno type="doi">10.1186/s13321-016-0113-y</idno>
<date when="2016">2016</date>
<idno type="wicri:Area/Pmc/Corpus">000017</idno>
<idno type="wicri:explorRef" wicri:stream="Pmc" wicri:step="Corpus" wicri:corpus="PMC">000017</idno>
</publicationStmt>
<sourceDesc><biblStruct><analytic><title xml:lang="en" level="a" type="main">The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS</title>
<author><name sortKey="Tetko, Igor V" sort="Tetko, Igor V" uniqKey="Tetko I" first="Igor V." last="Tetko">Igor V. Tetko</name>
<affiliation><nlm:aff id="Aff1">Institute of Structural Biology, Helmholtz Zentrum München für Gesundheit und Umwelt (HMGU), Ingolstädter Landstraße 1, b. 60w, 85764 Neuherberg, Germany</nlm:aff>
</affiliation>
<affiliation><nlm:aff id="Aff2">BigChem GmbH, 85764 Neuherberg, Germany</nlm:aff>
</affiliation>
</author>
<author><name sortKey="M Lowe, Daniel" sort="M Lowe, Daniel" uniqKey="M Lowe D" first="Daniel" last="M. Lowe">Daniel M. Lowe</name>
<affiliation><nlm:aff id="Aff3">NextMove Software Limited, Innovation Centre (Unit 23), Cambridge Science Park, Cambridge, CB4 0EY UK</nlm:aff>
</affiliation>
</author>
<author><name sortKey="Williams, Antony J" sort="Williams, Antony J" uniqKey="Williams A" first="Antony J." last="Williams">Antony J. Williams</name>
<affiliation><nlm:aff id="Aff4">ChemConnector Inc., 904 Tamaras Circle, Wake Forest, NC 27587 USA</nlm:aff>
</affiliation>
</author>
</analytic>
<series><title level="j">Journal of Cheminformatics</title>
<idno type="eISSN">1758-2946</idno>
<imprint><date when="2016">2016</date>
</imprint>
</series>
</biblStruct>
</sourceDesc>
</fileDesc>
<profileDesc><textClass></textClass>
</profileDesc>
</teiHeader>
<front><div type="abstract" xml:lang="en"><sec><title>Background</title>
<p>Melting point (MP) is an important property in regards to the solubility of chemical compounds. Its prediction from chemical structure remains a highly challenging task for quantitative structure–activity relationship studies. Success in this area of research critically depends on the availability of high quality MP data as well as accurate chemical structure representations in order to develop models. Currently, available datasets for MP predictions have been limited to around 50k molecules while lots more data are routinely generated following the synthesis of novel materials. Significant amounts of MP data are freely available within the patent literature and, if it were available in the appropriate form, could potentially be used to develop predictive models.</p>
</sec>
<sec><title>Results</title>
<p>We have developed a pipeline for the automated extraction and annotation of chemical data from published PATENTS. Almost 300,000 data points have been collected and used to develop models to predict melting and pyrolysis (decomposition) points using tools available on the OCHEM modeling platform (<ext-link ext-link-type="uri" xlink:href="http://ochem.eu">http://ochem.eu</ext-link>
). A number of technical challenges were simultaneously solved to develop models based on these data. These included the handing of sparse data matrices with >200,000,000,000 entries and parallel calculations using 32 × 6 cores per task using 13 descriptor sets totaling more than 700,000 descriptors. We showed that models developed using data collected from PATENTS had similar or better prediction accuracy compared to the highly curated data used in previous publications. The separation of data for chemicals that decomposed rather than melting, from compounds that did undergo a normal melting transition, was performed and models for both pyrolysis and MPs were developed. The accuracy of the consensus MP models for molecules from the drug-like region of chemical space was similar to their estimated experimental accuracy, 32 °C. Last but not least, important structural features related to the pyrolysis of chemicals were identified, and a model to predict whether a compound will decompose instead of melting was developed.</p>
</sec>
<sec><title>Conclusions</title>
<p>We have shown that automated tools for the analysis of chemical information have reached a mature stage allowing for the extraction and collection of high quality data to enable the development of structure–activity relationship models. The developed models and data are publicly available at <ext-link ext-link-type="uri" xlink:href="http://ochem.eu/article/99826">http://ochem.eu/article/99826</ext-link>
.</p>
</sec>
<sec><title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s13321-016-0113-y) contains supplementary material, which is available to authorized users.</p>
</sec>
</div>
</front>
<back><div1 type="bibliography"><listBibl><biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Dearden, Jc" uniqKey="Dearden J">JC Dearden</name>
</author>
<author><name sortKey="Rotureau, P" uniqKey="Rotureau P">P Rotureau</name>
</author>
<author><name sortKey="Fayet, G" uniqKey="Fayet G">G Fayet</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Lowe, Dm" uniqKey="Lowe D">DM Lowe</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
<author><name sortKey="Sushko, Y" uniqKey="Sushko Y">Y Sushko</name>
</author>
<author><name sortKey="Novotarskyi, S" uniqKey="Novotarskyi S">S Novotarskyi</name>
</author>
<author><name sortKey="Patiny, L" uniqKey="Patiny L">L Patiny</name>
</author>
<author><name sortKey="Kondratov, I" uniqKey="Kondratov I">I Kondratov</name>
</author>
<author><name sortKey="Petrenko, Ae" uniqKey="Petrenko A">AE Petrenko</name>
</author>
<author><name sortKey="Charochkina, L" uniqKey="Charochkina L">L Charochkina</name>
</author>
<author><name sortKey="Asiri, Am" uniqKey="Asiri A">AM Asiri</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bhhatarai, B" uniqKey="Bhhatarai B">B Bhhatarai</name>
</author>
<author><name sortKey="Teetz, W" uniqKey="Teetz W">W Teetz</name>
</author>
<author><name sortKey="Liu, T" uniqKey="Liu T">T Liu</name>
</author>
<author><name sortKey="Oberg, T" uniqKey="Oberg T">T Öberg</name>
</author>
<author><name sortKey="Jeliazkova, N" uniqKey="Jeliazkova N">N Jeliazkova</name>
</author>
<author><name sortKey="Kochev, N" uniqKey="Kochev N">N Kochev</name>
</author>
<author><name sortKey="Pukalov, O" uniqKey="Pukalov O">O Pukalov</name>
</author>
<author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
<author><name sortKey="Kovarich, S" uniqKey="Kovarich S">S Kovarich</name>
</author>
<author><name sortKey="Papa, E" uniqKey="Papa E">E Papa</name>
</author>
<author><name sortKey="Gramatica, P" uniqKey="Gramatica P">P Gramatica</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chu, Ka" uniqKey="Chu K">KA Chu</name>
</author>
<author><name sortKey="Yalkowsky, Sh" uniqKey="Yalkowsky S">SH Yalkowsky</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Varnek, A" uniqKey="Varnek A">A Varnek</name>
</author>
<author><name sortKey="Kireeva, N" uniqKey="Kireeva N">N Kireeva</name>
</author>
<author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
<author><name sortKey="Baskin, Ii" uniqKey="Baskin I">II Baskin</name>
</author>
<author><name sortKey="Solov V, Vp" uniqKey="Solov V V">VP Solov’ev</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Nigsch, F" uniqKey="Nigsch F">F Nigsch</name>
</author>
<author><name sortKey="Bender, A" uniqKey="Bender A">A Bender</name>
</author>
<author><name sortKey="Van Buuren, B" uniqKey="Van Buuren B">B van Buuren</name>
</author>
<author><name sortKey="Tissen, J" uniqKey="Tissen J">J Tissen</name>
</author>
<author><name sortKey="Nigsch, E" uniqKey="Nigsch E">E Nigsch</name>
</author>
<author><name sortKey="Mitchell, Jb" uniqKey="Mitchell J">JB Mitchell</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Jain, A" uniqKey="Jain A">A Jain</name>
</author>
<author><name sortKey="Yalkowsky, Sh" uniqKey="Yalkowsky S">SH Yalkowsky</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bergstrom, Ca" uniqKey="Bergstrom C">CA Bergstrom</name>
</author>
<author><name sortKey="Norinder, U" uniqKey="Norinder U">U Norinder</name>
</author>
<author><name sortKey="Luthman, K" uniqKey="Luthman K">K Luthman</name>
</author>
<author><name sortKey="Artursson, P" uniqKey="Artursson P">P Artursson</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Boethling, Rs" uniqKey="Boethling R">RS Boethling</name>
</author>
<author><name sortKey="Mackay, D" uniqKey="Mackay D">D Mackay</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ran, Y" uniqKey="Ran Y">Y Ran</name>
</author>
<author><name sortKey="Yalkowsky, Sh" uniqKey="Yalkowsky S">SH Yalkowsky</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Lowe, Dm" uniqKey="Lowe D">DM Lowe</name>
</author>
<author><name sortKey="Sayle, Ra" uniqKey="Sayle R">RA Sayle</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hawizy, L" uniqKey="Hawizy L">L Hawizy</name>
</author>
<author><name sortKey="Jessop, Dm" uniqKey="Jessop D">DM Jessop</name>
</author>
<author><name sortKey="Adams, N" uniqKey="Adams N">N Adams</name>
</author>
<author><name sortKey="Murray Rust, P" uniqKey="Murray Rust P">P Murray-Rust</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Vorberg, S" uniqKey="Vorberg S">S Vorberg</name>
</author>
<author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
<author><name sortKey="Novotarskyi, S" uniqKey="Novotarskyi S">S Novotarskyi</name>
</author>
<author><name sortKey="Sushko, I" uniqKey="Sushko I">I Sushko</name>
</author>
<author><name sortKey="Ivanov, V" uniqKey="Ivanov V">V Ivanov</name>
</author>
<author><name sortKey="Petrenko, Ae" uniqKey="Petrenko A">AE Petrenko</name>
</author>
<author><name sortKey="Dieden, R" uniqKey="Dieden R">R Dieden</name>
</author>
<author><name sortKey="Lebon, F" uniqKey="Lebon F">F Lebon</name>
</author>
<author><name sortKey="Mathieu, B" uniqKey="Mathieu B">B Mathieu</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Todeschini, R" uniqKey="Todeschini R">R Todeschini</name>
</author>
<author><name sortKey="Consonni, V" uniqKey="Consonni V">V Consonni</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Gasteiger, J" uniqKey="Gasteiger J">J Gasteiger</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sushko, I" uniqKey="Sushko I">I Sushko</name>
</author>
<author><name sortKey="Novotarskyi, S" uniqKey="Novotarskyi S">S Novotarskyi</name>
</author>
<author><name sortKey="Korner, R" uniqKey="Korner R">R Korner</name>
</author>
<author><name sortKey="Pandey, Ak" uniqKey="Pandey A">AK Pandey</name>
</author>
<author><name sortKey="Rupp, M" uniqKey="Rupp M">M Rupp</name>
</author>
<author><name sortKey="Teetz, W" uniqKey="Teetz W">W Teetz</name>
</author>
<author><name sortKey="Brandmaier, S" uniqKey="Brandmaier S">S Brandmaier</name>
</author>
<author><name sortKey="Abdelaziz, A" uniqKey="Abdelaziz A">A Abdelaziz</name>
</author>
<author><name sortKey="Prokopenko, Vv" uniqKey="Prokopenko V">VV Prokopenko</name>
</author>
<author><name sortKey="Tanchuk, Vy" uniqKey="Tanchuk V">VY Tanchuk</name>
</author>
<author><name sortKey="Todeschini, R" uniqKey="Todeschini R">R Todeschini</name>
</author>
<author><name sortKey="Varnek, A" uniqKey="Varnek A">A Varnek</name>
</author>
<author><name sortKey="Marcou, G" uniqKey="Marcou G">G Marcou</name>
</author>
<author><name sortKey="Ertl, P" uniqKey="Ertl P">P Ertl</name>
</author>
<author><name sortKey="Potemkin, V" uniqKey="Potemkin V">V Potemkin</name>
</author>
<author><name sortKey="Grishina, M" uniqKey="Grishina M">M Grishina</name>
</author>
<author><name sortKey="Gasteiger, J" uniqKey="Gasteiger J">J Gasteiger</name>
</author>
<author><name sortKey="Schwab, C" uniqKey="Schwab C">C Schwab</name>
</author>
<author><name sortKey="Baskin, Ii" uniqKey="Baskin I">II Baskin</name>
</author>
<author><name sortKey="Palyulin, Va" uniqKey="Palyulin V">VA Palyulin</name>
</author>
<author><name sortKey="Radchenko, Ev" uniqKey="Radchenko E">EV Radchenko</name>
</author>
<author><name sortKey="Welsh, Wj" uniqKey="Welsh W">WJ Welsh</name>
</author>
<author><name sortKey="Kholodovych, V" uniqKey="Kholodovych V">V Kholodovych</name>
</author>
<author><name sortKey="Chekmarev, D" uniqKey="Chekmarev D">D Chekmarev</name>
</author>
<author><name sortKey="Cherkasov, A" uniqKey="Cherkasov A">A Cherkasov</name>
</author>
<author><name sortKey="Aires De Sousa, J" uniqKey="Aires De Sousa J">J Aires-de-Sousa</name>
</author>
<author><name sortKey="Zhang, Qy" uniqKey="Zhang Q">QY Zhang</name>
</author>
<author><name sortKey="Bender, A" uniqKey="Bender A">A Bender</name>
</author>
<author><name sortKey="Nigsch, F" uniqKey="Nigsch F">F Nigsch</name>
</author>
<author><name sortKey="Patiny, L" uniqKey="Patiny L">L Patiny</name>
</author>
<author><name sortKey="Williams, A" uniqKey="Williams A">A Williams</name>
</author>
<author><name sortKey="Tkachenko, V" uniqKey="Tkachenko V">V Tkachenko</name>
</author>
<author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Hall, Lh" uniqKey="Hall L">LH Hall</name>
</author>
<author><name sortKey="Kier, Lb" uniqKey="Kier L">LB Kier</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Varnek, A" uniqKey="Varnek A">A Varnek</name>
</author>
<author><name sortKey="Fourches, D" uniqKey="Fourches D">D Fourches</name>
</author>
<author><name sortKey="Horvath, D" uniqKey="Horvath D">D Horvath</name>
</author>
<author><name sortKey="Klimchuk, O" uniqKey="Klimchuk O">O Klimchuk</name>
</author>
<author><name sortKey="Gaudin, C" uniqKey="Gaudin C">C Gaudin</name>
</author>
<author><name sortKey="Vayer, P" uniqKey="Vayer P">P Vayer</name>
</author>
<author><name sortKey="Solov V, V" uniqKey="Solov V V">V Solov’ev</name>
</author>
<author><name sortKey="Hoonakker, F" uniqKey="Hoonakker F">F Hoonakker</name>
</author>
<author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
<author><name sortKey="Marcou, G" uniqKey="Marcou G">G Marcou</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Skvortsova, Mi" uniqKey="Skvortsova M">MI Skvortsova</name>
</author>
<author><name sortKey="Baskin, Ii" uniqKey="Baskin I">II Baskin</name>
</author>
<author><name sortKey="Skvortsov, La" uniqKey="Skvortsov L">LA Skvortsov</name>
</author>
<author><name sortKey="Palyulin, Va" uniqKey="Palyulin V">VA Palyulin</name>
</author>
<author><name sortKey="Zefirov, Ns" uniqKey="Zefirov N">NS Zefirov</name>
</author>
<author><name sortKey="Stankevich, Iv" uniqKey="Stankevich I">IV Stankevich</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Steinbeck, C" uniqKey="Steinbeck C">C Steinbeck</name>
</author>
<author><name sortKey="Han, Y" uniqKey="Han Y">Y Han</name>
</author>
<author><name sortKey="Kuhn, S" uniqKey="Kuhn S">S Kuhn</name>
</author>
<author><name sortKey="Horlacher, O" uniqKey="Horlacher O">O Horlacher</name>
</author>
<author><name sortKey="Luttmann, E" uniqKey="Luttmann E">E Luttmann</name>
</author>
<author><name sortKey="Willighagen, E" uniqKey="Willighagen E">E Willighagen</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Potemkin, Va" uniqKey="Potemkin V">VA Potemkin</name>
</author>
<author><name sortKey="Grishina, Ma" uniqKey="Grishina M">MA Grishina</name>
</author>
<author><name sortKey="Bartashevich, Ev" uniqKey="Bartashevich E">EV Bartashevich</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sushko, I" uniqKey="Sushko I">I Sushko</name>
</author>
<author><name sortKey="Salmina, E" uniqKey="Salmina E">E Salmina</name>
</author>
<author><name sortKey="Potemkin, Va" uniqKey="Potemkin V">VA Potemkin</name>
</author>
<author><name sortKey="Poda, G" uniqKey="Poda G">G Poda</name>
</author>
<author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Haider, N" uniqKey="Haider N">N Haider</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Rogers, D" uniqKey="Rogers D">D Rogers</name>
</author>
<author><name sortKey="Hahn, M" uniqKey="Hahn M">M Hahn</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Bender, A" uniqKey="Bender A">A Bender</name>
</author>
<author><name sortKey="Mussa, Hy" uniqKey="Mussa H">HY Mussa</name>
</author>
<author><name sortKey="Glen, Rc" uniqKey="Glen R">RC Glen</name>
</author>
<author><name sortKey="Reiling, S" uniqKey="Reiling S">S Reiling</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
<author><name sortKey="Sushko, I" uniqKey="Sushko I">I Sushko</name>
</author>
<author><name sortKey="Pandey, Ak" uniqKey="Pandey A">AK Pandey</name>
</author>
<author><name sortKey="Zhu, H" uniqKey="Zhu H">H Zhu</name>
</author>
<author><name sortKey="Tropsha, A" uniqKey="Tropsha A">A Tropsha</name>
</author>
<author><name sortKey="Papa, E" uniqKey="Papa E">E Papa</name>
</author>
<author><name sortKey="Oberg, T" uniqKey="Oberg T">T Oberg</name>
</author>
<author><name sortKey="Todeschini, R" uniqKey="Todeschini R">R Todeschini</name>
</author>
<author><name sortKey="Fourches, D" uniqKey="Fourches D">D Fourches</name>
</author>
<author><name sortKey="Varnek, A" uniqKey="Varnek A">A Varnek</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Chang, C C" uniqKey="Chang C">C-C Chang</name>
</author>
<author><name sortKey="Lin, C J" uniqKey="Lin C">C-J Lin</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
<author><name sortKey="Solov V, Vp" uniqKey="Solov V V">VP Solov’ev</name>
</author>
<author><name sortKey="Antonov, Av" uniqKey="Antonov A">AV Antonov</name>
</author>
<author><name sortKey="Yao, X" uniqKey="Yao X">X Yao</name>
</author>
<author><name sortKey="Doucet, Jp" uniqKey="Doucet J">JP Doucet</name>
</author>
<author><name sortKey="Fan, B" uniqKey="Fan B">B Fan</name>
</author>
<author><name sortKey="Hoonakker, F" uniqKey="Hoonakker F">F Hoonakker</name>
</author>
<author><name sortKey="Fourches, D" uniqKey="Fourches D">D Fourches</name>
</author>
<author><name sortKey="Jost, P" uniqKey="Jost P">P Jost</name>
</author>
<author><name sortKey="Lachiche, N" uniqKey="Lachiche N">N Lachiche</name>
</author>
<author><name sortKey="Varnek, A" uniqKey="Varnek A">A Varnek</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Breiman, L" uniqKey="Breiman L">L Breiman</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Kotsiantis, Sb" uniqKey="Kotsiantis S">SB Kotsiantis</name>
</author>
<author><name sortKey="Kanellopoulos, D" uniqKey="Kanellopoulos D">D Kanellopoulos</name>
</author>
<author><name sortKey="Pintelas, Pe" uniqKey="Pintelas P">PE Pintelas</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Zhu, H" uniqKey="Zhu H">H Zhu</name>
</author>
<author><name sortKey="Tropsha, A" uniqKey="Tropsha A">A Tropsha</name>
</author>
<author><name sortKey="Fourches, D" uniqKey="Fourches D">D Fourches</name>
</author>
<author><name sortKey="Varnek, A" uniqKey="Varnek A">A Varnek</name>
</author>
<author><name sortKey="Papa, E" uniqKey="Papa E">E Papa</name>
</author>
<author><name sortKey="Gramatica, P" uniqKey="Gramatica P">P Gramatica</name>
</author>
<author><name sortKey="Oberg, T" uniqKey="Oberg T">T Oberg</name>
</author>
<author><name sortKey="Dao, P" uniqKey="Dao P">P Dao</name>
</author>
<author><name sortKey="Cherkasov, A" uniqKey="Cherkasov A">A Cherkasov</name>
</author>
<author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Dunn, Ms" uniqKey="Dunn M">MS Dunn</name>
</author>
<author><name sortKey="Brophy, Tw" uniqKey="Brophy T">TW Brophy</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Manahan, Se" uniqKey="Manahan S">SE Manahan</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Novotarskyi, S" uniqKey="Novotarskyi S">S Novotarskyi</name>
</author>
<author><name sortKey="Sushko, I" uniqKey="Sushko I">I Sushko</name>
</author>
<author><name sortKey="Korner, R" uniqKey="Korner R">R Korner</name>
</author>
<author><name sortKey="Pandey, Ak" uniqKey="Pandey A">AK Pandey</name>
</author>
<author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
<author><name sortKey="Bruneau, P" uniqKey="Bruneau P">P Bruneau</name>
</author>
<author><name sortKey="Mewes, Hw" uniqKey="Mewes H">HW Mewes</name>
</author>
<author><name sortKey="Rohrer, Dc" uniqKey="Rohrer D">DC Rohrer</name>
</author>
<author><name sortKey="Poda, Gi" uniqKey="Poda G">GI Poda</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sushko, I" uniqKey="Sushko I">I Sushko</name>
</author>
<author><name sortKey="Novotarskyi, S" uniqKey="Novotarskyi S">S Novotarskyi</name>
</author>
<author><name sortKey="Korner, R" uniqKey="Korner R">R Körner</name>
</author>
<author><name sortKey="Pandey, Ak" uniqKey="Pandey A">AK Pandey</name>
</author>
<author><name sortKey="Kovalishyn, Vv" uniqKey="Kovalishyn V">VV Kovalishyn</name>
</author>
<author><name sortKey="Prokopenko, Vv" uniqKey="Prokopenko V">VV Prokopenko</name>
</author>
<author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Sushko, I" uniqKey="Sushko I">I Sushko</name>
</author>
<author><name sortKey="Novotarskyi, S" uniqKey="Novotarskyi S">S Novotarskyi</name>
</author>
<author><name sortKey="Korner, R" uniqKey="Korner R">R Korner</name>
</author>
<author><name sortKey="Pandey, Ak" uniqKey="Pandey A">AK Pandey</name>
</author>
<author><name sortKey="Cherkasov, A" uniqKey="Cherkasov A">A Cherkasov</name>
</author>
<author><name sortKey="Li, J" uniqKey="Li J">J Li</name>
</author>
<author><name sortKey="Gramatica, P" uniqKey="Gramatica P">P Gramatica</name>
</author>
<author><name sortKey="Hansen, K" uniqKey="Hansen K">K Hansen</name>
</author>
<author><name sortKey="Schroeter, T" uniqKey="Schroeter T">T Schroeter</name>
</author>
<author><name sortKey="Muller, Kr" uniqKey="Muller K">KR Muller</name>
</author>
<author><name sortKey="Xi, L" uniqKey="Xi L">L Xi</name>
</author>
<author><name sortKey="Liu, H" uniqKey="Liu H">H Liu</name>
</author>
<author><name sortKey="Yao, X" uniqKey="Yao X">X Yao</name>
</author>
<author><name sortKey="Oberg, T" uniqKey="Oberg T">T Oberg</name>
</author>
<author><name sortKey="Hormozdiari, F" uniqKey="Hormozdiari F">F Hormozdiari</name>
</author>
<author><name sortKey="Dao, P" uniqKey="Dao P">P Dao</name>
</author>
<author><name sortKey="Sahinalp, C" uniqKey="Sahinalp C">C Sahinalp</name>
</author>
<author><name sortKey="Todeschini, R" uniqKey="Todeschini R">R Todeschini</name>
</author>
<author><name sortKey="Polishchuk, P" uniqKey="Polishchuk P">P Polishchuk</name>
</author>
<author><name sortKey="Artemenko, A" uniqKey="Artemenko A">A Artemenko</name>
</author>
<author><name sortKey="Kuz In, V" uniqKey="Kuz In V">V Kuz’min</name>
</author>
<author><name sortKey="Martin, Tm" uniqKey="Martin T">TM Martin</name>
</author>
<author><name sortKey="Young, Dm" uniqKey="Young D">DM Young</name>
</author>
<author><name sortKey="Fourches, D" uniqKey="Fourches D">D Fourches</name>
</author>
<author><name sortKey="Muratov, E" uniqKey="Muratov E">E Muratov</name>
</author>
<author><name sortKey="Tropsha, A" uniqKey="Tropsha A">A Tropsha</name>
</author>
<author><name sortKey="Baskin, I" uniqKey="Baskin I">I Baskin</name>
</author>
<author><name sortKey="Horvath, D" uniqKey="Horvath D">D Horvath</name>
</author>
<author><name sortKey="Marcou, G" uniqKey="Marcou G">G Marcou</name>
</author>
<author><name sortKey="Muller, C" uniqKey="Muller C">C Muller</name>
</author>
<author><name sortKey="Varnek, A" uniqKey="Varnek A">A Varnek</name>
</author>
<author><name sortKey="Prokopenko, Vv" uniqKey="Prokopenko V">VV Prokopenko</name>
</author>
<author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
<author><name sortKey="Sopasakis, P" uniqKey="Sopasakis P">P Sopasakis</name>
</author>
<author><name sortKey="Kunwar, P" uniqKey="Kunwar P">P Kunwar</name>
</author>
<author><name sortKey="Brandmaier, S" uniqKey="Brandmaier S">S Brandmaier</name>
</author>
<author><name sortKey="Novoratskyi, S" uniqKey="Novoratskyi S">S Novoratskyi</name>
</author>
<author><name sortKey="Charochkina, L" uniqKey="Charochkina L">L Charochkina</name>
</author>
<author><name sortKey="Prokopenko, V" uniqKey="Prokopenko V">V Prokopenko</name>
</author>
<author><name sortKey="Peijnenburg, Wj" uniqKey="Peijnenburg W">WJ Peijnenburg</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Delaney, Js" uniqKey="Delaney J">JS Delaney</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Potemkin, Va" uniqKey="Potemkin V">VA Potemkin</name>
</author>
<author><name sortKey="Bartashevich, Ev" uniqKey="Bartashevich E">EV Bartashevich</name>
</author>
<author><name sortKey="Belik, Av" uniqKey="Belik A">AV Belik</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Potemkin, Va" uniqKey="Potemkin V">VA Potemkin</name>
</author>
<author><name sortKey="Bartashevich, Ev" uniqKey="Bartashevich E">EV Bartashevich</name>
</author>
<author><name sortKey="Belik, Av" uniqKey="Belik A">AV Belik</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bemis, Gw" uniqKey="Bemis G">GW Bemis</name>
</author>
<author><name sortKey="Murcko, Ma" uniqKey="Murcko M">MA Murcko</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Yang, Y" uniqKey="Yang Y">Y Yang</name>
</author>
<author><name sortKey="Chen, H" uniqKey="Chen H">H Chen</name>
</author>
<author><name sortKey="Nilsson, I" uniqKey="Nilsson I">I Nilsson</name>
</author>
<author><name sortKey="Muresan, S" uniqKey="Muresan S">S Muresan</name>
</author>
<author><name sortKey="Engkvist, O" uniqKey="Engkvist O">O Engkvist</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Bauerschmidt, S" uniqKey="Bauerschmidt S">S Bauerschmidt</name>
</author>
<author><name sortKey="Gasteiger, J" uniqKey="Gasteiger J">J Gasteiger</name>
</author>
</analytic>
</biblStruct>
<biblStruct></biblStruct>
<biblStruct></biblStruct>
<biblStruct><analytic><author><name sortKey="Palmer, Ds" uniqKey="Palmer D">DS Palmer</name>
</author>
<author><name sortKey="Mitchell, Jb" uniqKey="Mitchell J">JB Mitchell</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Hughes, Ld" uniqKey="Hughes L">LD Hughes</name>
</author>
<author><name sortKey="Palmer, Ds" uniqKey="Palmer D">DS Palmer</name>
</author>
<author><name sortKey="Nigsch, F" uniqKey="Nigsch F">F Nigsch</name>
</author>
<author><name sortKey="Mitchell, Jb" uniqKey="Mitchell J">JB Mitchell</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Ustun, B" uniqKey="Ustun B">B Üstün</name>
</author>
<author><name sortKey="Melssen, Wj" uniqKey="Melssen W">WJ Melssen</name>
</author>
<author><name sortKey="Oudenhuijzen, M" uniqKey="Oudenhuijzen M">M Oudenhuijzen</name>
</author>
<author><name sortKey="Buydens, Lmc" uniqKey="Buydens L">LMC Buydens</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Balakin, Kv" uniqKey="Balakin K">KV Balakin</name>
</author>
<author><name sortKey="Savchuk, Np" uniqKey="Savchuk N">NP Savchuk</name>
</author>
<author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
<author><name sortKey="Poda, Gi" uniqKey="Poda G">GI Poda</name>
</author>
<author><name sortKey="Ostermann, C" uniqKey="Ostermann C">C Ostermann</name>
</author>
<author><name sortKey="Mannhold, R" uniqKey="Mannhold R">R Mannhold</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
<author><name sortKey="Poda, Gi" uniqKey="Poda G">GI Poda</name>
</author>
<author><name sortKey="Ostermann, C" uniqKey="Ostermann C">C Ostermann</name>
</author>
<author><name sortKey="Mannhold, R" uniqKey="Mannhold R">R Mannhold</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
<author><name sortKey="Tanchuk, Vy" uniqKey="Tanchuk V">VY Tanchuk</name>
</author>
<author><name sortKey="Kasheva, Tn" uniqKey="Kasheva T">TN Kasheva</name>
</author>
<author><name sortKey="Villa, Aep" uniqKey="Villa A">AEP Villa</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
<author><name sortKey="Tanchuk, Vy" uniqKey="Tanchuk V">VY Tanchuk</name>
</author>
</analytic>
</biblStruct>
<biblStruct><analytic><author><name sortKey="Tetko, Iv" uniqKey="Tetko I">IV Tetko</name>
</author>
</analytic>
</biblStruct>
</listBibl>
</div1>
</back>
</TEI>
<pmc article-type="research-article"><pmc-dir>properties open_access</pmc-dir>
  <front><journal-meta><journal-id journal-id-type="nlm-ta">J Cheminform</journal-id>
<journal-id journal-id-type="iso-abbrev">J Cheminform</journal-id>
<journal-title-group><journal-title>Journal of Cheminformatics</journal-title>
</journal-title-group>
<issn pub-type="epub">1758-2946</issn>
<publisher><publisher-name>Springer International Publishing</publisher-name>
<publisher-loc>Cham</publisher-loc>
</publisher>
</journal-meta>
<article-meta><article-id pub-id-type="pmid">26807157</article-id>
<article-id pub-id-type="pmc">4724158</article-id>
<article-id pub-id-type="publisher-id">113</article-id>
<article-id pub-id-type="doi">10.1186/s13321-016-0113-y</article-id>
<article-categories><subj-group subj-group-type="heading"><subject>Research Article</subject>
</subj-group>
</article-categories>
<title-group><article-title>The development of models to predict melting and pyrolysis point data associated with several hundred thousand compounds mined from PATENTS</article-title>
</title-group>
<contrib-group><contrib contrib-type="author" corresp="yes"><name><surname>Tetko</surname>
<given-names>Igor V.</given-names>
</name>
<address><phone>+49-89-3187-3575</phone>
<email>i.tetko@helmholtz-muenchen.de</email>
</address>
<xref ref-type="aff" rid="Aff1"></xref>
<xref ref-type="aff" rid="Aff2"></xref>
</contrib>
<contrib contrib-type="author"><name><surname>M. Lowe</surname>
<given-names>Daniel</given-names>
</name>
<address><email>daniel@nextmovesoftware.com</email>
</address>
<xref ref-type="aff" rid="Aff3"></xref>
</contrib>
<contrib contrib-type="author"><name><surname>Williams</surname>
<given-names>Antony J.</given-names>
</name>
<address><email>Williams.Antony@epa.gov</email>
</address>
<xref ref-type="aff" rid="Aff4"></xref>
</contrib>
<aff id="Aff1"><label></label>
Institute of Structural Biology, Helmholtz Zentrum München für Gesundheit und Umwelt (HMGU), Ingolstädter Landstraße 1, b. 60w, 85764 Neuherberg, Germany</aff>
<aff id="Aff2"><label></label>
BigChem GmbH, 85764 Neuherberg, Germany</aff>
<aff id="Aff3"><label></label>
NextMove Software Limited, Innovation Centre (Unit 23), Cambridge Science Park, Cambridge, CB4 0EY UK</aff>
<aff id="Aff4"><label></label>
ChemConnector Inc., 904 Tamaras Circle, Wake Forest, NC 27587 USA</aff>
</contrib-group>
<pub-date pub-type="epub"><day>22</day>
<month>1</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="pmc-release"><day>22</day>
<month>1</month>
<year>2016</year>
</pub-date>
<pub-date pub-type="collection"><year>2016</year>
</pub-date>
<volume>8</volume>
<elocation-id>2</elocation-id>
<history><date date-type="received"><day>31</day>
<month>8</month>
<year>2015</year>
</date>
<date date-type="accepted"><day>8</day>
<month>1</month>
<year>2016</year>
</date>
</history>
<permissions><copyright-statement>© Tetko et al. 2016</copyright-statement>
<license license-type="OpenAccess"><license-p><bold>Open Access</bold>
This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/licenses/by/4.0/">http://creativecommons.org/licenses/by/4.0/</ext-link>
), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (<ext-link ext-link-type="uri" xlink:href="http://creativecommons.org/publicdomain/zero/1.0/">http://creativecommons.org/publicdomain/zero/1.0/</ext-link>
) applies to the data made available in this article, unless otherwise stated.</license-p>
</license>
</permissions>
<abstract id="Abs1"><sec><title>Background</title>
<p>Melting point (MP) is an important property in regards to the solubility of chemical compounds. Its prediction from chemical structure remains a highly challenging task for quantitative structure–activity relationship studies. Success in this area of research critically depends on the availability of high quality MP data as well as accurate chemical structure representations in order to develop models. Currently, available datasets for MP predictions have been limited to around 50k molecules while lots more data are routinely generated following the synthesis of novel materials. Significant amounts of MP data are freely available within the patent literature and, if it were available in the appropriate form, could potentially be used to develop predictive models.</p>
</sec>
<sec><title>Results</title>
<p>We have developed a pipeline for the automated extraction and annotation of chemical data from published PATENTS. Almost 300,000 data points have been collected and used to develop models to predict melting and pyrolysis (decomposition) points using tools available on the OCHEM modeling platform (<ext-link ext-link-type="uri" xlink:href="http://ochem.eu">http://ochem.eu</ext-link>
). A number of technical challenges were simultaneously solved to develop models based on these data. These included the handing of sparse data matrices with >200,000,000,000 entries and parallel calculations using 32 × 6 cores per task using 13 descriptor sets totaling more than 700,000 descriptors. We showed that models developed using data collected from PATENTS had similar or better prediction accuracy compared to the highly curated data used in previous publications. The separation of data for chemicals that decomposed rather than melting, from compounds that did undergo a normal melting transition, was performed and models for both pyrolysis and MPs were developed. The accuracy of the consensus MP models for molecules from the drug-like region of chemical space was similar to their estimated experimental accuracy, 32 °C. Last but not least, important structural features related to the pyrolysis of chemicals were identified, and a model to predict whether a compound will decompose instead of melting was developed.</p>
</sec>
<sec><title>Conclusions</title>
<p>We have shown that automated tools for the analysis of chemical information have reached a mature stage allowing for the extraction and collection of high quality data to enable the development of structure–activity relationship models. The developed models and data are publicly available at <ext-link ext-link-type="uri" xlink:href="http://ochem.eu/article/99826">http://ochem.eu/article/99826</ext-link>
.</p>
</sec>
<sec><title>Electronic supplementary material</title>
<p>The online version of this article (doi:10.1186/s13321-016-0113-y) contains supplementary material, which is available to authorized users.</p>
</sec>
</abstract>
<custom-meta-group><custom-meta><meta-name>issue-copyright-statement</meta-name>
<meta-value>© The Author(s) 2016</meta-value>
</custom-meta>
</custom-meta-group>
</article-meta>
</front>
<body><sec id="Sec1"><title>Background</title>
<p>The prediction of physicochemical properties is important in the pharmaceutical industry for structure design and for the purpose of optimizing ADME properties. Physicochemical parameters such as logP, pKa, logD, aqueous solubility and many others impact not only drug-related properties but also environmental chemicals such as surfactants, wetting agents and so on [<xref ref-type="bibr" rid="CR1">1</xref>
, <xref ref-type="bibr" rid="CR2">2</xref>]. The modeling of these properties is best facilitated by obtaining large, structurally diverse, high-quality datasets. The aggregation and curation of such datasets can be very exacting in terms of extraction of the data from the literature. Redrawing of chemical compounds can be difficult and in many cases they are not available as structure depictions but only in the form of chemical names. Validating the measured property in any meaningful way is difficult but manual inspection can highlight obvious errors with the parameters as captured (vide infra).
</p>
<p>Text-mining for the identification and extraction of properties may offer an opportunity to assemble rather large databases of properties harvested from the appropriate corpora. One of the authors (D.L.) has extensive experience with the extraction of chemistry-related information from PATENTS and previous investigations have examined the extraction of chemical reactions [<xref ref-type="bibr" rid="CR3">3</xref>
]. Initial investigations of chemical property measurements contained within the USPTO patent collection indicated the presence of a large number (>100,000) of melting points (MPs), typically within semi-structured experimental sections.</p>
<p>The theme of this memorial issue is focused on the contributions of Jean-Claude Bradley to Open Science and Dr. Bradley had a particular interest in the quality of MP data and he invested significant efforts in investigating this property. His interests were in regards to the value of MP to help in predicting temperature-dependent solubility for solvent selection [<xref ref-type="bibr" rid="CR4">4</xref>
] as well as assembling measured experimental properties as part of an Open Notebook Challenge [<xref ref-type="bibr" rid="CR5">5</xref>
]. He was particularly interested in the quality of experimental MPs reported in the literature and those reported by chemical vendors [<xref ref-type="bibr" rid="CR6">6</xref>
]. He had also worked tirelessly to make a large data collection of over 20,000 MPs available as Open Data [<xref ref-type="bibr" rid="CR7">7</xref>
]. In collaboration with Dr. Andrew Lang, one of the editors for this memorial issue, he made available MP web services [<xref ref-type="bibr" rid="CR8">8</xref>
] providing access to open models for prediction [<xref ref-type="bibr" rid="CR9">9</xref>
] and, prior to his passing, published an open dataset of 28,645 measurements for the community to use to develop models [<xref ref-type="bibr" rid="CR10">10</xref>
].</p>
<p>The prediction of MP remains an important task for cheminformatics studies for a number of reasons [<xref ref-type="bibr" rid="CR2">2</xref>
, <xref ref-type="bibr" rid="CR11">11</xref>
–<xref ref-type="bibr" rid="CR17">17</xref>
]. It specifically has relevance in the prediction of toxicity but has been observed to correlate with other physical properties such as boiling point, vapor pressure and water solubility [<xref ref-type="bibr" rid="CR1">1</xref>
, <xref ref-type="bibr" rid="CR18">18</xref>
]. As a result the MP has been used as a descriptor in some of the estimation methods used to predict these properties [<xref ref-type="bibr" rid="CR1">1</xref>
, <xref ref-type="bibr" rid="CR19">19</xref>
] and therefore the use of reliable MP data, or accurate estimates obtained from high-performing models, can improve the accuracy from such methods. With this in mind we decided to investigate the data mining of property data from an openly available patent corpus, with a focus on the extraction, curation and modeling of MP data.</p>
</sec>
<sec id="Sec2"><title>Datasets utilized in this work</title>
<sec id="Sec3"><title>Data extracted by mining patent literature</title>
<p>The workflow for extracting compound/MP associations is summarized in Fig. <xref rid="Fig1" ref-type="fig">1</xref>
. All United States Patent and Trademark Office (USPTO) PATENTS available as structured text were downloaded from ReedTech [<xref ref-type="bibr" rid="CR20">20</xref>
] for the period 1976–2014. Patent grants were available for the entirety of this period, while patent applications were available only from 2001 onwards. Complicating data extraction, the format used by the USPTO has varied over time with four significantly different formats being employed (one textual, one SGML and two XML formats). To simplify further handling, the textual and SGML formats were converted to an equivalent XML representation using a LeadMine [<xref ref-type="bibr" rid="CR21">21</xref>
] library function. From these heterogeneous XML representations, headings and paragraphs were extracted from the description section of each patent. The paragraphs are associated with the paragraph number noted in the XML, hence simplifying relating extracted data back to its locations in the original patent. From this point the workflow is the same for all formats of patent. The headings and paragraphs were grouped into experimental sections using the methodology described by Lowe [<xref ref-type="bibr" rid="CR3">3</xref>
]. LeadMine was then used to identify chemical entities and MPs.<fig id="Fig1"><label>Fig. 1</label>
<caption><p>Workflow for extraction of melting point data</p>
</caption>
<graphic xlink:href="13321_2016_113_Fig1_HTML" id="MO1"></graphic>
</fig>
</p>
<p>The association of MPs that are in close proximity to a chemical entity (e.g. in a bracket after the chemical), was achieved using a customized version of ChemicalTagger [<xref ref-type="bibr" rid="CR22">22</xref>
]. This customization consisted of adding support for tokens containing spaces (such that a MP measurement could be treated as a single token) and the integration of LeadMine to identify chemical entities and MPs. ChemicalTagger associates properties with chemical entities using a grammar that describes the syntax of a chemical entity with associated chemical properties. In many experimental sections the association of the MP with the synthesized compound is only implicit from the context, i.e. the MP appears at the end of the experimental section along with any other characterization data. In these cases the assumption is made that the MP applies to the compound being synthesized in that paragraph (Fig. <xref rid="Fig2" ref-type="fig">2</xref>
).<fig id="Fig2"><label>Fig. 2</label>
<caption><p>Example of typical experimental section with entities machine-annotated. The entities to associate are shown above</p>
</caption>
<graphic xlink:href="13321_2016_113_Fig2_HTML" id="MO2"></graphic>
</fig>
</p>
</sec>
<sec id="Sec4"><title>Melting point recognition</title>
<p>Melting points are efficiently identified by LeadMine using a finite state machine compiled from a formal grammar. The same grammar is also used to generate a parser for identifying the different parts of a MP declaration. The grammar can be summarized as:</p>
<p>FromLiterature? MeltingPoint Qualifier? (Value|Range|MeasurementError) OutcomeQualifier?</p>
<p>Where:<table-wrap id="Taba"><table frame="hsides" rules="groups"><thead><tr><th align="left">Term</th>
<th align="left">Examples of text matched</th>
</tr>
</thead>
<tbody><tr><td align="left">FromLiterature</td>
<td align="left">“lit.”</td>
</tr>
<tr><td align="left">MeltingPoint</td>
<td align="left">“mpt”, “melting point”, “m.p.”</td>
</tr>
<tr><td align="left">Qualifier</td>
<td align="left">“>”; “approximately”</td>
</tr>
<tr><td align="left">Value</td>
<td align="left">“75 °C”, “200 °F”, “one hundred degrees Celsius”</td>
</tr>
<tr><td align="left">Range</td>
<td align="left">“184–186”, “191.5–192.4 °C”</td>
</tr>
<tr><td align="left">MeasurementError</td>
<td align="left">“50 ± 1 °C”</td>
</tr>
<tr><td align="left">OutcomeQualifier</td>
<td align="left">“decomp.”, “with decomposition”, “subl.”</td>
</tr>
</tbody>
</table>
</table-wrap>
</p>
<p>As the grammar accepts numbers both as numerals and decimals, and qualifiers both as symbols and words, the different lexical ways of representing a MP are collapsed into a normalized form that is used for further processing. Values expressed as measurement errors were converted to ranges and all temperatures were converted to degrees Celsius. The original text was retained for reference.</p>
</sec>
<sec id="Sec5"><title>Extracted data</title>
<p>The associations between molecules and melting/decomposition/sublimation points were serialized to SDF format [<xref ref-type="bibr" rid="CR23">23</xref>
] (Fig. <xref rid="Fig3" ref-type="fig">3</xref>
).<fig id="Fig3"><label>Fig. 3</label>
<caption><p>Example of two entries from the resultant SDF</p>
</caption>
<graphic xlink:href="13321_2016_113_Fig3_HTML" id="MO3"></graphic>
</fig>
</p>
</sec>
<sec id="Sec6"><title>Suspicious value detection</title>
<p>Melting points that could be automatically detected as being likely to be incorrect were flagged in the SDF. This flag was set for cases where:<list list-type="bullet"><list-item><p>Value was >500 °C;</p>
</list-item>
<list-item><p>Value was a range wider than 50 °C;</p>
</list-item>
<list-item><p>Value was a range where the second temperature was lower than the first temperature.</p>
</list-item>
</list>
</p>
<p>These heuristics aimed to detect cases where the patent text was likely to be in error e.g. typo, missing decimal point, missing hyphen etc.</p>
</sec>
<sec id="Sec7"><title>Data filtering</title>
<p>In total 498,985 associations were found in patent grants and 172,886 associations were found in the patent applications. 1498 and 426 associations, respectively, were excluded from the two sets by checking for the aforementioned suspicious value flag. Additionally all compounds that were mixtures i.e. contained more than one connected component, were excluded.</p>
<p>A large number of MP measurements were duplicated across different PATENTS. To avoid duplicates we eliminated records with ΔT ≤ 1 °C differences in reported MP values, which were considered as full duplicates. This procedure eliminated N = 366,532 associations. All other values were considered as multiple measurement values for the same molecule. For each molecule we selected one record, which had MP near to the median experimental value for it. This allowed us to preserve the link to the originating patent, which could then be revisited in case of a problem with each particular record. We also excluded all molecules, which failed with the descriptor calculation programs. The final number of records is shown in Table <xref rid="Tab1" ref-type="table">1</xref>
.<table-wrap id="Tab1"><label>Table 1</label>
<caption><p>The number of compounds and average properties of molecules of the analyzed datasets and their drug-like subsets</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left" rowspan="3">Dataset</th>
<th align="left" rowspan="3">Type</th>
<th align="left" colspan="4">Whole set</th>
<th align="left" rowspan="3">Drug-like set, % of the total set</th>
</tr>
<tr><th align="left" rowspan="2">N</th>
<th align="left" colspan="3">Average</th>
</tr>
<tr><th align="left">T (°C)</th>
<th align="left">MW</th>
<th align="left">NA</th>
</tr>
</thead>
<tbody><tr><td align="left">PATENTS</td>
<td align="left">Training</td>
<td char="." align="char">241,958</td>
<td char="." align="char">159</td>
<td char="." align="char">357</td>
<td char="." align="char">25</td>
<td char="." align="char">89</td>
</tr>
<tr><td align="left"> Decomposing</td>
<td align="left">Training</td>
<td char="." align="char">13,785</td>
<td char="." align="char">209</td>
<td char="." align="char">358</td>
<td char="." align="char">25</td>
<td char="." align="char">76</td>
</tr>
<tr><td align="left"> Non-decomposing</td>
<td align="left">Training</td>
<td char="." align="char">228,173</td>
<td char="." align="char">155</td>
<td char="." align="char">357</td>
<td char="." align="char">25</td>
<td char="." align="char">93</td>
</tr>
<tr><td align="left">Bergström</td>
<td align="left">Validation</td>
<td char="." align="char">277</td>
<td char="." align="char">151</td>
<td char="." align="char">295</td>
<td char="." align="char">20.8</td>
<td char="." align="char">92</td>
</tr>
<tr><td align="left">Bradley</td>
<td align="left">Validation</td>
<td char="." align="char">2878</td>
<td char="." align="char">59</td>
<td char="." align="char">174</td>
<td char="." align="char">11.4</td>
<td char="." align="char">53</td>
</tr>
<tr><td align="left">OCHEM</td>
<td align="left">Validation</td>
<td char="." align="char">21,832</td>
<td char="." align="char">117</td>
<td char="." align="char">249</td>
<td char="." align="char">16.7</td>
<td char="." align="char">73</td>
</tr>
<tr><td align="left">Enamine</td>
<td align="left">Validation</td>
<td char="." align="char">22,449</td>
<td char="." align="char">143</td>
<td char="." align="char">223</td>
<td char="." align="char">14.9</td>
<td char="." align="char">91</td>
</tr>
<tr><td align="left">COMBINED</td>
<td align="left">Validation, merge of four sets</td>
<td char="." align="char">47,436</td>
<td char="." align="char">126</td>
<td char="." align="char">233</td>
<td char="." align="char">15.6</td>
<td char="." align="char">81</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p><italic>MW</italic>
 molecular weight, <italic>NA</italic>
 number of non-hydrogen atoms</p>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec id="Sec8"><title>Experimental accuracy of data</title>
<p>The duplicated measurements N = 18,058 were used to estimate the experimental accuracy of MP measurements, which was estimated to be σ = 38 °C. Considering that the procedure to eliminate duplicated records eliminated also molecules having ΔT ≤ 1 °C measured in different experiments, we corrected the observed distribution of values for ΔT = 0 and 1 by using the same number of counts as observed for ΔT = [2, 3] °C interval. This procedure provided σ = 35 °C, which can be used as an average estimation of the experimental accuracy of MP measurements across multiple experiments. This value incorporated the uncertainties due to polymorphism of chemical compounds, uncertainty and difficulties with experimental measurements as well as possible text-mining errors. For example, the distribution of MP values from PATENTS literature had peaks at 250 and 350 °C thus indicating that measurements were either stopped at these temperatures and threshold values were reported or simply that at these temperatures an estimated value within a fairly broad range was entered (i.e. an accurate MP was not required per se, see Fig. <xref rid="Fig4" ref-type="fig">4</xref>
). All of these uncertainties decreased the accuracy of MP measurements.<fig id="Fig4"><label>Fig. 4</label>
<caption><p>Data distribution in the analyzed sets. The <italic>dashed lines</italic>
 indicate a defined drug-like region, which covers the MP of >90 % of drugs (Bergström) and chemical provider (Enamine) set as well as 87 % of the compounds from the PATENTS set</p>
</caption>
<graphic xlink:href="13321_2016_113_Fig4_HTML" id="MO4"></graphic>
</fig>
</p>
<p>It is interesting that the experimental accuracy depended on the MP value. A binned plot of the accuracy as a function of the MP temperature indicates that measurements with higher and lower temperatures were less reproducible (Fig. <xref rid="Fig5" ref-type="fig">5</xref>
). The measurements in the drug-like region of [50, 250] °C were estimated to have an experimental measurement error of σ = 32 °C.</p>
</sec>
<sec id="Sec9"><title>Validation datasets</title>
<p>Four other MP data sets were used to validate the models developed in this work. These datasets were taken from our previously published study [<xref ref-type="bibr" rid="CR11">11</xref>
]. The “Bergström” dataset contained drug-like molecules [<xref ref-type="bibr" rid="CR17">17</xref>
]. The “Bradley” dataset [<xref ref-type="bibr" rid="CR24">24</xref>
] contains doubly curated data collected by Open Notebook science community members. The OCHEM and Enamine datasets [<xref ref-type="bibr" rid="CR11">11</xref>
] comprised MP values collected from datasets available via the Online Chemical Modeling Environment (<ext-link ext-link-type="uri" xlink:href="http://ochem.eu">http://ochem.eu</ext-link>
) and provided by Enamine Ltd. These datasets did not have any common overlapping compounds. The compounds overlapping with any of these four sets were removed from the PATENTS set. We also used a combined dataset (COMBINED) composed of the OCHEM, Enamine, Bradley and Bergström sets to simplify analysis of performances for several studies.</p>
</sec>
<sec id="Sec10"><title>Drug-like subsets</title>
<p>In our previous study we showed that compounds with MP in the range 50–250 °C contributed the majority of compounds in drug-like collections [<xref ref-type="bibr" rid="CR11">11</xref>
]. Table <xref rid="Tab1" ref-type="table">1</xref>
 and Fig. <xref rid="Fig4" ref-type="fig">4</xref>
 confirm this observation and indicate that about 90 % of compounds from the PATENTS, Enamine and Bergström data sets are covered by this temperature interval. Indeed, it is unlikely to find in a drugs dataset compounds with MPs below room temperature (i.e. liquids) or with very high MPs, e.g. >500 °C. The former may have low affinity and specificity while the latter are likely to be non-soluble. Therefore, the pharma industry is mainly working with compounds from the “drug-like” region of chemical space and the accuracy of prediction for compounds from this region is the most important for drug discovery.</p>
<p>The statistics of all datasets is provided in Table <xref rid="Tab1" ref-type="table">1</xref>
. There is a correlation between the average molecular weight (MW) and average MP of compounds. This result is in agreement with the known problem of decreasing solubility of compounds in drug discovery for large molecules. The compounds with MP from the PATENTS dataset contributed molecules with the largest MW and thus MP. The compounds from the Bergström dataset had the second largest MPs. The Bradley dataset, which was composed of many general chemical industry compounds, had the smallest average MW and MP values.</p>
</sec>
</sec>
<sec id="Sec11" sec-type="materials|methods"><title>Methods</title>
<p>The consensus modeling approach, which was also applied in our previous studies [<xref ref-type="bibr" rid="CR11">11</xref>
, <xref ref-type="bibr" rid="CR25">25</xref>
, <xref ref-type="bibr" rid="CR26">26</xref>
], was used to develop models. The descriptors were calculated using 13 descriptor packages, which cover different representations of chemical structures from simple fingerprints and a count of chemical groups, to packages offering a wide variety of descriptors types, such as Dragon [<xref ref-type="bibr" rid="CR27">27</xref>
] and Adriana [<xref ref-type="bibr" rid="CR28">28</xref>
]. All of these descriptor types are implemented within the OCHEM platform [<xref ref-type="bibr" rid="CR29">29</xref>
]. Below we briefly overview the used descriptors (see also Table <xref rid="Tab2" ref-type="table">2</xref>
). The detailed information about each set of descriptors can be found on the OCHEM [<xref ref-type="bibr" rid="CR30">30</xref>
].<table-wrap id="Tab2"><label>Table 2</label>
<caption><p>Analyzed sets of descriptors</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left">Package name</th>
<th align="left">Type of descriptors<sup>a</sup>
</th>
<th align="left">Number of descriptors</th>
<th align="left">Matrix size, billions</th>
<th align="left">Number of descriptors after filtering</th>
<th align="left">Non-zero values, millions</th>
<th align="left">Sparseness<sup>b</sup>
</th>
</tr>
</thead>
<tbody><tr><td align="left">EFG</td>
<td align="left">Binary</td>
<td char="." align="char">595</td>
<td align="left">0.18</td>
<td char="." align="char">347</td>
<td align="left">3.1</td>
<td align="left">33</td>
</tr>
<tr><td align="left">QNPR</td>
<td align="left">Integer</td>
<td char="." align="char">1502</td>
<td align="left">0.45</td>
<td char="." align="char">1040</td>
<td align="left">6.3</td>
<td align="left">49</td>
</tr>
<tr><td align="left">MolPrint</td>
<td align="left">Binary</td>
<td char="." align="char">688,634</td>
<td align="left">205</td>
<td char="." align="char">197,367</td>
<td align="left">8.1</td>
<td align="left">7200</td>
</tr>
<tr><td align="left">E-state count</td>
<td align="left">Float</td>
<td char="." align="char">631</td>
<td align="left">0.19</td>
<td char="." align="char">487</td>
<td align="left">10</td>
<td align="left">14</td>
</tr>
<tr><td align="left">Inductive</td>
<td align="left">Float</td>
<td char="." align="char">54</td>
<td align="left">0.02</td>
<td char="." align="char">39</td>
<td align="left">11</td>
<td align="left">1</td>
</tr>
<tr><td align="left">ECFP4</td>
<td align="left">Binary</td>
<td char="." align="char">1024</td>
<td align="left">0.31</td>
<td char="." align="char">1021</td>
<td align="left">12</td>
<td align="left">25</td>
</tr>
<tr><td align="left">ISIDA</td>
<td align="left">Integer</td>
<td char="." align="char">5886</td>
<td align="left">1.75</td>
<td char="." align="char">2275</td>
<td align="left">18</td>
<td align="left">37</td>
</tr>
<tr><td align="left">ChemAxon</td>
<td align="left">Float</td>
<td char="." align="char">498</td>
<td align="left">0.15</td>
<td char="." align="char">114</td>
<td align="left">23</td>
<td align="left">1.5</td>
</tr>
<tr><td align="left">GSFrag</td>
<td align="left">Integer</td>
<td char="." align="char">1138</td>
<td align="left">0.34</td>
<td char="." align="char">469</td>
<td align="left">24</td>
<td align="left">5.7</td>
</tr>
<tr><td align="left">CDK</td>
<td align="left">Float</td>
<td char="." align="char">239</td>
<td align="left">0.07</td>
<td char="." align="char">182</td>
<td align="left">27</td>
<td align="left">2</td>
</tr>
<tr><td align="left">Adriana</td>
<td align="left">Float</td>
<td char="." align="char">200</td>
<td align="left">0.06</td>
<td char="." align="char">139</td>
<td align="left">32</td>
<td align="left">1.3</td>
</tr>
<tr><td align="left">Mera, Mersy</td>
<td align="left">Float</td>
<td char="." align="char">571</td>
<td align="left">0.17</td>
<td char="." align="char">235</td>
<td align="left">61</td>
<td align="left">1.1</td>
</tr>
<tr><td align="left">Dragon</td>
<td align="left">Float</td>
<td char="." align="char">1647</td>
<td align="left">0.49</td>
<td char="." align="char">911</td>
<td align="left">183</td>
<td align="left">1.5</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p><sup>a</sup>
The dominating type of descriptors within the corresponding package</p>
<p><sup>b</sup>
Average number of zero entries per one non-zero value of the descriptor matrix</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>E-state [<xref ref-type="bibr" rid="CR31">31</xref>
] refers to electro-topological state indices that are based on chemical graph theory. E-state indices are 2D descriptors that combine the electronic character and topological environment of each skeletal atom and bond. The environment of atoms and bonds determine their type. In this study after a preliminary analysis we found that E-state indices and just counts of atom and bond types defining E-state indices produced similar results. Since development of models with E-state counts was faster, the counts were used.</p>
<sec id="Sec12"><title>In silico design and data analysis (ISIDA) fragments</title>
<p>These 2D descriptors are calculated with the help of the ISIDA fragmenter tool [<xref ref-type="bibr" rid="CR32">32</xref>
]. Compounds are split into substructural molecular fragments (SMF) of (in our case) lengths 2–4. Each fragment type comprises a descriptor, with the number of occurrences of the fragment type as the respective descriptor value. In this study, we used the sequence fragments composed of atoms and bonds.</p>
</sec>
<sec id="Sec13"><title>GSFragments</title>
<p>GSFrag and GSFrag-L [<xref ref-type="bibr" rid="CR33">33</xref>
] are used to calculate 2D descriptors representing fragments of length <inline-formula id="IEq1"><alternatives><tex-math id="M1">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$${\text{k}} = 2 \ldots 10$$\end{document}</tex-math>
<mml:math id="M2"><mml:mrow><mml:mtext>k</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>…</mml:mo>
<mml:mn>10</mml:mn>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13321_2016_113_Article_IEq1.gif"></inline-graphic>
</alternatives>
</inline-formula>
 or <inline-formula id="IEq2"><alternatives><tex-math id="M3">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$${\text{k}} = 2 \ldots 7,$$\end{document}</tex-math>
<mml:math id="M4"><mml:mrow><mml:mtext>k</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mn>2</mml:mn>
<mml:mo>…</mml:mo>
<mml:mn>7</mml:mn>
<mml:mo>,</mml:mo>
</mml:mrow>
</mml:math>
<inline-graphic xlink:href="13321_2016_113_Article_IEq2.gif"></inline-graphic>
</alternatives>
</inline-formula>
 respectively. Similar to ISIDA, descriptor values are the occurrences of specific fragments. GSFrag-L is an extension of GSFrag: it considers labeled vertices in order to take heteroatoms of otherwise identical fragments into account.</p>
</sec>
<sec id="Sec14"><title>CDK v. 1.4.11 (3D)</title>
<p>The Chemistry Development Kit (CDK) [<xref ref-type="bibr" rid="CR34">34</xref>
] is an open source Java library for structural chemo- and bio-informatics. It provides the descriptor engine, which calculates 246 descriptors containing topological, geometric, electronic, molecular, and constitutional descriptors.</p>
</sec>
<sec id="Sec15"><title>Dragon v. 5.5 (3D)</title>
<p>Dragon is a software package from Talete [<xref ref-type="bibr" rid="CR27">27</xref>
] that calculates 3190 molecular descriptors. They cover 0D–3D space and are subdivided into 29 different logical blocks. Detailed information on the descriptors can be found on the Talete website (<ext-link ext-link-type="uri" xlink:href="http://www.talete.mi.it/">http://www.talete.mi.it/</ext-link>
).</p>
</sec>
<sec id="Sec16"><title>ChemAxon (v. 5.10.4) descriptors (3D)</title>
<p>The ChemAxon [<xref ref-type="bibr" rid="CR35">35</xref>
] Calculator Plugin produces a variety of properties. The properties encoded by numerical or Boolean values were used as descriptors [<xref ref-type="bibr" rid="CR29">29</xref>
]. They were subdivided into seven groups, ranging from 0D to 3D: elemental analysis, charge, geometry, partitioning, protonation, isomers, and others.</p>
<p>Adriana.Code v.2.2.6 [<xref ref-type="bibr" rid="CR28">28</xref>
] (3D), developed by Molecular Networks GmbH, calculates a variety of physicochemical properties of a molecule. The 211 resulting descriptors range from 0D descriptors (such as MW, or atom numbers) to 1D, 2D, and various 3D descriptors.</p>
<p>Mera/Mersy (3D) developed by chemosophia [<xref ref-type="bibr" rid="CR36">36</xref>
] included geometrical, energy characteristics, molecular symmetry and chirality and physicochemical descriptors [<xref ref-type="bibr" rid="CR37">37</xref>
].</p>
</sec>
<sec id="Sec17"><title>QNPR descriptors</title>
<p>Quantitative Name Property Relationship (QNPR) are 1D descriptors, which are directly based on IUPAC names or SMILES text string representation of molecules. The descriptors are calculated by splitting the respective string of all possible continuous substrings of a fixed length. In our study the substrings of length one to three characters calculated by splitting SMILES structures were used. The minimum frequency of an occurrence of each substring within the dataset was five.</p>
<p>ToxAlert [<xref ref-type="bibr" rid="CR38">38</xref>
] extended functional groups (EFG) [<xref ref-type="bibr" rid="CR39">39</xref>
] included 583 groups covering different functional features of molecules. The groups are based on classifications provided by the CheckMol software [<xref ref-type="bibr" rid="CR40">40</xref>
], which was extended to cover new groups, in particular heterocycles [<xref ref-type="bibr" rid="CR39">39</xref>
].</p>
<p>ECFP4 descriptor circular fingerprints [<xref ref-type="bibr" rid="CR41">41</xref>
] were calculated using ChemAxon software v. 5.10.4. These descriptors are widely used as part of the Pipeline Pilot software [<xref ref-type="bibr" rid="CR42">42</xref>
].</p>
<p>MolPrint descriptors [<xref ref-type="bibr" rid="CR43">43</xref>
] are circular fingerprints which employ Sybyl MOL2 atom types. They are based on counts of MOL2 atom types around each heavy atom of the molecule and enumerate all atom environments present in a molecule.</p>
<sec id="Sec18"><title>Machine learning methods</title>
<p>In our previous work [<xref ref-type="bibr" rid="CR11">11</xref>
] we found that ASNN [<xref ref-type="bibr" rid="CR44">44</xref>
] and SVM [<xref ref-type="bibr" rid="CR45">45</xref>
] methods provided significantly higher accuracy of MP predictions compared to other tested methods while the accuracy of models developed with both methods was similar.</p>
<p>The same two approaches were initially used in this study. However, the training of large datasets requires significant computational resources and can take a long time. The LibSVM supports parallelization, which can be easily enabled by editing a few lines of code and linking the code with appropriate libraries. This feature was used for LibSVM and all calculations were performed on servers with up to 32 cores simultaneously. Considering that all models were validated using a fivefold cross validation approach, we were using up to 6 × 32 = 192 cores per one task simultaneously thus allowing fast processing of the data. The implementation of ASNN did not offer this feature. Therefore, after initial analysis LibSVM was used to develop all models using radial basis function (RBF) kernel. The most recent version LibSVM v. 3.20 was used [<xref ref-type="bibr" rid="CR46">46</xref>
].</p>
</sec>
<sec id="Sec19"><title>Optimization of LibSVM parameters</title>
<p>The application of the SVM method required an optimization of three parameters, C, γ and ε. The LibSVM manual proposes to use a grid search based on an internal cross-validation (CV) procedure to optimize them. This grid optimization procedure is implemented as part of OCHEM. The full run includes 1693 individual LibSVM calculations using different combinations of three analyzed parameters. This step requires a significant computational time. Moreover, it is also parameterized: the user should indicate which fraction of data should be used for the optimization to speed up the search. When using 1 % of a randomly selected training data set we found that, surprisingly, the same parameters (C = 64, γ = 1, ε = 0.00391) were optimal for 10 out of 13 descriptor sets. However, parameters selected with such a small data subset could be suboptimal for the whole dataset. Considering that the selection of optimal parameters for this dataset practically did not depend on the used descriptors, we decided to perform the optimization using 50 % of the training set for only one descriptors set. The EFG were selected as the set having the smallest number of non-zero values (Table <xref rid="Tab2" ref-type="table">2</xref>
). The optimization required about 15,000 core-hours (>600 days of calculations on a single-core computer) and identified another set of parameters (C = 256, γ = 1, ε = 16), which was used for the final analysis. This second set of parameters provided on average smaller training and validation set errors and calculated models with the smaller number of support vectors. For example, models based on Dragon descriptors were 316 and 219 Mb (in a zipped file format) when developed with the first and the second set of SVM parameters, respectively.</p>
</sec>
<sec id="Sec20"><title>Unsupervised descriptors selection</title>
<p>Before the development of models, descriptors, which had two or fewer non-zero values for the whole training set were eliminated. Moreover, descriptors which were inter-correlated with a linear correlation coefficient of R<sup>2</sup>
 > 0.95 were grouped together and only one descriptor from the group was selected for model development. This unsupervised filtering does not use any information about the target property and thus does not introduce selection bias [<xref ref-type="bibr" rid="CR47">47</xref>
], which could provide chance correlations.</p>
</sec>
<sec id="Sec21"><title>Validation of models</title>
<p>The models developed using the PATENTS dataset were validated using fivefold CV as described in details elsewhere [<xref ref-type="bibr" rid="CR48">48</xref>
]. In this approach each model is built using 4/5 of the compounds from the initial training set. The remaining 20 % of compounds are predicted and are used to estimate the accuracy of the models. By repeating the model building five times one can calculate predictions for all molecules from the initial dataset. These predictions are used to estimate the CV accuracy of the model. The final model is built using all training set data.</p>
<p>For classification of molecules that melt and decompose we used bagging validation [<xref ref-type="bibr" rid="CR49">49</xref>
]. Since the number of decomposing molecules was ca. 6 % of the dataset and thus much smaller compared to non-decomposing ones, a specific implementation, so called stratified bagging learning, was selected [<xref ref-type="bibr" rid="CR50">50</xref>
]. It is one of the most successful methods to work with an imbalanced dataset. In the stratified bagging approach the molecules of the smallest class are selected using sampling with replacement to form a set of the same size as the class is. The same procedure is also used for the larger class but the number of selected samples is limited to that of the smaller class. The resulting training set used is thus double the size of the number of samples in the smaller class. The selection of samples is repeated for each developed model used in the bagging protocol. The predictions are calculated for samples which were not included in the respective training sets and are averaged over all calculated models. The bagging models were developed using N = 64 models.</p>
</sec>
<sec id="Sec22"><title>Consensus modeling</title>
<p>Consensus modeling was shown to be an essential approach to calculate high prediction accuracy for the previous study [<xref ref-type="bibr" rid="CR11">11</xref>
]. A simple average of models<disp-formula id="Equ1"><label>1</label>
<alternatives><tex-math id="M5">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$\bar{y} = \frac{1}{n}\sum y_{i}$$\end{document}</tex-math>
<mml:math id="M6" display="block"><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow><mml:mo stretchy="false">¯</mml:mo>
</mml:mrow>
</mml:mover>
<mml:mo>=</mml:mo>
<mml:mfrac><mml:mn>1</mml:mn>
<mml:mi>n</mml:mi>
</mml:mfrac>
<mml:mo>∑</mml:mo>
<mml:msub><mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
</mml:math>
<graphic xlink:href="13321_2016_113_Article_Equ1.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where <italic>n</italic>
 is the total number of models and <italic>y</italic>
<sub><italic>i</italic>
</sub>
 is an individual prediction was used to develop the consensus model in that study. In this study the individual models were developed each with 1 of 13 sets of descriptors, which are described above. This approach contributed highly predictive models, as reported in the previous studies [<xref ref-type="bibr" rid="CR11">11</xref>
, <xref ref-type="bibr" rid="CR25">25</xref>
, <xref ref-type="bibr" rid="CR26">26</xref>
, <xref ref-type="bibr" rid="CR51">51</xref>
–<xref ref-type="bibr" rid="CR53">53</xref>
], including Rank-I submission models [<xref ref-type="bibr" rid="CR52">52</xref>
, <xref ref-type="bibr" rid="CR53">53</xref>
] for the ToxCast challenges organized by EPA and NIH.</p>
<p>In this study two other additional methods were also analyzed. The first approach was averaging by model accuracy<disp-formula id="Equ2"><label>2</label>
<alternatives><tex-math id="M7">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$\bar{y} = \sum \frac{{w_{i} y_{i} }}{{\sqrt {\sum (w_{i} )^{2} } }};\quad w_{i} = 1/RMSE$$\end{document}</tex-math>
<mml:math id="M8" display="block"><mml:mrow><mml:mover accent="true"><mml:mrow><mml:mi>y</mml:mi>
</mml:mrow>
<mml:mrow><mml:mo stretchy="false">¯</mml:mo>
</mml:mrow>
</mml:mover>
<mml:mo>=</mml:mo>
<mml:mo>∑</mml:mo>
<mml:mfrac><mml:mrow><mml:msub><mml:mi>w</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:msub><mml:mi>y</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
</mml:mrow>
<mml:msqrt><mml:mrow><mml:mo>∑</mml:mo>
<mml:msup><mml:mrow><mml:mo stretchy="false">(</mml:mo>
<mml:msub><mml:mi>w</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo stretchy="false">)</mml:mo>
</mml:mrow>
<mml:mn>2</mml:mn>
</mml:msup>
</mml:mrow>
</mml:msqrt>
</mml:mfrac>
<mml:mo>;</mml:mo>
<mml:mspace width="1em"></mml:mspace>
<mml:msub><mml:mi>w</mml:mi>
<mml:mi>i</mml:mi>
</mml:msub>
<mml:mo>=</mml:mo>
<mml:mn>1</mml:mn>
<mml:mo stretchy="false">/</mml:mo>
<mml:mi>R</mml:mi>
<mml:mi>M</mml:mi>
<mml:mi>S</mml:mi>
<mml:mi>E</mml:mi>
</mml:mrow>
</mml:math>
<graphic xlink:href="13321_2016_113_Article_Equ2.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where RMSE was the root mean squared error of the model. In the second approach a consensus model was developed using the predictions of individual models as descriptors for a multiple linear regression model (MLRA).</p>
</sec>
</sec>
<sec id="Sec23"><title>Handling of intervals and ranges</title>
<p>A majority of MP values were reported as intervals or ranges. We used the average or threshold value for the development of the LibSVM models.</p>
</sec>
<sec id="Sec24"><title>Reproducibility of models</title>
<p>The OCHEM web site was developed with the idea of delivering full reproducibility of modeling efforts. Thus, each model has details of the configuration which was used to create it. The configuration includes options for data standardization, descriptor calculation and pre-processing as well as the parameters for the configuration of the machine learning methods, e.g. LibSVM in this study. The configuration can be exported in an XML human-readable format using the “Export configuration XML” link available on the model profile. If a user wishes to exactly reproduce the model the exported configuration can be uploaded to the model development web page (OCHEM menu: “Models”/“Create a model”) using “Import an XML model template” or just use the configuration of the previous model (“Use another model as a template”). Once one of these options is used the model can be submitted to perform calculations without a need to specify any other parameters and will use exactly the same workflow as the original model. The only exception is the Consensus model which will require repeating the steps used for the model development manually (the options from XML configuration will be automatically pre-set), due to the technical differences in the implementation of this model. It should be noted that the calculation of large models requires significant CPU resources. Users are therefore allowed to submit tasks with a maximum number of molecules which is proportional to the number of bonus points they have collected (i.e. during the process of registering, uploading data, developing and publishing models and participating in data moderation). The limitation on the number of molecules per task is also useful to prevent possible challenges from inexperienced users who can initiate very large calculations by mistake. As an example, a non-registered and validated registered user can submit models with up to 1000 and 10,000 molecules per task, respectively. It is always possible to contact the web administrator (first author of the manuscript) to increase this limit for some specific projects. A detailed protocol used for the development of the consensus MP model for the PATENTS dataset is provided as Additional file <xref rid="MOESM1" ref-type="media">1</xref>.
</p>
</sec>
<sec id="Sec25"><title>Automatic filtering of outliers</title>
<p>OCHEM provides tools for the automated recognition and filtering of errors. It assumes that the distribution of errors, i.e. differences between predicted and experimental values, is governed by a Gaussian distribution N(0, σ) with a dispersion which equals the σ = RMSE. Molecules with large errors between the predicted and calculated MP values are unlikely to be produced with a Gaussian distribution and are considered to be outliers. The probability of finding a molecule with an error between the predicted and measured values of larger than two σ is p < 0.05. For the dataset with N = 229k molecules one can expect to have 22.9 ≈ 23 molecules for p = 0.0001. For the model with RMSE = 40 °C this value corresponds to errors which are larger than about 3.8σ and thus 150 °C between predicted and calculated values. If instead of N = 23 we detect e.g. l N = 163 molecules with such large errors, we can assume that the vast majority of outliers are either experimental errors or there are some problems with the data or with the model itself. If the outliers are indeed errors then their exclusion can improve the quality of the models. Of course, by removing the outliers we will also remove a number of “good” data points (in this case N = 23), which could have large errors due to the statistical properties of the dataset. Contrary to the removal of outlying data, the removal of “good” molecules will decrease the data set size and thus will decrease the quality of the model. The ratio of identified outliers to that expected by chance corresponds to the signal-to-noise ratio (SNR). For the considered example the SNR is 163/23 ≈ 7 i.e. out of seven molecules identified for this p-value, only one can be explained by statistical properties of the data. Thus, for this SNR the removal of seven outlying points will also remove one “good” data point.</p>
</sec>
<sec id="Sec26"><title>SetCompare</title>
<p>This utility uses a hyper-geometric distribution to identify the probability that observed the ratios of a particular feature (e.g. alert) in two analyzed sets could happen by chance [<xref ref-type="bibr" rid="CR25">25</xref>
].</p>
</sec>
</sec>
<sec id="Sec27" sec-type="results"><title>Results</title>
<p>The modeling of large datasets represents many challenges with respect to the required computational time, storage of descriptors and the calculated model as well as the selection of appropriate machine learning algorithms, which can handle such data. The descriptor packages analyzed in this study calculated different numbers of descriptors (see Table <xref rid="Tab2" ref-type="table">2</xref>
). The largest matrix was contributed by MolPrint descriptors. It had an initial size of 688, 634 × 197, 367–200 × 10<sup>9</sup>
 (0.2 trillion points) which decreased to 60 billion after the unsupervised filtering. The training of a model with hundred thousand descriptors is infeasible with computational algorithms, which operate with the full matrix. Examples of such algorithms include neural networks, multiple linear regression analysis and partial least squares.</p>
<p>However, the matrix produced by MolPrint descriptors is a very sparse one: only one out of more than 7000 descriptor entries was non-zero (Table <xref rid="Tab2" ref-type="table">2</xref>
). This matrix had the third smallest number of non-zero descriptors after the EFG and QNPR. Such sparse data can be analyzed using kernel-based methods. These approaches deal with the pairwise similarity of molecules and thus can efficiently work with sparse data by performing calculations using non-zero entries only. The support of a sparse data format is efficiently realized in LibSVM making this method easily applicable to this type of data. OCHEM software also supports a sparse data format thus making it possible to fully utilize the power of the LibSVM method.</p>
<p>The EFG, despite their high dimensionality, had only 3.1 million non-zero values, and provided the fastest calculations. The calculation of one model for these descriptors (without optimization of LibSVM parameters) required about 120 core-hours. The Dragon descriptors contained the largest number >183 million non-zero values, and required the longest calculation time of more than 1000 core hours.</p>
<sec id="Sec28"><title>Comparison of the accuracy of models developed using PATENTS dataset with previous models</title>
<p>As in our previous study [<xref ref-type="bibr" rid="CR11">11</xref>
] the model developed using E-state indices calculated the lowest RMSE for the training set and provided one of the best results for the four validation sets (see Additional file <xref rid="MOESM2" ref-type="media">2</xref>
: Table S1, Table <xref rid="Tab3" ref-type="table">3</xref>
). The largest errors of models were calculated using ECFP4, MolPrint and Inductive descriptors, which had a cross-validation RMSE of >50 °C for the training set.<table-wrap id="Tab3"><label>Table 3</label>
<caption><p>RMSE of models for prediction of different sets</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left">Method</th>
<th align="left">PATENTS set</th>
<th align="left">Bergström</th>
<th align="left">Bradley</th>
<th align="left">OCHEM</th>
<th align="left">Enamine</th>
<th align="left">COMBINED</th>
</tr>
</thead>
<tbody><tr><td align="left">PATENTS E-state</td>
<td align="left">38.3 ± 0.1 (36.1)<sup>a</sup>
</td>
<td char="(" align="char">34 ± 1 (31)</td>
<td char="(" align="char">62 ± 1 (33.7)</td>
<td align="left">48.5 ± 0.4 (36.2)</td>
<td align="left">40.8 ± 0.3 (35.2)</td>
<td align="left">45.9 (35.6)</td>
</tr>
<tr><td align="left">PATENTS Consensus all ten models</td>
<td align="left">37.8 ± 0.1 (34.1)</td>
<td char="(" align="char">34 ± 1 (31)</td>
<td char="(" align="char">78 ± 1 (32.2)</td>
<td align="left">54.2 ± 0.4 (34.1)</td>
<td align="left">40.4 ± 0.3 (33.9)</td>
<td align="left">50.5 (33.9)</td>
</tr>
<tr><td align="left">PATENTS Consensus five best models</td>
<td align="left">37.0 ± 0.1 (33.7)</td>
<td char="(" align="char">33 ± 1 (31)</td>
<td char="(" align="char">71 ± 1 (31.3)</td>
<td align="left">50.1 ± 0.4 (33.8)</td>
<td align="left">39.4 ± 0.3 (33.5)</td>
<td align="left">46.9 (33.6)</td>
</tr>
<tr><td align="left">OCHEM consensus</td>
<td align="left">–</td>
<td char="(" align="char">34 ± 1 (31)</td>
<td char="(" align="char">33.9 ± 0.6 (33.1)</td>
<td align="left">–</td>
<td align="left">40.1 ± 0.3 (34.6)</td>
<td align="left">–</td>
</tr>
<tr><td align="left">Enamine consensus</td>
<td align="left">–</td>
<td char="(" align="char">36 ± 2 (33)</td>
<td char="(" align="char">73 ± 1 (33.9)</td>
<td align="left">51.9 ± 0.4 (36.6)</td>
<td align="left">–</td>
<td align="left">–</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p><sup>a</sup>
Values in parentheses are calculated for compounds with experimental MP values in [50; 250]  °C “drug like” interval. They had the same or lower confidence intervals, which are not indicated</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>A consensus model was built as a simple average of all models with an exception of the three aforementioned models, which had CV RMSE >50 °C. It decreased the RMSE for CV and test set predictions in the range of 1–2 °C compared to the results based on E-state descriptors. This model in its design (a simple consensus average of ten individual models) was the best match to the model developed in our previous study thus allowing their straightforward comparison. The new model provided similar or lower errors for the drug-like subsets compared to the consensus models developed with individual OCHEM or Enamine sets. For example, a consensus model developed with the OCHEM dataset predicted drug-like subsets of the Bradley and Enamine set with RMSEs of 33.1 and 34.6 °C, respectively. The model developed with the PATENTS dataset predicted them with RMSEs of 32.2 and 33.9 °C, respectively. For the Enamine dataset the absolute difference in model errors RMSE = 0.7 °C is significant (p < 0.05) due to the large number of molecules in this set.</p>
<p>The accuracy of the consensus model developed using the PATENTS dataset was low for the whole Bradley set despite it having a low RMSE for the drug-like subset of this set. This result was due to the absence of molecules with MP <0 °C in the PATENTS set. Indeed, there were only few molecules with an MP <0 °C in this set and, moreover, most (all) of these molecules were likely to be experimental errors (see below the section regarding the filtering of outliers). Because of the insufficient coverage of this region of values the model was unable to predict molecules with low MP values, which constituted about 25 % molecules of the Bradley set. The low accuracy of the model for Bradley set (RMSE = 78) is in agreement with the similar error (RMSE = 73) of a consensus model based on the Enamine dataset (see Table <xref rid="Tab3" ref-type="table">3</xref>
) [<xref ref-type="bibr" rid="CR11">11</xref>
]. The Enamine set also did not have compounds with MP <0 °C and a model based on this set failed to predict the whole Bradley set despite the fact that it had excellent prediction ability for its “drug like” subset (Table <xref rid="Tab3" ref-type="table">3</xref>
).</p>
<p>There were about 5 and 6 % molecules with MP >250 °C in the PATENTS and COMBINED sets respectively. The PATENTS consensus model calculated a large CV RMSE = 61 °C and an even higher RMSE = 74 °C for the PATENTS and COMBINED subsets respectively. The prediction of compounds from this temperature range therefore remains a challenging task.</p>
<p>This analysis demonstrates that models developed using text-mined MP data from PATENTS provide an excellent prediction performance, similar or even significantly better than the results based on manually curated data used in previous studies. Since the patent corpus continues to grow quickly we can envisage that if the workflow and data processing pipeline is applied on an ongoing basis then the dataset will continue to grow and it will be at a much faster rate than manual extraction and curation will allow. While this procedure has only been reported for MP extraction and modeling in this work we can imagine utilizing the same procedure for other physicochemical properties such as multi-solvent solubilities, logP and other available parameters. The success for these parameters is yet to be proven.</p>
</sec>
<sec id="Sec29"><title>Analysis of different methods to perform consensus averaging</title>
<p>A simple consensus averaging is often used by many researchers, including ourselves, to improve the quality of models by agglomerating predictions of individual models [<xref ref-type="bibr" rid="CR11">11</xref>
, <xref ref-type="bibr" rid="CR48">48</xref>
, <xref ref-type="bibr" rid="CR54">54</xref>
]. Is it possible to achieve even better results by using more sophisticated averaging methods? We analyzed three approaches described in the methods section by applying them to n-best models, which were ranked in order of decreasing CV RMSE. The accuracy of predictions of the various models was estimated for the drug-like subset of the PATENTS and COMBINED set.</p>
<p>The average of five models based on E-state, Fragmentor, CDK, ChemAxon and QNPR descriptors calculated the lowest RMSE of 33.7 °C for drug-like subsets of the PATENTS and COMBINED datasets and thus provided an improvement of 0.4 °C compared to the results obtained when averaging ten models.</p>
<p>The models calculated using the weighted average had exactly the same performance for all subsets up to the average of five models. Indeed, since the accuracies of individual models were very similar and their weighted combination did not improve results compared to the simple average. For combinations of a large number of models, the weighted average sometimes provided smaller RMSEs of about 0.01 log units, which was not significantly different compared to the simple average.</p>
<p>The application of MLRA regression on the predicted values did not improve this result and the same RMSE was calculated for the drug-like subset of the COMBINED set. Thus, both studied strategies did not provide an improvement compared to the use of a simple arithmetic average of models.</p>
<p>The important result of this analysis was that the averaging of few models with the highest prediction ability could improve results compared to the averaging of all models.</p>
</sec>
<sec id="Sec30"><title>Analysis of compounds, which decompose during melting</title>
<p>A number of data points (see Table <xref rid="Tab1" ref-type="table">1</xref>
) from the PATENTS collection contained annotation about the thermal decomposition (pyrolysis) of chemical structures. The MW and number of non-hydrogen atoms of decomposing structural were practically identical to other molecules. The CV RMSE for a subset of molecules that decomposed was 47.7 °C, i.e. significantly larger compared to the 36.5 °C calculated for the subset of molecules without the decomposition. The median MP for decomposing compounds was 210 °C as compared to 155 °C for the whole dataset (Table <xref rid="Tab1" ref-type="table">1</xref>
). Thus, the lower prediction accuracy for these compounds could partially be due to the higher average MP, which is more difficult to predict.</p>
<p>The SetCompare tool identified that molecules containing acids (carboxylic, phosphonic and α-amino acids), primary amines, tetrazoles, and a number of other groups, were overrepresented in the group of compounds, which decomposed with the heating. The identified overrepresented groups are available for review online at <ext-link ext-link-type="uri" xlink:href="http://ochem.eu/article/99826">http://ochem.eu/article/99826</ext-link>
. Phosphonic acids and α-amino acids were among the most overrepresented groups in the set of decomposing compounds. They were present in only 0.4 % compounds (0.1 % phosphonic and 0.3 % α-amino acids) in the whole set but contributed about 4 % of all compounds in the decomposing set. Thus, the presence of one of these groups increased the probability of a compound to decompose by more than ten times. Compounds with a nitroso group were also about 9 times overrepresented in the decomposing set. The propensity of these three groups to decompose is well known. Already Dunn and Brophy [<xref ref-type="bibr" rid="CR55">55</xref>
] studied the decomposition of the amino acids and their contribution to the uncertainty associated with determination of their MPs. The decomposition of phosphonic acid and its esters has been actively studied in toxicological chemistry since it results in the release of highly toxic phosphine, PH<sub>3</sub>
 [<xref ref-type="bibr" rid="CR56">56</xref>
]. Compounds with nitroso groups are well known for their ability to decompose with a release of high energy, which makes them very important for the development of explosives (including dynamite).</p>
<p>The propensity of a compound to decompose versus melt is different properties. We therefore expect that better models should be calculated by considering each property independently.</p>
</sec>
<sec id="Sec31"><title>Modeling to predict compound decomposition</title>
<p>A model to predict the decomposition point of molecules was developed using the same protocol and SVM parameters selected for the whole PATENTS set. As with the analysis of the whole set of compounds the best accuracy of the individual model was calculated using E-state descriptors (RMSE = 43.2 °C). A consensus model based on the average of five models calculated the lowest RMSE = 42.3 °C. This error was lower than the CV RMSE calculated for the molecules within the whole PATENTS set (Table <xref rid="Tab4" ref-type="table">4</xref>
). Indeed, this dataset was more internally consistent thus contributing a better prediction ability of the models. However, the higher CV errors calculated for the decomposition point indicated that this property is even more difficult to predict than MP. This model calculated a higher RMSE for prediction of molecules from both the COMBINED and non-decomposing PATENTS set. This result was expected since both properties describe different physical effects.<table-wrap id="Tab4"><label>Table 4</label>
<caption><p>RMSE of the consensus models developed with different subsets of the PATENTS set</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left" rowspan="2">PATENTS subset used to train model</th>
<th align="left" colspan="2">PATENTS subsets</th>
<th align="left" rowspan="2">COMBINED</th>
</tr>
<tr><th align="left">Decomposing</th>
<th align="left">Non-decomposing</th>
</tr>
</thead>
<tbody><tr><td align="left">Non-decomposing + decomposing</td>
<td char="(" align="char">47.7 (40.5)<sup>a</sup>
</td>
<td char="(" align="char">36.5 (33.4)</td>
<td char="(" align="char">46.9 (33.6)</td>
</tr>
<tr><td align="left">Decomposing</td>
<td char="(" align="char">42.3 (38.9)</td>
<td char="(" align="char">64.3 (62.9)</td>
<td char="(" align="char">94.9 (70.4)</td>
</tr>
<tr><td align="left">Non-decomposing</td>
<td char="(" align="char">51 (43.1)</td>
<td char="(" align="char">36.3 (33.3)</td>
<td char="(" align="char">46.5 (33.3)</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p><sup>a</sup>
Values in parentheses are calculated for compounds with experimental MP values in the [50; 250] °C “drug like” interval</p>
</table-wrap-foot>
</table-wrap>
</p>
</sec>
<sec id="Sec32"><title>Prediction of MP of compounds cleaned from decomposing molecules</title>
<p>The models calculated for the MP set with excluded decomposing molecules calculated a lower CV RMSE and also a lower RMSE for predictions of the COMBINED set molecules. The increase in the accuracy of 0.1–0.3 log units for both sets was not statistically significant. Of course, it calculated higher RMSE for the prediction of decomposing molecules.</p>
<p>These results indicate the separation of molecules into two classes, i.e. those that decompose and those that do not decompose allowed for the development of better predictive models for each property. Unfortunately, such information is generally unknown for new molecules. A classification of compounds into those that decompose and do not decompose during melting could help to identify both classes of compounds. Moreover, such information can also be useful for the handling of chemical compounds.</p>
</sec>
<sec id="Sec33"><title>Classification model to predict decomposing compounds</title>
<p>A model was developed using the same sets of descriptors for all molecules from the PATENTS database, which were classified on non-decomposing and decomposing classes. We used stratified undersampling bagging [<xref ref-type="bibr" rid="CR50">50</xref>
] since the decomposing molecules corresponded only to 5.5 % and thus the dataset was highly imbalanced. This approach has demonstrated its high prediction power also for analysis of large chemical datasets [<xref ref-type="bibr" rid="CR25">25</xref>
, <xref ref-type="bibr" rid="CR56">56</xref>
, <xref ref-type="bibr" rid="CR57">57</xref>
]. Since the training datasets contained just double the number of decomposing molecules the SVM calculations were fast. Three descriptor sets (E-state, CDK and Fragmentor) had balanced accuracy above 75.5 % with the best one, E-state, having 78.1 %. The use of WEKA [<xref ref-type="bibr" rid="CR58">58</xref>
] implementation of decision trees (J48) improved balanced accuracy for Fragmentor descriptors from 75.8 to 77.6 %. A consensus model based on this decision tree and two SVM models achieved an accuracy of 79.6 ± 0.2 % for the whole set. The E-state model or the consensus model can be used to predict the fate of the molecules.</p>
</sec>
<sec id="Sec34"><title>Detection of outlying molecules</title>
<p>The automated data extraction from PATENTS resulted in a number of systematic errors in the data, which needed to be cleaned and filtered. As mentioned in the methods section a lot of efforts were devoted to cleaning up the data set during extraction from the literature. Data modeling was very useful during this step. Following data upload to OCHEM we performed modeling and reviewed outlier molecules. After finding and correcting a common pattern, which was leading to errors, data extraction was repeated.</p>
<p>For example, many records in PATENTS had a MP reported as “235-2360” (i.e. the decimal point after 236 was missed). This would be filtered out both due to the range being implausibly large and due to one of the values being implausibly high. Other “errors”, which could easily be corrected by a human, e.g. “159-62”, “160-2”, “82-82,5” (i.e. a comma instead of a dot) were addressed by introducing rules to handle these non-standard forms. Actually, the reporting of MP values as intervals, thus having two values instead of a single reported value, was beneficial to find and eliminate errors in the data.</p>
<p>Some of the problems with collected values were difficult to recognize and eliminate. They could originate from rare types of errors and/or simply be misprints. For example, one of the obvious erroneously reported values was “Mp. −383 °C”, which was a misprint of the minus sign. Another case included missing or incorrect decimal points in the MP values, e.g. “Mp. 236” or “Mp. 236” instead of “Mp. 23.6” and “Mp. 236”, respectively, which contributed noise to the MP values in the high or low temperature region.</p>
<p>Table <xref rid="Tab3" ref-type="table">3</xref>
 indicates the performances of consensus models and individual sub-models calculated for the different number of excluded outlying molecules as described in the Methods section. The RMSEs for the PATENTS set were reported for the whole set, i.e. also including the outlying molecules, which were excluded for different p values. This was done to have a simpler comparison of results. Thus the results for the PATENTS set COMBINED prediction of molecules from the validation sets and prediction of the outlying molecules using the final models developed with the respective training set. The reported results and tendencies did not change if we used the PATENTS sets with molecules excluded for different p-values.</p>
<p>The filtering of outlying compounds for <italic>p</italic>
 in the range of 0.001–0.01 improved prediction accuracies of individual models for both the PATENTS and the COMBINED sets (Table <xref rid="Tab5" ref-type="table">5</xref>
). The RMSE of most individual models decreased by about 0.1–1 °C log units for both sets. The degree of improvement depended on the descriptors used. Thus, the exclusion of outlying molecules, which distorted the training procedures, contributed models with higher prediction accuracies.</p>
<p>The improvement in model performance for the whole COMBINED set was larger compared to the results calculated for the drug-like subsets. The distribution of the excluded outlying molecules filtered using, e.g. p = 0.001, indicated a bimodal distribution of their MPs with peaks at 60 and 280 °C, i.e. from the regions outside or on the border of the drug-like region. Thus, increasing the quality contributed to the higher prediction accuracy of models for these regions of chemical space.</p>
<p>It is interesting that similar to our previous study [<xref ref-type="bibr" rid="CR11">11</xref>
] the removal of outlying compounds practically did not affect the performance of the consensus models for the drug-like subsets. Thus, a combination of individual models cancelled the biases of individual models introduced by noise in the experimental data. This result confirms that consensus averaging is a powerful method to increase the accuracy of individual models.</p>
<p>The consensus model provided an improvement, ΔRMSE = 1 °C, for the prediction of molecules outside of the drug-like space for the COMBINED set thus confirming the aforementioned conclusions about the influence of the outlier filtering on the data quality for molecules with this range of MP values.</p>
<p>The number of outlying molecules identified for <italic>p</italic>
 = 0.1 (N = 21,928) was less than expected for this <italic>p</italic>
 value, N = 22,208 for Gaussian distribution. Thus, the majority of identified data for this threshold suggested that outliers could just appear due to the statistical properties of the data and their removal can lead to deterioration of the model quality. This can be observed by the fact that RMSEs calculated for the “drug-like” and the whole COMBINED set start to increase for this p-value.</p>
<p>The CV RMSE error for the PATENTS set, 36.3 °C, was in good agreement with the estimated experimental accuracy of σ = 35 °C. Moreover, for the drug-like region the estimated σ = 33.3 °C, and calculated errors, CV RMSE = 33 °C, were also very similar. Thus, the developed consensus model achieved the experimental accuracy of the MP data (Fig. <xref rid="Fig6" ref-type="fig">6</xref>
). <fig id="Fig5"><label>Fig. 5</label>
<caption><p>The experimental accuracy of the data as a function of the MP temperature. Each point averaged at least 50 measurements. The graph was built using N = 18,058 differences in the MP temperatures and was rescaled to match the average experimental accuracy of σ = 35 °C. Compounds with MP <0 °C, most of which were data processing errors, were excluded</p>
</caption>
<graphic xlink:href="13321_2016_113_Fig5_HTML" id="MO7"></graphic>
</fig>
</p>
</sec>
<sec id="Sec35"><title>Analysis of the final model</title>
<p>The final models were developed by pooling data from all five datasets analyzed in this study. The outlying molecules were filtered using p = 0.01. The final consensus model was compared (Table <xref rid="Tab6" ref-type="table">6</xref>
) with the model developed using the COMBINED set in our previous publication [<xref ref-type="bibr" rid="CR29">29</xref>
].</p>
<p>The combination of PATENTS and COMBINED sets decreased the RMSE by 0.6 °C for the drug-like subset of the COMBINED set as well as also for the four individual subsets from the previous study. Thus enlargement of the training set increased prediction power of the models according to the CV protocol. The RMSE error calculated for the Bergström set is the lowest published value for this set and it is about 30 % smaller compared to 44.6 °C reported in the original study of Bergström et al. [<xref ref-type="bibr" rid="CR17">17</xref>
].</p>
<p>Figure <xref rid="Fig6" ref-type="fig">6</xref>
 shows that the CV RMSE of the individual subsets as a function of temperature increases for all sets of high temperatures (MP >250 °C). This decrease in the accuracy of predictions for this region is qualitatively similar for all five analyzed datasets and is in agreement with the decrease of the experimental accuracy of MP data as estimated for the PATENTS set. Thus, the accuracy of prediction of MP for the high temperature region was limited by the accuracy of experimental data.<fig id="Fig6"><label>Fig. 6</label>
<caption><p>The CV RMSE for different subsets of the final model as a function of MP. Each point on the plot is an average of at least N = 100 predictions with the exception of the Bergström set (N = 20)</p>
</caption>
<graphic xlink:href="13321_2016_113_Fig6_HTML" id="MO8"></graphic>
</fig>
</p>
<p>The experimental accuracy of data was also the limiting factor for the prediction accuracy of the model for PATENTS set for MP <50 °C. The predictions of MP for the Bradley dataset were of higher accuracy in this region. This result can be explained by the different quality of data measurements for this data set. Indeed, Dr. Bradley collected for this dataset only measurements that had highly reproducible published MP values: the values were only kept if there were multiple measurements and the range of values was between 0.01 and 5 °C inclusive. It is also interesting that only a few compounds in this set had MP values of >250 °C, thus indicating the difficulties of identifying reproducible measurements for high MP values.</p>
<p>The developed consensus model estimates both the applicability domain [<xref ref-type="bibr" rid="CR59">59</xref>
] and the accuracy of the prediction for new compounds based on the CONSENSUS-STD distance to model [<xref ref-type="bibr" rid="CR44">44</xref>
, <xref ref-type="bibr" rid="CR59">59</xref>
]. This distance to model corresponds to the disagreement (standard deviation) of the individual predictions of models in the consensus model [<xref ref-type="bibr" rid="CR44">44</xref>
]. It was found as the most reliable approach to estimate the accuracy of predictions in several benchmarking studies [<xref ref-type="bibr" rid="CR44">44</xref>
, <xref ref-type="bibr" rid="CR60">60</xref>
, <xref ref-type="bibr" rid="CR61">61</xref>
].</p>
</sec>
<sec id="Sec36"><title>Analysis of models based on few descriptors</title>
<p>The MP is used as a parameter for several models, e.g. solubility assessment [<xref ref-type="bibr" rid="CR19">19</xref>
] or as a parameter of multiple solvent models to simulate the accumulation and degradation of chemicals in different solvents, based on a number of explicit mathematical models for the transfer and degradation of molecules [<xref ref-type="bibr" rid="CR62">62</xref>
, <xref ref-type="bibr" rid="CR63">63</xref>
]. In the absence of the MP values a default value is frequently used, e.g. Syngenta [<xref ref-type="bibr" rid="CR64">64</xref>
] uses MP = 125 °C for their solubility model. In the PATENTS dataset the average MP value was 155 °C, which can be probably used as a better estimation of MP for drug-like compounds. The use of this value as a model prediction for all compounds gave an RMSE = 65.7 °C, which can be used as the null hypothesis for MP prediction. The MW and the number of carbon atoms (NC) had significant linear Pearson correlation coefficients, R = 0.172 and R = 0.136 respectively, relative to MP. The MLRA model developed with both these descriptors MP = 117 + 0.142 × MW − 0.79 × nC achieved an RMSE = 64.7 °C. This model, however, can hardly be considered as an improvement of the null hypothesis model for any practical application.</p>
<p>For another analysis we calculated the Pearson coefficient of correlation between MP and descriptors. The highest negative R = −0.484 and positive correlation R = 0.481 with MP were obtained for the of0ug and ef0ug 3D descriptors [<xref ref-type="bibr" rid="CR65">65</xref>
, <xref ref-type="bibr" rid="CR66">66</xref>
] calculated using the Mera program [<xref ref-type="bibr" rid="CR37">37</xref>
]. The first descriptor corresponds to the number of electrons participating in the orbital overlap of the carbon atoms. The second one is its complement, which indicates the number of free electrons for the carbon atoms, which do not participate in the overlap. Thus, both of them measure the degree of hybridization of the molecules. The use of the single descriptor of0ug in the model MP = 513–1260*of0ug produced an RMSE of 57 °C. It can be proposed as a single descriptor model for the estimation of MP of compounds.</p>
<p>The nAtomP, which calculates the number of atoms in the largest π-chain, was found as the most highly correlated descriptor, R = 0.371, provided by the CDK package [<xref ref-type="bibr" rid="CR34">34</xref>
]. The second descriptor of the same package, f<sub>MF</sub>
, R = 0.357, characterizes the complexity of the molecules. This descriptor is calculated as the fraction of the size of the molecular Bemis and Murcko framework [<xref ref-type="bibr" rid="CR67">67</xref>
] versus the size of the whole molecule and was introduced to predict the promiscuity of chemical compounds [<xref ref-type="bibr" rid="CR68">68</xref>
]. This descriptor is defined in a [0, 1] interval and it is equal to 1 if a molecule does not have side chains.</p>
<p>The two best descriptors calculated by Adriana.CODE [<xref ref-type="bibr" rid="CR28">28</xref>
] 2DACorr_PiEN_3 and 2DACorr_PiEN_4 are 2D π electronegativity-weighted autocorrelation descriptors calculated for topological distances 3 and 4 [<xref ref-type="bibr" rid="CR69">69</xref>
]. Both of these descriptors had R = 0.359.</p>
<p>The number of rings and resonance counts (number of resonance structures of a molecule) were also two highly correlated descriptors (R = 0.355 and R = 0.354) calculated using ChemAxon. Unsaturation and saturation indexes Ui (R = 0.349) and Uc (R = 0.325) were the two most highly correlated molecular property descriptors calculated by the Dragon software. The MP also correlated with more simple descriptors, such as the number of nitrogen atoms (R = 0.322).</p>
<p>The analysis of the most correlated descriptors indicates that many of them are strongly related to the π-system of electrons and thus had the possibility to interact through π-interactions. For example, the presence of side chains decreases f<sub>MF</sub>
 and thus the ability to perform such interactions and the formation of crystal structures. Possibly, the same effect contributes to formation of agglomerates in solution thus leading to the promiscuity of chemical substances observed by Yang et al. [<xref ref-type="bibr" rid="CR68">68</xref>
] The same change decreases the number of rings as well as the number of atoms in the largest π-chain (relative to the overall size of the molecule) as well as other electronic parameters of the molecule.</p>
<p>However, the aforementioned effect is not the only one contributing to the MP of compounds. Indeed, we built a linear MLRA using the best 100 and ten descriptors. The models RMSEs 48.1 ± 0.1 (100 descriptors) and 53.6 ± 0.1 (ten descriptors) were more than 10 °C higher compared to those calculated using SVM methods. Thus, while analysis of the individual descriptors is important to understanding the major effects influencing the property, their non-linear interactions, as captured by the machine learning methods, are important to derive the predictive models.</p>
<p>To some extent the comparison of results associated with the MLRA and SVM methods, and the conclusion about the advantage of the latter approach could be biased due to the use of the different number of descriptors used by both models. In order to better evaluate it we developed SVM models using the descriptors selected with MLRA for the PATENTS set for five descriptor sets contributing to the consensus model. The RMSEs of SVM models developed using exactly the same descriptors as those used in MLRA models were on average 7 ± 1 °C lower than the RMSEs of the MLRA models. Thus, the difference in the prediction performances of the SVM and MLRA models was mainly due to the ability of the SVM approach to better handle the non-linearity of data.</p>
</sec>
<sec id="Sec37"><title>Models and data availability</title>
<p>The final models based on the E-state descriptors (the best single individual set of descriptors) and consensus models for decomposition and MPs are publicly available on the OCHEM web site. The patent-mined data from this study are publicly downloadable from the same web site as well as available from FigShare [<xref ref-type="bibr" rid="CR70">70</xref>
] under a CC-BY license [<xref ref-type="bibr" rid="CR71">71</xref>
]. Users of the data are however strongly encouraged to cite this article as well as the data utilized as this work describes the details of the extraction process and data cleaning specifically.</p>
</sec>
</sec>
<sec id="Sec38"><title>Discussion and conclusion</title>
<p>We have collected from the literature a large number of MPs and decomposition points of compounds. A number of technical challenges were solved to curate the data and transform the information from the text to computer readable formats. Many of these challenges were related to the ambiguous representation of information within chemical PATENTS.</p>
<p>As the abstraction used for text mining only requires a list of headings and paragraphs, application of the same methodology to other structured text such as journal articles and other properties would be a straightforward extension.</p>
<p>We have shown that models based on the data collected from the PATENTS provided similar or higher prediction ability, compared to the results from our previous study. This indicates the high quality of patent-mined data, which is similar to that of manually curated data from the literature.</p>
<p>The PATENTS data contained about 5.5 % of compounds, which decompose during melting. The separation of data into subsets of compounds, which decompose and do not decompose during melting increased the accuracy of the individual models for both properties. The use of SetCompare tools allowed for the identification of chemical features, which are important for the pyrolysis of chemical compounds. Moreover, a classification model, which can predict whether a compound will decompose during MP measurement, was also developed.</p>
<p>In our previous study [<xref ref-type="bibr" rid="CR11">11</xref>
] we suggested that the 691 outlying molecules could be enriched with decomposing structures. The classification model predicted 28 % of these molecules as decomposing while only 21 % were predicted for the rest of the COMBINED set. Thus, indeed, the outlying structures contained a significantly higher percentage of decomposing compounds. The model also predicted 22 and 14 % decomposing compounds for the set of 4.7k outliers (identified with <italic>p</italic>
 = 0.01, see Table <xref rid="Tab5" ref-type="table">5</xref>
) and the remaining compounds of the PATENTS set, respectively. The outlying compounds were therefore again enriched with decomposing compounds. This result suggests that the PATENTS dataset may still contain decomposing compounds, which were not annotated in the PATENTS literature. The presence of decomposing compounds in the training set of the non-decomposing subset for the development of the pyrolysis classification model could decrease its accuracy. The difference between the average numbers of predicted compounds for COMBINED and PATENTS sets was about 6 %, that is the percentage of decomposing compounds annotated in the PATENTS literature. Thus, the COMBINED set has about the same percentage of decomposing compounds as the PATENTS set. The pyrolysis compounds were excluded for development of the MP model but for better comparison with the previous model the analysis in Table <xref rid="Tab3" ref-type="table">3</xref>
 was performed without separation of both classes.<table-wrap id="Tab5"><label>Table 5</label>
<caption><p>RMSE of models developed with filtering of outliers</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left"></th>
<th align="left">No filtering</th>
<th align="left">0.001 (N = 1414)<sup>a</sup>
</th>
<th align="left">0.01 (N = 4727)</th>
<th align="left">0.1 (N = 21,928)</th>
</tr>
</thead>
<tbody><tr><td align="left" colspan="5">PATENTS</td>
</tr>
<tr><td align="left"> CDK</td>
<td char="(" align="char">38.9 (36.2)</td>
<td char="(" align="char">38.9 (36.1)</td>
<td char="(" align="char">38.8 (36.1)</td>
<td char="(" align="char">38.9 (36.1)</td>
</tr>
<tr><td align="left"> Isida Fragmentor</td>
<td char="(" align="char">38.5 (35.5)</td>
<td char="(" align="char">38.4 (35.4)</td>
<td char="(" align="char">38.3 (35.2)</td>
<td char="(" align="char">38.2 (35.2)</td>
</tr>
<tr><td align="left"> ChemAxon</td>
<td char="(" align="char">40.1 (37.1)</td>
<td char="(" align="char">40 (37.1)</td>
<td char="(" align="char">40.1 (37.1)</td>
<td char="(" align="char">40.1 (37.2)</td>
</tr>
<tr><td align="left"> QNPR</td>
<td char="(" align="char">39.7 (36.6)</td>
<td char="(" align="char">39.7 (36.3)</td>
<td char="(" align="char">39.4 (36)</td>
<td char="(" align="char">39.2 (35.9)</td>
</tr>
<tr><td align="left"> E-state</td>
<td char="(" align="char">38.3 (35.6)</td>
<td char="(" align="char">38.1 (35.6)</td>
<td char="(" align="char">38.1 (35.5)</td>
<td char="(" align="char">38.0 (35.5)</td>
</tr>
<tr><td align="left"> Consensus</td>
<td char="(" align="char">36.3 (33.3)</td>
<td char="(" align="char">36.2 (33.3)</td>
<td char="(" align="char">36.3 (33.2)</td>
<td char="(" align="char">36.4 (33.5)</td>
</tr>
<tr><td align="left" colspan="5">COMBINED</td>
</tr>
<tr><td align="left"> CDK</td>
<td char="(" align="char">51.6 (35.6)</td>
<td char="(" align="char">51.3 (35.5)</td>
<td char="(" align="char">50.8 (35.5)</td>
<td char="(" align="char">49.9 (35.4)</td>
</tr>
<tr><td align="left"> Isida Fragmentor</td>
<td char="(" align="char">47.6 (35.9)</td>
<td char="(" align="char">47.5 (35.6)</td>
<td char="(" align="char">47.2 (35.3)</td>
<td char="(" align="char">47.3 (35.4)</td>
</tr>
<tr><td align="left"> ChemAxon</td>
<td char="(" align="char">49.7 (36.5)</td>
<td char="(" align="char">49.6 (36.5)</td>
<td char="(" align="char">49.5 (36.4)</td>
<td char="(" align="char">49 (36.5)</td>
</tr>
<tr><td align="left"> QNPR</td>
<td char="(" align="char">50.2 (38.1)</td>
<td char="(" align="char">50.5 (37.8)</td>
<td char="(" align="char">49.9 (37.7)</td>
<td char="(" align="char">49.5 (37.6)</td>
</tr>
<tr><td align="left"> E-state</td>
<td char="(" align="char">45.9 (35.4)</td>
<td char="(" align="char">46.1 (35.4)</td>
<td char="(" align="char">45.8 (35.3)</td>
<td char="(" align="char">45.8 (35.2)</td>
</tr>
<tr><td align="left"> Consensus</td>
<td char="(" align="char">46.5 (33.4)</td>
<td char="(" align="char">46.3 (33.4)</td>
<td char="(" align="char">46.2 (33.3)</td>
<td char="(" align="char">46.1 (33.4)</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p><sup>a</sup>
The numbers in parentheses indicate the number of molecules detected as outliers and filtered from the PATENTS set. The RMSE values for the PATENTS set are calculated for all molecules in this set (including the outliers)</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>Using the repeated measurements in PATENTS we estimated the experimental error of MP measurements as σ = 35 °C for the PATENTS set. We showed that the estimated accuracy varied as a function of temperature and achieved the lowest error of σ = 32 °C for the drug-like region of the dataset. Problems such as difficulties with experimental measurements for high temperatures, errors with reporting these values (i.e. using threshold or typing errors), as well as polymorphism and the purity of analyzed chemical compounds likely contribute to the measurement error. The final consensus model achieved a precision, which was similar to the estimated experimental accuracy. Thus contrary to previous studies, which indicated that the accuracy of models for physicochemical properties is limited by the insufficient descriptors [<xref ref-type="bibr" rid="CR72">72</xref>
, <xref ref-type="bibr" rid="CR73">73</xref>
], we can conclude that our results were rather limited by the experimental data accuracy.</p>
<p>A comparison of MLRA and SVM results developed using exactly the same sets of descriptors indicated significantly higher accuracy of the SVM models. This result suggests high non-linearity and interactions of descriptors, which is better modeled by the SVM method.</p>
<p>Because of the limitation on the computational resources, the grid search to select SVM parameters was done using only one set of descriptors, EFG, which contained the smallest number of non zero values. Even these calculations required about 15,000 core-hours. It is possible that selection of SVM parameters for each set could contribute better models. Considering that the grid search does not always provide the optimal set of parameters [<xref ref-type="bibr" rid="CR74">74</xref>
], more sophisticated algorithms based on evolutionary programming can be used thus contributing even more accurate models. Such study, however, is beyond the scope of this article.</p>
<p>The final consensus model developed in this study provided the best published prediction accuracy for the Bergström subset, RMSE = 31 °C, which is a 3 °C improvement in the result from our previous study (see Table <xref rid="Tab6" ref-type="table">6</xref>
) [<xref ref-type="bibr" rid="CR11">11</xref>
] and this corresponds to an almost 15 °C improvement of results from the original study [<xref ref-type="bibr" rid="CR17">17</xref>
] and other earlier studies using this set [<xref ref-type="bibr" rid="CR15">15</xref>
, <xref ref-type="bibr" rid="CR73">73</xref>
].<table-wrap id="Tab6"><label>Table 6</label>
<caption><p>RMSE of the final consensus models developed in this and in the previous study [<xref ref-type="bibr" rid="CR29">29</xref>
]</p>
</caption>
<table frame="hsides" rules="groups"><thead><tr><th align="left">Method</th>
<th align="left">PATENTS set</th>
<th align="left">Bergström</th>
<th align="left">Bradley</th>
<th align="left">OCHEM</th>
<th align="left">Enamine</th>
<th align="left">COMBINED</th>
</tr>
</thead>
<tbody><tr><td align="left">PATENTS + COMBINED</td>
<td char="(" align="char">36.5 ± 0.1 (33.7)</td>
<td char="(" align="char">31 ± 1 (29)</td>
<td char="(" align="char">32.2 ± 0.6 (32.2)</td>
<td char="(" align="char">37.9 ± 0.3 (33)</td>
<td char="(" align="char">36.3 ± 0.3 (31.1)</td>
<td char="(" align="char">36.8 ± 0.3 (32)</td>
</tr>
<tr><td align="left">COMBINED</td>
<td char="(" align="char">44.6 ± 0.1 (40.9)<sup>a</sup>
</td>
<td char="(" align="char">34 ± 1 (31)</td>
<td char="(" align="char">32.6 ± 0.6 (33.1)</td>
<td char="(" align="char">38 ± 0.3 (33.7)</td>
<td char="(" align="char">36.8 ± 0.3 (31.5)</td>
<td char="(" align="char">37.1 ± 0.3 (32.6)</td>
</tr>
</tbody>
</table>
<table-wrap-foot><p><sup>a</sup>
The results of the prediction of the PATENTS set using the model developed in our previous study [<xref ref-type="bibr" rid="CR29">29</xref>
]</p>
</table-wrap-foot>
</table-wrap>
</p>
<p>Further progress in the prediction of MPs can be advanced by improvement in the accuracy of experimental measurements, as well as prediction of MP for different polymorphic and amorphic forms. This work, however, is unlikely to happen in the near future since it will require rather different approaches to the collection and handling of experimental MP data.</p>
<p>The prediction of MP itself has limited practical value. The main interest in this property is because of its possible use for the estimation of the solubility of chemical compounds using the general solubility equation (GSE) [<xref ref-type="bibr" rid="CR19">19</xref>
]<disp-formula id="Equ3"><label>3</label>
<alternatives><tex-math id="M9">\documentclass[12pt]{minimal}
				\usepackage{amsmath}
				\usepackage{wasysym} 
				\usepackage{amsfonts} 
				\usepackage{amssymb} 
				\usepackage{amsbsy}
				\usepackage{mathrsfs}
				\usepackage{upgreek}
				\setlength{\oddsidemargin}{-69pt}
				\begin{document}$$\log\,{\text{S}} = 0.5 - 0.01({\text{MP-25}}) - \log{\text{P}}$$\end{document}</tex-math>
<mml:math id="M10" display="block"><mml:mrow><mml:mo>log</mml:mo>
<mml:mspace width="0.166667em"></mml:mspace>
<mml:mtext>S</mml:mtext>
<mml:mo>=</mml:mo>
<mml:mn>0.5</mml:mn>
<mml:mo>-</mml:mo>
<mml:mn>0.01</mml:mn>
<mml:mo stretchy="false">(</mml:mo>
<mml:mtext>MP-25</mml:mtext>
<mml:mo stretchy="false">)</mml:mo>
<mml:mo>-</mml:mo>
<mml:mo>log</mml:mo>
<mml:mtext>P</mml:mtext>
</mml:mrow>
</mml:math>
<graphic xlink:href="13321_2016_113_Article_Equ3.gif" position="anchor"></graphic>
</alternatives>
</disp-formula>
where logS is the intrinsic molar solubility and logP is the octanol/water partition coefficient. According to this equation, the prediction of MP with RMSE of 30 °C contributes 0.3 log unit to the error of the solubility prediction. The estimation of logS with an error of <0.5 log units is on the level of the experimental measurement accuracy [<xref ref-type="bibr" rid="CR75">75</xref>
] and thus is very valuable for the pharma industry. Unfortunately, as indicated by the recent benchmarking study of 18 approaches contributed by leading academic groups and chemical software providers [<xref ref-type="bibr" rid="CR76">76</xref>
], the estimation of logP is more challenging and can contribute about one log unit error. This can limit the application of GSE to new chemicals. However, if the applicability domain [<xref ref-type="bibr" rid="CR59">59</xref>
] of models is carefully addressed and extended with new measurements, the accuracy of logP predictions could be as low as 0.35 logP units for about 60 % out of 96k analyzed compounds [<xref ref-type="bibr" rid="CR77">77</xref>
]. Such an approach could enable a widespread use of the GSE equation to estimate the solubility of chemical compounds.</p>
<p>As an illustrative example we applied the GSE to predict logS for N = 1311 molecules from our previous study [<xref ref-type="bibr" rid="CR78">78</xref>
]. The logP values were obtained using ALOGPS 2.1 program [<xref ref-type="bibr" rid="CR79">79</xref>
], which is also available as part of the OCHEM descriptors. Equation (<xref rid="Equ3" ref-type="">3</xref>
) gave a calculated RMSE of 0.84 ± 0.02. The same accuracy was calculated notwithstanding whether the consensus or model based on the E-state descriptors was used. While this error was higher than the RMSE of 0.62 calculated for the data in the original study the results obtained in this study did not use any information about the target property. The GSE water solubility model, which is based on E-state indices and thus requires lower computational resources, was made publicly available on the OCHEM web site.</p>
<p>The development and public availability of computational models developed with an increasing volume of publicly available data mined from the published literature is important to the development of better QSAR/QSPR models and their wider acceptance by academia, industry and chemical authorities [<xref ref-type="bibr" rid="CR80">80</xref>
].</p>
</sec>
</body>
<back><app-group><app id="App1"><sec id="Sec39"><title>Additional files</title>
<p><media position="anchor" xlink:href="13321_2016_113_MOESM1_ESM.pdf" id="MOESM1"><caption><p>10.1186/s13221-016-0113-2 Protocols used to develop the melting point consensus model.</p>
</caption>
</media>
<media position="anchor" xlink:href="13321_2016_113_MOESM2_ESM.docx" id="MOESM2"><caption><p>10.1186/s13221-016-0113-2 RMSE of LibSVM models calculated with different sets of descriptors.</p>
</caption>
</media>
</p>
</sec>
</app>
</app-group>
<glossary><title>Abbreviations</title>
<def-list><def-item><term>ADMET</term>
<def><p>absorption, distribution, metabolism, exertion and toxicity</p>
</def>
</def-item>
<def-item><term>ASNN</term>
<def><p>associative neural network</p>
</def>
</def-item>
<def-item><term>CV</term>
<def><p>cross-validation</p>
</def>
</def-item>
<def-item><term>logP</term>
<def><p>octanol/water partition coefficient</p>
</def>
</def-item>
<def-item><term>logS</term>
<def><p>water solubility</p>
</def>
</def-item>
<def-item><term>MLRA</term>
<def><p>multiple linear regression analysis</p>
</def>
</def-item>
<def-item><term>MP</term>
<def><p>melting point</p>
</def>
</def-item>
<def-item><term>OCHEM</term>
<def><p>on-line chemical database and modeling environment, <ext-link ext-link-type="uri" xlink:href="http://ochem.eu">http://ochem.eu</ext-link>
</p>
</def>
</def-item>
<def-item><term>QSAR</term>
<def><p>quantitative structure activity relationship</p>
</def>
</def-item>
<def-item><term>QSPR</term>
<def><p>quantitative structure property relationship</p>
</def>
</def-item>
<def-item><term>RMSE</term>
<def><p>root mean squared error</p>
</def>
</def-item>
<def-item><term>SGML</term>
<def><p>standard generalized markup language</p>
</def>
</def-item>
<def-item><term>SVM</term>
<def><p>support vector machine</p>
</def>
</def-item>
<def-item><term>USPTO</term>
<def><p>United States Patent and Trademark Office</p>
</def>
</def-item>
<def-item><term>XML</term>
<def><p>extensible markup language</p>
</def>
</def-item>
</def-list>
</glossary>
<ack><title>Authors’ contributions</title>
<p>AJW initiated the study, DL extracted and curated the data, IVT performed the modeling and statistical analysis. All authors read and approved the final manuscript.</p>
<sec id="FPar1"><title>Acknowledgements</title>
<p>We thank ChemAxon (<ext-link ext-link-type="uri" xlink:href="http://www.chemaxon.com">http://www.chemaxon.com</ext-link>
), Molecular Networks GmbH (<ext-link ext-link-type="uri" xlink:href="http://www.molecular-networks.com">http://www.molecular-networks.com</ext-link>
), Talete Srl (<ext-link ext-link-type="uri" xlink:href="http://www.talete.mi.it">http://www.talete.mi.it</ext-link>
) and ChemoSophia (<ext-link ext-link-type="uri" xlink:href="http://www.chemosophia.com">http://www.chemosophia.com</ext-link>
) for contributing their software tools used in this study.</p>
</sec>
<sec id="FPar2"><title>Competing interests</title>
<p>DL is an employee of NextMove Software Ltd. who develop and license the text-mining software tools used in this study. IVT is CEO of BigChem GmbH, which licenses OCHEM software.</p>
</sec>
</ack>
<ref-list id="Bib1"><title>References</title>
<ref id="CR1"><label>1.</label>
<mixed-citation publication-type="other">Tetko IV (2007) Prediction of physicochemical properties. In: Ekins S (ed) Computational toxicology: risk assessment for pharmaceutical and environmental chemicals, vol 1. Wiley, Hoboken, pp 241–275</mixed-citation>
</ref>
<ref id="CR2"><label>2.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Dearden</surname>
<given-names>JC</given-names>
</name>
<name><surname>Rotureau</surname>
<given-names>P</given-names>
</name>
<name><surname>Fayet</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>QSPR prediction of physico-chemical properties for REACH</article-title>
<source>SAR QSAR Environ Res</source>
<year>2013</year>
<volume>24</volume>
<fpage>279</fpage>
<lpage>318</lpage>
<pub-id pub-id-type="doi">10.1080/1062936X.2013.773372</pub-id>
<pub-id pub-id-type="pmid">23521394</pub-id>
</element-citation>
</ref>
<ref id="CR3"><label>3.</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Lowe</surname>
<given-names>DM</given-names>
</name>
</person-group>
<source>Extraction of chemical structures and reactions from the literature</source>
<year>2012</year>
<publisher-loc>Cambridge</publisher-loc>
<publisher-name>University of Cambridge</publisher-name>
</element-citation>
</ref>
<ref id="CR4"><label>4.</label>
<mixed-citation publication-type="other">Predicting temperature-dependent solubility for solvent selection.
<ext-link ext-link-type="uri" xlink:href="http://usefulchem.blogspot.com/2011/02/predicting-temperature-dependent.html"> http://usefulchem.blogspot.com/2011/02/predicting-temperature-dependent.html</ext-link>
 (5 Aug 2015)</mixed-citation>
</ref>
<ref id="CR5"><label>5.</label>
<mixed-citation publication-type="other">Open Notebook Science Challenge. <ext-link ext-link-type="uri" xlink:href="http://onschallenge.wikispaces.com">http://onschallenge.wikispaces.com</ext-link>
 (5 Aug 2015)</mixed-citation>
</ref>
<ref id="CR6"><label>6.</label>
<mixed-citation publication-type="other">My talk at SLA on Trust in Science and Open Melting Point Collections. <ext-link ext-link-type="uri" xlink:href="http://usefulchem.blogspot.com/2011/06/my-talk-at-sla-on-trust-in-science-and.html">http://usefulchem.blogspot.com/2011/06/my-talk-at-sla-on-trust-in-science-and.html</ext-link>
 (5 Aug 2015)</mixed-citation>
</ref>
<ref id="CR7"><label>7.</label>
<mixed-citation publication-type="other">Open Melting Point Collection Book Edition 1. <ext-link ext-link-type="uri" xlink:href="http://usefulchem.blogspot.com/2011/08/open-melting-point-collection-book.html">http://usefulchem.blogspot.com/2011/08/open-melting-point-collection-book.html</ext-link>
 (5 Aug 2015)</mixed-citation>
</ref>
<ref id="CR8"><label>8.</label>
<mixed-citation publication-type="other">Melting Point Web Services. <ext-link ext-link-type="uri" xlink:href="http://onswebservices.wikispaces.com/meltingpoint">http://onswebservices.wikispaces.com/meltingpoint</ext-link>
 (5 Aug 2015)</mixed-citation>
</ref>
<ref id="CR9"><label>9.</label>
<mixed-citation publication-type="other">Open modeling of melting point data. <ext-link ext-link-type="uri" xlink:href="http://usefulchem.blogspot.com/2011/03/open-modeling-of-melting-point-data.html">http://usefulchem.blogspot.com/2011/03/open-modeling-of-melting-point-data.html</ext-link>
 (5 Aug 2015)</mixed-citation>
</ref>
<ref id="CR10"><label>10.</label>
<mixed-citation publication-type="other">Jean-Claude Bradley Open Melting Point Dataset. <ext-link ext-link-type="uri" xlink:href="http://figshare.com/articles/Jean_Claude_Bradley_Open_Melting_Point_Datset/1031637">http://figshare.com/articles/Jean_Claude_Bradley_Open_Melting_Point_Datset/1031637</ext-link>
 (5 Aug 2015)</mixed-citation>
</ref>
<ref id="CR11"><label>11.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
<name><surname>Sushko</surname>
<given-names>Y</given-names>
</name>
<name><surname>Novotarskyi</surname>
<given-names>S</given-names>
</name>
<name><surname>Patiny</surname>
<given-names>L</given-names>
</name>
<name><surname>Kondratov</surname>
<given-names>I</given-names>
</name>
<name><surname>Petrenko</surname>
<given-names>AE</given-names>
</name>
<name><surname>Charochkina</surname>
<given-names>L</given-names>
</name>
<name><surname>Asiri</surname>
<given-names>AM</given-names>
</name>
</person-group>
<article-title>How accurately can we predict the melting points of drug-like compounds?</article-title>
<source>J Chem Inf Model</source>
<year>2014</year>
<volume>54</volume>
<fpage>3320</fpage>
<lpage>3329</lpage>
<pub-id pub-id-type="doi">10.1021/ci5005288</pub-id>
<pub-id pub-id-type="pmid">25489863</pub-id>
</element-citation>
</ref>
<ref id="CR12"><label>12.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bhhatarai</surname>
<given-names>B</given-names>
</name>
<name><surname>Teetz</surname>
<given-names>W</given-names>
</name>
<name><surname>Liu</surname>
<given-names>T</given-names>
</name>
<name><surname>Öberg</surname>
<given-names>T</given-names>
</name>
<name><surname>Jeliazkova</surname>
<given-names>N</given-names>
</name>
<name><surname>Kochev</surname>
<given-names>N</given-names>
</name>
<name><surname>Pukalov</surname>
<given-names>O</given-names>
</name>
<name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
<name><surname>Kovarich</surname>
<given-names>S</given-names>
</name>
<name><surname>Papa</surname>
<given-names>E</given-names>
</name>
<name><surname>Gramatica</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>CADASTER QSPR Models for predictions of melting and boiling points of perfluorinated chemicals</article-title>
<source>Mol Inform</source>
<year>2011</year>
<volume>30</volume>
<fpage>189</fpage>
<lpage>204</lpage>
<pub-id pub-id-type="doi">10.1002/minf.201000133</pub-id>
</element-citation>
</ref>
<ref id="CR13"><label>13.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chu</surname>
<given-names>KA</given-names>
</name>
<name><surname>Yalkowsky</surname>
<given-names>SH</given-names>
</name>
</person-group>
<article-title>An interesting relationship between drug absorption and melting point</article-title>
<source>Int J Pharm</source>
<year>2009</year>
<volume>373</volume>
<fpage>24</fpage>
<lpage>40</lpage>
<pub-id pub-id-type="doi">10.1016/j.ijpharm.2009.01.026</pub-id>
<pub-id pub-id-type="pmid">19429285</pub-id>
</element-citation>
</ref>
<ref id="CR14"><label>14.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Varnek</surname>
<given-names>A</given-names>
</name>
<name><surname>Kireeva</surname>
<given-names>N</given-names>
</name>
<name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
<name><surname>Baskin</surname>
<given-names>II</given-names>
</name>
<name><surname>Solov’ev</surname>
<given-names>VP</given-names>
</name>
</person-group>
<article-title>Exhaustive QSPR studies of a large diverse set of ionic liquids: how accurately can we predict melting points?</article-title>
<source>J Chem Inf Model</source>
<year>2007</year>
<volume>47</volume>
<fpage>1111</fpage>
<lpage>1122</lpage>
<pub-id pub-id-type="doi">10.1021/ci600493x</pub-id>
<pub-id pub-id-type="pmid">17381081</pub-id>
</element-citation>
</ref>
<ref id="CR15"><label>15.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Nigsch</surname>
<given-names>F</given-names>
</name>
<name><surname>Bender</surname>
<given-names>A</given-names>
</name>
<name><surname>van Buuren</surname>
<given-names>B</given-names>
</name>
<name><surname>Tissen</surname>
<given-names>J</given-names>
</name>
<name><surname>Nigsch</surname>
<given-names>E</given-names>
</name>
<name><surname>Mitchell</surname>
<given-names>JB</given-names>
</name>
</person-group>
<article-title>Melting point prediction employing k-nearest neighbor algorithms and genetic parameter optimization</article-title>
<source>J Chem Inf Model</source>
<year>2006</year>
<volume>46</volume>
<fpage>2412</fpage>
<lpage>2422</lpage>
<pub-id pub-id-type="doi">10.1021/ci060149f</pub-id>
<pub-id pub-id-type="pmid">17125183</pub-id>
</element-citation>
</ref>
<ref id="CR16"><label>16.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Jain</surname>
<given-names>A</given-names>
</name>
<name><surname>Yalkowsky</surname>
<given-names>SH</given-names>
</name>
</person-group>
<article-title>Estimation of melting points of organic compounds-II</article-title>
<source>J Pharm Sci</source>
<year>2006</year>
<volume>95</volume>
<fpage>2562</fpage>
<lpage>2618</lpage>
<pub-id pub-id-type="doi">10.1002/jps.20634</pub-id>
<pub-id pub-id-type="pmid">17034051</pub-id>
</element-citation>
</ref>
<ref id="CR17"><label>17.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bergstrom</surname>
<given-names>CA</given-names>
</name>
<name><surname>Norinder</surname>
<given-names>U</given-names>
</name>
<name><surname>Luthman</surname>
<given-names>K</given-names>
</name>
<name><surname>Artursson</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>Molecular descriptors influencing melting point and their role in classification of solid drugs</article-title>
<source>J Chem Inf Comput Sci</source>
<year>2003</year>
<volume>43</volume>
<fpage>1177</fpage>
<lpage>1185</lpage>
<pub-id pub-id-type="doi">10.1021/ci020280x</pub-id>
<pub-id pub-id-type="pmid">12870909</pub-id>
</element-citation>
</ref>
<ref id="CR18"><label>18.</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Boethling</surname>
<given-names>RS</given-names>
</name>
<name><surname>Mackay</surname>
<given-names>D</given-names>
</name>
</person-group>
<source>Handbook of property estimation methods for chemicals: environmental and health sciences</source>
<year>2000</year>
<publisher-loc>Boca Raton</publisher-loc>
<publisher-name>Lewis</publisher-name>
<fpage>xxii</fpage>
</element-citation>
</ref>
<ref id="CR19"><label>19.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Ran</surname>
<given-names>Y</given-names>
</name>
<name><surname>Yalkowsky</surname>
<given-names>SH</given-names>
</name>
</person-group>
<article-title>Prediction of drug solubility by the general solubility equation (GSE)</article-title>
<source>J Chem Inf Comput Sci</source>
<year>2001</year>
<volume>41</volume>
<fpage>354</fpage>
<lpage>357</lpage>
<pub-id pub-id-type="doi">10.1021/ci000338c</pub-id>
<pub-id pub-id-type="pmid">11277722</pub-id>
</element-citation>
</ref>
<ref id="CR20"><label>20.</label>
<mixed-citation publication-type="other">Reed Tech USPTO Data Portal. <ext-link ext-link-type="uri" xlink:href="http://patents.reedtech.com/">http://patents.reedtech.com/</ext-link>
 (5 Aug 2015)</mixed-citation>
</ref>
<ref id="CR21"><label>21.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Lowe</surname>
<given-names>DM</given-names>
</name>
<name><surname>Sayle</surname>
<given-names>RA</given-names>
</name>
</person-group>
<article-title>LeadMine: a grammar and dictionary driven approach to entity recognition</article-title>
<source>J Cheminform</source>
<year>2015</year>
<volume>7</volume>
<fpage>S5</fpage>
<pub-id pub-id-type="doi">10.1186/1758-2946-7-S1-S5</pub-id>
<pub-id pub-id-type="pmid">25810776</pub-id>
</element-citation>
</ref>
<ref id="CR22"><label>22.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hawizy</surname>
<given-names>L</given-names>
</name>
<name><surname>Jessop</surname>
<given-names>DM</given-names>
</name>
<name><surname>Adams</surname>
<given-names>N</given-names>
</name>
<name><surname>Murray-Rust</surname>
<given-names>P</given-names>
</name>
</person-group>
<article-title>ChemicalTagger: a tool for semantic text-mining in chemistry</article-title>
<source>J Cheminform</source>
<year>2011</year>
<volume>3</volume>
<fpage>17</fpage>
<pub-id pub-id-type="doi">10.1186/1758-2946-3-17</pub-id>
<pub-id pub-id-type="pmid">21575201</pub-id>
</element-citation>
</ref>
<ref id="CR23"><label>23.</label>
<mixed-citation publication-type="other">Distributed Structure-Searchable Toxicity (DSSTox) Database. <ext-link ext-link-type="uri" xlink:href="http://www.epa.gov/ncct/dsstox/MoreonSDF.html">http://www.epa.gov/ncct/dsstox/MoreonSDF.html</ext-link>
 (5 Aug 2015)</mixed-citation>
</ref>
<ref id="CR24"><label>24.</label>
<mixed-citation publication-type="other">Bradley J-C, Lang A, Williams AJ (2014) Jean-Claude Bradley double plus good (highly curated and validated) melting point dataset. <ext-link ext-link-type="uri" xlink:href="https://figshare.com/articles/Jean_Claude_Bradley_Double_Plus_Good_Highly_Curated_and_Validated_Melting_Point_Dataset/1031638">https://figshare.com/articles/Jean_Claude_Bradley_Double_Plus_Good_Highly_Curated_and_Validated_Melting_Point_Dataset/1031638</ext-link>
 (5 Aug 2915) </mixed-citation>
</ref>
<ref id="CR25"><label>25.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Vorberg</surname>
<given-names>S</given-names>
</name>
<name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
</person-group>
<article-title>Modeling the biodegradability of chemical compounds using the online CHEmical modeling environment (OCHEM)</article-title>
<source>Mol Inf</source>
<year>2014</year>
<volume>33</volume>
<fpage>73</fpage>
<lpage>85</lpage>
<pub-id pub-id-type="doi">10.1002/minf.201300030</pub-id>
</element-citation>
</ref>
<ref id="CR26"><label>26.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
<name><surname>Novotarskyi</surname>
<given-names>S</given-names>
</name>
<name><surname>Sushko</surname>
<given-names>I</given-names>
</name>
<name><surname>Ivanov</surname>
<given-names>V</given-names>
</name>
<name><surname>Petrenko</surname>
<given-names>AE</given-names>
</name>
<name><surname>Dieden</surname>
<given-names>R</given-names>
</name>
<name><surname>Lebon</surname>
<given-names>F</given-names>
</name>
<name><surname>Mathieu</surname>
<given-names>B</given-names>
</name>
</person-group>
<article-title>Development of dimethyl sulfoxide solubility models using 163 000 molecules: using a domain applicability metric to select more reliable predictions</article-title>
<source>J Chem Inf Model</source>
<year>2013</year>
<volume>53</volume>
<fpage>1990</fpage>
<lpage>2000</lpage>
<pub-id pub-id-type="doi">10.1021/ci400213d</pub-id>
<pub-id pub-id-type="pmid">23855787</pub-id>
</element-citation>
</ref>
<ref id="CR27"><label>27.</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Todeschini</surname>
<given-names>R</given-names>
</name>
<name><surname>Consonni</surname>
<given-names>V</given-names>
</name>
</person-group>
<source>Handbook of molecular descriptors</source>
<year>2000</year>
<publisher-loc>Weinheim</publisher-loc>
<publisher-name>Wiley-VCH</publisher-name>
<fpage>667</fpage>
</element-citation>
</ref>
<ref id="CR28"><label>28.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Gasteiger</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Of molecules and humans</article-title>
<source>J Med Chem</source>
<year>2006</year>
<volume>49</volume>
<fpage>6429</fpage>
<lpage>6434</lpage>
<pub-id pub-id-type="doi">10.1021/jm0608964</pub-id>
<pub-id pub-id-type="pmid">17064061</pub-id>
</element-citation>
</ref>
<ref id="CR29"><label>29.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sushko</surname>
<given-names>I</given-names>
</name>
<name><surname>Novotarskyi</surname>
<given-names>S</given-names>
</name>
<name><surname>Korner</surname>
<given-names>R</given-names>
</name>
<name><surname>Pandey</surname>
<given-names>AK</given-names>
</name>
<name><surname>Rupp</surname>
<given-names>M</given-names>
</name>
<name><surname>Teetz</surname>
<given-names>W</given-names>
</name>
<name><surname>Brandmaier</surname>
<given-names>S</given-names>
</name>
<name><surname>Abdelaziz</surname>
<given-names>A</given-names>
</name>
<name><surname>Prokopenko</surname>
<given-names>VV</given-names>
</name>
<name><surname>Tanchuk</surname>
<given-names>VY</given-names>
</name>
<name><surname>Todeschini</surname>
<given-names>R</given-names>
</name>
<name><surname>Varnek</surname>
<given-names>A</given-names>
</name>
<name><surname>Marcou</surname>
<given-names>G</given-names>
</name>
<name><surname>Ertl</surname>
<given-names>P</given-names>
</name>
<name><surname>Potemkin</surname>
<given-names>V</given-names>
</name>
<name><surname>Grishina</surname>
<given-names>M</given-names>
</name>
<name><surname>Gasteiger</surname>
<given-names>J</given-names>
</name>
<name><surname>Schwab</surname>
<given-names>C</given-names>
</name>
<name><surname>Baskin</surname>
<given-names>II</given-names>
</name>
<name><surname>Palyulin</surname>
<given-names>VA</given-names>
</name>
<name><surname>Radchenko</surname>
<given-names>EV</given-names>
</name>
<name><surname>Welsh</surname>
<given-names>WJ</given-names>
</name>
<name><surname>Kholodovych</surname>
<given-names>V</given-names>
</name>
<name><surname>Chekmarev</surname>
<given-names>D</given-names>
</name>
<name><surname>Cherkasov</surname>
<given-names>A</given-names>
</name>
<name><surname>Aires-de-Sousa</surname>
<given-names>J</given-names>
</name>
<name><surname>Zhang</surname>
<given-names>QY</given-names>
</name>
<name><surname>Bender</surname>
<given-names>A</given-names>
</name>
<name><surname>Nigsch</surname>
<given-names>F</given-names>
</name>
<name><surname>Patiny</surname>
<given-names>L</given-names>
</name>
<name><surname>Williams</surname>
<given-names>A</given-names>
</name>
<name><surname>Tkachenko</surname>
<given-names>V</given-names>
</name>
<name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
</person-group>
<article-title>Online chemical modeling environment (OCHEM): web platform for data storage, model development and publishing of chemical information</article-title>
<source>J Comput Aided Mol Des</source>
<year>2011</year>
<volume>25</volume>
<fpage>533</fpage>
<lpage>554</lpage>
<pub-id pub-id-type="doi">10.1007/s10822-011-9440-2</pub-id>
<pub-id pub-id-type="pmid">21660515</pub-id>
</element-citation>
</ref>
<ref id="CR30"><label>30.</label>
<mixed-citation publication-type="other">OCHEM Molecular descriptors. <ext-link ext-link-type="uri" xlink:href="http://docs.ochem.eu/display/MAN/Molecular%2bdescriptors">http://docs.ochem.eu/display/MAN/Molecular+descriptors</ext-link>
 (5 Aug 2015)</mixed-citation>
</ref>
<ref id="CR31"><label>31.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hall</surname>
<given-names>LH</given-names>
</name>
<name><surname>Kier</surname>
<given-names>LB</given-names>
</name>
</person-group>
<article-title>Electrotopological state indexes for atom types—a novel combination of electronic, topological, and valence state information</article-title>
<source>J Chem Inf Comput Sci</source>
<year>1995</year>
<volume>35</volume>
<fpage>1039</fpage>
<lpage>1045</lpage>
<pub-id pub-id-type="doi">10.1021/ci00028a014</pub-id>
</element-citation>
</ref>
<ref id="CR32"><label>32.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Varnek</surname>
<given-names>A</given-names>
</name>
<name><surname>Fourches</surname>
<given-names>D</given-names>
</name>
<name><surname>Horvath</surname>
<given-names>D</given-names>
</name>
<name><surname>Klimchuk</surname>
<given-names>O</given-names>
</name>
<name><surname>Gaudin</surname>
<given-names>C</given-names>
</name>
<name><surname>Vayer</surname>
<given-names>P</given-names>
</name>
<name><surname>Solov’ev</surname>
<given-names>V</given-names>
</name>
<name><surname>Hoonakker</surname>
<given-names>F</given-names>
</name>
<name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
<name><surname>Marcou</surname>
<given-names>G</given-names>
</name>
</person-group>
<article-title>ISIDA—platform for virtual screening based on fragment and pharmacophoric descriptors</article-title>
<source>Curr Comput Aided Drug Des</source>
<year>2008</year>
<volume>4</volume>
<fpage>191</fpage>
<lpage>198</lpage>
<pub-id pub-id-type="doi">10.2174/157340908785747465</pub-id>
</element-citation>
</ref>
<ref id="CR33"><label>33.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Skvortsova</surname>
<given-names>MI</given-names>
</name>
<name><surname>Baskin</surname>
<given-names>II</given-names>
</name>
<name><surname>Skvortsov</surname>
<given-names>LA</given-names>
</name>
<name><surname>Palyulin</surname>
<given-names>VA</given-names>
</name>
<name><surname>Zefirov</surname>
<given-names>NS</given-names>
</name>
<name><surname>Stankevich</surname>
<given-names>IV</given-names>
</name>
</person-group>
<article-title>Chemical graphs and their basis invariants</article-title>
<source>J Mol Struct</source>
<year>1999</year>
<volume>466</volume>
<fpage>211</fpage>
<lpage>217</lpage>
<pub-id pub-id-type="doi">10.1016/S0166-1280(98)00467-9</pub-id>
</element-citation>
</ref>
<ref id="CR34"><label>34.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Steinbeck</surname>
<given-names>C</given-names>
</name>
<name><surname>Han</surname>
<given-names>Y</given-names>
</name>
<name><surname>Kuhn</surname>
<given-names>S</given-names>
</name>
<name><surname>Horlacher</surname>
<given-names>O</given-names>
</name>
<name><surname>Luttmann</surname>
<given-names>E</given-names>
</name>
<name><surname>Willighagen</surname>
<given-names>E</given-names>
</name>
</person-group>
<article-title>The chemistry development kit (CDK): an open-source Java library for chemo- and bio-informatics</article-title>
<source>J Chem Inf Comput Sci</source>
<year>2003</year>
<volume>43</volume>
<fpage>493</fpage>
<lpage>500</lpage>
<pub-id pub-id-type="doi">10.1021/ci025584y</pub-id>
<pub-id pub-id-type="pmid">12653513</pub-id>
</element-citation>
</ref>
<ref id="CR35"><label>35.</label>
<mixed-citation publication-type="other">ChemAxon Kft. <ext-link ext-link-type="uri" xlink:href="http://www.chemaxon.com">http://www.chemaxon.com</ext-link>
 (5 Aug 2015)</mixed-citation>
</ref>
<ref id="CR36"><label>36.</label>
<mixed-citation publication-type="other">Online Chemical e-Laboratory. <ext-link ext-link-type="uri" xlink:href="http://www.chemosophia.com">http://www.chemosophia.com</ext-link>
 (5 Aug 2015)</mixed-citation>
</ref>
<ref id="CR37"><label>37.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Potemkin</surname>
<given-names>VA</given-names>
</name>
<name><surname>Grishina</surname>
<given-names>MA</given-names>
</name>
<name><surname>Bartashevich</surname>
<given-names>EV</given-names>
</name>
</person-group>
<article-title>Modeling of drug molecule orientation within a receptor cavity in the BiS algorithm framework</article-title>
<source>J Struct Chem</source>
<year>2007</year>
<volume>48</volume>
<fpage>155</fpage>
<lpage>160</lpage>
<pub-id pub-id-type="doi">10.1007/s10947-007-0023-y</pub-id>
</element-citation>
</ref>
<ref id="CR38"><label>38.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sushko</surname>
<given-names>I</given-names>
</name>
<name><surname>Salmina</surname>
<given-names>E</given-names>
</name>
<name><surname>Potemkin</surname>
<given-names>VA</given-names>
</name>
<name><surname>Poda</surname>
<given-names>G</given-names>
</name>
<name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
</person-group>
<article-title>ToxAlerts: a web server of structural alerts for toxic chemicals and compounds with potential adverse reactions</article-title>
<source>J Chem Inf Model</source>
<year>2012</year>
<volume>52</volume>
<fpage>2310</fpage>
<lpage>2316</lpage>
<pub-id pub-id-type="doi">10.1021/ci300245q</pub-id>
<pub-id pub-id-type="pmid">22876798</pub-id>
</element-citation>
</ref>
<ref id="CR39"><label>39.</label>
<mixed-citation publication-type="other">Salmina E, Haider N, Tetko IV (2016) Extended functional groups (EFG): an efficient set for chemical characterization and structure-activity relationship studies of chemical compounds. Molecules 21:1 doi:10.3390/molecules21010001</mixed-citation>
</ref>
<ref id="CR40"><label>40.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Haider</surname>
<given-names>N</given-names>
</name>
</person-group>
<article-title>Functionality pattern matching as an efficient complementary structure/reaction search tool: an open-source approach</article-title>
<source>Molecules</source>
<year>2010</year>
<volume>15</volume>
<fpage>5079</fpage>
<lpage>5092</lpage>
<pub-id pub-id-type="doi">10.3390/molecules15085079</pub-id>
<pub-id pub-id-type="pmid">20714286</pub-id>
</element-citation>
</ref>
<ref id="CR41"><label>41.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Rogers</surname>
<given-names>D</given-names>
</name>
<name><surname>Hahn</surname>
<given-names>M</given-names>
</name>
</person-group>
<article-title>Extended-connectivity fingerprints</article-title>
<source>J Chem Inf Model</source>
<year>2010</year>
<volume>50</volume>
<fpage>742</fpage>
<lpage>754</lpage>
<pub-id pub-id-type="doi">10.1021/ci100050t</pub-id>
<pub-id pub-id-type="pmid">20426451</pub-id>
</element-citation>
</ref>
<ref id="CR42"><label>42.</label>
<mixed-citation publication-type="other">BIOVIA Pipeline Pilot Overview. <ext-link ext-link-type="uri" xlink:href="http://accelrys.com/products/collaborative-science/biovia-pipeline-pilot/">http://accelrys.com/products/collaborative-science/biovia-pipeline-pilot/</ext-link>
 (5 Aug 2015)</mixed-citation>
</ref>
<ref id="CR43"><label>43.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bender</surname>
<given-names>A</given-names>
</name>
<name><surname>Mussa</surname>
<given-names>HY</given-names>
</name>
<name><surname>Glen</surname>
<given-names>RC</given-names>
</name>
<name><surname>Reiling</surname>
<given-names>S</given-names>
</name>
</person-group>
<article-title>Similarity searching of chemical databases using atom environment descriptors (MOLPRINT 2D): evaluation of performance</article-title>
<source>J Chem Inf Comput Sci</source>
<year>2004</year>
<volume>44</volume>
<fpage>1708</fpage>
<lpage>1718</lpage>
<pub-id pub-id-type="doi">10.1021/ci0498719</pub-id>
<pub-id pub-id-type="pmid">15446830</pub-id>
</element-citation>
</ref>
<ref id="CR44"><label>44.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
<name><surname>Sushko</surname>
<given-names>I</given-names>
</name>
<name><surname>Pandey</surname>
<given-names>AK</given-names>
</name>
<name><surname>Zhu</surname>
<given-names>H</given-names>
</name>
<name><surname>Tropsha</surname>
<given-names>A</given-names>
</name>
<name><surname>Papa</surname>
<given-names>E</given-names>
</name>
<name><surname>Oberg</surname>
<given-names>T</given-names>
</name>
<name><surname>Todeschini</surname>
<given-names>R</given-names>
</name>
<name><surname>Fourches</surname>
<given-names>D</given-names>
</name>
<name><surname>Varnek</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Critical assessment of QSAR models of environmental toxicity against <italic>Tetrahymena pyriformis</italic>
: focusing on applicability domain and overfitting by variable selection</article-title>
<source>J Chem Inf Model</source>
<year>2008</year>
<volume>48</volume>
<fpage>1733</fpage>
<lpage>1746</lpage>
<pub-id pub-id-type="doi">10.1021/ci800151m</pub-id>
<pub-id pub-id-type="pmid">18729318</pub-id>
</element-citation>
</ref>
<ref id="CR45"><label>45.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Chang</surname>
<given-names>C-C</given-names>
</name>
<name><surname>Lin</surname>
<given-names>C-J</given-names>
</name>
</person-group>
<article-title>LIBSVM: a library for support vector machines</article-title>
<source>ACM Trans Intell Syst Technol</source>
<year>2011</year>
<volume>2</volume>
<fpage>1</fpage>
<lpage>27</lpage>
<pub-id pub-id-type="doi">10.1145/1961189.1961199</pub-id>
</element-citation>
</ref>
<ref id="CR46"><label>46.</label>
<mixed-citation publication-type="other">LIBSVM: a library for support vector machines. <ext-link ext-link-type="uri" xlink:href="http://www.csie.ntu.edu.tw/%7ecjlin/libsvm">http://www.csie.ntu.edu.tw/~cjlin/libsvm</ext-link>
 (10 Nov 2015)</mixed-citation>
</ref>
<ref id="CR47"><label>47.</label>
<mixed-citation publication-type="other">Tetko IV, Baskin II, Varnek A (2008) Tutorial on machine learning. Part 2. Descriptor selection bias. In: Strasbourg summer school on chemoinformatics: cheminfoS3. Obernai. <ext-link ext-link-type="uri" xlink:href="https://www.researchgate.net/publication/236651951_Tutorial_on_Machine_Learning_Part_2_Descriptor_Selection_Bias">https://www.researchgate.net/publication/236651951_Tutorial_on_Machine_Learning_Part_2_Descriptor_Selection_Bias</ext-link>
 (5 Aug 2015)</mixed-citation>
</ref>
<ref id="CR48"><label>48.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
<name><surname>Solov’ev</surname>
<given-names>VP</given-names>
</name>
<name><surname>Antonov</surname>
<given-names>AV</given-names>
</name>
<name><surname>Yao</surname>
<given-names>X</given-names>
</name>
<name><surname>Doucet</surname>
<given-names>JP</given-names>
</name>
<name><surname>Fan</surname>
<given-names>B</given-names>
</name>
<name><surname>Hoonakker</surname>
<given-names>F</given-names>
</name>
<name><surname>Fourches</surname>
<given-names>D</given-names>
</name>
<name><surname>Jost</surname>
<given-names>P</given-names>
</name>
<name><surname>Lachiche</surname>
<given-names>N</given-names>
</name>
<name><surname>Varnek</surname>
<given-names>A</given-names>
</name>
</person-group>
<article-title>Benchmarking of linear and nonlinear approaches for quantitative structure–property relationship studies of metal complexation with ionophores</article-title>
<source>J Chem Inf Model</source>
<year>2006</year>
<volume>46</volume>
<fpage>808</fpage>
<lpage>819</lpage>
<pub-id pub-id-type="doi">10.1021/ci0504216</pub-id>
<pub-id pub-id-type="pmid">16563012</pub-id>
</element-citation>
</ref>
<ref id="CR49"><label>49.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Breiman</surname>
<given-names>L</given-names>
</name>
</person-group>
<article-title>Bagging predictors</article-title>
<source>Mach Learn</source>
<year>1996</year>
<volume>24</volume>
<fpage>123</fpage>
<lpage>140</lpage>
</element-citation>
</ref>
<ref id="CR50"><label>50.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Kotsiantis</surname>
<given-names>SB</given-names>
</name>
<name><surname>Kanellopoulos</surname>
<given-names>D</given-names>
</name>
<name><surname>Pintelas</surname>
<given-names>PE</given-names>
</name>
</person-group>
<article-title>Handling imbalanced datasets: a review</article-title>
<source>Int Trans Comput Sci Eng</source>
<year>2006</year>
<volume>30</volume>
<fpage>25</fpage>
<lpage>36</lpage>
</element-citation>
</ref>
<ref id="CR51"><label>51.</label>
<mixed-citation publication-type="other">Tetko IV, Varbanov H, Galanski M, Platts J, Gabano E (2016) Prediction of logP for Pt(II) and Pt(IV) complexes: comparison of statistical and quantum-chemistry based approaches. J Inorg Biochem 156:1-13 </mixed-citation>
</ref>
<ref id="CR52"><label>52.</label>
<mixed-citation publication-type="other">Novoratskyi S, Sushko Y, Abdelaziz A, Korner R, Vogt J, Tetko IV (2016) Why rank-I submission of the ToxCast EPA in vitro to in vivo challenge to predict lowest effect level (LEL) does not use in vitro measurements? Chem Res Toxicol <bold>(submitted)</bold>
</mixed-citation>
</ref>
<ref id="CR53"><label>53.</label>
<mixed-citation publication-type="other">Abdelaziz A, Spahn-Langguth H, Schramm KW, Tetko IV (2016) Consensus approach for modeling HTS assays using in silico descriptors. Front Environ Sci. doi:10.3389/fenvs.2016.00002</mixed-citation>
</ref>
<ref id="CR54"><label>54.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Zhu</surname>
<given-names>H</given-names>
</name>
<name><surname>Tropsha</surname>
<given-names>A</given-names>
</name>
<name><surname>Fourches</surname>
<given-names>D</given-names>
</name>
<name><surname>Varnek</surname>
<given-names>A</given-names>
</name>
<name><surname>Papa</surname>
<given-names>E</given-names>
</name>
<name><surname>Gramatica</surname>
<given-names>P</given-names>
</name>
<name><surname>Oberg</surname>
<given-names>T</given-names>
</name>
<name><surname>Dao</surname>
<given-names>P</given-names>
</name>
<name><surname>Cherkasov</surname>
<given-names>A</given-names>
</name>
<name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
</person-group>
<article-title>Combinatorial QSAR modeling of chemical toxicants tested against <italic>Tetrahymena pyriformis</italic>
</article-title>
<source>J Chem Inf Model</source>
<year>2008</year>
<volume>48</volume>
<fpage>766</fpage>
<lpage>784</lpage>
<pub-id pub-id-type="doi">10.1021/ci700443v</pub-id>
<pub-id pub-id-type="pmid">18311912</pub-id>
</element-citation>
</ref>
<ref id="CR55"><label>55.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Dunn</surname>
<given-names>MS</given-names>
</name>
<name><surname>Brophy</surname>
<given-names>TW</given-names>
</name>
</person-group>
<article-title>Decomposition points of the amino acids</article-title>
<source>J Biol Chem</source>
<year>1932</year>
<volume>99</volume>
<fpage>221</fpage>
<lpage>229</lpage>
</element-citation>
</ref>
<ref id="CR56"><label>56.</label>
<element-citation publication-type="book"><person-group person-group-type="author"><name><surname>Manahan</surname>
<given-names>SE</given-names>
</name>
</person-group>
<source>Toxicological chemistry and biochemistry</source>
<year>2003</year>
<edition>3</edition>
<publisher-loc>Boca Raton</publisher-loc>
<publisher-name>Lewis</publisher-name>
<fpage>425</fpage>
</element-citation>
</ref>
<ref id="CR57"><label>57.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Novotarskyi</surname>
<given-names>S</given-names>
</name>
<name><surname>Sushko</surname>
<given-names>I</given-names>
</name>
<name><surname>Korner</surname>
<given-names>R</given-names>
</name>
<name><surname>Pandey</surname>
<given-names>AK</given-names>
</name>
<name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
</person-group>
<article-title>A comparison of different QSAR approaches to modeling CYP450 1A2 inhibition</article-title>
<source>J Chem Inf Model</source>
<year>2011</year>
<volume>51</volume>
<fpage>1271</fpage>
<lpage>1280</lpage>
<pub-id pub-id-type="doi">10.1021/ci200091h</pub-id>
<pub-id pub-id-type="pmid">21598906</pub-id>
</element-citation>
</ref>
<ref id="CR58"><label>58.</label>
<mixed-citation publication-type="other">Hall M, Frank E, Holmes G, Pfahringer B, Reutemann P, Witten IH (2009) The WEKA data mining software: an update. In: SIGKDD explorations, p 11</mixed-citation>
</ref>
<ref id="CR59"><label>59.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
<name><surname>Bruneau</surname>
<given-names>P</given-names>
</name>
<name><surname>Mewes</surname>
<given-names>HW</given-names>
</name>
<name><surname>Rohrer</surname>
<given-names>DC</given-names>
</name>
<name><surname>Poda</surname>
<given-names>GI</given-names>
</name>
</person-group>
<article-title>Can we estimate the accuracy of ADME–Tox predictions?</article-title>
<source>Drug Discov Today</source>
<year>2006</year>
<volume>11</volume>
<fpage>700</fpage>
<lpage>707</lpage>
<pub-id pub-id-type="doi">10.1016/j.drudis.2006.06.013</pub-id>
<pub-id pub-id-type="pmid">16846797</pub-id>
</element-citation>
</ref>
<ref id="CR60"><label>60.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sushko</surname>
<given-names>I</given-names>
</name>
<name><surname>Novotarskyi</surname>
<given-names>S</given-names>
</name>
<name><surname>Körner</surname>
<given-names>R</given-names>
</name>
<name><surname>Pandey</surname>
<given-names>AK</given-names>
</name>
<name><surname>Kovalishyn</surname>
<given-names>VV</given-names>
</name>
<name><surname>Prokopenko</surname>
<given-names>VV</given-names>
</name>
<name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
</person-group>
<article-title>Applicability domain for in silico models to achieve accuracy of experimental measurements</article-title>
<source>J Chemom</source>
<year>2010</year>
<volume>24</volume>
<fpage>202</fpage>
<lpage>208</lpage>
<pub-id pub-id-type="doi">10.1002/cem.1296</pub-id>
</element-citation>
</ref>
<ref id="CR61"><label>61.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Sushko</surname>
<given-names>I</given-names>
</name>
<name><surname>Novotarskyi</surname>
<given-names>S</given-names>
</name>
<name><surname>Korner</surname>
<given-names>R</given-names>
</name>
<name><surname>Pandey</surname>
<given-names>AK</given-names>
</name>
<name><surname>Cherkasov</surname>
<given-names>A</given-names>
</name>
<name><surname>Li</surname>
<given-names>J</given-names>
</name>
<name><surname>Gramatica</surname>
<given-names>P</given-names>
</name>
<name><surname>Hansen</surname>
<given-names>K</given-names>
</name>
<name><surname>Schroeter</surname>
<given-names>T</given-names>
</name>
<name><surname>Muller</surname>
<given-names>KR</given-names>
</name>
<name><surname>Xi</surname>
<given-names>L</given-names>
</name>
<name><surname>Liu</surname>
<given-names>H</given-names>
</name>
<name><surname>Yao</surname>
<given-names>X</given-names>
</name>
<name><surname>Oberg</surname>
<given-names>T</given-names>
</name>
<name><surname>Hormozdiari</surname>
<given-names>F</given-names>
</name>
<name><surname>Dao</surname>
<given-names>P</given-names>
</name>
<name><surname>Sahinalp</surname>
<given-names>C</given-names>
</name>
<name><surname>Todeschini</surname>
<given-names>R</given-names>
</name>
<name><surname>Polishchuk</surname>
<given-names>P</given-names>
</name>
<name><surname>Artemenko</surname>
<given-names>A</given-names>
</name>
<name><surname>Kuz’min</surname>
<given-names>V</given-names>
</name>
<name><surname>Martin</surname>
<given-names>TM</given-names>
</name>
<name><surname>Young</surname>
<given-names>DM</given-names>
</name>
<name><surname>Fourches</surname>
<given-names>D</given-names>
</name>
<name><surname>Muratov</surname>
<given-names>E</given-names>
</name>
<name><surname>Tropsha</surname>
<given-names>A</given-names>
</name>
<name><surname>Baskin</surname>
<given-names>I</given-names>
</name>
<name><surname>Horvath</surname>
<given-names>D</given-names>
</name>
<name><surname>Marcou</surname>
<given-names>G</given-names>
</name>
<name><surname>Muller</surname>
<given-names>C</given-names>
</name>
<name><surname>Varnek</surname>
<given-names>A</given-names>
</name>
<name><surname>Prokopenko</surname>
<given-names>VV</given-names>
</name>
<name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
</person-group>
<article-title>Applicability domains for classification problems: benchmarking of distance to models for Ames mutagenicity set</article-title>
<source>J Chem Inf Model</source>
<year>2010</year>
<volume>50</volume>
<fpage>2094</fpage>
<lpage>2111</lpage>
<pub-id pub-id-type="doi">10.1021/ci100253r</pub-id>
<pub-id pub-id-type="pmid">21033656</pub-id>
</element-citation>
</ref>
<ref id="CR62"><label>62.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
<name><surname>Sopasakis</surname>
<given-names>P</given-names>
</name>
<name><surname>Kunwar</surname>
<given-names>P</given-names>
</name>
<name><surname>Brandmaier</surname>
<given-names>S</given-names>
</name>
<name><surname>Novoratskyi</surname>
<given-names>S</given-names>
</name>
<name><surname>Charochkina</surname>
<given-names>L</given-names>
</name>
<name><surname>Prokopenko</surname>
<given-names>V</given-names>
</name>
<name><surname>Peijnenburg</surname>
<given-names>WJ</given-names>
</name>
</person-group>
<article-title>Prioritisation of polybrominated diphenyl ethers (PBDEs) by using the QSPR-THESAURUS web tool</article-title>
<source>Altern Lab Anim</source>
<year>2013</year>
<volume>41</volume>
<fpage>127</fpage>
<lpage>135</lpage>
<pub-id pub-id-type="pmid">23614549</pub-id>
</element-citation>
</ref>
<ref id="CR63"><label>63.</label>
<mixed-citation publication-type="other">den Hollander HA, Van de Meent D (2004) SimpleBox 3.0: a multimedia mass balance model for evaluating the environmental fate of chemicals. RIVM report 601200003. RIVM, National Institute of Public Health and the Environment, Bilthoven</mixed-citation>
</ref>
<ref id="CR64"><label>64.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Delaney</surname>
<given-names>JS</given-names>
</name>
</person-group>
<article-title>Predicting aqueous solubility from structure</article-title>
<source>Drug Discov Today</source>
<year>2005</year>
<volume>10</volume>
<fpage>289</fpage>
<lpage>295</lpage>
<pub-id pub-id-type="doi">10.1016/S1359-6446(04)03365-3</pub-id>
<pub-id pub-id-type="pmid">15708748</pub-id>
</element-citation>
</ref>
<ref id="CR65"><label>65.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Potemkin</surname>
<given-names>VA</given-names>
</name>
<name><surname>Bartashevich</surname>
<given-names>EV</given-names>
</name>
<name><surname>Belik</surname>
<given-names>AV</given-names>
</name>
</person-group>
<article-title>A new approach to predicting the thermodynamic parameters of substances from molecular characteristics</article-title>
<source>Russ J Phys Chem A</source>
<year>1996</year>
<volume>70</volume>
<fpage>411</fpage>
<lpage>415</lpage>
</element-citation>
</ref>
<ref id="CR66"><label>66.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Potemkin</surname>
<given-names>VA</given-names>
</name>
<name><surname>Bartashevich</surname>
<given-names>EV</given-names>
</name>
<name><surname>Belik</surname>
<given-names>AV</given-names>
</name>
</person-group>
<article-title>A model for calculating the atomic volumetric characteristics in molecular systems</article-title>
<source>Zh Fiz Khim</source>
<year>1998</year>
<volume>72</volume>
<fpage>650</fpage>
<lpage>656</lpage>
</element-citation>
</ref>
<ref id="CR67"><label>67.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bemis</surname>
<given-names>GW</given-names>
</name>
<name><surname>Murcko</surname>
<given-names>MA</given-names>
</name>
</person-group>
<article-title>The properties of known drugs. 1. Molecular frameworks</article-title>
<source>J Med Chem</source>
<year>1996</year>
<volume>39</volume>
<fpage>2887</fpage>
<lpage>2893</lpage>
<pub-id pub-id-type="doi">10.1021/jm9602928</pub-id>
<pub-id pub-id-type="pmid">8709122</pub-id>
</element-citation>
</ref>
<ref id="CR68"><label>68.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Yang</surname>
<given-names>Y</given-names>
</name>
<name><surname>Chen</surname>
<given-names>H</given-names>
</name>
<name><surname>Nilsson</surname>
<given-names>I</given-names>
</name>
<name><surname>Muresan</surname>
<given-names>S</given-names>
</name>
<name><surname>Engkvist</surname>
<given-names>O</given-names>
</name>
</person-group>
<article-title>Investigation of the relationship between topology and selectivity for druglike molecules</article-title>
<source>J Med Chem</source>
<year>2010</year>
<volume>53</volume>
<fpage>7709</fpage>
<lpage>7714</lpage>
<pub-id pub-id-type="doi">10.1021/jm1008456</pub-id>
<pub-id pub-id-type="pmid">20942392</pub-id>
</element-citation>
</ref>
<ref id="CR69"><label>69.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Bauerschmidt</surname>
<given-names>S</given-names>
</name>
<name><surname>Gasteiger</surname>
<given-names>J</given-names>
</name>
</person-group>
<article-title>Overcoming the limitations of a connection table description: a universal representation of chemical species</article-title>
<source>J Chem Inf Comput Sci</source>
<year>1997</year>
<volume>37</volume>
<fpage>705</fpage>
<lpage>714</lpage>
<pub-id pub-id-type="doi">10.1021/ci9704423</pub-id>
</element-citation>
</ref>
<ref id="CR70"><label>70.</label>
<mixed-citation publication-type="other">Williams A, Lowe D, Tetko I (2015) Melting point and pyrolysis point data for tens of thousands of chemicals. <ext-link ext-link-type="uri" xlink:href="https://figshare.com/articles/Melting_Point_and_Pyrolysis_Point_Data_for_Tens_of_Thousands_of_Chemicals/2007426">https://figshare.com/articles/Melting_Point_and_Pyrolysis_Point_Data_for_Tens_of_Thousands_of_Chemicals/2007426</ext-link>
 (9 Dec 2015)</mixed-citation>
</ref>
<ref id="CR71"><label>71.</label>
<mixed-citation publication-type="other">Creative Commons. Attribution 3.0 Unported (CC BY 3.0). <ext-link ext-link-type="uri" xlink:href="https://creativecommons.org/licenses/by/3.0/">https://creativecommons.org/licenses/by/3.0/</ext-link>
 (24 Nov 2015)</mixed-citation>
</ref>
<ref id="CR72"><label>72.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Palmer</surname>
<given-names>DS</given-names>
</name>
<name><surname>Mitchell</surname>
<given-names>JB</given-names>
</name>
</person-group>
<article-title>Is experimental data quality the limiting factor in predicting the aqueous solubility of druglike molecules?</article-title>
<source>Mol Pharmacol</source>
<year>2014</year>
<volume>11</volume>
<fpage>2962</fpage>
<lpage>2972</lpage>
<pub-id pub-id-type="doi">10.1021/mp500103r</pub-id>
</element-citation>
</ref>
<ref id="CR73"><label>73.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Hughes</surname>
<given-names>LD</given-names>
</name>
<name><surname>Palmer</surname>
<given-names>DS</given-names>
</name>
<name><surname>Nigsch</surname>
<given-names>F</given-names>
</name>
<name><surname>Mitchell</surname>
<given-names>JB</given-names>
</name>
</person-group>
<article-title>Why are some properties more difficult to predict than others? A study of QSPR models of solubility, melting point, and Log P</article-title>
<source>J Chem Inf Model</source>
<year>2008</year>
<volume>48</volume>
<fpage>220</fpage>
<lpage>232</lpage>
<pub-id pub-id-type="doi">10.1021/ci700307p</pub-id>
<pub-id pub-id-type="pmid">18186622</pub-id>
</element-citation>
</ref>
<ref id="CR74"><label>74.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Üstün</surname>
<given-names>B</given-names>
</name>
<name><surname>Melssen</surname>
<given-names>WJ</given-names>
</name>
<name><surname>Oudenhuijzen</surname>
<given-names>M</given-names>
</name>
<name><surname>Buydens</surname>
<given-names>LMC</given-names>
</name>
</person-group>
<article-title>Determination of optimal support vector regression parameters by genetic algorithms and simplex optimization</article-title>
<source>Anal Chim Acta</source>
<year>2005</year>
<volume>544</volume>
<fpage>292</fpage>
<lpage>305</lpage>
<pub-id pub-id-type="doi">10.1016/j.aca.2004.12.024</pub-id>
</element-citation>
</ref>
<ref id="CR75"><label>75.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Balakin</surname>
<given-names>KV</given-names>
</name>
<name><surname>Savchuk</surname>
<given-names>NP</given-names>
</name>
<name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
</person-group>
<article-title>In silico approaches to prediction of aqueous and DMSO solubility of drug-like compounds: trends, problems and solutions</article-title>
<source>Curr Med Chem</source>
<year>2006</year>
<volume>13</volume>
<fpage>223</fpage>
<lpage>241</lpage>
<pub-id pub-id-type="doi">10.2174/092986706775197917</pub-id>
<pub-id pub-id-type="pmid">16472214</pub-id>
</element-citation>
</ref>
<ref id="CR76"><label>76.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
<name><surname>Poda</surname>
<given-names>GI</given-names>
</name>
<name><surname>Ostermann</surname>
<given-names>C</given-names>
</name>
<name><surname>Mannhold</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Accurate in silico logP predictions: one can’t embrace the unembraceable</article-title>
<source>QSAR Comb Sci</source>
<year>2009</year>
<volume>28</volume>
<fpage>845</fpage>
<lpage>849</lpage>
<pub-id pub-id-type="doi">10.1002/qsar.200960003</pub-id>
</element-citation>
</ref>
<ref id="CR77"><label>77.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
<name><surname>Poda</surname>
<given-names>GI</given-names>
</name>
<name><surname>Ostermann</surname>
<given-names>C</given-names>
</name>
<name><surname>Mannhold</surname>
<given-names>R</given-names>
</name>
</person-group>
<article-title>Large-scale evaluation of log P predictors: local corrections may compensate insufficient accuracy and need of experimentally testing every other compound</article-title>
<source>Chem Biodivers</source>
<year>2009</year>
<volume>6</volume>
<fpage>1837</fpage>
<lpage>1844</lpage>
<pub-id pub-id-type="doi">10.1002/cbdv.200900075</pub-id>
<pub-id pub-id-type="pmid">19937825</pub-id>
</element-citation>
</ref>
<ref id="CR78"><label>78.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
<name><surname>Tanchuk</surname>
<given-names>VY</given-names>
</name>
<name><surname>Kasheva</surname>
<given-names>TN</given-names>
</name>
<name><surname>Villa</surname>
<given-names>AEP</given-names>
</name>
</person-group>
<article-title>Estimation of aqueous solubility of chemical compounds using E-state indices</article-title>
<source>J Chem Inf Comput Sci</source>
<year>2001</year>
<volume>41</volume>
<fpage>1488</fpage>
<lpage>1493</lpage>
<pub-id pub-id-type="doi">10.1021/ci000392t</pub-id>
<pub-id pub-id-type="pmid">11749573</pub-id>
</element-citation>
</ref>
<ref id="CR79"><label>79.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
<name><surname>Tanchuk</surname>
<given-names>VY</given-names>
</name>
</person-group>
<article-title>Application of associative neural networks for prediction of lipophilicity in ALOGPS 2.1 program</article-title>
<source>J Chem Inf Comput Sci</source>
<year>2002</year>
<volume>42</volume>
<fpage>1136</fpage>
<lpage>1145</lpage>
<pub-id pub-id-type="doi">10.1021/ci025515j</pub-id>
<pub-id pub-id-type="pmid">12377001</pub-id>
</element-citation>
</ref>
<ref id="CR80"><label>80.</label>
<element-citation publication-type="journal"><person-group person-group-type="author"><name><surname>Tetko</surname>
<given-names>IV</given-names>
</name>
</person-group>
<article-title>The perspectives of computational chemistry modeling</article-title>
<source>J Comput Aided Mol Des</source>
<year>2012</year>
<volume>26</volume>
<fpage>135</fpage>
<lpage>136</lpage>
<pub-id pub-id-type="doi">10.1007/s10822-011-9513-2</pub-id>
<pub-id pub-id-type="pmid">22160554</pub-id>
</element-citation>
</ref>
</ref-list>
</back>
</pmc>
</record>

Pour manipuler ce document sous Unix (Dilib)

EXPLOR_STEP=$WICRI_ROOT/Wicri/Informatique/explor/SgmlV1/Data/Pmc/Corpus

HfdSelect -h $EXPLOR_STEP/biblio.hfd -nk 000017  | SxmlIndent | more

HfdSelect -h $EXPLOR_AREA/Data/Pmc/Corpus/biblio.hfd -nk 000017  | SxmlIndent | more

Pour mettre un lien sur cette page dans le réseau Wicri

{{Explor lien
   |wiki=    Wicri/Informatique
   |area=    SgmlV1
   |flux=    Pmc
   |étape=   Corpus
   |type=    RBID
   |clé=     
   |texte=   
}}

This area was generated with Dilib version V0.6.33.
Data generation: Mon Jul 1 14:26:08 2019. Site generation: Wed Apr 28 21:40:44 2021

	Serveur d'exploration sur SGML
	Attention, ce site est en cours de développement ! Attention, site généré par des moyens informatiques à partir de corpus bruts. Les informations ne sont donc pas validées.

Serveur d'exploration sur SGML

Links to Exploration step

Le document en format XML

Pour manipuler ce document sous Unix (Dilib)

Pour mettre un lien sur cette page dans le réseau Wicri