Thursday, February 3, 2011

An acronym in cheminformatics: SMILES for simplified molecular input line entry system

SMILES stands for simplified molecular input line entry system [1]. SMILES is a chemical notation system with a small chemical grammar for the encoding of molecular structures based on the principles of molecular graph theory.

SMILES is a user-friendly chemical language that allows input of molecular structures in molecular editors, databases, search engines and property estimation tools. For example, ChemSpider [2] accepts SMILES entries (also: nomenclature-based names, registry number or InChI). Looking for aspirin (2-(acetyloxy)benzoic acid)? Type the SMILES notation CC(=O)Oc1ccccc1C(=O)O into the search field. Notice that the six carbon atoms of the aromatic benzene ring have been entered in lower case to identify them as aromatic-ring members. Also, hydrogen atoms have not been explicitly specified, since their occurrence is deduced based on valence rules.

What is “hiding” behind the notation FC12C3(F)C4(F)C1(F)C5(F)C4(F)C3(F)C25F?
Correct, it is perfluorocubane (1,2,3,4,5,6,7,8-octafluorocubane). ChemSpider is finding it either way and it is your choice to type the SMILES notation or a name.

Computers “love” SMILES since they can automatically derive a connection table from the linear notation code, which is essential to draw the assocoated structure or to calculate molecular descriptors. Given a name such as 2-(acetyloxy)benzoic acid would require automatic structure interpretation on a much higher level—not so easy for a computer. And giving aspirin will cause a computer headache, speaking in human terms.

Keywords: molecular graphs, molecular connectivity, molecular data exchange

[1] David Weininger: SMILES, a Chemical Language and Information System. 1. Introduction to Methodology and Encoding Rules. J. Chem. Inf. Comput. Sci. 1988, 28, 31-36.
DOI: 10.1021/ci00057a005.
[2] ChemSpider starting guide: How do you find compounds? [].

