Document to Structure Developer's Guide

Version 5.11.4

Introduction
Basic API usage
Configuring behavior
Monitoring progress
Command line usage

Introduction

The Document to Structure product finds chemical structures in documents. Chemical names in the text of document, structures embedded in Office documents, or image drawings of structure are all support (see the user documentation for more details). The structures can then be exported to any supported molecule format, or manipulated in memory.

Basic API usage

Document to Structure plugs into the generic IO API of ChemAxon. This means that documents can be used exactly as other molecular formats (sdf, ...) as a source for importing structures.

Example usage:

// We have a document to process
File document = new File("document.pdf");

MolImporter importer = new MolImporter(document, "d2s");

// Iterate through the hits
Molecule m;
while ((m = importer.read()) != null) {
  String smiles = m.toFormat("smiles");
  String name = m.getName();
  String sourceText = m.getProperty(DocumentToStructure.SOURCE_TEXT);
  ...
}

The exact same code can be used to import an XML file, a Microsoft Office document, ... The format is detected automatically.

The list of all available properties can be found in the API. Which property is available depends on the format. For instance, in text formats like xml, html and txt, the number of characters since the beginning of the file is available as DocumentToStructure.CHARACTER, while this has no value in a binary format.

Note that SOURCE_TEXT contains the name as it appears in the source document. A cleaned version (of possible OCR errors, typos, ...) can be retrieved with m.getName().

Processing text directly

When the text to convert is given as a String object, the MolImport object can be constructure with:

String text = ...;
MolImporter importer = DocumentToStructure.process(text);

Configuring behavior

Document to Structure accepts options to configure how it behaves. All name to structure options can be used with document to structure as well, to configure which name conversions are attempted. For instance, by default elements and ions are not converted when using d2s, as they may occur often in documents and are not always useful. However their conversion can be enabled, using:

MolImporter importer = new MolImporter(document, "d2s:+elements,+ions");

Document to Structure has specific options as well:

cas: enable the conversion of CAS numbers (uses a webservice, off by default).
smiles: enable the conversion of SMILES strings (on by default)
inchi: enable the conversion of InChI strings (on by default)
OSRA: enable the conversion of structure drawings by the OSRA external tool (on by default if OSRA is installed)
text: enable the conversion of all text based formats: name, smiles, InChI, CAS (on by default)
startPage=N: start processing document at page N (can be combined with endPage to process a range of pages)
endPage=N: stop processing document at page N
insideTag=<tag>: for markup formats, enable the conversion only inside the given tag (typically insideTag=body for HTML). Off by default.

Each option can be precedeed by a minus sign - (for instance -smiles) to disable it. Both forms smiles and +smiles are accepted to enable an option.

Monitoring progress

For estimating the progress of converting a document, you can use the standard method MolImporter.estimateNumRecords().

Command line usage

Document to Structure can be used as any other import file format. For instance, command line usage can be achieved by using MolConverter on a format supported by Document to Structure:

molconvert sdf document.doc -o structures.sdf