Training of the correction library for pKa calculations

If you feel your experimental data could improve the performance of the default pKa calculator, you can take advantage of the supervised pKa learning method that is built into the pKa calculator. Special structural parts may have an effect on the pKa values calculated by the built-in method, so your correction library based on experimental data of your compound family helps the pKa calculator to increase the prediction accuracy.

How to improve the accuracy of the pKa calculation?

First, you need to see clearly which ionization center(s) was predicted inaccurately by the pKa calculator. You have to collect experimental data for that ionization center(s). The learning algorithm is based on linear regression analysis, therefore you need to collect a certain amount of experimental pKa data otherwise the regression analysis will fail. There is no strict rule how large pool of data is required to perform a reliable pKa training. If your purpose is to create a local model only for a certain type of chemical environment of the ionization center, then it may be enough to collect a few representative structures. A more robust model, however, requires as many diverse structures and pKa values of the ionization center in question as possible.

The first step of the training process is the input of the collected data into an sdf file. After that, you have to run the training algorithm which creates a correction library from your data. This will be stored on your computer. You can use this correction library via MarvinSketch, cxcalc, Chemical Terms.

How to create a training set and generate a correction library

  1. Create a training set in sdf file (.sdf) format.
    This can be easily done by using the graphical user interface of Instant JChem. Your sdf file must contain the following fields: Additional fields of pKa values are optional (recommended for handling multiprotic compunds). For example pKa value 2 (pKa2), ID2, etc. Definition of only one pKa value is enough to apply the training data, but more values in case of multiprotic compounds will enhance the reliability of the pKa training.

    Example
    The picture below shows the details of the training set (pKa_trainingset.sdf). ID1 is the index of the atom with the experimental pKa1 value (ID2 would be the index of the second measured pKa value /pKa2/, etc.).

    mypkadata

  2. Generate the correction library
    Execute the following command from command line:
    cxtrain pka -i [library name] [training file] 
    Example
    cxtrain pka -i mypka mydata.sdf

Usage of the pKa plugin with correction library

    MarvinSketch

  1. Select MarvinSketch menu:Tools > Protonation > pKa.
  2. Set the 'Use correction library' box to activate the training option (see figure below).
  3. If you have created multiple training sets, choose the most accurate one from the dropdown list below the checkbox.

  4. pKa options panel in Marvin

    The next figure shows the results with (I) and without (II) applying the correction llbrary.

    MarvinSketch trained pKa calculation MarvinSketch not trained pKa calculation
    I. pKa calculation with training data II. pKa calculation without training data

    cxcalc

    To apply your corrections for the pKa calculation use the parameter --correctionlibrary or its short form: -L).
    cxcalc pKa  --correctionlibrary  [library name] [input file/string]
    Example
    $ cxcalc pKa --correctionlibrary mypka "CSC1=NC2=C(N1)C=NC(O)=N2"
    Result
     id      apKa1   apKa2   bpKa1   bpKa2   atoms
     1       11.19   16.01   2.34    -2.59   7,11,9,4

    If you use cxcalc pKa calculation without the correction library, the results will be calculated with the built-in dataset.
    Example
    $ cxcalc pKa "CSC1=NC2=C(N1)C=NC(O)=N2"
    Result
     id      apKa1   apKa2   bpKa1   bpKa2   atoms
     1       8.34   16.01   2.34    -2.59   7,11,9,4

    For more options see this page.

    Chemical Terms

    pKa calculation applying correction library can be performed via Chemical Terms from Evaluator command line or from Instant JChem.

    Chemical Terms Evaluator

    The Chemical Terms Evaluator is designed to evaluate mathematical expressions on molecules. To use your correction library, the following expession has to be typed into the command line.
    evaluate -e "pKa('correctionlibrary:[library name]')" "[input file/string]"
    Example
    evaluate -e "pKa('correctionlibrary:mypka')" "CSC1=NC2=C(N1)C=NC(O)=N2"
    Result
    ;;;-2,59;;;11,19;;2,34;;16,01;

    For more details see this page.

    Chemical Terms in Instant JChem

    Instant JChem is an out-of-the-box tool that allows scientists to create, manage and analyze chemical structures and their data. You can also apply your pKa correction library via Chemical Terms in it.
  1. Choose the 'New Chemical Terms Field icon' on the panel on the right side.
  2. Type the chemical term into the window, use the correctionlibrary:[library name] parameter. Do not forget to adjust the Name, the Type and the DB Column Name.

  3. Example
    The following figure presents the usage of pKa training in the 'New Chemical terms' window. The expression pKa ('correctionlibrary:mypKa type:acidic','1') defines that the plugin use the correction library named mypKa, and it will calculate the strongest acidic pKa of the molecule(s).


    New Chemical Terms window in Instant JChem


    The part of the results of this calculation is presented on the next figure. You can see the difference between the untrained(column 5., Strongest acidic pKa) and trained (column 6., Trained strongest acidic pKa) pKa values.

    New Chemical Terms window in Instant JChem