Dissociation constant
Parsing and extraction utilities for PubChem dissociation constant data.
This module processes raw annotation entries from the PubChem PUG-View API for the "Dissociation Constants" heading. Each entry contains one or more free-text pKa strings deposited by data providers, which are highly inconsistent in format. The parsing pipeline normalizes these strings into structured records suitable for downstream analysis.
Typical usage¶
Fetch data via PubChemAPI and pass it
directly to
DissociationConstantData.from_page
from pcdigitizer import PubChemAPI, Annotation
from pcdigitizer.data import DissociationConstantData
raw = PubChemAPI.get_data(Annotation.DISSOCIATION_CONSTANTS)
df = DissociationConstantData.from_page(raw)
_OUTPUT_SCHEMA = {'cid': pl.Int64, 'sid': pl.Int64, 'pclid': pl.Int64, 'pka_label': pl.String, 'pka_value': pl.Float64, 'temperature_C': pl.Float64, 'comment': pl.String}
¶
Expected output schema (used to return an empty DataFrame safely)
DissociationConstantData
¶
Bases: AnnotationProcessor
Parse and assemble PubChem dissociation constant annotation data.
All methods are static or class methods. The primary public interface is
from_page,
which converts a raw list of PubChem annotation
entries into a tidy polars DataFrame.
The parsing pipeline for free-text pKa strings is the following.
-
parse_valueis the top-level dispatcher. It first checks for the "pKa values are X, Y, and Z" sentence form via_parse_multi_value_sentence. If that does not match it splits the input on semicolons and delegates each segment to_parse_part. -
_parse_parttries each compiled pattern in_PATTERNSin priority order via_try_patterns, returning the first successfulParsedPKaorNone. -
Patterns are compiled once at class definition time and reused across all calls.
The following free-text forms are recognized (case-insensitive):
pKa values are 3.25, 4.76, and 6.17(multi-value sentence)pKa3 = -2.03pK1 = 2.36 (SRC: carboxylic acid)pKa = 10.4 at 40 °C (tertiary amine)Weak acid. pK (25 °C): 3.35pKa = 0.7 (caffeine cation)pKa = 20
Lines that cannot be matched (e.g. density or solubility values mistakenly deposited under this heading) are logged at WARNING level and excluded from the output.
_PATTERNS = [_PATTERN_LABELED, _PATTERN_TEMP_PREFIX, _PATTERN_FALLBACK]
¶
Ordered list used by _try_patterns; higher-specificity patterns first.
_PATTERN_FALLBACK = re.compile('\n ^\\s*\n (?P<value>-?[\\d.]+)\n (?:\\s+in\\s+(?P<env>.+))? # optional "in water / solvent" context\n ', re.IGNORECASE | re.VERBOSE)
¶
Fallback for bare numeric values in an environment context. Matches:
4.24 in water, -1.34
Only used when a label group is present from one of the above patterns;
this pattern has no label group intentionally, so bare numbers without
any pK context are always rejected (see
_parse_part
for the label guard).
_PATTERN_LABELED = re.compile('\n ^\\s*\n (?P<label>pK(?:a)?\\d*) # pK, pKa, pK1, pKa2, etc.\n \\s*(?:=|:)\\s*\n (?P<value>-?[\\d.]+) # numeric value, allow negative\n (?:\\s*\\(SRC:\\s*(?P<comment>[^)]+)\\))? # optional (SRC: ...)\n (?:\\s+at\\s+(?P<temp>[\\d.]+)\\s*°?C)? # optional "at XX °C"\n ', re.IGNORECASE | re.VERBOSE)
¶
Labeled pKa with optional SRC comment and optional temperature. Matches:
pKa3 = -2.03, pK1 = 2.36 (SRC: carboxylic acid), pKa = 10.4 at 40 °C
_PATTERN_MULTI_VALUE = re.compile('pKa values are\\s+([\\d\\.,\\sand-]+)', re.IGNORECASE)
¶
Pre-compiled pattern for the multi-value sentence form.
_PATTERN_TEMP_PREFIX = re.compile('\n (?P<label>pK(?:a)?) # pK or pKa\n \\s*\n (?:\\(\\s*(?P<temp>[\\d.]+)\\s*°?C\\))? # optional (25 °C)\n \\s*[:=]\\s*\n (?P<value>-?[\\d.]+) # numeric value\n ', re.IGNORECASE | re.VERBOSE)
¶
temperature-prefix form. Matches:
Weak acid. pK (25 °C): 3.35, pK (25 °C) = 4.5
_extract_ids(entry)
¶
Extract the CID and SID from a PubChem annotation entry.
Both identifiers must be present for the entry to be usable. If either is missing the entry is malformed and should be skipped.
| PARAMETER | DESCRIPTION |
|---|---|
|
A single annotation entry dict as returned by the PUG-View API.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[int, int] | None
|
A |
_extract_pclid(datum)
¶
Extract the PCLID from a single datum's ExtendedReference.
The PCLID (PubChem Live Data Identifier) links a specific measurement to its source record. It is optional: not all depositors provide it.
| PARAMETER | DESCRIPTION |
|---|---|
|
A single data point dict from a PubChem annotation entry.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
int | None
|
The integer PCLID if present, or |
_extract_string_value(datum)
¶
Extract the raw pKa string from a datum's Value field.
| PARAMETER | DESCRIPTION |
|---|---|
|
A single data point dict from a PubChem annotation entry.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str | None
|
The raw string value if present, or |
_parse_multi_value_sentence(line)
¶
Attempt to parse the "pKa values are X, Y, and Z" sentence form.
This form lists multiple unlabeled pKa values in a single prose sentence. When matched, individual values are extracted and returned without temperature or label information.
| PARAMETER | DESCRIPTION |
|---|---|
|
The raw input string to test.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[ParsedPKa] | None
|
A list of |
_parse_part(part, original_line)
¶
Parse a single semicolon-split segment of a pKa string.
Delegates to
_try_patterns
and logs a warning when no pattern matches, so that
parse_value
stays free of logging concerns.
| PARAMETER | DESCRIPTION |
|---|---|
|
A single segment, already stripped of leading/trailing whitespace and surrounding quotes.
TYPE:
|
|
The full original input line, passed through to
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ParsedPKa | None
|
A |
_try_patterns(part, original_line)
¶
Try each compiled pattern against a single text segment.
Patterns are attempted in priority order. A match is only accepted when it captures a non-empty label group, which filters out numeric strings that are not actually pKa values (e.g. density or solubility values deposited under the wrong heading).
| PARAMETER | DESCRIPTION |
|---|---|
|
A single semicolon-split segment of the original input, with leading/trailing whitespace and quotes stripped.
TYPE:
|
|
The full original input line, retained verbatim
for the
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
ParsedPKa | None
|
A |
from_page(annotation_data)
¶
Convert a list of PubChem annotation entries into a tidy DataFrame.
Each entry in annotation_data corresponds to a single depositor
record for one compound. This method extracts the compound and
source identifiers, then parses every free-text pKa string within
the entry into structured :class:FlatPKaRecord rows.
Entries with missing CID or SID are skipped with a WARNING log. Individual data points whose string value cannot be extracted or parsed are skipped with a WARNING log. Broad exception types are never swallowed: only specific, expected failure modes are handled.
| PARAMETER | DESCRIPTION |
|---|---|
|
A list of annotation entry dicts as returned by
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
A polars DataFrame with one row per parsed pKa value and the following columns:
Returns an empty DataFrame with the above schema when no valid
rows could be extracted from |
| RAISES | DESCRIPTION |
|---|---|
TypeError
|
If |
parse_value(line)
¶
Parse a free-text pKa string into a list of structured records.
The input may contain one or more pKa values separated by
semicolons, or a prose sentence listing multiple values. Each
recognized value is returned as a
ParsedPKa
record. Segments that cannot be matched to any known format are logged at
WARNING level and excluded from the output.
| PARAMETER | DESCRIPTION |
|---|---|
|
A raw pKa string as deposited in PubChem, for example:
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[ParsedPKa]
|
A list of |
FlatPKaRecord
¶
ParsedPKa
¶
Bases: TypedDict
A single parsed pKa value extracted from a free-text string.
comment
¶
The original source line with commas replaced by semicolons,
retained for provenance. None for multi-value sentence parses.
pka_label
¶
The label identifying which ionization site this value belongs to
(e.g. "pKa1", "pK2"). None when the source text does not include a label.
pka_value
¶
The numeric pKa value.
temperature_C
¶
The temperature in degrees Celsius at which the measurement was made, if stated.
None when not specified.