Dissociation constant

Parsing and extraction utilities for PubChem dissociation constant data.

This module processes raw annotation entries from the PubChem PUG-View API for the "Dissociation Constants" heading. Each entry contains one or more free-text pKa strings deposited by data providers, which are highly inconsistent in format. The parsing pipeline normalizes these strings into structured records suitable for downstream analysis.

Typical usage¶

Fetch data via PubChemAPI and pass it directly to DissociationConstantData.from_page

from pcdigitizer import PubChemAPI, Annotation
from pcdigitizer.data import DissociationConstantData

raw = PubChemAPI.get_data(Annotation.DISSOCIATION_CONSTANTS)
df = DissociationConstantData.from_page(raw)

`_OUTPUT_SCHEMA = {'cid': pl.Int64, 'sid': pl.Int64, 'pclid': pl.Int64, 'pka_label': pl.String, 'pka_value': pl.Float64, 'temperature_C': pl.Float64, 'comment': pl.String}` ¶

Expected output schema (used to return an empty DataFrame safely)

`DissociationConstantData` ¶

Bases: AnnotationProcessor

Parse and assemble PubChem dissociation constant annotation data.

All methods are static or class methods. The primary public interface is from_page, which converts a raw list of PubChem annotation entries into a tidy polars DataFrame.

The parsing pipeline for free-text pKa strings is the following.

parse_value is the top-level dispatcher. It first checks for the "pKa values are X, Y, and Z" sentence form via _parse_multi_value_sentence. If that does not match it splits the input on semicolons and delegates each segment to _parse_part.
_parse_part tries each compiled pattern in _PATTERNS in priority order via _try_patterns, returning the first successful ParsedPKa or None.
Patterns are compiled once at class definition time and reused across all calls.

The following free-text forms are recognized (case-insensitive):

pKa values are 3.25, 4.76, and 6.17 (multi-value sentence)
pKa3 = -2.03
pK1 = 2.36 (SRC: carboxylic acid)
pKa = 10.4 at 40 °C (tertiary amine)
Weak acid. pK (25 °C): 3.35
pKa = 0.7 (caffeine cation)
pKa = 20

Lines that cannot be matched (e.g. density or solubility values mistakenly deposited under this heading) are logged at WARNING level and excluded from the output.

`_PATTERNS = [_PATTERN_LABELED, _PATTERN_TEMP_PREFIX, _PATTERN_FALLBACK]` ¶

Ordered list used by _try_patterns; higher-specificity patterns first.

`_PATTERN_FALLBACK = re.compile('\n ^\\s*\n (?P<value>-?[\\d.]+)\n (?:\\s+in\\s+(?P<env>.+))? # optional "in water / solvent" context\n ', re.IGNORECASE | re.VERBOSE)` ¶

Fallback for bare numeric values in an environment context. Matches:

4.24 in water, -1.34

Only used when a label group is present from one of the above patterns; this pattern has no label group intentionally, so bare numbers without any pK context are always rejected (see _parse_part for the label guard).

`_PATTERN_LABELED = re.compile('\n ^\\s\n (?P<label>pK(?:a)?\\d) # pK, pKa, pK1, pKa2, etc.\n \\s(?:=|:)\\s\n (?P<value>-?[\\d.]+) # numeric value, allow negative\n (?:\\s\\(SRC:\\s(?P<comment>[^)]+)\\))? # optional (SRC: ...)\n (?:\\s+at\\s+(?P<temp>[\\d.]+)\\s*°?C)? # optional "at XX °C"\n ', re.IGNORECASE | re.VERBOSE)` ¶

Labeled pKa with optional SRC comment and optional temperature. Matches:

pKa3 = -2.03, pK1 = 2.36 (SRC: carboxylic acid), pKa = 10.4 at 40 °C

`_PATTERN_MULTI_VALUE = re.compile('pKa values are\\s+([\\d\\.,\\sand-]+)', re.IGNORECASE)` ¶

Pre-compiled pattern for the multi-value sentence form.

`_PATTERN_TEMP_PREFIX = re.compile('\n (?P<label>pK(?:a)?) # pK or pKa\n \\s\n (?:\\(\\s(?P<temp>[\\d.]+)\\s°?C\\))? # optional (25 °C)\n \\s[:=]\\s*\n (?P<value>-?[\\d.]+) # numeric value\n ', re.IGNORECASE | re.VERBOSE)` ¶

temperature-prefix form. Matches:

Weak acid. pK (25 °C): 3.35, pK (25 °C) = 4.5

`_extract_ids(entry)` ¶

Extract the CID and SID from a PubChem annotation entry.

Both identifiers must be present for the entry to be usable. If either is missing the entry is malformed and should be skipped.

PARAMETER	DESCRIPTION
`entry` ¶	A single annotation entry dict as returned by the PUG-View API. TYPE: `AnnotationEntry`

RETURNS	DESCRIPTION
`tuple[int, int] \| None`	A `(cid, sid)` tuple of integers, or `None` if either key is absent or the `CID` list is empty.

`_extract_pclid(datum)` ¶

Extract the PCLID from a single datum's ExtendedReference.

The PCLID (PubChem Live Data Identifier) links a specific measurement to its source record. It is optional: not all depositors provide it.

PARAMETER	DESCRIPTION
`datum` ¶	A single data point dict from a PubChem annotation entry. TYPE: `PubChemDatum`

RETURNS	DESCRIPTION
`int \| None`	The integer PCLID if present, or `None` if the key path does not exist.

`_extract_string_value(datum)` ¶

Extract the raw pKa string from a datum's Value field.

PARAMETER	DESCRIPTION
`datum` ¶	A single data point dict from a PubChem annotation entry. TYPE: `PubChemDatum`

RETURNS	DESCRIPTION
`str \| None`	The raw string value if present, or `None` if the expected key path does not exist or is empty.

`_parse_multi_value_sentence(line)` ¶

Attempt to parse the "pKa values are X, Y, and Z" sentence form.

This form lists multiple unlabeled pKa values in a single prose sentence. When matched, individual values are extracted and returned without temperature or label information.

PARAMETER	DESCRIPTION
`line` ¶	The raw input string to test. TYPE: `str`

RETURNS	DESCRIPTION
`list[ParsedPKa] \| None`	A list of `ParsedPKa` records, one per numeric value found in the sentence, or `None` if this sentence form is not present in `line`.

`_parse_part(part, original_line)` ¶

Parse a single semicolon-split segment of a pKa string.

Delegates to _try_patterns and logs a warning when no pattern matches, so that parse_value stays free of logging concerns.

PARAMETER	DESCRIPTION
`part` ¶	A single segment, already stripped of leading/trailing whitespace and surrounding quotes. TYPE: `str`
`original_line` ¶	The full original input line, passed through to `_try_patterns` for provenance. TYPE: `str`

RETURNS	DESCRIPTION
`ParsedPKa \| None`	A `ParsedPKa` record, or `None` if the segment could not be matched to any known pKa format.

`_try_patterns(part, original_line)` ¶

Try each compiled pattern against a single text segment.

Patterns are attempted in priority order. A match is only accepted when it captures a non-empty label group, which filters out numeric strings that are not actually pKa values (e.g. density or solubility values deposited under the wrong heading).

PARAMETER	DESCRIPTION
`part` ¶	A single semicolon-split segment of the original input, with leading/trailing whitespace and quotes stripped. TYPE: `str`
`original_line` ¶	The full original input line, retained verbatim for the `comment` field (with commas replaced by semicolons). TYPE: `str`

RETURNS	DESCRIPTION
`ParsedPKa \| None`	A `ParsedPKa` record if a pattern matches and a label is present, or `None` if no pattern yields a valid match.

`from_page(annotation_data)` ¶

Convert a list of PubChem annotation entries into a tidy DataFrame.

Each entry in annotation_data corresponds to a single depositor record for one compound. This method extracts the compound and source identifiers, then parses every free-text pKa string within the entry into structured :class:FlatPKaRecord rows.

Entries with missing CID or SID are skipped with a WARNING log. Individual data points whose string value cannot be extracted or parsed are skipped with a WARNING log. Broad exception types are never swallowed: only specific, expected failure modes are handled.

PARAMETER	DESCRIPTION
`annotation_data` ¶	A list of annotation entry dicts as returned by `get_data` for the `"Dissociation Constants"` heading. Each entry is expected to conform to `AnnotationEntry`. TYPE: `list[AnnotationEntry]`

RETURNS DESCRIPTION

DataFrame

A polars DataFrame with one row per parsed pKa value and the following columns:

cid (Int64): PubChem Compound Identifier.
sid (Int64): PubChem Substance Identifier.
pclid (Int64, nullable): PubChem Live Data Identifier.
pka_label (String, nullable): Ionisation-site label.
pka_value (Float64): The numeric pKa value.
temperature_C (Float64, nullable): Measurement temperature.
comment (String, nullable): Original source line.

Returns an empty DataFrame with the above schema when no valid rows could be extracted from annotation_data.

RAISES	DESCRIPTION
`TypeError`	If `annotation_data` is not a list.

`parse_value(line)` ¶

Parse a free-text pKa string into a list of structured records.

The input may contain one or more pKa values separated by semicolons, or a prose sentence listing multiple values. Each recognized value is returned as a ParsedPKa record. Segments that cannot be matched to any known format are logged at WARNING level and excluded from the output.

PARAMETER	DESCRIPTION
`line` ¶	A raw pKa string as deposited in PubChem, for example: `"pKa = 10.4 at 40 °C (tertiary amine)"` `"pKa1 = 3.25; pKa2 = 4.76"` `"pKa values are 3.25, 4.76, and 6.17"` `"Weak acid. pK (25 °C): 3.35"` `"pKa = 0.7 (caffeine cation)"` `"pKa = 20"` TYPE: `str`

RETURNS	DESCRIPTION
`list[ParsedPKa]`	A list of `ParsedPKa` records, one per recognized pKa value. Returns an empty list if no values could be parsed.

`FlatPKaRecord` ¶

Bases: TypedDict

A fully resolved pKa record with compound and source identifiers.

`cid` ¶

PubChem Compound Identifier.

`comment` ¶

See ParsedPKa.

`pclid` ¶

PubChem Live Data Identifier linking to the specific measurement record, if available.

`pka_label` ¶

See ParsedPKa.

`pka_value` ¶

See ParsedPKa.

`sid` ¶

PubChem Substance Identifier for the depositing source record.

`temperature_C` ¶

See ParsedPKa.

`ParsedPKa` ¶

Bases: TypedDict

A single parsed pKa value extracted from a free-text string.

`comment` ¶

The original source line with commas replaced by semicolons, retained for provenance. None for multi-value sentence parses.

`pka_label` ¶

The label identifying which ionization site this value belongs to (e.g. "pKa1", "pK2"). None when the source text does not include a label.

`pka_value` ¶

The numeric pKa value.

`temperature_C` ¶

The temperature in degrees Celsius at which the measurement was made, if stated. None when not specified.

Dissociation constant

Typical usage¶

`_OUTPUT_SCHEMA = {'cid': pl.Int64, 'sid': pl.Int64, 'pclid': pl.Int64, 'pka_label': pl.String, 'pka_value': pl.Float64, 'temperature_C': pl.Float64, 'comment': pl.String}` ¶

`DissociationConstantData` ¶

`_PATTERNS = [_PATTERN_LABELED, _PATTERN_TEMP_PREFIX, _PATTERN_FALLBACK]` ¶

`_PATTERN_FALLBACK = re.compile('\n ^\\s*\n (?P<value>-?[\\d.]+)\n (?:\\s+in\\s+(?P<env>.+))? # optional "in water / solvent" context\n ', re.IGNORECASE | re.VERBOSE)` ¶

`_PATTERN_MULTI_VALUE = re.compile('pKa values are\\s+([\\d\\.,\\sand-]+)', re.IGNORECASE)` ¶

`_PATTERN_TEMP_PREFIX = re.compile('\n (?P<label>pK(?:a)?) # pK or pKa\n \\s\n (?:\\(\\s(?P<temp>[\\d.]+)\\s°?C\\))? # optional (25 °C)\n \\s[:=]\\s*\n (?P<value>-?[\\d.]+) # numeric value\n ', re.IGNORECASE | re.VERBOSE)` ¶

`_extract_ids(entry)` ¶

`entry` ¶

`_extract_pclid(datum)` ¶

`datum` ¶

`_extract_string_value(datum)` ¶

`datum` ¶

`_parse_multi_value_sentence(line)` ¶

`line` ¶

`_parse_part(part, original_line)` ¶

`part` ¶

`original_line` ¶

`_try_patterns(part, original_line)` ¶

`part` ¶

`original_line` ¶

`from_page(annotation_data)` ¶

`annotation_data` ¶

`parse_value(line)` ¶

`line` ¶

`FlatPKaRecord` ¶

`cid` ¶

`comment` ¶

`pclid` ¶

`pka_label` ¶

`pka_value` ¶

`sid` ¶

`temperature_C` ¶

`ParsedPKa` ¶

`comment` ¶

`pka_label` ¶

`pka_value` ¶

`temperature_C` ¶

Dissociation constant

Typical usage¶

_OUTPUT_SCHEMA = {'cid': pl.Int64, 'sid': pl.Int64, 'pclid': pl.Int64, 'pka_label': pl.String, 'pka_value': pl.Float64, 'temperature_C': pl.Float64, 'comment': pl.String} ¶

DissociationConstantData ¶

_PATTERNS = [_PATTERN_LABELED, _PATTERN_TEMP_PREFIX, _PATTERN_FALLBACK] ¶

_PATTERN_FALLBACK = re.compile('\n ^\\s*\n (?P<value>-?[\\d.]+)\n (?:\\s+in\\s+(?P<env>.+))? # optional "in water / solvent" context\n ', re.IGNORECASE | re.VERBOSE) ¶

_PATTERN_MULTI_VALUE = re.compile('pKa values are\\s+([\\d\\.,\\sand-]+)', re.IGNORECASE) ¶

_PATTERN_TEMP_PREFIX = re.compile('\n (?P<label>pK(?:a)?) # pK or pKa\n \\s*\n (?:\\(\\s*(?P<temp>[\\d.]+)\\s*°?C\\))? # optional (25 °C)\n \\s*[:=]\\s*\n (?P<value>-?[\\d.]+) # numeric value\n ', re.IGNORECASE | re.VERBOSE) ¶

_extract_ids(entry) ¶

entry ¶

_extract_pclid(datum) ¶

datum ¶

_extract_string_value(datum) ¶

datum ¶

_parse_multi_value_sentence(line) ¶

line ¶

_parse_part(part, original_line) ¶

part ¶

original_line ¶

_try_patterns(part, original_line) ¶

part ¶

original_line ¶

from_page(annotation_data) ¶

annotation_data ¶

parse_value(line) ¶

line ¶

FlatPKaRecord ¶

cid ¶

comment ¶

pclid ¶

pka_label ¶

pka_value ¶

sid ¶

temperature_C ¶

ParsedPKa ¶

comment ¶

pka_label ¶

pka_value ¶

temperature_C ¶

`_OUTPUT_SCHEMA = {'cid': pl.Int64, 'sid': pl.Int64, 'pclid': pl.Int64, 'pka_label': pl.String, 'pka_value': pl.Float64, 'temperature_C': pl.Float64, 'comment': pl.String}` ¶

`DissociationConstantData` ¶

`_PATTERNS = [_PATTERN_LABELED, _PATTERN_TEMP_PREFIX, _PATTERN_FALLBACK]` ¶

`_PATTERN_FALLBACK = re.compile('\n ^\\s*\n (?P<value>-?[\\d.]+)\n (?:\\s+in\\s+(?P<env>.+))? # optional "in water / solvent" context\n ', re.IGNORECASE | re.VERBOSE)` ¶

`_PATTERN_MULTI_VALUE = re.compile('pKa values are\\s+([\\d\\.,\\sand-]+)', re.IGNORECASE)` ¶

`_PATTERN_TEMP_PREFIX = re.compile('\n (?P<label>pK(?:a)?) # pK or pKa\n \\s\n (?:\\(\\s(?P<temp>[\\d.]+)\\s°?C\\))? # optional (25 °C)\n \\s[:=]\\s*\n (?P<value>-?[\\d.]+) # numeric value\n ', re.IGNORECASE | re.VERBOSE)` ¶

`_extract_ids(entry)` ¶

`entry` ¶

`_extract_pclid(datum)` ¶

`datum` ¶

`_extract_string_value(datum)` ¶

`datum` ¶

`_parse_multi_value_sentence(line)` ¶

`line` ¶

`_parse_part(part, original_line)` ¶

`part` ¶

`original_line` ¶

`_try_patterns(part, original_line)` ¶

`part` ¶

`original_line` ¶

`from_page(annotation_data)` ¶

`annotation_data` ¶

`parse_value(line)` ¶

`line` ¶

`FlatPKaRecord` ¶

`cid` ¶

`comment` ¶

`pclid` ¶

`pka_label` ¶

`pka_value` ¶

`sid` ¶

`temperature_C` ¶

`ParsedPKa` ¶

`comment` ¶

`pka_label` ¶

`pka_value` ¶

`temperature_C` ¶