Skip to content

Dissociation constant

Parsing and extraction utilities for PubChem dissociation constant data.

This module processes raw annotation entries from the PubChem PUG-View API for the "Dissociation Constants" heading. Each entry contains one or more free-text pKa strings deposited by data providers, which are highly inconsistent in format. The parsing pipeline normalizes these strings into structured records suitable for downstream analysis.

Typical usage

Fetch data via PubChemAPI and pass it directly to DissociationConstantData.from_page

from pcdigitizer import PubChemAPI, Annotation
from pcdigitizer.data import DissociationConstantData

raw = PubChemAPI.get_data(Annotation.DISSOCIATION_CONSTANTS)
df = DissociationConstantData.from_page(raw)

_OUTPUT_SCHEMA = {'cid': pl.Int64, 'sid': pl.Int64, 'pclid': pl.Int64, 'pka_label': pl.String, 'pka_value': pl.Float64, 'temperature_C': pl.Float64, 'comment': pl.String}

Expected output schema (used to return an empty DataFrame safely)

DissociationConstantData

Bases: AnnotationProcessor

Parse and assemble PubChem dissociation constant annotation data.

All methods are static or class methods. The primary public interface is from_page, which converts a raw list of PubChem annotation entries into a tidy polars DataFrame.

The parsing pipeline for free-text pKa strings is the following.

  1. parse_value is the top-level dispatcher. It first checks for the "pKa values are X, Y, and Z" sentence form via _parse_multi_value_sentence. If that does not match it splits the input on semicolons and delegates each segment to _parse_part.

  2. _parse_part tries each compiled pattern in _PATTERNS in priority order via _try_patterns, returning the first successful ParsedPKa or None.

  3. Patterns are compiled once at class definition time and reused across all calls.

The following free-text forms are recognized (case-insensitive):

  • pKa values are 3.25, 4.76, and 6.17 (multi-value sentence)
  • pKa3 = -2.03
  • pK1 = 2.36 (SRC: carboxylic acid)
  • pKa = 10.4 at 40 °C (tertiary amine)
  • Weak acid. pK (25 °C): 3.35
  • pKa = 0.7 (caffeine cation)
  • pKa = 20

Lines that cannot be matched (e.g. density or solubility values mistakenly deposited under this heading) are logged at WARNING level and excluded from the output.

_PATTERNS = [_PATTERN_LABELED, _PATTERN_TEMP_PREFIX, _PATTERN_FALLBACK]

Ordered list used by _try_patterns; higher-specificity patterns first.

_PATTERN_FALLBACK = re.compile('\n ^\\s*\n (?P<value>-?[\\d.]+)\n (?:\\s+in\\s+(?P<env>.+))? # optional "in water / solvent" context\n ', re.IGNORECASE | re.VERBOSE)

Fallback for bare numeric values in an environment context. Matches:

  • 4.24 in water, -1.34

Only used when a label group is present from one of the above patterns; this pattern has no label group intentionally, so bare numbers without any pK context are always rejected (see _parse_part for the label guard).

_PATTERN_LABELED = re.compile('\n ^\\s*\n (?P<label>pK(?:a)?\\d*) # pK, pKa, pK1, pKa2, etc.\n \\s*(?:=|:)\\s*\n (?P<value>-?[\\d.]+) # numeric value, allow negative\n (?:\\s*\\(SRC:\\s*(?P<comment>[^)]+)\\))? # optional (SRC: ...)\n (?:\\s+at\\s+(?P<temp>[\\d.]+)\\s*°?C)? # optional "at XX °C"\n ', re.IGNORECASE | re.VERBOSE)

Labeled pKa with optional SRC comment and optional temperature. Matches:

  • pKa3 = -2.03, pK1 = 2.36 (SRC: carboxylic acid), pKa = 10.4 at 40 °C

_PATTERN_MULTI_VALUE = re.compile('pKa values are\\s+([\\d\\.,\\sand-]+)', re.IGNORECASE)

Pre-compiled pattern for the multi-value sentence form.

_PATTERN_TEMP_PREFIX = re.compile('\n (?P<label>pK(?:a)?) # pK or pKa\n \\s*\n (?:\\(\\s*(?P<temp>[\\d.]+)\\s*°?C\\))? # optional (25 °C)\n \\s*[:=]\\s*\n (?P<value>-?[\\d.]+) # numeric value\n ', re.IGNORECASE | re.VERBOSE)

temperature-prefix form. Matches:

  • Weak acid. pK (25 °C): 3.35, pK (25 °C) = 4.5

_extract_ids(entry)

Extract the CID and SID from a PubChem annotation entry.

Both identifiers must be present for the entry to be usable. If either is missing the entry is malformed and should be skipped.

PARAMETER DESCRIPTION

entry

A single annotation entry dict as returned by the PUG-View API.

TYPE: AnnotationEntry

RETURNS DESCRIPTION
tuple[int, int] | None

A (cid, sid) tuple of integers, or None if either key is absent or the CID list is empty.

_extract_pclid(datum)

Extract the PCLID from a single datum's ExtendedReference.

The PCLID (PubChem Live Data Identifier) links a specific measurement to its source record. It is optional: not all depositors provide it.

PARAMETER DESCRIPTION

datum

A single data point dict from a PubChem annotation entry.

TYPE: PubChemDatum

RETURNS DESCRIPTION
int | None

The integer PCLID if present, or None if the key path does not exist.

_extract_string_value(datum)

Extract the raw pKa string from a datum's Value field.

PARAMETER DESCRIPTION

datum

A single data point dict from a PubChem annotation entry.

TYPE: PubChemDatum

RETURNS DESCRIPTION
str | None

The raw string value if present, or None if the expected key path does not exist or is empty.

_parse_multi_value_sentence(line)

Attempt to parse the "pKa values are X, Y, and Z" sentence form.

This form lists multiple unlabeled pKa values in a single prose sentence. When matched, individual values are extracted and returned without temperature or label information.

PARAMETER DESCRIPTION

line

The raw input string to test.

TYPE: str

RETURNS DESCRIPTION
list[ParsedPKa] | None

A list of ParsedPKa records, one per numeric value found in the sentence, or None if this sentence form is not present in line.

_parse_part(part, original_line)

Parse a single semicolon-split segment of a pKa string.

Delegates to _try_patterns and logs a warning when no pattern matches, so that parse_value stays free of logging concerns.

PARAMETER DESCRIPTION

part

A single segment, already stripped of leading/trailing whitespace and surrounding quotes.

TYPE: str

original_line

The full original input line, passed through to _try_patterns for provenance.

TYPE: str

RETURNS DESCRIPTION
ParsedPKa | None

A ParsedPKa record, or None if the segment could not be matched to any known pKa format.

_try_patterns(part, original_line)

Try each compiled pattern against a single text segment.

Patterns are attempted in priority order. A match is only accepted when it captures a non-empty label group, which filters out numeric strings that are not actually pKa values (e.g. density or solubility values deposited under the wrong heading).

PARAMETER DESCRIPTION

part

A single semicolon-split segment of the original input, with leading/trailing whitespace and quotes stripped.

TYPE: str

original_line

The full original input line, retained verbatim for the comment field (with commas replaced by semicolons).

TYPE: str

RETURNS DESCRIPTION
ParsedPKa | None

A ParsedPKa record if a pattern matches and a label is present, or None if no pattern yields a valid match.

from_page(annotation_data)

Convert a list of PubChem annotation entries into a tidy DataFrame.

Each entry in annotation_data corresponds to a single depositor record for one compound. This method extracts the compound and source identifiers, then parses every free-text pKa string within the entry into structured :class:FlatPKaRecord rows.

Entries with missing CID or SID are skipped with a WARNING log. Individual data points whose string value cannot be extracted or parsed are skipped with a WARNING log. Broad exception types are never swallowed: only specific, expected failure modes are handled.

PARAMETER DESCRIPTION

annotation_data

A list of annotation entry dicts as returned by get_data for the "Dissociation Constants" heading. Each entry is expected to conform to AnnotationEntry.

TYPE: list[AnnotationEntry]

RETURNS DESCRIPTION
DataFrame

A polars DataFrame with one row per parsed pKa value and the following columns:

  • cid (Int64): PubChem Compound Identifier.
  • sid (Int64): PubChem Substance Identifier.
  • pclid (Int64, nullable): PubChem Live Data Identifier.
  • pka_label (String, nullable): Ionisation-site label.
  • pka_value (Float64): The numeric pKa value.
  • temperature_C (Float64, nullable): Measurement temperature.
  • comment (String, nullable): Original source line.

Returns an empty DataFrame with the above schema when no valid rows could be extracted from annotation_data.

RAISES DESCRIPTION
TypeError

If annotation_data is not a list.

parse_value(line)

Parse a free-text pKa string into a list of structured records.

The input may contain one or more pKa values separated by semicolons, or a prose sentence listing multiple values. Each recognized value is returned as a ParsedPKa record. Segments that cannot be matched to any known format are logged at WARNING level and excluded from the output.

PARAMETER DESCRIPTION

line

A raw pKa string as deposited in PubChem, for example:

  • "pKa = 10.4 at 40 °C (tertiary amine)"
  • "pKa1 = 3.25; pKa2 = 4.76"
  • "pKa values are 3.25, 4.76, and 6.17"
  • "Weak acid. pK (25 °C): 3.35"
  • "pKa = 0.7 (caffeine cation)"
  • "pKa = 20"

TYPE: str

RETURNS DESCRIPTION
list[ParsedPKa]

A list of ParsedPKa records, one per recognized pKa value. Returns an empty list if no values could be parsed.

FlatPKaRecord

Bases: TypedDict

A fully resolved pKa record with compound and source identifiers.

cid

PubChem Compound Identifier.

comment

See ParsedPKa.

pclid

PubChem Live Data Identifier linking to the specific measurement record, if available.

pka_label

See ParsedPKa.

pka_value

See ParsedPKa.

sid

PubChem Substance Identifier for the depositing source record.

temperature_C

See ParsedPKa.

ParsedPKa

Bases: TypedDict

A single parsed pKa value extracted from a free-text string.

comment

The original source line with commas replaced by semicolons, retained for provenance. None for multi-value sentence parses.

pka_label

The label identifying which ionization site this value belongs to (e.g. "pKa1", "pK2"). None when the source text does not include a label.

pka_value

The numeric pKa value.

temperature_C

The temperature in degrees Celsius at which the measurement was made, if stated. None when not specified.