Skip to content

Pubchem

PubChemAPI

Client for the PubChem PUG-REST and PUG-View APIs.

All methods are class methods or static methods; no instance state is required. Network access is centralized in make_request so that it can be replaced with a mock session during testing.

ALLOWED_NAMESPACES = {'compound': {'cid', 'name', 'smiles', 'inchi', 'inchikey', 'formula', 'listkey'}, 'substance': {'sid', 'sourceid', 'sourceall', 'name', 'xref', 'listkey'}, 'assay': {'aid', 'listkey', 'type', 'sourceall', 'target', 'activity'}, 'gene': {'geneid', 'genesymbol', 'synonym'}, 'protein': {'accession', 'gi', 'synonym'}, 'pathway': {'pwacc'}, 'taxonomy': {'taxid', 'synonym'}, 'cell': {'cellacc', 'synonym'}, 'annotations': {'sourcename', 'headings', 'heading'}}

BASE_URL = 'https://pubchem.ncbi.nlm.nih.gov/rest/'

_IDENTIFIER_PATTERN = re.compile('^[\\w,\\.\\- ]+$')

_VALID_PUG_ENDPOINTS = frozenset(('pug', 'pug_view', 'pug_soap'))

_build_path(domain, encoded_key, encoded_val, encoded_identifiers, operation, output_format)

Assemble the URL path segments into a single slash-joined string.

PARAMETER DESCRIPTION

domain

The PubChem domain (e.g. "compound").

TYPE: str

encoded_key

The percent-encoded namespace key.

TYPE: str

encoded_val

The percent-encoded namespace value, or an empty string when absent.

TYPE: str

encoded_identifiers

The percent-encoded identifier string, or an empty string when absent.

TYPE: str

operation

An optional operation string (e.g. "property/MolecularWeight"). Slashes within this string are preserved as path separators.

TYPE: str | None

output_format

The desired output format (e.g. "JSON", "CSV").

TYPE: str

RETURNS DESCRIPTION
str

A slash-joined path string ending with the output format segment, suitable for appending to a PUG base URL.

_encode_namespace(namespace)

Split and percent-encode a namespace string.

A namespace may be a bare key (e.g. "cid") or a key with a /-separated value (e.g. "sourcename/ChEBI"). Both segments are percent-encoded for safe inclusion in a URL path.

PARAMETER DESCRIPTION

namespace

The namespace string to encode, with an optional /-separated value component.

TYPE: str

RETURNS DESCRIPTION
tuple[str, str]

A two-tuple (encoded_key, encoded_value) where encoded_value is an empty string when no value component is present.

_process_annotations(raw_annotations)

Group annotation names by their type.

PARAMETER DESCRIPTION

raw_annotations

A list of annotation records as returned by the PubChem API, each containing at minimum a "Type" key and a "Heading" key.

TYPE: list[AnnotationEntry]

RETURNS DESCRIPTION
dict[str, list[str]]

A dict mapping each annotation type (e.g. "Compound") to a list of heading names belonging to that type, in the order they were encountered.

_validate_components(domain, namespace, identifiers)

Validate the domain, namespace key, and identifier string.

The namespace may contain a / separator (e.g. sourcename/ChEBI); only the portion before the first / is checked against the allowed set for the given domain.

PARAMETER DESCRIPTION

domain

The PubChem domain (e.g. "compound", "annotations").

TYPE: str

namespace

The namespace string, optionally including a /-separated value (e.g. "sourcename/ChEBI").

TYPE: str

identifiers

The identifier string to look up. May be empty only for the "annotations" domain.

TYPE: str

RAISES DESCRIPTION
ValueError

If domain is not in ALLOWED_NAMESPACES.

ValueError

If the namespace key is not valid for the given domain.

ValueError

If identifiers is empty for a domain that requires it.

ValueError

If identifiers contains characters outside the allowed set [A-Za-z0-9_,.-_ ].

_validate_url(url)

Verify that a URL is a safe, well-formed PubChem PUG endpoint.

PARAMETER DESCRIPTION

url

The URL string to validate.

TYPE: str

RAISES DESCRIPTION
ValueError

If the URL does not use HTTPS, does not point to pubchem.ncbi.nlm.nih.gov, or whose path does not start with /rest/pug.

build_url(domain, namespace, pug='pug', identifiers='', operation=None, output_format='JSON', options=None)

Construct, validate, and return a fully formed PUG-REST URL.

PARAMETER DESCRIPTION

domain

The PubChem domain to query (e.g. "compound", "annotations"). Must be a key in ALLOWED_NAMESPACES.

TYPE: str

namespace

The namespace within the domain, optionally with a /-separated value (e.g. "name", "sourcename/ChEBI").

TYPE: str

pug

The PUG endpoint variant to use. Must be one of "pug", "pug_view", or "pug_soap". Defaults to "pug".

TYPE: str DEFAULT: 'pug'

identifiers

The record identifier(s) to look up, as a comma-separated string. May be empty for the "annotations" domain only.

TYPE: str DEFAULT: ''

operation

An optional operation to perform on the matched records (e.g. "property/MolecularWeight"). Slashes are treated as path separators.

TYPE: str | None DEFAULT: None

output_format

The response format requested from PubChem. Defaults to "JSON".

TYPE: str DEFAULT: 'JSON'

options

Optional query-string parameters appended to the URL (e.g. {"page": 2}).

TYPE: dict[str, str | int] | None DEFAULT: None

RETURNS DESCRIPTION
str

A fully constructed, validated HTTPS URL string ready to pass to make_request.

RAISES DESCRIPTION
ValueError

If pug is not a recognized endpoint variant.

ValueError

If domain, namespace, or identifiers fail validation (propagated from _validate_components).

ValueError

If the resulting URL fails the safety check (propagated from _validate_url).

get_annotations(session=None)

Retrieve all annotation headings available in PubChem.

Fetches and processes the results of /rest/pug/annotations/headings/JSON. The returned dict normally contains the following type keys: Assay, Cell, Compound, Element, Gene, Pathway, Protein, Taxonomy.

PARAMETER DESCRIPTION

session

An optional Session forwarded to make_request. Pass a mock during testing to avoid live network calls.

TYPE: Session | None DEFAULT: None

RETURNS DESCRIPTION
dict[str, list[str]]

A dict mapping annotation type strings to lists of heading names belonging to that type.

RAISES DESCRIPTION
RuntimeError

If the HTTP request fails.

KeyError

If the response does not contain the expected InformationList.Annotation structure.

get_data(annotation, page=None, session=None)

Retrieve all records for a specific annotation.

Fetches data from the PUG-View annotations/heading/<heading> endpoint. Without a page argument this returns every result, which can be slow for popular headings.

PARAMETER DESCRIPTION

annotation

The PubChem annotation heading to download (e.g. Annotation.DISSOCIATION_CONSTANTS).

TYPE: Annotation

page

If provided, fetch only this specific page of results. Must be a positive integer. If None, all results are returned.

TYPE: int | None DEFAULT: None

session

An optional SessionSessionforwarded to [make_request`][pcdigitizer.pubchem.PubChemAPI.make_request]. Pass a mock during testing to avoid live network calls.

TYPE: Session | None DEFAULT: None

RETURNS DESCRIPTION
list[AnnotationEntry]

A list of AnnotationEntry dicts for the requested data, in the order returned by the API.

RAISES DESCRIPTION
ValueError

If page is provided but is less than 1.

RuntimeError

If the HTTP request fails.

KeyError

If the response does not contain the expected Annotations structure.

get_source_annotations(source_name, session=None)

Fetch all annotation headings deposited by a specific source.

Retrieves annotations from the annotations/sourcename/<source> endpoint and groups them by type via _process_annotations.

Note

output_format is not exposed as a parameter here because make_json always expects a JSON response. To retrieve raw non-JSON data from this endpoint, use build_url and make_request directly.

PARAMETER DESCRIPTION

source_name

The PubChem depositor source name to query (e.g. "ChEBI"). Forward slashes are replaced with periods as required by the PubChem API.

TYPE: str

session

An optional Session forwarded to make_request. Pass a mock during testing to avoid live network calls.

TYPE: Session | None DEFAULT: None

RETURNS DESCRIPTION
dict[str, list[str]]

A dict mapping annotation type strings (e.g. "Compound") to lists of heading names provided by the given source.

RAISES DESCRIPTION
RuntimeError

If the HTTP request fails.

KeyError

If the response does not contain the expected InformationList.Annotation structure.

get_sources(session=None)

Fetch the full PubChem depositor source table as a DataFrame.

Retrieves the CSV source table from /rest/pug/sourcetable/all/CSV, which lists every organization that has deposited data into PubChem along with associated metadata.

PARAMETER DESCRIPTION

session

An optional Session forwarded to make_request. Pass a mock during testing to avoid live network calls.

TYPE: Session | None DEFAULT: None

RETURNS DESCRIPTION
DataFrame

A polars.DataFrame with one row per depositor source.

RAISES DESCRIPTION
RuntimeError

If the HTTP request fails.

make_json(url, session=None)

Fetch a URL and parse the response body as JSON.

PARAMETER DESCRIPTION

url

The fully constructed PubChem REST URL to fetch.

TYPE: str

session

An optional Session forwarded to make_request. See that method for details.

TYPE: Session | None DEFAULT: None

RETURNS DESCRIPTION
dict[str, object]

The top-level JSON object as a plain dict.

RAISES DESCRIPTION
RuntimeError

If the HTTP request fails (propagated from make_request).

JSONDecodeError

If the response body is not valid JSON.

make_request(url, session=None)

Perform an HTTP GET and return the raw response body.

PARAMETER DESCRIPTION

url

The fully constructed PubChem REST URL to fetch.

TYPE: str

session

An optional Session to use for the request. When None the module-level requests.get function is used. Pass a session (or a mock) during testing to avoid live network calls.

TYPE: Session | None DEFAULT: None

RETURNS DESCRIPTION
bytes

The raw response body.

RAISES DESCRIPTION
RuntimeError

If the server returns a non-200 HTTP status code.