Pubchem
PubChemAPI
¶
Client for the PubChem PUG-REST and PUG-View APIs.
All methods are class methods or static methods; no instance state is
required. Network access is centralized in
make_request so that it can
be replaced with a mock session during testing.
ALLOWED_NAMESPACES = {'compound': {'cid', 'name', 'smiles', 'inchi', 'inchikey', 'formula', 'listkey'}, 'substance': {'sid', 'sourceid', 'sourceall', 'name', 'xref', 'listkey'}, 'assay': {'aid', 'listkey', 'type', 'sourceall', 'target', 'activity'}, 'gene': {'geneid', 'genesymbol', 'synonym'}, 'protein': {'accession', 'gi', 'synonym'}, 'pathway': {'pwacc'}, 'taxonomy': {'taxid', 'synonym'}, 'cell': {'cellacc', 'synonym'}, 'annotations': {'sourcename', 'headings', 'heading'}}
¶
BASE_URL = 'https://pubchem.ncbi.nlm.nih.gov/rest/'
¶
_IDENTIFIER_PATTERN = re.compile('^[\\w,\\.\\- ]+$')
¶
_VALID_PUG_ENDPOINTS = frozenset(('pug', 'pug_view', 'pug_soap'))
¶
_build_path(domain, encoded_key, encoded_val, encoded_identifiers, operation, output_format)
¶
Assemble the URL path segments into a single slash-joined string.
| PARAMETER | DESCRIPTION |
|---|---|
|
The PubChem domain (e.g.
TYPE:
|
|
The percent-encoded namespace key.
TYPE:
|
|
The percent-encoded namespace value, or an empty string when absent.
TYPE:
|
|
The percent-encoded identifier string, or an empty string when absent.
TYPE:
|
|
An optional operation string (e.g.
TYPE:
|
|
The desired output format (e.g.
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
str
|
A slash-joined path string ending with the output format segment, suitable for appending to a PUG base URL. |
_encode_namespace(namespace)
¶
Split and percent-encode a namespace string.
A namespace may be a bare key (e.g. "cid") or a key with a
/-separated value (e.g. "sourcename/ChEBI"). Both segments
are percent-encoded for safe inclusion in a URL path.
| PARAMETER | DESCRIPTION |
|---|---|
|
The namespace string to encode, with an optional
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
tuple[str, str]
|
A two-tuple |
_process_annotations(raw_annotations)
¶
Group annotation names by their type.
| PARAMETER | DESCRIPTION |
|---|---|
|
A list of annotation records as returned by the
PubChem API, each containing at minimum a
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, list[str]]
|
A dict mapping each annotation type (e.g. |
_validate_components(domain, namespace, identifiers)
¶
Validate the domain, namespace key, and identifier string.
The namespace may contain a / separator (e.g. sourcename/ChEBI);
only the portion before the first / is checked against the allowed
set for the given domain.
| PARAMETER | DESCRIPTION |
|---|---|
|
The PubChem domain (e.g.
TYPE:
|
|
The namespace string, optionally including a
TYPE:
|
|
The identifier string to look up. May be empty only
for the
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If |
ValueError
|
If the namespace key is not valid for the given domain. |
ValueError
|
If |
ValueError
|
If |
_validate_url(url)
¶
Verify that a URL is a safe, well-formed PubChem PUG endpoint.
| PARAMETER | DESCRIPTION |
|---|---|
|
The URL string to validate.
TYPE:
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If the URL does not use HTTPS, does not point to
|
build_url(domain, namespace, pug='pug', identifiers='', operation=None, output_format='JSON', options=None)
¶
Construct, validate, and return a fully formed PUG-REST URL.
| PARAMETER | DESCRIPTION |
|---|---|
|
The PubChem domain to query (e.g.
TYPE:
|
|
The namespace within the domain, optionally with a
TYPE:
|
|
The PUG endpoint variant to use. Must be one of
TYPE:
|
|
The record identifier(s) to look up, as a
comma-separated string. May be empty for the
TYPE:
|
|
An optional operation to perform on the matched
records (e.g.
TYPE:
|
|
The response format requested from PubChem.
Defaults to
TYPE:
|
|
Optional query-string parameters appended to the URL
(e.g. |
| RETURNS | DESCRIPTION |
|---|---|
str
|
A fully constructed, validated HTTPS URL string ready to pass to
|
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If |
ValueError
|
If |
ValueError
|
If the resulting URL fails the safety check
(propagated from |
get_annotations(session=None)
¶
Retrieve all annotation headings available in PubChem.
Fetches and processes the results of
/rest/pug/annotations/headings/JSON. The returned dict normally
contains the following type keys: Assay, Cell, Compound,
Element, Gene, Pathway, Protein, Taxonomy.
| PARAMETER | DESCRIPTION |
|---|---|
|
An optional
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, list[str]]
|
A dict mapping annotation type strings to lists of heading names belonging to that type. |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If the HTTP request fails. |
KeyError
|
If the response does not contain the expected
|
get_data(annotation, page=None, session=None)
¶
Retrieve all records for a specific annotation.
Fetches data from the PUG-View annotations/heading/<heading>
endpoint. Without a page argument this returns every result,
which can be slow for popular headings.
| PARAMETER | DESCRIPTION |
|---|---|
|
The PubChem annotation heading to download
(e.g.
TYPE:
|
|
If provided, fetch only this specific page of results.
Must be a positive integer. If
TYPE:
|
|
An optional
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
list[AnnotationEntry]
|
A list of |
| RAISES | DESCRIPTION |
|---|---|
ValueError
|
If |
RuntimeError
|
If the HTTP request fails. |
KeyError
|
If the response does not contain the expected Annotations structure. |
get_source_annotations(source_name, session=None)
¶
Fetch all annotation headings deposited by a specific source.
Retrieves annotations from the annotations/sourcename/<source>
endpoint and groups them by type via
_process_annotations.
Note
output_format is not exposed as a parameter here because
make_json always expects a JSON response.
To retrieve raw non-JSON data from this endpoint, use
build_url and
make_request directly.
| PARAMETER | DESCRIPTION |
|---|---|
|
The PubChem depositor source name to query
(e.g.
TYPE:
|
|
An optional
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, list[str]]
|
A dict mapping annotation type strings (e.g. |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If the HTTP request fails. |
KeyError
|
If the response does not contain the expected
|
get_sources(session=None)
¶
Fetch the full PubChem depositor source table as a DataFrame.
Retrieves the CSV source table from
/rest/pug/sourcetable/all/CSV, which lists every organization
that has deposited data into PubChem along with associated metadata.
| PARAMETER | DESCRIPTION |
|---|---|
|
An optional
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
DataFrame
|
A |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If the HTTP request fails. |
make_json(url, session=None)
¶
Fetch a URL and parse the response body as JSON.
| PARAMETER | DESCRIPTION |
|---|---|
|
The fully constructed PubChem REST URL to fetch.
TYPE:
|
|
An optional
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
dict[str, object]
|
The top-level JSON object as a plain |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If the HTTP request fails (propagated from
|
JSONDecodeError
|
If the response body is not valid JSON. |
make_request(url, session=None)
¶
Perform an HTTP GET and return the raw response body.
| PARAMETER | DESCRIPTION |
|---|---|
|
The fully constructed PubChem REST URL to fetch.
TYPE:
|
|
An optional
TYPE:
|
| RETURNS | DESCRIPTION |
|---|---|
bytes
|
The raw response body. |
| RAISES | DESCRIPTION |
|---|---|
RuntimeError
|
If the server returns a non-200 HTTP status code. |