Skip to content

Project

Project(array_store, table_store, label='prj')

Bases: Container

The project is the core interface you have for interacting with data through atomea.

Cadence

All data derived from atomistic calculations are on one of two cadences: ensemble and microstate. Both are derived from the conceptual understanding of statistical mechanics.

[!NOTE] We do not support ensembles where the number of particles change (i.e., grand canonical).

Both are defined in the Cadence enum in atomea.

from atomea.data import Cadence
Microstate
Cadence.MICROSTATE

A microstate is a single, distinguishable configuration of particles (i.e., atoms) where the system's thermodynamic variables are unchanged. In our calculations, a microstate could be a:

  • frame of a molecular dynamics simulation trajectory;
  • protein-ligand pose in a docking calculation;
  • transition state of a chemical reaction.

Data that could change between microstates, such as atomistic coordinates, energy, docking score, instantaneous temperature or pressure, dipole moment, electronic state, etc., are given a cadence of MICROSTATE.

Ensemble
Cadence.ENSEMBLE

An ensemble is a collection of microstates where "thermodynamic variables" (e.g., Hamiltonian, temperature, number of particles) are constant. Changes in any of these variables change the ensemble. This also extends to calculation parameters that could—intentionally or not—change properties of that atomistic system (e.g., force field, integration algorithm, docking scoring algorithm, barostat set point, etc.).

Stores

Dimensionality of data determines how we represent, store, and analyze it. We define two working categories: scalars and n-dimensional.

Scalars

Values that have only one dimension with respect to each microstate are always stored in tables using DataFrames in their respective columns. This includes energies, thermodynamic variables, calculation parameters, and other relevant factors. Data cadence has no influence on storage.

All data should be stored in a way that assumes multiple ensembles and microstates will be present. Each table item must include:

  • ens_id (str): A unique identification label for an ensemble. This can be "1", "default", "exp3829", etc.
  • run_id (str): An unique, independent run within the same ensemble. This often arises when running multiple independent molecular simulation trajectories with different random seeds.
  • micro_id (uint): An index specifying a microstate with some relationship to order. This can be a frame in a molecular simulation trajectories, docking scores from best to worst, optimization steps, etc.
N-dimensional

Data with more than one value must be stored with arrays with the appropriate number of dimensions for multiple values—even if there is only one. Data for all ensemble runs are stored in a single array since they are theoretically sampled from the same ensemble.

Data must also be stored in the same order as the table indices of that Container. Thus, row indices from tables can be used to slice arrays. Note that row indices can change between Containers since not all data is collected on the same cadence.

PARAMETER DESCRIPTION

array_store

Storage backend for all arrays.

TYPE: ArrayStore

table_store

Storage backend for all tables.

TYPE: TableStore

id

Unique ID for this container.

energy = Energy(self)

label = label

quantum = Quantum(self)

time = Time(self)

__getitem__(ens_id)

__repr__()

add_ensemble(ens_id)

Create and register a new Ensemble with given ID, using the project's stores.

PARAMETER DESCRIPTION

ens_id

Unique label for the ensemble.

TYPE: str

RETURNS DESCRIPTION
Ensemble

The newly created Ensemble.

get_ensemble(ens_id)

Retrieve an existing Ensemble by its ID, or None if not found.

list_ensembles()

Return all ensemble IDs managed by this project.

remove_ensemble(ens_id)

Remove an Ensemble from the project by its ID.

TODO: Need to implement dropping tables and arrays.

RAISES DESCRIPTION
KeyError

if the ensemble does not exist.