Data Access Overview¶
The bluesky ecosystem provides several modes for accessing data:
Access Central DataBroker via a Generic Remote Client — This includes Remote Desktop, Jupyter, and SSH.
Portable DataBroker with Local Data — Let users use
databroker
on their laptops and/or on servers at their home institutions, with all the relevant data copied locally and no need for a network connection.Portable DataBroker with Remote Data — Let users use
databroker
on their laptops and/or on servers at their home institutions, pulling data from an HTTP server on demand, and optionally caching it locally.Traditional File Export — Export data to files for existing software that expects files in a certain format named a certain way.
Access Central DataBroker via a Generic Remote Client¶
In this mode, users do not install databroker
locally. They use any remote
client—such as Remote Desktop, Jupyter, or SSH—to access a Python
environment on the source machine, and use databroker
there, which
presumably has fast access to the data storage and some compute resources.
Portable DataBroker with Local Data¶
DataBroker is not itself a data store; it is a Python library for accessing data across a variety of data stores. Therefore, it can be run on a laptop without network connectivity, accessing data stored in ordinary files or in a local database. Both are officially supported.
The process involves:
Identify a subset of the data to be copied locally from the source institution, given as a query (e.g. a time range) or a list of unique identifiers. Export the documents into a file-based format (typically msgpack). Copy any of the large “external” files (e.g. TIFF or HDF5 files generated by large detectors).
Transfer all of this to the target machine, perhaps via
rsync
or Globus. Place a configuration file discoverable bydatabroker
that points to the location where the files were transferred.Install the Python library
databroker
on the target machine using pip or conda.
DataBroker can work on top of a directory of ordinary files just fine; it even supports the same queries that it would normally run on a database—just less efficiently. Optionally, ingest the documents into a local database to support more efficient queries.
The small utility databroker-pack streamlines the process of “packing” some data from data broker into portable files and “unpacking” them at their destination.
Portable DataBroker with Remote Data¶
In this mode, data copying would happen invisibility to the user and only on demand. The process involves:
Install the Python library
databroker
on the target machine using pip or conda.Provide databroker with the URL of a remote “remote data catalog” running that the source facility.
The user experience from there is exactly the same where the data happens to be local or remote. Thus, users could write code in one mode and seamless transition to the other.
Data is downloaded on demand, and it may be cached locally so that it need not be repeatedly downloaded. This requires a stable URL and a reliable network connection. There are no instances of this mode known at this time, but all the software pieces to achieve it exist. It is on the project roadmap.
Traditional File Export¶
Export the data to files (e.g. TIFFs and/or CSVs) with the metadata of your choice encoded in filenames. This mode forfeits much of the power of databroker and the bluesky ecosystem generally, but it is important for supporting existing workflows and software that expects files in a certain format named a certain way.
We expect this mode to become less useful as data sizes increase and scientific software literacy grows over time. It is a bridge.
Streaming Export¶
This means exporting the data during data acquisition such that partial results are available for reading. The bluesky suitcase project provides a pattern for doing this and ready-to-use implementations for popular formats.
The streaming export tools may also be used after data acquisition.
Prompt Export¶
This means exporting the data at the end of data acquisition. (To be precise, at the end of each “Bluesky Run”. The scope of a “Run” is up to the details of the data acquisition procedure.) This is typically much simpler than streaming export and can be implemented ad hoc by accessing the data from databroker and writing out a file using the relevant Python I/O library.