Administrator Documentation

When databroker is imported, it discovers catalogs available on the system. User can list the discovered catalogs by importing a special global databroker.catalog object and listing its entries.

from databroker import catalog
list(catalog)  # a list of strings, names of sub-catalogs

which can be accessed like

catalog['SOME_SUB_CATALOG']

DataBroker assembles this list of catalogs by looking for:

  1. Old-style “databroker v0.x” YAML configuration files, for backward-compatibility

  2. Intake-style catalog YAML files, which have different fields

  3. Python packages that advertise catalogs via the intake.catalogs entrypoint

Old-style databroker configuration files

DataBroker v0.x used a custom YAML-based configuration file. See Configuration. For backward-compatibility, configuration files specifying MongoDB storage will be discovered and included in databroker.catalog.

Migrating sqlite or HDF5 storage

The implementation in databroker.v0 interfaces with storage in MongoDB, sqlite, or HDF5. The implementations in databroker.v1 and databroker.v2 drop support for sqlite and HDF5 and add support for JSONL (newline-delimited JSON) and msgpack. For binary file-based storage, we recommend using msgpack. Data can be migrated from sqlite or HDF5 to msgpack like so:

from databroker import Broker
import suitcase.msgpack

# If the config file associated with YOUR_BROKER_NAME specifies sqlite or
# HDF5 storage, then this will return a databroker.v0.Broker instance.
db = Broker.named(YOUR_BROKER_NAME)
# Loop through every run in the old Broker.
for run in db():
    # Load all the documents out of this run from their existing format and
    # write them into one file located at
    # `<DESTINATION_DIRECTORY>/<uid>.msgpack`.
    suitcase.msgpack.export(run.documents(), DESTINATION_DIRECTORY)

In the next section, we’ll create a “catalog YAML file” to make this data discoverable by databroker.

Intake-style Catalog YAML Files

Search Path

Use the convenience function catalog_search_path(). Place catalog YAML files in one of these locations to make them discoverable by intake and, in turn, by databroker.

from databroker import catalog_search_path
catalog_search_path()  # result will vary depending on OS and environment

Structure

The general structure of a catalog YAML file is a nested dictionary of data “sources”. Each source name is mapped to information for accessing that data, which includes a type of “driver” and some keyword arguments to pass to it. A “driver” is generally associated with a particular storage format.

sources:
  SOME_NAME:
    driver: SOME_DRIVER
    args:
      SOME_PARAMETER: VALUE
      ANOTHER_PARAMETER: VALUE
  ANOTHER_NAME:
    driver: SOME_DRIVER
    args:
      SOME_PARAMETER: VALUE
      ANOTHER_PARAMETER: VALUE

As shown, multiple sources can be specified in one file. All sources found in all the YAML files in the search path will be included as top-level entries in databroker.catalog.

Arguments

All databroker “drivers” accept the following arguments:

  • handler_registry — If ommitted or None, the result of discover_handlers() is used. See External Assets for background on the role of “handlers”.

  • root_map — This is passed to event_model.Filler() to account for temporarily moved/copied/remounted files. Any resources which have a root matching a key in root_map will be loaded using the mapped value in root_map.

  • transforms — A dict that maps any subset of the keys {start, stop, resource, descriptor} to a function that accepts a document of the corresponding type and returns it, potentially modified. This feature is for patching up erroneous metadata. It is intended for quick, temporary fixes that may later be applied permanently to the data at rest (e.g., via a database migration).

Specific drivers require format-specific arguments, shown in the following subsections.

Msgpack Example

Msgpack is a binary file format.

sources:
  ENTRY_NAME:
    driver: bluesky-msgpack-catalog
    args:
      paths:
        - "DESTINATION_DIRECTORY/*.msgpack"

where ENTRY_NAME is a name of the entry that will appear in databroker.catalog, and DESTINATION_DIRECTORY is a directory of msgpack files generated by suitcase-msgpack, as illustrated in the previous section.

Note that the value of paths is a list. Multiple directories can be grouped into one “source”.

JSONL (Newline-delimited JSON) Example

JSONL is a text-based format in which each line is a valid JSON. Unlike ordinary JSON, it is suitable for streaming. This storage is much slower than msgpack, but the format is human-readable.

sources:
  ENTRY_NAME:
    driver: bluesky-jsonl-catalog
    args:
      paths:
        - "DESTINATION_DIRECTORY/*.jsonl"

where ENTRY_NAME is a name of the entry that will appear in databroker.catalog and DESTINATION_DIRECTORY is a directory of newline-delimited JSON files generated by suitcase-jsonl.

Note that the value of paths is a list. Multiple directories can be grouped into one “source”.

MongoDB Example

MongoDB is the recommended storage format for large-scale deployments because it supports fast search.

sources:
  ENTRY_NAME:
    driver: bluesky-mongo-normalized-catalog
    args:
      metadatastore_db: mongodb://HOST:PORT/MDS_DATABASE_NAME
      asset_registry_db: mongodb://HOST:PORT/ASSETS_DATABASE_NAME

where ENTRY_NAME is a name of the entry that will appear in databroker.catalog, and the mongodb://... URIs point to MongoDB databases with documents inserted by suitcase-mongo.

The driver’s name, bluesky-mongo-normalized-catalog, differentiates it from the bluesky-mongo-embedded-catalog, an experimental alternative way of original bluesky documents into MongoDB documents and collections. It is still under evaluation and not yet recommended for use in production.

Python packages

To distribute catalogs to users, it may be more convenient to provide an installable Python package, rather than placing YAML files in specific locations on the user’s machine. To achieve this, a Python package can advertise catalog objects using the 'intake.catalogs' entrypoint. Here is a minimal example:

# setup.py
from setuptools import setup

setup(name='example',
      entry_points={'intake.catalogs':
          ['ENTRY_NAME = example:catalog_instance']},
      py_modules=['example'])
# example.py

# Create an object named `catalog_instance` which is referenced in the
# setup.py, and will be discovered by databroker. How the instance is
# created, and what type of catalog it is, is completely up to the
# implementation. This is just one possible example.

import intake

# Look up a driver class by its name in the registry.
catalog_class = intake.registry['bluesky-mongo-normalized-catalog']

catalog_instance = catalog_class(
    metadatastore_db='mongodb://...', asset_registry_db='mongodb://...')

The entry_points parameter in the setup(...) is a feature supported by Python packaging. When this package is installed, a special file inside the distribution, entry_points.txt, will advertise that is has catalogs. DataBroker will discover these and add them to databroker.catalog. Note that databroker does not need to actually import the package to discover its catalogs. The package will only be imported if and when the catalog is accessed. Thus, the overhead of this discovery process is low.

Important

Some critical details of Python’s entrypoints feature:

  • Note the unusual syntax of the entrypoints. Each item is given as one long string, with the = as part of the string. Modules are separated by ., and the final object name is preceded by :.

  • The right hand side of the equals sign must point to where the object is actually defined. If catalog_instance is defined in foo/bar.py and imported into foo/__init__.py you might expect foo:catalog_instance to work, but it does not. You must spell out foo.bar:catalog_instance.