Tutorial¶

The databroker is a tool to access data from many sources through a unified interface. It emphasizes rich searching capabilities and handling multiple concurrent “streams” of data in an organized way.

Basic Walkthrough¶

Get a Broker¶

List the names of available configurations.

In [1]: from databroker import list_configs

In [2]: list_configs()
Out[2]: ['example']

If this list is empty, no one has created any configuration files yet. See the section on Configuration.

Make a databroker using one of the configurations.

In [3]: from databroker import Broker

In [4]: db = Broker.named('example')

Load Data as a Table¶

Load the most recently saved run.

In [5]: header = db[-1]

The result, a Header, encapsulates the metadata from this run. Loading the data itself can be a longer process, so it’s a separate step. For scalar data, the most convenient method is:

In [6]: header.table()
Out[6]: 
                                 time       det  motor  motor_setpoint
seq_num                                                               
1       2020-11-04 22:11:59.017866850  0.606531    1.0             1.0
2       2020-11-04 22:11:59.022908449  0.135335    2.0             2.0
3       2020-11-04 22:11:59.026233912  0.011109    3.0             3.0
4       2020-11-04 22:11:59.029233932  0.000335    4.0             4.0
5       2020-11-04 22:11:59.032141685  0.000004    5.0             5.0

This object is DataFrame, a spreadsheet-like object provided by the library pandas.

Note

For Python novices we point out that header above is an arbitrary variable name. It could have been:

h = db[-1]
h.table()

or even in one line:

db[-1].table()

Do Analysis or Export¶

DataFrames can be used to perform fast computations on labeled data, such as

In [7]: t = header.table()

In [8]: t.mean(numeric_only=True)
Out[8]: 
det               0.150663
motor             3.000000
motor_setpoint    3.000000
dtype: float64

In [9]: t['det'] / t['motor']
Out[9]: 
seq_num
1    6.065307e-01
2    6.766764e-02
3    3.702999e-03
4    8.386566e-05
5    7.453306e-07
dtype: float64

or export to a file.

In [10]: t.to_csv('data.csv')

Load Data Lazily (Good for Image Data)¶

The Header.table method is just one way to load the data. Another is Header.data, which loads data for one specific field (i.e., one column of the table) in a “lazy”, streaming fashion.

In [11]: data = header.data('det')

In [12]: data  # This a 'generator' that will load data when we loop through it.
Out[12]: <generator object Header.data at 0x7fe8bcff8de0>

In [13]: for point in data:
   ....:     print(point)
   ....: 
0.6065306597126334
0.1353352832366127
0.011108996538242306
0.00033546262790251185
3.726653172078671e-06

The Header.data method is suitable for loading image data. See the API Documentation for more methods.

Explore Metadata¶

Everything recorded at the start of the run is in header.start.

In [14]: header.start
Out[14]: 
{'uid': 'bb805845-5757-4642-b3ef-7ab14e9fd31e',
 'time': 1604527919.0073605,
 'versions': {'ophyd': '1.5.4', 'bluesky': '1.6.7'},
 'scan_id': 5,
 'plan_type': 'generator',
 'plan_name': 'scan',
 'detectors': ['det'],
 'motors': ['motor'],
 'num_points': 5,
 'num_intervals': 4,
 'plan_args': {'detectors': ["SynGauss(prefix='', name='det', read_attrs=['val'], configuration_attrs=['Imax', 'center', 'sigma', 'noise', 'noise_multiplier'])"],
  'num': 5,
  'args': ["SynAxis(prefix='', name='motor', read_attrs=['readback', 'setpoint'], configuration_attrs=['velocity', 'acceleration'])",
   1,
   5],
  'per_step': 'None'},
 'hints': {'dimensions': [[['motor'], 'primary']]},
 'plan_pattern': 'inner_product',
 'plan_pattern_module': 'bluesky.plan_patterns',
 'plan_pattern_args': {'num': 5,
  'args': ["SynAxis(prefix='', name='motor', read_attrs=['readback', 'setpoint'], configuration_attrs=['velocity', 'acceleration'])",
   1,
   5]}}

Information only knowable at the end, like the exit status (success, abort, fail) is stored in header.stop.

In [15]: header.stop
Out[15]: 
{'run_start': 'bb805845-5757-4642-b3ef-7ab14e9fd31e',
 'time': 1604527919.0332673,
 'uid': 'fe00f574-61a1-4ddc-9f9d-2560a41f01ee',
 'exit_status': 'success',
 'reason': '',
 'num_events': {'primary': 5}}

Metadata about the devices involved and their configuration is stored in header.descriptors, but that is quite a lot to dig through, so it’s useful to start with some convenience methods that extract the list of devices or the fields that they reported:

In [16]: header.devices()
Out[16]: {'det', 'motor'}

In [17]: header.fields()
Out[17]: {'det', 'motor', 'motor_setpoint'}

To extract configuration data recorded by a device:

In [18]: header.config_data('motor')
Out[18]: {'primary': [{'motor_velocity': 1, 'motor_acceleration': 1}]}

(A realistic example might report, for example, exposure_time or zero point.)

Searching¶

The “slicing” (square bracket) syntax is a quick way to search based on relative indexing, unique ID, or counting number scan_id. Examples:

# Get the most recent run.
header = db[-1]

# Get the fifth most recent run.
header = db[-5]

# Get a list of all five most recent runs, using Python slicing syntax.
headers = db[-5:]

# Get a run whose unique ID ("RunStart uid") begins with 'x39do5'.
header = db['x39do5']

# Get a run whose integer scan_id is 42. Note that this might not be
# unique. In the event of duplicates, the most recent match is returned.
header = db[42]

Calling a Broker like a function (with parentheses) accesses richer searches. Common search parameters include plan_name, motor, and detectors. Any user-provided metadata can be used in a search. Examples:

# Search by plan name.
headers = db(plan_name='scan')

# Search for runs involving a motor with the name 'eta'.
headers = db(motor='eta')

# Search for runs operated by a given user---assuming this metadata was
# recorded in the first place!
headers = db(operator='Dan')

# Search by time range. (These keywords have a special meaning.)
headers = db(since='2015-03-05', until='2015-03-10')

Full-text search is also supported, for MongoDB-backed deployments. (Other deployments will raise NotImplementedError if you try this.)

# Perform text search on all values in the Run Start document.
headers = db('keyword')

Note that partial words are not matched, but partial phrases are. For example, ‘good’ will match to ‘good sample’ but ‘goo’ will not.

Unlike the “slicing” (square bracket) queries, rich searches can return an unbounded number of results. To avoid slowness, the results are loaded “lazily,” only as needed. Here’s an example of what works and what doesn’t.

In [19]: headers = db(plan_name='scan')

In [20]: headers
Out[20]: <databroker.v1.Results at 0x7fe8bd067a20>

In [21]: headers[2]  # Fails! The results are not a list.
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-21-6a220047f4ab> in <module>
----> 1 headers[2]  # Fails! The results are not a list.

TypeError: 'Results' object does not support indexing

In [22]: list(headers)[2]  # This works, but might be slow if the results are large.
Out[22]: <databroker.v1.Header at 0x7fe8bc8c04e0>

Looping through them loads one at a time, conserving memory.

In [23]: for header in headers:
   ....:     print(header.table()['det'].mean())
   ....: 
0.1506628257537126
0.1506628257537126
0.1506628257537126
0.1506628257537126
0.1506628257537126