Retrieve metadata, tabular data, and image data

Problem

Retrieve metadata, tabular data, or image data for analysis, processing, or export.

Approach

Use the databroker, the all-in-one interface to saved data.

Retrieve the metadata for the run(s) of interest. Retrieve the data itself in three different modes:

  • a general-purpose method, which provides maximum flexiblity and performance
  • a convenient method for retrieving tabular data
  • a convenient method for retrieving image data

Example Solution

The first step is always retrieving the metadata; from there, we can retrieve the data itself.

We’ll preface this example by running a scan to generate some example data.

In [1]: uid, = RE(scan([det], motor, -10, 10, 15))

The unique id of the data set has been stashed in the variable uid. We can use that to retrieve the data from the databroker.

In [2]: h = db[uid]

NameErrorTraceback (most recent call last)
<ipython-input-2-107caca0d61b> in <module>()
----> 1 h = db[uid]

NameError: name 'db' is not defined

What we get back is a header, which contains all of the metadata from the run. For example, we can review the names of the detector(s) involved:

In [3]: h['start']['detectors']

NameErrorTraceback (most recent call last)
<ipython-input-3-dbb72a980c05> in <module>()
----> 1 h['start']['detectors']

NameError: name 'h' is not defined

There is a lot of information in h. See How metadata is organized: understand the contents of the header.

If we don’t know the uid, we can search for the metadata in other ways. One of the most common is recency: db[-1] retrieves the header of the most recent scan; db[-5] means “five scans ago”; db[-5:] retrieve all of the last five scans together. See this section of the databroker documentation for more.

Now, what about the data itself?

General-Purpose Method

In [4]: events = db.get_events(h)

NameErrorTraceback (most recent call last)
<ipython-input-4-d5be852c102a> in <module>()
----> 1 events = db.get_events(h)

NameError: name 'db' is not defined

In the variable events, we now have a collection of documents (dictionary-like mappings of names to values). Each event corresponds to a single data point, a row in table.

For performance reasons, the data has not actually been loaded yet. The data is loaded one point at a time if we loop through events. (This is very useful for applications where we don’t need to load the entire data set.)

To load the entire data set once, convert events to a list.

In [5]: events = list(events)  # for large data sets, this takes awhile

NameErrorTraceback (most recent call last)
<ipython-input-5-9138b19c5e59> in <module>()
----> 1 events = list(events)  # for large data sets, this takes awhile

NameError: name 'events' is not defined

Let’s look at all the data in the events.

In [6]: [event['data'] for event in events]

NameErrorTraceback (most recent call last)
<ipython-input-6-131a6a035bf3> in <module>()
----> 1 [event['data'] for event in events]

NameError: name 'events' is not defined

You might be thinking, “Just give me data!” As promised, the general-purpose method is flexible, but it lacks terseness. For more direct methods, read on!

To learn more about the structure of an event, refer to the overview of the document model.

Retrieving a Table

In [7]: db.get_table(h)

NameErrorTraceback (most recent call last)
<ipython-input-7-1e31461aefa6> in <module>()
----> 1 db.get_table(h)

NameError: name 'db' is not defined

The result is a DataFrame. One can access individual columns like so:

In [8]: table = db.get_table(h)

NameErrorTraceback (most recent call last)
<ipython-input-8-59238a931038> in <module>()
----> 1 table = db.get_table(h)

NameError: name 'db' is not defined

In [9]: table['det']

NameErrorTraceback (most recent call last)
<ipython-input-9-b3ba444d7568> in <module>()
----> 1 table['det']

NameError: name 'table' is not defined

perform fast array computations using numpy

In [10]: import numpy as np

In [11]: np.mean(table)

NameErrorTraceback (most recent call last)
<ipython-input-11-fd4bd50e9706> in <module>()
----> 1 np.mean(table)

NameError: name 'table' is not defined

and much, much more.

Note

The variable table here is a pandas DataFrame, scientific Python’s answer to the spreadsheet. Read the pandas documentation for more. It’s an extremely powerful package for analyzing tabular data.

Narrowing the Results

The get_table method accepts several optional arguments which can be used to filter the results (and corespondingly speed up the retrieval). Examples:

In [12]: db.get_table(h, ['det'])  # just include the 'det' column

NameErrorTraceback (most recent call last)
<ipython-input-12-68717f5feba2> in <module>()
----> 1 db.get_table(h, ['det'])  # just include the 'det' column

NameError: name 'db' is not defined

Retrieving Images

Our example data above did not include images, so get_table served our purposes. It is not as suitable for image data, so a separate method is available.

If the scan includes image data, use the get_images method. You will need to specify field name with which the image data was labeled. If you aren’t sure what this is, you can review all the field names using get_fields.

from databroker import get_fields
get_fields(h)  # returns list of fields names

Common choices are just 'image' or 'detector_name_image'.

images = db.get_images(h, 'image_field_name')

Plot individual images using matplotlib.

# These imports may be not be necessary; they already be in your config.
%matplotlib
import matplotlib.pyplot as plt

first_img = images[0]

# First, print the image dimensions and check that they make sense.
print(first_img.shape)

# Plot.
plt.imshow(first_img)

The imshow (i.e., “image show”) function has many useful optional parameters. Refer to this section of the matplotlib documentation for more.