DataBroker Basics

What is the Data Broker?

The Data Broker prodives one interface for retrieving data from all sources.

You, the user, don’t have to know where the data is stored or in what format it is stored. The Data Broker returns all the data in one simply-structured bundle. All measurements are given as standard Python data types (integer, float, or string) or numpy arrays.

Quick Start

This demonstrates the basic usage with minimal explanation. To understand what is being done, read the next section.

In [1]: from dataportal.broker import DataBroker

In [2]: from dataportal.muxer import DataMuxer

In [3]: header = DataBroker[-1]  # get most recent run
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
     61                     try:
---> 62                         result = next(gen)
     63                     except StopIteration:

StopIteration: 

During handling of the above exception, another exception occurred:

IndexError                                Traceback (most recent call last)
<ipython-input-3-40ed62db5c60> in <module>()
----> 1 header = DataBroker[-1]  # get most recent run

/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
     63                     except StopIteration:
     64                         raise IndexError(
---> 65                             "There are only {0} runs.".format(i))
     66                 header = Header.from_run_start(result)
     67         elif isinstance(key, six.string_types):

IndexError: There are only 0 runs.

In [4]: events = DataBroker.fetch_events(header)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-4-8927db3c3d9f> in <module>()
----> 1 events = DataBroker.fetch_events(header)

NameError: name 'header' is not defined

In [5]: dm = DataMuxer.from_events(events)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-5-3cf5850c05d5> in <module>()
----> 1 dm = DataMuxer.from_events(events)

NameError: name 'events' is not defined

In [6]: dm.sources  # to review list of data sources
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-6-2cb4e71e1f7c> in <module>()
----> 1 dm.sources  # to review list of data sources

NameError: name 'dm' is not defined

You can plot individual data sources against time...

In [7]: dm['Tsam'].plot()  # or the name of any data source
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-7-2a20c70a1259> in <module>()
----> 1 dm['Tsam'].plot()  # or the name of any data source

NameError: name 'dm' is not defined

Or bin the data to plot sources against each other...

In [8]: binned_data = dm.bin_on('point_det')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-8-5b651beb765a> in <module>()
----> 1 binned_data = dm.bin_on('point_det')

NameError: name 'dm' is not defined

In [9]: binned_data
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-9-30d1318b2999> in <module>()
----> 1 binned_data

NameError: name 'binned_data' is not defined

In [10]: binned_data.plot(x='Tsam', y='point_det')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-10-0b810db50efb> in <module>()
----> 1 binned_data.plot(x='Tsam', y='point_det')

NameError: name 'binned_data' is not defined

And you can easily export to common formats. Among them:

In [11]: binned_data.to_csv('myfile.csv')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-11-4dc5892d59db> in <module>()
----> 1 binned_data.to_csv('myfile.csv')

NameError: name 'binned_data' is not defined

In [12]: binned_data.to_excel('myfile.xlsx')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-12-d90efd455eff> in <module>()
----> 1 binned_data.to_excel('myfile.xlsx')

NameError: name 'binned_data' is not defined

Basic Example: Plotting The Most Recent Scan

Looking at a Scan

Let’s inspect the most recent run. To get the Nth most recent run, type DataBroker[-N].

In [13]: from dataportal.broker import DataBroker

In [14]: header = DataBroker[-1]
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
     61                     try:
---> 62                         result = next(gen)
     63                     except StopIteration:

StopIteration: 

During handling of the above exception, another exception occurred:

IndexError                                Traceback (most recent call last)
<ipython-input-14-7b3cc0ce866a> in <module>()
----> 1 header = DataBroker[-1]

/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
     63                     except StopIteration:
     64                         raise IndexError(
---> 65                             "There are only {0} runs.".format(i))
     66                 header = Header.from_run_start(result)
     67         elif isinstance(key, six.string_types):

IndexError: There are only 0 runs.

What we get is a Header, a dictionary-like (for C programmers, struct-like) object with all the information pertaining to a run.

In [15]: header
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-15-71a15b456dab> in <module>()
----> 1 header

NameError: name 'header' is not defined

We can view its complete contents with print or, equivalently, str(header).

In [16]: print header
  File "<ipython-input-16-1f06858d6d82>", line 1
    print header
               ^
SyntaxError: Missing parentheses in call to 'print'

You can access the contents like a Python dictionary

In [17]: header['owner']
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-17-0ca857377bba> in <module>()
----> 1 header['owner']

NameError: name 'header' is not defined

or, equivalently, an attribute. In IPython, use tab-completion to explore.

In [18]: header.owner
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-18-64f2d83cbc3d> in <module>()
----> 1 header.owner

NameError: name 'header' is not defined

Getting the Data in its Rawest Form

The Header does not contain any of the actual measurements from a run. To get the data itself, pass header (or a list of several Headers) to fetch_events:

In [19]: events = DataBroker.fetch_events(header)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-19-8927db3c3d9f> in <module>()
----> 1 events = DataBroker.fetch_events(header)

NameError: name 'header' is not defined

The result is a list of Events, each one representing a measurement or measurements that took place at a given time. (Exactly what we mean by “Event” and “a given time” is documented elsewhere in both medium and excruciating detail.)

Consider this an intermediate step. The data is structured in a generic way that is wonderfully fleixble but not especially convenient. To get a more useful view of the data, read on.

Putting the Data into a More Useful Form

One level above the DataBroker sits the DataMuxer, an object for merging and aligning streams of Events from mamy sources into a table. Build a DataMuxer like so:

In [20]: from dataportal.muxer import DataMuxer

In [21]: dm = DataMuxer.from_events(events)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-21-3cf5850c05d5> in <module>()
----> 1 dm = DataMuxer.from_events(events)

NameError: name 'events' is not defined

The events can be from one scan or from many scans together. Then, the simplest task is to simply look at the data from one source – say, sample temperature.

In [22]: dm['Tsam']
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-22-caacaf9fabf5> in <module>()
----> 1 dm['Tsam']

NameError: name 'dm' is not defined

Incidentally, to save a litte typing, dm.Tsam accomplishes the same thing. At any rate, the output gives the measured data at each time.

Next, let’s obtain a table showing data from multiple sources. Strictly speaking, measurements recorded by different equipment are not in general synchronized, but in practice one usually ignores small differences in time. For instance, we might want to plot “temperature” versus “intensity” even if the temperature and intesity sensors never happened to take a simultaneous measurement. Doing so, we would be implicitly binning those measurements in time.

Therefore, plotting one dependent variable against another usually requires binning to effectively “align” the measurements against each other in time. This is the problem that DataMuxer is designed to solve. On the simplest level, it takes the stream of events and creates the table of data you probably expected in the first place. But it is also capable of fully exploiting the asynchronous stream of measurements, grouping them in different ways to answer different questions.

To begin, we bin the data by centering one bin at each point_det measurement.

In [23]: binned_data = dm.bin_on('point_det')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-23-5b651beb765a> in <module>()
----> 1 binned_data = dm.bin_on('point_det')

NameError: name 'dm' is not defined

In [24]: binned_data
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-24-30d1318b2999> in <module>()
----> 1 binned_data

NameError: name 'binned_data' is not defined

Wherever there is point_det measurement but no Tsam measurement within the time window, NaN indicates the missing data. (You may object that “NaN” is not really the same as “missing.” This is a convention borrowed from the widely-used pandas package, and the reasons for using NaN to mean “missing” have to do with the limitations of numpy in handling missing data.)

The count sub-column indicates the number of Tsam measurements in each bin. There is exactly one point_det measurement in every bin, by definition, so no count is shown there.

Sometimes, one can interpolate the missing values according to some rule, such as linear interpolation.

In [25]: binned_data = dm.bin_on('point_det', interpolation={'Tsam': 'linear'})
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-25-75376ab7c551> in <module>()
----> 1 binned_data = dm.bin_on('point_det', interpolation={'Tsam': 'linear'})

NameError: name 'dm' is not defined

In [26]: binned_data
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-26-30d1318b2999> in <module>()
----> 1 binned_data

NameError: name 'binned_data' is not defined

The count column, still present, indicates which values are measured (1) and which are interpolated (0).

If instead we bin the other way, defining one bin per Tsam data point, we must provide a rule for combining multiple point_det measurements in the same bin into one representative value. Now, along with the count sub-column, other summary statistics are automatically generated.

In [27]: binned_data = dm.bin_on('Tsam', agg={'point_det': np.mean})
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-27-0dc2b0e44a8f> in <module>()
----> 1 binned_data = dm.bin_on('Tsam', agg={'point_det': np.mean})

NameError: name 'dm' is not defined

In [28]: binned_data
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-28-30d1318b2999> in <module>()
----> 1 binned_data

NameError: name 'binned_data' is not defined

To discard the extra statistics and keep the values only, use this syntax. (xs stands for cross-section, a sophisticated pandas method.)

In [29]: binned_data.xs('val', level=1, axis=1)
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-29-86beb7b60a6d> in <module>()
----> 1 binned_data.xs('val', level=1, axis=1)

NameError: name 'binned_data' is not defined

Exporting the Data for Use Outside of Python

The tabular results from the DataMuxer are DataFrames, objects from the widely- used and well-documented package pandas, and there are many convenient methods for exporting them to common formats. For example:

In [30]: binned_data.to_csv('myfile.csv')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-30-4dc5892d59db> in <module>()
----> 1 binned_data.to_csv('myfile.csv')

NameError: name 'binned_data' is not defined

In [31]: binned_data.to_excel('myfile.xlsx')
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-31-d90efd455eff> in <module>()
----> 1 binned_data.to_excel('myfile.xlsx')

NameError: name 'binned_data' is not defined

More methods are described in the pandas documention, and can easily be explored by typing binned_data.to_ <tab>.

This quick-and-dirty export is really only useful if the data of interest is scalar (e.g., not images) and not very large. For other applications, different tools should be used. As of this writing, these tools are in development and not yet documented.

More Ways to Look Up Scans

To quickly look up recent scans, use the standard Python slicing syntax for indexing from the end of a list.

In [32]: header = DataBroker[-1]  # most recent scan
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
     61                     try:
---> 62                         result = next(gen)
     63                     except StopIteration:

StopIteration: 

During handling of the above exception, another exception occurred:

IndexError                                Traceback (most recent call last)
<ipython-input-32-1b16e9bd6325> in <module>()
----> 1 header = DataBroker[-1]  # most recent scan

/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
     63                     except StopIteration:
     64                         raise IndexError(
---> 65                             "There are only {0} runs.".format(i))
     66                 header = Header.from_run_start(result)
     67         elif isinstance(key, six.string_types):

IndexError: There are only 0 runs.

In [33]: header.scan_id
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-33-568d15f1f12c> in <module>()
----> 1 header.scan_id

NameError: name 'header' is not defined

In [34]: header = DataBroker[-2]  # next to last scan
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
     61                     try:
---> 62                         result = next(gen)
     63                     except StopIteration:

StopIteration: 

During handling of the above exception, another exception occurred:

IndexError                                Traceback (most recent call last)
<ipython-input-34-36cc6e9f562c> in <module>()
----> 1 header = DataBroker[-2]  # next to last scan

/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
     63                     except StopIteration:
     64                         raise IndexError(
---> 65                             "There are only {0} runs.".format(i))
     66                 header = Header.from_run_start(result)
     67         elif isinstance(key, six.string_types):

IndexError: There are only 0 runs.

In [35]: header.scan_id
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-35-568d15f1f12c> in <module>()
----> 1 header.scan_id

NameError: name 'header' is not defined

In [36]: headers = DataBroker[-5:]  # all of the last five scans

In [37]: [h.scan_id for h in headers]
Out[37]: []

In [38]: headers = DataBroker[-1000::100]  # sample every 100th of the last (up to) 1000 scans

Or give the scan ID, which is always a positive integer.

In [39]: header = DataBroker[4]  # scan ID 4
---------------------------------------------------------------------------
StopIteration                             Traceback (most recent call last)
/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
     52                 try:
---> 53                     result = next(gen)  # most recent match
     54                 except StopIteration:

StopIteration: 

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
<ipython-input-39-c42deee19cfc> in <module>()
----> 1 header = DataBroker[4]  # scan ID 4

/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
     53                     result = next(gen)  # most recent match
     54                 except StopIteration:
---> 55                     raise ValueError("No such run found.")
     56                 header = Header.from_run_start(result)
     57             else:

ValueError: No such run found.

In [40]: header.scan_id
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
<ipython-input-40-568d15f1f12c> in <module>()
----> 1 header.scan_id

NameError: name 'header' is not defined

If you know the unique id (uid) of a Header, you can use the first few characters to find it.

In [41]: header = DataBroker['a5fbde']
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-41-f24635177cf4> in <module>()
----> 1 header = DataBroker['a5fbde']

/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
     74                 results = list(gen)
     75             if len(results) < 1:
---> 76                 raise ValueError("No such run found.")
     77             if len(results) > 1:
     78                 raise ValueError("That partial uid matches multiple runs. "

ValueError: No such run found.

For advanced searches, use find_headers.

In [42]: neds_headers = DataBroker.find_headers(owner='nedbrainard')

In [43]: headers_measuring_temperature = DataBroker.find_headers(data_key='Tsam')

Any of these results, whether a single Header or a list of Headers, can be passed to DataBroker.fetch_events() as shown in the previous sections above.