DataBroker Basics¶
What is the Data Broker?¶
The Data Broker prodives one interface for retrieving data from all sources.
You, the user, don’t have to know where the data is stored or in what format it is stored. The Data Broker returns all the data in one simply-structured bundle. All measurements are given as standard Python data types (integer, float, or string) or numpy arrays.
Quick Start¶
This demonstrates the basic usage with minimal explanation. To understand what is being done, read the next section.
In [1]: from dataportal.broker import DataBroker
In [2]: from dataportal.muxer import DataMuxer
In [3]: header = DataBroker[-1] # get most recent run
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
61 try:
---> 62 result = next(gen)
63 except StopIteration:
StopIteration:
During handling of the above exception, another exception occurred:
IndexError Traceback (most recent call last)
<ipython-input-3-40ed62db5c60> in <module>()
----> 1 header = DataBroker[-1] # get most recent run
/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
63 except StopIteration:
64 raise IndexError(
---> 65 "There are only {0} runs.".format(i))
66 header = Header.from_run_start(result)
67 elif isinstance(key, six.string_types):
IndexError: There are only 0 runs.
In [4]: events = DataBroker.fetch_events(header)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-4-8927db3c3d9f> in <module>()
----> 1 events = DataBroker.fetch_events(header)
NameError: name 'header' is not defined
In [5]: dm = DataMuxer.from_events(events)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-5-3cf5850c05d5> in <module>()
----> 1 dm = DataMuxer.from_events(events)
NameError: name 'events' is not defined
In [6]: dm.sources # to review list of data sources
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-6-2cb4e71e1f7c> in <module>()
----> 1 dm.sources # to review list of data sources
NameError: name 'dm' is not defined
You can plot individual data sources against time...
In [7]: dm['Tsam'].plot() # or the name of any data source
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-7-2a20c70a1259> in <module>()
----> 1 dm['Tsam'].plot() # or the name of any data source
NameError: name 'dm' is not defined
Or bin the data to plot sources against each other...
In [8]: binned_data = dm.bin_on('point_det')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-8-5b651beb765a> in <module>()
----> 1 binned_data = dm.bin_on('point_det')
NameError: name 'dm' is not defined
In [9]: binned_data
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-9-30d1318b2999> in <module>()
----> 1 binned_data
NameError: name 'binned_data' is not defined
In [10]: binned_data.plot(x='Tsam', y='point_det')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-10-0b810db50efb> in <module>()
----> 1 binned_data.plot(x='Tsam', y='point_det')
NameError: name 'binned_data' is not defined
And you can easily export to common formats. Among them:
In [11]: binned_data.to_csv('myfile.csv')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-11-4dc5892d59db> in <module>()
----> 1 binned_data.to_csv('myfile.csv')
NameError: name 'binned_data' is not defined
In [12]: binned_data.to_excel('myfile.xlsx')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-12-d90efd455eff> in <module>()
----> 1 binned_data.to_excel('myfile.xlsx')
NameError: name 'binned_data' is not defined
Basic Example: Plotting The Most Recent Scan¶
Looking at a Scan¶
Let’s inspect the most recent run. To get the Nth most recent run, type DataBroker[-N].
In [13]: from dataportal.broker import DataBroker
In [14]: header = DataBroker[-1]
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
61 try:
---> 62 result = next(gen)
63 except StopIteration:
StopIteration:
During handling of the above exception, another exception occurred:
IndexError Traceback (most recent call last)
<ipython-input-14-7b3cc0ce866a> in <module>()
----> 1 header = DataBroker[-1]
/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
63 except StopIteration:
64 raise IndexError(
---> 65 "There are only {0} runs.".format(i))
66 header = Header.from_run_start(result)
67 elif isinstance(key, six.string_types):
IndexError: There are only 0 runs.
What we get is a Header, a dictionary-like (for C programmers, struct-like) object with all the information pertaining to a run.
In [15]: header
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-15-71a15b456dab> in <module>()
----> 1 header
NameError: name 'header' is not defined
We can view its complete contents with print or, equivalently, str(header).
In [16]: print header
File "<ipython-input-16-1f06858d6d82>", line 1
print header
^
SyntaxError: Missing parentheses in call to 'print'
You can access the contents like a Python dictionary
In [17]: header['owner']
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-17-0ca857377bba> in <module>()
----> 1 header['owner']
NameError: name 'header' is not defined
or, equivalently, an attribute. In IPython, use tab-completion to explore.
In [18]: header.owner
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-18-64f2d83cbc3d> in <module>()
----> 1 header.owner
NameError: name 'header' is not defined
Getting the Data in its Rawest Form¶
The Header does not contain any of the actual measurements from a run. To get the data itself, pass header (or a list of several Headers) to fetch_events:
In [19]: events = DataBroker.fetch_events(header)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-19-8927db3c3d9f> in <module>()
----> 1 events = DataBroker.fetch_events(header)
NameError: name 'header' is not defined
The result is a list of Events, each one representing a measurement or measurements that took place at a given time. (Exactly what we mean by “Event” and “a given time” is documented elsewhere in both medium and excruciating detail.)
Consider this an intermediate step. The data is structured in a generic way that is wonderfully fleixble but not especially convenient. To get a more useful view of the data, read on.
Putting the Data into a More Useful Form¶
One level above the DataBroker sits the DataMuxer, an object for merging and aligning streams of Events from mamy sources into a table. Build a DataMuxer like so:
In [20]: from dataportal.muxer import DataMuxer
In [21]: dm = DataMuxer.from_events(events)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-21-3cf5850c05d5> in <module>()
----> 1 dm = DataMuxer.from_events(events)
NameError: name 'events' is not defined
The events can be from one scan or from many scans together. Then, the simplest task is to simply look at the data from one source – say, sample temperature.
In [22]: dm['Tsam']
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-22-caacaf9fabf5> in <module>()
----> 1 dm['Tsam']
NameError: name 'dm' is not defined
Incidentally, to save a litte typing, dm.Tsam accomplishes the same thing. At any rate, the output gives the measured data at each time.
Next, let’s obtain a table showing data from multiple sources. Strictly speaking, measurements recorded by different equipment are not in general synchronized, but in practice one usually ignores small differences in time. For instance, we might want to plot “temperature” versus “intensity” even if the temperature and intesity sensors never happened to take a simultaneous measurement. Doing so, we would be implicitly binning those measurements in time.
Therefore, plotting one dependent variable against another usually requires binning to effectively “align” the measurements against each other in time. This is the problem that DataMuxer is designed to solve. On the simplest level, it takes the stream of events and creates the table of data you probably expected in the first place. But it is also capable of fully exploiting the asynchronous stream of measurements, grouping them in different ways to answer different questions.
To begin, we bin the data by centering one bin at each point_det measurement.
In [23]: binned_data = dm.bin_on('point_det')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-23-5b651beb765a> in <module>()
----> 1 binned_data = dm.bin_on('point_det')
NameError: name 'dm' is not defined
In [24]: binned_data
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-24-30d1318b2999> in <module>()
----> 1 binned_data
NameError: name 'binned_data' is not defined
Wherever there is point_det measurement but no Tsam measurement within the time window, NaN indicates the missing data. (You may object that “NaN” is not really the same as “missing.” This is a convention borrowed from the widely-used pandas package, and the reasons for using NaN to mean “missing” have to do with the limitations of numpy in handling missing data.)
The count sub-column indicates the number of Tsam measurements in each bin. There is exactly one point_det measurement in every bin, by definition, so no count is shown there.
Sometimes, one can interpolate the missing values according to some rule, such as linear interpolation.
In [25]: binned_data = dm.bin_on('point_det', interpolation={'Tsam': 'linear'})
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-25-75376ab7c551> in <module>()
----> 1 binned_data = dm.bin_on('point_det', interpolation={'Tsam': 'linear'})
NameError: name 'dm' is not defined
In [26]: binned_data
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-26-30d1318b2999> in <module>()
----> 1 binned_data
NameError: name 'binned_data' is not defined
The count column, still present, indicates which values are measured (1) and which are interpolated (0).
If instead we bin the other way, defining one bin per Tsam data point, we must provide a rule for combining multiple point_det measurements in the same bin into one representative value. Now, along with the count sub-column, other summary statistics are automatically generated.
In [27]: binned_data = dm.bin_on('Tsam', agg={'point_det': np.mean})
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-27-0dc2b0e44a8f> in <module>()
----> 1 binned_data = dm.bin_on('Tsam', agg={'point_det': np.mean})
NameError: name 'dm' is not defined
In [28]: binned_data
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-28-30d1318b2999> in <module>()
----> 1 binned_data
NameError: name 'binned_data' is not defined
To discard the extra statistics and keep the values only, use this syntax. (xs stands for cross-section, a sophisticated pandas method.)
In [29]: binned_data.xs('val', level=1, axis=1)
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-29-86beb7b60a6d> in <module>()
----> 1 binned_data.xs('val', level=1, axis=1)
NameError: name 'binned_data' is not defined
Exporting the Data for Use Outside of Python¶
The tabular results from the DataMuxer are DataFrames, objects from the widely- used and well-documented package pandas, and there are many convenient methods for exporting them to common formats. For example:
In [30]: binned_data.to_csv('myfile.csv')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-30-4dc5892d59db> in <module>()
----> 1 binned_data.to_csv('myfile.csv')
NameError: name 'binned_data' is not defined
In [31]: binned_data.to_excel('myfile.xlsx')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-31-d90efd455eff> in <module>()
----> 1 binned_data.to_excel('myfile.xlsx')
NameError: name 'binned_data' is not defined
More methods are described in the pandas documention, and can easily be explored by typing binned_data.to_ <tab>.
This quick-and-dirty export is really only useful if the data of interest is scalar (e.g., not images) and not very large. For other applications, different tools should be used. As of this writing, these tools are in development and not yet documented.
More Ways to Look Up Scans¶
To quickly look up recent scans, use the standard Python slicing syntax for indexing from the end of a list.
In [32]: header = DataBroker[-1] # most recent scan
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
61 try:
---> 62 result = next(gen)
63 except StopIteration:
StopIteration:
During handling of the above exception, another exception occurred:
IndexError Traceback (most recent call last)
<ipython-input-32-1b16e9bd6325> in <module>()
----> 1 header = DataBroker[-1] # most recent scan
/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
63 except StopIteration:
64 raise IndexError(
---> 65 "There are only {0} runs.".format(i))
66 header = Header.from_run_start(result)
67 elif isinstance(key, six.string_types):
IndexError: There are only 0 runs.
In [33]: header.scan_id
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-33-568d15f1f12c> in <module>()
----> 1 header.scan_id
NameError: name 'header' is not defined
In [34]: header = DataBroker[-2] # next to last scan
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
61 try:
---> 62 result = next(gen)
63 except StopIteration:
StopIteration:
During handling of the above exception, another exception occurred:
IndexError Traceback (most recent call last)
<ipython-input-34-36cc6e9f562c> in <module>()
----> 1 header = DataBroker[-2] # next to last scan
/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
63 except StopIteration:
64 raise IndexError(
---> 65 "There are only {0} runs.".format(i))
66 header = Header.from_run_start(result)
67 elif isinstance(key, six.string_types):
IndexError: There are only 0 runs.
In [35]: header.scan_id
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-35-568d15f1f12c> in <module>()
----> 1 header.scan_id
NameError: name 'header' is not defined
In [36]: headers = DataBroker[-5:] # all of the last five scans
In [37]: [h.scan_id for h in headers]
Out[37]: []
In [38]: headers = DataBroker[-1000::100] # sample every 100th of the last (up to) 1000 scans
Or give the scan ID, which is always a positive integer.
In [39]: header = DataBroker[4] # scan ID 4
---------------------------------------------------------------------------
StopIteration Traceback (most recent call last)
/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
52 try:
---> 53 result = next(gen) # most recent match
54 except StopIteration:
StopIteration:
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
<ipython-input-39-c42deee19cfc> in <module>()
----> 1 header = DataBroker[4] # scan ID 4
/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
53 result = next(gen) # most recent match
54 except StopIteration:
---> 55 raise ValueError("No such run found.")
56 header = Header.from_run_start(result)
57 else:
ValueError: No such run found.
In [40]: header.scan_id
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
<ipython-input-40-568d15f1f12c> in <module>()
----> 1 header.scan_id
NameError: name 'header' is not defined
If you know the unique id (uid) of a Header, you can use the first few characters to find it.
In [41]: header = DataBroker['a5fbde']
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-41-f24635177cf4> in <module>()
----> 1 header = DataBroker['a5fbde']
/Users/dallan/Documents/Repos/dataportal/dataportal/broker/simple_broker.py in __getitem__(self, key)
74 results = list(gen)
75 if len(results) < 1:
---> 76 raise ValueError("No such run found.")
77 if len(results) > 1:
78 raise ValueError("That partial uid matches multiple runs. "
ValueError: No such run found.
For advanced searches, use find_headers.
In [42]: neds_headers = DataBroker.find_headers(owner='nedbrainard')
In [43]: headers_measuring_temperature = DataBroker.find_headers(data_key='Tsam')
Any of these results, whether a single Header or a list of Headers, can be passed to DataBroker.fetch_events() as shown in the previous sections above.