.. ipython:: python :suppress: # Do this so Quick Start does not include example-generation. from metadatastore.utils.testing import mds_setup from filestore.utils.testing import fs_setup mds_setup() fs_setup() from dataportal.examples.sample_data import temperature_ramp from metadatastore.api import insert_run_start, insert_beamline_config rs_uid = insert_run_start(time=0., scan_id=1, uid='a5fbde', owner='nedbrainard', beamline_id='example', beamline_config=insert_beamline_config({}, time=0.)) temperature_ramp.run(run_start_uid=rs) rs_uid = insert_run_start(time=1., scan_id=2, owner='nedbrainard', beamline_id='example', beamline_config=insert_beamline_config({}, time=0.)) temperature_ramp.run(run_start_uid=rs) rs_uid = insert_run_start(time=2., scan_id=3, owner='nedbrainard', beamline_id='example', beamline_config=insert_beamline_config({}, time=0.)) temperature_ramp.run(run_start_uid=rs) rs_uid = insert_run_start(time=3., scan_id=4, owner='nedbrainard', beamline_id='example', beamline_config=insert_beamline_config({}, time=0.)) temperature_ramp.run(run_start_uid=rs) rs_uid = insert_run_start(time=4., scan_id=5, owner='nedbrainard', beamline_id='example', beamline_config=insert_beamline_config({}, time=0.)) temperature_ramp.run(run_start_uid=rs) ***************** DataBroker Basics ***************** What is the Data Broker? ======================== The Data Broker prodives one interface for retrieving data from all sources. You, the user, don't have to know where the data is stored or in what format it is stored. The Data Broker returns all the data in one simply-structured bundle. All measurements are given as standard Python data types (integer, float, or string) or numpy arrays. Quick Start =========== This demonstrates the basic usage with minimal explanation. To understand what is being done, read the next section. .. ipython:: python from dataportal.broker import DataBroker from dataportal.muxer import DataMuxer header = DataBroker[-1] # get most recent run events = DataBroker.fetch_events(header) dm = DataMuxer.from_events(events) dm.sources # to review list of data sources You can plot individual data sources against time... .. ipython:: python dm['Tsam'].plot() # or the name of any data source Or bin the data to plot sources against each other... .. ipython:: python binned_data = dm.bin_on('point_det') binned_data binned_data.plot(x='Tsam', y='point_det') And you can easily export to common formats. Among them: .. ipython:: python binned_data.to_csv('myfile.csv') binned_data.to_excel('myfile.xlsx') Basic Example: Plotting The Most Recent Scan ============================================ Looking at a Scan ----------------- Let's inspect the most recent run. To get the Nth most recent run, type ``DataBroker[-N]``. .. ipython:: python from dataportal.broker import DataBroker header = DataBroker[-1] What we get is a Header, a dictionary-like (for C programmers, struct-like) object with all the information pertaining to a run. .. ipython:: python header We can view its complete contents with ``print`` or, equivalently, ``str(header)``. .. ipython:: python print header You can access the contents like a Python dictionary .. ipython:: python header['owner'] or, equivalently, an attribute. In IPython, use tab-completion to explore. .. ipython:: python header.owner Getting the Data in its Rawest Form ----------------------------------- The Header does not contain any of the actual measurements from a run. To get the data itself, pass ``header`` (or a list of several Headers) to ``fetch_events``: .. ipython:: python events = DataBroker.fetch_events(header) The result is a list of Events, each one representing a measurement or measurements that took place at a given time. (Exactly what we mean by "Event" and "a given time" is documented elsewhere in both medium and excruciating detail.) Consider this an intermediate step. The data is structured in a generic way that is wonderfully fleixble but not especially convenient. To get a more useful view of the data, read on. Putting the Data into a More Useful Form ---------------------------------------- One level above the DataBroker sits the DataMuxer, an object for merging and aligning streams of Events from mamy sources into a table. Build a DataMuxer like so: .. ipython:: python from dataportal.muxer import DataMuxer dm = DataMuxer.from_events(events) The ``events`` can be from one scan or from many scans together. Then, the simplest task is to simply look at the data from one source -- say, sample temperature. .. ipython:: python dm['Tsam'] Incidentally, to save a litte typing, ``dm.Tsam`` accomplishes the same thing. At any rate, the output gives the measured data at each time. Next, let's obtain a table showing data from multiple sources. Strictly speaking, measurements recorded by different equipment are not in general synchronized, but in practice one usually ignores small differences in time. For instance, we might want to plot "temperature" versus "intensity" even if the temperature and intesity sensors never happened to take a simultaneous measurement. Doing so, we would be implicitly *binning* those measurements in time. Therefore, plotting one dependent variable against another usually requires binning to effectively "align" the measurements against each other in time. This is the problem that DataMuxer is designed to solve. On the simplest level, it takes the stream of events and creates the table of data you probably expected in the first place. But it is also capable of fully exploiting the asynchronous stream of measurements, grouping them in different ways to answer different questions. To begin, we bin the data by centering one bin at each ``point_det`` measurement. .. ipython:: python binned_data = dm.bin_on('point_det') binned_data Wherever there is ``point_det`` measurement but no ``Tsam`` measurement within the time window, NaN indicates the missing data. (You may object that "NaN" is not really the same as "missing." This is a convention borrowed from the widely-used pandas package, and the reasons for using NaN to mean "missing" have to do with the limitations of numpy in handling missing data.) The ``count`` sub-column indicates the number of ``Tsam`` measurements in each bin. There is exactly one ``point_det`` measurement in every bin, by definition, so no ``count`` is shown there. Sometimes, one can interpolate the missing values according to some rule, such as linear interpolation. .. ipython:: python binned_data = dm.bin_on('point_det', interpolation={'Tsam': 'linear'}) binned_data The ``count`` column, still present, indicates which values are measured (1) and which are interpolated (0). If instead we bin the other way, defining one bin per ``Tsam`` data point, we must provide a rule for combining multiple ``point_det`` measurements in the same bin into one representative value. Now, along with the ``count`` sub-column, other summary statistics are automatically generated. .. ipython:: python binned_data = dm.bin_on('Tsam', agg={'point_det': np.mean}) binned_data To discard the extra statistics and keep the values only, use this syntax. (``xs`` stands for cross-section, a sophisticated pandas method.) .. ipython:: python binned_data.xs('val', level=1, axis=1) Exporting the Data for Use Outside of Python -------------------------------------------- The tabular results from the DataMuxer are DataFrames, objects from the widely- used and well-documented package pandas, and there are many convenient methods for exporting them to common formats. For example: .. ipython:: python binned_data.to_csv('myfile.csv') binned_data.to_excel('myfile.xlsx') More methods are described in the pandas documention, and can easily be explored by typing ``binned_data.to_`` . This quick-and-dirty export is really only useful if the data of interest is scalar (e.g., not images) and not very large. For other applications, different tools should be used. As of this writing, these tools are in development and not yet documented. .. ipython:: python :suppress: # Cleanup !rm myfile.csv !rm myfile.xlsx More Ways to Look Up Scans ========================== To quickly look up recent scans, use the standard Python slicing syntax for indexing from the end of a list. .. ipython:: python header = DataBroker[-1] # most recent scan header.scan_id header = DataBroker[-2] # next to last scan header.scan_id headers = DataBroker[-5:] # all of the last five scans [h.scan_id for h in headers] headers = DataBroker[-1000::100] # sample every 100th of the last (up to) 1000 scans Or give the scan ID, which is always a positive integer. .. ipython:: python header = DataBroker[4] # scan ID 4 header.scan_id If you know the unique id (uid) of a Header, you can use the first few characters to find it. .. ipython:: python header = DataBroker['a5fbde'] For advanced searches, use ``find_headers``. .. ipython:: python neds_headers = DataBroker.find_headers(owner='nedbrainard') headers_measuring_temperature = DataBroker.find_headers(data_key='Tsam') Any of these results, whether a single Header or a list of Headers, can be passed to ``DataBroker.fetch_events()`` as shown in the previous sections above.