.. currentmodule:: bluesky.plans ==================== Recording Metadata ==================== Capturing useful metadata is the main objective of bluesky. The more information you can provide about what you are doing and why you are doing it, the more useful bluesky and downstream data search and analysis tools can be. The term "metadata" can be a controversial term, one scientist's "data" is another's "metadata" and classification is context- dependent. The same exact information can be "data" in one experiment, but "metadata" in a different experiment done on the exact same hardware. The `Document Model `_ provides a framework for deciding _where_ to record a particular piece of information. There are some things that we know *a priori* before doing an experiment; where are we? who is the user? what sample are we looking at? what did the user just ask us to do? These are all things that we can, in principle, know independent of the control system. These are the prime candidates for inclusion in the `Start Document `_. Downstream DataBroker provides tools to do rich searches on this data. The more information you can include the better. There is some information that we need that is nominally independent of any particular device but we need to consult the controls system about. For example the location of important, but un-scanned motors or the configuration of beam attenuators. If the values *should* be fixed over the course of the experiment then this it is a good candidate for being a "baseline device" either via the `Supplemental pre-processor `_ or explicitly in custom plans. This will put the readings in a separate stream (which is a peer to the "primary" data). In principle, these values *could* be read from the control system once and put into the Start document along with the *a priori* information, however that has several draw backs: 1. There is only ever 1 reading of the values so if they do drift during data acquisition, you will never know. 2. We cannot automatically capture information about the device like we do for data in Events. This includes things like the datatype, units, and shape of the value and any configuration information about the hardware it is being read from. A third class of information that can be called "metadata" is configuration information of pieces of hardware. These are things like the velocity of a motor or the integration time of a detector. These readings are embedded in the `Descriptor `_ and are extracted from the hardware via the `read_configuration `_ method of the hardware. We expect that these values will not change over the course of the experiment so only read them once. Information that does not fall into one of these categories, because you expect it to change during the experiment, should be treated as "data", either as an explicit part of the experimental plan or via :ref:`async_monitoring`. Adding to the Start Document ============================ When the RunEngine mints a Start document it includes structured data. That information can be injected in via several mechanisms: 1. entered interactively by the user at execution time 2. provided in the code of the *plan* 3. automatically inferred 4. entered by user once and stashed for reuse on all future plans If there is a conflict between these sources, the higher entry in this list wins. The "closer" to a user the information originated the higher priority it has. 1. Interactively, for One Use ----------------------------- Suppose we are executing some custom plan called ``plan``. .. code-block:: python RE(plan()) If we give arbitrary extra keyword arguments to ``RE``, they will be interpreted as metadata. .. code-block:: python RE(plan(), sample_id='A', purpose='calibration', operator='Dan') The :ref:`run(s) ` --- i.e., datasets --- generated by ``plan()`` will include the custom metadata: .. code-block:: python ... 'sample_id': 'A', 'purpose': 'calibration'. 'operator': 'Dan', ... If ``plan`` generates more that one run, all the runs will get this metadata. For example, this plan generates three different runs. .. code-block:: python from bluesky.plans import count, scan from ophyd.sim det1, det2, motor # simulated detectors, motor def plan(): yield from count([det]) yield from scan([det], motor, 1, 5, 5) yield from count([det]) If executed as above: .. code-block:: python RE(plan(), sample_id='A', purpose='calibration', operator='Dan') each run will get a copy of the sample_id, purpose and operator metadata. 2. Through a plan ----------------- Revisiting the previous example: .. code-block:: python def plan(): yield from count([det]) yield from scan([det], motor, 1, 5, 5) yield from count([det]) we can pass different metadata for each run. Every :ref:`built-in pre-assembled plan ` accepts a parameter ``md``, which you can use to inject metadata that applies only to that plan. .. code-block:: python def plan(): yield from count([det], md={'purpose': 'calibration'}) # one yield from scan([det], motor, 1, 5, 5, md={'purpose': 'good data'}) # two yield from count([det], md={'purpose': 'sanity check'}) # three The metadata passed into ``RE`` is combined with the metadata passed in to each plan. Thus, calling .. code-block:: python RE(plan(), sample_id='A', operator='Dan') generates these three sets of metadata: .. code-block:: python # one ... 'sample_id': 'A', 'purpose': 'calibration'. 'operator': 'Dan', ... # two ... 'sample_id': 'A', 'purpose': 'good data'. 'operator': 'Dan', ... # three ... 'sample_id': 'A', 'purpose': 'sanity check'. 'operator': 'Dan', ... If there is a conflict, ``RE`` keywords takes precedence. So .. code-block:: python RE(plan(), purpose='test') would override the individual 'purpose' metadata from the plan, marking all three as purpose=test. If you define your own plans, it is best practice have them take a keyword only argument ``md=None``. This allows the hard-coded meta-data to be over-ridden later: .. code-block:: python def plan(*, md=None): md = md or {} # handle the default case # putting unpacking **md at the end means it "wins" # and if the user calls # yield from plan(md={'purpose': bob}) # it will over-ride these values yield from count([det], md={'purpose': 'calibration', **md}) yield from scan([det], motor, 1, 5, 5, md={'purpose': 'good data', **md}) yield from count([det], md={'purpose': 'sanity check', **md}) This is consistent with all of the :ref:`preassembled_plans`. For more on injecting metadata via plans, refer to :ref:`this section ` of the tutorial. .. note:: All of the built-in plans provide certain metadata automatically. Custom plans are not *required* to provide any of this, but it is a nice pattern to follow. * plan_name --- e.g., ``'scan'`` * detectors --- a list of the names of the detectors * motors --- a list of the names of the motors * plan_args --- dict of keyword arguments passed to the plan * plan_pattern -- function used to create the trajectory * plan_pattern_module --- Python module where ``plan_pattern`` is defined * plan_pattern_args --- dict of keyword arguments passed to ``plan_pattern`` to create the trajectory The ``plan_name`` and ``plan_args`` together should provide sufficient information to recreate the plan. The ``detectors`` and ``motors`` are convenient keys to search on later. The ``plan_pattern*`` entries provide lower-level, more explicit information about the *trajectory* ("pattern") generated by the plan, separate from the specific detectors and motors involved. For complex trajectories like spirals, this is especially useful. As a simple example, here is the pattern-related metadata for :func:`scan`. .. code-block:: python ... 'plan_pattern': 'linspace', 'plan_pattern_module': 'numpy', 'plan_pattern_args': dict(start=start, stop=stop, num=num) ... Thus, one can re-create the "pattern" (trajectory) like so: .. code-block:: python numpy.linspace(**dict(start=start, stop=stop, num=num)) 3. Automatically ---------------- For each run, the RunEngine automatically records: * 'time' --- In this context, the start time. (Other times are also recorded.) * 'uid' --- a globally unique ID for this run * 'plan_name' --- the function or class name of ``plan`` (e.g., 'count') * 'plan_type'--- e.g., the Python type of ``plan`` (e.g., 'generator') The last two can be overridden by any of the methods above. The first two cannot be overridden by the user. .. note:: If some custom plan does not specify a 'plan_name' and 'plan_type', the RunEngine infers them as follows: .. code-block:: python plan_name = type(plan).__name__ plan_type = getattr(plan, '__name__', '') These may be more or less informative depending on what ``plan`` is. They are just heuristics to provide *some* information by default if the plan itself and the user do not provide it. 4. Interactively, for Repeated Use ---------------------------------- Each time a plan is executed, the current contents of ``RE.md`` are copied into the metadata for all runs generated by the plan. To enter metadata once to reuse on all plans, add it to ``RE.md``. .. code-block:: python RE.md['proposal_id'] = 123456 RE.md['project'] = 'flying cars' RE.md['dimensions'] = (5, 3, 10) View its current contents, .. code-block:: python RE.md delete a key you want to stop using, .. code-block:: python del RE.md['project'] # delete a key or use any of the standard methods that apply to `dictionaries in Python `_. .. warning:: In general we recommend against putting device readings in the Start document. (The Start document is for who/what/why/when, things you know before you start communicating with hardware.) It is *especially* critical that you do not put device readings in the ``RE.md`` dictionary. The value will remain until you change it and not track the state of the hardware. This will result in recording out-of-date, incorrect data! This can be particularly dangerous if ``RE.md`` is backed by a persistent data store (see next section) because out-of-date readings will last across sessions. The ``scan_id``, an integer that the RunEngine automatically increments at the beginning of each scan, is stored in ``RE.md['scan_id']``. .. warning:: Clearing all keys, like so: .. code-block:: python RE.md.clear() # clear *all* keys will reset the ``scan_id``. The next time a plan is executed, the RunEngine will start with a ``scan_id`` of 1 and set .. code-block:: python RE.md['scan_id'] = 1 Some readers may prefer to reset the scan ID to 1 at the beginning of a new experiment; others way wish to maintain a single unbroken sequence of scan IDs forever. From a technical standpoint, it is fine to have duplicate scan IDs. All runs also have randomly-generated 'uid' ("unique ID") which is globally unique forever. .. _md_persistence: Persistence Between Sessions ---------------------------- We provide a way to save the contents of the metadata stash ``RE.md`` between sessions (e.g., exiting and re-opening IPython). In general, the ``RE.md`` attribute may be anything that supports the dictionary interface. The simplest is just a plain Python dictionary. .. code-block:: python RE.md = {} To persist metadata between sessions, bluesky recommends :class:`bluesky.utils.PersistentDict` --- a Python dictionary synced with a directory of files on disk. Any changes made to ``RE.md`` are synced to the file, so the contents of ``RE.md`` can persist between sessions. .. code-block:: python from bluesky.utils import PersistentDict RE.md = PersistentDict('some/path/here') Bluesky does not provide a strong recommendation on that path; that a detail left to the local deployment. Bluesky formerly recommended using :class:`~historydict.HistoryDict` --- a Python dictionary backed by a sqlite database file. This approach proved problematic with the threading introduced in bluesky v1.6.0, so it is no longer recommended. If you have been following that recommendation, you should migrate your metadata from `~historydict.HistoryDict` to :class:`~bluesky.utils.PersistentDict`. First, update your configuration to make ``RE.md`` a :class:`~bluesky.utils.PersistentDict` as shown above. Then, migrate like so: .. code-block:: python from bluesky.utils import get_history old_md = get_history() RE.md.update(old_md) The :class:`~bluesky.utils.PersistentDict` object has been back-ported to bluesky v1.5.6 as well. It is not available in 1.4.x or older, so once you move to the new system, you must run bluesky v1.5.6 or higher. .. warning:: The ``RE.md`` object can also be set when the RunEngine is instantiated: .. code-block:: python # This: RE = RunEngine(...) # is equivalent to this: RE = RunEngine({}) RE.md = ... As we stated :ref:`at the start of the tutorial `, if you are using bluesky at a user facility or with shared configuration, your ``RE`` may already be configured, and defining a new ``RE`` as above can result in data loss! If you aren't sure, it's safer to use ``RE.md = ...``. Allowed Data Types ------------------ Custom metadata keywords can be mapped to: * strings --- e.g., ``task='calibration'`` * numbers --- e.g., ``attempt=5`` * lists or tuples --- e.g., ``dimensions=[1, 3]`` * (nested) dictionaries --- e.g., ``dimensions={'width': 1, 'height': 3}`` Required Fields --------------- The fields: * **uid** * **time** are reserved by the document model and cannot be set by the user. In current versions of bluesky, **no fields are universally required by bluesky itself**. It is possible specify your own required fields in local configuration. See :ref:`md_validator`. (At NSLS-II, there are facility-wide requirements coming soon.) Special Fields -------------- Arbitrary custom fields are allowed --- you can invent any names that are useful to you. But certain fields are given special significance by bluesky's document model, and are either disallowed are required to be a certain type. The fields: * **owner** * **group** * **project** are optional but, to facilitate searchability, if they are not blank they must be strings. A non-string, like ``owner=5`` will produce an error that will interrupt scan execution immediately after it starts. Similarly, the keyword **sample** has special significance. It must be either a string or a dictionary. The **scan_id** field is expected to be an integer, and it is automatically incremented between runs. If a scan_id is not provided by the user or stashed in the persistent metadata from the previous run, it defaults to 1. .. _md_validator: Validation ---------- Additional, customized metadata validation can be added to the RunEngine. For example, to ensure that a run will not be executed unless the parameter 'sample_number' is specified, define a function that accepts a dictionary argument and raises if 'sample_number' is not found. .. code-block:: python def ensure_sample_number(md): if 'sample_number' not in md: raise ValueError("You forgot the sample number.") Apply this function by setting .. code-block:: python RE.md_validator = ensure_sample_number The function will be executed immediately before each new run in opened.