.. currentmodule:: bluesky.plans

====================
 Recording Metadata
====================

Capturing useful metadata is the main objective of bluesky. The more
information you can provide about what you are doing and why you are
doing it, the more useful bluesky and downstream data search and
analysis tools can be.

The term "metadata" can be a controversial term, one scientist's
"data" is another's "metadata" and classification is context- dependent.
The same exact information can be "data" in one
experiment, but "metadata" in a different experiment done on the exact
same hardware.
The `Document Model
<https://blueskyproject.io/event-model/data-model.html>`_ provides a framework
for deciding _where_ to record a particular piece of information.

There are some things that we know *a priori* before doing an experiment;
where are we? who is the user? what sample are we looking at? what did
the user just ask us to do?  These are all things that we can, in
principle, know independent of the control system.  These are the
prime candidates for inclusion in the `Start Document
<https://blueskyproject.io/event-model/data-model.html#run-start-document>`_.
Downstream DataBroker provides tools to do rich searches on this data.
The more information you can include the better.

There is some information that we need that is nominally independent of
any particular device but we need to consult the controls system
about.  For example the location of important, but un-scanned motors
or the configuration of beam attenuators.  If the values *should* be fixed over
the course of the experiment then this it is a good candidate for
being a "baseline device" either via the `Supplemental pre-processor
<https://blueskyproject.io/bluesky/tutorial.html#baseline-readings-and-other-supplemental-data>`_
or explicitly in custom plans.  This will put the readings in a separate stream
(which is a peer to the "primary" data).  In principle, these values *could* be
read from the control system once and put into the Start document along with
the *a priori* information, however that has several draw backs:

1. There is only ever 1 reading of the values so if they do drift during
   data acquisition, you will never know.
2. We cannot automatically capture information about the device like
   we do for data in Events.  This includes things like the datatype,
   units, and shape of the value and any configuration information about the
   hardware it is being read from.

A third class of information that can be called "metadata" is
configuration information of pieces of hardware.  These are things
like the velocity of a motor or the integration time of a detector.
These readings are embedded in the `Descriptor
<https://blueskyproject.io/event-model/data-model.html#event-descriptor>`_
and are extracted from the hardware via the `read_configuration
<https://blueskyproject.io/bluesky/hardware.html#ReadableDevice.read_configuration>`_
method of the hardware.  We expect that these values will not change over
the course of the experiment so only read them once.

Information that does not fall into one of these categories, because
you expect it to change during the experiment,
should be treated as "data", either as an explicit part of the
experimental plan or via :ref:`async_monitoring`.


Adding to the Start Document
============================

When the RunEngine mints a Start document it includes structured data.  That
information can be injected in via several mechanisms:

1. entered interactively by the user at execution time
2. provided in the code of the *plan*
3. automatically inferred
4. entered by user once and stashed for reuse on all future plans

If there is a conflict between these sources, the higher entry in this
list wins.  The "closer" to a user the information originated the
higher priority it has.


1. Interactively, for One Use
-----------------------------

Suppose we are executing some custom plan called ``plan``.

.. code-block:: python

    RE(plan())

If we give arbitrary extra keyword arguments to ``RE``, they will be
interpreted as metadata.

.. code-block:: python

    RE(plan(), sample_id='A', purpose='calibration', operator='Dan')

The :ref:`run(s) <run_overview>` --- i.e., datasets --- generated by ``plan()``
will include the custom metadata:

.. code-block:: python

    ...
    'sample_id': 'A',
    'purpose': 'calibration'.
    'operator': 'Dan',
    ...

If ``plan`` generates more that one run, all the runs will get this metadata.
For example, this plan generates three different runs.

.. code-block:: python

    from bluesky.plans import count, scan
    from ophyd.sim det1, det2, motor  # simulated detectors, motor

    def plan():
        yield from count([det])
        yield from scan([det], motor, 1, 5, 5)
        yield from count([det])

If executed as above:

.. code-block:: python

    RE(plan(), sample_id='A', purpose='calibration', operator='Dan')

each run will get a copy of the sample_id, purpose and operator metadata.

2. Through a plan
-----------------

Revisiting the previous example:

.. code-block:: python

    def plan():
        yield from count([det])
        yield from scan([det], motor, 1, 5, 5)
        yield from count([det])

we can pass different metadata for each run. Every
:ref:`built-in pre-assembled plan <preassembled_plans>` accepts a parameter
``md``, which you can use to inject metadata that applies only to that plan.

.. code-block:: python

    def plan():
        yield from count([det], md={'purpose': 'calibration'})  # one
        yield from scan([det], motor, 1, 5, 5, md={'purpose': 'good data'})  # two
        yield from count([det], md={'purpose': 'sanity check'})  # three

The metadata passed into ``RE`` is combined with the metadata passed in to each
plan. Thus, calling

.. code-block:: python

    RE(plan(), sample_id='A', operator='Dan')

generates these three sets of metadata:

.. code-block:: python

    # one
    ...
    'sample_id': 'A',
    'purpose': 'calibration'.
    'operator': 'Dan',
    ...

    # two
    ...
    'sample_id': 'A',
    'purpose': 'good data'.
    'operator': 'Dan',
    ...

    # three
    ...
    'sample_id': 'A',
    'purpose': 'sanity check'.
    'operator': 'Dan',
    ...

If there is a conflict, ``RE`` keywords takes precedence. So

.. code-block:: python

    RE(plan(), purpose='test')

would override the individual 'purpose' metadata from the plan, marking all
three as purpose=test.

If you define your own plans, it is best practice have them take a keyword only
argument ``md=None``.  This allows the hard-coded meta-data to be over-ridden
later:

.. code-block:: python

    def plan(*, md=None):
        md = md or {}  # handle the default case
        # putting unpacking **md at the end means it "wins"
        # and if the user calls
        #    yield from plan(md={'purpose': bob})
        # it will over-ride these values
        yield from count([det], md={'purpose': 'calibration', **md})
        yield from scan([det], motor, 1, 5, 5, md={'purpose': 'good data', **md})
        yield from count([det], md={'purpose': 'sanity check', **md})

This is consistent with all of the :ref:`preassembled_plans`.

For more on injecting metadata via plans, refer to
:ref:`this section <tutorial_plan_metadata>` of the tutorial.

.. note::

    All of the built-in plans provide certain metadata automatically. Custom
    plans are not *required* to provide any of this, but it is a nice pattern
    to follow.

    * plan_name --- e.g., ``'scan'``
    * detectors --- a list of the names of the detectors
    * motors --- a list of the names of the motors
    * plan_args --- dict of keyword arguments passed to the plan
    * plan_pattern -- function used to create the trajectory
    * plan_pattern_module --- Python module where ``plan_pattern`` is defined
    * plan_pattern_args --- dict of keyword arguments passed to
      ``plan_pattern`` to create the trajectory

    The ``plan_name`` and ``plan_args`` together should provide sufficient
    information to recreate the plan. The ``detectors`` and ``motors`` are
    convenient keys to search on later.

    The ``plan_pattern*`` entries provide lower-level, more explicit
    information about the *trajectory* ("pattern") generated by the plan,
    separate from the specific detectors and motors involved. For complex
    trajectories like spirals, this is especially useful. As a simple example,
    here is the pattern-related metadata for :func:`scan`.

    .. code-block:: python

        ...
        'plan_pattern': 'linspace',
        'plan_pattern_module': 'numpy',
        'plan_pattern_args': dict(start=start, stop=stop, num=num)
        ...

    Thus, one can re-create the "pattern" (trajectory) like so:

    .. code-block:: python

        numpy.linspace(**dict(start=start, stop=stop, num=num))

3. Automatically
----------------

For each run, the RunEngine automatically records:

* 'time' --- In this context, the start time. (Other times are also recorded.)
* 'uid' --- a globally unique ID for this run
* 'plan_name' --- the function or class name of ``plan`` (e.g., 'count')
* 'plan_type'--- e.g., the Python type of ``plan`` (e.g., 'generator')

The last two can be overridden by any of the methods above. The first two
cannot be overridden by the user.

.. note::

    If some custom plan does not specify a 'plan_name' and 'plan_type', the
    RunEngine infers them as follows:

    .. code-block:: python

        plan_name = type(plan).__name__
        plan_type = getattr(plan, '__name__', '')

    These may be more or less informative depending on what ``plan`` is. They
    are just heuristics to provide *some* information by default if the plan
    itself and the user do not provide it.

4. Interactively, for Repeated Use
----------------------------------

Each time a plan is executed, the current contents of ``RE.md`` are copied into
the metadata for all runs generated by the plan.  To enter metadata once to
reuse on all plans, add it to ``RE.md``.

.. code-block:: python

    RE.md['proposal_id'] = 123456
    RE.md['project'] = 'flying cars'
    RE.md['dimensions'] = (5, 3, 10)

View its current contents,

.. code-block:: python

    RE.md

delete a key you want to stop using,

.. code-block:: python

    del RE.md['project']   # delete a key

or use any of the standard methods that apply to
`dictionaries in Python <https://docs.python.org/3/library/stdtypes.html#typesmapping>`_.

.. warning::


   In general we recommend against putting device readings in the Start
   document. (The Start document is for who/what/why/when, things you
   know before you start communicating with hardware.) It is *especially*
   critical that you do not put device readings in the ``RE.md`` dictionary.
   The value will remain until you change it and not track the state of the
   hardware.  This will result in recording out-of-date, incorrect data!

   This can be particularly dangerous if ``RE.md`` is backed by a
   persistent data store (see next section) because out-of-date readings will
   last across sessions.


The ``scan_id``, an integer that the RunEngine automatically increments at the
beginning of each scan, is stored in ``RE.md['scan_id']``.

.. warning::

    Clearing all keys, like so:

    .. code-block:: python

        RE.md.clear()  # clear *all* keys

    will reset the ``scan_id``. The next time a plan is executed, the
    RunEngine will start with a ``scan_id`` of 1 and set

    .. code-block:: python

        RE.md['scan_id'] = 1

    Some readers may prefer to reset the scan ID to 1 at the beginning of a new
    experiment; others way wish to maintain a single unbroken sequence of scan
    IDs forever.

    From a technical standpoint, it is fine to have duplicate scan IDs. All
    runs also have randomly-generated 'uid' ("unique ID") which is globally
    unique forever.

.. _md_persistence:

Persistence Between Sessions
----------------------------

We provide a way to save the contents of the metadata stash ``RE.md`` between
sessions (e.g., exiting and re-opening IPython).

In general, the ``RE.md`` attribute may be anything that supports the
dictionary interface. The simplest is just a plain Python dictionary.

.. code-block:: python

    RE.md = {}

To persist metadata between sessions, bluesky recommends
:class:`bluesky.utils.PersistentDict` --- a Python dictionary synced with a
directory of files on disk. Any changes made to ``RE.md`` are synced to the
file, so the contents of ``RE.md`` can persist between sessions.

.. code-block:: python

    from bluesky.utils import PersistentDict
    RE.md = PersistentDict('some/path/here')

Bluesky does not provide a strong recommendation on that path; that a detail
left to the local deployment.

Bluesky formerly recommended using :class:`~historydict.HistoryDict` --- a
Python dictionary backed by a sqlite database file. This approach proved
problematic with the threading introduced in bluesky v1.6.0, so it is no longer
recommended. If you have been following that recommendation, you should migrate
your metadata from `~historydict.HistoryDict` to
:class:`~bluesky.utils.PersistentDict`. First, update your configuration to
make ``RE.md`` a :class:`~bluesky.utils.PersistentDict` as shown above. Then,
migrate like so:

.. code-block:: python

   from bluesky.utils import get_history
   old_md = get_history()
   RE.md.update(old_md)

The :class:`~bluesky.utils.PersistentDict` object has been back-ported to
bluesky v1.5.6 as well. It is not available in 1.4.x or older, so once you move
to the new system, you must run bluesky v1.5.6 or higher.

.. warning::

    The ``RE.md`` object can also be set when the RunEngine is instantiated:

    .. code-block:: python

        # This:
        RE = RunEngine(...)

        # is equivalent to this:
        RE = RunEngine({})
        RE.md = ...

    As we stated
    :ref:`at the start of the tutorial <tutorial_run_engine_setup>`, if you are
    using bluesky at a user facility or with shared configuration, your
    ``RE`` may already be configured, and defining a new ``RE`` as above can
    result in data loss! If you aren't sure, it's safer to use ``RE.md = ...``.


Allowed Data Types
------------------

Custom metadata keywords can be mapped to:

* strings --- e.g., ``task='calibration'``
* numbers --- e.g., ``attempt=5``
* lists or tuples --- e.g., ``dimensions=[1, 3]``
* (nested) dictionaries --- e.g., ``dimensions={'width': 1, 'height': 3}``


Required Fields
---------------

The fields:

* **uid**
* **time**

are reserved by the document model and cannot be set by the user.

In current versions of bluesky, **no fields are universally required by bluesky
itself**. It is possible specify your own required fields in local
configuration. See :ref:`md_validator`. (At NSLS-II, there are facility-wide
requirements coming soon.)


Special Fields
--------------

Arbitrary custom fields are allowed --- you can invent any names that are
useful to you.

But certain fields are given special significance by bluesky's document model,
and are either disallowed are required to be a certain type.

The fields:

* **owner**
* **group**
* **project**

are optional but, to facilitate searchability, if they are not blank they must
be strings. A non-string, like ``owner=5`` will produce an error that will
interrupt scan execution immediately after it starts.

Similarly, the keyword **sample** has special significance. It must be either a
string or a dictionary.

The **scan_id** field is expected to be an integer, and it is automatically
incremented between runs. If a scan_id is not provided by the user or stashed
in the persistent metadata from the previous run, it defaults to 1.


.. _md_validator:

Validation
----------

Additional, customized metadata validation can be added to the RunEngine.
For example, to ensure that a run will not be executed unless the parameter
'sample_number' is specified, define a function that accepts a dictionary
argument and raises if 'sample_number' is not found.

.. code-block:: python

    def ensure_sample_number(md):
        if 'sample_number' not in md:
            raise ValueError("You forgot the sample number.")

Apply this function by setting

.. code-block:: python

    RE.md_validator = ensure_sample_number

The function will be executed immediately before each new run in opened.