DataBroker mini-hackathon¶
Dates: Feb 26 - Mar 2
1000 - 1800 M-Th 1000 - 1600 Fri
Goals¶
produce planning artifacts to support DOE hackathon in June:
high level design documents (requirements, goals) for modules - use cases
low level design documents (proposed API documentation)
sort out priority and interdependence of high
where do we have duplicate effort between facilities?
where do we have gaps in effort?
what can we get ‘good enough’ un-block other work
where can we work in parallel?
where must we work serially?
High-level Requirements¶
*-as-a-service
as first pass, preserve python API, rip out guts
worry about nice REST/RPC interface later
wire protocol for small data
json, bson, msgpack
wire protocol for big data
msgpack, epics (v3 or v7), custom, h5serv
ops
ansible / kubernetties
authorization
do we need to look at how ‘role based’ auth work?
probably want to develop a common set of nouns / tools for our use and then shim against home institution systems
user groupings
user (ex 1 person)
proposal (application to get beam time)
one or more PI
zero or more ‘users’
beamtime (actually data collection period)
one or more PI
zero or more ‘users’
1 or more beamtime per proposal
exactly one proposal per beamtime
where to do authorization at?
proposal or beamtime?
filter at runstart / header level
once you have header object, assume you _should_ have it
if get pickled and shipped, maybe re-validate on first access?
authorize on each call to header methods / attributes
integration with filesystem users / groups?
how much to segment data?
database per ACL entity?
rely on external tools for authentication
slicing (by stream / field / event)
streaming-by-chunks
alternate implementations
diverse backends
bucket-of-files - single point-of-entry for ingest - hdf5 / cbf / tiff / …
databases - sqlite / mongo / postgres / elastic/ cloud databases?
arrow?
fully local implementations
catalog services - related to federated storage - related to provenance - integration with simulation data
user annotations and tags - fully mutable - fully searchable - local or remote - shareable or private
integration with GUI / web - Xicam - glueviz - databrokerbrowser - BEC-as-a-service
federated storage systems - a given Broker object may have more than one (*Source)
provenance - derived-data storage - naming schemes - related to federated storage - there is an on-going LDRD about this at BNL
data ‘muxing’ - align multiple streams in one header - merge events from multiple headers - split events from one stream in one header into multiple streams (?) - pivot “event with a time series” to “series of events” - stack “series of events” to “event with a time series” - resampling
pipelines - how do they interact with the databases? - what goes between the edges?
search - maintaining rich start search across implementations
graphql http://graphql.org/
mongo https://docs.mongodb.com/manual/tutorial/query-documents/
postgress https://www.postgresql.org/docs/10/static/functions-json.html
fuzzy-search?
search into stop documents?
search into descriptors
use of particular hardware
configuration of detector
“all runs where
search in data sets
“show me all runs where the theta motor was between [0, 5]”
“show me all runs where the sum of the fccd images > 1000”
first-class query object?
provide & and | operators for building complex queries
manage turning user intent into input to different HeaderSource implementations
manage serialization of query for re-use later
support GUI based search tools
name standardization - use nexus dictionary
alternative returns - xarray? - numpy array? - dask? - VTK/ITK primitives?
configuration management - some progress, but needs to be richer
feeding computation services - dask - SHED - paws - staging to traditional HPC
data export - suitcase - round-trip with ‘common’ formats - write nexus definition - publish to MDF / what ever
Agenda¶
Day 1
current architecture and API (60 minute talk by DAMA)
Document types - Start - Stop - Descriptor - Event - Resource - Datum - BulkEvent - BulkDatum
concept of ‘stream’
Header class
holds aggregated meta-data (Start, Stop, Descriptor, Resource)
access method to full data set
manages filling
top-level Broker class
search, direct header access, ‘most recent’ access
storage factoring
HeaderSource
EventSource
AssetRegistry
high level requirements
work planning - break into 2-3 person cross-facility teams
Day 2
Philip Cloud visit
he presents arrow / pandas 2
present to him what we are doing
applications of arrow to *-as-a-service?
arrow as a column store?
are we using pandas right?
Facility tour
working in groups
Day 3/4
working in groups
Day 5
working in groups
presentation of progress made (1200-1400)
wrap-up by 1400-1600