DataBroker mini-hackathon ========================= Dates: Feb 26 - Mar 2 1000 - 1800 M-Th 1000 - 1600 Fri Goals ***** - produce planning artifacts to support DOE hackathon in June: 1. high level design documents (requirements, goals) for modules - use cases 2. low level design documents (proposed API documentation) - sort out priority and interdependence of high - where do we have duplicate effort between facilities? - where do we have gaps in effort? - what can we get 'good enough' un-block other work - where can we work in parallel? - where must we work serially? High-level Requirements *********************** 1. \*-as-a-service - as first pass, preserve python API, rip out guts - worry about nice REST/RPC interface later - wire protocol for small data - json, bson, msgpack - wire protocol for big data - msgpack, epics (v3 or v7), custom, h5serv - ops - ansible / kubernetties 2. authorization - do we need to look at how 'role based' auth work? - probably want to develop a common set of nouns / tools for our use and then shim against home institution systems - user groupings - user (ex 1 person) - proposal (application to get beam time) - one or more PI - zero or more 'users' - beamtime (actually data collection period) - one or more PI - zero or more 'users' - 1 or more beamtime per proposal - exactly one proposal per beamtime - where to do authorization at? - proposal or beamtime? - filter at runstart / header level - once you have header object, assume you _should_ have it - if get pickled and shipped, maybe re-validate on first access? - authorize on each call to header methods / attributes - integration with filesystem users / groups? - how much to segment data? - database per ACL entity? - rely on external tools for authentication 3. slicing (by stream / field / event) - streaming-by-chunks 4. alternate implementations - diverse backends - bucket-of-files - single point-of-entry for ingest - hdf5 / cbf / tiff / ... - databases - sqlite / mongo / postgres / elastic/ cloud databases? - arrow? - fully local implementations 5. catalog services - related to federated storage - related to provenance - integration with simulation data 6. user annotations and tags - fully mutable - fully searchable - local or remote - shareable or private 7. integration with GUI / web - Xicam - glueviz - databrokerbrowser - BEC-as-a-service 8. federated storage systems - a given Broker object may have more than one (\*Source) 9. provenance - derived-data storage - naming schemes - related to federated storage - there is an on-going LDRD about this at BNL 10. data 'muxing' - align multiple streams in one header - merge events from multiple headers - split events from one stream in one header into multiple streams (?) - pivot "event with a time series" to "series of events" - stack "series of events" to "event with a time series" - resampling 11. pipelines - how do they interact with the databases? - what goes between the edges? 12. search - maintaining rich start search across implementations - graphql ``_ - mongo ``_ - postgress ``_ - fuzzy-search? - search into stop documents? - search into descriptors - use of particular hardware - configuration of detector - "all runs where - search in data sets - "show me all runs where the theta motor was between [0, 5]" - "show me all runs where the sum of the fccd images > 1000" - first-class query object? - provide & and | operators for building complex queries - manage turning user intent into input to different HeaderSource implementations - manage serialization of query for re-use later - support GUI based search tools 13. name standardization - use nexus dictionary 14. alternative returns - xarray? - numpy array? - dask? - VTK/ITK primitives? 15. configuration management - some progress, but needs to be richer 16. feeding computation services - dask - SHED - paws - staging to traditional HPC 17. data export - suitcase - round-trip with 'common' formats - write nexus definition - publish to MDF / what ever Agenda ****** - Day 1 - current architecture and API (60 minute talk by DAMA) - Document types - Start - Stop - Descriptor - Event - Resource - Datum - BulkEvent - BulkDatum - concept of 'stream' - Header class - holds aggregated meta-data (Start, Stop, Descriptor, Resource) - access method to full data set - manages filling - top-level Broker class - search, direct header access, 'most recent' access - storage factoring - HeaderSource - EventSource - AssetRegistry - high level requirements - work planning - break into 2-3 person cross-facility teams - Day 2 - Philip Cloud visit - he presents arrow / pandas 2 - present to him what we are doing - applications of arrow to \*-as-a-service? - arrow as a column store? - are we using pandas right? - Facility tour - working in groups - Day 3/4 - working in groups - Day 5 - working in groups - presentation of progress made (1200-1400) - wrap-up by 1400-1600