============== Mobile files ============== .. contents:: :local: Status ====== FEPs go through a number of phases in their lifetime: - **Partially Implemented**: The FEP is being actively discussed, a sub set has been implemented. Branches and Pull requests ========================== - https://github.com/NSLS-II/filestore/pull/58 Abstract ======== This FEP adds the ability for filestore to copy / move files around the file system and keep track of those changes. Detailed description ==================== This FEP will provide API to - database to keep track of the full history of file locations *implemented* - make a copy of all data from a resource from one location in the file system to another and update all relevant entries *implemented* - This may be trouble for some usage patterns where multiple resources point to same file - move files from one place to another *implemented* - delete files *implemented* - delete resources - verify data at both file system and Datum level Implementation ============== General Requirements -------------------- - implement Datum-level hashing - this should be a new collection which is keyed on DatumID and contains the hash (sha1 or md5) of the values - may contain additional statistics, proprieties about datum - shape, dtype, (min, max, mean, histogram ?) - may want to stats as separate transient DB - each file spec needs class/handler that will, given a resource, produce a list of all the files that are needed *partial, need to flesh out handlers* - implement resource < - > absolute path mapping collection - this is transient as it can always be re-generated - need a way to flag as 'alive' or not - implement hashing of files - maybe implement a chroot, as well as path into Resource *implemented, but not as described* - this is so that you can say ``change_root(resource_id, new_root)`` and then the files along with the folder structure would be moved. - without doing this we could do something like ``change_root(resource_id, n_base, new_root)`` where n_base is how many layers of directory to strip off, but this requires knowing a fair amount about the actually paths involved in the - Could also do something like ``change_path(path_mutation_func, resource_id)`` where ``path_mutation_func`` is a str -> str mapping function which is general, but is not great in terms of keeping this a controlled process and puts a big burden on the user. - if there are multiple copies of the same file be able to control which version gets hit - this needs to be controllable based on which computer the compute is running on API proposal ------------ Currently Implemented ********************* Limited API :: def change_root(resource, new_root, remove_origin=True, verify=False): '''Change the root directory of a given resource The registered handler must have a `get_file_list` method and the process running this method must have read/write access to both the source and destination file systems. Parameters ---------- resource_or_uid : Document or str The resource to move the files of new_root : str The new 'root' to copy the files into remove_origin : bool, optional (True) If the source files should be removed verify : bool, optional (False) Verify that the move happened correctly. This currently is not implemented and will raise if ``verify == True``. ''' def shift_root(self, resource_or_uid, shift): '''Shift directory levels between root and resource_path This is useful because the root can be change via `change_root`. Parameters ---------- resource_or_uid : Document or str The resource to change the root/resource_path allocation of absolute path. shift : int The amount to shift the split. Positive numbers move more levels into the root and negative values move levels into the resource_path ''' def insert_resource(self, spec, resource_path, resource_kwargs, root=''): additional public API *draft*:: def get_resources_by_root(root, partial=False): pass def get_resources_by_path(path, partial=False): pass def get_resources_by_spec(spec): pass def get_resource_by_uid(uid): pass extended schema :: resource_update = { resource: uid, old: original_resource_doc, new: updated_serouce_doc, time: timestamp (posix time), cmd: str, the command that generated the insertion cmd_kwargs: dict, the inputs to cmd } resource = { spec: str, root: str, resource_path: str, resource_kwargs: dict, uid: str } Full proposal ************* New python API :: def copy_resource(resource_id, new_root, old_root=None): """Copy all the files of a resource Parameters ---------- resource_id : uuid The unique id of the resource to work on new_root : str The path to the location in the filesystem to cop the files into. The full existing directory structure will be replicated on top of the now root old_root : str, optional If there exists more than one copy already, select which one to use """ def move_resource(resource_id, old_root, new_root): """Move all files for a resource to a new location This is the same as copy then delete. Because of the delete step users must be explicit about source path. Parameters ---------- resource_id : uuid The unique id of the resource to work on old_root : str If there exists more than one copy already, select which one to use new_root : str The path to the location in the filesystem to cop the files into. The full existing directory structure will be replicated on top of the now root """ def remove_resource(resource_id, old_root, force_last=False): """Delete all files associated with a resource Parameters ---------- resource_id : uuid The unique id of the resource to work on old_root : str Which set of files to delete force_last : bool, optional If False, will raise RuntimeError rather than delete the last copy of the files. """ def insert_resource(spec, resource_root, resource_path, resource_kwargs=None): """ Parameters ---------- spec : str spec used to determine what handler to use to open this resource. resource_path, resource_root : str or None Url to the physical location of this resource resource_kwargs : dict, optional resource_kwargs name/value pairs of additional kwargs to be passed to the handler to open this resource. """ def retrieve(eid, root_preference=None) """ Given a resource identifier return the data. The root_preference allows control over which copy of the data is used if there is more than one available. Parameters ---------- eid : str The resource ID (as stored in MDS) root_preference : list, optional A list of preferred root locations to pull data from in descending order. If None, fall back to configurable default. Returns ------- data : ndarray The requested data as a numpy array """ New DB schema:: class Resource(Document): """ Parameters ---------- spec : str spec used to determine what handler to use to open this resource. resource_path : str Url to the physical location of the resource resource_kwargs : dict name/value pairs of additional kwargs to be passed to the handler to open this resource. """ spec = StringField(required=True, unique=False) path = StringField(required=True, unique=False) kwargs = DictField(required=False) uid = StringField(required=True, unique=True) meta = {'indexes': ['-_id', 'resource_root'], 'db_alias': ALIAS} class ResourceRoots(DynamicDocument): """ Many to one mapping between Resource documents and chroot paths. The idea is that the absolute path of a file contains two parts, the root, which is set by details of how the file system is mounted, and the relative path which is set by some sort of semantics. For example in the path :: /mnt/DATA/2015/05/06/my_data.h5 ``/mnt/DATA/`` is the root and ``2015/05/06/my_data.h5`` is the relative path. In the case of a URL this would be :: http://data.nsls-ii.bnl.gov/xf11id/2015/05/06/my_data.h5 the root would be ``http://data.nsls-ii.bnl.gov/`` and the relative path would be ``xf11id/2015/05/06/my_data.h5`` Parameters ---------- root : str The chroot of the resource. resource_uid : str The uid of the resource this is associated with """ root = StringField(required=True, unique=False) resource_uid = StringField(required=True, unique=False) class File(Document): """ This is 'semi-transient', everything in here can be rebuilt if needed from Resource, Datum, and their helper code, but the hash can be used for validation """ resource_uid = StringField(required=True, unique=False) root = StringField(required=True, unique=False) uid = StringField(required=True, unique=True) abs_path = StringField(required=True, unique=True) sha1_hash = StringField(required=True) size = FloatField(required=True) exists = Bool(required=True) class DatumStats(DynamicDocument): datum_uid = StringField(required=True, unique=True) sha1_hash = StringField(required=True) shape = ListField(field=IntField()) class CommandJournal(Document): command = StringField(required=True) args = ListField() kwargs = DictField() success = Bool(required=True) In a departure from our standard design protocol let File have the 'exists' field be updated. Or have a collection which is just a (resource_uid, root) create/delete journal. Another option is to allow ``remove`` to delete entries from `File` collection. Backward compatibility ====================== This will require a DB migration and breaks all of the AD instances that insert into FS. Alternatives ============ None yet