Mobile files

Status

FEPs go through a number of phases in their lifetime:

  • Partially Implemented: The FEP is being actively discussed, a sub set has been implemented.

Abstract

This FEP adds the ability for filestore to copy / move files around the file system and keep track of those changes.

Detailed description

This FEP will provide API to

  • database to keep track of the full history of file locations implemented

  • make a copy of all data from a resource from one location in the file system to another and update all relevant entries implemented

    • This may be trouble for some usage patterns where multiple resources point to same file

  • move files from one place to another implemented

  • delete files implemented

  • delete resources

  • verify data at both file system and Datum level

Implementation

General Requirements

  • implement Datum-level hashing

    • this should be a new collection which is keyed on DatumID and contains the hash (sha1 or md5) of the values

    • may contain additional statistics, proprieties about datum

      • shape, dtype, (min, max, mean, histogram ?)

      • may want to stats as separate transient DB

  • each file spec needs class/handler that will, given a resource, produce a list of all the files that are needed partial, need to flesh out handlers

  • implement resource < - > absolute path mapping collection

    • this is transient as it can always be re-generated

    • need a way to flag as ‘alive’ or not

  • implement hashing of files

  • maybe implement a chroot, as well as path into Resource implemented, but not as described

    • this is so that you can say change_root(resource_id, new_root) and then the files along with the folder structure would be moved.

    • without doing this we could do something like change_root(resource_id, n_base, new_root) where n_base is how many layers of directory to strip off, but this requires knowing a fair amount about the actually paths involved in the

    • Could also do something like change_path(path_mutation_func, resource_id) where path_mutation_func is a str -> str mapping function which is general, but is not great in terms of keeping this a controlled process and puts a big burden on the user.

    • if there are multiple copies of the same file be able to control which version gets hit

      • this needs to be controllable based on which computer the compute is running on

API proposal

Currently Implemented

Limited API

def change_root(resource, new_root, remove_origin=True, verify=False):
    '''Change the root directory of a given resource

    The registered handler must have a `get_file_list` method and the
    process running this method must have read/write access to both the
    source and destination file systems.


     Parameters
     ----------
     resource_or_uid : Document or str
         The resource to move the files of

     new_root : str
         The new 'root' to copy the files into

     remove_origin : bool, optional (True)
         If the source files should be removed

     verify : bool, optional (False)
         Verify that the move happened correctly.  This currently
         is not implemented and will raise if ``verify == True``.
    '''

 def shift_root(self, resource_or_uid, shift):
     '''Shift directory levels between root and resource_path

     This is useful because the root can be change via `change_root`.

     Parameters
     ----------
     resource_or_uid : Document or str
         The resource to change the root/resource_path allocation
         of absolute path.

     shift : int
         The amount to shift the split.  Positive numbers move more
         levels into the root and negative values move levels into
         the resource_path

     '''

  def insert_resource(self, spec, resource_path, resource_kwargs, root=''):

additional public API draft:

def get_resources_by_root(root, partial=False):
    pass


def get_resources_by_path(path, partial=False):
    pass


def get_resources_by_spec(spec):
    pass


def get_resource_by_uid(uid):
    pass

extended schema

resource_update = {
    resource: uid,
    old: original_resource_doc,
    new: updated_serouce_doc,
    time: timestamp (posix time),
    cmd: str, the command that generated the insertion
    cmd_kwargs: dict, the inputs to cmd
    }

resource = {
     spec: str,
     root: str,
     resource_path: str,
     resource_kwargs: dict,
     uid: str
     }

Full proposal

New python API

def copy_resource(resource_id, new_root, old_root=None):
    """Copy all the files of a resource

    Parameters
    ----------
    resource_id : uuid
        The unique id of the resource to work on

    new_root : str
        The path to the location in the filesystem to cop
        the files into.  The full existing directory structure
        will be replicated on top of the now root

    old_root : str, optional
        If there exists more than one copy already, select
        which one to use

    """

def move_resource(resource_id, old_root, new_root):
    """Move all files for a resource to a new location


    This is the same as copy then delete.  Because of the
    delete step users must be explicit about source path.

    Parameters
    ----------
    resource_id : uuid
        The unique id of the resource to work on

    old_root : str
        If there exists more than one copy already, select
        which one to use

    new_root : str
        The path to the location in the filesystem to cop
        the files into.  The full existing directory structure
        will be replicated on top of the now root

    """

def remove_resource(resource_id, old_root, force_last=False):
    """Delete all files associated with a resource

    Parameters
    ----------
    resource_id : uuid
        The unique id of the resource to work on

    old_root : str
        Which set of files to delete

    force_last : bool, optional
        If False, will raise RuntimeError rather than
        delete the last copy of the files.


    """

def insert_resource(spec, resource_root, resource_path, resource_kwargs=None):
    """
    Parameters
    ----------

    spec : str
        spec used to determine what handler to use to open this
        resource.

    resource_path, resource_root : str or None
        Url to the physical location of this resource

    resource_kwargs : dict, optional
        resource_kwargs name/value pairs of additional kwargs to be
        passed to the handler to open this resource.

    """

def retrieve(eid, root_preference=None)
    """
    Given a resource identifier return the data.

    The root_preference allows control over which copy
    of the data is used if there is more than one available.

    Parameters
    ----------
    eid : str
        The resource ID (as stored in MDS)

    root_preference : list, optional
        A list of preferred root locations to pull data from in
        descending order.

        If None, fall back to configurable default.

    Returns
    -------
    data : ndarray
        The requested data as a numpy array
    """

New DB schema:

class Resource(Document):
    """

    Parameters
    ----------

    spec : str
        spec used to determine what handler to use to open this
        resource.

    resource_path : str
        Url to the physical location of the resource

    resource_kwargs : dict
        name/value pairs of additional kwargs to be
        passed to the handler to open this resource.

    """

    spec = StringField(required=True, unique=False)
    path = StringField(required=True, unique=False)
    kwargs = DictField(required=False)
    uid = StringField(required=True, unique=True)

    meta = {'indexes': ['-_id', 'resource_root'], 'db_alias': ALIAS}


class ResourceRoots(DynamicDocument):
    """
    Many to one mapping between Resource documents and chroot paths.

    The idea is that the absolute path of a file contains two
    parts, the root, which is set by details of how the file
    system is mounted, and the relative path which is set by some
    sort of semantics.  For example in the path ::

        /mnt/DATA/2015/05/06/my_data.h5

    ``/mnt/DATA/`` is the root and ``2015/05/06/my_data.h5`` is
    the relative path.

    In the case of a URL this would be ::

      http://data.nsls-ii.bnl.gov/xf11id/2015/05/06/my_data.h5

    the root would be ``http://data.nsls-ii.bnl.gov/`` and the
    relative path would be ``xf11id/2015/05/06/my_data.h5``

    Parameters
    ----------
    root : str
        The chroot of the resource.

    resource_uid : str
        The uid of the resource this is associated with

    """
    root = StringField(required=True, unique=False)
    resource_uid = StringField(required=True, unique=False)


class File(Document):
    """
    This is 'semi-transient', everything in here can be rebuilt
    if needed from Resource, Datum, and their helper code, but
    the hash can be used for validation
    """
    resource_uid = StringField(required=True, unique=False)
    root = StringField(required=True, unique=False)

    uid = StringField(required=True, unique=True)
    abs_path = StringField(required=True, unique=True)
    sha1_hash = StringField(required=True)
    size = FloatField(required=True)
    exists = Bool(required=True)


class DatumStats(DynamicDocument):
    datum_uid = StringField(required=True, unique=True)
    sha1_hash = StringField(required=True)
    shape = ListField(field=IntField())

class CommandJournal(Document):
    command = StringField(required=True)
    args =  ListField()
    kwargs = DictField()
    success = Bool(required=True)

In a departure from our standard design protocol let File have the ‘exists’ field be updated. Or have a collection which is just a (resource_uid, root) create/delete journal. Another option is to allow remove to delete entries from File collection.

Backward compatibility

This will require a DB migration and breaks all of the AD instances that insert into FS.

Alternatives

None yet