Mobile files¶
Status¶
FEPs go through a number of phases in their lifetime:
Partially Implemented: The FEP is being actively discussed, a sub set has been implemented.
Abstract¶
This FEP adds the ability for filestore to copy / move files around the file system and keep track of those changes.
Detailed description¶
This FEP will provide API to
database to keep track of the full history of file locations implemented
make a copy of all data from a resource from one location in the file system to another and update all relevant entries implemented
This may be trouble for some usage patterns where multiple resources point to same file
move files from one place to another implemented
delete files implemented
delete resources
verify data at both file system and Datum level
Implementation¶
General Requirements¶
implement Datum-level hashing
this should be a new collection which is keyed on DatumID and contains the hash (sha1 or md5) of the values
may contain additional statistics, proprieties about datum
shape, dtype, (min, max, mean, histogram ?)
may want to stats as separate transient DB
each file spec needs class/handler that will, given a resource, produce a list of all the files that are needed partial, need to flesh out handlers
implement resource < - > absolute path mapping collection
this is transient as it can always be re-generated
need a way to flag as ‘alive’ or not
implement hashing of files
maybe implement a chroot, as well as path into Resource implemented, but not as described
this is so that you can say
change_root(resource_id, new_root)
and then the files along with the folder structure would be moved.without doing this we could do something like
change_root(resource_id, n_base, new_root)
where n_base is how many layers of directory to strip off, but this requires knowing a fair amount about the actually paths involved in theCould also do something like
change_path(path_mutation_func, resource_id)
wherepath_mutation_func
is a str -> str mapping function which is general, but is not great in terms of keeping this a controlled process and puts a big burden on the user.if there are multiple copies of the same file be able to control which version gets hit
this needs to be controllable based on which computer the compute is running on
API proposal¶
Currently Implemented¶
Limited API
def change_root(resource, new_root, remove_origin=True, verify=False):
'''Change the root directory of a given resource
The registered handler must have a `get_file_list` method and the
process running this method must have read/write access to both the
source and destination file systems.
Parameters
----------
resource_or_uid : Document or str
The resource to move the files of
new_root : str
The new 'root' to copy the files into
remove_origin : bool, optional (True)
If the source files should be removed
verify : bool, optional (False)
Verify that the move happened correctly. This currently
is not implemented and will raise if ``verify == True``.
'''
def shift_root(self, resource_or_uid, shift):
'''Shift directory levels between root and resource_path
This is useful because the root can be change via `change_root`.
Parameters
----------
resource_or_uid : Document or str
The resource to change the root/resource_path allocation
of absolute path.
shift : int
The amount to shift the split. Positive numbers move more
levels into the root and negative values move levels into
the resource_path
'''
def insert_resource(self, spec, resource_path, resource_kwargs, root=''):
additional public API draft:
def get_resources_by_root(root, partial=False):
pass
def get_resources_by_path(path, partial=False):
pass
def get_resources_by_spec(spec):
pass
def get_resource_by_uid(uid):
pass
extended schema
resource_update = {
resource: uid,
old: original_resource_doc,
new: updated_serouce_doc,
time: timestamp (posix time),
cmd: str, the command that generated the insertion
cmd_kwargs: dict, the inputs to cmd
}
resource = {
spec: str,
root: str,
resource_path: str,
resource_kwargs: dict,
uid: str
}
Full proposal¶
New python API
def copy_resource(resource_id, new_root, old_root=None):
"""Copy all the files of a resource
Parameters
----------
resource_id : uuid
The unique id of the resource to work on
new_root : str
The path to the location in the filesystem to cop
the files into. The full existing directory structure
will be replicated on top of the now root
old_root : str, optional
If there exists more than one copy already, select
which one to use
"""
def move_resource(resource_id, old_root, new_root):
"""Move all files for a resource to a new location
This is the same as copy then delete. Because of the
delete step users must be explicit about source path.
Parameters
----------
resource_id : uuid
The unique id of the resource to work on
old_root : str
If there exists more than one copy already, select
which one to use
new_root : str
The path to the location in the filesystem to cop
the files into. The full existing directory structure
will be replicated on top of the now root
"""
def remove_resource(resource_id, old_root, force_last=False):
"""Delete all files associated with a resource
Parameters
----------
resource_id : uuid
The unique id of the resource to work on
old_root : str
Which set of files to delete
force_last : bool, optional
If False, will raise RuntimeError rather than
delete the last copy of the files.
"""
def insert_resource(spec, resource_root, resource_path, resource_kwargs=None):
"""
Parameters
----------
spec : str
spec used to determine what handler to use to open this
resource.
resource_path, resource_root : str or None
Url to the physical location of this resource
resource_kwargs : dict, optional
resource_kwargs name/value pairs of additional kwargs to be
passed to the handler to open this resource.
"""
def retrieve(eid, root_preference=None)
"""
Given a resource identifier return the data.
The root_preference allows control over which copy
of the data is used if there is more than one available.
Parameters
----------
eid : str
The resource ID (as stored in MDS)
root_preference : list, optional
A list of preferred root locations to pull data from in
descending order.
If None, fall back to configurable default.
Returns
-------
data : ndarray
The requested data as a numpy array
"""
New DB schema:
class Resource(Document):
"""
Parameters
----------
spec : str
spec used to determine what handler to use to open this
resource.
resource_path : str
Url to the physical location of the resource
resource_kwargs : dict
name/value pairs of additional kwargs to be
passed to the handler to open this resource.
"""
spec = StringField(required=True, unique=False)
path = StringField(required=True, unique=False)
kwargs = DictField(required=False)
uid = StringField(required=True, unique=True)
meta = {'indexes': ['-_id', 'resource_root'], 'db_alias': ALIAS}
class ResourceRoots(DynamicDocument):
"""
Many to one mapping between Resource documents and chroot paths.
The idea is that the absolute path of a file contains two
parts, the root, which is set by details of how the file
system is mounted, and the relative path which is set by some
sort of semantics. For example in the path ::
/mnt/DATA/2015/05/06/my_data.h5
``/mnt/DATA/`` is the root and ``2015/05/06/my_data.h5`` is
the relative path.
In the case of a URL this would be ::
http://data.nsls-ii.bnl.gov/xf11id/2015/05/06/my_data.h5
the root would be ``http://data.nsls-ii.bnl.gov/`` and the
relative path would be ``xf11id/2015/05/06/my_data.h5``
Parameters
----------
root : str
The chroot of the resource.
resource_uid : str
The uid of the resource this is associated with
"""
root = StringField(required=True, unique=False)
resource_uid = StringField(required=True, unique=False)
class File(Document):
"""
This is 'semi-transient', everything in here can be rebuilt
if needed from Resource, Datum, and their helper code, but
the hash can be used for validation
"""
resource_uid = StringField(required=True, unique=False)
root = StringField(required=True, unique=False)
uid = StringField(required=True, unique=True)
abs_path = StringField(required=True, unique=True)
sha1_hash = StringField(required=True)
size = FloatField(required=True)
exists = Bool(required=True)
class DatumStats(DynamicDocument):
datum_uid = StringField(required=True, unique=True)
sha1_hash = StringField(required=True)
shape = ListField(field=IntField())
class CommandJournal(Document):
command = StringField(required=True)
args = ListField()
kwargs = DictField()
success = Bool(required=True)
In a departure from our standard design protocol let File have the
‘exists’ field be updated. Or have a collection which is just a
(resource_uid, root) create/delete journal. Another option is to allow
remove
to delete entries from File collection.
Backward compatibility¶
This will require a DB migration and breaks all of the AD instances that insert into FS.
Alternatives¶
None yet