API Reference
File System Accessors
- pfio.v2.open_url(url: str, mode: str = 'r', **kwargs) Iterator[IOBase] [source]
Opens a file regardless of the backend FS type
url
must be compliant with URL standard in https://url.spec.whatwg.org/ . As this function implements context manager, the FileObject can be written as:with open_url("s3://bucket.example.com/path/your-file.txt", 'r') as f: f.read()
Note
Some FS resources won’t be closed when using this functionality. See
from_url
for keyword arguments.- Returns:
a FileObject that must be closed.
- pfio.v2.from_url(url: str, **kwargs) FS [source]
Factory pattern implementation, creates FS from URI
If
force_type
is set with archive type, not scheme, it ignores the suffix and tries the specified archive format by opening the blob file.If
force_type
is set with scheme type, the FS will built from it accordingly. The URL path is supposed to be a directory for file systems or a path prefix for S3.Warning
When opening an
hdfs://...
URL, be sure about forking context. See:Hdfs
for discussion.- Arguments:
url (str): A URL string compliant with RFC 1738.
- force_type (str): Force type of FS to be returned.
One of “zip”, “hdfs”, “s3”, or “file”, returned respectively. Default is
"file"
.
create (bool): Create the specified path doesn’t exist.
- http_cache (str): Prefix url of http cached entries.
In the filesystem with http_cache specified, all read access will be hooked and upload its content to the url with the given prefix. For details, please refer to
pfio.v2.HTTPCachedFS
. (experimental feature)
Note
Some FS resources won’t be closed when using this functionality.
Note
Pickling the FS object may or may not work correctly depending on the implementation.
- pfio.v2.lazify(init_func, lazy_init=True, recreate_on_fork=True)[source]
Make FS init lazy and recreate on fork
Deprecated since version 2.2.0: This will be removed in 2.3.0.
- class pfio.v2.fs.FS[source]
FS access abstraction
- abstract exists(path: str) bool [source]
Returns the existence of the path
When the
file_path
points to a symlink, the return value depends on the actual file instead of the link itself.
- glob(pattern: str) Iterator[FileStat | str] [source]
Returns the files and dictories that match the glob pattern.
- abstract isdir(file_path: str) bool [source]
Returns
True
if the path is an existing directory- Args:
path (str): the path to the target directory
- Returns:
True
when the path points to a directory,False
when it is not
- abstract list(path_or_prefix: str | None = None, recursive=False, detail=False) Iterator[FileStat | str] [source]
- Lists all the files and directories under
the given
path_or_prefix
- Args:
- path_or_prefix (str): The path to list against.
When we get the default value,
list
shows the content under the working directory as the default value. However, if apath_or_prefix
is given, then it shows only the files and directories under thepath_or_prefix
.- recursive (bool): When this is
True
, list files and directories recursively.
- detail (bool): If this is
True
, the return values will be the detail information of each file or directory.
- Returns:
An Iterator that iterates though the files and directories.
- abstract makedirs(file_path: str, mode: int = 511, exist_ok: bool = False) None [source]
Makes directories recursively with mode
Also creates all the missing parents of the given path.
- Args:
path (str): the path to the directory to make.
mode (int): the mode of the directory
- exist_ok (bool): In default case, a
FileExitsError
will be raised when the target directory exists.
- exist_ok (bool): In default case, a
- abstract mkdir(file_path: str, mode: int = 511, *args, dir_fd: int | None = None) None [source]
Makes a directory with mode
- Args:
path (str): the path to the directory to make
mode (int): the mode of the new directory
- abstract remove(file_path: str, recursive: bool = False) None [source]
Removes a file or directory
- Args:
path (str): the target path to remove. The
path
can be a regular file or a directory.- recursive (bool): When the given path is a directory,
all the files and directories under it will be removed. When the path is a file, this option is ignored.
- abstract rename(src: str, dst: str) None [source]
Renames the file from
src
todst
On systems and situation where rename functionality is proviced, it renames the file or the directory.
- Args:
src (str): the current name of the file or directory.
dst (str): the name to rename to.
Local file system
- class pfio.v2.Local(cwd=None, create=False, **_)[source]
- exists(file_path: str)[source]
Returns the existence of the path
When the
file_path
points to a symlink, the return value depends on the actual file instead of the link itself.
- isdir(path: str)[source]
Returns
True
if the path is an existing directory- Args:
path (str): the path to the target directory
- Returns:
True
when the path points to a directory,False
when it is not
- list(path: str | None = '', recursive=False, detail=False)[source]
- Lists all the files and directories under
the given
path_or_prefix
- Args:
- path_or_prefix (str): The path to list against.
When we get the default value,
list
shows the content under the working directory as the default value. However, if apath_or_prefix
is given, then it shows only the files and directories under thepath_or_prefix
.- recursive (bool): When this is
True
, list files and directories recursively.
- detail (bool): If this is
True
, the return values will be the detail information of each file or directory.
- Returns:
An Iterator that iterates though the files and directories.
- makedirs(file_path: str, mode=511, exist_ok=False)[source]
Makes directories recursively with mode
Also creates all the missing parents of the given path.
- Args:
path (str): the path to the directory to make.
mode (int): the mode of the directory
- exist_ok (bool): In default case, a
FileExitsError
will be raised when the target directory exists.
- exist_ok (bool): In default case, a
- mkdir(file_path: str, mode=511, *args, dir_fd=None)[source]
Makes a directory with mode
- Args:
path (str): the path to the directory to make
mode (int): the mode of the new directory
- remove(file_path: str, recursive=False)[source]
Removes a file or directory
- Args:
path (str): the target path to remove. The
path
can be a regular file or a directory.- recursive (bool): When the given path is a directory,
all the files and directories under it will be removed. When the path is a file, this option is ignored.
HDFS (Hadoop File System)
- class pfio.v2.Hdfs(cwd=None, create=False, **_)[source]
Hadoop FileSystem wrapper
To use HDFS, PFIO requires
$HADOOP_HOME
predefined before initialization. If it is not defined,ARROW_LIBHDFS_DIR
must be defined instead.$CLASSPATH
will be needed in casehdfs
command is not available from$PATH
.Warning
It is strongly discouraged to use
Hdfs
under multiprocessing. Once the object detects the process id changed (which means it is forked), the object raisesForkedError
before doing anything. If you do need forking, for example, PyTorch DataLoader with multiple workers for performance, it is strongly recommended not to instantiateHdfs
before forking. Details are described in PFIO issue #123. Simple workaround is to set multiprocessing start method as'forkserver'
and start the very first child process before everything.import multiprocessing multiprocessing.set_start_method('forkserver') p = multiprocessing.Process() p.start() p.join()
Note
With environment variable
KRB5_KTNAME=path/to/your.keytab
set,hdfs
handler automatically starts automatic and periodical updating Kerberos ticket using krbticket . The update frequency is every 10 minutes by default.Note
Only the username in the first entry in The keytab will be used to update the Kerberos ticket.
- exists(path: str)[source]
Returns the existence of the path
When the
file_path
points to a symlink, the return value depends on the actual file instead of the link itself.
- isdir(path: str | None)[source]
Returns
True
if the path is an existing directory- Args:
path (str): the path to the target directory
- Returns:
True
when the path points to a directory,False
when it is not
- list(path: str | None = '', recursive=False, detail=False)[source]
- Lists all the files and directories under
the given
path_or_prefix
- Args:
- path_or_prefix (str): The path to list against.
When we get the default value,
list
shows the content under the working directory as the default value. However, if apath_or_prefix
is given, then it shows only the files and directories under thepath_or_prefix
.- recursive (bool): When this is
True
, list files and directories recursively.
- detail (bool): If this is
True
, the return values will be the detail information of each file or directory.
- Returns:
An Iterator that iterates though the files and directories.
- makedirs(path: str, mode=511, exist_ok=False)[source]
Makes directories recursively with mode
Also creates all the missing parents of the given path.
- Args:
path (str): the path to the directory to make.
mode (int): the mode of the directory
- exist_ok (bool): In default case, a
FileExitsError
will be raised when the target directory exists.
- exist_ok (bool): In default case, a
- mkdir(path: str, *args, dir_fd=None)[source]
Makes a directory with mode
- Args:
path (str): the path to the directory to make
mode (int): the mode of the new directory
- remove(path, recursive=False)[source]
Removes a file or directory
- Args:
path (str): the target path to remove. The
path
can be a regular file or a directory.- recursive (bool): When the given path is a directory,
all the files and directories under it will be removed. When the path is a file, this option is ignored.
- rename(src, dst)[source]
Renames the file from
src
todst
On systems and situation where rename functionality is proviced, it renames the file or the directory.
- Args:
src (str): the current name of the file or directory.
dst (str): the name to rename to.
S3 (AWS S3)
- class pfio.v2.S3(bucket, prefix=None, endpoint=None, create_bucket=False, aws_access_key_id=None, aws_secret_access_key=None, mpu_chunksize=33554432, buffering=-1, create=False, _skip_connect=None, **_)[source]
S3 FileSystem wrapper
Takes three arguments as well as enviroment variables for constructor. The priority is (1) see arguments, (2) see enviroment variables, (3) take boto3’s default. Available arguments are:
aws_access_key_id
,AWS_ACCESS_KEY_ID
aws_secret_access_key
,AWS_SECRET_ACCESS_KEY
endpoint
,S3_ENDPOINT
It supports buffering when opening a file in binary read mode (“rb”). When
buffering
is set to -1 (default), the buffer size will be the size of the file orpfio.v2.S3.DEFAULT_MAX_BUFFER_SIZE
, whichever smaller.buffering=0
disables buffering, andbuffering>0
forcibly sets the specified value as the buffer size in bytes.- exists(file_path: str)[source]
Returns the existence of objects
For common prefixes, it does nothing. See discussion in
isdir()
.
- isdir(file_path: str)[source]
Imitate isdir by handling common prefix ending with “/” as directory
AWS S3 does not have concept of directory tree, but this class imitates other file systems to increase compatibility.
- list(prefix: str | None = '', recursive=False, detail=False)[source]
List all objects (and prefixes)
Although there is not concept of directory in AWS S3 API, common prefixes shows up like directories.
- makedirs(file_path: str, mode=511, exist_ok=False)[source]
Does nothing
Note
see discussion in
mkdir()
.
- mkdir(file_path: str, mode=511, *args, dir_fd=None)[source]
Does nothing
Note
AWS S3 does not have concept of directory tree; what this function (and
makedirs()
) should do and return? To be strict, it would be straightforward to raiseio.UnsupportedOperation
exception. But it just breaks users’ applications that except quasi-compatible behaviour. Thus, imitating other file systems, like returningNone
would be nicer.
- open(path, mode='r', **kwargs)[source]
Opens an object accessor for read or write
Note
Multi-part upload is not yet available.
- Arguments:
path (str): relative path from basedir
mode (str): open mode
- remove(file_path: str, recursive=False)[source]
Removes an object
It raises a FileNotFoundError when the specified file doesn’t exist.
Zip Archive
- class pfio.v2.Zip(backend, file_path, mode='r', create=False, local_cache=False, local_cachedir=None, **kwargs)[source]
- exists(file_path: str)[source]
Returns the existence of the path
When the
file_path
points to a symlink, the return value depends on the actual file instead of the link itself.
- isdir(file_path: str)[source]
Returns
True
if the path is an existing directory- Args:
path (str): the path to the target directory
- Returns:
True
when the path points to a directory,False
when it is not
- list(path_or_prefix: str | None = '', recursive=False, detail=False)[source]
- Lists all the files and directories under
the given
path_or_prefix
- Args:
- path_or_prefix (str): The path to list against.
When we get the default value,
list
shows the content under the working directory as the default value. However, if apath_or_prefix
is given, then it shows only the files and directories under thepath_or_prefix
.- recursive (bool): When this is
True
, list files and directories recursively.
- detail (bool): If this is
True
, the return values will be the detail information of each file or directory.
- Returns:
An Iterator that iterates though the files and directories.
- makedirs(file_path: str, mode=511, exist_ok=False)[source]
Makes directories recursively with mode
Also creates all the missing parents of the given path.
- Args:
path (str): the path to the directory to make.
mode (int): the mode of the directory
- exist_ok (bool): In default case, a
FileExitsError
will be raised when the target directory exists.
- exist_ok (bool): In default case, a
- mkdir(file_path: str, mode=511, *args, dir_fd=None)[source]
Makes a directory with mode
- Args:
path (str): the path to the directory to make
mode (int): the mode of the new directory
- remove(file_path, recursive=False)[source]
Removes a file or directory
- Args:
path (str): the target path to remove. The
path
can be a regular file or a directory.- recursive (bool): When the given path is a directory,
all the files and directories under it will be removed. When the path is a file, this option is ignored.
- rename(*args)[source]
Renames the file from
src
todst
On systems and situation where rename functionality is proviced, it renames the file or the directory.
- Args:
src (str): the current name of the file or directory.
dst (str): the name to rename to.
HTTPCachedFS
- class pfio.v2.HTTPCachedFS(url: str, fs: FS, max_cache_size: int = 1073741824, bearer_token_path: str | None = None)[source]
HTTP-based cache system
Stores cache data in an HTTP server with
PUT
andGET
methods. Each cache entry corresponds to url suffixed by _canonical_name inpfio.v2.fs.FS
.- Arguments:
- url (string):
Prefix url of cache entries. Each entry corresponds to the url suffixed by each normalized paths.
- fs (pfio.v2.FS):
Underlying filesystem.
Read operations will be hooked by HTTPCachedFS to send a request to the cache system. If the object is found in cache, the object will be returned from cache without requesting to underlying fs. Therefore, after the update of file in underlying fs, users have to update url to avoid reading old data from the cache.
Other operations including write will not be hooked. It will be transferred to underlying filesystem immediately.
- max_cache_size (int):
Files larger than max_cache_size will not be cached. max_cache_size is 1 GiB by default.
- bearer_token_path (string):
Path to HTTP bearer token if authorization required.
HTTPCachedFS
supports refresh of bearer token by periodical reloading.
Note
This feature is experimental.
- exists(*args, **kwargs) bool [source]
Returns the existence of the path
When the
file_path
points to a symlink, the return value depends on the actual file instead of the link itself.
- glob(pattern: str) Iterator[FileStat | str] [source]
Returns the files and dictories that match the glob pattern.
- isdir(*args, **kwargs) bool [source]
Returns
True
if the path is an existing directory- Args:
path (str): the path to the target directory
- Returns:
True
when the path points to a directory,False
when it is not
- list(*args, **kwargs) Iterator[FileStat | str] [source]
- Lists all the files and directories under
the given
path_or_prefix
- Args:
- path_or_prefix (str): The path to list against.
When we get the default value,
list
shows the content under the working directory as the default value. However, if apath_or_prefix
is given, then it shows only the files and directories under thepath_or_prefix
.- recursive (bool): When this is
True
, list files and directories recursively.
- detail (bool): If this is
True
, the return values will be the detail information of each file or directory.
- Returns:
An Iterator that iterates though the files and directories.
- makedirs(*args, **kwargs) None [source]
Makes directories recursively with mode
Also creates all the missing parents of the given path.
- Args:
path (str): the path to the directory to make.
mode (int): the mode of the directory
- exist_ok (bool): In default case, a
FileExitsError
will be raised when the target directory exists.
- exist_ok (bool): In default case, a
- mkdir(*args, **kwargs) None [source]
Makes a directory with mode
- Args:
path (str): the path to the directory to make
mode (int): the mode of the new directory
- remove(*args, **kwargs) None [source]
Removes a file or directory
- Args:
path (str): the target path to remove. The
path
can be a regular file or a directory.- recursive (bool): When the given path is a directory,
all the files and directories under it will be removed. When the path is a file, this option is ignored.
Error
- class pfio.v2.fs.ForkedError[source]
An error class when PFIO found the process forked.
If an FS object is not “lazy”, any object usage detects process fork and raises this
ForkedError
as soon as possible at the child process. The parent process may or may not run well, depending on theFS
implementation.
Pathlib-like API
PFIO v2 API has utility tool that behaves like pathlib in Python’s standard library. Paths can be manipulated like this:
from pfio.v2 import from_url
from pfio.v2.pathlib import Path
with from_url('s3://your-bucket') as s3:
p = Path('foo', fs=s3)
p2 = p / 'bar'
with p2.open() as fp:
# yields s3://your-bucket/foo/bar
fp.read()
It tries to be compatible with pathlib.Path
as much as possible,
but several methods are not yet implemented.
- class pfio.v2.pathlib.Path(*args: str, fs: FS, scheme: str | None = None)[source]
pathlib.PosixPath compatible interface.
- Args:
args: construct paths. fs: target file system. scheme: specify URL scheme. (for as_uri method)
- Note:
many methods raise NotImplementedError because they require unsupported features of FS.
several methods behave slightly different, - stat returns FileStat object instead of os.stat_result. - glob, rglob, iterdir will not return directory type object.
Sparse File Cache
Removed at 2.8.
Cache API
PFIO provides experimental cache API to improve performance of repetitive access to the data collection.
Example
Here let us suppose we have a file that includes a list of paths to images.
/path/to/image1.jpg
/path/to/image2.jpg
...
/path/to/imageN.jpg
The PyTorch Dataset class with using NaiveCache
as an example
can be implemented as follows.
from pfio.cache import NaiveCache
class MyDataset(torch.utils.data.Dataset):
def __init__(self, image_paths):
self.paths = image_paths
self.cache = NaiveCache(len(image_paths), do_pickle=True)
def __len__(self):
return len(self.paths)
def _read_image(self, i):
return cv2.imread(self.paths[i]).transpose(2, 0, 1)
def __getitem__(self, i):
x = self.cache.get_and_cache(i, self._read_image)
# This is equivalent
# x = self.cache.get(i)
# if not x:
# x = cv2.imread(self.paths[i]).transpose(2, 0, 1)
# self.cache.put(i, x)
return torch.Tensor(x)
By calling get_and_cache
of the cache in __getitem__
method,
it will check if the data for the specified index is already cached.
If there already is, it reads the data from the cache and return,
otherwise it calls the actual data loading function, add it to the cache,
and return it.
Therefore load the data from the storage only when necessary,
which is at the first access to each data.
PFIO cache API provides NaiveCache
, FileCache
and
MultiprocessFileCache
.
They all share the same core idea and interface.
The difference is how to manage the cached data.
The NaiveCache
keeps everything in memory,
making it virtually zero overhead.
The cache capacity is limited by the memory size,
thus it would not be suitable for large-scale datasets.
The FileCache
and the MultiprocessFileCache
both
store the cached data in a filesystem.
The FileCache
is designed for single-process data load.
In case of parallelized data loading, which is relatively common in
deep learning workloads, consider using MultiprocessFileCache
.
Also, these file-based caches support cache data persistency.
Once the cache is completely built, we can keep them as files by calling
:func:FileCache.preserve
, and we can recover the cache
from the preserved files by calling :func:FileCache.preload
.
This is useful when we want to reuse the cache already built in a previous workload.
Currently deletion of a data from cache is not supported.
- class pfio.cache.Cache[source]
Abstract class to define Cache class interface
This can be instance of
collections.abc.Sequence
but so far this will be just a single interface difinition. Note that this is experimental feature.- get_and_cache(i, backend_get: Callable[[int], bytes]) bytes [source]
Get data from cache, otherwise from backend with caching
First try to get data from cache. If not found, it gets data from backend callable with the result stored in cache.
- abstract property multiprocess_safe: bool
Returns multiprocess safety.
- abstract property multithread_safe: bool
Returns multithread safety.
- class pfio.cache.NaiveCache(length, multithread_safe=False, do_pickle=False)[source]
Naive on-memory cache just with dict.
- property multiprocess_safe
Returns multiprocess safety.
- property multithread_safe
Returns multithread safety.
- class pfio.cache.FileCache(length, multithread_safe=False, do_pickle=False, dir=None, cache_size_limit=None, verbose=False)[source]
Cache system with local filesystem
Stores cache data in a local temporary file created in
$XDG_CACHE_HOME/pfio
by default. If it is unset,$HOME/.cache/pfio
will be the cache destination. Cache data is automatically deleted after the object is collected. When this object is not correctly closed, (e.g., the process killed by SIGTERM), the cache remains after the death of process.Note
This feature requires
stat(1)
command from GNU coreutils.- Arguments:
length (int): Length of the cache array.
- multithread_safe (bool): Defines multithread safety. If this
is
True
, reader-writer locking system based onthreading.Lock
is introduced behind the cache management. Major use case is with Chainer’sMultithreadIterator
.- do_pickle (bool):
Do automatic pickle and unpickle inside the cache.
- dir (str): The path to the directory to place cache data in
case home directory is not backed by fast storage device. Must not be an NFS.
- cache_size_limit (None or int): Limitation of the cache size in bytes.
If the total amount of cached data reaches the limit, the cache will become frozen and no longer acccept further addition. Data already stored in the cache can be accessed normally. None (default) and 0 is unlimited.
- verbose (bool):
Print detailed logs of the cache.
- preload(name)[source]
Load the cache saved by
preserve()
cache_path
is the path to the persistent file. To use cache inmultiprocessing
environment, call this method at every forked process, except the process that calledpreserve()
. After the preload, no data can be added to the cache.When it succeeds, it returns
True
. If there is no cache file with the specified name in the cache directory, it will do nothing but returnFalse
.- Returns:
bool: Returns True if succeed.
Note
This feature is experimental.
- preserve(name, overwrite=False)[source]
Preserve the cache as a persistent file on the disk
Saves the current cache into
cache_path
. Once the cache is preserved, the cache file will not be removed at cache close. To read data from the preserved file, usepreload()
method. After preservation, no data can be added to the cache.When it succeeds, it returns
True
. If there is a cache file with the same name already exists in the cache directory, it will do nothing but returnFalse
.The preserved cache can also be preloaded by
MultiprocessFileCache
.- Arguments:
- name (str): Prefix of the preserved file names.
(name).cachei
and(name).cached
are created. The files are created in the same directory as the cache (dir
option to__init__
).
overwrite (bool): Overwrite if already exists
- Returns:
bool: Returns True if succeed.
Note
This feature is experimental.
- class pfio.cache.MultiprocessFileCache(length, do_pickle=False, dir=None, cache_size_limit=None, verbose=False)[source]
The Multiprocess-safe cache system on a local filesystem
Stores cache data in a local temporary file, created in
~/.pfio/cache
by default. It automatically deletes the cache data after the object is collected. When this object is not correctly closed (e.g., the process killed by SIGKILL), the cache remains after the process’s death.This class supports handling a cache from multiple processes. A MultiprocessFileCache object can be handed over to another process through the pickle. Calling
get
andput
in each process will look into the same cache file with flock-based locking. The temporary cache file will persist as long as the MultiprocessFileCache object is alive in the original process that creates it. Therefore, even after destroying the worker processes, the MultiprocessFileCache object can still be passed to another process.Example
Using MultiprocessFileCache is similar to the
NaiveCache
andFileCache
.from pfio.cache import MultiprocessFileCache class MyDataset(torch.utils.data.Dataset): def __init__(self, image_paths): self.paths = image_paths self.cache = MultiprocessFileCache(len(image_paths), do_pickle=True) ...
When iterating over the dataset, it is common to load the data concurrently to hide file IO bottleneck by setting higher
num_workers
in PyTorch DataLoader. https://pytorch.org/docs/stable/data.htmlimage_paths = open('/path/to/image_list.txt').read().splitlines() dataset = MyDataset(image_paths) loader = DataLoader(dataset, batch_size=64, num_workers=8) # Parallel data loading for epoch in range(10): for batch in loader: ...
In this case, the dataset is distributed to each worker process i.e.,
__getitem__
of the dataset will be called in a different process that initialized it. TheMultiprocessFileCache
object held by the dataset in each worker looks at the same cache file and handles the concurrent access based on theflock
system call. Therefore the data inserted to the cache by a worker process can be accessed from another worker process safely.In case your task does not require concurrent data loading, i.e.,
num_workers=0
in DataLoader, consider usingFileCache
as it has less overhead for concurrency control.The persisted cache file created by
preserve()
can be used forFileCache.preload()
and vice versa.- Arguments:
length (int): Length of the cache array.
- do_pickle (bool):
Do automatic pickle and unpickle inside the cache.
- dir (str): The path to the directory to place cache data in
case home directory is not backed by fast storage device. Must not be an NFS.
- cache_size_limit (None or int): Limitation of the cache size in bytes.
If the total amount of cached data reaches the limit, the cache will become frozen and no longer acccept further addition. Data already stored in the cache can be accessed normally. None (default) and 0 is unlimited.
- verbose (bool):
Print detailed logs of the cache.
- preload(name)[source]
Load the cache saved by
preserve()
After loading the file, no data can be added to the cache.
name
is the name of the persistent file in the cache directory.When it succeeds, it returns
True
. If there is no cache file with the specified name in the cache directory, it will do nothing but returnFalse
.Be noted that
preload()
can be called only by the master process i.e., the process where__init__()
is called, in order to prevent inconsistency. When using in a multiprocessing environment, you first need to create aMultiprocessFileCache
object, call itspreload()
and then pass it to the worker processes.- Returns:
bool: Returns True if succeed.
Note
This feature is experimental.
- preserve(name, overwrite=False)[source]
Preserve the cache as a persistent file on the disk
Once the cache is preserved, the cache file will not be removed at cache close. To read data from preserved file, use
preload()
method. After preservation, no data can be added to the cache.name
is the name of the persistent files saved into the cache directory.When it succeeds, it returns
True
. If there is a cache file with the same name already exists in the cache directory, it will do nothing but returnFalse
.Be noted that
preserve()
can be called only by the master process i.e., the process where__init__()
is called, in order to prevent inconsistency.The preserved cache can also be preloaded by
FileCache
.- Arguments:
- name (str): Prefix of the preserved file names.
(name).cachei
and(name).cached
are created. The files are created in the same directory as the cache (dir
option to__init__
).
overwrite (bool): Overwrite if already exists
- Returns:
bool: Returns True if succeed.
Note
This feature is experimental.
- class pfio.cache.HTTPCache(length: int, url: str, bearer_token_path=None, do_pickle=False)[source]
HTTP-based cache system
Stores cache data in an HTTP server with
PUT
andGET
methods. Each cache entry corresponds to url suffixed by indexi
.- Arguments:
- length (int):
Length of the cache.
- url (string):
Prefix url of cache entries. Each entry corresponds to the url suffixed by each index. A user must specify url as globally identical across the cache system in the server side, because
HTTPCache
doesn’t suffix the url by user or dataset information. Therefore, a user should include user and dataset in the url to avoid conflicting the cache entry.For example, let’s assume that given url is
http://cache.example.com/some/{user}/{dataset-id}/
. Here,put(123)
andget(123)
correspond tohttp://cache.example.com/some/{user}/{dataset-id}/123
.- bearer_token_path (string):
Path to HTTP bearer token if authorization required.
HTTPCache
supports refresh of bearer token by periodical reloading.- do_pickle (bool):
Do automatic pickle and unpickle inside the cache.
Note
This feature is experimental.
- property multiprocess_safe
Returns multiprocess safety.
- property multithread_safe
Returns multithread safety.
Toplevel Functions in v1(deprecated)
Note
Toplevel functions will be deprecated in 2.0 and removed in 2.1. Please use V2 API instead.