API Reference

File System Accessors

pfio.v2.open_url(url: str, mode: str = 'r', **kwargs) → Iterator[IOBase][source]

Opens a file regardless of the backend FS type

url must be compliant with URL standard in https://url.spec.whatwg.org/ . As this function implements context manager, the FileObject can be written as:

with open_url("s3://bucket.example.com/path/your-file.txt", 'r') as f:
    f.read()

Note

Some FS resources won’t be closed when using this functionality. See from_url for keyword arguments.

Returns:: a FileObject that must be closed.

pfio.v2.from_url(url: str, **kwargs) → FS[source]

Factory pattern implementation, creates FS from URI

If force_type is set with archive type, not scheme, it ignores the suffix and tries the specified archive format by opening the blob file.

If force_type is set with scheme type, the FS will built from it accordingly. The URL path is supposed to be a directory for file systems or a path prefix for S3.

Warning

When opening an hdfs://... URL, be sure about forking context. See: Hdfs for discussion.

Arguments:

url (str): A URL string compliant with RFC 1738.

force_type (str): Force type of FS to be returned.: One of “zip”, “hdfs”, “s3”, or “file”, returned respectively. Default is "file".

create (bool): Create the specified path doesn’t exist.

http_cache (str): Prefix url of http cached entries.: In the filesystem with http_cache specified, all read access will be hooked and upload its content to the url with the given prefix. For details, please refer to pfio.v2.HTTPCachedFS. (experimental feature)

Note

Some FS resources won’t be closed when using this functionality.

Note

Pickling the FS object may or may not work correctly depending on the implementation.

pfio.v2.lazify(init_func, lazy_init=True, recreate_on_fork=True)[source]: Make FS init lazy and recreate on fork

Deprecated since version 2.2.0: This will be removed in 2.3.0.

class pfio.v2.fs.FS(scheme=None)[source]

FS access abstraction

abstractmethod exists(path: str) → bool[source]

Returns the existence of the path

When the file_path points to a symlink, the return value depends on the actual file instead of the link itself.

glob(pattern: str) → Iterator[FileStat | str][source]: Returns the files and dictories that match the glob pattern.

abstractmethod isdir(file_path: str) → bool[source]

Returns True if the path is an existing directory

Args:: path (str): the path to the target directory
Returns:: True when the path points to a directory, False when it is not

abstractmethod list(path_or_prefix: str | None = None, recursive=False, detail=False) → Iterator[FileStat | str][source]

Lists all the files and directories under

the given path_or_prefix

Args:

path_or_prefix (str): The path to list against.: When we get the default value, list shows the content under the working directory as the default value. However, if a path_or_prefix is given, then it shows only the files and directories under the path_or_prefix.
recursive (bool): When this is True, list files and directories: recursively.
detail (bool): If this is True, the return values will be the: detail information of each file or directory.

Returns:

An Iterator that iterates though the files and directories.

abstractmethod makedirs(file_path: str, mode: int = 511, exist_ok: bool = False) → None[source]

Makes directories recursively with mode

Also creates all the missing parents of the given path.

Args:

path (str): the path to the directory to make.

mode (int): the mode of the directory

exist_ok (bool): In default case, a FileExitsError will be: raised when the target directory exists.

abstractmethod mkdir(file_path: str, mode: int = 511, *args, dir_fd: int | None = None) → None[source]

Makes a directory with mode

Args:

path (str): the path to the directory to make

mode (int): the mode of the new directory

abstractmethod remove(file_path: str, recursive: bool = False) → None[source]

Removes a file or directory

Args:

path (str): the target path to remove. The path can be a regular file or a directory.

recursive (bool): When the given path is a directory,: all the files and directories under it will be removed. When the path is a file, this option is ignored.

abstractmethod rename(src: str, dst: str) → None[source]

Renames the file from src to dst

On systems and situation where rename functionality is proviced, it renames the file or the directory.

Args:

src (str): the current name of the file or directory.

dst (str): the name to rename to.

property scheme

This property is used to identify the _nominal_ scheme.

If the scheme is a custom scheme, this property returns the custom scheme name. If the scheme is a standard scheme (hdfs, s3, fs, etc…), this property returns the standard scheme name.

abstractmethod stat(path: str) → FileStat[source]

Show details of a file

It returns an object of subclass of pfio.io.FileStat in accordance with filesystem or container type.

Args:: path (str): The path to file
Returns:: pfio.io.FileStat object.

subfs(rel_path: str) → FS[source]

Virtually changes the working directory

By default it performs shallow copy. If any resource that as different lifecycles than the copy source (e.g. HDFS connection and zipfile.ZipFile object), they also will be copied by overriding this method.

Local file system

class pfio.v2.Local(cwd=None, trace=False, create=False, scheme=None, **_)[source]

exists(file_path: str)[source]

Returns the existence of the path

When the file_path points to a symlink, the return value depends on the actual file instead of the link itself.

glob(pattern: str)[source]: Returns the files and dictories that match the glob pattern.

isdir(path: str)[source]

Returns True if the path is an existing directory

Args:: path (str): the path to the target directory
Returns:: True when the path points to a directory, False when it is not

list(path: str | None = '', recursive=False, detail=False)[source]

Lists all the files and directories under

the given path_or_prefix

Args:

path_or_prefix (str): The path to list against.: When we get the default value, list shows the content under the working directory as the default value. However, if a path_or_prefix is given, then it shows only the files and directories under the path_or_prefix.
recursive (bool): When this is True, list files and directories: recursively.
detail (bool): If this is True, the return values will be the: detail information of each file or directory.

Returns:

An Iterator that iterates though the files and directories.

makedirs(file_path: str, mode=511, exist_ok=False)[source]

Makes directories recursively with mode

Also creates all the missing parents of the given path.

Args:

path (str): the path to the directory to make.

mode (int): the mode of the directory

exist_ok (bool): In default case, a FileExitsError will be: raised when the target directory exists.

mkdir(file_path: str, mode=511, *args, dir_fd=None)[source]

Makes a directory with mode

Args:

path (str): the path to the directory to make

mode (int): the mode of the new directory

remove(file_path: str, recursive=False)[source]

Removes a file or directory

Args:

path (str): the target path to remove. The path can be a regular file or a directory.

recursive (bool): When the given path is a directory,: all the files and directories under it will be removed. When the path is a file, this option is ignored.

rename(src, dst)[source]

Renames the file from src to dst

On systems and situation where rename functionality is proviced, it renames the file or the directory.

Args:

src (str): the current name of the file or directory.

dst (str): the name to rename to.

stat(path)[source]

Show details of a file

It returns an object of subclass of pfio.io.FileStat in accordance with filesystem or container type.

Args:: path (str): The path to file
Returns:: pfio.io.FileStat object.

HDFS (Hadoop File System)

class pfio.v2.Hdfs(cwd=None, create=False, scheme=None, **_)[source]

Hadoop FileSystem wrapper

To use HDFS, PFIO requires $HADOOP_HOME predefined before initialization. If it is not defined, ARROW_LIBHDFS_DIR must be defined instead. $CLASSPATH will be needed in case hdfs command is not available from $PATH.

Warning

It is strongly discouraged to use Hdfs under multiprocessing. Once the object detects the process id changed (which means it is forked), the object raises ForkedError before doing anything. If you do need forking, for example, PyTorch DataLoader with multiple workers for performance, it is strongly recommended not to instantiate Hdfs before forking. Details are described in PFIO issue #123. Simple workaround is to set multiprocessing start method as 'forkserver' and start the very first child process before everything.

import multiprocessing
multiprocessing.set_start_method('forkserver')
p = multiprocessing.Process()
p.start()
p.join()

See: https://github.com/pfnet/pfio/issues/123

Note

With environment variable KRB5_KTNAME=path/to/your.keytab set, hdfs handler automatically starts automatic and periodical updating Kerberos ticket using krbticket . The update frequency is every 10 minutes by default.

Note

Only the username in the first entry in The keytab will be used to update the Kerberos ticket.

exists(path: str)[source]

Returns the existence of the path

When the file_path points to a symlink, the return value depends on the actual file instead of the link itself.

isdir(path: str | None)[source]

Returns True if the path is an existing directory

Args:: path (str): the path to the target directory
Returns:: True when the path points to a directory, False when it is not

list(path: str | None = '', recursive=False, detail=False)[source]

Lists all the files and directories under

the given path_or_prefix

Args:

path_or_prefix (str): The path to list against.: When we get the default value, list shows the content under the working directory as the default value. However, if a path_or_prefix is given, then it shows only the files and directories under the path_or_prefix.
recursive (bool): When this is True, list files and directories: recursively.
detail (bool): If this is True, the return values will be the: detail information of each file or directory.

Returns:

An Iterator that iterates though the files and directories.

makedirs(path: str, mode=511, exist_ok=False)[source]

Makes directories recursively with mode

Also creates all the missing parents of the given path.

Args:

path (str): the path to the directory to make.

mode (int): the mode of the directory

exist_ok (bool): In default case, a FileExitsError will be: raised when the target directory exists.

mkdir(path: str, *args, dir_fd=None)[source]

Makes a directory with mode

Args:

path (str): the path to the directory to make

mode (int): the mode of the new directory

remove(path, recursive=False)[source]

Removes a file or directory

Args:

path (str): the target path to remove. The path can be a regular file or a directory.

recursive (bool): When the given path is a directory,: all the files and directories under it will be removed. When the path is a file, this option is ignored.

rename(src, dst)[source]

Renames the file from src to dst

On systems and situation where rename functionality is proviced, it renames the file or the directory.

Args:

src (str): the current name of the file or directory.

dst (str): the name to rename to.

stat(path)[source]

Show details of a file

It returns an object of subclass of pfio.io.FileStat in accordance with filesystem or container type.

Args:: path (str): The path to file
Returns:: pfio.io.FileStat object.

subfs(rel_path)[source]

Virtually changes the working directory

By default it performs shallow copy. If any resource that as different lifecycles than the copy source (e.g. HDFS connection and zipfile.ZipFile object), they also will be copied by overriding this method.

S3 (AWS S3)

class pfio.v2.S3(bucket, prefix=None, endpoint=None, create_bucket=False, aws_access_key_id=None, aws_secret_access_key=None, mpu_chunksize=33554432, buffering=-1, create=False, connect_timeout=None, read_timeout=None, scheme=None, _skip_connect=None, trace=False, **_)[source]

S3 FileSystem wrapper

Takes three arguments as well as enviroment variables for constructor. The priority is (1) see arguments, (2) see enviroment variables, (3) take boto3’s default. Available arguments are:

aws_access_key_id, AWS_ACCESS_KEY_ID
aws_secret_access_key, AWS_SECRET_ACCESS_KEY
endpoint, S3_ENDPOINT

It supports buffering when opening a file in binary read mode (“rb”). When buffering is set to -1 (default), the buffer size will be the size of the file or pfio.v2.S3.DEFAULT_MAX_BUFFER_SIZE, whichever smaller. buffering=0 disables buffering, and buffering>0 forcibly sets the specified value as the buffer size in bytes. connect_timeout and read_timeout are passed as botocore.config.

exists(file_path: str)[source]

Returns the existence of objects

For common prefixes, it does nothing. See discussion in isdir().

isdir(file_path: str)[source]

Imitate isdir by handling common prefix ending with “/” as directory

AWS S3 does not have concept of directory tree, but this class imitates other file systems to increase compatibility.

list(prefix: str | None = '', recursive=False, detail=False)[source]

List all objects (and prefixes)

Although there is not concept of directory in AWS S3 API, common prefixes shows up like directories.

makedirs(file_path: str, mode=511, exist_ok=False)[source]: Does nothing

Note

see discussion in mkdir().

mkdir(file_path: str, mode=511, *args, dir_fd=None)[source]: Does nothing

Note

AWS S3 does not have concept of directory tree; what this function (and makedirs()) should do and return? To be strict, it would be straightforward to raise io.UnsupportedOperation exception. But it just breaks users’ applications that except quasi-compatible behaviour. Thus, imitating other file systems, like returning None would be nicer.

open(path, mode='r', **kwargs)[source]

Opens an object accessor for read or write

Note

Multi-part upload is not yet available.

Arguments:

path (str): relative path from basedir

mode (str): open mode

remove(file_path: str, recursive=False)[source]

Removes an object

It raises a FileNotFoundError when the specified file doesn’t exist.

rename(src, dst)[source]

Copies & removes the object

Source and destination must be in the same bucket for pfio, although AWS S3 supports inter-bucket copying.

stat(path)[source]: Imitate FileStat with S3 Object metadata

Zip Archive

class pfio.v2.Zip(backend, file_path, mode='r', create=False, local_cache=False, local_cachedir=None, trace=False, **kwargs)[source]

exists(file_path: str)[source]

Returns the existence of the path

When the file_path points to a symlink, the return value depends on the actual file instead of the link itself.

isdir(file_path: str)[source]

Returns True if the path is an existing directory

Args:: path (str): the path to the target directory
Returns:: True when the path points to a directory, False when it is not

list(path_or_prefix: str | None = '', recursive=False, detail=False)[source]

Lists all the files and directories under

the given path_or_prefix

Args:

path_or_prefix (str): The path to list against.: When we get the default value, list shows the content under the working directory as the default value. However, if a path_or_prefix is given, then it shows only the files and directories under the path_or_prefix.
recursive (bool): When this is True, list files and directories: recursively.
detail (bool): If this is True, the return values will be the: detail information of each file or directory.

Returns:

An Iterator that iterates though the files and directories.

makedirs(file_path: str, mode=511, exist_ok=False)[source]

Makes directories recursively with mode

Also creates all the missing parents of the given path.

Args:

path (str): the path to the directory to make.

mode (int): the mode of the directory

exist_ok (bool): In default case, a FileExitsError will be: raised when the target directory exists.

mkdir(file_path: str, mode=511, *args, dir_fd=None)[source]

Makes a directory with mode

Args:

path (str): the path to the directory to make

mode (int): the mode of the new directory

remove(file_path, recursive=False)[source]

Removes a file or directory

Args:

path (str): the target path to remove. The path can be a regular file or a directory.

recursive (bool): When the given path is a directory,: all the files and directories under it will be removed. When the path is a file, this option is ignored.

rename(*args)[source]

Renames the file from src to dst

On systems and situation where rename functionality is proviced, it renames the file or the directory.

Args:

src (str): the current name of the file or directory.

dst (str): the name to rename to.

stat(path)[source]

Show details of a file

It returns an object of subclass of pfio.io.FileStat in accordance with filesystem or container type.

Args:: path (str): The path to file
Returns:: pfio.io.FileStat object.

subfs(path)[source]

Virtually changes the working directory

By default it performs shallow copy. If any resource that as different lifecycles than the copy source (e.g. HDFS connection and zipfile.ZipFile object), they also will be copied by overriding this method.

HTTPCachedFS

class pfio.v2.HTTPCachedFS(url: str, fs: FS, max_cache_size: int = 1073741824, bearer_token_path: str | None = None)[source]

HTTP-based cache system

Stores cache data in an HTTP server with PUT and GET methods. Each cache entry corresponds to url suffixed by _canonical_name in pfio.v2.fs.FS.

Arguments:

url (string):

Prefix url of cache entries. Each entry corresponds to the url suffixed by each normalized paths.

fs (pfio.v2.FS):

Underlying filesystem.

Read operations will be hooked by HTTPCachedFS to send a request to the cache system. If the object is found in cache, the object will be returned from cache without requesting to underlying fs. Therefore, after the update of file in underlying fs, users have to update url to avoid reading old data from the cache.

Other operations including write will not be hooked. It will be transferred to underlying filesystem immediately.

max_cache_size (int):

Files larger than max_cache_size will not be cached. max_cache_size is 1 GiB by default.

bearer_token_path (string):

Path to HTTP bearer token if authorization required. HTTPCachedFS supports refresh of bearer token by periodical reloading.

Note

This feature is experimental.

exists(*args, **kwargs) → bool[source]

Returns the existence of the path

When the file_path points to a symlink, the return value depends on the actual file instead of the link itself.

glob(pattern: str) → Iterator[FileStat | str][source]: Returns the files and dictories that match the glob pattern.

isdir(*args, **kwargs) → bool[source]

Returns True if the path is an existing directory

Args:: path (str): the path to the target directory
Returns:: True when the path points to a directory, False when it is not

list(*args, **kwargs) → Iterator[FileStat | str][source]

Lists all the files and directories under

the given path_or_prefix

Args:

path_or_prefix (str): The path to list against.: When we get the default value, list shows the content under the working directory as the default value. However, if a path_or_prefix is given, then it shows only the files and directories under the path_or_prefix.
recursive (bool): When this is True, list files and directories: recursively.
detail (bool): If this is True, the return values will be the: detail information of each file or directory.

Returns:

An Iterator that iterates though the files and directories.

makedirs(*args, **kwargs) → None[source]

Makes directories recursively with mode

Also creates all the missing parents of the given path.

Args:

path (str): the path to the directory to make.

mode (int): the mode of the directory

exist_ok (bool): In default case, a FileExitsError will be: raised when the target directory exists.

mkdir(*args, **kwargs) → None[source]

Makes a directory with mode

Args:

path (str): the path to the directory to make

mode (int): the mode of the new directory

remove(*args, **kwargs) → None[source]

Removes a file or directory

Args:

path (str): the target path to remove. The path can be a regular file or a directory.

recursive (bool): When the given path is a directory,: all the files and directories under it will be removed. When the path is a file, this option is ignored.

rename(*args, **kwargs) → None[source]

Renames the file from src to dst

On systems and situation where rename functionality is proviced, it renames the file or the directory.

Args:

src (str): the current name of the file or directory.

dst (str): the name to rename to.

stat(*args, **kwargs) → FileStat[source]

Show details of a file

It returns an object of subclass of pfio.io.FileStat in accordance with filesystem or container type.

Args:: path (str): The path to file
Returns:: pfio.io.FileStat object.

Error

class pfio.v2.fs.ForkedError[source]

An error class when PFIO found the process forked.

If an FS object is not “lazy”, any object usage detects process fork and raises this ForkedError as soon as possible at the child process. The parent process may or may not run well, depending on the FS implementation.

Pathlib-like API

PFIO v2 API has utility tool that behaves like pathlib in Python’s standard library. Paths can be manipulated like this:

from pfio.v2 import from_url
from pfio.v2.pathlib import Path

with from_url('s3://your-bucket') as s3:
  p = Path('foo', fs=s3)
  p2 = p / 'bar'
  with p2.open() as fp:
    # yields s3://your-bucket/foo/bar
    fp.read()

It tries to be compatible with pathlib.Path as much as possible, but several methods are not yet implemented.

class pfio.v2.pathlib.Path(*args: str, fs: FS)[source]

pathlib.PosixPath compatible interface.

Args:

args: construct paths. fs: target file system.

Note:

many methods raise NotImplementedError because they require unsupported features of FS.

several methods behave slightly different, - stat returns FileStat object instead of os.stat_result. - glob, rglob, iterdir will not return directory type object.

Sparse File Cache

Removed at 2.8.

Cache API

PFIO provides experimental cache API to improve performance of repetitive access to the data collection.

Example

Here let us suppose we have a file that includes a list of paths to images.

/path/to/image1.jpg
/path/to/image2.jpg
...
/path/to/imageN.jpg

The PyTorch Dataset class with using NaiveCache as an example can be implemented as follows.

from pfio.cache import NaiveCache


class MyDataset(torch.utils.data.Dataset):
    def __init__(self, image_paths):
        self.paths = image_paths
        self.cache = NaiveCache(len(image_paths), do_pickle=True)

    def __len__(self):
        return len(self.paths)

    def _read_image(self, i):
        return cv2.imread(self.paths[i]).transpose(2, 0, 1)

    def __getitem__(self, i):
        x = self.cache.get_and_cache(i, self._read_image)

        # This is equivalent
        # x = self.cache.get(i)
        # if not x:
        #     x = cv2.imread(self.paths[i]).transpose(2, 0, 1)
        #     self.cache.put(i, x)

        return torch.Tensor(x)

By calling get_and_cache of the cache in __getitem__ method, it will check if the data for the specified index is already cached. If there already is, it reads the data from the cache and return, otherwise it calls the actual data loading function, add it to the cache, and return it. Therefore load the data from the storage only when necessary, which is at the first access to each data.

PFIO cache API provides NaiveCache, FileCache and MultiprocessFileCache. They all share the same core idea and interface. The difference is how to manage the cached data.

The NaiveCache keeps everything in memory, making it virtually zero overhead. The cache capacity is limited by the memory size, thus it would not be suitable for large-scale datasets.

The FileCache and the MultiprocessFileCache both store the cached data in a filesystem. The FileCache is designed for single-process data load. In case of parallelized data loading, which is relatively common in deep learning workloads, consider using MultiprocessFileCache.

Also, these file-based caches support cache data persistency. Once the cache is completely built, we can keep them as files by calling :func:FileCache.preserve, and we can recover the cache from the preserved files by calling :func:FileCache.preload. This is useful when we want to reuse the cache already built in a previous workload.

Currently deletion of a data from cache is not supported.

class pfio.cache.Cache[source]

Abstract class to define Cache class interface

This can be instance of collections.abc.Sequence but so far this will be just a single interface difinition. Note that this is experimental feature.

abstractmethod get(i: int) → bytes | None[source]: Tries to get the data from cache.

get_and_cache(i, backend_get: Callable[[int], bytes]) → bytes[source]

Get data from cache, otherwise from backend with caching

First try to get data from cache. If not found, it gets data from backend callable with the result stored in cache.

abstract property multiprocess_safe: bool: Returns multiprocess safety.

abstract property multithread_safe: bool: Returns multithread safety.

abstractmethod put(i: int, data: bytes) → bool[source]: Puts the bytes to the cache. No overwrite nor deletion supported.

class pfio.cache.NaiveCache(length, multithread_safe=False, do_pickle=False)[source]

Naive on-memory cache just with dict.

get(i)[source]: Tries to get the data from cache.

property multiprocess_safe: Returns multiprocess safety.

property multithread_safe: Returns multithread safety.

put(i, data)[source]: Puts the bytes to the cache. No overwrite nor deletion supported.

class pfio.cache.FileCache(length, multithread_safe=False, do_pickle=False, dir=None, cache_size_limit=None, verbose=False, trace=False)[source]

Cache system with local filesystem

Stores cache data in a local temporary file created in $XDG_CACHE_HOME/pfio by default. If it is unset, $HOME/.cache/pfio will be the cache destination. Cache data is automatically deleted after the object is collected. When this object is not correctly closed, (e.g., the process killed by SIGTERM), the cache remains after the death of process.

Note

This feature requires stat(1) command from GNU coreutils.

Arguments:

length (int): Length of the cache array.

multithread_safe (bool): Defines multithread safety. If this: is True, reader-writer locking system based on threading.Lock is introduced behind the cache management. Major use case is with Chainer’s MultithreadIterator.
do_pickle (bool):: Do automatic pickle and unpickle inside the cache.
dir (str): The path to the directory to place cache data in: case home directory is not backed by fast storage device. Must not be an NFS.
cache_size_limit (None or int): Limitation of the cache size in bytes.: If the total amount of cached data reaches the limit, the cache will become frozen and no longer acccept further addition. Data already stored in the cache can be accessed normally. None (default) and 0 is unlimited.
verbose (bool):: Print detailed logs of the cache.

preload(name)[source]

Load the cache saved by preserve()

cache_path is the path to the persistent file. To use cache in multiprocessing environment, call this method at every forked process, except the process that called preserve(). After the preload, no data can be added to the cache.

When it succeeds, it returns True. If there is no cache file with the specified name in the cache directory, it will do nothing but return False.

Returns:: bool: Returns True if succeed.

Note

This feature is experimental.

preserve(name, overwrite=False)[source]

Preserve the cache as a persistent file on the disk

Saves the current cache into cache_path. Once the cache is preserved, the cache file will not be removed at cache close. To read data from the preserved file, use preload() method. After preservation, no data can be added to the cache.

When it succeeds, it returns True. If there is a cache file with the same name already exists in the cache directory, it will do nothing but return False.

The preserved cache can also be preloaded by MultiprocessFileCache.

Arguments:

name (str): Prefix of the preserved file names.: (name).cachei and (name).cached are created. The files are created in the same directory as the cache (dir option to __init__).

overwrite (bool): Overwrite if already exists

Returns:

bool: Returns True if succeed.

Note

This feature is experimental.

class pfio.cache.MultiprocessFileCache(length, do_pickle=False, dir=None, cache_size_limit=None, verbose=False, trace=False)[source]

The Multiprocess-safe cache system on a local filesystem

Stores cache data in a local temporary file, created in ~/.pfio/cache by default. It automatically deletes the cache data after the object is collected. When this object is not correctly closed (e.g., the process killed by SIGKILL), the cache remains after the process’s death.

This class supports handling a cache from multiple processes. A MultiprocessFileCache object can be handed over to another process through the pickle. Calling get and put in each process will look into the same cache file with flock-based locking. The temporary cache file will persist as long as the MultiprocessFileCache object is alive in the original process that creates it. Therefore, even after destroying the worker processes, the MultiprocessFileCache object can still be passed to another process.

Example

Using MultiprocessFileCache is similar to the NaiveCache and FileCache.

from pfio.cache import MultiprocessFileCache

class MyDataset(torch.utils.data.Dataset):
    def __init__(self, image_paths):
        self.paths = image_paths
        self.cache = MultiprocessFileCache(len(image_paths), do_pickle=True)

    ...

When iterating over the dataset, it is common to load the data concurrently to hide file IO bottleneck by setting higher num_workers in PyTorch DataLoader. https://pytorch.org/docs/stable/data.html

image_paths = open('/path/to/image_list.txt').read().splitlines()
dataset = MyDataset(image_paths)
loader = DataLoader(dataset, batch_size=64, num_workers=8)  # Parallel data loading

for epoch in range(10):
    for batch in loader:
        ...

In this case, the dataset is distributed to each worker process i.e., __getitem__ of the dataset will be called in a different process that initialized it. The MultiprocessFileCache object held by the dataset in each worker looks at the same cache file and handles the concurrent access based on the flock system call. Therefore the data inserted to the cache by a worker process can be accessed from another worker process safely.

In case your task does not require concurrent data loading, i.e., num_workers=0 in DataLoader, consider using FileCache as it has less overhead for concurrency control.

The persisted cache file created by preserve() can be used for FileCache.preload() and vice versa.

Arguments:

length (int): Length of the cache array.

do_pickle (bool):: Do automatic pickle and unpickle inside the cache.
dir (str): The path to the directory to place cache data in: case home directory is not backed by fast storage device. Must not be an NFS.
cache_size_limit (None or int): Limitation of the cache size in bytes.: If the total amount of cached data reaches the limit, the cache will become frozen and no longer acccept further addition. Data already stored in the cache can be accessed normally. None (default) and 0 is unlimited.
verbose (bool):: Print detailed logs of the cache.

preload(name)[source]

Load the cache saved by preserve()

After loading the file, no data can be added to the cache. name is the name of the persistent file in the cache directory.

When it succeeds, it returns True. If there is no cache file with the specified name in the cache directory, it will do nothing but return False.

Be noted that preload() can be called only by the master process i.e., the process where __init__() is called, in order to prevent inconsistency. When using in a multiprocessing environment, you first need to create a MultiprocessFileCache object, call its preload() and then pass it to the worker processes.

Returns:: bool: Returns True if succeed.

Note

This feature is experimental.

preserve(name, overwrite=False)[source]

Preserve the cache as a persistent file on the disk

Once the cache is preserved, the cache file will not be removed at cache close. To read data from preserved file, use preload() method. After preservation, no data can be added to the cache. name is the name of the persistent files saved into the cache directory.

When it succeeds, it returns True. If there is a cache file with the same name already exists in the cache directory, it will do nothing but return False.

Be noted that preserve() can be called only by the master process i.e., the process where __init__() is called, in order to prevent inconsistency.

The preserved cache can also be preloaded by FileCache.

Arguments:

name (str): Prefix of the preserved file names.: (name).cachei and (name).cached are created. The files are created in the same directory as the cache (dir option to __init__).

overwrite (bool): Overwrite if already exists

Returns:

bool: Returns True if succeed.

Note

This feature is experimental.

class pfio.cache.HTTPCache(length: int, url: str, bearer_token_path=None, do_pickle=False, trace=False)[source]

HTTP-based cache system

Stores cache data in an HTTP server with PUT and GET methods. Each cache entry corresponds to url suffixed by index i.

Arguments:

length (int):

Length of the cache.

url (string):

Prefix url of cache entries. Each entry corresponds to the url suffixed by each index. A user must specify url as globally identical across the cache system in the server side, because HTTPCache doesn’t suffix the url by user or dataset information. Therefore, a user should include user and dataset in the url to avoid conflicting the cache entry.

For example, let’s assume that given url is http://cache.example.com/some/{user}/{dataset-id}/. Here, put(123) and get(123) correspond to http://cache.example.com/some/{user}/{dataset-id}/123.

bearer_token_path (string):

Path to HTTP bearer token if authorization required. HTTPCache supports refresh of bearer token by periodical reloading.

do_pickle (bool):

Do automatic pickle and unpickle inside the cache.

trace (bool):

Enable PPE Profiler.

Note

This feature is experimental.

get(i: int) → Any[source]: Tries to get the data from cache.

property multiprocess_safe: Returns multiprocess safety.

property multithread_safe: Returns multithread safety.

put(i: int, data: Any)[source]: Puts the bytes to the cache. No overwrite nor deletion supported.

Toplevel Functions in v1(deprecated)

Note

Toplevel functions will be deprecated in 2.0 and removed in 2.1. Please use V2 API instead.