Skip to content

Redundant metadata API calls in Azure Blob Storage operations #533

@pavelyanu

Description

@pavelyanu

Problem Description

The current CloudPath implementation makes multiple redundant metadata API calls during common operations like open(), download_to(), and copy(). Each call to exists(), is_file(), is_dir(), and stat() results in a separate _get_metadata() call to Azure Blob Storage, even though all these properties are available from a single metadata response.

What happens during an open() call

  • open() calls exists() + is_file()
  • _refresh_cache() calls stat()
  • download_to() calls exists() + is_file() again

On Azure, all of these end up calling the same AzureBlobClient._get_metadata(), which returns all the necessary information (existence, file/directory status, size, last modified time) in a single API call.

Performance Impact

After removing the redundant calls, I was able to achieve:

  • ~2× speedup for 1 MB downloads
  • ~1.5× speedup for 10 MB downloads

Proposal

There are two possible solutions:

Option 1: Azure-specific optimization

Optimize this in AzureBlobClient and AzureBlobPath.

Implementation:

  • Add _get_blob_properties() to AzureBlobClient that returns all the needed information in one call
  • Store the result of AzureBlobClient._get_blob_properties() at the start of e.g. AzureBlobPath.open()
  • Pass metadata between internal methods to avoid redundant calls
  • Alternatively implement metadata caching/invalidation logic

Example:

def open(self, mode="r", **kwargs):
    meta = self.client._get_blob_properties(self)  # Single call
    if meta.exists and meta.is_directory:
        raise CloudPathIsADirectoryError(...)
    if mode == "x" and meta.exists:
        raise CloudPathFileExistsError(...)
    self._refresh_cache_with_meta(meta, **kwargs)  # Reuse metadata
    # ... rest of implementation

Option 2: CloudPath optimization

Change Client API and optimize Cloudpath

  • Modify Client API to explicitly require _get_metadata() method that will fetch all the required data
  • Similar optimization to Cloudpath as described in option 1

PR for Option 1 coming

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions