ListFiles does not stream results

Assignee

Reporter

Sprint

Description

Some protocols, for example S3, use paging to return the directory listing. But IOperationHandler.list() returns a List, so it needs to read all the files first. It would be possible to return java.nio.file.DirectoryStream<T> instead.

This would save memory required to store huge directory listings and improve performance.

We shouldn't break backward compatibility. We could probably add a new interface to handlers that allow streaming in directory listing.

  • S3: com.amazonaws.services.s3.AmazonS3.listNextBatchOfObjects(ObjectListing)

  • local files: java.nio.file.Files.newDirectoryStream(Path)

  • SMB2: com.hierynomus.smbj.share.Directory.iterator(Class<F>, String)

We should rewrite both directory listing and wildcard resolution to support streaming. For wildcard resolution, it should be sufficient to test if the result is empty or not, we don't need to know the exact size.

Steps to reproduce

None

Attachments

2
100% Done

Activity

Show:

Milan Krivanek December 6, 2019 at 11:58 AM
Edited

Summary:

  • ListFiles rewritten to support streaming, including wildcard resolution.

  • Readers rewritten to support streaming, including wildcard resolution.

  • Implemented streaming for S3 protocol in ListFiles and readers.

Done:

  • FileManager.list(CloverURI, ListParameters) rewritten to call FileManager.directoryStream() and convert it to a list in order to keep backward compatibility with callers and to reuse existing tests.

  • FileManager.resolve(CloverURI, ResolveParameters)
    rewritten to call FileManager.wildcardDirectoryStream() and convert it to a list.

  • FileManager.defaultResolve(SingleCloverURI) rewritten to call FileManager.defaultWildcardDirectoryStream() and convert it to a list.

    • FileManager.expand(Info, String, boolean) rewritten to use the new streaming API.

  • IOperationHandler.list() - preserved. Added a new default method IOperationHandler.directoryStream() that calls IOperationHandler.list() and converts it to a stream to avoid reimplementing directory listing in all protocols.

  • IOperationHandler.resolve(SingleCloverURI, ResolveParameters) - preserved. Added a new default method IOperationHandler.wildcardDirectoryStream() that calls IOperationHandler.resolve() and converts it to a stream to avoid reimplementing wildcard resolution in all protocols.

  • DefaultOperationHandler.copyInternal(SingleCloverURI, SingleCloverURI, CopyParameters) - no change, deferred

  • DefaultOperationHandler.move(SingleCloverURI, SingleCloverURI, MoveParameters) - no change, deferred

  • AbstractOperationHandler - no change, deferred

  • ListFiles component - rewritten to use FileManager.directoryStream()

  • streaming support in readers: WildcardDirectoryStream.newDirectoryStream(String)

    • added a new default method CustomPathResolver.wildcardDirectoryStream() that delegates to FileManager.wildcardDirectoryStream()

Milan Krivanek December 6, 2019 at 11:57 AM

Merged to release-5-5.

Kevin Scott November 5, 2019 at 12:27 PM
Edited

Prospective customer has come across this issue when trying to catalog contents of an S3 bucket with 3 million + files.

Fixed

Details

Story Points

Priority

Fix versions

QA Testing

UNDECIDED

Components

Created October 15, 2019 at 8:43 AM
Updated September 12, 2023 at 8:44 AM
Resolved December 6, 2019 at 11:57 AM