ListFiles does not stream results
Assignee
Reporter
Sprint
Description
Steps to reproduce
Attachments
caused
relates to
Activity
Milan Krivanek December 6, 2019 at 1:02 PM
Milan Krivanek December 6, 2019 at 11:58 AMEdited
Summary:
ListFiles rewritten to support streaming, including wildcard resolution.
Readers rewritten to support streaming, including wildcard resolution.
Implemented streaming for S3 protocol in ListFiles and readers.
Done:
FileManager.list(CloverURI, ListParameters)
rewritten to callFileManager.directoryStream()
and convert it to a list in order to keep backward compatibility with callers and to reuse existing tests.FileManager.resolve(CloverURI, ResolveParameters)
rewritten to callFileManager.wildcardDirectoryStream()
and convert it to a list.FileManager.defaultResolve(SingleCloverURI)
rewritten to callFileManager.defaultWildcardDirectoryStream()
and convert it to a list.FileManager.expand(Info, String, boolean)
rewritten to use the new streaming API.
IOperationHandler.list()
- preserved. Added a new default methodIOperationHandler.directoryStream()
that callsIOperationHandler.list()
and converts it to a stream to avoid reimplementing directory listing in all protocols.IOperationHandler.resolve(SingleCloverURI, ResolveParameters)
- preserved. Added a new default methodIOperationHandler.wildcardDirectoryStream()
that callsIOperationHandler.resolve()
and converts it to a stream to avoid reimplementing wildcard resolution in all protocols.DefaultOperationHandler.copyInternal(SingleCloverURI, SingleCloverURI, CopyParameters)
- no change, deferredDefaultOperationHandler.move(SingleCloverURI, SingleCloverURI, MoveParameters)
- no change, deferredAbstractOperationHandler
- no change, deferredListFiles
component - rewritten to useFileManager.directoryStream()
streaming support in readers:
WildcardDirectoryStream.newDirectoryStream(String)
added a new default method
CustomPathResolver.wildcardDirectoryStream()
that delegates toFileManager.wildcardDirectoryStream()
Milan Krivanek December 6, 2019 at 11:57 AM
Merged to release-5-5.
Kevin Scott November 5, 2019 at 12:27 PMEdited
Prospective customer has come across this issue when trying to catalog contents of an S3 bucket with 3 million + files.
Some protocols, for example S3, use paging to return the directory listing. But
IOperationHandler.list()
returns aList
, so it needs to read all the files first. It would be possible to returnjava.nio.file.DirectoryStream<T>
instead.This would save memory required to store huge directory listings and improve performance.
We shouldn't break backward compatibility. We could probably add a new interface to handlers that allow streaming in directory listing.
S3:
com.amazonaws.services.s3.AmazonS3.listNextBatchOfObjects(ObjectListing)
local files:
java.nio.file.Files.newDirectoryStream(Path)
SMB2:
com.hierynomus.smbj.share.Directory.iterator(Class<F>, String)
We should rewrite both directory listing and wildcard resolution to support streaming. For wildcard resolution, it should be sufficient to test if the result is empty or not, we don't need to know the exact size.