Is there any faster way for downloading multiple files from s3 to local folder?

I am trying to download 12,000 files from s3 bucket using jupyter notebook, which is estimating to complete download in 21 hours. This is because each file is downloaded one at a time. Can we do multiple downloads parallel to each other so I can speed up the process?

Currently, I am using the following code to download all files

### Get unique full-resolution image basenames
images = df['full_resolution_image_basename'].unique()
print(f'No. of unique full-resolution images: {len(images)}')

### Create a folder for full-resolution images
images_dir = './images/'
os.makedirs(images_dir, exist_ok=True)

### Download images
images_str = "','".join(images)
limiting_clause = f"CONTAINS(ARRAY['{images_str}'], 
full_resolution_image_basename)"
_ = download_full_resolution_images(images_dir, 
limiting_clause=limiting_clause)

Answers 1

  • See the code below. This will only work with python 3.6+, because of the f-string (PEP 498). Use a different method of string formatting for older versions of python.

    Provide the relative_path, bucket_name and s3_object_keys. In addition, max_workers is optional, and if not provided the number will be a multiple of 5 times the number of machine processors.

    Most of the code for this answer came from an answer to How to create an async generator in Python? which sources from this example documented in the library.

    import boto3
    import os
    from concurrent import futures
    
    
    relative_path = './images'
    bucket_name = 'bucket_name'
    s3_object_keys = [] # List of S3 object keys
    max_workers = 5
    
    abs_path = os.path.abspath(relative_path)
    s3 = boto3.client('s3')
    
    def fetch(key):
        file = f'{abs_path}/{key}'
        os.makedirs(file, exist_ok=True)  
        with open(file, 'wb') as data:
            s3.download_fileobj(bucket_name, key, data)
        return file
    
    
    def fetch_all(keys):
    
        with futures.ThreadPoolExecutor(max_workers=5) as executor:
            future_to_key = {executor.submit(fetch, key): key for key in keys}
    
            print("All URLs submitted.")
    
            for future in futures.as_completed(future_to_key):
    
                key = future_to_key[future]
                exception = future.exception()
    
                if not exception:
                    yield key, future.result()
                else:
                    yield key, exception
    
    
    for key, result in fetch_all(S3_OBJECT_KEYS):
        print(f'key: {key}  result: {result}')
    

Related Articles