Speed up image file transfer by zipping

Compressing multiple files into one archive file allows for faster files transfer. Image files in the form of jpg are already compressed and compressing them into zip files does not significantly reduce the file size further. However, it does reduce the number files to one archive file and this makes file transfer faster, even including the time to decompress the zip file at the destination.

Files can be compressed into a zip file for storage to reduce the amount of disk space required. This article demonstrates that transferring large numbers of image files to Google Colab and decompressing is over 300 times faster than transferring the same data as individual files.



Google Colab (Colaboratory)

Google released Google Colab in 2017 as an online collaborating tool for machine learning with cloud storage. This is an impressive resource that is free to use, it presents a Jupyter-like notebook interface that is easily accessed through a browser using a google account. The Colab Jupyter notebooks are saved automatically to the Google Drive associated with the account. The Jupyter notebooks can be shared, copied, reviewed and opened on other machines.

This is a great free resource, but there are some constraints such as inactive sessions will time out and be shut down and all local variables and loaded data are lost when the session ends. In addition, I found that uploading numerous files is very time consuming, even from mapped google drive. This is where use of zip files significantly improved file transfer times.



Image dataset

This experiment was done with some of the images from the dogs-vs-cats data from Kaggle, but it could be done with any set of images. The images were sampled randomly and split into multiple datasets of the following increasing sizes. These were stored in directories as well as each directory compressed into a zip file. The table shows that zipping the directories does not reduce the size much. These zip files and directories were uploaded to Google Drive for easy access from Google Colab. Note that it can take a couple of hours for the files to upload to Google Drive.

Table: Comparison of directory sizes and zip file size for sample images

number of files directory size (KB) zip file size (KB)
10 200 199
20 398 397
30 623 621
40 850 848
50 1097 1094
60 1320 1317
70 1571 1567
80 1802 1797
90 2043 2038
100 2209 2203
200 4478 4462
300 6664 6640
400 8811 8778
500 10929 10888
600 13254 13203
700 15520 15461
800 17719 17650
900 20147 20069
1000 22418 22329
2000 45587 45393
3000 68260 67971
4000 91057 90675
5000 114013 113539

Code to create the sample data from the Cats & Dogs image files.

 1def get_file_names_from_dir(directory, ext=".jpg"):
 2    file_list = []
 3    for root, dirnames, filenames in os.walk(directory):
 4        # Only load image files
 5        files = [f for f in filenames if f.lower().endswith(ext)]
 6        for filename in files:
 7            file_path = os.path.join(root, filename)
 8            if os.path.getsize(file_path) > 0:
 9                file_list.append(file_path)
10            else:
11                print(f"{filename} is zero length, so ignoring.")
12    return file_list
 1def make_dir_and_zip(num):
 2    root_dir = r"/tmp/dogs-vs-cats/split"
 3    dest_dir = f"{root_dir}/images_{num}"
 4    if os.path.exists(dest_dir):
 5        shutil.rmtree(dest_dir)
 6    if not os.path.exists(dest_dir):
 7        os.makedirs(dest_dir)
 8
 9    for a in animals_df.sample(num, random_state=42)["animals"]:
10        shutil.copy(a, dest_dir)
11
12    shutil.make_archive(dest_dir, "zip", dest_dir)
1animals = get_file_names_from_dir(r"/tmp/dogs-vs-cats")
2animals_df = pd.DataFrame({'animals': animals})
3
4for i in range(10, 100, 10):
5    print("- ", end="")
6    make_dir_and_zip(i)
7    make_dir_and_zip(i * 10)
8    make_dir_and_zip(i * 100)


Transfer image files individually

The following function copies all files in a directory using copytree function from shutil. This uses time module to calculate the time taken and returns the time taken in milliseconds as well as the number of files copied.

1def copy_directory_and_files(source_path, dest_path):
2    print('Copying files')
3    start = time.time()
4    # Copy the content of source to destination
5    dest = shutil.copytree(source_path, dest_path)
6    copy_dir_time = (time.time() - start) * 1000
7    print(f'directory copied in {copy_dir_time:.0F} milliseconds')
8    num_copied = len(os.listdir(dest_path))
9    return copy_dir_time, num_copied
 1num = 10
 2source_dir = f"{image_remote_dir}images_{num}"
 3dest_dir = f"{local_image_dir}images_{num}"
 4copy_dir_time, num_copied = copy_directory_and_files(source_dir, dest_dir)
 5
 6"""
 7Copying files
 8directory copied in 4021 milliseconds
 9"""
10
11copy_dir_time
12"""
134021
14"""
15
16num_copied
17"""
1810
19"""


Transfer zip file and unzip image files

A similar function is created to copy the zip files and extract the files from the zip file into a directory.

 1def copy_and_extract_data(zip_filename, source_path, dest_path, extract_path):
 2    zource_zip = source_path + zip_filename
 3    local_zip = dest_path + zip_filename
 4
 5    start = time.time()
 6    # Copy the content of source to destination
 7    dest = shutil.copyfile(zource_zip, local_zip)
 8    copy_time = (time.time() - start) * 1000
 9    print(f"Zip file copied in {copy_time:.0F} milliseconds")
10
11    start = time.time()
12    # Unzip the file
13    zip_ref = zipfile.ZipFile(local_zip, "r")
14    zip_ref.extractall(extract_path)
15    zip_ref.close()
16    unzip_time = (time.time() - start) * 1000
17    print(f"Zip file Extracted in {unzip_time:.0F} milliseconds")
18
19    num_extracted = len(os.listdir(extract_path))
20
21    return copy_time, unzip_time, num_extracted
 1num = 10
 2zip_filename = f"images_{num}.zip"
 3extract_path = f"{local_image_dir_zips}images_{num}"
 4copy_time, unzip_time, num_extracted = copy_and_extract_data(
 5    zip_filename, image_remote, local_image_dir_zips, extract_path
 6)
 7
 8"""
 9Zip file copied in 740 milliseconds
10Zip file Extracted in 5 milliseconds
11"""
12
13copy_time, unzip_time, num_extracted
14"""
15740, 5, 10
16"""

time to copy 10 files
Copying zip file and extracting is 5 times faster than copying files individually

Copying just 10 files from Google Drive to a local directory in Google Colab takes 4021 milliseconds. Copying the zip file with the same images takes just 740 milliseconds and only 5 milliseconds to decompress the zip file. I found it surprising that even with just 10 small images (ranging from 5 to 40 KB) that the copying the zip file and extracting is significantly faster. Using the zip file is over five times faster.



Transfer all files

The following code transfers all the zip files and extracts the images as well as transferring the directories with file images. The resulting times are gathered into a dataframe for analysis. Note this time-consuming to run, so the results are saved on each iteration incase connection to the Google Colab is lost and everything gets reset.

 1def copy_twice(num):
 2    print()
 3    print("*" * 30)
 4    print(f"num = {num}")
 5    zip_filename = f"images_{num}.zip"
 6    extract_path = f"{local_image_dir_zips}images_{num}"
 7    copy_time, unzip_time, num_extracted = copy_and_extract_data(
 8        zip_filename, image_remote, local_image_dir_zips, extract_path
 9    )
10    print()
11
12    source_dir = f"{image_remote}images_{num}"
13    dest_dir = f"{local_image_dir}images_{num}"
14    copy_dir_time, num_copied = copy_directory_and_files(source_dir, dest_dir)
15
16    r_df = pd.DataFrame(
17        [[num, copy_time, unzip_time, num_extracted, copy_dir_time, num_copied]],
18        columns=[
19            "num",
20            "copy_zip_time",
21            "unzip_time",
22            "num_extracted",
23            "copy_files_time",
24            "num_copied",
25        ],
26    )
27    return r_df
 1# Create the results dataframe
 2result_df = pd.DataFrame(
 3    [[0, 0, 0, 0, 0, 0]],
 4    columns=[
 5        "num",
 6        "copy_zip_time",
 7        "unzip_time",
 8        "num_extracted",
 9        "copy_files_time",
10        "num_copied",
11    ],
12)
13
14for num in range(10, 101, 10):
15    result_df = result_df.append(copy_twice(num))
16    result_df.to_csv(f"/content/gdrive/MyDrive/image_tests/timing_results_10_{num}.csv")
17
18for i in range(20, 101, 10):
19    num = i * 10
20    result_df = result_df.append(copy_twice(num))
21    result_df.to_csv(f"/content/gdrive/MyDrive/image_tests/timing_results_10_{num}.csv")
22
23for i in range(20, 51, 10):
24    num = i * 100
25    result_df = result_df.append(copy_twice(num))
26    result_df.to_csv(f"/content/gdrive/MyDrive/image_tests/timing_results_10_{num}.csv")


Compare transfer performance

The results confirm what was seen with the initial sample of 10 files. It can be seen that as the number of files increases, the difference between using a zip file and transferring files individually increases. The Times Faster columns is the sum of the time to copy the zip file and the unzip time divided by the time to copy the files individually. This factor increase as the number of files grows and it can be seen that when there are over 4,000 files the use of a zip file to transfer the files is over 300 times faster.

Table: Transfer times of increasing number of files using zip compared to transferring individually

Number of files Directory size (KB) Copy zip (sec) Unzip (sec) Copy files (sec) Times faster
10 200 0.7 0.0 4.0 5.4
20 398 1.1 0.0 7.0 6.4
30 623 0.6 0.0 8.3 13.0
40 850 0.9 0.0 11.6 13.1
50 1097 1.1 0.0 13.5 12.2
60 1320 1.2 0.0 16.5 13.1
70 1571 0.7 0.0 18.0 24.7
80 1802 1.3 0.0 19.7 15.3
90 2043 1.2 0.0 23.1 19.2
100 2209 1.2 0.0 24.9 20.3
200 4478 1.4 0.1 64.5 44.8
300 6664 1.3 0.1 96.8 68.5
400 8811 1.5 0.1 133.1 81.9
500 10929 1.4 0.1 160.4 102.8
600 13254 1.2 0.2 196.2 139.9
700 15520 1.5 0.2 225.8 136.2
800 17719 1.5 0.2 256.2 151.8
900 20147 1.1 0.2 290.5 219.9
1000 22418 1.5 0.3 326.2 176.7
2000 45587 2.3 0.6 678.1 235.8
3000 68260 2.5 0.9 1003.0 296.4
4000 91057 3.1 1.2 1330.0 312.9
5000 114013 3.1 1.4 1654.1 371.5

The following charts shows that the transfer of files individually increases in a linear fashion as the number of files increases. Although, there is an increase in the transfer times of the zip files, it seems almost flat in comparison to the increase in individual file transfer.

 1def plot_copy_time_vs_num_chart(df, title):
 2    fig, ax = plt.subplots(figsize=(8, 5), facecolor=plt.cm.Blues(0.2))
 3    fig.suptitle(title, fontsize="x-large", fontweight="bold")
 4    ax.set_facecolor(plt.cm.Blues(0.2))
 5    ax.plot(df.index, df.copy_files_time / 1000, label="Copy files")
 6    ax.plot(df.index, df.copy_unzip_time / 1000, label="Copy zip file and unzip")
 7    ax.legend(facecolor=plt.cm.Blues(0.1))
 8    ax.set_xlabel("Number of Files", fontsize="large")
 9    ax.set_ylabel("Time to transfer (seconds)", fontsize=14)
10    ax.spines["right"].set_visible(False)
11    ax.spines["top"].set_visible(False)
12    return fig
13
14
15fig = plot_copy_time_vs_num_chart(
16    res_df.loc[range(10, 101, 10), :],
17    "Time taken to transfer files vs number of files\n(10 to 100 files)",
18)
19
20fig = plot_copy_time_vs_num_chart(
21    res_df,
22    "Time taken to transfer files vs number of files\n(10 to 5000 files)",
23)

time to copy 10 to 100 files
Time taken to transfer files as zip files or individually with file groups of 10 to 100

time to copy 10 to 5000 files
Time taken to transfer files as zip files or individually with file groups of 10 to 5000

There is such a difference in the transfer times from transferring files individually to transferring the same files in a zip file, that it looks like the zip file transfer is constant. This is not the case as can be shown by just charting the zip file transfer on its own.

time to copy zip files
Looking at just zip file transfer shows an increase in time transfer as the number of files increases

The use of zip file to transfer files is faster by an increasing factor as the number of files increases.

zip-transfer-times-faster
The factor by which zip file transfer is faster increases as the number of files increases



Conclusion

The purpose of compressing these images is to save time when transferring the files not to save storage space. This article showed that storing files in zip file archives allows for much faster file transfer. It is interesting to see that this is true even for a small number of files, such as 10, where the use of a zip file is over 5 times faster that transferring the files individually. The benefits of using a zip file increase significantly as the number of files grows. Transfer times can be reduced by a factor of over 370 when there are 5,000 files to transfer with the use of a zip file. In the case of 5,000 image files, it took 27 minutes and 34 seconds to copy the 5,000 images directly, but only 3 seconds to copy the zip file and 1.4 seconds to unzip it.