Speed up image file transfer by zipping
Compressing multiple files into one archive file allows for faster files transfer. Image files in the form of jpg are already compressed and compressing them into zip files does not significantly reduce the file size further. However, it does reduce the number files to one archive file and this makes file transfer faster, even including the time to decompress the zip file at the destination.
Files can be compressed into a zip file for storage to reduce the amount of disk space required. This article demonstrates that transferring large numbers of image files to Google Colab and decompressing is over 300 times faster than transferring the same data as individual files.
Google Colab (Colaboratory)
Google released Google Colab in 2017 as an online collaborating tool for machine learning with cloud storage. This is an impressive resource that is free to use, it presents a Jupyter-like notebook interface that is easily accessed through a browser using a google account. The Colab Jupyter notebooks are saved automatically to the Google Drive associated with the account. The Jupyter notebooks can be shared, copied, reviewed and opened on other machines.
This is a great free resource, but there are some constraints such as inactive sessions will time out and be shut down and all local variables and loaded data are lost when the session ends. In addition, I found that uploading numerous files is very time consuming, even from mapped google drive. This is where use of zip files significantly improved file transfer times.
Image dataset
This experiment was done with some of the images from the dogs-vs-cats data from Kaggle, but it could be done with any set of images. The images were sampled randomly and split into multiple datasets of the following increasing sizes. These were stored in directories as well as each directory compressed into a zip file. The table shows that zipping the directories does not reduce the size much. These zip files and directories were uploaded to Google Drive for easy access from Google Colab. Note that it can take a couple of hours for the files to upload to Google Drive.
Table: Comparison of directory sizes and zip file size for sample images
number of files | directory size (KB) | zip file size (KB) |
---|---|---|
10 | 200 | 199 |
20 | 398 | 397 |
30 | 623 | 621 |
40 | 850 | 848 |
50 | 1097 | 1094 |
60 | 1320 | 1317 |
70 | 1571 | 1567 |
80 | 1802 | 1797 |
90 | 2043 | 2038 |
100 | 2209 | 2203 |
200 | 4478 | 4462 |
300 | 6664 | 6640 |
400 | 8811 | 8778 |
500 | 10929 | 10888 |
600 | 13254 | 13203 |
700 | 15520 | 15461 |
800 | 17719 | 17650 |
900 | 20147 | 20069 |
1000 | 22418 | 22329 |
2000 | 45587 | 45393 |
3000 | 68260 | 67971 |
4000 | 91057 | 90675 |
5000 | 114013 | 113539 |
Code to create the sample data from the Cats & Dogs image files.
1def get_file_names_from_dir(directory, ext=".jpg"):
2 file_list = []
3 for root, dirnames, filenames in os.walk(directory):
4 # Only load image files
5 files = [f for f in filenames if f.lower().endswith(ext)]
6 for filename in files:
7 file_path = os.path.join(root, filename)
8 if os.path.getsize(file_path) > 0:
9 file_list.append(file_path)
10 else:
11 print(f"{filename} is zero length, so ignoring.")
12 return file_list
1def make_dir_and_zip(num):
2 root_dir = r"/tmp/dogs-vs-cats/split"
3 dest_dir = f"{root_dir}/images_{num}"
4 if os.path.exists(dest_dir):
5 shutil.rmtree(dest_dir)
6 if not os.path.exists(dest_dir):
7 os.makedirs(dest_dir)
8
9 for a in animals_df.sample(num, random_state=42)["animals"]:
10 shutil.copy(a, dest_dir)
11
12 shutil.make_archive(dest_dir, "zip", dest_dir)
1animals = get_file_names_from_dir(r"/tmp/dogs-vs-cats")
2animals_df = pd.DataFrame({'animals': animals})
3
4for i in range(10, 100, 10):
5 print("- ", end="")
6 make_dir_and_zip(i)
7 make_dir_and_zip(i * 10)
8 make_dir_and_zip(i * 100)
Transfer image files individually
The following function copies all files in a directory using copytree
function from
shutil. This uses time module to calculate the time taken and returns the
time taken in milliseconds as well as the number of files copied.
1def copy_directory_and_files(source_path, dest_path):
2 print('Copying files')
3 start = time.time()
4 # Copy the content of source to destination
5 dest = shutil.copytree(source_path, dest_path)
6 copy_dir_time = (time.time() - start) * 1000
7 print(f'directory copied in {copy_dir_time:.0F} milliseconds')
8 num_copied = len(os.listdir(dest_path))
9 return copy_dir_time, num_copied
1num = 10
2source_dir = f"{image_remote_dir}images_{num}"
3dest_dir = f"{local_image_dir}images_{num}"
4copy_dir_time, num_copied = copy_directory_and_files(source_dir, dest_dir)
5
6"""
7Copying files
8directory copied in 4021 milliseconds
9"""
10
11copy_dir_time
12"""
134021
14"""
15
16num_copied
17"""
1810
19"""
Transfer zip file and unzip image files
A similar function is created to copy the zip files and extract the files from the zip file into a directory.
1def copy_and_extract_data(zip_filename, source_path, dest_path, extract_path):
2 zource_zip = source_path + zip_filename
3 local_zip = dest_path + zip_filename
4
5 start = time.time()
6 # Copy the content of source to destination
7 dest = shutil.copyfile(zource_zip, local_zip)
8 copy_time = (time.time() - start) * 1000
9 print(f"Zip file copied in {copy_time:.0F} milliseconds")
10
11 start = time.time()
12 # Unzip the file
13 zip_ref = zipfile.ZipFile(local_zip, "r")
14 zip_ref.extractall(extract_path)
15 zip_ref.close()
16 unzip_time = (time.time() - start) * 1000
17 print(f"Zip file Extracted in {unzip_time:.0F} milliseconds")
18
19 num_extracted = len(os.listdir(extract_path))
20
21 return copy_time, unzip_time, num_extracted
1num = 10
2zip_filename = f"images_{num}.zip"
3extract_path = f"{local_image_dir_zips}images_{num}"
4copy_time, unzip_time, num_extracted = copy_and_extract_data(
5 zip_filename, image_remote, local_image_dir_zips, extract_path
6)
7
8"""
9Zip file copied in 740 milliseconds
10Zip file Extracted in 5 milliseconds
11"""
12
13copy_time, unzip_time, num_extracted
14"""
15740, 5, 10
16"""
Copying zip file and extracting is 5 times faster than copying files individually
Copying just 10 files from Google Drive to a local directory in Google Colab takes 4021 milliseconds. Copying the zip file with the same images takes just 740 milliseconds and only 5 milliseconds to decompress the zip file. I found it surprising that even with just 10 small images (ranging from 5 to 40 KB) that the copying the zip file and extracting is significantly faster. Using the zip file is over five times faster.
Transfer all files
The following code transfers all the zip files and extracts the images as well as transferring the directories with file images. The resulting times are gathered into a dataframe for analysis. Note this time-consuming to run, so the results are saved on each iteration incase connection to the Google Colab is lost and everything gets reset.
1def copy_twice(num):
2 print()
3 print("*" * 30)
4 print(f"num = {num}")
5 zip_filename = f"images_{num}.zip"
6 extract_path = f"{local_image_dir_zips}images_{num}"
7 copy_time, unzip_time, num_extracted = copy_and_extract_data(
8 zip_filename, image_remote, local_image_dir_zips, extract_path
9 )
10 print()
11
12 source_dir = f"{image_remote}images_{num}"
13 dest_dir = f"{local_image_dir}images_{num}"
14 copy_dir_time, num_copied = copy_directory_and_files(source_dir, dest_dir)
15
16 r_df = pd.DataFrame(
17 [[num, copy_time, unzip_time, num_extracted, copy_dir_time, num_copied]],
18 columns=[
19 "num",
20 "copy_zip_time",
21 "unzip_time",
22 "num_extracted",
23 "copy_files_time",
24 "num_copied",
25 ],
26 )
27 return r_df
1# Create the results dataframe
2result_df = pd.DataFrame(
3 [[0, 0, 0, 0, 0, 0]],
4 columns=[
5 "num",
6 "copy_zip_time",
7 "unzip_time",
8 "num_extracted",
9 "copy_files_time",
10 "num_copied",
11 ],
12)
13
14for num in range(10, 101, 10):
15 result_df = result_df.append(copy_twice(num))
16 result_df.to_csv(f"/content/gdrive/MyDrive/image_tests/timing_results_10_{num}.csv")
17
18for i in range(20, 101, 10):
19 num = i * 10
20 result_df = result_df.append(copy_twice(num))
21 result_df.to_csv(f"/content/gdrive/MyDrive/image_tests/timing_results_10_{num}.csv")
22
23for i in range(20, 51, 10):
24 num = i * 100
25 result_df = result_df.append(copy_twice(num))
26 result_df.to_csv(f"/content/gdrive/MyDrive/image_tests/timing_results_10_{num}.csv")
Compare transfer performance
The results confirm what was seen with the initial sample of 10 files. It can be seen that as the number of files increases, the difference between using a zip file and transferring files individually increases. The Times Faster columns is the sum of the time to copy the zip file and the unzip time divided by the time to copy the files individually. This factor increase as the number of files grows and it can be seen that when there are over 4,000 files the use of a zip file to transfer the files is over 300 times faster.
Table: Transfer times of increasing number of files using zip compared to transferring individually
Number of files | Directory size (KB) | Copy zip (sec) | Unzip (sec) | Copy files (sec) | Times faster |
---|---|---|---|---|---|
10 | 200 | 0.7 | 0.0 | 4.0 | 5.4 |
20 | 398 | 1.1 | 0.0 | 7.0 | 6.4 |
30 | 623 | 0.6 | 0.0 | 8.3 | 13.0 |
40 | 850 | 0.9 | 0.0 | 11.6 | 13.1 |
50 | 1097 | 1.1 | 0.0 | 13.5 | 12.2 |
60 | 1320 | 1.2 | 0.0 | 16.5 | 13.1 |
70 | 1571 | 0.7 | 0.0 | 18.0 | 24.7 |
80 | 1802 | 1.3 | 0.0 | 19.7 | 15.3 |
90 | 2043 | 1.2 | 0.0 | 23.1 | 19.2 |
100 | 2209 | 1.2 | 0.0 | 24.9 | 20.3 |
200 | 4478 | 1.4 | 0.1 | 64.5 | 44.8 |
300 | 6664 | 1.3 | 0.1 | 96.8 | 68.5 |
400 | 8811 | 1.5 | 0.1 | 133.1 | 81.9 |
500 | 10929 | 1.4 | 0.1 | 160.4 | 102.8 |
600 | 13254 | 1.2 | 0.2 | 196.2 | 139.9 |
700 | 15520 | 1.5 | 0.2 | 225.8 | 136.2 |
800 | 17719 | 1.5 | 0.2 | 256.2 | 151.8 |
900 | 20147 | 1.1 | 0.2 | 290.5 | 219.9 |
1000 | 22418 | 1.5 | 0.3 | 326.2 | 176.7 |
2000 | 45587 | 2.3 | 0.6 | 678.1 | 235.8 |
3000 | 68260 | 2.5 | 0.9 | 1003.0 | 296.4 |
4000 | 91057 | 3.1 | 1.2 | 1330.0 | 312.9 |
5000 | 114013 | 3.1 | 1.4 | 1654.1 | 371.5 |
The following charts shows that the transfer of files individually increases in a linear fashion as the number of files increases. Although, there is an increase in the transfer times of the zip files, it seems almost flat in comparison to the increase in individual file transfer.
1def plot_copy_time_vs_num_chart(df, title):
2 fig, ax = plt.subplots(figsize=(8, 5), facecolor=plt.cm.Blues(0.2))
3 fig.suptitle(title, fontsize="x-large", fontweight="bold")
4 ax.set_facecolor(plt.cm.Blues(0.2))
5 ax.plot(df.index, df.copy_files_time / 1000, label="Copy files")
6 ax.plot(df.index, df.copy_unzip_time / 1000, label="Copy zip file and unzip")
7 ax.legend(facecolor=plt.cm.Blues(0.1))
8 ax.set_xlabel("Number of Files", fontsize="large")
9 ax.set_ylabel("Time to transfer (seconds)", fontsize=14)
10 ax.spines["right"].set_visible(False)
11 ax.spines["top"].set_visible(False)
12 return fig
13
14
15fig = plot_copy_time_vs_num_chart(
16 res_df.loc[range(10, 101, 10), :],
17 "Time taken to transfer files vs number of files\n(10 to 100 files)",
18)
19
20fig = plot_copy_time_vs_num_chart(
21 res_df,
22 "Time taken to transfer files vs number of files\n(10 to 5000 files)",
23)
Time taken to transfer files as zip files or individually with file groups of 10 to 100
Time taken to transfer files as zip files or individually with file groups of 10 to 5000
There is such a difference in the transfer times from transferring files individually to transferring the same files in a zip file, that it looks like the zip file transfer is constant. This is not the case as can be shown by just charting the zip file transfer on its own.
Looking at just zip file transfer shows an increase in time transfer as the number of files increases
The use of zip file to transfer files is faster by an increasing factor as the number of files increases.
The factor by which zip file transfer is faster increases as the number of files increases
Conclusion
The purpose of compressing these images is to save time when transferring the files not to save storage space. This article showed that storing files in zip file archives allows for much faster file transfer. It is interesting to see that this is true even for a small number of files, such as 10, where the use of a zip file is over 5 times faster that transferring the files individually. The benefits of using a zip file increase significantly as the number of files grows. Transfer times can be reduced by a factor of over 370 when there are 5,000 files to transfer with the use of a zip file. In the case of 5,000 image files, it took 27 minutes and 34 seconds to copy the 5,000 images directly, but only 3 seconds to copy the zip file and 1.4 seconds to unzip it.