Automating file downloads can save a lot of time. There are several ways for automating file downloads in Python. The easiest way to download files is using a simple Python loop to iterate through a list of URLs to download. This serial approach can work well with a few small files, but if you are downloading many files or large files, you’ll want to use a parallel approach to maximize your computational resources.
With a parallel file download routine, you can download multiple files simultaneously and save a considerable amount of time. The tutorial demonstrates how to develop a generic file download function in Python and apply it to download multiple files with serial and parallel approaches. The code in this tutorial uses only modules available from the Python standard library, so no installations are required.
For this example, we only need the
multiprocessing Python modules to download files in parallel. The
multiprocessing modules are both available from the Python standard library, so you won’t need to perform any installations.
We’ll also import the
time module to keep track of how long it takes to download individual files and compare performance between the serial and parallel download routines. The
time module is also part of the Python standard library.
import requests import time from multiprocessing import cpu_count from multiprocessing.pool import ThreadPool
Define URLs and filenames
I’ll demonstrate parallel file downloads in Python using gridMET NetCDF files that contain daily precipitation data for the United States.
Here, I specify the URLs to four files in a list. In other applications, you may programmatically generate a list of files to download.
urls = ['https://www.northwestknowledge.net/metdata/data/pr_1979.nc', 'https://www.northwestknowledge.net/metdata/data/pr_1980.nc', 'https://www.northwestknowledge.net/metdata/data/pr_1981.nc', 'https://www.northwestknowledge.net/metdata/data/pr_1982.nc']
Each URL must be associated with its download location. Here, I’m downloading the files to the Windows ‘Downloads’ directory. I’ve hardcoded the filenames in a list for simplicity and transparency. Given your application, you may want to write code that will parse the input URL and download it to a specific directory.
fns = [r'C:\Users\konrad\Downloads\pr_1979.nc', r'C:\Users\konrad\Downloads\pr_1980.nc', r'C:\Users\konrad\Downloads\pr_1981.nc', r'C:\Users\konrad\Downloads\pr_1982.nc']
Multiprocessing requires parallel functions to have only one argument (there are some workarounds, but we won’t get into that here). To download a file we’ll need to pass two arguments, a URL and a filename. So we’ll zip the
fns lists together to get a list of tuples. Each tuple in the list will contain two elements; a URL and the download filename for the URL. This way we can pass a single argument (the tuple) that contains two pieces of information.
inputs = zip(urls, fns)
Function to download a URL
Now that we have specified the URLs to download and their associated filenames, we need a function to download the URLs (
We’ll pass one argument (
download_url. This argument will be an iterable (list or tuple) where the first element is the URL to download (
url) and the second element is the filename (
fn). The elements are assigned to variables (
fn) for readability.
Now create a try statement in which the URL is retrieved and written to the file after it is created. When the file is written the URL and download time are returned. If an exception occurs a message is printed.
download_url function is the meat of our code. It does the actual work of downloading and file creation. We can now use this function to download files in serial (using a loop) and in parallel. Let’s go through those examples.
def download_url(args): t0 = time.time() url, fn = args, args try: r = requests.get(url) with open(fn, 'wb') as f: f.write(r.content) return(url, time.time() - t0) except Exception as e: print('Exception in download_url():', e)
Download multiple files with a Python loop
To download the list of URLs to the associated files, loop through the iterable (
inputs) that we created, passing each element to
download_url. After each download is complete we will print the downloaded URL and the time it took to download.
The total time to download all URLs will print after all downloads have been completed.
t0 = time.time() for i in inputs: result = download_url(i) print('url:', result, 'time:', result) print('Total time:', time.time() - t0)
url: https://www.northwestknowledge.net/metdata/data/pr_1979.nc time: 16.381176710128784 url: https://www.northwestknowledge.net/metdata/data/pr_1980.nc time: 11.475878953933716 url: https://www.northwestknowledge.net/metdata/data/pr_1981.nc time: 13.059367179870605 url: https://www.northwestknowledge.net/metdata/data/pr_1982.nc time: 12.232381582260132 Total time: 53.15849542617798
It took between 11 and 16 seconds to download the individual files. The total download time was a little less than one minute. Your download times will vary based on your specific network connection.
Let’s compare this serial (loop) approach to the parallel approach below.
Download multiple files in parallel with Python
To start, create a function (
download_parallel) to handle the parallel download. The function (
download_parallel) will take one argument, an iterable containing URLs and associated filenames (the
inputs variable we created earlier).
Next, get the number of CPUs available for processing. This will determine the number of threads to run in parallel.
Now use the
ThreadPool to map the
inputs to the
download_url function. Here we use the
imap_unordered method of
ThreadPool and pass it the
download_url function and input arguments to
inputs variable). The
imap_unordered method will run
download_url simultaneously for the number of specified threads (i.e. parallel download).
Thus, if we have four files and four threads all files can be downloaded at the same time instead of waiting for one download to finish before the next starts. This can save a considerable amount of processing time.
In the final part of the
download_parallel function the downloaded URLs and the time required to download each URL are printed.
def download_parallel(args): cpus = cpu_count() results = ThreadPool(cpus - 1).imap_unordered(download_url, args) for result in results: print('url:', result, 'time (s):', result)
download_parallel are defined, the files can be downloaded in parallel with a single line of code.
url: https://www.northwestknowledge.net/metdata/data/pr_1980.nc time (s): 14.641696214675903 url: https://www.northwestknowledge.net/metdata/data/pr_1981.nc time (s): 14.789752960205078 url: https://www.northwestknowledge.net/metdata/data/pr_1979.nc time (s): 15.052601337432861 url: https://www.northwestknowledge.net/metdata/data/pr_1982.nc time (s): 23.287317752838135 Total time: 23.32273244857788
Notice that it took longer to download each individual file with the approach. This may be a result of changing network speed, or overhead required to map the downloads to their respective threads. Even though the individual files took longer to download, the parallel method resulted in a 50% decrease in total download time.
You can see how parallel processing can greatly reduce processing time for multiple files. As the number of files increases, you will save much more time by using a parallel download approach.
Automating file download in your development and analysis routines can save you a lot of time. As demonstrated by this tutorial implementing a parallel download routine can greatly decrease file acquisition time if you require many files or large files.
Whether you’re looking to take your GIS skills to the next level, or just getting started with GIS, we have a course for you! We’re constantly creating and curating more courses to help you improve your geospatial skills.
All of our courses are taught by industry professionals and include step-by-step video instruction so you don’t get lost in YouTube videos and blog posts, downloadable data so you can reproduce everything the instructor does, and code you can copy so you can avoid repetitive typing