Download Multiple Files (or ULRs) in Parallel with Python
40K views
May 13, 2022
Learn how to use Python to download multiple files (or URLs) in parallel. Downloading file in parallel is very easy with Python. This tutorial will demonstrate how to speed up your processing time when you have a lot of files to download. Code available: https://opensourceoptions.com/blog/use-python-to-download-multiple-files-or-urls-in-parallel/ Sign up for email notifications (https://opensourceoptions.com/subscribe) and get $5 off any course at https://opensourceoptions.com/course-list
View Video Transcript
0:00
Welcome to open source options. In this tutorial, I'm going to show you how you can use Python to download multiple files in parallel
0:10
And I just want to point out all the code for this tutorial will be up on the website, open source options.com, and there's a link in the description
0:19
So let's go ahead and get started. I'm using a Jupyter notebook inside a Visual Studio code for this tutorial
0:26
We're going to start out by just doing some imports. So we're going to import time, just to time things
0:35
We're going to import requests, and this is going to be give us the functionality to download URLs
0:43
And then we're going to, from multi-processing, we're going to import CPU count
0:52
and from multi-processing, we're going to import thread pool I need to type import
1:01
import thread pool I'm going to double check that it's an auto complete but I'm pretty sure that's correct
1:10
I made one slight mistake from multi-processing dot pool import and this should auto complete and we'll just double check it
1:18
thread pool there we go okay so we've got time for timing
1:23
request for downloading URLs multi-processing is going to give us the number of computer cores we have, and we're going to use a thread pool to parallelize things with
1:34
Now, all these are part of the Python standard library, so you shouldn't have to do any installs
1:40
I'm going to hit Shift Enter to run this cell, and I'm going to run with a base Anaconda environment
1:47
All right, so that's going to take just a sec to run and get those imports, and now we can start writing some actual code
1:53
So the next thing I'm going to do is to find some URLs to download
1:58
I'm just going to paste these over. These are for GridMet data, which provide climate and weather data
2:05
and so looking at precipitation for four different years here. These files are on the order of tens of megabytes large
2:13
Maybe it's hundreds. I don't remember. They're not small files, but they're not ginormous files either
2:18
And then I'm going to get another list here. I'm going to copy it over. It's going to be the file names to download these files, too
2:23
So let's paste those file names over here. And I'm just downloading these to my downloads directory
2:32
So you can see here that I have a list URLs of the URLs and the list FNs of the file names
2:40
And these are what I want to download. So let's just go ahead here and hit Shift Enter
2:46
In case we have those to find. Now to make things a little easier to work with, I want to combine the URLs and
2:53
file names into a single iterable. And what I mean by that is I want to have, and that will give me
3:00
just one variable, one iterable that I can pass into a function, and that'll make a simpler
3:05
when we parallelize. And let me show you what I mean here. So let's make an inputs variable
3:10
and we're going to say that it is equal to zip, and we're going to zip together the URLs
3:16
and the file names. Okay, I'm going to double check my code here, make sure I have it right
3:21
and then we'll run it. Okay, this should be. correct before I run it though let's just do let's go for I in inputs print I
3:35
and so we'll run this and then we'll print it out so you can see where it looks like so if I hit shift enter enter and you can see what I get here is an iterable of tuples and so I have a tuple and did that tuple I have the URL and the file associated with that URL
3:54
And those are all contained now in this inputs variable. So now that we know what our input data look like, we can start to build some functions
4:03
to download these URLs to the associated file names. So let's get started on that next
4:09
So what I want to do is I'm going to make a function called Download URL
4:15
And I'm going to be able to use this function to download any URL to a file name
4:20
I'm going to input args, arguments. Okay. Now, my arguments are going to be URL and FN, and they're going to equal args zero and args 1
4:35
Because what I'm going to do is I'm going to pass in this iterable
4:39
with a tuple. So that one tuple is going to be an input
4:43
It has two elements. The first one is the URL. The second one is the file name
4:47
Okay. And then I'm going to put in a try accept statement here
4:51
So we're going to try. R equals request. Dot get URL. So we're going to try to download that
5:01
And if we don't, we're going to type in accept, exception as E
5:09
and we're going to print out problem downloading and then we'll print out the error message we get
5:19
okay this isn't the full function we're going to go back and fill some things in here so that it's a little easier to time and understand so we're going to give it a t0 t not start time which is going to be time dot time all right so that will start our timer and then at the end we're going to get the time again
5:39
And so we're going to come down here and we're going to print URL, and we're going to print URL, and then we're going to print time
5:56
And this is going to be in seconds. And then we're going to print time
6:09
T not or T0, and that will tell us how long it took to download each specific URL
6:18
Now I forgot one important thing here, and we have to actually create the file and write to it
6:24
And so we're going to do with open, and we're going to give it the file name
6:30
We're going to give it the right, which is right binary, as F
6:35
and we're going to do f. write and we're going to do r dot content so it's going to write the content from the
6:44
URL and what i actually want to do here is instead of printing this i want to return
6:51
if we're going to return the URL and the time so we'll know the URL and how long it took
7:00
and we can print that out when we get to the other function so let me just double-checked
7:05
notes make sure I have this function correct and then we'll move on from here
7:10
Okay, so we should be good with this function. Let's go ahead and run this cell
7:15
So our functions now run and let's try to download all of these URLs using our download URL function
7:22
Now this is not a parallel yet, but we can use this function to parallelize it later
7:27
And by doing this we'll get an idea of how long it takes. So let do 4i in inputs and let do result equals download URL
7:41
We'll pass in i and then let's print URL and then result zero because we're turning a tuple in our result
7:52
And then let's also print time seconds. and result one
8:06
And then let's come up here and let's add another T not variable
8:11
T0 equals time dot time. And then let's come down here and print total download time
8:23
This is going to be in seconds again, is this will be once again
8:30
time dot time minus t not and so we'll print that out so this is going to loop through all our input files it's going to download each one
8:41
and then we're going to print out how long it took to download that specific file and then we're finished we're going to print out the total time it took so let's go ahead and do this this is not in parallel it's in serial i'm going to click run i'm going to pause the video while it starts because it's going to take probably about a minute so let's hit shift enter to run that and we get an error
9:00
Let's see, total time. Oh, it's because I need to put parentheses on this function call
9:10
Let's try that again. Alright, here we go, and something happened. That..
9:22
It didn't run. Let me just check this real quick. All right, let's make one quick change to our code
9:33
Let's get rid of this line where we loop through the inputs because we don't actually need that
9:39
And then come up here and let's restart our kernel through the outputs of all cells
9:46
And then let's go ahead and just run through our code again. We'll do the imports, get the URLs, we'll define the inputs, we'll define the function
9:54
and now we'll loop through all the inputs to download. download. So let's go ahead and hit shift enter. And you can see here that now we've printed
10:04
out. I added this one line just to print which files are being downloaded just to see that it's
10:09
working. And you can see that we're starting to download here. And this should take just a few seconds
10:14
to download the first file. Then I'll pause the video while it downloads the rest of the files
10:19
so that you can don't have to wait for that. So you can see we downloaded our first URL
10:24
It took about 20 seconds. I'm now going to pause the video while the downloads finish, and we can see what the total download time is
10:32
Okay, so the file download finished, you can see that it took just under two minutes to complete this in a loop
10:40
Now, instead of performing a traditional loop, we're going to write another function that will use our download URL function and apply it in parallel
10:49
so we can download all four of these files simultaneously, and it will greatly decrease the total download time
11:00
So let's go ahead and do that now. So DEF, we're going to call this download parallel
11:05
It's going to take the same arguments as download URL. Now let go ahead and get the CPU count and so we going to say CPUs equals CPU count which we called and then we need to set up a
11:29
thread pool and this thread pool we're going to tell it how many threads to run which we're
11:33
going to derive from CPU count and then we are going to tell it which function to parallelize
11:41
and which arguments to apply to that function and keep in mind that
11:48
download parallel when we call this function we're going to get this return type
11:53
which is going to be a URL and a time and so we want to give it results equals
12:02
thread pool we're going to do CPUs we'll do minus one so we'll just to make
12:10
sure that we have enough computation for other things if we're running other
12:14
things and then we're going to use iMap unordered This means the order that the arguments are run in does not matter
12:24
And then we're going to give it the function, which is download URL, and we're going to give it the arguments
12:33
And now we can loop through the results and print them out, so we can do four result in results
12:42
And this is just going to tell us each URL and how long it took to reload, or to download
12:48
So let's do print URL and result zero. And then let's print time seconds and result one
13:07
Okay. And that should be the function we need. So let me just go ahead and make sure I have this right
13:14
I'll check my notes and we'll come back and get this thing running. All right. Let's hit shift enter to run this cell
13:20
Make sure there are no errors there. And now let's go ahead and run this. We're going to time it again
13:24
So let's do a t.0 equals time dot time. Time dot time
13:33
And then I'm going to do inputs equals zip URLs, FNs, just to make sure that's the clean inputs iteratable
13:43
and then we're going to just run download parallel with inputs
13:52
and then we'll print total download time in seconds is time
14:04
. . . . . . . . . . . . . . all right, and that should be everything we need to run this in parallel
14:11
So let me hit Shift Enter and we'll see how much faster this runs
14:16
All right, I'm going to pause the video while this finishes running and then we'll take a look at the output
14:22
Okay, the files finished downloading and you'll notice that it only took 33 seconds
14:27
So it was almost four times faster than our other method, which took 113 seconds
14:33
Notice if we add up all these numbers, we get pretty close to 113 to download time of each individual file here
14:40
but if we add up the individual download times here for the parallel version we get a number much higher than 33
14:46
in fact 33 is the time it took for the longest file to download so you can see that these were all downloading in parallel
14:54
and it printed them out in the order in which they completed so there you have how you can download files in parallel
15:01
I hope you found this useful as something I use all the time when I need to download a large number of files
15:06
I hope this can help you out in your programming work
#Computer Education
#Computer Science
#Open Source