This
is the first part of the series, which will explain how to write a
simple C# application to download every book cover from O'Reilly's
website.
O'Reilly's website contains a complete
list of every book still in
print. Basically, all we have to do is write a simple parser to pull out
catalog IDs and then download the appropriate image. I know C# is not
the best language for text parsing, but I know it well so I wrote my
tool in C#. The concepts in this tutorial, however, can easily be
extended to your own favorite language.
If you go to the complete list, you'll see it's separated into 4 pages -
"A-D", "E-J", "K-P", and "Q-Z". The first thing you'll want to do is
download the source of all of those pages somewhere on your hard drive.
I put mine here:
C:\Downloads\complete.html
C:\Downloads\complete2.html
C:\Downloads\complete3.html
C:\Downloads\complete4.html
Now that we have some HTML to parse, we need to take a look at it to
figure out where the catalog IDs are. Every link in the complete list
has the id as part of the link. Here's a snippet that contains one of
the links from O'Reilly's website.
<a class="tt" id="0596007574" href="http://oreilly.com/catalog/9780596007577">.NET Compact Framework Pocket Guide
</a>
</td>
<td valign="top" nowrap="nowrap">
May 2004</td>
<td valign="top" align="right">
$9.95
</td>
Looking at this code, what we really care about are the numbers after
"/catalog/". Now we know what to look for. Here's a simple parse
function that will extract all ids based on that fact.
void _btnParse_Click(object sender, RoutedEventArgs e)
{
//set up the paths to the downloads HTML files
string[] files = new string[]
{
@"C:\Downloads\complete.html",
@"C:\Downloads\complete2.html",
@"C:\Downloads\complete3.html",
@"C:\Downloads\complete4.html"
};
//create a list of hold all of the found ids
List<ulong> ids = new List<ulong>();
//loop through each downloaded file
foreach (string file in files)
{
using (StreamReader reader =
new StreamReader(new FileStream(file, FileMode.Open)))
{
//get all of the file's contents
string contents = reader.ReadToEnd();
int index = 0;
int catStart;
int catEnd;
ulong id;
do
{
index = contents.IndexOf("/catalog/", index);
if (index == -1)
break;
//Ids will either end with quotes or a slash
// ".../catalog/123456789" or
// ".../catalog/123456789/..."
catStart = index + 9;
catEnd = contents.IndexOfAny(new char[] { '\"', '/' }, catStart);
if (catEnd == -1)
break;
//use parse to make sure this is actually a number
//some links are ".../catalog/somefile.pdf"
if (ulong.TryParse(contents.Substring(catStart,
(catEnd - catStart)), out id))
{
//ids are duplicated, only add it once
if (!ids.Contains(id))
ids.Add(id);
}
//increment index past this occurrence of "/catalog/"
index += 9;
}
while (index != -1);
}
}
//show all the ids
_lbResults.ItemsSource = ids;
//download all the images
DownloadBookCovers(ids);
}
There's a lot of code here, but it's really simple. At it's core, it's
looking for every occurrence "/catalog/", then pulling whatever is after
that but before a quote (") or slash (/). This is because links
containing ids will be in one of two formats:
".../catalog/123456789"
".../catalog/123456789/..."
Once it finds something, I use
ulong.TryParse
since sometimes what it
pulls out is a filename instead of an id. My display contains a ListBox
which I populate with the results just so I can see them.
All right, now we're on to actually downloading files. Fortunately,
O'Reilly keeps every book cover in the same location:
http://oreilly.com/catalog/covers/[id]_lrg.jpg
Since we've already extracted every id, all that's left to do is loop
through them and make a simple web request for the file. Here's a
function that does that.
private void DownloadBookCovers(List<ulong> ids)
{
for (int i = 0; i < ids.Count; i++)
{
ulong id = ids[i];
//Print some status information
Console.WriteLine("Getting image " + i + " of " + ids.Count);
string requestUri = "http://oreilly.com/catalog/covers/" + id + "_lrg.jpg";
WebRequest request = WebRequest.Create(requestUri);
WebResponse response;
try
{
response = request.GetResponse();
}
catch
{
//file probably didn't exist, just skip to the next one
continue;
}
System.Drawing.Image image =
System.Drawing.Image.FromStream(response.GetResponseStream());
image.Save(@"C:\Downloads\OReillyCovers\" + id + ".jpg");
}
}
Again, this is all pretty straight forward. I create a WebRequest object
for the URL of the image I want to download. I then call
GetResponse
to perform the request. If the file doesn't exist, this function will
throw an exception. When I ran the downloader, a little over half of the
ids actually had covers (around 800 in total). Lastly I create an Image
object from the response stream and save it to my desired location. Be
warned, this function will take a few minutes to run.
That's it! Once the program runs, you'll be the proud holder of hundreds
of O'Reilly book covers.
I wrapped up my code in a little WPF app just so I had a button to press
and a way to view the extracted ids. Here's the XAML for that.
<Window x:Class="OReillyImageDownloader.Window1"
xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
Title="Window1" Height="522" Width="796">
<Grid>
<Grid.RowDefinitions>
<RowDefinition Height="*" />
<RowDefinition Height="Auto" />
</Grid.RowDefinitions>
<ListBox x:Name="_lbResults" />
<Button x:Name="_btnParse" Content="Parse HTML"
Grid.Row="1" Click="_btnParse_Click" />
</Grid>
</Window>
Just in case you don't have Visual Studio, but still want
to run the program, I've also attached the executable. To run it, you'll
have to make sure you've placed the downloaded HTML files in the same
place I have and that the folder "C:\Downloads\OReillyCovers" exists.
I also added a progress bar so you know somethings happening while it's
downloading.
Source Files:
Comments
Post a Comment