C# - to Download Every O'Reilly Book Cover [Beginner]

This is the first part of the series, which will explain how to write a simple C# application to download every book cover from O'Reilly's website.

O'Reilly's website contains a complete list of every book still in print. Basically, all we have to do is write a simple parser to pull out catalog IDs and then download the appropriate image. I know C# is not the best language for text parsing, but I know it well so I wrote my tool in C#. The concepts in this tutorial, however, can easily be extended to your own favorite language.

O'Reilly Website Complete Book List
border=

If you go to the complete list, you'll see it's separated into 4 pages - "A-D", "E-J", "K-P", and "Q-Z". The first thing you'll want to do is download the source of all of those pages somewhere on your hard drive. I put mine here:

C:\Downloads\complete.html
C:\Downloads\complete2.html
C:\Downloads\complete3.html
C:\Downloads\complete4.html

Now that we have some HTML to parse, we need to take a look at it to figure out where the catalog IDs are. Every link in the complete list has the id as part of the link. Here's a snippet that contains one of the links from O'Reilly's website.

<a class="tt" id="0596007574" href="http://oreilly.com/catalog/9780596007577">.NET Compact Framework Pocket Guide
</a> 

</td>
<td valign="top" nowrap="nowrap">
May 2004</td>
<td valign="top" align="right">
$9.95
</td>

Looking at this code, what we really care about are the numbers after "/catalog/". Now we know what to look for. Here's a simple parse function that will extract all ids based on that fact.

void _btnParse_Click(object sender, RoutedEventArgs e)
{
  //set up the paths to the downloads HTML files
  string[] files = new string[] 
  {
    @"C:\Downloads\complete.html",
    @"C:\Downloads\complete2.html",
    @"C:\Downloads\complete3.html",
    @"C:\Downloads\complete4.html"
  };

  //create a list of hold all of the found ids
  List<ulong> ids = new List<ulong>();

  //loop through each downloaded file
  foreach (string file in files)
  {
    using (StreamReader reader = 
      new StreamReader(new FileStream(file, FileMode.Open)))
    {
      //get all of the file's contents
      string contents = reader.ReadToEnd();

      int index = 0;
      int catStart;
      int catEnd;
      ulong id;

      do
      {
        index = contents.IndexOf("/catalog/", index);

        if (index == -1)
          break;

        //Ids will either end with quotes or a slash
        // ".../catalog/123456789" or
        // ".../catalog/123456789/..."
        catStart = index + 9;
        catEnd = contents.IndexOfAny(new char[] { '\"', '/' }, catStart);

        if (catEnd == -1)
          break;

        //use parse to make sure this is actually a number
        //some links are ".../catalog/somefile.pdf"
        if (ulong.TryParse(contents.Substring(catStart,
          (catEnd - catStart)), out id))
        {
          //ids are duplicated, only add it once
          if (!ids.Contains(id))
            ids.Add(id);
        }

        //increment index past this occurrence of "/catalog/"
        index += 9;
      }
      while (index != -1);
    }
  }

  //show all the ids
  _lbResults.ItemsSource = ids;

  //download all the images
  DownloadBookCovers(ids);
}

There's a lot of code here, but it's really simple. At it's core, it's looking for every occurrence "/catalog/", then pulling whatever is after that but before a quote (") or slash (/). This is because links containing ids will be in one of two formats:

".../catalog/123456789"
".../catalog/123456789/..."

Once it finds something, I use ulong.TryParse since sometimes what it pulls out is a filename instead of an id. My display contains a ListBox which I populate with the results just so I can see them.

All right, now we're on to actually downloading files. Fortunately, O'Reilly keeps every book cover in the same location:

http://oreilly.com/catalog/covers/[id]_lrg.jpg

Since we've already extracted every id, all that's left to do is loop through them and make a simple web request for the file. Here's a function that does that.

private void DownloadBookCovers(List<ulong> ids)
{
  for (int i = 0; i < ids.Count; i++)
  {
    ulong id = ids[i];

    //Print some status information
    Console.WriteLine("Getting image " + i + " of " + ids.Count);

    string requestUri = "http://oreilly.com/catalog/covers/" + id + "_lrg.jpg";

    WebRequest request = WebRequest.Create(requestUri);
    WebResponse response;

    try
    {
      response = request.GetResponse();
    }
    catch
    {
      //file probably didn't exist, just skip to the next one
      continue;
    }

    System.Drawing.Image image = 
      System.Drawing.Image.FromStream(response.GetResponseStream());

    image.Save(@"C:\Downloads\OReillyCovers\" + id + ".jpg");
  }
}

Again, this is all pretty straight forward. I create a WebRequest object for the URL of the image I want to download. I then call GetResponse to perform the request. If the file doesn't exist, this function will throw an exception. When I ran the downloader, a little over half of the ids actually had covers (around 800 in total). Lastly I create an Image object from the response stream and save it to my desired location. Be warned, this function will take a few minutes to run.

That's it! Once the program runs, you'll be the proud holder of hundreds of O'Reilly book covers.

I wrapped up my code in a little WPF app just so I had a button to press and a way to view the extracted ids. Here's the XAML for that.

<Window x:Class="OReillyImageDownloader.Window1"
    xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
    xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
    Title="Window1" Height="522" Width="796">
  <Grid>
    <Grid.RowDefinitions>
      <RowDefinition Height="*" />
      <RowDefinition Height="Auto" />
    </Grid.RowDefinitions>
    <ListBox x:Name="_lbResults" />
    <Button x:Name="_btnParse" Content="Parse HTML" 
            Grid.Row="1" Click="_btnParse_Click" />
  </Grid>
</Window>

Just in case you don't have Visual Studio, but still want to run the program, I've also attached the executable. To run it, you'll have to make sure you've placed the downloaded HTML files in the same place I have and that the folder "C:\Downloads\OReillyCovers" exists. I also added a progress bar so you know somethings happening while it's downloading.

O'Reilly Image Downloader Exe Zip (6kb)

O'Reilly Image Downloader Source Zip (20kb)

Source Files:

C# 4 All

Search This Blog

C# - to Download Every O'Reilly Book Cover [Beginner]

Labels

Comments

Post a Comment

Popular posts from this blog

C# Snippet - Shuffling a Dictionary [Beginner]

C# WPF Printing Part 2 - Pagination [Intermediate]

C# WPF Tutorial - Implementing IScrollInfo [Advanced]