Skip to main content

...

....

C# - to Download Every O'Reilly Book Cover [Beginner]


This is the first part of the series, which will explain how to write a simple C# application to download every book cover from O'Reilly's website.

O'Reilly's website contains a complete list of every book still in print. Basically, all we have to do is write a simple parser to pull out catalog IDs and then download the appropriate image. I know C# is not the best language for text parsing, but I know it well so I wrote my tool in C#. The concepts in this tutorial, however, can easily be extended to your own favorite language.

O'Reilly Website Complete Book List
border=

If you go to the complete list, you'll see it's separated into 4 pages - "A-D", "E-J", "K-P", and "Q-Z". The first thing you'll want to do is download the source of all of those pages somewhere on your hard drive. I put mine here:
C:\Downloads\complete.html
C:\Downloads\complete2.html
C:\Downloads\complete3.html
C:\Downloads\complete4.html
 
Now that we have some HTML to parse, we need to take a look at it to figure out where the catalog IDs are. Every link in the complete list has the id as part of the link. Here's a snippet that contains one of the links from O'Reilly's website.
<a class="tt" id="0596007574" href="http://oreilly.com/catalog/9780596007577">.NET Compact Framework Pocket Guide
</a> 

</td>
<td valign="top" nowrap="nowrap">
May 2004</td>
<td valign="top" align="right">
$9.95
</td>
 
Looking at this code, what we really care about are the numbers after "/catalog/". Now we know what to look for. Here's a simple parse function that will extract all ids based on that fact.
void _btnParse_Click(object sender, RoutedEventArgs e)
{
  //set up the paths to the downloads HTML files
  string[] files = new string[] 
  {
    @"C:\Downloads\complete.html",
    @"C:\Downloads\complete2.html",
    @"C:\Downloads\complete3.html",
    @"C:\Downloads\complete4.html"
  };

  //create a list of hold all of the found ids
  List<ulong> ids = new List<ulong>();

  //loop through each downloaded file
  foreach (string file in files)
  {
    using (StreamReader reader = 
      new StreamReader(new FileStream(file, FileMode.Open)))
    {
      //get all of the file's contents
      string contents = reader.ReadToEnd();

      int index = 0;
      int catStart;
      int catEnd;
      ulong id;

      do
      {
        index = contents.IndexOf("/catalog/", index);

        if (index == -1)
          break;

        //Ids will either end with quotes or a slash
        // ".../catalog/123456789" or
        // ".../catalog/123456789/..."
        catStart = index + 9;
        catEnd = contents.IndexOfAny(new char[] { '\"', '/' }, catStart);

        if (catEnd == -1)
          break;

        //use parse to make sure this is actually a number
        //some links are ".../catalog/somefile.pdf"
        if (ulong.TryParse(contents.Substring(catStart,
          (catEnd - catStart)), out id))
        {
          //ids are duplicated, only add it once
          if (!ids.Contains(id))
            ids.Add(id);
        }

        //increment index past this occurrence of "/catalog/"
        index += 9;
      }
      while (index != -1);
    }
  }

  //show all the ids
  _lbResults.ItemsSource = ids;

  //download all the images
  DownloadBookCovers(ids);
}
 
There's a lot of code here, but it's really simple. At it's core, it's looking for every occurrence "/catalog/", then pulling whatever is after that but before a quote (") or slash (/). This is because links containing ids will be in one of two formats:
".../catalog/123456789"
".../catalog/123456789/..."
 
Once it finds something, I use ulong.TryParse since sometimes what it pulls out is a filename instead of an id. My display contains a ListBox which I populate with the results just so I can see them.

All right, now we're on to actually downloading files. Fortunately, O'Reilly keeps every book cover in the same location:
http://oreilly.com/catalog/covers/[id]_lrg.jpg
 
Since we've already extracted every id, all that's left to do is loop through them and make a simple web request for the file. Here's a function that does that.
private void DownloadBookCovers(List<ulong> ids)
{
  for (int i = 0; i < ids.Count; i++)
  {
    ulong id = ids[i];

    //Print some status information
    Console.WriteLine("Getting image " + i + " of " + ids.Count);

    string requestUri = "http://oreilly.com/catalog/covers/" + id + "_lrg.jpg";

    WebRequest request = WebRequest.Create(requestUri);
    WebResponse response;

    try
    {
      response = request.GetResponse();
    }
    catch
    {
      //file probably didn't exist, just skip to the next one
      continue;
    }

    System.Drawing.Image image = 
      System.Drawing.Image.FromStream(response.GetResponseStream());

    image.Save(@"C:\Downloads\OReillyCovers\" + id + ".jpg");
  }
}
 
Again, this is all pretty straight forward. I create a WebRequest object for the URL of the image I want to download. I then call GetResponse to perform the request. If the file doesn't exist, this function will throw an exception. When I ran the downloader, a little over half of the ids actually had covers (around 800 in total). Lastly I create an Image object from the response stream and save it to my desired location. Be warned, this function will take a few minutes to run.

That's it! Once the program runs, you'll be the proud holder of hundreds of O'Reilly book covers.

Ferret Book
Cover

I wrapped up my code in a little WPF app just so I had a button to press and a way to view the extracted ids. Here's the XAML for that.
<Window x:Class="OReillyImageDownloader.Window1"
    xmlns="http://schemas.microsoft.com/winfx/2006/xaml/presentation"
    xmlns:x="http://schemas.microsoft.com/winfx/2006/xaml"
    Title="Window1" Height="522" Width="796">
  <Grid>
    <Grid.RowDefinitions>
      <RowDefinition Height="*" />
      <RowDefinition Height="Auto" />
    </Grid.RowDefinitions>
    <ListBox x:Name="_lbResults" />
    <Button x:Name="_btnParse" Content="Parse HTML" 
            Grid.Row="1" Click="_btnParse_Click" />
  </Grid>
</Window>
 
Just in case you don't have Visual Studio, but still want to run the program, I've also attached the executable. To run it, you'll have to make sure you've placed the downloaded HTML files in the same place I have and that the folder "C:\Downloads\OReillyCovers" exists. I also added a progress bar so you know somethings happening while it's downloading.


Source Files:

Comments

Popular posts from this blog

C# Snippet - Shuffling a Dictionary [Beginner]

Randomizing something can be a daunting task, especially with all the algorithms out there. However, sometimes you just need to shuffle things up, in a simple, yet effective manner. Today we are going to take a quick look at an easy and simple way to randomize a dictionary, which is most likely something that you may be using in a complex application. The tricky thing about ordering dictionaries is that...well they are not ordered to begin with. Typically they are a chaotic collection of key/value pairs. There is no first element or last element, just elements. This is why it is a little tricky to randomize them. Before we get started, we need to build a quick dictionary. For this tutorial, we will be doing an extremely simple string/int dictionary, but rest assured the steps we take can be used for any kind of dictionary you can come up with, no matter what object types you use. Dictionary < String , int > origin = new Dictionary < string , int >();

C# WPF Printing Part 2 - Pagination [Intermediate]

About two weeks ago, we had a tutorial here at SOTC on the basics of printing in WPF . It covered the standard stuff, like popping the print dialog, and what you needed to do to print visuals (both created in XAML and on the fly). But really, that's barely scratching the surface - any decent printing system in pretty much any application needs to be able to do a lot more than that. So today, we are going to take one more baby step forward into the world of printing - we are going to take a look at pagination. The main class that we will need to do pagination is the DocumentPaginator . I mentioned this class very briefly in the previous tutorial, but only in the context of the printing methods on PrintDialog , PrintVisual (which we focused on last time) and PrintDocument (which we will be focusing on today). This PrintDocument function takes a DocumentPaginator to print - and this is why we need to create one. Unfortunately, making a DocumentPaginator is not as easy as

C# WPF Tutorial - Implementing IScrollInfo [Advanced]

The ScrollViewer in WPF is pretty handy (and quite flexible) - especially when compared to what you had to work with in WinForms ( ScrollableControl ). 98% of the time, I can make the ScrollViewer do what I need it to for the given situation. Those other 2 percent, though, can get kind of hairy. Fortunately, WPF provides the IScrollInfo interface - which is what we will be talking about today. So what is IScrollInfo ? Well, it is a way to take over the logic behind scrolling, while still maintaining the look and feel of the standard ScrollViewer . Now, first off, why in the world would we want to do that? To answer that question, I'm going to take a an example from a tutorial that is over a year old now - Creating a Custom Panel Control . In that tutorial, we created our own custom WPF panel (that animated!). One of the issues with that panel though (and the WPF WrapPanel in general) is that you have to disable the horizontal scrollbar if you put the panel in a ScrollV