Skip to main content

...

....

C# - Generate large XML sitemaps from a database [Beginner]


I wrote a quick utility awhile ago to generate large XML sitemaps from a database. There are many sitemap generating tools out there, such as xml-sitemaps.com which are now even made to be web tools (you don't even have to download anything).

The problem with this is that these tools are usually crawler-based. That means that you basically point it at a starting URL, and it crawls that page looking for links, adds them to the sitemap, and then continues onward to all of those pages, crawling for those links, etc...

I never really understood this. Google is already doing that (and believe me, they are doing it BETTER). More importantly, these crawlers are usually not as forgiving to your web server as google is. Some of them will blindly hit your server with as many connections as they can throw at it, and not only will your server come to a screeching halt - your sitemap will take FOREVER to create - and might error out on the bagillionth page and then corrupt the whole file or something.

Now if you are the supposed "webmaster" for this web site, isn't it safe to assume that you have more intimate knowledge about what pages you want to include in your sitemap than a crawler does? If your website is pulling content from a database of some sort, the answer is likely yes.

Anyway, I created this little tool to make it easy for me to generate large XML SiteMap files from a database query of some sort so that I could pack these SiteMap files chock-full of good links without even sending one HTTP Request.

The code is a single file using an XmlTextWriter:
using System;
using System.Collections.Generic;
using System.Text;
using System.Xml;
using System.IO;

class SiteMapWriter
{
    private const string xmlFormatString = @"<?xml version=""1.0"" encoding=""UTF-8""?>";
    private const string xmlStylesheetString = @"<?xml-stylesheet type=""text/xsl"" href=""http://www.intelligiblebabble.com/files/style.xsl""?>";
    private const string xmlns = "http://www.sitemaps.org/schemas/sitemap/0.9";
    private const string xmlns_xsi = "http://www.w3.org/2001/XMLSchema-instance";
    private const string xsi_schemaLocation = "http://www.sitemaps.org/schemas/sitemap/0.9\nhttp://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd";

    private int MaxUrls; //urls per file

    private XmlTextWriter writer;
    private XmlTextWriter parentwriter;
    private readonly HashSet<string> urls;
    private int NumberOfFiles;
    private int NumberOfUrls;

    private string BaseUri;
    private string FileDirectory;
    private string BaseFileName;

    /// <summary>
    /// Create SiteMapWriter Instance
    /// </summary>
    /// <param name="fileDirectory">the directory to generate the sitemap files to</param>
    /// <param name="baseFileName">filename to prefix all generated sitemap files</param>
    /// <param name="baseUri">URL to prefix all generated URLs.  leave blank if you want relative.</param>
    public SiteMapWriter(string fileDirectory, string baseFileName, string baseUri = "", int maxUrlsPerFile = 30000)
    {
        urls = new HashSet<string>();

        NumberOfFiles = 1;
        NumberOfUrls = 0;

        BaseUri = baseUri;
        BaseFileName = baseFileName;
        FileDirectory = fileDirectory;
        MaxUrls = maxUrlsPerFile;    

        var f = string.Format("{0}{1}.xml", fileDirectory, baseFileName);

        if (File.Exists(f)) File.Delete(f);
        parentwriter = new XmlTextWriter(f, Encoding.UTF8) { Formatting = Formatting.Indented };

        parentwriter.WriteRaw(xmlFormatString);
        parentwriter.WriteRaw("\n");
        parentwriter.WriteRaw(xmlStylesheetString);


        parentwriter.WriteStartElement("sitemapindex");
        parentwriter.WriteAttributeString("xmlns", xmlns);
        parentwriter.WriteAttributeString("xmlns:xsi", xmlns_xsi);
        parentwriter.WriteAttributeString("xsi:schemaLocation", xsi_schemaLocation);

        CreateUrlSet();

    }

    /// <summary>
    /// Add Url to SiteMap
    /// </summary>
    /// <param name="loc">relative path to page</param>
    /// <param name="changefreq">how often the file changes. either "daily", "weekly", or "monthly". leaves out if empty.</param>
    /// <param name="priority">the priority of the page (double between 0 and 1). defaults to 0.5</param>
    public void AddUrl(string loc, double priority = 0.5, string changefreq = null)
    {
        if (urls.Contains(loc)) return;

        writer.WriteStartElement("url");
        writer.WriteElementString("loc", loc);
        if(changefreq != null) {writer.WriteElementString("changefreq", changefreq);}
        writer.WriteElementString("priority",string.Format("{0:0.0000}",priority));
        writer.WriteEndElement();

        urls.Add(loc);
        NumberOfUrls++;
        if(NumberOfUrls % 2000 == 0) Console.WriteLine(string.Format("Urls Processed: {0}",NumberOfUrls));
        if (NumberOfUrls >= MaxUrls) LimitIsMet();

    }

    private void LimitIsMet()
    {
        //close out current file
        CloseWriter();
        NumberOfFiles++;
        CreateUrlSet();
        //create and start new file
        NumberOfUrls = 0;
    }

    private void CreateUrlSet()
    {
        string f = string.Format("{0}{1}_{2}.xml", FileDirectory, BaseFileName, NumberOfFiles);

        if(File.Exists(f)) File.Delete(f);
        writer = new XmlTextWriter(f, Encoding.UTF8) { Formatting = Formatting.Indented };

        writer.WriteRaw(xmlFormatString);
        writer.WriteRaw("\n");
        writer.WriteRaw(xmlStylesheetString);


        writer.WriteStartElement("urlset");
        writer.WriteAttributeString("xmlns", xmlns);
        writer.WriteAttributeString("xmlns:xsi", xmlns_xsi);
        writer.WriteAttributeString("xsi:schemaLocation", xsi_schemaLocation);
        AddSiteMapFile(string.Format("{0}_{1}.xml", BaseFileName, NumberOfFiles));

    }

    private void AddSiteMapFile(string filename)
    {
        parentwriter.WriteStartElement("sitemap");
        parentwriter.WriteElementString("loc", string.Concat(BaseUri,filename));
        parentwriter.WriteEndElement();
    }

    private void CloseWriter()
    {
        writer.WriteEndElement();
        writer.Flush();
        writer.Close();
    }

    /// <summary>
    /// Flushed and closes all open writers.
    /// </summary>
    public void Finish()
    {
        CloseWriter();

        parentwriter.WriteEndElement();
        parentwriter.Flush();
        parentwriter.Close();
    }
}
 
Not the most elegant code in the world, but it gets the job done. Note, as mentioned above, this is NOT for crawling pages for URLs. The upside though is that this will write your SiteMaps essentially as quick as you can pull the data from the database. I recently used it to generate > 4 Million URLs in around 30 seconds.

You can try it out yourself like so
public static void Main(string[] args)
{
    var writer = new SiteMapWriter(
            @"C:\SiteMapDirectory\", 
            "SiteMap", 
            "http://www.intelligiblebabble.com/");

    for(int i = 0; i < 300000; i++)
    {
        writer.AddUrl(
            string.Format("{0}/index.html",Guid.NewGuid()));
    }
    writer.Finish();
}
 
Here I am generating 300,000 (fake) URLs. Google will not accept sitemap files with more than 50,000 URLs (any more than that and these files are going to be getting pretty large). As a result, this script will create a parent SiteMap file which will link to children sitemap files - each less than 50,000 URLs long. I have set the default max number of URLS in each file to be 30,000 just to be on the safe side - but you can override this in the constructor with the maxUrlsPerFile param.

The Sitemaps generated will generally look like this:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="http://www.intelligiblebabble.com/files/style.xsl"?>
<urlset 
    xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" 
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
    xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9&#xA;http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
  <url>
    <loc>2b55897a-43d1-4d3f-bec0-22ff50226a9a/index.html</loc>
    <priority>0.5000</priority>
  </url>
  <url>
    <loc>a4cb2c7a-3080-4e77-a5e8-27a869e33afc/index.html</loc>
    <priority>0.5000</priority>
  </url>
  <url>
    <loc>614f281b-c60b-4dca-9c5c-9c59f1fff1fe/index.html</loc>
    <priority>0.5000</priority>
  </url>
</urlset>
 
Some other things to note are:
  • A HashSet is used to maintain that there are no duplicate URLs inserted into the file.
  • the AddUrl method accepts a changefreq parameter, and priority parameter to set the corresponding XML attributes
  • if you are generating the base URL from the database, or you would like to keep the URLs relative, you can leave off the baseUri parameter from the constructor
That's pretty much it, please let me know if this was helpful to anyone.

Comments

Popular posts from this blog

C# Snippet - Shuffling a Dictionary [Beginner]

Randomizing something can be a daunting task, especially with all the algorithms out there. However, sometimes you just need to shuffle things up, in a simple, yet effective manner. Today we are going to take a quick look at an easy and simple way to randomize a dictionary, which is most likely something that you may be using in a complex application. The tricky thing about ordering dictionaries is that...well they are not ordered to begin with. Typically they are a chaotic collection of key/value pairs. There is no first element or last element, just elements. This is why it is a little tricky to randomize them. Before we get started, we need to build a quick dictionary. For this tutorial, we will be doing an extremely simple string/int dictionary, but rest assured the steps we take can be used for any kind of dictionary you can come up with, no matter what object types you use. Dictionary < String , int > origin = new Dictionary < string , int >();

C# WPF Printing Part 2 - Pagination [Intermediate]

About two weeks ago, we had a tutorial here at SOTC on the basics of printing in WPF . It covered the standard stuff, like popping the print dialog, and what you needed to do to print visuals (both created in XAML and on the fly). But really, that's barely scratching the surface - any decent printing system in pretty much any application needs to be able to do a lot more than that. So today, we are going to take one more baby step forward into the world of printing - we are going to take a look at pagination. The main class that we will need to do pagination is the DocumentPaginator . I mentioned this class very briefly in the previous tutorial, but only in the context of the printing methods on PrintDialog , PrintVisual (which we focused on last time) and PrintDocument (which we will be focusing on today). This PrintDocument function takes a DocumentPaginator to print - and this is why we need to create one. Unfortunately, making a DocumentPaginator is not as easy as

C# WPF Tutorial - Implementing IScrollInfo [Advanced]

The ScrollViewer in WPF is pretty handy (and quite flexible) - especially when compared to what you had to work with in WinForms ( ScrollableControl ). 98% of the time, I can make the ScrollViewer do what I need it to for the given situation. Those other 2 percent, though, can get kind of hairy. Fortunately, WPF provides the IScrollInfo interface - which is what we will be talking about today. So what is IScrollInfo ? Well, it is a way to take over the logic behind scrolling, while still maintaining the look and feel of the standard ScrollViewer . Now, first off, why in the world would we want to do that? To answer that question, I'm going to take a an example from a tutorial that is over a year old now - Creating a Custom Panel Control . In that tutorial, we created our own custom WPF panel (that animated!). One of the issues with that panel though (and the WPF WrapPanel in general) is that you have to disable the horizontal scrollbar if you put the panel in a ScrollV