I wrote a quick utility awhile ago to generate large XML sitemaps
from a database. There are many sitemap generating tools out there,
such as xml-sitemaps.com which are now even made to be web tools (you
don't even have to download anything).
The problem with this is that these tools are usually crawler-based.
That means that you basically point it at a starting URL, and it crawls
that page looking for links, adds them to the sitemap, and then
continues onward to all of those pages, crawling for those links, etc...
I never really understood this. Google is already doing
that (and believe me, they are doing it BETTER). More importantly,
these crawlers are usually not as forgiving to your web server as google
is. Some of them will blindly hit your server with as many connections
as they can throw at it, and not only will your server come to a
screeching halt - your sitemap will take FOREVER to create - and might
error out on the bagillionth page and then corrupt the whole file or
something.
Now if you are the supposed "webmaster" for this web site, isn't it
safe to assume that you have more intimate knowledge about what pages
you want to include in your sitemap than a crawler does? If your
website is pulling content from a database of some sort, the answer is
likely yes.
Anyway, I created this little tool to make it easy for me to generate
large XML SiteMap files from a database query of some sort so that I
could pack these SiteMap files chock-full of good links without even
sending one HTTP Request.
The code is a single file using an XmlTextWriter:
using System;
using System.Collections.Generic;
using System.Text;
using System.Xml;
using System.IO;
class SiteMapWriter
{
private const string xmlFormatString = @"<?xml version=""1.0"" encoding=""UTF-8""?>";
private const string xmlStylesheetString = @"<?xml-stylesheet type=""text/xsl"" href=""http://www.intelligiblebabble.com/files/style.xsl""?>";
private const string xmlns = "http://www.sitemaps.org/schemas/sitemap/0.9";
private const string xmlns_xsi = "http://www.w3.org/2001/XMLSchema-instance";
private const string xsi_schemaLocation = "http://www.sitemaps.org/schemas/sitemap/0.9\nhttp://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd";
private int MaxUrls; //urls per file
private XmlTextWriter writer;
private XmlTextWriter parentwriter;
private readonly HashSet<string> urls;
private int NumberOfFiles;
private int NumberOfUrls;
private string BaseUri;
private string FileDirectory;
private string BaseFileName;
/// <summary>
/// Create SiteMapWriter Instance
/// </summary>
/// <param name="fileDirectory">the directory to generate the sitemap files to</param>
/// <param name="baseFileName">filename to prefix all generated sitemap files</param>
/// <param name="baseUri">URL to prefix all generated URLs. leave blank if you want relative.</param>
public SiteMapWriter(string fileDirectory, string baseFileName, string baseUri = "", int maxUrlsPerFile = 30000)
{
urls = new HashSet<string>();
NumberOfFiles = 1;
NumberOfUrls = 0;
BaseUri = baseUri;
BaseFileName = baseFileName;
FileDirectory = fileDirectory;
MaxUrls = maxUrlsPerFile;
var f = string.Format("{0}{1}.xml", fileDirectory, baseFileName);
if (File.Exists(f)) File.Delete(f);
parentwriter = new XmlTextWriter(f, Encoding.UTF8) { Formatting = Formatting.Indented };
parentwriter.WriteRaw(xmlFormatString);
parentwriter.WriteRaw("\n");
parentwriter.WriteRaw(xmlStylesheetString);
parentwriter.WriteStartElement("sitemapindex");
parentwriter.WriteAttributeString("xmlns", xmlns);
parentwriter.WriteAttributeString("xmlns:xsi", xmlns_xsi);
parentwriter.WriteAttributeString("xsi:schemaLocation", xsi_schemaLocation);
CreateUrlSet();
}
/// <summary>
/// Add Url to SiteMap
/// </summary>
/// <param name="loc">relative path to page</param>
/// <param name="changefreq">how often the file changes. either "daily", "weekly", or "monthly". leaves out if empty.</param>
/// <param name="priority">the priority of the page (double between 0 and 1). defaults to 0.5</param>
public void AddUrl(string loc, double priority = 0.5, string changefreq = null)
{
if (urls.Contains(loc)) return;
writer.WriteStartElement("url");
writer.WriteElementString("loc", loc);
if(changefreq != null) {writer.WriteElementString("changefreq", changefreq);}
writer.WriteElementString("priority",string.Format("{0:0.0000}",priority));
writer.WriteEndElement();
urls.Add(loc);
NumberOfUrls++;
if(NumberOfUrls % 2000 == 0) Console.WriteLine(string.Format("Urls Processed: {0}",NumberOfUrls));
if (NumberOfUrls >= MaxUrls) LimitIsMet();
}
private void LimitIsMet()
{
//close out current file
CloseWriter();
NumberOfFiles++;
CreateUrlSet();
//create and start new file
NumberOfUrls = 0;
}
private void CreateUrlSet()
{
string f = string.Format("{0}{1}_{2}.xml", FileDirectory, BaseFileName, NumberOfFiles);
if(File.Exists(f)) File.Delete(f);
writer = new XmlTextWriter(f, Encoding.UTF8) { Formatting = Formatting.Indented };
writer.WriteRaw(xmlFormatString);
writer.WriteRaw("\n");
writer.WriteRaw(xmlStylesheetString);
writer.WriteStartElement("urlset");
writer.WriteAttributeString("xmlns", xmlns);
writer.WriteAttributeString("xmlns:xsi", xmlns_xsi);
writer.WriteAttributeString("xsi:schemaLocation", xsi_schemaLocation);
AddSiteMapFile(string.Format("{0}_{1}.xml", BaseFileName, NumberOfFiles));
}
private void AddSiteMapFile(string filename)
{
parentwriter.WriteStartElement("sitemap");
parentwriter.WriteElementString("loc", string.Concat(BaseUri,filename));
parentwriter.WriteEndElement();
}
private void CloseWriter()
{
writer.WriteEndElement();
writer.Flush();
writer.Close();
}
/// <summary>
/// Flushed and closes all open writers.
/// </summary>
public void Finish()
{
CloseWriter();
parentwriter.WriteEndElement();
parentwriter.Flush();
parentwriter.Close();
}
}
Not the most elegant code in the world, but it gets the job done.
Note, as mentioned above, this is NOT for crawling pages for URLs. The upside though is that this will write your SiteMaps essentially
as quick as you can pull the data from the database. I recently used it
to generate > 4 Million URLs in around 30 seconds.
You can try it out yourself like so
public static void Main(string[] args)
{
var writer = new SiteMapWriter(
@"C:\SiteMapDirectory\",
"SiteMap",
"http://www.intelligiblebabble.com/");
for(int i = 0; i < 300000; i++)
{
writer.AddUrl(
string.Format("{0}/index.html",Guid.NewGuid()));
}
writer.Finish();
}
Here I am generating 300,000 (fake) URLs. Google will not accept
sitemap files with more than 50,000 URLs (any more than that and these
files are going to be getting pretty large). As a result, this script
will create a parent SiteMap file which will link to children sitemap
files - each less than 50,000 URLs long. I have set the default max
number of URLS in each file to be 30,000 just to be on the safe side -
but you can override this in the constructor with the maxUrlsPerFile
param.
The Sitemaps generated will generally look like this:
<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="http://www.intelligiblebabble.com/files/style.xsl"?>
<urlset
xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://www.sitemaps.org/schemas/sitemap/0.9
http://www.sitemaps.org/schemas/sitemap/0.9/sitemap.xsd">
<url>
<loc>2b55897a-43d1-4d3f-bec0-22ff50226a9a/index.html</loc>
<priority>0.5000</priority>
</url>
<url>
<loc>a4cb2c7a-3080-4e77-a5e8-27a869e33afc/index.html</loc>
<priority>0.5000</priority>
</url>
<url>
<loc>614f281b-c60b-4dca-9c5c-9c59f1fff1fe/index.html</loc>
<priority>0.5000</priority>
</url>
</urlset>
Some other things to note are:
- A HashSet is used to maintain that there are no duplicate URLs inserted into the file.
- the AddUrl method accepts a changefreq parameter, and priority parameter to set the corresponding XML attributes
- if you are generating the base URL from the database, or you would like to keep the URLs relative, you can leave off the baseUri parameter from the constructor
That's pretty much it, please let me know if this was helpful to anyone.
Comments
Post a Comment