At DEQ, we are doing a project to move our external website out of OpenCMS and into DotNetNuke. One of my tasks is to create a tool that will take a bunch of HTML files exported from OpenCMS and import them into DotNetNuke. It’s not a simple transfer - the HTML needs to be parsed to extract metadata fields, and modified in various ways to fit into the new system.
The tool is a console application written in C#. It takes a list of files from an Excel spreadsheet (which our team is maintaining as the project moves forward), runs each file through an HTML parser, modifies the page, and then calls a series of SQL Server stored procedures to complete the import into DotNetNuke.
The HTML parsing and modification is handled by the excellent HTML Agility Pack. The biggest part of it is that the links in each page need to be rewritten to conform to our DotNetNuke system’s URL format.
Where this gets tricky is that according to our rules, a page’s new URL isn’t known until the page
is parsed. It could be based on the content of the <title> element, or on the first <h1> element.
As a result, it has to have (at least partially) parsed all the pages before the first one can be completed.
It doesn’t have enough memory to hold the content for all the pages (about two thousand), so it is doing the
parsing in two passes - once to determine all of the metadata (including the new URL), and once, just before
import, to retrieve and modify the body content.
During the first parsing pass, the tool builds a collection of all the pages and their metadata fields.
During the second, it traverses this collection, parsing, rewriting, and then importing each page in turn.
My initial (naïve) implementation used a standard List<T>, to preserve the order of the pages. This is important
because in DotNetNuke’s model, each page may have a parent page, which needs to be imported
before any of its children. So it builds the page hierarchy in breadth-first order.
However, this resulted in absolutely horrendous link rewriting performance. To find the new URL of a page by its current URL, the list needs to be traversed in order until a match is found. This has to happen for every single internal link on every single page on the site. A single run of the import tool was taking over six hours to complete, with the vast majority of the time spent in the link rewriting step.
Obviously, a List<T> is not the right data structure for this problem. A Dictionary<K,V> would be better
for the link rewriting step, since it would provide nearly instantaneous retrieval, but it wouldn’t preserve
the item order.
This past Valentine’s Day, I decided to fix this problem. One way to approach it would be to maintain two data structures - a list for importing, and a dictionary for URL lookups. But I decided to do some research to see if there was an existing collection class in the .NET framework that could preserve order while providing O(1) retrieval by key.
Turns out, there is: KeyedCollection<K,V>. It’s an
abstract class that allows both index- and key-based retrieval in (near-)constant time. To achieve this, it
maintains both a list of items for iteration and index-based retrieval, and a lookup dictionary for key-based
retrieval. Using it is as simple as extending it, and providing an implementation of GetKeyForItem that
describes how the key should be determined for each item. For example:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | |
As always, there are a few caveats. First, and perhaps obviously, KeyedCollection uses additional memory.
Instead of a List and a Dictionary, you now have both. There’s still only one copy of each value (if it’s a
reference type), but the additional collection itself is going to consume more memory. Second,
KeyedCollection<K,V> doesn’t actually implement IDictionary<K,V>, so you can’t use it interchangeably with
code that expects a dictionary. (It does implement the same interfaces as List<T>, so it’s best to think of it
as a list with key lookups rather than a dictionary with index lookups.) Third, when you use the collection’s
Contains method, watch out: the override that takes the key parameter uses the fast dictionary lookup, but the
override that takes the item parameter will cause an expensive iteration over the list items. And finally,
serialization performance can be slow because it serializes both the list and the dictionary. (Source for the
last two items)
After switching to KeyedCollection for my page list, I had improved the tool’s run time from six hours to
three minutes. Yes, three minutes. Now, this enormous improvement wouldn’t have been possible if I hadn’t
started with a ridiculously awful implementation to begin with, but it’s still a testament to the awesomeness of
this collection class.
And that’s my Valentine’s love note to KeyedCollection.