The generic collection classes provided by the .Net Framework are well thought-out and perform quite well for the great majority of a developer’s collection handling needs. The only real issues arise when the contents of a collection get too large and you
want to use a method like Contains to see if an element exists in the collection. The Contains method is an
O(n) operation, meaning that it takes proportionally longer to execute the larger the collection becomes.
Many times the solution to the problem is to use a collection backed by a
HashSet. So long as all of the items in the collection are unique, an index can be built and organized using the hash value for each element in the collection making the Contains method of a HashSet an O(1) operation (it takes the same amount of time to
execute regardless of the number of elements in the collection).
However, the HashSet can have its limitations as well. Depending on the objects being used it may be possible to generate a hash collision (two objects in the collection produce the same hash value), though this generally won’t happen unless you are using
custom types. There’s also the fact that there is only one hash value per element to search on.
For a collection of Strings, in particular, this could be a problem. In the case of strings, you may want the ability to search for a full string in the collection, or just the starting characters to find all strings that begin with those characters. A
HashSet would not be suitable for this because there is no way to search for the substrings without enumerating the entire collection.
There is another solution to fast searches over a large collection of strings, but it comes with a cost. It is possible to build a collection that offers performance approaching O(1) for any number of strings in a collection; however, to do so we must be
willing to sacrifice memory consumption for performance. If we are willing to consume more memory than the raw contents of the collection requires, we can create a collection of strings backed by a character dictionary of character dictionaries.
String is, by definition, an
Characters. So the string “Hello World” is also an array containing the characters:
(0) = ‘H’
(1) = ‘e’
(2) = ‘l’
(3) = ‘l’
(4) = ‘o’
(5) = ‘ ‘
(6) = ‘W’
(7) = ‘o’
(8) = ‘r’
(9) = ‘l’
(10) = ‘d’
Instead of using one array to hold the characters of the string, we can reorganize a string such that each letter is contained in a Dictionary(Of Char, Dictionary(Of Char)) creating a tree-like structure of characters. For example, if we added the words
“Hello” and “Help” to the collection, we would have a tree of dictionaries like:
(0) = ‘H’
(0) = ‘e’
(0) = ‘l’
(0) = ‘l’
(0) = ‘o’
(1) = ‘p’
So the letter “H” is the first key in the root dictionary. The value for the key “H” is another Dictionary(Of Char) and its first key is the letter “e”, and so on to the “o” in “Hello”. For the word “Help”, the “H”, “e”, and “l” are already in the tree,
so the first “l” now gets a second child of the letter “p”.
By this logic, we also have the word “Hell” in our collection, but we didn’t add that word… so how do we tell the difference between words explicitly added to the collection and those added accidentally? We can use Chr(0) as a terminator character to indicate
the end of an intended string. In this way, a letter with a child of Chr(0) will indicate the end of a complete word. So our previous example then becomes:
(0) = ‘H’
(0) = ‘e’
(0) = ‘l’
(0) = ‘l’
(0) = ‘o’
(0) = Chr(0)
(1) = ‘p’
(0) = Chr(0)
As you can begin to see, with an extra character on each string and a new dictionary for each new letter in a word, the amount of memory consumed is much greater than is required to merely represent the characters in the string. But at the same time, because
we have organized the characters into a tree structure, the greatest number of iterations required to find any word in the collection is equal to the longest word in the collection; so searches become very fast. And if we choose to execute a search that does
not look for the termination character, we can quickly find all words in the collection that start with a given set of letters.
Although the basic premise of the collection is fairly straightforward, the code implementation can be a bit tricky in that our tree of characters must still be able to expose itself as a collection of complete strings. To achieve this, we need to create
a custom enumerator for our collection that knows how to find each complete string within the tree of characters.
We can begin by creating a “SearchableStringCollection” class which implements
ICollection(Of String) and declares a backing store of
Dictionary(Of Char, SearchableStringCollection):
There are a few simple class members that we can implement without first defining our actual character storage and retrieval mechanisms. First we can implement the Count property by simply defining a private _Count field to hold the number of full strings
in the collection. We’ll want to track this total manually as items are added to and removed from the collection because our backing store will not have this information readily at hand (we would have to enumerate the entire collection to find and count all
of the complete strings, and that would have undesirable performance).
_Count = 0
array(arrayIndex) = entry
arrayIndex += 1
arrayIndex = array.Length
The primary functionality methods Add, Contains, and Remove will require helper methods to assist in their execution, and we’ll look at each more closely later in this article. We’ll also want to add a “ContainsStartingWith” method to see if a partial match
exists in the collection, as well as a “Find” method which can return all strings beginning with a given set of characters.
The only remaining class members then are the implementations of GetEnumerator. The implementation coming from IEnumerable can be easily coded to just return the strongly-typed implementation coming from IEnumerable (Of String). We can also change the
access level of this member to Protected so that the class only exposes a single GetEnumerator method.
However, the actual implementation of GetEnumerator from IEnumerable(Of String) will need to return an instance of the custom enumerator class which is capable of finding all complete strings within the tree of characters.
Let’s take a look at the definitions of the four primary helper methods used to add, remove, and find strings in the collection:
As you can see, each method takes a character array and an index as parameters. The reason for this design is to facilitate calling each of these methods in a recursive fashion. A recursive method is simply a function or subroutine which continues to
call itself while some condition is met. Recursion works well with tree structures of data because you need to “walk the branches” of the tree whenever you work with the data. This means that the primary worker methods only need to know a starting position
within the tree at which they should begin their work. Each method can then call itself, passing the original character array it received along with an updated index indicating the next position at which processing needs to continue.
The OnAdd method will provide the functionality of walking the character tree, looking for existing characters and adding new branches when needed. The code for this method consists of four conditional checks with one opportunity for recursion:
characters.Length > 0
index = characters.Length - 1
index < characters.Length - 1
result = _ListCore(characters(index)).OnAdd(characters, index + 1)
The method begins by declaring a result variable of the returning type (Boolean). This value will be returned at the end of the method to indicate whether or not a new item was added to the collection (if the provided complete string already exists in the
collection, no new string will be added). Next it checks the trivial case that the provided character array is empty. Providing that some characters were supplied, the method then begins to look for existing characters or add new branches to the tree.
The method checks to see if the specified character exists in the local character dictionary. If that character does not yet exist, a new branch is created and the character is added to the collection. It is important to note that when this method is first
called from an instance of SearchableStringCollection, the _ListCore field is pointing to the “root” dictionary of the tree, however, since each branch of the tree contains another SearchableStringCollection, in subsequent recursive calls the _ListCore field
will point to the dictionary for the particular branch of the tree that execution has walked to.
Once the current character has been processed, the method repeats itself for each remaining character in the originally provided string. However, it is again important to note that the target of this second call to OnAdd is the branch stemming from the
root collection; it is not a recursive call into the same object instance. This is how each helper method recursively walks the character tree.
_ListCore(characters(index)).OnRemove(characters, index + 1)
_ListCore(characters(index))._ListCore.Count = 0
_ListCore(characters(index)).OnContains(characters, index + 1)
result.AddRange(_ListCore(characters(index)).OnFind(characters, index + 1))
List(Of IEnumerator(Of KeyValuePair(Of
_BaseEnumerator = baseEnumerator
_Current.Length > 0
_Current.Length -= 1
_Enumerators.Count > 0
_Enumerators(_Enumerators.Count - 1).MoveNext()
_Current.Append(_Enumerators(_Enumerators.Count - 1).Current.Key)
_Current(_Current.Length - 1) = Chr(0)
_Enumerators.Add(_Enumerators(_Enumerators.Count - 1).Current.Value._ListCore.GetEnumerator)
result = BuildCurrent()
_Enumerators.RemoveAt(_Enumerators.Count - 1)
OnAdd((item & Chr(0)).ToArray, 0)
_Count += 1
= OnRemove((item & Chr(0)).ToCharArray, 0)
_Count -= 1
OnContains((item & Chr(0)).ToCharArray, 0)
''' Represents a collection of strings similar in performance to a HashSet(Of String) that offers partial string searches.
''' Adds the specified string to the collection if it does not already exist.
''' <param name="item">The string to add to the collection.</param>
''' Clears all strings from the collection.
''' Gets a value indicating whether the collection contains the specified string.
''' <param name="item">The exact string to find.</param>
''' <returns>True if the collection contains the specified string, otherwise false.</returns>
''' Gets a value indicating whether the collection contains any strings that begin with the specified characters.
''' <param name="startsWith">The starting characters to search for.</param>
''' <returns>True if the collection contains one or more strings beginning with the specified characters, otherwise false.</returns>
''' Copies the contents of the collection to the specified array.
''' <param name="array">The array of strings to copy the collection into.</param>
''' <param name="arrayIndex">The location in the array at which copying begins.</param>
''' Gets the number of complete strings contained in the collection.
''' Gets an array of strings that begin with the specified characters.
''' <param name="startsWith">The starting characters to find.</param>
''' <returns>An array of strings beginning with the specified characters.</returns>
''' Gets an enumerator that iterates over the strings in the collection.
''' Returns false indicating that the collection is not read-only.
''' Removes the specified string from the collection.
''' <param name="item">The string to remove from the collection.</param>
''' <returns>True if the string is found and removed, otherwise false.</returns>
''' Enumerates the elements of a SearchableStringCollection.
''' Gets the element in the collection at the current position of the enumerator.
''' Advances the enumerator to the next element in the collection.
''' Sets the enumerator to its initial position, which is before the first element in the collection.
#Region "IDisposable Support"