Development
Performance

Order of Complexity 101

Here at RDC Aviation, we have some large datasets to work with - a formal understanding of time complexity is crucial.

What is Order of Complexity?

Order of complexity allows you to predict how an algorithm will behave depending on the size of its input data. You can assign more developer time to optimise things which are going to deal with large data sets, and less on things which aren't going to be a problem. 

It can also show you, within a larger system which is performing badly, where you can turn to for potential optimisations.

What it can't really do is tell you how many seconds you'll wait for something to complete. The terms in use here are abstract, without any units. If you know how long something is taking in real world units with a known amount of input data, this system will only let you predict how it will behave with a different size of data.

Order of complexity is often expressed using Big-O Notation. For example, O(n) or O(n²)

Some examples

When plotted on a graph, 3 of the most common time complexities start to make sense. The big O notation matches a shape.

Three common orders of complexity. 
Constant - O(1),  
Linear - O(n)
Quadratic - O(n²)
– Three common orders of complexity. Constant - O(1), Linear - O(n) Quadratic - O(n²)

Constant Time - O(1)

Algorithms with this time complexity are a kind of ideal for people who work with large data sets. If you can achieve this, you won't need to worry about your system becoming slow in the future. This structure tells you that, if you have one item or a million, the time will never change.

Fetching a single element from a dictionary or hash table is usually a constant time operation.

Linear - O(n)

Algorithms with this time complexity get slower over time, but they do so in a predictable fashion. If you double the amount of data, you will roughly double the time an algorithm takes. 

Finding the position of a blog post within a bigger list of posts, by a brute force scan, is a linear time operation.

Quadratic - O(n²)

Algorithms with this time complexity are trouble makers, to say the least. Their time complexity has a habit of exploding. During development when you have a few sample records to work with you might not notice, but in production when your users are creating lots of data, these algorithms can become a major problem.

Scanning a series of items and finding all other items which are related to them, by brute force, is probably going to be O(n²)

How do we fix a slow algorithm?

There are two ways you could go about this. A higher value of n results in more time spent, so we could control the size of n. But if that isn't possible, you need think about rearranging the structure, switching to a less time-complex algorithm.

Example: turning quadratic into linear time, using a hashing container

Consider this code. Its task is to tell us, true or false, are there any duplicate values in the given array?

// compare every single number against every other number.
static bool AnyDuplicates(List<int> bigList)
{
    for (var i = 0; i < bigList.Count; i++)
    {
	    for (var j = 0; j < bigList.Count; j++) 
	    {
		    if (i != j && bigList[i] == bigList[j])
		    {
			    // we found a duplication
			    return true;
		    }
	    }
    }

    return false;
}

Calculating the time complexity

We can combine what we can see into different time complexities, then add or multiply them.

In the above code sample, we can see two nested loops. The outer loop is O(n). The inner is also O(n). Because there is one iteration of the inner loop for each iteration of the outer loop, we have to multiply our complexities.

Therefore, the time complexity of this algorithm is O(n * n) or rather, O(n²). Quadratic time complexity. It will perform OK with small data sets but will rapidly start to waste our user's time when we have more data.

Fixing the problem

If we rearrange the structure, we can improve it. If we target one of the components, such as the second O(n) we can reduce our complexity.

In C# we can use a HashSet, which will do a hash lookup, an O(1) operation, and a random access to the hash table, also an O(1) operation.

// using a hash set instead
static bool AnyDuplicatesFaster(List<int> bigList)
{
    // HashSet (int capacity) will preallocate, so that we can do Add in O(1) time
    var existing = new HashSet<int>(bigList.Count); 

    // the outer loop is still O(n)
    for (var i = 0; i < bigList.Count; i++) 
    {
        // contains will always be O(1)
        if (existing.Contains(bigList[i])) 
        {
            // we found a duplication.
            return true;
        }

        // Add is O(1)
        existing.Add(bigList[i]); 
    }

    return false;
}

Our time complexity is now O(n * (1 + 1)) which is equivalent to O(n). A big improvement. It will still slow over time but it won't be as serious.

How can we use this principle in practice?

Because this system can be used to predict how expensive an algorithm is to run in a real environment, you can use it to make decisions. For example, it can help you decide between repeated database lookups, or keeping a local cache.

If you have a small dataset in a database, and you're doing a lot of Linear O(n) operations on it, you might consider caching it in memory so that you can work in it via brute force and avoid a database hit every time you need something. For example, if you have 10 categories on your blog, and you want to display human readable tags. You know that you aren't going to have a million categories any time soon, so you can pull it into memory and leave it there.

Conversely, you might have a very large dataset which would use a lot of memory. If the operations you're performing are O(1), such as finding the appropriate blog post for a particular URI, you might consider finding each post when you need it, then caching individual result.

You might find that you've got a large dataset with lots of complex algorithms - in this case you might decide to spend time reducing the time complexity. Displaying a chart with hundreds of airlines and destination countries might fall into this category. It would be impractical to cache every possible query and result, but some judicious use of memory here and there could optimise it so the actual delay is much smaller.

Edge Cases (caveat emptor)

It may also be that in practice, your algorithm doesn't behave the same way that the theoretical order of complexity says it should. These are some things to look out for. Its not by any means an exhaustive list.

Regression

Some algorithms behave themselves most of the time, but due to the hidden details of their implementation, start to decay into another when you give them massive data sets. A good example would be a hash table - doing a search within a hash table is an O(1) operation. However, if the data set is large you start to reach hardware limitations - a large number of CPU cache misses, or access to the page file, can cause our fast operations to begin to decay into linear time. The code is picking the exact bucket to pick the result from, but then its having to load data from virtual memory.

This can also occur when performing linear operations on an array. If you aren't iterating in the same order that things are stored in memory, you can start to get chaotic use of the CPU cache. Scanning a 2D array can trigger this kind of behaviour. Your algorithm is still O(n) but it can experience a sudden drop in performance which can get worse as the data increases in size.

Unexpected input data

Sometimes your input data can cause wildly different real world results, even when the algorithm behaves as per the theoretical time complexity when you take the average.

Suppose you have this list of numbers:

[1,2,3,4,5,6,7]

If you use a brute force search for 7, you will eventually find it. But only after 7 iterations. But if the list looks like this:

[7,6,5,4,3,2,1]

This time, we will be lucky - it'll complete first time. For this reason, the time complexity should only be considered an approximation. Our brute force search is O(n) but the real world time varies wildly.

Conclusions

Order of complexity is a powerful tool, but it needs to be used correctly. It can tell you how much time you need to spend optimising a system before it is released to the wider world, and it can help you find locations to optimise if you have a slow system in practice, but theory cannot replace practical experience and testing.

At RDC, we apply this logic to our vast proprietary datasets in order to ensure optimal performance for our customers. It's all part of making that important data analysis smooth and simple every time.

 
An Article by Gavin Burton

As a core member of our development team, Gavin draws on his spectrum of experience in web application development for a variety of industries to play a key role in the build of RDC's innovative products.