Monday 10 August 2020

Linq vs Regex

I recently had to write a password validator as part of a technical test for a prospective employer, where three of the conditions were:

  • Passwords must contain an upper case character
  • Passwords must contain a lower case character
  • Passwords must contain a number

Having recently had occasion to learn that you can treat a string as a collection of characters and then use Linq to run set operations across the collection, in the test I used this code:

password.Any(char.IsUpper)

password.Any(char.IsLower)

password.Any(char.IsDigit)

But I worked through the same test again yesterday as an exercise with a friend, and we used Regex instead, so the code became:

new Regex(@"[A-Z]").IsMatch(password)

new Regex(@"[a-z]").IsMatch(password)

new Regex(@"[0-9]").IsMatch(password)

Plainly either version works; I don't see myself as a great Regex developer, which is partly why I reached for the Linq solution first. But seeing the solution written both ways I got to wondering if there was a performance benefit one way or the other. Clearly, there's only one way to find out...

I wrote up a console app that would use a Stopwatch to time each operation, and I ran it over 100 iterations and reported the average number of ticks taken (I started off measuring the number of milliseconds taken but this was zero...). 
And we can see that in two cases the Regex outperforms the Linq version, and in the third it's more or less even. Job done!

But... what happens if we up the number of iterations? Here's the same code run across 1 000, 10 000 and 1 000 000 iterations.

Across more iterations, the Linq version performs faster!

So which should you use? Well, as with most things programming, it depends... It's important to remember that we're talking about differences of only a tick or two, measurements so small that your user won't notice. So with that in mind, we should think about other considerations, like our relative skill levels between Linq and Regex, and which version we find more accessible and readable. So for me, if I was making the choice I'd opt for the Linq versions; I tend to find Regex somewhat impenetrable. But you might make a different choice, and that's fine too.

If you want to try my code (and the performance might be different on your machine), my code is at https://github.com/philpursglove/LinqVsRegex

(Aside: I was very pleasantly surprised to find the tooling support for writing Regexes in Visual Studio has massively improved, when you start writing your expression now you get Intellisense that shows you some of the options, and you get colour-coding inside your expression that helps you identify the parts of your expression)


Update: After I posted this, Steve got in touch to say that compiled Regexes may offer a performance benefit. You specify that a Regex is compiled by adding an option into the constructor:
new Regex("[A-Z]", RegexOptions.Compiled)
When a Regex is compiled, it is converted to MSIL and executed by the JIT compiler, and the MSIL is then cached by the regular expression engine. The price you pay for this is a longer startup time as the compilation happens, but then you get a faster execution. (Microsoft has a whole article on best practices for Regexes that covers this and some other considerations, and there's another piece about compilation and reuse of Regexes).
So let's kick the tyres on this... I added a switch to my code that allows you to switch compilation on and off. Here's some numbers over 100 iterations with compilation on.
We can see immediately that the startup impact of compiling the Regex adds an order of magnitude of overhead to our code (but remember that the practical impact of this is probably still only a matter of milliseconds), and over a relatively small number of iterations this means it will be significantly outperformed by the Linq solution. If we up the number of iterations, does paying that upfront cost return us any kind of performance dividend? Here's the same code over 1 000 000 iterations.
Over more iterations, the upfront cost of compiling the Regex gets amortised down, but (on average) it still gets beaten by Linq (and I've also tried it over 10m and 100m iterations and this pattern remains consistent). 
So Linq is consistently better, yes? Maybe, maybe not... For one thing, the Regexes I'm using are pretty simple - I have a sneaking suspicion that a compiled, more complex Regex would turn out to beat a number of chained Linq expressions. Some of the other options for Regex may also have an impact.

Steve also pointed me to Benchmarkdotnet to look at other aspects such as memory usage to get a fuller picture not only of which method looks better from the raw timings but also in terms of memory usage etc, which I may look at in a future post...