jMar"s Blog DevSmash Developer Portal

Thursday, April 3, 2008

Big Regex Improvements for Firefox 3 Beta 5

Having just posted my latest project, Asciible, that relies heavily (well, entirely) on client-side regular expressions, I've been paying closer attention to efficiency concerns. I don't know if I'd really consider myself to be a Firefox "fan boy", but it is my browser of choice, and I have to admit that I was surprised to find that IE7 was kicking FF2's butt when it came to executing String.replace().

To give you a real world context for my tests, Asciible simply searches through a text input and replaces special characters with their ASCII equivalents. Even all the custom options rely solely on simple String.replace()'s. So in fairness, be aware that this limited test does not represent an overall analysis of the regular expression engine. With that said, in this limited context IE was taking about 10% of the time it was taking Firefox to complete. I was even able to achieve nearly the same efficiency in FF2 by using a simple for loop and a character hash table - yuck.

After noticing this, I went and downloaded Firefox 3, Beta 4 and ran my tests again. I was disappointed to find only marginal improvements.

So time advances until earlier today when I was taking a look at the release notes for Firefox 3 Beta 5, and I couldn't help but notice they were touting some major performance improvements - specifically to the JavaScript engine. So with curiosity in hand I downloaded, installed, and ran my tests again. To spare you the suspense - no, Firefox did not put IE to shame, and in truth it was still the slower of the two. However it did show BIG improvements - and I'm happy to see that the regular expression engine has been given some serious attention. All in all, FF3 Beta 5 was completing the test in less than half the time it was taking FF2. For visualization purposes:

String.replace() comparison for IE7, FF2, and FF3 Beta 5

This is good news, especially since FF3 is still in beta. There's no guarantee that there will be further improvements in the final release, but the optimizations thus far are already outstanding, and certainly enough to leave me with my fingers crossed.



17 comments:

Schrep said...

Can you share the test you are running by posting a bug at bugzilla.mozilla.org - we'd love to take at look at further optimizations..

Anonymous said...

so what about Opera?

Jeremy Martin said...

@Anonymous
Running Opera 9.27 on the same machine I was averaging mid 80's with the same 250,000 character input - so that's not bad at all.

Incidentally, Safari 3 for Windows came in at around 280 milliseconds.

Jeremy Martin said...

@Schrep
I have submitted an enhancement request to bugzilla as you suggested. The issue can be found here: https://bugzilla.mozilla.org/show_bug.cgi?id=427073

Joshua Paine said...

You can get a huge performance win and make your code more flexible. I made a test string by duplicating your sample text until it was almost 300k characters. On FF3b5 on Ubuntu on my computer, your code handles that in about 280 ms. My version runs in about 32 ms. Mine also should work on all extended characters, not just the 100ish you named explicitly.

To test my version replace the 100ish lines of one-by-one character replace()ing with this:

var subTable = {};

markup = markup.replace(
/[^A-Za-z0-9\s!#$%&\(\)*+,\-.\/\\:;=?@\[\]\^_`{}|~]/g,
function(c){
return subTable[c] || (subTable[c] = '&#'+c.charCodeAt(0)+';');
});

(I apologize for the formatting--I can't figure out a way to make Blogger treat it nicely.)

That code will run slower if there are many, many different special characters that are not repeated, but in the real world I think that's an uncommon case.

I can't test this in IE or anything else at the moment, but it should run and not be excessively slow.

Jeremy Martin said...

@Joshua
Thank you for that suggestion - it is definitely easy to see how that would speed things up. Truth be told I was wrestling with an approach like that myself, but I wanted to try and follow the W3C recommendations and convert only the ISO 8859-1 character set (as per http://www.w3.org/TR/html401/sgml/entities.html). I may also expand the script to include other ISO10646 char sets.

I would be interested in seeing if an approach like yours could be used, but with the ranges specified by the charcode instead - that would allow me to be a little more specific. I think I've heard of matching based on charcode, but I'm not familiar with that technique - you?

Joshua Paine said...

@jeremy
The principle speed advantage of mine is that it only traverses 300k characters once instead of 100 times and only allocates a new 300k+ string once instead of 100 times. (Strings are immutable in JS, so every time you run .replace you're allocating a new string.) If you really wanted to convert only certain characters, you could just list them all in a character class in the regex (instead of a negated character class like I did). You should even be able to use ranges for weird characters just like one commonly does with /A-Za-z0-9/. The replacing function could stay the same or be rewritten with a 100-element hash table pre-built for likely somewhat better speed but a lot more code.

But I think you misread the w3c on this. The only way that that set of characters is special is that someone went to the trouble of making *named* entities for them (e.g., "). Since you're using the numeric version and not the named ones anyway, there's no reason to restrict yourself to that set.

Also, I fixed my VirtualBox after a kernel upgrade broke it, and I'm happy to report that while FF2 still takes nigh unto forever to run my version, FF3b5 is almost twice as fast as IE7. (For this post, all times based on running each browser by itself in a virtualized Windows XP.)

Jeremy Martin said...

@Joshua
Well I can see what you're saying. It's true that yours only traverses the input once, but if you consider your basic operation to be the comparison, then both algorithms are in the same order of magnitude. Mine traverses X times looking for 1 character, while yours traverses 1 time looking for X characters. If I understand correctly, both result in N*X comparisions (N being input chars, X being special chars). You have me on creating new String objects though. My first approach was basically what you suggested (i.e., markup.replace(/<|>|&|.../), which perhaps I mistakenly assumed is roughly the same as defining a character class. Is this not so? Anyway, using that approach FF2 (did not test beta) was marginally faster, but IE was MUCH slower. For whatever reason, doing all the distinct replaces seemed to stumble on some highly optimized feature of the IE regular expression engine.

I guess you're also right on defining those special characters. Is there a definitive list of "safe" characters that I can use to define the character class? Perhaps you already based your first example above based on such a list.

Anonymous said...

This is off topic, but I have to ask...

What tool made that graph? It's one of the best looking ones I've ever seen.

Jeremy Martin said...

@Anonymous
Haha, get ready for it... MS Word. I had to play around with the fill colors for awhile to make it look decent (it's all about the gradients...)

Steve said...

As noted a few days ago on Reddit (http://reddit.com/info/6ek5i/comments/c03mq23), here's a quick rewrite of the code for efficiency and brevity: http://stevenlevithan.com/test/asciible.js

Jeremy Martin said...

@Steve and Joshua
Thanks again for the help with my regex. I hadn't particularly cared for my solution, so I'm grateful for the help. I'll be sure to update the code and provide you both with due credit!

Steve said...

BTW, if the alternative solution is indeed slower at all in IE, that would indicate (without proper testing on my end) that IE has significant optimization for a couple things you were previously doing a lot more of: compiling regex objects, and global single-character replacements.

Jeremy Martin said...

@Steve
That would make sense - I was thinking about trying it again with all the Regex objects already built, rather than letting it happen implicitly. That would be more of an academic exercise at this point though, and thus one for less busy times :)
I suppose in the first example, even all the String creations could be a noteworthy factor...

Steve said...

Are you talking about the difference between using regex literals and the RegExp constructor?

BTW, in my last comment I meant literal character (not single character) replacements. But again, it was just a guess. :)

Jeremy Martin said...

@Steve
I meant creating all the RegExp objects on page load, rather than when the convert function is called.

For example:
// on page load
var re1 = /x/g;
var re2 = /y/g;
...

// on function call
String.replace(re1, someFunc);
String.replace(re2, someFunc);
...

This would at least help determine if compiling the RegExp objects (as you suggested) was partly to blame.

terminals-blocks said...

it's very professional!