Shareware Beach

Monday, 11 June 2007

Emulating Bugs in Regex Flavors

Filed under: Just Great Software — Jan @ 7:50

RegexBuddy user Steven Levithan blogged about the RegexBuddy 3 beta. He raised the question whether RegexBuddy would also emulate bugs as part of its new ability to emulate other regular expression flavors.

Whether RegexBuddy will mimic bugs depends on whether the developers of the library intend to fix the bugs. During development of RegexBuddy 3, I discovered that PCRE treated \Q..\E differently in character classes than Perl. I reported this to PCRE’s author, who responded he considered it a bug that should be fixed. At that point I decided RegexBuddy wouldn’t emulate PCRE’s incorrect behavior, since doing so would make RegexBuddy incorrect down the road. In fact, Philip Hazel has already fixed the bug in PCRE 7.0.

A bug that RegexBuddy does emulate is the fact that in Ruby, (?m) does the same as (?s) in Perl and all other Perl-style regex flavors. Ruby does not support Perl’s (?m) at all. I consider this a serious bug in Ruby, since (?s) and (?m) are confusing enough (due to their poor naming–the actual features are easy to understand) without Ruby switching their meaning. However, there’s no way this will be fixed in Ruby, since it would break all Ruby scripts that use (?m). All Ruby’s developers could do is to make both (?s) and (?m) do the same as (?s) in Perl. So in this case, RegexBuddy does interpret (?m) the way Ruby does, when you select the Ruby flavor.

Steven is right of course that RegexBuddy can never be 100% accurate in its emulation. Even different versions of the same tool aren’t 100% compatible. E.g. PCRE 7.0 treats \Q..\E differently inside character classes than all previous versions of PCRE. This particular bug is pretty obscure though, so I doubt anyone will run into it in practice. It occurs when using \Q..\E as the start or end of a range, e.g.: [a-\Qz\E]. In Perl and PCRE 7, this matches the range a-z like [a-z]. In earlier versions of PCRE, this matches the three characters a, – and z, like [-az].

2 Comments

  1. Interesting… thanks for the details. By the way, JavaScript is an example of a “flavor” which is a bit hard to pin down, since not only do you have to worry about previous versions of software, but also entirely different browsers (as one example, IE considers an unescaped, leading “]” within a character class as a non-terminating, literal character, while Firefox does not). Of course, the majority of users will probably not run into the few bugs here and there within stable, widely-adopted regex engines, so it’s more of an issue for people like you who try to pull off fancy features like flavor emulation.

    Comment by Steven Levithan — Monday, 11 June 2007 @ 19:33

  2. [...] new feature which further separates RegexBuddy from every competing tool. (Update: Jan Goyvaerts responded to the question of mimicking bugs on his [...]

    Pingback by RegexBuddy 3.0 Beta — Monday, 11 June 2007 @ 21:33

Sorry, the comment form is closed at this time.