MonthJune 2011

Unbearables

 

Bigotry

 

Fanaticism

 

How to fix a preg_match bug

I wanted to parse the host of a URL with a regular expression to get its third level domain

$pattern = '@(?:([^.]+)\.)*[^.]+\.[^.]+@';

Let’s test the general case with http://www.dnr.state.oh.us/

preg_match($pattern, parse_url('http://www.dnr.state.oh.us/', PHP_URL_HOST), $matches);
var_dump($matches);

array(2) {
  [0]=>
  string(19) "www.dnr.state.oh.us"
  [1]=>
  string(5) "state"
}

Pretty good. And now let’s test the edge case with http://google.com/

preg_match($pattern, parse_url('http://google.com/', PHP_URL_HOST), $matches);
var_dump($matches);

array(1) {
  [0]=>
  string(10) "google.com"
}

WTF, where is my empty submatch? Since when an optional submatch is not a submatch if it’s empty?

I googled it and found that there is already a filed bug. The chosen resolution has been won’t fix!! They say for backward compatibility, but I cannot imagine how fixing it would break anything older.

  • If I expect 3 submatches from my pattern, but I get 2, then I know (for the bug) that the missing submatch is the last one and it’s an empty string. So I add it myself to the submatches array. Would a programmer do anything different to fix this bug?
  • If the bug is globally fixed, it means that my old code will always get 3 submatches from that pattern. So my individual fix won’t get triggered, and having the last submatch the same value (empty string) as the one my fix would have added, I won’t have any issue, except a bit of (stale) unused code.
To cleanly fix it myself once and for all, I’ve written a wrapper ando_preg_match that has the same signature and the expected results.
EDIT: There were some bugs in my own fix to the preg_match bug. For the code, please see the new post.

In the edge case I get now

array(2) {
  [0]=>
  string(10) "google.com"
  [1]=>
  string(0) ""
}

Unfortunately the wrapper is more complex than I like, but PHP allows regular expressions with named groups and they require a lot of additional code. Anyway I’ve been able to do it all in a single function that can be easily dropped in any project.

Here is a test with a pattern with named groups, just in case you were wondering what it looks like

$pattern = '@(?:(?<subdomain>[^.]+)\.)*[^.]+\.[^.]+@';
ando_preg_match($pattern, parse_url('http://google.com/', PHP_URL_HOST), $matches);
var_dump($matches);

array(3) {
  [0]=>
  string(10) "google.com"
  ["subdomain"]=>
  string(0) ""
  [1]=>
  string(0) ""
}

Actually, this last example allows me to show that my wrapper is really returning the expected result. In fact, just by adding a last non-empty group to the previous pattern, the original and buggy preg_match will work just fine

$pattern = '@(?:(?<subdomain>[^.]+)\.)*([^.]+\.[^.]+)@';
preg_match($pattern, parse_url('http://google.com/', PHP_URL_HOST), $matches);
var_dump($matches);

array(4) {
  [0]=>
  string(10) "google.com"
  ["subdomain"]=>
  string(0) ""
  [1]=>
  string(0) ""
  [2]=>
  string(10) "google.com"
}

Of course you’ll get the same result using the wrapper.

© 2017 Notes Log

Theme by Anders NorenUp ↑