Counting expected matches of a PHP regular expression must consider 4 different ways of capturing groups, all possible at the same time: numbered, named, duplicate numbers and duplicate names.

Counting Numbered Groups

Numbered groups of a regular expression are introduced by the pattern ´(regex)´.

Subpatterns are delimited by parentheses (round brackets), which can be nested.

Opening parentheses are counted from left to right (starting from 1) to obtain numbers for the capturing subpatterns.

If an opening parenthesis is followed by a question mark and a colon, the subpattern does not do any capturing, and is not counted when computing the number of any subsequent capturing subpatterns.

Whitespace characters may never appear within special character sequences in a pattern, for example within the sequence (?( that introduces a conditional subpattern.

Example:

the ((?:red|white) (king|queen))

Of course escaped open parentheses are not to be counted. And open parentheses inside character classes (where escaping is automatic) are not to be counted either. To simplify, instead of one regular expression I use three, in such a specific order that the next regular expression takes advantage of the previous one.

$find_explicitly_escaped = '/\\\\./';
$pattern = preg_replace($find_explicitly_escaped, '%', $pattern);

$find_implicitly_escaped = '/\[[^\]]*\]/';
$pattern = preg_replace($find_implicitly_escaped, '%', $pattern);

$find_conditions = '/\(\?\(/';
$conditions_count = preg_match_all($find_conditions, $pattern, $dummy);

$find_numbered_groups  = '/\((?!\?)/'; 
$numbered_groups_count = preg_match_all($find_numbered_groups, $pattern, $numbered_groups);

$numbered_groups_count -= $conditions_count;

Explicitly and implicitly escaped characters are erased all from the pattern, instead of only open parentheses, by replacing them with a % char. This is not a problem and actually reduces intricacies of the regular expression while preserving order and groups. Erasing explicitly escaped characters before implicitly escaped ones allows the pattern for finding the latter to ignore any escaped closing bracket.

Counting Named Groups

Named groups of a regular expression are introduced by the patterns ´(?<name>regex)´, ´(?’name’regex)´, or ´(?P<name>regex)´.

PCRE supports the use of named as well as numbered capturing parentheses. The names are just an additional way of identifying the parentheses, which still acquire numbers.

Example:

(?<date>(?<year>(\d\d)?\d\d)-(?<month>\d\d)-(?<day>\d\d))

This is quite simple.

$find_named_groups  = '/\(\?P?(?:(?:<([^>]+)>)|(?:\'([^\']+)\'))/'; 
$named_groups_count = preg_match_all($find_named_groups, $pattern, $named_groups);

$numbered_groups_count += $named_groups_count;  //named groups also add as many numbered groups

Counting Groups with Duplicate Names

Duplicate names in a regular expression are introduced by the pattern ´(?J:regex)´ or ´(?J)regex´.

By default, a name must be unique within a pattern, but it is possible to relax this constraint by setting the PCRE_DUPNAMES option at compile time. (Duplicate names are also always permitted for subpatterns with the same number, set up as described in the previous section.)

Example:

(?J:(?<DN>Mon|Fri|Sun)(?:day)?|(?<DN>Tue)(?:sday)?|(?<DN>Wed)(?:nesday)?|(?<DN>Thu)(?:rsday)?|(?<DN>Sat)(?:urday)?)

Duplicate names are easy to spot by looping through the matches returned in the ´$named_groups´ array after matching the ´$find_named_groups´ pattern with the flags ´PREG_SET_ORDER | PREG_OFFSET_CAPTURE´.

$dupnames = array();
foreach ($named_groups as $named_group)
{
    $name = $named_group[1][0];
    if (isset($dupnames[$name]))
    {
        $dupnames[$name] += 1;
    }
    else 
    {
        $dupnames[$name] = 0;
    }
}
$dup_count = array_sum($dupnames);

$named_groups_count -= $dup_count;  //duplicate names are added only once

Counting Groups with Duplicate Numbers

Duplicate numbers in a regular expression are introduced by the pattern ´(?|regex)´.

Perl 5.10 introduced a feature whereby each alternative in a subpattern uses the same numbers for its capturing parentheses.

Inside a ´(?|´ group, parentheses are numbered as usual, but the number is reset at the start of each branch. The numbers of any capturing parentheses that follow the subpattern start after the highest number used in any branch.

Example:

(?|(Sat)ur|(Sun))day

Groups with duplicate numbers (I’ll call them hellternations, for brevity) can also be nested and used together with any other available grouping. This is a real hell, mostly because of the branch reset.

For this reason, I now separate all previous groups counting from this one, into the following nice function. (I’ll call it CGIH, for brevity)

/**
 * Returns how many groups (numbered or named) there are in the given $pattern, 
 * ignoring hellternations (?|...|...)
 * 
 * @param string $pattern
 * @param array  $named_groups
 * @param array  $numbered_groups
 * 
 * @return integer
 */
function ando_preg_count_groups_ignoring_hellternations( &$pattern, &$named_groups, &$numbered_groups )
{
    $find_explicitly_escaped = '/\\\\./';
    $pattern = preg_replace($find_explicitly_escaped, '%', $pattern);
    
    $find_implicitly_escaped = '/\[[^\]]*\]/';
    $pattern = preg_replace($find_implicitly_escaped, '%', $pattern);
    
    $find_numbered_groups  = '/\((?!\?)/'; 
    $numbered_groups_count = preg_match_all($find_numbered_groups, $pattern, $numbered_groups, PREG_SET_ORDER | PREG_OFFSET_CAPTURE); 
    
    $find_conditions = '/\(\?\(/';
    $conditions_count = preg_match_all($find_conditions, $pattern, $dummy);
    
    $numbered_groups_count -= $conditions_count;
    
    
    $find_named_groups  = '/\(\?P?(?:(?:<([^>]+)>)|(?:\'([^\']+)\'))/'; 
    $named_groups_count = preg_match_all($find_named_groups, $pattern, $named_groups, PREG_SET_ORDER | PREG_OFFSET_CAPTURE);
    
    $numbered_groups_count += $named_groups_count;  //named groups also add as many numbered groups
    
    
    $dupnames = array();
    foreach ($named_groups as $named_group)
    {
        $name = $named_group[1][0];
        if (isset($dupnames[$name]))
        {
            $dupnames[$name] += 1;
        }
        else 
        {
            $dupnames[$name] = 0;
        }
    }
    $dupnames_count = array_sum($dupnames);
    
    $named_groups_count -= $dupnames_count;  //duplicate names are added only once
    
    
    $result = $numbered_groups_count + $named_groups_count;
    return $result;
}

If a pattern does not contain any hellternation, then CGIH will give the correct result. On the contrary, I’m going to apply it anyway, because I also need to count any groups outside of any hellternation.

In the general case, there will be non-hellternations as as well as hellternations, all of them sibling to each other at the same level. In that case I can apply CGIH to the current pattern, and later adjust that count for each hellternation. First I have to subtract the contribution (1) of the hellternation to the total, and then add the maximum count (2) of all of its alternatives. The number (1) is got by applying CGIH to the hellternation, and the number (2) by recursively applying all I said in this paragraph to each alternative.

/**
 * Returns how many groups (numbered or named) there are in the given $pattern
 * 
 * @param string $pattern
 * @param array  $named_groups
 * @param array  $numbered_groups
 * 
 * @return integer
 */
function ando_preg_count_groups( $pattern, &$named_groups, &$numbered_groups )
{
    $result = ando_preg_count_groups_ignoring_hellternations($pattern, $named_groups, $numbered_groups);
    $hellternations = ando_preg_find_hellternations($pattern);
    if (empty($hellternations))
    {
        return $result;
    }
    foreach ($hellternations as $hellternation)
    {
        $count = array();
        $pieces = ando_preg_explode_alternation($hellternation);
        foreach ($pieces as $piece)
        {
            $count[] = ando_preg_count_groups($piece, $dummy, $dummy);
        }
        $max = max($count);
        $easy = ando_preg_count_groups_ignoring_hellternations($hellternation, $dummy, $dummy);
        $result = $result - $easy + $max;
    }
    return $result;
}

Utilities

The previous function is supported by the following utilities which both operate by counting balanced parentheses in a regular expression.

/**
 * Returns the hellternations which are siblings to each other.
 * NOTE: the given $pattern is assumed to not contain escaped parentheses.
 * 
 * @param string $pattern a string of alternations wrapped into (?|...)
 * 
 * @return array
 * 
 * @throws Exception
 */
function ando_preg_find_hellternations( $pattern )
{
    $result = array();
    $token = '(?|';
    $token_len = strlen($token);
    $offset = 0;
    do 
    {
        $start = strpos($pattern, $token, $offset);
        if (FALSE === $start)
        {
            return $result;
        }
        $open = 1;
        $start += $token_len;
        for ($i = $start, $iTop = strlen($pattern); $i < $iTop; $i++)
        {
            //$current = $pattern[$i];
            if ($pattern[$i] == '(')
            {
                $open += 1;
            }
            elseif ($pattern[$i] == ')')
            {
                $open -= 1;
                if (0 == $open)
                {
                    $result[$start] = substr($pattern, $start, $i - $start);
                    $offset = $i + 1;
                    break;
                }
            }
        }
    }
    while ($i < $iTop);
    if (0 != $open)
    {
        throw new Exception('Unbalanced parentheses.');
    }
    return $result;
}

/**
 * Explodes an alternation on outer pipes.
 * NOTE: the given $pattern is assumed to not contain escaped parentheses nor escaped pipes.
 * 
 * @param string $pattern a string with balanced (possibly nested) parentheses and pipes
 * 
 * @return array
 * 
 * @throws Exception
 */
function ando_preg_explode_alternation( $pattern )
{
    $result = array();
    $token = '|'; 
    $open = 0;
    $start = 0;
    for ($i = $start, $iTop = strlen($pattern); $i < $iTop; $i++)
    {
        //$current = $pattern[$i];
        if ($pattern[$i] == '(')
        {
            $open += 1;
        }
        elseif ($pattern[$i] == ')')
        {
            $open -= 1;
        }
        elseif ($pattern[$i] == '|')
        {
            if (0 == $open)
            {
                $result[$start] = substr($pattern, $start, $i - $start);
                $start = $i + 1;
            }
        }
    }
    $result[$start] = substr($pattern, $start);
    if (0 != $open)
    {
        throw new Exception('Unbalanced parentheses.');
    }
    return $result;
}

Minimal Test

foreach (array(
	'numbered' => 'the ((?:red|white) (king|queen))',
	'named' => '(?<date>(?<year>(\d\d)?\d\d)-(?<month>\d\d)-(?<day>\d\d))',
	'duplicate names' => '(?J:(?<DN>Mon|Fri|Sun)(?:day)?|(?<DN>Tue)(?:sday)?|(?<DN>Wed)(?:nesday)?|(?<DN>Thu)(?:rsday)?|(?<DN>Sat)(?:urday)?)',
	'duplicate numbers' => '(?|(Sat)ur|(Sun))day',
) as $type => $regex)
{
    $count = ando_preg_count_groups($regex, $dummy, $dummy);
    echo "type  = $type\nregex = $regex\ncount = $count\n\n";
}

type  = numbered
regex = the ((?:red|white) (king|queen))
count = 2

type  = named
regex = (?<date>(?<year>(\d\d)?\d\d)-(?<month>\d\d)-(?<day>\d\d))
count = 9

type  = duplicate names
regex = (?J:(?<DN>Mon|Fri|Sun)(?:day)?|(?<DN>Tue)(?:sday)?|(?<DN>Wed)(?:nesday)?|(?<DN>Thu)(?:rsday)?|(?<DN>Sat)(?:urday)?)
count = 6

type  = duplicate numbers
regex = (?|(Sat)ur|(Sun))day
count = 1