MonthJanuary 2011

Full UTF-8 support in WordPress

A few days ago I discovered that WordPress didn’t support full UTF-8 strings, whose characters are 1 to 4 bytes long. Instead it does support all unicodes belonging to the BMP, whose UTF-8 characters are 1 to 3 bytes long.

This WordPress defect is “caused by” MySQL 5, which only supports UTF-8 characters in the BMP. Apparently, MySQL 6 will be full UTF-8 compliant.

This morning, with the help of the UTF-8 class I recently developed, I made up a new WordPress plugin that adds full UTF-8 support to WordPress.

And this is the same sentence by Douglas Crockford, from the RFC4627 I cited in the previous post:

a string containing only the G clef character [𝄞] may be represented as “\uD834\uDD1E”

Windows users see a rectangle: it’s a Windows feature, but they should see the following thing

Imagen 1

You should note that the G clef above (not the one in the picture ;-) appears in the HTML not as an entity but as a common UTF-8 character, entered as is in the WordPress editor. You can see it for yourself by comparing the source code of this post (1) with that of the previous one (2).

  1. <blockquote><p>a string containing only the G clef character [<a href="http://www.fileformat.info/info/unicode/char/1d11e/index.htm" target="_blank"><span style="font-size: 2em;">𝄞</span></a>] may be represented as “\uD834\uDD1E”</p></blockquote>
  2. <blockquote><p>a string containing only the G clef character [<a href="http://www.fileformat.info/info/unicode/char/1d11e/index.htm" target="_blank"><span style="font-size: 2em;">&#119070;</span></a>] may be represented as &#8220;\uD834\uDD1E&#8221;</p></blockquote>

Note that my plugin works for post and page content, title, excerpt, and also for searches, but it doesn’t cover custom fields (since version 2.0.0) any character written to and read from the database. For this reason Anyway, I’ve just opened a ticket about this issue in the WordPress Trac: please drop by and comment :-)

What follows is the code of my Zend_Utf8 class, which I included in the plugin, after de-Zend-ifying all of it, for safe distribution in the wild.

<?php
/*
Copyright (c) 2011, Andrea Ercolino (http://noteslog.com)
All rights reserved.

Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are met:
    * Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
    * Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
    * Neither the name of the <organization> nor the
      names of its contributors may be used to endorse or promote products
      derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
DISCLAIMED. IN NO EVENT SHALL <COPYRIGHT HOLDER> BE LIABLE FOR ANY
DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND
ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
*/



/**
 * @package    Ando_Utf8
 */
class Ando_Utf8_Exception extends Exception
{}



/**
 * Basic UTF-8 support
 * 
 * @link http://noteslog.com/
 * @link http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
 * 
 * @package    Ando_Utf8
 */
class Ando_Utf8
{
    /**
     * Escape UTF-8 characters using the given options
     * 
     * About the write.callback option
     * -- it receives 
     * -- -- the given write.arguments 
     * -- -- the unicode of the current UTF-8 character
     * -- -- the current (unescaped) UTF-8 character
     * -- it must return the current escaped UTF-8 character
     * 
     * @link http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/
     * 
     * @param  string $value
     * @param  array $options
     *   'escapeControlChars'   => boolean (default: TRUE),
     *   'escapePrintableASCII' => boolean (default: FALSE),
     *   'write'                => array(
     *       'callback'  => callable (default: 'sprintf'),
     *       'arguments' => array    (default: array('\u%04x')),
     *   ),
     *   'extendedUseSurrogate' => boolean (default: true),
     * 
     * @throws Ando_Utf8_Exception If the code point of any char in $value is 
     *                             not unicode
     * @return string
     */
    public static function escape($value, array $options = array())
    {
        $options = array_merge(array(
            'escapeControlChars'   => true,
            'escapePrintableASCII' => false,
            'write'                => array(
                'callback'  => 'sprintf',
                'arguments' => array('\u%04x'),
            ),
            'extendedUseSurrogate' => true,
        ), $options);
        if (! self::isCallable($options['write']))
        {
            throw new Ando_Utf8_Exception('Expected a valid write handler (callable, array).');
        }
        if (self::validateFilters($options) && isset($options['filters']['before-write']))
        {
            $value = self::call($options['filters']['before-write'], $value);
        }
        
        $result = "";
        $length = strlen($value);
        for($i = 0; $i < $length; $i++) {
            $ord_var_c = ord($value[$i]);
            
            switch (true) {
                case ($ord_var_c < 0x20):
                    // code points 0x00000000..0x0000001F, mask 0xxxxxxx
                    $utf8Char = $value[$i];
                    $result .= $options['escapeControlChars']
                        ? self::call($options['write'], array($ord_var_c, $utf8Char))
                        : $value[$i];
                break;

                case ($ord_var_c < 0x80):
                    // code points 0x00000020..0x0000007F, mask 0xxxxxxx
                    $utf8Char = $value[$i];
                    $result .= $options['escapePrintableASCII'] 
                        ? self::call($options['write'], array($ord_var_c, $utf8Char))
                        : $value[$i];
                break;

                case (($ord_var_c & 0xE0) == 0xC0):
                    // code points 0x00000080..0x000007FF, mask 110yyyyy 10xxxxxx
                    $utf8Char = substr($value, $i, 2); $i += 1;
                    $code = self::utf8CharToCodePoint($utf8Char);
                    $result .= self::call($options['write'], array($code, $utf8Char));
                break;

                case (($ord_var_c & 0xF0) == 0xE0):
                    // code points 0x00000800..0x0000FFFF, mask 1110zzzz 10yyyyyy 10xxxxxx
                    $utf8Char = substr($value, $i, 3); $i += 2;
                    $code = self::utf8CharToCodePoint($utf8Char);
                    $result .= self::call($options['write'], array($code, $utf8Char));
                break;

                case (($ord_var_c & 0xF8) == 0xF0):
                    // code points 0x00010000..0x0010FFFF, mask 11110www 10zzzzzz 10yyyyyy 10xxxxxx
                    $utf8Char = substr($value, $i, 4); $i += 3;
                    if ($options['extendedUseSurrogate'])
                    {
                        list($upper, $lower) = self::utf8CharToSurrogatePair($utf8Char);
                        $result .= self::call($options['write'], array($upper, $utf8Char));
                        $result .= self::call($options['write'], array($lower, $utf8Char));
                    }
                    else 
                    {
                        $code = self::utf8CharToCodePoint($utf8Char);
                        $result .= self::call($options['write'], array($code, $utf8Char));
                    }
                break;

                default:
                    //no more cases in unicode, whose range is 0x00000000..0x0010FFFF
                    throw new Ando_Utf8_Exception('Expected a valid UTF-8 character.');
                break;
            }
        }

        return $result;
     }
     
    /**
     * Compute the code point of a given UTF-8 character
     *
     * If available, use the multibye string function mb_convert_encoding
     * TODO reject overlong sequences in $utf8Char
     *
     * @link http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/
     * 
     * @param  string $utf8Char
     * @throws Ando_Utf8_Exception If the code point of $utf8Char is not unicode
     * @return integer
     */
    public static function utf8CharToCodePoint($utf8Char)
    {
        if (function_exists('mb_convert_encoding')) 
        {
            $utf32Char = mb_convert_encoding($utf8Char, 'UTF-32', 'UTF-8');
        } 
        else 
        {
            $bytes = array('C*');
            list(, $utf8Int) = unpack('N', str_repeat(chr(0), 4 - strlen($utf8Char)) . $utf8Char);
            switch (strlen($utf8Char)) 
            {
                case 1:
                    //Code points U+0000..U+007F
                    //mask   0xxxxxxx (7 bits)
                    //map to 00000000 00000000 00000000 0xxxxxxx
                    $bytes[] = 0;
                    $bytes[] = 0;
                    $bytes[] = 0;
                    $bytes[] = $utf8Int;
                break;
    
                case 2:
                    //Code points U+0080..U+07FF
                    //mask   110yyyyy 10xxxxxx (5 + 6 = 11 bits)
                    //map to 00000000 00000000 00000yyy yyxxxxxx
                    $bytes[] = 0;
                    $bytes[] = 0;
                    $bytes[] = $utf8Int >> 10 & 0x07;
                    $bytes[] = $utf8Int >>  2 & 0xC0 | $utf8Int       & 0x3F;
                break;
    
                case 3:
                    //Code points U+0800..U+D7FF and U+E000..U+FFFF
                    //mask   1110zzzz 10yyyyyy 10xxxxxx (4 + 6 + 6 = 16 bits)
                    //map to 00000000 00000000 zzzzyyyy yyxxxxxx
                    $bytes[] = 0;
                    $bytes[] = 0;
                    $bytes[] = $utf8Int >> 12 & 0xF0 | $utf8Int >> 10 & 0x0F;
                    $bytes[] = $utf8Int >>  2 & 0xC0 | $utf8Int       & 0x3F;
                break;
                             
                case 4:
                    //Code points U+10000..U+10FFFF
                    //mask   11110www 10zzzzzz 10yyyyyy 10xxxxxx (3 + 6 + 6 + 6 = 21 bits)
                    //map to 00000000 000wwwzz zzzzyyyy yyxxxxxx
                    $bytes[] = 0;
                    $bytes[] = $utf8Int >> 22 & 0x1C | $utf8Int >> 20 & 0x03;
                    $bytes[] = $utf8Int >> 12 & 0xF0 | $utf8Int >> 10 & 0x0F;
                    $bytes[] = $utf8Int >>  2 & 0xC0 | $utf8Int       & 0x3F;
                break;
                
                default:
                    //no more cases in unicode, whose range is 0x00000000 - 0x0010FFFF
                    throw new Ando_Utf8_Exception('Expected a valid UTF-8 character.');
                break;
            }
            $utf32Char = call_user_func_array('pack', $bytes);
        }
        list(, $result) = unpack('N', $utf32Char); //unpack returns an array with base 1
        if (0xD800 <= $result && $result <= 0xDFFF) 
        {
            //reserved for UTF-16 surrogates
            throw new Ando_Utf8_Exception('Expected a valid UTF-8 character.');
        }
        if (0xFFFE == $result || 0xFFFF == $result) 
        {
            //reserved
            throw new Ando_Utf8_Exception('Expected a valid UTF-8 character.');
        }
        
        return $result;
    }
    
    /**
     * Compute the surrogate pair of a given extended UTF-8 character
     * 
     * @link http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/
     * @link http://en.wikipedia.org/wiki/UTF-16/UCS-2
     * 
     * @param  string $utf8Char
     * @throws Ando_Utf8_Exception If the code point of $utf8Char is not extended unicode
     * @return array
     */
    public static function utf8CharToSurrogatePair($utf8Char) 
    {
        $codePoint = self::utf8CharToCodePoint($utf8Char);
        if ($codePoint < 0x10000) 
        {
            throw new Ando_Utf8_Exception('Expected an extended UTF-8 character.');
        }
        $codePoint -= 0x10000;
        $upperSurrogate = 0xD800 + ($codePoint >> 10);
        $lowerSurrogate = 0xDC00 + ($codePoint & 0x03FF);
        $result = array($upperSurrogate, $lowerSurrogate);
        
        return $result;
    }
    
	/**
     * Unescape UTF-8 characters from a given escape format
     * 
     * About the read.pattern option
     * -- no delimiters and no modifiers allowed
     * -- for back references, your groups start at 3.
     * About the read.callback option
     * -- it receives 
     * -- -- the given read.arguments 
     * -- -- the current match of the pattern with all submatches
     * -- it must return the current unicode integer
     * 
     * @link http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/
     * 
     * @param  string $value
     * @param  array $options
     *   'read'                 => array(
     *   	 'pattern'   => preg     (default: '\\\\u([0-9A-Fa-f]{4})'),
     *       'callback'  => callable (default: create_function('$all, $code', 'return hexdec($code);')),
     *       'arguments' => array    (deafult: array()),
     *   ),
     *   'extendedUseSurrogate' => boolean (default: TRUE),
     * 
     * @throws Ando_Utf8_Exception If the code point of any char in $value is 
     *                             not unicode
     * 
     * @return string
     */
    public static function unescape($value, array $options = array())
    {
        $options = array_merge(array(
        	'read'                 => array(
            	'pattern'   => '\\\\u([0-9A-Fa-f]{4})',
                'callback'  => create_function('$all, $code', 'return hexdec($code);'),
                'arguments' => array(),
            ),
            'extendedUseSurrogate' => true,
        ), $options);
        if (! self::isCallable($options['read']))
        {
            throw new Ando_Utf8_Exception('Expected a valid read handler (callable, array).');
        }
        $thereAreFilters = self::validateFilters($options);
        
        $result = "";
        $length = strlen($value);
        $pattern = '@([\w\W]*?)(' . $options['read']['pattern'] . ')|([\w\W]+)@';
        $offset = 0;
        while (preg_match($pattern, $value, $matches, 0, $offset))
        {
            if (! $matches[2])
            {
                //no more escape patterns
                $result .= $matches[0];
                $offset += strlen($matches[0]);
            }
            else 
            {
                //one more escape pattern
                $result .= $matches[1];
                $offset += strlen($matches[0]);
                $args = array_splice($matches, 2, count($matches) - 1);
                $unicode = self::call($options['read'], $args);//                call_user_func($options['integer'], $matches[2]);
                if ($options['extendedUseSurrogate'] && (0xD800 <= $unicode && $unicode < 0xDC00))
                {
                    $upperSurrogate = $unicode;
                    preg_match($pattern, $value, $matches, 0, $offset);
                    if (! $matches[2])
                    {
                        throw new Ando_Utf8_Exception('Expected an extended UTF-8 character.');
                    }
                    $offset += strlen($matches[0]);
                    $args = array_splice($matches, 2, count($matches) - 1);
                    $unicode = self::call($options['read'], $args);//$lowerSurrogate = call_user_func($options['integer'], $matches[2]);
                    $utf8Char = self::utf8CharFromSurrogatePair(array($upperSurrogate, $unicode));
                }
                else 
                {
                    $utf8Char = self::utf8CharFromCodePoint($unicode);
                }
                $result .= $utf8Char;
            }
        }
        if ($thereAreFilters && isset($options['filters']['after-read']))
        {
            $result = self::call($options['filters']['after-read'], $result);
        }
        
        return $result;
     }
     
    /**
     * Compute the UTF-8 character of a given code point
     *
     * If available, use the multibye string function mb_convert_encoding
     *
     * @link http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/
     * 
     * @param  integer $codePoint
     * @throws Ando_Utf8_Exception if the code point is not unicode
     * @return string
     */
    public static function utf8CharFromCodePoint($codePoint)
    {
        if (0xD800 <= $codePoint && $codePoint <= 0xDFFF) 
        {
            //reserved for UTF-16 surrogates
            throw new Ando_Utf8_Exception('Expected a valid code point.');
        }
        if (0xFFFE == $codePoint || 0xFFFF == $codePoint) 
        {
            //reserved
            throw new Ando_Utf8_Exception('Expected a valid code point.');
        }
        
        if (function_exists('mb_convert_encoding')) 
        {
            $utf32Char = pack('N', $codePoint);
            $result = mb_convert_encoding($utf32Char, 'UTF-8', 'UTF-32');
        } 
        else 
        {
            $bytes = array('C*');
            switch (true)
            {
                case ($codePoint < 0x80):
                    //Code points U+0000..U+007F
                    //mask     0xxxxxxx (7 bits)
                    //map from xxxxxxx
                    $bytes[] = $codePoint;
                break;
                
                case ($codePoint < 0x800):
                    //Code points U+0080..U+07FF
                    //mask     110yyyyy 10xxxxxx (5 + 6 = 11 bits)
                    //map from yyy yyxxxxxx
                    $bytes[] = 0xC0 | $codePoint >> 6;
                    $bytes[] = 0x80 | $codePoint       & 0x3F;
                break;
                
                case ($codePoint < 0x10000):
                    //Code points U+0800..U+D7FF and U+E000..U+FFFF
                    //mask     1110zzzz 10yyyyyy 10xxxxxx (4 + 6 + 6 = 16 bits)
                    //map from zzzzyyyy yyxxxxxx
                    $bytes[] = 0xE0 | $codePoint >> 12;
                    $bytes[] = 0x80 | $codePoint >> 6  & 0x3F;
                    $bytes[] = 0x80 | $codePoint       & 0x3F;
                break;
                
                case ($codePoint < 0x110000):
                    //Code points U+10000..U+10FFFF
                    //mask     11110www 10zzzzzz 10yyyyyy 10xxxxxx (3 + 6 + 6 + 6 = 21 bits)
                    //map from wwwzz zzzzyyyy yyxxxxxx
                    $bytes[] = 0xF0 | $codePoint >> 18;
                    $bytes[] = 0x80 | $codePoint >> 12 & 0x3F;
                    $bytes[] = 0x80 | $codePoint >> 6  & 0x3F;
                    $bytes[] = 0x80 | $codePoint       & 0x3F;
                break;
                
                default:
                    throw new Ando_Utf8_Exception('Expected a valid code point.');
                break;
            }
            $result = call_user_func_array('pack', $bytes);
        }
        return $result;
    }
    
    /**
     * Compute the extended UTF-8 character of a given surrogate pair
     * 
     * @link   http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/
     * @link http://en.wikipedia.org/wiki/UTF-16/UCS-2
     * 
     * @param array $surrogatePair
     * @throws Ando_Utf8_Exception If the surrogate pair is not extended unicode
     * @return string
     */
    public static function utf8CharFromSurrogatePair($surrogatePair) 
    {
        list($upperSurrogate, $lowerSurrogate) = $surrogatePair;
        if (! (0xD800 <= $upperSurrogate && $upperSurrogate < 0xDC00))
        {
            throw new Ando_Utf8_Exception('Expected an extended UTF-8 character.');
        }
        if (! (0xDC00 <= $lowerSurrogate && $lowerSurrogate < 0xE000))
        {
            throw new Ando_Utf8_Exception('Expected an extended UTF-8 character.');
        }
        $codePoint = ($upperSurrogate & 0x03FF) << 10 | ($lowerSurrogate & 0x03FF);
        $codePoint += 0x10000;
        $result = self::utf8CharFromCodePoint($codePoint);
        
        return $result;
    }
    
    /**
     * A little calling interface: validation
     * 
     * @param  array  $handler
     * @return boolean
     */
    private static function isCallable($handler)
    {
        $result = is_callable($handler['callback']) && is_array($handler['arguments']);
        return $result;
    }
    
    /**
     * A little calling interface: call
     * 
     * @param  array  $handler
     * @param  mixed  $args
     * @return mixed
     */
    private static function call($handler, $args)
    {
        $args = array_merge($handler['arguments'], is_array($args) ? $args : array($args));
        $result = call_user_func_array($handler['callback'], $args);
        return $result;
    }
    
    /**
     * Validate filters. If there are filters return true, else false
     * 
     * @param array $options
     * @throws Ando_Utf8_Exception If there are malformed filters
     * @return boolean
     */
    protected static function validateFilters($options)
    {
        if (isset($options['filters']))
        {
            if (! is_array($options['filters']))
            {
                throw new Ando_Utf8_Exception('Expected valid filters.');
            }
            foreach ($options['filters'] as $key => $value)
            {
                if (! self::isCallable($value))
                {
                    throw new Ando_Utf8_Exception("Expected a valid $key handler.");
                }
            }
            return true;
        }
        return false;
    }
    
}

Escaping and unescaping UTF-8 characters in PHP

The Zend_Json_Encoder class implements three clearly different functionalities: encoding PHP values, encoding PHP classes, and escaping UTF-8 characters. And still in the last release-1.11.2 the UTF-8 escaping feature doesn’t take into account all possible UTF-8 characters: in fact it lacks any support for the so called extended unicode characters, with a code point between U+10000 and U+10FFFF. Douglas Crockford gives an example of how to escape extended characters in the RFC4627 by means of surrogate pairs:

a string containing only the G clef character [𝄞] may be represented as “uD834uDD1E”

The encodeUnicodeString() and its ancillary _utf82utf16() come directly from the Solar Framework. I think it’s fair to copy code from one open source project to another, but some insight is necessary for telling apart what is well made from what is not. In this case, encodeUnicodeString() supports up to 6 bytes characters and the _utf82utf16() supports up to 3 bytes characters, but UTF-8 characters are 1 to 4 bytes long!! And I can’t believe that those two functions destroy any character they cannot process!

Encoding PHP values to some other string format, like JSON, could require escaping UTF-8 characters. It respectively goes for decoding and unescaping. I think it’s sufficiently justified the existence of a class for basic UTF-8 support in the Zend Framework, so I made up Zend_Utf8, which I’m going to briefly introduce.

Zend_Utf8 exposes six static functions: two are the main functions for escaping and unescaping strings and four are the ancillary functions for mapping UTF-8 characters to unicode integers and the other way around. Usage of the ancillary functions is well documented by the main functions, so I’ll describe only usage of the latter.

Example

Here is a simple program that shows some simple usage

function show($string)
{
    $escape = Zend_Utf8::escape($string);
    $unescape = Zend_Utf8::unescape($escape);
    $json_encode = trim(json_encode($string), '"');
    $json_decode = json_decode('"'.$escape.'"');
    print_r($string."\n");
    print_r($escape."\n");
    print_r('-- escaped'  . ($string   === $unescape    ? " and "   : " BUT NOT ")      . 'unescaped as expected'."\n");
    print_r('-- escape'   . ($escape   === $json_encode ? " works " : " DOESN'T WORK ") . 'like in json_encode'."\n");
    print_r('-- unescape' . ($unescape === $json_decode ? " works " : " DOESN'T WORK ") . 'like in json_decode'."\n");
    print_r("\n");
}

show("The white space inside brackets [	] is a common tab.");
show("The kanji inside brackets [水] is read mizu and means water in Japanese.");
show("The symbol inside brackets [] should have been a G clef like the one above.");

And this is the output

The white space inside brackets [	] is a common tab.
The white space inside brackets [\u0009] is a common tab.
-- escaped and unescaped as expected
-- escape DOESN'T WORK like in json_encode
-- unescape works like in json_decode

The kanji inside brackets [水] is read mizu and means water in Japanese.
The kanji inside brackets [\u6c34] is read mizu and means water in Japanese.
-- escaped and unescaped as expected
-- escape works like in json_encode
-- unescape works like in json_decode

The symbol inside brackets [] is NOT a G clef.
The symbol inside brackets [] is NOT a G clef.
-- escaped and unescaped as expected
-- escape works like in json_encode
-- unescape works like in json_decode

Please note:

  1. in the first case it says “escape DOESN’T WORK like in json_encode” because json_encode use to replace all the JSON supported special control characters with their respective counterparts, this time a tab control character is replaced by t
  2. in the third case it says “is NOT” because I’ve just found out that MySQL doesn’t support extended UTF-8 characters, so WordPress silently breaks down. In the excerpt from the RFC4627 I cited above I’ve been able to display it by means of the HTML entity [&#119070;] but in the example I can’t use an entity; for this reason here is a screen capture of the output as it appears in the debug window of Zend Studio
  3. after the WordPress fix I’m going to develop for making it use a G clef seamlessly, I’ll introduce advanced usage, by modifying default options to some other interesting values, like how to escape for HTML. Meanwhile, please refer to the Zend Framework proposal

Class

<?php
/**
 * Basic UTF-8 support
 * 
 * @link http://noteslog.com/
 * @link http://www.cl.cam.ac.uk/~mgk25/unicode.html#utf-8
 */
class Zend_Utf8
{
    /**
     * Escape UTF-8 characters using the given options
     * 
     * About the write.callback option, it receives the given read.arguments 
     * option plus a unicode integer, and must return a string.
     * 
     * @link http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/
     * 
     * @param  string $value
     * @param  array $options
     *   'escapeControlChars'   => boolean (default: TRUE),
     *   'escapePrintableASCII' => boolean (default: FALSE),
     *   'write'                => array(
     *       'callback'  => callable (default: 'sprintf'),
     *       'arguments' => array    (default: array('\u%04x')),
     *   ),
     *   'extendedUseSurrogate' => boolean (default: true),
     * 
     * @throws Zend_Utf8_Exception If the code point of any char in $value is 
     *                             not unicode
     * @return string
     */
    public static function escape($value, array $options = array())
    {
        $options = array_merge(array(
            'escapeControlChars'   => true,
            'escapePrintableASCII' => false,
            'write'                => array(
                'callback'  => 'sprintf',
                'arguments' => array('\u%04x'),
            ),
            'extendedUseSurrogate' => true,
        ), $options);
        if (! self::isCallable($options['write']))
        {
            require_once 'Zend/Utf8/Exception.php';
            throw new Zend_Utf8_Exception('Expected a valid write (callable, array).');
        }
        if (self::validateFilters($options) && isset($options['filters']['before-write']))
        {
            $value = self::call($options['filters']['before-write'], $value);
        }
        
        $result = "";
        $length = strlen($value);
        for($i = 0; $i < $length; $i++) {
            $ord_var_c = ord($value[$i]);
            
            switch (true) {
                case ($ord_var_c < 0x20):
                    // code points 0x00000000..0x0000001F, mask 0xxxxxxx
                    $result .= $options['escapeControlChars']
                        ? self::call($options['write'], $ord_var_c)
                        : $value[$i];
                break;

                case ($ord_var_c < 0x80):
                    // code points 0x00000020..0x0000007F, mask 0xxxxxxx
                    $result .= $options['escapePrintableASCII'] 
                        ? self::call($options['write'], $ord_var_c)
                        : $value[$i];
                break;

                case (($ord_var_c & 0xE0) == 0xC0):
                    // code points 0x00000080..0x000007FF, mask 110yyyyy 10xxxxxx
                    $utf8Char = substr($value, $i, 2); $i += 1;
                    $code = self::utf8CharToCodePoint($utf8Char);
                    $result .= self::call($options['write'], $code);
                break;

                case (($ord_var_c & 0xF0) == 0xE0):
                    // code points 0x00000800..0x0000FFFF, mask 1110zzzz 10yyyyyy 10xxxxxx
                    $utf8Char = substr($value, $i, 3); $i += 2;
                    $code = self::utf8CharToCodePoint($utf8Char);
                    $result .= self::call($options['write'], $code);
                break;

                case (($ord_var_c & 0xF8) == 0xF0):
                    // code points 0x00010000..0x0010FFFF, mask 11110www 10zzzzzz 10yyyyyy 10xxxxxx
                    $utf8Char = substr($value, $i, 4); $i += 3;
                    if ($options['extendedUseSurrogate'])
                    {
                        list($upper, $lower) = self::utf8CharToSurrogatePair($utf8Char);
                        $result .= self::call($options['write'], $upper);
                        $result .= self::call($options['write'], $lower);
                    }
                    else 
                    {
                        $code = self::utf8CharToCodePoint($utf8Char);
                        $result .= self::call($options['write'], $code);
                    }
                break;

                default:
                    //no more cases in unicode, whose range is 0x00000000..0x0010FFFF
                    require_once 'Zend/Utf8/Exception.php';
                    throw new Zend_Utf8_Exception('Expected a valid UTF-8 character.');
                break;
            }
        }

        return $result;
     }
     
    /**
     * Compute the code point of a given UTF-8 character
     *
     * If available, use the multibye string function mb_convert_encoding
     * TODO reject overlong sequences in $utf8Char
     *
     * @link http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/
     * 
     * @param  string $utf8Char
     * @throws Zend_Utf8_Exception If the code point of $utf8Char is not unicode
     * @return integer
     */
    public static function utf8CharToCodePoint($utf8Char)
    {
        if (function_exists('mb_convert_encoding')) 
        {
            $utf32Char = mb_convert_encoding($utf8Char, 'UTF-32', 'UTF-8');
        } 
        else 
        {
            $bytes = array('C*');
            list(, $utf8Int) = unpack('N', str_repeat(chr(0), 4 - strlen($utf8Char)) . $utf8Char);
            switch (strlen($utf8Char)) 
            {
                case 1:
                    //Code points U+0000..U+007F
                    //mask   0xxxxxxx (7 bits)
                    //map to 00000000 00000000 00000000 0xxxxxxx
                    $bytes[] = 0;
                    $bytes[] = 0;
                    $bytes[] = 0;
                    $bytes[] = $utf8Int;
                break;
    
                case 2:
                    //Code points U+0080..U+07FF
                    //mask   110yyyyy 10xxxxxx (5 + 6 = 11 bits)
                    //map to 00000000 00000000 00000yyy yyxxxxxx
                    $bytes[] = 0;
                    $bytes[] = 0;
                    $bytes[] = $utf8Int >> 10 & 0x07;
                    $bytes[] = $utf8Int >>  2 & 0xC0 | $utf8Int       & 0x3F;
                break;
    
                case 3:
                    //Code points U+0800..U+D7FF and U+E000..U+FFFF
                    //mask   1110zzzz 10yyyyyy 10xxxxxx (4 + 6 + 6 = 16 bits)
                    //map to 00000000 00000000 zzzzyyyy yyxxxxxx
                    $bytes[] = 0;
                    $bytes[] = 0;
                    $bytes[] = $utf8Int >> 12 & 0xF0 | $utf8Int >> 10 & 0x0F;
                    $bytes[] = $utf8Int >>  2 & 0xC0 | $utf8Int       & 0x3F;
                break;
                             
                case 4:
                    //Code points U+10000..U+10FFFF
                    //mask   11110www 10zzzzzz 10yyyyyy 10xxxxxx (3 + 6 + 6 + 6 = 21 bits)
                    //map to 00000000 000wwwzz zzzzyyyy yyxxxxxx
                    $bytes[] = 0;
                    $bytes[] = $utf8Int >> 22 & 0x1C | $utf8Int >> 20 & 0x03;
                    $bytes[] = $utf8Int >> 12 & 0xF0 | $utf8Int >> 10 & 0x0F;
                    $bytes[] = $utf8Int >>  2 & 0xC0 | $utf8Int       & 0x3F;
                break;
                
                default:
                    //no more cases in unicode, whose range is 0x00000000 - 0x0010FFFF
                    require_once 'Zend/Utf8/Exception.php';
                    throw new Zend_Utf8_Exception('Expected a valid UTF-8 character.');
                break;
            }
            $utf32Char = call_user_func_array('pack', $bytes);
        }
        list(, $result) = unpack('N', $utf32Char); //unpack returns an array with base 1
        if (0xD800 <= $result && $result <= 0xDFFF) 
        {
            //reserved for UTF-16 surrogates
            require_once 'Zend/Utf8/Exception.php';
            throw new Zend_Utf8_Exception('Expected a valid UTF-8 character.');
        }
        if (0xFFFE == $result || 0xFFFF == $result) 
        {
            //reserved
            require_once 'Zend/Utf8/Exception.php';
            throw new Zend_Utf8_Exception('Expected a valid UTF-8 character.');
        }
        
        return $result;
    }
    
    /**
     * Compute the surrogate pair of a given extended UTF-8 character
     * 
     * @link http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/
     * @link http://en.wikipedia.org/wiki/UTF-16/UCS-2
     * 
     * @param  string $utf8Char
     * @throws Zend_Utf8_Exception If the code point of $utf8Char is not extended unicode
     * @return array
     */
    public static function utf8CharToSurrogatePair($utf8Char) 
    {
        $codePoint = self::utf8CharToCodePoint($utf8Char);
        if ($codePoint < 0x10000) 
        {
            require_once 'Zend/Utf8/Exception.php';
            throw new Zend_Utf8_Exception('Expected an extended UTF-8 character.');
        }
        $codePoint -= 0x10000;
        $upperSurrogate = 0xD800 + ($codePoint >> 10);
        $lowerSurrogate = 0xDC00 + ($codePoint & 0x03FF);
        $result = array($upperSurrogate, $lowerSurrogate);
        
        return $result;
    }
    
	/**
     * Unescape UTF-8 characters from a given escape format
     * 
     * About the read.pattern option
     * -- no delimiters and no modifiers allowed
     * -- for back references, your groups start at 3.
     * About the read.callback option
     * -- it receives the given read.arguments option plus all the matches
     * -- it must return a unicode integer.
     * 
     * @link http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/
     * 
     * @param  string $value
     * @param  array $options
     *   'read'                 => array(
     *   	 'pattern'   => preg     (default: '\\\\u([0-9A-Fa-f]{4})'),
     *       'callback'  => callable (default: create_function('$all, $code', 'return hexdec($code);')),
     *       'arguments' => array    (deafult: array()),
     *   ),
     *   'extendedUseSurrogate' => boolean (default: TRUE),
     * 
     * @throws Zend_Utf8_Exception If the code point of any char in $value is 
     *                             not unicode
     * 
     * @return string
     */
    public static function unescape($value, array $options = array())
    {
        $options = array_merge(array(
        	'read'                 => array(
            	'pattern'   => '\\\\u([0-9A-Fa-f]{4})',
                'callback'  => create_function('$all, $code', 'return hexdec($code);'),
                'arguments' => array(),
            ),
            'extendedUseSurrogate' => true,
        ), $options);
        if (! self::isCallable($options['read']))
        {
            require_once 'Zend/Utf8/Exception.php';
            throw new Zend_Utf8_Exception('Expected a valid read (callable, array).');
        }
        $thereAreFilters = self::validateFilters($options);
        
        $result = "";
        $length = strlen($value);
        $pattern = '@([\w\W]*?)(' . $options['read']['pattern'] . ')|([\w\W]+)@';
        $offset = 0;
        while (preg_match($pattern, $value, $matches, 0, $offset))
        {
            if (! $matches[2])
            {
                //no more escape patterns
                $result .= $matches[0];
                $offset += strlen($matches[0]);
            }
            else 
            {
                //one more escape pattern
                $result .= $matches[1];
                $offset += strlen($matches[0]);
                $args = array_splice($matches, 2, count($matches) - 1);
                $unicode = self::call($options['read'], $args);//                call_user_func($options['integer'], $matches[2]);
                if ($options['extendedUseSurrogate'] && (0xD800 <= $unicode && $unicode < 0xDC00))
                {
                    $upperSurrogate = $unicode;
                    preg_match($pattern, $value, $matches, 0, $offset);
                    if (! $matches[2])
                    {
                        require_once 'Zend/Utf8/Exception.php';
                        throw new Zend_Utf8_Exception('Expected an extended UTF-8 character.');
                    }
                    $offset += strlen($matches[0]);
                    $args = array_splice($matches, 2, count($matches) - 1);
                    $unicode = self::call($options['read'], $args);//$lowerSurrogate = call_user_func($options['integer'], $matches[2]);
                    $utf8Char = self::utf8CharFromSurrogatePair(array($upperSurrogate, $unicode));
                }
                else 
                {
                    $utf8Char = self::utf8CharFromCodePoint($unicode);
                }
                $result .= $utf8Char;
            }
        }
        if ($thereAreFilters && isset($options['filters']['after-read']))
        {
            $result = self::call($options['filters']['after-read'], $result);
        }
        
        return $result;
     }
     
    /**
     * Compute the UTF-8 character of a given code point
     *
     * If available, use the multibye string function mb_convert_encoding
     *
     * @link http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/
     * 
     * @param  integer $codePoint
     * @throws Zend_Utf8_Exception if the code point is not unicode
     * @return string
     */
    public static function utf8CharFromCodePoint($codePoint)
    {
        if (0xD800 <= $codePoint && $codePoint <= 0xDFFF) 
        {
            //reserved for UTF-16 surrogates
            require_once 'Zend/Utf8/Exception.php';
            throw new Zend_Utf8_Exception('Expected a valid code point.');
        }
        if (0xFFFE == $codePoint || 0xFFFF == $codePoint) 
        {
            //reserved
            require_once 'Zend/Utf8/Exception.php';
            throw new Zend_Utf8_Exception('Expected a valid code point.');
        }
        
        if (function_exists('mb_convert_encoding')) 
        {
            $utf32Char = pack('N', $codePoint);
            $result = mb_convert_encoding($utf32Char, 'UTF-8', 'UTF-32');
        } 
        else 
        {
            $bytes = array('C*');
            switch (true)
            {
                case ($codePoint < 0x80):
                    //Code points U+0000..U+007F
                    //mask     0xxxxxxx (7 bits)
                    //map from xxxxxxx
                    $bytes[] = $codePoint;
                break;
                
                case ($codePoint < 0x800):
                    //Code points U+0080..U+07FF
                    //mask     110yyyyy 10xxxxxx (5 + 6 = 11 bits)
                    //map from yyy yyxxxxxx
                    $bytes[] = 0xC0 | $codePoint >> 6;
                    $bytes[] = 0x80 | $codePoint       & 0x3F;
                break;
                
                case ($codePoint < 0x10000):
                    //Code points U+0800..U+D7FF and U+E000..U+FFFF
                    //mask     1110zzzz 10yyyyyy 10xxxxxx (4 + 6 + 6 = 16 bits)
                    //map from zzzzyyyy yyxxxxxx
                    $bytes[] = 0xE0 | $codePoint >> 12;
                    $bytes[] = 0x80 | $codePoint >> 6  & 0x3F;
                    $bytes[] = 0x80 | $codePoint       & 0x3F;
                break;
                
                case ($codePoint < 0x110000):
                    //Code points U+10000..U+10FFFF
                    //mask     11110www 10zzzzzz 10yyyyyy 10xxxxxx (3 + 6 + 6 + 6 = 21 bits)
                    //map from wwwzz zzzzyyyy yyxxxxxx
                    $bytes[] = 0xF0 | $codePoint >> 18;
                    $bytes[] = 0x80 | $codePoint >> 12 & 0x3F;
                    $bytes[] = 0x80 | $codePoint >> 6  & 0x3F;
                    $bytes[] = 0x80 | $codePoint       & 0x3F;
                break;
                
                default:
                    require_once 'Zend/Utf8/Exception.php';
                    throw new Zend_Utf8_Exception('Expected a valid code point.');
                break;
            }
            $result = call_user_func_array('pack', $bytes);
        }
        return $result;
    }
    
    /**
     * Compute the extended UTF-8 character of a given surrogate pair
     * 
     * @link   http://noteslog.com/post/escaping-and-unescaping-utf-8-characters-in-php/
     * @link http://en.wikipedia.org/wiki/UTF-16/UCS-2
     * 
     * @param array $surrogatePair
     * @throws Zend_Utf8_Exception If the surrogate pair is not extended unicode
     * @return string
     */
    public static function utf8CharFromSurrogatePair($surrogatePair) 
    {
        list($upperSurrogate, $lowerSurrogate) = $surrogatePair;
        if (! (0xD800 <= $upperSurrogate && $upperSurrogate < 0xDC00))
        {
            require_once 'Zend/Utf8/Exception.php';
            throw new Zend_Utf8_Exception('Expected an extended UTF-8 character.');
        }
        if (! (0xDC00 <= $lowerSurrogate && $lowerSurrogate < 0xE000))
        {
            require_once 'Zend/Utf8/Exception.php';
            throw new Zend_Utf8_Exception('Expected an extended UTF-8 character.');
        }
        $codePoint = ($upperSurrogate & 0x03FF) << 10 | ($lowerSurrogate & 0x03FF);
        $codePoint += 0x10000;
        $result = self::utf8CharFromCodePoint($codePoint);
        
        return $result;
    }
    
    /**
     * A little calling interface: validation
     * 
     * @param  array  $handler
     * @return boolean
     */
    private static function isCallable($handler)
    {
        $result = is_callable($handler['callback']) && is_array($handler['arguments']);
        return $result;
    }
    
    /**
     * A little calling interface: call
     * 
     * @param  array  $handler
     * @param  mixed  $args
     * @return mixed
     */
    private static function call($handler, $args)
    {
        $args = array_merge($handler['arguments'], is_array($args) ? $args : array($args));
        $result = call_user_func_array($handler['callback'], $args);
        return $result;
    }
    
    /**
     * Validate filters. If there are filters return true, else false
     * 
     * @param array $options
     * @throws Zend_Utf8_Exception If there are malformed filters
     * @return boolean
     */
    protected static function validateFilters($options)
    {
        if (isset($options['filters']))
        {
            if (! is_array($options['filters']))
            {
                require_once 'Zend/Utf8/Exception.php';
                throw new Zend_Utf8_Exception('Expected valid filters.');
            }
            foreach ($options['filters'] as $key => $value)
            {
                if (! self::isCallable($value))
                {
                    require_once 'Zend/Utf8/Exception.php';
                    throw new Zend_Utf8_Exception("Expected a valid $key filter.");
                }
            }
            return true;
        }
        return false;
    }
    
}

References

UTF-8 and Unicode

Unicode

Detecting recursive dependencies in PHP composite values

A great deal of complexity in the Zend_Json_Encoder class (like having a static method that instantiates the hosting class) is due to the implementation of a recursive dependency check.

Composite values can contain parts that contain the whole. In general it’s a good feature for data, but it allows a program to exhaust processing power, by entering infinite recursion. A developer MUST avoid infinite recursion, and every developer knows it. So the recursive dependency check is a requirement of any trustworthy PHP to JSON encoder, and the Developer of the Zend_Json_Encoder class put a lot of effort into giving a satisfactory solution to the problem.

Issue 1: False positives

The implementation of the recursive dependency check goes like this: if an object is found twice while visiting an object then it’s considered a recursive dependency. Unfortunately that’s a necessary but not sufficient condition. In fact, as a reported bug made clear, it’s very easy (and useful) to craft a composite value with the same object twice, and no recursive dependency involved.

The bug fix was quite wacky. Firstly the Developer added a switch for turning on the check, and made the switch off by default. Secondly they added another switch for turning off the throw of an exception so that when a recursive dependency is detected a standard string is returned instead of encoding the recurring object again. The problem is that this mechanism still suffer the same issue: it cuts recursive dependencies as well as simple repetitions (false positives).

Nobody should ever turn on the recursive dependency check in the Zend_Json_Encoder class because the only reason for doing so is when a developer wants the encoding to be performed no matter what. This can be accomplished by turning the check on and exceptions off. But the presence of false positives makes it a bad choice anyway.

Issue 2: Arrays recur too

The implementation of the recursive dependency check takes into account objects, but arrays can recur too, and no check for arrays is available. This is like generating false expectations: a developer that turned on the check would expect all recurring dependencies being detected, even if some could be false positives, but this is not the case. The check on means only recurring/repeating objects will be found.

Observations

We’ll make now some little experiments, with recurring arrays and objects, and see how they get printed, serialized and json_encoded by PHP itself.

function show(&$var, $name)
{
    print_r("\n\n-------------------- $name --------------------\n\n");
    print_r($var);
    print_r("\nserialized -> ");
    print_r(serialize($var));
    print_r("\njson_encoded -> ");
    print_r(json_encode($var));
    print_r("\n---");
}

//arrays with recursive dependency
$array1 = array();
$array2 = array();
$array1[] = &$array2;
$array2[] = $array1;
show($array1, '$array1 with recursive dependency');

//objects with recursive dependency
$object1 = new stdClass();
$object2 = new stdClass();
$object1->member1 = $object2;
$object2->member2 = $object1;
show($object1, '$object1 with recursive dependency');

//arrays without recursive dependency
$array1 = array();
$array2 = array();
$array1[] = $array2;
$array2[] = $array1;
show($array1, '$array1 without recursive dependency');

//objects without recursive dependency
$object1 = new stdClass();
$object2 = new stdClass();
$object1->member1 = $object2;
$object1->member2 = $object2;
show($object1, '$object1 without recursive dependency');

-------------------- $array1 with recursive dependency --------------------

Array
(
    [0] => Array
        (
            [0] => Array
                (
                    [0] => Array
 *RECURSION*
                )

        )

)

serialized -> a:1:{i:0;a:1:{i:0;a:1:{i:0;R:2;}}}
json_encoded -> 
Warning: json_encode(): recursion detected in /Users/aercolino/...
[[[null]]]
---

-------------------- $object1 with recursive dependency --------------------

stdClass Object
(
    [member1] => stdClass Object
        (
            [member2] => stdClass Object
 *RECURSION*
        )

)

serialized -> O:8:"stdClass":1:{s:7:"member1";O:8:"stdClass":1:{s:7:"member2";r:1;}}
json_encoded -> 
Warning: json_encode(): recursion detected in /Users/aercolino/...
{"member1":{"member2":{"member1":null}}}
---

-------------------- $array1 without recursive dependency --------------------

Array
(
    [0] => Array
        (
        )

)

serialized -> a:1:{i:0;a:0:{}}
json_encoded -> [[]]
---

-------------------- $object1 without recursive dependency --------------------

stdClass Object
(
    [member1] => stdClass Object
        (
        )

    [member2] => stdClass Object
        (
        )

)

serialized -> O:8:"stdClass":2:{s:7:"member1";O:8:"stdClass":0:{}s:7:"member2";r:2;}
json_encoded -> {"member1":{},"member2":{}}
---

Comparing the output in the cases with a recursive dependency, we see that PHP by itself detects recursion (we trust PHP, so we knew it MUST avoid recursion):

  • print_r() shows a *RECURSION* label
  • json_encode() shows a Warning
  • serialize() shows a metadata element labeled R: for arrays and r: for objects

Comparing the output in the cases without a recursive dependency, we also see that PHP doesn’t detect recursion. Hmm, almost.

  • print_r() doesn’t show any *RECURSION* label
  • json_encode() doesn’t show any Warning
  • serialize() doesn’t show any metadata element labeled R: for arrays… BUT does show r: for objects

In fact, the last case is a simpler version of the example given for reproducing the bug of the Zend_Json_Encoder class: a composite value with the same object twice.

It seems that the serialize function of PHP is affected by the same bug. Or should we assume that the R/r is for repetition instead of recursion? Anyway, the serialize function cannot be trusted.

Solution

We’re going to exploit the fact that print_r() emits a *RECURSION* label. If we tried to match the label in the string returned by print_r(), and no one existed, then it would mean that there is no recursion. But if we got a match, that one could be a false positive if user data contained the *RECURSION* substring.

So we need a means for getting rid of any of those user data substrings that could pollute our matches. Well, luckily serialize doesn’t use a *RECURSION* label in its metadata, so any that could occur would be user data.

function hasRecursiveDependency($value)
{
    //if PHP detects recursion in a $value, then a printed $value 
    //will contain at least one match for the pattern /\*RECURSION\*/
    $printed = print_r($value, true);
    $recursionMetaUser = preg_match_all('@\*RECURSION\*@', $printed, $matches);
    if ($recursionMetaUser == 0)
    {
        return false;
    }
    //if PHP detects recursion in a $value, then a serialized $value 
    //will contain matches for the pattern /\*RECURSION\*/ never because
    //of metadata of the serialized $value, but only because of user data
    $serialized = serialize($value);
    $recursionUser = preg_match_all('@\*RECURSION\*@', $serialized, $matches);
    //all the matches that are user data instead of metadata of the 
    //printed $value must be ignored
    $result = $recursionMetaUser > $recursionUser;
    return $result;
}

This solution has many advantages:

  1. is trustworthy (PHP does the job)
  2. works for objects as well as arrays
  3. is short and simple to understand
  4. can be applied before walking a value

And very few disadvantages:

  1. doesn’t allow to spot where the recursion occurs
  2. could be slow for complex structures

© 2017 Notes Log

Theme by Anders NorenUp ↑