MonthDecember 2010

Translating a string from PHP to JSON

Based on my understanding of this subject, I’ve come up with the following function for translating a string from PHP to JSON, strictly conforming to the RFC4627.

function json_string($string)
{
    //http://www.ietf.org/rfc/rfc4627.txt
    $replacements = array(
        '@[\\\\"]@'        => '\\\\$0',                        //escape backslashes and double quotes
        '@\n@'             => '\n',                            //convert new lines to their alias for readability
        '@\r@'             => '\r',                            //convert carriage returns to their alias for readability
        '@\t@'             => '\t',                            //convert tabs to their alias for readability
        '@[[:cntrl:]]@e'   => 'sprintf("\\u%04x", ord("$0"))', //escape control characters
        '@</([A-Z])@i'     => '<\/$1',                         //escape slashes that could fool browsers
    );
    $result = preg_replace(array_keys($replacements), array_values($replacements), $string);
    $result = '"' . $result . '"';
    return $result;
}

A simple test like this

$test = array(
    'a null: '.chr(0).'; a new line: '.chr(10).'; a carriage return: '.chr(13).';',
    'a js regex: /(["\'])\w+\1/',
    'a script element: <script type="test/javascript" src="http://example.com/all.js"></script>',
    'a japanese word: みず'
);

echo '<pre>';
echo 'Zend_Json_Encoder::_encodeString: ', htmlspecialchars(print_r(array_map('_encodeString', $test), true));
echo 'json_string: ', htmlspecialchars(print_r(array_map('json_string', $test), true));
echo '</pre>';

yields (in comparison to the _encodeString method of the Zend_Json_Encoder class of Zend Framework)

Zend_Json_Encoder::_encodeString: Array
(
    [0] => "a null: ; a new line: \n; a carriage return: \r;"
    [1] => "a js regex: \/([\"'])\\w+\\1\/"
    [2] => "a script element: <script type=\"test\/javascript\" src=\"http:\/\/example.com\/all.js\"><\/script>"
    [3] => "a japanese word: \u307f\u305a"
)
json_string: Array
(
    [0] => "a null: \u0000; a new line: \n; a carriage return: \r;"
    [1] => "a js regex: /([\"'])\\w+\\1/"
    [2] => "a script element: <script type=\"test/javascript\" src=\"http://example.com/all.js\"><\/script>"
    [3] => "a japanese word: みず"
)

The Solidus Issue

Recently I’ve been studying code of JSON encoders for PHP strings, and I’ve discovered the solidus issue.

As a side note, this was the first time I saw a slash called a solidus, and a backslash called a reverse solidus: I always learn something new ;-)

So the solidus issue is: Am I required to escape any slash in a JSON string?

Let’s see what Douglas Crockford specifies in the RFC4627:

2.5.  Strings

   The representation of strings is similar to conventions used in the C
   family of programming languages.  A string begins and ends with
   quotation marks.  All Unicode characters may be placed within the
   quotation marks except for the characters that must be escaped:
   quotation mark, reverse solidus, and the control characters (U+0000
   through U+001F).

   Any character may be escaped.  If the character is in the Basic
   Multilingual Plane (U+0000 through U+FFFF), then it may be
   represented as a six-character sequence: a reverse solidus, followed
   by the lowercase letter u, followed by four hexadecimal digits that
   encode the character's code point.  The hexadecimal letters A though
   F can be upper or lowercase.  So, for example, a string containing
   only a single reverse solidus character may be represented as
   "u005C".

   Alternatively, there are two-character sequence escape
   representations of some popular characters.  So, for example, a
   string containing only a single reverse solidus character may be
   represented more compactly as "\\".

   To escape an extended character that is not in the Basic Multilingual
   Plane, the character is represented as a twelve-character sequence,
   encoding the UTF-16 surrogate pair.  So, for example, a string
   containing only the G clef character (U+1D11E) may be represented as
   "uD834uDD1E".

Crockford                    Informational                      [Page 4]

RFC 4627                          JSON                         July 2006

         string = quotation-mark *char quotation-mark

         char = unescaped /
                escape (
                    %x22 /          ; "    quotation mark  U+0022
                    %x5C /          ;     reverse solidus U+005C
                    %x2F /          ; /    solidus         U+002F
                    %x62 /          ; b    backspace       U+0008
                    %x66 /          ; f    form feed       U+000C
                    %x6E /          ; n    line feed       U+000A
                    %x72 /          ; r    carriage return U+000D
                    %x74 /          ; t    tab             U+0009
                    %x75 4HEXDIG )  ; uXXXX                U+XXXX

         escape = %x5C              ; 

         quotation-mark = %x22      ; "

         unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

I must say that the above string grammar is perfect. It tells everything one needs to know about JSON valid strings.

On the contrary the introductory notes are a bit confusing. I think all the Strings chapter could be rewritten like this:

2.5 Strings

The representation of strings is similar to conventions used in the C family of programming languages.

A string is a sequence of characters wrapped in double quotes. A backslash is always related to the following character. Only a few characters can follow a backslash: some retain their literal meaning, some do not.

All the valid sequences of a backslash followed by a character (except unicodes) are:

"  which means the same as u0022 (double quote)
\  which means the same as u005C (backslash)
/  which means the same as u002F (slash)
b  which means the same as u0008 (backspace)
f  which means the same as u000C (form feed)
n  which means the same as u000A (line feed)
r  which means the same as u000D (carriage return)
t  which means the same as u0009 (tab)

Any character inside the Unicode Basic Multilingual Plane (U+0000 through U+FFFF) may also appear as a sequence of six characters: a backslash, followed by the lowercase letter u, followed by four hexadecimal digits (upper or lowercase) for the character’s code point. So, for example, a string containing only a single backslash may appear as “u005C”.

Any character outside the Unicode Basic Multilingual Plane may also appear as a sequence of twelve characters, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may appear as “uD834uDD1E”.

In the following grammar, assume that %x introduces a UTF-8 encoded character whose hexadecimal code follows %x.

string = "*char"
   " = %x22
   char = escaped | standard | unicode
      escaped = same | special
          = %x5C
         same = " |  | /
            / = %x2F
         special =  b | f | n | r | t
            b = %x62
            f = %x66
            n = %x6E
            r = %x72
            t = %x74
      standard = %x20 | %x21 | %x23 .. %x5B | %x5D .. %x10FFFF
      unicode = u0000 .. uFFFF
         u = %x75

Now it should be clear that no backslash is required before a slash in a JSON string, but if a backslash is provided it’s still a valid string. This is very clear if we look at the example that Douglas Crockford gives in the same RFC, where no slash is escaped in the given Url value:

 

8. Examples

   This is a JSON object:

   {
      "Image": {
          "Width":  800,
          "Height": 600,
          "Title":  "View from 15th Floor",
          "Thumbnail": {
              "Url":    "http://www.example.com/image/481989943",
              "Height": 125,
              "Width":  "100"
          },
          "IDs": [116, 943, 234, 38793]

Crockford                    Informational                      [Page 7]

RFC 4627                          JSON                         July 2006

        }
   }

   Its Image member is an object whose Thumbnail member is an object
   and whose IDs member is an array of numbers.

   This is a JSON array containing two objects:

   [
      {
         "precision": "zip",
         "Latitude":  37.7668,
         "Longitude": -122.3959,
         "Address":   "",
         "City":      "SAN FRANCISCO",
         "State":     "CA",
         "Zip":       "94107",
         "Country":   "US"
      },
      {
         "precision": "zip",
         "Latitude":  37.371991,
         "Longitude": -122.026020,
         "Address":   "",
         "City":      "SUNNYVALE",
         "State":     "CA",
         "Zip":       "94085",
         "Country":   "US"
      }
   ]

The reason for allowing the slash to be escaped is for making it safe to embed the JSON substring “</script>” in HTML. By writing “<\/script>” one can be sure that the browser won’t mistake it for the closing script tag of the current embedded script.

References

Hard to understand XHTML validation errors

In the following, please use lowercase tags (uppercase used for clarity)

  • No DIV allowed in A
    do not write <A><DIV>Hello</DIV></A>
    write instead <A><SPAN>Hello</SPAN></A>
  • No INPUT allowed in FORM
    do not write <FORM><INPUT /></FORM>
    write instead <FORM><DIV><INPUT /></DIV></FORM>

Dealing with Zend Studio validations

I’ve been struggling quite a while this afternoon for making Zend Studio behave as expected, and I got it!

Zend Studio was marking many warnings in WSDL files that it wrongfully interpreted as HTML rather than XML.

I wanted to disable validation only for those files and couldn’t find out how. I had already excluded their parent folder from the Build Path, but it didn’t work for warnings of this kind.

Those files have a php extension, and HTML validation is twofold: one for HTML files and one for PHP files.

So I changed the settings for the line reading HTML Syntax Validator (for PHP Files) from

to

and after a clean build I got

Call to undefined function ‘output_cache_disable’

This is a Zend Studio warning that has been bugging me lately.

The culprit is the file dummy.php, distributed with Zend Debugger. As per the installation instructions provided by the README file, one should “4. Copy the dummy.php file to your document root directory.” but the project I’m currently reviewing (FengOffice 1.7.3.1) has three copies of that file in three different folders. So a build in Zend Studio always gives that warning three times.

I know that in this particular case, I could simply erase all the dummy.php copies in the project, and forget the issue. And that’s exactly what I’m going to do.

But the issue is caused by a common programming idiom in PHP.

<?php
@ini_set('zend_monitor.enable', 0);
if(@function_exists('output_cache_disable')) {
	@output_cache_disable();
}
if(isset($_GET['debugger_connect']) && $_GET['debugger_connect'] == 1) {
	if(function_exists('debugger_connect'))  {
		debugger_connect();
		exit();
	} else {
		echo "No connector is installed.";
	}
}
?>

The code calls a contextual function only if it exists (@3-5). This is certainly correct, but fools up the simple Zend Studio validator.

It would be better if the code defined the expected contextual function if it didn’t exist, and called the function in any case (@3-6).

<?php
@ini_set('zend_monitor.enable', 0);
if(! @function_exists('output_cache_disable')) {
	function output_cache_disable() {}
}
@output_cache_disable();
if(isset($_GET['debugger_connect']) && $_GET['debugger_connect'] == 1) {
	if(function_exists('debugger_connect'))  {
		debugger_connect();
		exit();
	} else {
		echo "No connector is installed.";
	}
}
?>

This makes Zend Studio give no more warnings.

Anyway, all in all, I think that Zend Studio could be a little smarter and don’t give warnings for function calls inside function_exists blocks.

I know that this could be quite difficult to implement, so an alternative workaround could be an option for telling Zend Studio not to show a specific warning, specific up to the file and line, i.e. up to a single issue in the problems panel.

© 2017 Notes Log

Theme by Anders NorenUp ↑