Recently I’ve been studying code of JSON encoders for PHP strings, and I’ve discovered the solidus issue.

As a side note, this was the first time I saw a slash called a solidus, and a backslash called a reverse solidus: I always learn something new ;-)

So the solidus issue is: Am I required to escape any slash in a JSON string?

Let’s see what Douglas Crockford specifies in the RFC4627:

2.5.  Strings

   The representation of strings is similar to conventions used in the C
   family of programming languages.  A string begins and ends with
   quotation marks.  All Unicode characters may be placed within the
   quotation marks except for the characters that must be escaped:
   quotation mark, reverse solidus, and the control characters (U+0000
   through U+001F).

   Any character may be escaped.  If the character is in the Basic
   Multilingual Plane (U+0000 through U+FFFF), then it may be
   represented as a six-character sequence: a reverse solidus, followed
   by the lowercase letter u, followed by four hexadecimal digits that
   encode the character's code point.  The hexadecimal letters A though
   F can be upper or lowercase.  So, for example, a string containing
   only a single reverse solidus character may be represented as
   "u005C".

   Alternatively, there are two-character sequence escape
   representations of some popular characters.  So, for example, a
   string containing only a single reverse solidus character may be
   represented more compactly as "\\".

   To escape an extended character that is not in the Basic Multilingual
   Plane, the character is represented as a twelve-character sequence,
   encoding the UTF-16 surrogate pair.  So, for example, a string
   containing only the G clef character (U+1D11E) may be represented as
   "uD834uDD1E".

Crockford                    Informational                      [Page 4]

RFC 4627                          JSON                         July 2006

         string = quotation-mark *char quotation-mark

         char = unescaped /
                escape (
                    %x22 /          ; "    quotation mark  U+0022
                    %x5C /          ;     reverse solidus U+005C
                    %x2F /          ; /    solidus         U+002F
                    %x62 /          ; b    backspace       U+0008
                    %x66 /          ; f    form feed       U+000C
                    %x6E /          ; n    line feed       U+000A
                    %x72 /          ; r    carriage return U+000D
                    %x74 /          ; t    tab             U+0009
                    %x75 4HEXDIG )  ; uXXXX                U+XXXX

         escape = %x5C              ; 

         quotation-mark = %x22      ; "

         unescaped = %x20-21 / %x23-5B / %x5D-10FFFF

I must say that the above string grammar is perfect. It tells everything one needs to know about JSON valid strings.

On the contrary the introductory notes are a bit confusing. I think all the Strings chapter could be rewritten like this:

2.5 Strings

The representation of strings is similar to conventions used in the C family of programming languages.

A string is a sequence of characters wrapped in double quotes. A backslash is always related to the following character. Only a few characters can follow a backslash: some retain their literal meaning, some do not.

All the valid sequences of a backslash followed by a character (except unicodes) are:

"  which means the same as u0022 (double quote)
\  which means the same as u005C (backslash)
/  which means the same as u002F (slash)
b  which means the same as u0008 (backspace)
f  which means the same as u000C (form feed)
n  which means the same as u000A (line feed)
r  which means the same as u000D (carriage return)
t  which means the same as u0009 (tab)

Any character inside the Unicode Basic Multilingual Plane (U+0000 through U+FFFF) may also appear as a sequence of six characters: a backslash, followed by the lowercase letter u, followed by four hexadecimal digits (upper or lowercase) for the character’s code point. So, for example, a string containing only a single backslash may appear as “u005C”.

Any character outside the Unicode Basic Multilingual Plane may also appear as a sequence of twelve characters, encoding the UTF-16 surrogate pair. So, for example, a string containing only the G clef character (U+1D11E) may appear as “uD834uDD1E”.

In the following grammar, assume that %x introduces a UTF-8 encoded character whose hexadecimal code follows %x.

string = "*char"
   " = %x22
   char = escaped | standard | unicode
      escaped = same | special
          = %x5C
         same = " |  | /
            / = %x2F
         special =  b | f | n | r | t
            b = %x62
            f = %x66
            n = %x6E
            r = %x72
            t = %x74
      standard = %x20 | %x21 | %x23 .. %x5B | %x5D .. %x10FFFF
      unicode = u0000 .. uFFFF
         u = %x75

Now it should be clear that no backslash is required before a slash in a JSON string, but if a backslash is provided it’s still a valid string. This is very clear if we look at the example that Douglas Crockford gives in the same RFC, where no slash is escaped in the given Url value:

 

8. Examples

   This is a JSON object:

   {
      "Image": {
          "Width":  800,
          "Height": 600,
          "Title":  "View from 15th Floor",
          "Thumbnail": {
              "Url":    "http://www.example.com/image/481989943",
              "Height": 125,
              "Width":  "100"
          },
          "IDs": [116, 943, 234, 38793]

Crockford                    Informational                      [Page 7]

RFC 4627                          JSON                         July 2006

        }
   }

   Its Image member is an object whose Thumbnail member is an object
   and whose IDs member is an array of numbers.

   This is a JSON array containing two objects:

   [
      {
         "precision": "zip",
         "Latitude":  37.7668,
         "Longitude": -122.3959,
         "Address":   "",
         "City":      "SAN FRANCISCO",
         "State":     "CA",
         "Zip":       "94107",
         "Country":   "US"
      },
      {
         "precision": "zip",
         "Latitude":  37.371991,
         "Longitude": -122.026020,
         "Address":   "",
         "City":      "SUNNYVALE",
         "State":     "CA",
         "Zip":       "94085",
         "Country":   "US"
      }
   ]

The reason for allowing the slash to be escaped is for making it safe to embed the JSON substring “</script>” in HTML. By writing “<\/script>” one can be sure that the browser won’t mistake it for the closing script tag of the current embedded script.

References