PHP: Valid Javascript identifiers (or function names) for JSONP

Published:

After another reply to a question I've had on StackOverflow for a while, I decided that I perhaps should add another level of security to my method of providing JSONP from PHP. The way I did it before, I didn't do any checking on the provided callback. This means that someone could technically put whatever they wanted in there, including malicious code. So, therefore it might be a good idea to check if the callback, which should be a function name, actually is a valid function name. But ...

What is valid?

To figure that out, we need a look in the ECMAScript Language Specification. In chapter 13 on functions, we find that a function name is a so-called identifier, which is described in chapter 7.6. There we can find the following facts:

IdentifierIdentifierName, but not ReservedWord
IdentifierNameIdentifierStart IdentifierName IdentifierPart
IdentifierStartUnicodeLetter $ _ \ UnicodeEscapeSequence
IdentifierPart

IdentifierStart UnicodeCombiningMark UnicodeDigit UnicodeConnectorPunctuation ZWNJ ZWJ

UnicodeLetter

Uppercase letter (Lu) Lowercase letter (Ll) Titlecase letter (Lt) Modifier letter (Lm) Other letter (Lo) Letter number (Nl)

UnicodeCombiningMarkNon-spacing mark (Mn) Combining spacing mark (Mc)
UnicodeDigitDecimal number (Nd)
UnicodeConnectorPunctuationConnector punctuation (Pc)
UnicodeEscapeSequence

The definitions of the nonterminal UnicodeEscapeSequence is given in 7.8.4

ZWNJ

U+200C ( Zero-width non-joiner)

ZWJU+200D (Zero-width joiner)
ReservedWordKeyword FutureReservedWord NullLiteral BooleanLiteral
Keyword

break, do, instanceof, typeof, case, else, new, var, catch, finally, return, void, continue, for, switch, while, debugger, function, this, with, default, if, throw, delete, in, try

FutureReservedWord

class, enum, extends, super, const, export, import implements, let, private, public, yield, interface, package, protected, static

NullLiteralnull
BooleanLiteraltrue, false

Looks long, but not too complicated.

Checking if a string is valid

To check if a string is a valid identifier is now pretty easy. We just need to make sure the string matches the allowed syntax, and that it's not a reserved word. The first we can solve with a regular expression and the second with a simple white list array. For example, something along the following lines.

function is_valid_identifier($subject)
{
    $identifier_syntax
      = '/^[$_\p{L}][$_\p{L}\p{Mn}\p{Mc}\p{Nd}\p{Pc}\x{200C}\x{200D}]*+$/u';

    $reserved_words = new array('break', 'do', 'instanceof', 'typeof', 'case',
      'else', 'new', 'var', 'catch', 'finally', 'return', 'void', 'continue',
      'for', 'switch', 'while', 'debugger', 'function', 'this', 'with',
      'default', 'if', 'throw', 'delete', 'in', 'try', 'class', 'enum',
      'extends', 'super', 'const', 'export', 'import', 'implements', 'let',
      'private', 'public', 'yield', 'interface', 'package', 'protected',
      'static', 'null', 'true', 'false');

    return preg_match($identifier_syntax, $subject)
        && ! in_array(mb_strtolower($subject, 'UTF-8'), $reserved_words);
}

Not too complex, although the regular expression might look a bit nuts at first because of all the Unicode character groups. You might find regular expressions other places to do this that uses a-z for the letters, but as you saw from the specification that won't cover much of what's actually valid.

To build the regex I used RegexBuddy which is a very helpful tool to both write, test, and understand regular expressions.

Complete code

<?php

/**
 * Checks if the given subject is a valid js callback function.
 *
 * @author     Torleif Berger
 * @link       http://www.geekality.net/?p=1739
 * @return     TRUE if the subject is valid; otherwise FALSE.
 */
function is_valid_callback($subject)
{
    $identifier_syntax
        = '/^[$_\p{L}][$_\p{L}\p{Mn}\p{Mc}\p{Nd}\p{Pc}\x{200C}\x{200D}]*+$/u';

    $reserved_words = array('break', 'do', 'instanceof', 'typeof', 'case',
        'else', 'new', 'var', 'catch', 'finally', 'return', 'void', 'continue',
        'for', 'switch', 'while', 'debugger', 'function', 'this', 'with',
        'default', 'if', 'throw', 'delete', 'in', 'try', 'class', 'enum',
        'extends', 'super', 'const', 'export', 'import', 'implements', 'let',
        'private', 'public', 'yield', 'interface', 'package', 'protected',
        'static', 'null', 'true', 'false');

    return preg_match($identifier_syntax, $subject)
        && ! in_array(mb_strtolower($subject, 'UTF-8'), $reserved_words);
}

📝 I have ignored the issue with the Unicode escape sequences for now as I'm not quite sure how to best handle those. From the specification:

A UnicodeEscapeSequence cannot be used to put a character into an IdentifierName that would otherwise be illegal. In other words, if a \UnicodeEscapeSequence sequence were replaced by its UnicodeEscapeSequence's CV, the result must still be a valid IdentifierName that has the exact same sequence of characters as the original IdentifierName. All interpretations of identifiers within this specification are based upon their actual characters regardless of whether or not an escape sequence was used to contribute any particular characters.

So, I'm not sure if there is a way to just convert those sequences into actual characters or if this is done automatically by PHP as they come in as GET parameters or what. Either way, my code above there is ignoring them. This means, identifiers with escape sequences will not be considered valid. If you have some good ideas on how to handle it, please leave a comment. 🙂