javascript regex of a javascript string

Question

I need to match a javascript string, with a regular expression, that is a string enclosed by single quote and can only contain a backslashed single quote.

The examples string that i would match are like the following:

'abcdefg'
'abc\'defg'
'abc\'de\'fg'
Solution

This is the regex that matches all valid JavaScript literal string (that is surrounded by single quote ') and reject all invalid ones. Note that strict mode is assumed.

/'(?:[^'\\\n\r\u2028\u2029]|\\(?:['"\\bfnrtv]|[^\n\r\u2028\u2029'"\\bfnrtvxu0-9]|0(?![0-9])|x[0-9a-fA-F]{2}|u[0-9a-fA-F]{4})|\\(?:\n|\r\n|\r(?!\n)|[\u2028\u2029]))*'/

Or a shorter version:

/'(?:[^'\\\n\r\u2028\u2029]|\\(?:[^\n\rxu0-9]|0(?![0-9])|x[0-9a-fA-F]{2}|u[0-9a-fA-F]{4}|\n|\r\n?))*'/

The regex above is based on the definition of StringLiteral (ignoring the double quoted version) specified in ECMAScript Language Specification, 5.1 Edition published in June 2011.

The regex for the JavaScript literal string surrounded with double quote " is almost the same:

/"(?:[^"\\\n\r\u2028\u2029]|\\(?:[^\n\rxu0-9]|0(?![0-9])|x[0-9a-fA-F]{2}|u[0-9a-fA-F]{4}|\n|\r\n?))*"/

Let's dissect the monster (the longer version, since it is direct translation from the grammar):

  • A StringLiteral (ignoring the double quote version) starts and ends with ', as it can be seen in the regex. In between the quotes is an optional sequence of SingleStringCharacter. This explains the * - 0 or more characters.

  • SingleStringCharacter is defined as:

    SingleStringCharacter ::
           SourceCharacter but not one of ' or \ or LineTerminator
           \ EscapeSequence
           LineContinuation
    

    [^'\\\n\r\u2028\u2029] corresponds to the first rule

    \\(?:['"\\bfnrtv]|[^\n\r\u2028\u2029'"\\bfnrtvxu0-9]|0(?![0-9])|x[0-9a-fA-F]{2}|u[0-9a-fA-F]{4}) corresponds to the second rule

    \\(?:\n|\r\n|\r(?!\n)|[\u2028\u2029]) corresponds to the third rule

  • Let's look at the first rule: SourceCharacter but not one of ' or \ or LineTerminator. This first rule deals with "normal" characters.

    SourceCharacter is any Unicode unit.

    LineTerminator is Line Feed <LF> (\u000A or \n), Carriage Return <CR> (\u000D or \r), Line Separator <LS> (\u2028) or Paragraph Separator <PS> (\u2029).

    So we will just use a negative character class to represent this rule: [^'\\\n\r\u2028\u2029].

  • For the second rule, which deals with escape sequences, you can see \ before EscapeSequence, as it appears in the regex. As for EscapeSequence, this is its grammar (strict mode):

    EscapeSequence ::
            CharacterEscapeSequence
            0 [lookahead ∉ DecimalDigit]
            HexEscapeSequence
            UnicodeEscapeSequence
    

    ['"\\bfnrtv]|[^\n\r\u2028\u2029'"\\bfnrtvxu0-9] is the regex for CharacterEscapeSequence. It can actually be simplified to [^\n\r\u2028\u2029xu0-9]

    The first part is SingleEscapeCharacter, which includes ', ", \, and for control characters b, f, n, r, t, v.

    The second part is NonEscapeCharacter, which is SourceCharacter but not one of EscapeCharacter or LineTerminator. EscapeCharacter is defined as SingleEscapeCharacter, DecimalDigit or x (for hex escape sequence) or u (for unicode escape sequence).

    0(?![0-9]) is the regex for the second rule of EscapeSequence. This is for specifying null character \0.

    x[0-9a-fA-F]{2} is the regex for HexEscapeSequence

    u[0-9a-fA-F]{4} is the regex for UnicodeEscapeSequence

  • The third rule deals with string that spans multiple lines. Let's look at the grammar of LineContinuation and other related:

    LineContinuation ::
            \ LineTerminatorSequence
    
    LineTerminatorSequence :: 
            <LF> 
            <CR> [lookahead ∉ <LF> ]
            <LS>
            <PS>
            <CR> <LF>
    

    \\(?:\n|\r\n|\r(?!\n)|[\u2028\u2029]) corresponds to the above grammar.