Changeset 255134 in webkit


Ignore:
Timestamp:
Jan 26, 2020 3:28:44 PM (4 years ago)
Author:
Alexey Shvayka
Message:

Invalid ranges in character classes should be banned in unicode patterns
https://bugs.webkit.org/show_bug.cgi?id=206768

Reviewed by Darin Adler.

JSTests:

  • test262/expectations.yaml: Mark 18 test cases as passing.

Source/JavaScriptCore:

In ES5, grammar of CharacterRange was ambiguous, resulting in invalid ranges
like /[\d-a]/ being allowed. As of ES2015, invalid ranges are SyntaxError in
unicode patterns, yet still allowed in regular ones to avoid breaking the web.
(https://tc39.es/ecma262/#sec-patterns-static-semantics-early-errors-annexb)

This change adds SyntaxError for unicode patterns and updates explanatory
comments. ErrorCode::CharacterClassOutOfOrder is renamed for consistency
with newly added error code and ErrorCode::ParenthesesTypeInvalid.

  • yarr/YarrErrorCode.cpp:

(JSC::Yarr::errorMessage):
(JSC::Yarr::errorToThrow):

  • yarr/YarrErrorCode.h:
  • yarr/YarrParser.h:

(JSC::Yarr::Parser::CharacterClassParserDelegate::CharacterClassParserDelegate):
(JSC::Yarr::Parser::CharacterClassParserDelegate::atomPatternCharacter):
(JSC::Yarr::Parser::CharacterClassParserDelegate::atomBuiltInCharacterClass):
(JSC::Yarr::Parser::parseCharacterClass):

Location:
trunk
Files:
6 edited

Legend:

Unmodified
Added
Removed
  • trunk/JSTests/ChangeLog

    r255112 r255134  
     12020-01-26  Alexey Shvayka  <shvaikalesh@gmail.com>
     2
     3        Invalid ranges in character classes should be banned in unicode patterns
     4        https://bugs.webkit.org/show_bug.cgi?id=206768
     5
     6        Reviewed by Darin Adler.
     7
     8        * test262/expectations.yaml: Mark 18 test cases as passing.
     9
    1102020-01-24  Mark Lam  <mark.lam@apple.com>
    211
  • trunk/JSTests/test262/expectations.yaml

    r254757 r255134  
    12411241  default: 'SyntaxError: Invalid regular expression: invalid group specifier name'
    12421242  strict mode: 'SyntaxError: Invalid regular expression: invalid group specifier name'
    1243 test/built-ins/RegExp/property-escapes/character-class-range-end.js:
    1244   default: 'Test262: This statement should not be evaluated.'
    1245   strict mode: 'Test262: This statement should not be evaluated.'
    1246 test/built-ins/RegExp/property-escapes/character-class-range-no-dash-end.js:
    1247   default: 'Test262: This statement should not be evaluated.'
    1248   strict mode: 'Test262: This statement should not be evaluated.'
    1249 test/built-ins/RegExp/property-escapes/character-class-range-no-dash-start.js:
    1250   default: 'Test262: This statement should not be evaluated.'
    1251   strict mode: 'Test262: This statement should not be evaluated.'
    1252 test/built-ins/RegExp/property-escapes/character-class-range-start.js:
    1253   default: 'Test262: This statement should not be evaluated.'
    1254   strict mode: 'Test262: This statement should not be evaluated.'
    12551243test/built-ins/RegExp/property-escapes/generated/Alphabetic.js:
    12561244  default: 'Test262Error: `\p{Alphabetic}` should match U+001CFA (`ᳺ`)'
     
    16101598  default: 'Test262Error: RegExp("]", "u"):  Expected a SyntaxError to be thrown but no exception was thrown at all'
    16111599  strict mode: 'Test262Error: RegExp("]", "u"):  Expected a SyntaxError to be thrown but no exception was thrown at all'
    1612 test/built-ins/RegExp/unicode_restricted_character_class_escape.js:
    1613   default: 'Test262Error: RegExp("[\d-a]", "u"):  Expected a SyntaxError to be thrown but no exception was thrown at all'
    1614   strict mode: 'Test262Error: RegExp("[\d-a]", "u"):  Expected a SyntaxError to be thrown but no exception was thrown at all'
    16151600test/built-ins/RegExp/unicode_restricted_identity_escape.js:
    16161601  default: "Test262Error: Invalid IdentityEscape in AtomEscape: '\\"
     
    34243409  default: 'Test262: This statement should not be evaluated.'
    34253410  strict mode: 'Test262: This statement should not be evaluated.'
    3426 test/language/literals/regexp/u-invalid-non-empty-class-ranges-no-dash-a.js:
    3427   default: 'Test262: This statement should not be evaluated.'
    3428   strict mode: 'Test262: This statement should not be evaluated.'
    3429 test/language/literals/regexp/u-invalid-non-empty-class-ranges-no-dash-ab.js:
    3430   default: 'Test262: This statement should not be evaluated.'
    3431   strict mode: 'Test262: This statement should not be evaluated.'
    3432 test/language/literals/regexp/u-invalid-non-empty-class-ranges-no-dash-b.js:
    3433   default: 'Test262: This statement should not be evaluated.'
    3434   strict mode: 'Test262: This statement should not be evaluated.'
    3435 test/language/literals/regexp/u-invalid-non-empty-class-ranges.js:
    3436   default: 'Test262: This statement should not be evaluated.'
    3437   strict mode: 'Test262: This statement should not be evaluated.'
    34383411test/language/literals/regexp/u-invalid-oob-decimal-escape.js:
    34393412  default: 'Test262: This statement should not be evaluated.'
  • trunk/Source/JavaScriptCore/ChangeLog

    r255120 r255134  
     12020-01-26  Alexey Shvayka  <shvaikalesh@gmail.com>
     2
     3        Invalid ranges in character classes should be banned in unicode patterns
     4        https://bugs.webkit.org/show_bug.cgi?id=206768
     5
     6        Reviewed by Darin Adler.
     7
     8        In ES5, grammar of CharacterRange was ambiguous, resulting in invalid ranges
     9        like /[\d-a]/ being allowed. As of ES2015, invalid ranges are SyntaxError in
     10        unicode patterns, yet still allowed in regular ones to avoid breaking the web.
     11        (https://tc39.es/ecma262/#sec-patterns-static-semantics-early-errors-annexb)
     12
     13        This change adds SyntaxError for unicode patterns and updates explanatory
     14        comments. ErrorCode::CharacterClassOutOfOrder is renamed for consistency
     15        with newly added error code and ErrorCode::ParenthesesTypeInvalid.
     16
     17        * yarr/YarrErrorCode.cpp:
     18        (JSC::Yarr::errorMessage):
     19        (JSC::Yarr::errorToThrow):
     20        * yarr/YarrErrorCode.h:
     21        * yarr/YarrParser.h:
     22        (JSC::Yarr::Parser::CharacterClassParserDelegate::CharacterClassParserDelegate):
     23        (JSC::Yarr::Parser::CharacterClassParserDelegate::atomPatternCharacter):
     24        (JSC::Yarr::Parser::CharacterClassParserDelegate::atomBuiltInCharacterClass):
     25        (JSC::Yarr::Parser::parseCharacterClass):
     26
    1272020-01-24  Mark Lam  <mark.lam@apple.com>
    228
  • trunk/Source/JavaScriptCore/yarr/YarrErrorCode.cpp

    r251425 r255134  
    3636    // The order of this array must match the ErrorCode enum.
    3737    static const char* errorMessages[] = {
    38         nullptr,                                                              // NoError
    39         REGEXP_ERROR_PREFIX "regular expression too large",                   // PatternTooLarge
    40         REGEXP_ERROR_PREFIX "numbers out of order in {} quantifier",          // QuantifierOutOfOrder
    41         REGEXP_ERROR_PREFIX "nothing to repeat",                              // QuantifierWithoutAtom
    42         REGEXP_ERROR_PREFIX "number too large in {} quantifier",              // QuantifierTooLarge
    43         REGEXP_ERROR_PREFIX "missing )",                                      // MissingParentheses
    44         REGEXP_ERROR_PREFIX "unmatched parentheses",                          // ParenthesesUnmatched
    45         REGEXP_ERROR_PREFIX "unrecognized character after (?",                // ParenthesesTypeInvalid
    46         REGEXP_ERROR_PREFIX "invalid group specifier name",                   // InvalidGroupName
    47         REGEXP_ERROR_PREFIX "duplicate group specifier name",                 // DuplicateGroupName
    48         REGEXP_ERROR_PREFIX "missing terminating ] for character class",      // CharacterClassUnmatched
    49         REGEXP_ERROR_PREFIX "range out of order in character class",          // CharacterClassOutOfOrder
    50         REGEXP_ERROR_PREFIX "\\ at end of pattern",                           // EscapeUnterminated
    51         REGEXP_ERROR_PREFIX "invalid unicode {} escape",                      // InvalidUnicodeEscape
    52         REGEXP_ERROR_PREFIX "invalid backreference for unicode pattern",      // InvalidBackreference
    53         REGEXP_ERROR_PREFIX "invalid escaped character for unicode pattern",  // InvalidIdentityEscape
    54         REGEXP_ERROR_PREFIX "invalid property expression",                    // InvalidUnicodePropertyExpression
    55         REGEXP_ERROR_PREFIX "too many nested disjunctions",                   // TooManyDisjunctions
    56         REGEXP_ERROR_PREFIX "pattern exceeds string length limits",           // OffsetTooLarge
    57         REGEXP_ERROR_PREFIX "invalid flags"                                   // InvalidRegularExpressionFlags
     38        nullptr,                                                                    // NoError
     39        REGEXP_ERROR_PREFIX "regular expression too large",                         // PatternTooLarge
     40        REGEXP_ERROR_PREFIX "numbers out of order in {} quantifier",                // QuantifierOutOfOrder
     41        REGEXP_ERROR_PREFIX "nothing to repeat",                                    // QuantifierWithoutAtom
     42        REGEXP_ERROR_PREFIX "number too large in {} quantifier",                    // QuantifierTooLarge
     43        REGEXP_ERROR_PREFIX "missing )",                                            // MissingParentheses
     44        REGEXP_ERROR_PREFIX "unmatched parentheses",                                // ParenthesesUnmatched
     45        REGEXP_ERROR_PREFIX "unrecognized character after (?",                      // ParenthesesTypeInvalid
     46        REGEXP_ERROR_PREFIX "invalid group specifier name",                         // InvalidGroupName
     47        REGEXP_ERROR_PREFIX "duplicate group specifier name",                       // DuplicateGroupName
     48        REGEXP_ERROR_PREFIX "missing terminating ] for character class",            // CharacterClassUnmatched
     49        REGEXP_ERROR_PREFIX "range out of order in character class",                // CharacterClassRangeOutOfOrder
     50        REGEXP_ERROR_PREFIX "invalid range in character class for unicode pattern", // CharacterClassRangeInvalid
     51        REGEXP_ERROR_PREFIX "\\ at end of pattern",                                 // EscapeUnterminated
     52        REGEXP_ERROR_PREFIX "invalid unicode {} escape",                            // InvalidUnicodeEscape
     53        REGEXP_ERROR_PREFIX "invalid backreference for unicode pattern",            // InvalidBackreference
     54        REGEXP_ERROR_PREFIX "invalid escaped character for unicode pattern",        // InvalidIdentityEscape
     55        REGEXP_ERROR_PREFIX "invalid property expression",                          // InvalidUnicodePropertyExpression
     56        REGEXP_ERROR_PREFIX "too many nested disjunctions",                         // TooManyDisjunctions
     57        REGEXP_ERROR_PREFIX "pattern exceeds string length limits",                 // OffsetTooLarge
     58        REGEXP_ERROR_PREFIX "invalid flags"                                         // InvalidRegularExpressionFlags
    5859    };
    5960
     
    7778    case ErrorCode::DuplicateGroupName:
    7879    case ErrorCode::CharacterClassUnmatched:
    79     case ErrorCode::CharacterClassOutOfOrder:
     80    case ErrorCode::CharacterClassRangeOutOfOrder:
     81    case ErrorCode::CharacterClassRangeInvalid:
    8082    case ErrorCode::EscapeUnterminated:
    8183    case ErrorCode::InvalidUnicodeEscape:
  • trunk/Source/JavaScriptCore/yarr/YarrErrorCode.h

    r251425 r255134  
    4646    DuplicateGroupName,
    4747    CharacterClassUnmatched,
    48     CharacterClassOutOfOrder,
     48    CharacterClassRangeOutOfOrder,
     49    CharacterClassRangeInvalid,
    4950    EscapeUnterminated,
    5051    InvalidUnicodeEscape,
  • trunk/Source/JavaScriptCore/yarr/YarrParser.h

    r250005 r255134  
    5555    class CharacterClassParserDelegate {
    5656    public:
    57         CharacterClassParserDelegate(Delegate& delegate, ErrorCode& err)
     57        CharacterClassParserDelegate(Delegate& delegate, ErrorCode& err, bool isUnicode)
    5858            : m_delegate(delegate)
    5959            , m_errorCode(err)
     60            , m_isUnicode(isUnicode)
    6061            , m_state(Empty)
    6162            , m_character(0)
     
    8687            switch (m_state) {
    8788            case AfterCharacterClass:
    88                 // Following a builtin character class we need look out for a hyphen.
     89                // Following a built-in character class we need look out for a hyphen.
    8990                // We're looking for invalid ranges, such as /[\d-x]/ or /[\d-\d]/.
    90                 // If we see a hyphen following a charater class then unlike usual
     91                // If we see a hyphen following a character class then unlike usual
    9192                // we'll report it to the delegate immediately, and put ourself into
    92                 // a poisoned state. Any following calls to add another character or
    93                 // character class will result in an error. (A hypen following a
    94                 // character-class is itself valid, but only  at the end of a regex).
     93                // a poisoned state. In a unicode pattern, any following calls to add
     94                // another character or character class will result in syntax error.
     95                // A hypen following a character class is itself valid, but only at
     96                // the end of a regex.
    9597                if (hyphenIsRange && ch == '-') {
    9698                    m_delegate.atomCharacterClassAtom('-');
     
    117119            case CachedCharacterHyphen:
    118120                if (ch < m_character) {
    119                     m_errorCode = ErrorCode::CharacterClassOutOfOrder;
     121                    m_errorCode = ErrorCode::CharacterClassRangeOutOfOrder;
    120122                    return;
    121123                }
     
    124126                return;
    125127
    126                 // See coment in atomBuiltInCharacterClass below.
    127                 // This too is technically an error, per ECMA-262, and again we
    128                 // we chose to allow this.  Note a subtlely here that while we
    129                 // diverge from the spec's definition of CharacterRange we do
    130                 // remain in compliance with the grammar.  For example, consider
    131                 // the expression /[\d-a-z]/.  We comply with the grammar in
    132                 // this case by not allowing a-z to be matched as a range.
     128                // If we hit this case, we have an invalid range like /[\d-a]/.
     129                // See coment in atomBuiltInCharacterClass() below.
    133130            case AfterCharacterClassHyphen:
     131                if (m_isUnicode) {
     132                    m_errorCode = ErrorCode::CharacterClassRangeInvalid;
     133                    return;
     134                }
    134135                m_delegate.atomCharacterClassAtom(ch);
    135136                m_state = Empty;
     
    152153            case Empty:
    153154            case AfterCharacterClass:
     155                m_delegate.atomCharacterClassBuiltIn(classID, invert);
    154156                m_state = AfterCharacterClass;
    155                 m_delegate.atomCharacterClassBuiltIn(classID, invert);
    156157                return;
    157158
    158159                // If we hit either of these cases, we have an invalid range that
    159                 // looks something like /[x-\d]/ or /[\d-\d]/.
    160                 // According to ECMA-262 this should be a syntax error, but
    161                 // empirical testing shows this to break teh webz.  Instead we
    162                 // comply with to the ECMA-262 grammar, and assume the grammar to
    163                 // have matched the range correctly, but tweak our interpretation
    164                 // of CharacterRange.  Effectively we implicitly handle the hyphen
    165                 // as if it were escaped, e.g. /[\w-_]/ is treated as /[\w\-_]/.
     160                // looks something like /[a-\d]/ or /[\d-\d]/.
     161                // Since ES2015, this should be syntax error in a unicode pattern,
     162                // yet gracefully handled in a regular regex to avoid breaking the web.
     163                // Effectively we handle the hyphen as if it was (implicitly) escaped,
     164                // e.g. /[\d-a-z]/ is treated as /[\d\-a\-z]/.
     165                // See usages of CharacterRangeOrUnion abstract op in
     166                // https://tc39.es/ecma262/#sec-regular-expression-patterns-semantics
    166167            case CachedCharacterHyphen:
    167168                m_delegate.atomCharacterClassAtom(m_character);
     
    169170                FALLTHROUGH;
    170171            case AfterCharacterClassHyphen:
     172                if (m_isUnicode) {
     173                    m_errorCode = ErrorCode::CharacterClassRangeInvalid;
     174                    return;
     175                }
    171176                m_delegate.atomCharacterClassBuiltIn(classID, invert);
    172177                m_state = Empty;
     
    202207        Delegate& m_delegate;
    203208        ErrorCode& m_errorCode;
     209        bool m_isUnicode;
    204210        enum CharacterClassConstructionState {
    205211            Empty,
     
    596602        consume();
    597603
    598         CharacterClassParserDelegate characterClassConstructor(m_delegate, m_errorCode);
     604        CharacterClassParserDelegate characterClassConstructor(m_delegate, m_errorCode, m_isUnicode);
    599605
    600606        characterClassConstructor.begin(tryConsume('^'));
Note: See TracChangeset for help on using the changeset viewer.