Context Navigation

← Previous Changeset
Next Changeset →

Changeset 255134 in webkit

Timestamp:

Jan 26, 2020 3:28:44 PM (4 years ago)

Author:

Alexey Shvayka

Message:

Invalid ranges in character classes should be banned in unicode patterns
https://bugs.webkit.org/show_bug.cgi?id=206768

Reviewed by Darin Adler.

JSTests:

test262/expectations.yaml: Mark 18 test cases as passing.

Source/JavaScriptCore:

In ES5, grammar of CharacterRange was ambiguous, resulting in invalid ranges
like /[\d-a]/ being allowed. As of ES2015, invalid ranges are SyntaxError in
unicode patterns, yet still allowed in regular ones to avoid breaking the web.
(https://tc39.es/ecma262/#sec-patterns-static-semantics-early-errors-annexb)

This change adds SyntaxError for unicode patterns and updates explanatory
comments. ErrorCode::CharacterClassOutOfOrder is renamed for consistency
with newly added error code and ErrorCode::ParenthesesTypeInvalid.

yarr/YarrErrorCode.cpp:

(JSC::Yarr::errorMessage):
(JSC::Yarr::errorToThrow):

yarr/YarrErrorCode.h:
yarr/YarrParser.h:

(JSC::Yarr::Parser::CharacterClassParserDelegate::CharacterClassParserDelegate):
(JSC::Yarr::Parser::CharacterClassParserDelegate::atomPatternCharacter):
(JSC::Yarr::Parser::CharacterClassParserDelegate::atomBuiltInCharacterClass):
(JSC::Yarr::Parser::parseCharacterClass):

Location:

trunk

Files:

: 6 edited

JSTests/ChangeLog (modified) (1 diff)
JSTests/test262/expectations.yaml (modified) (3 diffs)
Source/JavaScriptCore/ChangeLog (modified) (1 diff)
Source/JavaScriptCore/yarr/YarrErrorCode.cpp (modified) (2 diffs)
Source/JavaScriptCore/yarr/YarrErrorCode.h (modified) (1 diff)
Source/JavaScriptCore/yarr/YarrParser.h (modified) (8 diffs)

Legend:

: Unmodified
: Added
: Removed

trunk/JSTests/ChangeLog

-                      r255112
+                      r255134
+-01-26  Alexey Shvayka  <shvaikalesh@gmail.com>
+        Invalid ranges in character classes should be banned in unicode patterns
+        https://bugs.webkit.org/show_bug.cgi?id=206768
+        Reviewed by Darin Adler.
+        * test262/expectations.yaml: Mark 18 test cases as passing.
 -01-24  Mark Lam  <mark.lam@apple.com>

trunk/JSTests/test262/expectations.yaml

-                      r254757
+                      r255134
   default: 'SyntaxError: Invalid regular expression: invalid group specifier name'
   strict mode: 'SyntaxError: Invalid regular expression: invalid group specifier name'
-test/built-ins/RegExp/property-escapes/character-class-range-end.js:
-  default: 'Test262: This statement should not be evaluated.'
-  strict mode: 'Test262: This statement should not be evaluated.'
-test/built-ins/RegExp/property-escapes/character-class-range-no-dash-end.js:
-  default: 'Test262: This statement should not be evaluated.'
-  strict mode: 'Test262: This statement should not be evaluated.'
-test/built-ins/RegExp/property-escapes/character-class-range-no-dash-start.js:
-  default: 'Test262: This statement should not be evaluated.'
-  strict mode: 'Test262: This statement should not be evaluated.'
-test/built-ins/RegExp/property-escapes/character-class-range-start.js:
-  default: 'Test262: This statement should not be evaluated.'
-  strict mode: 'Test262: This statement should not be evaluated.'
 test/built-ins/RegExp/property-escapes/generated/Alphabetic.js:
   default: 'Test262Error: `\p{Alphabetic}` should match U+001CFA (`ᳺ`)'
 …
   default: 'Test262Error: RegExp("]", "u"):  Expected a SyntaxError to be thrown but no exception was thrown at all'
   strict mode: 'Test262Error: RegExp("]", "u"):  Expected a SyntaxError to be thrown but no exception was thrown at all'
-test/built-ins/RegExp/unicode_restricted_character_class_escape.js:
-  default: 'Test262Error: RegExp("[\d-a]", "u"):  Expected a SyntaxError to be thrown but no exception was thrown at all'
-  strict mode: 'Test262Error: RegExp("[\d-a]", "u"):  Expected a SyntaxError to be thrown but no exception was thrown at all'
 test/built-ins/RegExp/unicode_restricted_identity_escape.js:
   default: "Test262Error: Invalid IdentityEscape in AtomEscape: '\\"
 …
   default: 'Test262: This statement should not be evaluated.'
   strict mode: 'Test262: This statement should not be evaluated.'
-test/language/literals/regexp/u-invalid-non-empty-class-ranges-no-dash-a.js:
-  default: 'Test262: This statement should not be evaluated.'
-  strict mode: 'Test262: This statement should not be evaluated.'
-test/language/literals/regexp/u-invalid-non-empty-class-ranges-no-dash-ab.js:
-  default: 'Test262: This statement should not be evaluated.'
-  strict mode: 'Test262: This statement should not be evaluated.'
-test/language/literals/regexp/u-invalid-non-empty-class-ranges-no-dash-b.js:
-  default: 'Test262: This statement should not be evaluated.'
-  strict mode: 'Test262: This statement should not be evaluated.'
-test/language/literals/regexp/u-invalid-non-empty-class-ranges.js:
-  default: 'Test262: This statement should not be evaluated.'
-  strict mode: 'Test262: This statement should not be evaluated.'
 test/language/literals/regexp/u-invalid-oob-decimal-escape.js:
   default: 'Test262: This statement should not be evaluated.'

trunk/Source/JavaScriptCore/ChangeLog

-                      r255120
+                      r255134
+-01-26  Alexey Shvayka  <shvaikalesh@gmail.com>
+        Invalid ranges in character classes should be banned in unicode patterns
+        https://bugs.webkit.org/show_bug.cgi?id=206768
+        Reviewed by Darin Adler.
+        In ES5, grammar of CharacterRange was ambiguous, resulting in invalid ranges
+        like /[\d-a]/ being allowed. As of ES2015, invalid ranges are SyntaxError in
+        unicode patterns, yet still allowed in regular ones to avoid breaking the web.
+        (https://tc39.es/ecma262/#sec-patterns-static-semantics-early-errors-annexb)
+        This change adds SyntaxError for unicode patterns and updates explanatory
+        comments. ErrorCode::CharacterClassOutOfOrder is renamed for consistency
+        with newly added error code and ErrorCode::ParenthesesTypeInvalid.
+        * yarr/YarrErrorCode.cpp:
+        (JSC::Yarr::errorMessage):
+        (JSC::Yarr::errorToThrow):
+        * yarr/YarrErrorCode.h:
+        * yarr/YarrParser.h:
+        (JSC::Yarr::Parser::CharacterClassParserDelegate::CharacterClassParserDelegate):
+        (JSC::Yarr::Parser::CharacterClassParserDelegate::atomPatternCharacter):
+        (JSC::Yarr::Parser::CharacterClassParserDelegate::atomBuiltInCharacterClass):
+        (JSC::Yarr::Parser::parseCharacterClass):
 -01-24  Mark Lam  <mark.lam@apple.com>

trunk/Source/JavaScriptCore/yarr/YarrErrorCode.cpp

-                      r251425
+                      r255134
     // The order of this array must match the ErrorCode enum.
     static const char* errorMessages[] = {
+        nullptr,                                                              // NoError
+        REGEXP_ERROR_PREFIX "regular expression too large",                   // PatternTooLarge
+        REGEXP_ERROR_PREFIX "numbers out of order in {} quantifier",          // QuantifierOutOfOrder
+        REGEXP_ERROR_PREFIX "nothing to repeat",                              // QuantifierWithoutAtom
+        REGEXP_ERROR_PREFIX "number too large in {} quantifier",              // QuantifierTooLarge
+        REGEXP_ERROR_PREFIX "missing )",                                      // MissingParentheses
+        REGEXP_ERROR_PREFIX "unmatched parentheses",                          // ParenthesesUnmatched
+        REGEXP_ERROR_PREFIX "unrecognized character after (?",                // ParenthesesTypeInvalid
+        REGEXP_ERROR_PREFIX "invalid group specifier name",                   // InvalidGroupName
+        REGEXP_ERROR_PREFIX "duplicate group specifier name",                 // DuplicateGroupName
+        REGEXP_ERROR_PREFIX "missing terminating ] for character class",      // CharacterClassUnmatched
+        REGEXP_ERROR_PREFIX "range out of order in character class",          // CharacterClassOutOfOrder
+        REGEXP_ERROR_PREFIX "\\ at end of pattern",                           // EscapeUnterminated
+        REGEXP_ERROR_PREFIX "invalid unicode {} escape",                      // InvalidUnicodeEscape
+        REGEXP_ERROR_PREFIX "invalid backreference for unicode pattern",      // InvalidBackreference
+        REGEXP_ERROR_PREFIX "invalid escaped character for unicode pattern",  // InvalidIdentityEscape
+        REGEXP_ERROR_PREFIX "invalid property expression",                    // InvalidUnicodePropertyExpression
+        REGEXP_ERROR_PREFIX "too many nested disjunctions",                   // TooManyDisjunctions
+        REGEXP_ERROR_PREFIX "pattern exceeds string length limits",           // OffsetTooLarge
+        REGEXP_ERROR_PREFIX "invalid flags"                                   // InvalidRegularExpressionFlags
+        nullptr,                                                                    // NoError
+        REGEXP_ERROR_PREFIX "regular expression too large",                         // PatternTooLarge
+        REGEXP_ERROR_PREFIX "numbers out of order in {} quantifier",                // QuantifierOutOfOrder
+        REGEXP_ERROR_PREFIX "nothing to repeat",                                    // QuantifierWithoutAtom
+        REGEXP_ERROR_PREFIX "number too large in {} quantifier",                    // QuantifierTooLarge
+        REGEXP_ERROR_PREFIX "missing )",                                            // MissingParentheses
+        REGEXP_ERROR_PREFIX "unmatched parentheses",                                // ParenthesesUnmatched
+        REGEXP_ERROR_PREFIX "unrecognized character after (?",                      // ParenthesesTypeInvalid
+        REGEXP_ERROR_PREFIX "invalid group specifier name",                         // InvalidGroupName
+        REGEXP_ERROR_PREFIX "duplicate group specifier name",                       // DuplicateGroupName
+        REGEXP_ERROR_PREFIX "missing terminating ] for character class",            // CharacterClassUnmatched
+        REGEXP_ERROR_PREFIX "range out of order in character class",                // CharacterClassRangeOutOfOrder
+        REGEXP_ERROR_PREFIX "invalid range in character class for unicode pattern", // CharacterClassRangeInvalid
+        REGEXP_ERROR_PREFIX "\\ at end of pattern",                                 // EscapeUnterminated
+        REGEXP_ERROR_PREFIX "invalid unicode {} escape",                            // InvalidUnicodeEscape
+        REGEXP_ERROR_PREFIX "invalid backreference for unicode pattern",            // InvalidBackreference
+        REGEXP_ERROR_PREFIX "invalid escaped character for unicode pattern",        // InvalidIdentityEscape
+        REGEXP_ERROR_PREFIX "invalid property expression",                          // InvalidUnicodePropertyExpression
+        REGEXP_ERROR_PREFIX "too many nested disjunctions",                         // TooManyDisjunctions
+        REGEXP_ERROR_PREFIX "pattern exceeds string length limits",                 // OffsetTooLarge
+        REGEXP_ERROR_PREFIX "invalid flags"                                         // InvalidRegularExpressionFlags
     };
 …
     case ErrorCode::DuplicateGroupName:
     case ErrorCode::CharacterClassUnmatched:
+    case ErrorCode::CharacterClassOutOfOrder:
+    case ErrorCode::CharacterClassRangeOutOfOrder:
+    case ErrorCode::CharacterClassRangeInvalid:
     case ErrorCode::EscapeUnterminated:
     case ErrorCode::InvalidUnicodeEscape:

trunk/Source/JavaScriptCore/yarr/YarrErrorCode.h

-                      r251425
+                      r255134
     DuplicateGroupName,
     CharacterClassUnmatched,
+    CharacterClassOutOfOrder,
+    CharacterClassRangeOutOfOrder,
+    CharacterClassRangeInvalid,
     EscapeUnterminated,
     InvalidUnicodeEscape,

trunk/Source/JavaScriptCore/yarr/YarrParser.h

-                      r250005
+                      r255134
     class CharacterClassParserDelegate {
     public:
         CharacterClassParserDelegate(Delegate& delegate, ErrorCode& err)
+        CharacterClassParserDelegate(Delegate& delegate, ErrorCode& err, bool isUnicode)
             : m_delegate(delegate)
             , m_errorCode(err)
+            , m_isUnicode(isUnicode)
             , m_state(Empty)
             , m_character(0)
 …
             switch (m_state) {
             case AfterCharacterClass:
                 // Following a builtin character class we need look out for a hyphen.
+                // Following a built-in character class we need look out for a hyphen.
                 // We're looking for invalid ranges, such as /[\d-x]/ or /[\d-\d]/.
                 // If we see a hyphen following a charater class then unlike usual
+                // If we see a hyphen following a character class then unlike usual
                 // we'll report it to the delegate immediately, and put ourself into
+                // a poisoned state. Any following calls to add another character or
+                // character class will result in an error. (A hypen following a
+                // character-class is itself valid, but only  at the end of a regex).
+                // a poisoned state. In a unicode pattern, any following calls to add
+                // another character or character class will result in syntax error.
+                // A hypen following a character class is itself valid, but only at
+                // the end of a regex.
                 if (hyphenIsRange && ch == '-') {
                     m_delegate.atomCharacterClassAtom('-');
 …
             case CachedCharacterHyphen:
                 if (ch < m_character) {
                     m_errorCode = ErrorCode::CharacterClassOutOfOrder;
+                    m_errorCode = ErrorCode::CharacterClassRangeOutOfOrder;
                     return;
+                }
 …
                 return;
+                // See coment in atomBuiltInCharacterClass below.
+                // This too is technically an error, per ECMA-262, and again we
+                // we chose to allow this.  Note a subtlely here that while we
+                // diverge from the spec's definition of CharacterRange we do
+                // remain in compliance with the grammar.  For example, consider
+                // the expression /[\d-a-z]/.  We comply with the grammar in
+                // this case by not allowing a-z to be matched as a range.
+                // If we hit this case, we have an invalid range like /[\d-a]/.
+                // See coment in atomBuiltInCharacterClass() below.
             case AfterCharacterClassHyphen:
+                if (m_isUnicode) {
+                    m_errorCode = ErrorCode::CharacterClassRangeInvalid;
+                    return;
+                }
                 m_delegate.atomCharacterClassAtom(ch);
                 m_state = Empty;
 …
             case Empty:
             case AfterCharacterClass:
+                m_delegate.atomCharacterClassBuiltIn(classID, invert);
                 m_state = AfterCharacterClass;
-                m_delegate.atomCharacterClassBuiltIn(classID, invert);
                 return;
                 // If we hit either of these cases, we have an invalid range that
                 // looks something like /[x-\d]/ or /[\d-\d]/.
                 // According to ECMA-262 this should be a syntax error, but
                 // empirical testing shows this to break teh webz.  Instead we
                 // comply with to the ECMA-262 grammar, and assume the grammar to
                 // have matched the range correctly, but tweak our interpretation
                 // of CharacterRange.  Effectively we implicitly handle the hyphen
                 // as if it were escaped, e.g. /[\w-_]/ is treated as /[\w\-_]/.
+                // looks something like /[a-\d]/ or /[\d-\d]/.
+                // Since ES2015, this should be syntax error in a unicode pattern,
+                // yet gracefully handled in a regular regex to avoid breaking the web.
+                // Effectively we handle the hyphen as if it was (implicitly) escaped,
+                // e.g. /[\d-a-z]/ is treated as /[\d\-a\-z]/.
+                // See usages of CharacterRangeOrUnion abstract op in
+                // https://tc39.es/ecma262/#sec-regular-expression-patterns-semantics
             case CachedCharacterHyphen:
                 m_delegate.atomCharacterClassAtom(m_character);
 …
                 FALLTHROUGH;
             case AfterCharacterClassHyphen:
+                if (m_isUnicode) {
+                    m_errorCode = ErrorCode::CharacterClassRangeInvalid;
+                    return;
+                }
                 m_delegate.atomCharacterClassBuiltIn(classID, invert);
                 m_state = Empty;
 …
         Delegate& m_delegate;
         ErrorCode& m_errorCode;
+        bool m_isUnicode;
         enum CharacterClassConstructionState {
             Empty,
 …
         consume();
         CharacterClassParserDelegate characterClassConstructor(m_delegate, m_errorCode);
+        CharacterClassParserDelegate characterClassConstructor(m_delegate, m_errorCode, m_isUnicode);
         characterClassConstructor.begin(tryConsume('^'));

Note: See TracChangeset for help on using the changeset viewer.