Changeset 259262 in webkit


Ignore:
Timestamp:
Mar 30, 2020 6:27:10 PM (4 years ago)
Author:
Alexey Shvayka
Message:

Add support in named capture group identifiers for direct surrogate pairs
https://bugs.webkit.org/show_bug.cgi?id=178174

Reviewed by Darin Adler and Michael Saboff.

JSTests:

  • test262/expectations.yaml: Mark 2 test cases as passing.

Source/JavaScriptCore:

This change:

a) Adds support for unescaped astral symbols in RegExp identifier names [1],

aligning JSC with V8.

b) Rewords InvalidUnicodeEscape error code to be used for \uXXXX escapes in

Unicode patterns and named groups/references instead of InvalidIdentityEscape,
matching error messages in V8 and SpiderMonkey.

c) Adds hasError() checks after tryConsumeGroupName() so errors generated in

tryConsumeIdentifierCharacter() would not get overriden.

d) Removes code duplication by using tryConsumeUnicodeEscape() for parsing \u

in parseEscape(); cleans up parsing \u{} escapes a bit, preferring ASSERTs
over hasError() checks.

[1]: https://tc39.es/ecma262/#prod-RegExpIdentifierName

  • yarr/YarrErrorCode.cpp:

(JSC::Yarr::errorMessage):
(JSC::Yarr::errorToThrow):

  • yarr/YarrErrorCode.h:
  • yarr/YarrParser.h:

(JSC::Yarr::Parser::parseEscape):
(JSC::Yarr::Parser::parseParenthesesBegin):
(JSC::Yarr::Parser::tryConsumeUnicodeEscape):
(JSC::Yarr::Parser::tryConsumeIdentifierCharacter):

LayoutTests:

Adjusted tests for error messages changes and added coverage for messages
of syntax errors due to invalid \u escapes inside named groups/references.

  • js/regexp-named-capture-groups-expected.txt:
  • js/regexp-unicode-expected.txt:
  • js/regress-158080-expected.txt:
  • js/script-tests/regexp-named-capture-groups.js:
  • js/script-tests/regexp-unicode.js:
Location:
trunk
Files:
12 edited

Legend:

Unmodified
Added
Removed
  • trunk/JSTests/ChangeLog

    r259246 r259262  
     12020-03-30  Alexey Shvayka  <shvaikalesh@gmail.com>
     2
     3        Add support in named capture group identifiers for direct surrogate pairs
     4        https://bugs.webkit.org/show_bug.cgi?id=178174
     5
     6        Reviewed by Darin Adler and Michael Saboff.
     7
     8        * test262/expectations.yaml: Mark 2 test cases as passing.
     9
    1102020-03-30  Ross Kirsling  <ross.kirsling@sony.com>
    211
  • trunk/JSTests/test262/expectations.yaml

    r259246 r259262  
    12471247  default: 'Test262Error: Expected [Symbol(b), Symbol(a)] and [Symbol(a), Symbol(b)] to have the same contents. '
    12481248  strict mode: 'Test262Error: Expected [Symbol(b), Symbol(a)] and [Symbol(a), Symbol(b)] to have the same contents. '
    1249 test/built-ins/RegExp/named-groups/unicode-property-names.js:
    1250   default: 'SyntaxError: Invalid regular expression: invalid group specifier name'
    1251   strict mode: 'SyntaxError: Invalid regular expression: invalid group specifier name'
    12521249test/built-ins/RegExp/property-escapes/generated/Alphabetic.js:
    12531250  default: 'Test262Error: `\p{Alphabetic}` should match U+001CFA (`ᳺ`)'
  • trunk/LayoutTests/ChangeLog

    r259261 r259262  
     12020-03-30  Alexey Shvayka  <shvaikalesh@gmail.com>
     2
     3        Add support in named capture group identifiers for direct surrogate pairs
     4        https://bugs.webkit.org/show_bug.cgi?id=178174
     5
     6        Reviewed by Darin Adler and Michael Saboff.
     7
     8        Adjusted tests for error messages changes and added coverage for messages
     9        of syntax errors due to invalid \u escapes inside named groups/references.
     10
     11        * js/regexp-named-capture-groups-expected.txt:
     12        * js/regexp-unicode-expected.txt:
     13        * js/regress-158080-expected.txt:
     14        * js/script-tests/regexp-named-capture-groups.js:
     15        * js/script-tests/regexp-unicode.js:
     16
    1172020-03-30  Devin Rousso  <drousso@apple.com>
    218
  • trunk/LayoutTests/js/regexp-named-capture-groups-expected.txt

    r259026 r259262  
    6262PASS let r = new RegExp("/(?<‌groupName1>abc)/u") threw exception SyntaxError: Invalid regular expression: invalid group specifier name.
    6363PASS let r = new RegExp("/(?<‍groupName1>abc)/u") threw exception SyntaxError: Invalid regular expression: invalid group specifier name.
     64PASS /(?<\u>.)/u threw exception SyntaxError: Invalid regular expression: invalid Unicode \u escape.
     65PASS /\k<\uzzz>/u threw exception SyntaxError: Invalid regular expression: invalid Unicode \u escape.
     66PASS /(?<\u{>.)/u threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
     67PASS /\k<\u{0>/u threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
    6468PASS "XzzXzz".match(/\k<z>X(?<z>z*)X\k<z>/) is ["XzzXzz", "zz"]
    6569PASS "XzzXzz".match(/\k<z>X(?<z>z*)X\k<z>/u) is ["XzzXzz", "zz"]
  • trunk/LayoutTests/js/regexp-unicode-expected.txt

    r258976 r259262  
    179179PASS "this is ba test".match(/is b\cha test/u)[0].length is 11
    180180PASS new RegExp("\\/", "u").source is "\\/"
    181 PASS r = new RegExp("\\u{110000}", "u") threw exception SyntaxError: Invalid regular expression: invalid Unicode {} escape.
     181PASS r = new RegExp("\\u{110000}", "u") threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
    182182PASS r = new RegExp("𐐅{2147483648}", "u") threw exception SyntaxError: Invalid regular expression: pattern exceeds string length limits.
    183183PASS /{/u threw exception SyntaxError: Invalid regular expression: incomplete {} quantifier for Unicode pattern.
     
    191191PASS r = new RegExp("\\x", "u") threw exception SyntaxError: Invalid regular expression: invalid escaped character for Unicode pattern.
    192192PASS r = new RegExp("[\\x]", "u") threw exception SyntaxError: Invalid regular expression: invalid escaped character for Unicode pattern.
    193 PASS r = new RegExp("\\u", "u") threw exception SyntaxError: Invalid regular expression: invalid escaped character for Unicode pattern.
    194 PASS r = new RegExp("[\\u]", "u") threw exception SyntaxError: Invalid regular expression: invalid escaped character for Unicode pattern.
    195 PASS r = new RegExp("\\u{", "u") threw exception SyntaxError: Invalid regular expression: invalid Unicode {} escape.
    196 PASS r = new RegExp("\\u{\udead", "u") threw exception SyntaxError: Invalid regular expression: invalid Unicode {} escape.
     193PASS r = new RegExp("\\u", "u") threw exception SyntaxError: Invalid regular expression: invalid Unicode \u escape.
     194PASS r = new RegExp("[\\u]", "u") threw exception SyntaxError: Invalid regular expression: invalid Unicode \u escape.
     195PASS r = new RegExp("\\u{", "u") threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
     196PASS r = new RegExp("\\u{\udead", "u") threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
    197197PASS /\1/u threw exception SyntaxError: Invalid regular expression: invalid backreference for Unicode pattern.
    198198PASS /\2/u threw exception SyntaxError: Invalid regular expression: invalid backreference for Unicode pattern.
  • trunk/LayoutTests/js/regress-158080-expected.txt

    r255452 r259262  
    44
    55
    6 PASS let r = /\u{|abc/u threw exception SyntaxError: Invalid regular expression: invalid Unicode {} escape.
    7 PASS let r = /\u{/u threw exception SyntaxError: Invalid regular expression: invalid Unicode {} escape.
    8 PASS let r = /\u{1/u threw exception SyntaxError: Invalid regular expression: invalid Unicode {} escape.
    9 PASS let r = /\u{12/u threw exception SyntaxError: Invalid regular expression: invalid Unicode {} escape.
    10 PASS let r = /\u{123/u threw exception SyntaxError: Invalid regular expression: invalid Unicode {} escape.
    11 PASS let r = /\u{1234/u threw exception SyntaxError: Invalid regular expression: invalid Unicode {} escape.
    12 PASS let r = /\u{abcde/u threw exception SyntaxError: Invalid regular expression: invalid Unicode {} escape.
    13 PASS let r = /\u{abcdef/u threw exception SyntaxError: Invalid regular expression: invalid Unicode {} escape.
    14 PASS let r = /\u{1111111}/u threw exception SyntaxError: Invalid regular expression: invalid Unicode {} escape.
    15 PASS let r = /\u{fedbca98}/u threw exception SyntaxError: Invalid regular expression: invalid Unicode {} escape.
    16 PASS let r = /\u{1{123}}/u threw exception SyntaxError: Invalid regular expression: invalid Unicode {} escape.
     6PASS let r = /\u{|abc/u threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
     7PASS let r = /\u{/u threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
     8PASS let r = /\u{1/u threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
     9PASS let r = /\u{12/u threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
     10PASS let r = /\u{123/u threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
     11PASS let r = /\u{1234/u threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
     12PASS let r = /\u{abcde/u threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
     13PASS let r = /\u{abcdef/u threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
     14PASS let r = /\u{1111111}/u threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
     15PASS let r = /\u{fedbca98}/u threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
     16PASS let r = /\u{1{123}}/u threw exception SyntaxError: Invalid regular expression: invalid Unicode code point \u{} escape.
    1717PASS successfullyParsed is true
    1818
  • trunk/LayoutTests/js/script-tests/regexp-named-capture-groups.js

    r259026 r259262  
    105105shouldThrow('let r = new RegExp("/(?<\u200dgroupName1>abc)/u")', '"SyntaxError: Invalid regular expression: invalid group specifier name"');
    106106
     107// Check that invalid \u escape errors are not get overriden.
     108shouldThrow('/(?<\\u>.)/u', '"SyntaxError: Invalid regular expression: invalid Unicode \\\\u escape"');
     109shouldThrow('/\\k<\\uzzz>/u', '"SyntaxError: Invalid regular expression: invalid Unicode \\\\u escape"');
     110shouldThrow('/(?<\\u{>.)/u', '"SyntaxError: Invalid regular expression: invalid Unicode code point \\\\u{} escape"');
     111shouldThrow('/\\k<\\u{0>/u', '"SyntaxError: Invalid regular expression: invalid Unicode code point \\\\u{} escape"');
     112
    107113// Check the named forward references work
    108114shouldBe('"XzzXzz".match(/\\\k<z>X(?<z>z*)X\\\k<z>/)', '["XzzXzz", "zz"]');
  • trunk/LayoutTests/js/script-tests/regexp-unicode.js

    r258976 r259262  
    228228// Check that invalid unicode patterns throw exceptions
    229229shouldBe('new RegExp("\\\\/", "u").source', '"\\\\/"');
    230 shouldThrow('r = new RegExp("\\\\u{110000}", "u")', '"SyntaxError: Invalid regular expression: invalid Unicode {} escape"');
     230shouldThrow('r = new RegExp("\\\\u{110000}", "u")', '"SyntaxError: Invalid regular expression: invalid Unicode code point \\\\u{} escape"');
    231231shouldThrow('r = new RegExp("\u{10405}{2147483648}", "u")', '"SyntaxError: Invalid regular expression: pattern exceeds string length limits"');
    232232shouldThrow('/{/u', '"SyntaxError: Invalid regular expression: incomplete {} quantifier for Unicode pattern"');
     
    251251shouldThrowInvalidEscape("\\\\x");
    252252shouldThrowInvalidEscape("[\\\\x]");
    253 shouldThrowInvalidEscape("\\\\u");
    254 shouldThrowInvalidEscape("[\\\\u]");
    255 
    256 shouldThrowInvalidEscape("\\\\u{", '"SyntaxError: Invalid regular expression: invalid Unicode {} escape"');
    257 shouldThrowInvalidEscape("\\\\u{\\udead", '"SyntaxError: Invalid regular expression: invalid Unicode {} escape"');
     253shouldThrowInvalidEscape("\\\\u", '"SyntaxError: Invalid regular expression: invalid Unicode \\\\u escape"');
     254shouldThrowInvalidEscape("[\\\\u]", '"SyntaxError: Invalid regular expression: invalid Unicode \\\\u escape"');
     255
     256shouldThrowInvalidEscape("\\\\u{", '"SyntaxError: Invalid regular expression: invalid Unicode code point \\\\u{} escape"');
     257shouldThrowInvalidEscape("\\\\u{\\udead", '"SyntaxError: Invalid regular expression: invalid Unicode code point \\\\u{} escape"');
    258258
    259259// Check that invalid backreferences in unicode patterns throw exceptions.
  • trunk/Source/JavaScriptCore/ChangeLog

    r259246 r259262  
     12020-03-30  Alexey Shvayka  <shvaikalesh@gmail.com>
     2
     3        Add support in named capture group identifiers for direct surrogate pairs
     4        https://bugs.webkit.org/show_bug.cgi?id=178174
     5
     6        Reviewed by Darin Adler and Michael Saboff.
     7
     8        This change:
     9
     10        a) Adds support for unescaped astral symbols in RegExp identifier names [1],
     11           aligning JSC with V8.
     12
     13        b) Rewords InvalidUnicodeEscape error code to be used for \uXXXX escapes in
     14           Unicode patterns and named groups/references instead of InvalidIdentityEscape,
     15           matching error messages in V8 and SpiderMonkey.
     16
     17        c) Adds hasError() checks after tryConsumeGroupName() so errors generated in
     18           tryConsumeIdentifierCharacter() would not get overriden.
     19
     20        d) Removes code duplication by using tryConsumeUnicodeEscape() for parsing \u
     21           in parseEscape(); cleans up parsing \u{} escapes a bit, preferring ASSERTs
     22           over hasError() checks.
     23
     24        [1]: https://tc39.es/ecma262/#prod-RegExpIdentifierName
     25
     26        * yarr/YarrErrorCode.cpp:
     27        (JSC::Yarr::errorMessage):
     28        (JSC::Yarr::errorToThrow):
     29        * yarr/YarrErrorCode.h:
     30        * yarr/YarrParser.h:
     31        (JSC::Yarr::Parser::parseEscape):
     32        (JSC::Yarr::Parser::parseParenthesesBegin):
     33        (JSC::Yarr::Parser::tryConsumeUnicodeEscape):
     34        (JSC::Yarr::Parser::tryConsumeIdentifierCharacter):
     35
    1362020-03-30  Ross Kirsling  <ross.kirsling@sony.com>
    237
  • trunk/Source/JavaScriptCore/yarr/YarrErrorCode.cpp

    r259026 r259262  
    5252        REGEXP_ERROR_PREFIX "invalid range in character class for Unicode pattern", // CharacterClassRangeInvalid
    5353        REGEXP_ERROR_PREFIX "\\ at end of pattern",                                 // EscapeUnterminated
    54         REGEXP_ERROR_PREFIX "invalid Unicode {} escape",                            // InvalidUnicodeEscape
     54        REGEXP_ERROR_PREFIX "invalid Unicode \\u escape",                           // InvalidUnicodeEscape
     55        REGEXP_ERROR_PREFIX "invalid Unicode code point \\u{} escape",              // InvalidUnicodeCodePointEscape
    5556        REGEXP_ERROR_PREFIX "invalid backreference for Unicode pattern",            // InvalidBackreference
    5657        REGEXP_ERROR_PREFIX "invalid \\k<> named backreference",                    // InvalidNamedBackReference
     
    8889    case ErrorCode::EscapeUnterminated:
    8990    case ErrorCode::InvalidUnicodeEscape:
     91    case ErrorCode::InvalidUnicodeCodePointEscape:
    9092    case ErrorCode::InvalidBackreference:
    9193    case ErrorCode::InvalidNamedBackReference:
  • trunk/Source/JavaScriptCore/yarr/YarrErrorCode.h

    r259026 r259262  
    5252    EscapeUnterminated,
    5353    InvalidUnicodeEscape,
     54    InvalidUnicodeCodePointEscape,
    5455    InvalidBackreference,
    5556    InvalidNamedBackReference,
  • trunk/Source/JavaScriptCore/yarr/YarrParser.h

    r259026 r259262  
    439439            if (!inCharacterClass && tryConsume('<')) {
    440440                auto groupName = tryConsumeGroupName();
     441                if (hasError(m_errorCode))
     442                    break;
     443
    441444                if (groupName) {
    442445                    if (m_captureGroupNames.contains(groupName.value())) {
     
    488491        // UnicodeEscape
    489492        case 'u': {
    490             consume();
    491             if (atEndOfPattern()) {
    492                 if (isIdentityEscapeAnError('u'))
    493                     break;
    494 
    495                 delegate.atomPatternCharacter('u');
    496                 break;
    497             }
    498 
    499             if (m_isUnicode && peek() == '{') {
    500                 consume();
    501                 UChar32 codePoint = 0;
    502                 do {
    503                     if (atEndOfPattern() || !isASCIIHexDigit(peek())) {
    504                         m_errorCode = ErrorCode::InvalidUnicodeEscape;
    505                         break;
    506                     }
    507 
    508                     codePoint = (codePoint << 4) | toASCIIHexValue(consume());
    509 
    510                     if (codePoint > UCHAR_MAX_VALUE)
    511                         m_errorCode = ErrorCode::InvalidUnicodeEscape;
    512                 } while (!atEndOfPattern() && peek() != '}');
    513                 if (!atEndOfPattern() && peek() == '}')
    514                     consume();
    515                 else if (!hasError(m_errorCode))
    516                     m_errorCode = ErrorCode::InvalidUnicodeEscape;
    517                 if (hasError(m_errorCode))
    518                     return false;
    519 
    520                 delegate.atomPatternCharacter(codePoint);
    521                 break;
    522             }
    523             int u = tryConsumeHex(4);
    524             if (u == -1) {
    525                 if (isIdentityEscapeAnError('u'))
    526                     break;
    527 
    528                 delegate.atomPatternCharacter('u');
    529             } else {
    530                 // If we have the first of a surrogate pair, look for the second.
    531                 if (U16_IS_LEAD(u) && m_isUnicode && (patternRemaining() >= 6) && peek() == '\\') {
    532                     ParseState state = saveState();
    533                     consume();
    534                    
    535                     if (tryConsume('u')) {
    536                         int surrogate2 = tryConsumeHex(4);
    537                         if (U16_IS_TRAIL(surrogate2)) {
    538                             u = U16_GET_SUPPLEMENTARY(u, surrogate2);
    539                             delegate.atomPatternCharacter(u);
    540                             break;
    541                         }
    542                     }
    543 
    544                     restoreState(state);
    545                 }
    546                 delegate.atomPatternCharacter(u);
    547             }
     493            int codePoint = tryConsumeUnicodeEscape<UnicodeEscapeContext::CharacterEscape>();
     494            if (hasError(m_errorCode))
     495                break;
     496
     497            delegate.atomPatternCharacter(codePoint == -1 ? 'u' : codePoint);
    548498            break;
    549499        }
     
    673623            case '<': {
    674624                auto groupName = tryConsumeGroupName();
     625                if (hasError(m_errorCode))
     626                    break;
     627
    675628                if (groupName) {
    676629                    if (m_kIdentityEscapeSeen) {
     
    1010963    }
    1011964
     965    enum class UnicodeEscapeContext : uint8_t { CharacterEscape, IdentifierName };
     966
     967    template<UnicodeEscapeContext context>
    1012968    int tryConsumeUnicodeEscape()
    1013969    {
    1014         if (!tryConsume('u'))
     970        ASSERT(!hasError(m_errorCode));
     971
     972        if (!tryConsume('u') || atEndOfPattern()) {
     973            if (m_isUnicode || context == UnicodeEscapeContext::IdentifierName)
     974                m_errorCode = ErrorCode::InvalidUnicodeEscape;
    1015975            return -1;
     976        }
    1016977
    1017978        if (m_isUnicode && tryConsume('{')) {
     
    1019980            do {
    1020981                if (atEndOfPattern() || !isASCIIHexDigit(peek())) {
    1021                     m_errorCode = ErrorCode::InvalidUnicodeEscape;
     982                    m_errorCode = ErrorCode::InvalidUnicodeCodePointEscape;
    1022983                    return -1;
    1023984                }
     
    1026987
    1027988                if (codePoint > UCHAR_MAX_VALUE) {
    1028                     m_errorCode = ErrorCode::InvalidUnicodeEscape;
     989                    m_errorCode = ErrorCode::InvalidUnicodeCodePointEscape;
    1029990                    return -1;
    1030991                }
    1031992            } while (!atEndOfPattern() && peek() != '}');
    1032             if (!atEndOfPattern() && peek() == '}')
    1033                 consume();
    1034             else if (!hasError(m_errorCode))
     993
     994            if (!tryConsume('}')) {
     995                m_errorCode = ErrorCode::InvalidUnicodeCodePointEscape;
     996                return -1;
     997            }
     998
     999            return codePoint;
     1000        }
     1001
     1002        int codeUnit = tryConsumeHex(4);
     1003        if (codeUnit == -1) {
     1004            if (m_isUnicode || context == UnicodeEscapeContext::IdentifierName)
    10351005                m_errorCode = ErrorCode::InvalidUnicodeEscape;
    1036             if (hasError(m_errorCode))
    1037                 return -1;
    1038 
    1039             return codePoint;
    1040         }
    1041 
    1042         int u = tryConsumeHex(4);
    1043         if (u == -1)
    10441006            return -1;
     1007        }
    10451008
    10461009        // If we have the first of a surrogate pair, look for the second.
    1047         if (U16_IS_LEAD(u) && m_isUnicode && (patternRemaining() >= 6) && peek() == '\\') {
     1010        if (U16_IS_LEAD(codeUnit) && m_isUnicode && patternRemaining() >= 6 && peek() == '\\') {
    10481011            ParseState state = saveState();
    10491012            consume();
     
    10511014            if (tryConsume('u')) {
    10521015                int surrogate2 = tryConsumeHex(4);
    1053                 if (U16_IS_TRAIL(surrogate2)) {
    1054                     u = U16_GET_SUPPLEMENTARY(u, surrogate2);
    1055                     return u;
    1056                 }
     1016                if (U16_IS_TRAIL(surrogate2))
     1017                    return U16_GET_SUPPLEMENTARY(codeUnit, surrogate2);
    10571018            }
    10581019
     
    10601021        }
    10611022
    1062         return u;
     1023        return codeUnit;
    10631024    }
    10641025
    10651026    int tryConsumeIdentifierCharacter()
    10661027    {
    1067         int ch = peek();
    1068 
    1069         if (ch == '\\') {
    1070             consume();
    1071             ch = tryConsumeUnicodeEscape();
    1072         } else
    1073             consume();
    1074 
    1075         return ch;
     1028        if (tryConsume('\\'))
     1029            return tryConsumeUnicodeEscape<UnicodeEscapeContext::IdentifierName>();
     1030
     1031        return consumePossibleSurrogatePair();
    10761032    }
    10771033
Note: See TracChangeset for help on using the changeset viewer.