Changeset 64817 in webkit


Ignore:
Timestamp:
Aug 5, 2010 10:36:28 PM (14 years ago)
Author:
ap@apple.com
Message:

Reviewed by Darin Adler.

https://bugs.webkit.org/show_bug.cgi?id=43554
Way too many encoding aliases are treated as valid

<rdar://problem/7863399> Garbage characters displayed in some yesky.com pages.

<rdar://problem/7859068> Garbage characters displayed for most text at ceping.zhaopin.com

Test: http/tests/misc/bad-charset-alias.html

  • loader/TextResourceDecoder.cpp: (WebCore::TextResourceDecoder::checkForCSSCharset): Fix encoding name length computation. Previously, a trailing quote was ignored by TextEncodingRegistry.


  • platform/text/TextCodecICU.cpp: (WebCore::TextCodecICU::registerExtendedEncodingNames): Added dashes to alias names that didn't have them. Added aliases prompted by regression tests.
  • platform/text/TextCodecLatin1.cpp: (WebCore::TextCodecLatin1::registerEncodingNames): Don't register 8859-1, other browsers do not support this encoding name.
  • platform/text/TextEncoding.cpp: (WebCore::Latin1Encoding): "Latin-1" is not a real encoding name, it's not known to Firefox or IE.
  • platform/text/TextEncodingRegistry.cpp: (WebCore::TextEncodingNameHash::equal): Changed to no longer ignore non-alphanumeric characters. There is a good chance that we'll be missing support for some necessary alias names, but other browsers don't ignore any characters when matching names. (WebCore::TextEncodingNameHash::hash): Ditto. (WebCore::checkExistingName): Re-formatted a line. (WebCore::isUndesiredAlias): Added a filter to reject "8859_1" and any names containing commas. (WebCore::addToTextEncodingNameMap): Used it. (WebCore::atomicCanonicalTextEncodingName): Changed to no longer ignore non-alphanumeric characters.
Location:
trunk
Files:
3 added
9 edited

Legend:

Unmodified
Added
Removed
  • trunk/LayoutTests/ChangeLog

    r64815 r64817  
     12010-08-05  Alexey Proskuryakov  <ap@apple.com>
     2
     3        Reviewed by Darin Adler.
     4
     5        https://bugs.webkit.org/show_bug.cgi?id=43554
     6        Way too many encoding aliases are treated as valid
     7
     8        <rdar://problem/7863399> Garbage characters displayed in some yesky.com pages.
     9
     10        <rdar://problem/7859068> Garbage characters displayed for most text at ceping.zhaopin.com
     11
     12        * fast/encoding/char-encoding-expected.txt:
     13        * fast/encoding/char-encoding.html:
     14        Use a correct name for GB_2312-80. At least Firefox doesn't know GB-2312-80.
     15
     16        * http/tests/misc/bad-charset-alias-expected.txt: Added.
     17        * http/tests/misc/bad-charset-alias.html: Added.
     18        * http/tests/misc/resources/bad-charset-alias.php: Added.
     19        Check that certain encoding names are unknown. Both Firefox and IE don't know these.
     20
    1212010-08-05  W. James MacLean  <wjmaclean@chromium.org>
    222
  • trunk/LayoutTests/fast/encoding/char-encoding-expected.txt

    r39787 r64817  
    77PASS encode('GBK', 'U+00A5') is '%A3%A4'
    88PASS encode('gb2312', 'U+00A5') is '%A3%A4'
    9 PASS encode('GB-2312-80', 'U+00A5') is '%A3%A4'
     9PASS encode('GB_2312-80', 'U+00A5') is '%A3%A4'
    1010PASS encode('EUC-CN', 'U+00A5') is '%A3%A4'
    1111PASS encode('GBK', 'U+20AC') is '%80'
    1212PASS encode('gb2312', 'U+20AC') is '%80'
    13 PASS encode('GB-2312-80', 'U+20AC') is '%80'
     13PASS encode('GB_2312-80', 'U+20AC') is '%80'
    1414PASS encode('EUC-CN', 'U+20AC') is '%80'
    1515PASS encode('GBK', 'U+01F9') is '%A8%BF'
  • trunk/LayoutTests/fast/encoding/char-encoding.html

    r51088 r64817  
    2626testEncode('GBK', 'U+00A5', '%A3%A4');
    2727testEncode('gb2312', 'U+00A5', '%A3%A4');
    28 testEncode('GB-2312-80', 'U+00A5', '%A3%A4');
     28testEncode('GB_2312-80', 'U+00A5', '%A3%A4');
    2929testEncode('EUC-CN', 'U+00A5', '%A3%A4');
    3030//Euro symbol in gbk
    3131testEncode('GBK', 'U+20AC', '%80');
    3232testEncode('gb2312', 'U+20AC', '%80');
    33 testEncode('GB-2312-80', 'U+20AC', '%80');
     33testEncode('GB_2312-80', 'U+20AC', '%80');
    3434testEncode('EUC-CN', 'U+20AC', '%80');
    3535//Misc symbols from TEC specific GBK translation
  • trunk/WebCore/ChangeLog

    r64816 r64817  
     12010-08-05  Alexey Proskuryakov  <ap@apple.com>
     2
     3        Reviewed by Darin Adler.
     4
     5        https://bugs.webkit.org/show_bug.cgi?id=43554
     6        Way too many encoding aliases are treated as valid
     7
     8        <rdar://problem/7863399> Garbage characters displayed in some yesky.com pages.
     9
     10        <rdar://problem/7859068> Garbage characters displayed for most text at ceping.zhaopin.com
     11
     12        Test: http/tests/misc/bad-charset-alias.html
     13
     14        * loader/TextResourceDecoder.cpp: (WebCore::TextResourceDecoder::checkForCSSCharset):
     15        Fix encoding name length computation. Previously, a trailing quote was ignored by
     16        TextEncodingRegistry.
     17       
     18        * platform/text/TextCodecICU.cpp: (WebCore::TextCodecICU::registerExtendedEncodingNames):
     19        Added dashes to alias names that didn't have them. Added aliases prompted by regression tests.
     20
     21        * platform/text/TextCodecLatin1.cpp: (WebCore::TextCodecLatin1::registerEncodingNames):
     22        Don't register 8859-1, other browsers do not support this encoding name.
     23
     24        * platform/text/TextEncoding.cpp: (WebCore::Latin1Encoding):
     25        "Latin-1" is not a real encoding name, it's not known to Firefox or IE.
     26
     27        * platform/text/TextEncodingRegistry.cpp:
     28        (WebCore::TextEncodingNameHash::equal): Changed to no longer ignore non-alphanumeric characters.
     29        There is a good chance that we'll be missing support for some necessary alias names, but other
     30        browsers don't ignore any characters when matching names.
     31        (WebCore::TextEncodingNameHash::hash): Ditto.
     32        (WebCore::checkExistingName): Re-formatted a line.
     33        (WebCore::isUndesiredAlias): Added a filter to reject "8859_1" and any names containing commas.
     34        (WebCore::addToTextEncodingNameMap): Used it.
     35        (WebCore::atomicCanonicalTextEncodingName): Changed to no longer ignore non-alphanumeric characters.
     36
    1372010-08-05  Simon Hausmann  <simon.hausmann@nokia.com>
    238
  • trunk/WebCore/loader/TextResourceDecoder.cpp

    r62551 r64817  
    489489                    return false;
    490490
    491                 int encodingNameLength = pos - dataStart + 1;
     491                int encodingNameLength = pos - dataStart;
    492492               
    493493                ++pos;
  • trunk/WebCore/platform/text/TextCodecICU.cpp

    r56825 r64817  
    7171}
    7272
    73 // FIXME: Registering all the encodings we get from ucnv_getAvailableName
    74 // includes encodings we don't want or need. For example, all
    75 // the encodings with commas and version numbers.
    76 
    7773void TextCodecICU::registerExtendedEncodingNames(EncodingNameRegistrar registrar)
    7874{
     
    137133    // Perhaps we can prove these are not used on the web and remove them.
    138134    // Or perhaps we can get them added to ICU.
    139     registrar("xmacroman", "macintosh");
    140     registrar("xmacukrainian", "x-mac-cyrillic");
    141     registrar("cnbig5", "Big5");
    142     registrar("xxbig5", "Big5");
    143     registrar("cngb", "GBK");
     135    registrar("x-mac-roman", "macintosh");
     136    registrar("x-mac-ukrainian", "x-mac-cyrillic");
     137    registrar("cn-big5", "Big5");
     138    registrar("x-x-big5", "Big5");
     139    registrar("cn-gb", "GBK");
    144140    registrar("csgb231280", "GBK");
    145     registrar("xeuccn", "GBK");
    146     registrar("xgbk", "GBK");
    147     registrar("csISO88598I", "ISO_8859-8-I");
     141    registrar("x-euc-cn", "GBK");
     142    registrar("x-gbk", "GBK");
     143    registrar("csISO88598I", "ISO-8859-8-I");
    148144    registrar("koi", "KOI8-R");
    149145    registrar("logical", "ISO-8859-8-I");
    150146    registrar("unicode11utf8", "UTF-8");
    151147    registrar("unicode20utf8", "UTF-8");
    152     registrar("xunicode20utf8", "UTF-8");
     148    registrar("x-unicode20utf8", "UTF-8");
    153149    registrar("visual", "ISO-8859-8");
    154150    registrar("winarabic", "windows-1256");
    155151    registrar("winbaltic", "windows-1257");
    156152    registrar("wincyrillic", "windows-1251");
    157     registrar("iso885911", "windows-874");
    158     registrar("dos874", "windows-874");
     153    registrar("iso-8859-11", "windows-874");
     154    registrar("iso8859-11", "windows-874");
     155    registrar("dos-874", "windows-874");
    159156    registrar("wingreek", "windows-1253");
    160157    registrar("winhebrew", "windows-1255");
     
    162159    registrar("winturkish", "windows-1254");
    163160    registrar("winvietnamese", "windows-1258");
    164     registrar("xcp1250", "windows-1250");
    165     registrar("xcp1251", "windows-1251");
    166     registrar("xeuc", "EUC-JP");
    167     registrar("xwindows949", "windows-949");
    168     registrar("xuhc", "windows-949");
     161    registrar("x-cp1250", "windows-1250");
     162    registrar("x-cp1251", "windows-1251");
     163    registrar("x-euc", "EUC-JP");
     164    registrar("x-windows-949", "windows-949");
     165    registrar("x-uhc", "windows-949");
     166    registrar("utf8", "UTF-8");
    169167
    170168    // These aliases are present in modern versions of ICU, but use different codecs, and have no standard names.
    171169    // They are not present in ICU 3.2.
    172     registrar("dos720", "cp864");
     170    registrar("dos-720", "cp864");
    173171    registrar("jis7", "ISO-2022-JP");
     172
     173    // Alternative spelling of ISO encoding names.
     174    registrar("ISO8859-1", "ISO-8859-1");
     175    registrar("ISO8859-2", "ISO-8859-2");
     176    registrar("ISO8859-3", "ISO-8859-3");
     177    registrar("ISO8859-4", "ISO-8859-4");
     178    registrar("ISO8859-5", "ISO-8859-5");
     179    registrar("ISO8859-6", "ISO-8859-6");
     180    registrar("ISO8859-7", "ISO-8859-7");
     181    registrar("ISO8859-8", "ISO-8859-8");
     182    registrar("ISO8859-8-I", "ISO-8859-8-I");
     183    registrar("ISO8859-9", "ISO-8859-9");
     184    registrar("ISO8859-10", "ISO-8859-10");
     185    registrar("ISO8859-13", "ISO-8859-13");
     186    registrar("ISO8859-14", "ISO-8859-14");
     187    registrar("ISO8859-15", "ISO-8859-15");
     188    registrar("ISO8859-16", "ISO-8859-16");
    174189}
    175190
  • trunk/WebCore/platform/text/TextCodecLatin1.cpp

    r56825 r64817  
    8080    registrar("ibm-1252_P100-2000", "windows-1252");
    8181
    82     registrar("8859-1", "ISO-8859-1");
    8382    registrar("CP819", "ISO-8859-1");
    8483    registrar("IBM819", "ISO-8859-1");
  • trunk/WebCore/platform/text/TextEncoding.cpp

    r56825 r64817  
    249249const TextEncoding& Latin1Encoding()
    250250{
    251     static TextEncoding globalLatin1Encoding("Latin-1");
     251    static TextEncoding globalLatin1Encoding("latin1");
    252252    return globalLatin1Encoding;
    253253}
  • trunk/WebCore/platform/text/TextEncodingRegistry.cpp

    r63036 r64817  
    6262const size_t maxEncodingNameLength = 63;
    6363
    64 // Hash for all-ASCII strings that does case folding and skips any characters
    65 // that are not alphanumeric. If passed any non-ASCII characters, depends on
    66 // the behavior of isalnum -- if that returns false as it does on OS X, then
    67 // it will properly skip those characters too.
     64// Hash for all-ASCII strings that does case folding.
    6865struct TextEncodingNameHash {
    6966
     
    7370        char c2;
    7471        do {
    75             do
    76                 c1 = *s1++;
    77             while (c1 && !isASCIIAlphanumeric(c1));
    78             do
    79                 c2 = *s2++;
    80             while (c2 && !isASCIIAlphanumeric(c2));
     72            c1 = *s1++;
     73            c2 = *s2++;
    8174            if (toASCIILower(c1) != toASCIILower(c2))
    8275                return false;
     
    9285        unsigned h = WTF::stringHashingStartValue;
    9386        for (;;) {
    94             char c;
    95             do {
    96                 c = *s++;
    97                 if (!c) {
    98                     h += (h << 3);
    99                     h ^= (h >> 11);
    100                     h += (h << 15);
    101                     return h;
    102                 }
    103             } while (!isASCIIAlphanumeric(c));
     87            char c = *s++;
     88            if (!c) {
     89                h += (h << 3);
     90                h ^= (h >> 11);
     91                h += (h << 15);
     92                return h;
     93            }
    10494            h += toASCIILower(c);
    10595            h += (h << 10);
     
    155145            && strcasecmp(atomicName, "iso-8859-8") == 0)
    156146        return;
    157     LOG_ERROR("alias %s maps to %s already, but someone is trying to make it map to %s",
    158         alias, oldAtomicName, atomicName);
    159 }
    160 
    161 #endif
     147    LOG_ERROR("alias %s maps to %s already, but someone is trying to make it map to %s", alias, oldAtomicName, atomicName);
     148}
     149
     150#endif
     151
     152static bool isUndesiredAlias(const char* alias)
     153{
     154    // Reject aliases with version numbers that are supported by some back-ends (such as "ISO_2022,locale=ja,version=0" in ICU).
     155    for (const char* p = alias; *p; ++p) {
     156        if (*p == ',')
     157            return true;
     158    }
     159    // 8859_1 is known to (at least) ICU, but other browsers don't support this name - and having it caused a compatibility
     160    // problem, see bug 43554.
     161    if (0 == strcmp(alias, "8859_1"))
     162        return true;
     163    return false;
     164}
    162165
    163166static void addToTextEncodingNameMap(const char* alias, const char* name)
    164167{
    165168    ASSERT(strlen(alias) <= maxEncodingNameLength);
     169    if (isUndesiredAlias(alias))
     170        return;
    166171    const char* atomicName = textEncodingNameMap->get(name);
    167172    ASSERT(strcmp(alias, name) == 0 || atomicName);
     
    301306    for (size_t i = 0; i < length; ++i) {
    302307        UChar c = characters[i];
    303         if (isASCIIAlphanumeric(c)) {
    304             if (j == maxEncodingNameLength)
    305                 return 0;
    306             buffer[j++] = c;
    307         }
     308        if (j == maxEncodingNameLength)
     309            return 0;
     310        buffer[j++] = c;
    308311    }
    309312    buffer[j] = 0;
Note: See TracChangeset for help on using the changeset viewer.