Changeset 78451 in webkit


Ignore:
Timestamp:
Feb 13, 2011 7:28:37 PM (13 years ago)
Author:
Darin Adler
Message:

2011-02-12 Darin Adler <Darin Adler>

Reviewed by Alexey Proskuryakov.

Add built-in decoder for UTF-8 for improved performance
https://bugs.webkit.org/show_bug.cgi?id=53898

Covered by existing tests; not adding new tests at this time.

This patch now handles errors in the same way the existing codecs do,
and so passes our tests. The previous version failed some tests because
of incorrect error handling.

  • platform/text/TextCodecICU.cpp: (WebCore::create): Renamed from newTextCodecICU, made a static member function, and added a call to adoptPtr. (WebCore::TextCodecICU::registerEncodingNames): Renamed from registerExtendedEncodingNames since this class is no longer used for base codecs. Removed aliases for UTF-8; now handled by TextCodecUTF8. (WebCore::TextCodecICU::registerCodecs): Renamed. (WebCore::fallbackForGBK): Renamed to conform to our current style.
  • platform/text/TextCodecICU.h: Updated for above changes. Changed indentation. Made most functions private, including virtual function overrides. Marked ICUConverterWrapper noncopyable.
  • platform/text/TextCodecUTF8.cpp: (WebCore::TextCodecUTF8::registerEncodingNames): Added the UTF-8 aliases that were formerly added by TextCodecICU. (WebCore::nonASCIISequenceLength): Fixed bug where this would return 4 for bytes F5-FF instead of failing. (WebCore::decodeNonASCIISequence): Tweaked coding style. (WebCore::appendCharacter): Added. Makes it easier to share code between the partial-character handling and main loop. (WebCore::TextCodecUTF8::decode): Fixed buffer size computation for case where there is a partial sequence. Fixed partial sequence handling so that goto is no longer needed, since compilers sometimes make poor code when goto is involved. Added a loop for partial sequences since we consume only one byte when a partial sequence is invalid. Fixed logic in main decoding loop so goto is not needed. Used early-exit style in both loops so the main flow is not nested inside if statements. Added correct error handling for flush when a partial sequence remains, which involved wrapping the function in yet another loop.
  • platform/text/TextCodecUTF8.h: Made virtual function overrides private.
  • platform/text/TextEncodingRegistry.cpp: (WebCore::buildBaseTextCodecMaps): Added calls to TextCodecUTF8. Removed calls to TextCodecICU. Added FIXMEs for other codecs that no longer need to be included here. (WebCore::extendTextCodecMaps): Updated for the name change of the TextCodecICU functions.
Location:
trunk/Source/WebCore
Files:
6 edited

Legend:

Unmodified
Added
Removed
  • trunk/Source/WebCore/ChangeLog

    r78450 r78451  
     12011-02-12  Darin Adler  <darin@apple.com>
     2
     3        Reviewed by Alexey Proskuryakov.
     4
     5        Add built-in decoder for UTF-8 for improved performance
     6        https://bugs.webkit.org/show_bug.cgi?id=53898
     7
     8        Covered by existing tests; not adding new tests at this time.
     9
     10        This patch now handles errors in the same way the existing codecs do,
     11        and so passes our tests. The previous version failed some tests because
     12        of incorrect error handling.
     13
     14        * platform/text/TextCodecICU.cpp:
     15        (WebCore::create): Renamed from newTextCodecICU, made a static member
     16        function, and added a call to adoptPtr.
     17        (WebCore::TextCodecICU::registerEncodingNames): Renamed from
     18        registerExtendedEncodingNames since this class is no longer used for
     19        base codecs. Removed aliases for UTF-8; now handled by TextCodecUTF8.
     20        (WebCore::TextCodecICU::registerCodecs): Renamed.
     21        (WebCore::fallbackForGBK): Renamed to conform to our current style.
     22
     23        * platform/text/TextCodecICU.h: Updated for above changes. Changed
     24        indentation. Made most functions private, including virtual function
     25        overrides. Marked ICUConverterWrapper noncopyable.
     26
     27        * platform/text/TextCodecUTF8.cpp:
     28        (WebCore::TextCodecUTF8::registerEncodingNames): Added the UTF-8 aliases
     29        that were formerly added by TextCodecICU.
     30        (WebCore::nonASCIISequenceLength): Fixed bug where this would return 4 for
     31        bytes F5-FF instead of failing.
     32        (WebCore::decodeNonASCIISequence): Tweaked coding style.
     33        (WebCore::appendCharacter): Added. Makes it easier to share code between
     34        the partial-character handling and main loop.
     35        (WebCore::TextCodecUTF8::decode): Fixed buffer size computation for case
     36        where there is a partial sequence. Fixed partial sequence handling so that
     37        goto is no longer needed, since compilers sometimes make poor code when
     38        goto is involved. Added a loop for partial sequences since we consume only
     39        one byte when a partial sequence is invalid. Fixed logic in main decoding
     40        loop so goto is not needed. Used early-exit style in both loops so the main
     41        flow is not nested inside if statements. Added correct error handling for
     42        flush when a partial sequence remains, which involved wrapping the function
     43        in yet another loop.
     44
     45        * platform/text/TextCodecUTF8.h: Made virtual function overrides private.
     46
     47        * platform/text/TextEncodingRegistry.cpp:
     48        (WebCore::buildBaseTextCodecMaps): Added calls to TextCodecUTF8. Removed
     49        calls to TextCodecICU. Added FIXMEs for other codecs that no longer need
     50        to be included here.
     51        (WebCore::extendTextCodecMaps): Updated for the name change of the
     52        TextCodecICU functions.
     53
    1542011-02-13  Mark Rowe  <mrowe@apple.com>
    255
  • trunk/Source/WebCore/platform/text/TextCodecICU.cpp

    r77849 r78451  
    11/*
    2  * Copyright (C) 2004, 2006, 2007, 2008 Apple Inc. All rights reserved.
     2 * Copyright (C) 2004, 2006, 2007, 2008, 2011 Apple Inc. All rights reserved.
    33 * Copyright (C) 2006 Alexey Proskuryakov <ap@nypop.com>
    44 *
     
    2828#include "TextCodecICU.h"
    2929
    30 #include "PlatformString.h"
    3130#include "ThreadGlobalData.h"
    3231#include <unicode/ucnv.h>
    3332#include <unicode/ucnv_cb.h>
    3433#include <wtf/Assertions.h>
    35 #include <wtf/text/CString.h>
    36 #include <wtf/PassOwnPtr.h>
    3734#include <wtf/StringExtras.h>
    3835#include <wtf/Threading.h>
     36#include <wtf/text/CString.h>
    3937#include <wtf/unicode/CharacterNames.h>
    4038
     
    5654}
    5755
    58 static PassOwnPtr<TextCodec> newTextCodecICU(const TextEncoding& encoding, const void*)
    59 {
    60     return new TextCodecICU(encoding);
    61 }
    62 
    63 void TextCodecICU::registerBaseEncodingNames(EncodingNameRegistrar registrar)
    64 {
    65     registrar("UTF-8", "UTF-8");
    66 }
    67 
    68 void TextCodecICU::registerBaseCodecs(TextCodecRegistrar registrar)
    69 {
    70     registrar("UTF-8", newTextCodecICU, 0);
    71 }
    72 
    73 void TextCodecICU::registerExtendedEncodingNames(EncodingNameRegistrar registrar)
     56PassOwnPtr<TextCodec> TextCodecICU::create(const TextEncoding& encoding, const void*)
     57{
     58    return adoptPtr(new TextCodecICU(encoding));
     59}
     60
     61void TextCodecICU::registerEncodingNames(EncodingNameRegistrar registrar)
    7462{
    7563    // We register Hebrew with logical ordering using a separate name.
     
    144132    registrar("koi", "KOI8-R");
    145133    registrar("logical", "ISO-8859-8-I");
    146     registrar("unicode11utf8", "UTF-8");
    147     registrar("unicode20utf8", "UTF-8");
    148     registrar("x-unicode20utf8", "UTF-8");
    149134    registrar("visual", "ISO-8859-8");
    150135    registrar("winarabic", "windows-1256");
     
    164149    registrar("x-windows-949", "windows-949");
    165150    registrar("x-uhc", "windows-949");
    166     registrar("utf8", "UTF-8");
    167151    registrar("shift-jis", "Shift_JIS");
    168152
     
    191175}
    192176
    193 void TextCodecICU::registerExtendedCodecs(TextCodecRegistrar registrar)
     177void TextCodecICU::registerCodecs(TextCodecRegistrar registrar)
    194178{
    195179    // See comment above in registerEncodingNames.
    196     registrar("ISO-8859-8-I", newTextCodecICU, 0);
     180    registrar("ISO-8859-8-I", create, 0);
    197181
    198182    int32_t numEncodings = ucnv_countAvailable();
     
    207191                continue;
    208192        }
    209         registrar(standardName, newTextCodecICU, 0);
     193        registrar(standardName, create, 0);
    210194    }
    211195}
     
    301285        }
    302286    }
     287
    303288private:
    304289    UConverter* m_converter;
     
    355340
    356341// We need to apply these fallbacks ourselves as they are not currently supported by ICU and
    357 // they were provided by the old TEC encoding path
    358 // Needed to fix <rdar://problem/4708689>
    359 static UChar getGbkEscape(UChar32 codePoint)
    360 {
    361     switch (codePoint) {
    362         case 0x01F9:
    363             return 0xE7C8;
    364         case 0x1E3F:
    365             return 0xE7C7;
    366         case 0x22EF:
    367             return 0x2026;
    368         case 0x301C:
    369             return 0xFF5E;
    370         default:
    371             return 0;
    372     }
     342// they were provided by the old TEC encoding path. Needed to fix <rdar://problem/4708689>.
     343static UChar fallbackForGBK(UChar32 character)
     344{
     345    switch (character) {
     346    case 0x01F9:
     347        return 0xE7C8;
     348    case 0x1E3F:
     349        return 0xE7C7;
     350    case 0x22EF:
     351        return 0x2026;
     352    case 0x301C:
     353        return 0xFF5E;
     354    }
     355    return 0;
    373356}
    374357
     
    376359// characters. See the declaration of TextCodec::encode for more.
    377360static void urlEscapedEntityCallback(const void* context, UConverterFromUnicodeArgs* fromUArgs, const UChar* codeUnits, int32_t length,
    378                                      UChar32 codePoint, UConverterCallbackReason reason, UErrorCode* err)
     361    UChar32 codePoint, UConverterCallbackReason reason, UErrorCode* err)
    379362{
    380363    if (reason == UCNV_UNASSIGNED) {
     
    390373// Substitutes special GBK characters, escaping all other unassigned entities.
    391374static void gbkCallbackEscape(const void* context, UConverterFromUnicodeArgs* fromUArgs, const UChar* codeUnits, int32_t length,
    392                               UChar32 codePoint, UConverterCallbackReason reason, UErrorCode* err)
     375    UChar32 codePoint, UConverterCallbackReason reason, UErrorCode* err)
    393376{
    394377    UChar outChar;
    395     if (reason == UCNV_UNASSIGNED && (outChar = getGbkEscape(codePoint))) {
     378    if (reason == UCNV_UNASSIGNED && (outChar = fallbackForGBK(codePoint))) {
    396379        const UChar* source = &outChar;
    397380        *err = U_ZERO_ERROR;
     
    404387// Combines both gbkUrlEscapedEntityCallback and GBK character substitution.
    405388static void gbkUrlEscapedEntityCallack(const void* context, UConverterFromUnicodeArgs* fromUArgs, const UChar* codeUnits, int32_t length,
    406                                        UChar32 codePoint, UConverterCallbackReason reason, UErrorCode* err)
     389    UChar32 codePoint, UConverterCallbackReason reason, UErrorCode* err)
    407390{
    408391    if (reason == UCNV_UNASSIGNED) {
    409         if (UChar outChar = getGbkEscape(codePoint)) {
     392        if (UChar outChar = fallbackForGBK(codePoint)) {
    410393            const UChar* source = &outChar;
    411394            *err = U_ZERO_ERROR;
     
    420403
    421404static void gbkCallbackSubstitute(const void* context, UConverterFromUnicodeArgs* fromUArgs, const UChar* codeUnits, int32_t length,
    422                                   UChar32 codePoint, UConverterCallbackReason reason, UErrorCode* err)
     405    UChar32 codePoint, UConverterCallbackReason reason, UErrorCode* err)
    423406{
    424407    UChar outChar;
    425     if (reason == UCNV_UNASSIGNED && (outChar = getGbkEscape(codePoint))) {
     408    if (reason == UCNV_UNASSIGNED && (outChar = fallbackForGBK(codePoint))) {
    426409        const UChar* source = &outChar;
    427410        *err = U_ZERO_ERROR;
     
    487470}
    488471
    489 
    490472} // namespace WebCore
  • trunk/Source/WebCore/platform/text/TextCodecICU.h

    r77831 r78451  
    11/*
    2  * Copyright (C) 2004, 2006, 2007 Apple Inc. All rights reserved.
     2 * Copyright (C) 2004, 2006, 2007, 2011 Apple Inc. All rights reserved.
    33 * Copyright (C) 2006 Alexey Proskuryakov <ap@nypop.com>
    44 *
     
    3030#include "TextCodec.h"
    3131#include "TextEncoding.h"
    32 
    3332#include <unicode/utypes.h>
    3433
     
    3938    class TextCodecICU : public TextCodec {
    4039    public:
    41         static void registerBaseEncodingNames(EncodingNameRegistrar);
    42         static void registerBaseCodecs(TextCodecRegistrar);
     40        static void registerEncodingNames(EncodingNameRegistrar);
     41        static void registerCodecs(TextCodecRegistrar);
    4342
    44         static void registerExtendedEncodingNames(EncodingNameRegistrar);
    45         static void registerExtendedCodecs(TextCodecRegistrar);
     43        virtual ~TextCodecICU();
    4644
     45    private:
    4746        TextCodecICU(const TextEncoding&);
    48         virtual ~TextCodecICU();
     47        static PassOwnPtr<TextCodec> create(const TextEncoding&, const void*);
    4948
    5049        virtual String decode(const char*, size_t length, bool flush, bool stopOnError, bool& sawError);
    5150        virtual CString encode(const UChar*, size_t length, UnencodableHandling);
    5251
    53     private:
    5452        void createICUConverter() const;
    5553        void releaseICUConverter() const;
     
    6866
    6967    struct ICUConverterWrapper {
    70         ICUConverterWrapper()
    71             : converter(0)
    72         {
    73         }
     68        ICUConverterWrapper() : converter(0) { }
    7469        ~ICUConverterWrapper();
    7570
    7671        UConverter* converter;
     72
     73        WTF_MAKE_NONCOPYABLE(ICUConverterWrapper);
    7774    };
    7875
  • trunk/Source/WebCore/platform/text/TextCodecUTF8.cpp

    r77819 r78451  
    2929#include <wtf/text/CString.h>
    3030#include <wtf/text/StringBuffer.h>
    31 #include <wtf/unicode/UTF8.h>
     31#include <wtf/unicode/CharacterNames.h>
    3232
    3333using namespace WTF::Unicode;
     
    3535
    3636namespace WebCore {
     37
     38const int nonCharacter = -1;
    3739
    3840// Assuming that a pointer is the size of a "machine word", then
     
    9496{
    9597    registrar("UTF-8", "UTF-8");
     98
     99    // Additional aliases that originally were present in the encoding
     100    // table in WebKit on Macintosh, and subsequently added by
     101    // TextCodecICU. Perhaps we can prove some are not used on the web
     102    // and remove them.
     103    registrar("unicode11utf8", "UTF-8");
     104    registrar("unicode20utf8", "UTF-8");
     105    registrar("utf8", "UTF-8");
     106    registrar("x-unicode20utf8", "UTF-8");
    96107}
    97108
     
    101112}
    102113
    103 static inline int nonASCIISequenceLength(unsigned char firstByte)
    104 {
    105     ASSERT(!isASCII(firstByte));
    106     switch (firstByte >> 4) {
    107     case 0xF:
    108         return 4;
    109     case 0xE:
    110         return 3;
    111     }
    112     return 2;
    113 }
    114 
    115 static inline int decodeNonASCIISequence(const unsigned char* sequence, unsigned length)
     114static inline int nonASCIISequenceLength(uint8_t firstByte)
     115{
     116    static const uint8_t lengths[256] = {
     117        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     118        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     119        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     120        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     121        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     122        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     123        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     124        0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
     125        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
     126        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
     127        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
     128        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
     129        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
     130        2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
     131        3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
     132        4, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
     133    };
     134    return lengths[firstByte];
     135}
     136
     137static inline int decodeNonASCIISequence(const uint8_t* sequence, unsigned length)
    116138{
    117139    ASSERT(!isASCII(sequence[0]));
     
    119141        ASSERT(sequence[0] <= 0xDF);
    120142        if (sequence[0] < 0xC2)
    121             return -1;
     143            return nonCharacter;
    122144        if (sequence[1] < 0x80 || sequence[1] > 0xBF)
    123             return -1;
     145            return nonCharacter;
    124146        return ((sequence[0] << 6) + sequence[1]) - 0x00003080;
    125147    }
     
    129151        case 0xE0:
    130152            if (sequence[1] < 0xA0 || sequence[1] > 0xBF)
    131                 return -1;
     153                return nonCharacter;
    132154            break;
    133155        case 0xED:
    134156            if (sequence[1] < 0x80 || sequence[1] > 0x9F)
    135                 return -1;
     157                return nonCharacter;
    136158            break;
    137159        default:
    138160            if (sequence[1] < 0x80 || sequence[1] > 0xBF)
    139                 return -1;
     161                return nonCharacter;
    140162        }
    141163        if (sequence[2] < 0x80 || sequence[2] > 0xBF)
    142             return -1;
     164            return nonCharacter;
    143165        return ((sequence[0] << 12) + (sequence[1] << 6) + sequence[2]) - 0x000E2080;
    144166    }
     
    148170    case 0xF0:
    149171        if (sequence[1] < 0x90 || sequence[1] > 0xBF)
    150             return -1;
     172            return nonCharacter;
    151173        break;
    152174    case 0xF4:
    153175        if (sequence[1] < 0x80 || sequence[1] > 0x8F)
    154             return -1;
     176            return nonCharacter;
    155177        break;
    156178    default:
    157179        if (sequence[1] < 0x80 || sequence[1] > 0xBF)
    158             return -1;
     180            return nonCharacter;
    159181    }
    160182    if (sequence[2] < 0x80 || sequence[2] > 0xBF)
    161         return -1;
     183        return nonCharacter;
    162184    if (sequence[3] < 0x80 || sequence[3] > 0xBF)
    163         return -1;
     185        return nonCharacter;
    164186    return ((sequence[0] << 18) + (sequence[1] << 12) + (sequence[2] << 6) + sequence[3]) - 0x03C82080;
    165187}
    166188
     189static inline UChar* appendCharacter(UChar* destination, int character)
     190{
     191    ASSERT(character != nonCharacter);
     192    ASSERT(!U_IS_SURROGATE(character));
     193    if (U_IS_BMP(character))
     194        *destination++ = character;
     195    else {
     196        *destination++ = U16_LEAD(character);
     197        *destination++ = U16_TRAIL(character);
     198    }
     199    return destination;
     200}
     201
    167202String TextCodecUTF8::decode(const char* bytes, size_t length, bool flush, bool stopOnError, bool& sawError)
    168203{
    169     StringBuffer buffer(length);
     204    // Each input byte might turn into a character.
     205    // That includes all bytes in the partial-sequence buffer because
     206    // each byte in an invalid sequence will turn into a replacement character.
     207    StringBuffer buffer(m_partialSequenceSize + length);
    170208
    171209    const uint8_t* source = reinterpret_cast<const uint8_t*>(bytes);
     
    174212    UChar* destination = buffer.characters();
    175213
    176     int count;
    177     int character;
    178 
    179     if (m_partialSequenceSize) {
    180         count = nonASCIISequenceLength(m_partialSequence[0]);
    181         ASSERT(count > m_partialSequenceSize);
    182         if (count - m_partialSequenceSize > end - source) {
    183             memcpy(m_partialSequence + m_partialSequenceSize, source, end - source);
    184             m_partialSequenceSize += end - source;
    185             source = end;
    186         } else {
     214    do {
     215        while (m_partialSequenceSize) {
     216            int count = nonASCIISequenceLength(m_partialSequence[0]);
     217            ASSERT(count > m_partialSequenceSize);
     218            ASSERT(count >= 2);
     219            ASSERT(count <= 4);
     220            if (count - m_partialSequenceSize > end - source) {
     221                if (!flush) {
     222                    // We have an incomplete partial sequence, so put it all in the partial
     223                    // sequence buffer, and break out of this loop so we can exit the function.
     224                    memcpy(m_partialSequence + m_partialSequenceSize, source, end - source);
     225                    m_partialSequenceSize += end - source;
     226                    source = end;
     227                    break;
     228                }
     229                // We have an incomplete partial sequence at the end of the buffer.
     230                // That is an error.
     231                sawError = true;
     232                if (stopOnError) {
     233                    source = end;
     234                    break;
     235                }
     236                // Each error consumes one byte and generates one replacement character.
     237                --m_partialSequenceSize;
     238                memmove(m_partialSequence, m_partialSequence + 1, m_partialSequenceSize);
     239                *destination++ = replacementCharacter;
     240                continue;
     241            }
    187242            uint8_t completeSequence[U8_MAX_LENGTH];
    188243            memcpy(completeSequence, m_partialSequence, m_partialSequenceSize);
    189244            memcpy(completeSequence + m_partialSequenceSize, source, count - m_partialSequenceSize);
    190245            source += count - m_partialSequenceSize;
     246            int character = decodeNonASCIISequence(completeSequence, count);
     247            if (character == nonCharacter) {
     248                sawError = true;
     249                if (stopOnError) {
     250                    source = end;
     251                    break;
     252                }
     253                // Each error consumes one byte and generates one replacement character.
     254                memcpy(m_partialSequence, completeSequence + 1, count - 1);
     255                m_partialSequenceSize = count - 1;
     256                *destination++ = replacementCharacter;
     257                continue;
     258            }
    191259            m_partialSequenceSize = 0;
    192             character = decodeNonASCIISequence(completeSequence, count);
    193             goto decodedNonASCII;
     260            destination = appendCharacter(destination, character);
    194261        }
    195     }
    196 
    197     while (source < end) {
    198         if (isASCII(*source)) {
    199             // Fast path for ASCII. Most UTF-8 text will be ASCII.
    200             if (isAlignedToMachineWord(source)) {
    201                 while (source < alignedEnd) {
    202                     MachineWord chunk = *reinterpret_cast_ptr<const MachineWord*>(source);
    203                     if (chunk & NonASCIIMask<sizeof(MachineWord)>::value()) {
    204                         if (isASCII(*source))
     262
     263        while (source < end) {
     264            if (isASCII(*source)) {
     265                // Fast path for ASCII. Most UTF-8 text will be ASCII.
     266                if (isAlignedToMachineWord(source)) {
     267                    while (source < alignedEnd) {
     268                        MachineWord chunk = *reinterpret_cast_ptr<const MachineWord*>(source);
     269                        if (chunk & NonASCIIMask<sizeof(MachineWord)>::value())
    205270                            break;
    206                         goto nonASCII;
     271                        UCharByteFiller<sizeof(MachineWord)>::copy(destination, source);
     272                        source += sizeof(MachineWord);
     273                        destination += sizeof(MachineWord);
    207274                    }
    208                     UCharByteFiller<sizeof(MachineWord)>::copy(destination, source);
    209                     source += sizeof(MachineWord);
    210                     destination += sizeof(MachineWord);
    211                 }
    212                 if (source == end)
    213                     break;
    214             }
    215             *destination++ = *source++;
    216         } else {
    217 nonASCII:
    218             count = nonASCIISequenceLength(*source);
    219             ASSERT(count >= 2);
    220             ASSERT(count <= 4);
    221             if (count > end - source) {
    222                 ASSERT(end - source <= static_cast<ptrdiff_t>(sizeof(m_partialSequence)));
    223                 ASSERT(!m_partialSequenceSize);
    224                 m_partialSequenceSize = end - source;
    225                 memcpy(m_partialSequence, source, m_partialSequenceSize);
    226                 break;
    227             }
    228             character = decodeNonASCIISequence(source, count);
     275                    if (source == end)
     276                        break;
     277                    if (!isASCII(*source))
     278                        continue;
     279                }
     280                *destination++ = *source++;
     281                continue;
     282            }
     283            int count = nonASCIISequenceLength(*source);
     284            int character;
     285            if (!count)
     286                character = nonCharacter;
     287            else {
     288                ASSERT(count >= 2);
     289                ASSERT(count <= 4);
     290                if (count > end - source) {
     291                    ASSERT(end - source <= static_cast<ptrdiff_t>(sizeof(m_partialSequence)));
     292                    ASSERT(!m_partialSequenceSize);
     293                    m_partialSequenceSize = end - source;
     294                    memcpy(m_partialSequence, source, m_partialSequenceSize);
     295                    break;
     296                }
     297                character = decodeNonASCIISequence(source, count);
     298            }
     299            if (character == nonCharacter) {
     300                sawError = true;
     301                if (stopOnError)
     302                    break;
     303                // Each error consumes one byte and generates one replacement character.
     304                ++source;
     305                *destination++ = replacementCharacter;
     306                continue;
     307            }
    229308            source += count;
    230 decodedNonASCII:
    231             if (character < 0) {
    232                 if (stopOnError) {
    233                     sawError = true;
    234                     break;
    235                 }
    236             } else {
    237                 ASSERT(!U_IS_SURROGATE(character));
    238                 if (U_IS_BMP(character))
    239                     *destination++ = character;
    240                 else {
    241                     *destination++ = U16_LEAD(character);
    242                     *destination++ = U16_TRAIL(character);
    243                 }
    244             }
     309            destination = appendCharacter(destination, character);
    245310        }
    246     }
     311    } while (flush && m_partialSequenceSize);
    247312
    248313    buffer.shrink(destination - buffer.characters());
    249 
    250     if (flush && m_partialSequenceSize)
    251         sawError = true;
    252314
    253315    return String::adopt(buffer);
  • trunk/Source/WebCore/platform/text/TextCodecUTF8.h

    r77819 r78451  
    3636    static void registerCodecs(TextCodecRegistrar);
    3737
    38     virtual String decode(const char*, size_t length, bool flush, bool stopOnError, bool& sawError);
    39     virtual CString encode(const UChar*, size_t length, UnencodableHandling);
    40 
    4138private:
    4239    static PassOwnPtr<TextCodec> create(const TextEncoding&, const void*);
    4340    TextCodecUTF8() : m_partialSequenceSize(0) { }
     41
     42    virtual String decode(const char*, size_t length, bool flush, bool stopOnError, bool& sawError);
     43    virtual CString encode(const UChar*, size_t length, UnencodableHandling);
    4444
    4545    int m_partialSequenceSize;
  • trunk/Source/WebCore/platform/text/TextEncodingRegistry.cpp

    r77831 r78451  
    5959#endif
    6060
     61#include <wtf/CurrentTime.h>
     62#include <wtf/text/CString.h>
     63
    6164using namespace WTF;
    6265
     
    221224    TextCodecLatin1::registerCodecs(addToTextCodecMap);
    222225
     226    TextCodecUTF8::registerEncodingNames(addToTextEncodingNameMap);
     227    TextCodecUTF8::registerCodecs(addToTextCodecMap);
     228
    223229    TextCodecUTF16::registerEncodingNames(addToTextEncodingNameMap);
    224230    TextCodecUTF16::registerCodecs(addToTextCodecMap);
     
    227233    TextCodecUserDefined::registerCodecs(addToTextCodecMap);
    228234
    229 #if USE(ICU_UNICODE)
    230     TextCodecICU::registerBaseEncodingNames(addToTextEncodingNameMap);
    231     TextCodecICU::registerBaseCodecs(addToTextCodecMap);
    232 #endif
    233 
    234235#if USE(GLIB_UNICODE)
     236    // FIXME: This is not needed. The code above covers all the base codecs.
    235237    TextCodecGtk::registerBaseEncodingNames(addToTextEncodingNameMap);
    236238    TextCodecGtk::registerBaseCodecs(addToTextCodecMap);
     
    238240
    239241#if USE(BREWMP_UNICODE)
     242    // FIXME: This is not needed. The code above covers all the base codecs.
    240243    TextCodecBrew::registerBaseEncodingNames(addToTextEncodingNameMap);
    241244    TextCodecBrew::registerBaseCodecs(addToTextCodecMap);
     
    243246
    244247#if OS(WINCE) && !PLATFORM(QT)
     248    // FIXME: This is not needed. The code above covers all the base codecs.
    245249    TextCodecWinCE::registerBaseEncodingNames(addToTextEncodingNameMap);
    246250    TextCodecWinCE::registerBaseCodecs(addToTextCodecMap);
     
    304308{
    305309#if USE(ICU_UNICODE)
    306     TextCodecICU::registerExtendedEncodingNames(addToTextEncodingNameMap);
    307     TextCodecICU::registerExtendedCodecs(addToTextCodecMap);
     310    TextCodecICU::registerEncodingNames(addToTextEncodingNameMap);
     311    TextCodecICU::registerCodecs(addToTextCodecMap);
    308312#endif
    309313
Note: See TracChangeset for help on using the changeset viewer.