Context Navigation

← Previous Changeset
Next Changeset →

Changeset 78451 in webkit

Timestamp:

Feb 13, 2011 7:28:37 PM (13 years ago)

Author:

Darin Adler

Message:

2011-02-12 Darin Adler <Darin Adler>

Reviewed by Alexey Proskuryakov.

Add built-in decoder for UTF-8 for improved performance
https://bugs.webkit.org/show_bug.cgi?id=53898

Covered by existing tests; not adding new tests at this time.

This patch now handles errors in the same way the existing codecs do,
and so passes our tests. The previous version failed some tests because
of incorrect error handling.

platform/text/TextCodecICU.cpp: (WebCore::create): Renamed from newTextCodecICU, made a static member function, and added a call to adoptPtr. (WebCore::TextCodecICU::registerEncodingNames): Renamed from registerExtendedEncodingNames since this class is no longer used for base codecs. Removed aliases for UTF-8; now handled by TextCodecUTF8. (WebCore::TextCodecICU::registerCodecs): Renamed. (WebCore::fallbackForGBK): Renamed to conform to our current style.

platform/text/TextCodecICU.h: Updated for above changes. Changed indentation. Made most functions private, including virtual function overrides. Marked ICUConverterWrapper noncopyable.

platform/text/TextCodecUTF8.cpp: (WebCore::TextCodecUTF8::registerEncodingNames): Added the UTF-8 aliases that were formerly added by TextCodecICU. (WebCore::nonASCIISequenceLength): Fixed bug where this would return 4 for bytes F5-FF instead of failing. (WebCore::decodeNonASCIISequence): Tweaked coding style. (WebCore::appendCharacter): Added. Makes it easier to share code between the partial-character handling and main loop. (WebCore::TextCodecUTF8::decode): Fixed buffer size computation for case where there is a partial sequence. Fixed partial sequence handling so that goto is no longer needed, since compilers sometimes make poor code when goto is involved. Added a loop for partial sequences since we consume only one byte when a partial sequence is invalid. Fixed logic in main decoding loop so goto is not needed. Used early-exit style in both loops so the main flow is not nested inside if statements. Added correct error handling for flush when a partial sequence remains, which involved wrapping the function in yet another loop.

platform/text/TextCodecUTF8.h: Made virtual function overrides private.

platform/text/TextEncodingRegistry.cpp: (WebCore::buildBaseTextCodecMaps): Added calls to TextCodecUTF8. Removed calls to TextCodecICU. Added FIXMEs for other codecs that no longer need to be included here. (WebCore::extendTextCodecMaps): Updated for the name change of the TextCodecICU functions.

Location:

trunk/Source/WebCore

Files:

: 6 edited

ChangeLog (modified) (1 diff)
platform/text/TextCodecICU.cpp (modified) (14 diffs)
platform/text/TextCodecICU.h (modified) (4 diffs)
platform/text/TextCodecUTF8.cpp (modified) (8 diffs)
platform/text/TextCodecUTF8.h (modified) (1 diff)
platform/text/TextEncodingRegistry.cpp (modified) (6 diffs)

Legend:

: Unmodified
: Added
: Removed

trunk/Source/WebCore/ChangeLog

-                      r78450
+                      r78451
+-02-12  Darin Adler  <darin@apple.com>
+        Reviewed by Alexey Proskuryakov.
+        Add built-in decoder for UTF-8 for improved performance
+        https://bugs.webkit.org/show_bug.cgi?id=53898
+        Covered by existing tests; not adding new tests at this time.
+        This patch now handles errors in the same way the existing codecs do,
+        and so passes our tests. The previous version failed some tests because
+        of incorrect error handling.
+        * platform/text/TextCodecICU.cpp:
+        (WebCore::create): Renamed from newTextCodecICU, made a static member
+        function, and added a call to adoptPtr.
+        (WebCore::TextCodecICU::registerEncodingNames): Renamed from
+        registerExtendedEncodingNames since this class is no longer used for
+        base codecs. Removed aliases for UTF-8; now handled by TextCodecUTF8.
+        (WebCore::TextCodecICU::registerCodecs): Renamed.
+        (WebCore::fallbackForGBK): Renamed to conform to our current style.
+        * platform/text/TextCodecICU.h: Updated for above changes. Changed
+        indentation. Made most functions private, including virtual function
+        overrides. Marked ICUConverterWrapper noncopyable.
+        * platform/text/TextCodecUTF8.cpp:
+        (WebCore::TextCodecUTF8::registerEncodingNames): Added the UTF-8 aliases
+        that were formerly added by TextCodecICU.
+        (WebCore::nonASCIISequenceLength): Fixed bug where this would return 4 for
+        bytes F5-FF instead of failing.
+        (WebCore::decodeNonASCIISequence): Tweaked coding style.
+        (WebCore::appendCharacter): Added. Makes it easier to share code between
+        the partial-character handling and main loop.
+        (WebCore::TextCodecUTF8::decode): Fixed buffer size computation for case
+        where there is a partial sequence. Fixed partial sequence handling so that
+        goto is no longer needed, since compilers sometimes make poor code when
+        goto is involved. Added a loop for partial sequences since we consume only
+        one byte when a partial sequence is invalid. Fixed logic in main decoding
+        loop so goto is not needed. Used early-exit style in both loops so the main
+        flow is not nested inside if statements. Added correct error handling for
+        flush when a partial sequence remains, which involved wrapping the function
+        in yet another loop.
+        * platform/text/TextCodecUTF8.h: Made virtual function overrides private.
+        * platform/text/TextEncodingRegistry.cpp:
+        (WebCore::buildBaseTextCodecMaps): Added calls to TextCodecUTF8. Removed
+        calls to TextCodecICU. Added FIXMEs for other codecs that no longer need
+        to be included here.
+        (WebCore::extendTextCodecMaps): Updated for the name change of the
+        TextCodecICU functions.
 -02-13  Mark Rowe  <mrowe@apple.com>

trunk/Source/WebCore/platform/text/TextCodecICU.cpp

-                      r77849
+                      r78451
 /*
  * Copyright (C) 2004, 2006, 2007, 2008 Apple Inc. All rights reserved.
+ * Copyright (C) 2004, 2006, 2007, 2008, 2011 Apple Inc. All rights reserved.
  * Copyright (C) 2006 Alexey Proskuryakov <ap@nypop.com>
+ *
 …
 #include "TextCodecICU.h"
-#include "PlatformString.h"
 #include "ThreadGlobalData.h"
 #include <unicode/ucnv.h>
 #include <unicode/ucnv_cb.h>
 #include <wtf/Assertions.h>
-#include <wtf/text/CString.h>
-#include <wtf/PassOwnPtr.h>
 #include <wtf/StringExtras.h>
 #include <wtf/Threading.h>
+#include <wtf/text/CString.h>
 #include <wtf/unicode/CharacterNames.h>
 …
+}
+static PassOwnPtr<TextCodec> newTextCodecICU(const TextEncoding& encoding, const void*)
+{
+    return new TextCodecICU(encoding);
+}
+void TextCodecICU::registerBaseEncodingNames(EncodingNameRegistrar registrar)
+{
+    registrar("UTF-8", "UTF-8");
+}
+void TextCodecICU::registerBaseCodecs(TextCodecRegistrar registrar)
+{
+    registrar("UTF-8", newTextCodecICU, 0);
+}
+void TextCodecICU::registerExtendedEncodingNames(EncodingNameRegistrar registrar)
+PassOwnPtr<TextCodec> TextCodecICU::create(const TextEncoding& encoding, const void*)
+{
+    return adoptPtr(new TextCodecICU(encoding));
+}
+void TextCodecICU::registerEncodingNames(EncodingNameRegistrar registrar)
+{
     // We register Hebrew with logical ordering using a separate name.
 …
     registrar("koi", "KOI8-R");
     registrar("logical", "ISO-8859-8-I");
-    registrar("unicode11utf8", "UTF-8");
-    registrar("unicode20utf8", "UTF-8");
-    registrar("x-unicode20utf8", "UTF-8");
     registrar("visual", "ISO-8859-8");
     registrar("winarabic", "windows-1256");
 …
     registrar("x-windows-949", "windows-949");
     registrar("x-uhc", "windows-949");
-    registrar("utf8", "UTF-8");
     registrar("shift-jis", "Shift_JIS");
 …
+}
 void TextCodecICU::registerExtendedCodecs(TextCodecRegistrar registrar)
+void TextCodecICU::registerCodecs(TextCodecRegistrar registrar)
+{
     // See comment above in registerEncodingNames.
     registrar("ISO-8859-8-I", newTextCodecICU, 0);
+    registrar("ISO-8859-8-I", create, 0);
     int32_t numEncodings = ucnv_countAvailable();
 …
                 continue;
+        }
         registrar(standardName, newTextCodecICU, 0);
+        registrar(standardName, create, 0);
+    }
+}
 …
+        }
+    }
 private:
     UConverter* m_converter;
 …
 // We need to apply these fallbacks ourselves as they are not currently supported by ICU and
+// they were provided by the old TEC encoding path
+// Needed to fix <rdar://problem/4708689>
+static UChar getGbkEscape(UChar32 codePoint)
+{
+    switch (codePoint) {
+        case 0x01F9:
+            return 0xE7C8;
+        case 0x1E3F:
+            return 0xE7C7;
+        case 0x22EF:
+            return 0x2026;
+        case 0x301C:
+            return 0xFF5E;
+        default:
+            return 0;
+    }
+// they were provided by the old TEC encoding path. Needed to fix <rdar://problem/4708689>.
+static UChar fallbackForGBK(UChar32 character)
+{
+    switch (character) {
+    case 0x01F9:
+        return 0xE7C8;
+    case 0x1E3F:
+        return 0xE7C7;
+    case 0x22EF:
+        return 0x2026;
+    case 0x301C:
+        return 0xFF5E;
+    }
+    return 0;
+}
 …
 // characters. See the declaration of TextCodec::encode for more.
 static void urlEscapedEntityCallback(const void* context, UConverterFromUnicodeArgs* fromUArgs, const UChar* codeUnits, int32_t length,
                                      UChar32 codePoint, UConverterCallbackReason reason, UErrorCode* err)
+    UChar32 codePoint, UConverterCallbackReason reason, UErrorCode* err)
+{
     if (reason == UCNV_UNASSIGNED) {
 …
 // Substitutes special GBK characters, escaping all other unassigned entities.
 static void gbkCallbackEscape(const void* context, UConverterFromUnicodeArgs* fromUArgs, const UChar* codeUnits, int32_t length,
                               UChar32 codePoint, UConverterCallbackReason reason, UErrorCode* err)
+    UChar32 codePoint, UConverterCallbackReason reason, UErrorCode* err)
+{
     UChar outChar;
     if (reason == UCNV_UNASSIGNED && (outChar = getGbkEscape(codePoint))) {
+    if (reason == UCNV_UNASSIGNED && (outChar = fallbackForGBK(codePoint))) {
         const UChar* source = &outChar;
         *err = U_ZERO_ERROR;
 …
 // Combines both gbkUrlEscapedEntityCallback and GBK character substitution.
 static void gbkUrlEscapedEntityCallack(const void* context, UConverterFromUnicodeArgs* fromUArgs, const UChar* codeUnits, int32_t length,
                                        UChar32 codePoint, UConverterCallbackReason reason, UErrorCode* err)
+    UChar32 codePoint, UConverterCallbackReason reason, UErrorCode* err)
+{
     if (reason == UCNV_UNASSIGNED) {
         if (UChar outChar = getGbkEscape(codePoint)) {
+        if (UChar outChar = fallbackForGBK(codePoint)) {
             const UChar* source = &outChar;
             *err = U_ZERO_ERROR;
 …
 static void gbkCallbackSubstitute(const void* context, UConverterFromUnicodeArgs* fromUArgs, const UChar* codeUnits, int32_t length,
                                   UChar32 codePoint, UConverterCallbackReason reason, UErrorCode* err)
+    UChar32 codePoint, UConverterCallbackReason reason, UErrorCode* err)
+{
     UChar outChar;
     if (reason == UCNV_UNASSIGNED && (outChar = getGbkEscape(codePoint))) {
+    if (reason == UCNV_UNASSIGNED && (outChar = fallbackForGBK(codePoint))) {
         const UChar* source = &outChar;
         *err = U_ZERO_ERROR;
 …
+}
 } // namespace WebCore

trunk/Source/WebCore/platform/text/TextCodecICU.h

-                      r77831
+                      r78451
 /*
  * Copyright (C) 2004, 2006, 2007 Apple Inc. All rights reserved.
+ * Copyright (C) 2004, 2006, 2007, 2011 Apple Inc. All rights reserved.
  * Copyright (C) 2006 Alexey Proskuryakov <ap@nypop.com>
+ *
 …
 #include "TextCodec.h"
 #include "TextEncoding.h"
 #include <unicode/utypes.h>
 …
     class TextCodecICU : public TextCodec {
     public:
         static void registerBaseEncodingNames(EncodingNameRegistrar);
         static void registerBaseCodecs(TextCodecRegistrar);
+        static void registerEncodingNames(EncodingNameRegistrar);
+        static void registerCodecs(TextCodecRegistrar);
+        static void registerExtendedEncodingNames(EncodingNameRegistrar);
+        static void registerExtendedCodecs(TextCodecRegistrar);
+        virtual ~TextCodecICU();
+    private:
         TextCodecICU(const TextEncoding&);
         virtual ~TextCodecICU();
+        static PassOwnPtr<TextCodec> create(const TextEncoding&, const void*);
         virtual String decode(const char*, size_t length, bool flush, bool stopOnError, bool& sawError);
         virtual CString encode(const UChar*, size_t length, UnencodableHandling);
-    private:
         void createICUConverter() const;
         void releaseICUConverter() const;
 …
     struct ICUConverterWrapper {
+        ICUConverterWrapper()
+            : converter(0)
+        {
+        }
+        ICUConverterWrapper() : converter(0) { }
         ~ICUConverterWrapper();
         UConverter* converter;
+        WTF_MAKE_NONCOPYABLE(ICUConverterWrapper);
     };

trunk/Source/WebCore/platform/text/TextCodecUTF8.cpp

-                      r77819
+                      r78451
 #include <wtf/text/CString.h>
 #include <wtf/text/StringBuffer.h>
 #include <wtf/unicode/UTF8.h>
+#include <wtf/unicode/CharacterNames.h>
 using namespace WTF::Unicode;
 …
 namespace WebCore {
+const int nonCharacter = -1;
 // Assuming that a pointer is the size of a "machine word", then
 …
+{
     registrar("UTF-8", "UTF-8");
+    // Additional aliases that originally were present in the encoding
+    // table in WebKit on Macintosh, and subsequently added by
+    // TextCodecICU. Perhaps we can prove some are not used on the web
+    // and remove them.
+    registrar("unicode11utf8", "UTF-8");
+    registrar("unicode20utf8", "UTF-8");
+    registrar("utf8", "UTF-8");
+    registrar("x-unicode20utf8", "UTF-8");
+}
 …
+}
+static inline int nonASCIISequenceLength(unsigned char firstByte)
+{
+    ASSERT(!isASCII(firstByte));
+    switch (firstByte >> 4) {
+    case 0xF:
+        return 4;
+    case 0xE:
+        return 3;
+    }
+    return 2;
+}
+static inline int decodeNonASCIISequence(const unsigned char* sequence, unsigned length)
+static inline int nonASCIISequenceLength(uint8_t firstByte)
+{
+    static const uint8_t lengths[256] = {
+, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
+, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
+, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
+, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
+, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
+, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
+, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2,
+, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3, 3,
+, 4, 4, 4, 4, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0
+    };
+    return lengths[firstByte];
+}
+static inline int decodeNonASCIISequence(const uint8_t* sequence, unsigned length)
+{
     ASSERT(!isASCII(sequence[0]));
 …
         ASSERT(sequence[0] <= 0xDF);
         if (sequence[0] < 0xC2)
             return -1;
+            return nonCharacter;
         if (sequence[1] < 0x80 || sequence[1] > 0xBF)
             return -1;
+            return nonCharacter;
         return ((sequence[0] << 6) + sequence[1]) - 0x00003080;
+    }
 …
         case 0xE0:
             if (sequence[1] < 0xA0 || sequence[1] > 0xBF)
                 return -1;
+                return nonCharacter;
             break;
         case 0xED:
             if (sequence[1] < 0x80 || sequence[1] > 0x9F)
                 return -1;
+                return nonCharacter;
             break;
         default:
             if (sequence[1] < 0x80 || sequence[1] > 0xBF)
                 return -1;
+                return nonCharacter;
+        }
         if (sequence[2] < 0x80 || sequence[2] > 0xBF)
             return -1;
+            return nonCharacter;
         return ((sequence[0] << 12) + (sequence[1] << 6) + sequence[2]) - 0x000E2080;
+    }
 …
     case 0xF0:
         if (sequence[1] < 0x90 || sequence[1] > 0xBF)
             return -1;
+            return nonCharacter;
         break;
     case 0xF4:
         if (sequence[1] < 0x80 || sequence[1] > 0x8F)
             return -1;
+            return nonCharacter;
         break;
     default:
         if (sequence[1] < 0x80 || sequence[1] > 0xBF)
             return -1;
+            return nonCharacter;
+    }
     if (sequence[2] < 0x80 || sequence[2] > 0xBF)
         return -1;
+        return nonCharacter;
     if (sequence[3] < 0x80 || sequence[3] > 0xBF)
         return -1;
+        return nonCharacter;
     return ((sequence[0] << 18) + (sequence[1] << 12) + (sequence[2] << 6) + sequence[3]) - 0x03C82080;
+}
+static inline UChar* appendCharacter(UChar* destination, int character)
+{
+    ASSERT(character != nonCharacter);
+    ASSERT(!U_IS_SURROGATE(character));
+    if (U_IS_BMP(character))
+        *destination++ = character;
+    else {
+        *destination++ = U16_LEAD(character);
+        *destination++ = U16_TRAIL(character);
+    }
+    return destination;
+}
 String TextCodecUTF8::decode(const char* bytes, size_t length, bool flush, bool stopOnError, bool& sawError)
+{
+    StringBuffer buffer(length);
+    // Each input byte might turn into a character.
+    // That includes all bytes in the partial-sequence buffer because
+    // each byte in an invalid sequence will turn into a replacement character.
+    StringBuffer buffer(m_partialSequenceSize + length);
     const uint8_t* source = reinterpret_cast<const uint8_t*>(bytes);
 …
     UChar* destination = buffer.characters();
+    int count;
+    int character;
+    if (m_partialSequenceSize) {
+        count = nonASCIISequenceLength(m_partialSequence[0]);
+        ASSERT(count > m_partialSequenceSize);
+        if (count - m_partialSequenceSize > end - source) {
+            memcpy(m_partialSequence + m_partialSequenceSize, source, end - source);
+            m_partialSequenceSize += end - source;
+            source = end;
+        } else {
+    do {
+        while (m_partialSequenceSize) {
+            int count = nonASCIISequenceLength(m_partialSequence[0]);
+            ASSERT(count > m_partialSequenceSize);
+            ASSERT(count >= 2);
+            ASSERT(count <= 4);
+            if (count - m_partialSequenceSize > end - source) {
+                if (!flush) {
+                    // We have an incomplete partial sequence, so put it all in the partial
+                    // sequence buffer, and break out of this loop so we can exit the function.
+                    memcpy(m_partialSequence + m_partialSequenceSize, source, end - source);
+                    m_partialSequenceSize += end - source;
+                    source = end;
+                    break;
+                }
+                // We have an incomplete partial sequence at the end of the buffer.
+                // That is an error.
+                sawError = true;
+                if (stopOnError) {
+                    source = end;
+                    break;
+                }
+                // Each error consumes one byte and generates one replacement character.
+                --m_partialSequenceSize;
+                memmove(m_partialSequence, m_partialSequence + 1, m_partialSequenceSize);
+                *destination++ = replacementCharacter;
+                continue;
+            }
             uint8_t completeSequence[U8_MAX_LENGTH];
             memcpy(completeSequence, m_partialSequence, m_partialSequenceSize);
             memcpy(completeSequence + m_partialSequenceSize, source, count - m_partialSequenceSize);
             source += count - m_partialSequenceSize;
+            int character = decodeNonASCIISequence(completeSequence, count);
+            if (character == nonCharacter) {
+                sawError = true;
+                if (stopOnError) {
+                    source = end;
+                    break;
+                }
+                // Each error consumes one byte and generates one replacement character.
+                memcpy(m_partialSequence, completeSequence + 1, count - 1);
+                m_partialSequenceSize = count - 1;
+                *destination++ = replacementCharacter;
+                continue;
+            }
             m_partialSequenceSize = 0;
+            character = decodeNonASCIISequence(completeSequence, count);
+            goto decodedNonASCII;
+            destination = appendCharacter(destination, character);
+        }
+    }
+    while (source < end) {
+        if (isASCII(*source)) {
+            // Fast path for ASCII. Most UTF-8 text will be ASCII.
+            if (isAlignedToMachineWord(source)) {
+                while (source < alignedEnd) {
+                    MachineWord chunk = *reinterpret_cast_ptr<const MachineWord*>(source);
+                    if (chunk & NonASCIIMask<sizeof(MachineWord)>::value()) {
+                        if (isASCII(*source))
+        while (source < end) {
+            if (isASCII(*source)) {
+                // Fast path for ASCII. Most UTF-8 text will be ASCII.
+                if (isAlignedToMachineWord(source)) {
+                    while (source < alignedEnd) {
+                        MachineWord chunk = *reinterpret_cast_ptr<const MachineWord*>(source);
+                        if (chunk & NonASCIIMask<sizeof(MachineWord)>::value())
                             break;
+                        goto nonASCII;
+                        UCharByteFiller<sizeof(MachineWord)>::copy(destination, source);
+                        source += sizeof(MachineWord);
+                        destination += sizeof(MachineWord);
+                    }
+                    UCharByteFiller<sizeof(MachineWord)>::copy(destination, source);
+                    source += sizeof(MachineWord);
+                    destination += sizeof(MachineWord);
+                }
+                if (source == end)
+                    break;
+            }
+            *destination++ = *source++;
+        } else {
+nonASCII:
+            count = nonASCIISequenceLength(*source);
+            ASSERT(count >= 2);
+            ASSERT(count <= 4);
+            if (count > end - source) {
+                ASSERT(end - source <= static_cast<ptrdiff_t>(sizeof(m_partialSequence)));
+                ASSERT(!m_partialSequenceSize);
+                m_partialSequenceSize = end - source;
+                memcpy(m_partialSequence, source, m_partialSequenceSize);
+                break;
+            }
+            character = decodeNonASCIISequence(source, count);
+                    if (source == end)
+                        break;
+                    if (!isASCII(*source))
+                        continue;
+                }
+                *destination++ = *source++;
+                continue;
+            }
+            int count = nonASCIISequenceLength(*source);
+            int character;
+            if (!count)
+                character = nonCharacter;
+            else {
+                ASSERT(count >= 2);
+                ASSERT(count <= 4);
+                if (count > end - source) {
+                    ASSERT(end - source <= static_cast<ptrdiff_t>(sizeof(m_partialSequence)));
+                    ASSERT(!m_partialSequenceSize);
+                    m_partialSequenceSize = end - source;
+                    memcpy(m_partialSequence, source, m_partialSequenceSize);
+                    break;
+                }
+                character = decodeNonASCIISequence(source, count);
+            }
+            if (character == nonCharacter) {
+                sawError = true;
+                if (stopOnError)
+                    break;
+                // Each error consumes one byte and generates one replacement character.
+                ++source;
+                *destination++ = replacementCharacter;
+                continue;
+            }
             source += count;
+decodedNonASCII:
+            if (character < 0) {
+                if (stopOnError) {
+                    sawError = true;
+                    break;
+                }
+            } else {
+                ASSERT(!U_IS_SURROGATE(character));
+                if (U_IS_BMP(character))
+                    *destination++ = character;
+                else {
+                    *destination++ = U16_LEAD(character);
+                    *destination++ = U16_TRAIL(character);
+                }
+            }
+            destination = appendCharacter(destination, character);
+        }
+    }
+    } while (flush && m_partialSequenceSize);
     buffer.shrink(destination - buffer.characters());
-    if (flush && m_partialSequenceSize)
-        sawError = true;
     return String::adopt(buffer);

trunk/Source/WebCore/platform/text/TextCodecUTF8.h

-                      r77819
+                      r78451
     static void registerCodecs(TextCodecRegistrar);
-    virtual String decode(const char*, size_t length, bool flush, bool stopOnError, bool& sawError);
-    virtual CString encode(const UChar*, size_t length, UnencodableHandling);
 private:
     static PassOwnPtr<TextCodec> create(const TextEncoding&, const void*);
     TextCodecUTF8() : m_partialSequenceSize(0) { }
+    virtual String decode(const char*, size_t length, bool flush, bool stopOnError, bool& sawError);
+    virtual CString encode(const UChar*, size_t length, UnencodableHandling);
     int m_partialSequenceSize;

trunk/Source/WebCore/platform/text/TextEncodingRegistry.cpp

-                      r77831
+                      r78451
 #endif
+#include <wtf/CurrentTime.h>
+#include <wtf/text/CString.h>
 using namespace WTF;
 …
     TextCodecLatin1::registerCodecs(addToTextCodecMap);
+    TextCodecUTF8::registerEncodingNames(addToTextEncodingNameMap);
+    TextCodecUTF8::registerCodecs(addToTextCodecMap);
     TextCodecUTF16::registerEncodingNames(addToTextEncodingNameMap);
     TextCodecUTF16::registerCodecs(addToTextCodecMap);
 …
     TextCodecUserDefined::registerCodecs(addToTextCodecMap);
-#if USE(ICU_UNICODE)
-    TextCodecICU::registerBaseEncodingNames(addToTextEncodingNameMap);
-    TextCodecICU::registerBaseCodecs(addToTextCodecMap);
-#endif
 #if USE(GLIB_UNICODE)
+    // FIXME: This is not needed. The code above covers all the base codecs.
     TextCodecGtk::registerBaseEncodingNames(addToTextEncodingNameMap);
     TextCodecGtk::registerBaseCodecs(addToTextCodecMap);
 …
 #if USE(BREWMP_UNICODE)
+    // FIXME: This is not needed. The code above covers all the base codecs.
     TextCodecBrew::registerBaseEncodingNames(addToTextEncodingNameMap);
     TextCodecBrew::registerBaseCodecs(addToTextCodecMap);
 …
 #if OS(WINCE) && !PLATFORM(QT)
+    // FIXME: This is not needed. The code above covers all the base codecs.
     TextCodecWinCE::registerBaseEncodingNames(addToTextEncodingNameMap);
     TextCodecWinCE::registerBaseCodecs(addToTextCodecMap);
 …
+{
 #if USE(ICU_UNICODE)
     TextCodecICU::registerExtendedEncodingNames(addToTextEncodingNameMap);
     TextCodecICU::registerExtendedCodecs(addToTextCodecMap);
+    TextCodecICU::registerEncodingNames(addToTextEncodingNameMap);
+    TextCodecICU::registerCodecs(addToTextCodecMap);
 #endif

Note: See TracChangeset for help on using the changeset viewer.