Changeset 228306 in webkit


Ignore:
Timestamp:
Feb 8, 2018 6:13:01 PM (6 years ago)
Author:
fpizlo@apple.com
Message:

Experiment with alternative implementation of memcpy/memset
https://bugs.webkit.org/show_bug.cgi?id=182563

Reviewed by Michael Saboff and Mark Lam.

Source/bmalloc:

Add a faster x86_64-specific implementation of memcpy and memset. Ideally, this would just be
implemented in WTF, but we have to copy it into bmalloc since bmalloc sits below WTF on the
stack.

  • bmalloc/Algorithm.h:

(bmalloc::fastCopy):
(bmalloc::fastZeroFill):

  • bmalloc/Allocator.cpp:

(bmalloc::Allocator::reallocate):

  • bmalloc/Bits.h:

(bmalloc::BitsWordOwner::operator=):
(bmalloc::BitsWordOwner::clearAll):
(bmalloc::BitsWordOwner::set):

  • bmalloc/IsoPageInlines.h:

(bmalloc::IsoPage<Config>::IsoPage):

  • bmalloc/Vector.h:

(bmalloc::Vector<T>::reallocateBuffer):

Source/JavaScriptCore:

This adopts new fastCopy/fastZeroFill calls for calls to memcpy/memset that do not take a
constant size argument.

  • assembler/AssemblerBuffer.h:

(JSC::AssemblerBuffer::append):

  • runtime/ArrayBuffer.cpp:

(JSC::ArrayBufferContents::tryAllocate):
(JSC::ArrayBufferContents::copyTo):
(JSC::ArrayBuffer::createInternal):

  • runtime/ArrayBufferView.h:

(JSC::ArrayBufferView::zeroRangeImpl):

  • runtime/ArrayConventions.cpp:
  • runtime/ArrayConventions.h:

(JSC::clearArray):

  • runtime/ArrayPrototype.cpp:

(JSC::arrayProtoPrivateFuncConcatMemcpy):

  • runtime/ButterflyInlines.h:

(JSC::Butterfly::tryCreate):
(JSC::Butterfly::createOrGrowPropertyStorage):
(JSC::Butterfly::growArrayRight):
(JSC::Butterfly::resizeArray):

  • runtime/GenericTypedArrayViewInlines.h:

(JSC::GenericTypedArrayView<Adaptor>::create):

  • runtime/JSArray.cpp:

(JSC::JSArray::appendMemcpy):
(JSC::JSArray::fastSlice):

  • runtime/JSArrayBufferView.cpp:

(JSC::JSArrayBufferView::ConstructionContext::ConstructionContext):

  • runtime/JSGenericTypedArrayViewInlines.h:

(JSC::JSGenericTypedArrayView<Adaptor>::set):

  • runtime/JSObject.cpp:

(JSC::JSObject::constructConvertedArrayStorageWithoutCopyingElements):
(JSC::JSObject::shiftButterflyAfterFlattening):

  • runtime/PropertyTable.cpp:

(JSC::PropertyTable::PropertyTable):

Source/WTF:

Adds a faster x86_64-specific implementation of memcpy and memset. These versions go by
different names than memcpy/memset and have a different API:

WTF::fastCopy<T>(T* dst, T* src, size_t N): copies N values of type T from src to dst.
WTF::fastZeroFill(T* dst, size_T N): writes N * sizeof(T) zeroes to dst.

There are also *Bytes variants that take void* for dst and src and size_t numBytes. Those are
most appropriate in places where the code is already computing bytes.

These will just call memcpy/memset on platforms where the optimized versions are not supported.

These new functions are not known to the compiler to be memcpy/memset. This has the effect that
the compiler will not try to replace them with anything else. This could be good or bad:

  • It's *good* if the size is *not known* at compile time. In that case, by my benchmarks, these versions are faster than either the memcpy/memset call or whatever else the compiler could emit. This is because of a combination of inlining and the algorithm itself (see below).


  • It's *bad* if the size is *known* at compile time. In that case, the compiler could potentially emit a fully unrolled memcpy/memset. That might not happen if the size is large (even if it's known), but in this patch I avoid replacing any memcpy/memset calls when the size is a constant. In particular, this totally avoids the call overhead -- if the size is small, then the compiler will emit a nice inlined copy or set. If the size is large, then the most optimal thing to do is emit the shortest piece of code possible, and that's a call to memcpy/memset.


It's unfortunate that you have to choose between them on your own. One way to avoid that might
have been to override the memcpy/memset symbols, so that the compiler can still do its
reasoning. But that's not quite right, since then we would lose inlining in the unknonw-size
case. Also, it's possible that for some unknown-size cases, the compiler could choose to emit
something on its own because it might think that some property of aliasing or alignment could
help it. I think it's a bit better to use our own copy/set implementations even in those cases.
Another way that I tried avoiding this is to detect inside fastCopy/fastZeroFill if the size is
constant. But there is no good way to do that in C++. There is a builtin for doing that inside a
macro, but that feels janky, so I didn't want to do it in this patch.

The reason why these new fastCopy/fastZeroFill functions are faster is that:

  • They can be inlined. There is no function call. Only a few registers get clobbered. So, the impact on the quality of the code surrounding the memcpy/memset is smaller.


  • They use type information to select the implementation. For sizes that are multiples of 2, 4, or 8, the resulting code performs dramatically better on small arrays than memcpy because it uses fewer cycles. The difference is greatest for 2 and 4 byte types, since memcpy usually handles small arrays by tiering from a 8-byte word copy loop to a byte copy loop. So, for 2 or 4 byte arrays, we use an algorithm that tiers from 8-byte word down to a 2-byte or 4-byte (depending on type) copy loop. So, for example, when copying a 16-bit string that has 1, 2, or 3 characters, this means doing 1, 2, or 3 word copies rather than 2, 4, or 6 byte copies. For 8-byte types, the resulting savings are mainly that there is no check to see if a tier-down to the byte-copy loop is needed -- so really that means reducing code size. 1-byte types don't get this inherent advantage over memcpy/memset, but they still benefit from all of the other advantages of these functions. Of course, this advantage isn't inherent to our approach. The compiler could also notice that the arguments to memcpy/memset have some alignment properties. It could do it even more generally than we do - for example a copy over bytes where the size is a multiple of 4 can use the 4-byte word algorithm. But based on my tests, the compiler does not do this (even though it does other things, like turn a memset call with a zero value argument into a bzero call).


  • They use a very nicely written word copy/set loop for small arrays. I spent a lot of time getting the assembly just right. When we use memcpy/memset, sometimes we would optimize the call by having a fast path word copy loop for small sizes. That's not necessary with this implementation, since the assembly copy loop gets inlined.


  • They use rep movs or rep stos for copies of 200 bytes or more. This decision benchmarks poorly on every synthetic memcpy/memset benchmark I have built, and so unsurprisingly, it's not what system memcpy/memset does. Most system memcpy/memset implementations end up doing some SSE for medium-sized copies,. However, I previously found that this decision is bad for one of the memset calls in GC (see clearArray() and friends in ArrayConventions.h|cpp) - I was able to make the overhead of that call virtually disappear by doing rep stos more aggressively. The theory behind this change is that it's not just the GC that prefers smaller rep threshold and no SSE. I am betting that reping more is better when the heap gets chaotic and the data being copied is used in interesting ways -- hence, synthetic memcpy/memset benchmarks think it's bad (they don't do enough chaotic memory accesses) while it's good for real-world uses. Also, when I previously worked on JVMs, I had found that the best memcpy/memset heuristics when dealing with GC'd objects in a crazy heap were different than any memcpy/memset in any system library.


This appears to be a 0.9% speed-up on PLT. I'm not sure if it's more because of the inlining or
the rep. I think it's both. I'll leave figuring out the exact tuning for future patches.

  • wtf/BitVector.cpp:

(WTF::BitVector::setSlow):
(WTF::BitVector::clearAll):
(WTF::BitVector::resizeOutOfLine):

  • wtf/BitVector.h:

(WTF::BitVector::wordCount):
(WTF::BitVector::OutOfLineBits::numWords const):

  • wtf/ConcurrentBuffer.h:

(WTF::ConcurrentBuffer::growExact):

  • wtf/FastBitVector.h:

(WTF::FastBitVectorWordOwner::operator=):
(WTF::FastBitVectorWordOwner::clearAll):
(WTF::FastBitVectorWordOwner::set):

  • wtf/FastCopy.h: Added.

(WTF::fastCopy):
(WTF::fastCopyBytes):

  • wtf/FastMalloc.cpp:

(WTF::fastZeroedMalloc):
(WTF::fastStrDup):
(WTF::tryFastZeroedMalloc):

  • wtf/FastZeroFill.h: Added.

(WTF::fastZeroFill):
(WTF::fastZeroFillBytes):

  • wtf/MD5.cpp:
  • wtf/OSAllocator.h:

(WTF::OSAllocator::reallocateCommitted):

  • wtf/StringPrintStream.cpp:

(WTF::StringPrintStream::increaseSize):

  • wtf/Vector.h:
  • wtf/persistence/PersistentDecoder.cpp:

(WTF::Persistence::Decoder::decodeFixedLengthData):

  • wtf/persistence/PersistentEncoder.cpp:

(WTF::Persistence::Encoder::encodeFixedLengthData):

  • wtf/text/CString.cpp:

(WTF::CString::init):
(WTF::CString::copyBufferIfNeeded):

  • wtf/text/LineBreakIteratorPoolICU.h:

(WTF::LineBreakIteratorPool::makeLocaleWithBreakKeyword):

  • wtf/text/StringBuilder.cpp:

(WTF::StringBuilder::allocateBuffer):
(WTF::StringBuilder::append):

  • wtf/text/StringConcatenate.h:
  • wtf/text/StringImpl.h:

(WTF::StringImpl::copyCharacters):

  • wtf/text/icu/UTextProvider.cpp:

(WTF::uTextCloneImpl):

  • wtf/text/icu/UTextProviderLatin1.cpp:

(WTF::uTextLatin1Clone):
(WTF::openLatin1UTextProvider):

  • wtf/threads/Signals.cpp:
Location:
trunk/Source
Files:
2 added
43 edited

Legend:

Unmodified
Added
Removed
  • trunk/Source/JavaScriptCore/ChangeLog

    r228302 r228306  
     12018-02-08  Filip Pizlo  <fpizlo@apple.com>
     2
     3        Experiment with alternative implementation of memcpy/memset
     4        https://bugs.webkit.org/show_bug.cgi?id=182563
     5
     6        Reviewed by Michael Saboff and Mark Lam.
     7       
     8        This adopts new fastCopy/fastZeroFill calls for calls to memcpy/memset that do not take a
     9        constant size argument.
     10
     11        * assembler/AssemblerBuffer.h:
     12        (JSC::AssemblerBuffer::append):
     13        * runtime/ArrayBuffer.cpp:
     14        (JSC::ArrayBufferContents::tryAllocate):
     15        (JSC::ArrayBufferContents::copyTo):
     16        (JSC::ArrayBuffer::createInternal):
     17        * runtime/ArrayBufferView.h:
     18        (JSC::ArrayBufferView::zeroRangeImpl):
     19        * runtime/ArrayConventions.cpp:
     20        * runtime/ArrayConventions.h:
     21        (JSC::clearArray):
     22        * runtime/ArrayPrototype.cpp:
     23        (JSC::arrayProtoPrivateFuncConcatMemcpy):
     24        * runtime/ButterflyInlines.h:
     25        (JSC::Butterfly::tryCreate):
     26        (JSC::Butterfly::createOrGrowPropertyStorage):
     27        (JSC::Butterfly::growArrayRight):
     28        (JSC::Butterfly::resizeArray):
     29        * runtime/GenericTypedArrayViewInlines.h:
     30        (JSC::GenericTypedArrayView<Adaptor>::create):
     31        * runtime/JSArray.cpp:
     32        (JSC::JSArray::appendMemcpy):
     33        (JSC::JSArray::fastSlice):
     34        * runtime/JSArrayBufferView.cpp:
     35        (JSC::JSArrayBufferView::ConstructionContext::ConstructionContext):
     36        * runtime/JSGenericTypedArrayViewInlines.h:
     37        (JSC::JSGenericTypedArrayView<Adaptor>::set):
     38        * runtime/JSObject.cpp:
     39        (JSC::JSObject::constructConvertedArrayStorageWithoutCopyingElements):
     40        (JSC::JSObject::shiftButterflyAfterFlattening):
     41        * runtime/PropertyTable.cpp:
     42        (JSC::PropertyTable::PropertyTable):
     43
    1442018-02-08  Don Olmstead  <don.olmstead@sony.com>
    245
  • trunk/Source/JavaScriptCore/assembler/AssemblerBuffer.h

    r206525 r228306  
    11/*
    2  * Copyright (C) 2008, 2012, 2014 Apple Inc. All rights reserved.
     2 * Copyright (C) 2008-2018 Apple Inc. All rights reserved.
    33 *
    44 * Redistribution and use in source and binary forms, with or without
     
    277277                grow(size);
    278278
    279             memcpy(m_storage.buffer() + m_index, data, size);
     279            fastCopyBytes(m_storage.buffer() + m_index, data, size);
    280280            m_index += size;
    281281        }
  • trunk/Source/JavaScriptCore/heap/LargeAllocation.cpp

    r227721 r228306  
    4646   
    4747    // Make sure that the padding does not contain useful things.
    48     memset(static_cast<char*>(space) + sizeBeforeDistancing, 0, distancing);
     48    fastZeroFillBytes(static_cast<char*>(space) + sizeBeforeDistancing, distancing);
    4949   
    5050    if (scribbleFreeCells())
  • trunk/Source/JavaScriptCore/heap/MarkedBlock.cpp

    r228149 r228306  
    494494        return;
    495495   
    496     memset(&block(), 0, endAtom * atomSize);
     496    fastZeroFillBytes(&block(), endAtom * atomSize);
    497497    m_securityOriginToken = securityOriginToken;
    498498}
  • trunk/Source/JavaScriptCore/runtime/ArrayBuffer.cpp

    r221439 r228306  
    114114   
    115115    if (policy == ZeroInitialize)
    116         memset(m_data.get(), 0, size);
     116        fastZeroFillBytes(m_data.get(), size);
    117117
    118118    m_sizeInBytes = numElements * elementByteSize;
     
    142142    if (!other.m_data)
    143143        return;
    144     memcpy(other.m_data.get(), m_data.get(), m_sizeInBytes);
     144    fastCopyBytes(other.m_data.get(), m_data.get(), m_sizeInBytes);
    145145    other.m_sizeInBytes = m_sizeInBytes;
    146146}
     
    247247    ASSERT(!byteLength || source);
    248248    auto buffer = adoptRef(*new ArrayBuffer(WTFMove(contents)));
    249     memcpy(buffer->data(), source, byteLength);
     249    fastCopyBytes(buffer->data(), source, byteLength);
    250250    return buffer;
    251251}
  • trunk/Source/JavaScriptCore/runtime/ArrayBufferView.h

    r225123 r228306  
    216216   
    217217    uint8_t* base = static_cast<uint8_t*>(baseAddress());
    218     memset(base + byteOffset, 0, rangeByteLength);
     218    fastZeroFillBytes(base + byteOffset, rangeByteLength);
    219219    return true;
    220220}
  • trunk/Source/JavaScriptCore/runtime/ArrayConventions.cpp

    r205611 r228306  
    3232
    3333#if USE(JSVALUE64)
    34 void clearArrayMemset(WriteBarrier<Unknown>* base, unsigned count)
    35 {
    36 #if CPU(X86_64) && COMPILER(GCC_OR_CLANG)
    37     uint64_t zero = 0;
    38     asm volatile (
    39         "rep stosq\n\t"
    40         : "+D"(base), "+c"(count)
    41         : "a"(zero)
    42         : "memory"
    43         );
    44 #else // not CPU(X86_64)
    45     memset(base, 0, count * sizeof(WriteBarrier<Unknown>));
    46 #endif // generic CPU
    47 }
    48 
    4934void clearArrayMemset(double* base, unsigned count)
    5035{
  • trunk/Source/JavaScriptCore/runtime/ArrayConventions.h

    r222384 r228306  
    118118
    119119#if USE(JSVALUE64)
    120 JS_EXPORT_PRIVATE void clearArrayMemset(WriteBarrier<Unknown>* base, unsigned count);
    121120JS_EXPORT_PRIVATE void clearArrayMemset(double* base, unsigned count);
    122121#endif // USE(JSVALUE64)
     
    125124{
    126125#if USE(JSVALUE64)
    127     const unsigned minCountForMemset = 100;
    128     if (count >= minCountForMemset) {
    129         clearArrayMemset(base, count);
    130         return;
    131     }
    132 #endif
    133    
     126    fastZeroFill(base, count);
     127#else
    134128    for (unsigned i = count; i--;)
    135129        base[i].clear();
     130#endif
    136131}
    137132
  • trunk/Source/JavaScriptCore/runtime/ArrayPrototype.cpp

    r228266 r228306  
    13421342    if (type == ArrayWithDouble) {
    13431343        double* buffer = result->butterfly()->contiguousDouble().data();
    1344         memcpy(buffer, firstButterfly->contiguousDouble().data(), sizeof(JSValue) * firstArraySize);
    1345         memcpy(buffer + firstArraySize, secondButterfly->contiguousDouble().data(), sizeof(JSValue) * secondArraySize);
     1344        fastCopy(buffer, firstButterfly->contiguousDouble().data(), firstArraySize);
     1345        fastCopy(buffer + firstArraySize, secondButterfly->contiguousDouble().data(), secondArraySize);
    13461346    } else if (type != ArrayWithUndecided) {
    13471347        WriteBarrier<Unknown>* buffer = result->butterfly()->contiguous().data();
     
    13491349        auto copy = [&] (unsigned offset, void* source, unsigned size, IndexingType type) {
    13501350            if (type != ArrayWithUndecided) {
    1351                 memcpy(buffer + offset, source, sizeof(JSValue) * size);
     1351                fastCopy(buffer + offset, static_cast<WriteBarrier<Unknown>*>(source), size);
    13521352                return;
    13531353            }
    13541354           
    1355             for (unsigned i = size; i--;)
    1356                 buffer[i + offset].clear();
     1355            clearArray(buffer + offset, size);
    13571356        };
    13581357       
  • trunk/Source/JavaScriptCore/runtime/ButterflyInlines.h

    r227617 r228306  
    11/*
    2  * Copyright (C) 2012-2017 Apple Inc. All rights reserved.
     2 * Copyright (C) 2012-2018 Apple Inc. All rights reserved.
    33 *
    44 * Redistribution and use in source and binary forms, with or without
     
    9494    if (hasIndexingHeader)
    9595        *result->indexingHeader() = indexingHeader;
    96     memset(result->propertyStorage() - propertyCapacity, 0, propertyCapacity * sizeof(EncodedJSValue));
     96    fastZeroFill(result->propertyStorage() - propertyCapacity, propertyCapacity);
    9797    return result;
    9898}
     
    130130    Butterfly* result = createUninitialized(
    131131        vm, intendedOwner, preCapacity, newPropertyCapacity, hasIndexingHeader, indexingPayloadSizeInBytes);
    132     memcpy(
     132    fastCopyBytes(
    133133        result->propertyStorage() - oldPropertyCapacity,
    134134        oldButterfly->propertyStorage() - oldPropertyCapacity,
    135135        totalSize(0, oldPropertyCapacity, hasIndexingHeader, indexingPayloadSizeInBytes));
    136     memset(
     136    fastZeroFill(
    137137        result->propertyStorage() - newPropertyCapacity,
    138         0,
    139         (newPropertyCapacity - oldPropertyCapacity) * sizeof(EncodedJSValue));
     138        newPropertyCapacity - oldPropertyCapacity);
    140139    return result;
    141140}
     
    169168    if (!newBase)
    170169        return nullptr;
    171     // FIXME: This probably shouldn't be a memcpy.
    172     memcpy(newBase, theBase, oldSize);
     170    fastCopyBytes(newBase, theBase, oldSize);
    173171    return fromBase(newBase, 0, propertyCapacity);
    174172}
     
    200198        totalSize(0, propertyCapacity, oldHasIndexingHeader, oldIndexingPayloadSizeInBytes),
    201199        totalSize(0, propertyCapacity, newHasIndexingHeader, newIndexingPayloadSizeInBytes));
    202     memcpy(to, from, size);
     200    fastCopyBytes(to, from, size);
    203201    return result;
    204202}
  • trunk/Source/JavaScriptCore/runtime/GenericTypedArrayViewInlines.h

    r212535 r228306  
    11/*
    2  * Copyright (C) 2013, 2016 Apple Inc. All rights reserved.
     2 * Copyright (C) 2013-2018 Apple Inc. All rights reserved.
    33 *
    44 * Redistribution and use in source and binary forms, with or without
     
    5353{
    5454    RefPtr<GenericTypedArrayView> result = create(length);
    55     memcpy(result->data(), array, length * sizeof(typename Adaptor::Type));
     55    fastCopy(result->data(), array, length);
    5656    return result;
    5757}
  • trunk/Source/JavaScriptCore/runtime/JSArray.cpp

    r227906 r228306  
    554554        }
    555555    } else if (type == ArrayWithDouble)
    556         memcpy(butterfly()->contiguousDouble().data() + startIndex, otherArray->butterfly()->contiguousDouble().data(), sizeof(JSValue) * otherLength);
     556        fastCopy(butterfly()->contiguousDouble().data() + startIndex, otherArray->butterfly()->contiguousDouble().data(), otherLength);
    557557    else
    558         memcpy(butterfly()->contiguous().data() + startIndex, otherArray->butterfly()->contiguous().data(), sizeof(JSValue) * otherLength);
     558        fastCopy(butterfly()->contiguous().data() + startIndex, otherArray->butterfly()->contiguous().data(), otherLength);
    559559
    560560    return true;
     
    762762        auto& resultButterfly = *resultArray->butterfly();
    763763        if (arrayType == ArrayWithDouble)
    764             memcpy(resultButterfly.contiguousDouble().data(), butterfly()->contiguousDouble().data() + startIndex, sizeof(JSValue) * count);
     764            fastCopy(resultButterfly.contiguousDouble().data(), butterfly()->contiguousDouble().data() + startIndex, count);
    765765        else
    766             memcpy(resultButterfly.contiguous().data(), butterfly()->contiguous().data() + startIndex, sizeof(JSValue) * count);
     766            fastCopy(resultButterfly.contiguous().data(), butterfly()->contiguous().data() + startIndex, count);
    767767        resultButterfly.setPublicLength(count);
    768768
  • trunk/Source/JavaScriptCore/runtime/JSArrayBufferView.cpp

    r227874 r228306  
    9595        return;
    9696    if (mode == ZeroFill)
    97         memset(m_vector.get(), 0, size);
     97        fastZeroFillBytes(m_vector.get(), size);
    9898   
    9999    vm.heap.reportExtraMemoryAllocated(static_cast<size_t>(length) * elementSize);
  • trunk/Source/JavaScriptCore/runtime/JSGenericTypedArrayViewInlines.h

    r227874 r228306  
    247247    const ClassInfo* ci = object->classInfo(vm);
    248248    if (ci->typedArrayStorageType == Adaptor::typeValue) {
    249         // The super fast case: we can just memcpy since we're the same type.
     249        // The super fast case: we can just memmove since we're the same type.
    250250        JSGenericTypedArrayView* other = jsCast<JSGenericTypedArrayView*>(object);
    251251        length = std::min(length, other->length());
  • trunk/Source/JavaScriptCore/runtime/JSObject.cpp

    r227906 r228306  
    11791179        vm, this, 0, propertyCapacity, true, ArrayStorage::sizeFor(neededLength));
    11801180   
    1181     memcpy(
     1181    fastCopy(
    11821182        newButterfly->propertyStorage() - propertySize,
    11831183        m_butterfly->propertyStorage() - propertySize,
    1184         propertySize * sizeof(EncodedJSValue));
     1184        propertySize);
    11851185   
    11861186    ArrayStorage* newStorage = newButterfly->arrayStorage();
     
    35813581    void* newBase = newButterfly->base(0, outOfLineCapacityAfter);
    35823582
    3583     memcpy(newBase, currentBase, Butterfly::totalSize(0, outOfLineCapacityAfter, hasIndexingHeader, indexingPayloadSizeInBytes));
     3583    fastCopyBytes(newBase, currentBase, Butterfly::totalSize(0, outOfLineCapacityAfter, hasIndexingHeader, indexingPayloadSizeInBytes));
    35843584   
    35853585    setButterfly(vm, newButterfly);
  • trunk/Source/JavaScriptCore/runtime/PropertyTable.cpp

    r217108 r228306  
    7575    ASSERT(isPowerOf2(m_indexSize));
    7676
    77     memcpy(m_index, other.m_index, dataSize());
     77    fastCopyBytes(m_index, other.m_index, dataSize());
    7878
    7979    iterator end = this->end();
  • trunk/Source/WTF/ChangeLog

    r228260 r228306  
     12018-02-08  Filip Pizlo  <fpizlo@apple.com>
     2
     3        Experiment with alternative implementation of memcpy/memset
     4        https://bugs.webkit.org/show_bug.cgi?id=182563
     5
     6        Reviewed by Michael Saboff and Mark Lam.
     7       
     8        Adds a faster x86_64-specific implementation of memcpy and memset. These versions go by
     9        different names than memcpy/memset and have a different API:
     10       
     11        WTF::fastCopy<T>(T* dst, T* src, size_t N): copies N values of type T from src to dst.
     12        WTF::fastZeroFill(T* dst, size_T N): writes N * sizeof(T) zeroes to dst.
     13       
     14        There are also *Bytes variants that take void* for dst and src and size_t numBytes. Those are
     15        most appropriate in places where the code is already computing bytes.
     16       
     17        These will just call memcpy/memset on platforms where the optimized versions are not supported.
     18       
     19        These new functions are not known to the compiler to be memcpy/memset. This has the effect that
     20        the compiler will not try to replace them with anything else. This could be good or bad:
     21       
     22        - It's *good* if the size is *not known* at compile time. In that case, by my benchmarks, these
     23          versions are faster than either the memcpy/memset call or whatever else the compiler could
     24          emit. This is because of a combination of inlining and the algorithm itself (see below).
     25       
     26        - It's *bad* if the size is *known* at compile time. In that case, the compiler could
     27          potentially emit a fully unrolled memcpy/memset. That might not happen if the size is large
     28          (even if it's known), but in this patch I avoid replacing any memcpy/memset calls when the
     29          size is a constant. In particular, this totally avoids the call overhead -- if the size is
     30          small, then the compiler will emit a nice inlined copy or set. If the size is large, then the
     31          most optimal thing to do is emit the shortest piece of code possible, and that's a call to
     32          memcpy/memset.
     33       
     34        It's unfortunate that you have to choose between them on your own. One way to avoid that might
     35        have been to override the memcpy/memset symbols, so that the compiler can still do its
     36        reasoning. But that's not quite right, since then we would lose inlining in the unknonw-size
     37        case. Also, it's possible that for some unknown-size cases, the compiler could choose to emit
     38        something on its own because it might think that some property of aliasing or alignment could
     39        help it. I think it's a bit better to use our own copy/set implementations even in those cases.
     40        Another way that I tried avoiding this is to detect inside fastCopy/fastZeroFill if the size is
     41        constant. But there is no good way to do that in C++. There is a builtin for doing that inside a
     42        macro, but that feels janky, so I didn't want to do it in this patch.
     43       
     44        The reason why these new fastCopy/fastZeroFill functions are faster is that:
     45       
     46        - They can be inlined. There is no function call. Only a few registers get clobbered. So, the
     47          impact on the quality of the code surrounding the memcpy/memset is smaller.
     48       
     49        - They use type information to select the implementation. For sizes that are multiples of 2, 4,
     50          or 8, the resulting code performs dramatically better on small arrays than memcpy because it
     51          uses fewer cycles. The difference is greatest for 2 and 4 byte types, since memcpy usually
     52          handles small arrays by tiering from a 8-byte word copy loop to a byte copy loop. So, for 2
     53          or 4 byte arrays, we use an algorithm that tiers from 8-byte word down to a 2-byte or 4-byte
     54          (depending on type) copy loop. So, for example, when copying a 16-bit string that has 1, 2, or
     55          3 characters, this means doing 1, 2, or 3 word copies rather than 2, 4, or 6 byte copies. For
     56          8-byte types, the resulting savings are mainly that there is no check to see if a tier-down to
     57          the byte-copy loop is needed -- so really that means reducing code size. 1-byte types don't
     58          get this inherent advantage over memcpy/memset, but they still benefit from all of the other
     59          advantages of these functions. Of course, this advantage isn't inherent to our approach. The
     60          compiler could also notice that the arguments to memcpy/memset have some alignment properties.
     61          It could do it even more generally than we do - for example a copy over bytes where the size
     62          is a multiple of 4 can use the 4-byte word algorithm. But based on my tests, the compiler does
     63          not do this (even though it does other things, like turn a memset call with a zero value
     64          argument into a bzero call).
     65       
     66        - They use a very nicely written word copy/set loop for small arrays. I spent a lot of time
     67          getting the assembly just right. When we use memcpy/memset, sometimes we would optimize the
     68          call by having a fast path word copy loop for small sizes. That's not necessary with this
     69          implementation, since the assembly copy loop gets inlined.
     70       
     71        - They use `rep movs` or `rep stos` for copies of 200 bytes or more. This decision benchmarks
     72          poorly on every synthetic memcpy/memset benchmark I have built, and so unsurprisingly, it's
     73          not what system memcpy/memset does. Most system memcpy/memset implementations end up doing
     74          some SSE for medium-sized copies,. However, I previously found that this decision is bad for
     75          one of the memset calls in GC (see clearArray() and friends in ArrayConventions.h|cpp) - I was
     76          able to make the overhead of that call virtually disappear by doing `rep stos` more
     77          aggressively. The theory behind this change is that it's not just the GC that prefers smaller
     78          `rep` threshold and no SSE. I am betting that `rep`ing more is better when the heap gets
     79          chaotic and the data being copied is used in interesting ways -- hence, synthetic
     80          memcpy/memset benchmarks think it's bad (they don't do enough chaotic memory accesses) while
     81          it's good for real-world uses. Also, when I previously worked on JVMs, I had found that the
     82          best memcpy/memset heuristics when dealing with GC'd objects in a crazy heap were different
     83          than any memcpy/memset in any system library.
     84       
     85        This appears to be a 0.9% speed-up on PLT. I'm not sure if it's more because of the inlining or
     86        the `rep`. I think it's both. I'll leave figuring out the exact tuning for future patches.
     87
     88        * wtf/BitVector.cpp:
     89        (WTF::BitVector::setSlow):
     90        (WTF::BitVector::clearAll):
     91        (WTF::BitVector::resizeOutOfLine):
     92        * wtf/BitVector.h:
     93        (WTF::BitVector::wordCount):
     94        (WTF::BitVector::OutOfLineBits::numWords const):
     95        * wtf/ConcurrentBuffer.h:
     96        (WTF::ConcurrentBuffer::growExact):
     97        * wtf/FastBitVector.h:
     98        (WTF::FastBitVectorWordOwner::operator=):
     99        (WTF::FastBitVectorWordOwner::clearAll):
     100        (WTF::FastBitVectorWordOwner::set):
     101        * wtf/FastCopy.h: Added.
     102        (WTF::fastCopy):
     103        (WTF::fastCopyBytes):
     104        * wtf/FastMalloc.cpp:
     105        (WTF::fastZeroedMalloc):
     106        (WTF::fastStrDup):
     107        (WTF::tryFastZeroedMalloc):
     108        * wtf/FastZeroFill.h: Added.
     109        (WTF::fastZeroFill):
     110        (WTF::fastZeroFillBytes):
     111        * wtf/MD5.cpp:
     112        * wtf/OSAllocator.h:
     113        (WTF::OSAllocator::reallocateCommitted):
     114        * wtf/StringPrintStream.cpp:
     115        (WTF::StringPrintStream::increaseSize):
     116        * wtf/Vector.h:
     117        * wtf/persistence/PersistentDecoder.cpp:
     118        (WTF::Persistence::Decoder::decodeFixedLengthData):
     119        * wtf/persistence/PersistentEncoder.cpp:
     120        (WTF::Persistence::Encoder::encodeFixedLengthData):
     121        * wtf/text/CString.cpp:
     122        (WTF::CString::init):
     123        (WTF::CString::copyBufferIfNeeded):
     124        * wtf/text/LineBreakIteratorPoolICU.h:
     125        (WTF::LineBreakIteratorPool::makeLocaleWithBreakKeyword):
     126        * wtf/text/StringBuilder.cpp:
     127        (WTF::StringBuilder::allocateBuffer):
     128        (WTF::StringBuilder::append):
     129        * wtf/text/StringConcatenate.h:
     130        * wtf/text/StringImpl.h:
     131        (WTF::StringImpl::copyCharacters):
     132        * wtf/text/icu/UTextProvider.cpp:
     133        (WTF::uTextCloneImpl):
     134        * wtf/text/icu/UTextProviderLatin1.cpp:
     135        (WTF::uTextLatin1Clone):
     136        (WTF::openLatin1UTextProvider):
     137        * wtf/threads/Signals.cpp:
     138
    11392018-02-06  Darin Adler  <darin@apple.com>
    2140
  • trunk/Source/WTF/WTF.xcodeproj/project.pbxproj

    r227701 r228306  
    207207                0F60F32D1DFCBD1B00416D6C /* LockedPrintStream.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = LockedPrintStream.cpp; sourceTree = "<group>"; };
    208208                0F60F32E1DFCBD1B00416D6C /* LockedPrintStream.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = LockedPrintStream.h; sourceTree = "<group>"; };
     209                0F62A8A6202CCC14007B8623 /* FastCopy.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = FastCopy.h; sourceTree = "<group>"; };
     210                0F62A8A7202CCC15007B8623 /* FastZeroFill.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = FastZeroFill.h; sourceTree = "<group>"; };
    209211                0F66B2801DC97BAB004A1D3F /* ClockType.cpp */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.cpp.cpp; path = ClockType.cpp; sourceTree = "<group>"; };
    210212                0F66B2811DC97BAB004A1D3F /* ClockType.h */ = {isa = PBXFileReference; fileEncoding = 4; lastKnownFileType = sourcecode.c.h; path = ClockType.h; sourceTree = "<group>"; };
     
    865867                                0F7C5FB51D885CF20044F5E2 /* FastBitVector.cpp */,
    866868                                0FD81AC4154FB22E00983E72 /* FastBitVector.h */,
     869                                0F62A8A6202CCC14007B8623 /* FastCopy.h */,
    867870                                A8A472A1151A825A004123FF /* FastMalloc.cpp */,
    868871                                A8A472A2151A825A004123FF /* FastMalloc.h */,
    869872                                0F79C7C31E73511800EB34D1 /* FastTLS.h */,
     873                                0F62A8A7202CCC15007B8623 /* FastZeroFill.h */,
    870874                                B38FD7BC168953E80065C969 /* FeatureDefines.h */,
    871875                                0F9D335B165DBA73005AD387 /* FilePrintStream.cpp */,
  • trunk/Source/WTF/wtf/BitVector.cpp

    r225668 r228306  
    11/*
    2  * Copyright (C) 2011 Apple Inc. All rights reserved.
     2 * Copyright (C) 2011-2018 Apple Inc. All rights reserved.
    33 *
    44 * Redistribution and use in source and binary forms, with or without
     
    3030#include <string.h>
    3131#include <wtf/Assertions.h>
     32#include <wtf/FastCopy.h>
    3233#include <wtf/FastMalloc.h>
     34#include <wtf/FastZeroFill.h>
    3335#include <wtf/StdLibExtras.h>
    3436
     
    4244    else {
    4345        OutOfLineBits* newOutOfLineBits = OutOfLineBits::create(other.size());
    44         memcpy(newOutOfLineBits->bits(), other.bits(), byteCount(other.size()));
     46        fastCopy(newOutOfLineBits->bits(), other.bits(), wordCount(other.size()));
    4547        newBitsOrPointer = bitwise_cast<uintptr_t>(newOutOfLineBits) >> 1;
    4648    }
     
    7072        m_bitsOrPointer = makeInlineBits(0);
    7173    else
    72         memset(outOfLineBits()->bits(), 0, byteCount(size()));
     74        fastZeroFill(outOfLineBits()->bits(), wordCount(size()));
    7375}
    7476
     
    9496        // Make sure that all of the bits are zero in case we do a no-op resize.
    9597        *newOutOfLineBits->bits() = m_bitsOrPointer & ~(static_cast<uintptr_t>(1) << maxInlineBits());
    96         memset(newOutOfLineBits->bits() + 1, 0, (newNumWords - 1) * sizeof(void*));
     98        fastZeroFill(newOutOfLineBits->bits() + 1, newNumWords - 1);
    9799    } else {
    98100        if (numBits > size()) {
    99101            size_t oldNumWords = outOfLineBits()->numWords();
    100             memcpy(newOutOfLineBits->bits(), outOfLineBits()->bits(), oldNumWords * sizeof(void*));
    101             memset(newOutOfLineBits->bits() + oldNumWords, 0, (newNumWords - oldNumWords) * sizeof(void*));
     102            fastCopy(newOutOfLineBits->bits(), outOfLineBits()->bits(), oldNumWords);
     103            fastZeroFill(newOutOfLineBits->bits() + oldNumWords, newNumWords - oldNumWords);
    102104        } else
    103             memcpy(newOutOfLineBits->bits(), outOfLineBits()->bits(), newOutOfLineBits->numWords() * sizeof(void*));
     105            fastCopy(newOutOfLineBits->bits(), outOfLineBits()->bits(), newOutOfLineBits->numWords());
    104106        OutOfLineBits::destroy(outOfLineBits());
    105107    }
  • trunk/Source/WTF/wtf/BitVector.h

    r225524 r228306  
    355355    }
    356356
     357    static size_t wordCount(uintptr_t bits)
     358    {
     359        return (bits + bitsInPointer() - 1) / bitsInPointer();
     360    }
     361   
    357362    static uintptr_t makeInlineBits(uintptr_t bits)
    358363    {
     
    419424    public:
    420425        size_t numBits() const { return m_numBits; }
    421         size_t numWords() const { return (m_numBits + bitsInPointer() - 1) / bitsInPointer(); }
     426        size_t numWords() const { return wordCount(m_numBits); }
    422427        uintptr_t* bits() { return bitwise_cast<uintptr_t*>(this + 1); }
    423428        const uintptr_t* bits() const { return bitwise_cast<const uintptr_t*>(this + 1); }
  • trunk/Source/WTF/wtf/CMakeLists.txt

    r228136 r228306  
    6060    ExportMacros.h
    6161    FastBitVector.h
     62    FastCopy.h
    6263    FastMalloc.h
    6364    FastTLS.h
     65    FastZeroFill.h
    6466    FeatureDefines.h
    6567    FilePrintStream.h
  • trunk/Source/WTF/wtf/ConcurrentBuffer.h

    r225831 r228306  
    2727
    2828#include <wtf/Atomics.h>
     29#include <wtf/FastCopy.h>
    2930#include <wtf/FastMalloc.h>
    3031#include <wtf/HashFunctions.h>
     
    6667        // This allows us to do ConcurrentBuffer<std::unique_ptr<>>.
    6768        if (array)
    68             memcpy(newArray->data, array->data, sizeof(T) * array->size);
     69            fastCopy(newArray->data, array->data, array->size);
    6970        for (size_t i = array ? array->size : 0; i < newSize; ++i)
    7071            new (newArray->data + i) T();
  • trunk/Source/WTF/wtf/FastBitVector.h

    r208209 r228306  
    11/*
    2  * Copyright (C) 2012, 2013, 2016 Apple Inc. All rights reserved.
     2 * Copyright (C) 2012-2018 Apple Inc. All rights reserved.
    33 *
    44 * Redistribution and use in source and binary forms, with or without
     
    2828#include <string.h>
    2929#include <wtf/Atomics.h>
     30#include <wtf/FastCopy.h>
    3031#include <wtf/FastMalloc.h>
     32#include <wtf/FastZeroFill.h>
    3133#include <wtf/PrintStream.h>
    3234#include <wtf/StdLibExtras.h>
     
    9698            setEqualsSlow(other);
    9799        else {
    98             memcpy(m_words, other.m_words, arrayLength() * sizeof(uint32_t));
     100            fastCopy(m_words, other.m_words, arrayLength());
    99101            m_numBits = other.m_numBits;
    100102        }
     
    116118    void clearAll()
    117119    {
    118         memset(m_words, 0, arrayLength() * sizeof(uint32_t));
     120        fastZeroFill(m_words, arrayLength());
    119121    }
    120122   
     
    122124    {
    123125        ASSERT_WITH_SECURITY_IMPLICATION(m_numBits == other.m_numBits);
    124         memcpy(m_words, other.m_words, arrayLength() * sizeof(uint32_t));
     126        fastCopy(m_words, other.m_words, arrayLength());
    125127    }
    126128   
  • trunk/Source/WTF/wtf/FastMalloc.cpp

    r220118 r228306  
    11/*
    22 * Copyright (c) 2005, 2007, Google Inc. All rights reserved.
    3  * Copyright (C) 2005-2017 Apple Inc. All rights reserved.
     3 * Copyright (C) 2005-2018 Apple Inc. All rights reserved.
    44 * Redistribution and use in source and binary forms, with or without
    55 * modification, are permitted provided that the following conditions
     
    3232#include <string.h>
    3333#include <wtf/DataLog.h>
     34#include <wtf/FastCopy.h>
     35#include <wtf/FastZeroFill.h>
    3436
    3537#if OS(WINDOWS)
     
    7981{
    8082    void* result = fastMalloc(n);
    81     memset(result, 0, n);
     83    fastZeroFillBytes(result, n);
    8284    return result;
    8385}
     
    8789    size_t len = strlen(src) + 1;
    8890    char* dup = static_cast<char*>(fastMalloc(len));
    89     memcpy(dup, src, len);
     91    fastCopy(dup, src, len);
    9092    return dup;
    9193}
     
    9698    if (!tryFastMalloc(n).getValue(result))
    9799        return 0;
    98     memset(result, 0, n);
     100    fastZeroFillBytes(result, n);
    99101    return result;
    100102}
  • trunk/Source/WTF/wtf/OSAllocator.h

    r227951 r228306  
    11/*
    2  * Copyright (C) 2010 Apple Inc. All rights reserved.
     2 * Copyright (C) 2010-2018 Apple Inc. All rights reserved.
    33 *
    44 * Redistribution and use in source and binary forms, with or without
     
    2828
    2929#include <algorithm>
     30#include <wtf/FastCopy.h>
    3031#include <wtf/VMTags.h>
    3132
     
    9192{
    9293    void* newBase = reserveAndCommit(newSize, usage, writable, executable);
    93     memcpy(newBase, oldBase, std::min(oldSize, newSize));
     94    fastCopyBytes(newBase, oldBase, std::min(oldSize, newSize));
    9495    decommitAndRelease(oldBase, oldSize);
    9596    return static_cast<T*>(newBase);
  • trunk/Source/WTF/wtf/StringPrintStream.cpp

    r225618 r228306  
    2929#include <stdarg.h>
    3030#include <stdio.h>
     31#include <wtf/FastCopy.h>
    3132#include <wtf/FastMalloc.h>
    3233
     
    120121    // we can't realloc the inline buffer.
    121122    char* newBuffer = static_cast<char*>(fastMalloc(m_size));
    122     memcpy(newBuffer, m_buffer, m_next + 1);
     123    fastCopy(newBuffer, m_buffer, m_next + 1);
    123124    if (m_buffer != m_inlineBuffer)
    124125        fastFree(m_buffer);
  • trunk/Source/WTF/wtf/Vector.h

    r226068 r228306  
    2828#include <utility>
    2929#include <wtf/CheckedArithmetic.h>
     30#include <wtf/FastCopy.h>
    3031#include <wtf/FastMalloc.h>
     32#include <wtf/FastZeroFill.h>
    3133#include <wtf/Forward.h>
    3234#include <wtf/MallocPtr.h>
     
    8789    static void initialize(T* begin, T* end)
    8890    {
    89         memset(begin, 0, reinterpret_cast<char*>(end) - reinterpret_cast<char*>(begin));
     91        fastZeroFill(begin, end - begin);
    9092    }
    9193};
     
    127129    static void move(const T* src, const T* srcEnd, T* dst)
    128130    {
    129         memcpy(dst, src, reinterpret_cast<const char*>(srcEnd) - reinterpret_cast<const char*>(src));
     131        fastCopy(dst, src, srcEnd - src);
    130132    }
    131133    static void moveOverlapping(const T* src, const T* srcEnd, T* dst)
  • trunk/Source/WTF/wtf/persistence/PersistentDecoder.cpp

    r220574 r228306  
    5353        return false;
    5454
    55     memcpy(data, m_bufferPosition, size);
     55    fastCopy(data, m_bufferPosition, size);
    5656    m_bufferPosition += size;
    5757
  • trunk/Source/WTF/wtf/persistence/PersistentEncoder.cpp

    r220574 r228306  
    5959
    6060    uint8_t* buffer = grow(size);
    61     memcpy(buffer, data, size);
     61    fastCopy(buffer, data, size);
    6262}
    6363
  • trunk/Source/WTF/wtf/text/CString.cpp

    r225463 r228306  
    11/*
    2  * Copyright (C) 2003-2017 Apple Inc. All rights reserved.
     2 * Copyright (C) 2003-2018 Apple Inc. All rights reserved.
    33 *
    44 * Redistribution and use in source and binary forms, with or without
     
    2929
    3030#include <string.h>
     31#include <wtf/FastCopy.h>
    3132#include <wtf/text/StringHasher.h>
    3233#include <wtf/text/StringMalloc.h>
     
    6768
    6869    m_buffer = CStringBuffer::createUninitialized(length);
    69     memcpy(m_buffer->mutableData(), str, length);
     70    fastCopy(m_buffer->mutableData(), str, length);
    7071    m_buffer->mutableData()[length] = '\0';
    7172}
     
    9798    size_t length = buffer->length();
    9899    m_buffer = CStringBuffer::createUninitialized(length);
    99     memcpy(m_buffer->mutableData(), buffer->data(), length + 1);
     100    fastCopy(m_buffer->mutableData(), buffer->data(), length + 1);
    100101}
    101102
  • trunk/Source/WTF/wtf/text/LineBreakIteratorPoolICU.h

    r218594 r228306  
    11/*
    2  * Copyright (C) 2011 Apple Inc. All Rights Reserved.
     2 * Copyright (C) 2011-2018 Apple Inc. All Rights Reserved.
    33 *
    44 * Redistribution and use in source and binary forms, with or without
     
    2727
    2828#include <unicode/uloc.h>
     29#include <wtf/FastCopy.h>
     30#include <wtf/FastZeroFill.h>
    2931#include <wtf/HashMap.h>
    3032#include <wtf/NeverDestroyed.h>
     
    5254            return locale;
    5355        Vector<char> scratchBuffer(utf8Locale.length() + 11, 0);
    54         memcpy(scratchBuffer.data(), utf8Locale.data(), utf8Locale.length());
     56        fastCopy(scratchBuffer.data(), utf8Locale.data(), utf8Locale.length());
    5557
    5658        const char* keywordValue = nullptr;
     
    7678        if (status == U_BUFFER_OVERFLOW_ERROR) {
    7779            scratchBuffer.grow(lengthNeeded + 1);
    78             memset(scratchBuffer.data() + utf8Locale.length(), 0, scratchBuffer.size() - utf8Locale.length());
     80            fastZeroFill(scratchBuffer.data() + utf8Locale.length(), scratchBuffer.size() - utf8Locale.length());
    7981            status = U_ZERO_ERROR;
    8082            int32_t lengthNeeded2 = uloc_setKeywordValue("lb", keywordValue, scratchBuffer.data(), scratchBuffer.size(), &status);
  • trunk/Source/WTF/wtf/text/StringBuilder.cpp

    r221330 r228306  
    100100    // Copy the existing data into a new buffer, set result to point to the end of the existing data.
    101101    auto buffer = StringImpl::createUninitialized(requiredLength, m_bufferCharacters8);
    102     memcpy(m_bufferCharacters8, currentCharacters, static_cast<size_t>(m_length) * sizeof(LChar)); // This can't overflow.
     102    fastCopy(m_bufferCharacters8, currentCharacters, m_length);
    103103   
    104104    // Update the builder state.
     
    115115    // Copy the existing data into a new buffer, set result to point to the end of the existing data.
    116116    auto buffer = StringImpl::createUninitialized(requiredLength, m_bufferCharacters16);
    117     memcpy(m_bufferCharacters16, currentCharacters, static_cast<size_t>(m_length) * sizeof(UChar)); // This can't overflow.
     117    fastCopy(m_bufferCharacters16, currentCharacters, m_length);
    118118   
    119119    // Update the builder state.
     
    277277        }
    278278
    279         memcpy(m_bufferCharacters16 + m_length, characters, static_cast<size_t>(length) * sizeof(UChar));
     279        fastCopy(m_bufferCharacters16 + m_length, characters, length);
    280280        m_length = requiredLength;
    281281    } else
    282         memcpy(appendUninitialized<UChar>(length), characters, static_cast<size_t>(length) * sizeof(UChar));
     282        fastCopy(appendUninitialized<UChar>(length), characters, length);
    283283    ASSERT(m_buffer->length() >= m_length);
    284284}
     
    292292    if (m_is8Bit) {
    293293        LChar* dest = appendUninitialized<LChar>(length);
    294         if (length > 8)
    295             memcpy(dest, characters, static_cast<size_t>(length) * sizeof(LChar));
    296         else {
    297             const LChar* end = characters + length;
    298             while (characters < end)
    299                 *(dest++) = *(characters++);
    300         }
     294        fastCopy(dest, characters, length);
    301295    } else {
    302296        UChar* dest = appendUninitialized<UChar>(length);
  • trunk/Source/WTF/wtf/text/StringConcatenate.h

    r225824 r228306  
    2828
    2929#include <string.h>
     30#include <wtf/FastCopy.h>
    3031
    3132#ifndef AtomicString_h
     
    158159    void writeTo(UChar* destination) const
    159160    {
    160         memcpy(destination, m_characters, m_length * sizeof(UChar));
     161        fastCopy(destination, m_characters, m_length);
    161162    }
    162163
  • trunk/Source/WTF/wtf/text/StringImpl.h

    r227691 r228306  
    10671067        return;
    10681068    }
    1069     memcpy(destination, source, numCharacters * sizeof(CharacterType));
     1069    fastCopy(destination, source, numCharacters);
    10701070}
    10711071
  • trunk/Source/WTF/wtf/text/icu/UTextProvider.cpp

    r203038 r228306  
    2929#include <algorithm>
    3030#include <string.h>
     31#include <wtf/FastCopy.h>
    3132
    3233namespace WTF {
     
    5657    int32_t flags = destination->flags;
    5758    int sizeToCopy = std::min(source->sizeOfStruct, destination->sizeOfStruct);
    58     memcpy(destination, source, sizeToCopy);
     59    fastCopyBytes(destination, source, sizeToCopy);
    5960    destination->pExtra = extraNew;
    6061    destination->flags = flags;
    61     memcpy(destination->pExtra, source->pExtra, extraSize);
     62    fastCopyBytes(destination->pExtra, source->pExtra, extraSize);
    6263    fixPointer(source, destination, destination->context);
    6364    fixPointer(source, destination, destination->p);
  • trunk/Source/WTF/wtf/text/icu/UTextProviderLatin1.cpp

    r225117 r228306  
    2828
    2929#include "UTextProvider.h"
     30#include <wtf/FastZeroFill.h>
    3031#include <wtf/text/StringImpl.h>
    3132
     
    8384    result->pFuncs = &uTextLatin1Funcs;
    8485    result->chunkContents = (UChar*)result->pExtra;
    85     memset(const_cast<UChar*>(result->chunkContents), 0, sizeof(UChar) * UTextWithBufferInlineCapacity);
     86    fastZeroFill(const_cast<UChar*>(result->chunkContents), UTextWithBufferInlineCapacity);
    8687
    8788    return result;
     
    229230    text->pFuncs = &uTextLatin1Funcs;
    230231    text->chunkContents = (UChar*)text->pExtra;
    231     memset(const_cast<UChar*>(text->chunkContents), 0, sizeof(UChar) * UTextWithBufferInlineCapacity);
     232    fastZeroFill(const_cast<UChar*>(text->chunkContents), UTextWithBufferInlineCapacity);
    232233
    233234    return text;
  • trunk/Source/WTF/wtf/threads/Signals.cpp

    r219760 r228306  
    173173    RELEASE_ASSERT(signal != Signal::Unknown);
    174174
    175     memcpy(outState, inState, inStateCount * sizeof(inState[0]));
     175    fastCopy(outState, inState, inStateCount);
    176176    *outStateCount = inStateCount;
    177177
  • trunk/Source/bmalloc/ChangeLog

    r228108 r228306  
     12018-02-08  Filip Pizlo  <fpizlo@apple.com>
     2
     3        Experiment with alternative implementation of memcpy/memset
     4        https://bugs.webkit.org/show_bug.cgi?id=182563
     5
     6        Reviewed by Michael Saboff and Mark Lam.
     7       
     8        Add a faster x86_64-specific implementation of memcpy and memset. Ideally, this would just be
     9        implemented in WTF, but we have to copy it into bmalloc since bmalloc sits below WTF on the
     10        stack.
     11
     12        * bmalloc/Algorithm.h:
     13        (bmalloc::fastCopy):
     14        (bmalloc::fastZeroFill):
     15        * bmalloc/Allocator.cpp:
     16        (bmalloc::Allocator::reallocate):
     17        * bmalloc/Bits.h:
     18        (bmalloc::BitsWordOwner::operator=):
     19        (bmalloc::BitsWordOwner::clearAll):
     20        (bmalloc::BitsWordOwner::set):
     21        * bmalloc/IsoPageInlines.h:
     22        (bmalloc::IsoPage<Config>::IsoPage):
     23        * bmalloc/Vector.h:
     24        (bmalloc::Vector<T>::reallocateBuffer):
     25
    1262018-02-05  JF Bastien  <jfbastien@apple.com>
    227
  • trunk/Source/bmalloc/bmalloc/Algorithm.h

    r225701 r228306  
    181181}
    182182
     183template<typename T>
     184void fastCopy(T* dst, T* src, size_t length)
     185{
     186#if BCPU(X86_64)
     187    uint64_t tmp = 0;
     188    size_t count = length * sizeof(T);
     189    if (!(sizeof(T) % sizeof(uint64_t))) {
     190        asm volatile (
     191            "cmpq $200, %%rcx\n\t"
     192            "jb 1f\n\t"
     193            "shrq $3, %%rcx\n\t"
     194            "rep movsq\n\t"
     195            "jmp 2f\n\t"
     196            "3:\n\t"
     197            "movq (%%rsi, %%rcx), %%rax\n\t"
     198            "movq %%rax, (%%rdi, %%rcx)\n\t"
     199            "1:\n\t"
     200            "subq $8, %%rcx\n\t"
     201            "jae 3b\n\t"
     202            "2:\n\t"
     203            : "+D"(dst), "+S"(src), "+c"(count), "+a"(tmp)
     204            :
     205            : "memory"
     206            );
     207        return;
     208    }
     209    if (!(sizeof(T) % sizeof(uint32_t))) {
     210        asm volatile (
     211            "cmpq $200, %%rcx\n\t"
     212            "jb 1f\n\t"
     213            "shrq $2, %%rcx\n\t"
     214            "rep movsl\n\t"
     215            "jmp 2f\n\t"
     216            "3:\n\t"
     217            "movq (%%rsi, %%rcx), %%rax\n\t"
     218            "movq %%rax, (%%rdi, %%rcx)\n\t"
     219            "1:\n\t"
     220            "subq $8, %%rcx\n\t"
     221            "jae 3b\n\t"
     222            "cmpq $-8, %%rcx\n\t"
     223            "je 2f\n\t"
     224            "addq $4, %%rcx\n\t" // FIXME: This isn't really a loop. https://bugs.webkit.org/show_bug.cgi?id=182617
     225            "4:\n\t"
     226            "movl (%%rsi, %%rcx), %%eax\n\t"
     227            "movl %%eax, (%%rdi, %%rcx)\n\t"
     228            "subq $4, %%rcx\n\t"
     229            "jae 4b\n\t"
     230            "2:\n\t"
     231            : "+D"(dst), "+S"(src), "+c"(count), "+a"(tmp)
     232            :
     233            : "memory"
     234            );
     235        return;
     236    }
     237    if (!(sizeof(T) % sizeof(uint16_t))) {
     238        asm volatile (
     239            "cmpq $200, %%rcx\n\t"
     240            "jb 1f\n\t"
     241            "shrq $1, %%rcx\n\t"
     242            "rep movsw\n\t"
     243            "jmp 2f\n\t"
     244            "3:\n\t"
     245            "movq (%%rsi, %%rcx), %%rax\n\t"
     246            "movq %%rax, (%%rdi, %%rcx)\n\t"
     247            "1:\n\t"
     248            "subq $8, %%rcx\n\t"
     249            "jae 3b\n\t"
     250            "cmpq $-8, %%rcx\n\t"
     251            "je 2f\n\t"
     252            "addq $6, %%rcx\n\t"
     253            "4:\n\t"
     254            "movw (%%rsi, %%rcx), %%ax\n\t"
     255            "movw %%ax, (%%rdi, %%rcx)\n\t"
     256            "subq $2, %%rcx\n\t"
     257            "jae 4b\n\t"
     258            "2:\n\t"
     259            : "+D"(dst), "+S"(src), "+c"(count), "+a"(tmp)
     260            :
     261            : "memory"
     262            );
     263        return;
     264    }
     265    asm volatile (
     266        "cmpq $200, %%rcx\n\t"
     267        "jb 1f\n\t"
     268        "rep movsb\n\t"
     269        "jmp 2f\n\t"
     270        "3:\n\t"
     271        "movq (%%rsi, %%rcx), %%rax\n\t"
     272        "movq %%rax, (%%rdi, %%rcx)\n\t"
     273        "1:\n\t"
     274        "subq $8, %%rcx\n\t"
     275        "jae 3b\n\t"
     276        "cmpq $-8, %%rcx\n\t"
     277        "je 2f\n\t"
     278        "addq $7, %%rcx\n\t"
     279        "4:\n\t"
     280        "movb (%%rsi, %%rcx), %%al\n\t"
     281        "movb %%al, (%%rdi, %%rcx)\n\t"
     282        "subq $1, %%rcx\n\t"
     283        "jae 4b\n\t"
     284        "2:\n\t"
     285        : "+D"(dst), "+S"(src), "+c"(count), "+a"(tmp)
     286        :
     287        : "memory"
     288        );
     289#else
     290    memcpy(dst, src, length * sizeof(T));
     291#endif
     292}
     293
     294template<typename T>
     295void fastZeroFill(T* dst, size_t length)
     296{
     297#if BCPU(X86_64)
     298    uint64_t zero = 0;
     299    size_t count = length * sizeof(T);
     300    if (!(sizeof(T) % sizeof(uint64_t))) {
     301        asm volatile (
     302            "cmpq $200, %%rcx\n\t"
     303            "jb 1f\n\t"
     304            "shrq $3, %%rcx\n\t"
     305            "rep stosq\n\t"
     306            "jmp 2f\n\t"
     307            "3:\n\t"
     308            "movq %%rax, (%%rdi, %%rcx)\n\t"
     309            "1:\n\t"
     310            "subq $8, %%rcx\n\t"
     311            "jae 3b\n\t"
     312            "2:\n\t"
     313            : "+D"(dst), "+c"(count)
     314            : "a"(zero)
     315            : "memory"
     316            );
     317        return;
     318    }
     319    if (!(sizeof(T) % sizeof(uint32_t))) {
     320        asm volatile (
     321            "cmpq $200, %%rcx\n\t"
     322            "jb 1f\n\t"
     323            "shrq $2, %%rcx\n\t"
     324            "rep stosl\n\t"
     325            "jmp 2f\n\t"
     326            "3:\n\t"
     327            "movq %%rax, (%%rdi, %%rcx)\n\t"
     328            "1:\n\t"
     329            "subq $8, %%rcx\n\t"
     330            "jae 3b\n\t"
     331            "cmpq $-8, %%rcx\n\t"
     332            "je 2f\n\t"
     333            "addq $4, %%rcx\n\t" // FIXME: This isn't really a loop. https://bugs.webkit.org/show_bug.cgi?id=182617
     334            "4:\n\t"
     335            "movl %%eax, (%%rdi, %%rcx)\n\t"
     336            "subq $4, %%rcx\n\t"
     337            "jae 4b\n\t"
     338            "2:\n\t"
     339            : "+D"(dst), "+c"(count)
     340            : "a"(zero)
     341            : "memory"
     342            );
     343        return;
     344    }
     345    if (!(sizeof(T) % sizeof(uint16_t))) {
     346        asm volatile (
     347            "cmpq $200, %%rcx\n\t"
     348            "jb 1f\n\t"
     349            "shrq $1, %%rcx\n\t"
     350            "rep stosw\n\t"
     351            "jmp 2f\n\t"
     352            "3:\n\t"
     353            "movq %%rax, (%%rdi, %%rcx)\n\t"
     354            "1:\n\t"
     355            "subq $8, %%rcx\n\t"
     356            "jae 3b\n\t"
     357            "cmpq $-8, %%rcx\n\t"
     358            "je 2f\n\t"
     359            "addq $6, %%rcx\n\t"
     360            "4:\n\t"
     361            "movw %%ax, (%%rdi, %%rcx)\n\t"
     362            "subq $2, %%rcx\n\t"
     363            "jae 4b\n\t"
     364            "2:\n\t"
     365            : "+D"(dst), "+c"(count)
     366            : "a"(zero)
     367            : "memory"
     368            );
     369        return;
     370    }
     371    asm volatile (
     372        "cmpq $200, %%rcx\n\t"
     373        "jb 1f\n\t"
     374        "rep stosb\n\t"
     375        "jmp 2f\n\t"
     376        "3:\n\t"
     377        "movq %%rax, (%%rdi, %%rcx)\n\t"
     378        "1:\n\t"
     379        "subq $8, %%rcx\n\t"
     380        "jae 3b\n\t"
     381        "cmpq $-8, %%rcx\n\t"
     382        "je 2f\n\t"
     383        "addq $7, %%rcx\n\t"
     384        "4:\n\t"
     385        "movb %%al, (%%rdi, %%rcx)\n\t"
     386        "sub $1, %%rcx\n\t"
     387        "jae 4b\n\t"
     388        "2:\n\t"
     389        : "+D"(dst), "+c"(count)
     390        : "a"(zero)
     391        : "memory"
     392        );
     393#else
     394    memset(dst, 0, length * sizeof(T));
     395#endif
     396}
     397
    183398} // namespace bmalloc
    184399
  • trunk/Source/bmalloc/bmalloc/Allocator.cpp

    r220352 r228306  
    126126    void* result = allocate(newSize);
    127127    size_t copySize = std::min(oldSize, newSize);
    128     memcpy(result, object, copySize);
     128    fastCopy(static_cast<char*>(result), static_cast<char*>(object), copySize);
    129129    m_deallocator.deallocate(object);
    130130    return result;
  • trunk/Source/bmalloc/bmalloc/Bits.h

    r224537 r228306  
    8181    BitsWordOwner& operator=(const BitsWordOwner& other)
    8282    {
    83         memcpy(m_words, other.m_words, arrayLength() * sizeof(uint32_t));
     83        fastCopy(m_words, other.m_words, arrayLength());
    8484        return *this;
    8585    }
     
    9292    void clearAll()
    9393    {
    94         memset(m_words, 0, arrayLength() * sizeof(uint32_t));
     94        fastZeroFill(m_words, arrayLength());
    9595    }
    9696   
    9797    void set(const BitsWordOwner& other)
    9898    {
    99         memcpy(m_words, other.m_words, arrayLength() * sizeof(uint32_t));
     99        fastCopy(m_words, other.m_words, arrayLength());
    100100    }
    101101   
  • trunk/Source/bmalloc/bmalloc/IsoPageInlines.h

    r225125 r228306  
    11/*
    2  * Copyright (C) 2017 Apple Inc. All rights reserved.
     2 * Copyright (C) 2017-2018 Apple Inc. All rights reserved.
    33 *
    44 * Redistribution and use in source and binary forms, with or without
     
    4848    , m_index(index)
    4949{
    50     memset(m_allocBits, 0, sizeof(m_allocBits));
     50    fastZeroFill(m_allocBits, bitsArrayLength(numObjects));
    5151}
    5252
  • trunk/Source/bmalloc/bmalloc/Vector.h

    r220352 r228306  
    204204    T* newBuffer = vmSize ? static_cast<T*>(vmAllocate(vmSize)) : nullptr;
    205205    if (m_buffer) {
    206         std::memcpy(newBuffer, m_buffer, m_size * sizeof(T));
     206        fastCopy(newBuffer, m_buffer, m_size);
    207207        vmDeallocate(m_buffer, bmalloc::vmSize(m_capacity * sizeof(T)));
    208208    }
Note: See TracChangeset for help on using the changeset viewer.