Profiling of QtWebKit
Basic Principles
Performance has many different aspects. This can include the binary size, memory usage, CPU usage, execution speed and in most cases when optimizing for one property another one will be impacted. E.g. for optimizing for execution speed another data structure with a bigger storage space is used. It is important to not only look at one property when doing a change and this is why we are focusing on the following items:
- Execution speed
-
- How long does it take to load a page, how long does it take to scroll, how long does it take to paint, how big is the latency is to start a network job, how long does it take to decode a image?
- In Qt the time can be measured using the
QBENCHMAKR { CODE }
macro in QtTest test cases.
- CPU usage
-
- Who is using the CPU, how often is it used?
- There are many different tools on many different architectures. On Linux x86 there is callgrind for other architectures supported by Linux there is OProfile. The biggest difference between OProfile and callgrind is that OProfile is working by collecting samples and that callgrind is executing on a virtual machine.
- Memory usage
-
- How much memory is consumed? How does this change over time?
- There are multiple levels to track this. One way is to monitor how many pages the kernel has allocated for the process, another one is to look at the requested address space (sbrk) and the third way is to look at calls to malloc/free. The memprof and memusage utilities do keep track of malloc/free calls.
Selecting target hardware
To generate a performance baseline hardware and base system needs to be selected. I have picked an ARM system for the profiling and the beagleboard in specific. The ARM architecture was picked as it is frequently used for mobile devices and future netbooks and that is the target for QtWebKit. The beagleboard was selected as it provides full access to the hardware (serial, JTAG), the tools are freely available (gcc, gdb, OpenOCD), the Linux OMAP community has created good support for the SoC and is following mainline (in contrast to a horrible vendor port), the Cortex-A8 will be used in many future devices and the price of the beagleboard is quite low. One problem with the Cortex-A8 is that it might be too fast compared to previous ARM cores (e.g. VFP-3 fpu, Neon co-processor, bigger cache size, higher clock) but the above opportunities have outweightes this.
Tool selection
The version of Qt to be used is Qt Embeded Linux. The primary reason is that with Qt Embedded Linux everything from event handling, to painting is in the control of Qt or the Kernel. This will greatly improve the ability to profile and determine latencies as they are introduced by Qt or Kernel and no other layer.
The Linux Angstrom Distribution was selected as the Operating System. The benefit of Angstrom is that it is working well on the beagleboard, additional software can be easily installed from the repository provided by the distribution and using the Qt Embedded Linux external toolchain one can easily compile Qt and other software for Angstrom.
OProfile was choosen for profiling as it is included in the kernel by default and has ARMv7 support. One requirement for it to generate calltraces for userspace application is that they were compiled using the -fno-omit-frame-pointer
switch. Angstrom is created by using OpenEmbedded which allows to easily recompile the distribution with different compiler flags. In this case the whole distribution was built with -fno-omit-frame-pointer
. OProfile is working by interrupting the execution, then checking which application and instruction was executed and then generating a callstack for the execution. The oprofiled application is reading these samples from the kernel and storing them to disk. The nature of this tool means that it will not exactly tell you how often certain methods were executed. More information can be found in the OProfile manual and some hints on how to import an archive OProfile run to the desktop can be seen in the Poky Linux Manual.
For testing the execution the built-in benchmarking support of QtTest was used. This tool provides different event counters and options to control the iterations of the test. This can and should be used when executing the test once has too many variations.
For looking into memory consumption the libmemusage.so library of glibc was selected. It was selected because it should be provided by every glibc installation. This library can be preloaded using LD_PRELOAD=/lib/libmemusage.so
and it can be instructed to safe a memory trace with MEMUSAGE_OUTPUT=/my.trace
. The trace can be analyzed using the memusagestat of glibc or the performance suite.
Beagleboard setup
The beagleboard was configured to boot via nfsroot over usb-ethernet. This has the benefit of easily sharing files between the beagleboard and the host, e.g. for installing new software. The possible disadvantage is that during the tests new parts of the binary need to be loaded which will force to share the network between HTTP download and NFS.
The /etc/hosts file was adjusted to point to the server and a custom built version of Qt with -fno-omit-frame-pointer
and the test applications were installed.
The rootfs, kernel, u-boot enviroment, external toolchain (amd64) can be provided on demand or if someone is willing to host the binaries and sources.
Running the tests
- Run the tests with oprofile
-
# prepare the setup $ rm -rf /var/lib/oprofile $ opcontrol --start-daemon -p library -c 10 # run the app once to force loading it from nfs into the cache $ ./tst_something # start profiling $ opcontrol --start $ ./tst_something -iterations enough # stop profiling $ opcontrol -h # generate reports $ opreport -l $ opreport -c ... # archiving it, it needs to be opimported on a x86 system $ oparchive -o /some/archive
- Running a test with memusage
-
$ MEMUSAGE_OUTPUT=/var/tmp/memusage.out.NUMBER ./tst_something -iterations