The mirror utility
mirror is part of the QtWebKit Performance Utilities and can be used to mirror web content to a local SQLite3 database to be later used by the http_server utility.
Using the mirror application to mirror websites
For benchmarking we want to use real webcontent but we don't want to be subject to different versions of websites (e.g. when they are dynamically created), different network bandwidth and latency. The mirror application can be used to store the downloaded content in a local SQLite3 database and the http_server can serve this content.
It can also be used in cases where a user is seeing a problem but it can not be reproduced locally. In this case the user should attempt to use the mirror application to create a copy and forward the database to the developer.
Building the mirror application
The mirror application is using qmake to build and is working best with version 4.6 of Qt.
Example of building and running:
$ cd host-tools/mirror $ qmake $ make $ ./mirror -h ./mirror options [url] -c cookies.ini Use the cookies from this file. The cookie file is compatible with Arora. -v Show the WebView when running -k Keep the application running.
Using the mirror application to mirror content
The mirror application is able to use the Arora CookieJar. On a GNU/Linux system this file is normally located in $HOME/.local/share/data/Arora/cookies.ini. The benefit of using a cookie file is that one can login to websites like gmail.com, facebook.com using Arora and then will be able to mirror pages with the logged-in state.
By default the mirror application is loading the page and then exiting. One can use the -k option to keep the application running. This can be used on pages that utilize a lot of AJAX that will load more resources even after having finished the initial loading. This option was used on the Nokia benchmarking content for the gmail.com website.
The -v option can be used to make the QWebView used to download the pages visible. This can be used to see which sites got downloaded or to manually crawl the web.
Step through to mirror gmail.com
- build the mirror application like shown above
- use arora to login into the gmail.com service. Click "stay signed in" as this will store cookie that we can use in the mirror application.
- Use
./mirror -v -k -c $HOME/.local/share/data/Arora/cookies.ini http://www.gmail.com
to start mirroring - Wait for being logged in and the site is loaded completely
- crawl_db.db now contains a copy of gmail.com. It can be served with the http_server
Generate more stable loading times
Some webpages use Math.random() or the current date to randomize which content to be displayed. In the case of Wikipedia this can be one of the various announcement, in case of the apple.com website this is the image to be displayed on the front page and the advertisement query. The problem with this randomisation is that the loading time may vary from page view to page view and that the resulting loading time is not stable. On way to deal with it is to remove all calls to Math.random() with a constant.
The manipulate-content.py
script was added to remove these random sources and replace them with constatnt. Currently Math.random() and new Date().getTime() gets replaced with constants in the db.
Another similiar source of trouble comes from using the current date to fetch resources or fetch resources depending on the useragent. Currently there are no hints on how to deal with that. The problem might be that on different dates, or different platforms a 404 will be returned instead of the real content making a comparison hard.