wiki:QtWebKitMirrorGuide

Version 7 (modified by zecke@selfish.org, 14 years ago) (diff)

Mention the manipulate-content.py script for post-processing.

Using the mirror application to mirror websites

For benchmarking we want to use real webcontent but we don't want to be subject to different versions of websites (e.g. when they are dynamically created), different network bandwidth and latency. The mirror application can be used to store the downloaded content in a local SQLite3 database and the http_server can serve this content.

It can also be used in cases where a user is seeing a problem but it can not be reproduced locally. In this case the user should attempt to use the mirror application to create a copy and forward the database to the developer.

Building the mirror application

The mirror application is using qmake to build and is working best with version 4.6 of Qt.

Example of building and running:

$ cd host-tools/mirror
$ qmake
$ make
$ ./mirror -h
./mirror options [url]
        -c cookies.ini  Use the cookies from this file.
                        The cookie file is compatible with Arora.
        -v              Show the WebView when running
        -k              Keep the application running.

Using the mirror application to mirror content

The mirror application is able to use the Arora CookieJar. On a GNU/Linux system this file is normally located in $HOME/.local/share/data/Arora/cookies.ini. The benefit of using a cookie file is that one can login to websites like gmail.com, facebook.com using Arora and then will be able to mirror pages with the logged-in state.

By default the mirror application is loading the page and then exiting. One can use the -k option to keep the application running. This can be used on pages that utilize a lot of AJAX that will load more resources even after having finished the initial loading. This option was used on the Nokia benchmarking content for the gmail.com website.

The -v option can be used to make the QWebView used to download the pages visible. This can be used to see which sites got downloaded or to manually crawl the web.

Step through to mirror gmail.com

  1. build the mirror application like shown above
  2. use arora to login into the gmail.com service. Click "stay signed in" as this will store cookie that we can use in the mirror application.
  3. Use ./mirror -v -k -c $HOME/.local/share/data/Arora/cookies.ini http://www.gmail.com to start mirroring
  4. Wait for being logged in and the site is loaded completely
  5. crawl_db.db now contains a copy of gmail.com. It can be served with the http_server

Generate more stable loading times

Some webpages use Math.random() or the current date to randomize which content to be displayed. In the case of Wikipedia this can be one of the various announcement, in case of the apple.com website this is the image to be displayed on the front page and the advertisement query. The problem with this randomisation is that the loading time may vary from page view to page view and that the resulting loading time is not stable. On way to deal with it is to remove all calls to Math.random() with a constant.

The manipulate-content.py script was added to remove these random sources and replace them with constatnt. Currently Math.random() and new Date().getTime() gets replaced with constants in the db.

Another similiar source of trouble comes from using the current date to fetch resources or fetch resources depending on the useragent. Currently there are no hints on how to deal with that. The problem might be that on different dates, or different platforms a 404 will be returned instead of the real content making a comparison hard.

Step through to mirror gmail.com and everything on a screen cast

  • A video on mirroring gmail.com can be seen here.
  • A video on running the do_mirror.sh script can be seen here.