wiki:commit-identifiers

One of the biggest issues some of WebKit's contributors have with git is that commits are referred to by hash rather than an ordered identifier, like they are in Subversion. Defending the performance and correctness of the WebKit project has historically relied heavily on Subversion revisions being ordered. A few other large projects, such as Chromium, have adopted revisions systems on top of git that look much like Subversion. One of the things we would like to improve on from Subversion though, is that revisions are applied across every branch (in git, this would be every protected branch), so two revision, for example, r1234 and r1244, may have 10 commits that occurred between them, or there may have been 10 commits that occurred on a different branch so r1234 and r1244 are actually sequential commits on the same branch. WebKit is looking for a system that fulfills the following criteria:

  • Monotonically increasing integer identifier per-commit on protected branches
  • Identifiers which function without additional metadata inside each commit message or in the repository
  • System assumes no changes (or no history rewriting) are ever to happen on protected branches
  • System will not canonically identify non-protected branches (i.e. development branches)
  • Identifiers should be human readable and optimized for bisection

The WebKit Operation's team has developed a system of commit identifiers which we think fulfills these requirements, that work is tracked in https://bugs.webkit.org/show_bug.cgi?id=216404. A brief outline of the idea:

Commit identifiers are of the form:

<branch-point>.<number>@<branch>

if no branch is specified, the default branch (main) will be assumed, and commits on the default branch have no branch point.

A commit identifier on main is the number of commits between that commit and very beginning of the repository, with the first commit in the repository (referred to as the primordial commit) having an identifier of 1@main.

A commit identifier on a branch is the number of commits since that branch diverged from the default branch, like so:

		              ———— o ————————————— o
		            /      |               |
		           /  101.2@branch-b  101.3@branch-b
                  ——————— o ———————————— o
                /         |              |
	       /   101.1@branch-a  101.2@branch-a
    o ———————— o ———————— o ——————— o
    |          |          |         |
 100@main   101@main   102@main  103@main

Under this architecture, commits have multiple valid identifiers. For example, 101.1@branch-a could also be referred to as 101.1@branch-b. Commit identifiers can even be negative (since they just describe a commit's relationship with the default branch), so 101.-1@branch-a would refer to 100@main.

The canonical identifier for a commit is its original branch, and when non-cannoncial identifiers are passed to scripts, we will do our best to normalize them to their canonical identifiers, but since git has no way of knowing which branch was the original branch, we cannot guarantee the normalized identifier will be the canonical identifier. When a commit is made to a protected branch, we intend to annotate the commit message with the canonical identifier.

It's important to stress that these identifiers are not required to interact with a repository that supports them, our intention is to have a more human friendly system of interacting with the project. CI will support and display these identifiers in addition to commit hashes, not in place of them.

The biggest drawback of this scheme is that it relies pretty heavily on the default branch always remaining the default branch (even if it starts to be known by a different alias). To use the above example, if one decided that branch-b was to be the default, the historical identifier scheme would be broken. If we wanted to retain the ability to make a new branch the default branch, we could have identifiers from all branches count from the very first commit on the default branch, but this will come at the cost of making a branch’s deviation point from the default branch much less obvious. Because Subversion already applies a similar restriction on ’trunk’, we believe that making the branch point from the default branch clear is more important that the ability to switch which branch is the default branch.

As of r268433 (or if you prefer, 230428@trunk), the script git-webkit has landed which introduces basic support for identifiers:

% git-webkit find
Title: [webkitscmpy] Add `git-webkit find`
Author: jbedard@apple.com <jbedard@apple.com>
Identifier: 230428@trunk
Date: Tue Oct 13 09:53:34 2020
Revision: 268433
Last modified 3 weeks ago Last modified on Oct 13, 2020 5:16:57 PM