Tdb is a hashtable database with multiple concurrent writer and external
record lock support. For speed reasons, wherever possible tdb uses a shared
memory mapped area for data access. In its currently released form, it uses
fcntl byte-range locks to coordinate access to the data itself.

The tdb data is organized as a hashtable. Hash collisions are dealt with by
forming a linked list of records that share a hash value. The individual
linked lists are protected across processes with 1-byte fcntl locks on the
starting pointer of the linked list representing a hash value.

The external locking API of tdb allows one to lock individual records. Instead of
really locking individual records, the tdb API locks a complete linked list
with a fcntl lock.

The external locking API of tdb also allows one to lock the complete database, and
ctdb uses this facility to freeze databases during a recovery. While the
so-called allrecord lock is held, all linked lists and all individual records
are frozen alltogether. Tdb achieves this by locking the complete file range
with a single fcntl lock. Individual 1-byte locks for the linked lists
conflict with this. Access to records is prevented by the one large fnctl byte
range lock.

Fcntl locks have been chosen for tdb for two reasons: First they are portable
across all current unixes. Secondly they provide auto-cleanup. If a process
dies while holding a fcntl lock, the lock is given up as if it was explicitly
unlocked. Thus fcntl locks provide a very robust locking scheme, if a process
dies for any reason the database will not stay blocked until reboot. This
robustness is very important for long-running services, a reboot is not an
option for most users of tdb.

Unfortunately, during stress testing, fcntl locks have turned out to be a major
problem for performance. The particular problem that was seen happens when
ctdb on a busy server does a recovery. A recovery means that ctdb has to
freeze all tdb databases for some time, usually a few seconds. This is done
with the allrecord lock. During the recovery phase on a busy server many smbd
processes try to access the tdb file with blocking fcntl calls. The specific
test in question easily reproduces 7,000 processes piling up waiting for
1-byte fcntl locks. When ctdb is done with the recovery, it gives up the
allrecord lock, covering the whole file range. All 7,000 processes waiting for
1-byte fcntl locks are woken up, trying to acquire their lock. The special
implementation of fcntl locks in Linux (up to 2013-02-12 at least) protects
all fcntl lock operations with a single system-wide spinlock. If 7,000 process
waiting for the allrecord lock to become released this leads to a thundering
herd condition, all CPUs are spinning on that single spinlock.

Functionally the kernel is fine, eventually the thundering herd slows down and
every process correctly gets his share and locking range, but the performance
of the system while the herd is active is worse than expected.

The thundering herd is only the worst case scenario for fcntl lock use. The
single spinlock for fcntl operations is also a performance penalty for normal
operations. In the cluster case, every read and write SMB request has to do
two fcntl calls to provide correct SMB mandatory locks. The single spinlock
is one source of serialization for the SMB read/write requests, limiting the
parallelism that can be achieved in a multi-core system.

While trying to tune his servers, Ira Cooper, Samba Team member, found fcntl
locks to be a problem on Solaris as well. Ira pointed out that there is a
potential alternative locking mechanism that might be more scalable: Process
shared robust mutexes, as defined by Posix 2008 for example via

http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_setpshared.html
http://pubs.opengroup.org/onlinepubs/9699919799/functions/pthread_mutexattr_setrobust.html

Pthread mutexes provide one of the core mechanisms in posix threads to protect
in-process data structures from concurrent access by multiple threads. In the
Linux implementation, a pthread_mutex_t is represented by a data structure in
user space that requires no kernel calls in the uncontended case for locking
and unlocking. Locking and unlocking in the uncontended case is implemented
purely in user space with atomic CPU instructions and thus are very fast.

The setpshared functions indicate to the kernel that the mutex is about to be
shared between processes in a common shared memory area.

The process shared posix mutexes have the potential to replace fcntl locking
to coordinate mmap access for tdbs. However, they are missing the criticial
auto-cleanup property that fcntl provides when a process dies. A process that
dies hard while holding a shared mutex has no chance to clean up the protected
data structures and unlock the shared mutex. Thus with a pure process shared
mutex the mutex will remain locked forever until the data structures are
re-initialized from scratch.

With the robust mutexes defined by Posix the process shared mutexes have been
extended with a limited auto-cleanup property. If a mutex has been declared
robust, when a process exits while holding that mutex, the next process trying
to lock the mutex will get the special error message EOWNERDEAD. This informs
the caller that the data structures the mutex protects are potentially corrupt
and need to be cleaned up.

The error message EOWNERDEAD when trying to lock a mutex is an extension over
the fcntl functionality. A process that does a blocking fcntl lock call is not
informed about whether the lock was explicitly freed by a process still alive
or due to an unplanned process exit. At the time of this writing (February
2013), at least Linux and OpenSolaris also implement the robustness feature of
process-shared mutexes.

Converting the tdb locking mechanism from fcntl to mutexes has to take care of
both types of locks that are used on tdb files.

The easy part is to use mutexes to replace the 1-byte linked list locks
covering the individual hashes. Those can be represented by a mutex each.

Covering the allrecord lock is more difficult. The allrecord lock uses a fcntl
lock spanning all hash list locks simultaneously. This basic functionality is
not easily possible with mutexes. A mutex carries 1 bit of information, a
fcntl lock can carry an arbitrary amount of information.

In order to support the allrecord lock, we have an allrecord_lock variable
protected by an allrecord_mutex. The coordination between the allrecord lock
and the chainlocks works like this:

- Getting a chain lock works like this:

  1. get chain mutex
  2. return success if allrecord_lock is F_UNLCK (not locked)
  3. return success if allrecord_lock is F_RDLCK (locked readonly)
     and we only need a read lock.
  4. release chain mutex
  5. wait for allrecord_mutex
  6. unlock allrecord_mutex
  7. goto 1.

- Getting the allrecord lock:

  1. get the allrecord mutex
  2. return error if allrecord_lock is not F_UNLCK (it's locked)
  3. set allrecord_lock to the desired value.
  4. in a loop: lock(blocking) / unlock each chain mutex.
  5. return success.

- allrecord lock upgrade:

  1. check we already have the allrecord lock with F_RDLCK.
  3. set allrecord_lock to F_WRLCK
  4. in a loop: lock(blocking) / unlock each chain mutex.
  5. return success.