[[ch-latency]]

== Optimizing DRBD latency

This chapter deals with optimizing DRBD latency. It examines some
hardware considerations with regard to latency minimization, and
details tuning recommendations for that purpose.

[[s-latency-hardware]]
=== Hardware considerations

DRBD latency is affected by both the latency of the underlying I/O
subsystem (disks, controllers, and corresponding caches), and the
latency of the replication network.

.I/O subsystem latency
indexterm:[latency]I/O subsystem latency is primarily a function of
disk rotation speed. Thus, using fast-spinning disks is a valid
approach for reducing I/O subsystem latency.

Likewise, the use of a indexterm:[battery-backed write cache]
battery-backed write cache (BBWC) reduces write completion times, also
reducing write latency. Most reasonable storage subsystems come with
some form of battery-backed cache, and allow the administrator to
configure which portion of this cache is used for read and write
operations. The recommended approach is to disable the disk read cache
completely and use all cache memory available for the disk write
cache.

.Network latency
indexterm:[latency]Network latency is, in essence, the packet
round-trip time ( ) between hosts. It is influenced by a number of
factors, most of which are irrelevant on the dedicated, back-to-back
network connections recommended for use as DRBD replication
links. Thus, it is sufficient to accept that a certain amount of
latency always exists in Gigabit Ethernet links, which typically is on
the order of 100 to 200 microseconds (μs) packet RTT.

Network latency may typically be pushed below this limit only by using
lower-latency network protocols, such as running DRBD over Dolphin
Express using Dolphin SuperSockets.

[[s-latency-overhead-expectations]]
=== Latency overhead expectations

As for throughput, when estimating the latency overhead associated
with DRBD, there are some important natural limitations to consider:

* DRBD latency is bound by that of the raw I/O subsystem.
* DRBD latency is bound by the available network latency.

The _sum_ of the two establishes the theoretical latency _minimum_
incurred to DRBD. DRBD then adds to that latency a slight additional
latency overhead, which can be expected to be less than 1 percent.

* Consider the example of a local disk subsystem with a write latency
  of 3ms and a network link with one of 0.2ms. Then the expected DRBD
  latency would be 3.2 ms or a roughly 7-percent latency increase over
  just writing to a local disk.

NOTE: Latency may be influenced by a number of other factors,
including CPU cache misses, context switches, and others.

[[s-latency-tuning]]
=== Tuning recommendations

[[s-latency-tuning-cpu-mask]]
==== Setting DRBD's CPU mask

DRBD allows for setting an explicit CPU mask for its kernel
threads. This is particularly beneficial for applications which would
otherwise compete with DRBD for CPU cycles.

The CPU mask is a number in whose binary representation the least
significant bit represents the first CPU, the second-least significant
bit the second, and so forth. A set bit in the bitmask implies that
the corresponding CPU may be used by DRBD, whereas a cleared bit means
it must not. Thus, for example, a CPU mask of 1 (+00000001+) means
DRBD may use the first CPU only. A mask of 12 (+00001100+) implies
DRBD may use the third and fourth CPU.

An example CPU mask configuration for a resource may look like this:

[source,drbd]
----------------------------
resource <resource> {
  options {
    cpu-mask 2;
    ...
  }
  ...
}
----------------------------

IMPORTANT: Of course, in order to minimize CPU competition between
DRBD and the application using it, you need to configure your
application to use only those CPUs which DRBD does not use.

Some applications may provide for this via an entry in a configuration
file, just like DRBD itself. Others include an invocation of the
`taskset` command in an application init script.


[[s-latency-tuning-mtu-size]]
==== Modifying the network MTU

When a block-based (as opposed to extent-based) filesystem is layered
above DRBD, it may be beneficial to change the replication network's
maximum transmission unit (MTU) size to a value higher than the
default of 1500 bytes. Colloquially, this is referred to as
indexterm:[Jumbo frames] "enabling Jumbo frames".

NOTE: Block-based file systems include ext3, ReiserFS (version 3), and
GFS. Extent-based file systems, in contrast, include XFS, Lustre and
OCFS2. Extent-based file systems are expected to benefit from enabling
Jumbo frames only if they hold few and large files.

The MTU may be changed using the following commands:
----------------------------
ifconfig <interface> mtu <size>
----------------------------
or
----------------------------
ip link set <interface> mtu <size>
----------------------------

_<interface>_ refers to the network interface used for DRBD
replication. A typical value for _<size>_ would be 9000 (bytes).

[[s-latency-tuning-deadline-scheduler]]
==== Enabling the +deadline+ I/O scheduler

When used in conjunction with high-performance, write back enabled
hardware RAID controllers, DRBD latency may benefit greatly from using
the simple +deadline+ I/O scheduler, rather than the CFQ scheduler. The
latter is typically enabled by default in reasonably recent kernel
configurations (post-2.6.18 for most distributions).

Modifications to the I/O scheduler configuration may be performed via
the +sysfs+ virtual file system, mounted at +/sys+. The scheduler
configuration is in +/sys/block/<device>+, where <device> is the
backing device DRBD uses.

Enabling the +deadline+ scheduler works via the following command:
----------------------------
`echo deadline > /sys/block/<device>/queue/scheduler`
----------------------------

You may then also set the following values, which may provide
additional latency benefits:

* Disable front merges:
----------------------------
echo 0 > /sys/block/<device>/queue/iosched/front_merges
----------------------------

* Reduce read I/O deadline to 150 milliseconds (the default is 500ms):
----------------------------
echo 150 > /sys/block/<device>/queue/iosched/read_expire
----------------------------

* Reduce write I/O deadline to 1500 milliseconds (the default is
  3000ms):
----------------------------
 echo 1500 > /sys/block/<device>/queue/iosched/write_expire
----------------------------

If these values effect a significant latency improvement, you may want
to make them permanent so they are automatically set at system
startup. indexterm:[Debian GNU/Linux]Debian and indexterm:[Ubuntu
Linux]Ubuntu systems provide this functionality via the
+sysfsutils+ package and the +/etc/sysfs.conf+ configuration file.

You may also make a global I/O scheduler selection by passing the
+elevator+ option via your kernel command line. To do so, edit your
boot loader configuration (normally found in +/boot/grub/menu.lst+ if
you are using the GRUB bootloader) and add +elevator=deadline+ to your
list of kernel boot options.
