Context¶
Uevents and udev¶
Linux kernel provides a way to send simple notification messages to userspace related to changes of device’s state and we call these udev events or uevents for short.
The uevents are sent from kernel to userspace using netlink interface
(man 7 netlink). The exact netlink type reserved for this purpose is
NETLINK_KOBJECT_UEVENT
. One or more userspace listeners can register
to receive the events and if there is more than one listener, these events
are sent in multicast manner.
Currently supported set of action names used for uevents are:
add
device added,change
device changed,remove
device removed,move
device moved to a new parent or device renamed,offline
device is put offline,online
device is put back online after being offline,bind
driver is bound to a device (since kernel version 4.14),unbind
driver is unbound from a device (since kernel version 4.14).
The most frequently used ones are add
, change
and remove
.
Each kernel uevent contains a set of environment variables in KEY=VALUE
format. The minimal and basic set of keys we can find in the kernel uevent,
which is added by kernel’s common uevent code, contains at least:
ACTION
device’s action name,DEVPATH
device’s canonical path in sysfs (see also man 5 sysfs),SUBSYSTEM
subsystem the device belongs to,SEQNUM
this uevent’s sequence number.
The Linux kernel’s driver core then adds further keys to extend the basic set, if values for these keys are available:
MAJOR
device’s major number,MINOR
device’s minor number,DEVNAME
device’s canonical kernel name,DEVMODE
device’s permissions mode,DEVUID
device’s user ID (if not global root UID),DEVGID
device’s group ID (if not global root GID),DEVTYPE
device’s type name,DRIVER
device’s driver name.
Various device subsystems and device drivers in kernel can add even more
additional KEY=VALUE
pairs. However, the overall size of the uevent is
limited: maximum number of KEY=VALUE
pairs is 32 (UEVENT_NUM_ENVP
constant as found in kernel source code) and the overall size limit for the
whole uevent sent from kernel is 2048 bytes (UEVENT_BUFFER_SIZE
constant as found in kernel source code).
For the purpose of this text, we will call the uevents which are sent from kernel the genuine kernel uevents (or kernel uevents shortly).
In general, each uevent for a block device has: ACTION
, DEVPATH
,
SUBSYSTEM
, SEQNUM
, MAJOR
, MINOR
, DEVNAME
and DEVTYPE
keys set in its uevent environment.
Besides genuine kernel uevents generated based on execution within kernel
and its drivers, there are also synthetic kernel uevents (or synthetic
uevents shortly). Even though these uevents are generated in kernel, they
are provoked directly in userspace by writing the uevent action name to
/sys/…/uevent
file. Such uevents look exactly like genuine kernel
uevents, the only difference is that they do not contain any additional
keys in their environment that drivers may add, only the basic key set.
Such uevents are usually used to trigger device state reevaluation back in
userspace once the synthetic uevent is received in userspace.
It is also possible to send uevents directly from userspace back to userspace, hence providing a way to send messages between two or more userspace processes. We call these udev uevents.
General term that encompasses all the uevents and the logic of dynamic device management based on these uevents in userspace is udev. On userspace side, the important component is the udev daemon.
Udev daemon¶
One of userspace uevent listeners has a primary role and this is the udev daemon, or shortly udevd, which is a part of systemd project.
The way udevd processes uevents in userspace is driven by udev rules
which are usually placed in /lib/udev/rules.d
and /etc/udev/rules.d
directory. There is a common set of rules provided directly by upstream
udevd, other rules are installed by foreign tools and system components.
Whenever there is a new kernel uevent (genuine or synthetic one) coming from the kernel and when received by udevd, the udevd creates a new process (also called udevd worker) to handle the event or reuses existing one if it is available. The udevd keeps worker processes if previous event has just been processed and the queue is not empty yet so it reuses the worker immediately to execute the rules for next uevent, hence optimizing and saving machine time and resources.
Only one event can be handled at a time for a single device. That means all processing of uevents that are issued for a single device is serialized, queued and processed one by one while uevents for different devices can still be processed in parallel. Devices are distinguished based on their canonical device path in sysfs.
There is a limit to the number of worker processes that are created to
handle the uevents in parallel and this is controlled by udevd’s
--children-max
command line option or provided on kernel command
line as udev.children_max
argument. This way, it is possible to control
the degree of parallelism the udevd uses. With current implementation,
the default value for this option is computed using a simple formula that
is based on number of CPU cores available:
Udevd’s primary role is to collect any additional information that is needed to create various symlinks under /dev directory and to set permissions driven by instructions written in udev rules. Udevd has no control over device node names (with the exception of network devices). With devtmpfs filesystem in use, the device nodes are created directly by kernel and udevd only adjusts their permissions. Udev rules can also access information present in sysfs for the device that is being processed. To collect any other information, udev rules need to instruct udevd to execute external commands or to gather this information in a special way. This is accomplished by executing one of these rules:
IMPORT
Executes a command that exports the information inKEY=VALUE
pairs that is then imported into udev context which further udev rules can access; the actual call is made right at the time when the rule is hit.RUN
Adds a command to the list of commands to be executed after all the rules are processed – so delaying the execution up to the end of udev rule processing.PROGRAM
Executes a command where the string output from the last executed command can be matched with accompanyingRESULT
rule.
The IMPORT
and RUN
rule can either execute external command or
it can execute udevd’s own builtin command by specifying
IMPORT{builtin}
or RUN{builtin}
. The builtin commands have
advantage over external commands in fact that they do not require a new
process to get created (forked) and these commands are initialized as
soon as udevd is started. However, builtin commands need to be integrated
directly into udevd’s code base - they are not designed as external modules
loaded on udevd startup.
Udevd poses a restriction on time to execute all the udev rules for
particular uevent. Currently, the default value is 180 seconds. It is
possible to override the default value by specifying udevd’s
event-timeout
option or by specifying the timeout value on kernel
command line with udev.event-timeout
argument. The timer starts
counting as soon as the worker process is forked or reused and it is
stopped when main udevd process receives a message from worker processes
that it has finished the processing. Simplified list of steps taken to
execute udev worker on incoming kernel uevent is following:
- kernel uevent is received by main udevd process
- udevd create or reuses udevd worker process to handle the uevent
- udevd starts timer for the udevd worker
- udevd worker executes and applies udev rules
- udevd worker updates udev database
- udevd worker executes run queue (all the calls as instructed by
RUN
rule)- udevd worker sends udev uevent
- udevd worker sends worker finished message to main udevd process
- udevd receives the worker finished message from worker
- udevd stops timer for the udevd worker
The udev uevent, in contrast to kernel uevent, is the uevent sent by udevd
directly to all its listeners after all the rules have been processed and
hence such uevent contains all the environment variables in KEY=VALUE
format that have been added by execution and application of the udev rules.
The udev uevent is sent by udevd using the same netlink interface as udevd
used to receive the kernel uevent, the netlink interface makes this
possible. Usually, udev uevent as well as kernel uevent listeners subscribe
for these uevents using libudev library (man 3 libudev) which wraps up
these uevents in a structure for easier manipulation and for further
processing using various libudev functions and also it abstracts out the
actual netlink usage for the library user.
The udev database is a simple filesystem-based database (usually stored
in /run/udev
directory). It contains current environment for each
device – the KEY=VALUE
pairs and other information used and recorded
by udevd: list of symlinks, symlink priority, tags and monitoring of
device content changes requested by OPTIONS+="watch"
udev rule.
Note
The OPTIONS+="watch"
udev rule is internally implemented using
inotify
monitoring mechanism (man 7 inotify). Whenever a monitored
device is closed after being open for writing before, udev daemon
receives the inotify event. Then, udev daemon generates synthetic uevent
for the device based on the inotify event. The OPTIONS+="watch"
udev rule is usually used when we expect that a write operation to
the device can change its content in a way that this also changes the
way udev rules are evaluated and that in turn can change the udev
database content.
Block device uevent processing¶
Block device uevent processing is driven by udev rules provided by both upstream udev itself as well as block device subsystems.
Rules provided by udev itself¶
60-block.rules
Eenables media presence polling, forwards scsi events to corresponding block device and setsOPTIONS+="watch"
for selected block devices.60-persistent-storage.rules
Imports parent information from udev database for partitions, callsata-id
,scsi_id
,usb_id
,path_id
,blkid
, sets device symlinks.60-persistent-storage-tape.rules
Callsblkid
, sets device symlinks.60-cdrom_id.rules
Callscdrom_id
, sets device symlinks.64-btrfs.rules
Callsbtrfs_ready
builtin command, marks device as not ready if needed and setsSYSTEMD_READY
variable appropriately.99-systemd.rules
SetsSYSTEMD_READY
variable based on various other variables and/orsysfs
content. It also includes handling of loop devices.
Rules provided by device-mapper (DM) subsystem¶
10-dm.rules
Callsdmsetup udevflags
to decode flags out ofDM_COOKIE
variable, callsdmsetup info
if needed, sets device symlinks, imports variables from previous udev database state if needed.13-dm-disk.rules
Callsblkid
, sets device symlinks.95-dm-notify.rules
Callsdmsetup udevcomplete
to notify waiting process about udev rule processing completion.
Rules provided by DM-LVM subsystem¶
11-dm-lvm.rules
Callsdmsetup splitname
to split DM name into VG/LV/layer parts, imports variables from previous udev state if needed, sets device symlinks.12-dm-lvm-permissions.rules
This is a template to add rules to set device permissions.69-dm-lvm-metad.rules
Detects when the device is ready for use and scheduleslvm2-pvscan@<major>:<minor>.service
systemd unit containingpvscan --cache -a ay call
to updatelvmetad
and to activate a VG once it is complete.
Rules provided by DM-multipath subsystem¶
11-dm-mpath.rules
Imports variables from previous udev database state if needed, marks multipath device either as ready or not or whether scanning can be done on this device.62-multipath.rules
Callsmultipath -c
andmultipath -T
to check for multipath components, imports variables from previous udev database state if needed, callspartx
to remove partitions on multipath components and it callskpartx
to create partition mappings on top of a multipath device.
Rules provided by multiple device (MD) subsystem¶
63-md-raid-arrays.rules
Handles arrays with external metadata: DDF and Intel Matrix RAID, callsmdadm –-detail
, callsblkid
, creates device symlinks, schedules MD array monitoring.65-md-incremental.rules
Callsmdadm -I
for incremental addition or removal of a device to/from an MD array if the device is ready/removed, requestsmdadm-last-resort@<md_device>.timer
systemd unit to get started to implement a timeout on MD devicefor it to be started in degraded mode.
Rules provided by Ceph subsystem¶
50-rbd.rules
Callsceph-rbdnamer
and creates device symlinks based on results.60-ceph-by-parttypeuuid.rules
Forwards SCSI events to corresponding block device, imports parent information from udev database for partitions, callsblkid
, creates device symlinks for partitions.95-ceph-osd.rules
Sets permissions, callsceph-disk
)
Rules provided by btrfs subsystem¶
64-btrfs-dm.rules
Callsbtrfs ready
to let btrfs subsystem know underlying DM device is ready.64-btrfs.rules
Callsbtrfs ready
to let btrfs subsystem know the underlying device is ready.
Problematic areas¶
The udevd was primarily designed to collect additional information that is
needed for a specific device and then let udevd create additional symlinks
in /dev
and set proper permissions for the device node based on rules.
Although the majority of the rules to handle block devices do contain rules that set device node symlinks, the fact is that over the years the number of various other calls within these rules has risen too. Currently, it is not only that additional information collection that the rules do, but it is also other functionality, like further activation and various helper calls to support various specific aspects of block device subsystems. As a consequence, there are various problems and shortcomings related with this approach which became significant.
This section lists and briefly describes various problems and shortcomings in general which we have identified while trying to deploy storage-related solutions over time and then trying to integrate them with udev.
These problems are not completely discrete. Instead, they are very closely related to each other and a solution to one of these problems usually reduces degree of impact of other problematic parts.
Multistep activation¶
Some block devices have more complex nature when it comes to activation and detecting current device state.
This is mainly the case for subsystems like DM (including device-mapper
multipath and LVM subsystem) and MD devices where they are are created
first (that generates add
uevent), but the device may not be usable
right away. Usually, there is another step or more to make these devices
ready for use (that generates further change
uevents).
Notion of device groups and stack awareness¶
One of the most important features we also need to take into account is the fact that some block devices can be stacked on top of each other and they can form an abstraction over a set of devices which logically groups them together.
Udev has no direct notion of grouping or stack awareness within the device groups.
Intermediate steps during device management¶
Some subsystems also support conversions from one type to another which may require several deactivation and activation steps and transforming the device with intermediate steps in between.
Unless we mark the intermediate states with additional KEY=VALUE
pairs
within the uevents the kernel driver generates or unless we use an external
information or tool to decide on what the current state is, we cannot make
a difference within udev rules and we act as if this was usual device
activation or deactivation or a generic change. The usual set of rules are
executed even though the commands executed within those rules may interfere
with the process of device transformation or conversion.
Also, such processing may not be efficient if the result is outdated right in the next step that follows and we are only interested in the overall result when the device is fully set up again and ready for use.
Recognizing uevents, device’s state and overloaded uevents¶
All block device subsystems use udevd to drive userspace actions based on
uevents coming from kernel - either originating in the kernel driver itself
or synthesized in userspace by writing the /sys/…/uevent
file.
Inherently, some of the special uevents that these block subsystems would
need to have processed are mapped onto a single change
uevent instead
of distinct uevents directly describing the nature of the event.
This fact makes the udev rules complex because they need to deal with
these device state transitions and they need to recognize uevents properly
to know what the transition is exactly, possibly comparing udev’s
environment (the KEY=VALUE
pairs) with previous environment stored in
udev database.
Udevd was not designed for this task. Even though there are rules to import
previous udev database values (the IMPORT{db}
udev rule), we cannot do
direct comparisons of previous and current values for certain keys which
are in udev’s environment in an efficient way. We can only do simple string
matching so only rules in the form of
ENV{KEY}=="direct_string_to_match"
are possible, but not
ENV{KEY1}==ENV{KEY2}
. Also, udevd does not support number comparisons
directly within udev rules, because the only operator supported is a match
against an explicit string value.
Debugging and logging¶
As the rules get more and more complex, whenever a problem appears, it is complicated to perform effective debugging - udevd does not report current environment it is working with nor does it have support for adding additional logging hooks into the rules directly. With this, it is hard to track what the actual path was taken when the udev rules were processed and what the actual states were.
This status quo is also a consequence of the fact that some device subsystems try to implement more complex logic with the udev rules tha what they were originally designed for.
Marking devices as ready¶
The udev rules are responsible for triggering device activation based on current state at proper time. This becomes even more prominent if we are considering device stacks where one block device subsystem is layered on top of another one and so on.
We need to have a proper and standard way of marking devices in the layer
below as ready for any layer above. This standard is currently missing.
Each subsystem has its own way of marking the device as ready – there are
various KEY=VALUE
pairs to check in udev’s environment (e.g.
DM_ACTIVATION
, MD_STARTED
, SYSTEMD_READY
, …).
The same problem arises when considering event subscribers using udev monitoring which have no standardized way to know whether a device is ready for use or not.
Amount of work in udevd context¶
Another problem that arises is related with the amount of work that needs to be done to process the uevent while processing udev rules.
As per udevd design, this extra work and processing needs to be minimized
as much as possible and it should be restricted to acquiring the
information that is needed to have all the needed symlinks in /dev
created. That means, all the rules and processing that is not related to
collecting basic device identification and information collection should be
moved out of udevd context and executed later or, if possible, in parallel
to udevd.
Timeouts¶
Udevd sets up timeout for each uevent’s processing. On heavy-loaded system,
this can pose a problem as default timeout may not be enough. The timeouts
cannot be set in runtime - support for OPTIONS="event_timeout"
rule
has been removed from udevd.
If the timeout occurs, the udevd worker with any of its children processes
is killed by udevd using SIGTERM
signal. For this reason, commands
which may take longer to execute must be executed in background. On systems
with systemd, the command needs to be instantiated as a service even,
completely out of udevd’s context and its control group. There is no
special handling for these timeouts – if a timeout occurs and the udevd
worker is killed, any udev uevent listener will receive the uevent without
any additional variables set – udevd just relays the kernel uevent it
receives as udev uevent to all its listeners.
If the timeout happens, we would need to let the listeners know or provide a possibility to define fallback actions to keep the system running and letting the user fix the configuration or increase timeouts if needed.
Synthetic uevents¶
Another problematic area is with the source of uevents. Besides genuine udev events coming from kernel directly, there are also synthetic events, as we already mentioned before. There are three usual ways how the synthetic uevent is triggered from user’s perspective:
- by directly writing the event name to
/sys/.../uevent
file,- by calling udevadm trigger command (which in turn writes to the
/sys/.../uevent
file),- by using
OPTIONS="watch"
udev rule for a device (and then whenever the device is opened for writing and then closed, the inotify watch triggers that udevd receives that in turn writes to the/sys/../uevent
file).
If kernel driver does not provide any additional variables for the uevent it generates, the genuine uevent is indistinguishable from the synthetic one – this may make it harder to recognize which event is the one that makes the device ready for use. For a long time, udev’s position was that these two uevents should remain indistinguishable and uevent listeners and authors of udev rules should account for this fact.
However, our argument is that there is indeed a difference in these two types of uevents. The genuine kernel uevents notify userspace about a state change of the device itself (e.g. device addition or change in device’s configuration that the kernel itself is aware of, device’s removal). The synthetic uevents, which originate in userspace actions, are either used to refresh udev’s state in userspace (e.g. to repopulate udev database if it was cleared before or started afresh or to notify any uevent subscribers to simply reread information based on the uevent if it is needed).
Alternatively, synthetic uevents may be used to notify about changes in device’s content in general – the device’s content is something that is usually not tracked by kernel device drivers (e.g. subsystem or filesystem signatures are added to device or they are cleared).
At the moment, the two types of uevents are considered equal. This is a source of confusion when handling the uevents and it may also cause useless resource consumption due to excessive processing. Trying to solve this issue, at least partially, within udev context, requires writing even more complex udev rules to try to make a difference between these two uevent types.
Also, the synthetic event is completely asynchronous and we cannot synchronize with that at all at the moment. This creates considerable burden for any tools trying to access the device exclusively or even remove the device because synthetic uevents can happen in parallel.
Marking devices as private or public¶
Another problematic area is within identification of devices which are private for the subsystem. Such devices only act as building blocks to create a higher level device that is supposed to be the one used.
Also, we may need to initialize the device first before marking it as
ready for use. For example, we need to erase any old signatures which may
be left on the device from previous use. Again, there is no standard
defined on how such devices are marked (e.g. DM devices use flags in
DM_COOKIE
uevent variable to handle this while MD uses a temporary file
/run/mdadm/creating-<md_device_name>
in filesystem to mark device as
not fully initialized yet).
Device initialization¶
We should be able to activate a device in private mode first (without doing scans), providing time for usersapce tools to do any initialization steps and cleaning that is necessary to properly make the device ready for use.
After these initialization steps, userspace tool should be able to switch the device into ready state by issuing synthetic uevent that is properly recognized for this type of switch from private to public mode.
Eventually, the solution for this initialization and wiping part during device activation may be centralized and handled by a single external entity without a need for each subsystem to provide its own code to implement this. Such solution would be preferred, but it requires the central entity to have enough knowledge so that the initialization and wiping operation is safe to do at a specific time. The external entity needs to recognize this initialization state properly and that is already the problem we have identified before – with current scheme, we are not completely sure about states.