Contents
Introduction
An application‘s lifetime includes a number of events of interest.
These events happen because of the application‘s interaction with the
system in some well-defined frameworks. The Asynchronous I/O (AIO),
timer, and poll frameworks are all good examples. As shown in this
article, each one of these frameworks provides a solution to the
problem on which it is focused, but does not extend any further. Due to
the lack of crossover between these frameworks, application developers
do not have a general way to gather multiple events of differing types
using one framework.
This is the problem that the event completion framework shipped with
the Solaris 10 Operating System (OS) is designed to solve. This
framework provides a group of clients waiting on multiple objects (that
is, AIO transactions, timers, files, and user-defined events) with a
method to receive transaction completion events from different parts of
the system in a scalable and performant manner. Additionally, the
introduction of this framework enables developers with applications
that leverage an event completion API to migrate from other operating
systems to the Solaris 10 OS.
Within the Solaris 10 OS, the event completion framework focuses on
providing a scalable, performant, and extendable framework that can
incorporate new object and event types as they appear within the
system.
Motivation
Prior to the Solaris 10 OS, there wasn‘t a unified way to reap
an application‘s completion of events. Within the Asynchronous I/O
framework, the status of an I/O transaction has to be collected, or
reaped, using the aio_error()
function. If the application needs to set up a timer to fire at some
point in the future, the application depends on the signal framework to
receive notification of the timer expiration. In addition, applications
commonly need to execute some form of I/O in order to read or write to
a group of files or the network. Due to system complexity, the resource
requested by the application might be busy, and thus the application
would have to wait. Traditionally the poll(2) or poll(7D)
system calls were used by application programmers to query, or poll,
the system to see if the application could write or read to the
pertinent resource (such as the pipe, socket, and so on).
Because all of these frameworks were built independently, no unified
methodology existed by which an application could gather events. For
example, the poll functions are not general enough to return AIO read
and write completion events or timer expiration events. In addition,
none of these frameworks allowed for a threaded application to send
user-defined events and payloads to a subset of the total amount of
threads within the application. These points -- as well as the widely
varying performance and scalability of the available frameworks to
deliver event completion -- spurred the developers at Sun Microsystems,
Inc. to develop a unified framework by which an application could reap
an event of interest using one API.
A classic example of historic work within this area surrounds the poll(2) and poll(7D) interfaces. These interfaces work by taking an array of pollfd
structures, which include file descriptors (fds) and a set of flags to
indicate what events the application is waiting for on the list of fds.
The poll(2) interface is a reasonable solution to the
problem but only if the set of fds is small, does not change
frequently, and the number of "active" fds is small in comparison to
the total number of fds. In addition, the poll(2) functions block every time a fd is added to the list of fds to be monitored.
To address these issues, the poll(7D) interface was
created. poll(7D) is a more performant option than
poll(2) , and it should be leveraged in cases where there are
a large number of fds to monitor. That said, the poll(7D)
interface still has performance issues due to the limitations of the
infrastructure shared by the poll interfaces. Specifically, because of
the implementation, the response time is dependent on the number of fds
within the list to be monitored. This illustrates the history and
complexity of the problems that application developers needed to be
aware of when implementing event-aware applications.
In the age of fast, cheap, multiprocessor systems, scalability has
become a focus for many application developers. Due to the amount of
time that has passed since the design and implementation of the
frameworks mentioned above, most were based upon the idea that the
process was the fundamental unit of execution, as opposed to the
thread. For example, the AIO framework was designed to support AIO
transactions in a per-process manner, and thus it does not scale well
for highly multithreaded applications. With this in mind, event
completion ports were designed to be used by a single thread or a
subset of the threads in the application.
The architects of the event completion framework decided to build a
new event framework within the Solaris OS kernel in order to avoid the
gaps within the historic interfaces. In creating a new framework, they
focused on solving the issues listed previously in this section.
Solaris Event Completion API
To give the developer a general idea of how to use the event completion
framework, I would like to start out with a simple code example. The
fundamental piece of the event completion framework is the port.
Applications use ports to register and reap events on the objects of
interest. Code Sample 1 gives a basic example of how to use the general
event completion framework.
Code Sample 1: Example Event Completion Code
/* Create port to use for event completion */
int portfd = port_create();
...
/* Register, or associate, the objects and events you are
interested in */
port_associate(portfd, ... );
...
/* Block until a single event appears on the port */
port_get(portfd, ... );
|
Note that using this framework is as simple as creating a port,
registering the events that you wish to receive events for the objects
of interest, and then using a single interface to reap a single event
or multiple events from the previously created port. As will be seen
later, the port_associate() call can be replaced with other initializing functions (such as timer_create(3RT) , aio_read(3RT) , and so on) in order to use event completion ports with other frameworks.
The Solaris 10 OS event completion API includes the functions listed in Code Sample 2.
Code Sample 2: Event Completion Function Specifics
int port_create(void);
int port_associate(int port, int source,
uintptr_t object, int events,
void *user);
int port_dissociate(int port, int source,
uintptr_t object);
int port_send(int port, int events, void *user);
int port_sendn(int ports[], int errors[],
uint_t nent, int events,
void *user);
int port_get(int port, port_event_t *pe,
const timespec_t *timeout);
int port_getn(int port, port_event_t list[],
uint_t max, uint_t *nget,
const timespec_t *timeout);
int port_alert(int port, int flags, int events,
void *user);
|
The port_create(3C) function creates a port by which
completion events can be delivered to a thread. This function returns a
non-negative integer representing the port‘s identifier.
The port_associate(3C) associates an object (such as
file, socket, timer, and so on) with a previously created port. The
first parameter is the port identifier, which was the return value of
the port_create() method. The second parameter associates a list of objects that will be monitored by the port; these may include the aiocb structure (found in aio.h ), time_t structure (found in time.h ),
an unsigned integer pointer to a user-defined variable/structure, or a
file descriptor, depending on the type of I/O the application is
binding to the port. Please note that an object is automatically
disassociated from the port once the object‘s event has been reaped
from the port. This is required because the poll interface doesn‘t
maintain any state. Thus, once an object‘s event has been reaped, port_associate(3C) must
be used to reassociate the object with the port if there is still an interest in any events pertaining to that object.
The port_disassociate(3C) function takes the object
referenced by the third parameter out of the list of objects monitored
by the port specified by the first parameter. The second parameter
indicates the source of the events, which was indicated at the time of
port association.
The port_send(3C) and port_sendn(3C) functions put a user-defined event onto the port indicated by the first parameter, port . In this case the difference between the two functions is that the port_sendn() function can send an event to more than one port. The events
value indicates what user-defined event is being put on the port. This
could be used to process the event when the application has a number of
possible user-defined event types that could arrive on the port. In
addition, the pointer userp represents the user-defined
payload that is delivered to the port for the receiver to consume. As
shown later, in the Examples section, this can be as complex a
structure as the application developer chooses.
The port_get(3C) and port_getn(3C) functions reap completed events from the port indicated by the first parameter, port . The difference between the two functions is that port_getn() function can reap more than one event from the port. Again, once an object‘s event has been reaped, that object is
disassociated from the port. When the port_getn() call returns, the number of reaped events is reflected by the value of the fourth parameter, nget . The timepsec
timeout parameters communicate how long the functions should block
waiting for an event to arrive on the port. If there is an error (for
example, if timeout occurs), the functions will return a value of -1.
When the port_get(3C) or port_getn(3C) functions return with reaped events, the second parameter is one or more (depending on the function used) port_event_t
structures filled with information to identify the event that took
place. The structure listed in Code Sample 3 includes the source of the
event, which can be found by using the values in Code Sample 4.
Code Sample 3: Event Completion Structure Listed in /usr/include/sys/port.h
typedef struct port_event {
int portev_events; /* event data is source specific*/
ushort_t portev_source; /* event source */
ushort_t portev_pad; /* port internal use */
uintptr_t portev_object; /* source specific object */
void *portev_user; /* user cookie */
} port_event_t;
|
Code Sample 4: Event Sources Listed in /usr/include/sys/port.h
#define PORT_SOURCE_AIO 1
#define PORT_SOURCE_TIMER 2
#define PORT_SOURCE_USER 3
#define PORT_SOURCE_FD 4
#define PORT_SOURCE_ALERT 5
|
Depending on the source of event, the portev_object of the structure is different; this can be seen on the port_create() man page. For example, when the source of the event is an AIO transaction, the portev_object is an aiocb structure. As can be seen in the Examples section, the portev_user
pointer can be used to consume the user-defined payload, which was
indicated at the time the object was associated with the port.
The port_alert(3C) function puts the port indicated by the first parameter, port , into alert mode by setting the third parameter, events , to a non-zero value. Once a port is put into alert mode, all of the threads waiting in the port_get() or port_getn() functions will awake with a PORT_SOURCE_ALERT event on the port. By setting the events parameter to 0, the port will be returned to a non-alert state.
When initiating an AIO transaction or arming a timer, the port_notify structure needs to be associated with the call (see Code Sample 5).
Code Sample 5: Event Notification Structure Listed in /usr/include/sys/port.h
typedef struct port_notify {
int portnfy_port; /* bind request(s) to port */
void *portnfy_user; /* user defined */
} port_notify_t;
|
In the case of AIO and timers, the port_notify_t structure is pointed to by the signal event structure‘s sigev_value.sival_ptr member (see the timer and AIO example listings in the Appendix).
Examples
To introduce the use of the event completion API, the following
subsections include educational examples. Each one of the subsections
takes one of the historic frameworks we have spoken about previously,
giving the reader a bit of background information concerning the
historic API and a pointer to a sample program that leverages the
Solaris 10 OS event completion framework. The expectation is that the
examples referenced here can help developers understand how the event
completion API can be used in each scenario.
Asynchronous I/O
Asynchronous I/O is a framework by which an application can
submit an I/O request that the system will handle without interacting
with the application until the I/O request is complete. Generally, AIO
is a framework that developers use to build applications that need to
continue execution without waiting until an I/O request is complete.
This need usually arises because an application has severe timing
constraints.
The AIO framework within the Solaris OS has been built upon the aio_read(3RT) and aio_write(3RT)
functions to submit the AIO requests. In older versions of the
operating system, an application could reap a completed AIO transaction
by using the aiowait(3AIO) , aio_waitn(3RT) , or aio_suspend(3RT) functions. Using these functions works well for processes with a few threads but not for highly multithreaded applications.
To provide an alternative, the new event completion framework within
the Solaris 10 OS delivers the AIO event completion to a port. An
application can reap the AIO event completion information using the port_get(3C) or port_getn(3C)
functions. With the ability to create a port that is bound to a single
thread or a group of threads, the developer of a highly multithreaded
application can scale the AIO requests using the thread (as opposed to
the process) as a basis.
In the Appendix, Listing 1 provides a simple program that initiates
an AIO write and then reaps the status using the event completion
framework.
Please note that the historic functions that were used within the AIO
framework are still present within the Solaris 10 OS and function as
expected.
Poll
Prior to the Solaris 10 OS, the best method to check if a fd was ready for reading and writing was to use the poll(2) or poll(7D) functions. poll(2)
traditionally works well when the list of file descriptors is small and
all the file descriptors in the list return with events. As was noted
earlier, poll(7D) works well when the number of file descriptors does not change.
In the Solaris 10 OS the event completion provides a way to reap the
status of the fds within an application. As is mentioned in the poll(7D) man page, the event completion framework should be used in any situation where a developer would historically have used the poll(7D) interface. When using the event completion framework to reap fd status, port remembers the registered file descriptors (unlike in the poll implementations). In addition, only new fds or fds that have an event pending need to be reactivated.
In the Appendix, Listing 2 shows a sample program that illustrates how to use the POLLIN event source as the fourth parameter in the port_associate(3C) call. This example shows how, historically, one could write an application that was implemented using only the poll() interfaces.
Timers
Timers, created using the timer_create(3RT)
function, are used within applications to set up a timer that fires a
signal when the timer expires. The signal delivered to the application
is specified within the second parameter of the timer_create(3RT) function. Using the port_notify_t structure we can have the signal directed to a port of our choosing.
In the Appendix, Listing 3 provides a working sample of arming a timer
and catching the expiration of that timer using the event completion
framework.
User-Defined Events
In the Appendix, Listings 4 and 5 provide sample programs that
illustrate how to send and receive user-defined events and payloads
using a single thread and between processes. Please note that in
Listing 5 the port identifier was passed through a pipe from one
process to another in order for the processes to have access to the
port‘s events.
Also, note that the code in Listing 5 contains two source files (denoted by 5a and 5b), port_sendfd_example and port_rcvfd_example . In order to run this example, please execute the port_sendfd_example binary first and then execute the port_rcvfd_example binary.
Related Work
Several other operating systems have implemented an event
completion framework, to some extent. Within the following section I
will step through several popular operating systems and describe the
functionality they provide in comparison to the Solaris 10 OS event
completion framework.
Windows
The event completion framework consists of the I/O Completion API in
the Windows NT OS and the WaitForMultipleObjects API in the Windows
Win32 OS.
The Microsoft Developer Network (MSDN) describes the I/O Completion
framework as a pool of threads created when an application was started
in order to process asynchronous I/O requests.1
The threads within this pool are solely used to asynchronously complete
I/O requests issued by the application. This framework consists of the CreateIoCompletionPort , GetQueuedCompletionStatus , and the PostQueuedCompletionStatus functions.
The CreateIoCompletionPort() call sets up a port with
one or more file handles associated with it. When the I/O operations
(like read, write, and so on) complete on these file handles, those
events are posted to the port. In order to collect information about
those events, the application has to call GetQueuedCompletionStatus() , which returns a key within the argument list to indicate the file that completed some I/O transaction. As with the port_get()
function within the Solaris 10 OS, the argument list contains a timeout
interval that indicates the maximum amount of time the call will wait
for a completion event (that is, the timeout interval). And finally,
the PostQueuedCompletionStatus() call can be used to post
a completed I/O event into the port in lieu of the system. This last
function is very similar in nature to the port_send() functionality in the Solaris 10 OS.
The WaitForMultipleObjects API provides a framework
that takes an array of objects and waits for one or all of them to
complete. This API can process objects of the following type: console
input, user event, memory resource notification, mutex, process,
semaphore, thread, and waitable timers. When the WaitForMultipleObjects() call is made and the completion of an event has not taken place, the calling thread enters the wait state.
The array of handles passed into the WaitForMultipleObjects
framework can consist of a heterogeneous set of these objects. However,
the array cannot contain multiple copies of the same handle. In
addition, if one of these handles is closed before the wait timeout
interval expires, the function‘s behavior is undefined.
The framework described here does not provide a simple, unified
interface to create and use completion ports for asynchronous I/O,
socket I/O, user events, and timers across the Windows OS variants.
FreeBSD, NetBSD, OpenBSD
FreeBSD, NetBSD, and OpenBSD provide the generic kqueue framework to take care of event completion.2 The design of the kqueue
framework provides a method to determine if AIO transactions, signal
delivery, file transactions, process events (such as fork, exit, and so
on), and file system changes have completed. The design goals of the kqueue
project closely resemble those of the event completion framework within
the Solaris OS due to the interest in creating a scalable framework to
deliver events to threads. The architects of both frameworks decided
early in the design phase to build an extendable system that could
handle a growing number of objects (that is, files, pipes, sockets, and
so on) and events.
The kqueue API consists of the kqueue() and kevent() functions. The kqueue()
call creates a queue in which the application can register events of
interest, such as AIO reads and writes, and so on. Once the queue has
been created, the application has to register the events of interest
using the kevent() . In addition, the kevent() call also reaps the completed events from the queue.
Linux Asynchronous I/O
The asynchronous I/O functionality has been integrated into Linux 2.6.3 For the last few years, prior to Linux 2.6, Ben LaHaise has maintained an AIO patch for the 2.4 Linux kernel.4 For our purposes, we will only examine the AIO functionality distributed within the standard Linux 2.6 kernel.
Within the Linux AIO framework, the io_submit() and io_getevents()
are the functions an application developer can use to submit I/O
requests and reap the completion or status of these events,
respectively. This Linux AIO framework supports reading() / writing() on a raw disk and files opened with O_DIRECT on the ext2, ext3, JFS, and XFS file systems. As of now, the Linux AIO framework does not support AIO fsync , AIO read()/write() on sockets and pipes, and files not opened with O_DIRECT .
AIO was not integrated within the standard Linux kernel until version
2.6. In addition, the AIO framework in Linux 2.6 has not been
implemented as a general framework from which an application developer
can use timers and user events.
Conclusion
In the past, developers had to rely on a group of frameworks to handle I/O events (such as AIO, poll() ,
timers, and so on) within an application. None of these frameworks
allowed for an application thread to send an event with user-defined
payload to another set of threads within the same application.
With the advent of the event completion framework within the Solaris
10 OS, a general framework has been implemented so that application
developers can reap AIO, timer, poll() ,
and user-defined events using the same methods. In addition to
extending functionality, the event completion framework has also
focused on providing a more scalable and performant solution for the
delivery of these events.
References
Acknowledgments
Thanks to Miguel Isenberg, Solaris 10 OS Event Completion Architect, for his invaluable documentation.
About the Author
Rob Benson is currently an engineer in the Market Development
Engineering organization of Sun Microsystems. His group is focused on
partner adoption of the Solaris OS, x86 Platform Edition.
Appendix: Code Example Listings
Listing 1: Example of using a port to reap the status of AIO
Listing 2: Threaded example of using ports to reap fd status using POLLIN events
Listing 3: Example of using a port to receive the firing of an expired timer
Listing 4: Example of using a port to send a user-defined payload
Listing 5a and Listing 5b: Examples of a port being shared between processes
|