About Network Kernel Extensions

Describes how to write a kernel extension for adding or modifying functionality of a networking stack (Mac OS X 10.3 and earlier).

NKE Implementation

Review of 4.4 BSD Network Architecture

Mac OS X is based on the 4.4BSD UNIX operating system. The following structures control the 4.4BSD network architecture:

socket structure, which the kernel uses to keep track of sockets. The socket structure is referenced by file descriptors from user mode.
domain structure, which describes protocol families.
protosw structure, which describes protocol handlers. (A protocol handler is the implementation of a particular protocol in a protocol family.)
ifnet structure, which describes a network device and contains pointers to interface device driver routines.

None of these structures is used uniformly throughout the 4.4BSD networking infrastructure. Instead, each structure is used at a specific level, as shown in Figure 1-1 .

Figure 1-1 4.4BSD network architecture

The socket structure is used to manage the socket while the domain, protosw, and ifnet structures are used to manage packet delivery to and from the network device.

NKE Types

Making the 4.4BSD network architecture dynamically extensible requires several NKE types that are used at specific locations within the kernel.

socket NKEs, which reside between the network layer and protocol handlers and are invoked through a protosw structure. Socket NKEs use a new set of override dispatch vectors that intercept specific socket and socket buffer utility functions.
protocol family NKEs, which are collections of protocols that share a common addressing structure. Internally, a domain structure and a chain of protosw structures describe each protocol.
protocol handler NKEs, which process packets for a particular protocol within the context of a protocol family. A protosw structure describes a protocol handler and provides the mechanism by which the handler is invoked to process incoming and outgoing packets and for invoking various control functions.
data link NKEs, which are inserted below the protocol layer and above the network interface layer. This type of NKE can passively observe traffic as it flows in and out of the system (for example, a sniffer) or can modify the traffic (for example, encrypting or performing address translation). Data link NKEs can provide media support functions (performing demultiplexing, framing, and pre-output functions, such as ARP) and can act as "filters" that are inserted between a protocol stack and a device or above a device.)

Figure 1-2 summarizes the NKE architecture.

Figure 1-2 NKE architecture

Global and Programmatic NKEs

Socket NKEs can operate in one of two modes: programmatic or global.

A global NKE is an NKE that is automatically enabled for sockets of the type specified for the NKE.

A programmatic NKE is a socket NKE that is enabled only under program control, using socket options, for a specfic socket.

Data link `filters' are essentially global in that they can't be accessed by specific sockets.

Tracking NKE Usage

To support the dynamic addition and removal of NKEs in Mac OS X, the kernel keeps track of the use of NKEs by other parts of the system.

Use of protocol family NKEs is tracked by the dom_refs member of the domain structure, which has been added to support NKEs in Mac OS X. The kernel's socreate function increments dom_refs each time socreate is called to create a socket in an NKE domain. The socreate function is called when user-mode applications call socket or when sonewconn successfully connects to a local listening socket. The dom_refs member is decremented each time soclose is called to close a socket connection.

Use of protocol handler NKEs is tracked by the pr_refs member of the protosw structure, which has been added to support NKEs in Mac OS X. Like the dom_refs member of the domain structure, the pr_refs member of the protosw structure tracks the use of the protocol between calls to socreate and sonewconn to create a socket and soclose to close a socket.

The most important aspect of removing an NKE is ensuring that all references to NKE resources are eliminated and that all system resources allocated by the NKE are returned to the system. The NKE must track its use of resources, such as socket structures and protocol control blocks, so that the NKE's termination routine can eliminate references and return system resources.

Modifications to 4.4BSD Networking Architecture

To support NKEs in Mac OS X, the 4.4BSD domain and protosw structures were modified as follows:

The protosw array referenced by the domain structure is now a linked list, thereby removing the array's upper bound. The new dom_maxprotohdr member defines the maximum protocol header size for the domain. The new dom_refs member is a reference count that is incremented when a new socket for this address family is created and is decremented when a socket for this address family is closed.
The protosw structure is no longer an array. The pr_next member has been added to link the structures together. This change has implications for protox usage for AF_INET and AF_ISO input packet processing. The pr_flags member is an unsigned integer instead of a short. NKE hooks have been added to link NKE descriptors together (pr_sfilter).

PF_NKE Domain

Mac OS X defines a new domain -- the PF_SYSTEM domain-- whose purpose is to provide a way for applications to configure and control NKEs. The PF_SYSTEM domain has two protocols, of which only one is of interest for communications with the NKE:

The SYSPROTO_CONTROL protocol is used for configuring and controlling all NKEs.

Internally, the PF_SYSTEM domain’s initialization function is called when the PF_SYSTEM domain is initially added to the system. The initialization function adds the SYSPROTO_CONTROL protocol to the domain’s protosw list and performs other initialization tasks.

In the NKE's start method, register a Kernel Controller structure using the ctl_register function. The ctl_register function is defined in <sys/kern_control.h>. The ctl_register call is prototyped as follows.

int ctl_register(struct kern_ctl_reg *userctl,
            void *userdata,
            kern_ctl_ref *ctlref);

The fields of the kern_ctl_reg structure are defined as follows.

ctl_id - unique 4 byte id for the controller. Enter a registered Creator ID. Go to the Apple Developer Creator ID web page to register a unique ID. See http://developer.apple.com/dev/cftype/ for more information.

ctl_unit - the unit number for the controlller. A controller can be registered multiple times with the same ctl_id, but for each instance and different unit number must be used.

ctl_flags - set to CTL_FLAG_PRIVILEGED which requires that the user must have admin privileges to contact the controller.

ctl_sendsize - size of buffer reserved for sending messages. 0 = default value.

ctl_recvsize - size of buffer reserved for receiving messages. 0 = default value.

Dispatch Functions

ctl_connect - called when the client process calls connect on the socket with the id/unit number of the registered controller.

clt_disconnect - called when the user client process closes the control socket.

ctl_write - called when the user client process writes data to the socket.

ctl_set - called when the user client process setsockopt to set the controller configuration.

ctl_get - called when the user client process calls getsockopt on the socket.

The following is a code example of this process.

Listing 1-1 Dispatch example

struct kern_ctl_reg     ep_ctl;
// Initialize controller
bzero(&ep_ctl, sizeof(ep_ctl));  // sets ctl_unit to 0
ep_ctl.ctl_id = kEPCommID; // should be unique -
                                   // use a registered Creator ID here
ep_ctl.ctl_flags = CTL_FLAG_PRIVILEGED;
ep_ctl.ctl_write = EPHandleWrite;
ep_ctl.ctl_get = EPHandleGet;
ep_ctl.ctl_set = EPHandleSet;
ep_ctl.ctl_connect = EPHandleConnect;
ep_ctl.ctl_disconnect = EPHandleDisconnect;
error = ctl_register(&ep_ctl, &gEPState, &gEPState.ctlHandle);
 
 
int EPHandleSet( kern_ctl_ref ctlref, void *userdata, int opt, void *data, size_t len )
{
    int    error = EINVAL;
#if DO_LOG
    log(LOG_ERR, "EPHandleSet opt is %d\n", opt);
#endif
 
    switch ( opt )
    {
        case kEPCommand1:               // program defined symbol
            error = Do_First_Thing();
            break;
 
        case kEPCommand2:               // program defined symbol
            error = Do_Command2();
            break;
    }
    return error;
}
 
int EPHandleGet( kern_ctl_ref ctlref, void *userdata, int opt, void *data, size_t *len )
{
    int    error = EINVAL;
#if DO_LOG
    log(LOG_ERR, "EPHandleGet opt is %d *****************\n", opt);
#endif
    return error;
}
 
int
EPHandleConnect(kern_ctl_ref ctlref, void *userdata)
{
#if DO_LOG
    log(LOG_ERR, "EPHandleConnect called\n");
#endif
    return (0);
}
 
void
EPHandleDisconnect(kern_ctl_ref ctlref, void *userdata)
{
#if DO_LOG
    log(LOG_ERR, "EPHandleDisconnect called\n");
#endif
    return;
}
 
int EPHandleWrite(kern_ctl_ref ctlref, void *userdata, struct mbuf *m)
{
#if DO_LOG
    log(LOG_ERR, "EPHandleWrite called\n");
#endif
    return (0);
}

Connection from the Client Process

After the NKE registers a Kernel Controller structure the application level process opens a PF_SYSTEM socket. The application level process sets up the sockaddr_ctl structure with the required parametrs to communicate with the NKE's Kernel Controller.

To communicate with the NKE, the client process opens a PF_SYSTEM socket using the socket call.

fd = socket(PF_SYSTEM, SOCK_DGRAM, SYSPROTO_CONTROL);

The client process uses the connect call with the file descriptor returned from the socket call to establish a connection with the NKE. In making the connect call, fill in the sockaddr_ctl structure as follows.

sc_len = sizeof(struct sockaddr_ctl);
sc_family = AF_SYSTEM;
ss_sysaddr = AF_SYS_CONTROL;
sc_id = set to value of ctl_id registered by the NKE in the ctl_reguster call described above.
sc_unit = set to the unit number registered by the NKE in the ctl_register call described above.

The client process uses the setsockopt call to send commands to the NKE. Note that the option names are user defined. The NKE defines what option names it will respond to, and the client process must pass only supported option names to the NKE in the setsockopt call.

The client process uses the getsockopt call to get status information from the NKE. Note that the option names are user defined. The NKE defines what option names it will respond to, and the client process must pass only supported option names to the NKE in the setsockopt call.

The following is a code example for opening a PF_SYSTEM socket to communicate with an NKE

Listing 1-2 Opening a PF_SYSTEM socket

      struct sockaddr_ctl       addr;
      int                       ret = 1;
 
      bzero(&addr, sizeof(addr)); // sets the sc_unit field to 0
      addr.sc_len = sizeof(addr);
      addr.sc_family = AF_SYSTEM;
      addr.ss_sysaddr = AF_SYS_CONTROL;
      addr.sc_id = kEPCommID;  // should be unique - use a registered Creator ID here
 
      fd = socket(PF_SYSTEM, SOCK_DGRAM, SYSPROTO_CONTROL);
      if (fd)
      {
        result = connect(fd, (struct sockaddr *)&addr, sizeof(addr));
        if (result)
           fprintf(stderr, "connect failed %d\n", result);
      }
      else
        fprintf(stderr, "failed to open socket\n");
 
        if (!result)
        {
        result = setsockopt( fd, SYSPROTO_CONTROL, kEPCommand1, NULL, 0);
        if (result)
          fprintf(stderr, "setsockopt failed on kEPCommand1 call - result was %d\n", result);
       etc.

Implementing a Preference File for NKE

The question arises as to how an NKE can open a "preference file" in the start method. Under the existing architecture, the NKE cannot reliably access a Preference File. When the system starts the NKE, there are no APIs, which the NKE can use to open a file and read preference information. While the NKE could access its info.plist, there is the assumption that the info.plist will not be changed across startups as this information is cached by the system in order to expedite startups.

The proper way to dynamically configure an NKE is with a startup daemon or other application level process. The daemon finds the NKE using the communication method described above, and passes in configuration information that the NKE may require.

About Protocol Family NKEs

Adding and removing protocol family NKEs is accomplished by calling net_add_domain and net_del_domain, respectively. These calls are described in Protocol Family NKE Functions . For detailed information about implementing protocol families, see The Design and Implementation of the 4.4 BSD Operating System by M. K. McKusick. et al. and TCP/IP Illustrated by Richard W. Stevens.

About Protocol Handler NKEs

Adding and removing protocol handler NKEs is accomplished by calling net_add_proto and net_del_proto, respectively. These calls are described in Protocol Handler NKE Functions . For detailed information about implementing protocol families, see The Design and Implementation of the 4.4 BSD Operating System by M. K. McKusick. et al. and TCP/IP Illustrated by Richard W. Stevens.

About Socket NKEs

Socket NKEs are installed in the kernel by calling register_sockfilter typically from the NKE's initialization routine. Each socket NKE provides a descriptor structure that is linked into a global list (nf_list). A second chain runs through the filter descriptor to link it to a protosw for global NKEs. Figure 1-3 shows the interconnections for these data structures.

Figure 1-3 Domain structure and protosw interconnections

When you call socreate to create a socket, any global NKEs associated with the corresponding protosw structure are attached to the socket structure using the so_ext field to link together ketcb structures that are allocated when the socket is created. (See Figure 1-3 .) These ketcb structures are initialized to point to the extension descriptor and two dispatch vectors of intercept functions (one for socket operations and one for socket buffer utilities).

The filter descriptor for a programmatic NKE is linked into the nf_list in the same way as are global NKEs but the file descriptor does not appear in the list associated with a protosw. A program can call setsocketopt using socket option SO_NKE) to insert a programmatic NKE into its NKE chain in the same way that it would call setsocketopt to insert a global NKE.

Each socket NKE has two dispatch vectors, a sockif structure and a sockutil structure, that contain pointers to the NKE's implementation of these functions. The functions are called when the corresponding socket and sockbuf functions are are called. The dispatch vectors permit the NKE to selectively intercept socket and socket buffer utilities. Here is an example:

int (*sf_sobind)(struct socket *, struct mbuf *, st kextcb);

The kernel's sobind function calls the NKE's bind entry point with the arguments passed to sobind and the kextcb pointer for the NKE. The sockaddr structure contains the name of the local endpoint being bound.

Each of the intercept functions can return an integer value. A return value of zero is interpreted to mean that processing at the call site can continue. A non-zero return value is interpreted as an error (as defined in ) that causes the processing of the packet or opertation to halt. If the return value is EJUSTRETURN, the calling function (for example, sobind) returns at that point with a value of zero. Otherwise, the function returns the non-zero error code. In this way, an NKE can "swallow" a packet or an operation. An NKE may reinject the packet at a later time. (Note that the injection mechanism is not yet defined.)

A program can insert a socket NKE on an open socket by calling setsockopt as follows:

setsockopt(s, SOL_SOCKET, SO_NKE, &so_nke, sizeof (struct so_nke);

The so_nke structure is defined as follows:

struct so_nke {
    unsigned int nke_handle;
    unsigned int nke_where;
    int nke_flags;
};

The nke_handle specifies the NKE to be linked to the socket (with the so_ext link). It is the programmer's task to locate the appropriate NKE, assure that it is loaded, and retain the returned handle for use in the setsockopt call.

The nke_where value specifies an NKE assumed to be in this linked list. If nke_where is NULL, the NKE represented by nke_handle is linked at the beginning or end of the list, depending on the value of nke_flags.

The nke_flags value specifies where, relative to nke_where, the NKE represented by nke_handle will be placed. Possible values are NFF_BEFORE and NFF_AFTER defined in .

The nke_handle and nke_where values are assigned by Apple Computer from the same name space as the type and creator codes used in Mac OS 8 and Mac OS 9 and using the same registration mechanism.

About Data Link NKEs

This section describes the programming interface for creating data link NKEs, which are inserted below the protocol layer and above the network interface layer. Data link NKEs depend on the Data link interface layer (DLIL), shown in Figure 1-4 , which provides a fixed point for the insertion of data link NKEs.

Figure 1-4 Data Link Interface Layer

DLIL Static Functions

The DLIL defines the following static functions, which are called by protocols and drivers:

dlil_attach_protocol, which attaches network protocol stacks to specific interfaces
dlil_detach_protocol, which detaches network protocol stacks from the interfaces to which they were previously attached
dlil_if_attach, which registers network interfaces with the DLIL
dlil_if_detach, which deregisters network interfaces that have been registered with the DLIL
dlil_ioctl, which sends ioctl commands to a network driver
dlil_input, which sends data to the DLIL from a network driver
dlil_output, which sends data to a network driver
dlil_event, which processes events from other parts of the network and from IOKit components. (Note that the event mechanisms are still under development.)

In Figure 1-5 , the DLIL static functions are shown in relation to the DLIL, the protocol layer, and the network driver layer.

Figure 1-5 DLIL static functions

Changes to the ifnet and if_proto Structures

To support data link NKEs, the traditional ifnet structure as been extended in Mac OS X: the driver or software that supports the driver must allocate a separate ifnet structure for each logical interface. When an interface is attached (by calling dlil_if_attach)to the DLIL, the DLIL receives a pointer to that interface's ifnet structure.

Each interface can transmit and receive packets for multiple network protocol families, so for each attached protocol family the DLIL creates an if_proto structure chained off the ifnet structure for that interface.

The if_proto structure contains function pointers that the DLIL uses to pass incoming packets and event information to the protocol stack, as well as a pointer to the protocol dependent "pre-output" function that performs protocol-family specific operations such as network address translation on outbound packets.

Figure 1-6 shows the ifnet and if_proto structures in relation to a generic protocol and a generic interface.

Figure 1-6 Sample ifnet structure in relation to a protocol and a network driver

Installing and Removing Data Link NKEs

To support the dynamic insertion of filters into the data and control streams between the network layer and the interface layer and the removal of inserted filters, the DLIL defines the following static functions:

dlil_attach_protocol_filter, which inserts an NKE between the DLIL and one of the attached protocols. Such an extension is known as a DLIL protocol filter. This type of NKE provides access to all function calls between the DLIL and the attached protocol for a specific protocol/interface pair.
dlil_attach_interface_filter, which inserts an NKE between the DLIL and an attached interface. Such a filter is known as an DLIL interface filter. This type of NKE provides access to all frames flowing to or from an interface.
dlil_detach_filter, which removes previously inserted DLIL protocol and interface filters.

Figure 1-7 shows the relationship of protocol and interface filters to the protocol stack layer, DLIL, and network driver layer.

Figure 1-7 Protocol and interface extensions in relation to the DLIL

Sending Data

Figure 1-8 shows the sequence of calls required to send an IP packet over the MACE Ethernet interface (en0).

Figure 1-8 Example of sending an IP packet

The following steps correspond to the numbers in Figure 1-8 and describe the process of sending a packet:

The ip_output routine in the IP protocol stack calls dlil_output, passing the dl_tag value for the stack's attachment to en0.
Using the dl_tag value, the dlil_output function locates the dl_pre_output pointer in the if_proto structure for IP.
The dlil_output function uses the dl_pre_output pointer in the if_proto structure to call IP's interface-specific output module. This module calls its arpresolve routine to resolve the target IP address into a media access control (MAC) address.
When IP's interface-specific output module returns, the dlil_output function uses the if_framer pointer in the ifnet structure to call the appropriate framing function in the DLIL interface module. The framing function prepends interface-specific frame data to the packet.
The dlil_output function calls the function pointed to by the if_output field in the ifnet structure for en0 and sends the frame to the MACE Ethernet driver.

Receiving Data

Figure 1-9 shows the sequence of calls required to receive an IP packet from the MACE Ethernet interface (en0).

Figure 1-9 Example of receiving a packet

The following steps correspond to the numbers in Figure 1-9 and describe the process of receiving a packet:

The MACE Ethernet driver or its support code calls dlil_input with pointers to its ifnet structure and mbuf chain.
The dlil_input function uses the if_demux entry in the ifnet structure to call the demultiplexing function for the interface family (Ethernet in this case).
The demultiplexing function identifies the frame and returns an if_proto pointer to dlil_input.
The dlil_input function calls the protocol input module through the dl_input pointer in the if_proto structure.

Note: The Ethernet-specific module for IP receives the frame, removes the 802.2 or SNAP header (if any) and delivers the packet to the protocol's ipintr routine.

For more information

The following sources provide additional information that may be of interest to developers of network kernel extensions:

The Design and Implementation of the 4.4 BSD Operating System . M. K. McKusick. et al., Addison-Wesley, Reading, 1996.
Unix Network Programming, Second Edition, Volume 1. Richard W. Stevens, Prentice Hall, New York, 1998.
TCP/IP Illustrated, Volume 1, The Protocols. Richard W. Stevens, Addison-Wesley, Reading, 1994.
TCP/IP Illustrated, Volume 2, The Implementation. Richard W. Stevens and Gary R. Wright, Addison-Wesley, Reading, 1995.
TCP/IP Illustrated, Volume 3, Other Protocols. Richard W. Stevens, Addison-Wesley, Reading, 1996.

The following websites provide information about the Berkeley Software Distribution (BSD):