Virtualized Input/Output Virtualized I/O is an optional feature of platforms that have hypervisor support. Virtual I/O (VIO) provides to a given partition the appearance of I/O adapters that do not have a one to one correspondence with a physical IOA. The hypervisor uses one of three techniques to realize a virtual IOA: In the hypervisor simulated class, the hypervisor may totally simulate the adapter. For example, this is used in the virtual ethernet (IEEE VLAN) support (see ). This technique is applicable to communications between partitions that are created by a single hypervisor instance. In the partition managed class, a server partition provides the services of one of its IOA’s to a partner partition(s) (one or more client partitions The term “hosted” is sometimes used for “client” and the term “hosting” is sometimes used for “server.” Note that a server IOA or partition can sometimes also be a client, and vice versa, so the terminology “client” and “server” tend to be less confusing than hosted and hosting. or one or more server partitions). In limited cases, a client may communicate directly to a client. A server partition provides support to interpret I/O requests from the partner partition, perform those requests on one or more of its devices, targeting the partner partition’s DMA buffer areas (for example, by using the Remote DMA (RDMA) facilities), and passing I/O responses back to the partner partition. For example, see . In the hypervisor managed class, the hypervisor may provide low level hardware management (error and sub-channel allocation) so that partition level code may directly manage its assigned sub-channels. This chapter is organized from general to specific. The overall structure of this architecture is as shown in .
VIO Architecture Structure
Terminology used with VIO Besides the general terminology defined on the first page of this chapter, will assist the reader in understanding the content of this chapter. Terminology used with VIO Term Definition VIO Virtual I/O. General term for all virtual I/O classes and virtual IOAs. ILLAN Interpartition Logical LAN. This option uses the hypervisor simulated class of virtual I/O to provide partition-to-partition LAN facilities without a real LAN IOA. See . VSCSI Virtual SCSI. This option provides the facilities for sharing physical SCSI type IOAs between partitions. . ClientClient VIO model This terminology is mainly used with the partition managed class of VIO. The client, or client partition, is an entity which generally requests of a server partition, access to I/O to which it does not have direct access (that is, access to I/O which is under control of the server partition). Unlike the server, the client does not provide services to other partitions to share the I/O which resides in their partition. However, it possible to have the same partition be both a server and client partition, but under different virtual IOAs. The Client VIO model is one where the client partition maps part of its local memory into an RTCE table (as defined by the first window pane of the “ibm,my-dma-window”property), so that the server partition can get access to that client’s local memory. An example of this is the VSCSI client (see for more information). ServerServer VIO model This terminology is mainly used with the partition managed class of VIO. The server, or server partition is an entity which provides a method of sharing the resources under its direct control with another partition, virtualizing those resources in the process. The following defines the Server VIO model: The server is a server to a client. An example of this is the VSCSI client (see ). In this case, the Server VIO model is one where the server gets access to the client partition’s local memory via what the client mapped into an RTCE table. This access is done through the second window pane of the server’s “ibm,my-dma-window”property, which is linked to the first window pane of the client’s “ibm,my-dma-window”property. Partner partition This is “the other” partition in a pair of partitions which are connected via a virtual IOA pair. For client partitions, the partner is generally the server (although, in limited cases, client to client connections may be possible). For server partitions, the partner can be a client partition or another server partition. RTCE table Remote DMA TCE table. TCE (Translation Control Entry) and RTCE tables are used to translate I/O DMA operations and provide protection against improper operations (access to what should not be accessed or for protection against improper access modes, like writing to a read only page). More information on TCEs and TCE tables, which are used for physical IOAs, can be found in . The RTCE table for Remote DMA (RDMA) is analogous to the TCE table for physical IOAs. The RTCE table does, however, have a little more information in it (as placed there by the hypervisor) in order to, among other things, allow the hypervisor to create links to physical IOA TCEs that were created from the RTCE table TCEs. A TCE in an RTCE table is never accessed directly by the partitions software; only though hypervisor hcall()s. For more information on RTCE table and operations, see , and . Window pane (“ibm,my-dma-window” property) The RTCE tables for VIO DMA are pointed to by the “ibm,my-dma-window” property in the device tree for each virtual device. This property can have one, two, or three triples, each consisting of a Logical I/O Bus Number (LIOBN), phys which is 0, and size. The LIOBN essentially points to a unique RTCE table (or a unique entry point into a single table. The phys is a value of 0, indicating offsets start at 0. The size is the size of the available address space for mapping memory into the RTCE table. This architecture talks about these unique RTCE tables as being window panes within the “ibm,my-dma-window” property. Thus, there can be up to three window panes for each virtual IOA, depending on the type of IOA. For more on usage of the window panes, see . RDMA Remote Direct Memory Access is DMA transfer from the server to its client or from the server to its partner partition. DMA refers to both physical I/O to/from memory operations and to memory to memory move operations. Copy RDMA This term refers to when the hypervisor is used (possibly with hardware assist) to move data between server partition and client partition memories or between server partition and partner partition memories. See . Redirected RDMA This term refers to when the TCE(s) for a physical IOA are set up through the use of the RTCE table manipulation hcall()s (for example, H_PUT_RTCE) such that the client or partner’s partition’s RTCE table (though the second window pane of the server partition) is used by the hypervisor during the processing of the hcall() to setup the TCE(s) for the physical IOA, and then the physical IOA DMAs directly to or from the client or partner partition’s memory. See for more information. LRDMA Stands for Logical Remote DMA and refers to the set of facilities for synchronous RDMA operations. See also for more information. LRDMA is a separate option. Command/Response Queue (CRQ) The CRQ is a facility which is used to communicate between partner partitions. Transport events which are signaled from the hypervisor to the partition are also reported in this queue. Subordinate CRQ (Sub-CRQ) Similar to the CRQ, except with notable differences (See ). Reliable Command/Response Transport This is the CRQ facility used for synchronous VIO operations to communicate between partner partitions. Several hcall()s are defined which allow a partition to place an entry on the partner partition’s queue. The firmware can also place transport change of status messages into the queue to notify a partition when the connection has been lost (for example, due to the other partition crashing or deregistering its queue). See for more information. Subordinate CRQ Transport This is the Sub-CRQ facility used for synchronous VIO operations to communicate between partner partitions when the CRQ facility by itself is not sufficient. The Subordinate CRQ Transport never exists without a corresponding Reliable Command/Response Transport. See for more information.
VIO Architectural Infrastructure VIO is used in conjunction with the Logical Partitioning option as described in . For each of a platform’s partitions, the number and type of VIO adapters with the associated interpartition communications paths (if any are defined). These definitions take the architectural form of VIO adapters and are communicated to the partitions as device nodes in their OF device tree. Depending upon the specific virtual device, their device tree node may be found as a child of / (the root node) or in the VIO sub-tree (see below). The VIO infrastructure provides several primitives that may be used to build connections between partitions for various purposes (that is, for various virtual IOA types). These primitives include: A Command/Response Queue (CRQ) facility which provides a pipe between partitions. A partition can enqueue an entry on its partner’s CRQ for processing by that partner. The partition can set up the CRQ to receive an interrupt when the queue goes from empty to non-empty, and hence this facility provides a method for an inter-partition interrupt. A Subordinate CRQ (Sub-CRQ) facility that may be used in conjunction with the CRQ facility, when the CRQ facility by itself is not sufficient. That is, when more than one queue with more than one interrupt is required by the virtual IOA. An extended TCE table called the RTCE table which allows a partition to provide “windows” into the memory of its partition to its partner partition, while maintaining addressing and access control to its memory. Remote DMA services that allow a server partition to transfer data to a partner partition’s memory via the RTCE table window panes. This allows a device driver in a server partition to efficiently transfer data to and from a partner, which is key in sharing of an IOA in the server partition with its partner partition. In addition to the virtual IOAs themselves, this architecture defines a virtual host bridge, and a virtual interrupt source controller. The virtual host bridge roots the VIO sub-tree. The virtual interrupt source controller provides the consistent syntax for communicating the interrupt numbers the partition’s OS sees when the virtual IOAs signal an interrupt. The general VIO infrastructure is defined in . There are additional infrastructures requirements for the partition managed class based on the Synchronous VIO model. See .
VIO Infrastructure - General This section describes the general OF device tree structure for virtual IOAs and describes in more detail window panes, as well as describing the interrupt control aspects of virtual IOAs.
Properties of the <emphasis role="bold"><literal>/vdevice</literal></emphasis> OF Tree Node Most VIO adapters are represented in the OF device tree as children of the /vdevice node (child of the root node). While the vdevice sub-tree is the preferred architectural home for VIO adapters, selected devices for historical reasons, are housed outside of the vdevice sub-tree. R1--1. /vdevice node must contain the properties as defined in . Properties of the <emphasis role="bold"><literal>/vdevice</literal></emphasis> Node Property Name Required? Definition “name” Y Standard property name per , specifying the virtual device name, the value shall be “vdevice” “device_type” Y Standard property name per , specifying the virtual device type, the value shall be “vdevice” “model” NA Property not present. “compatible” Y Standard property name per , specifying the virtual device programming models, the value shall include “IBM,vdevice” “used-by-rtas” NA Property not present. “ibm,loc-code” NA The location code is meaningless unless one is doing dynamic reconfiguration as in the children of this node. “reg” NA Property not present. “#size-cells” Y Standard property name per , the value shall be 0. No child of this node takes space in the address map as seen by the owning partition. “#address-cells” Y Standard property name per , the value shall be 1. “#interrupt-cells” Y Standard property name per , the value shall be 2. The first cell contains the interrupt# as will appear in the XIRR and is used as input to interrupt RTAS calls. The second cell contains the value 0 indicating a positive edge sense “interrupt-map-mask” NA Property not present. “interrupt-ranges” Y Standard property name that defines the interrupt number(s) and range(s) handled by this unit. “ranges”   These will probably be needed for IB virtual adapters. “interrupt-map” NA Property not present. “interrupt-controller” Y The /vdevice node appears to contain an interrupt controller. “ibm,drc-indexes” For DR Refers to the DR slots -- the number provided is the maximum number of slots that can be configured which is limited by, among other things, the RTCE tables allocated by the hypervisor. “ibm,drc-power-domains” For DR Value of -1 to indicate that no power manipulation is possible or needed. “ibm,drc-types” For DR Value of “SLOT”. Any virtual IOA can fit into any virtual slot. “ibm,drc-names” For DR The virtual location code (see ) “ibm,drc-info” For DR When present replaces the “ibm,drc-indexes”, “ibm,drc-power-domains”, “ibm,drc-types” and “ibm,drc-names” properties. This single property is a consolidation of the four pre-existing properties and contains all of the required information. “ibm,max-virtual-dma-size” See definition column The maximum transfer size for H_SEND_LOGICAL_LAN and H_COPY_RDMA hcall()s. Applies to all VIO which are children of the /vdevice node. Minimum value is 128 KB.
RTCE Table and Properties of the Children of the <emphasis role="bold"><literal>/vdevice</literal></emphasis> Node This architecture defines an extended type of TCE table called a Remote DMA TCE (RTCE) table. An RTCE table is one that is not directly used by the hardware to translate an I/O adapter’s DMA addresses, but is used by the hypervisor to translate a partition’s I/O addresses. RTCE tables have extra data, compared with a standard TCE table, to help firmware manage the use of its mappings. A partition manages the entries for its memory that is to be the target for I/O operations in the RTCE table using the TCE manipulation hcall()s, depending on the type of window pane. More on this later in this section. On platforms implementing the CRQ LRDMA options, these hcall()s are extended to understand the format of the RTCE table via the LIOBN parameter that is used to address the specific window pane within an RTCE table One could also think of each LIOBN pointing to a separate RTCE table, rather than window panes within an RTCE table. . Children of the /vdevice node that support operations which use RTCE tables (for example, RDMA) contain the “ibm,my-dma-window” property. This property contains one or more (logical-I/O-bus-number, phys, size) triple(s). Each triple represents one window pane in an RTCE table which is available to this virtual IOA. The phys value is 0, and hence the logical I/O bus number (LIOBN) points to a unique range of TCEs in the RTCE table which are assigned to this window pane (LIOBN), and hence the I/O address for that LIOBN begin at 0. The LIOBN is an opaque handle which references a window pane within an RTCE table. Since this handle is opaque, its internal structure is not architected, but left to the implementation’s discretion. However, it is the architectural intent that the LIOBN be an indirect reference to the RTCE table through a hypervisor table that contains management variables, allowing for movement of the RTCE table and table format specific access methods. The partition uses an I/O address as an offset relative to the beginning of the LIOBN, as part of any I/O request to that memory mapped by that RTCE table’s TCEs. A server partition appends its version of the LIOBN for the partner partition’s RTCE table that represents the partner partition’s RTCE table which it received through the second entry in the “ibm,my-dma-window” property associated with server partition’s virtual IOA’s device tree node (for example, see ). The mapping between the LIOBN in the second pane of a server virtual IOA’s “ibm,my-dma-window” property and the corresponding partner partition IOA’s RTCE table is made when the CRQ successfully completes registration. The window panes and the hcall()s that are applicable to those panes, are defined and used as indicated in . VIO Window Pane Usage and Applicable Hcall()s Window Pane (Which Triple) Hypervisor Simulated Class Client VIO Model Server VIO Model First I/O address range which is available to map local partition memory for use by the hypervisor I/O address range which is available to map local partition memory to make it available to the hypervisor use (access to the CRQ and any Sub-CRQs). For clients which support RDMA operations from their partner partition to their local memory (for example, VSCSI), this I/O address range is available to map local partition memory to make it available to the server partition, and this pane gets mapped to the second window pane of the partner partition (client/server relationship). I/O address range which is available to map local partition memory for use by the hypervisor (for access by H_COPY_RDMA requests, and for access to the CRQ, any Sub-CRQs). This window is not available to any other partition. Applicable hcall()s: H_PUT_TCE, H_GET_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE Second Does not exist Does not exist I/O address range which corresponds to a window pane of the partner partition: linked to the first window pane for Client/Server model connections. Used to get access to the partner partition’s memory from the hypervisor that services the local partition for use as source or destination in Copy RDMA requests or for redirected DMA operations (for example, H_PUT_RTCE). Applicable hcall()s: H_PUT_RTCE, H_REMOVE_RTCE and H_PUT_RTCE_INDIRECT
The “ibm,my-dma-window” property is the per device equivalent of the “ibm,dma-window” property found in nodes representing bus bridges. Children of the /vdevice node contain virtual location codes in their “ibm,loc-code” properties. The invariant assignment number is uniquely generated when the virtual IOA is assigned to the partition and remains invariably associated with that virtual IOA for the duration of the partition definition. For more information, see .
VIO Interrupt Control There are two hcall()s that work in conjunction with the RTAS calls ibm,int-on, ibm,int-off, ibm,set-xive and ibm,get-xive, which manage the state of the interrupt presentation controller logic. These hcall()s provide the equivalent of IOA control registers used to control IOA interrupts. The usage of these two hcall()s is summarized in . The detail of the H_VIO_SIGNAL is shown after this table and the detail of the applicable H_VIOCTL subfunctions can be found in , , and . VIO Interrupt Control hcall() Usage Interrupt From Virtual IOA Definition does not Include Sub-CRQs Virtual IOA Definition Includes Sub-CRQs Interrupt Number Obtained From CRQ H_VIO_SIGNAL H_VIO_SIGNAL or H_VIOCTL OF device tree “interrupts” property Sub-CRQ Not Applicable H_VIOCTL H_REG_SUB_CRQ hcall()
H_VIO_SIGNAL This H_VIO_SIGNAL hcall() manages the interrupt mode of a virtual adapter’s CRQ interrupt signalling logic. There are two modes: Disabled, and Enabled. The first interrupt of the “interrupts” property is for the CRQ. Syntax: Parameters: unit-address: unit address per device tree node “reg” property. mode: Bit 63 controls the first interrupt specifier given in the virtual IOA’s “interrupts” property, and bit 62 the second. High order bits not associated with an interrupt source as defined in the previous sentence should be set to zero by the caller and ignored by the hypervisor. A bit value of 1 enables the specified interrupt, a bit value of 0 disables the specified interrupt. Semantics: Validate that the unit address belongs to the partition and to a vdevice IOA, else H_Parameter. Validate that the mode is one of those defined, else H_Parameter. Establish the specified mode. Return H_Success.
General VIO Requirements R1--1. For all VIO options: The platform must be running in LPAR mode. R1--2. For all VIO options: The platform’s OF device tree must include, as a child of the root node, a node of type vdevice as the parent of a sub-tree representing the virtual IOAs assigned to the partition (see for details). R1--3. For all VIO options: The platform’s /vdevice node must contain properties as defined in . R1--4. For all VIO options: If the platform is going to limit the size of virtual I/O data copy operations (e.g., H_SEND_LOGICAL_LAN and H_COPY_RDMA), then the platform’s /vdevice node must contain the “ibm,max-virtual-dma-size” property, and the value of this property must be at least 128 KB. R1--5. For all VIO options: The interrupt server numbers for all interrupt source numbers, virtual and physical, must come from the same name space and are defined by the “ibm,interrupt-buid-size” property in the PowerPC External Interrupt Presentation Controller Node. R1--6. For all VIO options: The virtual interrupts for all children of the /vdevice node must, upon transfer of control to the booted partition program, be masked as would be the result of an ibm,int-off RTAS call specifying the virtual interrupt source number. R1--7. For all VIO options with the Reliable Command/Response option: The platform must specify the CRQ interrupt as the first interrupt in the “interrupts” property for a virtual IOA. R1--8. For the SMC options: The platform must specify the ASQ interrupt as the second interrupt in the “interrupts” property for a virtual IOA. R1--9. For all VIO options: The platform must implement the H_VIO_SIGNAL hcall() as defined in . R1--10. For all VIO options: The platform must assign an invariant virtual location code to each virtual IOA as described in . R1--11. (Requirement Number Reserved For Compatibility) R1--12. For all VIO options: The phys of each “ibm,my-dma-window” property triple (window pane) must have a value of zero and the LIOBN must be unique. Implementation Note: While the architectural definition of LIOBN would allow the definition of one logical I/O bus number (LIOBN) for all RTCE tables (IOBA ranges separating IOAs), such an implementation is not permitted for the VIO option, which requires a unique LIOBN (at least per partition preferably platform wide) for each virtual IOA window pane. Such designs allow the LIOBN handle to be used to validate access rights, and allows each subsequent I/O bus address range to start at zero, providing maximum accommodation for 32 bit OS’s. R1--13. For the VSCSI option: For the server partition, there must exist two triples (two window panes) in the “ibm,my-dma-window” property and the size field of the second triple (second window pane) of an “ibm,my-dma-window” property must be equal to the size field of the corresponding first triple (first window pane) of the associated partner partition’s “ibm,my-dma-window” property. R1--14. For the SMC option: There must exist three triples (three window panes) in the “ibm,my-dma-window” property of all partitions which contain an SMC virtual IOA, and the size field of the second triple (second window pane) of an “ibm,my-dma-window” property must be equal to the size field of the corresponding third triple (third window pane) of the associated partner partition’s “ibm,my-dma-window” property. R1--15. For all VIO options: RTCE tables for virtual IOAs, as pointed to by the partitions’ first window pane of the “ibm,my-dma-window” property, and the TCEs that they contain (as built by the TCE hcall()s) must be persistent across partner partition reboots and across partner partition deregister (free)/re-register operations, even when the partition which connects after one deregisters is a different partition, and must be available to have TCEs built in them by said partition, as long as that partition still owns the corresponding virtual IOA (an LRDR operation which removes the IOA will also remove the RTCE table). R1--16. For all VIO options: The connection between the second window pane of the “ibm,my-dma-window” property for a partition and its corresponding window pane in the partner partition (first window pane) must be broken by the platform when either partition deregisters its CRQ or when either partition terminates, and the platform must invalidate any redirected TCEs copied from the said second window pane (for information on invalidation of TCEs, see ). R1--17. For all VIO options: The following window panes of the “ibm,my-dma-window” property, when they exist, must support the following specified hcall()s, when they are implemented: For the first window pane: H_PUT_TCE, H_GET_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE For the second window pane: H_PUT_RTCE, H_REMOVE_RTCE, H_PUT_RTCE_INDIRECT R1--18. For all VIO options: The platform must not prohibit the server and partner partition, or client and partner partition, from being the same partition, unless the user interface used to setup the virtual IOAs specifically disallows such configurations. R1--19. For all VIO options: Any child node of the /vdevice node that is not defined by this architecture must contain the “used-by-rtas” property. Implementation Notes: Relative to Requirement , partner partitions being the same partition makes sense from a product development standpoint. The ibm,partner-control RTAS call does not make sense if the partner partitions are the same partition. R1--20. For all VIO options: The platform must implement the H_VIOCTL hcall() following the syntax of and semantics specified by .
Shared Logical Resources The sharing of resources, within the boundaries of a single coherence domain, owned by a partition owning a server virtual IOA by its clients (those owning the associated client virtual IOAs) is controlled by the hcall()s described in this section. The owning partition retains control of and access to the resources and can ask for their return or indeed force it. Refer to for a graphic representation of the state transitions involved in sharing logical resources.
Shared Logical Resource State Transitions
Owners of resources can grant, to one or more client partitions, access to any of its resources. A client partition being defined as a partition with which the resource owner is authorized to register a CRQ, as denoted via an OF device tree virtual IOA node. Granting access is accomplished by requesting that the hypervisor generate a specific cookie for that resource for a specific sharing partition. The cookie value thus generated is unique only within the context of the partition being granted the resource and is unusable for gaining access to the resource by any other partition. This unique cookie is then communicated via some inter partition communications channel, most likely the authorized Command Response Queue. The partner partition then accepts the logical resource (mapping it into the accepting partition’s logical address space). The owning partition may grant shared access of the same logical resource to several clients (by generating separate cookies for each client). During the time the resource is shared, both the owner and the sharer(s) have access to the logical resource, the software running in these partitions use private protocols to synchronize control access. Once the resource has been accepted into the client’s logical address space, the resource can be used by the client in any way it wishes, including granting it to one of its own clients. When the client no longer needs access to the shared logical resource, it destroys any virtual mappings it may have created for the logical resource and returns the logical resource thus unmapping it from its logical address space. The client program could, subsequently accept the logical resource again (given that the cookie is still valid). To complete the termination of sharing, the owner partition rescinds the cookie describing the shared resource. Normally a rescind operation succeeds only if the client has returned the resource, however, the owner can force the rescind in cases where it suspects that the client is incapable of gracefully returning the resource. In the case of a forced rescind, the hypervisor marks the client partition’s logical address map location corresponding to the shared logical resource such that any future hcall() that specifies the logical address fails with an H_RESCINDED return code. The hypervisor then ensures that the client partition’s translation tables contain no references to a physical address of the shared logical resource. Should the server partition fail, the hypervisor automatically notifies client partitions of the fact via the standard CRQ event message. In addition, the hypervisor recovers any outstanding shared logical resources prior to restarting the server partition. This recovery is proceeded by a minimum of two seconds of delay to allow the client partitions time to gracefully return the shared logical resources, then the hypervisor performs the equivalent of a forced rescind operation on all the server partition’s outstanding shared logical resources. This architecture does not specify a method of implementation, however, for the sake of clarifying the specified function, the following example implementation is given, refer to . Assume that the hypervisor maintains for each partition a logical to physical translation table (2) (used to verify the partition’s virtual to logical mapping requests). Each logical resource (4) mapped within the logical to real translation table has associated with it a logical resource control structure (3) (some of the contents of this control structure are defined in the following text). The original logical resource control structures (3) describe the standard logical resources allocated to the partition due to the partition’s definition, such as one per Logical Memory Blocks (LMB), etc. The platform firmware, when creating the OF device tree for a given partition knows the specific configuration of virtual IOAs with the associated quantity of the various types of logical resources types for each virtual IOA. From that knowledge, it understands the number and type of resources that must be shared between the server and client partitions and therefore the number of control structures that will be needed. When an owning partition grants access to a subset of one of its logical resources to another partition, the hypervisor chooses a logical resource control structure to describe this newly granted resource (6), (as stated above, the required number of these control structures were allocated when the client virtual IOA was defined) and attaches it to the grantee’s base partition control structure (5). This logical resource control structure is linked (9) to the base logical resource control structure (3) of the resource owner. Subsequently the grantee’s OS may accept the shared logical resource (4) mapping it (7) into the grantee’s partition logical to physical map table (8). This same set of operations may subsequently be performed for other partition(s) (10). The shared resource is always a subset (potentially the complete subset) of the original. Once a partition (10) has accepted a resource, it may subsequently grant a subset of that resource to yet another partition (14), here the hypervisor creates a logical resource control structure (13) links it (12) to the logical resource control structure (11) of the granting partition (10) that is in turn linked (9) to the owner’s logical resource control structure (3).  
Example Implementation of Control Structures for Shared Logical Resources
For the OS to return the logical resource represented by control structure (11), the grant represented by control structure (13) needs to be rescinded. This is normally accomplished only after the OS that is running partition (14) performs a return operation, either because it has finished using the logical resource, or in response to a request (delivered by inter partition communications channel) from the owner. The exceptions are in the case that either partition terminates (the return operation is performed by the hypervisor) and a non-responsive client (when the granter performs a forced rescind). A return operation is much like a logical resource dynamic reconfiguration isolate operation, the hypervisor removes the logical resource from the partition’s logical to physical map table, to prevent new virtual to physical mappings of the logical resource, then ensures that no virtual to physical mappings of the logical resource are outstanding (this can either be accomplished synchronously by checking map counts etc. or asynchronously prior to the completion of the rescind operation. R1--1. For the Shared Logical Resource option: The platform must implement the hcall-logical-resource function set following the syntax and semantics of the included hcall(s) as specified in: , , , and . R1--2. For the Shared Logical Resource option: In the event that the partition owning a granted shared logical resource fails, the platform must wait for a minimum of 2 seconds after notifying the client partitions before recovering the shared resources via an automatic H_RESCIND_LOGICAL (forced) operation.
H_GRANT_LOGICAL This hcall() creates a cookie that represents the specific instance of the shared object. That is, the specific subset of the original logical resource to be shared with the specific receiver partition. The owning partition makes this hcall() in preparation for the sharing of the logical resource subset with the receiver partition. The resulting cookie is only valid for the specified receiver partition. The caller needs to understand the bounds of the logical resource being granted, such as for example, the logical address range of a given LMB. The generated cookie does not span multiple elemental logical resources (that is resources represented by their own Dynamic Reconfiguration Connector). If the owner wishes to share a range of resources that does span multiple elemental logical resources, then the owner uses a series of H_GRANT_LOGICAL calls to generate a set of cookies, one for each subset of each elemental logical resource to be shared. The “logical” parameter identifies the starting “address” of the subset of the logical resource to be shared. The form of this “address” is resource dependent, and is given in . Syntax: Parameters:   Format of H_GRANT_LOGICAL parameters Flags subfunction code (bits 16-23) value: Access Restriction Bits 16-19 The defined bits in this field have independent meanings, and may appear in combination with all other bits unless specifically disallowed. (an x in the binary field indicates that the bit can take on a value of 0 or 1)   0b1xxx Read access inhibited (The grantee may not read from or grant access to read from this logical resource.)   0bx1xx Write access inhibited (The grantee may not write to or grant access to write to this logical resource.)   0bxx1x Re-Grant rights inhibited (the grantee may not grant access to this logical resource to a subsequent client.)   0bxxx1 Reserved calling software should set this bit to zero. Firmware returns H_Parameter if set.   Logical Resource Supported Combinations Bits 20-23 “address” description “length” description System Memory 0bxxx0 0x1 Logical Address (as would be used in H_ENTER) in logical-lo; logical-hi not used (should be = 0) Bytes in units of 4 K on 4 K boundaries (low order 12 bits = 0) MMIO Space 0bxxx0 0x2 Logical Address (as would be used in H_ENTER) in logical-lo; logical-hi not used (should be = 0) Bytes in units of 4 K on 4 K boundaries (low order 12 bits = 0) Interrupt Source 0b00x0 0x4 24 bit Interrupt # (as would be used in ibm,get-xive) in low order 3 bytes in logical-lo; logical-hi not used (should be = 0) value=1 (the logical resource is one indivisible unit) DMA Window Pane Note: The DMA window only refers to physical DMA windows not virtual DMA windows. Virtual DMA windows can be directly created with a client virtual IOA definition and need not be shared with those of the server. 0b00x0 0x5 32 bit LIOBN in logical-hi; with a 64 bit IOBA logical-lo Bytes of IOBA in units of 4 K on 4 K boundaries (low order 12 bits = 0) Interprocessor Interrupt Port 0b00x0 0x6 Processor Number. (As from the processor’s Unit ID) in logical-lo; logical-hi not used (should be = 0). value=1 (the logical resource is one indivisible unit)
unit-address: The unit address of the virtual IOA associated with the shared logical resource, and thus the partner partition that is to share the logical resource.
Semantics: Verify that the flags parameter specifies a supported logical resource type, else return H_Parameter. Verify that the logical address validly identifies a logical resource of the type specified by the flags parameter and owned/shared by the calling partition, else return H_Parameter; unless: The logical address’s page number represents a page that has been rescinded by the owner, then return H_RESCINDED. There exists a grant restriction on the logical resource, then return H_Permission. Verify that the length parameter is of valid form for the resource type specified by the flags parameter and that it represents a subset (up to the full size) of the logical resource specified by the logical address parameter, else return H_Parameter. Verify that the unit-address is for a virtual IOA owned by the calling partition, else return H_Parameter. If the partner partition’s client virtual IOA has sufficient resources, generate hypervisor structures to represent, for hypervisor management purposes, including any grant restrictions, the specified shared logical resource, else return H_NOMEM. Generate a cookie associated with the hypervisor structures created in the previous step that the partner partition associated with the unit-address can use to reference said structures via the H_ACCEPT_LOGICAL and H_RETURN_LOGICAL hcall()s and the calling partition can use to reference said structures via the H_RESCIND_LOGICAL hcall(). Place the cookie generated in the previous step in R4 and return H_Success.
H_RESCIND_LOGICAL This hcall() invalidates a logical sharing as created by H_GRANT_LOGICAL above. This operation may be subject to significant delays in certain circumstances. Callers may experience an extended series of H_PARTIAL returns prior to successful completion of this operation. If the sharer of the logical resource has not successfully completed the H_RETURN_LOGICAL operation on the shared logical resource represented by the specified cookie, the H_RESCIND_LOGICAL hcall() fails with the H_Resource return code unless the “force” flag is specified. The use of the “force” flag increases the likelihood that a series of H_PARTIAL returns will be experienced prior to successful completion. The “force” flag also causes the hcall() to recursively rescind any and all cookies that represent subsequent sharing of the logical resource. That is, if the original client subsequently granted access to any or all of the logical resource to a client, those cookies and any other subsequent grants are also rescinded. Syntax: Parameters: flags: The flags subfunction code field (bits 16-23) two values are defined 0x00 “normal”, and 0x01 “forced”. cookie: The handle returned by H_GRANT_LOGICAL representing the logical resource to be rescinded. Semantics: Verify that the cookie parameter references an outstanding instance of a shared logical resource owned/accepted by the calling partition, else return H_Parameter. Verify that the flags parameter is one of the supported values, else return H_Paramter. If the “force” flag is specified Implementations should provide mechanisms to ensure that reserved flag field bits are zero, to improve performance, implementations may chose to activate this checking only in “debug” mode. The mechanism for activating an implementation dependent debug mode is outside of the scope of this architecture. , then: perform the functions of H_RETURN_LOGICAL (cookie) as if called by the client partition. Note this involves forced rescinding any cookies generated by the client partition that refer to the logical resource referenced by the original cookie being rescinded. If the client partition has the resource referenced by cookie is in the available for mapping via its logical to physical mapping table (the resource was accepted and not returned), return H_Resource. Verify that resource reference by cookie is not mapped by the client partition, else return H_PARTIAL. Hypervisor reclaims control structures referenced by cookie and returns H_Success.
H_ACCEPT_LOGICAL The H_ACCEPT_LOGICAL hcall() maps the granted logical resource into the client partition’s logical address space. To provide the most compact client logical address space, the hypervisor maps the resource into the lowest applicable logical address for the referenced logical resource type, consistent with the resource’s size and the resource type’s constraints upon alignment etc. The chosen logical address for the starting point of the logical mapping is returned in register R4. Syntax: Parameters: cookie: The handle returned by H_GRANT_LOGICAL representing the logical resource to be accepted. Semantics: Verify that the cookie parameter is valid for the calling partition, else return H_Parameter. If the cookie represents a logical resources that has been rescinded by the owner, return H_RESCINDED. Map the resources represented by the cookie parameter, with any attendant access restrictions, into the lowest available logical address of the calling partition consistent with constraints of size and alignment and place the selected logical address into R4. Return H_Success.
H_RETURN_LOGICAL The H_RETURN_LOGICAL hcall() unmaps the logical resource from the client partition’s logical address space. Prior to calling H_RETURN_LOGICAL the client partition should have destroyed all virtual mappings to the section of the logical address space to which the logical resource is mapped. That is unmapping virtual addresses for MMIO and System Memory space, invalidating TCEs mapping a shared DMA window pane, disabling/masking shared interrupt sources and/or inter processor interrupts. Failing to do so, may result in parameter errors for other hcall()s and H_Resource from the H_RETURN_LOGICAL hcall(). Part of the semantic of this call is to determine that no such active mapping exists. Implementations may be able to determine this quickly if they for example maintain map counts for various logical resources, if an implementation searches a significant amount of platform tables, then the hcall() may return H_Busy and maintain internal state to continue the scan on subsequent calls using the same cookie parameter. The cookie parameter remains valid for the calling client partition until the server partition successfully executes the H_RESCIND_LOGICAL hcall(). Syntax: Parameters: cookie: The handle returned by H_GRANT_LOGICAL representing the logical resource to be returned. Semantics: Verify that the cookie parameter references an outstanding instance of a shared logical resource accepted by the calling partition, else return H_Parameter. Remove the referenced logical resource from the calling partition’s logical address map. Verify that no virtual to logical mappings exist for the referenced resource, else return H_Resource. This operation may require extensive processing -- in some cases the hcall may return H_Busy to allow for improved system responsiveness -- in these cases the state of the mapping scan is retained in the hypervisor’s state structures such that after some number of repeated calls the function is expected to finish. Return H_Success.
H_VIOCTL The H_VIOCTL hypervisor call allows the partition to manipulate or query certain virtual IOA behaviors.
Command Overview Syntax: Parameters: unit-address: As specified. subfunction: Specific subfunction to perform; see . parm-1: Specified in subfunction semantics. parm-2: Specified in subfunction semantics. parm-3: Specified in subfunction semantics. Semantics: Validate that the subfunction is implemented, else return H_Not_Found. Validate the unit-address, else return H_Parameter. Validate the subfunction is valid for the given virtual IOA, else return H_Parameter. Refer to to determine the semantics for the given subfunction.   Semantics for H_VIOCTL subfunction parameter values Subfunction Number Subfunction Name Required? Semantics Defined in 0x0 (Reserved) (Reserved) (Reserved) 0x1 GET_VIOA_DUMP_SIZE For all VIO options . 0x2 GET_VIOA_DUMP . 0x3 GET_ILLAN_NUMBER_VLAN_IDS For the ILLAN option . 0x4 GET_ILLAN_VLAN_ID_LIST . 0x5 GET_ILLAN_SWITCH_ID For the ILLAN option 0x6 DISABLE_MIGRATION For all vscsi-server and vfc-server . 0x7 ENABLE_MIGRATION . 0x8 GET_PARTNER_INFO . 0x9 GET_PARTNER_WWPN_LIST For all vfc-server . 0xA DISABLE_ALL_VIO_INTERRUPTS For the Subordinate CRQ Transport option 0xB DISABLE_VIO_INTERRUPT 0xC ENABLE_VIO_INTERRUPT 0xD GET_ILLAN_MAX_VLAN_PRIORITY No 0xE GET_ILLAN_NUMBER_MAC_ACLS No 0xF GET_MAC_ACLS No 0x10 GET_PARTNER_UUID For UUID Option 0x11 FW_RESET For the VNIC option. 0x12 Get_ILLAN_SWITCHING_MODE For any ILLAN adapter with the “ibm,trunk-adapter” property 0x13 DISABLE_INACTIVE_TRUNK_RECEPTION For any ILLAN adapter with the "ibm,trunk-adapter" property 0x14 GET_MAX_REDIRECTED_MAPPINGS For platforms that support more than a single Redirected RDMA mapping per virtual TCE. 0x18 VNIC_SERVER_STATUS For VNIC servers 0x19 GET_SESSION_TOKEN For VNIC clients 0x1A SESSION_ERROR_DETECTED For VNIC clients 0x1B GET_VNIC_SERVER_INFO For VNIC servers 0x1C ILLAN_MAC_SCAN For any ILLAN adapter with the “ibm,trunk-adapter” property 0x1D ENABLE_PREPARE_FOR_SUSPEND For all vscsi-server and vfc-server 0x1E READY_FOR_SUSPEND For all vscsi-server and vfc-server
GET_VIOA_DUMP_SIZE Subfunction Semantics Validate parm-1, parm-2, and parm-3 are set to zero, else return H_Parameter. The hypervisor calculates the size necessary for passing opaque firmware data describing current virtual IOA state to the partition for purposes of error logging and RAS, and returns H_Success, with the required size in R4.
GET_VIOA_DUMP Subfunction Semantics If the given virtual IOA has an “ibm,my-dma-window” property in its device tree, then parm-1 is an eight byte output descriptor. The high order byte of an output descriptor is control, the next three bytes are a length field of the buffer in bytes, and the low order four bytes are a TCE mapped I/O address of the start of a buffer in I/O address space. The high order control byte must be set to zero. The TCE mapped I/O address is mapped via the first window pane of the “ibm,my-dma-window” property. If the given virtual IOA has no “ibm,my-dma-window” property in its device tree, then parm-1 shall be a logical real, page-aligned address of a 4 K page used to return the virtual IOA dump. Validate parm-2 and parm-3 are set to zero, else return H_Parameter. If parm-1 is an output descriptor, then Validate the I/O address range is in the required DMA window and is mapped by valid TCEs, else return H_Parameter. Transfer as much opaque hypervisor data as fits into the output buffer as specified by the output descriptor. If all opaque data will not fit due to size, return H_Constrained, else return H_Success. If parm-1 is a logical real address, then Validate the logical real address is valid for the partition, else return H_Parameter. Transfer as much opaque hypervisor data as will fit into the passed logical real page, with a maximum of 4 K. If all opaque data will not fit in the page due to size, return H_Constrained, else return H_Success.
GET_ILLAN_NUMBER_VLAN_IDS Subfunction Semantics Validate parm-1, parm-2, and parm-3 are set to zero, else return H_Parameter. The hypervisor returns H_Success, with the number of VLAN IDs (PVID + VIDs) in R4. This subfunction allows the partition to allocate the correct amount of space for the call: H_VIOCTL(GET_VLAN_ID_LIST).
GET_ILLAN_VLAN_ID_LIST Subfunction Semantics parm-1 is an eight byte output descriptor. The high order byte of an output descriptor is control, the next three bytes are a length field of the buffer in bytes, and the low order four bytes are a TCE mapped I/O address of the start of a buffer in I/O address space. The high order control byte must be set to zero. The TCE mapped I/O address is mapped via the first window pane of the “ibm,my-dma-window” property. Validate parm-2 and parm-3 are set to zero, else return H_Parameter. Validate the I/O address range is in the required DMA window and is mapped by valid TCEs, else return H_Parameter. Transfer the VLAN_ID_LIST into the output buffer as specified by the output descriptor. The data will be an array of two byte values, where the first element of the array is the PVID followed by all the VIDs. The format of the elements of the array is specified by IEEE VLAN documentation. Any unused space in the output buffer will be zeroed. If all VLAN IDs do not fit due to size, return H_Constrained. Return H_Success
GET_ILLAN_SWITCH_ID Subfunction Semantics parm-1 is an eight byte output descriptor. The high order byte of an output descriptor is control, the next three bytes are a length field of the buffer in bytes, and the low order four bytes are a TCE mapped I/O address of the start of a buffer in I/O address space. The high order control byte must be set to zero. The TCE mapped I/O address is mapped via the first window pane of the “ibm,my-dma-window” property. Validate parm-2 and parm-3 are set to zero, else return H_Parameter. Validate the I/O address range is in the required DMA window and is mapped by valid TCEs, else return H_Parameter. Transfer the GET_ILLAN_SWITCH_ID into the output buffer as specified by the output descriptor. The data will be a string of ASCII characters uniquely identifying the virtual switch to which the ILLAN adapter is connected. Any unused space in the output buffer will be zeroed. If the switch identifier does not fit due to size, return H_Constrained. Return H_Success
DISABLE_MIGRATION Subfunction Semantics When this subfunction is implemented, the “ibm,migration-control” property exists in the /vdevice OF device tree node. Validate that parm-1, parm-2, and parm-3 are all set to zero, else return H_Parameter. If no partner is connected, then return H_Closed. Prevent the migration of the partner partition to the destination server until either the ENABLE_MIGRATION subfunction is called or H_FREE_CRQ is called. Return H_Success.
ENABLE_MIGRATION Subfunction Semantics When this subfunction is implemented, the “ibm,migration-control” property exists in the /vdevice OF device tree node. Validate that parm-1, parm-2, and parm-3 are all set to zero, else return H_Parameter. Validate that the migration of the partner partition to the destination server was previously prevented with DISABLE_MIGRATION subfunction, else return H_Parameter. Enable the migration of the partner partition. Return H_Success.
GET_PARTNER_INFO Subfunction Semantics Parm-1 is an eight byte output descriptor. The high order byte of an output descriptor is control, the next three bytes are a length field of the buffer in bytes, and the low order four bytes are a TCE mapped I/O address of the start of a buffer in I/O address space. The high order control byte must be set to zero. The TCE mapped I/O address is mapped via the first window pane of the “ibm,my-dma-window” property. Validate parm-2 and parm-3 are set to zero, else return H_Parameter. Validate the I/O address range is in the required DMA window and is mapped by valid TCEs, else return H_Parameter. If the output buffer is not large enough to fit all the data, then return H_Constrained. If no partner is connected and more than one possible partner exists, then return H_Closed. Transfer the eight byte partner partition ID into the first eight bytes of the output buffer. Transfer the eight byte unit address into the second eight bytes of the output buffer. Transfer the NULL-terminated Converged Location Code associated with the partner unit address and partner partition ID immediately following the unit address. Zero any remaining output buffer. Return H_Success.
GET_PARTNER_WWPN_LIST Subfunction Semantics This subfunction is used to get the WWPNs for the partner from the hypervisor. In this way, there is assurance that the WWPNs are accurate. Parm-1 is an eight byte output descriptor. The high order byte of an output descriptor is control, the next three bytes are a length field of the buffer in bytes, and the low order four bytes are a TCE mapped I/O address of the start of a buffer in I/O address space. The high order control byte must be set to zero. The TCE mapped I/O address is mapped via the first window pane of the “ibm,my-dma-window” property. Validate parm-2 and parm-3 are set to zero, else return H_Parameter. Validate the I/O address range is in the required DMA window and is mapped by valid TCEs, else return H_Parameter. If the output buffer is not large enough to fit all the data, return H_Constrained. If no partner is connected, return H_Closed. Transfer the first eight byte WWPN, which is represented in the vfc-client node of the partner partition in the “ibm,port-wwn-1” parameter, into the first eight bytes of the output buffer. Transfer the second eight byte WWPN, which is represented in the vfc-client node of the partner partition in the “ibm,port-wwn-2” parameter, into the second eight bytes of the output buffer. Zero any remaining output buffer. Return H_Success.
DISABLE_ALL_VIO_INTERRUPTS Subfunction Semantics This subfunction is used to disable any and all the CRQ and Sub-CRQ interrupts associated with the virtual IOA designated by the unit-address, for VIOs that define the use of Sub-CRQs. Software that controls a virtual IOA that does not define the use of Sub-CRQ facilities should use the H_VIO_SIGNAL hcall() to disable CRQ interrupts. Programming Note: On platforms that implement the partition migration option, after partition migration the support for this subfunction might change, and the caller should be prepared to receive an H_Not_Found return code indicating the platform does not implement this subfunction. Validate parm-1, parm-2, and parm-3 are set to zero, else return H_Parameter. Disable all CRQ and any Sub-CRQ interrupts associated with unit-address. Return H_Success.
DISABLE_VIO_INTERRUPT Subfunction Semantics This subfunction is used to disable a CRQ or Sub-CRQ interrupt, for VIOs that define the use of Sub-CRQs. The CRQ or Sub-CRQ is defined by the unit-address and parm-1. Software that controls a virtual IOA that does not define the use of Sub-CRQ facilities should use the H_VIO_SIGNAL hcall() to disable CRQ interrupts. Programming Note: On platforms that implement the partition migration option, after partition migration the support for this subfunction might change, and the caller should be prepared to receive an H_Not_Found return code indicating the platform does not implement this subfunction. Parm-1 is the interrupt number of the interrupt to be disabled. For an interrupt associated with a CRQ this number is obtained from the “interrupts” property in the device tree For an interrupt associated with a Sub-CRQ this number is obtained during the registration of the Sub-CRQ (H_REG_SUB_CRQ). Validate parm-1 is a valid interrupt number for a CRQ or Sub-CRQ for the virtual IOA defined by parm-1 and that parm-2 and parm-3 are set to zero, else return H_Parameter. Disable interrupt specified by parm-1. Return H_Success.
ENABLE_VIO_INTERRUPT Subfunction Semantics This subfunction is used to enable a CRQ or Sub-CRQ interrupt, for VIOs that define the use of Sub-CRQs. The CRQ or Sub-CRQ is defined by the unit-address and parm-1. Software that controls a virtual IOA that does not define the use of Sub-CRQ facilities should use the H_VIO_SIGNAL hcall() to disable CRQ interrupts. Programming Note: On platforms that implement the partition migration option, after partition migration the support for this subfunction might change, and the caller should be prepared to receive an H_Not_Found return code indicating the platform does not implement this subfunction. Parm-1 is the interrupt number of the interrupt to be enabled. For an interrupt associated with a CRQ this number is obtained from the “interrupts” property in the device tree For an interrupt associated with a Sub-CRQ this number is obtained during the registration of the Sub-CRQ (H_REG_SUB_CRQ). Validate parm-1 is a valid interrupt number for a CRQ or Sub-CRQ for the virtual IOA defined by unit-address and that parm-2 and parm-3 are set to zero, else return H_Parameter. Enable interrupt specified by parm-1. Return H_Success.
GET_ILLAN_MAX_VLAN_PRIORITY Subfunction Semantics Validate parm-1, parm-2, and parm-3 are set to zero, else return H_Parameter. The hypervisor returns H_Success, with the maximum IEEE 802.1Q VLAN priority returned in R4. If no priority limits are in place, the maximum VLAN priority is returned in R4.
GET_ILLAN_NUMBER_MAC_ACLS Subfunction Semantics This subfunction allows the partition to allocate the correct amount of space for the GET_MAC_ACLS Subfunction call. Validate parm-1, parm-2, and parm-3 are set to zero, else return H_Parameter. The hypervisor returns H_Success, with the number of allowed MAC addresses returned in R4. If no MAC access control limits are in place, 0 is returned in R4.
GET_MAC_ACLS Subfunction Semantics parm-1 is an eight byte output descriptor. The high order byte of an output descriptor is control, the next three bytes are a length field of the buffer in bytes, and the low order four bytes are a TCE mapped I/O address of the start of a buffer in I/O address space. The high order control byte must be set to zero. The TCE mapped I/O address is mapped via the first window pane of the “ibm,my-dma-window” property. Validate parm-2 and parm-3 are set to zero, else return H_Parameter. Validate the I/O address range is in the required DMA window and is mapped by valid TCEs, else return H_Parameter. Transfer the allowed MAC addresses into the output buffer as specified by the output descriptor. The data will be an array of 8 byte values containing the allowed MAC address, with the low order 6 bytes containing the 6 byte MAC address. Any unused space in the output buffer are zeroed. If all allowed MAC addresses do not fit due to size, return H_Constrained. Return H_Success
GET_PARTNER_UUID Subfunction Semantics Validate parm-1, parm-2 and parm-3 are set to zero, else return H_Parameter. If no partner is connected and more than one possible partner exists, then return H_Closed. Transfer into registers R4 (High order 8 bytes) and R5 (low order 8 bytes) of the UUID of the client partition that owns the virtual device ( for the format of the UUID string. Return H_Success
FW_Reset Subfunction Semantics This H_VIOCTL subfunction will reset the VNIC firmware associated with a VNIC client adapter, if currently active. This subfunction is useful when the associated firmware becomes unresponsive to other CRQ-based commands. For the case of vTPMs the firmware will be left inoperable until the client partition next boots up. Semantics: Validate that parm-1, parm-2, and parm-3 are all set to zero, else return H_Parameter. If the firmware associated with the virtual adapter can not be reset, return H_Constrained. Reset the firmware associated with the virtual adapter. Return H_Success.
GET_ILLAN_SWITCHING_MODE Subfunction Semantics Validate parm-1, parm-2, and parm-3 are set to zero, else return H_Parameter. Validate that the given virtual IOA is a ILLAN adapter with the “ibm,trunk-adapter”, else return H_Parameter. The hypervisor returns H_Success, with the current switching mode in R4. If the switching mode is VEB mode, R4 will have a 0. If the switching mode is VEPA mode, R4 will have a 1.
DISABLE_INACTIVE_TRUNK_RECEPTION Subfunction Semantics This subfunction is used to disable the reception of all packets for a ILLAN trunk adapter that is not the Active Trunk Adapter as set by the H_ILLAN_ATTRIBUTES hcall. Note: The default value for this attribute is ENABLED. The value is reset on a successful H_FREE_LOGICAL_LAN hcall or reboot/power change of the partition owning the ILLAN adapter. Validate parm-1, parm-2, and parm-3 are set to zero, else return H_Parameter. Validate that the given virtual IOA is a ILLAN adapter with the “ibm,trunk-adapter”, else return H_Parameter. The hypervisor disables reception of packets for this adapter when it is not the Active Trunk Adapter. Return H_Success.
GET_MAX_REDIRECTED_MAPPINGS Subfunction Semantics This sub-function retrieves the maximum number of additional redirected mappings for the specified adapter. Validate parm-1, parm-2, and parm-3 are set to zero, else return H_Parameter Validate that the given virtual IOA is an RDMA-capable server adapter, else return H_Parameter. Store the maximum number of additional redirected mappings for the LIOBN in R4. Store the maximum number of redirections per client IOBA in R5. Return H_Success
VNIC_SERVER_STATUS Subfunction Semantics This subfunction is used to report the status of the physical backing device corresponding to a specific VNIC server adapter. Additionally, this subfunction is used as a heartbeat mechanism that the hypervisor utilizes to ensure the backing device associated with the virtual adapter is responsive. parm-1 is an enumerated value reflecting the physical backing evice status. Validate that parm-1 is one of the following values: 0x1 (Operational), 0x2 (LinkDown), or 0x3 (AdapterError). Otherwise, return H_Parameter. parm-2 is a value, in milliseconds, that the caller utilizes to specify how long the hypervisor should wait for the next server status vioctl call. Validate that parm-3 is zero, else return H_Parameter. If the CRQ for the server adapter has not yet been registered, return H_State. Return H_Success.
GET_SESSION_TOKEN Subfunction Semantics This subfunction is used to obtain a session token from a VNIC client adapter. This token is opaque to the caller and is intended to be used in tandem with the SESSION_ERROR_DETECTED vioctl subfunction. On platforms that implement the partition migration option, after partition migration the support for this subfunction might change, and the caller should be prepared to receive an H_Not_Found return code indicating the platform does not implement this subfunction. Validate that parm-1, parm-2, and parm-3 are 0, else return H_Parameter. Return H_Success, with the session token in R4.
SESSION_ERROR_DETECTED Subfunction Semantics This subfunction is used to report that the currently active backing device for a VNIC client adapter is behaving poorly, and that the hypervisor should attempt to fail over to a different backing device, if one is available. On platforms that implement the partition migration option, after partition migration the support for this subfunction might change, nd the caller should be prepared to receive an H_Not_Found return code indicating the platform does not implement this subfunction. parm-1 is a VNIC session token. This token should be obtained from the GET_SESSION_TOKEN vioctl subfunction. Validate that parm-2 and parm-3 are 0, else return H_Parameter. Validate that the session token parameter corresponds to the current VNIC session, else return H_State. Validate that the active server status is Operational, else return H_Constrained. If the server status is Operational, change the server status to NetworkError and attempt to fail over to a different backing device. If there are no suitable servers to fail over to, return H_Constrained. If the client successfully failed over to another backing device as a result of this subfunction call, return H_Success.
GET_VNIC_SERVER_INFO Subfunction Semantics This subfunction is used to fetch information about a VNIC server adapter. parm-1 is an eight byte output descriptor. The high order byte of an output descriptor is control, the next three bytes are a length field of the buffer in bytes, and the low order four bytes are a TCE mapped I/O address of the start of a buffer in I/O address space. The high order control byte must be set to zero. The TCE mapped I/O address is mapped via the first window pane of the “ibm,my-dma-window” property. Validate that parm-2 and parm-3 are 0, else return H_Parameter. Populate the TCE mapped buffer with the following information. Note that if the buffer descriptor (parm-1) describes an output buffer that is not large enough to hold the following information, the server information will be truncated to the size of the output buffer and the buffer will be populated with the truncated information. Field Byte Offset Size Description Version 0 8 The format version of the provided information. The first supported version is 1. Active 8 8 Boolean value describing whether or not this server adapter is currently the active server for the client adapter it is mapped to. 0x0 - The server is not currently active. 0x1 - The server is currently the active backing device for the client. Status 16 8 Enumeration value corresponding to the current virtual adapter status. 0x1 - Operational - The server adapter is working as expected. 0x2 - LinkDown - The SR-IOV backing device's physical link is down. 0x3 - AdapterError - SR-IOV adapter is undergoing EEH. 0x4 - PoweredOff - The virtual server adapter or its hosting partition is powered off. 0x5 - NetworkError - VNIC client detected a network issue with this adapter. 0x6 - Unresponsive - The hypervisor is not reliably receiving VNIC_SERVER_STATUS vioctl calls from the VNIC server. Priority 24 1 The current priority of this server adapter. Lower values take precedence over larger values. Reserved 25 7 This field is reserved and must be zero. Return H_Success.
ILLAN_MAC_SCAN Subfunction Semantics parm-1 is an eight byte output descriptor. The high order byte of an output descriptor is control, the next three bytes are a length field of the buffer in bytes, and the low order four bytes are a TCE mapped I/O address of the start of a buffer in I/O address space. The high order control byte must be set to zero. The TCE mapped I/O address is mapped via the first window pane of the “ibm,my-dma-window” property. Parm-2 and parm-3 should be set to the opaque continuation tokens CT1 and CT2, respectively. These values are returned by the hypervisor through the ILLAN_MAC_SCAN Buffer header when a scan cannot be completed within a single vioctl call. See for more information about the values CT1 and CT2. Parm-2 and parm-3 should be set to zero when starting a new ILLAN_MAC_SCAN call sequence. Validate that the unit-address corresponds to an active ILLAN trunk adapter, else return H_Parameter. Validate parm-2 and parm-3 are both set to zero, or contain valid continuation tokens, else return H_Parameter. Validate that the I/O address range supplied by parm-1 is large enough to hold the header information for the ILLAN_MAC_SCAN Buffer detailed in , else return H_Parameter. Validate that the I/O address supplied by parm-1 is 8-byte aligned, else return H_Parameter. If any data transfers to the I/O address range supplied by parm-1 fail, return H_Permission. Iterate over all VLAN ids associated with the specified trunk adapter. For each associated VLAN id: Iterate over all ILLAN adapters, barring any adapters with the “ibm,trunk-adapter” property, belonging to the current VLAN id. For each non-trunk ILLAN adapter belonging to the current VLAN id, add a 64-bit value containing the current 12-bit VLAN id and the 48-bit MAC address of the ILLAN adapter to the next vacant entry in the MAC/VID Buffer. Each MAC/VID pair in the buffer is formatted as shown in Note that when handling H_IN_PROGRESS return codes, the caller should either copy information from the buffer, immediately process the information in the buffer, or modify the output descriptor to utilize a new, non-overlapping buffer I/O address range after each call. Otherwise, the buffer data will be overwritten on consecutive calls. MAC/VID Pair Entry Format Field Bit Offset Bit Length RESERVED 0 4 802.1qVLAND ID 4 12 Adapter MAC Address 16 48
If at any point during iteration the vioctl call exceeds the maximum allotted time interval, or if the MAC/VID buffer is filled to capacity, do the following: Store 'CT1' and 'CT2' in the buffer header so the operation can be continued on the next call Set 'Num Entries' in the buffer header to the number of valid MAC/VID pairs in the output buffer Set 'Reconfiguration Occurred' based on the rules described in the Dynamic Reconfiguration description item below Return H_IN_PROGRESS. ILLAN_MAC_SCAN Buffer Format Field Byte Offset Byte Length Description Header CT1 0 8 Continuation token 1 – this value should be used as parm-2 for sequential calls to ILLAN_MAC_SCAN when handling H_IN_PROGRESS return codes CT2 8 8 Continuation token 2 – this value should be used as parm-3 for sequential calls to ILLAN_MAC_SCAN when handling H_IN_PROGRESS return codes. Reserved 16 15 This field is reserved and should be set to zero. Reconfiguration Occurred 31 1 0: The data in this buffer is guaranteed to be consistent with the virtual adapter configuration at the point of return 1: The data in this buffer may contain data inconsistencies due to reconfiguration of ILLAN adapters between consecutive calls to ILLAN_MAC_SCAN. See Dynamic Reconfigurations description below. Num Entries 32 8 The number of valid 64-bit MAC/VID pairs in this buffer Data MAC/VID Buffer Start 40 Variable A variably-sized, contiguous array of MAC/VID pairs formatted according to . The number of valid entries in this array is specified by the “Num Entries” field.
If all MAC addresses were successfully scanned for all VLAN ids on the trunk adapter, do the following: Set 'CT1' and 'CT2' to zero in the buffer header Set 'Num Entries' in the buffer header to the number of valid MAC/VID pairs in the output buffer Set 'Reconfiguration Occurred' based on the rules described in the Dynamic Reconfiguration section below Return H_SUCCESS. Note that the buffer headers and data are only valid if this vioctl returns H_IN_PROGRESS or H_SUCCESS. Note that any unused buffer space outside of the range determined by the 'Num Entries' field in the ILLAN_MAC_SCAN buffer header is undefined by this architecture. Dynamic Reconfigurations: If the 'Reconfiguration Occurred' field in the ILLAN_MAC_SCAN buffer header is TRUE (1), the data in all MAC/VID buffers in a call sequence may contain inconsistencies due to dynamic reconfiguration events for the trunk adapter itself or any ILLAN adapters associated with the trunk adapter. In this case, all data collected from the call sequence should be utilized with caution, or re-queried. Possible inconsistencies arising from dynamic reconfiguration include the following: MAC addresses in the buffer may correspond to ILLAN adapters that have been removed from the switch, due to partition suspension, hibernation, adapter disablement, and DLPAR operations MAC addresses corresponding to ILLAN adapters that were added to the virtual switch due to partition resumption, adapter enablement, and DLPAR operations may not be included in the buffer ILLAN adapters may have their VLAN memberships reconfigured, in which case certain VID/MAC pairs in the buffer may no longer be valid, and some valid VID/MAC pairs for ILLAN adapters may not be included in the buffer at all. ILLAN adapters may have their MAC address reconfigured. Both the old and new MAC addresses for the adapter may be included in the buffer, or the old and new MAC addresses for the adapter may not be included in the buffer at all. Note that even if the value of the 'Reconfiguration Occurred' field is FALSE (0), ILLAN adapter reconfigurations may have occurred immediately after the vioctl completed, and the information in the buffer could be outdated.
ENABLE_PREPARE_FOR_SUSPEND Subfunction Semantics This subfunction is used to enable the “Prepare For Suspend” transport event on a VSCSI or VFC server adapter for which this function is called. If enabled and when a client partition is about to be migrated, the “Prepare For Suspend” transport event will be enqueued on the server's command queue, if active. The server should then quiesce all I/O to the backing store and respond when ready by calling the H_VIOCTL READY_FOR_SUSPEND subfunction. This subfunction should be called after each H_REG_CRQ as H_REG_CRQ will disable the support. When enabled, a “Resume” transport event will be enqueued on the server's command queue, if active, when the client partition resumes regardless of whether it successfully suspended. Parm-1 is an eight byte timeout value in milliseconds. The timeout value specifies the maximum amount of time in milliseconds the Hypervisor should wait after enqueueing the “Prepare for Suspend” transport event on the server's command queue until receiving the H_VIOCTL READY_FOR_SUSPEND subfunction from the server. The time- out value should take into account the maximum amount of time to queisce I/O operations prior to migration of the client partition. Validate that the unit-address corresponds to a VSCSI or VFC server adapter, else return H_Parameter. Validate parm-1 is less than or equal to 30,000 milliseconds, else return H_Parameter. Validate parm-2, and parm-3 are set to zero, else return H_Parameter. Verify the server adapter is not configured for “Any client can connect”, else return H_Constrained If this Subfunction parameter value not supported by Hypervisor, return H_Not_Found. If “Prepare For Suspend” is successfully enabled, H_Success will be returned and an opaque Hypervisor version placed in R4
READY_FOR_SUSPEND Subfunction Semantics This subfunction is used to respond to the “Prepare For Suspend” transport event on a VSCSI or VFC server adapter for which this function is called. If enabled via the H_VIOCTL ENABLE_PREPARE_FOR_SUSPEND subfunction, the server should call this H_VIOCTL READY_FOR_SUSPEND subfunction after receiving the “Prepare For Sus- pend” transport event and quiescing I/O operations to the backing store. If the server is unable to call this subfunction within the timeout specified in the H_VIOCTL ENABLE_PREPARE_FOR_SUSPEND subfunction, the migration op- eration on the client partition will be aborted. Validate that the unit-address corresponds to a VSCSI or VFC server adapter, else return H_Parameter. Validate parm-1, parm-2, and parm-3 are set to zero, else return H_Parameter. Validate that the server has previously called the H_VIOCTL ENABLE_PREPARE_FOR_SUSPEND subfunction after the most recent H_REG_CRQ, else return H_Constrained. If this Subfunction parameter value not supported by Hypervisor, return H_Not_Found.
Partition Managed Class Infrastructure - General In addition to the general requirements for all VIO described in , the architecture for the partition managed class of VIO defines several other infrastructures: A Command/Response Queue (CRQ) that allows communications back and forth between the server partition and its partner partition (see ). A Subordinate CRQ (Sub-CRQ) facility that may be used in conjunction with the CRQ facility, when the CRQ facility by itself is not sufficient. That is, when more than one queue with more than one interrupt is required by the virtual IOA. See . A mechanism for doing RDMA, which includes: A mechanism called Copy RDMA that can be used by the device driver to move blocks of data between memory of the server and partner partitions A mechanism for Redirected RDMA that allows the device driver to direct DMA of data from the server partition’s physical IOA to or from the partner partition’s memory (see ). The mechanisms for the synchronous type VIO are described as follows:  
Command/Response Queue (CRQ) The CRQ facility provides ordered delivery of messages between authorized partitions. The facility is reliable in the sense that the messages are delivered in sequence, that the sender of a message is notified if the transport facility is unable to deliver the message to the receiver’s queue, and that a notification message is delivered (providing that there is free space on the receive queue), or if the partner partition either fails or deregisters its half of the transport connection. The CRQ facility does not police the contents of the payload portions (after the 1 byte header) of messages that are exchanged between the communicating pairs, however, this architecture does provide means (via the Format Byte) for self describing messages such that the definitions of the content and protocol between using pairs may evolve over time without change to the CRQ architecture, or its implementation. The CRQ is used to hold received messages from the partner partition. The CRQ owner may optionally choose to be notified via an interrupt when a message is added to their queue.
CRQ Format and Registration The CRQ is built of one or more 4 KB pages aligned on a 4 KB boundary within partition memory. The queue is organized as a circular buffer of 16 byte long elements. The queue is mapped into contiguous I/O addresses via the TCE mechanism and RTCE table (first window pane). The I/O address and length of the queue are registered by . This registration process tells the hypervisor where to find the virtual IOA’s CRQ.
CRQ Entry Format Each CRQ entry consists of a 16 byte element. The first byte of a CRQ entry is the Header byte and is defined in .   CRQ Entry Header Byte Values Header Value Description 0 Element is unused -- all other bytes in the element are undefined 0x01 - 0x7F Reserved 0x80 Valid Command/Response entry -- the second byte defines the entry format (for example, see ). 0x81 - 0xBF Reserved 0xC0 Valid Initialization Command/Response entry -- the second byte defines the entry format. See . 0xC1 - 0xFE Reserved 0xFF Valid Transport Event -- the second byte defines the specific transport event. See .
The platform (transport mechanism) ignores the contents of all non-header bytes in all CRQ entries. Valid Command/Response entries (Header byte 0x80) are used to carry data between communicating partners, transparently to the platform. The second byte of the entry is reserved for a Format byte to enable the definitions of the content and protocol between using pairs to evolve over time. The definition of the second byte of the Valid Command/Response entry is beyond the scope of this architecture. presents example VSCSI format byte values. The Valid Initialization Command/Response entry (Header byte 0xC0) is used during virtual IOA initialization sequences. The second byte of this type entry is architected and is as defined in . This format is used for initialization operations between communicating partitions. The remaining bytes (byte three and beyond) of the Valid Initialization Command/Response entry are available for definition by the communicating entities.   Initialization Command/Response Entry Format Byte Definitions Format Byte Value Definition 0x0 Unused 0x1 Initialize 0x2 Initialization Complete 0x03 - 0xFE Reserved 0xFF Reserved for Expansion
Valid Transport Events (Header byte 0xFF) are used by the platform to notify communicating partners of conditions associated with the transport channel, such as the failure of the partner’s partition or the deregistration of the partner’s queue. The partner’s queue may be deregistered as a means of resetting the transport channel or simply to terminate the connection. When the Header byte of the queue entry specifies a Valid Transport Event, then the second byte of the CRQ entry defines the type of transport event. The Format byte (second byte) of a Valid Transport Event queue entry is architected and is as defined in ).   Transport Event Codes Code Value Explanation 0 Unused 1 Partner partition failed 2 Partner partition deregistered CRQ 3   4   5   6 Partner partition suspended (for the Partition Suspension option) 0x07 - 0x08 Reserved 0x09 Prepare for Client Adapter Suspend See 0x0A Client Adapter Resume See 0x0B - 0xFF Reserved
The “partner partition suspended” transport event disables the associated CRQ such that any H_SEND_CRQ hcall() ( ) to the associated CRQ returns H_Closed until the CRQ has been explicitly enabled using the H_ENABLE_CRQ hcall (See ).
CRQ Entry Processing Prior to the partition software registering the CRQ, the partition software sets all the header bytes to zero (entry invalid). After registration, the first valid entry is placed in the first element and the process proceeds to the end of the queue and then wraps around to the first entry again (given that the entry has been subsequently marked as invalid). This allows both the partition software and transport firmware to maintain independent pointers to the next element they will be respectively using. A sender uses an infrastructure dependent method to enter a 16 byte message on its partner’s queue (see ). Prior to enqueueing an entry on the CRQ, the platform first checks if the session to the partner’s queue is open, and there is a free entry, if not, it returns an error. If the checks succeed, the contents of the message is copied into the next free queue element, potentially notifying the receiver, and returns a successful status to the caller. At the receiver’s option, it may be notified via an interrupt when an element is enqueued to its CRQ. See . When the receiver has finished processing a queue entry, it writes the header to the value 0x00 to invalidate the entry and free it for future entries. Should the receiver wish to terminate or reset the communication channel it deregisters the queue, and if it needs to re-establish communications, proceeds to register either the same or different section of memory as the new queue, with the queue pointers reset to the first entry.
CRQ Facility Interrupt Notification The receiver can set the virtual interrupt associated with its CRQ to one of two modes. These are: Disabled (An enqueue interrupt is not signaled.) Enabled (An enqueue interrupt is signaled on every enqueue) Note: An enqueue is considered a pulse not a level. The pulse then sets the memory element within the emulated interrupt source controller. This allows the resetting of the interrupt condition by simply issuing the H_EOI hcall() as is done with the PCI MSI architecture rather than having to do an explicit interrupt reset as in the case with PCI Level Sensitive Interrupt (LSI) architecture. The interrupt mechanism is capable of presenting only one interrupt signal at a time from any given interrupt source. Therefore, no additional interrupts from a given source are ever signaled until the previous interrupt has been processed through to the issuance of an H_EOI hcall(). Specifically, even if the interrupt mode is enabled, the effect is to interrupt on an empty to non-empty transition of the queue. However, as with any asynchronous posting operation race conditions are to be expected. That is, an enqueue can happen in a window around the H_EOI hcall(). Therefore, the receiver should poll the CRQ after an H_EOI to prevent losing initiative. See ) for information about interrupt control.
Extensions to Other hcall()s for CRQ
H_MIGRATE_DMA Since the CRQ is RTCE table mapped, the H_MIGRATE_DMA hcall() may be requested to move a page that is part of the CRQ. The OS owner of the queue is responsible for preventing its processors from modifying the page during the migrate operation (as is standard practice with this hcall()), however, the H_MIGRATE_DMA hcall() serializes with the CRQ hcall()s to direct new elements to the migrated target page.
H_XIRR, H_EOI The CRQ facility utilizes a virtual interrupt source number to notify the queue owner of new element enqueues. The standard H_XIRR and H_EOI hcall()s are extended to support this virtual interrupt mechanism, emulating the standard PowerPC Interrupt hardware with respect to the virtual interrupt source number.
CRQ Facility Requirements R1--1. For the CRQ facility: The platform must implement the CRQ as specified in . R1--2. For the CRQ facility: The platform must reject CRQ definitions that are not 4 KB aligned. R1--3. For the CRQ facility: The platform must reject CRQ definitions that are not a multiple of 4 KB long. R1--4. For the CRQ facility: The platform must reject CRQ definitions that are not mapped relative to the TCE mapping defined by the first window pane of the virtual IOA’s “ibm,my-dma-window” property. R1--5. For the CRQ facility: The platform must start enqueueing Commands/Responses to the newly registered CRQ starting at offset zero and proceeding as in a circular buffer, each entry being 16 byte aligned. R1--6. For the CRQ facility: The platform must enqueue Commands/Responses only if the 16 byte entry is free (header byte contains 0x00), else the enqueue operation fails. R1--7. For the CRQ facility: The platform must enqueue the 16 bytes specified in the validated enqueue request as specified in Requirement except as required by Requirement . R1--8. For the CRQ facility: The platform must not enqueue commands/response entries if the CRQ has not been registered successfully or if after a successful completion, has subsequently deregistered the CRQ. R1--9. For the CRQ facility: The platform (transport mechanism) must ignore and must not modify the contents of all non-header bytes in all CRQ entries. R1--10. For the CRQ facility: The first byte of a CRQ entry must be the Header byte and must be as defined in . R1--11. For the CRQ facility: The Format byte (second byte) of a Valid Initialization CRQ entry must be as defined in . R1--12. For the CRQ facility: The Format byte (second byte) of a Valid Transport Event queue entry must be as defined in . R1--13. For the CRQ facility: If the partner partition fails, then the platform must enqueue a 16 byte entry starting with 0xFF01 (last 14 bytes unspecified) as specified in Requirement except as required by Requirements and . R1--14. For the CRQ facility: If the partner partition deregisters its corresponding CRQ, then the platform must enqueue a 16 byte entry starting with 0xFF02 (last 14 bytes unspecified) as specified in Requirement except as required by Requirements and . R1--15. Reserved R1--16. For the CRQ facility with the Partner Control option: If the partner partition is terminated by request of this partition via the ibm,partner-control RTAS call, then the platform must enqueue a 16 byte entry starting with 0xFF04 (last 14 bytes unspecified) as specified in Requirement except as required by Requirements and when the partner partition has been successfully terminated. R1--17. Reserved R1--18. For the CRQ facility option: Platforms that implement the H_MIGRATE_DMA hcall() must implement that function for pages mapped for use by the CRQ. R1--19. For the CRQ facility: The platforms must emulate the standard PowerPC External Interrupt Architecture for the interrupt source numbers associated with the virtual devices via the standard RTAS and hypervisor interrupt calls and must extend H_XIRR and H_EOI hcall()s as appropriate for CRQ interrupts. R1--20. For the CRQ facility: The platform’s OF must disable interrupts from the using virtual IOA before initially passing control to the booted partition program. R1--21. For the CRQ facility: The platform’s OF must disable interrupts from the using virtual IOA upon registering the IOA’s CRQ. R1--22. For the CRQ facility: The platform’s OF must disable interrupts from the using virtual IOA upon deregistering the IOA’s CRQ. R1--23. For the CRQ facility: The platform must present (as appropriate per RTAS control of the interrupt source number) the partition owning a CRQ the appearance of an interrupt, from the interrupt source number associated, through the OF device tree node, with the virtual device, when a new entry is enqueued to the virtual device’s CRQ and when the last interrupt mode set was “Enabled”, unless a previous interrupt from the interrupt source number is still outstanding. R1--24. For the CRQ facility: The platform must not present the partition owning a CRQ the appearance of an interrupt, from the interrupt source number associated, through the OF device tree node, with the virtual device, if the last interrupt mode set was “Disabled”, unless a previous interrupt from the interrupt source number is still outstanding.
Redirected RDMA (Using H_PUT_RTCE, and H_PUT_RTCE_INDIRECT) A server partition uses the hypervisor function, H_PUT_RTCE, which takes as a parameter the opaque handle (LIOBN) of the partner partition’s RTCE table (second window pane of “ibm,my-dma-window”), an offset in the RTCE table, the handle for one of the server partition's I/O adapter TCE tables plus an offset within the I/O adapter's TCE table. H_PUT_RTCE then copies the appropriate contents of the partner partition's RTCE table into the server partition's I/O adapter TCE table. In effect, this hcall() allows the server partition's I/O adapter to have access to a specific section of the partner partition's memory as if it were the server partition's memory. However, the partner partition, through the hypervisor, maintains control over exactly which areas of the partner partition's memory are made available to the server partition without the overhead of the hypervisor having to directly handle each byte of the shared data. The H_PUT_RTCE_INDIRECT, if implemented, takes as an input parameter a pointer to a list of offsets into the RTCE table, and builds the TCEs similar to the H_PUT_RTCE, described above. A server partition uses the hypervisor function, H_REMOVE_RTCE, to back-out TCEs generated by the H_PUT_RTCE and H_PUT_RTCE_INDIRECT hcall()s. The following rules guide the definition of the RTCE table entries and implementation of the H_PUT_RTCE, H_PUT_RTCE_INDIRECT, H_REMOVE_RTCE, H_MASS_MAP_TCE, H_PUT_TCE, H_PUT_TCE_INDIRECT, and H_STUFF_TCE hcall()s. Other implementations that provide the same external appearance as these rules are acceptable. The architectural intent is to provide RDMA performance essentially equivalent to direct TCE operations. The partner partition's RTCE table is itself never directly accessed by an I/O Adapter (IOA), it is only accessed by the hypervisor, and therefore it can be a bigger structure than the regular TCE table as accessed by hardware (more fields). When a server partition asks (via an H_PUT_RTCE or H_PUT_RTCE_INDIRECT hcall()) to have an RTCE table TCE copied to one of the server partition's physical IOA's TCEs, or asks (via an H_REMOVE_RTCE) to have an RTCE table entry removed from one of the server partition’s physical IOA’s TCEs, the hypervisor atomically, with respect to all RTCE table readers, sets (H_PUT_RTCE or H_PUT_RTCE_INDIRECT) or removes (H_REMOVE_RTCE) a field in the copied RTCE table entry This is an example of where the earlier statement “Other implementations that provide the same external appearance as these rules are acceptable” comes into affect. For example, for an RTCE table that is mapped with H_MASS_MAP_TCE, the pointer may not be in a field of the actual TCE in the RTCE table, but could, for example, be in a linked list, or other such structure, due to the fact that there is not a one-to-one correspondence from the RTCE to the physical IOA TCE in that case (H_MASS_MAP_TCE can map up to an LMB into one TCE, and physical IOA TCEs only map 4 KB). . This field is a pointer to the copy of the RTCE table TCE in the server partition’s IOA’s TCE table. (A per RTCE table TCE lock is one method for the atomic setting of the copied RTCE table TCE link pointer.) A server partition is guaranteed that it can create one redirected mapping per RTCE table entry. By default if the server partition tries to create another copy of the same RTCE table TCE it gets an error return. Platforms that support the H_VIOCTL hcall() might support multiple redirected RTCE table mappings provided that they do not duplicate existing mappings (the mappings are for different I/O operations); if they do, the total number of such multiple mappings per LIOBN and per RTCE page is communicated by the “GET_MAX_REDIRECTED_MAPPINGS” sub-function of the H_VIOCTL hcall(). If the “GET_MAX_REDIRECTED_MAPPINGS” sub-function of the H_VIOCTL hcall() is not implemented then only the default single copy is supported. Multiple mappings of the same physical page are always allowed, as long as they originate from different RTCE table TCEs just like with physical IOA TCEs. When the partner partition issues an H_PUT_TCE, H_PUT_TCE_INDIRECT, H_STUFF_TCE, or H_MASS_MAP hcall() to change his RTCE table, the hypervisor finds the TCE tables in one of several states. A number of these states represent unusual conditions, that can arise from timing windows or error conditions. The hypervisor rules for handling these cases are chosen to minimize its overhead while preventing one partition’s errors from corrupting another partition’s state. The RTCE table TCE is not currently in use: Clear/invalidate the TCE copy pointer and enter the RTCE table TCE mapping per the input parameters to the hcall(). The RTCE table TCE contains a valid mapping and the TCE copy pointer is invalid (NULL or other implementation dependent value) (The previous mapping was never used for Redirected RDMA): Enter the RTCE table TCE mapping per the input parameters to the hcall(). The RTCE table TCE contains a valid mapping and the TCE copy pointers reference TCEs that do not contain a valid copy of the previous mapping in the RTCE table TCE. (The previous mapping was used for Redirected RDMA, however, the server partition has moved on and is no longer targeting the page represented by the old RTCE table TCE mapping): Clear/invalidate the TCE copy pointers and enter the RTCE table TCE mapping per the input parameters to the hcall(). The RTCE table TCE contains a valid mapping and the TCE copy pointers reference TCEs that do contain a valid copy of the previous mapping in the RTCE table TCE (the previous mapping is still potentially in use for Redirected RDMA, however, the partner partition has moved on and is no longer interested in the previous I/O operation). The server partition’s IOA may still target a DMA operation against the TCE containing the copy of the RTCE table TCE mapping. The assumption is that any such targeting is the result of a timing window in the recovery of resources in the face of errors. Therefore, the server partition’s TCE is considered invalid, but the server partition may or may not be able to immediately invalidate the TCEs. For more information on invalidation of TCEs, see . The H_Resource return from an H_PUT_TCE, H_PUT_TCE_INDIRECT, and H_STUFF_TCE may be used to hold off invalidation in this case. If a server partition terminates, the partner partition’s device drivers time out the operations and resource recovery code recovers the RTCE table TCEs. If the partner partition terminates, the hypervisor scans the RTCE table and eventually invalidates all active copies of RTCE table TCEs. For more information on invalidation of TCEs, see . The server partition may use any of the supported hcall()s (see ) to manage the TCE tables used by its IOAs. No extra restrictions are made to changes of the server partition's TCE table beside those stated in 2 above. The server partition can only target its own memory or the explicitly granted partner partition’s memory. The H_MIGRATE_DMA hcall() made by a partner partition migrates the page referenced by the RTCE table TCE but follows the RTCE table TCE copy pointer, if valid, to the server partition’s IOA’s TCE table to determine which IOA’s DMA to disable, thus allowing migration of partner partition pages underneath server partition DMA activity. In this case, however, the H_MIGRATE_DMA algorithm is modified such that server partition’s IOA’s TCE table is atomically updated, after the page migration but prior to enabling the IOA’s DMA, only when its contents still are a valid copy of the partner partition’s RTCE table TCE contents. The H_MIGRATE_DMA hcall() also serializes with H_PUT_RTCE so that new copies of the RTCE table TCE are not made during the migration of the underlying page. The server partition should never call H_MIGRATE_DMA for any Redirected RDMA mapped pages, however, to check, the H_MIGRATE_DMA hcall() is enhanced to check the Logical Memory Block (LMB) owner in the TCE and reject the call if the LMB did not belong to the requester.
H_PUT_RTCE This hcall() maps “count” number of contiguous TCEs in an RTCE to the same number of contiguous IOA TCEs. The H_REMOVE_RTCE hcall() is used to back-out TCEs built with H_PUT_RTCE hcall(). for that hcall(). Syntax: Parameters: r-liobn: Handle of RDMA RTCE table r-ioba: IO address per RDMA RTCE table liobn: Logical I/O Bus Number of server TCE table ioba: I/O address as seen by server IOA count: Number of consecutive 4 KB pages to map Semantics: Validates r-liobn is from the second triple (second window pane) of the server partition’s “ibm,my-dma-window” property, else return H_Parameter. Validates r-ioba plus (count * 4 KB) is within range of RTCE table as specified by the window pane as specified by the r-liobn, else return H_Parameter. Validates that the TCE table associated with liobn is owned by calling partition, else return H_Parameter. If the Shared Logical Resource option is implemented and the LIOBN, represents a logical resource that has been rescinded by the owner, return H_RESCINDED. Validates that ioba plus (count * 4 KB) is within the range of TCE table specified by liobn, else return H_Parameter. If the Shared Logical Resource option is implemented and the IOBA represents a logical resource that has been rescinded by the owner, return H_RESCINDED. For count entries The following is done in a critical section with respect to updates to the r-ioba entry of the RTCE table TCE Check that the r-ioba entry of the RTCE table contains a valid mapping (this requires having a completed partner connection), else return H_R_Parm with the value of the loop count in R4. Prevent more redirected mappings of the same r-ioba than the platform supports and/or duplicates: If the r-ioba entry of the RTCE table TCE contains a valid pointer, and if that pointer references a TCE that is a clone of the r-ioba entry of the RTCE table TCE, then an additional redirected mapping, if supported, is used else return H_Resource with the value of the loop count in R4. Validate the liobn and ioba are not already mapped for this entry else return H_IN_USE with the value of the loop count in R4. Validate the number of redirected mappings for the r-ioba does not exceed the “ibm,max-rtce-mappings” value for any of the adapters mapped by the RTCE else return H_Resource with the value of the loop count in R4. Validate the number of redirected mappings for the r-ioba does not exceed the per client IOBA value returned from the H_VIOCTL GET_MAX_REDIRECTED_MAPPINGS sub-function else return H_Resource with the value of the loop count in R4. Validate the new entry will not cause the number of additional redirected mappings that have already been made for this r-liobn to exceed the maximum retrieved by the H_VIOCTL GET_MAX_REDIRECTED_MAPPINGS sub-function else return H_Resource with the value of the loop count in R4. Copy the DMA address mapping from the r-ioba entry of the r-liobn RTCE table to the ioba entry of the liobn TCE table and save a pointer to the ioba entry of the liobn TCE table in the r-ioba entry of the r-liobn RTCE table, or in a separate structure associated with the r-liobn RTCE table. End Loop (The critical section lasts for one iteration of the loop) Return H_Success Implementation Note: The PA requires the OS to issue a sync instruction to proceed the signalling of an IOA to start an IO operation involving DMA to guarantee the global visibility of both DMA and TCE data. This hcall() does not include a sync instruction to guarantee global visibility of TCE data and in no way diminishes the requirement for the OS to issue it. Implementation Note: The execution time for this hcall() is expected to be a linear function of the count parameter. Excessive size of the count parameter may cause an extended delay.
H_PUT_RTCE_INDIRECT This hcall() maps “count” number of potentially non-contiguous TCEs in an RTCE to the same number of contiguous IOA TCEs. The H_REMOVE_RTCE hcall() is used to back-out TCEs built with the H_PUT_RTCE_INDIRECT hcall(). for that hcall(). Syntax: Parameters: buff-addr: The Logical Address of a page (4 KB, 4 KB boundary) containing a list of r-ioba to be mapped via using the r-liobn RTCE table r-liobn: Handle of RTCE table to be used with r-ioba entries in indirect buffer (second window pane from server “ibm,my-dma-window” liobn: Logical I/O Bus Number of server TCE table ioba: I/O address as seen by server IOA count: Number of consecutive IOA bus 4 KB pages to map (number of entries in buffer) Semantics: Validates r-liobn is from the second triple (second window pane) of the server partition’s “ibm,my-dma-window” property, else return H_Parameter. Validates buff-addr points to the beginning of a 4 KB page owned by the calling partition, else return H_Parameter. If the Shared Logical Resource option is implemented and the logical address’s page number represents a page that has been rescinded by the owner, return H_RESCINDED. Validates that the TCE table associated with liobn is owned by calling partition, else return H_Parameter. If the Shared Logical Resource option is implemented and the LIOBN represents a logical resource that has been rescinded by the owner, return H_RESCINDED. Validates that ioba plus (count * 4 KB) is within the range of TCE table specified by liobn, else return H_Parameter. If the Shared Logical Resource option is implemented and the IOBA represents a logical resource that has been rescinded by the owner, return H_RESCINDED. If the count field is greater than 512 return H_Parameter. Copy (count * 8) bytes from the page specified by buff-addr to a temporary hypervisor page for contents verification and processing (this avoids the problem of the caller changing call by reference values after they are checked). For count entries: Validate the r-ioba entry in the temporary page is within range of RTCE table as specified by r-liobn, else place the count number in R4 and return H_R_Parm. End loop For count validated entries in the hypervisor's temporary page: The following is done in a critical section with respect to updates to the r-ioba entry of the RTCE Check that the r-ioba entry of the r-liobn RTCE table contains a valid mapping (this requires having a completed partner connection), else return H_R_Parm with the count number in R4. Prevent more redirected mappings of the same r-ioba than the platform supports and/or duplicates: If the r-ioba entry of the RTCE table TCE contains a valid pointer, and if that pointer references a TCE that is a clone of the r-ioba entry of the RTCE table TCE, then an additional redirected mapping, if supported, is used else return H_Resource with the value of the loop count in R4. Validate the liobn and ioba are not already mapped for this entry else return H_IN_USE with the value of the loop count in R4. Validate the number of redirected mappings for the r-ioba does not exceed the “ibm,max-rtce-mappings” value for any of the adapters mapped by the RTCE else return H_Resource with the value of the loop count in R4. Validate the number of redirected mappings for the r-ioba does not exceed the per client IOBA value returned from the H_VIOCTL GET_MAX_REDIRECTED_MAPPINGS sub-function else return H_Resource with the value of the loop count in R4. Validate the new entry will not cause the number of additional redirected mappings that have already been made for this r-liobn to exceed the maximum retrieved by the H_VIOCTL GET_MAX_REDIRECTED_MAPPINGS sub-function else return H_Resource with the value of the loop count in R4. Copy the DMA address mapping from the r-ioba entry of the r-liobn RTCE table to the ioba entry of the liobn TCE table and save a pointer to the ioba entry of the liobn TCE table in the r-ioba entry of the r-liobn RTCE table, or into a separate structure associated with the r-liobn RTCE table. End Loop (The critical section lasts for one iteration of the loop) Return H_Success Implementation Note: The PA requires the OS to issue a sync instruction to proceed the signalling of an IOA to start an IO operation involving DMA to guarantee the global visibility of both DMA and TCE data. This hcall() does not include a sync instruction to guarantee global visibility of TCE data and in no way diminishes the requirement for the OS to issue it. Implementation Note: The execution time for this hcall is expected to be a linear function of the count parameter. Excessive size of the count parameter may cause an extended delay.
H_REMOVE_RTCE The H_REMOVE_RTCE hcall() is used to back-out TCEs built with H_PUT_RTCE and H_PUT_RTCE_INDIRECT hcall()s. That is, to remove the TCEs from the IOA TCE table and links put into the RTCE table as a result of the H_PUT_RTCE or H_PUT_RTCE_INDIRECT hcall()s. Syntax: Parameters: r-liobn: Handle of RDMA RTCE table r-ioba: IO address per RDMA RTCE table liobn: Logical I/O Bus Number of server TCE table ioba: I/O address as seen by server IOA count: Number of consecutive 4 KB pages to unmap tce-value: TCE value to be put into the IOA TCE(s) after setting the “Page Mapping and Control” bits to “Page fault (no access)”. Semantics: Validates r-liobn is from the second triple (second window pane) of the server partition’s “ibm,my-dma-window” property, else return H_Parameter. Validates r-ioba plus (count * 4 KB) is within range of RTCE table as specified by the window pane as specified by the r-liobn, else return H_Parameter. Validates that the TCE table associated with liobn is owned by calling partition, else return H_Parameter. If the Shared Logical Resource option is implemented and the LIOBN, represents a logical resource that has been rescinded by the owner, return H_RESCINDED. Validates that ioba plus (count * 4 KB) is within the range of TCE table specified by liobn, else return H_Parameter. If the Shared Logical Resource option is implemented and the IOBA represents a logical resource that has been rescinded by the owner, return H_RESCINDED. For count entries The following is done in a critical section with respect to updates the r-ioba entry of the RTCE table TCE If it exists, invalidate the pointer in the r-ioba entry of the r-liobn RTCE table (or in a separate structure associated with the r-liobn RTCE table). Replace the ioba entry of the liobn TCE table with tce-value after setting the “Page Mapping and Control” bits to “Page fault (no access)”. End Loop (The critical section lasts for one iteration of the loop) Return H_Success Implementation Note: The execution time for this hcall() is expected to be a linear function of the count parameter. Excessive size of the count parameter may cause an extended delay.
Redirected RDMA TCE Recovery and In-Flight DMA There are certain error or error recovery scenarios that may attempt to unmap a TCE in an IOA’s TCE table prior to the completion of the operation which setup the TCE. For example: A client attempts to H_PUT_TCE to its DMA window pane, which is mapped to the second window pane of the server’s DMA window, and the TCE in the RTCE table which is the target of the H_PUT_TCE already points to a valid TCE in an IOA’s TCE table. A client attempts to H_FREE_CRQ and the server’s second window pane for that virtual IOA contains a TCE which points to a valid TCE in an IOA’s TCE table. A client partition attempts to reboot (which essentially is an H_FREE_CRQ). A server attempts to H_FREE_CRQ and the server’s second window pane for that virtual IOA contains a TCE which points to a valid TCE in an IOA’s TCE table. In such error and error recovery situations, the hypervisor attempts to prevent the changing of an IOAs TCE to a value that would cause a non-recoverable IOA error. One method that the hypervisor may use to accomplish this is that on a TCE invalidation operation, set the value of the read and write enable bits in the TCE to allow DMA writes but not reads, and to change the real page number in the TCE to target a dummy page. In this case the IOA receives an error (Target Abort) on attempts to read, while DMA writes (which were for a defunct operation) are silently dropped. This works well when all the following are true: The platform supports separate TCE read and write enable bits in the TCE EEH is enabled and the DD can recover from the MMIO Stopped and DMA Stopped states The IOA and the IOA’s DD can recover gracefully from Target Aborts (which are received on a read to a page where the read enable bit is off) If these conditions are not true then the hypervisor will need to try to prevent or delay invalidation of the TCEs. The H_Resource return from the H_FREE_CRQ, H_PUT_TCE, H_PUT_TCE_INDIRECT, and H_STUFF_TCE can be used to hold-off the invalidation until which time the IOA can complete the operation and the server can invalidate the IOA’s TCE. In addition, the Bit Bucket Allowed LIOBN attribute and the H_LIOBN_ATTRIBUTES hcall can be used to help enhance the recoverability in these error scenarios (see and for more information).
LIOBN Attributes There are certain LIOBN attributes that are made visible to and can be manipulated by partition software. The H_LIOBN_ATTRIBUTES hcall is used to read and modify the attributes (see ). defines the attributes that are visible and manipulatable.   LIOBN Attributes Bit(s) Field Name Definition 0-62 Reserved   63 Bit Bucket Allowed 1: For an indirect IOA TCE invalidation operation (that is, via an operation other than an H_PUT_TCE directly to the TCE by the partition owning the IOA), the platform may set the value of the read and write enable bits in the TCE to allow DMA writes but not reads and change the real page number in the TCE to target a dummy page (the IOA receives an error (Target Abort) on attempts to read, while DMA writes (which were for a defunct operation) are silently dropped). 0: The platform must reasonably attempt to prevent an indirect (that is, via an operation other than an H_PUT_TCE directly to the TCE by the partition owning the IOA) modification an IOA’s valid TCE so that a possible in-flight DMA does not cause a non-recoverable error. Software Implementation Notes: The results of changing this field when there are valid TCEs for the LIOBN may produce unexpected results. The hypervisor is not required to prevent such an operation. Therefore, the H_LIOBN_ATTRIBUTES call to change the value of this field should be made when there are no valid TCEs in the table for the IOA. This field may be implemented but not changeable (the actual value will be returned in R4 as a result of the H_LIOBN_ATTRIBUTES hcall() regardless, with a status of H_Constrained if not changeable).
H_LIOBN_ATTRIBUTES R1--1. If the H_LIOBN_ATTRIBUTES hcall is implemented, then it must implement the attributes as they are defined in and the syntax and semantics as defined in . R1--2. The H_LIOBN_ATTRIBUTES hcall must ignore bits in the set-mask and reset-mask which are not implemented and must process as an exception those which cannot be changed (H_Constrained returned), and must return the following for the LIOBN Attributes in R4: A value of 0 for unimplemented bit positions. The resultant field values for implemented fields. Syntax: Parameters: liobn: The LIOBN on which this Attribute modification is to be performed. reset-mask: The bit-significant mask of bits to be reset in the LIOBN’s Attributes (the reset-mask bit definition aligns with the bit definition of the LIOBN’s Attributes, as defined in ). The complement of the reset-mask is ANDed with the LIOBN’s Attributes, prior to applying the set-mask. See semantics for more details on any field-specific actions needed during the reset operations. If a particular field position in the LIOBN Attributes is not implemented, then the corresponding bit(s) in the reset-mask are ignored. set-mask: The bit-significant mask of bits to be set in the LIOBN’s Attributes (the set-mask bit definition aligns with the bit definition of the LIOBN’s Attributes, as defined in ). The set-mask is ORed with the LIOBN’s Attributes, after to applying the reset-mask. See semantics for more details on any field-specific actions needed during the set operations. If a particular field position in the LIOBN Attributes is not implemented, then the corresponding bit(s) in the set-mask are ignored. Semantics: Validate that liobn belongs to the partition, else H_Parameter. If the Bit Bucket Allowed field of the specified LIOBN’s Attributes is implemented and changeable, then set it to the result of: Bit Bucket Allowed field contents ANDed with the complement of the corresponding bits of the reset-mask and then ORed with the corresponding bits of the set-mask. Load R4 with the value of the LIOBN’s Attributes, with any unimplemented bits set to 0, and if all requested changes were made, then return H_Success, otherwise return H_Constrained.
Extensions to Other hcall()s for Redirected RDMA
H_PUT_TCE, H_PUT_TCE_INDIRECT, and H_STUFF_TCE These hcall()s are only valid for the first window pane of the “ibm,my-dma-window” property. See for information about window pane types. The following are extensions that apply to the H_PUT_TCE, H_PUT_TCE_INDIRECT, and H_STUFF_TCE hcall()s in their use against an RTCE table. Recognize the validated (owned by the calling partition, else H-Parameter) LIOBN as referring to a RTCE table (first window pane) and access accordingly: If the TCE is not from the first triple (first window pane) of the calling partition’s “ibm,my-dma-window” property, return H_Parameter. If the TCE is not currently in use: Clear/invalidate the TCE copy pointer and enter the TCE mapping per the input parameters to the hcall(). If the TCE contains a valid mapping and the TCE copy pointer is invalid: Enter the TCE mapping per the input parameters to the hcall(). If the TCE contains a valid mapping and the TCE copy pointer references a TCE that does not contain a valid copy of the previous mapping in the TCE: Clear/invalidate the TCE copy pointer and enter the TCE mapping per the input parameters to the hcall(). If the TCE contains a valid mapping and the TCE copy pointer references a TCE that does contain a valid copy of the previous mapping in the TCE, then: If the Bit Bucket Allowed Attribute of the LIOBN containing the TCE is a 1, invalidate the copied TCE and enter the TCE mapping per the input parameters to the hcall(). If the Bit Bucket Allowed Attribute of the LIOBN containing the TCE is a 0, then return H_Resource or perform some other platform-specific error recovery.
H_MIGRATE_DMA Check that the pages referenced by the TCEs specified in the mappings to be migrated belong to the calling partition, else H_Parameter. If the mapping being migrated is via an RTCE table (that is, LIOBN points to an RTCE table), then follow the valid redirected TCE pointer and migrate the redirected page (if the redirected TCE mapping is still a clone of the original RTCE table entry). If the mapping being migrated is via an RTCE table and if the RTCE table TCEs were built with the H_MASS_MAP_TCE hcall(), then expand each mass mapped area into smaller 4 KB granularities, as necessary to avoid performance and locking issues, during the migration process. Insert checking, and potentially delays, to allow IOAs to make forward progress between successive DMA disables caused by multiple partner partitions making simultaneous uncoordinated calls to H_MIGRATE_DMA targeting the same IOA.
Subordinate Command/Response Queue (Sub-CRQ) The Sub-CRQ facility is used in conjunction with the CRQ facility, for some virtual IOA types, when more than one queue is needed for the virtual IOA. For information on the CRQ facility, see . For information on which virtual IOAs may use the Sub-CRQ facilities, see the applicable sections for the virtual IOAs. See for a comparison of the differences in the queue structures between CRQs and Sub-CRQs. In addition to the hcall()s specified in , all of the following hcall()s and RTAS calls are applicable to both CRQs and Sub-CRQs: H_XIRR H_EOI ibm,int-on ibm,int-off ibm,set-xive ibm,get-xive   CRQ and Sub-CRQ Comparison Characteristic CRQ Sub-CRQ Queue entry size 16 32 Transport and initialization events Applicable Not applicable (coordinated through the CRQ that is associated with the Sub-CRQ) Registration H_REG_CRQ H_REG_SUB_CRQ Deregistration H_FREE_CRQ H_FREE_SUB_CRQ Note: H_FREE_CRQ for the associated CRQ implicitly deregisters the associated Sub-CRQs Enable H_ENABLE_CRQ Not applicable Interrupt number Obtained from “interrupts” property Obtained from H_REG_SUB_CRQ Interrupt enable/disable H_VIO_SIGNAL H_VIOCTL subfunction For virtual IOAs that define the use of Sub-CRQs, the interrupt associated with the CRQ, as defined by the “interrupts” property in the OF device tree, may be enabled or disabled with either the H_VIOCTL or the H_VIO_SIGNAL hcall(). The CRQ interrupt associated with a CRQ of a virtual IOA that does not define the use of Sub-CRQs should be enabled and disabled by use of the H_VIO_SIGNAL hcall(). hcall() used to place entry on queue H_SEND_CRQ H_SEND_SUB_CRQ H_SEND_SUB_CRQ_INDIRECT Number of queues per virtual IOA One Zero or more, depending on virtual IOA architecture, implementation, and client/server negotiation
Sub-CRQ Format and Registration Each Sub-CRQ is built of one or more 4 KB pages aligned on a 4 KB boundary within partition memory, and is organized as a circular buffer of 32 byte long elements. Each queue is mapped into contiguous I/O addresses via the TCE mechanism and RTCE table (first window pane). The I/O address and length of each queue is registered by the process defined in . This registration process tells the hypervisor where to find the virtual IOA’s Sub-CRQ(s).
Sub-CRQ Entry Format Each Sub-CRQ entry consists of a 32 byte element. The first byte of a Sub-CRQ entry is the Header byte and is defined in .   Sub-CRQ Entry Header Byte Values Header Value Description 0 Element is unused -- all other bytes in the element are undefined. 0x01 - 0x7F Reserved. 0x80 Valid Command/Response entry. 0x81 - 0xFF Reserved.
The platform (transport mechanism) ignores the contents of all non-header bytes in all Sub-CRQ entries. The operational state of any Sub-CRQs follows the operational state of the CRQ to which the Sub-CRQ is associated. That is, the CRQ transport is required to be operational in order for any associated Sub-CRQs to be operational (for example, if an H_SEND_CRQ hcall() would not succeed due to any reason other than lack of space is available in the CRQ, then an H_SEND_SUB_CRQ or H_SEND_SUB_CRQ_INDIRECT hcall() to the associated Sub-CRQ would also fail). Hence, the Sub-CRQ transport does not implement the transport and initialization events that are implemented by the CRQ facility.
Sub-CRQ Entry Processing During the Sub-CRQ registration (H_REG_SUB_CRQ), the platform firmware sets all the header bytes of the Sub-CRQ being registered to zero (entry invalid). After registration, the first valid entry is placed in the first element and the process proceeds to the end of the queue and then wraps around to the first entry again (given that the entry has been subsequently marked as invalid). This allows both the partition software and transport firmware to maintain independent pointers to the next element they will be respectively using. A sender uses an H_SEND_SUB_CRQ hcall() to enter one 32 byte message on its partner’s Sub-CRQ. Prior to enqueueing an entry on the Sub-CRQ, the platform first checks if the session to the partner’s associate CRQ is open, and there is a enough free space on the Sub-CRQ, if not, it returns an error. If the checks succeed, the contents of the message is copied into the next free queue element, potentially notifying the receiver, and returns a successful status to the caller. The caller may also insert more than one entry on the queue with one hcall() using H_SEND_SUB_CRQ_INDIRECT. Use of this hcall() requires that there be enough space on the queue for all the entries, otherwise none of the entries are placed onto the Sub-CRQ. At the receiver’s option, it may be notified via an interrupt when an element is enqueued to its Sub-CRQ. See . When the receiver has finished processing a Sub-CRQ entry, it writes the header to the value 0x00 to invalidate the entry and free it for future entries. Should the receiver wish to terminate or reset the communication channel it deregisters the Sub-CRQ (H_FREE_SUB_CRQ), and if it needs to re-establish communications, proceeds to register (H_REG_SUB_CRQ) either the same or different section of memory as the new queue, with the queue pointers reset to the first entry. Deregistering a CRQ (H_FREE_CRQ) is an implicit deregistration of any Sub-CRQs associated with the CRQ.
Sub-CRQ Facility Interrupt Notification The receiver can set the virtual interrupt associated with its Sub-CRQ to one of two modes. These are: Disabled (an enqueued interrupt is not signaled). Enabled (an enqueued interrupt is signaled on every enqueue). Note: An enqueue is considered a pulse not a level. The pulse then sets the memory element within the emulated interrupt source controller. This allows the resetting of the interrupt condition by simply issuing the H_EOI hcall() as is done with the PCI MSI architecture rather than having to do an explicit interrupt reset as in the case with PCI Level Sensitive Interrupt (LSI) architecture. The interrupt mechanism is capable of presenting only one interrupt signal at a time from any given interrupt source. Therefore, no additional interrupts from a given source are ever signaled until the previous interrupt has been processed through to the issuance of an H_EOI hcall(). Specifically, even if the interrupt mode is enabled, the effect is to interrupt on an empty to non-empty transition of the queue. However, as with any asynchronous posting operation race conditions are to be expected. That is, an enqueue can happen in a window around the H_EOI hcall(). Therefore, the receiver should poll the Sub-CRQ (that is, look at the header byte of the next queue entry to see if the entry is valid) after an H_EOI to prevent losing initiative. The hcall() used to enable and disable this Sub-CRQ interrupt notification is H_VIO_SIGNAL (see ).
Extensions to Other hcall()s for Sub-CRQ
H_MIGRATE_DMA Since Sub-CRQs are RTCE table mapped, the H_MIGRATE_DMA hcall() may be requested to move a page that is part of a Sub-CRQ. The OS owner of the queue is responsible for preventing its processors from modifying the page during the migrate operation (as is standard practice with this hcall()), however, the H_MIGRATE_DMA hcall() serializes with the Sub-CRQ hcall()s to direct new elements to the migrated target page.
H_XIRR, H_EOI The Sub-CRQ facility utilizes a virtual interrupt source number to notify the queue owner of new element enqueues. The standard H_XIRR and H_EOI hcall()s are extended to support this virtual interrupt mechanism, emulating the standard PowerPC Interrupt hardware with respect to the virtual interrupt source number.
Sub-CRQ Facility Requirements R1--1. For the Sub-CRQ facility: The platform must implement the Sub-CRQ as specified in . R1--2. For the Sub-CRQ facility: The platform must start enqueueing Commands/Responses to the newly registered Sub-CRQ starting at offset zero and proceeding as in a circular buffer, each entry being 32 byte aligned. R1--3. For the Sub-CRQ facility: The platform must enqueue Commands/Responses only if the 32 byte entry is free (header byte contains 0x00), else the enqueue operation fails. R1--4. For the Sub-CRQ facility: The first byte of a Sub-CRQ entry must be the Header byte and must be as defined in . R1--5. For the Sub-CRQ facility option: Platforms that implement the H_MIGRATE_DMA hcall() must implement that function for pages mapped for use by the Sub-CRQ. R1--6. For the Sub-CRQ facility: The platforms must emulate the standard PowerPC External Interrupt Architecture for the interrupt source numbers associated with the virtual devices via the standard RTAS and hypervisor interrupt calls and must extend H_XIRR and H_EOI hcall()s as appropriate for Sub-CRQ interrupts.
Partition Managed Class - Synchronous Infrastructure The architectural intent of the Synchronous VIO infrastructure is for platforms where the communicating partitions are under the control of the same hypervisor. Operations between the partitions are via synchronous hcall() operations. The Synchronous VIO infrastructure defines three options: Reliable Command/Response Transport option (see Subordinate CRQ Transport option (see Logical Remote DMA (LRDMA) option (see )
Reliable Command/Response Transport Option For the synchronous infrastructure, the CRQ facility defined in is implemented via the Reliable Command/Response Transport option. The synchronous nature of this infrastructure allows for the capability to immediately (synchronously) notify the sender of the message whether the message was delivered successfully or not.
Reliable CRQ Format and Registration The format of the CRQ is as defined in . The I/O address and length of the queue are registered using the H_REG_CRQ hcall(). .
Reliable CRQ Entry Format See .
Reliable CRQ Entry Processing A sender uses the H_SEND_CRQ hcall() to enter a 16 byte message on its partner’s queue. The hcall() takes the entire message as input parameters in two registers. .
Reliable Command/Response Transport Interrupt Notification The receiver can enable and disable the virtual interrupt associated with its CRQ. See .
Reliable Command/Response Transport hcall()s The H_REG_CRQ and H_FREE_CRQ hcall()s are used by both client and server virtual IOA device drivers. It is the architectural intent that the hypervisor maintains a connection control structure for each defined partner/server connection. The H_REG_CRQ and its corresponding H_FREE_CRQ register and deregister partition resources with that connection control structure. However, there are several conditions that can arise architecturally with this connection process (the design of an implementation may preclude some of these conditions). The association connection to the partner virtual IOA not being defined (H_Not_Found). The CRQ registration function fails if the CRQ is not registered with the hypervisor. The partner virtual IOA may not have registered its CRQ (H_Closed). The CRQ is registered with the hypervisor and the connection. However, the connection is incomplete because their partner has not registered. The partner virtual IOA may be already connected to another partner virtual IOA (H_Resource). The CRQ registration function fails if the CRQ is not registered with the hypervisor or the connection. The reaction of the virtual IOA device driver to these conditions is somewhat different depending upon the calling device driver being for a client or server IOA. Server IOAs in many cases register prior to their partner IOAs since they are servers and subsequently wait for service requests from their clients. Therefore, the H_Closed return code is to be expected when the DD’s CRQ has been registered with the connection and is just waiting for the partner to register. Should a partner DD register its CRQ in the future, higher level protocol messages (via the Initialization Command/Response CRQ entry) can notify the server DD when the connection is established. If a client IOA registers and receives a return code of H_Closed, it may choose to deregister the CRQ and fail since the client IOA would not be in a position to successfully send service requests using the CRQ facility, or it may wait and rely upon higher level CRQ messages (via the Initialization Command/Response CRQ entry) to tell it when its partner has registered. The reaction of a virtual IOA DDs to H_Not_Found and H_Resource are dependent upon the functionality of higher level platform and system management policies. While the current registration has failed, higher level system and or platform management actions may allow a future registration request to succeed. When registration succeeds, an association is made between the partner partition’s LIOBN (RTCE table) and the second window pane of the server partition. This association is dropped when either partner deregisters or terminates. However, on deregistration or termination, the RTCE tables associated with the local partition (first window pane) remain intact for that partition (see Requirement ).
H_REG_CRQ This hcall() registers the RTCE table mapped memory that contains the CRQ. Syntax: Parameters: unit-address: Unit Address per device tree node “reg” property queue: I/O address (offset into the RTCE table) of the CRQ buffer (starting on a 4 KB boundary). len: Length of the CRQ in bytes (a multiple of 4 KB) Semantics: Validate unit-address, else H_Parameter Validate queue, which is the I/O address of the CRQ (I/O addresses for entire buffer length starting at the specified I/O address are translated by the RTCE table, is 4 KB aligned, and length, len, is a multiple of 4 KB), else H_Parameter Validate that there is an authorized connection to another partition associated with the Unit Address, else H_Not_Found. Validate that the authorized connection to another partition associated with the Unit Address is free, else H_Resource. Initialize the CRQ enqueue pointer and length variables. These variables are kept in terms of I/O addresses so that page migration works and any remapping of TCEs is effective. Disable CRQ interrupts. Allow for Logical Remote DMA, when applicable, with associated partner partition when partner registers. If partner is already registered, then return H_Success, else return H_Closed.
H_FREE_CRQ This hcall() deregisters the RTCE table mapped memory that contains the CRQ. In addition, if there are any Sub-CRQs associated with the CRQ, the H_FREE_CRQ has the effect of releasing the Sub-CRQs. Syntax: Parameters: unit-address: Unit Address per device tree node “reg” property Semantics: Validate unit-address, else H_Parameter Mark the connection to the associated partner partition as closed (so that send hcall()s from the partner partition fail). Mark the CRQ enqueue pointer and length variables as invalid. For any and all Sub-CRQs associated with the CRQ, do the following: Mark the connection to the associated partner partition as closed for the Sub-CRQ (so that send hcall()s from the partner partition fail). Mark the Sub-CRQ enqueue pointer and length variables for the Sub-CRQ as invalid. Disable Sub-CRQ interrupts for the Sub-CRQ. Disable CRQ interrupts. If there exists any Redirected TCEs in the local TCE tables associated with this Virtual IOA, and all of those tables have a Bit Bucket Allowed attribute of 1, then Disable Logical Remote DMA with associated partner partition, if enabled, invalidating any Redirected TCEs in the local TCE tables (for information on invalidation of TCEs, see ). If there exists any Redirected TCEs in the local TCE tables associated with this Virtual IOA, and any of those tables have a Bit Bucket Allowed attribute of 0, then return H_Resource or perform some other platform-specific error recovery. Send partner terminated message to partner queue (if it is still registered), overlaying the last valid entry in the queue if the CRQ is full. Return H_Success. Implementation Note: If the hypervisor returns an H_Busy, H_LongBusyOrder1mSec, or H_LongBusyOrder10mSec, software must call H_FREE_CRQ again with the same parameters. Software may choose to treat H_LongBusyOrder1mSec and H_LongBusyOrder10mSec the same as H_Busy. The hypervisor, prior to returning H_Busy, H_LongBusyOrder1mSec, or H_LongBusyOrder10mSec, will have placed the virtual adapter in a state that will cause it to not accept any new work nor surface any new virtual interrupts (no new entries will be placed on the CRQ).
H_SEND_CRQ This hcall() sends one 16 byte entry to the partner partition’s registered CRQ. Syntax: Parameters: unit-addr: Unit Address per device tree node “reg” property msg-high: header: high order bit is on -- header of value 0xFF is reserved for transport error and is invalid for input. format: not checked by the firmware. msg-low: not checked by the firmware -- should be consistent with the definition of the format byte. Semantics: Validate the Unit Address, else return H_Parameter Validate that the msg header byte has its high order bit on and that it is not = 0xFF, else return H_Parameter. Validate that there is an authorized connection to another partition associated with the Unit Address and that the associated CRQ is enabled, else return H_Closed. Enter Critical Section on target CRQ Validate that there is room on the receive queue for the message and allocate that message, else exit critical Section and return H_Dropped. Store msg-low into the second 8 bytes of the allocated queue element. Store order barrier Store msg-high into the first 8 bytes of the allocated queue element (setting the header valid bit.) Exit Critical Section If receiver queue interrupt mode == enabled, then signal interrupt Return H_Success.
H_ENABLE_CRQ This hcall() explicitly enables a CRQ that has been disabled due to a Partner partition suspended transport event. As a side effect of this hcall(), all pages that are mapped via the logical TCE table associated with the first pane of “ibm,my-dma-window” property of the associated virtual IOA are restored prior to successful completion of the hcall(). It is the architectural intent that this hcall() is made while the logical TCE contains mappings for all the pages that will be involved in the recovery of the outstanding I/O operations at the time of the partition migration. Further, it is the architectural intent that this hcall() is made from a processing context that can handle the expected busy wait return code without blocking the processor. Syntax: Parameters: unit-addr: Unit Address per device tree node “reg” property Semantics: Validate the Unit Address, else return H_Parameter Test that all pages mapped through the logical TCE table associated with the first pane of the “ibm,my-dma-window” property associated with the unit-address parameter are present; else return H_LongBusyOrder10mSec. Set the status of the CRQ associated with the unit-address parameter to enabled. Return H_Success.
Reliable Command/Response Transport Option Requirements R1--1. For the Reliable Command/Response Transport option: The platform must implement the CRQ facility, as defined in . R1--2. For the Reliable Command/Response Transport option: The platform must implement the H_REG_CRQ hcall(). . R1--3. For the Reliable Command/Response Transport option: The platform must implement the H_FREE_CRQ hcall(). . R1--4. For the Reliable Command/Response Transport option: The platform must implement the H_SEND_CRQ hcall(). . R1--5. For the Reliable Command/Response Transport option: The platform must implement the H_ENABLE_CRQ hcall(). .
Logical Remote DMA (LRDMA) Option The Logical Remote Direct Memory Access (LRDMA) option allows a server partition to securely target memory pages within a partner partition for VIO operations. This architecture defines two modes of RDMA Copy RDMA is used to have the hypervisor copy data between a buffer in the server partition’s memory and a buffer in the partner partition’s memory. See for more information on Copy RDMA with respect to LRDMA. Redirected RDMA allows for a server partition to securely target its I/O adapter's DMA operations directly at the memory pages of the partner partition. The platform overhead of Copy RDMA is generally greater than Redirected RDMA, but this overhead may be offset if the server partition’s DMA buffer is being used as a data cache for multiple VIO operations. See for more information on Redirected RDMA with respect to LRDMA. The mapping between the LIOBN in the second pane of a server virtual IOA’s “ibm,my-dma-window” property and the corresponding partner IOA’s RTCE table is made when the CRQ successfully completes registration. The partner partition is not aware if the server partition is using Copy RDMA or Redirected RDMA. The server partition uses the Logical RDMA mode that best suits its needs for a given VIO operation. See for more information on RTCE tables.
Copy RDMA The Copy RDMA hcall()s are used to request that the hypervisor move data between partitions. The specific implementation is optimized to the platform’s hardware features. There are calls for when both source and destination buffers are RTCE table mapped (H_COPY_DMA) and when only the remote buffers are mapped (H_WRITE_RDMA and H_READ_RDMA).
H_COPY_RDMA This hcall() copies data from an RTCE table mapped buffer in one partition to an RTCE table mapped buffer in another partition, with the length of the transfer being specified by the transfer length parameter in the hcall(). The “ibm,max-virtual-dma-size” property, if it exists (in the /vdevice (node), specifies the maximum length of the transfer (minimum value of this property is 128 KB). Syntax: Parameters: len: Length of transfer (length not to exceed the value in the “ibm,max-virtual-dma-size” property, if it exists) s-liobn: LIOBN (RTCE table handle) of V-DMA source buffer s-ioba: IO address of V-DMA source buffer d-liobn: LIOBN (RTCE table handle) of V-DMA destination buffer d-ioba: I/O address of V-DMA destination buffer Semantics: Serialize access to RTCE tables with H_MIGRATE_DMA. If the “ibm,max-virtual-dma-size” property exist in the /vdevice node of the device tree, then if the value of len is greater than the value of this property, return H_Parameter. Source and destination LIOBNs are checked for authorization per the “ibm,my-dma-window” property, else return H_S_Parm or H_D_Parm, respectively. Source and destination ioba’s and length are checked for valid ranges per the “ibm,my-dma-window” property, else return H_S_Parm or H_D_Parm, respectively. The access bits of the associated TCEs are checked for authorization, else return H_Permission. Copy len number of bytes from the buffer starting at the specified source address to the buffer starting at the specified destination address, then return H_Success.
H_WRITE_RDMA This hcall() copies up to 48 bytes of data from a set of input parameters to an RTCE table mapped buffer in another partition. Syntax: Parameters: len: Length of transfer d-liobn: LIOBN (RTCE table handle) of V-DMA destination buffer d-ioba: I/O address of V-DMA destination buffer data1: Source data data2: Source data data3: Source data data4: Source data data5: Source data data6: Source data Semantics: Check that the len parameter => 0 and <= 48, else return H_Parameter The destination LIOBN is checked for authorization per the remote triple of the one of the calling partition’s “ibm,my-dma-window” property, else return H_D_Parm. The destination ioba and length are check for valid ranges per the remote triple of the one of the calling partition’s “ibm,my-dma-window” property, else return H_D_Parm. Serialize access to the destination RTCE table with H_MIGRATE_DMA. The access bits of the associated RTCE table TCEs are checked for authorization, else return H_Permission. Copy len number of bytes from the data parameters starting at the high order byte of data1 toward the low order byte of data 6 into the buffer starting at the specified destination address, then return H_Success.
H_READ_RDMA This hcall() copies up to 72 bytes of data from an RTCE table mapped buffer into a set of return registers. Syntax: Parameters: len: Length of transfer s-liobn: LIOBN (RTCE table handle) of V-DMA source buffer s-ioba: IO address of V-DMA source buffer Semantics: Check that the len parameter => 0 and <= 72, else return H_Parameter The source LIOBN is checked for authorization per the remote triple of the one of the calling partition’s “ibm,my-dma-window” property, else return H_S_Parm. The source ioba and length are check for valid ranges per the remote triple of the one of the calling partition’s “ibm,my-dma-window” property, else return H_S_Parm. Serialize access to the source RTCE table with H_MIGRATE_DMA. The access bits of the associated RTCE table TCEs are checked for authorization, else return H_Permission. Copy len number of bytes from the source data buffer specified by s-liobn starting at s-ioba, into the registers R4 through R12 starting with the high order byte of R4 toward the low order byte of R12, then return H_Success.
Logical Remote DMA Option Requirements R1--1. For the Logical Remote DMA option: The platform must implement the H_PUT_RTCE hcall() as specified in . R1--2. For the Logical Remote DMA option: The platform must implement the extensions to the H_PUT_TCE hcall() as specified in . R1--3. For the Logical Remote DMA option: The platform must implement the extensions to the H_MIGRATE_DMA hcall() as specified in . R1--4. For the Logical Remote DMA option: The platform must implement the H_COPY_RDMA hcall() as specified in . R1--5. For the Logical Remote DMA option: The platform must disable Logical Remote DMA operations that target an inactive partition (one that has terminated), including the H_COPY_RDMA hcall() and the H_PUT_RTCE hcall(). Implementation Note: It is expected that as part of meeting Requirement , all of the terminating partition’s TCE table entries (regular and RTCE) are invalidated along with any clones (for information on invalidation of TCEs, see ). While other mechanisms are available for meeting this requirement in the case of H_COPY_RDMA, this is the only method for Redirected RDMA, and since it works in both cases, it is expected that implementations will use this single mechanism.
Subordinate CRQ Transport Option For the synchronous infrastructure, in addition to the CRQ facility defined in , the Subordinate CRQ Transport option may also be implemented in conjunction with the CRQ facility. That is, the Subordinate CRQ Transport option requires that the Reliable Command/Response Transport option also be implemented. For this option, the Sub-CRQ facility defined in is implemented.
Sub-CRQ Format and Registration The format of the Sub-CRQ is as defined in . The I/O address and length of the queue are registered using the H_REG_SUB_CRQ hcall(). .
Sub-CRQ Entry Format See .
Sub-CRQ Entry Processing A sender uses the H_SEND_SUB_CRQ or H_SEND_SUB_CRQ_INDIRECT hcall() to enter one or more 32 byte messages on its partner’s queue. and .
Sub-CRQ Transport Interrupt Notification The receiver can enable and disable the virtual interrupt associated with its Sub-CRQ using the H_VIOCTL hcall(), with the appropriate subfunction. See . The interrupt number that is used in the H_VIOCTL call is obtained from the H_REG_SUB_CRQ call that is made to register the Sub-CRQ.
Sub-CRQ Transport hcall()s The H_REG_SUB_CRQ and H_FREE_SUB_CRQ hcall()s are used by both client and server virtual IOA device drivers. It is the architectural intent that the hypervisor maintains a connection control structure for each defined partner/server connection. The H_REG_SUB_CRQ and its corresponding H_FREE_SUB_CRQ register and deregister partition resources with that connection control structure. However, there are several conditions that can arise architecturally with this connection process (the design of an implementation may preclude some of these conditions). The association connection to the partner virtual IOA not being defined (H_Not_Found). The partner virtual IOA CRQ connection may not have been completed (H_Closed). The partner may deregister its CRQ which also deregisters any associated Sub-CRQs.
H_REG_SUB_CRQ This hcall() registers the RTCE table mapped memory that contains the Sub-CRQ. Multiple Sub-CRQ registrations may be attempted for each virtual IOA. If resources are not available to establish a Sub-CRQ, the H_REG_SUB_CRQ call will fail with H_Resource. Programming Note: On platforms that implement the partition migration option, after partition migration the support for this hcall() might change, and the caller should be prepared to receive an H_Function return code indicating the platform does not implement this hcall(). If a virtual IOA exists in the device tree after migration that requires by this architecture the presence of this hcall(), then if that virtual IOA exists after the migration, it can be expected that the hcall() will, also. Syntax: Parameters: unit-address: Unit Address per device tree node “reg” property. Sub-CRQ-ioba: I/O address (offset into the RTCE table, as specified by the first window pane of the virtual IOA’s “ibm,my-dma-window” property) of the Sub-CRQ buffer (starting on a 4 KB boundary). Sub-CRQ-length: Length of the Sub-CRQ in bytes (a multiple of 4 KB). Semantics: Validate unit-address, else H_Parameter. Validate Sub-CRQ-ioba, which is the I/O address of the Sub-CRQ (I/O addresses for entire buffer length starting at the specified I/O address are translated by the RTCE table, is 4 KB aligned, and length, Sub-CRQ-length, is a multiple of 4 KB), else H_Parameter. Validate that there are sufficient resources associated with the Unit Address to allocate the Sub-CRQ, else H_Resource. Initialize the Sub-CRQ enqueue pointer and length variables. These variables are kept in terms of I/O addresses so that page migration works and any remapping of TCEs is effective. Initialize all Sub-CRQ entry header bytes to 0 (invalid). Disable Sub-CRQ interrupts. Place cookie representing Sub-CRQ number (will be used in H_SEND_SUB_CRQ, H_SEND_SUB_CRQ_INDIRECT, and H_FREE_SUB_CRQ) in R4. Place interrupt number (the same as will be returned by H_XIRR or H_IPOLL for the interrupt from this Sub-CRQ) in R5. If the CRQ connection is already complete, then return H_Success, else return H_Closed.
H_FREE_SUB_CRQ This hcall() deregisters the RTCE table mapped memory that contains the Sub-CRQ. Note that the H_FREE_CRQ hcall() also deregisters any Sub-CRQs associated with the CRQ being deregistered by that hcall(). Programming Note: On platforms that implement the partition migration option, after partition migration the support for this hcall() might change, and the caller should be prepared to receive an H_Function return code indicating the platform does not implement this hcall(). If a virtual IOA exists in the device tree after migration that requires by this architecture the presence of this hcall(), then if that virtual IOA exists after the migration, it can be expected that the hcall() will, also. Syntax: Parameters: unit-address: Unit Address per device tree node “reg” property. Sub-CRQ-num: The queue # cookie returned from H_REG_SUB_CRQ hcall() at queue registration time. Semantics: Validate unit-address and Sub-CRQ-num, else H_Parameter Mark the connection to the associated partner partition as closed for the specified Sub-CRQ (so that send hcall()s from the partner partition fail). Mark the Sub-CRQ enqueue pointer and length variables for the specified Sub-CRQ as invalid. Disable Sub-CRQ interrupts for the specified Sub-CRQ. Return H_Success.
H_SEND_SUB_CRQ This hcall() sends one 32 byte entry to the partner partition’s registered Sub-CRQ. Programming Note: On platforms that implement the partition migration option, after partition migration the support for this hcall() might change, and the caller should be prepared to receive an H_Function return code indicating the platform does not implement this hcall(). If a virtual IOA exists in the device tree after migration that requires by this architecture the presence of this hcall(), then if that virtual IOA exists after the migration, it can be expected that the hcall() will, also. Syntax: Parameters: unit-addr: Unit Address per device tree node “reg” property. Sub-CRQ-num: The queue # cookie returned from H_REG_SUB_CRQ hcall() at queue registration time. msg-dword0: firmware checks only high order byte. msg-dword1, msg-dword2, msg-dword3: the rest of the message; firmware does not validate. Semantics: Validate the Unit Address, else return H_Parameter. Validate that the Sub-CRQ, as specified by Sub-CRQ-num, is properly registered by the partner, else return H_Parameter. Validate that the message header byte (high order byte of msg-dword0) is 0x80, else return H_Parameter. Validate that there is an authorized CRQ connection to another partition associated with the Unit Address and that the associated CRQ is enabled, else return H_Closed. Enter Critical Section on target Sub-CRQ. Validate that there is room on the specified Sub-CRQ for the message and allocate that message, else exit critical Section and return H_Dropped. Store msgdword1 into bytes 4-7 of the allocated queue element. Store msgdword2 into bytes 8-11 of the allocated queue element. Store msgdword3 into bytes 12-15 of the allocated queue element. Store order barrier. Store msgdword0 into bytes 0-3 of the allocated queue element (this sets the valid bit in the header byte). Exit Critical Section. If receiver queue interrupt mode is enabled, then signal interrupt. Return H_Success.
H_SEND_SUB_CRQ_INDIRECT This hcall() sends one or more 32 byte entries to the partner partition’s registered Sub-CRQ. On H_Success, all of the entries have been put onto the Sub-CRQ. On any return code other than H_Success, none of the entries have been put onto the Sub-CRQ. Programming Note: On platforms that implement the partition migration option, after partition migration the support for this hcall() might change, and the caller should be prepared to receive an H_Function return code indicating the platform does not implement this hcall(). If a virtual IOA exists in the device tree after migration that requires by this architecture the presence of this hcall(), then if that virtual IOA exists after the migration, it can be expected that the hcall() will, also. The maximum num-entries has increased on some platforms from 16 to 128. On platforms that implement the partition migration option, after partition migration the support for this hcall() might change, and the caller should be prepared to receive an H_Parameter return code in the situation where more than 16 num-entries have been sent, indicating the platform does not support more than 16 num-entries. Syntax: Parameters: unit-addr: Unit Address per device tree node “reg” property. Sub-CRQ-num: The Sub-CRQ # cookie returned from H_REG_SUB_CRQ hcall() at queue registration time. ioba: The address of the TCE-mapped page which contains the entries to be placed onto the specified Sub-CRQ. num-entries: Number of entries to be placed onto the specified Sub-CRQ from the TCE mapped page starting at ioba (maximum number of entries is 16 in order to minimize the hcall() time). Semantics: Validate the Unit Address, else return H_Parameter. Validate that the Sub-CRQ, as specified by Sub-CRQ-num, is properly registered by the partner, else return H_Parameter. If ioba is outside of the range of the calling partition assigned values, then return H_Parameter. If num-entries is not in the range of 1 to 128, then return H_Parameter. If num-entries is not in the range of 1 to 16, then return H_Parameter. Validate that there is an authorized CRQ connection to another partition associated with the Unit Address and that the associated CRQ is enabled, else return H_Closed. Copy (num-entries * 32) bytes from the page specified starting at ioba to a temporary hypervisor page for contents verification and processing (this avoids the problem of the caller changing call by reference values after they are checked). Validate that the message header bytes for num-entries starting at ioba are 0x80, else return H_Parameter. Enter Critical Section on target Sub-CRQ. Validate that there is room on the specified Sub-CRQ for num-entries messages and allocate those messages, else exit critical Section and return H_Dropped. For each of the num-entries starting at ioba Store entry bytes 1-31 into bytes 1-31 of the allocated queue element. Store order barrier. Store entry byte 0 into bytes 0 of the allocated queue element (this sets the valid bit in the header byte). Loop Exit Critical Section. If receiver queue interrupt mode is enabled, then signal interrupt. Return H_Success.
Subordinate CRQ Transport Option Requirements R1--1. For the Subordinate CRQ Transport option: The platform must implement the Reliable Command/Response Transport option, as defined in . R1--2. For the Subordinate CRQ Transport option: The platform must implement the Sub-CRQ facility, as defined in . R1--3. For the Subordinate CRQ Transport option: The platform must implement the H_REG_SUB_CRQ hcall(). . R1--4. For the Subordinate CRQ Transport option: The platform must implement the H_FREE_SUB_CRQ hcall(). . R1--5. For the Subordinate CRQ Transport option: The platform must implement the H_SEND_SUB_CRQ hcall(). . R1--6. For the Subordinate CRQ Transport option: The platform must implement the H_SEND_SUB_CRQ_INDIRECT hcall(). . R1--7. For the Subordinate CRQ Transport option: The platform must implement all of the following subfunctions of the H_VIOCTL hcall() ( ): DISABLE_ALL_VIO_INTERRUPTS DISABLE_VIO_INTERRUPT ENABLE_VIO_INTERRUPT
Interpartition Logical LAN (ILLAN) Option The Interpartition Logical LAN (ILLAN) option provides the functionality of IEEE VLAN between LPAR partitions. Partitions are configured to participate in the ILLAN. The participating partitions have one or more logical IOAs in their device tree. The hypervisor emulates the functionality of an IEEE VLAN switch. That functionality is defined in IEEE 802.1Q. The following information on IEEE VLAN switch functionality is provided for informative reference only with the referenced document being normative. Logical Partitions may have one or more Logical LAN IOA’s each of which appears to be connected to one and only one Logical LAN Switch port of the single Logical LAN Switch implemented by the hypervisor. Each Logical LAN Switch port is configured (by platform dependent means) as to whether the attached Logical LAN IOA supports IEEE VLAN headers or not, and the allowable VLAN numbers that the port may use (a single number if VLAN headers are not supported, an implementation dependent number if VLAN headers are supported). When a message arrives at a Logical LAN Switch port from a Logical LAN IOA, the hypervisor caches the message’s source MAC address (2nd 6 bytes) to use as a filter for future messages to the IOA. Then the hypervisor processes the message differently depending upon whether the port is configured for IEEE VLAN headers, or not. If the port is configured for VLAN headers, the VLAN header (bytes offsets 12 and 13 in the message) is checked against the port’s allowable VLAN list. If the message specified VLAN is not in the port’s configuration, the message is dropped. Once the message passes the VLAN header check, it passes onto destination MAC address processing below. If the port is NOT configured for VLAN headers, the hypervisor (conceptually) inserts a two byte VLAN header (based upon the port’s configured VLAN number) after byte offset 11 in the message. Next, the destination MAC address (first 6 bytes of the message) is processed by searching the table of cached MAC addresses (built from messages received at Logical LAN Switch ports see above). If a match for the MAC address is not found and if there is no Trunk Adapter defined for the specified VLAN number, then the message is dropped, otherwise if a match for the MAC address is not found and if there is a Trunk Adapter defined for the specified VLAN number, then the message is passed on to the Trunk Adapter. If a MAC address match is found, then the associated switch port is configured and the allowable VLAN number table is scanned for a match to the VLAN number contained in the message’s VLAN header. If a match is not found, the message is dropped. Next, the VLAN header configuration of the destination Switch Port is checked, and if the port is configured for VLAN headers, the message is delivered to the destination Logical LAN IOA including any inserted VLAN header. If the port is configured for no VLAN headers, the VLAN header is removed before being delivered to the destination Logical LAN IOA. The Logical LAN IOA’s device tree entry includes Unit Address, and “ibm,my-dma-window” properties. The “ibm,my-dma-window” property contains a LIOBN field that represents the RTCE table used by the Logical IOA. The Logical LAN hcall()s use the Unit Address field to imply the LIOBN and, therefore, the RTCE table to reference. When the logical IOA is opened, the device driver registers, with the hypervisor, as the “Buffer List”, a TCE mapped page of partition I/O mapped memory that contains the receive buffer descriptors. These receive buffers are mapped via a TCE mechanism from partition memory into contiguous I/O DMA space. The first descriptor in the buffer list page is that of the receive queue buffer. The rest of the descriptors are for a number of buffer pools organized by increasing size of receive buffer. The format of the descriptor is a 1 byte control field, 3 byte buffer length, followed by a 4 byte I/O address. The number of buffer pools is determined by the device driver (up to an architected maximum of 254). The control field in all unused descriptors is 0h00. The last 8 bytes are reserved for statistics. When a new message is received by the logical IOA, the list of buffer pools is scanned starting from the second descriptor in the buffer list looking for the first available buffer that is equal to or greater than the received message. That buffer is removed from the pool, filled with the incoming message, and an entry is placed on the receive queue noting the buffer status, message length, starting data offset, and the buffer correlator. The sender of a logical LAN message uses an hcall() that takes as parameters the Unit Address and a list of up to 6 buffer descriptors (length, starting I/O address pairs). The sending hcall(), after verifying the sender owns the Unit Address, correlates the Unit Address with its associated Logical LAN Switch port and copies the message from the send buffer(s) into a receive buffer, as described above, for each target logical LAN IOA that is a member of the specified VLAN. If a given logical IOA does not have a suitable receive buffer, the message is dropped for that logical IOA (a return code indicates that one or more destinations did not receive a message allowing for a reliable datagram service). The logical LAN facility uses the standard H_GET_TCE and H_PUT_TCE hcall()s to manage the I/O translations tables along with H_MIGRATE_DMA to aid in dynamic memory reconfiguration.
Logical LAN IOA Data Structures The Logical LAN IOA defines certain data structures as described in following paragraphs. outlines the inter-relationships between several of these structures. Since multiple hcall()s as well as multiple partitions access the same structures, careful serialization is essential. Implementation Note: During shutdown or migration of TCE mapped pages, implementations may choose to atomically maintain, within a single, two field variable, a usage count of processors currently sending data through the Logical LAN IOA combined with a quiesce request set to the processor that is requesting the quiesce (if no quiesce is requested, the value of this field is some reserved value). Then a protocol, such as the following, can manage the quiesce of Logical LAN DMA. A new sender atomically checks the DMA channel management variable -- spinning if the quiesce field is set and subsequently incrementing the usage count field when the quiesce variable is not set. The sender atomically decreases the use count when through with Logical Remote DMA copying. A quiesce requester, after atomically setting the quiesce field with its processor number (as in a lock), waits for the usage count to go to zero before proceeding.
Logical LAN IOA Structures
Buffer Descriptor The buffer descriptor is an 8 byte quantity, on an 8 byte boundary (so that it can be written atomically). The high order byte is control, the next 3 bytes consist of a length field of the buffer in bytes, the low order 4 bytes are a TCE mapped I/O address of the start of the buffer in I/O address space. Bit 0 of the control field is the valid indicator, 0 means not valid and 1 is valid. Bits 2-5 are reserved. Bit 1 is used in the receive queue descriptor as the valid toggle if the descriptor specifies the receive queue, else it is reserved. If the valid toggle is a 0, then the newly enqueued receive buffer descriptors have a valid bit value of 1, if the valid toggle is a 1, then the newly enqueued receive buffer descriptors have a valid bit value of 0. The hypervisor flips the value of the valid toggle bit each time it cycles from the bottom of the receive queue to the top. Bit 5 is the Large Send Indication bit and indicates that this packet is a large-send packet. See for more information on the usage of this bit. Bit 6 is the No Checksum bit and indicates that there is no checksum in this packet. See for more information on the usage of this bit. Bit 7 is the Checksum Good bit and indicates that the checksum in this packet has already been verified. See for more information on the usage of this bit.
Buffer List This structure is used to record buffer descriptors of various types used by the Logical LAN IOA. Additionally, running statistics about the logical LAN adapter are maintained at the end of the structure. It consists of one 4 KB aligned TCE mapped page. By TCE mapping the page, the H_MIGRATE_DMA hcall() is capable of migrating this structure. The first buffer descriptor (at offset 0) contains the buffer descriptor for the receive queue. The second buffer descriptor (at offset 8) contains the buffer descriptor for the MAC multicast filter table. It is the architectural intent that all subsequent buffer descriptors in the list head a pool of buffers of a given size. Further, it is the architectural intent that descriptors are ordered in increasing size of the buffers in their respective pools. The rest of the description of the ILLAN option is written assuming this intent. However, the contents of these descriptors are architecturally opaque, none of these descriptors are manipulated by code above the architected interfaces. This allows implementations to select the most appropriate serialization techniques for buffer enqueue/dequeue, migration, and buffer pool addition and subsequent garbage collection. The final 8 bytes in the buffer list is a counter of frames dropped because there was not a buffer in the buffer list capable of holding the frame.
Receive Queue The receive queue is a circular buffer used to store received message descriptors. The device driver sizes the buffer used for the receive queue in multiples of 16 bytes, starting on an 16 byte boundary (to allow atomic store operations) with, at least, one more 16 byte entry than the maximum number of possible outstanding receive buffers. Failure to have enough receive queue entries, may result in receive messages, and their buffers being lost since the logical IOA assumes that there are always empty receive queue elements and does not check. When the device driver registers the receive queue buffer, the buffer contents should be all zeros, this insures that the valid bits are all off. If a message is received successfully, the next 16 byte area (starting with the area at offset 0 for the first message received after the registration of the receive queue and looping back to the top after the last area is used) in the receive queue is written with a message descriptor as shown in . Either the entire entry is atomically written, or the write order is serialized such that the control field is globally visible after all other fields are visible.   Receive Queue Entry Field Name Byte Offset Length Definition Control 0 1 Bit 0 = the appropriate valid indicator. Bit 1 = 1 if the buffer contains a valid message. Bit 1 = 0 if the buffer does not contain a valid message, in which case the device driver recycles the buffer. Bits 2-4 Reserved. Bit 5: Large Send Indication bit. If a 1, then this indicates the packet is a large-send packet. Bit 6: No Checksum bit. If a 1, then this indicates that there is no checksum in this packet (see for more information on the usage of this bit). Bit 7: Checksum Good bit. If a 1, then this indicates that the checksum in this packet has already been verified (see for more information on the usage of this bit). Reserved 1 1 Reserved for future use. Message Offset 2 2 The byte offset to the start of the received message. The minimum offset is 8 (to bypass the message correlator field); larger offsets may be used to allow for optimized data copy operations. Message Length 4 4 The byte length of the received message. Opaque handle 8 8 Copy of the first 8 bytes contained in the message buffer as passed by the device driver.
So that the device driver never has to write into the receive queue, the VLAN logical IOA alternates the value of the valid bit on each pass through the receive queue buffer. On the first pass following registration, the valid bit value is written as a 1, on the next as a zero, on the third as a 1, and so on. To allow the device driver to follow the state of the valid bit, the Logical LAN IOA maintains a valid bit toggle in bit 1 of the receive queue descriptor control byte. The Logical LAN IOA increments its enqueue pointer after each enqueue. If the pointer increment (modulo the buffer size) loops to the top, the valid toggle bit alternates state. Following the write of the message descriptor, if enqueue interrupts are enabled and there is not an outstanding interrupt signaled from the Logical LAN IOA’s interrupt source number, an interrupt is signaled. It is the architectural intent that the first 8 bytes of the buffer is a device driver supplied opaque handle that is copied into the receive queue entry. One possible format of the opaque handle is the OS effective address of the buffer control block that pre-ends the buffer as seen by the VLAN Logical IOA. Within this control block might be stored the total length of the buffer, the 8 byte buffer descriptor (used to enqueue this buffer using the H_ADD_LOGICAL_LAN_BUFFER hcall()) and other control fields as deemed necessary by the device driver. When servicing the receive interrupt, it is the architectural intent that the device driver starts to search the receive queue using a device driver maintained receive queue service pointer (initially starting, after buffer registration, at the offset zero of the receive queue) servicing all receive queue entries with the appropriate valid bit, until reaching the first invalid receive queue entry. The receive queue service pointer is also post incremented, modulo the receive queue buffer length, and the device driver’s notion of valid bit state is also toggled/read from the receive queue descriptor’s valid bit toggle bit, on each cycle through the circular buffer. After all valid receive queue entries are serviced, the device driver resets the interrupt. . After the interrupt reset, the device driver again scans from the new value of the receive queue service pointer to pick up any entries that may have been enqueued during the interrupt reset window.
MAC Multicast Filter List This one 4 KB page (aligned on a 4 KB boundary) opaque data structure is used by firmware to contain multicast filter MAC addresses. The table is initialized by firmware by the H_REGISTER_LOGICAL_LAN hcall(). Any modification of this table by the partition software (OS or device driver) is likely to corrupt its contents which may corrupt/affect the OS’s partition but not other partitions, that is, the hypervisor may not experience significant performance degradation due to table corruption. However, for the partition that corrupted its filter list, the hypervisor may deliver multicast address packets that had previously been requested to be filtered out, or it may fail to deliver multicast address packets that had been requested to be delivered.
Receive Buffers The Logical LAN firmware requires that the minimum size receive buffer is 16 bytes aligned on an 4 byte boundary so that stores of linkage pointer may be atomic. Minimum IP message sizes, and message padding areas force a larger minimum size buffer. The first 8 bytes of the receive buffer are reserved for a device driver defined opaque handle that is written into the receive queue entry when the buffer is filled with a received message. Firmware never modifies the first 8 bytes of the receive buffer. From the time of buffer registration via the H_ADD_LOGICAL_LAN_BUFFER hcall() until the buffer is posted onto the receive queue, the entire buffer other than the first 8 bytes are subject to modification by the firmware. Any modification of the buffer contents, during this time, by non-firmware code subjects receive data within the partition to corruption. However, any data corruption caused by errors in partition code does not escape the offending partition, except to the extent that the corruption involves the data in Logical LAN send buffers. Provisions are incorporated in the receive buffer format for a beginning pad field to allow firmware to make use of data transfer hardware that may be alignment sensitive. While the contents of the Pad fields are undefined, firmware is not allowed to make visible to the receiver more data than was specifically included by the sender in the transfer message, so as to avoid a covert channel between the communicating partitions.   Receive Buffer Format Field Name Byte Offset Length Definition Opaque Handle 0 8 Per design of the device driver. Pad 1 8 0-L1 cache line size This field, containing undefined data, may be included by the firmware to align data for optimized transfers. Message defined by the “Message Offset” field of the Receive Queue Entry 12-N The destination and source MAC address are at the first two 6 byte fields of the message, followed by the message payload. Pad 2   To end of buffer Buffer contents after the Message field are undefined.
Logical LAN Device Tree Node The Logical LAN device tree node is a child of the vdevice node which itself is a child of / (the root node). There exists one such node for each logical LAN virtual IOA instance. Additionally, Logical LAN device tree nodes have associated packages such as obp-tftp and load method as appropriate to the specific virtual IOA configuration as would the node for a physical IOA of type network. Logical IOA’s intrinsic MAC address -- This number is guaranteed to be unique within the scope of the Logical LAN.   Properties of the Logical LAN OF Device Tree Node Property Name Required? Definition “name” Y Standard property name per , specifying the virtual device name, the value shall be “l-lan”. “device_type” Y Standard property name per , specifying the virtual device type, the value shall be “network”. “model” NA Property not present. “compatible” Y Standard property name per , specifying the programming models that are compatible with this virtual IOA, the value shall include “IBM,l-lan”. “used-by-rtas” See definition column Present if appropriate. “ibm,loc-code” Y Property name specifying the unique and persistent location code associated with this virtual IOA, the value shall be of the form defined in . “reg” Y Standard property name per , specifying the unit address (unit ID) associated with this virtual IOA presented as an encoded array as with encode-phys of length “#address-cells” value shall be 0xwhatever (virtual “reg” property used for unit address no actual locations used, therefore, the size field has zero cells (does not exist) as determined by the value of the “#size-cells” property). “ibm,my-dma-window” Y Property name specifying the DMA window associated with this virtual IOA presented as an encoded array of three values (LIOBN, phys, size) encoded as with encode-int, encode-phys, and encode-int. “local-mac-address” Y Standard property name per , specifying the locally administered MAC addresses are denoted by having the low order two bits of the high order byte being 0b10. “mac-address” See definition column Initial MAC address (may be changed by H_CHANGE_LOGICAL_LAN_MAC hcall()). Note: There have been requests for a globally unique mac address per logical LAN IOA. However, the combination of -- that requiring that the platform ship with an unbounded set of reserved globally unique addresses -- which clearly cannot work -- plus the availability of IP routing for external connectivity have overridden those requests. “ibm,mac-address-filters” Y Property name specifying the number of non-broadcast multicast MAC filters supported by this implementation (between 0 and 255) presented as an encoded array encoded as with encode-int. “interrupts” Y Standard property name specifying the interrupt source number and sense code associated with this virtual IOA presented as an encoded array of two cells encoded as with encode-int with the first cell containing the interrupt source number, and the second cell containing the sense code 0 indicating positive edge triggered. The interrupt source number being the value returned by the H_XIRR or H_IPOLL hcall(). “ibm,my-drc-index” For DR   “ibm,vserver” Y Property name specifying that this is a virtual server node. “ibm,trunk-adapter” See definition column Property name specifying that this is a Trunk Adapter. This property must be provided when the node is a Trunk Adapter node. “ibm,illan-options” See definition column This property is required when any of the ILLAN sub-options are implemented (see ). The existence of this property indicates that the H_ILLAN_ATTRIBUTES hcall() is implemented, and that hcall() is then used to determine which ILLAN options are implemented. “supported-network-types” Y Standard property name as per . Reports possible types of “network” the device can support. “chosen-network-type” Y Standard property name as per . Reports the type of “network” this device is supporting. “max-frame-size” Y Standard property name per , to indicate maximum packet size. “address-bits” Y Standard property name per , to indicate network address length. “ibm,#dma-size-cells” See definition column Property name, to define the package’s dma address size format. The property value specifies the number of cells that are used to encode the size field of dma-window properties. This property is present when the dma address size format cannot be derived using the method described in the definition for the “ibm,#dma-size-cells” property in . “ibm,#dma-address-cells” See definition column Property name, to define the package’s dma address format. The property value specifies the number of cells that are used to encode the physical address field of dma-window properties. This property is present when the dma address format cannot be derived using the method described in the definition for the “ibm,#dma-address-cells” property in .
Logical LAN hcall()s The receiver can set the virtual interrupt associated with its Receive Queue to one of two modes using the H_VIO_SIGNAL hcall(). These are: Disabled (An enqueue interrupt is not signaled.) Enabled (An enqueue interrupt is signaled on every enqueue) Note: An enqueue is considered a pulse not a level. The pulse then sets the memory element within the emulated interrupt source controller. This allows the resetting of the interrupt condition by simply issuing the H_EOI hcall() as is done with the PCI MSI architecture rather than having to do an explicit interrupt reset as in the case with PCI LSI architecture. The interrupt mechanism, however, is capable of presenting only one interrupt signal at a time from any given interrupt source. Therefore, no additional interrupts from a given source are ever signaled until the previous interrupt has been processed through to the issuance of an H_EOI hcall(). Specifically, even if the interrupt mode is enabled, the effect is to interrupt on an empty to non-empty transition of the queue.
H_REGISTER_LOGICAL_LAN Syntax: Parameters: unit-address: As specified in the Logical LAN device tree node “reg” property buf-list: I/O address of a 4 KB page (aligned) used to record registered input buffers rec-queue: Buffer descriptor of a receive queue, specifying a receive queue which is a multiple of 16 bytes in length and is 16 byte aligned filter-list: I/O address of a 4 KB page aligned broadcast MAC address filter list mac-address: The receive filter MAC address Semantics: Validate the Unit Address else H_Parameter Validate the I/O addresses of the buf-list and filter-list is in the TCE and is 4K byte aligned else H_Parameter Validate the Buffer Descriptor of the receive queue buffer (I/O addresses for entire buffer length starting at the specified I/O address are translated by the RTCE table, length is a multiple of 16 bytes, and alignment is on a 16 byte boundary) else H_Parameter. Initialize the one page buffer list Enqueue the receive queue buffer (set valid toggle to 0). Initialize the hypervisor’s receive queue enqueue pointer and length variables for the virtual IOA associated with the Unit Address. These variables are kept in terms of DMA addresses so that page migration works and any remapping of TCEs is effective. Disable receive queue interrupts. Record the low order 6 bytes of mac-address for filtering future incoming messages. Return H_Success.
H_FREE_LOGICAL_LAN Syntax: Parameters: unit-address: Unit Address per device tree node “reg” property. Semantics: Validate the Unit Address else H_Parameter Interlock/carefully manipulate tables so that H_SEND_LOGICAL_LAN performs safely. Clear the associated page buffer list, prevent further consumption of receive buffers and generation of receive interrupts. Return H_Success. H_FREE_LOGICAL_LAN is the only valid mechanism to reclaim the memory pages registered via H_REGISTER_LOGICAL_LAN. Implementation Note: If the hypervisor returns an H_Busy, H_LongBusyOrder1mSec, or H_LongBusyOrder10mSec, software must call H_FREE_LOGICAL_LAN again with the same parameters. Software may choose to treat H_LongBusyOrder1mSec and H_LongBusyOrder10mSec the same as H_Busy. The hypervisor, prior to returning H_Busy, H_LongBusyOrder1mSec, or H_LongBusyOrder10mSec, will have placed the virtual adapter in a state that will cause it to not accept any new work nor surface any new virtual interrupts (no new frames will arrive, etc.).
H_ADD_LOGICAL_LAN_BUFFER Syntax: Parameters: unit-address: Unit Address per device tree node “reg” property buf: Buffer Descriptor of new I/O buffer Semantics: Checks that unit address is OK else H_Parameter. Checks that I/O Address is within range of DMA window. Scans the buffer list for a pool of buffers of the length specified in the Descriptor If one does not exist (and there is still room in the buffer list, create a new pool entry else H_Resource). Uses enqueue procedure that is compatible with H_SEND_LOGICAL_LAN hcall()’s dequeue procedure Implementation Note: Since the buffer queue is based upon I/O addresses that are checked by H_SEND_LOGICAL_LAN, it is only necessary to insure that the enqueue/dequeue are internally consistent. If the owning OS corrupts his buffer descriptors or buffer queue pointers, this is caught by H_SEND_LOGICAL_LAN and/or the corruption is contained within the OS’s partition. Architecture Note: Consideration was given to define the enqueue algorithm and have the DD do the enqueue itself. However, no designs presented themselves that eliminated the timing windows caused by adding and removing pool lists without the introduction of OS/FW interlocks.
H_FREE_LOGICAL_LAN_BUFFER Syntax: Parameters: unit-address: Unit Address per device tree node “reg” property. bufsize: The size of the buffer that is being requested to be removed from the receive buffer pool. Semantics: Check that unit address is valid, else return H_Parameter. Scan the buffer list for a pool of buffers of the length specified in bufsize, and return H_Not_Found if one does not exist. Place an entry on receive queue for buffer of specified size, with Control field Bit 1 set to 0, and return H_Success
H_SEND_LOGICAL_LAN Syntax: The H_Dropped return code indicates to the sender that one or more intended receivers did not receive the message. Parameters: unit-address: Unit Address per device tree node “reg” property buff-1: Buffer Descriptor #1 buff-2: Buffer Descriptor #2 buff-3: Buffer Descriptor #3 buff-4: Buffer Descriptor #4 buff-5: Buffer Descriptor #5 buff-6: Buffer Descriptor #6 continue-token: Used to continue a transfer if H_Busy is returned. Set to 0 on the first call. If H_Busy is returned, then call again but use the value returned in R4 from the previous call as the value of continue-token. Semantics: If continue-token is non-zero, then do appropriate checks to see that parameters and buffers are still valid, and pickup where the previous transfer left off for the specified unit address, based on the value of the continue-token. If continue-token is zero and if previous H_SEND_LOGICAL_LAN for the specified unit address was suspended with H_Busy and never completed, then cleanup the state from the previously suspended call before proceeding. Verifies the VLAN number -- else H_Parameter. Proceeds down the 6 buffer descriptors until the first one that has a length of 0 If the “ibm,max-virtual-dma-size” property exist in the /vdevice node of the device tree, then if the length is greater than the value of this property, return H_Parameter For the length of the buffer: Verifies the I/O buffer addresses translate through the sender’s RTCE table else H_Parameter. Verifies the destination MAC address for the VLAN If MAC address is not cached and there exists a Trunk Adapter for the VLAN, then flags the message as destined for the Trunk Adapter and continues processing If MAC address is not cached and a Trunk Adapter does not exist for the VLAN, then drop the message (H_Dropped) For each Destination MAC Address (broadcast MAC address turns into multi-cast to all destinations on the specified VLAN): In the case of multicast MAC addresses the following algorithm defines the members of the receiver class for a given VLAN: For each logical lan IOA that would be a target for a broadcast from the source IOA: If the receiving IOA is not enabled for non-broadcast multicast frames then continue If the receiving IOA is not enabled for filtering non-broadcast multicast frames then copy the frame to the IOA's receive buffer Else If (lookup filter (table index)) then copy the frame to the IOA's receive buffer Else if the receiving IOA is not enabled for filtering non-broadcast multicast frames then copy the frame to the IOA's receive buffer /*allows for races on filter insertion */ int lookup filter (table index) Firmware implementation designed algorithm Searches the receiver’s receive queue for a suitable buffer and atomically dequeues it: If no suitable buffer is found, the receiver’s dropped packet counter (last 8 bytes of buffer list) is incremented and processing proceeds to the next receiver if any. Copy the send data in to the selected receive buffer, build a receive queue entry, and generate an interrupt to the receiver if the interrupt is enabled. If any frames were dropped return H_Dropped else return H_Success. Firmware Implementation Note: If during the processing of the H_SEND_LOGICAL_LAN call, it becomes necessary to temporarily suspend the processing of the call (for example, due to the length of time it is taking to process the call), the firmware may return a continuation token in R4, along with the return code of H_Busy. The value of the continuation token is up to the firmware, and will be passed back by the software as the continue-token parameter on the next call of H_SEND_LOGICAL_LAN. This hcall() interlocks with H_MIGRATE_DMA to allow for migration of TCE mapped DMA pages. Note: It is possible for either or both the sending and receiving OS to modify its RTCE tables so as to affect the TCE translations being actively used by H_SEND_LOGICAL_LAN. This is an error condition on the part of the OS. Implementations need only insure that such conditions do not corrupt memory in innocent partitions and should not add path length to protect guilty partitions. By all means the path length of H_GET_TCE and H_PUT_TCE should not be increased. If reasonably possible, without significant path length addition, implementations should: On send buffer translation corruption, return H_Parameter to the sender and either totally drop the packet prior to reception, or if the receive buffer has been processed past the point of transparent recycling, mark the receive buffer as received in error in the receive queue. On receive buffer translation corruption, terminate the data copy to the receive buffer and mark the buffer as received in error in the receive queue.
H_MULTICAST_CTRL This hcall() controls the reception of non-broadcast multicast packets (those with the high order address byte being odd but not the all 1’s address). All implementations support the enabling and disabling of the reception of all multicast packets on their V-LAN. Additionally, the l-lan device driver through this call may ask the firmware to filter multicast packets for it. That is, receive packets only if they contain multicast addresses specified by the device driver. The number of simultaneous multicast packet filters supported is implementation dependent, and is specified in the “ibm,mac-address-filters” property of the l-lan device tree node. Therefore, the device driver must be prepared to have any filter request fail, and fall back to enabling reception of all multicast packets and filtering them in the device driver. Semantically, the device driver may ask that the reception of multicast packets be enabled or disabled, further if reception is enabled, they may be filtered by only allowing reception of packets who’s mac address matches one of the entries in the filter table. The call also manages the contents of the mac address filter table. Individual mac addresses may be added, or removed, and the filter table may be cleared. If the filter table is modified by a call, there is the possibility that a packet may be improperly filtered (one that was to be filtered out may get through or one that should have gotten through may be dropped) this is done to avoid adding extra locking to the packet processing code. In most cases higher level protocols will handle the condition (since firmware filtering is simply a performance optimization), if, however, a specific condition requires complete accuracy, the device driver can disable filtering prior to an update, do its own filtering (as would be required if the number of receivers exceeded the number of filters in the filter table) update the filter table, and then reenable filtering. Syntax: Parameters: unit-address: Unit Address per device tree node “reg” property flags: Only bits 44-47 and 62-63 are defined all other bits should be zero. multi-cast-address: Multicast MAC address, if flag bits 62 and 63 are 01 or 10, else this parameter is ignored. Return value in register R4: State of Enables and Count of MAC Filters in table. Format: R = The value of the Receipt Enable bit F = The value of the Filter Enable bit MAC Filter Count -- 16 bit count of the number of MAC Filters in the multicast filter table. Semantics: Validate the unit-address parameter else return H_Parameter. Validate that no reserved flag bit = 1 else return H_Parameter. If any bits are on in the high order two bytes of the MAC parameter Return H_Parameter Modify Enables per specification if requested. Modify the Filter Table per specification if requested filtering is disable during any filter table modification and filter enable state restored after filter table modification). If don't modify RC=H_Success If Clear all: initialize the filter table, RC=H_Success If Add: If there is room in the table insert new MAC Filter entry, MAC Filter count++, RC=H_Success Else RC=H_Constrained (duplicates are silently dropped -- filter count stays the same RC=H_Success) If Remove: Locate the specified entry in the MAC Filter Table If Found remove the entry, MAC Filter count--, RC=H_Success Else RC=H_Not_Found Load the Enable Bits into R4 bits 46 and 47 Load the MAC Filter count into R4 Bits 48-63 Return RC
H_CHANGE_LOGICAL_LAN_MAC This hcall() allows the changing of the virtual IOA’s MAC address. Syntax: Parameters: unit-address: Unit Address per device tree node “reg” property mac-address: The new receive filter MAC address Semantics: Validates the unit address, else H_Parameter Records the low order 6 bytes of mac-address for filtering future incoming messages Returns H_Success
H_ILLAN_ATTRIBUTES There are certain ILLAN attributes that are made visible to and can be manipulated by partition software. The H_ILLAN_ATTRIBUTES hcall is used to read and modify the attributes (see ). defines the attributes that are visible and manipulatable.   ILLAN Attributes Bit(s) Field Name Definition 0-45 Reserved   46 Checksum Offload Non-zero Checksum Field Support This bit is implemented when PHYP supports sending TCP packets with a non-zero TCP checksum field when bit 6 of the buffer descriptor (the "No Checksum" bit) is set. This bit indicates R1–17.3.6.2.2–3 is not required. 47 Reserved   48 Large Send Indication Supported The bit is implemented when the large send indication bit in the I/O descriptor passed to H_SEND_LOGICAL_LAN is supported by firmware. 0: Software must not request large send indication, by setting Bit 5 of the buffer descriptor. 1: Software may request large send indication, by setting Bit 5 of the buffer descriptor. 49 Port Disabled When the bit is a 1, the port is disabled. When the port is disabled, no Ethernet traffic will be permitted to be transmitted or received. H_Parameter will be returned if this bit is turned on in either the reset or set masks. On firmware that does not support this function, bit 49 is reserved and required to be 0. OS can infer that means the port is enabled. 50 Checksum Offload Padded Packet Support This bit is implemented when the ILLAN Checksum Offload Padded Packet Support option is implemented. See . 0: Software must not request checksum offload, by setting Bit 6 of the buffer descriptor (the No Checksum bit), for packets that have been padded. 1: Software may request checksum offload, by setting Bit 6 of the buffer descriptor (the No Checksum bit), for packets that have been padded. 51 Buffer Size Control This bit is implemented when the ILLAN Buffer Size Control option is implemented. This bit allows the partition software to inhibit the use of too large of a buffer for incoming packets, when a reasonable size buffer is not available. The state of this bit cannot be changed between the time that the ILLAN is registered by an H_REGISTER_LOGICAL_LAN and it is deregistered by an H_FREE_LOGICAL_LAN. See also . 1: The hypervisor will keep a history of what buffer sizes have been registered. When a packets arrives the history is searched to find the smallest buffers size that will contain the packet. If that buffer size is depleted then the packet is dropped by the hypervisor (H_Dropped) instead of searching for the next larger available buffer. 0: This is the initial value. When a packet arrives, the available buffers are searched for the smallest available buffer that will hold the packet, and the packet is not dropped unless no buffer is available in which the packet will fit. 52-55 Trunk Adapter Priority This field is implemented for a VIOA whenever the ILLAN Backup Trunk Adapter option is implemented and the VIOA is a Trunk Adapter (the Active Trunk Adapter bit will be implemented, also, in this case). If this field is a 0, then the either the ILLAN Backup Trunk Adapter option is not implemented or it is implemented but this VIOA is not a Trunk Adapter. A non-0 value in this field reflects the priority of the node in the backup Trunk Adapter hierarchy, with a value of 1 being the highest (most favored) priority, the value of 2 being the next highest priority, and so on. This field may or may not be changeable by the partition firmware via the H_ILLAN_ATTRIBUTES hcall() (platform implementation dependent). If not changeable, then attempts to change this field will result in a return code of H_Constrained. See also . 56-60 Reserved   61 TCP Checksum Offload Support for IPv6 This bit is implemented for a VIOA whenever the ILLAN Checksum Offload Support option is implemented for TCP, the IPv6 protocol, and the following extension headers: Hop-by-Hop Options Routing Destination Options Authentication Mobility This bit is initially set to 0 by the firmware and the ILLAN DD may attempt to set it to a 1 by use of the H_ILLAN_ATTRIBUTES hcall() if the DD supports the option for TCP and IPv6. Firmware will not allow changing the state of this bit if it does not support Checksum Offload Support for TCP for IPv6 for the VIOA (H_Constrained would be returned in this case from the H_ILLAN_ATTRIBUTES hcall() when this bit is a 1 in the set-mask). This state of this bit cannot be changed between the time that the ILLAN is registered by an H_REGISTER_LOGICAL_LAN and it is deregistered by an H_FREE_LOGICAL_LAN. See for more information. 1: The partition software has indicated that it supports the ILLAN Checksum Offload Support option for TCP and IPv6 protocol and for the above stated extension headers by using the H_ILLAN_ATTRIBUTES hcall() with this bit set to a 1 in the set-mask, and the firmware has verified that it supports this protocol for the option for the VIOA. 0: The partition software has not indicated that it supports the ILLAN Checksum Offload Support option for TCP and IPv6 protocol and for the above stated extension headers by using the H_ILLAN_ATTRIBUTES hcall() with this bit set to a 1 in the set-mask, or it has but the firmware does not support the option, or supports the option but not for this protocol or for this VIOA. 62 TCP Checksum Offload Support for IPv4 This bit is implemented for a VIOA whenever the ILLAN Checksum Offload Support option is implemented for TCP and the IPv4 protocol. This bit is initially set to 0 by the firmware and the ILLAN DD may attempt to set it to a 1 by use of the H_ILLAN_ATTRIBUTES hcall() if the DD supports the option for TCP and IPv4. Firmware will not allow changing the state of this bit if it does not support Checksum Offload Support for TCP or IPv4 for the VIOA (H_Constrained would be returned in this case from the H_ILLAN_ATTRIBUTES hcall() when this bit is a 1 in the set-mask). This state of this bit cannot be changed between the time that the ILLAN is registered by an H_REGISTER_LOGICAL_LAN and it is deregistered by an H_FREE_LOGICAL_LAN. See for more information. 1: The partition software has indicated that it supports the ILLAN Checksum Offload Support option for TCP and IPv4 protocol by using the H_ILLAN_ATTRIBUTES hcall() with this bit set to a 1 in the set-mask, and the firmware has verified that it supports this protocol for the option for the VIOA. 0: The partition software has not indicated that it supports the ILLAN Checksum Offload Support option for TCP and IPv4 by using the H_ILLAN_ATTRIBUTES hcall() with this bit set to a 1 in the set-mask, or it has but the firmware does not support the option, or supports the option but not for this protocol or for this VIOA. 63 Active Trunk Adapter This bit is implemented for a VIOA whenever the ILLAN Backup Trunk Adapter option is implemented and the VIOA is a Trunk Adapter (the Trunk Adapter Priority field will be implemented, also, in this case). This bit is initially set to 0 by the firmware for an inactive Trunk Adapter. This bit is initially set to 1 by the firmware for an active Trunk Adapter. This bit will be changed from a 0 to a 1 when all the following a true: (1) the partition software (via the H_ILLAN_ATTRIBUTES hcall() with this bit set to a 1 in the set-mask) attempts to set this bit to a 1, (2) the firmware supports the Backup Trunk Adapter option, (3) the VIOA is a Trunk Adapter. This bit will be changed from a 1 to a 0 by the firmware when another Trunk Adapter has had its Active Trunk Adapter bit changed from a 0 to a 1. See for more information. 1: The VIOA is the active Trunk Adapter. 0: The VIOA is not an active Trunk Adapter or is not a Trunk Adapter at all.
R1--1. If the H_ILLAN_ATTRIBUTES hcall is implemented, then it must implement the attributes as they are defined in and the syntax and semantics as defined in . R1--2. The H_ILLAN_ATTRIBUTES hcall must ignore bits in the set-mask and reset-mask which are not implemented for the specified unit-address and must process as an exception those which cannot be changed for the specified unit-address (H_Constrained returned), and must return the following for the ILLAN Attributes in R4: A value of 0 for unimplemented bit positions. The resultant field values for implemented fields. Syntax: Parameters: unit-address: Unit Address per device tree node “reg” property. The ILLAN unit address on which this Attribute modification is to be performed. reset-mask: The bit-significant mask of bits to be reset in the ILLAN’s Attributes (the reset-mask bit definition aligns with the bit definition of the ILLAN’s Attributes, as defined in ). The complement of the reset-mask is ANDed with the ILLAN’s Attributes, prior to applying the set-mask. See semantics for more details on any field-specific actions needed during the reset operations. If a particular field position in the ILLAN Attributes is not implemented, then the corresponding bit(s) in the reset-mask are ignored. set-mask: The bit-significant mask of bits to be set in the ILLAN’s Attributes (the set-mask bit definition aligns with the bit definition of the ILLAN’s Attributes, as defined in ). The set-mask is ORed with the ILLAN’s Attributes, after to applying the reset-mask. See semantics for more details on any field-specific actions needed during the set operations. If a particular field position in the ILLAN Attributes is not implemented, then the corresponding bit(s) in the set-mask are ignored. Semantics: Validate that Unit Address belongs to the partition, else H_Parameter. Reset/set the bits in the ILLAN Attributes, as indicated by the rest-mask and set-mask except as indicated in the following conditions. If the Buffer Size Control bit is trying to be changed from a 0 to a 1 and any of the following is true, then do not allow the change (H_Constrained will be returned): The ILLAN is active. That is, the ILLAN has been registered (H_REGISTER_LOGICAL_LAN) but has not be deregistered (H_FREE_LOGICAL_LAN). The firmware does not support the ILLAN Buffer Size Control option. If the Buffer Size Control bit is trying to be changed from a 1 to a 0 and any of the following is true, then do not allow the change (H_Constrained will be returned): The ILLAN is active. That is, the ILLAN has been registered (H_REGISTER_LOGICAL_LAN) but has not be deregistered (H_FREE_LOGICAL_LAN). If either the TCP Checksum Offload Support for IPv4 bit or TCP Checksum Offload Support for IPv6 bit is trying to be changed from a 0 to a 1 and any of the following is true, then do not allow the change (H_Constrained will be returned): The ILLAN is active. That is, the ILLAN has been registered (H_REGISTER_LOGICAL_LAN) but has not be deregistered (H_FREE_LOGICAL_LAN). The firmware does not support the ILLAN Checksum Offload Support option or supports it, but not for the specified protocol(s) or does not support it for this VIOA. If the TCP Checksum Offload Support for IPv4 bit or TCP Checksum Offload Support for IPv6 bit is trying to be changed from a 1 to a 0 and any of the following is true, then do not allow the change (H_Constrained will be returned): The ILLAN is active. That is, the ILLAN has been registered (H_REGISTER_LOGICAL_LAN) but has not be deregistered (H_FREE_LOGICAL_LAN). If the Active Trunk Adapter bit is trying to be changed from a 0 to a 1 and any of the following is true, then do not allow the change (H_Constrained will be returned): The firmware does not support the ILLAN Backup Trunk Adapter option or this VIOA is not a Trunk Adapter. If the Active Trunk Adapter bit is trying to be changed from a 1 to a 0, then return H_Parameter. If the Active Trunk Adapter bit is changed from a 0 to a 1 for a VIOA, then also set any previously active Trunk Adapter’s Active Trunk Adapter bit from a 1 to a 0. If the Trunk Adapter Priority field is trying to be changed from 0 to a non-0 value, then return H_Parameter. If the Trunk Adapter Priority field is trying to be changed from a non-0 value to another non-0 value and either the parameter is not changeable or the change is not within the platform allowed limits, then do not allow the change (H_Constrained will be returned). Load R4 with the value of the ILLAN’s Attributes, with any unimplemented bits set to 0, and if all requested changes were made then return H_Success, otherwise return H_Constrained.
Other hcall()s extended or used by the Logical LAN Option
H_VIO_SIGNAL The H_VIO_SIGNAL hcall() is used by multiple VIO options.
H_EOI The H_EOI hcall(), when specifying an interrupt source number associated with an interpartion logical LAN IOA, incorporates the interrupt reset function.
H_XIRR This call is extended to report the virtual interrupt source number associated with virtual interrupts associated with an ILLAN IOA.
H_PUT_TCE This standard hcall() is used to manage the ILLAN IOA’s I/O translations.
H_GET_TCE This standard hcall() is used to manage the ILLAN IOA’s I/O translations.
H_MIGRATE_DMA This hcall() is extended to serialize with the H_SEND_LOGICAL_LAN hcall() to allow for migration of TCE mapped DMA pages.
RTAS Calls Extended or Used by the Logical LAN Option Platforms may combine the Logical LAN option with most other LoPAR options such as dynamic reconfiguration by including the appropriate OF properties and extending the associated firmware calls. However, the ibm,set-xive, ibm,get-xive, ibm,int-off, and ibm,int-on RTAS calls are extended as part of the base support.
Interpartition Logical LAN Requirements The following requirements are mandated for platforms implementing the ILLAN option. R1--1. For the ILLAN option: The platform must interpret logical LAN buffer descriptors as defined in . R1--2. For the ILLAN option: The platform must reject logical LAN buffer descriptors that are not 8 byte aligned. R1--3. For the ILLAN option: The platform must interpret the first byte of a logical LAN buffer descriptor as a control byte, the high order bit being the valid bit. R1--4. For the ILLAN option: The platform must set the next to high order bit of the control byte of the logical LAN buffer descriptor for the receive queue to the inverse of the value currently being used to indicate a valid receive queue entry. R1--5. For the ILLAN option: The platform must interpret the 2nd through 4th bytes of a logical LAN buffer descriptor as the binary length of the buffer in I/O space (relative to the TCE mapping table defined by the logical IOA’s “ibm,my-dma-window” property). R1--6. For the ILLAN option: The platform must interpret the 5th through 8th bytes of a logical LAN buffer descriptor as the binary beginning address of the buffer in I/O space (relative to the TCE mapping table defined by the logical IOA’s “ibm,my-dma-window” property). R1--7. For the ILLAN option: The platform must interpret logical LAN Buffer Lists as defined in . R1--8. For the ILLAN option: The platform must reject logical LAN Buffer Lists that are not mapped relative to the TCE mapping table defined by the logical IOA’s “ibm,my-dma-window” property. R1--9. For the ILLAN option: The platform must reject logical LAN buffer lists that are not 4 KB aligned. R1--10. For the ILLAN option: The platform must interpret the first 8 bytes of a logical LAN buffer list as a buffer descriptor for the logical IOA’s Receive Queue. R1--11. For the ILLAN option: The platform must interpret the logical LAN receive queue as defined in . R1--12. For the ILLAN option: The platform must reject a logical LAN receive queue that is not mapped relative to the TCE mapping table defined by the logical IOA’s “ibm,my-dma-window” property. R1--13. For the ILLAN option: The platform must reject a logical LAN receive queue that is not aligned on a 4 byte boundary. R1--14. For the ILLAN option: The platform must reject a logical LAN receive queue that is not an exact multiple of 12 bytes long. R1--15. For the ILLAN option: The platform must manage the logical LAN receive queue as a circular buffer. R1--16. For the ILLAN option: The platform must enqueue 12 byte logical LAN receive queue entries when a new message is received. R1--17. For the ILLAN option: The platform must set the last 8 bytes of the logical LAN receive queue entry to the value of the user supplied correlator found in the first 8 bytes of the logical LAN receive buffer used to contain the message before setting the first 4 bytes of the logical LAN receive queue entry. R1--18. For the ILLAN option: The platform must set the first 4 bytes of the logical LAN receive queue entry such that the first byte contains the control field (high order bit the inverse of the valid toggle in the receive queue buffer descriptor, next bit to a one if the message payload is valid) and the last 3 bytes contains the receive message length, after setting the correlator field in the last 8 bytes per Requirement . R1--19. For the ILLAN option: The platform must when crossing from the end of the logical LAN receive queue back to the beginning invert the value of the valid toggle in the receive queue buffer descriptor. R1--20. For the ILLAN option: The platform’s OF must disable interrupts from the logical LAN IOA before initially passing control to the booted client program. R1--21. For the ILLAN option: The platform must present (as appropriate per RTAS control of the interrupt source number) the partition owning a logical LAN receive queue the appearance of an interrupt, from the interrupt source number associated, through the OF device tree node, with the virtual device, when a new entry is enqueued to the logical LAN receive queue and the last interrupt mode set via the H_VIO_SIGNAL was “Enabled”, unless a previous interrupt from the interrupt source number is still outstanding. R1--22. For the ILLAN option: The platform must NOT present the partition owning a logical LAN receive queue the appearance of an interrupt, from the interrupt source number associated, through the OF device tree node, with the virtual device, if the last interrupt mode set via the H_VIO_SIGNAL was “Disabled”, unless a previous interrupt from the interrupt source number is still outstanding. R1--23. For the ILLAN option: The platform must interpret logical LAN receive buffers as defined in . R1--24. For the ILLAN option: The platform must reject a logical LAN receive buffer that is not mapped relative to the TCE mapping table defined by the logical IOA’s “ibm,my-dma-window” property. R1--25. For the ILLAN option: The platform must reject a logical LAN receive buffer that is not aligned on a 4 byte boundary. R1--26. For the ILLAN option: The platform must reject a logical LAN receive buffer that is not a minimum of 16 bytes long. R1--27. For the ILLAN option: The platform must not modify the first 8 bytes of a logical LAN receive buffer, this area is reserved for a user supplied correlator value. R1--28. For the ILLAN option: The platform must not allow corruption caused by a user modifying the logical LAN receive buffer from escaping the user partition (except as a side effect of some another user partition I/O operation). R1--29. For the ILLAN option: The platform’s l-lan OF device tree node must contain properties as defined in . (Other standard I/O adapter properties are permissible as appropriate.) R1--30. For the ILLAN option: The platform must implement the H_REGISTER_LOGICAL_LAN hcall() as defined in . R1--31. For the ILLAN option: The platform must implement the H_FREE_LOGICAL_LAN hcall() as defined in . R1--32. For the ILLAN option: The platform must implement the H_ADD_LOGICAL_LAN_BUFFER hcall() as defined in . R1--33. For the ILLAN option: The platform must implement the H_SEND_LOGICAL_LAN hcall() as defined in . R1--34. For the ILLAN option: The platform must implement the H_SEND_LOGICAL_LAN hcall() such that an OS requested modification to an active RTCE table entry cannot corrupt memory in other partitions. (Except indirectly as a result of some other of the partition’s I/O operations.) R1--35. For the ILLAN option: The platform must implement the H_CHANGE_LOGICAL_LAN_MAC hcall() as defined in . R1--36. For the ILLAN option: The platform must implement the H_VIO_SIGNAL hcall() as defined in . R1--37. For the ILLAN option: The platform must implement the extensions to the H_EOI hcall() as defined in . R1--38. For the ILLAN option: The platform must implement the extensions to the H_XIRR hcall() as defined in . R1--39. For the ILLAN option: The platform must implement the H_PUT_TCE hcall(). R1--40. For the ILLAN option: The platform must implement the H_GET_TCE hcall(). R1--41. For the ILLAN option: The platform must implement the extensions to the H_MIGRATE_DMA hcall() as defined in . R1--42. For the ILLAN option: The platforms must emulate the standard PowerPC External Interrupt Architecture for the interrupt source numbers associated with the virtual devices via the standard RTAS and hypervisor interrupt calls.
Logical LAN Options The ILLAN option has several sub-options. The hypervisor reports to the partition software when it supports one or more of these options, and potentially other information about those option implementations, via the implementation of the appropriate bits in the ILLAN Attributes, which can be ascertained by the H_ILLAN_ATTRIBUTES hcall(). The same hcall() may be used by the partition software to communicate back to the firmware the level of support for those options where the firmware needs to know the level of partition software support. The “ibm,illan-options” property will exist in the VIOA’s Device Tree node, indicating that the H_ILLAN_ATTRIBUTES hcall() is implemented, and therefore that one or more of the options are implemented. The following sections give more details.
ILLAN Backup Trunk Adapter Option The ILLAN Backup Trunk Adapter option allows the platform to provide one or more backups to a Trunk Adapter, for reliability purposes. Implementation of the ILLAN Backup Trunk Adapter option is specified to the partition by the existence of the “ibm,illan-options” property in the VIOA’s Device Tree node and a non-0 value in the ILLAN Attributes Backup Trunk adapter Priority field. A Trunk Adapter becomes the active Trunk Adapter by calling H_ILLAN_ATTRIBUTES hcall() and setting its Active Trunk Adapter bit. Only one Trunk Adapter is active for a VLAN at a time. The protocols which determine which Trunk Adapter is active at any particular time, is beyond the scope of this architecture. R1--1. For the ILLAN Backup Trunk Adapter option: The platform must implement the ILLAN option. R1--2. For the ILLAN Backup Trunk Adapter option: The platform must implement the H_ILLAN_ATTRIBUTES hcall(). R1--3. For the ILLAN Backup Trunk Adapter option: The platform must implement the “ibm,illan-options” and “ibm,trunk-adapter” properties in all the Trunk Adapter nodes of the Device Tree. R1--4. For the ILLAN Backup Trunk Adapter option: The platform must implement the Active Trunk Adapter bit and the Backup Trunk Adapter Priority field in the ILLAN Attributes, as defined in , for all Trunk Adapter VIOAs. R1--5. For the ILLAN Backup Trunk Adapter option: The platform must allow only one Trunk Adapter to be active for a VLAN at any given time, and must: Make the determination of which one is active by whichever was the most recent one to set its Active Trunk Adapter bit in their ILLAN Attributes. Turn off the Active Trunk Adapter bit in the ILLAN Attributes for a Trunk Adapter when it is removed from the active Trunk Adapter state.
ILLAN Checksum Offload Support Option This option allows for the support of IOAs that do checksum offload processing. This option allows for support at one end (client or server) but not the other, on a per-protocol basis, with the hypervisor generating the checksum when the client supports offload but the server does not, and the operation is a send from the client.
General The H_ILLAN_ATTRIUBTES hcall is used to establish the common set of checksum offload protocols to be supported between the firmware and the partition software. The firmware indicates support for H_ILLAN_ATTRIBUTES via the “ibm,illan-options” property in the VIOA’s Device Tree node. The partition software can determine which of the Checksum Offload protocols (if any) that the firmware supports by either attempting to set the bits in the ILLAN Attributes of the protocols that the partition software supports or by calling the hcall() with reset-mask and set-mask parameters of all-0’s (the latter being just a query and not a request to support anything between the partition and the firmware). Two bits in the control field of the first buffer descriptor specify which operations do not contain a checksum and which have had their checksum already verified. See . These two bits get transferred to the corresponding control field of the Receive Queue Entry, with the exception that the H_SEND_LOGICAL_LAN hcall will sometimes set these to 0b00 (see ). R1--1. For the ILLAN Checksum Offload Support option: The platform must do all the following: Implement the ILLAN option. Implement the H_ILLAN_ATTRIBUTES hcall(). Implement the “ibm,illan-options” property in the VIOA’s Device Tree node. Implement the appropriate Checksum Offload Support bit(s) of the ILLAN Attributes, as defined in . Software Implementation Note: Fragmentation and encryption are not supported when the No Checksum bit of the Buffer Descriptor is set to a 1.
H_SEND_LOGICAL_LAN Semantic Changes There are several H_SEND_LOGICAL_LAN semantic changes required for the ILLAN Checksum Offload Support option. See for the base semantics. R1--1. For the ILLAN Checksum Offload Support option: The H_SEND_LOGICAL_LAN semantics must be changed as follows: As shown in , and for multi-cast operations, the determination in this table must be applied for each destination. If the No Checksum bit is set to a 1 in the first buffer descriptor and the adapter is not a Trunk Adapter, and the source MAC address does not match the adapter's MAC address, then drop the packet.   Summary of H_SEND_LOGICAL_LAN Semantics with Checksum Offload Has Sender Set the Appropriate Checksum Offload Support bit in the ILLAN Attributes for the Protocol Being Used? Has Receiver Set the Appropriate Checksum Offload Support bit in the ILLAN Attributes for the Protocol Being Used? No Checksum bit in the Buffer Descriptor Checksum Good bit in the Buffer Descriptor H_SEND_LOGICAL_LAN Additional Semantics Receiver DD Additional Requirements no - 0 0 None. None. no - Either bit non-0 Return H_Parameter   yes - 0 0 None. None. yes no 0 1 Set the No Checksum and Checksum Good bits in the Buffer Descriptor to 00 on transfer. None. yes no 1 1 Generate checksum and set the No Checksum and Checksum Good bits in the Buffer Descriptor to 00 on transfer. None. yes yes 0 1 None. Do not need to do checksum checking. yes yes 1 1 None. Do not need to do checksum checking. Generate checksum if packet is to be passed on to an external LAN (may be done by the IOA or by the DD). - - 1 0 Return H_Parameter   yes - 01 or 11 and packet type not supported by the hypervisor, as indicated by the value returned by the H_ILLAN_ATTRIBUTES hcall() Return H_Parameter  
R1--2. For the ILLAN Checksum Offload Support option: The Receiver DD Additional Requirements shown in must be implemented. R1--3. For the ILLAN Checksum Offload Support option: When the caller of H_SEND_LOGICAL_LAN has set the No Checksum bit in the Control field to a 1, then they must also have set the checksum field in the packet to 0, unless bit 46 in the ILLAN Attributes (the "Checksum Offload Non-zero Checksum Field Support" bit) is set.
Checksum Offload Padded Packet Support Option Firmware may or may not support checksum offload for IPv4 packets that have been padded. The Checksum Offload Padded Packet Support bit of the ILLAN Attributes specifies whether or not this option is supported. R1--1. For the Checksum Offload Padded Packet Support Option: The platform must do all the following: Implement the ILLAN Checksum Offload Support option. Implement the Checksum Offload Padded Support bit of the ILLAN Attributes, as defined in , and set that bit to a value of 1.
ILLAN Buffer Size Control Option It is the partition software’s responsibility to keep firmware supplied with enough buffers to keep packets from being dropped. The ILLAN Buffer Size Control option gives the partition software a way to prevent a flood of small packets from consuming buffers that have been allocated for larger packets. When this option is implemented and the Buffer Size Control bit in the ILLAN Attributes is set to a 1 for the VLAN, the hypervisor will keep a history of what buffer sizes have been registered. Then, when a packets arrives the history is searched to find the smallest buffer size that will contain the packet. If that buffer size is depleted then the packet is dropped by the hypervisor (H_Dropped) instead of searching for the next larger available buffer.
General The following are the general requirements for this option. For H_SEND_LOGICAL_LAN changes, see . R1--1. For the ILLAN Buffer Size Control option: The platform must do all the following: Implement the ILLAN option. Implement the H_ILLAN_ATTRIBUTES hcall(). Implement the “ibm,illan-options” property in the VIOA’s Device Tree node. Implement the Buffer Size Control bit of the ILLAN Attributes, as defined in .
H_SEND_LOGICAL_LAN Semantic Changes The following are the required semantic changes to the H_SEND_LOGICL_LAN hcall(). R1--1. For the ILLAN Buffer Size Control option: When the Buffer Size Control bit of the target of an H_SEND_LOGIC_LAN hcall() is set to a 1, then the firmware for the H_SEND_LOGICAL_LAN hcall() must not just search for any available buffer into which the packet will fit, but must instead only place the packet into the receiver’s buffer if there is an available buffer of the smallest size previously registered by the receiver which will fit the packet, and must drop the packet for that target otherwise.
ILLAN Large Send Indication option This option allows the virtual device to send an indication to the receiver that the data being sent by H_SEND_LOGICAL_LAN contains a large send packet.
General The following are the general requirements for this option. For H_SEND_LOGICAL_LAN changes, see . R1--1. For the ILLAN Large send indication option: The platform must do all the following: Implement the H_ILLAN_ATTRIBUTES hcall(). Implement the Large Send Indication bit of the ILLAN Attributes as defined in .
H_SEND_LOGICAL_LAN Semantic Changes The following are the required semantic changes to the H_SEND_LOGICAL_LAN hcall(). R1--1. For the ILLAN Large send indication option: When the Large Send Indication bit of the first buffer descriptor is set to 1, then the firmware for the H_SEND_LOGICAL_LAN hcall() must set the Large Send Indication bit in the receiver's receive queue entry to 1 when the packet is copied to the destination receive buffer.
Virtual SCSI (VSCSI) Virtual SCSI (VSCI) support is provided by code running in a server partition that uses the mechanisms of the Reliable Command/Response Transport and Logical Remote DMA of the Synchronous VIO Infrastructure to service I/O requests for code running in a client partition, such that, the client partition appears to enjoy the services of its own SCSI adapter (see ). The terms server and client partitions refer to platform partitions that are respectively servers and clients of requests, usually I/O operations, using the physical I/O adapters (IOAs) that are assigned to the server partition. This allows a platform to have more client partitions than it may have physical I/O adapters because the client partitions share I/O adapters via the server partition. The VSCSI architecture is built upon the architecture specified in the following sections:
VSCSI General This section contains an informative outline of the architectural intent of the use of the Synchronous VIO Infrastructure to provide VSCSI support, along with a few architectural requirements. Other implementations of the server and client partition code, consistent with this architecture, are possible and may be preferable. The architectural metaphor for the VSCSI subsystem is that the server partition provides the virtual equivalent of a single SCSI DASD/Media string via each VSCSI server virtual IOA. The client partition provides the virtual equivalent of a single port SCSI adapter via each VSCSI client IOA. The platform, through the partition definition, provides means for defining the set of virtual IOA’s owned by each partition and their respective location codes. The platform also provides, through partition definition, instructions to connect each client partition’s VSCSI client IOA to a specific server partition’s VSCSI server IOA. That is, the equivalent of connecting the adapter cable to the specific DASD/Media string. The mechanism for specifying this partition definition is beyond the scope of this architecture. The human readable handle associated with the partition definition of virtual IOAs and their associated interconnection and resource configuration is the virtual location code. The OF unit address (Unit ID) remains the invariant handle upon which the OS builds its “physical to logical” configuration. The client partition’s device tree contains one or more nodes notifying the partition that it has been assigned one or more virtual adapters. The node’s “type” and “compatible” properties notify the partition that the virtual adapter is a VSCSI adapter. The unit address of the node is used by the client partition to map the virtual device(s) to the OS’s corresponding logical representations. The “ibm,my-dma-window” property communicates the size of the RTCE table window panes that the hypervisor has allocated. The node also specifies the interrupt source number that has been assigned to the Reliable Command/Response Transport connection and the RTCE range that the client partition device driver may use to map its memory for access by the server partition via Logical Remote DMA. The client partition, uses the four hcall()s associated with the Reliable Command/Response Transport facility to register and deregister its CRQ, manage notification of responses, and send command requests to the server partition. The server partition’s device tree contains one or more node(s) notifying the partition that it is requested to supply VSCSI services for one or more client partitions. The unit address ( Unit ID) of the node is used by the server partition to map to the local logical devices that are represented by this VSCSI device. The node also specifies the interrupt source number that has been assigned to the Reliable Command/Response Transport connection and the RTCE range that the server partition device driver may use for its copy Logical Remote DMA. The server partition uses the four hcall()s associated with the Reliable Command/Response Transport facility to register and deregister its Command request queue, manage notification of new requests, and send responses back to the client partition. In addition, the server partition uses the hcall()s of the Logical Remote DMA facility to manage the movement of commands and data associated with the client requests. The client partition, upon noting the device tree entry for the virtual adapter, loads the device driver associated with the value of the “compatible” property. The device driver, when configured and opened, allocates memory for its CRQ (an array, large enough for all possible responses, of 16 byte elements), pins the queue and maps it into the I/O space of the RTCE window specified in the “ibm,my-dma-window” property using the standard kernel mapping services that subsequently use the H_PUT_TCE hcall(). The queue is then registered using the H_REG_CRQ hcall(). Next, I/O request control blocks (within which the I/O requests commands are built) are allocated, pinned, and mapped into I/O address space. Finally, the device driver registers to receive control when the interrupt source specified in the virtual IOA’s device tree node signals. Once the CRQ is setup, the device driver queues an Initialization Command/Response with the second byte of “Initialize” in order to attempt to tell the hosting side that everything is setup on the hosted side. The response to this send may be that the send has been dropped or has successfully been sent. If successful, the sender should expect back an Initialization Command/Response with a second byte of “Initialization Complete,” at which time the communication path can be deemed to be open. If dropped, then the sender waits for the receipt of an Initialization Command/Response with a second byte of “Initialize,” at which time an “Initialization Complete” message is sent, and if that message is sent successfully, then the communication path can be deemed to be open. When the VSCSI Adapter device driver receives an I/O request from one of the SCSI device head drivers, it executes the following sequence. First an I/O request control block is allocated. Then it builds the SCSI request within the control block, adds a correlator field (to be returned in the subsequent response), I/O maps any target memory buffers and places their DMA descriptors into the I/O request control block. With the request constructed in the I/O request control block, the driver constructs a DMA descriptor (Starting Offset, and length) representing the I/O request within the I/O request control block. Lastly, the driver passes the I/O request’s DMA descriptor to the server partition using the H_SEND_CRQ hcall(). Provided that the H_SEND_CRQ hcall() succeeds, the VSCSI Adapter device driver returns, waiting for the response interrupt indicating that a response has been posted by the server partition to the device driver’s response queue. The response queue entry contains the summary status and request correlator. From the request correlator, the device driver accesses the I/O request control block, and from the summary status, the device driver determines how to complete the processing of the I/O request. Notice that the client partition only uses the Reliable Command/Response Transport primitives; it does not use the Logical Remote DMA primitives. Since the server partition’s RTCE tables are not authorized for access by the client partition, any attempt by the client partition to modify server partition memory would be prevented by the hypervisor. RTCE table access is granted on a connection by connection basis (client/server virtual device pair). If a client partition happens to be serving some other logical device, then the partition is entitled to use Logical Remote DMA for the virtual devices that is serving. The server partition, upon noting the device tree entry for the virtual adapter, loads the device driver associated with the value of the “compatible” property. The device driver, when configured and opened, allocates memory for its request queue (an array, large enough for all possible outstanding requests, of 16 byte elements). The driver then pins the queue and maps it into I/O space, via the kernel’s I/O mapping services that invoke the H_PUT_TCE hcall(), using the first window pane specified in the “ibm,my-dma-window” property. The queue is then registered using the H_REG_CRQ hcall(). Next, I/O request control blocks (within which the I/O request commands are built) are allocated, pinned, and I/O mapped. Finally the device driver registers to receive control when the interrupt source specified in the virtual IOA’s device tree node signals. Once the CRQ is setup, the device driver queues an Initialization Command/Response with the second byte of “Initialize” in order to attempt to tell the hosted side that everything is setup on the hosting side. The response to this send may be that the send has been dropped or has successfully been sent. If successful, the sender should expect back an Initialization Command/Response with a second byte of “Initialization Complete,” at which time the communication path can be deemed to be open. If dropped, then the sender waits for the receipt of an Initialization Command/Response with a second byte of “Initialize,” at which time an “Initialization Complete” message is sent, and if that message is sent successfully, then the communication path can be deemed to be open. When the server partition’s device driver receives an I/O request from its corresponding client partition’s VSCSI adapter drivers, it is notified via the interrupt registered for above. The server partition’s device driver selects an I/O request control block for the requested operation. It then uses the DMA descriptor from the request queue element to transfer the SCSI request from the client partition’s I/O request control block to its own (allocated above), using the H_COPY_RDMA hcall() through the second window pane specified in the “ibm,my-dma-window” property. The server partition’s device driver then uses kernel services, that are extended, to register the I/O request’s DMA descriptors into extended capacity cross memory descriptors (ones capable of recording the DMA descriptors). These cross memory descriptors are later mapped by the server partition’s physical device drivers into the physical I/O DMA address space of the physical I/O adapters using the kernel services, that have been similarly extended to call the H_PUT_RTCE hcall(), based upon the value of the LIOBN field reference by the cross memory descriptor. At this point, the server partition’s VSCSI device driver delivers what appears to be a SCSI request to be decoded and routed through the server partition’s file sub-system for processing. When the request completes, the server partition’s VSCSI device driver is called by the file sub-system and it packages the summary status along with the request correlator into a response message that it sends to the client partition using the H_SEND_CRQ hcall(), then recycles the resources recorded in the I/O request control block, and the block itself. The LIOBN value in the second window pane of the server partition’s “ibm,my-dma-window” property is intended to be an indirect reference to the RTCE table of the client partition. If, for some reason, the physical location of the client partition’s RTCE table changes or it becomes invalid, this level of indirection allows the hypervisor to determine the current target without changing the LIOBN number as seen by the server partition. The H_PUT_TCE and H_PUT_RTCE hcall()s do not map server partition memory into the second window pane; the second window pane is only available for use by server partition via the Logical RDMA services to reference memory mapped into it by the client partition’s IOA. This architecture does not specify the payload format of the requests or responses. However, the architectural intent is supplied in the following tables for reference.   General Form of Reliable CRQ Element Byte Offset Field Name Subfield Name Description 0 Header   Contains Element Valid Bit plus Event Type Encodings (see ). 1 Payload Format/Transport Event Code For Valid Command Response Entries, see . For Transport Event Codes see . 2-15   Format Dependent.
  Example Reliable CRQ Entry Format Byte Definitions for VSCSI Format Byte Value Definition 0x0 Unused 0x1 VSCSI Requests 0x2 VSCSI Responses 0x03 - 0xFE Reserved 0xFF Reserved for Expansion
  Example VSCSI Command Queue Element Byte Offset Field Value Description 0 0x80 Valid Header 1 0x01 VSCSI Request Format 2-3 NA Reserved 4-7   Length of the request block to be transferred 8-15   I/O address of beginning of request
  Example VSCSI Response Queue Element Byte Offset Field Value Description 0 0x80 Valid Header 1 0x02 VSCSI Response Format 2-3 NA Reserved 4-7   Summary Status 8-15   8 byte command correlator
See also .
Virtual SCSI Requirements This normative section provides the general requirements for the support of VSCSI. R1--1. For the VSCSI option: The platform must implement the Reliable Command/Response Transport option as defined in . R1--2. For the VSCSI option: The platform must implement the Logical Remote DMA option as defined in . In addition to the firmware primitives, and the structures they define, the partition’s OS needs to know specific information regarding the configuration of the virtual IOA’s that it has been assigned so that it can load and configure the correct device driver code. This information is provided by the OF device tree node associated with the virtual IOA (see and ).
Client Partition Virtual SCSI Device Tree Node Client partition VSCSI device tree nodes have associated packages such as disk-label, deblocker, iso-13346-files and iso-9660-files as well as children nodes such as block and byte as appropriate to the specific virtual IOA configuration as would the node for a physical IOA of type scsi-3. R1--1. For the VSCSI option: The platform’s OF device tree for client partitions must include as a child of the /vdevice node, a node of name “v-scsi” as the parent of a sub-tree representing the virtual IOAs assigned to the partition. R1--2. For the VSCSI option: The platform’s v-scsi OF node must contain properties as defined in (other standard I/O adapter properties are permissible as appropriate).   Properties of the VSCSI Node in the Client Partition Property Name Required? Definition “name” Y Standard property name per , specifying the virtual device name, the value shall be “v-scsi”. “device_type” Y Standard property name per , specifying the virtual device type, the value shall be “vscsi”. “model” NA Property not present. “compatible” Y Standard property name per , specifying the programming models that are compatible with this virtual IOA, the value shall include “IBM,v-scsi”. “IBM,v-scsi-2” precedes “IBM,vsci” if it is included in the value of this property. “used-by-rtas” See Definition Column Present if appropriate. “ibm,loc-code” Y Property name specifying the unique and persistent location code associated with this virtual IOA presented as an encoded array as with encode-string. The value shall be of the form specified in . “reg” Y Standard property name per , specifying the register addresses, used as the unit address (unit ID), associated with this virtual IOA presented as an encoded array as with encode-phys of length “#address-cells” value shall be 0xwhatever (virtual “reg” property used for unit address no actual locations used, therefore, the size field has zero cells (does not exist) as determined by the value of the “#size-cells” property). “ibm,my-dma-window” Y Property name specifying the DMA window associated with this virtual IOA presented as an encoded array of three values (LIOBN, phys, size) encoded as with encode-int, encode-phys, and encode-int. “interrupts” Y Standard property name specifying the interrupt source number and sense code associated with this virtual IOA presented as an encoded array of two cells encoded as with encode-int with the first cell containing the interrupt source number, and the second cell containing the sense code 0 indicating positive edge triggered. The interrupt source number being the value returned by the H_XIRR or H_IPOLL hcall(). “ibm,my-drc-index” For DR Present if the platform implements DR for this node. “ibm,#dma-size-cells” See Definition Column Property name, to define the package’s dma address size format. The property value specifies the number of cells that are used to encode the size field of dma-window properties. This property is present when the dma address size format cannot be derived using the method described in the definition for the “ibm,#dma-size-cells” property in . “ibm,#dma-address-cells” See Definition Column Property name, to define the package’s dma address format. The property value specifies the number of cells that are used to encode the physical address field of dma-window properties. This property is present when the dma address format cannot be derived using the method described in the definition for the “ibm,#dma-address-cells” property in .
R1--3. For the VSCSI option: The platform’s v-scsi node must have as children the appropriate block (disk) and byte (tape) nodes.
Server Partition Virtual SCSI Device Tree Node Server partition VSCSI IOA nodes have no children nodes. R1--1. For the VSCSI option: The platform’s OF device tree for server partitions must include as a child of the /vdevice node, a node of name “v-scsi-host” as the parent of a sub-tree representing the virtual IOAs assigned to the partition. R1--2. For the VSCSI option: The platform’s v-scsi-host node must contain properties as defined in (other standard I/O adapter properties are permissible as appropriate). Properties of the VSCSI Node in the Server Partition Property Name Required? Definition “name” Y Standard property name per , specifying the virtual device name, the value shall be “v-scsi-host”. “device_type” Y Standard property name per , specifying the virtual device type, the value shall be “v-scsi-host”. “model” NA Property not present. “compatible” Y Standard property name per , specifying the programming models that are compatible with this virtual IOA, the value shall include “IBM,v-scsi-host”. “IBM,v-scsi-host-2” precedes “IBM,vsci-host” if it is included in the value of this property. “used-by-rtas” See Definition Column Present if appropriate. “ibm,loc-code” Y Property name specifying the unique and persistent location code associated with this virtual IOA presented as an encoded array as with encode-string. The value shall be of the form . “reg” Y Standard property name per , specifying the register addresses, used as the unit address (unit ID), associated with this virtual IOA presented as an encoded array as with encode-phys of length “#address-cells” value shall be 0xwhatever (virtual “reg” property used for unit address no actual locations used, therefore, the size field has zero cells (does not exist) as determined by the value of the “#size-cells” property). “ibm,my-dma-window” Y Property name specifying the DMA window associated with this virtual IOA presented as an encoded array of two sets (two window panes) of three values (LIOBN, phys, size) encoded as with encode-int, encode-phys, and encode-int. Of these two triples, the first describes the window pane used to map server partition memory, the second is the window pane through which the client partition maps its memory that it makes available to the server partition. (Note the mapping between the LIOBN in the second window pane of a server virtual IOA’s “ibm,my-dma-window” property and the corresponding client IOA’s RTCE table is made when the CRQ successfully completes registration. See for more information on window panes.) “interrupts” Y Standard property name specifying the interrupt source number and sense code associated with this virtual IOA presented as an encoded array of two cells encoded as with encode-int with the first cell containing the interrupt source number, and the second cell containing the sense code 0 indicating positive edge triggered. The interrupt source number being the value returned by the H_XIRR or H_IPOLL hcall() “ibm,my-drc-index” For DR Present if the platform implements DR for this node. “ibm,vserver” Y Property name specifying that this is a virtual server node. “ibm,#dma-size-cells” See Definition Column Property name, to define the package’s dma address size format. The property value specifies the number of cells that are used to encode the size field of dma-window properties. This property is present when the dma address size format cannot be derived using the method described in the definition for the “ibm,#dma-size-cells” property in . “ibm,#dma-address-cells” See Definition Column Property name, to define the package’s dma address format. The property value specifies the number of cells that are used to encode the physical address field of dma-window properties. This property is present when the dma address format cannot be derived using the method described in the definition for the “ibm,#dma-address-cells” property in .
 
Virtual Terminal (Vterm) This section defines the Virtual Terminal (Vterm) options (Client Vterm option and Server Vterm option). Vterm IOAs are of the hypervisor simulated class of VIO. See also .
Vterm General This section contains an informative outline of the architectural intent of the use of Vterm support. The architectural metaphor for the Vterm IOA is that of an Async IOA. The connection at the other end of the Async “cable” may be another Vterm IOA in a server partition, the hypervisor, or the HMC. A partition’s device tree contains one or more nodes notifying the partition that it has been assigned one or more Vterm client adapters (each LoPAR partition has at least one). The node’s “type” and “compatible” properties notify the partition that the virtual adapter is a Vterm client adapter. The unit address of the node is used by the partition to map the virtual device(s) to the OS’s corresponding logical representations. The node’s “interrupts” property, if it exists, specifies the interrupt source number that has been assigned to the client Vterm IOA for receive data. The partition, uses the H_GET_TERM_CHAR and H_PUT_TERM_CHAR hcall()s to receive data from and send data to the client Vterm IOA. If the node contains the “interrupts” property, the partition may optionally use the ibm,int-on, ibm,int-off, ibm,set-xive, ibm,get-xive RTAS calls, and the H_VIO_SIGNAL hcall() to manage the client Vterm IOA interrupt. A partition’s device tree may also contain one or more node(s) notifying the partition that it is requested to supply server Vterm IOA services for one or more client Vterm IOAs. The node’s “type” and “compatible” properties notify the partition that the virtual adapter is a server Vterm IOA. The unit address (Unit ID) of the node is used by the partition to map the virtual device(s) to the OS’s corresponding logical representations. The node’s “interrupts” property specifies the interrupt source number that has been assigned to the server Vterm IOA for receive data. The partition uses the H_VTERM_PARTNER_INFO hcall() to find out which unit address(es) in which partition(s) to which it is allowed to attach (that is, to which client Vterm IOAs it is allowed to attach). The partition then uses the H_REGISTER_VTERM to setup the connection between a server and a client Vterm IOAs, and uses the H_GET_TERM_CHAR and H_PUT_TERM_CHAR hcall()s to receive data from and send data to the server Vterm IOA. In addition, the partition may optionally use the ibm,int-on, ibm,int-off, ibm,set-xive, ibm,get-xive RTAS calls, and the H_VIO_SIGNAL hcall() to manage the server Vterm IOA interrupt. shows a comparison between the client and server versions of Vterm.   Client Vterm versus Server Vterm Comparison Client Server The following hcall()s apply: H_PUT_TERM_CHAR H_GET_TERM_CHAR H_VIO_SIGNAL (optional use with Client) N/A The following hcall()s are valid: H_VTERM_PARTNER_INFO H_REGISTER_VTERM H_FREE_VTERM vty node vty-server node The “reg” property or the vty node(s) enumerates the valid client Vterm IOA unit address(es) The “reg” property or the vty-server node(s) enumerates the valid server Vterm IOA unit address(es) H_VTERM_PARTNER_INFO is used to getvalid client Vterm IOA partition ID(s) and corresponding unit address(es) to which the server Vterm IOA is allowed to connect “interrupts” property optional: Platform may or may not provide If provided, Vterm driver may or may not use “interrupts” property required: Platform must provide If provided, Vterm driver may or may not use
Vterm Requirements This normative section provides the general requirements for the support of Vterm. R1--1. For the LPAR option: the Client Vterm option must be implemented.
Character Put and Get hcall()s The following hcall()s are used to send data to and get data from both the client and sever Vterm IOAs.
H_GET_TERM_CHAR Syntax: Parameters: termno: The unit address of the Vterm IOA, from the “reg” property of the Vterm IOA. Semantics: Hypervisor checks the termno parameter for validity against the Vterm IOA unit addresses assigned to the partition, else return H_Parameter. Hypervisor returns H_Hardware if it detects that the virtual console terminal physical connection is not working. Hypervisor returns H_Closed if it detects that the virtual console associated with the termno parameter is not open (in the case of connection to a server Vterm IOA, this means that the server code has not made the connection to this specific client Vterm IOA). Hypervisor returns H_Success in all other cases, returning maximum number of characters available in the partition’s virtual console terminal input buffer (up to 16) -- a len value of 0 indicates that the input buffer is empty. Upon return with H_Success register R4 contains the number of bytes (if any) returned in registers R5 and R6. Upon return with H_Success the return character string starts in the high order byte of register R5 and proceeds toward the low order byte in register R6 for the number of characters specified in R4. The contents of all other byte locations of registers R5 and R6 are undefined. Upon return with H_Success register R4 contains the number of bytes (if any) returned in registers R5 and R6. Upon return with H_Success the return character string starts in the high order byte of register R5 and proceeds toward the low order byte in register R6 for the number of characters specified in R4. The contents of all other byte locations of registers R5 and R6 are undefined.
H_PUT_TERM_CHAR Syntax: Parameters: termno: The unit address of the Vterm IOA, from the “reg” property of the Vterm IOA. len: The length of the string to transmit through the virtual terminal port. Valid values are in the range of 0-16. char0_7 and char8_15: The string starts in the high order byte of register R6 and proceeds toward the low order byte in register R7 Semantics: Hypervisor checks the termno parameter for validity against the Vterm IOA unit addresses assigned to the partition, else return H_Parameter. Hypervisor returns H_Hardware if it detects that the virtual console terminal physical connection is not working. Hypervisor returns H_Closed if it detects that the virtual console session is not open (in the case of connection to a server Vterm IOA, this means that the server code has not made the connection to this specific client Vterm IOA). If the length parameter is outside of the values 0-16 the hypervisor immediately returns H_Parameter with no other action. If the partition’s virtual console terminal buffer has room for the entire string, the hypervisor queues the output string and returns H_Success. Note: There is always room for a zero length string (a zero length write can be used to test the virtual console terminal connection). If the buffer cannot hold the entire string, no data is enqueued and the return code is H_Busy.
Interrupts The interrupt source number is presented in the “interrupts” property of the Vterm node, when receive queue interrupts are implemented for the Vterm. The ibm,int-on, ibm,int-off, ibm,set-xive, ibm,get-xive RTAS calls, and H_VIO_SIGNAL hcall() are used to manage the interrupt. Interrupts and the “interrupts” property are always implemented for the server Vterm IOA, and may be implemented for the client Vterm IOA. The interrupt mechanism is edge-triggered and is capable of presenting only one interrupt signal at a time from any given interrupt source. Therefore, no additional interrupts from a given source are ever signaled until the previous interrupt has been processed through to the issuance of an H_EOI hcall(). Specifically, even if the interrupt mode is enabled, the effect is to interrupt on an empty to non-empty transition of the receiver queue or upon the closing of the connection between the server and client. However, as with any asynchronous posting operation race conditions are to be expected. That is, an enqueue can happen in a window around the H_EOI hcall() so that the receiver should poll the receive queue after an H_EOI using H_GET_TERM_CHAR after an H_EOI, to prevent losing initiative. R1--1. For the Server Vterm option: The platform must implement the “interrupts” property in all server Vterm device tree nodes ( vty-server), and must set the interrupt in that property for the receive data interrupt for the IOA. R1--2. For the Client Vterm and Server Vterm options: When implemented, the characteristics of the Vterm interrupts must be as follows: All must be edge-triggered. The receive interrupt must be activated when the Vterm receive queue goes from empty to non-empty. The receive interrupt must be activated when the Vterm connection from the client to the server goes from open to closed.
Client Vterm Device Tree Node (vty) All platforms that implement LPAR, also implement at least one client Vterm IOA per partition. R1--1. For the Client Vterm option: The H_GET_TERM_CHAR and H_TERM_CHAR hcall()s must be implemented. R1--2. For the Client Vterm option: The platform’s OF device tree must include as a child of the /vdevice node, one or more nodes of type “vty”; one for each client Vterm IOA. R1--3. For the Client Vterm option: The platform’s vty OF node must contain properties as defined in (other standard I/O adapter properties are permissible as appropriate).   Properties of the <emphasis role="bold"><literal>vty</literal></emphasis> Node (Client Vterm IOA) Property Name Required? Definition “name” Y Standard property name per , specifying the virtual device name. The value shall be “vty”. “device_type” Y Standard property name per , specifying the virtual device type. The value shall be “serial”. “model” NA Property not present. “compatible” Y Standard property name per , specifying the programming models that are compatible with this virtual IOA. The value shall include “hvterm1” when the virtual IOA will connect to a server with no special protocol, and shall include “hvterm-protocol” when the virtual IOA will connect to a server that requires a protocol to control modems or hardware control signals. “used-by-rtas” NA Property not present. “ibm,loc-code” Y Property name specifying the unique and persistent location code associated with this virtual IOA presented as an encoded array as with encode-string. The value shall be of the form specified in . “reg” Y Standard property name per , specifying the register addresses, used as the unit address (unit ID), associated with this virtual IOA presented as an encoded array as with encode-phys of length “#address-cells” value shall be 0xwhatever (virtual “reg” property used for unit address no actual locations used, therefore, the size field has zero cells (does not exist) as determined by the value of the “#size-cells” property). “interrupts” See Definition Column Standard property name specifying the interrupt source number and sense code associated with this virtual IOA presented as an encoded array of two cells encoded as with encode-int with the first cell containing the interrupt source number, and the second cell containing the sense code 0 indicating positive edge triggered. The interrupt source number being the value returned by the H_XIRR or H_IPOLL hcall(). If provided, this property will present one interrupt; the receive data interrupt. “ibm,my-drc-index” For DR Present if the platform implements DR for this node.
R1--4. For the Client Vterm option: If the compatible property in the vty node is “hvterm-protocol”, then the protocol that the client must use is defined in the document entitled Protocol for Support of Physical Serial Port Using a Virtual TTY Interface.
Server Vterm Server Vterm IOAs allow a partition to serve a partner partition’s client Vterm IOA.
Server Vterm Device Tree Node (vty-server) and Other Requirements R1--1. For the Server Vterm option: The H_GET_TERM_CHAR, H_PUT_TERM_CHAR, H_VTERM_PARTNER_INFO, H_REGISTER_VTERM, and H_FREE_VTERM hcall()s must be implemented. R1--2. For the Server Vterm option: The platform’s OF device tree for partitions implementing server Vterm IOAs must include as a child of the /vdevice node, one or more nodes of type “vty-server”; one for each server Vterm IOA. R1--3. For the Server Vterm option: The platform’s vty-server node must contain properties as defined in (other standard I/O adapter properties are permissible as appropriate). Properties of the <emphasis role="bold"><literal>vty-server</literal></emphasis> Node (Server Vterm IOA) Property Name Required? Definition “name” Y Standard property name per , specifying the virtual device name. The value shall be “vty-server”. “device_type” Y Standard property name per , specifying the virtual device type. The value shall be “serial-server”. “model” NA Property not present. “compatible” Y Standard property name per , specifying the programming models that are compatible with this virtual IOA. The value shall include “hvterm2”. “used-by-rtas” NA Property not present. “ibm,loc-code” Y Property name specifying the unique and persistent location code associated with this virtual IOA presented as an encoded array as with encode-string. The value shall be of the form . “reg” Y Standard property name per , specifying the register addresses, used as the unit address (unit ID), associated with this virtual IOA presented as an encoded array as with encode-phys of length specified by “#address-cells” value shall be 0xwhatever (virtual “reg” property used for unit address no actual locations used, therefore, the size field has zero cells (does not exist) as determined by the value of the “#size-cells” property). “interrupts” Y Standard property name specifying the interrupt source number and sense code associated with this virtual IOA presented as an encoded array of two cells encoded as with encode-int with the first cell containing the interrupt source number, and the second cell containing the sense code 0 indicating positive edge triggered. The interrupt source number being the value returned by the H_XIRR or H_IPOLL hcall(). This property will present one interrupt; the receive data interrupt. “ibm,my-drc-index” For DR Present if the platform implements DR for this node. “ibm,vserver” Y Property name specifying that this is a virtual server node.
Server Vterm hcall()s The following hcall()s are unique to the server Vterm IOA.
H_VTERM_PARTNER_INFO This hcall is used to retrieve the list of Vterms to which the specified server Vterm IOA is permitted to connect. The list is retrieved by making repeated calls, and returns sets of triples: partner partition ID, partner unit address, and partner location code. Passing in the previously returned value will return the next value in the list of allowed connections. Passing in a value of 0xFF...FF will return the first value in the list. Syntax: Parameters: unit-address: Virtual IOA’s unit address, as specified in the IOA’s device tree node. partner-partition-id: The partner partition ID of the last partner partition ID and partner unit address pair returned. If a value of 0xFF...FF is specified, the call returns the first item in the list. partner-unit-addr: The partner unit address of the last partner partition ID and partner unit address pair returned. If a value of 0xFF...FF is specified, the call returns the first item in the list. buffr-addr: The logical address of a single page in memory, belonging to the calling partition, which is used to return the next triple of information (partner partition ID, partner unit address, and Converged Location Code). The calling partition cannot migrate the page during the duration of the call, otherwise the call will fail. Buffer format on return with H_Success: First eight bytes: Eight byte partner partition ID of the partner partition ID and partner unit address pair from the list, or 0xFF...FF if partner partition ID and partner unit address passed in the input parameters was the last item in the list of valid connections. Second eight bytes: Eight byte partner unit address associated with the partner partition ID (as returned in first 8 bytes of the buffer), or 0xFF...FF if the partner partition ID and partner unit address passed in the input parameters was the last item in the list of valid connections. Beginning at the 17 byte in the buffer: NULL-terminated Converged Location Code associated with the partner unit address and partner partition ID (a returned in the first 16 bytes of the buffer), or just a NULL string if the partner partition ID and partner unit address passed in the input parameters was the last item in the list of valid connections. Semantics: Validate that unit-address belongs to the partition and to a server Vterm IOA, else H_Parameter. If partner-partition-id and partner-unit-addr together do not match a valid partner partition ID and partner unit address pair in the list of valid connections for this unit-address, then return H_Parameter. If the 4 KB page associated with buffr-addr does not belong to the calling partition, then return H_Parameter. If the buffer associated with buffr-addr does not begin on a 4 K boundary, then return H_Parameter. If the calling partition attempts to migrate the buffer page associated with buffr-addr during the duration of the H_VTERM_PARTNER_INFO call, then return H_Parameter. If partner-partition-id is equal to 0xFF...FF, then select the first item from the list of valid connections, format the buffer as specified, above, for this item, and return H_Success. If partner-partition-id and partner-unit-addr together matches a valid partner partition ID and partner unit address pair in the list of valid connections, and if this is the last valid connection in the list, then format the buffer as specified, above, with the partner partition ID and partner unit address both set to 0xFF...FF, and the Converged Location Code set to the NULL string, and return H_Success. If partner-partition-id and partner-unit-addr together matches a valid partner partition ID and partner unit address pair in the list of valid connections, then select the next item from the list of valid connections, and format the buffer as specified, above, and return H_Success.
H_REGISTER_VTERM This hcall has the effect of “opening” the connection to the client Vterm IOA in the specified partition ID and which has the specified unit address. The architectural metaphor for this is the connecting of the cable between two Async IOAs. The hcall fails if the partition does not have the authority to connect to the requested partition/unit address pair. The hcall() also fails if the specified partition/unit address is already in use (for example, by another partition or the HMC) Syntax: Parameters: unit-address: The server Virtual IOA’s unit address, as specified in the IOA’s device tree node. partner-partition-id: The partition ID of the partition ID and unit address pair to which to be connected. partner-unit-addr: The unit address of the partition ID and unit address pair to which to be connected. Semantics: Validate that unit-address belongs to the partition and to a server Vterm IOA and that there does not exist a valid connection between this server Vterm IOA and a partner, else H_Parameter. If partner-partition-id and partner-unit-addr together do not match a valid partition ID and unit address pair in the list of valid connections for this unit-address, then return H_Parameter, Else make connection between the server Vterm IOA specified by unit-address and the client Vterm IOA specified by the partner-partition-id and partner-unit-addr pair, allowing future H_PUT_TERM_CHAR and H_GET_TERM_CHAR operations to flow between the two Vterm IOAs, and return H_Success. Software Implementation Note: An H_Parameter will be returned to the H_REGISTER_VTERM if a DLPAR operation has been performed which changes the list of possible server to client Vterm connections. After a DLPAR operation which affects a partition’s server Vterm IOA connection list, a call to H_VTERM_PARTNER_INFO is needed to get the current list of possible connections.
H_FREE_VTERM This hcall has the effect of “closing” the connection to the partition/unit address pair. The architectural metaphor for this is the removal of the cable between two Async IOAs. After closing, the partner partition’s server Vterm IOA would now be available for serving by another partner (for example, another partition or the HMC). Syntax: Parameters: unit-address: Virtual IOA’s unit address, as specified in the IOA’s device tree node. Semantics: Validate that the unit address belongs to the partition and to a server Vterm IOA and that there exists a valid connection between this server Vterm IOA and a partner, else H_Parameter. Break the connection between the server Vterm IOA specified by the unit address and the client Vterm IOA, preventing further H_PUT_TERM_CHAR and H_GET_TERM_CHAR operations between the two Vterm IOAs (until a future successful H_REGISTER_VTERM operation), and return H_Success. Implementation Note: If the hypervisor returns an H_Busy, H_LongBusyOrder1mSec, or H_LongBusyOrder10mSec, software must call H_FREE_VTERM again with the same parameters. Software may choose to treat H_LongBusyOrder1mSec and H_LongBusyOrder10mSec the same as H_Busy. The hypervisor, prior to returning H_Busy, H_LongBusyOrder1mSec, or H_LongBusyOrder10mSec, will have placed the virtual adapter in a state that will cause it to not accept any new work nor surface any new virtual interrupts.  
Virtual Management Channel (VMC) PAPR Virtual Management Channel (VMC) support is provided by code running in a logical partition that uses the mechanisms of the Reliable Command/Response Transport and Logical Remote DMA of the Synchronous VIO Infrastructure to service and to send requests to platform code. The purpose of this interface is to communicate platform management information between designated logical partition and the platform. The VMC architecture is built upon the architecture specified in the following sections:
Virtual Management Channel Requirements This normative section provides the general requirements for the support of VMC. R1--1. For the VMC option:The platform must implement the Reliable Command/Response Transport option as defined in . R1--2. For the VMC option:The platform must implement the Logical Remote DMA option as defined in . In addition to the firmware primitives, and the structures they define, the partition’s OS needs to know specific information regarding the configuration of the virtual IOAs that it has been assigned so that it can load and configure the correct device driver code. This information is provided by the Open Firmware device tree node associated with the virtual IOA (see ).
Partition Virtual Management Channel Device Tree Node Partition VMC IOA nodes have no children nodes. R1--1. For the VMC option: The platform’s Open Firmware device tree for client partitions must include as a child of the /vdevice node, a node of type “vmc” as the parent of a sub-tree representing the virtual IOAs assigned to the partition. R1--2. For the VMC option: The platform’s vmc Open Firmware node must contain properties as defined in (other standard I/O adapter properties are permissible as appropriate). Properties of the VMC Node in the Client Partition Property Name Required? Definition “name” Y Standard property name per IEEE 1275 specifying the virtual device name, the value shall be “ibm,vmc”. “device_type” Y Standard property name per IEEE 1275 specifying the virtual device type, the value shall be “ibm,vmc”. “model” NA Property not present. “compatible” Y Standard property name per IEEE 1275 specifying the programming models that are compatible with this virtual IOA, the value shall be “IBM,vmc”. “used-by-rtas” See Definition Column Present if appropriate. “ibm,loc-code” Y Property name specifying the unique and persistent location code associated with this virtual IOA presented as an encoded array as with encode-string. The value shall be of the form specified in information on Virtual Card Connector Location Codes. “reg” Y Standard property name per IEEE 1275 specifying the register addresses, used as the unit address (unit ID), associated with this virtual IOA presented as an encoded array as with encode-phys of length “#address-cells” value shall be 0xwhatever (virtual “reg” property used for unit address no actual locations used, therefore, the size field has zero cells (does not exist) as determined by the value of the “#size-cells” property). “ibm,my-dma-window” Y Property name specifying the DMA window associated with this virtual IOA presented as an encoded array of two sets (two window panes) of three values (LIOBN, phys, size) encoded as with encode-int, encode-phys, and encode-int. Of these two triples, the first describes the window pane used to map server partition (the designated management partition) memory, the second is the window pane through which the client partition (the platform partition) maps its memory that it makes available to the server partition. “interrupts” Y Standard property name specifying the interrupt source number and sense code associated with this virtual IOA presented as an encoded array of two cells encoded as with encode-int with the first cell containing the interrupt source number, and the second cell containing the sense code 0 indicating positive edge triggered. The interrupt source number being the value returned by the H_XIRR or H_IPOLL hcall(). “ibm,my-drc-index” For DR Present if the platform implements DR for this node. “ibm,#dma-size-cells” See Definition Column Property name, to define the package’s dma address size format. The property value specifies the number of cells that are used to encode the size field of dma-window properties. This property is present when the dma address size format cannot be derived using the method described in the definition for the “ibm,#dma-size-cells” property in section on System Bindings. “ibm,#dma-address-cells” See Definition Column Property name, to define the package’s dma address format. The property value specifies the number of cells that are used to encode the physical address field of dma-window properties. This property is present when the dma address format cannot be derived using the method described in the definition for the “ibm,#dma-address-cells” property in section on System Bindings.
Virtual Asynchronous Services Interface (VASI) The PAPR Virtual Asynchronous Services Interface (VASI) allows an authorized virtual server partition (VSP) to safely access the internal state of a specific partition. The access provided by VASI enables high level administrative services such as partition migration, hibernation and virtualization of partition logical real memory. VASI uses the mechanisms of the Reliable Command/Response Transport a nd Logical Remote DMA of the Synchronous VIO Infrastructure to service requests. VASI is built upon the architecture specified in the following sections:
VASI Overview
VASI Streams, Services and States A single VASI virtual IOA may be capable of supporting multiple streams of operations (up to the number presented in the “ibm,#vasi-streams” property, see ) each representing a specific high level operation such as an individual logical partition migration, or a unique logical partition hibernation, etc. The hypervisor and the various logical partitions use the VASI_Stream_ID as a handle to associate the services that each provide to the specific high level function. Similarly a single VASI virtual IOA may be capable of supporting multiple service sessions (opens) for each VASI_Stream_ID (up to the number negotiated by the #Opens field of the capabilities string, see ). VASI streams and individual service sessions may be in one of several states. Refer to the specific high level function description in for the state descriptions and transition triggers that are defined for each high level function.
VASI Handles VASI defines several versions of handles. The VASI Stream ID is used to associate the elements of the same high level function (such as a specific partition migration operation). In this case, the various partitions are assigned roles and a VASI Stream ID. By opening a VASI virtual IOA with a given VASI Stream ID and Service parameter, the partition declares its intent to perform the specified service for the specific high level operation. By means outside the scope of PAPR, the platform is told to expect such service from the specific partition; thus when the match happens, the high level operation is enabled. At open time, the platform and partition negotiate a pair of convenient handles to use as a substitute for the architecturally opaque VASI Stream ID. This pair of 4 byte handles are called the TOP/BOTTOM. The TOP field is used by the partition to denote its operations for the specific VASI Stream ID, while the BOTTOM field provides that function for the platform firmware. The first 8 bytes of a VASI data buffer are reserved for Virtual Server Partition (VSP) use as the buffer correlator field. The buffer correlator field is architecturally opaque. The architectural intent is that the buffer correlator field is a VSP handle to the data buffer control block.
Semantic Conventions The convention for the specification of VASI CRQ message processing semantics is via a specifically ordered sequence of operations. Implementations are not required to code in these sequences but are required to appear as if they did. In general, parameters and operational state are first verified, followed by the operation to be performed if all the parameter/state checks succeed. If a check fails, a response is generated at that point and further processing of the message is abandoned. Note that standard CRQ processing operations (message enqueue and dequeue processing such as finding the next valid message, resetting the valid message bit when processing is complete, etc. (See ) are assumed and not explicitly included in the semantic specification.
VASI Data Buffers (Normative) Data buffers used by VASI are defined as from ILLAN (See ). VASI references data buffers via a valid buffer descriptor (Control Byte = 0x80) as from ILLAN (See ). relative to first pane of the VASI virtual IOA “ibm,my-dma-window” property. The first 8 bytes of a data buffer are reserved for an OS opaque handle. A filled data buffer contains either a VASI Download Request Specifier or a VASI Operation Request Specifier; refer to or respectively, following the opaque handle. Buffers are supplied to the VASI virtual IOA via the VASI Add Buffer CRQ request message, and returned to the VASI device driver in operation requests such as the VASI Operation CRQ request message or, for those that have not been used by operation requests, via responses to the VASI Free Buffer CRQ request message. Closing a VASI service session releases buffers queued for that service session in the VASI virtual IOA, while deregistering the VASI virtual IOA CRQ does the same for all of the VASI virtual IOA service sessions. R1--1. For the VASI option: The platform must implement the Reliable Command/Response Transport option (See ). R1--2. For the VASI option: The storage for VASI data buffers to be used by a given VASI virtual IOA must be TCE mapped within the first pane of the “ibm,my-dma-window” property as presented in the device tree node of the VASI virtual IOA Open Firmware device tree node. R1--3. For the VASI option: Firmware must not modify the first 8 bytes (Buffer Correlator field) of a VASI data buffer. R1--4. For the VASI option: Immediately following the first 8 bytes of a filled VASI data buffer must be either a VASI Download Request Specifier or a VASI Operation Request Specifier. R1--5. For the VASI option: The VASI Download Request Specifier must be formatted as per . R1--6. For the VASI option: The VASI Operation Request Specifier must be formatted as per .
VASI Download Request Specifier The VASI Download Request Specifier is presented in . The VASI Download Request Specifier is used with the VASI Download Request message see .
VASI Download Request Specifier structure
VASI Operation Request Specifier The VASI Operation Request Specifier is presented in . The TOP/BOTTOM (8 bytes) field is a pair of 4 byte opaque handles as negotiated by the VASI Open Request/Response pair see . Expected Semantics of VASI operation requests: Operation length is communicated by the summation of the lengths of the buffer segment control structures following the operation correlator field. Operations that write at the end of the file normally extend the file. If extending the file is not possible due to resource constraints, then the operation is aborted at the end of the file, the VASI operation response message carries the “End of File” status with the Residual field containing the number of requested bytes that were not transferred (Residual value of all ones indicates Residual field overflow). Read operations that access beyond the end of the file are aborted at the end of the file. The VASI operation response message carries the “End of File” status with the Residual field containing the number of requested bytes that were not transferred (Residual value of all ones indicates Residual field overflow). Sequential writes deliver the input stream of bytes to the receiver in the same order, but not necessarily in the same blocking as originated by the sender. Index operations carry the additional semantic over the corresponding sequential operation that they are a collection of one or more sub-operations of the same type (read/write). Each sub-operation specification starts with a control field encoding of 0xC0 that carries the 512 byte file block index of the start of the operation. The file cursor can then be adjusted within the block using a control field of 0x40 followed by a 3 byte binary offset (legal values 0-511). This offset allows operations to beginning on any byte boundary within the specified 512 byte block index. The remainder of each sub-operation specification is a scatter gather list. The sub-operation length is defined by the number of bytes of data/buffer supplied in the sub-operation scatter gather list. The “Hardware” status code indicates a failure due to any hardware problem including physical I/O. The “Invalid Buffer Correlator” status code is reserved for failure to find the operation buffer. The “Invalid VASI Operation Request Specifier” status code is used for any failure in the decode of the operation buffer not specifically called out by a previous semantic. The first control field of a scatter gather list may be a byte offset encoded with a control field of 0x40 and followed by a 3 byte binary offset (legal values 0-511). This offset allows operations to beginning on any byte boundary within the specified 512 byte block index. The control field encoding 0xC0 indicates that the original operation is conjoined with a second indexed operation of the same direction starting at a new 512 byte block index (as indicated in the following 7 bytes). The conjoined index operation has its own scatter gather list optionally starting with a byte offset, followed by one or more data buffers. Operation Modifiers: 000: Base Operation 001: Server Takeover Warning: informs the targeted VASI server that another VASI server had previously hosted the operation stream and that it may need to perform additional steps to process this request. 010 : 111 Reserved
VASI Operation Request Specifier structure
VASI CRQ Message Definition (Normative) For the VASI interface, all CRQ messages are defined to use the following base format: General Form of VASI Reliable CRQ Element Byte Offset Field Name Description 0 Header Contains Element Valid Bit plus Event Type Encodings ( ). 1 Format/Transport Event Code For Valid Command Response Entries, see . For Transport Event Codes see . 2-15 Payload Format dependent.
General format of a CRQ element
R1--1. For the VASI option: The format byte of VASI CRQ messages must be as defined in . Reliable CRQ Entry Format Byte Definitions for VASI (Header=0x80) Format Byte Value Definition 0x0 Unused 0x1 VASI Capabilities Request 0x2 VASI Open Request 0x3 VASI Close Request 0x4 VASI Add Buffer Request 0x5 VASI Free Buffer Request 0x6 VASI Download Request 0x07 VASI Operation Request 0x8 VASI Signal Request 0x9 VASI State Request 0x0A-0x0F Reserved 0x10 VASI Progress Request 0x11-0x80 Reserved 0x81 VASI Capabilities Response 0x82 VASI Open Response 0x83 VASI Close Response 0x84 VASI Add Buffer Response 0x85 VASI Free Buffer Response 0x86 VASI Download Response 0x87 VASI Operation Response 0x88 VASI Signal Response 0x89-0x8F Reserved 0x90 VASI Progress Response 0x91-0xFF Reserved
R1--2. For the VASI option: The status byte of VASI CRQ response messages must be as defined in able 252‚ “VASI Reliable CRQ Response Status Values‚” on page 721. . VASI Reliable CRQ Response Status Values Format Byte Value Definition 0x0 Success 0x1 Hardware Error 0x2 Invalid Stream ID 0x3 Stream ID Abort 0x4 Invalid Buffer Descriptor: Either the IOBA is too large for the LIOBN or its logical TCE does not contain a valid logical address mapping. 0x5 Invalid buffer length: Either the buffer is less than the minimum useful buffer size or it does not match one of the first “ibm,#buffer-pools” sizes that were added. 0x6 Empty: The request could not be satisfied because the buffer pool was empty 0x7 Invalid VASI Download Request Specifier 0x8 Invalid VASI Download data: The download data format is invalid. 0x9 Invalid Buffer Correlator: Does not correspond to an outstanding data buffer. 0x0A Invalid VASI Operation Request Specifier 0x0B Invalid Service Specifier 0x0C Too many opens 0x0D Invalid BOTTOM 0x0E Invalid TOP 0x0F End of File 0x10 Invalid Format 0x11 Unknown Reserved Value 0x12 Invalid State Transition 0x13 Race Lost 0x14 Invalid Signal Code 0x15-0xFF Reserved
VASI Request/Response Pairs R1--1. For the VASI option: The platform must validate the format byte in all VASI messages that it receives. R1--2. For the VASI option: The platform must initiate the processing of VASI messages in the order received on a given CRQ. R1--3. For the VASI option: If the format byte value of a received VASI message, as specified in , is “Unused”, “Reserved”, “VASI Operation Request”, or a response other than “VASI Operation Response”, the platform must declare the format byte invalid. R1--4. For the VASI option: If the format byte value is invalid, then the platform must generate a response message on the corresponding CRQ by copying the received message with the high order format byte bit set to a one and the status byte with the “Invalid Format” status code, and discard the received CRQ message. R1--5. For the VASI option: The platform must fill in all reserved fields in VASI messages that it generates with zeros. R1--6. For the VASI option: The platform must check that all reserved fields in a VASI message, except the for the Capability String of the VASI Exchange Capabilities message, that it receives are filled with zeros, else return the corresponding VASI reply message with a status of “Unknown Reserved Value”. R1--7. For the VASI option: The VASI Exchange Capabilities message must be as defined in . R1--8. For the VASI option: The VASI Open message must be as defined in . R1--9. For the VASI option: The platform must process the VASI Open Request message per the semantics described in . R1--10. For the VASI option: The VASI Close message must be as defined in . R1--11. For the VASI option: The platform must process the VASI Close Request message per the semantics described in . R1--12. For the VASI option: The VASI Add Buffer message must be as defined in . R1--13. For the VASI option: The platform must process the VASI Add Buffer Request message per the semantics described in . R1--14. For the VASI option: The VASI Free Buffer message must be as defined in . R1--15. For the VASI option: The platform must process the VASI Free Buffer Request message per the semantics described in . R1--16. For the VASI option: The platform must process the VASI Download Request message per the semantics described in . R1--17. For the VASI option: The VASI Download message must be as defined in . R1--18. For the VASI option: The platform must process the VASI Operation Response message per the semantics described in . R1--19. For the VASI option: The VASI Operation message must be as defined in . R1--20. For the VASI option: The platform must process the VASI State Request message per the semantics described in . R1--21. For the VASI option: The VASI State message must be as defined in . R1--22. For the VASI option: The platform must process the VASI Progress Request message per the semantics described in . R1--23. For the VASI option: The VASI Progress message must be as defined in . R1--24. For the VASI option: The platform must process the VASI Signal Request message per the semantics described in . R1--25. For the VASI option: The VASI Signal message must be as defined in . R1--26. For the VASI option: To avoid a return code of “Invalid TOP” or “Invalid BOTTOM”; the VASI messages: VASI Progress, VASI Add Buffer, VASI Free Buffer, VASI Download, VASI Operation, VASI Signal and VASI State requests must only be sent after successful VASI Opens and prior to a VASI Close.
VASI Exchange Capabilities The VASI Exchange Capabilities command response pair is used to negotiate run time characteristics of the VASI virtual IOA. The using partition issues one VASI Exchange Capabilities request message for each service that it plans to support, filling in the Capability String field of the exchange capabilities request (see ) with the values that it plans to use for that service and enqueues the request. The VASI virtual IOA copies to the response Capability String, the values from the request capability string that it can support. The Capability string boolean fields are defined such that zero indicates that the characteristic is not supported. Capability string fields that represent numeric values may be reduced by the VASI virtual IOA from the requested value to the supported value with the numeric value zero being possible. Status Values defined for the VASI Exchange Capabilities response message: Success Hardware
Format of the VASI Exchange Capabilities CRQ elements
Capability String Fields Field Name Location (Byte:Bit - Byte:Bit) Description Service 3:0 - 3:7 Supported Services code see   Reserved 1 4:1 - 13:5 Reserved for future expansion   Supported Download Forms 13:6 Immediate 13:7 Indirect The forms of VASI Download that are supported. This is a bit field so any combination is possible to represent. Immediate and indirect refer to the buffer placement, either directly following in the operation specifier or at a location specified by an address. Supported Operations 14:0 Read Squential Immediate 14:1 Read Sequential Indirect 14:2 Read Indexed Immediate 14:3 Read Indexed Indirect 14:4 Write Sequential Immediate 14:5 Write Sequential Indirect 14:6 Write Indexed Immediate 14:7 Write Indexed Indirect The forms of VASI Operations that are supported. This is a bit field so any combination is possible to represent. Sequential and indexed refer to the starting point of the operation (following the last operation or at a specific block offset). Immediate and indirect refer to the buffer placement, either directly following in the operation specifier or at a location specified by an address. #Opens 15:0 - 15-7 Number of opens (unique TOP/BOTTOM pairs) per VASI stream that are supported on this VASI Virtual IOA. Valid values (1-255)
VASI Open The VASI Open Command message, see , indicates to the hypervisor that the originator VASI device driver is prepared to provide the indicated processing service (role) for the indicated VASI stream. The VASI Open Response message indicates to the originating VASI device driver that the hypervisor is prepared to proceed with the indicated VASI stream. Status Values defined for the VASI Open response message: Success Hardware Invalid Stream ID: The Stream ID parameter is not currently valid for this VASI virtual device. Stream ID Aborted Too many opens Invalid Service Specifier: Either reserved value or service not defined for this VASI stream. Semantics for VASI Open Request Message: Construct VASI Open Response message prototype (Including service parameter from request). Copy low order 8 bytes from Request message to response prototype. Verify that the Stream ID parameter of the VASI Open Request message is valid for the caller, else respond with the status of “Invalid Stream ID”. Verify that the Service parameter of the VASI Open Request message is valid for the caller plus Stream ID pair, else respond with the status of “Invalid Service Specifier”. Note that the valid “Service” values vary with the specific high level function being performed (see ) and the role assigned to the calling partition by mechanisms outside of the scope of PAPR. If the state of the VASI stream specified by the Stream ID of a VASI Open Request message is “Aborted”, respond with the status value of “Stream ID Aborted”. If the maximum number of opens has not been reached, then allocate control structures to maintain the state of this open instance and associate them with a unique BOTTOM parameter -- copy BOTTOM parameter to response message; else respond with “Too many opens”. Record the associated TOP parameter value for use in subsequent VASI response and operation request messages. Respond with Success.
Format of the VASI Open CRQ elements
VASI Close The VASI Close Command message, see , requests the receiver to close the indicated BOTTOM instance of the VASI stream. Note, other BOTTOM instances remain open. The VASI Close Response message indicates that the VASI Close command receiver has processed the associated VASI Close Command message and all previously enqueued messages to the BOTTOM instance. No further CRQ messages will be enqueued by the closed BOTTOM service, and all enqueued buffers are forgotten. Status Values defined for the VASI Close response message: Success Hardware Invalid BOTTOM Semantics for VASI Close Request Message: Construct VASI Close Response message prototype (copy low order 14 bytes from request message). Validate the BOTTOM parameter is valid for caller, else respond “invalid BOTTOM” Transition the service for the specified VASI stream instance to the “Closed” state -- This process ends after all previously initiated VASI request messages for the BOTTOM instance have completed. Insert the TOP recorded at open time for the specified BOTTOM into the response prototype. Free queued buffers and deallocate the control structures associated with the BOTTOM parameter, then respond Success.
Format of the VASI Close CRQ elements
VASI Add Buffer The VASI Add Buffer Command message, see , indicates to the hypervisor that the originator VSP device driver is providing the hypervisor with an empty buffer for the specific BOTTOM instance. The hypervisor organizes its input buffers into N buffer pools per service, by size as indicated by the buffer descriptor. The VASI “ibm,#buffer-pools” device tree property relates how many buffer size pools the firmware implements. The first N different sizes supplied by the device driver specifies the sizes of the N buffer size pools -- buffers of other sizes are rejected. The VASI Add Buffer Response message indicates to the originating VASI device driver that the hypervisor has processed the associated VASI Add Buffer Command message. All VASI Add Buffer CRQ messages generate a VASI Add Buffer Response message to provide feedback to the VASI device driver for flow control of the firmware's VASI CRQ. The successful Add Buffer Response CRQ message indicates the buffer size of the pool upon which the buffer was enqueued, and the number of free buffers on the indicated buffer size pool after the add (to indicate buffer utilization). Status Values defined for the VASI Add Buffer response message: Success Hardware Invalid BOTTOM Invalid Buffer Descriptor Invalid Buffer Length Semantics for VASI Add Buffer Request Message: Construct VASI Add Buffer Response message prototype (copy low order 14 bytes from the request message to the response prototype). Validate the BOTTOM field, else respond “Invalid BOTTOM”. Insert the TOP recorded for the open BOTTOM into the response prototype. Validate high order Buffer Descriptor bit is 0b1, else respond with “Invalid Buffer Descriptor” Validate that the Buffer Descriptor address translates through the LIOBN of the first pane of the “ibm,my-dma-window” property, else respond with “Invalid Buffer Descriptor”. Copy the first 8 bytes at the translated Buffer Descriptor address into the low order 8 bytes of the response prototype. If the Buffer Descriptor length field does not match the buffer length of one of the buffer pools then: If buffer lengths are assigned to all buffer pools, then respond with “Invalid Buffer Length” Else select an unassigned buffer pool, and assign its length to match the length field of the Buffer Descriptor. Enqueue the buffer descriptor onto the per service session (“BOTTOM”) pool whose buffer length matches the length field of the Buffer Descriptor, increment the Free Buffers in Pool count for the pool; inserting the result into the response prototype along with the buffer size, clear the reserved fields and respond with “Success”
Format of the VASI Add Buffer CRQ elements
VASI Free Buffer The VASI Free Buffer Command message, see requests the hypervisor to return an empty data buffer of the specified size to the originator VSP device driver. This call is used to recover buffers. It may be used to recover buffers at the completion of a VASI operation stream. All buffers added to a VASI virtual IOA service session (“BOTTOM”) are forgotten by the virtual IOA when the service session is closed or the IOA’s CRQ is deregistered. The VASI Free Buffer Response message indicates to the originating VASI device driver that the hypervisor has processed the associated VASI Free Buffer Command message. All VASI Free Buffer CRQ messages generate a VASI Free Buffer Response message. If the Status field of the VASI Free Buffer Response CRQ message is “Success” then the last 8 bytes contain the Buffer Correlator (first 8 bytes of the data buffer) of the selected empty data buffer. The last 8 bytes of the VASI Free Buffer Response CRQ message are undefined for any non “Success” Status value. Status Values defined for the VASI Free Buffer response message: Success Hardware Invalid BOTTOM Invalid Buffer Length Empty Semantics for VASI Free Buffer Request Message: Construct VASI Free Buffer Response message prototype with the Buffer Correlator field zero. Validate the BOTTOM field, else respond “Invalid BOTTOM”. Insert the TOP recorded for the open BOTTOM into the response prototype. If the request message Buffer Length field does not match one of the pool lengths, then respond “Invalid Buffer Length”. If the buffer pool associated with the Buffer Length field is empty, then respond “Empty”. Dequeue a Buffer Descriptor from the buffer pool associated with the Buffer Length field. Copy the first 8 bytes at the translated Buffer Descriptor address into the low order 8 bytes of the response prototype and respond “Success”.
Format of the VASI Free Buffer CRQ elements
VASI Download The VASI Download Command message, see requests the hypervisor to process the VASI Download data buffer specified by the originator VSP device driver. The VASI Download Response message indicates to the originating VSP that the hypervisor has processed the associated VASI Download Command message. Unless the Status field of the VASI Download Response CRQ message is “Invalid Buffer Descriptor”, the last 8 bytes contain the Buffer Correlator (first 8 bytes of the data buffer) of the data buffer specified by the Buffer Descriptor field of the VASI Download Command CRQ message. The last 8 bytes of the VASI Download Response CRQ message are undefined for the “Invalid Buffer Descriptor” Status value. Status Values defined for the VASI Download response message: Success Hardware Invalid BOTTOM Invalid Buffer Descriptor Invalid VASI Download Request Specifier Invalid VASI Download data Semantics for VASI Download Request Message: Construct VASI Download Response message prototype (copy low order 14 bytes from Request message to response prototype). Validate the BOTTOM field, else respond “Invalid BOTTOM”. Insert the TOP recorded for the open BOTTOM into the response prototype. Validate high order Buffer Descriptor bit is 0b1, else respond with “Invalid Buffer Descriptor” Validate that the Buffer Descriptor address translates through the LIOBN of the first pane of the “ibm,my-dma-window” property, else respond with “Invalid Buffer Descriptor”. Copy the first 8 bytes at the translated Buffer Descriptor address into the low order 8 bytes of the response prototype. Verify that the BOTTOM parameter of the buffer’s VASI Download Request Specifier is valid for the caller and the Download service for the associated Stream ID is Open by the caller, else respond with “Invalid VASI Download Request Specifier”. The Download service processes the buffer data; if an error is detected in the buffer data respond with “Invalid VASI Download data”, else respond with “Success”.
Format of the VASI Download CRQ elements
VASI Operation The VASI Operation Request message, see Figure 47‚ “Format of the VASI Operation CRQ elements‚” on page 731, requests the receiving VSP to process the VASI Operation specified in the data buffer indicated by the Buffer Correlator field. The Buffer Correlator field is copied from the first 8 bytes of the data buffer as supplied by the VSP using the VASI add buffer request. VASI Operation Requests are used to upload data on migration/hibernation (Write Sequential) and for VPM paging requests (using indexed Read/Write). The VASI Operation Response message indicates to the hypervisor that the VSP has processed the associated VASI Operation Command message. Unless the Status field of the VASI Operation Response CRQ message is “Invalid Buffer Correlator”, the last 8 bytes contain the Operation Correlator (fourth 8 bytes of the data buffer) of the data buffer that the hypervisor selected for this operation. The last 8 bytes of the VASI Operation Response CRQ message are undefined for the “Invalid Buffer Correlator” Status value. The VSP validates that the TOP parameter corresponds to an open instance against a VASI stream ID, else it responds “Invalid TOP”. Similarly the VSP validates the format of the remainder of the buffer, else responds “Invalid VASI Operation Request Specifier”. Status Values defined for the VASI Operation response message: Success Hardware Invalid Buffer Correlator Invalid TOP Invalid VASI Operation Request Specifier Stream ID Aborted End of File Semantics for VASI Operation Response Message: Verify that the Operation Correlator references a valid outstanding VASI Operation, else discard message. NOTE: while an invalid operation correlator is a very serious error there is no obvious instance to which to deliver the error. Mark the operation control block as referenced by the Operation Correlator with the Status and Residual values, refer to , from the Response message and mark the response message as being processed. Further processing of the operation control block is covered in the specification for the specific VASI Operation Stream. See .
Format of the VASI Operation CRQ elements
VASI Signal The VASI Signal Command message (See ) informs the VASI Virtual IOA of the VASI Stream, associated with the BOTTOM parameter, of the condition specified by the “Signal Code” parameter; optionally, a non-zero reason code may be associated with the event so that firmware may record the event using facilities and methods that are outside the scope of this architecture. The valid signal codes, and reason codes are unique to the specific VASI operation stream. See and respectively for more details. Status Values defined for the VASI State response message: Success Hardware Invalid BOTTOM Invalid Signal Code Semantics for processing the VASI Signal Request Message: Construct VASI Signal Response message prototype (copy the low order 14 bytes from the Request message to the response prototype). Validate the BOTTOM parameter for the caller; else respond “Invalid BOTTOM” Insert the TOP recorded for the open BOTTOM into the response prototype. Determine the VASI stream associated with the BOTTOM parameter. If the “Signal” parameter represents an invalid signal code for the VASI operation stream represented by the BOTTOM parameter (refer to ), then respond “Invalid Signal Code”. Initiate the VASI stream event processing for the VASI operation stream represented by the BOTTOM parameter as defined under for the current state and the condition represented by the “Signal” parameter, record the value of the “Reason” parameter, and respond “Success”.
Format of the VASI Signal CRQ elements
VASI State The VASI virtual IOA generates a VASI State Request message, see , to each VASI open session instance (TOP), that is associated (through a VASI Open) with the VASI Stream ID, each time the VASI stream changes state. Such state change request messages may include an optional non-zero reason code. No VASI State Response message is defined. The VASI State Request message is informational, and the receiver does not generate a response. The valid states, state transitions, and reason codes are unique to the specific VASI operation stream, see . Semantics for VASI State Request Message sent only after all other VASI stream state transition processing completes: For each TOP opened for the VASI stream that changed state. Construct VASI State Request message prototype. Fill in the TOP from the values recorded at VASI open time. Fill in the “Reason” and “To” fields per the completed transition. Enqueue the request message to the CRQ used to open the associated TOP. Mark the VASI stream state transition complete.
Format of the VASI State CRQ elements
VASI Progress The VASI Progress Command message, see , is applicable to Migration and Hibernation high level operations. It requests the hypervisor to report the number of bytes of partition state that need to be processed for the VASI migration/hibernation stream associated with the “BOTTOM” parameter. If this request is made prior to any state transfer requests, it represents the total size of the partition state data. If the Status field of the VASI Progress Response CRQ message is “Invalid BOTTOM”, the last 8 bytes of the VASI Progress Response CRQ message are copied from the corresponding VASI Progress Request message in all cases. Status Values defined for the VASI State response message: Success Hardware Invalid BOTTOM Invalid Service Specifier Semantics for VASI Progress Request Message: Construct VASI Progress Response message prototype (copy the low order 14 bytes from Request message to response prototype). Validate the BOTTOM parameter for the caller, else respond “invalid BOTTOM” Insert the TOP recorded for the open BOTTOM into the response prototype. Validate that the operation stream associated with the BOTTOM parameter is either a migration or a hibernation; else respond “Invalid Service Specifier”. Estimate the number of bytes left to transfer (this is best effort since the number may constantly change) placing the value into the “Number of Bytes Left” field and respond Success. For the source side of an operation the estimate of the number of bytes left is the number of bytes of dirty status. For the destination side of an operation the estimate of the number of bytes left is the number of expected status bytes that the destination knows are not valid (either they were never sent or there has been an indication that they were subsequently made invalid).
Format of the VASI Progress CRQ elements
VASI Virtual IOA Device Tree Properties of the VASI Node in a Partition Property Name Required? Definition “name” Y IBM,VASI “device_type” Y IBM,VASI-1 “model” NA Property not present. “compatible” Y IBM,VASI-1 “used-by-rtas” N Property not present. “ibm,loc-code” Y Property name specifying the unique and persistent location code associated with this virtual IOA presented as an encoded array as with encode-string. The value shall be of the form specified in information on Virtual Card Connector Location Codes. “reg” Y Standard property name per IEEE 1275 specifying the register addresses, used as the unit address (unit ID), associated with this virtual IOA presented as an encoded array as with encode-phys of length “#address-cells” value shall be 0xwhatever (virtual “reg” property used for unit address no actual locations used, therefore, the size field has zero cells (does not exist) as determined by the value of the “#size-cells” property). “ibm,my-dma-window” Y Property name specifying the DMA window associated with this virtual IOA presented as an encoded array of tripples; each triple consisting of three values (LIOBN, phys, size) encoded as with encode-int, encode-phys, and encode-int respectively. “interrupts” Y Standard property name specifying the interrupt source number and sense code associated with this virtual IOA presented as an encoded array of two cells encoded as with encode-int with the first cell containing the interrupt source number, and the second cell containing the sense code 0 indicating positive edge triggered. The interrupt source number being the value returned by the H_XIRR or H_IPOLL hcall(). “ibm,my-drc-index” For DR Present if the platform implements DR for this node. “ibm,#dma-size-cells” N Property name, to define the package’s dma address size format. The property value specifies the number of cells that are used to encode the size field of dma-window properties. If the “ibm,#dma-size-cells” property is missing, the default value is the “#size-cells” property for the parent package. “ibm,#dma-address-cells” N Property name, to define the package’s dma address format. The property value specifies the number of cells that are used to encode the physical address field of child's dma-window properties. If the “ibm,#dma-address-cells” property is missing, the default value is the “#address-cells” property for the parent package. “ibm,#buffer-pools” Y Property name to define number, encoded as with encode-int of different data buffer size pools supported by the VASI virtual IOA service sessions. “ibm,crq-size” Y Property name to define the size, in bytes, of the VASI virtual IOA CRQ; encoded as with encode-int. “ibm,#vasi-streams” Y Property name to define the number of simultaneous unique VASI stream IDs that may be supported by the VASI virtual IOA CRQ; encoded as with encode-int.
VASI Support hcall()s The hcall()s of this section support the VASI option. H_DONOR_OPERATION supplies the hypervisor with processor cycles to perform administrative services. H_VASI_SIGNAL allows partitions to signal anomalous conditions such as the need to abort the administrative service stream without having to have an open VASI virtual IOA. While the H_VASI_STATE allows partitions that do not have an open VASI virtual IOA for a given VASI stream ID to poll the state of their administrative service streams.
H_DONOR_OPERATION This hcall() supplies donor partition cycles to perform hypervisor operations for a given VASI Stream. The TOP/BOTTOM parameter indicates the VASI stream, and thus a specific operating context relative to the caller and callee. The cycles donated by any and all TOP/BOTTOMs associated with the VASI Stream are combined by the platform to perform the needed processing for the stream. A platform may use the cycles from different TOP/BOTTOM pairs to create parallel processes to improve the stream performance. Syntax: Parameters: TOP/BOTTOM_ID (The opaque handles of a specific VASI operation stream relative to the caller and callee.) Semantics: If the TOP/BOTTOM_ID parameter is invalid relative to the calling partition, return H_Parameter. If the VASI stream is in the aborted state, return H_Aborted. Perform the next operation associated with the specified VASI stream. Note the amount of processing performed on any one call is limited by the interrupt hold off constraints of standard hypervisor calls. (The format of the platform operation state structure is outside of the scope of this architecture.) If the specific VASI stream operation is fully complete, return H_Success. If the specific VASI stream requires more processing to fully complete the platform operation and is not blocked waiting for asynchronous agent(s), return H_IN_PROGRESS. If the VASI stream is blocked waiting for asynchronous agent(s), return H_LongBusyOrder* (where * is the appropriate expected waiting time). R1--1. For the VASI option: The platform must implement the H_DONOR_OPERATION hcall() following the syntax and semantics of .
H_VASI_SIGNAL This hcall() signals the condition specified by the “signal code” parameter to the VASI Virtual IOA of the VASI Stream associated with the “handle” parameter; optionally, a non-zero “reason code” may be associated with the signal code so that firmware may record the signal using facilities and methods that are outside the scope of this architecture. Syntax: Parameters: handle -- the VASI Stream ID (The opaque handle of a specific VASI operation stream.) signal_code -- one of the values listed in right justified with high order bytes zero. reason_code -- Code user gives as reason for signal right justified with high order bytes zero -- The value is simply transported not checked by the platform. Semantics: If the “handle” parameter is invalid relative to the calling partition, then return H_Parameter. If the “signal_code” is invalid based upon the values listed in , then return H_P2. If the “signal_code” is valid for the current VASI stream state, initiate the processing defined for the “signal_code”; else return H_NOOP. VASI Signal Codes Name Value Description VASI Operation Stream Valid for Interface VASI Signal Message H_VASI_SIGNAL Null 0x0 Not used (invalid) All N N Cancel 0x1 Gracefully cancel processing if possible Partition MigrationPartition Hibernation Y Y Abort 0x2 Immediately halt function Partition MigrationPartition Hibernation Y N Suspend 0x3 Suspend target partition Partition MigrationPartition Hibernation Y N Complete 0x4 Complete paging operation Paging Y N Enable 0x5 Enabling paging operation Paging Y N Reserved 0x6-0xFFFF Reserved All N N
R1--1. For the VASI option: The platform must implement the H_VASI_SIGNAL hcall() following the syntax and semantics of .
H_VASI_STATE This hcall() returns the state of the specific VASI operation stream. Syntax: Parameters: handle -- the VASI Stream ID (The opaque handle of a specific VASI operation stream relative to the caller and callee.) Semantics: If the “handle” parameter is invalid relative to the calling partition, return H_Parameter. Else enter the value of the VASI state variable (see ) for the indicated stream into R4 and return H_Success R1--1. For the VASI option: The platform must implement the H_VASI_STATE hcall() following the syntax and semantics of .
VASI Operation Stream Specifications This section defines the usage of VASI to accomplish specific administrative services. Each section specifies the valid VASI state codes, state transitions, and reason codes, the action of the VASI virtual IOA in each state and the expected behavior of the VASI device driver in order to achieve the operational goal. VASI Stream Services for Partition Migration Name Value Description Unused 0   Source Mover for Partition Migration 1 VASI device will be used to extract partition state from the source platform to the target platform using VASI Operations (Write sequential) to extract partition state, and VASI Download commands to give source platform paging requests. See . Target Mover for Partition Migration 2 VASI device will be used to insert migrating partition’s state to the target platform. VASI Download requests will be used to give platform firmware partition state, and VASI Operations (Write sequential) will be used by platform firmware to give paging requests to the Mover partition to deliver to the source platform.See . Pager for the CMO option 3 VASI device will be used to handle CMO paging requests See .
Partition Migration defines the VASI Services for Partition Migration for use in the VASI Open CRQ request, as defined in . Requirements: R1--1. For the Partition Migration Option: If any partition code uses the value of the processor PVR to modify its operation, to ensure valid operation after the resume from suspension, prior to executing any such modified operation code, partition code must reread the PVR value and be prepared to remodify its operation. R1--2. For the Partition Migration Option: In order that LAN communication partners may learn of the new MAC address that may be associated with a migrated partition, the migrated partition must generate “gratuitous ARP” messages. It is suggested that these “gratuitous ARP” messages be sent at the rate of once per second between the time that the migrating partition resumes and the H_VASI_STATE hcall() responds with “Completed”. R1--3. For the Partition Migration Option: To maintain the platform capability to perform live firmware updates, the OS must call the ibm,activate-firmware RTAS service after waking from a migration suspension. R1--4. For the Partition Migration Option: The platform must implement the ILLAN option (see ). R1--5. For the Partition Migration Option: Platform firmware must support both immediate and indirect data in its VASI Download data buffers. R1--6. For the Partition Migration Option: If multiple partition migrations are being performed using a single VASI device, to ensure none of the migrations are starved, partition software must call H_DONOR_OPERATION with TOP/BOTTOMs associated with each migration (VASI Stream ID). R1--7. For the Partition Migration Option: If the platform detects any unrecoverable error in processing a VASI Download command, it must transition the associated VASI stream to the “Aborted” state. R1--8. For the Partition Migration Option: The VASI stream ID for the specific high level migration function must be the same value in both the source and target platforms. Programming Note: If partition software wishes to get an accurate count of the number of bytes to be transferred using the VASI Progress CRQ message, it should be issued immediately following a VASI open and before any cycles are donated for that migration via H_DONOR_OPERATION.
Partition Migration Abort Reason Codes defines the Abort reason code layout for Partition Migration for use with the H_VASI_SIGNAL hypervisor call and the VASI Signal and State CRQ requests, as defined in . Partition Migration Abort Reason Codes Name Byte Description Aborting Entity 0 1=Orchestrator 2=VSP providing VASI partition source migration service 3=Partition Firmware 4=Platform Firmware 5=VSP providing VASI partition target migration service 6=Migrating partition Detailed Error 1-3 Bytes one through three of the reason code are opaque values, architected by individual aborting entities.
Partition Migration VASI States This section defines the partition migration VASI states as used in the VASI State request CRQ message and as returned from the H_VASI_STATE hcall. VASI Migration Session States Name Value Description Invalid 0x0 This state is defined on both the source and destination platform and indicates either that the specified Stream ID is not valid (or is no longer valid) or the invoking partition is not authorized to utilize the Stream ID. Enabled 1 This state is defined on both the source and destination platform and indicates that the partition has been enabled for migration but has not progressed beyond this initial state. The transition to this state is triggered by events outside of the scope of PAPR. The partition on the source server transitions to this state first and then the partition on the destination server. Aborted 2 This state is defined on both the source and the destination platform and indicates that the abort processing has completed. If the migration has been aborted, this is the final state of the migration and platform firmware ensures that all interested partitions see this state at least once. Platform firmware continues to return this state until events outside of the scope of PAPR terminate the operation and all interested partitions have seen this final state. In this state VASI download request information is flushed, returning success status. VASI signal requests other than “abort” are treated as an invalid state transition. The transition to this state occurs on the two servers independently and thus it is a race condition which server transitions to this state first. Suspending 3 This state is defined only on the source platform and indicates that the partition is in the process of suspending itself. When the migrating partition sees this state, it enters its suspension sequence that concludes with the call to ibm,suspend-me. The transition to this state occurs when the source VSP directs platform firmware to suspend the partition via a VASI Signal request (Signal Code = Suspend) on the VASI device. Suspended 4 This state is defined only on the source platform and indicates that the partition has suspended itself via the ibm,suspend-me RTAS call. This is the point in the sequence where platform firmware will reject attempts by the user to abort the migration. Resumed 5 This state is defined on both the source and destination platform and indicates that the partition has resumed execution on the destination platform. The transition to this state occurs on the destination platform first when it receives the dirty page bitmap from the source platform firmware. It is at this point the virtual processors of the migrating partition are dispatched on the destination platform. Completed 6 This state is defined on both the source and destination platform and indicates that the migration has completed and all partition state has been transferred to the destination platform. This is the final state of the migration and platform firmware ensures that all interested partitions see this state at least once. Platform firmware continues to return this state until events outside of the scope of PAPR terminate the operation and all interested partitions have seen this final state. The transition to this state occurs on the source platform first as soon as all of the residual state of the migrating partition has been successfully transferred to the destination platform. The transition to this state on the destination platform occurs when all of the partition state has been received from the source platform. For an inactive migration, the partition is transferred as a single unit so the partition in the destination platform just moves from Enabled to Completed on a successful inactive migration.
Partition Hibernation R1--1. For the Partition Hibernation Option: The platform must ensure that all hibernating partition dynamic reconfiguration operations are complete prior to signaling suspension of the partition. R1--2. For the Partition Hibernation Option: If any partition code uses the value of the processor PVR to modify its operation, after the resume from suspension, but prior to executing any such modified operation code, it must reread the PVR value and be prepared to remodify its operation. R1--3. For the Partition Hibernation Option: In order that LAN communication partners may learn of the new MAC address that may be associated with a hibernated partition the hibernated partition must generate “gratuitous ARP” messages. It is suggested that these “gratuitous ARP” messages be sent at the rate of once per second between the time that the hibernated partition resumes and the H_VASI_STATE hcall() responds with “Completed”. R1--4. For the Partition Hibernation Option: To maintain the platform capability to perform live firmware updates, the OS must call the ibm,activate-firmware RTAS service after waking from a hibernation suspension. R1--5. For the Partition Hibernation Option: The platform must implement the ILLAN option (see ). R1--6. For the Partition Hibernation Option: The VASI stream ID for the specific high level migration function must be the same value for both the suspend and wake phases.
Cooperative Memory Overcommitment The CMO option defines the stream service value 3 for “Pager”. The Pager VASI device is used to page out paging partition state to the VASI Server Partition (VSP) using VASI Operation requests (Write indexed) and also to page in partition state from the VSP using VASI Operation requests (Read indexed). The Pager VASI service utilizes a subset of the VASI Operation request architecture; specifically in the VASI Operation Request Specifier structure, the File offset of the start for indexed operations field (Bytes 9:15) is not used (value = 0x00); the scatter/gather list is a series of 1 – N sub operation specifications each starting with the positioning of the file cursor using a type 0xC0 control element to establish the file block location, optionally followed by a type 0x40 control element to position the file cursor to a byte within the established file block, this is followed by one and only one type 0x80 control element per sub operation to transfer the sub operation data. The VASI Operation Request Specifier structure terminates as always with a type 0x00 control element with a segment length field of 0x000000. When a Pager VASI service aborts, the reason code returned is per . The Pager Service VASI States as in the state request CRQ message and as returned from the H_VASI_STATE hcall are as defined in . CMO VASI Abort Reason Codes Name Byte Description Entity (who is issuing state change or signal) 0 1 = VASI 2 = I/O Provider 3 = Platform Firmware Detailed Reason 1-3 Bytes one through three of the reason code are opaque values, architected by individual entities.
CMO VASI States Name Value Description Invalid 0x0 This state indicates that the specified Stream ID is not valid (or is no longer valid) or the invoking partition is not authorized to utilize the Stream ID. Disabled 1 This state indicates that the specified Stream ID is valid, but the stream has not been yet opened by the VSP providing VASI paging service. The transition to this state is triggered by events outside of the scope of PAPR. Suspended 2 This state indicates that the specified Stream ID is valid, but the client partition has not yet been powered on Enabled 3 This state indicates that the stream has been opened by the VSP providing VASI paging service and the client partition is powered on Stopped 4 This state indicates that the specified Stream ID is valid, however platform firmware is no longer using the stream to perform paging. The transition to this state is triggered by events outside of the scope of PAPR. Completed 5 This state indicates that paging has been terminated for this stream by a request to halt paging for this Stream ID.
Virtual Fibre Channel (VFC) using NPIV N_Port ID Virtualization (NPIV) is part of the Fibre Channel (FC) standards. NPIV allows multiple World Wide Port Names (WWPNs) to be mapped to a single physical port of a FC adapter. This section defines a Virtual Fibre Channel (VFC) interface to a server partition interfacing to a physical NPIV adapter that allows multiple partitions to share a physical port using different WWPNs. The implementation support is provided by code running in a server partition that uses the mechanisms of the Reliable Command/Response Transport and Logical Remote DMA of the Synchronous VIO Infrastructure to service I/O requests for code running in a client partition. The client partition appears to enjoy the services of its own FC adapter (see ) with a WWPN visible to the FC fabric. The terms server and client partitions refer to platform partitions that are respectively servers and clients of requests, usually I/O operations, using the physical I/O adapters (IOAs) that are assigned to the server partition. This allows a platform to have more client partitions than it may have physical I/O adapters because the client partitions share I/O adapters via the server partition. The VFC model makes use of Remote DMA which is built upon the architecture specified in the following sections:
VFC and NPIV General This section contains an informative outline of the architectural intent of the use of VFC and NIPV, and it assumes the user is familiar with concerning VSCSI architecture and the with the FC standards. Other implementations of the server and client partition code, consistent with this architecture, are possible and may be preferable. The client partition provides the virtual equivalent of a single port FC adapter via each VFC client IOA. The platform, through the partition definition, provides means for defining the set of virtual IOA’s owned by each partition and their respective location codes. The platform also provides, through partition definition, instructions to connect each client partition’s VFC client IOA to a specific server partition’s VFC server IOA. The mechanism for specifying this partition definition is beyond the scope of this architecture. The human readable handle associated with the partition definition of virtual IOAs and their associated interconnection and resource configuration is the virtual location code. The OF unit address (Unit ID) remains the invariant handle upon which the OS builds its “physical to logical” configuration. The platform also provides a method to assign unique WWPNs for each VFC client adapter. The port names are used by a SAN administrator to grant access to storage to a client partition. The mechanism for allocating port names is beyond the scope of this architecture. The client partition’s device tree contains one or more nodes notifying the partition that it has been assigned one or more virtual adapters. The node’s “type” and “compatible” properties notify the partition that the virtual adapter is a VFC adapter. The unit address of the node is used by the client partition to map the virtual device(s) to the OS’s corresponding logical representations. The “ibm,my-dma-window” property communicates the size of the RTCE table window panes that the hypervisor has allocated. The node also specifies the interrupt source number that has been assigned to the Reliable Command/Response Transport connection and the RTCE range that the client partition device driver may use to map its memory for access by the server partition via Logical Remote DMA. The client partition also reads it's WWPNs from the device tree. Two WWPNs are presented to the client in the properties “ibm,port-wwn-1”, and “ibm,port-wwn-2”, and the server tells the client, through a CRQ protocol exchange, which one of the two to use. The client partition, uses the four hcall()s associated with the Reliable Command/Response Transport facility to register and deregister its CRQ, manage notification of responses, and send command requests to the server partition. The server partition’s device tree contains one or more node(s) notifying the partition that it is requested to supply VFC services for one or more client partitions. The unit address ( Unit ID) of the node is used by the server partition to map to the local logical devices that are represented by this VFC device. The node also specifies the interrupt source number that has been assigned to the Reliable Command/Response Transport connection and the RTCE range that the server partition device driver may use for its copy Logical Remote DMA. The server partition uses the four hcall()s associated with the Reliable Command/Response Transport facility to register and deregister its Command request queue, manage notification of new requests, and send responses back to the client partition. In addition, the server partition uses the hcall()s of the Logical Remote DMA facility to manage the movement of commands and data associated with the client requests. The client partition, upon noting the device tree entry for the virtual adapter, loads the device driver associated with the value of the “compatible” property. The device driver, when configured and opened, allocates memory for its CRQ (an array, large enough for all possible responses, of 16 byte elements), pins the queue and maps it into the I/O space of the RTCE window specified in the “ibm,my-dma-window” property using the standard kernel mapping services that subsequently use the H_PUT_TCE hcall(). The queue is then registered using the H_REG_CRQ hcall(). Next, I/O request control blocks (within which the I/O requests commands are built) are allocated, pinned, and mapped into I/O address space. Finally, the device driver registers to receive control when the interrupt source specified in the virtual IOA’s device tree node signals. Once the CRQ is setup, the device driver queues an Initialization Command/Response with the second byte of “Initialize” in order to attempt to tell the hosting side that everything is setup on the hosted side. The response to this send may be that the send has been dropped or has successfully been sent. If successful, the sender should expect back an Initialization Command/Response with a second byte of “Initialization Complete,” at which time the communication path can be deemed to be open. If dropped, then the sender waits for the receipt of an Initialization Command/Response with a second byte of “Initialize,” at which time an “Initialization Complete” message is sent, and if that message is sent successfully, then the communication path can be deemed to be open. When the VFC Adapter device driver receives an I/O request from one of the FC device head drivers, it executes the following sequence. First an I/O request control block is allocated. Then it builds the FC Information Unit (FC IU) request within the control block, adds a correlator field (to be returned in the subsequent response), I/O maps any target memory buffers and places their DMA descriptors into the I/O request control block. With the request constructed in the I/O request control block, the driver constructs a DMA descriptor (Starting Offset, and length) representing the FC IU within the I/O request control block. It also constructs a DMA descriptor for the FC Response Unit. Lastly, the driver passes the I/O request’s DMA descriptor to the server partition using the H_SEND_CRQ hcall(). Provided that the H_SEND_CRQ hcall() succeeds, the VFC Adapter device driver returns, waiting for the response interrupt indicating that a response has been posted by the server partition to the device driver’s response queue. The response queue entry contains the summary status and request correlator. From the request correlator, the device driver accesses the I/O request control block, the summary status, and the FC Response Unit and determines how to complete the processing of the I/O request. Notice that the client partition only uses the Reliable Command/Response Transport primitives; it does not use the Logical Remote DMA primitives. Since the server partition’s RTCE tables are not authorized for access by the client partition, any attempt by the client partition to modify server partition memory would be prevented by the hypervisor. RTCE table access is granted on a connection by connection basis (client/server virtual device pair). If a client partition happens to be serving some other logical device, then the partition is entitled to use Logical Remote DMA for the virtual devices that is serving. The server partition, upon noting the device tree entry for the virtual adapter, loads the device driver associated with the value of the “compatible” property. The device driver, when configured and opened, allocates memory for its request queue (an array, large enough for all possible outstanding requests, of 16 byte elements). The driver then pins the queue and maps it into I/O space, via the kernel’s I/O mapping services that invoke the H_PUT_TCE hcall(), using the first window pane specified in the “ibm,my-dma-window” property. The queue is then registered using the H_REG_CRQ hcall(). Next, I/O request control blocks (within which the I/O request commands are built) are allocated, pinned, and I/O mapped. Finally the device driver registers to receive control when the interrupt source specified in the virtual IOA’s device tree node signals. Once the CRQ is setup, the device driver queues an Initialization Command/Response with the second byte of “Initialize” in order to attempt to tell the hosted side that everything is setup on the hosting side. The response to this send may be that the send has been dropped or has successfully been sent. If successful, the sender should expect back an Initialization Command/Response with a second byte of “Initialization Complete,” at which time the communication path can be deemed to be open. If dropped, then the sender waits for the receipt of an Initialization Command/Response with a second byte of “Initialize,” at which time an “Initialization Complete” message is sent, and if that message is sent successfully, then the communication path can be deemed to be open. When the server partition’s device driver receives an I/O request from its corresponding client partition’s VFC adapter drivers, it is notified via the interrupt registered for above. The server partition’s device driver selects an I/O request control block for the requested operation. It then uses the DMA descriptor from the request queue element to transfer the FC IU request from the client partition’s I/O request control block to its own (allocated above), using the H_COPY_RDMA hcall() through the second window pane specified in the “ibm,my-dma-window” property. The server partition’s device driver then uses kernel services, that are extended, to register the I/O request’s DMA descriptors into extended capacity cross memory descriptors (ones capable of recording the DMA descriptors). These cross memory descriptors are later mapped by the server partition’s physical device drivers into the physical I/O DMA address space of the physical I/O adapters using the kernel services, that have been similarly extended to call the H_PUT_RTCE hcall(), based upon the value of the LIOBN field reference by the cross memory descriptor. At this point, the server partition’s VFC device driver delivers what appears to be a FC IU request to be routed through the server partition’s adapter driver. When the request completes, the server partition’s VFC device driver is called through a registered entry point and it packages the summary status along with the request correlator into a response message that it sends to the client partition using the H_SEND_CRQ hcall(), then recycles the resources recorded in the I/O request control block, and the block itself. The LIOBN value in the second window pane of the server partition’s “ibm,my-dma-window” property is intended to be an indirect reference to the RTCE table of the client partition. If, for some reason, the physical location of the client partition’s RTCE table changes or it becomes invalid, this level of indirection allows the hypervisor to determine the current target without changing the LIOBN number as seen by the server partition. The H_PUT_TCE and H_PUT_RTCE hcall()s do not map server partition memory into the second window pane; the second window pane is only available for use by server partition via the Logical RDMA services to reference memory mapped into it by the client partition’s IOA. This architecture does not specify the payload format of the requests or responses. However, the architectural intent is supplied in the following tables for reference.   General Form of Reliable CRQ Element Byte Offset Field Name Subfield Name Description 0 Header   Contains Element Valid Bit plus Event Type Encodings (see ). 1 Payload Format/Transport Event Code For Valid Command Response Entries, see . For Transport Event Codes see . 2-15   Format Dependent.
  Example Reliable CRQ Entry Format Byte Definitions for VFC Format Byte Value Definition 0x0 Unused 0x01 VFC Requests 0x02 - 0x03 Reserved 0x04 Management Datagram 0x05 - 0xFE Reserved 0xFF Reserved for Expansion
  Example VFC Command Queue Element Byte Offset Field Value Description 0 0x80 Valid Header 1 0x01 VFC Requests 1 0x04 Management Datagram 2-3 NA Reserved 4-7   Length of the request block to be transferred 8-15   I/O address of beginning of request
  Example VFC Response Queue Element Byte Offset Field Value Description 0 0x80 Valid Header 1 0x01 VFC Response Format 1 0x02 Asynchronous Event 1 0x04 Management Datagram 2-3 NA Reserved 4-7   Summary Status 8-15   8 byte command correlator
VFC and NPIV Requirements This normative section provides the general requirements for the support of VFC. R1--1. For the VFC option: The platform must implement the Reliable Command/Response Transport option as defined in . R1--2. For the VFC option: The platform must implement the Logical Remote DMA option as defined in . R1--3. For the VFC option: The platform must allocate a WWPN pair for each VFC client and must present the WWPNs to the VFC clients in their OF device tree . In addition to the firmware primitives, and the structures they define, the partition’s OS needs to know specific information regarding the configuration of the virtual IOA’s that it has been assigned so that it can load and configure the correct device driver code. This information is provided by the OF device tree node associated with the virtual IOA (see and ).
Client Partition VFC Device Tree Node Client partition VFC device tree nodes have associated packages such as disk-label, deblocker, iso-13346-files and iso-9660-files as well as children nodes such as block and byte as appropriate to the specific virtual IOA configuration as would the node for a physical FC IOA. R1--1. For the VFC option: The platform’s OF device tree for client partitions must include as a child of the /vdevice node, a node of name “vfc-client” as the parent of a sub-tree representing the virtual IOAs assigned to the partition. R1--2. For the VFC option: The platform’s vfc-client OF node must contain properties as defined in (other standard I/O adapter properties are permissible as appropriate).   Properties of the VFC Node in the Client Partition Property Name Required? Definition “name” Y Standard property name per , specifying the virtual device name, the value shall be “vfc-client”. “device_type” Y Standard property name per , specifying the virtual device type, the value shall be “fcp”. “model” NA Property not present. “compatible” Y Standard property name per , specifying the programming models that are compatible with this virtual IOA, the value shall include “IBM,vfc-client”. “used-by-rtas” See Definition Column Present if appropriate. “ibm,loc-code” Y Property name specifying the unique and persistent location code associated with this virtual IOA presented as an encoded array as with encode-string. The value shall be of the form specified in . “reg” Y Standard property name per , specifying the register addresses, used as the unit address (unit ID), associated with this virtual IOA presented as an encoded array as with encode-phys of length “#address-cells” value shall be 0xwhatever (virtual “reg” property used for unit address no actual locations used, therefore, the size field has zero cells (does not exist) as determined by the value of the “#size-cells” property). “ibm,my-dma-window” Y Property name specifying the DMA window associated with this virtual IOA presented as an encoded array of three values (LIOBN, phys, size) encoded as with encode-int, encode-phys, and encode-int. “interrupts” Y Standard property name specifying the interrupt source number and sense code associated with this virtual IOA presented as an encoded array of two cells encoded as with encode-int with the first cell containing the interrupt source number, and the second cell containing the sense code 0 indicating positive edge triggered. The interrupt source number being the value returned by the H_XIRR or H_IPOLL hcall(). “ibm,my-drc-index” For DR Present if the platform implements DR for this node. “ibm,#dma-size-cells” See Definition Column Property name, to define the package’s dma address size format. The property value specifies the number of cells that are used to encode the size field of dma-window properties. This property is present when the dma address size format cannot be derived using the method described in the definition for the “ibm,#dma-size-cells” property in . “ibm,#dma-address-cells” See Definition Column Property name, to define the package’s dma address format. The property value specifies the number of cells that are used to encode the physical address field of dma-window properties. This property is present when the dma address format cannot be derived using the method described in the definition for the “ibm,#dma-address-cells” property in . “ibm,port-wwn-1” See Definition Column Property that represents one of two WWPNs assigned to this VFC client node. This property is a prop-encoded-array each encoded with encode-int. The array consists of the high order 32 bits and low order 32 bits of the WWPN such that (32 bits high | 32 bits low) is the 64 bit WWPN. The WWPN that the client is to use ( “ibm,port-wwn-1” or “ibm,port-wwn-2”) is communicated to the client by the server as part of the client-server communications protocol. “ibm,port-wwn-2” See Definition Column Property that represents one of two WWPNs assigned to this VFC client node This property is a prop-encoded-array each encoded with encode-int. The array consists of the high order 32 bits and low order 32 bits of the WWPN such that (32 bits high | 32 bits low) is the 64 bit WWPN. The WWPN that the client is to use ( “ibm,port-wwn-1” or “ibm,port-wwn-2”) is communicated to the client by the server as part of the client-server communications protocol.
R1--3. For the VFC option: The platform’s vfc-client node must have as children the appropriate block (disk) and byte (tape) nodes.
Server Partition VFC Device Tree Node Server partition VFC IOA nodes have no children nodes. R1--1. For the VFC option: The platform’s OF device tree for server partitions must include as a child of the /vdevice node, a node of name “vfc-server” as the parent of a sub-tree representing the virtual IOAs assigned to the partition. R1--2. For the VFC option: The platform’s vfc-server node must contain properties as defined in (other standard I/O adapter properties are permissible as appropriate). Properties of the VFC Node in the Server Partition Property Name Required? Definition “name” Y Standard property name per , specifying the virtual device name, the value shall be “vfc-server”. “device_type” Y Standard property name per , specifying the virtual device type, the value shall be “fcp”. “model” NA Property not present. “compatible” Y Standard property name per , specifying the programming models that are compatible with this virtual IOA, the value shall include “IBM,vfc-server”. “used-by-rtas” See Definition Column Present if appropriate. “ibm,loc-code” Y Property name specifying the unique and persistent location code associated with this virtual IOA presented as an encoded array as with encode-string. The value shall be of the form . “reg” Y Standard property name per , specifying the register addresses, used as the unit address (unit ID), associated with this virtual IOA presented as an encoded array as with encode-phys of length “#address-cells” value shall be 0xwhatever (virtual “reg” property used for unit address no actual locations used, therefore, the size field has zero cells (does not exist) as determined by the value of the “#size-cells” property). “ibm,my-dma-window” Y Property name specifying the DMA window associated with this virtual IOA presented as an encoded array of two sets (two window panes) of three values (LIOBN, phys, size) encoded as with encode-int, encode-phys, and encode-int. Of these two triples, the first describes the window pane used to map server partition memory, the second is the window pane through which the client partition maps its memory that it makes available to the server partition. (Note the mapping between the LIOBN in the second window pane of a server virtual IOA’s “ibm,my-dma-window” property and the corresponding client IOA’s RTCE table is made when the CRQ successfully completes registration. See for more information on window panes.) “interrupts” Y Standard property name specifying the interrupt source number and sense code associated with this virtual IOA presented as an encoded array of two cells encoded as with encode-int with the first cell containing the interrupt source number, and the second cell containing the sense code 0 indicating positive edge triggered. The interrupt source number being the value returned by the H_XIRR or H_IPOLL hcall() “ibm,my-drc-index” For DR Present if the platform implements DR for this node. “ibm,vserver” Y Property name specifying that this is a virtual server node. “ibm,#dma-size-cells” See Definition Column Property name, to define the package’s dma address size format. The property value specifies the number of cells that are used to encode the size field of dma-window properties. This property is present when the dma address size format cannot be derived using the method described in the definition for the “ibm,#dma-size-cells” property in . “ibm,#dma-address-cells” See Definition Column Property name, to define the package’s dma address format. The property value specifies the number of cells that are used to encode the physical address field of dma-window properties. This property is present when the dma address format cannot be derived using the method described in the definition for the “ibm,#dma-address-cells” property in .
 
Virtual Network Interface Controller (VNIC) This section defines a Virtual Network Interface Controller (VNIC) interface to a server partition interfacing to a physical Network Interface Controller (NIC) adapter that allows multiple partitions to share a physical port. The implementation support is provided by code running in a server partition that uses the mechanisms of the Synchronous VIO Infrastructure (or equivalent thereof as seen by the client) to service I/O requests for code running in a client partition. The client partition appears to enjoy the services of its own NIC adapter. The terms server and client partitions refer to platform partitions that are respectively servers and clients of requests, usually I/O operations, using the physical NIC that is assigned to the server partition. This allows a platform to have more client partitions than it may have physical NICs because the client partitions share I/O adapters via the server partition. The VNIC model makes use of Remote DMA which is built upon the architecture specified in the following sections: The use of Remote DMA has implications that the physical NIC be able to do some of its own vitualization. For example, for an Ethernet adapter, being able to route receive requests, via DMA to the appropriate client partition, based on the addressing in the incoming packet.
VNIC General This section contains an informative outline of the architectural intent of the use of VNIC. Other implementations of the server and client partition code, consistent with this architecture, are possible and may be preferable. The client partition provides the virtual equivalent of a single port NIC adapter via each VNIC client IOA. The platform, through the partition definition, provides means for defining the set of virtual IOA’s owned by each partition and their respective location codes. The platform also provides, through partition definition, instructions to connect each client partition’s VNIC client IOA to a specific server partition’s VNIC server IOA. The mechanism for specifying this partition definition is beyond the scope of this architecture. The human readable handle associated with the partition definition of virtual IOAs and their associated interconnection and resource configuration is the virtual location code. The OF unit address (unit ID) remains the invariant handle upon which the OS builds its “physical to logical” configuration. The platform also provides a method to assign unique MAC addresses for each VNIC client adapter. The mechanism for allocating port names is beyond the scope of this architecture. The client partition’s device tree contains one or more nodes notifying the partition that it has been assigned one or more virtual adapters. The node’s “type” and “compatible” properties notify the partition that the virtual adapter is a VNIC. The unit address of the node is used by the client partition to map the virtual device(s) to the OS’s corresponding logical representations. The “ibm,my-dma-window” property communicates the size of the RTCE table window panes that the hypervisor has allocated. The node also specifies the interrupt source number that has been assigned to the Reliable Command/Response Transport connection and the RTCE range that the client partition device driver may use to map its memory for access by the server partition via Logical Remote DMA. The client partition, uses the hcall()s associated with the Reliable Command/Response Transport facility to register and deregister its CRQ, manage notification of responses, and send command requests to the server partition. The client partition uses the hcall()s associated with the Subordinate CRQ Transport facility to register and deregister any sub-CRQs necessary for the operations of the VNIC. The client partition, upon noting the device tree entry for the virtual adapter, loads the device driver associated with the value of the “compatible” property. The device driver, when configured and opened, allocates memory for its CRQ (an array, large enough for all possible responses, of 16 byte elements), pins the queue and maps it into the I/O space of the RTCE window specified in the “ibm,my-dma-window” property using the standard kernel mapping services that subsequently use the H_PUT_TCE hcall(). The queue is then registered using the H_REG_CRQ hcall(). Next, I/O request control blocks (within which the I/O requests commands are built) are allocated, pinned, and mapped into I/O address space. Finally, the device driver registers to receive control when the interrupt source specified in the virtual IOA’s device tree node signals. Once the CRQ is setup, the device driver in the client queues an Initialization Command/Response with the second byte of “Initialize” in order to attempt to tell the hosting side that everything is setup on the hosted side. The response to this send may be that the send has been dropped or has successfully been sent. If successful, the sender should expect back an Initialization Command/Response with a second byte of “Initialization Complete,” at which time the communication path can be deemed to be open. If dropped, then the sender waits for the receipt of an Initialization Command/Response with a second byte of “Initialize,” at which time an “Initialization Complete” message is sent, and if that message is sent successfully, then the communication path can be deemed to be open. Once the CRQ connection is complete between the client and the server, the client receives from the server the number of sub-CRQs that can be supported on the client side. The client allocates memory for the first sub-CRQ (an array, large enough for all possible responses, of 32 byte elements), pins the queue and maps it into the I/O space of the RTCE window specified in the “ibm,my-dma-window” property using the standard kernel mapping services that subsequently use the H_PUT_TCE hcall(). The queue is then registered using the H_REG_SUB_CRQ hcall(). This process continues until all desired sub-CRQs are registered or until the H_REG_SUB_CRQ hcall() indicates that the resources allocated to the client for sub-CRQs for the virtual IOA have already been allocated (H_Resource returned). Interrupt numbers for the Sub-CRQs that have been registered, are returned by the H_REG_SUB_CRQ hcall() (See ). Once all the CRQs and Sub-CRQs are setup, the communications between the client and server device drivers may commence for purposes of further setup operations, and then normal I/O requests, error communications, etc. The protocol for this communications is beyond the scope of this architecture.
VNIC Requirements This normative section provides the general requirements for the support of VNIC. R1--1. For the VNIC option: The platform must implement the Reliable Command/Response Transport option as defined in . R1--2. For the VNIC option: The platform must implement the Subordinate CRQ Transport option as defined in . R1--3. For the VNIC option: The platform must implement the Logical Remote DMA option as defined in . R1--4. For the VNIC option: The platform’s OF device tree for client partitions must include as a child of the /vdevice node, at least one node of name “vnic”. R1--5. For the VNIC option: The platform’s vnic OF node must contain properties as defined in (other standard I/O adapter properties are permissible as appropriate). Properties of the <emphasis role="bold"><literal>vnic</literal></emphasis> Node in the OF Device Tree Property Name Required for vnic? Required for vnic-server? Definition “name” YValue = “ibm,vnic” YValue = “ibm,vnic-server” Standard property name per , specifying the virtual device name. “device_type” Y N Standard property name per , specifying the virtual device type, the value shall be “network”. “model” NA NA Property not present. “compatible” YValue includes: “ibm,vnic” YValue includes: “ibm,vnic-server” Standard property name per , specifying the programming models that are compatible with this virtual IOA. “used-by-rtas” Present if appropriate. Present if appropriate.   “ibm,loc-code” Y Y Property name specifying the unique and persistent location code associated with this virtual IOA. “reg” Y Y Standard property name per , specifying the unit address (unit ID) associated with this virtual IOA presented as an encoded array as with encode-phys of length “#address-cells” value shall be 0xwhatever (virtual “reg” property used for unit address no actual locations used, therefore, the size field has zero cells (does not exist) as determined by the value of the “#size-cells” property). “ibm,my-dma-window” YValue = a single triplet YValue = two triplet Property name specifying the DMA window associated with this virtual IOA presented as an encoded array of one or more sets of three values (triplet) (LIOBN, phys, size) encoded as with encode-int, encode-phys, and encode-int. For the vnic-server the two tripples describe two window panes: the first describes the window pane used to map server partition memory; the second is the window pane through which the client partition maps its memory that it makes available to the server partition. (Note: the mapping between the LIOBN in the second window pane of a server virtual IOA’s “ibm,my-dma-window” property and the corresponding client IOA’s RTCE table is made when the CRQ successfully completes registration. See .) “interrupts” Y Y Standard property name specifying the interrupt source number and sense code associated with this virtual IOA presented as an encoded array of two cells encoded as with encode-int with the first cell containing the interrupt source number, and the second cell containing the sense code 0 indicating positive edge triggered. The interrupt source number being the value returned by the H_XIRR or H_IPOLL hcall(). “ibm,my-drc-index” For DR For DR Present if the platform implements DR for this node. “ibm,#dma-size-cells” See definition column See definition column Property name, to define the package’s dma address size format. The property value specifies the number of cells that are used to encode the size field of dma-window properties. This property is present when the dma address size format cannot be derived using the method described in the definition for the “ibm,#dma-size-cells” property in . “ibm,#dma-address-cells” See definition column See definition column Property name, to define the package’s dma address format. The property value specifies the number of cells that are used to encode the physical address field of dma-window properties. This property is present when the dma address format cannot be derived using the method described in the definition for the “ibm,#dma-address-cells” property in . “local-mac-address” Y NA Standard property name per , specifying the locally administered MAC addresses are denoted by having the low order two bits of the high order byte being 0b10. “mac-address” Y NA Standard property name per , specifying the initial MAC address (may be changed by a VNIC CRQ command). “supported-network-types” Y NA Standard property name as per . Reports possible types of “network” the device can support. “chosen-network-type” Y NA Standard property name as per . Reports the type of “network” this device is supporting. “max-frame-size” Y NA Standard property name per , to indicate maximum packet size. “address-bits” Y NA Standard property name per , to indicate network address length. “interrupt-ranges” Y Y Standard property name that defines the interrupt number(s) and range(s) handled by this device. Subordinate CRQs associated with this VNIC use interrupt numbers from these ranges. “ibm,vf-loc-code” NA Y Vendor unique property name to define the physical device virtual function upon which the vnic-server runs. The value is that of the “ibm,loc-code” property of the physical device virtual function. “ibm,vnic-mode” NA Y Vendor unique property that represents the operational mode in which the vnic-server runs. “ibm,vnic-client-mac” NA Y Vendor unique property that represents a vNIC server's client MAC address.