You cannot select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
Linux-Architecture-Reference/LoPAR/app_eeh_handling.xml

302 lines
13 KiB
XML

<?xml version="1.0" encoding="UTF-8"?>
<!--
Copyright (c) 2016, 2020 OpenPOWER Foundation
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at
http://www.apache.org/licenses/LICENSE-2.0
Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.
-->
<appendix xmlns="http://docbook.org/ns/docbook"
xmlns:xl="http://www.w3.org/1999/xlink"
xml:id="dbdoclet.50569381_46906"
version="5.0"
xml:lang="en">
<title>EEH Error Processing</title>
<para>This appendix describes the architectural intent for EEH error processing.
This appendix does not attempt to illustrate all possible scenarios, and other
implementations are possible.</para>
<section xml:id="dbdoclet.50569381_11923">
<title>General Scenarios</title>
<para>In general, the device driver recovery consists of issuing an
<emphasis>ibm,read-slot-reset-state2</emphasis> call prior to doing any recovery
to determine if (1) the IOA is in the MMIO Stopped and DMA Stopped state (that is,
that an error has occurred which has put it into this state), and (2) whether or
not the PE has been reset by the platform in the process of entering the MMIO
Stopped and DMA Stopped state, and then doing one of the following:</para>
<orderedlist>
<listitem>
<para>Simplest approach: </para>
<itemizedlist>
<listitem>
<para>Reset the PE</para>
</listitem>
<listitem>
<para>Reconfigure the PE</para>
</listitem>
</itemizedlist>
</listitem>
<listitem xml:id="dbdoclet.50569381_39262">
<para>Most general approach (detailed more in <xref linkend="dbdoclet.50569381_38095"/>): </para>
<itemizedlist>
<listitem>
<para>Release the PE for <emphasis>Load</emphasis> /<emphasis>Store</emphasis></para>
</listitem>
<listitem>
<para>Issue <emphasis>Load</emphasis> /<emphasis> Store</emphasis> instructions t
o get any desired state information from the IOA</para>
</listitem>
<listitem>
<para>Call the <emphasis>ibm,slot-error-detail</emphasis> RTAS call to get the
platform error information</para>
</listitem>
<listitem>
<para>Log the error information</para>
</listitem>
<listitem>
<para>Reset the PE</para>
</listitem>
<listitem>
<para>Reconfigure the PE</para>
</listitem>
</itemizedlist>
</listitem>
<listitem>
<para>Most robust (no reset unless necessary): </para>
<itemizedlist>
<listitem>
<para>Release the PE for <emphasis>Load</emphasis> /<emphasis>Store</emphasis></para>
</listitem>
<listitem>
<para>Issue <emphasis> Load</emphasis> /<emphasis> Store</emphasis> instructions
to get any desired state information from the IOA</para>
</listitem>
<listitem>
<para>Call the <emphasis> ibm,slot-error-detail</emphasis> RTAS call to get the
platform error information</para>
</listitem>
<listitem>
<para>Log the error information</para>
</listitem>
<listitem>
<para>Device driver does IOA cleanup</para>
</listitem>
<listitem>
<para>Release the PE for DMA and restart operations (no reset) </para>
</listitem>
</itemizedlist>
</listitem>
</orderedlist>
<para>In any scenario, after several retries of a recoverable operation, the OS
may determine that further recovery efforts should cease. In such a case,
calling <emphasis>ibm,slot-error-detail</emphasis> with <emphasis>Function</emphasis> 2
(Permanent Error), in addition to returning error information, marks that the PE is
no longer accessible due to previous errors.</para>
</section>
<section xml:id="dbdoclet.50569381_38095">
<title>More Detail on the Most General Approach</title>
<para>The following gives a more detailed look at scenario #<xref linkend="dbdoclet.50569381_39262"/>
in <xref linkend="dbdoclet.50569381_11923"/>. This will be broken up into two groups of operations:
error logging and error recovery.</para>
<para>These scenarios assume that:</para>
<orderedlist>
<listitem>
<para>The <emphasis> ibm,configure-pe</emphasis> RTAS call is implemented.</para>
</listitem>
<listitem>
<para>The attempts at recovery stop when Max_Retries_Exceeded is true.</para>
</listitem>
</orderedlist>
<section>
<title>Error Logging</title>
<orderedlist>
<listitem>
<para>If the device driver is going to capture internal IOA-specific information
as a part of the error logging process or if the IOA controlled by the device driver
requires a longer wait after reset than the normal PCI specified minimum wait time,
then the device driver determines whether its IOA has been reset as a result of entering
EEH Stopped State, by looking at the <emphasis>PE Recovery Info</emphasis> output of
the <emphasis>ibm,read-slot-reset-state2</emphasis> RTAS call.</para>
</listitem>
<listitem>
<para>The OS or device driver insures that all MMIOs to the IOA(s) in the PE are finished.</para>
</listitem>
<listitem>
<para>If the IOA requires longer wait after reset times than the specified minimum,
and the PE was reset (see step #1) as a result of the EEH event, then wait the
additional necessary time before continuing.</para>
</listitem>
<listitem xml:id="dbdoclet.50569381_59303">
<para>The OS or device driver
enables PE MMIOs by calling the <emphasis>ibm,set-eeh-option</emphasis> RTAS call
with <emphasis>Function</emphasis> 2. </para>
</listitem>
<listitem>
<para>The OS or device driver calls the <emphasis>ibm,configure-pe</emphasis> RTAS call. </para>
<orderedlist numeration="loweralpha">
<listitem>
<para>If the PCI fabric does not need configuring (the PE was not reset previous to the
call or was reset but was previously configured with <emphasis>ibm,configure-pe</emphasis> ),
then the call returns without doing anything, otherwise it attempts to configure
the fabric up to but not including the endpoint IOA configuration registers.</para>
</listitem>
<listitem>
<para>If an EEH event occurs as a result of probing during the
<emphasis>ibm,configure-pe</emphasis> RTAS call that results in a
reset of the PE, the PE will be returned in the PE state of 2. Software
does not necessarily need to check this on return from the call. The case
where this occurs is expected to be rare, and probably signals a non-transient error.
In this case the software can continue on with the recovery phase of the EEH processing,
and will eventually hit the same EEH event on further processing.</para>
</listitem>
</orderedlist>
</listitem>
<listitem xml:id="dbdoclet.50569381_74560">
<para>If the PE was
reset (see step #1) as a result of the EEH event, then if the device driver is
going to gather IOA-specific information for logging, it needs to finish the
configuration of the IOA PCI configuration registers, by restoring the PCI
configuration space registers of the IOA(s) in the PE (for example, BARs,
Memory Space Enable, etc.). </para>
</listitem>
<listitem xml:id="dbdoclet.50569381_29191">
<para>If desired, the
device driver gathers IOA-specific information via MMIOs, by doing MMIOs to its IOA.</para>
</listitem>
<listitem>
<para>The OS or device driver calls <emphasis>ibm,slot-error-detail</emphasis> .
Any data captured in step #<xref linkend="dbdoclet.50569381_29191"/> is passed in the
call. Note that maximum amount of data will be captured in some cases only when the
<emphasis>ibm,slot-error-detail</emphasis> call is made with PE not in the MMIO Stopped
State (as it should be in step #<xref linkend="dbdoclet.50569381_59303"/>). </para>
<orderedlist numeration="loweralpha">
<listitem>
<para>If Max_Retries_Exceeded is true, then call <emphasis>ibm,slot-error-detail</emphasis>
with <emphasis> Function</emphasis> 2 (Permanent Error).</para>
</listitem>
<listitem>
<para>If Max_Retries_Exceeded is not true, then call <emphasis>ibm,slot-error-detail</emphasis>
with <emphasis>Function</emphasis> 1(Temporary Error).</para>
</listitem>
</orderedlist>
</listitem>
<listitem>
<para>The <emphasis>ibm,slot-error-detail</emphasis> RTAS call captures whatever
PCI config space registers it can between the configuration address passed in the call
and the system (PHB), and including at the configuration address and at the PHB, and
returns them along with the device specific data in an error log in the return information
from the call. This call may encounter another EEH event, in which case it returns what
information it can in the call, with a <emphasis>Status</emphasis> of 0 (Success).</para>
</listitem>
<listitem>
<para>The OS or device driver logs the log entry returned from the
<emphasis>ibm,slot-error-detail</emphasis> RTAS call.</para>
</listitem>
<listitem>
<para>If Max_Retries_Exceeded is not true, then the next step is PE Recovery,
otherwise stop and mark the IOA(s) in the PE as unusable.</para>
</listitem>
</orderedlist>
</section>
<section>
<title>PE Recovery</title>
<orderedlist>
<listitem>
<para>OS or device driver does a PE reset sequence. Note that this step is
required even if the PE was reset as a result of the initial EEH event, because
the error logging steps (for example, the <emphasis>ibm,configure-pe</emphasis>
or <emphasis> ibm,slot-error-detail</emphasis> calls) could have encountered
another EEH event. </para>
<orderedlist numeration="loweralpha">
<listitem>
<para>The device driver or OS calls <emphasis>ibm,set-slot-reset</emphasis> with
<emphasis>Function</emphasis> 1 or 3 to activate the reset.</para>
</listitem>
<listitem>
<para>The minimum reset active time is waited.</para>
</listitem>
<listitem>
<para>The device driver or OS calls <emphasis>ibm,set-slot-reset</emphasis> with
<emphasis>Function</emphasis> 0 to deactivate the reset.</para>
</listitem>
<listitem>
<para>The minimum reset inactive to first configuration cycles is waited.
If the IOA requires more than the standard PCI specified time, then wait that
longer time, instead.</para>
</listitem>
</orderedlist>
</listitem>
<listitem>
<para>The OS or device driver calls <emphasis>ibm,configure-pe.</emphasis>
<note><para>If an EEH event occurs as a result of probing
during the <emphasis>ibm,configure-pe</emphasis> RTAS call that results in a reset
of the PE, the PE will be returned in the PE state of 2. Software does not necessarily
need to check this on return from the call. The case where this occurs is expected to
be rare, and probably signals a non-transient error. In this case the software can
continue on with the recovery phase of the EEH processing, and will eventually hit
the same EEH event on further processing.</para></note>
</para>
</listitem>
<listitem>
<para>The device driver restores the PCI configuration spaces of the IOA(s) in the PE.</para>
</listitem>
<listitem>
<para>The device driver initializes the IOA for operations.</para>
</listitem>
</orderedlist>
</section>
</section>
</appendix>