Most of this section is derived from Mike Christie talking with an LSI and a SUN engineer, and his work with on the driver described in http://www-1.ibm.com/support/docview.wss?rs=0&q1=linux+rdac&uid=psg1MIGR-54973&loc=en_US&cs=utf-8&cc=us&lang=en (Stefan Bader) dm-multipath is the best place to support multipathing because its flexible and allows to combine paths that do not have the same transport. (Mike Christie) This section is a TODO list detailing the DM, SCSI and Block layer changes necessary to produce a multipath implmentation for 2.6 that vendors can port their existing SCSI multipath drivers to. It starts by describing the current DM-multipath implementation in the Unstable DM (udm) patchset then lists items that at this point appear to be needed for a more reliable and robust multipath framework. ------------ DM-multipath ------------ Device Assembly --------------- dm-multipath performs device assembly in two stages. First, a userspace tool performs device discovery and an initial assembly where it organizes paths in dm-multipath Priority Groups (PG) based on charactersitics such as controller states, responses to certain commands or other user specified predefined groupings. Next, the kernel component recieves the device information through a DM table. The arguments include the PG defintions, the Path Selector (PS) type for each group and Path information. It then uses this info to construct dm-multipath's representation of the PGs, PSs and Paths, and allows vendor specific Path Selectors to initialize themselves. Managing PGs and IO ------------------- As a simple example where the storage device has two single ported controllers (one active, one passive), and the host system has two single ported HBAs, the paths to the active controller could be placed in PG0 and the paths to the passive controller could be in PG1. The paths in PG0 are then sent IO accodring to an algorithm defined in the PS module. When the paths in PG0 have failed, the core dm-multipath code blocks incoming IO, queues failed IO, and calls the PS initialization function for PG1 giving that group of paths a chance to initialize themselves. For failback, dm-multipath relies on the userspace path tester which sends SCSI TUR commands or performs a read of sector zero. Currently, the multipath-tools must swap tables to initiate a failover or failback, but an IOCTL interface is in the works that will allow the Path and PG states to be manually modified from userspace. ---- TODO ---- DM-multipath ISSUES ------------------- 0. dm-multipath needs more complicated heuristics to determine when to failover and failback to paths and groups. For example we should maybe only fail a path if it has had N number of failures within a certain time frame (today we fail the path on the first error). Or, if we are using a lower priority group, and X paths are reactivated in a higher priority group do we failback at that point or is it better to wait for X+Y paths to be reactivated? (Tim Witham) Maybe it should be configurable? I can see cases where you have a tightly controlled environment and you know that if it fails once then you want to move over. And then again you could have a more normal environment where you get transient errors and you want to wait. (Mike Anderson) It should be configurable / replaceable to adjust to different storage ecosystems or pre-planned events. - There maybe maintenance events (i.e., firmware updates) where you want to alter the normal failover policy as all connectivity to a storage device maybe interrupted for some small amount of time and you want IO to just suspend for this time interval and not have a failover start. - While optical storage networks have improved with newer laser technology there are still cases where path testing is ineffective in determining the health of a path. The failover / failback if excessive could create performance problems. So one may want a failover / failback policy that utilizes a heuristics plus credit based decaying failback model that would cause these intermittent paths to not be used. 1. Vendor plugin? Currently the PS serves as a place where vendors can place HW specific code like failover commands. This unfortunately requires each vendor to reproduce a path selection algorithm for their module. There are patches that move the HW specifics to the Priority-Group data structure. Should this be moved to userspace (previous maintainer wanted in kernel), or do we need a new abstraction? (Stefan Bader) Not only the code has to be duplicated. I could imagine that the method to reconfigure a device in a failover case is confidential (maybe the path selection method, too). So vendors would have to ship binary only modules. While the implementation is simple, maintenance/service of such modules is a lot of work since the vendor then has to provide a matching module for every distro or even every kernel release of a distro. Doing failover processing in userspace would prevent such problems. But time and memory pressure become more critical. (Leading to the question how to prevent new I/O if lower layers know it can't be processed. Not only a multipath issue) (Mike Christie) I actually attempted this. If when it is determined that a failover to another group is needed, the code today will block new IO by just putting the requester to sleep, and then it will internally queue oustanding failed IO. At this point a userspace agent that listens for dm events (I broke this in the current patchset just FYI if you are trying this at home), can send the failover command, and when completed it can send the results down to dm-multipath through the ioctl and allow dm-multipath to continue using whatever group was activated in usersapce. Ehh, it is not so bad. I still don't care which way we go, so you guys can argue about that here or ols. As you said for multipath swap disk OOM type of scenarios, you have to make sure userspace and the kernel systems/interfaces it uses can allocate memory. 2. Round Robin algorithm is not optimal. This is not in reference to bio vs request based discussions. Eventually every dm device will round-robin to the same path, and for this instance (1000 bios worth of processing time) global throughput drops. (Stefan Bader) The path selector should be called everytime and make the decision to stay on a path by itself. If I understand correctly NUMA would require this as well as a simple load balancer. On the other and I think I remember that some mp implementations reduced the complexity of the selection method since it took more time than occasionally not using the optimal path.... (Mike Christie) The lock issues need to be thought about when doing path selection work every bio. It does become a factor. If you benchmark the current multipath patchset against the older one (which only went to the PS every 1000 bios), the older version seems to be a little better. Another problem with the PS is that for small IOs, performance drops when just comparing it to opening /dev/sdX and using that single path. If you were to just continue on the same path for maybe 5000-10000 bios instead of 1000 in this case the performance drop is not seen. (This is for sequential IO tests) 3. Finish ioctl support. SCSI ISSUES ----------- 3. The vendor module does not recieve ASC and ASCQ values it may have previoulsy used to determine special types of transitive errors. For example when performing online firmware upgrades the IBM Fastt may return values indicating the device is quiescing. A vendor specifc SCSI-ML module to decode and map this to an appropriate value may be possible, or we may try to pass the sense up through the request and bio layers to DM? Or, as Christophe has suggested dm-multipath is built to failover/failback why not utilize this, and simply failover then later failback utilizing the path-tester. (Stefan Bader) I would be on Christopes side in that case. It isn't that important for the generic layer to understand what exactly has happened. It is sufficient to know whether the error occured on the device, the transport or the adapter and maybe whether the driver thinks this is permanent or temporary. Generally I think it would be a good approach to provide a flexible interface to interact with targets into dm (as well as load balancing/round robin code and simple failover without intervention). Vendors that require special interaction could configure dm to trigger events on failover and let userspace do the setup before resuming I/O. 4. Error Handler - SCSI-ML's error hanlder halts all IO on the host while in error recovery, but DM does not know the HBA state. There is a discussion on linux-scsi on how to modify the error handler. 5. (Block, SCSI and DM layers) Informative error values and fine grained FAILFAST. SCSI returns 0 for I/O error and non-zero for uptodate. The bio layer then returns an -Exxx value. This does not convey enough information to make optimal IO resubmission decisions (all errors today are just sent to another path). Fast failing all errors immediately upwards is also not always optimal. dm multipath is better at handling transport errors, but most device errors could possibly be hanlded better at the SCSI/LLD level.