mm: manual page migration-rc1 -- overview
From: | Ray Bryant <raybry@sgi.com> | |
To: | Hirokazu Takahashi <taka@valinux.co.jp>, Andi Kleen <ak@suse.de>, Dave Hansen <haveblue@us.ibm.com>, Marcello Tosatti <marcello@cyclades.com> | |
Subject: | [PATCH_FOR_REVIEW 2.6.12-rc1 0/3] mm: manual page migration-rc1 -- overview | |
Date: | Tue, 5 Apr 2005 21:16:33 -0700 (PDT) | |
Cc: | Ray Bryant <raybry@sgi.com>, Ray Bryant <raybry@austin.rr.com>, linux-mm <linux-mm@kvack.org> |
Summary ------- This set of patches is an initial implementation (hence the -rc1) of the manual page migration facility that I proposed in February and that was discussed on the linux-mm mailing list. Rationale for the manual page migration facility, etc, can be obtained from that thread, available at the following URL: http://marc.theaimsgroup.com/?l=linux-mm&m=1108179079... and subsequently, at: http://marc.theaimsgroup.com/?l=linux-mm&m=1108207168... and the subsequent messages on that thread. This material is included, in an condensed and updated form, under "Background", below. The implementation meets the interface that Andi Kleen and I agreed on at that time, AFAIK. This patch depends on the page migration patches from the Memory Hotplug project. This particular patchset is built on top of: http://www.sr71.net/patches/2.6.12/2.6.12-rc1-mhp2/page_m... but it may appy on subsequent page migration patches as well. Thus, in order for this patch to be meaningfully considered for merging, the above patch needs to be merged first. Alternatively, we may decide to merge this patch (or peices thereof) with the above patch. That is a decision for the Memory Hotplug Project. Either approach is acceptable as far as I am concerned provided that the functionality is eventually merged. :-) Interface Description --------------------- After much discussion on the linux-mm mailing list, we have agreed to use the following kernel interface: migrate_pages(pid, count, old_nodes, new_nodes); The arguments are described as follows: pid -- process id of the process to be migrated count -- number of entries in the old_nodes, new_nodes arrays old_nodes -- array of short new_nodes -- array of short The way the old_nodes[] and new_nodes[] arguments are interpreted is as follows: each migratable page (for a definition of that term, see below) that is found on node "old_nodes[i]" is migrated to "new_nodes[i]". A page is migratable unless one of the following conditions are true: (1) The page is part of a mapped file and that file has the extended attribute "system.migration" set to "none". In this case, none of the pages of the mapped file are migratable. (2) The page is part of a mapped file and that file has the extended attribute "system.migration" set to "libr", and the page is a shared page. (Any page that has been written by the process is considered private data associated with the process and will be migrated.) Note: At the present time we only have a patch for XFS to support the extended system attribute "migration". Until we agree that this is the correct approach, there is no point in creating patches for other file systems. See "Issues", below. For this system call, the set of nodes specified by the old_nodes and new_nodes lists must be disjoint. It is the responsibility of a user space library to convert a migration where the old_nodes and new_nodes sets are not disjoint into a series of smaller migrations for which the sets are disjoint. The system call will return with -EINVAL if the old_nodes and new_nodes sets are not disjoint. The system call itself does not support a gather mode (previously we had talked about using the special value -1 for old_node[0] to indicate that all migratable pages found would be migrated to new_node[0]). Instead this functionality is supported by the user space library. Interaction of memory policies with migrate_pages() --------------------------------------------------- As part of the execution of this system call, memory policy structures are updated as they are encountered and these structures are modified as needed to reflect the migration. For example, if the memory policy is MPOL_BIND and the bound node is found at old_nodes[i], then the bound node is replaced by new_nodes[i]. (To preserve atomicity, actually what happens is that a new memory policy structure is created with the new bound node and a pointer to the new policy is stored in the process structure or vma struct; the old mempolicy structure is released.) If the memory policy is MPOL_DEFAULT, then (obviously) no update is needed. However, if the user wishes new allocations to occur on new_nodes, then the process must be migrated to one of the cpus associated with one of the new_nodes before the migrate_pages() system call is issued; otherwise allocations can continue to occur on the old_nodes after the migrate_pages() system call returns. Special Considerations for Migrating non-Suspended processes ------------------------------------------------------------ While our usage of this system call assumes that the migrated process has been suspended (see "Background", below), nothing in the implementation specifically requires the process to be suspended. (The page migration patch from the memory hotplug project supports migration without suspending the process). However, if the process being migrated is actively allocating pages at the same time that migrate_pages() is executed, there are certain edge conditions that can result in pages still remaining on the old_nodes after the migrate_pages() system call returns. This is because the scan that looks for pages to be migrated is not atomic with respect to page allocation and any page allocated in a vma after the vma has been scanned will not be seen by migrate_pages(). For processes (or vma's) that use the memory policy MPOL_DEFAULT, this problem can normally be overcome by first migrating the process to a CPU associated with one of the new_nodes before calling migrate_pages(). This can either be done by using set_schedaffinity() or using cpusets. If the per process mempolicy or a vma mempolicy is other than MPOL_DEFAULT, then the since the policy is updated before the process (or vma) is scanned, then in most cases no pages can be allocated on old_nodes while the scan is in progress and no pages should be left over on the old nodes. There is one special case, however. MPOL_INTERLEAVE uses a per process variable (current->il_next) to specify which node is the next node to allocate pages from. This variable is updated after each allocation and is separate from the mempolicy. While updating of mempolicies is atomic (a pointer to the new policy is stored in the process structure or vma) there is no way to also atomically update current->il_next. While current->ilnext is updated by the migrate_pages() system call, if needed, this update is inherently racy and if the process is not suspended before it is migrated, there is no way to guarantee that one (or more?) pages won't be allocated on some old_nodes at the same time that the migrate_pages() system call is executing. There appears to be no way to fix this with the current mempolicy implementation. Interaction with cpusets ------------------------ On a cpusets enabled system, additional checking is performed to make sure that the pid specified is allowed to allocate pages on each of the new nodes. In addition, normal rules of memory allocation for cpusets require that the process that invokes the migrate_pages() call is able to allocate pages on each of the new nodes. This is required because the new pages allocated on the new nodes will be allocated using the cpuset mems_allowed of the current process. For our intended use of this system call, this restriction is not a significant limitation, since the process issuing the migrate_pages() system call will normally be a batch manager of some kind that is managing job allocation to a number of cpusets. The batch manager will normally be running in a cpuset that is a parent cpuset of the managed cpusets; hence the batch manager will be have the necessary permissions to allocate pages in each of its managed cpusets. Using Extended Attributes to Control Migration ---------------------------------------------- Alternatives to using extended attributes to control page migration have been proposed, e. g. fixing the dynamic loader so it will mark libraries as such when they are mapped, thus requiring no file system changes. For files that should not be migrated, the proposal would be to add a special mmap() flag (e. g. NOT_MIGRATABLE), and require trivial application to mmap() the file so long as it is needed to be marked not-migratable. This needs to be discussed further and a resolution reached. The current patchset implements the extended attribute approach for the XFS file system. Description of the patches in this patchset ------------------------------------------- Patch 1: nathan_scott_extended_attributes-rc1.patch This patch, due to Nathan Scott at SGI, adds support to XFS for the system.migration extended attribute. Patch 2: add-node_map-arg-to-try_to_migrate_pages-rc1.patch This patch adds an additional argument to try_to_migrate_pages(). The additional argument is of type short * and is named node_map. If node_map is NULL, then try_to_migrate_pages() works as it used to. If node_map is non-NULL, then it must point to an array of size MAX_NUMNODES. node_map[i] is either -1 (if pages found on node "i" are not to be migrated, or the new node number if pages on node "i" are to be migrated. Patch 3: add-sys_migrate_pages-rc1.patch This is the patch that adds the migrate_pages() system call. Issues to be resolved: --------------------- Here is a list (probably not comprehensive) of the issues that need to be resolved: (1) Resolve whether the extended attribute approach for controlling which files are migrated is acceptable, and if not what the alternative approach should be. The current patch includes the function: is_mapped_file_migratable() and any changes in this area should be confined to rewriting that function. (2) At the moment, there is no access protection checking built into the extended attributes implementation. Given the discussion above, we propose to wait until the above is resolved before completing this part of the implementation. (3) We haven't done the extended attribute implementation for other file systems, for reasons similar to that of (2). (4) This implementation has chosen (arbitrarily) to use system call number 1279 as the system call number for sys_migrate_pages(). Obviously, this system call number will need to be assigned. (5) We haven't resolved the "permissions" model -- i. e. which processes can migrate which threads. Here are two possibilities: (a) Only root processes are able to call migrate_pages(). (Equivalently, we could define a CAP_MIGRATION capability and require the sending process to have that capability.) (b) A process is allowed to call migrate_pages(pid,...) for any pid that the process could signal. (6) As part of the discussion with Andi Kleen, we agreed to provide some memory migration support under MPOL_MF_STRICT. Currently, if one calls mbind() with the flag MPOL_MF_STRICT set, and pages are found that don't follow the memory policy, then the mbind() will return -EIO. Andi would like to be able cause those pages to be migrated to the correct nodes. This feature is not yet part of this patchset. Background ---------- The purpose of this set of patches is to introduce the necessary kernel infrastructure to support "manual page migration". That phrase is intended to describe a facility whereby some user program (most likely a batch scheduler) is given the responsibility of managing where jobs run on a large NUMA system. If it turns out that a job needs to be run on a different set of nodes from where it is running now, then that user program would invoke this facility to move the job to the new set of nodes. We use the word "manual" here to indicate that the facility is invoked in a way that the kernel is told where to move things; we distinguish this approach from "automatic page migration" facilities which have been proposed in the past. To us, "automatic page migration" implies using hardware counters to determine where pages should reside and having the O/S automatically move misplaced pages. The utility of such facilities, for example, on IRIX has, been mixed, and we are not currently proposing such a facility for Linux. The normal sequence of events would be as follows: A job is running on, say nodes 5-8, and a higher priority job arrives and the only place it can be run, for whatever reason, is nodes 5-8. Then the scheduler would suspend the processes of the existing job (by, for example sending them a SIGSTOP) and start the new job on those nodes. At some point in the future, other nodes become available for use, and at this point the batch scheduler would invoke the manual page migration facility to move the processes of the suspended job from nodes 5-8 to the new set of nodes. Note that not all of the pages of all of the processes will need to (or should) be moved. For example, pages of shared libraries are likely to be shared by many processes in the system; these pages should not be moved merely because a few processes using these libraries have been migrated. As discussed above, we use the extended attribute system.migration with value "lib" to identify such files. If a shared library file does not have this attribute set, or the shared library is stored in a file system that does not support extended attributes (e. g. XFS), then the entire shared library will be migrated. -- Best Regards, Ray ----------------------------------------------- Ray Bryant raybry@sgi.com The box said: "Requires Windows 98 or better", so I installed Linux. ----------------------------------------------- -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: <a href=mailto:"aart@kvack.org"> aart@kvack.org </a>