migration(5) migration(5) NAME migration - dynamic memory migration DESCRIPTION This document describes the dynamic memory migration system available in Origin 2000/200 systems. Dynamic migration is not available on other Origin platforms. Manual or User migration is available to all Origin platforms (see migration(3)). Introduction Dynamic page migration is a mechanism that provides adaptive memory locality for applications running on the Origin 2000/200 systems. The Origin 2000/200 hardware implements an algorithm based on comparing remote memory access counters to a local memory access counter; when the difference between the numbers of remote and local accesses goes beyond a preset threshold, an interrupt is generated to inform the operating system that a physical memory page is currently experiencing excessive remote accesses. Within the interrupt handler the operating system makes a final decision whether to migrate the page or not. If it decides to migrate the page, the migration is executed immediately. The system may decide not to execute the migration due to enforcement of a migration control policy or due to lack of resources. Note that use of the dynamic memory migration feature will cause a substantial amount of kernel memory to be consumed to support this operation. Page migration can also be explicitly requested by users, and in addition, it is used to assist the memory coalescing algorithms for multiple page size support. Migration Modules The migration subsystem is composed of the following modules: - Detection Module. This module monitors memory accesses issued by nodes in the system to each physical memory page. In Origin 2000/200 systems this module is mostly implemented in hardware. This detection module informs the Migration Control Module that a page is experiencing excessive remote accesses via an interrupt sent to the page's home node. - Migration Engine Module. This module carries out data movement from a current physical memory page to a new page in the node issuing the remote accesses. - Migration Control Module. This module decides whether the page should be migrated or not, based on migration control policies, defined by parameters such as migration threshold, bounce detection and prevention, dampening factor, and others. - Migration Control Periodic Operations Module. This module executes all periodic operations needed for the Migration Control Module. - Memory Management Control Interface Module (MMCI Module). This module provides an interface for users to tune the migration policy associated with an address space. Migration Detection Module The basic goal of memory migration is to minimize memory access latency. In a NUMA system where local memory access latency is smaller then remote memory access latency, we can achieve this latency minimization goal by moving the data to the node where most memory references are going to be issued from. It would be great to be able to move data to the node where it is going to be needed right before it is referenced. Unfortunately, we cannot predict the future. However, common programs usually have some amount of temporal and spatial locality, which allows us to heuristically predict future behavior based on recent past behavior. The usual procedure used to predict future memory accesses to a page is to count the memory references to this page issued by each node in the system. If the accumulated number of remote references becomes considerably greater than the number of accumulated local references, then it may be beneficial to migrate the page to the remote node issuing the references, especially if this remote node will continue accessing this same page for a long time. Origin 2000/200 systems have counters that continuously monitor all memory accesses issued by each node in the system to each physical memory page. In a 64-node Origin 2000 (128 processors), we have 64 memory access counters for every 4-KB low level physical page (4 KB is the size of a low level physical page size; software page sizes start at 16KB for Origin systems). For every memory access, the counter associated with the node issuing the reference is incremented; at the same time, this counter is compared to the counter that keeps track of local accesses, and if the remote counter exceeds the local counter by a threshold, an interrupt is generated advising the Operating System about the existence of a page with excessive remote accesses. Upon reception of the interrupt, the Migration Control Module in the Operating System decides whether to migrate the page or not. The threshold that determines how large the difference between remote and local counters needs to be in order for the interrupt to be generated is stored in a per-node hardware register, which is initialized by the Migration Control Module. The default system threshold defined in /var/sysgen/mtune/numa by the tunable variables numa_migr_default_threshold and numa_migr_threshold_reference (see Migration Tunables below), and the threshold specified by users as a parameter of a migration policy (mmci(5)), are not directly stored into this register due to the fact that different pages on the same node may have different migration thresholds. These thresholds are used to initialize the reference counters when a page is initialized. Migration Engine Module This module transparently moves a page from one physical frame to another. The migration engine first verifies the availability of all resources needed to realize the migration of a page. If all resources are not available, the operation is cancelled. The data transfer operation is performed using a processor to move memory. Translation lookaside buffer (TLB) shootdowns are done using inter-processor interrupts. TLB shootdowns are needed in order to avoid the use of stale translations that may be pointing to the physical memory page that contained the data before migration took place. Normally, a TLB shootdown operation is performed by sending interrupts to all processors in the system with a TLB that may have stale translation entries. Migration Control Module This module decides whether a page should be migrated or not after receiving a notification (via an interrupt) from the Migration Detection Module alerting that a page is experiencing excessive remote accesses. This decision is based on applicable migration control policies and resource availability. The basic idea behind controlling migration is that it is not always a good idea to migrate a page when the memory reference counters are telling us that a page is experiencing excessive remote accesses; the page may be bouncing back and forth due to poor application behavior, the counters may have accumulated too much past knowledge, making them unfit to predict near future behavior, the destination node may have little free memory, or the path needed to do the migration may be too busy. The Migration Control Module applies a series of filters to a reference counter notification or migration request, as enumerated below. All tunables mentioned in this list are found in /var/sysgen/mtune/numa. Node Distance Filter This filter rejects all migration requests where the distance between the source and the destination is less than numa_migr_min_distance in /var/sysgen/mtune/numa. All rejected requests result in the page being frozen in order to prevent this request from being re-issued too soon. Memory Pressure Filter This filter rejects migration requests to nodes where physical memory is low. The threshold for low memory is defined by the tunable numa_migr_memory_low_threshold, which defines the minimum percentage of physical memory that needs to be available in order for a page to be migrated there. This filter can be enabled and disabled using the tunable numa_migr_memory_low_enabled. Traffic Control Filter Experimental filter intended to throttle migration down when the Craylink Interconnect traffic reaches peak levels. Experiments have shown that this filter is unnecessary for Origin 2000/200 systems. Bounce Control Filter Sometimes pages may start bouncing due to poor application behavior or simple page level false sharing. This filter detects and freezes bouncing pages. The detection is done by keeping a count of the number of migrations per page in a counter that is aged (periodically decremented by a system daemon). If the count ever goes above a threshold, it is considered to be bouncing and is then frozen. Frozen pages start melting immediately, so after a period of time, they are unfrozen and migratable again. Note the the melting procedure is gradual, not instantaneous. The bounce control filter relies on operations executed periodically by the Migration Control Periodic Operations Module described below, for a) aging of the migration counters and b) melting of frozen pages. The period of these bounce control periodic operations is defined by the tunable numa_migr_bounce_control_interval. The default value for this tunable is 0, which translates into a period such that 4 physical pages are operated on per tick (10[ms] interval). Freezing can be enabled and disabled using the tunable numa_migr_freeze_enabled, and the freezing threshold can be set using the tunable numa_migr_freeze_threshold. This threshold is specified as a percentage of the maximum effective freezing threshold value, which is 7 for Origin 2000/200 systems. Melting can be enabled and disabled using the tunable numa_migr_melt_enabled, and the melting threshold can be set using the tunable numa_migr_melt_threshold. The melting threshold is expressed as a percentage of the maximum effective melting threshold value, which is 7 for Origin 2000/200 systems. Migration Dampening Filter This filter minimizes the amount of migration due to quick temporary remote memory accesses, such as those that occur when caches are loaded from a cold state, or when they are reloaded with a new context. We implement this dampening flter using a per- page migration request counter that is incremented every time we receive a migration request interrupt, and aged (periodically decremented) by the Migration Control Periodic Operations Module. We migrate a page only if the counter reaches a value greater than some dampening threshold. This will happen only for applications that continuously generate remote accesses to the same page during some interval of time. If the application experiences just a short transitory sequence of remote accesses, it is very unlikely that the migration request counter will reach the threshold value. This filter can be enabled and disabled using the tunable numa_migr_dampening_enabled, and the migration request count threshold can be set using the tunable numa_migr_dampening_factor. The memory reference counters are re-initialized to their startup values after every reference counter interrupt. Migration Control Periodic Operations Module The Migration Control Module relies on several periodic operations. These operations are listed below: - Bounce Control Operations. Age migration counter for freezing and melting. _ Unpegging. Reset memory reference counters that have reached a saturation level. - Queue Control Operations. Age queued outstanding migration requests. Experimental, always disabled for production systems. - Traffic Control Operations. Sample the state of the Craylink interconnect and correspondingly adjust the per-node migration threshold. Experimental, always disabled for production systems. These operations are executed in a loop, triggered once every mem_tick_base_period, a tunable that defines the migration control periodic period in terms of system ticks (a system tick is equivalent to 10 [ms] on Origin systems running IRIX 6.5). This loop of operations may be enabled and disabled using the tunable mem_tick_enabled. If migration is enabled or users are allowed to use migration, this loop must be enabled. In order to minimize interference with user processes, we limit the number of pages operated on in a loop to a few pages, trying to limit the time used to less than 20 [us]. Administrators can adjust the time dedicated to these periodic operations via the following tunables: + mem_tick_base_period + numa_migr_unpegging_control_interval + numa_migr_traffic_control_interval + numa_migr_bounce_control_interval Description of Periodic Operations The following list describes the Bounce Control Periodic Operations in detail: Aging Migration Counters In order to detect bouncing we keep track of the number of migrations per page using a counter that is periodically decremented (aged). When the counter goes beyond a threshold, we consider the page to be bouncing and freeze it. Aging Migration Request Counters In order to avoid excessive migration or bouncing due to short, transitory remote memry access sequences we have a migration dampening filter that needs to count several migration requests within a limited period of time before it actually lets a real page migration take place. The time factor is introduced in the filter by aging the migration request counters. Melting Frozen Pages When a page is frozen we want to eventually unfreeze it so that it becomes migratable again. This behavior is desirable because the events that cause a page to be frozen are usually temporary. As part of the periodic operations, we increment a counter per page to keep track of how long the page has been frozen. When the counter goes above a threshold, meaning that the page has been frozen for a sufficient time, we unfreeze the page, thereby making it migratable again. The Unpegging Periodic Operation consists of scanning all the memory reference counters looking for those counters that have pegged due to reaching their maximum count. When a pegged counter is found, all counters associated with that page are restarted. The current implementation of the Migration Control module does not execute Queue Control Periodic Operations or Traffic Control Periodic Operations. Page Migration Tunables This is a list of all the memory migration tunables in /var/sysgen/mtune/numa that define the default memory migration policy used by the system. * numa_migr_default_mode. This tunable defines the default migration mode. It can take the following values: 0: MIGR_DEFMODE_DISABLED Migration is completely disabled, users cannot use migration. 1: MIGR_DEFMODE_ENABLED Migration is always enabled, users cannot disable migration. 2: MIGR_DEFMODE_NORMOFF Migration is normally off, users can enable migration for an application. This is the default setting. 3: MIGR_DEFMODE_NORMON Migration is normally on, users can disable migration for an application. 4: MIGR_DEFMODE_LIMITED Migration is normally off for machine configurations with a maximum Numalink distance less than numa_migr_min_maxradius (defined below). Migration is normally on otherwise. Users can override this mode. * numa_migr_default_threshold. This threshold defines the minimum difference between the local and any remote counter needed to generate a migration request interrupt. if ((remote_counter - local_counter) >= ((numa_migr_threshold_reference_value / 100) * numa_migr_default_threshold)) { send_migration_request_intr(); } * numa_migr_threshold_reference. This parameter defines the pegging value for the memory reference counters. It is machine configuration dependent. For Origin 2000/200 systems, it can take the following values: 0: MIGR_THRESHREF_STANDARD = Threshold reference is 2048 (11 bit counters) Maximum threshold allowed for systems with STANDARD DIMMS. This is the default. 1: MIGR_THRESHREF_PREMIUM = Threshold reference is 524288 (19-bit counters) Maximum threshold allowed for systems with *all* PREMIUM SIMMS. * numa_migr_min_maxradius. This tunable is used if numa_migr_default_mode has been set to mode 4 (MIGR_DEFMODE_LIMITED). For this mode, migration is normally off for machine configurations with a maximum Craylink distance less than numa_migr_min_maxradius Migration is normally on otherwise. * numa_migr_auto_migr_mech. This tunable defines the migration execution mode for memory reference counter triggered migrations: 0 for immediate and 1 for delayed. Only the Immediate Mode (0) is currently available. * numa_migr_user_migr_mech. This tunables defines the migration execution mode for user requested migrations: 0 for immediate and 1 for delayed. Only the Immediate Mode (0) is currently available. * numa_migr_coaldmigr_mech . This tunables defines the migration execution mode for memory coalescing migrations: 0 for immediate and 1 for delayed. Only the Immediate Mode (0) is currently available. * numa_refcnt_default_mode. Extended counters are used in application profiling (see refcnt(5)) and to control automatic memory migration. This tunable defines the default extended reference counter mode. It can take the following values: 0: REFCNT_DEFMODE_DISABLED Extended reference counters are disabled, users cannot access the extended reference counters (refcnt(5)). In this case automatic memory migration will not be performed regardless of any other settings. 1: REFCNT_DEFMODE_ENABLED Extended reference counters are always enabled, users cannot disable them. 2: REFCNT_DEFMODE_NORMOFF Extended reference counters are normally disabled, users can disable or enable the counters for an application. 3: REFCNT_DEFMODE_NORMON Extended reference counters are normally enabled, users can disable or enable the counters for an application. Note that whenever the reference counters are enabled, a substantial amount of kernel memory is allocated internally to support the reference counter operations. * numa_refcnt_overflow_threshold This tunable defines the count at which the hardware reference counters notify the operating system of a counter overflow in order for the count to be transferred into the (software) extended reference counters. It is expresses as a percentage of the threshold reference value defined by numa_migr_threshold_reference. * numa_migr_min_distance Minimum distance required by the Node Distance Filter in order to accept a migration request. * numa_migr_memory_low_enabled Enable or disable the Memory Pressure Filter. * numa_migr_memory_low_threshold Threshold at which the Memory Pressure Filter starts rejecting migration requests to a node. This threshold is expressed as a percentage of the total amount of physical memory in a node. * numa_migr_freeze_enabled Enable or disable the freezing operation in the Bounce Control Filter. * numa_migr_freeze_threshold Threshold at which a page is frozen. This tunable is expressed as a percent of the maximum count supported by the migration counters (7 for Origin 2000). * numa_migr_melt_enabled Enable or disable the melting operation in the Bounce Control Filter. * numa_migr_melt_threshold When a migration counter goes below this threshold a page is unfrozen. This tunable is expressed as a percent of the maximum count supported by the migration counters (7 for Origin 2000). * numa_migr_bounce_control_interval This tunable defines the period for the loop that ages the migration counters and the dampening counters. It is expressed in terms of number of mem_ticks. The mem_tick unit is defined by mem_tick_base_period below. If it is set to 0, we process 4 pages per mem_tick. In this case, the actual period depends on the amount of physical memory present in a node. * numa_migr_dampening_enabled Enable or disable migration dampening. * numa_migr_dampening_factor The number of migration requests needed for a page before migration is actually executed. It is expressed as a percentage of the maximum count supported by the migration-request counters (3 for Origin 2000). * mem_tick_enabled Enable or disabled the loop that executes the Migration Control Periodic Operation. * mem_tick_base_period Number of 10[ms] system ticks in one mem_tick. * numa_migr_unpegging_control_enabled Enable or disable the unpegging periodic operation * numa_migr_unpegging_control_interval This tunable defines the period for the loop that unpegs the hardware memory reference counters. It is expressed in terms of number of mem_ticks. The mem_tick unit is defined by mem_tick_base_period above. If it is set to 0, we process 8 pages per mem_tick. In this case, the actual period depends on the amount of physical memory present in a node. * numa_migr_unpegging_control_threshold Hardware memory reference counter value at which we consider the counter to be pegged. It is expressed as a percent of the maximum count defined by numa_migr_threshold_reference. * numa_migr_traffic_control_enabled Enable or disable the Traffic Control Filter. This is an experimental module, and therefore it should always be disabled. * numa_migr_traffic_control_interval Traffic control period. Experimental module. * numa_migr_traffic_control_threshold Traffic control threshold for kicking the batch migration of enqueued migration requests. Experimental module. FILES /var/sysgen/mtune/numa SEE ALSO numa(5), mtune(4), refcnt(5), mmci(5), migration(3), nstats(1), sn(1), topology(1). Page 10