DPLACE(1) DPLACE(1) NAME dplace - a NUMA memory placement tool SYNOPSIS dplace [-place placement_file] [-data_pagesize n-bytes] [-data_lpage_wait [off|on|NNNN]] [-stack_pagesize n-bytes] [-stack_lpage_wait [off|on|NNNN]] [-text_pagesize n-bytes] [-text_lpage_wait [off|on|NNNN]] [-migration [off|on|threshold]] [-migration_level threshold] [-propagate] [-mustrun] [-v[erbose]] program [program-arguments] DESCRIPTION The given program is executed after placement policies are set up according to command line arguments and the specifications described in placement_file. OPTIONS -place placement_file Placement information is read from placement_file. If this argument is omitted, no input file is read. See dplace(5) for correct placement file format. -data_pagesize n-bytes Data and heap page sizes will be of size n-bytes. Valid page sizes are 16k multiplied by a non negative integer powers of 4 up to a maximum size of 16m. Valid page sizes are 16k, 64k, 256k, 1m, 4m, and 16m. -data_lpage_wait [off|on|n] Normal behavior in the event of large page shortages is to continue running utilizing smaller pages instead. If this option is specified as a value NNNN then the process will wait NNNN second(s), or if the option is on then wait one second for a large page to become available for use by the data segment. This wait will occur for each page fault of a large page in the data segment that is not readily available. By setting the -data_lpage_wait option, the fallback policy for the data segment is also changed from FallbackDefault to FallbackLargepage. This policy will then look on other nodes for large pages when the requesting node has exhausted its free large pages, and if not successful then search for smaller page sizes on the requested node, then on other nodes. The highest wait value allowed is 65535. -stack_pagesize n-bytes Stack page sizes will be of size n-bytes. Valid page sizes are 16k multiplied by a non negative integer powers of 4 up to a maximum size of 16m. Valid page sizes are 16k, 64k, 256k, 1m, 4m, and 16m. -stack_lpage_wait [off|on] Normal behavior in the event of large page shortages is to continue running utilizing smaller pages instead. If this option is specified as a value NNNN then the process will wait NNNN second(s), or if the option is on then wait one second for a large page to become available for use by the stack segment. This wait will occur for each page fault of a large page in the stack segment that is not readily available. By setting the -stack_lpage_wait option, the fallback policy for the stack segment is also changed from FallbackDefault to FallbackLargepage. This policy will then look on other nodes for large pages when the requesting node has exhausted its free large pages, and if not successful then search for smaller page sizes on the requested node, then on other nodes. The highest wait value allowed is 65535. -text_pagesize n-bytes Currently IRIX does not support large pages in the text segment, so this option has no effect. -text_lpage_wait [off|on] Currently IRIX does not support large pages in the text segment, so this option has no effect. -migration [off|on|threshold] Page migration is turned on or off. If a threshold is specified then page migration will be turned on and the migration threshold will be set in the same manner as when -migration_level is specified (see below). This capability is implemented only on Origin 2000/200 models. Other Origin product lines do not support automatic memory locality management. Use of this option on models other than Origin 2000/200 has no effect. -migration_level threshold Page migration threshold is set to threshold. This value specifies the maximum percentage difference between the number of remote memory accesses and local memory accesses (relative to maximum counter values ) for a given page, before a migration request event occurs. A special argument of 0 will turn page migration off. This option is provided for backward compatibility only, new scripts should use the -migration option (see above) instead. This capability is implemented only on Origin 2000/200 models. Other Origin product lines do not support automatic memory locality management. Use of this option on models other than Origin 2000/200 has no effect. -propagate Migration and page size information will be inherited by descendents which are exec'ed. -mustrun When threads are attached to memories or cpus, the threads are attached to CPUs on the node using process_cpulink(3) with a request mode of mandatory. Consult the MP_MUSTRUN description in the sysmp(2) man page for additional information concerning threads attached or bound to CPUs. -verbose or -v Detailed diagnostic information is written to standard error. EXAMPLE To place data according to the file placement_file for the executable a.out that would normally be run by: % a.out < in > out one would simply % dplace -place placement_file a.out < in > out . An example placement file placement_file, when a.out is two threaded might look like: # placement_file memories 2 in topology cube # set up 2 memories which are close threads 2 # number of threads run thread 0 on memory 1 # run the first thread on the 2nd memory run thread 1 on memory 0 # run the 2nd thread on the first memory This specification, would request 2 nearby memories from the operating system. At creation, the threads are requested to run on an available cpu which is local to the specified memory. As data and stack space is touched or faulted in, physical memory is allocated from the memory which is local to the thread which initiated the fault. This can be written in a scalable way for a variable number of threads using the environment variable NP as follows: # scalable placement_file memories $NP in topology cube # set up memories which are close threads $NP # number of threads # run the last thread on the first memory etc. distribute threads $NP-1:0:-1 across memories USING MPI Since most MPI implementations use $MPI_NP+1 threads; where the first thread is mainly inactive. One might use the placement file: # scalable placement_file for MPI memories ($MPI_NP + 1)/2 in topology cube # set up memories which are close threads $MPI_NP + 1 # number of threads # ignore the lazy thread distribute threads 1:$MPI_NP across memories When using MPI with dplace, syntax similar to the following should be used: mpirun -np <number_of_processes> dplace <dplace_args> a.out LARGE PAGES Some applications run more efficiently using large pages. To run a program a.out utilizing 64k pages for both stack and data, a placement file is not necessary. One need only invoke the command: dplace -data_pagesize 64k -stack_pagesize 64k a.out from the shell. The fallback policy in default operation of dplace is FallbackDefault. By setting the -data_lpage_wait option, the fallback policy for the data segment is changed from FallbackDefault to FallbackLargepage. This policy will then look on other nodes for large pages when the local node has exhausted its free large pages, and then revert to smaller page sizes down to base pages on the local node. Similar action is taken with the stack segment when the -stack_lpage_wait option is used. PHYSICAL PLACEMENT Physical placement can also be accomplished using dplace. The following placement file: # physical placement_file for 3 specific memories and 6 threads memories 3 in topology physical near \ /hw/module/2/slot/n4/node \ /hw/module/3/slot/n2/node \ /hw/module/4/slot/n3/node threads 6 #the first two threads (0 & 1 ) will run on /hw/module/2/slot/n4/node #the second two threads (2 & 3 ) will run on /hw/module/3/slot/n2/node #the last two threads (4 & 5 ) will run on /hw/module/4/slot/n3/node distribute threads across memories specifies three physical nodes using the proper /hw path. To find out the names of the memory nodes on the machine you are using, type "find /hw -name node -print" at the shell command prompt. MUSTRUN The mustrun option will bind a thread to a particular CPU on the node. In cases where threads are distributed across memories, the CPU selection will attempt to schedule CPUs as to maximize CPU to memory bandwidth. For example, on an Origin 3000, CPUs sharing a memory will be scheduled on two separate processor busses available on the node, which serves to increase available memory bandwidth for each processor. The following placement file will demonstrate this behavior: # placement file for 4 memories and 8 threads threads 8 memories 4 distribute threads across memories By running this placement file using the option -mustrun the CPU selection on each memory of an Origin 3000 will attempt to select CPU numbers 0 and 2, or 1 and 3, or 0 and 3, or 1 and 2. The selection will avoid pairing 0 and 1, or 2 and 3, as these CPUs share a processor interface bus. In a situation where a processor selection cannot be done optimally as described, then the next available CPU regardless of bus attachment will be selected. DEFAULTS If command line arguments are omitted, dplace chooses the following set of defaults: place /dev/null data_pagesize 16k stack_pagesize 16k text_pagesize 16k migration off propagate off mustrun off verbose off RESTRICTIONS Programs must be dynamic executables; non shared executables behavior are are unaffected by dplace. Placement files will only affect direct descendents of dplace. Parallel applications must be based on the sproc(2) or fork(2) mechanism. Page sizes for regions which are not stack, text, or data can not be specified with dplace (eg: SYSV shared memory). Regions shared by multiple processes (eg: DSO text) are faulted in with the pagesize settings of the faulting process. Dplace sets the environment variable _DSM_OFF which will disable libmp's own DSM directives and environment variables. ENVIRONMENT Dplace recognizes and uses the environment variables PAGESIZE_DATA, PAGESIZE_STACK and PAGESIZE_TEXT. When using these variables it is important to note that the units are in kilobytes. The command line option will override environment variable setting. ERRORS If errors are encountered in the placement_file, the default procedure for dplace is to print a diagnostic message to standard error specifying where the error occurred in the placement_file and abort execution. If errors are encountered in the libdplace.so library during the run-time execution of program, then a diagnostic message is sent to standard error, a default signal of SIGKILL is sent to all members of the process group, and execution is aborted. The mode signal expr statement allows a selection of a specific signal number to be generated upon error. If the mode signal expr is specified, the action taken when libdplace.so detects a run-time error is to send the signal number derived from expr to the program invoked by dplace. Under this condition the control is returned to the caller, which is the program. The signal number can range from 1 to 32. An example of how to set the signal number: mode signal 16 Upon detecting an error in libdplace.so during run-time, signal 16 (defined as SIGUSR1) is sent to the calling process (in this case the program and control is returned to the caller. SEE ALSO mld(3), sysmp(2). Page 6