# Guidelines on summitdev

## modules needed

	module load pgi
	module load cuda
	module load petsc/3.6.4
	module load hypre/2.11.1
	module load lsf-tools

## Environment Variables needed

[//]: # (It is not necessary to set OMP_NUM_THREADS on the node submitting the jobs, but just in case...)
	export OMP_NUM_THREADS=1 

## Compiling

### change Makefile

COMPILER := portland (needs to be set at the beginning of Makefile)

update: The Makefile seems work fine on summitdev now, without fixing the compiler to be portland. "default" will be correctly parsed and set to "portland" when summitdev is detected. But, just in case...

### compile
       
	make OPENMP=y [GPU=y [AMGX=y]] -j

## job submission

	./run.sh.gtc [nodes] [ppn] [walltime] [gpu/cpu]

## explaination of the scripts

### run.sh.gtc

This is the main job submission script.

example usage is: $ ./run.sh.gtc 8 4 1 30 gpu

This will submit a run with 8 nodes requested, 4 MPI process per node, 1 OpenMP thread per MPI, for 30 minutes wall time, and with GPU mode. Note that the "gpu" flag only sets the OpenMP and other related settings in the job batch script. You need to be sure that the executable "./gtc" is compiled with GPU enabled.

In the script, the following variables are set:

*nodes* reads in the total number of nodes. summitdev has 20 cores per node, and 8 hyper-thread per core, so logical ID (LID) goes from 0-159 on each node.

*ppn* reads in the total MPI processes for each node.

*walltime* reads in the wall time in minutes.

*nmpi* calculates the total MPI processes

*ompthreads* reads in OpenMP Thread numbers

The resources are calculated based on these numbers and a job script is created. some of the BSUB control options are:

 * standard output:	#BSUB -o %J.out
 * standard error:	#BSUB -e %J.err
 * total number of nodes: 	#BSUB -nnodes ${nodes}
 * submission queue	#BSUB -q batch
 * project charged	#BSUB -P csc190gtc
 * walltime in minutes	#BSUB -W ${walltime}
 * setup environment variables and gpu access mode 	#BSUB -alloc_flags "all,gpudefault"

Then the shell commands:

 * set OpenMP threads:	 export OMP_NUM_THREADS=${ompthreads}
 * set stack-size limit: 	ulimit -s 10240
 * create output directories: mkdir -p *
 * submit the jsrun:(module lsf-tools is required)
      jsrun -n${nmpi} -g1 -a1 -lgpu-cpu -c${ompthreads} ./gtc 
	
	* -n : number of Resource Sets
	* -g : GPU per Resource set
	* -a : MPI tasks per Resource Set
	* -l : latency_priority, controls layout priorities, can be cpu-cpu or gpu-cpu
	* -c : CPU per resource set


### nvprof.run.sh.gtc

This script is very similar to the _run.sh.gtc_ script, but with nvprof profiling controls.

### mps.run.sh.gtc

This script is very similar to the _run.sh.gtc_ script, but uses the *mps_helper.sh* script to setup the gpu mps system.

### set_device_and_bind.sh

This is the helper script setting up the device (GPU) used for each MPI process, and binding the Logical IDs (LIDs) to certain OpenMp threads.

#### set device
*ngpus* is total GPU chips on one node, currently 4 for summitdev

*OMPI_COMM_WORLD_LOCAL_RANK* is the rank within the node

*OMPI_COMM_WORLD_LOCAL_SIZE* is the total processes on the node

The *mydevice* calculation calculates the GPU ID that should be assigned to this MPI process. Since there are 4 GPUs per node, it is best to have total processes a multiple of 4, otherwise the GPU load won't be uniform. some GPU will be assigned more MPI processes than others.

Currently, only one GPU is assigned to a MPI rank. This script can be upgraded to be more clever so that fewer MPI processes per node can be well treated. This is when one MPI process can be assigned more than 1 GPUs. The syntax for assigning more than 1 GPU is : export CUDA_VISIBLE_DEVICES=0,1,2,3 

#### bind 

This part of the helper script binds the Logical IDs (LIDs) to certain OpenMP theads of a given MPI process. The Linux system do not automatically optimize the OpenMP thread distribution. SummitDev has 160 LIDs per node (20 cores * 8 hyper-threads), OpenMP threads are better performed when they are using LIDs from a single core.

*cpus_per_rank* calculates the LIDs per MPI rank

*startproc* labels the lowest LID of the given MPI rank

*stride* is the number of LIDs per OpenMP thread

### mps_helper.sh

This script helps set up the GPU Multiple Processes Service (MPS). This script is obtained from Bob Walkup's home directory: /ccs/home/walkup/bin/