Bohrium supports a broad range of front and back-ends.
The default backend is OpenMP. You can change which backend to use by defining the
BH_STACK environment variable:
- The CPU backend that make use of OpenMP:
- The GPU backend that make use of OpenCL:
- The GPU backend that make use of CUDA:
For debug information when running Bohrium, use the following environment variables:
BH_<backend>_PROF=true -- Prints a performance profile at the end of execution. BH_<backend>_VERBOSE=true -- Prints a lot of information including the source of the JIT compiled kernels. Enables per-kernel profiling when used together with BH_OPENMP_PROF=true. BH_SYNC_WARN=true -- Show Python warnings in all instances when copying data to Python. BH_MEM_WARN=true -- Show warnings when memory accesses are problematic. BH_<backend>_GRAPH=true -- Dump a dependency graph of the instructions send to the back-ends (.dot file). BH_<backend>_VOLATILE=true -- Declare temporary variables using `volatile`, which avoid precision differences because of Intel's use of 80-bit floats internally.
BH_<backend>_PROF=true is very useful to explore why Bohrium might not perform as expected:
BH_OPENMP_PROF=1 python -m bohrium heat_equation.py --size=4000*4000*100 heat_equation.py - target: bhc, bohrium: True, size: 4000*4000*100, elapsed-time: 6.446084 [OpenMP] Profiling: Fuse cache hits: 199/203 (98.0296%) Codegen cache hits 299/304 (98.3553%) Kernel cache hits 300/304 (98.6842%) Array contractions: 700/1403 (49.8931%) Outer-fusion ratio: 13/23 (56.5217%) Max memory usage: 0 MB Syncs to NumPy: 99 Total Work: 12800400099 operations Throughput: 1.9235e+09ops Work below par-threshold (1000): 0% Wall clock: 6.65473s Total Execution: 6.04354s Pre-fusion: 0.000761211s Fusion: 0.00411354s Codegen: 0.00192224s Compile: 0.285544s Exec: 4.91214s Copy2dev: 0s Copy2host: 0s Ext-method: 0s Offload: 0s Other: 0.839052s Unaccounted for (wall - total): 0.611198s
Which tells us, among other things, that the execution of the compiled JIT kernels (
Exec) takes 4.91 seconds, the JIT compilation (
Compile) takes 0.29 seconds, and the time spend outside of Bohrium (
Unaccounted for) takes 0.61.
Bohrium sorts all available devices by type (‘gpu’, ‘cpu’, or ‘accelerator’). Set the device number to the device Bohrium should use (0 means first):
In order to see all available devices, run:
python -m bohrium_api --info
You can also set the options in the configure file under the
Also under the
[opencl] section, you can set the OpenCL work group sizes:
# OpenCL work group sizes work_group_size_1dx = 128 work_group_size_2dx = 32 work_group_size_2dy = 4 work_group_size_3dx = 32 work_group_size_3dy = 2 work_group_size_3dz = 2
In order to configure the runtime setup of Bohrium you must provide a configuration file to Bohrium. The installation of Bohrium installs a default configuration file in
/etc/bohrium/config.ini when doing a system-wide installation,
~/.bohrium/config.ini when doing a local installation, and
<python library>/bohrium/config.ini when doing a pip installation.
At runtime Bohrium will search through the following prioritized list in order to find the configuration file:
- The environment variable
- The config within the Python package
bohrium/config.ini(in the same directory as
- The home directory config
- The system-wide config
- The system-wide config
- The system-wide config
The default configuration file looks similar to the config below:
# # Stack configurations, which are a comma separated lists of components. # NB: 'stacks' is a reserved section name and 'default' # is used when 'BH_STACK' is unset. # The bridge is never part of the list # [stacks] default = bcexp, bccon, node, openmp openmp = bcexp, bccon, node, openmp opencl = bcexp, bccon, node, opencl, openmp # # Managers # [node] impl = /usr/lib/libbh_vem_node.so timing = false [proxy] address = localhost port = 4200 impl = /usr/lib/libbh_vem_proxy.so # # Filters - Helpers / Tools # [pprint] impl = /usr/lib/libbh_filter_pprint.so # # Filters - Bytecode transformers # [bccon] impl = /usr/lib/libbh_filter_bccon.so collect = true stupidmath = true muladd = true reduction = false find_repeats = false timing = false verbose = false [bcexp] impl = /usr/lib/libbh_filter_bcexp.so powk = true sign = false repeat = false reduce1d = 32000 timing = false verbose = false [noneremover] impl = /usr/lib/libbh_filter_noneremover.so timing = false verbose = false # # Engines # [openmp] impl = /usr/lib/libbh_ve_openmp.so tmp_bin_dir = /usr/var/bohrium/object tmp_src_dir = /usr/var/bohrium/source dump_src = true verbose = false prof = false #Profiling statistics compiler_cmd = "/usr/bin/x86_64-linux-gnu-gcc" compiler_inc = "-I/usr/share/bohrium/include" compiler_lib = "-lm -L/usr/lib -lbh" compiler_flg = "-x c -fPIC -shared -std=gnu99 -O3 -march=native -Werror -fopenmp" compiler_openmp = true compiler_openmp_simd = false [opencl] impl = /usr/lib/libbh_ve_opencl.so verbose = false prof = false #Profiling statistics # Additional options given to the opencl compiler. See documentation for clBuildProgram compiler_flg = "-I/usr/share/bohrium/include" serial_fusion = false # Topological fusion is default
The configuration file consists of two things:
components and orchestration of components in
Components marked with square brackets. For example
[opencl] are all components available for the runtime system.
stacks define different default configurations of the runtime environment and one can switch between them using the environment var
The configuration of a component can be overwritten with environment variables using the naming convention
BH_[COMPONENT]_[OPTION], below are a couple of examples controlling the behavior of the CPU vector engine:
BH_OPENMP_PROF=true -- Prints a performance profile at the end of execution. BH_OPENMP_VERBOSE=true -- Prints a lot of information including the source of the JIT compiled kernels. Enables per-kernel profiling when used together with BH_OPENMP_PROF=true.
Useful environment variables:
BH_SYNC_WARN=true -- Show Python warnings in all instances when copying data to Python. BH_MEM_WARN=true -- Show warnings when memory accesses are problematic. BH_<backend>_GRAPH=true -- Dump a dependency graph of the instructions send to the back-ends (.dot file). BH_<backend>_VOLATILE=true -- Declare temporary variables using `volatile`, which avoid precision differences because of Intel's use of 80-bit floats internally.