compiler_gym.envs.llvm¶
The compiler_gym.envs.llvm
module contains datasets and API extensions
for the LLVM Environments. See LlvmEnv
for the class definition.
Document contents:
Constructing Benchmarks¶
- compiler_gym.envs.llvm.make_benchmark(inputs: Union[str, pathlib.Path, compiler_gym.envs.llvm.llvm_benchmark.ClangInvocation, List[Union[str, pathlib.Path, compiler_gym.envs.llvm.llvm_benchmark.ClangInvocation]]], copt: Optional[List[str]] = None, system_includes: bool = True, timeout: int = 600) → compiler_gym.datasets.benchmark.Benchmark[source]¶
Create a benchmark for use by LLVM environments.
This function takes one or more inputs and uses them to create a benchmark that can be passed to
compiler_gym.envs.LlvmEnv.reset()
.For single-source C/C++ programs, you can pass the path of the source file:
>>> benchmark = make_benchmark('my_app.c') >>> env = gym.make("llvm-v0") >>> env.reset(benchmark=benchmark)
The clang invocation used is roughly equivalent to:
$ clang my_app.c -O0 -c -emit-llvm -o benchmark.bc
Additional compile-time arguments to clang can be provided using the
copt
argument:>>> benchmark = make_benchmark('/path/to/my_app.cpp', copt=['-O2'])
If you need more fine-grained control over the options, you can directly construct a
ClangInvocation
to pass a list of arguments to clang:>>> benchmark = make_benchmark( ClangInvocation(['/path/to/my_app.c'], timeout=10) )
For multi-file programs, pass a list of inputs that will be compiled separately and then linked to a single module:
>>> benchmark = make_benchmark([ 'main.c', 'lib.cpp', 'lib2.bc', ])
If you already have prepared bitcode files, those can be linked and used directly:
>>> benchmark = make_benchmark([ 'bitcode1.bc', 'bitcode2.bc', ])
Text-format LLVM assembly can also be used:
>>> benchmark = make_benchmark('module.ll')
Note
LLVM bitcode compatibility is not guaranteed, so you must ensure that any precompiled bitcodes are compatible with the LLVM version used by CompilerGym, which can be queried using
env.compiler_version
.- Parameters
inputs – An input, or list of inputs.
copt – A list of command line options to pass to clang when compiling source files.
system_includes – Whether to include the system standard libraries during compilation jobs. This requires a system toolchain. See
get_system_includes()
.timeout – The maximum number of seconds to allow clang to run before terminating.
- Returns
A
Benchmark
instance.- Raises
FileNotFoundError – If any input sources are not found.
TypeError – If the inputs are of unsupported types.
OSError – If a compilation job fails.
TimeoutExpired – If a compilation job exceeds
timeout
seconds.
- class compiler_gym.envs.llvm.ClangInvocation(args: List[str], system_includes: bool = True, timeout: int = 600)[source]¶
Class to represent a single invocation of the clang compiler.
- __init__(args: List[str], system_includes: bool = True, timeout: int = 600)[source]¶
Create a clang invocation.
- Parameters
args – The list of arguments to pass to clang.
system_includes – Whether to include the system standard libraries during compilation jobs. This requires a system toolchain. See
get_system_includes()
.timeout – The maximum number of seconds to allow clang to run before terminating.
- compiler_gym.envs.llvm.get_system_includes() → List[pathlib.Path][source]¶
Determine the system include paths for C/C++ compilation jobs.
This uses the system compiler to determine the search paths for C/C++ system headers. By default,
c++
is invoked. This can be overridden by settingos.environ["CXX"]
.- Returns
A list of paths to system header directories.
- Raises
OSError – If the compiler fails, or if the search paths cannot be determined.
Datasets¶
- compiler_gym.envs.llvm.datasets.get_llvm_datasets(site_data_base: Optional[pathlib.Path] = None) → Iterable[compiler_gym.datasets.dataset.Dataset][source]¶
Instantiate the builtin LLVM datasets.
- Parameters
site_data_base – The root of the site data path.
- Returns
An iterable sequence of
Dataset
instances.
- class compiler_gym.envs.llvm.datasets.AnghaBenchDataset(site_data_base: pathlib.Path, sort_order: int = 0, manifest_url: Optional[str] = None, manifest_sha256: Optional[str] = None, deprecated: Optional[str] = None, name: Optional[str] = None)[source]¶
A dataset of C programs curated from GitHub source code.
The dataset is from:
da Silva, Anderson Faustino, Bruno Conde Kind, José Wesley de Souza Magalhaes, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimaraes, and Fernando Magno Quinão Pereira. “ANGHABENCH: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction.” In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 378-390. IEEE, 2021.
And is available at:
The AnghaBench dataset consists of C functions that are compiled to LLVM-IR on-demand and cached. The first time each benchmark is used there is an overhead of compiling it from C to bitcode. This is a one-off cost.
- class compiler_gym.envs.llvm.datasets.BlasDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶
- class compiler_gym.envs.llvm.datasets.CLgenDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶
The CLgen dataset contains 1000 synthetically generated OpenCL kernels.
The dataset is from:
Cummins, Chris, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. “Synthesizing benchmarks for predictive modeling.” In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 86-99. IEEE, 2017.
And is available at:
The CLgen dataset consists of OpenCL kernels that are compiled to LLVM-IR on-demand and cached. The first time each benchmark is used there is an overhead of compiling it from OpenCL to bitcode. This is a one-off cost. Compiling OpenCL to bitcode requires third party headers that are downloaded on the first call to
install()
.
- class compiler_gym.envs.llvm.datasets.CsmithDataset(site_data_base: pathlib.Path, sort_order: int = 0, csmith_bin: Optional[pathlib.Path] = None, csmith_includes: Optional[pathlib.Path] = None)[source]¶
A dataset which uses Csmith to generate programs.
Csmith is a tool that can generate random conformant C99 programs. It is described in the publication:
Yang, Xuejun, Yang Chen, Eric Eide, and John Regehr. “Finding and understanding bugs in C compilers.” In Proceedings of the 32nd ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI), pp. 283-294. 2011.
For up-to-date information about Csmith, see:
Note that Csmith is a tool that is used to find errors in compilers. As such, there is a higher likelihood that the benchmark cannot be used for an environment and that
env.reset()
will raiseBenchmarkInitError
.
- class compiler_gym.envs.llvm.datasets.GitHubDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶
- class compiler_gym.envs.llvm.datasets.LinuxDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶
- class compiler_gym.envs.llvm.datasets.LlvmStressDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶
A dataset which uses llvm-stress to generate programs.
llvm-stress is a tool for generating random LLVM-IR files.
This dataset forces reproducible results by setting the input seed to the generator. The benchmark’s URI is the seed, e.g. “generator://llvm-stress-v0/10” is the benchmark generated by llvm-stress using seed 10. The total number of unique seeds is 2^32 - 1.
Note that llvm-stress is a tool that is used to find errors in LLVM. As such, there is a higher likelihood that the benchmark cannot be used for an environment and that
env.reset()
will raiseBenchmarkInitError
.
- class compiler_gym.envs.llvm.datasets.MibenchDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶
- class compiler_gym.envs.llvm.datasets.NPBDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶
- class compiler_gym.envs.llvm.datasets.OpenCVDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶
- class compiler_gym.envs.llvm.datasets.POJ104Dataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶
The POJ-104 dataset contains 52000 C++ programs implementing 104 different algorithms with 500 examples of each.
The dataset is from:
Lili Mou, Ge Li, Lu Zhang, Tao Wang, Zhi Jin. “Convolutional neural networks over tree structures for programming language processing.” To appear in Proceedings of 30th AAAI Conference on Artificial Intelligence, 2016.
And is available at:
Miscellaneous¶
- compiler_gym.envs.llvm.compute_observation(observation_space: compiler_gym.views.observation_space_spec.ObservationSpaceSpec, bitcode: pathlib.Path, timeout: float = 300)[source]¶
Compute an LLVM observation.
This is a utility function that uses a standalone C++ binary to compute an observation from an LLVM bitcode file. It is intended for use cases where you want to compute an observation without the overhead of initializing a full environment.
Example usage:
>>> env = compiler_gym.make("llvm-v0") >>> space = env.observation.spaces["Ir"] >>> bitcode = Path("bitcode.bc") >>> observation = llvm.compute_observation(space, bitcode, timeout=30)
Warning
This is not part of the core CompilerGym API and may change in a future release.
- Parameters
observation_space – The observation that is to be computed.
bitcode – The path of an LLVM bitcode file.
timeout – The maximum number of seconds to allow the computation to run before timing out.
- Raises
ValueError – If computing the observation fails.
TimeoutError – If computing the observation times out.
FileNotFoundError – If the given bitcode does not exist.