compiler_gym.envs.llvm

The compiler_gym.envs.llvm module contains datasets and API extensions for the LLVM Environments. See LlvmEnv for the class definition.

Constructing Benchmarks

compiler_gym.envs.llvm.make_benchmark(inputs: Union[str, pathlib.Path, compiler_gym.envs.llvm.llvm_benchmark.ClangInvocation, List[Union[str, pathlib.Path, compiler_gym.envs.llvm.llvm_benchmark.ClangInvocation]]], copt: Optional[List[str]] = None, system_includes: bool = True, timeout: int = 600)compiler_gym.datasets.benchmark.Benchmark[source]

Create a benchmark for use by LLVM environments.

This function takes one or more inputs and uses them to create a benchmark that can be passed to compiler_gym.envs.LlvmEnv.reset().

For single-source C/C++ programs, you can pass the path of the source file:

>>> benchmark = make_benchmark('my_app.c')
>>> env = gym.make("llvm-v0")
>>> env.reset(benchmark=benchmark)

The clang invocation used is roughly equivalent to:

$ clang my_app.c -O0 -c -emit-llvm -o benchmark.bc

Additional compile-time arguments to clang can be provided using the copt argument:

>>> benchmark = make_benchmark('/path/to/my_app.cpp', copt=['-O2'])

If you need more fine-grained control over the options, you can directly construct a ClangInvocation to pass a list of arguments to clang:

>>> benchmark = make_benchmark(
    ClangInvocation(['/path/to/my_app.c'], timeout=10)
)

For multi-file programs, pass a list of inputs that will be compiled separately and then linked to a single module:

>>> benchmark = make_benchmark([
    'main.c',
    'lib.cpp',
    'lib2.bc',
])

If you already have prepared bitcode files, those can be linked and used directly:

>>> benchmark = make_benchmark([
    'bitcode1.bc',
    'bitcode2.bc',
])

Text-format LLVM assembly can also be used:

>>> benchmark = make_benchmark('module.ll')

Note

LLVM bitcode compatibility is not guaranteed, so you must ensure that any precompiled bitcodes are compatible with the LLVM version used by CompilerGym, which can be queried using env.compiler_version.

Parameters
  • inputs – An input, or list of inputs.

  • copt – A list of command line options to pass to clang when compiling source files.

  • system_includes – Whether to include the system standard libraries during compilation jobs. This requires a system toolchain. See get_system_includes().

  • timeout – The maximum number of seconds to allow clang to run before terminating.

Returns

A Benchmark instance.

Raises
  • FileNotFoundError – If any input sources are not found.

  • TypeError – If the inputs are of unsupported types.

  • OSError – If a compilation job fails.

  • TimeoutExpired – If a compilation job exceeds timeout seconds.

class compiler_gym.envs.llvm.ClangInvocation(args: List[str], system_includes: bool = True, timeout: int = 600)[source]

Class to represent a single invocation of the clang compiler.

__init__(args: List[str], system_includes: bool = True, timeout: int = 600)[source]

Create a clang invocation.

Parameters
  • args – The list of arguments to pass to clang.

  • system_includes – Whether to include the system standard libraries during compilation jobs. This requires a system toolchain. See get_system_includes().

  • timeout – The maximum number of seconds to allow clang to run before terminating.

compiler_gym.envs.llvm.get_system_includes()List[pathlib.Path][source]

Determine the system include paths for C/C++ compilation jobs.

This uses the system compiler to determine the search paths for C/C++ system headers. By default, c++ is invoked. This can be overridden by setting os.environ["CXX"].

Returns

A list of paths to system header directories.

Raises

OSError – If the compiler fails, or if the search paths cannot be determined.

Datasets

compiler_gym.envs.llvm.datasets.get_llvm_datasets(site_data_base: Optional[pathlib.Path] = None)Iterable[compiler_gym.datasets.dataset.Dataset][source]

Instantiate the builtin LLVM datasets.

Parameters

site_data_base – The root of the site data path.

Returns

An iterable sequence of Dataset instances.

class compiler_gym.envs.llvm.datasets.AnghaBenchDataset(site_data_base: pathlib.Path, sort_order: int = 0, manifest_url: Optional[str] = None, manifest_sha256: Optional[str] = None, deprecated: Optional[str] = None, name: Optional[str] = None)[source]

A dataset of C programs curated from GitHub source code.

The dataset is from:

da Silva, Anderson Faustino, Bruno Conde Kind, José Wesley de Souza Magalhaes, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimaraes, and Fernando Magno Quinão Pereira. “ANGHABENCH: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction.” In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 378-390. IEEE, 2021.

And is available at:

The AnghaBench dataset consists of C functions that are compiled to LLVM-IR on-demand and cached. The first time each benchmark is used there is an overhead of compiling it from C to bitcode. This is a one-off cost.

class compiler_gym.envs.llvm.datasets.BlasDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]
class compiler_gym.envs.llvm.datasets.CBenchDataset(site_data_base: pathlib.Path)[source]
class compiler_gym.envs.llvm.datasets.CLgenDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]

The CLgen dataset contains 1000 synthetically generated OpenCL kernels.

The dataset is from:

Cummins, Chris, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. “Synthesizing benchmarks for predictive modeling.” In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 86-99. IEEE, 2017.

And is available at:

The CLgen dataset consists of OpenCL kernels that are compiled to LLVM-IR on-demand and cached. The first time each benchmark is used there is an overhead of compiling it from OpenCL to bitcode. This is a one-off cost. Compiling OpenCL to bitcode requires third party headers that are downloaded on the first call to install().

class compiler_gym.envs.llvm.datasets.CsmithDataset(site_data_base: pathlib.Path, sort_order: int = 0, csmith_bin: Optional[pathlib.Path] = None, csmith_includes: Optional[pathlib.Path] = None)[source]

A dataset which uses Csmith to generate programs.

Csmith is a tool that can generate random conformant C99 programs. It is described in the publication:

Yang, Xuejun, Yang Chen, Eric Eide, and John Regehr. “Finding and understanding bugs in C compilers.” In Proceedings of the 32nd ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI), pp. 283-294. 2011.

For up-to-date information about Csmith, see:

Note that Csmith is a tool that is used to find errors in compilers. As such, there is a higher likelihood that the benchmark cannot be used for an environment and that env.reset() will raise BenchmarkInitError.

class compiler_gym.envs.llvm.datasets.GitHubDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]
class compiler_gym.envs.llvm.datasets.LinuxDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]
class compiler_gym.envs.llvm.datasets.LlvmStressDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]

A dataset which uses llvm-stress to generate programs.

llvm-stress is a tool for generating random LLVM-IR files.

This dataset forces reproducible results by setting the input seed to the generator. The benchmark’s URI is the seed, e.g. “generator://llvm-stress-v0/10” is the benchmark generated by llvm-stress using seed 10. The total number of unique seeds is 2^32 - 1.

Note that llvm-stress is a tool that is used to find errors in LLVM. As such, there is a higher likelihood that the benchmark cannot be used for an environment and that env.reset() will raise BenchmarkInitError.

class compiler_gym.envs.llvm.datasets.MibenchDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]
class compiler_gym.envs.llvm.datasets.NPBDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]
class compiler_gym.envs.llvm.datasets.OpenCVDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]
class compiler_gym.envs.llvm.datasets.POJ104Dataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]

The POJ-104 dataset contains 52000 C++ programs implementing 104 different algorithms with 500 examples of each.

The dataset is from:

Lili Mou, Ge Li, Lu Zhang, Tao Wang, Zhi Jin. “Convolutional neural networks over tree structures for programming language processing.” To appear in Proceedings of 30th AAAI Conference on Artificial Intelligence, 2016.

And is available at:

class compiler_gym.envs.llvm.datasets.TensorFlowDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]

Miscellaneous

compiler_gym.envs.llvm.compute_observation(observation_space: compiler_gym.views.observation_space_spec.ObservationSpaceSpec, bitcode: pathlib.Path, timeout: float = 300)[source]

Compute an LLVM observation.

This is a utility function that uses a standalone C++ binary to compute an observation from an LLVM bitcode file. It is intended for use cases where you want to compute an observation without the overhead of initializing a full environment.

Example usage:

>>> env = compiler_gym.make("llvm-v0")
>>> space = env.observation.spaces["Ir"]
>>> bitcode = Path("bitcode.bc")
>>> observation = llvm.compute_observation(space, bitcode, timeout=30)

Warning

This is not part of the core CompilerGym API and may change in a future release.

Parameters
  • observation_space – The observation that is to be computed.

  • bitcode – The path of an LLVM bitcode file.

  • timeout – The maximum number of seconds to allow the computation to run before timing out.

Raises
  • ValueError – If computing the observation fails.

  • TimeoutError – If computing the observation times out.

  • FileNotFoundError – If the given bitcode does not exist.