compiler_gym.envs.llvm¶

The compiler_gym.envs.llvm module contains datasets and API extensions for the LLVM Environments. See LlvmEnv for the class definition.

Document contents:

Constructing Benchmarks
Datasets
Miscellaneous

Constructing Benchmarks ¶

compiler_gym.envs.llvm.make_benchmark(inputs: Union[str, pathlib.Path, compiler_gym.envs.llvm.llvm_benchmark.ClangInvocation, List[Union[str, pathlib.Path, compiler_gym.envs.llvm.llvm_benchmark.ClangInvocation]]], copt: Optional[List[str]] = None, system_includes: bool = True, timeout: int = 600) → compiler_gym.datasets.benchmark.Benchmark [source]¶

Create a benchmark for use by LLVM environments.

This function takes one or more inputs and uses them to create a benchmark that can be passed to compiler_gym.envs.LlvmEnv.reset().

For single-source C/C++ programs, you can pass the path of the source file:

>>> benchmark = make_benchmark('my_app.c')
>>> env = gym.make("llvm-v0")
>>> env.reset(benchmark=benchmark)

The clang invocation used is roughly equivalent to:

$ clang my_app.c -O0 -c -emit-llvm -o benchmark.bc

Additional compile-time arguments to clang can be provided using the copt argument:

>>> benchmark = make_benchmark('/path/to/my_app.cpp', copt=['-O2'])

If you need more fine-grained control over the options, you can directly construct a ClangInvocation to pass a list of arguments to clang:

>>> benchmark = make_benchmark(
    ClangInvocation(['/path/to/my_app.c'], timeout=10)
)

For multi-file programs, pass a list of inputs that will be compiled separately and then linked to a single module:

>>> benchmark = make_benchmark([
    'main.c',
    'lib.cpp',
    'lib2.bc',
])

If you already have prepared bitcode files, those can be linked and used directly:

>>> benchmark = make_benchmark([
    'bitcode1.bc',
    'bitcode2.bc',
])

Text-format LLVM assembly can also be used:

>>> benchmark = make_benchmark('module.ll')

Note

LLVM bitcode compatibility is not guaranteed, so you must ensure that any precompiled bitcodes are compatible with the LLVM version used by CompilerGym, which can be queried using env.compiler_version.

Parameters

inputs – An input, or list of inputs.
copt – A list of command line options to pass to clang when compiling source files.
system_includes – Whether to include the system standard libraries during compilation jobs. This requires a system toolchain. See get_system_includes().
timeout – The maximum number of seconds to allow clang to run before terminating.

Returns

A Benchmark instance.

Raises

FileNotFoundError – If any input sources are not found.
TypeError – If the inputs are of unsupported types.
OSError – If a compilation job fails.
TimeoutExpired – If a compilation job exceeds timeout seconds.

class compiler_gym.envs.llvm.ClangInvocation(args: List[str], system_includes: bool = True, timeout: int = 600)[source]¶

Class to represent a single invocation of the clang compiler.

__init__(args: List[str], system_includes: bool = True, timeout: int = 600)[source]¶

Create a clang invocation.

Parameters

args – The list of arguments to pass to clang.
system_includes – Whether to include the system standard libraries during compilation jobs. This requires a system toolchain. See get_system_includes().
timeout – The maximum number of seconds to allow clang to run before terminating.

compiler_gym.envs.llvm.get_system_includes() → List[pathlib.Path][source]¶

Determine the system include paths for C/C++ compilation jobs.

This uses the system compiler to determine the search paths for C/C++ system headers. By default, c++ is invoked. This can be overridden by setting os.environ["CXX"].

Returns: A list of paths to system header directories.
Raises: OSError – If the compiler fails, or if the search paths cannot be determined.

Datasets ¶

compiler_gym.envs.llvm.datasets.get_llvm_datasets(site_data_base: Optional[pathlib.Path] = None) → Iterable[compiler_gym.datasets.dataset.Dataset][source]¶

Instantiate the builtin LLVM datasets.

Parameters: site_data_base – The root of the site data path.
Returns: An iterable sequence of Dataset instances.

class compiler_gym.envs.llvm.datasets.AnghaBenchDataset(site_data_base: pathlib.Path, sort_order: int = 0, manifest_url: Optional[str] = None, manifest_sha256: Optional[str] = None, deprecated: Optional[str] = None, name: Optional[str] = None)[source]¶

A dataset of C programs curated from GitHub source code.

The dataset is from:

da Silva, Anderson Faustino, Bruno Conde Kind, José Wesley de Souza Magalhaes, Jerônimo Nunes Rocha, Breno Campos Ferreira Guimaraes, and Fernando Magno Quinão Pereira. “ANGHABENCH: A Suite with One Million Compilable C Benchmarks for Code-Size Reduction.” In 2021 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 378-390. IEEE, 2021.

And is available at:

http://cuda.dcc.ufmg.br/angha/home

The AnghaBench dataset consists of C functions that are compiled to LLVM-IR on-demand and cached. The first time each benchmark is used there is an overhead of compiling it from C to bitcode. This is a one-off cost.

class compiler_gym.envs.llvm.datasets.BlasDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶

class compiler_gym.envs.llvm.datasets.CBenchDataset(site_data_base: pathlib.Path)[source]¶

class compiler_gym.envs.llvm.datasets.CLgenDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶

The CLgen dataset contains 1000 synthetically generated OpenCL kernels.

The dataset is from:

Cummins, Chris, Pavlos Petoumenos, Zheng Wang, and Hugh Leather. “Synthesizing benchmarks for predictive modeling.” In 2017 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), pp. 86-99. IEEE, 2017.

And is available at:

https://github.com/ChrisCummins/paper-synthesizing-benchmarks

The CLgen dataset consists of OpenCL kernels that are compiled to LLVM-IR on-demand and cached. The first time each benchmark is used there is an overhead of compiling it from OpenCL to bitcode. This is a one-off cost. Compiling OpenCL to bitcode requires third party headers that are downloaded on the first call to install().

class compiler_gym.envs.llvm.datasets.CsmithDataset(site_data_base: pathlib.Path, sort_order: int = 0, csmith_bin: Optional[pathlib.Path] = None, csmith_includes: Optional[pathlib.Path] = None)[source]¶

A dataset which uses Csmith to generate programs.

Csmith is a tool that can generate random conformant C99 programs. It is described in the publication:

Yang, Xuejun, Yang Chen, Eric Eide, and John Regehr. “Finding and understanding bugs in C compilers.” In Proceedings of the 32nd ACM SIGPLAN conference on Programming Language Design and Implementation (PLDI), pp. 283-294. 2011.

For up-to-date information about Csmith, see:

https://embed.cs.utah.edu/csmith/

Note that Csmith is a tool that is used to find errors in compilers. As such, there is a higher likelihood that the benchmark cannot be used for an environment and that env.reset() will raise BenchmarkInitError.

class compiler_gym.envs.llvm.datasets.GitHubDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶

class compiler_gym.envs.llvm.datasets.LinuxDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶

class compiler_gym.envs.llvm.datasets.LlvmStressDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶

A dataset which uses llvm-stress to generate programs.

llvm-stress is a tool for generating random LLVM-IR files.

This dataset forces reproducible results by setting the input seed to the generator. The benchmark’s URI is the seed, e.g. “generator://llvm-stress-v0/10” is the benchmark generated by llvm-stress using seed 10. The total number of unique seeds is 2^32 - 1.

Note that llvm-stress is a tool that is used to find errors in LLVM. As such, there is a higher likelihood that the benchmark cannot be used for an environment and that env.reset() will raise BenchmarkInitError.

class compiler_gym.envs.llvm.datasets.MibenchDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶

class compiler_gym.envs.llvm.datasets.NPBDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶

class compiler_gym.envs.llvm.datasets.OpenCVDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶

class compiler_gym.envs.llvm.datasets.POJ104Dataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶

The POJ-104 dataset contains 52000 C++ programs implementing 104 different algorithms with 500 examples of each.

The dataset is from:

Lili Mou, Ge Li, Lu Zhang, Tao Wang, Zhi Jin. “Convolutional neural networks over tree structures for programming language processing.” To appear in Proceedings of 30th AAAI Conference on Artificial Intelligence, 2016.

And is available at:

https://sites.google.com/site/treebasedcnn/

class compiler_gym.envs.llvm.datasets.TensorFlowDataset(site_data_base: pathlib.Path, sort_order: int = 0)[source]¶

Miscellaneous ¶

compiler_gym.envs.llvm.compute_observation(observation_space: compiler_gym.views.observation_space_spec.ObservationSpaceSpec, bitcode: pathlib.Path, timeout: float = 300)[source]¶

Compute an LLVM observation.

This is a utility function that uses a standalone C++ binary to compute an observation from an LLVM bitcode file. It is intended for use cases where you want to compute an observation without the overhead of initializing a full environment.

Example usage:

>>> env = compiler_gym.make("llvm-v0")
>>> space = env.observation.spaces["Ir"]
>>> bitcode = Path("bitcode.bc")
>>> observation = llvm.compute_observation(space, bitcode, timeout=30)

Warning

This is not part of the core CompilerGym API and may change in a future release.

Parameters

observation_space – The observation that is to be computed.
bitcode – The path of an LLVM bitcode file.
timeout – The maximum number of seconds to allow the computation to run before timing out.

Raises

ValueError – If computing the observation fails.
TimeoutError – If computing the observation times out.
FileNotFoundError – If the given bitcode does not exist.

compiler_gym.envs.llvm¶

Constructing Benchmarks¶

Datasets¶

Miscellaneous¶

Constructing Benchmarks ¶

Datasets ¶

Miscellaneous ¶