ReFrame Tutorial

This tutorial will cover the basic concepts of ReFrame and will get you started with the framework. For more specific topics, you should refer to “ReFrame How Tos” as well as to the “Advanced Topics” for an in-depth understanding of some of the framework’s concepts.

Requirements

To run this tutorial you need docker for the local examples and docker compose for the examples emulating a Slurm cluster. Note, that the Docker daemon must be running.

The tutorial container images provide already the latest ReFrame version installed. For installing a stand-alone version of ReFrame, please refer to the “Getting Started” guide.

All tutorial examples are located under the reframe-examples directory inside the container’s working directory.

Running the local examples

To run the local examples, launch the single-node tutorial container by binding mounting the examples:

git clone https://github.com/reframe-hpc/reframe.git
cd reframe
docker build -t reframe-tut-singlenode:latest -f examples/tutorial/dockerfiles/singlenode.Dockerfile .
docker run -h myhost -it --mount type=bind,source=$(pwd)/examples/,target=/home/user/reframe-examples reframe-tut-singlenode:latest /bin/bash

Running the multi-node examples

To run the multi-node examples you need first to launch a Slurm pseudo cluster using the provided Docker compose file:

git clone https://github.com/reframe-hpc/reframe.git
cd reframe
docker compose --project-directory=$(pwd) -f examples/tutorial/dockerfiles/slurm-cluster/docker-compose.yml up --abort-on-container-exit --exit-code-from frontend

Once the Docker compose stack is up, you execute the following from a different terminal window in order to “log in” in the frontend container:

docker exec -it $(docker ps -f name=frontend -q) /bin/bash

# Inside the container
cd reframe-examples/tutorial/

Once done, press Ctl-D in the frontend container and Ctl-C in the Docker compose console window.

Note

All examples use the single-node container unless it is otherwise noted.

Modifying the examples

In both cases, the tutorial examples are bind mounted in the container, so you could make changes directly from your host and these will be reflected inside the container and vice versa.

Writing your first test

We will start with the STREAM benchmark. This is a standard benchmark for measuring the DRAM bandwidth. The tutorial container already contains a pre-compiled OpenMP version of the benchmark. Our test will run the STREAM executable, validate the output and extract the figure of merits. Here is the full ReFrame test:

../examples/tutorial/stream/stream_runonly.py
@rfm.simple_test
class stream_test(rfm.RunOnlyRegressionTest):
    valid_systems = ['*']
    valid_prog_environs = ['*']
    executable = 'stream.x'

    @sanity_function
    def validate(self):
        return sn.assert_found(r'Solution Validates', self.stdout)

    @performance_function('MB/s')
    def copy_bw(self):
        return sn.extractsingle(r'Copy:\s+(\S+)', self.stdout, 1, float)

    @performance_function('MB/s')
    def triad_bw(self):
        return sn.extractsingle(r'Triad:\s+(\S+)', self.stdout, 1, float)

ReFrame tests are specially decorated classes that ultimately derive from the RegressionTest class. Since we only want to run an executable in this first test, we derive from the RunOnlyRegressionTest class, which essentially short-circuits the “compile” stage of the test. The @simple_test decorator registers a test class with the framework and makes it available for running.

Every ReFrame test must define the valid_systems and valid_prog_environs variables. These describe the test’s constraints and the framework will automatically filter out tests on systems and environments that do not match the constraints. We will describe the system and environments abstractions later in this tutorial. For this first example, the * symbol denotes that this test is valid for any system or environment. A RunOnlyRegressionTest must also define an executable to run.

A test must also define a validation function which is decorated with the @sanity_function decorator. This function will be used to validate the test’s output after it is finished. ReFrame, by default, makes no assumption about whether a test is successful or not; it is the test’s responsibility to define its validation. The framework provides a rich set of utility functions that help matching patterns and extract values from the test’s output. The stdout here refers to the name of the file where the test’s standard output is stored.

Finally, a test may optionally define a set of performance functions that will extract figures of merit for the test. These are simple test methods decorated with the @performance_function decorator that return the figure of merit. In this example, we extract the Copy and Triad bandwidth values and convert them to float. These figures of merit or performance variables as they are called in ReFrame’s nomenclature have a special treatment: they are logged in the test’s performance logs and a reference value per system may also be assigned to them. If that reference value is not met within some user-defined thresholds, the test will fail.

Running a test

Running our test is very straightforward:

Run in the single-node container.
cd reframe-examples/tutorial
reframe -c stream/stream_runonly.py -r

The -c option defines the check path and it can be specified multiple times. It specifies the locations, directories or files, where ReFrame will try to look for tests. In this case, we simply pass the path to our test file. The -r option instructs ReFrame to run the selected tests:

[ReFrame Setup]
  version:           4.5.0-dev.1
  command:           '/usr/local/share/reframe/bin/reframe -c stream/stream_runonly.py -r'
  launched by:       user@myhost
  working directory: '/home/user'
  settings files:    '<builtin>'
  check search path: '/home/user/reframe-examples/tutorial/stream/stream_runonly.py'
  stage directory:   '/home/user/stage'
  output directory:  '/home/user/output'
  log files:         '/tmp/rfm-mzynqhye.log'

[==========] Running 1 check(s)
[==========] Started on Mon Nov 27 20:55:17 2023

[----------] start processing checks
[ RUN      ] stream_test /2e15a047 @generic:default+builtin
[       OK ] (1/1) stream_test /2e15a047 @generic:default+builtin
P: copy_bw: 19538.4 MB/s (r:0, l:None, u:None)
P: triad_bw: 14883.4 MB/s (r:0, l:None, u:None)
[----------] all spawned checks have finished

[  PASSED  ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Mon Nov 27 20:55:25 2023
Log file(s) saved in '/tmp/rfm-mzynqhye.log'

The verbosity of the output can be increasing using the -v option or decreased using the -q option. By default, a log file is generated in the system’s temporary directory that contains detailed debug information. The Logging section article describes how logging can be configured in more detail. Once a performance test finishes, its figures of merit are printed immediately using the P: prefix. This can be suppressed by increasing the level at which this information is logged using the RFM_PERF_INFO_LEVEL environment variable.

Run reports and performance logging

Once a test session finishes, ReFrame generates a detailed JSON report under $HOME/.reframe/reports. Every time ReFrame is run a new report will be generated automatically. The latest one is always symlinked by the latest.json name, unless the --report-file option is given.

For performance tests, in particular, an additional CSV file is generated with all the relevant information. These files are located by default under perflogs/<system>/<partition>/<testname>.log. In our example, this translates to perflogs/generic/default/stream_test.log. The information that is being logged is fully configurable and we will cover this in the Logging section.

Finally, you can use also the --performance-report option, which will print a summary of the results of the performance tests that have run in the current session.

[stream_test /2e15a047 @generic:default:builtin]
  num_tasks: 1
  performance:
    - copy_bw: 22704.4 MB/s (r: 0 MB/s l: -inf% u: +inf%)
    - triad_bw: 16040.9 MB/s (r: 0 MB/s l: -inf% u: +inf%)

Inspecting the test artifacts

When ReFrame executes tests, it first copies over all of the test resources (if any) to a stage directory, from which it executes the test. Upon successful execution, the test artifacts will be copied over to the output directory for archiving. The default artifacts for every test are the generated test script as well as the test’s standard output and standard error. The default location for the stage and output directories are the ./stage and ./output directories. These can be changed with the -s and -o options or the more general --prefix option. The test artifacts of our first example can be found in the following location:

Run in the single-node container.
ls output/generic/default/builtin/stream_test/
rfm_job.err  rfm_job.out  rfm_job.sh

The rfm_job.sh is the actual test script that was generated and executed and, as you can see, it was pretty simple for this case:

#!/bin/bash
stream.x

Inspecting test failures

When a test fails, ReFrame will not move its artifacts to the output directory and will keep everything inside the stage directory. For each failed test, a summary will be printed at the end that contains details about the reason of the failure and the location of the test’s stage directory. Here is an example failure that we induced artificially by changing the validation regular expression:

FAILURE INFO for stream_test (run: 1/1)
  * Description:
  * System partition: generic:default
  * Environment: builtin
  * Stage directory: /home/user/stage/generic/default/builtin/stream_test
  * Node list: myhost
  * Job type: local (id=19)
  * Dependencies (conceptual): []
  * Dependencies (actual): []
  * Maintainers: []
  * Failing phase: sanity
  * Rerun with '-n /2e15a047 -p builtin --system generic:default -r'
  * Reason: sanity error: pattern 'Slution Validates' not found in 'rfm_job.out'
--- rfm_job.out (first 10 lines) ---
-------------------------------------------------------------
STREAM version $Revision: 5.10 $
-------------------------------------------------------------
This system uses 8 bytes per array element.
-------------------------------------------------------------
Array size = 100000000 (elements), Offset = 0 (elements)
Memory per array = 762.9 MiB (= 0.7 GiB).
Total memory required = 2288.8 MiB (= 2.2 GiB).
Each kernel will be executed 10 times.
 The *best* time for each kernel (excluding the first iteration)
--- rfm_job.out ---
--- rfm_job.err (first 10 lines) ---
--- rfm_job.err ---

Adding performance references

For each performance variable defined in the test, we can add a reference value and set thresholds of acceptable variations. Here is an example for our STREAM benchmark:

@rfm.simple_test
class stream_test(rfm.RunOnlyRegressionTest):
    ...
    reference = {
        'myhost:baseline': {
            'copy_bw': (23_890, -0.10, 0.30, 'MB/s'),
            'triad_bw': (17_064, -0.05, 0.50, 'MB/s'),
        }
    }

The reference test variable is a multi-level dictionary that defines the expected performance for each of the test’s performance variables on all supported systems. It is not necessary that all performance variables and all systems have a reference. If a reference value is not found, then the obtained performance will be logged, but no performance validation will be performed. The reference value is essentially a three or four element tuple of the form: (target_perf, lower_thres, upper_thres, unit). The unit is optional as it is already defined in the @performance_function definitions. The lower and upper thresholds are deviations from the target reference expressed as fractional numbers. In our example, we allow the copy_bw to be 10% lower than the target reference and no more than 30% higher. Sometimes, especially in microbenchmarks, it is a good practice to set an upper threshold to denote the absolute maximum that cannot be exceeded.

Dry-run mode

ReFrame provides also a dry-run mode for the tests, which can be enabled by passing --dry-run as the action option (instead of -r that runs the tests). In this mode, ReFrame will generate the test script to be executed in the stage directory, but it will not run the test and will not perform the sanity and performance checking, neither will it attempt to extract any of the figures of merit. Tests can also modify their behaviour if run in dry-run mode by calling the is_dry_run() method. Here is an example dry-run of our first version of the STREAM benchmark:

Run in the single-node container.
reframe -c stream/stream_runonly.py --dry-run
[==========] Running 1 check(s)
[==========] Started on Wed Jan 10 22:45:49 2024+0000

[----------] start processing checks
[ DRY      ] stream_test /2e15a047 @generic:default+builtin
[       OK ] (1/1) stream_test /2e15a047 @generic:default+builtin
[----------] all spawned checks have finished

[  PASSED  ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Wed Jan 10 22:45:49 2024+0000

Note that the RUN message is replaced by DRY in the dry-run mode. You can also check the generated test script in stage/generic/default/builtin/stream_test/rfm_job.sh.

Systems and environments

The first version of our STREAM test assumes that the environment it runs in already provides the benchmark executable. This is totally fine if we want to run a set of run-only tests in a single environment, but it can become a maintenance burden if we want to run our test on different systems and environments. In this section, we will introduce how we can define system and environment configurations in ReFrame and match them to tests.

For ReFrame, a system is an abstraction of an HPC system that is managed by a workload manager. A system can comprise multiple partitions, which are collection of nodes with similar characteristics. This is entirely up to the user on how to define the system partitions.

An environment is an abstraction of the environment where a test will run and it is a collection of environment variables, environment modules and compiler definitions. The following picture depicts this architecture.

_images/reframe-system-arch.svg

ReFrame’s system architecture

Tests are associated with systems and environments through their valid_systems and valid_prog_environs variables.

Let’s limit the scope of our test by making it require a specific environment, since in order to run it we require an environment that provides STREAM. We could do that simply by setting the valid_prog_environs as follows:

self.valid_prog_environs = ['+stream']

This tells ReFrame that this test is valid only for environments that define the stream feature. If we try to run the test now, nothing will be run:

Run in the single-node container.
reframe -c stream/stream_runonly.py -r
[  PASSED  ] Ran 0/0 test case(s) from 0 check(s) (0 failure(s), 0 skipped, 0 aborted)

This happens because ReFrame by default defines a generic system and environment. You may have noticed in our first run the @generic:default+builtin notation printed after test name. This is the system partition name (generic:default) and the environment name (builtin) where the test is being run in. The generic system and the builtin partition come as predefined in ReFrame. They make the minimum possible assumptions:

  • The generic system defines a single partition, named default which launches test jobs locally.

  • The builtin environment assumes only that the cc compiler is available.

Note

ReFrame will not complain if a compiler is not installed until your test tries to build something.

Let’s define our own system and baseline environment in a ReFrame configuration file (reframe-examples/tutorial/config/baseline.py):

../examples/tutorial/config/baseline.py

site_configuration = {
    'systems': [
        {
            'name': 'tutorialsys',
            'descr': 'Example system',
            'hostnames': ['myhost'],
            'partitions': [
                {
                    'name': 'default',
                    'descr': 'Example partition',
                    'scheduler': 'local',
                    'launcher': 'local',
                    'environs': ['baseline']
                }
            ]
        }
    ],
    'environments': [
        {
            'name': 'baseline',
            'features': ['stream']
        }
    ]
}

This configuration defines a system named tutorialsys with a single partition named default and an environment named baseline. Let’s look at some key elements of the configuration:

  • Each system, partition and environment require a unique name. The name must contain only alphanumeric characters, underscores or dashes.

  • The hostnames option defines a set of hostname patterns which ReFrame will try to match against the current system’s hostname. The first matching system will become the current system and ReFrame will load the corresponding configuration.

  • The scheduler partition option defines the job scheduler backend to use on this partition. ReFrame supports many job schedulers. The local scheduler that we use here is the simplest one and it practically spawns a process executing the generated test script.

  • The launcher partition option defines the parallel launcher to use for spawning parallel programs. ReFrame supports all the major parallel launchers.

  • The environs partition option is a list of environments to test on this partition. Their definitions are resolved in the environments section.

  • Every partition and environment can define a set of arbitrary features or key/value pairs in the features and extras options respectively. ReFrame will try to match system partitions and environments to a test based on the test’s specification in valid_systems and valid_prog_environs.

There are many options that we can be define for systems, partitions and environments. We will cover several of them as we go through the tutorial, but for the complete reference you should refer to Configuration Reference.

Note

ReFrame supports splitting the configuration in multiple files that can be loaded simultaneously. In fact the builtin configuration is always loaded, therefore the generic system as well as the builtin environment are always defined. Additionally, the builtin configuration provides a baseline logging configuration that should cover a wide range of use cases. See Managing the configuration for more details.

Let’s try running the constrained version of our STREAM test with the configuration file that we have just created:

Run in the single-node container.
reframe -C config/baseline.py -c stream/stream_runonly.py -r
[ReFrame Setup]
  version:           4.5.0-dev.1
  command:           '/usr/local/share/reframe/bin/reframe -C config/baseline.py -c stream/stream_runonly.py -r'
  launched by:       user@myhost
  working directory: '/home/user'
  settings files:    '<builtin>', 'reframe-examples/tutorial/config/baseline.py'
  check search path: '/home/user/reframe-examples/tutorial/stream/stream_runonly.py'
  stage directory:   '/home/user/stage'
  output directory:  '/home/user/output'
  log files:         '/tmp/rfm-dz8m5nfz.log'

<...>

[----------] start processing checks
[ RUN      ] stream_test /2e15a047 @tutorialsys:default+baseline
[       OK ] (1/1) stream_test /2e15a047 @tutorialsys:default+baseline
P: copy_bw: 23135.4 MB/s (r:0, l:None, u:None)
P: triad_bw: 16600.5 MB/s (r:0, l:None, u:None)
[----------] all spawned checks have finished

[  PASSED  ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)

The -C option specifies the configuration file that ReFrame should load. Note that ReFrame has loaded two configuration files: first the <builtin> and then the one we supplied.

Note also that the system and environment specification in the test run output is now @tutorialsys:default+baseline. ReFrame has determined that the default partition and the baseline environment satisfy the test constraints and thus it has run the test with this partition/environment combination.

Compiling the test code

You can also use ReFrame to compile the test’s code. To demonstrate this, we will write a different test version of the STREAM benchmark that will also compile the benchmark’s source code.

../examples/tutorial/stream/stream_build_run.py
@rfm.simple_test
class stream_build_test(rfm.RegressionTest):
    valid_systems = ['*']
    valid_prog_environs = ['+openmp']
    build_system = 'SingleSource'
    sourcepath = 'stream.c'
    executable = './stream.x'

    @run_before('compile')
    def prepare_build(self):
        omp_flag = self.current_environ.extras.get('omp_flag')
        self.build_system.cflags = ['-O3', omp_flag]

    @sanity_function
    def validate(self):
        return sn.assert_found(r'Solution Validates', self.stdout)

    @performance_function('MB/s')
    def copy_bw(self):
        return sn.extractsingle(r'Copy:\s+(\S+)', self.stdout, 1, float)

    @performance_function('MB/s')
    def triad_bw(self):
        return sn.extractsingle(r'Triad:\s+(\S+)', self.stdout, 1, float)

The key difference of this test is that it derives from the RegressionTest instead of the RunOnlyRegressionTest and it specifies how the test’s code should be built. ReFrame uses a build system abstraction for building source code. Based on the build system backend used, it will emit the appropriate build instructions. All the major build systems are supported as well as the EasyBuild build automation tool and the Spack package manager.

In this case, we use the SingleSource build system, which is suitable for compiling a single source file. The sourcepath variable is used to specify the source file to compile. The path is relative to the test’s resource directory.

Test resources

The resource directory is a directory associated to the test where its static resources are stored. During execution the contents of this directory will be copied to the test’s stage directory and the test will execute from that directory. Here is the directory structure:

stream
├── stream_build_run.py
└── src
    └── stream.c

By default, the test’s resources directory is named src/ and is located next to the test’s file. It can be set to a different location inside the test using the sourcesdir variable.

Pipeline hooks

The prepare_build() test method in our example is a pipeline hook that will execute just before the compilation phase and will set the compilation flags based on the current environment. Pipeline hooks are a fundamental tool in ReFrame for customizing the test execution. Let’s explain the concept in more detail.

When executed, every test in ReFrame goes through the following stages: (a) setup, (b) compile, (c) run, (d) sanity, (e) performance and (f) cleanup. This is the test pipeline and a test can assign arbitrary functions to run before or after any of these stages using the @run_before and @run_after decorators. There is also a pseudo-stage called init that denotes the instantiation/initialization of the test. The How ReFrame Executes Tests page describes in detail every stage, but the most important stages in terms of the test’s lifetime are the “init” and the “setup” stages.

The “init” stage is where the test object is actually instantiated and for this reason you cannot define a pre-init hooks. At this stage, the system partition and the environment where the test will run are not yet determined, therefore the current_partition and current_environ variables are not set. This happens during the “setup” stage, where also all the test’s dependencies (if any) have been executed and their resources can be safely accessed (we will cover test dependencies later in this tutorial). Technically, all pipeline hooks could be attached to those two stages, but it’s a good programming practice to attach them close to the phase that they manipulate as it makes clearer their intent.

For a detailed description of the pipeline hook API, you may refer to the Pipeline Hooks guide.

Disabling pipeline hooks

New in version 3.2.

Any pipeline hook can be disabled from the command line using the --disable-hook command line option. This can be useful to temporarily disable a functionality of the test, e.g., a workaround.

You can view the list of all the hooks of a test using the --describe option:

Run in the single-node container.
reframe -C config/baseline_environs.py -c stream/stream_variables.py --describe | jq .[].pipeline_hooks
{
  "post_setup": [
    "set_executable"
  ],
  "pre_run": [
    "set_num_threads"
  ]
}

We could disable the set_num_threads hook by passing --disable-hook=set_num_threads:

Run in the single-node container.
reframe -C config/baseline_environs.py -c stream/stream_variables.py --disable-hook=set_num_threads --describe | jq .[].pipeline_hooks
{
  "post_setup": [
    "set_executable"
  ]
}

The --disable-hook option can be passed multiple times to disable multiple hooks at the same time.

Environment features and extras

We have shown already in the first example the use of features in system partition and environment definitions in the configuration file. These can be used in the valid_systems and valid_prog_environs specifications to help ReFrame pick the right system/environment combinations to run the test.

In addition to features, the configuration of a system partition or environment can include extra properties that can be accessed from the test and also be used as constraints in the valid_systems and valid_prog_environs. The following shows the use of extras in the baseline_environs.py file for defining the compiler flag that enables OpenMP compilation:

../examples/tutorial/config/baseline_environs.py
    'environments': [
        {
            'name': 'baseline',
            'features': ['stream']
        },
        {
            'name': 'gnu',
            'cc': 'gcc',
            'cxx': 'g++',
            'features': ['openmp'],
            'extras': {'omp_flag': '-fopenmp'}
        },
        {
            'name': 'clang',
            'cc': 'clang',
            'cxx': 'clang++',
            'features': ['openmp'],
            'extras': {'omp_flag': '-fopenmp'}
        }
    ]

The extras is a simple key/value dictionary, where the values can have any type and are accessible in the test through the current_environ property as shown in the example above.

Execution policies

Having explained the key concepts behind compiled tests as well as the test pipeline, it’s time to run our updated test. However, there is still a small tweak that we need to introduce.

ReFrame executes tests concurrently. More precisely, the “compile” and “run” stages of a test execute asynchronously and ReFrame will schedule other tests for running. Once any of those stages finishes, it will resume execution of the test. However, this is problematic for our local benchmarks since ReFrame would schedule the GNU-based and the Clang-based tests concurrently and therefore the tests would exhibit lower performance. For this reason, we will force ReFrame to execute the tests serially with --exec-policy=serial:

Run in the single-node container.
reframe -C config/baseline_environs.py -c stream/stream_build_run.py --exec-policy=serial -r
[----------] start processing checks
[ RUN      ] stream_build_test /6c084d40 @tutorialsys:default+gnu-11.4.0
[       OK ] (1/2) stream_build_test /6c084d40 @tutorialsys:default+gnu-11.4.0
P: copy_bw: 22273.9 MB/s (r:0, l:None, u:None)
P: triad_bw: 16492.8 MB/s (r:0, l:None, u:None)
[ RUN      ] stream_build_test /6c084d40 @tutorialsys:default+clang-14.0.0
[       OK ] (2/2) stream_build_test /6c084d40 @tutorialsys:default+clang-14.0.0
P: copy_bw: 22747.9 MB/s (r:0, l:None, u:None)
P: triad_bw: 16541.7 MB/s (r:0, l:None, u:None)
[----------] all spawned checks have finished

[  PASSED  ] Ran 2/2 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)

Test fixtures

Often a test needs some preparation to be done before it runs, but this preparation may need to be run only once per system partition or environment and not every time the test is run. A typical example is when we want to build the code of the test once per environment and reuse the executable in multiple different tests. We can achieve this in ReFrame using test fixtures. Test fixtures are normal ReFrame tests as any other, but they have a scope associated with them and can be fully accessed by the tests that define them. When test A is a fixture of test B, then A will run before B and B will have access not only to anything that A produced, but also to every of its attributes.

Let’s see fixtures in practice by separating our compile-and-run STREAM version into two tests: a compile-only test that simply builds the benchmark and a run-only version that uses the former as a fixture.

../examples/tutorial/stream/stream_fixtures.py
import os
import reframe as rfm
import reframe.utility.sanity as sn


class build_stream(rfm.CompileOnlyRegressionTest):
    build_system = 'SingleSource'
    sourcepath = 'stream.c'
    executable = './stream.x'

    @run_before('compile')
    def prepare_build(self):
        omp_flag = self.current_environ.extras.get('omp_flag')
        self.build_system.cflags = ['-O3', omp_flag]


@rfm.simple_test
class stream_test(rfm.RunOnlyRegressionTest):
    valid_systems = ['*']
    valid_prog_environs = ['+openmp']
    stream_binary = fixture(build_stream, scope='environment')

    @run_after('setup')
    def set_executable(self):
        self.executable = os.path.join(self.stream_binary.stagedir, 'stream.x')

    @sanity_function
    def validate(self):
        return sn.assert_found(r'Solution Validates', self.stdout)

    @performance_function('MB/s')
    def copy_bw(self):
        return sn.extractsingle(r'Copy:\s+(\S+)', self.stdout, 1, float)

    @performance_function('MB/s')
    def triad_bw(self):
        return sn.extractsingle(r'Triad:\s+(\S+)', self.stdout, 1, float)

A test fixture is defined with the fixture() builtin:

../examples/tutorial/stream/stream_fixtures.py
    stream_binary = fixture(build_stream, scope='environment')

The first argument is a standard ReFrame test which encompasses the fixture logic and will be executed before the current test. Note that there is no need to decorate a fixture with @simple_test as it will run anyway as part of the test that is using it. You could still decorate it, though, if you would like to run it independently.

Each fixture is associated with a scope, which will determine when it will run. There are the following scopes available:

  • session: The fixture will run once for the whole run session.

  • partition: The fixture will run once per system partition.

  • environment: The fixture will run once per system partition and environment combination.

  • test: The fixture will run every time that the calling test is run.

Finally, the fixture() builtin returns a handle which can be used to access the target test once it has finished. This can only be done after the “setup” stage of the current test. Any attribute of the target test can be accessed through the fixture handle and, in our example, we use the target’s test stagedir to construct the final executable.

Note

Compile-only tests do not require a validation check, since the test will fail anyway if the compilation fails. But if one is provided, it will be used.

Before running the new test, let’s try to list it first:

Run in the single-node container.
reframe -C config/baseline_environs.py -c stream/stream_fixtures.py -l
[List of matched checks]
- stream_test /2e15a047
    ^build_stream ~tutorialsys:default+gnu-11.4.0 'stream_binary /2ed36672
    ^build_stream ~tutorialsys:default+clang-14.0.0 'stream_binary /d19d2d86
Found 1 check(s)

We will describe in more detail the listing output later in this tutorial, but at the moment it is enough to show that it gives us all the essential information about the test fixtures: their scope and the test variable that they are bound to. Note also that due to the environment scope, a separate fixture is created for every environment that will be tested.

We can now run the benchmarks in parallel to demonstrate that the execution order of the fixtures is respected:

Run in the single-node container.
reframe -C config/baseline_environs.py -c stream/stream_fixtures.py -r
[----------] start processing checks
[ RUN      ] build_stream ~tutorialsys:default+gnu-11.4.0 /2ed36672 @tutorialsys:default+gnu-11.4.0
[ RUN      ] build_stream ~tutorialsys:default+clang-14.0.0 /d19d2d86 @tutorialsys:default+clang-14.0.0
[       OK ] (1/4) build_stream ~tutorialsys:default+gnu-11.4.0 /2ed36672 @tutorialsys:default+gnu-11.4.0
[       OK ] (2/4) build_stream ~tutorialsys:default+clang-14.0.0 /d19d2d86 @tutorialsys:default+clang-14.0.0
[ RUN      ] stream_test /2e15a047 @tutorialsys:default+gnu-11.4.0
[ RUN      ] stream_test /2e15a047 @tutorialsys:default+clang-14.0.0
[       OK ] (3/4) stream_test /2e15a047 @tutorialsys:default+gnu-11.4.0
P: copy_bw: 8182.4 MB/s (r:0, l:None, u:None)
P: triad_bw: 9174.3 MB/s (r:0, l:None, u:None)
[       OK ] (4/4) stream_test /2e15a047 @tutorialsys:default+clang-14.0.0
P: copy_bw: 7974.4 MB/s (r:0, l:None, u:None)
P: triad_bw: 18494.1 MB/s (r:0, l:None, u:None)
[----------] all spawned checks have finished

[  PASSED  ] Ran 4/4 test case(s) from 3 check(s) (0 failure(s), 0 skipped, 0 aborted)

Note that the two STREAM tests are still independent to each other, so they run in parallel and thus the lower performance.

We will cover more aspects of the fixtures in the following sections, but you are advised to read the API docs of fixture() for a detailed description of all their capabilities.

Test variables

Tests can define variables that can be set from the command line. These are essentially knobs that allow you to change the test’s behaviour on-the-fly. All the test’s pre-defined attributes that we have seen so far are defined as variables. A test variable is defined with the variable() builtin. Let’s augment our STREAM example by adding a variable to control the number of threads to use.

../examples/tutorial/stream/stream_variables.py
import os
import reframe as rfm
import reframe.utility.sanity as sn


class build_stream(rfm.CompileOnlyRegressionTest):
    build_system = 'SingleSource'
    sourcepath = 'stream.c'
    executable = './stream.x'

    @run_before('compile')
    def prepare_build(self):
        omp_flag = self.current_environ.extras.get('omp_flag')
        self.build_system.cflags = ['-O3', omp_flag]


@rfm.simple_test
class stream_test(rfm.RunOnlyRegressionTest):
    valid_systems = ['*']
    valid_prog_environs = ['+openmp']
    stream_binary = fixture(build_stream, scope='environment')
    num_threads = variable(int, value=0)

    @run_after('setup')
    def set_executable(self):
        self.executable = os.path.join(self.stream_binary.stagedir, 'stream.x')

    @run_before('run')
    def set_num_threads(self):
        if self.num_threads:
            self.env_vars['OMP_NUM_THREADS'] = self.num_threads

    @sanity_function
    def validate(self):
        return sn.assert_found(r'Solution Validates', self.stdout)

    @performance_function('MB/s')
    def copy_bw(self):
        return sn.extractsingle(r'Copy:\s+(\S+)', self.stdout, 1, float)

    @performance_function('MB/s')
    def triad_bw(self):
        return sn.extractsingle(r'Triad:\s+(\S+)', self.stdout, 1, float)

We define a new test variable with the following line:

../examples/tutorial/stream/stream_variables.py
    num_threads = variable(int, value=0)

Variables are typed and any attempt to assign them a value of different type will cause a TypeError. Variables can also have a default value as in this case, which is set to 0. If a variable is not given a value is considered undefined. Any attempt to read an undefined variable will cause an error. It is not necessary for a variable to be assigned a value along with its declaration; this can happen anytime before it is accessed. Variables are also inherited, that’s why we can set the standard variables of a ReFrame test, such as the valid_systems, valid_prog_environs etc., in our subclasses.

Variables are accessed inside the test as normal class attributes. In our example, we use the num_threads variable to set the OMP_NUM_THREADS environment variable accordingly.

../examples/tutorial/stream/stream_variables.py
    @run_before('run')
    def set_num_threads(self):
        if self.num_threads:
            self.env_vars['OMP_NUM_THREADS'] = self.num_threads

Variables can be set from the command-line using the -S option as -S var=value:

Run in the single-node container.
reframe -C config/baseline_environs.py -c stream/stream_variables.py -S num_threads=2 -r

We will not list the command output here, but you could verify that the variable was set by inspecting the generated run script:

Run in the single-node container.
cat output/tutorialsys/default/clang-14.0.0/stream_test/rfm_job.sh
#!/bin/bash
export OMP_NUM_THREADS=2
/home/user/reframe-examples/tutorial/stage/tutorialsys/default/clang-14.0.0/build_stream_d19d2d86/stream.x

Another thing to notice in the output is the following warning:

WARNING: test 'build_stream': the following variables were not set: 'num_threads'

When setting a variable as -S var=value, ReFrame will try to set it on all the selected tests, including any fixtures. If the requested variable is not part of the test, the above warning will be issued. You can scope the variable assignment in the command line by prefixing the variable name with test’s name as follows: -S stream_test.num_threads=2. In this case, the num_threads variable will be set only in the stream_test test.

Setting variables in fixtures

As we have already mentioned, fixtures are normal ReFrame tests, so they can also define their own variables. In our example, it makes sense to define a variable in the build_stream fixture to control the size of the arrays involved in the computation. Here is the updated build_stream fixture:

../examples/tutorial/stream/stream_variables_fixtures.py
class build_stream(rfm.CompileOnlyRegressionTest):
    build_system = 'SingleSource'
    sourcepath = 'stream.c'
    executable = './stream.x'
    array_size = variable(int, value=0)

    @run_before('compile')
    def prepare_build(self):
        omp_flag = self.current_environ.extras.get('omp_flag')
        self.build_system.cflags = ['-O3', omp_flag]
        if self.array_size:
            self.build_system.cppflags = [f'-DARRAY_SIZE={self.array_size}']

Note

The cppflags attribute of build system refers to the preprocessor flags and not the C++ flags, which are cxxflags instead.

We can set the array_size variable inside the build fixture of our final test through the fixture handle (remember that the fixture handle name is printed in the test listing). Here is an example:

Run in the single-node container.
reframe -C config/baseline_environs.py -c stream/stream_variables_fixtures.py --exec-policy=serial -S stream_test.stream_binary.array_size=50000000 -r

If you check the generated build script, you will notice the emitted -D flag:

Run in the single-node container.
cat output/tutorialsys/default/clang-14.0.0/build_stream_d19d2d86/rfm_build.sh
#!/bin/bash

_onerror()
{
    exitcode=$?
    echo "-reframe: command \`$BASH_COMMAND' failed (exit code: $exitcode)"
    exit $exitcode
}

trap _onerror ERR

clang -DARRAY_SIZE=50000000 -O3 -fopenmp stream.c -o ./stream.x

Test parameterization

It is often the case that we want to test different variants of the same test, such as varying the number of tasks in order to perform a scaling analysis on a parallel program. ReFrame offers a powerful multi-dimensional test parameterization mechanism that automatically generate variants of your tests with different parameter values. Let’s elaborate on this using the STREAM example. Suppose we want to scale over the number of threads and also try different thread placements. Here is the updated parameterized stream_test:

../examples/tutorial/stream/stream_parameters.py
@rfm.simple_test
class stream_test(rfm.RunOnlyRegressionTest):
    valid_systems = ['*']
    valid_prog_environs = ['+openmp']
    stream_binary = fixture(build_stream, scope='environment')
    num_threads = parameter([1, 2, 4, 8])
    thread_placement = parameter(['close', 'cores', 'spread'])

    @run_after('setup')
    def set_executable(self):
        self.executable = os.path.join(self.stream_binary.stagedir, 'stream.x')

    @run_before('run')
    def setup_threading(self):
        self.env_vars['OMP_NUM_THREADS'] = self.num_threads
        self.env_vars['OMP_PROC_BIND'] = self.thread_placement

    @sanity_function
    def validate(self):
        return sn.assert_found(r'Solution Validates', self.stdout)

    @performance_function('MB/s')
    def copy_bw(self):
        return sn.extractsingle(r'Copy:\s+(\S+)', self.stdout, 1, float)

    @performance_function('MB/s')
    def triad_bw(self):
        return sn.extractsingle(r'Triad:\s+(\S+)', self.stdout, 1, float)

Parameters are defined in ReFrame using the parameter() builtin. This builtin takes simply a list of values for the parameter being defined. Each parameter is independent and defines a new dimension in the parameterization space. Parameters can also be inherit and filtered from base classes. For each point in the final parameterization space, ReFrame will instantiate a different test. In our example, we expect 12 stream_test variants. Given that we have two valid programming environments and a build fixture with an environment scope, we expect ReFrame to generate and run 26 tests in total (including the fixtures):

Run in the single-node container.
reframe -C config/baseline_environs.py -c stream/stream_parameters.py --exec-policy=serial -r
[----------] start processing checks
[ RUN      ] build_stream ~tutorialsys:default+gnu-11.4.0 /2ed36672 @tutorialsys:default+gnu-11.4.0
[       OK ] ( 1/26) build_stream ~tutorialsys:default+gnu-11.4.0 /2ed36672 @tutorialsys:default+gnu-11.4.0
[ RUN      ] build_stream ~tutorialsys:default+clang-14.0.0 /d19d2d86 @tutorialsys:default+clang-14.0.0
[       OK ] ( 2/26) build_stream ~tutorialsys:default+clang-14.0.0 /d19d2d86 @tutorialsys:default+clang-14.0.0
[ RUN      ] stream_test %num_threads=8 %thread_placement=spread /3c8af82c @tutorialsys:default+gnu-11.4.0
[       OK ] ( 3/26) stream_test %num_threads=8 %thread_placement=spread /3c8af82c @tutorialsys:default+gnu-11.4.0
P: copy_bw: 24020.6 MB/s (r:0, l:None, u:None)
P: triad_bw: 15453.1 MB/s (r:0, l:None, u:None)
<...omitted...>
[----------] all spawned checks have finished

[  PASSED  ] Ran 26/26 test case(s) from 14 check(s) (0 failure(s), 0 skipped, 0 aborted)

Note how the fixture mechanism of ReFrame prevents the recompilation of the STREAM’s source code in every test variant: The source code is only compiled once per toolchain.

Parameterizing existing test variables

We can also parameterize a test on any of its existing variables directly from the command line using the -P option. For example, we could parameterize the STREAM version in stream_variables_fixtures.py on num_threads as follows:

Run in the single-node container.
reframe -C config/baseline_environs.py -c stream/stream_variables_fixtures.py -P num_threads=1,2,4,8 --exec-policy=serial -r

Parameterizing a fixture

Fixtures can also be parameterized. In this case the tests that use them are also parameterized implicitly. Let’s see an example, by parameterizing the build fixture of the STREAM benchmark by adding parameter about the element type (float or double):

../examples/tutorial/stream/stream_parameters_fixtures.py
class build_stream(rfm.CompileOnlyRegressionTest):
    build_system = 'SingleSource'
    sourcepath = 'stream.c'
    executable = './stream.x'
    array_size = variable(int, value=0)
    elem_type = parameter(['float', 'double'])

    @run_before('compile')
    def prepare_build(self):
        omp_flag = self.current_environ.extras.get('omp_flag')
        self.build_system.cflags = ['-O3', omp_flag]
        if self.array_size:
            self.build_system.cppflags = [f'-DARRAY_SIZE={self.array_size}',
                                          f'-DELEM_TYPE={self.elem_type}']

As expected, parameters in fixtures are no different than parameters in normal test. The difference is when you try to list/run the final stream_test, where now we have twice as many variants:

Run in the single-node container.
reframe -C config/baseline_environs.py -c stream/stream_parameters_fixtures.py -l
- stream_test %num_threads=8 %thread_placement=spread %stream_binary.elem_type=double /ffbd00f1
    ^build_stream %elem_type=double ~tutorialsys:default+gnu-11.4.0 'stream_binary /099a4f75
    ^build_stream %elem_type=double ~tutorialsys:default+clang-14.0.0 'stream_binary /7bd4e3bb
    <...omitted...>
- stream_test %num_threads=1 %thread_placement=close %stream_binary.elem_type=float /bc1f32c2
    ^build_stream %elem_type=float ~tutorialsys:default+gnu-11.4.0 'stream_binary /2ed36672
    ^build_stream %elem_type=float ~tutorialsys:default+clang-14.0.0 'stream_binary /d19d2d86
Found 24 check(s)

Note, that the test variant name now contains the parameter coming from the fixture. In total, 52 test cases (24 tests x 2 environments + 2 fixtures x 2 environments) will be run from this simple combination of parameterized tests!

Pruning the parameterization space

Sometimes parameters are not independent of each other, as a result some parameter combination may be invalid for the test at hand. There are two ways to overcome this:

  1. Skip the test if the parameter combination is invalid.

  2. Use parameter packs.

Let’s see those two methods in practice with a fictitious test. The first method define two parameters and uses the skip_if() test method to skip the test if the two parameters have the same value. The test will be skipped just after it is initialized and the message supplied will be printed as a warning.

../examples/tutorial/dummy/params.py
@rfm.simple_test
class echo_test_v0(rfm.RunOnlyRegressionTest):
    valid_systems = ['*']
    valid_prog_environs = ['*']
    executable = 'echo'
    x = parameter([0, 1])
    y = parameter([0, 1])

    @run_after('init')
    def skip_invalid(self):
        self.skip_if(self.x == self.y, 'invalid parameter combination')

    @run_after('init')
    def set_executable_opts(self):
        self.executable_opts = [f'{self.x}', f'{self.y}']

    @sanity_function
    def validate(self):
        x = sn.extractsingle(r'(\d) (\d)', self.stdout, 1, int)
        y = sn.extractsingle(r'(\d) (\d)', self.stdout, 2, int)
        return sn.and_(sn.assert_eq(x, self.x), sn.assert_eq(y, self.y))

The second method uses a single parameter that packs the valid combinations of the x and y parameters. Note that we also use the optional fmt argument to provide a more compact formatting for the combined parameter. Instead of the skip hook, we simply unpack the combined parameters in the set_executable_opts() hook.

../examples/tutorial/dummy/params.py
@rfm.simple_test
class echo_test_v1(rfm.RunOnlyRegressionTest):
    valid_systems = ['*']
    valid_prog_environs = ['*']
    executable = 'echo'
    xy = parameter([(0, 1), (1, 0)], fmt=lambda val: f'{val[0]}{val[1]}')

    @run_after('init')
    def set_executable_opts(self):
        self.x, self.y = self.xy
        self.executable_opts = [f'{self.x}', f'{self.y}']

    @sanity_function
    def validate(self):
        x = sn.extractsingle(r'(\d) (\d)', self.stdout, 1, int)
        y = sn.extractsingle(r'(\d) (\d)', self.stdout, 2, int)
        return sn.and_(sn.assert_eq(x, self.x), sn.assert_eq(y, self.y))

The advantage of using parameter packs instead of skipping explicitly the test is that we do not get a warning message and the test is more compact.

Note

In these tests, we also introduced two more utility functions used in sanity checking, the and_(), which performs a logical AND of its arguments, and the assert_eq(), which asserts that its both arguments are equal. We could have simply written return x == self.x and y == self.y and the test would still validate, but the utility functions provide more context in case of validation errors. In fact, we could also provide a custom message to be printed in case of errors, which can be helpful in real case scenarios.

Mastering sanity and performance checking

The sanity and performance checking in our STREAM example are simple, but they do represent the most commonly used patterns. There are cases, however, where we would need a more elaborate sanity checking or extracting the performance measure would not be so straightforward. The sanity and performance functions (see @sanity_function and @performance_function) allow us to write arbitrary code to perform the task at hand, but there are a couple of things to keep in mind:

  • Both sanity and performance functions execute from the test’s stage directory. All relative paths will be resolved against it.

  • A sanity function must return a boolean or raise a SanityError with a message. Raising a SanityError is the preferred way to denote sanity error and this is exactly what the utility sanity functions do.

  • A performance function must return the value of the extracted figure of merit or raise a SanityError in case this is not possible.

Understanding the builtin sanity functions

All the utility functions provided by the framework for sanity checking and the stdout and stderr test attributes are lazily evaluated: when you call these functions or access these attributes, you are not getting their final value, but instead a special object, named deferred expression, which is similar in concept to a future or promise. You can include these objects in arbitrary expressions and a new deferred expression will be produced. In fact, both sanity and performance functions can return a deferred expression, which would return a boolean when evaluated. And this is what our STREAM sanity and performance functions actually return.

A deferred expression can be evaluated explicitly by calling its evaluate() method or pass it to the evaluate() utility function. For example, to retrieve the actual stdout value, we should do self.stdout.evaluate() or sn.evaluate(self.stdout). Deferred expressions are evaluated implicitly in the following situations:

  1. When trying to iterate over them in for loop.

  2. When trying to include them in an if expression.

  3. When calling str() on them.

The “Understanding the Mechanism of Deferrable Functions” page contains details about the underlying mechanism of deferred expressions and gives also concrete examples.

Tip

If you are in doubt about the evaluation of a deferred expression, always call evaluate() on it. At the point where the test’s @sanity_function is called, all test’s attributes are safe to access.

Note

Why deferred expressions?

In ReFrame versions prior to 3.7, the sanity and performance checking were defined using the sanity_patterns perf_patterns expressions at test’s initialization. In this case, a lazily evaluated expression was necessary since the test has not yet been executed. The use of sanity_patterns and perf_patterns attributes is still valid today, but it may be deprecated in the future.

Interacting with workload managers

ReFrame integrates with many HPC workload managers (batch job schedulers), including Slurm, PBS Pro, Torque and others. The complete list of scheduler backend can be found here. Tests in ReFrame are scheduler-agnostic in that they do not need to include any scheduler-specific information. Instead, schedulers are associated to system partitions. Each system partition in the configuration file defines the scheduler backend to use along with any scheduler-specific options that are needed to grant access to the desired nodes.

HPC systems also come with parallel program launchers which are responsible for launching parallel programs onto multiple nodes. ReFrame supports all major parallel launchers and allows users to easily define their own custom ones. Similarly to the batch job schedulers, each system partition is associated to a parallel launcher, which will be used to launch the test’s executable.

In the following, we define a configuration for the Slurm-based pseudo cluster of the tutorial. We will focus only on the new system configuration as the rest of the configuration remains the same.

../examples/tutorial/config/cluster.py
        {
            'name': 'pseudo-cluster',
            'descr': 'Example Slurm-based pseudo cluster',
            'hostnames': ['login'],
            'partitions': [
                {
                    'name': 'login',
                    'descr': 'Login nodes',
                    'scheduler': 'local',
                    'launcher': 'local',
                    'environs': ['gnu', 'clang']
                },
                {
                    'name': 'compute',
                    'descr': 'Compute nodes',
                    'scheduler': 'squeue',
                    'launcher': 'srun',
                    'access': ['-p all'],
                    'environs': ['gnu', 'clang']
                }
            ]
        }

We define two partitions, one named login where we are running tests locally (emulating the login nodes of an HPC cluster) and another one named compute (emulating the compute nodes of an HPC cluster), where we will be submitting test jobs with Slurm and srun. We use the squeue scheduler backend, because our Slurm installation does not have job accounting, so we instruct ReFrame to use the squeue command for querying the job state. If your Slurm installation has job accounting enabled, you should prefer the slurm backend, which uses the sacct for retrieving the job state, which is more reliable.

Another important parameter is access, which denotes the job scheduler options needed to access the desired nodes. In our example, it is redundant to define it as the all partition is the default, but in most real cases, you will have to define the access options.

Let’s run our STREAM example with the new configuration:

Run with the Docker compose setup.
reframe --prefix=/scratch/rfm-stage/ -C config/cluster.py -c stream/stream_variables_fixtures.py -r
[----------] start processing checks
[ RUN      ] build_stream ~pseudo-cluster:login+gnu /c5e9e6a0 @pseudo-cluster:login+gnu
[ RUN      ] build_stream ~pseudo-cluster:login+clang /d0622327 @pseudo-cluster:login+clang
[ RUN      ] build_stream ~pseudo-cluster:compute+gnu /3f5dbfe2 @pseudo-cluster:compute+gnu
[ RUN      ] build_stream ~pseudo-cluster:compute+clang /78c4801e @pseudo-cluster:compute+clang
[       OK ] (1/8) build_stream ~pseudo-cluster:login+gnu /c5e9e6a0 @pseudo-cluster:login+gnu
[       OK ] (2/8) build_stream ~pseudo-cluster:login+clang /d0622327 @pseudo-cluster:login+clang
[       OK ] (3/8) build_stream ~pseudo-cluster:compute+gnu /3f5dbfe2 @pseudo-cluster:compute+gnu
[       OK ] (4/8) build_stream ~pseudo-cluster:compute+clang /78c4801e @pseudo-cluster:compute+clang
[ RUN      ] stream_test /2e15a047 @pseudo-cluster:login+gnu
[ RUN      ] stream_test /2e15a047 @pseudo-cluster:login+clang
[ RUN      ] stream_test /2e15a047 @pseudo-cluster:compute+gnu
[ RUN      ] stream_test /2e15a047 @pseudo-cluster:compute+clang
[       OK ] (5/8) stream_test /2e15a047 @pseudo-cluster:login+gnu
P: copy_bw: 9062.2 MB/s (r:0, l:None, u:None)
P: triad_bw: 8344.9 MB/s (r:0, l:None, u:None)
[       OK ] (6/8) stream_test /2e15a047 @pseudo-cluster:login+clang
P: copy_bw: 25823.0 MB/s (r:0, l:None, u:None)
P: triad_bw: 12732.2 MB/s (r:0, l:None, u:None)
[       OK ] (7/8) stream_test /2e15a047 @pseudo-cluster:compute+clang
P: copy_bw: 11215.5 MB/s (r:0, l:None, u:None)
P: triad_bw: 7960.5 MB/s (r:0, l:None, u:None)
[       OK ] (8/8) stream_test /2e15a047 @pseudo-cluster:compute+gnu
P: copy_bw: 10300.7 MB/s (r:0, l:None, u:None)
P: triad_bw: 9647.1 MB/s (r:0, l:None, u:None)
[----------] all spawned checks have finished

[  PASSED  ] Ran 8/8 test case(s) from 5 check(s) (0 failure(s), 0 skipped, 0 aborted)

Note how the test runs every each partition and environment combination. For the login partition the generated script is the same as for the local execution, whereas for the compute partition ReFrame generates a job script, which submits with sbatch:

Run with the Docker compose setup.
cat /scratch/rfm-stage/output/pseudo-cluster/compute/gnu/stream_test/rfm_job.sh
#!/bin/bash
#SBATCH --job-name="rfm_stream_test"
#SBATCH --ntasks=1
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH -p all
srun /scratch/rfm-stage/stage/pseudo-cluster/compute/gnu/build_stream_3f5dbfe2/stream.x

You may have noticed that we use the --prefix option when running ReFrame this time. This option changes the prefix of the stage and output directory. All scheduler backends, except ssh, require the test’s stage directory to be shared across the local and remote nodes, therefore set it to point under the shared /scratch volume.

Note

For running the Slurm-based examples, make sure to follow the instructions in Running the multi-node examples for bringing up and accessing this cluster.

Selecting specific partitions or environments to run

ReFrame can generated many test cases if have many partitions and environments and you will most likely need to scope down the test space. You could use the --system and -p options to restrict a test to a single partition and/or a single environment. To run only the GCC tests on the compute partition you could do the following:

Run with the Docker compose setup.
reframe --prefix=/scratch/rfm-stage/ -C config/cluster.py -c stream/stream_variables_fixtures.py \
        --system=pseudo-cluster:compute -p gnu -r

Compiling remotely

By default, ReFrame compiles the test’s source code locally on the node that it runs. This may be problematic in cases that cross-compilation is not possible and the test’s code needs to be compiled on the remote nodes. This can be achieved by settings the test’s build_locally attribute to False with -S build_locally=0. In this case, ReFrame will generate a job script also for the compilation job and submit it for execution.

Passing additional scheduler options

There are two ways to pass additional options to the backend scheduler: either by modifying the Job instance associated to the test or by defining an extra resource at the partition configuration requesting this from the test. Let’s see both methods:

Modifying the test job’s options

This method is quite straightforward: you need simply to define a pre-run hook and set the self.job.options. For example, to pass the --mem Slurm option to the job submitted by the test, you could do the following:

@run_before('run')
def set_mem_constraint(self):
    self.job.options = ['--mem=1000']

The advantage of this method is its simplicity, but it injects system-specific information to the test tying it to the system scheduler. You could make the test more robust, however, by restricting it to system partitions with Slurm by setting the valid_systems accordingly:

valid_systems = [r'%scheduler=slurm', r'%scheduler=squeue']

Defining extra scheduler resources

This method comprises two steps. First, we need to define a resource in the partition configuration:

../examples/tutorial/config/cluster_resources.py
                    'resources': [
                        {
                            'name': 'memory',
                            'options': ['--mem={size}']
                        }
                    ]

Each resource has a name and a list of scheduler options that will be emitted in the job script when this resource will be requested by the test. The scheduler options specification can contain placeholders that will be filled from the test.

Now we can use this resource very in the test by settings its extra_resources:

extra_resources = {
    'memory': {'size': '1000'}
}

The advantage of this method is that it is completely scheduler-agnostic. If the system partition where the test is running on does not define a resource, the request will be ignored.

Both methods of setting addition job options are valid and you may use whichever of the two fits best your use case.

Modifying the launch command

Sometimes it’s useful to modify the launch command itself by prepending another program, such as debugger or profiler. You can achieve this by setting the modifier and modifier_options of the test job’s launcher:

@run_before('run')
def run_with_gdb(self):
    self.job.launcher.modifier = 'gdb'
    self.job.launcher.modifier_options = ['-batch', '-ex run', '-ex bt', '--args']

Replacing the launch command

Sometimes you may want to replace completely the launcher associated with the partition that the test will run. You can do that with the following hook:

from reframe.core.backends import getlauncher
...

@run_before('run')
def replace_launcher(self):
    self.job.launcher = getlauncher('local')()

The getlauncher() utility function returns the type the implements the launcher with the given name. The supported launcher names are those registered with the framework, i.e., all the names listed here as well as any user-registered launcher. Once we have the launcher type, we instantiate it and replace the job’s launcher.

Multiple job steps

A job step is a command launched with the parallel launcher. ReFrame will only launch the executable as a job step. You can launch multiple job steps by leveraging the prerun_cmds or postrun_cmds test attributes. These are commands to be executed before or after the main executable and, normally, they are not job steps: they are simple Bash commands. However, you can use the reframe.core.launcher.JobLauncher API to emit the parallel launch command and convert them to a job step as shown in the following example:

../examples/tutorial/stream/stream_multistep.py
    @run_before('run')
    def hostname_step(self):
        launch_cmd = self.job.launcher.run_command(self.job)
        self.prerun_cmds = [f'{launch_cmd} hostname']

Here we invoke the job launcher’s run_command() method, which is responsible for emitting the launcher prefix based on the current partition.

Generally, ReFrame generates the job shell scripts using the following pattern:

#!/bin/bash -l
{job_scheduler_preamble}
{prepare_cmds}
{env_load_cmds}
{prerun_cmds}
{parallel_launcher} {executable} {executable_opts}
{postrun_cmds}

The job_scheduler_preamble contains the backend job scheduler directives that control the job allocation. The prepare_cmds are commands that can be emitted before the test environment commands. These can be specified with the prepare_cmds partition configuration option. The env_load_cmds are the necessary commands for setting up the environment of the test. These include any modules or environment variables set at the system partition level or any modules or environment variables set at the test level. Then the commands specified in prerun_cmds follow, while those specified in the postrun_cmds come after the launch of the parallel job. The parallel launch itself consists of three parts:

  1. The parallel launcher program (e.g., srun, mpirun etc.) with its options,

  2. the test executable as specified in the executable attribute and

  3. the options to be passed to the executable as specified in the executable_opts attribute.

Accessing CPU topology information

New in version 3.7.

Sometimes a test may need to access processor topology information for the partition it runs so as to better set up the run. Of course, you could hard code the information in the test, but it wouldn’t be so portable. ReFrame auto-detects the local host topology and it can also auto-detect the topology of remote hosts. It makes available this information to the test through the current_partition’s processor attribute.

Let’s use this feature to set the number of threads of our STREAM benchmark to the host’s number of cores, if it is not defined otherwise.

../examples/tutorial/stream/stream_cpuinfo.py
    @run_before('run')
    def set_num_threads(self):
        if not self.num_threads:
            self.skip_if_no_procinfo()
            proc = self.current_partition.processor
            self.num_threads = proc.num_cores

        self.env_vars['OMP_NUM_THREADS'] = self.num_threads

Note also the use of the skip_if_no_procinfo() function which will cause ReFrame to skip the test if there is no processor information available.

Let’s try running the test on our pseudo-cluster:

Run with the Docker compose setup.
reframe --prefix=/scratch/rfm-stage/ -C config/cluster.py -c stream/stream_cpuinfo.py -p gnu -r
[==========] Running 3 check(s)
[==========] Started on Mon Feb 12 21:55:54 2024+0000

[----------] start processing checks
[ RUN      ] build_stream ~pseudo-cluster:login+gnu /c5e9e6a0 @pseudo-cluster:login+gnu
[ RUN      ] build_stream ~pseudo-cluster:compute+gnu /3f5dbfe2 @pseudo-cluster:compute+gnu
[       OK ] (1/4) build_stream ~pseudo-cluster:login+gnu /c5e9e6a0 @pseudo-cluster:login+gnu
[       OK ] (2/4) build_stream ~pseudo-cluster:compute+gnu /3f5dbfe2 @pseudo-cluster:compute+gnu
[ RUN      ] stream_test /2e15a047 @pseudo-cluster:login+gnu
[ RUN      ] stream_test /2e15a047 @pseudo-cluster:compute+gnu
[     SKIP ] (3/4) no topology information found for partition 'pseudo-cluster:compute'
[       OK ] (4/4) stream_test /2e15a047 @pseudo-cluster:login+gnu
P: copy_bw: 36840.6 MB/s (r:0, l:None, u:None)
P: triad_bw: 18338.8 MB/s (r:0, l:None, u:None)
[----------] all spawned checks have finished

[  PASSED  ] Ran 3/4 test case(s) from 3 check(s) (0 failure(s), 1 skipped, 0 aborted)

Indeed, for the login partition, the generated script contains the correct number of threads:

Run with the Docker compose setup.
 cat /scratch/rfm-stage/output/pseudo-cluster/login/gnu/stream_test/rfm_job.sh
#!/bin/bash
export OMP_NUM_THREADS=8
/scratch/rfm-stage/stage/pseudo-cluster/login/gnu/build_stream_c5e9e6a0/stream.x

However, if you noticed, the compute partition was skipped as no topology information was found. ReFrame by default does not try to auto-detect remote partitions, because this could be time consuming. To enable remote host auto-detection, we should set the RFM_REMOTE_DETECT or the equivalent remote_detect configuration option.

Run with the Docker compose setup.
RFM_REMOTE_WORKDIR=/scratch/rfm-stage RFM_REMOTE_DETECT=1 reframe --prefix=/scratch/rfm-stage/ -C config/cluster.py -c stream/stream_cpuinfo.py -p gnu -r
...
Detecting topology of remote partition 'pseudo-cluster:compute': this may take some time...
...
[       OK ] (3/4) stream_test /2e15a047 @pseudo-cluster:compute+gnu
P: copy_bw: 19288.6 MB/s (r:0, l:None, u:None)
P: triad_bw: 15243.0 MB/s (r:0, l:None, u:None)
...

Note

In our setup we need to set also the RFM_REMOTE_WORKDIR since the current volume (/home) is not shared with the head node.

ReFrame caches the result of host auto-detection, so that it avoids re-detecting the topology every time. For a detailed description of the process, refer to the documentation of the processor configuration option.

Device information

ReFrame cannot auto-detect at the moment device information, such as attached accelerators, NICs etc. You can however add manually in the configuration any interesting device and this will be accessible from inside the test through the current_partition. For more information check the documentation of the devices configuration parameter.

Multi-node tests

Multi-node tests are quite straightforward in ReFrame. All you need is to specify the task setup and the scheduler backend and parallel launcher will emit the right options.

The following tests run download, compile and launch the OSU benchmarks.

../examples/tutorial/mpi/osu.py

import os
import reframe as rfm
import reframe.utility.typecheck as typ
import reframe.utility.sanity as sn


class fetch_osu_benchmarks(rfm.RunOnlyRegressionTest):
    descr = 'Fetch OSU benchmarks'
    version = variable(str, value='7.3')
    executable = 'wget'
    executable_opts = [
        f'http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-{version}.tar.gz'  # noqa: E501
    ]
    local = True

    @sanity_function
    def validate_download(self):
        return sn.assert_eq(self.job.exitcode, 0)


class build_osu_benchmarks(rfm.CompileOnlyRegressionTest):
    descr = 'Build OSU benchmarks'
    build_system = 'Autotools'
    build_prefix = variable(str)
    osu_benchmarks = fixture(fetch_osu_benchmarks, scope='session')

    @run_before('compile')
    def prepare_build(self):
        tarball = f'osu-micro-benchmarks-{self.osu_benchmarks.version}.tar.gz'
        self.build_prefix = tarball[:-7]  # remove .tar.gz extension

        fullpath = os.path.join(self.osu_benchmarks.stagedir, tarball)
        self.prebuild_cmds = [
            f'cp {fullpath} {self.stagedir}',
            f'tar xzf {tarball}',
            f'cd {self.build_prefix}'
        ]
        self.build_system.max_concurrency = 8


class osu_base_test(rfm.RunOnlyRegressionTest):
    '''Base class of OSU benchmarks runtime tests'''

    valid_systems = ['*']
    valid_prog_environs = ['+mpi']
    num_tasks = 2
    num_tasks_per_node = 1
    osu_binaries = fixture(build_osu_benchmarks, scope='environment')
    kind = variable(str)
    benchmark = variable(str)
    metric = variable(typ.Str[r'latency|bandwidth'])

    @run_before('run')
    def prepare_run(self):
        self.executable = os.path.join(
            self.osu_binaries.stagedir,
            self.osu_binaries.build_prefix,
            'c', 'mpi', self.kind, self.benchmark
        )
        self.executable_opts = ['-x', '100', '-i', '1000']

    @sanity_function
    def validate_test(self):
        return sn.assert_found(r'^8', self.stdout)

    def _extract_metric(self, size):
        return sn.extractsingle(rf'^{size}\s+(\S+)', self.stdout, 1, float)

    @run_before('performance')
    def set_perf_vars(self):
        make_perf = sn.make_performance_function
        if self.metric == 'latency':
            self.perf_variables = {
                'latency': make_perf(self._extract_metric(8), 'us')
            }
        else:
            self.perf_variables = {
                'bandwidth': make_perf(self._extract_metric(1048576), 'MB/s')
            }


@rfm.simple_test
class osu_latency_test(osu_base_test):
    descr = 'OSU latency test'
    kind = 'pt2pt/standard'
    benchmark = 'osu_latency'
    metric = 'latency'
    executable_opts = ['-x', '3', '-i', '10']


@rfm.simple_test
class osu_bandwidth_test(osu_base_test):
    descr = 'OSU bandwidth test'
    kind = 'pt2pt/standard'
    benchmark = 'osu_bw'
    metric = 'bandwidth'
    executable_opts = ['-x', '3', '-i', '10']


@rfm.simple_test
class osu_allreduce_test(osu_base_test):
    descr = 'OSU Allreduce test'
    kind = 'collective/blocking'
    benchmark = 'osu_allreduce'
    metric = 'bandwidth'
    executable_opts = ['-m', '8', '-x', '3', '-i', '10']

Notice the assignment of num_tasks and num_tasks_per_node in the base test class osu_base_test. The RegressionTest base class offers many more attributes for specifying the placement of tasks on the nodes.

Unrelated to their multi-node nature, these examples showcase some other interesting aspects of ReFrame tests:

  • Fixtures can use other fixtures.

  • The session scope of the osu_benchmarks fixture will make the fetch_osu_benchmarks test that downloads the benchmarks to run only once at the beginning of the session. Similarly, the environment scope of the osu_binaries fixture will make the build_osu_benchmarks test execute once per partition and environment combination.

  • Instead of using the @performance_function decorator to define performance variables, we could directly set the perf_variables test attribute. This is useful when we want to programmatically generate test’s performance variables.

Here is how to execute the tests. Note that we are using another configuration file, which defines an MPI-enabled environment so that we can compile the OSU benchmarks:

Run with the Docker compose setup.
reframe --prefix=/scratch/rfm-stage/ -C config/cluster_mpi.py -c mpi/osu.py --exec-policy=serial -r
[----------] start processing checks
[ RUN      ] fetch_osu_benchmarks ~pseudo-cluster /d20db00e @pseudo-cluster:compute+gnu-mpi
[       OK ] (1/5) fetch_osu_benchmarks ~pseudo-cluster /d20db00e @pseudo-cluster:compute+gnu-mpi
[ RUN      ] build_osu_benchmarks ~pseudo-cluster:compute+gnu-mpi /be044b23 @pseudo-cluster:compute+gnu-mpi
[       OK ] (2/5) build_osu_benchmarks ~pseudo-cluster:compute+gnu-mpi /be044b23 @pseudo-cluster:compute+gnu-mpi
[ RUN      ] osu_allreduce_test /63dd518c @pseudo-cluster:compute+gnu-mpi
[       OK ] (3/5) osu_allreduce_test /63dd518c @pseudo-cluster:compute+gnu-mpi
P: bandwidth: 38618.05 MB/s (r:0, l:None, u:None)
[ RUN      ] osu_bandwidth_test /026711a1 @pseudo-cluster:compute+gnu-mpi
[       OK ] (4/5) osu_bandwidth_test /026711a1 @pseudo-cluster:compute+gnu-mpi
P: bandwidth: 144.96 MB/s (r:0, l:None, u:None)
[ RUN      ] osu_latency_test /d2c978ad @pseudo-cluster:compute+gnu-mpi
[       OK ] (5/5) osu_latency_test /d2c978ad @pseudo-cluster:compute+gnu-mpi
P: latency: 12977.31 us (r:0, l:None, u:None)
[----------] all spawned checks have finished

[  PASSED  ] Ran 5/5 test case(s) from 5 check(s) (0 failure(s), 0 skipped, 0 aborted)

Note

The parameters passed to the OSU benchmarks are adapted for the purposes of the tutorial. You should adapt them if running on an actual parallel cluster.

Managing the run session

ReFrame offers a rich command line interface that allows users to manage and execute their test suite. In this section, we will briefly discuss the most important command line options. For a complete reference, users are referred to the Command Line Reference.

Test listing

This is probably the most important action after the test execution. We have seen already the -l option that performs the listing of the tests to be executed. It is always a good practice to first list the tests to run before executing them in order to avoid surprises. By default, the -l lists only the tests to be executed:

Run with the Docker compose setup.
reframe -C config/cluster.py -c stream/stream_build_run.py -l
[List of matched checks]
- stream_build_test /6c084d40
Found 1 check(s)

However, on a system with multiple partitions and environments, the test will run on all the supported combinations. In ReFrame’s terminology, these are called test cases. You can instruct the -l to list the actual (concretized) test cases that will eventually run with -lC:

Run with the Docker compose setup.
reframe -C config/cluster.py -c stream/stream_build_run.py -lC
[List of matched checks]
- stream_build_test /6c084d40 @pseudo-cluster:login+gnu
- stream_build_test /6c084d40 @pseudo-cluster:login+clang
- stream_build_test /6c084d40 @pseudo-cluster:compute+gnu
- stream_build_test /6c084d40 @pseudo-cluster:compute+clang
Concretized 4 test case(s)

Notice the @pseudo-cluster:login+gnu notation that is appended to each test case: this is the exact combination of partition and environment that the test will run for.

You can also opt for a detailed listing with the -L option, which also accepts the C argument for producing the concretized test cases.

Run with the Docker compose setup.
reframe -C config/cluster.py -c stream/stream_build_run.py -LC
[List of matched checks]
- stream_build_test /6c084d40 @pseudo-cluster:login+gnu [variant: 0, file: '/home/admin/reframe-examples/tutorial/stream/stream_build_run.py']
- stream_build_test /6c084d40 @pseudo-cluster:login+clang [variant: 0, file: '/home/admin/reframe-examples/tutorial/stream/stream_build_run.py']
- stream_build_test /6c084d40 @pseudo-cluster:compute+gnu [variant: 0, file: '/home/admin/reframe-examples/tutorial/stream/stream_build_run.py']
- stream_build_test /6c084d40 @pseudo-cluster:compute+clang [variant: 0, file: '/home/admin/reframe-examples/tutorial/stream/stream_build_run.py']
Concretized 4 test case(s)

This listing prints the test variant (each parameter value in a parameterized test generates a new variant) and the file from where this test was loaded.

There are several parts and symbols in a full test case listing. The following figure explains them all in detail.

_images/test-naming.svg

Test naming scheme.

Test discovery

This is the phase that ReFrame looks for tests and loads them. By default, it looks inside the ./checks directory unless the -c is specified. We have used already this option to load the tests from specific files. The -c can be used multiple times to load more files or its argument may be a directory, in which case all the ReFrame test files found will be loaded. Note that ReFrame will refuse to load test with the same name. Finally, this option may be combined with the -R to recursively descend inside a directory.

Test filtering

ReFrame offers several options for filtering the tests loaded during test discovery. We have seen already the -p option to select tests that support a specific environment.

Perhaps the most important filtering option is the -n option, which filters tests by name. Its argument can take different forms that help in different scenarios:

  • It can be a regular expression that will be searched inside the full test’s name, including the parameters. For example, in order to select only the test variants that have the num_threads parameter set to 1 in the stream/stream_parameters_fixtures.py, we can do the following:

    Run with the Docker compose setup.
      reframe -C config/cluster.py -c stream/stream_parameters_fixtures.py -l -n '%num_threads=1'
    
  • It can be of the form /<hash> in which case the exact test with the specified hash will be selected

  • It can be of the form test_name@<variant_num> in which case a specific variant of a parameterized test will be selected.

Another useful filtering option is the -t option, which selects tests by tag. You can assign multiple tags to a test by settings its tags attribute. You can use tags to effectively categorize tests.

Finally, a powerful selection option is -E. This allows you to filter tests by evaluating an expression over their variables or parameters. For example, we could select all tests with 4 threads and a spread placement as follows:

Run with the Docker compose setup.
reframe -C config/cluster.py -c stream/stream_parameters_fixtures.py -l -E 'num_threads==4 and thread_placement=="spread"'

The syntax of the expression must be valid Python.

Execution modes

ReFrame allows you to group command-line options into so called execution modes. Execution mode are defined in the modes configuration section and are merely a collection of command-line options that will be passed to ReFrame when the mode is selected with the --mode option. Here is an example:

../examples/tutorial/config/cluster.py
    'modes': [
        {
            'name': 'singlethread',
            'options': ['-E num_threads==1']
        }
    ]

We can now select this mode with --mode=singlethread and get only the tests where num_threads=1. Obviously, modes become more useful when we need to abbreviate many options.

Run with the Docker compose setup.
reframe -C config/cluster.py -c stream/stream_parameters_fixtures.py --mode=singlethread -l

You can use any ReFrame option in an execution mode except the -C and --system options, since these determine the configuration file and the system configuration to load.

Retrying tests

You can instruct ReFrame to retry the failing tests with the --max-retries option. The retries will happen at the end of the session and not immediately after the test fails. Each retried test will be staged in a separate directory. If the test passed in retries, its result is “success” with a mention that it has passed in retries.

The --max-retries has an effect only in the current run session. However, in order to rerun the failed tests of a previous session, you should use the --restore-session --failed options. This is particularly useful when a failed test has long-running dependencies that have succeeded in the previous run. In this case, the dependencies will be restored and only the failed tests will be rerun.

Let’s see an artificial example that uses the following test dependency graph.

_images/deps-complex.svg

Complex test dependency graph. Nodes in red are set to fail.

Tests T2 and T8 are set to fail. Let’s run the whole test DAG:

Run in the single-node container.
cd reframe-examples/tutorial/
reframe -c deps/deps_complex.py -r
[ReFrame Setup]
  version:           4.6.0-dev.2
  command:           '/usr/local/share/reframe/bin/reframe -c deps/deps_complex.py -r --nocolor'
  launched by:       user@myhost
  working directory: '/home/user/reframe-examples/tutorial'
  settings files:    '<builtin>'
  check search path: '/home/user/reframe-examples/tutorial/deps/deps_complex.py'
  stage directory:   '/home/user/reframe-examples/tutorial/stage'
  output directory:  '/home/user/reframe-examples/tutorial/output'
  log files:         '/tmp/rfm-01gkxmq0.log'

[==========] Running 10 check(s)
[==========] Started on Tue Apr 16 21:35:34 2024+0000

[----------] start processing checks
[ RUN      ] T0 /c9c2be9f @generic:default+builtin
[       OK ] ( 1/10) T0 /c9c2be9f @generic:default+builtin
[ RUN      ] T4 /11ee5e9a @generic:default+builtin
[       OK ] ( 2/10) T4 /11ee5e9a @generic:default+builtin
[ RUN      ] T5 /020d01e5 @generic:default+builtin
[       OK ] ( 3/10) T5 /020d01e5 @generic:default+builtin
[ RUN      ] T1 /1f93603d @generic:default+builtin
[       OK ] ( 4/10) T1 /1f93603d @generic:default+builtin
[ RUN      ] T8 /605fc1d6 @generic:default+builtin
[     FAIL ] ( 5/10) T8 /605fc1d6 @generic:default+builtin
==> test failed during 'setup': test staged in '/home/user/reframe-examples/tutorial/stage/generic/default/builtin/T8'
[     FAIL ] ( 6/10) T9 /78a78a4e @generic:default+builtin
==> test failed during 'startup': test staged in None
[ RUN      ] T6 /6dbdaf93 @generic:default+builtin
[       OK ] ( 7/10) T6 /6dbdaf93 @generic:default+builtin
[ RUN      ] T2 /0f617ba9 @generic:default+builtin
[ RUN      ] T3 /5dd67f7f @generic:default+builtin
[     FAIL ] ( 8/10) T2 /0f617ba9 @generic:default+builtin
==> test failed during 'sanity': test staged in '/home/user/reframe-examples/tutorial/stage/generic/default/builtin/T2'
[     FAIL ] ( 9/10) T7 /f005e93d @generic:default+builtin
==> test failed during 'startup': test staged in None
[       OK ] (10/10) T3 /5dd67f7f @generic:default+builtin
[----------] all spawned checks have finished

[  FAILED  ] Ran 10/10 test case(s) from 10 check(s) (4 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Tue Apr 16 21:35:36 2024+0000
================================================================================
SUMMARY OF FAILURES
--------------------------------------------------------------------------------
FAILURE INFO for T8 (run: 1/1)
  * Description: 
  * System partition: generic:default
  * Environment: builtin
  * Stage directory: /home/user/reframe-examples/tutorial/stage/generic/default/builtin/T8
  * Node list: 
  * Job type: local (id=None)
  * Dependencies (conceptual): ['T1']
  * Dependencies (actual): [('T1', 'generic:default', 'builtin')]
  * Maintainers: []
  * Failing phase: setup
  * Rerun with '-n /605fc1d6 -p builtin --system generic:default -r'
  * Reason: exception
Traceback (most recent call last):
  File "/usr/local/share/reframe/reframe/frontend/executors/__init__.py", line 317, in _safe_call
    return fn(*args, **kwargs)
  File "/usr/local/share/reframe/reframe/core/hooks.py", line 111, in _fn
    getattr(obj, h.__name__)()
  File "/usr/local/share/reframe/reframe/core/hooks.py", line 38, in _fn
    func(*args, **kwargs)
  File "/home/user/reframe-examples/tutorial/deps/deps_complex.py", line 180, in fail
    raise Exception
Exception

--------------------------------------------------------------------------------
FAILURE INFO for T9 (run: 1/1)
  * Description: 
  * System partition: generic:default
  * Environment: builtin
  * Stage directory: None
  * Node list: 
  * Job type: local (id=None)
  * Dependencies (conceptual): ['T8']
  * Dependencies (actual): [('T8', 'generic:default', 'builtin')]
  * Maintainers: []
  * Failing phase: startup
  * Rerun with '-n /78a78a4e -p builtin --system generic:default -r'
  * Reason: task dependency error: dependencies failed
--------------------------------------------------------------------------------
FAILURE INFO for T2 (run: 1/1)
  * Description: 
  * System partition: generic:default
  * Environment: builtin
  * Stage directory: /home/user/reframe-examples/tutorial/stage/generic/default/builtin/T2
  * Node list: myhost
  * Job type: local (id=23)
  * Dependencies (conceptual): ['T6']
  * Dependencies (actual): [('T6', 'generic:default', 'builtin')]
  * Maintainers: []
  * Failing phase: sanity
  * Rerun with '-n /0f617ba9 -p builtin --system generic:default -r'
  * Reason: sanity error: 31 != 30
--- rfm_job.out (first 10 lines) ---
--- rfm_job.out ---
--- rfm_job.err (first 10 lines) ---
--- rfm_job.err ---
--------------------------------------------------------------------------------
FAILURE INFO for T7 (run: 1/1)
  * Description: 
  * System partition: generic:default
  * Environment: builtin
  * Stage directory: None
  * Node list: 
  * Job type: local (id=None)
  * Dependencies (conceptual): ['T2']
  * Dependencies (actual): [('T2', 'generic:default', 'builtin')]
  * Maintainers: []
  * Failing phase: startup
  * Rerun with '-n /f005e93d -p builtin --system generic:default -r'
  * Reason: task dependency error: dependencies failed
--------------------------------------------------------------------------------
Log file(s) saved in '/tmp/rfm-01gkxmq0.log'

You can restore the run session and run only the failed test cases as follows:

Run in the single-node container.
reframe --restore-session --failed -r

Of course, as expected, the run will fail again, since these tests were designed to fail.

Instead of running the failed test cases of a previous run, you might simply want to rerun a specific test. This has little meaning if you don’t use dependencies, because it would be equivalent to running it separately using the -n option. However, if a test was part of a dependency chain, using --restore-session will not rerun its dependencies, but it will rather restore them. This is useful in cases where the test that we want to rerun depends on time-consuming tests. There is a little tweak, though, for this to work: you need to have run with --keep-stage-files in order to keep the stage directory even for tests that have passed. This is due to two reasons: (a) if a test needs resources from its parents, it will look into their stage directories and (b) ReFrame stores the state of a finished test case inside its stage directory and it will need that state information in order to restore a test case.

Let’s try to rerun the T6 test from the previous test dependency chain:

Run in the single-node container.
reframe -c deps/deps_complex.py --keep-stage-files -r
reframe --restore-session --keep-stage-files -n T6 -r

Notice how only the T6 test was rerun and none of its dependencies, since they were simply restored:

[ReFrame Setup]
  version:           4.6.0-dev.2
  command:           '/usr/local/share/reframe/bin/reframe --restore-session --keep-stage-files -n T6 -r --nocolor'
  launched by:       user@myhost
  working directory: '/home/user/reframe-examples/tutorial'
  settings files:    '<builtin>'
  check search path: '/home/user/reframe-examples/tutorial/deps/deps_complex.py'
  stage directory:   '/home/user/reframe-examples/tutorial/stage'
  output directory:  '/home/user/reframe-examples/tutorial/output'
  log files:         '/tmp/rfm-5nhx1_74.log'

[==========] Running 1 check(s)
[==========] Started on Wed Apr  3 21:40:44 2024+0000

[----------] start processing checks
[ RUN      ] T6 /6dbdaf93 @generic:default+builtin
[       OK ] (1/1) T6 /6dbdaf93 @generic:default+builtin
[----------] all spawned checks have finished

[  PASSED  ] Ran 1/1 test case(s) from 1 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Wed Apr  3 21:40:44 2024+0000
Log file(s) saved in '/tmp/rfm-5nhx1_74.log'

If we tried to run T6 without restoring the session, we would have to rerun also the whole dependency chain, i.e., also T5, T1, T4 and T0.

Run in the single-node container.
reframe -c deps/deps_complex.py -n T6 -r
[ReFrame Setup]
  version:           4.6.0-dev.2
  command:           '/usr/local/share/reframe/bin/reframe -c deps/deps_complex.py -n T6 -r --nocolor'
  launched by:       user@myhost
  working directory: '/home/user/reframe-examples/tutorial'
  settings files:    '<builtin>'
  check search path: '/home/user/reframe-examples/tutorial/deps/deps_complex.py'
  stage directory:   '/home/user/reframe-examples/tutorial/stage'
  output directory:  '/home/user/reframe-examples/tutorial/output'
  log files:         '/tmp/rfm-umx3ijmp.log'

[==========] Running 5 check(s)
[==========] Started on Wed Apr  3 21:41:17 2024+0000

[----------] start processing checks
[ RUN      ] T0 /c9c2be9f @generic:default+builtin
[       OK ] (1/5) T0 /c9c2be9f @generic:default+builtin
[ RUN      ] T4 /11ee5e9a @generic:default+builtin
[       OK ] (2/5) T4 /11ee5e9a @generic:default+builtin
[ RUN      ] T5 /020d01e5 @generic:default+builtin
[       OK ] (3/5) T5 /020d01e5 @generic:default+builtin
[ RUN      ] T1 /1f93603d @generic:default+builtin
[       OK ] (4/5) T1 /1f93603d @generic:default+builtin
[ RUN      ] T6 /6dbdaf93 @generic:default+builtin
[       OK ] (5/5) T6 /6dbdaf93 @generic:default+builtin
[----------] all spawned checks have finished

[  PASSED  ] Ran 5/5 test case(s) from 5 check(s) (0 failure(s), 0 skipped, 0 aborted)
[==========] Finished on Wed Apr  3 21:41:19 2024+0000
Log file(s) saved in '/tmp/rfm-umx3ijmp.log'

Running continuously

You can instruct ReFrame to rerun the whole test session multiple times or for a specific duration, using the --reruns or --duration options. These options will repeat the session once the first round of tests has finished. For example, the following command will run the STREAM benchmark repeatedly for 30 minutes:

Run in the single-node container.
 reframe -c stream_runonly.py --duration=30m -r

Generating tests on-the-fly

ReFrame can generate new tests dynamically from the already selected tests. We have seen already the -P option that parameterizes the selected tests on a specific variable.

Another very useful test generation option for Slurm-based partitions is the --distribute option. This will distribute the selected tests on all the available idle nodes of their valid system partitions. A separate test variant will be created for every available node and it will be pinned to it. It also accepts an optional argument distribute the tests either on all the nodes of the partition regardless of their state or only on the nodes that are in a specific state. It can be combined with the -J option to further restrict the node selection, e.g., -J reservation=maint to submit to all the nodes in the main reservation. The following example will run our STREAM test on all the nodes of our pseudo-cluster:

Run with the Docker compose setup.
reframe --prefix=/scratch/rfm-stage/ -C config/cluster.py -c stream/stream_fixtures.py -p gnu --system=pseudo-cluster:compute --distribute=all -r

Note that similarly to the -P option, --distribute parameterizes the leaf tests and not their fixtures, so the build fixture of the STREAM benchmark will be executed once and only the binary will run on every node, which is the desired behavior in our case.

By inspecting the generated script files, you will notice that ReFrame emits the --nodelist to pin the tests to the cluster nodes:

Run with the Docker compose setup.
cat /scratch/rfm-stage/output/pseudo-cluster/compute/gnu/stream_test_pseudo-cluster_compute_8de19aca/rfm_job.sh
#!/bin/bash
#SBATCH --job-name="rfm_stream_test_pseudo-cluster_compute_8de19aca"
#SBATCH --ntasks=1
#SBATCH --output=rfm_job.out
#SBATCH --error=rfm_job.err
#SBATCH --nodelist=nid01
#SBATCH -p all
srun /scratch/rfm-stage/stage/pseudo-cluster/compute/gnu/build_stream_3f5dbfe2/stream.x

Another useful test generation option is --repeat. This will repeat the selected tests a specified number of times. Conversely to the --reruns option explained before, the repeated tests will be executed concurrently. In practice, this option clones the selected tests N times and submits them at once unless the serial execution policy is used.

Aborting the test session

ReFrame does careful error handling. If a test fails due to a programming error or even if a test tries to explicitly call sys.exit(), ReFrame will simply mark the test as failure and will continue with the rest of the session.

You can kill the current session gracefully by pressing Ctl-C or sending the SIGINT signal to the ReFrame process. In this case, ReFrame will cancel all spawned jobs, either local or remote, so as to avoid spilling resources.

Another useful option that will finish the session prematurely is --maxfail. This will cause ReFrame to stop after a specified amount of failures.

Managing the configuration

Adding more systems to the ReFrame configuration file will soon make it quite big. ReFrame can build its final configuration by combining several smaller configuration files. For example, you could maintain a configuration file per system and keep logging and general settings in a different file. You can chain the configuration files by passing multiple times the -C option. Alternatively, you can set the RFM_CONFIG_PATH variable to specify the directories where ReFrame will search for configuration files. The configuration files in this case must be named as settings.py.

In the following, we have split the cluster_perflogs.py in three different configuration files as follows:

config/multifile/
├── common
│   └── settings.py
├── environments
│   └── settings.py
└── pseudo-cluster
    └── settings.py

Since the configuration file names are normalized, we could use the RFM_CONFIG_PATH environment variable instead of the -C option:

Run with the Docker compose setup.
export RFM_CONFIG_PATH=$(pwd)/config/multifile/common:$(pwd)/config/multifile/environments:$(pwd)/config/multifile/pseudo-cluster

Inspecting the loaded configuration

ReFrame offers the very convenient --show-config, that allows you to inspect the actual loaded configuration or query configuration values. Indeed having set the environment variable RFM_CONFIG_PATH, running

Run with the Docker compose setup.
reframe --show-config

will show us the current configuration. Note that the loaded configuration resolves to the auto-detected system. Even if we load a configuration file with multiple files, the --show-config option will show the configuration of the current system:

Run with the Docker compose setup.
reframe -C :config/baseline.py --show-config

Notice that the tutorialsys was not matched and therefore the current system is the generic.

Note

Using the : before the configuration filename passed to the -C option, instructs ReFrame to drop any configuration built so far from the RFM_CONFIG_PATH.

The --show-config option takes also an optional argument which will allows you to select a specific configuration parameter:

Run with the Docker compose setup.
reframe --show-config=systems/0/name
"pseudo-cluster"

You can also use the --show-config option to retrieve the default value of a configuration parameter, even if this is not defined in the configuration file:

Run with the Docker compose setup.
reframe --show-config=general/0/trap_job_errors
false

Scoping configuration options

ReFrame allows you to limit the effect of configuration options only to certain systems. Every top-level configuration object, except systems, has an optional config.environment.target_systems that accepts a list of systems where this object is valid for. For example, if we wanted to set trap_job_errors only for the pseudo-cluster system, we could add the following in our configuration:

'general': [
    {
        'trap_job_errors': True,
        'target_systems': ['pseudo-cluster']
    }
]
Run with the Docker compose setup.
reframe -C :config/cluster.py --show-config=general/0/trap_job_errors
true

Logging

There are two types of logs that ReFrame produces:

  1. Activity logs, which log the activities of the framework and in their detailed version can be useful for debugging.

  2. Performance logs, which is about recording the obtained performance metrics of performance tests.

Activity logging

By default, ReFrame generates a debug log file in the system’s temporary directory. This is quite a detailed log. Logging can be configured in the logging section of the configuration file. Multiple logging handlers can be registered that will log messages to different sinks at different levels. Let’s see an example on how to setup ReFrame to save its output in a reframe_<timestamp>.out` and a detailed debug output in ``reframe_<timestamp>.log:

../examples/tutorial/config/cluster_logging.py
    'logging': [
        {
            'handlers': [
                {
                    'type': 'file',
                    'name': 'reframe.log',
                    'timestamp': '%FT%T',
                    'level': 'debug2',
                    'format': ('[%(asctime)s] %(levelname)s: '
                               '%(check_info)s: %(message)s'),
                    'append': False
                },
                {
                    'type': 'file',
                    'name': 'reframe.out',
                    'timestamp': '%FT%T',
                    'level': 'info',
                    'format': '%(message)s',
                    'append': False
                }
            ]
        }
    ]

Controlling output verbosity

You may control the output verbosity using -v to increase it or -q to decrease it. Both options can be specified multiple times to further increase or decrease verbosity.

The following table shows the available verbosity levels and the effect of the above options:

ReFrame’s verbosity levels

Option

Level

-qq

error

-q

warning

default

info

-v

verbose

-vv

debug

-vvv

debug2

Performance logging

We have talked briefly about performance logging in Run reports and performance logging but here we will present in more detail the information displayed and how you can configure it.

By default, ReFrame stores the performance data obtained from performance tests in a CSV file. A performance test is a test that defines at least one figure of merit (see Writing your first test). The default location of the performance logs is <prefix>/perflogs/<system>/<partition>/<test_base_name>.log, where <prefix> is the output prefix as specified by the --prefix or --perflogdir options and <test_base_name> refers to the test’s class name. Every time that a variant of the test is run, a new line will be appended to this file. Here is the performance log file for the stream_test on our pseudo-cluster:compute partition:

Run with the Docker compose setup.
cat /scratch/rfm-stage/perflogs/pseudo-cluster/compute/stream_test.log
result|job_completion_time|descr|env_vars|environ|exclusive_access|extra_resources|job_completion_time_unix|job_exitcode|job_nodelist|job_submit_time|jobid|modules|name|num_cpus_per_task|num_gpus_per_node|num_tasks|num_tasks_per_core|num_tasks_per_node|num_tasks_per_socket|partition|copy_bw_value|copy_bw_unit|copy_bw_ref|copy_bw_lower_thres|copy_bw_upper_thres|triad_bw_value|triad_bw_unit|triad_bw_ref|triad_bw_lower_thres|triad_bw_upper_thres|short_name|system|unique_name|use_multithreading
pass|2024-02-21T22:51:16||{}|gnu|false|{}|1708555876.746763|null|None|1708555874.6122677|65||stream_test|null|null|1|null|null|null|compute|21116.7|MB/s|0|None|None|14813.0|MB/s|0|None|None|stream_test|pseudo-cluster|stream_test|null
pass|2024-02-21T22:51:19||{}|clang|false|{}|1708555879.4456542|null|None|1708555877.3607173|66||stream_test|null|null|1|null|null|null|compute|18405.7|MB/s|0|None|None|14997.1|MB/s|0|None|None|stream_test|pseudo-cluster|stream_test|null
pass|2024-02-25T20:45:17||{}|gnu|false|{}|1708893917.8675761|null|None|1708893915.3461528|69||stream_test|null|null|1|null|null|null|compute|11429.4|MB/s|0|None|None|10674.8|MB/s|0|None|None|stream_test|pseudo-cluster|stream_test|null
pass|2024-02-25T20:45:17||{}|clang|false|{}|1708893917.9110286|null|None|1708893915.4608803|70||stream_test|null|null|1|null|null|null|compute|11909.9|MB/s|0|None|None|8325.0|MB/s|0|None|None|stream_test|pseudo-cluster|stream_test|null

The first line serves as a header. If ReFrame determines that the header must change due to a change in the test (e.g., new variables, new figure of merits, etc.), it will back up the existing file to <test_base_name>.log.h0 and will create a new file to hold the current performance data.

You may have noticed that ReFrame logs a lot of information along with the test’s performance. The reason for that is to cover a wide range of usage, but you might not be interested in all that information, especially if your test’s setup is fixed. Let’s see how we can change the perflogs format for our example:

../examples/tutorial/config/cluster_perflogs.py
    'logging': [
        {
            'handlers_perflog': [
                {
                    'type': 'filelog',
                    'prefix': '%(check_system)s/%(check_partition)s',
                    'level': 'info',
                    'format': ('%(check_result)s,'
                               '%(check_job_completion_time)s,'
                               '%(check_system)s:%(check_partition)s,'
                               '%(check_environ)s,'
                               '%(check_perfvalues)s'),
                    'format_perfvars': ('%(check_perf_value)s,'
                                        '%(check_perf_unit)s,'),
                    'append': True
                }
            ]
        }
    ]

The handlers_perflog configuration section defines a list of log handlers where the performance data for every test that finishes will be sent to. The filelog handler manages the writing of performance data to files per test as described above. Let’s walk briefly through the most important parts of its configuration:

  • The prefix is an additional directory prefix under the global prefix (see --prefix option) where the perflogs will be saved. The formatting placeholders are described below.

  • The format specifies how the log record will be formatted. Each placeholder of the form %(placeholder)s is replaced by the actual value during runtime. All placeholders starting with check_ refer to test attributes. You can check the complete list of supported placeholders in the configuration reference guide. In this particular example, we will log only the test result, the (formatted) completion time, the list of nodes where the test was run and the obtained values of the test’s performance variables.

  • The format_perfvars specifies how the performance values (the %(check_perfvalues)s placeholder) will be formatted. In this case, we will only log the obtained value and its unit. Note that ReFrame will repeat this pattern for all the performance variables defined in the test and this is why we need to end the pattern with the separator, here the ,.

Let’s rerun our STREAM example using the new configuration:

Run with the Docker compose setup.
reframe --prefix=/scratch/rfm-stage/ -C config/cluster_perflogs.py -c stream/stream_fixtures.py -r
cat /scratch/rfm-stage/perflogs/pseudo-cluster/compute/stream_test.log
result,job_completion_time,system:partition,environ,copy_bw_value,copy_bw_unit,triad_bw_value,triad_bw_unit
pass,2024-02-26T22:39:52,pseudo-cluster:compute,gnu,11527.4,MB/s,10110.8,MB/s
pass,2024-02-26T22:39:52,pseudo-cluster:compute,clang,11798.5,MB/s,8370.0,MB/s

Sending performance data to an HTTP endpoint

You can instruct ReFrame to send the performance logs directly to an HTTP endpoint. An example is sending test performance records to Elastic. This is handled by the httpjson handler and an example configuration is the following:

../examples/tutorial/config/cluster_perflogs_httpjson.py
                {
                    'type': 'httpjson',
                    'url': 'https://httpjson-server:12345/rfm',
                    'level': 'info',
                    'debug': True,
                    'extra_headers': {'Authorization': 'Token YOUR_API_TOKEN'},
                    'extras': {
                        'facility': 'reframe',
                        'data-version': '1.0'
                    },
                    'ignore_keys': ['check_perfvalues'],
                    'json_formatter': (_format_record
                                       if os.getenv('CUSTOM_JSON') else None)
                }

The url key refers to the service endpoint, the extra_headers are additional headers to be included in the POST request, whereas the extras and the ignore_keys are additional keys to send or keys to exclude, respectively. Normally, ReFrame sends the whole log record, which contains all of the test’s variables prefixed with check_.

Note that in this example, we also set debug to True so that ReFrame will simply dump the JSON record and will not attempt to send it. Also, the json_formatter Is optional and we will cover it in the next section.

Let’s rerun our STREAM benchmark:

Run with the Docker compose setup.
reframe --prefix=/scratch/rfm-stage/ -C config/cluster_perflogs_httpjson.py -c stream/stream_fixtures.py -r

Notice that that there is one JSON file produced per test run named as httpjson_record_<timestamp>.json. You can inspect it to see the exact JSON record that ReFrame would send to the HTTP endpoint:

Run with the Docker compose setup.
jq . httpjson_record_<timestamp>.json

Customizing the JSON record

It might be the case that the remote endpoint imposes a schema on the incoming JSON blob. In this case, ReFrame’s record will most likely be rejected. However, you can directly format the log record to be sent to the server by setting the json_formatter configuration option to a callable that will generate the JSON payload from ReFrame’s log record. In the following example, we remove the check_ prefix from the test’s attributes.

../examples/tutorial/config/cluster_perflogs_httpjson.py
def _format_record(record, extras, ignore_keys):
    data = {}
    for attr, val in record.__dict__.items():
        if attr in ignore_keys or attr.startswith('_'):
            continue

        if attr.startswith('check_'):
            data[attr[6:]] = val
        else:
            data[attr] = val

    data.update(extras)
    return json.dumps(data)

The format function takes the raw log record, the extras and the keys to ignore as specified in the handler configuration and returns a JSON string. Since we can’t know the exact log record attributes, we iterate over its __dict__ items and format the record keys as we go. Also note that we ignore all private field of the record starting with _. Rerunning the previous example with CUSTOM_JSON=1 will generated the modified JSON record.