Correlating a test result with the source code

As part of a CI loop, the results of the LAVA test job may indicate a bug or regression in the source code which initiated the CI loop. These issues would be distinct from infrastructure or job errors and reporting these issues is a customised process for each team involved.

The details of how and why the test failed will typically be essential to identifying how to fix the issue, so developers need help from test writers and from LAVA to provide information, logs and build artefacts to be able to reproduce the issue.

However, it is common for a test failure to occur due to an earlier failure in the test job, e.g. changes in dependencies. It is also common for tests to report the error briefly at one point within the log and then provide more verbose content at another point.

So the first problem can be correlating the test output with the actual failure. Test writers often need to modify how the original test behaves, to be able to identify which pieces of output are relevant to any particular test failure. Each test is different and uses different ways to describe, summarise, report and fail test operations. Test writers already need to write customised wrappers to run different tests in similar ways. To be able to relate the failures back to the source code, a lot more customisation is likely to be required.

Overall, LAVA can only be one part of the effort to triage test failures and debug the original source code. Results need to be presented to developers using a frontend, test writers need to write scripts to wrap test suites and there needs to be enough other tests being run that developers have a reliable way of knowing all the details leading up to the failure.

Problems within test suites

Avoid reliance on the total count

Test suites which discover the list of tests automatically can be a particular problem. Each test job could potentially add, remove or skip test results differently to previous test jobs, based on the same source code changes that triggered the test in the first place. Test writers may need to take control of the list of tests which will be executed, adding new tests individually and highlighting tests which were run in previous jobs but which are now missing.

For example, if a developer is waiting for a large number of CI results, automated test suites which add one test whilst removing another could easily mislead the developer into thinking that a particular test passed when it was actually omitted. This is made worse if the test suite has wide coverage as the developer might not be aware of the context or purpose of the added test result.

The LAVA Charts are only intended as a generic summary of the results, it is all too easy to miss a test being replaced if the report sent to the developer is only tracking the number of passes over time.

Control the test operations

  • Keep the test itself stable, this includes the wrappers and the reporting.
  • Use staging instances for all components, including LAVA and any frontends where each and every change is tested against known working components.
  • Avoid downloading from third-party URLs. Use tools from the existing base system or build known working versions of the tools into the base system so that every test always uses the same tools.
    • Use checksums on all downloaded content if this is not implemented by the base system itself. (For example, apt and dpkg use checksums and other cryptographic methods extensively, to ensure that downloads are from verified locations and of verified content.)
  • Push your changes upstream. Avoid the burden of forks by working with each upstream to improve the tools and test scripts themselves.
  • Split the test operations into logical blocks. A combined test job can still be run separately but there are advantages to running more test jobs, each of shorter duration:
    • test jobs can be run in parallel across a pool of devices.
    • logs are smaller and easier to triage.
    • failures are easier to reproduce.
    • shorter test jobs can make it easier to build and run the full matrix of jobs which results from only changing one element at a time. Not all tests need to be run to know that the firmware is working correctly.
  • Use descriptive commit messages in the test shell version control and use code review.
  • Consider formal bug tracking for the test shell scripts, distinct from other bugs.
  • Implement ways to resubmit after infrastructure failures, using the same automated submitter, metadata, artefacts and tests.

Control the output

Established test suites often lack any standard way of outputting the process of running the results, the format of errors and the layout of the result summary.

Each of these elements may need to be taken over by the test writer to allow the developer a way to identify a specific test and the section of the LAVA logs to which it relates.

This can cause issues if, for example, a wrapper has to wait until the end of the test process to obtain the relevant information. The test job may appear to stall and later produce a flood of output. If the wrapper or the underlying test fail in an unexpected way, it is very easy to produce a LAVA test job with no useful output for any of the results.

To be able to properly correlate the test results to the source code, it may become necessary to rewrite the test suite itself and then consider pushing the changes upstream.

LAVA is investigating ways to help test writers standardise the ways of running tests to be able to provide more benefit from automated log files. Talk to us if you have ideas for or experience of such changes.

Control the base system

Most tests require some level of system to be executing and some level of dependencies within that system. The choice of which system to use can impact the triage of the results obtained.

  • If the system is continuously changing (at the source code level), then results from last month may be completely invalid for comparing with the most recent failure.
  • If the system is based on a distribution which supports reproducing an identical system at a later time, this may make it much simpler to triage failures and bisect regressions.

Consider the impact of the base system carefully - triage and bisection may require weeks of historical data to be able to identify the root of any reported issues. Test one thing at a time.

Control the build system

  • Avoid changing the name of files between builds unless those files have actually changed.
  • Avoid reliance on build numbers when not everything in the build has changed.
    • Use version strings which relate directly to the versions used by the source code for that binary.
  • Make changelogs available for the components that have changed between builds.
  • Always publish checksums for all build artefacts.

This is to make it easier, during triage, to use known working versions of each component whilst changing just one component. It can be very difficult to relate a build number from a URL to an upstream code change, especially if the build system removes build URLs after a period of time.

Remember that every component has it’s own upstream team and it’s own upstream source code versioning. If a bug is found in one component, locating the source code for that component will involve knowing the exact upstream version string that was actually used in the test.

Control the list of tests

It may be necessary to remove the auto-detection support within the test suite and explicitly set which tests are to be run and which are skipped.

Avoid executing tests which are known to fail. Developers reading the final report need to be able to pick out which tests have failed without the distraction of then filtering out tests which have never passed.

Avoid hiding the list of tests inside test scripts. Ensure that the report sent to developers discloses the tests which were submitted and the tests which were skipped. Provide changelogs when the lists are changed.

Review the list of skipped tests regularly. This can be done by submitting LAVA test jobs which only execute tests which are skipped in other test jobs. Again, ensure that only one element is changed at a time, so choose the most stable kernel, root filesystem and firmware available as the base for executing these skipped tests on an occasional basis.

Distinguish between CI tests and functional tests

CI tests need to use lots of support to relate the results back to the reason for running the test in the first place.

Functional tests exist to test the elements outside the test job and include things like health checks and sample jobs used for unit tests.

The objective of a CI test job is to test the changes made by developers.

The objective of a functional test job is to test the functionality of the CI system.

Health checks are not the only functional tests - sometimes there is functionality which cannot be put into a health check. For example, if additional hardware is available on some devices of a particular device type, the health check may report a failure when run on the devices without that hardware. This may need to be taken into account when deciding what qualifies as a new device type. Functional tests can be submitted automatically, using notifications to alert admins to failures of additional hardware.

Manage testing of complete software stacks

It is possible to test a complete software stack in automation, however, unpicking that stack to isolate a problem can consume very large amounts of engineering time. This only gets worse when the problem itself is intermittent due to the inherent complexity of identifying which component is at fault.

Wherever possible, break up the stack and test each change independently, building the stack vertically from the lowest base able to run a test.

  • Boot test the kernel with an unchanging root filesystem and a known working build of firmware. Ensure that each kernel build is boot tested before functional tests are submitted.
  • Test the modified root filesystem with a known working kernel and known working firmware.
    • Test with and without installing the dependencies required for the later tests. Check that the system works reliably to be able to prepare the dependencies.
  • Break the test into components and test each block separately.
  • Only change the “gold standard” files when absolutely essential, this includes firmware, kernel, root filesystem and any dependencies required by the test as well as the code running the test itself.

Metadata

See also

Metadata.

Any link between a test result in LAVA and a line of source code will rely on metadata.

  • Pre-installed dependencies of the test, including versions and original source. Using a reproducible distribution for this can provide confidence that the test result arises from the tests and not the base operating system.
  • the git commit hash of the source code used in the build
  • the git commit hash of the test code executing the tests as this is often external to the source code being tested. LAVA provides the commit hash of the Lava Test Shell Definition but scripts executed by LAVA will need to be tracked separately.
  • the filename of the code running the test. (Remember that the result of any test may be due to a bug in the function running the test, as well as a bug in the code being executed outside the test function.)
  • the filename(s) within the source code for each error produced by the test. (Most test suites do not have this support or may only infer it via the name of the test function. The affected code could easily be moved to a different file without changing the test function name.)
  • the location of the source code
    • how to construct a URL to the file at the specified version at the location. This differs according to the chosen web service for the repository.
  • control the metadata and the queries which use it. Users and admins will frequently copy and paste job submissions to retry particular issues. Always ensure that queries and reports look at the metadata only from a known automated submitter.

Reproducing test jobs

LAVA can support developers who want to reproduce a test job locally but the details depend a lot on the actual device being used. Some devices will need significant amounts of (sometimes expensive or difficult to obtain) support hardware. However, once an alternative rig is assembled, developers can use lava-run to re-run the test job locally.

Other options include:

  • emulation - depending on the nature of the failure, it may be possible to emulate the test job locally and in LAVA.

  • local workers - if devices are available locally, a worker can be configured to run test jobs using a remote master.

  • portability - the best option is when the issue can be reproduced without needing the original hardware. If the scripts used in LAVA are portable, developers can run the test process without needing automation.