Check the preview of 2nd version of this platform being developed by the open MLCommons taskforce on automation and reproducibility as a free, open-source and technology-agnostic on-prem platform.
A Collective Knowledge workflow for collaborative research into multi-objective autotuning and machine learning techniques
We thank the Raspberry Pi Foundation and cTuning Foundation for supporting this proof-of-concept project to automatically generate "live" papers with portable CK workflows and reusable components that can be reproduced and build upon by the community.
Corresponding author: Grigori Fursin

Auto-generated ArXiv PDF ] [ CK project overview ]
[ Optimization repository ] [ GCC ] [ LLVM ] [ MILEPOST optimization predictor ]
[ Andrew Ng's ML course ] [ Google ML crash course ] [ Facebook Field Guide to ML ]

Developing efficient software and hardware has never been harder whether it is for a tiny IoT device or an Exascale supercomputer. Apart from the ever growing design and optimization complexity, there exist even more fundamental problems such as lack of interdisciplinary knowledge required for effective software/hardware co-design, and a growing technology transfer gap between academia and industry.

We introduce our new educational initiative to tackle these problems by developing Collective Knowledge (CK), a unified experimental framework for computer systems research and development. We use CK to teach the community how to make their research artifacts and experimental workflows portable, reproducible, customizable and reusable while enabling sustainable R&D and facilitating technology transfer. We also demonstrate how to redesign multi-objective autotuning and machine learning as a portable and extensible CK workflow. Such workflows enable researchers to experiment with different applications, data sets and tools; crowdsource experimentation across diverse platforms; share experimental results, models, visualizations; gradually expose more design and optimization choices using a simple JSON API; and ultimately build upon each other's findings.

As the first practical step, we have implemented customizable compiler autotuning, crowdsourced optimization of diverse workloads across Raspberry Pi 3 devices, reduced the execution time and code size by up to 40%, and applied machine learning to predict optimizations. We hope such approach will help teach students how to build upon each others' work to enable efficient and self-optimizing software/hardware/model stack for emerging workloads.

1  Introduction

Many recent international roadmaps for computer systems research appeal to reinvent computing [1, 2, 3]. Indeed, developing, benchmarking, optimizing and co-designing hardware and software has never been harder, no matter if it is for embedded and IoT devices, or data centers and Exascale supercomputers. This is caused by both physical limitations of existing technologies and an unmanageable complexity of continuously changing computer systems which already have too many design and optimization choices and objectives to consider at all software and hardware levels [4], as conceptually shown in Figure 1. That is why most of these roadmaps now agree with our vision that such problems should be solved in a close collaboration between industry, academia and end-users [5, 6].

However, after we initiated artifact evaluation (AE) [7, 8] at several premier ACM and IEEE conferences to reproduce and validate experimental results from published papers, we noticed an even more fundamental problem: a growing technology transfer gap between academic research and industrial development. After evaluating more than 100 artifacts from the leading computer systems conferences in the past 4 years, we noticed that only a small fraction of research artifacts could be easily customized, ported to other environments and hardware, reused, and built upon. We have grown to believe that is this due to a lack of a common workflow framework that could simplify implementation and sharing of artifacts and workflows as portable, customizable and reusable components with some common API and meta information vital for open science [9].

At the same time, companies are always under pressure and rarely have time to dig into numerous academic artifacts shared as CSV/Excel files and ``black box'' VM and Docker images, or adapt numerous ad-hoc scripts to realistic and ever changing workloads, software and hardware. That is why promising techniques may remain in academia for decades while just being incrementally improved, put on the shelf when leading students graduate, and ``reinvented'' from time to time.

Autotuning is one such example: this very popular technique has been actively researched since the 1990s to automatically explore large optimization spaces and improve efficiency of computer systems [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]. Every year, dozens of autotuning papers get published to optimize some components of computer systems, improve and speed up exploration and co-design strategies, and enable run-time adaptation. Yet, when trying to make autotuning practical (in particular, by applying machine learning) we faced numerous challenges with integrating such published techniques into real, complex and continuously evolving software and hardware stack [5, 4, 6, 9].

Eventually, these problems motivated us to develop a common experimental framework and methodology similar to physics and other natural sciences to collaboratively improve autotuning and other techniques. As part of this educational initiative, we implemented an extensible, portable and technology-agnostic workflow for autotuning using the open-source Collective Knowledge framework (CK) [34, 35]. Such workflows help researchers to reuse already shared applications, kernels, data sets and tools, or add their own ones using a common JSON API and meta-description [36]. Moreover, such workflows can automatically adapt compilation and execution to a given environment on a given device using integrated cross-platform package manager.

Our approach takes advantage of a powerful and holistic top-down methodology successfully used in physics and other sciences when learning complex systems. The key idea is to let novice researchers first master simple compiler flag autotuning scenarios while learning interdisciplinary techniques including machine learning and statistical analysis. Researchers can then gradually increase complexity to enable automatic and collaborative co-design of the whole SW/HW stack by exposing more design and optimization choices, multiple optimization objectives (execution time, code size, power consumption, memory usage, platform cost, accuracy, etc.), crowdsource autotuning across diverse devices provided by volunteers similar to SETI@home [37], continuously exchange and discuss optimization results, and eventually build upon each other's results.

We use our approach to optimize diverse kernels and real workloads such as zlib in terms of speed and code size by crowdsourcing compiler flag autotuning across Raspberry Pi3 devices using the default GCC 4.9.2 and the latest GCC 7.1.0 compilers. We have been able to achieve up to 50% reductions in code size and from 15% to 8 times speed ups across different workloads over the ``-O3'' baseline. Our CK workflow and all related artifacts are available at GitHub to allow researchers to compare and improve various exploration strategies (particularly based on machine learning algorithms such as KNN, GA, SVM, deep learning, though further documentation of APIs is still required) [24, 6]. We have also shared all experimental results in our open repository of optimization knowledge [38, 39] to be validated and reproduced by the community.

We hope that our approach will serve as a practical foundation for open, reproducible and sustainable computer systems research by connecting students, scientists, end-users, hardware designers and software developers to learn together how to co-design the next generation of efficient and self-optimizing computer systems, particularly via reproducible competitions such as ReQuEST [40].

This technical report is organized as follows. Section 2 introduces the Collective Knowledge framework (CK) and the concept of sharing artifacts as portable, customizable and reusable components. Section 3 describes how to implement a customizable, multi-dimensional and multi-objective autotuning as a CK workflow. Section 4 shows how to optimize compiler flags using our universal CK autotuner. Section 5 presents a snapshot of the latest optimization results from collaborative tuning of GCC flags for numerous shared workloads across Raspberry Pi3 devices. Section 6 shows optimization results of zlib and other realistic workloads for GCC 4.9.2 and GCC 7.1.0 across Raspberry Pi3 devices. Section 7 describes how implement and crowdsource fuzzing of compilers and systems for various bugs using our customizable CK autotuning workflow. Section 8 shows how to predict optimizations via CK for previously unseen programs using machine learning. Section 9 demonstrates how to select and autotune models and features to improve optimization predictions while reducing complexity. Section 10 shows how to enable efficient, input-aware and adaptive libraries and programs via CK. Section 11 presents CK as an open platform to support reproducible and Pareto-efficient co-design competitions of the whole software/hardware/model stack for emerging workloads such as deep learning and quantum computing. We present future work in Section 12. We also included Artifact Appendix to allow students try our framework, participate in collaborative autotuning, gradually document APIs and improve experimental workflows.

2  Converting ad-hoc artifacts to portable and reusable components with JSON API

Artifact sharing and reproducible experimentation are key for our collaborative approach to machine-learning based optimization and co-design of computer systems, which was first prototyped during the EU-funded MILEPOST project [5, 41, 24]. Indeed, it is difficult, if not impossible, and time consuming to build useful predictive models without large and diverse training sets (programs, data sets), and without crowdsourcing design and optimization space exploration across diverse hardware [9, 6].

While we have been actively promoting artifact sharing for the past 10 years since the MILEPOST project [5, 42], it is still relatively rare in the community systems community. We have begun to understand possible reasons for that through our Artifact Evaluation initiative [7, 8] at PPoPP, CGO, PACT, SuperComputing and other leading ACM and IEEE conferences which has attracted over a hundred of artifacts in the past few years.

Unfortunately, nearly all the artifacts have been shared simply as zip archives, GitHub/GitLab/Bitbucket repositories, or VM/Docker images, with many ad-hoc scripts to prepare, run and visualize experiments, as shown in Figure 2a. While a good step towards reproducibility, such ad-hoc artifacts are hard to reuse and customize as they do not provide a common API and meta information.

Some popular and useful services such as Zenodo [43] and FigShare [44] allow researchers to upload individual artifacts to specific websites while assigning DOI [45] and providing some meta information. This helps the community to discover the artifacts, but does not necessarily make them easy to reuse.

After an ACM workshop on reproducible research methodologies (TRUST'14) [46] and a Dagstuhl Perspective workshop on Artifact Evaluation [8], we concluded that compute systems research lacked a common experimental framework in contrast with other sciences [47].

Together with our fellow researchers, we also assembled the following wish-list for such a framework:

       ●  it should be able to help researchers quickly organize their local code and data into discoverable and reusable components with a unique ID, common API and unified meta information, rather than being forced to upload them to the web from the start;

       ●  it should be open-source with a permissive license to simplify technology transfer;

       ●  it should be portable, simple to install and use from the command line;

       ●  it should allow to assemble experimental workflows by simply plugging in shared components;

       ●  it should support native non-virtualized execution of such workflows, i.e. not only via Virtual Machine [48] and Docker [49], critical for empirical program optimization and hardware co-design experiments;

       ●  it should be able to adapt to continuously evolving software environments and support different versions of tools such as rapidly evolving compilers and libraries;

       ●  it should include a local web server to simplify crowdsourcing of experiments and visualization of results in workgroups.

Since there was no available open-source framework with all these features, we decided to develop such a framework, Collective Knowledge (CK) [34, 35], from scratch with initial support from the EU-funded TETRACOM project [50]. CK is implemented as a small and portable Python module with a command line front-end to assist users in converting their local objects (code and data) into searchable, reusable and shareable directory entries with user-friendly aliases and auto-generated Unique ID, JSON API and JSON meta information [36], as described in [35, 51] and conceptually shown in Figure 2b.

The user first creates a new local CK repository as follows:

$ ck add repo:new-ck-repo

Initially, it is just an empty directory:

$ ck find repo:new-ck-repo
$ ls `ck find repo:new-ck-repo`

Now, the user starts adding research artifacts as CK components with extensible APIs. For example, after noticing that we always perform 3 common actions on all our benchmarks during our experiments, "compile", "run" and "autotune", we want to provide a common API for these actions and benchmarks, rather than writing ad-hoc scripts. The user can provide such an API with actions by adding a new CK module to a CK repository as follows:

$ ck add new-ck-repo:module:program
CK will then create two levels of directories module and program in the new-ck-repo and will add a dummy module.py where common object actions can be implemented later. CK will also create a sub-directory .cm (collective meta) with an automatically generated Unique ID of this module and various pre-defined descriptions in JSON format (date and time of module creation, author, license, etc) to document provenance of the CK artifacts.

Users can now create holders (directories) for such objects sharing common CK module and an API as follows:

$ ck add new-ck-repo:program:new-benchmark
CK will again create two levels of directories: the first one specifying used CK module (program) and the second one with alias new-benchmark to keep objects. CK will also create three files in an internal .cm directory:

       ●  meta.json - an empty JSON file which can be gradually extended to describe a given object (such as added program in our example);

       ●  info.json - a JSON file with the date and time of the last modification as well as license, copyright and author information to keep attribution of all updates for open research;

       ●  desc.json - an empty JSON file to describe types of keys in meta.json (useful for automatic type checking) and their value ranges (useful for autotuning as we will show later in this report).

Users can then find a path to a newly created object holder (CK entry) using the ck find program:new-benchmark command and then copy all files and sub-directories related to the given object using standard OS shell commands.

This allows to get rid of ad-hoc scripts by implementing actions inside reusable CK Python modules as shown in Figure 3. For example, the user can add an action to a given module such as compile program as follows:

$ ck add_action module:program --func=compile
CK will create a dummy function body with an input dictionary i inside module.py in the CK module:program entry. Whenever this function is invoked via CK using the following format:
$ ck compile program:some_entry --param1=val1
the command line will be converted to i dictionary and printed to the console to help novice users understand the CK API. The user can now substitute this dummy function with a specific action on a specific entry (some program in our example based on its meta information) as conceptually shown in Figure 3. The above example shows how to call CK functions from Python modules rather than from the command line using the ck.access function. It also demonstrates how to find a path to a given program entry, load its meta information and unique ID. For the reader's convenience, Figure 4 lists several important CK commands.

This functionality should be enough to start implementing unified compilation and execution of shared programs. For example, the program module can read instructions about how to compile and run a given program from the JSON meta data of related entries, prepare and execute portable sub-scripts, collect various statistics, and embed them to the output dictionary in a unified way. This can be also gradually extended to include extra tools into compilation and execution workflow such as code instrumentation and profiling.

Here we immediately face another problem common for computer systems research: how to support multiple versions of various and continuously evolving tools and libraries? However, since we no longer hardwire calls to specific tools directly in scripts but invoke them from higher-level CK modules, we can detect all required tools and set up their environment before execution. To support this concept even better, we have developed a cross-platform package manager as a ck-env repository [52] with several CK modules including soft, env, package, os and platform. These modules allow the community to describe various operating systems (Linux, Windows, MacOS, Android); detect platform features (ck detect platform); detect multiple-versions of already installed software (ck detect soft:compiler.gcc); prepare CK entries with their environments for a given OS and platform using env module (ck show env) thus allowing easy co-existence of multiple versions of a given tool; install missing software using package modules; describe software dependencies using simple tags in a program meta description (such as compiler,gcc or lib,caffe), and ask the user to select an appropriate version during program compilation when multiple software versions are registered in the CK as shown in Figure 5.

Such approach extends the concept of package managers including Spack [53] and EasyBuild [54] by integrating them directly with experimental CK workflows while using unified CK API, supporting any OS and platform, and allowing the community to gradually extend existing detection or installation procedures via CK Python scripts and CK meta data.

Note that this CK approach encourages reuse of all such existing CK modules from shared CK repositories rather then writing numerous ad-hoc scripts. It should indeed be possible to substitute most of ad-hoc scripts from public research projects (Figure 2) with just a few above modules and entries (Figure 6), and then collaboratively extend them, thus dramatically improving research productivity. For this reason, we keep track of all publicly shared modules and their repositories in this wiki page. The user will just need to add/update a .ckr.json file in the root directory of a given CK repository to describe a dependency on other existing CK repositories with required modules or entries. Since it is possible to uniquely reference any CK entry by two Unique IDs (module UID:object UID), we also plan to develop a simple web service to automatically index and discover all modules similar to DOI.

The open, file-based format of CK repositories allows researchers to continue editing entries and their meta directly using their favourite editors. It also simplifies exchange of these entries using Git repositories, zip archives, Docker images and any other popular tool. At the same time, schema-free and human readable Python dictionaries and JSON files helps users to collaboratively extend actions, API and meta information while keeping backward compatibility. Such approach should let the community to gradually and collaboratively convert and cross-link all existing ad-hoc code and data into unified components with extensible API and meta information. This, in turn, allows users organize their own research while reusing existing artifacts, building upon them, improving them and continuously contributing back to Collective Knowledge similar to Wikipedia.

We also noticed that CK can help students reduce preparation time for Artifact Evaluation [7] at conferences while automating preparation and validation of experiments since all artifacts, workflows and repositories are immediatelly ready to be shared, ported and plugged in to research workflows.

For example, the highest ranked artifact from the CGO'17 article [55] was implemented and shared using the CK framework [56]. That is why CK is now used and publicly extended by leading companies [35], universities [55] and organizations [57] to encourage, support and simplify technology transfer between academia and industry.

3  Assembling portable and customizable autotuning workflow

Autotuning combined with various run-time adaptation, genetic and machine learning techniques is a popular approach in computer systems research to automatically explore multi-dimensional design and optimization spaces [10, 58, 59, 60, 11, 12, 13, 61, 14, 62, 15, 16, 17, 18, 19, 63, 20, 64, 65, 66, 67, 68, 69, 70, 29, 71, 72, 73].

CK allows to unify such techniques by developing a common, universal, portable, customizable, multi-dimensional and multi-objective autotuning workflow as a CK module (pipeline from the public ck-autotuning repository with the autotune function). This allows us to abstract autotuning by decoupling it from the autotuned objects such as "program". Users just need to provide a compatible function "pipeline" in a CK module which they want to be autotuned with a specific API including the following keys in both input and output:

       ●  dependencies to describe software dependencies via portable package manager from the CK;

       ●  choices to expose various design and optimization knobs c such as algorithmic parameters, model topology, source-to-source transformations, compiler flags, hardware configurations, etc.;

       ●  characteristics to monitor optimized behavior b such as execution time, code size, compilation time, energy, memory usage, accuracy, resiliency, costs, etc.;

       ●  features to expose various object features f such as semantic program and data set features, hardware counters, platform properties, etc.;

       ●  state to define run-time system state s such as hardware frequencies, network status, cache state, etc.

Autotuning can now be implemented as a universal and extensible workflow applied to any object with a matching JSON API by chaining together related CK modules with various exploration strategies, program transformation tools, compilers, program compilation and execution pipeline, architecture simulators, statistical analysis, Pareto frontier filter and other components, as conceptually shown in Figure 7. Researchers can also use unified machine learning CK modules (wrappers to R and scikit-learn [74]) to model the relationship between c, f, s and the observed behavior b, increase coverage, speed up (focus) exploration, and predict efficient optimizations [4, 6]. They can also take advantage of a universal complexity reduction module which can automatically simplify found solutions without changing their behavior, reduce models and features without sacrificing accuracy, localize performance issues via differential analysis [75], reduce programs to localize bugs, and so on.

Even more importantly, our concept of a universal autotuning workflow, knowledge sharing and artifact reuse can help teach students how to apply a well-established holistic and top-down experimental methodology from natural sciences to continuously learn and improve the behavior of complex computer systems [35, 4]. Researchers can continue exposing more design and optimization knobs c, behavioral characteristics b, static and dynamic features f, and run-time state state to optimize and model behavior of various interconnected objects from the workflow depending on their research interests and autotuning scenarios.

Such scenarios are also implemented as CK modules and describe which sets of choices to select, how to autotune them and which multiple characteristics to trade off. For example, existing scenarios include "autotuning OpenCL parameters to improve execution time", "autotuning GCC flags to balance execution time and code size", "autotune LLVM flags to reduce execution time", "automatically fuzzing compilers to detect bugs", "exploring CPU and GPU frequency in terms of execution time and power consumption", "autotuning deep learning algorithms in terms of speed, accuracy, energy, memory usage and costs", and so on.

You can see some of the autotuning scenarios using the following commands:

$ ck pull repo:ck-crowdtuning
$ ck search module --tags="program optimization"
$ ck list program
They can then be invoked from the command line as follows:
$ ck autotune program:[CK program alias] --scenario=[above CK scenario alias]

4  Implementing universal compiler flag autotuning

In this section we would like to show how to customize our universal autotuning workflow to tackle an old but yet unsolved problem of finding the most efficient selection of compiler flag which minimizes program size and execution time.

Indeed, the raising complexity of ever changing hardware made development of compilers very challenging. Popular GCC and LLVM compilers nowadays include hundreds of optimizations (Figure 8) and often fail to produce efficient code (execution time and code size) on realistic workloads within a reasonable compilation time [10, 58, 76, 77, 4]. Such large design and optimization spaces mean that hardware and compiler designers can afford to explore only a tiny fraction of the whole optimization space using just few ad-hoc benchmarks and data sets on a few architectures in a tough mission to assemble -O3, -Os and other optimization levels across all supported architectures and workloads.

Our idea is to keep compiler as a simple collection of code analysis and transformation routines and separate it from optimization heuristics. In such case we can use CK autotuning workflow to collaboratively optimize multiple shared benchmarks and realistic workloads across diverse hardware, exchange optimization results, and continuously learn and update compiler optimization heuristics for a given hardware as a compiler plugin. We will demonstrate this approach by randomly optimizing compiler flags for susan corners program with aging GCC 4.9.2, the latest GCC 7.1.0 and compare them with Clang 3.8.1. We already monitor and optimize execution time and code size of this popular image processing application across different compilers and platforms for many years [24]. That is why we are interested to see if we can still improve it with the CK autotuner on the latest Raspberry Pi 3 (Model B) devices (RPi3) extensively used for educational purposes.

First of all, we added susan program with corners algorithm to the ctuning-programs repository with the JSON meta information describing compilation and execution as shown in Figure 9.

We can then test its compilation and execution by invoking the program pipeline as following:

$ ck pipeline program:cbench-automotive-susan

CK program pipeline will first attempt to detect platform features (OS, CPU, GPU) and embed them to the input dictionary using key features. Note that in case of cross-compilation for a target platform different from the host one (Android, remote platform via SSH, etc), it is possible to specify such platform using CK os entries and --target_os= flag.

For example, it is possible to compile and run a given CK program for Android via adb as following:

$ ck ls os
$ ck pipeline program:cbench-automotive-susan --target_os=android21-arm64

Next, CK will try to resolve software dependencies and prepare environment for compilation by detecting already installed compilers using CK soft:compiler.* entries or installing new ones if none was found using CK package:compiler.*. Each installed compiler for each target will have an associated CK entry with prepared environment to let computer systems researchers work with different versions of different tools:

$ ck show env
$ ck show env --target_os=android21-arm64
$ ck show env --tags=compiler

Automatically detected version of a selected compiler is used by CK to find and preload all available optimization flags from related compiler:* entries to the choices key of a pipeline input. An example of such flags and tags in the CK JSON format for GCC 4.9 is shown in Figure 10. The community can continue extending such descriptions for different compilers including GCC, LLVM, Julia, Open64, PathScale, Java, MVCC, ICC and PGI using either public ck-autotuning repository or their own ones.

Finally, CK program pipeline compiles a given program, runs it on a target platform and fills in sub-dictionary characteristics with compilation time, object and binary sizes, MD5 sum of the binary, execution time, used energy (if supported by a used platform), and all other obtained measurements in the common pipeline dictionary.

We are now ready to implement universal compiler flag autotuning coupled with this program pipeline. For a proof-of-concept, we implemented GCC compiler flags exploration strategy which automatically generate N random combinations of compiler flags, compile a given program with each combination, runs it and record all results (inputs and outputs of a pipeline) in a reproducible form in a local CK repository using experiment module from the ck-analytics repository:

$ ck pull repo:ck-crowdtuning
$ ck info module:experiment.tune.compiler.flags.gcc

The JSON meta information of this module describes which keys to select in the program pipeline, how to tune them, and which characteristics to monitor and record as shown in Figure 11. Note that a string starting with \#\# is used to reference any key in a complex, nested JSON or Python dictionary (CK flat key [4]). Such flat key always starts with \# followed by \#key if it is a dictionary key or @position_in_a_list if it is a value in a list. CK also supports wild cards in such flat keys such as "\#\#compiler_flags\#\*" and "\#\#characteristics\#\* to be able to select multiple sub-keys, dictionaries and lists in a given dictionary.

We can now invoke this CK experimental scenario from the command line as following:

$ ck autotune program:cbench-automotive-susan --iterations=300 --repetitions=3 
  --scenario=experiment.tune.compiler.flags.gcc
  --cmd_key=corners --record_uoa=tmp-susan-corners-gcc4-300-rnd

CK will generate 300 random combinations of compiler flags, compile susan corners program with each combination, run each produced code 3 times to check variation, and record results in the experiment:tmp-susan-corners-gcc4-300-rnd. We can now visualize these autotuning results using the following command line:

$ ck plot graph:tmp-susan-corners-gcc4-300-rnd

Figure 12 shows a manually annotated graph with the outcome of GCC 4.9.2 random compiler flags autotuning applied to susan corners on an RPi3 device in terms of execution time with variation and code size. Each blue point on this graph is related to one combination of random compiler flags. The red line highlights the frontier of all autotuning results (not necessarily Pareto optimal) which trade off execution time and code size during multi-objective optimization. We also plotted points when default GCC compilation is used without any flags or with -O3 and -Os optimization levels. Finally, we decided to compare optimization results with Clang 3.8.1 also available on RPi3.


Figure 12



ID Compiler Time (sec.) Size (bytes) Flags
A1 GCC 4.9.2 11.7 ± 0.0 60560
A2 GCC 4.9.2 4.3 ± 0.1 36360 -O3
A3 GCC 4.9.2 6.2 ± 0.1 32184 -Os
A4R GCC 4.9.2 4.2 ± 0.0 32448 -O3 -fno-guess-branch-probability -fno-if-conversion -fno-ivopts -fno-schedule-insns -fsingle-precision-constant --param max-unswitch-insns=5
A5R GCC 4.9.2 3.7 ± 0.1 33376 -O3 -fbranch-probabilities -fno-ivopts -fno-sched-dep-count-heuristic
A6R GCC 4.9.2 3.4 ± 0.0 33804 -O3 -fno-inline-small-functions -fno-ivopts -fno-tree-partial-pre
A7 CLANG 3.8.1 11.1 ± 0.1 58368
A8 CLANG 3.8.1 4.5 ± 0.1 35552 -O3



Results of GCC 4.9.2 random compiler flag autotuning of susan corners program on Raspberry Pi 3 (Model B) device using CK with a highlighted frontier (trading-off execution time and code size) and best found combinations of flags on this frontier.

Besides showing that GCC -O3 (optimization choice A2) and Clang -O3 (optimization choice A8) can produce a very similar code, these results confirm well that it is indeed possible to automatically obtain execution time and binary size of -O3 and -Os levels in comparison with non-optimized code within tens to hundreds autotuning iterations (green improvement vectors with  3.6x execution time speedup and  1.6x binary size improvement). The graph also shows that it is possible to improve best optimization level -O3 much further and obtain  1.3x execution time speedup (optimization solution A6R or obtain 11% binary size improvement without sacrifying original execution time (optimization solution A4R). Such automatic squeezing of a binary size without sacrificing performance can be very useful for the future IoT devices.

Note that it is possible to browse all results in a user-friendly way via web browser using the following command:

$ ck browse experiment:tmp-susan-corners-gcc4-300-rnd

CK will then start internal CK web server available in the ck-web repository, will run a default web browser, and will open a web page with all given experimental results. Each experiment on this page has an associated button with a command line to replay it via CK such as:

$ ck replay experiment:7b41a4ac1b3b4f2b --point=00e81f4e4abb371d

CK will then attempt to reproduce this experiment using the same input and then report any differences in the output. This simplifies validation of shared experimental results (optimizations, models, bugs) by the community and possibly with a different software and hardware setup (CK will automatically adapt the workflow to a user platform).

We also provided support to help researchers visualize their results as interactive graphs using popular D3.js library as demonstrated in this link.

Looking at above optimization results one may notice that one of the original optimization solutions on a frontier A4 has  40 optimization flags, while A4R only 7 as shown in Table 1. The natural reason is that not all randomly selected flags contribute to improvements. That is why we developed a simple and universal complexity reduction algorithm. It iteratively and randomly removes choices from a found solution one by one if they do not influence monitored characteristics such as execution time and code size in our example.


Table 1

ID Flags
A4 -O3 -fira-algorithm=priority -fcaller-saves -fno-devirtualize-speculatively -fno-function-cse -fgcse-sm -fno-guess-branch-probability -fno-if-conversion -fno-inline-functions-called-once -fipa-reference -fno-ira-loop-pressure -fira-share-save-slots -fno-isolate-erroneous-paths-dereference -fno-ivopts -floop-nest-optimize -fmath-errno -fmove-loop-invariants -fsched-last-insn-heuristic -fsched2-use-superblocks -fno-schedule-insns -fno-signed-zeros -fsingle-precision-constant -fno-tree-sink -fno-unsafe-loop-optimizations --param asan-instrument-reads=1 --param gcse-unrestricted-cost=5 --param l1-cache-size=11 --param large-function-growth=33 --param loop-invariant-max-bbs-in-loop=636 --param max-completely-peel-loop-nest-depth=7 --param max-delay-slot-live-search=163 --param max-gcse-insertion-ratio=28 --param max-inline-insns-single=282 --param max-inline-recursive-depth-auto=0 --param max-jump-thread-duplication-stmts=6 --param max-last-value-rtl=4062 --param max-pipeline-region-insns=326 --param max-sched-region-blocks=17 --param max-tail-merge-iterations=2 --param max-unswitch-insns=5 --param max-vartrack-expr-depth=6 --param min-spec-prob=1 --param omega-eliminate-redundant-constraints=1 --param omega-max-keys=366 --param omega-max-wild-cards=36 --param sms-dfa-history=0
A4R -O3 -fno-guess-branch-probability -fno-if-conversion -fno-ivopts -fno-schedule-insns -fsingle-precision-constant --param max-unswitch-insns=5


One of original optimization solutions found after autotuning with random selection of compiler flags (A4) and reduced optimization solution (A4R) which results in the same or better execution time and code size.

Such complexity reduction (pruning) of an existing solution can be invoked as following (flag --prune_md5 tells CK to exclude a given choice without running code if MD5 of a produced binary didn't change, thus considerably speeding up flag pruning):

$ck replay experiment:93974bf451f957eb --point=74e9c9f14b424ba7 --prune --prune_md5 @prune.json

The 'prune.json' file describes conditions on program pipeline keys when a given choice should be removed as shown in Figure 13.

Such universal complexity reduction approach helps software engineers better understand individual contribution of each flag to improvements or degradations of all monitored characteristics such as execution time and code size as shown in Figure 14.

Asked by compiler developers, we also provided an extension to our complexity reduction module to turn off explicitly all available optimization choices one by one if they do not influence found optimization result. Table 2 demonstrates this approach and shows all compiler optimizations contributing to the found optimization solution. It can help improve internal optimization heuristics, global optimization levels such as -O3, and improve machine learning based optimization predictions. This extension can be invoked by adding flags --prune_invert --prune_invert_do_not_remove_key when reducing complexity of a given solution such as:

$ ck replay experiment:93974bf451f957eb --point=74e9c9f14b424ba7 --prune --prune_md5 --prune_invert --prune_invert_do_not_remove_key @prune.json


Table 2

ID Flags
A6R -O3 -fno-inline-small-functions -fno-ivopts -fno-tree-partial-pre
A6RI -O3 -fno-inline-small-functions -fno-ivopts -fno-tree-bit-ccp -fno-tree-partial-pre -fno-tree-pta -fno-associative-math -fno-auto-inc-dec -fno-branch-probabilities -fno-branch-target-load-optimize -fno-branch-target-load-optimize2 -fno-caller-saves -fno-check-data-deps -fno-combine-stack-adjustments -fno-conserve-stack -fno-compare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps -fno-cse-skip-blocks -fno-cx-limited-range -fno-data-sections -fdce -fno-delayed-branch -fno-devirtualize -fno-devirtualize-speculatively -fno-early-inlining -fno-ipa-sra -fno-expensive-optimizations -fno-fat-lto-objects -fno-fast-math -fno-finite-math-only -fno-float-store -fforward-propagate -fno-function-sections -fno-gcse-after-reload -fno-gcse-las -fno-gcse-lm -fno-graphite-identity -fno-gcse-sm -fno-hoist-adjacent-loads -fno-if-conversion -fif-conversion2 -fno-indirect-inlining -fno-inline-functions -fno-inline-functions-called-once -fno-ipa-cp -fno-ipa-cp-clone -fno-ipa-pta -fipa-pure-const -fno-ipa-reference -fno-ira-hoist-pressure -fno-ira-loop-pressure -fno-ira-share-save-slots -fira-share-spill-slots -fisolate-erroneous-paths-dereference -fno-isolate-erroneous-paths-attribute -fno-keep-inline-functions -fno-keep-static-consts -fno-live-range-shrinkage -fno-loop-block -fno-loop-interchange -fno-loop-strip-mine -fno-loop-nest-optimize -fno-loop-parallelize-all -fno-lto -fno-merge-all-constants -fno-merge-constants -fno-modulo-sched -fno-modulo-sched-allow-regmoves -fmove-loop-invariants -fno-branch-count-reg -fno-defer-pop -fno-function-cse -fguess-branch-probability -finline -fmath-errno -fno-peephole -fpeephole2 -fno-sched-interblock -fno-sched-spec -fno-signed-zeros -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss -fomit-frame-pointer -fno-optimize-sibling-calls -fno-partial-inlining -fno-peel-loops -fno-predictive-commoning -fno-prefetch-loop-arrays -fno-ree -fno-rename-registers -freorder-blocks -fno-reorder-blocks-and-partition -fno-rerun-cse-after-loop -fno-reschedule-modulo-scheduled-loops -fno-rounding-math -fno-sched2-use-superblocks -fsched-pressure -fno-sched-spec-load -fno-sched-spec-load-dangerous -fno-sched-group-heuristic -fsched-critical-path-heuristic -fno-sched-spec-insn-heuristic -fno-sched-rank-heuristic -fno-sched-dep-count-heuristic -fschedule-insns -fschedule-insns2 -fno-section-anchors -fno-selective-scheduling -fno-selective-scheduling2 -fno-sel-sched-pipelining -fno-sel-sched-pipelining-outer-loops -fno-shrink-wrap -fno-signaling-nans -fno-single-precision-constant -fno-split-ivs-in-unroller -fno-split-wide-types -fno-strict-aliasing -fstrict-overflow -fno-tracer -fno-tree-builtin-call-dce -fno-tree-ccp -ftree-ch -fno-tree-coalesce-vars -fno-tree-copy-prop -ftree-copyrename -ftree-dce -ftree-dominator-opts -fno-tree-dse -ftree-forwprop -fno-tree-fre -fno-tree-loop-if-convert -fno-tree-loop-if-convert-stores -ftree-loop-im -fno-tree-phiprop -fno-tree-loop-distribution -fno-tree-loop-distribute-patterns -fno-tree-loop-linear -ftree-loop-optimize -fno-tree-loop-vectorize -fno-tree-pre -ftree-reassoc -fno-tree-sink -ftree-slsr -ftree-sra -fno-tree-switch-conversion -fno-tree-tail-merge -ftree-ter -fno-tree-vectorize -ftree-vrp -fno-unit-at-a-time -fno-unroll-all-loops -fno-unroll-loops -fno-unsafe-loop-optimizations -fno-unsafe-math-optimizations -fno-unswitch-loops -fno-variable-expansion-in-unroller -fno-vect-cost-model -fno-vpt -fno-web -fno-whole-program -fno-wpa -fexcess-precision=standard -ffp-contract=off -fira-algorithm=CB -fira-region=all


Explicitly switching off all compiler flags one by one if they do not influence the optimization result - useful to understand all compiler optimizations which contributed to the found solution.

We have been analyzing already aging GCC 4.9.2 because it is still the default compiler for Jessy Debian distribution on RPi3. However, we would also like to check how our universal autotuner works with the latest GCC 7.1.0.

Since there is no yet a standard Debian GCC 7.1.0 package available for RPi3, we need to build it from scratch. This is not a straightforward task since we have to pick up correct configuration flags which will adapt GCC build to quite outdated RPi3 libraries. However, once we manage to do it, we can automate this process using CK package module.

We created a public ck-dev-compilers repository to automate building and installation of various compilers including GCC and LLVM via CK. It is therefore possible to install GCC 7.1.0 on RPi3 as following (see Appendix or GitHub repository ReadMe file for more details):

$ ck pull repo:ck-dev-compilers 

$ ck install package:compiler-gcc-any-src-linux-no-deps --env.PARALLEL_BUILDS=1 --env.GCC_COMPILE_CFLAGS=-O0 --env.GCC_COMPILE_CXXFLAGS=-O0 --env.EXTRA_CFG_GCC=--disable-bootstrap --env.RPI3=YES --force_version=7.1.0

This CK package has an install.sh script which is customized using environment variables or --env flags to build GCC for a target platform. The JSON meta data of this CK package provides optional software dependencies which CK has to resolve before installation (similar to CK compilation). If installation succeeded, you should be able to see two prepared environments for GCC 4.9.2 and GCC 7.1.0 which now co-exist in the system.

$ ck show env --tags=gcc

Whenever we now invoke CK autotuning, CK software and package manager will detect multiple available versions of a required software dependency and will let you choose which compiler version to use.

Let us now autotune the same susan corners program by generating 300 random combinations of GCC 7.1.0 compiler flags and record results in the experiment:tmp-susan-corners-gcc7-300-rnd:

$ ck autotune program:cbench-automotive-susan --iterations=300 --repetitions=3 
  --scenario=experiment.tune.compiler.flags.gcc
  --cmd_key=corners --record_uoa=tmp-susan-corners-gcc7-300-rnd

Figure 15 shows the results of such GCC 7.1.0 compiler flag autotuning (B points) and compares them against GCC 4.9.2 (A points). Note that this graph is also available in interactive form online.


Figure 15



ID Compiler Time (sec.) Size (bytes) Flags
A2 GCC 4.9.2 4.3 ± 0.1 36360 -O3
A5R GCC 4.9.2 3.7 ± 0.1 33376 -O3 -fbranch-probabilities -fno-ivopts -fno-sched-dep-count-heuristic
A6R GCC 4.9.2 3.4 ± 0.0 33804 -O3 -fno-inline-small-functions -fno-ivopts -fno-tree-partial-pre
B1 GCC 7.1.0 11.5 ± 0.0 58008
B2 GCC 7.1.0 3.2 ± 0.0 34432 -O3
B3 GCC 7.1.0 4.4 ± 0.0 29980 -Os
B4 GCC 7.1.0 3.1 ± 0.1 31460 -O3 -fno-cx-fortran-rules -fno-devirtualize -fno-expensive-optimizations -fno-if-conversion -fira-share-save-slots -fno-ira-share-spill-slots -fno-ivopts -fno-loop-strip-mine -finline -fno-math-errno -frounding-math -fno-sched-rank-heuristic -fno-sel-sched-pipelining-outer-loops -fno-semantic-interposition -fsplit-wide-types -fno-tree-ccp -ftree-dse
B4R GCC 7.1.0 3.1 ± 0.1 31420 -O3 -fno-expensive-optimizations -fno-ivopts -fno-math-errno



Results of GCC 7.1.0 random compiler flag autotuning of susan corners program on Raspberry Pi 3 (Model B) device using CK with a highlighted frontier (trading-off execution time and code size), best combinations of flags on this frontier, and comparison with the results from GCC 4.9.2.

It is interesting to see considerable improvement in execution time of susan corners when moving from GCC 4.9 to GCC 7.1 with the best optimization level -O3. This graph also shows that new optimization added during the past 3 years opened up many new opportunities thus considerably expanding autotuning frontier (light red dashed line versus dark red dashed line). Autotuning only managed to achieve a modest improvement of a few percent over -O3.

On the other hand, GCC -O3 and -Os are still far from achieving best trade-offs for execution time and code size. For example, it is still possible to improve a program binary size by  10% (reduced solution B4R) without degrading best achieved execution time with the -O3 level (-O3), or improve execution time of -Os level by  28% while slightly degrading code size by  5%.

Note that for readers' convenience we added scripts to reproduce and validate all results from this section to the following CK entries:

$ ck pull repo:ck-rpi-optimization-results 

$ ck find script:rpi3-susan*

These results confirm that it is difficult to manually prepare compiler optimization heuristic which can deliver good trade offs between execution time and code size in such a large design and optimization spaces. They also suggest that either susan corners or similar code was eventually added to the compiler regression testing suite, or some engineer check it manually and fixed compiler heuristic. However, there is also no guarantee that future GCC versions will still perform well on the susan corners program. Neither these results guarantee that GCC 7.1.0 will perform well on other realistic workloads or devices.

5  Crowdsourcing autotuning

We use our universal CK autotuning workflow to teach students and end-users how to automatically find good trade offs between multiple characteristics for any individual program, data set, compiler, environment and hardware. At the same time, automatically tuning many realistic workloads is very costly and can easily take from days to weeks and months [24].

Common experimental frameworks can help tackle this problem too by crowdsourcing autotuning across diverse hardware provided by volunteers and combining it with online classification, machine learning and run-time adaptation [5, 66, 6]. However, our previous frameworks did not cope well with "big data" problem (cTuning framework [5, 9] based on MySQL database) or were too "heavy" (Collective Mind aka cTuning 3 framework [4]).

Extensible CK workflow framework combined with our cross-platform package manager, internal web server and machine learning, helped solve most of the above issues. For example, we introduced a notion of a remote repository in the CK - whenever such repository is accessed CK simply forward all JSON requests to an appropriate web server.

CK always has a default remote repository remote-ck connected with a public optimization repository running CK web serve at cKnowledge.org/repo:

$ ck load repo:remote-ck --min

For example, one can see publicly available experiments from command line as following:

$ ck list remote-ck:experiment:* | sort

Such organization allows one to crowdsource autotuning, i.e. distributing autotuning of given shared workloads in a cloud or across diverse platforms simply by using remote repositories instead of local ones. On the other hand, it does not address the problem of optimizing larger applications with multiple hot spots. It also does not solve the "big data" problem when a large amount of data from multiple participants needed for reproducibility will be continuously aggregated in a CK server.

However, we have been already addressing the first problem by either instrumenting, monitoring and optimizing hot code regions in large applications using our small "XOpenME" library, or even extracting such code regions from a large application with a run-time data set and registering them in the CK as standalone programs (codelets or computational species) as shown in Figure 16 ( [4]).

In the MILEPOST project [24] we used a proprietary "codelet extractor" tool from CAPS Entreprise (now dissolved) to automatically extract such hot spots with their data sets from several real software projects and 8 popular benchmark suits including NAS, MiBench, SPEC2000, SPEC2006, Powerstone, UTDSP and SNU-RT. We shared those of them with a permissive license as CK programs in the ctuning-programs repository to be compatible with the presented CK autotuning workflow. We continue adding real, open-source applications and libraries as CK program entries (GEMM, HOG, SLAM, convolutions) or manually extracting and sharing interesting code regions from them with the help of the community. Such a large collection of diverse and realistic workloads should help make computer systems research more applied and practical.

As many other scientists, we also faced a big data problem when continuously aggregating large amounts of raw optimization data during crowd-tuning for further processing including machine learning [9]. We managed to solve this problem in the CK by using online pre-processing of raw data and online classification to record only the most efficient optimization solutions (on a frontier in case of multi-objective autotuning) along with unexpected behavior (bugs and numerical instability) [6]. It is now possible to invoke crowd-tuning of GCC compiler flags (improving execution time) in the CK as following:

$ ck crowdtune program --iterations=50 --scenario=8289e0cf24346aa7

or

$ ck crowdsource program.optimization --iterations=50 --scenario=8289e0cf24346aa7

In contrast with traditional autotuning, CK will first query remote-ck repository to obtain all most efficient optimization choices aka solutions (combinations of random compiler flags in our example) for a given trade-off scenario (GCC compiler flag tuning to minimize execution time), compiler version, platform and OS. CK will then select a random CK program (computational species), compiler and run it with all these top optimizations, and then try N extra random optimizations (random combinations of GCC flags) to continue increasing design and optimization space coverage. CK will then send the highest improvements of monitored characteristics (execution time in our example) achieved for each optimization solution as well as worst degradations back to a public server. If a new optimization solution if also found during random autotuning, CK will assign it a unique ID (solution_uid and will record it in a public repository. At the public server side, CK will merge improvements and degradations for a given program from a participant with a global statistics while recording how many programs achieved the highest improvement (best species) or worst degradation (worst species) for a given optimization as shown in Figure 17.


Figure 17



Solution Pruned flags (complexity reduction) Best species Worst species
1 -O3 -flto 6 3
2 -O3 -fno-inline -flto 1 1
3 -O3 -fno-if-conversion2 -funroll-loops 2 1
4 -O3 -fpeel-loops -ftracer 1 3
5 -O3 -floop-nest-optimize -fno-sched-interblock -fno-tree-copy-prop -funroll-all-loops 4 1
6 -O3 -funroll-loops 2 3
7 -O3 -floop-strip-mine -funroll-loops 1 1
8 -O3 -fno-inline -fno-merge-all-constants -fno-tree-ccp -funroll-all-loops 2 3
9 -O3 -fno-tree-loop-if-convert -funroll-all-loops 3 2
10 -O3 -fno-section-anchors -fselective-scheduling2 -fno-tree-forwprop -funroll-all-loops 2 2
11 -O3 -fno-ivopts -funroll-loops 4 1
12 -O3 -fno-tree-ch -funroll-all-loops 1 1
13 -O3 -fno-move-loop-invariants -fno-tree-ch -funroll-loops 1 2
14 -O3 -fira-algorithm=priority -fno-ivopts 1 2
15 -O3 -fno-ivopts 2 4
16 -O3 -fno-sched-spec -fno-tree-ch 1 2
17 -O3 -fno-ivopts -fselective-scheduling -fwhole-program 1 1
18 -O3 -fno-omit-frame-pointer -fno-tree-loop-optimize 1 4
19 -O3 -fno-auto-inc-dec -ffinite-math-only 1 2
20 -O3 -fno-guess-branch-probability -fira-loop-pressure -fno-toplevel-reorder 1 5
21 -O3 -fselective-scheduling2 -fno-tree-pre 2 2
22 -O3 -fgcse-sm -fno-move-loop-invariants -fno-tree-forwprop -funroll-all-loops -fno-web 1 0
23 -O3 -fno-schedule-insns -fselective-scheduling2 1 2


[ Latest live results in online repository and replay info ]


Snapshot of top performing combinations of GCC 4.9.2 compiler flags together with highest speedups and worst degradations achieved across all shared CK workloads on RPi3.

This figure shows a snapshot of public optimization results with top performing combinations of GCC 4.9.2 compiler flags on RPi3 devices which minimize execution time of shared CK workloads (programs and data sets) in comparison with -O3 optimization level. It also shows the highest speedup and the worse degradation achieved across all CK workloads for a given optimization solution, as well as a number of workloads where this solution was the best or the worst (online classification of all optimization solutions). Naturally this snapshot automatically generated from the public repository at the time of publication may slightly differ from continuously updated live optimization results available at this link. These results confirm that GCC 4.9.2 misses many optimization opportunities not covered by -O3 optimization level.


Figure 18



Solution Pruned flags (complexity reduction) Best species Worst species
1 -O3 -fno-delayed-branch -flto -fno-selective-scheduling2 -fno-whole-program 6 0
2 -O3 -flto 4 1
3 -O3 -fno-inline -flto 2 1
4 -O3 -fno-cprop-registers -flto -funroll-all-loops 3 1
5 -O3 -fno-tree-fre -funroll-all-loops 2 1
6 -O3 -fno-predictive-commoning -fno-schedule-insns -funroll-loops 3 3
7 -O3 -funroll-loops 3 0
8 -O3 -fno-tree-ter -funroll-all-loops 3 1
9 -O3 -fno-merge-all-constants -fselective-scheduling2 -funroll-loops 1 0
10 -O3 -fno-devirtualize-at-ltrans -fno-predictive-commoning -fno-tree-pre 1 2
11 -O3 -fcheck-data-deps -fira-loop-pressure -fno-isolate-erroneous-paths-dereference -fno-sched-dep-count-heuristic -fsection-anchors -fsemantic-interposition -fno-tree-ch -fno-tree-loop-linear -fno-tree-partial-pre 2 2
12 -O3 -fno-schedule-insns -ftracer 2 3
13 -O3 -fno-auto-inc-dec -fguess-branch-probability -fipa-pure-const -freorder-blocks -fselective-scheduling2 -ftree-ccp -fno-tree-pre -ftree-tail-merge 1 1


[ Latest live results in online repository and replay info ]


Snapshot of top performing combinations of GCC 7.1.0 compiler flags together with highest speedups and worst degradations achieved across all shared CK workloads on RPi3.

Figure 18 with optimization results for GCC 7.1.0 also confirms that this version was considerably improved in comparison with GCC 4.9.2 (latest live results are available in our public optimization repository at this link): there are fewer efficient optimization solutions found during crowd-tuning 14 vs 23 showing the overall improvement of the -O3 optimization level.

Nevertheless, GCC 7.1.0 still misses many optimization opportunities simply because our long-term experience suggests that it is infeasible to prepare one universal and efficient optimization heuristics with good multi-objective trade-offs for all continuously evolving programs, data sets, libraries, optimizations and platforms. That is why we hope that our approach of combining a common workflow framework adaptable to software and hardware changes, public repository of optimization knowledge, universal and collaborative autotuning across multiple hardware platforms (e.g. provided by volunteers or by HPC providers), and community involvement should help make optimization and testing of compilers more automatic and sustainable [6, 35]. Rather than spending considerable amount of time on writing their own autotuning and crowd-tuning frameworks, students and researchers can quickly reuse shared workflows, reproduce and learn already existing optimizations, try to improve optimization heuristics, and validate their results by the community.

Furthermore, besides using -Ox compiler levels, academic and industrial users can immediately take advantage of various shared optimizations solutions automatically found by volunteers for a given compiler and hardware via CK using solution_uid flag. For example, users can test the most efficient combination of compiler flags which achieved the highest speedup for GCC 7.1.0 on RPi3 (see "Copy CID to clipboard for a given optimization solution at this link) for their own programs using CK:

$ ck benchmark program:{new program}
  --shared_solution_cid=27bc42ee449e880e:
  79bca2b76876b5c6-8289e0cf24346aa7-
  f49649288ab0accd

or

$ ck benchmark program:{new program} 
  -O27bc42ee449e880e:79bca2b76876b5c6-
  8289e0cf24346aa7-f49649288ab0accd

6  Autotuning and crowd-tuning real workloads

In this section we would like to show how we can apply universal autotuning and collaboratively found optimization solutions to several popular workloads used by RPi community: zlib decode, zlib encode, 7z encode, aubio, ccrypt, gzip decode, gzip encode, minigzip decode, minigzip encode, rhash, sha512sum, unrar. We added the latest versions of these real programs to the CK describing how to compile and run them using CK JSON meta data:

$ ck ls ck-rpi-optimization:program:*

We can now autotune any of these programs via CK as described in Section 4. For example, the following command will autotune zlib decode workload with 150 random combinations of compiler flags including parametric and architecture specific ones, and will record results in a local repository:

$ ck autotune program:zlib --cmd_key=decode
  --iterations=150 --repetitions=3 
  --scenario=experiment.tune.compiler.flags.gcc
  --parametric_flags --cpu_flags --base_flags 
  --record_uoa=tmp-rpi3-zlib-decode-gcc4-150bpc-rnd

Figure 19 (link with interactive graph) shows a manually annotated graph with the outcome of such autotuning when using GCC 4.9.2 compiler on RPi3 device in terms of execution time with variation and code size. Each blue point on this graph is related to one combination of random compiler flags. The red line highlights the frontier of all autotuning results to let users trade off execution time and code size during multi-objective optimization. Similar to graphs in Section 4, we also plotted points when using several main GCC and Clang optimization levels.

In contrast with susan corners workload, autotuning did not improve execution time of zlib decode over -O3 level most likely because this algorithm is present in many benchmarking suits. On the other hand, autotuning impressively improved code size over -O3 by nearly 2x without sacrificing execution time, and by  1.5x with 11% execution time improvement over -Os (reduced optimization solution A4R), showing that code size optimization is still a second class citizen.

Since local autotuning can still be quite costly (150 iterations to achieve above results), we can now first check 10..20 most efficient combinations of compiler flags already found and shared by the community for this compiler and hardware (Figure 17). Note that programs from this section did not participate in crowd-tuning to let us have a fair evaluation of the influence of shared optimizations on these programs similar to leave-one-out cross-validation in machine learning.

Figure 20 shows "reactions" of zlib decode to these optimizations in terms of execution time and code size (the online interactive graph). We can see that crowd-tuning solutions indeed cluster in a relatively small area close to -O3 with one collaborative solution (C1) close to the best optimization solution found during lengthy autotuning (A4R) thus providing a good trade off between autotuning time, execution time and code size.

Autotuning zlib decode using GCC 7.1.0 revels even more interesting results in comparison with susan corners as shown in Figure 21 (the online interactive graph). While there is practically no execution time improvements when switching from GCC 4.9.2 to GCC 7.1.0 on -O3 and -Os optimization levels, GCC 7.1.0 -O3 considerably degraded code size by nearly 20%. Autotuning also shows few opportunities on GCC 7.1.0 in comparison with GCC 4.9.2 where the best found optimization B4R is worse in terms of a code size than A4R also by around 20%. These results highlight issues which both end-users and compiler designers face when searching for efficient combinations of compiler flags or preparing the default optimization levels -Ox.


Figure 21



ID Compiler Time (sec.) Size (bytes) Flags
A2 GCC 4.9.2 12.2 ± 0.0 101448 -O3
A4R GCC 4.9.2 12.1 ± 0.1 54272 -O2 -flto -fno-tree-fre
B1 GCC 7.1.0 41.3 ± 0.0 128376
B2 GCC 7.1.0 11.7 ± 0.1 119084 -O3
B3 GCC 7.1.0 13.7 ± 0.1 74280 -Os
B4R GCC 7.1.0 11.9 ± 0.1 78700 -O2 -fno-early-inlining -fno-tree-fre