Developing efficient software and hardware has never been harder whether it is for a tiny IoT device or an Exascale supercomputer. Apart from the ever growing design and optimization complexity, there exist even more fundamental problems such as lack of interdisciplinary knowledge required for effective software/hardware co-design, and a growing technology transfer gap between academia and industry.
We introduce our new educational initiative to tackle these problems by developing Collective Knowledge (CK), a unified experimental framework for computer systems research and development. We use CK to teach the community how to make their research artifacts and experimental workflows portable, reproducible, customizable and reusable while enabling sustainable R&D and facilitating technology transfer. We also demonstrate how to redesign multi-objective autotuning and machine learning as a portable and extensible CK workflow. Such workflows enable researchers to experiment with different applications, data sets and tools; crowdsource experimentation across diverse platforms; share experimental results, models, visualizations; gradually expose more design and optimization choices using a simple JSON API; and ultimately build upon each other's findings.
As the first practical step, we have implemented customizable compiler autotuning, crowdsourced optimization of diverse workloads across Raspberry Pi 3 devices, reduced the execution time and code size by up to 40%, and applied machine learning to predict optimizations. We hope such approach will help teach students how to build upon each others' work to enable efficient and self-optimizing software/hardware/model stack for emerging workloads.
1 Introduction
Many recent international roadmaps for computer systems research
appeal to reinvent computing [1, 2, 3].
Indeed, developing, benchmarking, optimizing and co-designing hardware and software
has never been harder, no matter if it is for embedded and IoT devices,
or data centers and Exascale supercomputers.
This is caused by both physical limitations of existing technologies
and an unmanageable complexity of continuously changing computer systems
which already have too many design and optimization choices and objectives
to consider at all software and hardware levels [4],
as conceptually shown in Figure 1.
That is why most of these roadmaps now agree with our vision
that such problems should be solved in a close collaboration
between industry, academia and end-users [5, 6].
However, after we initiated artifact evaluation (AE) [7, 8] at several premier ACM and IEEE conferences to reproduce and validate experimental results from published papers, we noticed an even more fundamental problem: a growing technology transfer gap between academic research and industrial development. After evaluating more than 100 artifacts from the leading computer systems conferences in the past 4 years, we noticed that only a small fraction of research artifacts could be easily customized, ported to other environments and hardware, reused, and built upon. We have grown to believe that is this due to a lack of a common workflow framework that could simplify implementation and sharing of artifacts and workflows as portable, customizable and reusable components with some common API and meta information vital for open science [9].
At the same time, companies are always under pressure and rarely have time to dig into numerous academic artifacts shared as CSV/Excel files and ``black box'' VM and Docker images, or adapt numerous ad-hoc scripts to realistic and ever changing workloads, software and hardware. That is why promising techniques may remain in academia for decades while just being incrementally improved, put on the shelf when leading students graduate, and ``reinvented'' from time to time.
Autotuning is one such example: this very popular technique has been actively researched since the 1990s to automatically explore large optimization spaces and improve efficiency of computer systems [10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33]. Every year, dozens of autotuning papers get published to optimize some components of computer systems, improve and speed up exploration and co-design strategies, and enable run-time adaptation. Yet, when trying to make autotuning practical (in particular, by applying machine learning) we faced numerous challenges with integrating such published techniques into real, complex and continuously evolving software and hardware stack [5, 4, 6, 9].
Eventually, these problems motivated us to develop a common experimental framework and methodology similar to physics and other natural sciences to collaboratively improve autotuning and other techniques. As part of this educational initiative, we implemented an extensible, portable and technology-agnostic workflow for autotuning using the open-source Collective Knowledge framework (CK) [34, 35]. Such workflows help researchers to reuse already shared applications, kernels, data sets and tools, or add their own ones using a common JSON API and meta-description [36]. Moreover, such workflows can automatically adapt compilation and execution to a given environment on a given device using integrated cross-platform package manager.
Our approach takes advantage of a powerful and holistic top-down methodology successfully used in physics and other sciences when learning complex systems. The key idea is to let novice researchers first master simple compiler flag autotuning scenarios while learning interdisciplinary techniques including machine learning and statistical analysis. Researchers can then gradually increase complexity to enable automatic and collaborative co-design of the whole SW/HW stack by exposing more design and optimization choices, multiple optimization objectives (execution time, code size, power consumption, memory usage, platform cost, accuracy, etc.), crowdsource autotuning across diverse devices provided by volunteers similar to SETI@home [37], continuously exchange and discuss optimization results, and eventually build upon each other's results.
We use our approach to optimize diverse kernels and real workloads such as
zlib
in terms of speed and code size by crowdsourcing compiler flag autotuning
across Raspberry Pi3 devices using the default GCC 4.9.2 and the latest GCC
7.1.0 compilers.
We have been able to achieve up to 50% reductions in code size and from 15%
to 8 times speed ups across different workloads over the ``-O3'' baseline.
Our CK workflow and all related artifacts are available at GitHub
to allow researchers to compare and improve various exploration strategies
(particularly based on machine learning algorithms such as KNN, GA, SVM,
deep learning, though further documentation of APIs is still
required) [24, 6].
We have also shared all experimental results in our open repository of
optimization knowledge [38, 39] to be
validated and reproduced by the community.
We hope that our approach will serve as a practical foundation for open, reproducible and sustainable computer systems research by connecting students, scientists, end-users, hardware designers and software developers to learn together how to co-design the next generation of efficient and self-optimizing computer systems, particularly via reproducible competitions such as ReQuEST [40].
This technical report is organized as follows. Section 2 introduces the Collective Knowledge framework (CK) and the concept of sharing artifacts as portable, customizable and reusable components. Section 3 describes how to implement a customizable, multi-dimensional and multi-objective autotuning as a CK workflow. Section 4 shows how to optimize compiler flags using our universal CK autotuner. Section 5 presents a snapshot of the latest optimization results from collaborative tuning of GCC flags for numerous shared workloads across Raspberry Pi3 devices. Section 6 shows optimization results of zlib and other realistic workloads for GCC 4.9.2 and GCC 7.1.0 across Raspberry Pi3 devices. Section 7 describes how implement and crowdsource fuzzing of compilers and systems for various bugs using our customizable CK autotuning workflow. Section 8 shows how to predict optimizations via CK for previously unseen programs using machine learning. Section 9 demonstrates how to select and autotune models and features to improve optimization predictions while reducing complexity. Section 10 shows how to enable efficient, input-aware and adaptive libraries and programs via CK. Section 11 presents CK as an open platform to support reproducible and Pareto-efficient co-design competitions of the whole software/hardware/model stack for emerging workloads such as deep learning and quantum computing. We present future work in Section 12. We also included Artifact Appendix to allow students try our framework, participate in collaborative autotuning, gradually document APIs and improve experimental workflows.
2 Converting ad-hoc artifacts to portable and reusable components with JSON API
Artifact sharing and reproducible experimentation are key
for our collaborative approach to machine-learning based optimization
and co-design of computer systems, which was first prototyped
during the EU-funded MILEPOST project [5, 41, 24].
Indeed, it is difficult, if not impossible, and time consuming to build useful predictive models
without large and diverse training sets (programs, data sets),
and without crowdsourcing design and optimization space exploration
across diverse hardware [9, 6].
While we have been actively promoting artifact sharing for the past 10 years since the MILEPOST project [5, 42], it is still relatively rare in the community systems community. We have begun to understand possible reasons for that through our Artifact Evaluation initiative [7, 8] at PPoPP, CGO, PACT, SuperComputing and other leading ACM and IEEE conferences which has attracted over a hundred of artifacts in the past few years.
Unfortunately, nearly all the artifacts have been shared simply as zip archives, GitHub/GitLab/Bitbucket repositories, or VM/Docker images, with many ad-hoc scripts to prepare, run and visualize experiments, as shown in Figure 2a. While a good step towards reproducibility, such ad-hoc artifacts are hard to reuse and customize as they do not provide a common API and meta information.
Some popular and useful services such as Zenodo [43] and FigShare [44] allow researchers to upload individual artifacts to specific websites while assigning DOI [45] and providing some meta information. This helps the community to discover the artifacts, but does not necessarily make them easy to reuse.
After an ACM workshop on reproducible research methodologies (TRUST'14) [46] and a Dagstuhl Perspective workshop on Artifact Evaluation [8], we concluded that compute systems research lacked a common experimental framework in contrast with other sciences [47].
Together with our fellow researchers, we also assembled the following wish-list for such a framework:
●
●
●
●
●
●
●
Since there was no available open-source framework with all these features, we decided to develop such a framework, Collective Knowledge (CK) [34, 35], from scratch with initial support from the EU-funded TETRACOM project [50]. CK is implemented as a small and portable Python module with a command line front-end to assist users in converting their local objects (code and data) into searchable, reusable and shareable directory entries with user-friendly aliases and auto-generated Unique ID, JSON API and JSON meta information [36], as described in [35, 51] and conceptually shown in Figure 2b.
The user first creates a new local CK repository as follows:
$ ck add repo:new-ck-repo
Initially, it is just an empty directory:
$ ck find repo:new-ck-repo
$ ls `ck find repo:new-ck-repo`
Now, the user starts adding research artifacts as CK components with extensible APIs. For example, after noticing that we always perform 3 common actions on all our benchmarks during our experiments, "compile", "run" and "autotune", we want to provide a common API for these actions and benchmarks, rather than writing ad-hoc scripts. The user can provide such an API with actions by adding a new CK module to a CK repository as follows:
$ ck add new-ck-repo:module:programCK will then create two levels of directories module and program in the new-ck-repo and will add a dummy module.py where common object actions can be implemented later. CK will also create a sub-directory .cm (collective meta) with an automatically generated Unique ID of this module and various pre-defined descriptions in JSON format (date and time of module creation, author, license, etc) to document provenance of the CK artifacts.
Users can now create holders (directories) for such objects sharing common CK module and an API as follows:
$ ck add new-ck-repo:program:new-benchmarkCK will again create two levels of directories: the first one specifying used CK module (program) and the second one with alias new-benchmark to keep objects. CK will also create three files in an internal .cm directory:
●
●
●
Users can then find a path to a newly created object holder (CK entry) using the ck find program:new-benchmark command and then copy all files and sub-directories related to the given object using standard OS shell commands.
This allows to get rid of ad-hoc scripts by implementing actions inside reusable CK Python modules as shown in Figure 3. For example, the user can add an action to a given module such as compile program as follows:
$ ck add_action module:program --func=compileCK will create a dummy function body with an input dictionary i inside module.py in the CK module:program entry. Whenever this function is invoked via CK using the following format:
$ ck compile program:some_entry --param1=val1the command line will be converted to i dictionary and printed to the console to help novice users understand the CK API. The user can now substitute this dummy function with a specific action on a specific entry (some program in our example based on its meta information) as conceptually shown in Figure 3. The above example shows how to call CK functions from Python modules rather than from the command line using the ck.access function. It also demonstrates how to find a path to a given program entry, load its meta information and unique ID. For the reader's convenience, Figure 4 lists several important CK commands.
This functionality should be enough to start implementing unified compilation and execution of shared programs. For example, the program module can read instructions about how to compile and run a given program from the JSON meta data of related entries, prepare and execute portable sub-scripts, collect various statistics, and embed them to the output dictionary in a unified way. This can be also gradually extended to include extra tools into compilation and execution workflow such as code instrumentation and profiling.
Here we immediately face another problem common for computer systems research: how to support multiple versions of various and continuously evolving tools and libraries? However, since we no longer hardwire calls to specific tools directly in scripts but invoke them from higher-level CK modules, we can detect all required tools and set up their environment before execution. To support this concept even better, we have developed a cross-platform package manager as a ck-env repository [52] with several CK modules including soft, env, package, os and platform. These modules allow the community to describe various operating systems (Linux, Windows, MacOS, Android); detect platform features (ck detect platform); detect multiple-versions of already installed software (ck detect soft:compiler.gcc); prepare CK entries with their environments for a given OS and platform using env module (ck show env) thus allowing easy co-existence of multiple versions of a given tool; install missing software using package modules; describe software dependencies using simple tags in a program meta description (such as compiler,gcc or lib,caffe), and ask the user to select an appropriate version during program compilation when multiple software versions are registered in the CK as shown in Figure 5.
Such approach extends the concept of package managers including Spack [53] and EasyBuild [54] by integrating them directly with experimental CK workflows while using unified CK API, supporting any OS and platform, and allowing the community to gradually extend existing detection or installation procedures via CK Python scripts and CK meta data.
Note that this CK approach encourages reuse of all such existing CK modules from shared CK repositories rather then writing numerous ad-hoc scripts. It should indeed be possible to substitute most of ad-hoc scripts from public research projects (Figure 2) with just a few above modules and entries (Figure 6), and then collaboratively extend them, thus dramatically improving research productivity. For this reason, we keep track of all publicly shared modules and their repositories in this wiki page. The user will just need to add/update a .ckr.json file in the root directory of a given CK repository to describe a dependency on other existing CK repositories with required modules or entries. Since it is possible to uniquely reference any CK entry by two Unique IDs (module UID:object UID), we also plan to develop a simple web service to automatically index and discover all modules similar to DOI.
The open, file-based format of CK repositories allows researchers to continue editing entries and their meta directly using their favourite editors. It also simplifies exchange of these entries using Git repositories, zip archives, Docker images and any other popular tool. At the same time, schema-free and human readable Python dictionaries and JSON files helps users to collaboratively extend actions, API and meta information while keeping backward compatibility. Such approach should let the community to gradually and collaboratively convert and cross-link all existing ad-hoc code and data into unified components with extensible API and meta information. This, in turn, allows users organize their own research while reusing existing artifacts, building upon them, improving them and continuously contributing back to Collective Knowledge similar to Wikipedia.
We also noticed that CK can help students reduce preparation time for Artifact Evaluation [7] at conferences while automating preparation and validation of experiments since all artifacts, workflows and repositories are immediatelly ready to be shared, ported and plugged in to research workflows.
For example, the highest ranked artifact from the CGO'17 article [55] was implemented and shared using the CK framework [56]. That is why CK is now used and publicly extended by leading companies [35], universities [55] and organizations [57] to encourage, support and simplify technology transfer between academia and industry.
3 Assembling portable and customizable autotuning workflow
Autotuning combined with various run-time adaptation,
genetic and machine learning techniques is a popular approach
in computer systems research to automatically explore multi-dimensional
design and optimization spaces [10, 58, 59, 60, 11, 12, 13, 61, 14, 62, 15, 16, 17, 18, 19, 63, 20, 64, 65, 66, 67, 68, 69, 70, 29, 71, 72, 73].
CK allows to unify such techniques by developing a common, universal, portable, customizable, multi-dimensional and multi-objective autotuning workflow as a CK module (pipeline from the public ck-autotuning repository with the autotune function). This allows us to abstract autotuning by decoupling it from the autotuned objects such as "program". Users just need to provide a compatible function "pipeline" in a CK module which they want to be autotuned with a specific API including the following keys in both input and output:
●
●
●
●
●
Autotuning can now be implemented as a universal and extensible workflow applied to any object with a matching JSON API by chaining together related CK modules with various exploration strategies, program transformation tools, compilers, program compilation and execution pipeline, architecture simulators, statistical analysis, Pareto frontier filter and other components, as conceptually shown in Figure 7. Researchers can also use unified machine learning CK modules (wrappers to R and scikit-learn [74]) to model the relationship between c, f, s and the observed behavior b, increase coverage, speed up (focus) exploration, and predict efficient optimizations [4, 6]. They can also take advantage of a universal complexity reduction module which can automatically simplify found solutions without changing their behavior, reduce models and features without sacrificing accuracy, localize performance issues via differential analysis [75], reduce programs to localize bugs, and so on.
Even more importantly, our concept of a universal autotuning workflow, knowledge sharing and artifact reuse can help teach students how to apply a well-established holistic and top-down experimental methodology from natural sciences to continuously learn and improve the behavior of complex computer systems [35, 4]. Researchers can continue exposing more design and optimization knobs c, behavioral characteristics b, static and dynamic features f, and run-time state state to optimize and model behavior of various interconnected objects from the workflow depending on their research interests and autotuning scenarios.
Such scenarios are also implemented as CK modules and describe which sets of choices to select, how to autotune them and which multiple characteristics to trade off. For example, existing scenarios include "autotuning OpenCL parameters to improve execution time", "autotuning GCC flags to balance execution time and code size", "autotune LLVM flags to reduce execution time", "automatically fuzzing compilers to detect bugs", "exploring CPU and GPU frequency in terms of execution time and power consumption", "autotuning deep learning algorithms in terms of speed, accuracy, energy, memory usage and costs", and so on.
You can see some of the autotuning scenarios using the following commands:
$ ck pull repo:ck-crowdtuning
$ ck search module --tags="program optimization"
$ ck list programThey can then be invoked from the command line as follows:
$ ck autotune program:[CK program alias] --scenario=[above CK scenario alias]
Indeed, the raising complexity of ever changing hardware
made development of compilers very challenging.
Popular GCC and LLVM compilers nowadays include hundreds
of optimizations (Figure 8)
and often fail to produce efficient code (execution time and code size)
on realistic workloads within a reasonable compilation time [10, 58, 76, 77, 4].
Such large design and optimization spaces mean
that hardware and compiler designers can afford to explore
only a tiny fraction of the whole optimization space
using just few ad-hoc benchmarks and data sets on a few architectures
in a tough mission to assemble -O3, -Os and other
optimization levels across all supported architectures and workloads.
Our idea is to keep compiler as a simple collection of code analysis and transformation routines
and separate it from optimization heuristics.
In such case we can use CK autotuning workflow to collaboratively optimize multiple
shared benchmarks and realistic workloads across diverse hardware, exchange optimization results,
and continuously learn and update compiler optimization heuristics for a given hardware
as a compiler plugin.
We will demonstrate this approach by randomly optimizing compiler flags
for susan corners program with aging GCC 4.9.2, the latest GCC 7.1.0
and compare them with Clang 3.8.1.
We already monitor and optimize execution time and code size of this popular image processing
application across different compilers and platforms for many years [24].
That is why we are interested to see if we can still improve it with the CK autotuner
on the latest Raspberry Pi 3 (Model B) devices (RPi3) extensively used for educational purposes.
First of all, we added susan program with corners algorithm
to the ctuning-programs repository with the JSON meta information
describing compilation and execution
as shown in Figure 9.
We can then test its compilation and execution by invoking the program pipeline as following:
CK program pipeline will first attempt to detect platform features
(OS, CPU, GPU) and embed them to the input dictionary using key features.
Note that in case of cross-compilation for a target platform different from the host one
(Android, remote platform via SSH, etc),
it is possible to specify such platform using CK os entries and --target_os= flag.
For example, it is possible to compile and run a given CK program for Android via adb as following:
Next, CK will try to resolve software dependencies and prepare environment for compilation
by detecting already installed compilers using CK soft:compiler.* entries
or installing new ones if none was found using CK package:compiler.*.
Each installed compiler for each target
will have an associated CK entry with prepared environment
to let computer systems researchers work with different
versions of different tools:
Automatically detected version of a selected compiler is used by CK
to find and preload all available optimization flags
from related compiler:* entries to the choices key
of a pipeline input.
An example of such flags and tags in the CK JSON format
for GCC 4.9 is shown in Figure 10.
The community can continue extending such descriptions for different compilers
including GCC, LLVM, Julia, Open64, PathScale, Java, MVCC, ICC and PGI
using either public ck-autotuning repository
or their own ones.
Finally, CK program pipeline compiles a given program, runs it on a target platform
and fills in sub-dictionary characteristics
with compilation time, object and binary sizes, MD5 sum of the binary, execution time,
used energy (if supported by a used platform), and all other obtained measurements
in the common pipeline dictionary.
We are now ready to implement universal compiler flag autotuning coupled with
this program pipeline.
For a proof-of-concept, we implemented GCC compiler flags exploration strategy
which automatically generate N random combinations of compiler flags,
compile a given program with each combination, runs it and record all results
(inputs and outputs of a pipeline) in a reproducible form in a local
CK repository using experiment module from
the ck-analytics
repository:
The JSON meta information of this module describes which keys to select
in the program pipeline, how to tune them, and which characteristics to monitor
and record as shown in Figure 11.
Note that a string starting with \#\# is used to reference any key
in a complex, nested JSON or Python dictionary (CK flat key [4]).
Such flat key always starts with \#
followed by \#key if it is a dictionary key or
@position_in_a_list if it is a value in a list.
CK also supports wild cards in such flat keys
such as "\#\#compiler_flags\#\*" and "\#\#characteristics\#\*
to be able to select multiple sub-keys, dictionaries
and lists in a given dictionary.
We can now invoke this CK experimental scenario from the command line as following:
CK will generate 300 random combinations of compiler flags, compile susan corners program
with each combination, run each produced code 3 times to check variation, and record
results in the experiment:tmp-susan-corners-gcc4-300-rnd.
We can now visualize these autotuning results using the following command line:
Figure 12 shows a manually annotated graph
with the outcome of GCC 4.9.2 random compiler flags autotuning
applied to susan corners on an RPi3 device in terms of execution
time with variation and code size.
Each blue point on this graph is related to one combination of random compiler flags.
The red line highlights the frontier of all autotuning results (not necessarily Pareto optimal)
which trade off execution time and code size during multi-objective optimization.
We also plotted points when default GCC compilation is used without any flags
or with -O3 and -Os optimization levels.
Finally, we decided to compare optimization results with Clang 3.8.1 also available on RPi3.
Besides showing that GCC -O3 (optimization choice A2)
and Clang -O3 (optimization choice A8) can produce a very similar code,
these results confirm well that it is indeed possible to automatically obtain execution time
and binary size of -O3 and -Os levels in comparison with non-optimized code
within tens to hundreds autotuning iterations (green improvement vectors with 3.6x execution time speedup
and 1.6x binary size improvement).
The graph also shows that it is possible to improve best optimization level -O3
much further and obtain 1.3x execution time speedup (optimization solution A6R
or obtain 11% binary size improvement without sacrifying original execution time
(optimization solution A4R).
Such automatic squeezing of a binary size without sacrificing performance
can be very useful for the future IoT devices.
Note that it is possible to browse all results in a user-friendly way
via web browser using the following command:
CK will then start internal CK web server
available in the ck-web
repository, will run a default web browser, and will
open a web page with all given experimental results.
Each experiment on this page has an associated button
with a command line to replay it via CK such as:
CK will then attempt to reproduce this experiment using the same input
and then report any differences in the output.
This simplifies validation of shared experimental results
(optimizations, models, bugs) by the community
and possibly with a different software and hardware setup
(CK will automatically adapt the workflow to a user platform).
We also provided support to help researchers
visualize their results as interactive graphs
using popular D3.js library as demonstrated in this
link.
Looking at above optimization results one may notice
that one of the original optimization solutions on a frontier A4
has 40 optimization flags, while A4R only 7 as shown in Table 1.
The natural reason is that not all randomly selected flags contribute to improvements.
That is why we developed a simple and universal complexity reduction algorithm.
It iteratively and randomly removes choices from a found solution one by one
if they do not influence monitored characteristics such as execution time and code size
in our example.
Such complexity reduction (pruning) of an existing solution can be invoked as following
(flag --prune_md5 tells CK to exclude a given choice without running code
if MD5 of a produced binary didn't change, thus considerably speeding up flag pruning):
The 'prune.json' file describes conditions on program pipeline keys
when a given choice should be removed as shown in Figure 13.
Such universal complexity reduction approach helps software engineers better understand
individual contribution of each flag to improvements or degradations of all monitored
characteristics such as execution time and code size as shown in Figure 14.
Asked by compiler developers, we also provided an extension to our complexity reduction module
to turn off explicitly all available optimization choices one by one
if they do not influence found optimization result.
Table 2 demonstrates this approach and shows all compiler optimizations contributing to the found optimization solution.
It can help improve internal optimization heuristics, global optimization levels such as -O3,
and improve machine learning based optimization predictions.
This extension can be invoked by adding flags --prune_invert --prune_invert_do_not_remove_key
when reducing complexity of a given solution such as:
We have been analyzing already aging GCC 4.9.2 because
it is still the default compiler for Jessy Debian distribution on RPi3.
However, we would also like to check how our universal autotuner
works with the latest GCC 7.1.0.
Since there is no yet a standard Debian GCC 7.1.0 package available for RPi3,
we need to build it from scratch.
This is not a straightforward task since we have to pick up correct
configuration flags which will adapt GCC build to quite outdated RPi3 libraries.
However, once we manage to do it, we can automate this process
using CK package module.
We created a public ck-dev-compilers repository
to automate building and installation of various compilers including GCC and LLVM via CK.
It is therefore possible to install GCC 7.1.0 on RPi3 as following
(see Appendix or GitHub repository ReadMe file for more details):
This CK package has an install.sh script which is customized
using environment variables or --env flags to build GCC for a target platform.
The JSON meta data of this CK package provides optional software dependencies
which CK has to resolve before installation (similar to CK compilation).
If installation succeeded, you should be able to see two prepared environments
for GCC 4.9.2 and GCC 7.1.0 which now co-exist in the system.
Whenever we now invoke CK autotuning, CK software and package manager
will detect multiple available versions of a required software dependency
and will let you choose which compiler version to use.
Let us now autotune the same susan corners program
by generating 300 random combinations of GCC 7.1.0 compiler flags
and record results in the experiment:tmp-susan-corners-gcc7-300-rnd:
Figure 15 shows the results of such GCC 7.1.0
compiler flag autotuning (B points) and compares them
against GCC 4.9.2 (A points).
Note that this graph is also available in interactive form online.
It is interesting to see considerable improvement in execution time of susan corners
when moving from GCC 4.9 to GCC 7.1 with the best optimization level -O3.
This graph also shows that new optimization added during the past 3 years opened up
many new opportunities thus considerably expanding autotuning frontier (light red
dashed line versus dark red dashed line).
Autotuning only managed to achieve a modest improvement of a few percent over -O3.
On the other hand, GCC -O3 and -Os are still far from achieving
best trade-offs for execution time and code size.
For example, it is still possible to improve a program binary size
by 10% (reduced solution B4R) without degrading best achieved
execution time with the -O3 level (-O3), or improve
execution time of -Os level by 28% while slightly degrading code size by 5%.
Note that for readers' convenience we added scripts to reproduce and validate
all results from this section to the following CK entries:
These results confirm that it is difficult to manually prepare compiler optimization
heuristic which can deliver good trade offs between execution time and code size
in such a large design and optimization spaces.
They also suggest that either susan corners or similar code
was eventually added to the compiler regression testing suite,
or some engineer check it manually and fixed compiler heuristic.
However, there is also no guarantee that future GCC versions will still
perform well on the susan corners program.
Neither these results guarantee that GCC 7.1.0
will perform well on other realistic workloads or devices.
Common experimental frameworks can help tackle this problem too by
crowdsourcing autotuning across diverse hardware provided by volunteers and combining it with online
classification, machine learning and run-time adaptation [5, 66, 6].
However, our previous frameworks did not cope well with "big data" problem
(cTuning framework [5, 9] based on MySQL database)
or were too "heavy" (Collective Mind aka cTuning 3 framework [4]).
Extensible CK workflow framework combined with our cross-platform package manager,
internal web server and machine learning, helped solve most of the above issues.
For example, we introduced a notion of a remote repository in the CK -
whenever such repository is accessed CK simply forward all JSON requests
to an appropriate web server.
CK always has a default remote repository remote-ck connected
with a public optimization repository running CK web serve
at cKnowledge.org/repo:
For example, one can see publicly available experiments from command line as following:
Such organization allows one to crowdsource autotuning, i.e. distributing autotuning
of given shared workloads in a cloud or across diverse platforms simply by using remote
repositories instead of local ones.
On the other hand, it does not address the problem of optimizing larger applications
with multiple hot spots.
It also does not solve the "big data" problem
when a large amount of data from multiple participants
needed for reproducibility will be continuously aggregated in a CK server.
However, we have been already addressing the first problem by either
instrumenting, monitoring and optimizing hot code regions in large applications
using our small "XOpenME" library, or even extracting such code regions
from a large application with a run-time data set and registering them
in the CK as standalone programs (codelets or computational species)
as shown in Figure 16
( [4]).
In the MILEPOST project [24]
we used a proprietary "codelet extractor" tool from CAPS Entreprise
(now dissolved) to automatically extract such hot spots with their data sets
from several real software projects and 8 popular benchmark suits
including NAS, MiBench, SPEC2000, SPEC2006, Powerstone, UTDSP and SNU-RT.
We shared those of them with a permissive license as CK programs
in the ctuning-programs repository
to be compatible with the presented CK autotuning workflow.
We continue adding real, open-source applications and libraries as CK program entries
(GEMM, HOG, SLAM, convolutions) or manually extracting and sharing interesting code
regions from them with the help of the community.
Such a large collection of diverse and realistic workloads
should help make computer systems research more applied and practical.
As many other scientists, we also faced a big data problem when continuously
aggregating large amounts of raw optimization data during crowd-tuning
for further processing including machine learning [9].
We managed to solve this problem in the CK by using
online pre-processing of raw data and online classification
to record only the most efficient optimization solutions
(on a frontier in case of multi-objective autotuning)
along with unexpected behavior (bugs and numerical
instability) [6].
It is now possible to invoke crowd-tuning of GCC compiler flags (improving execution time) in the CK as following:
or
In contrast with traditional autotuning, CK will first query remote-ck repository
to obtain all most efficient optimization choices aka solutions (combinations of random compiler flags in our example)
for a given trade-off scenario (GCC compiler flag tuning to minimize execution time), compiler version,
platform and OS.
CK will then select a random CK program (computational species),
compiler and run it with all these top optimizations,
and then try N extra random optimizations (random combinations of GCC flags)
to continue increasing design and optimization space coverage.
CK will then send the highest improvements of monitored characteristics
(execution time in our example) achieved for each optimization solution as well as worst degradations
back to a public server.
If a new optimization solution if also found during random autotuning,
CK will assign it a unique ID (solution_uid
and will record it in a public repository.
At the public server side, CK will merge improvements and degradations for a given
program from a participant with a global statistics while recording how many programs
achieved the highest improvement (best species) or worst degradation (worst species) for a given optimization
as shown in Figure 17.
This figure shows a snapshot of public optimization results
with top performing combinations of GCC 4.9.2 compiler flags
on RPi3 devices which minimize execution time of shared CK workloads
(programs and data sets) in comparison with -O3 optimization level.
It also shows the highest speedup and the worse degradation achieved
across all CK workloads for a given optimization solution, as well
as a number of workloads where this solution was the best or the worst
(online classification of all optimization solutions).
Naturally this snapshot automatically generated from the public repository
at the time of publication may slightly differ from continuously updated
live optimization results available at this link.
These results confirm that GCC 4.9.2 misses many optimization opportunities
not covered by -O3 optimization level.
Figure 18 with optimization results
for GCC 7.1.0 also confirms that this version was considerably improved
in comparison with GCC 4.9.2
(latest live results are available in our public optimization repository
at this link):
there are fewer efficient optimization solutions found during crowd-tuning
14 vs 23 showing the overall improvement of the -O3 optimization level.
Nevertheless, GCC 7.1.0 still misses many optimization opportunities
simply because our long-term experience suggests that it is infeasible
to prepare one universal and efficient optimization heuristics
with good multi-objective trade-offs for all continuously
evolving programs, data sets, libraries, optimizations and platforms.
That is why we hope that our approach of combining a common workflow framework
adaptable to software and hardware changes, public repository of optimization knowledge,
universal and collaborative autotuning across multiple hardware platforms
(e.g. provided by volunteers or by HPC providers), and community involvement
should help make optimization and testing of compilers
more automatic and sustainable [6, 35].
Rather than spending considerable amount of time on writing their own autotuning and crowd-tuning
frameworks, students and researchers can quickly reuse shared workflows,
reproduce and learn already existing optimizations, try to improve optimization heuristics,
and validate their results by the community.
Furthermore, besides using -Ox compiler levels, academic and industrial users
can immediately take advantage of various shared optimizations solutions automatically
found by volunteers for a given compiler and hardware via CK using solution_uid flag.
For example, users can test the most efficient combination of compiler flags
which achieved the highest speedup for GCC 7.1.0 on RPi3
(see "Copy CID to clipboard for a given optimization solution at this
link)
for their own programs using CK:
or
We can now autotune any of these programs via CK as described in Section 4.
For example, the following command will autotune zlib decode workload
with 150 random combinations of compiler flags including parametric and architecture
specific ones, and will record results in a local repository:
Figure 19
(link with interactive graph)
shows a manually annotated graph with the outcome of such autotuning
when using GCC 4.9.2 compiler on RPi3 device
in terms of execution time with variation and code size.
Each blue point on this graph is related to one combination of random compiler flags.
The red line highlights the frontier of all autotuning results
to let users trade off execution time and code size
during multi-objective optimization.
Similar to graphs in Section 4, we also plotted points
when using several main GCC and Clang optimization levels.
In contrast with susan corners workload, autotuning did not improve execution time
of zlib decode over -O3 level most likely because this algorithm is present
in many benchmarking suits.
On the other hand, autotuning impressively improved code size over -O3
by nearly 2x without sacrificing execution time, and by 1.5x with 11% execution time
improvement over -Os (reduced optimization solution A4R),
showing that code size optimization is still a second class citizen.
Since local autotuning can still be quite costly (150 iterations to achieve above results),
we can now first check 10..20 most efficient combinations of compiler flags
already found and shared by the community for this compiler and hardware
(Figure 17).
Note that programs from this section did not participate in crowd-tuning
to let us have a fair evaluation of the influence of shared optimizations
on these programs similar to leave-one-out cross-validation in machine learning.
Figure 20 shows "reactions"
of zlib decode to these optimizations in terms of execution time and code size
(the online interactive graph).
We can see that crowd-tuning solutions indeed cluster in a relatively small area
close to -O3 with one collaborative solution (C1) close to the
best optimization solution found during lengthy autotuning (A4R)
thus providing a good trade off between autotuning time, execution time and code size.
Autotuning zlib decode using GCC 7.1.0 revels even more interesting results
in comparison with susan corners as shown in Figure 21
(the online interactive graph).
While there is practically no execution time improvements when switching from GCC 4.9.2 to GCC 7.1.0
on -O3 and -Os optimization levels, GCC 7.1.0 -O3 considerably degraded code size by nearly 20%.
Autotuning also shows few opportunities on GCC 7.1.0 in comparison with GCC 4.9.2
where the best found optimization B4R is worse in terms of a code size than A4R also by around 20%.
These results highlight issues which both end-users and compiler designers face
when searching for efficient combinations of compiler flags or preparing the
default optimization levels -Ox.
4 Implementing universal compiler flag autotuning
In this section we would like to show how to
customize our universal autotuning workflow
to tackle an old but yet unsolved problem
of finding the most efficient
selection of compiler flag which minimizes
program size and execution time.
$ ck pipeline program:cbench-automotive-susan
$ ck ls os
$ ck pipeline program:cbench-automotive-susan --target_os=android21-arm64
$ ck show env
$ ck show env --target_os=android21-arm64
$ ck show env --tags=compiler
$ ck pull repo:ck-crowdtuning
$ ck info module:experiment.tune.compiler.flags.gcc
$ ck autotune program:cbench-automotive-susan --iterations=300 --repetitions=3
--scenario=experiment.tune.compiler.flags.gcc
--cmd_key=corners --record_uoa=tmp-susan-corners-gcc4-300-rnd
$ ck plot graph:tmp-susan-corners-gcc4-300-rnd
Figure 12
ID
Compiler
Time (sec.)
Size (bytes)
Flags
A1
GCC 4.9.2
11.7 ± 0.0
60560
A2
GCC 4.9.2
4.3 ± 0.1
36360
-O3
A3
GCC 4.9.2
6.2 ± 0.1
32184
-Os
A4R
GCC 4.9.2
4.2 ± 0.0
32448
-O3 -fno-guess-branch-probability -fno-if-conversion -fno-ivopts -fno-schedule-insns -fsingle-precision-constant --param max-unswitch-insns=5
A5R
GCC 4.9.2
3.7 ± 0.1
33376
-O3 -fbranch-probabilities -fno-ivopts -fno-sched-dep-count-heuristic
A6R
GCC 4.9.2
3.4 ± 0.0
33804
-O3 -fno-inline-small-functions -fno-ivopts -fno-tree-partial-pre
A7
CLANG 3.8.1
11.1 ± 0.1
58368
A8
CLANG 3.8.1
4.5 ± 0.1
35552
-O3
Results of GCC 4.9.2 random compiler flag autotuning of susan corners program on Raspberry Pi 3 (Model B)
device using CK with a highlighted frontier (trading-off execution time and code size)
and best found combinations of flags on this frontier.
$ ck browse experiment:tmp-susan-corners-gcc4-300-rnd
$ ck replay experiment:7b41a4ac1b3b4f2b --point=00e81f4e4abb371d
Table 1
ID
Flags
A4
-O3 -fira-algorithm=priority -fcaller-saves -fno-devirtualize-speculatively -fno-function-cse -fgcse-sm -fno-guess-branch-probability -fno-if-conversion -fno-inline-functions-called-once -fipa-reference -fno-ira-loop-pressure -fira-share-save-slots -fno-isolate-erroneous-paths-dereference -fno-ivopts -floop-nest-optimize -fmath-errno -fmove-loop-invariants -fsched-last-insn-heuristic -fsched2-use-superblocks -fno-schedule-insns -fno-signed-zeros -fsingle-precision-constant -fno-tree-sink -fno-unsafe-loop-optimizations --param asan-instrument-reads=1 --param gcse-unrestricted-cost=5 --param l1-cache-size=11 --param large-function-growth=33 --param loop-invariant-max-bbs-in-loop=636 --param max-completely-peel-loop-nest-depth=7 --param max-delay-slot-live-search=163 --param max-gcse-insertion-ratio=28 --param max-inline-insns-single=282 --param max-inline-recursive-depth-auto=0 --param max-jump-thread-duplication-stmts=6 --param max-last-value-rtl=4062 --param max-pipeline-region-insns=326 --param max-sched-region-blocks=17 --param max-tail-merge-iterations=2 --param max-unswitch-insns=5 --param max-vartrack-expr-depth=6 --param min-spec-prob=1 --param omega-eliminate-redundant-constraints=1 --param omega-max-keys=366 --param omega-max-wild-cards=36 --param sms-dfa-history=0
A4R
-O3 -fno-guess-branch-probability -fno-if-conversion -fno-ivopts -fno-schedule-insns -fsingle-precision-constant --param max-unswitch-insns=5
One of original optimization solutions found after autotuning with random selection of compiler flags (A4)
and reduced optimization solution (A4R) which results in the same or better execution time and code size.
$ck replay experiment:93974bf451f957eb --point=74e9c9f14b424ba7 --prune --prune_md5 @prune.json
$ ck replay experiment:93974bf451f957eb --point=74e9c9f14b424ba7 --prune --prune_md5 --prune_invert --prune_invert_do_not_remove_key @prune.json
Table 2
ID
Flags
A6R
-O3 -fno-inline-small-functions -fno-ivopts -fno-tree-partial-pre
A6RI
-O3 -fno-inline-small-functions -fno-ivopts -fno-tree-bit-ccp -fno-tree-partial-pre -fno-tree-pta -fno-associative-math -fno-auto-inc-dec -fno-branch-probabilities -fno-branch-target-load-optimize -fno-branch-target-load-optimize2 -fno-caller-saves -fno-check-data-deps -fno-combine-stack-adjustments -fno-conserve-stack -fno-compare-elim -fcprop-registers -fcrossjumping -fcse-follow-jumps -fno-cse-skip-blocks -fno-cx-limited-range -fno-data-sections -fdce -fno-delayed-branch -fno-devirtualize -fno-devirtualize-speculatively -fno-early-inlining -fno-ipa-sra -fno-expensive-optimizations -fno-fat-lto-objects -fno-fast-math -fno-finite-math-only -fno-float-store -fforward-propagate -fno-function-sections -fno-gcse-after-reload -fno-gcse-las -fno-gcse-lm -fno-graphite-identity -fno-gcse-sm -fno-hoist-adjacent-loads -fno-if-conversion -fif-conversion2 -fno-indirect-inlining -fno-inline-functions -fno-inline-functions-called-once -fno-ipa-cp -fno-ipa-cp-clone -fno-ipa-pta -fipa-pure-const -fno-ipa-reference -fno-ira-hoist-pressure -fno-ira-loop-pressure -fno-ira-share-save-slots -fira-share-spill-slots -fisolate-erroneous-paths-dereference -fno-isolate-erroneous-paths-attribute -fno-keep-inline-functions -fno-keep-static-consts -fno-live-range-shrinkage -fno-loop-block -fno-loop-interchange -fno-loop-strip-mine -fno-loop-nest-optimize -fno-loop-parallelize-all -fno-lto -fno-merge-all-constants -fno-merge-constants -fno-modulo-sched -fno-modulo-sched-allow-regmoves -fmove-loop-invariants -fno-branch-count-reg -fno-defer-pop -fno-function-cse -fguess-branch-probability -finline -fmath-errno -fno-peephole -fpeephole2 -fno-sched-interblock -fno-sched-spec -fno-signed-zeros -fno-toplevel-reorder -fno-trapping-math -fno-zero-initialized-in-bss -fomit-frame-pointer -fno-optimize-sibling-calls -fno-partial-inlining -fno-peel-loops -fno-predictive-commoning -fno-prefetch-loop-arrays -fno-ree -fno-rename-registers -freorder-blocks -fno-reorder-blocks-and-partition -fno-rerun-cse-after-loop -fno-reschedule-modulo-scheduled-loops -fno-rounding-math -fno-sched2-use-superblocks -fsched-pressure -fno-sched-spec-load -fno-sched-spec-load-dangerous -fno-sched-group-heuristic -fsched-critical-path-heuristic -fno-sched-spec-insn-heuristic -fno-sched-rank-heuristic -fno-sched-dep-count-heuristic -fschedule-insns -fschedule-insns2 -fno-section-anchors -fno-selective-scheduling -fno-selective-scheduling2 -fno-sel-sched-pipelining -fno-sel-sched-pipelining-outer-loops -fno-shrink-wrap -fno-signaling-nans -fno-single-precision-constant -fno-split-ivs-in-unroller -fno-split-wide-types -fno-strict-aliasing -fstrict-overflow -fno-tracer -fno-tree-builtin-call-dce -fno-tree-ccp -ftree-ch -fno-tree-coalesce-vars -fno-tree-copy-prop -ftree-copyrename -ftree-dce -ftree-dominator-opts -fno-tree-dse -ftree-forwprop -fno-tree-fre -fno-tree-loop-if-convert -fno-tree-loop-if-convert-stores -ftree-loop-im -fno-tree-phiprop -fno-tree-loop-distribution -fno-tree-loop-distribute-patterns -fno-tree-loop-linear -ftree-loop-optimize -fno-tree-loop-vectorize -fno-tree-pre -ftree-reassoc -fno-tree-sink -ftree-slsr -ftree-sra -fno-tree-switch-conversion -fno-tree-tail-merge -ftree-ter -fno-tree-vectorize -ftree-vrp -fno-unit-at-a-time -fno-unroll-all-loops -fno-unroll-loops -fno-unsafe-loop-optimizations -fno-unsafe-math-optimizations -fno-unswitch-loops -fno-variable-expansion-in-unroller -fno-vect-cost-model -fno-vpt -fno-web -fno-whole-program -fno-wpa -fexcess-precision=standard -ffp-contract=off -fira-algorithm=CB -fira-region=all
Explicitly switching off all compiler flags one by one if they do not influence the optimization result -
useful to understand all compiler optimizations which contributed to the found solution.
$ ck pull repo:ck-dev-compilers
$ ck install package:compiler-gcc-any-src-linux-no-deps --env.PARALLEL_BUILDS=1 --env.GCC_COMPILE_CFLAGS=-O0 --env.GCC_COMPILE_CXXFLAGS=-O0 --env.EXTRA_CFG_GCC=--disable-bootstrap --env.RPI3=YES --force_version=7.1.0
$ ck show env --tags=gcc
$ ck autotune program:cbench-automotive-susan --iterations=300 --repetitions=3
--scenario=experiment.tune.compiler.flags.gcc
--cmd_key=corners --record_uoa=tmp-susan-corners-gcc7-300-rnd
Figure 15
ID
Compiler
Time (sec.)
Size (bytes)
Flags
A2
GCC 4.9.2
4.3 ± 0.1
36360
-O3
A5R
GCC 4.9.2
3.7 ± 0.1
33376
-O3 -fbranch-probabilities -fno-ivopts -fno-sched-dep-count-heuristic
A6R
GCC 4.9.2
3.4 ± 0.0
33804
-O3 -fno-inline-small-functions -fno-ivopts -fno-tree-partial-pre
B1
GCC 7.1.0
11.5 ± 0.0
58008
B2
GCC 7.1.0
3.2 ± 0.0
34432
-O3
B3
GCC 7.1.0
4.4 ± 0.0
29980
-Os
B4
GCC 7.1.0
3.1 ± 0.1
31460
-O3 -fno-cx-fortran-rules -fno-devirtualize -fno-expensive-optimizations -fno-if-conversion -fira-share-save-slots -fno-ira-share-spill-slots -fno-ivopts -fno-loop-strip-mine -finline -fno-math-errno -frounding-math -fno-sched-rank-heuristic -fno-sel-sched-pipelining-outer-loops -fno-semantic-interposition -fsplit-wide-types -fno-tree-ccp -ftree-dse
B4R
GCC 7.1.0
3.1 ± 0.1
31420
-O3 -fno-expensive-optimizations -fno-ivopts -fno-math-errno
Results of GCC 7.1.0 random compiler flag autotuning of susan corners program on Raspberry Pi 3 (Model B)
device using CK with a highlighted frontier (trading-off execution time and code size),
best combinations of flags on this frontier, and comparison with the results from GCC 4.9.2.
$ ck pull repo:ck-rpi-optimization-results
$ ck find script:rpi3-susan*
5 Crowdsourcing autotuning
We use our universal CK autotuning workflow to teach students and end-users
how to automatically find good trade offs between multiple characteristics
for any individual program, data set, compiler, environment and hardware.
At the same time, automatically tuning many realistic workloads
is very costly and can easily take from days to weeks and months [24].
$ ck load repo:remote-ck --min
$ ck list remote-ck:experiment:* | sort
$ ck crowdtune program --iterations=50 --scenario=8289e0cf24346aa7
$ ck crowdsource program.optimization --iterations=50 --scenario=8289e0cf24346aa7
Figure 17
Solution
Pruned flags (complexity reduction)
Best species
Worst species
1
-O3 -flto
6
3
2
-O3 -fno-inline -flto
1
1
3
-O3 -fno-if-conversion2 -funroll-loops
2
1
4
-O3 -fpeel-loops -ftracer
1
3
5
-O3 -floop-nest-optimize -fno-sched-interblock -fno-tree-copy-prop -funroll-all-loops
4
1
6
-O3 -funroll-loops
2
3
7
-O3 -floop-strip-mine -funroll-loops
1
1
8
-O3 -fno-inline -fno-merge-all-constants -fno-tree-ccp -funroll-all-loops
2
3
9
-O3 -fno-tree-loop-if-convert -funroll-all-loops
3
2
10
-O3 -fno-section-anchors -fselective-scheduling2 -fno-tree-forwprop -funroll-all-loops
2
2
11
-O3 -fno-ivopts -funroll-loops
4
1
12
-O3 -fno-tree-ch -funroll-all-loops
1
1
13
-O3 -fno-move-loop-invariants -fno-tree-ch -funroll-loops
1
2
14
-O3 -fira-algorithm=priority -fno-ivopts
1
2
15
-O3 -fno-ivopts
2
4
16
-O3 -fno-sched-spec -fno-tree-ch
1
2
17
-O3 -fno-ivopts -fselective-scheduling -fwhole-program
1
1
18
-O3 -fno-omit-frame-pointer -fno-tree-loop-optimize
1
4
19
-O3 -fno-auto-inc-dec -ffinite-math-only
1
2
20
-O3 -fno-guess-branch-probability -fira-loop-pressure -fno-toplevel-reorder
1
5
21
-O3 -fselective-scheduling2 -fno-tree-pre
2
2
22
-O3 -fgcse-sm -fno-move-loop-invariants -fno-tree-forwprop -funroll-all-loops -fno-web
1
0
23
-O3 -fno-schedule-insns -fselective-scheduling2
1
2
[ Latest live results in online repository and replay info ]
Snapshot of top performing combinations of GCC 4.9.2 compiler flags together with highest speedups and worst degradations achieved across all shared CK workloads on RPi3.
Figure 18
Solution
Pruned flags (complexity reduction)
Best species
Worst species
1
-O3 -fno-delayed-branch -flto -fno-selective-scheduling2 -fno-whole-program
6
0
2
-O3 -flto
4
1
3
-O3 -fno-inline -flto
2
1
4
-O3 -fno-cprop-registers -flto -funroll-all-loops
3
1
5
-O3 -fno-tree-fre -funroll-all-loops
2
1
6
-O3 -fno-predictive-commoning -fno-schedule-insns -funroll-loops
3
3
7
-O3 -funroll-loops
3
0
8
-O3 -fno-tree-ter -funroll-all-loops
3
1
9
-O3 -fno-merge-all-constants -fselective-scheduling2 -funroll-loops
1
0
10
-O3 -fno-devirtualize-at-ltrans -fno-predictive-commoning -fno-tree-pre
1
2
11
-O3 -fcheck-data-deps -fira-loop-pressure -fno-isolate-erroneous-paths-dereference -fno-sched-dep-count-heuristic -fsection-anchors -fsemantic-interposition -fno-tree-ch -fno-tree-loop-linear -fno-tree-partial-pre
2
2
12
-O3 -fno-schedule-insns -ftracer
2
3
13
-O3 -fno-auto-inc-dec -fguess-branch-probability -fipa-pure-const -freorder-blocks -fselective-scheduling2 -ftree-ccp -fno-tree-pre -ftree-tail-merge
1
1
[ Latest live results in online repository and replay info ]
Snapshot of top performing combinations of GCC 7.1.0 compiler flags together with highest speedups and worst degradations achieved across all shared CK workloads on RPi3.
$ ck benchmark program:{new program}
--shared_solution_cid=27bc42ee449e880e:
79bca2b76876b5c6-8289e0cf24346aa7-
f49649288ab0accd
$ ck benchmark program:{new program}
-O27bc42ee449e880e:79bca2b76876b5c6-
8289e0cf24346aa7-f49649288ab0accd
6 Autotuning and crowd-tuning real workloads
In this section we would like to show how we can apply universal autotuning
and collaboratively found optimization solutions to several popular workloads
used by RPi community: zlib decode, zlib encode,
7z encode, aubio, ccrypt, gzip decode, gzip encode, minigzip decode,
minigzip encode, rhash, sha512sum, unrar.
We added the latest versions of these real programs
to the CK describing how to compile and run them
using CK JSON meta data:
$ ck ls ck-rpi-optimization:program:*
$ ck autotune program:zlib --cmd_key=decode
--iterations=150 --repetitions=3
--scenario=experiment.tune.compiler.flags.gcc
--parametric_flags --cpu_flags --base_flags
--record_uoa=tmp-rpi3-zlib-decode-gcc4-150bpc-rnd
Figure 19
ID
Compiler
Time (sec.)
Size (bytes)
Flags
A1
GCC 4.9.2
41.3 ± 0.0
131140
A2
GCC 4.9.2
12.2 ± 0.0
101448
-O3
A3
GCC 4.9.2
13.6 ± 0.0
78116
-Os
A4R
GCC 4.9.2
12.1 ± 0.1
54272
-O2 -flto -fno-tree-fre
A5
CLANG 3.8.1
38.5 ± 0.0
132080
A6
CLANG 3.8.1
12.9 ± 0.1
90076
-O3
Results of GCC 4.9.2 random compiler flag autotuning of a zlib decode workload on RPi3
device using CK with a highlighted frontier (trading-off execution time and code size)
and the best found combinations of flags on this frontier.
Figure 20
ID
Compiler
Time (sec.)
Size (bytes)
Flags
A2
GCC 4.9.2
12.2 ± 0.0
101448
-O3
A4R
GCC 4.9.2
12.1 ± 0.1
54272
-O2 -flto -fno-tree-fre
C1
GCC 4.9.2
12.2 ± 0.1
64184
-O3 -fno-inline -flto
Speeding up GCC 4.9.2 autotuning of a zlib decode workload on RPi3 device using
10..20 best performing combinations of compiler flags already found and shared by the community
during crowd-tuning.
Figure 21
ID
Compiler
Time (sec.)
Size (bytes)
Flags
A2
GCC 4.9.2
12.2 ± 0.0
101448
-O3
A4R
GCC 4.9.2
12.1 ± 0.1
54272
-O2 -flto -fno-tree-fre
B1
GCC 7.1.0
41.3 ± 0.0
128376
B2
GCC 7.1.0
11.7 ± 0.1
119084
-O3
B3
GCC 7.1.0
13.7 ± 0.1
74280
-Os
B4R
GCC 7.1.0
11.9 ± 0.1
78700
-O2 -fno-early-inlining -fno-tree-fre