HiPEACinfo 43

NETWORK OF EXCELLENCE ON HIGH PERFORMANCE AND EMBEDDED

ARCHITECTURE AND COMPILATION

AUTUMN COMPUTING SYSTEMS WEEK, SEPTEMBER 21-23, 2015, MILANO, ITALY

WELCOME TO ACACES’15,

JULY 12-18, 2015, FIUGGI, ITALY

Follow us on LinkedIn!hipeac.net/linkedin

info 43

appears quarterly july 2015

MESSAGE FROM THE HIPEAC COORDINATOR

CONTENT

intro

This spring, a new company miDiagnostics was launched in Belgium. This in itself is not special, but its product captured my attention. It will commercialize a disposable chip called miLab. This chip will be able to automatically carry out a sophisticated blood analysis (DNA, proteins, viruses, blood cells,…) in ten to fifteen minutes, starting from one drop of blood. The process will be as simple as a routine glucose test already used at home by millions of people diagnosed with diabetes. One chip will cost between 10 and 20 euro, and it will be for sale everywhere. The chip integrates a complete lab in just a few square centimeters, with the outcome of the analysis sent wirelessly to a smartphone. If this product is successful, then it will change the way doctors work, because they will no longer have to wait for the outcome of a blood analysis. It will bring an affordable diagnostic tool to the poorest regions of the world, and it will disrupt the clinical microbiology market. Beyond that, police officers will be able to analyze the blood of suspected drunk

drivers, without having to call a doctor to take a blood sample, monthly blood checks at home will become feasible leading to the early detection of medical conditions long before they become life threatening. The chip also has applications in other sectors where wet laboratories are used: environment, food industry, sports, and many more.By creating a new class of (disposable) devices, it also creates opportunities for the computing systems industry. Maybe one day, we will buy disposable transistors, rather than transistors that have to last for a couple of years. The future will tell whether this technology will become a game changer or not, and whether Europe will lead in this domain. The investors of miDiagnostics believe it will: the company starts with an investment of 60 M €, which makes it the highest capitalized startup ever in Belgium.In May, HiPEAC underwent its third review. The reviewers concluded that the project has made very good progress during the third reporting period and that the con

hipeac activity

4 THEMATICSESSIONSINTHEOSLOCSW(MAY5-7,2015)

hipeac announce

7 COMPUTING:THECURRENTANDITSPROBABILITYBASEDFUTURE

7 CEVELOPC++IDERELEASED

hipeac news

8 JUNIPERNETWORKSANDMAXELERTECHNOLOGIESANNOUNCENEWCOMPUTE-INTEGRATEDNETWORKSWITCH

8 EUROPEANLLVMCONFERENCE2015

in the spotlight

9 SAFERTRAVELSANDIMPLANTSWITHDESYRESYSTEMS

10 SOCKETSOVERRDMAANDSHAREDPERIPHERALSFORARMMICROSERVERS

11 THEAEGLEPROJECT

12 SUCCESSFULCONCLUSIONTOEUPARAPHRASEPROJECT

13 COLLECTIVEKNOWLEDGE:AFRAMEWORKFORSYSTEMATICPERFORMANCEANALYSISANDOPTIMIZATION

hipeac students

14 COLLABORATIONGRANT:HECTORORTEGA

14 COLLABORATIONGRANT:LUCIAG.MENEZO

15 COLLABORATIONGRANT:ROELJORDANS

15 COLLABORATIONGRANT:JAIMEESPINOSAGARCIA

16 COLLABORATIONGRANT:ALEJANDROVALERO

16 COLLABORATIONGRANT:ERKANDIKEN

17 INTERNSHIPREPORT:SOMNATHMAZUMDAR

17 INTERNSHIPREPORT:MICHELESCANDALE

18 INTERNSHIPREPORT:TURKEYALSALKINI

18 INTERNSHIPREPORT:ANOUKVANLAER

19 phd news

20upcoming events

sortium certainly has the capacity and resources to continue delivering value to European stakeholders and to fully achieve the stated objectives. We are very happy with this outcome, and we are committed to continue supporting the European computing systems community in the future.This newsletter is the summer school issue. The summer school also marks the beginning of the summer break for me and for the HiPEAC staff. We wish you a relaxing summer with your family and friends, and we hope to see you again after the summer holiday in good health, and full of plans for the year to come.

Koen De Bosschere_________

HiPEAC info 432

intro

MESSAGE FROM THE PROJECT OFFICER

I am just back from the “Computer Systems Week and block review” organised by HiPEAC with the University of Oslo, and I am really satisfied with what I saw: HiPEAC is a growing, active and hardworking commu nity, with impressive know-how in all areas of computer science.What I found a bit less exciting, in the same community, is the capability to go beyond the purely technological work; for example, explaining the potential of the developed technologies to the wide public, or identifying and pursuing exploitation opportunities. We know that computers are going to change the world (they have already done it, actually), but a part of our community seems so busy with the details of the technology that they risk over-looking the possible applications.Well, this is not the best approach: advanced computing is a very powerful tool in our hands and we should develop a vision of how this “magic wand” can create new

useful applications, in every existing sector of the economy, and even in new market segments that do not yet exist. And then we should be able to communicate this vision, to make it understandable to everybody, not only to computer science majors, because it is not them who will be able to finance our ideas and make them happen.This does not mean that every researcher should become a salesman: on the contrary, research should not be confused with product development, and we should fight for the freedom to fail while exploring untracked roads, because this is the only way to innovate; personally, I will continue to support this approach internally in the European Commission. What we all need, however, is the far-sighted vision and open mind needed to look beyond the computer screen and the lab walls, in order to understand how our technology can really make a difference. It is much more fun to change the world than to play in the lab, and it is also what European Union asks as a condition of giving us money. The Horizon 2020 programme is described as “research and innovation to boost growth and jobs

in Europe”, and future work programmes will be more and more asking for the creation of platforms, ecosystems, and technical/economic communities which can put technology into use, create value, and make advanced computing technologies really relevant in the world. The applications of these technologies to the digitalization of European industry, to the Internet of Things and autonomous vehicles and robots, can change radically the world as we know it. Our role is not only to develop the technologies that make this change possible, but also to create applications which are socially and ethically acceptable and to make sure that European industry can benefit from the change in terms of jobs and growth.If you do a good job – and I am sure that the HiPEAC community can do a really good job – we will have such visible results that I will be finally able to explain my work to my motherinlaw. Sandro D’Elia_________

ACACES'14 Group Photo

HiPEAC info 43 3

The 2015 Spring Computing Systems Week took place in Oslo, Norway from May 5th to May 7th. During this week several thematic sessions were organized by HiPEAC participants. In the following text we summarize the topic and organizers for each of the sessions. Most presentations are available on the HiPEAC website at https://www.hipeac.net/csw/2015/oslo/schedule

hipeac activity

1. SECURITY INTELLIGENCEOrganized by Michael Vinov and Omer Boehm, from IBM Research Haifa, this session explored the most recent advancements in Computer Security, and the problems that need to be solved. It looked at a wide range of recent computer security research, focusing on mitigation techniques and some of the future challenges that we are facing. The session included two talks that presented realworld use cases:• Vulnerability Detection using Symbolic Interpretation. Sharon KeidarBarner

(IBM Research Haifa)• Preventing ROP Attacks. Omer Boehm (IBM Research Haifa)

2. EMBEDDED COMPUTER VISIONOrganized by David Moloney (Movidius) and Oscar Deniz Suarez (University of CastillaLa Mancha), this session dealt with the increasing need for ‘intelligence’ and cognitive functions in embedded systems. Four invited speakers shared their experiences in this field with the audience:• Next Generation Imaging Solutions for Smartphones. Peter Corcoran (FotoNation)• Hyperspectral Imaging goes Embedded. Max Larin (XIMEA)• Platforms and Applications for Embedded Computer Vision: Toys, Bees and Safety

Devices. Emanuel Popovici (University College Cork)• Accelerating OPENVX Applications on Embedded Many-core Accelerators.

Giuseppe Tagliavini (University of Bologna)

3. PATTERNS OF PARALLELISM AND SOFTWARE ENGINEERING FOR MULTICORE/MANYCORE SYSTEMS

Organized by Kevin Hammond (Univ. of St. Andrews), the purpose of this session was to explore the general problem of software engineering as it applies to multicore/manycore systems. Parallelism is increasingly important for software. It is not unreasonable to say that almost all future software development will need to consider parallelism. Software engineering methodologies and practices are, however, firmly rooted in the singlecore era. The limited tools that exist do not cover all aspects of software development and are not integrated into coherent methodologies. It included 6 talks:• Advanced Parallel Programming with FastFlow. Marco Danelutto (Univ. of Pisa)• Refactoring Parallel Programs. Chris Brown (Univ. of St. Andrews)• Pattern-Based Approaches to Programming Heterogeneous Systems. Kevin Hammond

(Univ. of St. Andrews)• Concurrency and Parallelism in Modern C++. Daniel Garcia (Univ. Carlos III of Madrid)• Using Machine Learning to Map Applications to Heterogeneous Parallel Systems.

Vladimir Janjic (Univ. of St. Andrews)• Experience with Programming Parallel Applications: an Industrial Perspective.

Thomas Natschlager (SCCH)

4. FP7 PROJECTS HARPA, CLERECO AND EXCESS; CONVERGENCE, PERSPECTIVES AND JOINT VISION

This session was organized by Dimitrios Soudris (National Technical University of Athens), and included discussions on standardization, power vs. reliability, hardware vs. software reliability, data structures of HPC applications on embedded platforms, and exploration

THEMATIC SESSIONS IN THE OSLO CSW (MAY 5 - 7, 2015)

VisitOSLO/Normanns Kunstforlag/ Terje Bakke Pettersen

HiPEAC info 434

https://www.hipeac.net/csw/2015/oslo/schedule

hipeac activity

of synergies and common actions between the three involved projects. It included 3 talks and a panel discussion:• HARPA: Harnessing Performance Variability. William Fornaciari (Politecnico di Milano)• CLERECO: Cross-Layer Early Reliability Evaluation for the Computing cOntinuum.

Stefano Dicarlo (Politecnico di Torino)• EXCESS: Execution Models for Energy-Efficient Computing Systems. Christoph Kessler

(Linköping University)The panel discussion went around standardization, power vs. reliability, the debate between hardware or software reliability, how certain data structures affect the dependability/reliability on embedded platforms, and ways to explore synergies between the EU projects.

5. TOWARDS PORTABLE LIBRARIES FOR HYBRID SYSTEMSOrganized by Christian Brugger and Christian De Schryver (Technical Univ. of Kaiserslautern), this session was about current challenges and feasible approaches for bundling hardware and software parts with the required interconnect and runtime environment in a library that runs on a wide range of compute platforms. It included 4 talks:• Scalable Architecture and Shared-memory Programming for FPGA-based

Heterogeneous Platforms. Paolo Burgio (Univ. Modena)• Portable Libraries and Programming Environments. Jeronimo Castrillon (TU Dresden)• Rule-based Program Transformation for Hybrid Architectures. Manuel Carro

(IMDEA Software Institute)• HW Flexibility & Runtime Optimizations. Ioannis Sourdis (Chalmers Univ. of Technology)

6. INTERNET OF THINGS (IOT): TECHNOLOGY AND APPLICATIONS FOR A GOOD SOCIETY

Organized by Donn Morrison and Lasse Natvig (Norwegian Univ. of Science and Technology), this session brought together experts from academia and industry who research and develop components and products that deal with the growth in the number of network connected devices, known as the Internet of Things. It included 6 talks:• Smarter Bees - Bees, IoT and Big Data. Torstein Dybdah (TD Research)• Integrating Wireless Sensor Networks Into Internet of Things: Challenges. Yuming Jiang

(NTNU)• Developing Robust IoT Gateway Applications from Building. Frank Alexander Kraemer

(Bitreactive)• Robotics in IoT. Jim Tørresen (Univ. of Oslo)• Engineering the IoT. Alf Syvertsen (Silicon Labs)• Internet of Things - Marketing or Real Opportunity. Jo Uthus (ATMEL Norway)

7. BEYOND SELF-AWARE EMBEDDED COMPUTINGOrganized by Stephan Wong (Delft Univ. of Technology), this session invited experts from the field of selfaware embedded computing to explore how their solutions can be further improved in a more interconnected world. It included three talks:• Application Autotuning and Runtime Resource Management from Heterogeneous

Manycore Architectures. William Fornaciari (Politecnico di Milano)• Self-Awareness in Cyber-Physical Systems. Axel Jantsch (Technical Univ. Wien)• Runtime support for self-awareness in interconnected CPS systems.

Dionisios N. Pnevmatikatos (ICS FORTH)It ended with a panel discussion on “What is beyond selfaware embedded computing?”

8. EUROPEAN INITIATIVE ON RUNTIME SYSTEMS AND ARCHITECTURE CO-DESIGN

This session was organized by Miquel Moretó (BSC), Marc Casas (BSC), Vassilis Papaefstathiou (Chalmers) and Miquel Pericàs (Chalmers). This session gathered representatives of the main european research groups in programming models and computer architecture codesign to discuss ways to achieve strengthened cooperation and improved interoperability of their runtime middlewares. This session included six talks:

HiPEAC info 43 5

hipeac activity

• MECCA – Meeting the Challenges in Computer Architecture. Per Stenström (Chalmers Univ. of Technology)

• The Swan Task Dataflow Scheduler. Hans Vandierendonck (Queens Univ. of Belfast)• Project Beehive: A HS/SW co-designed stack for runtime and architectural research.

Christos Kotselidis (Univ. of Manchester)• Task-based Runtimes for Multicore Architectures. Foivos S. Zakkak (FORTHICS)• Runtime-Aware Architectures. Miquel Moretó (BSC)• The StarPU Runtime System: Task Scheduling for Exploiting Heterogeneous

Architectures. Olivier Aumage (INRIA Bordeaux)The thematic session concluded with a panel during which the participants discussed the main scientific challenges addressed by their research, how their work could benefit from strengthened collaboration, and what instruments or funding vehicles could be used to improve collaboration across European research groups.

9. RISING VIRTUES OF HETEROGENEOUS SYSTEMS: RELIABILITYThis session was organized by Chris Fensch (HeriotWatt University), Georgios Goumas (National Technical Univ. of Athens), and Marisa Gil (BSCUPC Barcelona Tech). This new edition of the Programming Models Thematic Session started a series of meetings that focus on specific factors or components that influence performance, but that are also of concern to designers, programmers, and developers. The selected topic was Reliability. The main ideas in the discussion were around the factors that impact the resilience of a system, and how overall resilience depends on the weakest link. Resilience is difficult to test, and the use of simulators gives results less accurate than using real hardware. For the programming models, a return to dataflow and taskbased models (e.g. OpenMP 4.0 and OmpSs) are future promises. It included 4 talks and a panel discussion:• Keynote: Heterogeneous Systems - a Blessing or a Curse for Massive Parallel Dependable

Systems. Avi Mendelson (Technion Israel Institute of Technology)• Controlling Application Behavior in the Presence of Approximations and Errors.

Christos Antonopoulos (CERTH)• Variability-Aware Self-Adaptive Parallel Application in Many-core Chips. Fabien Chaix

(FORTH)• PID-Controlled DVFS for Absorbing Temporal Overheads of RAS Mechanisms. Dimitrios

Rodopoulos (National Technical Univ. of Athens)After the talks, the panel discussion dealt with the difficulties of testing for proper implementation of resiliency, how the programming model can improve resiliency, the tradeoffs between low power consumption and reliability, and the fact that there is a missing set of benchmarks to evaluate the resiliency techniques.

10. ERROR-AWARE SYSTEMS: OPPORTUNITIES AND CHALLENGES FOR HANDLING ERRORS AT MULTIPLE LEVELS

Organized by Dimitrios Nikolopoulos (Queen’s Univ. of Belfast), Pedro Trancoso and Yiannakis Sazeides (Univ. of Cyprus), this session included an initial report on the topics covered during the discussions in the Thematic Session in the Athens CSW, talks reporting results from related EU Projects and an Invited Talk from BSC on related issues within the scope of the MontBlanc Project. It included 5 talks:• Report from the Error-Aware Thematic Session I. Yanos Sazeides (Univ. of Cyprus)• Report from the EU Energy-Efficient Computing Systems Workshop. Koen De Bosschere

(Ghent Univ.)• Report from the 1st Workshop on Approximate Computing (WAPCO 2015).

Georgios Karakonstantis (Queen’s Univ. of Belfast)• Enablers and Roadblocks of Approximate and Error-Aware Computing.

Christos Antonopoulos (Univ. of Thessaly)• Understanding and Addressing the Resiliency Issues for Future Exascale Computing with

the Mont-Blanc Prototype. Ferad Zyulkyarov (BSC)The session concluded with a panel discussion._________

HiPEAC info 436

Where Current Computers struggle and how Probability Based Computing can overcome this.

Developed in the REPARA project, the Eclipse based Cevelop IDE for C++ aims to make programmers more productive by integrating automated refactoring and testing tools.

hipeac announce

Over the last decade, current computing platforms have not progressed at a similar rate as in previous years. This lack of progress is largely due to a combination of different problems, ranging from silicon manufacturing issues over to actual problems with the models being used, leading to, for example, the von Neumann bottleneck. When one takes a few steps backwards and “overviews” the situation, then it becomes clear that current platforms have their limitations. Consequently, there is the need to start developing a new computing approach, namely one that is more biologically inspired, can deal with unreliable components, while at the same time offering more intelligent functionalities. Probability based computation has all the necessary characteristics to overcome the current problems, while also forming a better platform for machine learning approaches. There is still work to be done before these new systems will become reality, but it is about time that people with knowledge and understanding of science and technology start to combine their knowledge and learn to deal with the need for unreliability to ensure this brighter future does happen.

For more information please visit: http://users.telenet.be/wimmelis/professional/books/book_1.htm

Wim Melis, University of Greenwich_________

Cevelop is an integrated development environment (IDE) for C++ programmers that combines a variety of development tools into a onestop download. Cevelop is based on the most recent version of the popular Eclipse C/C++ Development Tooling, combining a stable technical foundation with the ergonomics of a stateoftheart IDE. The development of Cevelop was started in the FP7 REPARA project by the University of Applied Sciences Rapperswil, Switzerland. It provides the infrastructure to run REPARA’s static code analysis and transformation tools for heterogeneous parallelization refactorings.Cevelop ships with the the CUTE unit testing framework, support for the Scons build system and new tools to refactor namespaces and macros. It also helps you to upgrade your code to C++11/14 to automatically take advantage of new features such as initializer lists and smart pointers. Cevelop is free to use and is available for Windows, OS X, and Linux.

For more information please visit: https://www.cevelop.com/

Mirko Stocker, HSR University of Applied Sciences Rapperswil_________

COMPUTING: THE CURRENT AND ITS PROBABILITY BASED FUTURE

CEVELOP C++ IDE RELEASED

HiPEAC info 43 7

http://users.telenet.be/wimmelis/professional/books/book_1.htm

https://www.cevelop.com/

hipeac news

JUNIPER NETWORKS AND MAXELER TECHNO LOGIES ANNOUNCE NEW COMPUTE-INTEGRATED NETWORK SWITCH

Maxeler Technologies and Juniper Networks have joined forces and developed a compact groundbreaking data center switch that integrates highperformance compute resources into the network. Juniper’s QFX5100AA is a novel type of application acceleration switch that includes a jointly developed QFXPFA packet flow accelerator module capable of processing a large amount and variety of data streams at linerate.

Maxeler is pioneering a new datafloworiented approach to efficient highperformance computing, where appli ca tion experts in science, engineering or finance can develop and customise their algorithms in a highlevel language, targeting Maxeler’s highly efficient dataflow systems. Maxeler multiscale dataflow technology exploits inherent parallelism in appli cations, and one to two ordersofmagnitude improvements in terms of both throughput and power consumption compared to standard ser vers of the same size have been realised, across a range of application domains, including finance,

geology, weather modelling, genomics, and data analytics.

Including Maxeler’s technology into the latest Juniper QFX network switch product family vastly improves the performance of network applications that require complex processing. Network traffic can be decoded, transformed and reencoded while sus taining linerate throughputs with mini mal and highly predictable latency. Finan cial institutions will be able to instantly analyse and process massive quantities of information originating from various sources, in order to make better trading decisions, reduce risks or comply with the latest regulatory requirements. The QFX5100AA switch enables the processing of market data decoding and risk analysis to take place directly inside the network

infrastructure with ultrahigh throughput and predictable latency. Other application areas include, but are not limited to: handling and analysis of social media feeds, linerate video transcoding and big data analytics. The QFXPFA packet flow accelerator is programmed and managed using Maxeler’s dataflow com piler and operating system. Full TCP/IP support is included.

More information about this product can be found at:http://newsroom.juniper.net/press-release/juniper-networks-delivers-unparalleled-application-performance-with-new-compute-integrated-switch

Tobias Becker, Maxeler Technologies_________

This April 2015, taking place in parallel with ETAPS, EuroLLVM reached over 250 attendees!

EUROPEAN LLVM CONFERENCE 2015

active games masters programme that works, for example, on applying technology deve loped for tripleA games to the solution of high performance computing problems.

This year, EuroLLVM not only had a great lineup of presentations, but it also hosted a Khronos UK Chapter meeting and it was strategically organized in parallel with ETAPS 2015. Together, EuroLLVM attracted over 250 attendees, which made it the largest EuroLLVM conference ever, with participation from industry, research and members from the LLVM community.

EuroLLVM had two inspiring keynotes, with Francesco Zappa Nardelli talking about the C++ memory model and Ivan Goddard presenting the Mill CPU, an unusual architecture that uses ‘bands’ instead of traditional registers. In addition to the keynotes, there was a diverse set of talks discussing SIMD challenges such as mixedwidth vector code generation and vecto rization of control flow with the new AVX512 instruction set, but also optimi zations

2015 has seen the return of EuroLLVM to London, this year on April 13 and 14 at Goldsmiths College, University of London. Although better known for its Turner Prize winners, Goldsmiths has an

HiPEAC info 438

http://newsroom.juniper.net/press-release/juniper-networks-delivers-unparalleled-application-perform




such as loop fusion in the presence of control flow, highlevel software pipe lining and lowoverhead LinkTime Optimi zation (LTO) with the ThinLTO framework. With Templight we saw a practical solution for interactive C++ template debugging, we saw C++ code such as clang being executed in a webbrowser, a report of the use of LLVM in .NET’s CoreCLR, as well as talks on static analysis, partial evaluation and micro processor emulation. The lightning talk sessions were as varied as ever with talks on SPIR, Javascript, verification and sym bolic execution to name but a few. Besides presentations, there were several tea breaks and interactive sessions that allowed attendees to mangle and collaborate. The conference itself started with a halfday hacking session providing time for networking and free discussions. There was also a poster session, several developer BoFs as well as tutorials on topics such as LLDB and LLVM’s debug info. Finally, the Khronos chapter presented demonstra tions of Vulkan – a lowoverhead API for graphics and compute similar to those found on game consoles.

Andy Thomason from Goldsmiths College, who organized EuroLLVM together with LLVM community, said: “We are thankful for the support of Apple, ARM, Codeplay, Google, the HSA Foundation, Intel, the Khronos Group, the LLVM Foundation, Mentor, Qualcomm Innovation Center (QuIC), Solid Sands and Sony Computer Entertainment. Our partners have enabled us to support the growth of EuroLLVM while ensuring attendance

remains affor dable, particularly for enthusiasts and students.” If you now regret having missed out on EuroLLVM, you can look at the slides and video recordings of EuroLLVM 2015 (http://llvm.org/devmtg/201504/) and join the next meeting later this year in the Bay Area, or Spring 2016 back in Europe!

For more information please visit: http://llvm.org/devmtg/201504/

Tobias Grosser, ETHz _________

hipeac news / in the spotlight

The DeSyRe project has developed a novel DeSyRe SoC architecture and underlying concepts for reliability. Such SoCs have been shown to typically use 28 percent less energy and 48 percent less chip area, while offering a nine times lower hardware failure rate. It’s time to integrate this technology in safer cars and trains, more reliable medical devices, more advanced brain models, and embedded many-core systems.

The DeSyRe project brought together experts in faulttolerant and selfrepairing design.Industry partner YOGITECH helps silicon vendors and system integrators meet functionalsafety challenges for the automotive, industrial automation, biomedical and railway mar kets. During the DeSyRe project, YOGITECH combined fault detection, fault diagnosis, and reconfiguration in safe and dependable architectures for reconfigurable embedded systems. An ARMbased demonstrator now shows the benefits of the newly developed IPs.Industry partner Recore Systems makes manycore programming easy. Their first application of DeSyRe’s faultdetection and reconfiguration techniques is for missions into deep space, where reliable and faulttolerant systems are a must. The second application is in their FlexaWare manycore embedded platform. DeSyRe concepts of taskbased

SAFER TRAVELS AND IMPLANTS WITH DESYRE SYSTEMS

Consortium Members: Chalmers University of Technology (Sweden)University of Bristol (UK)EPFL (Switzerland)FORTH (Greece)Imperial College London (UK) Neurasmus (The Netherlands) Recore Systems (The Netherlands) YOGITECH (Italy).

Project start date: 1st October 2011Project end date: March, 2015

Industry partner websites:YOGITECH: www. yogitech.com Recore Systems: www. flexaware.net Neurasmus: neurasmus.com

Project coordinator: Chalmers University of Technology (Sweden)

Project website:http://www.desyre.eu

HiPEAC info 43 9

http://llvm.org/devmtg/2015-04/

http://www.desyre.eu

The Euroserver Prototype at FORTH-ICS

in the spotlight

programming and runtime task migration are key for an intelligent manycore OS and crucial for the easy programming of manycores.Neurasmus is a R&D company which develops new hightech medical systems. DeSyRe stimulated the launch of two novel Neurasmus products: the Implant Toolbox – with faulttolerance and security techniques for implantable medical devices – and BrainFrame – an intuitivetoprogram highperformance platform for brainmodeling research and applications.Each DeSyRe idea spreads. One project benefits many!

Inès Nijman, Recore Systems_________

SOCKETS OVER RDMA AND SHARED PERIPHERALS FOR ARM MICROSERVERS

Communication among microservers in data centers must be carried out at low overhead, hence low energy and low latency. The Euroserver Project (www. euroserverproject.eu) builds micro servers using lowpower ARM cores, that can be assembled in large numbers to maintain high multithreaded performance. Because communication among server nodes is critical for the scalability of applications, it is essential to minimize small message latency and maximize data transfer throughput. This is not feasible via traditional communication through the TCP/IP stack, which is not optimized for data center workloads and cannot take advantage of any underlying hardware mechanisms within the micro server.

In Crete, we implemented a first prototype (shown in the photograph) of the Euroserver architecture, in the Computer Architecture and VLSI Laboratory of FORTHICS. It consists of eight compute

nodes (MicroZed boards with ARM CortexA9 processors and FPGA logic), connected by Michael Ligerakis to a central FPGA board. A custom interconnection network by George Kalokairinos, built upon the AXI protocol and the ARM master/slave ports, allows remote memory access among the nodes via physical address translation. Furthermore, the prototype shares the I/O resources, such as network interfaces (NICs) and storage devices, among the multiple compute nodes and their Linux OS instances.

We employ RDMA transfers over our custom interconnect instead of TCP over Ethernet communication. We completely bypass the TCP/IP stack of the Linux kernel to ensure lowlatency transmission and reception of internal messages. To allow traditional socket applications to run without any modification, Dimitrios Poulios created our own library which is invoked by intercepting socketrelated

system calls in the standard C library (libc). Transfers that are destined to the internal network pass through our RDMA driver, that handles internal connections. Remote notifications and control messages are delivered through the prototype’s hardware mailbox mechanism.

Communication to/from the external world is achieved through a shared virtualized 10 Gbps NIC, which resides in the central FPGA board. Kostas Harteros created dedicated Tx/Rx FIFOs per node in the 10 Gbps MAC layer and transmits frames roundrobin. Incoming frames are routed to the appropriate nodes according to their destination MAC address. A device driver by John Velegrakis allows the Linux OS and the applications to view the NIC as a standard Ethernet device. The driver makes use of the underlying AXI DMA engine to send and receive frames to and from the MAC. Zerocopy between the kernel and the driver is achieved by maintaining rings of DMA descriptors and operating the DMA in scattergather mode.

For further information please contact: Manolis Marazakis, Iakovos Mavroidis, Manolis Katevenis {maraz,jacob,kateveni} @ics.forth.gr FORTH-ICS, Heraklion, Crete, Greecewww.ics.forth.gr/carv

_________

HiPEAC info 4310

http://www.ics.forth.gr/carv

in the spotlight

An Analytics Framework for Integrated and Personalized Healthcare Services in Europe

The AEGLE project aims to produce value out of big data in healthcare, with the goal of revolutionizing integrated and persona lized healthcare services, offering analytic services at two levels, as shown in Figure 1. First at the local level, AEGLE will focus on realtime processing of large volumes of raw data originating from patient moni toring services. Then at the cloud level, AEGLE will offer an experimental big data research platform for data scientists, wor kers and data professionals across Europe. The platform consists of a large pool of semanticallyannotated and anonymized healthcare data, stateoftheart big data analytics methods and advanced visuali sation tools, allowing data scientists to steer the analytics mechanisms with their own insights.

THE AEGLE PROJECT

Consortium Members: Exodus S.A. (EXUS) GRInstitute of Communications and Computer Systems (ICCS) GRKingston University Higher Education Corporation (KINGSTON) UKCenter for Research and Technology Hellas (CERTH) GRMaxeler Technologies (MAXELER) UKUppsala University (UU) SWUniversita’ VitaSalute San Raffaele (USR) ITTime.Lex (TML) BEErasmus Universiteit Rotterdam (EUR) NLCroydon Health Services NHS Trust (CROYDON) UKGlobaz Grupo S.A. (GLOBAZ) PTUniversity Hospital of Heraklion (PAGNI) GRGnúbila France (GN) FR

Project Coordinator:Exodus S.A. (EXUS) GR

Project website:http://www.aegleuhealth.eu

Figure 1: The AEGLE infrastructure

AEGLE will go beyond stateofart on big data services by introducing an integrated infrastructure that exploits FPGAbased dataflow acceleration across three diffe rent software levels, i.e.:• Algorithmic level: customized DataFlow Engines (DFEs) will accelerate the com pu

tation intensive kernels found in the targeted big data analytics proce dures that will be subsequently mapped to MAXELER’s devices. Advanced compiler and datapathlevel optimi za tion techniques will be adapted for spatial computing with DFEs. Along with peraccelerator datapath optimi zation, maximization of the accelerator’s scalability issues given the FPGA’s compu tational and memory organi za tion constraints will also be considered.

• MapReduce runtime level: Specialized DFEs will be designed targeting the acce le ration of the underlying Map Reduce programming model. Map Reduce allocates several resources from the software processors, reducing the overall performance of the big data applica tion. In this case, acceleration targets the efficient implementation of internal procedures found in Map Reduce runtime. Early experimental ana lysis considering the implemen tation of a MapReduce accelerated framework for FPGAs, showed speedup gains of up to 32x with respect to a purely software solution. In addition, customized memory management schemes tailored to the memory hierarchy and organization of MAXELER’s devices will be incorporated to efficiently handle the large number of keyvalue pairs usually generated by MapReduce semantics, as well as platform specific task schedulers for balancing the load across the software processors and the DFEs.

HiPEAC info 43 11

http://www.aegle-uhealth.eu

in the spotlight

• Storage and data management level: the database management system (DBMS) would be extended to support both adaptive data layout optimizations, e.g. columnar versus rowwise storage model according to the type of queries, and queryspecific hardware pipelines dataflowbased acceleration. There are a lot of calculationintensive operations that are executed by the DBMS to maintain the stored data; e.g. merging the update buffer into the main storage of an inmemory column store, that can be efficiently accelerated though data flowbased FPGA acceleration. Regar ding DBMS acceleration for data analytics, MAXELER’s inmemory capa bilities are expected to fulfil the needs for high demanding and fast data retrieval. Several DFEs organizations and design options will be investigated, in order to tailor the hardware architecture of DFEs to the set of most demanding queries.

Building upon the synergy of hetero geneous high performance computing, while exploiting reconfigurable architec tures, cloud and big data computing technologies, AEGLE will provide a framework for big data analytics for healthcare that will overall enable and promote innovation activities, placing health in the spotlight.

_________

Finding patterns of parallelism in industrial software for heterogeneous systems

SUCCESSFUL CONCLUSION TO EU PARAPHRASE PROJECT

A worldleading team of academic researchers and industrial experts from across Europe are celebrating the con clusion of a fouryear research colla boration tackling the challenges posed by the fastest and most powerful computing systems on the planet.The € 4.2M ParaPhrase project brought together academic and industrial experts from across Europe to improve the programmability and performance of modern parallel computing technologies.“Future computers will consist of thousands or even millions of processors, which poses a real problem to traditional programmers not used to thinking in parallel,” said project leader Professor Kevin Hammond of the University of St. Andrews.

“The sheer complexity of these systems means that powerful tools are needed to develop software that runs stably and efficiently while making the most of the ability to process in parallel. The technologies we have developed in Para Phrase

make it possible now to really exploit the power of these new systems.”

The ParaPhrase researchers have deve loped an approach that allows large parallel programs to be constructed out of stan dard building blocks called patterns. A refactoring tool allows these patterns to be reassembled in optimal ways without changing the functionality of the overall program.

Further tools developed on the project allow the program components to be run on the system in ways that make best use of the available processors, maximising throughput and minimising run time of large programs. The tools can even adapt the program while it is running to improve performance.Professor Hammond said, “It was important to us that our research could be directly exploited by industry and other researchers. That’s why we applied ParaPhrase to several important industrial case studies during the project.”Indeed, the project team has used its extensive industrial expertise to develop Use Case Scenarios in a range of application areas including industrial optimisation, scientific simulation and data mining.

The outputs of the project have been impressive. As well as producing over 80 publications in leading international

con fe rences and journals and being demon strated at over 100 international con fe rences and other events, the project has produced a range of new software tools and programming standards to support the growing global community in parallel programming.

A Streaming Parallel Skeleton Library for the Erlang programming language has recently been made available and a new release of the FastFlow parallel programming framework has already seen thou sands of downloads. Industrial partners are already applying the technology in their own operations and three recentlylaunched spinout compa nies are set to take full commercial advantage of the technologies produced.

Already, the project partners are looking to the future. A number of followon projects are underway and more are in the pipeline. "ParaPhrase has been a tremendous success but significant challenges remain. In the future, parallel programs will need to selfadapt to computing architectures we haven't even thought of yet,” said Professor Hammond.

For more information please visit: http://www.paraphrase-ict.eu/

Kevin Hammond, University of St Andrews_________

HiPEAC info 4312

http://www.paraphrase-ict.eu/

We present the outcome of a technology transfer project between the non-profit cTuning Foundation, France (Grigori Fursin) and ARM, UK (Anton Lokhmotov, Ed Plowman). The six-month project supported by the TETRACOM Coordination Action has resulted in developing from scratch the Collective Knowledge framework and validating it on realistic use cases, as well as forming a startup called dividiti.

in the spotlight

COLLECTIVE KNOWLEDGE: A FRAMEWORK FOR SYSTEMATIC PERFORMANCE ANALYSIS AND OPTIMIZATION

Designing, modeling and benchmarking of computer systems in terms of performance, power consumption, size, reliability and other characteristics is becoming extraordinarily complex and costly. This is due to a large and continuously growing number of available design and optimization choices, a lack of common performance analysis and optimization methodologies, and a lack of common ways to create, preserve and reuse vast design and optimization knowledge. As a result, optimal characteristics are achieved only for a few adhoc benchmarks, while often leaving realworld applications underperforming. Eventually, these problems lead to a dramatic increase in the development, optimization and maintenance costs, increasing time to market for new products, eroding return on investment (ROI), and slowing down innovation in computer engineering.

Since 2012 the nonprofit cTuning Foundation and ARM have been engaging in discussions on systematic performance analysis and optimization using statistical analysis, machine learning and crowdtuning techniques. In November 2014, we started a sixmonth technology transfer project supported by the FP7 TETRACOM Coordination Action. The cTuning technology comprises a framework and repositories for collaborative and reproducible experimentation combined with predictive

analytics. This technology has been successfully used in several EU projects including the FP6 MILEPOST project.

The TETRACOM grant has allowed us to completely design and develop from scratch the fourth version of cTuning technology which we called Collective Knowledge (or CK for short). CK is a Pythonbased framework, repository and web service, supporting JSON interfaces and standard Git services such as GitHub and Bitbucket. CK allows engineers and researchers to organize, describe, crossreference and share their code, data, experimental setups and meta information as unified and reusable components. CK users can assemble from components various experimental workflows, quickly prototype ideas, crowdsource experiments using spare computer resources such as mobile phones, and more. Importantly, CK allows experimental results to be exposed to powerful predictive analytics packages such as scikitlearn and R in order to speed up decision making via statistical analysis, data mining and machine learning.

During the project, we have successfully applied the Collective Knowledge framework to perform systematic analysis, data mining and online/offline learning on vast amounts of benchmarking data available at ARM. Our technology showed good potential to automatically find various important correlations between numerous inhouse benchmarks, data sets, hardware, performance, energy and runtime state. Such correlations can, in turn, help derive

representative benchmarks and data sets, quickly detect unexpected behavior, suggest how to improve architectures and compilers, and speed up machinelearning based multiobjective autotuning.Furthermore, our technology has also showed potential to enable collaborative research and development within and across groups. Therefore, we have released the Collective Knowledge framework under a permissive BSD license, and expect to grow the user community.

Finally, our positive results have motivated us to establish a UKbased startup called dividiti to accelerate computer engineering and research by further developing our technology and applying it to realworld problems.

Acknowledgments: We would like to thank the TETRACOM Steering Committee for accepting the project proposal and the TETRACOM manager Eva Haas for simplifying the paperwork. We also would like to thank our ARM colleagues Marco Cornero, Alexis Mather and Jem Davies for encouraging and supporting the project.

Further resources:• http://tetracom.eu• http://ctuning.org• http://github.com/ctuning/ck• http://hal.inria.fr/hal-01054763• http://www.dividiti.com

Grigori Fursin, cTuning Foundation_________

HiPEAC info 43 13

http://tetracom.eu

http://ctuning.org

http://github.com/ctuning/ck

http://hal.inria.fr/hal-01054763

http://www.dividiti.com

hipeac students

Host Institution: University of St Andrews, UKTitle: Exploiting parallel data-flow tools for graph traversing

COLLABORATION GRANT: HECTOR ORTEGA

During the last decade, parallel processing architectures have become a powerful tool to deal with massivelyparallel problems that require high performance computing (HPC). In order to give support to the massive demand of HPC, the latest trends focus on the use of heterogeneous environ ments including computational units of different natures, such as common CPUcores, graphics processing units (GPUs) and other hardware accelerators. The exploitation of these environments offers a higher peak performance and better efficiency compared to classical homogeneous cluster systems.

Part of my research has been focused on the development of a model, and its corresponding framework implemen tation, whose main objectives are: to simplify the programming of these heterogeneous systems, by hiding the details of synchronization, deployment, and tuning; and to

maximize their efficiency, by automatically using all available resources. For various reasons, however, it is not always true that exploiting all available computational units leads to the fastest results.

One of the interests of the host research group at the University of St. Andrews is to efficiently parallelize the solution of a problem, by automatically distributing the computation between different heterogeneous processors, and following a particular mapping for the resource usage. Learning this knowledge, and working together, also with another HiPEAC grant holder from Chalmers, Josef Svenningsson, has led to new methods that predict optimal configurations of the resource usage, leading, therefore, to improvements in the automatic distribution of the work. In order to enrich the obtained results, experiments have been conducted using very different server machines from both

universities, and solving realworld problems with different characteristics. These activities resulted in a willingness to continue the collaboration in distance in order to extend the experimentation, and also, in the improvement of our respective frameworks by the addition of the new developed knowledge.

I would like to thank HiPEAC for giving me this opportunity, as well as all the people of the University of St. Andrews and Josef, for their unsurpassable treat in both academic and personal contexts, and finally, to my supervisors at Universidad de Valladolid for their support and encouragement to participate in this rewarding experience.

Hector Ortega, Universidad de Valladolid, Spain_________

Thanks to this HiPEAC Collaboration Grant, I have had the pleasure of spending the last three months of the year at the IBM Thomas J. Watson Research Center, and this report summarizes the research work carried out during this period. It is well known that the capacity to extract as much knowledge as possible from the large volumes of data that our society generates plays an important role in the development of systems. Traditional metho dologies are no longer valid to manage extremely large and complex data sets, and a new computation paradigm known as BigData or data analytics is gaining ground. From current social networks to the future Internet of Things (IoT), all these systems share the same characteristics: the amount of data generated per time unit is vast, highly variable and schemaless. The complexity

of the problem lies in the fact that the majority of these data does not include relevant information and the filtering of data that do include useful information has to be done in a limited time. This is necessary to reach useful decisions in time. For this reason, many of these systems have to be carefully designed and planned in order to fulfil the workload requirements. Under these circumstances, it is necessary to propose new architectures suitable for overcoming the obstacles of these information systems. To achieve such a goal, it is essential to know the impact that these applications will have on the memory system of actual architectures. During this threemonth research collaboration, we have been focused on studying the type of databases used to manage these high volumes of data in current systems. These are no longer being

structured with relational databases, and instead NoSQL databases are becoming more popular, since their distributed and horizontally scalable characteristics make them the perfect tool to manage these types of data. More specifically, our main database has been the Apache Cassandra distribution. In order to work with it, it is necessary to run some kind of client. For this task, we decided to use the Yahoo Cloud Serving Benchmark (YCSB), which is a highly configurable benchmark that allows the user to set different parameters, in order to test the database under different behavior patterns. Apache Cassandra was configured as a cluster database in a threenode system, each with two Intel Sandy Bridge quadcore chips and a memory hierarchy of 12 MB of cache each and 48 GB of main memory. Its configuration details were set to the

Host Institution: IBM Thomas J. Watson, USATitle: Processor-memory effects analysis on BigData workloads

COLLABORATION GRANT: LUCIA G. MENEZO

HiPEAC info 4314

hipeac students

During this collaboration project we have designed and implemented a highlevel software pipelining method which operates in the target independent optimization layer of the compiler. This allowed us to create a software pipelining optimization pass that can easily be reused for many different architectures. Our current implementation is targeted at the Movidius SHAVE processor architecture. From our initial experiments we found that this method does indeed provide benefits over naively unrolling loops, while requiring only very limited information about the processor architecture.

The current work of this project focuses mostly on stabilizing the code so that we can run a larger set of test benchmarks and start tuning the required processor architecture model interface. Once this is completed we plan to demonstrate the portability of our approach by applying it to several processor architectures already supported by LLVM, such as Qualcomm’s Hexagon DSP and AMD’s R600 series of GPUs. Finally, we will also present our design together with the results of our expe riments at the upcoming EuroLLVM meeting (April 2015) in London.I am grateful to HiPEAC for providing me with the opportunity for performing this

research through a collaboration grant. It has been a great pleasure to work in such an international collaboration, meet interesting people, and make use of the vast amount of experience available within industry. It was a valuable experience and hopefully the beginning of a longterm scientific contact between both our insti tutions. In particular I would like to thank David Moloney and Martin O’Riordan for hosting me and sharing their knowledge on processor architecture and compiler design.

Roel Jordans, Eindhoven University of Technology, The Netherlands_________

Host Institution: Movidius, IrelandTitle: An implementation of software pipelining in LLVM

COLLABORATION GRANT: ROEL JORDANS

The Project developed during the internship of Mr. Jaime Espinosa in BSCCNS, a HiPEAC member, has been successfully completed. Thanks to the collaboration between the different institutions, the expertise of the FaultTolerant Systems Research Group (STF) at the Polytechnic University of Valencia in Fault Injection has been of great use when performing the experi mentation required in the work. Likewise, many lessons have been learned when applying injection in complex systems such as the one used in the host institution. Furthermore it has been an enriching experience to work with people

more closely related to industry, since the view of academia is not always so closely aligned with it. The outcomes of the program have been remarkable.

Firstly, a joint publication has been published at DAC 2015. The title of the publi cation is “Analysis and RTL correlation of Instruction Set Simula tors for Automotive Microcontroller Robust ness Verifica tion” J. Espinosa, C. Hernández, J. Abella, D. de Andrés, JC. Ruiz.

Secondly, a longer collaboration has been fostered thanks to the internship, which

will probably yield a second extended publication with more indepth results and conclusions. A set of interesting contacts have been made with people from BSC, which may inspire new research work and further fruitful discussion in the area. I would like to personally thank all the people at HiPEAC who have made this internship possible, as well as the people at BSC for their support.

Jaime Espinosa Garcia, Universitat Politècnica de Valencia, Spain_________

Host Institution: Barcelona Supercomputing Center, SpainTitle: Correlating microarchitectural fault injection with RTL fault injection experiments

COLLABORATION GRANT: JAIME ESPINOSA GARCIA

default values, mainly regarding the replica tion strategy and the snitches behavior. The Cassandra tool named nodetool was used for managing the constructed cluster and the whole testing infrastructure has been completed with monitoring tools to analyze the behavior of the system, specifically processor and memory behavior.

This collaboration work will be continued by constructing benchmarks from this

structure, in order to integrate them into a flexible and reliable framework such as the Gem5 simulator, in which we already have experience. This will allow us to evaluate new hardware proposals for the coherence protocol especially conceived to accelerate BigData or data analytics workloads. Finally, I would like to thank HiPEAC for this opportunity and IBM for accepting me. I would also like to thank the Compu ter Architecture Department at IBM T.J. Watson

for receiving me in such a warm way from the first day and especially Ravi Nair for our discussions. In addition to the extraordinary experience I had, I hope and wish that this rewarding collaboration might continue in the future.

Lucia G. Menezo, University of Cantabria, Spain _________

HiPEAC info 43 15

hipeac students

Host Institution: University of Cambridge, UKTitle: Implementing Microarchitectural Mechanisms to Slow Down Ageing in Current Microprocessors

COLLABORATION GRANT: ALEJANDRO VALERO

I have spent three months in the Computer Laboratory at the University of Cambridge, UK, under the guidance of Dr. Timothy M. Jones. During this period I have learned about how variations in the manufacturing process of current microprocessors affect the lifetime of their transistors. Degradation of the different microprocessor compo nents affects the overall system, depen ding on the importance of the component. For instance, cache memories are implemented with a huge quantity of transistors and they are critical for system performance. Caches have usually been implemented with Static Random Access Memory (SRAM) cells composed of six transistors. Over the lifetime of processors, Negative Bias Temperature Instability (NBTI) and Hot Carrier Injection (HCI)

gradually increase the threshold voltage of SRAM’s PMOS and NMOS transis tors, respec tively, causing slower transistor switching, which in turn results in timing violations and faulty operation in such cells.

In this collaboration we have focused on new architectural mechanisms to lengthen the lifetime of the processor caches. We have identified that some memory cells are much more affected by both NBTI and HCI effects than others. Based on these observations, our mecha nisms attempt to mitigate these effects by ensuring a homogeneous degradation across the memory cells. This collaboration has allowed me to exploit a new research direction and it is the beginning of new exciting research opportunities. Besides, this grant has

permitted us to establish a solid rela tionship between the University of Cambridge and the Universitat Politè cnica de València. We are still working together after the end of the mobility and we plan to apply jointly for future research project calls. I would like to thank HiPEAC for giving me the opportunity to partici pate in this internship, as well as my host Tim Jones, who was extremely committed to our collaboration from the very beginning and provided useful hints to develop our techniques. Finally, I also would like to thank all the fellows in the lab, who made this experience unfor gettable for me

Alejandro Valero, Universitat Politècnica de València, Spain _________

Massive parallelism in many commu nication and multimedia applications shows up as significant opportunities for datalevel parallelism (DLP). DLP is usually exploited by singleinstruction multipledata (SIMD) execution units due to their lowpower architecture which applies a single instruction across many processing elements. Nowadays, SIMD execution units exists in many modern embedded and mainstream processor architectures.

Although SIMD execution is one of the main enablers of computing efficiency, programming for SIMD architectures is still a challenge and a hot research topic. Moreover, autovectorization capabilities of compilers are very limited and vecto rization mostly requires tweaks and instrumentation (e.g. pragmas, targetspecific intrinsics, etc.) to be added in the source code by the programmer. Generating vector code for an architecture with a single vector width is already difficult. It

gets even much more challen ging when the vector code is to be gene rated for a VLIW architecture with a multiple native vector widths. The SHAVE VLIW vector processor of Movidius is an example of such an architecture. SHAVE is a unique VLIW processor that provides hardware support for both native 32bit (short) and 128bit (long) vector operations.

During my HIPEAC internship at Movidius, I have been continuously challenged to improve the SIMD code generation of an LLVMbased commer cial compiler targeting the SHAVE processor family. The compiler was capable of SIMD code generation for long (128 / 64bit) vector operations. I focused on the compiler backend support for short (32bit) vector code generation. More specifically, the work aimed at generating SIMD code for a short vector type that can be executed next to the long vector SIMD code. As a result, the compiler is now able to generate mixed

width assembly code consisting of both short and long SIMD operations. To our knowledge, we implemented the first (prototype) compiler producing such mixedwidth SIMD code.

The research paper explaining the results of the work will be presented at the 26th IEEE International Con fe rence on Applicationspecific Systems, Architectures and Processors (ASAP), 2015, in Toronto, Canada. Moreover, our experiences with LLVM compiler development were presen ted at the EuroLLVM 2015 conference in London, UK.

Finally, I would like to thank the team at Movidius, in particular Martin J. O’Riordan and David Moloney, for all their support during my stay in Dublin, as well as HiPEAC for making this internship possible.

Erkan Diken, Eindhoven University of Technology, The Netherlands_________

Host Institution: Movidius, IrelandTitle: Mixed-width SIMD code generation in an LLVM based compiler for SHAVE VLIW Vector Processor

COLLABORATION GRANT: ERKAN DIKEN

HiPEAC info 4316

hipeac students

I am a second year PhD student at the University of Siena (UniSi), Siena, Italy. I have just completed my internship at Ericsson Research Lab, Lund, Sweden. During my stay at Ericsson, I worked with the Cloud Computing Research Group. I enjoyed my stay working with Cloud Principal Researcher, Johan Eker and his team. The main goal of the internship was to implement actorbased models on Parallella boards. To build a highly complex Cloud/HPC system, we now have many options, but in recent times, there has been a sea change in the semiconductor industry. Each technology has its pros and cons. To build highly complex and scalable

computer systems, there exists multi processors, multicoprocessors, Accelerated Processing Units (APUs), GPUs and FPGA based devices. But, each device comes with 3P (Performance, Programming and Power) issues. To exploit the massive levels of parallelism of these processors, we must find the bestsuited programming model for each. Dataflow models and actorbased models are examples of well known programming models that have recently gained popularity for achie ving high levels of parallelism. It was a nice experience for me to explore Parallella boards and to implement the functionalities of actorbased models on these

boards. I especially want to thank sincerely my supervisor Prof. Roberto Giorgi for his support and inspi ration to apply for this internship in my first year of my PhD. I would also like to thank HiPEAC for giving me the oppor tu nity to work with this group of highly skilled and inspiring professional researchers at Ericsson.

Somnath Mazumdar, University of Siena (UniSi), Siena, Italy _________

I am a final year PhD student at Politecnico di Milano, Italy. For the HiPEAC Industrial Internship,I worked at the Media Processing Group (MPG) of ARM in Cambridge, from August to December of 2014, under the super vision of Marco Cornero. The topic of the internship was to investigate improvements in the pro grammability of tightlyintegrated CPUGPU sys tems. The cost of commu nication between CPUs and GPUs is decreasing, in particular thanks to multicore chips and memory coherency tech niques. This trend can also be noticed in current programming models, for example in the new OpenCL 2.0 standard, which introduces shared virtual memory profiles, and in the HSA Foundation. The idea developed during the internship is to fully exploit the benefits provided by full shared virtual memory, in order to simplify hetero geneous systems programming, in parti cular the hostdevice interaction API. We defined a prototype programming model inspired by the Khronos SyCL provisional specification, which we extended to support almost all the features of

the C++ language, on both CPU and GPU, with the least possible restric tions. The sharing of data and data pointers is implicit and almost free, thanks to shared virtual memory, while special care is needed for the executable code, given the presence of multiple instruction sets in the hetero geneous system. The main topic of the work was to provide full support for the C++ language also for the device kernels. We identified three main challenges: support for virtual member functions and representation of virtual tables for multiple devices, support for function poin ters, and support for generalized sys tem calls (host services accessed remotely from the devices). These features share the fact that they involve code for both the CPU and the GPU. Because of the different instruction sets, functions must be compiled for multiple targets. In this context, a device that invokes a function from a function pointer needs to access the version of the function that corresponds to its ISA, or otherwise if such a version is not present, it should remotely

invoke the function in the host. During the internship, various solutions for the aforemen tioned functionality, with different requirements, were evaluated and prototyped. A considerable effort was put into not breaking the host ABI, which has a strong impact in ensuring interopera bility with already existing codebases. Furthermore, I started the development of a prototype compiler based on the Clang/LLVM framework to support the new programming model. In addition I have actively contributed to the development of the compiler backend for the ARM nextgeneration GPUs, in order to support some of the features required by the hete ro geneous programming model inves ti ga tions.

Michele Scandale, Politecnico di Milano, Italy_________

Host Institution: Ericsson Research Lab, SwedenTitle: Programming Parallel Cloud Hardware

Host Institution: ARM, UKTitle: Exploiting Tight CPU/GPU integration to Improve GPGPU Programming Models

INTERNSHIP REPORT: SOMNATH MAZUMDAR

INTERNSHIP REPORT: MICHELE SCANDALE

HiPEAC info 43 17

hipeac students

Host Institution: Samsung R&D Institute (SRUK), United KingdomTitle: On Heterogeneous Programming with MPSoC Application Programming Studio “MAPS”

INTERNSHIP REPORT: TURKEY ALSALKINI

I am a PhD student at HeriotWatt University, working on exploring a mechanism to balance the runtime load to optimize resource use on heterogeneous architectures. As a part of a HiPEAC sponsored industrial internship, I spent four months, parttime, at Samsung R&D Institute (SRUK), London, United Kingdom. Samsung R&D Institute focuses on investigating and optimising native libraries for embedded multicore heterogeneous architecture. Android is a mobile operating system based on the Linux kernel and it is currently developed by Google. With a user interface based on direct manipulation, Android is designed primarily for touchscreen mobile devices such as smartphones and tablet computers. The number of cores in mobile devices is increasing to meet the needs of performance demanding applications. To take advantage of such a

multicore architecture, these resources should be utilized. The main task of this internship was to improve the startup time of Android applications in order to reduce the power consumption. We started by analysing the memory heap for applica tions in Android to investigate the memory usage and we looked for memory leaks which may have an impact on the per formance of the application. Then, we moved to explore the Android application launch process. In this process, the current layout inflator serially instantiates the UI elements from the XML file. Instantiating a UI element does not take long time unless the element has an image for decoding. Our solution assigns the decoding of images to worker threads, in order to let the main thread inflate other components while the threads are decoding the images. Our modifications were on the Android

framework, through adding a task manager, runnable tasks and future results. We started the implementation by applying our changes to the Android framework and we tested the results on the Emulator, Samsung Nexus 10 and Samsung Note 4. The results show that the inflator submits the tasks when it finds an image for decoding images for some UI elements. As a result, this improved the startup time by utilizing the multicore architecture in the mobile device. I would like to thank HiPEAC and Samsung for giving me this opportunity to meet and work with highly skilled people. I would also like to thank my supervisor Greg Michaelson for his support.

Turkey Alsalkini, Heriot-Watt University, Edinburgh, UK _________

As part of the HiPEAC industrial internship program for PhD students, I spent four months at ARM (Cambridge, UK) investigating the effects of false sharing.

False sharing arises in shared memory multiprocessors with an invalidation based coherence protocol. As multiple addresses are mapped onto the same cache line, two cores working on distinct addresses might end up sharing a cache line. This can cause false invalidations where a write to address A by one core might end up invalidating the copy of the cache line containing A that is cached at another location, even though the remote cache never accessed A and will never access A. In the worst case scenario, both cores are writing to distinct addresses mapped onto the same cache line, causing a pingpong effect where they continuously invalidate each other’s copy. This means that false sharing causes

misses that could have been avoided. False sharing can be addressed in software by remapping the data structures that cause the false sharing, but this requires active colla bo ration from the programmer. Hardware can also be used to detect false sharing but it comes at the cost of additional logic in the cache controllers.

The goal of this internship was investigating the severity of the false sharing problem using the computing system simu lator gem5. I instrumented the cache controllers with additional code that assigns access vectors to all cache lines to detect which words in the cache line have been used locally. By comparing these with the access vector of remote requests, false sharing can be detected. This also allowed for other statistics to be gathered about the utilisation of cache lines, etc.I really liked doing research in an industrial

environment, as it has different focuses than research in an academic setting. I felt really inte grated into the group and its research, which allowed me to learn a lot during my time at ARM. The work I did also contributed to internal projects,l which was a nice change from PhD work, which can at times be a bit solitary. ARM gave me the chance to extend my stay by another two months, which I took so I that could completely finish the project. I am very thankful towards HiPEAC and ARM for offering these industrial internships: I not only learnt more about computer architecture, but I also think this experience will contribute to my becoming a better researcher. I would also like to thank everybody at ARM who I worked with for an interesting and nice experience.

Anouk Van Laer, University College London, UK _________

Host Institution: ARM, UKTitle: False Sharing Cache Misses Detection and Elimination

INTERNSHIP REPORT: ANOUK VAN LAER

HiPEAC info 4318

PhD news

cution scenarios where no single policy is best suited in all cases, this work proposes an approach based on the idea of mixture of experts. It considers a number of offline experts or mapping policies, each specialized for a given scenario, and learns online the best expert that is optimal for the current execution._________

tions in the MPSoCs. Last but not least, we discuss the application of aforementioned strategies in 3D NoC systems and we propose a Bus Virtual channel Allocation (BVA) mechanism to enable vertical wormhole switching in 3D NoCBus hybrid systems. All proposals are evaluated in our NoC simulation platform and their advantage over state of the art counterparts are demonstrated by means of experimental results. _________

Murali Emani, University of Edinburgh, UKAdvisor: Prof. Michael O’BoyleGraduation date: March 2015

Changlin Chen, Delft University of Technology, NLAdvisor: Assoc. Prof. Sorin CotofanaGraduation date: May 2015

ADAPTIVE PARALLELISM MAPPING IN DYNAMIC ENVIRONMENTS USING MACHINE LEARNING

TOWARDS DEPENDABLE NETWORK-ON-CHIP ARCHITECTURES

to the mapping of parallel programs in dynamic environments. It employs predictive modelling techniques to determine the best degree of parallelism. Firstly, this thesis proposes a machine learningbased model using static code and dynamic runtime features as input, to determine the optimal thread number for a target program. Next, this thesis proposes a novel solution to monitor the proposed offline model and adjust its decisions in response to any drastic environment changes. Furthermore, considering the multitude of potential exe

we propose a Flit Serialization (FS) strategy to tolerate broken link wires and to efficiently utilize the remaining link bandwidth. Within the FS framework heavily defected links whose fault levels exceed a certain threshold value are deactivated. Moreover, we design a distributed logic based routing algorithm able to tolerate totally broken links as well as to efficiently utilize UnPaired Functional (UPF) Links in partiallydefected interconnects. We also introduce a link bandwidth aware runtime task mapping algorithm to improve the mapping quality for newly injected applica

Modern day hardware platforms ranging from mobiles to data centers are parallel and diverse. The execution environment composed of workloads and hardware resources, is dynamic and unpredictable. Here efficient matching of program parallelism to machine parallelism under uncertainty is hard. My thesis proposes solutions

In this dissertation, we propose several novel NoC tailored mechanisms to tolerate faults induced by transistor miniaturization, as well as to efficiently utilize still functional NoC components. We first introduce a lowcost method to allow for correct flit transmission even when soft errors are occurring in the router control plane. Then

HiPEAC info 43 19

upcoming events

International Conference on Embedded Computer Systems: Architectures, Modeling and Simulation (SAMOS XV)20-23 July 2015, Samos Island, Greece http://samos-conference.com/

The 2015 International Conference on High Performance Computing & Simulation (HPCS 2015)20-24 July 2015, Amsterdam, the Netherlands http://hpcs2015.cisedu.info/

International Symposium on Low Power Electronics and Design (ISLPED)22-24 July 2015, Rome, Italy http://www.islped.org/2015/index.html

Euro-Par 201524-28 August 2015, Vienna, Austria http://www.europar2015.org/

22nd European conference on circuit theory and design (ECCTD2015)24-26 August 2015 in Trondheim, Norway http://www.ntnu.edu/ecctd2015/

ParCo20151-4 September 2015, Edinburgh, UK http://www.parco.org/

International Conference on Field-programmable Logic and Applications (FPL 2015)2-4 September 2015, London, UK http://www.fpl2015.org/

IEEE 9th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC-15)23-25 September 2015, Turin, Italy http://mcsoc-forum.org/2015/

2015 IEEE Nordic Circuits and Systems Conference (NORCAS)26-28 October 2015, Oslo, Norway http://www.norcas.org/

22nd IEEE International Symposium on High Performance Computer Architecture (HPCA), 21st ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP), and 2016 International Symposium on Code Generation and Optimization (CGO)12-16 March 2016, Barcelona, Spain http://hpca22.site.ac.upc.edu/ http://conf.researchr.org/home/PPoPP-2016 http://cgo.org/cgo2016/

HIPEAC 2016 CONFERENCE, 18-20 JANUARY 2016, PRAGUE, CZECH REPUBLICWWW.HIPEAC.NET/2016/PRAGUE

info 43

hipeac info is a quarterly newsletter published by the hipeac network of excellence, funded by the 7th european framework programme (fp7) under contract no. fp7/ict 287759website: https://www.hipeac.net/subscriptions: https://www.hipeac.net/publications/newsletter/

contributions If you are a HiPEAC member and would like to contribute to future HiPEAC newsletters, please visit https://www.hipeac.net/publications/newsletter/

design

: w

ww

. mag

elaa

n.be

http://samos-conference.com/

http://hpcs2015.cisedu.info/

http://www.islped.org/2015/index.html

http://www.europar2015.org/

http://www.ntnu.edu/ecctd2015/

http://www.parco.org/

http://www.fpl2015.org/

http://mcsoc-forum.org/2015/

http://www.norcas.org/

http://hpca22.site.ac.upc.edu/

http://conf.researchr.org/home/PPoPP-2016

http://cgo.org/cgo2016/

http://www.hipeac.net/2016/prague

http://www.hipeac.org

https://www.hipeac.net/publications/newsletter/

https://www.hipeac.net/publications/newsletter/

http://www.magelaan.be

Documents

HiPEACinfo 43