Upload
others
View
1
Download
0
Embed Size (px)
Citation preview
X‐Eye: A Profiling Tool for X10
Seisei ItahashiYoshiki Sato
Prof. Shigeru Chiba
Master student in Chiba Lab,The University of Tokyo
1Seisei Itahashi, Chiba Lab in U.Tokyo
Background
Seisei Itahashi, Chiba Lab in U.Tokyo 2
Nowadays, programming languages such as X10, etc are evolving toward abstracting low level process
Behavior becomes more implicit, and can’t be seen
Performance tuning can be difficult
Motivation: Visualize Implicit Behavior
1. Implicit data transferWhen and how much data are transferred?
2. Waste of CPU resource in synchronizationWhich activities causes long sync‐wait time?
3. Scattered synchronization codeWhich async corresponds to which finish?Possible to determine it statically?
3Seisei Itahashi, Chiba Lab in U.Tokyo
1. Implicit data transferWhen is the value of a transferred to place 2? How much data are transferred?What route does the value a pass?
• Users want to know easily– For performance tuning
• But, hard to know at a glance– Harder If code is more complex 4Seisei Itahashi, Chiba Lab in U.Tokyo
Variable a is initialized as an arraywith 10 elements and those values are 1
Moves 2 times in sequential
Only the first element is referred
2. Waste of CPU resource in sync
• Sync and activity creation are easy in X10– Useless sync blocking is easy to happen
5
Sync startActivity A B C
Sync end
Blocking time may wastethe CPU resource
Seisei Itahashi, Chiba Lab in U.Tokyo
finish {async { S1; }async { S2; }async { S3; }}
S1 S2 S3
3. Scattered synchronization code
• Which async corresponds to which finish?– The number of asyncs executed during the finish clause depends on the variable num and taskSet.
6
jump to a separate
source file
How many asyncs?
Seisei Itahashi, Chiba Lab in U.Tokyo
8
Moreover…
Run in parallel appropriately (value of num and taskSet decides the number of activities)
Run in sequential (run all tasks by 1 activity) Run in parallel (run 1 task by 1 activity)
Seisei Itahashi, Chiba Lab in U.Tokyo
Which class’s run() method?
X‐Eye: Profiler and Visualizer for X10(1/2)
• Profiler: Records X10 specific events at parallel and distribution constructs such as at, finish at runtime– With transferred data size and activity identifiers
11
ProfilerProfilerX10 sourcecode
LogdataVisualizerVisualizer
Insert profiling code and compile it
Analyze the log
X10 binarycode
Run it, andgenerate log
Extending polyglot Impl. of X10 compiler
Based on JavaFX
Both Managedand Native X10
Seisei Itahashi, Chiba Lab in U.Tokyo
X‐Eye: Profiler and Visualizer for X10(2/2)
• Visualizer: not only visualizes the events but is able to interactively limit the scope of visualization by Scoping DSL
Grammar is like Stream API in Java 8Filter conditions can be written in rambdaPossible to filter interactively by watching the progress (interim) result
12
Ex: eventStream = eventStream.filter(e ‐> e.execPlace == 0);
Seisei Itahashi, Chiba Lab in U.Tokyo
Run
Display only the events happened in Place 1
Demo
13Seisei Itahashi, Chiba Lab in U.Tokyo
Implementation of X‐Eye
• Profiler– Extend the X10 Compiler to insert the profiling codes to AST
• Developed the visitors for polyglot
• Visualizer– GUI is implemented using JavaFX– Scoping DSL is Java code, so compile on‐memory, output class file, load it and run
Seisei Itahashi, Chiba Lab in U.Tokyo 14
Data recorded by our tool• Events of execution of
– at, async, finish, clocked, etc…• Size of transferred data
– During compilation, the tool statically determines which values are transferred, then estimates the size at every at.
• The size is estimated based on the impl. of X10 runtime– During runtime, the profiling code records the estimated size of the transferred data at the beginning of the block of the at operation.
15Seisei Itahashi, Chiba Lab in U.Tokyo
Data recorded by our tool (cont.)
• Activity identifier– maintains which activity is executing.– The profiling code records the activity identifier
• To trace the relation between async and finish.
– It is also used to solve the 2nd motivation:Our tool inserts an extra parameter to every method.The parameter specifies an activity identifier.
16
X10 API does not provide functionalities of activity identifier,because an activity is not always bound to the single thread. Instead, a worker is bound for dealing with multiple activities.But users want to distinguish activities, so, we’ve implemented it!
X10 API does not provide functionalities of activity identifier,because an activity is not always bound to the single thread. Instead, a worker is bound for dealing with multiple activities.But users want to distinguish activities, so, we’ve implemented it!
Seisei Itahashi, Chiba Lab in U.Tokyo
An example of profiling codeof our tool
17
• We’ve modified the X10 compiler based on polyglot
Insert profiling code
Seisei Itahashi, Chiba Lab in U.Tokyo
An Extra parameter added to method
18
Insert profiling code
Seisei Itahashi, Chiba Lab in U.Tokyo
An example of JSON event log file
19Seisei Itahashi, Chiba Lab in U.Tokyo
Our tool shows us chance for tuning such as...
• KMeansDist.x10– Some activities moved to their same place of them(no need!)
– Activity 0‐0 (main activity) moves to every place one‐by‐one to create a new activity
• This can be parallelized
• NQueensPar.x10– Activity doesn’t move. All activities are in the same place
– There are activities which causes sync‐blocking time long
20Seisei Itahashi, Chiba Lab in U.Tokyo
21
Source Place and Target Place are same
You can find there are several D (move event with data transfer)
You can also check by watching transferred route of variable pSeisei Itahashi, Chiba Lab in U.Tokyo
Seisei Itahashi, Chiba Lab in U.Tokyo 22
Main Activity goes around the places to spawn child activities
Program may be accelerated by parallelizing this part
Our tool shows us chance for tuning such as...
• KMeansDist.x10– Some activities moved to their same place of them (no need!)
– Activity 0‐0 moves to every place one‐by‐oneto create a new activity
• This can be parallelized
• NQueensPar.x10– Activity doesn’t move. All activities are in the same place
– There are activities which causes sync‐blocking time long
23Seisei Itahashi, Chiba Lab in U.Tokyo
24Seisei Itahashi, Chiba Lab in U.Tokyo
All activities never move
Computer resource can’t be efficiently usedeven if it is run by several places
25
Activity 0‐2 causes the long blocking time in sync
F: “finish” event
Seisei Itahashi, Chiba Lab in U.Tokyo
Overhead of Profiler (Result in X10 workshop ‘14)
• KMeansDist.x10– CPU: Intel Xeon E5‐2687W 3.10GHz 8 cores x 2, RAM: 64GB– OS: CentOS release 6.2– X10 version: 2.4.0– The number of places: 2~10
29
Large overheads are caused by inter‐place communications for profiling‐ The profiler frequently moves to the first place to record the data
Large overheads are caused by inter‐place communications for profiling‐ The profiler frequently moves to the first place to record the dataSeisei Itahashi, Chiba Lab in U.Tokyo
Seisei Itahashi, Chiba Lab in U.Tokyo 30
Overhead of Profiler (Latest result)
Overhead was reduced‐ The profiler collects the result only in the end of the program (=main function)
But… the overhead increases as the number of places increases
Overhead was reduced‐ The profiler collects the result only in the end of the program (=main function)
But… the overhead increases as the number of places increases
Number of places: 1 ~ 20
31Seisei Itahashi, Chiba Lab in U.Tokyo
We consider “at” event causes the overhead‐ Part of the inserted profiling codes are transferred when “at” happens‐ The number of “at” events increases as the number of places increases
32Seisei Itahashi, Chiba Lab in U.Tokyo
33Seisei Itahashi, Chiba Lab in U.Tokyo
Related Work
• Guiding to X10 Programmers to Improve Runtime Performance– XAnalyzer– Detect the predefined 8 patterns’ codes– Suggest a better code for each pattern
• Data‐centric Performance Analysis of PGAS Applications– Detect “read” and “write” of global data objects– Target language is Global Arrays
• Automatic Communications Performance Debugging in PGAS Languages– ti‐trend‐prof– Debug the performance of remote read and write– Target language is Titanium
36Seisei Itahashi, Chiba Lab in U.Tokyo
Conclusion
• Developed a tool to support to expose implicit behavior in X10– To make the implicit data transfer explicit– To capture the activities’ behaviors inside of the synchronization even if…
• Dynamic dispatching happens• The codes inside of sync are scattered among files and codes
• Developed interactive filtering functionality– Possible to filter the events when visualizing by scoping DSL– Let us be able to continue to filter as watching the progress results
37Seisei Itahashi, Chiba Lab in U.Tokyo
Future Work
• Reduce the overhead in the case of many places• Extend the event filtering functionality
– Not only when visualizing, but also be able to restrict the range of profiling when profiling
• Results in reducing the overhead• Interactive filtering based on execution contexts
38Seisei Itahashi, Chiba Lab in U.Tokyo