41
1 ATHENa – Automated Tool for Hardware Evalua4oN: Toward Fair and Comprehensive Benchmarking of Cryptographic Hardware using FPGAs Kris Gaj , JensPeter Kaps, Venkata Amirineni, Marcin Rogawski, Ekawat Homsirikamol, Benjamin Y. Brewster Supported in part by the National Institute of Standards & Technology (NIST)

FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

  • Upload
    others

  • View
    0

  • Download
    0

Embed Size (px)

Citation preview

Page 1: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

1  

ATHENa  –  Automated  Tool  for  Hardware  Evalua4oN:    

Toward  Fair  and  Comprehensive  Benchmarking  of  Cryptographic  Hardware  

using  FPGAs  

Kris  Gaj,  Jens-­‐Peter  Kaps,    Venkata  Amirineni,  Marcin  Rogawski,  

Ekawat  Homsirikamol,    Benjamin  Y.  Brewster  

Supported in part by the National Institute of Standards & Technology (NIST)

Page 2: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

ATHENa  Team  

Venkata “Vinny” MS CpE student

Ekawat “Ice”

PhD CpE student

Marcin

PhD ECE student

Rajesh

PhD ECE student

Michal PhD exchange student from

Slovakia

John

MS CpE student

Page 3: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

3

•  Motivation & Goals

•  Previous Work

•  ATHENa

•  Two Case Studies

•  Future Work & Conclusions

Outline

Page 4: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

4

Common Benchmarking Pitfalls

•  taking credit for improvements in technology

•  choosing a convenient (but not necessarily fair) performance measure

•  comparing designs with different functionality

•  comparing designs optimized using different optimization target

•  comparing clock frequency after synthesis vs. clock frequency after placing and routing

4

Page 5: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

5

Objective Benchmarking Difficulties

•  lack of standard interfaces •  influence of tools and their options

•  stand-alone performance vs. performance as a part of a bigger system

•  time & effort spent on optimization

5

Page 6: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

6

Goal of Our Project

•  spread knowledge and awareness needed to eliminate benchmarking pitfalls

•  develop methodology and tools required to overcome objective difficulties

6

Page 7: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

7

Why Cryptographic Algorithms?

•  well documented speed-ups (10x-10,000x)

•  security gains (e.g., key generation & storage)

•  constantly evolving standards

•  cryptographic standard contests

7

Page 8: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

8

Cryptographic Standard Contests

8 time 96 97 98 99 00 01 02 03 04 05 06 07 08 09 10 11 12

AES

NESSIE

CRYPTREC

eSTREAM

SHA-3

34 stream ciphers → 4 SW+4 HW winners

51 hash functions → 1 winner

15 block ciphers → 1 winner IX.1997 X.2000

I.2000 XII.2002

V.2008

X.2007 XII.2012

XI.2004

Page 9: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

9

Criteria Used to Evaluate Cryptographic Algorithms

9

Security

Software Efficiency

Hardware Efficiency

Flexibility

Page 10: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

10

Advanced Encryption Standard Contest

Speed in FPGAs Votes at the AES 3 conference

Round 2 of AES Contest, 2000

Page 11: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

11

Benchmarking in Cryptography

11

Software ASICs FPGAs

eBACS

D. Bernstein, T. Lange

? ?

Page 12: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

12

eBACS: ECRYPT Benchmarking of Cryptographic Systems

12

•  measurements on multiple machines (currently over 70) •  each implementation is recompiled multiple times

(currently over 1200 times) with various compiler options •  time measured in clock cycles/byte for multiple input/output sizes •  median, lower quartile (25th percentile), and upper quartile

(75th percentile) reported •  standardized function arguments (common API)

SUPERCOP - toolkit developed by D. Bernstein and T. Lange for measuring performance of cryptographic software

Page 13: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

ATHENa – Automated Tool for Hardware EvaluatioN

13  

Benchmarking open-source tool, written in Perl, aimed at an

AUTOMATED generation of OPTIMIZED results for MULTIPLE hardware platforms

Currently under development at George Mason University.

http://cryptography.gmu.edu/athena

Page 14: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

Why Athena?

14  

"The Greek goddess Athena was frequently called upon to settle disputes between the gods or various mortals. Athena Goddess of Wisdom was known for her superb logic and intellect. Her decisions were usually well-considered, highly ethical, and seldom motivated by self-interest.”

from "Athena, Greek Goddess of Wisdom and Craftsmanship"

Page 15: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

ATHENa Server

FPGA Synthesis and Implementation

Result Summary + Database Entries

2 3

HDL + scripts + configuration files

1

Database Entries

Download scripts and

configuration files8

Designer

4

HDL + FPGA Tools

User

Database query

Ranking of designs

5 6

Basic Dataflow of ATHENa

0 Interfaces

+ Testbenches 15  

Page 16: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

16  

synthesizable  source  files  

configura4on  files    

testbench  

constraint  files    

result  summary    

(user-­‐friendly)  

database  entries    

(machine-­‐  friendly)  

Page 17: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

ATHENa  Major  Features  (1)  •  synthesis,  implementa3on,  and  3ming  analysis  in  batch  mode  

•  support  for  devices  and  tools  of  mul4ple  FPGA  vendors:    

•  genera3on  of  results  for  mul4ple  families  of  FPGAs  of  a  given  vendor  

•  automated  choice  of  a  best-­‐matching  device  within  a  given  family  

17  

Page 18: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

ATHENa  Major  Features  (2)  

•  automated  verifica4on  of  designs  through  simula3on  in  batch  mode  

•  support  for  mul4-­‐core  processing  

•  automated  extrac4on  and  tabula4on  of  results  

•  several  op4miza4on  strategies  aimed  at  finding  

–  op3mum  op3ons  of  tools  

–  best  target  clock  frequency  

–  best  star3ng  point  of  placement  

OR

18  

Page 19: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

19  

Multi-Pass Place-and-Route Analysis GMU  SHA-­‐256,  Xilinx  Spartan  3  

100 runs for different placement starting points

The bigger the better

~ 15%

best

worst

19  

Page 20: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

20

Results for default target clock frequency

default COST_TABLE (1) preselected COST_TABLEs (21, 41, 61, 81)

Page 21: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

21

Results for target clock frequency = 80 MHz

Page 22: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

22

Results for target clock frequency = 85 MHz

default COST_TABLE (1) preselected COST_TABLEs (21, 41, 61, 81)

Page 23: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

23

Results for target clock frequency = 90 MHz

Page 24: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

24

Results for target clock frequency = 95 MHz

Page 25: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

25

Optimization Strategy Used for Xilinx Devices

1.  Frequency search Search for the highest requested clock frequency that is met

with a single run of tools. 2.  Exhaustive search

Search for the best combination of the following options •  optimization target for synthesis: area, speed •  optimization target for mapping: area, speed •  optimization effort level for placing and routing: medium, high

3.  Placement search Search for the best starting point for placement,

using 4 additional values of the COST_TABLE {21, 41, 61, 81}.

Total number of runs = 15-20

Page 26: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

26

Optimization Strategy Used for Altera Devices

1. Exhaustive search Search for the best combination of the following options:

•  Synthesis optimization: speed, area, balanced •  Optimization effort: auto, fast •  Implementation effort: standard, auto

2. Placement search Search for the best starting point of placement,

using 4 additional values of SEED {2001, 4001, 6001, 8001}.

Total number of runs = 16

Page 27: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

27

Benchmarking Goals Facilitated by ATHENa

1.  comparing multiple cryptographic algorithms 2.  comparing multiple hardware architectures or implementations

of the same cryptographic algorithm 3.  comparing various hardware platforms from the point of view

of their suitability for the implementation of a given algorithm, such as a choice of an FPGA device or FPGA board for implementing a particular cryptographic system

4.  comparing various tools and languages in terms of quality of results they generate (e.g. Verilog vs. VHDL, Synplicity Synplify Pro vs. Xilinx XST, ISE v. 10.2 vs. ISE v. 9.1)

Page 28: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

28

Algorithm Comparison: SHA-256 vs. Fugue on Xilinx Spartan 3

SHA 256 Fugue 256

0 200 400 600 800

1000 1200 1400

Single Optimized

625.9 694.9

(+11%)

1100.2 1283.2 (+17%)

Throughput (Mbit/s)

SHA 256 Fugue 256

0 500

1000 1500 2000 2500 3000 3500 4000

Single Optimized

1020 883 (-13%)

3987 3873 (- 3%)

Area (CLB slices)

Page 29: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

29

Architecture Comparison: SHA-256 Basic Loop vs. Rescheduling on Altera Cyclone II

0 100 200 300 400 500 600 700 800 900

Single Optimized

831

871.8 (+5%)

838.7 854.6 (+2%)

Throughput (Mbit/s)

0

500

1000

1500

2000

2500 2019

2015 (- .1 %)

2291 2216 (-3%)

Area (LE)

Rescheduling Basic Loop Rescheduling

Basic Loop Single Optimized

Page 30: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

30

Platform Comparison: SHA-256 Xilinx Spartan 3 vs. Altera Cyclone II

Spartan 3 Cyclone II

0 100 200 300 400 500 600 700 800 900

Single Optimized

625.9 694.9

(+11%)

831 871.8 (+5%)

Throughput (Mbit/s)

Spartan 3 Cyclone II

0

500

1000

1500

2000

2500

Single Optimized

2040

1776 (-13 %)

2019 2015 (-.1%)

Area (LC or LE)

Page 31: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

31

Tool Comparison: SHA-256 Rescheduling with ISE 9.1 vs. ISE 11.1

ISE 9.1 ISE 11.1

0

200

400

600

800

1000

1200

Single Optimized

1020

873 (-14%)

1020 883 (-13%)

Area (CLB Slices)

ISE 9.1 ISE 11.1

0 100 200 300 400 500 600 700 800

Single Optimized

613.4

729.2 (+19%)

625.9 694.9

(+11%)

Throughput (Mbit/s)

Page 32: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

32

•  Timeline

Benchmarking of 14 Round-2 SHA-3 Candidates

•  High-speed implementations of all 14 Round 2 SHA-3 candidates and the current standard SHA-2 developed and evaluated using ATHENa

•  Results reported at CHES 2010 and at the SHA-3 conference organized by NIST

51 candidates

Round 1 14

5-6 1-2 Round 2 Round 3

July 2009 End of 2010 Mid 2012 Oct. 2008

Page 33: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

33

•  batch mode of FPGA tools

•  ease of extraction and tabulation of results •  Excel, CSV (available), LaTeX (coming soon)

•  optimized choice of tool options •  GMU_Xilinx_optimization_1 strategy

Generation of Results Facilitated by ATHENa

vs.

Page 34: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

34

Relative Improvement of Results from Using ATHENa Virtex 5, 256-bit Variants of Hash Functions

0

0.5

1

1.5

2

2.5

Area Thr Thr/Area

Ratios of results obtained using ATHENa suggested options vs. default options of FPGA tools

Page 35: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

35

Relative Improvement of Results from Using ATHENa Virtex 5, 512-bit Variants of Hash Functions

0

0.5

1

1.5

2

2.5

3

Area Thr Thr/Area

Ratios of results obtained using ATHENa suggested options vs. default options of FPGA tools

Page 36: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

36

Most Important Features of ATHENa

•  comprehensive

•  automated •  collaborative

•  practical •  distributed

•  optimized

•  with a single point of contact

With single point of contact:

Page 37: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

37

Our Environment will Serve

•  Researchers – fair, automated, and comprehensive comparison of new algorithms, architectures, and implementations with previously reported work

•  Designers – informed choice of technology (FPGA, ASIC, microprocessor) and a specific device within a given technology

•  Developers and Users of Tools – comprehensive comparison across various tools; optimization methodologies developed and comprehensively tested as a part of this project

•  Standardization Organizations (such as NIST) – evaluation of existing and emerging standards; support of contests for new standards; comprehensive and easy to locate database of results

Page 38: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

ATHENa Server

FPGA Synthesis and Implementation

Result Summary + Database Entries

2 3

HDL + scripts + configuration files

1

Database Entries

ATHENa scripts and configuration files8

Designer

4

HDL + FPGA Tools

User

Database query

Ranking of designs

5 6

Invita4on  to  Use  ATHENa  http://cryptography.gmu.edu/athena

0 Interfaces

+ Testbenches 38  

Page 39: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock
Page 40: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

40

Future Work

•  Additional FPGA Vendors

•  More Efficient and Effective Heuristic Optimization Algorithms •  Support for Linux

•  Graphical User Interface •  Application to Comparison and Optimization of Other Cryptographic Primitives (e.g., public key cryptosystems)

•  Adapting ATHENa to Other Application Domains (Digital Signal Processing, communications, etc.)

Page 41: FPL 2010 GMU ATHENakgaj/publications/conferences/GMU_FPL_2010_slides.pdf25 Optimization Strategy Used for Xilinx Devices 1. Frequency search Search for the highest requested clock

Questions?

Thank you!

41

Questions?

ATHENa: http:/cryptography.gmu.edu/athena