Upload
nguyencong
View
217
Download
1
Embed Size (px)
Citation preview
Supercomputing & Multi-core Have I/O Problems
That Compression Can Solve
Samplify Systems, Inc.160 Saratoga Ave. Suite 150
Santa Clara, CA 95051www.samplify.com(888) LESS-BITS+1 (408) 249-1500
That Compression Can Solve27 Sep 2011
Outline
� Introduction to Samplify Systems
� Samplify Prism Compression
� Prism Results on Integers
� Prism Results on IEEE-754 Floats
� “Good Enough” Results & Uncertainty Quantification
…simply the bits that matter®©2011 Samplify Systems, Inc.
� “Good Enough” Results & Uncertainty Quantification
� High-Performance Computing & Multi-core Bottlenecks
� Why Compression Can Help
Samplify & NCAR Collaboration ?
2
About Samplify
• Intellectual Property company in Santa Clara, CA providing:
• Intellectual property for leading FPGAs & ASICs
• Semiconductors
• Module and system level
Executive Team:
Al Wegener, Founder & CTO• Industry-recognized compression expert• Inventor Samplify Prism compression• TI, Graychip, Morphics, Studer ReVox
Tom Sparkman, CEO• Sales and Marketing Semico Executive• 19 years Maxim, Motorola
…simply the bits that matter®©2011 Samplify Systems, Inc.
• Module and system level solutions
• Private company with >$22M raised from VCs & strategics (IDT & Schlumberger)
• Founded in March 2007
• 25 employees
3
Richard Tobias, VP Engineering• Engineering Semico Executive• Toshiba Semi, Pixelworks, White Eagle
(Quicksilver)
• 19 years Maxim, Motorola
Allan Evans, VP Marketing• Marketing & Technology Executive• Successful exits at Savi (LMCO), Netro
(NTRO), Stanford Telecom (Newbridge)
Applications for Samplify Technology
First Markets:
Ultrasound – Higher resolution ultrasound machines, lower power portables, enable U/S “ODM” model in China
CT – Double number of x-ray sensors in existing hardware. Lower cost of data transport
New Markets:
High Speed Imaging –2x frame rate, resolution
HPC – Supercomputing
…simply the bits that matter®©2011 Samplify Systems, Inc.
Lower cost of data transport and storage
Wireless Base Stations –Lowers cost of data transport in wireless infrastructure. Especially important for LTE.
Wireless Repeaters – Dual-band over existing copper infrastructure
Storage –2x throughput & capacity
Broadcast – Reduce SDI coax links, long-range HDMI over UTP
Automotive – Driver assistance, collision avoidance, etc.
4
Samplify’s Prism™ Signal Compression
• No other solutions operate as fast as Samplify. We start where they stop.• No psycho-visual/acoustic tricks. Samplify’s compression free from artifacts.• Operates in real time. Latency is very low with only a few samples of delay.• Validated by Experts: Herfkens (Stanford), Senzig (GE), several wireless OEMs• Samplify holds granted patents on integrating any lossless and lossy compression
into data converters (US 7,088,276) and in wireless base stations (US 8,005,152)
Q-CELP
…simply the bits that matter®©2011 Samplify Systems, Inc.
1 ksample/sec 40 Gsample/sec
…simply the bits that matter
Samplify spans 1ks-40Gs
10 ksps
ADPCM
Speech
LPC
100 ksps
Audio
to 50 Msps
Video
Q-CELP
5
Samplify Prism Eliminates Signal Whitespace
0 500 1000 1500 2000 2500 3000 3500 4000-150
-100
-50
0
50
100
150
� Time domain whitespace: peak to average ratio of signals
� Frequency domain whitespace: oversampling of narrowband signals
� Full resolution not delivered by ADCs and DSP algorithms
�� No “a priori” signal information No “a priori” signal information
…simply the bits that matter®©2011 Samplify Systems, Inc.
6
12 Bit Resolution
10.5 Effective Bits
�� No “a priori” signal information No “a priori” signal information requiredrequired
Using floating point does not repeal the Nyquist criterion !!
Prism Compression Algorithm & Modes
CompressionEngine
US 5,839,100
AdaptationEngine
Bit ratemonitor
Samplifycontroller
Compressed packetsInput samples
Param.tracking
RateTrakOptiBit
RateTrakVeribit
…simply the bits that matter®©2011 Samplify Systems, Inc.
monitorcontroller
MODE CONTROL RESULTS
tracking
7
• LOSSLESS• FIXED RATE• FIXED QUALITY
Samplify’s Customer Signal Database
3000+ customer signal files; 700+ GB of data, including:
• Medical (CT, ultrasound, MRI, digital x-ray, PET)• Wireless (GSM, W-CDMA, cdma2000, LTE, WiMax)• Instrumentation (scopes, waveform generators, SerDes)• Military/defense (radar, SAR, spectra)
…simply the bits that matter®©2011 Samplify Systems, Inc.
8
• Military/defense (radar, SAR, spectra)• Automotive (RGB, infrared, ultrasound, radar)• Geophysical (sonobuoys, oil/gas exploration)• Video (NTSC, PAL, HD)• Print and still images (CMYK, YCrCb, RGB, infrared)• Floating-point data sets (seismic, drug discovery, molecular
simulation, astrophysics, weather satellite, fluid dynamics)
Samplify Compression Results (Integers)
Signal Type Sample rate @ sample width
LosslessC. R.
Fixed rate C. R.& quality metrics
Typical customers
Wireless baseband (3G, LTE)
30.72 Msamp/sec @ 16 bits I & Q
1.2:1 – 1.5:1 1.6:1 – 2.3:1EVM, PCDE, ACLR
Ericsson, Huawei, ZTE
Wireless RF (3G, LTE)
600 Msamp/sec @ 16 bits I & Q
2:1 – 3:1 3:1 – 5:1EVM, PCDE, ACLR
Ericsson, Huawei, ZTE
Computedtomography
320,000 chans, 5 ksamp/sec @ 20 bits
1.6:1 – 2.7:1 3:1 – 4.5:1Radiologists & SSIM
GE, Philips, Toshiba
Ultrasound 64 - 256 chans, 1.5:1 – 2:1 2:1 – 3:1 GE, Siemens,
…simply the bits that matter®©2011 Samplify Systems, Inc.
9
Ultrasound(ADC)
64 - 256 chans, 50 Msamp/sec @ 12 bits
1.5:1 – 2:1 2:1 – 3:1Sonographers & SSIM
GE, Siemens, Sonosite
Ultrasound (beamformer)
4 beams, 12 Msamp/sec @ 18 bits
2:1 – 3:1 3:1 – 4:1Sonographers & SSIM
GE, Siemens, Sonosite
Images & video 60 frames/sec, 6 Msamp/sec @ 8 bits
1.5:1 – 2:1 2:1 – 3:1viewers, PSNR, SSIM
1000+ frames/sec
Oscilloscope (SerDes & LVDS)
60 Gsamp/sec@ 8 bits
1.3:1 – 2:1 2:1 – 4:1BER, rise/fall time
Agilent, Tektronix
Radar 3 Gsamp/sec@ 10 bits
2:1 – 3:1 3:1 – 5:1pd, pfa
Lockheed, Northrop
Integer Compression: CT Scanners
Example 1:
Compression of CT X-ray Sensor Values (20-bit integers)
…simply the bits that matter®©2011 Samplify Systems, Inc.
10
20 bits/sample x 3,000 samples/sec per detector
X 912 detectors per rowX 64 rows
= 3.5 Gbps
Integer Compression: CT Scanners
Bottleneck 1:
slip ringBottleneck
#1
…simply the bits that matter®©2011 Samplify Systems, Inc.
11
Bottleneck 2:
disk array
x-ra
y c
ou
nt
sensor number
1 200 500 800 1000
105
103
x-ray
source
x-ray
sensors
patient
Bottleneck#2
Lossy Compression Methodology
Compress
Decompress
100 200 300 400 500
50
100
150
200
250
300
350
400
450
500
A
Compression
“samplified”
image
“samplified” (compressed + decompressed)
projectiondata
…simply the bits that matter®©2011 Samplify Systems, Inc.
ImageReconstruction
Compress
100 200 300 400 500
50
100
150
200
250
300
350
400
450
500
BCT Projection
Data Files
Compression
ratios:
3:1, 4:1, etc.
12
original
imageoriginal
projection
data
Image Pair (SSIM_min = 0.9307)
…simply the bits that matter®©2011 Samplify Systems, Inc.
13
Success: 3:1 Compression for CT
Of 419 image pairs, Dr. Herfkens correctly identified 17 “samplified” images:
RadiologistJudgment
Number of Images
Pct of images
“Left & right images 402 of 419 95.9%
…simply the bits that matter®©2011 Samplify Systems, Inc.
14
“Left & right images look identical”
402 of 419 95.9%
“Few minor streaks” 1 of 419 0.2%
“Streaks in soft tissue” 16 of 419 3.9%
…but no effect on the radiologist’s clinical diagnosis
using images created from “samplified” x-rays !!
Integer Compression: 4G Wireless
Example 2:
Compression of 4G Wireless Baseband Signals
…simply the bits that matter®©2011 Samplify Systems, Inc.
15
16 bits/sample � 32 bits per (I, Q) sample pairx 30.72 Msamples/sec per antenna-carrierX 12 antenna-carriers per fiber-optic link
= 11.8 Gbps
LTE Requires Distributed Base Stations
LTERRU LTE requires up
to 10 Gbps CPRI per sector
Remote radio units required for macro-celldeployments
To maintain coverage, LTE radio units deployed metro fiber.
LTERRU
MIMO technology for 4G makes passive antennas no longer feasible
CPRI incompatible with SONET/SDH �dark fiber required: DWDM/CWDM/PON
…simply the bits that matter®©2011 Samplify Systems, Inc.
16
LTEBBU
LTERRU
Hybrid 3G/4GRRU
Up to 10 km
sector
Each LTE RRU requires 8wavelengths across DWDM (6 for CWDM)
� 10 Gbps CPRI links very expensive!� LTE fiber optic CAPEX & OPEX up to 12x greater than 3G!
DWDM can support only 20 LTE RRUs;CWDM only 2
DWDM/PON
DWDM/CWDM/PON
LTE Requires Distributed Base Stations
LTERRU
Samplify Prism IQ eliminates 10 Gbps CPRI links saving CAPEX
LTERRU
…simply the bits that matter®©2011 Samplify Systems, Inc.
17
LTEBBU
DWDM
LTERRU
Hybrid 3G/4GRRU
Up to 10 km
Samplify Prism IQ reduces OPEX of DWDM backhaul by 75%
� LTE fiber optic CAPEX & OPEX up to 12x greater than 3G!
Quadruple number of LTE RRUs deployed across dark fiber
Success: Save ~$1500 per 4G CPRI Link
Component No Compression Compression
Fiber Optic Line Rate 9.8 Gbps 3.027 Gbps
Radio Head FPGA Stratix IV GX Cyclone IV GX
FPGA Price (1K, 2009) $560.00 $65.00
Fiber Optic Transceivers $590.00 $100.00
Baseband FPGA Stratix IV GX Cyclone IV GX
BB FPGA Price (1K, 2009) $560.00 $65.00
2 Fibers at 6.144 Gbps required
withoutcompression
4x6.144 Gbps SFP fiber optic modules
…simply the bits that matter®©2011 Samplify Systems, Inc.
18
Total $1,710.00 $230.00
Cost Savings per Sector $1,480.00Installation cost of
2nd fiber optic cable (150 ft)
� SAM2308 enables deployment of LTE-capable RRUs today with single fiber optic cable
� No tower climbing required to install second fiber optic cable to upgrade to LTE
Compression Saves Mobile Industry $13.5B for LTE Deployment
� Industry expects 1M LTE base stations to be deployed world wide per year
� 3 sectors/CPRI links per base station
# LTE Base Stations Deployed per year
1M
Number of Sectors/CPRI links per base station
3
Number yrs of peak 3
…simply the bits that matter®©2011 Samplify Systems, Inc.
� LTE peak deployment years 2012-2014
�Compression saves $13.5B per year
19
Number yrs of peak deployment
3
Number of LTE CPRI Links
9M
Cost Savings per Link
$1,500
Total Savings $13.5B
Virtually Lossless at 7.5 Effective Bits (2:1 compression)
Configuration:• TD-LTE Downlink• 20 MHz BW• E-TM 3.1 per 3GPP
TS36.141
Results:• EVM = 0.55% rms
…simply the bits that matter®©2011 Samplify Systems, Inc.
20
• EVM = 0.55% rms
�Virtually lossless: Equivalent to Agilent test equipment
4
5
6
7
8
EV
M (
%)
4:1 Compression for LTE (Downlink)
No compression = 15 bits
EVM limit for LTE Downlink at 64 QAM
is 8%
Prism IQ achieves 3.75 effective bits at
8% EVM = 4:1 compression
…simply the bits that matter®©2011 Samplify Systems, Inc.
3 4 5 6 7 8 9 100
1
2
3
Effective Number of Bits
EV
M (
%)
21
At 7.5 effective bits (2:1 compression)
EVM performance is equivalent to Agilent
test equipment
compression
Integer Compression: Imaging
Example 3:
Compression for 40 Mpixeland 2k frames/sec Cameras
…simply the bits that matter®©2011 Samplify Systems, Inc.
22
16 bits/pixel x 40 Mpixel/frame x 30 fps =
= 19 Gbps
16 bits/pixel x 1 Mpixel /frame x 2k fps =
= 32 Gbps
Prism Lossless Compression
• Lossless means bit-exact replica of original
• Samplify SignalZIP lossless compression achieved minimum 1.76:1 compression
2.09 : 1 1.90 : 1
…simply the bits that matter®©2011 Samplify Systems, Inc.
1.76:1 compression
• Algorithm operates in real time on FPGA
• Switch from lossless to lossy with a register setting
9/28/2011 V1.1
2.09 : 1
1.83 : 1
1.90 : 1
1.76 : 1
23
Prism Fixed-Rate Compression
• Fixed rate provides high quality compression at a given rate
• Minimal image degradation between different steps of
2.65:1Original
…simply the bits that matter®©2011 Samplify Systems, Inc.
different steps of compression
• Algorithm operates in real time on FPGA
• Switch from lossless to lossy with a register setting
9/28/2011 V1.1
2.65:1
3.15:1 3.60:1
Original
24
Infrared Imaging
Across 40 infrared images, Prism HD achieved
…simply the bits that matter®©2011 Samplify Systems, Inc.
25
~4:1 lossless
(12 grayscale bits per pixel)
Bayer Matrix Image Results
File Name
File Size
(bytes) CR lossless
SSIM @
2.0:1
SSIM @
2.5:1
SSIM @
3.0:1
SSIM @
3.5:1
SSIM @
4.0:1
Cam1-b.bin 3956064 1.70 0.9968 0.9887 0.9760 0.9598 0.9413
Cam1-g1.bin 3956064 1.62 0.9953 0.9858 0.9690 0.9473 0.9245
Cam1-g2.bin 3956064 1.62 0.9954 0.9860 0.9690 0.9482 0.9248
Cam1-r.bin 3956064 1.55 0.9951 0.9811 0.9596 0.9279 0.9127
Cam2-b.bin 3956064 2.12 1.0000 0.9946 0.9919 0.9853 0.9775
…simply the bits that matter®©2011 Samplify Systems, Inc.
Cam2-g1.bin 3956064 1.90 0.9980 0.9929 0.9837 0.9699 0.9566
Cam2-g2.bin 3956064 1.90 0.9979 0.9928 0.9842 0.9696 0.9590
Cam2-r.bin 3956064 1.84 0.9967 0.9927 0.9827 0.9669 0.9469
Cam3-b.bin 3956064 1.73 0.9960 0.9894 0.9775 0.9606 0.9449
Cam3-g1.bin 3956064 1.65 0.9955 0.9858 0.9692 0.9480 0.9275
Cam3-g2.bin 3956064 1.65 0.9958 0.9866 0.9688 0.9496 0.9282
Cam3-r.bin 3956064 1.61 0.9950 0.9840 0.9651 0.9381 0.9180
26
Example: HD Video @ 2.5:1 compression
…simply the bits that matter®©2011 Samplify Systems, Inc.
27
{-2, +5} {-3, +3} {-3, +6}
Compression of Floats: Prism FP*
Compression for High-Performance Computing
(HPC)
* floating point
…simply the bits that matter®©2011 Samplify Systems, Inc.
28
• Compressing Integers and Floating-Pt Values• For HPC Scientific, Technical & Multi-core Apps
FP
Prism FP Compression for HPC
Prism FP features:
• User-selectable lossless & lossy modes• Compresses integers and floating-point values• Low complexity (“fits under a bond pad or two”)• Low latency (< 6 clks to comp or decomp 4 numbers)• Trade higher latency for better compression
…simply the bits that matter®©2011 Samplify Systems, Inc.
29
• Trade higher latency for better compression• Scalable to PCIe Gen3, DDR3, & optical rates
• Targeted at HPC applications:
>> Prism FP solves multi-core I/O problems <<
Floating-point Basics
The ONLY Standard That Matters:
IEEE-754-2008
“mantissa”
…simply the bits that matter®©2011 Samplify Systems, Inc.
30
Prism FP Concept
Using floating-point representation:• doesn’t repeal the Nyquist criteria
• doesn’t reduce dynamic range requirements !!
+127(max exp)
exp = 523 bits {5 .. -17}
exp = -123 bits {-1 .. -23} exp = -7
23 bits {-7 .. -29}
10+38
Base 10 Base 2
± Inf, NaN
…simply the bits that matter®©2011 Samplify Systems, Inc.
0
Exponent: 5 5 4 2 -1 -2 -3 -5 -5 -7 -9 …
23 bits {-7 .. -29}
-128(min exp)
10-38
100
= 1.0000
Denorm,± Zero
equivalent
“noise floor”
31
Prism FP Results on Nvidia CUDA SDK
Signal & Datatype
Prism Real-time
Compression Rate
Prism Lossless
Comp Ratio
Prism Lossy Comp
Ratios & Quality Metrics
3G & 4G wireless,
16-bit integers
3 to 10 Gbps 1.2:1 – 1.5:1 1.6:1 – 2.3:1
EVM, PCDE, ACLR
Computed tomography,
20-bit integers
20 to 80 Gbps 1.6:1 – 2.7:1 3:1 – 4.5:1
Radiologists & SSIM
Medical ultrasound,
12-bit integers
50 to 300 Gbps 2:1 – 3:1 3:1 – 4:1
Sonographers & SSIM
Image sensors, 0.6 to 10 Gbps 1.5:1 – 2:1 2:1 – 3:1
…simply the bits that matter®©2011 Samplify Systems, Inc.
32
12-bit integers Viewers, PSNR, SSIM
Oscilloscopes,
8-bit integers
100 to 600 Gbps 1.3:1 – 2:1 2:1 – 4:1
BER, rise/fall time
k-means clustering,
32-bit floats
300 Mfloat/sec 1.4:1 – 2:1 2:1 – 4.5:1
SSIM, % error
Black-Sholes financial,
32-bit floats
100 Mfloat/sec 1.6:1 – 2.2:1 3:1 – 4:1
% error of mean and std
3D wireframe model,
32-bit floats
60 Mfloat/sec 1.9:1 – 2.6:1 2:1 – 3.5:1
visual inspection, SSIM
Example 1
Example 2
k-means Clustering (from CUDA SDK)
Resulting oval measurements:
• location (xi, yi) and 2.5:1 compression
…simply the bits that matter®©2011 Samplify Systems, Inc.
33
• location (xi, yi) and
• axis length (Lx, Ly)
differ in the 6th decimal place, e.g.:
3.55873 vs. 3.55875
2.5:1 compression
Graphics: FP Wireframe & Textures
original decompressed
…simply the bits that matter®©2011 Samplify Systems, Inc.
34
2.75:1 compression, SSIM = 0.99
Geophysical Exploration Data Bottlenecks From Acquisition to Data Processing
3. Data storage &
Formats:
• LIS, DLIS
• SEG-D, -Y
• WellLog ML
…simply the bits that matter®©2011 Samplify Systems, Inc.
35
1. Seismic sensor acquisition
3. Data storage &intermediate results
4. Computation
2. Remote data transmission
�Data sets are petabytes in size!
Prism FP Results for HPC Seismic
Signal Type Signal Description Lossy Comp Ratio & Quality Metric or Resolution
Images Downhole imaging 20:1 to 60:1 @ SSIM > 0.99
Acoustic traces 5 acoustic files 2:1 to 4:1 @ 80+ dB
Acoustic archives Trace headers & signals 2:1 @ 99.1 dB3:1 @ 69.6 dB
Earth models Delta, epsilon, velocity 2:1 @ 137 dB3:1 @ 70 dB
…simply the bits that matter®©2011 Samplify Systems, Inc.
36
Forward path RTM Reverse Time Migration intermediate signal
3:1 to 4:1 @ 55 - 75 dB
Noise-reducedacoustic traces
Reverse Time Migration input signal 2:1 to 4:1 @ 45 - 60 dB
Pressure (Type 1) 4 pressure waveforms 2.66:1 to 3.47:1 @ 0.01 psi5.24:1 to 6.57:1 @ 0.1 psi
Pressure (Type 2) 1 pressure waveform 4.33:1 @ 0.01 psi6.2:1 @ 0.1 psi
Temperature 4 temperature waveforms 15.9:1 to 19.3:1 @ 0.01º C21.9:1 to 22.6:1 @ 0.1º C
Objective Metrics of Signal Quality:
…simply the bits that matter®©2011 Samplify Systems, Inc.
How to Quantify “Good Enough” Results
37
Prism Compression’s Effects on Results ?
Q: How does compression affect users’ signal quality?A: IT’S COMPLICATED – JUST TRY IT!
• Medical imaging:• computed tomography (CT): SSIM + radiologists’ assessment• ultrasound: working with 10+ Asian and 2 US ultrasound mfrs (sonographer assessment)
…simply the bits that matter®©2011 Samplify Systems, Inc.
(sonographer assessment)
• Wireless:• Measure EVM, ACLR, spectral emissions masks, PCDE
• Seismic: • Ask geophysicists to assess the quality of 3D Earth images• SSIM on 3D Earth “slices”• Try on both input signals (acoustic traces) and intermediate sigs
38
Simple Signal Quality Metrics
x(i) = original signaly(i) = decompressed signal
d(i) = x(i) – y(i) <<< difference signal
Some representative signal quality metrics include:
1. mean(d) error mean2. std(d) error standard deviation
…simply the bits that matter®©2011 Samplify Systems, Inc.
2. std(d) error standard deviation3. max(abs(d)) worst-case error4. SNR(x) – SNR(y) decrease in SNR5. 100 * rms(d) / rms(x) percent error6. FFT(y) – FFT(x) spectral effects
CAVEAT: These quality metrics are easy to measure, BUT they don’t tell you how the final results are affected !!
39
Image Quality Metrics
• Difference image: Di,j = Oi,j – Pi,j
• HU diffs: min(Di,j) and max(Di,j), vs.• Percentile-based HU diff thresholds
• Local contrast ratio:Contrast = sqrt (mean (∑ (O – O)2 ) )
…simply the bits that matter®©2011 Samplify Systems, Inc.
40
ContrastRMS = sqrt (mean (∑ (Oi,j – O)2 ) )
• Peak signal-to-noise ratio (PSNR) << not useful
• Just-noticeable differences (JND) << not available
• Masking effects (bone, air, etc.)• Structural Similarity (SSIM) << next page
Structural Similarity Metric (SSIM)
SSIM(O, P) = l(O, P) ● c(O, P) ● s(O, P)
= ( ) ● ( ) ● ( )2µOµP
µO + µP2 2
2σO σP
σ O + σ P2 2
σOP
σ O σ P
…simply the bits that matter®©2011 Samplify Systems, Inc.
41
Brightness(µ)
Contrast(σ)
“Structure”(cross-correlation)
Ref: Wang & Bovik, IEEE Signal Processing Magazine, Jan 2009
Uncertainty Quantification (1 of 2)
In general, uncertainty quantification has to incorporate research and development efforts in three key, irreducibletechnical areas:
…simply the bits that matter®©2011 Samplify Systems, Inc.
42
(1) Characterization of uncertainty in systemparameters and the external environment;
(2) Propagation of this uncertainty through largecomputational engineering models; and
(3) Verification and validation of the computationalmodels and incorporating the uncertainty of the models themselves into the global uncertainty assessment.
Uncertainty Quantification (2 of 2)
…simply the bits that matter®©2011 Samplify Systems, Inc.
43
“What a Long, Strange Trip It’s Been”
“Multi-core Needs Compression” – REALLY??
…simply the bits that matter®©2011 Samplify Systems, Inc.
44
a)b)
What is Numerical Data? Ints & Floats
…simply the bits that matter®©2011 Samplify Systems, Inc.
Figure 1
Prior Artc)
45
NUMERICALINPUT(INTS /
FLOATS)
MULTI-CORENUMERICALPROCESSOR
NUMERICALOUTPUT(INTS /
FLOATS)
HPC is “Just” Numerical Processing
…simply the bits that matter®©2011 Samplify Systems, Inc.
INTERMEDIATERESULTS
(INTS / FLOATS)
46
Two kinds of HPC algs:
1. compute-bound2. I/O-bound
Samplify accelerates I/O-bound applications
I/O Is A Real HPC & Multi-core Problem
GPU and multi-core trends:
• Cores scale (Moore’s Law), but I/O (pins, clks, mem speed) doesn’t• Core utilization (% busy) keeps decreasing (e.g. < 20% in seismic)• Nvidia GPUs with 16 lanes of PCIe Gen2 (8 GB/sec)
• In 2007: 192 SMPs (GeForce) � 41 MB/sec per core• In 2011: 512 SMPs (Fermi) � 15 MB/sec per core
• Intel x86
…simply the bits that matter®©2011 Samplify Systems, Inc.
47
• Intel x86• In 2006: 500 MB/sec per core • In 2011: 2 GB/sec for 4 cores � still 500 MB/sec per core
Int’l Supercomputing & Hot Chips Conferences:
• “Exascale is I/O-limited, while multi-core is easy” Jeffrey Vetter, DoE
• “Exascale is power-limited (20 MW/Exaflop)” Jack Dongarra, DoE
• “Communication-avoiding algorithms” Jim Demmel, UC Berkeley
1. The real world is inherently noisy:• Real-world (vs. idealized) measurements contain noise• Signal-to-noise ratio (SNR) measures what part of measurements
are “useful” (ADC analogy: resolution vs. ENOB)• “Simulated real-world” computations add noise on purpose (Monte
Carlo)
2. The real world is inherently lowpass:
Why Lossy Comp is OK for HPC (1 of 2)
…simply the bits that matter®©2011 Samplify Systems, Inc.
2. The real world is inherently lowpass:• To a DSP guy, 2D Nyquist rate � choosing grid/mesh size for HPC• Time series of adjacent HPC grid/mesh points are correlated• Distance and time attenuate signals, often to r2 or r3 (e.g.
SerDes on backplanes, light in space, audio signals, etc.)• 2 kinds of HPC problems:
• those that can be validated against the real world, and • those that can’t (“theoretical” HPC problems…)
48
Why Lossy Comp is OK for HPC (2 of 2)
3. Application dyn range vs. Computational dyn range:
• The required dynamic range of HPC signals (input, intermediate, output) is typically lower than the dynamic range provided by 32/64-bit computational float engines
…simply the bits that matter®©2011 Samplify Systems, Inc.
49
• 32-bit and 64-bit floats are arbitrary:• Why not 21-bit or 16-bit mantissas? • Why 8-bit and 11-bit exponents? • Why not 5-bit or 16-bit exponents…
Simple goal: “good enough” answers … sooner and faster!
Future: Prism 4 for Multi-Core Engines
x86Core 1
x86Core 2
x86Core 3
x86Core 4
x86Core 5
x86Core 6
FrontSideBus
DDRx
&PCIeGen2
QPI or HT Ring,≤ 200 GB/sec (256-bit bus)
3 GHz cores,
DDRxDIMM
#2
DDRxDIMM
#1
C
C D C D C D
C DC DC D
Compress
C D
CD
CD
8 -18GB/sec
…simply the bits that matter®©2011 Samplify Systems, Inc.
x86 bottlenecks:o DDR3 (off-chip RAM)o PCIe (off-chip I/O)o Inter-core communicationso QPI and HyperTransport
50
PCIe Gen2 bus
3 GHz cores,1200 – 2000 pins
C
D
Compress
Decompress
8 GB/sec
GPU bottlenecks:o On-chip “shared RAM”o GDDR5 (video RAM)o PCIe (off-chip I/O)
Network bottlenecks:o Infinibando 10 GbE, 40 GbEo MPIo RapidIO
How to Start? Send Samplify Signals, or Use Prism Software
Usual Samplify model: customers send Samplify (Al) > 700 GB
Option 1: Existing Prism 3 (ints) and Prism FP (floats) SW:
• Prism 3 for Windows and Matlab• Prism FP for Linux, Windows, and Matlab
Option 2: Easy Ports:
…simply the bits that matter®©2011 Samplify Systems, Inc.
51
Option 2: Easy Ports:
• fwrite_c, fread_c (for file I/O)• memcpy_c (for memory moves)
Option 3: More work, but possible:
• MPI_SEND_C, MPI_RECV_C (MPI)• What else?
Proposed Collaboration with NCAR
• Try Prism compression (Linux, Windows, Matlab)• Quantify your application’s BW and/or storage bottlenecks• Quantify your application’s sensitivity to input variations• Quantify your application’s “good enough” results level
or
…simply the bits that matter®©2011 Samplify Systems, Inc.
• Send Samplify your signals (in, intermediate, out) & we’ll do the work
Goal: publish collaboration results in 2012
Contact: Al [email protected]
408-221-1191
52