Upload
lehanh
View
297
Download
16
Embed Size (px)
Citation preview
High Performance Computing in HFSS
HFSS 中的高性能计算
李 皓
一、高性能计算:集群、网格与云
二、HFSS中的高性能计算
三、算例与讨论
3 © 2014 ANSYS Inc.
2000年 美国 IBM ASCI White 7.226 TFLOPS 美国加州罗兰士利物摩亚国家实验室
2002年 日本 NEC地球模拟器 35.86 TFLOPS 日本地球模拟器中心
2004年 美国 IBM Blue Gene/L
70.72 TFLOPS 美国能源部/IBM
2005年 美国 136.8 TFLOPS 美国能源部/NNSA/LLNL
美国 280.6 TFLOPS
2007年 美国 478.2 TFLOPS
2008年 美国 IBM Roadrunner 1.026 PFLOPS 美国新墨西哥州洛斯阿拉莫斯国家实验室
美国 1.105 PFLOPS
2009年 美国 ORNL 美洲虎 1.759 PFLOPS 美国橡树岭国家实验室
2010年 中国 天河一号 2.566 PFLOPS 中国天津国家超级计算中心
2011年 日本 Fujitsu 京 10.51PFLOPS 日本富士通
2012年 美国 IBM Blue Gene/Q 16.32475PFLOPS 美国劳伦斯·利弗莫尔国家实验室
2012年11月13日 美国 ORNL Titan 17.59PFLOPS 美国橡树岭国家实验室
历年 HPC TOP 1
HPC 硬件架构
并行编程标准分类: 数据并行
HPF, Fortran90 用于SMP, DSM
共享编程 OpenMP 用于SMP, DSM
消息传递 MPI, PVM 用于所有并行计算机
计算机集群简称集群是一种计算机系统, 它通过一组松散集成的计算机软件和/或硬件连接起来高度紧密地协作完成计算工作。在某种意义
上,他们可以被看作是一台计算机。集群系统中的单个计算机通常称为节点,通常通过局域网连接,但也有其它的可能连接方式。集群计算机通常用来改进单个计算机的计算速度和/或可靠性。一般情况下集
群计算机比单个计算机,比如工作站或超级计算机性能价格比要高得多。
集群分为同构与异构两种,它们的区别在于:组成集群系统的计算机之间的体系结构是否相同。集群计算机按功能和结构可以分成以下几类:
• 高可用性集群 High-availability (HA) clusters
• 负载均衡集群 Load balancing clusters
• 高性能计算集群 High-performance (HPC) clusters
• 网格计算 Grid computing
高性能计算集群 高性能计算集群采用将计算任务分配到集群的不同计算节点而提高计算能力,因而主要应用在科学计算领域。比较流行的HPC采用Linux操作系统和其它一些免费软件来完成并行运算。这一集群配置通常被称为Beowulf集群。这类集群通常运行特定的程序以发挥HPC cluster的并行能力。这类程序一般应用特定的运行库, 比如专为科学计算设计的MPI库。
HPC集群特别适合于在计算中各计算节点之间发生大量数据通讯的计算作业,比如一个节点的中间结果或影响到其它节点计算结果的情况。
网格计算 网格计算或网格集群是一种与集群计算非常相关的技术。网格与传统集群的主要差别是网格是连接一组相关并不信任的计算机,它的运作更像一个计算公共设施而不是一个独立的计算机。还有,网格通常比集群支持更多不同类型的计算机集合。
网格计算是针对有许多独立作业的工作任务作优化,在计算过程中作业间无需共享数据。网格主要服务于管理在独立执行工作的计算机间的作业分配。资源如存储可以被所有结点共享,但作业的中间结果不会影响在其他网格结点上作业的进展。
云计算 云的基本概念,是通过网络将庞大的计算处理程序自动分拆成无数个较小的子程序,再由多部服务器所组成的庞大系统搜索、计算分析之后将处理结果回传给用户。通过这项技术,远程的服务供应商可以在数秒之内,达成处理数以千万计甚至亿计的信息,达到和“超级计算机”同样强大性能的网络服务,它可分析DNA结构、基因图谱定序、解析癌症细胞等高级计算。
集群(机群)与分布式计算的区别:
集群:在一组计算机上运行相同的软件,以最终用户角度看来,集群
系统表现为单一的计算资源借口,有组织性。
分布式:是互相连接的多个独立计算机的集合,每个节点都有自己的存储
、I/O 和操作系统,每台计算机处理不同的工作,组织性较弱。
简单说,分布式是以缩短单个任务的执行时间来提升效率的,而集群则是
通过提高单位时间内执行的任务数来提升效率。
高性能计算集群,英文原文为High Performance Computing Cluster, 简称HPC
Cluster,是指以提高科学计算能力为目的计算机集群技术。 HPC Cluster是一种并行计算(Parallel Processing)集群的实现方法。并行计算是指将一个应用程序分割成
多块可以并行执行的部分并指定到多个处理器上执行的方法。目前的很多计算机系统可以支持SMP(对称多处理器)架构并通过进程调度机制进行并行处理,但是SMP
技术的可扩展性是十分有限的,比如在目前的Intel架构上最多只可以扩展到8颗CPU。为了满足哪些"计算能力饥渴"的科学计算任务,并行计算集群的方法被引入到计算机界。著名的“深蓝”计算机就是并行计算集群的一种实现。
由于在某些廉价而通用的计算平台(如Intel+Linux)上运行并行计算集群可以提供极
佳的性能价格比,所以近年来这种解决方案越来越受到用户的青睐。比如壳牌石油(Shell)所使用的由IBM xSeries服务器组成的1024节点的Linux HPC Cluster是目前世界上计算能力最强的计算机之一。
HPC Cluster向用户提供一个单一计算机的界面。前置计算机负责与用户交互,并在接受用户提交的计算任务后通过调度器(Scheduler)程序将任务分配给各个计算节
点执行;运行结束后通过前置计算机将结果返回给用户。程序运行过程中的进程间通信(IPC)通过专用网络进行。
高性能计算的核心是并行与分布式
一、高性能计算:集群、网格与云
二、HFSS中的高性能计算
1. MP: 多进程
2. RSM: 远程计算
3. Distributed Analysis
(1) DSO:分布式求解
(2) DDM:区域分解
(3) MPI:消息传递界面
4. FE-BI
三、算例与讨论
Domain Decomposition
Multi-Processing
Bigger
Faster • Multi-Processing (MP)
• The MP option is used for solving models on a single
machine with multiple processors/cores which share
RAM.
• Increases throughput by speeding up turn-around
time for individual simulations
• Domain Decomposition (DDM)
• A distributed memory parallel solver technique that
distributes mesh sub-domains to a network of
processors.
• This method is a hybrid iterative and direct solver
technique that significantly increases the
simulation capacity by distributing the RAM usage
across multiple computers.
• Enables the solution of higher fidelity and larger
models HPC
License
Spectral Decomposition
Even
Faster
• Spectral Domain Decomposition (SDM)
• SDM enables frequency sweeps to be performed in
parallel on distributed hardware.
• Increases throughput by speeding up turn-around
time for individual simulations
Ansys HPC
Domain
Decomposition HPC License Feature
Multi Processing HPC or MP License Feature
Distributed Solver
Technology DSO License Feature
Mesh Based FEM and Hybrid
Matrix Based IE Solutions
Spectral Frequency Sweep
HFSS FEM and IE Solutions
HFSS Hybrid Solutions
HFSS and HFSS-IE Parametric/Frequency Sweeps
Multi-Processing • Takes advantage of multiple cores/CPUs on a single
workstation to increase simulation throughput
Domain Decomposition • Ability to partition a single problem into smaller sub
domains
– Enabling problem to be distributed across networked computer resources
– Network resources applied to problem will increase capacity and simulation throughput
Distributed Solver Technology • Efficient solution parametric or frequency swept point by
simultaneous solution of variations across networked computer resources
• Parallel solution of FEM and IE solution domains when a hybrid solve is performed
• Parallel of each excitation when transient HFSS solution is applied
HFSS-Transient Distributed Excitations
MultiProcessing
15 © 2014 ANSYS Inc.
16 © 2014 ANSYS Inc.
MP:Not supported in new release
MP:主要用于迭代求解器中
与激励相关,每个激励会对应一个迭代进程,使用一个处理器。
如果激励是1,即使本地机器有多个处理器,hf3d也仅会使用其中
1个处理器。
多处理器设置
• Number of Processors:
– Controls the number of CPU’s used when solution is performed on the local computer
• Number of Processors, Distributed:
– Controls the number of Cores/CPU’s used when solution is performed on a remote machine(s)
• HPC Licensing Options – Multi-processing capability can be achieved with two different licensing options, MP and
HPC licensing features
– To use MP license for multi-processing on a single machine
• Leave this option unchecked
– To use HPC license for multi-processing on a single machine
• Check this options
• Post Processing Options, Number of Processors:
– Post processing of solution data can also take advantage of multiple Cores/CPU’s
Tools Options HFSS Options
Multi-Processing
Multi Processing HPC or MP License Feature
HFSS FEM and IE Solutions
1
2
3
4
0 10 20 30
Fact
or
RAM [GB]
Solver - Time Factor
v12 - MP1
v12 - MP2
v12 - mp4
v12 - mp8
• Multi-Processing
– Single workstation solution to increase simulation throughput
– Takes advantage of multi-core and/or multi-processor computing resources
• FEM solution
– Direct Matrix Solver
• Takes advantage of multi-core and/or multi-processor computing resources
– Iterative Solver
– Parallelized matrix pre-conditioner
– Parallelized excitations
MP license is needed.
HPC
The HPC License Type determines the type and number of licenses that will
be checked out for a given number of cores. For the HPC type, one license
will be checked out for each core in use. So a simulation with twenty cores
would require twenty HPC licenses.
HPC Pack
For the HPC Pack type, a single pack enables eight cores, and each
additional pack enables four times as many cores. So a simulation with
twenty cores would require two “HPC Pack” licenses, enabling up to 8x4,
or 32, cores.
HPC 与 HPC Pack 的区别
Command Line that analyzes an HFSS project serially:
"C:\Program Files (x86)\AnsysEM\HFSS15.0\hfss.exe" -ng -local
-batchsolve \\shared_drive\projs\OptimTee12.hfss
Command Line that analyzes HFSS project serially and monitors analysis progress that is printed to
stdout/stderr: "C:\Program Files (x86)\AnsysEM\HFSS15.0\hfss.exe" -ng -monitor -local -batchsolve
\\shared_drive\projs\OptimTee12.hfss
Command Line that uses four cores for multi-processing of analysis:
"C:\Program Files (x86)\AnsysEM\HFSS15.0\hfss.exe" -ng -monitor
-local -batchoptions
"'Hfss/Preferences/NumberOfProcessors'=4"
-batchsolve \\shared_drive\projs\OptimTee12.hfss
Command Line that runs four distributed engines, on compute units allocated by the HPC Scheduler:
"C:\Program Files (x86)\AnsysEM\HFSS15.0\hfss.exe" -ng
-monitor -distributed -machinelist num=4
-batchsolve \\shared_drive\projs\OptimTee12.hfss
Command Line that runs four distributed engines, with each engine using four cores for multi-processing:
"C:\Program Files (x86)\AnsysEM\HFSS15.0\hfss.exe" -ng -monitor
-distributed -machinelist num=4 -batchoptions
" 'Hfss/Preferences/NumberOfProcessorsDistributed'=4
'Hfss/Preferences/NumberOfProcessors'=4" -batchsolve
\\shared_drive\projs\OptimTee12.hfss
Remote Analysis
21 © 2014 ANSYS Inc.
1. RSM in HFSS.
2. HFSS must be accessible from all remote machines as well as accessible on the local
machine.
3. If you use RSM, it must be accessible from all remote machines. In addition, the HFSS
engines must be registered with each initialization of RSM. To do this, on each remote machine:
• On Windows on the local and remote machines, click
Start>Programs>ANSYS Electromagnetics>product >Register with RSM. You can also run
RegisterEnginesWith RSM.exe, located in the product subdirectory (for example,
C:\Program Files\AnsysEM\hfss15\Windows 64-bit\RegisterEnginesWithRSM.exe).
In each case, you see a dialog confirming the registration. OK the dialog.
• On Linux, run RegisterEnginesWithRSM.pl, located in the product installation directory. (for example,
/apps/ansyselectromagnetics/hfss15/RegisterEnginesWithRSM.pl).
If the RSM service cannot run due to permission issues for the configuration file, it issues an
error message and exits. If your product is not registered with RSM, the analysis will run locally.
RSM: 远程计算
23 © 2014 ANSYS Inc.
Remote: 使用远程计算机进行计算,本地计算机不参与
When you run a simulation remotely, you should see a message in the
Progress window identifying the design name, and the specified remote
machine. You will see Progress messages as the simulations continues.
When the simulation is complete, you will see a message in the Message
window.
远程计算中的设置
Remote Simulation Manager 远程计算管理
• RSM
– Manages communications between local and remote computers for HFSS simulations
– Used by DSO, DDM, and Remote Solve to communicate with networked workstations
• Improved installation setup for remote simulations
– Supports mixed operating system environments
– Supports LSF and Windows HPC
Local Remote Distributed
Selected Machine List
Edit Machine List
Distributed Solve Option
27 © 2014 ANSYS Inc.
DSO: Distributed Solve Option
Regular DSO
Large Scale DSO
• Setting DSO Configurations Using the User Interface
• DSO Configurations in the Registry
• Setting DSO Configurations Using UpdateRegistry
Each DSO configuration is identified by a unique
name.
DSO 设置
DSO 任务
Large Scale DSO to distribute parametric variations of an HFSS model
across the nodes of a cluster or to multiple cores of a single machine.
DSO 示例
DSO 参数扫描
desktopjob.exe -cmd dso -machinelist "list=m1:2,m2:2" -batchoptions
\\sjo7na\hfssprojs\hfssoptions.txt -batchsolve
"TeeModel"Optimetrics:ParametricSetup1" \\sjo7na\hfssprojs\OptimTee.hfss
where the file \\sjo7na\hfssproj\hfssoptions.txt has the following contents:
$begin Config
'HFSS/Preferences/NumberOfProcessors'=1
'HFSS/Preferences/NumberOfProcessorsDistributed'=1
'HFSS/Preferences/NumberOfProcessorsPostProc'=1
'HFSS/Preferences/UseHPCforMP'=0
'HFSS/Preferences/SaveBeforeSolving'=0
'HFSS/Preferences/MemLiimitHard'=0
'HFSS/Preferences/MemLimitSoft'=0
'HFSS/Preferences/HPCLicenceType'='pack'
#end 'Config'
命令行方式启动DSO
desktopjob.exe -cmd dso -machinelist "list=shhhli:2,shhtech01:2" -batchsolve
"TeeModel:Optimetrics:ParametricSetup1"
Examples\RF_Microwave\OptimTee1.hfss
You have three options for postprocessing csv files.
• Import Large Scale DSO Dataset Solution
• Use Microsoft Excell or any other application that has csv post processing
functionality.
• Parse the csv output into your custom program, for any downstream flow.
DSO 结果处理
DSO 结果文件
The extracted results are saved to the local storage. When the engine is
done with analysis of all variations, the extracted results are transferred
from local storage to the results folder of the input project.
DSO 的结果后处理
分布式参数扫描
• Distributed analysis used to quickly explore multi-dimensional design space
– Helix Antenna example, parameters may include wire radius, pitch spacing, helix radius
• DSO distributes frequency and parametric sweeps to network of processors
• Approximately linear increase in simulation throughput
• Highly scalable to large numbers of processors
• Multi-processor nodes can be utilized
DSO distributes frequency and parametric
sweeps to networked processors
HFSS 3D Rectangular plot
• Parametric sweep of helix wire radius – 8 Computers with 2-dual core CPU’s each
• 32 Nodes, 45 variations
– 27x speed up when running 32 parallel simulations using DSO when
compared with running each parameter variation sequentially
DSO 参数话扫描:两个例子
• Optimetrics analysis of circular waveguide phased array
• Parametric sweep over 45 scan angles
• 5X faster when distributed to 6 CPUs
• Optimetrics analysis of PIFA radiating
element
• Parametric sweep of antenna geometry
• 7.5X faster when distributed to 8 CPUs
2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 2.9 3.0Freq [GHz]
-35
-30
-25
-20
-15
-10
-5
0
dB(S
(P1,
P1)
)
Ansoft Corporation isolationS11 for Element 1 Parametric Sweep
Curve Info
dB(S(P1,P1))Setup1 : Sweep1extra_element_lengt
dB(S(P1,P1))Setup1 : Sweep1extra_element_lengt
dB(S(P1,P1))Setup1 : Sweep1extra_element_lengt
dB(S(P1,P1))Setup1 : Sweep1extra_element_lengt
dB(S(P1,P1))
Domain Decomposition Method
40 © 2014 ANSYS Inc.
区域分解算法(DDM)
• Applications
• Electrically Large RF/Antenna Designs
• Antenna Placement
• Radome Design
• Radar Cross-Section (RCS)
• EMC Analysis
• Industries
• Aerospace and Defense
• Wireless/Mobile Platforms
• Communications
• Healthcare
Domain
Decomposition HPC License Feature
Mesh Based FEM and Hybrid
Matrix Based IE Solutions
Spectral Frequency Sweep
Domain
Decomposition HPC License Feature
HFSS 中的 DDM 与高性能计算
HPC distributes mesh sub-domains, FEM and discontinuous
IE domains, to networked processors and memory
FEM
Domain 1
FEM
Domain 2
FEM
Domain 3
FEM
Domain 4 IE Domain
• Distributes mesh sub-domains to network of processors
• FEM volume can be sub-divided into multiple domains
• IE Domains that are discontinuous will be distributed to separate nodes when they become large
• Significantly increases simulation capacity
• Multi-processor nodes can be utilized
Mesh Based FEM and Hybrid
有限大阵列区域分解算法
– Each element in array treated as solution domain
– One compute engine can solve multiple element/domain in series
Distributes element sub-domains
to networked processors and memory
算例: 8X8 天线阵列
•Direct solver with 12 cores 5:05:14 60.8 GB RAM
= Solution are equivalent for a finite array, solution technique is only difference Finite Array DDM with 12 cores
00:44:53 1.8 GB
6.8X faster
33.8X less memory
1
3
5
7
0 50 100 150
speed factor
speed factor
Number of cores
Additional performance benefits can be seen as additional cores are applied to this finite array with domain decomposition
区域分解算法(DDM)之注意点:
1、DDM分域的个数不宜太少,否则每个子域规模大且域间耦合强会降低并行效率。通常,DDM分域的个数不宜小于4个,每个子域的网格量在10~50万网格之间较为合适,具体视硬件的核数和具体问题而定。
2、HFSS会根据网格规模和用于计算的处理器核/计算机数目将待求解问题的划分子域数目进行优化;DDM算法会自动将有限元网格按上述优化的结果分解成若干子域。分域的最大个数由HFSS-
>Tools->options->general option中的Analysis option中的Distributed Analysis Machines决定。如在本机上(含远程登录的服务器上)分解为8个区域求解超大规模问题,则选择IP address
输入127.0.0.1,点击add machine to list添加8次(即指定分为8个域)。
Examples
46 © 2014 ANSYS Inc.
47 © 2014 ANSYS Inc.
算例:4x4阵列天线
48 © 2014 ANSYS Inc.
49 © 2014 ANSYS Inc.
50 © 2014 ANSYS Inc.
The number of tasks specifies
the total number of compute jobs
that will be run on that machine
simultaneously.
The Total Cores specifies the total
number of cores that will be used
on the given machine.
The purpose behind specifying Total
Cores at the machine level is to
allow you the flexibility of assigning
large amounts of multiprocessing to
some machines and smaller
amounts to others.
51 © 2014 ANSYS Inc.
分布式求解类型 任务数(Task) 机器数(Machinelist)
Level 1 参数化扫描 3 20
同时启动三个参数扫描进程
Level 2 区域分解 20
20 个 Tasks 平均分配 到三个扫描进程中
扫描1 扫描2 扫描3
7 7 6
两级分布式求解示意图
52 © 2014 ANSYS Inc.
Number of tasks for Level 1
This control determines how many level 1
tasks to create during a two level
distribution. This indirectly determines how
many level 2 tasks for each level 1 task are
used: the total number of tasks is specified
by the list of enabled machines on the first
tab, and the software evenly distributes
resources among the L1 tasks which then
are used to spawn off level 2 tasks.
53 © 2014 ANSYS Inc.
有限大阵列的阶段效应
Vivaldi 天线 80x80,有限大阵列,合成激励:
80x80 阵列天线增益特性
Hybrid Method
56 © 2014 ANSYS Inc.
57 © 2014 ANSYS Inc.
FE-BI 混合算法及其应用
58 © 2014 ANSYS Inc.
FE-BI 边界 FEM求解
IE Region 矩量法求解
59 © 2014 ANSYS Inc.
60 © 2014 ANSYS Inc.
车载天线远场方向特性(有临车存在)
61 © 2014 ANSYS Inc.
车载天线在本车及临车上的感应电流
62 © 2014 ANSYS Inc.
远场方向图
Message Passing Interface
63 © 2014 ANSYS Inc.
消息传递是目前使用最为广泛的实现并行计算的一种方式.在消息传递模型中,计算由一个或者多个进程构成,进程间的通信通过调用库函数发送和接收消息来完成.通信是一种协同的行为。
消息传递模型的两大优势:
• 具有高度的可移植性
• 允许用户显式的控制并行程序的存储,特别可以控制每个进程的内存
MPICH是MPI最流行的非专利实现,由美国Argonne国家实验室和密西西比州立大学联合开发。
目前MPI已在所有主流的并行机、IBM PC机、所有主要的Unix工作站、MS
Windows得到实现。使用MPI作消息传递的C或Fortran并行程序可不加改变地运行在IBM PC、MS Windows、Unix工作站、以及各种并行机上。它是高性能大规模
并行计算最可信赖平台,大量科研和工程软件(气象、石油、地震、空气动力学、核等)已移植到MPI平台。
MPI相对于PVM,具有功能强大、性能高、适应面广、使用方便、可扩展性好等优点。
消息传递接口(MPI)
Recommendation: it is more important to use memory
efficiently than to use all the processors
HFSS and HFSS-IE support different forms of distributed
analysis:
• Distributing rows of a parametric table, either as a
regular DSO, or as Large Scale DSO performed through
command line. Large Scale DSO generates a reduced
set of outputs.
• Distributing array solves.
• Distributing domain solves.
• Distributing a single or discrete interpolating sweep.
If a problem is too large to solve on one machine HFSS can automatically
partition a design into domains that can be solved by separate processes.
Before enabling solver domains, you must have the HPC license option, and
you must have allocated at least three distributed machines to the solve
pool. The number of domains that the solver creates will not exceed N-1,
where N is the number of machines listed in the pool (The first machine in
the pool acts as the head node and is responsible for domain assembly,
mesh refinement, and solution management). If more machines are present
in the solve pool than are needed, HFSS creates the number of domains
that leads to increased overall solver efficiency. Consequently, some
machines remain idle if the problem size does not justify their use.
Domain use can be invoked for a solve when
• The Enable Use of Solver Domains check box under the Solution
Setup Options tab is checked.
• You have the HPC License.
• You have provided at least three distributed machines in the pool.
• The solver determines that the problem is large enough (the mesh
has enough tets) to bother with domains.
• The design includes IE Regions and/or FEBI Radiation Boundaries.
If an HFSS problem involves solver domains or a finite array, then frequency
sweeps will not be done using DSO. Also, DSO for Optimetrics will not be
allowed.
Restrictions on solver domains are that the design and analysis setup
cannot include:
• The design cannot contain master and slave boundaries.
• Eigenmode solution type.
• Fast frequency sweeps.