ΗΜΥ 408/664 ΨΗΦΙΑΚΟΣ ΣΧΕΔΙΑΣΜΟΣ ΜΕ …ΗΜΥ 408 ΨΗΦΙΑΚΟΣ...

ΗΜΥ 408 ΨΗΦΙΑΚΟΣ ΣΧΕΔΙΑΣΜΟΣ ΜΕ FPGAs

Χειμερινό Εξάμηνο 2018

ΔΙΑΛΕΞΕΙΣ 6 - 7: Design Flow

ΧΑΡΗΣ ΘΕΟΧΑΡΙΔΗΣ

(ttheocharides@ucy.ac.cy) Some slides adopted from Digital Integrated Circuits, Rabbey et. al.

ΗΜΥ408 Δ06-7 Design Flow.2 © Θεοχαρίδης, ΗΜΥ, 2018

Design Process Steps (Review)

Definition of system requirements. Example: ISA (instruction set architecture) for CPU. Includes software and hardware interfaces including

timing. May also include cost, speed, reliability and

maintainability specifications.

Definition of system architecture. Example: high-level HDL (hardware description

language) representation - this is not required in ECE 408 specifically but is done in the real world).

Useful for system validation and verification and as a basis for lower level design execution and validation or verification.

Refinement of system architecture In manual design, descent in hierarchy, designing

increasingly lower-level components In synthesized design, transformation of high-level HDL to

“synthesizable” register transfer level (RTL) HDL

Logic design or synthesis In manual or synthesized design, development of logic

design in terms of library components Result is logic level schematic or netlist representation or

combinations of both. Both manual design or synthesis typically involve

optimization of cost, area, or delay.

Implementation Conversion of the logic design to physical implementation Involves the processes of:

Mapping of logic to physical elements, Placing of resulting physical elements, And routing of interconnections between the elements.

In case of SRAM-based FPGAs, represented by the programming bitstream which generates the physical implementation in the form of CLBs, IOBs and the interconnections between them

Validation (used at number of steps in the process) At architecture level - functional simulation of HDL At RTL level- functional simulation of RTL HDL At logic design or synthesis - functional simulation of gate-

level circuit - not usually done in ECE 408/664 At implementation - timing simulation of schematic, netlist or

HDL with implemention based timing information (functional simulation can also be useful here)

At programmed FPGA level - in-circuit test of function and timing

Hardware design in general

Logic (RTL) design Logic simulation Logic debugging

RTL code (Verilog)

Placement & routing

Timing simulation

Timing analysis

Netlist & Gate delay

(SDF) GDSII

Semiconductor fabrication GDSII & Test-vector

Logic synthesis

Gate-level simulation

Gate-level debugging

Netlist (EDIF)

RTL code & Target library

Logic synthesis

Placement & routing FPGA

bit-stream RTL code &

FPGA library

FPGA BOARD

FPGA Debugging

Logic synthesis RTL compilation

Placement & routing

H/W Platform

General Hardware Design Flow / Methodologies.

Xilinx HDL/Core Design Flow

DESIGN ENTRY

CORE GENERATION RTL HDL EDITING

RTL HDL-CORE SIMULATION

SYNTHESIS

IMPLEMENTATION

TIMING SIMULATION

FPGA PROGRAMMING & IN-CIRCUIT TEST

Xilinx HDL/Core Design Flow - HDL Editing

Language Construct Templates

HDL EDITOR

DESIGN WIZARD LANGUAGE ASSISTANT Accessed within HDL Editor

RTL HDL Files

HDL Module Frameworks

Xilinx HDL/core Design Flow – Core Generation

CORE GENERATOR

Select core and specify input parameters

HDL instantiation module for core_name

EDIF netlist for core_name

Other core_name files

Xilinx HDL/core Design Flow - HDL Functional Simulation

Compile HDL Files

Waveforms or List Files

Set Up and Map work library RTL HDL Files

Test Inputs or Force Files

HDL instantiation module for core_name

EDIF netlists for core_names

Functional Simulate

Testbench HDL Files

HDLSIMULATOR

All HDL Files

Gate/Primitive Netlist Files (EDIF or XNF)

Xilinx HDL Design Flow - Synthesis

Select Top Level

Select Target Device

Edit XST Synthesis Constraints

Synthesize

Synthesis/Implement-ation Constraints

Synthesis Report Files

EDIF netlists for core_names

Model Extraction

Xilinx HDL/core Design Flow - Implementation

Netlist Translation

Place & Route

BIT File

Create Bitstream

Timing Model Gen

Gate/Primitive Netlist Files (XNF or EDN)

Standard Delay Format File

HDL or EDIF for Implemented Design

XILINX DESIGN MANAGER

Xilinx HDL/core Design Flow- Timing Simulation

Test Inputs, Force Files

MODELSIM

Compile HDL Files

Waveforms or List Files

Set Up and Map work Directory

Compiled HDL

HDL Simulate

Standard Delay Format File HDL or EDIF for Implemented Design

Testbench HDL Files

Xilinx HDL Design Flow - Programming and In-circuit Verification

Bit File

FPGA Board

iMPACT

I/O Port

Input Byte

Human Inputs

Outputs

A Few Notes on Programming: Start up Sequence

° During an FPGA start-up, the device performs four operations:

1. The assertion of DONE signal. The failure of DONE to go High may indicate the unsuccessful loading of configuration data.

2. The release of the Global Three State (GTS) signal. This activates all the I/Os.

3. The release of the Global Set Reset (GSR) signal. This allows all flip-flops to change state.

4. The assertion of Global Write Enable (GWE) signal. This allows all RAMs and flip-flops to change state.

° By default, these operations are synchronized to the CCLK signal.

° The entire start-up sequence lasts eight cycles, called C0-C7, after which the loaded design is fully functional.

Serial Load Configuration

° There are two serial configuration modes. ° Master Serial mode

• the FPGA controls the configuration process by driving CCLK as an output.

° Slave Serial mode • the FPGA passively receives CCLK as an input from an external

agent (e.g., a microprocessor, CPLD, or second FPGA in master mode) that is controlling the configuration process.

° In both modes, the FPGA is configured by loading one bit per CCLK cycle.

° The MSB of each configuration data byte is always written to the DIN pin first.

ASIC Design Flow

°ASIC • Application Specific Integrated Circuits • Custom design, usually from scratch or from pre-built components

• Chip performs a particular function • Typically NOT general purpose

°Front End Back End • Front End – Synthesis / Gate Level • Back End – Layout / Mask Generation

ASIC Design Flow – Typical flow

° ASIC Design Flow Steps

• Specifications • Early Planning • Architecture • Design • Synthesis • Pre-Layout Static Timing Analysis • Layout • Post-Layout Static Timing Analysis • Pads Placement • Sent for Manufacturing

VERIFICATION

Design Flow – Commercial (Example Tools)

Synopsys Design Compiler

Modelsim

Prime Time

Cadence Silicon Ensemble

Silicon Ensemble/Virtuoso

HDL Model

Verilog Gate Level / Netlist

DEF File

VHDL / Verilog

Verilog Simulation

Static Timing Analysis

Standard Cell Placement and Routing

Post-Layout Static Timing Analysis

Pads Placement

Prime Time

RTL (Register Transfer Language)

Other Commercial Tools

° RTL Verification with Specman e ° Gate-level simulation with ModelSim ° Logic Synthesis with Synopsys Design Compiler ° Static Timing Analysis with Synopsys PrimeTime ° Placement and Routing with Cadence Silicon

Ensemble ° Running Silicon Ensemble in the GUI mode ° Clock Tree Generation with Cadence CTGen ° Integrating IP Block, DesignWare and Virage SRAM ° Power Estimation with Synopsys Power Compiler ° Code Revision Control with CVS

° Must understand specifications first ° Start by looking it as black box

° e.g. Adder • F(X,Y) = X+Y • Takes two inputs, produces Sum of Inputs

Starting A Design

X Y F(X,Y)

Starting A Design

° SPECS Architecture • Block Diagram • Brainstorming (if collaborating) • Feedback • I/O Specs • Architectural Decisions

- Frequency? - Latency? - Power/Performance? - Reliability?

• Architectural Optimizations • Finalizing the initial Design

From “Architecture” to RTL

° Create Block Diagram of Design - with sub-blocks if necessary

° Create I/O Specs for each block • e.g adder

- Sum generator – Takes three inputs, produces one output

- Carry generator – Takes three inputs, produces one output

- Interconnected? • Place box in functional order

- i.e. can’t generate sum after carry-in arrives!!! • Create pipeline flow

- i.e. IF IDIXICWB • Clocked signals/registers/latches

° Proceed then to code module by module

Hierarchical Design

•Multiple modules •Multiple instances

•Top-Level Design •Contains all sub-modules and connection information

•Sub-Modules can be hierarchically built themselves

° Hardware Description Language • Verilog, VHDL, SystemC, etc.

° High Level of Design Abstraction • ex:

- Input A, B - Output C - Architecture entity of adder is

C A + B

° Not going to talk in depth about HDL • Refer to multiple online resources

- www.deeps.org • Behavioral vs. Structural • Code Simulate or Code Synthesize (Compile) Simulate

HDL - Tools

° Code programming • Just a text editor! • Today, fancy text editors with syntax highlighting are available

for free (emacs, nedit, etc.)

° Simulation • Multiple free HDL Simulators for simple designs • State-of-the-art Simulators available at CSE

- Modelsim - NCVHDL - NCVerilog

• Not necessary synthesized code

° Synthesis (Compilation) • Neet a target library of “standard” cells (i.e. AND, XOR, ADDER,

etc.) • Synopsys Design Compiler

HDL Simulation / Verification

° Upon coding each block / module, we can then simulate its functionality

° Use an HDL / RTL Simulator • Event Driven • Cycle Driven

° Simulator reads code and models code functionality based on clock cycles or events, e.g.

• CA+B @posedge clk • CA+B after 10 ns

° Tools Available: • Modelsim, NCVHDL, NCVerilog, etc.

HDL Synthesis H/W

Timing Analysis

Routing

Placement

Synthesis

Why learning about Logic Synthesis?

° Logic synthesis is the core of today's CAD flows for IC and system design

• course covers many algorithms that are used in a broad range of CAD tools

• basis for other optimization techniques, e.g. embedded software • basis for functional verification techniques

° Most algorithms are computationally hard

• covered algorithms and flows are good example for approaching hard algorithmic problems

• course covers theory as well as implementation details • demonstrates an engineering approaches based on theoretical

solid but also practical solutions - very few research areas can offer this combination

Design of Integrated Systems

System Level

Register Transfer Level

Gate Level

Transistor Level

Layout Level

Mask Level

System Level

° Abstract algorithmic description of high-level behavior

• e.g. C-Programming language

• abstract because it does not contain any implementation details for timing or data

• efficient to get a compact execution model as first design draft • difficult to maintain throughout project because no link to

implementation

Port* compute_optimal_route_for_packet(Packet_t *packet, Channel_t *channel) { static Queue_t *packet_queue; packet_queue = add_packet(packet_queue, packet); ... }

RTL Level

° Cycle accurate model “close” to the hardware implementation

• bit-vector data types and operations as abstraction from bit-level implementation

• sequential constructs (e.g. if - then - else, while loops) to support modeling of complex control flow

module mark1; reg [31:0] m[0:8192]; reg [12:0] pc; reg [31:0] acc; reg[15:0] ir; always begin ir = m[pc]; if(ir[15:13] == 3b’000) pc = m[ir[12:0]]; else if (ir[15:13] == 3’b010) acc = -m[ir[12:0]]; ... end endmodule

Gate Level

° Model on finite-state machine level • models function in Boolean logic using registers and gates • various delay models for gates and wires

• in this lecture we will mostly deal with gate level

4ns 3ns

Transistor Level

° Model on CMOS transistor level • depending on application function modeled as resistive

switches - used in functional equivalence checking

• or full differential equations for circuit simulation - used in detailed timing analysis

Layout Level

° Transistors and wires are laid out as polygons in different technology layers such as diffusion, poly-silicon, metal, etc.

Design of Integrated Systems R

Project Time

System

- Design phases overlap to large degrees - Parallel changes on multiple levels, multiple teams - Tight scheduling constraints for product

Transistor

Design Challenges

° Systems are becoming huge, design schedules are getting tighter

• > 100 Mio gates becoming common for ASICs • > 0.4 Mio lines of C-code to describe system behavior • > 5 Mio lines of RLT code

° Design teams are getting very large for big projects • several hundred people • differences in skills • concurrent work on multiple levels • management of design complexity and communication very difficult

° Design tools are becoming more complex but still inadequate • typical designer has to run ~50 tools on each component • tools have lots of bugs, interfaces do not line up etc.

Design Challenges

° Decision about design point very difficult • compromise between performance / costs / time-to-market • decision has to be made 2-3 years before design finished • design points are difficult to predict without actually doing the

design • scheduling of product cycles

° Functional verification • simulation still main vehicle for functional verification but

inadequate because of size of design space • results in bugs in released hardware that is very expensive to

recover from (different in software ;-)

Design Challenges

° Fundamental tradeoffs between different modeling levels:

• modeling detail and team size to maintain model - high-level models can be maintained by one or two people - detailed models need to be partitioned which results in a

significant communication overhead • modeling accuracy versus modeling compactness

- compact models omit details and give only crude estimations for implementation

- detailed models are lengthy and difficult to adopt for major changes in design points

• simulation speed versus hardware performance - high-level models can be simulated fast but cannot be

implemented efficiently with automatic means - low-level models can be made to have a fast

implementation but cannot be simulated very fast

General Design Approach

° How do engineers build a bridge?

° Divide and conquer !!!! • partition design problem into many sub-problems which are

manageable • define mathematical model for sub-problem and find an

algorithmic solution - beware of model limitations and check them !!!!!!!

• implement algorithm in individual design tools, define and implement general interfaces between the tools

• implement checking tools for boundary conditions • concatenate design tools to general design flows which can be

managed • see what doesn’t work and start over

Design Automation

° Design Automation is one of the most advanced areas in practical computer science

• many problems require sophisticated mathematical modeling • many algorithms are computationally hard and require advanced and

fine-tuned heuristics to work on realistic problem sizes • boundary conditions need to be well declared and synchronized

between different tools (patchwork to cover all wholes)

° Two common pitfalls in CAD research • problem is looking for a solution:

- problem scope is too big, makes modeling difficult or algorithms don’t scale

- problem scope is too small, solutions are not good enough • solution is looking for a problem:

- model was oversimplified because real problem was too complex with too many boundary conditions

Key to Success

° Fine-tuned combination of Design Methodology and Tools

• addresses algorithmic complexity by requiring - manual partitioning of the problem - manual input of hints/suggestions - manual iterations to drive tool application to best solution

• makes CAD systems and design flows very complex and difficult to manage

Problem space Tools applicable

Practical combination through design methodology

Examples of Divide and Conquer

° RLT cycle simulation does only evaluate the next state logic of the circuits, timing is assumed to be correct

• combination of static timing analysis, formal equivalence checking, and cycle simulation allows separation of issues

• cycle simulation avoids expensive event scheduling and processing and performs significantly faster

° However: • timing analysis is conservative with respect to the achievable

clock cycle time

° Static timing analysis assumed simple gate delay models

• complexity of static timing analysis becomes linear (simple longest and shortest paths analysis in circuit implementation)

• very efficient implementation of incremental static timing analysis which is needed in the inner loop of the technology dependent part of logic synthesis

° However: • actual gate delay varies a lot in reality

- models often assume average fan-out rather than actual gate load

• delay model assumes ideal signals - slew dependency ignored

° Logic synthesis assumes ideal gates which are independent of physical environment

• standard cell place and route technology has made logic synthesis possible

- gates are heavily over-designed to be functional in a wide variety of combinations (e.g. range of fan-out gates possible, different wire loads

- layout placement and route done in standard rows that minimize latch-up effects and optimize power and clock wiring

° However: - layout implementation remains sub-optimal because cells

are designed for worst case application and with large safety margins with respect to environment

° Logic synthesis uses crude model to estimate circuit area

- literal count or simple table-lookup for gates sizes allows fast comparison of different implementation choices

° However: - actual gate size can vary to a very large degree depending

on load and timing requirement - area for wiring completely ignored

° Formal equivalence checking assumes identical state encoding of the two designs to be compared

• reduces the general equivalence checking problem to combinational equivalence checking which is computationally less complex

• exploitation of structural similarities between designs to be compared makes tools applicable for huge (multi-million gate) designs

• automatic algorithms for identifying register correspondence compensate to some extent for limited model

° However: • combinational verification model cannot handle sequential

verification problems

Full Custom Design Flow

° Application: ultra-high performance designs • general-purpose processors, DSPs, graphic chips, internet

routers, games processors etc.

° Target: very large markets with high profit margins • e.g. PC business

° Complexity: very complex and labor intense • involving large teams • high up-front investments and relatively high risks

° Role of Logic Synthesis: • limited to components that are not performance critical or that

might change late in design cycle (due to designs bugs found late)

- control logic - non-critical data paths logic

• bulk of data-path components and fast control logic are manually crafted for optimal performance

Full Custom Design Flow

ISA Specification

RTL Spec

Gate Level Netlist

Transistor Level Circuit

Layout

Circuit Simulation

Simulation

Design Rule Checker

Formal Equivalence

Checking

Simulation

Logic Synthesis

Manual or semi-automatic

Design

Extract&Compare

° Incomplete picture:

ASIC Design Flow

° Application: general IC market • peripheral chips in PCs, toys, handheld devices etc.

° Target: small to medium markets, tight design schedules

• e.g. consumer electronics

° Complexity of design: standard design style, quite predictable

• standard flows, standard off-the-shelf tools

° Role of Logic Synthesis: • used on large fraction of design except for special blocks such

as RAM’s, ROM’s, analog components

ASIC Design Flow

Informal Specification

RTL Spec

Gate Level Netlist

Modifies Gate Level Netlist Static Timing Analysis

Formal Equivalence

Checking

Simulation

Logic Synthesis

Manual Changes to fix timing

° Incomplete picture:

ASIC Foundry Test Logic Insertion

What is Logic Synthesis?

X Y λδ

Given: Finite-State Machine F(X,Y,Z, , ) where: λ δX: Input alphabet Y: Output alphabet Z: Set of internal states : X x Z Z (next state function) : X x Z Y (output function) λδ

Target: Circuit C(G, W) where: G: set of circuit components g {Boolean gates, flip-flops, etc} W: set of wires connecting G

Objective Function for Synthesis

° Minimize area • in terms of literal count, cell count, register count, etc.

° Minimize power • in terms of switching activity in individual gates, deactivated

circuit blocks, etc.

° Maximize performance • in terms of maximal clock frequency of synchronous systems,

throughput for asynchronous systems

° Any combination of the above • combined with different weights • formulated as a constraint problem

- “minimize area for a clock speed > 300MHz”

° More global objectives • feedback from layout

- actual physical sizes, delays, placement and routing

Constraints on Synthesis

° Given implementation style: • two-level implementation (PLA, CAMs) • multi-level logic • FPGAs

° Given performance requirements • minimal clock speed requirement • minimal latency, throughput

° Given cell library • set of cells in standard cell library • fan-out constraints (maximum number of gates connected to

another gate) • cell generators

Why learn HDL coding styles for FPGAs? ° HDLs contain many complex constructs that are

difficult to understand at first. ° Methods and examples included in HDL manuals do

not always apply to the design of FPGA devices. ° If you currently use HDLs to design ASICs, your

established coding style may unnecessarily increase the number of gates or CLB levels in FPGA designs

° HDL synthesis tools implement logic based on the coding style of your design.

Naming Convention - Restrictions ° The following FPGA resource names are reserved

and should not be used to name nets or components. • Components (Comps), Configurable Logic Blocks (CLBs),

Input/Output Blocks (IOBs), Slices, basic elements (bels), clock buffers (BUFGs), tristate buffers (BUFTs), oscillators (OSC), CCLK, DP, GND, VCC, and RST

• CLB names such as AA, AB, SLICE_R1C2, SLICE_X1Y2, X1Y2, and R1C2

• Primitive names such as TD0, BSCAN, M0, M1, M2, or STARTUP • Do not use pin names such as P1 and A4 for component names • Do not use pad names such as PAD1 for component names

Use optional labels on flow control constructs

° Make the code structure more obvious ° Can slow execution in some simulators

/* Changing Latch into a D-Register * D_REGISTER.V */ module d_register (CLK, DATA, Q);

input CLK; input DATA; output Q; reg Q; always @ (posedge CLK) begin: My_D_Reg Q <= DATA; end

endmodule

Coding for Synthesis

° Omit the Wait for XX ns Statement • XX specifies the number of nanoseconds that must pass before a

condition is executed. • VHDL: wait for XX ns; • Verilog: #XX;

° Omit the ...After XX ns or Delay Statement • VHDL

(Q <=0 after XX ns) • Verilog assign #XX Q=0; • This statement is usually ignored by the synthesis tool. In this

case, the functionality of the simulated design does not match the functionality of the synthesized design.

° Omit Initial Values • VHDL signal sum : integer := 0; • Verilog initial sum = 1’b0;

° Order and Group Arithmetic Functions • ADD = A1 + A2 + A3 + A4; cascades three adders in series. • ADD = (A1 + A2) + (A3 + A4); two additions are evaluated in parallel and the results are

combined with a third adder. • RTL simulation results are the same for both statements, • however, the second statement results in a faster circuit after

synthesis (depending on the bit width of the input signals). • When is second construct preferred ?

For example, if the A4 signal reaches the adder later than the other signals, the first statement produces a faster implementation because the cascaded structure creates fewer logic levels for A4.

This structure allows A4 to catch up to the other signals. In this case, A1 is the fastest signal followed by A2 and A3; A4 is the slowest signal.

° Most synthesis tools can balance or restructure the arithmetic operator tree if timing constraints require it.

° However, Xilinx® recommends that you code your design for your selected structure.

Comparing If Statement vs.Case Statement ° If statement generally produces priority-encoded logic ° Case statement generally creates balanced logic. ° Use the Case statement for complex decoding and use the If

statement for speed critical paths. ° Make sure that all outputs are defined in all branches of an if

statement. • If not, it can create latches or long equations on the CE signal. • Have default values for all outputs before the if statements.

° Limiting the number of input signals into an if statement can reduce the number of logic levels.

° If there are a large number of input signals, see if some of them can be pre-decoded and registered before the if statement.

° Avoid bringing the dataflow into a complex if statement. ° Only control signals should be generated in complex if-else

statements.

Implementation

Case vs IF

° Case implementation requires only one Virtex™ slice while the If construct requires two slices in some synthesis tools.

° In this case, design the multiplexer using the Case construct because fewer resources are used and the delay path is shorter.

Example – From XCELL

° Verilog designs that use the CASE construct with the NESTED IF to more effectively describe the same function.

° The CASE construct reduces the delay by approximately 3 ns (using an XC4005E-2 part)

Source:http://www.xilinx.com/xcell/xl30/xl30_21.pdf

From IF construct

From Case construct

Implementing Latches and Registers

° Synthesizers infer latches from incomplete conditional expressions, such as an If statement without an Else clause.

° This can be problematic for FPGA designs because not all FPGA devices have latches available in the CLBs.

° In addition, you may think that a register is created, and the synthesis tool actually created a latch.

° The Spartan-II™, Spartan-3™ and Virtex™, Virtex-E™, Virtex-II™, Virtex-II Pro™ and Virtex-II Pro X™ FPGA devices do have registers that can be configured to act as latches.

° For these devices, synthesizers infer a dedicated latch from incomplete conditional expressions.

D Latch

module d_latch (GATE, DATA, Q);

input GATE;

input DATA;

output Q;

reg Q;

always @ (GATE or DATA)

if (GATE == 1'b1)

Q <= DATA;

end // End Latch

endmodule

D register

module d_register (CLK, DATA, Q);

input CLK;

input DATA;

output Q;

reg Q;

always @ (posedge CLK)

begin: My_D_Reg

Q <= DATA;

endmodule

How to handle latches?

° With some synthesis tools you can determine the number of latches that are implemented in your design.

° You should convert all If statements without corresponding Else statements and without a clock edge to registers.

° Use the recommended register coding styles in the synthesis tool documentation to complete this conversion.

Resource Sharing

° Resource sharing is an optimization technique that uses a single functional block (such as an adder or comparator) to implement several operators in the HDL code.

° Use resource sharing to improve design performance by reducing the gate count and the routing congestion.

° If you do not use resource sharing, each HDL operation is built with separate circuitry.

° However, you may want to disable resource sharing for speed critical paths in your design.

Resource Sharing module res_sharing (A1, B1, C1, D1, COND_1, Z1);

input COND_1; input [7:0] A1, B1, C1, D1; output [7:0] Z1; reg [7:0] Z1; always @(A1 or B1 or C1 or D1 or COND_1) begin

if (COND_1) Z1 <= A1 + B1; else Z1 <= C1 + D1;

endmodule

With and Without Resource Sharing

Resource Sharing

° The following operators can be shared either with instances of the same operator or with an operator on the same line.

• * • + – • > >= < <=

° For example, a + operator can be shared with instances of other + operators or with – operators.

° A * operator can be shared only with other * operators.

Resource Sharing

° You can implement arithmetic functions (+, –, magnitude comparators) with gates or with your synthesis tool’s module library.

° The library functions use modules that take advantage of the carry logic in CLBs/slices.

° Resource sharing of the module library automatically occurs in most synthesis tools if the arithmetic functions are in the same process.

° Resource sharing adds additional logic levels to multiplex the inputs to implement more than one function.

• Do not use it for arithmetic functions that are part of your design’s time critical path.

Using Preset Pin or Clear Pin

° Xilinx® FPGA devices consist of CLBs that contain function generators and flip-flops. Spartan-II™, Spartan-3™ , Virtex™, Virtex-E™, Virtex-II™, Virtex-II Pro™ and Virtex-II Pro X™ registers can be configured to have either or both preset and clear pins.

FlipFlop module ff_example( RESET, SET, CLOCK, ENABLE;D_IN;

A_Q_OUT; B_Q_OUT; C_Q_OUT; D_Q_OUT; E_Q_OUT); input RESET; input SET; input CLOCK; input ENABLE; input [7:0] D_IN; output [7:0] A_Q_OUT; output [7:0] B_Q_OUT; output [7:0] C_Q_OUT; output [7:0] D_Q_OUT; output [7:0] E_Q_OUT; // D flip-flop

always @(posedge CLOCK) begin A_Q_OUT <= D_IN;

end // End FF

Asynchronous Reset

always @(posedge CLOCK || posedge RESET) begin if (RESET == 1'b1) B_Q_OUT <= “00000000”; else if (CLOCK == 1'b1) B_Q_OUT <= D_IN; end

Asynchronous Set

always @(posedge CLOCK || posedge SET) begin if (SET == 1'b1) C_Q_OUT <= “11111111”; else if (CLOCK == 1'b1) C_Q_OUT <= D_IN; end

What is this ?

always @(posedge CLOCK || posedge RESET) begin if (RESET == 1'b1) D_Q_OUT <= “00000000”; else if (CLOCK == 1'b1) begin if (ENABLE == 1'b1) D_Q_OUT <= D_IN; end end

Answer

° Flip-flop with asynchronous reset and clock enable

Flip-flop with asynchronous reset; asynchronous set and clock enable

always @(posedge CLOCK || posedge RESET || posedge SET) begin

if (RESET == 1'b1) E_Q_OUT <= "00000000"; else if (SET == 1'b1) E_Q_OUT <= "11111111"; else if (CLOCK == 1'b1) begin if (ENABLE == 1'b1) E_Q_OUT <= D_IN; end

Using Clock Enable Pin Instead of Gated Clocks

° Use the CLB clock enable pin instead of gated clocks in your designs. Gated clocks can introduce glitches, increased clock delay, clock skew, and other undesirable effects

Gated Clock module gate_clock(IN1, IN2, DATA, CLK,LOAD,OUT1); input IN1; input IN2; input DATA; input CLK; input LOAD; output OUT1; reg OUT1; wire GATECLK; assign GATECLK = (IN1 & IN2 & CLK); always @(posedge GATECLK) begin if (LOAD == 1'b1) OUT1 <= DATA; end endmodule

Gated Clock

BAD IDEA !!

Clock Enable

module clock_enable (IN1, IN2, DATA, CLK, LOAD, DOUT); input IN1, IN2, DATA; input CLK, LOAD; output DOUT; wire ENABLE; reg DOUT; assign ENABLE = IN1 & IN2 & LOAD; always @(posedge CLK) begin if (ENABLE) DOUT <= DATA; end endmodule

Clock Enable

PLACEMENT AND ROUTING

Post-Synthesis Implementation

Placement and routing

Two critical phases of layout design: – placement of components on the chip; – routing of wires between components.

Placement and routing interact, but separating layout design into phases helps us understand the problem and find good solutions.

Placement metrics

° Quality metrics for layout: • Area • Delay • Energy consumption

° Ideally placement and routing would be performed together • Both problems are NP-hard • For practical considerations placement and routing must be

performed separately

° Design time may be important for FPGAs

Wire length as a quality metric

bad placement good placement

Wire length measures

Estimate wire length by distance between components.

Possible distance measures: – Euclidean distance (sqrt(x2 +

y2)); – Manhattan distance (x + y).

Multi-point nets must be broken up into trees for good estimates.

Euclidean

Manhattan

Wiring trees

Steiner point

Placement techniques

Can construct an initial solution, improve an existing solution. Pairwise interchange is a simple

improvement metric: – Interchange a pair, keep the swap if it helps

wire length. – Heuristic determines which two components to

Placement by partitioning

Works well for components of fairly uniform size. Partition netlist to minimize total wire

length using min-cut criterion. Partitioning may be interpreted as 1-D or 2-

D layout.

Recursive partitioning

Min-cut bisecting partitioning

partition 1 partition 2

3 nets

Min-cut bisecting partitioning, cont’d

Swapping A and B: – B drags 1 net; – A drags 3 nets; – total cut increase: 3 nets.

Conclusion: probably not a good swap, but must be compared with other pairs.

Before Placement: Clustering ° Need to group BLEs into

groups ° Goals:

• Minimize number of clusters

• Minimize inter-cluster wiring

• Minimize critical path (timing-driven)

° How do we do this • Take advantage of cluster

architecture

netlist with delay for each gate

Timing Analysis

arrival times

Source: David Pan

arrival time/required time

slack = required time - arrival time

Timing Analysis

Example with interconnect delay

3 2 1 1

2 1 3 2

Placement • Placement has a set of competing goals. • Can’t optimize locally and globally simultaneously. • Use heuristic approaches to evaluate quality.

LUT1 LUT2 A B C D E

Placement Algorithms

• Constructive methods: begin from netlist and generate an initial placement.

- Partitioning methods: mincut and Kernighan-Lin methods

- Clustering • Iterative improvement

- Begin with random or constructive placement. - Iterate to improve it. - Hill-climbing

Iterative Placement Algorithms • Pairwise interchange methods • Force-directed methods

- FD relaxation - FD pairwise exchange

• Simulated annealing - Generates best results - Can be time consuming

• Macro-based approaches - Genetic algorithms - Quad swaps

Iterative Improvement Algorithms

Force-directed: (classical mechanics) - Force vector computed on each module corresponding to

all nets - Solve set of non-linear differential equations.

Simulated annealing: (statistical mechanics)

- Model a physical annealing process which optimizes energy.

- Similar to “quenching” metal.

Timing-driven Placement

• Take both wire length and critical path into account • Problem

- Critical path changes as I move blocks - How do I balance the two objectives

• How do we go about modeling routing delay during placement?

Determining Criticality

• Same basic approach as used for clustering criticality • For each (i, j) connection from source i and sink j

- Determine arrival times (pre-order BFS) - Determine required arrival times (post-order BFS) - Determine slack -> required_arrival_time –

arrival_time - Criticality(i, j) = [1- slack(i, j)]/ (Max slack)

What is the purpose of the criticality exponent?

Balancing Wiring and Timing Cost

• Need to determine relative changes in timing and wiring based on moves

• Idea: Use relative changes from previous calculation - Both values less than 1 - Helps balance effect based on scaling parameter

This still doesn’t help address changes in delay

Routing • Problem Given a placement, and a fixed number of metal

layers, find a valid pattern of horizontal and vertical wires that connect the terminals of the nets

Levels of abstraction: o Global routing o Detailed routing

• Objectives Cost components:

o Area (channel width) – min congestion in prev levels helped o Wire delays – timing minimization in previous levels o Number of layers (less layers less expensive) o Additional cost components: number of bends, vias

Metal layer 1

Routing Anatomy

Top view

3D view

Metal layer 2

Metal layer 3

Symbolic

Layout

Note: Colors used in this slide are not standard

Global vs. Detailed Routing • Global routing

Input: detailed placement, with exact terminal locations

Determine “channel” (routing region) for each net

Objective: minimize area (congestion), and timing (approximate)

• Detailed routing Input: channels and approximate routing from

the global routing phase Determine the exact route and layers for each

net Objective: valid routing, minimize area

(congestion), meet timing constraints Additional objectives: min via, power

Channel graph

channel channel

channel

channel channel

channel

channel channel

channel

channel channel switch box

switch box

Routing Environment • Routing regions Channel

o Fixed height ? ( fixed number of tracks)

o Fixed terminals on top and bottom o More constrained problem: switchbox.

Terminals on four sides fixed Area routing

o Wires can pass through any region not occupied by cells (exception: over-the-cell routing)

• Routing layers Could be pre-assigned (e.g., M1 horizontal, M2 vert.) Different weights might be assigned to layers

1 1 4 5 4

3 2 3 2 5

1,3 4,5

Routing Environment • Chip architecture Full-custom:

o No constraint on routing regions Standard cell:

o Variable channel height? o Feed-through cells connect

channels FPGA:

o Fixed channel height o Limited switchbox connections o Prefabricated wire segments

have different weights

Failed net Channel

Feedthroughs

Tracks

Failed connection

FPGA Programmable Switch Elements • Used in connecting: The I/O of functional units

to the wires

A horizontal wire to a vertical wire

Two wire segments to form a longer wire segment

FPGA Routing Channels Architecture • Note: fixed channel widths (tracks) • Should “predict” all possible connectivity

requirements when designing the FPGA chip • Channel -> track -> segment

• Segment length? Long: carry the signal longer,

less “concatenation” switches, but might waste track Short: local connections, slow for longer connections

channel track

segment

FPGA Switch Boxes • Ideally, provide switches

for all possible connections

• Trade-off: Too many switches:

o Large area o Complex to program

Too few switches: o Cannot route signals

Xilinx 4000 One possible

solution

FPGA Routing Architecture

°Island – Style FPGA °Row – Based FPGA °Sea – Gates FPGA °Hierarchical FPGA

Commercial FPGAs can be classified into the four groups, based on their routing architecture.

FPGA Architecture - Layout • Island FPGAs Array of functional units Horizontal and vertical routing

channels connecting the functional units

Versatile switch boxes Example: Xilinx, Altera

• Row-based FPGAs Like standard cell design Rows of logic blocks Routing channels (fixed width)

between rows of logic Example: Actel FPGAs

The Four Classes of FPGA

An Island – Based FPGA

Island-Style Devices

• Two dimensional problem • (X+Y)!/(X!Y!) possible paths • Restricted within bounding box

Example channel segmentation distribution

Virtex Routing Architecture

18Kb BRAM

Multiplier BLVDS

Backplane

QDR SRAM

DDR SDRAM Distri

Shift Registers

FIFO PCI

SONET / SDH

Virtex II Architecture

Virtex II Routing Hierarchy

Virtex II Clock Distribution

FPGA Routing • Routing resources pre-fabricated

100% routability using existing channels If fail to route all nets, redo placement

• FPGA architectural issues Careful balance between number of logic blocks and routing

resources (100% logic area utilization?) Designing flexible switchboxes and channels

(conflicts with high clock speeds) • FPGA routing algorithms

Graph search algorithms o Convert the wire segments to graph nodes, and switch

elements to edges Bin packing heuristics (nets as objects, tracks as bins) Combination of maze routing and graph search algorithms

FPGA issues

Often want a fast answer. May be willing to accept lower quality result for less place/route time. May be interested in knowing wirability

without needing the final configuration. Fast placement: constructive placement,

iterative improvement through simulated annealing.

FPGA routing

Finding a route into given interconnection network. Global routing assigns to channels. Local routing selects the programming

points used to make the connections.

FPGA routing techniques

Nair: route based on congestion, not distance. Route in two passes: – Estimate congestion. – Final routing.

Triptych: more gradual penalty for congestion.

Xilinx XC4000 Routing

Altera Stratix Logic Array Blocks (Clusters)

Routing Connections

Based on the switch and wire parasitic, interconnect routes can be modeled as RC networks.

Other issues: Power

Routability

Timing-Driven Routing

• Add delay cost component to routing. • Represent delay along path as RC chain. Buffering

important here. • Note that timing driven routing selects most distant

point for first route. - Sets upper bound on delay.

• Need for combined breadth-first congestion and

timing-driven route.

Timing-Driven Routing

• Difficult to estimate remaining timing along a path

• Difficult to balance costs for each critical net

• Some routers attempt to “look-ahead” to anticipate congested or time-critical areas

• Optimal approaches have generally failed.

Combined Placement and Routing

• Used depth-first route to select initial connections • Swap blocks and rip up attached nets • Bias nets that span the bulk of device onto long-line

resources. • Took 16X longer than place and route

- 8% to 15% improvement.

Optimizing your FPGA design

° Pinout and Area Constraints Editor (PACE) ° Implementation (Mapping, Placing, Routing)

• Constraints Editor • Text Editor (HDL source) • Floorplanner -- Placement • FPGA Editor – Routing

° Timing Constraints • Xilinx Constraints Editor

VHDL based synthesis

VHDL code architecture RTL1 of RESOURCE is begin seq : process (RSTn, CLOCK) begin if (RSTn = '0') then DOUT <= (others => '0'); elsif (CLOCK'event and CLOCK = '1') then case SEL is when "00" => DOUT <= unsigned(A) - 1; when "01" => DOUT <= unsigned(B) - 1; when "10" => DOUT <= unsigned(C) - 1; when others => DOUT <= unsigned(D) - 1; end case; end if; end process; end RTL1;

Synthesized schematic

for RTL1 of resource

delay 57 ns

area 65 number of

flip-flops 16

4-bit Shift Register

HDL: Design Verification

Synthesis

Implementation

Download

HDL Implement your design using VHDL or Verilog

Functional Simulation

Timing Simulation

In-Circuit Verification

Behavioral Simulation

Synthesis: Design Verification

Synthesis

Implementation

Download

Synthesize the design to create an FPGA netlist

Timing Simulation

Implementation: Design Verification

Behavioral Simulation HDL

Synthesis

Implementation

Download

Translate, place and route and generate a bitstream to download in the FPGA

Timing Simulation

HDL: Summary

° Full VHDL/Verilog (RTL code) • Advantages:

- Portability - Complete control of the design implementation and tradeoffs - Easier to debug and understand a code that you own

• Disadvantages:

- Can be time consuming - Don’t always have control over the Synthesis tool - Need to be familiar with algorithm and how to write it

But…

What about the custom ASIC case?

Layout – Back End Tools

° Once our design meets all static timing requirements, we move on

° Next step: Layout ° Objective: Receive an HDL gate level netist ° Create a custom cell (or a chip) using that netlist ° We use Cadence Silicon Ensemble now

A typical ASIC Design Flow

HDL HDL Simulation

HDL Synthesis

Func. Sim.

Physical Implementation

Netlist

Tim Sim & STA & DRC/ERC/LVS

fabrication

Floor Planning

Placement

Routing

DRC/LVS

A typical Layout Flow

Import Files (Design files & Libraries)

Floor Planning (Create cell rows)

Placement (Place IO & cells)

Routing • Power Ring Generation • Global Routing • Detailed Routing

Timing Data Generation (RC Extraction & Delay Calculation)

Design Rules Check (Antenna, connectivity, geometry)

Output Generation (GDSII, DEF, LEF, and SDF)

Clock Tree Synthesis

Sample Design

ΗΜΥ 408/664 ΨΗΦΙΑΚΟΣ ΣΧΕΔΙΑΣΜΟΣ ΜΕ …ΗΜΥ 408 ΨΗΦΙΑΚΟΣ...

Documents

Σχεδιασμοσ (Planning )

ΕΝΕΡΓΕΙΑΚΟΣ – ΒΙΟΚΛΙΜΑΤΙΚΟΣ ΣΧΕΔΙΑΣΜΟΣ

ΣΧΕΔΙΑΣΜΟΣ ΦΑΡΜΑΚΕΥΤΙΚΟΥ ΠΡΟΙΟΝΤΟΣ

ΗΜΥ 007 – Τεχνολογία Πληροφορίας Διάλεξη 5

ΗΜΥ 007 – Τεχνολογία Πληροφορίας Διάλεξη 4

ΠΤΛ Χ1ΑΚΗ LFf ALlA ΣΧΕΔΙΑΣΜΟΣ- ΠΡΟΓΡΑΜΜΑΤΙΣΜΟΣ …digilib.teiemt.gr/jspui/bitstream/123456789/5879/1/STEF1352010.pdf · ΠΤΛ Χ1ΑΚΗ lff alla ΣΧΕΔΙΑΣΜΟΣ-

ΗΜΥ 100 Εισαγωγή στην Τεχνολογία Επανάληψη Εργαστηρίου

ΣΧΕΔΙΑΣΜΟΣ ΚΑΙ ΑΝΑΠΤΥΞΗ ΕΝΟΣ ΝΕΟΥ ΣΥΣΤΗΜΑΤΟΣ ΡΟΜΠΟΤΙΚΗΣ

ΗΜΥ 100 Εισαγωγή στην Τεχνολογία Διάλεξη 5

ΣΧΕΔΙΑΣΜΟΣ & ΤΕΧΝΟΛΟΓΙΑgym-tsireio-lem.schools.ac.cy/data/uploads/pdf/2020/YpostiriktikoYli… · 1 ΣΧΕΔΙΑΣΜΟΣ & ΤΕΧΝΟΛΟΓΙΑ Για ʐην απάνʐηση

ΗΜΥ 100 Εισαγωγή στην Τεχνολογία Διάλεξη 4

ΣΕΝΑΡΙΟ ΨΗΦΙΑΚΟΣ ΚΟΣΜΟΣ

ΗΜΥ 664 ΨΗΦΙΑΚΟΣ ΣΧΕΔΙΑΣΜΟΣ ΜΕ FPGAs 2010 … · At logic design or synthesis - functional simulation of gate-level circuit - not usually done in ECE 408/664

ΑΝΑΛΥΣΗ ΚΑΙ ΣΧΕΔΙΑΣΜΟΣ ΠΣ ΣΗΜ ΕΡΓ

VHDL (Very high speed integrated ΗΜΥ--210: 210 ...VHDL για Σχεδιασμό Συνδυαστικών Κυκλωμάτων 1 ΗΜΥ--210: 210: Σχεδιασμός Ψηφιακών

ΣΧΕΔΙΑΣΜΟΣ ΔΟΧΕΙΩΝ ΠΙΕΣΗΣ ΜΕ ΦΛΟΓΑ

ΒΙΟΜΗΧΑΝΙΚΟΣ! ΣΧΕΔΙΑΣΜΟΣ! 1!2013).pdf · 2013. 6. 5. · ΒΙΟΜΗΧΑΝΙΚΟΣ! ΣΧΕΔΙΑΣΜΟΣ! 1!!! Θανάσης!Μπάμπαλης!ma!(rca)!! Βιομηχανικός!Σχεδιαστής!

ΗΜΥ 664 ΨΗΦΙΑΚΟΣ ΣΧΕΔΙΑΣΜΟΣ ΜΕ FPGAs 2010 · Monochrome Display Adapter (MDA) Earliest display system for IBM PCs Text-only – 80x25 characters, each character

ΔΙΔΑΚΤΙΚΟ ΣΕΝΑΡΙΟ Β ΓΥΜΝΑΣΙΟΥ ΨΗΦΙΑΚΟΣ ΚΟΣΜΟΣ

ΗΜΥ 340 Μηχανική Ηλεκτρικής Ισχύος Διάλεξη 1 · ΗΜΥ 340 Μηχανική Ηλεκτρικής Ισχύος Διάλεξη 1 Δρ. Ηλίας Κυριακίδης