Download pdf - Embedded System Platform & Andes Embedded Processors

Microsoft PowerPoint - 2.Andes Embedded System Platform & Andes Embedded ProcessorsAndes Andes Embedded Processors Embedded Processors
ANDES Confidential
Page 2
What is A SoC?What is A SoC? SOC: A complex chip with functionality of a system with:
Generic modules: CPU’s, memory controller, generic interfaces such as PCI/USB/UART/ROM. Acceleration engines: video codec, crypto engines, etc. IO interfaces: Ethernet, WiFi, USB, etc. Interconnects: bus, switch, crossbar, etc. Require significant SW effort.
CPU
Initialization/Startup SequenceInitialization/Startup Sequence For embedded systems w/o a full-blown OS:
You needs to take care of the underlying CPU resources before start doing anything.
Example of CPU resources to initialize: Set up address space (thru CPU control registers)
• Cacheable or non-cacheable • Interrupt and exception handlers • Local (or scratchpad) memory
Initialize memory. Remember to turn on caches if they exist.
Page 4
Memory Hierarchy and CoherencyMemory Hierarchy and Coherency Memory hierarchy:
Storage elements: the smaller, the faster. Level-0 memory: register file Level-1 memory: caches or local memory . . . Level-n memory: global memory (DRAM)
Coherency: CPU updates address X. LCDC reads address X. What would happen if
• X is cacheable.
CPU-pipeline Register file
LCD Ctlr Ethernet
Page 5
Page 6
Peripheral Control and DMAPeripheral Control and DMA How to control a peripheral (or a device)?
Thru control/status registers SW controlling peripherals: device drivers
Example: LCD controller (LCDC) Preparation: CPU sets display’s resolution Periodic data transfer
• scheme 1: CPU involved in all work – CPU reads each pixel from frame buffer and writes it to LCDC’s display
register. • scheme 2: CPU initiating the work
– CPU sets frame buffer’s base address. – CPU sets “GO” bit to start the whole operation, where LCDC reads frame
buffer data automatically and periodically. Status:
• Record error status (e.g. data can’t arrive at LCDC on time). DMA: scheme 2
Free CPU to do something else • after setting up batch job.
Page 7
RJ45
VIRTEX
LED
Andes Embedded System Platform Block DiagramAndes Embedded System Platform Block Diagram
Customer design
NCORE INCTRL
CPU Core
Slaves • All device on
AHB are slave except N1213
APB bus SRAM/SDRAM are sharing the IO pin on address and data Side Band
Page 9
Page 10
On Board DeviceOn Board Device
Xilinx XC5VLX110-1FF676 FPGA 144-pin SO-DIMM for SDRAM 32MB on-board NOR flash 10/100 Ethernet PHY 2 UART ports X-Bus expansion AHB bus connector SD card slot IDE connector LCD I/F AC97 Audio Codec
Page 11
ADP-XC5FF676 DevicesADP-XC5FF676 Devices
Devices on AHB SDRAM controller (128MB) LCD controller DMA controller MAC controller USB 2.0 device controller SRAM controller (512KB) Flash controller (32MB) AHB Bus controller
Devices on APB AHB-APB bridge PMU I2C GPIO Interrupt controller Watch dog timer Timer RTC UART SSP I2S/AC97 SD/MMC
Page 12
N12N12 Bus ControllerBus Controller MAC
10/100 MAC 10/100 USB2.0USB2.0
IrDAIrDA ST UART ST
MMCPower Manager Power
ADP-XC5FF676 ProfileADP-XC5FF676 Profile
CPU frequency is 80 MHz : N1213; 40MHz : N903 AHB Clock is 40 MHz XILINX Virtex5 LX110 64MB SDRAM SO-DIMM 32MB NOR Flash X-Bus for AIT Chip 10/100 Ethernet SD card slot 2-Digit debug port AndesICE port 5 push bottons 2 UART ports
Page 14
C P U an d D R A M co n tro lle r in itia liz a tio n
G P IO b u tto n p u sh ed ?
Y
N
S TA RT
C h eck b o o tin g m o d e
B o o t co d e in itia liz a tio n
B u rn in o r d em o
D efau lt m o d e?
E N D
N
Y
B o o t O S im ag e o r d iag n o s is /se tu p
D iag n o s is /se tu p ?
Y
N
Page 15
Boot procedures Boot procedures
Plug in DC power to main board Turn power switch to ‘ON’ Press push-button “SW4”
Page 16
------------------------------------------------------------------------------
Andes Development Platform Diagnosis Menu, Built@Aug 25 2008 (release: 1.1)
------------------------------------------------------------------------------
( 1) SDRAM Test ( 2) Timer Test ( 3) DMA Test
( 5) UART Loopback Test ( 6) UART DMA Test ( 9) Watchdog Test
(10) Watchdog Reset Test(11) MAC Loopback Test (12) Flash Test
(13) SODIMM Sizing (14) SDRAM(bnk1,2) (17) AC97 Test
(18) AC97 DMA Test (21) LCD Test (23) Query RTC
(24) RTC Alarm Test (25) GPIO Test (55) CLI
(67) Set Console's UART (75) Burnin Test (93) Exec Img on LM(I/D)
(94) Dhrystone Test (95) Boot Selection (97) CopyImageFromCard
(99) Setup
Page 18
AHB Extension Bus – Two Leopards SolutionAHB Extension Bus – Two Leopards Solution
SOC platform with N1213
. SW development
Page 19
AHB Extension Bus – Quick SOC IntegrationAHB Extension Bus – Quick SOC Integration
User Define Circuit
Page 21
Page 22
Data Phase Master/Slave read and write data
• hdata Slave response
Ext. AHB Master Master No. 5
• X_hm5_hbusreq • X_hm5_hgrant
Page 24
Ext. AHB Slave Slave No. 13,15,17,18,19,21,22
• X_hs13_hsel, X_hs15_hsel, X_hs17_hsel, X_hs18_hsel, X_hs19_hsel, X_hs21_hsel, X_hs22_hsel
• Memory Size : 1MB, 1MB, 1MB, 1MB, 1MB, 256MB, 128MB,
• Address Map : 0x90A0_0000, 0x90C0_0000, 0x90E0_0000, 0x90F0_0000, 0x9200_0000, 0xA000_0000, 0xB000_0000
Page 25
Features: Harvard architecture, 5-stage pipeline. 16 general-purpose registers. Static branch prediction Fast MAC Hardware divider Fully clock gated pipeline 2-level nested interrupt External instruction/data local memory interface Instruction/data cache APB/AHB/AHB-Lite/AMI bus interface Power management instructions 45K ~ 110K gate count 250MHz @ 130nm
Applications: MCU Storage Automotive control Toys
External Bus Interface
Applications: Portable audio/media player DVB/DMB baseband DVD DSC Toys, Games
Page 27
Applications: Portable media player MFP Networking Gateway/Router Home entertainment Smartphone/Mobile phone
Page 28
Page 29
AndesCore™ N1213-S (1/2)AndesCore™ N1213-S (1/2)
I & D Local memory wide range support for internal /external local memory
• 4KB~1024KB Provide fixed access latencies for internal local memory Double buffer mode for D local memory Optional external local memory interface
Bus Synchronous/Asynchronous AHB
Page 30
• 1D/2D DMA, load/store multiple Efficient synchronization without locking the whole bus
• Load lock, store conditional instructions Vectored interrupt to improve real-time performance
• 6 interrupt signals MMU
For flexibility Memory-mapped IO space JTAG-based debug support Optional embedded program trace interface Performance monitors for performance tuning Bi-endian modes to support flexible data input
Page 31
Q & A
Page 32
von Neumann architecture
von Neumann architecture Features of each:
• Execution in multiple cycles Serial fetch instructions & data Single memory structure
• Can get data/program mixed • Data/instructions same size
Examples, von Neumann: PCs (Intel 80x86/Pentium, Motorola 68000, Mot 68xx uC families
Page 34
Harvard architecture
Computer architecture taxonomy (3/3)Computer architecture taxonomy (3/3)
Harvard architecture Features of each: Execution in 1 cycle Parallel fetch instructions & data More Complex H/W
• Instructions and data always separate • Different code/data path widths (E.G. 14 bit
instructions, 8 bit data) Harvard: 8051, Microchip PIC families, Atmel AVR, AndeScore, Atom
Page 36
CISC - Complex Instruction Set Computers: Emphasis on hardware Includes multi-clock complex instructions Memory-to-memory Sophisticated arithmetic (multiply, divide, trigonometry etc.). Special instructions are added to optimize performance with particular compilers.
Page 37
RISC - Reduced Instruction Set Computers: A very small set of primitive instructions Fixed instruction format Emphasis on software All instructions execute in one cycle (Fast!). Register to register (except Load/Store instructions) Pipeline architecture
Page 38
Coprocessor vs. Hardwired EnginesCoprocessor vs. Hardwired Engines How can a divide operation be implemented:
In main CPU pipeline In coprocessor separated from main CPU As a hardwired engine
User-defined instructions: In-Core instructions: short latency (1~5 cycles)
• Light semantics: such as SIMD instructions Coprocessor instructions: longer latency (10~100 cycles)
• Heavier semantics such as macroblock DCT/VLC or crypto engine.
Hardwired engines for comparison: >1000 cycles • Used for even larger chunk of data processing (say, slice or
frame)
Most popular today.
Dual-core, multi-core, many-core: Forms of multiprocessors in a single chip
Small-scale multiprocessors (2-4 cores): Utilize task-level parallelism. Task example: audio decode, video decode, display control, network packet handling.
Large-scale multiprocessors (>32 cores): nVidia’s graphics chip: >128 core Sun’s server chips: 64 threads
Page 40
MPU, DSP, MCUMPU, DSP, MCU CPUCentral Processing Unit DSPDigital Signal Processing/Processor MCUMicro Control Unit MPUMicro Processor Unit
MCU
RF
EX
Instruction-Fetch First and Second
AG
Data Access First and Second
Data Address Generation
MAC1 MAC2
Page 43
F1 – Instruction Fetch First Instruction Tag/Data Arrays ITLB Address Translation Branch Target Buffer Prediction
F2 – Instruction Fetch Second Instruction Cache Hit Detection Cache Way Selection Instruction Alignment
IF1 IF2 ID RF AG DA1 DA2 WB
EX
I1 – Instruction Issue First / Instruction Decode 32/16-Bit Instruction Decode Return Address Stack prediction
I2 – Instruction Issue Second / Register File Access Instruction Issue Logic Register File Access
EX
Execution StageExecution Stage
E1 – Instruction Execute First / Address Generation / MAC First Data Access Address Generation Multiply Operation (if MAC presents)
E2 –Instruction Execute Second / Data Access First / MAC Second / ALU Execute
ALU Branch/Jump/Return Resolution Data Tag/Data arrays DTLB address translation Accumulation Operation (if MAC presents)
E3 –Instruction Execute Third / Data Access Second Data Cache Hit Detection Cache Way Selection Data Alignment
EX
E4 –Instruction Execute Fourth / Write Back Interruption Resolution Instruction Retire Register File Write Back
EX
Branch Prediction OverviewBranch Prediction Overview
Why is branch prediction required? A deep pipeline is required for high speed
Why dynamic branch prediction? Static branch prediction Dynamic branch prediction
Page 48
Branch Prediction UnitBranch Prediction Unit
Branch Target Buffer (BTB) 128 entries of 2-bit saturating counters 128 entries, 32-bit predicted PC and 26-bit address tag
Return Address Stack (RAS) Four entries
BTB and RAS updated by committing branches/jumps
Page 49
BTB Instruction PredictionBTB Instruction Prediction
BTB predictions are performed based on the previous PC instead of the actual instruction decoding information, BTB may make the following two mistakes
Wrongly predicts the non-branch/jump instructions as branch/jump instructions Wrongly predicts the instruction boundary (32-bit -> 16-bit)
If these cases are detected, IFU will trigger a BTB instruction misprediction in the I1 stage and re-start the program sequence from the recovered PC. There will be a 2-cycle penalty introduced here
F1 F2 I1
RAS PredictionRAS Prediction
When return instructions present in the instruction sequence, RAS predictions are performed and the fetch sequence is changed to the predicted PC. Since the RAS prediction is performed in the I1 stage. There will be a 2-cycle penalty in the case of return instructions since the sequential fetches in between will not be used.
F1 F2 I1
Branch Miss-PredictionBranch Miss-Prediction
In N12 processor core, the resolution of the branch/return instructions is performed by the ALU in the E2 stage and will be used by the IFU in the next (F1) stage. In this case, the misprediction penalty will be 5 cycles.
F1 F2 I1 I2
F1 F2 I1 I2
F1 F2 I1 I2
Page 54
CPU ca
ch e
co nt
ro lle
CPU L1 cache L2 cache
Page 56
I-Cache
D-Cache
Uncached write/write-through
Write back
D-Cache refill
Page 57
Cache operationCache operation
Many main memory locations are mapped onto one cache entry. May have caches for:
instructions; data; data + instructions (unified).
Page 58
Replacement policyReplacement policy
Replacement policy: strategy for choosing which cache entry to throw out to make room for a new memory location. Two popular strategies:
Random. Least-recently used (LRU).
Write operationsWrite operations
Write-through: immediately copy write to main memory. Write-back: write to main memory only when location is removed from cache.
Page 60
Goal: reduce the Average Memory Access Time (AMAT)
AMAT = Hit Time + Miss Rate * Miss Penalty
Approaches Reduce Hit Time Reduce or Miss Penalty Reduce Miss Rate
Notes There may be conflicting goals Keep track of clock cycle time, area, and power consumption
Page 61
Tuning Cache ParametersTuning Cache Parameters
Size: Must be large enough to fit working set (temporal locality) If too big, then hit time degrades
Associativity Need large to avoid conflicts, but 4-8 way is as good a FA If too big, then hit time degrades
Block Need large to exploit spatial locality & reduce tag overhead If too large, few blocks ⇒ higher misses & miss penalty
Configurable architecture allows designers to make the best performance/cost trade-offs
Configurable architecture allows designers to make the best performance/cost trade-offs
Page 62
Page 64
CPU memory
management unit
logical address
physical address
Page 65
M-TLB entry index
4 056
Memory protection (read/write/execute) Different permission flags for kernel/user mode OS typically runs in kernel mode Applications run in user mode
Cache control (cached/uncached) Accesses to peripherals and other processors needs to be uncached.
Page 67
Page 69
DMA Controller
Local Memory
Ext. Memory
Two channels One active channel Programmed using physical addressing For both instruction and data local memory External address can be incremented with stride Optional 2-D Element Transfer (2DET) feature which provides an easy way to transfer two- dimensional blocks from external memory.
Page 70 Width byte stride (in DMA Setup register)=1
LMDMA Double Buffer ModeLMDMA Double Buffer Mode
Local Memory Bank 0
Core Pipeline
External Memory
Computation Data Movement Bank Switch between core and DMA engine
Page 71
Page 73
BIU introductionBIU introduction
Bus Interface unit is responsible for off-CPU memory access which includes
System memory access Instruction/data local memory access Memory-mapped register access in devices.
Page 74
Compliance to AHB/AHB-Lite/APB High Speed Memory Port Andes Memory Interface External LM Interface
Bus InterfaceBus Interface
HSMP – High speed memory portHSMP – High speed memory port
N12 also provides a high speed memory port interface which has higher bus protocol efficiency and can run at a higher frequency to connect to a memory controller. The high speed memory port will be AMBA3.0 (AXI) protocol compliant, but with reduced I/O requirements.
WWW.ANDESTECH.COM