Microsoft PowerPoint - 2.Andes Embedded System Platform & Andes
Embedded ProcessorsAndes Andes Embedded Processors Embedded
Processors
ANDES Confidential
Page 2
What is A SoC?What is A SoC? SOC: A complex chip with functionality
of a system with:
Generic modules: CPU’s, memory controller, generic interfaces such
as PCI/USB/UART/ROM. Acceleration engines: video codec, crypto
engines, etc. IO interfaces: Ethernet, WiFi, USB, etc.
Interconnects: bus, switch, crossbar, etc. Require significant SW
effort.
CPU
Initialization/Startup SequenceInitialization/Startup Sequence For
embedded systems w/o a full-blown OS:
You needs to take care of the underlying CPU resources before start
doing anything.
Example of CPU resources to initialize: Set up address space (thru
CPU control registers)
• Cacheable or non-cacheable • Interrupt and exception handlers •
Local (or scratchpad) memory
Initialize memory. Remember to turn on caches if they exist.
Page 4
Memory Hierarchy and CoherencyMemory Hierarchy and Coherency Memory
hierarchy:
Storage elements: the smaller, the faster. Level-0 memory: register
file Level-1 memory: caches or local memory . . . Level-n memory:
global memory (DRAM)
Coherency: CPU updates address X. LCDC reads address X. What would
happen if
• X is cacheable.
CPU-pipeline Register file
LCD Ctlr Ethernet
Page 5
Page 6
Peripheral Control and DMAPeripheral Control and DMA How to control
a peripheral (or a device)?
Thru control/status registers SW controlling peripherals: device
drivers
Example: LCD controller (LCDC) Preparation: CPU sets display’s
resolution Periodic data transfer
• scheme 1: CPU involved in all work – CPU reads each pixel from
frame buffer and writes it to LCDC’s display
register. • scheme 2: CPU initiating the work
– CPU sets frame buffer’s base address. – CPU sets “GO” bit to
start the whole operation, where LCDC reads frame
buffer data automatically and periodically. Status:
• Record error status (e.g. data can’t arrive at LCDC on time).
DMA: scheme 2
Free CPU to do something else • after setting up batch job.
Page 7
RJ45
VIRTEX
LED
Andes Embedded System Platform Block DiagramAndes Embedded System
Platform Block Diagram
Customer design
NCORE INCTRL
CPU Core
Slaves • All device on
AHB are slave except N1213
APB bus SRAM/SDRAM are sharing the IO pin on address and data Side
Band
Page 9
Page 10
On Board DeviceOn Board Device
Xilinx XC5VLX110-1FF676 FPGA 144-pin SO-DIMM for SDRAM 32MB
on-board NOR flash 10/100 Ethernet PHY 2 UART ports X-Bus expansion
AHB bus connector SD card slot IDE connector LCD I/F AC97 Audio
Codec
Page 11
ADP-XC5FF676 DevicesADP-XC5FF676 Devices
Devices on AHB SDRAM controller (128MB) LCD controller DMA
controller MAC controller USB 2.0 device controller SRAM controller
(512KB) Flash controller (32MB) AHB Bus controller
Devices on APB AHB-APB bridge PMU I2C GPIO Interrupt controller
Watch dog timer Timer RTC UART SSP I2S/AC97 SD/MMC
Page 12
N12N12 Bus ControllerBus Controller MAC
10/100 MAC 10/100 USB2.0USB2.0
IrDAIrDA ST UART ST
MMCPower Manager Power
ADP-XC5FF676 ProfileADP-XC5FF676 Profile
CPU frequency is 80 MHz : N1213; 40MHz : N903 AHB Clock is 40 MHz
XILINX Virtex5 LX110 64MB SDRAM SO-DIMM 32MB NOR Flash X-Bus for
AIT Chip 10/100 Ethernet SD card slot 2-Digit debug port AndesICE
port 5 push bottons 2 UART ports
Page 14
C P U an d D R A M co n tro lle r in itia liz a tio n
G P IO b u tto n p u sh ed ?
Y
N
S TA RT
C h eck b o o tin g m o d e
B o o t co d e in itia liz a tio n
B u rn in o r d em o
D efau lt m o d e?
E N D
N
Y
B o o t O S im ag e o r d iag n o s is /se tu p
D iag n o s is /se tu p ?
Y
N
Page 15
Boot procedures Boot procedures
Plug in DC power to main board Turn power switch to ‘ON’ Press
push-button “SW4”
Page 16
------------------------------------------------------------------------------
Andes Development Platform Diagnosis Menu, Built@Aug 25 2008
(release: 1.1)
------------------------------------------------------------------------------
( 1) SDRAM Test ( 2) Timer Test ( 3) DMA Test
( 5) UART Loopback Test ( 6) UART DMA Test ( 9) Watchdog Test
(10) Watchdog Reset Test(11) MAC Loopback Test (12) Flash
Test
(13) SODIMM Sizing (14) SDRAM(bnk1,2) (17) AC97 Test
(18) AC97 DMA Test (21) LCD Test (23) Query RTC
(24) RTC Alarm Test (25) GPIO Test (55) CLI
(67) Set Console's UART (75) Burnin Test (93) Exec Img on
LM(I/D)
(94) Dhrystone Test (95) Boot Selection (97)
CopyImageFromCard
(99) Setup
Page 18
AHB Extension Bus – Two Leopards SolutionAHB Extension Bus – Two
Leopards Solution
SOC platform with N1213
. SW development
Page 19
AHB Extension Bus – Quick SOC IntegrationAHB Extension Bus – Quick
SOC Integration
User Define Circuit
Page 21
Page 22
Data Phase Master/Slave read and write data
• hdata Slave response
Ext. AHB Master Master No. 5
• X_hm5_hbusreq • X_hm5_hgrant
Page 24
Ext. AHB Slave Slave No. 13,15,17,18,19,21,22
• X_hs13_hsel, X_hs15_hsel, X_hs17_hsel, X_hs18_hsel, X_hs19_hsel,
X_hs21_hsel, X_hs22_hsel
• Memory Size : 1MB, 1MB, 1MB, 1MB, 1MB, 256MB, 128MB,
• Address Map : 0x90A0_0000, 0x90C0_0000, 0x90E0_0000, 0x90F0_0000,
0x9200_0000, 0xA000_0000, 0xB000_0000
Page 25
Features: Harvard architecture, 5-stage pipeline. 16
general-purpose registers. Static branch prediction Fast MAC
Hardware divider Fully clock gated pipeline 2-level nested
interrupt External instruction/data local memory interface
Instruction/data cache APB/AHB/AHB-Lite/AMI bus interface Power
management instructions 45K ~ 110K gate count 250MHz @ 130nm
Applications: MCU Storage Automotive control Toys
External Bus Interface
External Bus Interface
Applications: Portable audio/media player DVB/DMB baseband DVD DSC
Toys, Games
Page 27
External Bus Interface
Applications: Portable media player MFP Networking Gateway/Router
Home entertainment Smartphone/Mobile phone
Page 28
Page 29
AndesCore™ N1213-S (1/2)AndesCore™ N1213-S (1/2)
I & D Local memory wide range support for internal /external
local memory
• 4KB~1024KB Provide fixed access latencies for internal local
memory Double buffer mode for D local memory Optional external
local memory interface
Bus Synchronous/Asynchronous AHB
Page 30
• 1D/2D DMA, load/store multiple Efficient synchronization without
locking the whole bus
• Load lock, store conditional instructions Vectored interrupt to
improve real-time performance
• 6 interrupt signals MMU
For flexibility Memory-mapped IO space JTAG-based debug support
Optional embedded program trace interface Performance monitors for
performance tuning Bi-endian modes to support flexible data
input
Page 31
Q & A
Page 32
von Neumann architecture
von Neumann architecture Features of each:
• Execution in multiple cycles Serial fetch instructions & data
Single memory structure
• Can get data/program mixed • Data/instructions same size
Examples, von Neumann: PCs (Intel 80x86/Pentium, Motorola 68000,
Mot 68xx uC families
Page 34
Harvard architecture
Computer architecture taxonomy (3/3)Computer architecture taxonomy
(3/3)
Harvard architecture Features of each: Execution in 1 cycle
Parallel fetch instructions & data More Complex H/W
• Instructions and data always separate • Different code/data path
widths (E.G. 14 bit
instructions, 8 bit data) Harvard: 8051, Microchip PIC families,
Atmel AVR, AndeScore, Atom
Page 36
CISC - Complex Instruction Set Computers: Emphasis on hardware
Includes multi-clock complex instructions Memory-to-memory
Sophisticated arithmetic (multiply, divide, trigonometry etc.).
Special instructions are added to optimize performance with
particular compilers.
Page 37
RISC - Reduced Instruction Set Computers: A very small set of
primitive instructions Fixed instruction format Emphasis on
software All instructions execute in one cycle (Fast!). Register to
register (except Load/Store instructions) Pipeline
architecture
Page 38
Coprocessor vs. Hardwired EnginesCoprocessor vs. Hardwired Engines
How can a divide operation be implemented:
In main CPU pipeline In coprocessor separated from main CPU As a
hardwired engine
User-defined instructions: In-Core instructions: short latency (1~5
cycles)
• Light semantics: such as SIMD instructions Coprocessor
instructions: longer latency (10~100 cycles)
• Heavier semantics such as macroblock DCT/VLC or crypto
engine.
Hardwired engines for comparison: >1000 cycles • Used for even
larger chunk of data processing (say, slice or
frame)
Most popular today.
Dual-core, multi-core, many-core: Forms of multiprocessors in a
single chip
Small-scale multiprocessors (2-4 cores): Utilize task-level
parallelism. Task example: audio decode, video decode, display
control, network packet handling.
Large-scale multiprocessors (>32 cores): nVidia’s graphics chip:
>128 core Sun’s server chips: 64 threads
Page 40
MPU, DSP, MCUMPU, DSP, MCU CPUCentral Processing Unit DSPDigital
Signal Processing/Processor MCUMicro Control Unit MPUMicro
Processor Unit
MCU
RF
EX
Instruction-Fetch First and Second
AG
Data Access First and Second
Data Address Generation
MAC1 MAC2
Page 43
F1 – Instruction Fetch First Instruction Tag/Data Arrays ITLB
Address Translation Branch Target Buffer Prediction
F2 – Instruction Fetch Second Instruction Cache Hit Detection Cache
Way Selection Instruction Alignment
IF1 IF2 ID RF AG DA1 DA2 WB
EX
I1 – Instruction Issue First / Instruction Decode 32/16-Bit
Instruction Decode Return Address Stack prediction
I2 – Instruction Issue Second / Register File Access Instruction
Issue Logic Register File Access
IF1 IF2 ID RF AG DA1 DA2 WB
EX
Execution StageExecution Stage
E1 – Instruction Execute First / Address Generation / MAC First
Data Access Address Generation Multiply Operation (if MAC
presents)
E2 –Instruction Execute Second / Data Access First / MAC Second /
ALU Execute
ALU Branch/Jump/Return Resolution Data Tag/Data arrays DTLB address
translation Accumulation Operation (if MAC presents)
E3 –Instruction Execute Third / Data Access Second Data Cache Hit
Detection Cache Way Selection Data Alignment
IF1 IF2 ID RF AG DA1 DA2 WB
EX
E4 –Instruction Execute Fourth / Write Back Interruption Resolution
Instruction Retire Register File Write Back
IF1 IF2 ID RF AG DA1 DA2 WB
EX
Branch Prediction OverviewBranch Prediction Overview
Why is branch prediction required? A deep pipeline is required for
high speed
Why dynamic branch prediction? Static branch prediction Dynamic
branch prediction
Page 48
Branch Prediction UnitBranch Prediction Unit
Branch Target Buffer (BTB) 128 entries of 2-bit saturating counters
128 entries, 32-bit predicted PC and 26-bit address tag
Return Address Stack (RAS) Four entries
BTB and RAS updated by committing branches/jumps
Page 49
BTB Instruction PredictionBTB Instruction Prediction
BTB predictions are performed based on the previous PC instead of
the actual instruction decoding information, BTB may make the
following two mistakes
Wrongly predicts the non-branch/jump instructions as branch/jump
instructions Wrongly predicts the instruction boundary (32-bit
-> 16-bit)
If these cases are detected, IFU will trigger a BTB instruction
misprediction in the I1 stage and re-start the program sequence
from the recovered PC. There will be a 2-cycle penalty introduced
here
F1 F2 I1
RAS PredictionRAS Prediction
When return instructions present in the instruction sequence, RAS
predictions are performed and the fetch sequence is changed to the
predicted PC. Since the RAS prediction is performed in the I1
stage. There will be a 2-cycle penalty in the case of return
instructions since the sequential fetches in between will not be
used.
F1 F2 I1
Branch Miss-PredictionBranch Miss-Prediction
In N12 processor core, the resolution of the branch/return
instructions is performed by the ALU in the E2 stage and will be
used by the IFU in the next (F1) stage. In this case, the
misprediction penalty will be 5 cycles.
F1 F2 I1 I2
F1 F2 I1 I2
F1 F2 I1 I2
Page 54
CPU ca
ch e
co nt
ro lle
CPU L1 cache L2 cache
Page 56
I-Cache
D-Cache
Uncached write/write-through
Write back
D-Cache refill
Page 57
Cache operationCache operation
Many main memory locations are mapped onto one cache entry. May
have caches for:
instructions; data; data + instructions (unified).
Page 58
Replacement policyReplacement policy
Replacement policy: strategy for choosing which cache entry to
throw out to make room for a new memory location. Two popular
strategies:
Random. Least-recently used (LRU).
Write operationsWrite operations
Write-through: immediately copy write to main memory. Write-back:
write to main memory only when location is removed from
cache.
Page 60
Goal: reduce the Average Memory Access Time (AMAT)
AMAT = Hit Time + Miss Rate * Miss Penalty
Approaches Reduce Hit Time Reduce or Miss Penalty Reduce Miss
Rate
Notes There may be conflicting goals Keep track of clock cycle
time, area, and power consumption
Page 61
Tuning Cache ParametersTuning Cache Parameters
Size: Must be large enough to fit working set (temporal locality)
If too big, then hit time degrades
Associativity Need large to avoid conflicts, but 4-8 way is as good
a FA If too big, then hit time degrades
Block Need large to exploit spatial locality & reduce tag
overhead If too large, few blocks ⇒ higher misses & miss
penalty
Configurable architecture allows designers to make the best
performance/cost trade-offs
Configurable architecture allows designers to make the best
performance/cost trade-offs
Page 62
Page 64
CPU memory
management unit
logical address
physical address
Page 65
M-TLB entry index
4 056
Memory protection (read/write/execute) Different permission flags
for kernel/user mode OS typically runs in kernel mode Applications
run in user mode
Cache control (cached/uncached) Accesses to peripherals and other
processors needs to be uncached.
Page 67
Page 69
DMA Controller
Local Memory
Ext. Memory
Two channels One active channel Programmed using physical
addressing For both instruction and data local memory External
address can be incremented with stride Optional 2-D Element
Transfer (2DET) feature which provides an easy way to transfer two-
dimensional blocks from external memory.
Page 70 Width byte stride (in DMA Setup register)=1
LMDMA Double Buffer ModeLMDMA Double Buffer Mode
Local Memory Bank 0
Core Pipeline
External Memory
Computation Data Movement Bank Switch between core and DMA
engine
Page 71
Page 73
BIU introductionBIU introduction
Bus Interface unit is responsible for off-CPU memory access which
includes
System memory access Instruction/data local memory access
Memory-mapped register access in devices.
Page 74
Compliance to AHB/AHB-Lite/APB High Speed Memory Port Andes Memory
Interface External LM Interface
Bus InterfaceBus Interface
HSMP – High speed memory portHSMP – High speed memory port
N12 also provides a high speed memory port interface which has
higher bus protocol efficiency and can run at a higher frequency to
connect to a memory controller. The high speed memory port will be
AMBA3.0 (AXI) protocol compliant, but with reduced I/O
requirements.
WWW.ANDESTECH.COM