Processors selection

Processing Elements and their selection

By Pradeep Shankhwar

Presentation layout

• Computing elements • Processor architectures• processor

– Micro controller– PowerPC– ARM– MIPS– DSPs– GPU

• Selection • Conclusion

Computing Elements• Microprocessors

– ARM, Intel, AMD, PPC, Motorola, MIPS etc

• Microcontrollers– ARM, Intel, Atmel, Motorola etc

• Digital Signal Processor (DSP)– ADI DSPs and TI DSPs

• Graphics Processing Unit (GPU)– Nvidia and ATI GPUs

• System on Chip (SoC)– Free scale iMx51/53, TI DaVinchi Platform

• Application Specific IC (ASIC)– Crypto Elements, Ethernet Controller, USB Controller,

Serial Controller etc

• FPGA

Computing Element -architecture

• Architecture is concerned with– internal structures of processor and each

interconnections of ALU, Control Units; address generator, instruction decoder and pipelined execution of any instruction

Architecture defining parameters

•No of ALUs/FPU•No of memory units•On chip resources•External IO interfaces•No of cores•Clock of chip•Power requirement•Endianness (big/little)•Instruction set requirements•Mem handling architecture stack, reg-mem, accumulator, Load/store•Complex?•DSP capability Multiply/accumulate?•Addressing modes and address space supported•Width of machine ?•Instruction Pipelining support•Computing pipelining support•Cache size, levels

Kind of Architectures

Von Neumann Harvard

• Named after the mathematician and computer scientist John Von Neumann.

• The computer has single storage memory (data & program)

• Processor needs two clock cycles to complete an instruction.

• Pipelining the instructions is not possible with this architecture.

• This is a relatively older architecture and was replaced by Harvard architecture.

• Named after "Harvard Mark I" a relay based old computer.

• The computer has two separate memories for storing data and program.

• Processor can complete an instruction in one cycle if appropriate pipelining strategies are implemented.

• Most of the modern computing architectures are based on Harvard architecture. But the number of stages in the pipeline varies from system to system.

CPU

PCdata memory

program memory

address

data

address

data

Input OutputSo where is the Input/Output?

here

CPU Buses

Code Sequence C = A + B for Four Instruction Sets

Stack Accumulator Register(register-memory)

Register (load-store)

Push APush BAddPop C

Load AAdd BStore C

Load R1, AAdd R1, BStore C, R1

Load R1,ALoad R2, BAdd R3, R1, R2Store C, R3

memory memoryacc = acc + mem[C] R1 = R1 + mem[C] R3 = R1 + R2

Addressing ModesAddressing Mode Example Action

1. Register direct Add R4, R3 R4 <- R4 + R32. Immediate Add R4, #3 R4 <- R4 + 33. Displacement Add R4, 100(R1) R4 <- R4 + M[100 + R1]4. Register indirect Add R4, (R1) R4 <- R4 + M[R1]5. Indexed Add R4, (R1 + R2) R4 <- R4 + M[R1 + R2]6. Direct Add R4, (1000) R4 <- R4 + M[1000]7. Memory Indirect Add R4, @(R3) R4 <- R4 +

M[M[R3]]8. Autoincrement Add R4, (R2)+ R4 <- R4 + M[R2]

R2 <- R2 + d9. Autodecrement Add R4, (R2)- R4 <- R4 + M[R2]

R2 <- R2 - d10. Scaled Add R4, 100(R2)[R3] R4 <- R4 +

M[100 + R2 + R3*d]

What is CISC?• CISC (Complex Instruction Set Computer)• Instructions which require multiple clock cycles to

execute.• Variable length instructions where the length

often varies according to the addressing mode • A small number of general purpose registers• chips that are easy to program and which make

efficient use of memory. Since the earliest machines were programmed in assembly language and memory was slow and expensive, the CISC philosophy made sense

• CISC was developed to make compiler development simpler. It shifts most of the burden of generating machine instructions to the processor.

CISC contd…• Several special purpose registers. Many CTSC

designs set aside special registers for the stack pointer, interrupt handling, and so on. This can simplify the hardware design somewhat, at the expense of making the instruction set more complex.

• But recent changes in software and hardware technology have forced a re-examination of CISC and many modern CISC processors are hybrids, implementing many RISC principles.

• Most common microprocessor designs such as the Intel 80x86 and Motorola 68K series followed the CISC philosophy

• implemented in such large computers as the PDP-11 and the DECsystem 10 and 20 machines.

• E.g. Pentium is considered a modern CISC processor

CISC Disadvantage

• instruction set & chip hardware become more complex with each generation of computers

• Many specialized instructions aren't used frequently enough to justify their existence -approximately 20% of the available instructions are used in a typical program

• condition codes as a side effect of the instruction. Not only does setting the condition codes take time, but programmers have to remember to examine the condition code bits before a subsequent instruction changes them

What is RISC?

• RISC, or Reduced Instruction Set Computer. is a type of microprocessor architecture that utilizes a small and highly-optimized set of instructions

• RISC processors have a CPI (clock per instruction) of one cycle.

• pipelining: a technique that allows for simultaneous execution of parts, or stages, of instructions to more efficiently process instructions;

• large number of registers: the RISC design philosophy generally incorporates a larger number of registers to prevent in large amounts of interactions with memory– The IBM 801, Stanford MIPS, and Berkeley RISC 1 and 2

RISC contd…• Less complex, simple instructions. • Hardwired control unit and machine instructions. • Few addressing schemes for memory operands

with only two basic instructions, LOAD and STORE • Many symmetric registers which are organised

into a register file.

Big & Little Endian• In little endian machines, the least significant byte is followed by the most significant byte.

• Big endian machines store the most significant byte first (at the lower address).

• As an example, suppose we have the hexadecimal number 12345678.

• The big endian and small endian arrangements of the bytes are shown below.

• Big endian:– Is more natural.– The sign of the number can be determined by looking at the byte at address offset 0.– Strings and integers are stored in the same order.

• Little endian:– Makes it easier to place values on non-word boundaries.– Conversion from a 16-bit integer address to a 32-bit integer address does not require

any arithmetic.

80x86 Instruction Frequency

Rank Instruction Frequency 1 load 22% 2 branch 20% 3 compare 16% 4 store 12% 5 add 8% 6 and 6% 7 sub 5% 8 register move 4%

9

9 call 1% 10 return 1%

Total 96%

Micro-controller

uC

Program(ROM) & data memory (RAM)Provision of extension of memory Simple mode of addressing

direct/ indirect addressingSpecial Function Registers

Microcontroller architecture• In addition to processor

– On-chip memory(RAM, ROM)

• clocking

• IO pins

• interrupts

• timers

• Peripherals

• Serial Ios

• ADC inputs

• DAC outputs

• PWM outputs

• Meant for low computation task– Can handle industrial control app– Can also work as supporting chip to main processor– All peripherals are made of micro controllers

• Ethernet, USB, Serial, Wi-Fi, Firewire, Bluetooth etc

Power Architecture• Performance Optimization With Enhanced RISC

(Power)• IBM came first with RISC System-RS/6000• Power architecture incorporated lots of RISC

attributes fixed-length instructions, register-to-register architecture, simple addressing modes, large general register file three-operand instruction format More characteristic from complex ISAs

Designed to be superscalar Compound instruction AIM alliance formed, resulted into PowerPC

PowerPC Architectureo In order to maintain RS/6000 software compatibility, the

PowerPC adapted the POWER architecture, and many enhancements were added to provide a low-cost, single-chip, superscalar, multiprocessor capable, and 64-bit processor • Support for operation in both big-endian and little-endian

modes• Single and double precision floating-point arithmetic 64-bit

architecture, backward compatible to 32-bit• Complex string instructions were left out, consistent with the

RISC philosophy • Several bit/field instructions that use three source operands

were eliminated to avoid the need for extra register ports. • Instructions whose operation was dependent on the value of

source operand were eliminated. • Precision shifts, integer multiplies, and divide-with-reminder

instructions were omitted.

PowerPC familyo PowerPC 601:

• includes a more sophisticated branch unit• capable to dispatch three “out-of-order” instructions per cycle. • up to 8 instructions per cycle can be fetched directly into an eight-

entry instruction queue (IQ), where they're decoded before being dispatched to the execution core.

• medium sized and medium performance processor Branch folding: The instruction queue is used for detecting and

dealing with branches. The branch unit scans bottom four entries of the queue, identifying branch instructions and determining what type they are (conditional, unconditional).

o PowerPC 603:• smaller die size than the 601• smaller cache • capable to dispatch three “out-of-order” instructions per cycle. The 604 and 620 microprocessors were developed in the sequel of the

PowerPC production line. Both aimed for higher performance. The 604 was based on the 32-bit architecture while the 620 is a 64-bit architecture.

PowerPC family– PowerPC e200 - 32 bit power architecture microprocessor - speed

ranging up to 600 MHz - ideal for embedded applications. – PowerPC e300 – similar to e200 with an increase in speed upto 667

MHz. – PowerPC e600 – speed upto 2 Ghz – ideal for high performance

routing and telecommunications applications. – POWER5 – IBM – dual core μP – POWER6 – IBM – Dual core μP - A notable difference from POWER5 is

that the POWER6 executes instructions in-order instead of out-of-order – PowerPC G3 - Apple Macintosh computers such as the PowerBook G3,

the multicolored iMacs, iBooks and several desktops, including both the Beige and Blue and White Power Macintosh G3s.

– PowerPC G4 - is a designation used by Apple Computer to describe a fourth generation of 32-bit PowerPC microprocessors.

– PowerPC G5 - 64-bit Power Architecture processors – Xenon - based on IBM’s PowerPC ISA – XBOX 360 game console. – Broadway – based on IBM’s PowerPC ISA – Nintendo Wii gaming

console

– Blue Gene/L - dual core PowerPC 440, 700 MHz, 2004 – Blue Gene/P - quad core PowerPC 450, 850 MHz, 2007

PowerPC G4e Pipelining• Seven Stage Pipeline• Superscalar Microprocessor – allows multiple

instructions to be executed in parallel.Nine Execution Units

• BPU : Branch Processing Unit• VPU : Vector Permute Unit• VIU : Vector Integer Unit• VCIU : Vector Complex Integer Unit• VFPU : Vector Floating Point Unit• FPU : Floating Point Unit• IU : Integer Unit• CIU : Complex Integer Unit• LSU : Load/Store Unit

Pros and Cons

• Instruction Set– 200 machine instructions

• More complex than most RISC machines• e.g. floating-point “multiply and add” instructions that take

three input operands• e.g. load and store instructions may automatically update

the index register to contain the just-computed target address

– Pipelined execution• More sophisticated than SPARC

• Input and Output– Two different modes

• Direct-store segment: map virtual address space to an external address space

• Normal virtual memory access

• Permits a range of implementation from low cost controllers through high performance processors.

ARM (Advanced RISC Machine)

• ARM is leading IP provider of high performance, low cost, power efficient processors, peripherals and SOCs through involvement with Virtual Socket Interface alliance(VISA) and Virtual component exchange (VCX)

• Four major OS platform supported– Embedded CE, Linux, Symbian and Palm OS

• Does not manufacture chip, it provides services to 40 licensed partner and finally validates test chips

• ARM's Global Technology Partner Network is the largest in the industry

ARM’s solution•it does not present hardened macros and synthesizable CPUs to the

industry

•It provides the ASIC infrastructure in the form of AMBA, the PrimeCell Peripherals, and models and modeling tools for the cores

•There is also the need for ARM to pursue ports for RTOSs, develop debug hardware and software development tools, and, of course, embedded software for "off-the-shelf” integration

•ARM is a full-solutions provider, supporting a broad range of applications

ARM architecture

• Many SoCs are built around ARM– Apple’s A4/A5/A5x, Nvidia’s Tegra– Samsung’s Exynos, TI’s Omap, Davinchi

platforms, freescale’s iMx51, 53 etc– Qualcomm’s snapdragon series etc

ARM architecture• The ARM uses modified Harvard architecture,

load/store architecture, i.e.,– Only 32 bit data bus for both inst. And data.– Only the load/store inst. (and SWP) access memory.

• Memory is addressed as a 32 bit address space

• Most ARM’s implement two instruction sets– 32-bit ARM Instruction Set– 16-bit Thumb Instruction Set

• Jazelle cores can also execute Java bytecode• Execution mode

– When the processor is executing in ARM state(32)– When the processor is executing in Thumb state(16)– When the processor is executing in Jazelle state(8)

• DSP instruction (multi-accumulate)

ARM block diagram

Brid

ge

Timer

On-chipRAM

ARM

InterruptController

Remap/Pause

TIC

Arbiter

Bus InterfaceExternalROM

ExternalRAM

Reset

System Bus Peripheral Bus

• AMBA– Advanced Microcontroller Bus

Architecture• ADK

– Complete AMBA Design Kit

• ACT– AMBA Compliance Testbench

• PrimeCell– ARM’s AMBA compliant

peripherals

AHB or ASB APB

ExternalBus

Interface

Decoder

Thumb • Thumb is a 16-bit instruction set

– Optimised for code density from C code (~65% of ARM code size)– Improved performance from narrow memory– Subset of the functionality of the ARM instruction set

• Core has additional execution state - Thumb– Switch between ARM and Thumb using BX instruction

015

31 0ADDS r2,r2,#1

ADD r2,#1

32-bit ARM Instruction

16-bit Thumb Instruction

For most instructions generated by compiler: Conditional execution is not used Source and destination registers identical Only Low registers used Constants are of limited size

Microprocessor Without Interlocked Pipeline Stages (MIPS)

• Main memory used for composite data– Arrays, structures, dynamic data

• Memory is byte addressed– Each address identifies an 8-bit byte

• Words are aligned in memory– Address must be a multiple of 4

• MIPS is Big Endian

• Reg 0 is the Constant Zero ($zero)

• The R10000 has three pipelines: A five-stage pipeline for integer instructions, a seven-stage pipeline for floating-point instructions, and a six-state pipeline for LOAD/STORE instructions.

• In all MIPS ISAs, only the LOAD and STORE instructions can access memory

• The ISA uses only base addressing mode

• MIPS Instruction sets MIPS1/2/3/4/5, MIPS32, MIPS64

• R2000/3000/4000 to R16000 etc

MIPS• The stored-program concept:

– Instructions are represented as numbers– Programs can be stored in memory to be read or written just

like data

• MIPS – ISA developed in the early 80’s (RISC)– Similar to other RISC architectures developed since the 1980's– Almost 100 million MIPS processors manufactured in 2002– Used by NEC, Nintendo, Cisco, Silicon Graphics, Sony, …– Regular (32 bit instructions, small number of instruction

formats)– Relatively small number of instructions– Register architecture (all instructions operate on registers)– Load/Store architecture (memory accessed only with load/store

instructions, with few addressing modes)– All arithmetic instructions have 3 operands– Operand order is fixed

Design Principles for MIPS

• Simplicity favors regularity– All instructions 32 bits– All instructions have 3 operands

• Smaller is faster– Only 32 registers

• Good design demands good compromises– All instructions are the same length– Limited number of instruction formats: R, I, J

• Make common cases fast– 16-bit immediate constant– Only two branch instructions

– Every ISA designed after 1980 uses a load-store ISA (i.e RISC, to simplify CPU design)

MIPS contribution

1400

1300

1200

1100

1000

900

800

700

600

500

400

300

200

100

01998 2000 2001 20021999

Other

SPARC

Hitachi SH

PowerPC

Motorola 68K

MIPS

IA-32

ARM

• Cable Modems 94%• DSL Modems 40%• VDSL Modems 93%• IDTV 40%• Cable STBs 76%• DVD Recorder 75%• Game Consoles 76% • Office Automation 48% • Color Laser Printers 62%• Commercial Color Copiers

73%

• Source: Website of MIPS Technologies, Inc.,

2004.

Java Virtual Machine(JVM)• Java runs on JVM• A JVM is written in a native language for a wide array of processors, including MIPS and Intel• Like a real machine, the JVM has an ISA all of its own, called bytecode. This ISA was designed to

be compatible with the architecture of any machine on which the JVM is running

• Java bytecode is a stack-based language.

• Most instructions are zero address instructions.

• The JVM has four registers that provide access to five regions of main memory.

• All references to memory are offsets from these registers. Java uses no pointers or absolute memory references.

• Java was designed for platform interoperability, not performance!

General DSP Architecture• Hard to find good definition: ---changing or analyzing

information which is measured as discrete sequences of numbers

• Most share common features:– They use a lot of maths (multiplying and adding

signals) – They deal with signals that come from the real

world – They require a response in a certain time

DSP

• DSP Support for Parallel Moves– Need to fetch next coefficient and next stored value at

each step in the filter– DSPs generally support a parallel move or fetch

operation while MAC is computed– This design avoids idle ALU and data buses

• DSP algorithms often have “multiply-accumulate” requirements: coef[n] * data[n], where two operands must be fetched

• Simple FIR filter is given by • Digital filters require accumulated sum-of-

products• Multiple address generators to handle separate

memory spaces

1

0

N

ii inxbny

DSP performance comparison

Architectural overview•Harvard architecture•On-chip memory•ALU•Multiplier• On chip IOs• Separate address spaces for

program memory, data memory, and I/O

• Pipelines operations • Single-Cycle Multiply-

accumulate capability• Specialized addressing

modes• Specialized execution

control• Irregular instruction sets•Support for complex

instruction•Multiple computing units to

support data handling in parallel

•More no of registers to have faster data access

•Higher bus bandwidth

Irregular Instruction SetsUnlike general microprocessors, DSPs’ instruction allow for arithmetic operations to be carried out in parallel with data moves

MACR -D0, D1, D7

AND D4, D5

MOVE.L (R0) +N0, R6

ADDA R2, R3

DALU Instr DALU Instr AGU Instr AGU Instr

four instruction in an execution set

Specialized execution control-DSP processors provide a loop instruction for fast nesting of repetitive operations. This is usually done hardware wise to increase the speed

Direct comparison

Processor MHz MIPS DSP Benchmarks

ISR Latency

Power Price Dimensions(in)

Pentium MMX

233 233 49 1.38 us 4.25 W $213 5.5 x 2.47 x .647

Pentium MMX

266 266 56 1.38 us 4.85 W $348 5.5 x 2.47 x .647

TMS320C62 120 960 62 0.09 us 1.14 W (est.)

$25 1.3 x 1.3 x .07

TMS320C62 200 1600 103 0.09 us 1.9 W $96 1.3 x 1.3 x .07

GPU• In 1999, Nvidia introduced GeForce 256,

marketed as 1st GPU, fixed function device

• ATI & 3dfx also made their devices

• General architecture of Nvidia 8800

• 8 thread processing clusters (TPC)

• Each TPC has two streaming multiprocessors (SM)

• Each SP has 8 scalar processors (SP)

• Each SP equipped with their own ALU & FPUs

GPU Architecture

GPUs focus is on increasing raw compute power, so that more primitives (vertices, triangles, pixels) can be processed• GPUs are always using smaller transistor size to dramatically increase

the number of processors, aiming at ever-larger data throughput • CPUs, rather, focus on instruction Level parallelism and reducing latency• GPU contains more no of ALUs than CPU, it implies higher arithmetic

operations. Less emphasis on cache and control unit

• Many parallel arithmetic ops, means same ops on huge data set

• Graphics is best example for parallel rendering of pixels

• However programmer has to parallelize app suitably

• Sqrt of array of numbers taken on quad core Xeon (2.33 GHz) and NVIDIA® Tesla C870 (1.35 GHz), GPU emerged as ~ 400 times faster

• Restrictive memory access compared to CPU

GPU architecture example

Fermi architecture of GPU

Elaborated view of SM

Softcore processor

• They are utilized in FPGA design flow• They are utilized in SoC devt.• They are available in various flavors

– Picoblaze/microblaze/arm/NIOS II/LEON3/4/CPU86/TSK3000A/TSK51/52/Cortex-M1/open RISC

• They can be programmed as normal CPU• CPU footprint is under user control• Multiple instances can be created• Ideal when embedded and FPGA both

approach is demanded by app

Requirement analysis

• Study of dataset– Is there parallelism?

• Timing requirement of application– Soft Real-time, Hard Real-time

• IO bound or CPU applications• Algorithmic complexity• Multitasking or non-multitasking solution

– Scheduler based application– Monolithic application

• Heterogeneous tasking solution– Single card or multi card solution– Bus based data sharing or through dedicated IOs or

Interface

Requirement analysis

• Time to market– Buying for R&D/ learning purpose– To be used in field application

• Availability of part in extended temp range or MIL grade

• Overall cost of development– In-house efforts– Cost of customization

• Availability of development tools– Open source supported– Only proprietary tools

General Purpose Hardware• PC based hardware is often called General purpose

hardware– Day-to-Day documentation & presentation– Offline data analysis– Simulation of activities– Gaming, Database, multimedia application– Internet based applications

• Mail, browsing, e-transactions and online database applications

• No pressure of time• More of sequential processing• When you need more interaction with system• Sometimes, it works as console for many systems• As a Development host• PC has a powerful hardware but highly under utilized as PC

– E.g. Intel or AMD processor based PC

Hardware for Multimedia App• Video Encoding, Video decoding & Image

compression – Possible with DSPs like C64xx, C67xx from TI– DaVinchi Devices like DM365, DM368,

DM6467t, DM642 etc– Freescale iMx51, iMx53 etc

• Application – Video transmission:

• LAN, WAN, Internet, Surveillance purpose, CCTV coverage

– Recording:• CD, DVD, in-built recording in defence equipments• Handheld cameras and camcorder • DTH services, IP TV service

Hardware for video processing in defence equipments

• Single video processing– DSPs are preferred– OEM supported image/video processing API are

provided as development framework– Convenient to use (single front end)

• Multi-video processing– FPGAs are preferred– GPUs can also be used– Developer has to develop every module– May take advantage of IPcores for complex

processing modules– More compact solution is possible

Can we live with open source solution?

• Open source h/w architecture– ARM

• Open source mobile platform kernel– Android a big example

• Open source development tools– Linux, Mozilla, thunderbird, Java, My SQL,

Tomcat Server, Apache server, Qt etc

• Open source API for dedicated purpose– Open CV, open GL, open CL, live 555, ffmpeg

etc

• Yes: we can definitely live with

RTOSes

• pSoS from Integrated Inc• VxWorks from Windriver• Integrity from Greenhills• QNX• RTLinux• Pico linux• Montavista Linux• Embedded NT• Etc

Conclusion

• Identifying the computing and IO needs is first

• Find the availability of prototyping tools and hardware

• ………………………..• ………………………..• ………………………..

Thank you

Relative Frequency of Control Instructions

Operation SPECint92 SPECfp92Call/Return 13% 11%

Jumps 6% 4%Branches 81% 87%

• Design hardware to handle branches quickly, since these occur most frequently

University of PittsburghMIPS Instruction Set

Architecture 55

MIPS Architecture• Design “philosophies” for ISAs: RISC vs. CISC

• Execution time =– instructions per program * cycles per instruction * seconds per cycle

• MIPS is implementation of a RISC architecture

• MIPS R2000 ISA– Designed for use with high-level programming languages

• small set of instructions and addressing modes, easy for compilers

– Minimize/balance amount of work (computation and data flow) per instruction• allows for parallel execution

– Load-store machine• large register set, minimize main memory access

– fixed instruction width (32-bits), small set of uniform instruction encodings• minimize control complexity, allow for more registers


Architecture 56

MIPS Instructions

• MIPS instructions fall into 5 classes:– Arithmetic/logical/shift/comparison– Control instructions (branch and jump)– Load/store– Other (exception, register movement

to/from GP registers, etc.)

• Three instruction encoding formats:– R-type (6-bit opcode, 5-bit rs, 5-bit rt, 5-bit rd, 5-bit shamt, 6-bit function code)

– I-type (6-bit opcode, 5-bit rs, 5-bit rt, 16-bit immediate)

– J-type (6-bit opcode, 26-bit pseudo-direct address)


Architecture 57

MIPS ISA

• MIPS pipeline stages– Fetch (F)

• read next instruction from memory, increment address counter

• assume 1 cycle to access memory

– Decode (D)• read register operands, resolve instruction in control

signals, compute branch target

– Execute (E)• execute arithmetic/resolve branches

– Memory (M)• perform load/store accesses to memory, take branches• assume 1 cycle to access memory

– Write back (W)• write arithmetic results to register file

Pipeline Implementation• Idea:

– Goal of MIPS: CPI <= 1– Some instructions take longer to execute than others– Don’t want cycle time to depend on slowest instruction– Want 100% hardware utilization– Split execution of each instruction into several, balanced

“stages”– Each stage is a block of combinational logic– Latency of each stage fits within 1 clock cycle– Insert registers between each pipeline stage to hold

intermediate results– Execute each of these steps in parallel for a sequence of

instructions– “Assembly line”

Hazards• Hazards are data flow problems that arise as a result of

pipelining– Limits the amount of parallelism, sometimes induces

“penalties” that prevent one instruction per clock cycle– Structural hazards

• Two operations require a single piece of hardware• Structural hazards can be overcome by adding additional hardware

– Control hazards• Conditional control instructions are not resolved until late in the

pipeline, requiring subsequent instruction fetches to be predicted– Flushed if prediction does not hold (make sure no state change)

• Branch hazards can use dynamic prediction/speculation, branch delay slot

– Data hazards• Instruction from one pipeline stage is “dependant” of data

computed in another pipeline stage

Terminology

• Hyper-Threading (HT)• Turbo Boost/Turbo Core• QuickPath Interconnect (QPI)/Hyper Transport• Tri-Gate (3D) Transistor• Cool'n'Quiet• CoolCore• Vector processing• Super scalar architecture • VLIEW architecture

Technical point of view RTOS vs OSOS RTOS

Multitasking and multiuser Multitasking but not a multiuser

Kernel size bin 10s of MB Kernel size in few KB to 2 MB

All features are bundled Scalable feature set

Native GUI support 3rd party app is needed to support GUI

User has no control over context switch Context switch time is very less

Preemption is not guaranteed Guaranteed preemption of task

Computing Hardware

• Dedicated & timed task• DSP or dedicated SoC or general CPU

• Parallelism in dataset? Use parallel hardware like FPGA, GPGPU– Image & Video processing– Weather forecasting– Stock market prediction– Bio-inspired computation