z-IRAM

Embed Size (px)

Citation preview

  • 8/3/2019 z-IRAM

    1/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 1

    SEMINAR

    ON

    INTELLIGENT RAM

  • 8/3/2019 z-IRAM

    2/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 2

    Abstract

    As the name suggests Intelligent RAM is the integration of Intelligence and RAM

    Intelligence stands for Microprocessor, and RAM, the random access memory which is the

    volatile memory that is an important part of computing systems from desktop computers t

    supercomputers. Intelligent RAM, or IRAM, merges processing and memory into a single

    chip to lower memory latency, increase memory bandwidth, and improve energy

    efficiency as well as to allow more flexible selection of memory size and organization. In

    addition, IRAM promises savings in power and board area. The seminar is genuine effor

    to introduce this new idea which covers the inspirations, advantages, architecture of IRAM

    and the technologies which makes possible the revolutionary idea Intelligent RAM.

  • 8/3/2019 z-IRAM

    3/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 3

    Contents

    1. Introduction2.

    Inspirations of IRAM

    3. IRAM - Architecture4. IRAM - Benchmarking5. Advantages of IRAM6. Disadvantages of IRAM7. Conclusion8. References

  • 8/3/2019 z-IRAM

    4/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 4

    1.Introduction

    The division of the semiconductor industry into microprocessor and memory

    camps provides many advantages. First and foremost, a fabrication line can be tailored to

    the needs of the device. Microprocessor fabrication lines offer fast transistors to make fas

    logic and many metal layers to accelerate communication and simplify power distribution

    while DRAM fabrications offer many polysilicon layers to achieve both small DRAM

    cells and low leakage current to reduce the DRAM refresh rate. Separate chips also mean

    separate packages, allowing microprocessors to use expensive packages that dissipate high

    power (5 to 50 watts) and provide hundreds of pins to make wide connections to externa

    memory, while allowing DRAMs to use inexpensive packages which dissipate low powe

    (1 watt) and use only a few dozen pins. Separate packages in turn mean computer

    designers can scale the number of memory chips independent of the number of processors

    most desktop systems have 1 processor and 4 to 32 DRAM chips, but most server system

    have 2 to 16 processors and 32 to 256 DRAMs. Memory systems have standardized on th

    Single In-line Memory Module (SIMM) or Dual In-line Memory Module (DIMM), which

    allows the end user to scale the amount of memory in a system. Quantitative evidence o

    the success of the industry is its size: in 1998 DRAMs were a $37B industry and

    microprocessors were a $20B industry. In addition to financial success, the technologies o

    these industries have improved at unparalleled rates. DRAM capacity has quadrupled on

    average every 3 years since 1986, while microprocessor speed has done the same sinc

    1996. The split into two camps has its disadvantages as well. Two trends call int

    question the current practice of microprocessors and DRAMs being fabricated as differen

    chips on different fabrication lines: 1) the gap between processor and DRAM speed i

    growing at 50% per year; and 2) the size and organization of memory on a single DRAM

    chip is becoming awkward to use in a system, yet size is growing at 60% per year. These

  • 8/3/2019 z-IRAM

    5/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 5

    disadvantages made the people to think of developing a new technology and as a result th

    integrated technology Intelligent RAM came into exist. Intelligent RAM, or IRAM

    merges processing and memory into a single chip to lower memory latency, increas

    memory bandwidth, and improve energy efficiency as well as to allow more flexibleselection of memory size and organization. In addition, IRAM promises savings in powe

    and board area.The IRAM technology was proposed by David Patterson, University o

    California, Berkeley, U S A as an alternative for current memory-processor combination

    which has many pit falls. It was implemented by a group of post graduate students lead by

    Patterson.

    2. Inspirations of IRAM

    Processor - Memory (DRAM) Gap or Latency

    It is the delay time between the performance of Microprocessor and RAM. Since

    processor and RAM are fabricated separately in two fabrication lines the clock speed o

    them will vary in two different rates creating a performance gap between them.

    Figure 2.1 shows that while microprocessor performance has been improving at

    rate of 60% per year, the access time to DRAM has been improving at less than 10% pe

    year. Hence computer designers are faced with an increasing Processor-Memory

    Performance Gap, which is now the primary obstacle to improved computer system

    performance. System architects have attempted to bridge the processor-memory

    performance gap by introducing deeper and deeper cache memory hierarchies

    unfortunately, this makes the memory latency even longer in the worst case. The main

    memory latency in the system is a factor of four larger than the raw DRAM access time

    this difference is due to the time to drive the address off the microprocessor, the time to

    multiplex the addresses to the DRAM, the time to turn around the bidirectional data bus

    the overhead of the memory controller, the latency of the SIMM connectors, and the time

    to drive the DRAM pins first with the address and then with the return data.

  • 8/3/2019 z-IRAM

    6/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 6

    Figure 2.1

    Processor - Memory Performance Gap Penalty

    To overcome the performance gap we have to provide a cache memory which

    needs to invest more money. Also because of the increase in the number of transistors thpower consumption also is increased. But the cache is inefficient as the performance gap i

    very high and is growing at a rate of 50% per year. The extraordinary delays in the

    memory hierarchy occur despite tremendous resources being spent trying the bridge the

    processor-memory performance gap. We call the percent of die area and transistor

    dedicated to caches and other memory latency-hiding hardware the Memory Ga

    Penalty. Table 2.1 quantifies the penalty; it has grown to 60% of the area and almost 90%

    of the transistors in several microprocessors. Also due to the rapid growth there may arise

    a situation where the degree of integration has to be increased and fabrication size o

    transistors need to be decreased which will further increase the cost of production which

    implies the penalty will increase with new generations of microprocessors.

  • 8/3/2019 z-IRAM

    7/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 7

    Year Processor On-ChipCache Size

    Memory GapPenalty:

    % Die Area

    Memory GapPenalty:

    Transistors

    Die Area(mm2)

    TotalTransistors

    1994 Digital

    Alpha

    21164

    I: 8 KB,

    D: 8 KB,

    L2: 96 KB

    37.4% 77.4% 298 9.3 M

    1996 Digital

    Strong-Arm

    SA-110

    I: 16 KB,

    D: 16 KB

    60.8% 94.5% 50 2.1 M

    1993 Intel

    Pentium

    I: 8 KB,

    D: 8 KB

    31.9% 32%

    300

    3.1 M

    1995 Intel

    Pentium Pro

    I: 8 KB,

    D: 8 KB,

    L2: 512 KB

    P: 18.5%

    +L2: 100%

    (Total:52.2%)

    P: 11.2%

    +L2: 100%

    (Total:76.5%)

    P: 226

    +L2:186

    P: 3.5 M

    +L2: 25.0M

    2000 Intel

    Pentium4

    I: 8 KB,

    D:12 KB,

    L2: 512 KB

    P: 22.5%

    +L2: 100%

    (Total: 64.2%)

    P: 18.2%

    +L2: 100%

    (Total:87.5%)

    P: 242

    +L2:282

    P: 5.5 M

    +L2: 31.0M

    2001 AMD

    Athlon

    I: 32 KB,

    D: 32 KB,

    L2: 512 KB

    P: 20.2%

    +L2: 100%

    (Total:62.0%)

    P: 20.2%

    +L2: 100%

    (Total:88.5%)

    P: 268

    +L2:312

    P: 6.5 M

    +L2: 33.0M

    Table 2.1

    Memory Revenue

    The memory revenue is decreasing rapidly nowadays. Even though the need fo

    DRAM chips is increasing the DRAM manufacturers are not getting the benefit because o

    the high cost of production. This makes the RAM manufactures to think of anothe

    alternative which can reduce the cost of production to maintain the revenue as well as thei

    business.

    Figure 2.2 shows that the DRAM revenue has been falling continuously from

    first quarter of the year 1999 after reaching a maximum of 16 Billion U S Dollars. Again

    in the first quarter of the year 2000 it showed a slight rise by reaching 7 Billion after which

    seems to be sliding down continuously for three consecutive years which has not ceased

    till now.

  • 8/3/2019 z-IRAM

    8/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 8

    Figure 2.2

    I/O Bus Performance Lag

    The parallel I/O bus is not efficient because it lags behind the processor and

    memory in band width. If we scale the bus by increasing clock speed and bus width fo

    increasing performance the packaging cost is increased. Scaling also results in increased

    number of pins. So the performance lag of parallel I/O bus points to the requirement of

    much more efficient technological implementation which rectifies the band width scarcity

    and increase in both cost of production and in number of pins while scaling it fo

    increasing the performance and efficiency.

    PCI Bits Pin Number16 ~2032 ~50

    64 ~90

    Table 2.2

  • 8/3/2019 z-IRAM

    9/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 9

    Table 2.2 shows the rapid increase in pin number which in turn increases the cos

    of production when the PCI bits are scaled for increasing the performance of the I/O

    system.

    Database Demand for Processing Power and Memory

    The database or software is demanding for more and more processing power and

    memory. But both of them are inadequate when compared to the actual demand. Also th

    demand is increasing rapidly because more and more high end applications lik

    multimedia applications which need high processing power and RAM. Also thei

    requirements are increasing by the release of their continuing versions which are capableof squeezing the final drop of performance from the computing system.

    Figure 2.3Figure 2.3 shows the database demand for processing power and memory i

    increasing by a multiple of 2 or becoming twice in 9 months according to Gregs law. Bu

    the microprocessor speed and DRAM speed becomes twice in 18 months and 120 month

    respectively. The microprocessor speed is increasing according to the Moores Law. So

  • 8/3/2019 z-IRAM

    10/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 10

    both microprocessor and memory is less when compared to the actual requiremen

    demanded by the database and other software applications making a Database - Processo

    performance gap and Database - Memory performance gap respectively. Also the

    performance gap is growing continuously and rapidly since the new software applicationare hungrier in matters of processing power and memory. So the database demand fo

    more processing power and memory made the computer experts to think of a technology

    which reduces the performance gap between the database demand and processor as well a

    the memory.

    Fewer DRAMs/System over TimeWhile the Processor-Memory Performance Gap has widened to the point where i

    is dominating performance for many applications, the cumulative effect of two decades o

    60% per year improvement in DRAM capacity has resulted in huge individual DRAM

    chips. This has put the DRAM industry in something of a bind. That is the DRAM width

    or the memory per DRAM is growing at a rate of 60% per year. But the minimum memory

    required per system is growing only at a rate of 25% - 30% per year.

    Figure 2.4

  • 8/3/2019 z-IRAM

    11/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 11

    Figure 2.4 shows that over time the number of DRAM chips required for a

    reasonably configured PC has been shrinking. The required minimum memory size

    reflecting application and operating system memory usage, has been growing at only abou

    half to three-quarters the rate of DRAM chip capacity. For example, consider a wordprocessor that requires 8MB; if its memory needs had increased at the rate of DRAM chip

    capacity growth, that word processor would have had to fit in 80KB in 1986 and 800 byte

    in 1976. The result of the prolonged rapid improvement in DRAM capacity is fewe

    DRAM chips needed per PC, to the point where soon many PC customers may requir

    only a single DRAM chip. Also unused memory bits increases effective cost. So customer

    may no longer automatically switch to the larger capacity DRAM as soon as the nex

    generation matches the same cost per bit in the same organization because 1) the minimum

    memory increment may be much larger than needed, 2) the larger capacity DRAM wil

    need to be in a wider configuration that is more expensive per bit than the narrow versio

    of the smaller DRAM, or 3) the wider capacity does not match the width needed for erro

    checking and hence results in even higher costs.

    3. IRAM - Architecture

    Key Technologies

    The Key Technologies behind the IRAM technology are,

    1) Vector Processing 2) Embedded DRAM and 3) Serial I/O

    Vector Processing

    High speed microprocessors rely on instruction level parallelism (ILP) i

    programs, which means the hardware has the potential short instruction sequences to

    execute in parallel. As mentioned above, these high speed microprocessors rely on getting

    hits in the cache to supply instructions and operands at a sufficient rate to keep th

    processors busy.

  • 8/3/2019 z-IRAM

    12/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 12

    An alternative model to exploiting ILP that does not rely on caches is vecto

    processing. It is a well established architecture and compiler model that was popularized

    by supercomputers, and it is considerably older than superscalar. Vector processors hav

    high-level operations that work on linear arrays of numbers.Advantages of vector computers and the vectorized programs on them include:

    1. Each result is independent of previous results, which enables deep pipelines and high

    clock rates in them.

    2. A single vector instruction does a great deal of work, which means fewer instruction

    fetches in general and fewer branch instructions and so fewer mispredicted branches.

    3. Vector instructions often access memory a block at a time, which allows memory

    latency to be amortized over, say, 64 elements.

    4. Vector instructions often access memory with regular (constant stride) patterns, which

    allows multiple memory banks to simultaneously supply operands.

    Figure 4.1

    Figure 4.1 shows the vector processing model in which the difference between

    scalar and vector instructions is schematically represented. In scalar processing th

    instructions are carried out sequentially while in vector processing a number o

  • 8/3/2019 z-IRAM

    13/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 13

    instructions are carried out in parallel which depends on the vector length of the processor

    So parallel processing is much faster than scalar processing.

    IRAM - Vector ArchitectureSince vector architecture deals with vector processing it represents only the

    processor architecture of IRAM. It helps to study about instruction level parallelism o

    parallel processing of IRAM. The parallel processing is carried out by virtual processing o

    the IRAM processor. Figure 4.2 shows the vector architectural model of IRAM. The

    parallel processing is carried out by the virtual processors VP0, VP1, .. ,VP$vlr-1. I

    consists of the 32 general purpose registers which are vr0, vr1, .., vr31. The genera

    purpose registers are for execution of general instructions.

    Figure 4.2

    The 32 flag registers vf0, vf1, , vf31 are for executing floating poin

    instructions. It also consists of 32 control registers vcr0, vcr1, ., vcr31 for the control o

    instruction execution carried out by the processor.

  • 8/3/2019 z-IRAM

    14/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 14

    Advantages of Vector Processing

    Advantages of vector processing are,

    1. It has high performance on demand for multimedia processing. That is it enhances th

    multimedia processing by means of parallel processing which makes it ideal processingmethod since multimedia processing plays a key role in todays computing.

    2. It has low power for issue of control logic. The control logic voltage is 1.2 V when

    compared to the 3V to 5V of other processing methods.

    3. It doesnt have much complexity in design. Due to the simple design the

    implementation becomes very easy and cost effective.

    4. It has a well understood programming model. The compiler instruction language is very

    simple and easy to understand. Also the instruction language is very efficient even

    though it is simple when compared to the other assembly level programming languages

    Embedded DRAM

    The embedded DRAM technology used in IRAM is by means of embedded

    technology. It is the technology by which a chip is embedded into a device for the contro

    and well execution of the operations of that particular device. Usually chip embedding i

    done in devices handled by common people where they dont need to interact with the chip

    directly, but by means of embedded chip he executes and controls the device.

    Figure 4.3

  • 8/3/2019 z-IRAM

    15/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 15

    Figure 4.3 shows how embedded technology is used in the manufacturing o

    IRAM. During the fabrication the memory chip is embedded into the microprocessor to

    produce IRAM. Thus IRAM becomes a single chip into which both memory and processo

    are integrated for high quality performance due to their coexistence.

    Advantages of Embedded DRAM

    The Advantages of Embedded DRAM are,

    1. It offers high bandwidth for vector processing. Due to the high memory bandwidth

    possessed by the DRAM chip it can enhance the performance of vector processo

    which needs high memory usage and bandwidth because of the abundant parallelism in

    vector processing.

    2. It has a low latency which makes the memory accesses much faster and efficient.

    3. The energy or power required for memory accesses is very low. Also the memory

    access frequency is less. So the power consumption of the DRAM chip is les

    compared to other memory chips which consume more power.

    4. The memory flexibility of IRAM is due to the embedded technology used in it

    manufacturing process. The designers can specify exact length and width of the DRAM

    since it is not restricted by powers of 2. So embedded DRAM offers system memory

    size benefits.

    Serial I/O

    Due to the poor performance of parallel I/O both in the case of band width and

    scaling processes the I/O system of IRAM is using a much more efficient and cos

    effective technology the Serial I/O system. It enhances the performance of IRAM

    without hindering the memory and processor performances by offering a smooth and faste

    path for data transfer.

  • 8/3/2019 z-IRAM

    16/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 16

    Figure 4.4

    Figure 4.4 shows the schematic representation of the interaction between IRAM

    and the I/O devices. The interaction is through the serial I/O lines implemented in it a

    shown in figure. Due to the high band width offered by the serial I/O lines the data transfe

    takes place much faster and efficiently in IRAM which enhances its performance.

    Advantages of Serial I/O

    The advantages of serial I/O are,

    1. Serial I/O offers very high band width which enhances both processing and memory

    intensive operations. The typical band width of serial I/O is of the order o

    Gigabits/Sec.

    2. The pin count of serial I/O is very less compared to parallel I/O. It requires only 1-2

    pins per unidirectional link while the parallel I/O requires 5-10 pins per unidirectiona

    link.

    3. Serial I/O band width can be incrementally scaled for increasing the band width and

    efficiency. Also the scaling will not cause any increase in pin number which implie

    that the scaling is very cost effective in the case of serial I/O while scaling in paralle

    I/O increases both the number of pins and cost of production.

    4. The power consumption of serial I/O system is very less when compared to that o

    parallel I/O system. So serial I/O enhances the performance of IRAM along with

  • 8/3/2019 z-IRAM

    17/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 17

    reduction in power consumption. The reduced power consumption of serial I/O system

    makes it suitable for low power consuming devices.

    IRAM - Floor Plan

    Each and every technological implementation needs a basic plan on which thdevice or circuit is built or implemented. Similarly IRAM too have a basic plan or floo

    plan for its implementation which includes the design specification and structure fo

    implementation. Figure 4.5 shows the floor plan of IRAM. It consists of 1024 1Megabi

    memory modules split into two memory zones each having 512 1Megabit modules. Th

    memory capacity is 64 Mbytes per memory zone. The 8 vector lanes are for multiple

    instances of parallel processing of the vector processor. The CPU and IO are the centra

    processing unit and input-output system of the IRAM chip. The crossbar switch acts as

    link between the memory, central processing unit and the IO system.

    Figure 4.5

  • 8/3/2019 z-IRAM

    18/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 18

    Floor Plan Specification

    The IRAM Floor plan specification is as follows, 0.13 Micron - The

    manufacturing size of the transistors. 1GHZ - The processing power of the IRAM chip.

    1GBit DRAM - The memory subsystem capacity in terms of memory modules. 128MB -Total memory capacity. 16GFLOPS (64b) - The number of floating point operations per

    second. 64GOPS (16b) - The number of operations per second.

    IRAM - Complete Architecture

    IRAM is implemented from the floor plan, according to the floor plan

    specifications. Figure 4.6 represents the complete architecture of IRAM implementation

    The 1024 1Mbit modules are embedded in IRAM as shown in figure. The 16Kilobyte L

    cache is split into two, 8K Instruction cache and 8K Data cache. The Instruction cache i

    for processor instructionoperations and Data cache is for the various memory load and

    store operations.

    Figure 4.6

  • 8/3/2019 z-IRAM

    19/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 19

    The 2-way superscalar processor is for scalar processing. It is called 2-way

    because the issue of control logic is different for FPU (Floating Point Unit) operations and

    LSC (Load/Store/Coprocessor) operations. The vector registers are meant for vecto

    instruction executions which register the vector instructions issued by the vector processorThe vector instructions are queued from 2-way superscalar processor to the vecto

    instruction queue unit. The arithmetic and logic unit is for the execution of variou

    arithmetic and logic instructions issued by both super scalar processor and vecto

    processor. The Load/Store unit is meant for the various memory load and store operations

    The serial I/O lines are for the interaction of IRAM with the various input and outpu

    devices. The memory crossbar switch acts as a link between the memory, processor and

    input-output devices for their mutual interactions during their operations. The integrated

    architecture makes the memory and processor to coexist and perform as a single unit

    Since there is high level of interaction between the memory, processor and the I/O device

    the performance of IRAM is very high compared to the separate chip processor -memory

    unit. This unified architecture from the same fabrication line is in fact the secret behind the

    excellent performance of IRAM in both processing and memory intensive operations.

    4. IRAM -Benchmarking

    Benchmarking is a process or group of processes by means of which one can tak

    a right decision upon the performance and efficiency of a product or technology in

    comparison with the other product or technology which took part in the event

    Benchmarking helps us to find the product or technology which is appropriate for ou

    requirements. So it is a solid proof for the performance and efficiency of a product otechnology in comparison with others.

    Benchmarking Environment

    Benchmarking environment is the environment where the whole processes ar

    carried out for arriving at a right decision. It may include the various products based on th

  • 8/3/2019 z-IRAM

    20/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 20

    same technology or different technology depending upon the type and requirement o

    benchmarking. In this case we are benchmarking the technology IRAM with othe

    processor-memory combinations for proving the real potential of IRAM. So th

    benchmarking environment will consist of the various processors from differenmanufacturers along with various memory modules from different sources to perform as

    single unit while the IRAM is itself a single unit manufactured from a single fabrication

    process. Table 5.1 shows the various processors selected for the benchmarking and th

    different memory modules used for the benchmarking process.

    P IRAM SPARC R10K P III P 4 EV 6

    Make Berkeley Sun Origin Intel Intel Alpha

    Clock 1GHz 833MHz 900MHz 950MHz 1.5GHz 966MHz

    L1 8+8KB 16+16KB 32+32KB 32KB 12+8KB 64+64KB

    L2 NA 2MB 1MB 256KB 256KB 2MB

    Memory 128 MB 256MB 1GB 256MB 1GB 512MB

    Table 5.1

    The various processors used for the benchmarking are SPARC from Sun, R10K

    from Origin, P III and P 4 from Intel, EV6 from Alpha and IRAM from Berkeley. The

    clock frequencies of the various processors, the L1 and L2 caches and the memory

    modules used with them are all mentioned in the table 5.1 given above. Also it can be

    noted that the processor with minimum L1 and L2 cache (L2 cache is not possessed by

    IRAM) and minimum memory capacity is IRAM. All other processors are having more L

    and L2 caches and memory capacity associated with them for their operations.

    Benchmarking Processes

    The various benchmarking processes carried out are,

  • 8/3/2019 z-IRAM

    21/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 21

    1. Transitive Closure:The first benchmark problem is to compute the transitive closure

    of a directed graph in a dense representation. The code taken from the DIS reference

    implementation used non-unit stride, but was easily changed to unit stride. This benchmark

    performs only 2 arithmetic operations (an addition and a subtraction) at each step, while i

    executes 2 loads and 1 store.

    2. Giga Updates per Second (GUPS): This benchmark is a synthetic problem, which

    measures giga-updates-per-second. It repeatedly reads and updates distinct, pseudo

    random memory locations. The inner loop contains 1 arithmetic operation, 2 loads, and

    store, but unlike transitive, the memory accesses are random. It contains abundant data

    parallelism because the addresses are pre-computed and free of duplicates.

    3. Sparse Matrix-Vector Multiplication (SPMV): This problem also requires random

    memory access patterns and a low number of arithmetic operations. It is common in

    scientific applications, and appears in both the DIS and NPB suites in the form of a

    Conjugate Gradient (CG) solver. We have a CG implementation for IRAM, which i

    dominated by SPMV, but here we focus on the kernel to isolate the memory system issues

    The matrices contain a pseudo-random pattern of non-zeros using a construction algorithm

    from the DIS specification, parameterized by the matrix dimension, n, and the number o

    non zeroes, m.

    4. Histogram:Computing a histogram of a set of integers can be used for sorting and in

    some image processing problems. Two important considerations govern the algorithmi

    choice: the number of buckets, b, and the likelihood of duplicates. For image processingthe number of buckets is large and collisions are common because there are typically many

    occurrences of certain colors (e.g.: white) in an image. Histogram is nearly identical to

    GUPS in its memory behavior, but differs due to the possibility of collisions, which limi

    parallelism and are particularly challenging in a data-parallel model.

  • 8/3/2019 z-IRAM

    22/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 22

    Mesh Adaptation: The final benchmark is a two dimensional unstructured mesh

    adaptation algorithm based on triangular elements. This benchmark is more complex than

    the others, and there is no single inner loop to characterize. The memory accesses include

    both random and unit stride, and the key problem is the complex control structure, since

    there are several different cases when inserting a new point into an existing mesh. Starting

    with a coarse-grained task parallel program, we performed significant code reorganization

    and data preprocessing to allow vectorization. The various benchmarking processes are

    selected specially for testing both the processing and memory handling of the variou

    processor-memory combinations and IRAM. These benchmarking processes are both

    processing and memory intensive which squeezes the final drop of performance from th

    different processor-memory systems.

    Transitive Closure

    The best-case scenario for both caches and vectors is a unit stride memory acces

    pattern, as found in the transitive closure benchmark. In this case, the main advantage fo

    IRAM is the size of its on-chip memory, since DRAM is denser than SRAM. IRAM ha

    12 MB of on-chip memory compared to 10s of KB for the L1 caches on the cache-based

    machines. IRAM is admittedly a large chip, but this is partly due to being an academi

    research project with a very small design team - the 2-3 orders of magnitude advantage in

    on-chip memory size is primarily due to the memory technology. Figure 5.1 shows the

    performance of the transitive closure benchmark. Results confirm the expected advantag

    for IRAM on a problem with abundant parallelism and a low arithmetic/memory operation

    ratio. Performance is relatively insensitive to graph size, although IRAM performs bette

    on larger problems due to the longer average vector length. The Pentium 4 has a simila

    effect, which may be due to improved branch prediction because of the sparse graph

    structure in the test problem.

  • 8/3/2019 z-IRAM

    23/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 23

    Figure 5.1

    Giga Updates per Second (GUPS)

    A more challenging memory access pattern is one with either non-unit strides o

    indexed loads and stores (scatter/gather operations). The first challenge for any machine i

    generating the addresses, since each address needs to be checked for validity and focollisions. IRAM can generate only 4 addresses per cycle, independent of the data width

    For 64-bit data, this is sufficient to load or store a value on every cycle, but if the dat

    width is halved to 32-bits, the 4 64-bit lanes perform arithmetic operations at the rate of 4

    32-bit lanes, and the arithmetic unit can more easily be starved for data. In addition, detail

    of the memory bank structure can become apparent, as multiple accesses to the sam

    DRAM bank require additional latency to charge the DRAM. The frequency of these

    bank-conflicts depends on the memory access pattern and the number of banks in th

    memory system.

    The GUPS benchmark results, shown in Figure 5.2, highlights the addres

    generation issue. Although performance improves slightly when moving from 64 to 32

    bits, after that performance is constant due to the limits for 4 address generators. Overall

  • 8/3/2019 z-IRAM

    24/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 24

    though, IRAM does very well on this benchmark, nearly doubling the performance of it

    nearest competitor, the Pentium 4, for 32 and 64 bit data. In fairness, GUPS was the one o

    the benchmarks in which the benchmarking conductors tidied up the compiler-generated

    assembly instructions for the inner loops, which produced a 20-60% speedup.

    Figure 5.2

    In addition to the MOP rate, it is interesting to observe the memory bandwidth

    consumed in this problem. GUPS achieves 1.77, 2.36, 3.54, and 4.87 GB/s memory

    bandwidth on IRAM at 8, 16, 32, and 64-bit data widths, respectively. This is relatively

    close to the peak memory bandwidth of 6.4 GB/s.

    Sparse Matrix-Vector Multiplication (SPMV)

    For the SPMV benchmark, the matrix dimension was set to 10,000 and the

    number of non zeroes to 177,782, i.e., there were about 18 non zeroes per row. The

    computation is done in single precision floating-point. The pseudorandom pattern of non

    zeroes is particularly challenging, and many matrices taken from real applications hav

  • 8/3/2019 z-IRAM

    25/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 25

    some structure that would have better locality, which would especially benefit cache-bas

    machines.

    Four different algorithms were considered for SPMV, reflecting the best practic

    for both cache-based and vector machines. The performance results are shown in Figur5.3. Compressed Row Storage (CRS) is the most common sparse matrix format, which

    stores an array of column indices and non-zero values for each row; SPMV is then

    performed as a series of sparse dot products. The performance on IRAM is better than

    some cache-based machines, but it suffers from lack of parallelism. The dot product i

    performed by recursive halving, so vectors start with an average of 18 elements and drop

    from there. Both the P4 and EV6 exceed IRAM performance for this reason. CRS-banded

    uses the same format and algorithm as CRS, but reflects a different nonzero structure tha

    would likely result from bandwidth reduction orderings, such as reverse Cuthill-McKe

    (RCM). This has little effect on IRAM, but improves the cache hit rate on some of th

    other machines.

    Figure 5.3

  • 8/3/2019 z-IRAM

    26/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 26

    The Ellpack (or Itpack) format forces all rows to have the same length by

    padding them with zeros. It still has indexed memory operations, but increases available

    data parallelism through vectorization across rows. The raw Ellpack performance i

    excellent, and this format should be used on IRAM and PIII for matrices with the longesrow length close to the average. If we instead measure the effective performance (eff)

    which discounts operations performed on padded zeros, the efficiency can be arbitrarily

    poor. Indeed, the randomly generated DIS matrix has an enormous increase in the matrix

    size and number of operations, making it impractical. The Segmented-sum algorithm wa

    first proposed for the Cray PVP. The data structure is an augmented form of the CRS

    format and the computational structure is similar to Ellpack, although there is additiona

    control complexity. The underlying Ellpack algorithm was modified that converts roughly

    2/3 of the memory accesses from a large stride to unit stride. The remaining 1/3 are stil

    indexed references. This was important on IRAM, because we are using 32-bit data an

    have only 4 address generators as discussed above.

    Histogram

    This benchmark builds a histogram for the pixels in a 500x500 image from the

    DIS Specification. The number of buckets depends on the number of bits in each pixel, so

    we use the base 2 logarithm (i.e., the pixel depth) as the parameter in our study

    Performance results for pixel depths of 7, 11, and 15 are shown in Figure 5.4. The first fiv

    sets are for IRAM, all but the second (Retry 0%) use this image data set. The first se

    (Retry) uses the compiler default vectorization algorithm, which vectorizes while ignoring

    duplicates, and corrects the duplicates in a serial phase at the end. This works well if there

    are few duplicates, but performs poorly for our case. The second set (Retry 0%) shows the

    performance when the same algorithm is used on data containing no duplicates. The third

    set (Priv) makes several private copies of the buckets with the copies merged at the end. I

    performs poorly due to the large number of buckets and gets worse as this numbe

    increases with the pixel depth.

  • 8/3/2019 z-IRAM

    27/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 27

    Figure 5.4

    The fourth and fifth algorithms use a more sophisticated sort-diff-find-dif

    algorithm that performs inregister sorting. Bitonic sort was used because the

    communication requirements are regular and it proved to be a good match for IRAM'

    butterfly permutation instructions, designed primarily for reductions and FFTs. Th

    compiler automatically generates in-register permutation code for reductions, but th

    sorting algorithm used here was hand-coded. The two sort algorithms differ on the allowe

    data width: one works when the width is less than 16 bits and the other when it is up to 32

    bits. The narrower width takes advantage of the higher arithmetic performance for narrow

    data on IRAM. Results show that on IRAM, the sort-based and privatized optimization

    methods consistently give the best performance over the range of bit depths. It alsodemonstrates the improvements that can be obtained when the algorithm is tailored to

    shorter bit depths. Overall, IRAM does not do as well as on the other benchmarks, becaus

    the presence of duplicates hurts vectorization, but can actually help improve cache hits on

    cache-based machines. We therefore see excellent timings for the histogram computation

  • 8/3/2019 z-IRAM

    28/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 28

    on these machines without any special optimizations. A memory system advantage start

    to be apparent for 15-bit pixels, where the histograms do not fit in cache, and at this poin

    IRAM's performance is comparable to the faster microprocessors.

    Mesh Adaptation

    This benchmark performs a single level of refinement starting with a mesh o

    4802 triangular elements, 2500 vertices, and 7301 edges. In this application, we use a

    different algorithm organization for the different machines: The original code wa

    designed for conventional processors and is used for those machines, while the vecto

    algorithm uses more memory bandwidth but contains smaller loop bodies, which helps the

    compiler perform vectorization. The vectorized code also pre-sorts the mesh points to

    avoid branches in the inner loop, as in Histogram. Although the branches negatively affec

    superscalar performance, presorting is too expensive on those machines. Mesh adaptation

    also requires indexed memory operations, so address generation again limits IRAM.

    Figure 5.5

    Figure 5.5 shows the performance of processors in Mesh Adaptation. It indicate

    that that IRAM has performed well in mesh adaptation when compared to othe

  • 8/3/2019 z-IRAM

    29/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 29

    processors. The only competitor was Intel Pentium 4. So IRAM has emerged a clea

    winner in the processing intensive benchmark, Mesh Adaptation.

    Summary of Benchmark Characteristics

    An underlying goal in the benchmark was to identify the limiting factor in these

    memory-intensive benchmarks. The graph in Figure 5.6 shows the memory bandwidth

    used on IRAM and the MOPS rate achieved on each of the benchmarks using the bes

    algorithm on the most challenging input. GUPS uses the 64-bit version of the problem

    SPMV uses the segmented sum algorithm, and Histogram uses the 16-bit sort. While all o

    these problems have low operation counts per memory operation, the memory and

    operation rates are quite different in practice. Of these benchmarks, GUPS is the mos

    memory-intensive, where as Mesh Adaptation is the least. Histogram, SPMV and

    Transitive Closure have roughly the same balance between computation and memory

    although their absolute performance varies dramatically due to differences in parallelism

    In particular, although GUPS and Histogram are nearly identical in the characteristics, the

    difference in parallelism results in a very different absolute performance as well as relativ

    bandwidth to operation rate.

    Figure 5.7 shows the summary of performance for each of the benchmarks acros

    machines. The y-axis is a log scale, and IRAM is significantly faster than the othe

    machines on all applications except SPMV and Histogram.

    An even more dramatic picture is seen from measuring the MOPS/Watt ratio, a

    shown in Figure 5.8. Most of the cache-based machines use a small amount of parallelism

    but spend a great deal of power on a high clock rate. Indeed a graph of Flops per machine

    cycle is very similar. Only the Pentium III, designed for portable machines, has a

    comparable power consumption of 4 Watts compared to IRAMs 2 Watts. The Pentium II

    cannot compete on performance, however, due to lack of parallelism.

  • 8/3/2019 z-IRAM

    30/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 30

    Figure 5.6

    Figure 5.7

    Figure 5.8

  • 8/3/2019 z-IRAM

    31/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 31

    5.Advantages of IRAM

    Low Latency

    To reduce latency, the wire length should be kept as short as possible. Thi

    suggests the fewer bits per block the better. In addition, the DRAM cells furthest away

    from the processor will be slower than the closest ones. Rather than restricting the acces

    timing to accommodate the worst case, the processor could be designed to be aware when

    it is accessing slow or fast memory. Some additional reduction in latency can b

    obtained simply by not multiplexing the address as there is no reason to do so on an

    IRAM. Also, being on the same chip with the DRAM, the processor avoids driving the of

    chip wires, potentially turning around the data bus, and accessing an external memorycontroller. In summary, the access latency of an IRAM processor does not need to be

    limited by the same constraints as a standard DRAM part. Much lower latency may be

    obtained by intelligent floor planning, utilizing faster circuit topologies, and redesigning

    the address/data bussing schemes.

    The potential memory latency for random addresses of less than 30 ns is possibl

    for a latency-oriented DRAM design on the same chip as the processor; this is as fast as

    second level caches. Recall that the memory latency on the Alpha Server 8400 is 253 ns.

    IRAM offers performance opportunities for applications with unpredictabl

    memory accesses and very large memory footprints, such as data bases, which may take

    advantage of the potential 5X to 10X decrease in IRAM latency. The lower latency o

    IRAM is due to the fact that it doesnt have any parallel DRAMs, memory controller

    longer bus to turn around and also it has lower number of pins. The latency of an IRAM

    chip is 10-15 ns for a 64-128 MB memory capacity which is very low compared to othe

    memory chips.

  • 8/3/2019 z-IRAM

    32/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 32

    High Bandwidth

    A DRAM naturally has extraordinary internal bandwidth, essentially fetching th

    square root of its capacity each DRAM clock cycle; an on-chip processor can tap tha

    bandwidth. The potential bandwidth of the gigabit DRAM is even greater than indicatedby its logical organization. Since it is important to keep the storage cell small, the norma

    solution is to limit the length of the bit lines, typically with 256 to 512 bits per sense amp

    This quadruples the number of sense amplifiers. To save die area, each block has a smal

    number of I/O lines, which reduces the internal bandwidth by a factor of about 5 to 10 bu

    still meets the external demand. One IRAM goal is to capture a larger fraction of the

    potential on-chip bandwidth. For example, two prototypes 1 gigabit DRAMs were

    presented at ISSCC in1996. As mentioned above, to cope with the long wires inherent in

    600 mm2 dies of the gigabit DRAMs, vendors are using more metal layers: 3 fo

    Mitsubishi and 4 for Samsung. The total number of memory modules on chip is 512 2

    Mbit modules and 1024 1-Mbit modules, respectively. Thus a gigabit IRAM might hav

    1024 memory modules each 1K bits wide. Not only would there be tremendous bandwidth

    at the sense amps of each block, the extra metal layers enable more cross-chip bandwidth

    Assuming a 1Kbit metal bus needs just 1mm, a 600 mm2

    IRAM might have 16 busse

    running at 50 to 100 MHz. Thus the internal IRAM bandwidth should be as high as 200

    300 GBytes/sec. For comparison, the sustained memory bandwidth of the Alpha Serve

    8400 which includes a 75 MHz, 256-bit memory bus is 1.2 Gbytes/sec. Cross bar switch

    in IRAM architecture delivers only 1/3 to 2/3 of the theoretical band width, so actual band

    width will be 100-200 GB/Sec. Applications with predictable memory accesses, such a

    matrix manipulations, may take advantage of the potential 50X to 100X increase in IRAMbandwidth.

    High Energy Efficiency

    Integrating a microprocessor and DRAM memory on the same die offers the

    potential for improving energy consumption of the memory system. DRAM is much

  • 8/3/2019 z-IRAM

    33/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 33

    denser than SRAM, which is traditionally used for on-chip memory. Therefore, an IRAM

    will have many fewer external memory accesses, which consume a great deal of energy to

    drive high capacitance off-chip buses. Even on-chip accesses will be more energy

    efficient, since DRAM consumes less energy than SRAM. Finally, an IRAM has thepotential for higher performance than a conventional approach. Since higher performance

    for some fixed energy consumption can be translated into equal performance at a lowe

    amount of energy, the performance advantages of IRAM can be translated into lowe

    energy consumption.

    Besides reducing the frequency of memory accesses IRAM also reduces the

    energy per instruction which is given by the equation,

    Energy per memory access = AEL1 + MRL1 x AEL2 + MRL2 x AEoff-chipwhereAE=accessenergyandMR=missrate

    The main contributing term in the above equation is the access energy of off-chip

    which vanishes along with the second term from the equation for energy per instruction fo

    IRAM since there is no L2 cache in IRAM. So the energy per memory access will be only

    the access energy of L1 cache which is also less when compared to the L1 cache acces

    energy of other microprocessor chips. So the energy consumption of IRAM chips is very

    low which is very good for low power consuming devices and portable devices.

    Memory Flexibility

    Another advantage of IRAM over conventional designs is the ability to adjus

    both the size and width of the on-chip DRAM. Rather than being limited by powers of 2 in

    length or width, as is conventional DRAM, IRAM designers can specify exactly the

    number of words and their width. This flexibility can improve the cost of IRAM solution

    versus memories made from conventional DRAMs.

  • 8/3/2019 z-IRAM

    34/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 34

    Low Cost of Production

    Fabrication of RAM and Processor is done in a single fabrication line. So cost i

    reduced due to unified fabrication. Tax is reduced since no need for individual taxes for

    RAM and Processor. Since both cost of production and tax is less the market price oIRAM will be less than the combined price for RAM and Processor. Thus it is suitable fo

    budget conscious customers. In fact it should be an obvious choice of performance

    conscious customers because it delivers high quality performance at bare minimum price.

    Small Board Area

    IRAM integrates several chips into One Chip. So board area is very much

    reduced due to integration. So it may be attractive in applications where board area i

    precious such as cellular phones or portable computers. Since the size and portability o

    devices is decreasing and increasing rapidly IRAM offers a well defined path for achieving

    the future goals.

    6.Disadvantages of IRAM

    No product or technology is hundred percent perfect. Sometimes it may hav

    defects or drawbacks. Similarly IRAM is also not a perfect technology. It also ha

    disadvantages. The disadvantages of IRAM chip are,

    1. Completely New Architecture: IRAM is a new technology which is entirely

    different from current technological implementations since it integrates the processor and

    memory into a single chip. For the acceptance of this new technology we have to discard

    our current products and technologies. That is we have to revamp the complete system

    from the scratch itself. So it may effect the wide acceptance of this technology even

    though the performance is excellent. But once it is accepted widely it wont be a problem

    as it becomes the current technology making other technologies obsolete.

  • 8/3/2019 z-IRAM

    35/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 35

    2. Non Upgradeability of Memory: Since the DRAM chips are embedded in the IRAM

    chip we will not be able to upgrade the memory further. This may limit the popularity o

    IRAM chips since the demand for more memory capacity in increasing rapidly

    Researches are going on for finding a solution for this problem. Sometimes the nexgeneration chips may have the provision for upgrading the memory capacity.

    3. High Cost of Testing: The testing cost of IRAM chip is high when compared to th

    other memory testing processes. This is because the cost of testing during manufacturing i

    significant for DRAMs. Adding a processor would significantly increase the test time on

    conventional DRAM testers. But once it establishes the testing cost will not be a problem

    in a long run since the revenue or profit will account for the high cost of testing

    Sometimes a new cost effective method of testing the DRAMs may emerge in course o

    time.

    4. Overheating: The high level of integration decreases the chip area considerably. So even

    though the heat produced is less when compared to that of current processors, it may

    overheat the chip due to the small area. But we can rectify the effect of heat by means of

    an efficient heat sink or cooling system which exhausts the heat produced to the externa

    environment.

  • 8/3/2019 z-IRAM

    36/37

    SEMINAR2006 IRAM

    DEPARTMENT OF ECE 36

    7.Conclusion

    Merging a microprocessor and DRAM on the same chip presents opportunities in

    performance, energy efficiency, and cost: a factor of 5 to 10 reduction in latency, a facto

    of 50 to 100 increase in bandwidth, a factor of 2 to 4 advantage in energy efficiency, and

    an unquantified cost savings by removing superfluous memory and by reducing board

    area. The surprise is that these claims are notbased on some exotic, unproven technology

    they based instead on tapping the potential of a technology in use for the last 10 years. Th

    popularity of IRAM is only limited by the amount of memory on-chip, which should

    expand by about 60% per year. A best case scenario would be for IRAM to expand it

    beachhead in graphics, which requires about 10 Mbits, to the game, embedded, andpersonal digital assistant markets, which require about 32 Mbits of storage. Such high

    volume applications could in turn justify creation of a process that is more friendly to

    IRAM, with DRAM cells that are a little bigger than in a DRAM fabrication but much

    more amenable to logic and SRAM. As IRAM grows to 128 to 256 Mbits of storage, an

    IRAM might be adopted by the network computer or portable PC markets. Such a succes

    could in turn entice either microprocessor manufacturers to include substantial DRAM on

    chip, or DRAM manufacturers to include processors on chip. Hence IRAM presents an

    opportunity to change the nature of the semiconductor industry. From the current division

    into logic and memory camps, a more homogeneous industry might emerge with historica

    microprocessor manufacturers shipping substantial amounts of DRAM - just as they ship

    substantial amounts of SRAM today - or historical DRAM manufacturers shipping

    substantial numbers of microprocessors. Both scenarios might even occur, with one set o

    manufacturers oriented towards high performance and the other towards low cost. Also

    IRAM with its potential can create a new generation of computers with increased

    portability, reduced size and power consumption without compromising on performanc

    and efficiency.

  • 8/3/2019 z-IRAM

    37/37

    SEMINAR2006 IRAM

    8.References

    IRAM - Chips that remember and compute, IEEE International Solid State

    Circuits Conference A Case for Intelligent RAM,IEEE Micro

    IRAM - the Industrial Setting, Applications, and Architectures, Computer

    Science Division, University of California, Berkeley

    Vector IRAM - ISA and Micro-architecture, Computer Science Division,

    University of California, Berkeley

    Vector IRAM - A Media-oriented Vector Processor with Embedded DRAM,

    Computer Science Division, University of California, Berkeley

    A Media-enhanced vector architecture for embedded memory systems,

    Computer Science Division, University of California, Berkeley

    Memory-Intensive Benchmarks: IRAM vs. Cache-Based Machines, Computer

    Science Division, University of California, Berkeley

    IRAM - Overcoming the I/O Bus Bottleneck, Denver, CO, USA

    The energy efficiency of IRAM architectures, 24th Annual Internationa

    Symposium on Computer Architecture

    http://iram.cs.berkeley.edu/