CAM_MEM emory

Embed Size (px)

DESCRIPTION

it is used in memory operations

Citation preview

To design and implementation of CAM memory for DAUL CORE RAM

Abstract:A new information detection method has been proposed for a very fast and efficient search engine. This method is implemented on hardware system using FPGA. We take advantages of Content Addressable Memory (CAM) which has an ability of searching and matching mode for designing the system. The CAM blocks have been designed using available memory blocks of the FPGA device to save access times of the whole system. The entire memory can return multi-matched results concurrently. The system operates based on the CAMs for pattern matching in parallel manner to return multiple addresses of multi-matched results. Based on the parallel multi-matching operations, the system can be applied for pattern matching with various required constraint conditions without using any search principles. The very fast multi-matched results 60ns are achieved at the operational frequency 50 MHz. Thus increases the matching performance of the information detection system which uses this method as the core system.

Keywords - FPGA, Information Detection Hardware System, Content Addressable Memory, Dual-port RAM, parallel operation, multiple matches.

IntroductionNowadays, the information detection at high speed plays an important role in many applications. It becomes a key element in analyzing and detecting patterns as well as searching data such as network router, telecommunications, image recognition, sound recognition, character recognition, full text search, and artificial intelligence in robots, bioinformatics and DNA computation. However, designing fast and efficient information detection on hardware systems is the challenge for designers. Moreover, the information detection systems which have been implemented on hardware are still not largely implemented to satisfy the industrial needs. The efficiency in information detection for searching is always the trade-off between the number of detection times and system resources. The search engines for various purposes on hardware system were presented. However, the matching operations were implemented in sequential or complicated algorithms. These methods slowed down the speed of information detection for vast information and consumed a large space in the system.

Content addressable memory:Content-addressable memory(CAM) is a special type of computer memory used in certain very high speed searching applications. It is also known asassociative memory,associative storage, orassociative array, although the last term is more often used for a programming data structure. Several custom computers, like theGoodyear STARAN, were built to implement CAM, and were designatedassociative computers.

Hardware associative array:Unlike standard computer memory (random access memoryor RAM) in which the user supplies a memory address and the RAM returns the data word stored at that address, a CAM is designed such that the user supplies a data word and the CAM searches its entire memory to see if that data word is stored anywhere in it. If the data word is found, the CAM returns a list of one or more storage addresses where the word was found (and in some architectures, it also returns thedata word, or other associated pieces of data). Thus, a CAM is the hardware embodiment of what in software terms would be called anassociative array. The data word recognition unit was proposed byDudley Allen Buckin 1955.

Standards for content addressable memories:A major interface definition for CAMs and otherNetwork Search Elements(NSEs) was specified in an Interoperability Agreement called theLook-Aside Interface(LA-1 and LA-1B) developed by the Network Processing Forum, which later merged with theOptical Internetworking Forum(OIF). Numerous devices have been produced byIntegrated Device Technology,Cypress Semiconductor, IBM,Broadcomand others to the LA interface agreement. On December 11, 2007, the OIF published the serial look-aside (SLA) interface agreement.

Semiconductor implementations:Because a CAM is designed to search its entire memory in a single operation, it is much faster than RAM in virtually all search applications. There are cost disadvantages to CAM however. Unlike a RAMchip, which has simple storage cells, each individual memorybitin a fully parallel CAM must have its own associated comparison circuit to detect a match between the stored bit and the input bit. Additionally, match outputs from each cell in the data word must be combined to yield a complete data word match signal. The additional circuitry increases the physical size of the CAM chip which increases manufacturing cost. The extra circuitry also increases power dissipation since every comparison circuit is active on every clock cycle. Consequently, CAM is only used in specialized applications where searching speed cannot be accomplished using a less costly method. One successful early implementation was a General Purpose Associative Processor IC and System.

Alternative implementations:To achieve a different balance between speed, memory size and cost, some implementations emulate the function of CAM by using standard tree search or hashing designs in hardware, using hardware tricks like replication or pipelining to speed up effective performance. These designs are often used inrouters.Ternary CAMs:Binary CAMis the simplest type of CAM which uses data search words consisting entirely of 1s and 0s.Ternary CAM(TCAM) allows a third matching state of "X" or "Don't Care" for one or more bits in the stored data word, thus adding flexibility to the search. For example, a ternary CAM might have a stored word of "10XX0" which will match any of the four search words "10000", "10010", "10100", or "10110". The added search flexibility comes at an additional cost over binary CAM as the internal memory cell must now encode three possible states instead of the two of binary CAM. This additional state is typically implemented by adding a mask bit ("care" or "don't care" bit) to every memory cell.Holographic associative memoryprovides a mathematical model for "Don't Care" integrated associative recollection using complex valued representation.

Example applications:Content-addressable memory is often used incomputer networking devices. For example, when anetwork switchreceives adata framefrom one of its ports, it updates an internal table with the frame's sourceMAC addressand the port it was received on. It then looks up the destination MAC address in the table to determine what port the frame needs to be forwarded to, and sends it out on that port. The MAC address table is usually implemented with a binary CAM so the destination port can be found very quickly, reducing the switch's latency.Ternary CAMs are often used in networkrouters, where each address has two parts: the network address, which can vary in size depending on thesubnetconfiguration, and the host address, which occupies the remaining bits. Each subnet has a network mask that specifies which bits of the address are the network address and which bits are the host address.Routingis done by consulting a routing table maintained by the router which contains each known destination network address, the associated network mask, and the information needed to route packets to that destination. Without CAM, the router compares the destination address of the packet to be routed with each entry in the routing table, performing alogical ANDwith the network mask and comparing it with the network address. If they are equal, the corresponding routing information is used to forward the packet. Using a ternary CAM for the routing table makes the lookup process very efficient. The addresses are stored using "don't care" for the host part of the address, so looking up the destination address in the CAM immediately retrieves the correct routing entry; both the masking and comparison are done by the CAM hardware.Other CAM applications include: CPUfully associative cache controllersandtranslation look-aside buffersACPU cacheis acacheused by thecentral processing unit(CPU) of acomputerto reduce the average time to accessmemory. The cache is a smaller, faster memory which stores copies of the data from frequently usedmain memorylocations. Most CPUs have different independent caches, including instruction and data caches, where the data cache is usually organized as a hierarchy of more cache levels (L1, L2 etc.) Databaseengines Adatabaseis an organized collection ofdata. The data are typically organized to model relevant aspects of reality in a way that supports processes requiring this information. For example, modeling the availability of rooms in hotels in a way that supports finding a hotel with vacancies. Database management systems(DBMSs) are specially designed software applications that interact with the user, other applications, and the database itself to capture and analyze data. A general-purpose DBMS is asoftwaresystem designed to allow the definition, creation, querying, update, and administration of databases. Well-known DBMSs includeMySQL,MariaDB,PostgreSQL,SQLite,Microsoft SQL Server,Oracle,SAP,dBASE,FoxPro,IBM DB2,LibreOffice BaseandFileMaker Pro. A database is not generallyportableacross different DBMSs, but different DBMSs can interoperate by usingstandardssuch asSQLandODBCorJDBCto allow a single application to work with more than one database. Data compressionhardware Incomputer scienceandinformation theory,data compression,source coding,orbit-rate reductioninvolvesencodinginformationusing fewerbitsthan the original representation. Compression can be eitherlossyorlossless.Lossless compressionreduces bits by identifying and eliminatingstatistical redundancy. No information is lost in lossless compression.Lossy compressionreduces bits by identifying unnecessary information and removing it.The process of reducing the size of a data file is popularly referred to as data compression, although its formal name is source coding (coding done at the source of the data before it is stored or transmitted). Compression is useful because it helps reduce resource usage, such as data storage space or transmissioncapacity. Because compressed data must be decompressed to use, this extra processing imposes computational or other costs through decompression; this situation is far from being afree lunch. Data compression is subject to aspacetime complexitytrade-off. For instance, a compression scheme for video may require expensivehardwarefor the video to be decompressed fast enough to be viewed as it is being decompressed, and the option to decompress the video in full before watching it may be inconvenient or require additional storage. The design of data compression schemes involves trade-offs among various factors, including the degree of compression, the amount of distortion introduced (e.g., when usinglossy data compression), and the computational resources required to compress and uncompress the data. New alternatives to traditional systems (which sample at full resolution, then compress) provide efficient resource usage based on principles ofcompressed sensing. Compressed sensing techniques circumvent the need for data compression by sampling off on a cleverly selected basis. Artificial neural networks Incomputer scienceand related fields,artificial neural networksare computationalmodelsinspired by animals'central nervous systems(in particular thebrain) that are capable ofmachine learningandpattern recognition. They are usually presented as systems of interconnected "neurons" that can compute values from inputs by feeding information through the network. For example, in a neural network forhandwriting recognition, a set of input neurons may be activated by the pixels of an input image representing a letter or digit. The activations of these neurons are then passed on, weighted and transformed by some function determined by the network's designer, to other neurons, etc., until finally an output neuron is activated that determines which character was read. Like other machine learning methods, neural networks have been used to solve a wide variety of tasks that are hard to solve using ordinary rule-based programming, includingcomputer visionandspeech recognition. Intrusion Prevention System Anintrusion detection system(IDS) is a device orsoftware applicationthat monitors network or system activities for malicious activities or policy violations and produces reports to a management station. Some systems may attempt to stop an intrusion attempt but this is neither required nor expected of a monitoring system. Intrusion detection and prevention systems (IDPS) are primarily focused on identifying possible incidents, logging information about them, and reporting attempts. In addition, organizations use IDPSes for other purposes, such as identifying problems with security policies, documenting existing threats and deterring individuals from violating security policies. IDPSes have become a necessary addition to the security infrastructure of nearly every organization. IDPSes typically record information related to observed events, notify security administrators of important observed events and produce reports. Many IDPSes can also respond to a detected threat by attempting to prevent it from succeeding. They use several response techniques, which involve the IDPS stopping the attack itself, changing the security environment (e.g. reconfiguring a firewall) or changing the attack's content.

Random-access memory:Random-access memoryis a form ofcomputer data storage. A random-access device allows storeddatato be accessed directly in any random order. In contrast, other data storage media such ashard disks,CDs,DVDsandmagnetic tape, as well as early primary memory types such asdrum memory, read and write data only in a predetermined order, consecutively, because of mechanical design limitations. Therefore, the time to access a given data location varies significantly depending on its physical location.Today, random-access memory takes the form ofintegrated circuits. Strictly speaking, modern types ofDRAMare not random access, as data is read in bursts, although the nameDRAM/ RAM has stuck. However, many types ofSRAM,ROM,OTP, andNOR flashare stillrandom accesseven in a strict sense. RAM is normally associated withvolatiletypes of memory (such asDRAMmemory modules), where its stored information is lost if the power is removed. Many other types of non-volatile memory are RAM as well, including most types ofROMand a type offlash memorycalledNOR-Flash. The first RAM modules to come into the market were created in 1951 and were sold until the late 1960s and early 1970s.

History:Early computers usedrelays, ordelay linesfor "main" memory functions. Ultrasonic delay lines could only reproduce data in the order it was written.Drum memorycould be expanded at low cost but retrieval of non-sequential memory items required knowledge of the physical layout of the drum to optimize speed. Latches built out ofvacuum tubetriodes, and later, out of discretetransistors, were used for smaller and faster memories such as random-access register banks and registers. Such registers were relatively large, power-hungry and too costly to use for large amounts of data; generally only a few hundred or few thousand bits of such memory could be provided.The first practical form of random-access memory was theWilliams tubestarting in 1947. It stored data as electrically charged spots on the face of acathode ray tube. Since the electron beam of the CRT could read and write the spots on the tube in any order, memory was random access. The capacity of the Williams tube was a few hundred to around a thousand bits, but it was much smaller, faster, and more power-efficient than using individual vacuum tube latches. Developed at theUniversity of Manchesterin England, the Williams tube provided the medium on which the first electronically stored-memory program was implemented in theManchester Small-Scale Experimental Machine(SSEM) computer, which first successfully ran a program on 21 June 1948.[1]In fact, rather than the Williams tube memory being designed for the SSEM, the SSEM was atestbedto demonstrate the reliability of the memory. Magnetic-core memorywas invented in 1947 and developed up until the mid-1970s. It became a widespread form of random-access memory, relying on an array of magnetized rings. By changing the sense of each ring's magnetization, data could be stored with one bit stored per ring. Since every ring had a combination of address wires to select and read or write it, access to any memory location in any sequence was possible.Magnetic core memory was the standard form of memory system until displaced by solid-state memory in integrated circuits, starting in the early 1970s.Robert H. Dennardinventeddynamic random-access memory(DRAM) in 1968; this allowed replacement of a 4 or 6-transistor latch circuit by a single transistor for each memory bit, greatly increasing memory density at the cost of volatility. Data was stored in the tiny capacitance of each transistor, and had to be periodically refreshed in a few milliseconds before the charge could leak away.Prior to the development of integratedread-only memory(ROM) circuits,permanent(orread-only) random-access memory was often constructed usingdiode matricesdriven byaddress decoders, or specially woundcore rope memoryplanes.

Types of RAM:The three main forms of modern RAM arestatic RAM(SRAM),dynamic RAM(DRAM) andphase-change memory(PRAM). In SRAM, abit of datais stored using the state of aflip-flop. This form of RAM is more expensive to produce, but is generally faster and requires less power than DRAM and, in modern computers, is often used as cache memory for theCPU. DRAM stores a bit of data using a transistor and capacitor pair, which together comprise a memory cell. The capacitor holds a high or low charge (1 or 0, respectively), and the transistor acts as a switch that lets the control circuitry on the chip read the capacitor's state of charge or change it. As this form of memory is less expensive to produce than static RAM, it is the predominant form of computer memory used in modern computers.Both static and dynamic RAM are consideredvolatile, as their state is lost or reset when power is removed from the system. By contrast,read-only memory(ROM) stores data by permanently enabling or disabling selected transistors, such that the memory cannot be altered. Writeable variants of ROM (such asEEPROMandflash memory) share properties of both ROM and RAM, enabling data topersistwithout power and to be updated without requiring special equipment. These persistent forms of semiconductor ROM includeUSBflash drives, memory cards for cameras and portable devices, etc.ECC memory(which can be either SRAM or DRAM) includes special circuitry to detect and/or correct random faults (memory errors) in the stored data, usingparity bitsorerror correction code.In general, the termRAMrefers solely to solid-state memory devices (either DRAM or SRAM), and more specifically the main memory in most computers. In optical storage, the termDVD-RAMis somewhat of a misnomer since, unlikeCD-RWorDVD-RWit does not need to be erased before reuse. Nevertheless a DVD-RAM behaves much like a hard disc drive if somewhat slower.

Memory hierarchy:One can read and over-write data in RAM. Many computer systems have a memory hierarchy consisting ofCPU registers, on-dieSRAMcaches, externalcaches,DRAM,pagingsystems andvirtual memoryorswap spaceon a hard drive. This entire pool of memory may be referred to as "RAM" by many developers, even though the various subsystems can have very differentaccess times, violating the original concept behind therandom accessterm in RAM. Even within a hierarchy level such as DRAM, the specific row, column, bank,rank, channel, orinterleaveorganization of the components make the access time variable, although not to the extent that rotatingstorage mediaor a tape is variable. The overall goal of using a memory hierarchy is to obtain the higher possible average access performance while minimizing the total cost of the entire memory system (generally, the memory hierarchy follows the access time with the fast CPU registers at the top and the slow hard drive at the bottom).In many modern personal computers, the RAM comes in an easily upgraded form of modules calledmemory modulesor DRAM modules about the size of a few sticks of chewing gum. These can quickly be replaced should they become damaged or when changing needs demand more storage capacity. As suggested above, smaller amounts of RAM (mostly SRAM) are also integrated in theCPUand otherICson themotherboard, as well as in hard-drives,CD-ROMs, and several other parts of the computer system.

Other uses of RAM:In addition to serving as temporary storage and working space for the operating system and applications, RAM is used in numerous other ways.

Virtual memory:Most modern operating systems employ a method of extending RAM capacity, known as "virtual memory". A portion of the computer'shard driveis set aside for apaging fileor ascratch partition, and the combination of physical RAM and the paging file form the system's total memory. (For example, if a computer has 2 GB of RAM and a 1 GB page file, the operating system has 3 GB total memory available to it.) When the system runs low on physical memory, it can "swap" portions of RAM to the paging file to make room for new data, as well as to read previously swapped information back into RAM. Excessive use of this mechanism results inthrashingand generally hampers overall system performance, mainly because hard drives are far slower than RAM.

RAM disk:Software can "partition" a portion of a computer's RAM, allowing it to act as a much faster hard drive that is called aRAM disk. A RAM disk loses the stored data when the computer is shut down, unless memory is arranged to have a standby battery source.

Shadow RAM:Sometimes, the contents of a relatively slow ROM chip are copied to read/write memory to allow for shorter access times. The ROM chip is then disabled while the initialized memory locations are switched in on the same block of addresses (often write-protected). This process, sometimes calledshadowing, is fairly common in both computers andembedded systems.As a common example, theBIOSin typical personal computers often has an option called use shadow BIOS or similar. When enabled, functions relying on data from the BIOSs ROM will instead use DRAM locations (most can also toggle shadowing of video card ROM or other ROM sections). Depending on the system, this may not result in increased performance, and may cause incompatibilities. For example, some hardware may be inaccessible to theoperating systemif shadow RAM is used. On some systems the benefit may be hypothetical because the BIOS is not used after booting in favor of direct hardware access. Free memory is reduced by the size of the shadowed ROMs.

Memory wall:The "memory wall" is the growing disparity of speed between CPU and memory outside the CPU chip. An important reason for this disparity is the limited communication bandwidth beyond chip boundaries. From 1986 to 2000,CPUspeed improved at an annual rate of 55% while memory speed only improved at 10%. Given these trends, it was expected that memory latency would become an overwhelmingbottleneckin computer performance.CPU speed improvements slowed significantly partly due to major physical barriers and partly because current CPU designs have already hit the memory wall in some sense.Intelsummarized these causes in a 2005 document.First of all, as chip geometries shrink and clock frequencies rise, the transistorleakage currentincreases, leading to excess power consumption and heat... Secondly, the advantages of higher clock speeds are in part negated by memory latency, since memory access times have not been able to keep pace with increasing clock frequencies. Third, for certain applications, traditional serial architectures are becoming less efficient as processors get faster (due to the so-calledVon Neumann bottleneck), further undercutting any gains that frequency increases might otherwise buy. In addition, partly due to limitations in the means of producing inductance within solid state devices,resistance-capacitance (RC) delays in signal transmission are growing as feature sizes shrink, imposing an additional bottleneck that frequency increases don't address.The RC delays in signal transmission were also noted inClock Rate versus IPC: The End of the Road for Conventional Micro-architectureswhich projects a maximum of 12.5% average annual CPU performance improvement between 2000 and 2014. The data onIntel Processorsclearly shows a slowdown in performance improvements in recent processors. However, Intel'sCore 2 Duo-processors (codenamed Conroe) showed a significant improvement over previousPentium 4processors; due to a more efficient architecture, performance increased while clock rate actually decreased.

Dual ported RAM:Dual-ported RAM(DPRAM) is a type ofRandom Access Memorythat allows multiple reads or writes to occur at the same time, or nearly the same time, unlike single-ported RAM which only allows one access at a time.Video RAM orVRAMis a common form of dual-porteddynamic RAMmostly used for video memory, allowing theCPUto draw the image at the same time the video hardware is reading it out to the screen.Apart from VRAM, most other types of dual-ported RAM are based onstatic RAMtechnology.Most CPUs implement theprocessor registersas a small dual-ported or multi-ported RAM.

Parallel computing:Parallel computingis a form ofcomputationin which many calculations are carried out simultaneously,operating on the principle that large problems can often be divided into smaller ones, which are then solvedconcurrently("in parallel"). There are several different forms of parallel computing:bit-level,instruction level,data, andtask parallelism. Parallelism has been employed for many years, mainly inhigh-performance computing, but interest in it has grown lately due to the physical constraints preventingfrequency scaling.As power consumption (and consequently heat generation) by computers has become a concern in recent years,parallel computing has become the dominant paradigm incomputer architecture, mainly in the form ofmulti-core processors. Parallel computers can be roughly classified according to the level at which the hardware supports parallelism, withmulti-coreandmulti-processorcomputers having multiple processing elements within a single machine, whileclusters,MPPs, andgridsuse multiple computers to work on the same task. Specialized parallel computer architectures are sometimes used alongside traditional processors, for accelerating specific tasks.Parallel computer programsare more difficult to write than sequential ones,because concurrency introduces several new classes of potentialsoftware bugs, of whichrace conditionsare the most common.Communicationandsynchronizationbetween the different subtasks are typically some of the greatest obstacles to getting good parallel program performance. The maximum possiblespeed-upof a single program as a result of parallelization is known asAmdahl's law.

Design approachOur objective is to design search engine on hardware systems such as FPGA without increasing the detection time and system complexities. Besides, the system which is fast, efficient, and low cost has become increasingly necessary. We have proposed a novel information detection method on hardware system in which the searching data will be processed in parallel. The architecture is based on Content Addressable Memory (CAM) and parallel structures for achieving fast detection, accelerating the search performance, but consuming as least system resources as possible. Because CAM can be implemented on digital circuits, CAM and the whole system are completely implemented on a programmable hardware device such as FPGA.This paper makes the following main contributions: 1) Efficient and fast CAM structure. The multi-match values are returned in parallel. 2) The system has operated in parallel structure and also produced the multi-match results in parallel at a high speed (60ns on FPGA.) 3) The system has been proposed for search purposes but without using any search principles. 4) The system design is based on CAM blocks and uses simple logic circuits such as SHIFT, AND without using CPU and complex computations.

CAM STRUCTURE ON FPGAContent Addressable Memory (CAM) plays a very important part in our system. There are many approaches for CAM designs. Most of the CAM designs are mainly focused on circuit design aspects. However, designing CAM on FPGA hardware has also received much attention. For example, the CAM can be carried out by using Look-Up Table (LUT) or memory resources on FPGA devices.

Fig: DUAL PORT RAM structure

In this paper, we have designed CAM by using memory resources of Altera Cyclone IVE. The reasons are to save the logic resources, use the available memory resources and accelerate the match speed of the system. Because the CAM has a structure which is similar to RAM, we have implemented the CAM based on the dual-port RAM structure. We take advantages of the M9K memory structure on Cyclone IV E device as shown in Fig. Each dual-port M9K block consists of 8192 memory bits and 1024 bits for parity check codes. Port-A is used to store or erase data in the CAM and is therefore a write-only port. Port-B is used to look up matched data and is a read-only port.

Fig: DUAL PORT RAM structurePort-A would be configured to be 2(8+5)-bit address and 1-bit data for 8192-word1-bit = 8192 bits. Port-B can be configured as 8-bit address (28= 256 words) and 32-bit data width for a configuration of 256-word32-bit = 8192 bits. The size of these two ports is always the same as the initial 8192 bits. In Port-B, the address port will be considered as match data, meanwhile the data port is considered as match address. It means that, at Port-B, the 8-bit address will become 8-bit matched data, meanwhile the 32-bit data will be 32-bit match one-hot address. In other words, Port-B behaves as the RAM function but the address and data port are swapped for each other to build CAM functions. The detailed swapping process is illustrated in Fig. 2. The expected match data of the CAM will be read into Port-Bs address port, and the matched address will be returned via Port-Bs data port. At this time, the data in memory is considered as memory addresses which are in the same indexes so that the data, or also called CAM addresses, will be read out concurrently. In this case, one M9K is configured as one CAM 32-word x 8-bit.

Fig: CAM circuit based on dual-port RAM

A dual-port M9K block can be configured as various CAM sizes. Because our CAM is built from dual-port M9K blocks which have 8192-word address corresponding to 2 address bits, the bit length and the depth of one CAM unit device are varied in the range how the total numbers of address bits and data bits are always equal to 13. For instance, CAM depth sizes and data bit length can be flexibly configured as 32, 16, 8, 4-word depth and 8-bit, 9-bit, 10-bit, 11-bit length respectively. The CAM sizes with various bit length are depicted in Table I. However, in our research, we only use the CAM size of 32-word depth and 8-bit width as a CAM unit device for cascading so that we just focus on CAM size of 32-word x 8-bit only. The output of our CAM is match one-hot address register storing the multiple matched values in parallel. The match detection output is validated by the match signal output, which is bitwise-OR of the all of matched one-hot addresses from the CAM.

ConclusionA new information detection method which has been proposed for a very fast and efficient search engine has been implemented successfully on hardware system using FPGA. Based on the parallel multi-matching operations, the information detection system can be applied for pattern matching with various defined search pattern without using any search principles. Thus improves the detection or search performance of the information processing and detection systems. The most significant advantage of this information detection system is the use of CAM blocks with parallel match outputs, the parallel process operations and very simple logic circuits such as AND and SHIFT for very fast data detection. This leads to less detection operations and accelerates the search performance. The multi-matched values of various search patterns are returned concurrently. All the designs are implemented successfully on FPGA, and the system produces the correct addresses in parallel.

4. INTRODUCTION TO XILINX ISE

Xilinx designs, develops and markets programmable logic products including integrated circuits (ICs), software design tools, predefined system functions delivered as intellectual property (IP) cores, design services, customer training, field engineering and technical support. Xilinx provide both FPGAs and CPLDs programmable logic devices for electronic equipment manufacturers such as communications, industrial, consumer, automotive and data processing.

It is the world's largest supplier of programmable logic devices, the inventor of the field programmable gate array (FPGA) and the first semiconductor company with a fabless manufacturing model.

The Integrated Software Environment (ISE) Design Suite is the central electronic design automation (EDA) product family sold by Xilinx. The ISE Design Suite features include design entry and synthesis supporting Verilog or VHDL, place-and-route (PAR), completed verification and debug using ChipScope Pro tools, and creation of the bit files that are used to configure the chip.

The ISE 9.1i is a hands-on learning tool for new users of the ISE software and for users who wish to refresh their knowledge of the software. It demonstrates basic set-up and design methods available in the PC version of the ISE software. We will have a greater understanding of how to implement our own design flow using the ISE 9.1i software.

Xilinx PLD designers with a quick overview of the basic design process using ISE 9.1i. We will have an understanding of how to create, verify, and implement a design. ISE controls all aspects of the design flow. Through the Project Navigator interface, you can access all of the design entry and design implementation tools. You can also access the files and documents associated with your project.

THESE ARE THE FOLLOWING STEPS WHILE DOING A PROJECT Getting Started Create a New Project Create an HDL Source Design Simulation Create Timing Constraints Implement Design and Verify Constraints Reimplement Design and Verify Pin Locations Dumpping code on the Spartan-3 Demo Board

4.1 SOFTWARE REQUIREMENTSWe must install the following software: ISE 9.1i

4.2 HARDWARE REQUIREMENTSWe must have the following hardware: Spartan-3 Startup Kit, containing the Spartan-3 Startup Kit Demo Board(FPGA).

4.3 PROJECT NAVIGATOR MAIN WINDOW: The following figure shows the Project Navigator main window, which allows you to manage your design starting with design entry through device configuration.1 2 3 4 5

Figure 5. Project Navigator Window

1.Toolbar2.Sources window3.Processes window4.Workspace5.Transcript window4.4 USING THE SOURCES WINDOW:The first step in implementing your design for a Xilinx FPGA or CPLD is to assemble the design source files into a project. The Sources tab in the Sources window shows the source files you create and add to your project, as shown in the following figure.

Figure 6.Source Window

4.4 USING THE PROCESS WINDOW The Processes tab in the Processes window allows you to run actions or "processes" on the source file you select in the Sources tab of the Sources window. The processes change according to the source file you select. The Process tab shows the available processes in a hierarchical view. Processes are arranged in the order of a typical design flow: project creation, design entry, constraints management, synthesis, implementation, and programming file creation.

Figure 7. Process Window

PROCESS TYPES:The following types of processes are available as you work on your design: Tasks When you run a task process, the ISE software runs in "batch mode," that is, the software processes your source file but does not open any additional Software tools in the Workspace. Output from the processes appears in the Transcript window. Reports Most tasks include report sub-processes, which generate a summary or status report, for example, the Synthesis Report or Map Report. When you run a report process, the report appears in the Workspace. Tools When you run a tools process, the related tool launches in stand alone mode or appears in the Workspace where you can view or modify your design source files. Process Status:Project Navigator keeps track of the changes you make in source file and shows the status of each process with the following status icons:

Running This icon shows that the process is running. Up-to-date This icon shows that the process ran successfully with no errors or warnings and does not need to be rerun. If the icon is next to a report process, the report is up-to-date; however, associated tasks may have warnings or errors. Warnings reported This icon shows that the process ran successfully but that warnings were encountered. Errors reported This icon shows that the process ran but encountered an error. Out-of-Date This icon shows that you made design changes, which require that the Process be rerun. If this icon is next to a report process, you can rerun the associated task process to create an up-to-date version of the report. No icon If there is no icon, this shows that the process was never run.

. INTRODUCTION TO VLSI

VLSI stands for "Very Large Scale Integration". This is the field which involves packing more and more logic devices into smaller and smaller areas. Thanks to VLSI, circuits that would have taken board full of space can now be put into a small space few millimeters across! This has opened up a big opportunity to do things that were not possible before. VLSI circuits are everywhere ... your computer, your car, your brand new state-of-the-art digital camera, the cell-phones, and what have you. All this involves a lot of expertise on many fronts within the same field, which we will look at in later sections. VLSI has been around for a long time, there is nothing new about it ... but as a side effect of advances in the world of computers, there has been a dramatic prolife ration of tools that can be used to design VLSI circuits. Alongside, obeying Moore's law, the capability of an IC has increased exponentially over the years, in terms of computation power, utilisation of available area, yield. The combined effect of these two advances is that people can now put diverse functionality into the IC's, opening up new frontiers. Examples are embedded systems, where intelligent devices are put inside everyday objects, and ubiquitous computing where small computing devices proliferate to such an extent that even the shoes you wear may actually do something useful like monitoring your heartbeats! These two fields are kind a related, and getting into their description can easily lead to another article.2.1 DEALING WITH VLSI CIRCUITS Digital VLSI circuits are predominantly CMOS based. The way normal blocks like latches and gates are implemented is different from what students have seen so far, but the behaviour remains the same. All the minsaturisation involves new things to consider.A lot of thought has to go into actual implementations as well as design. Let us look atsome of the factors involved ...1. Circuit Delays. Large complicated circuits running at very high frequencies have onebig problem to tackle - the problem of delays in propagation of signals through gates and wires even for areas a from micrometers across! The operation speed is so large that as the delays add up, they can actually become comparable2. Power. Another effect of high operation frequencies is increased consumption ofpower. This has two-fold effect - devices consume batteries faster, and heat dissipation increases. Coupled with the fact that surface areas have decreased, heat poses a major threat to the stability of the circuit itself. 3. Layout. Laying out the circuit components is task common to all branches Of electronics. Whats so special in our case is that there are many possible ways to do this; there can be multiple layers of different materials on the same silicon, there can be different arrangements of the smaller parts for the same component and so on. The power dissipation and speed in a circuit present a trade-off; if we try to optimize on one, the other is affected. The choice between the two is determined by the way we chose the layout the circuit components. Layout can also affect the fabrication of VLSI chips, making it either easy or difficult to implement the components on the silicon.

2.2 THE VLSI DESIGN PROCESSA typical digital design flow is as follows: Specification Architecture RTL Coding RTL Verification Synthesis Backend

Tape Out to Foundry to get end product.a wafer with repeated number of identical Ics. All modern digital designs start with a designer writing a hardware description of the IC (using HDL or Hardware Description Language) in Verilog /VERILOG. A Verilog or VERILOG program essentially describes the hardware (logic gates, Flip-Flops, countersetc) and the interconnect of the circuit blocks and the functionality. Various CAD tools area vailable to synthesize a circuit based on the HDL. The most widely used synthesis toolscome from two CAD companies. Synposys and Cadence.Without going into details, we can say that the VERILOG, can be called as the "C" ofVHSIC stands for "Very High Speed Integrated Circuit". This languages is used to designthe circuits at a high-level, in two ways. It can either be a behavioural description, which describes what the circuit is supposed to do, or a structural description, which describes what the circuit is made of. There are other languages for describing circuits, such as Verilog, which work in a similar fashion. Both forms of description are then used to generate a very low-level description that actually spells out how all this is to be fabricated on the silicon chips. This will result in the manufacture of the intended IC.2.3 A TYPICAL ANALOG DESIGN FLOW IS AS FOLLOWS:In case of analog design, the flow changes somewhat. Specifications

Architecture

Circuit Design

SPICE Simulation

Layout

Parametric Extraction / Back Annotation

Final Design

Tape Out to foundry.

While digital design is highly automated now, very small portion of analog design can be automated. There is a hardware description language called AHDL but is not widely used as it does not accurately give us the behavioral model of the circuit because of the complexity of the effects of parasitic on the analog behavior of the circuit. Many analog chips are what are termed as flat or non-hierarchical designs. This is true for small transistor count chips such as an operational amplifier, or a filter or a power management chip. For more complex analog chips such as data converters, the design is done at a transistor level, building up to a cell level, then a block level and then integrated at a chip level. Not many CAD tools are available for analog design even today and thus analog design remains a difficult art. SPICE remains the most useful simulation tool for analog as well as digital design.2.4 MOST OF TODAYS VLSI DESIGNS ARE CLASSIFIED INTO THREE CATEGORIES:2.4.1 ANALOG: Small transistor count precision circuits such as Amplifiers, Data converters, filters, Phase Locked Loops, Sensors etc.2.4.2 ASICS OR APPLICATION SPECIFIC INTEGRATED CIRCUITS: Progress in the fabrication of IC's has enabled us to create fast and powerful circuits in smaller and smaller devices. This also means that we can pack a lot more of functionality into the same area. The biggest application of this ability is found in the design of ASIC's. These are IC's that are created for specific purposes - each device is created to do a particular job, and do it well. The most common application area for this is DSP signal filters, image compression, etc. To go to extremes, consider the fact that the digital wristwatch normally consists of a single IC doing all the time-keeping jobs as well as extra features like games, calendar, etc.

3.SoC OR SYSTEMS ON A CHIP: These are highly complex mixed signal circuits (digital and analog all on the same chip). A network processor chip or a wireless radio chip is an example of an SoC.

5. INTRODUCTION TO FPGA (SPARTAN 3E) KIT

A field-programmable gate array (FPGA) is an integrated circuit designed to be configured by the customer or designer after manufacturing. Hence "field-programmable". The FPGA configuration is generally specified using a hardware description language (HDL), FPGAs can be used to implement any logical function. The ability to update the functionality after shipping, and the low non-recurring engineering costs (not with standing the generally higher unit cost), offer advantages for many applications.

FPGAs contain programmable logic components called "logic blocks", and a hierarchy of reconfigurable interconnects that allow the blocks to be "wired together", Somewhat like a one-chip programmable breadboard. Logic blocks can be configured to perform complex combinational functions, or merely simple logic gates like AND and XOR. In most FPGAs, the logic blocks also include memory elements, which may be simple flip-flops or more complete blocks of memory.

An alternate approach to using hard-macro processors is to make use of "soft" processor cores that are implemented within the FPGA logic.To define the behavior of the FPGA, the user provides a hardware description language (HDL) or a schematic design. The HDL form is more suited to work with large structurebecause it's possible to just specify them numerically rather than having to draw every piece by hand.

As previously mentioned, many modern FPGAs have the ability to be reprogrammed at "run time," and this is leading to the idea of reconfigurable computing or reconfigurable systemsApplications of FPGAs include digital signal processing, software-defined radio, aerospace and defense systems, medical imaging, computer vision, speech recognition, cryptography, bioinformatics, computer hardware emulation, radio astronomy, metal detection and a growing range of other areas.\Xilinx continued unchallenged and quickly growing from 1985 to the mid-1990s, when competitors sprouted up, Xilinx has two main FPGA families: the high-performance Virtex series and the high-volume Spartan series, with a cheaper EasyPath option for ramping to volume production.

5.1 SPARTAN FAMILYThe Spartan series targets applications with a low-power footprint, extreme cost sensitivity and high-volume; e.g. displays, set-top boxes, wireless routers and other applicationsThe Spartan-3E consumes 70-90% less power in suspend mode and 40-50% less for static power compared to standard devices.Verilog HDL is a hardware description language used to design and document electronic systems. Verilog HDL allows designers to design at various levels of abstraction. It is the most widely used HDL with a user community of more than 50,000 active designers.

Figure 8. FPGA Spartan 3e Starter Kit

5.2 Spartan 3e ArchitectureInput/Output Blocks (IOBs)control the flow of data between the I/O pins and the internal logic of the device. Each IOB supports bidirectional data flow plus 3-state operation. Supports a variety of signal standards, including four high-performance differential standards. Double Data-Rate (DDR) registers are included.Block RAMprovides data storage in the form of 18-Kbit dual-port blocks.Multiplier Blocksaccept two 18-bit binary numbers as inputs and calculate the product.Digital Clock Manager (DCM) Blocks provide self-calibrating, fully digital solutions for distributing, delaying, multiplying, dividing, and phase-shifting clock signals.

These elements are organized as shown in Figure . A ring of IOBs surrounds a regular array of CLBs. Each device has two columns of block RAM except for the XC3S100E, which has one column. Each RAM column consists of several 18-Kbit RAM blocks. Each block RAM is associated with a dedicated multiplier. The DCMs are positioned in the center with two at the top and two at the bottom of the device. The XC3S100E has only one DCM at the top and bottom, while the XC3S1200E and XC3S1600E add two DCMs in the middle of the left and right sides.

Figure 9 Spartan 3e Architecture

The Spartan-3E family features a rich network of traces that interconnect all five functional elements, transmitting signals among them. Each functional element has an associated switch matrix that permits multiple connections to therouting.5.3 IMPLEMENTATION OVERVIEW FOR FPGAS:After synthesis, you run design implementation, which comprises the following steps:1. Translate, which merges the incoming net lists and constraints into a Xilinx design file2. Map, which fits the design into the available resources on the target device 3. Place and Route, which places and routes the design to the timing constraints.4. Programming file generation, which creates a bit stream file that can be downloaded to the device

5.4 FPGA PIN CONFIGURATION FOR SRL KITS. noPIN NAMEH/W PIN NO PURPOSE

1.Switch_0P2i/p on boardSW1

2.Switch_1P3i/p on boardSW2

3.Switch_2P4i/p on boardSW3

4.Switch_3P5i/p on boardSW4

5.Switch_4P9i/p on boardSW5

6.Switch_5P10i/p on boardSW6

7.Switch_6P11i/p on boardSW7

8.Switch_7P13i/p on boardSW8

9.LED_0P12o/p on boardD8

10.LED_1P15o/p on boardD7

11.LED_2P16o/p on boardD6

12.LED_3P17o/p on boardD5

13.LED_4P18o/p on boardD4

14.LED_5P22o/p on boardD3

15.LED_6P23o/p on boardD2

16.LED_7P98o/p on boardD1

17.LEDDTP24(B_3 or pin9)

18.LEDQ1P27(T_1 or pin1)

19.LEDQ2P32(T_4 or pin4)

20.LEDQ3P33(T_5 or pin5)

21.LEDQ4P34(B_6 or pin12)

22.LEDAP35(T_2 or pin2)

23.LEDBP36(T_6 or pin6)

24.LEDCP40(B_4 or pin10)

25.LEDDP41(B_2 or pin8)

26.LEDEP47(B_1 or pin7)

27.LEDFP48(T_3 or pin3)

28.LEDGP49(B_2 or pin11)

30.GCLKP89Internal CLK

References1) Y. Utan, S.Wakabayashi, S.Nagayama, An FPGA-based text search engine for approximate regularexpression matching, Proc. International Conference on Field-Programmable Technology (FPT), pp.184-191, Dec. 2010 . 2) H. Yamada, M. Hirata, H. Nagai, K. Takahashi, A High-Speed String Search Engine, IEEE Journal of Solid-State Circuits, Vol. 22, No. 5, pp. 829-834, Oct. 1987. 3) Md. A. Abedin, Y. Tanaka, A. Ahmadi, T. Koide, and H.J. Mattausch, Mixed Digital-Analog Associative Memory Enabling Fully-Parallel Nearest Euclidean Distance Search, Jpn. J. Appl. Phys. 46 (2007) 2231. 4) K. Pagiamtzis and A. Sheikholeslami, Content-Addressable Memory (CAM) Circuits and Architectures: A Tutorial and Survey, IEEE Journal of Solid-State Circuits, Vol. 41, No. 3, pp. 712-727, Mar. 2006. 5) S.A. Guccione, D. Levi and D. Downs, A Reconfigurable Content Addressable Memory, Proc. of the 15 IPDPS 2000 Workshops on Parallel and Distributed Processing, pp. 882-889, 2000. 6) H. Nakahara, T. Sasao and M. Matsuura, A CAM Emulator Using Look-Up Table Cascades, Proc. 21st International Parallel and Distributed Processing Symposium (IPDPS 07), pp. 26-30, Mar. 2007. 7) J-L. Brelet, An Overview of Multiple CAM Designs in Virtex Family Devices, Xilinx Inc., Application Notes 201, pp. 4-5, Sep. 1999. 8) J-L. Brelet, Using Block RAM for High Performance Read/Write CAMs, Xilinx Inc., Application Notes 204, May. 2000. 9) European Molecular Biology Laboratory-European Informatics Institute, http://www.ebi.ac.uk.10) K.C. McGrath, S.R. Thomas-Hall, C.T. Cheng, L. Leo, A. Alexa, S. Schmidt and P.M. Schenk, Isolation and analysis of mRNA from environmental microbial communities, Journal of Microbiology, Methods 75(2), pp. 172-176, 2008.