Click here to load reader

[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex

  • View

  • Download

Embed Size (px)

Text of [Hic2011] using hadoop lucene-solr-for-large-scale-search by systex

  • 1.Using Hadoop/MapReduce with Solr/Lucene for Large Scale Distributed Search(SYSTEX Corp)/Big Data 2011.12.2 Hadoop in China 2011 : (James Chen) 1

2. About Myself (James Chen) Big Data Big DataBig Data ,Email Big Data Killer AppWeb Security Hadoop 5HadoopEcosystemHadoop DistributionsToolsEnd-to-End Web CDR Hadoop Production Cloudera CertifiedDeveloper/Administrator for Apache Hadoop ecosystemSolrHadoop /Lucene Cloudera Certified Developer for Apache Hadoop (CCDH) 3. Agenda Why search on big data ? Lucene, Solr, ZooKeeper Distributed Searching Distributed Indexing with Hadoop Case Study: Web Log CategorizaTon 4. Why Search on Big Data ?Structured UnstructuredNo Schema LimitaTonAd-hoc AnalysisKeep Everything 5. A java library for search Mature, feature riched and high performance indexing, searching, highlighTng, spell checking, etc.Not a crawler Doesnt know anything about Adobe PDF, MS Word, etc.Not fully distributed Does not support distributed index 6. Ease of [email protected] Easy setup and Without knowing Java! [email protected] HTTP/XML interface Simply deploy in any web container LuceneMore Features Distributed FaceTng ReplicaTon HighlighTng Sharding Caching MulT Coreh_p:// 7. Lucene Default Query SyntaxLucene Query Syntax [; sort specification]1. mission impossible; releaseDate desc2. +mission +impossible actor:cruise3. mission impossible actor:cruise4. title:spiderman^10 description:spiderman5. description:spiderman movie~106. +HDTV +weight:[0 TO 100]7. Wildcard queries: te?t, te*t, test*for more information: 7 8. Solr Default ParametersQuery Arguments for HTTP GET/POST to /select param default description qThe query start 0Offset into the list of matches rows10 Number of documents to return fl*Stored fields to return qtstandard Query type; maps to query handler df(schema) Default field to search 9. Solr Search Results Examplehttp://localhost:8983/solr/select ?q=videostart=0rows=2fl=name,priceresponseresponseHeaderstatus0/status QTime1/QTime/responseHeader result numFound=16173 start=0 docstr name=nameApple 60 GB iPod with Video/strfloat name=price399.0/float /doc docstr name=nameASUS Extreme N7800GTX/2DHTV/strfloat name=price479.95/float /doc /result/response 9 10. Zookeeper ConfigurationGroup ServiceSynchronization Sharing configuration Master/Leader Election Coordinating searches across nodes Central Managementacross shards/load Maintaining status about Replication balancing shards Name [email protected] 11. Zookeeper ArchitectureAll servers store a copy of the data (in memory)Leader A leader is elected at startupServer ServerServerFollowers service clients, all updates go through leader Update responses are sent Client Client ClientClient when a majority of servers have persisted the change 12. Distributed Search Shard Management Scalability Sharding Algorithm Query Performance DistribuTon Load Balancing ReplicaTonAvailability Server Fail-Over 13. SolrCloud Distributed Solr Cluster Central conguraTon for the enTre clusterManagement Cluster state and layout stored in central systemZookeeper Live node trackingIntegraTon Central Repository for conguraTon and state Remove external AutomaTc fail-over and load balancingload balancing Client dont need to Any node can serve any requestknow shard at all 14. SolrCloud Architecture Share Storageor ZooKeepoer DFS download configuration shards Solr Shard 1Solr Shard 2 Solr Shard 3 Solr ShardSolr Shard Solr Shard replicationreplication replicationCollection 15. Start a SolrCloud nodesolr.xmlsolr persistent=false!! !--! adminPath: RequestHandler path to manage cores.! If null (or absent), cores will not be manageable via requesthandler! --! cores adminPath=/admin/cores defaultCoreName=collection1! core name=collection1 instanceDir=. shard=shard1/! /cores!/solrstart solr instance to serve a shard l java Dbootstrap_confdir=./solr/conf ! lDcollection.configName=myconf ! lDzkRun jar start.jar 16. Live Node Tracking ShardsReplication 17. Distributed Requests of SolrCloud Explicitly specify the address of shards shards=localhost:8983/solr,localhost:7574/solr Load balance and fail over shards=localhost:8983/solr | localhost:8900/solr,localhost:7574/solr | localhost:7500/solr Query all shards of a collecTon h_p://localhost:8983/solr/collecTon1/select?distrib=true Query specic shard ids h_p://localhost:8983/solr/collecTon1/select?shards=shard_1,shard_2distrib=true 18. More about SolrCloud STll under development. Not in ocial release Get SolrCloud: h_ps:// dev/trunk Wiki: h_p:// 19. Ka_a Distributed Lucene Master Slave ArchitectureMaster Fail-OverIndex Shard ManagementReplicaTon Fault Tolerance 20. Ka_a Architecture fail overHDFS Secondary MasterMaster ZK ZKdownloadshards assign shards NodeNode Node Nodeshard replicationmulticast queryJava client APIDistributed ranking plug-ableselection policy 21. Ka_a Index Katta Indexlucene Index lucene Index lucene Index Folder with Lucene indexes Shard Index can be zipped 22. The dierent between Ka_a and SolrCloud Ka_a is based on lucene not solr Ka_a is master-slave architecture. Master provide management command and API Ka_a will automaTcally assign shards and replicaTon for ka_a nodes. Client should use Ka_a client API for querying. Ka_a client is Java API, require java programming. Ka_a support read index from HDFS. Plays well with Hadoop 23. Congure Ka_aconf/mastersRequired on Masterserver0conf/nodesRequired fo all nodesserver1!server2!server3conf/ka_a.zk.properTeszk can be embedded with master or standaloneStandalone zk is required for master fail-overzookeeper.server=server0:2181!Zookeeper.embedded=true!conf/ka_a-env.shJAVA_HOME requiredKATTA_MASTER opTonalexport JAVA_HOME=/usr/lib/j2sdk1.5-sun!Export KATTA_MASTER=server0:/home/$USER/katta-distribution! 24. Ka_a CLISearch search index name[,index name,] query [count]Cluster Management listIndexes listNodes startMaster startNode showStructure Index Management check addIndex index name path to index lucene analyzer class [replicaTon level] removeIndex index name redeployIndex index name listErrors index name 25. Deploy Ka_a Index Deploying Ka_a index on local lesystembin/katta addIndex name of index !![file:///path to index] rep lv Deploying Ka_a index on HDFSbin/katta addIndex name of index !![hdfs://namenode:9000/path] rep lv 26. Building Lucene Index by MapReduceReduce Side Indexingmap K: shard id V: doc to be indexed Indexshard 1mapreduce InputHDFS Shufflelucene Blocks /Sort Indexshard 2mapreduce lucene map 27. Building Lucene Index by MapReduceMap Side Indexingmap K: shard id V: path of indiceslucene Indexshard 1mapreduce InputHDFSlucene ShuffleMergeBlocks /Sort Indexshard 2mapreduce luceneMergemap lucene 28. Mapperpublic class IndexMapper extends MapperObject, Text, Text, Text {!private String [] shardList;! int k=0;!! protected void setup(Context context) {!!! !shardList=context.getConfiguration().getStrings(shardList);!}!!public void map(Object key, Text value, Context context) !!! !throws IOException, InterruptedException!!{! context.write(new Text(shardList[k++]), value);! ! !k %= shardList.length;!}! 29. Reducerpublic class IndexReducer extends ReducerText, Text, Text, IntWritable{! !Configuration conf;! !public void reduce(Text key, IterableText values, Context context)! !!!throws Exception! !{! !!!conf = context.getConfiguration();! !!!String indexName = conf.get(indexName);! !!!String shard = key.toString();! !!!writeIndex(indexName, shard, values);! !}!! !private void writeIndex(String indexName, String shard, IterableText values)! !{! !!!IteratorText it=values.iterator();! !!!while(it.hasNext()){! use lucene here ! !!! !//...index using lucene index writer...! !!!}! !}!}! 30. Sharding Algorithm Good document distribuTon across shards is important Simple approach: hash(id) % numShards Fine if number of shards doesnt change or easy to reindex Be_er: Consistent Hashing h_p:// 31. Case Study Web Log Categorization Customer Background Tire-1 telco in APAC Daily web log ~ 20TB Marketing guys would like to understand more aboutbrowsing behavior of their customers Requirements The logs collected from the network should be inputthrough a big data solution. Provide dashboard of URL category vs. hit count Perform analysis based on the required attributes andalgorithms 32. Hardware Spec 1 Master Node (Name Node, Job Tracker, Katta Master) 64 TB raw storage size (8 Data Node) Hbase / HiveMapReduce 64 cores (8 Data Node)Hadoop Distributed File System 8 Task Tracker 3 katta nodes 5 zookeeper nodes Master Node CPU: Intel Xeon Processor E5620 2.4G RAM: 32G ECC-Registered HDD: SAS 300G @ 15000 rpm (RAID-1) Data Node CPU: Intel Xeon Processor E5620 2.4G RAM: 32G ECC-Registered HDD: SATA 2Tx4, Total 8T 32 33. System OverviewDash BoardSearch web gatewayLog aggregator XML response Web Service API Ka_a Client API3rd-party Search Query URL DB Ka_a Ka_a Ka_a Node Node Node MR JobsURLcategorization Build IndexLoad Index Shard Name Node /raw_log /url_cat HDFS /shardA/url /shardB/url/shardN/url KaLa Master 34. Thank YouWe are hiring !!mailto: [email protected]