Greplin searches: 2- Greplin helps you search all your personal
information, wherever it is.- As Michael Arrington of TechCrunch
said, weve attacked the other half of search.- Greplin supports
over a dozen services today, with more added constantly.
Requirements Many inserts Fewer searches Low per-user cost 3-
We insert up to 5,000 documents/second- Average document size of
2KB-4KB- A fully loaded server is an Amazon c1.medium machine
responsible for up to 80,000,0003KB documents- Each machine has
just 1.7GB of RAM!- Overall, we handle about 50M documents per GB
of RAM with median search latenciesaround 200ms.
Memory Per doc: 2 longs + 1 int +1 String (avg 5 letters) into
the FieldCache, and average of 10 normd elds/doc 27 bytes/doc * 50M
docs = 1.3GB 4- Ranking requires pulling a few eld values and norms
into memory.- For 50M documents would require well over 1.3GB of
memory.- Assuming an optimized index, searching the number of docs
we have per machine with1GB of RAM is impossible without swapping.-
We benchmarked using a single-index + swapping: search times were
multi-second.
Virtual memory was meant to make it easier to program when data
was larger than the physical memory, but people have still not
caught on. Poul-Henning Kamp,Varnish architect and coder. Whats
Wrong With 1975 Programming
http://www.varnish-cache.org/trac/wiki/ArchitectNotes 5- Over the
last decade, the trend has been to stop manually managing what goes
on disk andwhat goes in RAM, instead trusting the operating systems
virtual memory and pagingsystems to swap data in/out
appropriately.- For example, the caching HTTP proxy Varnish trusts
the OSs virtual memory, and is thussignicantly simpler and faster
than Squid, which tries to manage the what-belongs-in-memory vs
what-belongs-on-disk itself.- This philosophy has been jokingly
summarized as Youre not smarter than Linus, so donttry to be.
Were Smarter than Linus!** When we cheat 6- Many signals (such
as user logins) let us predict which users are likely to do
searches betterthan the OS can.- By keeping each users data in a
separate index, we save memory and improveperformance.- We only
keep open IndexSearchers for users who are likely to do
searches.
Other Benets tar -cvzf user.tar.gz user && mv
user.tar.gz du -h Smaller corruption domain 7By keeping each users
index separate, we can:- more easily move users between servers-
gure out their space usage- ensure index corruption affects only
one user
RAM Index Deletion Filters MultiSearcher Flush planning 8-
Inspired by Zoie (http://sna-projects.com/zoie/)- All incoming
documents are rst added to a RAM Index.- A user search encompasses
a ltered view of the RAM Index, the currently ushing index,plus
their disk index.- When the RAM index is full we create a new RAM
index.- We open IndexWriters for each user in turn and ush
documents from RAM to disk.- Interesting cases including updates
and deletions are handled with temporary lters on thedisk
index.
Amazon Cloud Script everything XFS+LVM expandability and
snapshots are helpful Some pain is unavoidable EBS Performance
150000 112500 KB/sec 75000 37500 0 Seq. Write Seq. Read Random Read
Random Write Single EBS RAID10 EBS Instance Store RAID 0 EBS 9More
info at:
http://tech.blog.greplin.com/aws-best-practices-and-benchmarks
Other Cool Stuff kill -9 any time with no data-loss via a
Protocol Buffer Write Ahead Log Detect duplicate documents with
Bloom Filter Dynamically sized SoftReference Cache Custom
MergeScheduler Custom FieldCache for multi-valued or sparse elds
Efcient result clustering and faceting 10Some of this is open
source: https://github.com/Greplin