31
MongoDB https://www.mongodb.com/ Prutha Date ([email protected] ) Siraj Memon ([email protected] )

MongoDB Internals

Embed Size (px)

Citation preview

Page 2: MongoDB Internals

Outline• Introduction to MongoDB• Storage Layout• Data Management Features• Performance Analysis• Limitations• Conclusion• Demo• References

Page 3: MongoDB Internals

What is MongoDB?• MongoDB is a NoSQL Document-Oriented database.

• It provides semi-structured flexible schema.

• It provides high performance, high availability, and easy scalability.

• MongoDB is free and open source software.

• License: GNU Affero General Public License (AGPL) and Apache License

• MongoDB is a server process that runs on Linux, Windows and OS X. It can be

run both as a 32 or 64-bit application.

Page 4: MongoDB Internals

When to use MongoDB?

“Knowing when to use a hammer, and when to use a screwdriver.”• Account and user profiles: can store arrays of addresses with ease (MetLife)• Content Management Systems (CMS): the flexible schema of MongoDB is great for heterogeneous

collections of content types (MongoPress)• Form data: MongoDB makes it easy to evolve the structure of form data over time (ADP)• Blogs / user-generated content: can keep data with complex relationships together in one object (Forbes,

AOL)• Messaging: vary message meta-data easily per message or message type without needing to maintain

separate collections or schemas (Viber)• System configuration: just a nice object graph of configuration values, which is very natural in MongoDB

(Cisco)• Log data of any kind: structured log data is the future (ebay)• Location based systems: makes use of Geospatial indices (Foursquare, City government of Chicago)

Page 5: MongoDB Internals

Terminologies – RDBMS vs MongoDB

*JSON – JavaScript Object Notation

Page 6: MongoDB Internals

Storage Internals - Directory LayoutData Directory is found at /data/db

Page 7: MongoDB Internals

Internal File Format

Page 8: MongoDB Internals

Extent Structure

Page 9: MongoDB Internals

Extents and Records

Page 10: MongoDB Internals

To Sum Up: Internal File Format• Files on disk are broken into extents which contain the documents.• A collection has one or more extents.• Extent grow exponentially up to 2GB.• Namespace entries in the ns (namespace) file point to the first extent

for that collection.

Page 11: MongoDB Internals

Virtual Address Space

Page 12: MongoDB Internals

Storage Engine - MMAP (Memory Mapped)• All data files are memory mapped to Virtual Memory by the

OS.• MongoDB just reads / writes to RAM in the filesystem cache • OS takes care of the rest! • Virtual process size = total files size + overhead (connections,

heap)• Uses Memory-mapped file using mmap() system call.

Page 13: MongoDB Internals

Storage Engine - WiredTiger• Designed especially for Write-Intensive applications• Document level locking• Compression and Record-level locking•Multi-version concurrency control (MVCC)•Multi-document transactions• Support for Log Structured Merge (LSM) trees for very high

insert workloads

Page 14: MongoDB Internals

What makes MongoDB cool?

• Sharding• Aggregation Framework and Map-Reduce• Capped Collection• GridFS• Geo-Spatial Indexing

Page 15: MongoDB Internals

Sharding• Horizontal scaling - divides the data set and distributes the data over

multiple servers, or shards. • Used to support deployments with very large data sets and high

throughput operations.• Sharded Cluster Components – • Shards – mongod instance or replica sets• Config Server – Multiple mongod instances• Routing Instances – Multiple mongos instances

• Shards are divided into fixed size chunks using ranges of shard key values.

Page 16: MongoDB Internals

Sharding Internals

Page 17: MongoDB Internals

Choosing a Shard keyThe choice of shard key affects:• Distribution of reads and writes• Uneven distribution of reads/writes across shards.• Solution – Hashed ids

• Size of chunks• Jumbo chunks cause uneven distribution of data.• Moving data between shards becomes difficult.• Solution – Multi-tenant compound index

• The number of shards each query hits

Page 18: MongoDB Internals

Aggregation Framework• Aggregation Pipeline• Map-Reduce• Single Purpose Aggregation Operations (deprecated in latest version)

Page 19: MongoDB Internals

Aggregation Pipeline• The aggregation pipeline is a framework for performing aggregation

tasks, modeled on the concept of data processing pipelines. • Using this framework, MongoDB passes the documents of a single

collection through a pipeline. • The pipeline transforms the documents into aggregated results, and is

accessed through the aggregate database command.• Operators: $match, $project, $unwind, $sort, $limit• User gets to choose the operator.

Page 20: MongoDB Internals

Aggregation Pipeline - Example

Page 21: MongoDB Internals

Continued…

Page 22: MongoDB Internals

Map-Reduce

Page 23: MongoDB Internals

Capped Collection• Fixed size collection called capped collection• Use the db.createCollection command and marked it as capped• e.g - db.createCollection(‘logs’, {capped: true, size: 2097152})

• When it reaches the size limit, old documents are automatically removed• Guarantees preservation of the insertion order• Maintains insertion order identical to the order on disk by prohibiting

updates that increase document size• Allows the use of tailable cursor to retrieve documents

Page 24: MongoDB Internals

GridFS• GridFS is a specification for storing and retrieving files that exceed

the BSON (binary JSON) document size limit of 16MB.• Instead of storing a file in a single document, GridFS divides a file into

parts, or chunks, and stores each of those chunks as a separate document. • By default GridFS limits chunk size to 255k. • GridFS uses two collections to store files. One collection stores the file

chunks, and the other stores file metadata.• GridFS is useful not only for storing files that exceed 16MB but also for

storing any files for which you want access without having to load the entire file into memory.

Page 25: MongoDB Internals

GeoSpatial Indexing• To support efficient queries of geospatial coordinate data, MongoDB

provides two special indexes: • 2d indexes that uses planar geometry when returning results.• 2sphere indexes that use spherical geometry to return results.

• Store location data as GeoJSON objects with this coordinate-axis order: longitude, latitude.• GeoJSON Object Supported: Point, LineString, Polygon, etc.• Query Operations: Inclusion, Intersection, Proximity.• You cannot use a geospatial index as the shard key index.

Page 26: MongoDB Internals

Performance Analysis• Yahoo! Cloud Serving Benchmark (YCSB)• Throughput (ops/second)

WORKLOADS Cassandra Couchbase MongoDB

50% read, 50% update 134,839 106,638 160,719

95% read, 5% update 144,455 187,798 196,498

50% read, 50% update (Durability Optimized) 6,289 1,236 31,864

Page 27: MongoDB Internals

Limitations• Need to have enough memory to fit your working set into memory,

otherwise performance might suffer.• MapReduce and Aggregation are single-threaded. To be more specific,

one per mongod. • No joins across collections.• On 32-bit, it has limitation of 2.5 Gb data. • Sharding has some unique exceptions. If you plan to shard your data,

you need to shard early as some things that are feasible on a single server are not feasible on a sharded collection.

Page 28: MongoDB Internals

Conclusion• MongoDB is a semi-structured document-oriented NoSQL Database.• It has two storage engines: MMAP and WiredTiger• Multiple Aggregation Frameworks: Aggregation Pipeline and Map-

Reduce• Support for GridFS, GeoSpatial Indexing, Capped Collection• Better Performance as compared to Cassandra and Couchbase.• On-going work – In-memory and HDFS support

Page 29: MongoDB Internals

DEMO

Page 31: MongoDB Internals

Questions?

Thank you!