Sponsored by:
This story appeared on Network World at
http://www.networkworld.com/news/tech/2012/102212-nosql-263595.html
The Yahoo guys did a great job, but like any paper, it could not include everything:
● The research did not provide all the information we needed for our own analysis.
● Though Cassandra, HBase, Yahoo's PNUTS, and a simple sharded MySQL implementation were analyzed, some of the databases we often work with were not covered.
● Yahoo used high-performance hardware, while it would be more useful for most companies to see how these databases perform on average hardware.
IN THE NEWS: MySQL users caution against NoSQL fad
As R&D engineers at Altoros Systems Inc., a big data specialist, we were inspired by Yahoo's endeavors and decided to add some effort of our own. This article is our vendor-independent analysis of NoSQL databases, based on performance measured under different system workloads.
In 2012, the number of NoSQL products reached 120-plus and the figure is still growing. That variety makes it difficult to select the best tool for a particular case. Database vendors usually measure productivity of their products with custom hardware and software settings designed to demonstrate the advantages of their solutions. We wanted to do independent and unbiased research to complement the work done by the folks at Yahoo.
Using Amazon virtual machines to ensure verifiable results and research transparency (which also helped minimize errors due to hardware differences), we have analyzed and evaluated the following NoSQL solutions:
● Cassandra, a column family store
● HBase (column-oriented, too)
● MongoDB, a document-oriented database
● Riak, a key-value store
We also tested MySQL Cluster and sharded MySQL, taking them as benchmarks.
After some of the results had been presented to the public, some observers said MongoDB should not be compared to other NoSQL databases because it is more targeted at working with memory directly. We certainly understand this, but the aim of this investigation is to determine the best use cases for different NoSQL products. Therefore, the databases were tested under the same conditions, regardless of their specifics.
● a framework with a workload generator
● a set of workload scenarios
We have measured database performance under certain types of workloads. A workload was defined by different distributions assigned to the two main choices:
• which operation to perform
• which record to read or write
Operations against a data store were randomly selected and could be of the following types:
• Insert: Inserts a new record.
• Update: Updates a record by replacing the value of one field.
• Read: Reads a record, either one randomly selected field, or all fields.
• Scan: Scans records in order, starting at a randomly selected record key. The number of records to scan is also selected randomly from the range between 1 and 100.
Each workload was targeted at a table of 100,000,000 records; each record was 1,000 bytes in size and contained 10 fields. A primary key identified each record, which was a string, such as "user234123." Each field was named field0, field1, and so on. The values in each field were random strings of ASCII characters, 100 bytes each.
Database performance was defined by the speed at which a database computed basic operations. A basic operation is an action performed by the workload executor, which drives multiple client threads. Each thread executes a sequential series of operations by making calls to the database interface layer both to load the database (the load phase) and to execute the workload (the transaction phase). The threads throttle the rate at which they generate requests, so that we may directly control the offered load against the database. In addition, the threads measure the latency and achieved throughput of their operations and report these measurements to the statistics module.
The performance of the system was evaluated under different workloads:
Workload A: Update heavily
Workload B: Read mostly
Workload C: Read only
Workload D: Read latest
Workload E: Scan short ranges
Workload F: Read-modify-write
Workload G: Write heavily
Each workload was defined by:
1) The number of records manipulated (read or written)
2) The number of columns per each record
3) The total size of a record or the size of each column
4) The number of threads used to load the system
This research also specifies configuration settings for each type of the workloads. We used the following default settings:
1) 100,000,000 records manipulated
2) The total size of a record equal to 1Kb
3) 10 fields of 100 bytes each per record
4) Multithreaded communications with the system (100 threads)
• 7.5GB of memory
• four EC2 Compute Units (two virtual cores with two EC2 Compute Units each)
• 850GB of instance storage
• 64-bit platform
• high I/O performance
• EBS-Optimized (500Mbps)
• API name: m1.large
Each of the NoSQL databases was deployed on a four-node cluster in the same geographical region on Amazon Extra Large Instances:
• 15GB of memory
• eight EC2 Compute Units (four virtual cores with two EC2 Compute Units each)
• 1690GB of instance storage
• 64-bit platform
• high I/O performance
• EBS-Optimized (1000Mbps)
• API name: m1.xlarge
Amazon is often blamed for its high I/O wait time and comparatively slow EBS performance. To mitigate these drawbacks, EBS disks had been assembled in a RAID0 array with stripping and after that they were able to provide up to two times higher performance.
We started with measuring the load phase, during which 100 million records, each containing 10 fields of 100 randomly generated bytes, were imported to a four-node cluster.
HBase demonstrated by far the best writing speed. With pre-created regions and deferred log flush enabled, it reached 40K ops/sec. Cassandra also showed great performance during the loading phase with around 15K ops/sec. The data is first saved to the commit log, using the append method, which is a fast operation. Then it is written to a per-column family memory store called a Memtable. Once the Memtable becomes full, the data is saved to disk as an SSTable. In the "just in-memory" mode, MySQL Cluster could show much better results, by the way.
* Workload A: Update-heavily mode. Workload A is an update-heavily scenario that simulates the database work, during which typical actions of an e-commerce solution user are recorded. Settings for the workload:
1) Read/update ratio: 50/50
2) Zipfian request distribution
During updates, HBase and Cassandra went far ahead from the main group with the average response latency time not exceeding two milliseconds. HBase was even faster. HBase client was configured with AutoFlush turned off. The updates aggregated in the client buffer and pending writes flushed asynchronously, as soon as the buffer became full. To accelerate updates processing on the server, the deferred log flush was enabled and WAL edits were kept in memory during the flush period.
Cassandra wrote the mutation to the commit log for the transaction purposes and then an in-memory Memtable was updated. This is a slower, but safer scenario if compared to the HBase deferred log flushing.
* Workload A: Read. During reads, per-column family compression provides HBase and Cassandra with faster data access. HBase was configured with native LZO and Cassandra with Google's Snappy compression codecs. Although the computation ran longer, the compression reduces the number of bytes read from the disk.
* Workload B: Read-heavy mode. Workload B consisted of 95% of reads and 5% of writes. Content tagging can serve as an example task matching this workload; adding a tag is an update, but most operations imply reading tags. Settings for the "read-mostly" workload:
1) Read/update ratio: 95/5
2) Zipfian request distribution
Sharded MySQL showed the best performance in reads. MongoDB -- accelerated by the "memory mapped files" type of cache -- was close to that result. Memory-mapped files were used for all disk I/O in MongoDB. Cassandra's key and row caching enabled very fast access to frequently requested data. With the off-heap row caching feature added in Version 0.8, it showed excellent read performance while using less per-row memory. The key cache held locations of keys in memory on a per-column family basis and defined the offset for the SSTable where the rows were stored. With a key cache, there was no need to look for the position of the row in the SSTable index file. Thanks to the row cache, we did not have to read rows from the SSTable data file. In other words, each key cache hit saved us one disk seek and each row cache hit saved two disk seeks. In HBase, random read performance was slower. However, Cassandra and HBase can provide faster data access with per-column-family compression.
* Workload B: Update. Thanks to deferred log flushing, HBase showed very high throughput with extremely small latency under heavy writes. With deferred log flush turned on, the edits were first committed to the memstore. Then the aggregated edits were flushed to HLog asynchronously. On the client side, HBase write buffer cached writes with the autoFlush option set to true, which also improved performance greatly. For security purposes, HBase confirms every write after its write-ahead log reaches a particular number of in-memory HDFS replicas. HBase's write latency with memory commit was roughly equal to the latency of data transmission over the network. Cassandra demonstrated great write throughput, since it first writes to the commit log -- using the append method, which is a pretty fast operation -- and then to a per-column-family memory store called Memtable.
* Workload C: Read-only. Settings for the workload:
1) Read/update ratio: 100/0
2) Zipfian request distribution
This read-only workload simulated a data caching system. The data was stored outside the system, while the application was only reading it. Thanks to B-tree indexes, sharded MySQL became the winner in this competition.
* Workload E: Scanning short ranges. Settings for the workload:
1) Read/update/insert ratio: 95/0/5
2) Latest request distribution
3) Max scan length: 100 records
4) Scan length distribution: uniform
HBase performed better than Cassandra in range scans. HBase scanning is a form of hierarchical fast merge-sort operation performed by HRegionScanner. It merges the results received from HStoreScanners (one per family), which, in their turn, merge the results received from HStoreFileScanners (one for each file in the family). If caching is turned on, the server will simply provide the number of specified records instead of bouncing back to the HRegionServer to process every record.
Cassandra's scan performance with a random partitioner has improved considerably compared to Version 0.6, where this feature was initially introduced.
Cassandra's SSTable is a sorted strings table that can be described as a file of key-value string pairs sorted by keys and key-value pairs written in a particular order. To achieve maximum performance during range scans, we had to use an order-preserving partitioner. Scanning over ranges of order-preserving rows is super-fast. It is similar to moving a cursor through a continuous index. However, the database cannot distribute individual keys and corresponding rows over the cluster evenly, thus a random partitioner is used to ensure even data distribution. This is the default partitioning strategy in Cassandra. Random partitioning ensures good load balancing and provides some additional speed in range scans with an order preserving partitioner.
In MongoDB 2.5, the table scan triggered by the { "_id":{"$gte": startKey}} query showed a maximum throughput of 20 ops/sec with a latency of ≈ 1 sec.
The performance of MySQL Cluster was under 10 ops/sec with a latency of 400 ms. It is partitioned over the nodes in the cluster, so the system uses an optimizer to translate SQL commands into a query plan. The execution of this plan is divided among multiple nodes. For range scans, a B-tree index is used to make column comparisons in such expressions as >, <, or BETWEEN.
Sharded MySQL is based on key hashing on the connector side and does not support true range scans over a cluster. While a single shard did about 10 ops/sec, the whole sharded setup showed near 40 ops/sec with a latency of up to 400 ms. MyISAM caches index blocks but not data blocks. There can be an overhead due to re-reading data blocks from the OS buffer cache.
The Riak bitcask storage engine does not support range scans. This can be done through secondary indexes with eleveldb and special $key index referring to the primary key. Eleveldb showed insufficient performance that started to degrade after 50,000,000 records had been imported and we fell back to bitcask.
* Workload G: Insert-mostly mode. Settings for the workload:
1) Insert/Read: 90/10
2) Latest request distribution
HBase showed the best results under a workload that included large volumes of writes. Cassandra was second. The NDB engine of MySQL Cluster also managed intensive writes perfectly well.
For example, a database can demonstrate excellent performance, but once the amount of records exceeds a certain limit, the speed falls dramatically. It means that this particular solution can be good for moderate data loads and extremely fast computations, but it would not be suitable for jobs that require a lot of reads and writes. In addition, database performance also depends on the capacity of your hardware.
It was hardly possible to include all of the performance diagrams and describe everything in one article. You can download the full version of the research that contains separate chapters dedicated to every database, YCSB and Amazon EC2 configuration details, and appendix with other performance diagrams athttp://altoros.com/nosql-research.
We hope this research will be useful to both developers working with NoSQL solutions and customers trying to choose a database. Altoros's R&D team will regularly revise and update information of this research to cover new databases and releases of the most popular products.
About the author: Sergey Bushik is a senior R&D engineer at Altoros. He has more than seven years of experience in implementation of Java-based projects that include big data processing, data mining and Hadoop computations. Sergey has a number of certificates in Java and is a Sun Certified Enterprise Architect for the Java Platform. He is a regular speaker at international conferences -- most recently, he delivered sessions at Big Data Meetup (Sunnyvale, Calif.), GOTO Copenhagen 2012, Hadoop Evening (Eastern Europe), etc.
Read more about software in Network World's Software section.
All contents copyright 1995-2012 Network World, Inc. http://www.networkworld.com
This story appeared on Network World at
http://www.networkworld.com/news/tech/2012/102212-nosql-263595.html
A vendor-independent comparison of NoSQL databases: Cassandra, HBase, MongoDB, Riak
Sponsored by:
Network World - "The more alternatives, the more difficult the choice." -- Abbe' D'Allanival
In 2010, when the world became enchanted by the capabilities of cloud systems and new databases designed to serve them, a group of researchers from Yahoo decided to look into NoSQL. They developed the YCSB framework to assess the performance of new tools and find the best cases for their use. The results were published in the paper, "Benchmarking Cloud Serving Systems with YCSB."The Yahoo guys did a great job, but like any paper, it could not include everything:
● The research did not provide all the information we needed for our own analysis.
● Though Cassandra, HBase, Yahoo's PNUTS, and a simple sharded MySQL implementation were analyzed, some of the databases we often work with were not covered.
● Yahoo used high-performance hardware, while it would be more useful for most companies to see how these databases perform on average hardware.
IN THE NEWS: MySQL users caution against NoSQL fad
As R&D engineers at Altoros Systems Inc., a big data specialist, we were inspired by Yahoo's endeavors and decided to add some effort of our own. This article is our vendor-independent analysis of NoSQL databases, based on performance measured under different system workloads.
What makes this research unique?
Often referred to as NoSQL, non-relational databases feature elasticity and scalability in combination with a capability to store big data and work with cloud computing systems, all of which make them extremely popular. NoSQL data management systems are inherently schema-free (with no obsessive complexity and a flexible data model) and eventually consistent (complying with BASE rather than ACID). They have a simple API, serve huge amounts of data and provide high throughput.In 2012, the number of NoSQL products reached 120-plus and the figure is still growing. That variety makes it difficult to select the best tool for a particular case. Database vendors usually measure productivity of their products with custom hardware and software settings designed to demonstrate the advantages of their solutions. We wanted to do independent and unbiased research to complement the work done by the folks at Yahoo.
Using Amazon virtual machines to ensure verifiable results and research transparency (which also helped minimize errors due to hardware differences), we have analyzed and evaluated the following NoSQL solutions:
● Cassandra, a column family store
● HBase (column-oriented, too)
● MongoDB, a document-oriented database
● Riak, a key-value store
We also tested MySQL Cluster and sharded MySQL, taking them as benchmarks.
After some of the results had been presented to the public, some observers said MongoDB should not be compared to other NoSQL databases because it is more targeted at working with memory directly. We certainly understand this, but the aim of this investigation is to determine the best use cases for different NoSQL products. Therefore, the databases were tested under the same conditions, regardless of their specifics.
Tools, libraries and methods
For benchmarking, we used Yahoo Cloud Serving Benchmark, which consists of the following components:● a framework with a workload generator
● a set of workload scenarios
We have measured database performance under certain types of workloads. A workload was defined by different distributions assigned to the two main choices:
• which operation to perform
• which record to read or write
Operations against a data store were randomly selected and could be of the following types:
• Insert: Inserts a new record.
• Update: Updates a record by replacing the value of one field.
• Read: Reads a record, either one randomly selected field, or all fields.
• Scan: Scans records in order, starting at a randomly selected record key. The number of records to scan is also selected randomly from the range between 1 and 100.
Each workload was targeted at a table of 100,000,000 records; each record was 1,000 bytes in size and contained 10 fields. A primary key identified each record, which was a string, such as "user234123." Each field was named field0, field1, and so on. The values in each field were random strings of ASCII characters, 100 bytes each.
Database performance was defined by the speed at which a database computed basic operations. A basic operation is an action performed by the workload executor, which drives multiple client threads. Each thread executes a sequential series of operations by making calls to the database interface layer both to load the database (the load phase) and to execute the workload (the transaction phase). The threads throttle the rate at which they generate requests, so that we may directly control the offered load against the database. In addition, the threads measure the latency and achieved throughput of their operations and report these measurements to the statistics module.
The performance of the system was evaluated under different workloads:
Workload A: Update heavily
Workload B: Read mostly
Workload C: Read only
Workload D: Read latest
Workload E: Scan short ranges
Workload F: Read-modify-write
Workload G: Write heavily
Each workload was defined by:
1) The number of records manipulated (read or written)
2) The number of columns per each record
3) The total size of a record or the size of each column
4) The number of threads used to load the system
This research also specifies configuration settings for each type of the workloads. We used the following default settings:
1) 100,000,000 records manipulated
2) The total size of a record equal to 1Kb
3) 10 fields of 100 bytes each per record
4) Multithreaded communications with the system (100 threads)
Testing environment
To provide verifiable results, benchmarking was performed on Amazon Elastic Compute Cloud instances. Yahoo Cloud Serving Benchmark Client was deployed on one Amazon Large Instance:• 7.5GB of memory
• four EC2 Compute Units (two virtual cores with two EC2 Compute Units each)
• 850GB of instance storage
• 64-bit platform
• high I/O performance
• EBS-Optimized (500Mbps)
• API name: m1.large
Each of the NoSQL databases was deployed on a four-node cluster in the same geographical region on Amazon Extra Large Instances:
• 15GB of memory
• eight EC2 Compute Units (four virtual cores with two EC2 Compute Units each)
• 1690GB of instance storage
• 64-bit platform
• high I/O performance
• EBS-Optimized (1000Mbps)
• API name: m1.xlarge
Amazon is often blamed for its high I/O wait time and comparatively slow EBS performance. To mitigate these drawbacks, EBS disks had been assembled in a RAID0 array with stripping and after that they were able to provide up to two times higher performance.
The results
When we started our research into NoSQL databases, we wanted to get unbiased results that would show which solution is best suitable for each particular task. That is why we decided to test performance of each database under different types of loads and let the users decide what product better suits their needs.We started with measuring the load phase, during which 100 million records, each containing 10 fields of 100 randomly generated bytes, were imported to a four-node cluster.
HBase demonstrated by far the best writing speed. With pre-created regions and deferred log flush enabled, it reached 40K ops/sec. Cassandra also showed great performance during the loading phase with around 15K ops/sec. The data is first saved to the commit log, using the append method, which is a fast operation. Then it is written to a per-column family memory store called a Memtable. Once the Memtable becomes full, the data is saved to disk as an SSTable. In the "just in-memory" mode, MySQL Cluster could show much better results, by the way.
* Workload A: Update-heavily mode. Workload A is an update-heavily scenario that simulates the database work, during which typical actions of an e-commerce solution user are recorded. Settings for the workload:
1) Read/update ratio: 50/50
2) Zipfian request distribution
During updates, HBase and Cassandra went far ahead from the main group with the average response latency time not exceeding two milliseconds. HBase was even faster. HBase client was configured with AutoFlush turned off. The updates aggregated in the client buffer and pending writes flushed asynchronously, as soon as the buffer became full. To accelerate updates processing on the server, the deferred log flush was enabled and WAL edits were kept in memory during the flush period.
Cassandra wrote the mutation to the commit log for the transaction purposes and then an in-memory Memtable was updated. This is a slower, but safer scenario if compared to the HBase deferred log flushing.
* Workload A: Read. During reads, per-column family compression provides HBase and Cassandra with faster data access. HBase was configured with native LZO and Cassandra with Google's Snappy compression codecs. Although the computation ran longer, the compression reduces the number of bytes read from the disk.
* Workload B: Read-heavy mode. Workload B consisted of 95% of reads and 5% of writes. Content tagging can serve as an example task matching this workload; adding a tag is an update, but most operations imply reading tags. Settings for the "read-mostly" workload:
1) Read/update ratio: 95/5
2) Zipfian request distribution
Sharded MySQL showed the best performance in reads. MongoDB -- accelerated by the "memory mapped files" type of cache -- was close to that result. Memory-mapped files were used for all disk I/O in MongoDB. Cassandra's key and row caching enabled very fast access to frequently requested data. With the off-heap row caching feature added in Version 0.8, it showed excellent read performance while using less per-row memory. The key cache held locations of keys in memory on a per-column family basis and defined the offset for the SSTable where the rows were stored. With a key cache, there was no need to look for the position of the row in the SSTable index file. Thanks to the row cache, we did not have to read rows from the SSTable data file. In other words, each key cache hit saved us one disk seek and each row cache hit saved two disk seeks. In HBase, random read performance was slower. However, Cassandra and HBase can provide faster data access with per-column-family compression.
* Workload B: Update. Thanks to deferred log flushing, HBase showed very high throughput with extremely small latency under heavy writes. With deferred log flush turned on, the edits were first committed to the memstore. Then the aggregated edits were flushed to HLog asynchronously. On the client side, HBase write buffer cached writes with the autoFlush option set to true, which also improved performance greatly. For security purposes, HBase confirms every write after its write-ahead log reaches a particular number of in-memory HDFS replicas. HBase's write latency with memory commit was roughly equal to the latency of data transmission over the network. Cassandra demonstrated great write throughput, since it first writes to the commit log -- using the append method, which is a pretty fast operation -- and then to a per-column-family memory store called Memtable.
* Workload C: Read-only. Settings for the workload:
1) Read/update ratio: 100/0
2) Zipfian request distribution
This read-only workload simulated a data caching system. The data was stored outside the system, while the application was only reading it. Thanks to B-tree indexes, sharded MySQL became the winner in this competition.
* Workload E: Scanning short ranges. Settings for the workload:
1) Read/update/insert ratio: 95/0/5
2) Latest request distribution
3) Max scan length: 100 records
4) Scan length distribution: uniform
HBase performed better than Cassandra in range scans. HBase scanning is a form of hierarchical fast merge-sort operation performed by HRegionScanner. It merges the results received from HStoreScanners (one per family), which, in their turn, merge the results received from HStoreFileScanners (one for each file in the family). If caching is turned on, the server will simply provide the number of specified records instead of bouncing back to the HRegionServer to process every record.
Cassandra's scan performance with a random partitioner has improved considerably compared to Version 0.6, where this feature was initially introduced.
Cassandra's SSTable is a sorted strings table that can be described as a file of key-value string pairs sorted by keys and key-value pairs written in a particular order. To achieve maximum performance during range scans, we had to use an order-preserving partitioner. Scanning over ranges of order-preserving rows is super-fast. It is similar to moving a cursor through a continuous index. However, the database cannot distribute individual keys and corresponding rows over the cluster evenly, thus a random partitioner is used to ensure even data distribution. This is the default partitioning strategy in Cassandra. Random partitioning ensures good load balancing and provides some additional speed in range scans with an order preserving partitioner.
In MongoDB 2.5, the table scan triggered by the { "_id":{"$gte": startKey}} query showed a maximum throughput of 20 ops/sec with a latency of ≈ 1 sec.
The performance of MySQL Cluster was under 10 ops/sec with a latency of 400 ms. It is partitioned over the nodes in the cluster, so the system uses an optimizer to translate SQL commands into a query plan. The execution of this plan is divided among multiple nodes. For range scans, a B-tree index is used to make column comparisons in such expressions as >, <, or BETWEEN.
Sharded MySQL is based on key hashing on the connector side and does not support true range scans over a cluster. While a single shard did about 10 ops/sec, the whole sharded setup showed near 40 ops/sec with a latency of up to 400 ms. MyISAM caches index blocks but not data blocks. There can be an overhead due to re-reading data blocks from the OS buffer cache.
The Riak bitcask storage engine does not support range scans. This can be done through secondary indexes with eleveldb and special $key index referring to the primary key. Eleveldb showed insufficient performance that started to degrade after 50,000,000 records had been imported and we fell back to bitcask.
* Workload G: Insert-mostly mode. Settings for the workload:
1) Insert/Read: 90/10
2) Latest request distribution
HBase showed the best results under a workload that included large volumes of writes. Cassandra was second. The NDB engine of MySQL Cluster also managed intensive writes perfectly well.
Conclusion
As you can see, there is no perfect NoSQL database. Every database has its advantages and disadvantages that become more or less important depending on your preferences and the type of tasks.For example, a database can demonstrate excellent performance, but once the amount of records exceeds a certain limit, the speed falls dramatically. It means that this particular solution can be good for moderate data loads and extremely fast computations, but it would not be suitable for jobs that require a lot of reads and writes. In addition, database performance also depends on the capacity of your hardware.
It was hardly possible to include all of the performance diagrams and describe everything in one article. You can download the full version of the research that contains separate chapters dedicated to every database, YCSB and Amazon EC2 configuration details, and appendix with other performance diagrams athttp://altoros.com/nosql-research.
We hope this research will be useful to both developers working with NoSQL solutions and customers trying to choose a database. Altoros's R&D team will regularly revise and update information of this research to cover new databases and releases of the most popular products.
About the author: Sergey Bushik is a senior R&D engineer at Altoros. He has more than seven years of experience in implementation of Java-based projects that include big data processing, data mining and Hadoop computations. Sergey has a number of certificates in Java and is a Sun Certified Enterprise Architect for the Java Platform. He is a regular speaker at international conferences -- most recently, he delivered sessions at Big Data Meetup (Sunnyvale, Calif.), GOTO Copenhagen 2012, Hadoop Evening (Eastern Europe), etc.
Read more about software in Network World's Software section.
No comments:
Post a Comment
I would be glad to know if this post helped you.