You may wonder: what is Hive and why should I care? Straight from Wikipedia: 'Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix. Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.' As for why you should care? If all you're doing is small-data stuff on MySql, you don't really have a reason to care. On the other hand, if you're dealing with massive amounts of data, you probably already know about Hive. What's really interesting though, is the import speed Hive provides. Below are the results of some tests we've done to illustrate this point.
Referring back to our previous blog post about how to generate large amounts of test data using the TPC-DS data generation scripts, we ran the following tests on my MySql and Hive in order to get some (very) basic performance metrics.
First of all, we generated a 3GB data set using the following command:
dsdgen -scale 3 -force
which generated the following data files:
> ls -l -rw-r--r-- 1 root root 3119 Sep 15 03:08 call_center.dat -rw-r--r-- 1 root root 1631792 Sep 15 03:08 catalog_page.dat -rw-r--r-- 1 root root 896286083 Sep 15 03:09 catalog_sales.dat -rw-r--r-- 1 root root 65042990 Sep 15 03:09 catalog_returns.dat -rw-r--r-- 1 root root 24950720 Sep 15 03:09 customer.dat -rw-r--r-- 1 root root 10354456 Sep 15 03:09 customer_address.dat -rw-r--r-- 1 root root 80660096 Sep 15 03:09 customer_demographics.dat -rw-r--r-- 1 root root 10317438 Sep 15 03:09 date_dim.dat -rw-r--r-- 1 root root 328 Sep 15 03:09 income_band.dat -rw-r--r-- 1 root root 151653 Sep 15 03:09 household_demographics.dat -rw-r--r-- 1 root root 576099667 Sep 15 03:10 inventory.dat -rw-r--r-- 1 root root 10140270 Sep 15 03:10 item.dat -rw-r--r-- 1 root root 1113 Sep 15 03:10 ship_mode.dat -rw-r--r-- 1 root root 1401 Sep 15 03:10 reason.dat -rw-r--r-- 1 root root 42649 Sep 15 03:10 promotion.dat -rw-r--r-- 1 root root 8341 Sep 15 03:10 store.dat -rw-r--r-- 1 root root 1180463121 Sep 15 03:12 store_sales.dat -rw-r--r-- 1 root root 99674498 Sep 15 03:12 store_returns.dat -rw-r--r-- 1 root root 5107780 Sep 15 03:12 time_dim.dat -rw-r--r-- 1 root root 716 Sep 15 03:12 warehouse.dat -rw-r--r-- 1 root root 8662 Sep 15 03:12 web_page.dat -rw-r--r-- 1 root root 29889520 Sep 15 03:13 web_returns.dat -rw-r--r-- 1 root root 445928016 Sep 15 03:13 web_sales.dat -rw-r--r-- 1 root root 9305 Sep 15 03:13 web_site.dat
As per our dsdgen command, this generated about 3GB worth of data. We'll take the largest file (store_sales.dat), which is roughly 1.1GB or 8.6 million rows for our tests. First we tried MySql (see this previous blog post on how to setup MySql with InnoDB tables for good import performance so you don't spend lots of time waiting just because your database writes to disk after every insert):
mysql> create table store_sales ( ss_sold_date_sk integer , ss_sold_time_sk integer , ss_item_sk integer not null, ss_customer_sk integer , ss_cdemo_sk integer , ss_hdemo_sk integer , ss_addr_sk integer , ss_store_sk integer , ss_promo_sk integer , ss_ticket_number integer not null, ss_quantity integer , ss_wholesale_cost decimal(7,2) , ss_list_price decimal(7,2) , ss_sales_price decimal(7,2) , ss_ext_discount_amt decimal(7,2) , ss_ext_sales_price decimal(7,2) , ss_ext_wholesale_cost decimal(7,2) , ss_ext_list_price decimal(7,2) , ss_ext_tax decimal(7,2) , ss_coupon_amt decimal(7,2) , ss_net_paid decimal(7,2) , ss_net_paid_inc_tax decimal(7,2) , ss_net_profit decimal(7,2) , primary key (ss_item_sk, ss_ticket_number) ); mysql> LOAD DATA INFILE '/data/Splice/TPC-DS/tools/store_sales.dat' INTO TABLE store_sales FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n'; Query OK, 8639377 rows affected, 65535 warnings (1 hour 7 min 55.59 sec) Records: 8639377 Deleted: 0 Skipped: 0 Warnings: 8170060 mysql> select count(*) from store_sales; +----------+ | count(*) | +----------+ | 8639377 | +----------+ 1 row in set (22.73 sec)
1 hour 7 minutes, not bad but not really stellar either. We know that MySql import performance drops off after about 1M rows and we could split our data file into several smaller files, but still, if you have a lot of data, you need to get it into MySql somehow. Also, the reason the select count(*) query takes 22 seconds is that MySql doesn't have the query cached yet. If we repeat the query, it comes back instantaneous, which is what we would expect from a cached query result. Just for kicks, we repeated the MySql test after bumping up the buffer pool size to 2GB in my.cnf
innodb_buffer_pool_size = 2048M
With this change to the MySql configuration, the import finishes in 35 minutes, which is quite an improvement but still not as fast as we would hope.
Now lets try the same thing in Hive (note that Hive generates lots of trace messages which we've removed for the sake of clarity):
hive> create table store_sales > ( > ss_sold_date_sk int , > ss_sold_time_sk int , > ss_item_sk int , > ss_customer_sk int , > ss_cdemo_sk int , > ss_hdemo_sk int , > ss_addr_sk int , > ss_store_sk int , > ss_promo_sk int , > ss_ticket_number int , > ss_quantity int , > ss_wholesale_cost double , > ss_list_price double , > ss_sales_price double , > ss_ext_discount_amt double , > ss_ext_sales_price double , > ss_ext_wholesale_cost double , > ss_ext_list_price double , > ss_ext_tax double , > ss_coupon_amt double , > ss_net_paid double , > ss_net_paid_inc_tax double , > ss_net_profit double > ); hive> LOAD DATA LOCAL INPATH '/data/Splice/TPC-DS/tools/store_sales.dat' OVERWRITE INTO TABLE store_sales; Copying data from file:/data/Splice/TPC-DS/tools/store_sales.dat Copying file: file:/data/Splice/TPC-DS/tools/store_sales.dat Loading data to table default.store_sales Time taken: 5.406 seconds hive> select count(*) from store_sales; 8639377 Time taken: 8.082 seconds
Wow! Hive imported our data file in 5.5 seconds, almost 400 times faster than the MySql version with the extra large buffer cache! Pretty amazing. Actually, this is unbelivably fast. It is instructive to look the contents of the /var/lib/hive/ directory to see what happened. Hive simply copied the dat file to this directory and since the file had just been generated and thus was still cached in memory, this was blazingly fast. You may think "this is cheating" and it may be, however, since it is now available for querying we don't really care what Hive did as for all practical purposes it did import the file
The basic select count(*) query also was considerably faster than in MySql by about a factor 3.5. Now you may think that Hive is the answer to all your database worries and that you can just use it instead of MySql. Not so fast! It is important to realize what Hive is and what it’s not and what it's limitations are:
- Hive is not a database which will spit out query results with blazing speed. It's great at processing huge amounts of data, but for the type of usage which is typical for MySql (ie: lots of queries over relatively small amounts of data), Hive is not a good fit. The reason for this is that Hive is essentially a batch processing system which incurs large overheads for launching its jobs.
- Hive is not a full SQL implmentation but rather supports it's own subset of SQL which goes by the name of HQL so if you have complex queries which work in MySql or another RDBMS, chances are you can't just throw them at Hive and expect it to work.
- Due to it's architecture, Hive doesn't do query caching the way MySql does, which means that a query which took 10 seconds the first time, will take about 10 seconds every subsequent time. So while Hive is great for large-scale data analysis, you can forget about using it to run your website.
So is Hive cool technology? Definitely yes. It's import speed is nothing short of amazing and once you start dealing with *lots* of data (and you should note that in terms of typical Hive/Hadoop usages, 8M rows is not a lot of data) and using it's clustering capabilities to farm out your queries over lots of nodes, it can deliver some pretty amazing performance. So if you find yourself with nothing to do on a slow Sunday, play around with Hive for a few hours, it's a pretty interesting experience.