« Blog Overview

MySql vs Hive Import Speed: can you say 'Wow'?

You may wonder: what is Hive and why should I care? Straight from Wikipedia: 'Apache Hive is a data warehouse infrastructure built on top of Hadoop for providing data summarization, query, and analysis. While initially developed by Facebook, Apache Hive is now used and developed by other companies such as Netflix. Amazon maintains a software fork of Apache Hive that is included in Amazon Elastic MapReduce on Amazon Web Services.' As for why you should care? If all you're doing is small-data stuff on MySql, you don't really have a reason to care. On the other hand, if you're dealing with massive amounts of data, you probably already know about Hive. What's really interesting though, is the import speed Hive provides. Below are the results of some tests we've done to illustrate this point.

Referring back to our previous blog post about how to generate large amounts of test data using the TPC-DS data generation scripts, we ran the following tests on my MySql and Hive in order to get some (very) basic performance metrics.

First of all, we generated a 3GB data set using the following command:

dsdgen -scale 3 -force

which generated the following data files:

> ls -l
-rw-r--r-- 1 root root       3119 Sep 15 03:08 call_center.dat
-rw-r--r-- 1 root root    1631792 Sep 15 03:08 catalog_page.dat
-rw-r--r-- 1 root root  896286083 Sep 15 03:09 catalog_sales.dat
-rw-r--r-- 1 root root   65042990 Sep 15 03:09 catalog_returns.dat
-rw-r--r-- 1 root root   24950720 Sep 15 03:09 customer.dat
-rw-r--r-- 1 root root   10354456 Sep 15 03:09 customer_address.dat
-rw-r--r-- 1 root root   80660096 Sep 15 03:09 customer_demographics.dat
-rw-r--r-- 1 root root   10317438 Sep 15 03:09 date_dim.dat
-rw-r--r-- 1 root root        328 Sep 15 03:09 income_band.dat
-rw-r--r-- 1 root root     151653 Sep 15 03:09 household_demographics.dat
-rw-r--r-- 1 root root  576099667 Sep 15 03:10 inventory.dat
-rw-r--r-- 1 root root   10140270 Sep 15 03:10 item.dat
-rw-r--r-- 1 root root       1113 Sep 15 03:10 ship_mode.dat
-rw-r--r-- 1 root root       1401 Sep 15 03:10 reason.dat
-rw-r--r-- 1 root root      42649 Sep 15 03:10 promotion.dat
-rw-r--r-- 1 root root       8341 Sep 15 03:10 store.dat
-rw-r--r-- 1 root root 1180463121 Sep 15 03:12 store_sales.dat
-rw-r--r-- 1 root root   99674498 Sep 15 03:12 store_returns.dat
-rw-r--r-- 1 root root    5107780 Sep 15 03:12 time_dim.dat
-rw-r--r-- 1 root root        716 Sep 15 03:12 warehouse.dat
-rw-r--r-- 1 root root       8662 Sep 15 03:12 web_page.dat
-rw-r--r-- 1 root root   29889520 Sep 15 03:13 web_returns.dat
-rw-r--r-- 1 root root  445928016 Sep 15 03:13 web_sales.dat
-rw-r--r-- 1 root root       9305 Sep 15 03:13 web_site.dat

As per our dsdgen command, this generated about 3GB worth of data. We'll take the largest file (store_sales.dat), which is roughly 1.1GB or 8.6 million rows for our tests. First we tried MySql (see this previous blog post on how to setup MySql with InnoDB tables for good import performance so you don't spend lots of time waiting just because your database writes to disk after every insert):

mysql> create table store_sales
    ss_sold_date_sk           integer                       ,
    ss_sold_time_sk           integer                       ,
    ss_item_sk                integer               not null,
    ss_customer_sk            integer                       ,
    ss_cdemo_sk               integer                       ,
    ss_hdemo_sk               integer                       ,
    ss_addr_sk                integer                       ,
    ss_store_sk               integer                       ,
    ss_promo_sk               integer                       ,
    ss_ticket_number          integer               not null,
    ss_quantity               integer                       ,
    ss_wholesale_cost         decimal(7,2)                  ,
    ss_list_price             decimal(7,2)                  ,
    ss_sales_price            decimal(7,2)                  ,
    ss_ext_discount_amt       decimal(7,2)                  ,
    ss_ext_sales_price        decimal(7,2)                  ,
    ss_ext_wholesale_cost     decimal(7,2)                  ,
    ss_ext_list_price         decimal(7,2)                  ,
    ss_ext_tax                decimal(7,2)                  ,
    ss_coupon_amt             decimal(7,2)                  ,
    ss_net_paid               decimal(7,2)                  ,
    ss_net_paid_inc_tax       decimal(7,2)                  ,
    ss_net_profit             decimal(7,2)                  ,
    primary key (ss_item_sk, ss_ticket_number)
mysql> LOAD DATA INFILE '/data/Splice/TPC-DS/tools/store_sales.dat' INTO TABLE store_sales FIELDS TERMINATED BY '|' LINES TERMINATED BY '\n';
Query OK, 8639377 rows affected, 65535 warnings (1 hour 7 min 55.59 sec)
Records: 8639377  Deleted: 0  Skipped: 0  Warnings: 8170060
mysql> select count(*) from store_sales;
| count(*) |
|  8639377 |
1 row in set (22.73 sec)

1 hour 7 minutes, not bad but not really stellar either. We know that MySql import performance drops off after about 1M rows and we could split our data file into several smaller files, but still, if you have a lot of data, you need to get it into MySql somehow. Also, the reason the select count(*) query takes 22 seconds is that MySql doesn't have the query cached yet. If we repeat the query, it comes back instantaneous, which is what we would expect from a cached query result. Just for kicks, we repeated the MySql test after bumping up the buffer pool size to 2GB in my.cnf

innodb_buffer_pool_size = 2048M

With this change to the MySql configuration, the import finishes in 35 minutes, which is quite an improvement but still not as fast as we would hope.

Now lets try the same thing in Hive (note that Hive generates lots of trace messages which we've removed for the sake of clarity):

hive> create table store_sales
    > (
    >     ss_sold_date_sk           int                       ,
    >     ss_sold_time_sk           int                       ,
    >     ss_item_sk                int                       ,
    >     ss_customer_sk            int                       ,
    >     ss_cdemo_sk               int                       ,
    >     ss_hdemo_sk               int                       ,
    >     ss_addr_sk                int                       ,
    >     ss_store_sk               int                       ,
    >     ss_promo_sk               int                       ,
    >     ss_ticket_number          int                       ,
    >     ss_quantity               int                       ,
    >     ss_wholesale_cost         double                  ,
    >     ss_list_price             double                  ,
    >     ss_sales_price            double                  ,
    >     ss_ext_discount_amt       double                  ,
    >     ss_ext_sales_price        double                  ,
    >     ss_ext_wholesale_cost     double                  ,
    >     ss_ext_list_price         double                  ,
    >     ss_ext_tax                double                  ,
    >     ss_coupon_amt             double                  ,
    >     ss_net_paid               double                  ,
    >     ss_net_paid_inc_tax       double                  ,
    >     ss_net_profit             double
    > );
hive> LOAD DATA LOCAL INPATH '/data/Splice/TPC-DS/tools/store_sales.dat'  OVERWRITE INTO TABLE store_sales;
Copying data from file:/data/Splice/TPC-DS/tools/store_sales.dat
Copying file: file:/data/Splice/TPC-DS/tools/store_sales.dat
Loading data to table default.store_sales
Time taken: 5.406 seconds
hive> select count(*) from store_sales;
Time taken: 8.082 seconds

Wow! Hive imported our data file in 5.5 seconds, almost 400 times faster than the MySql version with the extra large buffer cache! Pretty amazing. Actually, this is unbelivably fast. It is instructive to look the contents of the /var/lib/hive/ directory to see what happened. Hive simply copied the dat file to this directory and since the file had just been generated and thus was still cached in memory, this was blazingly fast. You may think "this is cheating" and it may be, however, since it is now available for querying we don't really care what Hive did as for all practical purposes it did import the file

The basic select count(*) query also was considerably faster than in MySql by about a factor 3.5. Now you may think that Hive is the answer to all your database worries and that you can just use it instead of MySql. Not so fast! It is important to realize what Hive is and what it’s not and what it's limitations are:

  • Hive is not a database which will spit out query results with blazing speed. It's great at processing huge amounts of data, but for the type of usage which is typical for MySql (ie: lots of queries over relatively small amounts of data), Hive is not a good fit. The reason for this is that Hive is essentially a batch processing system which incurs large overheads for launching its jobs.

  • Hive is not a full SQL implmentation but rather supports it's own subset of SQL which goes by the name of HQL so if you have complex queries which work in MySql or another RDBMS, chances are you can't just throw them at Hive and expect it to work.

  • Due to it's architecture, Hive doesn't do query caching the way MySql does, which means that a query which took 10 seconds the first time, will take about 10 seconds every subsequent time. So while Hive is great for large-scale data analysis, you can forget about using it to run your website.

So is Hive cool technology? Definitely yes. It's import speed is nothing short of amazing and once you start dealing with *lots* of data (and you should note that in terms of typical Hive/Hadoop usages, 8M rows is not a lot of data) and using it's clustering capabilities to farm out your queries over lots of nodes, it can deliver some pretty amazing performance. So if you find yourself with nothing to do on a slow Sunday, play around with Hive for a few hours, it's a pretty interesting experience.

« Blog Overview