« Blog Overview

Using TPC-DS to generate RDBMS performance benchmark data

If you ever find yourself in need to generate massive quantities of benchmark data to test your database's data-import or query performance, the TPC (Transaction Processing Performance Council) provides a handy tool which can easily generate gigabytes of data. Yes, the data it generates and the queries it provides are geared towards decision support applications, but that doesn't prevent these scripts from being a good testing ground for your database; especially if you wish to compare performance on several database platforms.


While the TPC provides a whole range of benchmark suites for various purposes, the TPC-DS benchmark is probably the easiest to implement and use. Best of all, it's free (but does require you to submit your data for a download request) and on a modern Linux box, compiles out of the box without having to resort to any hacking. If you're using Mac OS X, it's not quite as easy as it will generate compile errors which you have to fix manually. So for the purposes of using the TPC-DS benchmark, do yourself a favour and use a Linux box to generate the data.


Here's what do need to do in order to set the TPC-DS benchmark up (on a Linux box):


- Download the DSGen utility (duh!)
- Extract the downloaded archive
- cd TPC-DS
- cd tools
- make (ignore the compile warnings, just ensure the build process completes successfully)

OK, you've now built the required utlities to use the benchmark. At this point it's probably useful to download and read the provided documentation in order to better understand the scope of what the benchmark provides, but here's a quick rundown of how to generate the testdata.


./dsdgen

will simply generate the test data (which is generated into | delimited text files with the extension *.dat) at the default scale factor (which is 1). Each scale factor corresponds to roughly 1GB of data, so, for example, the command


./dsdgen -scale 5 -force

will generate 5 GB of data and the -force option will overwrite previously generated data. Without the -force option, dsdgen will refuse to overwrite existing test data and simply do nothing.


Now that you have your test data ready, you can load it into your database. For MySql this rougly involves the following:


mysql -u <your_mysql_user> -p < tpcds.sql
</your_mysql_user>

Then for each *.dat file which was generated, do the following (see thispage for details:


LOAD DATA INFILE 'your_DAT_filename'
INTO TABLE table_the_DAT_file_is_for
FIELDS TERMINATED BY '|'
LINES TERMINATED BY '\n'

to load the data. See this previous blog post on how to setup MySql with InnoDB tables for good import performance so you don't spend lots of time waiting just because your database writes to disk after every insert. So this is your quick'n'dirty guide on how to use TPC-DS to generate and load lots of test data. Hopefully someone out there finds this useful. Enjoy.



« Blog Overview