Wednesday, April 25, 2012

MySQL Cluster: How to load it with data


After you have setup and properly configured MySQL Cluster you typically want to load data into it. There are a couple of ways to load data into MySQL Cluster, e.g, from dumpfiles or from csv files.

If you expect loading data into MySQL Cluster will be as fast as on MyISAM then you have the wrong expectations unless you parallelize your data loading. 

Also, MYISAM/INNODB stores the data in local buffers/disk and in one copy only (asynchronous replication does not count as it is asynchronous) whereas MySQL Cluster (NDBCLUSTER engine) stores two copies of the data. Synchronous replication between the two copies adds ~ 2x overhead and you have network between mysql servers and data nodes.

So, to load a dump file into MySQL Cluster is bound to be slower than MYISAM/INNODB. Also loading data files can be error prone.

Here is what we at Severalnines usually do to avoid trial and terror of configuration options and to make sure we don't have to make X tries to load my Y GB dump file. We want to do it once, and get it right then, just because it takes time to make multiple iterations:
  • Load in the schema in MYISAM/INNODB on one of the MySQL Servers connected to the data nodes.Please Note that your data is NOT yet stored in the NDBCLUSTER storage engine and is at this point only accessible from the mysql server where you imported the MYISAM/INNODB tables.
  • Check for redundant indexes and remove them to reduce storage footprint. Some ORMs have a tendency for adding redundant indexes. Personally I think it is faster to make the transformations when the tables are in MYISAM/INNODB than in NDBCLUSTER, especially if there are data in the tables.
    You can use Percona's pt-duplicate-key-checker  (wget percona.com/get/pt-duplicate-key-checker). See an example below at the end. It makes it dead easy to find and drop the redundant indexes. Redundant indexes are good to remove also because writes will be faster (less indexes to update) and to reduce the risk that the optimizer will pick the wrong index.
  • Check data types - do you have BLOB/TEXT where you really can use VARBINARY/VARCHAR ? BLOBs/TEXTs can slow down performance with a factor 2-3x (at least). Sometimes BLOBs/TEXTs are overused in many applications.
When you have washed your schema then you can simply do:

ALTER TABLE tablename ENGINE=ndbcluster;

If you get an error message such as:

mysql> alter table events  engine=ndbcluster;
ERROR 1297 (HY000): Got temporary error 1217 'Out of operation records in local data manager (increase MaxNoOfLocalOperations)' from NDBCLUSTER

Then you can change the ndb_batch_size (a lower batch size will make it longer time to ALTER the table from MYISAM/INNODB to NDBCLUSTER because more roundtrips are needed between the  MySQL Server and the data nodes):

mysql> set  ndb_batch_size=8*1024*1024;
Query OK, 0 rows affected (0.00 sec)

mysql> alter table events  engine=ndbcluster;
Query OK, 67748 rows affected (4.00 sec)
Records: 67748  Duplicates: 0  Warnings: 0

32MB is the max value for ndb_batch_size.

Now, to furthermore increase the speed of the ALTER you can perform two or more ALTERs in parallel from two or more connections:
mysql conn 1 > alter table events     engine=ndbcluster;
mysql conn 2 > alter table employees  engine=ndbcluster;
mysql conn 3 > alter table randomdata engine=ndbcluster;

Finally do:
ANALYZE TABLE ;

on all tables to build index statistics (MySQL Cluster 7.2 will love you for this and you will hate MySQL Cluster 7.2 if you don't).

However, if you do too many in parallel you will most likely run into problems like:
  • Redolog files are overloaded  (you have too small redo log and too slow disks to handle the write load generated by the ALTER)
  • Out of Redo buffer (you have too slow disks to handle the write load generated by the ALTER) -  it will probably not matter if you increase the RedoBuffer as your disk subsystem it too slow and cannot "consume" what you "produce".
  • Out of operation records   ( increase MaxNoOfConcurrentOperations or MaxNoOfLocalOperations), at Severalnines we set this to 100000 by default, you can of course increase this but it costs memory. If you increase it you should change:
    MaxNoOfConcurrentOperations, then MaxNoOfLocalOperations will automatically be set to 1.1 x the value of MaxNoOfConcurrentOperations. With MaxNoOfConcurrentOperations=100000 it will take 100MB of RAM on each data node. MaxNoOfConcurrentOperations=200000 will take 200MB of RAM on each data node.
If you have a CSV file you can do the same: Load the data first into MYISAM/INNODB tables and then ALTER. 

Why first load it into MYISAM/INNODB?
Well, i think it makes it easier and gives better control. If you have a 30GB dump file and many tables you want to load, then it is easier to track errors and to make use of the ndb_batch_size parameter to make it actually load without errors, i.e, you will save time, even though you will use some extra time to first load in the data into MYISAM/INNODB:

Example run of ./pt-duplicate-key-checker:

./pt-duplicate-key-checker  --host=ubuntu02 --user=cmon --password=xxxxxxx --databases=test2 --tables=t1
# ##################################################################
# test2.t1                                                                
# ##################################################################
# a is a duplicate of PRIMARY
# Key definitions:
#   KEY `a` (`a`)
#   PRIMARY KEY (`a`),
# Column types:
#  `a` int(11) not null
# To remove this duplicate index, execute:
ALTER TABLE `test2`.`t1` DROP INDEX `a`;
# ##################################################################
# Summary of indexes                                                      
# ##################################################################

# Size Duplicate Indexes   0
# Total Duplicate Indexes  1
# Total Indexes            2

4 comments:

Jacky Shu said...

I'd use pt-find to alter table when you have lots tables to convert into ndb.

wwwted said...

Great post, very useful!

I have a question with regards to: "Now, to furthermore increase the speed of the ALTER you can perform two or more ALTERs in parallel from two or more connections". My question is - will not all schema operation be serialized inside NDB or am I missing something here?

Kindest Regards,
Ted

Daniël van Eeden said...

So the 'helpful' comment is wrong in many cases?

joe said...

My god! the ALTER TABLE {table} ENGINE=ndbcluster; takes forever and day... 2 HOURS!

the data nodes in my cluster are running at 220Mbps on gigabit iscsi... so at least i know its working... :)