Splice Machine

Today I will tell you about the startup called “Splice Machine”. They position themselves as “The Only Hadoop RDBMS”, which is quite bold given the boom we see now in SQL-on-Hadoop solutions field, almost each of the big vendors implemented their own “one and only” solution and claim it to be the best. But let’s take a look at its internals to say whether its design really reflects the marketing slogan they’ve chosen.
header-logo-2x

In its internals it utilizes Apache HBase for data storage and modified Apache Derby query engine to provide an SQL interface. Here’s the diagram from their official web site:

Splice Machine General

It gives a general understanding that their engine is embedded into HBase and is split between the cluster with no single master node (except by HBase Master, of course). By Splice Executor they must have meant HBase regions. This design looks good, but the devil is in the details, so let’s take a closer look at this solution. If you want to skip the technical discussion, you can go directly to the summary part in the end of an article.

The data is stored in HBase tables. The whole table row is put into a single column of HBase table in a binary representation. When you update the table row, the whole row is updated in the HBase table by creating its new version. Here’s an example of how it looks like:

splice> create table test (id int, value1 varchar(100), value2 varchar(100));
0 rows inserted/updated/deleted
splice> show tables;
TABLE_SCHEM |TABLE_NAME |CONGLOM_ID|REMARKS
——————————————————-

APP |TEST |1184 |
hbase(main):002:0> describe ‘1184’
DESCRIPTION
‘1184’, {NAME => ‘V’, ENCODE_ON_DISK => ‘true’, BLOOMFILTER => ‘ROW’, VERSIONS => ‘2147483647’, IN_MEMORY => ‘false’, KEEP_DELETED_CELLS => ‘false’, DATA_BLOCK_ENCODING => ‘NONE’, TTL => ‘2147483647’, COMPRESSION => ‘NONE’, MIN_VERSIONS => ‘0’, BLOCKCACHE => ‘true’, BLOCKSIZE => ‘65536’, REPLICATION_SCOPE => ‘0’}
1 row(s) in 0.9570 seconds
hbase(main):003:0> scan ‘1184’
ROW COLUMN+CELL
0 row(s) in 0.0810 seconds
splice> insert into test (id, value1, value2) values (1, ‘foo’, ‘bar’);
1 row inserted/updated/deleted
hbase(main):004:0> scan ‘1184’
ROW COLUMN+CELL
@t\x91}<\xD9p\x01 column=V:7, timestamp=45, value=\x8F\x90\x00\x81\x00hqq\x00dct
1 row(s) in 0.0700 seconds
splice> insert into test (id, value1, value2) values (2, ‘bar’, ‘foo’);
1 row inserted/updated/deleted
hbase(main):005:0> scan ‘1184’
ROW COLUMN+CELL
@t\x91}<\xD9p\x01 column=V:7, timestamp=45, value=\x8F\x90\x00\x81\x00hqq\x00dct
p\xB4\x91}N\x81P\x01 column=V:0, timestamp=48, value=\x00\x00\x00\x00\x00\x00\x001
p\xB4\x91}N\x81P\x01 column=V:7, timestamp=48, value=\x8F\x90\x00\x82\x00dct\x00hqq
2 row(s) in 0.0560 seconds
splice> insert into test (id, value1, value2) values (3, ‘test’, ‘test’);
1 row inserted/updated/deleted
hbase(main):006:0> scan ‘1184’
ROW COLUMN+CELL
@t\x91}<\xD9p\x01 column=V:7, timestamp=45, value=\x8F\x90\x00\x81\x00hqq\x00dct
p\xB4\x91}N\x81P\x01 column=V:0, timestamp=48, value=\x00\x00\x00\x00\x00\x00\x001
p\xB4\x91}N\x81P\x01 column=V:7, timestamp=48, value=\x8F\x90\x00\x82\x00dct\x00hqq
\xA0\xF4\x91}[~\xE0\x01 column=V:0, timestamp=51, value=\x00\x00\x00\x00\x00\x00\x004
\xA0\xF4\x91}[~\xE0\x01 column=V:7, timestamp=51, value=\x8F\x90\x00\x83\x00vguv\x00vguv
3 row(s) in 0.0470 seconds
splice> update test set id=4 where id=1;
1 row inserted/updated/deleted
hbase(main):007:0> scan ‘1184’
ROW COLUMN+CELL
@t\x91}<\xD9p\x01 column=V:0, timestamp=60, value=\x00\x00\x00\x00\x00\x00\x00=
@t\x91}<\xD9p\x01 column=V:7, timestamp=60, value=\x8E\x00\x84
p\xB4\x91}N\x81P\x01 column=V:0, timestamp=48, value=\x00\x00\x00\x00\x00\x00\x001
p\xB4\x91}N\x81P\x01 column=V:7, timestamp=48, value=\x8F\x90\x00\x82\x00dct\x00hqq
\xA0\xF4\x91}[~\xE0\x01 column=V:0, timestamp=51, value=\x00\x00\x00\x00\x00\x00\x004
\xA0\xF4\x91}[~\xE0\x01 column=V:7, timestamp=51, value=\x8F\x90\x00\x83\x00vguv\x00vguv
hbase(main):017:0> get ‘1184’, “@t\x91}<\xD9p\x01", {COLUMN => ‘V’, VERSIONS => 100}
COLUMN CELL
V:0 timestamp=60, value=\x00\x00\x00\x00\x00\x00\x00=
V:7 timestamp=60, value=\x8E\x00\x84
V:7 timestamp=45, value=\x8F\x90\x00\x81\x00hqq\x00dct
3 row(s) in 0.0120 seconds
splice> update test set id=5 where id=4;
1 row inserted/updated/deleted
hbase(main):018:0> get ‘1184’, “@t\x91}<\xD9p\x01", {COLUMN => ‘V’, VERSIONS => 100}
COLUMN CELL
V:0 timestamp=63, value=\x00\x00\x00\x00\x00\x00\x00@
V:0 timestamp=60, value=\x00\x00\x00\x00\x00\x00\x00=
V:7 timestamp=63, value=\x8E\x00\x85
V:7 timestamp=60, value=\x8E\x00\x84
V:7 timestamp=45, value=\x8F\x90\x00\x81\x00hqq\x00dct
splice> update test set id=6,value1=’changed’,value2=’changed’ where id=5;
1 row inserted/updated/deleted
hbase(main):019:0> get ‘1184’, “@t\x91}<\xD9p\x01", {COLUMN => ‘V’, VERSIONS => 100}
COLUMN CELL
V:0 timestamp=66, value=\x00\x00\x00\x00\x00\x00\x00C
V:0 timestamp=63, value=\x00\x00\x00\x00\x00\x00\x00@
V:0 timestamp=60, value=\x00\x00\x00\x00\x00\x00\x00=
V:7 timestamp=66, value=\x8F\x90\x00\x86\x00ejcpigf\x00ejcpigf
V:7 timestamp=63, value=\x8E\x00\x85
V:7 timestamp=60, value=\x8E\x00\x84
V:7 timestamp=45, value=\x8F\x90\x00\x81\x00hqq\x00dct

This lets me assume that the data is stored in multiple versions of the same row in HBase. On insert the binary data that we’ve written was big in size, while when we updated 1 column out of 3 the data written to the table was really small, which allows me to assume that the information that is stored in the data “V:7” is the changes applied to the row, while “V:0” seems to be information related to the transaction ID.

Ok. What about the transactions? Transaction isolation level is not “serializable”, which can be simply tested:

splice> autocommit off; splice> autocommit off;
splice> select * from test where id=9; splice> select * from test where id=2;
splice> update test set value1=’trans’ where id=2; splice> update test set value1=’trans’ where id=9;
splice> commit; splice> commit;

Both has finished successfully, which means that there is no read locks in Splice Machine, which is good in terms of performance in resource utilization.

splice> update test set value1=’transaction’ where id=2;
1 row inserted/updated/deleted
splice> update test set value1=’transaction’ where id=2;
ERROR 42Z80: serializableERROR XJ001: Java exception: ‘null: com.splicemachine.si.impl.WriteConflict’.

This means that they use optimistic transactions instead of locking and implementation of write-write conflicts is the responsibility of application deleveloper. Information about transactions is stored in HBase table
Now let’s take a look how the indexes work. First, let’s create an index:

splice> create index test_idx on test(id);
0 rows inserted/updated/deleted
Now let’s see what is inside:
splice> show indexes;
TABLE_SCHEM |TABLE_NAME |INDEX_NAME |COLUMN_NAME |ORDINAL&|NON_UNIQUE|TYPE |ASC&|CONGLOM_NO
——————————————————————————————-
APP |TEST |TEST_IDX |ID |1 |true |BTREE|A |1201

This means that a separate HBase table was created to handle an index:

hbase(main):002:0> scan ‘1201’
ROW COLUMN+CELL
\x82\x00\xB8\xAD\x92\x97\xEA\xBA\x82\xD0\x80\xC0 column=V:7, timestamp=99, value=\x84\x00\xB8\xAD\x92\x97\xEA\xBA\x82\xD0\x80\xC0
\x83\x00\xD0\xBD\x92\x97\xEA\xED\xFD\xE0\x80\xC0 column=V:7, timestamp=99, value=\x84\x00\xD0\xBD\x92\x97\xEA\xED\xFD\xE0\x80\xC0
\x89\x00\xA0\x9D\x92\x97\xE9\xF3\xB2\xF0\x80\xC0 column=V:7, timestamp=99, value=\x84\x00\xA0\x9D\x92\x97\xE9\xF3\xB2\xF0\x80\xC0
3 row(s) in 0.1410 seconds
splice> update test set id=4 where id=3;
1 row inserted/updated/deleted
hbase(main):003:0> scan ‘1201’
ROW COLUMN+CELL
\x82\x00\xB8\xAD\x92\x97\xEA\xBA\x82\xD0\x80\xC0 column=V:7, timestamp=99, value=\x84\x00\xB8\xAD\x92\x97\xEA\xBA\x82\xD0\x80\xC0
\x83\x00\xD0\xBD\x92\x97\xEA\xED\xFD\xE0\x80\xC0 column=V:0, timestamp=101, value=\x00\x00\x00\x00\x00\x00\x00f
\x83\x00\xD0\xBD\x92\x97\xEA\xED\xFD\xE0\x80\xC0 column=V:1, timestamp=101, value=
\x83\x00\xD0\xBD\x92\x97\xEA\xED\xFD\xE0\x80\xC0 column=V:7, timestamp=99, value=\x84\x00\xD0\xBD\x92\x97\xEA\xED\xFD\xE0\x80\xC0
\x84\x00\xD0\xBD\x92\x97\xEA\xED\xFD\xE0\x80\xC0 column=V:0, timestamp=101, value=\x00\x00\x00\x00\x00\x00\x00f
\x84\x00\xD0\xBD\x92\x97\xEA\xED\xFD\xE0\x80\xC0 column=V:7, timestamp=101, value=\x84\x00\xD0\xBD\x92\x97\xEA\xED\xFD\xE0\x80\xC0
\x89\x00\xA0\x9D\x92\x97\xE9\xF3\xB2\xF0\x80\xC0 column=V:7, timestamp=99, value=\x84\x00\xA0\x9D\x92\x97\xE9\xF3\xB2\xF0\x80\xC0

So each index occupies a separate table and contains the mapping from the field value to field key, this way to utilize the index you first query the index table to get list of matching row keys and after that query the table with data itself. Quite straightforward.

Reading the documentation, you will find many places where the developers ask to avoid complex queries and specify hints to provide specific join order for the tables, which lets me reasonably assume that the engine is not yet completely ready.

Given vs Develop

Ok, so what does this solution take from open-source community and what they develop by themselves?

Given:

  • HDFS Storage level and its redundancy. They should not care about the data availability and replication;
  • HBase Engine. They are given the key-value store that is redundant and fast, so they again should not care much about the data storage layer;
  • Apache Derby SQL Engine. This is an open-source embedded DBMS that gives them query parser and simple query executor, which by default works with files on the local FS;

Develop

  • Transaction Management. By default HBase does not support multi-tables transactions and this is clearly the limitation for the solution pretending to be RDBMS. To implement this they went in a typical RDBMS way implementing the MVCC with storing deltas in a separate row versions, plus storing the separate SPLICE_TXN table responsible for global transaction management;
  • Distributed Query Optimizer. Maintaining table statistics, making join reordering based on the table statistics + information about table distribution, query decorellation and other complex tasks. Usually it takes years to implement a good query optimizer for distributed system;
  • Distributed Query Engine. By default Derby allows you to work within a single HBase Region Server, but what if you need to join two tables and they are not collocated? You should first redistribute them to be collocated and only then perform the join, which might be really tricky, especially for complicated queries;
  • GUI client, monitoring tools, ODBC/JDBC drivers, etc.;

Of source, it is only the tip of the iceberg and in fact there would be much more complex problems they face to build a really good solution.

Summary

In summary I will provide pros and cons of this solution in my opinion and a guess which market they target

Pros

  • As the solution based on HBase key-value store, this would be almost the only solution suitable for OLTP processing on top of Hadoop;
  • The Splice Machine is integrated with most popular enterprise Hadoop distributions, which means that it can be easily integrated in the infrastructure of the enterprises that use Hadoop;
  • Using open-source technologies as the basis gives a hope that the solution would be better integrated with this technologies and we should expect Pig, Hive, Map-Reduce connectors to read the Splice Machine data and process it manually outside of HBase, which would be a good option for most of the customers;
  • Splice machine recently raised $15M of investments and it gives a hope that the solution will finally hit the market.

Cons

  • HBase is good for OLTP processing, but for OLAP you usually need full table scans, and here is where you will find problems. First, HBase stores data in HFiles in sorted order as they flush from the Memstore. To read the table, you have to merge a number of HFiles and merge them all with transactional information stored in a separate table, which won’t be as efficient as a simple table scan in any other SQL-on-Hadoop solution. I think that its performance on a full-table scan would be ~3x slower that for any other SQL-on-Hadoop solution;
  • Distributed Query engine and planner is yet immature and the optimization is mostly on the developer side, which greatly increases the solution implementation price;
  • Bulk inserts of the data. If your data input is quite intensive, you will surely face the problem with performance. When you load the data to HBase, it first goes to Memstore and when it fills up the data is flushed on HDFS. But it is ok, the problem will occur when you will have too many spills and HBase will start to compact them while you will continue to write the data, which finally may cause the HBase to stop processing of the data while there would be enough free memory. HBase have about 10-15 different settings responsible for memory utilization and compaction settings and choosing them the right way is usually problematic. And it would be almost impossible if the solution would be used for Ad-Hoc analysis;
  • Solution is still in Beta, for more than a year.

Potential Users

  • Startups building their infrastructure from scratch that wants to try something new;
  • Companies building specialized system (not general-use and not for ad-hoc), that want to use OLTP processing of huge amounts of data and also run some analytics on top of it;
  • Enterprises adopting SQL-on-Hadoop, but they won’t be ready for this solution for at least 1 – 1.5 more years unless it become more mature.

2 thoughts on “Splice Machine

  1. Ivan

    Oh my goodness! Awesome article dude! Many thanks, However I am
    having problems with your RSS. I don’t understand the reason why I cannot join it.
    Is there anyone else getting the same RSS issues? Anyone that
    knows the answer will you kindly respond? Thanks!!

    Reply

Leave a Reply