Sunday, December 15, 2013

HIVE with HBASE, No more secrets...

Use the HBaseStorageHandler to register HBase tables with the Hive metastore. You can optionally specify the HBase table as EXTERNAL, in which case Hive will not create to drop that table directly you’ll have to use the HBase shell to do so.

CREATE [EXTERNAL] TABLE foo(...)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
TBLPROPERTIES ('hbase.table.name' = 'bar');


Registering the table is only the first step. As part of that registration, you also need to specify a column mapping. This is how you link Hive column names to the HBase table’s rowkey and columns. Do so using the hbase.columns.mapping SerDe property.


CREATE TABLE foo(rowkey STRING, a STRING, b STRING)
STORED BY 'org.apache.hadoop.hive.hbase.HBaseStorageHandler'
WITH SERDEPROPERTIES ('hbase.columns.mapping' = ':key,f:c1,f:c2')
TBLPROPERTIES ('hbase.table.name' = 'bar');


The values provided in the mapping property correspond one-for-one with column names of the hive table. HBase column names are fully qualified by column family, and you use the special token :key to represent the rowkey.

With the column mappings defined, you can now access HBase data just like you would any other Hive data. Only simple query predicates are currently supported.

SELECT * FROM foo WHERE ...;

You can also populate and HBase table using Hive. This works with both INTO and OVERWRITE clauses.

FROM source_hive_table INSERT INTO TABLE my_hbase_table
SELECT source_hive_table.* WHERE ...;



There’s still a little finesse required to get everything wired up properly at runtime. The HBase interaction module is completely optional, so you have to make sure it and it’s HBase dependencies are available on Hive’s classpath.

The installation environment could do a better job of handling this for users, but for the time being you must manage it yourself. Ideally the hive bin script can detect the presence of HBase and automatically make the necessary CLASSPATH adjustments. This enhancement appears to be tracked in HIVE-2055. The last mile is provided by the distribution itself, ensuring the environment variables are set for hive. This functionality is provided by BIGTOP-955.

PROBLEMS GENERALLY FACED

While making  a table with large number of columns, a very common issue is limited size of  column in metastore in hive.  To solve this problem, connect to the configured  database increase the size of this column.

FAILED: Error in metadata: javax.jdo.JDODataStoreException: Add request failed : INSERT INTO COLUMNS (SD_ID,COMMENT,"COLUMN_NAME",TYPE_NAME,INTEGER_IDX) VALUES (?,?,?,?,?)
NestedThrowables:
java.sql.SQLDataException: A truncation error was encountered trying to shrink VARCHAR 'struct<prop1:int,prop2:int,prop3:int,prop4:int,prop5:int,pro&' to length 4000.
FAILED: Execution Error, return code 1 from org.apache.hadoop.hive.ql.exec.DDLTask

For metastore configured with derby(default), connect to the Derby database using ‘ij’ and increase the column length to required length.

 ALTER TABLE columns ALTER type_name SET DATA TYPE VARCHAR(8000);