Transferring Real Time data from our servers to Hive became a huge issue. Since we had JSON and using JSON Serializers & Deserializers was a computationally intensive task. We had problems finding JSON -> ORC straight conversion primarily because the JSON was schemaless and ORC on the other hand was linked to a Schema of a Hive Table.
Then came node-avro-io. Using this we can impose a schema to our JSON and the library generates an AVRO file which can be loaded into a HIVE Schema. Sounds Great? The reason why it’s so complicated is because
- HIVE doesn’t support inserting complex stuctures like arrays and maps using direct HQL (Hive Query Language)
- JSON SerDe (Serializer DeSerializer) is computationally expensive as it has to Deserialize the row before computing a hive job/query.
This is where avro plays a huge role in providing a schema for my JS Object and Imposing it. Also avro supports structures, named structures, maps as well as arrays!
Let’s create a simple table simple_users(name, marks, address) and we define a common schema between NodeJS(JS Objects) and HIVE Tables stored in HDFS. Given below is the schema of our table:
{ "type": "record", "name": "userRecord", "namespace": "com.user.record", "fields": [{ "name": "name", "type": "string" }, { "name": "marks", "type": { "type": "array", "items": "int" } }, { "name": "address", "type": { "type": "record", "name": "address_struct", "fields": [{ "name": "zipcode", "type": "int" }, { "name": "city", "type": "string" }] } }] }
Now, we will use the exact same schema in our Hive Table create query.
CREATE TABLE testdw.simple_users ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' TBLPROPERTIES ( 'avro.schema.literal'='{ "type": "record", "name": "userRecord", "namespace": "com.user.record", "fields": [{ "name": "name", "type": "string" }, { "name": "marks", "type": { "type": "array", "items": "int" } }, { "name": "address", "type": { "type": "record", "name": "address_struct", "fields": [{ "name": "zipcode", "type": "int" }, { "name": "city", "type": "string" }] } }] }');
Now once your complex Avro Schema based Table is Ready. We can use the same JSON into our NodeJS code and Impose this common schema.
var avro = require('node-avro-io').DataFile.AvroFile(); var schema = { "name": "data", "type": "record", "fields": [{ "name": "name", "type": "string" }, { "name": "marks", "type": { "type": "array", "items": "int" } }, { "name": "address", "type": { "type": "record", "name": "address_struct", "fields": [{ "name": "zipcode", "type": "int" }, { "name": "city", "type": "string" }] } }] }; var writer = avro.open("test-output.avro", schema, { flags: 'w', codec: 'deflate' }); writer.end({ name: "Vasu", marks: [1,2,3,4], address: {city: "Delhi", zipcode: 112233} });
Now we can transfer test-output.avro to HDFS and upload it (use node-webhdfs). Then use the below HQL to LOAD Avro File into the Data Warehouse
LOAD DATA LOCAL INPATH '/home/hduser/test-output.avro' OVERWRITE INTO TABLE testdw.simple_users;
After doing tons of researching this may be an optimal solution to get complex data structures into your data warehouse. To summarize the entire transactional process occurs between Node and Hadoop:
- Generate your JS Object
- Impose Your Schema & Generate Avro File
- Upload Avro File to HDFS
- make JDBC Connection to Hive
- Create Temporary Table, Load the file
- Transfer contents of Temporary table into a more optimized ORCFILE based Hive table
- Close JDBC Connection
Unfortunately this must be done as long as Hive INSERT INTO doesn’t support complex structures.