So I am now at a point in the development of Xekmypic, where I must start sending data to hadoop. I already have a bot that simulates most end-user actions on the client api. This api will be modified to send data to hadoop when certain events occur.
But the sending process is not straightforward. The deal is to publish a message with the event data on to a queue in a rabbit MQ server. Then a custom application (Java) will pickup messages from the queues and call the HDFS apis to append to the current file being written.
There are a lot of implications on this consumer application. My idea currently is to have a pool of consumers, that block on the receipt of messages. When a message arrives, a thread is spawned with the code that will write the event data to HDFS. Some questions arise here. Is the appending process in HDFS thread safe ? Can I open a connection to hdfs to do filesystem operations and leave it open for long periods of time ? I need to have some hierarchy of folders in HDFS, so that I don’t get too many files in a single directory. Should I start appending right from the beginning of a file, or should I write to the normal file system up to a specific size and then append it to the HDFS file ?