Thursday, December 09, 2010

Small issue of hadoop 0.20.2 and hive 0.6

There's a small configuration problem of hadoop 0.20.2 and hive 0.6

Need to make a small modification of file bin/hadoop

Change line 113

CLASSPATH="${HADOOP_CONF_DIR}"
to
CLASSPATH="${HADOOP_CONF_DIR}":$CLASSPATH

otherwise you can not start hive correctly, because hive can not find its classes on CLASSPATH

Wednesday, November 17, 2010

How to use Hive MetaStore Service

1. Start the Hive MetaStore Service by invoke command: $HIVE_HOME/bin/hive --service metastore [port]
the default listening port is 9083 if you do not specify the port argument

2. Enhance the model by invoking command: ant model-jar, and put the hive-model-${version}.jar on your classpath.
In hive-0.6, you should copy ${basedir}/lib/data-nucleus*.jar and jdo.jar to ${basedir}/src/lib, this is a bug of hive-0.6

3. Remember to put package.jdo on your classpath, otherwise you will get Exception

Tuesday, November 09, 2010

Comparison of object serialization framework

If you are doing distributed system work, object serialization is indispensable. Because you need to communicate between different processes and difference nodes. There's different ways for serialization. Here I will list some of them (I classify them into two main categories: binary & plaint text)
Binary form of serialization has the advantage of more compact space, but not readable.
Plain text of serialization has the advantage of readable and easy for debugging, but take more space.

Both of them has pros and cons, what you should choose is all depend on your application.

Binary Format
1. Protobuf (Google)
2. Thrift (Facebook)
3. Avro (Hadoop would adopt it in future)
4. Writable interface (Hadoop serialization)
5. Java built-in serialization

Plain Text Format
1. XML
2. JSON

And you can also do compression or encoding upon these serialization framework. It all depends on your application.

Regarding the performance of these serialization framework, you can refer here:



Monday, November 08, 2010

Play Redis & Mongdb

Today I play two nosql products redis & mongdb. I won't talk about the technologies they use and compare the two products. I just want to say that the tutorials they provide give me a deep impression. They provide a console-like UI on web page, and allow users to try to use them without installation. I believe this is very friendly to user, especially the new users without much executrices.

Teradata & Terracotta

Previously I do not recognize that these are two different companies, Today I fianlly releaise it.

Teradata focus on data warehousing technology while Terracotta specialize Java clustering technology.

Sunday, November 07, 2010

How to handle skew data in join operation

This following is basic method to handle parallel join operation:
Suppose we have tow relations R and S, also having N nodes

1. Partition R into N nodes based on Hash Function h1
2. Build hash table using Ri based on Hash Function h2 ( may not fit into memeory entirely, repartition Ri into buckets)
3. Partition S into N nodes based on the same Hash Function h1 as step1
4. As each incomming tuple s from S, probe hash table built in step 2 to see whether it can join any tuples of Ri

Methods to avoid skew
1. Range partition
2. Subset replicate
3. Weighting
4. Virtual processor partitioning

MapReduce Versus Parallel DBMS

Today I read a paper "MapReduceand Parallel DBMSs: Friends or Foes?" which compare the techology MR and Parlell DBMS. The following is extracted from this paper.

1. ETL and "read once", MR wins
2. Complex analytics MR wins
3. Semi-structured data. MR wins
4. Quick-and-dirty analyses. MR wins
5. Limited-budget operations. MR wins
6. Query-intensive (Performance). Parallel DBMS wins (Although the performance of Parallel DBMS is better than MR, the loading processing of Parallel DBMS cost much more time)


The reason why the performance of Parallel DBMS is better than MR
1. Repetitive record parsing
2. Compression.
3. Pipelining. (Parallel DBMS never write intermediate data into local disk, just push it to the target node, while MR write intermediate data into disk)
4. Scheduling.
5. Column-oriented storage.

Monday, November 01, 2010

Add extra source folder in Maven

I'd like to add additional source folder in project using maven. And I found that build helper maven plug-in can do this.

But when I try it the first time, it fails. Fianlly I find that I should add ${basedir} before the source code folder as following


<plugin>
<groupId>org.codehaus.mojo</groupId>
<artifactId>build-helper-maven-plugin</artifactId>
<version>1.5</version>
<executions>
<execution>
<id>add-source</id>
<phase>generate-sources</phase>
<goals>
<goal>add-source</goal>
</goals>
<configuration>
<sources>
<source>${basedir}/src/main/thrift/gen-java</source>
</sources>
</configuration>
</execution>
</executions>
</plugin>

Sunday, October 31, 2010

DataNode of HDFS

This post is about notes of DataNode of HDFS

How to Start Data Node.
  1. Check the datadir
  2. Get namenode proxy.
  3. Handle shake with namenode (check the build version of namenode and datanode)
  4. Start ipcServer (Listen on 50020)
  5. Start DataXceiverServer (Listen on 50010)
  6. Start HttpServer (Listen on 50075)
  7. Register datanode (check build version with namenode)
  8. Offerservice
  • send heartbeat periodly (DataNode do actions based on the response from namenode)
  • send blockreport periodly
DataXceiver
DataXceiver is a thread swaped from DataXceiverServer, it will first the serverl bytes from InputStream and decide what clients want to do, the following actions is serverl actions client want the data node do.
  • OP_READ_BLOCK: Get block id, offset, length, call BlockSender (DFSClient call this operation)
  • OP_WRITE_BLOCK: Get block id,generation timestamp), call BlockReceiver to write and local data node,and write to next target get reply from next target then reply to pre target. (DFSClient and DataNode will call this operation)
  • OP_READ_METADATA:
  • OP_REPLACE_BLOCK: For load balancing, Receive a block and write to disk and then notify namenode to delete the source copy of this block (Balancer will call this operation)
  • OP_COPY_BLOCK: For balancing purpose, Read a block from disk and send it to destination (BlockSender)
  • OP_BLOCK_CHECKSUM (DFSClient will call this method)