TechLog: December 2010

Tuesday, December 28, 2010

Fact Sheet for computer science

In this blog I will include all the facts about algorithms or data structures.

Insertion Sort vs Merge Sort
Which sort to use and when is important or must know practical limitations. It is not necessary to blindly implement the best algorithm always.
Normally in languages like Java or Python sort is implemented as Insertion Sort or Quick Sort as default system sort.
Merget Sort - O(n lg n) - when input is more than 2^12
Insertion Sort - O(n^2) - when input is less than 2^12

Binary Tree fanout :
Height of binary tree is calculated as
h <= lg{f} n , where f= fanouts possible at each level for each node. To redue disk reads fanout can be increased or can be adjusted at runtime depending upon number of nodes. Normally 1000 fanouts can contain 1 million nodes in the tree which is quite sufficient.

Use C wrapper for computation:
C is fast. But when. Why everyone is not using C then?
There are issues with C wrt memory management, reading files, DB connections, finding developers, learning curve, fast development (RAD). But still I agree that for computations C is the best. C can give 20X improvement when load is computations. In case when it is required to compute keys very fast (as in large systems, to calculate key for DHT) C wrapper can improve performance depending upon the load curve.

===============================================================================

Registers & CPU cache (1 nanosecond): These are small, expensive and very fast memory slots. Memory controllers try mightily to keep this space populated with the data the CPU needs. A cache miss means a 100X speed penalty. Even with a 95% hit rate, CPU cache misses waste half the time.

Main memory (10^2 nanoseconds): If your computer was an office, RAM would be the desk scattered with manuals and scraps of paper. The kernel is there, reserving Papal land-grant-sized chunks of memory for its own mysterious purposes. So are the programs that are either running or waiting to run, network packets you are receiving, data the kernel thinks it's going to need, and (if you want your program to run fast) your working set. RAM is hundreds of times slower than a register but still orders of magnitude faster than anything else. That's why server people go to such lengths to jam more and more RAM in.

Solid-state drive (10^5 nanoseconds): SSDs can greatly improve the performance of systems with working sets too large to fit into main memory. Being "only" one thousand times slower than RAM, solid-state devices can be used as ersatz memory. It will take a few more years for SSDs to replace magnetic disks. And then we'll have to rewrite software tuned for the RAM / magnetic gap and not for the new reality.

Magnetic disk (10^7 nanoseconds): Magnetic storage can handle large, contiguous streams of data very well. Random disk access is what kills performance. The latency gap between RAM and magnetic disks is so great that it's hard to overstate its importance. It's like the difference between having a dollar in your wallet and having your mom send you a dollar in the mail. The other important fact is that access time varies wildly. You can get at any part of RAM or SSD in about the same time, but a hard disk has a physical metal arm that swings around to reach the right part of the magnetic platter.

Network (10^6 to 10^9 nanoseconds): Other computers. Unless you control that computer too, and it's less than a hundred feet away, network calls should be a last resort.

Saturday, December 25, 2010

Google File System

Here is my understading of Google File System from Youtube video.

Google File System
- provides file system abstraction for scalable system
- Google File System has a Namenode
- Namenode works as master server

- Files are devided into 64 MB chunks (why? Becuase large systems have a very high read load. So 64MB chunks (which are pretty big) are necessary so reads will finish only when chunk is read completely which is regular case)
- Due to high frequency of hard disk failures data needs to be replicated so chunks are replicated
- The locations of the chunks are stored by Namenode and it fits totally in the memory

writing the files
- Coordinated by not executed by Namenode
- Namenode chooses the primary replica to do the writes
- For all the writes the primary replicas would be same
- Now for primary replica is leased for mutation to client i guess
- Namenode does the write ahead logging for each write
- Each chunk server sends back the the ack to primary replica server
- After each write is received at every chunk server primary replica server commands to execute writes
- Writes done and ack is send back to primary replica to client (not to namenode)

Appending Records at the end of the file:
- Record append functionality is provided
- Line can be appended at the end of the line

Deleting Files
- Deleting files works as lazy deletes
- Files are maked delete and the garbage collected
- Deletion is not considered as main operation

GFS also puts some responsibility to clients / programs. As appending records may create muliple reocords in the file and those records are sent back to client/program. The program now reads the file and sees multiple / stale records. Recognizes them and handles the descrepancy. Consider this as trade off between consistency and scalability.

For more information refer to this Google File System.

My Understanding of Network File System (NFS)

NFS
> Stands for Network File System
> Client Server Mechanism
> NFS to be hosted by one single server machine (not suitable for scaling)
> NFS works on TCP/IP layer.

Mounting NFS
> User boots up the machine and NFS is mounted as file system
> OS is required to do same file system calls and NFS mount will handle it

Interaction between client and server
> Client makes a request for a file and translated to NFS
> Client uses TCP/IP to communicate
> Reqeusted file is locked and sent back to client
> Client caches the copy of the file (or part of the file)
> The lock is a time base entity. Say for 5 mins lock will stay alive for client.
> Meanwhile accessing the file client can release the lock, can request lock on the file (to restart the time quota), on go down (NFS will time out and releases the lock)

Cache Consistency or Coherence of files
> As mentioned above the files are stored in the local cache (temporal locality)
> Now if client requires to use the same file, a request goes back to NFS checking if the cache is consistent with file on the server (caching the same copy)
> If the copy is updated then new version of file is fetched else client can use the cached copy of file

How to cache consistency is not 100% in NFS
> Client has already cached a file
> Client requests the file and cache is consistent with NFS copy
> While client is using the cached copy of file another client (client0) requests a lock on that file to NFS
> NFS allocates lock to client0.
> Client0 updates the file while client is reading the file
> So there are chances that client may read a stale version

For more information check NFS @ Wiki.

Friday, December 24, 2010

Tree traversals

Preparing for interviews always go through different tree traversals. Wikipedia has an excellent article over tree traversal.

The tree traversals are:
> InOrder (Sorted Output)
> PreOrder (Root first)
> PostOrder [Depth First Search](children first)
> LevelOrder(Top Level First)

InOrder Code:


void InOrder(Node root){
   if(root.left != null) InOrder(root.left);
   print (root.value);
   if(root.right != null) InOrder(root.right);
}

PreOrder Code:


void PreOrder(Node root){
   print(root.value);
   if(root.left != null) PreOrder(root.left);
   if(root.right != null) PreOrder(root.right);
}

PostOrder Code (Depth First Search):


void PostOrder(Node root){
   if(root.left != null) PostOrder(root.left);
   if(root.right != null) PosOrder(root.right);
   print(root.value);
}

LevelOrder Code :

 
 void LevelOrder(Node root){
  // use queue to get level order traversal
  // Queue provides two functions. 1. head()  2.add()
  Queue q = new Queue();
  q.add(root);

  while(!q.isEmpty()){
    root = q.head();
    
    if(root.left != null)  q.add(root.left);
    if(root.right  != null)  q.add(root.right)l

    print(roor.value);

  }
}

Except Level Order Traversal remaining of the traversal methods can be rewritten using stacks. Remember you might need to push and pop elements in stack rather than calling the function again. Also sometimes visited flag is required to make sure function is not repeating the same values again.

Have a great fight !!

How to switch between remote host and localhost while SSHing?

How to switch between localhost and remotehost while sshing?

localhost$ssh -l username remotehost.com
-put your password here
remotehost$
-now you are connected to remote host
remotehost$~
-press enter, now remote host goes to background and localhost shows up
localhost$
-to open the session to remote host
localhost$jobs
[1]+ Stopped ssh -l username remotehost.com
-to go back use fg command
localhost$fg %1
ssh -l username remotehost.com
remotehost$
-now you are back to remotehost session

Wednesday, December 22, 2010

Installing Apache MySql Server and Apache on Ubuntu

Ubuntu gives beautiful functionality of installing Apapche http web server, MySql Server and PHP. It is sudo apt-get install.

To start before installation just update OSes package index.

$ sudo apt-get update

I was trying to install apache and it works fine. Here is what I did.

$ sudo apt-get install apache2

Any one have ever seen which lang Apache Web Server is written?
It is C. It made me worried because it does not have garbage collection.
Might not work if too much load is generated and resources are getting free is required.

Anyway lets get back to installation. To check Apache has been install properly to to web browser and go to http://localhost/
The text itself is self explanatory.

Now to install php :

$ sudo apt-get install php5 libapache2-mod-php5

As libapache2 has been installed to take affect restart Apache.

$ sudo /etc/init.d/restart

OK. So now php is up and running. How?

$ sudo gedit /var/www/phpinfo.php

Type in

It you see page which is not showing any error, PHP works. NICE !!

Now its time for installing MySql :
$ sudo apt-get install mysql-server mysql-client

This should finish the install. Remember the passwords you are installing here.

Now its time to check whether login and MySql is working fine.

$ mysql -u root -p

Here root is the user created. MySql service will ask for the password after pressing enter. And then you will see

> mysql
> exit

Now some more installation for PHP
$ sudo apt-get install libapache2-mod-auth-mysql php5-mysql phpmyadmin

Now this installation may or may not ask you to select Yes No type pop up box.
I got some but in this changing world if you don't get them, don't worry just move on.

If you do get such Yes No Pop ups then select by reading the text.

Next thing to do is to restart Apache2.

$ sudo /etc/init.d/restart

This is good enough for right now. If you run into any issue just Google it.

Have great day / night ahead.

TechLog