Counting Word Occurrences From Huge Files

Let us say we were given to count all the occurrences of all the words from a million pages in ONE days time (24 hours). How can we do that? Here is a possible design:

  1. Pick up a dictionary which has all the words that ever exists. Get the count of number of words – word-count.
  2. Have a file which has all the links to those million pages which has to be counted.
  3. Now run the program to count the occurrences of any one word and note the time. Next we need to get number of threads – thread-count. If the time is nearly a day, we plan to create word-count number of threads. If it more than a day, we create enough threads for each word so that the program ends in a day. If it takes less than half a day, we use same thread for two words and so on. In this fashion we get the thread-count. In the mean time note down the memory required too.
  4. Now look out for a parallel programming system that can support word-count * thread-count number of processes running at the same time. (Like a CUDA system with OMG number of cores assembled together!) Well, yes, we would also know the amount of memory required.
  5. Every other constraint to be suitably assumed, recorded and added to the process. (Like the master machine carrying out all this, program assigning and calling the threads etc. )

