Skip to main content

A faster, Non-recursive Algorithm to compute all Combinations of a String

Imagine you're me, and you studied Permutations and Combinations in your high school maths and after so many years, you happen to know that to solve a certain problem, you need to apply Combinations.

You do your revision and confidently open your favourite IDE to code; after typing some usual lines, you pause and think, then you do the next best thing - search on Internet. You find out a nice recursive solution, which does the job well. Like the following:

import java.util.ArrayList;
import java.util.Date;

public class Combination {
   public ArrayList<ArrayList<String>> compute (ArrayList<String> restOfVals) {
      if (restOfVals.size () < 2) {
         ArrayList<ArrayList<String>> c = new ArrayList<ArrayList<String>> ();
         c.add (restOfVals);
         return c;
      }
      else {
         ArrayList<ArrayList<String>> newList = new ArrayList<ArrayList<String>> ();
         for (String o : restOfVals) {
            ArrayList<String> rest = new ArrayList<String> (restOfVals);
            rest.remove (o);
            newList.addAll (prependToEach (o, compute (rest)));
         }
         return newList;
      }
   }

   private ArrayList<ArrayList<String>> prependToEach (String v, ArrayList<ArrayList<String>> vals) {
      for (ArrayList<String> o : vals)
         o.add (0, v);
      return vals;
   }

   public static void main (String args[]) {
      ArrayList<String> i = new ArrayList<String> ();
      i.add ("a");
      i.add ("b");
      i.add ("c");
      long start = new Date ().getTime ();
      Combination c = new Combination ();
      c.compute (i);
      System.out.println ("Elapsed Time: " + (new Date ().getTime () - start));
   }
}

So, if the above does what we need, what's the problem we are addressing? Well! Try passing "acknowledgement" to this function and enjoy your cup of coffee, cause there is no way your program will finish execution in realistic time; in fact, it may even crash due to low memory. The reason for that is the problem of computing all combinations is NP-Hard, so as the length of the string increases, the time hikes exponentially. The graph below illustrates this very well (input is on x-axis and time on y-axis).


Image Ref: http://www.regentsprep.org

What's wrong with the current approach is recursion. As your program starts branching, the tree becomes gigantic and your memory requirement grows exponentially too. While you cannot reduce the time it takes to compute all combinations, you can certainly do some tinkering to reduce the memory consumption, thus reducing the additional overhead.

In order to mitigate this issue, we look for a non-recursive solution. Now, in my case, I couldn't really find any (you might be luckier). So here is what I did:

Recall the table you once wrote in your College that maps all Hexa-decimal digits to respective 4-digit binary values?
0 = 0000
1 = 0001
2 = 0010
. . .
. . .
F = 1111

What's so interesting about these binary numbers? You choose the length 4, turn all to 0 and keep adding 1 and you'll get ALL combinations that can come out of a 4-digit string of binary digits. Now, think of a string stored in a character-array, if you print the characters on indices represented by a binary-digit array, you can get any combination from this string. This is exactly what we do here. We iterate a binary-digit array to the maximum number of combinations and bang! You get a non-recursive method to discover all possible combinations from a string. Here is the code in Java:

import java.util.Date;
import java.util.SortedSet;
import java.util.TreeSet;

public class Combinations {
   public static void main (String[] args) {
      long start = new Date ().getTime ();
      combination ("teststring");
   }

   public static String[] combination (String str) {
      SortedSet<String> list = new TreeSet<String> ();
      int length = str.length ();
      int total = ((Double) Math.pow (2, length)).intValue () - 1;
      for (int i = 0; i < total; i++) {
         String tmp = "";
         char[] charArray = new StringBuilder (Integer.toBinaryString (i)).reverse ().toString ().toCharArray ();
         for (int j = 0; j < charArray.length; j++)
            if (charArray[j] == '1')
               tmp += str.charAt (j);
         list.add (tmp);
      }
      list.add (str);
      return list.toArray (new String[] {});
   }
}

And here's the comparison of both the algorithms:

On x-axis, we have length of string ranging between 5 and 21 and time in milliseconds on y-axis. The recursive algorithm refused to proceed after length 10, throwing OutOfMemory Exception.

Note: due to fitting problem, I took Log of time by both Algorithms.

Another plus with this algorithm is that since recursive functions keep reducing the problem to a simpler solution and not start solving them unless it reaches the bottom, i.e. it cannot further reduce the input, you cannot interrupt them in the middle and ask for the values it has computed so far. Whereas here, you can stop the program at any stage and fetch the the program has already computed.

From pure algorithmic aspect, the complexity with former approach is Exponential, i.e. O(k^n). But our solution does the same in quadratic time, O(n^2). Please note that this does not mean that we have reduced an NP-Hard problem to a P-type problem. Because the number of times the loop executes itself grows exponentially, it is the execution time within the loop that we have reduced.

Realistically speaking, a common word in English will be under 20 characters. I mean, how often do you use Internationalization, really? But then this isn't only about English, right?!

Ending note: I'm new to theoretical computing, and might have mistaken here; please make correction if I have misinterpreted the results...

Comments

Popular posts from this blog

Executing MapReduce Applications on Hadoop (Single-node Cluster) - Part 1

Okay. You just set up Hadoop on a single node on a VM and now wondering what comes next. Of course, you’ll run something on it, and what could be better than your own piece of code? But before we move to that, let’s first try to run an existing program to make sure things are well set on our Hadoop cluster.
Power up your Ubuntu with Hadoop on it and on Terminal (Ctrl+Alt+T) run the following command: $ start-all.sh
Provide the password whenever asked and when all the jobs have started, execute the following command to make sure all the jobs are running: $ jps
Note: The “jps” utility is available only in Oracle JDK, not Open JDK. See, there are reasons it was recommended in the first place.
You should be able to see the following services: NameNode SecondaryNameNode DataNode JobTracker TaskTracker Jps


We'll take a minute to very briefly define these services first.
NameNode: a component of HDFS (Hadoop File System) that manages all the file system metadata, links, trees, directory structure, etc…

Titanic: A case study for predictive analysis on R (Part 4)

Working with titanic data set picked from Kaggle.com's competition, we predicted the passenger survivals with 79.426% accuracy in our previous attempt. This time, we will try to learn the missing values instead of setting trying mean or median. Let's start with Age.

Looking at the available data, we can hypothetically correlate Age with attributes like Title, Sex, Fare and HasCabin. Also note that we previous created variable AgePredicted; we will use it here to identify which records were filled previously.

> age_train <- dataset[dataset$AgePredicted == 0, c("Age","Title","Sex","Fare","HasCabin")]
>age_test <- dataset[dataset$AgePredicted == 1, c("Title","Sex","Fare","HasCabin")]
>formula <- Age ~ Title + Sex + Fare + HasCabin
>rp_fit <- rpart(formula, data=age_train, method="class")
>PredAge <- predict(rp_fit, newdata=age_test, type="vector")
&…