jMar"s Blog DevSmash Developer Portal

Friday, March 7, 2008

Efficient Tag Cloud Algorithm

Hey all - since tag clouds are so popular these days, I thought I'd make a quick post about the algorithm I'm using to generate my tag cloud. I found a few sample algorithms already out there, but many of them seemed either inefficient, or didn't return the size that I was expecting.

Considering that the algorithm I'm using for my blog is running on the client side, I wanted to make sure that it was especially efficient. Here's what I ended up coming up with.

First, let's define the variables that are being used:

maxPercent
The font size is set as a percentage. This is the font-size percentage that the largest (most frequent) tag should be set to.
minPercent
This is the font-size percentage that the smallest (least frequent) tag should be set to.
max
This is the number of occurences for the most frequent tag.
min
This is the number of occurences for the least frequent tag.
count
This variable should be set inside of the link iterator. It refers to the number of occurences for the current tag.

Before I show some sample code for this, let's look at the actual algorithm for calculating the size:

var size = minPercent + ((max-(max-(count-min)))*(maxPercent-minPercent)/(max-min));

Since count is the only variable that changes during each iteration, we can extract the expression (maxPercent-minPercent)/(max-min) into its own constant and define it outside of the loop. This will save us two subtractions and a division operation for every loop. This now yields the following algorithm:

var multiplier = (maxPercent-minPercent)/(max-min);
var size = minPercent + ((max-(max-(count-min)))*multiplier);

The mathematical brilliance (sorry) behind this simple function isn't immediately obvious by looking at it, but the size returned will adhere to the following rules:

  1. The least occuring tag(s) will have a font-size of minPercent.
  2. The most occuring tag(s) will have a font-size of maxPercent.
  3. Tags with occurence counts in the middle will be scaled linearly.

Ok, let's look at this algorithm in action. The sample below uses a few jQuery methods (you all saw it coming), but this algorithm can certainly survive apart from any JS library. Notice that in the code below, I don't know min or max at the start. This means I have to loop through all of the tags twice, calculating these statistics the first time.

To give this code a real life context, I am generating a tag cloud out of the categories list that blogger generates. The markup that I am looping through is in the following format:

<div class="widget-content" id="categories">
 <ul>
  <li>
   <a dir="ltr" href="http://jmar777.blogspot.com/search/label/jQuery">jQuery</a>
   (<span dir="ltr">6</span>)
  </li>
  <li>
   <a dir="ltr" href="http://jmar777.blogspot.com/search/label/jTruncate">jTruncate</a>
   (<span dir="ltr">2</span>)
  </li>
  [etc...]
 </ul>
</div>

And here's the heavily commented javascript:

$().ready(function() {
 var catContainer = $('div#categories');
 // get an array of all the <li>'s
 var categories = catContainer.find('ul li');
 var cloudMarkup = '';
 // set maxPercent/minPercent preferences
 var maxPercent = 150, minPercent = 100;
 // note that max is initialized to a number that I know is lower than the max count
 // and that min is set to a number larger than the known min count
 var max = 1, min = 999, count = 0;
 // loop through each li and calculate statistics
 categories.each(function(i) {
  count = parseInt($(this).find('span').text());
  max = (count > max ? count : max);
  min = (min > count ? count : min);
 });
 var total, link, size;
 var multiplier = (maxPercent-minPercent)/(max-min);
 // loop through each li and generate the markup for the new tag cloud
 categories.each(function(i) {
  count = parseInt($(this).find('span').text());
  link = $(this).find('a');
  size = minPercent + ((max-(max-(count-min)))*multiplier) + '%';
  cloudMarkup += '<a class="cloud_link" style="font-size:' + size + '" href="' + link.attr('href') + '">' + link.text() + '</a> ';
 });
 // replace the html content of the parent container with the new tag cloud
 catContainer.html(cloudMarkup);
});

For you jQuery users out there - I am considering turning this into a plugin. I would like to hear from you what the expected markup should look like. For example, <a href="#" rel="[count]">tag</a> or <a href="#">tag ([count])</a>. Hopefully we can come up with something that doesn't completely violate the purpose of the rel tag, like that example! Thanks for your input!



20 comments:

Anonymous said...

Wonderful and a jquery plugin would be awesome.I personally feel you can choose any markup that makes things simple for you.

Anonymous said...

Why give it to the plugin users choice and let him define via plugin constructor the way he wants his tag to be.

Just to think for. It could not be this vast effort to implement it this way.

Best regards,
Skyborg

Jeremy Martin said...

I agree that it would be ideal to let the user determine the markup - just not sure yet on how to let them specify that in a flexible way...

Anonymous said...

While you're optimising your loop why don't you simplify
((max-(max-(count-min)))

to just

(count - min)?

Or do you have a typo in there somewhere?

Jeremy Martin said...

@Anonymous
Since count is updated during each iteration, I can't extract any expressions that contain count outside of the loop.

Anonymous said...

I would be very happy, if you could add an alternative routine to get the data for the cloud.

Like a text-file or data from JSON. The reason is, that not all people who use jQuery use it for HTML ;-)
Personally I use it for XUL (Firefox extension, etc.) as well.

Right now I am trying to create a GTD add-on for an application based on Mozilla and thus I am in need for a tag-clound jQuery plugin, that reads the tags from a database and then creates the tag-cloud from it.

Ken said...

Jeremy--

Thanks for this post. I had the same thought as the March 21 anonymous regarding the ((max-(max-(count-min))). You said efficiency was a concern, so here's my two bits on optimzing the size calculation in the loop:

Just distribute the minuses out to the terms in the parenthesis, and you'll see it works out, even though "count" changes on each iteration. Here it is step-by-step (perhaps a little more verbose than needed).

a) ((max - (max - (count - min)))
b) ((max - (max - count + min))
c) (max - max + count - min)

"C" will give you the exact same result as "A", and since the "max-max" cancels out, you can further simiplify it to be:

d) count - min

That leaves your code as the following:

var multiplier = (maxPercent-minPercent)/(max-min);
var size = minPercent + (count-min)*multiplier;

The nice thing about the above code is that math-minded folks will instantly recognize you're doing a linear interpolation of "count" between two numbers ("maxPercent" and "minPercent"). That was a bit obscured in the original code.

With a bit of explanation to someone of what interpolation is, you can even easily visualize in your minds-eye what is happening in the above code.

This is the easiest to understand version of the code, but you can introduce another variable optimize the loop further:

a) var size = minPercent + (count-min)*multiplier;
b) var size = minPercent + count*multiplier - min*multiplier;
c) var size = (minPercent - min*multiplier) + count*multiplier;
d) var offset = minPercent - min*multiplier; var size = offset + count*multiplier;

with "offset" being calculated outside of the loop (note that no terms in "offset" use "count" so this is safe to do). This means that inside of the loop we have 1 multiplication and 1 addition, whereas it used to be 4 additions/subtractions and 1 multiplication. Notice that the only things we're pulling out of the loop are the constants. Anything involving "count" stays in the loop.

This leaves your code as the following

var multiplier = (maxPercent-minPercent)/(max-min);
var offset = minPercent - min*multiplier;
var size = count*multiplier + offset;


A little harder to visualize what's going on in this version, but it is fine with a comment.

Try this; it will work. What your algorithm is basically doing is a linear interpolation.

Jeremy Martin said...

@Ken
Thanks for that very thorough explanation! I guess I was too hasty in my response to the 3/21 comment. Didn't we learn that stuff in like 5th grade or something...? I'll add this to my growing list of updates - too many smart people correcting me, not enough time...

jacob said...

Your algorithm has a problem if all of the tags have exactly the same number of occurrences. In this case max=min and you would get a divide by zero error.

Jeremy Martin said...

@Jacob
Uh... so it does. I swear, this post is taking me to school... Thanks for pointing that out!

rapin said...

This is great. Playing with it now. As for markup, maybe something simple like class="cloudable" rel="rank"?

I Personally prefer the rank/weight indicator to be in an attrib.

ege said...

Thank you very much for the article. It took me about 25 mins to apply it, not on the client side but server side using php. Your article is the simplest tutorial that I have found on the net. Special thanks go to Ken, as I used his refined version.

terminals-blocks said...

Just to think for. It could not be this vast effort to implement it this way.

webbod said...

I liked the idea, but I have a large set of data, the distribution is such that I have a lot of rare tags and a few very common ones - with hardly anything in between - I'm looking at product sales from a large inventory.

Some products sell 1 unit, others 50, 100, 500, 1000, etc... but one product sells 250,000 which means your algorithm gives me a few enormous tag and maps everything else down to the mininium size.

I need to be able to pull out the detail at both ends of the scale.

Using Log10(Count) worked really well and produces really compelling sales charts.

Thanks for the idea - 10 minute metrics - Mwhahahaha!

toothygoose said...

Here is a mootools/functional programming variant. Tidies away all the variables into its own namespace:


var TagCloud = {
apply: function(selector, options) {
options = options || {};

var maxPercent = options['maxPercent'] || 150;
var minPercent = options['minPercent'] || 100;
var retrieveCount = options['retrieveCount'] || function(element){element.getAttribute('count').toInt()};
var apply = options['apply'] || function(element, size){element.setStyle('font-size', size + '%')};

var max = null;
var min = null;

var tagElements = $$(selector);

tagElements.each(function(element)
{
count = retrieveCount(element);
max = (max == null || count > max ? count : max);
min = (min == null || min > count ? count : min);
});

var multiplier = (maxPercent-minPercent)/(max-min);

tagElements.each(function(element)
{
count = retrieveCount(count);
size = (minPercent + (count - min) * multiplier);
apply(element, size);
});
}
};

TagCloud.apply('ul#tags > li');

toothygoose said...

Whoops...

make that:

var retrieveCount = options['retrieveCount'] || function(element){return element.getProperty('count').toInt()};

Keep forgetting that the return statement is required in JavaScript. Lazy!

Anonymous said...

big think it was exactly what I searched, with a min and max!

I did it in PHP but will perhaps look after to make in JQuery because it's display work

think again!

nodejs said...

Nice Plug-in and demonstration!

Joseph Buarao said...

great demonstration... can i use this to my premium template?

Jonas said...

Thanks! I have been trying to come up with a good way to do this for a long time.