Under Pressure
Friday 25 January 2013
More Numbers Every Awesome Programmer Must Know
http://highscalability.com/blog/2013/1/15/more-numbers-every-awesome-programmer-must-know.html
Colin Scott, a Berkeley researcher, updated Jeff Dean’s famous Numbers Everyone Should Know with his Latency Numbers Every Programmer Should Know interactive graphic. The interactive aspect is cool because it has a slider that let’s you see numbers back from as early as 1990 to the far far future of 2020.
Colin explained his motivation for updating the numbers:
The other day, a friend mentioned a latency number to me, and I realized that it was an order of magnitude smaller than what I had memorized from Jeff’s talk. The problem, of course, is that hardware performance increases exponentially! After some digging, I actually found that the numbers Jeff quotes are over a decade old
Since numbers without interpretation are simply data, take a look at Google Pro Tip: Use Back-Of-The-Envelope-Calculations To Choose The Best Design. The idea is back-of-the-envelope calculations are estimates you create using a combination of thought experiments and common performance numbers to a get a good feel for which designs will meet your requirements.
And given most of these measures are in nanoseconds, to better understand the nanosecond you can do no better than Grace Hopper To Programmers: Mind Your Nanoseconds! 11.8 inches is the length of wire that light travels in a nanosecond, a billionth of a second.
Colin's post inspired some great threads On Reddit and On Hacker News. Here are some I found particularly juicy:
To the idea that these numbers are inaccurate Beckneard counters:
Wednesday 12 December 2012
Stop VS2012 shouting!
- Start PowerShell
- copy/paste the following
Set-ItemProperty -Path HKCU:\Software\Microsoft\VisualStudio\11.0\General -Name
SuppressUppercaseConversion -Type DWord -Value 1
Next time you start VS the menus will be in mixed case.
See http://blogs.msdn.com/b/zainnab/archive/2012/06/14/turn-off-the-uppercase-menu-in-visual-studio-2012.aspx for more detail.
Monday 10 December 2012
LuceneNet Sparse Faceted Search
I recently had a use for faceted search in a project using Lucene.net. The contrib project for Lucene.Net provides SimpleFacetedSearch which is great, but… it becomes very inefficient when your index has lots of values in the given field. So, for example, if you have a product category code that only has a hand full of values then SimpleFS is fine. If you facet against date and you have significant history then SimpleFS will eat all your memory very quickly.
SimpleFS represents a facet as a bitmap for each value. The size of the bitmap is equal to the number of documents in your index in bits (so numDocs/8 bytes). A quick calculation based on your index will give you very big numbers if you have large numbers of documents, values or both. In our case we have 100K values and 3.5M documents = 100K * (3.5M/8) = around 41GB!
SparseFacetedSearch to the rescue. SparseFS can deal with facets with many thousands of values (100’s of thousands in my case). Using much less memory. It is based on SimpleFacetedSearch but uses DocID lists instead of bitmaps. It’s suitable for high cardinality, sparsely populated facets. i.e. There are a large number of facet values and each facet value is hit in a small percentage of documents.
The memory usage is related to the number of hits each document has. In the product category example this would be exactly 1. The break even point is if you have more than 32 values. In our case we generally have between 1 and 5 values per document. Our memory usage is around 600MB. Quite a saving on 41GB!
There is a bunch of math that goes to explaining how that works out.
Friday 7 December 2012
Wednesday 29 August 2012
IEqualityComparer trick
But this isn’t actually completely true.
The code for Distinct looks like this…
public static IEnumerable<TSource> Distinct<TSource>(this IEnumerable<TSource> source, IEqualityComparer<TSource> comparer) { if (source == null) throw Error.ArgumentNull("source"); else return Enumerable.DistinctIterator<TSource>(source, comparer); } private static IEnumerable<TSource> DistinctIterator<TSource>(IEnumerable<TSource> source, IEqualityComparer<TSource> comparer) { Set<TSource> set = new Set<TSource>(comparer); foreach (TSource source1 in source) { if (set.Add(source1)) yield return source1; } }
So if it can add the item to a Set<T> then item will be returned. So duplicate items are ignored. Quite sneaky!
Notice the Set is created with a comparer. Inside set.Add it looks like this…
private bool Find(TElement value, bool add) { int hashCode = this.InternalGetHashCode(value); for (int index = this.buckets[hashCode % this.buckets.Length] - 1; index >= 0; index = this.slots[index].next) { if (this.slots[index].hashCode == hashCode && this.comparer.Equals(this.slots[index].value, value)) return true; } // other stuff trimmed }
It’s the if statement that is interesting. What it implies is that if the hash code is the same as another entry then it will separate the two values using comparer.Equals.
What if we supply a comparer that always returns a fixed value for GetHashCode (like 0). Then it will always have to use Equals.
public class AlwaysEqualsEqualityComparer<T> : IEqualityComparer<T> { public AlwaysEqualsEqualityComparer(Func<T, T, bool> comparer) { this.comparer = comparer; } private readonly Func<T, T, bool> comparer; public bool Equals(T x, T y) { return comparer(x, y); } public int GetHashCode(T obj) { return 0; } }
So, given I have a collection of these…
public class MyThing { public string ID { get; set; } public string Name { get; set; } public string Type { get; set; } public string Colour { get; set; } }
I can use the new comparer like this…
var distinctList = thingList.Distinct(new AlwayEqualsEqualityComparer<MyThing>((x,y) => x.Colour == y.Colour);
…and get a new collection of Colours.
Here’s the caveat… You should not use this with large collections.
Sets (and other collections) use a system of “buckets” (generally implemented with arrays) to store the items. Which array your item is in is determined with the hash code. You can see some of this in the for statement of the Set.Find method above. So if every item returns the same hash code, every item will go in the same bucket. Scanning a bucket for a duplicate value will get slower the large the bucket.
So only use this trick when you know that the source collection is small (say less than a few thousand).