Wednesday, 12 December 2012

Stop VS2012 shouting!

  1. Start PowerShell
  2. copy/paste the following
    Set-ItemProperty -Path HKCU:\Software\Microsoft\VisualStudio\11.0\General -Name SuppressUppercaseConversion -Type DWord -Value 1

Next time you start VS the menus will be in mixed case.

See http://blogs.msdn.com/b/zainnab/archive/2012/06/14/turn-off-the-uppercase-menu-in-visual-studio-2012.aspx for more detail.

Monday, 10 December 2012

LuceneNet Sparse Faceted Search

I recently had a use for faceted search in a project using Lucene.net. The contrib project for Lucene.Net provides SimpleFacetedSearch which is great, but… it becomes very inefficient when your index has lots of values in the given field. So, for example, if you have a product category code that only has a hand full of values then SimpleFS is fine. If you facet against date and you have significant history then SimpleFS will eat all your memory very quickly.

SimpleFS represents a facet as a bitmap for each value. The size of the bitmap is equal to the number of documents in your index in bits (so numDocs/8 bytes). A quick calculation based on your index will give you very big numbers if you have large numbers of documents, values or both. In our case we have 100K values and 3.5M documents = 100K * (3.5M/8) = around 41GB!

SparseFacetedSearch to the rescue. SparseFS can deal with facets with many thousands of values (100’s of thousands in my case). Using much less memory. It is based on SimpleFacetedSearch but uses DocID lists instead of bitmaps. It’s suitable for high cardinality, sparsely populated facets. i.e. There are a large number of facet values and each facet value is hit in a small percentage of documents.
The memory usage is related to the number of hits each document has. In the product category example this would be exactly 1. The break even point is if you have more than 32 values. In our case we generally have between 1 and 5 values per document. Our memory usage is around 600MB. Quite a saving on 41GB!

There is a bunch of math that goes to explaining how that works out.

Wednesday, 29 August 2012

IEqualityComparer trick

In a previous post I showed that Distinct  uses GetHashCode to spot different values and used GroupBy to workaround.
But this isn’t actually completely true.
The code for Distinct looks like this…
public static IEnumerable<TSource> Distinct<TSource>(this IEnumerable<TSource> source, IEqualityComparer<TSource> comparer)
{
  if (source == null)
    throw Error.ArgumentNull("source");
  else
    return Enumerable.DistinctIterator<TSource>(source, comparer);
}

private static IEnumerable<TSource> DistinctIterator<TSource>(IEnumerable<TSource> source, IEqualityComparer<TSource> comparer)
{
  Set<TSource> set = new Set<TSource>(comparer);
  foreach (TSource source1 in source)
  {
    if (set.Add(source1))
      yield return source1;
  }
}

So if it can add the item to a Set<T> then item will be returned. So duplicate items are ignored. Quite sneaky!

Notice the Set is created with a comparer. Inside set.Add it looks like this…

private bool Find(TElement value, bool add)
{
  int hashCode = this.InternalGetHashCode(value);
  for (int index = this.buckets[hashCode % this.buckets.Length] - 1; index >= 0; index = this.slots[index].next)
  {
    if (this.slots[index].hashCode == hashCode && this.comparer.Equals(this.slots[index].value, value))
      return true;
  }
  // other stuff trimmed
}

It’s the if statement that is interesting. What it implies is that if the hash code is the same as another entry then it will separate the two values using comparer.Equals.

What if we supply a comparer that always returns a fixed value for GetHashCode (like 0). Then it will always have to use Equals.

public class AlwaysEqualsEqualityComparer<T> : IEqualityComparer<T>
{
    public AlwaysEqualsEqualityComparer(Func<T, T, bool> comparer)
    {
        this.comparer = comparer;
    }

    private readonly Func<T, T, bool> comparer;

    public bool Equals(T x, T y)
    {
        return comparer(x, y);
    }

    public int GetHashCode(T obj)
    {
        return 0;
    }
}

So, given I have a collection of these…

public class MyThing
{
  public string ID { get; set; }
  public string Name { get; set; }
  public string Type { get; set; }
  public string Colour { get; set; }
}

I can use the new comparer like this…

var distinctList = thingList.Distinct(new AlwayEqualsEqualityComparer<MyThing>((x,y) => x.Colour == y.Colour);

…and get a new collection of Colours.


Here’s the caveat… You should not use this with large collections.

Sets (and other collections) use a system of “buckets” (generally implemented with arrays) to store the items. Which array your item is in is determined with the hash code. You can see some of this in the for statement of the Set.Find method above. So if every item returns the same hash code, every item will go in the same bucket. Scanning a bucket for a duplicate value will get slower the large the bucket.

So only use this trick when you know that the source collection is small (say less than a few thousand).

Distinct - Doesn’t work the way you think it does

I recently wanted to retrieve the set of unique values out of a list of objects. Say the object looked like this…
public class MyThing
{
  public string ID { get; set; }
  public string Name { get; set; }
  public string Type { get; set; }
  public string Colour { get; set; }
}

I have 100 of these in a List<MyThing>. How do I get a list of unique Type/Colour combinations. You might think that the obvious way would be to use the Distinct linq operator. It gets a little complicated because you need to define an IEqualityComparer<MyThing> to define which objects are equal (ie where Type and Colour are the same). So you need one of these…

public class MyThingComparer : IEqualityComparer<MyThing>
{
  public bool Equals(MyThing x, MyThing y)
  {
    // Object.ReferenceEquals and null check left out for clarity

    return x.Type == y.Type && x.Colour == y.Colour;
  }

  public int GetHashCode(MyThing obj)
  {
    return obj.GetHashCode();
  }
}

Seems straight forward enough. We’re checking that the Type and Colour properties are the same in two objects. So we try…

var distinctList = thingList.Distinct(new MyThingComparer());

…expecting a nice short list of unique Type/Colour combinations. What we actually get is the whole list. Much head scratching and ranting ensues.

Then you run into a post from 2008 http://blog.jordanterrell.com/post/LINQ-Distinct()-does-not-work-as-expected.aspx and you learn that it isn’t using you’re Equals function. It’s using GetHashCode to determine the objects with the same hash code are equal. Once you calm down and think about it, it kind of makes sense. But is a little counter intuitive. To quote from the MSDN page for Object.GetHashCode
“A hash code is a numeric value that is used to identify an object during equality testing.”
It goes on to describe IEqualityComparer<T> as a “hash code provider”.

So we could try to figure out some algorithm to calculate a hash code that is equal and unique for our combination of property values. But this is incredibly difficult to get provably correct.

There is another way…

var distinctList = thingList.GroupBy(x => new { x.Type, x.Colour }).First();

We’re using an anonymous object to group by the required properties then taking the first out of the group.
Yay, it works! But let’s dig a little.

Enumerable.GroupBy says…
“returns a collection of IGrouping<TKey, TElement> objects, one for each distinct key that was encountered.”

“The default equality comparer Default is used to compare keys.”
Oops, there’s that “distinct” word again. Read a little further and we discover that this “Default” thing is a IEqualityComparer that uses Object.GetHashCode. Aren’t we back where we started? Why does the GroupBy seem to work as expected? A little further research leads to http://odetocode.com/blogs/scott/archive/2008/03/25/and-equality-for-all-anonymous-types.aspx Where the summary states…
“Turns out the C# compiler overrides Equals and GetHashCode for anonymous types. The implementation of the two overridden methods uses all the public properties on the type to compute an object's hash code and test for equality. If two objects of the same anonymous type have all the same values for their properties – the objects are equal. This is a safe strategy since anonymously typed objects are essentially immutable (all the properties are read-only). Fiddling with the hash code of a mutable type gets a bit dicey.”
So an anonymous type implements GetHashCode as a combination of each of the properties of the type. So we can rely on a bunch of clever engineers to figure out how to figure out the algorithm.

Friday, 24 August 2012

WebAPI - AddWithoutValidation method not found

If you are using WebAPI and have recently installed VS2012. Your WebAPI stuff will be broken.
Your controller method will be called then you’ll just get a 500 response. Poking VS got it to give up the underlying exception which is "Method not found System.Net.Http.HttpHeaders.AddWitoutValidation".

Several hours of spelunking, trying framework source stepping, break on exception, beating it with a big stick and compiling from a command line with “msbuild /v:d” which shows assembly reolution resulted in realising that VS was compiling against the correct assemblies (ie I’d previously grabbed the RC from nuget). But…

Using the Fusion Log Viewer (fuslogvw) showed that when I ran the project the System.Net.Http assembly was being redirected to the new framework version instead of the file reference to my copy of the dll.

Here’s my solution: Add your own assembly redirect to ensure the right version of the assembly is used.
Simply add the following section to your app.config (if you are self hosting) or to web.config (if you are hosting in IIS).

<runtime>
<assemblyBinding xmlns="urn:schemas-microsoft-com:asm.v1">
<dependentAssembly>
<assemblyIdentity name="System.Net.Http" publicKeyToken="b03f5f7f11d50a3a" culture="neutral" />
<bindingRedirect oldVersion="1.0.0.0 - 2.0.0.0" newVersion="2.0.0.0"/>
</dependentAssembly>
</assemblyBinding>
</runtime>

Some will ask “Why bother? Why not just use the RTM?”. Well, I have a new version of our product about to RTM. I really don’t want to have to refactor a bunch of code and have it all go back through QA and regression testing.
We will move on to WebAPI RTM but not until our next version.

This is the risk that the decision to make 4.5 an in-place upgrade exposes us to. What other subtle changes in the framework are there?

I’m glad we test and deploy from build servers and not from a developers machine. You should check that your build environment is compatible with your production/runtime environment.