Thursday, March 27, 2008

Expression Trees: Why LINQ to SQL is Better than NHibernate

In my last post I described how the Where() function works for LINQ to Objects via extension methods and the yield statement. That was interesting. But where things get crazy is how the other LINQ technologies, like LINQ to SQL use extension methods. In particular it’s their use of a new C# 3 feature called expression trees that makes them extremely powerful. And it’s an advantage that more traditional technologies like NHibernate will never touch until they branch out from being a simple port of a Java technology. In this post I’ll explain the inherent advantage conferred on LINQ technologies by expression trees and attempt to describe how the magic works.

What’s so Magic about LINQ to SQL?

LINQ to SQL (and it’s more powerful unreleased cousin LINQ to Entities) is a new Object Relational Mapping (ORM) technology from Microsoft. It allows you to write something like the following:

IEnumerable<Product> products = northwindDataContext.Products.Where(
      p => p.Category.CategoryName == "Beverages"
      );

Which as you’d expect returns products from the database whose category is Beverages. But wait, aren’t you impressed? If not read over that code again, you should be very impressed. In the background that C# code is converted into the following SQL:

SELECT [t0].[ProductID], [t0].[ProductName], ...
FROM [dbo].[Products] AS [t0]
LEFT OUTER JOIN [dbo].[Categories] AS [t1]
ON [t1].[CategoryID] = [t0].[CategoryID]
WHERE [t1].[CategoryName] = @p0

In other words it’s pretty smart. It isn’t just returning all products and filtering them in memory using the LINQ to Objects version of Where() I discussed previously.

Doing something like that using NHibernate Criteria would require something like this:

ICriteria c = session.CreateCriteria(typeof(Product));
c.Add(Expression.Eq("Category.CategoryName", "Beverages"));
IEnumerable<Product> products = c.List<Product>();

You could use HQL too, but both NHibernate options suffer from the same problem. Did you spot it?

The LINQ to SQL version is taking actual strongly typed C# code and somehow smartly converting it to useful SQL. The NHibernate version does the same thing, but always using a weakly typed alternative. In other words the column “CategoryName” in NHibernate is a string. If it or its data type change in NHibernate you won’t find out until runtime. And that is the beauty of LINQ to SQL: you’ll find more errors at compile time. And if you’re like me you want the compiler to find your mistakes before the unit tests that you (or your fellow developers) may or may not have written do.

So you’re probably now wondering if you can put strongly typed C# in your where clause and it somehow magically gets converted to SQL, what’s the limit? If you put in a String.ToLower() or StartsWith() will it get converted to equivalent SQL? What about a loop or conditional? A function call? A recursive function call? At some point it has to break down and either return all products and filter them in memory or just fail right? Before answering those questions we need to understand what’s going on.

Understanding the Magic

The Magic happens in a class called Expression<T>. Expression takes a generic argument that must be a delegate and is usually one of the built in Func methods.  However the class can only be instantiated to a lambda expression. That’s right, not a delegate or anonymous method, only a Lambda expression. So in my deferred execution post where I explained what Lambda expression are, I said they were essentially syntactic sugar for an anonymous methods. Well, the emphasis is on the essentially, because they really aren’t sugar at all. When you assign a lambda expression to an Expression, the compiler, rather than generating the IL to evaluate the expression, generates IL that constructs an abstract syntax tree (AST) for the expression! You can then parse the tree and perform actions based on the code in the lambda expression.

Below is an example adapted from the .Net Developer’s guide on MSDN that shows how this works:

// convert the lambda expression to an abstract syntax tree
Expression<Func<int, bool>> expression = i => i < 5;

ParameterExpression param = (ParameterExpression)expression.Parameters[0];
// this next line would fail if we change the Lambda expression much
BinaryExpression operation = (BinaryExpression)expression.Body;
ParameterExpression left = (ParameterExpression)operation.Left;
ConstantExpression right = (ConstantExpression)operation.Right;

Console.WriteLine("Decomposed expression: {0} => {1} {2} {3}",
      param.Name,
      left.Name,
      operation.NodeType,
      right.Value
      );

This outputs “Decomposed expression: i => i LessThan 5”. The first line is the most important. It defines an Expression that takes a delegate with a single int parameter and a return type of bool. It then instantiates the Expression to a simple lambda expression.  Incidentally this would also work if we defined our own Delegate:

public delegate bool LessThanFive(int i);

public static void DoStuff() {
      Expression<LessThanFive> expression = i => i < 5;
}

It would, however, not work if we used an anonymous method:

Expression<Func<int, bool>> expression = delegate(int i) { return i < 5; };

While that looks legal it actually results in the compile time error “An anonymous method expression cannot be converted to an expression tree.”

There is a lot of complexity in parsing the AST, far beyond the scope of this article. However, the MSDN does have a nice diagram that helps explain how the following slightly more complicated Lambda expression that determines if a string has more letters than a number:

Expression<Func<string, int, bool>> expression =
    (str, num) => num > str.Length;

How Deep Does The Rabbit Hole Go?

So LINQ to SQL uses this Expression Tree technique to parse a plethora of possible code that you could throw at it and turn it into smart SQL. For instance check out a couple of the following conversions that LINQ to SQL will (or will not) perform:

p => p.Category.CategoryName.ToLower() == "beverages"

Results In:

SELECT [t0].[ProductID], ...
FROM [dbo].[Products] AS [t0]
LEFT OUTER JOIN [dbo].[Categories] AS [t1] ON [t1].[CategoryID] = [t0].[CategoryID]
WHERE LOWER([t1].[CategoryName]) = @p0

Not bad, huh? How about:

p => p.Category.CategoryName.Contains("everage")

That results in the following SQL snippet:

WHERE [t1].[CategoryName] LIKE @p0

And it sets @p0 to “%everage%”. Pretty cool. Ok this will get it to fail though, right?

public static string GetCat() {
    return "Beverages";
}

IEnumerable<Product> products = northwindDataContext.Products.Where(
      p => p.Category.CategoryName == GetCat()
      );

It turns out that LINQ to SQL will look inside of other functions! Alright, there’s no way it can do complicated conditionals:

p => p.Category.CategoryName ==
    "Beverages" ? p.UnitsInStock < 5 : !p.Discontinued

This should only pick up Beverages that have fewer than 5 items in stock regardless of whether they are discontinued and any other products that aren’t discontinued. Would you believe that it runs a single SQL statement:

SELECT [t0].[ProductID], ...
FROM [dbo].[Products] AS [t0]
LEFT OUTER JOIN [dbo].[Categories] AS [t1] ON [t1].[CategoryID] = [t0].[CategoryID]
WHERE (
    (CASE
        WHEN [t1].[CategoryName] = @p0 THEN
            (CASE
                WHEN [t0].[UnitsInStock] < @p1 THEN 1
                WHEN NOT ([t0].[UnitsInStock] < @p1) THEN 0
                ELSE NULL
             END)
        ELSE CONVERT(Int,
            (CASE
                WHEN NOT ([t0].[Discontinued] = 1) THEN 1
                WHEN NOT NOT ([t0].[Discontinued] = 1) THEN 0
                ELSE NULL
             END))
     END)) = 1

Wow, it sure isn’t pretty, but it scales to multiple conditionals, and most importantly it didn’t return all products and process them in memory. Not bad.

Conclusion

I asserted up front that using expression trees and the strong typing that comes with them is the reason LINQ to SQL is inherently better that NHibernate. I really can’t make that claim without admitting one of LINQ to SQL’s biggest shortcomings: It currently does not support multiple table inheritance. Ultimately, however, it’s a short term fault since the forthcoming LINQ to Entities does. And I stand by my claim because from a long term perspective as long as technologies like NHibernate remain pure ports of Java code they will never realize the full benefits of equivelant LINQ technologies that take advantage of .Net's native strengths: like expression trees.

Friday, March 14, 2008

How System.Linq.Where() Really Works

After writing my last blog entry on Deferred Execution in LINQ I had a conversation with Seth Schroeder who rightly pointed out among other things that I really didn't show how LINQ's deferred execution works internally. So in this post I wanted to implement my own LINQ Where() extension method based off of the one in the System.Linq namespace. So I'll show you the code, explain interesting parts of how it works including collection initializiers and extension methods, and then explain where the deferred execution behavior comes from (i.e. the yield statement). I will only explain in the context of LINQ to Objects since that's far simpler than other Linq's. I will implement a Where() like LINQ to SQL does in a later blog post (that's where things get really crazy).

Implementing MyWhere()

Let's start out with some code. The first question is does this compile?

using System;
using System.Collections.Generic;
using MyExtensionMethods;

namespace PlayingWithLinq {
    public class LinqToObjects {
        public static void DoStuff() {
            IList<int> ints = new List<int>() {9,8,7,6,5,4,3,2,1};

            IEnumerable<int> result = ints.MyWhere(i => i < 5);

            foreach (int i in result) {
                Console.WriteLine(i);
            }
        }
    }
}

namespace MyExtensionMethods {
    public static class ExtensionMethods {
        public static IEnumerable<TSource> MyWhere<TSource>(
            this IEnumerable<TSource> source,
            Func<TSource, bool> predicate
            ) {

            foreach (TSource element in source) {
                if (predicate(element)) {
                    yield return element;
                }
            }
        }
    }
}

Side note: putting two namespaces in on file is far from a best practice, but yes that is allowed.

Lambdas and Collection Initializers

If you're new to C# 3.5 then your first thought may be that:

IList<int> ints = new List<int>() {9,8,7,6,5,4,3,2,1};

is not allowed. Actually it is. It's the collection initializer syntax that I initially whined about in my post C# 3.0: The Sweet and Sour of Syntactic Sugar (ironically I actually like this syntax the more I use it.)

Your next thought may be that:

i => i < 5

is not legitimate. This is in fact a Lambda Expression, and as I explained in Deferred Execution, The Elegance of LINQ it conceptually compiles down to an anonymous method. Incidentally those that know Groovy (myself not included) or Lisp may know this as a closure since as we'll see later it can access local variables.

Extension Methods

Ok, the .Net Framework certainly has no MyWhere() function on the List object so this certainly wouldn't compile in C# 2. But that's where C# 3's Extension Methods come in. The "this" in:

MyWhere<TSource>(this IEnumerable<TSource> source,

says that MyWhere() can be applied to any generic IEnumerable. If you want to, you can still call MyWhere() normally:

IList<int> ints = new List<int>() {9,8,7,6,5,4,3,2,1};
ExtensionMethods.MyWhere(ints, i => i < 5);

And in fact this is what the compiler does in the background when you call MyWhere() off of an IEnumerable. But now with extension methods you don't have to.

But does MyWhere() now exist on all IEnumerable objects everywhere? No, it turns out you only get MyWhere() when you import the namespace it exists in (MyExtensionMethods). Incidentally unlike Groovy and Ruby there is no way to add an extension method to a class itself, only to instances.

Whose got the Func()?

The last two questionable parts of the code are the Func<TSource, bool> and the yield. Func is pretty easy. It's simply one of several new predefined delegates (method signatures) that comes with the .Net framework off of the System namespace. The two generic argument one above will match any function that returns the second generic argument and takes the first generic argument as a parameter. It looks like this:

delegate TResult Func<T, TResult>(T arg1);

So rather than using a Lambda expression in my initial example I could have been very explicit about the delegate instance (myFunc):

public static void DoStuff() {
      IList<int> ints = new List<int>() {9,8,7,6,5,4,3,2,1};

      Func<int, bool> myFunc = IsSmall;
      IEnumerable<int> result = ints.MyWhere<int>(myFunc);

      foreach (int i in result) {
            Console.WriteLine(i);
      }
}

public static bool IsSmall(int i) {
      return i < 5;
}

And that would have done the same thing. Notice I had to specify the generic type on the call to MyWhere() since the compiler can't infer the type in this example.

Yield

Now the really interesting part: yield. Yield is what makes deferred execution work. It actually was introduced with C# 2.0, but I don't think anyone really used it (I didn't know about it until recently). So because MyWhere() returns an IEnumerable (and because it isn't anonymous and doesn't have ref or out parameters) it is allowed to use the yield statement. When a method has a yield return (or yield break) statement, then execution of the method doesn't even begin until a calling method first iterates over the resulting IEnumerable. Execution then begins in the method and runs to the first yield statement, returns a result, and passes execution back to the caller. When the calling method iterates to the next value execution continues in the method where it left off until it gets to the next yield statement and then it passes execution back to the caller again and so on. Weird huh? Joshua Flanagan has a nice article that explains this in more detail along with some of the nice benefits like a smaller memory footprint.

So here's a quiz. What happens when you execute the following code?

IList<int> ints = new List<int>() {9,8,7,6,5,4,3,2,1};

IEnumerable<int> result = ints.MyWhere<int>(i => i < 4);

ints.Add(0);

foreach (int i in result) {
      Console.WriteLine(i);
}

Without the yield you'd get the numbers 3 through 1 since you added 0 after the call to MyWhere(). But since the yield in MyWhere() (and the Where() in System.Linq) defers execution until the foreach statement, you actually get 3 through 0. Ready for a little more mind bending? How about this:

IList<int> ints = new List<int>() {9,8,7,6,5,4,3,2,1};

int j = 4;

IEnumerable<int> result = ints.MyWhere<int>(i => i < j);

ints.Add(0);
j = 3;

foreach (int i in result) {
      Console.WriteLine(i);
}

Does the state of j get captured? My intuition would say yes. If so you'd expect 3 through 0. Well, the closure part of anonymous methods and lambdas work by keeping a reference to their calling object (this). So consequently they always get the most up to date value of a variable. So if your intuition works like mine you'd be wrong. You actually get the numbers 2 through 0. Crazy huh? And definitely something I hope I won't run into in someone's code (JetBrains ReSharper actually warns you if you do something crazy like this).

Conclusion

If this made sense then you should have a pretty solid grasp of how most of Linq to Objects works. Understanding extension methods, Func delegates, and yield statements should form the majority of what Linq does. Well, except for expression trees. But that's a topic for another post. Please post if this doesn't make sense or if I got it all wrong, I'd love to hear from you.