How can I perform queries across sets or sequences of items in C#?

Using LINQ (Language Integrated Query).

Sequences, LINQ, Rx, & Reaqtor Part 2: LINQ

By Ian Griffiths Technical Fellow, endjin 13th May 2021

To understand Reaqtor, it is necessary to understand Rx (the Reactive Extensions for .NET). And to understand Rx, it is necessary to understand how C# works with sequences of items. In this series I will outline the ideas at the heart of Reaqtor, and how they are handled in C#.

Version 3.5 of the .NET Framework added a set of features which, amongst other things, made it significantly easier to work with sets of items. The features were known, collectively, as LINQ: Language Integrated Query. The headline capability of LINQ didn't obviously seem to have anything to do with IEnumerable<T>: its most touted feature was the ability to write database queries directly in C#, e.g.:

var openOrderIds =
    from order in dbContext.Orders
    where order.Status == OrderStatus.Open
    select order.Id;

Such queries could be translated at runtime into SQL. So although this code uses C#'s == operator in a way that appears to be inspecting the Status field of every order, in practice that will be converted into a SQL WHERE clause, meaning that the filtering will in fact be performed by the database server (which is generally what you want—downloading an entire copy of a table and then picking out the items you want is almost always a mistake).

However, while the marketing at the time put the spotlight on relational database support, the real power of LINQ came from the fact that none of the features added to support it had anything to do with databases. The new language features (new in 2008, that is) at work in that code snippet are:

A query syntax
A set of standard 'operators' for performing operations over sets of data
Lambdas, i.e. the ability to write an expression whose value is a function
The option to have the compiler translate lambdas into data structures describing those expressions instead of runnable code

It's the first of these that makes examples like the one above most resemble database queries. It's arguably also the least interesting, because you could remove it from the language without losing any of the functionality LINQ enables. Moreover, it obscures things a little—it's not obvious that the other features in that list are in use because of the way query syntax hides certain details. (Some people insist on not using query syntax for this reason. I think that's overkill because there are some situations in which query syntax enables simpler code that is easier to read, particularly in scenarios where let clauses are useful.)

The C# compiler translates queries into a series of method calls at compile time, with each call being invoked on the return value of the preceding one, leading to a chain of method calls. The query shown above is equivalent to this:

var openOrders = dbContext
    .Orders
    .Where(order => order.Status == OrderStatus.Open)
    .Select(order => order.Id);

If you were to inspect the output of the compiler for this and the previous snippet you would not be able to tell which of the results came from which input—the compiler simply translates the first form into the second form before going on to compile the code. This is why the query syntax isn't hugely interesting—it just provides a different look for your code.

In this second form, we can now see the operators more clearly: the Where and Select methods are examples of LINQ operators. A LINQ operator is not a language feature—it's more of a convention: for each operator there is a standard name for the method, some expected behaviour, and a particular method signature. (In some cases, there might be multiple standard signatures, in which case the methods will have overloads for each form).

For example, take the Where operator. The expected behaviour is that this will filter the items: the operator takes a filter function which must accept a single input—an item of whatever type the data set contains—and return a bool. If this function returns true for any particular item, that item will be included, and if it returns false, that item will be filtered out.

You might be wondering at this point "How is this a 'convention'? Isn't this just a method?" The answer is that there are many different implementations of the Where operator. The one you get depends entirely on the type of data set that you invoke it on. In the snippets above we are invoking Where on dbContext.Orders, and since I've not yet made it clear what that actually is, it's not yet possible to tell exactly which implementation of LINQ we're going to get.

So let's get clear.

Imagine dbContext is an Entity Framework context object, and its Orders property represents a particular table in the database. In that case, Orders would implement the IQueryable<Order> interface, an instance of the generic IQueryable<T> interface. This represents something slightly more abstract than IEnumerable<T>: it is some set of items of type T. One important distinction is that unlike with IEnumerable<T> there is no presumption of ordering. (This matters if the data set is ultimately a table in a database, because queries may well be executed in parallel, or in some cunning highly optimized way that means the ordering of elements is essentially unknowable while the query is in progress.) And the other important distinction is that an IQueryable<T> gets to decide exactly how queries are ultimately processed.

To expand on that last point, it's important to understand that there are several different LINQ implementations. Microsoft seems not to use these terms so much any more, but when LINQ was first introduced, they talked about "LINQ to Objects", "LINQ to SQL", and "LINQ to XML" and LINQ was designed explicitly to be open, so that anyone could implement their own "LINQ to Something".

The particular implementation of LINQ you get depends first of all on the static type of the object on which you invoke the operator. For example, if you invoke Where on something that implements IEnumerable<T>, this ends up invoking the Enumerable.Where extension method, and you are now using LINQ to Objects. If on the other hand you invoke it on something that implements IQueryable<T> then you end up invoking the Queryable.Where extension method, at which point the target object can decide at runtime exactly how to process the query. (And if you are writing your own "LINQ to Whatever" then you get to implement the Where method yourself, as either a member of, or an extension method for the target type.)

IQueryable<T> is a special case, and a slightly odd one on first inspection: if it ends up letting the target object decide exactly what to do, then what's the point of it? Why not just have the target object implement Where (and any other operators it wishes to support) directly? Well part of the answer is that by having a standard interface, it becomes possible to write code that does things with types that support LINQ, without needing to know what those types are. But in that case, why doesn't absolutely everything use it? What's the point of IEnumerable<T> if we have the more general purpose IQueryable<T>?

To understand why, it's worth looking at a difference between how standard operators look on these two types. Here's the method signature of the LINQ to Objects version of Where, supplied by the Enumerable class:

public static IEnumerable<T> Where<T>(
    this IEnumerable<T> source,
    Func<T, bool> predicate)

The this keyword on the first argument makes this an extension method, meaning the C# compiler will let you invoke this method directly on anything that implements IEnumerable<T> as though it were a member of that interface, even though it isn't really. And then the interesting part is the second argument: the predicate that determines whether each item should be let through. This is of type Func<T, bool> and that's an ordinary delegate type—this argument should essentially be a reference to a function.

So if I write this:

var openOrders =
    someIEnumerableOfOrders
    .Where(order => order.Status == OrderStatus.Open);

the compiler will turn that into something like this:

var openOrders =
    someIEnumerableOfOrders
    .Where(FilterMethod);

...

private static bool FilterMethod(Order order)
{
    return order.Status == OrderStatus.Open;
}

So it takes that lambda expression I wrote as the argument to Where, turns it into a method, and then passes a reference to that method. So this is ordinary executable code. Now let's look at the IQueryable<T> version of Where:

public static IQueryable<T> Where<T>(
    this IQueryable<T> source,
    Expression<Func<T, bool>> predicate)

Aside from the obvious difference that this deals with IQueryable<T> and not IEnumerable<T>, there's a more subtle difference: the predicate argument is now of type Expression<Func<T, bool>>. The significance of this is that it tells the compiler to do something completely different. Instead of wanting a reference to a method, this form of Where operator says that it wants a data structure describing the expression. So the compiler turns this:

var openOrders = 
    someIQueryableOfOrders
    .Where(order => order.Status == OrderStatus.Open);

into something more like this:

var orderParam = Expression.Parameter(typeof(Order), "order");
var predicate = Expression.Lambda<Func<Order, bool>>(
    Expression.Equal(
        Expression.Property(orderParam, "Status"),
        Expression.Constant(OrderStatus.Open)),
    orderParam);
var openOrders =
    someIQueryableOfOrders.Where(predicate);

In fact it's slightly more complex than that due to the way enum comparisons get handled, but this illustrates the basic idea: the compiler generates code which, when executed at runtime, creates a tree of objects representing the structure of the expression you originally wrote.

So in this case, our order => order.Status == OrderStatus.Open gets turned into an object representing a lambda expression with a single parameter called order of type Order, and whose body is an object representing an equality comparison that compares the result of fetching the order parameter's Status property with the constant value OrderStatus.Open.

Anything implementing IQueryable<T> will end up with code like this when you use LINQ operators that take lambdas—they all compile into the expression tree form. And then the source object implementing IQueryable<T> gets to decide how to process that expression tree.

If the source represents some table in a relational database as in our original query example, it will convert this expression into a SQL query which it will then send to the database.

Objects representing other sources could do something similar—for example, something representing a collection in a CosmosDB could convert an expression into a query that CosmosDB can process.

The upshot of all this is that LINQ defines a set of operators (which goes well beyond the simple filtering of the Where operator shown in this example—you can also do grouping, ordering, existential predicates, and numerous other jobs). Any type that supports LINQ can then offer some or all of these operators, and it can then implement them in whatever fashion it sees fit—it might simply run code on collections of objects (which is what LINQ to Objects does) but it might also translate queries into a different language to be executed remotely.

When Execution Occurs

In general, LINQ operators don't do anything immediately. When using either IEnumerable<T> or IQueryable<T> they only do work when you ask them for objects. For IEnumerable<T> this is important because it makes it possible to work with infinite sequences. If a Where clause attempted to perform filtering the moment you call the Where method, it wouldn't work on an infinite sequence because it would never finish. But in fact LINQ to Objects works on demand—when you chain together a series of operators, you end up with an IEnumerable<T> that only starts work when you obtain an enumerator and then call MoveNext().

In general, IQueryable<T> implementations also defer their work although they have an additional motivation: in cases where they translate the query into something else (e.g. SQL) they want to wait until they know you've finished building the query. Until you actually start asking for results, a LINQ provider has no way of knowing if you're planning to append any more operators. Take this example from earlier:

var openOrders =
    dbContext.Orders
    .Where(order => order.Status == OrderStatus.Open)
    .Select(order => order.Id);

We could also have written this thus:

var orders = dbContext.Orders;
var whereQuery = orders.Where(order => order.Status == OrderStatus.Open);
var openOrders = whereQuery.Select(order => order.Id);

This has exactly the same effect, we've just put the result of each operator into a named variable rather than going on to invoke a method on it directly. If operators ran immediately, the Where operator would have run before the Select, and that's bad: what we would actually want this to do is run as a single SQL query that combines the Where and Select, and only returns the ids.

So when you use a LINQ operator, you don't normally get actual results back. You usually get a new instance of the same sort of data set you applied the operator to. For example, applying an operator to an IEnumerable<T> produces an IEnumerable<R> (where R may or may not be the same as T because some operators change the type—Select does that, for example).

This is called deferred execution: LINQ queries (for all widely used LINQ providers) don't do any work until you start asking them for results, which you typically do either by using a foreach loop on them, or by using LINQ's ToList operator, which evaluates the query and returns the results in a List<T>. (ToList is one of the few standard LINQ operators that does in fact execute immediately.)

In short, with the LINQ providers we've seen so far, operators do work when you pull information out of them. However, there is another way.

In the third part of this series, we'll delve into Rx.

Docs

API

Demos

Download

Blog

Talks

Contribute

Sequences, LINQ, Rx, & Reaqtor Part 2: LINQ

When Execution Occurs

Ian Griffiths

Technical Fellow, endjin

Browse by topic