One of the things I love about LINQ is its deferred execution model. It's the type of thing that makes sense academically when you first read about it (e.g. in Part Three of Scott Gunthrie's LINQ to SQL series), but for me anyway, it took some time to understand enough to use effectively.
For instance the Daily RSS Download open source application that I wrote about last week needs to download entries (posts) that are newly published since the last download. While it isn't a complicated problem, my first attempt at a solution didn't use the power of LINQ correctly. I'll explain my naïve solution in this post, describe how LINQ's deferred execution works (i.e. Lambda expressions), explain the problems with my solution, then give an the elegant solution that is only possible because of LINQ's deferred execution model. See if you can spot my error along the way.
Downloading the Latest Entries
Downloading the latest entries would be a ridiculously simple problem if there weren't multiple formats for RSS. But since the solution needs to support Atom and RSS 2.0 and 1.0 and potentially other future formats, the class structure should be set up appropriately:
The newspaper class primarily exists to enumerate feeds:
public
class Newspaper
{
public
void DownloadNow() {
foreach (Feed objFeed in
Settings.Feeds) {
objFeed.DownloadRecentEntries(...);
}
}
}
The Feed class is abstract and during runtime is either an RssFeed or an AtomFeed. The relevant function Feed.DownloadRecentEntries() calls the abstract Feed.GetEntries() method, which returns a group of Entry objects.
public
abstract class
Feed {
public
abstract IEnumerable<Entry> GetEntries(XDocument
rssfeed);
public
void DownloadRecentEntries(...) {
XDocument
xdocFeed = XDocument.Load(Url);
IEnumerable<Entry> lstRecentPosts = GetEntries(xdocFeed);
foreach
(Entry objEntry in
lstRecentPosts) {
objEntry.Download(...)
}
}
}
The Feed classes, RssFeed and AtomFeed then implement GetEntries as follows:
publpublic
class RssFeed
: Feed {
public
override IEnumerable<Entry> GetEntries(XDocument
rssfeed) {
return
from item in
rssfeed.Descendants("item")
where (DateParser.ParseDateTime(item.Element("pubDate").Value)
>=
this.LastDownloaded)
||
this.LastDownloaded ==
null
select (Entry)new
RssEntry(item, this);
}
}
public class
AtomFeed : Feed
{
public
override IEnumerable<Entry> GetEntries(XDocument
rssfeed) {
return
from item in
rssfeed.Descendants(_atomNamespace + "entry")
where (DateParser.ParseDateTime(
item.Element(_atomNamespace + "published").Value)
>=
this.LastDownloaded)
||
this.LastDownloaded ==
null
select (Entry)new AtomEntry(item,
this);
}
}
Yes, that's all LINQ to XML in there. It looks a lot like SQL, but as you'll see in a second it's really just glorified syntactic sugar. Expressive though, isn't it? While the astute reader may have already spotted the inelegance of my solution, for those unfamiliar with LINQ, let's first describe what AtomFeed.GetEntries() does.
What is this Deferred Execution Stuff?
If you already understand LINQ and how delayed execution works feel free to skip this section. For everyone else it's important to understand that the following line:
from item
in rssfeed.Descendants("item")
where (DateParser.ParseDateTime(item.Element("pubDate").Value)
>=
this.LastDownloaded)
||
this.LastDownloaded ==
null
select (Entry)new
RssEntry(item, this);
Is actually just syntactic sugar for the following set of statements:
rssfeed
.Descendants(_atomNamespace +
"entry")
.Where( item => (DateParser.ParseDateTime(
item.Element(_atomNamespace +
"published").Value)
>= this.LastDownloaded)
|| this.LastDownloaded
== null)
.Select( item => (Entry)new
AtomEntry(item, this));
Now XDocument.Descendants() returns IEnumerable
But more important for the topic of deferred execution is the => operator, which is a Lambda expression and is also new to C# 3.0. The best way to understand them is that they are essentially syntactic sugar for an anonymous method (e.g. a type safe function pointer to code). So we could again rewrite our code as follows:
rssfeed
.Descendants(_atomNamespace +
"entry")
.Where(delegate(XElement
item) {
return
(DateParser.ParseDateTime(
item.Element(_atomNamespace +
"published").Value)
>= this.LastDownloaded) || this.LastDownloaded
== null; })
.Select(delegate(XElement
item) {
return
(Entry)new
AtomEntry(item, this);
});
Back in familiar territory yet? If not you probably aren't familiar with C# 2.0. In the background the compiler takes the anonymous methods above and turns them into methods on the current class and instantiates new delegates of the correct type that points to them and passes them to the Select() and Where() methods.
The The key thing to note is that the arguments for select and where are delegates, and so when those delegates are executed is beyond our control. In fact if you put a Console.WriteLine or a breakpoint inside of the AtomEntry constructor, it won't get called until the resulting IEnumerable is enumerated, specifically the following line in the first code sample:
foreach (Entry objEntry in lstRecentPosts) {
So that's delayed execution. But understanding how it works and how to use it are completely different things.
The Inelegant Solution
Getting back to my code sample you may have picked up that my where clause is the mistake. I implemented it like this because RSS and Atom have different field names for the published date. But the way I wrote it I'd have to make two changes if I wanted to change which entries to download. Ok, big deal, I'm extremely unlikely to make changes to that where clause right? Or I wasn't until I wanted functionality to set some defaults based on the average length of posts prior to downloading posts. Basically:
public
static Feed CreateFeed(string
strUrl, int intDisplayOrder) {
IEnumerable<Entry> lstRecentEntries = feed.GetEntries(rssfeed);
double
intAveragePostSize = lstRecentEntries.Average(
i => i.Description.Length);
// if the
feeds posts are typically small then include the
// description field in the summary
and download the content
// for the main article from the link
if
(intAveragePostSize < 1000) {
...
} else {
...
}
}
Except this now ties me to the were clause, when what I'd really like to do is just get the average post size for the last couple of posts. The problem is that GetEntries() isn't generic enough.
The Elegant Solution
The The solution is then to normalize out (excuse the database terminology) the where clause into the two methods that use GetEntries(). So GetEntries() becomes simple:
public
override IEnumerable<Entry> GetEntries(XDocument
rssfeed) {
return
from item in
rssfeed.Descendants(_atomNamespace + "entry")
select (Entry)new AtomEntry(item,
this);
}
And then Feed.CreateFeed() and Feed.DownloadRecentEntries() become more complicated
public
abstract class
Feed {
public
abstract IEnumerable<Entry> GetEntries(XDocument
rssfeed);
public
static Feed CreateFeed(string strUrl,
int intDisplayOrder) {
IEnumerable<Entry> lstEntries = feed.GetEntries(rssfeed);
// get the
five most recent posts
IEnumerable<Entry> lstRecentEntries =
from
entry in lstEntries.Take(5)
select entry;
double intAveragePostSize =
lstRecentEntries.Average(
i => i.Description.Length);
if (intAveragePostSize < 1000) {
...
} else {
...
}
}
public
void DownloadRecentEntries(...) {
XDocument xdocFeed = XDocument.Load(Url);
IEnumerable<Entry> lstEntries =
GetEntries(xdocFeed);
// get newly
published posts
IEnumerable<Entry>
lstRecentPosts = from entry in lstEntries
where
(entry.Published >= this.LastDownloaded)
||
this.LastDownloaded ==
null
select entry;
foreach
(Entry objEntry in
lstRecentPosts) {
objEntry.Download(...)
}
}
}
Note that we now have a second LINQ statement that runs against the results of the LINQ statement in GetEntries(). But since nothing's been executed yet we're just building out the statement that we will eventually run when the resulting IEnumerable if enumerated. So we've now spread our LINQ statements across an inheriting and a base class, and in process we've made GetEntries() extremely generic.
Conclusion
So what's the big deal? The big deal is that we can spread our data access statements across multple classes and because of deferred execution we don't need to worry about the performance of generic methods that are closer to the data that don't contain a "where" clause. This may not be a huge deal in this example, but it becomes extremely powerful when the user interface tier can tack on "order by" statements or "filters" BEFORE anything is executed against your data store. And that, for me, is at the heart of the beauty of LINQ.
Comments