Wednesday, February 20, 2008

Deferred Execution, the Elegance of LINQ

One of the things I love about LINQ is its deferred execution model. It's the type of thing that makes sense academically when you first read about it (e.g. in Part Three of Scott Gunthrie's LINQ to SQL series), but for me anyway, it took some time to understand enough to use effectively.

For instance the Daily RSS Download open source application that I wrote about last week needs to download entries (posts) that are newly published since the last download. While it isn't a complicated problem, my first attempt at a solution didn't use the power of LINQ correctly. I'll explain my naïve solution in this post, describe how LINQ's deferred execution works (i.e. Lambda expressions), explain the problems with my solution, then give an the elegant solution that is only possible because of LINQ's deferred execution model. See if you can spot my error along the way.

Downloading the Latest Entries

Downloading the latest entries would be a ridiculously simple problem if there weren't multiple formats for RSS. But since the solution needs to support Atom and RSS 2.0 and 1.0 and potentially other future formats, the class structure should be set up appropriately:

The newspaper class primarily exists to enumerate feeds:

public class Newspaper {
    public void DownloadNow() {
      foreach
(Feed objFeed in Settings.Feeds) {
        objFeed.DownloadRecentEntries(...);
      }
    }
}

The Feed class is abstract and during runtime is either an RssFeed or an AtomFeed. The relevant function Feed.DownloadRecentEntries() calls the abstract Feed.GetEntries() method, which returns a group of Entry objects.

public abstract class Feed {
    public
abstract IEnumerable<Entry> GetEntries(XDocument rssfeed);

    public
void DownloadRecentEntries(...) {
        XDocument xdocFeed = XDocument.Load(Url);
        IEnumerable<Entry> lstRecentPosts = GetEntries(xdocFeed);

        foreach (Entry objEntry in lstRecentPosts) {
            objEntry.Download(...)
        }
    }
}

The Feed classes, RssFeed and AtomFeed then implement GetEntries as follows:

publpublic class RssFeed : Feed {
    public override IEnumerable<Entry> GetEntries(XDocument rssfeed) {
        return from item in rssfeed.Descendants("item")
               where (DateParser.ParseDateTime(item.Element("pubDate").Value)
                   >= this.LastDownloaded)
                   || this.LastDownloaded == null
               select
(Entry)new RssEntry(item, this);
    }
}

public class AtomFeed : Feed {
    public override IEnumerable<Entry> GetEntries(XDocument rssfeed) {
        return from item in rssfeed.Descendants(_atomNamespace + "entry")
               where (DateParser.ParseDateTime(
                   item.Element(_atomNamespace + "published").Value)
                    >= this.LastDownloaded)
                    || this.LastDownloaded == null
               select (Entry)new AtomEntry(item, this);
    }
}

Yes, that's all LINQ to XML in there. It looks a lot like SQL, but as you'll see in a second it's really just glorified syntactic sugar. Expressive though, isn't it? While the astute reader may have already spotted the inelegance of my solution, for those unfamiliar with LINQ, let's first describe what AtomFeed.GetEntries() does.

What is this Deferred Execution Stuff?

If you already understand LINQ and how delayed execution works feel free to skip this section. For everyone else it's important to understand that the following line:

from item in rssfeed.Descendants("item")
               where (DateParser.ParseDateTime(item.Element("pubDate").Value)
                   >= this.LastDownloaded)
                   || this.LastDownloaded == null
               select
(Entry)new RssEntry(item, this);

Is actually just syntactic sugar for the following set of statements:

rssfeed
    .Descendants(_atomNamespace + "entry")
    .Where( item => (DateParser.ParseDateTime(
        item.Element(_atomNamespace + "published").Value)
        >= this.LastDownloaded)
        || this.LastDownloaded == null)
    .Select( item => (Entry)new AtomEntry(item, this));

Now XDocument.Descendants() returns IEnumerable which most definitely does not have a Where() function on it. And if you look at the return type of Where() in this context, it returns an IEnumerable which definetly does not have a Select() method on it. That's because Where() and Select() are extension methods, meaning you can attach them on to just about anything. They're new to C# 3.0 and are beyond the scope of this article.

But more important for the topic of deferred execution is the => operator, which is a Lambda expression and is also new to C# 3.0. The best way to understand them is that they are essentially syntactic sugar for an anonymous method (e.g. a type safe function pointer to code). So we could again rewrite our code as follows:

rssfeed
    .Descendants(_atomNamespace + "entry")
    .Where(delegate(XElement item) {
        return (DateParser.ParseDateTime(
            item.Element(_atomNamespace + "published").Value)
            >= this.LastDownloaded) || this.LastDownloaded == null; }
)
    .Select(delegate(XElement item) {
        return (Entry)new AtomEntry(item, this); }
);

Back in familiar territory yet? If not you probably aren't familiar with C# 2.0. In the background the compiler takes the anonymous methods above and turns them into methods on the current class and instantiates new delegates of the correct type that points to them and passes them to the Select() and Where() methods.

The The key thing to note is that the arguments for select and where are delegates, and so when those delegates are executed is beyond our control. In fact if you put a Console.WriteLine or a breakpoint inside of the AtomEntry constructor, it won't get called until the resulting IEnumerable is enumerated, specifically the following line in the first code sample:

foreach (Entry objEntry in lstRecentPosts) {

So that's delayed execution. But understanding how it works and how to use it are completely different things.

The Inelegant Solution

Getting back to my code sample you may have picked up that my where clause is the mistake. I implemented it like this because RSS and Atom have different field names for the published date. But the way I wrote it I'd have to make two changes if I wanted to change which entries to download. Ok, big deal, I'm extremely unlikely to make changes to that where clause right? Or I wasn't until I wanted functionality to set some defaults based on the average length of posts prior to downloading posts. Basically:

public static Feed CreateFeed(string strUrl, int intDisplayOrder) {
    IEnumerable<Entry> lstRecentEntries = feed.GetEntries(rssfeed);
    double intAveragePostSize = lstRecentEntries.Average(
        i => i.Description.Length);
    // if the feeds posts are typically small then include the
    // description field in the summary and download the content
    // for the main article from the link
   
if (intAveragePostSize < 1000) {
        ...
    } else {
        ...
    }
}

Except this now ties me to the were clause, when what I'd really like to do is just get the average post size for the last couple of posts. The problem is that GetEntries() isn't generic enough.

The Elegant Solution

The The solution is then to normalize out (excuse the database terminology) the where clause into the two methods that use GetEntries(). So GetEntries() becomes simple:

public override IEnumerable<Entry> GetEntries(XDocument rssfeed) {
      return from item in rssfeed.Descendants(_atomNamespace + "entry")
               select (Entry)new AtomEntry(item, this);
}

And then Feed.CreateFeed() and Feed.DownloadRecentEntries() become more complicated

public abstract class Feed {
    public
abstract IEnumerable<Entry> GetEntries(XDocument rssfeed);

    public
static Feed CreateFeed(string strUrl, int intDisplayOrder) {
        IEnumerable<Entry> lstEntries = feed.GetEntries(rssfeed);
        // get the five most recent posts
        IEnumerable<Entry> lstRecentEntries =
            from entry in lstEntries.Take(5)
            select entry;

        double
intAveragePostSize = lstRecentEntries.Average(
            i => i.Description.Length);
        if (intAveragePostSize < 1000) {
            ...
        } else {
            ...
        }
    }

    public
void DownloadRecentEntries(...) {
        XDocument xdocFeed = XDocument.Load(Url);
        IEnumerable<Entry> lstEntries = GetEntries(xdocFeed);
        // get newly published posts
        IEnumerable<Entry> lstRecentPosts = from entry in lstEntries
            where (entry.Published >= this.LastDownloaded)
                || this.LastDownloaded == null
            select entry;


        foreach (Entry objEntry in lstRecentPosts) {
            objEntry.Download(...)
        }
    }
}

Note that we now have a second LINQ statement that runs against the results of the LINQ statement in GetEntries(). But since nothing's been executed yet we're just building out the statement that we will eventually run when the resulting IEnumerable if enumerated. So we've now spread our LINQ statements across an inheriting and a base class, and in process we've made GetEntries() extremely generic.

Conclusion

So what's the big deal? The big deal is that we can spread our data access statements across multple classes and because of deferred execution we don't need to worry about the performance of generic methods that are closer to the data that don't contain a "where" clause. This may not be a huge deal in this example, but it becomes extremely powerful when the user interface tier can tack on "order by" statements or "filters" BEFORE anything is executed against your data store. And that, for me, is at the heart of the beauty of LINQ.

Thursday, February 14, 2008

Daily RSS Download

I published Daily RSS Download, my first open source project on CodePlex* today. It's not going to change the world, but if you have a need for it there is a decided lack of decent products that perform this functionality. In this post I'll give a little background about why I wrote it and explain what it does and how to use it. Besides needing this functionality I also wrote it to learn LINQ to Objects and LINQ to XML, but I'll cover the more interesting implementation details in a later post.

Why I Wrote It

For Christmas I received an IRex Iliad which is an e-book reader combined with a Wacom tablet. It's an awesome product that allows reading PDF's (among other formats) and writing on them. It's pricey, but the ability to jot notes on technical documents (in addition to recipes and guitar tablature, etc) as you read is invaluable for me. I now read about twice as much as I did before. It supports Wi-Fi, and in particular can connect to a computer on a regular basis to download files you put in a specific directory.

So theoretically it could download a customized newspaper every morning for me, right? I could have today's world news, national news, local news, technical news, weather, and my RSS feeds like Scott Gunthrie all in one place while I eat my cereal! And then I could cancel my Washington Post subscription and after about 7.5 years I would have recouped the costs of the Iliad. Sweet.

The problem is that the product doesn't come with any way to download RSS feeds. Well, you can use software from MobiPocket, but it's a pain to setup, and use, and I couldn't figure out how to have it automatically run on a daily basis. And furthermore it can't grab the real content from the website if the RSS feed only contains an abstract (e.g. washingtonpost.com). I searched and there was some software out there, but none of it did what I liked. And of course none of it generated a manifest.xml file which is an Iliad specific file that links HTML pages together and gives names to groups of content (i.e. grouping the files in a directory to make a “book" called “My Daily News for February 13").

So what a great opportunity to write it myself and learn LINQ to XML and LINQ to Objects in the process.

What It Does

The end result (or the index page anyway) looks something like this:

The images are local, the links go to a full page of content, and on the Iliad, because Daily RSS Download generates a manifest.xml, the next and previous buttons can move you to the next or previous article and you can see at a glance how many articles there are.

If you want to recreate the screenshot above, first head over to the Releases page of Daily RSS Download, where you can download the msi and install the application. When you open “Daily RSS Download Config" you can view a home page like this:

You can type in an RSS or Atom URL and click Add Feed. The application will try to connect to the website, download the title, and set some configuration options based on the average length of posts (specifically if you put in a feed from the washingtonpost.com website it will detect that the average post size is small and determine that it should download the main content from the website).

You can click on any of the feeds you've added and you'll get a Feed Settings page like below:

The fields are mostly self explanatory, but here are three of the more interesting settings:

Summary Source Values:

This setting determines where the abstract (summary) on the index page should come from. There are three options:

No Summary – Does not display a summary on the index.html page. This is what Scott Gunthrie's feed was set to in the first screenshot.

Extract from the content – This takes the first 300 characters from the main content as the summary. This was set for the washingtonpost.com feed in the first screenshot (although Use the RSS description field would actually have been more appropriate).

Use the RSS description field – This uses the entire description field from the RSS (or Atom) feed. This is what the weatherbug feed was set to in the first screenshot. Obviously this is a bad choice for a Scott Gunthrie type of RSS entry since he posts everything in the description field.

Content Source Values:

This setting determines where the main content page should get it's value. There are thee options:

No content, summary only – If you set a feed to this, then Daily RSS Download won't generate a content file. This would be a good choice for the weather feed in the example.

Use the RSS description field – The content file will be created from the RSS description field. This would be a good choice for a Scott Gunthrie type of feed.

Download from the referenced web page –Daily RSS Download will download the page referenced by the RSS or Atom feed. This would be a good option for a washingtonpost.com type of feed.

Content Start/End Markers

These are regular expressions that are used if you set content source to download the referenced web page. You can leave them blank or you can set them if you want to try to strip out header, footer, navigation bars, etc. The content start marker in the screenshot:

\<div id=\"article_body\"[^\>]*\>

Says match ‘<div id="article_body"' up through to the next ‘>'. Both markers are exclusive (the thing your matching on won't be included in the results).

Customizing the CSS

So that's it for the general settings and use. You can click “Download Now" on the main config page to download your feeds, and you can set it up to run on a recurring basis (it will only download new content) by setting a recurring task to run “DailyRssDownload.exe DownloadNow". The only other thing of interest is to make the content more pretty.

The generated HTML is CSS customizable, so in order to get the two column look above (and/or make it look pretty on an Iliad) you can customize the CSS as below:

h1
{
      margin-top: 0px;
      /* A pretty linux script font since the Iliad has a linux kernel */
      font-family: Zapf Chancery;
      font-size: 30pt;
      margin-bottom: 0px;
}
h2
{
}
.NewsHeader
{
      border-bottom: solid 1px black;
      text-align: center;
}
.DailyRss_Date
{
      text-align: center;
}
.DailyRss_Feed
{
}
.DailyRss_Entry
{
}
.DailyRss_EvenEntry
{
}
.DailyRss_OddEntry
{
}

/* LEFT COLUMN */
#ScottGusBlog
{
      float: left;
      width: 49%;
      border-right: solid 1px gray;
}
#washingtonpostcom-TodaysHighlights
{
      clear: both;
      float: left;
      width: 49%;
      border-right: solid 1px gray;
}

/* RIGHT COLUMN */
#WeatherBugLocalWeatherfor20190
{
      margin-left: 50%;
}

So basically just use the old float left, width 50%, margin-left 50% trick to get the pretty two-column look (without tables).

Conclusion

I hope you find the Daily RSS Download open source project useful. Please feel free to submit suggestions, feature requests, defects or preferably defects AND patches on the project's CodePlex home page.

* In case you aren't aware CodePlex is an open source project hosting website from Microsoft. It's similar to Source Forge, except there is no approval process for new projects and it integrates nicely with Visual Studio.