Thursday, February 14, 2008

Daily RSS Download

I published Daily RSS Download, my first open source project on CodePlex* today. It's not going to change the world, but if you have a need for it there is a decided lack of decent products that perform this functionality. In this post I'll give a little background about why I wrote it and explain what it does and how to use it. Besides needing this functionality I also wrote it to learn LINQ to Objects and LINQ to XML, but I'll cover the more interesting implementation details in a later post.

Why I Wrote It

For Christmas I received an IRex Iliad which is an e-book reader combined with a Wacom tablet. It's an awesome product that allows reading PDF's (among other formats) and writing on them. It's pricey, but the ability to jot notes on technical documents (in addition to recipes and guitar tablature, etc) as you read is invaluable for me. I now read about twice as much as I did before. It supports Wi-Fi, and in particular can connect to a computer on a regular basis to download files you put in a specific directory.

So theoretically it could download a customized newspaper every morning for me, right? I could have today's world news, national news, local news, technical news, weather, and my RSS feeds like Scott Gunthrie all in one place while I eat my cereal! And then I could cancel my Washington Post subscription and after about 7.5 years I would have recouped the costs of the Iliad. Sweet.

The problem is that the product doesn't come with any way to download RSS feeds. Well, you can use software from MobiPocket, but it's a pain to setup, and use, and I couldn't figure out how to have it automatically run on a daily basis. And furthermore it can't grab the real content from the website if the RSS feed only contains an abstract (e.g. washingtonpost.com). I searched and there was some software out there, but none of it did what I liked. And of course none of it generated a manifest.xml file which is an Iliad specific file that links HTML pages together and gives names to groups of content (i.e. grouping the files in a directory to make a “book" called “My Daily News for February 13").

So what a great opportunity to write it myself and learn LINQ to XML and LINQ to Objects in the process.

What It Does

The end result (or the index page anyway) looks something like this:

The images are local, the links go to a full page of content, and on the Iliad, because Daily RSS Download generates a manifest.xml, the next and previous buttons can move you to the next or previous article and you can see at a glance how many articles there are.

If you want to recreate the screenshot above, first head over to the Releases page of Daily RSS Download, where you can download the msi and install the application. When you open “Daily RSS Download Config" you can view a home page like this:

You can type in an RSS or Atom URL and click Add Feed. The application will try to connect to the website, download the title, and set some configuration options based on the average length of posts (specifically if you put in a feed from the washingtonpost.com website it will detect that the average post size is small and determine that it should download the main content from the website).

You can click on any of the feeds you've added and you'll get a Feed Settings page like below:

The fields are mostly self explanatory, but here are three of the more interesting settings:

Summary Source Values:

This setting determines where the abstract (summary) on the index page should come from. There are three options:

No Summary – Does not display a summary on the index.html page. This is what Scott Gunthrie's feed was set to in the first screenshot.

Extract from the content – This takes the first 300 characters from the main content as the summary. This was set for the washingtonpost.com feed in the first screenshot (although Use the RSS description field would actually have been more appropriate).

Use the RSS description field – This uses the entire description field from the RSS (or Atom) feed. This is what the weatherbug feed was set to in the first screenshot. Obviously this is a bad choice for a Scott Gunthrie type of RSS entry since he posts everything in the description field.

Content Source Values:

This setting determines where the main content page should get it's value. There are thee options:

No content, summary only – If you set a feed to this, then Daily RSS Download won't generate a content file. This would be a good choice for the weather feed in the example.

Use the RSS description field – The content file will be created from the RSS description field. This would be a good choice for a Scott Gunthrie type of feed.

Download from the referenced web page –Daily RSS Download will download the page referenced by the RSS or Atom feed. This would be a good option for a washingtonpost.com type of feed.

Content Start/End Markers

These are regular expressions that are used if you set content source to download the referenced web page. You can leave them blank or you can set them if you want to try to strip out header, footer, navigation bars, etc. The content start marker in the screenshot:

\<div id=\"article_body\"[^\>]*\>

Says match ‘<div id="article_body"' up through to the next ‘>'. Both markers are exclusive (the thing your matching on won't be included in the results).

Customizing the CSS

So that's it for the general settings and use. You can click “Download Now" on the main config page to download your feeds, and you can set it up to run on a recurring basis (it will only download new content) by setting a recurring task to run “DailyRssDownload.exe DownloadNow". The only other thing of interest is to make the content more pretty.

The generated HTML is CSS customizable, so in order to get the two column look above (and/or make it look pretty on an Iliad) you can customize the CSS as below:

h1
{
      margin-top: 0px;
      /* A pretty linux script font since the Iliad has a linux kernel */
      font-family: Zapf Chancery;
      font-size: 30pt;
      margin-bottom: 0px;
}
h2
{
}
.NewsHeader
{
      border-bottom: solid 1px black;
      text-align: center;
}
.DailyRss_Date
{
      text-align: center;
}
.DailyRss_Feed
{
}
.DailyRss_Entry
{
}
.DailyRss_EvenEntry
{
}
.DailyRss_OddEntry
{
}

/* LEFT COLUMN */
#ScottGusBlog
{
      float: left;
      width: 49%;
      border-right: solid 1px gray;
}
#washingtonpostcom-TodaysHighlights
{
      clear: both;
      float: left;
      width: 49%;
      border-right: solid 1px gray;
}

/* RIGHT COLUMN */
#WeatherBugLocalWeatherfor20190
{
      margin-left: 50%;
}

So basically just use the old float left, width 50%, margin-left 50% trick to get the pretty two-column look (without tables).

Conclusion

I hope you find the Daily RSS Download open source project useful. Please feel free to submit suggestions, feature requests, defects or preferably defects AND patches on the project's CodePlex home page.

* In case you aren't aware CodePlex is an open source project hosting website from Microsoft. It's similar to Source Forge, except there is no approval process for new projects and it integrates nicely with Visual Studio.

6 comments:

Ted Jardine said...

If I had an IRex, would be leaping with joy after reading your post.

Lee Richardson said...

:) I suppose the audience that would use it is fairly small. Besides e-book readers people who travel a lot might find it useful. Anyone else have any ideas on uses?

G Sanders said...

This sounds like a wonderful app, but I use a Mac. Do you have any plans to port it to Mac OS X?

Lee Richardson said...

Sorry, .Net doesn't work with the Mac so well.

Anonymous said...

Why does the download not start when the app runs via scheduled tasks?

Arlo said...

This is great, there doesn't seem to be anything else like it on the internet.

I was browsing around 4Chan.org and thought, boy wouldn't it be great to just see the images and skip all the talk? Turns out they have feeds so I created a Yahoo Pipe to change the content to just the image.

Then I thought, wouldn't it be neat if you could download all of the images from one of those 4chan galleries automatically onto your computer? Your program was the only one I could find.

Unfortunately, I was hoping to have it save all of the content into a single directory, but it's saving it in multiple directories by time, and then by post. Is there any way to get it to just dump all the content into one directory?

Also, I ran the download twice in about 5 minutes and even though I had "only new" checked, it downloaded duplicates of all the images. Any ideas?

Thanks.