It may seem a class library like this would be readily available: download a web page, all the assets and alter said web page to reference the downloaded assets.  Well, when I needed to build this functionality for 4teaspoons I searched high and low – and not just for .NET code – any code what so ever.  No one seems to have to do this.  It’s more or less creating caches of a web page.  Maybe it’s so simple no one thinks to create a shared library – when that is the case it’s time to contribute the code back to the community.

In my instance I wanted to download a given page and save the HTML file and all assets up to Amazon S3.  This library comes with the S3 provider.  A provider class can be created to persist the assets just about anywhere – database, mongodb, filesystem, etc.  It’s really up to the developer what they need. I’ve taken care of the heavy lifting of parsing the pages and doing the transformation.

I hope this contribution is worthwhile and used.

https://code.google.com/p/ontheheap-websucker/

Advertisements