Posts Tagged ‘RegExp’

Creating a Scraper for Multiple URLs Using Regular Expressions

Wednesday, November 19th, 2008

Important Note: The tutorials you will find on this blog may become outdated with new versions of the program. We have now added a series of built-in tutorials in the application which are accessible from the Help menu.
You should run these to discover the Hub.

NOTE: This tutorial was created using version 0.8.2. The Scraper Editor interface has changed a long time ago. More features were included and some controls now have a new name. The following can still be a good complement to get acquainted with scrapers. The Sraper Editor can now be found in the ‘Scrapers’ view instead of ‘Source’ but the principle remains funamentally the same.

In this example we’ll redo the scraper from the previous lesson using Regular Expressions.  This will allow us to create a more precise scraper, which we can then apply to many URLs.  When working with RegExps you can always reference a list of basic expressions and a tutorial by selecting ‘Help’ in the menu bar.

Recap: For complex web pages or specific needs, when the automatic data extraction functions (table, list, guess) don’t provide you with exactly what you are looking for, you can extract data manually by creating your own scraper.  Scrapers will be saved on your computer then can be reapplied or shared with other users, as desired.

(more…)