Posts Tagged ‘extract data’

A New Kernel

Monday, November 16th, 2009

We have been extremely busy in the last weeks with the complete refactoring of the OutWit Kernel, preparing the way for the advanced automation functionalities of OutWit Hub Pro. The coming version 0.8.9 will be the first using our new core library. You will not see very radical changes yet, except for the scraper editor, which should make many of you happy. Here are the changes that you will find:

The brand new scraper manager and editor

The big red Stop button that many have been asking for (which, by the way, also allows to abort ‘Apply Scraper to URLs’ processes)

A few changes in the interface, to prepare the integration of new automators in the following versions.

(more…)

General overview of the OutWit programs

Tuesday, July 28th, 2009

OutWit’s collection technology is organized around three simple concepts:

  1. The programs dissect the Web page into data elements and enable users to see only the type of data they are looking for (images, links, email addresses, RSS news…).
  2. They offer a universal collection basket, the « Catch », into which users can manually drag and drop or automatically collect structured or unstructured data, links or media, as they surf the Web.
  3. They also know how to automatically browse through series of pages, allowing users to harvest all sorts of information objects in a single click.

With simple intuitive features as well as sophisticated scraping functions and data structure recognition, the OutWit programs target a broad range of user categories.

(more…)

Creating a Scraper for Multiple URLs Using Regular Expressions

Wednesday, November 19th, 2008

Important Note: The tutorials you will find on this blog may become outdated with new versions of the program. We have now added a series of built-in tutorials in the application which are accessible from the Help menu.
You should run these to discover the Hub.

NOTE: This tutorial was created using version 0.8.2. The Scraper Editor interface has changed a long time ago. More features were included and some controls now have a new name. The following can still be a good complement to get acquainted with scrapers. The Sraper Editor can now be found in the ‘Scrapers’ view instead of ‘Source’ but the principle remains funamentally the same.

In this example we’ll redo the scraper from the previous lesson using Regular Expressions.  This will allow us to create a more precise scraper, which we can then apply to many URLs.  When working with RegExps you can always reference a list of basic expressions and a tutorial by selecting ‘Help’ in the menu bar.

Recap: For complex web pages or specific needs, when the automatic data extraction functions (table, list, guess) don’t provide you with exactly what you are looking for, you can extract data manually by creating your own scraper.  Scrapers will be saved on your computer then can be reapplied or shared with other users, as desired.

(more…)

Create your First Web Scraper to Extract Data from a Web Page

Friday, August 22nd, 2008

Important Note: The tutorials you will find on this blog may become outdated with new versions of the program. We have now added a series of built-in tutorials in the application which are accessible from the Help menu.
You should run these to discover the Hub.

Find a simple but more up-to-date version of this tutorial here

This tutorial was created using version 0.8.2. The Scraper Editor interface has changed a long time ago. Many more features were included and some controls now have a new name. The following can still be a good complement to get acquainted with scrapers. The Sraper Editor can now be found in the ‘Scrapers’ view instead of ‘Source’ but the principle remains funamentally the same.

In many cases the automatic data extraction functions: tables, lists, guess, will be enough and you will manage to extract and export the data in just a few clicks.

If, however, the page is too complex, or if your needs are more specific there is a way to extract data manually: Create your own scraper.

Scrapers will be saved to your personal database and you will be able to re-apply them on the same URL or on other URLs starting, for instance, with the same domain name.

A scraper can even be applied to whole lists of URLs.

You can also export your scrapers and share them with other users.

Let’s get acquainted with this feature by creating a simple one.

(more…)

Grab HTML Tables to Excel Spreadsheets

Saturday, July 19th, 2008

Important Note: The tutorials you will find on this blog may become outdated with new versions of the program. We have now added a series of built-in tutorials in the application which are accessible from the Help menu.
You should run these to discover the Hub.

While surfing the Web, you may have come across interesting data that you want to use offline. You then faced the tiresome task of copying and pasting all the information row by row, column by column. OutWit Hub‘s “Data” views can automatically do this for you.

In this tutorial, we are going to learn how to grab structured data from a Web page with the “Table” view and export it to an Excel spreadsheet.

(more…)