Archive for April, 2011

Advanced Tips: Hierarchical Scraping

Friday, April 22nd, 2011

You may need, at times, to extract hierarchical data without loosing the structure:

  • Flight #AW345: New York - Paris:
    • Departure time: 04:45 pm (local)
    • Arrival time: 07:15 am (local)
  • Flight #SG45: Paris - Rome:
    • Departure time: 10:05 am (local)
    • Arrival time: 11:55 am (local)
  • Flight #SG46: Rome - Paris:
    • Departure time: 06:25 am (local)
    • Arrival time: 08:55 am (local)
  • Flight #AW346: Paris - New York:
    • Departure time: 09:20 am (local)
    • Arrival time: 08:35 pm (local)

In many of these cases, you will lack significant markers to distinguish the parent elements (the legs of the trip, in this case). If you just grab the time information with a simple scraper, you are likely to loose the leg they belong to. In cases like this (although we haven’t yet implemented a complete recursive/hierarchical scraping system), you can already often get the result you want using the ‘Separator’ & ‘Labels’ fields in the scraper (pro version only). Making such scrapers often requires a good understanding of complex regular expressions.

The Separator is a delimiter which you can use to split an extracted string into several data fields.
The List of Labels is the series of headers to be used respectively for each destination column, when the result is split into several fields using a Separator.

In the Separator field, you can use either a literal string like “,” or “;”, a tag like “</ul>” or a regular expression. When splitting the result with a Separator, use the List of Labels to assign a field name to each part of the data. Separate the labels with a comma.

The parent block of data is extracted between the Marker Before and the Marker After and this block is then split into several fields. This way, you will keep the data that belong together in a same row. Separator/Labels are very helpful, in general, when the strings you want to extract are not surrounded by remarkable markers.

The Hub’s Help is… really helpful.

Saturday, April 2nd, 2011

Tutorials are better than in-line help; that’s a fact. We know this, and we are also very conscious that we’ve been promising a new set of tutorials for over a year, now. They will come (eventually and progressively) in version 1.1. In the meantime, the 50+ pages of built-in Help of OutWit Hub detail all main functions of the application.

Expanding the help sections

The first time you open the help, all sections are collapsed. To expand and explore their content, you need to click on the + (or triangle) signs on the left of each section title. Don’t forget to have a look there before creating a ticket in the support system as we update the help articles regularly. If you do not find an answer in the help, of course, don’t hesitate to ask us for assistance.