Advanced Tips: Hierarchical Scraping
Friday, April 22nd, 2011You may need, at times, to extract hierarchical data without loosing the structure:
- Flight #AW345: New York - Paris:
- Departure time: 04:45 pm (local)
- Arrival time: 07:15 am (local)
- Flight #SG45: Paris - Rome:
- Departure time: 10:05 am (local)
- Arrival time: 11:55 am (local)
- Flight #SG46: Rome - Paris:
- Departure time: 06:25 am (local)
- Arrival time: 08:55 am (local)
- Flight #AW346: Paris - New York:
- Departure time: 09:20 am (local)
- Arrival time: 08:35 pm (local)
In many of these cases, you will lack significant markers to distinguish the parent elements (the legs of the trip, in this case). If you just grab the time information with a simple scraper, you are likely to loose the leg they belong to. In cases like this (although we haven’t yet implemented a complete recursive/hierarchical scraping system), you can already often get the result you want using the ‘Separator’ & ‘Labels’ fields in the scraper (pro version only). Making such scrapers often requires a good understanding of complex regular expressions.
The Separator is a delimiter which you can use to split an extracted string into several data fields.
The List of Labels is the series of headers to be used respectively for each destination column, when the result is split into several fields using a Separator.
In the Separator field, you can use either a literal string like “,” or “;”, a tag like “</ul>” or a regular expression. When splitting the result with a Separator, use the List of Labels to assign a field name to each part of the data. Separate the labels with a comma.
The parent block of data is extracted between the Marker Before and the Marker After and this block is then split into several fields. This way, you will keep the data that belong together in a same row. Separator/Labels are very helpful, in general, when the strings you want to extract are not surrounded by remarkable markers.