Important Note: The tutorials you will find on this blog may become outdated with new versions of the program. We have now added a series of built-in tutorials in the application which are accessible from the Help menu.
You should run these to discover the Hub.
This tutorial was created using version 0.8.2. The Scraper Editor interface has changed a long time ago. Many more features were included and some controls now have a new name. The following can still be a good complement to get acquainted with scrapers. The Sraper Editor can now be found in the ‘Scrapers’ view instead of ‘Source’ but the principle remains funamentally the same.
Now that we’ve learned how to create a scraper for a single URL, let’s try something a little more advanced. In this lesson we’ll learn how to create a scraper which can be applied to a whole list of URLs using a simple method suited for beginners. In the next lesson a more complex scraper utilizing regular expressions will be demonstrated for our tech savvy users. Geeks, feel free to skip to: Creating a Scraper for Multiple URLs using Regular Expressions.
Recap: For complex web pages or specific needs, when the automatic data extraction functions (table, list, guess) don’t provide you with exactly what you are looking for, you can extract data manually by creating your own scraper. Scrapers will be saved on your computer then can be reapplied or shared with other users, as desired.
2. Choose your source. Today let’s use: http://en.wikipedia.org/wiki/Leading_firms_by_activity
Make sure this address is displayed in the address bar. In the Page view, you will see a list of leading firms by activity. Today we will make a report detailing the company information for each of these firms.
Traditionally, you’d have to click on each link, then copy and paste the information into an excel spreadsheet, but with the scraper function we’re going to save a lot of time and energy.
3. Choose the information you want to extract and create your scraper:
If you click on the List view you can see all the URLs and their related companies:
Select a random company to get started, say Toyota. Double-click on the link for Toyota and you will be redirected to the Toyota article in the Page view. On the right, there is a box with the company information and logo. This is what we’ll use to populate our spreadsheet for each company contained in the original list.
Click Source on the left-hand side view list to open the scraper editor. Then click new and the following box will appear. (If you have already created a scraper you will see the information populated in the right.)
You will be asked to enter a URL that satisfies: “scraper will be applied to URLs starting with.” For this example use: http://en.wikipedia.org/wiki/ since all the company’s URLs will begin with this address. (Unfortunately, in the case of wikipedia, all articles start with the same string.) You will then be asked to name the scraper and click OK.
If you have already created a scraper for this URL you will see the following error:
You can create multiple scrapers for the same URL, but you can only have one loaded at a time in OutWit Hub. In order to create a new scraper for the same URL you will have to click the Delete button and then create your new scraper. If you do click delete, however, all of your information will be lost unless you have exported the scraper. At the end of this tutorial you will learn how to export/import scrapers.
The first column will be named Company, so enter that in the description field. In the Source Code Box you can see Toyota in black, which is the value that we want to populate this field. The text strings in black are the values that appear on the website and when we use the scraper they are the only values that will appear in our results. We need to take the most logical markers both preceding and following Toyota. Now there is no exact way to say how much of the text before or after the value you’ll need to use. You want the Marker Before to be specific and not be repeated in the rest of the document. This will ensure that you will get the correct value when applying the scraper to multiple websites. In our example, firstHeading”> is quite specific so copy & paste that into the Marker Before field. If the Marker Before is unique then the Marker After can be less specific. </h1> will do the trick.
To be absolutely sure the marker is unique, you can use the find command to search the document for the number of instances of the value firstHeading”>. (If Cmd-F/Ctrl-F do not open the Find Box click in the source code box to give it the focus and try again)
Try to see if you can complete the next few fields by yourself. The column names will be: Type, Founded, Founder, Headquarters, Industry, and Revenue. Remember you should check your work periodically by clicking: Save-Scraper-Refresh.
We have successfully created our scraper, but now we need to apply it to all the URLs from our list. In the Page view, reload our initial site: http://en.wikipedia.org/wiki/Leading_firms_by_activity. Then select the List view, highlight the links for the companies, right click, and select “Apply Scraper to Selected URLs.”
In the List view select only the values that relate to the links for the companies. In this case it would be rows 62-199.
Now we can go to the Scraper view and see our list loading.
4. Export the data:
To save the results in a spreadsheet, select all (Ctrl-A or Cmd-A). Then Select File and “Export Selection As…” Finally, chose a name, designation for your file, and click Save.
You can see that OutWit Hub has successfully created our spreadsheet.
Notice that in the results there are unwanted characters: (▼   ). In the more advanced tutorial, we will learn how to remove the unwanted characters using RegExp. Click here to skip to the next tutorial.
We have one last and very important step to complete. Return to the Source view and click Export. Then chose a name and designation for your file.
It is very important that you export your scraper if you plan to use it again or to create additional scrapers for the same URL. Remember that when we were creating our scraper we applied it to the URL: http://en.wikipedia.org/wiki/. In the future, if you create another scraper for Wikipedia you will also have to use this URL. If you were to do so, the new scraper would replace the existing one and all your work would be lost. By exporting and then importing your work, you can create many scrapers for the same URL.
To reload a scraper click Import, select your file, then Open.
Occasionally after importing your scraper, it may not appear. You will then have to select it from the drop down box where the Scraper Name is located.
This completes our tutorial, and, as always, we welcome any questions or feedback. We couldn’t do this without you.