Grab Documents With OutWit Hub

In this tutorial we are going to learn how to download all the documents (.pdf, .doc, .xls,…) from a webpage with OutWit Hub.

Important Note: The tutorials you will find on this blog may become outdated with new versions of the program. We have now added a series of built-in tutorials in the application which are accessible from the Help menu.
You should run these to discover the Hub.

On some webpages, you can find links to different kinds of documents. Looking for each link would be really tiring: with OutWit Hub, you can automatically see all the links to documents, the name and extension of those, and download them to your hard-disk (also see OutWit Docs).

1. Launch OutWit Hub

If you haven’t installed OutWit Hub yet, please refer to the Getting Started with OutWit Hub tutorial.

Begin by launching OutWit Hub from Firefox. Open Firefox then click on the OutWit Button OutWit Hub in the toolbar.

If the icon is not visible go to the menu bar and select Tools -> OutWit -> OutWit Hub

OutWit Hub will open displaying the Web page currently loaded on Firefox.

2. Go to the Desired Web Page

In the address bar, type the URL of the Website. You can also type any string to search and OutWit Hub will look for it using the preferred search engine selected in Firefox.

addressbar1

We’ll use the United States’ wikipedia article: http://en.wikipedia.org/wiki/United_states

Go to the Page view where you can see that OutWit Hub displays the Web page as it would appear in a traditional browser.

page

Now, select “Links” from the view list.

links-outwit-hub

In the “Links” widget, OutWit Hub displays all the links from the current page and allows you to sort them. Check the “Documents” box from the filter controls.

showdocuments

If the “Links” view is blank, reload the page.

linksdocuments

Checking the Documents box from the “Links” view allows you to see all the documents from the webpage.

If you want to have more information about the documents, you can hit the “Pick a Column” button to show or hide columns with various precisions (HTTP headers, last seen on, filename, etc).

pickacolumn

3. Select the documents you want

The data in the “Links” view can be edited, filtered, sorted and moved to the “Catch”.

Let’s sort the current documents in order to select the Excel spreadsheets.

In the search tool of the filter controls, we can use the Select row if Filename contains “xls”.

sortdocuments

The .xls documents are now highlighted. You can move them to the catch, you just have to hitcatch

catchselection

4. Save the documents you like

If you want to save the documents you have selected on your hard disk, check the “Save incoming files” in the Catch before catching them.

saveincomingfiles

When hitting the Catch button, you will have to choose a destination folder on your hard-disk for the selected documents. Download time may vary depending on the number of documents and their size.

5. Application Examples

Documents are common in Web pages and simple to extract with OutWit Hub’s “Links” view. Let’s say you want to get statistics about US inhabitants. Open OutWit Hub and go to the American Census webpage.

census

In the “Links” view, check Show Documents and Catch Selection and uncheck Empty on page load. This way, each time a page will load, the “Links” widget will show the documents of the current page and catch them, but the documents of the previous pages will also be displayed.
boxes

In the Catch, check the Save Incoming files box so that each file that is catched will be saved onto your hard-disk.

saveincomingfiles1

Now click the Dig buttondig (depth:0) and take a rest: the hub will browse through all the links of the page for you to find the ones that link to documents and will directly save those into the selected directory.

You can also train your automatic documents download skills on the Biomass database of the US department of energy here.

by

Tags: , ,

Comments are closed.