Knowledge is knowing that we cannot know: How to write a web crawler (theory)

This tutorial will go through the understanding, challenges and design decisions you face when implementing a Java web crawler.
I will try to cover as much as possible in one page itself.

Basic Understanding
The task of the crawler is to keep on getting information from the internet into the database of the search engine. It literally crawls over the internet from page to page, link by link and downloads all the information to the database.
A search engine is made up of basically four parts:

Web Crawler
Database
Search Algorithm
Search system that binds all the above together

For more information on crawler visit the wiki page for web crawlers.
When implementing a web crawler in Java you have a few major design possibilities to choose from. The major design possibilities are:

Singlethreaded, synchronous crawler
Multithreaded, concurrent crawler
Singlethreaded, nio based crawler
Multithreaded, nio based crawler

Each of these design possibilities can be implemented with extra variations, so that the total number of designs is somewhat larger. The main design requirements are speed and memory consumption. The speed of the downloading and parsing, and the memory consumption of the parsed documents. Design choices are most often made to increase speed or reduce memory consumption. Of course, the readability and maintainability of the java web crawler code is also an issue.

Singlethreaded, Synchronous Web Crawler

A singlethreaded, synchronous Java web crawler is a simple component. It carries out the following simple actions:

a line diagram to show the actions

Download a HTML page from a URL.
Parse the HTML page.
Extract needed information from HTML page.
Extract link URL's from HTML page.
Repeat from 1 for each URL to crawl.

Better Design :
A better design is to parse the HTML page while it is being downloaded, and extract the info and links from the page at the same time. That way, the downloading of the second page may start before the first page is fully downloaded and parsed. This will utilize the bandwidth and CPU of the crawler's computer better. In order to actually implement this design which starts the second download before the first is fully finished, you will need either a singlethreaded NIO based design, or a multithreaded IO based design.

A NIO Based Java Web Crawler

A Java NIO based web crawler can download multiple pages using a single thread, and parse the pages as they are downloaded.
A Java NIO based web crawler would use NIO's channels and selectors to open connections, and manage multiple open connections using a single thread. Here is a diagram illustrating a Java web crawler design based on NIO:

Each open connection (channel) is registered with a selector. The thread polls the selector for channels that have data ready for reading. The thread then reads and processes the data, and polls the selector for the next channel with data ready for reading. Thus a single thread can download multiple pages, and process them.
The data available from a channel (connetion) may not be the full page. It may just be part of the page that is ready for reading. Therefore the thread parsing the data must be able to parse a partial page, and leave the data in a partially parsed state, waiting for when the the rest of the page data arrives.
A simpler NIO based design would be a single thread that downloads multiple pages, but do not start parsing them until the full page is downloaded. But then the CPU and bandwidth is not fully utilized. The parsing of the page does not start until the full page is downloaded, and no new pages can start downloading until parsing starts.

A Multithreaded Java Web Crawler

It is simpler to implement a multithreaded IO based design, in which each thread is only downloading and parsing a single page at a time. That will both utilize the bandwidth and CPU of the crawling computer better than the second NIO based design, and it is easier to implement than the first NIO based design.
Here is a diagram illustrating a multithreaded Java web crawler:

A coordinating thread passes URL's to process to worker threads (processing threads in the diagram). The worker threads download the HTML pages, parse them, and extract information and links from the page. The link URL's are given back to the coordinating thread. The coordinating thread wraps the URL's in a crawl job, and passes the crawl job to a worker thread.
The worker threads are located in a thread pool. The crawl job object implements either the Runnable or Callable interface, depending on the concrete design of the crawler.
When each page is downloaded and processed by different threads, then multiple pages can be downloaded and processed at the same time. This utilizes both the bandwidth and CPU better. Neither is idle much time. Both CPU and bandwidth is most of the time used to either download or parse pages.

Online Resources :
http://ruby.bastardsbook.com/chapters/web-crawling/
A simple webcrawler in Python

Knowledge is knowing that we cannot know

Friday, June 6, 2014

How to write a web crawler (theory) - Part I

No comments:

Post a Comment