Knowledge is knowing that we cannot know: How to write a web crawler

In this tutorial a simple web crawler is developed for the understanding. The reader can change the code as per their wish.

A crawler development can be planned out into phases as we will be doing.

To begin with, we would develop a very trivial crawler that will just crawl the url spoon fed to it.
Then we will make a crawler with capability to extract urls from the downloaded web page.
Next we can also make a queue system in the crawler that will track no of urls still to be downloaded.
We can then add capability to the crawler to extract only the user visible text from the web page.
There after we will make a multi-threaded downloader that will utilize our network bandwidth to the maximum.
And we will also add some kind of front end to it, probably in php.

In this, a simple java crawler is demonstrated which will crawl a single page over the internet.
Make a new project in Net-beans or any editor you are comfortable with and save it by the name something like “WebC” or “w1”,etc.
Write the following code in it’s main() function. This class will later be worked upon and new classes will be added once we get going.

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
     public class Main {
         public static void main(String[] args) {
             try {
               URL my_url = new URL("http://en.wikipedia.org/wiki/Web_crawler");
               BufferedReader br = new BufferedReader(new InputStreamReader my_url.openStream()));
               String strTemp = "";
             while(null != (strTemp = br.readLine())){
               System.out.println(strTemp);
                }
              }
             catch (Exception ex) {
                  ex.printStackTrace();
              }
        }
}

there is your first baby crawler :)
Watch the output when you first run it, when runing successfully it will show you the HTML code.
For play, enter any url and see the output.

Trouble Shooting in Web Crawler

It may give some hiccups or may stumble upon some errors, most probably network errors related to proxy settings on your Net-beans and JVM. In such a case you can change the proxy IP & port for the Net-beans at Tools>>options>>general>>proxy settings.Also you may need to feed the same to the JVM via command line, that can be done in Net-Beans at File>>’w1′ Properties>>Run>>VM options:
write the following in the text box over there.
-Dhttp.proxyHost=<your proxy IP> -Dhttp.proxyPort=<port for the same>
example: -Dhttp.proxyHost=172.16.3.1 -Dhttp.proxyPort=3128

If this is clear, try to understand the code given in the link below. Try to run it and play with it.
http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/imagecrawler/
http://code.google.com/p/crawler4j/

Happy Surfing !!!

Knowledge is knowing that we cannot know

Friday, June 6, 2014

How to write a web crawler - Part II

No comments:

Post a Comment