Friday, June 6, 2014

How to write a web crawler - Part II

In this tutorial a simple web crawler is developed for the understanding. The reader can change the code as per their wish.

A crawler development can be planned out into phases as we will be doing.
  1. To begin with, we would develop a very trivial crawler that will just crawl the url spoon fed to it.
  2. Then we will make a crawler with capability to extract urls from the downloaded web page.
  3. Next we can also make a queue system in the crawler that will track no of urls still to be downloaded.
  4. We can then add capability to the crawler to extract only the user visible text from the web page.
  5. There after we will make a multi-threaded downloader that will utilize our network bandwidth to the maximum.
  6. And we will also add some kind of front end to it, probably in php.
In this, a simple java crawler is demonstrated which will crawl a single page over the internet.
Make a new project in Net-beans or any editor you are comfortable with and save it by the name something like “WebC” or “w1”,etc.
Write the following code in it’s main() function. This class will later be worked upon and new classes will be added once we get going.

import java.io.BufferedReader;
import java.io.InputStreamReader;
import java.net.URL;
     public class Main {
         public static void main(String[] args) {
             try {
               URL my_url = new URL("http://en.wikipedia.org/wiki/Web_crawler
");
               BufferedReader br = new BufferedReader(new InputStreamReader my_url.openStream()));
               String strTemp = "";
             while(null != (strTemp = br.readLine())){
               System.out.println(strTemp);
                }
              }
             catch (Exception ex) {
                  ex.printStackTrace();
              }
        }
}



there is your first baby crawler :)
Watch the output when you first run it, when runing successfully it will show you the HTML code.
For play, enter any url and see the output.

Trouble Shooting in Web Crawler
It may give some hiccups or may stumble upon some errors, most probably network errors related to proxy settings on your Net-beans and JVM. In such a case you can change the proxy IP & port for the Net-beans at Tools>>options>>general>>proxy settings.Also you may need to feed the same to the JVM via command line, that can be done in Net-Beans at File>>’w1′ Properties>>Run>>VM options:
write the following in the text box over there.
-Dhttp.proxyHost=<your proxy IP> -Dhttp.proxyPort=<port for the same>
example: -Dhttp.proxyHost=172.16.3.1 -Dhttp.proxyPort=3128

If this is clear, try to understand the code given in the link below. Try to run it and play with it.
http://code.google.com/p/crawler4j/source/browse/src/test/java/edu/uci/ics/crawler4j/examples/imagecrawler/
http://code.google.com/p/crawler4j/

Happy Surfing !!!

How to write a web crawler (theory) - Part I

 This tutorial will go through the understanding, challenges and design decisions you face when implementing a Java web crawler.
I will try to cover as much as possible in one page itself.

Basic Understanding
The task of the crawler is to keep on getting information from the internet into the database of the search engine. It literally crawls over the internet from page to page, link by link and downloads all the information to the database.
A search engine is made up of basically four parts:
  1. Web Crawler
  2. Database
  3. Search Algorithm
  4. Search system that binds all the above together
For more information on crawler visit the wiki page for web crawlers.
When implementing a web crawler in Java you have a few major design possibilities to choose from. The major design possibilities are:
  • Singlethreaded, synchronous crawler
  • Multithreaded, concurrent crawler
  • Singlethreaded, nio based crawler
  • Multithreaded, nio based crawler
Each of these design possibilities can be implemented with extra variations, so that the total number of designs is somewhat larger. The main design requirements are speed and memory consumption. The speed of the downloading and parsing, and the memory consumption of the parsed documents. Design choices are most often made to increase speed or reduce memory consumption. Of course, the readability and maintainability of the java web crawler code is also an issue.

Singlethreaded, Synchronous Web Crawler
 
A singlethreaded, synchronous Java web crawler is a simple component. It carries out the following simple actions:
a line diagram to show the actions
  1. Download a HTML page from a URL.
  2. Parse the HTML page.
  3. Extract needed information from HTML page.
  4. Extract link URL's from HTML page.
  5. Repeat from 1 for each URL to crawl.
Better Design :
A better design is to parse the HTML page while it is being downloaded, and extract the info and links from the page at the same time. That way, the downloading of the second page may start before the first page is fully downloaded and parsed. This will utilize the bandwidth and CPU of the crawler's computer better. In order to actually implement this design which starts the second download before the first is fully finished, you will need either a singlethreaded NIO based design, or a multithreaded IO based design.

A NIO Based Java Web Crawler
A Java NIO based web crawler can download multiple pages using a single thread, and parse the pages as they are downloaded.
A Java NIO based web crawler would use NIO's channels and selectors to open connections, and manage multiple open connections using a single thread. Here is a diagram illustrating a Java web crawler design based on NIO:
Each open connection (channel) is registered with a selector. The thread polls the selector for channels that have data ready for reading. The thread then reads and processes the data, and polls the selector for the next channel with data ready for reading. Thus a single thread can download multiple pages, and process them.
The data available from a channel (connetion) may not be the full page. It may just be part of the page that is ready for reading. Therefore the thread parsing the data must be able to parse a partial page, and leave the data in a partially parsed state, waiting for when the the rest of the page data arrives.
A simpler NIO based design would be a single thread that downloads multiple pages, but do not start parsing them until the full page is downloaded. But then the CPU and bandwidth is not fully utilized. The parsing of the page does not start until the full page is downloaded, and no new pages can start downloading until parsing starts.

A Multithreaded Java Web Crawler
It is simpler to implement a multithreaded IO based design, in which each thread is only downloading and parsing a single page at a time. That will both utilize the bandwidth and CPU of the crawling computer better than the second NIO based design, and it is easier to implement than the first NIO based design.
Here is a diagram illustrating a multithreaded Java web crawler:
A coordinating thread passes URL's to process to worker threads (processing threads in the diagram). The worker threads download the HTML pages, parse them, and extract information and links from the page. The link URL's are given back to the coordinating thread. The coordinating thread wraps the URL's in a crawl job, and passes the crawl job to a worker thread.
The worker threads are located in a thread pool. The crawl job object implements either the Runnable or Callable interface, depending on the concrete design of the crawler.
When each page is downloaded and processed by different threads, then multiple pages can be downloaded and processed at the same time. This utilizes both the bandwidth and CPU better. Neither is idle much time. Both CPU and bandwidth is most of the time used to either download or parse pages.

Online Resources :
http://ruby.bastardsbook.com/chapters/web-crawling/
A simple webcrawler in Python

Monday, June 2, 2014

Mathematical Thinking !!


What is mathematical thinking, is it the same as doing mathematics, if it is not, is it important, and if it is different from doing math and important, then why is it important?
The answers are, in order, (1) I’ll tell you, (2) no, (3) yes, and (4) I’ll give you an example that concerns the safety of the nation.
What is mathematical thinking?
To people whose experience of mathematics does not extend far, if at all, beyond the high school math class, I think it’s actually close to impossible for them to really grasp what mathematical thinking is. I used to try to convey the distinction with an analogy. “K-12 mathematics is like a series of courses in digging trenches, pouring concrete, bricklaying, carpentry, plumbing, electrical wiring, roofing, and glazing,” I would say.
I would continue, “Mathematical thinking is the equivalent of architecting. You need all of those individual house-building skills to build a house. But putting those skills together and making use of them requires a higher-order form of thinking. You need someone who can design the building and oversee its construction.”
It is a great analogy. I felt sure it would convey the essence of mathematical thinking. But many conversations and email exchanges over the years eventually convinced me it was not working. Saying A is to B as C is to D works fine when the recipient has good understanding of A, B, and C and some understanding of D. But if they have not even a clue about D, or even worse, if they believe that D actually is C, then the analogy simply does not work. It’s one of those analogies that is brilliant if you are sufficiently familiar with all four components, but hopeless as a way to explain one in terms of the other three.
[Mathematical thinking is more than being able to do arithmetic or solve algebra problems. In fact, it is possible to think like a mathematician and do fairly poorly when it comes to balancing your checkbook. Mathematical thinking is a whole way of looking at things, of stripping them down to their numerical, structural, or logical essentials, and of analyzing the underlying patterns. Moreover, it involves adopting the identity of a mathematical thinker.]
[For instance] like most people, when I am doing something routine, I rarely reflect on my actions. But if I’m doing mathematics and I step back for a moment and think about it, I see myself [not just as someone who can do math, but] as a mathematician.
“Well,” I hear you saying. “You are a mathematician.” By which I assume you mean that I have credentials in the field and am paid to do math. But I have a similar feeling when I am riding my bicycle. I’m a fairly serious cyclist. I wear skintight Lycra clothing and ride a $4,000, ultralight, carbon fiber, racing-type bike with drop handlebars, skinny tires, and a saddle that resembles a razor blade. I try to ride for at least an hour at a time four or five days a week, and on weekends I often take part in organized events in which I ride virtually nonstop for 100 miles or more. Yet I’m not a professional cyclist, and I would have trouble keeping up with the Tour de France racers even during their early morning warm-up while they are riding along with a newspaper in one hand and a latte in the other. Being a bike rider is part of who I am. When I am out on my bike, I feel like a cyclist. And you know, I’d be willing to bet that the feeling I have for the activity is not very different from [the professional bike racers].



It’s very different for me when it comes to, say, tennis. I don’t have the proper gear, and I have never played enough to become even competent. When I do pick up a (borrowed) racket and play, as I do from time to time, it always feels like I’m just dabbling. I never feel like a tennis player. I feel like an outsider who is just sticking his toe in the tennis waters. I do not know what it feels like to be a real tennis player. As a consequence of these two very different mental attitudes, I have become a pretty good cyclist, as average-Joe cyclists go, but I am terrible at tennis. The same is true for anyone and pretty much any human activity. Unless you get inside the activity and identify with it, you are not going to be good at it.
If you want to be good at activity X, you have to start to see yourself as an X-er – to act like an X-er.
A large part of becoming an X-er is joining a community of other X-ers. This often involves joining up with other X-ers, but it does not need to. It’s more an attitude of mind than anything else, though most of us find that it’s a lot easier when we team up with others. The centuries-old method of learning a craft or trade by a process of apprenticeship was based on this idea.
Learning to X competently means becoming part of the semiotic domain associated with X. Moreover, if you don’t become part of that semiotic domain you won’t achieve competency in X. Notice that I’m not talking here about becoming an expert. In some domains, it may be that few people are born with the natural talent to become world class. Rather, the point we are both making is that a crucial part of becoming competent at some activity is to enter the semiotic domain of that activity. This is why we have schools and universities, and this is why distance education will never replace spending a period of months or years in a social community of experts and other learners. Schools and universities are environments in which people can learn to become X-ers for various X activities – and a large part of that is learning to think and act like an X-er and to see yourself as an X-er. They are only secondarily places where you can learn the facts of X-ing; the part you can also acquire online or learn from a book.
The social aspect of learning that goes with entering a semiotic domain is often overlooked when educational issues are discussed, particularly when discussed by policy makers rather than professional teachers. Yet it is a huge factor.
But my focus here is describing mathematical thinking.
In many cases, the real value of being a mathematical thinker, both to the individual and to society, lies in the things the individual does automatically, without conscious thought or effort. The things they take for granted – because they have become part of who they are.
My brief was to look at ways that reasoning and decision making are influenced by the context in which the data arises. Which information do you regard as more significant? How do you weight, and then combine, information coming from different sources.
When I start a project, my task is to find a way of analyzing how context influences data analysis and reasoning in highly complex domains. The task seemed impossibly daunting (and still does). Nevertheless, I took the oh-so-obvious (to me) first step. “I need to write down as precise a mathematical definition as possible of what a context is,” I said to myself. It took me a couple of days mulling it over in the back of my mind while doing other things, then maybe an hour or so of drafting some preliminary definitions on paper. I can’t say I was totally satisfied with it, and would have been unable to defend it as “the right definition.” But it was the best I could do, and it did at least give me a firm base on which to start to develop some rudimentary mathematical ideas. (Think Euclid writing down definitions and axioms for what had hitherto been intuition-based geometry.)
When the project got completed, what I had given them was, first, I asked the question “What is a context?” Since each person in the room besides me had a good working concept of context – different ones, as I just noted – they never thought to write down a formal definition. It was not part of what they did. And second, by presenting them with a formal definition, I gave them a common reference point from which they could compare and contrast their own notions.



As a mathematician, I had done nothing special, nothing unusual. It was an obvious first step when someone versed in mathematical thinking approaches a new problem. Identify the key parameters and formulate formal definitions of them. But it was not at all an obvious thing for anyone else on the project. They each had their own “obvious things.” Some of them seemed really clever to me. Others seemed superficially very similar to mine, but on closer inspection they set about things in importantly different ways.
I’ve had a number of similar experiences over the years, and though they appear on the surface to be widely different (from analyzing children’s fairy stories to looking at communication breakdown in the workplace to trying to predict the endings of movies like Memento to trying to make sense of the modern battlefield), at their (mathematical) heart they all have the same general pattern. That then, is mathematical thinking. How do you teach it? Well, you can’t teach it; in fact there is very little anyone can teach anyone. People have to learn things for themselves; the best a “teacher” can do is help them to learn.
The most efficient domain to learn mathematical thinking is, perhaps not surprisingly (though it’s not such a slam-dunk as you might think) mathematics itself. Particularly well suited parts of mathematics for this purpose are algebra, formal logic, basic set theory, elementary number theory, and beginning real analysis, etc. Other topics could serve the same purpose, but would require more background knowledge on the part of the student.
But it’s not about the topic. It’s the thinking required that is important.
One of the features of mathematical thinking that often causes beginners immense difficulty is the logical precision required in mathematical writing, frequently leading to sentence constructions that read awkwardly compared to everyday text and take considerable effort to parse. (The standard definition of continuity is an excellent example, but mathematical writing is rife with instances.)
Mathematical thinking is a highly complex activity, and a great deal has been written and studied about it. I will give several examples of mathematical thinking, and to demonstrate two pairs of processes through which mathematical thinking very often proceeds:
- Specializing and Generalizing

- Conjecturing and Convincing.
If students are to become good mathematical thinkers, then mathematical thinking needs to be a prominent part of their education. In addition, however, students who have an understanding of the components of mathematical thinking will be able to use these abilities independently to make sense of mathematics that they are learning. For example, if they do not understand what a question is asking, they should decide themselves to try an example (specialise) to see what happens, and if they are oriented to constructing convincing arguments, then they can learn from reasons rather than rules. Experiences like the exploration above, at an appropriate level build these dispositions.

Enjoy Mathematical Thinking!