Research Proposal

Web Scraper

By

Ivan Stoyanov


 

Background Information

 

Internet users sometimes like to read and view web sites in their entirety. Sometimes they don’t have the patience or the time to spend hours at end clicking on links here and there while connected to the Internet, wasting time online and money for their phone bill. The main reason behind this project is to make it easier for users to read and look at such web sites at a time and pace of their own. As the web page is downloaded and saved on the users hard drive, he or she can look at it at any time without the need of being connected to the Internet.

 

Some background information on HTTP: HTTP stands for Hyper Text Transfer Protocol. Its definition as stated by the HTTP 1.0 specification is:

 

The Hypertext Transfer Protocol (HTTP) is an application-level protocol with the lightness and speed necessary for distributed, collaborative, hypermedia information systems. It is a generic, stateless, object-oriented protocol, which can be used for any tasks, such as name servers and distributed object management systems, through extension of its request methods (commands). A feature of HTTP is the typing and negotiation of data representation, allowing systems to be built independently of the data being transferred.

 

So the research proposal as it stands is: What can be done to facilitate effective offline browsing and customisable web content retrieval while minimizing a user’s connection time?

 

 

Main Goal

 

The main goal of this project is to develop a so-called web scraper program, which will allow a user to download a web page from the Internet to a local source such as a hard drive so that this web page can be viewed offline.

 

This program will be able to retrieve web pages through the Internet and the use of the HTTP (Hyper Text Transfer Protocol) and TCP/IP protocols. After the retrieval process is completed one will be able to view the downloaded web page off line i.e. without the need of being connected to the Internet.

 

The user should also be able to select the type of content he/she wants to download. For example the user may only want to download text, or maybe only images. The program should provide for this.

 

 

Sub Goals Of The Project

 

·        Investigation of the processes associated with the automated downloading of web pages: The HTTP protocol provides for all the needed functionalities discussed in the problem setting and much more. For example the protocol contains the so-called GET method, which allow you to specify exactly what content you need to download from a web site. HTTP also allows for MIME content filtering

 

 

 

Scope And Delimitations

 

The program has to contain user selectable options, which can be used to restrict what content of the given web page should be downloaded. Content refers to the files, which are available on the web page, ranging from text and graphics to movies, music and music videos. The program should also allow “depth of downloading”. This means that the user should be able to select how deep (number of links) the web scraper should go.

 

The program should contain a feature, which establishes the structure of the Web Page. This feature should list what files the web page contains, be they images text videos or other. Looking at this list the user should then be able to select exactly which files he/she wants to download.

 

The second twist of this project is that is has to work on a client sever basis. What this means is that the client program will be able to go online connect to a PC running the server and request that a certain web page is downloaded. Having done that the client side can go offline and let the server side do all the downloading.

 

The idea behind this is that both client and server are running on some sort of a LAN. If this were otherwise, there would be no use of running the client server model at all, because if the client side had to access the server over a dial up connection, it would be too slow. Hence if both client and server were sitting on a LAN, once the server is finished with all the downloading, it can transfer the web page to the client side over the network at a fairly speedy pace.

 

Has this sort of thing been done before? Yes it has. However the incorporation of the client server interface in my project is something new.  After searching on the Internet, I found a few web scraping tools such as Teleport Pro, Web Stripper and Grab-a-Site. None of them incorporated the client server model.

 

Project Delimitations: This program is not for educational purposes. It will not give the user an understanding of how the Internet works or how HTTP or TCP/IP function, for these are processes which will be happening in the background and will not be visible to the user.

 

 

Research Methods

 

The current state of the technologies related to this project and previous research will be investigated though a dedicated literature survey.

 

The development of the programme will follow the so-called engineering research approach and profiling.

 

 

Research Aspect

 

By no means is this research proposal stating how this project should be solved. It is merely trying to find alternative ways of approaching the given problem. The idea behind this is not just about coding the program. Other things have to be done as well, such as choosing a programming language, designing and implementing a user friendly interface, learning about HTTP and TCP/IP etc.

 

As it was established above a similar “web scraper” has been done before, but never on a client server basis. What can be done is a review of all the features previous web scrapers present the users with. Ideas can then be combined and included in this particular project.

 

 

Formulation Of Work Breakdown

 

 

 

 

Relevant Sources

 

1.      http://www.wdvl.com

2.      http://www.ics.uci.edu/pub/ietf/http/

3.      http://www.w3.org/Architecture/

4.      Berners-Lee, T. and D. Connolly, “Hypertext Markup Language – 2.0” RFC 1866, MIT/W3C, November 1995