Programming Spiders, Bots, and Aggregators in Java

by Sybex

$59.99
buy from amazon.com
Average Rating: * * * half star -
Sales Rank:685576 (lower is better)
Price Used:$52.00
Shipping:Free Shipping on most orders over $25*
Availability:
Label:Sybex
UPC:025211440407
Pages:512
Binding:Paperback
Publication Date:2002-02
Published By:Sybex
ASIN:0782140408
Category:Book

Authors

Editorial Reviews and Product Descriptions

Product Description

The content and services available on the web continue to be accessed mostly through direct human control. But this is changing. Increasingly, users rely on automated agents that save them time and effort by programmatically retrieving content, performing complex interactions, and aggregating data from diverse sources. Programming Spiders, Bots, and Aggregators in Java teaches you how to build and deploy a wide variety of these agents-from single-purpose bots to exploratory spiders to aggregators that present a unified view of information from multiple user accounts.

You will quickly build on your basic knowledge of Java to quickly master the techniques that are essential to this specialized world of programming, including parsing HTML, interpreting data, working with cookies, reading and writing XML, and managing high-volume workloads. You'll also learn about the ethical issues associated with bot use--and the limitations imposed by some websites.

This book offers two levels of instruction, both of which are focused on the library of routines provided on the companion CD. If your main concern is adding ready-made functionality to an application, you'll achieve your goals quickly thanks to step-by-step instructions and sample programs that illustrate effective implementations. If you're interested in the technologies underlying these routines, you'll find in-depth explanations of how they work and the techniques required for customization.

Customer Reviews

Lots of working code but not much of a tutorial - Reviewed on 2006-07-16
* * * *
9 customers found this review helpful.

Bots are the simplest form of Internet-aware programs in that they simply carry out a repetitive task once unleashed on the web. A spider travels the web in a complex fashion, moving from one part of the World Wide Web to another collecting information from one site and then jumping to another based on that information. An aggregator is a bot that is designed to log into several user accounts and retrieve similar information.

If you need a complete bot, spider, or aggregator written in Java, complete with source code and a detailed manual about that source code so that you can customize it to suit your needs, this is a five star book. However, if you are looking for a book about information storage and retrieval and network programming that focuses on the theory of operation of such software with application code written in Java, you will be sorely disappointed.

The author did such a fine job of documenting his work with excellent diagrams, comments, and the book that reads like a user's manual, that I easily took his Web spider code and modified it to perform many additional tasks that his basic package does not do. All of the hooks are available in his code for you to modify it to collect or examine just about any kind of data accessible via the web.

I highly recommend this book if you are taking an information storage and retrieval class and you would like to read and study something applied on spiders, bots, and aggregators versus the theory you get in most textbooks. Just understand you are getting code plus a user's manual, not a tutorial. You are definitely going to need other resources on Java network programming if you want to study, understand, or modify the included source code. I suggest the latest edition of "Java Network Programming" by Elliotte Rusty Harold for help with the network programming part of bots, spiders, and aggregates. I also suggest you look at "Spidering Hacks", which has many good ideas of features you can add to your web spider.
Not much information for such a long book - Reviewed on 2004-06-24
* *
9 customers found this review helpful, 1 did not.

The essence of this book could probably have been compressed into a few chapters. I read the whole thing in about a day, skimming over many sections (e.g. the structure of HTML, including discussion of anchor tags) that I, like most programmers, already know well. I think I would have preferred a focussed tutorial on Heaton's Bot package instead of a detailed but boring treatment of every technology (however elementary) used in the process of constructing spiders and bots.

Aside from this, Heaton is not a great writer. Attempting to be particularly organized and structured, he comes off as excessively stiff; I stopped counting the number of times he wrote "I will now show how to..."

I purchased this book expecting the process of constructing a spider or bot to draw on a range of specialized skills, but it appears to be quite simple: basic knowledge of Java network programming (i.e. sockets), HTTP, HTML and XML parsing would appear to suffice. I'm sure there is all sorts of complex stuff Heaton does not talk about, but I wish he had!

At the moment I'm wondering whether this book deserves a space on my finite bookshelf.

Create a Object Oriented Bot Package Step by Step - Reviewed on 2004-04-26
* * * * *
8 customers found this review helpful.

I use this book as a supplement to a class that I teach, as it gives the students the necessary stills to programmatically spider, and generally access, information on the Net.

As some of the other reviewers point out, this book does center around the creation of a "bot package". However, I see this as one of the book's greatest strengths. The author explains step by step how to take basic concepts, continually build upon them, progressing onward to more complex spiders and bots. Specifically:

1. Create an advanced HTTP object that overcomes many of the shortcomings of the one which is built into Java. (namely cookie support, referrer support, HTTP authentication, and more)
2. Add forms/page processing on top of the HTTP object. You are shown step by step how to process the data you collect from step 1.
3. Create a bot that wields the page/form processing created in step 2.
4. Create a spider, that, using steps 1-3, can access pages across an entire site.
5. Expand the spider to support thread pooling and a JDBC database.

Rather than providing a bunch of disjoint code samples, like many books do. The author guides you step by step through the above path, revealing the techniques at every step. For the reader who does not care about the intricate nature of bot programming, sadly, some of my students. You can skip to the API documentation and get right onto creating your own bots. You can also download updated versions of the "bot package" from the author's site. I actually did this before buying the book.

The downsides to the book are the example programs use of GUI's. I would rather every example had been straight console, the GUI only gets in the way, for a book targeting bot programming. Also the author very annoyingly putting an underscore in front of every class-instance variable, which gives some of the code something of a C++ look I suppose.

If you are already programming bots and spiders of your own, I don't think you will get much more from this book than you are likely already doing.

But for someone who wants to get started in this exciting area, there is nothing else like it, and I highly recommend it.

Misleading Title - Reviewed on 2003-12-23
* *
11 customers found this review helpful, 1 did not.

As another reviewer commented this book should be called using the com.heaton.bot package api reference. All you learn is how to use this package of java classes, not how to actually create spiders, bots or aggregators from the ground up. I feel the title is misleading for such an expensive book. The only way I will learn what I want is to read the authors source code - which btw is very ugly however functional.
happy - Reviewed on 2003-11-07
* * * * *
2 customers found this review helpful.

Visual Cafe produces the Swing so one can view the examples from the book. So what?

When beginning to program with HTTP protocols, it's easy to enter incorrect methods and parameters that lead to dead-ends and frustration. As I learn about and use the Heaton API, I am pleasantly surprised with the methods available and how easily they're implemented and that they lead to success.

The source code is included on the CD with updated versions at the Heaton Website.

Read More Customer Reviews »
Go To Amazon Product Page

* - See Amazon Product Page for shipping and pricing details.


Book Subjects