TERAGRAM WEB CRAWLER

Teragram Crawler is a powerful tool enabling you to automatically download documents from the Internet. Starting at a user–specified URL, the crawler follows the hyperlinks in the web, while repeatedly sending HTTP requests to simultaneously obtain corresponding HTML content and any URLs existent within that content.

Teragram Crawler is designed to be a fast, polite, and easy–to–use web crawling system with high information coverage. It is a standalone tool that enables you to locate desired web information, and it is also a key component of other Teragram products.

The Teragram crawler is designed for organizations that must collect information from the Internet ranging from search engines to a wide variety of business intelligence specialists.

Key Benefits

  • High performance crawling: The Teragram crawler is used in a multi–threading mode. The number of threads can be specified according to your requirements. For example, specify more than 1,000 threads for large scale crawling. In this case the download speed can reach 10M bytes per second on a single machine with good bandwidth. The crawler can also be deployed in a distributed cluster environment.
  • Distributed crawling: Teragram Crawler provides a distributed running mode to make crawling faster. When multiple crawlers are running simultaneously, each crawler will send the correct set of links to the crawler to which they might belong. The links are sent in batch for the purposes of making communications between the crawlers more efficient. The size of the batches is configurable.
  • Page quality: Teragram crawler crawls the highest quality pages first, when the quantity of object pages is very large. Duplicates of URLs or page contents are automatically removed.
  • Friendly downloads: The crawler can be specified to do a polite download. This feature prevents complaints or access blocking from crawled sites. To ensure friendly downloads you can:
    • Specify the minimum access interval for continuous downloads from each site.
    • Specify the maximum parallel connections to each site or domain.
    • Specify the maximum number of times to retry each failed HTTP request.
  • Easy management: Crawling can be restricted in several ways, for example:
    • Entry points: Specify a list of URLs as seeds to start the crawling, and define the number of pages to start from each seed.
    • Portal list: Define a list of URLs to download without extracting new URLs.
    • Link–following restrictions: Define link–following rules with regular expressions to restrict the crawling area. For example, restrict the crawling in a directory, a server, or a domain.
    • Excluded paths: Provide a list of URL paths that will be excluded in the crawling. Any URL that is not an entry point will not be extracted if it contains an excluded pattern.
    • Search mode: Set the mode to crawl multiple sites depth–first (follow the links within a URL before proceeding to other URLs) or crawl breadth–first (across all entries then into links).
    • File format limitation: Configure the file formats that can be crawled, e.g., htm, html; and which formats will not be crawled, e.g., jpeg, css, and so forth.
  • Easy Integration with Teragram Linguistic suite: Teragram Crawler can easily be integrated with Teragram Linguistic software:
    • Results selection with Teragram TK240: The crawled pages can be analyzed on–the–fly by Teragram CatCon Server to extract entities or categories for a domain–related selection. With this feature, only the information in some specified categories will be saved.
    • Integration with Teragram Search Engine: Submit the result documents to Teragram Search Engine, an advanced search engine, to build an index, a question-answering engine, and so forth.
    • Integration with Teragram concept extraction: The result pages can be submitted to Teragram LITI, a powerful linguistic tool, to extract important entities, e.g., persons, locations, organizations, noun groups, and so forth.
    • Easy language recognition and encoding conversion: Crawling results can be run through the Teragram automatic language identifier to identify the language and character mapping of a document. Encoding conversion for documents in, e.g., UTF–8, UNICODE, GB18030, and so forth can also be performed.
    • Document conversion and hyperlink extraction: Document formats can be efficiently processed. For example, PDF, MS Word files, and so forth can be parsed with their hyperlinks extracted.
  • Incremental crawling: You enable this mode to do a continuous download, and during this period, only the pages which are newly created or have changed recently will be saved to the local disk. This function will save a lot of resources when the number of pages in the object sites is large.
  • Javascript parsing: The Teragram crawler supports URL extraction from javascripts where content is often deeply embedded. The powerful linguistic technologies of the Teragram crawler make this crawler uniquely well-suited to this type of data extraction.
  • Cookie supported and password protected websites logon: The Teragram crawler supports cookie–based crawling after it logs in to websites. This is important because many sites set restrictions for public access and only those who have registered are granted access. Using the Teragram crawler you can automatically get password–protected content from the web.
  • Simple configuration: All of the configurations can be specified in a single XML file, which is the only input to the crawler. Every time you want to change the behavior of the Teragram crawler, you need only change the XML file, and restart the crawler. The crawler also runs a web–based administrative interface that is used to modify the configuration during crawling and to display the crawler’s statistics.
  • Easy operation: The crawler will produce log files using your specifications, in order to check the crawler’s status and its progress.

 
 
Back to Solutions

Copyright © 2008 SAS Institute Inc. All rights reserved.