Apache Nutch Crawl Management
Workflow capability for managing end-to-end web crawl pipelines with Apache Nutch. Covers job lifecycle management, configuration control, seed list management, and CrawlDB querying for web crawl engineers and data engineers.
What You Can Do
MCP Tools
get-server-status
Get the current status of the Apache Nutch server including running jobs and known configurations.
list-configs
List all known Nutch configuration identifiers.
create-config
Create a new Nutch crawl configuration with custom properties.
get-config
Get all configuration properties for a specific Nutch configuration.
list-jobs
List all Nutch crawl jobs, optionally filtered by crawl ID.
create-job
Create and start a new Nutch crawl job. Job types include INJECT, GENERATE, FETCH, PARSE, UPDATEDB, INDEX, DEDUP, INVERTLINKS.
get-job-info
Get the current state and details for a specific Nutch crawl job.
stop-job
Stop a running Nutch crawl job gracefully.
abort-job
Abort a Nutch crawl job immediately without waiting for graceful shutdown.
list-seed-lists
List all available seed URL lists in the Nutch server.
create-seed-list
Create a new seed URL list for initializing a crawl.
query-crawldb
Query the Apache Nutch CrawlDB for statistics, data dumps, or specific URL status.
query-fetchdb
Query the Apache Nutch FetchDB for node fetch history and statistics.