Apache Nutch · Capability

Apache Nutch Crawl Management

Workflow capability for managing end-to-end web crawl pipelines with Apache Nutch. Covers job lifecycle management, configuration control, seed list management, and CrawlDB querying for web crawl engineers and data engineers.

Run with Naftiko Apache NutchWeb CrawlerCrawl ManagementData Engineering

What You Can Do

GET
Get status — Get Apache Nutch server status.
/v1/admin/status
GET
List configs — List all available configurations.
/v1/configs
POST
Create config — Create a new crawl configuration.
/v1/configs
GET
Get config — Get all properties for a configuration.
/v1/configs/{configId}
DELETE
Delete config — Delete a configuration.
/v1/configs/{configId}
GET
List jobs — List all crawl jobs.
/v1/jobs
POST
Create job — Create and start a crawl job.
/v1/jobs
GET
Get job — Get job status and info.
/v1/jobs/{id}
POST
Stop job — Stop a running crawl job.
/v1/jobs/{id}/stop
GET
List seeds — List all seed URL lists.
/v1/seeds
POST
Create seed — Create a new seed URL list.
/v1/seeds
POST
Query crawldb — Query the CrawlDB for stats or URL lookups.
/v1/db/crawldb
POST
Query fetchdb — Query the FetchDB for node information.
/v1/db/fetchdb

MCP Tools

get-server-status

Get the current status of the Apache Nutch server including running jobs and known configurations.

read-only
list-configs

List all known Nutch configuration identifiers.

read-only
create-config

Create a new Nutch crawl configuration with custom properties.

get-config

Get all configuration properties for a specific Nutch configuration.

read-only
list-jobs

List all Nutch crawl jobs, optionally filtered by crawl ID.

read-only
create-job

Create and start a new Nutch crawl job. Job types include INJECT, GENERATE, FETCH, PARSE, UPDATEDB, INDEX, DEDUP, INVERTLINKS.

get-job-info

Get the current state and details for a specific Nutch crawl job.

read-only
stop-job

Stop a running Nutch crawl job gracefully.

abort-job

Abort a Nutch crawl job immediately without waiting for graceful shutdown.

list-seed-lists

List all available seed URL lists in the Nutch server.

read-only
create-seed-list

Create a new seed URL list for initializing a crawl.

query-crawldb

Query the Apache Nutch CrawlDB for statistics, data dumps, or specific URL status.

read-only
query-fetchdb

Query the Apache Nutch FetchDB for node fetch history and statistics.

read-only

APIs Used

nutch