Apache Nutch · Capability

Apache Nutch Crawl Management

Workflow capability for managing end-to-end web crawl pipelines with Apache Nutch. Covers job lifecycle management, configuration control, seed list management, and CrawlDB querying for web crawl engineers and data engineers.

Run with Naftiko Apache NutchWeb CrawlerCrawl ManagementData Engineering

What You Can Do

GET

Get status — Get Apache Nutch server status.

/v1/admin/status

GET

List configs — List all available configurations.

/v1/configs

POST

Create config — Create a new crawl configuration.

/v1/configs

GET

Get config — Get all properties for a configuration.

/v1/configs/{configId}

DELETE

Delete config — Delete a configuration.

/v1/configs/{configId}

GET

List jobs — List all crawl jobs.

/v1/jobs

POST

Create job — Create and start a crawl job.

/v1/jobs

GET

Get job — Get job status and info.

/v1/jobs/{id}

POST

Stop job — Stop a running crawl job.

/v1/jobs/{id}/stop

GET

List seeds — List all seed URL lists.

/v1/seeds

POST

Create seed — Create a new seed URL list.

/v1/seeds

POST

Query crawldb — Query the CrawlDB for stats or URL lookups.

/v1/db/crawldb

POST

Query fetchdb — Query the FetchDB for node information.

/v1/db/fetchdb

MCP Tools

get-server-status

Get the current status of the Apache Nutch server including running jobs and known configurations.

read-only

list-configs

List all known Nutch configuration identifiers.

read-only

create-config

Create a new Nutch crawl configuration with custom properties.

get-config

Get all configuration properties for a specific Nutch configuration.

read-only

list-jobs

List all Nutch crawl jobs, optionally filtered by crawl ID.

read-only

create-job

Create and start a new Nutch crawl job. Job types include INJECT, GENERATE, FETCH, PARSE, UPDATEDB, INDEX, DEDUP, INVERTLINKS.

get-job-info

Get the current state and details for a specific Nutch crawl job.

read-only

stop-job

Stop a running Nutch crawl job gracefully.

abort-job

Abort a Nutch crawl job immediately without waiting for graceful shutdown.

list-seed-lists

List all available seed URL lists in the Nutch server.

read-only

create-seed-list

Create a new seed URL list for initializing a crawl.

query-crawldb

Query the Apache Nutch CrawlDB for statistics, data dumps, or specific URL status.

read-only

query-fetchdb

Query the Apache Nutch FetchDB for node fetch history and statistics.

read-only

APIs Used

nutch