Resolves https://github.com/webrecorder/browsertrix/issues/2073 ### Changes - Removes "URL List" and "Seeded Crawl" job type distinction and adds as additional crawl scope types instead. - 'New Workflow' button defaults to Single Page - 'New Workflow' dropdown includes Page Crawl (Single Page, Page List, In-Page Links) and Site Crawl (Page in Same Directory, Page on Same Domain, + Subdomains and Custom Page Prefix) - Enables specifying `DOCS_URL` in `.env` - Additional follow-ups in #2090, #2091
2.6 KiB
Intro to Crawl Workflows
Crawl workflows are the bread and butter of automated browser-based crawling. A crawl workflow enables you to specify how and what the crawler should capture on a website.
A finished crawl results in an archived item that can be downloaded and shared. To easily identify and find archived items within your org, you can automatically name and tag archived items through custom workflow metadata.
You can create, view, search for, and run crawl workflows from the Crawling page.
Create a Crawl Workflow
Create new crawl workflows from the Crawling page, or the Create New ... shortcut from Overview.
Choose what to crawl
The first step in creating a new crawl workflow is to choose what you'd like to crawl by defining a Crawl Scope. Crawl scopes are categorized as a Page Crawl or Site Crawl.
Page Crawl
Choose one of these crawl scopes if you know the URL of every page you'd like to crawl and don't need to include any additional pages beyond one hop out.
A Page Crawl workflow is simpler to configure, since you don't need to worry about configuring the workflow to exclude parts of the website that you may not want to archive.
Site Crawl
Choose one of these crawl scopes to have the the crawler automatically find pages based on a domain name, start page URL, or directory on a website.
Site Crawl workflows are great for advanced use cases where you don't need (or want) to know every single URL of the website that you're archiving.
After deciding what type of crawl you'd like to run, you can begin to set up your workflow. A detailed breakdown of available settings can be found in the workflow settings guide.
Run Crawl
Run a crawl workflow by clicking Run Crawl in the actions menu of the workflow in the crawl workflow list, or by clicking the Run Crawl button on the workflow's details page.
While crawling, the Watch Crawl section displays a list of queued URLs that will be visited, and streams the current state of the browser windows as they visit pages from the queue. You can modify the crawl live by adding URL exclusions or changing the number of crawling instances.
Re-running a crawl workflow can be useful to capture a website as it changes over time, or to run with an updated crawl scope.
Status
Finished crawl workflows inherit the status of the last archived item they created. Crawl workflows that are in progress maintain their own statuses.