* 1.6 docs update ### Changes - Adds note in style guide about referencing actions in the app - Adds page for Browser Profiles - Adds callout for uploads in the context of combining items from multiple sources - Adds page for Collections - Adds page for Crawl Workflows - Updates index to link to new dedicated Crawl Workflow page in addition to the Crawl Workflow Setup page - Updates Org Settings page action styling in accordance with new rules - Updates Crawl Workflow Setup page with links to the new pages and a hierarchy fix for the first item - Updates user guide navigation with a new section for crawling related items --------- Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2.4 KiB
Crawl Workflows
Crawl workflows consist of a list of configuration options that instruct the crawler what it should capture.
Creating and Editing Crawl Workflows
New crawl workflows can be created from the Crawling page. A detailed breakdown of available settings can be found here.
Running Crawl Workflows
Crawl workflows can be run from the actions menu of the workflow in the crawl workflow list, or by clicking the Run Crawl button on the workflow's details page.
While crawling, the Watch Crawl page displays a list of queued URLs that will be visited, and streams the current state of the browser windows as they visit pages from the queue.
Running a crawl workflow that has successfully run previously can be useful to capture content as it changes over time, or to run with an updated Crawl Scope.
Live Exclusion Editing
While exclusions can be set before running a crawl workflow, sometimes while crawling the crawler may find new parts of the site that weren't previously known about and shouldn't be crawled, or get stuck browsing parts of a website that automatically generate URLs known as "crawler traps".
If the crawl queue is filled with URLs that should not be crawled, use the Edit Exclusions button on the Watch Crawl page to instruct the crawler what pages should be excluded from the queue.
Exclusions added while crawling are applied to the same exclusion table saved in the workflow's settings and will be used the next time the crawl workflow is run unless they are manually removed.
Ending a Crawl
If a crawl workflow is not crawling websites as intended it may be preferable to end crawling operations and update the crawl workflow's settings before trying again. There are two operations to end crawls, available both on the workflow's details page, or as part of the actions menu in the workflow list.
Stopping
Stopping a crawl will throw away the crawl queue but otherwise gracefully end the process and save anything that has been collected. Stopped crawls show up in the list of Archived Items and can be used like any other item in the app.
Canceling
Canceling a crawl will throw away all data collected and immediately end the process. Canceled crawls do not show up in the list of Archived Items, though a record of the runtime and workflow settings can be found in the crawl workflow's list of crawls.