From c01e3dd88b9ea75941a0ab723db0e87be7e26aa9 Mon Sep 17 00:00:00 2001 From: sua yoo Date: Mon, 9 Sep 2024 16:42:47 -0700 Subject: [PATCH] feat: Improve UX of choosing new workflow crawl type (#2067) Resolves https://github.com/webrecorder/browsertrix/issues/2066 ### Changes - Allows directly choosing new "Page List" or "Site Crawl from workflow list - Reverts terminology introduced in https://github.com/webrecorder/browsertrix/pull/2032 --- docs/user-guide/crawl-workflows.md | 10 +-- docs/user-guide/getting-started.md | 4 +- docs/user-guide/workflow-setup.md | 16 ++--- frontend/src/components/ui/config-details.ts | 20 +++--- .../crawl-workflows/new-workflow-dialog.ts | 56 ++++++++--------- .../crawl-workflows/workflow-editor.ts | 4 +- frontend/src/pages/org/index.ts | 4 ++ frontend/src/pages/org/workflows-list.ts | 62 ++++++++++++++----- frontend/src/pages/org/workflows-new.ts | 8 +-- 9 files changed, 111 insertions(+), 73 deletions(-) diff --git a/docs/user-guide/crawl-workflows.md b/docs/user-guide/crawl-workflows.md index 78bbe3f6..8b3620aa 100644 --- a/docs/user-guide/crawl-workflows.md +++ b/docs/user-guide/crawl-workflows.md @@ -12,19 +12,19 @@ Create new crawl workflows from the **Crawling** page, or the _Create New ..._ ### Choose what to crawl -The first step in creating a new crawl workflow is to choose what you'd like to crawl. This determines whether the crawl type will be **URL List** or **Seeded Crawl**. Crawl types can't be changed after the workflow is created—you'll need to create a new crawl workflow. +The first step in creating a new crawl workflow is to choose what you'd like to crawl. This determines whether the crawl type will be **Page List** or **Site Crawl**. Crawl types can't be changed after the workflow is created—you'll need to create a new crawl workflow. -#### Known URLs `URL List`{ .badge-blue } +#### Page List Choose this option if you already know the URL of every page you'd like to crawl. The crawler will visit every URL specified in a list, and optionally every URL linked on those pages. -A URL list is simpler to configure, since you don't need to worry about configuring the workflow to exclude parts of the website that you may not want to archive. +A Page List workflow is simpler to configure, since you don't need to worry about configuring the workflow to exclude parts of the website that you may not want to archive. -#### Automated Discovery `Seeded Crawl`{ .badge-orange } +#### Site Crawl Let the crawler automatically discover pages based on a domain or start page that you specify. -Seeded crawls are great for advanced use cases where you don't need (or want) to know every single URL of the website that you're archiving. +Site Crawl workflows are great for advanced use cases where you don't need (or want) to know every single URL of the website that you're archiving. After deciding what type of crawl you'd like to run, you can begin to set up your workflow. A detailed breakdown of available settings can be found in the [workflow settings guide](workflow-setup.md). diff --git a/docs/user-guide/getting-started.md b/docs/user-guide/getting-started.md index 677959b5..a99974be 100644 --- a/docs/user-guide/getting-started.md +++ b/docs/user-guide/getting-started.md @@ -15,8 +15,8 @@ To start crawling with hosted Browsertrix, you'll need a Browsertrix account. [S Once you've logged in you should see your org [overview](overview.md). If you land somewhere else, navigate to **Overview**. 1. Tap the _Create New..._ shortcut and select **Crawl Workflow**. -2. Choose **Known URLs**. We'll get into the details of the options [later](./crawl-workflows.md), but this is a good starting point for a simple crawl. -3. Enter the URL of the webpage that you noted earlier in **Crawl URL(s)**. +2. Choose **Page List**. We'll get into the details of the options [later](./crawl-workflows.md), but this is a good starting point for a simple crawl. +3. Enter the URL of the webpage that you noted earlier in **Page URL(s)**. 4. Tap _Review & Save_. 5. Tap _Save Workflow_. 6. You should now see your new crawl workflow. Give the crawler a few moments to warm up, and then watch as it archives the webpage! diff --git a/docs/user-guide/workflow-setup.md b/docs/user-guide/workflow-setup.md index 5dae9c82..8a55393a 100644 --- a/docs/user-guide/workflow-setup.md +++ b/docs/user-guide/workflow-setup.md @@ -8,17 +8,17 @@ Crawl settings are shown in the crawl workflow detail **Settings** tab and in th ## Crawl Scope -Specify the range and depth of your crawl. Different settings will be shown depending on whether you chose _Known URLs_ (crawl type of **URL List**) or _Automated Discovery_ (crawl type of **Seeded Crawl**) when creating a new workflow. +Specify the range and depth of your crawl. Different settings will be shown depending on whether you chose _URL List_ or _Site Crawl_ when creating a new workflow. ??? example "Crawling with HTTP basic auth" - Both URL List and Seeded crawls support [HTTP Basic Auth](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication) which can be provided as part of the URL, for example: `https://username:password@example.com`. + Both Page List and Site Crawls support [HTTP Basic Auth](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication) which can be provided as part of the URL, for example: `https://username:password@example.com`. **These credentials WILL BE WRITTEN into the archive.** We recommend exercising caution and only archiving with dedicated archival accounts, changing your password or deleting the account when finished. -### Crawl Type: URL List +### Crawl Type: Page List -#### Crawl URL(s) +#### Page URL(s) A list of one or more URLs that the crawler should visit and capture. @@ -26,14 +26,14 @@ A list of one or more URLs that the crawler should visit and capture. When enabled, the crawler will visit all the links it finds within each page defined in the _Crawl URL(s)_ field. -??? example "Crawling tags & search queries with URL List crawls" +??? example "Crawling tags & search queries with Page List crawls" This setting can be useful for crawling the content of specific tags or search queries. Specify the tag or search query URL(s) in the _Crawl URL(s)_ field, e.g: `https://example.com/search?q=tag`, and enable _Include Any Linked Page_ to crawl all the content present on that search query page. #### Fail Crawl on Failed URL When enabled, the crawler will fail the entire crawl if any of the provided URLs are invalid or unsuccessfully crawled. The resulting archived item will have a status of "Failed". -### Crawl Type: Seeded Crawl +### Crawl Type: Site Crawl #### Crawl Start URL @@ -84,7 +84,7 @@ This can be useful for discovering and capturing pages on a website that aren't ### Exclusions -The exclusions table will instruct the crawler to ignore links it finds on pages where all or part of the link matches an exclusion found in the table. The table is only available in URL List crawls when _Include Any Linked Page_ is enabled. +The exclusions table will instruct the crawler to ignore links it finds on pages where all or part of the link matches an exclusion found in the table. The table is only available in Page List crawls when _Include Any Linked Page_ is enabled. This can be useful for avoiding crawler traps — sites that may automatically generate pages such as calendars or filter options — or other pages that should not be crawled according to their URL. @@ -228,7 +228,7 @@ Describe and organize your crawl workflow and the resulting archived items. ### Name -Allows a custom name to be set for the workflow. If no name is set, the workflow's name will be set to the _Crawl Start URL_. For URL List crawls, the workflow's name will be set to the first URL present in the _Crawl URL(s)_ field, with an added `(+x)` where `x` represents the total number of URLs in the list. +Allows a custom name to be set for the workflow. If no name is set, the workflow's name will be set to the _Crawl Start URL_. For Page List crawls, the workflow's name will be set to the first URL present in the _Crawl URL(s)_ field, with an added `(+x)` where `x` represents the total number of URLs in the list. ### Description diff --git a/frontend/src/components/ui/config-details.ts b/frontend/src/components/ui/config-details.ts index 21d130e8..4c051312 100644 --- a/frontend/src/components/ui/config-details.ts +++ b/frontend/src/components/ui/config-details.ts @@ -333,7 +333,7 @@ export class ConfigDetails extends LiteElement { return html` ${this.renderSetting( - msg("Crawl URL(s)"), + msg("Page URL(s)"), html`