feat: Improve UX of choosing new workflow crawl type (#2067)

Resolves https://github.com/webrecorder/browsertrix/issues/2066

### Changes
- Allows directly choosing new "Page List" or "Site Crawl from
workflow list
- Reverts terminology introduced in
https://github.com/webrecorder/browsertrix/pull/2032
This commit is contained in:
sua yoo 2024-09-09 16:42:47 -07:00 committed by GitHub
parent b4e34d1c3c
commit c01e3dd88b
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194
9 changed files with 111 additions and 73 deletions

View File

@ -12,19 +12,19 @@ Create new crawl workflows from the **Crawling** page, or the _Create New ..._
### Choose what to crawl
The first step in creating a new crawl workflow is to choose what you'd like to crawl. This determines whether the crawl type will be **URL List** or **Seeded Crawl**. Crawl types can't be changed after the workflow is created—you'll need to create a new crawl workflow.
The first step in creating a new crawl workflow is to choose what you'd like to crawl. This determines whether the crawl type will be **Page List** or **Site Crawl**. Crawl types can't be changed after the workflow is created—you'll need to create a new crawl workflow.
#### Known URLs `URL List`{ .badge-blue }
#### Page List
Choose this option if you already know the URL of every page you'd like to crawl. The crawler will visit every URL specified in a list, and optionally every URL linked on those pages.
A URL list is simpler to configure, since you don't need to worry about configuring the workflow to exclude parts of the website that you may not want to archive.
A Page List workflow is simpler to configure, since you don't need to worry about configuring the workflow to exclude parts of the website that you may not want to archive.
#### Automated Discovery `Seeded Crawl`{ .badge-orange }
#### Site Crawl
Let the crawler automatically discover pages based on a domain or start page that you specify.
Seeded crawls are great for advanced use cases where you don't need (or want) to know every single URL of the website that you're archiving.
Site Crawl workflows are great for advanced use cases where you don't need (or want) to know every single URL of the website that you're archiving.
After deciding what type of crawl you'd like to run, you can begin to set up your workflow. A detailed breakdown of available settings can be found in the [workflow settings guide](workflow-setup.md).

View File

@ -15,8 +15,8 @@ To start crawling with hosted Browsertrix, you'll need a Browsertrix account. [S
Once you've logged in you should see your org [overview](overview.md). If you land somewhere else, navigate to **Overview**.
1. Tap the _Create New..._ shortcut and select **Crawl Workflow**.
2. Choose **Known URLs**. We'll get into the details of the options [later](./crawl-workflows.md), but this is a good starting point for a simple crawl.
3. Enter the URL of the webpage that you noted earlier in **Crawl URL(s)**.
2. Choose **Page List**. We'll get into the details of the options [later](./crawl-workflows.md), but this is a good starting point for a simple crawl.
3. Enter the URL of the webpage that you noted earlier in **Page URL(s)**.
4. Tap _Review & Save_.
5. Tap _Save Workflow_.
6. You should now see your new crawl workflow. Give the crawler a few moments to warm up, and then watch as it archives the webpage!

View File

@ -8,17 +8,17 @@ Crawl settings are shown in the crawl workflow detail **Settings** tab and in th
## Crawl Scope
Specify the range and depth of your crawl. Different settings will be shown depending on whether you chose _Known URLs_ (crawl type of **URL List**) or _Automated Discovery_ (crawl type of **Seeded Crawl**) when creating a new workflow.
Specify the range and depth of your crawl. Different settings will be shown depending on whether you chose _URL List_ or _Site Crawl_ when creating a new workflow.
??? example "Crawling with HTTP basic auth"
Both URL List and Seeded crawls support [HTTP Basic Auth](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication) which can be provided as part of the URL, for example: `https://username:password@example.com`.
Both Page List and Site Crawls support [HTTP Basic Auth](https://developer.mozilla.org/en-US/docs/Web/HTTP/Authentication) which can be provided as part of the URL, for example: `https://username:password@example.com`.
**These credentials WILL BE WRITTEN into the archive.** We recommend exercising caution and only archiving with dedicated archival accounts, changing your password or deleting the account when finished.
### Crawl Type: URL List
### Crawl Type: Page List
#### Crawl URL(s)
#### Page URL(s)
A list of one or more URLs that the crawler should visit and capture.
@ -26,14 +26,14 @@ A list of one or more URLs that the crawler should visit and capture.
When enabled, the crawler will visit all the links it finds within each page defined in the _Crawl URL(s)_ field.
??? example "Crawling tags & search queries with URL List crawls"
??? example "Crawling tags & search queries with Page List crawls"
This setting can be useful for crawling the content of specific tags or search queries. Specify the tag or search query URL(s) in the _Crawl URL(s)_ field, e.g: `https://example.com/search?q=tag`, and enable _Include Any Linked Page_ to crawl all the content present on that search query page.
#### Fail Crawl on Failed URL
When enabled, the crawler will fail the entire crawl if any of the provided URLs are invalid or unsuccessfully crawled. The resulting archived item will have a status of "Failed".
### Crawl Type: Seeded Crawl
### Crawl Type: Site Crawl
#### Crawl Start URL
@ -84,7 +84,7 @@ This can be useful for discovering and capturing pages on a website that aren't
### Exclusions
The exclusions table will instruct the crawler to ignore links it finds on pages where all or part of the link matches an exclusion found in the table. The table is only available in URL List crawls when _Include Any Linked Page_ is enabled.
The exclusions table will instruct the crawler to ignore links it finds on pages where all or part of the link matches an exclusion found in the table. The table is only available in Page List crawls when _Include Any Linked Page_ is enabled.
This can be useful for avoiding crawler traps — sites that may automatically generate pages such as calendars or filter options — or other pages that should not be crawled according to their URL.
@ -228,7 +228,7 @@ Describe and organize your crawl workflow and the resulting archived items.
### Name
Allows a custom name to be set for the workflow. If no name is set, the workflow's name will be set to the _Crawl Start URL_. For URL List crawls, the workflow's name will be set to the first URL present in the _Crawl URL(s)_ field, with an added `(+x)` where `x` represents the total number of URLs in the list.
Allows a custom name to be set for the workflow. If no name is set, the workflow's name will be set to the _Crawl Start URL_. For Page List crawls, the workflow's name will be set to the first URL present in the _Crawl URL(s)_ field, with an added `(+x)` where `x` represents the total number of URLs in the list.
### Description

View File

@ -333,7 +333,7 @@ export class ConfigDetails extends LiteElement {
return html`
${this.renderSetting(
msg("Crawl URL(s)"),
msg("Page URL(s)"),
html`
<ul>
${this.seeds?.map(
@ -375,14 +375,16 @@ export class ConfigDetails extends LiteElement {
primarySeedConfig?.include || seedsConfig.include || [];
return html`
${this.renderSetting(
msg("Primary Seed URL"),
html`<a
class="text-blue-600 hover:text-blue-500 hover:underline"
href="${primarySeedUrl!}"
target="_blank"
rel="noreferrer"
>${primarySeedUrl}</a
>`,
msg("Crawl Start URL"),
primarySeedUrl
? html`<a
class="text-blue-600 hover:text-blue-500 hover:underline"
href="${primarySeedUrl}"
target="_blank"
rel="noreferrer"
>${primarySeedUrl}</a
>`
: undefined,
true,
)}
${this.renderSetting(

View File

@ -43,14 +43,15 @@ export class NewWorkflowDialog extends TailwindElement {
src=${urlListSvg}
/>
<figcaption class="p-1">
<div
class="my-2 text-lg font-semibold leading-none transition-colors group-hover:text-primary-700"
>
${msg("Known URLs")}
<div class="leading none my-2 font-semibold">
<div class="transition-colors group-hover:text-primary-700">
${msg("Page List")}:
</div>
<div class="text-lg">${msg("One or more URLs")}</div>
</div>
<p class="text-balance leading-normal text-neutral-700">
<p class="leading-normal text-neutral-700">
${msg(
"Choose this option to crawl a single page, or if you already know the URL of every page you'd like to crawl.",
"Choose this option if you know the URL of every page you'd like to crawl and don't need to include any additional pages beyond one hop out.",
)}
</p>
</figcaption>
@ -72,14 +73,15 @@ export class NewWorkflowDialog extends TailwindElement {
src=${seededCrawlSvg}
/>
<figcaption class="p-1">
<div
class="my-2 text-lg font-semibold leading-none transition-colors group-hover:text-primary-700"
>
${msg("Automated Discovery")}
<div class="leading none my-2 font-semibold">
<div class="transition-colors group-hover:text-primary-700">
${msg("Site Crawl")}:
</div>
<div class="text-lg">${msg("Website or directory")}</div>
</div>
<p class="text-balance leading-normal text-neutral-700">
<p class="leading-normal text-neutral-700">
${msg(
"Let the crawler automatically discover pages based on a domain or start page that you specify.",
"Specify a domain name, start page URL, or path on a website and let the crawler automatically find pages within that scope.",
)}
</p>
</figcaption>
@ -92,32 +94,28 @@ export class NewWorkflowDialog extends TailwindElement {
@sl-after-hide=${this.stopProp}
>
<p class="mb-3">
${msg(
html`Choose <strong>Known URLs</strong> (aka a "URL List" crawl
type) if:`,
)}
${msg(html`Choose <strong>Page List</strong> if:`)}
</p>
<ul class="mb-3 list-disc pl-5">
<li>${msg("You want to archive a single page on a website")}</li>
<li>
${msg("You're archiving just a few specific pages on a website")}
${msg("You have a list of URLs that you can copy-and-paste")}
</li>
<li>
${msg("You have a list of URLs that you can copy-and-paste")}
${msg(
"You want to include URLs with different domain names in the same crawl",
)}
</li>
</ul>
<p class="mb-3">
${msg(
html`A URL list is simpler to configure, since you don't need to
worry about configuring the workflow to exclude parts of the
website that you may not want to archive.`,
html`A Page List workflow is simpler to configure, since you don't
need to worry about configuring the workflow to exclude parts of
the website that you may not want to archive.`,
)}
</p>
<p class="mb-3">
${msg(
html`Choose <strong>Automated Discovery</strong> (aka a "Seeded
Crawl" crawl type) if:`,
)}
${msg(html`Choose <strong>Site Crawl</strong> if:`)}
</p>
<ul class="mb-3 list-disc pl-5">
<li>${msg("You want to archive an entire website")}</li>
@ -136,10 +134,10 @@ export class NewWorkflowDialog extends TailwindElement {
</ul>
<p class="mb-3">
${msg(
html`Seeded crawls are great for advanced use cases where you
don't need to know every single URL that you want to archive. You
can configure reasonable crawl limits and page limits so that you
don't crawl more than you need to.`,
html`Site Crawl workflows are great for advanced use cases where
you don't need to know every single URL that you want to archive.
You can configure reasonable crawl limits and page limits so that
you don't crawl more than you need to.`,
)}
</p>
<p>

View File

@ -713,7 +713,7 @@ export class WorkflowEditor extends BtrixElement {
<sl-textarea
name="urlList"
class="textarea-wrap"
label=${msg("Crawl URL(s)")}
label=${msg("Page URL(s)")}
rows="10"
autocomplete="off"
inputmode="url"
@ -1105,7 +1105,7 @@ https://example.net`}
${inputCol(html`
<sl-textarea
name="urlList"
label=${msg("Crawl URL(s)")}
label=${msg("Page URL(s)")}
rows="3"
autocomplete="off"
inputmode="url"

View File

@ -536,6 +536,10 @@ export class Org extends LiteElement {
return html`<btrix-workflows-list
@select-new-dialog=${this.onSelectNewDialog}
@select-job-type=${(e: SelectJobTypeEvent) => {
this.openDialogName = undefined;
this.navTo(`${this.orgBasePath}/workflows?new&jobType=${e.detail}`);
}}
></btrix-workflows-list>`;
};

View File

@ -1,5 +1,5 @@
import { localized, msg, str } from "@lit/localize";
import type { SlCheckbox } from "@shoelace-style/shoelace";
import type { SlCheckbox, SlSelectEvent } from "@shoelace-style/shoelace";
import { type PropertyValues } from "lit";
import { customElement, state } from "lit/decorators.js";
import { ifDefined } from "lit/directives/if-defined.js";
@ -13,6 +13,7 @@ import type { SelectNewDialogEvent } from ".";
import { CopyButton } from "@/components/ui/copy-button";
import type { PageChangeEvent } from "@/components/ui/pagination";
import { type SelectEvent } from "@/components/ui/search-combobox";
import type { SelectJobTypeEvent } from "@/features/crawl-workflows/new-workflow-dialog";
import { pageHeader } from "@/layouts/pageHeader";
import type { APIPaginatedList, APIPaginationQuery } from "@/types/api";
import { isApiError } from "@/utils/api";
@ -208,21 +209,54 @@ export class WorkflowsList extends LiteElement {
${when(
this.appState.isCrawler,
() => html`
<sl-button
variant="primary"
size="small"
?disabled=${this.org?.readOnly}
@click=${() => {
this.dispatchEvent(
new CustomEvent("select-new-dialog", {
detail: "workflow",
}) as SelectNewDialogEvent,
);
<sl-dropdown
distance="4"
placement="bottom-end"
@sl-select=${(e: SlSelectEvent) => {
const { value } = e.detail.item;
if (value) {
this.dispatchEvent(
new CustomEvent<SelectJobTypeEvent["detail"]>(
"select-job-type",
{
detail: value as SelectJobTypeEvent["detail"],
},
),
);
} else {
this.dispatchEvent(
new CustomEvent("select-new-dialog", {
detail: "workflow",
}) as SelectNewDialogEvent,
);
}
}}
>
<sl-icon slot="prefix" name="plus-lg"></sl-icon>
${msg("New Workflow")}
</sl-button>
<sl-button
slot="trigger"
size="small"
variant="primary"
caret
?disabled=${this.org?.readOnly}
>
<sl-icon slot="prefix" name="plus-lg"></sl-icon>
${msg("New Workflow...")}
</sl-button>
<sl-menu>
<sl-menu-item value="url-list">
${msg("Page List")}
</sl-menu-item>
<sl-menu-item value="seed-crawl">
${msg("Site Crawl")}
</sl-menu-item>
<sl-divider> </sl-divider>
<sl-menu-item>
<sl-icon slot="prefix" name="question-circle"></sl-icon>
${msg("Help me decide")}
</sl-menu-item>
</sl-menu>
</sl-dropdown>
`,
)}
`,

View File

@ -1,4 +1,4 @@
import { localized, msg } from "@lit/localize";
import { localized, msg, str } from "@lit/localize";
import { mergeDeep } from "immutable";
import type { LitElement } from "lit";
import { customElement, property } from "lit/decorators.js";
@ -59,8 +59,8 @@ export class WorkflowsNew extends LiteElement {
initialWorkflow?: WorkflowParams;
private readonly jobTypeLabels: Record<JobType, string> = {
"url-list": msg("URL List"),
"seed-crawl": msg("Seeded Crawl"),
"url-list": msg("Page List"),
"seed-crawl": msg("Site Crawl"),
custom: msg("Custom"),
};
@ -98,7 +98,7 @@ export class WorkflowsNew extends LiteElement {
return html`
<div class="mb-5">${this.renderBreadcrumbs()}</div>
<h2 class="mb-6 text-xl font-semibold">
${msg("New")} ${this.jobTypeLabels[jobType]}
${msg(str`New ${this.jobTypeLabels[jobType]} Workflow`)}
</h2>
${when(this.org, (org) => {
const initialWorkflow = mergeDeep(