feat: Merge workflow job types (#2068 )

Resolves https://github.com/webrecorder/browsertrix/issues/2073

### Changes

- Removes "URL List" and "Seeded Crawl" job type distinction and adds as
additional crawl scope types instead.
- 'New Workflow' button defaults to Single Page
- 'New Workflow' dropdown includes Page Crawl (Single Page, Page List, In-Page Links) and Site Crawl (Page in Same Directory, Page on Same Domain, + Subdomains and Custom Page Prefix)
- Enables specifying `DOCS_URL` in `.env`
- Additional follow-ups in #2090, #2091

2024-09-25 10:37:18 -04:00

1.9 KiB

Raw Blame History

Your First Crawl

Let’s crawl your first webpage! Start by opening up a webpage that you'd like to crawl, and note the URL for later.

Logging in

To start crawling with hosted Browsertrix, you'll need a Browsertrix account. Sign up for an account and log in.

!!! note "Self-hosting"

If you'd like to try Browsertrix before signing up, or you have specialized hosting requirements, you can host Browsertrix yourself. [Set up Browsertrix](../deploy/index.md) on your system and log in as your admin user.

Starting the crawl

Once you've logged in you should see your org overview. If you land somewhere else, navigate to Overview.

Tap the Create New... shortcut and select Crawl Workflow.
Choose Page List. We'll get into the details of the options later, but this is a good starting point for a simple crawl.
Enter the URL of the webpage that you noted earlier in Page URL(s).
Tap Review & Save.
Tap Save Workflow.
You should now see your new crawl workflow. Give the crawler a few moments to warm up, and then watch as it archives the webpage!

Next steps

After running your first crawl, check out the following to learn more about Browsertrix's features:

A detailed list of crawl workflow setup options.
Adding exclusions to limit your crawl's scope and evading crawler traps by editing exclusion rules while crawling.
Best practices for crawling with browser profiles to capture content only available when logged in to a website.
Managing archived items, including uploading previously archived content.
Organizing and combining archived items with collections for sharing and export.
Invite collaborators to your org.

1.9 KiB Raw Blame History Unescape Escape

Your First Crawl

Logging in

Starting the crawl

Next steps

1.9 KiB

Raw Blame History