Reorganizes user guide to be more solutions based --------- Co-authored-by: Henry Wilkinson <henry@wilkinson.graphics> Co-authored-by: Emma Segal-Grossman <hi@emma.cafe> Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
1.9 KiB
1.9 KiB
Your First Crawl
Let’s crawl your first webpage! Start by opening up a webpage that you'd like to crawl, and note the URL for later.
Logging in
To start crawling with hosted Browsertrix, you'll need a Browsertrix account. Sign up for an account and log in.
!!! note "Self-hosting"
If you'd like to try Browsertrix before signing up, or you have specialized hosting requirements, you can host Browsertrix yourself. [Set up Browsertrix](../deploy/index.md) on your system and log in as your admin user.
Starting the crawl
Once you've logged in you should see your org overview. If you land somewhere else, navigate to Overview.
- Tap the Create New... shortcut and select Crawl Workflow.
- Choose Known URLs. We'll get into the details of the options later, but this is a good starting point for a simple crawl.
- Enter the URL of the webpage that you noted earlier in Crawl URL(s).
- Tap Review & Save.
- Tap Save Workflow.
- You should now see your new crawl workflow. Give the crawler a few moments to warm up, and then watch as it archives the webpage!
Next steps
After running your first crawl, check out the following to learn more about Browsertrix's features:
- A detailed list of crawl workflow setup options.
- Adding exclusions to limit your crawl's scope and evading crawler traps by editing exclusion rules while crawling.
- Best practices for crawling with browser profiles to capture content only available when logged in to a website.
- Managing archived items, including uploading previously archived content.
- Organizing and combining archived items with collections for sharing and export.
- Invite collaborators to your org.