docs: Adds information about 1.6 features to documentation (#1086)

* 1.6 docs update

### Changes

- Adds note in style guide about referencing actions in the app
- Adds page for Browser Profiles
  - Adds callout for uploads in the context of combining items from multiple sources
- Adds page for Collections
- Adds page for Crawl Workflows
- Updates index to link to new dedicated Crawl Workflow page in addition to the Crawl Workflow Setup page
- Updates Org Settings page action styling in accordance with new rules
- Updates Crawl Workflow Setup page with links to the new pages and a hierarchy fix for the first item
- Updates user guide navigation with a new section for crawling related items
---------

Co-authored-by: sua yoo <sua@webrecorder.org>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
This commit is contained in:
Henry Wilkinson 2023-08-19 00:55:20 -04:00 committed by GitHub
parent 726a070ca9
commit 02a01e7abb
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23
8 changed files with 106 additions and 10 deletions

View File

@ -67,7 +67,9 @@ All headings should be set in [title case](https://en.wikipedia.org/wiki/Title_c
Controls with multiple options should have their options referenced as `in-line code blocks`.
Setting names referenced outside of a heading should be Capitalized and set in _italics_.
Setting names referenced outside of a heading should be Title Cased and set in _italics_.
Actions with text (buttons in the app) should also be Title Cased and set in _italics_.
##### Example

View File

@ -0,0 +1,31 @@
# Browser Profiles
Browser Profiles are saved instances of a web browsing session that can be reused to crawl websites as they were configued, with any cookies or saved login sessions. They are specifically useful for crawling websites as a logged in user or accepting cookie consent popups.
Using a pre-created profile means that paywalled content can be archived, without archiving the actual login credentials.
??? info "Best practice: Create and use web archiving-specific accounts"
Some websites may rate limit or lock your account if they deem crawling-related activity to be suspicious, such as logging in from a new location.
While your login information (username, password) is not archived, *other* data such as cookies, location, etc.. may be part of a logged in content (after all, personalized content is often the goal of paywalls).
Due to nature of social media especially, existing accounts may have personally identifiable information, even when accessing otherwise public content.
For these reasons, we recommend creating dedicated accounts for archiving anything that is paywalled but otherwise public, especially on social media platforms.
Of course, there are exceptions -- such as when the goal is to archive personalized or private content accessible only from designated accounts.
## Creating New Browser Profiles
New browser profiles can be created on the Browser Profiles page by pressing the _New Browser Profile_ button and providing a starting URL. Once in the profile creator, log in to any websites that should behave as logged in while crawling and accept any pop-ups that require interaction from the user to proceed with using the website.
Press the _Next_ button to save the browser profile with a _Name_ and _Description_ of what is logged in or otherwise notable about this browser session.
## Editing Existing Browser Profiles
Sometimes websites will log users out or expire cookies after a period of time. In these cases, when crawling the browser profile can still be loaded but may not behave as it did when it was initially set up.
To update the profile, go to the profile's details page and press the _Edit Browser Profile_ button to load and interact with the sites that need to be re-configured. When finished, press the _Save Browser Profile_ button to return to the profile's details page.

View File

@ -0,0 +1,22 @@
# Collections
Collections are the primary way of organizing and combining archived items into groups for presentation.
!!! tip "Tip — Combining items from multiple sources"
If the crawler has not captured every resource or interaction on a webpage, the [ArchiveWebpage browser extension](https://archiveweb.page/) can be used to manually capture missing content and upload it directly to your org.
After adding the crawl and the upload to a collection, the content from both will become available in the replay viewer.
## Adding Content to Collections
Crawls and uploads can be added to a collection as part of the initial creation process or after creation by selecting _Edit Collection_ from the collection's actions menu.
A crawl workflow can also be set to [automatically add any completed archived items to a collection](../workflow-setup/#collection-auto-add) in the workflow's settings.
## Sharing Collections
Collections are private by default, but can be made public by marking them as sharable in the Metadata step of collection creation, or by toggling the _Collection is Shareable_ switch in the share collection dialogue.
After a collection has been made public, it can be shared with others using the public URL available in the share collection dialogue. The collection can also be embedded into other websites using the provided embed code. Unsharing the collection will break any previously shared links.
For further resources on embedding archived web content into your own website, see the [ReplayWebpage docs page on embedding](https://replayweb.page/docs/embedding).

View File

@ -0,0 +1,35 @@
# Crawl Workflows
Crawl workflows consist of a list of configuration options that instruct the crawler what it should capture.
## Creating and Editing Crawl Workflows
New crawl workflows can be created from the Crawling page. A detailed breakdown of available settings can be found [here](../workflow-setup).
## Running Crawl Workflows
Crawl workflows can be run from the actions menu of the workflow in the crawl workflow list, or by clicking the _Run Crawl_ button on the workflow's details page.
While crawling, the Watch Crawl page displays a list of queued URLs that will be visited, and streams the current state of the browser windows as they visit pages from the queue.
Running a crawl workflow that has successfully run previously can be useful to capture content as it changes over time, or to run with an updated [Crawl Scope](../workflow-setup/#scope).
### Live Exclusion Editing
While [exclusions](../workflow-setup/#exclusions) can be set before running a crawl workflow, sometimes while crawling the crawler may find new parts of the site that weren't previously known about and shouldn't be crawled, or get stuck browsing parts of a website that automatically generate URLs known as ["crawler traps"](https://en.wikipedia.org/wiki/Spider_trap).
If the crawl queue is filled with URLs that should not be crawled, use the _Edit Exclusions_ button on the Watch Crawl page to instruct the crawler what pages should be excluded from the queue.
Exclusions added while crawling are applied to the same exclusion table saved in the workflow's settings and will be used the next time the crawl workflow is run unless they are manually removed.
## Ending a Crawl
If a crawl workflow is not crawling websites as intended it may be preferable to end crawling operations and update the crawl workflow's settings before trying again. There are two operations to end crawls, available both on the workflow's details page, or as part of the actions menu in the workflow list.
### Stopping
Stopping a crawl will throw away the crawl queue but otherwise gracefully end the process and save anything that has been collected. Stopped crawls show up in the list of Archived Items and can be used like any other item in the app.
### Canceling
Canceling a crawl will throw away all data collected and immediately end the process. Canceled crawls do not show up in the list of Archived Items, though a record of the runtime and workflow settings can be found in the crawl workflow's list of crawls.

View File

@ -2,8 +2,6 @@
## Signup
!!! warning "For all signup options the Name field cannot currently be changed later."
### Invite Link
If you have been sent an [invite](org-settings#members), enter a password and name to create a new account. Your account will be added to the organization you were invited to by an organization admin.
@ -12,8 +10,10 @@ If you have been sent an [invite](org-settings#members), enter a password and na
If the server has enabled signups and you have been given a registration link, enter your email address, password, and name to create a new account. Your account will be added to the server's default organization.
!!! info "At this time, the name field is not yet editable."
---
## Automated Crawling
## Start Crawling!
A Workflow must be created in order to crawl websites automatically. Workflows can be created on the Crawling page found in the main navigation menu. A detailed list of all available workflow configuration options can be found on the [Crawl Workflow Setup](workflow-setup) page.
A [Crawl Workflow](crawl-workflows) must be created in order to crawl websites automatically. A detailed list of all available workflow configuration options can be found on the [Crawl Workflow Setup](workflow-setup) page.

View File

@ -10,6 +10,6 @@ This page lets you change the organization's name. This name must be unique.
This page lists all current members who have access to the organization, as well as any invited members who have not yet accepted an invitation to join the organization. In the _Active Members_ table, Admins can change the permission level of all users in the organization, including other Admins. At least one user must be an Admin per-organization. Admins can also remove members by pressing the trash button.
Admins can add new members to the organization by pressing the `Invite New Member` button. Enter the email address associated with the user, select the appropriate role, and press `Invite` to send a link to join the organization via email.
Admins can add new members to the organization by pressing the _Invite New Member_ button. Enter the email address associated with the user, select the appropriate role, and press _Invite_ to send a link to join the organization via email.
Sent invites can be invalidated by pressing the trash button in the relevant _Pending Invites_ table row.

View File

@ -1,6 +1,8 @@
# Crawl Workflow Setup
The first step in creating a new crawl workflow is to choose what type of crawl you want to run. Crawl types are fixed and cannot be converted or changed later.
## Crawl Type
The first step in creating a new [crawl workflow](../crawl-workflows) is to choose what type of crawl you want to run. Crawl types are fixed and cannot be converted or changed later.
`URL List`{ .badge-blue }
: The crawler visits every URL specified in a list, and optionally every URL linked on those pages.
@ -138,7 +140,7 @@ Waits on the page for a set period of time after any behaviors have finished run
### Browser Profile
Sets the _Browser Profile_ to be used for this crawl.
Sets the [_Browser Profile_](../browser-profiles) to be used for this crawl.
### Block Ads by Domain
@ -197,4 +199,4 @@ Apply tags to the workflow. Tags applied to the workflow will propigate to every
### Collection Auto-Add
Search for and specify collections that this crawl workflow should automatically add content to as soon as crawls finish running. Cancelled and Failed crawls will not be automatically added to collections.
Search for and specify [collections](../collections) that this crawl workflow should automatically add content to as soon as crawling finishes. Canceled and Failed crawls will not be automatically added to collections.

View File

@ -56,7 +56,11 @@ nav:
- develop/docs.md
- User Guide:
- user-guide/index.md
- user-guide/workflow-setup.md
- Crawling:
- user-guide/crawl-workflows.md
- user-guide/workflow-setup.md
- user-guide/browser-profiles.md
- user-guide/collections.md
- user-guide/org-settings.md
markdown_extensions: