Adds QA features to user docs (#1784)
Fixes #1695 ### Changes - Adds Crawl Review user docs - Adds Quality Assurance section to the Archived Items page - Adds note in the user roles list on crawl review not being available for viewers Co-authored-by: Emma Segal-Grossman <hi@emma.cafe> Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
This commit is contained in:
parent
94d57b98ce
commit
1a668fe82f
@ -22,7 +22,7 @@ Because <span class="status-warning">:bootstrap-x-octagon-fill: Canceled</span>
|
||||
|
||||
## Archived Item Details
|
||||
|
||||
The archived item details page is composed of five sections, though the Crawl Settings tab is only available for crawls and not uploads.
|
||||
The archived item details page is composed of the following sections, though some are only available for crawls and not uploads.
|
||||
|
||||
### Overview
|
||||
|
||||
@ -30,11 +30,29 @@ The Overview tab displays the item's metadata and statistics associated with its
|
||||
|
||||
Metadata can be edited by pressing the pencil icon at the top right of the metadata section to edit the item's description, tags, and collections it is associated with.
|
||||
|
||||
### Quality Assurance
|
||||
|
||||
The Quality Assurance tab displays crawl quality information collected from analysis runs and user assessment of pages. This is where you can start new analysis runs, view quality metrics from older runs, and delete previous analysis runs. This tab is not available for uploaded archived items and not accessible for users with [viewer permissions](org-settings.md#permission-levels).
|
||||
|
||||
The pages list provides a record of all pages within the archived item, as well as any ratings or notes given to the page during review. If analysis has been run, clicking on a page in the pages list will go to that page in the review interface.
|
||||
|
||||
#### Crawl Analysis
|
||||
|
||||
Running crawl analysis will re-visit all pages within the archived item, comparing the data collected during analysis with the data collected during crawling. Crawl analysis runs with the same workflow limit settings used during crawling.
|
||||
|
||||
Crawl analysis can be run multiple times, though results should only differ if the crawler version has been updated between runs. The analysis process is being constantly improved and future analysis runs should produce better results. Analysis run data can be downloaded or deleted from the _Analysis Runs_ tab. While they are stored as WACZ files, analysis run WACZs only contain analysis data and may not open correctly or be useful in other programs that replay archived content.
|
||||
|
||||
Once a crawl has been analyzed — either fully, or partially — it can be reviewed by pressing the _Review Crawl_ button. For more on reviewing crawls and how to interpret analysis data, see: [Crawl Review](review.md).
|
||||
|
||||
`Paid Feature`{ .badge-green }
|
||||
|
||||
Like running a crawl workflow, running crawl analysis also uses execution time. Crawls and crawl analysis share the same concurrent crawling limit, but crawl analysis runs will be paused in favor of new crawls if the concurrent crawling limit is reached.
|
||||
|
||||
### Replay
|
||||
|
||||
The Replay tab displays the web content contained within the archived item.
|
||||
|
||||
For more details on navigating web archives within ReplayWeb.page, see the [ReplayWeb.page user documentation.](https://replayweb.page/docs/exploring)
|
||||
For more details on navigating web archives within ReplayWeb.page, see the [ReplayWeb.page user documentation.](https://replayweb.page/docs/user-guide/exploring/)
|
||||
|
||||
### Exporting Files
|
||||
|
||||
@ -50,4 +68,4 @@ All log entries with that were recorded in the creation of the Archived Item can
|
||||
|
||||
### Crawl Settings
|
||||
|
||||
The Crawl Settings tab displays the crawl workflow configuration options that were used to generate the resulting archived item.
|
||||
The Crawl Settings tab displays the crawl workflow configuration options that were used to generate the resulting archived item. Many of these settings also apply when running crawl analysis.
|
||||
|
@ -17,7 +17,7 @@ Sent invites can be invalidated by pressing the trash button in the relevant _Pe
|
||||
### Permission Levels
|
||||
|
||||
`Viewer`
|
||||
: Users with the viewer role have read-only access to all material within the organization. They cannot create or edit archived items, crawl workflows, browser profiles, or collections.
|
||||
: Users with the viewer role have read-only access to all material within the organization. They cannot create or edit archived items, crawl workflows, browser profiles, or collections. They also do not have access to any crawl analysis or review tools.
|
||||
|
||||
`Crawler`
|
||||
: Users with the crawler role can create crawl workflows and collections, but they cannot delete existing archived items that they were not responsible for creating.
|
||||
|
48
docs/user-guide/review.md
Normal file
48
docs/user-guide/review.md
Normal file
@ -0,0 +1,48 @@
|
||||
# Crawl Review
|
||||
|
||||
The Crawl Review page provides a streamlined interface for assessing the capture quality of pages within an archived item using the heuristics collected during crawl analysis.
|
||||
|
||||
Crawls can only be reviewed once [crawl analysis](archived-items.md#crawl-analysis) has been run. If multiple analysis runs have been completed, the page analysis heuristics will be used from the selected analysis run, which are displayed next to the archived item name. The most recent analysis run is selected by default, but you can choose to display data from any other completed or stopped analysis run here as well.
|
||||
|
||||
## Heuristics
|
||||
|
||||
Crawl analysis generates comparisons across three heuristics that can indicate which pages may be the most problematic.
|
||||
|
||||
### Screenshot Comparison
|
||||
|
||||
Screenshots are compared by measuring the perceived difference between color samples and by the intensity of difference between pixels. These metrics are provided by the open-source tool [Pixelmatch](https://observablehq.com/@mourner/pixelmatch-demo).
|
||||
|
||||
Discrepancies between crawl and replay screenshots may occur because resources aren't loaded or rendered properly (usually indicating a replay issue).
|
||||
|
||||
!!! Tip "Caveats"
|
||||
If many similar pages exhibit similarly poor screenshot comparison scores but look fine in the replay tab, it may be because of page loading time not being long enough during analysis.
|
||||
|
||||
Some websites may take more time to load than others, including on replay! If the page wasn't given enough time to load during crawl analysis — because crawl analysis uses the same workflow limit settings as crawling — increasing the [_Delay After Page Load_ workflow setting](workflow-setup.md#delay-after-page-load) may yield better screenshot analysis scores, at the cost of extra execution time.
|
||||
|
||||
### Extracted Text Comparison
|
||||
|
||||
Text extracted during crawl analysis is compared to the text extracted during crawling. Text is compared on the basis of [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance).
|
||||
|
||||
Resources not loaded properly on replay may display ReplayWeb.page's `Archived Page Not Found` error within the extracted text.
|
||||
|
||||
### Resource Comparison
|
||||
|
||||
The resource comparison tab displays a table of resource types, and their [HTTP status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) count grouped by "good" and "bad". 2xx & 3xx range status codes are assigned "good", 4xx & 5xx range status codes are assigned "bad". Bad status codes on crawl indicate that a resource was not successfully captured. Bad status codes on replay that marked good when crawling usually indicate a replay issue.
|
||||
|
||||
!!! Tip "Caveats"
|
||||
The number of resources may be higher on replay due to how components of ReplayWeb.page re-write certain request types. A discrepancy alone may not be an indicator that the page is broken, though generally it is a positive sign when the counts are equal.
|
||||
|
||||
Due to the complicated nature of resource count comparison, this is not available as a sorting option in the pages list.
|
||||
|
||||
## Page Review
|
||||
|
||||
The pages list can be sorted using analysis heuristics to determine the pages that are likely more important to review versus those that might require less attention. After selecting a page to review, looking over the analysis heuristics, and checking them against replay, make a decision about if the page capture was successful or unsuccessful and leave a note about what worked well or what might be problematic.
|
||||
|
||||
??? Question "Should I review every page? (Spoiler alert: probably not!)"
|
||||
When reviewing a crawl of a site that has many similar pages, all of which exhibit the same error and have similar heuristic scores, it's likely that they all are similarly broken, and you can _probably_ save yourself the trouble. Depending on the website, the heuristic scores may not always be an accurate predictor of quality, but in our testing they are fairly consistent — consistency being the important factor of this tool. It is up to you, the curator, to make the final quality judgement!
|
||||
|
||||
Our recommended workflow is as follows: run crawl analysis, examine the most severe issues as highlighted, examine some key examples of common layouts, review any other key pages, and score the crawl accordingly!
|
||||
|
||||
## Finish Review
|
||||
|
||||
Once a satisfactory amount of pages have been reviewed, press the _Finish Review_ button to give the archived item an overall quality score ranging from "Excellent!" to "Bad". You can add any additional notes or considerations in the archived item description, which can be edited during this step.
|
@ -70,6 +70,7 @@ nav:
|
||||
- user-guide/browser-profiles.md
|
||||
- user-guide/overview.md
|
||||
- user-guide/archived-items.md
|
||||
- user-guide/review.md
|
||||
- user-guide/collections.md
|
||||
- user-guide/org-settings.md
|
||||
- user-guide/user-settings.md
|
||||
|
Loading…
Reference in New Issue
Block a user