[Docs:] Add docs for quality assurance (#2769)
This commit is contained in:
parent
993f82a49b
commit
4c5185d973
@ -58,4 +58,4 @@ Translations are managed through Weblate, a web-based and open source translatio
|
|||||||
|
|
||||||
Browsertrix is made available under the [AGPLv3 License](https://github.com/webrecorder/browsertrix?tab=AGPL-3.0-1-ov-file#readme).
|
Browsertrix is made available under the [AGPLv3 License](https://github.com/webrecorder/browsertrix?tab=AGPL-3.0-1-ov-file#readme).
|
||||||
|
|
||||||
Documentation is made available under the Creative Commons Attribution 4.0 International License.
|
Documentation is made available under the Creative Commons Attribution 4.0 International License.
|
71
frontend/docs/docs/user-guide/qa-review.md
Normal file
71
frontend/docs/docs/user-guide/qa-review.md
Normal file
@ -0,0 +1,71 @@
|
|||||||
|
# Review Crawl
|
||||||
|
|
||||||
|
## Overview of Crawl Quality
|
||||||
|
In a QA analysis, Browsertrix collects data in two stages: first during the initial crawl, and then again during the replay. Rather than comparing the replay to the live site, we compare it to the data captured during the crawl. This ensures that the web archive that is to be downloaded, added to a collection, or shared provides a high quality replay.
|
||||||
|
|
||||||
|
When reviewing the page, you will be able to analyze specific elements beginning with the Starting URL.
|
||||||
|
|
||||||
|
You will be able to review the crawled pages by:
|
||||||
|
|
||||||
|
- **Screenshots**: A static visual snapshot of a section of the captured page
|
||||||
|
- **Text**: A full transcript of the text within the page
|
||||||
|
- **Resources**: Web documents (i.e. HTML, stylesheets, fonts, etc.) that make up the page
|
||||||
|
|
||||||
|
!!! note "Navigation Prevented in Replay within QA"
|
||||||
|
To navigate through the captured website, use the Replay feature in the Crawling section. Links will not be clickable when using the Replay tab within the Analysis view.
|
||||||
|
|
||||||
|
!!! note "Limited View in Default Mode"
|
||||||
|
When you first view an analysis of a page, the screenshot, text, and resource comparison views are only available for analyzed crawls. You'll need to run an analysis to view and compare all quality metrics.
|
||||||
|
|
||||||
|
## QA on Your Web Archive
|
||||||
|
When you run an analysis, you'll have a comparison view of the data collected. If multiple analysis runs have been completed, the page data will be used from the selected analysis run, which are displayed next to the archived item name. The most recent analysis run is selected by default, but you can choose to display data from any other completed or stopped analysis run here as well.
|
||||||
|
|
||||||
|
The depth of your page review may vary depending on available time and the complexity of the page. For automated support, crawl analysis can generate comparisons across three key factors to help highlight potentially problematic pages. If you prefer a manual approach, you can still assess crawls even without running an analysis. You’re still able to review page quality manually and leave comments, provide ratings, and examine the screenshots, text, and resources.
|
||||||
|
|
||||||
|
### Screenshot Comparison
|
||||||
|
Screenshots are compared by measuring the perceived difference between color samples and by the intensity of difference between pixels. These metrics are provided by the open source tool [Pixelmatch](https://observablehq.com/@mourner/pixelmatch-demo).
|
||||||
|
|
||||||
|
Discrepancies between crawl and replay screenshots may occur because resources aren't loaded or rendered properly (usually indicating a replay issue).
|
||||||
|
|
||||||
|
!!! Tip "Caveats"
|
||||||
|
If many similar pages exhibit similarly poor screenshot comparison scores but look fine in the replay tab, it may be because of page loading time not being long enough during analysis.
|
||||||
|
|
||||||
|
Some websites may take more time to load than others, including on replay! If the page wasn't given enough time to load during crawl analysis — because crawl analysis uses the same workflow limit settings as crawling — increasing the [_Delay After Page Load_ workflow setting](workflow-setup.md#delay-after-page-load) may yield better screenshot analysis scores, at the cost of extra execution time.
|
||||||
|
|
||||||
|
### Extracted Text Comparison
|
||||||
|
Text extracted during crawl analysis is compared to the text extracted during crawling. Text is compared on the basis of [Levenshtein distance](https://en.wikipedia.org/wiki/Levenshtein_distance).
|
||||||
|
|
||||||
|
Resources not loaded properly on replay may display ReplayWeb.page's `Archived Page Not Found` error within the extracted text.
|
||||||
|
|
||||||
|
### Resource Comparison
|
||||||
|
The resource comparison tab displays a table of resource types, and their [HTTP status code](https://en.wikipedia.org/wiki/List_of_HTTP_status_codes) count grouped by "good" and "bad". 2xx & 3xx range status codes are assigned "good", 4xx & 5xx range status codes are assigned "bad". Bad status codes on crawl indicate that a resource was not successfully captured. Bad status codes on replay that marked good when crawling usually indicate a replay issue.
|
||||||
|
|
||||||
|
!!! Tip "Caveats"
|
||||||
|
The number of resources may be higher on replay due to how components of ReplayWeb.page re-write certain request types. A discrepancy alone may not be an indicator that the page is broken, though generally it is a positive sign when the counts are equal.
|
||||||
|
|
||||||
|
Due to the complicated nature of resource count comparison, this is not available as a sorting option in the pages list.
|
||||||
|
|
||||||
|
## Page Review
|
||||||
|
The pages from the crawl will be listed so you can click on pages based on particular interest, comparison rating, or just random spot checking from your workflow.
|
||||||
|
|
||||||
|
<!-- ### sort by approval
|
||||||
|
### leave comments
|
||||||
|
### rate -->
|
||||||
|
|
||||||
|
??? Question "Should I review every page?"
|
||||||
|
Probably not! When reviewing a crawl of a site that has many similar pages, all of which exhibit the same error and have similar heuristic scores, it's likely that they all are similarly broken, and you can _probably_ save yourself the trouble. Depending on the website, the heuristic scores may not always be an accurate predictor of quality, but in our testing they are fairly consistent — consistency being the important factor of this tool. It is up to you, the curator, to make the final quality judgement!
|
||||||
|
|
||||||
|
Our recommended workflow is as follows: run crawl analysis, examine the most severe issues as highlighted, examine some key examples of common layouts, review any other key pages, and score the crawl accordingly!
|
||||||
|
|
||||||
|
## Finish Review
|
||||||
|
Once a satisfactory amount of pages have been reviewed, press the Finish Review button to give the archived item an overall quality score ranging from "Excellent!" to "Bad". You can add any additional notes or considerations in the archived item description, which can be edited during this step.
|
||||||
|
|
||||||
|
### Rate This Crawl
|
||||||
|
This quality score helps others in your organization understand how well the page was captured and whether it needs to be recaptured. You can choose from the following rating ranges:
|
||||||
|
- Excellent! This archived item perfectly replicates the original pages
|
||||||
|
- Good. Looks and functions nearly the same as the original pages
|
||||||
|
- Fair. Similar to the original pages, but may be missing non-critical content or functionality
|
||||||
|
- Bad. Missing all content and functionality from original pages
|
||||||
|
|
||||||
|
### Update Crawl Metadata
|
||||||
|
You can include additional metadata in the provided text area. There is a maximum of 500 characters for this section.
|
18
frontend/docs/docs/user-guide/qa-run-analysis.md
Normal file
18
frontend/docs/docs/user-guide/qa-run-analysis.md
Normal file
@ -0,0 +1,18 @@
|
|||||||
|
# Run Analysis
|
||||||
|
|
||||||
|
automate the QA
|
||||||
|
|
||||||
|
stop analysis
|
||||||
|
cancel analysis
|
||||||
|
rerun analysis
|
||||||
|
|
||||||
|
|
||||||
|
## Crawl Results
|
||||||
|
- HTML pages
|
||||||
|
- Non-HTML pages
|
||||||
|
- Failed Pages
|
||||||
|
|
||||||
|
## Review Crawl
|
||||||
|
go to review crawl
|
||||||
|
|
||||||
|
|
33
frontend/docs/docs/user-guide/quality-assurance.md
Normal file
33
frontend/docs/docs/user-guide/quality-assurance.md
Normal file
@ -0,0 +1,33 @@
|
|||||||
|
# Introduction to QA
|
||||||
|
|
||||||
|
Quality assurance (QA) in web archiving is the systematic process of verifying that archived web content is accurate, complete, and usable. It often involves checking for broken links, missing content, and ensuring the archived version matches the website website at the time it was crawled, especially sites with dynamic and interactive elements.
|
||||||
|
|
||||||
|
Quality assurance has often been performed manually, typically by visually comparing crawl results to the live site and clicking on the hyperlinks of a crawled web page. This can be tedious and prone to issues if some interactive elements are overlooked, especially if the live site has changed since the time it was crawled and archived. Browsertrix addresses these potential issues through QA tools that provide immediate feedback on the capture quality of the crawl, so that crawl or replay issues can be identified and resolved promptly.
|
||||||
|
|
||||||
|
## Overview of Quality Assurance
|
||||||
|
With assisted QA, you can analyze any web archive crawled through Browsertrix to compare, replay, and review pages in the web archive.
|
||||||
|
|
||||||
|
!!! note "Types of crawls you can review"
|
||||||
|
You are able to review and analyze crawls that have been completed or even stopped. You would not be able to review or run an analysis on cancelled crawls, paused crawls or uploaded crawls.
|
||||||
|
|
||||||
|
At a quick glance, you can tell:
|
||||||
|
|
||||||
|
- **Analysis Status**: By default, the status of your analysis will be shown as *Not Analyzed* because QA analysis does not run automatically. You will need to run analysis if you want an HTML Page match analysis and side-by-side screenshot comparisons.
|
||||||
|
- **QA Rating**: Users can rate crawls if they are Excellent, Good, Fair, Poor, or Bad depending on the quality.
|
||||||
|
- **Total Analysis Time**: Similar to Execution Time (crawl running time), an analysis uses minutes to measure the total runtime of a crawl scaled by the Browser Windows value during a crawl.
|
||||||
|
|
||||||
|
### Crawl Results
|
||||||
|
You will be automatically given a summarization of your crawl, even without running analysis.
|
||||||
|
|
||||||
|
You will get a count of all the HTML (HyperText Markup Language) files captured as well as non-HTML files. Non-HTML files include PDFs, Word and text files, images, and other downloadable content that the crawler discovers through clickable links on a page. These files are not analyzed, as they are standalone assets without comparable Web elements. Failed pages did not respond when the crawlers tried to visit them.
|
||||||
|
|
||||||
|
### Pages
|
||||||
|
A Page refers to a Web page. A Web page is a Web document that is accessed in a web browser that would typically be linked together with other Web pages to create a website.
|
||||||
|
|
||||||
|
You will see a list of all the Web pages crawled featuring its Title, URL, any approval rating and comments done by users of the org.
|
||||||
|
|
||||||
|
<!-- ## Run Analysis
|
||||||
|
From the Quality Assurance tab in the Crawl overview page, you will be able to [*Run Analysis*](./qa-run-analysis.md) or *Rerun Analysis* depending on which step of the workflow you are at. -->
|
||||||
|
|
||||||
|
## Review Crawl Analysis
|
||||||
|
From the Quality Assurance tab in the Crawl overview page, you will be able to [*Review Crawl*](./qa-review.md) when you are ready to analyze the quality of the pages from the crawl.
|
@ -66,6 +66,9 @@ nav:
|
|||||||
- user-guide/collection.md
|
- user-guide/collection.md
|
||||||
- user-guide/public-collections-gallery.md
|
- user-guide/public-collections-gallery.md
|
||||||
- user-guide/presentation-sharing.md
|
- user-guide/presentation-sharing.md
|
||||||
|
- Quality Assurance:
|
||||||
|
- user-guide/quality-assurance.md
|
||||||
|
- user-guide/qa-review.md
|
||||||
- Browser Profiles:
|
- Browser Profiles:
|
||||||
- user-guide/browser-profiles.md
|
- user-guide/browser-profiles.md
|
||||||
- Org Settings:
|
- Org Settings:
|
||||||
@ -91,7 +94,6 @@ nav:
|
|||||||
- develop/index.md
|
- develop/index.md
|
||||||
- develop/local-dev-setup.md
|
- develop/local-dev-setup.md
|
||||||
- develop/docs.md
|
- develop/docs.md
|
||||||
|
|
||||||
- UI Development:
|
- UI Development:
|
||||||
- develop/frontend-dev.md
|
- develop/frontend-dev.md
|
||||||
- develop/ui/components.md
|
- develop/ui/components.md
|
||||||
@ -146,4 +148,4 @@ plugins:
|
|||||||
- search
|
- search
|
||||||
- redirects:
|
- redirects:
|
||||||
redirect_maps:
|
redirect_maps:
|
||||||
"user-guide/collections.md": "user-guide/collection.md"
|
"user-guide/collections.md": "user-guide/collection.md"
|
Loading…
Reference in New Issue
Block a user