Commit Graph

18 Commits

Author SHA1 Message Date
Ilya Kreymer
94e985ae13
optimize org quota lookups (#1973)
- instead of looking up storage and exec min quotas from oid, and
loading an org each time, load org once and then check quotas on the org
object - many times the org was already available, and was looked up
again
- storage and exec quota checks become sync
- rename can_run_crawl() to more generic can_write_data(), optionally
also checks exec minutes
- typing: get_org_by_id() always returns org, or throws, adjust methods
accordingly (don't check for none, catch exception)
- typing: fix typo in BaseOperator, catch type errors in operator
'org_ops'
- operator quota check: use up-to-date 'status.size' for current job,
ignore current job in all jobs list to avoid double-counting
- follow up to #1969
2024-07-25 14:00:16 -07:00
Ilya Kreymer
8c0321bdea
Pydantic 2.x update + type fixes + python 3.12 (#1947)
* updates pydantic to 2.x
* also update to python 3.12
* additional type fixes:
- all Optional[] types must have a default value
- update to constrained types
- URL types converted from str
- test updates

Fixes #1940
2024-07-22 17:23:03 -07:00
Ilya Kreymer
335700e683
Additional typing cleanup (#1938)
Misc typing fixes, including in profiles and time functions

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-07-17 10:49:22 -07:00
Tessa Walsh
d41647e6c2
Document all API endpoints with response models (#1928)
Fixes #1920 

Adds response models to all API endpoints that were missing them,
documenting current behavior without making any changes at this stage to
standardize responses.

Follow-up work will involve adding generics to some of the response models
2024-07-16 12:48:38 -07:00
Tessa Walsh
aaf18e70a0
Add created date to Organization and fix datetimes across backend (#1921)
Fixes #1916

- Add `created` field to Organization and OrgOut, set on org creation
- Add migration to backfill `created` dates from first workflow
`created`
- Replace `datetime.now()` and `datetime.utcnow()` across app with
consistent timezone-aware `utils.dt_now` helper function, which now uses
`datetime.now(timezone.utc)`. This is in part to ensure consistency in
how we handle datetimes, and also to get ahead of timezone naive
datetime creation methods like `datetime.utcnow()` being deprecated in
Python 3.12. For more, see:
https://blog.miguelgrinberg.com/post/it-s-time-for-a-change-datetime-utcnow-is-now-deprecated
2024-07-15 19:46:32 -07:00
Ilya Kreymer
3bd714ea9d
QA stats aggregation: exclude isFile / isError pages from stats (#1879)
Follow-up to: #1868, exclude pages that have isFile or isError set to
true from the stats aggregation.
2024-06-25 08:54:42 -07:00
Tessa Walsh
879e509b39 Backend: Move page file and error counts to crawl replay.json endpoint (#1868)
Backend work for #1859

- Remove file count from qa stats endpoint
- Compute isFile or isError per page when page is added
- Increment filePageCount and errorPageCount per crawl to count number of isFile or isError pages
- Add file and error counts to crawl replay.json endpoint (filePageCount and errorPageCount)
- Add migration 0028 to set isFile / isError for each page, aggregate filePageCount / errorPageCount per crawl
- Determine if page is a file based on loadState == 2, mime type or status code and lack of title
2024-06-20 19:02:57 -07:00
Tessa Walsh
8b0d1432af
Show QA meter while analysis is running (#1854)
Fixes #1846 

- Ensure meter auto-updates as new stats are ready
- Switch meter to new QA run when new analysis run is started
- Remove Files from QA meter (files and errors will be reported separately)

Co-authored-by: emma <hi@emma.cafe>
Co-authored-by: sua yoo <sua@webrecorder.org>
2024-06-12 12:32:01 -04:00
Tessa Walsh
a85f9496b0
Include number of Identical Files in QA stats and meter (#1848)
This PR adds Identical Files to the QA Page Match Analysis meter bars.
To do this, the backend calculates the number of non-HTML pages once and
includes it under the key `Files` in each of the `screenshotMatch` and
`textMatch` QA stats return arrays.

The backend additionally removes the file count from "No Data" to
prevent these from being counted twice.

---------

Co-authored-by: emma <hi@emma.cafe>
2024-06-06 13:15:19 -04:00
sua yoo
1915274e26
Fix QA review comments (#1723)
Fixes https://github.com/webrecorder/browsertrix/issues/1710

Fixes date and deletion for newly added comments.
2024-04-23 16:31:52 -04:00
Tessa Walsh
30ab139ff2
Add QA run aggregate stats API endpoint (#1682)
Fixes #1659 

Takes an arbitrary set of thresholds for text and screenshot matches as
a comma-separated list of floats.

Returns a list of groupings for each that include the lower boundary and
count for all thresholds passed in.
2024-04-17 13:24:18 -04:00
Tessa Walsh
87e0873f1a
Add mime field to Page model (#1678) 2024-04-17 00:57:49 -04:00
Tessa Walsh
00ced6dd6b
Add single page QA GET endpoint (#1635)
Fixes #1634 

Also make sure other get page endpoint without qa uses PageOut model
2024-03-27 14:57:59 -07:00
Tessa Walsh
e9895e78a2
Add additional filters to page list endpoints (#1622)
Fixes #1617 

Filters added:

- reviewed: filter by page has approval or at least one note (true) or
neither (false)
- approved: filter by approval value (accepts list of strings,
comma-separated, each of which are coerced into True, False, or None, or
ignored if they are invalid values)
- hasNotes: filter by has at least one note (true) or not (false)

Tests have also been added to ensure that results are as expected.
2024-03-21 21:33:07 -07:00
Ilya Kreymer
4f676e4e82
QA Runs Initial Backend Implementation (#1586)
Supports running QA Runs via the QA API!

Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes
#1498

Also requires the latest Browsertrix Crawler 1.1.0+ (from
webrecorder/browsertrix-crawler#469 branch)

Notable changes:
- QARun objects contain info about QA runs, which are crawls
performed on data loaded from existing crawls.

- Various crawl db operations can be performed on either the crawl or
`qa.` object, and core crawl fields have been moved to CoreCrawlable.

- While running,`QARun` data stored in a single `qa` object, while
finished qa runs are added to `qaFinished` dictionary on the Crawl. The
QA list API returns data from the finished list, sorted by most recent
first.

- Includes additional type fixes / type safety, especially around
BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific
get_upload(), get_basecrawl(), get_crawl() getters for internal use and
get_crawl_out() for API

- Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) 
along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results.

---------
Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>
2024-03-20 22:42:16 -07:00
Tessa Walsh
21ae38362e
Add endpoints to read pages from older crawl WACZs into database (#1562)
Fixes #1597

New endpoints (replacing old migration) to re-add crawl pages to db from
WACZs.

After a few implementation attempts, we settled on using
[remotezip](https://github.com/gtsystem/python-remotezip) to handle
parsing of the zip files and streaming their contents line-by-line for
pages. I've also modified the sync log streaming to use remotezip as
well, which allows us to remove our own zip module and let remotezip
handle the complexity of parsing zip files.

Database inserts for pages from WACZs are batched 100 at a time to help
speed up the endpoint, and the task is kicked off using
asyncio.create_task so as not to block before giving a response.

StorageOps now contains a method for streaming the bytes of any file in
a remote WACZ, requiring only the presigned URL for the WACZ and the
name of the file to stream.
2024-03-19 14:14:21 -07:00
Ilya Kreymer
09a0d51843
pages: set page status to 200 if unset and loadState != 0 (#1563)
Follow up to #1516, ensure page status is set to 200 if no status is
provided, if loadState is not 0
2024-02-29 15:15:17 -08:00
Tessa Walsh
14189b7cfb
Add crawl pages and related API endpoints (#1516)
Fixes #1502 

- Adds pages to database as they get added to Redis during crawl
- Adds migration to add pages to database for older crawls from
pages.jsonl and extraPages.jsonl files in WACZ
- Adds GET, list GET, and PATCH update endpoints for pages
- Adds POST (add), PATCH, and POST (delete) endpoints for page notes,
each with their own id, timestamp, and user info in addition to text
- Adds page_ops methods for 1. adding resources/urls to page, and 2.
adding automated heuristics and supplemental info (mime, type, etc.) to
page (for use in crawl QA job)
- Modifies `Migration` class to accept kwargs so that we can pass in ops
classes as needed for migrations
- Deletes WACZ files and pages from database for failed crawls during
crawl_finished process
- Deletes crawl pages when a crawl is deleted

Note: Requires a crawler version 1.0.0 beta3 or later, with support for
`--writePagesToRedis` to populate pages at crawl completion. Beta 4 is
configured in the test chart, which should be upgraded to stable 1.0.0
when it's released.

Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464

---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
2024-02-28 12:11:35 -05:00