browsertrix

Author	SHA1	Message	Date
Ilya Kreymer	8a507f0473	Consolidate list page endpoints + better QA sorting + optimize pages fix (#2417 ) - consolidate list_pages() and list_replay_query_pages() into list_pages() - to keep backwards compatibility, add <crawl>/pagesSearch that does not include page totals, keep <crawl>/pages with page total (slower) - qa frontend: add default 'Crawl Order' sort order, to better show pages in QA view - bgjob: account for parallelism in bgjobs, add logging if succeeded mismatches parallelism - QA sorting: default to 'crawl order' by default to get better results. - Optimize pages job: also cover crawls that may not have any pages but have pages listed in done stats - Bgjobs: give custom op jobs more memory	2025-02-21 13:47:20 -08:00
Tessa Walsh	f8fb2d2c8d	Rework crawl page migration + MongoDB Query Optimizations (#2412 ) Fixes #2406 Converts migration 0042 to launch a background job (parallelized across several pods) to migrate all crawls by optimizing their pages and setting `version: 2` on the crawl when complete. Also Optimizes MongoDB queries for better performance. Migration Improvements: - Add `isMigrating` and `version` fields to `BaseCrawl` - Add new background job type to use in migration with accompanying `migration_job.yaml` template that allows for parallelization - Add new API endpoint to launch this crawl migration job, and ensure that we have list and retry endpoints for superusers that work with background jobs that aren't tied to a specific org - Rework background job models and methods now that not all background jobs are tied to a single org - Ensure new crawls and uploads have `version` set to `2` - Modify crawl and collection replay.json endpoints to only include fields for replay optimization (`initialPages`, `pageQueryUrl`, `preloadResources`) if all relevant crawls/uploads have `version` set to `2` - Remove `distinct` calls from migration pathways - Consolidate collection recompute stats Query Optimizations: - Remove all uses of $group and $facet - Optimize /replay.json endpoints to precompute preload_resources, avoid fetching crawl list twice - Optimize /collections endpoint by not fetching resources - Rename /urls -> /pageUrlCounts and avoid $group, instead sort with index, either by seed + ts or by url to get top matches. - Use $gte instead of $regex to get prefix matches on URL - Use $text instead of $regex to get text search on title - Remove total from /pages and /pageUrlCounts queries by not using $facet - frontend: only call /pageUrlCounts when dialog is opened. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com> Co-authored-by: Emma Segal-Grossman <hi@emma.cafe> Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>	2025-02-20 15:26:11 -08:00
Tessa Walsh	6c2d8c88c8	Modify page upload migration (#2400 ) Related to #2396 Changes to migration 0037: - Re-adds pages in migration rather than in background job to avoid race condition with later migrations - Re-adds pages for all uploads in all orgs Fix for readd pages for org: - Ensure org filter is applied! - Fix wrong type - Remove distinct, use iterator to iterate over crawls faster. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-02-17 16:47:58 -08:00
Ilya Kreymer	5bebb6161a	Issue 2396 readd pages fixes (#2398 ) readd pages fixes: - add additional mem to background job - copy page qa data to separate temp coll when re-adding pages, then merge back in	2025-02-17 13:52:11 -08:00
Ilya Kreymer	e112f96614	Upload Fixes: (#2397 ) - ensure upload pages are always added with a new uuid, to avoid any duplicates with existing uploads, even if upload wacz is actually a crawl from different browsertrix instance, etc.. - cleanup upload names with slugify, which also replaces spaces, fixes uploading wacz filenames with spaces in them - part of fix for #2396 --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-02-17 13:05:33 -08:00
Ilya Kreymer	4516268a70	misc fixes: cors + disable buffering for uploads (#2395 ) - ensure pages endpoint support CORS for local dev - disable proxy request buffering to support large uploads	2025-02-13 19:38:20 -08:00
Tessa Walsh	7f1af9bb31	Mark all pages from pages.jsonl as seeds (#2390 ) Fixes #2389 All pages from `pages/pages.jsonl` files now have `isSeed: True` in the database, in addition to any pages that explicitly have `seed` set to true in the actual JSONL. Tests have been added to ensure that all pages from our fixture uploads have `isSeed: True`. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-02-13 16:54:30 -08:00
Ilya Kreymer	7b2932c582	Add initial pages + pagesQuery endpoint to /replay.json APIs (#2380 ) Fixes #2360 - Adds `initialPages` to /replay.json response for collections, returning up-to 25 pages (seed pages first, then sorted by capture time). - Adds `pagesQueryUrl` to /replay.json - Adds a public pages search endpoint to support public collections. - Adds `preloadResources`, including list of WACZ files that should always be loaded, to /replay.json --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2025-02-13 16:53:47 -08:00
Tessa Walsh	98a45b0d85	Add collection page list/search endpoint (#2354 ) Fixes #2353 Adds a new endpoint to list pages in a collection, with filtering available on `url` (exact match), `ts`, `urlPrefix`, `isSeed`, and `depth`, as well as accompanying tests. Additional sort options have been added as well. These same filters and sort options have also been added to the crawl pages endpoint. Also fixes an issue where `isSeed` wasn't being set in the database when false but only added on serialization, which was preventing filtering from working as expected.	2025-02-10 16:44:37 -08:00
Tessa Walsh	0e9e70f3a3	Add WACZ filename, depth, favIconUrl, isSeed to pages (#2352 ) Adds `filename` to pages, pointed to the WACZ file those files come from, as well as depth, favIconUrl, and isSeed. Also adds an idempotent migration to backfill this information for existing pages, and increases the backend container's startupProbe time to 24 hours to give it sufficient time to finish the migration. --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-02-05 15:50:04 -05:00
Tessa Walsh	763c654484	feat: Update collection sorting, metadata, stats (#2327 ) - Refactors dashboard and org profile preview to use private API endpoint, to fix public collections not showing when the org visibility is hidden - Adds additional sorting options for collections - Adds unique page url counts for archived items, collections, and organizations to backend and exposes this in collections - Shows collection period (i.e. `dateEarliest` to `dateLatest`) in collections list - Shows same collection metadata in private and public views, updates private view info bar - Fixes "Update Org Profile" action item showing for crawler roles --------- Co-authored-by: sua yoo <sua@webrecorder.org> Co-authored-by: sua yoo <sua@suayoo.com> Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2025-01-23 13:32:23 -05:00
Tessa Walsh	6797b41de0	Add pageCount to crawls and uploads and use in frontend for page counts (#2315 ) Fixes #2257 This is a follow-up to the public collections work, which adds pages to the database for uploads. All crawls and uploads now have a `pageCount` field which is populated when the item is successfully added. A new migration is also added to populate the field for existing archived items that don't have it set yet. OrgMetrics have also been modified to include `crawlPageCount` and `uploadPageCount`, and to include the total of both in `pageCount`, and all three included in the frontend org dashboard. The frontend has been updated to use `pageCount` rather than `stats.done` wherever appropriate, meaning that in archived item lists and details we now have a consistent page count for both crawls and uploads. ### New functionality - Deploy this branch - Create new crawls and uploads and verify that page count appears correctly throughout the frontend for all new crawls and uploads ### Migration - Deploy from latest main - Create some crawls and uploads - Change to this branch and re-deploy - Verify migration ran without errors in backend logs - Verify that page count has been populated successfully by checking archived items lists, crawl and upload detail pages, and dashboard to ensure there are no longer any missing page counts. --------- Co-authored-by: emma <hi@emma.cafe>	2025-01-16 14:41:14 -08:00
Tessa Walsh	a031fab313	Backend work for public collections (#2198 ) Fixes #2182 This rather large PR adds the rest of what should be needed for public collections work in the frontend. New API endpoints include: - Public collections endpoints: GET, streaming download - Paginated list of URLs in collection with snapshot (page) info for each - Collection endpoint to set home URL - Collection endpoint to upload thumbnail as stream - DELETE endpoint to remove collection thumbnail Changes to existing API endpoints include: - Paginating public collection list results - Several `pages` endpoints that previously only supported `/crawls/` in their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support `/uploads/` and `/all-crawls/` namespaces as well. This is necessitated by adding pages for uploads to the database (see below). For `/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will serve as a filter to only affect crawls of that given type. Other endpoints are more liberal at this point, and will perform the same action regardless of the namespace used in the route (we'll likely want to change this in a follow-up to be more consistent). - `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job rather than doing all of the computation in an asyncio task in the backend container. The background job additionally updates collection date ranges, page/size counts, and tags for each collection in the org after pages have been (re)added. Other big changes: - New uploads will now have their pages read into the database! Collection page counts now also include uploads - A migration was added to start a background job for each org that will add the pages for previously-uploaded WACZ files to the database and update collections accordingly - Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we can use for other user-uploaded image files moving forward, with separate output models for authenticated and public endpoints	2025-01-13 15:15:48 -08:00
Tessa Walsh	123705c53f	Serialize datetimes with Z suffix (#2058 ) Use timezone aware datetimes instead of timezone naive datetimes: - Update mongodb client to use tz-aware conversion - Convert dt_now() to return timezone aware UTC date - Rename to_k8s_date -> date_to_str, just returns ISO UTC date with 'Z' (instead of '+00:00' suffix) - Rename from_k8s_date -> str_to_date, returns timezone aware date from str - Standardize all string<->date conversion to use either date_to_str or str_to_date - Update frontend to assume iso date, not append 'Z' directly - Update tests to check for 'Z' suffix on some dates --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-09-12 16:16:13 -07:00
Ilya Kreymer	5f53db75ee	fix resetting of invalid logins: (#2002 ) * Fixes issue in FailedLogin model: - fix data-model to remove nested 'attempted.attempted' - migrate existing data to remove nested field * Also, avoid setting dt_now() in model as that results in fixed date for all objects: - update FailedLogin to update 'attempted' date on every attempt - also update PageNote object to set date in constructor * Update text for too many logins to make it clear it is set only if its a valid email * fixes #2001	2024-08-07 12:36:06 -07:00
Ilya Kreymer	94e985ae13	optimize org quota lookups (#1973 ) - instead of looking up storage and exec min quotas from oid, and loading an org each time, load org once and then check quotas on the org object - many times the org was already available, and was looked up again - storage and exec quota checks become sync - rename can_run_crawl() to more generic can_write_data(), optionally also checks exec minutes - typing: get_org_by_id() always returns org, or throws, adjust methods accordingly (don't check for none, catch exception) - typing: fix typo in BaseOperator, catch type errors in operator 'org_ops' - operator quota check: use up-to-date 'status.size' for current job, ignore current job in all jobs list to avoid double-counting - follow up to #1969	2024-07-25 14:00:16 -07:00
Ilya Kreymer	8c0321bdea	Pydantic 2.x update + type fixes + python 3.12 (#1947 ) * updates pydantic to 2.x * also update to python 3.12 * additional type fixes: - all Optional[] types must have a default value - update to constrained types - URL types converted from str - test updates Fixes #1940	2024-07-22 17:23:03 -07:00
Ilya Kreymer	335700e683	Additional typing cleanup (#1938 ) Misc typing fixes, including in profiles and time functions --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-07-17 10:49:22 -07:00
Tessa Walsh	d41647e6c2	Document all API endpoints with response models (#1928 ) Fixes #1920 Adds response models to all API endpoints that were missing them, documenting current behavior without making any changes at this stage to standardize responses. Follow-up work will involve adding generics to some of the response models	2024-07-16 12:48:38 -07:00
Tessa Walsh	aaf18e70a0	Add created date to Organization and fix datetimes across backend (#1921 ) Fixes #1916 - Add `created` field to Organization and OrgOut, set on org creation - Add migration to backfill `created` dates from first workflow `created` - Replace `datetime.now()` and `datetime.utcnow()` across app with consistent timezone-aware `utils.dt_now` helper function, which now uses `datetime.now(timezone.utc)`. This is in part to ensure consistency in how we handle datetimes, and also to get ahead of timezone naive datetime creation methods like `datetime.utcnow()` being deprecated in Python 3.12. For more, see: https://blog.miguelgrinberg.com/post/it-s-time-for-a-change-datetime-utcnow-is-now-deprecated	2024-07-15 19:46:32 -07:00
Ilya Kreymer	3bd714ea9d	QA stats aggregation: exclude isFile / isError pages from stats (#1879 ) Follow-up to: #1868, exclude pages that have isFile or isError set to true from the stats aggregation.	2024-06-25 08:54:42 -07:00
Tessa Walsh	879e509b39	Backend: Move page file and error counts to crawl replay.json endpoint (#1868 ) Backend work for #1859 - Remove file count from qa stats endpoint - Compute isFile or isError per page when page is added - Increment filePageCount and errorPageCount per crawl to count number of isFile or isError pages - Add file and error counts to crawl replay.json endpoint (filePageCount and errorPageCount) - Add migration 0028 to set isFile / isError for each page, aggregate filePageCount / errorPageCount per crawl - Determine if page is a file based on loadState == 2, mime type or status code and lack of title	2024-06-20 19:02:57 -07:00
Tessa Walsh	8b0d1432af	Show QA meter while analysis is running (#1854 ) Fixes #1846 - Ensure meter auto-updates as new stats are ready - Switch meter to new QA run when new analysis run is started - Remove Files from QA meter (files and errors will be reported separately) Co-authored-by: emma <hi@emma.cafe> Co-authored-by: sua yoo <sua@webrecorder.org>	2024-06-12 12:32:01 -04:00
Tessa Walsh	a85f9496b0	Include number of Identical Files in QA stats and meter (#1848 ) This PR adds Identical Files to the QA Page Match Analysis meter bars. To do this, the backend calculates the number of non-HTML pages once and includes it under the key `Files` in each of the `screenshotMatch` and `textMatch` QA stats return arrays. The backend additionally removes the file count from "No Data" to prevent these from being counted twice. --------- Co-authored-by: emma <hi@emma.cafe>	2024-06-06 13:15:19 -04:00
sua yoo	1915274e26	Fix QA review comments (#1723 ) Fixes https://github.com/webrecorder/browsertrix/issues/1710 Fixes date and deletion for newly added comments.	2024-04-23 16:31:52 -04:00
Tessa Walsh	30ab139ff2	Add QA run aggregate stats API endpoint (#1682 ) Fixes #1659 Takes an arbitrary set of thresholds for text and screenshot matches as a comma-separated list of floats. Returns a list of groupings for each that include the lower boundary and count for all thresholds passed in.	2024-04-17 13:24:18 -04:00
Tessa Walsh	87e0873f1a	Add mime field to Page model (#1678 )	2024-04-17 00:57:49 -04:00
Tessa Walsh	00ced6dd6b	Add single page QA GET endpoint (#1635 ) Fixes #1634 Also make sure other get page endpoint without qa uses PageOut model	2024-03-27 14:57:59 -07:00
Tessa Walsh	e9895e78a2	Add additional filters to page list endpoints (#1622 ) Fixes #1617 Filters added: - reviewed: filter by page has approval or at least one note (true) or neither (false) - approved: filter by approval value (accepts list of strings, comma-separated, each of which are coerced into True, False, or None, or ignored if they are invalid values) - hasNotes: filter by has at least one note (true) or not (false) Tests have also been added to ensure that results are as expected.	2024-03-21 21:33:07 -07:00
Ilya Kreymer	4f676e4e82	QA Runs Initial Backend Implementation (#1586 ) Supports running QA Runs via the QA API! Builds on top of the `issue-1498-crawl-qa-backend-support` branch, fixes #1498 Also requires the latest Browsertrix Crawler 1.1.0+ (from webrecorder/browsertrix-crawler#469 branch) Notable changes: - QARun objects contain info about QA runs, which are crawls performed on data loaded from existing crawls. - Various crawl db operations can be performed on either the crawl or `qa.` object, and core crawl fields have been moved to CoreCrawlable. - While running,`QARun` data stored in a single `qa` object, while finished qa runs are added to `qaFinished` dictionary on the Crawl. The QA list API returns data from the finished list, sorted by most recent first. - Includes additional type fixes / type safety, especially around BaseCrawl / Crawl / UploadedCrawl functionality, also creating specific get_upload(), get_basecrawl(), get_crawl() getters for internal use and get_crawl_out() for API - Support filtering and sorting pages via `qaFilterBy` (screenshotMatch, textMatch) along with `gt`, `lt`, `gte`, `lte` params to return pages based on QA results. --------- Co-authored-by: Tessa Walsh <tessa@bitarchivist.net>	2024-03-20 22:42:16 -07:00
Tessa Walsh	21ae38362e	Add endpoints to read pages from older crawl WACZs into database (#1562 ) Fixes #1597 New endpoints (replacing old migration) to re-add crawl pages to db from WACZs. After a few implementation attempts, we settled on using [remotezip](https://github.com/gtsystem/python-remotezip) to handle parsing of the zip files and streaming their contents line-by-line for pages. I've also modified the sync log streaming to use remotezip as well, which allows us to remove our own zip module and let remotezip handle the complexity of parsing zip files. Database inserts for pages from WACZs are batched 100 at a time to help speed up the endpoint, and the task is kicked off using asyncio.create_task so as not to block before giving a response. StorageOps now contains a method for streaming the bytes of any file in a remote WACZ, requiring only the presigned URL for the WACZ and the name of the file to stream.	2024-03-19 14:14:21 -07:00
Ilya Kreymer	09a0d51843	pages: set page status to 200 if unset and loadState != 0 (#1563 ) Follow up to #1516, ensure page status is set to 200 if no status is provided, if loadState is not 0	2024-02-29 15:15:17 -08:00
Tessa Walsh	14189b7cfb	Add crawl pages and related API endpoints (#1516 ) Fixes #1502 - Adds pages to database as they get added to Redis during crawl - Adds migration to add pages to database for older crawls from pages.jsonl and extraPages.jsonl files in WACZ - Adds GET, list GET, and PATCH update endpoints for pages - Adds POST (add), PATCH, and POST (delete) endpoints for page notes, each with their own id, timestamp, and user info in addition to text - Adds page_ops methods for 1. adding resources/urls to page, and 2. adding automated heuristics and supplemental info (mime, type, etc.) to page (for use in crawl QA job) - Modifies `Migration` class to accept kwargs so that we can pass in ops classes as needed for migrations - Deletes WACZ files and pages from database for failed crawls during crawl_finished process - Deletes crawl pages when a crawl is deleted Note: Requires a crawler version 1.0.0 beta3 or later, with support for `--writePagesToRedis` to populate pages at crawl completion. Beta 4 is configured in the test chart, which should be upgraded to stable 1.0.0 when it's released. Connected to https://github.com/webrecorder/browsertrix-crawler/pull/464 --------- Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>	2024-02-28 12:11:35 -05:00

33 Commits