Commit Graph

5 Commits

Author SHA1 Message Date
Ilya Kreymer
8a507f0473
Consolidate list page endpoints + better QA sorting + optimize pages fix (#2417)
- consolidate list_pages() and list_replay_query_pages() into
list_pages()
- to keep backwards compatibility, add <crawl>/pagesSearch that does not
include page totals, keep <crawl>/pages with page total (slower)
- qa frontend: add default 'Crawl Order' sort order, to better show
pages in QA view
- bgjob: account for parallelism in bgjobs, add logging if succeeded
mismatches parallelism
- QA sorting: default to 'crawl order' by default to get better results.
- Optimize pages job: also cover crawls that may not have any pages but have pages listed in done stats
- Bgjobs: give custom op jobs more memory
2025-02-21 13:47:20 -08:00
Tessa Walsh
f8fb2d2c8d
Rework crawl page migration + MongoDB Query Optimizations (#2412)
Fixes #2406 

Converts migration 0042 to launch a background job (parallelized across
several pods) to migrate all crawls by optimizing their pages and
setting `version: 2` on the crawl when complete.

Also Optimizes MongoDB queries for better performance.

Migration Improvements:

- Add `isMigrating` and `version` fields to `BaseCrawl`
- Add new background job type to use in migration with accompanying
`migration_job.yaml` template that allows for parallelization
- Add new API endpoint to launch this crawl migration job, and ensure
that we have list and retry endpoints for superusers that work with
background jobs that aren't tied to a specific org
- Rework background job models and methods now that not all background
jobs are tied to a single org
- Ensure new crawls and uploads have `version` set to `2`
- Modify crawl and collection replay.json endpoints to only include
fields for replay optimization (`initialPages`, `pageQueryUrl`,
`preloadResources`) if all relevant crawls/uploads have `version` set to
`2`
- Remove `distinct` calls from migration pathways
- Consolidate collection recompute stats

Query Optimizations:
- Remove all uses of $group and $facet
- Optimize /replay.json endpoints to precompute preload_resources, avoid
fetching crawl list twice
- Optimize /collections endpoint by not fetching resources 
- Rename /urls -> /pageUrlCounts and avoid $group, instead sort with
index, either by seed + ts or by url to get top matches.
- Use $gte instead of $regex to get prefix matches on URL
- Use $text instead of $regex to get text search on title
- Remove total from /pages and /pageUrlCounts queries by not using
$facet
- frontend: only call /pageUrlCounts when dialog is opened.


---------

Co-authored-by: Ilya Kreymer <ikreymer@gmail.com>
Co-authored-by: Emma Segal-Grossman <hi@emma.cafe>
Co-authored-by: Ilya Kreymer <ikreymer@users.noreply.github.com>
2025-02-20 15:26:11 -08:00
Ilya Kreymer
5bebb6161a
Issue 2396 readd pages fixes (#2398)
readd pages fixes:
    - add additional mem to background job
- copy page qa data to separate temp coll when re-adding pages, then
merge back in
2025-02-17 13:52:11 -08:00
Tessa Walsh
a031fab313
Backend work for public collections (#2198)
Fixes #2182 

This rather large PR adds the rest of what should be needed for public
collections work in the frontend.

New API endpoints include:

- Public collections endpoints: GET, streaming download
- Paginated list of URLs in collection with snapshot (page) info for
each
- Collection endpoint to set home URL
- Collection endpoint to upload thumbnail as stream
- DELETE endpoint to remove collection thumbnail

Changes to existing API endpoints include:

- Paginating public collection list results
- Several `pages` endpoints that previously only supported `/crawls/` in
their path, e.g. `/orgs/{oid}/crawls/all/pages/reAdd`, now support
`/uploads/` and `/all-crawls/` namespaces as well. This is necessitated
by adding pages for uploads to the database (see below). For
`/orgs/{oid}/namespace/all/pages/reAdd`, `crawls` or `uploads` will
serve as a filter to only affect crawls of that given type. Other
endpoints are more liberal at this point, and will perform the same
action regardless of the namespace used in the route (we'll likely want
to change this in a follow-up to be more consistent).
- `/orgs/{oid}/namespace/all/pages/reAdd` now kicks off a background job
rather than doing all of the computation in an asyncio task in the
backend container. The background job additionally updates collection
date ranges, page/size counts, and tags for each collection in the org
after pages have been (re)added.

Other big changes:

- New uploads will now have their pages read into the database!
Collection page counts now also include uploads
- A migration was added to start a background job for each org that will
add the pages for previously-uploaded WACZ files to the database and
update collections accordingly
- Adds a new `ImageFile` subclass of `BaseFile` for thumbnails that we
can use for other user-uploaded image files moving forward, with
separate output models for authenticated and public endpoints
2025-01-13 15:15:48 -08:00
Tessa Walsh
1b1819ba5a
Move org deletion to background job with access to backend ops classes (#2098)
This PR introduces background jobs that have full access to the backend
ops classes and moves the delete org job to a background job.
2024-10-10 14:41:05 -04:00