fix issue with incorrect number of total pages if any of the seeds is a redirect (#1649)

Following changes in webrecorder/browsertrix-crawler#475,
webrecorder/browsertrix-crawler#509, the crawler adds a redirected seed
to the seen list. To account for this, it needs to be subtracted to get
the total page count.
This commit is contained in:
Ilya Kreymer 2024-04-04 15:55:44 -07:00 committed by GitHub
parent 83c9203a11
commit 5c08c9679c
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

View File

@ -1178,6 +1178,11 @@ class CrawlOperator(BaseOperator):
pages_done = await redis.llen(f"{crawl_id}:d")
pages_found = await redis.scard(f"{crawl_id}:s")
# account for extra seeds and subtract from seen list
extra_seeds = await redis.llen(f"{crawl_id}:extraSeeds")
if extra_seeds:
pages_found -= extra_seeds
sizes = await redis.hgetall(f"{crawl_id}:size")
archive_size = sum(int(x) for x in sizes.values())