From 251aef3ac19fd8c08172a422a97ddb6e261c743e Mon Sep 17 00:00:00 2001 From: Henry Wilkinson Date: Thu, 30 May 2024 14:50:10 -0400 Subject: [PATCH] Docs: Elaborates on using user agents (#1841) - Provides a link to Mozilla's page explaining what they are (good for folks new to the concept) - Provides a link to useragents.me, the same site we link to in the app - Provides two examples of situations where they may be helpful to get around content restrictions --- docs/user-guide/workflow-setup.md | 15 ++++++++++++++- 1 file changed, 14 insertions(+), 1 deletion(-) diff --git a/docs/user-guide/workflow-setup.md b/docs/user-guide/workflow-setup.md index f6d91eeb..e2668386 100644 --- a/docs/user-guide/workflow-setup.md +++ b/docs/user-guide/workflow-setup.md @@ -168,7 +168,20 @@ Will prevent any content from the domains listed in [Steven Black's Unified Host ### User Agent -Sets the browser's user agent in outgoing requests to the specified value. If left blank, the crawler will use the browser's default user agent. +Sets the browser's [user agent](https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/User-Agent) in outgoing requests to the specified value. If left blank, the crawler will use the Brave browser's default user agent. For a list of common user agents see [useragents.me](https://www.useragents.me/). + +??? example "Using custom user agents to get around restrictions" + Despite being against best practices, some websites will block specific browsers based on their user agent: a string of text that browsers send web servers to identify what type of browser or operating system is requesting content. If Brave is blocked, using a user agent string of a different browser (such as Chrome or Firefox) may be sufficient to convince the website that a different browser is being used. + + User agents can also be used to voluntarily identify your crawling activity, which can be useful when working with a website's owners to ensure crawls can be completed successfully. We recommend using a user agent string similar to the following, replacing the `orgname` and URL comment with your own: + + ``` + Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/125.0.0.0 Safari/537.3 orgname.browsertrix (+https://example.com/crawling-explination-page) + ``` + + If you have no webpage to identify your organization or statement about your crawling activities available as a link, omit the bracketed comment section at the end entirely. + + This string must be provided to the website's owner so they can allowlist Browsertrix to prevent it from being blocked. ### Language