Rotate IP Address and User-agent to Scrape Data

Sadman Kabir Soumik
Geek Culture
Published in
3 min readFeb 20, 2022

--

When you run a web crawler, and it sends too many requests to the target site within a short time from the same IP and device, the target site might arise reCAPTCHA, or even block your IP address to stop you from scraping data.

Photo by JJ Ying on Unsplash

Here, in this article, I will show you two different methods to apply in your web crawler to avoid such problems using Python.
1. Rotate your IP address
2. Rotate User-agent

Rotate IP address

You can provide a proxy with each request. If you keep using one particular IP, the site might detect it and block it. To solve this problem, you can rotate your IP, and send a different IP address with each request. Though this will make your program a bit slower but may help you to avoid blocking from the target site. You can use the tor browser, and set tor proxies according to that. But here we will be using a python tor client called torpy that doesn’t require you to download the tor browser in your system. The GitHub link for the library is following:

You can install the library using the following command:

pip install torpy

Let’s say we want to send requests to the following sites:

urls = [ "https://www.google.com", "https://www.facebook.com", "https://www.youtube.com", "https://www.amazon.com", "https://www.reddit.com", "https://www.instagram.com", "https://www.linkedin.com", "https://www.wikipedia.org", "https://www.twitter.com"]

So, we are gonna write a function that starts a new session with each URL request. Then loop through all the URLs and pass each URL with a new session.

code snippet 1

We can check our IP address from this site https://httpbin.org/ip
So, in line 11, we are printing the IP address of the session. If we execute the above program, we will get the IP addresses of each request. In my case, the output looks like below:

{'origin': '107.189.7.175'}
{'origin': '185.220.101.162'}
{'origin': '185.220.101.79'}
{'origin': '103.236.201.88'}
{'origin': '185.220.100.242'}
{'origin': '209.141.53.20'}
{'origin': '198.98.62.79'}
{'origin': '184.105.220.24'}
{'origin': '193.218.118.167'}

As you can see, each IP addresses are different with each request.

Rotate User-agent

Most websites block requests if it comes without a piece of valid browser information. So, we usually pass the bowser information in the form of a User-Agent with each request, like below:

import requestsUSER_AGENT = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_2) AppleWebKit/600.4.10 (KHTML, like Gecko) Version/8.0.4 Safari/600.4.10"HEADERS = {"User-Agent": USER_AGENT}html_content = requests.get(url, headers=HEADERS, timeout=40).text

User-agent usually contains the information of application type, operating system information, software version, etc.

When you keep the user-agent information unchanged, like the above code snippet, the target site can detect all the requests (which your program is sending) are coming from the same device. We can fake that information by sending a valid user-agent but different agents with each request. You can find many valid user agent information from this site.

The idea is to make a list of valid User-agents, and then randomly chose one of the user-agents with each request. So, let’s make a list of valid user agents:

import randomAGENT_LIST = ["Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/44.0.2403.157 Safari/537.36","Mozilla/5.0 (X11; Ubuntu; Linux i686; rv:24.0) Gecko/20100101 Firefox/24.0","Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) HeadlessChrome/91.0.4472.114 Safari/537.36","Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_5) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/13.1.1 Safari/605.1.15","Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:77.0) Gecko/20100101 Firefox/77.0"
]
USER_AGENT = random.choice(AGENT_LIST)

Now, let’s randomize our user-agents in code snippet 1, where we made the IP address rotated. So, the following program changes your IP address and user-agent both with each request.

Another simple approach to try is adding time.sleep() before each request to avoid reCAPTCHA problems like below:

Here, in line 7, we have added a time.sleep() method that selects a random number between 1 and 3.

Remember, all of the above methods will make your web crawling slower than usual. But these help to avoid getting blocked from the target site and bypass reCAPTCHA issues.

--

--

Sadman Kabir Soumik
Geek Culture

Artificial Intelligence | Cloud Computing | Back-End Engineering ☕️☕️