How we Obtained 500 Thousand Domains Whois Information in 10 Days using Python and Tor
Before reading, take a look at Do you know the difference between Registrar, Registry, and Registrant? to have a better understanding of this post.
In my latest project, I had to retrieve Whois information for a total of 500 thousand domains with various TLDs.
When I fetched Whois information for some domains, I realized it didn’t provide complete information. That’s when I became familiar with the concepts of thick Whois and thin Whois.
Now, what are these?
When you fetch Whois for a domain with a TLD like com or net, if that domain is registered in a Registrar, you won’t get all the information. Instead, you’ll only receive the name and address of the Whois server of the Registrar or Registrar Whois Server, along with the registration and expiration dates of the domain. To obtain all the domain information, such as contact information, you need to fetch the domain’s Whois from the Registrar Whois Server. This was the first challenge of the project.
The next challenge was that when we made a lot of requests to the Whois server within a short period, the server returned an error, stating that our quota for that time interval was exhausted, and we had to make the request later. This situation was worse for some whois servers, as it didn’t return any errors and kept the connection alive without responding.
To solve this problem, we had to use a proxy that changes the IP within a specified interval and set a timeout to reject the request if the Whois server did not respond quickly.
Now, let’s move on to coding.
According to the challenges, we needed to use a library that could fetch the Whois of a domain and, if that domain had a Registrar Whois Server, make another request to that server to obtain more complete information. Most libraries available had some issues.
I took matters into my own hands.
To fetch the Whois of a domain, we used a pre-prepared list containing the TLD and the Whois server’s address for that TLD. As you might guess, none of them were complete. The next problem was that none of them accepted proxies.
So, I started coding.
To find the Whois server of a TLD, we can connect to whois.iana.org via socket, sent the TLD, and received the Whois server’s address
|
|
|
|
Some TLDs don’t have Whois servers, and you need to visit their site to get Whois, like the az TLD.
For server names with more than one part, the Whois server is the same as the TLD part. For example, the Whois server for the co.uk TLD is the same as the uk TLD.
To set up the socket connection, we used the pysocks
library instead of the standard socket library in Python, which allows socket connections with a proxy.
To use a proxy and change the IP, we used Tor.
To automatically change the IP, you need to put the following line in the torrc configuration file located at /etc/tor/torrc/
|
|
With this config, the IP changes every 10 seconds. Note that 10 seconds is the minimum value for this setting.
Another way to change the IP in Tor is to send the HUP signal to it, like the following command:
|
|
The allowed number of requests in a specified time interval to get Whois from domain registration centers varied. Therefore, we set the IP change period to the minimum allowed value, i.e., 10 seconds.
Now, most steps have been completed. To execute this task, we used three servers, each of which obtained the domain list, fetched their Whois, and stored them.
A crucial point is that since these IPs are public, individuals may have caused the Whois quota to be exhausted in that time interval. Even though we changed the IP, the Whois server might not respond. To address this issue, we checked the Whois we obtained. If the number of characters was less than 100, it likely contained a message saying your quota is exhausted. In this case, we didn’t save them, and in the next round, we fetched them again.
By using this method, we obtained an average of 50 thousand Whois per day, and within 10 days, we had 500 thousand Whois in our database.