Scrapy + Free Proxies: A Lightweight Stack for Learning Data Extraction

Scrapy is a complete web crawling and web scraping tool whose effectiveness can be improved using free proxies. You can easily set up a free proxy with Scrapy, enabling you to seamlessly extract data from websites without worrying about being blocked because of sending too many requests. Paid proxies increase the effectiveness even further, as they do not suffer from the issues associated with free proxies. This article discusses how to set up free proxies with Scrapy, enabling you to learn how to scrape data from websites.

Table of Contents

What is Scrapy?

Scrapy is a free, open-source framework for writing web crawlers that search, find, index, and extract data from web pages. It’s a framework for web scraping, data extraction, and web crawling rather than a library. Written in Python, Scrapy is a complete package. It has built-in Selectors, which are mechanisms for scraping data. Other built-in tools include Spiders, Link Extractors, Requests, Responses, Items, Item Loaders, Item Pipelines, and Feed exports, just to mention a few.

But while Scrapy is a complete package that enables you to crawl and scrape websites, it can still be affected by a common issue that affects web scrapers: blacklisting. Typically, web scraping involves sending multiple requests in order to collect the data in the responses.

When the requests exceed a given threshold or are deemed to have bot-like patterns, the web servers’ defense mechanisms kick in. One of the most common mechanisms to protect against bot-like activities is blocking requests from the associated IP address. To ensure smooth and uninterrupted web scraping using Scrapy, it’s advisable to take advantage of the online anonymity offered by proxy servers.

Understanding free proxies

Free proxies are readily available at no cost. They are ideal if you intend to keep the data extraction costs to a minimum. However, they are not as reliable and fast as paid proxies. And there are a few reasons for this. First, there is always a likelihood that they are shared among multiple users. After all, anyone can find them by simply keying in the term ‘free proxy list’ on a search engine. The result is that tens, hundreds, or even thousands of users share the limited bandwidth, leading to slow speeds.

Secondly, free proxies are unreliable. They do not have a high availability (also known as uptime). And it’s not uncommon to find a free proxy with an uptime of less than 50%, with some having sub-30% numbers. They can also be unstable because the slow speeds can lead to request timeouts. Given that they are offered by unknown operators, they carry an inherent security risk.

These limitations condemn free proxies to just a few use cases. For instance, you can also use them to test websites and other web tools across different jurisdictions. They are also perfect for learning how to scrape websites. You can achieve the latter by integrating the free proxy server with a web scraping framework like Scrapy. This article will focus on this second application of free proxies.

Using free proxies with Scrapy

There are two methods of integrating Scrapy with a free proxy:

Creating a custom middleware
Adding a meta parameter to the request

Custom middleware

The custom middleware is a middle layer that routes Scrapy requests through the free proxy. It is a handy approach when you are working with multiple spiders. It ensures that all spiders pass through this layer. First, create a middleware (i.e., write Python code in the middleware.py file) and then register it in Scrapy’s settings.py.

Keep in mind that when you create a new Scrapy project using the command scrapy startproject ‘project_name’ (where ‘project_name’ is a placeholder for the name of your project), Scrapy creates several configuration files and Python files. The Python files include items.py, pipelines.py, middleware.py, and settings.py. Creating custom middleware is a more advanced way of setting up free proxies with Scrapy. Fortunately, a more beginner-friendly method exists.

Meta parameter

As stated earlier, Scrapy has several built-in tools, including Requests. Requests include objects known as Parameters, with URL, meta, method, callback, body, headers, and cookies being examples of Parameters. The meta parameter method is concerned with the meta object.

To set up a free proxy, simply pass the free proxy’s address as a meta parameter, as shown in the code block below. We have used the free proxy 52.13.248.29:3128, which supports HTTPS. (It’s worth pointing out that the code below is a snippet of a larger code block.)

Conclusion

Free proxies are a great tool for testing web apps in different jurisdictions. And when paired with Scrapy, they allow you to learn how to extract data from websites. They assign a different IP address, preventing a scenario where a web server blocks your IP address due to the high number of requests. However, they have a few limitations that restrict their utility. Not only are they slow and unreliable, but they are also not safe, given their inherent security vulnerabilities.