Intro to web scraping
In simple words, web scraping is a process of pulling information from a given website when no other way, like API endpoint or export option, is available. This is a very popular method of automating processes typically handled manually by the user performing inside the browser.
Ruby is an excellent language for writing scraping bots as it provides a friendly and readable interface and a bunch of libraries that support requests performing and data parsing. This article introduces web scraping with Ruby, based on my ten years of experience in writing automation scripts. But it’s just a beginning; there is a lot more to discover later.
Different approaches to website scraping
At the beginning of my journey with scraping, I thought that this process was simple as it consisted only of two steps: making the request and parsing the response. I was right about the steps, but I was wrong about the complexity of the process.
Most websites are not just simple HTML pages unless we want to pull information from a landing page. Many of them are created with frameworks and contain dynamic elements that modify the page structure. Many of them are secured with login and password. There is a big chance that the content is loaded dynamically in the background, so there is nothing valuable inside when you pull the page source.
To handle all of those cases, I separated three types of approaches when it comes to website scraping:
- Perform request directly - if you know what endpoint you can request and what params you can pass, you can send a direct HTTP request and receive the response. It’s possible to achieve this if the endpoint is not secured or you have the credentials to call the endpoint.
- Combined approach - there are cases when you have to combine the above approaches. First, you can replicate the user’s behavior to authenticate, and then you can perform direct requests with the credentials you obtained in the previous step.
There is no golden rule here. It all depends on the type of website you want to scrap and its complexity. It’s easy to pull information from a landing page, but it becomes problematic when the website is secured and has iframes and AJAX calls.
Hopefully, Ruby has some great libraries that can help us to implement the scraping process.
Libraries to the rescue
I mentioned before that if we want to generalize the web scraping process, we can distinguish two steps in web scraping: requesting and parsing.
Of course, we don’t have to use any external gems to send requests, but it is easier to use some nice wrappers instead of handling all cases on our own with a standard Ruby interface.
We can easily handle more complex params, responses, cookies, and session parameters with a well-written library for requests. Here are my favorite Ruby HTTP clients that helped me many times when dealing with web scraping:
- Faraday - https://github.com/lostisland/faraday
- Rest Client - https://github.com/rest-client/rest-client
- HTTParty - https://github.com/jnunemaker/httparty
Each of them deserves their own article about options and flexibility, so that I won’t focus on the implementation details right now. However, for the introduction purpose, this list is a perfect starting point that you can use to find the option that seems to be the best for your preferences and use cases.
I would say that there are two ways to do the parsing: a manual one or using a parsing library. When you receive a JSON response, you can parse it manually by writing some parsing procedures. When you receive HTML/XML code, you can either do the manual parsing using regular expressions or use a library that provides a powerful but easy-to-use interface for parsing.
The best example of such a library is Nokogiri that provides a powerful interface for parsing HTML and XML documents. I have been using it for many years, and so far, I haven’t found anything better. The parsing becomes a lot easier with that gem, so it’s a standard element of my approach for scraping websites.
The investigation process
Before you begin the creation process of scraping script, you have to inspect the website you want to scrap to understand its structure and confirm that the information you are looking for is accessible.
The best way to investigate the website’s behavior and structure is to navigate it manually and observe what requests are sent when performing specific actions. For me, the best tool that helps in the investigation is the developer console provided by Google Chrome.
I usually start with watching the type of requests that are sent. For example, on the above image, you can see that we have a bunch of information available at our disposal: general information, response, request, and query parameters. This information, along with the website source code that you can check in every browser, is usually enough to tell what parameters I should send or what type of request would be valid.
The General and Request Headers information would be helpful for the requesting stage, and the response headers will be helpful when building the parsing code. It’s simple as that.
The next steps
When you know how to plan the scraping process and what tools to use to make it efficient, you can start building code that will pull information from almost any website.
Of course, I didn’t put any code samples in this article since it’s just an introduction to that topic. But with such theory, we can quickly move to some more practical stuff, and you will understand the code and extend it by yourself.