Have you ever scraped a website? No? Well here is your chance to get started. Here is some information about web scraping if you never heard of it before. Let’s explore the wonderful world of scraping a website using the Ruby gem nokogiri.
Here is the gist. The site has newsletters and each newsletter has many articles.
1. Create a config directory with an environment.rb file.Config directory will have an environment.rb file to create a sandbox to test with.
2. Create a Gemfile
Gemfile will include:
3. Create a lib directory and make three files in this directory.
3. Create a Rakefile to automate testing task
Here is what your project file should look like after setup.
The Article class has a title and url. This will setup up my first model which will include a getter and setter method to the article class.
attr_accessor :title, :url
Newsletter class has articles, issue number, and issue date. Which includes a getter and setter for :issue_number, :issue_date and a getter for :articles.
We will also initialize an instance variable @articles to store our articles for our newsletter.
- Create an instance of the newsletter for a specific number
- Scrape the details of that newsletter.
- Scrape individual articles of the newsletter and add to newsletter instance.
Hop in terminal and type
Success! Issue number 303 is now instantiated.
=> #<Newsletter:0x007fa9a14804e8 @articles=, @issue_number=303>
The next steps are to scrape the specific details we want to add to our newsletter object. We need to add the issue date to our newsletter object
Inspect the site to see what we need to focus on to get the data.
The Issue number and date are stored in a table with the class “gowide lonmo”
Let’s test this in our console to see if we scraped it properly.
s = n_303
Ah, we are getting warmer!
Let’s try add .text at the end of our search to return just the text.
Interested in seeing the source code? Head over to GitHub to check it out.