How To Scrape A Website

Have you ever scraped a website? No? Well here is your chance to get started. Here is some information about web scraping if you never heard of it before.  Let’s explore the wonderful world of scraping a website using the Ruby gem nokogiri.

I decided to scrape the javascriptweekly.com site because in case you didn’t know javaScript is the most popular language used today.

Here is the gist. The site has newsletters and each newsletter has many articles.

Setup:

1. Create a config directory with an environment.rb file.Config directory will have an environment.rb file to create a sandbox to test with.

require 'bundler'
Bundler.require
require 'open-uri'
require_all './lib'

2. Create a Gemfile

Gemfile will include:

gem 'rake'
gem 'pry'
gem 'require_all'
gem 'nokogiri'

3. Create a lib directory and make three files in this directory.

1- article.rb

2-newsletter.rb

3-javascriptweekly_scraper.rb

3. Create a Rakefile to automate testing task

screen-shot-2016-10-07-at-11-21-14-am

Here is what your project file should look like after setup.

screen-shot-2016-10-07-at-11-22-01-am

Models:

The Article class has a title and url. This will setup up my first model which will include a getter and setter method to the article class.

attr_accessor :title, :url

screen-shot-2016-10-07-at-11-01-32-am

Newsletter class has articles, issue number, and issue date. Which includes a getter and setter for :issue_number, :issue_date and a getter for :articles.

We will also initialize an instance variable @articles to store our articles for our newsletter.

screen-shot-2016-10-07-at-11-03-13-am

Next, let’s create a javaScript Weekly scraper class. The purpose of this class will scrape javaScriptWeekly.com. The objectives are:

      1. Create an instance of the newsletter for a specific number
      2. Scrape the details of that newsletter.
      3. Scrape individual articles of the newsletter and add to newsletter instance.

screen-shot-2016-10-05-at-11-29-54-am

Hop in terminal and type rake console to test if the JavascriptWeeklyScraper scrapes an issue number correctly.  The goal is to scrape issue # 303 and create an instance of that newsletter.

n_303 = JavascriptWeeklyScraper.new(303)

screen-shot-2016-10-05-at-11-42-32-am

Success! Issue number 303 is now instantiated.

n_303.newsletter

=> #<Newsletter:0x007fa9a14804e8 @articles=[], @issue_number=303>

The next steps are to scrape the specific details we want to add to our newsletter object. We need to add the issue date to our newsletter object


screen-shot-2016-10-05-at-12-09-21-pm


Inspect the site to see what we need to focus on to get the data.

The Issue number and date are stored in a table with the class “gowide lonmo”


screen-shot-2016-10-05-at-12-16-49-pm


Let’s test this in our console to see if we scraped it properly.

s = n_303

s.doc.search("table.gowide.lonmo")


screen-shot-2016-10-05-at-12-22-22-pm


Ah, we are getting warmer!

Let’s try add .text at the end of our search to return just the text.

s.doc.search("table.gowide.lonmo").text

screen-shot-2016-10-05-at-12-27-42-pm

Much better!

Let’s add our new scrape detail method to our JavascriptWeeklyScraper class.

screen-shot-2016-10-05-at-12-29-28-pm

Resources:

Nokogiri Tutorial

Interested in seeing the source code? Head over to GitHub to check it out.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s