Image download with Scrapy

Hello, I am working on GAN network which generates images in NES style. But first, I need to prepare dataset. I found website http://www.vgmuseum.com/ which contains a lot of pictures for NES (~10000). That’s why I create a script to scrap those images. To start I will need to install scrapy and pillow. 

pip install scrapy
pip install pillow

Once scrapy installed to create a project.

scrapy.exe startproject vgmuseum
cd vgmuseum
scrapy genspider nes www.vgmuseum.com

After command executed directory structure will be created.Before writing code scrapy need to be configured to download images. Open settings.py and add IMAGES_STORE variables with a path to the folder where you would like to store pictures and enable ImagesPipeline by adding or editing ITEM_PIPELINES.

ITEM_PIPELINES = {"scrapy.pipelines.images.ImagesPipeline": 1}
IMAGES_STORE = "images"

Then open items.py. Here I will create two field images and image_url this is a filed where I will store information about images. The field should have name images and image_url. This is required by ImagesPipeline but can be changed in settings.py.

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class VgmuseumItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    images = scrapy.Field()
    image_urls = scrapy.Field()

Finally, I am ready to create spider open spider folder and then nes.py. Start with import.
Import scrapy, scrapy request and item VgmuseumItem class created in items file.

import scrapy
from scrapy.http import Request

from vgmuseum.items import VgmuseumItem

In NesSpider class at the beggining, I need to specify a name, allowed domains and start URLs. 

class NesSpider(scrapy.Spider):
    name = "nes"
    allowed_domains = ["www.vgmuseum.com"]
    start_urls = ["http://www.vgmuseum.com/nes_b.html"]

Then I need to create a parse method. When I will execute spider next time it will start from the parse method. Parse method has two arguments self and response. 

This is how the website looks like. Using XPath syntax I need to fetch all URLs from this page and then download all images from specific url. Also, there is Back to the top which needs to be ignored.

def parse(self, response):
    game_urls = response.xpath("//ol/li/a/@href").extract()
    for url in game_urls:
        if url != "#top":
            yield Request(
                response.urljoin(url), callback=self.parse_images
            )

game_urls contains a list of all url from the page. Then I am a loop through all URLs, filter all #top URLs and call parse_images method. 
This is an example of a page for the game:

Here I need to get an image name and form URL to an image. Then append URLs to list and assign this list to image_urls field from VgmuseumItem class. 

def parse_images(self, response):
    item = VgmuseumItem()
    image_urls = []
    image_name = response.xpath("//center/img/@src").extract()
    for name in image_name:
        image_urls.append(f"{response.url.rsplit('/', 1)[0]}/{name}")
    item["image_urls"] = image_urls
    yield item

Create an instance of VgmuseumItem class, get all image names from a page, form url and add to the image_urls list. Assing image_urls to field form VgmuseumItem and return it. That’s it. Now I am ready to execute a spider. Make sure you are in a vgmuseum folder and execute.

scrapy crawl nes

Images folder will be automatically created and pictures will start to appear there. Also, it is a good idea in settings.py add DOWNLOAD_DELAY variable. This can prevent you to get a ban on the website.