How I Scraped Goodreads with Python

Abdelmalek Merouan
4 min readNov 26, 2023

--

Hello everyone as you can see in the title “How I Scraped Goodreads with Python” I will show you how I scraped Goodreads book’s ratings with Python.

In this Project, I use 3 Python libraries “requests”, “BeautifulSoup” and “CSV”

The project it’s clean ,simple and beginner-friendly

Photo by Fotis Fotopoulos on Unsplash

What data did you scrape?

I scraped a list of 100 books including Book Titles and Book Author and book rating

What do we need to start?

First, You need to install python in your machine

if you don’t have Python simply you can install it from the official Python website : https://www.python.org/downloads/

Done? Great!

Now you need to install these libraries

to install it write this line in your command prompt or terminal:

For BeautifulSoup (bs4):

pip install bs4

For requests:

pip install requests

Note: CSV library comes with Python you don’t need to install it

Coding Part

STEP.1

Import Libraries we’re going to use in this project

Like this:

import requests
from bs4 import BeautifulSoup
import csv

STEP.2

Declare a list

books = []

in the “books” list I can store all data we scraped and then in the future turn the list into a CSV file

STEP.3

Declare another variable named “page” to send an HTTP request to the URL

page = requests.get("https://www.goodreads.com/list/show/153860.Goodreads_Top_100_Highest_Rated_Books_on_Goodreads_with_at_least_10_000_Ratings").content

Why .content: content attribute is used to extract the HTML content of the webpage.

STEP.4

Now we need to parse the HTML content of the webpage.

Like this:

soup = BeautifulSoup(page,"html.parser")

Once you have the HTML content of the webpage, you need to parse it to extract the data you need, This is where BeautifulSoup comes in.

Why do we need to parse it:

In short, we need to parse data to transform it from its raw, unstructured form into a more organized, usable format.

STEP.5

Extract table rows using BeautifulSoup

After parsing the HTML content of the webpage using BeautifulSoup, you need to extract the specific data you want. In this case, you want to extract the data from the table rows.

and store in a variable named “cards”

Like this:

cards = soup.find("table",class_="tableList js-dataTooltip").find_all("tr")

STEP.6

After Extracting all rows (cards) now, we need to create for loop to go to each card in “cards”

Like this:

for card in cards:
book_ranking = card.find("td",class_="number").text.strip()
book_title = card.find("a",class_="bookTitle").text.strip()
book_author = card.find("a",class_="authorName").text.strip()
book_rating = card.find("span",class_="minirating").text.strip()
  • book_ranking: The ranking of the book on Goodreads.
  • book_title: The title of the book.
  • book_author: The author of the book.
  • book_rating: The rating of the book on Goodreads.
book_ranking = card.find("td",class_="number").text.strip()

This line of code extracts the ranking of the book from the current card. The find() method is used to search the HTML of the card for an element with the class name "number". The text() method is then used to extract the text content of the element. The strip() method is used to remove any leading or trailing whitespace from the text content.

STEP.7

Add the extracted data to the LIST we created “books”

Like that:

books.append({"Book Ranking":book_ranking,
"Book Title":book_title,
"Book Author":book_author,
"Book Rating":book_rating})

keys = books[0].keys()

keys: The books[0].keys() expression returns a list of the keys in the first dictionary in the "books" list. In this case, the list of keys will be ['Book Ranking', 'Book Title', 'Book Author' , 'Book Rating']

STEP.8

Write data to a CSV file

with open("books.csv","w",newline="",encoding="UTF-8") as f:
writer = csv.DictWriter(f,keys)
writer.writeheader()
writer.writerows(books)
print("Done!")

FULL CODE

import requests
from bs4 import BeautifulSoup
import csv

books = []
page = requests.get("https://www.goodreads.com/list/show/153860.Goodreads_Top_100_Highest_Rated_Books_on_Goodreads_with_at_least_10_000_Ratings").content

soup = BeautifulSoup(page,"html.parser")

cards = soup.find("table",class_="tableList js-dataTooltip").find_all("tr")

for card in cards:
book_ranking = card.find("td",class_="number").text.strip()
book_title = card.find("a",class_="bookTitle").text.strip()
book_author = card.find("a",class_="authorName").text.strip()
book_rating = card.find("span",class_="minirating").text.strip()

books.append({"Book Ranking":book_ranking,
"Book Title":book_title,
"Book Author":book_author,
"Book Rating":book_rating})

keys = books[0].keys()

with open("books.csv","w",newline="",encoding="UTF-8") as f:
writer = csv.DictWriter(f,keys)
writer.writeheader()
writer.writerows(books)
print("Done!")

Article’s Sponsor

If I want to start learning coding from the beginning I will start with Educative

Why Educative?

Learn Interactively

Our courses include built-in coding playgrounds that let you learn new things without any setup.

Learn Faster

All of our learning products are text-based. You get to learn at your own pace. No pauses.

Personalize Your Learning

Achieve your goals faster with a path designed just for you. Personalized Paths are customized and focused on your individual learning needs and career goals.

Build Real-World Projects

Complete a real-world programming project designed to exercise practical skills used in the workplace. Everything you learn from a Project will help you practice skills that are in demand, useful, and highly relevant.

If you want to start your learning journey with Educative sign up with this LINK

I will get a small commission if you sign up with my LINK

Thanks for reading if you like this article destroy the clap button, and subscribe here to receive my articles in your email Have a great day friend ❤

--

--

Abdelmalek Merouan
Abdelmalek Merouan

Written by Abdelmalek Merouan

Developer passionate about sharing valuable insights on coding and the tech world. vteek.com/me

No responses yet