How I Scraped Goodreads with Python
Hello everyone as you can see in the title “How I Scraped Goodreads with Python” I will show you how I scraped Goodreads book’s ratings with Python.
In this Project, I use 3 Python libraries “requests”, “BeautifulSoup” and “CSV”
The project it’s clean ,simple and beginner-friendly
What data did you scrape?
I scraped a list of 100 books including Book Titles and Book Author and book rating
What do we need to start?
First, You need to install python in your machine
if you don’t have Python simply you can install it from the official Python website : https://www.python.org/downloads/
Done? Great!
Now you need to install these libraries
to install it write this line in your command prompt or terminal:
For BeautifulSoup (bs4):
pip install bs4
For requests:
pip install requests
Note: CSV library comes with Python you don’t need to install it
Coding Part
STEP.1
Import Libraries we’re going to use in this project
Like this:
import requests
from bs4 import BeautifulSoup
import csv
STEP.2
Declare a list
books = []
in the “books” list I can store all data we scraped and then in the future turn the list into a CSV file
STEP.3
Declare another variable named “page” to send an HTTP request to the URL
page = requests.get("https://www.goodreads.com/list/show/153860.Goodreads_Top_100_Highest_Rated_Books_on_Goodreads_with_at_least_10_000_Ratings").content
Why .content: content
attribute is used to extract the HTML content of the webpage.
STEP.4
Now we need to parse the HTML content of the webpage.
Like this:
soup = BeautifulSoup(page,"html.parser")
Once you have the HTML content of the webpage, you need to parse it to extract the data you need, This is where BeautifulSoup comes in.
Why do we need to parse it:
In short, we need to parse data to transform it from its raw, unstructured form into a more organized, usable format.
STEP.5
Extract table rows using BeautifulSoup
After parsing the HTML content of the webpage using BeautifulSoup, you need to extract the specific data you want. In this case, you want to extract the data from the table rows.
and store in a variable named “cards”
Like this:
cards = soup.find("table",class_="tableList js-dataTooltip").find_all("tr")
STEP.6
After Extracting all rows (cards) now, we need to create for loop to go to each card in “cards”
Like this:
for card in cards:
book_ranking = card.find("td",class_="number").text.strip()
book_title = card.find("a",class_="bookTitle").text.strip()
book_author = card.find("a",class_="authorName").text.strip()
book_rating = card.find("span",class_="minirating").text.strip()
book_ranking
: The ranking of the book on Goodreads.book_title
: The title of the book.book_author
: The author of the book.book_rating
: The rating of the book on Goodreads.
book_ranking = card.find("td",class_="number").text.strip()
This line of code extracts the ranking of the book from the current card. The find()
method is used to search the HTML of the card for an element with the class name "number"
. The text()
method is then used to extract the text content of the element. The strip()
method is used to remove any leading or trailing whitespace from the text content.
STEP.7
Add the extracted data to the LIST we created “books”
Like that:
books.append({"Book Ranking":book_ranking,
"Book Title":book_title,
"Book Author":book_author,
"Book Rating":book_rating})
keys = books[0].keys()
keys: The books[0].keys()
expression returns a list of the keys in the first dictionary in the "books" list. In this case, the list of keys will be ['Book Ranking', 'Book Title', 'Book Author' , 'Book Rating']
STEP.8
Write data to a CSV file
with open("books.csv","w",newline="",encoding="UTF-8") as f:
writer = csv.DictWriter(f,keys)
writer.writeheader()
writer.writerows(books)
print("Done!")
FULL CODE
import requests
from bs4 import BeautifulSoup
import csv
books = []
page = requests.get("https://www.goodreads.com/list/show/153860.Goodreads_Top_100_Highest_Rated_Books_on_Goodreads_with_at_least_10_000_Ratings").content
soup = BeautifulSoup(page,"html.parser")
cards = soup.find("table",class_="tableList js-dataTooltip").find_all("tr")
for card in cards:
book_ranking = card.find("td",class_="number").text.strip()
book_title = card.find("a",class_="bookTitle").text.strip()
book_author = card.find("a",class_="authorName").text.strip()
book_rating = card.find("span",class_="minirating").text.strip()
books.append({"Book Ranking":book_ranking,
"Book Title":book_title,
"Book Author":book_author,
"Book Rating":book_rating})
keys = books[0].keys()
with open("books.csv","w",newline="",encoding="UTF-8") as f:
writer = csv.DictWriter(f,keys)
writer.writeheader()
writer.writerows(books)
print("Done!")
Article’s Sponsor
If I want to start learning coding from the beginning I will start with Educative
Why Educative?
Learn Interactively
Our courses include built-in coding playgrounds that let you learn new things without any setup.
Learn Faster
All of our learning products are text-based. You get to learn at your own pace. No pauses.
Personalize Your Learning
Achieve your goals faster with a path designed just for you. Personalized Paths are customized and focused on your individual learning needs and career goals.
Build Real-World Projects
Complete a real-world programming project designed to exercise practical skills used in the workplace. Everything you learn from a Project will help you practice skills that are in demand, useful, and highly relevant.
If you want to start your learning journey with Educative sign up with this LINK
I will get a small commission if you sign up with my LINK
Thanks for reading if you like this article destroy the clap button, and subscribe here to receive my articles in your email Have a great day friend ❤