TDM 20200: Project 02 — 2024
Motivation: Web scraping is the process of taking content off of the internet. Typically this goes hand-in-hand with parsing or processing the data. In general, scraping data from websites has always been a popular topic in The Data Mine. We will use the website of "books.toscrape.com" to practice scraping skills
Context: In the previous project we gently introduced XML and XPath to parse a XML document. In this project, we will introduce web scraping. We will learn some basic web scraping skills using BeautifulSoup.
Scope: Python, web scraping, BeautifulSoup
Readings and Resources
Dr Ward created 9 videos to help with this project. You might want to take a look at these videos here: |
Questions
Question 1 (2 points)
-
Please use BeautifulSoup to get and display the website’s HTML source code https://books.toscrape.com
-
Review the website’s HTML source code. What is the title for that webpage?
You may refer to the following to import libraries, modify the code to fit into yours
|
Question 2 (2 points)
-
Please use the BeautifulSoup library to get and display all categories' names from the homepage of the website.
|
Question 3 (2 points)
-
Now, instead of only getting the names of the categories, get all of the category links from the homepage as well.
-
Review the homepage source code, to explore where are the category links located. You may use
find
to get thediv
tags.
soup.find('div',class_ = 'side_categories')
-
Under a
div
tag, the links can be found in the "a" tag. You may usefind_all
to get all category links.find_all('a')
-
You may refer to the following code, to exclude "books" from the category list, since they are not part of the categories.
-
Assume
link
holds the link information for a category.
link.get('href').startswith("catalogue/category/books/")
-
The format of the output for this question should look like:
Category: Travel,link: catalogue/category/books/travel_2/index.html Category: Mystery,link: catalogue/category/books/mystery_3/index.html ....
-
-
Update the code from question 3a to get (only) the links for books with the category "Romance".
romance_url is https://books.toscrape.com/catalogue/category/books/romance_8/index.html |
Question 4 (2 points)
-
Use the "Romance" link https://books.toscrape.com/catalogue/category/books/romance_8/index.html from Question 3b to get the webpage source code for the Romance category web page.
-
Display all book titles in the first page of the romance category.
Question 5 (2 points)
If you look at this page:
you can see, in the lower-right-hand corner, that the link to the second page is:
Now temporarily forget that you know this fact! We want you to try to find this page-2 link in the Romance book page.
-
Starting with http://books.toscrape.com/catalogue/category/books/romance_8/index.html, please find the page 2 link the from Romance category web page using BeautifulSoup.
The following is some sample code, for your reference.
# need to remove last part from basic url url= "http://books.toscrape.com/catalogue/category/books/romance_8/index.html" url=url.rsplit('/',1)[0] # Assume you get next hyperlink ""category/page-2.html" as the next page, you need to only keep the last part next_link = next.split("/")[-1] #Combine next_url=url+"/"+next_link
-
List the titles of all of the books from the second page of the "Romance" category.
Project 02 Assignment Checklist
-
Jupyter Lab notebook with your code, comments and output for the assignment
-
firstname-lastname-project02.ipynb
-
-
Python file with code and comments for the assignment
-
firstname-lastname-project02.py
-
-
Submit files through Gradescope
Please make sure to double check that your submission is complete, and contains all of your code and output before submitting. If you are on a spotty internet connection, it is recommended to download your submission after submitting it to make sure what you think you submitted, was what you actually submitted. In addition, please review our submission guidelines before submitting your project. |