Amazon Web Scraping Using Python | Data Analyst Portfolio Project

Summary notes created by Deciphr AI

https://www.youtube.com/watch?v=HiOtQMcI5wg&t=5s
Abstract

Abstract

In this video, the host demonstrates a data analyst portfolio project focused on web scraping Amazon using Python. The project, aimed at intermediate Python users, involves using libraries such as BeautifulSoup and Requests to extract data from Amazon product pages. The host explains how to clean and format the scraped data, create a CSV file, and automate the data collection process using Python's time module. Additionally, the host touches on the potential to expand the project by tracking price changes over time and sending email alerts for significant price drops. The project serves as an introduction to web scraping and data automation.

Summary Notes

Introduction to Data Analyst Portfolio Project

  • The project involves scraping data from Amazon using Python.
  • Web scraping is not a mandatory skill for data analysts but is a useful and interesting skill to learn.
  • The project is classified as intermediate; beginners in Python might find it challenging but can still learn by following along.

"Do I need to know web scripting to become a data analyst? The answer is no, you absolutely don't need to know it, but it is a very cool skill to learn."

  • Web scraping can be used to create custom datasets and has various other applications.

Project Setup and Tools

  • The project will be conducted using Anaconda and Jupyter Notebooks.
  • Instructions to download and set up Anaconda and Jupyter Notebooks are provided.
  • The project will involve scraping data from Amazon as voted by the audience.

"If you didn't watch the last project, I had people download Anaconda. We use Jupyter Notebooks, and I'll show you how to get to that in just a second."

  • The project will focus on scraping data from a specific Amazon item page.
  • A more advanced follow-up project will involve scraping multiple items across different pages.

Initial Setup and Importing Libraries

  • Create a new Python 3 notebook and name it "Amazon Web Scraper Project."
  • Import necessary libraries: Beautiful Soup, Requests, Time, DateTime, and optionally smtplib for sending emails.

"The first thing that we need to do or that we should do is upload or import our libraries."

  • The code for the project is available on a GitHub page, and it is recommended to write the code yourself for better learning.

"I do recommend writing it all yourself because you will learn it much better, I promise."

Connecting to the Website

  • Define the URL of the Amazon page to be scraped.
  • Set up headers for the HTTP request using a user-agent string, which can be obtained from a specified link.

"What we are going to need is something called headers. Now again, you will never ever ever need to know this."

  • Use the Requests library to send an HTTP GET request to the specified URL with the headers.

"We are now connecting using our computer using this URL, and then what we want to write is you want to write page."

Using Beautiful Soup to Parse HTML

  • Beautiful Soup library will be used to parse the HTML content of the page.
  • The initial HTML content retrieved from the page will be quite "dirty" and needs to be cleaned and processed.

"We are actually going to start using the Beautiful Soup library."

  • The project will involve printing out intermediate results to understand the structure of the HTML and how to extract useful data.

This concludes the study notes for the initial part of the data analyst portfolio project focused on web scraping. The notes cover the introduction, setup, and initial steps for connecting to the website and setting up the environment. Further details on parsing and extracting data using Beautiful Soup will be covered in subsequent sections of the project.

Introduction to Beautiful Soup

  • Initial Setup:
    • Importing Beautiful Soup to parse HTML.
    • Basic syntax: soup = BeautifulSoup(page.content, 'html.parser').

"You guessed it, you're going to say Beautiful Soup and then in parentheses we're going to do page.content."

  • Purpose:
    • Extracting HTML content from a webpage.
    • Using html.parser to parse the HTML content.

"We're just pulling in the content from the page, that's really all we're doing right now, and it comes in as HTML."

Inspecting HTML Elements

  • Inspecting a Webpage:
    • Using browser tools (right-click and inspect or Ctrl+Shift+I) to view HTML structure.
    • Highlighting elements to understand their HTML tags and attributes.

"If you come here this is a static page basically written in HTML...I did right-click and inspect or Ctrl+Shift+I whichever one works better for you."

  • Selecting Elements:
    • Using the 'Select Element' tool to choose specific parts of the HTML, such as headers or titles.

"Let's say we want this title, what I can do is I can click select element, go right here and then we can select like a type the header or the title of the page."

Parsing and Formatting HTML with Beautiful Soup

  • Creating Soup Objects:
    • soup1 and soup2 as different stages of parsing.
    • Using BeautifulSoup to re-parse existing soup objects for better formatting.

"Let's do soup two, we're just gonna do a very, you know, an upgrade to soup one basically."

  • Using Prettify:
    • The prettify method for better HTML formatting.

"We'll do beautiful soup again and then we're going to do soup one...dot prettify...it just makes things look better."

Extracting Specific Data

  • Finding Elements by ID:
    • Using soup.find to locate elements with specific IDs (e.g., product_title).

"Let's say title, that's what we're going to be getting, and we're going to do soup 2...find...we want to find that id where it's equal to product title."

  • Extracting Text Content:
    • Using .get_text() to extract the text from the found elements.

"We're going to do .get_text and then we'll do open parentheses so now let's print the title and see what we get."

Handling Multiple Data Points

  • Extracting Additional Data:
    • Example of extracting price data using a similar method as for the title.

"We don't only want the title, we are also going to be pulling in the price...id equals price block underscore our price."

  • Printing Extracted Data:
    • Verifying the extracted data by printing it.

"Let's print the title and print the price now let's see what we get."

Cleaning Extracted Data

  • Stripping Unwanted Characters:
    • Using .strip() to clean up whitespace and unwanted characters from the data.

"What we want to do is let's start with the price...price.strip and that's just going to take basically the junk off of either side."

  • Removing Specific Characters:
    • Removing the dollar sign from the price to keep only the numeric value.

"I don't want that dollar sign, I just want the numeric value."

Preparing Data for CSV

  • Creating Headers and Data Lists:
    • Defining headers and data for CSV export.

"In a CSV what you want is you want headers and then you want the data...we're gonna do a bracket and let's make the first one a title."

  • Converting Strings to Lists:
    • Ensuring the data is in a list format for CSV compatibility.

"These are strings and that's important to know...this is a string...what we're going to do is make this a list."

Automating Data Extraction

  • Future Steps:
    • Plan to automate the data extraction process over time and append to the CSV.

"What I'm going to show you is basically doing it over time and just having it automated in the background."

  • CSV Creation and Appending:
    • Creating CSV files, inserting data, and appending new data.

"We need to create the CSV, insert it into the CSV, and then create a process to append more data into that CSV."

These notes provide a detailed and structured overview of the key points discussed in the transcript, ensuring a comprehensive understanding of using Beautiful Soup for web scraping and data extraction.

Data Types and Their Importance

  • Understanding the type of data (list, array, dictionary) is crucial for data manipulation.
  • Different data types impact how you handle and process the data.

"It's really important to remember what's what type, um how do I say this how your data is, is it a list, is it an array, is it a dictionary, um you know what is it these things are important they do play a big impact especially with this type of stuff."

  • Knowing the data type helps avoid issues during data processing and ensures proper handling.

Creating a CSV File

  • Steps to create a CSV file using Python:
    • Open a file with write mode ('w') and specify encoding ('utf-8').
    • Use csv.writer to write data into the file.
    • Write the header row and data rows.

"So what we're going to do is we're going to say with and we're going to say open and now we're going to name our file you can name this whatever you want I'm going to call it amazon web scraper data set that's real long dot csv and we're gonna do underscore w and that means write."

  • The open function with 'w' mode creates or overwrites the file.
  • The csv.writer function is used to write rows into the CSV file.

Writing Data to CSV

  • Write the header row first, followed by the data rows.
  • Ensure no extra spaces between rows by using newline=''.

"We're going to do writer is dot sorry dot write row and this is just for the initial um the initial import or or um not import the initial insertion of the data into this csv."

  • The write_row method is used to insert the header and data rows.
  • Proper formatting of data rows is crucial for accurate data representation.

Handling Errors and Debugging

  • Addressing issues when data appears incorrect or incomplete.
  • Re-running processes to ensure data integrity.

"Oh geez this isn't good can't verify my um my subscription uh why does it say 6.99 I'm gonna go back and look but I think I know the issue."

  • Debugging involves identifying and correcting errors in data or code.
  • Re-running scripts can help verify corrections and ensure proper data output.

Adding Timestamps

  • Importance of adding timestamps to data for tracking and historical reference.
  • Using the datetime module to generate current date.

"What you can do is you can do date let me get date time and you do dot date dot today open parenthesis and that is going to give us this right here and so we're just going to do today that's what we'll call it is equal to this and we'll say print today."

  • Timestamps provide context and help in tracking data changes over time.
  • The datetime module is used to fetch the current date and add it to the data.

Using Pandas for Data Verification

  • Using the pandas library to read and verify CSV data.
  • Simplifies the process of checking data without manually opening the file.

"What we can do just to check the data without having to open up the data every single time which is super annoying because we're going to use pandas again I should have imported this at the top."

  • pandas.read_csv function reads CSV files into DataFrames for easy manipulation and verification.
  • Automates the process of checking data integrity and structure.

Appending Data to CSV

  • Changing file mode to 'a+' to append data instead of overwriting.
  • Ensuring new data is added to the next available row.

"We are ignoring the data and we're now going to the next nearest free row in appending data which means to add on data to that and so if I run this which I'm not going to right now I mean why not I can I can run it."

  • Appending mode ensures existing data is preserved while new data is added.
  • Important for continuous data collection and tracking over time.

Automating Data Collection

  • Automating the data collection process to run without manual intervention.
  • Using functions to encapsulate the data collection logic.

"We want a way where it does it while we sleep it does it in the background of our laptop um and is easy to do right I don't want to come in here every single morning with an alarm on my phone every single morning come in here I want to automate this."

  • Automation reduces manual effort and ensures consistent data collection.
  • Functions help in organizing code and making it reusable for automation.

Encapsulating Logic in Functions

  • Using functions to encapsulate the data scraping and saving logic.
  • Facilitates automation and improves code readability.

"So now what we're going to do is we're going to put this all into uh this check underscore price now you may never have used oh geez what are these things called oh my gosh super used all the time you'll know what I what it is not a function I don't even remember what it's called maybe there's a function."

  • Functions encapsulate specific tasks, making the code modular and easier to manage.
  • Essential for automating repetitive tasks and improving code maintainability.

Summary

  • Understanding data types is crucial for proper data handling.
  • Creating and writing to CSV files involves specific steps and methods.
  • Debugging and re-running scripts are essential for ensuring data integrity.
  • Adding timestamps provides context and historical tracking.
  • Using pandas simplifies data verification.
  • Appending data to CSV files is important for continuous data collection.
  • Automating data collection reduces manual effort and ensures consistency.
  • Encapsulating logic in functions improves code organization and reusability.

Automating Data Collection Using Python

  • Objective: Automate the process of checking and recording the price of an item at regular intervals using Python.
  • Tools Used: Python, Time Library, Jupyter Notebooks, Visual Studio Code.
  • Key Steps:
    • Writing a function to check the price.
    • Using a timer to run the function at specified intervals.
    • Storing the collected data in a CSV file.
    • Running the script continuously in the background.

"So now we have our header and our data, and then we want to pull this in right here... Everything that we just wrote out, we are now putting into this check price."

  • Explanation: Setting up the initial structure for the data and the function to check the price.

"This is how we are going to do that... We had something called time, this library time right here, that's what we're going to use right now."

  • Explanation: Introduction to using the time library to automate the function execution.

"So we're going to say while true... every 5 seconds it is going to run through this entire process."

  • Explanation: Using a while loop and time.sleep to repeatedly execute the price check function.

"I guess I ran for 20 seconds... we can put this as long or as short as you want."

  • Explanation: Demonstrating the flexibility in setting the interval for the price check.

"This is the entire point of this project... we want to create our own data set."

  • Explanation: Emphasizing the goal of the project is to collect data over time.

"You can do this on any item you could ever imagine on Amazon... the code itself will be nice to put in a project."

  • Explanation: Highlighting the wide applicability of the script for different items and the utility of the code.

Running the Script in the Background

  • Execution: Running the script in the background to continuously collect data.
  • Tools: Visual Studio Code, Jupyter Notebooks.
  • Considerations:
    • Preference for Visual Studio Code for automation tasks.
    • Jupyter Notebooks used for demonstration purposes.

"You can run it every second if you want... you can do some type of time series with."

  • Explanation: Flexibility in execution intervals and potential for time series analysis with the collected data.

"I personally when I did this... I did something similar and I put this in Visual Studio Code."

  • Explanation: Personal preference for using Visual Studio Code for such tasks.

"If you restart your computer just come back in here and restart running this process."

  • Explanation: Instructions on how to resume the process after a system restart.

Enhancing the Project with Email Notifications

  • Objective: Extending the project to include email notifications for price drops.
  • Tools Used: Python, SMTP library for sending emails.
  • Key Steps:
    • Writing a script to send an email when the price drops below a certain threshold.
    • Connecting to an email server and configuring the message.

"If the price is lower than... it would then send an email."

  • Explanation: Setting up a condition to trigger email notifications for price drops.

"We're sending a mail, we're connecting to a server, we're using Gmail, we're logging into our account."

  • Explanation: Steps involved in configuring the email notification system.

"I have used this and I used it and was able to buy a watch that was like... on a Black Friday sale."

  • Explanation: Personal anecdote demonstrating the practical utility of the email notification feature.

Conclusion and Future Projects

  • Summary: The project provides a basic yet powerful introduction to web scraping and automation using Python.
  • Future Plans: More complex web scraping projects to be introduced in subsequent tutorials.
  • Encouragement: Viewers are encouraged to experiment and modify the code for their own use cases.

"I hope that this was instructional... I hope that this is useful."

  • Explanation: Reinforcing the educational objective of the tutorial.

"In this next one, it gets quite a bit more difficult... just much more technical or coding heavy."

  • Explanation: Teasing more advanced projects to come, indicating a progression in difficulty.

"Thank you so much for watching... I really appreciate it."

  • Explanation: Closing remarks expressing gratitude and encouraging viewer engagement.

What others are sharing

Go To Library

Want to Deciphr in private?
- It's completely free

Deciphr Now
Footer background
Crossed lines icon
Deciphr.Ai
Crossed lines icon
Deciphr.Ai
Crossed lines icon
Deciphr.Ai
Crossed lines icon
Deciphr.Ai
Crossed lines icon
Deciphr.Ai
Crossed lines icon
Deciphr.Ai
Crossed lines icon
Deciphr.Ai

© 2024 Deciphr

Terms and ConditionsPrivacy Policy