Help for different types of layouts #6

Alex87a · 2024-03-20T13:50:59Z

Alex87a
Mar 20, 2024

Hi, I really appreciate your project! I've been testing it on my Telegram channel for a few days, everything is working perfectly. However, so far I have used the home Offers page (on the Italian store), never managing to do so with other specific pages. In detail, I am working for now on this page:

https://amzn.to/4clPZGR

My idea was to search in a given category, which is impossible for me since the Bot, as I understand it, searches for menus on the left, such as "50% discount", which is a selectable option to filter offers. So, if I wanted to scrape in this category:

https://amzn.to/3J5cGBT

what should I change?
Thanks in advance for your attention.

Answered by lorenzo-asquini

Mar 21, 2024

Hi! Thanks for your interest in our project!

Amazon pages have many different layouts, and products are always represented in a different way in the HTML source code.
In the link you provided, https://amzn.to/3J5cGBT, the products are contained in <div> elements that have, among their classes, the class d4xojt-0. For this reason, it's necessary to search for all such elements, and then search for the urls to the product in the <a> elements inside those elements.

Your example made us notice that, for dynamically loaded pages, not all products are loaded from the beginning, so it's necessary to scroll the page to make new products be loaded.

Here's the modified function get_all_deals_ids() …

View full answer

lorenzo-asquini · 2024-03-21T15:10:18Z

lorenzo-asquini
Mar 21, 2024
Collaborator

Hi! Thanks for your interest in our project!

Amazon pages have many different layouts, and products are always represented in a different way in the HTML source code.
In the link you provided, https://amzn.to/3J5cGBT, the products are contained in <div> elements that have, among their classes, the class d4xojt-0. For this reason, it's necessary to search for all such elements, and then search for the urls to the product in the <a> elements inside those elements.

Your example made us notice that, for dynamically loaded pages, not all products are loaded from the beginning, so it's necessary to scroll the page to make new products be loaded.

Here's the modified function get_all_deals_ids() adapted for https://amzn.to/3J5cGBT. You will need to import the library time with import time. Let us know if you have any other issues!

def get_all_deals_ids():
    deals_page = "https://amzn.to/3J5cGBT"
    selenium_driver = start_selenium()

    print("Starting taking all urls")

    try:
        selenium_driver.get(deals_page)

        # products are dynamically loaded, so it's necessary to load all of them in order to not loose any
        # scrolling the page a few times makes sure that all products have been loaded
        deals_urls = []
        for _ in range(3):
            
            selenium_driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
            time.sleep(3)  # 3 seconds is a good compromise. It may be necessary to increase the delay on a slow connection
        
            # products are contained in a <div> with class 'd4xojt-0'
            # inside each <div>, there is an <a> element with the link of a product. There are no submenus
            # no problem if some urls are saved multiple times because, at the end, the ids are considered as a set
            deals_urls += [div.find_elements(By.TAG_NAME, "a")[0].get_attribute("href") for div in
                             selenium_driver.find_elements(By.CSS_SELECTOR, "div[class*='d4xojt-0']")]

        # keep only product ids because they can be easily used to create the link for that product
        product_ids = [extract_product_id(url) for url in deals_urls if
                       extract_product_id(url) is not None and extract_product_id(url) != '']

        selenium_driver.quit()  # close everything that was created. Better not to keep driver open for much time
        return [*set(product_ids)]  # remove duplicates

    except Exception as e:
        print(e)
        selenium_driver.quit()  # close everything that was created. Better not to keep driver open for much time
        return []  # error, no ids taken

0 replies

Alex87a · 2024-03-21T21:58:11Z

Alex87a
Mar 21, 2024
Author

Good evening! First of all I thank you for also writing the new code for me, I really appreciate it. a further "problem" that occurred to me is with a further Layout that I wanted to try, specifically the one on the page:

https://amzn.to/43rBtcC

I tried to modify the code you wrote as follows

deals_urls += [div.find_elements(By.CLASS_NAME, "sg-col-inner")[0].get_attribute("href") for div in
selenium_driver.find_elements(By.CSS_SELECTOR, "div[class*='sg-col-inner']")]

but did not obtain the desired result. Would you be so kind as to review this additional code for me? Thank you in advance.

0 replies

lorenzo-asquini · 2024-03-21T22:50:57Z

lorenzo-asquini
Mar 21, 2024
Collaborator

The idea is correct, but you are searching for a class name that is too broad, meaning that it is used in much more places on the webpage than just for the product cards. In the future, when searching for a suitable class to use, try to search for it in the source code of the page and make sure that it is present (almost) only in the elements you want.
The class for <div> elements you tried to search for is also used for elements that do not contain an <a> element, so this gives problems.

On the page you provided (https://amzn.to/43rBtcC), it's easier to search for the ids of the products because the <a> elements contain a searchable class. The code snippet to use in this case is:

deals_urls += [e.get_attribute("href") for e in
                           selenium_driver.find_elements(By.CSS_SELECTOR, "a[class*='a-link-normal']")]

Although the class a-link-normal is present in <a> elements that do not relate to products, and it may seem that it is a contradiction with what I said earlier, other functions take care of the additional uses, removing any false positive.

To make sure that the code could also handle searches for ids that are quite broad as this one, I updated the source code, and I invite you to download the latest version before trying the code snippet I provided you.

0 replies

Alex87a · 2024-03-21T23:30:10Z

Alex87a
Mar 21, 2024
Author

Hello again! Despite the changes I get the following error code:

line 54
selenium_driver.find_elements(By.CSS_SELECTOR, "a[class*='a-link-normal']")]
^
SyntaxError: invalid non-printable character U+00A0

Why does it appear to me if before, with a similar wording, I had no problems? Thanks again for your patience.

0 replies

lorenzo-asquini · 2024-03-22T06:35:20Z

lorenzo-asquini
Mar 22, 2024
Collaborator

It is probably caused by a transparent character that was introduced when copying. See this stackoverflow question.
For these kinds of questions, I would suggest using Google by searching for the error directly.

0 replies

Alex87a · 2024-03-22T15:21:33Z

Alex87a
Mar 22, 2024
Author

Good afternoon. I apologize for the previous request, it was probably late and I wasn't thinking well... In any case, I'm continuing to tinker with your code, trying to add nice information for my posts. In detail I wanted to extrapolate two data:

The first concerns the "#1 best-selling" label, which appears when certain products are the best-selling in a category. I've been trying to write some code but I don't know how right it is.

from selenium import webdriver
import time

# Inizializza il browser WebDriver (esempio Chrome)
driver = webdriver.Chrome('/path/to/chromedriver')

url = "https://www.example.com"  # sostituisci con la tua URL desiderata
driver.get(url)
time.sleep(5)  # Aggiungi un pause per dare tempo al caricamento delle pagine

product_pages = driver.find_elements_by_tag_name("article")

for product_page in product_pages:
    product_page_content = product_page.find_element_by_tag_name("div").get_attribute("outerHTML")

    # Trova i valori desiderati nel contenuto della pagina del prodotto
    best_seller = product_page_content.xpath('//*[@id="zeitgeistBadge_feature_div"]/div/a/i/text()')[0].strip() if best_seller else ""
    category = product_page_content.xpath('//*[@id="zeitgeistBadge_feature_div"]/div/a/span/span/text()')[0].strip()

    # Aggiunge simbolo "Top Seller" alla stringa finale solo se presente
    prodott1 = f'{best_seller if best_seller else ""} {category}'

    print(prodott1)

# Chiudi il browser dopo aver eseguito l'elaborazione
driver.quit()

I wanted to extrapolate the seller's data; however, since it is always a link associated with the text (of the type https://www.amazon.it/gp/help/seller/at-a-glance.html/ref=dp_merchant_link?ie=UTF8&seller=A1WEB05YGWKZ68&asin =B001QY8QXM&ref_=dp_merchant_link&isAmazonFulfilled=1, present on the page of the following product https://amzn.eu/d/0Xf1fMd)

I didn't quite understand how to do it, while for shipping I found the correct string:

shipping = product_page_content.xpath('//*[@id="merchantInfoFeature_feature_div"]/div[2]/div/span/text()')[0].strip()

Hoping not to have disturbed you too much, thank you again.

0 replies

lorenzo-asquini · 2024-03-22T16:27:51Z

lorenzo-asquini
Mar 22, 2024
Collaborator

To search if a product is a bestseller, you could try to search if there is an <i> element that contains the class p13n-best-seller-badge when inside a product page. If such an element is present, you know that the product is a bestseller.
In your code, when you try to search for that characteristic, you are also using an if statement that doesn't seem to make much sense, so I suggest doing as above.
If I understand correctly your question, to get the name of the seller, you just need to go one level deeper in the element tree and get the text in the <a> element:

seller = product_page_content.xpath('//*[@id="merchantInfoFeature_feature_div"]/div[2]/div/span/a/text()')[0].strip()

0 replies

albertopasqualetto · 2024-03-22T16:39:30Z

albertopasqualetto
Mar 22, 2024
Maintainer

@Alex87a Honestly we cannot know about classes of all Amazon's pages since they are different in all of them because Amazon is not meant to be scraped like we are doing.

This is basically a project project about scraping, so you should refer to some guides about this topic.

Also it does not make sense to have plenty of information based on scraping (unstable) in this tool (like best-selling and seller's data) since there are official and stable Amazon's APIs for who is a verified affiliate.

At last I notice that you are not using our code.

For these reasons and because your issue is not really related to this project I'll close this PR.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Help for different types of layouts #6

{{title}}

Replies: 8 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Help for different types of layouts #6

Alex87a Mar 20, 2024

Replies: 8 comments

lorenzo-asquini Mar 21, 2024 Collaborator

Alex87a Mar 21, 2024 Author

lorenzo-asquini Mar 21, 2024 Collaborator

Alex87a Mar 21, 2024 Author

lorenzo-asquini Mar 22, 2024 Collaborator

Alex87a Mar 22, 2024 Author

lorenzo-asquini Mar 22, 2024 Collaborator

albertopasqualetto Mar 22, 2024 Maintainer

Alex87a
Mar 20, 2024

lorenzo-asquini
Mar 21, 2024
Collaborator

Alex87a
Mar 21, 2024
Author

lorenzo-asquini
Mar 21, 2024
Collaborator

Alex87a
Mar 21, 2024
Author

lorenzo-asquini
Mar 22, 2024
Collaborator

Alex87a
Mar 22, 2024
Author

lorenzo-asquini
Mar 22, 2024
Collaborator

albertopasqualetto
Mar 22, 2024
Maintainer