Help for different types of layouts #6
-
Hi, I really appreciate your project! I've been testing it on my Telegram channel for a few days, everything is working perfectly. However, so far I have used the home Offers page (on the Italian store), never managing to do so with other specific pages. In detail, I am working for now on this page: My idea was to search in a given category, which is impossible for me since the Bot, as I understand it, searches for menus on the left, such as "50% discount", which is a selectable option to filter offers. So, if I wanted to scrape in this category: what should I change? |
Beta Was this translation helpful? Give feedback.
Replies: 8 comments
-
Hi! Thanks for your interest in our project! Amazon pages have many different layouts, and products are always represented in a different way in the HTML source code. Your example made us notice that, for dynamically loaded pages, not all products are loaded from the beginning, so it's necessary to scroll the page to make new products be loaded. Here's the modified function def get_all_deals_ids():
deals_page = "https://amzn.to/3J5cGBT"
selenium_driver = start_selenium()
print("Starting taking all urls")
try:
selenium_driver.get(deals_page)
# products are dynamically loaded, so it's necessary to load all of them in order to not loose any
# scrolling the page a few times makes sure that all products have been loaded
deals_urls = []
for _ in range(3):
selenium_driver.execute_script("window.scrollTo(0, document.body.scrollHeight);")
time.sleep(3) # 3 seconds is a good compromise. It may be necessary to increase the delay on a slow connection
# products are contained in a <div> with class 'd4xojt-0'
# inside each <div>, there is an <a> element with the link of a product. There are no submenus
# no problem if some urls are saved multiple times because, at the end, the ids are considered as a set
deals_urls += [div.find_elements(By.TAG_NAME, "a")[0].get_attribute("href") for div in
selenium_driver.find_elements(By.CSS_SELECTOR, "div[class*='d4xojt-0']")]
# keep only product ids because they can be easily used to create the link for that product
product_ids = [extract_product_id(url) for url in deals_urls if
extract_product_id(url) is not None and extract_product_id(url) != '']
selenium_driver.quit() # close everything that was created. Better not to keep driver open for much time
return [*set(product_ids)] # remove duplicates
except Exception as e:
print(e)
selenium_driver.quit() # close everything that was created. Better not to keep driver open for much time
return [] # error, no ids taken |
Beta Was this translation helpful? Give feedback.
-
Good evening! First of all I thank you for also writing the new code for me, I really appreciate it. a further "problem" that occurred to me is with a further Layout that I wanted to try, specifically the one on the page: I tried to modify the code you wrote as follows deals_urls += [div.find_elements(By.CLASS_NAME, "sg-col-inner")[0].get_attribute("href") for div in but did not obtain the desired result. Would you be so kind as to review this additional code for me? Thank you in advance. |
Beta Was this translation helpful? Give feedback.
-
The idea is correct, but you are searching for a class name that is too broad, meaning that it is used in much more places on the webpage than just for the product cards. In the future, when searching for a suitable class to use, try to search for it in the source code of the page and make sure that it is present (almost) only in the elements you want. On the page you provided (https://amzn.to/43rBtcC), it's easier to search for the ids of the products because the deals_urls += [e.get_attribute("href") for e in
selenium_driver.find_elements(By.CSS_SELECTOR, "a[class*='a-link-normal']")] Although the class To make sure that the code could also handle searches for ids that are quite broad as this one, I updated the source code, and I invite you to download the latest version before trying the code snippet I provided you. |
Beta Was this translation helpful? Give feedback.
-
Hello again! Despite the changes I get the following error code: line 54 Why does it appear to me if before, with a similar wording, I had no problems? Thanks again for your patience. |
Beta Was this translation helpful? Give feedback.
-
It is probably caused by a transparent character that was introduced when copying. See this stackoverflow question. |
Beta Was this translation helpful? Give feedback.
-
Good afternoon. I apologize for the previous request, it was probably late and I wasn't thinking well... In any case, I'm continuing to tinker with your code, trying to add nice information for my posts. In detail I wanted to extrapolate two data:
from selenium import webdriver
import time
# Inizializza il browser WebDriver (esempio Chrome)
driver = webdriver.Chrome('/path/to/chromedriver')
url = "https://www.example.com" # sostituisci con la tua URL desiderata
driver.get(url)
time.sleep(5) # Aggiungi un pause per dare tempo al caricamento delle pagine
product_pages = driver.find_elements_by_tag_name("article")
for product_page in product_pages:
product_page_content = product_page.find_element_by_tag_name("div").get_attribute("outerHTML")
# Trova i valori desiderati nel contenuto della pagina del prodotto
best_seller = product_page_content.xpath('//*[@id="zeitgeistBadge_feature_div"]/div/a/i/text()')[0].strip() if best_seller else ""
category = product_page_content.xpath('//*[@id="zeitgeistBadge_feature_div"]/div/a/span/span/text()')[0].strip()
# Aggiunge simbolo "Top Seller" alla stringa finale solo se presente
prodott1 = f'{best_seller if best_seller else ""} {category}'
print(prodott1)
# Chiudi il browser dopo aver eseguito l'elaborazione
driver.quit()
I didn't quite understand how to do it, while for shipping I found the correct string: shipping = product_page_content.xpath('//*[@id="merchantInfoFeature_feature_div"]/div[2]/div/span/text()')[0].strip() Hoping not to have disturbed you too much, thank you again. |
Beta Was this translation helpful? Give feedback.
-
seller = product_page_content.xpath('//*[@id="merchantInfoFeature_feature_div"]/div[2]/div/span/a/text()')[0].strip() |
Beta Was this translation helpful? Give feedback.
-
@Alex87a Honestly we cannot know about classes of all Amazon's pages since they are different in all of them because Amazon is not meant to be scraped like we are doing. This is basically a project project about scraping, so you should refer to some guides about this topic. Also it does not make sense to have plenty of information based on scraping (unstable) in this tool (like best-selling and seller's data) since there are official and stable Amazon's APIs for who is a verified affiliate. At last I notice that you are not using our code. For these reasons and because your issue is not really related to this project I'll close this PR. |
Beta Was this translation helpful? Give feedback.
Hi! Thanks for your interest in our project!
Amazon pages have many different layouts, and products are always represented in a different way in the HTML source code.
In the link you provided, https://amzn.to/3J5cGBT, the products are contained in
<div>
elements that have, among their classes, the classd4xojt-0
. For this reason, it's necessary to search for all such elements, and then search for the urls to the product in the<a>
elements inside those elements.Your example made us notice that, for dynamically loaded pages, not all products are loaded from the beginning, so it's necessary to scroll the page to make new products be loaded.
Here's the modified function
get_all_deals_ids()
…