所以我有一个代码,它从 14 页(到目前为止)中删除矿物的名称+价格并将其保存到 .txt 文件中。我首先尝试仅使用 Page1,然后我想添加更多页面以获取更多数据。但随后代码抓取了一些它不应该抓取的东西——随机名称/字符串。我没想到它会抢到那个,但它确实抢到了,并且给这个分配了错误的价格!它发生在具有这种“意外名称”的矿物之后,然后列表中的整个其余部分都有错误的价格。见下图:
因此,由于该字符串与其他字符串不同,因此进一步的代码无法拆分它并给出错误:
cutted2 = split2.pop(1)
^^^^^^^^^^^^^
IndexError: pop index out of range
我试图忽略这些错误并使用不同 Stackoverflow 页面中使用的方法之一:
try: cutted2 = split2.pop(1) except IndexError: continue
它确实有效,没有出现错误......但随后它为错误的矿物分配了错误的价格(正如我注意到的)!如何更改代码以忽略这些“奇怪”的名称并继续列表?下面是完整的代码,我记得它停在 URL5 上,并给出了这个弹出索引错误:
import requests
from bs4 import BeautifulSoup
import re
def collecter(URL):
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}
soup = BeautifulSoup(requests.get(URL, headers=headers).text, "lxml")
names = [n.getText(strip=True) for n in soup.select("table tr td font a")]
prices = [
p.getText(strip=True).split("Price:")[-1] for p
in soup.select("table tr td font font")
]
names[:] = [" ".join(n.split()) for n in names if not n.startswith("[")]
prices[:] = [p for p in prices if p]
with open("Minerals.txt", "a+", encoding='utf-8') as file:
for name, price in zip(names, prices):
# print(f"{name}\n{price}")
# print("-" * 50)
filename = str(name)+" "+str(price)+"\n"
split1 = filename.split(' / ')
cutted1 = split1.pop(0)
split2 = cutted1.split(": ")
try:
cutted2 = split2.pop(1)
except IndexError:
continue
two_prices = cutted2+" "+split1.pop(0)+"\n"
file.write(two_prices)
URL1 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=0"
URL2 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=25"
URL3 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=50"
URL4 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=75"
URL5 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=100"
URL6 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=125"
URL7 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=150"
URL8 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=175"
URL9 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=200"
URL10 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=225"
URL11 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=250"
URL12 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=275"
URL13 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=300"
URL14 = "https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First=325"
collecter(URL1)
collecter(URL2)
collecter(URL3)
collecter(URL4)
collecter(URL5)
collecter(URL6)
collecter(URL7)
collecter(URL8)
collecter(URL9)
collecter(URL10)
collecter(URL11)
collecter(URL12)
collecter(URL13)
collecter(URL14)
编辑:这是下面完全有效的代码,感谢帮助人员!
import requests
from bs4 import BeautifulSoup
import re
for URL in range(0,2569,25):
headers = {"User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"}
soup = BeautifulSoup(requests.get(f'https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First={URL}', headers=headers).text, "lxml")
names = [n.getText(strip=True) for n in soup.select("table tr td font>a")]
prices = [p.getText(strip=True).split("Price:")[-1] for p in soup.select("table tr td font>font")]
names[:] = [" ".join(n.split()) for n in names if not n.startswith("[") ]
prices[:] = [p for p in prices if p]
with open("MineralsList.txt", "a+", encoding='utf-8') as file:
for name, price in zip(names, prices):
# print(f"{name}\n{price}")
# print("-" * 50)
filename = str(name)+" "+str(price)+"\n"
split1 = filename.split(' / ')
cutted1 = split1.pop(0)
split2 = cutted1.split(": ")
cutted2 = split2.pop(1)
try:
two_prices = cutted2+" "+split1.pop(0)+"\n"
except IndexError:
two_prices = cutted2+"\n"
file.write(two_prices)
但是经过一些更改后,它会因新错误而停止 - 它无法通过给定属性找到字符串,因此出现错误“IndexError:从空列表中弹出”...甚至 soup.select("table tr td font>font" ) 提供了帮助,就像它在“名称”中所做的那样
Copyright 2014-2025 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号
您可以尝试下一个示例以及分页
import requests from bs4 import BeautifulSoup for URL in range(0,100,25): headers = {"User-Agent": "Mozilla/5.0"} soup = BeautifulSoup(requests.get(f'https://www.fabreminerals.com/search_results.php?LANG=EN&SearchTerms=&submit=Buscar&MineralSpeciment=&Country=&Locality=&PriceRange=&checkbox=enventa&First={URL}', headers=headers).text, "lxml") names = [ x.get_text(strip=True) for x in soup.select('table tr td font a')][:25] print(names) prices = [ x.get_text(strip=True) for x in soup.select('table tr td font:nth-child(3)')][:25] print(prices) # with open("Minerals.txt", "a+", encoding='utf-8') as file: # for name, price in zip(names, prices): # # print(f"{name}\n{price}") # # print("-" * 50) # filename = str(name)+" "+str(price)+"\n" # split1 = filename.split(' / ') # cutted1 = split1.pop(0) # split2 = cutted1.split(": ") # try: # cutted2 = split2.pop(1) # except IndexError: # continue # two_prices = cutted2+" "+split1.pop(0)+"\n" # file.write(two_prices)输出:
您只需使 CSS 选择器更加具体,以便仅识别直接位于字体元素内部(而不是向下几层)的链接:
soup.select("table tr td font>a")添加进一步的条件,即链接指向单个项目而不是页面底部的下一页/上一页链接也将有所帮助:
soup.select("table tr td font>a[href*='CODE']")