上传包含 HTML 页面中的 URL 的 CSV 文件,并使用 Flask 读取要抓取的 URL
P粉799885311
P粉799885311 2023-09-07 11:22:35
[HTML讨论组]

我目前需要制作一个基于网络的系统,可以上传包含 URL 列表的 CSV 文件。上传后,系统将逐行读取 URL,并将用于下一步抓取。这里,抓取需要先登录网站再抓取。我已经有了登录网站的源代码。但是,问题是我想将名为“upload_page.html”的html页面与名为“upload_csv.py”的烧瓶文件连接起来。登录和抓取的源代码应该放在flask文件中的哪里?

upload_page.html

<div class="upload">
            <h2>Upload a CSV file</h2>
                <form action="/upload" method="post" enctype="multipart/form-data">
                 <input type="file" name="file" accept=".csv">
                 <br>
                 <br>
                 <button type="submit">Upload</button>
                </form>
</div>

upload_csv.py

from flask import Flask, request, render_template
import pandas as pd
import requests
from bs4 import BeautifulSoup
import csv
import json
import time
from time import sleep
from selenium import webdriver
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.chrome.options import Options

app = Flask(__name__)

@app.route('/')
def index():
    return render_template('upload_page.html')

#Code for Login to the website


@app.route('/upload', methods=['POST'])
def upload():
    # Read the uploaded file
    csv_file = request.files['file']
    # Load the CSV data into a DataFrameSS
    df = pd.read_csv(csv_file)
    final_data = []
    # Loop over the rows in the DataFrame and scrape each link
    for index, row in df.iterrows():
        link = row['Link']
        response = requests.get(link)
        soup = BeautifulSoup(response.content, 'html.parser')
        start = time.time()
        # will be used in the while loop
        initialScroll = 0
        finalScroll = 1000

        while True:
            driver.execute_script(f"window.scrollTo({initialScroll},{finalScroll})")
            # this command scrolls the window starting from the pixel value stored in the initialScroll
            # variable to the pixel value stored at the finalScroll variable
            initialScroll = finalScroll
            finalScroll += 1000

            # we will stop the script for 3 seconds so that the data can load
            time.sleep(2)
            end = time.time()
            # We will scroll for 20 seconds.
            if round(end - start) > 20:
                break

        src = driver.page_source
        soup = BeautifulSoup(driver.page_source, 'html.parser')
        # print(soup.prettify())

        #Code to do scrape the website

    return render_template('index.html', message='Scraped all data')


if __name__ == '__main__':
    app.run(debug=True)

我的登录和抓取代码是否位于正确的位置?但是,编码不起作用,在我单击上传按钮后,它没有被处理

P粉799885311
P粉799885311

全部回复(1)
P粉207969787
csv_file = request.files['file']
# Load the CSV data into a DataFrame
df = pd.read_csv(csv_file)
final_data = []
# Initialize the web driver
chrome_options = Options()
chrome_options.add_argument("--headless")
chrome_options.add_argument("--disable-gpu")
driver = webdriver.Chrome(options=chrome_options)
# Loop over the rows in the DataFrame and scrape each link
for index, row in df.iterrows():
    link = row['Link']
    # Login to the website
    # Replace this with your own login code
    driver.get("https://example.com/login")
    username_field = driver.find_element_by_name("username")
    password_field = driver.find_element_by_name("password")
    username_field.send_keys("myusername")
    password_field.send_keys("mypassword")
    password_field.send_keys(Keys.RETURN)
    # Wait for the login to complete
    WebDriverWait(driver, 10).until(EC.url_changes("https://example.com/login"))
    # Scrape the website
    driver.get(link)
    start = time.time()
    # will be used in the while loop
    initialScroll = 0
    finalScroll = 1000

    while True:
        driver.execute_script(f"window.scrollTo({initialScroll},{finalScroll})")
        # this command scrolls the window starting from the pixel value stored in the initialScroll
        # variable to the pixel value stored at the finalScroll variable
        initialScroll = finalScroll
        finalScroll += 1000

        # we will stop the script for 3 seconds so that the data can load
        time.sleep(2)
        end = time.time()
        # We will scroll for 20 seconds.
        if round(end - start) > 20:
            break
热门教程
更多>
最新下载
更多>
网站特效
网站源码
网站素材
前端模板
关于我们 免责申明 意见反馈 讲师合作 广告合作 最新更新
php中文网:公益在线php培训,帮助PHP学习者快速成长!
关注服务号 技术交流群
PHP中文网订阅号
每天精选资源文章推送
PHP中文网APP
随时随地碎片化学习
PHP中文网抖音号
发现有趣的

Copyright 2014-2025 https://www.php.cn/ All Rights Reserved | php.cn | 湘ICP备2023035733号