你要求不少。该网页是动态的,因此您需要运行自动浏览器,导航到搜索页面,输入服务编号,按“搜索”并查看是否有任何项目指示数字副本。
一种方法是使用 Python,一种免费的编程语言。在此处查找Python并在此处查找“如何安装”教程。
确保您使用 安装 Python pip,因为它可以让您非常轻松地安装其他免费模块。要安装其他软件包,您只需cmd在 Windows 中的搜索字段中输入。这将向您显示命令行界面(黑色窗口)。当您安装了 Python 后pip,您只需键入pip install XYZ(按 Enter)即可安装包XYZ(包XYZ的名称在哪里)。
您还需要下载Selenium Chrome 驱动程序(版本必须与您当前 Chrome 浏览器的版本号匹配)。这个免费软件允许 Python 自动使用 Chrome 浏览器页面。Chrome 驱动程序是单个文件,必须将其复制到计算机上安装 Python 的位置。
下面的脚本完全符合您的要求。它在 NAA 页面上一个一个地从 excel 文件中搜索“数字”列表。对于每个请求,脚本都会查找数字文档并“记录”找到的数字文档的数量。最后,将结果保存到 Excel 文件中。我的虚拟结果如下所示:
servnr digicopy
0 1234 2
1 6516 15
2 51651 2
3 51651651 0
也许您有朋友或家人可以帮助您安装 Python。通常这对于没有编程经验的人来说有点令人生畏。如果您在下面详细说明我的脚本,您甚至可以下载数字文档或提取信息(例如,使用 OCR 软件,如tesseract)。
网络爬虫的基本代码:
# Import required additional packages (to be installed via pip first)
from bs4 import BeautifulSoup
from urllib.request import urlopen
import re, time, datetime, json, requests
import pandas as pd
# Chrome Webdriver https://chromedriver.chromium.org/downloads
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By
# When you see something like "ModuleNotFoundError: No module named 'openpyxl'"
# You need to install the pip package, e.g. pip install openpyxl
# Install pip packages in the console window (type cmd in the search field in Windows bottom left, open it, type pip...)
# Load Excel containing service numbers
# Change the path to the file is needed
df = pd.read_excel('D:/service_numbers.xlsx')
# Start chrome driver https://chromedriver.chromium.org/downloads
options = webdriver.ChromeOptions()
#options.add_argument('headless') # in background option
options.add_argument('log-level=3')
browser = webdriver.Chrome(chrome_options=options)
# Set up empty data frame to store results
results = pd.DataFrame([])
# Loop over each number (aka row) in the excel file (column name 'snr')
for s in df['snr']:
browser.get("https://www.naa.gov.au/")
time.sleep(0.5)
# Go to record search
browser.find_element_by_xpath("//*[@id=\"block-naa-mainpagecontent\"]/div/div[4]/div[2]/div/div/div/div/a").click()
time.sleep(0.5)
# Find searchfield and send number "s", press enter
print("Loading number %s" %s)
searchfield = browser.find_element_by_xpath("//*[@id=\"ContentPlaceHolderSNR_tbxKeyword\"]") # find field
searchfield.send_keys(s) # enter searchterm (the service number)
searchfield.send_keys('\n') # press enter
time.sleep(0.5)
# Load the content of the current page (I do not iterate over all resultrs but only look at the first page here)
soup = BeautifulSoup(browser.page_source, "html.parser")
table = soup.find("table", id="ContentPlaceHolderSNR_tblItemDetails")
# You can also load/print the entire table, you can even go directly to the digital contant (if present)
#print(table.text)
# See how many "images" indicating digital content are there
imgs = soup.find_all('img', {'class':'digital_copy'})
try:
numberofdigiitems = len(imgs)
except:
numberofdigiitems = 0
print("There are %s digital copys" %numberofdigiitems)
# Append result to data frame
results = results.append(pd.DataFrame({'servnr': s, 'digicopy': numberofdigiitems}, index=[0]), ignore_index=True)
print(results.head())
# Save to excel
results.to_excel("D:/myresults.xlsx", index=False)