数据挖掘 - 加载关键字并评估信息 - 吾爱随笔录

加载关键字并评估信息

数据挖掘评估搜索

2022-03-07 10:39:05

我是一名退伍军人，需要一个小程序的帮助才能与我的 Rats of Tobruk 项目一起使用，以评估档案信息，我通常通过手动方式进行，但我的列表已增长到需要验证的 11,663 条记录。

我把它放在一起作为需要发生的事情的指南。我希望有经验的人可以帮助我完成一个程序。

自动查找服务记录和评估

我从我的 National Rats of Tobruk 数据库中生成了一个 Excel 列表，由两 (2) 列组成——服务编号和 NAA。
- a) 服务编号将包含按顺序查找和评估的服务编号列表。
- b) NAA 最初将为空白，但如果发现已加载服务编号的条件为“真”，则将更改为 (Y)。
打开浏览器（Firefox 或 Chrome）
- a) 浏览至 ( https://recordsearch.naa.gov.au/SearchNRetrieve/Interface/DetailsReports/SeriesDetail.aspx?series_no=B883 )
- b) 将按钮设置为“列出报告”
- c) 单击选项卡 = “基本搜索”
在 Excel 服务编号列表中，在 Pad =“Keywords”中输入第一个“服务编号”，然后按“Enter”或单击“Search”
- a) 如果“数字化项目”列为空白 – 环回 – 从 Excel 服务编号列表加载下一个服务编号并进行评估。
- b) 如果“数字化项目”列显示“文档图标”或其他内容，则条件 = True – 转到 Excel 服务编号列表，并在 NAA 列中显示该加载的服务编号的“Y”。
在网页上，单击“新搜索”并从 Excel 服务编号列表中加载下一个服务编号，并对每个服务编号重复评估过程，直到 Excel 服务编号列表的末尾用尽。
服务编号列表完成后，将 Excel 文件保存到桌面/ROTWEB/服务编号 NAA 测试。（ROTWEB 是我桌面上的一个文件夹）

我希望你能理解我的想法——如果没有，请通过回复帖子或我的电子邮件 (ocar23@iinet.net.au) 与我联系

1个回答

你要求不少。该网页是动态的，因此您需要运行自动浏览器，导航到搜索页面，输入服务编号，按“搜索”并查看是否有任何项目指示数字副本。

一种方法是使用 Python，一种免费的编程语言。在此处查找Python并在此处查找“如何安装”教程。

确保您使用安装 Python pip，因为它可以让您非常轻松地安装其他免费模块。要安装其他软件包，您只需cmd在 Windows 中的搜索字段中输入。这将向您显示命令行界面（黑色窗口）。当您安装了 Python 后pip，您只需键入pip install XYZ（按 Enter）即可安装包XYZ（包XYZ的名称在哪里）。

您还需要下载Selenium Chrome 驱动程序（版本必须与您当前 Chrome 浏览器的版本号匹配）。这个免费软件允许 Python 自动使用 Chrome 浏览器页面。Chrome 驱动程序是单个文件，必须将其复制到计算机上安装 Python 的位置。

下面的脚本完全符合您的要求。它在 NAA 页面上一个一个地从 excel 文件中搜索“数字”列表。对于每个请求，脚本都会查找数字文档并“记录”找到的数字文档的数量。最后，将结果保存到 Excel 文件中。我的虚拟结果如下所示：

     servnr  digicopy
0      1234         2
1      6516        15
2     51651         2
3  51651651         0

也许您有朋友或家人可以帮助您安装 Python。通常这对于没有编程经验的人来说有点令人生畏。如果您在下面详细说明我的脚本，您甚至可以下载数字文档或提取信息（例如，使用 OCR 软件，如tesseract）。

网络爬虫的基本代码：

# Import required additional packages (to be installed via pip first)
from bs4 import BeautifulSoup 
from urllib.request import urlopen 
import re, time, datetime, json, requests
import pandas as pd
# Chrome Webdriver https://chromedriver.chromium.org/downloads
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.by import By

# When you see something like "ModuleNotFoundError: No module named 'openpyxl'"
# You need to install the pip package, e.g. pip install openpyxl
# Install pip packages in the console window (type cmd in the search field in Windows bottom left, open it, type pip...)

# Load Excel containing service numbers 
# Change the path to the file is needed
df = pd.read_excel('D:/service_numbers.xlsx')

# Start chrome driver https://chromedriver.chromium.org/downloads
options = webdriver.ChromeOptions()
#options.add_argument('headless') # in background option
options.add_argument('log-level=3')
browser = webdriver.Chrome(chrome_options=options)

# Set up empty data frame to store results
results = pd.DataFrame([])

# Loop over each number (aka row) in the excel file (column name 'snr')
for s in df['snr']:
    browser.get("https://www.naa.gov.au/")
    time.sleep(0.5)
    # Go to record search 
    browser.find_element_by_xpath("//*[@id=\"block-naa-mainpagecontent\"]/div/div[4]/div[2]/div/div/div/div/a").click()
    time.sleep(0.5)

    # Find searchfield and send number "s", press enter
    print("Loading number %s" %s)
    searchfield = browser.find_element_by_xpath("//*[@id=\"ContentPlaceHolderSNR_tbxKeyword\"]") # find field    
    searchfield.send_keys(s) # enter searchterm (the service number)
    searchfield.send_keys('\n') # press enter
    time.sleep(0.5)

    # Load the content of the current page (I do not iterate over all resultrs but only look at the first page here)
    soup = BeautifulSoup(browser.page_source, "html.parser")
    table = soup.find("table", id="ContentPlaceHolderSNR_tblItemDetails")
    # You can also load/print the entire table, you can even go directly to the digital contant (if present)
    #print(table.text)

    # See how many "images" indicating digital content are there
    imgs = soup.find_all('img', {'class':'digital_copy'})
    try:
        numberofdigiitems = len(imgs)
    except:
        numberofdigiitems = 0
    print("There are %s digital copys" %numberofdigiitems)

    # Append result to data frame
    results = results.append(pd.DataFrame({'servnr': s, 'digicopy': numberofdigiitems}, index=[0]), ignore_index=True)

print(results.head())

# Save to excel
results.to_excel("D:/myresults.xlsx", index=False)

其它你可能感兴趣的问题

上一篇您如何区分会话文本和可能的新闻文章？下一篇所有已知的 ML 算法都可以写成一系列矩阵运算吗？