破解中国移动教育技术论文批量下载

天黑 发布于 28 天前 110 次阅读


打开单篇论文网址,发现论文由pdf.js加载,但是原生的下载按钮被隐藏了,这个好解决

打开f12审查元素,找到<button id="download" class="toolbarButton download hiddenMediumView" style="display:none;" title="下载" tabindex="34" data-l10n-id="download"> <span data-l10n-id="download_label">下载</span></button>style="display:none;"注释掉即可显示下载按钮。如果不喜欢这个方法,在“网络”里用筛选器筛选出.pdf后缀文件,也一样能下载。

但是这样太麻烦了,每次点开链接,再点击“查看作品”,pdf才会出现,再审查元素,前后30秒才能下载单篇,有没有可以批量下载的方法?
一般移动端浏览器查看pdf文件,前端兼容性会比较好(简陋),果然,分析移动端网址,发现打开后直接就是pdf.js在加载pdf。而且网址好简单哈哈哈哈吼吼吼,直接用论文编号(productid)就可以打开文件,那只要批量输入ID,岂不是很容易实现批量下载?!


于是CTRL+U查看网页源代码:

<!DOCTYPE html>
<script type="text/javascript" src="/activity/resources/js/jquery.js"></script>

<html dir="ltr" mozdisallowselectionprint moznomarginboxes>
  <head>
    <meta charset="utf-8">
    <meta name="viewport" content="width=device-width, initial-scale=1, maximum-scale=1">
<!--#if GENERIC || CHROME-->
    <meta name="google" content="notranslate">
<!--#endif-->
    <title>人工智能引领教育教学变革的探究</title>

    <link rel="stylesheet" href="/activity/H5/subjectResources/pdf/web/viewer.css"/>
<!--#if !PRODUCTION-->
    <link rel="resource" type="application/l10n" href="locale/locale.properties"/>
<!--#endif-->

<!--#if !PRODUCTION-->
  </head>
  <body >
    <div style=" height: 100%;">
      <iframe frameborder="0" name="Iframe1" src="https://media.andedu.net/andedu/plugins/pdf.js/web/viewer.html?docurl=aHR0cHM6Ly9kbWVkaWEuYW5kZWR1Lm5ldC8yMDI1MDgwMTIwMzEvNWEyYWUwMDZhOTRmNDIzMjM4MTk1YWVjNDMzYzYwNmUvcHJvZHVjdC9jcF9saWFvbmluZy9ET0NVTUVOVC8yNTA3MjgvTC05MDQ2ODEzMS0wLTBDMDMyMTc0NUQvNTg4MDc4OS5wZGY=" width="100%" height="100%"></iframe>
    </div>
  </body>
</html>

发现源代码里的PDF链接是嵌入在iframe的src属性中:

<iframe frameborder="0" name="Iframe1" src="https://media.andedu.net/andedu/plugins/pdf.js/web/viewer.html?docurl=aHR0cHM6Ly9kbWVkaWEuYW5kZWR1Lm5ldC8yMDI1MDgwMTE4MDgvZjYwMmIzYzc5ODg5Y2RhMTAxZjUyODAzYjg2NmM0ZDkvcHJvZHVjdC9jcF9saWFvbmluZy9ET0NVTUVOVC8yNTA3MjgvTC05MDQ2ODEzMS0wLTBDMDMyMTc0NUQvNTg4MDc4OS5wZGY=" width="100%" height="100%"></iframe>

URL参数用Base64编码,想到python的base64库很容易实现编码解码,还有内置的HTTP请求库urllib可以基本实现批量下载功能。

第一波简单写了小DEMO后发现报错“urllib3.exceptions.SSLError: [SSL: UNSAFE_LEGACY_RENEGOTIATION_DISABLED] unsafe legacy renegotiation disabled (_ssl.c:1020)”
发现是SSL安全校验出现问题,可能是Python 3.13版本太严格了?降级可太麻烦,于是采用urllib禁止警告,就可以正常请求了:
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

完整代码拆解如下:

1.导入必要的库和模块

import urllib.request
import base64
import re
from bs4 import BeautifulSoup
import ssl
import urllib3
import time
import os
urllib3.disable_warnings(urllib3.exceptions.InsecureRequestWarning)

2.定义下载函数,忽略SSL安全认证,并启用不安全的遗留重新协商(因为之前SSL报错)

def download_pdf(product_id):
    try:
        context = ssl._create_unverified_context()
        context.options |= 0x00040000  # OP_ALLOW_UNSAFE_LEGACY_RENEGOTIATION

        url = f"https://edu.10086.cn/activity/activity/thesis/h5SeePDF?productId={product_id}"
        print(f"正在处理产品ID: {product_id}")

3.创建一个HTTP请求,设置User-Agent头,然后发送请求并获取响应,读取内容后解码为UTF-8文本

req = urllib.request.Request(
            url, 
            headers={
                "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
            }
        )

        response = urllib.request.urlopen(req, context=context, timeout=30)
        html_content = response.read().decode('utf-8')

4.使用BeautifulSoup解析HTML,查找iframe标签提取src:https://media.andedu.net/andedu/plugins/pdf.js/web/viewer.html?docurl=aHR0cHM6Ly9kbWVkaWEuYW5kZWR1Lm5ldC8yMDI1MDgwMTIwMzEvNWEyYWUwMDZhOTRmNDIzMjM4MTk1YWVjNDMzYzYwNmUvcHJvZHVjdC9jcF9saWFvbmluZy9ET0NVTUVOVC8yNTA3MjgvTC05MDQ2ODEzMS0wLTBDMDMyMTc0NUQvNTg4MDc4OS5wZGY=,再用正则表达式提取docurl参数的值;'aHR0cHM6Ly9kbWVkaWEuYW5kZWR1Lm5ldC8yMDI1MDgwMTIwMzEvNWEyYWUwMDZhOTRmNDIzMjM4MTk1YWVjNDMzYzYwNmUvcHJvZHVjdC9jcF9saWFvbmluZy9ET0NVTUVOVC8yNTA3MjgvTC05MDQ2ODEzMS0wLTBDMDMyMTc0NUQvNTg4MDc4OS5wZGY=',最后解码得到PDF下载链接

soup = BeautifulSoup(html_content, 'html.parser')
        iframe = soup.find('iframe')
        if iframe and 'src' in iframe.attrs:
            iframe_src = iframe['src']

            match = re.search(r'docurl=([^&]+)', iframe_src)
            if match:
                encoded_url = match.group(1)
                actual_url = base64.b64decode(encoded_url).decode('utf-8')
                print(f"解码后的PDF URL: {actual_url}")

5.创建新HTTP请求来访问实际url,响应状态码200就创建下载目录,从url提取文件名+产品Id(论文编号),方便后续核对。然后然后以二进制模式打开文件并写入内容。

pdf_req = urllib.request.Request(
                    actual_url,
                    headers={
                        "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
                    }
                )
                pdf_response = urllib.request.urlopen(pdf_req, context=context, timeout=30)

                if pdf_response.status == 200:
                    download_dir = "downloaded_pdfs"
                    if not os.path.exists(download_dir):
                        os.makedirs(download_dir)

                    original_filename = actual_url.split('/')[-1]
                    filename = f"{product_id}_{original_filename}"
                    file_path = os.path.join(download_dir, filename)

                    with open(file_path, 'wb') as f:
                        f.write(pdf_response.read())
                    print(f"PDF已成功下载为: {file_path}")
                    return True

6.错误处理

else:
                    print(f"下载失败,状态码: {pdf_response.status}")
            else:
                print(f"产品ID {product_id}: 未找到docurl参数")
        else:
            print(f"产品ID {product_id}: 未找到iframe标签")
    except Exception as e:
        print(f"处理产品ID {product_id} 时出错: {str(e)}")

    return False

7.主函数提示用户输入多个产品ID,每行一个(因为我的ocr识别自动换行),输入空行或"end"表示结束输入。还要设置下载间隔时间,避免请求频繁导致被服务器封锁(虽然我也不知道什么时候会被封锁),然后批量下载。

def main():
    print("PDF批量下载工具 - 中国移动教育论文")
    print("=" * 50)

    print("\n请输入多个产品ID,每行一个。输入完成后输入一个空行或输入'end'表示结束:")
    product_ids = []

    while True:
        line = input()
        if not line.strip() or line.strip().lower() == 'end':
            break
        if line.strip():
            product_ids.append(line.strip())
    if not product_ids:
        print("没有有效的产品ID!")
        return

    print(f"共找到 {len(product_ids)} 个产品ID,开始下载...")

    delay = input("请设置下载间隔秒数 (推荐2-5秒,避免请求过于频繁): ")
    try:
        delay = float(delay)
    except:
        delay = 3.0

    success_count = 0
    fail_count = 0

    for i, product_id in enumerate(product_ids):
        print(f"\n[{i+1}/{len(product_ids)}] 处理产品ID: {product_id}")
        if download_pdf(product_id):
            success_count += 1
        else:
            fail_count += 1

        if i < len(product_ids) - 1:
            print(f"等待 {delay} 秒后继续下一个下载...")
            time.sleep(delay)

    print(f"\n批量下载完成! 成功: {success_count}, 失败: {fail_count}")
if __name__ == "__main__":
    main()

最终效果:

此作者没有提供个人介绍。
最后更新于 2025-08-01