Skip to content

Update arXiv fetch#279

Open
TimidRobot wants to merge 57 commits intomainfrom
arxiv-fetch
Open

Update arXiv fetch#279
TimidRobot wants to merge 57 commits intomainfrom
arxiv-fetch

Conversation

@TimidRobot
Copy link
Member

@TimidRobot TimidRobot commented Feb 4, 2026

Fixes

Description

Update arXiv fetch

  • Includes changes from Update arXiv fetch to use OAI-PMH API #243 by @Opsmithe
    • Switch from Atom API to OAI-PMH API
      • OAI-PMH: Open Archives Initiative Protocol for Metadata Havesting
  • Query API for category names (instead of using hard coded values)
  • Document all categories instead of just the first
  • Improve speed by using third-party lxml instead of standard library xml

Tests

Command:

pipenv run ./dev/test_scripts_help.sh

Output:

Loading .env environment variables...
✅ scripts/1-fetch/arxiv_fetch.py
✅ scripts/1-fetch/europeana_fetch.py
✅ scripts/1-fetch/gcs_fetch.py
✅ scripts/1-fetch/github_fetch.py
✅ scripts/1-fetch/openverse_fetch.py
✅ scripts/1-fetch/smithsonian_fetch.py
✅ scripts/1-fetch/wikipedia_fetch.py
✅ scripts/2-process/gcs_process.py
✅ scripts/2-process/github_process.py
✅ scripts/2-process/wikipedia_process.py
✅ scripts/3-report/gcs_report.py
✅ scripts/3-report/github_report.py
✅ scripts/3-report/wikipedia_report.py
✅ scripts/3-report/zzz-notes.py
exit status: 0

Checklist

  • I have read and understood the Developer Certificate of Origin (DCO), below, which covers the contents of this pull request (PR).
  • My pull request doesn't include code or content generated with AI (also see Avoiding generative AI development tools — Creative Commons Open Source).
  • My pull request has a descriptive title (not a vague title like Update index.md).
  • My pull request targets the default branch of the repository (main or master).
  • My commit messages follow best practices.
  • My code follows the established code style of the repository.
  • I added or updated unit tests and/or test scripts for the changes I made (if applicable).
  • I added or updated documentation (if applicable).
  • I tried running the project locally and verified that there are no
    visible errors.

Developer Certificate of Origin

For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."

Developer Certificate of Origin
Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
1 Letterman Drive
Suite D4700
San Francisco, CA, 94129

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.


Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

- Updated base URL from v3 to v4 API endpoint
- Added publisher information extraction (name and country)
- Added article sampling functionality for license analysis
- Enhanced CSV output with new publisher and article count files
- Improved error handling and logging for v4 API structure
- Updated provenance tracking to include API version
- Maintained backward compatibility with existing data structure

Benefits of v4 migration:
- Access to richer metadata including publisher details
- Better structured response format with pagination info
- Enhanced license information extraction capabilities
- Improved data quality for commons quantification analysis
…nformation

- Generated doaj_6_count_by_publisher.csv with publisher name and country data
- Added doaj_5_article_count.csv for article sampling statistics
- Updated provenance.yaml to track API v4 usage and enhanced data collection
- Publisher data includes institutions from IR, PL, CL, GB, RU, BR, ID countries
- Article sampling demonstrates new capability to analyze article-level data
- All existing data files (count, subject, language, year) maintained compatibility

Test run processed 10 journals and 1 article sample successfully.
- Extract detailed license flags (BY, NC, ND, SA) from DOAJ v4 API response
- Add doaj_7_license_details.csv to capture license component breakdown
- Enhanced extract_license_type() to return both license type and detailed components
- Updated data processing pipeline to handle granular license information
- Added license URL tracking for verification and compliance analysis

New capabilities:
- Identify specific Creative Commons license components used by journals
- Track license URLs for direct reference to legal terms
- Enable analysis of license component combinations and trends
- Support more precise commons quantification based on usage restrictions

Test data shows successful extraction of BY, NC, SA flags and license URLs.
- Document complete migration process from v3 to v4 API
- Detail all enhanced data collection capabilities
- Provide technical implementation overview
- Include validation results and test data analysis
- Document new CSV file schemas and data structures
- Outline future enhancement opportunities
- Reference all related commits for audit trail

Key documentation sections:
- API endpoint changes and migration rationale
- Enhanced license component analysis capabilities
- Publisher and geographic data collection
- Article processing implementation
- Data quality improvements and validation
- Performance optimizations and error handling
- Impact on commons quantification research
… integration

- Remove boolean license component extraction (BY, NC, ND, SA flags)
- Remove doaj_7_license_details.csv file generation
- Simplify extract_license_type() to return only license type string
- Remove license_details_counts processing from data pipeline
- Maintain focus on meaningful license type classification

Rationale: License type string (e.g., 'CC BY-NC') already contains all necessary
information. Boolean flags add complexity without providing additional analytical
value for commons quantification purposes.
- Remove doaj_fetch.py script (moved to feature/doaj branch)
- Remove all DOAJ data files (moved to feature/doaj branch)
- Remove DOAJ_V4_MIGRATION.md documentation (moved to feature/doaj branch)

This branch now focuses exclusively on ArXiv-related improvements.
All DOAJ v4 migration work has been moved to dedicated feature/doaj branch.
@TimidRobot TimidRobot self-assigned this Feb 4, 2026
@TimidRobot TimidRobot requested review from a team as code owners February 4, 2026 07:49
@TimidRobot TimidRobot requested review from possumbilities and removed request for a team February 4, 2026 07:49
@github-project-automation github-project-automation bot moved this to Triage in TimidRobot Feb 4, 2026
@TimidRobot TimidRobot marked this pull request as draft February 4, 2026 07:50
@TimidRobot TimidRobot moved this from Triage to In review in TimidRobot Feb 4, 2026
@TimidRobot TimidRobot requested a review from oree-xx February 5, 2026 06:15
@TimidRobot TimidRobot marked this pull request as ready for review February 5, 2026 06:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: In review

Development

Successfully merging this pull request may close these issues.

arXiv fetch is unreliable

3 participants