Open
Conversation
- Updated base URL from v3 to v4 API endpoint - Added publisher information extraction (name and country) - Added article sampling functionality for license analysis - Enhanced CSV output with new publisher and article count files - Improved error handling and logging for v4 API structure - Updated provenance tracking to include API version - Maintained backward compatibility with existing data structure Benefits of v4 migration: - Access to richer metadata including publisher details - Better structured response format with pagination info - Enhanced license information extraction capabilities - Improved data quality for commons quantification analysis
…nformation - Generated doaj_6_count_by_publisher.csv with publisher name and country data - Added doaj_5_article_count.csv for article sampling statistics - Updated provenance.yaml to track API v4 usage and enhanced data collection - Publisher data includes institutions from IR, PL, CL, GB, RU, BR, ID countries - Article sampling demonstrates new capability to analyze article-level data - All existing data files (count, subject, language, year) maintained compatibility Test run processed 10 journals and 1 article sample successfully.
- Extract detailed license flags (BY, NC, ND, SA) from DOAJ v4 API response - Add doaj_7_license_details.csv to capture license component breakdown - Enhanced extract_license_type() to return both license type and detailed components - Updated data processing pipeline to handle granular license information - Added license URL tracking for verification and compliance analysis New capabilities: - Identify specific Creative Commons license components used by journals - Track license URLs for direct reference to legal terms - Enable analysis of license component combinations and trends - Support more precise commons quantification based on usage restrictions Test data shows successful extraction of BY, NC, SA flags and license URLs.
- Document complete migration process from v3 to v4 API - Detail all enhanced data collection capabilities - Provide technical implementation overview - Include validation results and test data analysis - Document new CSV file schemas and data structures - Outline future enhancement opportunities - Reference all related commits for audit trail Key documentation sections: - API endpoint changes and migration rationale - Enhanced license component analysis capabilities - Publisher and geographic data collection - Article processing implementation - Data quality improvements and validation - Performance optimizations and error handling - Impact on commons quantification research
… integration - Remove boolean license component extraction (BY, NC, ND, SA flags) - Remove doaj_7_license_details.csv file generation - Simplify extract_license_type() to return only license type string - Remove license_details_counts processing from data pipeline - Maintain focus on meaningful license type classification Rationale: License type string (e.g., 'CC BY-NC') already contains all necessary information. Boolean flags add complexity without providing additional analytical value for commons quantification purposes.
- Remove doaj_fetch.py script (moved to feature/doaj branch) - Remove all DOAJ data files (moved to feature/doaj branch) - Remove DOAJ_V4_MIGRATION.md documentation (moved to feature/doaj branch) This branch now focuses exclusively on ArXiv-related improvements. All DOAJ v4 migration work has been moved to dedicated feature/doaj branch.
…iptive identifiers
…iptive identifiers
- avoid switching back and forth from string and Element object - check updated date first, then created date
Update arXiv fetch to use OAI-PMH API
possumbilities
approved these changes
Feb 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes
Description
Update arXiv fetch
Atom APIto OAI-PMH APIlxmlinstead of standard libraryxmlTests
Command:
Output:
Checklist
Update index.md).mainormaster).visible errors.
Developer Certificate of Origin
For the purposes of this DCO, "license" is equivalent to "license or public domain dedication," and "open source license" is equivalent to "open content license or public domain dedication."
Developer Certificate of Origin