This repository was archived by the owner on Sep 9, 2025. It is now read-only.

InstructLab and Deepsearch#106

Open

jjasghar wants to merge 1 commit intoinstructlab:mainfrom

jjasghar:jjasghar/deepsearch

Contributor

jjasghar commented Jun 25, 2024

This is the proposal to start integrating th document conversion system Deepsearch from IBM Research and InstructLab

jjasghar force-pushed the jjasghar/deepsearch branch 5 times, most recently from 2787217 to ba8aaed Compare

June 25, 2024 21:29


          InstructLab and Deepsearch

c2ec3e1

This is the proposal to start integrating th document conversion
system Deepsearch from IBM Research and InstructLab

Co-authored-by: Ming Zhao <mingzhao@ibm.com>
Co-authored-by: BJ Hargrave <hargrave@us.ibm.com>
Signed-off-by: JJ Asghar <awesome@ibm.com>

jjasghar force-pushed the jjasghar/deepsearch branch from ba8aaed to c2ec3e1 Compare

June 25, 2024 21:31

hickeyma suggested changes

View reviewed changes

Member

hickeyma left a comment

Thanks @jjasghar for pushing the proposal. Suggestions added inline.

docs/instructlab-deepsearch-integration.md


		# DeepSearch + InstructLab Integration Proposal

		<https://github.com/DS4SD>

Member

hickeyma Jul 4, 2024

Moved below where DeepSearch is first mentioned

docs/instructlab-deepsearch-integration.md

Comment on lines +8 to +12

+              Managing submissions for the open-source InstructLab project has revealed a significant bottleneck in processing
+              knowledge documents. For the InstructLab backend to effectively utilize these documents, they must be in markdown
+              format. Currently, we only accept Wikipedia articles, but the built-in conversion tool is inadequate. Internally at
+              IBM, and other companies, many knowledge submissions are in multiple document formats, including PDF format,
+              necessitating conversion to markdown before being used in InstructLab.

Member

hickeyma Jul 4, 2024

Suggested change

      
            Managing submissions for the open-source InstructLab project has revealed a significant bottleneck in processing
          
            knowledge documents. For the InstructLab backend to effectively utilize these documents, they must be in markdown
          
            format. Currently, we only accept Wikipedia articles, but the built-in conversion tool is inadequate. Internally at
          
            IBM, and other companies, many knowledge submissions are in multiple document formats, including PDF format,
          
            necessitating conversion to markdown before being used in InstructLab.
          
            Managing taxonomy submissions in InstructLab has revealed an issue in processing knowledge documents. For the InstructLab to handle these documents, they must be in [markdown](https://en.wikipedia.org/wiki/Markdown) format. However, many knowledge submissions are in multiple document formats, including PDF, which necessitating conversion to markdown before being used by InstructLab.

docs/instructlab-deepsearch-integration.md

Comment on lines +14 to +16

+              Existing open-source methods, such as PanDoc, are inconsistent. While they preserve text, they struggle with parsing
+              tables and special symbols, as evidenced by issues in PR #1154 of the taxonomy repo in the InstructLab project. Other
+              open-source solutions have similar shortcomings.

Member

hickeyma Jul 4, 2024

Suggested change

      
            Existing open-source methods, such as PanDoc, are inconsistent. While they preserve text, they struggle with parsing
          
            tables and special symbols, as evidenced by issues in PR #1154 of the taxonomy repo in the InstructLab project. Other
          
            open-source solutions have similar shortcomings.
          
            Existing open source tools, such as [Pandoc](https://pandoc.org/), are inconsistent. While they preserve text, they struggle with parsing tables and special symbols, as evidenced by issues in [PR #1154](https://github.com/instructlab/taxonomy/pull/1154) of the taxonomy repo in the InstructLab. Other open-source solutions have similar shortcomings.

docs/instructlab-deepsearch-integration.md

Comment on lines +20 to +23

+              IBM's DeepSearch software excels in document conversion, outperforming traditional open-source methods. Utilizing a
+              computer vision model layer, it accurately parses content in the files, including titles, headers, and tables.
+              Additionally, it automatically implements RAG layers for models, which could benefit the InstructLab process in
+              the future.

Member

hickeyma Jul 4, 2024

Suggested change

      
            IBM's DeepSearch software excels in document conversion, outperforming traditional open-source methods. Utilizing a
          
            computer vision model layer, it accurately parses content in the files, including titles, headers, and tables.
          
            Additionally, it automatically implements RAG layers for models, which could benefit the InstructLab process in
          
            the future.
          
            IBM's [DeepSearch](https://github.com/DS4SD) software excels in document conversion, outperforming traditional open source methods. Utilizing a computer vision model layer, it accurately parses content in the files, including titles, headers, and tables. Additionally, it automatically implements RAG layers for models, which could benefit the InstructLab process in the future.

docs/instructlab-deepsearch-integration.md

		@@ -0,0 +1,52 @@

		# DeepSearch + InstructLab Integration Proposal

Member

hickeyma Jul 4, 2024

Suggested change

      
            # DeepSearch + InstructLab Integration Proposal
          
            # Document Conversion Proposal

docs/instructlab-deepsearch-integration.md

Comment on lines +30 to +33

+              ### Open-Source Conversion
+              - Implement a basic document conversion tool in the UI using an open-source method such as PanDoc. This tool will be
+              lightweight and easily hosted, ensuring it can be used and improved by the community.

Member

hickeyma Jul 4, 2024

Suggested change

      
            ### Open-Source Conversion
          
            - Implement a basic document conversion tool in the UI using an open-source method such as PanDoc. This tool will be
          
            lightweight and easily hosted, ensuring it can be used and improved by the community.
          
            - Open source conversion: Implement a basic document conversion tool in the InstructLab UI using an open source tool such as [Pandoc](https://pandoc.org/). This tool will be lightweight and easily hosted, ensuring it can be used and improved by the community.

docs/instructlab-deepsearch-integration.md

Comment on lines +35 to +39

+              ### DeepSearch Integration
+              - Enable the UI to switch the conversion endpoint to DeepSearch, allowing high-fidelity markdown conversions for
+              backend use. This approach maintains an open-source version while benefiting from DeepSearch's superior
+              conversion capabilities.

Member

hickeyma Jul 4, 2024

Suggested change

      
            ### DeepSearch Integration
          
            - Enable the UI to switch the conversion endpoint to DeepSearch, allowing high-fidelity markdown conversions for
          
            backend use. This approach maintains an open-source version while benefiting from DeepSearch's superior
          
            conversion capabilities.
          
            - [DeepSearch](https://github.com/DS4SD) conversion: Enable the InstructLab UI to switch the conversion endpoint to DeepSearch, allowing high-fidelity markdown conversions for backend use. This approach uses the open source version of Deepsearch while benefiting from DeepSearch's superior conversion capabilities.

docs/instructlab-deepsearch-integration.md

Comment on lines +41 to +43

+              IBM Research and the DeepSearch team will host the DeepSearch endpoint for the open-source community. This
+              arrangement benefits the community by streamlining contributions and provides data and exposure for the DeepSearch
+              project. IBM's contribution underscores its commitment to supporting and improving open-source projects.

Member

hickeyma Jul 4, 2024

Suggested change

      
            IBM Research and the DeepSearch team will host the DeepSearch endpoint for the open-source community. This
          
            arrangement benefits the community by streamlining contributions and provides data and exposure for the DeepSearch
          
            project. IBM's contribution underscores its commitment to supporting and improving open-source projects.
          
            IBM Research and the DeepSearch team will host the DeepSearch endpoint for the InstructLab community. This
          
            arrangement benefits the community by providing a means to handle different document formats.

docs/instructlab-deepsearch-integration.md

Comment on lines +45 to +48

+              This integration will highlight the value of DeepSearch, highlighting their potential for those integrating
+              InstructLab into their workflows. If the volume of community requests becomes unsustainable for the DeepSearch team,
+              we hope for ample notification to allow the community to find alternative solutions. By then, we anticipate that the
+              open-source versions will have improved sufficiently, or the value of the integration will justify continued support.

Member

hickeyma Jul 4, 2024

Suggested change

      
            This integration will highlight the value of DeepSearch, highlighting their potential for those integrating
          
            InstructLab into their workflows. If the volume of community requests becomes unsustainable for the DeepSearch team,
          
            we hope for ample notification to allow the community to find alternative solutions. By then, we anticipate that the
          
            open-source versions will have improved sufficiently, or the value of the integration will justify continued support.
          
            If the volume of community requests becomes unsustainable for the DeepSearch team to manage, we hope for ample notification to allow the community to find alternative solutions. By then, we anticipate that the open source conversion tools will have improved sufficiently, or the value of the integration will justify continued support.

docs/instructlab-deepsearch-integration.md

Comment on lines +50 to +52

+              By adopting this two-pronged approach, we ensure the integrity of the open-source project while leveraging IBM's
+              advanced DeepSearch capabilities. This strategy balances community collaboration with innovative technology,
+              fostering innovation and improvement in document processing for the InstructLab project.

Member

hickeyma Jul 4, 2024

Suggested change

      
            By adopting this two-pronged approach, we ensure the integrity of the open-source project while leveraging IBM's
          
            advanced DeepSearch capabilities. This strategy balances community collaboration with innovative technology,
          
            fostering innovation and improvement in document processing for the InstructLab project.
          
            By adopting this two-pronged approach, we ensure the integrity of the open source project while leveraging advanced DeepSearch capabilities. This strategy balances community collaboration with innovative technology,
          
            fostering innovation and improvement in document processing for the InstructLab community.

nathan-weinberg mentioned this pull request

Knowledge doc ingestion #148

Closed

github-actions bot commented Feb 11, 2025

This pull request has been automatically marked as stale because it has not had activity within 30 days. It will be automatically closed if no further activity occurs within 7 days.

github-actions bot added the stale label

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.

Labels