Skip to main content

Document Conversion

At Datascience9, we provide specialized solutions for transforming legacy documentation into modern, machine-readable formats. Our automated tools bridge the gap between static documents and dynamic data systems, ensuring that your valuable information is preserved, structured, and accessible.

Whether dealing with massive archives of MS Word files or complex PDF reports, our open-source frameworks allow for precise extraction to XML and seamless regeneration of documents.

Projects & Case Studies

At Datascience9, we developed an open-source tool that automatically converts legacy MS Word and PDF documents into XML. The tool is also capable of regenerating PDF documents from the generated XML where applicable.

The source code is available on GitHub.

Conversion Process Flow
Figure 1: MS Word Conversion Process Diagram

Technologies

  • ANTLR 4
  • StringTemplate 4
  • PDFBox
  • iText 2.7.1
  • Jsoup
  • dom4j
  • Gradle

Need Help?

Looking for a custom conversion solution? Contact us today to discuss your specific requirements.