Employing ML Methods on Digitized FOIA Requests for Improved Discoverability and Policy Research
dc.contributor.author | Seaton, Alexa | en_US |
dc.contributor.author | Xu, Yujie | en_US |
dc.contributor.author | Von Arx, Devin | en_US |
dc.contributor.author | Traylor, Jordan | en_US |
dc.contributor.author | Jin, Ying | en_US |
dc.contributor.author | Evans, Kenneth Mellinger | en_US |
dc.contributor.org | Baker Institute, Science and Technology Policy Program | en_US |
dc.date.accessioned | 2025-04-07T20:54:06Z | en_US |
dc.date.available | 2025-04-07T20:54:06Z | en_US |
dc.date.issued | 2025 | en_US |
dc.description.abstract | Born-digital records pose challenges for digital preservation due to their unstructured formats and noncompliance with accessibility standards. This project introduces a modular, open-source workflow to batch process large, mixed media PDFs—many obtained through FOIA requests—by leveraging OCR, AI, and named-entity recognition. Built for the White House Scientists Archive, this system enhances discoverability and usability of digitized records across administrations and supports metadata extraction at scale. Key tools include Mistral AI for OCR, Apache Tika for entity recognition, and a finet uned Mistral model for metadata generation. | en_US |
dc.identifier.citation | Seaton, A., Xu, Y., Von Arx, D., Traylor, J., Jin, Y., & Evans, K. (2025). Employing ML Methods on Digitized FOIA Requests for Improved Discoverability and Policy Research. Rice University. https://doi.org/10.25611/AYBM-0F31 | en_US |
dc.identifier.doi | https://doi.org/10.25611/AYBM-0F31 | en_US |
dc.identifier.uri | https://hdl.handle.net/1911/118299 | en_US |
dc.language.iso | eng | en_US |
dc.publisher | Rice University | en_US |
dc.rights | Except where otherwise noted, this work is licensed under a Creative Commons Attribution-NonCommercial (CC BY-NC) license. Permission to reuse, publish, or reproduce the work beyond the terms of the license or beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder. | en_US |
dc.rights.uri | https://creativecommons.org/licenses/by-nc/4.0/ | en_US |
dc.subject.keyword | computer vision | en_US |
dc.subject.keyword | science policy | en_US |
dc.subject.keyword | digital humanities | en_US |
dc.title | Employing ML Methods on Digitized FOIA Requests for Improved Discoverability and Policy Research | en_US |
dc.type | Presentation | en_US |
dc.type.dcmi | Text | en_US |
Files
Original bundle
1 - 1 of 1