Employing ML Methods on Digitized FOIA Requests for Improved Discoverability and Policy Research

dc.contributor.authorSeaton, Alexaen_US
dc.contributor.authorXu, Yujieen_US
dc.contributor.authorVon Arx, Devinen_US
dc.contributor.authorTraylor, Jordanen_US
dc.contributor.authorJin, Yingen_US
dc.contributor.authorEvans, Kenneth Mellingeren_US
dc.contributor.orgBaker Institute, Science and Technology Policy Programen_US
dc.date.accessioned2025-04-07T20:54:06Zen_US
dc.date.available2025-04-07T20:54:06Zen_US
dc.date.issued2025en_US
dc.description.abstractBorn-digital records pose challenges for digital preservation due to their unstructured formats and noncompliance with accessibility standards. This project introduces a modular, open-source workflow to batch process large, mixed media PDFs—many obtained through FOIA requests—by leveraging OCR, AI, and named-entity recognition. Built for the White House Scientists Archive, this system enhances discoverability and usability of digitized records across administrations and supports metadata extraction at scale. Key tools include Mistral AI for OCR, Apache Tika for entity recognition, and a finet uned Mistral model for metadata generation.en_US
dc.identifier.citationSeaton, A., Xu, Y., Von Arx, D., Traylor, J., Jin, Y., & Evans, K. (2025). Employing ML Methods on Digitized FOIA Requests for Improved Discoverability and Policy Research. Rice University. https://doi.org/10.25611/AYBM-0F31en_US
dc.identifier.doihttps://doi.org/10.25611/AYBM-0F31en_US
dc.identifier.urihttps://hdl.handle.net/1911/118299en_US
dc.language.isoengen_US
dc.publisherRice Universityen_US
dc.rightsExcept where otherwise noted, this work is licensed under a Creative Commons Attribution-NonCommercial (CC BY-NC) license. Permission to reuse, publish, or reproduce the work beyond the terms of the license or beyond the bounds of fair use or other exemptions to copyright law must be obtained from the copyright holder.en_US
dc.rights.urihttps://creativecommons.org/licenses/by-nc/4.0/en_US
dc.subject.keywordcomputer visionen_US
dc.subject.keywordscience policyen_US
dc.subject.keyworddigital humanitiesen_US
dc.titleEmploying ML Methods on Digitized FOIA Requests for Improved Discoverability and Policy Researchen_US
dc.typePresentationen_US
dc.type.dcmiTexten_US
Files
Original bundle
Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
seaton-TXLA.pdf
Size:
1.36 MB
Format:
Adobe Portable Document Format