An AI-Driven PDF Query System Leveraging OpenAI LLM and LangChain for Enhanced Data Retrieval

Chirag Jindal; Satyam Gupta; Jyoti Mehra; Tushar Sharma; Pulkit Aggrawal

Research Article

BibTex

RIS

Cite

An AI-Driven PDF Query System Leveraging OpenAI LLM and LangChain for Enhanced Data Retrieval

Year 2024, Volume: 7 Issue: 2, 16 - 28

Chirag Jindal , Satyam Gupta Jyoti Mehra Tushar Sharma Pulkit Aggrawal

Abstract

One of the most important issues in knowledge mining is the problem of how to extract correct and useful information from the unprejudiced PDFs. Introducing OpenAI LLM with LangChain for contextual understanding of PDF, this paper proposes a new PDF querying system. The system operates in multiple stages: extracting text, generating the embedding, and storing the embeddings in a vector database. From a business user perspective, they can ask natural language queries, which the Conversational Chain of LangChain processes to obtain text chunks, context, and prompt optimization. The input provided by the user is processed by OpenAI’s highly developed LLM to produce factual and suitable output. The efficiency of the developed system has been tested through experiments on different PDF materials with higher accuracy, relevance of the search results and users’ satisfaction compared to conventional keyword-based search. LangChain helps to enriched text meaning from OpenAI, and its contextual reasoning helps to efficiently extract structured information from texts. This approach has innovative use cases in science, law, and finance by allowing easy access to large amounts of information available in PDFs. Through implementation of NLP, the proposed system enables effective search and enhanced learning from data that is less likely to be managed structurally.

Keywords

PDF Querying, NLP, Language Models, Contextual AI, LangChain, Document Understanding, Semantic Search

Supporting Institution

université 8 Mai 1945 Guelma

References

Pappuri Jithendra Sai et al. An effective query system using LLMs and LangChain. International Journal of Engineering Research & Technology (IJERT), 12(06), 2023.
Arjun Pesaru, Taranveer Singh Gill, and Archit Reddy Tangella. AI assistant for document management using LangChain and Pinecone. International Research Journal of Modernization in Engineering Technology and Science, 5(6):3980–3983, 2023.
Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, et al. A survey of large language models. arXiv preprint arXiv:2303.18223, 2023.
Konstantinos I Roumeliotis and Nikolaos D Tselikas. ChatGPT and Open-AI models: A preliminary review. Future Internet, 15(6):192, 2023.
Fatih Soygazi and Damla Oguz. An analysis of large language models and LangChain in mathematics education. In Proceedings of the 2023 7th International Conference on Advances in Artificial Intelligence, pages 92–97, 2023.
Thaís Medeiros, Morsinaldo Medeiros, Mariana Azevedo, Marianne Silva, Ivanovitch Silva, and Daniel G Costa. Analysis of language-model-powered chatbots for query resolution in PDF-based automotive manuals. Vehicles, 5(4):1384–1399, 2023.
Keivalya Pandya and Mehfuza Holia. Automating customer service using LangChain: Building custom open-source GPT chatbot for organizations. arXiv preprint arXiv:2310.05421, 2023.
Holkar Aniket, Bhosale Shivam, Harpale Avdhut, and Pachangane V.H. Unlocking the depth analysis of PDF using artificial intelligence, large language model, LangChain. International Research Journal of Modernization in Engineering Technology and Science, 6(2):682–684, 2024.
Jeremy Howard and Sebastian Ruder. Universal language model fine-tuning for text classification. arXiv preprint arXiv:1801.06146, 2018.
T Wolf. HuggingFace’s Transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
Aijia Yuan, Edlin Garcia Colato, Bernice Pescosolido, Hyunju Song, and Sagar Samtani. Improving workplace well-being in modern organizations: A review of large language model-based mental health chatbots. ACM Transactions on Management Information Systems, 2024.
Aigerim Mansurova, Aliya Nugumanova, and Zhansaya Makhambetova. Development of a question answering chatbot for blockchain domain. Scientific Journal of Astana IT University, pages 27–40, 2023.
Oguzhan Topsakal and Tahir Cetin Akinci. Creating large language model applications utilizing LangChain: A primer on developing LLM apps fast. In International Conference on Applied Engineering and Natural Sciences, volume 1, pages 1050–1056, 2023.
Jacob Devlin. BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
Alec Radford. Improving language understanding by generative pre-training. 2018.
Pedro Neira-Maldonado, Diego Quisi-Peralta, Juan Salgado-Guerrero, Jordan Murillo-Valarezo, Tracy Cárdenas-Arichábala, Jorge Galan-Mena, and Daniel Pulla-Sanchez. Intelligent educational agent for education support using long language models through LangChain. In International Conference on Information Technology & Systems, pages 258–268. Springer, 2024.
Ce Zhou, Qian Li, Chen Li, Jun Yu, Yixin Liu, Guangjing Wang, Kai Zhang, Cheng Ji, Qiben Yan, Lifang He, et al. A comprehensive survey on pretrained foundation models: A history from BERT to ChatGPT. International Journal of Machine Learning and Cybernetics, pages 1–65, 2024.

There are 17 citations in total.

Details

Primary Language	English
Subjects	Natural Language Processing, Autonomous Agents and Multiagent Systems
Journal Section	Articles
Authors	Chirag Jindal Satyam Gupta This is me Jyoti Mehra This is me Tushar Sharma This is me Pulkit Aggrawal This is me
Early Pub Date	January 30, 2025
Publication Date
Submission Date	December 19, 2024
Acceptance Date	January 11, 2025
Published in Issue	Year 2024 Volume: 7 Issue: 2

Cite

APA	Jindal, C., Gupta, S., Mehra, J., Sharma, T., et al. (2025). An AI-Driven PDF Query System Leveraging OpenAI LLM and LangChain for Enhanced Data Retrieval. International Journal of Informatics and Applied Mathematics, 7(2), 16-28.
AMA	Jindal C, Gupta S, Mehra J, Sharma T, Aggrawal P. An AI-Driven PDF Query System Leveraging OpenAI LLM and LangChain for Enhanced Data Retrieval. IJIAM. January 2025;7(2):16-28.
Chicago	Jindal, Chirag, Satyam Gupta, Jyoti Mehra, Tushar Sharma, and Pulkit Aggrawal. “An AI-Driven PDF Query System Leveraging OpenAI LLM and LangChain for Enhanced Data Retrieval”. International Journal of Informatics and Applied Mathematics 7, no. 2 (January 2025): 16-28.
EndNote	Jindal C, Gupta S, Mehra J, Sharma T, Aggrawal P (January 1, 2025) An AI-Driven PDF Query System Leveraging OpenAI LLM and LangChain for Enhanced Data Retrieval. International Journal of Informatics and Applied Mathematics 7 2 16–28.
IEEE	C. Jindal, S. Gupta, J. Mehra, T. Sharma, and P. Aggrawal, “An AI-Driven PDF Query System Leveraging OpenAI LLM and LangChain for Enhanced Data Retrieval”, IJIAM, vol. 7, no. 2, pp. 16–28, 2025.
ISNAD	Jindal, Chirag et al. “An AI-Driven PDF Query System Leveraging OpenAI LLM and LangChain for Enhanced Data Retrieval”. International Journal of Informatics and Applied Mathematics 7/2 (January 2025), 16-28.
JAMA	Jindal C, Gupta S, Mehra J, Sharma T, Aggrawal P. An AI-Driven PDF Query System Leveraging OpenAI LLM and LangChain for Enhanced Data Retrieval. IJIAM. 2025;7:16–28.
MLA	Jindal, Chirag et al. “An AI-Driven PDF Query System Leveraging OpenAI LLM and LangChain for Enhanced Data Retrieval”. International Journal of Informatics and Applied Mathematics, vol. 7, no. 2, 2025, pp. 16-28.
Vancouver	Jindal C, Gupta S, Mehra J, Sharma T, Aggrawal P. An AI-Driven PDF Query System Leveraging OpenAI LLM and LangChain for Enhanced Data Retrieval. IJIAM. 2025;7(2):16-28.