ยท hands on
Analyzing PDFs with ChatGPT Using OpenAI's Vision API
Learn how to convert PDFs to images using Node.js and analyze them with OpenAI's Vision API. This process ensures privacy as images are deleted after analysis.
ChatGPT's web interface enables users to upload files from their computer, Google Drive, or Microsoft's OneDrive. This functionality allows you to upload PDFs and ask questions related to these documents. For instance, you can upload your insurance documents and inquire if you are covered in specific situations. Using the OpenAI API, you can upload files but you'll need to manage file retrieval and deletion.
In this tutorial, I will present an alternative approach: converting a PDF to an image that can be analyzed using OpenAI's Vision API. The main advantage is that once the image is processed by the model, it is deleted from OpenAI servers and not retained.
Contents
- Convert PDFs to Images with Node.js
- Analyzing Images with ChatGPT
- Combine PDF Conversion and Image Analysis
Convert PDFs to Images with Node.js
While ChatGPT can handle PDFs directly in its web interface, using the API requires a bit of extra work. By converting a PDF into an image, you can take advantage of the Vision API to extract information from the document. After the image has been processed, it is deleted from OpenAI's servers, ensuring privacy and security.
To start, we'll convert a PDF (Portable Document Format) into a PNG (Portable Network Graphics) using the pdf-img-convert library, which is based on Mozilla's PDF.js:
This code snippet converts each page of a PDF document into a separate PNG image, saved locally.
Analyzing Images with ChatGPT
OpenAI offers vision capabilities to understand images. By using the Vision API, you can send image URLs or Base64-encoded images to ChatGPT. In return, you'll receive answers to your questions about the image:
Be sure to phrase your questions to inquire about the image's content. According to the "Managing images" documentation, images uploaded via the OpenAI API are not used to train global models. Once an image has been processed, it is deleted from OpenAI servers.
Combine PDF Conversion and Image Analysis
Let's combine the previous steps to create a tool that converts PDFs to images and then uses the chat completion functionality to analyze them. Since we don't have public URLs for our images, we'll encode them into Base64. We'll also ensure the images are high-resolution to improve analysis accuracy.
Game Plan:
- Convert the PDF to PNG images
- Encode the images to Base64
- Send the Base64-encoded images to OpenAI for analysis
PDF Conversion
The OpenAI documentation states that for high-resolution mode (detail: 'high'
in chat config), the image's short side should be less than 768px and the long side should be less than 2,000px. Since paper documents are taller than they are wide, we'll limit the height to 1998px. This keeps it under 2000px and divisible by 2, avoiding power-of-two limitations.
Additionally, we'll command pdf-img-convert
to return a Base64-encoded version of our converted images:
Base64 Conversion
With the Base64-encoded images ready, we can now map them into OpenAI chat messages. To supply Base64-encoded data, we use the schema data:image/png;base64,OURBASE64IMAGE
. This is the syntax for inline images using Data URLs:
PDF Analysis
With our prompts ready, we can now put everything together. Remember, our text message should inquire about image analysis. Instead of asking, "What do you see in this PDF?" ask, "What do you see in these images?":