Agentic Document Extraction & Understanding

agentic-doc

Extracting and reasoning over tabular data in unstructured (financial) documents remains a challenge in many real-world use cases. Bank statements, balance sheets, and income statements rarely follow a consistent layout: Columns shift, headers are implicit, and critical figures may span multiple pages or be expressed in subtly different forms. Vanilla RAG systems may also struggle here, not because retrieval fails, but simply because tables are not text. Chunking inevitably breaks structural relationships, where LLMs are then forced to “reconstruct” numbers and hierarchies from partial context, introducing silent arithmetic errors or outright hallucinations.

An agentic approach tackles this problem at the root: Instead of retrieving loosely related text, specialized AI agents explicitly detect tables, reconstruct their logical schemas, normalize financial line items, and operate directly on such structured representations. This enables precise querying, deterministic computations, and reasoning over structured financial data, turning arbitrary documents into machine-readable financial objects rather than probabilistic text snippets.

In this post, I will show how to use the popular agentic AI framework smolagents developed by Hugging Face to extract tabular data from unstructured documents and reason over them with much better precision and efficiency than a traditional RAG system. I will also be using a PDF document as an example throughout – that is, Duolingo’s (my favorite learning app) Q3 2025 Shareholder Letter.

Convert PDF to Markdown

This post assumes PDF documents because virtually any document now can be converted into PDF format. However, in order for a document to be readily read and processed by an LLM, it should ideally be in Markdown format. This is because that is how most LLMs were trained in the first place. One popular free and open-source package for converting PDF to Markdown that I’m using in this post is marker. There are other alternatives such as Docling, which also has great reviews.

Now, suppose your PDF doc has been converted into Markdown format, you can simply load it in Python like this:

# Read the Markdown text file
with open(file_path, 'r', encoding='utf-8') as file:
    statement = file.read()

Extract data frames from Markdown

After converting PDF into Markdown, the next step is to initiate an LLM-based AI agent and delegate a task of data frame extraction to it. In order to to initiate an agent, we would need an API token (or a self-hosted LLM). In this example, I’m using Anthropic’s Claude API, but you can easily change to any other provider of choice (such as OpenAI). The code to initiate such an agent is given below.

from smolagents import LiteLLMModel, CodeAgent

# Run model via Anthropic API
model = LiteLLMModel(
    model_id="anthropic/claude-3-5-haiku-20241022",
)

agent = CodeAgent(
    model=model,
    tools=[], # empty tools list
    max_steps=10,
    additional_authorized_imports=["os", "os.path", "posixpath", "pandas", "numpy", "pathlib", "io",
                                   "math", "random", "re", "time", "itertools", "statistics", "queue",
                                   "collections", "datetime", "stat", "unicodedata"],
    verbosity_level=1 # lower verbosity for cleaner output
)

Note that in the code above, I’m using Claude Haiku family of models for better efficiency (faster token generation) and lower cost since the given task is relatively structured. Notice also that the agent here is an instance of CodeAgent, which is a specialized type of agent in the smolagents library that generates Python code to solve problems. The agent is powered by the underlying LLM that is Claude Haiku. In this agent instantiation, I deliberately give it an empty list of tools and give it a list of authorized Python packages that it can import instead. This is because the agent has the capability to come up with Python code to solve the problem by itself (using the authorized imports) without relying on any other external tools. And specifically for this problem, no tools are really needed.

We are now ready to craft a prompt to solve the problem – that is, extracting tabular structures from unstructured text. The prompt below is how I approach this. Note that this is highly personal and depends on the individual’s style of approaching the problem to engineer an appropriate prompt. Note also that the converted Markdown text is inserted directly into the prompt, and I specifically ask the agent to dump all extracted data frames as external CSV files to a given directory for use later. Finally, because of the nature of this document being a “financial statement”, I specifically ask the agent to “equalize” all the units mentioned for more accurate representations.

task = f"""
Given the following financial statement in Markdown format, where there is a mix of free text data and structured tabular data. 
Ignore all free text data and extract all structured tabular data as pandas data frames. 
Each extracted table should be represented by a separate pandas data frame. 
Name the data frame meaningfully according to what it represents. If a column has no data at all, 
you may omit it to simplify the data frame. On the other hand, if a column has full data but it has no name, 
you might use the whole column as row indices for the data frame. 
Use your sensible judgments because the Markdown text might contain noises and errors after being converted from the original PDF document.

Last but not least, remember to equalize the units in the data frames. If the statement mentions that some numbers are in thousands 
or millions, make sure to multiply those numbers accordingly so that all numbers in the data frames are in the same base units.

Here is the financial statement:

{statement}

After extracting the data frames:
1. Save all data frames to CSV files -- name the files according to the data frame names.
2. Print a summary of what was extracted

Save all CSV files to '<absolute-path-to-directory>' directory.
"""

result = agent.run(task=task)

Executing the prompt above can be quite lengthy and verbose, and depending on the maximum number of “steps” (iterations) the agent is allowed to fully solve the problem – in this case, it is 10. Nevertheless, after fully executing, I obtain the following log:

Execution logs:
Duolingo Financial Statement Data Extraction - Final Report:

Data Extraction:
  Total Data Frames: 8
  Total Rows Extracted: 76
  CSV Files Created: 8

Key Highlights:
  User Growth: Strong 36% increase in Daily Active Users
  Financial Performance: Significant 41% revenue growth and 292% net income increase
  Operational Metrics: Comprehensive capture of financial and operational data

File Location:
  /absolute-path-to-directory/

Data Frames Created:
  user_metrics.csv, financial_metrics.csv, summary_metrics.csv, guidance.csv, dilutive_securities.csv, 
balance_sheet.csv, income_statement.csv, cash_flow_statement.csv

This basically says that 8 data frames (tables) have been successfully extracted from the Markdown document. Notice how the extracted CSV files are meaningfully named as well. This tallies very well with the original document, except for those tables in the appendix.

Query the extracted data frames

Now it’s time for the real action, i.e., to query or reason over the extracted tabular data. But before that, we would need to load the saved data frames from the previous step. The following code does just that.

import os
import pandas as pd

extraction_dir = "<absolute-path-to-directory>"
csv_files = [f for f in os.listdir(extraction_dir) if f.endswith('.csv')]
print(f"Extracted CSV files: {csv_files}")

extracted_dataframes = {}
for csv_file in csv_files:
    df_name = os.path.splitext(csv_file)[0]
    df_path = os.path.join(extraction_dir, csv_file)
    df = pd.read_csv(df_path)
    extracted_dataframes[df_name] = df
    print(f"Data frame '{df_name}' loaded from '{csv_file}':")
    print(df.shape)

Suppose that we would like to ask the following 3 questions based on the loaded data frames stored in extracted_dataframes:

How much did cash and cash equivalents increase from Dec 31, 2024 to Sep 30, 2025?
What is Provision for (benefit from) income taxes in 2025?
Explain why Net Income is $292.2M in Q3 2025 — what portion is one-time?

We can then craft a simple prompt as below and use the same agent to execute it.

task = """
Given the extracted data frames from a financial statement, answer the following questions:
- How much did cash and cash equivalents increase from Dec 31, 2024 to Sep 30, 2025?
- What is Provision for (benefit from) income taxes in 2025?
- Explain why Net Income is $292.2M in Q3 2025 — what portion is one-time?

For each question, give your justification (such as which data frame and column you used) and show your calculations.
"""

agent.run(
    task,
    additional_args=extracted_dataframes,
)

After finishing the execution, the following logs are obtained:

Execution logs:
- Net Income Q3 2024: $23.360M
- Net Income Q3 2025: $292.195M
- Income Tax Benefit: $245.746M
- Operating Income Increase: $21.585M

Therefore, the agent has come up with the following answers to the questions asked:

Cash and Cash Equivalents Increase: $225.751M
Income Tax Provision: -245.746M
Net Income explanation: Primarily driven by a one-time $245.746M income tax benefit, representing 84% of the Net Income increase from Q3 2024 to Q3 2025.

The interested reader is now encouraged to pause and verify these answers using the original document to convince oneself whether these numbers are accurate or not.

Compare with vanilla RAG

If I implement a simple RAG system (with or without re-ranking) using the same financial statement (in Markdown format), gte-small as the underlying embedding model and FAISS as the vector DB (for efficient retrieval), the following notebook implements just that.

In this notebook, I’m also using OpenAI API and gpt-5.1 as the underlying LLM to answer the 3 questions asked. This RAG implementation produces the following answers to the questions:

USER_QUERY: How much did cash and cash equivalents increase from Dec 31, 2024 to Sep 30, 2025?
ANSWER: $225,751 thousand. This is the net increase in cash, cash equivalents and restricted cash shown for the nine months ended September 30, 2025.

USER_QUERY: What is Provision for (benefit from) income taxes in 2025?
ANSWER: N/A

USER_QUERY: Explain why Net Income is $292.2M in Q3 2025 — what portion is one-time?
ANSWER: Net income of $292.2M in Q3 2025 includes a one-time income tax benefit of $222.7M from releasing the valuation allowance on deferred tax assets, so only about $69.5M is from recurring operations.  
This is because the company explicitly discloses that the $222.7M tax benefit is a one-time item recorded in the quarter, inflating GAAP net income for Q3 2025.

As can be seen from the answers, while questions 1 and 3 are answered correctly – in fact, the answer to question is 3 is even more accurate than the one produced by the agent (because it was not found in the tables, but rather in the text portions of the statement) – the answer to question 2 is simply given as N/A (not found). While the agentic approach insists that the one-time tax benefit is $245.746M (for question 3) as that’s the only number that can be found in the tables, a more accurate answer (in true accounting terms) is $222.7M, which is stressed in other parts of the statement.

We can thus conclude that even though an agentic approach to this problem is seemingly more superior than an traditional RAG, it cannot completely replace a RAG system knowing that some of the answers might not be embedded in tabular parts of the statement, but rather in the free texts. Therefore, a more comprehensive solution should be an intelligent hybrid approach of both agents and RAG such that all kinds of answers can be potentially retrieved.