run-llama · neerajprad · Feb 24, 2025 · Feb 24, 2025 · Feb 24, 2025 · Feb 24, 2025
diff --git a/.gitignore b/.gitignore
@@ -3,3 +3,5 @@ __pycache__/
 *.pyc
 .DS_Store
 .idea
+.env*
+.ipynb_checkpoints*
diff --git a/README.md b/README.md
@@ -10,7 +10,7 @@ This includes:
 
 - [LlamaParse](./parse.md) - A GenAI-native document parser that can parse complex document data for any downstream LLM use case (Agents, RAG, data processing, etc.).
 - [LlamaReport (beta/invite-only)](./report.md) - A prebuilt agentic report builder that can be used to build reports from a variety of data sources.
-- [LlamaExtract (coming soon!)]() - A prebuilt agentic data extractor that can be used to transform data into a structured JSON representation.
+- [LlamaExtract (beta/invite-only)](./llama_cloud_services/extract/README.md) - A prebuilt agentic data extractor that can be used to transform data into a structured JSON representation.
 
 ## Getting Started
 
@@ -25,10 +25,11 @@ Then, get your API key from [LlamaCloud](https://cloud.llamaindex.ai/).
 Then, you can use the services in your code:
 
 ```python
-from llama_cloud_services import LlamaParse, LlamaReport
+from llama_cloud_services import LlamaParse, LlamaReport, LlamaExtract
 
 parser = LlamaParse(api_key="YOUR_API_KEY")
 report = LlamaReport(api_key="YOUR_API_KEY")
+extract = LlamaExtract(api_key="YOUR_API_KEY")
 ```
 
 See the quickstart guides for each service for more information:

diff --git a/examples/extract/data/resumes/ai_researcher.pdf b/examples/extract/data/resumes/ai_researcher.pdf
diff --git a/examples/extract/data/resumes/ml_engineer.pdf b/examples/extract/data/resumes/ml_engineer.pdf
diff --git a/examples/extract/data/resumes/software_architect.pdf b/examples/extract/data/resumes/software_architect.pdf
diff --git a/examples/extract/resume_screening.ipynb b/examples/extract/resume_screening.ipynb
diff --git a/llama_cloud_services/__init__.py b/llama_cloud_services/__init__.py
@@ -1,8 +1,11 @@
 from llama_cloud_services.parse import LlamaParse
 from llama_cloud_services.report import ReportClient, LlamaReport
+from llama_cloud_services.extract import LlamaExtract, ExtractionAgent
 
 __all__ = [
     "LlamaParse",
     "ReportClient",
     "LlamaReport",
+    "LlamaExtract",
+    "ExtractionAgent",
 ]
diff --git a/llama_cloud_services/extract/README.md b/llama_cloud_services/extract/README.md
@@ -0,0 +1,185 @@
+# LlamaExtract
+
+> **⚠️ EXPERIMENTAL**
+> This library is under active development with frequent breaking changes. APIs and functionality may change significantly between versions. If you're interested in being an early adopter, please contact us at [[email protected]](mailto:[email protected]) or join our [Discord](https://discord.com/invite/eN6D2HQ4aX).
+
+LlamaExtract provides a simple API for extracting structured data from unstructured documents like PDFs, text files and images (upcoming).
+
+## Quick Start
+
+```python
+from llama_extract import LlamaExtract
+from pydantic import BaseModel, Field
+
+# Initialize client
+extractor = LlamaExtract()
+
+
+# Define schema using Pydantic
+class Resume(BaseModel):
+    name: str = Field(description="Full name of candidate")
+    email: str = Field(description="Email address")
+    skills: list[str] = Field(description="Technical skills and technologies")
+
+
+# Create extraction agent
+agent = extractor.create_agent(name="resume-parser", data_schema=Resume)
+
+# Extract data from document
+result = agent.extract("resume.pdf")
+print(result.data)
+```
+
+## Core Concepts
+
+- **Extraction Agents**: Reusable extractors configured with a specific schema and extraction settings.
+- **Data Schema**: Structure definition for the data you want to extract.
+- **Extraction Jobs**: Asynchronous extraction tasks that can be monitored.
+
+## Defining Schemas
+
+Schemas can be defined using either Pydantic models or JSON Schema:
+
+### Using Pydantic (Recommended)
+
+```python
+from pydantic import BaseModel, Field
+from typing import List, Optional
+
+
+class Experience(BaseModel):
+    company: str = Field(description="Company name")
+    title: str = Field(description="Job title")
+    start_date: Optional[str] = Field(description="Start date of employment")
+    end_date: Optional[str] = Field(description="End date of employment")
+
+
+class Resume(BaseModel):
+    name: str = Field(description="Candidate name")
+    experience: List[Experience] = Field(description="Work history")
+```
+
+### Using JSON Schema
+
+```python
+schema = {
+    "type": "object",
+    "properties": {
+        "name": {"type": "string", "description": "Candidate name"},
+        "experience": {
+            "type": "array",
+            "description": "Work history",
+            "items": {
+                "type": "object",
+                "properties": {
+                    "company": {
+                        "type": "string",
+                        "description": "Company name",
+                    },
+                    "title": {"type": "string", "description": "Job title"},
+                    "start_date": {
+                        "anyOf": [{"type": "string"}, {"type": "null"}],
+                        "description": "Start date of employment",
+                    },
+                    "end_date": {
+                        "anyOf": [{"type": "string"}, {"type": "null"}],
+                        "description": "End date of employment",
+                    },
+                },
+            },
+        },
+    },
+}
+
+agent = extractor.create_agent(name="resume-parser", data_schema=schema)
+```
+
+### Important restrictions on JSON/Pydantic Schema
+
+*LlamaExtract only supports a subset of the JSON Schema specification.* While limited, it should
+be sufficient for a wide variety of use-cases.
+
+  - All fields are required by default. Nullable fields must be explicitly marked as such,
+  using `"anyOf"` with a `"null"` type. See `"start_date"` field above.
+  - Root node must be of type `"object"`.
+  - Schema nesting must be limited to within 5 levels.
+  - The important fields are key names/titles, type and description. Fields for
+  formatting, default values, etc. are not supported.
+  - There are other restrictions on number of keys, size of the schema, etc. that you may
+  hit for complex extraction use cases. In such cases, it is worth thinking how to restructure
+  your extraction workflow to fit within these constraints, e.g. by extracting subset of fields
+  and later merging them together.
+
+## Other Extraction APIs
+
+### Batch Processing
+
+Process multiple files asynchronously:
+
+```python
+# Queue multiple files for extraction
+jobs = await agent.queue_extraction(["resume1.pdf", "resume2.pdf"])
+
+# Check job status
+for job in jobs:
+    status = agent.get_extraction_job(job.id).status
+    print(f"Job {job.id}: {status}")
+
+# Get results when complete
+results = [agent.get_extraction_run_for_job(job.id) for job in jobs]
+```
+
+### Updating Schemas
+
+Schemas can be modified and updated after creation:
+
+```python
+# Update schema
+agent.data_schema = new_schema
+
+# Save changes
+agent.save()
+```
+
+### Managing Agents
+
+```python
+# List all agents
+agents = extractor.list_agents()
+
+# Get specific agent
+agent = extractor.get_agent(name="resume-parser")
+
+# Delete agent
+extractor.delete_agent(agent.id)
+```
+
+## Installation
+
+```bash
+pip install llama-extract==0.1.0
+```
+
+## Tips & Best Practices
+
+1. **Schema Design**:
+   - Try to limit schema nesting to 3-4 levels.
+   - Make fields optional when data might not always be present. Having required fields may force the model
+   to hallucinate when these fields are not present in the documents.
+   - When you want to extract a variable number of entities, use an `array` type. Note that you cannot use
+   an `array` type for the root node.
+   - Use descriptive field names and detailed descriptions. Use descriptions to pass formatting
+   instructions or few-shot examples.
+   - Start simple and iteratively build your schema to incorporate requirements.
+
+2. **Running Extractions**:
+   - Note that resetting `agent.schema` will not save the schema to the database,
+   until you call `agent.save`, but it will be used for running extractions.
+   - Check job status prior to accessing results. Any extraction error should be available as
+   part of `job.error` or `extraction_run.error` fields for debugging.
+   - Consider async operations (`queue_extraction`) for large-scale extraction once you have finalized your schema.
+
+## Additional Resources
+
+- [Example Notebook](examples/resume_screening.ipynb) - Detailed walkthrough of resume parsing
+- [Discord Community](https://discord.com/invite/eN6D2HQ4aX) - Get help and share feedback
diff --git a/llama_cloud_services/extract/__init__.py b/llama_cloud_services/extract/__init__.py
@@ -0,0 +1,3 @@
+from llama_cloud_services.extract.extract import LlamaExtract, ExtractionAgent
+
+__all__ = ["LlamaExtract", "ExtractionAgent"]
-Original file line number
+Diff line change
@@ Expand Up / @@ -3,3 +3,5 @@ __pycache__/ @@
     *.pyc
     .DS_Store
     .idea
+    .env*
+    .ipynb_checkpoints*
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		from llama_cloud_services.extract.extract import LlamaExtract, ExtractionAgent

		__all__ = ["LlamaExtract", "ExtractionAgent"]