Install poppler on the system, it should be available in path variable. See the pdf2image documentation for instructions by platform.
The pyzerox.zerox function is an asynchronous API that performs OCR (Optical Character Recognition) to markdown using vision models. It processes PDF files and converts them into markdown format. Make sure to set up the environment variables for the model and the model provider before using this API.
Refer to the LiteLLM Documentation for setting up the environment and passing the correct model name.
from pyzerox import zeroximport osimport jsonimport asyncio### Model Setup (Use only Vision Models) Refer: https://docs.litellm.ai/docs/providers ##### placeholder for additional model kwargs which might be required for some modelskwargs = {}## system prompt to use for the vision modelcustom_system_prompt = None# to override# custom_system_prompt = "For the below PDF page, do something..something..." ## example###################### Example for OpenAI ######################model = "gpt-4o-mini" ## openai modelos.environ["OPENAI_API_KEY"] = "" ## your-api-key###################### Example for Azure OpenAI ######################model = "azure/gpt-4o-mini" ## "azure/<your_deployment_name>" -> format <provider>/<model>os.environ["AZURE_API_KEY"] = "" # "your-azure-api-key"os.environ["AZURE_API_BASE"] = "" # "https://example-endpoint.openai.azure.com"os.environ["AZURE_API_VERSION"] = "" # "2023-05-15"###################### Example for Gemini ######################model = "gemini/gpt-4o-mini" ## "gemini/<gemini_model>" -> format <provider>/<model>os.environ['GEMINI_API_KEY'] = "" # your-gemini-api-key###################### Example for Anthropic ######################model="claude-3-opus-20240229"os.environ["ANTHROPIC_API_KEY"] = "" # your-anthropic-api-key###################### Vertex ai ######################model = "vertex_ai/gemini-1.5-flash-001" ## "vertex_ai/<model_name>" -> format <provider>/<model>## GET CREDENTIALS## RUN ### !gcloud auth application-default login - run this to add vertex credentials to your env## OR ##file_path = 'path/to/vertex_ai_service_account.json'# Load the JSON filewith open(file_path, 'r') as file: vertex_credentials = json.load(file)# Convert to JSON stringvertex_credentials_json = json.dumps(vertex_credentials)vertex_credentials=vertex_credentials_json## extra argskwargs = {"vertex_credentials": vertex_credentials}###################### For other providers refer: https://docs.litellm.ai/docs/providers ####################### Define main async entrypointasync def main(): file_path = "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf" ## local filepath and file URL supported ## process only some pages or all select_pages = None ## None for all, but could be int or list(int) page numbers (1 indexed) output_dir = "./output_test" ## directory to save the consolidated markdown file result = await zerox(file_path=file_path, model=model, output_dir=output_dir, custom_system_prompt=custom_system_prompt,select_pages=select_pages, **kwargs) return result# run the main function:result = asyncio.run(main())# print markdown resultprint(result)
ZeroxOutput( completion_time=9432.975, file_name='cs101', input_tokens=36877, output_tokens=515, pages=[ Page( page=1, content='| Type | Description | Wrapper Class |\n' + '|---------|--------------------------------------|---------------|\n' + '| byte | 8-bit signed 2s complement integer | Byte |\n' + '| short | 16-bit signed 2s complement integer | Short |\n' + '| int | 32-bit signed 2s complement integer | Integer |\n' + '| long | 64-bit signed 2s complement integer | Long |\n' + '| float | 32-bit IEEE 754 floating point number| Float |\n' + '| double | 64-bit floating point number | Double |\n' + '| boolean | may be set to true or false | Boolean |\n' + '| char | 16-bit Unicode (UTF-16) character | Character |\n\n' + 'Table 26.2.: Primitive types in Java\n\n' + '### 26.3.1. Declaration & Assignment\n\n' + 'Java is a statically typed language meaning that all variables must be declared before you can use ' + 'them or refer to them. In addition, when declaring a variable, you must specify both its type and ' + 'its identifier. For example:\n\n' + '```java\n' + 'int numUnits;\n' + 'double costPerUnit;\n' + 'char firstInitial;\n' + 'boolean isStudent;\n' + '```\n\n' + 'Each declaration specifies the variable’s type followed by the identifier and ending with a ' + 'semicolon. The identifier rules are fairly standard: a name can consist of lowercase and ' + 'uppercase alphabetic characters, numbers, and underscores but may not begin with a numeric ' + 'character. We adopt the modern camelCasing naming convention for variables in our code. In ' + 'general, variables must be assigned a value before you can use them in an expression. You do not ' + 'have to immediately assign a value when you declare them (though it is good practice), but some ' + 'value must be assigned before they can be used or the compiler will issue an error.\n\n' + 'The assignment operator is a single equal sign, `=` and is a right-to-left assignment. That is, ' + 'the variable that we wish to assign the value to appears on the left-hand-side while the value ' + '(literal, variable or expression) is on the right-hand-side. Using our variables from before, ' + 'we can assign them values:\n\n' + '> 2 Instance variables, that is variables declared as part of an object do have default values. ' + 'For objects, the default is `null`, for all numeric types, zero is the default value. For the ' + 'boolean type, `false` is the default, and the default char value is `\\0`, the null-terminating ' + 'character (zero in the ASCII table).', content_length=2333 ) ])