files |
File | Blob | shared.Files |
✔️ |
The file to extract |
|
chunkingStrategy |
shared.ChunkingStrategy |
➖ |
Use one of the supported strategies to chunk the returned elements. Currently supports: 'basic', 'by_page', 'by_similarity', or 'by_title' |
|
combineUnderNChars |
number |
➖ |
If chunking strategy is set, combine elements until a section reaches a length of n chars. Default: 500 |
|
coordinates |
boolean |
➖ |
If true, return coordinates for each element. Default: false |
|
encoding |
string |
➖ |
The encoding method used to decode the text input. Default: utf-8 |
|
extractImageBlockTypes |
string[] |
➖ |
The types of elements to extract, for use in extracting image blocks as base64 encoded data stored in metadata fields |
|
gzUncompressedContentType |
string |
➖ |
If file is gzipped, use this content type after unzipping |
|
hiResModelName |
string |
➖ |
The name of the inference model used when strategy is hi_res |
|
includeOrigElements |
boolean |
➖ |
When a chunking strategy is specified, each returned chunk will include the elements consolidated to form that chunk as .metadata.orig_elements . Default: true. |
|
includePageBreaks |
boolean |
➖ |
If True, the output will include page breaks if the filetype supports it. Default: false |
|
languages |
string[] |
➖ |
The languages present in the document, for use in partitioning and/or OCR |
|
maxCharacters |
number |
➖ |
If chunking strategy is set, cut off new sections after reaching a length of n chars (hard max). Default: 500 |
|
multipageSections |
boolean |
➖ |
If chunking strategy is set, determines if sections can span multiple sections. Default: true |
|
newAfterNChars |
number |
➖ |
If chunking strategy is set, cut off new sections after reaching a length of n chars (soft max). Default: 1500 |
|
ocrLanguages |
string[] |
➖ |
The languages present in the document, for use in partitioning and/or OCR |
|
outputFormat |
shared.OutputFormat |
➖ |
The format of the response. Supported formats are application/json and text/csv. Default: application/json. |
|
overlap |
number |
➖ |
Specifies the length of a string ('tail') to be drawn from each chunk and prefixed to the next chunk as a context-preserving mechanism. By default, this only applies to split-chunks where an oversized element is divided into multiple chunks by text-splitting. Default: 0 |
|
overlapAll |
boolean |
➖ |
When True , apply overlap between 'normal' chunks formed from whole elements and not subject to text-splitting. Use this with caution as it entails a certain level of 'pollution' of otherwise clean semantic chunk boundaries. Default: False |
|
pdfInferTableStructure |
boolean |
➖ |
Deprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table Elements will be extracted from pdf files regardless of skip_infer_table_types contents. |
|
similarityThreshold |
number |
➖ |
A value between 0.0 and 1.0 describing the minimum similarity two elements must have to be included in the same chunk. Note that similar elements may be separated to meet chunk-size criteria; this value can only guarantees that two elements with similarity below the threshold will appear in separate chunks. |
|
skipInferTableTypes |
string[] |
➖ |
The document types that you want to skip table extraction with. Default: [] |
|
splitPdfConcurrencyLevel |
number |
➖ |
Number of maximum concurrent requests made when splitting PDF. Ignored on backend. |
|
splitPdfPage |
boolean |
➖ |
Should the pdf file be split at client. Ignored on backend. |
|
startingPageNumber |
number |
➖ |
When PDF is split into pages before sending it into the API, providing this information will allow the page number to be assigned correctly. Introduced in 1.0.27. |
|
strategy |
shared.Strategy |
➖ |
The strategy to use for partitioning PDF/image. Options are fast, hi_res, auto. Default: auto |
auto |
uniqueElementIds |
boolean |
➖ |
When True , assign UUIDs to element IDs, which guarantees their uniqueness (useful when using them as primary keys in database). Otherwise a SHA-256 of element text is used. Default: False |
|
xmlKeepTags |
boolean |
➖ |
If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to partition_xml. |
|