Skip to content

Latest commit

 

History

History
34 lines (31 loc) · 44.1 KB

partitionparameters.md

File metadata and controls

34 lines (31 loc) · 44.1 KB

PartitionParameters

Fields

Field Type Required Description Example
files File | Blob | shared.Files ✔️ The file to extract
chunkingStrategy shared.ChunkingStrategy Use one of the supported strategies to chunk the returned elements. Currently supports: 'basic', 'by_page', 'by_similarity', or 'by_title'
combineUnderNChars number If chunking strategy is set, combine elements until a section reaches a length of n chars. Default: 500
coordinates boolean If true, return coordinates for each element. Default: false
encoding string The encoding method used to decode the text input. Default: utf-8
extractImageBlockTypes string[] The types of elements to extract, for use in extracting image blocks as base64 encoded data stored in metadata fields
gzUncompressedContentType string If file is gzipped, use this content type after unzipping
hiResModelName string The name of the inference model used when strategy is hi_res
includeOrigElements boolean When a chunking strategy is specified, each returned chunk will include the elements consolidated to form that chunk as .metadata.orig_elements. Default: true.
includePageBreaks boolean If True, the output will include page breaks if the filetype supports it. Default: false
languages string[] The languages present in the document, for use in partitioning and/or OCR
maxCharacters number If chunking strategy is set, cut off new sections after reaching a length of n chars (hard max). Default: 500
multipageSections boolean If chunking strategy is set, determines if sections can span multiple sections. Default: true
newAfterNChars number If chunking strategy is set, cut off new sections after reaching a length of n chars (soft max). Default: 1500
ocrLanguages string[] The languages present in the document, for use in partitioning and/or OCR
outputFormat shared.OutputFormat The format of the response. Supported formats are application/json and text/csv. Default: application/json.
overlap number Specifies the length of a string ('tail') to be drawn from each chunk and prefixed to the next chunk as a context-preserving mechanism. By default, this only applies to split-chunks where an oversized element is divided into multiple chunks by text-splitting. Default: 0
overlapAll boolean When True, apply overlap between 'normal' chunks formed from whole elements and not subject to text-splitting. Use this with caution as it entails a certain level of 'pollution' of otherwise clean semantic chunk boundaries. Default: False
pdfInferTableStructure boolean Deprecated! Use skip_infer_table_types to opt out of table extraction for any file type. If False and strategy=hi_res, no Table Elements will be extracted from pdf files regardless of skip_infer_table_types contents.
similarityThreshold number A value between 0.0 and 1.0 describing the minimum similarity two elements must have to be included in the same chunk. Note that similar elements may be separated to meet chunk-size criteria; this value can only guarantees that two elements with similarity below the threshold will appear in separate chunks.
skipInferTableTypes string[] The document types that you want to skip table extraction with. Default: []
splitPdfConcurrencyLevel number Number of maximum concurrent requests made when splitting PDF. Ignored on backend.
splitPdfPage boolean Should the pdf file be split at client. Ignored on backend.
startingPageNumber number When PDF is split into pages before sending it into the API, providing this information will allow the page number to be assigned correctly. Introduced in 1.0.27.
strategy shared.Strategy The strategy to use for partitioning PDF/image. Options are fast, hi_res, auto. Default: auto auto
uniqueElementIds boolean When True, assign UUIDs to element IDs, which guarantees their uniqueness (useful when using them as primary keys in database). Otherwise a SHA-256 of element text is used. Default: False
xmlKeepTags boolean If True, will retain the XML tags in the output. Otherwise it will simply extract the text from within the tags. Only applies to partition_xml.