Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
fix: parse URL response Content-Type according to RFC 9110 (#2950)
Currently, `file_and_type_from_url()` does not correctly handle the `Content-Type` header. Specifically, it assumes that the header contains only the mime-type (e.g. `text/html`), however, [RFC 9110](https://www.rfc-editor.org/rfc/rfc9110#field.content-type) allows for additional directives — specifically the `charset` — to be returned in the header. This leads to a `ValueError` when loading a URL with a response Content-Type header such as `text/html; charset=UTF-8`. To reproduce the issue: ```python from unstructured.partition.auto import partition url = "https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/" partition(url=url) ``` Which will result in the following exception: ```python { "name": "ValueError", "message": "Invalid file. The FileType.UNK file type is not supported in partition.", "stack": "--------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[1], line 4 1 from unstructured.partition.auto import partition 3 url = \"https://arstechnica.com/space/2024/04/nasa-still-doesnt-understand-root-cause-of-orion-heat-shield-issue/\" ----> 4 partition(url=url) File ~/miniconda3/envs/ai-tasks/lib/python3.11/site-packages/unstructured/partition/auto.py:541, in partition(filename, content_type, file, file_filename, url, include_page_breaks, strategy, encoding, paragraph_grouper, headers, skip_infer_table_types, ssl_verify, ocr_languages, languages, detect_language_per_element, pdf_infer_table_structure, extract_images_in_pdf, extract_image_block_types, extract_image_block_output_dir, extract_image_block_to_payload, xml_keep_tags, data_source_metadata, metadata_filename, request_timeout, hi_res_model_name, model_name, date_from_file_object, starting_page_number, **kwargs) 539 else: 540 msg = \"Invalid file\" if not filename else f\"Invalid file {filename}\" --> 541 raise ValueError(f\"{msg}. The {filetype} file type is not supported in partition.\") 543 for element in elements: 544 element.metadata.url = url ValueError: Invalid file. The FileType.UNK file type is not supported in partition." } ``` This PR fixes the issue by parsing the mime-type out of the `Content-Type` header string. Closes #2257
- Loading branch information