What kinds of models are people training with document data? [P]

“`html

We’ve seen a discussion on Reddit about the types of models being trained with document data, such as annotated PDFs and health forms. This highlights concerns over privacy issues related to personal identifiable information (PII).

The conversation also touches on how these models are typically deployed within existing training pipelines.
Participants have questions about the suitability of certain formats like FUNSD, BIO, YOLO, Donut, and COCO for their needs.
There’s also a discussion on whether a Python Package Index (PyPI) SDK package is beneficial or if users prefer to use APIs directly without additional tooling.

This exchange underscores the complexity of training models with sensitive data like document images and text, particularly regarding privacy and integration into existing workflows.

“`

– Takeaways:
– There’s a need for robust handling of PII in model training.
– Various formats are being used but there’s uncertainty about their suitability.
– Users have differing preferences between SDK packages and API integration.

Source Read original →