Skip to main content

Get the Reddit app

Scan this QR code to download the app now
Or check it out in the app stores
r/promptoftheday
members
online

Accident reports to unified taxonomy: A multi-class-classification problem

Hello!

I'm here to brainstorm possible solutions for my labeling problem.

Core Data I have ~4500 accident reports from paragliding incidents. Reports are unstructured text, some very elaborate over different aspects of the incident over multiple pages, some are just a few lines.

My idea Extract semantically relevant information from the accidents into one unified taxonomy for further analyses of accident causes, etc.

My approach I want to use topic modeling to create a unified taxonomy for all accidents, in which virtually all relevant information of each accident can be captured. The Taxonomy + one accident will then be formed into one API call. After ~4500 API calls, I should end up with all of my accidents represented by a unified taxonomy.

Example The taxonomy has different categories like weather, pilot experience, conditions of the surface, etc. These main categories are further subdivided, e.g., Weather -> Wind -> Velocity.

Current State Right now, I am not finished with my taxonomy, but I estimate that it will roughly have 150 parameters to look out for in one accident. I worked on a similar problem a year ago, building a voice assistant with GPT. There, I used Davinci to transform spoken input into a JSON format with predefined JSON actions. This worked decently for most scenarios, but I had to do post-processing of my output because formats weren't always right, etc.

Currently, my concerns and questions are:

  • With many more categories now (150) compared to my voice assistant (14) and a bigger text input (the voice assistant got one sentence, now a whole accident report is up to 8 pages), GPT uses different categories than those defined in the taxonomy, or hallucinates unpredictable.

  • How to effectively get structured output (here in the form of a taxonomy) from GPT?

  • Would my solution even work as intended?

  • Is this a smart way to approach my goal?

  • What are alternatives?

For any input and thoughts, I am very grateful. Thanks in advance!