In this recording, I explored DOCETL, an open source package for declarative data processing using the power of LLM. This reminds me of the Hadoop days when I used to write complex Java programs to create input and output formats to find the schema in unstructured data. The approached looked similar but more powerful with Gen AI.
I have modified the code a little to add the youtube parser also in the pipeline. The revise code is in this repo
github.com/raj...
Code used in the video:
_________________________
Extracting the transcript from youtube vide:
import json
from youtube_transcript_api import KZitemTranscriptApi
transcript = KZitemTranscriptApi.get_transcript("dG9zjKpRmdY")
texts = transcript
transcript=""
for text in texts:
transcript = transcript +" " + text["text"]
print(transcript)
json_content = {"transcript":transcript.replace("'","")}
with open("transcript.json","w") as f:
f.write(str(json.dumps(json_content)))
And here is the pipeline_2.yaml for the data processing
datasets:
audio_transcripts:
path: transcript.json
type: file
default_model: gpt-4o-mini
operations:
name: extract_topics
type: map
output:
schema:
topics: list[str]
prompt: |
Analyze the following transcript :
{{ input.transcript }}
Extract and list all key topics mentioned in the transcript.
If no topics are mentioned, return an empty list.
pipeline:
steps:
name: analyze_video
input: audio_transcripts
operations:
extract_topics
output:
type: file
path: audio_topics.json
intermediate_dir: intermediate_results
Reference: ucbepic.github...
Негізгі бет DOCETL | ETL for unstructured data
Пікірлер