Today was mostly conversations.
Are conversations useful?
If they lead to valuable action action then yes for sure.
But if you’ve watched Stutz on Netflix, you’ll know that real world conversations – and hence socialising – are also vital for your own wellbeing and mental health. It’s more helpful if the conversations are interesting and not poisonous, but any social contact is better than none, funnily enough.
If you are working on your own a lot as a founder, it’s so important to have social contact to ground yourself in reality. It’s very good for your brain.
The sad reality of future AI is that people will be more and more communicating with AI generated feedback. But it’s down to each one of us to make sure you keep on talking, especially technical founders who are introverted.
I was listening to someone the other day who said that whoever is alive now… we are the final era of humans who knew what life was like before AI (and robots) started to take over and reduce the cost of knowledge and content creation down to almost zero.
Interesting new world!
In other news, did some more R&D on web crawling. It turns out you can use the Crawl4AI python package in tandem with a language model and it will automatically run it through your prompt. I’ll do a video on it another time but for the moment here is my code. It basically will rewrite the BBC article as an excited Arsenal fan.
# Example 2: Using Pruning filter
url2 = "https://www.bbc.co.uk/sport/football/live/c8j00ke2r23t"
success2, content2, file2 = await crawl_url(
url=url2,
filter_type="llm",
llm_instruction="""
Rewrite this as if you are an excited Arsenal fan.
Include:
- Emotive descriptive language of the goals
Exclude:
- Navigation elements
- Sidebars
- Footer content
Format the output as clean markdown with proper paragraphs and headers.
""",
)
and this is the definition of my custom function
import asyncio
from crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, BM25ContentFilter, CacheMode, DefaultMarkdownGenerator, LLMContentFilter, PruningContentFilter
from crawl4ai.async_configs import BrowserConfig, CrawlerRunConfig, LlmConfig
import os
from dotenv import load_dotenv
async def crawl_url(url, filter_type="prune", query=None, llm_instruction=None):
"""
Crawl a URL and apply a specified content filter.
Args:
url (str): The URL to crawl
filter_type (str): Type of filter to use - "bm25", "prune", or "llm"
query (str): Query for BM25 or Pruning filters
llm_instruction (str): Instruction for LLM filter
Returns:
tuple: (success, markdown_content, output_filename)
"""
# Load environment variables from .env file
load_dotenv()
OPENAI_API_KEY = os.environ.get("OPENAI_API_KEY")
# Select the appropriate content filter based on filter_type
if filter_type == "bm25" and query:
content_filter = BM25ContentFilter(
user_query=query,
bm25_threshold=1.2,
# use_stemming=True
)
elif filter_type == "prune" and query:
content_filter = PruningContentFilter(
user_query=query,
threshold=0.5,
threshold_type="fixed", # or "dynamic"
min_word_threshold=50
)
elif filter_type == "llm" and llm_instruction:
content_filter = LLMContentFilter(
llmConfig=LlmConfig(provider="openai/gpt-4o-mini", api_token=OPENAI_API_KEY),
instruction=llm_instruction,
chunk_token_threshold=4096,
verbose=True
)
else:
# Default to pruning filter with empty query if no valid filter specified
content_filter = PruningContentFilter(
user_query=query or "",
threshold=0.5,
threshold_type="fixed",
min_word_threshold=50
)
md_generator = DefaultMarkdownGenerator(
content_filter=content_filter,
options={"ignore_links": True},
)
config = CrawlerRunConfig(markdown_generator=md_generator)
async with AsyncWebCrawler() as crawler:
result = await crawler.arun(url=url, config=config)
if not result.success:
print(f"Crawl failed: {result.error_message}")
print(f"Status code: {result.status_code}")
return False, None, None
# Create a filename based on the URL
# Remove protocol and replace special characters
filename = url.replace("https://", "").replace("http://", "").replace("/", "_").rstrip("_")
output_file = f"{filename}.md"
# Write the extracted content to a markdown file
with open(output_file, "w", encoding="utf-8") as f:
f.write(result.markdown.fit_markdown)
print(f"Content successfully exported to {output_file}")
return True, result.markdown.fit_markdown, output_file
Leave a Reply