1

Web scraping


Avatar
Casey Jeppesen

I do a lot of research online; so much that I often forget where I found a piece of information. Furthermore, the most tedious aspect of online research is scanning content for relevant info.

The most useful thing curiosity could do for me (BY FAR), would be to scrape and/or index web pages, adding that content to corpus of info available to AI.

Ideal scenario:
1. Visited web pages are scraped/indexed for main text content, along with page URL & meta (Page Title, Description, etc.).
2. AI bots can be directed to prioritize web page content first, then source x,y,z, then ChatGPT default corpus.
3. All responses include a citation of the source &/or url of the pages referenced.

Ideal options:
1. Global switch to "auto-index" on browser app by profile (automatically scrape every web page visited).
2. Optional browser extension and/or shortcut keys to: A) Scrape current page. B) Index current website.

Considerations:
1. Webscraping is challenging (there's really no single ruleset that works for all sites). It might be best to start with an app-integration with something like scrapestorm.com or simplescraper.io.
2. Elastic search might be a better option; effectively off-loading the web-scraping component to their powerful indexing engine & making that data available as vectors via api. Although more expensive to end-users (starting at ~$100/mo), it would provide a robust solution for heavy researchers. See: www.elastic.co/de/blog/chatgpt-elasticsearch-openai-meets-private-data
and: www.elastic.co/de/web-crawler

A

Activity Newest / Oldest

Avatar

Casey Jeppesen

Perhaps a simple/interim integration: memex.garden/
Although it doesn't scrape entire pages/sites, it does have a slick way to use ChatGPT to summarize pages and add that info as notes. If this db could be incorporated into Curiosity, we could use ChatGPT in curiosity to query all the content from all the websites we've saved.