Web scraping

Open
Subscribe

Casey Jeppesen

I do a lot of research online; so much that I often forget where I found a piece of information. Furthermore, the most tedious aspect of online research is scanning content for relevant info.

The most useful thing curiosity could do for me (BY FAR), would be to scrape and/or index web pages, adding that content to corpus of info available to AI.

Ideal scenario:
1. Visited web pages are scraped/indexed for main text content, along with page URL & meta (Page Title, Description, etc.).
2. AI bots can be directed to prioritize web page content first, then source x,y,z, then ChatGPT default corpus.
3. All responses include a citation of the source &/or url of the pages referenced.

Ideal options:
1. Global switch to "auto-index" on browser app by profile (automatically scrape every web page visited).
2. Optional browser extension and/or shortcut keys to: A) Scrape current page. B) Index current website.

Considerations:
1. Webscraping is challenging (there's really no single ruleset that works for all sites). It might be best to start with an app-integration with something like scrapestorm.com or simplescraper.io.
2. Elastic search might be a better option; effectively off-loading the web-scraping component to their powerful indexing engine & making that data available as vectors via api. Although more expensive to end-users (starting at ~$100/mo), it would provide a robust solution for heavy researchers. See: www.elastic.co/de/blog/chatgpt-elasticsearch-openai-meets-private-data
and: www.elastic.co/de/web-crawler

2024-04-26 -

Customer Feedback

Voters 1

Web scraping

Activity Newest / Oldest

Casey Jeppesen