Digitaltrends

The AI-powered future of web data collection

E.Wilson1 hr ago

At the web scraping conference OxyCon 2024 , its organizer Oxylabs revealed the first AI copilot for web scraping. It comes as a feature of Oxylabs' unified Web Scraper API, which serves as an all-in-one public web data scraping platform. Named OxyCopilot , the feature tackles one of the main challenges in the web scraping pipeline — building and maintaining custom data parsers with limited infrastructure, personnel, and time resources.

Oxylabs is a leading web intelligence platform with nearly a decade of experience in the field. OxyCopilot is not the first time the company has led the way in AI implementation for data scraping. In fact, from its inception, Oxylabs put special emphasis on fostering innovation and research and development (R&D). This attitude allowed a once humble startup from Lithuania to become a leading force in web data extraction recognized by the world's top brands. However, it also meant encountering multiple challenges on its path.

What is OxyCopilot, and who is it for?

Integrating Oxylabs' scraper APIs into one platform — Web Scraper API — provides users a one-stop shop for all their public web data needs. OxyCopilot is an AI-powered feature of this unified platform that helps users create requests for the API and build custom parsers.

The copilot understands natural language prompts and only needs a URL or multiple URLs belonging to the same domain to create parsing instructions. In other words, users can provide the URL and tell in simple language sentences what data they want from a particular domain. The AI copilot will provide parsing instructions in minutes. Feeding these instructions to the API will promptly return the parsed data. And anyone with a basic knowledge of web scraping can quickly learn to use it.

However, OxyCopilot was not planned in detail from the beginning. It did not come as a sudden stroke of genius. Instead, there were many ideas and iterations along the way, as well as multiple challenges and eureka moments in the history of Oxylabs that eventually led the company to create the unified Web Scraper API with an AI-powered assistant.

The path to innovation

Founded in 2015, Oxylabs faced the challenge common to early-stage startups: You must be inventive and dare to do things unconventionally to make a name for yourself. What amplified the challenge was that the web scraping and proxy solutions industry at the time was young, niche, and often overlooked or misunderstood.

Thus, it was upon industry pioneers to explain that only publicly available data is collected, implement strict KYC policy, and differentiate themselves from those who do not follow legal and ethical norms. Being original and unconventional when looking for technical solutions had to be balanced with and limited by following and promoting ethical conventions. Nevertheless, choosing this path proved right for Oxylabs. After a while, it allowed harnessing proprietary knowledge about the client's needs and possible solutions that few companies possessed.

This competitive advantage cemented Oxylabs' leading status. However, it brought new challenges, one of which is that such status can lead to deceptive comfort and inertia that stifle innovation.

Recognizing the risk, Oxylabs decided to address it with a favorable and motivating policy for thinking outside the box. It includes the inventor's bonus policy, available to any employee who proposes a feasible, innovative idea; full support to inventors during the patenting procedure, and regular meetings for innovation mining. Effective implementation of these measures allowed Oxylabs to end up with over 100 patents in their portfolio and be named one of the best workplaces for innovators .

Implementing AI for data scraping

While Oxylabs was maturing as a company, AI capabilities were developing rapidly. These developments presented an important research direction for the company at the forefront of the industry, aiming to address the most pressing web scraping challenges. Understanding this, in 2020, Oxylabs established an AI and machine learning (ML) advisory board consisting of AI researchers and developers who worked with the world's leading tech companies.

By 2021, Oxylabs has already experimented with AI in areas like proxy management, response recognition, and dynamic fingerprinting. These explorations led to important discoveries and implementations now unified in Web Scraper API. However, parsing, although a simpler task than web unblocking at the surface, was an area where AI's benefits were yet to be discovered. To understand why automating parsing with AI proved to be challenging, one needs to look closer at the peculiarities of the parsing process.

The struggle with parsing

Web data parsing is the process of extracting unstructured data from HTML and structuring it to make it analyzable. While parsing is done by specialized tools known as data parsers, building and maintaining these tools proves challenging. Website layouts change, which causes parsers to break, delaying important business procedures. According to a recent survey , due to this reason, 57% of developers fix parsers several times a week, and 31% do it every day.

The mentioned research, conducted by Oxylabs and Censuswide, surveyed scraping professionals in two of the biggest markets for public web data, the USA and the UK. It has revealed that, generally, parsing processes cost 10 to 40 hours a week for 75% of scraping professionals. Meanwhile, if parsing is interrupted, 95% of businesses face negative impact within 24 hours.

The survey only confirmed what was clear to Oxylabs years ago — AI could bring great value to developers and businesses if it could automate parser building, generate instructions rapidly, and thus expand the range of people who can handle web scraping tasks. Already before the AI boom of 2022, Oxylabs released products capable of ML-driven adaptive parsing. Additionally, explorations in this direction led to Oxy Parser , an LLM-based open-source product that can automatically parse HTML into the Pydantic models.

Building the first AI-powered copilot for parsing

Encountering tools like ChatGPT, capable of interacting with the user through natural language prompts, made Oxylabs' developers eager to utilize the large language models (LLMs), underlying these tools. However, creating such an assistant for data parsing was far from straightforward, even for some of the best minds in the public web data gathering industry.

The main problem that put the development of AI copilot for parsing on hold at the end of 2023 was generating Xpath, which works as a roadmap to finding specific webpage elements in HTML. The goal of OxyCopilot was to enable the client to automatically generate parsing templates for the domains they want to scrape. With this functionality, they could have all the parsing done on Oxylabs' side, removing the necessity of using internal server resources. However, if the tool couldn't find the Xpath and put it in the parsing instructions' template, there was no chance of automating this process.

In the spring of 2024, a solution was found. Figuring out the solution was followed by three months of intense work by three seasoned ML engineers. OxyCopilot uses novel logic, which Oxylabs is currently patenting. [2] The company already had proprietary technology for generating parsing templates, and it became the stepping stone for automating the parsing process without excessive costs associated with using LLMs. Instead of calling LLMs for each request, the Oxylabs' copilot generates parsing templates based on URLs and natural language prompts.

To be continued

Facing a challenge leads to struggle, leading to innovative solutions, which then leads to new challenges. This was Oxylabs' path from its inception in 2015 to launching Web Scraper API, the all-in-one scraping platform, and its AI-powered OxyCopilot in 2024.

This path, however, has no end in sight. Instead, further improvements on OxyCopilot and the entire scraping platform, new AI applications for web scraping, and other innovations are on the horizon. Thus, web scraping professionals around the world have plenty to be excited about when thinking about the future.

0 Comments
0