Woman standing middle room lights
Image credit Mahdis Mousavi.

At Tribal we have been considering how AI and machine learning technologies can be incorporated into websites.

Here are some recent, and ongoing advances we have been making in the Machine Learning and AI space. There is a mixture of new capabilities, and new possibilities, which we are looking at incorporating into our Zenario platform.

Smarter document scanning

We have now incorporated Amazon Textract into the latest version of Zenario. Textract is a text-scanning and OCR tool. It gives us far better quality text extraction from PDFs.

So our forthcoming release of Zenario will be able to scan text from PDFs — even ones that are lacking a text layer (such as photographic scans of old documents) — and also JPEG and PNG images that contain text.

At its simplest, better quality text extraction means that on-site searching by users provides better results; but more importantly, it paves the way for machine learning and AI-type features.

Introducing Vector Databases

Websites conventionally store meta data about web pages, documents, images, videos and other materials in a relational database (Zenario uses MySQL). However, to achieve AI-type features it's necessary to store data in vector databases.

A vector database is a different way of storing data. It does not replace the relational database, but adds to it. While it may not store an exact or human-readable copy of the original data, a vector is a way of storing data whereby patterns and meanings can be preserved in such a way that various machine learning processes can be carried out with them.

Vector databases are an efficient way of storing high-dimensional data, that is to say, complex material of many kinds. They can scale to include thousands, millions or billions of data points. Each data point is a vector, and they can store and process vectors derived from complex data types such as natural language text, numerical statistics, images, audio clips and videos.

Here are some of the things on our development horizon, that are made possible when data is stored in vector form.

Topic detection

Websites and extranets are often large enough that content is split into topics or categories. When key topics have been established, it is possible to automatically scan documents and other material to see which topics they contain.

While a human could do this, it can be a mundane and repetitive task, which can be done faster and more accurately by machine learning.

For example, if a site has topics relating to industry sections — for example, transport, manufacturing, information technology, etc. — it should be possible for the system to automatically scan each uploaded document or page and establish which of those topics it relates to. 

Having done that, the related topics — checked by human first, if need be — can be stored in the document's meta data.

Meaning-based searching

When a user performs a search for documents in a conventional site, Zenario scans the extracted text to look for matches of that text. It uses a high quality fulltext search; it works well, but Zenario is still searching its database for relevant text rather than relevant meanings.

For example, it wouldn't understand that "aviation", "aeroplanes", "airplanes" and "planes" are similar things. So if a user types in "aeroplanes", they might miss some important matches.

Once a copy of the content is inside a vector database, a meaning-based search can be undertaken. So the idea is that a user can type in "aeroplanes" and get results that have meaning and relevance against that term.

Automatic summarisation

AI features are available which allow summaries to automatically be created from documents, web pages and other blocks of text. Our initial aim in this respect is that any document, once its text has been cleanly extracted, can be passed to the summariser.

The auto-generated summary may be accepted as-is, or it may go into a pending state so that a human can check, adjust and approve it, before it is published.

Answering questions

This is a more powerful AI feature, which lets a human ask a question in natural language, and the system responds with a real answer based on the content of the material it stores.

I expect that to begin with, this may be limited to interrogating a single document, owing to the amount of processing power required. But this will be no doubt be improved in due course, with the system interrogating numerous documents, and perhaps web pages and other sources of information, so as to return a better-considered or balanced response.

Numerical data

While the above concepts refer to text-heavy documents, there are tools available which help AI systems handle numeric and tabular information.

A user might ask a question such as, "What is the GDP of the UK over the past 5 years?", or "Show me a graph of the number of electric vehicles in the United States". An AI-based system will be able to interrogate a dataset and produce a sensible answer to to these questions, in text or visual form.

Making sense of images

There are many tools available for handling images. A common need is to identify elements in images — not simply words, but what the image contains.

With these elements identified, the system can assist the user by automatically tagging images, and with searching.

It will be possible to edit images, for example to remove an unwanted element or background, thereby making it easier to compile images for use on a website.

Translation and search

Combining this with text extraction, it will be possible to translate between languages. So when a document has been uploaded, it should be possible to translate it into a different language for indexing. 

For example, if the site is presented in English, but a document is in (say) Spanish, that document's text extract can be translated into English. With that done, an English-speaking user should be able to do a search in English and find relevant Spanish documents. The user should then be able to view the resulting document's text in English, and to be able to download the original in Spanish if the original reference is needed.

Conclusion

This is an exciting time to be working in the internet space. 2023 has seen tools like ChatGPT be opened out to millions of users, but 2024 will see much greater changes, both broadly and in the details of thousands of activities as they become AI-enabled.