Unlocking Faster Insights: How Cloudera and Cohere can deliver Smarter Document Analysis

Unlocking Faster Insights: How Cloudera and Cohere can deliver Smarter Document Analysis

Today we are excited to announce the release of a new Cloudera Accelerator for Machine Learning (ML) Projects (AMP) for PDF document analysis, “Document Analysis with Command R and FAISS”, leveraging Cohere’s Command R Large Language Model (LLM), the Cohere Toolkit for retrieval augmented generation (RAG) applications, and Facebook’s AI Similarity Search (FAISS). 

Document analysis is crucial for efficiently extracting insights from large volumes of text. It has wide-ranging applications including legal research, market analysis, and scientific research. For example, cancer researchers can use document analysis to quickly understand the key findings of thousands of research papers on a certain type of cancer, helping them identify trends and knowledge gaps needed to set new research priorities. 

Before the widespread use of LLMs, document analysis was primarily conducted through manual methods and rule-based systems. These methods were often time-consuming, labor-intensive, and limited in their ability to handle complex language nuances and unstructured data. 

The development of advanced LLMs, such as Cohere’s Command R, and AI Platforms, such as Cloudera Artificial Intelligence (CAI), made it easier than ever for enterprises to deploy high-impact document analysis applications. We created our “Document Analysis with Command R and FAISS” AMP to make that process even easier. 

Cohere’s Command R Family of Models are advanced LLMs that leverage state-of-the-art transformer architectures to handle complex text generation and understanding tasks with high accuracy and speed, making them suitable for enterprise-level applications and real-time processing needs. They were made to be easily integrated into various applications, offering scalability and flexibility for both small-scale and large-scale implementations. The Cohere Toolkit is a collection of pre-built components enabling developers to quickly build and deploy retrieval augmented generation (RAG) applications.

CAI is a robust platform for data scientists and Artificial Intelligence (AI) practitioners to build, train, deploy, and manage models and applications at scale. AMPs are one-click deployments of commonly used AI/ML-based prototypes that reduce time to value by providing high-quality reference examples leveraging Cloudera’s research and expertise to showcase cutting-edge AI applications. 

This AMP is a single project launched from CAI that automatically deploys an application, loads vectors into a FAISS vector store, and enables interfacing with Cohere’s Command R LLM to perform document analysis. The image below illustrates the Retrieval-Augmented Generation (RAG) architecture used by the AMP, and how the components of Cohere, FAISS, the user’s knowledge base, and Streamlit work together to create a ready-to-use Generative AI use case.

This project brings together several exciting new themes to Cloudera’s AMP library, especially in terms of RAG. Facebook’s open source FAISS is a library for efficient similarity search and clustering of dense vectors. It contains algorithms that search in sets of vectors of any size, up to ones that possibly do not fit in RAM. By leveraging it in this AMP, Cloudera demonstrates its flexibility in vector search applications and adds this capability on top of its adoption of Milvus, Chroma, Pinecone, and others in its existing AMP catalog. 

Additionally, the AMP leverages LangChain’s AI toolkit that takes advantage of custom connectors to Cohere and FAISS to enable advanced semantic search and summarization capabilities in a clean and easy to understand code base. It also utilizes Cohere’s embed-english-v3.0 model which is tailor made for generating high-quality text embeddings from English language inputs and excels in capturing semantic nuances. By using Streamlit for the UI, users have a simple starting template, which can be the basis for a full-scale production deployment. 

More on how the “Document Analysis with Command R and FAISS” AMP works and how to deploy it can be found in this Github Repository

Be on the lookout for more news from Cohere and Cloudera as we work together to make it easier than ever to deploy high-performance AI applications.

Kevin Talbert
Senior Solutions Engineer
More by this author
Abhas Ricky
Chief Strategy Officer
More by this author
Nashua Springberry
Principal Corporate Strategist
More by this author

Leave a comment

Your email address will not be published. Links are not permitted in comments.