Skip to content
Snippets Groups Projects
Commit 457c20a8 authored by Durvesh Rajubhau Mahurkar's avatar Durvesh Rajubhau Mahurkar
Browse files

Upload New File

parent 6ea9ba55
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:3c146709-280d-4b17-abd2-dd4a3114156d tags:
<b>Case Study:</b> Imagine you are working for GenAI, a cutting-edge AI company that specializes in natural language processing and document management. Your task is to create a coding exercise that demonstrates the integration of multiple PDF and Excel documents using vector database text embedding. Additionally, you will leverage LangChain for text summarization.
%% Cell type:markdown id:fd7d033a-2d71-4ddb-8903-d514d73c0b34 tags:
**ContractNLI** is a dataset for document-level natural language inference (NLI) on contracts whose goal is to automate/support a time-consuming procedure of contract review. In this task, a system is given a set of hypotheses (such as “Some obligations of Agreement may survive termination.”) and a contract, and it is asked to classify whether each hypothesis is entailed by, contradicting to or not mentioned by (neutral to) the contract as well as identifying evidence for the decision as spans in the contract.
%% Cell type:markdown id:401d2079-1b79-4fab-bc43-ba8c43830170 tags:
**Extract Text from Multiple PDFs**
%% Cell type:code id:4696d663-8e7b-49d7-b087-aa011490f654 tags:
``` python
import os
import fitz
import json
import requests
from sentence_transformers import SentenceTransformer
import faiss
print(faiss.__version__)
import numpy as np
from dotenv import load_dotenv
load_dotenv()
from typing import Optional, List
from transformers import pipeline
from langchain_google_genai import ChatGoogleGenerativeAI
from langchain_google_genai import GoogleGenerativeAIEmbeddings
from langchain.chains import StuffDocumentsChain
from langchain.chains.llm import LLMChain
from langchain import HuggingFaceHub
from langchain.llms.base import BaseLLM
from langchain.prompts import PromptTemplate
from langchain.schema import LLMResult, Generation
```
%% Output
1.8.0
%% Cell type:code id:27c80419-1535-4494-aebf-cdcc2f09faaa tags:
``` python
def extract_text_from_pdfs(directory):
pdf_texts = {}
for filename in os.listdir(directory):
if filename.endswith(".pdf"):
pdf_path = os.path.join(directory, filename)
doc = fitz.open(pdf_path)
text = ""
for page in doc:
text += page.get_text()
pdf_texts[filename] = text
return pdf_texts
```
%% Cell type:code id:9ef8503e-74af-46c6-85c4-f90739629ee6 tags:
``` python
def extract_text_from_json(json_path):
with open(json_path, "r") as file:
data = json.load(file)
return json.dumps(data, indent=4)
```
%% Cell type:code id:308f1949-a628-4071-b160-2c5abbe2a7a2 tags:
``` python
pdf_texts = extract_text_from_pdfs(r"C:\Users\acer\Documents\contract-nli\contract-nli\raw")
json_train_text = extract_text_from_json(r"C:\Users\acer\Documents\contract-nli\contract-nli\train.json")
json_test_text = extract_text_from_json(r"C:\Users\acer\Documents\contract-nli\contract-nli\test.json")
```
%% Cell type:code id:01de7d7d-233f-4f31-8111-a1f4d9ca2502 tags:
``` python
#pdf_texts
```
%% Cell type:markdown id:6a96e9a3-b5d5-4d8c-8d7f-972c3fe79610 tags:
**Vectorize the Extracted Text**
%% Cell type:code id:b1e8117d-2f7d-4bce-a467-ce114bf1e693 tags:
``` python
# Load the SentenceTransformer model
model = SentenceTransformer('all-MiniLM-L6-v2')
# Generate embeddings for multiple texts (from PDF and JSON)
def generate_embeddings(texts):
embeddings = {}
for filename, text in texts.items():
embedding = model.encode(text)
embeddings[filename] = embedding
return embeddings
```
%% Output
C:\Users\acer\AppData\Roaming\Python\Python311\site-packages\transformers\tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
%% Cell type:code id:802ff79c-da02-4a3f-b1ab-965e8b64a537 tags:
``` python
# If json_train_text is a single string, wrap it in a dictionary
json_train_text = {"train_file.json": json_train_text}
json_test_text = {"test_file.json": json_test_text}
```
%% Cell type:code id:6c30a3d4-843c-484e-bf09-2e11a81db4ce tags:
``` python
# Generate embeddings
pdf_embeddings = generate_embeddings(pdf_texts)
# Now generate embeddings
json_train_embeddings = generate_embeddings(json_train_text)
json_test_embeddings = generate_embeddings(json_test_text)
```
%% Cell type:markdown id:89a47114-be32-4d0b-95e9-6e2bffab269a tags:
**Store Embeddings in a Vector Database**
%% Cell type:code id:38aa7076-12cc-4ca7-adc0-653967650ea0 tags:
``` python
embedding_list = np.array(list(pdf_embeddings.values())).astype('float32')
# Create a Faiss index
index = faiss.IndexFlatL2(embedding_list.shape[1]) # Using L2 distance
index.add(embedding_list) # Add embeddings to the index
# To search for similar embeddings
query_embedding = model.encode("Non Disclosure Agreement").astype('float32')
D, I = index.search(query_embedding.reshape(1, -1), k=5) # Search for 5 nearest neighbors
# Retrieve filenames based on index
for idx in I[0]:
print(list(pdf_embeddings.keys())[idx]) # Get corresponding filenames
```
%% Output
Non-Disclosure%20Agreement_5.pdf
Basic-Non-Disclosure-Agreement.pdf
NDA-Agreement-NPAF.pdf
NDA_6.pdf
Non-Disclosure-Agreement-NDA.pdf
%% Cell type:code id:d780048e-59b5-4d48-8b94-a90649503197 tags:
``` python
# To search for similar embeddings
query_embedding = model.encode("What is the primary purpose of this Confidentiality Agreement between BROOKS and the CLIENT?").astype('float32')
D, I = index.search(query_embedding.reshape(1, -1), k=5) # Search for 5 nearest neighbors
# Retrieve filenames based on index
for idx in I[0]:
print(list(pdf_embeddings.keys())[idx]) # Get corresponding filenames
```
%% Output
183.pdf
BTS_NDA.pdf
BCG-Mutual-NDA.pdf
NMLS%20Accessibility%20NDA.pdf
5-Appendix-Non-Disclosure-Agreement-Mutual.pdf
%% Cell type:code id:e84264fc-d2fa-4cb4-a97f-1758f391b5d4 tags:
``` python
# Example text summarization using Hugging Face in LangChain context
sample_text = """
NON-DISCLOSURE AGREEMENT
The parties to this Agreement are MPD Technologies, Inc. ("Disclosing
Party") and the undersigned "Recipient". The parties desire that Disclosing
Party disclose certain Information or Items to Recipient, but Disclosing Party
desires to maintain the trade secret, proprietary or private nature of such
Information or Items.
As used herein, the following words have the indicated meanings:
(i) "Disclose" means to reveal, make known, make available, furnish, or
permit access to, whether or not intentionally.
(ii) "Information" means all oral, written, or other information
whatsoever, including information in documents and other recording media and
information embodied in any item, which in connection with the Matter, is (a)
obtained by Recipient from or through Disclosing Party, (b) obtained by or
through Recipient by an examination of any Item, or (c) created by or through
Recipient with the use of information in (a) or (b). It includes but is not
limited to ideas, inventions, discoveries, formulas, methods, designs, drawings,
specifications, engineering and manufacturing data. This information is limited
to trade secrets and other proprietary or private information of Disclosing
Party or of any third party if disclosed by or through Disclosing Party.
(iii) "Item" means any system, subsystem, assembly, subassembly, device,
components, product, or machine, work of authorship, or part thereof, or
substance which is disclosed by or through Disclosing Party hereunder, which
embodies trade secret or other proprietary or private information of Disclosing
Party or of any third party if disclosed by or through Disclosing Party.
(iv) "Matter" means the project or other matter in connection with which
this Agreement is executed. This matter is or relates to the potential
acquisition by the Recipient of a controlling ownership interest in the share
capital or business of the Disclosing Party.
"""
```
%% Cell type:code id:e757f6a8-073b-49b4-9cdc-16915e5cff90 tags:
``` python
# Create a Hugging Face summarization pipeline
summarizer = pipeline("summarization", model="facebook/bart-large-cnn")
```
%% Output
C:\Users\acer\AppData\Roaming\Python\Python311\site-packages\transformers\tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
warnings.warn(
%% Cell type:code id:dca9e4ac-3917-4c35-b13d-4772e0fbb0ef tags:
``` python
class HuggingFaceSummarizer(BaseLLM):
def _generate(self, prompts, **kwargs):
generations = []
for prompt in prompts:
summary = summarizer(prompt, max_length=480, min_length=150, do_sample=False)
# Wrap the summary in the Generation object as required by LangChain
generations.append([Generation(text=summary[0]['summary_text'])])
# Return an LLMResult which includes generations
return LLMResult(generations=generations)
@property
def _llm_type(self):
return "huggingface"
# Instantiate the summarizer
llm = HuggingFaceSummarizer()
# Create a prompt template for summarization
prompt_template = PromptTemplate(
input_variables=["text"],
template="Please summarize the following text:\n{text}"
)
# Create the LLM chain for summarization
summarization_chain = LLMChain(llm=llm, prompt=prompt_template)
text_to_summarize = sample_text
summary = summarization_chain.run(text=text_to_summarize)
print("Summary:")
print(summary)
```
%% Output
Summary:
The parties to this Agreement are MPD Technologies, Inc. and the undersigned "Recipient" The parties desire that Disclosing Party disclose certain Information or Items to Recipient. The parties also desire to maintain the trade secret, proprietary or private nature of such information or Items. "Information" means all oral, written, or other information whatsoever. "Matter" means the project or other matter in connection with which this Agreement is executed. This matter is or relates to the potential acquisition by the Recipient of a controlling ownership interest in the share capital or business of the Discloses Party. This information is limited to trade secrets and other proprietary orPrivate information of Disclose Party or of any third party if disclosed by or through DisclOSE Party.
%% Cell type:code id:7e8bb0d9-cb4a-4e0c-97f9-16f5838cb097 tags:
``` python
```
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment