"<b>Case Study:</b> Imagine you are working for GenAI, a cutting-edge AI company that specializes in natural language processing and document management. Your task is to create a coding exercise that demonstrates the integration of multiple PDF and Excel documents using vector database text embedding. Additionally, you will leverage LangChain for text summarization."
]
},
{
"cell_type": "markdown",
"id": "fd7d033a-2d71-4ddb-8903-d514d73c0b34",
"metadata": {},
"source": [
"**ContractNLI** is a dataset for document-level natural language inference (NLI) on contracts whose goal is to automate/support a time-consuming procedure of contract review. In this task, a system is given a set of hypotheses (such as “Some obligations of Agreement may survive termination.”) and a contract, and it is asked to classify whether each hypothesis is entailed by, contradicting to or not mentioned by (neutral to) the contract as well as identifying evidence for the decision as spans in the contract."
"C:\\Users\\acer\\AppData\\Roaming\\Python\\Python311\\site-packages\\transformers\\tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884\n",
"query_embedding = model.encode(\"What is the primary purpose of this Confidentiality Agreement between BROOKS and the CLIENT?\").astype('float32')\n",
"D, I = index.search(query_embedding.reshape(1, -1), k=5) # Search for 5 nearest neighbors\n",
"\n",
"# Retrieve filenames based on index\n",
"for idx in I[0]:\n",
" print(list(pdf_embeddings.keys())[idx]) # Get corresponding filenames"
]
},
{
"cell_type": "code",
"execution_count": 36,
"id": "e84264fc-d2fa-4cb4-a97f-1758f391b5d4",
"metadata": {},
"outputs": [],
"source": [
"# Example text summarization using Hugging Face in LangChain context\n",
"sample_text = \"\"\"\n",
"\n",
" NON-DISCLOSURE AGREEMENT\n",
"\n",
" The parties to this Agreement are MPD Technologies, Inc. (\"Disclosing\n",
"Party\") and the undersigned \"Recipient\". The parties desire that Disclosing\n",
"Party disclose certain Information or Items to Recipient, but Disclosing Party\n",
"desires to maintain the trade secret, proprietary or private nature of such\n",
"Information or Items.\n",
"\n",
" As used herein, the following words have the indicated meanings:\n",
"\n",
" (i) \"Disclose\" means to reveal, make known, make available, furnish, or\n",
"permit access to, whether or not intentionally.\n",
"\n",
" (ii) \"Information\" means all oral, written, or other information\n",
"whatsoever, including information in documents and other recording media and\n",
"information embodied in any item, which in connection with the Matter, is (a)\n",
"obtained by Recipient from or through Disclosing Party, (b) obtained by or\n",
"through Recipient by an examination of any Item, or (c) created by or through\n",
"Recipient with the use of information in (a) or (b). It includes but is not\n",
"limited to ideas, inventions, discoveries, formulas, methods, designs, drawings,\n",
"specifications, engineering and manufacturing data. This information is limited\n",
"to trade secrets and other proprietary or private information of Disclosing\n",
"Party or of any third party if disclosed by or through Disclosing Party.\n",
"\n",
" (iii) \"Item\" means any system, subsystem, assembly, subassembly, device,\n",
"components, product, or machine, work of authorship, or part thereof, or\n",
"substance which is disclosed by or through Disclosing Party hereunder, which\n",
"embodies trade secret or other proprietary or private information of Disclosing\n",
"Party or of any third party if disclosed by or through Disclosing Party.\n",
"\n",
" (iv) \"Matter\" means the project or other matter in connection with which\n",
"this Agreement is executed. This matter is or relates to the potential\n",
"acquisition by the Recipient of a controlling ownership interest in the share\n",
"capital or business of the Disclosing Party.\n",
"\"\"\""
]
},
{
"cell_type": "code",
"execution_count": 35,
"id": "e757f6a8-073b-49b4-9cdc-16915e5cff90",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"C:\\Users\\acer\\AppData\\Roaming\\Python\\Python311\\site-packages\\transformers\\tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884\n",
" warnings.warn(\n"
]
}
],
"source": [
"# Create a Hugging Face summarization pipeline\n",
"The parties to this Agreement are MPD Technologies, Inc. and the undersigned \"Recipient\" The parties desire that Disclosing Party disclose certain Information or Items to Recipient. The parties also desire to maintain the trade secret, proprietary or private nature of such information or Items. \"Information\" means all oral, written, or other information whatsoever. \"Matter\" means the project or other matter in connection with which this Agreement is executed. This matter is or relates to the potential acquisition by the Recipient of a controlling ownership interest in the share capital or business of the Discloses Party. This information is limited to trade secrets and other proprietary orPrivate information of Disclose Party or of any third party if disclosed by or through DisclOSE Party.\n"
<b>Case Study:</b> Imagine you are working for GenAI, a cutting-edge AI company that specializes in natural language processing and document management. Your task is to create a coding exercise that demonstrates the integration of multiple PDF and Excel documents using vector database text embedding. Additionally, you will leverage LangChain for text summarization.
**ContractNLI** is a dataset for document-level natural language inference (NLI) on contracts whose goal is to automate/support a time-consuming procedure of contract review. In this task, a system is given a set of hypotheses (such as “Some obligations of Agreement may survive termination.”) and a contract, and it is asked to classify whether each hypothesis is entailed by, contradicting to or not mentioned by (neutral to) the contract as well as identifying evidence for the decision as spans in the contract.
# Generate embeddings for multiple texts (from PDF and JSON)
defgenerate_embeddings(texts):
embeddings={}
forfilename,textintexts.items():
embedding=model.encode(text)
embeddings[filename]=embedding
returnembeddings
```
%% Output
C:\Users\acer\AppData\Roaming\Python\Python311\site-packages\transformers\tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
C:\Users\acer\AppData\Roaming\Python\Python311\site-packages\transformers\tokenization_utils_base.py:1617: FutureWarning: `clean_up_tokenization_spaces` was not set. It will be set to `True` by default. This behavior will be deprecated in transformers v4.45, and will be then set to `False` by default. For more details check this issue: https://github.com/huggingface/transformers/issues/31884
The parties to this Agreement are MPD Technologies, Inc. and the undersigned "Recipient" The parties desire that Disclosing Party disclose certain Information or Items to Recipient. The parties also desire to maintain the trade secret, proprietary or private nature of such information or Items. "Information" means all oral, written, or other information whatsoever. "Matter" means the project or other matter in connection with which this Agreement is executed. This matter is or relates to the potential acquisition by the Recipient of a controlling ownership interest in the share capital or business of the Discloses Party. This information is limited to trade secrets and other proprietary orPrivate information of Disclose Party or of any third party if disclosed by or through DisclOSE Party.