From 98652a0cae3e022e14cdb855de9cfe310116f160 Mon Sep 17 00:00:00 2001
From: Durvesh Rajubhau Mahurkar <durvesh.mahurkar@niveussolutions.com>
Date: Thu, 3 Oct 2024 06:44:00 +0000
Subject: [PATCH] Upload New File

---
 Python_DE_Lateral/README.md | 34 ++++++++++++++++++++++++++++++++++
 1 file changed, 34 insertions(+)
 create mode 100644 Python_DE_Lateral/README.md

diff --git a/Python_DE_Lateral/README.md b/Python_DE_Lateral/README.md
new file mode 100644
index 0000000..b3e706c
--- /dev/null
+++ b/Python_DE_Lateral/README.md
@@ -0,0 +1,34 @@
+# DE - Python Lateral Training Assignment
+
+This notebook contains a data pipeline that reads data from a CSV file in Google Cloud Storage (GCS) and writes it to a BigQuery table.
+
+## Setup
+
+1.  **Install required packages:**
+```
+apache-beam 
+google-cloud-storage 
+apache-beam[gcp]
+```
+
+2.  **Authenticate user**
+
+
+## Pipeline Steps
+
+1.  **Read data from GCS:** The pipeline reads the CSV file from the specified GCS bucket.
+2.  **Parse CSV to dictionary:** The `parse_csv_to_dict` function converts each row of the CSV data into a dictionary based on the provided table schema.
+3.  **Remove invalid data:** Rows with `nan` values are filtered out.
+4.  **Write data to BigQuery:** The pipeline writes the processed data to the specified BigQuery table.
+
+## Running the Pipeline
+
+1.  **Set pipeline options:** Configure the pipeline options, including the runner type and temporary location.
+2.  **Specify input and output parameters:** Set the input file path, project ID, dataset ID, and table name.
+3.  **Run the pipeline:** Call the `run_pipeline` function with the configured options and parameters.
+
+## Note
+
+*   Ensure that you have the necessary permissions to access GCS and BigQuery.
+
+*   The `DirectRunner` is used for local execution. For running the pipeline on Google Cloud Dataflow, use the `DataflowRunner`.
\ No newline at end of file
-- 
GitLab