From 98652a0cae3e022e14cdb855de9cfe310116f160 Mon Sep 17 00:00:00 2001 From: Durvesh Rajubhau Mahurkar <durvesh.mahurkar@niveussolutions.com> Date: Thu, 3 Oct 2024 06:44:00 +0000 Subject: [PATCH] Upload New File --- Python_DE_Lateral/README.md | 34 ++++++++++++++++++++++++++++++++++ 1 file changed, 34 insertions(+) create mode 100644 Python_DE_Lateral/README.md diff --git a/Python_DE_Lateral/README.md b/Python_DE_Lateral/README.md new file mode 100644 index 0000000..b3e706c --- /dev/null +++ b/Python_DE_Lateral/README.md @@ -0,0 +1,34 @@ +# DE - Python Lateral Training Assignment + +This notebook contains a data pipeline that reads data from a CSV file in Google Cloud Storage (GCS) and writes it to a BigQuery table. + +## Setup + +1. **Install required packages:** +``` +apache-beam +google-cloud-storage +apache-beam[gcp] +``` + +2. **Authenticate user** + + +## Pipeline Steps + +1. **Read data from GCS:** The pipeline reads the CSV file from the specified GCS bucket. +2. **Parse CSV to dictionary:** The `parse_csv_to_dict` function converts each row of the CSV data into a dictionary based on the provided table schema. +3. **Remove invalid data:** Rows with `nan` values are filtered out. +4. **Write data to BigQuery:** The pipeline writes the processed data to the specified BigQuery table. + +## Running the Pipeline + +1. **Set pipeline options:** Configure the pipeline options, including the runner type and temporary location. +2. **Specify input and output parameters:** Set the input file path, project ID, dataset ID, and table name. +3. **Run the pipeline:** Call the `run_pipeline` function with the configured options and parameters. + +## Note + +* Ensure that you have the necessary permissions to access GCS and BigQuery. + +* The `DirectRunner` is used for local execution. For running the pipeline on Google Cloud Dataflow, use the `DataflowRunner`. \ No newline at end of file -- GitLab