diff --git a/Python_DE_Lateral/README.md b/Python_DE_Lateral/README.md new file mode 100644 index 0000000000000000000000000000000000000000..b3e706cef23275b830ea97feee235f576acf7e63 --- /dev/null +++ b/Python_DE_Lateral/README.md @@ -0,0 +1,34 @@ +# DE - Python Lateral Training Assignment + +This notebook contains a data pipeline that reads data from a CSV file in Google Cloud Storage (GCS) and writes it to a BigQuery table. + +## Setup + +1. **Install required packages:** +``` +apache-beam +google-cloud-storage +apache-beam[gcp] +``` + +2. **Authenticate user** + + +## Pipeline Steps + +1. **Read data from GCS:** The pipeline reads the CSV file from the specified GCS bucket. +2. **Parse CSV to dictionary:** The `parse_csv_to_dict` function converts each row of the CSV data into a dictionary based on the provided table schema. +3. **Remove invalid data:** Rows with `nan` values are filtered out. +4. **Write data to BigQuery:** The pipeline writes the processed data to the specified BigQuery table. + +## Running the Pipeline + +1. **Set pipeline options:** Configure the pipeline options, including the runner type and temporary location. +2. **Specify input and output parameters:** Set the input file path, project ID, dataset ID, and table name. +3. **Run the pipeline:** Call the `run_pipeline` function with the configured options and parameters. + +## Note + +* Ensure that you have the necessary permissions to access GCS and BigQuery. + +* The `DirectRunner` is used for local execution. For running the pipeline on Google Cloud Dataflow, use the `DataflowRunner`. \ No newline at end of file