# DE - Python Lateral Training Assignment

This notebook contains a data pipeline that reads data from a CSV file in Google Cloud Storage (GCS) and writes it to a BigQuery table.

## Setup

1.  **Install required packages:**
```
apache-beam 
google-cloud-storage 
apache-beam[gcp]
```

2.  **Authenticate user**


## Pipeline Steps

1.  **Read data from GCS:** The pipeline reads the CSV file from the specified GCS bucket.
2.  **Parse CSV to dictionary:** The `parse_csv_to_dict` function converts each row of the CSV data into a dictionary based on the provided table schema.
3.  **Remove invalid data:** Rows with `nan` values are filtered out.
4.  **Write data to BigQuery:** The pipeline writes the processed data to the specified BigQuery table.

## Running the Pipeline

1.  **Set pipeline options:** Configure the pipeline options, including the runner type and temporary location.
2.  **Specify input and output parameters:** Set the input file path, project ID, dataset ID, and table name.
3.  **Run the pipeline:** Call the `run_pipeline` function with the configured options and parameters.

## Note

*   Ensure that you have the necessary permissions to access GCS and BigQuery.

*   The `DirectRunner` is used for local execution. For running the pipeline on Google Cloud Dataflow, use the `DataflowRunner`.