This notebook contains a data pipeline that reads data from a CSV file in Google Cloud Storage (GCS) and writes it to a BigQuery table.
## Setup
1.**Install required packages:**
```
apache-beam
google-cloud-storage
apache-beam[gcp]
```
2.**Authenticate user**
## Pipeline Steps
1.**Read data from GCS:** The pipeline reads the CSV file from the specified GCS bucket.
2.**Parse CSV to dictionary:** The `parse_csv_to_dict` function converts each row of the CSV data into a dictionary based on the provided table schema.
3.**Remove invalid data:** Rows with `nan` values are filtered out.
4.**Write data to BigQuery:** The pipeline writes the processed data to the specified BigQuery table.
## Running the Pipeline
1.**Set pipeline options:** Configure the pipeline options, including the runner type and temporary location.
2.**Specify input and output parameters:** Set the input file path, project ID, dataset ID, and table name.
3.**Run the pipeline:** Call the `run_pipeline` function with the configured options and parameters.
## Note
* Ensure that you have the necessary permissions to access GCS and BigQuery.
* The `DirectRunner` is used for local execution. For running the pipeline on Google Cloud Dataflow, use the `DataflowRunner`.