# DE - Python Lateral Training Assignment This notebook contains a data pipeline that reads data from a CSV file in Google Cloud Storage (GCS) and writes it to a BigQuery table. ## Setup 1. **Install required packages:** ``` apache-beam google-cloud-storage apache-beam[gcp] ``` 2. **Authenticate user** ## Pipeline Steps 1. **Read data from GCS:** The pipeline reads the CSV file from the specified GCS bucket. 2. **Parse CSV to dictionary:** The `parse_csv_to_dict` function converts each row of the CSV data into a dictionary based on the provided table schema. 3. **Remove invalid data:** Rows with `nan` values are filtered out. 4. **Write data to BigQuery:** The pipeline writes the processed data to the specified BigQuery table. ## Running the Pipeline 1. **Set pipeline options:** Configure the pipeline options, including the runner type and temporary location. 2. **Specify input and output parameters:** Set the input file path, project ID, dataset ID, and table name. 3. **Run the pipeline:** Call the `run_pipeline` function with the configured options and parameters. ## Note * Ensure that you have the necessary permissions to access GCS and BigQuery. * The `DirectRunner` is used for local execution. For running the pipeline on Google Cloud Dataflow, use the `DataflowRunner`.