Skip to content
Snippets Groups Projects
Commit 98652a0c authored by Durvesh Rajubhau Mahurkar's avatar Durvesh Rajubhau Mahurkar
Browse files

Upload New File

parent 446eb4fd
Branches master
No related tags found
No related merge requests found
# DE - Python Lateral Training Assignment
This notebook contains a data pipeline that reads data from a CSV file in Google Cloud Storage (GCS) and writes it to a BigQuery table.
## Setup
1. **Install required packages:**
```
apache-beam
google-cloud-storage
apache-beam[gcp]
```
2. **Authenticate user**
## Pipeline Steps
1. **Read data from GCS:** The pipeline reads the CSV file from the specified GCS bucket.
2. **Parse CSV to dictionary:** The `parse_csv_to_dict` function converts each row of the CSV data into a dictionary based on the provided table schema.
3. **Remove invalid data:** Rows with `nan` values are filtered out.
4. **Write data to BigQuery:** The pipeline writes the processed data to the specified BigQuery table.
## Running the Pipeline
1. **Set pipeline options:** Configure the pipeline options, including the runner type and temporary location.
2. **Specify input and output parameters:** Set the input file path, project ID, dataset ID, and table name.
3. **Run the pipeline:** Call the `run_pipeline` function with the configured options and parameters.
## Note
* Ensure that you have the necessary permissions to access GCS and BigQuery.
* The `DirectRunner` is used for local execution. For running the pipeline on Google Cloud Dataflow, use the `DataflowRunner`.
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Please register or to comment