| from Sakar Gurubacharya

From API to S3 Data Lake: ETL Solutions in AWS

Data is everywhere you look these days—businesses thrive on it. Handling it efficiently is more important than ever. For those working in the fields of data analytics and statistics, whether in sports, finance or marketing, the task of collecting, processing, and storing data is often cumbersome and time-consuming. Fortunately, with the advancement of cloud platforms like AWS [1], managing large-scale data has become more accessible and efficient. In this article, you will observe how we can leverage AWS services to seamlessly extract statistical data from an external API, transform it into a usable format, and load it into an AWS S3 data lake using ETL (Extract, Transform, Load) processes.
The process leverages Amazon Web Services (AWS) tools such as S3, Boto3, Glue demonstrating how to create a robust and scalable ETL pipeline.

Objective: Data Engineering on the Cloud

The primary objective is to understand and implement a cloud-based data pipeline platform that efficiently ingests, transforms, and stores data for analysis. This approach allows businesses to leverage the scalability and flexibility of cloud services while managing large datasets seamlessly. One of the most powerful tools in this process is the ability to ingest statistical data from an external API provider into an S3 bucket. APIs (Application Programming Interfaces) enable us to access structured data from diverse sources—whether it’s financial markets, customer interactions, or operational metrics.

APIs are the backbone of modern data exchange, providing structured access to a wealth of information. However, API data often needs to be transformed and enriched before it can be effectively used for analytics. Simultaneously, cloud-based data lakes, particularly those built on Amazon S3, offer cost-effective and scalable storage for vast amounts of data. Bridging the gap between APIs and S3 requires a well-defined ETL process.

Once the data has been ingested and stored in Amazon S3, it becomes primed for querying and analytics. Leveraging AWS services like Athena for serverless querying and QuickSight for visualization, you can quickly turn raw data into valuable insights. In the following sections, I will be demonstrating on how to implement an ETL pipeline that seamlessly integrates data from external APIs into your cloud data lake, setting the foundation for powerful, data-driven decision-making.

Overview of the ETL Process in AWS

To make the concept clear, let’s start with a high-level architecture overview
ETL on the Cloud
Figure 1: High-Level Overview of ETL on the Cloud

The ETL process begins by interacting with an external API to extract raw data, typically in JSON or XML format. API authentication credentials can be securely managed using AWS Secrets Manager, ensuring secure and controlled access. Python scripts handle the extraction logic, making scheduled API calls, managing rate limits, and logging responses for monitoring and debugging. These scripts are stored in GitHub for version control, allowing for cross-team collaboration and version tracking.

To automate infrastructure provisioning, Terraform is used to define and deploy AWS resources such as S3 buckets, IAM roles, and compute instances. The extracted raw data is then ingested into an AWS S3 bucket, acting as a data lake, i.e. central storage repository. AWS services like Glue or Lambda can further process and transform the data into structured formats optimized for analysis. [2]

This architecture ensures scalability, automation, and security, making data ingestion from APIs seamless and efficient. By leveraging AWS cloud services, the ETL pipeline can handle large-scale data ingestion with minimal operational overhead. [3]

With this high-level overview in mind, let's now take a closer look at each stage of the ETL process.

E = Extraction: Making API Calls

The first step of the process is data extraction, where raw data is fetched from an external API. To facilitate this, we utilize pre-existing, or a custom-built API client class written in Python, which encapsulates the logic for managing the interaction with the external API. This class is responsible for handling OAuth or API key-based authentication, securely managing tokens or credentials, and constructing dynamic API request URLs based on predefined endpoint patterns. The URL structure is parameterized to allow flexible querying, based on required data fields such as date ranges or specific data types. Each API provider will have its own set of credentials and endpoint configurations, which are injected into the client class through environment variables or configuration files. Once the client is properly configured, a GET request is executed using the requests library to fetch the relevant raw data, typically in JSON or XML format. [4] The response is then parsed and validated before being passed along for further processing in the ETL pipeline.

[GET] /api/v1/data/endpoint?parameter1=value1&parameter2=value2

Figure 2: An example of a GET endpoint

In Figure 2, the GET request is made to the /api/v1/data/endpoint endpoint, where parameter1 and parameter2 are query parameters that define the specific data or filters being requested. The API class constructs the full URL dynamically based on input parameters and makes the request. The response from the API is usually returned in JSON format, containing the requested data.

When working with APIs, failures can occur due to network issues, authentication errors, rate limits, or unexpected data formats. To ensure a resilient ETL pipeline, it’s important to implement retry mechanisms (such as exponential backoff for failed requests), error logging for debugging, and secure credential management to handle authentication failures. Using AWS CloudWatch for logging and monitoring helps track API failures, while storing API credentials securely in AWS Secrets Manager prevents security risks. By incorporating these best practices, we ensure that data extraction remains robust and reliable, even in the face of API inconsistencies.

T = Transformation: Structuring and Preparing Data for Analysis

Once the data is fetched, the next crucial step is data transformation. APIs typically return data in complex, deeply nested JSON formats, which are not ideal for direct analysis or querying. Our objective here is to normalize, cleanse, and filter the raw JSON data, converting it into a simplified, tabular structure, making it easier to store in AWS S3 and query later using Amazon Athena.

To achieve this, we use data processing frameworks such as pandas, Spark DataFrame, and Polars, which provide robust tools for working with structured data. For relatively simple and well-structured JSON files, functions like json_normalize() in pandas or equivalent methods in Spark and Polars help recursively flatten nested JSON objects, converting them into a two-dimensional DataFrame where each nested field is expanded into its own column. This transformation not only simplifies the data but also ensures that the resulting structure is optimized for efficient querying in systems like Athena. By converting the data into a tabular format, we can quickly load it into S3 as Parquet or CSV files for long-term storage and further analysis.

However, when the API call returns a rather complex JSON structure, there are additional steps involved to normalize it. Deeply nested fields, null values, and arrays of objects require further processing to ensure all relevant information is represented in a flat table format.

{
    "players": [
        {
            "id": 101,
            "name": "John Doe",
            "team": {
                "id": null,
                "name": "Team A",
                "coach": {
                    "name": "Jane Smith",
                    "experience_years": null
                }
            },
            "stats": {
                "season": "2023",
                "matches": [
                    {
                        "id": 1001,
                        "date": "2023-09-01",
                        "opponent": "Team B",
                        "performance": {
                            "goals": 2,
                            "assists": 1,
                            "minutes_played": 90,
                            "cards": {
                                "yellow": 1,
                                "red": 0
                            }
                        }
                    }
                ]
            }
        }
    ]
}

Figure 3: An example of complex JSON Structure

In this case, we need to flatten nested fields like "team" and "stats" and extract arrays like "matches". To do this, we use pandas' json_normalize with specific parameters to deal with nested structures and lists.

df = pd.json_normalize(
	data['players'],
	record_path=['stats', 'matches'],
	meta=[
		['id', 'name', 'team.id'],
		['team', 'name'],
		['team', 'coach', 'name'],
		['team', 'coach', 'experience_years'],
		['stats', 'season']
	],
	meta_prefix='player_',
	record_prefix='match_'
)

Figure 4: Normalization process for complex JSON

We pass the path to the nested "matches" array using the record_path parameter and define metadata fields like player ID, name, team details, and stats using the meta parameter. Additionally, we use meta_prefix and record_prefix to ensure that columns are properly labeled, distinguishing between player and match-level data.

Despite this, flattening alone is not always sufficient. In cases where data fields exist at varying hierarchy levels or must be assigned based on specific criteria, a more advanced transformation – data mapping – is required. This ensures that key attributes are properly structured, even when they do not align perfectly in the raw API response.

In complex transformations, raw API responses often contain deeply nested structures with key information spread across multiple levels, requiring data mapping to align fields with a structured schema. Unlike simple flattening, this involves selectively extracting and restructuring data while preserving contextual relationships. Fields may need reassignment or aggregation to fit the target schema, with transformation logic determining whether to retain, derive, or discard them. Context-aware imputation fills missing values by inheriting parent attributes, cross-referencing datasets, or using external metadata. Filtering enforces data integrity by rejecting incomplete records and resolving conflicts. By programmatically mapping raw attributes and intelligently filling gaps, this process ensures data consistency, accuracy, and usability for downstream analytics.

For instance, consider an API response where details about a championship, its seasons, and registered teams are dispersed, with some attributes either missing or inconsistently formatted:

{
    "championship": {
        "id": 101,
        "name": "Championship",
        "region": null
    },
    "seasons": [
        {
            "id": 202,
            "year": 2024,
            "teams": [
                {
                    "id": 303,
                    "name": "Club",
                    "category": "Professional",
                    "players": [
                        {
                            "id": 404,
                            "role": "Striker"
                        },
                        {
                            "id": 405
                        }
                    ]
                }
            ]
        }
    ]
}

Figure 5: Multilevel JSON and data hierarchy

To properly structure this data, transformation logic maps the nested attributes into a flattened schema while applying contextual filling rules:

mapped_data = []
for season in api_response["seasons"]:
    for team in season["teams"]:
        for player in team["players"]:
            transformed_record = {
                "championship_id": api_response["championship"]["id"],
                "championship_name": api_response["championship"]["name"],
                "championship_region": api_response["championship"].get("region", "Undefined"),
                "season_id": season["id"],
                "season_year": season["year"],
                "team_id": team["id"],
                "team_name": team["name"],
                "team_category": team.get("category", "Unknown"),
                "player_id": player["id"],
                "player_role": player.get("role", "Unassigned")
            }
            mapped_data.append(transformed_record)

Figure 6: Data transformation for mapping

In this process, the championship region is mapped but assigned "Undefined" if missing. The team category defaults to "Unknown" if not provided. The player role is assigned "Unassigned" when absent. The deeply nested structure is transformed into a structured format, where each row represents a unique player with all relevant contextual data.

L = Loading: Saving Data into Amazon S3

Once the data is structured into a clean DataFrame, the next step is loading it into Amazon S3. S3 serves as an ideal storage solution for data due to its scalability, security, and cost-effectiveness, making it an essential part of the data pipeline. The transformed data is uploaded programmatically using Boto3, the AWS SDK for Python, which provides an interface for interacting with AWS services, including S3. [6] The put_object function in Boto3 is typically used to upload the DataFrame as a file into an S3 bucket.

To streamline and automate the storage process, a custom utility class like BotoUtils can be created. This class extends the capabilities of Boto3 and incorporates specialized functions for common tasks, such as managing S3 storage and ensuring that the data is properly formatted, named and stored. One of the key functionalities within BotoUtils is the handle_storage function which is specifically designed to automate the process of uploading while also managing other important factors like file formats, dynamic key generation, and historical tracking.

This function is responsible for several key tasks:

1. File Format Handling: Based on input parameters, it determines the correct file format (e.g., CSV, Parquet) for storing the data in S3. This is especially important when dealing with large datasets, as the format chosen directly impacts downstream performance, including the speed of queries and processing efficiency.

2. Generating Dynamic S3 Key: The S3 key (file path within the S3 bucket) is dynamically generated using _type, base_key, extra_path, and the historize flag. This structure organizes data by maturity level—raw, or processed—ensuring traceability and efficient querying. APIs influence key patterns, with batch data following scheduled paths (year/month/day) and streaming data using event-based structures. Transformations and historization further refine the key for version control and retrieval. The format follows:

{_type}/{base_key.get("source")}/{base_key.get("endpoint_key")}/{extra_path}

Figure 7: S3 Key Generation

3. Optional Historical Pathing: If the historize flag is True (which is the default), the function appends a timestamp to the key, ensuring that the data is stored in a time-partitioned manner. This is particularly useful for tracking changes over time. The timestamp is split into year, month, and day to create a clear folder structure within the S3 bucket.

if historize:
    now = self.startup_timestamp
    year = now.strftime('%Y')
    month = now.strftime('%m')
    day = now.strftime('%d')
    key += f'/year={year}/month={month}/day={day}/'

Figure 8: Historize timestamp

4. Uploading the Data to S3: Once the key is prepared, the function can then use the Boto3 put_object method to upload the object to the appropriate S3 bucket. The object is typically a transformed DataFrame, and it is uploaded to the specified location in S3 using the dynamically generated key.

s3_client.put_object(Bucket=s3_bucket_name, Key=s3_key, Body-obj)

Figure 9: S3 Put Object

The handle_storage function automates the process of organizing and uploading data to S3, ensuring consistent and error-free storage. By dynamically generating S3 keys and incorporating date-based partitioning, it facilitates efficient data management and easy historical tracking. This approach ensures scalability and performance, especially when dealing with large datasets, making it ideal for analytics and long-term data storage. Ultimately, it streamlines workflows and maintains a clean, structured storage solution.

Once the data is successfully loaded into S3, it is now ready for further analysis. With the data stored in an accessible, structured format, tools like Amazon Athena and Amazon QuickSight can easily query and analyze the data directly from S3. Athena allows for serverless SQL queries, while QuickSight provides powerful visualization capabilities to turn data into actionable insights. This integration ensures that, with minimal effort, the data is ready to be explored and leveraged for deeper analysis, enabling data-driven decision-making.

S3 Loading
Figure 10: Analysis possibility after S3 Loading

Conclusion

This article has explored the entire ETL process for ingesting, transforming, and loading statistics from an external API into Amazon S3. This workflow leverages the powerful capabilities of AWS, offering a scalable and flexible solution for handling large datasets efficiently. By automating the ETL process, data becomes more accessible, enabling quicker insights and more informed decision-making.

Now that you have an overview of the core ETL process and its capabilities, think about how this workflow could elevate your own datasets using AWS services. While the solution is inherently scalable and flexible, ensuring it aligns perfectly with your specific business requirements often requires expert customization. We specialize in helping businesses design, implement, and optimize ETL pipelines that drive tangible business outcomes.

Reach out to us to discuss how we can assist in streamlining your data operations, maximizing the value of your data, and unlocking powerful insights with AWS.

Sources

[1]   https://aws.amazon.com/
[2]   https://aws.amazon.com/datapipeline/
[3]   https://docs.aws.amazon.com/prescriptive-guidance/latest/patterns/build-an-etl-service-pipeline-to-load-data-incrementally-from-amazon-s3-to-amazon-redshift-using-aws-glue.html
[4]   https://www.restapitutorial.com/introduction/httpmethods
[5]   https://pandas.pydata.org/
[6]   https://boto3.amazonaws.com/v1/documentation/api/latest/index.html

Share this article with others

About the author

Sakar is a Computer Science graduate who has been with Woodmark Consulting AG since March 2024. He specializes in the fields of Data Engineering and Analytics, applying his skills to build data pipelines, manage cloud data platforms, and optimize data workflows. He further cements his expertise through certifications in AWS and ongoing professional development in Databricks.

To overview blog posts