How redBus Uses Raw Data from 150 billion Data Points

[ad_1]

Various apps and services collect data, which must be processed, organised, and managed to be meaningful. Several technologies enable companies to accomplish this. How does a company like redBus, a bus ticket booking platform that processes 150 billion data points daily, handle this?

At DES 2025, Ravikumar Kumarasamy, VP of engineering at redBus, explained how the company rebuilt its data platform to make raw events queryable, evolvable, and ready for reuse. He highlighted schema drift, efficient storage, and why it pays to treat raw data as a long-term asset.

The company uses the raw data from the data points, which is meant to be streamed in through applications and services. It was not stored in a way that supported historical querying or scalable reprocessing.

Therefore, the team set out to change that by building a storage framework that could infer schema on the fly, compact files for efficiency, and allow users to retrieve filtered datasets without engineering intervention.

Turning JSON haystacks into Apache Parquet

The approach began with raw event ingestion over Apache Kafka, an open-source event streaming platform, coming in as JSON payloads. Instead of relying on fixed API contracts, the system infers the schema dynamically from each event.

Kumarasamy said, “I already have the schema, but the challenge is that the schema is ever evolving. We introduce new fields and information into the schema. So, instead of tracking the schema, we thought, why don’t we infer the schema from the raw data?”

The raw data includes time, source, country, ID description, and amount. “So, now we derive a general schema from that,” Kumarasamy said.

Once a schema is identified, it’s versioned, bucketed, and used to cast the incoming data into a generalised format.

Type mismatches are handled through defined casting rules—integers can be upcast to long data types, strings are treated as fallback types, and unsupported conversions are rejected early. This generalisation allows the system to normalise diverse data sources without losing their original fidelity.

The metadata extraction step is also notable. redBus pulls common fields like country, event source, and event time from every payload and appends them to the data, creating a consistent layer of information that supports filtering and querying without scanning entire datasets.

Events are saved in Parquet format, which keeps them compact and easy to query. RedBus uses an automated system to merge the large number of small files. Kumarasamy said this setup cuts storage needs by 93% while making the data easy to access.

Reusability of Data

To support exploration, the team developed an internal serverless tool called ‘Lenses’. It allows teams to extract datasets from raw data buckets using a simple interface with filter options like geography, event type, and time range. Behind the scenes, the tool creates Parquet files and gives users a link to download the data for analysis or checks.

“Our CRM and marketing teams, product managers and engineers use Lenses to solve problems,” Kumarasamy said.

While concerns about losing data fidelity in the transformation process were considered, Kumarasamy clarified that only the structure is reconstructed; the raw information remains intact.

A key advantage of the architecture is its temporal precision. Historical events can be retrieved accurately, down to specific hours on specific days, using metadata embedded in the file names and bucket IDs. This allows both real-time and retrospective analysis from a single unified storage layer.

Together, these innovations form a modular, scalable system that balances flexibility, cost, and ease of use, transforming raw event streams into a trusted, queryable system of record.

redBus has built a flexible system that avoids the complexity of traditional data lakes. While it still faces some challenges like managing versions and merging files, the platform now treats raw data as more than just passing events, it’s a structured, reusable source of information.

[ad_2]

Source link