Share to

LATEST posts

Data Platform

Tell us about your team

The data platform team is responsible for the development and continuous optimization of generic data platform tooling and an internal ETL platform as a service, that enables Rubikloud to ingest data from our clients. Overall, the data platform is used to receive, ingest, transform, and load customer data to build the RubiCore Data Model (RDM). The RDM is then consumed by upstream Rubikloud apps and client dashboards, and utilized by our Machine Learning platform. Another primary focus of the data platform team is optimizing our usage of the Hadoop software stack (YARN, Spark, etc.).

What’s your data platform’s stack?

The data platform team writes core ETL transformations in Scala as it is native to Spark, and uses spark-submit to deploy these jobs to Azure HDInsight cluster via Jenkins. All support logic is written mainly in Python3.

Additional Tools

  • Scala
  • Python
  • Luigi + Jenkins
  • YARN + Spark
  • Kubernetes
  • Azure Storage (blob, table, file, queue)