My company has selected AWS to be its cloud provider. This is brand new territory for me. As I begin to use each service, I’ll capture what I learn here.
My focus is going to be on a small subset of AWS. On first logging into the AWS dashboard and clicking the services button, the array of offerings appears overwhelming. They seem to have a service to cater for everything that is possible in the cloud, from big data analytics and learning to massively parallel processing of data.
The latter is what my company will be concentrating on, creating a suite of ETL (Extract, transform, Load) jobs to move data around. In this first phase, we will be migrating an entire system from the in-house ETL using IBM’s Data Stage to Glue jobs built on AWS.
Some of the technologies that we will use are AWS Glue, AWS S3, AWS Code Commit, AWS Cloud Formation and AWS Redshift. (From now on I’ll omit the AWS from the name of each trade marked offering.)
The company has already built a number of these ETL processes either as proof of concept of for actual business purposes. Now we are extending this capability out to the rest of the development team, quite a learning curve for those of us with a pure Java development background.
Identity and Access Management (IAM)
Our business spans five countries, and will soon enter new ones through company acquisition. The consolidation of insurance companies requires that we make savings by having one group sanctioned way of doing things for each business application.
For example, we will have a single accounting platform, single HR systems, one issue ticketing system across all companies. The data that can be extracted from these sources will play a key role in managing the business.
The cloud will form central area where similar information from each unit is transferred. Some data transformation and normalisation will occur there before being analysed or fed onto satellite systems.
The problem we face is that it may not always be correct to allow different units to see the information from others, but the group itself will need access to everything. That‘s where IAM comes in.
With IAM we can control access to our resources and services that we build in the cloud. Access can be granted with specific, fine grained control or using broad, company wide policies.
Glue is the ETL service provided on AWS. When activated, a Glue job will provision the resources it needs, configure and scaled appropriately and run the job. It is scalable, with the ability to increase to many parallel processing units depending on the job requirements.
At the time of writing glue supports Scala and Python with Apache Spark as the underlying technology for managing the highly parallel processing. We will be using Python to develop the jobs.
Ultimately the data that we store in the cloud will be in a data warehouse upon which various business functions run reports to manage and direct the business; this is where Redshift comes in.
Redshift makes it easy to provision for capacity, monitoring and backup strategies. It is scalable to meet the demands of the business and offers a high level of security.
Redshift prides itself on being the most performant data warehousing technology on offer. It is also relatively cheap compared to other cloud based services (at least according to the blurb on their website).
S3 (Simple Storage Service) is basically data storage in the cloud that has a few key attributes. Secure and scalable with replication to protect against failures, and it is relatively cheap. It can host parquet files that can be accessed through Redshift, thus providing a cheap database.
We will be consuming files generated from a mainframe system rather than interacting directly with it. The source files are going to be archived after processing for audit purposes. S3 is where they will arrive to be read and where they will reside in archive.
S3 will also house any database parquet files that serve databases that do not need fast response times. When we process the mainframe files initially, they will sit in S3 before further processing.
For source code management we will use Code Commit, the AWS Git offering. It provides an interface, similar to GitHub or other competitors, to view the code, see past revisions and manage comments and pull requests.
Cloud Formation will allow us to provision the resources necessary to put everything together. It is configuration as code to create and tear down environments, code and data. It can allow us to quickly create development or test versions of production, using the same configuration, thus making it easy to keep the environment similar.
Serverless computing is becoming the next big thing. If you have some code that you need to run and it doesn’t rely on permanent resources, you can use Lambda to run it on demand. No server maintenance to worry about. No patching cycles. No need to fret of idle CPU cycles when your code is not in use.
You only get charged for the compute time when the code is running. This makes smaller sized jobs perfect for Lambda. Triggering your code can be achieved by using some of the other services available from AWS, for example, HTTP endpoints.
We have a number of small sized ETL jobs that are a good fit for Lambda. These would generally take less than a few minutes to run, so the overhead of provisioning a server for a Glue job would only cause delay. Instead the code to transfer the data will reside in a Lambda function.
Lambada is also available in the free tier to try out. At the moment you can use one million requests per month before charging kicks in. This should be ample to get most proof of concept and initial offerings up and running. Lambda is still relatively cheap after that point also.
There is quite a lot to take in when you first log into the AWS web console. Simply click on the Services menu item and a huge array of existing services show up.
All of these are being actively developed at the moment, and new ones are being added all the time. One such recent addition is the Well-Architected Tool, that walks you through the process of evaluating or designing a best practice cloud platform.
I hope to get some experience in each of the areas mentioned above and some others, and I will post my experiences and learnings. The pace of change in this sphere is fast, new technology is coming on-line all the time and old offerings are dying away. Watch this space.