Blog

Thinking out loud

Where we share the insights, questions, and observations that shape our approach.

Software development

How to build hypermedia API with Spring HATEOAS

Have you ever considered the quality of your REST API? Do you know that there are several levels of REST API? Have you ever heard the term HATEOAS? Or maybe you wonder how to implement it in Java? In this article, we answer these questions with the main emphasis on the HATEOAS concept and the implementation of that concept with the Spring HATEOAS project.

Learn more about services provided by Grape Up

You are at Grape Up blog, where our experts share their expertise gathered in projects delivered for top enterprises. See how we work.

Enabling the automotive industry to build software-defined vehicles
Empowering insurers to create insurance telematics platforms
Providing AI & advanced analytics consulting

What is HATEOAS?

H ypermedia A s T he E ngine O f A pplication S tate - is one of the constraints of the REST architecture. Neither REST nor HATEOAS is any requirement or specification. How you implement it depends only on you. At this point, you may ask yourself - how RESTful your API is without using HATEOAS? This question is answered by the REST maturity model presented by Leonard Richardson. This model consists of four levels, as set out below:

Level 0
The API implementation uses the HTTP protocol but does not utilize its full capabilities. Additionally, unique addresses for resources are not provided.
Level 1
We have a unique identifier for the resource, but each action on the resource has its own URL.
Level 2
We use HTTP methods instead of verbs describing actions, e.g., DELETE method instead of URL ... /delete .
Level 3
The term HATEOAS has been introduced. Simply speaking, it introduces hypermedia to resources. That allows you to place links in the response informing about possible actions, thereby adding the possibility to navigate through API.

Most projects these days are written using level 2. If we would like to go for the perfect RESTful API, we should consider HATEOAS.

Above, we have an example of a response from the server in the form of JSON+HAL. Such a resource consists of two parts: our data and links to actions that are possible to be performed on a given resource.

Spring HATEOAS 1.x.x

You may be asking yourself how to implement HATEOAS in Java? You can write your solution, but why reinvent the wheel? The right tool for this seems to be the Spring Hateoas project. It is a long-standing solution on the market because its origins date back to 2012, but in 2019 we had a version 1.0 release. Of course, this version introduced a few changes compared to 0.x. They will be discussed at the end of the article after presenting some examples of using this library so that you better understand what the differences between the two versions are. Let’s discuss the possibilities of the library based on a simple API that returns us a list of movies and related directors. Our domain looks like this:

@Entity

public class Movie {

@Id

@GeneratedValue

private Long id;

private String title;

private int year;

private Rating rating;

@ManyToOne

private Director director;

}

@Entity

public class Director {

@Id

@GeneratedValue

@Getter

private Long id;

@Getter

private String firstname;

@Getter

private String lastname;

@Getter

private int year;

@OneToMany(mappedBy = "director")

private Set<Movie> movies;

}

We can approach the implementation of HATEOAS in several ways. Three methods represented here are ranked from least to most recommended.

But first, we need to add some dependencies to our Spring Boot project:

<dependency>

<groupId>org.springframework.boot</groupId>

<artifactId>spring-boot-starter-hateoas</artifactId>

</dependency>

Ok, now we can consider implementation options.

Entity extends RepresentationModel with links directly in Controller class

Firstly, extend our entity models with RepresentationModel.

public class Movie extends RepresentationModel<Movie>

public class Director extends RepresentationModel<Director>

Then, add links to RepresentationModel within each controller. The example below returns all directors from the system. By adding two links to each director - to himself and to the entire collection. A link is also added to the collection. The key elements of this code are two methods with static imports:

linkTo() - responsible for creating the link
methodOn() - enables to dynamically generate the path to a given resource. We don’t need to hardcode the path, but we can refer to the method in the controller.

@GetMapping("/directors")

public ResponseEntity<CollectionModel<Director>> getAllDirectors() {

List<Director> directors = directorService.getAllDirectors();

directors.forEach(director -> {

director.add(linkTo(methodOn(DirectorController.class).getDirectorById(director.getId())).withSelfRel());

director.add(linkTo(methodOn(DirectorController.class).getDirectorMovies(director.getId())).withRel("directorMovies"));

});

Link allDirectorsLink = linkTo(methodOn(DirectorController.class).getAllDirectors()).withSelfRel());

return ResponseEntity.ok(CollectionModel.of(directors, allDirectorsLink));

}

This is the response we get after invoking such controller:

We can get a similar result when requesting a specific resource.

@GetMapping("/directors/{id}")

public ResponseEntity<Director> getDirector(@PathVariable("id") Long id) {

return directorService.getDirectorById(id)

.map(director -> {

director.add(linkTo(methodOn(DirectorController.class).getDirectorById(id)).withSelfRel());

director.add(linkTo(methodOn(DirectorController.class).getDirectorMovies(id)).withRel("directorMovies"));

director.add(linkTo(methodOn((DirectorController.class)).getAllDirectors()).withRel("directors"));

return ResponseEntity.ok(director);

})

.orElse(ResponseEntity.notFound().build());

}

The main advantage of this implementation is simplicity. But making our entity dependent on an external library is not a very good idea. Plus, the code repetition for adding links for a specific resource is immediately noticeable. You can, of course, bring it to some private method, but there is a better way.

Use Assemblers - SimpleRepresentationModelAssembler

And it’s not about assembly language, but about a special kind of class that converts our resource to RepresentationModel.

One of such assemblers is SimpleRepresentationModelAssembler. Its implementation goes as follows:

@Component

public class DirectorAssembler implements SimpleRepresentationModelAssembler<Director> {

@Override

public void addLinks(EntityModel<Director> resource) {

Long directorId = resource.getContent().getId();

resource.add(linkTo(methodOn(DirectorController.class).getDirectorById(directorId)).withSelfRel());

resource.add(linkTo(methodOn(DirectorController.class).getDirectorMovies(directorId)).withRel("directorMovies"));

}

@Override

public void addLinks(CollectionModel<EntityModel<Director>> resources) {

resources.add(linkTo(methodOn(DirectorController.class).getAllDirectors()).withSelfRel());

}

}

In this case, our entity will be wrapped in an EnityModel (this class extends RepresentationModel ) to which the links specified by us in the addLinks() will be added. Here we overwrite two addLinks() methods - one for entire data collections and the other for single resources. Then, as part of the controller, it is enough to call the toModel() or toCollectionModel() method ( addLinks() are template methods here), depending on whether we return a collection or a single representation.

@GetMapping

public ResponseEntity<CollectionModel<EntityModel<Director>>> getAllDirectors() {

return ResponseEntity.ok(directorAssembler.toCollectionModel(directorService.getAllDirectors()));

}

@GetMapping(value = "directors/{id}")

public ResponseEntity<EntityModel<Director>> getDirectorById(@PathVariable("id") Long id) {

return directorService.getDirectorById(id)

.map(director -> {

EntityModel<Director> directorRepresentation = directorAssembler.toModel(director)

.add(linkTo(methodOn(DirectorController.class).getAllDirectors()).withRel("directors"));

return ResponseEntity.ok(directorRepresentation);

})

.orElse(ResponseEntity.notFound().build());

}

The main benefit of using the SimpleRepresentationModelAssembler is the separation of our entity from the RepresentationModel , as well as the separation of the adding link logic from the controller.

The problem arises when we want to add hypermedia to the nested elements of an object. Obtaining the effect, as in the example below, is impossible in a current way.

{

"id": "M0002",

"title": "Once Upon a Time in America",

"year": 1984,

"rating": "R",

"directors": [

{

"id": "D0001",

"firstname": "Sergio",

"lastname": "Leone",

"year": 1929,

"_links": {

"self": {

"href": "http://localhost:8080/directors/D0001"

}

}

}

],

"_links": {

"self": {

"href": "http://localhost:8080/movies/M0002"

}

}

}

Create DTO class with RepresentationModelAssembler

The solution to this problem is to combine the two previous methods, modifying them slightly. In our opinion, RepresentationModelAssembler offers the most possibilities. It removes the restrictions that arose in the case of nested elements for SimpleRepresentationModelAssembler . But it also requires more code from us because we need to prepare DTOs, which are often done anyway. This is the implementation based on RepresentationModelAssembler :

@Component

public class DirectorRepresentationAssembler implements RepresentationModelAssembler<Director, DirectorRepresentation> {

@Override

public DirectorRepresentation toModel(Director entity) {

DirectorRepresentation directorRepresentation = DirectorRepresentation.builder()

.id(entity.getId())

.firstname(entity.getFirstname())

.lastname(entity.getLastname())

.year(entity.getYear())

.build();

directorRepresentation.add(linkTo(methodOn(DirectorController.class).getDirectorById(directorRepresentation.getId())).withSelfRel());

directorRepresentation.add(linkTo(methodOn(DirectorController.class).getDirectorMovies(directorRepresentation.getId())).withRel("directorMovies"));

return directorRepresentation;

}

@Override

public CollectionModel<DirectorRepresentation> toCollectionModel(Iterable<? extends Director> entities) {

CollectionModel<DirectorRepresentation> directorRepresentations = RepresentationModelAssembler.super.toCollectionModel(entities);

directorRepresentations.add(linkTo(methodOn(DirectorController.class).getAllDirectors()).withSelfRel());

return directorRepresentations;

}

}

When it comes to controller methods, they look the same as for SimpleRepresentationModelAssembler , the only difference is that in the ResponseEntity the return type is DTO - DirectorRepresentation .

@GetMapping

public ResponseEntity<CollectionModel<DirectorRepresentation>> getAllDirectors() {

return ResponseEntity.ok(directorRepresentationAssembler.toCollectionModel(directorService.getAllDirectors()));

}

@GetMapping(value = "/{id}")

public ResponseEntity<DirectorRepresentation> getDirectorById(@PathVariable("id") String id) {

return directorService.getDirectorById(id)

.map(director -> {

DirectorRepresentation directorRepresentation = directorRepresentationAssembler.toModel(director)

.add(linkTo(methodOn(DirectorController.class).getAllDirectors()).withRel("directors"));

return ResponseEntity.ok(directorRepresentation);

})

.orElse(ResponseEntity.notFound().build());

}

Here is our DTO model:

@Builder

@Getter

@EqualsAndHashCode(callSuper = false)

@Relation(itemRelation = "director", collectionRelation = "directors")

public class DirectorRepresentation extends RepresentationModel<DirectorRepresentation> {

private final String id;

private final String firstname;

private final String lastname;

private final int year;

}

The @Relation annotation allows you to configure the relationship names to be used in the HAL representation. Without it, the relationship names match the class name and a suffix List for the collection.

By default, JSON+HAL looks like this:

{

"_embedded": {

"directorRepresentationList": [

…

]

},

"_links": {

…

}

}

However, annotation @Relation can change the name of directors :

{

"_embedded": {

"directors": [

…

]

},

"_links": {

…

}

}

Summarizing the HATEOAS concept, it consists of a few pros and cons.

Pros:

If the client uses it, we can change the API address for our resources without breaking the client.
Creates good self-documentation, and table of contents of API to the person who has the first contact with our API.
Can simplify building some conditions on the frontend, e.g., whether the button should be disabled / enabled based on whether the link to corresponding the action exists.
Less coupling between frontend and backend.
Just like writing tests imposes on us to stick to the SRP principle in class construction, HATEOAS can keep us in check when designing API.

Cons:

Additional work needed on implementing non-business functionality.
Additional network overhead. The size of the transferred data is larger.
Adding links to some resources can be sometimes complicated and can introduce mess in controllers.

Changes in Spring HATEOAS 1.0

Spring HATEOAS has been available since 2012, but the first release of version 1.0 was in 2019.

The main changes concerned the changes to the package paths and names of some classes, e.g.

Old New ResourceSupport RepresentationModel Resource EntityModel Resources CollectionModel PagedResources PagedModel ResourceAssembler RepresentationModelAssembler

It is worth paying attention to a certain naming convention - the replacement of the word Resource in class names with the word Representation . It occurred because these types do not represent resources but representations, which can be enriched with hypermedia. It is also more in the spirit of REST. We are returning the resource representations, not the resources themselves. In the new version, there is a tendency to move away from constructors in favor of static construction methods - .of() .

It is also worth mentioning that the old version has no equivalent for SimpleRepresentationModelAssembler . On the other hand, the ResourceAssembler interface has only the toResource() method (equivalent - toModel() ) and no equivalent for toCollectionModel() . Such a method is found in RepresentationModelAssembler and is the toModelCollection() method.

The creators of the library have also included a script that migrates old package paths and old class names to the new version. You can check it here .

written by

Albert Bernat

written by

Aleksy Wołowiec

Software development

Practical tips to testing React apps with Selenium

If you ever had to write some automation scripts for an app with the frontend part done in React and you used Selenium Webdriver to get it to work, you’ve probably noticed that those two do not always get along very well. Perhaps you had to ‘hack’ your way through the task, and you were desperately searching for solutions to help you finish the job. I’ve been there and done that – so now you don’t have to. If you’re looking for a bunch of tricks which you can learn and expand your automation testing skillset, you’ve definitely come to the right place. Below I’ll share with you several solutions to problems I’ve encountered in my experience with testing against React with Selenium . Code examples will be presented for Python binding.

They see me scrolling

First, let’s take a look at scrolling pages. To do that, the solution that often comes to mind in automation testing is using JavaScript. Since we’re using Python here, the first search result would probably suggest using something like this:

Tips to Testing React Apps with Selenium

The first argument in the JS part is the number of pixels horizontally, and the second one is the number of pixels vertically. If we just paste window.scrollTo(0,100) into browsers’ console with some webpage opened, the result of the action will be scrolling the view vertically to the pixel position provided.

You could also try the below line of code:

And again, you can see how it works by pasting window.scrollBy(0,100) into browsers’ console – the page will scroll down by the number of pixels provided. If you do this repeatedly, you’ll eventually reach the bottom of the page.

However, that might not always work wonders for you. Perhaps you do not want to scroll the whole page, but just a part of it – the scrollbars might be confusing, and when you think it’s the whole page you need to scroll, it might be just a portion of it. In that case, here’s what you need to do. First, locate the React element you want to scroll. Then, make sure it has an ID assigned to it. If not, do it yourself or ask your friendly neighborhood developer to do it for you. Then, all you have to do is write the following line of code:

Obviously, don’t forget to change ‘scrollable_element_id’ to an ID of your element. That will perform a scroll action within the selected element to the position provided in arguments. Or, if needed, you can try .scrollBy instead of .scrollTo to get a consistent, repeatable scrolling action.

To finish off, you could also make a helper method out of it and call it whenever you need it:

I’ll be mentioning the above method in the following paragraph, so please keep in mind what scroll_view is about.

Still haven’t found what you were looking for

Now that you have moved scrolling problems out of the way, locating elements and interacting with them on massive React pages should not bother you anymore, right? Well, not exactly. If you need to perform some action on an element that exists within a page, it has to be scrolled into view so you can work with it. And Selenium does not automatically do that. Let’s assume that you’re working on a web app that has various sub-pages, or tabs. Each of those tabs contains elements of a different sort but arranged in similar tables with search bars on top of each table at the beginning of the tab. Imagine the following scenario: you navigate to the first tab, scroll the view down, then navigate to the second tab, and you want to use the search bar at the top of the page. Sounds easy, doesn’t it?

What you need to be aware of is the React’s feature which does not always move the view to the top of the page after switching subpages of the app. In this case, to interact with the aforementioned search box you need to scroll the view to the starting position. That’s why scroll_view method in previous paragraph took (0,0) as .scrollTo arguments. You could use it before interacting with an element just to make sure it’s in the view and can be found by Selenium. Here’s an example:

When it doesn’t click

Seems like a basic action like clicking on an element should be bulletproof and never fail. Yet again, miracles happen and if you’re losing your mind trying to find out what’s going on, remember that Selenium doesn’t always work great with React. If you have to deal with some stubborn element, such as a checkbox, for example, you could just simply make the code attempt the action several times:

The key here is the if statement; it has to verify whether the requested action had actually taken place. In the above case, a checkbox is selected, and Selenium has a method for verifying that. In other situations, you could just provide a specific selector which applies to a particular element when it changes its state, eg., an Xpath similar to this:

In the above example, Xpath contains generic Material-UI classes, but it could be anything as long as it points out the exact element you needed when it changed its state to whichever you wanted.

Clear situation

Testing often includes dealing with various forms that we need to fill and verify. Fortunately, Selenium’s send_keys() method usually does the job. But when it doesn’t, you could try clicking the text field before inputting the value:

It's a simple thing to do, but we might sometimes have the tendency to forget about such trivial solutions. Anyway, it gets the job done.

The trickier part might actually be getting rid of data in already filled out forms. And Selenium's .clear() method doesn't cooperate as you would expect it to do. If getting the field into focus just like in the above example doesn't work out for you:

there is a solution that uses some JavaScript (again!). Just make sure your cursor is focused on the field you want to clear and use the following line:

You can also wrap it into a nifty little helper as I did:

While this should work fine 99% of the time, there might be a situation with a stubborn text field where React quickly restores the previous value. What you can do in such a situation is experiment with sending an empty string to that field right after clearing it or sending some whitespace to it:

Just make sure it works for you!

Continuing the topic of the text in various fields, which sometimes have to be verified or checked after particular conditions are met, sometimes you need to make sure you're using the right method to extract the text value of an element. They might come in different forms, but the ones below are used quite often. Text in element could be extracted by Selenium with .get_attribute() method:

Or sometimes it's just enough to use .text() method:

It all depends on the context and the element you're working with. So don't fall into the trap of assuming that all forms and elements in the app are exactly the same. Always check twice, you'll thank yourself for that, and in the end, you'll save tons of time!

React Apps - Keep on testing!

Hopefully, the tips and tricks I presented above will prove most useful for you in your testing projects. There's definitely more to share within the testing field, so make sure you stay tuned in for other articles on our blog!

written by

Adrian Poć

Software development

How to run Selenium BDD tests in parallel with AWS Lambda - Lambda handlers

In our first article about Selenium BDD Tests in Parallel with AWS Lambda, we introduce parallelization in the Cloud and give you some insights into automating testing to accelerate your software development process. By getting familiar with the basics of Lambda Layers architecture and designing test sets, you are now ready to learn more about the Lambda handlers.

Lambda handlers

Now’s the time to run our tests on AWS. We need to create two Lambda handlers. The first one will find all scenarios from the test layer and run the second lambda in parallel for each scenario. In the end, it will generate one test report and upload it to the AWS S3 bucket.

Let’s start with the middle part. In order to connect to AWS, we need to use the boto3 library - AWS SDK for Python. It enables us to create, configure, and manage AWS services. We also import here behave __main__ function , which will be called to run behave tests from the code, not from the command line.

lambda/handler.py

import json

import logging

import os

from datetime import datetime

from subprocess import call

import boto3

from behave.__main__ import main as behave_main

REPORTS_BUCKET = 'aws-selenium-test-reports'

DATETIME_FORMAT = '%H:%M:%S'

logger = logging.getLogger()

logger.setLevel(logging.INFO)

def get_run_args(event, results_location):

test_location = f'/opt/{event["tc_name"]}'

run_args = [test_location]

if 'tags' in event.keys():

tags = event['tags'].split(' ')

for tag in tags:

run_args.append(f'-t {tag}')

run_args.append('-k')

run_args.append('-f allure_behave.formatter:AllureFormatter')

run_args.append('-o')

run_args.append(results_location)

run_args.append('-v')

run_args.append('--no-capture')

run_args.append('--logging-level')

run_args.append('DEBUG')

return run_args

What we also have above is setting arguments for our tests e.g., tags or feature file locations. But let's get to the point. Here is our Lambda handler code:

lambda/handler.py

def lambda_runner(event, context):

suffix = datetime.now().strftime(DATETIME_FORMAT)

results_location = f'/tmp/result_{suffix}'

run_args = get_run_args(event, results_location)

print(f'Running with args: {run_args}')

# behave -t @smoke -t ~@login -k -f allure_behave.formatter:AllureFormatter -o output --no-capture

try:

return_code = behave_main(run_args)

test_result = False if return_code == 1 else True

except Exception as e:

print(e)

test_result = False

response = {'test_result': test_result}

s3 = boto3.resource('s3')

for file in os.listdir(results_location):

if file.endswith('.json'):

s3.Bucket(REPORTS_BUCKET).upload_file(f'{results_location}/{file}', f'tmp_reports/{file}')

call(f'rm -rf {results_location}', shell=True)

return {

'statusCode': 200,

'body': json.dumps(response)

}

The lambda_runner method is executed with tags that are passed in the event. It will handle a feature file having a name from the event and at least one of those tags. At the end of a single test, we need to upload our results to the S3 bucket. The last thing is to return a Lambda result with a status code and a response from tests.

There’s a serverless file with a definition of max memory size, lambda timeout, used layers, and also some policies that allow us to upload the files into S3 or save the logs in CloudWatch.

lambda/serverless.yml

service: lambda-test-runner

app: lambda-test-runner

provider:

name: aws

runtime: python3.6

region: eu-central-1

memorySize: 512

timeout: 900

iamManagedPolicies:

- "arn:aws:iam::aws:policy/CloudWatchLogsFullAccess"

- "arn:aws:iam::aws:policy/AmazonS3FullAccess"

functions:

lambda_runner:

handler: handler.lambda_runner

events:

- http:

path: lambda_runner

method: get

layers:

- ${cf:lambda-selenium-layer-dev.SeleniumLayerExport}

- ${cf:lambda-selenium-layer-dev.ChromedriverLayerExport}

- ${cf:lambda-selenium-layer-dev.ChromeLayerExport}

- ${cf:lambda-tests-layer-dev.FeaturesLayerExport}

Now let’s go back to the first lambda function. There will be a little more here, so we'll go through it in batches. Firstly, imports and global variables. REPORTS_BUCKET should have the same value as it’s in the environment.py file (tests layer).

test_list/handler.py

import json

import logging

import os

import shutil

import subprocess

from concurrent.futures import ThreadPoolExecutor as PoolExecutor

from datetime import date, datetime

import boto3

from botocore.client import ClientError, Config

REPORTS_BUCKET = 'aws-selenium-test-reports'

SCREENSHOTS_FOLDER = 'failed_scenarios_screenshots/'

CURRENT_DATE = str(date.today())

REPORTS_FOLDER = 'tmp_reports/'

HISTORY_FOLDER = 'history/'

TMP_REPORTS_FOLDER = f'/tmp/{REPORTS_FOLDER}'

TMP_REPORTS_ALLURE_FOLDER = f'{TMP_REPORTS_FOLDER}Allure/'

TMP_REPORTS_ALLURE_HISTORY_FOLDER = f'{TMP_REPORTS_ALLURE_FOLDER}{HISTORY_FOLDER}'

REGION = 'eu-central-1'

logger = logging.getLogger()

logger.setLevel(logging.INFO)

There are some useful functions to avoid duplication and make the code more readable. The first one will find and return all .feature files which exist on the tests layer. Then we have a few functions that let us create a new AWS bucket or folder, remove it, upload reports, or download some files.

test_list/handler.py

def get_test_cases_list() -> list:

return [file for file in os.listdir('/opt') if file.endswith('.feature')]

def get_s3_resource():

return boto3.resource('s3')

def get_s3_client():

return boto3.client('s3', config=Config(read_timeout=900, connect_timeout=900, max_pool_connections=500))

def remove_s3_folder(folder_name: str):

s3 = get_s3_resource()

bucket = s3.Bucket(REPORTS_BUCKET)

bucket.objects.filter(Prefix=folder_name).delete()

def create_bucket(bucket_name: str):

client = get_s3_client()

try:

client.head_bucket(Bucket=bucket_name)

except ClientError:

location = {'LocationConstraint': REGION}

client.create_bucket(Bucket=bucket_name, CreateBucketConfiguration=location)

def create_folder(bucket_name: str, folder_name: str):

client = get_s3_client()

client.put_object(

Bucket=bucket_name,

Body='',

Key=folder_name

)

def create_sub_folder(bucket_name: str, folder_name: str, sub_folder_name: str):

client = get_s3_client()

client.put_object(

Bucket=bucket_name,

Body='',

Key=f'{folder_name}{sub_folder_name}'

)

def upload_html_report_to_s3(report_path: str):

s3 = get_s3_resource()

current_path = os.getcwd()

os.chdir('/tmp')

shutil.make_archive('report', 'zip', report_path)

s3.Bucket(REPORTS_BUCKET).upload_file('report.zip', f'report_{str(datetime.now())}.zip')

os.chdir(current_path)

def upload_report_history_to_s3():

s3 = get_s3_resource()

current_path = os.getcwd()

os.chdir(TMP_REPORTS_ALLURE_HISTORY_FOLDER)

for file in os.listdir(TMP_REPORTS_ALLURE_HISTORY_FOLDER):

if file.endswith('.json'):

s3.Bucket(REPORTS_BUCKET).upload_file(file, f'{HISTORY_FOLDER}{file}')

os.chdir(current_path)

def download_folder_from_bucket(bucket, dist, local='/tmp'):

s3 = get_s3_resource()

paginator = s3.meta.client.get_paginator('list_objects')

for result in paginator.paginate(Bucket=bucket, Delimiter='/', Prefix=dist):

if result.get('CommonPrefixes') is not None:

for subdir in result.get('CommonPrefixes'):

download_folder_from_bucket(subdir.get('Prefix'), bucket, local)

for file in result.get('Contents', []):

destination_pathname = os.path.join(local, file.get('Key'))

if not os.path.exists(os.path.dirname(destination_pathname)):

os.makedirs(os.path.dirname(destination_pathname))

if not file.get('Key').endswith('/'):

s3.meta.client.download_file(bucket, file.get('Key'), destination_pathname)

For that handler, we also need a serverless file. There’s one additional policy AWSLambdaExecute and some actions that are required to invoke another lambda.

test_list/serverless.yml

service: lambda-test-list

app: lambda-test-list

provider:

name: aws

runtime: python3.6

region: eu-central-1

memorySize: 512

timeout: 900

iamManagedPolicies:

- "arn:aws:iam::aws:policy/CloudWatchLogsFullAccess"

- "arn:aws:iam::aws:policy/AmazonS3FullAccess"

- "arn:aws:iam::aws:policy/AWSLambdaExecute"

iamRoleStatements:

- Effect: Allow

Action:

- lambda:InvokeAsync

- lambda:InvokeFunction

Resource:

- arn:aws:lambda:eu-central-1:*:*

functions:

lambda_test_list:

handler: handler.lambda_test_list

events:

- http:

path: lambda_test_list

method: get

layers:

- ${cf:lambda-tests-layer-dev.FeaturesLayerExport}

- ${cf:lambda-selenium-layer-dev.AllureLayerExport}

And the last part of this lambda - the handler. In the beginning, we need to get a list of all test cases. Then if the action is run_tests , we get the tags from the event. In order to save reports or screenshots, we must have a bucket and folders created. The invoke_test function will be executed concurrently by the PoolExecutor. This function invokes a lambda, which runs a test with a given feature name. Then it checks the result and adds it to the statistics so that we know how many tests failed and which ones.

In the end, we want to generate one Allure report. In order to do that, we need to download all .json reports, which were uploaded to the S3 bucket after each test. If we care about trends, we can also download data from the history folder. With the allure generate command and proper parameters, we are able to create a really good looking HTML report. But we can’t see it at this point. We’ll upload that report into the S3 bucket with a newly created history folder so that in the next test execution, we can compare the results. If there are no errors, our lambda will return some statistics and links after the process will end.

test_list/handler.py

def lambda_test_list(event, context):

test_cases = get_test_cases_list()

if event['action'] == 'run_tests':

tags = event['tags']

create_bucket(bucket_name=REPORTS_BUCKET)

create_folder(bucket_name=REPORTS_BUCKET, folder_name=SCREENSHOTS_FOLDER)

create_sub_folder(

bucket_name=REPORTS_BUCKET, folder_name=SCREENSHOTS_FOLDER, sub_folder_name=f'{CURRENT_DATE}/'

)

remove_s3_folder(folder_name=REPORTS_FOLDER)

create_folder(bucket_name=REPORTS_BUCKET, folder_name=REPORTS_FOLDER)

client = boto3.client(

'lambda',

region_name=REGION,

config=Config(read_timeout=900, connect_timeout=900, max_pool_connections=500)

)

stats = {'passed': 0, 'failed': 0, 'passed_tc': [], 'failed_tc': []}

def invoke_test(tc_name):

response = client.invoke(

FunctionName='lambda-test-runner-dev-lambda_runner',

InvocationType='RequestResponse',

LogType='Tail',

Payload=f'{{"tc_name": "{tc_name}", "tags": "{tags}"}}'

)

result_payload = json.loads(response['Payload'].read())

result_body = json.loads(result_payload['body'])

test_passed = bool(result_body['test_result'])

if test_passed:

stats['passed'] += 1

stats['passed_tc'].append(tc_name)

else:

stats['failed'] += 1

stats['failed_tc'].append(tc_name)

with PoolExecutor(max_workers=500) as executor:

for _ in executor.map(invoke_test, test_cases):

pass

try:

download_folder_from_bucket(bucket=REPORTS_BUCKET, dist=REPORTS_FOLDER)

download_folder_from_bucket(bucket=REPORTS_BUCKET, dist=HISTORY_FOLDER, local=TMP_REPORTS_FOLDER)

command_generate_allure_report = [

f'/opt/allure-2.10.0/bin/allure generate --clean {TMP_REPORTS_FOLDER} -o {TMP_REPORTS_ALLURE_FOLDER}'

]

subprocess.call(command_generate_allure_report, shell=True)

upload_html_report_to_s3(report_path=TMP_REPORTS_ALLURE_FOLDER)

upload_report_history_to_s3()

remove_s3_folder(REPORTS_FOLDER)

subprocess.call('rm -rf /tmp/*', shell=True)

except Exception as e:

print(f'Error when generating report: {e}')

return {

'Passed': stats['passed'],

'Failed': stats['failed'],

'Passed TC': stats['passed_tc'],

'Failed TC': stats['failed_tc'],

'Screenshots': f'https://s3.console.aws.amazon.com/s3/buckets/{REPORTS_BUCKET}/'

f'{SCREENSHOTS_FOLDER}{CURRENT_DATE}/',

'Reports': f'https://s3.console.aws.amazon.com/s3/buckets/{REPORTS_BUCKET}/'

}

else:

return test_cases

Once we have it all set, we need to deploy our code. This shouldn’t be difficult. Let’s open a command prompt in the selenium_layer directory and execute the serverless deploy command. When it’s finished, do the same thing in the ‘tests’ directory, lambda directory, and finally in the test_list directory. The order of deployment is important because they are dependent on each other.

When everything is set, let’s navigate to our test-list-lambda in the AWS console.

We need to create a new event. I already have three, the Test one is what we’re looking for. Click on the Configure test events option.

Then select an event template, an event name, and fill JSON. In the future, you can add more tags separated with a single space. Click Create to save that event.

The last step is to click the Test button and wait for the results. In our case, it took almost one minute. The longest part of our solution is generating the Allure report when all tests are finished.

When you navigate to the reports bucket and download the latest one, you need to unpack the .zip file locally and open the index.html file in the browser. Unfortunately, most of the browsers won’t handle it that easily. If you have Allure installed, you can use the allure serve <path> command. It creates a local Jetty server instance, serves the generated report, and opens it in the default browser. But there’s also a workaround - Microsoft Edge. Just right-click on the index.html file and open it with that browser. It works!

Statistics

Everybody knows that time is money. Let’s check how much we can save. Here we have a division into the duration of the tests themselves and the entire process.

It’s really easy to find out that parallel tests are much faster. When having a set of 500 test cases, the difference is huge. It can take about 2 hours when running in a sequential approach or 2 minutes in parallel. The chart below may give a better overview.

During the release, there’s usually not that much time for doing regression tests. Same with running tests that take several hours to complete. Parallel testing may speed up the whole release process.

Well, but what is the price for that convenience? Actually not that high.

Let’s assume that we have 100 feature files, and it takes 30 seconds for each one to execute. We can set a 512MB memory size for our lambda function. Tests will be executed daily in the development environment and occasionally before releases. We can assume 50 executions of each test monthly.

Total compute (seconds) = 100 * 50 * (30s) = 150,000 seconds
Total compute (GB-s) = 150,000 * 512MB/1024 = 75,000 GB-s
Monthly compute charges = 75,000 * $0.00001667 = $1.25
Monthly request charges = 100 * 50 * $0.2/M = $0.01
Total = $1.26

It looks very promising. If you have more tests or they last longer, you can double this price. It’s still extremely low!

AWS Lambda handlers - summary

We went through quite an extended Selenium test configuration with Behave and Allure and made it work in the parallel process using AWS Lambda to achieve the shortest time waiting for results. Everything is ready to be used with your own app, just add some tests! Of course, there is still room for improvement - reports are now available in the AWS S3 bucket but could be attached to emails or served so that anybody can display them in a browser with a URL. You can also think of CI/CD practices. It's good to have continuous testing in the continuous integration process, e.g., when pushing some new changes to the main or release branch in your GIT repository in order to find all bugs as soon as possible. Hopefully, this article will help you with creating your custom testing process and speed up your work.

Sources

https://github.com/eruchlewicz/aws-lambda-python-selenium-tests

written by

Grape up Expert

Software development

Introduction to Kubernetes security: Container security

Focusing on Kubernetes security, we have to go through container security and their runtimes. All in all, clusters without containers running does not make much sense. Hardening workloads often is much harder than hardening the cluster itself. Let’s start with container configuration.

Basic rules for containers

There are two ways how you can get a container image you want to run. You can build it, or you can use an existing one. If you create your own containers, then you have more control over the process and you have a clear vision of what is inside. But it is now your responsibility to make that image as secure as possible. There are plenty of rules to make your container safer, and here we share the best practices to ensure that.

Minimal image

First of all, if you want to start fast, you set some base images with plenty of features built-in. But in the end, it is not a good idea. The larger the base is, the more issues may occur. For example, the nginx image hosted on Docker Hub has 98 known vulnerabilities, and node has more than 800. All of those issues are inherited automatically by your container - unless you mitigate each one in your custom layers. Please take a look at the graph below that shows how the number of those vulnerabilities grows.

So you have to decide if you really need that additional functionality. If not, then you can use some smaller and simpler base images. It will, for sure, lower the number of known vulnerabilities in your container. It should lower the size of the container dramatically as well.

FROM node -> FROM ubuntu

If you really want only your application running in the container, then you can use Docker’s reserved, minimal image scratch:

FROM scratchCOPY hello /CMD [“/hello”]

User vs Root

Another base rule that you should embrace are the privileges inside the container. If you do not specify any, then it uses the root user inside the container. So there is a potential risk that it gets root access on the Docker host. To minimize that threat, you have to use a dedicated user/group in the Docker image. You can use the USER directive for this purpose:

FROM myslq

COPY . /app

RUN chown -R mysql:mysql /app

USER mysql

As you can see in the above example some images have an already defined user, that you can use. In mysql, it is named mysql (what a surprise!). But sometimes you may have to create one on your own:

RUN groupadd -r test && useradd -r -s /bin/false -g test test

WORKDIR /app

COPY . /app

RUN chown -R test:test /app

USER test

Use the specific tag for a base image

Another threat is not so obvious. You may think that the newest version of your base image will be the most secure one. In general, that is true, but it may bring some new risks and issues to your image. If you do not specify a proper version:

FROM ubuntu

Docker will use the latest one. It sounds pretty handy, but in some cases, it may break your build because the version may change between the builds. Just imagine that you are dependent on some package that has been removed in the latest version of the ubuntu image. Another threat is that the latest version may introduce new vulnerabilities that are not yet discovered. To avoid the described issues, it is better to specify the version of your base image:

FROM ubuntu:18.04

If the version is more specific, then there is a lower risk that it will be changed or updated without notice. On the other hand, please note that there is a higher chance that some specific versions will be removed. In that case, it is always a good practice to use the local Docker registry and keep important images mirrored there.

Also, check and keep in mind the versioning schema for the image - focusing on how alfa, beta, and test images are versioned. Being a test rat for the new features is not really what you would like to do.

See what is inside

The rules described above are only a part of a larger set but some of the most important ones. Especially if you create your own container image. But many times, you have to use images from other teams. It may happen when you simply want to run such an image, or you want to use it as a base image.

In both cases, you are at risk that this external image will bring a lot of issues to your application. If you are not a creator of the container image, then you have to pay even more attention to the security. First of all, you should check a Dockerfile to see how the image is built. Below is an example of the Ubuntu:18.04 image Dockerfile:

FROM scratch

ADD ubuntu-bionic-core-cloudimg-amd64-root.tar.gz /

# verify that the APT lists files do not exist

RUN [ -z "$(apt-get indextargets)" ]

# (see https://bugs.launchpad.net/cloud-images/+bug/1699913)

# a few minor docker-specific tweaks

# see https://github.com/docker/docker/blob/9a9fc01af8fb5d98b8eec0740716226fadb373...

RUN set -xe \

\

(...)

# make systemd-detect-virt return "docker"

# See: https://github.com/systemd/systemd/blob/aa0c34279ee40bce2f9681b496922dedbadfca...

RUN mkdir -p /run/systemd && echo 'docker' > /run/systemd/container

CMD ["/bin/bash"]

Unfortunately, a Dockerfile is often not available, and it is not integrated into the image. You have to use Docker inspect command in order to see what is inside:

$ docker inspect ubuntu:18.04

[

{

"Id": "sha256:d27b9ffc56677946e64c1dc85413006d8f27946eeb9505140b094bade0bfb0cc",

"RepoTags": [

"ubuntu:18.04"

],

"RepoDigests": [

"ubuntu@sha256:e5b0b89c846690afe2ce325ac6c6bc3d686219cfa82166fc75c812c1011f0803"

],

"Parent": "",

"Comment": "",

"Created": "2020-07-06T21:56:11.478320417Z",

(...)

"Config": {

"Hostname": "",

"Domainname": "",

"User": "",

"AttachStdin": false,

"AttachStdout": false,

"AttachStderr": false,

"Tty": false,

"OpenStdin": false,

"StdinOnce": false,

"Env": [

"PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin"

],

"Cmd": [

"/bin/bash"

],

"ArgsEscaped": true,

"Image": "sha256:4f2a5734a710e5466a625e279994892c9dd9003d0504d99c8297b01b7138a663",

"Volumes": null,

"WorkingDir": "",

"Entrypoint": null,

"OnBuild": null,

"Labels": null

},

"Architecture": "amd64",

"Os": "linux",

"Size": 64228599,

"VirtualSize": 64228599,

(…)

}

]

It gives you all detailed information about the image in a JSON format, so you can review what is inside. Finally, you can use docker histor y to see the complete history of how the image was created.

$ docker history ubuntu:18.04 IMAGE CREATED CREATED BY SIZE d27b9ffc5667 13 days ago /bin/sh -c #(nop) CMD ["/bin/bash"] 0B <missing> 13 days ago /bin/sh -c mkdir -p /run/systemd && echo 'do… 7B <missing> 13 days ago /bin/sh -c set -xe && echo '#!/bin/sh' > /… 745B <missing> 13 days ago /bin/sh -c [ -z "$(apt-get indextargets)" ] 987kB <missing> 13 days ago /bin/sh -c #(nop) ADD file:0b40d881e3e00d68d… 63.2MB

So both commands give you the information similar to the Dockerfile content. But you have to admit that it is pretty complex and not very user-friendly. Fortunately, some tools might help you inspect Docker images. Start with checking out dive

It gives you a good view of each Docker image layer with information on what and where something has changed. The above example shows the layers of Ubuntu:18.04 image and corresponding changes to the files in that layer.

All those commands and tools should give you more confidence in order to decide whether it is safe to run a certain image.

Scan and sign

Usually, you do not have time to manually inspect images and check whether they are safe or not. Especially when you do not look for some malicious code but check for some well-known threats and vulnerabilities. This also applies when you build a container by yourself. In that case, you are probably sure that there is no malicious application or package installed, but still, you have to use some base image that may introduce some vulnerabilities. So again, there are multiple tools to help developers and operators check images.

The first two of them are focused on scanning images for Common Vulnerabilities and Exposures (CVE). All those scanners work pretty similarly – scanning images using external data with known vulnerabilities. It may come from OS vendors or non-OS data like NVD (National Vulnerability Database). Results in most cases, depend on the fact which CVE databases are used and if they are up-to-date. So the final number of detected vulnerabilities may differ for different tools. The scan process itself simply analyzes the contents and creates a history of an image with the mentioned databases.

Below are some results from open source tools Clair created by CoreOS and Trivy developed by Aquasec.

$ clair-scanner ubuntu:18.04

$ clair-scanner ubuntu:18.04

2020/07/22 10:10:02 [INFO] ▶ Start clair-scanner

2020/07/22 10:10:04 [INFO] ▶ Server listening on port 9279

2020/07/22 10:10:04 [INFO] ▶ Analyzing e86dffecb5a4284ee30b1905fef785336d438013826e4ee74a8fe7d65d95ee8f

2020/07/22 10:10:07 [INFO] ▶ Analyzing 7ff84cfee7ab786ad59579706ae939450d999e578c6c3a367112d6ab30b5b9b4

2020/07/22 10:10:07 [INFO] ▶ Analyzing 940667a71e178496a1794c59d07723a6f6716f398acade85b4673eb204156c79

2020/07/22 10:10:07 [INFO] ▶ Analyzing 22367c56cc00ec42fb1d0ca208772395cc3ea1e842fc5122ff568589e2c4e54e

2020/07/22 10:10:07 [WARN] ▶ Image [ubuntu:18.04] contains 39 total vulnerabilities

2020/07/22 10:10:07 [ERRO] ▶ Image [ubuntu:18.04] contains 39 unapproved vulnerabilities

+------------+-----------------------+-------+-------------------+--------------------------------------------------------------+

| STATUS | CVE SEVERITY | PKG | PACKAGE VERSION | CVE DESCRIPTION |

+------------+-----------------------+-------+-------------------+--------------------------------------------------------------+

(...)

+------------+-----------------------+-------+-------------------+--------------------------------------------------------------+

| Unapproved | Medium CVE-2020-10543 | perl | 5.26.1-6ubuntu0.3 | Perl before 5.30.3 on 32-bit platforms allows a heap-based |

| | | | | buffer overflow because nested regular expression |

| | | | | quantifiers have an integer overflow. An application |

| | | | | written in Perl would only be vulnerable to this flaw if |

| | | | | it evaluates regular expressions supplied by the attacker. |

| | | | | Evaluating regular expressions in this fashion is known |

| | | | | to be dangerous since the regular expression engine does |

| | | | | not protect against denial of service attacks in this |

| | | | | usage scenario. Additionally, the target system needs a |

| | | | | sufficient amount of memory to allocate partial expansions |

| | | | | of the nested quantifiers prior to the overflow occurring. |

| | | | | This requirement is unlikely to be met on 64bit systems.] |

| | | | | http://people.ubuntu.com/~ubuntu-security/cve/CVE-2020-10543 |

+------------+-----------------------+-------+-------------------+--------------------------------------------------------------+

| Unapproved | Medium CVE-2018-11236 | glibc | 2.27-3ubuntu1 | stdlib/canonicalize.c in the GNU C Library (aka glibc |

| | | | | or libc6) 2.27 and earlier, when processing very |

| | | | | long pathname arguments to the realpath function, |

| | | | | could encounter an integer overflow on 32-bit |

| | | | | architectures, leading to a stack-based buffer |

| | | | | overflow and, potentially, arbitrary code execution. |

| | | | | http://people.ubuntu.com/~ubuntu-security/cve/CVE-2018-11236 |

(...)

+------------+-----------------------+-------+-------------------+--------------------------------------------------------------+

| Unapproved | Low CVE-2019-18276 | bash | 4.4.18-2ubuntu1.2 | An issue was discovered in disable_priv_mode in shell.c |

| | | | | in GNU Bash through 5.0 patch 11. By default, if Bash is |

| | | | | run with its effective UID not equal to its real UID, it |

| | | | | will drop privileges by setting its effective UID to its |

| | | | | real UID. However, it does so incorrectly. On Linux and |

| | | | | other systems that support "saved UID" functionality, |

| | | | | the saved UID is not dropped. An attacker with command |

| | | | | execution in the shell can use "enable -f" for runtime |

| | | | | loading of a new builtin, which can be a shared object that |

| | | | | calls setuid() and therefore regains privileges. However, |

| | | | | binaries running with an effective UID of 0 are unaffected. |

| | | | | http://people.ubuntu.com/~ubuntu-security/cve/CVE-2019-18276 |

+------------+-----------------------+-------+-------------------+--------------------------------------------------------------+

As mentioned above, both results are subtly different, but it is okay. If you investigate issues reported by Trivy, you will see duplicates, and in the end, both results are the same. But it is not the rule, and usually, they differ.

Based on the above reports, you should be sure if it is safe to use that particular image or not. By inspecting the image, you ensure what is running there, and thanks to the scanner, you are guaranteed about any known CVE. But it is important to underline that those vulnerabilities have to be known. In case of some new threats, you should schedule a regular scan, e.g., every week. In some of the Docker registries, this can be very easily configured, or you can use your CI/CD to run scheduled pipelines. It is also a good idea to send some notification in case of any High Severity vulnerability is found.

Active image scanning

All the above methods are passive, which means they do not actively scan or verify a running container. It was just a static analysis and scan of the image. If you want to be super secure, then you can simply add this live runtime scanner. An example of such a tool is Falco . It is an open-source project started by SysDig and now developed as CNCF Incubating Project. An extremely useful advantage provided by Falco comes to scanning for any abnormal behavior in your container. Besides, it has a built-in analyzer for Kubernetes Audit events. So taking both features together, this is a quite powerful tool to analyze and keep an eye on running containers in real-time. Below there is a quick setup of Falco with the Kubernetes cluster.

First of all, you have to run Falco. You can, of course, deploy it on Kubernetes or use the standalone version. Per the documentation, the most secure way is to run it separated from the Kubernetes cluster to provide isolation in case of a hacker attack. For testing purposes, we will do it in a different way and deploy it onto Kubernetes using Helm .

The setup is quite simple. First, we have to add the helm repository with a falco chart and simply install it. Please note that nginx pod is used for testing purposes and is not a part of Falco.

$ helm repo add falcosecurity https://falcosecurity.github.io/charts

$ helm repo update

$ helm install falco falcosecurity/falco

$ kubectl get pods

NAME READY STATUS RESTARTS AGE

falco-qq8zx 1/1 Running 0 23m

falco-t5glj 1/1 Running 0 23m

falco-w2krg 1/1 Running 0 23m

nginx-6db489d4b7-6pvjg 1/1 Running 0 25m

And that is it. Now let’s test that. Falco comes with already predefined rules, so we can, for example, try to exec into some pod and view sensitive data from files. We will use the mentioned nginx pod.

Now we should check the logs from the falco pod which runs on the same node as our nginx.

$ kubectl logs falco-w2krg

* Setting up /usr/src links from host

* Running falco-driver-loader with: driver=module, compile=y...

* Unloading falco module, if present

(...)

17:07:53.445726385: Notice A shell was spawned in a container with an attached terminal (user=root k8s.ns=default k8s.pod=nginx-6db489d4b7-6pvjg container=b17be5f70cdc shell=bash parent=runc cmdline=bash terminal=34816 container_id=b17be5f70cdc image=<NA>) k8s.ns=default k8s.pod=nginx-6db489d4b7-6pvjg container=b17be5f70cdc

17:08:39.377051667: Warning Sensitive file opened for reading by non-trusted program (user=root program=cat command=cat /etc/shadow file=/etc/shadow parent=bash gparent=<NA> ggparent=<NA> gggparent=<NA> container_id=b17be5f70cdc image=<NA>) k8s.ns=default k8s.pod=nginx-6db489d4b7-6pvjg container=b17be5f70cdc k8s.ns=default k8s.pod=nginx-6db489d4b7-6pvjg container=b17be5f70cdc

Great, there are nice log messages about exec and reading sensitive file incidents with some additional information. The good thing is that you can easily add your own rules.

$ cat custom_rule.yaml

customRules:

example-rules.yaml: |-

- rule: shell_in_container

desc: notice shell activity within a container

condition: container.id != host and proc.name = bash

output: TEST shell in a container (user=%user.name)

priority: WARNING

To apply them you have to update the helm chart again.

$ helm install falco -f custom_rule.yaml falcosecurity/falco

$ kubectl get pods

NAME READY STATUS RESTARTS AGE

falco-7qnn8 1/1 Running 0 6m21s

falco-c54nl 1/1 Running 0 6m3s

falco-k859g 1/1 Running 0 6m11s

nginx-6db489d4b7-6pvjg 1/1 Running 0 45m

Now we can repeat the procedure and see if there is something more in the logs.

$ kubectl exec -it nginx-6db489d4b7-6pvjg /bin/bash

root@nginx-6db489d4b7-6pvjg:/# echo "Knock, knock!"

Knock, knock!

root@nginx-6db489d4b7-6pvjg:/# cat /etc/shadow > /dev/null

root@nginx-6db489d4b7-6pvjg:/# exit

$ kubectl logs falco-7qnn8

* Setting up /usr/src links from host

* Running falco-driver-loader with: driver=module, compile=yes, download=yes

* Unloading falco module, if present

(...)

17:33:35.547831851: Notice A shell was spawned in a container with an attached terminal (user=root k8s.ns=default k8s.pod=nginx-6db489d4b7-6pvjg container=b17be5f70cdc shell=bash parent=runc cmdline=bash terminal=34816 container_id=b17be5f70cdc image=<NA>) k8s.ns=default k8s.pod=nginx-6db489d4b7-6pvjg container=b17be5f70cdc k8s.ns=default k8s.pod=nginx-6db489d4b7-6pvjg container=b17be5f70cdc

17:33:35.551194695: Warning TEST shell in a container (user=root) k8s.ns=default k8s.pod=nginx-6db489d4b7-6pvjg container=b17be5f70cdc

(...)

17:33:40.327820806: Warning Sensitive file opened for reading by non-trusted program (user=root program=cat command=cat /etc/shadow file=/etc/shadow parent=bash gparent=<NA> ggparent=<NA> gggparent=<NA> container_id=b17be5f70cdc image=<NA>) k8s.ns=default k8s.pod=nginx-6db489d4b7-6pvjg container=b17be5f70cdc k8s.ns=default k8s.pod=nginx-6db489d4b7-6pvjg container=b17be5f70cdc k8s.ns=default k8s.pod=nginx-6db489d4b7-6pvjg container=b17be5f70cdc

You can see a new message is there. You can add more rules and customize falco to your needs. We encourage setting up gRPC and then using falco-exporter to integrate it with Prometheus to easily monitor any security incident. In addition, you may also configure falco to support Kubernetes audit events.

Is it enough? You have inspected and scanned your image. You deployed a runtime scanner to keep an eye on the running containers. But none of those methods guarantee that the image you have just pulled or started is the same you wanted to run. What if someone injected there some malicious code and you did not notice? In order to defend against such an attack, you have to somehow securely and confidently identify the image. There has to be some tool that guarantees us such confidence… and there is one!

The UpdateFramework and Notary

The key element that helps and in fact solves many of those concerns is The Update Framework (TUF) that describes the update system as “secure” if:

“it knows about the latest available updates in a timely manner,
any files it downloads are the correct files, and,
no harm results from checking or downloading files.”

(source: https://theupdateframework.github.io/security.html )

There are four principles defined by the framework that make it almost impossible to make a successful attack on such an update system.

1. The first principle is responsibility separation. In other words, there are a few different roles defined (that are used by e.g., the user or server) that are able to do different actions and use different keys for that purpose.

2. The next one is the multi-signature trust. This simply says that you need a fixed number of signatures which has to come together to perform certain actions, e.g., two developers using their keys to agree that a specific package is valid.

3. The third principle is explicit and implicit revocation. Explicit means that some parties come together and revoke another key, whereas implicit is when e.g., after some time, the repository may automatically revoke signing keys.

4. The last principle is to minimize individual key and role risks. As it says, the goal is to minimize the expected damage which can be defined by the probability of the event happening and its impact. So if there is a root role with a high impact on the system, the key it uses is kept offline.

The idea of TUF is to create and manage a set of metadata (signed by corresponding roles) that provide general information about the valid state of the repository at a specified time.

The next question is: How can Docker use this update framework, and what does it mean to you and me? First of all, Docker already uses it in the Content Trust, which definition seems to answer our first question about image correctness. As per documentation:

“Content trust provides the ability to use digital signatures for data sent to and received from remote Docker registries. These signatures allow client-side verification of the integrity and publisher of specific image tags.”

(source: https://docs.docker.com/engine/security/trust/content_trust )

To be more precise, Content Trust does not use TUF directly. Instead, it uses Notary, a tool created by Docker, which is an opinionated implementation of TUF. It keeps the TUF principles, so there are five roles (with corresponding keys), same as TUF defined, so we have:

· a root role – it uses the most important key that is used to sign the root metadata, which specifies other roles, so it is strongly advised to keep it secure offline;

· a snapshot role – this role signs snapshot metadata that contains information about file names, sizes, hashes of other (root, target and delegation) metadata files, so it ensures users about their integrity. It can be held by owner/admin or Notary service itself;

· a timestamp role – using timestamp key Notary signs metadata file which guarantee the freshness of the trusted collection, because of short expiration time. Due to that fact it is kept by Notary service to automatically regenerate when it is outdated;

· a targets role – it uses the targets key to sign the targets metadata file, with information about files in the collection (filenames, sizes and corresponding hashes) and it should be used to verify the integrity of the files inside the collection. The other usage of the targets key is to delegate trust to other peers using delegation roles.

· a delegation role – which is pretty similar to the targets role but instead of the whole content of the repository those keys ensure integrity of some (or sometimes all) of the actual content. They also can be used to delegate trust to other collaborators via lower level delegation roles.

All this metadata can be pulled or pushed to the Notary service. There are two components in the Notary service – server and signer. The server is responsible for storing the metadata (those files generated by the TUF framework underneath) for trusted collections in an associated database, generating the timestamp metadata, and the most important validating any uploaded metadata.

Notary signer stores private keys (this way they are not kept in the Notary server) and in case of a request from the Notary server it signs metadata for it. In addition, there is a Notary CLI that helps you to manage trusted collections and supports Content Trust with additional functionality. The basic interaction between client, server, and service can be described as: When the client wants to upload new metadata, after authentication (if required) metadata is validated by the server, which generates timestamp metadata (and sometimes snapshot based on what has changed) and sends it to the Notary signer for signing. After that server stores the client metadata, timestamp, and snapshot metadata which ensures that client files are the most recent and valid.

Let’s check how it works. First, run an unsigned image with Docker Content Trust (DCT) disabled. Everything works fine, so we can simply run our image v1:

$ docker run mirograpeup/hello:v1

Unable to find image 'mirograpeup/hello:v1' locally

v1: Pulling from mirograpeup/hello

Digest: sha256:be202781edb5aa6c322ec19d04aba6938b46e136a09512feed26659fb404d637

Status: Downloaded newer image for mirograpeup/hello:v1

Hello World folks!!!

Now we can check how it goes with DCT enabled. First, let’s see what happens when we want to run v2 (which is not signed as well):

$ export DOCKER_CONTENT_TRUST=1

$ docker run mirograpeup/hello:v2

docker: Error: remote trust data does not exist for docker.io/mirograpeup/hello: notary.docker.io does not have trust data for docker.io/mirograpeup/hello.

apiVersion: v1

See 'docker run --help'.

The error above is obvious – we did not specify the trust data/signatures for that image, so it fails to run. To sign the image, you have to push it to the remote repository.

$ docker push mirograpeup/hello:v2

The push refers to repository [docker.io/mirograpeup/hello]

c71acc1231cb: Layer already exists

v2: digest: sha256:be202781edb5aa6c322ec19d04aba6938b46e136a09512feed26659fb404d637 size: 524

Signing and pushing trust metadata

Enter passphrase for root key with ID 1d3d9a4:

Enter passphrase for new repository key with ID 5a9ff85:

Repeat passphrase for new repository key with ID 5a9ff85:

Finished initializing "docker.io/mirograpeup/hello"

Successfully signed docker.io/mirograpeup/hello:v2

For the first time, docker will ask you for the corresponding passphrases for the root key and repository if needed. After that, your image is signed, and we can check again if v1 or v2 can run.

$ docker run mirograpeup/hello:v1

docker: No valid trust data for v1.

See 'docker run --help'.

$ docker run mirograpeup/hello:v2

Hello World folks!!!

So it works well. You can see that it is not allowed to run unsigned images when DCT is enabled and that during the push all the signing process is done automatically. In the end, even though the process itself is a little bit complicated, you can very easily push images to the Docker repository and be sure that it is always the image you intended to run. One drawback of the DCT is that it is not supported in Kubernetes by default. You are able to work around that with admission plugins but it requires additional work.

Registry

We spoke a lot about containers security and how to run them without a headache. But besides that, you have to secure your container registry. First of all, you need to decide whether to use a hosted (e.g., Docker Hub) or on-prem registry.

One good thing about a hosted registry is that it should support (at least Docker Hub does) the Docker Content Trust by default and you just have to enable that on the client side. If you want that to be supported in the on-prem registry then you have to deploy the Notary server and configure that properly on the client side.

On the other hand, the Docker Hub does not provide image scanning and in the on-prem registries, it is usually a standard to provide such ability. Plus in most cases, those solutions support more than one image scanner. So you can choose which scanner you want to run. In some cases, like in Harbor, you are allowed to configure automatically scheduled tests and you can set up some notifications if needed. So it is a very nice thing that the on-prem registry is not worse and sometimes offers more than a free Docker Hub registry which comes with few limitations.

In addition, you have much more control over the on-prem registry. You can have a super-admin account and see all the images and statistics. But you have to maintain it and make sure it will be always up and running. Still, many companies prevent use of external registries - so in that case, you have no choice.

Whatever you choose, always make sure to use TLS communication. If your on-prem registry uses self-signed or signed by company root CA, certificates, then you have to configure Docker in Kubernetes properly. You can specify the insecure-registries option in the Docker daemon, but it may end up with fallback to HTTP which is not what we tried to prevent in the first place. A more secure option is to provide the certificates to the Docker daemon:

cp domain.crt /etc/docker/certs.d/mydomain.com:5000/ca.crt

Private registry

Also, if you want to use a private registry, you will have to provide the credentials or cache (manually load) the image to each worker node. Please note to force cached images to be used you need to block the imagePullPolicy: Always . When it comes to providing credentials, you basically have two options.

Configure Nodes

Prepare the proper docker config.json:

{

"auths": {

"https://index.docker.io/v1/": {

"username": "xxxxxxxx"

"password": "xxxxxxxx"

"email": "xxxxxxxx"

"auth": "<base64-encoded-‘username:password’>"

}

}

}

And copy that to each worker node to the kubelet configuration directory. It is usually /var/lib/kubelet, so for a single worker it would be:

$ scp config.json root@worker-node:/var/lib/kubelet/config.json

Use ImagePullSecrets

The easiest way is to use the built-in mechanism in the Kubernetes secrets:

kubectl create secret docker-registry <secret-name> \

--docker-server=<your-registry-server> \

--docker-username=<your-name> \

--docker-password=<your-password> \

--docker-email=<your-email>

or you may create a secret by providing a YAML file with the base64 encoded docker config.json:

{

"auths": {

"https://index.docker.io/v1/": {

"username": "xxxxxxxx"

"password": "xxxxxxxx"

"email": "xxxxxxxx"

"auth": "<base64-encoded-‘username:password’>"

}

}

}

secret.yml (please note the type has to be kubernetes.io/dockerconfigjson and the data should be placed under .dockerconfigjson):

apiVersion: v1

kind: Secret

metadata:

name: <secret-name>

namespace: <namespace-name>

data:

.dockerconfigjson: <base64-encoded-config-json-file>

type: kubernetes.io/dockerconfigjson

Adding the above secret, you have to keep in mind that it works only in the specified namespace, and anyone in that namespace may be able to read that. So you have to be careful who is allowed to use your namespace/cluster. Now, if you want to use this secret, you simply have to put it in the Pod spec imagePullSecrets property:

apiVersion: v1

kind: Pod

metadata:

name: private-reg

spec:

containers:

- name: private-reg-container

image: <your-private-image>

imagePullSecrets:

- name: <your-secret-name>

Container security checks

If you want to be even more secure than you can run some additional tests to check your setup. In that case, the Docker security is the thing you want to check. In order to do that, you can use Docker Bench Security. It will scan your runtime to find any issues or insecure configurations. The easiest way is to run it as a pod inside your cluster. You have to mount a few directories from the worker node and run this pod as root, so make sure you know what you are doing. The below example shows how the Pod is configured if your worker runs on Ubuntu (there are some different mount directories needed on different operating systems).

cat docker-bench.yml

apiVersion: v1

kind: Pod

metadata:

name: docker-bench

spec:

hostPID: true

hostIPC: true

hostNetwork: true

securityContext:

runAsUser: 0

containers:

- name: docker-bench

image: docker/docker-bench-security

securityContext:

privileged: true

capabilities:

add: ["AUDIT_CONTROL"]

volumeMounts:

- name: etc

mountPath: /etc

readOnly: true

- name: libsystemd

mountPath: /lib/systemd/system

readOnly: true

- name: usrbincontainerd

mountPath: /usr/bin/containerd

readOnly: true

- name: usrbinrunc

mountPath: /usr/bin/runc

readOnly: true

- name: usrlibsystemd

mountPath: /usr/lib/systemd

readOnly: true

- name: varlib

mountPath: /var/lib

readOnly: true

- name: dockersock

mountPath: /var/run/docker.sock

readOnly: true

volumes:

- name: etc

hostPath:

path: /etc

- name: libsystemd

hostPath:

path: /lib/systemd/system

- name: usrbincontainerd

hostPath:

path: /usr/bin/containerd

- name: usrbinrunc

hostPath:

path: /usr/bin/runc

- name: usrlibsystemd

hostPath:

path: /usr/lib/systemd

- name: varlib

hostPath:

path: /var/lib

- name: dockersock

hostPath:

path: /var/run/docker.sock

type: Socket

kubectl apply -f docker-bench.yml

kubectl logs docker-bench -f

# ------------------------------------------------------------------------------

# Docker Bench for Security v1.3.4

#

# Docker, Inc. (c) 2015-

#

# Checks for dozens of common best-practices around deploying Docker containers in production.

# Inspired by the CIS Docker Community Edition Benchmark v1.1.0.

# ------------------------------------------------------------------------------

Initializing Sun Sep 13 22:41:02 UTC 2020

[INFO] 1 - Host Configuration

[WARN] 1.1 - Ensure a separate partition for containers has been created

[NOTE] 1.2 - Ensure the container host has been Hardened

[INFO] 1.3 - Ensure Docker is up to date

[INFO] * Using 18.09.5, verify is it up to date as deemed necessary

[INFO] * Your operating system vendor may provide support and security maintenance for Docker

[INFO] 1.4 - Ensure only trusted users are allowed to control Docker daemon

[INFO] * docker:x:998

[WARN] 1.5 - Ensure auditing is configured for the Docker daemon

[WARN] 1.6 - Ensure auditing is configured for Docker files and directories - /var/lib/docker

[WARN] 1.7 - Ensure auditing is configured for Docker files and directories - /etc/docker

[WARN] 1.8 - Ensure auditing is configured for Docker files and directories - docker.service

[INFO] 1.9 - Ensure auditing is configured for Docker files and directories - docker.socket

[INFO] * File not found

[WARN] 1.10 - Ensure auditing is configured for Docker files and directories - /etc/default/docker

[INFO] 1.11 - Ensure auditing is configured for Docker files and directories - /etc/docker/daemon.json

[INFO] * File not found

[INFO] 1.12 - Ensure auditing is configured for Docker files and directories - /usr/bin/docker-containerd

[INFO] * File not found

[INFO] 1.13 - Ensure auditing is configured for Docker files and directories - /usr/bin/docker-runc

[INFO] * File not found

[INFO] 2 - Docker daemon configuration

[WARN] 2.1 - Ensure network traffic is restricted between containers on the default bridge

[PASS] 2.2 - Ensure the logging level is set to 'info'

[WARN] 2.3 - Ensure Docker is allowed to make changes to iptables

[PASS] 2.4 - Ensure insecure registries are not used

[PASS] 2.5 - Ensure aufs storage driver is not used

[INFO] 2.6 - Ensure TLS authentication for Docker daemon is configured

(...)

After completing these steps, you can check the logs from the Pod to see the full output from Docker Bench Security and act on any warning you see there. The task is quite demanding but gives you the most secure Docker runtime. Please note sometimes you have to leave a few warnings in order to keep everything working. Then you still have an option to provide more security using Kubernetes configuration and resources, but this is a topic for a separate article.

written by

Michał Różycki

Software development

How to run Selenium BDD tests in parallel with AWS Lambda

Have you ever felt annoyed because of the long waiting time for receiving test results? Maybe after a few hours, you’ve figured out that there had been a network connection issue in the middle of testing, and half of the results can go to the trash? That may happen when your tests are dependent on each other or when you have plenty of them and execution lasts forever. It's quite a common issue. But there’s actually a solution that can not only save your time but also your money - parallelization in the Cloud.

How it started

Developing UI tests for a few months, starting from scratch, and maintaining existing tests, I found out that it has become something huge that will be difficult to take care of very soon. An increasing number of test scenarios made every day led to bottlenecks. One day when I got to the office, it turned out that the nightly tests were not over yet. Since then, I have tried to find a way to avoid such situations.

A breakthrough was the presentation of Tomasz Konieczny during the Testwarez conference in 2019. He proved that it’s possible to run Selenium tests in parallel using AWS Lambda. There’s actually one blog that helped me with basic Selenium and Headless Chrome configuration on AWS. The Headless Chrome is a light-weighted browser that has no user interface. I went a step forward and created a solution that allows designing tests in the Behavior-Driven Development process and using the Page Object Model pattern approach, run them in parallel, and finally - build a summary report.

Setting up the project

The first thing we need to do is signing up for Amazon Web Services. Once we have an account and set proper values in credentials and config files (.aws directory), we can create a new project in PyCharm, Visual Studio Code, or in any other IDE supporting Python. We’ll need at least four directories here. We called them ‘lambda’, ‘selenium_layer’, ‘test_list’, ‘tests’ and there’s also one additional - ‘driver’, where we keep a chromedriver file, which is used when running tests locally in a sequential way.

In the beginning, we’re going to install the required libraries. Those versions work fine on AWS, but you can check newer if you want.

requirements.txt

allure_behave==2.8.6

behave==1.2.6

boto3==1.10.23

botocore==1.13.23

selenium==2.37.0

What’s important, we should install them in the proper directory - ‘site-packages’.

We’ll need also some additional packages:

Allure Commandline ( download )

Chromedriver ( download )

Headless Chromium ( download )

All those things will be deployed to AWS using Serverless Framework, which you need to install following the docs . The Serverless Framework was designed to provision the AWS Lambda Functions, Events, and infrastructure Resources safely and quickly. It translates all syntax in serverless.yml to a single AWS CloudFormation template which is used for deployments.

Architecture - Lambda Layers

Now we can create a serverless.yml file in the ‘selenium-layer’ directory and define Lambda Layers we want to create. Make sure that your .zip files have the same names as in this file. Here we can also set the AWS region in which we want to create our Lambda functions and layers.

serverless.yml

service: lambda-selenium-layer

provider:

name: aws

runtime: python3.6

region: eu-central-1

timeout: 30

layers:

selenium:

path: selenium

CompatibleRuntimes: [

"python3.6"

]

chromedriver:

package:

artifact: chromedriver_241.zip

chrome:

package:

artifact: headless-chromium_52.zip

allure:

package:

artifact: allure-commandline_210.zip

resources:

Outputs:

SeleniumLayerExport:

Value:

Ref: SeleniumLambdaLayer

Export:

Name: SeleniumLambdaLayer

ChromedriverLayerExport:

Value:

Ref: ChromedriverLambdaLayer

Export:

Name: ChromedriverLambdaLayer

ChromeLayerExport:

Value:

Ref: ChromeLambdaLayer

Export:

Name: ChromeLambdaLayer

AllureLayerExport:

Value:

Ref: AllureLambdaLayer

Export:

Name: AllureLambdaLayer

Within this file, we’re going to deploy a service consisting of four layers. Each of them plays an important role in the whole testing process.

Creating test set

What would the tests be without the scenarios? Our main assumption is to create test files running independently. This means we can run any test without others and it works. If you're following clean code, you'll probably like using the Gherkin syntax and the POM approach. Behave Framework supports both.

What gives us Gherkin? For sure, better readability and understanding. Even if you haven't had the opportunity to write tests before, you will understand the purpose of this scenario.

01.OpenLoginPage.feature

@smoke

@login

Feature: Login to service

Scenario: Login

Given Home page is opened

And User opens Login page

When User enters credentials

And User clicks Login button

Then User account page is opened

Scenario: Logout

When User clicks Logout button

Then Home page is opened

And User is not authenticated

In the beginning, we have two tags. We add them in order to run only chosen tests in different situations. For example, you can name a tag @smoke and run it as a smoke test, so that you can test very fundamental app functions. You may want to test only a part of the system like end-to-end order placing in the online store - just add the same tag for several tests.

Then we have the feature name and two scenarios. Those are quite obvious, but sometimes it’s good to name them with more details. Following steps starting with Given, When, Then and And can be reused many times. That’s the Behavior-Driven Development in practice. We’ll come back to this topic later.

Meantime, let’s check the proper configuration of the Behave project.

We definitely need a ‘feature’ directory with ‘pages’ and ‘steps’. Make the ‘feature’ folder as Sources Root. Just right-click on it and select the proper option. This is the place for our test scenario files with .feature extension.

It’s good to have some constant values in a separate file so that it will change only here when needed. Let’s call it config.json and put the URL of the tested web application.

config.json

{

"url": "http://drabinajakuba.atthost24.pl/"

}

One more thing we need is a file where we set webdriver options.

Those are required imports and some global values like, e.g. a name of AWS S3 bucket in which we want to have screenshots or local directory to store them in. As far as we know, bucket names should be unique in whole AWS S3, so you should probably change them but keep the meaning.

environment.py

import os

import platform

from datetime import date, datetime

import json

import boto3

from selenium import webdriver

from selenium.webdriver.chrome.options import Options

REPORTS_BUCKET = 'aws-selenium-test-reports'

SCREENSHOTS_FOLDER = 'failed_scenarios_screenshots/'

CURRENT_DATE = str(date.today())

DATETIME_FORMAT = '%H_%M_%S'

Then we have a function for getting given value from our config.json file. The path of this file depends on the system platform - Windows or Darwin (Mac) would be local, Linux in this case is in AWS. If you need to run these tests locally on Linux, you should probably add some environment variables and check them here.

def get_from_config(what):

if 'Linux' in platform.system():

with open('/opt/config.json') as json_file:

data = json.load(json_file)

return data[what]

elif 'Darwin' in platform.system():

with open(os.getcwd() + '/features/config.json') as json_file:

data = json.load(json_file)

return data[what]

else:

with open(os.getcwd() + '\\features\\config.json') as json_file:

data = json.load(json_file)

return data[what]

Now we can finally specify paths to chromedriver and set browser options which also depend on the system platform. There’re a few more options required on AWS.

def set_linux_driver(context):

"""

Run on AWS

"""

print("Running on AWS (Linux)")

options = Options()

options.binary_location = '/opt/headless-chromium'

options.add_argument('--allow-running-insecure-content')

options.add_argument('--ignore-certificate-errors')

options.add_argument('--disable-gpu')

options.add_argument('--headless')

options.add_argument('--window-size=1280,1000')

options.add_argument('--single-process')

options.add_argument('--no-sandbox')

options.add_argument('--disable-dev-shm-usage')

capabilities = webdriver.DesiredCapabilities().CHROME

capabilities['acceptSslCerts'] = True

capabilities['acceptInsecureCerts'] = True

context.browser = webdriver.Chrome(

'/opt/chromedriver', chrome_options=options, desired_capabilities=capabilities

)

def set_windows_driver(context):

"""

Run locally on Windows

"""

print('Running on Windows')

options = Options()

options.add_argument('--no-sandbox')

options.add_argument('--window-size=1280,1000')

options.add_argument('--headless')

context.browser = webdriver.Chrome(

os.path.dirname(os.getcwd()) + '\\driver\\chromedriver.exe', chrome_options=options

)

def set_mac_driver(context):

"""

Run locally on Mac

"""

print("Running on Mac")

options = Options()

options.add_argument('--no-sandbox')

options.add_argument('--window-size=1280,1000')

options.add_argument('--headless')

context.browser = webdriver.Chrome(

os.path.dirname(os.getcwd()) + '/driver/chromedriver', chrome_options=options

)

def set_driver(context):

if 'Linux' in platform.system():

set_linux_driver(context)

elif 'Darwin' in platform.system():

set_mac_driver(context)

else:

set_windows_driver(context)

Webdriver needs to be set before all tests, and in the end, our browser should be closed.

def before_all(context):

set_driver(context)

def after_all(context):

context.browser.quit()

Last but not least, taking screenshots of test failure. Local storage differs from the AWS bucket, so this needs to be set correctly.

def after_scenario(context, scenario):

if scenario.status == 'failed':

print('Scenario failed!')

current_time = datetime.now().strftime(DATETIME_FORMAT)

file_name = f'{scenario.name.replace(" ", "_")}-{current_time}.png'

if 'Linux' in platform.system():

context.browser.save_screenshot(f'/tmp/{file_name}')

boto3.resource('s3').Bucket(REPORTS_BUCKET).upload_file(

f'/tmp/{file_name}', f'{SCREENSHOTS_FOLDER}{CURRENT_DATE}/{file_name}'

)

else:

if not os.path.exists(SCREENSHOTS_FOLDER):

os.makedirs(SCREENSHOTS_FOLDER)

context.browser.save_screenshot(f'{SCREENSHOTS_FOLDER}/{file_name}')

Once we have almost everything set, let’s dive into single test creation. Page Object Model pattern is about what exactly hides behind Gherkin’s steps. In this approach, we treat each application view as a separate page and define its elements we want to test. First, we need a base page implementation. Those methods will be inherited by all specific pages. You should put this file in the ‘pages’ directory.

base_page_object.py

from selenium.webdriver.common.action_chains import ActionChains

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

from selenium.common.exceptions import *

import traceback

import time

from environment import get_from_config

class BasePage(object):

def __init__(self, browser, base_url=get_from_config('url')):

self.base_url = base_url

self.browser = browser

self.timeout = 10

def find_element(self, *loc):

try:

WebDriverWait(self.browser, self.timeout).until(EC.presence_of_element_located(loc))

except Exception as e:

print("Element not found", e)

return self.browser.find_element(*loc)

def find_elements(self, *loc):

try:

WebDriverWait(self.browser, self.timeout).until(EC.presence_of_element_located(loc))

except Exception as e:

print("Element not found", e)

return self.browser.find_elements(*loc)

def visit(self, url):

self.browser.get(url)

def hover(self, element):

ActionChains(self.browser).move_to_element(element).perform()

time.sleep(5)

def __getattr__(self, what):

try:

if what in self.locator_dictionary.keys():

try:

WebDriverWait(self.browser, self.timeout).until(

EC.presence_of_element_located(self.locator_dictionary[what])

)

except(TimeoutException, StaleElementReferenceException):

traceback.print_exc()

return self.find_element(*self.locator_dictionary[what])

except AttributeError:

super(BasePage, self).__getattribute__("method_missing")(what)

def method_missing(self, what):

print("No %s here!", what)

That’s a simple login page class. There’re some web elements defined in locator_dictionary and methods using those elements to e.g., enter text in the input, click a button, or read current values. Put this file in the ‘pages’ directory.

login.py

from selenium.webdriver.common.by import By

from .base_page_object import *

class LoginPage(BasePage):

def __init__(self, context):

BasePage.__init__(

self,

context.browser,

base_url=get_from_config('url'))

locator_dictionary = {

'username_input': (By.XPATH, '//input[@name="username"]'),

'password_input': (By.XPATH, '//input[@name="password"]'),

'login_button': (By.ID, 'login_btn'),

}

def enter_username(self, username):

self.username_input.send_keys(username)

def enter_password(self, password):

self.password_input.send_keys(password)

def click_login_button(self):

self.login_button.click()

What we need now is a glue that will connect page methods with Gherkin steps. In each step, we use a particular page that handles the functionality we want to simulate. Put this file in the ‘steps’ directory.

login.py

from behave import step

from environment import get_from_config

from pages import LoginPage, HomePage, NavigationPage

@step('User enters credentials')

def step_impl(context):

page = LoginPage(context)

page.enter_username('test_user')

page.enter_password('test_password')

@step('User clicks Login button')

def step_impl(context):

page = LoginPage(context)

page.click_login_button()

It seems that we have all we need to run tests locally. Of course, not every step implementation was shown above, but it should be easy to add missing ones.

If you want to read more about BDD and POM, take a look at Adrian’s article

All files in the ‘features’ directory will also be on a separate Lambda Layer. You can create a serverless.yml file with the content presented below.

serverless.yml

service: lambda-tests-layer

provider:

name: aws

runtime: python3.6

region: eu-central-1

timeout: 30

layers:

features:

path: features

CompatibleRuntimes: [

"python3.6"

]

resources:

Outputs:

FeaturesLayerExport:

Value:

Ref: FeaturesLambdaLayer

Export:

Name: FeaturesLambdaLayer

This is the first part of the series covering running Parallel Selenium tests on AWS Lambda. More here !

written by

Grape up Expert

Software development

Whose cluster is it anyway?

While researching how enterprises adopt Kubernetes, we can outline a common scenario; implementing a Kubernetes cluster in a company often starts as a proof of concept. Either developers decide they want to try something new, or the CTO does his research and decides to give it a try as it sounds promising. Typically, there is no roadmap, no real plan for the future steps, no decision to go for production.

First steps with a Kubernetes cluster in an enterprise

And then it is a huge success - a Kubernetes cluster makes managing deployments easier, it’s simple to use for developers, cheaper than the previously used platform and it just works for everyone. The security team creates the firewall rules, approves the configuration of the network overlay and load balancers. Operators create their CI/CD pipelines for the cluster deployments, backups and daily tasks. Developers rewrite configuration parsing and communication to fully utilize the ConfigMaps, Secrets and cluster internal routing and DNS. In no time you are one click from scrapping the existing infrastructure and moving everything to the Kubernetes.

This might be the point when you start thinking about providing support for your cluster and the applications in it. It may be an internal development team using your Kubernetes cluster, or PaaS for external teams. In all cases, you need a way to triage all support cases and decide which team or a person is responsible for which part of the cluster management . Let’s first split this into two scenarios.

A Kubernetes Cluster per team

If the decision is to give a full cluster or clusters for a team, there is no resource sharing, so there is less to worry about. Still, someone has to draw the line and say where a cluster operators’ responsibility ends, and the developers have to take it.

The easiest way would be to give the full admin access to the cluster, some volumes for persistent data and a set of LBs (or even one LB for ingress), and delegate the management to the development team. Such a solution would not be possible in most cases, as it requires a lot of experience from the development team to properly manage the cluster and make sure it is stable. Also, this is not always optimal from the resources perspective to create a cluster for even a small team.

The other problem is that when a team has to manage the whole cluster, the actual way it works can greatly diverge. Some teams decide to use nginx ingress and some traefik. End of the day, it is much easier to monitor and manage the uniform clusters.

Shared cluster

The alternative is to utilize the same cluster for multiple teams. There is quite a lot of configuration required to make sure the team doesn't interfere and can’t affect other teams operations, but adds a lot of flexibility when it comes to resource management and limits greatly the number of clusters which have to be managed, for example in terms of backing them up. It might be also useful if teams work on the same project or the set of projects which use the same resources or closely communicate - at the current point it is possible to communicate between the cluster using service mesh or just load balancers, but it may be the most performant solution.

Responsibility levels

If the dev team does not possess the skills required to manage a Kubernetes cluster, then the responsibility has to split between them and operators. Let’s create four examples of this kind of distribution:

Not a developer responsibility

This is probably the hardest version for the operators’ team, where the development team is only responsible for building the docker image and pushing to the correct container registry. Kubernetes on it’s own helps a lot with making sure that new version rollout does not result in a broken application via deployment strategy and health checks. If something silently breaks, it may be hard to figure out if it is a cluster failure or a result of the application update , or even database model change.

Developer can manage deployments, pods, and configuration resources

This is a better scenario. When developers are responsible for the whole application deployment by creating manifests, all configuration resources, and doing rollouts, they can and should do a smoke test afterwards to make sure everything remains operational. Additionally, they can check the logs to see what went wrong and debug in the cluster.

This is also the point where the security or operations team need to start to think about securing a cluster. There are settings on the pod level which can elevate the workload privileges, change the group it runs as or mount the system directories. This can be done for example via Open Policy Agent. Obviously, there should be no access to the other namespaces, especially the kube-system, but this can be easily done with just built-in RBAC.

Developers can manage all namespace level resources

If the previous version worked maybe we can give developers more power? We can, especially when we create quotas on everything we can. Let’s first go through additional resources that are now available and see if something seems risky (we have stripped the uncommon ones for clarity). Below you can see them gathered in two groups:

Safe ones:

Job
PersistentVolumeClaim
Ingress
PodDisruptionBudget
DaemonSet
HorizontalPodAutoscaler
CronJob
ServiceAccount

The ones we recommend to block:

NetworkPolicy
ResourceQuota
LimitRange
RoleBinding
Role

This is not really a definitive guide, just a hint. NetworkPolicy depends really on the network overlay configuration and security rules we want to enforce. ServiceAccount is also arguable depending on the use case. Other ones are commonly used to manage the resources in the shared cluster and the access to it, so should be available mainly for the cluster administrators.

DevOps multifunctional teams

Last, but not least, the famous and probably the hardest to come by approach: multifunctional teams and a DevOps role. Let’s start with the first one - moving part of the operators to work in the same team, same room, with the developers solves a lot of problems. There is no going back and forth and trying to keep in sync backlogs, sprints, and tasks for multiple teams - the work is prioritized for the team and treated as a team effort. No more waiting 3 weeks for a small change, because the whole ops team is busy with the mission-critical project . No more fighting for the change that is top-priority for the project, but gets pushed down in the queue.

Unfortunately, this means each team needs its own operators, which may be expensive and rarely possible. As a solution for that problem comes the mythical DevOps position: developer with operator skills who can part-time create and manage the cluster resources, deployments and CI/CD pipelines, and part-time work on the code. The required skill set is very broad, so it is not easy to find someone for that position, but it gets popular and may revolutionize the way teams work. Sad to say, this position is often described as an alias of the SRE position, which is not really the same thing.

Triage, delegate, and fix

The responsibility split is done, so now we should only decide on the incident response scenarios, how do we triage issues, and figure out which team is responsible for fixing it (for example by monitoring cluster health and associating it with the failure), alerting and, of course, on-call schedules. There are a lot of tools available just for that.

Eventually, there is always a question “whose cluster is it?” and if everyone knows which field or part of the cluster they manage, then there are no misunderstandings and no blaming each other for the failure. And it’s getting resolved much faster.

written by

Adam Kozłowski

Our experts

Software development

In-app purchases in iOS apps – a tester’s perspective

Year after year, Apple’s new releases of mobile devices gain a decent amount of traction in tech media coverage and keep attracting customers to obtain their quite pricey products. Promises of superior quality, straightforwardness of the integrated ecosystem, and inclusion of new, cutting edge technologies urge the company’s longtime fans and new customers alike to upgrade their devices to Californian designed phones, tablets and computers.

Resurgence

Focusing on the mobile market alone, it is impossible to neglect the significant raise in Apple’s iOS market share of mobile operating systems. Its major competitor, Google’s Android has noted 70.68% of mobile market share in April 2020 – which is around 6 percentage points less than in October 2019. On the other hand, iOS, which noted 22.09% of the market share around the same time, recently has risen to 28.79%. This trend surely pleases Apple’s board, along with anyone who strives to monetize their app ideas in App store.

Gaining revenue through in-app purchases sounds like a brilliant idea, but it requires plenty of planning, calculating risks, and evaluating funds for the project. Before publishing the software product, an idea has to be conceived, marketed, developed, and tested. Each step of this process of making an app aimed at providing paid content differs from the process of creating a custom-ordered software. And that also includes testing.

At what cost?

But wait! Testing usually includes lots of repetition. So that would mean testers have to go through many transactions. Doesn’t that entail spending lots of money? Well, not exactly. Apple provides development teams with their own in-app purchase testing tool, Sandbox. But using it doesn’t make testing all fun and games.

Sandbox allows for local development of in-app purchases without spending a dime on them. That happens by supplementing the ‘real’ Appstore account with the Sandbox one. Sounds fantastic, doesn’t it? But unfortunately, there are some inconveniences behind that.

If it ain’t broke...

First of all, Sandbox accounts have to be created manually via iTunes Connect, which leaves much to be desired in terms of performance. These accounts require an email in a valid format. Testers will need plenty of Sandbox accounts because it is actually quite easy to ‘use them up’, especially when tested software has its own sign-in system (not related to Apple ID). If by design said app account is also associated with In-app purchase, each app account will require a new Sandbox account.

Unfortunately, Apple’s Sandbox accounts can get really tricky to log into. When you’re trying to sign in to another Sandbox account, which was probably named similarly as all previous one for convenience, you’d think your muscle memory will allow you to type in the password without looking at the screen. Nothing more wrong. Sometimes, when you type in the credentials which consist of an email and a password, check twice and hit Sign In button in Sandbox login popover, nothing happens.

User is not logged in, not even a sign in error is displayed. And you try again. Every character is exactly the same as before. And eventually, you manage to log in. It’s not really a big of a deal unless you lose your temper easily testing manually, but a simple message informing why Sandbox login failed would be much more user-friendly. In automated tests you could just write the code to try to log in until the email address used as login is displayed in the Sandbox account section in iOS settings, which means that the login was successful. It’s not something testers can’t live with, but addressing the issue by Apple would greatly improve the experience of working in iOS development.

Cryptic writings

Problems arise when notifications informing that a particular Sandbox user is subscribed to an auto-renewable subscription are not delivered by Apple. Therefore, many subscription purchase attempts have to be made to actually make sure whether the development of the app went the correct way and it’s just Apple’s own system’s error, not a bug inside the app.

Speaking of errors – during testing of in-app purchase features, it can become really difficult to point out to developers what went wrong to help them debug the problem. Errors displayed are very cryptic and long; therefore, investigating the root cause of the problem can consume a substantial amount of time. There are two main reasons for that: there’s no error documentation created by Apple for those long error messages or the message displayed is very generic.

Combining this with problems which include performance drops in ‘prime time’, problems with receiving server notifications, e.g. for Autorenewing Subscriptions or simply inability to connect to iTunes store and a simple task of testing monthly subscription can turn into a major regression testing suite.

Hey, Siri...

Another issue with Sandbox testing that is not so convenient to work with and not so obvious to workaround are the irritating Sandbox login prompts. These occur randomly for the eternity of your app’s development cycle if the In-app purchases feature in the app under test includes auto-renewable subscriptions. What is problematic is that these login prompts pop-up at any given time, not just when the app is used or dropped to the background. Well, if you’re patient you can learn to live with it and dismiss it when it shows up. But problems may occur when the device used for testing said app is also utilized as a real device in automated tests, e.g. in conjunction with Appium.

This can be addressed by setting up Appium properties in testing framework to automatically dismiss system popups. That could prove somewhat helpful if the test suite doesn’t include any other interactions with system popups. Deleting the application which includes auto-renewable subscriptions from the device gets rid of the random Sandbox login prompts on the device, but that’s not how testing works. Another workaround might be building the app with subscription part removed, which requires additional work on developers’ side. These login prompts are surely a major problem which Apple should address.

Send reinforcements

Despite all that, developers and testers alike can and eventually will get through the tedious process of developing and ensuring the quality of in-app purchases in Apple’s ecosystem . A good tactic for this in manual testing is to work out a solid testing routine, which will allow for quicker troubleshooting. Being cautious about each step in the testing scenario and monitoring the environment differences such as being logged in with proper Sandbox account instead of regular Apple ID, an appropriate combination of app account and the Sandbox account or the state of the app in relation to purchases made (whether an In-app purchase has been made within a particular installation or not) is key to understanding whether the application does what is expected and transactions are successful.

While Silicon Valley’s giant rises in the mobile market again, more and more ideas will be monetized in Appstore, making profits not only for the developers but also directly for Apple, which collects a hefty portion of the money spent on apps and paid extras. Let’s hope that sooner than later Apple will address the issues that have been annoying development teams for years now and make their jobs a bit easier.

Sources:

https://gs.statcounter.com/os-market-share/mobile/worldwide

written by

Adrian Poć

Software development

Common Kubernetes failures at scale

Currently, Vanilla Kubernetes supports 5000 nodes in a single cluster. It does not mean that we can just deploy 5000 workers without consequences - some problems and edge scenarios happen only in the larger clusters. In this article, we analyze the common Kubernetes failures at scale, the issues we can encounter if we reach a certain cluster size or high load - network or compute.

Incorrect size

When the compute power requirements grow, the cluster grows in size to house the new containers. Of course, as experienced cluster operators , while adding new workers, we also increase master nodes count. Everything works well until the Kubernetes cluster size expanded slightly over 1000-1500 nodes - and now everything fails. Kubectl does not work anymore, we can’t make any new changes - what has happened?

Let’s start with what is a change for Kubernetes and what actually happens when an event occurs. Kubectl contacts the kube-apiserver through API port and requests a change. Then the change is saved in a database and used by other APIs like kube-controller-manager or kube-scheduler. This gives us two quick leads - either there is a communication problem or the database does not work.

Let’s quickly check the connection to the API with curl ( curl https://[KUBERNETES_MASTE_HOST]/api/ ) - it works. Well, that was too easy.

Now, let’s check the apiserver logs if there is something strange or alarming. And there is! We have an interesting error message in logs:

etcdserver: mvcc: database space exceeded

Let’s connect to ETCD and see what is the database size now:

And we see a round number 2GB or 4GB of database size. Why is that a problem? The disks on masters have plenty of free space.

The thing is, it is not caused by resources starvation. The maximum DB size is just a configuration value, namely quota-backend-bytes . The configuration for this was added in 1.12, but it is possible (and for large clusters highly advised) to just use separate etcd cluster to avoid slowdowns. It can be configured by environment variable:

ETCD_QUOTA_BACKEND_BYTES

Etcd itself is a very fragile solution if you think of it for the production environment. Upgrades, rollback procedure, restoring backups - those are things to be carefully considered and verified because not so many people think about it. Also, it requires A LOT of IOPS bandwidth, so optimally, it should be run on fast SSDs.

What are ndots?

Here occurs one of the most common issues which comes to mind when we think about the Kubernetes cluster failing at scale. This is the first issue faced by our team while starting with managing Kubernetes clusters, and it seems to occur after all those years to the new clusters.

Let’s start with defining ndots . And this is not something specific to Kubernetes this time. In fact, it is just a rarely used /etc/resolv.conf configuration parameter, which by default is set to 1 .

Let’s start with the structure of this file, there are only a few options available there:

nameserver - list of addresses of the DNS server used to resolve the addresses (in the order listed in a file). One address per keyword.
domain - local domain name.
sortlist - sort order of addresses returned by gethostbyname() .
options:
- ndots - maximum number of dots which must appear in hostname given for resolution before initial absolute query should happen. Ndots = 1 means if there is any dot in the name the first try will be absolute name try.
- debug , timeout , attempts … - let’s leave other ones for now
search - list of domains used for the resolution if the query has less than configure in ndots dots.

So the ndots is a name of configuration parameter which, if set to value bigger than 1 , generates more requests using the list specified in the search parameter. This is still quite cryptic, so let’s look at the example `/etc/resolve.conf` in Kubernetes pod:

nameserver 10.11.12.13
search kube-system.svc.cluster.local svc.cluster.local cluster.local
options ndots:5

With this configuration in place, if we try to resolve address test-app with this configuration, it generates 4 requests:

test-app.kube-system.svc.cluster.local
test-app.svc.cluster.local
test-app.cluster.local
test-app

If the test-app exists in the namespace, the first one will be successful. If it does not exist at all, it 4th will get out to real DNS.

How can Kubernetes, or actually CoreDNS, know if www.google.com is not inside the cluster and should not go this path?

It does not. It has 2 dots, the ndots = 5, so it will generate:

www.google.com.kube-system.svc.cluster.local
www.google.com.svc.cluster.local
www.google.com.cluster.local
www.google.com

If we look again in the docs there is a warning next to “search” option, which is easy to miss at first:

Note that this process may be slow and will generate a lot of network traffic if the servers for the listed domains are not local and that queries will time out if no server is available for one of the domains.

Not a big deal then? Not if the cluster is small, but imagine each DNS resolves request between apps in the cluster being sent 4 times for thousands of apps, running simultaneously, and one or two CoreDNS instances.

Two things can go wrong there - either the DNS can saturate the bandwidth and greatly reduce apps accessibility, or the number of requests sent to the resolver can just kill it - the key factor here will be CPU or memory.

What can be done to prevent that?

There are multiple solutions:

1. Use only fully qualified domain names (FQDN). The domain name ending with a dot is called fully qualified and is not affected by search and ndots settings. This might not be easy to change and requires well-built applications, so changing the address does not require a rebuild.

2. Change ndots in the dnsConfig parameter of the pod manifest:

dnsConfig:
options:
- name: ndots
value: "1"

This means the short domain names for pods do not work anymore, but we reduce the traffic. Also can be done for deployments which reach a lot of internet addresses, but not require local connections.

3. Limit the impact. If we deploy kube-dns (CordeDNS) on all nodes as DaemonSet with a fairly big resources pool there will be no outside traffic. This helps a lot with the bandwidth problem but still might need a deeper look into the deployed network overlay to make sure it is enough to solve all problems.

ARP cache

This is one of the nastiest failures, which can result in the full cluster outage when we scale up - even if the cluster is scaled up automatically. It is ARP cache exhaustion and (again) this is something that can be configured in underlying linux.

There are 3 config parameters associated with the number of entries in the ARP table:

gc_thresh1 - minimal number of entries kept in ARP cache.
gc_thresh2 - soft max number of entries in ARP cache (default 512).
gc_thresh3 - hard max number of entries in ARP cache (default 1024).

If the gc_thresh3 limit is exceeded, the next requests result with a neighbor table overflow error in syslog.

This one is easy to fix, just increase the limits until the error goes away, for example in /etc/sysctl.conf file (check the manual for you OS version to make sure what is the exact name of the option):

net.ipv4.neigh.default.gc_thresh1 = 256
net.ipv4.neigh.default.gc_thresh2 = 1024
net.ipv4.neigh.default.gc_thresh3 = 2048

So it’s fixed , by why did it happen in the first place? Each pod in Kubernetes has it’s own IP address (which is at least one ARP entry). Each node takes at least two entries. This means it is really easy for a bigger cluster to exhaust the default limit.

Pulling everything at once

When the operator decides to use a smaller amount of very big workers, for example, to speed up the communication between containers, there is a certain risk involved. There is always a point of time when we have to restart a node - either it is an upgrade or maintenance. Or we don’t restart it, but add a new one with a long queue of containers to be deployed.

In certain cases, especially when there are a lot of containers or just a few very big ones, we might have to download a few dozens of gigabytes, for example, 100GB. There are a lot of moving pieces that affect this scenario - container registry location, size of containers, or several containers which results in a lot of data to be transmitted - but one result: the image pull fails. And the reason is, again, the configuration.

There are two configuration parameters that lead to Kubernetes cluster failures at scale:

serialize-image-pulls - download the images one by one, without parallelization.
image-pull-progress-deadline - if images cannot be pulled before the deadline triggers it is canceled.

It might be also required to verify docker configuration on nodes if there is no limit set for parallel pulls. This should fix the issue.

Kubernetes failures at scale - sum up

This is by no means a list of all possible issues which can happen. From our experience, those are the common ones, but as the Kubernetes and software evolve, this can change very quickly. It is highly recommended to learn about Kubernetes cluster failures that happened to others, like Kubernetes failures stories and lessons learned to avoid repeating mistakes that had happened before. And remember to backup your cluster, or even better make sure you have the immutable infrastructure for everything that runs in the cluster and the cluster itself, so only data requires a backup.

written by

Adam Kozłowski

Automotive

How to expedite claims adjustment by using AI to improve virtual inspection

If we look at the claims adjustment domain from a high-level perspective, we will surely notice it is a very complex set of elements: processes, data, activities, documents, systems, and many others, depending on each other. There are many people who are involved in the process and in many cases, they struggle with a lot of inefficiency in their daily work. This is exactly where AI comes to help. AI-based solutions and mechanisms can automate, simplify, and speed up many parts of the claims adjustment process, and eventually reduce overall adjustment costs.

The claims adjustment process

Let's look at the claims adjustment process in more detail. There are multiple steps on the way: when an event that causes a loss for the customer occurs, the customer notifies the insurance company about the loss and files a claim. Then the company needs to gather all the information and documentation to understand the circumstances, assess the situation, and eventually be able to validate their responsibility and estimate the loss value. Finally, the decision needs to be made, and appropriate parties, including the customer, need to be notified about the result of the process.

At each step of this process, AI can not only introduce improvements and optimizations but also enable new possibilities and create additional value for the customer .

Let’s dive into a few examples of potential AI application to claims adjustment process in more detail.

Automated input management

The incoming correspondence related to claims is very often wrongly addressed. Statistics show that on average, 35% of messages is incorrectly addressed. A part of them is sent to a generic corporate inbox, next ones to wrong people, or sometimes even to entirely different departments. That causes a lot of confusion and requires time to reroute the message to the correct place.

AI can be very helpful in this scenario - an algorithm can analyze the subject and the content of the message, look for keywords such as claim ID, name of the customer, policy number , and automatically reroute the message to the correct recipient. Furthermore, the algorithm can analyze the context and detect if it is a new claim report or a missing attachment that should be added to an already-filed claim. Such a solution can significantly improve the effectiveness and speed up the process.

Automated processing of incoming claims

The automation of processing of incoming documents and messages could be taken one step further. What if we used an AI algorithm to analyze the content of the message? A claim report can be sent using an official form, but also as a plain email message or even as a scanned paper document – the solution could analyze the document and extract the key information about the claim so that it can be automatically added to the claim registry system. Simultaneously the algorithm could check if all the needed data, documents, and attachments are provided and if not, notify the reporter appropriately. In a "traditional" approach, this part is often manual and thus takes a lot of time. Introducing an AI-based mechanism here would drastically reduce the amount of manual work, especially in the case of well-defined and repeatable causes, e.g., car insurance claims.

Verification of reported damage

Appraisal of the filed claim and verification of reported damage is another lengthy step in the claim adjustment process. The adjuster needs to verify if the reported damage is true and if the reported case includes those that occurred previously. Computer vision techniques can be used here to automate and speed up the process - e.g., by analyzing pictures of the car taken by the customer after the accident or analyzing satellite or aerial photos of a house in case of property insurance.

Verification of incurred costs

AI-driven verification can also help identify fraudulent operations and recognize costs that are not related to the filed claim. In some cases, invoices presented for reimbursement include items or services which should not be there or which cost is calculated using too high rates. AI can help compare the presented invoices with estimated costs and indicate inflated rates or excess costs - in case of medical treatment or hospital stay. Similarly, the algorithm can verify whether the car repair costs are calculated correctly by analyzing the reported damage and comparing an average rate for corresponding repair services with the presented rate.

Such automated verification helps flag potentially fraudulent situations and saves adjuster's time. letting them focus only on those unclear cases rather than analyze each one manually.

Accelerate online claims reporting with automated VIN recognition

In the current COVID-19 situation, digital services and products are becoming critical for all the industries. Providing policyholders with the capability to effectively use online channels and virtual services is essential for the insurance industry as well.

One of our customers wanted to speed up the processing of claims reported through their mobile application. The insurer faced a challenging issue, as 8% of claims reported through the mobile application were rejected due to the bad quality of VIN images. Adjusters had problems with deciphering the Vehicle Identification Number and had to request the same information from the customer. The whole process was unnecessarily prolonged and frustrating for the policyholder.

By introducing a custom machine learning model, trained specifically for VIN recognition instead of a generic cloud service, our customer increased VIN extraction accuracy from 60% to 90% , saving on average 1,5 h per day for each adjuster. Previously rejected claims can be now processed quicker and without asking policyholders for the information they already provided resulting in increased NPS and overall customer satisfaction.

https://www.youtube.com/watch?v=oACNXmlUgtY

Those are just a few examples of how AI can improve claims adjustments. If you would like to know more about leveraging AI technologies to help your enterprises improve your business, tell us about your challenges and we will jointly work on tackling them .

written by

Roman Swoszowski

Software development

Kubernetes cluster management: Size and resources

While managing Kubernetes clusters, we can face some demanding challenges. This article helps you manage your cluster resources properly, especially in an autoscaling environment.

If you try to run a resource-hungry application, especially on a cluster which has autoscaling enabled, at some point this happens:

Image credits: https://github.com/eclipse/che/issues/14598

For the first time, it may look bad, especially if you see dozens of evicted pods in kubectl get, and you only wanted to run 5 pods. With all that claims, that you can run containers without worries about the orchestration, as Kubernetes does all of that for you, you may find it overwhelming.

Well, this is true to some extent, but the answer is - it depends, and it all boils down to a crucial topic associated with Kubernetes cluster management. Let's dive into the problem.

Learn more about services provided by Grape Up

You are at Grape Up blog, where our experts share their expertise gathered in projects delivered for top enterprises. See how we work.

Enabling the automotive industry to build software-defined vehicles
Empowering insurers to create insurance telematics platforms
Providing AI & advanced analytics consulting

Kubernetes Cluster resources management

While there is a general awareness that resources are never limitless - even in a huge cluster as a service solution, we do not often consider the exact layout of the cluster resources. And the general idea of virtualization and containerization makes it seem like resources are treated as a single, huge pool - which may not always be true. Let’s see how it looks.

Let’s assume we have a Kubernetes cluster with 16 vCPU and 64GB of RAM.

Can we run on it our beautiful AI container, which requires 20GB of memory to run? Obviously, not. Why not? We have 64GB of memory available on the cluster!

Well, not really. Let’s see how our cluster looks inside:

The Cluster again

There are 4 workers in the cluster, and each has 16GB of memory available (in practice, it will be a little bit less, because of DaemonSets and system services, which run a node and take their small share). Container hard memory limit is, in this case, 16GB, and we can’t run our container.

Moreover, it means we have to always take this limitation into account. Not just if we deploy one big container, but also in complex deployments, or even things which in general can run out-of-the-box like helm charts .

Let’s try another example.

Our next task will be a Ceph deployment to the same cluster. The target we want to achieve is a storage size of 1TB split into 10 OSDs (object storage daemons) and 3 ceph MONs (monitors). We want to put it on 2 of the nodes, and leave the other 2 for deployments which are going to use the storage. Basic and highly extensible architecture.

The first, naive approach is to just set OSDs count to 10, MONs count to 3 and add tolerations to the Ceph pods, plus of course matching taint on Node 1 and Node 2 . All ceph deployments and pods are going to have the nodeSelector set to target only nodes 1 and 2 .

Kubernetes does its thing and runs mon-1 and mon-2 on the first worker along with 5 osds, and mon-3 along with 5 osds on the second worker.

mon-1
mon-2
osd-1
osd-2
osd-3
osd-4
osd-5 mon-3
osd-6
osd-7
osd-8
osd-9
osd-10 Stateless App

It worked out! And our application can now save quite a lot of large files to Ceph very quickly, so our job becomes easier. If we also deploy the dashboard and create a replicated pool, we can even see 1TB of storage available and 10 OSDs up, that's a huge achievement!

Dashboard view example ( https://ceph.io/community/new-in-nautilus-new-dashboard-functionality/ )

The very next morning, we check the status again and see that the available storage is around 400GB and 4 OSDs in flight. What is going on? Is this a crash? Ceph is resilient, it should be immune to crashes, restart quickly, and yet it does not seem like it worked very well here.

If we now check the cluster, we can see a lot of evicted OSD pods. Even more, than we are supposed to have at all. So what really has happened? To figure this out, we need to go back to our initial deployment configuration and think it through.

Limits and ranges

We ran 13 pods, 3 of them (monitors) don’t really need a lot of resources, but OSDs do. More we use it more resources it needs because ceph caches a lot of data in memory. Plus replication and balancing data over storage containers do not come free.

So initially after the deployment, the memory situation looks more or less like this:

Node 1
mon-1 - 50MB
mon-2 - 50MB
osd-1 - 200MB
osd-2 - 200MB
osd-3 - 200MB
osd-4 - 200MB
osd-5 - 200MB

1100MB memory used Node 2
mon-3 - 50M
Bosd-6 - 200MB
osd-7 - 200MB
osd-8 - 200MB
osd-9 - 200MB
osd-10 - 200MB

1050MB memory used

After a few hours of extensive usage, something goes wrong.

Node 1
mon-1 - 250MB
mon-2 - 250MB
osd-1 - 6500MB
osd-2 - 5300MB
osd-3 - Evicted
osd-4 - Evicted
osd-5 - Evicted

12300MB memory used Node 2
mon-3 - 300MB
osd-6 - 9100MB
osd-7 - 5700MB
osd-8 - Evicted
osd-9 - Evicted
osd-10 - Evicted

15100MB memory used

We have lost almost 50% of our pods. Does it mean it’s over? No, we can lose more of them quickly, especially if the high throughput will now target the remaining pods. Does it mean we need more than 32GB of memory to run this Ceph cluster? No, we just need to correctly set limits so a single OSD can’t just use all available memory and starve other pods.

In this case, the easiest way would be to take the 30GB of memory (leave 2GB for mons - 650MB each, and set them limits properly too!) and divide it by 10 OSDs. So we have:

resources :
limits :
memory : "3000Mi"
cpu : "600m"

Is it going to work? It depends, but probably not. We have configured 15GB of memory for OSDs and 650MB for each pod. It means that first node requires: 15 + 2*0.65 = 16.3GB. A little bit too much and also not taking into account things like DaemonSets for logs running on the same node. The new version should do the trick:

resources :
limits :
memory : "2900Mi"
cpu : "600m"

Quality of Service

There is one more warning. If we also set a request for the pod to exactly match the limit, then Kubernetes treats this kind of pod differently:

resources :
requests :
memory : "2900Mi"
cpu : "600m"
limits :
memory : "2900Mi"
cpu : "600m"

This pod configuration is going to have QoS in Kubernetes set to Guaranteed . Otherwise, it is Burstable . Guaranteed pods are never evicted - by setting the same request and limit size, we confirm that we are certain what is the resource usage of this pod, so it should not be moved or managed by Kubernetes. It reduces flexibility for the scheduler but makes the whole deployment way more resilient.

Obviously, for mission-critical systems , “best-effort” is never enough.

Resources in an autoscaling environment

If we can calculate or guess the required resources correctly to match the cluster size, the limits and quality of service may be just enough. Sometimes though the configuration is more sophisticated and the cluster size is fluid - it can scale up and down horizontally and change the number of available workers.

In this case, the planning goes in two parallel paths - you need to plan for the minimal cluster size and the maximum cluster size - assuming linear scaling of resources.

It cannot be assumed that applications will act properly and leave space for the other cluster cohabitants. If the pods are allowed to scale up horizontally or vertically while the cluster is expanding, it may result in evicting other pods when it’s scaling down. To mitigate this issue, there are two main concepts available in Kubernetes: Pod Priority and Pod Disruption Budget .

Let’s start again by creating our test scenario. This time we don’t need tons of nodes, so let’s just create a cluster with two node groups: one consisting of regular instances (let’s call it persistent) and one consisting of preemptible/spot instance (let’s just call them preemptible for the sake of an experiment).

The preemptible nodes group will scale up when the CPU usage of the VM (existing node) will be over 0.7 (70%).

The advantage of the preemptible/spot instances is their price. They are much cheaper than regular VMs of the same performance. The only drawback is that there is no guarantee for their lifetime - the instance can be killed when the cloud providers decide it is required somewhere else, for maintenance purposes, or just after 24 hours. This means we can only run fault-tolerant, stateless workloads there.

Which should be most of the things which run in your cluster if you follow the 12 factors, right?

Why there is one persistent node in our cluster then? To prepare for the rare case, when none of the preemptible nodes are running, it is going to maintain the minimal set of containers to manage the operability of the application.

Our application will consist of:

Application Replicas CPUs Memory Redis cluster with one redis master - has to run on a persistent node 1 0.5 300MB Frontend application (immutable) 2 0.5 500MB Backend application (immutable) 2 0.7 500MB Video converter application (immutable) 1 1 2GB Sum 3.9 4.3GB

We can configure the redis master to work on the persistent node using a node selector. Then just deploy everything else and Bob is your uncle .

Horizontal Pod Autoscaler

Well, but we have an autoscaling nodes group and no autoscaling configured in the cluster. This means we have never really triggered cluster autoscaling and it stays all the time on two workers, because application itself does not increase replicas count. Let’s start with the Horizontal Pod Autoscaler:

Frontend:

apiVersion : autoscaling/v2beta2
kind : HorizontalPodAutoscaler
metadata :
name : frontend-hpa
spec : scaleTargetRef :
apiVersion : apps/v1
kind : Deployment
name : frontend
minReplicas : 2
maxReplicas : 10
metrics :
- type : Resource
resource :
name : cpu
target :
type : Utilization
averageUtilization : 75

Backend:

apiVersion : autoscaling/v2beta2
kind : HorizontalPodAutoscaler
metadata :
name : backend-hpa
spec :
scaleTargetRef :
apiVersion : apps/v1
kind : Deployment
name : backend
minReplicas : 2
maxReplicas : 10
metrics :
- type : Resource
resource :
name : cpu
target :
type : Utilization
averageUtilization : 75

Video converter:

apiVersion : autoscaling/v2beta2
kind : HorizontalPodAutoscaler
metadata :
name : video-converter-hpa
spec :
scaleTargetRef :
apiVersion : apps/v1
kind : Deployment
name : video-converter
minReplicas : 1
maxReplicas : 25
metrics :
- type : Resource
resource :
name : cpu
target :
type : Utilization
averageUtilization : 25

So now we have the same configuration as we described in the deployment - the sum of minReplicas is equal. Why does the video converter have such a low target average utilization? When there are multiple conversions enqueued, it will make autoscaling quicker - if it quickly reaches 25% of average CPU usage, then the new one is spawned. This is a very trivial configuration - if you need something more sophisticated check scaling policies .

What might happen if we now test our environment and enqueue 50 video conversions each taking around 10 minutes?

It depends, but the likely scenario is that the video converter will scale up to the 25 instances. What happens with other containers in the cluster? Some of them will be evicted, maybe backend ones, maybe frontend ones, or maybe even redis. There is quite a high risk of the setup to break down and be inaccessible for the end-users.

Can we mitigate the issue? Yes, for example, we can create the priority classes and assign them lower for the video converter. The higher priority pod has, the more worth it has for the scheduler. If two pods are due to be evicted - the one with lower priority gets the pole position. If two pods of different priorities are scheduled, the higher priority one gets the precedence.

apiVersion : scheduling.k8s.io/v1
kind : PriorityClass
metadata :
name : high-priority
value : 100000
globalDefault : false
description : "This is high priority class for important workloads"

So if we give the converter lower priority, we confirm that the frontend and backend pods are more important, and in the worst case, the video converter can be expelled from the cluster.

Moreover, this is not going to guarantee that the backend can’t evict the frontend.

There is also an alternative that allows us to have better control over the scheduling of the pods. It is called…

Pod Disruption Budget

This resource allows us to configure a minimal amount of the deployment pods running at once. It is more strict than just priority because it can even block the node drain, if there is not enough space on other workers to reschedule the pod, and in result make the replicas count lower than the assigned budget.

The configuration is straightforward:

apiVersion : policy/v1beta1
kind : PodDisruptionBudget
metadata :
name : frontend-pdb
spec :
minAvailable : 2
selector :
matchLabels :
app : frontend

From now on, the frontend replica count cannot get lower than 2. We can assign this way minimums for all the pods and make sure there are always at least 1 or 2 pods which can handle the request.

This is the easiest and safest way to make sure that pod autoscaling and cluster scaling down is not going to affect the overall solution stability - as long as the minimal set of containers configured with the disruption budget can fit the minimal cluster size and it is enough to handle the bare minimum of requests.

Connecting the dots

Now we have all the required pieces to create a stable solution. We can configure HPAs to have the same min number of replicas as PDB to make the scheduler's life easier. We know our max cluster size and made sure limits are the same as requests, so pods are not evicted. Let’s see what we get with the current configuration:

Application Min. replicas Max. replicas PDB CPUs Memory A redis cluster with one redis master - has to run on a persistent node 1 1 1 0.5 300MB Frontend application (immutable) 2 10 2 0.5 500MB Backend application (immutable) 2 10 2 0.7 500MB Video converter application (immutable) 1 25 1 1 2GB Sum (min) 3.9 4.3GB Sum (max) 37.5 ~60.3GB

Not bad. It can even stay as it is, but the current max cluster size is 24 cores with 48GB of memory. With all the configurations we went through, it should be fine when we exceed that size, so there is a little bit of flexibility for the scheduler - for example if there is a very low load on frontend and backend, but a huge pile of data to be converted, then the converter can scale up to approx. 19-21 instances, which is nice to have.

There is no one design that fits all

Is there anything wrong with the current configuration? Well, there can be, but we are going into unknown depths of “it depends.”

It all starts with the simple question - what is the purpose of my solution/architecture and what are the KPIs. Let’s look again at the example - it is a video converted with a web application. A pretty basic solution that scales up if required to accommodate a higher load. But what is more important - faster conversion or more responsible UI?

It all boils down to the product requirements, and in general, it is easy to solve. There are three paths we can follow from now on:

The I don’t care path

If it does not matter from the user and product perspective just leave it and see how it performs. Maybe even two frontend pods can handle a lot of load? Or maybe nobody cares about the latency as long as nothing crashes unexpectedly? Don’t overengineer and don’t try the premature optimization - let it be and see if it’s fine. If it’s not there are still two other paths available.

The I know what matters most path

This path requires a bit of knowledge about priorities. If the priority is the smooth and scalable UI and it’s fine to have quite some conversions waiting - put the higher priority on the frontend and backend deployments as described in previous paragraphs. If the video conversion is the key - put the higher priority on it. Whatever you choose, it will be the deployment that can scale up at the expense of the other one. This is especially important if loads don’t really run in parallel most of the time, so can scale up and down independently, and the next path does not fit that scenario.

The I want to be safe path

The last path is straightforward, just put the maximums so to be close to the cluster limits, but not higher:

Application Min. replicas Max. replicas PDB CPUs Memory A redis cluster with one redis master - has to run on a persistent node 1 1 1 0.5 300MB Frontend application (immutable) 2 8 2 0.5 500MB Backend application (immutable) 2 8 2 0.7 500MB Video converter application (immutable) 1 13 1 1 2GB Sum (min) 3.9 4.3GB Sum (max) 23,1 34,3GB

Now there is some space in the memory department, so we can, for example, give the pods more memory. We are also always safe because most of the time, there will be no fighting for resources. It might happen only when the cluster will be scaling up.

Is this a perfect solution? Not really, because it is possible to fit 20 video converters at once in the cluster when there is no traffic on the UI (frontend and backend) and we artificially limit the deployment ability to scale.

Autoscaling considerations

When it comes to autoscaling, there are some things to keep in mind. First, it is not reliable - it’s impossible to say how long it will take for the cloud provider to spin up the VM. It may take seconds, and it may take minutes (in general it rarely takes less than a minute), so starting very small with the hope of autoscaling solving the peak loads may not be the greatest idea.

The other often forgotten thing is that when we scale up, then there is a point when the cluster scales down. If the deployment scales down and pods are truly stateless and can handle it gracefully - then it is not a big deal. When it comes to the cluster scaling down, we need to remember that it effectively shuts down the VMs. Sometimes something is running on them, and the scheduler has to quickly move the workload to the other workers. This is something that has to be thoughtfully tested to make sure it does not break the application operations.

Kubernetes cluster management - summary

This is the end of our quite long journey through Kubernetes cluster size and resources management. There is much more there, especially for the bigger clusters or complex problems, which may come in handy later on, like configuring the eviction policies , namespace requests and limits , or topology management useful when we have specific nodes for specific purposes. Although what we have gone through in this article should be perfectly fine and serve well even quite complex solutions . Good luck and we wish you no evicted pods in the future!

written by

Adam Kozłowski

Software development

Variable key names for Codable objects: How to make Swift Codable protocol even more useful?

It’s hard to imagine modern Swift iOS application that doesn’t work with multiple data sources like servers, local cache DB, etc, or doesn’t parse/convert data between different formats. While Swift Codable protocol is a great solution for this purpose it also has some important drawbacks when developing a complex app that deals with multiple data formats. From this article, you will know how to improve the Swift Codable mechanism and why it’s important.

Swift has a great feature for encoding/decoding data in key-value formats called Coding protocol. That is, you may choose to store data in e.g. JSON format or plist by at minimum just defining names of the keys for which the corresponding values should be stored.

Advantages and disadvantages of Swift Codable protocol

Here are the advantages of Codable protocol:

1) Type safety . You don't need typecasting or parsing the strings read from the file. Swift does for you all the low-level reading and parsing only returning you a ready to use object of a concrete type.

2) The Simplicity of usage . At a minimum, you may just declare that your type that needs to be encodable or decodable confirms to the corresponding protocol (either Codable or it's parts Decodable or Encodable). The compiler will match the keys from your data (e.g., JSON) automatically based on the names of your type's properties. In case you need advanced matching of keys' names with your type's properties (and in most real life cases you need it), you may define an enum CodingKeys that will do the mapping.

3) Extensibility . When you need some advanced parsing, you may implement initialization and encoding methods to parse/encode the data. This, for example, allows you to decode several fields of JSON combining them into a single value or make some advanced transformation before assigning value to your codable object's property.

Despite its flexibility, the Codable approach has a serious limitation. For real-life tasks, it's often needed to store the same data in several data formats at the same time. For example, data coming from a server may be stored locally as a cache. Info about user account coming from the server is often stored locally to keep user sign in. At first glance, the Swift Codable protocol can be perfectly used in this case. However, the problem is that, as soon as one data source changes names of the keys for the stored values, the data won't be readable anymore by Codable object.

As an example let's imagine a situation when an application gets user info for a user account from the server and stores it locally to be used when the app is relaunched. In this case, the proper solution for parsing JSON data from the server into a model object is to use Codable protocol. The simplest way to store the object locally would be to just use Codable to encode the object (e.g. in plist format) and to store it locally. But codable object will use a certain set of keys that is defined by server JSON field names in our example. So if the server changes names of the JSON fields it returns, we'll have to change Codable implementation to match the new fields' names. So Codable implementation will use new keys to encode/decode data. And since the same implementation is used for local data, as well the user info that was previously saved locally will become unreadable.

To generalize, if we have multiple data sources for the same keyed data, the Codable implementation will stop working as soon as one of the data sources changes the names of the keys.

Approach with multiple entities

Let's see how to improve the Swift Codable protocol to properly handle such a situation. We need a way to encode/decode from each data source without restriction to have the same key names. To do it, we may write a model object type for each data source.

Back to our example with server and local data, we’ll have the following code:

// Server user info

struct ServerUserInfo: Codable {

let user_name: String

let email_address: String

let user_age: Int

}

// Local user info to store in User Defaults

struct LocalUserInfo: Codable {

let USER_NAME: String

let EMAIL: String

let AGE: Int

}

So we have two different structures: one to encode/decode user info from server and the other to encode/decode data for local usage in User Defaults. But semantically, this is the same entity. So code that works with such object should be able to use any of the structures above interchangeably. For this purpose, we may declare the following protocol:

protocol UserInfo {

var userName: String { get }

var email: String { get }

var age: Int { get }

}

Each user info structure will then conform to the protocol:

extension LocalUserInfo: UserInfo {

var userName: String {

return USER_NAME

}

var email: String {

return EMAIL

}

var age: Int {

return AGE

}

}

extension ServerUserInfo: UserInfo {

var userName: String {

return user_name

}

var email: String {

return email_address

}

var age: Int {

return user_age

}

}

So, code that requires user info will use it via UserInfo protocol.

Such solution is a very straightforward and easy to read. However, it requires much code. That is, we have to define a separate structure for each format a particular entity can be encoded/decoded from. Additionally, we need to define a protocol describing the entity and make all the structures conform to that protocol.

Approach with variational keys

Let’s find another approach that will make it possible to use a single structure to do the encoding/decoding from different key sets for different formats. Let’s also make this approach maintain simplicity in its usage. Obviously, we cannot have Coding keys bound to properties’ names as in the previous approach. This means we’ll need to override init(from:) and encode(to:) methods from Codable protocol. Below is a UserInfo structure defined for coding in JSON format from our example.

extension UserInfo: Codable {

private enum Keys: String, CodingKey {

case userName = "user_name"

case email = "email_address"

case age = "user_age"

}

init(from decoder: Decoder) throws {

let container = try decoder.container(keyedBy: Keys.self)

self.userName = try container.decode(String.self, forKey: .userName)

self.email = try container.decode(String.self, forKey: .email)

self.age = try container.decode(Int.self, forKey: .age)

}

func encode(to encoder: Encoder) throws {

var container = encoder.container(keyedBy: Keys.self)

try container.encode(userName, forKey: .userName)

try container.encode(email, forKey: .email)

try container.encode(age, forKey: .age)

}

}

In fact, to make the code above decode and encode another data format we only need to change the keys themselves. That is, we’ve used simple enum conforming to the CodingKey protocol to define the keys. However, we may implement arbitrary type conforming to the CodingKey protocol. For example, we may choose a structure. So, a particular instance of a structure will represent the coding key used in calls to container.decode() or container.encode() . While implementation will provide info about the keys of a particular data format. The code of such structure is provided below:

struct StringKey: CodingKey {

let stringValue: String

let intValue: Int?

init?(stringValue: String) {

self.intValue = nil

self.stringValue = stringValue

}

init?(intValue: Int) {

self.intValue = intValue

self.stringValue = "\(intValue)"

}

}

So, the StringKey just wraps a concrete key for a particular data format. For example, to decode userName from JSON, we’ll create the corresponding StringKey instances specifying JSON user_name field into init?(stringValue:) method.

Now we need to find a way to define key sets for each data type. To each property from UserInfo , we need somehow assign keys that can be used to encode/decode the property’s value. E.g. for property userName corresponds to user_name key for JSON and USER_NAME key for plist format. To represent each property, we may use Swift’s KeyPath type. Also, we would like to store information about which data format each key is used for. Translating the above into code we’ll have the following:

enum CodingType {

case local

case remote

}

extension UserInfo {

static let keySet: [CodingType: [PartialKeyPath<UserInfo>: String]] = [

// for .plist stored locally

.local: [

\Self.userName: "USER_NAME",

\Self.email: "EMAIL",

\Self.age: "AGE"

],

// for JSON received from server

.remote: [

\Self.userName: "user_name",

\Self.email: "email_address",

\Self.age: "user_age"

]

]

}

To let the code inside init(from:) and encode(to:) methods aware of the decode/encode data format we may use user info from Decoder/Encoder objects:

extension CodingUserInfoKey {

static var codingTypeKey = CodingUserInfoKey(rawValue: "CodingType")

}

...

let providedType = <either .local or .remote from CodingType enum>

let decoder = JSONDecoder()

if let typeKey = CodingUserInfoKey.codingTypeKey {

decoder.userInfo[typeKey] = providedType

}

When decoding/encoding, we’ll just read the value from user info for CodingUserInfoKey.codingTypeKey key and pick the corresponding set of coding keys.

Let’s bring all the above together and see how our code will look like:

enum CodingError: Error {

case keyNotFound

case keySetNotFound

}

extension UserInfo: Codable {

static func codingKey(for keyPath: PartialKeyPath<Self>,

in keySet: [PartialKeyPath<Self>: String]) throws -> StringKey {

guard let value = keySet[keyPath],

let codingKey = StringKey(stringValue: value) else {

throw CodingError.keyNotFound

}

return codingKey

}

static func keySet(from userInfo: [CodingUserInfoKey: Any]) throws -> [PartialKeyPath<Self>: String] {

guard let typeKey = CodingUserInfoKey.codingTypeKey,

let type = userInfo[typeKey] as? CodingType,

let keySet = Self.keySets[type] else {

throw CodingError.keySetNotFound

}

return keySet

}

init(from decoder: Decoder) throws {

let keySet = try Self.keySet(from: decoder.userInfo)

let container = try decoder.container(keyedBy: StringKey.self)

self.userName = try container.decode(String.self, forKey: try Self.codingKey(for: \Self.userName,

in: keySet))

self.email = try container.decode(String.self, forKey: try Self.codingKey(for: \Self.email,

in: keySet))

self.age = try container.decode(Int.self, forKey: try Self.codingKey(for: \Self.age,

in: keySet))

}

func encode(to encoder: Encoder) throws {

let keySet = try Self.keySet(from: encoder.userInfo)

var container = encoder.container(keyedBy: StringKey.self)

try container.encode(userName, forKey: try Self.codingKey(for: \Self.userName,

in: keySet))

try container.encode(email, forKey: try Self.codingKey(for: \Self.email,

in: keySet))

try container.encode(age, forKey: try Self.codingKey(for:

\Self.age,

in: keySet))

}

}

Note we’ve added two helper static methods: codingKey(for keyPath , in keySet) and keySet(from userInfo) . Their usage makes code of init(from:) and encode(to:) more clear and straightforward.

Generalizing the solution

Let’s improve the solution with coding key sets we’ve developed to make it easier and faster to apply. The solution has some boilerplate code for transforming KeyPath of the type into a coding key and choosing the particular key set. Also, encoding/ decoding code has a repeating call to codingKey(for keyPath, in keySet) that complicates the init(from:) and encode(to:) implementation and can be reduced.

First, we’ll extract helping code into helper objects. It will be enough to just use structures for this purpose:

private protocol CodingKeyContainable {

associatedtype Coding

var keySet: [PartialKeyPath<Coding>: String] { get }

}

private extension CodingKeyContainable {

func codingKey(for keyPath: PartialKeyPath<Coding>) throws -> StringKey {

guard let value = keySet[keyPath], let codingKey = StringKey(stringValue: value) else {

throw CodingError.keyNotFound

}

return codingKey

}

}

struct DecodingContainer<CodingType>: CodingKeyContainable {

fileprivate let keySet: [PartialKeyPath<CodingType>: String]

fileprivate let container: KeyedDecodingContainer<StringKey>

func decodeValue<PropertyType: Decodable>(for keyPath: KeyPath<CodingType, PropertyType>) throws -> PropertyType {

try container.decode(PropertyType.self, forKey: try codingKey(for: keyPath as PartialKeyPath<CodingType>))

}

}

struct EncodingContainer<CodingType>: CodingKeyContainable {

fileprivate let keySet: [PartialKeyPath<CodingType>: String]

fileprivate var container: KeyedEncodingContainer<StringKey>

mutating func encodeValue<PropertyType: Encodable>(_ value: PropertyType, for keyPath: KeyPath<CodingType, PropertyType>) throws {

try container.encode(value, forKey: try codingKey(for: keyPath as PartialKeyPath<CodingType>))

}

}

Protocol CodingKeyContainable just helps us to reuse key set retrieving code in both structures.

Now let’s define our own Decodable/Encodable-like protocols. This will allow us to hide all the boilerplate code for getting the proper key set and creating a decoder/encoder object inside of the default implementation of init(from:) and encode(to:) methods. On the other hand, it will allow us to simplify decoding/encoding the concrete values by using DecodingContainer and EncodingContainer structures we’ve defined above. Another important thing is that by using the protocols, we’ll also add the requirement of implementing:

static let keySet: [CodingType: [PartialKeyPath<UserInfo>: String]] by codable types for which we want to use the approach with variational keys.

Here are our protocols:

// MARK: - Key Sets

protocol VariableCodingKeys {

static var keySets: [CodingType: [PartialKeyPath<Self>: String]] { get }

}

private extension VariableCodingKeys {

static func keySet(from userInfo: [CodingUserInfoKey: Any]) throws -> [PartialKeyPath<Self>: String] {

guard let typeKey = CodingUserInfoKey.codingTypeKey,

let type = userInfo[typeKey] as? CodingType,

let keySet = Self.keySets[type] else {

throw CodingError.keySetNotFound

}

return keySet

}

}

// MARK: - VariablyDecodable

protocol VariablyDecodable: VariableCodingKeys, Decodable {

init(from decodingContainer: DecodingContainer<Self>) throws

}

extension VariablyDecodable {

init(from decoder: Decoder) throws {

let keySet = try Self.keySet(from: decoder.userInfo)

let container = try decoder.container(keyedBy: StringKey.self)

let decodingContainer = DecodingContainer<Self>(keySet: keySet, container: container)

try self.init(from: decodingContainer)

}

}

// MARK: - VariablyEncodable

protocol VariablyEncodable: VariableCodingKeys, Encodable {

func encode(to encodingContainer: inout EncodingContainer<Self>) throws

}

extension VariablyEncodable {

func encode(to encoder: Encoder) throws {

let keySet = try Self.keySet(from: encoder.userInfo)

let container = encoder.container(keyedBy: StringKey.self)

var encodingContainer = EncodingContainer<Self>(keySet: keySet, container: container)

try self.encode(to: &encodingContainer)

}

}

typealias VariablyCodable = VariablyDecodable & VariablyEncodable

Let’s now rewrite our UserInfo structure to make it conform to newly defined VariablyCodable protocol:

extension UserInfo: VariablyCodable {

static let keySets: [CodingType: [PartialKeyPath<UserInfo>: String]] = [

// for .plist stored locally

.local: [

\Self.userName: "USER_NAME",

\Self.email: "EMAIL",

\Self.age: "AGE"

],

// for JSON received from server

.remote: [

\Self.userName: "user_name",

\Self.email: "email_address",

\Self.age: "user_age"

]

]

init(from decodingContainer: DecodingContainer<UserInfo>) throws {

self.userName = try decodingContainer.decodeValue(for: \.userName)

self.email = try decodingContainer.decodeValue(for: \.email)

self.age = try decodingContainer.decodeValue(for: \.age)

}

func encode(to encodingContainer: inout EncodingContainer<UserInfo>) throws {

try encodingContainer.encodeValue(userName, for: \.userName)

try encodingContainer.encodeValue(email, for: \.email)

try encodingContainer.encodeValue(age, for: \.age)

}

}

This is where a true power of protocols comes. By conforming to VariablyCodable our type automatically becomes Codable. Moreover, without any boilerplate code, we now have the ability to use different sets of coding keys.

Going back to the advantages of the Codable protocol we outlined at the beginning of the article, let’s check which ones VariablyCodable has.

1) Type safety . Nothing changed here comparing to the Codable protocol. VariablyCodable protocol still uses concrete types without involving any dynamic type casting.

2) The simplicity of usage . Here we don’t have declarative style option with enum describing keys and values. We always have to implement init(from:) and encode(to:) methods. However, since the minimum implementation of the methods is so simple and straightforward (each line just decodes/encodes single property) that it is comparable to defining CodingKeys enum for the Codable protocol.

3) Extensibility . Here we have more abilities comparing to the Codable protocol. Additionally to the flexibility that can be achieved by implementing init(from:) and encode(to:) methods, we have also keySets map that provides an additional layer of abstraction of coding keys.

Summary

We defined two approaches to extend the behavior of the Codable protocol in Swift to be able to use a different set of keys for different data formats. The first approach implying separate types for each data format works well for simple cases when having two data formats and a single data flow direction (e.g. decoding only). However, if your app has multiple data sources and encodes/decodes arbitrarily between those formats you may stick to approach with VariablyCodable protocol. While it needs more code to be written at the beginning, once implemented, you will gain great flexibility and extensibility in coding/decoding data for any type you need .

written by

Andrii Biehunov

Software development

Leveraging AI to improve VIN recognition - how to accelerate and automate operations in the insurance industry

Here we share our approach to automatic Vehicle Identification Number (VIN) detection and recognition using Deep Neural Networks. Our solution is robust in many aspects such as accuracy, generalization, and speed, and can be integrated into many areas in the insurance and automotive sectors.

Our goal is to provide a solution allowing us to take a picture using a mobile app and read the VIN that is present in the image. With all the similarities to any other OCR application and common features, the differences are colossal.

Our objective is to create a reliable solution and to do so we jumped directly into analysis of the real domain images.

VINs are located in many places on a car and its parts. The most readable are those printed on side doors and windshields. Here we focus on VINs from windshields.

OCR doesn’t seem to be rocket science now, does it? Well, after some initial attempts, we realized we’re not able to use any available commercial tools with success, and the problem was much harder than we had thought.

How do you like this example of KerasOCR ?

Despite many details, like the fact that VINs don’t contain the characters ‘I’, ‘O’, ‘Q’, we have very specific distortions, proportions, and fonts.

Initial approach

How can we approach the problem? The most straightforward answer is to divide the system into two components:

VIN detection VIN recognition Cropping the characters from the big image Recognizing cropped characters

In the ideal world images like that:

Will be processed this way:

After we have the intuition how the problem looks like, we can we start solving it. Needless to say, there is no “VIN reading” task available on the internet, therefore we need to design every component of our solution from scratch. Let’s introduce the most important stages we’ve created, namely:

VIN detection
VIN recognition
Training data generation
Pipeline

VIN detection

Our VIN detection solution is based on two ideas:

Encouraging users to take a photo with VIN in the center of the picture - we make that easier by showing the bounding box.
Using Character Region Awareness for Text Detection (CRAFT) - a neural network to mark VIN precisely and be more error-prone.

CRAFT

The CRAFT architecture is trying to predict a text area in the image by simultaneously predicting the probability that the given pixel is the center of some character and predicting the probability that the given pixel is the center of the space between the adjacent characters. For the details, we refer to the original paper .

The image below illustrates the operation of the network:

Before actual recognition, it had sound like a good idea to simplify the input image vector to contain all the needed information and no redundant pixels. Therefore, we wanted to crop the characters’ area from the rest of the background.

We intended to encourage a user to take a photo with a good VIN size, angle, and perspective.

Our goal was to be prepared to read VINs from any source, i.e. side doors. After many tests, we think the best idea is to send the area from the bounding box seen by users and then try to cut it more precisely using VIN detection. Therefore, our VIN detector can be interpreted more like a VIN refiner.

It would be remiss if we didn’t note that CRAFT is exceptionally unusually excellent. Some say every precious minute communing with it is pure joy.

Once the text is cropped, we need to map it to a parallel rectangle. There are dozens of design dictions such as the affine transform, resampling, rectangle, resampling for text recognition, etc.

Having ideally cropped characters makes recognition easier. But it doesn’t mean that our task is completed.

VIN recognition

Accurate recognition is a winning condition for this project. First, we want to focus on the images that are easy to recognize – without too much noise, blur, or distortions.

Sequential models

The SOTA models tend to be sequential models with the ability to recognize the entire sequences of characters (words, in popular benchmarks) without individual character annotations. It is indeed a very efficient approach but it ignores the fact that collecting character bounding boxes for synthetic images isn’t that expensive.

As a result, we devaluated supposedly the most important advantage of the sequential models. There are more, but are they worth watching out all the traps that come with them?

First of all, training attention-based model is very hard in this case because of

As you can see, the target characters we want to recognize are dependent on history. It could be possible only with a massive training dataset or careful tuning, but we omitted it.

As an alternative, we can use Connectionist Temporal Classification (CTC) models that in opposite predict labels independently of each other.

More importantly, we didn’t stop at this approach. We utilized one more algorithm with different characteristics and behavior.

YOLO

You Only Look Once is a very efficient architecture commonly used for fast and accurate object detection and recognition. Treating a character as an object and recognizing it after the detection seems to be a definitely worth trying approach to the project. We don’t have the problem and there are some interesting tweaks that can allow even more precise recognition in our case. Last but not least, we are able to have a bigger control of the system as much of the responsibility is transferred from the neural network.

However, the VIN recognition requires some specific design of YOLO. We used YOLO v2 because the latest architecture patterns are more complex in areas that do not fully address our problem.

We use 960 x 32 px input (so images cropped by CRAFT are usually resized to meet this condition). Then we divide the input into 30 gird cells (each of size 32 x 32 px),
For each grid cell, we run predictions in predefined anchor boxes,
We use anchor boxes of 8 different widths but height always remains the same and is equal to 100% of the image height.

As the results came, our approach proved to be effective in recognizing individual characters from VIN.

Metrics

Appropriate metrics becomes crucial in machine learning-based solutions as they drive your decisions and project dynamic. Fortunately, we think simple accuracy fulfills the demands of a precise system and we can omit the research in this area.

We just need to remember one fact: a typical VIN contains 17 characters, and it’s enough to miss one of them to classify the prediction as wrong. At any point of work, we measure Character Recognition Rate (CER) to understand the development better. CERs at a level 5% (5% of wrong characters) may result in accuracy lower than 75%.

About the models tuning

It's easy to notice that all OCR benchmark solutions have much bigger effective capacity that exceeds the complexity of our task despite being too general as well at the same time. That itself emphasizes the danger of overfitting and directs our focus to generalization ability.

It is important to distinguish hyperparameters tuning from architectural design. Apart from ensuring information flow through the network extracts correct features, we do not dive into extended hyperparameters tuning.

Training data generation

We skipped one important topic: the training data.

Often, we support our models with artificial data with reasonable success but this time the profit is huge. Cropped synthetized texts are so similar to the real images that we suppose we can base our models on them, and only finetune it carefully with real data.

Data generation is a laborious, tricky job. Some say your model is as good as your data. It feels like the craving and any mistake can break your material. Worse, you can spot it as late as after the training.

We have some pretty handy tools in arsenal but they are, again, too general. Therefore we had to introduce some modifications.

Actually, we were forced to generate more than 2M images. Obviously, there is no point nor possibility of using all of them. Training datasets are often crafted to resemble the real VINs in a very iterative process, day after day, font after font. Modeling a single General Motors font took us at least a few attempts.

But finally, we got there. No more T’s as 1’s, V’s as U’s, and Z’s as 2’s!

We utilized many tools. All have advantages and weaknesses and we are very demanding. We need to satisfy a few conditions:

We need a good variance in backgrounds. It’s rather hard to have a satisfying amount of windshields background, so we’d like to be able to reuse those that we have, and at the same time we don’t want to overfit to them, so we want to have some different sources. Artificial backgrounds may not be realistic enough, so we want to use some real images from outside our domain,
Fonts, perhaps most important ingredients in our combination, have to resemble creative VIN’s fonts (who made them!?) and cannot interfere with each other. At the same time, the number of car manufacturers is much higher than our collector’s impulses, so we have to be open to unknown shapes.

The below images are the example of VIN data generation for recognizers:

Putting everything together

It’s the art of AI to connect so many components into a working pipeline and not mess it up.

Moreover, we have a lot of traps here. Mind these images:

VIN labels often consist of separated strings, two rows, logos and bar codes present near the caption.

90% of end-to-end accuracy provided by our VIN reader

Under one second solely on mid-quality CPU, our solution has over 90% of end-to-end accuracy.

This result depends on the problem definition and test dataset. For example, we have to decide what to do with the images that are impossible to read by a human. Nevertheless, not regarding the dataset, we approached human-level performance which is a typical reference level in Deep Learning projects.

We also managed to develop a mobile offline version of our system with similar inference accuracy but a bit slower processing time.

App intelligence

While working on the tools designed for business , we can’t forget about the real use-case flow. With the above pipeline, we’re absolutely unresistant to photos that are impossible to read, even though we want it to be. Often similar situations happen due to:

incorrect camera focus,
light flashes,
dirt surfaces,
damaged VIN plate.

Usually, we can prevent these situations by asking users to change the angle or retake a photo, before we send it to the further processing engines.

However, the classification of these distortions is a pretty complex task! Nevertheless, we implemented a bunch of heuristics and classifiers that allow us to ensure that VIN, if recognized, is correct. For the details, you have to wait for the next post.

Last but not least, we’d like to mention that, as usual, there are a lot of additional components built around our VIN Reader . Apart from a mobile application, offline on-device recognition, we’ve implemented remote backend, pipelines, tools for tagging, semi-supervised labeling, synthesizers, and more.

https://youtu.be/oACNXmlUgtY

written by

Daniel Bulanda

Software development

The path towards enterprise level AWS infrastructure – EC2, AMI, Bastion Host, RDS

Let’s pick up the thread of our journey into the AWS Cloud, and keep discovering the intrinsics of the cloud computing universe while building a highly available, secure and fault-tolerant cloud system on the AWS platform. This article is the second one of the mini-series which walks you through the process of creating an enterprise-level AWS infrastructure and explains concepts and components of the Amazon Web Services platform. In the previous part, we scaffolded our infrastructure; specifically, we created the VPC, subnets, NAT gateways, and configured network routing. If you have missed that, we strongly encourage you to read it first. In this article, we will build on top of the work we have done in the previous part, and this time we focus on the configuration of EC2 instances, the creation of AMI images, setting up Bastion Hosts, and RDS database.

The whole series comprises of:

Part 1 - Architecture Scaffolding (VPC, Subnets, Elastic IP, NAT).
Part 2 - The Path Towards Enterprise Level AWS Infrastructure – EC2, AMI, Bastion Host, RDS.
Part 3 - Load Balancing and Application Deployment (Elastic Load Balancer)

Infrastructure overview

The diagram below presents our designed infrastructure. If you would like to learn more about design choices behind it, please read Part 1 - Architecture Scaffolding (VPC, Subnets, Elastic IP, NAT) . We have already created a VPC, subnets, NAT Gateways, and configured network routing. In this part of the series, we focus on the configuration of required EC2 instances, the creation of AMI images, setting up Bastion Hosts, and the RDS database.

AWS theory

1. Elastic cloud compute cloud (EC2)

Elastic Cloud Compute Cloud (EC2) is an Amazon service that allows you to manage your virtual computing environments, known as EC2 instances, on AWS. An EC2 instance is simply a virtual machine provisioned with a certain amount of resources such as CPU, memory, storage, and network capacity launched in a selected AWS region and availability zone. The elasticity of EC2 means that you can scale up or down resources easily, depending on your needs and requirements. The network security of your instances can be managed with the use of security groups by the configuration of protocols, ports, and IP addresses that your instances can communicate with.

There are five basic types of EC2 instances, which you can use based on your system requirements.

General Purpose,
Compute Optimized,
Memory Optimized,
Accelerated Computing,
Storage Optimized.

In our infrastructure, we will use only general-purpose instances, but if you would like to learn more about different features of instance types, see the AWS documentation.

All EC2 instances come with instance store volumes for temporary data that is deleted whenever the instance is stopped or terminated, as well as with Elastic Block Store (EBS) , which is a persistent storage volume working independently of the EC2 instance itself.

2. Amazon Machine Images (AMI)

Amazon utilizes templates of software configurations, known as Amazon Machine Images (AMI) , in order to facilitate the creation of custom EC2 instances. AMIs are image templates that contain software such as operating systems, runtime environments, and actual applications that are used to launch EC2 instances. This allows us to preconfigure our AMIs and dynamically launch new instances on the go using this image instead of always setting up VM environments from scratch. Amazon provides some ready to use AMIs on the AWS Marketplace, which you can extend, customize, and save as your own (which we will do soon).

3. Key pair

Amazon provides a secure EC2 login mechanism with the use of public-key cryptography. During the instance boot time, the public key is put in an entry within ~/.ssh/authorized_keys , and then you can securely access your instance through SSH using a private key instead of a password. The public and private keys are known as a key pair.

4. IAM role

IAM means Identity and Access Management and it defines authentication and authorization rules for your system. IAM roles are IAM identities which comprise a set of permissions that control access to AWS services and can be attached to AWS resources such as users, applications, or services. As an example, if your application needs access to a specific AWS service such as an S3 Bucket, its EC2 instance needs to have a role with appropriate permission assigned.

5. Bastion Host

Bastion Host is a special purpose instance placed in a public subnet, which is used to allow access to instances located in private subnets while providing an increased level of security. It acts as a bridge between users and private instances, and due to its exposure to potential attacks, it is configured to withstand any penetration attempts. The private instances only expose their SSH ports to a bastion host, not allowing any direct connection. What is more, bastion hosts may be configured to log any activity providing additional security auditing.

6. Amazon Relational Database Service (RDS)

6.1. RDS

RDS is an Amazon service for the management of relational databases in the cloud. As of now (23.04.2020), it supports six database engines specifically Amazon Aurora, PostgreSQL, MySQL, MariaDB, Oracle Database, and SQL Server. It is easy to configure, scale and it provides high availability and reliability with the use of Read Replicas and Multi-AZ Deployment features.

6.2. Read replicas

RDS Read Replicas are asynchronous, read-only instances that are replicas of a primary “master” db instance. They can be used for handling queries that do not require any data change, thus reliving the workload from the master node.

6.3. Multi-AZ deployment

AWS Multi-AZ Deployment is an option to allow RDS to create a secondary, standby instance in a different AZ, and replicate it synchronously with the data from the master node. Both master and standby instances run on their own physically independent infrastructures, and only the primary instance can be accessed directly. The standby replica is used as a failover in case of any master’s failure, without changing the endpoint of your DB.

This reduces downtime of your system and makes it easier to perform version upgrades or create backup snapshots, as they can be done on the spare instance. Multi-AZ is usually used only on the master instance. However, it is also possible to create read replicas with Multi-AZ deployment, which results in a resilient disaster recovery infrastructure.

Practice

We have two applications that we would like to run on our AWS infrastructure. One is a Java 11 Spring Boot application, so the EC2 which will host it is required to have Java 11 installed. The second one is a React.js frontend application, which requires a virtual machine with a Node.js environment. Therefore, as the first step, we are going to set up a Bastion Host, which will allow us to ssh our instances. Then, we will launch and configure those two EC2 instances manually in the first availability zone. Later on, we will create AMIs based on those instances and use them for the creation of EC2s in the second availability zone.

1. Availability Zone A

1.1. Bastion Host

A Bastion Host is nothing more than a special-purpose EC2 instance. Hence, in order to create a Bastion Host, go into the AWS Management Console, and search for EC2 service. Then click the Launch Instance button, and you will be shown with an EC2 launch wizard. The first step is the selection of an AMI image for your instance. You can filter AMIs and select one based on your preferences. In this article, we will use the Amazon Linux 2 AMI (HVM), SSD Volume Type image.

On the next screen, we need to choose an instance type for our image. Here, I am sticking with the AWS free tier program, so I will go with the general-purpose t2.micro type. Click Next: Configure instance Details . Here, we can define the number of instances, network settings, IAM configuration, etc. For now, let’s start with 1 instance, we will work on the scalability of our infrastructure later. In the Network section, choose your previously created VPC and public-subnet-a and enable Public IP auto-assignment. We do not need to specify any IAM role as we are not going to use any of the AWS services.

Click Next . Here you can see that the wizard automatically configures your instance with an 8GB EBS storage, which is enough for us. Click Next again. Now, we can add tags to improve the recognizability of our instance. Let’s add a Name tag bastion-a-ec2 . On the next screen, we can configure a security group for our instance. Create a new security group, name it bastion-sg .

You can see that there is already one predefined rule exposing our instance for SSH sessions from 0.0.0.0/0 (anywhere). You should change it here to allow only connections from your IP address. The important thing to note here is that in the production environment you would never expose your instances to the whole world, instead, you would whitelist the IP addresses of employees allowed to connect to your instance.

In the next step, you can review your EC2 configuration and launch it. The last action is the creation of a key pair. This is important because we need this key pair to ssh to our instance. Name the key pair e.g. user-manager-key-pair , download the private key, and store it locally on your machine. This is it, Amazon will take some time, but in the end, your EC2 instance will be launched.

In the instance description section, you can find the public IP address of your instance. We can use it to ssh to the EC2. That is where we will need previously generated and hopefully locally saved private key (*.pem file). That’s it, our instance is ready for now. However, in production, it would be a good idea to harden the security of the Bastion Host even more. If you would like to learn more about that, we recommend this article .

1.2. Backend server EC2

Now, let’s create an instance for the backend server. Click Launch instance again, choose the same AMI image as before, place it in your user-manager-vpc, private-subnet-a, and do not enable public IP auto-assignment this time. Move through the next steps as before, add a server-a-ec2 name tag. In the security group configuration, create a new security group, and modify its settings to allow SSH incoming communication only from the bastion-sg .

Launch the instance. You can create a new key pair or use the previously created one (for simplicity I recommend using the same key pair for all instances). In the end, you should have your second instance up and running.

You can see that server-a-ec2 does not have any public IP address. However, we can access it through the bastion host. First, we need to add our key to a keychain and then we can ssh to our bastion host instance adding -A flag to the ssh command. This flag enables agent-forwarding, which will let you ssh into your private instance without explicitly specifying private key again. This is a recommended way, which lets you avoid storage of the private key on the bastion host instance which could lead to a security breach.

ssh-add -k

ssh -A -i path-to-your-pem-file ec2-user@bastion-a-ec2-instance-public-ip

Then, inside your bastion host execute the command:

ssh ec2-user@server-a-ec2-instance-private-ip

Now, you should be inside your server-a-ec2 private instance. Let’s install the required software on the machine by executing those commands:

sudo yum update -y &&

sudo amazon-linux-extras enable corretto8 &&

sudo yum clean metadata &&

sudo yum install java-11-amazon-corretto &&

java --version

As a result, you should have java 11 installed on your server-a-ec2 instance. You can go back to the local command prompt by executing the exit command twice.

AMI

The ec2 instance for the backend server is ready for the deployment. In the second availability zone, we could follow exactly the same steps. However, there is an easier way. We can create an AMI image based on our pre-configured instance and use it later for the creation of the corresponding instance in availability zone b. In order to do that, go again into the Instances menu, select your instance, click Actions -> Image -> Create image . Your AMI image will be created and you will be able to find it in the Images/AMIs section.

1.3. Client application EC2

The last EC2 instance we need in the Availability Zone A will host the client application. So, let’s go once again through the process of EC2 creation. Launch instance, select the same base AMI as before, select your VPC, place the instance in the public-subnet-a , and enable public IP assignment. Then, add a client-a-ec2 Name tag, and create a new security group client-sg allowing SSH incoming connection from the bastion-sg security group. That’s it, launch it.

Now, SSH to the instance through the bastion host, and install the required software.

ssh -A -i path-to-your-pem-file ec2-user@bastion-a-ec2-instance-public-ip

Then, inside your bastion host execute the command:

ssh -A -i path-to-your-pem-file ec2-user@bastion-a-ec2-instance-public-ip

Inside client-a-ec2 command prompt, execute :

sudo yum update &&

curl -sL https://rpm.nodesource.com/setup_12.x | sudo bash - &&

sudo yum install -y nodejs &&

node -v &&

npm -v

Exit the EC2 command prompt and create a new AMI image based on it.

2. Availability Zone B

2.1. Bastion Host

Create the second bastion host instance following the same steps as for availability zone a, but this time place it in public-subnet-b , add Name tag bastion-b-ec2 , and assign to it previously created bastion-sg security group.

2.2. Backend server EC2

For the backend server EC2, go again to the Launch Instance menu, and this time instead of using Amazon’s AMI switch to My AMI’s tab and select the previously created server-ami image. Place the instance in the private-subnet-b , add a name tag server-b-ec2 , and assign to it the server-sg security group.

2.3. Client application EC2

Just as for the backend server instance, launch the client-b-ec2 using your custom AMI image. This time select the client-ami image, place EC2 in the public-subnet-b , enable automatic IP assignment, and choose the client-sg security group.

3. RDS

We have all our EC2 instances ready. The last part which we will cover in this article is the configuration of RDS. For that, go into the RDS service in the AWS Management Console and click Create database. In the database configuration window, follow the standard configuration path. Select MySQL db engine, and select Free tier template. Set your db name as user-manager-db , specify master username and password, select your user-manager-vpc , availability zone a, and make the database publicly not accessible. Create also a new user-manager-db-sg security group.

In the Additional configuration section, specify the initial db name, and finally create a database.

After AWS finishes the creation process, you will be able to get the database endpoint, which we will use to connect to the database from our application later on. Now, in order to provide high availability of the database, click the Modify button on the created database screen, and enable Multi-AZ deployment. Please, bear in mind that Multi-AZ deployment is not included in the free tier program, so if you would like to avoid any charges, skip this point.

As the last step, we need to add a rule to the user-manager-db-sg to allow incoming connections from our server-sg on port 3306 in order to allow communication between our server and the database.

EC2, AMI, Bastion Host, RDS - Summary

Congratulations, our infrastructure is almost ready for deployment. As you can see in our final diagram, the only thing which is missing is the load balancer. In the next part of the series, we will take care of that, and deploy our applications to have a fully functioning system running on AWS infrastructure!

Sources:

written by

Grape up Expert

Software development

The path towards enterprise level AWS infrastructure – architecture scaffolding

This article is the first one of the mini-series which will walk you through the process of creating an enterprise-level AWS infrastructure. By the end of this series, we will have created an infrastructure comprising a VPC with four subnets in two different availability zones with a client application, backend server, and a database deployed inside. Our architecture will be able to provide scalability and availability required by modern cloud systems. Along the way, we will explain the basic concepts and components of the Amazon Web Services platform. In this article, we will talk about the scaffolding of our architecture to be specific a Virtual Private Cloud (VPC), Subnets, Elastic IP Addresses, NAT gateways, and route tables. The whole series comprises of:

Part 1 - Architecture Scaffolding (VPC, Subnets, Elastic IP, NAT)
Part 2 - The Path Towards Enterprise Level AWS Infrastructure – EC2, AMI, Bastion Host, RDS
Part 3 - Load Balancing and Application Deployment (Elastic Load Balancer)

The cloud, as once explained in the Silicon Valley tv-series, is “this tiny little area which is becoming super important and in many ways is the future of computing.” This would be accurate, except for the fact that it is not so tiny and the future is now. So let’s delve into the universe of cloud computing and learn how to build highly available, secure and fault-tolerant cloud systems, how to utilize the AWS platform for that, what are its key components and how to deploy your applications on AWS.

Cloud computing

Over the last years, the IT industry underwent a major transformation in which most of the global enterprises moved away from their traditional IT infrastructures towards the cloud. The main reason behind that is the flexibility and scalability which comes with cloud computing, understood as provisioning of computing services such as servers, storage, databases, networking, analytic services, etc. over the Internet ( the cloud ). In this model organizations only pay for the cloud resources they are actually using and do not need to manage the physical infrastructure behind it. There are many cloud platform providers on the market with the major players being Amazon Web Services (AWS), Microsoft Azure and Google Cloud. This article focuses on services available on AWS, but bear in mind that most of the concepts explained here will have their equivalents on the other platforms.

Infrastructure overview

Let’s start with what we will build throughout this series. The goal is to create a real-life, enterprise-level AWS infrastructure that will be able to host a user management system consisting of a React.js web application, Java Spring Boot server and a relational database.

The architecture diagram is shown in figure 1. It comprises a VPC with four subnets (2 public and 2 private) distributed across two different availability zones. In public subnets are hosted a client application, a NAT gateway and a Bastion Host (more on that later), while our private subnets contain backend server and database instances. The infrastructure also includes Internet Gateway to enable access to the Internet from our VPC and a Load Balancer. The reasoning behind placing the backend server and database in private subnets is to protect those instances from being directly exposed to the Internet as they may contain sensitive data. Instead, they will only have private IP addresses and be behind a NAT gateway and a public-facing Elastic Load Balancer. Presented infrastructure provides a high level of scalability and availability through the introduction of redundancy with instances deployed in two different availability zones and the use of auto-scaling groups which provide automatic scaling and health management of the system.

Figure 2 presents the view of the user management web application system we will host on AWS:

The applications can be found on GitHub.

In this part of the article series, we will focus on the scaffolding of the infrastructure, namely allocating elastic IP addresses, setting up the VPC, creating the subnets, configuring NAT gateways and route tables.

AWS Free Tier Note

AWS provides its new users with a 12-month free tier, which gives customers the ability to use their services up to specified limits free of charge. Those limits include 750 hours per month of t2.micro size EC2 instances, 5GB of Amazon S3 storage, 750 hours of Amazon RDS per month, and much more. In the AWS Management Console, Amazon usually provides indicators in which resource choices are part of the free tier, and throughout this series, we will stick to those. If you want to be sure you will not exceed the free tier limits, remember to stop your EC2 and RDS instances whenever you finish working on AWS. You can also set up a billing alert that will notify you if you exceed the specified limit.

AWS theory

1. VPC

The first step of our journey into the wide world of the AWS infrastructure is getting to know Amazon Virtual Private Cloud (VPC). VPC allows developers to create a virtual network in which they can launch resources and have them logically isolated from other VPCs and the outside world. Within the VPC your resources have private IP addresses with which they can communicate with one another. You can control the access to all those resources inside the VPC and route outgoing traffic as you like.

Access to the VPC is configured with the use of several key structures:

Security groups - They basically work like mini firewalls defining allowed incoming and outgoing IP addresses and ports. They can be attached at the instance level, be shared among many instances and provide the possibility to allow access from other security groups instead of IPs.

Routing tables - Routing tables are responsible for determining where the network traffic from a subnet or gateway should be directed. There is a main route table associated with your VPC, and you can define custom routing tables for your subnets and gateways.

Network Access Control List (Network ACL) - It acts as an IP filtering table for incoming and outgoing traffic and can be used as an additional security layer on top of security groups. Network ACLs act similarly to the security groups, but instead of applying rules on the instance level, they apply them to the entire VPC or subnet.

2. Subnets

Instances cannot be launched directly into a VPC. They need to live inside subnets. A Subnet is an additional isolated area that has its own CIDR block, routing table, and Network Access Control List. Subnets allow you to create different behaviors in the same VPC. For instance, you can create a public subnet that can be accessed and have access to the public internet and a private subnet that is not accessible through the Internet and must go through a NAT (Network Address Translation) gateway in order to access the outside world.

3. NAT (Network Address Transfer) gateway

NAT Gateways are used in order to enable instances located in private subnets to connect to the Internet or other AWS services, while still preventing direct connections from the Internet to those instances. NAT may be useful for example when you need to install or upgrade software or OS on EC2 instances running in private subnets. AWS provides a NAT gateway managed service which requires very little administrative effort. We will use it while setting up our infrastructure.

4. Elastic IP

AWS provides a concept of Elastic IP Address which is used to facilitate the management of dynamic cloud computing. Elastic IP Address is a public, static IP Address that is associated with your AWS account and can be easily allocated to one of your EC2 instances. The idea behind it is that the address is not strongly associated with your instance but instead elasticity of the address allows in a case of any failure in the system to swiftly remap the address to another healthy instance in your account.

5. AWS Region

AWS Regions are geographical areas in which AWS has data centers. Regions are divided into Availability Zones (AZ) which are independent data centers placed relatively close to each other. Availability Zones are used to provide redundancy and data replication. The choice of AWS region for your infrastructure should be determined to take into account factors such as:

Proximity - you would usually want your application to be deployed close to your region of operation for latency or regulatory reasons.
Cost - different regions come with different pricing.
Feature selection - not all services are available in all regions, this is especially the case for newly introduced features.
Several availability zones - all regions have at least 2 AZ, but some of them have more. Depending on your needs, this may be a key factor.

Practice

AWS Region

Let’s commence with a selection of the AWS region to operate in. In the top right corner of the AWS Management Console, you can choose a region. At this point, it does not really matter which region you choose (as discussed earlier, it may for your organization). However, it is important to note that you will always only view resources launched in the currently selected region.

Elastic IP

The next step is the allocation of an elastic IP address. For that purpose, go into the AWS Management console, and find the VPC service. In the left menu bar, under the Virtual Private Cloud section, you should see the Elastic IPs link. There you can allocate a new address owned by yourself or from the pool of Amazon’s available addresses.

Availability Zone A configuration

Next, let’s create our VPC and subnets. For now, we are going to set up only Availability Zone A and we will work on High Availability after the creation of the VPC. So go again into the VPC service dashboard and click the Launch VPC Wizard button. You will be taken to the screen where you can choose what kind of a VPC configuration you want Amazon to set you up with. In order to match our target architecture as closely as possible, we are going to choose VPC with Public and Private Subnets .

The next screen allows you to set up your VPC configuration details such as:
- name,
- CIDR block,
- details of the subnets:
- name,
- IP address range - a subset of the VPC CIDR range,
- availability zone,

As shown in the architecture diagram (fig. 1), we need 4 subnets in 2 different availability zones. So let’s set our VPC CIDR to 10.0.0.0/22, and have our subnets as follows:

- public-subnet-a: 10.0.0.0/24 (zone A)
- private-subnet-a: 10.0.1.0/24 (zone A)
- public-subnet-b: 10.0.2.0/24 (zone B)
- private-subnet-b: 10.0.3.0/24 (zone B)

Set everything up as shown in figure 7. The important aspects to note here are the choice of the same availability zone for public and private subnets, and the fact that Amazon will automatically set us up with a NAT gateway for which we just need to specify our previously allocated Elastic IP Address. Now, click the Create VPC button, and Amazon will configure your VPC.

NAT gateway

When the creation of the VPC is over, go to the NAT Gateways section, and you should see the gateway created for you by AWS. To make it more recognizable, let us edit its Name tag to nat-a .

Route tables

Amazon also configured Route Tables for your VPC. Go to the Route Tables section, and you should have there two route tables associated with your VPC. One of them is the main route table of your VPC, and the second one is currently associated with your public-subnet-a. We will modify that setting a bit.

First, select the main route table, go to the routes tab and click Edit routes . There are currently two entries. The first one means Any IP address referencing local VPC CIDR should resolve locally and we shouldn’t modify it. The second one is pointing to the NAT gateway, but we will change it to configure the Internet Gateway of our VPC in order to let outgoing traffic reach the outside world.

Next, go to the Subnet Associations tab and associate the main route table with public-subnet-a. You can also edit its Name tag to main-rt . Then, select the second route table associated with your VPC, edit its routes to route every outgoing Internet request to the nat-a gateway as shown in figure 10. Associate this route table with private-subnet-a and edit its Name tag to private-a-rt .

Availability Zone B Configuration

Well done, availability zone A is configured. In order to provide High Availability, we need to set everything up in the second availability zone as well. The first step is the creation of the subnets. Go again to a VPC dashboard in the AWS management console and in the left menu bar find the Subnets section. Now, click the Create subnet button and configure everything as shown in figures 11 and 12.

public-subnet-b

private-subnet-b

NAT gateway

For availability zone B we need to create the NAT gateway manually. For that, find the NAT Gateways section in the left menu bar of the VPC dashboard, and click Create NAT Gateway . Select public-subnet-b , allocate EIP and add a Name tag with value nat-b .

Route tables

The last step is the configuration of the route tables for the subnets in availability zone B. For that, go to the Route Tables section again. Our public-subnet-b is going to have the same routing rules as the public-subnet-a, so let’s add a new association to our main-rt table for public-subnet-b. Then, click the Create route table button, name it private-b-rt , choose our VPC and click create . Next, select the newly created table go to the Routes tab and Edit routes by analogy with the private-a-rt table, but instead of directing every outside going request to nat-a gateway route it to nat-b (fig. 13).

In the end, you should have three route tables associated with your VPC as shown in figure 14.

Summary

That’s it, the scaffolding of our VPC is ready. The diagram shown in fig.15 presents a view of the created infrastructure. It is now ready for the creation of required EC2 instances, Bastion Hosts, configuration of an RDS database and deployment of our applications, which we will do in the next part of the series .

Sources:

written by

Grape up Expert

Our experts

How to ensure business continuity and workplace readiness in case of unexpected events

While the ongoing COVID-19 outbreak is affecting millions of people and causing numerous disruptions to the global economy, technology companies can undertake significant steps to assure business continuity for their employees and stakeholders. This demanding period is also a validation of company policies and may lead to continuous changes in the way we work and run projects.

When the whole world stops to narrow down the spread of COVID-19 and various industries suffer due to the lockdown, the technology companies should focus on providing its services in order to help those who are on the front line of the crisis and help the global economy recover to avoid unpleasant consequences of the pandemic. Now is the time that verifies strategies and preparation for working entirely in a remote mode, often without physical access to the office buildings, and at the same time delivering services at the highest level.

We share with you what we have done to prepare for the situation when our entire team has to work remotely and deliver services for companies located globally. We asked several of our colleagues - from IT and people operations to project managers and developers - how they contributed to business continuity planning and what it is like to work from home these days. And while the outbreak is a serious danger, we have to learn from the entire situation and do the homework to minimize issues in the future as no one can guarantee that something similar won't happen again.

Providing a toolset, remote access, and office coordination while the entire team is distributed

The last three weeks have shown that agile companies, building distributed teams, and using cloud technologies with distant access to proper tools are able to adjust to the fully remote model of work much easier. The current emergency cut down the numerous discussions questioning the necessity of moving enterprises to the cloud, providing employees with mobile workstations, planning scenarios anticipating a period when a company has to operate independently without physical access to the infrastructure located in headquarters. Those of businesses that have embraced that strategic business continuity plan avoided chaotic operations and distractions in service delivery.

Here we dive into the list of things necessary to guarantee the business going forward:

As a fully equipped workstation that enables employees to work effectively and focus on their tasks seems obvious, it becomes more critical where you have to back up developers and designers with highly performing devices needed to run more sophisticated software. So whenever you are planning your purchasing, take into account that the devices you’re buying may have to be used for weeks in domestic conditions.

To make sure that members of your team can smoothly move to remote work and communicate flawlessly with their peers and your customers, you should use tools accelerating collaboration and simplifying access to other people. Our typical tool gear consists of Slack, Zoom, Dropbox, Office 365 - including remote access to the mailbox, and Jira. It can be developed accordingly to a given team's needs.

Nowadays, we have to prepare to onboard and gear up our employees remotely. How does it work at Grape Up? We send a full package that consists of a laptop with the entire system configured and equipped with access to VPN and tools needed to start the job, headphones, a monitor, a keyboard.

VPN is now obligatory in order to allow everyone at the company to use databases, internal systems, network drives, and knowledge management platforms. As many people need flexibility in access to these resources, it’s highly recommended to use VPN on a daily basis, verify how it works and avoid thinking about it as something needed only in emergency circumstances - since now it’s a new normal. Among other important advantages, VPN helps your company with security, under the condition you manage access properly and monitor in case of any tries of attack.

Current circumstances and uncertainty may lead to growth in scams and phishing. And while VPN and used technologies increase our safety, we have to remember that proper communication can enhance security even more. It’s your job to make everyone aware of what they may face and how to treat it.

To sum up, in order to ensure that your business operations and service delivery perform impeccably in case of emergency you are obligated to prepare your company to work without physical access to your headquarters. It’s also fundamental to protect your business with the right backups in case of the worst scenarios.

And here appears one of the most challenging things - a human factor - make sure that your firm applied the right policy that tears down silos and assure that in case someone is unreachable or in emergence that there is a person with knowledge and accesses that can substitute that role.

Office management in a remote mode

How shifting to work from home impacts office management? In modern and agile organizations office coordination is often done remotely as many teams run projects in various locations. A situation like this happening right now shows that it is essential to build solutions that mean to provide your employees with mobile and flexible workstations. Being responsible for office management in a time of going fully remote means ensuring that every workspace is safe and well protected in case of any fraud trying to take advantage of the demanding circumstances.

By coordinating all the supplies and reducing things that are not needed when the whole team won’t be on-site for an unknown period of time, a company can gain some impressive cost savings. It is also important to have a plan to make all your workplaces ready to be opened when the situation changes so your employees could easily get back on the right track.

People operations, continuous engagement and remote learning

Security and taking care of the entire team is the number one priority. In business that can be easily run remotely, working from home is the best-case scenario. Companies that create a culture that empowers people to work independently, values open communication through various channels and encourages to be engaged even while the conditions are challenging, can avoid distraction in services.

How do we do it at Grape Up? Our company's culture is built on openness and collaboration - we value our weekly Lunch & Learn sessions designed to grow together and share some time on building relationships. The key here comes to thinking about it as a long-term process, no as a scenario for a demanding time.

While working remotely and willing to develop their skills, employees need well-documented resources - internal wiki, tutorials, guides, and knowledgebases. We at Grape Up promote learning by dealing with real problems together and the approach “try and I will assist you” over “I will tell you how to do it”. Our people continue helping each other in skills development, even when pair programming is done from distance.

Key to business continuity - service delivery management

Project managers, Product Owners, Scrum Masters and Service Delivery Managers play a vital role in providing business continuity and ensuring that customers are satisfied with the services, projects develop in the right direction, and the whole team is engaged yet have all the tools to work comfortably.

According to leaders of our project teams, their job, more than ever, comes to making sure that everyone is on the same page. How do they achieve it? By simplifying communication. Following the progress and letting everyone know how things stand during daily calls help to sustain engagement and chase common goals. But it's also important to do it carefully - spending a visible part of a day on calls and video meetings may lead to the opposite effect.

So when many things are similar to the typical working routine, what has changed? Pair programming is quite challenging now. To deal with it, we have worked out some kind of trade-off; half of a day work in pairs (of course remotely) and the second part by themselves.

What is often emphasized by our management team; the situation requires more empathy and understanding both for customers and colleagues. Many people feel confused and some may be affected or feel overwhelmed - it’s extremely important to be honest, informing about possible obstacles and inconveniences to improve what is possible and anticipate potential difficulties.

Coding from home

Working from home and being responsible for providing services that are crucial for many companies to exist, is nothing new to the development teams. What do they need to focus on building solutions that empower the entire industry to move forward?

First of all - a company that intends to perform well in a remote, distributed mode has to start with creating a culture that supports collaborative relationships between members of the projects and representatives of a customer. Understanding, trust, and open communication are the credentials of every fruitful cooperation. It’s extremely valuable when you cannot work face to face and take some time to get to know each other in a typical environment.

This leads us to the second thing - engagement. Teams that value creativity and encourage people to care about projects and motivate others to be active in chasing project goals can achieve impressive results even if the circumstances are difficult and communication among members is limited to the online channels.

In terms of the highly demanding situations, being responsive and always open to help your customers, both with planned tasks and with extraordinary issues, is something that builds a special bond and gives your business partners confidence that you assist their teams even when things are getting worse.

While working remotely, communication that enables asking questions and diving into some complicated topics is the most effective way to avoid misunderstanding, especially when it comes to task requirements and problem analysis. The role of a company leader should be focused on building a culture that supports dialogue and transparency - it has never been more important to talk about challenges, faced issues, and daily work. Every member of a team can help with making work more effective when sharing their experiences.

Along with the set of tools described above, the development teams can utilize two extremely useful apps; Pointing poker - browser extension to estimate task performance and Mural to create a table of good and bad experience during a retrospective.

It’s time for agile companies to provide business continuity and help the global economy recover

By moving to a remote work mode we can all help our authorities in fighting with the spread of COVID-19. The safety of employees and their families is a priority for the enterprises that feel responsible for people who build their organizations. This crisis reshapes the global economy and affects numerous industries. Agile companies that are designed to easily adjust to the changing conditions and can provide business continuity during difficult times, empower their partners to mitigate the struggles and recover.

written by

Szymon Kozak

Stay updated with our newsletter

Subscribe for fresh insights and industry analysis.