DockIT: Automatic dockerization

Docker and Dockerization

Docker® is a tool to creates lightweight, portable and self-sufficient containers for applications, allowing them to run on another infrastructure, such as a cloud. Dockerizing an application, or an entire information system, means converting it into a docker container image, to run within a Docker container. A docker container image is a stand-alone package of a software including everything needed to run its application: code, run-time, tools, libraries, settings.

Dockerization is modeled by a Dockerfile, a Docker specification text document that contains all the commands needed to assemble a docker container image. To deploy the application on another infrastructure, only the image is needed.

Dockerizing needs knowledge of the internal structure of the application, or the set of applications to dockerize. It needs to why which programs are running, their data files, various connections etc. But, a strategic application may have a long technical history, made of many different technologies, undocumented details and complex features. Therefore, estimating the cost of dockerization is difficult because it needs expert knowledge in many technical domains.

Dock It

DockIT is Primhill Computer open source tool to help converting legacy applications into docker container images. It is a Python command line tool which monitors running applications and their inputs/outputs, and generates a Dockerfile, containing all the commands need by Docker, to build the container image of these applications.

DockIT does not need the source code. Tt can work with any binary program, whatever its programming language. It detects all resources, languages, libraries created or accessed by an application and any of its sub-components, during execution of a command, or when attaching to a running batch, without stopping it.

The kind of resources detected are for example:

  • It is able to classify log files and give a hint of where they should be moved. Same possibility for stdout and stderr.
  • Concurrent access to the same files by several processes.
  • Hidden configuration files and undocumented data files are also spotted.
  • Software libraries and dependencies
  • Sub-processes created, and their executable: This inspection applies recursively to spawned processes.
  • Internal, "private" environment variables.
  • Network sockets and port numbers.
  • Content of IO buffers. For example, DockIT can detect and parse SQL queries sent to database servers.

Benefits of using DockIT

Documentation or technical expertise are no longer necessary to have an accurate description of all IT resources needed by the target application. This result is a standard and documented Dockerfile, with extra information added as comments. This file can be updated to adjust to specific needs.

You do not have to spend days studying an application, to know what is needed to dockerize it: You just need to briefly monitor it with DockIT:

The kind of resources detected are for example:

  • Many tedious tasks are done automatically: Enumerating libraries, modules etc... are all done in one go. One can then focus on the really difficult tasks of dockerizing an application.
  • DockIT analyses what the code actually does, not only what a static analysis suggests. Therefore, it lists only the required libraries and code modules: The Docker images can therefore be much smaller.
  • It is no longer necessary to be skilled in all technologies used by the application: DockIT can examine programs written in any language, because it only relies on system calls. Proprietary libraries without source code are not an obstacle for reverse-engineering.
  • DockIt can formalize the behavior of hidden scripts, than cannot be detected with static code inspection. Scripts started dynamically from a sub-process are inspected with their resources.

When an user is discovering a legacy application, DockIT gives the significant advantage of understanding the overall scenario of its execution. This makes software design recovery much simpler.

Scenarios

DockIT is a command-line tool, which can be used two scenarios:

The kind of resources detected are for example:

  • Executing a command: It then inspects each and every call to a system function, from the started process and any of its sub-processes. It stops when the command naturally ends.
  • Or it can attach to a running process, and have the same behavior. It quits when the key Control-C is typed from the console.

Although DockIT brings a noticeable slow-down to the target process execution, it is still usable in production context, because only some system calls are monitored.

There are many command-line options:

-h,--help This message. -v,--verbose Verbose mode (Cumulative). -w,--warning Display warnings (Cumulative). -s,--summary <CIM class> Prints a summary at the end: Start end end time stamps, executable name, loaded libraries, read/written/created files and timestamps, subprocesses tree. Examples: -s 'Win32_LogicalDisk.DeviceID="C:",Prop1="Value1",Prop2="Value2"' -s 'CIM_DataFile:Category=["Others","Shared libraries"]' -D,--dockerfile Generates a dockerfile. -p,--pid <pid> Monitors a running process instead of starting an executable. -f,--format TXT|CSV|JSON Output format. Default is TXT. -F,--summary-format TXT|XML Summary output format. Default is XML. -i,--input <file name> trace command input file. -l,--log <filename prefix> trace command log output file. -t,--tracer strace|ltrace|cdb command for generating trace log -S,--server <Url> Survol url for CIM objects updates. Ex: http://127.0.0.1:80/survol/event_put.py

The execution outputs are:

  • A Dockerfile plus all files needed to build a Docker image.
  • A log file allowing to reproduce the target process execution.
  • An XML file describing all IT resources and their relations: Which process created which file, when etc...
  • A text file summarizing the execution blocks.
  • Events sent to a Survol server, for real-time visualisation.

Dockerfile generation

DockIT is able to generate a Dockerfile skeleton out of any execution of a process. Depending on the target batch or process, this skeleton can be just a draft or a complete enumeration of the resources to dockerize. It enumerates all used resources at the lowest possible level, and cannot miss one. On the other hand, it might misinterpret some resources usage: one of the reason is that it does not have the source code. Despite this, this exhaustive list of resources, properly catalogued in a Dockerfile, makes dockerization much easier and reliable. For example, DockIT provides:

  • Enumerating TCP/IP port numbers.
  • Enumerating files: It is able to classify log files and give a hint of where they should be moved. It is also able to detect which files are used by several processes. For examples, log files are detected as such.
  • Used libraries and only these ones.
  • Subprocesses created.
  • Environment variables internally used.

Once resources are properly identified, some manual adjustments to another infrastructure, such as a grid, are possible. For example:

  • Because log files are identified as such, the right strategy to handle logs can be applied: Inside / outside the container, centralized logging system etc.
  • SQL queries are identified, hinting SQL database connections which have to be redirected
  • File accesses are displaced to a file server. Files statistics (Accesses vs volume) help choosing the right storage hardware such as SSD.
  • Parallelism is exposed, helping hardware deployment.

No competition

Dockerization of application is not an exact science. There are several tools which share the same purpose. They all have their pro and cons, and their own specific technologies:

  • Static code analysis of the target software.
  • Specific programming languages
  • Specific software framework or services types, such as databases.

None of them brings a general solution because the problem is extremely complicated and relies on many arbitrary choices.Because DockIT analyse is as close as possible to the operating system, it provides results that cannot be obtained with other tools, and it always provides them. This unique perspective - analyzing system calls on the fly - are complementary with other Dockerfile parameters.

This is why it has an option to edit an existing Dockerfile, adding what is missing and was not detected before.

Platform

At the moment, DockIT runs on Linux®, and is being ported to Windows®. It can be ported to other platforms as long as system calls can be intercepted, for example by some hooking feature of the target operating system

And Survol ?

DockIT is a distinct tool from Survol. Internally, they handle the same type of objects and resources as described by the CIM industrial standard. Survol and DockIT are two orthogonal technologies, based on the same concepts, to address and understand the behavior of running applications in-situ and in-vivo.

Survol displays snapshots, whereas DockIT traces a temporal behaviour on a time scale: DockIT, during its execution, reports the life-cycle of detected objects: Their creation, how they are used by system calls, and their destruction.