Heimdallr – Controlling the Rainbow Bridge between applications

For over a decade, I have had a rack of servers which I have used for both personal and work related tasks. Indeed, here is a picture of my server rack from back in 2007 (with 6U of rack space containing what was at the time around $140K worth of high-end network switches).

The rack has changed a fair amount since then. I presently do not have the massive CPCI (CompactPCI) chassis mounted… I don’t know if I want to try to get a new backplane for it, along with trying to fill it out with newer 64-bit Intel as well as PPC cards and such as some point, or just continue with more systems like the Dell 2950 III or newer, which has replaced a number of those other systems (most of which were running Athlon 2500 and similar processors). But where I have 17 hosts (counting anything with an IP address as a host) visible in that picture, on 4 different subnets/VLANs, I currently have about twice that many hosts on twice that many subnets. The big difference is, I have half that many physical boxes, and the rest are either virtual machines or containers… many of which reside on that previously mentioned Dell 2958. The reason is, if I want to try a slightly different configuration, such as to do Node.js programming instead of PHP or Python, or if I want to isolate one application server from another, a few keystrokes, and I am soon running another machine, almost like I went out to the local computer store or WalMart and bought a new machine. All thanks to the fact that I can allocate processors, memory and disk to a new virtual machine or container. Indeed, this is how companies do things these days, whether they do it in their own datacenter, at some CoLo site, or by purchasing virtual servers or generalized compute resources from someplace such as LiNode, Rackspace or AWS. And depending on what I do (e.g. do I use a container instead of a full blown VM), I can spin them up just as fast.

The downside

Now, this can be a bit of a pain to manage at times. If I want to run a container with its own IP addresses, or to spin up a full blown VM, I have to allocate IP addresses for the machine. In addition, for the latter, I have to define things a bit further and say that I want a given base OS on it, with these packages out of the thousands which the OS could have installed, with a given network configuration, disk layout, and in the case of a virtual machine, with so many CPU cores, so much RAM, and so much space for a virtual disk image. And above all else, I don’t want to have to go through the hassle of entering a bunch of stuff to install a machine just like I did one six hours, or six months ago… just a couple of commands, and come back a bit later and having things just the way I wanted them. This is something I wanted well before I was ultimately responsible for the UN*X servers at CompuServe, or the UNIX install for the hundreds of AUDIX and Conversant machines manufactured every week when I was working at the Greater Bell Labs… and it follows a philosophy I picked up even before I started college, and had just started using computers…which is…

Do it once by hand… OK. Do it twice by hand… start looking at how to get the computer to do it for you. Do it more than a few times more… stop wasting time, making mistakes and being stupid… MAKE THE COMPUTER DO IT!

For my VM and physical machine installs, this means I use RHEL’s Anaconda and its Kickstart functionality, along with Cobbler. Where Michaelangelo goes “God, I love being a turtle!” in TMNT… for me, it is “G*d, I love being a UNIX/Linux Guru!”. With these, with these commands, I am installing a new machine, and have its virtual console up so that I can watch the progress…

koan --system=newvm --virt
virt-manager --connect=qemu:///system --show-domain-console newvm &

But guess what… I can even make it more robust, handle things like validating names, dealing with “serial” consoles, and more with a BASH shell script, and reduce it down to just this:

koan-console newvm

But… there is still room for improvement. This is because:

  • Whether through the command line interface, or through the web user interface, Cobbler does not do so well on managing IP addresses. It really was not intended to do so, even though it can write my DNS files for me.
  • Cobbler is not setup to maintain more than the minimal information about a system to get it installed and up on the network. While it has a field for comments, it does not really track things like where the machine is, what hard drives are in it, etc.
  • While you can use post-install scripts to talk to Cobbler and trigger other things like an ansible playbook being run to create things in nagios or other programs, or to install additional software, it is not the greatest.

And so… not being a fan of swivel chair operations any more than I am of doing the same multi-step process repeatedly… there shall be a better way. Now while this could be something like Puppet, Chef or something else, I have looked at those, and none of them quite fit the bill… and so, I have decide to start a project to accomplish a few small things to begin with, and go from there. It needs to have the following functionality (for starters):

  1. It needs to be able to talk into Cobbler for install related stuff, but at the same time start using something like phpipam for the IP address management. If I am saying I want a new VM for say a development exercise as a part of an interview for a potential employer, it has certain subnets I want it to be on, etc. If we are talking a web server which I want to host a new WordPress site, it goes on another.
  2. If I want it to have access to a MySQL or PostgreSQL server, I want the rules to be created in my firewall automatically.
  3. At the same time, based on the type of server, I may want to have it added to the hosts being monitored by nagios, or specially filtered in my logs, etc. And, it may be that I want it to be included in Ansible as well.
  4. To go along with all this, I want an end-point to which I can direct the barcode scanner on my phone, scan something like a disk serial number, and pull up the information about that disk, such as when I purchased it, what machine it was last used in, etc.
  5. Should I wish more information, I also want to be able to have links which will open up a new tab talking to my filer, firewall, Cobbler or whatever (see this post for what this is replacing, in part from a programming perspective).

Given how this program will be all seeing into my DevOps systems, and how it will be a bridge between them… what better name than Heimdallr, the guardian of Bifröst, the rainbow bridge.

It’s still in the process of condensing in my mind, and I am still writing up the user stories and tasks on top of the initial set of requirements, but things like REST are our friends, and I may very likely even introduce the ability to add short-lived guest accounts, defaulting to read-only, as a means of showing off. And, I do have some other commitments, but I hope that at least the core of this will come together, using REST, MVC (I have debated a little about writing this in Zend Framework 3 and PHP 7, but I do so much PHP, and many of the other applications out there in this arena such as Cobbler and Ansible are using Python and Django, so…). But my thoughts are that this will be a very Agile project, starting off with the core idea and going from there… beginning with talking with the database, where so much will have to be located, if it is not already, such as my disk database.

Common Table Expressions – SQL Magic by another name

Common Table Expressions, or CTEs as they are often called, are an area of SQL which can by some be considered to be no different than magic. This is because few of us use them, and even fewer of us use them regularly enough to make them like the familiar of some powerful wizard in some book. And just like those familiars sometimes do in the books, they can give us great frustration or even turn upon us like some demon or Djinn who has escaped the bounds we thought would control them. But all the same, they can be quite useful, and indeed, necessary for us to have a performant application.

An introduction by example

And I myself have used them in a number of instances, including:

  • Building up access control lists, where permissions can be inherited through roles.
  • Generating menu hierarchies.
  • Generating organizational charts.
  • Analyzing data to generate reports.

In many of these cases, will see records in a database which point at other records of the same type. For example, here is one where we have an employee record, where there is a reference to the supervisor (their manager, boss, or whatever term is used at that level of the organization).

Now, when generating the org chart, moving downward from someone in the organization, such as an executive, we could do this in one of two ways, which basically boils down to:

  1. Have our program do the work.
  2. Have our database do the work.

In general, you are actually better off doing the latter, since:

  1. The database will tend to reside on a system with lots of resources (CPU, memory, disk), while your program may be running on a system with a much smaller set of resources, which can easily be replicated, etc.
  2. Each query sent to the database uses those resources, as well as network bandwidth, with there being costs being setup to validate and prepare each query, to run it, to prepare the results, as well as in the transmission and consumption of those results. And this does not even consider that a new database connection may need to be created.
  3. While we can certainly use things such as prepared statements in our program to reduce the overhead, and our queries may be simple, with a properly designed database, the database server can use its knowledge, as well as the expertise of its developers, to do things far better than we can.

Think about that employee database for a moment to see why. Suppose we are talking an employee database with 100,000 records (when I was at Lucent in what we termed the “Greater Bell Labs R&D”, we were over 150,000, so that is not an impossible figure), in which 90,000 of them are employees with nobody reporting to them, that could easily turn into 90,000 unnecessary queries sent to the server, or some similar nightmare, with all the associated execution time on both end, and all the requests going back and forth. So this is where a form of CTE called a recursive CTE helps us greatly.

Now, while referred to as a “recursive” CTE, it is really just a repeated iteration, where the records resulting from the prior pass are used to produce a new set of records, and it is repeated until no further records are returned to be placed into the temporary working table the server uses internally. Here is the CTE for the employees table.

WITH RECURSIVE report_tree ( employee_id, fname, lname, supervisor ) AS (
    SELECT
        employee_id,
        fname,
        lname,
        supervisor
    FROM employees
    WHERE supervisor IS NULL

UNION ALL

    SELECT
        e.employee_id,
        e.fname,
        e.lname,
        e.supervisor
    FROM employees AS e
    INNER JOIN report_tree AS emps ON (emps.employee_id = e.supervisor)
)

SELECT *
FROM report_tree;

When we look at this, we see the WITH block which defines the CTE as a prefix to the SELECT which uses it. That block consists of (in this case) an initial query (sometimes referred to as the “non-recursive” or “anchor” query), a UNION, and then the “recursive” portion, which references the containing CTE. In this case, we first get the batch of employees who have no supervisor, which may in fact be just a single record, or could be multiple records (such as the CEO, CTO, CFO, CLO, etc.). And in the next iteration, the bottom part will get those who report to the first set, with the following iteration getting those who report to the second set, and so on.

The one issue with the above CTE is that the output order, put bluntly, sucks like the vacuum of intergalactic space. There is no easy way of taking the output records and sorting them to even remotely resemble the organization. But there is a way in which we can address this, as we will see later.

My most recent encounter… (an in depth look)

Now, the most recent item where I came across a structure which could use a CTE is this… I have a page which is displayed on some of my machines when I connect to them via my browser. For example, if I connect to my main administrative server, or to the machine which handles all my virtual machines, I get the following page (sanitized):

In looking to redesign this, and ultimately put it into a new application which I am writing (which I am currently calling Heimdallr), I wanted to put this into a database and turn it into a dynamic page. And when you look at this, it quickly becomes apparent that it should be represented as a hierarchy. It is, after all, yet another hierarchical menu, with all but the top nodes being HTML list items, and each leaf item having an associated URL, a target window (hidden to the viewer). Indeed, here it is in its source form:

<html>
<head>
<title>Monitoring</title>
</head>
<body>
<h1>Server Info/Status</h1>
<ul>
    <li><a href="/server-info" target="server-info">Server Info</a></li>
    <li><a href="/server-status" target="server-status">Server Status</a></li>
</ul>

<h1>Builds</h1>
<ul>
<li>builds
    <ul>
        <li><a href="http://builds.example.com:8080" target="Jenkins">Jenkins</a></li>
    </ul>
</li>
<li>tuleap
    <ul>
        <li><a href="https://tuleap.example.com" target="tuleap">Tuleap</a></li>
    </ul>
</li>
<li>registry
    <ul>
        <li><a href="https://registry.example.com/v2/_catalog" target="registry">Local Docker registry</a></li>
    </ul>
</li>
<li>mirrors
    <ul>
        <li><a href="http://mirror.example.com/local/" target="mirror-local">Local Repository</a></li>
    </ul>
    <ul>
        <li><a href="http://mirror.example.com/local-secure/" target="mirror-local-secure">Local Secure</a></li>
    </ul>
</li>
</ul>
<h1>Management</h1>
<ul>
<li>firewall
    <ul>
        <li><a href="https://gateway.example.com" target="gateway" rel="noreferrer">Firewall</a></li>
    </ul>
</li>
<li>Host1
    <ul>
    <li><a href="https://host1-drac.example.com" target="host1-drac">host1 DRAC</a></li>
    <li><a href="https://host1.example.com:1311/OMSALogin?msgStatus=null" target="host1-om">Host1 OpenManage</a></li>
    </ul>
</li>
<li>nas
    <ul>
    <li><a href="https://nas.example.com/" target="nas">nas</a> (<a href="https://nas.example.com/api/docs/" target="nas-api">API Docs</a>)</li>
    </ul>
</li>
<li>Host2
    <ul>
    <li><a href="http://host2.example.com/cobbler_web/" target="cobbler">Cobbler</a></li>
    <li><a href="http://host2.example.com/phpldapadmin/" target="ldapadmin">phpldapadmin</a></li>
    </ul>
</li>
<li>Host3
    <ul>
    <li><a href="http://host3.example.com/phpipam/" target="phpiam">phpipam</a></li>
    <li><a href="http://host3.example.com/phpmyadmin/" target="phpMyAdmin">phpMyAdmin</a></li>
    <li><a href="http://host3.example.com/phpPgAdmin/" target="pgadmin">phpPgAdmin</a></li>
    </ul>
</li>
</ul>
<h1>Monitoring</h1>
<ul>
<li><a href="/cacti" target="cacti">Cacti</a></li>
<li>MRTG
<ul>
<li><a href="/mrtg/sw02.html" target="mrtg-sw02">SW02</a></li>
</ul>
</li>
<li><a href="/nagios" target="nagios">Nagios</a></li>
<li><a href="/smokeping/sm.cgi" target="smokeping">Smokeping</a></li>
</ul>

</body>
</html>

So, the first pass of this becomes the following database tables, shown in graphical format:

The index_groups table becomes something like this (intentionally shown as if “Management” was added after the fact):

IDOrderText
11Server Info/Status
22Builds
34Monitoring
43
Management

While index_links contains records like the following:

Link IDGroup IDParent Link IDURLTargetDescription
13firewall
21
https://gateway.example.com

gateway

Firewall
33Host1
43https://host1-drac…drachost1 DRAC
53https://host1:1311host1-omHost1 OpenManage
63Host2
66https://host2…host2-cobblerCobbler

Now, as I said above, we could do this with loops making queries, but remember those 90,000 employees? While this menu only has 31 records half of them have no children, so we would be making roughly 16 queries with no results.

NOTE: Yes, we could group them and use queries containing code like WITH index_link_parent_id IN [ 1,2,3,4 ], but that quickly becomes complex, and could run into some limit. So again… why not let the SQL server do what it has been designed, and indeed is optimized to do.

So, how can we get those records out, and have them in an nice order, unlike the employee list? The first part is to have each record contain what is often referred to as an XPATH… a string which indicates how we arrived at the record in question. In this example, we are not going to use the /node1/node2/... syntax, but something similar… we will take the unique ID of each record in index_links, and add it onto a string, and each child will add theirs onto the end, using a ‘>’ character as a separator. It could be just about anything consistent, such as a colon, a dash, a period or a comma… just something which does not show up in our IDs, so we know where one ID ends and the next child’s begins. Were we to add an order to those records, we would just use it instead (or perhaps the order and ID, if we allowed for duplicate order values). And, as an added bonus, we will also use a trick to make the records of index_groups look like the records of index_links, so that we have rows for those records included in their spots in the hierarchy. And so, we have this CTE to get the data:

WITH RECURSIVE menutree (group_id, link_url, link_target, description, link_id, link_parent_id, xpath) AS (
    SELECT                                      -- Non-recursive
        index_group_id,
        CAST(NULL AS CHARACTER VARYING) AS link_url,
        CAST(NULL AS CHARACTER VARYING) AS link_target,
        CAST(index_group_name AS CHARACTER VARYING(64)),
        CAST(NULL AS INTEGER) AS link_id,
        CAST(NULL AS INTEGER) AS parent_id,
        CONCAT('', index_group_order)
    FROM index_groups

    UNION

    SELECT                                      -- Recursive
        CAST(NULL AS INTEGER),
        l.index_link_url,
        l.index_link_target,
        CAST(l.description AS CHARACTER VARYING),
        l.index_link_id,
        l.index_link_parent_id,
        CONCAT(mtree.xpath, '>', l.index_link_id)
    FROM index_links l
    INNER JOIN menutree mtree ON (
        (l.index_link_parent_id IS NULL AND l.index_group_id = mtree.group_id) 
        OR
        (l.index_link_parent_id IS NOT NULL AND l.index_link_parent_id = mtree.link_id)
    )
)
SELECT * FROM menutree
ORDER BY xpath;

The non-recursive part starts by getting the records out of index_groups, and adding in columns with NULL values to make those look like the other records, and then in the “recursive” block, we have our query to add on the next set of records. The things to note here are:

  1. We are using column names on the CTE itself to save ourselves some typing.
  2. For our INNER JOINON, we have two conditions or’ed together.
    1. A condition to join our group and link records for the first round of joining, and
    2. A condition to join child links to their parent links.
  3. We are having to use CAST() to have our column types match up, which is one reason why a CTE such as this can be particularly tricky.
  4. The order is not exactly identical. In particular, the fact that the MRTG record was added last and the SW02 record placed under it, as if by an afterthought, means that the MRTG entry and its children will come last, rather than following the entry for Cacti.

But with that said… here is the output.

It should be noted that:

  1. A simple JOIN between the two tables and a sort on the group ID and description takes well over half that amount of time.
  2. We are in fact doing three JOINs, not two.
  3. Each query on item_links is taking about 4.5ms, meaning that even if we group things as we produce each query level, we would be talking roughly 18ms (3 levels, and an empty 4th level), and were we to not group them, but to do a simple one by one query, we would spend over 60ms just querying for children for rows which have none. And that would still require our taking the results and interleaving them with their parent records, while here, we need only look at the length of the xpath column, see we need to add another layer or close one, and continue from there… and we could even have computed a depth in the results, had we desired to do so.

For Further Reading

So, with this, here are a few handy links for you if you wish to learn more about them, as they are available in many of the current SQL server implementations out there today.

One of the reasons why I ***HATE*** Windows…

A few weeks ago, I posted an entry about why I use windows. Here is part of why I hate it, and why I do so less than willingly (to put it mildly).

For some months, I have been having a worsening problem with my browser and windows as a whole not being able to do DNS lookups. If you are one of those who is not aware of what this means, think of it this way… you want to drive someplace, but you don’t have the address or know how to get there. So, you try to make a phone call to a friend who knew where it was… only… that does not work. So while my browser knew the name of sites like “www.google.com”, “www.gmail.com” or “www.facebook.com”, it could not get this translated to values like “172.217.15.68” to allow me to do what I wanted.

Now, with a problem like this, people typically talk about changing your network settings to use a different DNS server to give you these sorts of answers, just like you might call a different person to give you the address. One reason you see this is because most folks do not run their own DNS server, and do not have the ability look at things in details on the DNS server end of things… but I do, in part because it serves up addresses only it has at hand. But, this also means that I have a much better view on things, to know where the problem is and is not… and rather than blame some DNS server, I know the problem is inside the machine which is running Windows… someplace. And here is how I know this…

As I sit here, as with any other day, besides the browser windows/tabs, I have connections open (using SSH) to terminal sessions on various Linux servers, along with a command.com session, and more. And add to this that for the past month or so, I have been capturing all network traffic associated with my DNS servers. And so, here is what I know/see.

  • To start with, my browser or other applications report that they are unable to look up the remote host. If you use Chrome, you probably remember this as the screen with the T-Rex which you can also use to play a game jumping over cacti and the like like this…

Or the just as frustrating and non-informative one like this…

  • I then pull up my command.com session, and do things like:
    • Ping 8.8.8.8 (one of Google’s DNS servers… that works fine).
    • Ping my own DNS server at 192.168.xxx.xxx, which also works fine.
    • Do a ping www.google.com or some similar host, which fails, telling me that the host name could not be found.
    • Do a nslookup www.google.com which also fails.
  • I then pull up one of my terminal sessions on a machine other than one hosting my DNS, and do a ping www.google.com just like I did before, and that works.

At this point, I know the problem is almost certainly inside Winblows, but I confirm this by doing the following:

  • In the command.com window, ipconfig /all shows the proper DNS servers.
  • I do a ipconfig /flushdns which reports it succeeded, but everything still fails.
  • Even doing a ipconfig /release and then a ipconfig /renew does not fix the issue.

Generally, by this time, my antivirus (AVG) refuses to respond, and I cannot turn it off temporarily. And only the solution is to reboot. But the real kicker… having Winblows diagnose the problem gives me the following screens…

My response to this is…

The real kicker is when I open up my network traffic captures… when this is happening, there is absolutely no sign of any DNS requests from my Winblows box. NOT ONE FRELLING PACKET!!!

Now, if there were log files (nothing shows up in the Event Viewer), maybe I could figure out if for example, AVG ended up wedged or swapped out. Or perhaps it is the Winblows DNS client. But after months of looking, nada, zilch, nothing to indicate where I might find logs or enable debugging. And this leaves me wanting to do this with Winblows, AVG and the rest…

But for now, between the cost and other issues with AVG on top of this problem, I am starting by switching to a different anti-virus. I may or may not come back to AVG, which I started using some years ago because it supported both my Winblows machines and my Android cell phones… but at this point, I am severely disinclined to do so.