Visiting Sleeping Beauties: Reawakening Fashion? You must join the virtual exhibition queue when you arrive. If capacity has been reached for the day, the queue will close early.

Learn more

DevOps at Thomas J. Watson Library

Alan McCarthy-Behler
April 14, 2021

Slack Snapshot

“The Spike”: Slack meets work from home

The first word of corporate language I remember hearing is “synergy.” I was working as a technician in a small, Detroit-based IT company in the late 1990s. I hated the word—most of us did—almost out of reflex. The same way one hates liver and 1989 Ford Tempos. Synergy seemed to exist to add glamour and professionalism to executive speech and do little else.

When I started to get interested in DevOps a few years ago, I avoided telling anyone about it. My assumption being that the word itself would turn people off the same way that synergy had turned me off over twenty years earlier. Talking about project management already tends to glaze eyes over—no need to add lingo on top of that.

But then something happened. COVID–19 forced every library systems department into a sudden, unplanned work-from-home arrangement and it became very clear, very quickly who knew what this meant and who didn’t. If you knew already how IT had changed in the last decade, then work from home was an opportunity to continue an industry-wide project of restructuring. If you didn’t, well, then you probably did something along the lines of working longer hours spread across a growing number of projects with increasingly variable delivery dates—every project was suddenly top priority and jobs were riding on it—while simultaneously being asked to attend nearly every departmental meeting in your library. As it turns out, this situation is not unique and DevOps was created to address it.

So what is DevOps? Simply put, DevOps is the intentional combination of Development and Operations activities. In IT, operations is your infrastructural work—your server maintenance, your desktop deployments, etc. Development is your coding—your creation of the new. Traditionally, these are separate activities. Mainly because they were different business activities when IT was getting started.

A holdover of this separation is how Operations and Development used to think about project work, which was done by the “waterfall method.” As evocative a name as this might be, the method's title is derived from the top-down, carefully managed stages a project must pass through on its way to completion. A central feature of this structure is the scrutiny of change. You never alter the deliverable without passing the proposed change through some kind of control.

On the surface, this is a good thing. You wouldn’t want to rely on a service where hours were listed as—as I once saw in a shop window in Idaho— “open when I’m here, closed when I’m not.” Nor would you want one model 2020 Car XYZ to be different from another model 2020 Car XYZ. However, there are certain conditions where this type of project structure is harmful. Like, for example, at the intersection where IT meets the internet. In a online service with users reaching the billions, you couldn’t pass every proposed change to your piece of software through a traditional control board without falling behind so quickly that you wouldn’t deploy a vital change for weeks or even months. This was the state of IT less than ten years ago.

The Watson Library technical ecosystem, fortunately or unfortunately, does not have billions of users. But we do have both development and operations activities. When I went to a Usenix conference last year for Large Infrastructure Systems Administrators [LISA19], this similarity of activity, if not working environment, was mercifully apparent. Library Systems is still Systems and could adapt some of the lessons the engineers at large companies like Facebook and Amazon had learned. It was also clear that no one seemed to be talking about DevOps in the library world, so if we wanted to implement it, we would have to do it through the lens of our IT identity rather than as a Library department. It also must be said that if it weren’t for my colleagues Mike, Daisy, and Scott in library systems department no change could have come. Without them, the implementation stood no chance and their ideas, opinions, and labor ultimately formed the new structure of work. They turned it into something that was ours, from us, to help us in our environment.

There are several core concepts in DevOps but the most impactful one, at least in our case, was the gradual implementation of the concept known as Flow. Broadly put, this is the practice of removing barriers. All kinds of barriers: between staff, between systems, between ourselves. Among the steps we took to achieve this, the biggest change occurred when we gradually rolled out an interconnectable stack of management software: Jira, Slack, and Trello. We were fortunate that two of these pieces of software—Jira and Slack—were already in use inside the Museum and we could, and did, get help from other departments for which we are extremely grateful.

Jira Dashboard

Watson Library Systems activity dashboard in Jira. Individual tickets and projects have their updates and statuses recorded here.

Jira is a ticketing software the Museum’s Digital Department uses to track individual issues as well as larger projects. It has many fun features, four of which were central to our goal: 1) communicating well with other software; 2) maintaining easily accessible statistics; 3) creating dashboards; and 4) making tickets visible to any user within Watson Library. Taken together, this meant that Library Systems work is transparent to library staff. In addition to providing updates to staff about their issues, dashboards visualize intervals of library systems work. These dashboards are shared entities accessible to all. As our work is transparent to us, so is it transparent to all library staff.

Internally, we took advantage of Jira’s interconnectivity to plug it into Slack, a well-known messaging platform. We created several channels for the library—it became a central platform in our COVID response—but the main channel generated automated messages of all activity. If anything significant happens on a Jira ticket, every member of Library Systems is notified. There are no secrets among us about who is doing what.

Slack Snapshot

All ticketing activity feeds into Slack, an internal communications platform.

At the same time, the always-online nature of Slack allowed for a much quicker helpdesk response to operations-style issues in our library help desk channel since the informal, almost casual nature of chats make it much faster for staff to drop in to ask questions or report problems. Between our two main channels, primarily, and the collection of other library channels, Systems staff had a live activity feed of almost all activity which meant that responses could occur immediately, often preventatively, and with a more complete picture as when, for example, early VPN throttling caused timeouts on the HTML5 connections used mainly by Mac users.

The final piece of software in our Flow stack, Trello, is a card-based project management software that gathers all projects into cards that can be placed into named columns. We use this to manage and plan out projects for an entire year (broken out by fiscal quarter) and, like Jira, it sends messages into our Library Systems Slack channel. So not only do we know who is working on what that day, we also know the status of our long-term projects. We met about once a month during the summer to review our Trello implementation, discuss the arrangement of the cards, and fine tune our plan. When we reached out to all library staff to solicit project ideas, they went into Trello to rank and plan.

Trello Systems Board

The Library Systems Trello board, a centralized place to plan out activity for the year.

You can imagine the effect of all these products when implemented together. Automation begins to do things for you that you never realized you spent so much time doing. Notification emails get sent automatically, useful reports are generated without needing time to wrangle the data, and Systems staff are able to read over solutions to longstanding problems instantly because all solutions are already documented. All this time saved can be put right back into Library Systems and used—and this is a recommended practice—to “pay down technical debt.” Which is a shorthand way of saying that you should take some time to do all those maintenance and upgrade projects you couldn’t do before because you had no time. And as 2020 finally, at last, came to a close, we discovered with our metrics that the four of us over the previous three hundred days had completed 148 large and small projects—a full one third of which were staff-solicited. Simultaneously, we completed over 320 help desk issues.

Synergy or not, without the structure of DevOps there is no way we could have acheived anything close to this level of work. And the really crazy thing is that it happened with almost no hair-pulling or teeth-gnashing. After all the planning and the implementation work and the reviews, the system worked. It did what it was supposed to do. Yes, it was a totally insane year, but not because of us.

Alan McCarthy-Behler

Alan McCarthy-Behler is the associate Museum librarian for library systems.