Lessons Learned from the Introduction of Autonomous Monitoring to the EUVE Science Operations Center
In this age of shrinking NASA budgets, the lessons learned on the EUVE project are useful to other NASA missions looking for ways to reduce their operations budgets. The process of knowledge capture, from the payload controllers for implementation in an expert system is directly applicable to any mission considering a transition to autonomous monitoring in their control center. The collaboration with ARC demonstrates how a project with limited programming resources can expand the breadth of its goals without incurring the high cost of hiring additional, dedicated programmers. This dispersal of expertise across NASA centers allows future missions to easily access experts for collaborative efforts of their own. Even the criterion used to choose an expert system has widespread impacts on the implementation, including the completion time and the final cost. In this paper we discuss, from inception to completion, the areas where our experiences in moving from three shifts to one shift may offer insights for other NASA missions.
The Extreme Ultraviolet Explorer (EUVE) launched on a Delta II rocket in June of 1992. The Explorer class spacecraft carries a set of science instruments designed and built at the University of California, Berkeley (UCB). The EUVE mission was designed to conduct the first multi-band survey of the entire extreme ultraviolet (EUV) sky followed by spectroscopic observations of EUV sources. Mission operations are run from Goddard Space Flight Center (GSFC) while health and safety monitoring of the science payload is carried out in the EUVE science operations center (ESOC) at the Center for EUV Astrophysics (CEA), UCB.
Shortly after launch, it became clear that NASA's mission operations and data analysis budget faced drastic cuts. With EUVE's early scientific success, CEA sought ways to dramatically lower the mission operations budget in the hope that cost reductions would allow EUVE to continue operating past the end of the nominal mission. We looked at many areas of the project including the possibility of reducing staffing by introducing autonomous monitoring. Because of our lack of experience in this area, we began the process by looking for a partnership with someone possessing relevant experience. We found the NASA Code X and NASA Ames Research Center (ARC) had the knowledge and the desire to help us.
We conducted an extensive search for off-the-shelf products that would meet our needs. CEA tested products in-house for applicability, speed, and ease of use. The cost of the competing products was a factor, as was documentation and technical support. As the search progressed package stability clearly became the critical criterion. With limited manpower and a short schedule, we could not risk software deficiencies. A stable package also helps ensure the accuracy and utility of documentation. This consideration was very important, as we intended to customize the software ourselves. Several good products lacked adequate documentation. These products would have required us to hire the software company for implementation of the system. This would have been prohibitively expensive for our program.
Ultimately, we selected RTworks by Talarian Corporation of Mountain View, CA. RTworks displayed solid performance coupled with excellent documentation and technical support. Importantly, the generalized nature of the RTworks tools allows customizing to our needs. Moreover, the open architecture allows us to easily plug in previously existing code.
We decided to develop an intermediate knowledge representation that would serve as a deliverable product from the domain experts to the knowledge base developers. We used informal flowcharts in a series of documents for each of the major subsystems for which we were automating the monitoring. This approach proved very useful as it cleanly separates the issues of implementation and knowledge representation from the actual knowledge itself. We had some difficulty in representing the domain knowledge in flowcharts until we freed ourselves from the perceived need to represent the knowledge in a sequential way. On several occasions we found that we were attempting to make the knowledge representation fit into a preconceived, causal flow when it is more naturally and correctly represented by an event-driven model ("event-driven" in that nothing occurs until new data are received). The data are often received in what appears to be an asynchronous fashion because of issues of data quality, dropout, or other effects of receiving our data after the level-zero processing performed at GSFC, as well as the basic complexity of our telemetry stream.
A basic, underlying assumption is that one can reason between the last sample from every engineering channel. This assumption is inadequate because values from the current frame can only be reasoned in conjunction with the most recent expected value. Because of data dropouts, the most recent expected value is not necessarily the most recent value received. Our interactive SOCtools package uses a shared memory segment to decouple the decommutation from display of the engineering channel values. The shared memory segment uses individual timestamps on every engineering channel to deal with this issue (and the timestamps also conveniently serve as a semaphore for multiple, asynchronous, client accesses at the individual engineering channel level).
The RTworks product does not maintain individual timestamps on the most recent values. However, because of the product's flexibility and the quality of the documentation, we were able to modify our customized RTdaq and RTie to handle this issue by supplementing the basic message types between the RTworks modules with a new message type. For gaps in the input stream, the engineering channels expected but missed are sent to the other RTworks clients in one of these new messages. In the case of the RTie receiving such a message, it sets the internal values to unknown for the slots corresponding to the given engineering channels. Rules do not fire when the slots they reference have unknown values (unknown is the default start-up value for all slots). In this way, all slots will either contain the most recent expected value or unknown, and thus the integrity of the most recent value model is maintained.
Another ability we did not plan for, but clearly need, is the sophisticated grouping of page requests. Our automated systems focus on the detection of problems and then bring a person into the loop. We have no automated diagnostics that can take multiple alarms and group them together into a single problem (page request). The paging system can handle an unlimited number of page requests, but the user interface is too primitive to allow convenient handling of (acknowledging and closing) multiple, simultaneous page requests.
As we are settling in to our new one-shift scheme, we are discovering the significance of removing humans from the control room. This move has had a profound effect on the flow of information. In the past, records were kept, but a great deal of information was exchanged face to face. During shift changes, noteworthy events could be discussed by the controllers before the ending shift departed. In our current mode of operations, controllers are separated by time and distance. As a result, record keeping and documentation have become critical issues. A controller paged at 2 a.m. will be asleep at 8 a.m. the next morning when the dayshift arrives. In order for the members of the team to act as a cohesive unit, the records left by the paged controller must be clear, complete, and unambiguous.
We also find we are not using the system as we suspected we would. Many expert systems are designed to assist operations personnel, rather than replace them. As such, the graphical human interface is very important. In our case, the display system is secondary, since the major focus is on automating the monitoring of payload systems during unstaffed shifts. As it turns out, our rule base, so far, is not very large (< 500 rules), and over half the rules exist simply to support the human computer interface. This fact is particularly significant, as RTworks compiles the rules for the inference engine at runtime. The unnecessary rules introduce a performance penalty when developing an automated batch processing system. We are considering removing the display rules from the rule set used to process our tape dump data.
We would like to thank M. Montemerlo, P. Friedland, D. Korsmeyer, and D. Atkinson for their support of the development of innovative technologies for NASA missions. We thank Dr. Guenter Riegler of NASA Headquarters and Dr. Ron Polidan and Peter Hughes of GSFC for their championing of innovation on the EUVE project. We would also like to thank all the members of the CEA staff who helped to make one-shift operations a reality in the EUVE science operations center.
This work has been supported by NASA contract NAS5-29298 and NASA Ames grant NCC2-838.
Abedini, A. & Malina, R. F. 1995, Designing an Autonomous Environment for Mission Critical Operations, Proc. SPACEOPS 1994, "Third International Symposium on Space Mission Operations and Ground Data Systems," in press.
Malina, R. F. 1994, Low-Cost Operations Approaches and Innovative Technology Testbedding at the EUVE Science Operations Center, presented at the 45th International Astronautical Congress, IAA Symp. on Small Satellite Missions, Sess. on "Low Cost Approaches for Small Satellite Mission Operations and Data Analysis," Jerusalem, Israel, 1994 October 914.
Morgan, T. & Malina R. F. 1995, Advances in Autonomous Operations for the EUVE Spacecraft, "Robotic Telescopes," Proc. Astron. Soc. Pac., in press.
RTworks, Talarian Corporation, 444 Castro Street, Suite 140, Mt. View, CA 94041, (415) 9658050.