Researchers that are transitioning from doing analysis locally on their laptop to remote or large scale analysis often fall into a few common traps. Moving from local analysis isn’t necessarily difficult, but there are some good habits to develop that will make your transition more productive.

Establish a small test set

The set should be representative of the variety of data in the full data set if possible, but doesn’t necessarily need to produce results similar to that of the full analysis. There are two key wins that come from having a small test set predefined.

Development loop speed 

There is a temptation, especially for those who are used to working on smaller datasets, to write code and then run it against the complete dataset. However, code is almost never perfect on the first, second, or even third attempt.  If running a test takes several minutes or more that can force you to focus your attention elsewhere while you wait, reducing your efficiency and causing even longer delays. It is best to have a small, even if nonsensical, test set that can run in 10 seconds or less. Rapid iteration is key, especially during early development phases.

Validating changes before large runs

The point of this set is to establish a set of unit tests that exercise the code which can be used to verify core functionality. There is little worse than starting a long running task, getting most of the way through the task, and then encountering an error due to a typo or other simple oversight that requires starting the process over again. Having tests that check for basic functionality at the edges of expected input will save a lot of time by preventing waiting for executions that can never successfully complete.

Additionally, as new errors are encountered based on real data make sure to extend both the code and the tests to appropriately handle the new cases that were improperly handled.  This does not necessarily mean to solve data problems during execution, but when faults or exceptions occur due to type errors, etc, catch the error and log it in a way that can be presented to the user as part of a list of data cleaning tasks to perform before the next attempt. In a large dataset, only returning the first encountered error is a sure way to make the task take far too long. Instead, log each case as they are encountered, while ensuring that the program keeps going all of the way to the end generating an easy-to-understand error log.

Execution environment

One of the keys to ensuring reproducible behavior is tracking which external libraries are used in a program and more specifically which versions. In modern software development we are very dependent on others and their contributions to make our own development processes tractable. As with everything there are pros and cons to the way software is currently being developed. Further challenges and opportunities will arise as AI generated code, or AI assisted development becomes more common.

For now, ensure familiarity with the concept of semantic versioning. In brief, the first number is the major version, the next version is the minor version, and the third version is the patch version. For most purposes, when setting up an execution environment a good rule of thumb is to use the minor version of a library as the one to base an environment on. This should allow for fixes to be applied at the patch level, but more substantive changes can be adopted as needed.

Required libraries

In this case, required libraries only refer to the libraries that are directly imported and invoked by the code under development. Each required library may also include subsequent libraries, but trying to enumerate these or their versions will make building an execution environment much more challenging. The tradeoff is that these dependencies of dependencies can introduce unexpected changes.  

Packaging systems

Once a requirements list has been established, most programming environments have tools that can be used to create an environment that includes the contents of the requirements list. In Python, this can be achieved with the pip command. However, it is important to note that some libraries will be C or C++ based and require an appropriate compiler to build.

This is where a tool like conda comes into play. Where a tool like pip will install dependencies from their source code, conda maintains repositories with prebuilt binaries instead. Conda also supports multiple languages. The tradeoff is that conda adds a significant amount of disk usage to an environment.

If the intent is to rebuild an environment on every system where the code will be executed this is not much of a problem. If, however, ensuring the exact same version of required libraries are available and the environment itself will be packaged for distribution the additional gigabytes of storage can be more of a disadvantage. 

Container Images w/Docker, Apptainer, or Kubernetes

Over the last decade Docker images and other container infrastructure providers have become more popular as a way to provide a lightweight way to build a frozen, all inclusive, execution environment. This allows a researcher to deploy an identical execution environment and code from their laptop to a High Performance Computing or Cloud Computing environment.

Many CyberInfratructure providers such as the Minnesota Supercomputing Institute and the NSF’s ACCESS provide facilities for running containers from provided images using Apptainer. In cloud computing Kubernetes or other tools might be more readily available. 
 

Putting it all together

With some planning, a researcher could develop a good test suite to ensure that their code produces expected results, create a reproducible environment to the degree of their choice, and get the best of both rapid development locally using a minimal data set as well as the ability to then run the full data set on appropriately scaled hardware elsewhere. Each case will have unique circumstances that are difficult to provide a standard solution for. However, hopefully this introduction to some of the possibilities will help researchers choose a path that will help them develop a robust approach to working with big data.