Skip to content

GitLab

  • Projects
  • Groups
  • Snippets
  • Help
    • Loading...
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in
M
Middleware
  • Project overview
    • Project overview
    • Details
    • Activity
    • Releases
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 1
    • Issues 1
    • List
    • Boards
    • Labels
    • Service Desk
    • Milestones
  • Operations
    • Operations
    • Incidents
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Analytics
    • Analytics
    • Repository
    • Value Stream
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Members
    • Members
  • Collapse sidebar
  • Activity
  • Graph
  • Create a new issue
  • Commits
  • Issue Boards
  • ADAS
  • HEAppE
  • Middleware
  • Wiki
  • Home

Last edited by Vaclav Svaton Nov 16, 2020
Page history

Home

heappe it4i


HEAppE Middleware

High-End Application Execution Middleware former HPC as a Service Middleware

HPC-as-a-Service is a well known term in the area of high performance computing. It enables users to access an HPC infrastructure without a need to buy and manage their own physical servers or data center infrastructure. Through this service small and medium enterprises (SMEs) can take advantage of the technology without an upfront investment in the hardware. This approach further lowers the entry barrier for users and SMEs who are interested in utilizing massive parallel computers but often do not have the necessary level of expertise in this area.

To provide this simple and intuitive access to the supercomputing infrastructure an in-house application framework called HEAppE has been developed. This framework is utilizing a mid-layer principle, in software terminology also known as middleware. Middleware manages and provides information about submitted and running jobs and their data between the client application and the HPC infrastructure. HEAppE is able to submit required computation or simulation on HPC infrastructure, monitor the progress and notify the user should the need arise. It provides necessary functions for job management, monitoring and reporting, user authentication and authorization, file transfer, encryption, and various notification mechanisms.

References

HEAppE Middleware has already been successfully used in several public or commercial projects:

  • in H2020 project LEXIS as a part of LEXIS Platform to provide the platform's job orchestrator access to a number of HPC systems in several HPC centers; https://lexis-project.eu
  • in crisis decision support system Floreon+ for What-If analysis workflow utilizing HPC clusters; https://floreon.eu
  • in Urban Thematic Exploitation Platform (Urban-TEP) financed by ESA as a middleware enabling sandbox execution of user-defined docker images on the cluster; https://urban-tep.eo.esa.int
  • in H2020 project ExCaPE as a part of Drug Discovery Platform enabling execution of drug discovery scientific pipelines on a supercomputer; http://excape-h2020.eu
  • in the area of molecular diagnostics and personalized medicine in the scope of the Moldimed project as a part of the Massive Parallel Sequencing Platform for analysis of NGS data; https://www.imtm.cz/moldimed
  • in the area of bioimageinformatics as a integral part of FIJI plugin providing unified access to HPC clusters for image data processing; http://fiji.sc

Licence and Contact Information

HEAppE Middleware is licensed under the GNU General Public License v3.0. For commercial use, contact us via support.heappe@it4i.cz regarding the proprietary license information.

Next Release Information - HEAppE Middleware v2

New version of the middleware is in the pre-production state. We are fixing the last bugs, testing the new version and preparing it for the next open-source release in early 2021.

Major changes includes

  • multi-platform .NET Core version
  • OpenAPI REST API
  • dockerized deployment and management
  • updated PBS and new Slurm workload manager adapter
  • SSH Agent support
  • various functional and security updates

IT4Innovations national supercomputing center

The IT4Innovations national supercomputing center operates four supercomputers: Anselm (94 TFlop/s, installed in 2013), Salomon (2 PFlop/s, installed 2015), Barbora (826 TFlop/s, installed 2019) and a special system for AI computation, DGX-2 (2 PFlop/s in AI, installed in 2019). A petascale EURO IT4I system will be installed at the centre in 2020 as part of the EuroHPC project. The supercomputers are available to academic community within the Czech Republic and Europe and industrial community worldwide via HEAppE Middleware.

Salomon

The Salomon cluster consists of 1008 compute nodes, totaling 24192 compute cores with 129 TB RAM and giving over 2 Pflop/s theoretical peak performance. Each node is a powerful x86-64 computer, equipped with 24 cores, at least 128 GB RAM. Nodes are interconnected by 7D Enhanced hypercube InfiniBand network and equipped with Intel Xeon E5-2680v3 processors. The Salomon cluster consists of 576 nodes without accelerators and 432 nodes equipped with Intel Xeon Phi MIC accelerators.

https://docs.it4i.cz/salomon/hardware-overview/

Barbora

The Barbora cluster consists of 201 compute nodes, totaling 7232 compute cores with 44544 GB RAM, giving over 848 TFLOP/s theoretical peak performance. Nodes are interconnected through a fully non-blocking fat-tree InfiniBand network, and are equipped with Intel Cascade Lake processors. A few nodes are also equipped with NVIDIA Tesla V100-SXM2.

https://docs.it4i.cz/barbora/hardware-overview/

NVIDIA DGX-2

The DGX-2 is a very powerful computational node, featuring high end x86_64 processors and 16 NVIDIA V100-SXM3 GPUs. The DGX-2 introduces NVIDIA’s new NVSwitch, enabling 300 GB/s chip-to-chip communication at 12 times the speed of PCIe. With NVLink2, it enables 16x NVIDIA V100-SXM3 GPUs in a single system, for a total bandwidth going beyond 14 TB/s. Featuring pair of Xeon 8168 CPUs, 1.5 TB of memory, and 30 TB of NVMe storage, we get a system that consumes 10 kW, weighs 163.29 kg, but offers double precision performance in excess of 130TF.

https://docs.it4i.cz/dgx2/introduction/

Anselm

The Anselm cluster consists of 209 compute nodes, totaling 3344 compute cores with 15 TB RAM and giving over 94 TFLOP/s theoretical peak performance. Each node is a powerful x86-64 computer, equipped with 16 cores, at least 64 GB RAM, and 500 GB hard disk drive. Nodes are interconnected by fully non-blocking fat-tree InfiniBand network and equipped with Intel Sandy Bridge processors. A few nodes are also equipped with NVIDIA Kepler GPU or Intel Xeon Phi MIC accelerators.

https://docs.it4i.cz/anselm/hardware-overview/

Acknowledgement

This work was supported by The Ministry of Education, Youth and Sports from the National Programme of Sustainability (NPS II) project ”IT4Innovations excellence in science - LQ1602” and by the IT4Innovations infrastructure which is supported from the Large Infrastructures for Research, Experimental Development and Innovations project ”IT4Innovations National Supercomputing Center – LM2015070”.

Middleware Architecture

architecture

HEAppE's universally designed software architecture enables unified access to~different HPC systems through a simple object-oriented client-server interface using standard web services. Thus providing HPC capabilities to the users but without the necessity to manage the running jobs form the command-line interface of the HPC scheduler directly on the cluster.

Web Services

UserAndLimitationManagementWs

  • AuthenticateUserDigitalSignature – the user authentication using public/private key pair
  • AuthenticateUserPassword – the user authentication via username and password
  • GetCurrentUsageAndLimitationsForCurrentUser – the current usage and constraints for the user

ClusterInformationWs

  • GetCurrentClusterNodeUsage – information about the current node usage
  • ListAvailableClusters – the list of available clusters

JobManagementWs

  • GetCurrentInfoForJob – returns base information describing the current task
  • ListJobsForCurrentUser – returns a list of basic information describing all user jobs
  • CreateJob – creates a new job
  • SubmitJob – submits a job to a HPC scheduler queue
  • CancelJob – cancels a running job
  • DeleteJob - deletes job files

FileTransferWs

  • DownloadFileFromCluster – downloads entire files from the cluster storage
  • DownloadPartsOfJobFilesFromCluster – downloads changes in the file based on the offset
  • EndFileTransfer – ends the file transfer
  • GetFileTransferMethod – acquires object used for data transfer
  • ListChangedFilesForJob – lists newly created or edited files and folders in the base directory

JobReportingWs

  • GetResourceUsageReportForJob – the used core-hours for the job
  • GetUserGroupResourceUsageReport – the used core-hours for a user group
  • GetUserResourceUsageReport – the used core-hours for the user
  • ListAdaptorUserGroups – a list of all users in the specified group

Command Template Preparation

For security purposes HEAppE enables the users to run only pre-prepared set of so-called Command Templates. Each template defines arbitrary script or executable file that will be executed on the cluster, any dependencies or third-party software it might require and the type queue that should be used for the processing (type of computing nodes to be used on the cluster). The template also contains the set of input parameters that will be passed to the executable script during run-time. Thus, the users are only able to execute pre-prepared command templates with the pre-defined set of input parameters. The actual value of each parameter (input from the user) can be changed by the user for each job submission.

Id Name Description Code Executable File Command Parameters Preparation Script Cluster Node Type
1 TestTemplate Desc Code /scratch/temp/ HaasTestScript/test.sh "%%{inputParam}" module load Python/2.7.9-intel-2015b; 7

{: .custom-class #custom-id}

Workflow

workflow

HEAppE Integration Example (C#)

//web references for HaaS services
using WsClient.test_ClusterInformationWs;
using WsClient.test_FileTransferWs;
using WsClient.test_JobManagementWs;
using WsClient.test_JobReportingWs;
using WsClient.test_UserAndLimitationManagementWs;

namespace WsClient
{
    class Program
    {
        static ClusterInformationWs wsClusterInformation = new ClusterInformationWs();
        static FileTransferWs wsFileTransfer = new FileTransferWs();
        static JobManagementWs wsJobManagement = new JobManagementWs();
        static JobReportingWs wsJobReporting = new JobReportingWs();
        static UserAndLimitationManagementWs wsUserAndLimitationManagement = new
        UserAndLimitationManagementWs();

        private static string sessionCode;	//code acquired via authentication
        
        static void Main(string[] args)
        {
            AuthenticateUserPassword();
            CreateAndSubmitTestJob();
        }

        private static void CreateAndSubmitTestJob()
        {
            //each submitted job must contain at least one task
            TaskSpecificationExt testTask = new TaskSpecificationExt();
            testTask.name = "TestJob";
            testTask.minCores = 1;		//minimum number of cores required
            testTask.maxCores = 1;		//maximum number of cores required
            testTask.walltimeLimit = 600;	//maximum time for task to run (seconds)
            testTask.standardOutputFile = "console_Stdout";
            testTask.standardErrorFile = "console_Stderr";
            testTask.progressFile = "console_Stdprog";
            testTask.logFile = "console_Stdlog";
            testTask.commandTemplateId = 1;	//commandTemplateID
            //custom environment variables for the task
            testTask.environmentVariables = new EnvironmentVariableExt[0];
            //fill the command template parameters (see Table1 for “inputParam”)
            testTask.templateParameterValues = new CommandTemplateParameterValueExt[] { 
                new CommandTemplateParameterValueExt() { commandParameterIdentifier =
                    "inputParam", parameterValue = "someStringParam" } 
            };

            //create job specification with the task above
            JobSpecificationExt testJob = new JobSpecificationExt();
            testJob.name = "TestJob";	//job name
            testJob.minCores = 1;		//minimum number of cores required
            testJob.maxCores = 1;		//maximum number of cores required
            testJob.priority = WsClient.test_JobManagementWs.JobPriorityExt.Average;
            testJob.project = "ExpTests";	//accounting project
            testJob.waitingLimit = null;	//limit for the waiting time in cluster queue 
            testJob.walltimeLimit = 600;	//maximum time for job to run (seconds)
            testJob.clusterNodeTypeId = 7;	//Salomon express queue (1h limit)
            //custom environment variables for the job
            testJob.environmentVariables = new EnvironmentVariableExt[0];
            //assign created task to job specification
            testJob.tasks = new TaskSpecificationExt[] { testTask };

            //create job
            SubmittedJobInfoExt submittedTestJob = wsJobManagement.CreateJob(testJob, sessionCode);
            Console.WriteLine("Created job ID {0}.", submittedTestJob.id);

            //upload input files
            FileTransferMethodExt ft = wsFileTransfer.GetFileTransferMethod(
                (long)submittedTestJob.id, sessionCode);
            using (MemoryStream pKeyStream = new MemoryStream(
                Encoding.UTF8.GetBytes(ft.credentials.privateKey)))
            {
                using (ScpClient scpClient = new ScpClient(ft.serverHostname,
                    ft.credentials.username, new PrivateKeyFile(pKeyStream)))
                {
                    scpClient.Connect();
                    DirectoryInfo di = new DirectoryInfo(@"C:\InputFiles\");
                    foreach (FileInfo fi in di.GetFiles())
                    {
                        Console.WriteLine("Uploading file: " + fi.Name);
                        scpClient.Upload(fi, ft.sharedBasepath + "//" + fi.Name);
                        Console.WriteLine("File uploaded.");
                    }
                }
            }
            wsFileTransfer.EndFileTransfer((long)submittedTestJob.id, ft, sessionCode);
            
            //submit the job to a cluster queue for processing
            submittedTestJob = wsJobManagement.SubmitJob((long)submittedTestJob.id, sessionCode);
            Console.WriteLine("Submitted job ID: {0}", submittedTestJob.id);

            //check status of running job (submitted/configuring/queued/running)
            SubmittedJobInfoExt submittedJob;
            long jobId = (long)submittedTestJob.id;
            do
            {
                //wait 30s before the next status check
                System.Threading.Thread.Sleep(30000);
                //get info for the job
                submittedJob = wsJobManagement.GetCurrentInfoForJob(jobId, sessionCode);
                Console.WriteLine(submittedJob.state);
                //set offsets for the stdout, stderr, stdprog, stdlog files
                //offsets can be used for the partial download of files
                List<TaskFileOffsetExt> offsets = new List<TaskFileOffsetExt>();
                foreach (SubmittedTaskInfoExt taskInfo in submittedJob.tasks)
                {
                    TaskFileOffsetExt off = new TaskFileOffsetExt();
                    off.fileType = SynchronizableFilesExt.LogFile;
                    off.submittedTaskInfoId = taskInfo.id;
                    off.offset = 0;
                    offsets.Add(off);

                    off = new TaskFileOffsetExt();
                    off.fileType = SynchronizableFilesExt.ProgressFile;
                    off.submittedTaskInfoId = taskInfo.id;
                    off.offset = 0;
                    offsets.Add(off);

                    off = new TaskFileOffsetExt();
                    off.fileType = SynchronizableFilesExt.StandardErrorFile;
                    off.submittedTaskInfoId = taskInfo.id;
                    off.offset = 0;
                    offsets.Add(off);

                    off = new TaskFileOffsetExt();
                    off.fileType = SynchronizableFilesExt.StandardOutputFile;
                    off.submittedTaskInfoId = taskInfo.id;
                    off.offset = 0;
                    offsets.Add(off);
                }

                //donwload stdouts based on the offsets
                JobFileContentExt[] result = wsFileTransfer.DownloadPartsOfJobFilesFromCluster(
                    jobId, offsets.ToArray(), sessionCode);
                //print each file
                foreach (JobFileContentExt file in result)
                {
                    Console.WriteLine("File: " + file.fileType + ", " + file.relativePath);
                    Console.WriteLine("TaskInfoId: " + file.submittedTaskInfoId);
                    Console.WriteLine("Offset: " + file.offset);
                    Console.WriteLine("Content: " + file.content);
                }
            }
            While (submittedJob.state == WsClient.test_JobManagementWs.JobStateExt.Submitted
                || submittedJob.state == WsClient.test_JobManagementWs.JobStateExt.Configuring 
                || submittedJob.state == WsClient.test_JobManagementWs.JobStateExt.Queued 
                || submittedJob.state == WsClient.test_JobManagementWs.JobStateExt.Running);   

            //job computation is done (finished/failed/canceled)
            submittedJob = wsJobManagement.GetCurrentInfoForJob(jobId, sessionCode);
            // job finished successfully, download result files from the cluster
            if (submittedJob.state == WsClient.test_JobManagementWs.JobStateExt.Finished) 
            {
                ft = wsFileTransfer.GetFileTransferMethod(jobId, sessionCode);
                using (MemoryStream pKeyStream = new MemoryStream(Encoding.UTF8.GetBytes(
                    ft.credentials.privateKey)))
                {
                    using (ScpClient scpClient = new ScpClient(ft.serverHostname,
                        ft.credentials.username, new PrivateKeyFile(pKeyStream)))
                    {
                        scpClient.Connect();
                        // changed result files
                        string[] changedFiles = wsFileTransfer.ListChangedFilesForJob(
                            jobId, sessionCode);
                        foreach (string file in changedFiles)
                        {
                            Console.WriteLine("Downloading file: " + file);
                            FileInfo fi = new FileInfo(@"C:\OutputFiles\" + file);
                            scpClient.Download(ft.sharedBasepath + "//" + file, fi);
                        }
                    }
                }
                wsFileTransfer.EndFileTransfer(jobId, ft, sessionCode);
            }
            // job failed or was canceled
            else if (submittedJob.state == WsClient.test_JobManagementWs.JobStateExt.Failed 
                || submittedJob.state == WsClient.test_JobManagementWs.JobStateExt.Canceled)
            {
                // do nothing
            }
        }
        
        private static void AuthenticateUserPassword()
        {
            Console.WriteLine("Authenticating user [{0}]...", "[testuser]");
            PasswordCredentialsExt credentials = new PasswordCredentialsExt();
            credentials.username = "[testuser]";
            credentials.password = "[testpass]";
            sessionCode = wsUserAndLimitationManagement.AuthenticateUserPassword(credentials);
            Console.WriteLine("\tAuth OK (Session GUID: {0})", sessionCode);
        }

    }
}

Sample Template Output

Authenticating user [testuser]...
        Auth OK (Session GUID: 4a5d7017-1992-45b1-8b07-43cf5d421f50)
Created job ID 71.
Uploading file: someInputFile1.txt
File uploaded.
Uploading file: someInputFile2.txt
File uploaded.
Submitted job ID: 71
Queued
File: StandardErrorFile, console_Stderr
TaskInfoId: 71
Offset: 0
Content:
File: StandardOutputFile, console_Stdout
TaskInfoId: 71
Offset: 0
Content: Input param: someStringParam
Iteration: 01

Running
File: StandardErrorFile, console_Stderr
TaskInfoId: 71
Offset: 0
Content:
File: StandardOutputFile, console_Stdout
TaskInfoId: 71
Offset: 0
Content: Input param: someStringParam
Iteration: 01
Iteration: 02

Finished
File: StandardErrorFile, console_Stderr
TaskInfoId: 71
Offset: 0
Content:
File: StandardOutputFile, console_Stdout
TaskInfoId: 71
Offset: 0
Content: Input param: someStringParam
Iteration: 01
Iteration: 02
Iteration: 03
Iteration: 04
Iteration: 05
Iteration: 06
Iteration: 07
Iteration: 08
Iteration: 09
Iteration: 10

Downloading file: /resultFile.txt
Downloading file: /console_Stdout
Downloading file: /console_Stderr
Press any key to continue . . .
Clone repository
  • Home