How WEKA and Starfish Storage Redefine Metadata Performance

In the world of data-intensive industries, effectively managing massive datasets is paramount. Integrating WEKA and Starfish Storage creates a powerful solution to streamline metadata operations, enhance data governance, and bring previously unattainable performance levels to high-performance computing (HPC) environments. This post highlights our recent achievements in integrating Starfish with WEKA, leading to unprecedented levels of scalability, flexibility, and efficiency.

A Partnership Built for Speed and Scale

Using Starfish Storage with WEKA, we achieved a scan rate of over 448K per second across distributed Starfish agents, equating to 1.6 billion scans per hour! Even under the heavy load of production medical data, the WEKA GUI displayed up to 1 million metadata operations per second.

Figure 1 – Weka GUI during Starfish Distributed Scanning

Figure 2 – Starfish stats showing 448080 scans per second.

Here’s how this partnership offers a comprehensive solution for next-generation data management needs.

1. Enhanced Data Visibility and Cataloging
With Starfish, we can create data catalogs and tag data seamlessly. This is instrumental for organizations seeking to organize, audit, or optimize file management. From tracking uncompressed files to extracting metadata from over 80 file types including scientific files like, FASTQ, OME Tiff and DICOM. Starfish provides an intelligent, automated approach to managing petabytes of data in WEKA environments.

2. Optimized Storage and Compliance
Starfish helps identify uncompressed files and automate compression, a key feature for storage efficiency. The system can also peek inside billions of file headers, detect compressed and uncompressed states, and take corrective actions, freeing up valuable storage space and enhancing file accessibility. Additionally, for compliance and auditing, Starfish’s ability to track sensitive data like PII is crucial for regulated industries, making the integration ideal for healthcare, finance, and research sectors.

3. Deep Archive and Data Lifecycle Management
One of the unique features Starfish brings to WEKA is automated integration with “deep archive” tiers on cloud storage. Files can be transferred to archival storage with just a few clicks and retrieved with ease. This capability is user-driven, providing control and flexibility while optimizing storage costs.

4. Powerful Automation and Workflow Optimization
Integrating Starfish allows for the automation of data workflows critical to HPC and data acquisition environments. Files generated by laboratory instruments and acquisition devices can be processed, organized, and stored without manual intervention, ensuring a seamless data flow from generation to storage.

5. Comprehensive Data Migration and Replication
With support for rapid data migration, Starfish adds a new level of flexibility to WEKA, allowing petabyte-scale data migrations and nuanced replication. The ability to replicate data at such scale while preserving ACLs, prioritization, and hash verification provides a robust solution for disaster recovery and high availability (DR/HA).

6. Advanced Security and Compliance Support
Starfish enables real-time permissions auditing and policy-driven encryption checks. If Starfish detects non-compliant permissions, it can automatically apply remedial measures, ensuring compliance with data security standards and regulatory requirements.

7. Simplified Inter-Organizational Sharing
In multi-tenant environments, Starfish’s policy-driven sharing capabilities streamline data sharing across departments and organizations. Data can be securely transferred to shared repositories, further enhancing collaboration across different teams.

8. Streamlined HPC and AI Workflows
With Starfish and WEKA working together, we can automate cloud bursting for AI and HPC jobs, moving data between on-premises and cloud resources seamlessly. This integration aligns with WEKA’s mission to support scalable, high-performance workflows across hybrid environments.

Performance Insights: 1 Million Metadata Operations Per Second

The true highlight of this integration lies in its performance capabilities. Starfish is an out-of-band solution, never getting in-between applications and WEKA high performance filesystem. Using 10 Starfish agents, we achieved over 448K scans per second, allowing us to analyze an entire WEKA file system with remarkable speed. Starfish agents are lightweight and simple to install. Deploy as many agents, typically in VMs, as needed to achieve the required performance. The WEKA GUI handled up to 1 million metadata operations per second, which is crucial for real-time applications that require rapid access and manipulation of large datasets, particularly in medical imaging, genomics, and financial analysis.

Unlocking the Future of Scalable and Efficient Data Management

Integrating WEKA and Starfish Storage offers an unparalleled combination of high-speed metadata management and scalable data solutions. From automating compliance to enabling deep archiving, this powerful duo delivers transformative capabilities across data-centric industries. With scalable file system performance and automated metadata management, WEKA and Starfish Storage help organizations unlock new levels of efficiency, flexibility, and data governance.

Discover how the WEKA Data Platform conquers the challenge of managing lots of small files (LOSF) with unmatched performance and scalability.