AI Workers Configuration
By default, the Aindo Synthetic Data Platform is configured with only Persistent AI Workers enabled.
Persistent AI Workers
Persistent workers utilize a simple, fixed-size, queue-based worker pool. They listen for incoming AI workloads and process them sequentially based on available capacity. To increase the number of concurrent AI training jobs, you must increase the number of nodes in the pool. This configuration is ideal for bare-metal deployments where hardware resources are static.
On-demand AI Workers
If the platform is running on a Kubernetes cluster in the cloud, On-demand AI Workers are better suited for the task. In this configuration, AI workloads are spawned as individual Kubernetes Jobs. This offers several advantages:
- Cost Efficiency: Jobs can be scheduled on node pools with autoscaling enabled, ensuring you only pay for compute resources when a workload is actually running.
- Hardware Flexibility: Multiple On-demand AI Workers can be configured to target different hardware specifications (e.g., specific GPU models or high-RAM instances).
Because of this flexibility, On-demand AI Workers are not enabled by default and must be configured manually.
Configuring On-demand AI Workers
First, enable the Kubernetes Job Scheduler within your configuration as shown in the snippet below:
be: workers: k8s: enabled: trueNext, configure the options for the jobs that will execute the workloads. Below is an example of a CPU-based worker configuration:
tasks: manifests: example-worker-cpu: name: "CPU Worker" description: "16 vCPU, 32 GB RAM. 5th Gen Intel Xeon Scalable." resources: limits: memory: "32G" # cpu: omit to allow burstable usage requests: cpu: "16" memory: "32G" threads: 8 nodeSelector: cloud.google.com/machine-family: c4Parameter Definitions:
example-worker-cpu: The internal identifier used as a template for the Kubernetes Job name. This must adhere to standard Kubernetes DNS naming conventions.name: The display name of the worker as it will appear in the user interface.description: A summary of the worker’s hardware or purpose.resources: Standard Kubernetes resource parameters defining limits and requests for CPU and memory.threads: The number of NumPy threads allocated for the workload.nodeSelector: Used to constrain the job to specific nodes (e.g., targeting a specific cloud machine family).