Training using TensorFlow
TFJob is a Kubernetes custom resource that makes it easy to run TensorFlow training jobs on Kubernetes.
A TFJob is a resource with a simple YAML representation illustrated below.
apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
labels:
experiment: experiment10
name: tfjob
namespace: kubeflow
spec:
tfReplicaSpecs:
Ps:
replicas: 1
template:
metadata:
creationTimestamp: null
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources: {}
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
Worker:
replicas: 1
template:
metadata:
creationTimestamp: null
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources: {}
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
status:
conditions:
- lastTransitionTime: 2018-07-29T00:31:48Z
lastUpdateTime: 2018-07-29T00:31:48Z
message: TFJob tfjob is running.
reason: TFJobRunning
status: "True"
type: Running
startTime: 2018-07-29T21:40:13Z
tfReplicaStatuses:
PS:
active: 1
Worker:
active: 1
If you are not familiar with Kubernetes resources please refer to the page Understanding Kubernetes Objects.
What makes TFJob different from built in controllers is the TFJob spec is designed to manage distributed TensorFlow training jobs.
A distributed TensorFlow job typically contains 0 or more of the following processes
The field tfReplicaSpecs in TFJob spec contains a map from the type of replica (as listed above) to the TFReplicaSpec for that replica. TFReplicaSpec consists of 3 fields
Note: Before submitting a training job, you should have deployed kubeflow to your cluster. Doing so ensures that
the TFJob
custom resource is available when you submit the training job.
We treat each TensorFlow job as a component in your APP.
Kubeflow ships with a ksonnet prototype suitable for running the TensorFlow CNN Benchmarks.
You can also use this prototype to generate a component which you can then customize for your jobs.
Create the component (update version as appropriate).
CNN_JOB_NAME=mycnnjob
VERSION=v0.2-branch
ks registry add kubeflow-git github.com/kubeflow/kubeflow/tree/${VERSION}/kubeflow
ks pkg install kubeflow-git/examples
ks generate tf-job-simple ${CNN_JOB_NAME} --name=${CNN_JOB_NAME}
Submit it
ks apply ${KF_ENV} -c ${CNN_JOB_NAME}
Monitor it (Please refer to the TfJob docs)
kubectl get -o yaml tfjobs ${CNN_JOB_NAME}
Delete it
ks delete ${KF_ENV} -c ${CNN_JOB_NAME}
Generating a component as in the previous step will create a file named
components/${CNN_JOB_NAME}.jsonnet
A jsonnet file is basically a json file defining the manifest for your TFJob. You can modify this manifest to run your jobs.
Typically you will want to change the following values
Change the image to point to the docker image containing your code
Change the number and types of replicas
Change the resources (requests and limits) assigned to each resource
Set any environment variables
Attach PV’s if you want to use PVs for storage.
To use GPUs your cluster must be configured to use GPUs.
nvidia.com/gpu
resource typeTo attach GPUs specify the GPU resource on the container in the replicas that should contain the GPUs; for example.
apiVersion: "kubeflow.org/v1alpha2"
kind: "TFJob"
metadata:
name: "tf-smoke-gpu"
spec:
tfReplicaSpecs:
PS:
replicas: 1
template:
metadata:
creationTimestamp: null
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources:
limits:
nvidia.com/gpu: 1
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
Follow TensorFlow’s instructions for using GPUs.
To get the status of your job
kubectl get -o yaml tfjobs $JOB
Here is sample output for an example job
apiVersion: v1
items:
- apiVersion: kubeflow.org/v1alpha2
kind: TFJob
metadata:
creationTimestamp: 2018-07-29T00:31:12Z
generation: 1
labels:
app.kubernetes.io/deploy-manager: ksonnet
name: tfjob
namespace: kubeflow
resourceVersion: "22310"
selfLink: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/tfjob
uid: b20c924b-92c6-11e8-b3ca-42010a80019c
spec:
tfReplicaSpecs:
PS:
replicas: 1
template:
metadata:
creationTimestamp: null
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources: {}
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
Worker:
replicas: 1
template:
metadata:
creationTimestamp: null
spec:
containers:
- args:
- python
- tf_cnn_benchmarks.py
- --batch_size=32
- --model=resnet50
- --variable_update=parameter_server
- --flush_stdout=true
- --num_gpus=1
- --local_parameter_device=cpu
- --device=cpu
- --data_format=NHWC
image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
name: tensorflow
ports:
- containerPort: 2222
name: tfjob-port
resources: {}
workingDir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
restartPolicy: OnFailure
status:
conditions:
- lastTransitionTime: 2018-07-29T00:31:48Z
lastUpdateTime: 2018-07-29T00:31:48Z
message: TFJob tfjob is running.
reason: TFJobRunning
status: "True"
type: Running
startTime: 2018-07-29T02:18:13Z
tfReplicaStatuses:
PS:
active: 1
Worker:
active: 1
A TFJob has a TFJobStatus, which has an array of TFJobConditions through which the TFJob has or has not passed. Each element of the TFJobCondition array has six possible fields:
Success or failure of a job is determined as follows
tfReplicaStatuses provides a map indicating the number of pods for each replica in a given state. There are three possible states
During execution, TFJob will emit events to indicate whats happening such as the creation/deletion of pods and services. Kubernetes doesn’t retain events older than 1 hour by default. To see recent events for a job run
kubectl describe tfjobs ${JOB}
which will produce output like
Name: tfjob2
Namespace: kubeflow
Labels: app.kubernetes.io/deploy-manager=ksonnet
Annotations: ksonnet.io/managed={"pristine":"H4sIAAAAAAAA/+yRz27UMBDG7zzGnJ3NbkoFjZQTqEIcYEUrekBVNHEmWbOObY3HqcJq3x05UC1/ngCJHKKZbz6P5e93AgzmM3E03kENx9TRYP3TxvNYzju04YAVKDga10MN97fvfQcKJhLsURDqEzicCGqQ4avvsjX3MaCm...
API Version: kubeflow.org/v1alpha2
Kind: TFJob
Metadata:
Cluster Name:
Creation Timestamp: 2018-07-29T02:46:53Z
Generation: 1
Resource Version: 26872
Self Link: /apis/kubeflow.org/v1alpha2/namespaces/kubeflow/tfjobs/tfjob2
UID: a6bc7b6f-92d9-11e8-b3ca-42010a80019c
Spec:
Tf Replica Specs:
PS:
Replicas: 1
Template:
Metadata:
Creation Timestamp: <nil>
Spec:
Containers:
Args:
python
tf_cnn_benchmarks.py
--batch_size=32
--model=resnet50
--variable_update=parameter_server
--flush_stdout=true
--num_gpus=1
--local_parameter_device=cpu
--device=cpu
--data_format=NHWC
Image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
Name: tensorflow
Ports:
Container Port: 2222
Name: tfjob-port
Resources:
Working Dir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
Restart Policy: OnFailure
Worker:
Replicas: 1
Template:
Metadata:
Creation Timestamp: <nil>
Spec:
Containers:
Args:
python
tf_cnn_benchmarks.py
--batch_size=32
--model=resnet50
--variable_update=parameter_server
--flush_stdout=true
--num_gpus=1
--local_parameter_device=cpu
--device=cpu
--data_format=NHWC
Image: gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3
Name: tensorflow
Ports:
Container Port: 2222
Name: tfjob-port
Resources:
Working Dir: /opt/tf-benchmarks/scripts/tf_cnn_benchmarks
Restart Policy: OnFailure
Status:
Conditions:
Last Transition Time: 2018-07-29T02:46:55Z
Last Update Time: 2018-07-29T02:46:55Z
Message: TFJob tfjob2 is running.
Reason: TFJobRunning
Status: True
Type: Running
Start Time: 2018-07-29T02:46:55Z
Tf Replica Statuses:
PS:
Active: 1
Worker:
Active: 1
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning SettedPodTemplateRestartPolicy 19s (x2 over 19s) tf-operator Restart policy in pod template will be overwritten by restart policy in replica spec
Normal SuccessfulCreatePod 19s tf-operator Created pod: tfjob2-worker-0
Normal SuccessfulCreateService 19s tf-operator Created service: tfjob2-worker-0
Normal SuccessfulCreatePod 19s tf-operator Created pod: tfjob2-ps-0
Normal SuccessfulCreateService 19s tf-operator Created service: tfjob2-ps-0
Here the events indicate that the pods and services were successfully created.
Logging follows standard K8s logging practices.
You can use kubectl to get standard output/error for any pods that haven’t been deleted.
First find the pod created by the job controller for the replica of interest. Pods will be named
${JOBNAME}-${REPLICA-TYPE}-${INDEX}
Once you’ve identified your pod you can get the logs using kubectl.
kubectl logs ${PODNAME}
The CleanPodPolicy in the TFJob spec controls deletion of pods when a job terminates. The policy can be one of the following values
If your cluster takes advantage of K8s cluster logging then your logs may also be shipped to an appropriate data store for further analysis.
See here for instructions to get logs using Stackdriver.
As described here its possible to fetch the logs for a particular replica based on pod labels.
Using the Stackdriver UI you can use a query like
resource.type="k8s_container"
resource.labels.cluster_name="${CLUSTER}"
metadata.userLabels.tf_job_name="${JOB_NAME}"
metadata.userLabels.tf-replica-type="${TYPE}"
metadata.userLabels.tf-replica-index="${INDEX}"
Alternatively using gcloud
QUERY="resource.type=\"k8s_container\" "
QUERY="${QUERY} resource.labels.cluster_name=\"${CLUSTER}\" "
QUERY="${QUERY} metadata.userLabels.tf_job_name=\"${JOB_NAME}\" "
QUERY="${QUERY} metadata.userLabels.tf-replica-type=\"${TYPE}\" "
QUERY="${QUERY} metadata.userLabels.tf-replica-index=\"${INDEX}\" "
gcloud --project=${PROJECT} logging read \
--freshness=24h \
--order asc ${QUERY}
Here are some steps to follow to troubleshoot your job
Is a status present for your job? Run the command
kubectl -n ${NAMESPACE} get tfjobs -o yaml ${JOB_NAME}
If the resulting output doesn’t include a status for your job then this typically indicates the job spec is invalid.
If the TFJob spec is invalid there should be a log message in the tf operator logs
kubectl -n ${KUBEFLOW_NAMESPACE} logs `kubectl get pods --selector=name=tf-job-operator -o jsonpath='{.items[0].metadata.name}'`
Check the events for your job to see if the pods were created
There are a number of ways to get the events; if your job is less than 1 hour old then you can do
kubectl -n ${NAMESPACE} describe tfjobs -o yaml ${JOB_NAME}
``
The bottom of the output should include a list of events emitted by the job; e.g.
Events: Type Reason Age From Message
Warning SettedPodTemplateRestartPolicy 19s (x2 over 19s) tf-operator Restart policy in pod template will be overwritten by restart policy in replica spec Normal SuccessfulCreatePod 19s tf-operator Created pod: tfjob2-worker-0 Normal SuccessfulCreateService 19s tf-operator Created service: tfjob2-worker-0 Normal SuccessfulCreatePod 19s tf-operator Created pod: tfjob2-ps-0 Normal SuccessfulCreateService 19s tf-operator Created service: tfjob2-ps-0 ```
* Kubernetes only preserves events for **1 hour** (see [kubernetes/kubernetes#52521](https://github.com/kubernetes/kubernetes/issues/52521))
* Depending on your cluster setup events might be persisted to external storage and accessible for longer periods
* On GKE events are persisted in stackdriver and can be accessed using the instructions in the previous section.
* If the pods and services aren't being created then this suggests the TFJob isn't being processed; common causes are
* The TFJob spec is invalid (see above)
* The TFJob operator isn't running
Check the events for the pods to ensure they are scheduled.
There are a number of ways to get the events; if your pod is less than 1 hour old then you can do
kubectl -n ${NAMESPACE} describe pods ${POD_NAME}
``
The bottom of the output should contain events like the following
Events: Type Reason Age From Message
Normal Scheduled 18s default-scheduler Successfully assigned tfjob2-ps-0 to gke-jl-kf-v0-2-2-default-pool-347936c1-1qkt Normal SuccessfulMountVolume 17s kubelet, gke-jl-kf-v0-2-2-default-pool-347936c1-1qkt MountVolume.SetUp succeeded for volume “default-token-h8rnv” Normal Pulled 17s kubelet, gke-jl-kf-v0-2-2-default-pool-347936c1-1qkt Container image “gcr.io/kubeflow/tf-benchmarks-cpu:v20171202-bdab599-dirty-284af3” already present on machine Normal Created 17s kubelet, gke-jl-kf-v0-2-2-default-pool-347936c1-1qkt Created container Normal Started 16s kubelet, gke-jl-kf-v0-2-2-default-pool-347936c1-1qkt Started container ```
* Some common problems that can prevent a container from starting are
* Insufficient resources to schedule the pod
* The pod tries to mount a volume (or secret) that doesn't exist or is unavailable
* The docker image doesn't exist or can't be accessed (e.g due to permission issues)